featurebox.featurizers.state package

Submodules

featurebox.featurizers.state.extrastats module

General methods for computing property statistics from a list of values

class featurebox.featurizers.state.extrastats.PropertyStats

Bases: object

This class contains statistical operations that are commonly employed when computing features. The primary way for interacting with this class is to call the calc_stat function, which takes the x_name of the statistic you would like to compute and the weights/values of datamnist to be assessed. For example, computing the mean of a list looks like:

>>> x = [1, 2, 3]
>>> PropertyStats.calc_stat(x, 'mean') # Result is 2
>>> PropertyStats.calc_stat(x, 'mean', weights=[0, 0, 1]) # Result is 3

Some the statistics functions take options (e.g., Holder means). You can pass them to the statistics functions by adding them after the x_name and two colons. For example, the 0th Holder mean would be:

>>>PropertyStats.calc_stat(x, ‘holder_mean::0’)

You can, of course, call the statistical functions directly. All take at least two arguments. The first is the datamnist being assessed and the second, optional, argument is the weights.

static avg_dev(data_lst, weights=None)

Mean absolute deviation of list of element datamnist. This is computed by first calculating the mean of the list, and then computing the average absolute difference between each value and the mean. :param data_lst: List of values to be assessed :type data_lst: list of floats :param weights: Weights for each value :type weights: list of floats

Returns:

mean absolute deviation

static calc_stat(data_lst, stat, weights=None)

Compute a property statistic

Parameters:
  • data_lst (list of floats) – list of values

  • stat (str) –

  • example (should be added after the x_name and separated by two colons. For) –

  • would (the 2nd Holder mean) –

  • "holder_mean::2" (be) –

  • weights (list of floats) – (Optional) weights for each element in data_lst

Returns:

float - Desired statistic

static eigenvalues(data_lst, symm=False, sort=False)

Return the eigenvalues of a matrix as a numpy array :param data_lst: (matrix-like) of values :param symm: whether to assume the matrix is symmetric :param sort: wheter to sort the eigenvalues

Returns: eigenvalues

static flatten(data_lst, weights=None)

Returns a flattened copy of data_lst-as a numpy array

static geom_std_dev(data_lst, weights=None)

Geometric standard deviation :param data_lst: List of values to be assessed :type data_lst: list of floats :param weights: Weights for each value :type weights: list of floats

Returns:

geometric standard deviation

static holder_mean(data_lst, weights=None, power=1)

Get Holder mean :param data_lst: (list/array) of values :param weights: (list/array) of weights :param power: (int/float/str) which holder mean to compute

Returns: Holder mean

static inverse_mean(data_lst, weights=None)

Mean of the inverse of each entry :param data_lst: List of values to be assessed :type data_lst: list of floats :param weights: Weights for each value :type weights: list of floats

Returns:

inverse mean

static kurtosis(data_lst, weights=None)

Kurtosis of a list of datamnist :param data_lst: List of values to be assessed :type data_lst: list of floats :param weights: Weights for each value :type weights: list of floats

Returns:

kurtosis

static maximum(data_lst, weights=None)

Maximum value in a list :param data_lst: List of values to be assessed :type data_lst: list of floats :param weights: (ignored)

Returns:

maximum value

static mean(data_lst, weights=None)

Arithmetic mean of list :param data_lst: List of values to be assessed :type data_lst: list of floats :param weights: Weights for each value :type weights: list of floats

Returns:

mean value

static minimum(data_lst, weights=None)

Minimum value in a list :param data_lst: List of values to be assessed :type data_lst: list of floats :param weights: (ignored)

Returns:

minimum value

static mode(data_lst, weights=None)

Mode of a list of datamnist. If multiple elements occur equally-frequently (or same weight, if weights are provided), this function will return the minimum of those values. :param data_lst: List of values to be assessed :type data_lst: list of floats :param weights: Weights for each value :type weights: list of floats

Returns:

mode

static quantile(data_lst, weights=None, q=0.5)

Return a specific quantile. :param weights: not used :type weights: float :param data_lst: 1D datamnist list to be used for computing, quantiles :type data_lst: list or np.ndarray :param q: The quantile, as a fraction between 0 and 1. :type q: float

Returns:

(float) The computed quantile of the data_lst.

static range(data_lst, weights=None)

Range of a list :param data_lst: List of values to be assessed :type data_lst: list of floats :param weights: (ignored)

Returns:

range

static skewness(data_lst, weights=None)

Skewness of a list of datamnist :param data_lst: List of values to be assessed :type data_lst: list of floats :param weights: Weights for each value :type weights: list of floats

Returns:

shewness

static sorted(data_lst, weights=None)

Returns the sorted data_lst

static std_dev(data_lst, weights=None)

Standard deviation of a list of element datamnist :param data_lst: List of values to be assessed :type data_lst: list of floats :param weights: Weights for each value :type weights: list of floats

Returns:

standard deviation

featurebox.featurizers.state.state_mapper module

class featurebox.featurizers.state.state_mapper.StructurePymatgenPropMap(prop_name=None, func: Optional[Callable] = None, return_type='df', **kwargs)

Bases: _StructurePymatgenPropMap

Get property of pymatgen structure preprocessing. default [“density”, “volume”, “ntypesp”]

Examples

>>> tmps = StructurePymatgenPropMap()
>>> tmps.fit_transform()
Parameters:
  • prop_name – (str,list of str) prop name or list of prop name default [“density”, “volume”, “ntypesp”]

  • func – (callable or list of callable) please make sure the size of it is the same with prop_name.

featurebox.featurizers.state.statistics module

class featurebox.featurizers.state.statistics.BaseCompositionFeature(data_map: BinaryMap, n_jobs: int = 1, on_errors: str = 'raise', return_type: str = 'df', feature_labels_mark: Optional[str] = None)

Bases: BinaryMap

BaseCompositionFeature is the basis for composition data. the subclass should be re-implemented, such as:

def mix_function(self, elems:List, nums:List):
    w_ = np.array(nums)
    return w_.dot(elems)

Base class for composition feature.

convert_dict(atoms: dict) ndarray

Convert atom {symbol: fraction} list to numeric features

convert_number(atoms: List)

Convert atom {symbol: fraction} list to numeric features

fit(*args, x_labels=None, **kwargs)

fit function in BaseFeature are weakened and just pass parameter.

abstract mix_function(elems: List, nums: Union[List, ndarray])
Parameters:
  • elems (list) – Elements in compound.

  • nums (list) – Number of each element.

Returns:

descriptor

Return type:

numpy.ndarray

class featurebox.featurizers.state.statistics.DepartElementFeature(data_map: BinaryMap, n_composition: int, n_jobs: int = 1, on_errors: str = 'raise', return_type: str = 'df')

Bases: BaseCompositionFeature

Get the table of element data.

Examples

>>> from featurebox.featurizers.atom.mapper import AtomJsonMap
>>> from featurebox.featurizers.state.union import UnionFeature
>>> from featurebox.featurizers.state.statistics import DepartElementFeature
>>> data_map = AtomJsonMap(search_tp="name",embedding_dict="ele_megnet.json", n_jobs=1) # keep this n_jobs=1 and return_type="np"
>>> wa = DepartElementFeature(data_map,n_composition=2, n_jobs=1, return_type="pd")
>>> comp = [{"H": 2, "Pd": 1},{"He":1, "Al":4}]
>>> wa.set_feature_labels(["fea_{}".format(_) for _ in range(16)]) # 16 this the feature number of built-in "ele_megnet.json"
>>> wa.fit_transform(comp)
   depart_fea_0_0  depart_fea_0_1  ...  depart_fea_15_0  depart_fea_15_1
0        0.352363        0.561478  ...        -0.270104        -0.212607
1       -0.067220        0.025758  ...        -0.042185         0.080350

[2 rows x 32 columns]

Base class for composition feature.

convert_dict(atoms: Union[dict, Composition]) ndarray

Convert atom {symbol: fraction} list to numeric features

convert_number(atoms: List) ndarray

Convert atom {symbol: fraction} list to numeric features

mix_function(elems: ndarray, nums=None)
Parameters:
  • elems (list) – Elements in compound.

  • nums (list) – Number of each element.

Returns:

descriptor

Return type:

numpy.ndarray

set_feature_labels(values)

Generate attribute names.

Returns:

([str]) attribute labels.

class featurebox.featurizers.state.statistics.ExtraMix(data_map: BinaryMap, stats: Tuple[str] = ('mean',), n_jobs: int = 1, on_errors: str = 'raise', return_type: str = 'df')

Bases: BaseCompositionFeature

See also

WeightedSum

Base class for composition feature.

mix_function(elems, nums)
Parameters:
  • elems (list) – Elements in compound.

  • nums (list) – Number of each element.

Returns:

descriptor

Return type:

numpy.ndarray

class featurebox.featurizers.state.statistics.GeometricMean(data_map: BinaryMap, n_jobs: int = 1, on_errors: str = 'raise', return_type: str = 'df')

Bases: BaseCompositionFeature

See also

WeightedSum

Base class for composition feature.

mix_function(elems: ndarray, nums)
Parameters:
  • elems (list) – Elements in compound.

  • nums (list) – Number of each element.

Returns:

descriptor

Return type:

numpy.ndarray

class featurebox.featurizers.state.statistics.HarmonicMean(data_map: BinaryMap, n_jobs: int = 1, on_errors: str = 'raise', return_type: str = 'df')

Bases: BaseCompositionFeature

See also

WeightedSum

Base class for composition feature.

mix_function(elems, nums)
Parameters:
  • elems (list) – Elements in compound.

  • nums (list) – Number of each element.

Returns:

descriptor

Return type:

numpy.ndarray

class featurebox.featurizers.state.statistics.MaxPooling(data_map: BinaryMap, n_jobs: int = 1, on_errors: str = 'raise', return_type: str = 'df')

Bases: BaseCompositionFeature

See also

WeightedSum

Base class for composition feature.

mix_function(elems, _)
Parameters:
  • elems (list) – Elements in compound.

  • nums (list) – Number of each element.

Returns:

descriptor

Return type:

numpy.ndarray

class featurebox.featurizers.state.statistics.MinPooling(data_map: BinaryMap, n_jobs: int = 1, on_errors: str = 'raise', return_type: str = 'df')

Bases: BaseCompositionFeature

See also

WeightedSum

Base class for composition feature.

mix_function(elems, _)
Parameters:
  • elems (list) – Elements in compound.

  • nums (list) – Number of each element.

Returns:

descriptor

Return type:

numpy.ndarray

class featurebox.featurizers.state.statistics.WeightedAverage(data_map: BinaryMap, n_jobs: int = 1, on_errors: str = 'raise', return_type: str = 'df')

Bases: BaseCompositionFeature

Examples

>>> from featurebox.featurizers.atom.mapper import AtomTableMap, AtomJsonMap
>>> data_map = AtomJsonMap(search_tp="name", n_jobs=1)
>>> wa = WeightedAverage(data_map, n_jobs=1,return_type="df")
>>> x3 = [{"H": 2, "Pd": 1},{"He":1,"Al":4}]
>>> wa.fit_transform(x3)
         0         1         2   ...        13        14        15
0  0.422068  0.360958  0.201433  ... -0.459164 -0.064783 -0.250939
1  0.007163 -0.471498 -0.072860  ...  0.206306 -0.041006  0.055843

[2 rows x 16 columns]
>>> wa.set_feature_labels(["fea_{}".format(_) for _ in range(16)])
>>> wa.fit_transform(x3)
   wt_ave_fea_0  wt_ave_fea_1  ...  wt_ave_fea_14  wt_ave_fea_15
0      0.422068      0.360958  ...      -0.064783      -0.250939
1      0.007163     -0.471498  ...      -0.041006       0.055843

[2 rows x 16 columns]

Base class for composition feature.

mix_function(elems, nums)
Parameters:
  • elems (list) – Elements in compound.

  • nums (list) – Number of each element.

Returns:

descriptor

Return type:

numpy.ndarray

class featurebox.featurizers.state.statistics.WeightedSum(data_map: BinaryMap, n_jobs: int = 1, on_errors: str = 'raise', return_type: str = 'df')

Bases: BaseCompositionFeature

Examples

>>> from featurebox.featurizers.atom.mapper import AtomTableMap, AtomJsonMap
>>> data_map = AtomTableMap(search_tp="name", n_jobs=1)
>>> wa = WeightedSum(data_map, n_jobs=1,return_type="df")
>>> x3 = [{"H": 2, "Pd": 1},{"He":1,"Al":4}]
>>> wa.fit_transform(x3)
   wt_sum_1s  wt_sum_2s  wt_sum_2p  ...  wt_sum_6d  wt_sum_6f  wt_sum_7s
0    8320.18   11837.27      11.80  ...        0.0        0.0        0.0
1    2188.73    1513.40     986.16  ...        0.0        0.0        0.0

[2 rows x 19 columns]

Base class for composition feature.

mix_function(elems, nums)
Parameters:
  • elems (list) – Elements in compound.

  • nums (list) – Number of each element.

Returns:

descriptor

Return type:

numpy.ndarray

class featurebox.featurizers.state.statistics.WeightedVariance(data_map: BinaryMap, n_jobs: int = 1, on_errors: str = 'raise', return_type: str = 'df')

Bases: BaseCompositionFeature

See also

WeightedSum

Base class for composition feature.

mix_function(elems: ndarray, nums)
Parameters:
  • elems (list) – Elements in compound.

  • nums (list) – Number of each element.

Returns:

descriptor

Return type:

numpy.ndarray

featurebox.featurizers.state.union module

class featurebox.featurizers.state.union.PolyFeature(*, degree: Union[int, List[int]] = 3, n_jobs=1, on_errors='raise', return_type='df')

Bases: BaseFeature, ABC

Extension method.

Such as degree = 2 means (x1x2,x1**2,x2**2)

Examples

>>> n = np.array([[0,1,2,3,4,5],[0.422068,0.360958,0.201433,-0.459164,-0.064783,-0.250939]]).T
>>> ps = pd.DataFrame(n,columns=["f1","f2"],index= ["x0","x1","x2","x3","x4","x5"])
>>> pf = PolyFeature(degree=[1,2])
>>> pf.fit_transform(n)
   f0^1      f1^1  f0^2  f0^1*f1^1      f1^2
0   0.0  0.422068   0.0   0.000000  0.178141
1   1.0  0.360958   1.0   0.360958  0.130291
2   2.0  0.201433   4.0   0.402866  0.040575
3   3.0 -0.459164   9.0  -1.377492  0.210832
4   4.0 -0.064783  16.0  -0.259132  0.004197
5   5.0 -0.250939  25.0  -1.254695  0.062970
Parameters:
  • batch_size (int) – size of batch.

  • batch_calculate (bool) – batch_calculate or not.

  • n_jobs (int) – Parallel number.

  • on_errors (str) – How to handle the exceptions in a feature calculations. Can be nan, keep, raise. When ‘nan’, return a column with np.nan. The length of column corresponding to the number of feature labs. The default is ‘raise’ which will raise up the exception.

  • return_type (str) – Specific the return type. Can be any, np,``array`` and df. ‘array’ and ‘df’ force return type to np.ndarray and pd.DataFrame respectively. If ‘any’, without type conversion . Default is ‘any’

fit_transform(X: Union[ndarray, DataFrame], y=None, **kwargs)

If convert takes multiple inputs, supply inputs as a list of tuples.

Copy from Mixin class for all transformers in scikit-learn. TransformerMixin

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:
  • X (list) – list of case.

  • y (None) – deprecated.

  • **kwargs – Additional fit or transform parameters. feature_labels_mark: str, mark for each feature_labes. for return_type ==’pd’. x_labels: list, mark for each row. for return_type ==’pd’.

Returns:

result data.

Return type:

X_new

set_feature_labels(input_features=None)

Generate attribute names.

Returns:

([str]) attribute labels.

class featurebox.featurizers.state.union.UnionFeature(comp: List[Dict], couple_data: Union[DataFrame, ndarray], couple=2, stats=('mean',), n_jobs: int = 1, on_errors: str = 'raise', return_type: str = 'df')

Bases: BaseFeature

Transform method should input0 comp_index rather than entries.

Examples

>>> from featurebox.featurizers.atom.mapper import AtomTableMap, AtomJsonMap
>>> data_map = AtomJsonMap(search_tp="name", n_jobs=1)
>>> wa = DepartElementFeature(data_map,n_composition=2, n_jobs=1,return_type="df")
>>> x3 = [{"H": 2, "Pd": 1},{"He":1,"Al":4}]
>>> wa.set_feature_labels(["fea_{}".format(_) for _ in range(16)])
>>> wa.fit_transform(x3)
   depart_fea_0_0  depart_fea_0_1  ...  depart_fea_15_0  depart_fea_15_1
0        0.352363        0.561478  ...        -0.270104        -0.212607
1       -0.067220        0.025758  ...        -0.042185         0.080350

[2 rows x 32 columns]
>>> couple_data = wa.fit_transform(x3)
>>> uf = UnionFeature(x3,couple_data,couple=2,stats=("mean","maximum"))
>>> uf.fit_transform()
   mean_fea_0  maximum_fea_0  ...  mean_fea_15  maximum_fea_15
0    0.422068       0.360958  ...     0.021095       -0.212607
1    0.007163      -0.471498  ...     0.165278        0.080350

[2 rows x 32 columns]
>>> couple_data = wa.fit_transform(x3)
>>> uf = UnionFeature(x3,couple_data,couple=2,stats=("std_dev",))
>>> uf.fit_transform()
   std_dev_fea_0  std_dev_fea_1  ...  std_dev_fea_14  std_dev_fea_15
0       0.147867       0.583352  ...        0.182177        0.040657
1       0.065745       0.541477  ...        0.182331        0.086646

[2 rows x 16 columns]
Parameters:
  • batch_size (int) – size of batch.

  • batch_calculate (bool) – batch_calculate or not.

  • n_jobs (int) – Parallel number.

  • on_errors (str) – How to handle the exceptions in a feature calculations. Can be nan, keep, raise. When ‘nan’, return a column with np.nan. The length of column corresponding to the number of feature labs. The default is ‘raise’ which will raise up the exception.

  • return_type (str) – Specific the return type. Can be any, np,``array`` and df. ‘array’ and ‘df’ force return type to np.ndarray and pd.DataFrame respectively. If ‘any’, without type conversion . Default is ‘any’

convert(comp_number=0)

Get elemental property attributes

Parameters:

comp – Pymatgen composition object

Returns:

Specified property statistics of features :param comp_number:

Return type:

all_attributes

fit_transform(entries: Optional[List] = None) Any

If convert takes multiple inputs, supply inputs as a list of tuples.

Copy from Mixin class for all transformers in scikit-learn. TransformerMixin

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:
  • X (list) – list of case.

  • y (None) – deprecated.

  • **kwargs – Additional fit or transform parameters. feature_labels_mark: str, mark for each feature_labes. for return_type ==’pd’. x_labels: list, mark for each row. for return_type ==’pd’.

Returns:

result data.

Return type:

X_new

set_feature_labels(self_elem_data_columns_values: List)

Generate attribute names.

Parameters:

self_elem_data_columns_values (List) – name

Return type:

([str]) attribute labels.

transform(entries: Optional[List] = None) Any

Transform a list of entries. Each iterable element of entries is corresponding to the parameter of convert, If convert takes n multiple inputs, the transform inputs should be a list or tuple (size n),

[(p1,p2),(p1,p2),(p1,p2),…,(p1,p2),(p1,p2)]

which can be from zip` or used the built-in transform_with_zip.

Parameters:

entries (list) – A list of entries to be featured.

Returns:

result – features for each entry.

Return type:

any