featurebox.featurizers package

Subpackages

Submodules

featurebox.featurizers.base_feature module

Base

class featurebox.featurizers.base_feature.BaseFeature(n_jobs: int = 1, *, on_errors: str = 'raise', return_type: str = 'any', batch_calculate: bool = False, batch_size: int = 30, feature_labels_mark: Optional[str] = None, **kwargs)

Bases: object

Using a BaseFeature Class

That means you can embed this feature directly into BaseFeature class implement.

class MatFeature(BaseFeature):
    def convert(spath, *x):
        ...

BaseFeature implement sklearn.base.BaseEstimator and sklearn.base.TransformerMixin that means you can use it in a scikit-learn way.

feature = SomeFeature()
features = feature.fit_transform(X)

Note

The convert method should be rewrite to deal with single case. And the transform and fit_transform will be established for list of case automatically.

Adding references

BaseFeature also provide you to retrieving proper references for a feature. The __citations__ returns a list of papers that should be cited. The __authors__ returns a list of people who wrote the feature. Also can be accessed from property citations and citations.

These operations must be implemented for each new feature:

  • feature_labels - Generates a human-meaningful x_name for each of the features. Implement this as property.

which can be set by set_feature_labels

Also suggest to implement these two properties:

  • citations - Returns a list of citations in BibTeX format.

  • authors - Returns a list of people who contributed writing a paper.

Note

None of these operations should change the state of the feature. I.e., running each method twice should no produce different results, no class attributes should be changed, Running one operation should not affect the output of another.

Parameters:
  • batch_size (int) – size of batch.

  • batch_calculate (bool) – batch_calculate or not.

  • n_jobs (int) – Parallel number.

  • on_errors (str) – How to handle the exceptions in a feature calculations. Can be nan, keep, raise. When ‘nan’, return a column with np.nan. The length of column corresponding to the number of feature labs. The default is ‘raise’ which will raise up the exception.

  • return_type (str) – Specific the return type. Can be any, np,``array`` and df. ‘array’ and ‘df’ force return type to np.ndarray and pd.DataFrame respectively. If ‘any’, without type conversion . Default is ‘any’

property authors

List of implementors of the feature.

Returns:

(list) each element should either be a string with author x_name (e.g.,

”Anubhav Jain”) or a dictionary with required key “x_name” and other keys like “email” or “institution” (e.g., {“x_name”: “Anubhav Jain”, “email”: “ajain@lbl.gov”, “institution”: “LBNL”}).

property citations

Citation(s) and reference(s) for this feature.

Returns:

(list) each element should be a string citation,

ideally in BibTeX format.

convert(d)

Main feature function, which has to be implemented in any derived feature subclass.

Notes

It cannot be passed np.ndarray in default unless:

1. useful for bond_converter. For np.array we check the ndim and for ndim 2, or 3. we decide whether to pass them the data to _converter together or separately by self.ndim attribute. Now max support 3d. due to for some functions, using ufunc in numpy is very efficient.

  1. keep the size of data and simple the _convert.

Parameters:

d – one input data (one sample, one case),

Returns:

new x.

Return type:

new_x

static emptytonone(d)
property feature_labels

Generate attribute names.

Returns:

([str]) attribute labels.

fit(*args, **kwargs)

fit function in BaseFeature are weakened and just pass parameter.

fit_transform(X: List, y=None, **kwargs) Any

If convert takes multiple inputs, supply inputs as a list of tuples.

Copy from Mixin class for all transformers in scikit-learn. TransformerMixin

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:
  • X (list) – list of case.

  • y (None) – deprecated.

  • **kwargs – Additional fit or transform parameters. feature_labels_mark: str, mark for each feature_labes. for return_type ==’pd’. x_labels: list, mark for each row. for return_type ==’pd’.

Returns:

result data.

Return type:

X_new

property n_jobs

int Parallel number.

Type:

n_jobs

static nonetoempty(d)
set_feature_labels(values: List[str])

Generate attribute names.

Returns:

([str]) attribute labels.

transform(entries: List) Any

Transform a list of entries. Each iterable element of entries is corresponding to the parameter of convert, If convert takes n multiple inputs, the transform inputs should be a list or tuple (size n),

[(p1,p2),(p1,p2),(p1,p2),…,(p1,p2),(p1,p2)]

which can be from zip` or used the built-in transform_with_zip.

Parameters:

entries (list) – A list of entries to be featured.

Returns:

result – features for each entry.

Return type:

any

transform_with_zip(*args) Any

Second transform, which convert Iterables to list and run transform.

first: p1s,p2s -> [(p1,p2),(p1,p2),(p1,p2),…,(p1,p2),(p1,p2)]

second: run self.transform

Parameters:

args (Iterable) – each of args must be Iterable.

Returns:

result – features for each entry.

Return type:

any

featurebox.featurizers.base_feature.Converter

alias of BaseFeature

class featurebox.featurizers.base_feature.ConverterCat(*args: BaseFeature, force_concatenate=False, n_jobs: int = 1, on_errors: str = 'raise', return_type: str = 'any')

Bases: BaseFeature

Pack the converters in to one unified approach. The same type Converter would merge and different would order to run. Thus, keeping the same type is next to each other! such as A(),A(),B(),B().

Examples

>>> tmps = ConverterCat(
...    AtomEmbeddingMap(),
...    AtomEmbeddingMap("ie.json")
...    AtomTableMap(search_tp="name"))
>>> tmp.convert(x)
>>> tmp.tranmform(xs)
Parameters:

args (Converter) – List of Converter

convert(d)

convert and concatenate.

static sums(args)

SUM

class featurebox.featurizers.base_feature.ConverterSequence(*args: BaseFeature, n_jobs: int = 1, on_errors: str = 'raise', return_type: str = 'any')

Bases: BaseFeature

Pack the converters in to one sequentially executed assembly approach.

input -> convert1 -> temp -> convert2 -> temp -> convert3 -> output

Notes

There is no error checking, please make sure the temp could be passed manually !!! There is no error checking, please make sure the temp could be passed manually !!! There is no error checking, please make sure the temp could be passed manually !!!

Examples

>>> tmps = ConverterCat(
...    AtomEmbeddingMap(),
...    DummyConverter()
>>> tmp.convert(x)
Parameters:

args (Converter) – List of Converter

convert(d)

convert batched

class featurebox.featurizers.base_feature.DummyConverter(n_jobs: int = 1, *, on_errors: str = 'raise', return_type: str = 'any', batch_calculate: bool = False, batch_size: int = 30, feature_labels_mark: Optional[str] = None, **kwargs)

Bases: BaseFeature

Dummy converter as a placeholder, Do nothing.

Parameters:
  • batch_size (int) – size of batch.

  • batch_calculate (bool) – batch_calculate or not.

  • n_jobs (int) – Parallel number.

  • on_errors (str) – How to handle the exceptions in a feature calculations. Can be nan, keep, raise. When ‘nan’, return a column with np.nan. The length of column corresponding to the number of feature labs. The default is ‘raise’ which will raise up the exception.

  • return_type (str) – Specific the return type. Can be any, np,``array`` and df. ‘array’ and ‘df’ force return type to np.ndarray and pd.DataFrame respectively. If ‘any’, without type conversion . Default is ‘any’

convert(d) ndarray

Dummy convert, does nothing to input.

Parameters:

d (Any) – input object

Returns: d

featurebox.featurizers.batch_feature module

class featurebox.featurizers.batch_feature.BatchFeature(data_type: str = 'compositions', user_convert: Optional[BaseFeature] = None, n_jobs: int = 1, on_errors: str = 'raise', return_type: str = 'df', batch_calculate: bool = False, batch_size: int = 30)

Bases: object

Script for generate batch_data, could be copied and user-defined.

Parameters:
  • data_type (str) – Predefined name [“elements”, “compositions”, “structures”]

  • user_convert (BatchFeature) – which contain convert method.

convert(d)
property feature_labels
fit_transform(entries: List)
set_feature_labels(values: List[str])
transform(entries: List)