featurebox.featurizers package¶

Subpackages¶

Submodules¶

featurebox.featurizers.base_feature module¶

Base

class featurebox.featurizers.base_feature.BaseFeature(n_jobs: int = 1, *, on_errors: str = 'raise', return_type: str = 'any', batch_calculate: bool = False, batch_size: int = 30)¶

基类：object

Using a BaseFeature Class

That means you can embed this feature directly into BaseFeature class implement.

class MatFeature(BaseFeature):
    def convert(spath, *x):
        ...

BaseFeature implement sklearn.base.BaseEstimator and sklearn.base.TransformerMixin that means you can use it in a scikit-learn way.

feature = SomeFeature()
features = feature.fit_transform(X)

备注

The convert method should be rewrite to deal with single case. And the transform and fit_transform will be established for list of case automatically.

Adding references

BaseFeature also provide you to retrieving proper references for a feature. The __citations__ returns a list of papers that should be cited. The __authors__ returns a list of people who wrote the feature. Also can be accessed from property citations and citations.

These operations must be implemented for each new feature:

feature_labels - Generates a human-meaningful x_name for each of the features. Implement this as property.

which can be set by set_feature_labels

Also suggest to implement these two properties:

citations - Returns a list of citations in BibTeX format.
authors - Returns a list of people who contributed writing a paper.

备注

None of these operations should change the state of the feature. I.e., running each method twice should no produce different results, no class attributes should be changed, Running one operation should not affect the output of another.

参数:

batch_size (int) – size of batch.
batch_calculate (bool) – batch_calculate or not.
n_jobs (int) – Parallel number.
on_errors (str) – How to handle the exceptions in a feature calculations. Can be nan, keep, raise. When ‘nan’, return a column with np.nan. The length of column corresponding to the number of feature labs. The default is ‘raise’ which will raise up the exception.
return_type (str) – Specific the return type. Can be any, np,``array`` and df. ‘array’ and ‘df’ force return type to np.ndarray and pd.DataFrame respectively. If ‘any’, without type conversion . Default is ‘any’

property authors¶

List of implementors of the feature.

返回:

(list) each element should either be a string with author x_name (e.g.,: ”Anubhav Jain”) or a dictionary with required key “x_name” and other keys like “email” or “institution” (e.g., {“x_name”: “Anubhav Jain”, “email”: “ajain@lbl.gov”, “institution”: “LBNL”}).

property citations¶

Citation(s) and reference(s) for this feature.

返回:

(list) each element should be a string citation,: ideally in BibTeX format.

convert(d)¶

Main feature function, which has to be implemented in any derived feature subclass.

备注

It cannot be passed np.array in default unless:

1. useful for bond_converter. For np.array we check the ndim and for ndim 2, or 3. we decide whether to pass them the data to _converter together or separately by self.ndim attribute. Now max support 3d. due to for some functions, using ufunc in numpy is very efficient.

keep the size of data and simple the _convert.

参数:: d – one input data (one sample, one case),
返回:: new x.
返回类型:: new_x

property feature_labels¶

Generate attribute names.

返回:: ([str]) attribute labels.

fit(*args, **kwargs)¶: fit function in BaseFeature are weakened and just pass parameter.

fit_transform(X: List, y=None, **kwargs) → Any¶

If convert takes multiple inputs, supply inputs as a list of tuples.

Copy from Mixin class for all transformers in scikit-learn. TransformerMixin

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

参数:

X (list) – list of case.
y (None) – deprecated.
**kwargs – Additional fit or transform parameters. feature_labels_mark: str, mark for each feature_labes. for return_type ==’pd’. x_labels: list, mark for each row. for return_type ==’pd’.

返回:

result data.

返回类型:

X_new

property n_jobs¶

int Parallel number.

Type:: n_jobs

set_feature_labels(values: List[str])¶

Generate attribute names.

返回:: ([str]) attribute labels.

transform(entries: List) → Any¶

Transform a list of entries. Each iterable element of entries is corresponding to the parameter of convert, If convert takes n multiple inputs, the transform inputs should be a list or tuple (size n),

[(p1,p2),(p1,p2),(p1,p2),…,(p1,p2),(p1,p2)]

which can be from zip` or used the built-in transform_with_zip.

参数:: entries (list) – A list of entries to be featured.
返回:: result – features for each entry.
返回类型:: any

transform_with_zip(*args) → Any¶

Second transform, which convert Iterables to list and run transform.

first: p1s,p2s -> [(p1,p2),(p1,p2),(p1,p2),…,(p1,p2),(p1,p2)]

second: run self.transform

参数:: args (Iterable) – each of args must be Iterable.
返回:: result – features for each entry.
返回类型:: any

featurebox.featurizers.base_feature.Converter¶: BaseFeature 的别名

class featurebox.featurizers.base_feature.ConverterCat(*args: BaseFeature, force_concatenate=False, n_jobs: int = 1, on_errors: str = 'raise', return_type: str = 'any')¶

基类：BaseFeature

Pack the converters in to one unified approach. The same type Converter would merged and different would order to run. Thus, keeping the same type is next to each other! such as A(),A(),B(),B().

示例

>>> tmps = ConverterCat(
...    AtomEmbeddingMap(),
...    AtomEmbeddingMap("ie.json")
...    AtomTableMap(search_tp="name"))
>>> tmp.convert(x)
>>> tmp.tranmform(xs)

参数:: args (Converter) – List of Converter

convert(d)¶: convert and concatenate.

static sums(args)¶: SUM

class featurebox.featurizers.base_feature.ConverterSequence(*args: BaseFeature, n_jobs: int = 1, on_errors: str = 'raise', return_type: str = 'any')¶

基类：BaseFeature

Pack the converters in to one sequentially executed assembly approach.

input -> convert1 -> temp -> convert2 -> temp -> convert3 -> output

备注

There is no error checking, please make sure the temp could be passed manually !!! There is no error checking, please make sure the temp could be passed manually !!! There is no error checking, please make sure the temp could be passed manually !!!

示例

>>> tmps = ConverterCat(
...    AtomEmbeddingMap(),
...    DummyConverter()
>>> tmp.convert(x)

参数:: args (Converter) – List of Converter

convert(d)¶: convert batched

class featurebox.featurizers.base_feature.DummyConverter(n_jobs: int = 1, *, on_errors: str = 'raise', return_type: str = 'any', batch_calculate: bool = False, batch_size: int = 30)¶

基类：BaseFeature

Dummy converter as a placeholder, Do nothing.

参数:

batch_size (int) – size of batch.
batch_calculate (bool) – batch_calculate or not.
n_jobs (int) – Parallel number.
on_errors (str) – How to handle the exceptions in a feature calculations. Can be nan, keep, raise. When ‘nan’, return a column with np.nan. The length of column corresponding to the number of feature labs. The default is ‘raise’ which will raise up the exception.
return_type (str) – Specific the return type. Can be any, np,``array`` and df. ‘array’ and ‘df’ force return type to np.ndarray and pd.DataFrame respectively. If ‘any’, without type conversion . Default is ‘any’

convert(d) → ndarray¶

Dummy convert, does nothing to input.

参数:: d (Any) – input object

Returns: d

featurebox.featurizers.batch_feature module¶

class featurebox.featurizers.batch_feature.BatchFeature(data_type: str = 'compositions', user_convert: Optional[BaseFeature] = None, n_jobs: int = 1, on_errors: str = 'raise', return_type: str = 'any', batch_calculate: bool = False, batch_size: int = 30)¶

基类：BaseFeature

Script for generate batch_data, could be copied and user-defined.

参数:

data_type (str) – Predefined name [“elements”, “compositions”, “structures”]
user_convert (BatchFeature) – which contain convert method.

convert(d)¶

Main feature function, which has to be implemented in any derived feature subclass.

备注

It cannot be passed np.array in default unless:

1. useful for bond_converter. For np.array we check the ndim and for ndim 2, or 3. we decide whether to pass them the data to _converter together or separately by self.ndim attribute. Now max support 3d. due to for some functions, using ufunc in numpy is very efficient.

keep the size of data and simple the _convert.

参数:: d – one input data (one sample, one case),
返回:: new x.
返回类型:: new_x

property feature_labels¶

Generate attribute names.

返回:: ([str]) attribute labels.

set_feature_labels(values: List[str])¶

Generate attribute names.

返回:: ([str]) attribute labels.