featurebox.selection package

Submodules

featurebox.selection.backforward module

Forward_and_back feature elimination for feature ranking

class featurebox.selection.backforward.BackForward(estimator: BaseEstimator, n_type_feature_to_select: Optional[int] = None, primary_feature: Optional[int] = None, multi_grade: int = 2, multi_index: Optional[List] = None, refit=True, cv=5, min_type_feature_to_select: int = 3, must_index: Optional[List] = None, tolerant: float = 0.01, verbose: int = 1, random_state: Optional[int] = None, scoring: Optional[str] = None, note: bool = True, filter_warn: bool = False)

Bases: BaseEstimator, MetaEstimatorMixin, SelectorMixin, MultiBase

BackForward method to selected features.

n_feature_

The number of selected features finally.

Type:

int

support_

The mask of selected features finally.

Type:

array of shape [n_feature]

estimator_

The best model with the best features finally (refited with all data.).

Type:

object

best_score_

Best score of best model of best features.

Type:

float

Examples

>>> from sklearn.datasets import fetch_california_housing
>>> from sklearn.svm import SVR
>>> X,y = fetch_california_housing(return_X_y=True)
>>> X = X[:100]
>>> y = y[:100]
>>> svr= SVR()
>>> bf = BackForward(svr,primary_feature=4, random_state=1,verbose=0,note=False)
>>> new_x = bf.fit_transform(X,y)
>>> bf.support_
array([False,  True,  True, False, False, False, False,  True])

Examples

>>> from sklearn.datasets import fetch_california_housing
>>> from sklearn.svm import SVR
>>> from sklearn.model_selection import cross_val_score
>>> X,y = fetch_california_housing(return_X_y=True)
>>> X = X[:100]
>>> y = y[:100]
>>> X_train,y_train,X_test,y_test = X[:50],y[:50],X[-50:],y[-50:]
>>> svr= SVR()
>>> bf = BackForward(svr, primary_feature=4, random_state=1, refit=True, cv=5,verbose=0,note=False)
>>> bf = bf.fit(X_train,y_train)
>>> bf.best_score_         # cv score
-3.0552830696940037
>>> train_score = bf.score(X_train,y_train)  # train score
>>> test_score = bf.score(X_test,y_test) # test score in more data.
>>> np.mean(cross_val_score(bf.estimator_,X_train[:,bf.support_],y_train,cv=5)) # get cv_score manually.
-3.0552830696940037
Notes

If score and predict is used, the refit should be set True, the refit used all data in fit function, that is, it is not test score/predict.

Examples

If GridSearchCV, the refit should be set True and return the cv score.

>>> from sklearn.datasets import fetch_california_housing
>>> from sklearn.svm import SVR
>>> from sklearn import model_selection
>>> X,y = fetch_california_housing(return_X_y=True)
>>> X = X[:100]
>>> y = y[:100]
>>> X_train,y_train,X_test,y_test = X[:50],y[:50],X[-50:],y[-50:]
>>> svr= SVR()
>>> gd = model_selection.GridSearchCV(svr,param_grid={"C":[1,10]},n_jobs=1)  # keep n_jobs=1 there.
>>> bf = BackForward(gd,primary_feature=4, random_state=1, refit=True,
... scoring="neg_root_mean_squared_error",cv=5,verbose=0, note=False)
Uniform parameter in SearchCV and Exhaustion:
(scoring=neg_root_mean_squared_error, cv=5, refit=True)
>>> bf = bf.fit(X_train,y_train)
>>> bf.best_score_         # cv score
-0.5919173121895709
>>> train_score = bf.score(X_train,y_train)  # train score
>>> test_score = bf.score(X_test,y_test) # test score in more data.
>>> # bf.estimator_ is the gd object (GridSearchCV)
>>> bf.estimator_.best_score_ # re cv_score in manually.
-0.5919173121895709
Parameters:
  • estimator (estimator object) – This is assumed to implement the scikit-learn estimator interface. A supervised sklearn learning estimator with fit method.

  • n_type_feature_to_select (int) – The max number of feature to selection. If None, select the features with best score.

  • min_type_feature_to_select (int) – force select number min.

  • primary_feature (int) – primary features to start loop, default initial n_features//2.

  • multi_grade (int) – group number.

  • multi_index – group index.

  • must_index – must selection index.

  • tolerant – tolerant for rank compare.

  • verbose (int) – print or not.

  • random_state (int) – random_state.

  • refit (bool) – refit or not. if refit, the model would use all data.

  • scoring (None,str) – scoring method name.

  • note (bool) – print note or not.

  • filter_warn (bool) – warnings.filterwarnings or not.

fit(X, y)
Fit the baf model and then the underlying estimator on the selected

feature.

Parameters:
  • X ({array-like, sparse matrix}, shape = [n_samples, n_feature]) – The training input0 samples.

  • y (array-like, shape = [n_samples]) – The target values.

predict(X)
Reduce X to the selected feature and then using the underlying estimator to predict.

Only available refit=True.

Parameters:

X (array of shape [n_samples, n_feature]) – The input0 samples.

Returns:

y – The predicted target values.

Return type:

array of shape [n_samples]

score(X, y, scoring=None)

Reduce X to the selected feature and then return the score of the underlying estimator. Only available refit=True.

Parameters:
  • X (array of shape [n_samples, n_feature]) – The input0 samples.

  • y (array of shape [n_samples]) – The target values.

  • scoring (str, callable, default=None) –

    Strategy to evaluate the performance of the cross-validated model on the test set.

    If scoring represents a single score, one can use: a single string (see scoring_parameter)

    The score defined by scoring if provided, and the estimator_.score method otherwise else raise error.

class featurebox.selection.backforward.BackForwardStable(estimator: BaseEstimator, n_type_feature_to_select: Optional[int] = None, min_type_feature_to_select: int = 3, primary_feature: Optional[int] = None, multi_grade: int = 2, multi_index: Optional[List] = None, must_index: Optional[List] = None, verbose: int = 0, random_state: Optional[int] = None, tolerant: float = 0.001, cv: int = 5, times: int = 5, scoring: Optional[str] = None, n_jobs: Optional[int] = None, refit=False, note=True)

Bases: MetaEstimatorMixin, SelectorMixin, BaseEstimator

BackForwardStable. Run with different order for more Stable (Just for test).

n_feature_

The number of selected feature with cross-validation.

Type:

int

support_

The mask of selected feature.

Type:

array of shape [n_feature]

estimator_

The model with the best features finally (refited with all data.).

Type:

object

best_score_

Best score of best model of best features.

Type:

float

Examples

>>> from sklearn.datasets import fetch_california_housing
>>> from sklearn.svm import SVR
>>> X,y = fetch_california_housing(return_X_y=True)
>>> X = X[:100]
>>> y = y[:100]
>>> svr= SVR()
>>> bf = BackForwardStable(svr,primary_feature=3, random_state=1,verbose=0,note=False)
>>> new_x = bf.fit_transform(X,y)
>>> bf.support_
array([False,  True, False, False, False,  True,  True, False])
>>> bf.best_score_
-0.09122826477472024

If score and predict is used, the refit could be set True and make sure the data is splited, due to the refit used all data in fit() function.

>>> from sklearn.datasets import fetch_california_housing
>>> from sklearn.svm import SVR
>>> X,y = fetch_california_housing(return_X_y=True)
>>> X = X[:100]
>>> y = y[:100]
>>> svr= SVR()
>>> bf = BackForwardStable(svr,primary_feature=4, random_state=1, refit=True,verbose=0,note=False)
>>> new_x = bf.fit_transform(X[:50],y[:50])
>>> train_score = bf.score(X[50:],y[50:])
>>> cv_score = bf.best_score_
...

If GridSearchCV, the refit could be set True and return the cv score.

>>> from sklearn.datasets import fetch_california_housing
>>> from sklearn.svm import SVR
>>> from sklearn import model_selection
>>> X,y = fetch_california_housing(return_X_y=True)
>>> X = X[:100]
>>> y = y[:100]
>>> svr= SVR()
>>> gd = model_selection.GridSearchCV(svr,param_grid={"C":[1,10]})
>>> bf = BackForward(gd,primary_feature=4, random_state=1, refit=True, cv=5,verbose=0,note=False)
Uniform parameter in SearchCV and Exhaustion:
(scoring=None, cv=5, refit=True)
>>> new_x = bf.fit_transform(X,y)
...
Parameters:
  • estimator (estimator object) – This is assumed to implement the scikit-learn estimator interface. A supervised sklearn learning estimator with fit method.

  • n_type_feature_to_select (int) – The max number of feature to selection. If None, select the features with best score.

  • min_type_feature_to_select (int) – force select number min.

  • primary_feature (int) – primary features to start loop, default initial n_features//2.

  • multi_grade (int) – group number.

  • multi_index – group index.

  • must_index – must selection index.

  • tolerant – tolerant for rank compare.

  • verbose (int) – print or not.

  • random_state (int) – random_state.

  • refit (bool) – refit or not. if refit, the model would use all data.

  • n_jobs (int or None) – Number of cores to run in parallel while fitting across folds. None means 1 and -1 means using all processors.

  • scoring (None,str) – scoring method.

  • note (bool) – print note or not.

fit(X, y, groups=None)

Fit the baf model and automatically tune the number of selected feature.

Parameters:
  • X ({array-like, sparse matrix}, shape = [n_samples, n_feature]) – Training vector, where n_samples is the number of samples and n_feature is the total number of feature.

  • y (array-like, shape = [n_samples]) – Target values (integers for classification, real numbers for regression).

  • groups (array-like, shape = [n_samples], optional) – cal_group labels for the samples used while splitting the dataset into train/test set.

predict(X)

Reduce X to the selected feature and then Fit using the underlying estimator, only with refit. Only available refit=True.

Parameters:

X (array of shape [n_samples, n_feature]) – The input0 samples.

Returns:

y – The predicted target values.

Return type:

array of shape [n_samples]

score(X, y, scoring=None)

Reduce X to the selected feature and then return the score of the underlying estimator, only with refit. Only available refit=True.

Parameters:
  • X (array of shape [n_samples, n_feature]) – The input0 samples.

  • y (array of shape [n_samples]) – The target values.

featurebox.selection.corr module

Calculate the correction of columns.

class featurebox.selection.corr.Corr(threshold: float = 0.85, multi_grade: int = 2, multi_index: Optional[List] = None, must_index: Optional[List] = None, random_state: int = 0)

Bases: BaseEstimator, MetaEstimatorMixin, SelectorMixin, MultiBase

Calculate correlation. (Where the result are changed with random state.)

1. Used for filter automatically by machine

Examples

>>> from sklearn.datasets import fetch_california_housing
>>> from featurebox.selection.corr import Corr
>>> x, y = fetch_california_housing(return_X_y=True)
>>> x = x[:100]
>>> y = y[:100]
>>> co = Corr(threshold=0.5)
>>> new_x = co.fit_transform(x)
>>> select_feature = co.support_

1. Used for get group exceeding the threshold by setp

Examples

>>> from sklearn.datasets import fetch_california_housing
>>> from featurebox.selection.corr import Corr
>>> x, y = fetch_california_housing(return_X_y=True)
>>> x = x[:100]
>>> y = y[:100]
>>> co = Corr(threshold=0.5)
>>> groups = co.count_cof(np.corrcoef(x[:,:7], rowvar=False))
>>> groups[1]
[[0, 6], [1], [2], [3], [4], [5], [0, 6]]
>>> groups[0]
[[1.0, 0.554], [1.0], [1.0], [1.0], [1.0], [1.0], [0.554, 1.0]]
>>> co.remove_coef(groups[1]) # Filter automatically by machine.
[0, 1, 2, 3, 4, 5]

Where the remove_coef are changed with random state.

Where the (0,6) are with correlation more than 0.7.

3. Used for binding correlation

Examples

>>> from sklearn.datasets import fetch_california_housing
>>> from featurebox.selection.corr import Corr
>>> x, y = fetch_california_housing(return_X_y=True)
>>> x = x[:100]
>>> y = y[:100]
>>> co = Corr(threshold=0.3,multi_index=[0,8],multi_grade=2)
>>> # in range [0,8], the features are binding in to 2 sized: [[0,1],[2,3],[4,5],[6,7]]
>>> co.fit(x)
Corr(multi_index=(0, 8), threshold=0.3)
Parameters:
  • threshold (float) – ranking threshold.

  • multi_grade – binding_group size, calculate the correction between binding.

  • multi_index (list) – the range of multi_grade:[min,max).

  • must_index (list) – the columns force to index.

  • random_state (int) –

count_cof(cof=None)

Check cof and count the number.

static cov_y(x_, y_)
filter()
fit(data, pre_cal=None, method='mean')
remove_by_y(y_)
remove_coef(cof_list_all)

Delete the index of feature with repeat coef.

featurebox.selection.exhaustion module

class featurebox.selection.exhaustion.Exhaustion(estimator: BaseEstimator, n_select: Tuple = (2, 3, 4), multi_grade: Optional[int] = None, multi_index: Optional[List] = None, must_index: Optional[List] = None, n_jobs: int = 1, refit: bool = False, cv: int = 5, scoring: Optional[str] = None, note=True, filter_warn=False)

Bases: BaseEstimator, MetaEstimatorMixin, SelectorMixin, MultiBase

Exhaustion features combination.

n_feature_

The number of selected features finally.

Type:

int

support_

The mask of selected features finally.

Type:

array of shape [n_feature]

estimator_

The best model with the best features finally (refited with all data.).

Type:

object

best_score_

Best score of best model of best features.

Type:

float

Examples

>>> from sklearn.datasets import fetch_california_housing
>>> from sklearn.model_selection import cross_val_predict
>>> from sklearn.svm import SVR
>>> X,y = fetch_california_housing(return_X_y=True)
>>> X = X[:100]
>>> y = y[:100]
>>> X_train,y_train,X_test,y_test = X[:50],y[:50],X[-50:],y[-50:]
>>> svr = SVR()
>>> bf = Exhaustion(svr,n_select=(2,),refit=True,note=False)
>>> new_x = bf.fit_transform(X,y)
>>> bf.support_
array([False, False, False,  True, False,  True, False, False])
>>> train_score = bf.score(X_train,y_train)  # train score
>>> test_score = bf.score(X_test,y_test) # test score in more data.
>>> np.mean(cross_val_score(bf.estimator_,X_train[:,bf.support_],y_train,cv=5)) # re cv_score in manually.
-2.888471220974372
>>> np.mean(cross_val_predict(bf.estimator_,X_train[:,bf.support_],y_train,cv=5)) # re cv_predict for plot.
1.6001222987265382

Examples

>>> from sklearn.datasets import fetch_california_housing
>>> from sklearn.svm import SVR
>>> from sklearn import model_selection
>>> X,y = fetch_california_housing(return_X_y=True)
>>> X = X[:100]
>>> y = y[:100]
>>> svr= SVR()
>>> gd = model_selection.GridSearchCV(svr, param_grid=[{"C": [1, 10]}], n_jobs=1, cv=3)
>>> bf = Exhaustion(gd,n_select=(2,),refit=True,note=False,cv=5)
Uniform parameter in SearchCV and Exhaustion:
(scoring=None, cv=5, refit=True)
>>> new_x = bf.fit_transform(X,y)
>>> bf.support_
array([False, False, False,  True, False,  True, False, False])
>>> bf.best_score_
-0.7336740728050252
Parameters:
  • estimator – sklearn model or GridSearchCV.

  • n_select (tuple) – the n_select list,default,n_select=(3, 4).

  • multi_grade (list) – binding_group size, calculate the correction between binding.

  • multi_index (list) – the range of multi_grade:[min,max).

  • must_index (list) – the columns force to index.

  • n_jobs (int) – n_jobs.

  • refit (bool) – refit or not, if refit the model would use all data.

  • cv (bool) – if estimator is sklearn model, used cv, else pass.

  • scoring (None,str) – scoring method name.

  • note (bool) – print note or not.

  • filter_warn (bool) – warnings.filterwarnings or not.

fit(X, y)

Fit the baf model and then the underlying estimator on the selected feature.

Parameters:
  • X ({array-like, sparse matrix}, shape = [n_samples, n_feature]) – The training input0 samples.

  • y (array-like, shape = [n_samples]) – The target values.

predict(X)

Reduce X to the selected feature and then Fit using the underlying estimator. Only available refit=True.

Parameters:

X (array of shape [n_samples, n_feature]) – The input0 samples.

Returns:

y – The predicted target values.

Return type:

array of shape [n_samples]

score(X, y, scoring=None)

Reduce X to the selected feature and then return the score of the underlying estimator. Only available refit=True.

Parameters:
  • X (array of shape [n_samples, n_feature]) – The input0 samples.

  • y (array of shape [n_samples]) – The target values.

  • scoring (str, callable, default=None) –

    Strategy to evaluate the performance of the cross-validated model on the test set.

    If scoring represents a single score, one can use: a single string (see scoring_parameter)

    The score defined by scoring if provided, and the estimator_.score method otherwise else raise error.

featurebox.selection.exhaustion.ExhaustionCV

alias of Exhaustion

featurebox.selection.ga module

class featurebox.selection.ga.GA(estimator, n_jobs=2, pop_n=1000, hof_n=1, cxpb=0.6, mutpb=0.3, ngen=40, max_or_min='max', mut_indpb=0.05, max_=None, min_=2, random_state=None, multi_grade=2, multi_index=None, must_index=None, cv: int = 5, scoring=None, filter_warn=False)

Bases: BaseEstimator, MetaEstimatorMixin, SelectorMixin, MultiBase

GA with binding. Please just passing training data.

Examples

>>> from sklearn.datasets import fetch_california_housing
>>> from sklearn.svm import SVR
>>> data = fetch_california_housing()
>>> X = data.data
>>> y = data.target
>>> X_train,y_train,X_test,y_test = X[:50],y[:50],X[-50:],y[-50:]
>>> svr = SVR(gamma="scale", C=100)
>>> ga = GA(estimator=svr, n_jobs=2, pop_n=50, hof_n=1, cxpb=0.8, mutpb=0.4, ngen=3,
... max_or_min="max", mut_indpb=0.1, min_=2, multi_index=[0, 5],random_state=0)
>>> ga.fit(X_train, y_train)
gen nevals  min     max
1   50      -4.9231 -1.09124
2   43      -3.83152  -1.09124
3   46      -4.9231   -1.09124
[1, 1, 1, 1, 0, 0, 1, 0] (-1.039237326973499,)
GA(cxpb=0.8, estimator=SVR(C=100), multi_index=(0, 5), mut_indpb=0.1, mutpb=0.4,
   ngen=3, pop_n=50, random_state=0)
>>> ga.score(X_test, y_test)
-28.542309712899435
Parameters:
  • estimator – sklearn estimator

  • n_jobs (int) – njobs

  • pop_n (int) – population

  • hof_n (int) – hof

  • cxpb (float) – probility of cross

  • mutpb (float) – probility of mutate

  • ngen (int) – generation

  • max_or_min (str) – “max”,”min”;max problem or min

  • mut_indpb (float) – probility of mutate of each node.

  • max (int) – max size

  • min (int) – min size

  • random_state (float) – randomstate

  • multi_grade – binding grade

  • multi_index – binding range [min,max]

  • scoring (None,str) – scoring method name.

  • cv (bool) – if estimator is sklearn model, used cv, else pass.

  • filter_warn (bool) – warnings.filterwarnings or not.

feature_fold_length(feature)
fit(X, y)

Fit data and run GA.

fitness_func(ind, model, x, y, return_model=False)
static generate_min_max(space, min_=2, max_=None)
predict(X)

Reduce X to the selected feature and then return the score of the underlying estimator.

Parameters:

X (array of shape [n_samples, n_feature]) – The input0 samples.

predict_func(ind, model, x)
score(X, y)

Reduce X to the selected feature and then return the score of the underlying estimator.

Parameters:
  • X (array of shape [n_samples, n_feature]) – The input0 samples.

  • y (array of shape [n_samples]) – The target values.

score_cv(X, y)

Reduce X to the selected feature and then return the score of the underlying estimator.

Parameters:
  • X (array of shape [n_samples, n_feature]) – The input0 samples.

  • y (array of shape [n_samples]) – The target values.

socre_func(ind, model, x, y, scoring=None)
unfold(ind)
featurebox.selection.ga.eaSimple(population, toolbox, cxpb, mutpb, ngen, stats=None, n_jobs=2, halloffame=None, verbose=True)

This algorithm reproduce the simplest evolutionary algorithm.

Parameters:
  • population – A list of individuals.

  • n_jobs – jobs.

  • toolbox – A Toolbox that contains the evolution operators.

  • cxpb – The probability of mating two individuals.

  • mutpb – The probability of mutating an individual.

  • ngen – The number of generation.

  • stats – A Statistics object that is updated inplace, optional.

  • halloffame – A HallOfFame object that will contain the best individuals, optional.

  • verbose – Whether to log the statistics.

Returns:

The final population

Returns:

A class:~deap.tools.Logbook with the statistics of the evolution

featurebox.selection.ga.filt(ind, min_=2, max_=None)
featurebox.selection.ga.generate(space)
featurebox.selection.ga.generate_xi()

featurebox.selection.multibase module

class featurebox.selection.multibase.MultiBase(multi_grade: int = 2, multi_index: Optional[Union[List, Tuple]] = None, must_index: Optional[Union[List, Tuple]] = None)

Bases: object

Base method for binding

Parameters:
  • multi_grade (int) – binding_group size, calculate the correction between binding

  • multi_index (list,tuple,None) – the range of multi_grade:[min,max)

  • must_index (list,tuple,None) – the columns force to index

property check_multi
property check_must
feature_fold(feature)
feature_unfold(feature)
inverse_transform_index(index)

inverse the selected index to origin index by support.

property must_fold_add
property must_unfold_add
transform(data: Any)
transform_index(index)

Get support index.