featurebox.selection package¶

Submodules¶

featurebox.selection.backforward module¶

Forward_and_back feature elimination for feature ranking

class featurebox.selection.backforward.BackForward(estimator: BaseEstimator, n_type_feature_to_select: Optional[int] = None, primary_feature: Optional[int] = None, multi_grade: int = 2, multi_index: Optional[List] = None, refit=True, cv=5, min_type_feature_to_select: int = 3, must_index: Optional[List] = None, tolerant: float = 0.01, verbose: int = 1, random_state: Optional[int] = None, scoring: Optional[str] = None, note: bool = True, filter_warn: bool = False)¶

Bases: BaseEstimator, MetaEstimatorMixin, SelectorMixin, MultiBase

BackForward method to selected features.

n_feature_¶

The number of selected features finally.

Type:: int

support_¶

The mask of selected features finally.

Type:: array of shape [n_feature]

estimator_¶

The best model with the best features finally (refited with all data.).

Type:: object

best_score_¶

Best score of best model of best features.

Type:: float

Examples

>>> from sklearn.datasets import fetch_california_housing
>>> from sklearn.svm import SVR
>>> X,y = fetch_california_housing(return_X_y=True)
>>> X = X[:100]
>>> y = y[:100]
>>> svr= SVR()
>>> bf = BackForward(svr,primary_feature=4, random_state=1,verbose=0,note=False)
>>> new_x = bf.fit_transform(X,y)
>>> bf.support_
array([False,  True,  True, False, False, False, False,  True])

Examples

>>> from sklearn.datasets import fetch_california_housing
>>> from sklearn.svm import SVR
>>> from sklearn.model_selection import cross_val_score
>>> X,y = fetch_california_housing(return_X_y=True)
>>> X = X[:100]
>>> y = y[:100]
>>> X_train,y_train,X_test,y_test = X[:50],y[:50],X[-50:],y[-50:]

>>> svr= SVR()
>>> bf = BackForward(svr, primary_feature=4, random_state=1, refit=True, cv=5,verbose=0,note=False)
>>> bf = bf.fit(X_train,y_train)
>>> bf.best_score_         # cv score
-3.0552830696940037
>>> train_score = bf.score(X_train,y_train)  # train score
>>> test_score = bf.score(X_test,y_test) # test score in more data.
>>> np.mean(cross_val_score(bf.estimator_,X_train[:,bf.support_],y_train,cv=5)) # get cv_score manually.
-3.0552830696940037

Notes: If score and predict is used, the refit should be set True, the refit used all data in fit function, that is, it is not test score/predict.

Examples

If GridSearchCV, the refit should be set True and return the cv score.

>>> from sklearn.datasets import fetch_california_housing
>>> from sklearn.svm import SVR
>>> from sklearn import model_selection
>>> X,y = fetch_california_housing(return_X_y=True)
>>> X = X[:100]
>>> y = y[:100]
>>> X_train,y_train,X_test,y_test = X[:50],y[:50],X[-50:],y[-50:]

>>> svr= SVR()
>>> gd = model_selection.GridSearchCV(svr,param_grid={"C":[1,10]},n_jobs=1)  # keep n_jobs=1 there.
>>> bf = BackForward(gd,primary_feature=4, random_state=1, refit=True,
... scoring="neg_root_mean_squared_error",cv=5,verbose=0, note=False)
Uniform parameter in SearchCV and Exhaustion:
(scoring=neg_root_mean_squared_error, cv=5, refit=True)
>>> bf = bf.fit(X_train,y_train)
>>> bf.best_score_         # cv score
-0.5919173121895709

>>> train_score = bf.score(X_train,y_train)  # train score
>>> test_score = bf.score(X_test,y_test) # test score in more data.
>>> # bf.estimator_ is the gd object (GridSearchCV)
>>> bf.estimator_.best_score_ # re cv_score in manually.
-0.5919173121895709

Parameters:

estimator (estimator object) – This is assumed to implement the scikit-learn estimator interface. A supervised sklearn learning estimator with fit method.
n_type_feature_to_select (int) – The max number of feature to selection. If None, select the features with best score.
min_type_feature_to_select (int) – force select number min.
primary_feature (int) – primary features to start loop, default initial n_features//2.
multi_grade (int) – group number.
multi_index – group index.
must_index – must selection index.
tolerant – tolerant for rank compare.
verbose (int) – print or not.
random_state (int) – random_state.
refit (bool) – refit or not. if refit, the model would use all data.
scoring (None,str) – scoring method name.
note (bool) – print note or not.
filter_warn (bool) – warnings.filterwarnings or not.

fit(X, y)¶

Fit the baf model and then the underlying estimator on the selected: feature.

Parameters:

X ({array-like, sparse matrix}, shape = [n_samples, n_feature]) – The training input0 samples.
y (array-like, shape = [n_samples]) – The target values.

predict(X)¶

Reduce X to the selected feature and then using the underlying estimator to predict.: Only available refit=True.

Parameters:: X (array of shape [n_samples, n_feature]) – The input0 samples.
Returns:: y – The predicted target values.
Return type:: array of shape [n_samples]

score(X, y, scoring=None)¶

Reduce X to the selected feature and then return the score of the underlying estimator. Only available refit=True.

Parameters:

X (array of shape [n_samples, n_feature]) – The input0 samples.
y (array of shape [n_samples]) – The target values.
scoring (str, callable, default=None) –
Strategy to evaluate the performance of the cross-validated model on the test set.

If scoring represents a single score, one can use: a single string (see scoring_parameter)

The score defined by scoring if provided, and the estimator_.score method otherwise else raise error.

class featurebox.selection.backforward.BackForwardStable(estimator: BaseEstimator, n_type_feature_to_select: Optional[int] = None, min_type_feature_to_select: int = 3, primary_feature: Optional[int] = None, multi_grade: int = 2, multi_index: Optional[List] = None, must_index: Optional[List] = None, verbose: int = 0, random_state: Optional[int] = None, tolerant: float = 0.001, cv: int = 5, times: int = 5, scoring: Optional[str] = None, n_jobs: Optional[int] = None, refit=False, note=True)¶

Bases: MetaEstimatorMixin, SelectorMixin, BaseEstimator

BackForwardStable. Run with different order for more Stable (Just for test).

n_feature_¶

The number of selected feature with cross-validation.

Type:: int

support_¶

The mask of selected feature.

Type:: array of shape [n_feature]

estimator_¶

The model with the best features finally (refited with all data.).

Type:: object

best_score_¶

Best score of best model of best features.

Type:: float

Examples

>>> from sklearn.datasets import fetch_california_housing
>>> from sklearn.svm import SVR
>>> X,y = fetch_california_housing(return_X_y=True)
>>> X = X[:100]
>>> y = y[:100]
>>> svr= SVR()
>>> bf = BackForwardStable(svr,primary_feature=3, random_state=1,verbose=0,note=False)
>>> new_x = bf.fit_transform(X,y)
>>> bf.support_
array([False,  True, False, False, False,  True,  True, False])
>>> bf.best_score_
-0.09122826477472024

If score and predict is used, the refit could be set True and make sure the data is splited, due to the refit used all data in fit() function.

>>> from sklearn.datasets import fetch_california_housing
>>> from sklearn.svm import SVR
>>> X,y = fetch_california_housing(return_X_y=True)
>>> X = X[:100]
>>> y = y[:100]
>>> svr= SVR()
>>> bf = BackForwardStable(svr,primary_feature=4, random_state=1, refit=True,verbose=0,note=False)
>>> new_x = bf.fit_transform(X[:50],y[:50])
>>> train_score = bf.score(X[50:],y[50:])
>>> cv_score = bf.best_score_
...

If GridSearchCV, the refit could be set True and return the cv score.

>>> from sklearn.datasets import fetch_california_housing
>>> from sklearn.svm import SVR
>>> from sklearn import model_selection
>>> X,y = fetch_california_housing(return_X_y=True)
>>> X = X[:100]
>>> y = y[:100]
>>> svr= SVR()
>>> gd = model_selection.GridSearchCV(svr,param_grid={"C":[1,10]})
>>> bf = BackForward(gd,primary_feature=4, random_state=1, refit=True, cv=5,verbose=0,note=False)
Uniform parameter in SearchCV and Exhaustion:
(scoring=None, cv=5, refit=True)
>>> new_x = bf.fit_transform(X,y)
...

Parameters:

estimator (estimator object) – This is assumed to implement the scikit-learn estimator interface. A supervised sklearn learning estimator with fit method.
n_type_feature_to_select (int) – The max number of feature to selection. If None, select the features with best score.
min_type_feature_to_select (int) – force select number min.
primary_feature (int) – primary features to start loop, default initial n_features//2.
multi_grade (int) – group number.
multi_index – group index.
must_index – must selection index.
tolerant – tolerant for rank compare.
verbose (int) – print or not.
random_state (int) – random_state.
refit (bool) – refit or not. if refit, the model would use all data.
n_jobs (int or None) – Number of cores to run in parallel while fitting across folds. None means 1 and -1 means using all processors.
scoring (None,str) – scoring method.
note (bool) – print note or not.

fit(X, y, groups=None)¶

Fit the baf model and automatically tune the number of selected feature.

Parameters:

X ({array-like, sparse matrix}, shape = [n_samples, n_feature]) – Training vector, where n_samples is the number of samples and n_feature is the total number of feature.
y (array-like, shape = [n_samples]) – Target values (integers for classification, real numbers for regression).
groups (array-like, shape = [n_samples], optional) – cal_group labels for the samples used while splitting the dataset into train/test set.

predict(X)¶

Reduce X to the selected feature and then Fit using the underlying estimator, only with refit. Only available refit=True.

Parameters:: X (array of shape [n_samples, n_feature]) – The input0 samples.
Returns:: y – The predicted target values.
Return type:: array of shape [n_samples]

score(X, y, scoring=None)¶

Reduce X to the selected feature and then return the score of the underlying estimator, only with refit. Only available refit=True.

Parameters:

X (array of shape [n_samples, n_feature]) – The input0 samples.
y (array of shape [n_samples]) – The target values.

featurebox.selection.corr module¶

Calculate the correction of columns.

class featurebox.selection.corr.Corr(threshold: float = 0.85, multi_grade: int = 2, multi_index: Optional[List] = None, must_index: Optional[List] = None, random_state: int = 0)¶

Bases: BaseEstimator, MetaEstimatorMixin, SelectorMixin, MultiBase

Calculate correlation. (Where the result are changed with random state.)

1. Used for filter automatically by machine

Examples

>>> from sklearn.datasets import fetch_california_housing
>>> from featurebox.selection.corr import Corr
>>> x, y = fetch_california_housing(return_X_y=True)
>>> x = x[:100]
>>> y = y[:100]
>>> co = Corr(threshold=0.5)
>>> new_x = co.fit_transform(x)
>>> select_feature = co.support_

1. Used for get group exceeding the threshold by setp

Examples

>>> from sklearn.datasets import fetch_california_housing
>>> from featurebox.selection.corr import Corr
>>> x, y = fetch_california_housing(return_X_y=True)
>>> x = x[:100]
>>> y = y[:100]
>>> co = Corr(threshold=0.5)
>>> groups = co.count_cof(np.corrcoef(x[:,:7], rowvar=False))
>>> groups[1]
[[0, 6], [1], [2], [3], [4], [5], [0, 6]]
>>> groups[0]
[[1.0, 0.554], [1.0], [1.0], [1.0], [1.0], [1.0], [0.554, 1.0]]
>>> co.remove_coef(groups[1]) # Filter automatically by machine.
[0, 1, 2, 3, 4, 5]

Where the remove_coef are changed with random state.

Where the (0,6) are with correlation more than 0.7.

3. Used for binding correlation

Examples

>>> from sklearn.datasets import fetch_california_housing
>>> from featurebox.selection.corr import Corr
>>> x, y = fetch_california_housing(return_X_y=True)
>>> x = x[:100]
>>> y = y[:100]
>>> co = Corr(threshold=0.3,multi_index=[0,8],multi_grade=2)
>>> # in range [0,8], the features are binding in to 2 sized: [[0,1],[2,3],[4,5],[6,7]]
>>> co.fit(x)
Corr(multi_index=(0, 8), threshold=0.3)

Parameters:

threshold (float) – ranking threshold.
multi_grade – binding_group size, calculate the correction between binding.
multi_index (list) – the range of multi_grade:[min,max).
must_index (list) – the columns force to index.
random_state (int) –

count_cof(cof=None)¶: Check cof and count the number.

static cov_y(x_, y_)¶

filter()¶

fit(data, pre_cal=None, method='mean')¶

remove_by_y(y_)¶

remove_coef(cof_list_all)¶: Delete the index of feature with repeat coef.

featurebox.selection.exhaustion module¶

class featurebox.selection.exhaustion.Exhaustion(estimator: BaseEstimator, n_select: Tuple = (2, 3, 4), multi_grade: Optional[int] = None, multi_index: Optional[List] = None, must_index: Optional[List] = None, n_jobs: int = 1, refit: bool = False, cv: int = 5, scoring: Optional[str] = None, note=True, filter_warn=False)¶

Bases: BaseEstimator, MetaEstimatorMixin, SelectorMixin, MultiBase

Exhaustion features combination.

n_feature_¶

The number of selected features finally.

Type:: int

support_¶

The mask of selected features finally.

Type:: array of shape [n_feature]

estimator_¶

The best model with the best features finally (refited with all data.).

Type:: object

best_score_¶

Best score of best model of best features.

Type:: float

Examples

>>> from sklearn.datasets import fetch_california_housing
>>> from sklearn.model_selection import cross_val_predict
>>> from sklearn.svm import SVR
>>> X,y = fetch_california_housing(return_X_y=True)
>>> X = X[:100]
>>> y = y[:100]
>>> X_train,y_train,X_test,y_test = X[:50],y[:50],X[-50:],y[-50:]

>>> svr = SVR()
>>> bf = Exhaustion(svr,n_select=(2,),refit=True,note=False)
>>> new_x = bf.fit_transform(X,y)
>>> bf.support_
array([False, False, False,  True, False,  True, False, False])
>>> train_score = bf.score(X_train,y_train)  # train score
>>> test_score = bf.score(X_test,y_test) # test score in more data.
>>> np.mean(cross_val_score(bf.estimator_,X_train[:,bf.support_],y_train,cv=5)) # re cv_score in manually.
-2.888471220974372
>>> np.mean(cross_val_predict(bf.estimator_,X_train[:,bf.support_],y_train,cv=5)) # re cv_predict for plot.
1.6001222987265382

Examples

>>> from sklearn.datasets import fetch_california_housing
>>> from sklearn.svm import SVR
>>> from sklearn import model_selection
>>> X,y = fetch_california_housing(return_X_y=True)
>>> X = X[:100]
>>> y = y[:100]
>>> svr= SVR()

>>> gd = model_selection.GridSearchCV(svr, param_grid=[{"C": [1, 10]}], n_jobs=1, cv=3)
>>> bf = Exhaustion(gd,n_select=(2,),refit=True,note=False,cv=5)
Uniform parameter in SearchCV and Exhaustion:
(scoring=None, cv=5, refit=True)
>>> new_x = bf.fit_transform(X,y)
>>> bf.support_
array([False, False, False,  True, False,  True, False, False])
>>> bf.best_score_
-0.7336740728050252

Parameters:

estimator – sklearn model or GridSearchCV.
n_select (tuple) – the n_select list,default,n_select=(3, 4).
multi_grade (list) – binding_group size, calculate the correction between binding.
multi_index (list) – the range of multi_grade:[min,max).
must_index (list) – the columns force to index.
n_jobs (int) – n_jobs.
refit (bool) – refit or not, if refit the model would use all data.
cv (bool) – if estimator is sklearn model, used cv, else pass.
scoring (None,str) – scoring method name.
note (bool) – print note or not.
filter_warn (bool) – warnings.filterwarnings or not.

fit(X, y)¶

Fit the baf model and then the underlying estimator on the selected feature.

Parameters:

X ({array-like, sparse matrix}, shape = [n_samples, n_feature]) – The training input0 samples.
y (array-like, shape = [n_samples]) – The target values.

predict(X)¶

Reduce X to the selected feature and then Fit using the underlying estimator. Only available refit=True.

Parameters:: X (array of shape [n_samples, n_feature]) – The input0 samples.
Returns:: y – The predicted target values.
Return type:: array of shape [n_samples]

score(X, y, scoring=None)¶

Reduce X to the selected feature and then return the score of the underlying estimator. Only available refit=True.

Parameters:

X (array of shape [n_samples, n_feature]) – The input0 samples.
y (array of shape [n_samples]) – The target values.
scoring (str, callable, default=None) –
Strategy to evaluate the performance of the cross-validated model on the test set.

If scoring represents a single score, one can use: a single string (see scoring_parameter)

The score defined by scoring if provided, and the estimator_.score method otherwise else raise error.

featurebox.selection.exhaustion.ExhaustionCV¶: alias of Exhaustion

featurebox.selection.ga module¶

class featurebox.selection.ga.GA(estimator, n_jobs=2, pop_n=1000, hof_n=1, cxpb=0.6, mutpb=0.3, ngen=40, max_or_min='max', mut_indpb=0.05, max_=None, min_=2, random_state=None, multi_grade=2, multi_index=None, must_index=None, cv: int = 5, scoring=None, filter_warn=False)¶

Bases: BaseEstimator, MetaEstimatorMixin, SelectorMixin, MultiBase

GA with binding. Please just passing training data.

Examples

>>> from sklearn.datasets import fetch_california_housing
>>> from sklearn.svm import SVR
>>> data = fetch_california_housing()
>>> X = data.data
>>> y = data.target
>>> X_train,y_train,X_test,y_test = X[:50],y[:50],X[-50:],y[-50:]
>>> svr = SVR(gamma="scale", C=100)
>>> ga = GA(estimator=svr, n_jobs=2, pop_n=50, hof_n=1, cxpb=0.8, mutpb=0.4, ngen=3,
... max_or_min="max", mut_indpb=0.1, min_=2, multi_index=[0, 5],random_state=0)
>>> ga.fit(X_train, y_train)
gen nevals  min     max
1   50      -4.9231 -1.09124
2   43      -3.83152  -1.09124
3   46      -4.9231   -1.09124
[1, 1, 1, 1, 0, 0, 1, 0] (-1.039237326973499,)
GA(cxpb=0.8, estimator=SVR(C=100), multi_index=(0, 5), mut_indpb=0.1, mutpb=0.4,
   ngen=3, pop_n=50, random_state=0)
>>> ga.score(X_test, y_test)
-28.542309712899435

Parameters:

estimator – sklearn estimator
n_jobs (int) – njobs
pop_n (int) – population
hof_n (int) – hof
cxpb (float) – probility of cross
mutpb (float) – probility of mutate
ngen (int) – generation
max_or_min (str) – “max”,”min”;max problem or min
mut_indpb (float) – probility of mutate of each node.
max (int) – max size
min (int) – min size
random_state (float) – randomstate
multi_grade – binding grade
multi_index – binding range [min,max]
scoring (None,str) – scoring method name.
cv (bool) – if estimator is sklearn model, used cv, else pass.
filter_warn (bool) – warnings.filterwarnings or not.

feature_fold_length(feature)¶

fit(X, y)¶: Fit data and run GA.

fitness_func(ind, model, x, y, return_model=False)¶

static generate_min_max(space, min_=2, max_=None)¶

predict(X)¶

Reduce X to the selected feature and then return the score of the underlying estimator.

Parameters:: X (array of shape [n_samples, n_feature]) – The input0 samples.

predict_func(ind, model, x)¶

score(X, y)¶

Reduce X to the selected feature and then return the score of the underlying estimator.

Parameters:

X (array of shape [n_samples, n_feature]) – The input0 samples.
y (array of shape [n_samples]) – The target values.

score_cv(X, y)¶

Reduce X to the selected feature and then return the score of the underlying estimator.

Parameters:

X (array of shape [n_samples, n_feature]) – The input0 samples.
y (array of shape [n_samples]) – The target values.

socre_func(ind, model, x, y, scoring=None)¶

unfold(ind)¶

featurebox.selection.ga.eaSimple(population, toolbox, cxpb, mutpb, ngen, stats=None, n_jobs=2, halloffame=None, verbose=True)¶

This algorithm reproduce the simplest evolutionary algorithm.

Parameters:

population – A list of individuals.
n_jobs – jobs.
toolbox – A Toolbox that contains the evolution operators.
cxpb – The probability of mating two individuals.
mutpb – The probability of mutating an individual.
ngen – The number of generation.
stats – A Statistics object that is updated inplace, optional.
halloffame – A HallOfFame object that will contain the best individuals, optional.
verbose – Whether to log the statistics.

Returns:

The final population

Returns:

A class:~deap.tools.Logbook with the statistics of the evolution

featurebox.selection.ga.filt(ind, min_=2, max_=None)¶

featurebox.selection.ga.generate(space)¶

featurebox.selection.ga.generate_xi()¶

featurebox.selection.multibase module¶

class featurebox.selection.multibase.MultiBase(multi_grade: int = 2, multi_index: Optional[Union[List, Tuple]] = None, must_index: Optional[Union[List, Tuple]] = None)¶

Bases: object

Base method for binding

Parameters:

multi_grade (int) – binding_group size, calculate the correction between binding
multi_index (list,tuple,None) – the range of multi_grade:[min,max)
must_index (list,tuple,None) – the columns force to index

property check_multi¶

property check_must¶

feature_fold(feature)¶

feature_unfold(feature)¶

inverse_transform_index(index)¶: inverse the selected index to origin index by support.

property must_fold_add¶

property must_unfold_add¶

transform(data: Any)¶

transform_index(index)¶: Get support index.