featurebox.selection package¶
Submodules¶
featurebox.selection.backforward module¶
Forward_and_back feature elimination for feature ranking
- class featurebox.selection.backforward.BackForward(estimator: BaseEstimator, n_type_feature_to_select: Optional[int] = None, primary_feature: Optional[int] = None, multi_grade: int = 2, multi_index: Optional[List] = None, refit=True, cv=5, min_type_feature_to_select: int = 3, must_index: Optional[List] = None, tolerant: float = 0.01, verbose: int = 1, random_state: Optional[int] = None, scoring: Optional[str] = None, note: bool = True, filter_warn: bool = False)¶
Bases:
BaseEstimator,MetaEstimatorMixin,SelectorMixin,MultiBaseBackForward method to selected features.
- n_feature_¶
The number of selected features finally.
- Type:
int
- support_¶
The mask of selected features finally.
- Type:
array of shape [n_feature]
- estimator_¶
The best model with the best features finally (refited with all data.).
- Type:
object
- best_score_¶
Best score of best model of best features.
- Type:
float
Examples
>>> from sklearn.datasets import fetch_california_housing >>> from sklearn.svm import SVR >>> X,y = fetch_california_housing(return_X_y=True) >>> X = X[:100] >>> y = y[:100] >>> svr= SVR() >>> bf = BackForward(svr,primary_feature=4, random_state=1,verbose=0,note=False) >>> new_x = bf.fit_transform(X,y) >>> bf.support_ array([False, True, True, False, False, False, False, True])
Examples
>>> from sklearn.datasets import fetch_california_housing >>> from sklearn.svm import SVR >>> from sklearn.model_selection import cross_val_score >>> X,y = fetch_california_housing(return_X_y=True) >>> X = X[:100] >>> y = y[:100] >>> X_train,y_train,X_test,y_test = X[:50],y[:50],X[-50:],y[-50:]
>>> svr= SVR() >>> bf = BackForward(svr, primary_feature=4, random_state=1, refit=True, cv=5,verbose=0,note=False) >>> bf = bf.fit(X_train,y_train) >>> bf.best_score_ # cv score -3.0552830696940037 >>> train_score = bf.score(X_train,y_train) # train score >>> test_score = bf.score(X_test,y_test) # test score in more data. >>> np.mean(cross_val_score(bf.estimator_,X_train[:,bf.support_],y_train,cv=5)) # get cv_score manually. -3.0552830696940037
- Notes
If
scoreandpredictis used, therefitshould be set True, the refit used all data infitfunction, that is, it is not test score/predict.
Examples
If GridSearchCV, the refit should be set True and return the cv score.
>>> from sklearn.datasets import fetch_california_housing >>> from sklearn.svm import SVR >>> from sklearn import model_selection >>> X,y = fetch_california_housing(return_X_y=True) >>> X = X[:100] >>> y = y[:100] >>> X_train,y_train,X_test,y_test = X[:50],y[:50],X[-50:],y[-50:]
>>> svr= SVR() >>> gd = model_selection.GridSearchCV(svr,param_grid={"C":[1,10]},n_jobs=1) # keep n_jobs=1 there. >>> bf = BackForward(gd,primary_feature=4, random_state=1, refit=True, ... scoring="neg_root_mean_squared_error",cv=5,verbose=0, note=False) Uniform parameter in SearchCV and Exhaustion: (scoring=neg_root_mean_squared_error, cv=5, refit=True) >>> bf = bf.fit(X_train,y_train) >>> bf.best_score_ # cv score -0.5919173121895709
>>> train_score = bf.score(X_train,y_train) # train score >>> test_score = bf.score(X_test,y_test) # test score in more data. >>> # bf.estimator_ is the gd object (GridSearchCV) >>> bf.estimator_.best_score_ # re cv_score in manually. -0.5919173121895709
- Parameters:
estimator (estimator object) – This is assumed to implement the scikit-learn estimator interface. A supervised sklearn learning estimator with
fitmethod.n_type_feature_to_select (int) – The max number of feature to selection. If
None, select the features with best score.min_type_feature_to_select (int) – force select number min.
primary_feature (int) – primary features to start loop, default initial n_features//2.
multi_grade (int) – group number.
multi_index – group index.
must_index – must selection index.
tolerant – tolerant for rank compare.
verbose (int) – print or not.
random_state (int) – random_state.
refit (bool) – refit or not. if refit, the model would use all data.
scoring (None,str) – scoring method name.
note (bool) – print note or not.
filter_warn (bool) – warnings.filterwarnings or not.
- fit(X, y)¶
- Fit the baf model and then the underlying estimator on the selected
feature.
- Parameters:
X ({array-like, sparse matrix}, shape = [n_samples, n_feature]) – The training input0 samples.
y (array-like, shape = [n_samples]) – The target values.
- predict(X)¶
- Reduce X to the selected feature and then using the underlying estimator to predict.
Only available
refit=True.
- Parameters:
X (array of shape [n_samples, n_feature]) – The input0 samples.
- Returns:
y – The predicted target values.
- Return type:
array of shape [n_samples]
- score(X, y, scoring=None)¶
Reduce X to the selected feature and then return the score of the underlying estimator. Only available
refit=True.- Parameters:
X (array of shape [n_samples, n_feature]) – The input0 samples.
y (array of shape [n_samples]) – The target values.
scoring (str, callable, default=None) –
Strategy to evaluate the performance of the cross-validated model on the test set.
If scoring represents a single score, one can use: a single string (see scoring_parameter)
The score defined by
scoringif provided, and theestimator_.scoremethod otherwise else raise error.
- class featurebox.selection.backforward.BackForwardStable(estimator: BaseEstimator, n_type_feature_to_select: Optional[int] = None, min_type_feature_to_select: int = 3, primary_feature: Optional[int] = None, multi_grade: int = 2, multi_index: Optional[List] = None, must_index: Optional[List] = None, verbose: int = 0, random_state: Optional[int] = None, tolerant: float = 0.001, cv: int = 5, times: int = 5, scoring: Optional[str] = None, n_jobs: Optional[int] = None, refit=False, note=True)¶
Bases:
MetaEstimatorMixin,SelectorMixin,BaseEstimatorBackForwardStable. Run with different order for more Stable (Just for test).
- n_feature_¶
The number of selected feature with cross-validation.
- Type:
int
- support_¶
The mask of selected feature.
- Type:
array of shape [n_feature]
- estimator_¶
The model with the best features finally (refited with all data.).
- Type:
object
- best_score_¶
Best score of best model of best features.
- Type:
float
Examples
>>> from sklearn.datasets import fetch_california_housing >>> from sklearn.svm import SVR >>> X,y = fetch_california_housing(return_X_y=True) >>> X = X[:100] >>> y = y[:100] >>> svr= SVR() >>> bf = BackForwardStable(svr,primary_feature=3, random_state=1,verbose=0,note=False) >>> new_x = bf.fit_transform(X,y) >>> bf.support_ array([False, True, False, False, False, True, True, False]) >>> bf.best_score_ -0.09122826477472024
If score and predict is used, the refit could be set True and make sure the data is splited, due to the refit used all data in fit() function.
>>> from sklearn.datasets import fetch_california_housing >>> from sklearn.svm import SVR >>> X,y = fetch_california_housing(return_X_y=True) >>> X = X[:100] >>> y = y[:100] >>> svr= SVR() >>> bf = BackForwardStable(svr,primary_feature=4, random_state=1, refit=True,verbose=0,note=False) >>> new_x = bf.fit_transform(X[:50],y[:50]) >>> train_score = bf.score(X[50:],y[50:]) >>> cv_score = bf.best_score_ ...
If GridSearchCV, the refit could be set True and return the cv score.
>>> from sklearn.datasets import fetch_california_housing >>> from sklearn.svm import SVR >>> from sklearn import model_selection >>> X,y = fetch_california_housing(return_X_y=True) >>> X = X[:100] >>> y = y[:100] >>> svr= SVR() >>> gd = model_selection.GridSearchCV(svr,param_grid={"C":[1,10]}) >>> bf = BackForward(gd,primary_feature=4, random_state=1, refit=True, cv=5,verbose=0,note=False) Uniform parameter in SearchCV and Exhaustion: (scoring=None, cv=5, refit=True) >>> new_x = bf.fit_transform(X,y) ...
- Parameters:
estimator (estimator object) – This is assumed to implement the scikit-learn estimator interface. A supervised sklearn learning estimator with
fitmethod.n_type_feature_to_select (int) – The max number of feature to selection. If
None, select the features with best score.min_type_feature_to_select (int) – force select number min.
primary_feature (int) – primary features to start loop, default initial n_features//2.
multi_grade (int) – group number.
multi_index – group index.
must_index – must selection index.
tolerant – tolerant for rank compare.
verbose (int) – print or not.
random_state (int) – random_state.
refit (bool) – refit or not. if refit, the model would use all data.
n_jobs (int or None) – Number of cores to run in parallel while fitting across folds.
Nonemeans 1 and-1means using all processors.scoring (None,str) – scoring method.
note (bool) – print note or not.
- fit(X, y, groups=None)¶
Fit the baf model and automatically tune the number of selected feature.
- Parameters:
X ({array-like, sparse matrix}, shape = [n_samples, n_feature]) – Training vector, where n_samples is the number of samples and n_feature is the total number of feature.
y (array-like, shape = [n_samples]) – Target values (integers for classification, real numbers for regression).
groups (array-like, shape = [n_samples], optional) – cal_group labels for the samples used while splitting the dataset into train/test set.
- predict(X)¶
Reduce X to the selected feature and then Fit using the underlying estimator, only with refit. Only available
refit=True.- Parameters:
X (array of shape [n_samples, n_feature]) – The input0 samples.
- Returns:
y – The predicted target values.
- Return type:
array of shape [n_samples]
- score(X, y, scoring=None)¶
Reduce X to the selected feature and then return the score of the underlying estimator, only with refit. Only available
refit=True.- Parameters:
X (array of shape [n_samples, n_feature]) – The input0 samples.
y (array of shape [n_samples]) – The target values.
featurebox.selection.corr module¶
Calculate the correction of columns.
- class featurebox.selection.corr.Corr(threshold: float = 0.85, multi_grade: int = 2, multi_index: Optional[List] = None, must_index: Optional[List] = None, random_state: int = 0)¶
Bases:
BaseEstimator,MetaEstimatorMixin,SelectorMixin,MultiBaseCalculate correlation. (Where the result are changed with random state.)
1. Used for filter automatically by machine
Examples
>>> from sklearn.datasets import fetch_california_housing >>> from featurebox.selection.corr import Corr >>> x, y = fetch_california_housing(return_X_y=True) >>> x = x[:100] >>> y = y[:100] >>> co = Corr(threshold=0.5) >>> new_x = co.fit_transform(x) >>> select_feature = co.support_
1. Used for get group exceeding the threshold by setp
Examples
>>> from sklearn.datasets import fetch_california_housing >>> from featurebox.selection.corr import Corr >>> x, y = fetch_california_housing(return_X_y=True) >>> x = x[:100] >>> y = y[:100] >>> co = Corr(threshold=0.5) >>> groups = co.count_cof(np.corrcoef(x[:,:7], rowvar=False)) >>> groups[1] [[0, 6], [1], [2], [3], [4], [5], [0, 6]] >>> groups[0] [[1.0, 0.554], [1.0], [1.0], [1.0], [1.0], [1.0], [0.554, 1.0]] >>> co.remove_coef(groups[1]) # Filter automatically by machine. [0, 1, 2, 3, 4, 5]
Where the remove_coef are changed with random state.
Where the (0,6) are with correlation more than 0.7.
3. Used for binding correlation
Examples
>>> from sklearn.datasets import fetch_california_housing >>> from featurebox.selection.corr import Corr >>> x, y = fetch_california_housing(return_X_y=True) >>> x = x[:100] >>> y = y[:100] >>> co = Corr(threshold=0.3,multi_index=[0,8],multi_grade=2) >>> # in range [0,8], the features are binding in to 2 sized: [[0,1],[2,3],[4,5],[6,7]] >>> co.fit(x) Corr(multi_index=(0, 8), threshold=0.3)
- Parameters:
threshold (float) – ranking threshold.
multi_grade – binding_group size, calculate the correction between binding.
multi_index (list) – the range of multi_grade:[min,max).
must_index (list) – the columns force to index.
random_state (int) –
- count_cof(cof=None)¶
Check cof and count the number.
- static cov_y(x_, y_)¶
- filter()¶
- fit(data, pre_cal=None, method='mean')¶
- remove_by_y(y_)¶
- remove_coef(cof_list_all)¶
Delete the index of feature with repeat coef.
featurebox.selection.exhaustion module¶
- class featurebox.selection.exhaustion.Exhaustion(estimator: BaseEstimator, n_select: Tuple = (2, 3, 4), multi_grade: Optional[int] = None, multi_index: Optional[List] = None, must_index: Optional[List] = None, n_jobs: int = 1, refit: bool = False, cv: int = 5, scoring: Optional[str] = None, note=True, filter_warn=False)¶
Bases:
BaseEstimator,MetaEstimatorMixin,SelectorMixin,MultiBaseExhaustion features combination.
- n_feature_¶
The number of selected features finally.
- Type:
int
- support_¶
The mask of selected features finally.
- Type:
array of shape [n_feature]
- estimator_¶
The best model with the best features finally (refited with all data.).
- Type:
object
- best_score_¶
Best score of best model of best features.
- Type:
float
Examples
>>> from sklearn.datasets import fetch_california_housing >>> from sklearn.model_selection import cross_val_predict >>> from sklearn.svm import SVR >>> X,y = fetch_california_housing(return_X_y=True) >>> X = X[:100] >>> y = y[:100] >>> X_train,y_train,X_test,y_test = X[:50],y[:50],X[-50:],y[-50:]
>>> svr = SVR() >>> bf = Exhaustion(svr,n_select=(2,),refit=True,note=False) >>> new_x = bf.fit_transform(X,y) >>> bf.support_ array([False, False, False, True, False, True, False, False]) >>> train_score = bf.score(X_train,y_train) # train score >>> test_score = bf.score(X_test,y_test) # test score in more data. >>> np.mean(cross_val_score(bf.estimator_,X_train[:,bf.support_],y_train,cv=5)) # re cv_score in manually. -2.888471220974372 >>> np.mean(cross_val_predict(bf.estimator_,X_train[:,bf.support_],y_train,cv=5)) # re cv_predict for plot. 1.6001222987265382
Examples
>>> from sklearn.datasets import fetch_california_housing >>> from sklearn.svm import SVR >>> from sklearn import model_selection >>> X,y = fetch_california_housing(return_X_y=True) >>> X = X[:100] >>> y = y[:100] >>> svr= SVR()
>>> gd = model_selection.GridSearchCV(svr, param_grid=[{"C": [1, 10]}], n_jobs=1, cv=3) >>> bf = Exhaustion(gd,n_select=(2,),refit=True,note=False,cv=5) Uniform parameter in SearchCV and Exhaustion: (scoring=None, cv=5, refit=True) >>> new_x = bf.fit_transform(X,y) >>> bf.support_ array([False, False, False, True, False, True, False, False]) >>> bf.best_score_ -0.7336740728050252
- Parameters:
estimator – sklearn model or GridSearchCV.
n_select (tuple) – the n_select list,default,n_select=(3, 4).
multi_grade (list) – binding_group size, calculate the correction between binding.
multi_index (list) – the range of multi_grade:[min,max).
must_index (list) – the columns force to index.
n_jobs (int) – n_jobs.
refit (bool) – refit or not, if refit the model would use all data.
cv (bool) – if estimator is sklearn model, used cv, else pass.
scoring (None,str) – scoring method name.
note (bool) – print note or not.
filter_warn (bool) – warnings.filterwarnings or not.
- fit(X, y)¶
Fit the baf model and then the underlying estimator on the selected feature.
- Parameters:
X ({array-like, sparse matrix}, shape = [n_samples, n_feature]) – The training input0 samples.
y (array-like, shape = [n_samples]) – The target values.
- predict(X)¶
Reduce X to the selected feature and then Fit using the underlying estimator. Only available
refit=True.- Parameters:
X (array of shape [n_samples, n_feature]) – The input0 samples.
- Returns:
y – The predicted target values.
- Return type:
array of shape [n_samples]
- score(X, y, scoring=None)¶
Reduce X to the selected feature and then return the score of the underlying estimator. Only available
refit=True.- Parameters:
X (array of shape [n_samples, n_feature]) – The input0 samples.
y (array of shape [n_samples]) – The target values.
scoring (str, callable, default=None) –
Strategy to evaluate the performance of the cross-validated model on the test set.
If scoring represents a single score, one can use: a single string (see scoring_parameter)
The score defined by
scoringif provided, and theestimator_.scoremethod otherwise else raise error.
- featurebox.selection.exhaustion.ExhaustionCV¶
alias of
Exhaustion
featurebox.selection.ga module¶
- class featurebox.selection.ga.GA(estimator, n_jobs=2, pop_n=1000, hof_n=1, cxpb=0.6, mutpb=0.3, ngen=40, max_or_min='max', mut_indpb=0.05, max_=None, min_=2, random_state=None, multi_grade=2, multi_index=None, must_index=None, cv: int = 5, scoring=None, filter_warn=False)¶
Bases:
BaseEstimator,MetaEstimatorMixin,SelectorMixin,MultiBaseGA with binding. Please just passing training data.
Examples
>>> from sklearn.datasets import fetch_california_housing >>> from sklearn.svm import SVR >>> data = fetch_california_housing() >>> X = data.data >>> y = data.target >>> X_train,y_train,X_test,y_test = X[:50],y[:50],X[-50:],y[-50:] >>> svr = SVR(gamma="scale", C=100) >>> ga = GA(estimator=svr, n_jobs=2, pop_n=50, hof_n=1, cxpb=0.8, mutpb=0.4, ngen=3, ... max_or_min="max", mut_indpb=0.1, min_=2, multi_index=[0, 5],random_state=0) >>> ga.fit(X_train, y_train) gen nevals min max 1 50 -4.9231 -1.09124 2 43 -3.83152 -1.09124 3 46 -4.9231 -1.09124 [1, 1, 1, 1, 0, 0, 1, 0] (-1.039237326973499,) GA(cxpb=0.8, estimator=SVR(C=100), multi_index=(0, 5), mut_indpb=0.1, mutpb=0.4, ngen=3, pop_n=50, random_state=0) >>> ga.score(X_test, y_test) -28.542309712899435
- Parameters:
estimator – sklearn estimator
n_jobs (int) – njobs
pop_n (int) – population
hof_n (int) – hof
cxpb (float) – probility of cross
mutpb (float) – probility of mutate
ngen (int) – generation
max_or_min (str) – “max”,”min”;max problem or min
mut_indpb (float) – probility of mutate of each node.
max (int) – max size
min (int) – min size
random_state (float) – randomstate
multi_grade – binding grade
multi_index – binding range [min,max]
scoring (None,str) – scoring method name.
cv (bool) – if estimator is sklearn model, used cv, else pass.
filter_warn (bool) – warnings.filterwarnings or not.
- feature_fold_length(feature)¶
- fit(X, y)¶
Fit data and run GA.
- fitness_func(ind, model, x, y, return_model=False)¶
- static generate_min_max(space, min_=2, max_=None)¶
- predict(X)¶
Reduce X to the selected feature and then return the score of the underlying estimator.
- Parameters:
X (array of shape [n_samples, n_feature]) – The input0 samples.
- predict_func(ind, model, x)¶
- score(X, y)¶
Reduce X to the selected feature and then return the score of the underlying estimator.
- Parameters:
X (array of shape [n_samples, n_feature]) – The input0 samples.
y (array of shape [n_samples]) – The target values.
- score_cv(X, y)¶
Reduce X to the selected feature and then return the score of the underlying estimator.
- Parameters:
X (array of shape [n_samples, n_feature]) – The input0 samples.
y (array of shape [n_samples]) – The target values.
- socre_func(ind, model, x, y, scoring=None)¶
- unfold(ind)¶
- featurebox.selection.ga.eaSimple(population, toolbox, cxpb, mutpb, ngen, stats=None, n_jobs=2, halloffame=None, verbose=True)¶
This algorithm reproduce the simplest evolutionary algorithm.
- Parameters:
population – A list of individuals.
n_jobs – jobs.
toolbox – A
Toolboxthat contains the evolution operators.cxpb – The probability of mating two individuals.
mutpb – The probability of mutating an individual.
ngen – The number of generation.
stats – A
Statisticsobject that is updated inplace, optional.halloffame – A
HallOfFameobject that will contain the best individuals, optional.verbose – Whether to log the statistics.
- Returns:
The final population
- Returns:
A class:~deap.tools.Logbook with the statistics of the evolution
- featurebox.selection.ga.filt(ind, min_=2, max_=None)¶
- featurebox.selection.ga.generate(space)¶
- featurebox.selection.ga.generate_xi()¶
featurebox.selection.multibase module¶
- class featurebox.selection.multibase.MultiBase(multi_grade: int = 2, multi_index: Optional[Union[List, Tuple]] = None, must_index: Optional[Union[List, Tuple]] = None)¶
Bases:
objectBase method for binding
- Parameters:
multi_grade (int) – binding_group size, calculate the correction between binding
multi_index (list,tuple,None) – the range of multi_grade:[min,max)
must_index (list,tuple,None) – the columns force to index
- property check_multi¶
- property check_must¶
- feature_fold(feature)¶
- feature_unfold(feature)¶
- inverse_transform_index(index)¶
inverse the selected index to origin index by support.
- property must_fold_add¶
- property must_unfold_add¶
- transform(data: Any)¶
- transform_index(index)¶
Get support index.