WbMLAlgo
- class lightautoml.ml_algo.whitebox.WbMLAlgo(default_params=None, freeze_defaults=True, timer=None, optimization_search_space={})[source]
Bases:
lightautoml.ml_algo.base.TabularMLAlgo
WhiteBox - scorecard model.
https://github.com/sberbank-ai-lab/AutoMLWhitebox
default_params:
- monotonic: bool
Global condition for monotonic constraints. If
True
, then only monotonic binnings will be built. You can pass values to the.fit
method that change this condition separately for each feature.
- max_bin_count: int
Global limit for the number of bins. Can be specified for every feature in .fit
- select_type:
None
orint
The type to specify the primary feature selection. If the type is an integer, then we select the number of features indicated by this number (with the best feature_importance). If the value is
None
, we leave only features withfeature_importance
greater than0
.
- select_type:
- pearson_th: 0 < pearson_th < 1
Threshold for feature selection by correlation. All features with the absolute value of correlation coefficient greater then pearson_th will be discarded.
- auc_th: .5 < auc_th < 1
Threshold for feature selection by one-dimensional AUC. WoE with AUC < auc_th will be discarded.
- vif_th: vif_th > 0
Threshold for feature selection by VIF. Features with VIF > vif_th are iteratively discarded one by one, then VIF is recalculated until all VIFs are less than vif_th.
- imp_th: real >= 0
Threshold for feature selection by feature importance
- th_const:
Threshold, which determines that the feature is constant. If the number of valid values is greater than the threshold, then the column is not constant. For float, the number of valid values will be calculated as the sample size * th_const
- force_single_split: bool
In the tree parameters, you can set the minimum number of observations in the leaf. Thus, for some features, splitting for 2 beans at least will be impossible. If you specify that
force_single_split = True
, it means that 1 split will be created for the feature, if the minimum bin size is greater than th_const.
- th_nan: int >= 0
Threshold, which determines that WoE values are calculated to NaN.
- th_cat: int >= 0
Threshold, which determines which categories are small.
- woe_diff_th: float = 0.01
The option to merge NaNs and rare categories with another bin, if the difference in WoE is less than woe_diff_th.
- min_bin_size: int > 1, 0 < float < 1
Minimum bin size when splitting.
- min_bin_mults: list of floats > 1
If minimum bin size is specified, you can specify a list to check if large values work better, for example: [2, 4].
- min_gains_to_split: list of floats >= 0
min_gain_to_split values that will be iterated to find the best split.
- auc_tol: 1e-5 <= auc_tol <=1e-2
AUC tolerance. You can lower the auc_tol value from the maximum to make the model simpler.
- cat_alpha: float > 0
Regularizer for category encoding.
- cat_merge_to: str
The way of WoE values filling in the test sample for categories that are not in the training sample. Values - ‘to_nan’, ‘to_woe_0’, ‘to_maxfreq’, ‘to_maxp’, ‘to_minp’
- nan_merge_to: str
The way of WoE values filling on the test sample for real NaNs, if they are not included in their group. Values - ‘to_woe_0’, ‘to_maxfreq’, ‘to_maxp’, ‘to_minp’
- oof_woe: bool
Use OOF or standard encoding for WOE.
- n_folds: int
Number of folds for feature selection / encoding, etc.
- n_jobs: int > 0
Number of CPU cores to run in parallel.
- l1_base_step: real > 0
Grid size in l1 regularization
- l1_exp_step: real > 1
Grid scale in l1 regularization
- population_size: None, int > 0
Feature selection type in the selector. If the value is
None
then L1 boost is used. Ifint
is specified, then a standard step will be used for the number of random subsamples indicated by this value. Can be generalized to genetic algorithm.
- feature_groups_count: int > 0
The number of groups in the genetic algorithm. Its effect is visible only when population_size > 0
- imp_type: str
Feature importances type. Feature_imp and perm_imp are available. It is used to sort the features at the first and at the final stage of feature selection.
- regularized_refit: bool
Use regularization at the time of model refit. Otherwise, we have a statistical model.
- p_val: 0 < p_val <= 1
When training a statistical model, do backward selection until all p-values of the model’s coefficient are
- verbose: int 0-3
Verbosity level
- freeze_defaults:
True
: params may be rewrited depending on dataset.False
: params may be changed only manually or with tuning.
timer:
Timer
instance orNone
.- fit_predict_single_fold(train, valid)[source]
Implements training and prediction on single fold.
- Parameters
train (
PandasDataset
) – Train Dataset.valid (
PandasDataset
) – Validation Dataset.
- Return type
- Returns
Tuple (model, predicted_values).
- predict_single_fold(model, dataset)[source]
Predict target values for dataset.
- Parameters
model (
Union
[AutoWoE
,ReportDeco
]) – WhiteBox modeldataset (
PandasDataset
) – Test dataset.
- Return type
- Returns
Predicted target values.
- fit(train_valid)[source]
Just to be compatible with ImportanceEstimator.
- Parameters
train_valid (
TrainValidIterator
) – classic cv iterator.
- predict(dataset, report=False)[source]
Predict on new dataset.
- Parameters
dataset (
PandasDataset
) – Dataset.report (
bool
) – Flag to generate report.
- Return type
- Returns
Dataset with predictions.