WbMLAlgo

class lightautoml.ml_algo.whitebox.WbMLAlgo(default_params=None, freeze_defaults=True, timer=None, optimization_search_space={})[source]

Bases: lightautoml.ml_algo.base.TabularMLAlgo

WhiteBox - scorecard model.

https://github.com/sberbank-ai-lab/AutoMLWhitebox

default_params:

monotonic: bool
Global condition for monotonic constraints. If True, then only monotonic binnings will be built. You can pass values to the .fit method that change this condition separately for each feature.

max_bin_count: int
Global limit for the number of bins. Can be specified for every feature in .fit

select_type: None or int
The type to specify the primary feature selection. If the type is an integer, then we select the number of features indicated by this number (with the best feature_importance). If the value is None, we leave only features with feature_importance greater than 0.

pearson_th: 0 < pearson_th < 1
Threshold for feature selection by correlation. All features with the absolute value of correlation coefficient greater then pearson_th will be discarded.

auc_th: .5 < auc_th < 1
Threshold for feature selection by one-dimensional AUC. WoE with AUC < auc_th will be discarded.

vif_th: vif_th > 0
Threshold for feature selection by VIF. Features with VIF > vif_th are iteratively discarded one by one, then VIF is recalculated until all VIFs are less than vif_th.

imp_th: real >= 0
Threshold for feature selection by feature importance

th_const:
Threshold, which determines that the feature is constant. If the number of valid values is greater than the threshold, then the column is not constant. For float, the number of valid values will be calculated as the sample size * th_const

force_single_split: bool
In the tree parameters, you can set the minimum number of observations in the leaf. Thus, for some features, splitting for 2 beans at least will be impossible. If you specify that force_single_split = True, it means that 1 split will be created for the feature, if the minimum bin size is greater than th_const.

th_nan: int >= 0
Threshold, which determines that WoE values are calculated to NaN.

th_cat: int >= 0
Threshold, which determines which categories are small.

woe_diff_th: float = 0.01
The option to merge NaNs and rare categories with another bin, if the difference in WoE is less than woe_diff_th.

min_bin_size: int > 1, 0 < float < 1
Minimum bin size when splitting.

min_bin_mults: list of floats > 1
If minimum bin size is specified, you can specify a list to check if large values work better, for example: [2, 4].

min_gains_to_split: list of floats >= 0
min_gain_to_split values that will be iterated to find the best split.

auc_tol: 1e-5 <= auc_tol <=1e-2
AUC tolerance. You can lower the auc_tol value from the maximum to make the model simpler.

cat_alpha: float > 0
Regularizer for category encoding.

cat_merge_to: str
The way of WoE values filling in the test sample for categories that are not in the training sample. Values - ‘to_nan’, ‘to_woe_0’, ‘to_maxfreq’, ‘to_maxp’, ‘to_minp’

nan_merge_to: str
The way of WoE values filling on the test sample for real NaNs, if they are not included in their group. Values - ‘to_woe_0’, ‘to_maxfreq’, ‘to_maxp’, ‘to_minp’

oof_woe: bool
Use OOF or standard encoding for WOE.

n_folds: int
Number of folds for feature selection / encoding, etc.

n_jobs: int > 0
Number of CPU cores to run in parallel.

l1_base_step: real > 0
Grid size in l1 regularization

l1_exp_step: real > 1
Grid scale in l1 regularization

population_size: None, int > 0
Feature selection type in the selector. If the value is None then L1 boost is used. If int is specified, then a standard step will be used for the number of random subsamples indicated by this value. Can be generalized to genetic algorithm.

feature_groups_count: int > 0
The number of groups in the genetic algorithm. Its effect is visible only when population_size > 0

imp_type: str
Feature importances type. Feature_imp and perm_imp are available. It is used to sort the features at the first and at the final stage of feature selection.

regularized_refit: bool
Use regularization at the time of model refit. Otherwise, we have a statistical model.

p_val: 0 < p_val <= 1
When training a statistical model, do backward selection until all p-values of the model’s coefficient are

verbose: int 0-3
Verbosity level

freeze_defaults:

True : params may be rewrited depending on dataset.
False: params may be changed only manually or with tuning.

timer: Timer instance or None.

fit_predict_single_fold(train, valid)[source]

Implements training and prediction on single fold.

Parameters

train (PandasDataset) – Train Dataset.
valid (PandasDataset) – Validation Dataset.

Return type

Tuple[Union[AutoWoE, ReportDeco], ndarray]

Returns

Tuple (model, predicted_values).

predict_single_fold(model, dataset)[source]

Predict target values for dataset.

Parameters

model (Union[AutoWoE, ReportDeco]) – WhiteBox model
dataset (PandasDataset) – Test dataset.

Return type

ndarray

Returns

Predicted target values.

fit(train_valid)[source]

Just to be compatible with ImportanceEstimator.

Parameters: train_valid (TrainValidIterator) – classic cv iterator.

predict(dataset, report=False)[source]

Predict on new dataset.

Parameters

dataset (PandasDataset) – Dataset.
report (bool) – Flag to generate report.

Return type

NumpyDataset

Returns

Dataset with predictions.