LightAutoML documentation
LightAutoML is open-source Python library aimed at automated machine learning. It is designed to be lightweight and efficient for various tasks with tabular, text data. LightAutoML provides easy-to-use pipeline creation, that enables:
Automatic hyperparameter tuning, data processing.
Automatic typing, feature selection.
Automatic time utilization.
Automatic report creation.
Easy-to-use modular scheme to create your own pipelines.
Installation Guide
Basic
You can install library LightAutoML from PyPI.
pip install lightautoml
Development
You can also clone repository and install with poetry. First, install poetry. Then,
git clone git@github.com:AILab-MLTools/LightAutoML.git
cd LightAutoML
# Create virtual environment inside your project directory
poetry config virtualenvs.in-project true
# If you want to update dependecies, run the command:
poetry lock
# Installation
poetry install
Tutorials
Tutorial 1: Basics
Official LightAutoML github repository is here
In this tutorial you will learn how to: * run LightAutoML training on tabular data * obtain feature importances and reports * configure resource usage in LightAutoML
0. Prerequisites
0.0. install LightAutoML
[1]:
#!pip install -U lightautoml
0.1. Import libraries
Here we will import the libraries we use in this kernel: - Standard python libraries for timing, working with OS and HTTP requests etc. - Essential python DS libraries like numpy, pandas, scikit-learn and torch (the last we will use in the next cell) - LightAutoML modules: presets for AutoML, task and report generation module
[2]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
[3]:
# Standard python libraries
import os
import requests
# Essential DS libraries
import numpy as np
import pandas as pd
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split
import torch
# LightAutoML presets, task and report generation
from lightautoml.automl.presets.tabular_presets import TabularAutoML, TabularUtilizedAutoML
from lightautoml.tasks import Task
from lightautoml.report.report_deco import ReportDeco
0.2. Constants
Here we setup some parameters to use in the kernel: - N_THREADS
- number of vCPUs for LightAutoML model creation - N_FOLDS
- number of folds in LightAutoML inner CV - RANDOM_STATE
- random seed for better reproducibility - TEST_SIZE
- houldout data part size - TIMEOUT
- limit in seconds for model to train - TARGET_NAME
- target column name in dataset
[4]:
N_THREADS = 4
N_FOLDS = 5
RANDOM_STATE = 42
TEST_SIZE = 0.2
TIMEOUT = 300
TARGET_NAME = 'TARGET'
[5]:
DATASET_DIR = '../data/'
DATASET_NAME = 'sampled_app_train.csv'
DATASET_FULLNAME = os.path.join(DATASET_DIR, DATASET_NAME)
DATASET_URL = 'https://raw.githubusercontent.com/AILab-MLTools/LightAutoML/master/examples/data/sampled_app_train.csv'
0.3. Imported models setup
For better reproducibility fix numpy random seed with max number of threads for Torch (which usually try to use all the threads on server):
[6]:
np.random.seed(RANDOM_STATE)
torch.set_num_threads(N_THREADS)
0.4. Data loading
Let’s check the data we have:
[7]:
if not os.path.exists(DATASET_FULLNAME):
os.makedirs(DATASET_DIR, exist_ok=True)
dataset = requests.get(DATASET_URL).text
with open(DATASET_FULLNAME, 'w') as output:
output.write(dataset)
[8]:
data = pd.read_csv(DATASET_DIR + DATASET_NAME)
data.head()
[8]:
SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | ... | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 313802 | 0 | Cash loans | M | N | Y | 0 | 270000.0 | 327024.0 | 15372.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
1 | 319656 | 0 | Cash loans | F | N | N | 0 | 108000.0 | 675000.0 | 19737.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
2 | 207678 | 0 | Revolving loans | F | Y | Y | 2 | 112500.0 | 270000.0 | 13500.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
3 | 381593 | 0 | Cash loans | F | N | N | 1 | 67500.0 | 142200.0 | 9630.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 4.0 |
4 | 258153 | 0 | Cash loans | F | Y | Y | 0 | 337500.0 | 1483231.5 | 46570.5 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 2.0 | 0.0 | 0.0 |
5 rows × 122 columns
[9]:
data.shape
[9]:
(10000, 122)
0.5. Data splitting for train-holdout
As we have only one file with target values, we can split it into 80%-20% for holdout usage:
[10]:
train_data, test_data = train_test_split(
data,
test_size=TEST_SIZE,
stratify=data[TARGET_NAME],
random_state=RANDOM_STATE
)
print(f'Data is splitted. Parts sizes: train_data = {train_data.shape}, test_data = {test_data.shape}')
train_data.head()
Data is splitted. Parts sizes: train_data = (8000, 122), test_data = (2000, 122)
[10]:
SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | ... | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
6444 | 112261 | 0 | Cash loans | F | N | N | 1 | 90000.0 | 640080.0 | 31261.5 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
3586 | 115058 | 0 | Cash loans | F | N | Y | 0 | 180000.0 | 239850.0 | 23850.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 |
9349 | 326623 | 0 | Cash loans | F | N | Y | 0 | 112500.0 | 337500.0 | 31086.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2.0 |
7734 | 191976 | 0 | Cash loans | M | Y | Y | 1 | 67500.0 | 135000.0 | 9018.0 | ... | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
2174 | 281519 | 0 | Revolving loans | F | N | Y | 0 | 67500.0 | 202500.0 | 10125.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2.0 |
5 rows × 122 columns
Note: missing values (NaN and other) in the data should be left as is, unless the reason for their presence or their specific meaning are known. Otherwise, AutoML model will perceive the filled NaNs as a true pattern between the data and the target variable, without knowledge and assumptions about missing values, which can negatively affect the model quality. LighAutoML can deal with missing values and outliers automatically.
1. Task definition
1.1. Task type
First we need to create Task
object - the class to setup what task LightAutoML model should solve with specific loss and metric if necessary (more info can be found here in our documentation).
The following task types are available:
'binary'
- for binary classification.'reg’
- for regression.‘multiclass’
- for multiclass classification.'multi:reg
- for multiple regression.'multilabel'
- for multi-label classification.
In this example we will consider a binary classification:
[11]:
task = Task('binary')
Note: only logloss loss is available for binary task and it is the default loss. Default metric for binary classification is ROC-AUC. See more info about available and default losses and metrics here.
Depending on the task, you can and shold choose exactly those metrics and losses that you want and need to optimize.
1.2. Feature roles setup
To solve the task, we need to setup columns roles. LightAutoML can automatically define types and roles of data columns, but it is possible to specify it directly through the dictionary parameter roles
when training AutoML model (see next section “AutoML training”). Specific roles can be specified using a string with the name (any role can be set like this). So the key in dictionary must be the name of the role, the value must be a list of the names of the corresponding columns in dataset.
The only role you must setup is 'target'
role (that is column with target variable obviously), everything else ('drop', 'numeric', 'categorical', 'group', 'weights'
etc) is up to user:
[12]:
roles = {
'target': TARGET_NAME,
'drop': ['SK_ID_CURR']
}
You can also optionally specify the following roles:
'numeric'
- numerical feature'category'
- categorical feature'text'
- text data'datetime'
- features with date and time'date'
- features with date only'group'
- features by which the data can be divided into groups and which can be taken into account for group k-fold validation (so the same group is not represented in both testing and training sets)'drop'
- features to drop, they will not be used in model building'weights'
- object weights for the loss and metric'path'
- image file paths (for CV tasks)'treatment'
- object group in uplift modelling tasks: treatment or control
Note: role name can be written in any case. Also it is possible to pass individual objects of role classes with specific arguments instead of strings with role names for specific tasks and more optimal pipeline construction (more details).
For example, to set the date role, you can use the DatetimeRole
class.
[13]:
#from lightautoml.dataset.roles import DatetimeRole
Different seasonality can be extracted from the data through the seasonality
parameter: years ('y'
), months ('m'
), days ('d'
), weekdays ('wd'
), hours ('hour'
), minutes ('min'
), seconds ('sec'
), milliseconds ('ms'
), nanoseconds ('ns'
). This features will be considered as categorical. Another important parameter is base_date
. It allows to specify the base date and convert the feature to the distances to this date (set to False
by default). Also for
all roles classes there is a force_input
parameter, and if it is True
, then the corresponding features will pass all further feature selections and won’t be excluded (equals False
by default). Also it is always possible to specify data type for all roles using dtype
argument.
Here is an example of such a role assignment through a class object for date feature (but there is no such feature in the considered dataset):
[14]:
# roles = {
# DatetimeRole(base_date=False, seasonality=('d', 'wd', 'hour')): 'date_time'
# }
Any role can be set through a class object. Information about specific parameters of specific roles and other datailed information can be found here.
1.3. LightAutoML model creation - TabularAutoML preset
Next we are going to create LightAutoML model with TabularAutoML
class - preset with default model structure in just several lines.
In general, the whole AutoML model consists of multiple levels, which can contain several pipelines with their own set of data processing methods and ML models. The outputs of one level are the inputs of the next, and on the last level predictions of previous level models are combined with blending procedure. All this can be combined into a model using the AutoML
class and its various descendants (like TabularAutoML
).
Let’s look at how the LightAutoML model is arranged and what it consists in general.
1.3.1 Reader object
First the task and data go into Reader
object. It analyzes the data and extracts various valuable information from them. Also it can detect and remove useless features, conduct feature selection, determine types and roles etc. Let’s look at this steps in more detail.
Role and types guessing
Roles can be specified as a string or a specific class object, or defined automatically. For TabularAutoML
preset 'numeric'
, 'datetime'
and 'category'
roles can be automatically defined. There are two ways of role defining. First is very simple: check if the value can be converted to a date ('datetime'
), otherwise check if it can be converted to a number ('numeric'
), otherwise declare it a category ('categorical'
). But this method may not work well on large data
or when encoding categories with integers. The second method is based on statistics: the distributions of numerical features are considered, and how similar they are to the distributions of real or categorical value. Also different ways of feature encoding (as a number or as a category) are compared and based on normalized Gini index it is decided which encoding is better. For this case a set of specific rules is created, and if at least one of them is fullfilled, then the feature will be
assigned to numerical, otherwise to categorical. This check can be enabled or disabled using the advanced_roles
parameter.
If roles are explicitly specified, automatic definition won’t be applied to the specified dataset columns. In the case of specifying a role as an object of a certain class, through its arguments, it is possible to set the processing parameters in more detail.
Feature selection
In general, the AutoML pipeline uses pre-selection, generation and post-selection of features. TabularAutoML
has no post-selection stage. There are three feature selection methods: its absence, using features importances and more strict selection (forward selection). The GBM model is used to evaluate features importances. Importances can be calculated in 2 ways: based on splits (how many times a split was made for each feature in the entire ensemble) or using permutation feature importances
(mixing feature values during validation and assessing quality change in this case). Second method is harder but it requires holdout data. Then features with importance above a certain threshold are selected. Faster and more strict feature selection method is forward selection. Features are sorted in descending order of importance, then in blocks (size of 1 by default) a model is built based on selected features, and its quality is measured. Then the next block of features is added, and they are
saved if the quality has improved with them, and so on.
Also LightAutoML can merge some columns if it is adequate and leads to an improvement in the model quality (for example, an intersection between categorical variables). Different columns join options are considered, the best one is chosen by the normalized Gini index.
1.3.2 Machine learning pipelines architecture and training
As a result, after analyzing and processing the data, the Reader
object forms and returns a LAMA Dataset
. It contains the original data and markup with metainformation. In this dataset it is possible to see the roles defined by the Reader
object, selected features etc. Then ML pipelines are trained on this data.
Each such pipeline is one or more machine learning algorithms that share one post-processing block and one validation scheme. Several such pipelines can be trained in parallel on one dataset, and they form a level. Number of levels can be unlimited as possible. List of all levels of AutoML pipeline is available via .levels
attribute of AutoML
instance. Level predictions can be inputs to other models or ML pipelines (i. e. stacking scheme). As inputs for subsequent levels, it is possible
to use the original data by setting skip_conn
argument in True
when initializing preset instance. At the last level, if there are several pipelines, blending is used to build a prediction.
Different types of features are processed depending on the models. Numerical features are processed for linear model preprocessing: standardization, replacing missing values with median, discretization, log odds (if feature is probability - output of previous level). Categories are processed using label encoding (by default), one hot encoding, ordinal encoding, frequency encoding, out of fold target encoding.
The following algorithms are available in the LightAutoML: linear regression with L2 regularization, LightGBM, CatBoost, random forest.
By default KFold cross-validation is used during training at all levels (for hyperparameter optimization and building out-of-fold prediction during training), and for each algorithm a separate model is built for each validation fold, and their predictions are averaged. So the predictions at each level and the resulting prediction during training are out-of-fold predictions. But it is also possible to just pass a holdout data for validation or use custom cross-validation schemes, setting
cv_iter
iterator returning the indices of the objects for validation. LightAutoML has ready-made iterators, for example, TimeSeriesIterator
for time series split. To further reduce the effect of overfitting, it is possible to use nested cross-validation (nested_cv
parameter), but it is not used by default.
Prediction on new data is the averaging of models over all folds from validation and blending.
Hyperparameter tuning of machine learning algorithms can be performed during training (early stopping by the number of trees in gradient boosting or the number of training epochs of neural networks etc), based on expert rules (according to data characteristics and empirical recommendations, so-called expert parameters), by the sequential model-based optimization (SMBO, bayesian optimization: Optuna with TPESampler) or by grid search. LightGBM and CatBoost can be used with parameter tuning or with expert parameters, with no tuning. For linear regression parameters are always tuned using warm start model training technique.
At the last level blending is used to build a prediction. There are three available blending methods: choosing the best model based on a given metric (other models are just discarded), simple averaging of all models, or weighted averaging (weights are selected using coordinate descent algorithm with optimization of a given metric). TabularAutoML
uses the latter strategy by default. It is worth noting that, unlike stacking, blending can exclude models from composition.
1.3.3 Timing
When creating AutoML object, a certain time limit is set, and it schedules a list of tasks that it can complete during this time, and it will initially allocate approximately equal time for each task. In the process of solving objectives, it understands how to adjust the time allocated to different subtasks. If AutoML finished working earlier than set timeout, it means that it completed the entire list of tasks. If AutoML worked to the limit and turned off, then most likely it sacrificed something, for example, reduced the number of algorithms for training, realized that it would not have time to train the next one, or it might not calculate the full cross-validation cycle for one of the models (then on folds, where the model has not trained, the predictiuons will be NaN, and the model related to this fold will not participate in the final averaging). The resulting quality is evaluated at the blending stage, and if necessary and possible, the composition will be corrected.
If you do not set the time for AutoML during initialization, then by default it will be equal to a very large number, that is, sooner or later AutoML will complete all tasks.
1.3.4 LightAutoML model creation
So the entire AutoML pipeline can be composed from various parts by user (see custom pipeline tutorial), but it is possible to use presets - in a certain sense, fixed strategies for dynamic pipeline building.
Here is a default AutoML pipeline for binary classification and regression tasks (TabularAutoML
preset):
Another example:
Let’s discuss some of the params we can setup: - task
- the type of the ML task (the only must have parameter) - timeout
- time limit in seconds for model to train - cpu_limit
- vCPU count for model to use - reader_params
- parameter change for Reader
object inside preset, which works on the first step of data preparation: automatic feature typization, preliminary almost-constant features, correct CV setup etc. For example, we setup n_jobs
threads for typization algo,
cv
folds and random_state
as inside CV seed. - general_params
- general parameters dictionary, in which it is possible to specify a list of algorithms used ('use_algos'
), nested CV using ('nested_cv'
) etc.
Important note: reader_params
key is one of the YAML config keys, which is used inside TabularAutoML
preset. More details on its structure with explanation comments can be found on the link attached. Each key from this config can be modified with user settings during preset object initialization. To get more info about different parameters setting (for example, ML algos which can
be used in general_params->use_algos
) please take a look at our article on TowardsDataScience.
Moreover, to receive the automatic report for our model we will use ReportDeco
decorator and work with the decorated version in the same way as we do with usual one.
[15]:
automl = TabularAutoML(
task = task,
timeout = TIMEOUT,
cpu_limit = N_THREADS,
reader_params = {'n_jobs': N_THREADS, 'cv': N_FOLDS, 'random_state': RANDOM_STATE},
)
2. AutoML training
To run autoML training use fit_predict
method.
Main arguments:
train_data
- dataset to train.roles
- column roles dict.verbose
- controls the verbosity: the higher, the more messages: <1 : messages are not displayed; >=1 : the computation process for layers is displayed; >=2 : the information about folds processing is also displayed; >=3 : the hyperparameters optimization process is also displayed; >=4 : the training process for every algorithm is displayed;
Note: out-of-fold prediction is calculated during training and returned from the fit_predict method
[16]:
%%time
out_of_fold_predictions = automl.fit_predict(train_data, roles = roles, verbose = 1)
[11:02:15] Stdout logging level is INFO.
[11:02:15] Copying TaskTimer may affect the parent PipelineTimer, so copy will create new unlimited TaskTimer
[11:02:15] Task: binary
[11:02:15] Start automl preset with listed constraints:
[11:02:15] - time: 300.00 seconds
[11:02:15] - CPU: 4 cores
[11:02:15] - memory: 16 GB
[11:02:15] Train data shape: (8000, 122)
[11:02:18] Layer 1 train process start. Time left 297.23 secs
[11:02:18] Start fitting Lvl_0_Pipe_0_Mod_0_LinearL2 ...
[11:02:21] Fitting Lvl_0_Pipe_0_Mod_0_LinearL2 finished. score = 0.7351175537276247
[11:02:21] Lvl_0_Pipe_0_Mod_0_LinearL2 fitting and predicting completed
[11:02:21] Time left 294.29 secs
[11:02:23] Selector_LightGBM fitting and predicting completed
[11:02:24] Start fitting Lvl_0_Pipe_1_Mod_0_LightGBM ...
[11:02:38] Fitting Lvl_0_Pipe_1_Mod_0_LightGBM finished. score = 0.7139016076564749
[11:02:38] Lvl_0_Pipe_1_Mod_0_LightGBM fitting and predicting completed
[11:02:38] Start hyperparameters optimization for Lvl_0_Pipe_1_Mod_1_Tuned_LightGBM ... Time budget is 1.00 secs
[11:02:43] Hyperparameters optimization for Lvl_0_Pipe_1_Mod_1_Tuned_LightGBM completed
[11:02:43] Start fitting Lvl_0_Pipe_1_Mod_1_Tuned_LightGBM ...
[11:02:58] Fitting Lvl_0_Pipe_1_Mod_1_Tuned_LightGBM finished. score = 0.6809442140928409
[11:02:58] Lvl_0_Pipe_1_Mod_1_Tuned_LightGBM fitting and predicting completed
[11:02:58] Start fitting Lvl_0_Pipe_1_Mod_2_CatBoost ...
[11:03:03] Fitting Lvl_0_Pipe_1_Mod_2_CatBoost finished. score = 0.7205654339932637
[11:03:03] Lvl_0_Pipe_1_Mod_2_CatBoost fitting and predicting completed
[11:03:03] Start hyperparameters optimization for Lvl_0_Pipe_1_Mod_3_Tuned_CatBoost ... Time budget is 168.11 secs
[11:04:57] Hyperparameters optimization for Lvl_0_Pipe_1_Mod_3_Tuned_CatBoost completed
[11:04:57] Start fitting Lvl_0_Pipe_1_Mod_3_Tuned_CatBoost ...
[11:05:06] Fitting Lvl_0_Pipe_1_Mod_3_Tuned_CatBoost finished. score = 0.7450030072205183
[11:05:06] Lvl_0_Pipe_1_Mod_3_Tuned_CatBoost fitting and predicting completed
[11:05:06] Time left 129.22 secs
[11:05:06] Layer 1 training completed.
[11:05:06] Blending: optimization starts with equal weights and score 0.747978018908178
[11:05:06] Blending: iteration 0: score = 0.7502378457373473, weights = [0.25709525 0.13884845 0.0858601 0.05886992 0.4593263 ]
[11:05:06] Blending: iteration 1: score = 0.7502116959937104, weights = [0.2473392 0.14313795 0.07531787 0.06068861 0.47351643]
[11:05:06] Blending: iteration 2: score = 0.7502116959937104, weights = [0.2473392 0.14313795 0.07531787 0.06068861 0.47351643]
[11:05:06] Blending: no score update. Terminated
[11:05:06] Automl preset training completed in 171.05 seconds
[11:05:06] Model description:
Final prediction for new objects (level 0) =
0.24734 * (5 averaged models Lvl_0_Pipe_0_Mod_0_LinearL2) +
0.14314 * (5 averaged models Lvl_0_Pipe_1_Mod_0_LightGBM) +
0.07532 * (5 averaged models Lvl_0_Pipe_1_Mod_1_Tuned_LightGBM) +
0.06069 * (5 averaged models Lvl_0_Pipe_1_Mod_2_CatBoost) +
0.47352 * (5 averaged models Lvl_0_Pipe_1_Mod_3_Tuned_CatBoost)
CPU times: user 11min 19s, sys: 1min 30s, total: 12min 50s
Wall time: 2min 51s
After training we can see logs with all the progress, final scores, weights assigned to the models in the final prediction etc.
Note that in this fit_predict
you receive the model with only 3 out of 5 LightGBM models (you can see that from the log line in the end 0.25685 * (3 averaged models Lvl_0_Pipe_1_Mod_0_LightGBM)
) - to fix it you can set the bigger timeout to make LightAutoML train all the models.
3. Prediction on holdout and model evaluation
Now we can use trained AutoML model to build predictions on holdout and evaluate model quality. Note that in case of classification tasks LightAutoML model returns probabilities as predictions.
[17]:
%%time
test_predictions = automl.predict(test_data)
print(f'Prediction for test_data:\n{test_predictions}\nShape = {test_predictions.shape}')
Prediction for test_data:
array([[0.06620218],
[0.06621333],
[0.03255654],
...,
[0.06863909],
[0.04567214],
[0.2046678 ]], dtype=float32)
Shape = (2000, 1)
CPU times: user 3.33 s, sys: 452 ms, total: 3.78 s
Wall time: 646 ms
[18]:
print(f'OOF score: {roc_auc_score(train_data[TARGET_NAME].values, out_of_fold_predictions.data[:, 0])}')
print(f'HOLDOUT score: {roc_auc_score(test_data[TARGET_NAME].values, test_predictions.data[:, 0])}')
OOF score: 0.7502681411720484
HOLDOUT score: 0.7327955163043479
4. Model analysis
4.1. Reports
You can obtain the description of the resulting pipeline:
[19]:
print(automl.create_model_str_desc())
Final prediction for new objects (level 0) =
0.24734 * (5 averaged models Lvl_0_Pipe_0_Mod_0_LinearL2) +
0.14314 * (5 averaged models Lvl_0_Pipe_1_Mod_0_LightGBM) +
0.07532 * (5 averaged models Lvl_0_Pipe_1_Mod_1_Tuned_LightGBM) +
0.06069 * (5 averaged models Lvl_0_Pipe_1_Mod_2_CatBoost) +
0.47352 * (5 averaged models Lvl_0_Pipe_1_Mod_3_Tuned_CatBoost)
Also for this purposes LightAutoML have ReportDeco
, use it to build detailed reports:
[20]:
RD = ReportDeco(output_path = 'tabularAutoML_model_report')
automl_rd = RD(
TabularAutoML(
task = task,
timeout = TIMEOUT,
cpu_limit = N_THREADS,
reader_params = {'n_jobs': N_THREADS, 'cv': N_FOLDS, 'random_state': RANDOM_STATE}
)
)
[21]:
%%time
out_of_fold_predictions = automl_rd.fit_predict(train_data, roles = roles, verbose = 1)
[11:05:07] Stdout logging level is INFO.
[11:05:07] Task: binary
[11:05:07] Start automl preset with listed constraints:
[11:05:07] - time: 300.00 seconds
[11:05:07] - CPU: 4 cores
[11:05:07] - memory: 16 GB
[11:05:07] Train data shape: (8000, 122)
[11:05:08] Layer 1 train process start. Time left 299.03 secs
[11:05:09] Start fitting Lvl_0_Pipe_0_Mod_0_LinearL2 ...
[11:05:11] Fitting Lvl_0_Pipe_0_Mod_0_LinearL2 finished. score = 0.7351175537276247
[11:05:11] Lvl_0_Pipe_0_Mod_0_LinearL2 fitting and predicting completed
[11:05:11] Time left 296.09 secs
[11:05:14] Selector_LightGBM fitting and predicting completed
[11:05:14] Start fitting Lvl_0_Pipe_1_Mod_0_LightGBM ...
[11:05:28] Fitting Lvl_0_Pipe_1_Mod_0_LightGBM finished. score = 0.7139016076564749
[11:05:28] Lvl_0_Pipe_1_Mod_0_LightGBM fitting and predicting completed
[11:05:28] Start hyperparameters optimization for Lvl_0_Pipe_1_Mod_1_Tuned_LightGBM ... Time budget is 1.00 secs
[11:05:34] Hyperparameters optimization for Lvl_0_Pipe_1_Mod_1_Tuned_LightGBM completed
[11:05:34] Start fitting Lvl_0_Pipe_1_Mod_1_Tuned_LightGBM ...
[11:05:48] Fitting Lvl_0_Pipe_1_Mod_1_Tuned_LightGBM finished. score = 0.6809442140928409
[11:05:48] Lvl_0_Pipe_1_Mod_1_Tuned_LightGBM fitting and predicting completed
[11:05:48] Start fitting Lvl_0_Pipe_1_Mod_2_CatBoost ...
[11:05:53] Fitting Lvl_0_Pipe_1_Mod_2_CatBoost finished. score = 0.7205654339932637
[11:05:53] Lvl_0_Pipe_1_Mod_2_CatBoost fitting and predicting completed
[11:05:53] Start hyperparameters optimization for Lvl_0_Pipe_1_Mod_3_Tuned_CatBoost ... Time budget is 173.09 secs
[11:07:42] Hyperparameters optimization for Lvl_0_Pipe_1_Mod_3_Tuned_CatBoost completed
[11:07:42] Start fitting Lvl_0_Pipe_1_Mod_3_Tuned_CatBoost ...
[11:07:50] Fitting Lvl_0_Pipe_1_Mod_3_Tuned_CatBoost finished. score = 0.7450030072205183
[11:07:50] Lvl_0_Pipe_1_Mod_3_Tuned_CatBoost fitting and predicting completed
[11:07:50] Time left 137.02 secs
[11:07:50] Layer 1 training completed.
[11:07:50] Blending: optimization starts with equal weights and score 0.747978018908178
[11:07:51] Blending: iteration 0: score = 0.7502378457373473, weights = [0.25709525 0.13884845 0.0858601 0.05886992 0.4593263 ]
[11:07:51] Blending: iteration 1: score = 0.7502116959937104, weights = [0.2473392 0.14313795 0.07531787 0.06068861 0.47351643]
[11:07:51] Blending: iteration 2: score = 0.7502116959937104, weights = [0.2473392 0.14313795 0.07531787 0.06068861 0.47351643]
[11:07:51] Blending: no score update. Terminated
[11:07:51] Automl preset training completed in 163.25 seconds
[11:07:51] Model description:
Final prediction for new objects (level 0) =
0.24734 * (5 averaged models Lvl_0_Pipe_0_Mod_0_LinearL2) +
0.14314 * (5 averaged models Lvl_0_Pipe_1_Mod_0_LightGBM) +
0.07532 * (5 averaged models Lvl_0_Pipe_1_Mod_1_Tuned_LightGBM) +
0.06069 * (5 averaged models Lvl_0_Pipe_1_Mod_2_CatBoost) +
0.47352 * (5 averaged models Lvl_0_Pipe_1_Mod_3_Tuned_CatBoost)
CPU times: user 11min 17s, sys: 1min 41s, total: 12min 59s
Wall time: 2min 46s
The report will be available in the folder with specified name (output_path
argument in ReportDeco
initialization).
[22]:
!ls tabularAutoML_model_report
feature_importance.png test_roc_curve_1.png
lama_interactive_report.html valid_distribution_of_logits.png
test_distribution_of_logits_1.png valid_pie_f1_metric.png
test_pie_f1_metric_1.png valid_pr_curve.png
test_pr_curve_1.png valid_preds_distribution_by_bins.png
test_preds_distribution_by_bins_1.png valid_roc_curve.png
[23]:
%%time
test_predictions = automl_rd.predict(test_data)
print(f'Prediction for test_data:\n{test_predictions}\nShape = {test_predictions.shape}')
Prediction for test_data:
array([[0.06620218],
[0.06621333],
[0.03255654],
...,
[0.06863909],
[0.04567214],
[0.2046678 ]], dtype=float32)
Shape = (2000, 1)
CPU times: user 14.3 s, sys: 9.01 s, total: 23.3 s
Wall time: 2.59 s
[24]:
print(f'OOF score: {roc_auc_score(train_data[TARGET_NAME].values, out_of_fold_predictions.data[:, 0])}')
print(f'HOLDOUT score: {roc_auc_score(test_data[TARGET_NAME].values, test_predictions.data[:, 0])}')
OOF score: 0.7502681411720484
HOLDOUT score: 0.7327955163043479
4.2 Feature importances calculation
For feature importances calculation we have 2 different methods in LightAutoML: - Fast (fast
) - this method uses feature importances from feature selector LGBM model inside LightAutoML. It works extremely fast and almost always (almost because of situations, when feature selection is turned off or selector was removed from the final models with all GBM models). There is no need to use new labelled data. - Accurate (accurate
) - this method calculate features permutation importances for
the whole LightAutoML model based on the new labelled data. It always works but can take a lot of time to finish (depending on the model structure, new labelled dataset size etc.).
In the cell below we will use automl_rd.model
instead automl_rd
because we want to take the importances from the model, not from the report. But be carefull - everything, which is calculated using automl_rd.model
will not go to the report.
[25]:
%%time
# Fast feature importances calculation
fast_fi = automl_rd.model.get_feature_scores('fast')
fast_fi.set_index('Feature')['Importance'].plot.bar(figsize = (30, 10), grid = True)
CPU times: user 275 ms, sys: 117 ms, total: 393 ms
Wall time: 202 ms
[25]:
<Axes: xlabel='Feature'>

[26]:
%%time
# Accurate feature importances calculation with detailed info (Permutation importances) - can take long time to calculate
accurate_fi = automl_rd.model.get_feature_scores('accurate', test_data, silent = True)
CPU times: user 6min 32s, sys: 20.1 s, total: 6min 52s
Wall time: 1min 7s
[27]:
accurate_fi.set_index('Feature')['Importance'].plot.bar(figsize = (30, 10), grid = True)
[27]:
<Axes: xlabel='Feature'>

Bonus: where is the automatic report?
As we used automl_rd
in our training and prediction cells, it is already ready in the folder we specified - you can check the output folder and find the tabularAutoML_model_report
folder with lama_interactive_report.html
report inside (or just click this link for short). It’s interactive so you can click the black triangles on the left of the texts to go deeper in selected part.
5. Spending more from TIMEOUT - TabularUtilizedAutoML
usage
To spend (almost) all the TIMEOUT
time for building the model we can use TabularUtilizedAutoML
preset instead of TabularAutoML
, which has the same API. By default TabularUtilizedAutoML
model trains with 7 different parameter configurations (see this for more details) sequentially, and if there is time left, then whole AutoML pipeline with these configs is run again with another
cross-validation seed, and so on. Then results for each pipeline model are averaged over the considered validation seeds, and all averaged results at the end are also combined through blending. User can set his own set of configs by passing list of paths to according files in configs_list
argument during TabularUtilizedAutoML
instance initialization. Such configs allow the user to configure all pipeline parameters and can be used for any available preset.
[28]:
utilized_automl = TabularUtilizedAutoML(
task = task,
timeout = 900,
cpu_limit = N_THREADS,
reader_params = {'n_jobs': N_THREADS, 'cv': N_FOLDS, 'random_state': RANDOM_STATE},
)
[29]:
%%time
out_of_fold_predictions = utilized_automl.fit_predict(train_data, roles = roles, verbose = 1)
[11:09:09] Start automl utilizator with listed constraints:
[11:09:09] - time: 900.00 seconds
[11:09:09] - CPU: 4 cores
[11:09:09] - memory: 16 GB
[11:09:09] If one preset completes earlier, next preset configuration will be started
[11:09:09] ==================================================
[11:09:09] Start 0 automl preset configuration:
[11:09:09] conf_0_sel_type_0.yml, random state: {'reader_params': {'random_state': 42}, 'nn_params': {'random_state': 42}, 'general_params': {'return_all_predictions': False}}
[11:09:09] Stdout logging level is INFO.
[11:09:09] Task: binary
[11:09:09] Start automl preset with listed constraints:
[11:09:09] - time: 900.00 seconds
[11:09:09] - CPU: 4 cores
[11:09:09] - memory: 16 GB
[11:09:09] Train data shape: (8000, 122)
[11:09:10] Layer 1 train process start. Time left 899.10 secs
[11:09:10] Start fitting Lvl_0_Pipe_0_Mod_0_LinearL2 ...
[11:09:13] Fitting Lvl_0_Pipe_0_Mod_0_LinearL2 finished. score = 0.7351175537276247
[11:09:13] Lvl_0_Pipe_0_Mod_0_LinearL2 fitting and predicting completed
[11:09:13] Time left 896.26 secs
[11:09:13] Start fitting Lvl_0_Pipe_1_Mod_0_LightGBM ...
[11:09:30] Fitting Lvl_0_Pipe_1_Mod_0_LightGBM finished. score = 0.7208359669101568
[11:09:30] Lvl_0_Pipe_1_Mod_0_LightGBM fitting and predicting completed
[11:09:30] Start hyperparameters optimization for Lvl_0_Pipe_1_Mod_1_Tuned_LightGBM ... Time budget is 103.73 secs
[11:11:14] Hyperparameters optimization for Lvl_0_Pipe_1_Mod_1_Tuned_LightGBM completed
[11:11:14] Start fitting Lvl_0_Pipe_1_Mod_1_Tuned_LightGBM ...
[11:11:25] Fitting Lvl_0_Pipe_1_Mod_1_Tuned_LightGBM finished. score = 0.703382394929586
[11:11:25] Lvl_0_Pipe_1_Mod_1_Tuned_LightGBM fitting and predicting completed
[11:11:25] Start fitting Lvl_0_Pipe_1_Mod_2_CatBoost ...
[11:11:30] Fitting Lvl_0_Pipe_1_Mod_2_CatBoost finished. score = 0.7216183332238446
[11:11:30] Lvl_0_Pipe_1_Mod_2_CatBoost fitting and predicting completed
[11:11:30] Start hyperparameters optimization for Lvl_0_Pipe_1_Mod_3_Tuned_CatBoost ... Time budget is 300.00 secs
[11:13:24] Hyperparameters optimization for Lvl_0_Pipe_1_Mod_3_Tuned_CatBoost completed
[11:13:24] Start fitting Lvl_0_Pipe_1_Mod_3_Tuned_CatBoost ...
[11:13:33] Fitting Lvl_0_Pipe_1_Mod_3_Tuned_CatBoost finished. score = 0.7486773651008073
[11:13:33] Lvl_0_Pipe_1_Mod_3_Tuned_CatBoost fitting and predicting completed
[11:13:33] Time left 636.25 secs
[11:13:33] Layer 1 training completed.
[11:13:33] Blending: optimization starts with equal weights and score 0.7510956848883608
[11:13:33] Blending: iteration 0: score = 0.7530033405765996, weights = [0.24248882 0.0901353 0.11209824 0.09796461 0.45731303]
[11:13:33] Blending: iteration 1: score = 0.7530437344895347, weights = [0.23387231 0.08203291 0.10811498 0.12389028 0.45208955]
[11:13:33] Blending: iteration 2: score = 0.7530445848877017, weights = [0.23470336 0.0871055 0.10718196 0.12282111 0.44818804]
[11:13:33] Blending: iteration 3: score = 0.7530445848877017, weights = [0.23470336 0.0871055 0.10718196 0.12282111 0.44818804]
[11:13:33] Blending: no score update. Terminated
[11:13:33] Automl preset training completed in 264.09 seconds
[11:13:33] Model description:
Final prediction for new objects (level 0) =
0.23470 * (5 averaged models Lvl_0_Pipe_0_Mod_0_LinearL2) +
0.08711 * (5 averaged models Lvl_0_Pipe_1_Mod_0_LightGBM) +
0.10718 * (5 averaged models Lvl_0_Pipe_1_Mod_1_Tuned_LightGBM) +
0.12282 * (5 averaged models Lvl_0_Pipe_1_Mod_2_CatBoost) +
0.44819 * (5 averaged models Lvl_0_Pipe_1_Mod_3_Tuned_CatBoost)
[11:13:33] ==================================================
[11:13:33] Start 1 automl preset configuration:
[11:13:33] conf_1_sel_type_1.yml, random state: {'reader_params': {'random_state': 43}, 'nn_params': {'random_state': 43}, 'general_params': {'return_all_predictions': False}}
[11:13:33] Stdout logging level is INFO.
[11:13:33] Task: binary
[11:13:33] Start automl preset with listed constraints:
[11:13:33] - time: 635.86 seconds
[11:13:33] - CPU: 4 cores
[11:13:33] - memory: 16 GB
[11:13:33] Train data shape: (8000, 122)
[11:13:34] Layer 1 train process start. Time left 634.90 secs
[11:13:35] Start fitting Lvl_0_Pipe_0_Mod_0_LinearL2 ...
[11:13:37] Fitting Lvl_0_Pipe_0_Mod_0_LinearL2 finished. score = 0.7342794863339953
[11:13:37] Lvl_0_Pipe_0_Mod_0_LinearL2 fitting and predicting completed
[11:13:37] Time left 631.81 secs
[11:13:41] Selector_LightGBM fitting and predicting completed
[11:13:41] Start fitting Lvl_0_Pipe_1_Mod_0_LightGBM ...
[11:13:55] Fitting Lvl_0_Pipe_1_Mod_0_LightGBM finished. score = 0.7392594180002504
[11:13:55] Lvl_0_Pipe_1_Mod_0_LightGBM fitting and predicting completed
[11:13:55] Start hyperparameters optimization for Lvl_0_Pipe_1_Mod_1_Tuned_LightGBM ... Time budget is 81.29 secs
[11:15:19] Hyperparameters optimization for Lvl_0_Pipe_1_Mod_1_Tuned_LightGBM completed
[11:15:19] Start fitting Lvl_0_Pipe_1_Mod_1_Tuned_LightGBM ...
[11:15:22] Fitting Lvl_0_Pipe_1_Mod_1_Tuned_LightGBM finished. score = 0.7456401680471817
[11:15:22] Lvl_0_Pipe_1_Mod_1_Tuned_LightGBM fitting and predicting completed
[11:15:22] Start fitting Lvl_0_Pipe_1_Mod_2_CatBoost ...
[11:15:26] Fitting Lvl_0_Pipe_1_Mod_2_CatBoost finished. score = 0.7142443181177967
[11:15:26] Lvl_0_Pipe_1_Mod_2_CatBoost fitting and predicting completed
[11:15:26] Start hyperparameters optimization for Lvl_0_Pipe_1_Mod_3_Tuned_CatBoost ... Time budget is 300.00 secs
[11:17:43] Hyperparameters optimization for Lvl_0_Pipe_1_Mod_3_Tuned_CatBoost completed
[11:17:43] Start fitting Lvl_0_Pipe_1_Mod_3_Tuned_CatBoost ...
[11:17:50] Fitting Lvl_0_Pipe_1_Mod_3_Tuned_CatBoost finished. score = 0.7420738107341084
[11:17:50] Lvl_0_Pipe_1_Mod_3_Tuned_CatBoost fitting and predicting completed
[11:17:50] Time left 378.85 secs
[11:17:50] Layer 1 training completed.
[11:17:50] Blending: optimization starts with equal weights and score 0.7493590655314701
[11:17:50] Blending: iteration 0: score = 0.7515980576055465, weights = [0.14799137 0.1691257 0.4277942 0. 0.25508872]
[11:17:50] Blending: iteration 1: score = 0.7518406336826982, weights = [0.23282948 0.12138072 0.40569347 0. 0.24009633]
[11:17:50] Blending: iteration 2: score = 0.7518406336826982, weights = [0.23282948 0.12138072 0.40569347 0. 0.24009633]
[11:17:50] Blending: no score update. Terminated
[11:17:50] Automl preset training completed in 257.29 seconds
[11:17:50] Model description:
Final prediction for new objects (level 0) =
0.23283 * (5 averaged models Lvl_0_Pipe_0_Mod_0_LinearL2) +
0.12138 * (5 averaged models Lvl_0_Pipe_1_Mod_0_LightGBM) +
0.40569 * (5 averaged models Lvl_0_Pipe_1_Mod_1_Tuned_LightGBM) +
0.24010 * (5 averaged models Lvl_0_Pipe_1_Mod_3_Tuned_CatBoost)
[11:17:50] ==================================================
[11:17:50] Start 2 automl preset configuration:
[11:17:50] conf_2_select_mode_1_no_typ.yml, random state: {'reader_params': {'random_state': 44}, 'nn_params': {'random_state': 44}, 'general_params': {'return_all_predictions': False}}
[11:17:50] Stdout logging level is INFO.
[11:17:50] Task: binary
[11:17:50] Start automl preset with listed constraints:
[11:17:50] - time: 378.53 seconds
[11:17:50] - CPU: 4 cores
[11:17:50] - memory: 16 GB
[11:17:50] Train data shape: (8000, 122)
[11:17:51] Layer 1 train process start. Time left 378.43 secs
[11:17:51] Start fitting Lvl_0_Pipe_0_Mod_0_LinearL2 ...
[11:17:53] Fitting Lvl_0_Pipe_0_Mod_0_LinearL2 finished. score = 0.7369646185464611
[11:17:53] Lvl_0_Pipe_0_Mod_0_LinearL2 fitting and predicting completed
[11:17:53] Time left 375.60 secs
[11:17:57] Selector_LightGBM fitting and predicting completed
[11:17:57] Start fitting Lvl_0_Pipe_1_Mod_0_LightGBM ...
[11:18:12] Fitting Lvl_0_Pipe_1_Mod_0_LightGBM finished. score = 0.7396506011570942
[11:18:12] Lvl_0_Pipe_1_Mod_0_LightGBM fitting and predicting completed
[11:18:12] Start hyperparameters optimization for Lvl_0_Pipe_1_Mod_1_Tuned_LightGBM ... Time budget is 2.40 secs
[11:18:23] Hyperparameters optimization for Lvl_0_Pipe_1_Mod_1_Tuned_LightGBM completed
[11:18:23] Start fitting Lvl_0_Pipe_1_Mod_1_Tuned_LightGBM ...
[11:18:38] Fitting Lvl_0_Pipe_1_Mod_1_Tuned_LightGBM finished. score = 0.707420085426748
[11:18:38] Lvl_0_Pipe_1_Mod_1_Tuned_LightGBM fitting and predicting completed
[11:18:38] Start fitting Lvl_0_Pipe_1_Mod_2_CatBoost ...
[11:18:43] Fitting Lvl_0_Pipe_1_Mod_2_CatBoost finished. score = 0.7208621166537937
[11:18:43] Lvl_0_Pipe_1_Mod_2_CatBoost fitting and predicting completed
[11:18:43] Start hyperparameters optimization for Lvl_0_Pipe_1_Mod_3_Tuned_CatBoost ... Time budget is 222.56 secs
[11:20:48] Hyperparameters optimization for Lvl_0_Pipe_1_Mod_3_Tuned_CatBoost completed
[11:20:48] Start fitting Lvl_0_Pipe_1_Mod_3_Tuned_CatBoost ...
[11:20:54] Fitting Lvl_0_Pipe_1_Mod_3_Tuned_CatBoost finished. score = 0.7435819918833748
[11:20:54] Lvl_0_Pipe_1_Mod_3_Tuned_CatBoost fitting and predicting completed
[11:20:54] Time left 195.02 secs
[11:20:54] Layer 1 training completed.
[11:20:54] Blending: optimization starts with equal weights and score 0.755773406306
[11:20:54] Blending: iteration 0: score = 0.7573862927295848, weights = [0.21980023 0.3309117 0.13515761 0.05808445 0.25604597]
[11:20:54] Blending: iteration 1: score = 0.7574197771574124, weights = [0.22310945 0.3465078 0.10554349 0.06082201 0.26401728]
[11:20:54] Blending: iteration 2: score = 0.7574281748393119, weights = [0.22096376 0.34772167 0.10591323 0.06045922 0.26494217]
[11:20:54] Blending: iteration 3: score = 0.7574281748393119, weights = [0.22096376 0.34772167 0.10591323 0.06045922 0.26494217]
[11:20:54] Blending: no score update. Terminated
[11:20:54] Automl preset training completed in 183.90 seconds
[11:20:54] Model description:
Final prediction for new objects (level 0) =
0.22096 * (5 averaged models Lvl_0_Pipe_0_Mod_0_LinearL2) +
0.34772 * (5 averaged models Lvl_0_Pipe_1_Mod_0_LightGBM) +
0.10591 * (5 averaged models Lvl_0_Pipe_1_Mod_1_Tuned_LightGBM) +
0.06046 * (5 averaged models Lvl_0_Pipe_1_Mod_2_CatBoost) +
0.26494 * (5 averaged models Lvl_0_Pipe_1_Mod_3_Tuned_CatBoost)
[11:20:54] ==================================================
[11:20:54] Blending: optimization starts with equal weights and score 0.7584169753080514
[11:20:54] Blending: iteration 0: score = 0.7593946143008483, weights = [0.26784256 0.14028242 0.59187496]
[11:20:54] Blending: iteration 1: score = 0.7594212955433396, weights = [0.26530033 0.14962645 0.58507323]
[11:20:54] Blending: iteration 2: score = 0.7594212955433396, weights = [0.26530033 0.14962645 0.58507323]
[11:20:54] Blending: no score update. Terminated
CPU times: user 46min 55s, sys: 5min 16s, total: 52min 12s
Wall time: 11min 45s
[30]:
print('out_of_fold_predictions:\n{}\nShape = {}'.format(out_of_fold_predictions, out_of_fold_predictions.shape))
out_of_fold_predictions:
array([[0.04793217],
[0.03117672],
[0.03577064],
...,
[0.03144814],
[0.18522368],
[0.11984787]], dtype=float32)
Shape = (8000, 1)
[31]:
print(utilized_automl.create_model_str_desc())
Final prediction for new objects =
0.26530 * 1 averaged models with config = "conf_0_sel_type_0.yml" and different CV random_states. Their structures:
Model #0.
================================================================================
Final prediction for new objects (level 0) =
0.23470 * (5 averaged models Lvl_0_Pipe_0_Mod_0_LinearL2) +
0.08711 * (5 averaged models Lvl_0_Pipe_1_Mod_0_LightGBM) +
0.10718 * (5 averaged models Lvl_0_Pipe_1_Mod_1_Tuned_LightGBM) +
0.12282 * (5 averaged models Lvl_0_Pipe_1_Mod_2_CatBoost) +
0.44819 * (5 averaged models Lvl_0_Pipe_1_Mod_3_Tuned_CatBoost)
================================================================================
+ 0.14963 * 1 averaged models with config = "conf_1_sel_type_1.yml" and different CV random_states. Their structures:
Model #0.
================================================================================
Final prediction for new objects (level 0) =
0.23283 * (5 averaged models Lvl_0_Pipe_0_Mod_0_LinearL2) +
0.12138 * (5 averaged models Lvl_0_Pipe_1_Mod_0_LightGBM) +
0.40569 * (5 averaged models Lvl_0_Pipe_1_Mod_1_Tuned_LightGBM) +
0.24010 * (5 averaged models Lvl_0_Pipe_1_Mod_3_Tuned_CatBoost)
================================================================================
+ 0.58507 * 1 averaged models with config = "conf_2_select_mode_1_no_typ.yml" and different CV random_states. Their structures:
Model #0.
================================================================================
Final prediction for new objects (level 0) =
0.22096 * (5 averaged models Lvl_0_Pipe_0_Mod_0_LinearL2) +
0.34772 * (5 averaged models Lvl_0_Pipe_1_Mod_0_LightGBM) +
0.10591 * (5 averaged models Lvl_0_Pipe_1_Mod_1_Tuned_LightGBM) +
0.06046 * (5 averaged models Lvl_0_Pipe_1_Mod_2_CatBoost) +
0.26494 * (5 averaged models Lvl_0_Pipe_1_Mod_3_Tuned_CatBoost)
================================================================================
Feature importances calculation for TabularUtilizedAutoML
:
[32]:
%%time
# Fast feature importances calculation
fast_fi = utilized_automl.get_feature_scores('fast', silent=False)
fast_fi.set_index('Feature')['Importance'].plot.bar(figsize = (30, 10), grid = True)
CPU times: user 294 ms, sys: 117 ms, total: 411 ms
Wall time: 204 ms
[32]:
<Axes: xlabel='Feature'>

Note that in TabularUtilizedAutoML
the first config doesn’t have a LGBM feature selector (but second one already has it), so if there is enough time only for training with the first config, then 'fast'
feature importance calculation method won’t work. 'accurate'
method will still work correctly.
[33]:
%%time
# Accurate feature importances calculation
fast_fi = utilized_automl.get_feature_scores('accurate', test_data, silent=True)
fast_fi.set_index('Feature')['Importance'].plot.bar(figsize = (30, 10), grid = True)
CPU times: user 16min 44s, sys: 58.4 s, total: 17min 42s
Wall time: 3min 8s
[33]:
<Axes: xlabel='Feature'>

Prediction on holdout and metric calculation:
[34]:
%%time
test_predictions = utilized_automl.predict(test_data)
print(f'Prediction for test_data:\n{test_predictions}\nShape = {test_predictions.shape}')
Prediction for test_data:
array([[0.05962077],
[0.08053684],
[0.03293977],
...,
[0.05790396],
[0.04083381],
[0.21182759]], dtype=float32)
Shape = (2000, 1)
CPU times: user 9.32 s, sys: 386 ms, total: 9.7 s
Wall time: 1.8 s
[35]:
print(f'OOF score: {roc_auc_score(train_data[TARGET_NAME].values, out_of_fold_predictions.data[:, 0])}')
print(f'HOLDOUT score: {roc_auc_score(test_data[TARGET_NAME].values, test_predictions.data[:, 0])}')
OOF score: 0.7594212955433396
HOLDOUT score: 0.7356114130434782
It is also important to note that using a ReportDeco
decorator for a TabularUtilizedAutoML
is not yet available.
Bonus: another tasks examples
Regression task
Without big differences from the case of binary classification, LightAutoML can solve the regression problems.
Here you will use Ames Housing dataset. Load the data, split it into train and validation parts:
[36]:
data = pd.read_csv('https://raw.githubusercontent.com/reneemarama/aiming_high_in_aimes/master/datasets/train.csv')
data.head()
[36]:
Id | PID | MS SubClass | MS Zoning | Lot Frontage | Lot Area | Street | Alley | Lot Shape | Land Contour | ... | Screen Porch | Pool Area | Pool QC | Fence | Misc Feature | Misc Val | Mo Sold | Yr Sold | Sale Type | SalePrice | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 109 | 533352170 | 60 | RL | NaN | 13517 | Pave | NaN | IR1 | Lvl | ... | 0 | 0 | NaN | NaN | NaN | 0 | 3 | 2010 | WD | 130500 |
1 | 544 | 531379050 | 60 | RL | 43.0 | 11492 | Pave | NaN | IR1 | Lvl | ... | 0 | 0 | NaN | NaN | NaN | 0 | 4 | 2009 | WD | 220000 |
2 | 153 | 535304180 | 20 | RL | 68.0 | 7922 | Pave | NaN | Reg | Lvl | ... | 0 | 0 | NaN | NaN | NaN | 0 | 1 | 2010 | WD | 109000 |
3 | 318 | 916386060 | 60 | RL | 73.0 | 9802 | Pave | NaN | Reg | Lvl | ... | 0 | 0 | NaN | NaN | NaN | 0 | 4 | 2010 | WD | 174000 |
4 | 255 | 906425045 | 50 | RL | 82.0 | 14235 | Pave | NaN | IR1 | Lvl | ... | 0 | 0 | NaN | NaN | NaN | 0 | 3 | 2010 | WD | 138500 |
5 rows × 81 columns
[37]:
data.shape
[37]:
(2051, 81)
[38]:
train_data, test_data = train_test_split(
data,
test_size=TEST_SIZE,
random_state=RANDOM_STATE
)
print(f'Data is splitted. Parts sizes: train_data = {train_data.shape}, test_data = {test_data.shape}')
train_data.head()
Data is splitted. Parts sizes: train_data = (1640, 81), test_data = (411, 81)
[38]:
Id | PID | MS SubClass | MS Zoning | Lot Frontage | Lot Area | Street | Alley | Lot Shape | Land Contour | ... | Screen Porch | Pool Area | Pool QC | Fence | Misc Feature | Misc Val | Mo Sold | Yr Sold | Sale Type | SalePrice | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1448 | 452 | 528174050 | 120 | RL | 47.0 | 6904 | Pave | NaN | IR1 | Lvl | ... | 0 | 0 | NaN | NaN | NaN | 0 | 8 | 2009 | WD | 213000 |
1771 | 1697 | 528110070 | 20 | RL | 110.0 | 14226 | Pave | NaN | Reg | Lvl | ... | 0 | 0 | NaN | NaN | NaN | 0 | 7 | 2007 | New | 395000 |
966 | 2294 | 923229100 | 80 | RL | NaN | 15957 | Pave | NaN | IR1 | Low | ... | 0 | 0 | NaN | MnPrv | NaN | 0 | 9 | 2007 | WD | 188000 |
1604 | 2449 | 528348010 | 60 | RL | 93.0 | 12090 | Pave | NaN | Reg | Lvl | ... | 0 | 0 | NaN | NaN | NaN | 0 | 7 | 2006 | WD | 258000 |
1827 | 1859 | 533254100 | 80 | RL | 80.0 | 9600 | Pave | NaN | Reg | Lvl | ... | 0 | 0 | NaN | NaN | NaN | 0 | 8 | 2007 | WD | 187000 |
5 rows × 81 columns
Now we have a regression task, and it is necessary to specify it in Task
object. Note that default loss and metric for regression task is MSE, but you can use any available functions, for example, MAE:
[39]:
task = Task('reg', loss='mae', metric='mae')
Specifying columns roles:
[40]:
roles = {
'target': 'SalePrice',
'drop': ['Id', 'PID']
}
Building AutoML model:
[41]:
automl = TabularAutoML(
task = task,
timeout = TIMEOUT,
cpu_limit = N_THREADS,
reader_params = {'n_jobs': N_THREADS, 'cv': N_FOLDS, 'random_state': RANDOM_STATE}
)
Training:
[42]:
%%time
out_of_fold_predictions = automl.fit_predict(train_data, roles = roles, verbose = 1)
[11:24:12] Stdout logging level is INFO.
[11:24:12] Task: reg
[11:24:12] Start automl preset with listed constraints:
[11:24:12] - time: 300.00 seconds
[11:24:12] - CPU: 4 cores
[11:24:12] - memory: 16 GB
[11:24:12] Train data shape: (1640, 81)
[11:24:15] Layer 1 train process start. Time left 297.53 secs
[11:24:15] Start fitting Lvl_0_Pipe_0_Mod_0_LinearL2 ...
[11:24:22] Fitting Lvl_0_Pipe_0_Mod_0_LinearL2 finished. score = -16095.787941834984
[11:24:22] Lvl_0_Pipe_0_Mod_0_LinearL2 fitting and predicting completed
[11:24:22] Time left 290.36 secs
[11:24:25] Selector_LightGBM fitting and predicting completed
[11:24:26] Start fitting Lvl_0_Pipe_1_Mod_0_LightGBM ...
[11:24:46] Fitting Lvl_0_Pipe_1_Mod_0_LightGBM finished. score = -14962.662254668445
[11:24:46] Lvl_0_Pipe_1_Mod_0_LightGBM fitting and predicting completed
[11:24:46] Start hyperparameters optimization for Lvl_0_Pipe_1_Mod_1_Tuned_LightGBM ... Time budget is 1.00 secs
[11:24:53] Hyperparameters optimization for Lvl_0_Pipe_1_Mod_1_Tuned_LightGBM completed
[11:24:53] Start fitting Lvl_0_Pipe_1_Mod_1_Tuned_LightGBM ...
[11:25:11] Fitting Lvl_0_Pipe_1_Mod_1_Tuned_LightGBM finished. score = -15086.075512099847
[11:25:11] Lvl_0_Pipe_1_Mod_1_Tuned_LightGBM fitting and predicting completed
[11:25:11] Start fitting Lvl_0_Pipe_1_Mod_2_CatBoost ...
[11:25:19] Fitting Lvl_0_Pipe_1_Mod_2_CatBoost finished. score = -14711.966370522103
[11:25:19] Lvl_0_Pipe_1_Mod_2_CatBoost fitting and predicting completed
[11:25:19] Start hyperparameters optimization for Lvl_0_Pipe_1_Mod_3_Tuned_CatBoost ... Time budget is 144.46 secs
[11:27:41] Hyperparameters optimization for Lvl_0_Pipe_1_Mod_3_Tuned_CatBoost completed
[11:27:41] Start fitting Lvl_0_Pipe_1_Mod_3_Tuned_CatBoost ...
[11:27:46] Fitting Lvl_0_Pipe_1_Mod_3_Tuned_CatBoost finished. score = -14724.322287061737
[11:27:46] Lvl_0_Pipe_1_Mod_3_Tuned_CatBoost fitting and predicting completed
[11:27:46] Time left 86.35 secs
[11:27:46] Layer 1 training completed.
[11:27:46] Blending: optimization starts with equal weights and score -14167.255313929116
[11:27:46] Blending: iteration 0: score = -14135.606938357469, weights = [0.2613054 0.05990669 0.21512504 0.33101246 0.1326504 ]
[11:27:46] Blending: iteration 1: score = -14133.848594702744, weights = [0.2651733 0. 0.2282765 0.38584936 0.12070084]
[11:27:46] Blending: iteration 2: score = -14133.788066882622, weights = [0.26562527 0. 0.2276214 0.38920173 0.11755158]
[11:27:46] Blending: iteration 3: score = -14133.75897961128, weights = [0.26574308 0. 0.22703037 0.39105994 0.11616667]
[11:27:46] Blending: iteration 4: score = -14133.758250762196, weights = [0.2657241 0. 0.22708555 0.391032 0.11615837]
[11:27:46] Automl preset training completed in 213.77 seconds
[11:27:46] Model description:
Final prediction for new objects (level 0) =
0.26572 * (5 averaged models Lvl_0_Pipe_0_Mod_0_LinearL2) +
0.22709 * (5 averaged models Lvl_0_Pipe_1_Mod_1_Tuned_LightGBM) +
0.39103 * (5 averaged models Lvl_0_Pipe_1_Mod_2_CatBoost) +
0.11616 * (5 averaged models Lvl_0_Pipe_1_Mod_3_Tuned_CatBoost)
CPU times: user 14min 24s, sys: 1min 37s, total: 16min 2s
Wall time: 3min 33s
[43]:
%%time
test_predictions = automl.predict(test_data)
CPU times: user 2.4 s, sys: 199 ms, total: 2.6 s
Wall time: 408 ms
[44]:
print(f'Prediction for te_data:\n{test_predictions[:10]}\nShape = {test_predictions.shape}')
Prediction for te_data:
array([[134551.25],
[210638.33],
[275164.66],
[127255.11],
[200470.92],
[386540. ],
[159519.5 ],
[292220.12],
[159665.39],
[ 82598.77]], dtype=float32)
Shape = (411, 1)
[45]:
from sklearn.metrics import mean_absolute_error
print(f'OOF score: {mean_absolute_error(train_data[roles["target"]].values, out_of_fold_predictions.data[:, 0])}')
print(f'HOLDOUT score: {mean_absolute_error(test_data[roles["target"]].values, test_predictions.data[:, 0])}')
OOF score: 14133.75818883384
HOLDOUT score: 12320.59225783151
In the same way as in the previous example with binary classification, you can build a detailed report using ReportDeco
, calculate feature importances, use TabularUtilizedAutoML
etc.
Multi-class classification
Now let’s consider multi-class classification. Here you will use Anuran Calls (MFCCs) Data Set:
[46]:
from io import BytesIO
from zipfile import ZipFile
from urllib.request import urlopen
data = pd.read_csv(
ZipFile(
BytesIO(
urlopen(
"https://archive.ics.uci.edu/ml/machine-learning-databases/00406/Anuran%20Calls%20(MFCCs).zip"
).read()
)
).open('Frogs_MFCCs.csv')
)
data.head()
[46]:
MFCCs_ 1 | MFCCs_ 2 | MFCCs_ 3 | MFCCs_ 4 | MFCCs_ 5 | MFCCs_ 6 | MFCCs_ 7 | MFCCs_ 8 | MFCCs_ 9 | MFCCs_10 | ... | MFCCs_17 | MFCCs_18 | MFCCs_19 | MFCCs_20 | MFCCs_21 | MFCCs_22 | Family | Genus | Species | RecordID | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1.0 | 0.152936 | -0.105586 | 0.200722 | 0.317201 | 0.260764 | 0.100945 | -0.150063 | -0.171128 | 0.124676 | ... | -0.108351 | -0.077623 | -0.009568 | 0.057684 | 0.118680 | 0.014038 | Leptodactylidae | Adenomera | AdenomeraAndre | 1 |
1 | 1.0 | 0.171534 | -0.098975 | 0.268425 | 0.338672 | 0.268353 | 0.060835 | -0.222475 | -0.207693 | 0.170883 | ... | -0.090974 | -0.056510 | -0.035303 | 0.020140 | 0.082263 | 0.029056 | Leptodactylidae | Adenomera | AdenomeraAndre | 1 |
2 | 1.0 | 0.152317 | -0.082973 | 0.287128 | 0.276014 | 0.189867 | 0.008714 | -0.242234 | -0.219153 | 0.232538 | ... | -0.050691 | -0.023590 | -0.066722 | -0.025083 | 0.099108 | 0.077162 | Leptodactylidae | Adenomera | AdenomeraAndre | 1 |
3 | 1.0 | 0.224392 | 0.118985 | 0.329432 | 0.372088 | 0.361005 | 0.015501 | -0.194347 | -0.098181 | 0.270375 | ... | -0.136009 | -0.177037 | -0.130498 | -0.054766 | -0.018691 | 0.023954 | Leptodactylidae | Adenomera | AdenomeraAndre | 1 |
4 | 1.0 | 0.087817 | -0.068345 | 0.306967 | 0.330923 | 0.249144 | 0.006884 | -0.265423 | -0.172700 | 0.266434 | ... | -0.048885 | -0.053074 | -0.088550 | -0.031346 | 0.108610 | 0.079244 | Leptodactylidae | Adenomera | AdenomeraAndre | 1 |
5 rows × 26 columns
[47]:
train_data, test_data = train_test_split(
data,
test_size=TEST_SIZE,
shuffle=True,
random_state=RANDOM_STATE
)
print(f'Data is splitted. Parts sizes: train_data = {train_data.shape}, test_data = {test_data.shape}')
train_data.head()
Data is splitted. Parts sizes: train_data = (5756, 26), test_data = (1439, 26)
[47]:
MFCCs_ 1 | MFCCs_ 2 | MFCCs_ 3 | MFCCs_ 4 | MFCCs_ 5 | MFCCs_ 6 | MFCCs_ 7 | MFCCs_ 8 | MFCCs_ 9 | MFCCs_10 | ... | MFCCs_17 | MFCCs_18 | MFCCs_19 | MFCCs_20 | MFCCs_21 | MFCCs_22 | Family | Genus | Species | RecordID | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
3838 | 1.0 | 0.389057 | 0.283855 | 0.558597 | 0.142120 | 0.006777 | -0.100356 | 0.015060 | 0.277700 | 0.062747 | ... | 0.221304 | 0.037511 | -0.019166 | -0.042803 | 0.024793 | 0.177462 | Leptodactylidae | Adenomera | AdenomeraHylaedactylus | 22 |
293 | 1.0 | 0.339049 | -0.001276 | 0.075088 | 0.298091 | 0.190639 | 0.022295 | 0.049216 | 0.175380 | -0.007751 | ... | -0.299407 | -0.121592 | 0.108062 | 0.124870 | -0.004888 | -0.040086 | Leptodactylidae | Adenomera | AdenomeraAndre | 7 |
1593 | 1.0 | 0.211356 | 0.132368 | 0.530019 | 0.181015 | 0.047415 | -0.142114 | 0.000687 | 0.249328 | 0.032000 | ... | 0.207694 | 0.026302 | -0.167216 | -0.160102 | 0.084770 | 0.276008 | Leptodactylidae | Adenomera | AdenomeraHylaedactylus | 15 |
4669 | 1.0 | 0.069635 | 0.170713 | 0.583894 | 0.275507 | 0.086236 | -0.152521 | -0.032355 | 0.268403 | 0.054420 | ... | 0.256234 | -0.116248 | -0.230951 | -0.058546 | 0.205891 | 0.211869 | Leptodactylidae | Adenomera | AdenomeraHylaedactylus | 24 |
940 | 1.0 | 0.222777 | -0.069955 | 0.299370 | 0.318585 | 0.094394 | -0.019920 | 0.120537 | 0.192053 | 0.047852 | ... | -0.094684 | -0.104692 | -0.006204 | 0.023067 | -0.044556 | 0.006679 | Dendrobatidae | Ameerega | Ameeregatrivittata | 11 |
5 rows × 26 columns
Now we indicate that we have multi-class classification problem. Default metric and loss is cross-entropy function.
[48]:
task = Task('multiclass')
Set the column roles, build and train AutoML model:
[49]:
roles = {
'target': 'Species',
'drop': ['RecordID']
}
[50]:
automl = TabularAutoML(
task = task,
timeout = 900,
cpu_limit = N_THREADS,
reader_params = {'n_jobs': N_THREADS, 'cv': N_FOLDS, 'random_state': RANDOM_STATE}
)
Note that in case of multi-class classification default pipeline architecture has a slightly different look. First level is the same as level in default binary classification and regression, second level consists of linear regression and LightGBM model and the third is blending. It was decided to use two levels in default architecture based on the results of experiments on different tasks and data, where it gave an increase in model quality. Intuitively, this can be explained by the fact that only at the second and subsequent levels, the model that predicts the probability of belonging to one of the classes can see what the models that predict the rest of the classes see, that is, the models are able to exchange information about classes. Final prediction is blended from last pipeline level.
[51]:
%%time
out_of_fold_predictions = automl.fit_predict(train_data, roles = roles, verbose = 1)
[11:27:53] Stdout logging level is INFO.
[11:27:53] Task: multiclass
[11:27:53] Start automl preset with listed constraints:
[11:27:53] - time: 900.00 seconds
[11:27:53] - CPU: 4 cores
[11:27:53] - memory: 16 GB
[11:27:53] Train data shape: (5756, 26)
[11:27:55] Layer 1 train process start. Time left 897.05 secs
[11:27:56] Start fitting Lvl_0_Pipe_0_Mod_0_LinearL2 ...
[11:28:04] Fitting Lvl_0_Pipe_0_Mod_0_LinearL2 finished. score = -0.010843929660272388
[11:28:04] Lvl_0_Pipe_0_Mod_0_LinearL2 fitting and predicting completed
[11:28:04] Time left 888.50 secs
[11:28:11] Selector_LightGBM fitting and predicting completed
[11:28:11] Start fitting Lvl_0_Pipe_1_Mod_0_LightGBM ...
[11:28:40] Fitting Lvl_0_Pipe_1_Mod_0_LightGBM finished. score = -0.008332313869425251
[11:28:40] Lvl_0_Pipe_1_Mod_0_LightGBM fitting and predicting completed
[11:28:40] Start hyperparameters optimization for Lvl_0_Pipe_1_Mod_1_Tuned_LightGBM ... Time budget is 63.52 secs
[11:29:45] Hyperparameters optimization for Lvl_0_Pipe_1_Mod_1_Tuned_LightGBM completed
[11:29:45] Start fitting Lvl_0_Pipe_1_Mod_1_Tuned_LightGBM ...
[11:29:52] Fitting Lvl_0_Pipe_1_Mod_1_Tuned_LightGBM finished. score = -0.005472618099950116
[11:29:52] Lvl_0_Pipe_1_Mod_1_Tuned_LightGBM fitting and predicting completed
[11:29:52] Start fitting Lvl_0_Pipe_1_Mod_2_CatBoost ...
[11:30:17] Fitting Lvl_0_Pipe_1_Mod_2_CatBoost finished. score = -0.005273819979140362
[11:30:17] Lvl_0_Pipe_1_Mod_2_CatBoost fitting and predicting completed
[11:30:17] Start hyperparameters optimization for Lvl_0_Pipe_1_Mod_3_Tuned_CatBoost ... Time budget is 300.00 secs
[11:35:20] Hyperparameters optimization for Lvl_0_Pipe_1_Mod_3_Tuned_CatBoost completed
[11:35:20] Start fitting Lvl_0_Pipe_1_Mod_3_Tuned_CatBoost ...
[11:36:09] Fitting Lvl_0_Pipe_1_Mod_3_Tuned_CatBoost finished. score = -0.004656292055580485
[11:36:09] Lvl_0_Pipe_1_Mod_3_Tuned_CatBoost fitting and predicting completed
[11:36:09] Time left 403.14 secs
[11:36:09] Layer 1 training completed.
[11:36:09] Layer 2 train process start. Time left 403.13 secs
[11:36:09] Start fitting Lvl_1_Pipe_0_Mod_0_LinearL2 ...
[11:36:16] Fitting Lvl_1_Pipe_0_Mod_0_LinearL2 finished. score = -0.005844817771754601
[11:36:16] Lvl_1_Pipe_0_Mod_0_LinearL2 fitting and predicting completed
[11:36:16] Time left 396.05 secs
[11:36:17] Start fitting Lvl_1_Pipe_1_Mod_0_LightGBM ...
[11:36:47] Fitting Lvl_1_Pipe_1_Mod_0_LightGBM finished. score = -0.0079052970600202
[11:36:47] Lvl_1_Pipe_1_Mod_0_LightGBM fitting and predicting completed
[11:36:47] Time left 365.09 secs
[11:36:47] Layer 2 training completed.
[11:36:47] Blending: optimization starts with equal weights and score -0.006441429966915045
[11:36:47] Blending: iteration 0: score = -0.005844817771754601, weights = [1. 0.]
[11:36:48] Blending: iteration 1: score = -0.005844817771754601, weights = [1. 0.]
[11:36:48] Blending: no score update. Terminated
[11:36:48] Automl preset training completed in 535.01 seconds
[11:36:48] Model description:
Models on level 0:
5 averaged models Lvl_0_Pipe_0_Mod_0_LinearL2
5 averaged models Lvl_0_Pipe_1_Mod_0_LightGBM
5 averaged models Lvl_0_Pipe_1_Mod_1_Tuned_LightGBM
5 averaged models Lvl_0_Pipe_1_Mod_2_CatBoost
5 averaged models Lvl_0_Pipe_1_Mod_3_Tuned_CatBoost
Final prediction for new objects (level 1) =
1.00000 * (5 averaged models Lvl_1_Pipe_0_Mod_0_LinearL2)
CPU times: user 33min 8s, sys: 1min 53s, total: 35min 1s
Wall time: 8min 55s
[52]:
%%time
test_predictions = automl.predict(test_data)
print(f'Prediction for test_data:\n{test_predictions}\nShape = {test_predictions.shape}')
Prediction for test_data:
array([[9.99299645e-01, 2.99156236e-04, 5.64087532e-05, ...,
3.26732697e-05, 1.66250393e-05, 1.44051810e-05],
[2.37410968e-05, 1.63492259e-05, 1.99144088e-05, ...,
7.16923523e-06, 1.01458818e-05, 2.79938058e-06],
[8.50680008e-05, 9.98717308e-01, 1.62742770e-04, ...,
1.16484836e-04, 4.89874292e-05, 4.69718543e-05],
...,
[4.72079875e-04, 9.67716187e-05, 9.98478889e-01, ...,
3.78043551e-05, 2.92024688e-05, 2.44144667e-05],
[2.20251924e-04, 1.70577587e-05, 9.99231935e-01, ...,
2.65343951e-05, 3.58870748e-05, 2.94520050e-05],
[2.06183802e-04, 9.98132050e-01, 1.34161659e-04, ...,
8.58679778e-05, 9.00253362e-05, 5.26059594e-05]], dtype=float32)
Shape = (1439, 10)
CPU times: user 6.6 s, sys: 628 ms, total: 7.23 s
Wall time: 1.51 s
It is also important to note that the Reader
object may re-label classes during training. To see the new labelling, you can call .class_mapping
method of Reader
object. If the output dict is empty, then the original order and class layout has been preserved.
[53]:
automl.reader.class_mapping
[53]:
{'AdenomeraHylaedactylus': 0,
'HypsiboasCordobae': 1,
'AdenomeraAndre': 2,
'Ameeregatrivittata': 3,
'HypsiboasCinerascens': 4,
'HylaMinuta': 5,
'LeptodactylusFuscus': 6,
'ScinaxRuber': 7,
'OsteocephalusOophagus': 8,
'Rhinellagranulosa': 9}
Just in case, in order to avoid problems, it is better to relabel known class labels when calculating metrics, for example, in this or similar way:
[54]:
mapping = automl.reader.class_mapping
def map_class(x):
return mapping[x]
mapped = np.vectorize(map_class)
mapped(train_data['Species'].values)
[54]:
array([0, 2, 0, ..., 4, 4, 3])
[55]:
from sklearn.metrics import log_loss
print(f'OOF score: {log_loss(mapped(train_data[roles["target"]].values), out_of_fold_predictions.data)}')
print(f'HOLDOUT score: {log_loss(mapped(test_data[roles["target"]].values), test_predictions.data)}')
OOF score: 0.005844810740496928
HOLDOUT score: 0.002458012646828699
Feature importance calculation, building reports, time management etc are also available for multiclass classification.
Multi-label classifcation
Now let’s consider multi-label classification task, here you will use the same dataset as in section above (Anuran Calls (MFCCs) Data Set ). Let’s pick labels in each column:
[56]:
data['AdenomeraHylaedactylus'] = (data['Species'] == 'AdenomeraHylaedactylus').astype(int)
data['Adenomera'] = (data['Genus'] == 'Adenomera').astype(int)
data['Leptodactylidae'] = (data['Family'] == 'Leptodactylidae').astype(int)
[57]:
data.head()
[57]:
MFCCs_ 1 | MFCCs_ 2 | MFCCs_ 3 | MFCCs_ 4 | MFCCs_ 5 | MFCCs_ 6 | MFCCs_ 7 | MFCCs_ 8 | MFCCs_ 9 | MFCCs_10 | ... | MFCCs_20 | MFCCs_21 | MFCCs_22 | Family | Genus | Species | RecordID | AdenomeraHylaedactylus | Adenomera | Leptodactylidae | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1.0 | 0.152936 | -0.105586 | 0.200722 | 0.317201 | 0.260764 | 0.100945 | -0.150063 | -0.171128 | 0.124676 | ... | 0.057684 | 0.118680 | 0.014038 | Leptodactylidae | Adenomera | AdenomeraAndre | 1 | 0 | 1 | 1 |
1 | 1.0 | 0.171534 | -0.098975 | 0.268425 | 0.338672 | 0.268353 | 0.060835 | -0.222475 | -0.207693 | 0.170883 | ... | 0.020140 | 0.082263 | 0.029056 | Leptodactylidae | Adenomera | AdenomeraAndre | 1 | 0 | 1 | 1 |
2 | 1.0 | 0.152317 | -0.082973 | 0.287128 | 0.276014 | 0.189867 | 0.008714 | -0.242234 | -0.219153 | 0.232538 | ... | -0.025083 | 0.099108 | 0.077162 | Leptodactylidae | Adenomera | AdenomeraAndre | 1 | 0 | 1 | 1 |
3 | 1.0 | 0.224392 | 0.118985 | 0.329432 | 0.372088 | 0.361005 | 0.015501 | -0.194347 | -0.098181 | 0.270375 | ... | -0.054766 | -0.018691 | 0.023954 | Leptodactylidae | Adenomera | AdenomeraAndre | 1 | 0 | 1 | 1 |
4 | 1.0 | 0.087817 | -0.068345 | 0.306967 | 0.330923 | 0.249144 | 0.006884 | -0.265423 | -0.172700 | 0.266434 | ... | -0.031346 | 0.108610 | 0.079244 | Leptodactylidae | Adenomera | AdenomeraAndre | 1 | 0 | 1 | 1 |
5 rows × 29 columns
[58]:
targets = ['Leptodactylidae', 'Adenomera', 'AdenomeraHylaedactylus']
Split it to train and validation:
[59]:
train_data, test_data = train_test_split(
data,
test_size=TEST_SIZE,
random_state=RANDOM_STATE
)
print(f'Data is splitted. Parts sizes: train_data = {train_data.shape}, test_data = {test_data.shape}')
train_data.head()
Data is splitted. Parts sizes: train_data = (5756, 29), test_data = (1439, 29)
[59]:
MFCCs_ 1 | MFCCs_ 2 | MFCCs_ 3 | MFCCs_ 4 | MFCCs_ 5 | MFCCs_ 6 | MFCCs_ 7 | MFCCs_ 8 | MFCCs_ 9 | MFCCs_10 | ... | MFCCs_20 | MFCCs_21 | MFCCs_22 | Family | Genus | Species | RecordID | AdenomeraHylaedactylus | Adenomera | Leptodactylidae | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
3838 | 1.0 | 0.389057 | 0.283855 | 0.558597 | 0.142120 | 0.006777 | -0.100356 | 0.015060 | 0.277700 | 0.062747 | ... | -0.042803 | 0.024793 | 0.177462 | Leptodactylidae | Adenomera | AdenomeraHylaedactylus | 22 | 1 | 1 | 1 |
293 | 1.0 | 0.339049 | -0.001276 | 0.075088 | 0.298091 | 0.190639 | 0.022295 | 0.049216 | 0.175380 | -0.007751 | ... | 0.124870 | -0.004888 | -0.040086 | Leptodactylidae | Adenomera | AdenomeraAndre | 7 | 0 | 1 | 1 |
1593 | 1.0 | 0.211356 | 0.132368 | 0.530019 | 0.181015 | 0.047415 | -0.142114 | 0.000687 | 0.249328 | 0.032000 | ... | -0.160102 | 0.084770 | 0.276008 | Leptodactylidae | Adenomera | AdenomeraHylaedactylus | 15 | 1 | 1 | 1 |
4669 | 1.0 | 0.069635 | 0.170713 | 0.583894 | 0.275507 | 0.086236 | -0.152521 | -0.032355 | 0.268403 | 0.054420 | ... | -0.058546 | 0.205891 | 0.211869 | Leptodactylidae | Adenomera | AdenomeraHylaedactylus | 24 | 1 | 1 | 1 |
940 | 1.0 | 0.222777 | -0.069955 | 0.299370 | 0.318585 | 0.094394 | -0.019920 | 0.120537 | 0.192053 | 0.047852 | ... | 0.023067 | -0.044556 | 0.006679 | Dendrobatidae | Ameerega | Ameeregatrivittata | 11 | 0 | 0 | 0 |
5 rows × 29 columns
Indicate that we are solving multi-label classification problem. Default metric and loss for this task is logloss.
[60]:
task = Task('multilabel')
multilabel isn`t supported in lgb
Specifying the roles. Now we have several columns with target variables, and it’s necessary to specify them all.
[61]:
roles = {
'target': ['Leptodactylidae', 'Adenomera', 'AdenomeraHylaedactylus'],
'drop': ['RecordID', 'Species', 'Genus', 'Family']
}
Create TabularAutoML
instance. One of the differences in this case is that by default, random forest algorithm will be used at the end before blending.
[62]:
automl = TabularAutoML(
task = task,
timeout = 3600,
cpu_limit = N_THREADS,
reader_params = {'n_jobs': N_THREADS, 'cv': N_FOLDS, 'random_state': RANDOM_STATE}, #TODO: N_THREADS
general_params = {'use_algos': 'auto'}
)
Training:
[63]:
%%time
out_of_fold_predictions = automl.fit_predict(train_data, roles = roles, verbose = 1)
[11:36:53] Stdout logging level is INFO.
[11:36:53] Task: multilabel
[11:36:53] Start automl preset with listed constraints:
[11:36:53] - time: 3600.00 seconds
[11:36:53] - CPU: 4 cores
[11:36:53] - memory: 16 GB
[11:36:53] Train data shape: (5756, 29)
[11:36:55] Layer 1 train process start. Time left 3597.53 secs
[11:36:55] Start fitting Lvl_0_Pipe_0_Mod_0_LinearL2 ...
[11:36:57] Fitting Lvl_0_Pipe_0_Mod_0_LinearL2 finished. score = -1.7529212493912845
[11:36:57] Lvl_0_Pipe_0_Mod_0_LinearL2 fitting and predicting completed
[11:36:57] Time left 3596.09 secs
[11:36:57] Start fitting Lvl_0_Pipe_1_Mod_0_RFSklearn ...
[11:37:17] Fitting Lvl_0_Pipe_1_Mod_0_RFSklearn finished. score = -1.7417950976311207
[11:37:17] Lvl_0_Pipe_1_Mod_0_RFSklearn fitting and predicting completed
[11:37:17] Start hyperparameters optimization for Lvl_0_Pipe_1_Mod_1_Tuned_RFSklearn ... Time budget is 300.00 secs
[11:42:18] Hyperparameters optimization for Lvl_0_Pipe_1_Mod_1_Tuned_RFSklearn completed
[11:42:18] Start fitting Lvl_0_Pipe_1_Mod_1_Tuned_RFSklearn ...
[11:42:24] Fitting Lvl_0_Pipe_1_Mod_1_Tuned_RFSklearn finished. score = -1.7253897032715284
[11:42:24] Lvl_0_Pipe_1_Mod_1_Tuned_RFSklearn fitting and predicting completed
[11:42:24] Time left 3268.90 secs
[11:42:35] Selector_CatBoost fitting and predicting completed
[11:42:35] Start fitting Lvl_0_Pipe_2_Mod_0_CatBoost ...
[11:43:09] Fitting Lvl_0_Pipe_2_Mod_0_CatBoost finished. score = -1.7239793367012388
[11:43:09] Lvl_0_Pipe_2_Mod_0_CatBoost fitting and predicting completed
[11:43:09] Start hyperparameters optimization for Lvl_0_Pipe_2_Mod_1_Tuned_CatBoost ... Time budget is 300.00 secs
[11:48:11] Hyperparameters optimization for Lvl_0_Pipe_2_Mod_1_Tuned_CatBoost completed
[11:48:11] Start fitting Lvl_0_Pipe_2_Mod_1_Tuned_CatBoost ...
[11:50:05] Fitting Lvl_0_Pipe_2_Mod_1_Tuned_CatBoost finished. score = -1.7232911778884046
[11:50:05] Lvl_0_Pipe_2_Mod_1_Tuned_CatBoost fitting and predicting completed
[11:50:05] Time left 2807.87 secs
[11:50:05] Layer 1 training completed.
[11:50:05] Blending: optimization starts with equal weights and score -1.7285958080655424
[11:50:05] Blending: iteration 0: score = -1.7225303376046726, weights = [0. 0. 0.06385981 0.5410291 0.39511114]
[11:50:05] Blending: iteration 1: score = -1.7225227486743002, weights = [0. 0. 0.05698766 0.5967933 0.34621903]
[11:50:05] Blending: iteration 2: score = -1.7231152348780834, weights = [0. 0. 0.05289694 0.618034 0.32906908]
[11:50:05] Blending: iteration 3: score = -1.7231152348780834, weights = [0. 0. 0.05289694 0.618034 0.32906908]
[11:50:05] Blending: no score update. Terminated
[11:50:05] Automl preset training completed in 792.55 seconds
[11:50:05] Model description:
Final prediction for new objects (level 0) =
0.05290 * (5 averaged models Lvl_0_Pipe_1_Mod_1_Tuned_RFSklearn) +
0.61803 * (5 averaged models Lvl_0_Pipe_2_Mod_0_CatBoost) +
0.32907 * (5 averaged models Lvl_0_Pipe_2_Mod_1_Tuned_CatBoost)
CPU times: user 36min 34s, sys: 3min 24s, total: 39min 58s
Wall time: 13min 12s
Get a prediction on the test data:
[64]:
%%time
test_predictions = automl.predict(test_data)
print(f'Prediction for test_data:\n{test_predictions}\nShape = {test_predictions.shape}')
Prediction for test_data:
array([[9.9725902e-01, 9.9696267e-01, 9.9710155e-01],
[8.3490519e-04, 4.1192214e-04, 9.3345130e-05],
[7.9921435e-04, 2.5811681e-04, 6.2886458e-05],
...,
[9.0384924e-01, 9.1573036e-01, 1.2525763e-04],
[9.9974513e-01, 9.9971306e-01, 2.4268193e-05],
[1.3019618e-03, 4.1277776e-04, 8.0916194e-05]], dtype=float32)
Shape = (1439, 3)
CPU times: user 4.21 s, sys: 572 ms, total: 4.78 s
Wall time: 1.77 s
Note that in case of multi-label classification classes order always remains unchanged.
[65]:
automl.reader.class_mapping
[65]:
{'Leptodactylidae': None, 'Adenomera': None, 'AdenomeraHylaedactylus': None}
It is importatnt to note that in this case models taken in the final composition did not have time to learn on all cross-validation folds, so their predicts in them will be NaNs:
[66]:
np.unique(np.isnan(out_of_fold_predictions.data))
[66]:
array([False])
But for new data, normal numerical predictions are made:
[67]:
np.unique(np.isnan(test_predictions.data))
[67]:
array([False])
Therefore, we can calculate the logloss from sklearn only on the test set (because of NaNs):
[68]:
from sklearn.metrics import log_loss
print(f'HOLDOUT score: {log_loss(test_data[targets].values, test_predictions.data)}')
HOLDOUT score: 1.7311416755276812
Multi-output regression
For completeness, let’s consider multi-output regression task. Here you will use Energy Efficiency dataset.
Data loading and splitting:
[69]:
# !pip install openpyxl
[70]:
columns = [
'relative_compactness', 'surface_area', 'wall_area', 'roof_area',
'overall_height', 'orientation', 'glazing_area',
'glazing_area_distribution', 'heating_load', 'cooling_load'
]
data = pd.read_excel("https://archive.ics.uci.edu/ml/machine-learning-databases/00242/ENB2012_data.xlsx", names=columns, header=None)
data = data.drop(index=0, inplace=False)
data
[70]:
relative_compactness | surface_area | wall_area | roof_area | overall_height | orientation | glazing_area | glazing_area_distribution | heating_load | cooling_load | |
---|---|---|---|---|---|---|---|---|---|---|
1 | 0.98 | 514.5 | 294 | 110.25 | 7 | 2 | 0 | 0 | 15.55 | 21.33 |
2 | 0.98 | 514.5 | 294 | 110.25 | 7 | 3 | 0 | 0 | 15.55 | 21.33 |
3 | 0.98 | 514.5 | 294 | 110.25 | 7 | 4 | 0 | 0 | 15.55 | 21.33 |
4 | 0.98 | 514.5 | 294 | 110.25 | 7 | 5 | 0 | 0 | 15.55 | 21.33 |
5 | 0.9 | 563.5 | 318.5 | 122.5 | 7 | 2 | 0 | 0 | 20.84 | 28.28 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
764 | 0.64 | 784 | 343 | 220.5 | 3.5 | 5 | 0.4 | 5 | 17.88 | 21.4 |
765 | 0.62 | 808.5 | 367.5 | 220.5 | 3.5 | 2 | 0.4 | 5 | 16.54 | 16.88 |
766 | 0.62 | 808.5 | 367.5 | 220.5 | 3.5 | 3 | 0.4 | 5 | 16.44 | 17.11 |
767 | 0.62 | 808.5 | 367.5 | 220.5 | 3.5 | 4 | 0.4 | 5 | 16.48 | 16.61 |
768 | 0.62 | 808.5 | 367.5 | 220.5 | 3.5 | 5 | 0.4 | 5 | 16.64 | 16.03 |
768 rows × 10 columns
[71]:
train_data, test_data = train_test_split(
data,
test_size=TEST_SIZE,
random_state=RANDOM_STATE
)
print(f'Data is splitted. Parts sizes: train_data = {train_data.shape}, test_data = {test_data.shape}')
train_data.head()
Data is splitted. Parts sizes: train_data = (614, 10), test_data = (154, 10)
[71]:
relative_compactness | surface_area | wall_area | roof_area | overall_height | orientation | glazing_area | glazing_area_distribution | heating_load | cooling_load | |
---|---|---|---|---|---|---|---|---|---|---|
61 | 0.82 | 612.5 | 318.5 | 147 | 7 | 2 | 0.1 | 1 | 23.53 | 27.31 |
619 | 0.64 | 784 | 343 | 220.5 | 3.5 | 4 | 0.4 | 2 | 18.9 | 22.09 |
347 | 0.86 | 588 | 294 | 147 | 7 | 4 | 0.25 | 2 | 29.27 | 29.9 |
295 | 0.9 | 563.5 | 318.5 | 122.5 | 7 | 4 | 0.25 | 1 | 32.84 | 32.71 |
232 | 0.66 | 759.5 | 318.5 | 220.5 | 3.5 | 5 | 0.1 | 4 | 11.43 | 14.83 |
[72]:
train_data = train_data.astype('float')
test_data = test_data.astype('float')
Specifying the Task
object. Default loss and metric for multi-output regression is MAE.
[73]:
task = Task('multi:reg')
multi:reg isn`t supported in lgb
Roles setting:
[74]:
roles = {
'target': ['heating_load', 'cooling_load'],
}
Create TabularAutoML
instance:
[75]:
automl = TabularAutoML(
task = task,
timeout = 600,
cpu_limit = N_THREADS,
reader_params = {'n_jobs': N_THREADS, 'cv': N_FOLDS, 'random_state': RANDOM_STATE},
general_params = {'use_algos': 'auto'}
)
By default, random forest algorithm will be used at the end before blending.
Training and getting out-of-fold prediction:
[76]:
%%time
out_of_fold_predictions = automl.fit_predict(train_data, roles = roles, verbose = 1)
[11:50:13] Stdout logging level is INFO.
[11:50:13] Task: multi:reg
[11:50:13] Start automl preset with listed constraints:
[11:50:13] - time: 600.00 seconds
[11:50:13] - CPU: 4 cores
[11:50:13] - memory: 16 GB
[11:50:13] Train data shape: (614, 10)
[11:50:16] Layer 1 train process start. Time left 597.75 secs
[11:50:16] Start fitting Lvl_0_Pipe_0_Mod_0_LinearL2 ...
[11:50:20] Fitting Lvl_0_Pipe_0_Mod_0_LinearL2 finished. score = -1.097580430530958
[11:50:20] Lvl_0_Pipe_0_Mod_0_LinearL2 fitting and predicting completed
[11:50:20] Time left 592.91 secs
[11:50:20] Start fitting Lvl_0_Pipe_1_Mod_0_RFSklearn ...
[11:50:28] Fitting Lvl_0_Pipe_1_Mod_0_RFSklearn finished. score = -1.9434915645736046
[11:50:28] Lvl_0_Pipe_1_Mod_0_RFSklearn fitting and predicting completed
[11:50:28] Start hyperparameters optimization for Lvl_0_Pipe_1_Mod_1_Tuned_RFSklearn ... Time budget is 222.28 secs
[11:53:14] Hyperparameters optimization for Lvl_0_Pipe_1_Mod_1_Tuned_RFSklearn completed
[11:53:14] Start fitting Lvl_0_Pipe_1_Mod_1_Tuned_RFSklearn ...
[11:53:18] Fitting Lvl_0_Pipe_1_Mod_1_Tuned_RFSklearn finished. score = -0.7539709750526506
[11:53:18] Lvl_0_Pipe_1_Mod_1_Tuned_RFSklearn fitting and predicting completed
[11:53:18] Time left 414.93 secs
[11:53:19] Selector_CatBoost fitting and predicting completed
[11:53:19] Start fitting Lvl_0_Pipe_2_Mod_0_CatBoost ...
[11:53:22] Fitting Lvl_0_Pipe_2_Mod_0_CatBoost finished. score = -0.5153592339866712
[11:53:22] Lvl_0_Pipe_2_Mod_0_CatBoost fitting and predicting completed
[11:53:22] Start hyperparameters optimization for Lvl_0_Pipe_2_Mod_1_Tuned_CatBoost ... Time budget is 250.24 secs
[11:54:37] Hyperparameters optimization for Lvl_0_Pipe_2_Mod_1_Tuned_CatBoost completed
[11:54:37] Start fitting Lvl_0_Pipe_2_Mod_1_Tuned_CatBoost ...
[11:54:40] Fitting Lvl_0_Pipe_2_Mod_1_Tuned_CatBoost finished. score = -0.5357584665497276
[11:54:40] Lvl_0_Pipe_2_Mod_1_Tuned_CatBoost fitting and predicting completed
[11:54:40] Time left 333.04 secs
[11:54:40] Layer 1 training completed.
[11:54:40] Blending: optimization starts with equal weights and score -0.8152277580689917
[11:54:40] Blending: iteration 0: score = -0.5153592339866712, weights = [0. 0. 0. 1. 0.]
[11:54:40] Blending: iteration 1: score = -0.5153592339866712, weights = [0. 0. 0. 1. 0.]
[11:54:40] Blending: no score update. Terminated
[11:54:40] Automl preset training completed in 267.02 seconds
[11:54:40] Model description:
Final prediction for new objects (level 0) =
1.00000 * (5 averaged models Lvl_0_Pipe_2_Mod_0_CatBoost)
CPU times: user 10min 21s, sys: 1min 44s, total: 12min 5s
Wall time: 4min 27s
Make prediction on test data:
[77]:
%%time
test_predictions = automl.predict(test_data)
CPU times: user 1.32 s, sys: 82.5 ms, total: 1.4 s
Wall time: 61.1 ms
Evaluate regression quality:
[78]:
from sklearn.metrics import mean_absolute_error
mae_h_train = mean_absolute_error(train_data["heating_load"].values, out_of_fold_predictions.data[:, 0])
mae_c_train = mean_absolute_error(train_data["cooling_load"].values, out_of_fold_predictions.data[:, 1])
mae_h_test = mean_absolute_error(test_data["heating_load"].values, test_predictions.data[:, 0])
mae_c_test = mean_absolute_error(test_data["cooling_load"].values, test_predictions.data[:, 1])
print(f'OOF score, heating_load: {mae_h_train}')
print(f'OOF score, cooling_load: {mae_c_train}')
print(f'HOLDOUT score, heating_load: {mae_h_test}')
print(f'HOLDOUT score, cooling_load: {mae_c_test}')
print(f'OOF score, general: {(mae_h_train + mae_c_train) / 2}')
print(f'HOLDOUT score, general: {(mae_h_test + mae_c_test) / 2}')
OOF score, heating_load: 0.34296714751650537
OOF score, cooling_load: 0.6877513204568373
HOLDOUT score, heating_load: 0.3168395171475101
HOLDOUT score, cooling_load: 0.5896234752605487
OOF score, general: 0.5153592339866713
HOLDOUT score, general: 0.45323149620402936
Additional materials
Tutorial 2: AutoWoE (WhiteBox model for binary classification on tabular data)
Official LightAutoML github repository is here
Scorecard
Linear model
Discretization
Selection and One-dimensional analysis
Whitebox pipeline:
General parameters
Technical
n_jobs
debug
Simple features typing and initial cleaning
1.1. Remove trash features
Medium: - th_nan - th_const
1.2. Typling (auto or user defined)
Critical: - features_type (dict) {'age': 'real', 'education': 'cat', 'birth_date': (None, ("d", "wd"), ...}
1.3. Categories and datetimes encoding
Critical: - features_type (for datetimes) Optional: - cat_alpha (int) - greater means more conservative encoding
Pre selection (based on BlackBox model importances)
Critical:
select_type (None or int)
imp_type (if type(select_type) is int ‘perm_imt’/’feature_imp’)
Optional:
imt_th (float) - threshold for select_type is None
Binning (discretization)
Critical:
monotonic / features_monotone_constraints
max_bin_count / max_bin_count
min_bin_size
cat_merge_to
nan_merge_to
Medium:
force_single_split
Optional:
min_bin_mults
min_gains_to_split
WoE estimation WoE = LN( ((% 0 in bin) / (% 0 in sample)) / ((% 1 in bin) / (% 1 in sample)) ):
Critical:
oof_woe
Optional:
woe_diff_th
n_folds (if oof_woe)
2nd selection stage:
5.1. One-dimentional importance
Critical: - auc_th
5.2. VIF
Critical: - vif_th
5.3. Partial correlations
Critical: - pearson_th
3rd selection stage (model based)
Optional:
n_folds
l1_base_step
l1_exp_step
Do not touch:
population_size
feature_groups_count
Fitting the final model
Critical:
regularized_refit
p_val (if not regularized_refit)
validation (if not regularized_refit)
Optional:
interpreted_model
l1_base_step (if regularized_refit)
l1_exp_step (if regularized_refit)
Report generation
report_params
Imports
[1]:
import pandas as pd
from pandas import Series, DataFrame
import numpy as np
import os
import requests
import joblib
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
from autowoe import AutoWoE, ReportDeco
Reading the data and train/test split
[2]:
DATASET_DIR = '../data/'
DATASET_NAME = 'jobs_train.csv'
DATASET_FULLNAME = os.path.join(DATASET_DIR, DATASET_NAME)
DATASET_URL = 'https://raw.githubusercontent.com/AILab-MLTools/LightAutoML/master/examples/data/jobs_train.csv'
[3]:
%%time
if not os.path.exists(DATASET_FULLNAME):
os.makedirs(DATASET_DIR, exist_ok=True)
dataset = requests.get(DATASET_URL).text
with open(DATASET_FULLNAME, 'w') as output:
output.write(dataset)
CPU times: user 14 µs, sys: 12 µs, total: 26 µs
Wall time: 62 µs
[2]:
data = pd.read_csv(DATASET_FULLNAME)
[3]:
data
[3]:
enrollee_id | city | city_development_index | gender | relevent_experience | enrolled_university | education_level | major_discipline | experience | company_size | company_type | last_new_job | training_hours | target | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 8949 | city_103 | 0.920 | Male | Has relevent experience | no_enrollment | Graduate | STEM | 21.0 | NaN | NaN | 1.0 | 36 | 1.0 |
1 | 29725 | city_40 | 0.776 | Male | No relevent experience | no_enrollment | Graduate | STEM | 15.0 | 99.0 | Pvt Ltd | 5.0 | 47 | 0.0 |
2 | 11561 | city_21 | 0.624 | NaN | No relevent experience | Full time course | Graduate | STEM | 5.0 | NaN | NaN | 0.0 | 83 | 0.0 |
3 | 33241 | city_115 | 0.789 | NaN | No relevent experience | NaN | Graduate | Business Degree | 0.0 | NaN | Pvt Ltd | 0.0 | 52 | 1.0 |
4 | 666 | city_162 | 0.767 | Male | Has relevent experience | no_enrollment | Masters | STEM | 21.0 | 99.0 | Funded Startup | 4.0 | 8 | 0.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
19153 | 7386 | city_173 | 0.878 | Male | No relevent experience | no_enrollment | Graduate | Humanities | 14.0 | NaN | NaN | 1.0 | 42 | 1.0 |
19154 | 31398 | city_103 | 0.920 | Male | Has relevent experience | no_enrollment | Graduate | STEM | 14.0 | NaN | NaN | 4.0 | 52 | 1.0 |
19155 | 24576 | city_103 | 0.920 | Male | Has relevent experience | no_enrollment | Graduate | STEM | 21.0 | 99.0 | Pvt Ltd | 4.0 | 44 | 0.0 |
19156 | 5756 | city_65 | 0.802 | Male | Has relevent experience | no_enrollment | High School | NaN | 0.0 | 999.0 | Pvt Ltd | 2.0 | 97 | 0.0 |
19157 | 23834 | city_67 | 0.855 | NaN | No relevent experience | no_enrollment | Primary School | NaN | 2.0 | NaN | NaN | 1.0 | 127 | 0.0 |
19158 rows × 14 columns
[4]:
train, test = train_test_split(data.drop('enrollee_id', axis=1), test_size=0.2, stratify=data['target'])
AutoWoe: default settings
[5]:
auto_woe_0 = AutoWoE(interpreted_model=True,
monotonic=False,
max_bin_count=5,
select_type=None,
pearson_th=0.9,
auc_th=.505,
vif_th=10.,
imp_th=0,
th_const=32,
force_single_split=True,
th_nan=0.01,
th_cat=0.005,
auc_tol=1e-4,
cat_alpha=100,
cat_merge_to="to_woe_0",
nan_merge_to="to_woe_0",
imp_type="feature_imp",
regularized_refit=False,
p_val=0.05,
verbose=2
)
auto_woe_0 = ReportDeco(auto_woe_0, )
[6]:
auto_woe_0.fit(train,
target_name="target",
)
city processing...
city_development_index processing...
gender processing...
relevent_experience processing...
enrolled_university processing...
education_level processing...
experience processing...
company_size processing...
company_type processing...
last_new_job processing...
training_hours processing...
dict_keys(['city', 'city_development_index', 'gender', 'relevent_experience', 'enrolled_university', 'education_level', 'experience', 'company_size', 'company_type', 'last_new_job', 'training_hours']) to selector !!!!!
Feature selection...
city_development_index -0.974107
company_size -0.795953
company_type -0.400146
experience -0.184238
enrolled_university -0.251287
education_level -1.188926
dtype: float64
[7]:
test_prediction = auto_woe_0.predict_proba(test)
test_prediction
[7]:
array([0.06265852, 0.56483877, 0.04151965, ..., 0.15191705, 0.08528486,
0.0409943 ])
[8]:
roc_auc_score(test['target'].values, test_prediction)
[8]:
0.8034365349304012
[9]:
report_params = {"output_path": "HR_REPORT_1", # folder for report generation
"report_name": "WHITEBOX REPORT",
"report_version_id": 1,
"city": "Moscow",
"model_aim": "Predict if candidate will work for the company",
"model_name": "HR model",
"zakazchik": "Kaggle",
"high_level_department": "Ai Lab",
"ds_name": "Btbpanda",
"target_descr": "Candidate will work for the company",
"non_target_descr": "Candidate will work for the company"}
auto_woe_0.generate_report(report_params, )
No handles with labels found to put in legend.
No handles with labels found to put in legend.
No handles with labels found to put in legend.
No handles with labels found to put in legend.
No handles with labels found to put in legend.
No handles with labels found to put in legend.
No handles with labels found to put in legend.
AutoWoE - simplier model
[10]:
auto_woe_1 = AutoWoE(interpreted_model=True,
monotonic=True,
max_bin_count=4,
select_type=None,
pearson_th=0.9,
auc_th=.505,
vif_th=10.,
imp_th=0,
th_const=32,
force_single_split=True,
th_nan=0.01,
th_cat=0.005,
auc_tol=1e-4,
cat_alpha=100,
cat_merge_to="to_woe_0",
nan_merge_to="to_woe_0",
imp_type="feature_imp",
regularized_refit=False,
p_val=0.05,
verbose=2
)
auto_woe_1 = ReportDeco(auto_woe_1, )
[11]:
auto_woe_1.fit(train,
target_name="target",
)
city processing...city_development_index processing...
gender processing...
relevent_experience processing...
enrolled_university processing...education_level processing...
experience processing...company_type processing...company_size processing...
last_new_job processing...
training_hours processing...
dict_keys(['city', 'city_development_index', 'gender', 'relevent_experience', 'enrolled_university', 'education_level', 'experience', 'company_size', 'company_type', 'last_new_job', 'training_hours']) to selector !!!!!
Feature selection...
city -0.516274
city_development_index -0.512608
company_size -0.814922
company_type -0.397978
experience -0.175231
enrolled_university -0.219507
education_level -1.239627
dtype: float64
[12]:
test_prediction = auto_woe_1.predict_proba(test)
test_prediction
[12]:
array([0.06460692, 0.57321671, 0.0497262 , ..., 0.13746553, 0.07190761,
0.04153373])
[13]:
roc_auc_score(test['target'].values, test_prediction)
[13]:
0.8019815944109903
[14]:
report_params = {"output_path": "HR_REPORT_2", # folder for report generation
"report_name": "WHITEBOX REPORT",
"report_version_id": 2,
"city": "Moscow",
"model_aim": "Predict if candidate will work for the company",
"model_name": "HR model",
"zakazchik": "Kaggle",
"high_level_department": "Ai Lab",
"ds_name": "Btbpanda",
"target_descr": "Candidate will work for the company",
"non_target_descr": "Candidate will work for the company"}
auto_woe_1.generate_report(report_params, )
No handles with labels found to put in legend.
No handles with labels found to put in legend.
No handles with labels found to put in legend.
No handles with labels found to put in legend.
No handles with labels found to put in legend.
No handles with labels found to put in legend.
No handles with labels found to put in legend.
No handles with labels found to put in legend.
WhiteBox preset - like TabularAutoML
[15]:
from lightautoml.automl.presets.whitebox_presets import WhiteBoxPreset
from lightautoml import Task
[16]:
task = Task('binary')
automl = WhiteBoxPreset(task)
[17]:
train_pred = automl.fit_predict(train.reset_index(drop=True), roles={'target': 'target'})
Validation data is not set. Train will be used as valid in report and valid prediction
Start automl preset with listed constraints:
- time: 3600 seconds
- cpus: 4 cores
- memory: 16 gb
Train data shape: (15326, 13)
Feats was rejected during automatic roles guess: []
Layer 1 ...
Train process start. Time left 3595.0072581768036 secs
Start fitting Lvl_0_Pipe_0_Mod_0_WhiteBox ...
===== Start working with fold 0 for Lvl_0_Pipe_0_Mod_0_WhiteBox =====
features [] contain too many nans or identical values
features [] have low importance
city processing...
city_development_index processing...company_type processing...education_level processing...
enrolled_university processing...
gender processing...
major_discipline processing...
relevent_experience processing...
company_size processing...
experience processing...
last_new_job processing...
training_hours processing...
dict_keys(['city', 'city_development_index', 'company_type', 'education_level', 'enrolled_university', 'gender', 'major_discipline', 'relevent_experience', 'company_size', 'experience', 'last_new_job', 'training_hours']) to selector !!!!!
Feature selection...
Feature training_hours removed due to low AUC value 0.5031265374717342
Feature city_development_index removed due to high VIF value = 40.56438648184099
C parameter range in [0.0002603488674824265:260.3488674824265], 20 values
Result(score=0.7856775296767177, reg_alpha=0.020431136952654548, is_neg=True, min_weights=city -0.980620
company_size -0.800535
company_type -0.340185
experience -0.198176
enrolled_university -0.101047
relevent_experience 0.000000
education_level -0.624324
last_new_job 0.000000
gender 0.000000
major_discipline -0.317699
dtype: float64)
Iter 0 of final refit starts with 7 features
Validation data checks
city -0.956550
company_size -0.866063
company_type -0.402941
experience -0.329493
enrolled_university -0.230776
education_level -0.641994
major_discipline -1.596907
dtype: float64
Lvl_0_Pipe_0_Mod_0_WhiteBox fitting and predicting completed
Time left 3587.2280378341675
Automl preset training completed in 12.77 seconds.
[18]:
test_prediction = automl.predict(test).data[:, 0]
[19]:
roc_auc_score(test['target'].values, test_prediction)
[19]:
0.7966826628232216
Serialization
Important note: auto_woe_1
is the ReportDeco
object (the report generator object), not AutoWoE
itself. To receive the AutoWoE
object you can use the auto_woe_1.model
.
ReportDeco
object usage for inference is not recommended for several reasons: - The report object needs to have the target column because of model quality metrics calculation - Model inference using ReportDeco
object is slower than the usual one because of the report update procedure
[20]:
joblib.dump(auto_woe_1.model, 'model.pkl')
model = joblib.load('model.pkl')
SQL inference query
[21]:
sql_query = model.get_sql_inference_query('global_temp.TABLE_1')
print(sql_query)
SELECT
1 / (1 + EXP(-(
-1.111
-0.516*WOE_TAB.city
-0.513*WOE_TAB.city_development_index
-0.815*WOE_TAB.company_size
-0.398*WOE_TAB.company_type
-0.175*WOE_TAB.experience
-0.22*WOE_TAB.enrolled_university
-1.24*WOE_TAB.education_level
))) as PROB,
WOE_TAB.*
FROM
(SELECT
CASE
WHEN (city IS NULL OR LOWER(CAST(city AS VARCHAR(50))) = 'nan') THEN 0
WHEN city IN ('city_100', 'city_102', 'city_103', 'city_116', 'city_149', 'city_159', 'city_160', 'city_45', 'city_46', 'city_64', 'city_71', 'city_73', 'city_83', 'city_99') THEN 0.213
WHEN city IN ('city_104', 'city_114', 'city_136', 'city_138', 'city_16', 'city_173', 'city_23', 'city_28', 'city_36', 'city_50', 'city_57', 'city_61', 'city_65', 'city_67', 'city_75', 'city_97') THEN 1.017
WHEN city IN ('city_11', 'city_21', 'city_74') THEN -1.455
ELSE -0.209
END AS city,
CASE
WHEN (city_development_index IS NULL OR city_development_index = 'NaN') THEN 0
WHEN city_development_index <= 0.6245 THEN -1.454
WHEN city_development_index <= 0.7915 THEN -0.121
WHEN city_development_index <= 0.9235 THEN 0.461
ELSE 1.101
END AS city_development_index,
CASE
WHEN (company_size IS NULL OR company_size = 'NaN') THEN -0.717
WHEN company_size <= 74.0 THEN 0.221
ELSE 0.467
END AS company_size,
CASE
WHEN (company_type IS NULL OR LOWER(CAST(company_type AS VARCHAR(50))) = 'nan') THEN -0.64
WHEN company_type IN ('Early Stage Startup', 'NGO', 'Other', 'Public Sector') THEN 0.164
WHEN company_type = 'Funded Startup' THEN 0.737
WHEN company_type = 'Pvt Ltd' THEN 0.398
ELSE 0
END AS company_type,
CASE
WHEN (experience IS NULL OR experience = 'NaN') THEN 0
WHEN experience <= 1.5 THEN -0.811
WHEN experience <= 7.5 THEN -0.319
WHEN experience <= 11.5 THEN 0.119
ELSE 0.533
END AS experience,
CASE
WHEN (enrolled_university IS NULL OR LOWER(CAST(enrolled_university AS VARCHAR(50))) = 'nan') THEN -0.327
WHEN enrolled_university = 'Full time course' THEN -0.614
WHEN enrolled_university = 'Part time course' THEN 0.026
WHEN enrolled_university = 'no_enrollment' THEN 0.208
ELSE 0
END AS enrolled_university,
CASE
WHEN (education_level IS NULL OR LOWER(CAST(education_level AS VARCHAR(50))) = 'nan') THEN 0.21
WHEN education_level = 'Graduate' THEN -0.166
WHEN education_level = 'High School' THEN 0.34
WHEN education_level = 'Masters' THEN 0.21
WHEN education_level IN ('Phd', 'Primary School') THEN 0.704
ELSE 0
END AS education_level
FROM global_temp.TABLE_1) as WOE_TAB
Check the SQL query by PySpark
[23]:
from pyspark.sql import SparkSession
[ ]:
spark = SparkSession.builder \
.master("local[2]") \
.appName("spark-course") \
.config("spark.driver.memory", "512m") \
.getOrCreate()
sc = spark.sparkContext
[24]:
spark_df = spark.read.csv("jobs_train.csv", header=True)
spark_df.createGlobalTempView("TABLE_1")
[25]:
res = spark.sql(sql_query).toPandas()
[26]:
res
[26]:
PROB | city | city_development_index | company_size | company_type | experience | enrolled_university | education_level | |
---|---|---|---|---|---|---|---|---|
0 | 0.365512 | 0.213 | 0.461 | -0.717 | -0.640 | 0.533 | 0.208 | -0.166 |
1 | 0.195716 | -0.209 | -0.121 | 0.467 | 0.398 | 0.533 | 0.208 | -0.166 |
2 | 0.835002 | -1.455 | -1.454 | -0.717 | -0.640 | -0.319 | -0.614 | -0.166 |
3 | 0.476161 | -0.209 | -0.121 | -0.717 | 0.398 | -0.811 | -0.327 | -0.166 |
4 | 0.117694 | -0.209 | -0.121 | 0.467 | 0.737 | 0.533 | 0.208 | 0.210 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
19153 | 0.275602 | 1.017 | 0.461 | -0.717 | -0.640 | 0.533 | 0.208 | -0.166 |
19154 | 0.365512 | 0.213 | 0.461 | -0.717 | -0.640 | 0.533 | 0.208 | -0.166 |
19155 | 0.126794 | 0.213 | 0.461 | 0.467 | 0.398 | 0.533 | 0.208 | -0.166 |
19156 | 0.060842 | 1.017 | 0.461 | 0.467 | 0.398 | -0.811 | 0.208 | 0.340 |
19157 | 0.130552 | 1.017 | 0.461 | -0.717 | -0.640 | -0.319 | 0.208 | 0.704 |
19158 rows × 8 columns
[27]:
sc.stop()
[28]:
full_prediction = model.predict_proba(data)
full_prediction
[28]:
array([0.36557352, 0.19577798, 0.83497665, ..., 0.12678668, 0.06083813,
0.13061427])
[29]:
(res['PROB'] - full_prediction).abs().max()
[29]:
0.0002878641803194526
Tutorial 3: SQL data source
Official LightAutoML github repository is here
Preparing
Step 1. Install LightAutoML
Uncomment if doesn’t clone repository by git. (ex.: colab, kaggle version)
[1]:
#! pip install -U lightautoml
Step 2. Import necessary libraries
[2]:
# Standard python libraries
import os
import time
import requests
# Installed libraries
import numpy as np
import pandas as pd
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split
import torch
# Imports from our package
import gensim
from lightautoml.automl.presets.tabular_presets import TabularAutoML, TabularUtilizedAutoML
from lightautoml.dataset.roles import DatetimeRole
from lightautoml.tasks import Task
Step 3. Parameters
[3]:
N_THREADS = 8 # threads cnt for lgbm and linear models
N_FOLDS = 5 # folds cnt for AutoML
RANDOM_STATE = 42 # fixed random state for various reasons
TEST_SIZE = 0.2 # Test size for metric check
TIMEOUT = 300 # Time in seconds for automl run
TARGET_NAME = 'TARGET' # Target column name
Step 4. Fix torch number of threads and numpy seed
[4]:
np.random.seed(RANDOM_STATE)
torch.set_num_threads(N_THREADS)
Step 5. Example data load
Load a dataset from the repository if doesn’t clone repository by git.
[5]:
DATASET_DIR = '../data/'
DATASET_NAME = 'sampled_app_train.csv'
DATASET_FULLNAME = os.path.join(DATASET_DIR, DATASET_NAME)
DATASET_URL = 'https://raw.githubusercontent.com/AILab-MLTools/LightAutoML/master/examples/data/sampled_app_train.csv'
[6]:
%%time
if not os.path.exists(DATASET_FULLNAME):
os.makedirs(DATASET_DIR, exist_ok=True)
dataset = requests.get(DATASET_URL).text
with open(DATASET_FULLNAME, 'w') as output:
output.write(dataset)
CPU times: user 29 µs, sys: 20 µs, total: 49 µs
Wall time: 68.4 µs
[7]:
%%time
data = pd.read_csv(DATASET_FULLNAME)
data.head()
CPU times: user 104 ms, sys: 19.8 ms, total: 123 ms
Wall time: 122 ms
[7]:
SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | ... | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 313802 | 0 | Cash loans | M | N | Y | 0 | 270000.0 | 327024.0 | 15372.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
1 | 319656 | 0 | Cash loans | F | N | N | 0 | 108000.0 | 675000.0 | 19737.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
2 | 207678 | 0 | Revolving loans | F | Y | Y | 2 | 112500.0 | 270000.0 | 13500.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
3 | 381593 | 0 | Cash loans | F | N | N | 1 | 67500.0 | 142200.0 | 9630.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 4.0 |
4 | 258153 | 0 | Cash loans | F | Y | Y | 0 | 337500.0 | 1483231.5 | 46570.5 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 2.0 | 0.0 | 0.0 |
5 rows × 122 columns
Step 6. (Optional) Some user feature preparation
Cell below shows some user feature preparations to create task more difficult (this block can be omitted if you don’t want to change the initial data):
[8]:
%%time
data['BIRTH_DATE'] = (np.datetime64('2018-01-01') + data['DAYS_BIRTH'].astype(np.dtype('timedelta64[D]'))).astype(str)
data['EMP_DATE'] = (np.datetime64('2018-01-01') + np.clip(data['DAYS_EMPLOYED'], None, 0).astype(np.dtype('timedelta64[D]'))
).astype(str)
data['constant'] = 1
data['allnan'] = np.nan
data['report_dt'] = np.datetime64('2018-01-01')
data.drop(['DAYS_BIRTH', 'DAYS_EMPLOYED'], axis=1, inplace=True)
CPU times: user 105 ms, sys: 8.82 ms, total: 114 ms
Wall time: 112 ms
Step 7. (Optional) Data splitting for train-test
Block below can be omitted if you are going to train model only or you have specific train and test files:
[9]:
%%time
train_data, test_data = train_test_split(data,
test_size=TEST_SIZE,
stratify=data[TARGET_NAME],
random_state=RANDOM_STATE)
print('Data splitted. Parts sizes: train_data = {}, test_data = {}'
.format(train_data.shape, test_data.shape))
Data splitted. Parts sizes: train_data = (8000, 125), test_data = (2000, 125)
CPU times: user 11.2 ms, sys: 0 ns, total: 11.2 ms
Wall time: 9.95 ms
[10]:
train_data.head()
[10]:
SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | ... | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | BIRTH_DATE | EMP_DATE | constant | allnan | report_dt | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
6444 | 112261 | 0 | Cash loans | F | N | N | 1 | 90000.0 | 640080.0 | 31261.5 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1985-06-28 | 2012-06-21 | 1 | NaN | 2018-01-01 |
3586 | 115058 | 0 | Cash loans | F | N | Y | 0 | 180000.0 | 239850.0 | 23850.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 | 1953-12-27 | 2018-01-01 | 1 | NaN | 2018-01-01 |
9349 | 326623 | 0 | Cash loans | F | N | Y | 0 | 112500.0 | 337500.0 | 31086.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 2.0 | 1975-06-21 | 2016-06-17 | 1 | NaN | 2018-01-01 |
7734 | 191976 | 0 | Cash loans | M | Y | Y | 1 | 67500.0 | 135000.0 | 9018.0 | ... | NaN | NaN | NaN | NaN | NaN | 1988-04-27 | 2009-06-05 | 1 | NaN | 2018-01-01 |
2174 | 281519 | 0 | Revolving loans | F | N | Y | 0 | 67500.0 | 202500.0 | 10125.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 2.0 | 1975-06-13 | 1997-01-22 | 1 | NaN | 2018-01-01 |
5 rows × 125 columns
Step 8. (Optional) Reading data from SqlDataSource
Preparing datasets as SQLite data bases
[11]:
import sqlite3 as sql
for _fname in ('train.db', 'test.db'):
if os.path.exists(_fname):
os.remove(_fname)
train_db = sql.connect('train.db')
train_data.to_sql('data', train_db)
test_db = sql.connect('test.db')
test_data.to_sql('data', test_db)
Using dataset wrapper for a connection
[12]:
from lightautoml.reader.tabular_batch_generator import SqlDataSource
# train_data is replaced with a wrapper for an SQLAlchemy connection
# Wrapper requires SQLAlchemy connection string and query to obtain data from
train_data = SqlDataSource('sqlite:///train.db', 'select * from data', index='index')
test_data = SqlDataSource('sqlite:///test.db', 'select * from data', index='index')
AutoML preset usage
Step 1. Create Task
[13]:
%%time
task = Task('binary', )
CPU times: user 6.11 ms, sys: 1.41 ms, total: 7.52 ms
Wall time: 5.65 ms
Step 2. Setup columns roles
Roles setup here set target column and base date, which is used to calculate date differences:
[14]:
%%time
roles = {'target': TARGET_NAME,
DatetimeRole(base_date=True, seasonality=(), base_feats=False): 'report_dt',
}
CPU times: user 48 µs, sys: 32 µs, total: 80 µs
Wall time: 95.1 µs
Step 3. Create AutoML from preset
To create AutoML model here we use TabularAutoML
preset, which looks like:
All params we set above can be send inside preset to change its configuration:
[15]:
%%time
automl = TabularAutoML(task = task,
timeout = TIMEOUT,
general_params = {'nested_cv': False, 'use_algos': [['linear_l2', 'lgb', 'lgb_tuned']]},
reader_params = {'cv': N_FOLDS, 'random_state': RANDOM_STATE},
tuning_params = {'max_tuning_iter': 20, 'max_tuning_time': 30},
lgb_params = {'default_params': {'num_threads': N_THREADS}})
oof_pred = automl.fit_predict(train_data, roles = roles)
print('oof_pred:\n{}\nShape = {}'.format(oof_pred, oof_pred.shape))
oof_pred:
array([[0.0226106 ],
[0.02359573],
[0.02438388],
...,
[0.02287533],
[0.15669319],
[0.08664417]], dtype=float32)
Shape = (8000, 1)
CPU times: user 4min 19s, sys: 3.59 s, total: 4min 23s
Wall time: 1min 11s
Step 4. Predict to test data and check scores
[16]:
%%time
test_pred = automl.predict(test_data)
print('Prediction for test data:\n{}\nShape = {}'
.format(test_pred, test_pred.shape))
print('Check scores...')
print('OOF score: {}'.format(roc_auc_score(train_data.data[TARGET_NAME].values, oof_pred.data[:, 0])))
print('TEST score: {}'.format(roc_auc_score(test_data.data[TARGET_NAME].values, test_pred.data[:, 0])))
Prediction for test data:
array([[0.05828221],
[0.07749337],
[0.02520473],
...,
[0.05070161],
[0.0373171 ],
[0.23640296]], dtype=float32)
Shape = (2000, 1)
Check scores...
OOF score: 0.7500913646530726
TEST score: 0.7331657608695653
CPU times: user 1.05 s, sys: 4.05 ms, total: 1.06 s
Wall time: 449 ms
Step 5. Create AutoML with time utilization
Below we are going to create specific AutoML preset for TIMEOUT utilization (try to spend it as much as possible):
[20]:
%%time
automl = TabularUtilizedAutoML(task = task,
timeout = TIMEOUT,
general_params = {'nested_cv': False, 'use_algos': [['linear_l2', 'lgb', 'lgb_tuned']]},
reader_params = {'cv': N_FOLDS, 'random_state': RANDOM_STATE},
tuning_params = {'max_tuning_iter': 20, 'max_tuning_time': 30},
lgb_params = {'default_params': {'num_threads': N_THREADS}})
oof_pred = automl.fit_predict(train_data, roles = roles)
print('oof_pred:\n{}\nShape = {}'.format(oof_pred, oof_pred.shape))
oof_pred:
array([[0.0343032 ],
[0.01933593],
[0.02276292],
...,
[0.02349434],
[0.17084229],
[0.09522362]], dtype=float32)
Shape = (8000, 1)
CPU times: user 16min 54s, sys: 12.3 s, total: 17min 6s
Wall time: 4min 29s
Step 6. Predict to test data and check scores for utilized automl
[21]:
%%time
test_pred = automl.predict(test_data)
print('Prediction for test data:\n{}\nShape = {}'
.format(test_pred, test_pred.shape))
print('Check scores...')
print('OOF score: {}'.format(roc_auc_score(train_data.data[TARGET_NAME].values, oof_pred.data[:, 0])))
print('TEST score: {}'.format(roc_auc_score(test_data.data[TARGET_NAME].values, test_pred.data[:, 0])))
Prediction for test data:
array([[0.05981494],
[0.07601136],
[0.02678316],
...,
[0.04721078],
[0.03855655],
[0.19377196]], dtype=float32)
Shape = (2000, 1)
Check scores...
OOF score: 0.7586795357421285
TEST score: 0.730679347826087
CPU times: user 2.99 s, sys: 64.1 ms, total: 3.05 s
Wall time: 1.21 s
[ ]:
Tutorial 4: Interpretation Tutorial (requires GPU)
Official LightAutoML github repository is here
Some of HTML static content is not loading, to solve this problem you can use nbviewer. Link on tutorial on nbviewer here.
The last years deep neural networks / gradient boosting / ensembles of models allow to improve the soulution quality of many application task in field of natural language processing (NLP). The indicators of this improvement describe the partial behavior of the model and can hide errors, for example, errors in the construction of the model, errors in data collection. All this can be critical in tasks related to the processing of medical, forensic, banking data. In this tutorial we will check the NLP interpetation module of automl.
Download library and make some imports
[1]:
# !pip install lightautoml
[2]:
import shutil
import numpy as np
import pandas as pd
from sklearn.metrics import roc_auc_score, mean_squared_error
from sklearn.model_selection import train_test_split
from lightautoml.automl.presets.text_presets import TabularNLPAutoML
from lightautoml.tasks import Task
from lightautoml.addons.interpretation import LimeTextExplainer, L2XTextExplainer
import transformers
transformers.logging.set_verbosity(50)
import pickle
Dowload data
For this tutorial we will use train dataset (train.csv) from Jigsaw-Toxic-Comment-Classification-Challage. The dataset contains textual comments and 6 attributes of this text (toxic, serve_toxic, obscene, treat, insult, identity_hate). For now, we will use only toxic attribute.
[3]:
# train.csv file from
# https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/overview
data = pd.read_csv('train.csv')
data
[3]:
id | comment_text | toxic | severe_toxic | obscene | threat | insult | identity_hate | |
---|---|---|---|---|---|---|---|---|
0 | 0000997932d777bf | Explanation\nWhy the edits made under my usern... | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 000103f0d9cfb60f | D'aww! He matches this background colour I'm s... | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 000113f07ec002fd | Hey man, I'm really not trying to edit war. It... | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 0001b41b1c6bb37e | "\nMore\nI can't make any real suggestions on ... | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 0001d958c54c6e35 | You, sir, are my hero. Any chance you remember... | 0 | 0 | 0 | 0 | 0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
159566 | ffe987279560d7ff | ":::::And for the second time of asking, when ... | 0 | 0 | 0 | 0 | 0 | 0 |
159567 | ffea4adeee384e90 | You should be ashamed of yourself \n\nThat is ... | 0 | 0 | 0 | 0 | 0 | 0 |
159568 | ffee36eab5c267c9 | Spitzer \n\nUmm, theres no actual article for ... | 0 | 0 | 0 | 0 | 0 | 0 |
159569 | fff125370e4aaaf3 | And it looks like it was actually you who put ... | 0 | 0 | 0 | 0 | 0 | 0 |
159570 | fff46fc426af1f9a | "\nAnd ... I really don't think you understand... | 0 | 0 | 0 | 0 | 0 | 0 |
159571 rows × 8 columns
Usage of AutoML
We will use standard lightautoml.automl.presets.text_presets.TabularNLPAutoML
preset with finetuned TinyBERT from Hugging Face.
[4]:
np.random.seed(42)
train, test = train_test_split(data, test_size=0.2, random_state=42)
roles = {
'text': ['comment_text'],
'drop': ['id', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate'],
'target': 'toxic'
}
task = Task('binary')
automl = TabularNLPAutoML(
task=task,
timeout=3600,
cpu_limit=1,
gpu_ids='0',
general_params={
'nested_cv': False,
'use_algos': [['nn']]
},
autonlp_params={
'sent_scaler': 'l2'
},
text_params={
'lang': 'en',
'bert_model': 'prajjwal1/bert-tiny'
},
nn_params={
'opt_params': {'lr': 1e-5},
'max_length': 128,
'bs': 32,
'n_epochs': 7,
}
)
[5]:
%%time
oof_pred = automl.fit_predict(train, roles=roles, verbose = 10)
test_pred = automl.predict(test)
not_nan = np.any(~np.isnan(oof_pred.data), axis=1)
print('Check scores:')
print('OOF score: {}'.format(roc_auc_score(train[roles['target']].values[not_nan], oof_pred.data[not_nan][:, 0])))
print('TEST score: {}'.format(roc_auc_score(test[roles['target']].values, test_pred.data[:, 0])))
[11:22:30] Stdout logging level is DEBUG.
[11:22:30] Model language mode: en
[11:22:30] Task: binary
[11:22:30] Start automl preset with listed constraints:
[11:22:30] - time: 3600.00 seconds
[11:22:30] - CPU: 1 cores
[11:22:30] - memory: 16 GB
[11:22:30] Train data shape: (127656, 8)
[11:22:30] Layer 1 train process start. Time left 3599.85 secs
[11:22:31] Start fitting Lvl_0_Pipe_0_Mod_0_TorchNN ...
[11:22:31] Training params: {'bs': 32, 'num_workers': 1, 'max_length': 128, 'opt_params': {'lr': 1e-05}, 'scheduler_params': {'patience': 5, 'factor': 0.5, 'verbose': True}, 'is_snap': False, 'snap_params': {'k': 1, 'early_stopping': True, 'patience': 1, 'swa': False}, 'init_bias': True, 'n_epochs': 7, 'input_bn': False, 'emb_dropout': 0.1, 'emb_ratio': 3, 'max_emb_size': 50, 'bert_name': 'prajjwal1/bert-tiny', 'pooling': 'cls', 'device': device(type='cuda', index=0), 'use_cont': True, 'use_cat': True, 'use_text': True, 'lang': 'en', 'deterministic': False, 'multigpu': False, 'random_state': 42, 'path_to_save': None, 'verbose_inside': None, 'verbose': 1, 'device_ids': None, 'n_out': 1, 'cat_features': [], 'cat_dims': [], 'cont_features': [], 'cont_dim': 0, 'text_features': ['concated__comment_text'], 'bias': array([[-2.24401446]])}
[11:22:31] ===== Start working with fold 0 for Lvl_0_Pipe_0_Mod_0_TorchNN =====
[11:22:36] number of text features: 1
[11:22:36] number of categorical features: 0
[11:22:36] number of continuous features: 0
train (loss=0.257356): 100%|██████████| 2660/2660 [02:12<00:00, 20.13it/s]
val: 100%|██████████| 1330/1330 [01:07<00:00, 19.83it/s]
[11:25:59] Epoch: 0, train loss: 0.25735557079315186, val loss: 0.19599375128746033, val metric: 0.9640350800072578
train (loss=0.168968): 100%|██████████| 2660/2660 [02:09<00:00, 20.61it/s]
val: 100%|██████████| 1330/1330 [01:04<00:00, 20.58it/s]
[11:29:13] Epoch: 1, train loss: 0.16896754503250122, val loss: 0.14401142299175262, val metric: 0.9713461808486132
train (loss=0.131891): 100%|██████████| 2660/2660 [02:09<00:00, 20.49it/s]
val: 100%|██████████| 1330/1330 [01:03<00:00, 20.87it/s]
[11:32:26] Epoch: 2, train loss: 0.1318911910057068, val loss: 0.12361849099397659, val metric: 0.9742718921629787
train (loss=0.114705): 100%|██████████| 2660/2660 [02:07<00:00, 20.90it/s]
val: 100%|██████████| 1330/1330 [01:04<00:00, 20.76it/s]
[11:35:38] Epoch: 3, train loss: 0.11470535397529602, val loss: 0.11394938081502914, val metric: 0.9763582643756192
train (loss=0.103179): 100%|██████████| 2660/2660 [02:09<00:00, 20.54it/s]
val: 100%|██████████| 1330/1330 [01:05<00:00, 20.36it/s]
[11:38:53] Epoch: 4, train loss: 0.10317856818437576, val loss: 0.10656153410673141, val metric: 0.9775081138714583
train (loss=0.0965996): 100%|██████████| 2660/2660 [02:09<00:00, 20.49it/s]
val: 100%|██████████| 1330/1330 [01:05<00:00, 20.24it/s]
[11:42:08] Epoch: 5, train loss: 0.09659960865974426, val loss: 0.10427780449390411, val metric: 0.9783243683208365
train (loss=0.090561): 100%|██████████| 2660/2660 [02:11<00:00, 20.24it/s]
val: 100%|██████████| 1330/1330 [01:02<00:00, 21.23it/s]
[11:45:22] Epoch: 6, train loss: 0.09056100249290466, val loss: 0.10337436944246292, val metric: 0.9788043902058639
[11:45:23] ===== Start working with fold 1 for Lvl_0_Pipe_0_Mod_0_TorchNN =====
[11:45:28] number of text features: 1
[11:45:28] number of categorical features: 0
[11:45:28] number of continuous features: 0
train (loss=0.257485): 100%|██████████| 2660/2660 [02:04<00:00, 21.30it/s]
val: 100%|██████████| 1330/1330 [01:04<00:00, 20.67it/s]
[11:48:38] Epoch: 0, train loss: 0.2574850618839264, val loss: 0.19478833675384521, val metric: 0.961936968917119
train (loss=0.170552): 100%|██████████| 2660/2660 [02:08<00:00, 20.72it/s]
val: 100%|██████████| 1330/1330 [01:06<00:00, 19.87it/s]
[11:51:53] Epoch: 1, train loss: 0.1705523431301117, val loss: 0.1437842845916748, val metric: 0.970873732336761
train (loss=0.132485): 100%|██████████| 2660/2660 [02:05<00:00, 21.15it/s]
val: 100%|██████████| 1330/1330 [01:03<00:00, 20.97it/s]
[11:55:03] Epoch: 2, train loss: 0.13248467445373535, val loss: 0.12127983570098877, val metric: 0.9751468710522353
train (loss=0.11448): 100%|██████████| 2660/2660 [02:03<00:00, 21.51it/s]
val: 100%|██████████| 1330/1330 [01:02<00:00, 21.11it/s]
[11:58:09] Epoch: 3, train loss: 0.11447965353727341, val loss: 0.11149459332227707, val metric: 0.9768346789459879
train (loss=0.103458): 100%|██████████| 2660/2660 [02:06<00:00, 20.99it/s]
val: 100%|██████████| 1330/1330 [01:04<00:00, 20.65it/s]
[12:01:20] Epoch: 4, train loss: 0.10345754027366638, val loss: 0.10722416639328003, val metric: 0.9782435623593337
train (loss=0.0963441): 100%|██████████| 2660/2660 [02:05<00:00, 21.17it/s]
val: 100%|██████████| 1330/1330 [01:03<00:00, 20.91it/s]
[12:04:30] Epoch: 5, train loss: 0.09634406119585037, val loss: 0.10441421717405319, val metric: 0.978748563376753
train (loss=0.0900231): 100%|██████████| 2660/2660 [02:05<00:00, 21.17it/s]
val: 100%|██████████| 1330/1330 [01:03<00:00, 20.84it/s]
[12:07:39] Epoch: 6, train loss: 0.09002314507961273, val loss: 0.10312184691429138, val metric: 0.9791290354336872
[12:07:40] ===== Start working with fold 2 for Lvl_0_Pipe_0_Mod_0_TorchNN =====
[12:07:44] number of text features: 1
[12:07:44] number of categorical features: 0
[12:07:44] number of continuous features: 0
train (loss=0.257448): 100%|██████████| 2660/2660 [02:04<00:00, 21.45it/s]
val: 100%|██████████| 1330/1330 [01:00<00:00, 21.91it/s]
[12:10:50] Epoch: 0, train loss: 0.2574479281902313, val loss: 0.19449889659881592, val metric: 0.9648288318293945
train (loss=0.169502): 100%|██████████| 2660/2660 [02:03<00:00, 21.52it/s]
val: 100%|██████████| 1330/1330 [01:01<00:00, 21.79it/s]
[12:13:55] Epoch: 1, train loss: 0.1695016324520111, val loss: 0.14307956397533417, val metric: 0.9706200035841146
train (loss=0.131626): 100%|██████████| 2660/2660 [02:03<00:00, 21.54it/s]
val: 100%|██████████| 1330/1330 [01:00<00:00, 21.84it/s]
[12:16:59] Epoch: 2, train loss: 0.13162554800510406, val loss: 0.12111066281795502, val metric: 0.97454294780979
train (loss=0.114015): 100%|██████████| 2660/2660 [02:03<00:00, 21.57it/s]
val: 100%|██████████| 1330/1330 [01:00<00:00, 21.83it/s]
[12:20:04] Epoch: 3, train loss: 0.11401509493589401, val loss: 0.11131983995437622, val metric: 0.9763178957078734
train (loss=0.104155): 100%|██████████| 2660/2660 [02:03<00:00, 21.56it/s]
val: 100%|██████████| 1330/1330 [01:00<00:00, 21.87it/s]
[12:23:08] Epoch: 4, train loss: 0.10415521264076233, val loss: 0.10691472887992859, val metric: 0.9772204526836245
train (loss=0.0953203): 100%|██████████| 2660/2660 [02:04<00:00, 21.41it/s]
val: 100%|██████████| 1330/1330 [01:01<00:00, 21.67it/s]
[12:26:13] Epoch: 5, train loss: 0.09532025456428528, val loss: 0.10362745076417923, val metric: 0.9780747656394276
train (loss=0.0899258): 100%|██████████| 2660/2660 [02:04<00:00, 21.34it/s]
val: 100%|██████████| 1330/1330 [01:01<00:00, 21.68it/s]
[12:29:20] Epoch: 6, train loss: 0.08992581069469452, val loss: 0.10427321493625641, val metric: 0.9781931517871759
val: 100%|██████████| 1330/1330 [01:01<00:00, 21.53it/s]
[12:30:21] Early stopping: val loss: 0.10362745076417923, val metric: 0.9780747656394276
[12:30:22] Fitting Lvl_0_Pipe_0_Mod_0_TorchNN finished. score = 0.9782371823652668
[12:30:22] Lvl_0_Pipe_0_Mod_0_TorchNN fitting and predicting completed
[12:30:22] Time left -472.15 secs
[12:30:22] Time limit exceeded. Last level models will be blended and unused pipelines will be pruned.
[12:30:22] Layer 1 training completed.
[12:30:22] Automl preset training completed in 4072.15 seconds
[12:30:22] Model description:
Final prediction for new objects (level 0) =
1.00000 * (3 averaged models Lvl_0_Pipe_0_Mod_0_TorchNN)
[12:30:22] number of text features: 1
[12:30:22] number of categorical features: 0
[12:30:22] number of continuous features: 0
test: 100%|██████████| 998/998 [00:47<00:00, 21.08it/s]
[12:31:15] number of text features: 1
[12:31:15] number of categorical features: 0
[12:31:15] number of continuous features: 0
test: 100%|██████████| 998/998 [00:46<00:00, 21.51it/s]
[12:32:08] number of text features: 1
[12:32:08] number of categorical features: 0
[12:32:08] number of continuous features: 0
test: 100%|██████████| 998/998 [00:46<00:00, 21.47it/s]
Check scores:
OOF score: 0.9782371823652668
TEST score: 0.9807740353486142
CPU times: user 18min 47s, sys: 1min 15s, total: 20min 3s
Wall time: 1h 10min 30s
[6]:
automl.set_verbosity_level(0) # refuse logging in automl
LIME
Linear approximation of model nearby selected object. The weights of this linear model is feature attribution for automl’s prediction for this object.
Algorithm:
Select object to interpret.
Select the input text column, that will be explained (
perturb_column
). All other columns of object will be fixed.A dataset of size
n_sample
(by default5000
) is created by randomly deleting tokens (in groups). Dataset is binary (there is a token if one and no token if zero).Predict with AutoML module target values for created dataset.
Optionally, the selection of features (important tokens) is performed using LASSO (
feature_selection='lasso'
, you can also'none'
to not select and get them all). The number of features used after feature selection isn_feautres
(= 10
by default).We train the explained model on this (a linear model with weights, the method of calculating weights is the cosine distance by default, you can also use your own function or the name of the distance from
sklearn.metrics.pairwise_distances
).The weights of the linear model are the interpretation.
P.S. Care about the sentence length. Detokenization works within \(O(n^2)\), where \(n\) – sentence length.
Scheme of work:
[7]:
# LimeTextExplainer for NLP preset
lime = LimeTextExplainer(automl, feature_selection='lasso', force_order=False)
Let’s try it on neutral text
[8]:
exp = lime.explain_instance(test.loc[34019], labels=(0, 1), perturb_column='comment_text')
exp.visualize_in_notebook(1)
test: 100%|██████████| 157/157 [00:02<00:00, 77.34it/s]
test: 100%|██████████| 157/157 [00:01<00:00, 79.52it/s]
test: 100%|██████████| 157/157 [00:01<00:00, 78.82it/s]
Text
The lyrics is found in the German version , so I assume it ' s usable . ~
Class mapping
Class: 0 Class: 1
Toxic comments
[9]:
exp = lime.explain_instance(test.loc[78687], labels=(0, 1), perturb_column='comment_text')
exp.visualize_in_notebook(1)
test: 100%|██████████| 157/157 [00:01<00:00, 93.42it/s]
test: 100%|██████████| 157/157 [00:01<00:00, 92.47it/s]
test: 100%|██████████| 157/157 [00:01<00:00, 93.75it/s]
Text
A silly fat cow who won ' t leave me alone
Class mapping
Class: 0 Class: 1
Let’s see on uncertain expamples
[10]:
exp = lime.explain_instance(test.loc[4733], labels=(0, 1), perturb_column='comment_text', n_features=20)
exp.visualize_in_notebook(1)
test: 100%|██████████| 157/157 [00:02<00:00, 71.46it/s]
test: 100%|██████████| 157/157 [00:02<00:00, 71.48it/s]
test: 100%|██████████| 157/157 [00:02<00:00, 71.15it/s]
Text
Why are you still here ? Can you not find anything more important to do , like killing yourself ?
Class mapping
Class: 0 Class: 1
Let’s delete ‘important’ from this abstract. We can see that automl increase it’s probability of toxicity of this abstract
[11]:
test.loc[4733, 'comment_text'] = 'Why are you still here ? Can you not find anything more to do , like killing yourself ?'
[12]:
exp = lime.explain_instance(test.loc[4733], labels=(0, 1), perturb_column='comment_text', n_features=20)
exp.visualize_in_notebook(1)
test: 100%|██████████| 157/157 [00:02<00:00, 73.97it/s]
test: 100%|██████████| 157/157 [00:02<00:00, 73.07it/s]
test: 100%|██████████| 157/157 [00:02<00:00, 73.22it/s]
Text
Why are you still here ? Can you not find anything more to do , like killing yourself ?
Class mapping
Class: 0 Class: 1
If we add the ‘relability’ the AutoML decrease the toxicity probability.
[13]:
test.loc[4733, 'comment_text'] = 'Why are you still here ? Can you not find anything more to do , like killing yourself ? relability'
[14]:
exp = lime.explain_instance(test.loc[4733], labels=(0, 1), perturb_column='comment_text', n_features=20)
exp.visualize_in_notebook(1)
test: 100%|██████████| 157/157 [00:02<00:00, 68.66it/s]
test: 100%|██████████| 157/157 [00:02<00:00, 64.12it/s]
test: 100%|██████████| 157/157 [00:02<00:00, 68.18it/s]
Text
Why are you still here ? Can you not find anything more to do , like killing yourself ? relability
Class mapping
Class: 0 Class: 1
Another example
[15]:
exp = lime.explain_instance(test.loc[40112], labels=(0, 1), perturb_column='comment_text', n_features=20)
exp.visualize_in_notebook(1)
test: 100%|██████████| 157/157 [00:02<00:00, 57.57it/s]
test: 100%|██████████| 157/157 [00:02<00:00, 56.72it/s]
test: 100%|██████████| 157/157 [00:02<00:00, 56.36it/s]
Text
stop editing this , you dumbass . why do you have to be such a bitch ? the ghosts of bill maas ' past will haunt you forever !!! MWAHAHHAHAA
Class mapping
Class: 0 Class: 1
Let’s delete the toxic words to ‘good boy’
[16]:
test.loc[40112, 'comment_text'] = "stop editing this, you good boy. why do you have to be such a good boy? the ghosts of bill maas' past will haunt you forever!!! MWAHAHHAHAA"
[17]:
exp = lime.explain_instance(test.loc[40112], labels=(0, 1), perturb_column='comment_text', n_features=20)
exp.visualize_in_notebook(1)
test: 100%|██████████| 157/157 [00:02<00:00, 55.40it/s]
test: 100%|██████████| 157/157 [00:02<00:00, 56.35it/s]
test: 100%|██████████| 157/157 [00:02<00:00, 55.86it/s]
Text
stop editing this , you good boy . why do you have to be such a good boy ? the ghosts of bill maas ' past will haunt you forever !!! MWAHAHHAHAA
Class mapping
Class: 0 Class: 1
Let’s try from neutral make toxic abstract.
[18]:
exp = lime.explain_instance(test.loc[18396], labels=(0, 1), perturb_column='comment_text', n_features=20)
exp.visualize_in_notebook(1)
test: 100%|██████████| 157/157 [00:01<00:00, 101.90it/s]
test: 100%|██████████| 157/157 [00:01<00:00, 100.88it/s]
test: 100%|██████████| 157/157 [00:01<00:00, 99.18it/s]
Text
Okay , thanks . I will do so .
Class mapping
Class: 0 Class: 1
[19]:
test.loc[18396] = "Okay , thanks . I will do so . dumbass please"
[20]:
exp = lime.explain_instance(test.loc[18396], labels=(0, 1), perturb_column='comment_text', n_features=20)
exp.visualize_in_notebook(1)
test: 100%|██████████| 157/157 [00:01<00:00, 89.68it/s]
test: 100%|██████████| 157/157 [00:01<00:00, 90.71it/s]
test: 100%|██████████| 157/157 [00:01<00:00, 90.35it/s]
Text
Okay , thanks . I will do so . dumbass please
Class mapping
Class: 0 Class: 1
Adding some happy words
[21]:
test.loc[18396] = "Okay , thanks . I will do so . happy dumbass please"
[22]:
exp = lime.explain_instance(test.loc[18396], labels=(0, 1), perturb_column='comment_text', n_features=20)
exp.visualize_in_notebook(1)
test: 100%|██████████| 157/157 [00:01<00:00, 86.02it/s]
test: 100%|██████████| 157/157 [00:01<00:00, 87.59it/s]
test: 100%|██████████| 157/157 [00:01<00:00, 85.69it/s]
Text
Okay , thanks . I will do so . happy dumbass please
Class mapping
Class: 0 Class: 1
More happy words.
[23]:
test.loc[18396] = "Okay , thanks . I will do so . happy cheerful joyfull glorious elated dumbass please"
[24]:
exp = lime.explain_instance(test.loc[18396], labels=(0, 1), perturb_column='comment_text', n_features=20)
exp.visualize_in_notebook(1)
test: 100%|██████████| 157/157 [00:02<00:00, 75.00it/s]
test: 100%|██████████| 157/157 [00:02<00:00, 74.62it/s]
test: 100%|██████████| 157/157 [00:02<00:00, 74.52it/s]
Text
Okay , thanks . I will do so . happy cheerful joyfull glorious elated dumbass please
Class mapping
Class: 0 Class: 1
L2X for Regression
For this part the BeerAdvocate we will use. The dataset contains the reviews on alcoholic drinks (texutal comment + 5 attributes: overview, taste, plate, aroma, appearance). For this experiment we will use only appearance
attribute.
[25]:
def download_from_gdrive(file_id, file_name, chunk_size=2**15):
import requests
def handle_warning(res):
for k, v in res.cookies.items():
if k.startswith("download_warning"):
return v
template_url = "https://docs.google.com/uc?export=download"
session = requests.Session()
res = session.get(template_url, params={"id": file_id}, stream=True)
print('GET: {} CODE'.format(res.status_code))
token = handle_warning(res)
if token:
res = session.get(template_url, params={"id": file_id, "confirm": token}, stream=True)
print('Started downloading...')
with open(file_name, 'wb') as f:
for chunk in res.iter_content(chunk_size):
if chunk:
f.write(chunk)
print('Downloaded.')
download_from_gdrive('1s8PG13Y0BvYM67nNL0EQpdgB5S4gJK9r', 'beeradvocate.tar.gz')
shutil.unpack_archive('beeradvocate.tar.gz', '.')
GET: 200 CODE
Started downloading...
Downloaded.
[26]:
train_data = pd.read_csv('./datasets/reviews.aspect0.train.csv')
valid_data = pd.read_csv('./datasets/reviews.aspect0.heldout.csv')
train_data.head()
[26]:
Appearance | Aroma | Palate | Taste | Overall | Review | tokens_number | |
---|---|---|---|---|---|---|---|
0 | 1.5 | 1.5 | 2.5 | 1.5 | 1.5 | the main problem with this beer is that it has... | 62 |
1 | 2.0 | 2.0 | 3.0 | 2.0 | 3.0 | it is very unfortunate this situation we have ... | 179 |
2 | 4.0 | 2.5 | 3.0 | 1.5 | 2.0 | appearance is a light golden yellow with a thi... | 79 |
3 | 4.5 | 3.5 | 2.0 | 3.5 | 3.0 | it has a great color to the body . this beer p... | 87 |
4 | 4.0 | 4.5 | 1.0 | 1.5 | 1.0 | though this beer is , or course , not carbonat... | 246 |
Train AutoML
In this part we use BERT-Base model.
[27]:
roles = {
'text': ['Review'],
'drop': ['tokens_number', 'Aroma', 'Palete', 'Taste', 'Overall'],
'target': 'Appearance'
}
task = Task('reg')
automl = TabularNLPAutoML(
task=task,
timeout=3600,
cpu_limit=1,
gpu_ids='1',
general_params={
'nested_cv': False,
'use_algos': [['nn']],
'n_folds': 3
},
reader_params={
'cv': 3
},
autonlp_params={
'sent_scaler': 'l2'
},
text_params={
'lang': 'en',
'bert_model': 'bert-base-uncased'
},
nn_params={
'opt_params': {'lr': 1e-5},
'max_length': 128,
'bs': 32,
'n_epochs': 7,
},
)
oof_pred = automl.fit_predict(train_data, roles=roles, verbose=2)
test_pred = automl.predict(valid_data)
not_nan = np.any(~np.isnan(oof_pred.data), axis=1)
print('Check scores:')
print('OOF score: {}'.format(mean_squared_error(train_data[roles['target']].values[not_nan], oof_pred.data[not_nan][:, 0])))
print('TEST score: {}'.format(mean_squared_error(valid_data[roles['target']].values, test_pred.data[:, 0])))
[12:38:00] Stdout logging level is INFO2.
[12:38:00] Task: reg
[12:38:00] Start automl preset with listed constraints:
[12:38:00] - time: 3600.00 seconds
[12:38:00] - CPU: 1 cores
[12:38:00] - memory: 16 GB
[12:38:00] Train data shape: (80000, 7)
[12:38:01] Layer 1 train process start. Time left 3599.63 secs
[12:38:01] Start fitting Lvl_0_Pipe_0_Mod_0_TorchNN ...
[12:38:01] ===== Start working with fold 0 for Lvl_0_Pipe_0_Mod_0_TorchNN =====
train (loss=0.755747): 100%|██████████| 1667/1667 [06:45<00:00, 4.11it/s]
val: 100%|██████████| 834/834 [02:04<00:00, 6.68it/s]
train (loss=0.442306): 100%|██████████| 1667/1667 [06:48<00:00, 4.08it/s]
val: 100%|██████████| 834/834 [02:05<00:00, 6.66it/s]
train (loss=0.344638): 100%|██████████| 1667/1667 [06:52<00:00, 4.04it/s]
val: 100%|██████████| 834/834 [02:06<00:00, 6.61it/s]
val: 100%|██████████| 834/834 [02:05<00:00, 6.64it/s]
[13:07:23] ===== Start working with fold 1 for Lvl_0_Pipe_0_Mod_0_TorchNN =====
train (loss=0.760973): 100%|██████████| 1667/1667 [06:51<00:00, 4.05it/s]
val: 100%|██████████| 834/834 [02:06<00:00, 6.62it/s]
train (loss=0.44357): 100%|██████████| 1667/1667 [06:50<00:00, 4.06it/s]
val: 100%|██████████| 834/834 [02:06<00:00, 6.61it/s]
train (loss=0.343338): 100%|██████████| 1667/1667 [06:49<00:00, 4.07it/s]
val: 100%|██████████| 834/834 [02:05<00:00, 6.66it/s]
val: 100%|██████████| 834/834 [02:05<00:00, 6.66it/s]
[13:36:29] Time limit exceeded after calculating fold 1
[13:36:29] Fitting Lvl_0_Pipe_0_Mod_0_TorchNN finished. score = -0.46728458911890136
[13:36:29] Lvl_0_Pipe_0_Mod_0_TorchNN fitting and predicting completed
[13:36:29] Time left 91.29 secs
[13:36:29] Time limit exceeded in one of the tasks. AutoML will blend level 1 models.
[13:36:29] Layer 1 training completed.
[13:36:29] Automl preset training completed in 3508.71 seconds
[13:36:29] Model description:
Final prediction for new objects (level 0) =
1.00000 * (2 averaged models Lvl_0_Pipe_0_Mod_0_TorchNN)
test: 100%|██████████| 313/313 [00:47<00:00, 6.63it/s]
test: 100%|██████████| 313/313 [00:47<00:00, 6.64it/s]
Check scores:
OOF score: 0.46728458911890136
TEST score: 0.43322843977913716
[28]:
# >>> about 2gb
with open('apperance_model.pkl', 'wb') as f:
pickle.dump(automl, f)
[29]:
with open('apperance_model.pkl', 'rb') as f:
automl = pickle.load(f)
automl.set_verbosity_level(2)
[13:38:29] Stdout logging level is INFO2.
L2X
Algorithm.
The general idea of method is find the most informative subset of tokens with respect to target using Mutual Information. The number of tokens in this subset is fixed and equals
n_important
.There is may be some missunderstanding with tokenization that used inside models in automl and tokenization in this method. L2X has its own tokenization, so they are different. If it isn’t set we infer it from default tokenization for language in
text_params
ofTabularNLPAutoML
. Else you can set it with language:'ru'
or'en'
for russian and english languages, respectively. Also it can be scepcified as callable function that from string produces list of tokens.After tokenization sentence was presented as the matrix of embedding vectors (you can specify
embedder
or randomly initalized embeddings will be used). Not important vectors of this matrix will be masked (important tokens selected with Token Importance + Subset Sampler blocks), and the other use for model (Distil model), that tries to imitate the original automl model (learns to predict the same outputs).Scheme of L2X:
Some info about parameters:
n_important
- number of important tokens;temperature
- initial temperature used in gumbel softmax trick;train_device
- device used for training;inference_device
- device used for inference;verbose
- verbose mode;binning_mode
- for training we use batch sampling by the length of sequence. So, batch formed only by the sequences from the respect bin. This parameter used for method of automatic binning border choosing. There are two of them:'linear'
(min-max binning, like linspace),'hist'
(histogram binning).bins_number
- number of bins in batch sampling process;n_epochs
- number of epochs of training of the L2X;learning_rate
- learning rate of L2X model;patience
- number of epoches before learning rate decreasing (torch.optim.lr_scheduler.ReduceLROnPlateu
);extreme_patience
- number of epoches before early stopping by the validation dataset;train_batch_size
- size of batch for training process;valid_batch_size
- size of batch for validation process;temp_anneal_factor
- annealing factor for temperature. The temperature will be multiplied by this coefficient every epoch.importance_sampler
- specifices method of sampling importance (there are two of them'gumbeltopk'
- method from the original paper,'softsub'
- another method);max_vocab_length
- maximum lenght of vocabular (vocabular build up frommax_vocab_length
the most frequent tokens). Ifmax_vocab_length
is-1
then include all in train set.embedder
- embedding dictionary or path to fasttext/dict of embeddings.
Some links for more info about L2X:
[30]:
l2x = L2XTextExplainer(automl, train_device='cuda:1',
inference_device='cuda:1',
embedding_dim=300,
gamma=0.1, temperature=2, temp_anneal_factor=0.95,
n_epochs=200, importance_sampler='gumbeltopk',
n_important=20, patience=25,
extreme_patience=30, trainable_embeds=True)
l2x.fit(train_data, valid_data, cols_to_explain='Review')
test: 100%|██████████| 2500/2500 [06:14<00:00, 6.67it/s]
test: 100%|██████████| 2500/2500 [06:15<00:00, 6.66it/s]
test: 100%|██████████| 313/313 [00:47<00:00, 6.66it/s]
test: 100%|██████████| 313/313 [00:46<00:00, 6.66it/s]
train nll (loss=7.8830): 100%|█████████▉| 1249/1251 [00:41<00:00, 30.12it/s]
train nll (loss=1.4016): 100%|█████████▉| 1249/1251 [00:40<00:00, 30.63it/s]
train nll (loss=1.3859): 100%|█████████▉| 1249/1251 [00:40<00:00, 30.53it/s]
train nll (loss=1.3684): 100%|█████████▉| 1249/1251 [00:40<00:00, 30.57it/s]
train nll (loss=1.0265): 100%|█████████▉| 1249/1251 [00:41<00:00, 30.44it/s]
train nll (loss=0.7086): 100%|█████████▉| 1249/1251 [00:41<00:00, 30.46it/s]
train nll (loss=0.6344): 100%|█████████▉| 1249/1251 [00:40<00:00, 30.62it/s]
train nll (loss=0.5779): 100%|█████████▉| 1249/1251 [00:40<00:00, 30.68it/s]
train nll (loss=0.5318): 100%|█████████▉| 1249/1251 [00:41<00:00, 30.22it/s]
train nll (loss=0.4962): 100%|█████████▉| 1249/1251 [00:40<00:00, 30.56it/s]
train nll (loss=0.4575): 100%|█████████▉| 1249/1251 [00:40<00:00, 30.69it/s]
train nll (loss=0.4233): 100%|█████████▉| 1249/1251 [00:40<00:00, 30.65it/s]
train nll (loss=0.3882): 100%|█████████▉| 1249/1251 [00:40<00:00, 30.58it/s]
train nll (loss=0.3574): 100%|█████████▉| 1249/1251 [00:40<00:00, 30.59it/s]
train nll (loss=0.3326): 100%|█████████▉| 1249/1251 [00:41<00:00, 30.41it/s]
train nll (loss=0.3177): 100%|█████████▉| 1249/1251 [00:40<00:00, 30.70it/s]
train nll (loss=0.2997): 100%|█████████▉| 1249/1251 [00:40<00:00, 30.50it/s]
train nll (loss=0.2885): 100%|█████████▉| 1249/1251 [00:40<00:00, 30.50it/s]
train nll (loss=0.2768): 100%|█████████▉| 1249/1251 [00:41<00:00, 30.24it/s]
train nll (loss=0.2667): 100%|█████████▉| 1249/1251 [00:41<00:00, 30.41it/s]
train nll (loss=0.2569): 100%|█████████▉| 1249/1251 [00:42<00:00, 29.40it/s]
train nll (loss=0.2500): 100%|█████████▉| 1249/1251 [00:42<00:00, 29.54it/s]
train nll (loss=0.2439): 100%|█████████▉| 1249/1251 [00:41<00:00, 30.26it/s]
train nll (loss=0.2349): 100%|█████████▉| 1249/1251 [00:40<00:00, 30.54it/s]
train nll (loss=0.2278): 100%|█████████▉| 1249/1251 [00:41<00:00, 30.25it/s]
train nll (loss=0.2244): 100%|█████████▉| 1249/1251 [00:41<00:00, 30.35it/s]
train nll (loss=0.2220): 100%|█████████▉| 1249/1251 [00:41<00:00, 30.35it/s]
train nll (loss=0.2158): 100%|█████████▉| 1249/1251 [00:41<00:00, 30.36it/s]
train nll (loss=0.2110): 100%|█████████▉| 1249/1251 [00:41<00:00, 30.45it/s]
train nll (loss=0.2080): 100%|█████████▉| 1249/1251 [00:41<00:00, 30.43it/s]
train nll (loss=0.2050): 100%|█████████▉| 1249/1251 [00:41<00:00, 30.40it/s]
train nll (loss=0.2003): 100%|█████████▉| 1249/1251 [00:41<00:00, 30.38it/s]
train nll (loss=0.1977): 100%|█████████▉| 1249/1251 [00:41<00:00, 30.43it/s]
train nll (loss=0.1925): 100%|█████████▉| 1249/1251 [00:41<00:00, 30.42it/s]
train nll (loss=0.1919): 100%|█████████▉| 1249/1251 [00:40<00:00, 30.55it/s]
train nll (loss=0.1888): 100%|█████████▉| 1249/1251 [00:40<00:00, 30.49it/s]
train nll (loss=0.1842): 100%|█████████▉| 1249/1251 [00:41<00:00, 30.37it/s]
train nll (loss=0.1841): 100%|█████████▉| 1249/1251 [00:41<00:00, 30.44it/s]
train nll (loss=0.1820): 100%|█████████▉| 1249/1251 [00:40<00:00, 30.52it/s]
train nll (loss=0.1777): 100%|█████████▉| 1249/1251 [00:41<00:00, 30.37it/s]
train nll (loss=0.1785): 100%|█████████▉| 1249/1251 [00:41<00:00, 30.27it/s]
train nll (loss=0.1778): 100%|█████████▉| 1249/1251 [00:41<00:00, 30.40it/s]
train nll (loss=0.1748): 100%|█████████▉| 1249/1251 [00:41<00:00, 30.40it/s]
train nll (loss=0.1719): 100%|█████████▉| 1249/1251 [00:41<00:00, 30.42it/s]
train nll (loss=0.1704): 100%|█████████▉| 1249/1251 [00:40<00:00, 30.58it/s]
train nll (loss=0.1715): 100%|█████████▉| 1249/1251 [00:40<00:00, 30.74it/s]
train nll (loss=0.1715): 100%|█████████▉| 1249/1251 [00:40<00:00, 30.48it/s]
train nll (loss=0.1734): 100%|█████████▉| 1249/1251 [00:40<00:00, 30.68it/s]
train nll (loss=0.1732): 100%|█████████▉| 1249/1251 [00:40<00:00, 30.65it/s]
train nll (loss=0.1781): 100%|█████████▉| 1249/1251 [00:40<00:00, 30.58it/s]
train nll (loss=0.1770): 100%|█████████▉| 1249/1251 [00:40<00:00, 30.57it/s]
train nll (loss=0.1737): 100%|█████████▉| 1249/1251 [00:40<00:00, 30.66it/s]
train nll (loss=0.1728): 100%|█████████▉| 1249/1251 [00:40<00:00, 30.59it/s]
train nll (loss=0.1731): 100%|█████████▉| 1249/1251 [00:42<00:00, 29.48it/s]
train nll (loss=0.1708): 100%|█████████▉| 1249/1251 [00:41<00:00, 30.26it/s]
train nll (loss=0.1696): 100%|█████████▉| 1249/1251 [00:41<00:00, 30.35it/s]
train nll (loss=0.1699): 100%|█████████▉| 1249/1251 [00:41<00:00, 30.01it/s]
train nll (loss=0.1699): 100%|█████████▉| 1249/1251 [00:42<00:00, 29.39it/s]
train nll (loss=0.1681): 100%|█████████▉| 1249/1251 [00:42<00:00, 29.37it/s]
train nll (loss=0.1682): 100%|█████████▉| 1249/1251 [00:45<00:00, 27.39it/s]
train nll (loss=0.1684): 100%|█████████▉| 1249/1251 [00:43<00:00, 28.57it/s]
train nll (loss=0.1666): 100%|█████████▉| 1249/1251 [00:46<00:00, 26.91it/s]
train nll (loss=0.1659): 100%|█████████▉| 1249/1251 [00:51<00:00, 24.20it/s]
train nll (loss=0.1656): 100%|█████████▉| 1249/1251 [00:46<00:00, 27.09it/s]
train nll (loss=0.1665): 100%|█████████▉| 1249/1251 [00:43<00:00, 28.92it/s]
train nll (loss=0.1676): 100%|█████████▉| 1249/1251 [00:43<00:00, 28.65it/s]
train nll (loss=0.1657): 100%|█████████▉| 1249/1251 [00:43<00:00, 29.02it/s]
train nll (loss=0.1651): 100%|█████████▉| 1249/1251 [00:43<00:00, 28.92it/s]
train nll (loss=0.1631): 100%|█████████▉| 1249/1251 [00:43<00:00, 28.87it/s]
train nll (loss=0.1634): 100%|█████████▉| 1249/1251 [00:43<00:00, 28.79it/s]
train nll (loss=0.1634): 100%|█████████▉| 1249/1251 [00:43<00:00, 28.77it/s]
train nll (loss=0.1626): 100%|█████████▉| 1249/1251 [00:43<00:00, 28.40it/s]
train nll (loss=0.1631): 100%|█████████▉| 1249/1251 [00:42<00:00, 29.35it/s]
train nll (loss=0.1613): 100%|█████████▉| 1249/1251 [00:41<00:00, 30.44it/s]
train nll (loss=0.1614): 100%|█████████▉| 1249/1251 [00:40<00:00, 30.63it/s]
train nll (loss=0.1638): 100%|█████████▉| 1249/1251 [00:40<00:00, 30.51it/s]
train nll (loss=0.1633): 100%|█████████▉| 1249/1251 [00:40<00:00, 30.52it/s]
train nll (loss=0.1618): 100%|█████████▉| 1249/1251 [00:41<00:00, 30.39it/s]
train nll (loss=0.1612): 100%|█████████▉| 1249/1251 [00:41<00:00, 30.43it/s]
train nll (loss=0.1628): 100%|█████████▉| 1249/1251 [00:41<00:00, 30.34it/s]
train nll (loss=0.1616): 100%|█████████▉| 1249/1251 [00:41<00:00, 29.99it/s]
train nll (loss=0.1618): 100%|█████████▉| 1249/1251 [00:41<00:00, 29.77it/s]
train nll (loss=0.1594): 100%|█████████▉| 1249/1251 [00:42<00:00, 29.29it/s]
train nll (loss=0.1617): 100%|█████████▉| 1249/1251 [00:42<00:00, 29.55it/s]
train nll (loss=0.1617): 100%|█████████▉| 1249/1251 [00:42<00:00, 29.27it/s]
train nll (loss=0.1610): 100%|█████████▉| 1249/1251 [00:41<00:00, 30.35it/s]
train nll (loss=0.1590): 100%|█████████▉| 1249/1251 [00:41<00:00, 30.35it/s]
train nll (loss=0.1602): 100%|█████████▉| 1249/1251 [00:43<00:00, 28.49it/s]
train nll (loss=0.1602): 100%|█████████▉| 1249/1251 [00:42<00:00, 29.07it/s]
train nll (loss=0.1613): 100%|█████████▉| 1249/1251 [00:43<00:00, 28.80it/s]
train nll (loss=0.1620): 100%|█████████▉| 1249/1251 [00:43<00:00, 28.86it/s]
train nll (loss=0.1593): 100%|█████████▉| 1249/1251 [00:43<00:00, 28.95it/s]
train nll (loss=0.1612): 100%|█████████▉| 1249/1251 [00:43<00:00, 28.89it/s]
train nll (loss=0.1620): 100%|█████████▉| 1249/1251 [00:43<00:00, 28.93it/s]
train nll (loss=0.1614): 100%|█████████▉| 1249/1251 [00:44<00:00, 28.26it/s]
train nll (loss=0.1630): 100%|█████████▉| 1249/1251 [00:42<00:00, 29.19it/s]
train nll (loss=0.1665): 100%|█████████▉| 1249/1251 [00:43<00:00, 28.75it/s]
train nll (loss=0.1605): 100%|█████████▉| 1249/1251 [00:42<00:00, 29.48it/s]
train nll (loss=0.1605): 100%|█████████▉| 1249/1251 [00:41<00:00, 30.01it/s]
train nll (loss=0.1636): 100%|█████████▉| 1249/1251 [00:40<00:00, 30.58it/s]
train nll (loss=0.1617): 100%|█████████▉| 1249/1251 [00:41<00:00, 30.08it/s]
train nll (loss=0.1635): 100%|█████████▉| 1249/1251 [00:41<00:00, 30.44it/s]
train nll (loss=0.1606): 100%|█████████▉| 1249/1251 [00:40<00:00, 30.65it/s]
train nll (loss=0.1631): 100%|█████████▉| 1249/1251 [00:41<00:00, 30.45it/s]
train nll (loss=0.1645): 100%|█████████▉| 1249/1251 [00:40<00:00, 30.60it/s]
train nll (loss=0.1652): 100%|█████████▉| 1249/1251 [00:41<00:00, 30.37it/s]
train nll (loss=0.1641): 100%|█████████▉| 1249/1251 [00:41<00:00, 29.96it/s]
train nll (loss=0.1669): 100%|█████████▉| 1249/1251 [00:43<00:00, 28.80it/s]
train nll (loss=0.1610): 100%|█████████▉| 1249/1251 [00:42<00:00, 29.11it/s]
train nll (loss=0.1630): 100%|█████████▉| 1249/1251 [00:43<00:00, 28.96it/s]
train nll (loss=0.1644): 100%|█████████▉| 1249/1251 [00:42<00:00, 29.11it/s]
train nll (loss=0.1681): 100%|█████████▉| 1249/1251 [00:42<00:00, 29.06it/s]
train nll (loss=0.1691): 100%|█████████▉| 1249/1251 [00:42<00:00, 29.12it/s]
train nll (loss=0.1728): 100%|█████████▉| 1249/1251 [00:42<00:00, 29.15it/s]
train nll (loss=0.1710): 100%|█████████▉| 1249/1251 [00:42<00:00, 29.16it/s]
train nll (loss=0.1679): 100%|█████████▉| 1249/1251 [00:42<00:00, 29.30it/s]
train nll (loss=0.1697): 100%|█████████▉| 1249/1251 [00:42<00:00, 29.31it/s]
train nll (loss=0.1669): 100%|█████████▉| 1249/1251 [00:41<00:00, 29.92it/s]
train nll (loss=0.1708): 100%|█████████▉| 1249/1251 [00:41<00:00, 30.28it/s]
train nll (loss=0.1662): 100%|█████████▉| 1249/1251 [00:40<00:00, 30.56it/s]
train nll (loss=0.1739): 100%|█████████▉| 1249/1251 [00:41<00:00, 30.34it/s]
train nll (loss=0.1977): 100%|█████████▉| 1249/1251 [00:40<00:00, 30.68it/s]
train nll (loss=0.1844): 100%|█████████▉| 1249/1251 [00:40<00:00, 30.72it/s]
train nll (loss=0.1685): 100%|█████████▉| 1249/1251 [00:41<00:00, 30.17it/s]
train nll (loss=0.1664): 100%|█████████▉| 1249/1251 [00:40<00:00, 30.48it/s]
train nll (loss=0.1760): 100%|█████████▉| 1249/1251 [00:40<00:00, 30.63it/s]
train nll (loss=0.1678): 100%|█████████▉| 1249/1251 [00:40<00:00, 30.71it/s]
train nll (loss=0.1681): 100%|█████████▉| 1249/1251 [00:41<00:00, 30.25it/s]
train nll (loss=0.1806): 100%|█████████▉| 1249/1251 [00:40<00:00, 30.67it/s]
train nll (loss=0.1724): 100%|█████████▉| 1249/1251 [00:40<00:00, 30.70it/s]
train nll (loss=0.1691): 100%|█████████▉| 1249/1251 [00:40<00:00, 30.67it/s]
train nll (loss=0.1712): 100%|█████████▉| 1249/1251 [00:40<00:00, 30.71it/s]
train nll (loss=0.1762): 100%|█████████▉| 1249/1251 [00:40<00:00, 30.55it/s]
train nll (loss=0.1655): 100%|█████████▉| 1249/1251 [00:41<00:00, 30.00it/s]
train nll (loss=0.1860): 100%|█████████▉| 1249/1251 [00:42<00:00, 29.72it/s]
train nll (loss=0.1897): 100%|█████████▉| 1249/1251 [00:41<00:00, 29.76it/s]
train nll (loss=0.1770): 100%|█████████▉| 1249/1251 [00:41<00:00, 30.45it/s]
train nll (loss=0.1796): 100%|█████████▉| 1249/1251 [00:42<00:00, 29.65it/s]
train nll (loss=0.1816): 100%|█████████▉| 1249/1251 [00:40<00:00, 30.48it/s]
train nll (loss=0.1870): 100%|█████████▉| 1249/1251 [00:40<00:00, 30.75it/s]
train nll (loss=0.1797): 100%|█████████▉| 1249/1251 [00:40<00:00, 30.57it/s]
train nll (loss=0.1878): 100%|█████████▉| 1249/1251 [00:40<00:00, 30.75it/s]
train nll (loss=0.1903): 100%|█████████▉| 1249/1251 [00:40<00:00, 30.72it/s]
train nll (loss=0.1787): 100%|█████████▉| 1249/1251 [00:40<00:00, 30.63it/s]
train nll (loss=0.1815): 100%|█████████▉| 1249/1251 [00:40<00:00, 30.70it/s]
train nll (loss=0.1781): 100%|█████████▉| 1249/1251 [00:40<00:00, 30.68it/s]
train nll (loss=0.1749): 100%|█████████▉| 1249/1251 [00:40<00:00, 30.68it/s]
train nll (loss=0.1735): 100%|█████████▉| 1249/1251 [00:40<00:00, 30.51it/s]
train nll (loss=0.1675): 100%|█████████▉| 1249/1251 [00:40<00:00, 30.56it/s]
train nll (loss=0.1671): 100%|█████████▉| 1249/1251 [00:40<00:00, 30.58it/s]
[31]:
expl_train = l2x['Review'].explain_instances(train_data)
[32]:
expl_valid = l2x['Review'].explain_instances(valid_data)
Examples
[33]:
expl_valid[66].visualize_in_notebook()
Text
<START> lot exploder lost about 3 of this beer down the drain as foam whats left is a cloudy medium brown color with floaties plenty of head obviously which dissipates quickly aroma is tons of malt and dark fruit the flavor is again very fruity with bready malt and caramel notes a bit of roast malt no hint of spices anywhere full bodied with plenty of silky crispness <PAD>
[34]:
expl_valid[55].visualize_in_notebook()
Text
<START> whoa is right with this one this is a big brew in my opinion hence its name pours a thick creamy head and has a dark brown color with hints of amber the taste ha thick hops in here think of biting into a big juicy fruit terrapin comes out strong with this seasonal taste of alcohol is well hidden but will creap up on you in a hurry i have found this most of the year for some reason i guess they distributed alot of it in the atlanta area <PAD>
[35]:
expl_valid[77].visualize_in_notebook()
Text
<START> the beer pours an opaque light copper capped by a minimal off white head there s very little retention despite a robust pour into the glass the nose is simply divine i feel like i just pulled a freshly baked pumpkin pie out of the oven aromas of gram cracker and butterscotch covered shortbread mix with sweet potato and canned <UNK> pumpkin puree coconut cinnamon and a hint of citrus add a twist of the exotic liquid pumpkin pie is the best way to describe the flavour a fine pte sucre crust with a rich pumpkin filling spiced with cinnamon and allspice this really is devilishly good everything i found lacking in previous pumpkin beers this makes up for smooth creamy macadamia nuttiness adds another dimension hops are nearly <UNK> simple there for balance similarly absent is the taste of alcohol despite the whopping percentage only int he very finish does it pop up like a <UNK> child the medium body and medium low level of carbonation make for a surprisingly easy drinking beer dangerous this is the best pumpkin beer i ve ever had hands down <PAD>
[36]:
expl_valid[88].visualize_in_notebook()
Text
<START> og 5 p sg 046 1 abv pours out to a clear very pale golden forming a soapy white head with decent retention and good lacing carbonation is moderate aroma of weak floral hops with a touch of freesia corn and a light dryness mouthfeel is average watery with a light body and clean finish taste is predominated by corn with hardly any perceptable hop flavor or bitterness cleanly fermented with a crisp finish this is a very simple beer and only a step above budweiser if you need to transition someone from macro domestic swill to an average pale ale then this would be the one comparable to the lightest tap at any new brewpub easy to drink but then again why would i want to <PAD>
[37]:
expl_valid[121].visualize_in_notebook()
Text
<START> a blend of stout and bock cool hopefully better than a blend of wheat and <UNK> a inside joke for anyone who has worked as a grain handler not a good thing i know some beer gods frown on the whole black tan thing this is my first so in i dive in with my usual open <UNK> looked like watered down cola head fizzled fast not a good sign as it definitely was n t too cold hey decent lacing it s really trying to give me that chocolate coffee stout smell here but it s muted some slight coffe toffee taste initially with a hint of hop bitterness maybe even a little nutty but it seems to be out of balance hence the blending thing i guess very very thin and watery given it s parent ingredients got this in a beers of the world pack so bonus would certainly taste fairy exotic to a macro lager person but i wo n t be <UNK> my <UNK> account to get some more out of <UNK> usa anytime soon <PAD>
[38]:
expl_valid[888].visualize_in_notebook()
Text
<START> beer is a dark dark color with just the slightest hints of ruby at the edges and a coffee colored head beautiful to look at and almost as nice to drink the smell is coffee cocoa and just a hint of caramel or toffee with an underlying alcohol character and honey sweetness taste is very similar to the smell with the coffee and cocoa taking center stage and the alcohol almost overpowering the toffee and caramel notes luckily the sweetness helps to balance that out mouthfeel is good nice and thick with just a hint of stickiness the drinkability is n t the best i was actually surprised that this did n t have the highest abv of the beers i had at dragonmead that being said it was a perfectly enjoyable beer and i d jump at the chance to have another <PAD>
[39]:
expl_valid[999].visualize_in_notebook()
Text
<START> ten fidy another thanks to <UNK> for the trade bod 17 pours a rusted mahoghany and settles jet metal black a fingers worth of burnt caramel head sits for a short while the edges leave very little light to pass through spotty lacing clings throughout the nose brings a lot of milk chocolate that has a bittering end to it roasted malts and a presence of alcohol are also noted the taste is interesting the roasty malts bloom but an annoying metallic taste lingers there is a light hop presence as it warms the flavors intensify the mouthfeel is very full bodied and sits like an <UNK> creamy feel with good carbonation overall pretty good impy stout but that metal taste was a bit off putting i would like to try this fresh to see if there is a difference <PAD>
[40]:
expl_valid[333].visualize_in_notebook()
Text
<START> this is a wow witbier cloudy yellow with tendencies toward something darker more orange the head is a little flat though there s a good flowery perfume aroma soft citrus light coriander come up front wiht a good dry wheat in the finish this is a little heftier than the supposed style stalwart hoegaarden but the soft almost creamy mouthfeel makes this a surprisingly satisfying beer without being heavy tasty beer <PAD>
[41]:
expl_valid[111].visualize_in_notebook()
Text
<START> dark black with creamy tan head that leaves great retention and foamy lace the smell is roasty with burnt sugar edges dark chocolate coffee and smoke the taste is ashy too much black patent perhaps others enjoy this but there is a charcoal burnt taste that is a bit much for me smoky bitter chocolate and roasted coffee quite roasty and ashy tasting strong with alcohol peeking through overall an average stout <PAD>
[42]:
expl_valid[100].visualize_in_notebook()
Text
<START> reviewing the oaked arrogant bastard ale from stone brewing company a hearty thank you to beeradvocate user funhog for hooking me up with this one score appearance pours a dark red brown color with plenty of opaque ish ruby highlights with three fingers of cream colored head excellent lacing and the head really stick around if not apparent by the photo proprietary 5 smell piney citrusy hops and oak wood up front creamy chocolate a little caramel and figs oranges tangerines and malts 5 taste very sweet caramel and citrus hoppy with toasted maltiness slightly bitter finish 5 mouthfeel medium bodied oily and cream low carbonation complements the viscosity well dry bitter finish 5 overall a very solid brew but the original version arrogant bastard ale is better in my opinion double bastard is even better this beer is absolutely worth trying but a six pack seems a bit much on quantity for me i guess i have some extras for future ba trades recommendation i can certainly recommend this one to both beer geeks and casual beer drinkers as the flavors are pretty solid and not overwhelming but the oaking does not seem to add enough additional character flavor to justify the steep price jump i would most recommend this beer as one to add to a mix a six pairings hamburger cost 99 for a six pack <PAD>
[43]:
expl_valid[1021].visualize_in_notebook()
Text
<START> 750ml bottle into a tulip huge thanks to kevin for sharing this ancient oddity a muddy magenta brown body with a handful of off white bubbles meh s old musty oaky dirty vaguely reminiscent of tequila in a very weird way i do n t know that i ve ever smelled a more basementy beer and i kind of like it in a masochistic way t like liquid dementia so so old and yet still tasty some moderate sourness and acidic fruitiness is still there to provide at least a hint at what this beer used to be i dig it m smooth soft amazingly delicate o this was n t exactly delicious but it was a great experience i wish i d gotten the chance to taste this five years ago cheers <PAD>
[44]:
expl_valid[9999].visualize_in_notebook()
Text
<START> the apperance was an amber dark yellow color with not much head it did however have stuff floating in it i m not certain if that was of the fault of the manufacturer or the fault of myself for trusting my friends around my <UNK> drink any who the smell was bellow average although not always clearly present the taste was a sweet sour mix with a main taste of bitterness mouthfeel was smooth esque drinkability was average but seeing as i m a big time <UNK> and seeing as it s what my friends have i will most likely be having another very soon <PAD>
[45]:
expl_valid[7676].visualize_in_notebook()
Text
<START> pale gold with a thin film around the edge some lacing looks very flat and insipid no carbonation i do n t hold out much hope very unpleasant sticky rice nose lots of nothingness as well sweet with a very light sickly note but do n t get me wrong it s incredibly bland blech thin but fortunately not overly sweet on the palate quite clean and dry with a light lingering bitterness mouthfeel is quite crisp which is a blessing no it s not great but i was expecting a lot lot worse it s really not that bad when you get down to it it s not amazing but it s pretty clean and light i guess i m just pleased it does n t have the sweet sickly character promised on the nose <PAD>
[46]:
expl_valid[6767].visualize_in_notebook()
Text
<START> i was actually a little surprised by this one surprised it was not vile pours a clear gold color with a thin white head no real lacing to speak of and the head was short lived the aroma is lightly sweet which was another surprise light bodied with a barely average hops flavor the finish is a little sweet and a little fruity this is n t a beer i would seek out again but i would drink it in korea over a bud <PAD>
[47]:
expl_valid[3131].visualize_in_notebook()
Text
<START> pours a clear deep red brown with a big white head malty sweet no major flavors stand out though it is slightly toasty hops are clean and mellow they only come in near the end and help to balance the beer this is a solid simple brown it s so great to finally see organic beer in the store <PAD>
Tutorial 5: Uplift modeling
Official LightAutoML github repository is here
[ ]:
%load_ext autoreload
%autoreload 2
Install LightAutoML
Uncomment if doesn’t clone repository by git. (ex.: colab, kaggle version)
[ ]:
#! pip install -U lightautoml
Import necessary libraries
[ ]:
# Standard python libraries
from copy import deepcopy
import os
import requests
# Installed libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split
import torch
# Imports from our package
from lightautoml.automl.presets.tabular_presets import TabularAutoML
from lightautoml.dataset.roles import DatetimeRole
from lightautoml.tasks import Task
from lightautoml.addons.uplift.base import AutoUplift, BaseLearnerWrapper, MetaLearnerWrapper
from lightautoml.addons.uplift import metalearners
from lightautoml.addons.uplift.metrics import (_available_uplift_modes,
TUpliftMetric,
calculate_graphic_uplift_curve,
calculate_min_max_uplift_auc,
calculate_uplift_at_top,
calculate_uplift_auc,
perfect_uplift_curve)
from lightautoml.addons.uplift.utils import create_linear_automl
from lightautoml.report.report_deco import ReportDecoUplift
%matplotlib inline
Parameters
Setting
[ ]:
N_THREADS = 8 # threads cnt for lgbm and linear models
N_FOLDS = 5 # folds cnt for AutoML
RANDOM_STATE = 42 # fixed random state for various reasons
TEST_SIZE = 0.2 # Test size for metric check
TIMEOUT = 300 # Time in seconds for automl run
TARGET_NAME = 'TARGET' # Target column name
TREATMENT_NAME = 'CODE_GENDER'
Fix torch number of threads and numpy seed
[ ]:
np.random.seed(RANDOM_STATE)
torch.set_num_threads(N_THREADS)
Example data load
Load a dataset from the repository if doesn’t clone repository by git.
[ ]:
DATASET_DIR = '../data/'
DATASET_NAME = 'sampled_app_train.csv'
DATASET_FULLNAME = os.path.join(DATASET_DIR, DATASET_NAME)
DATASET_URL = 'https://raw.githubusercontent.com/AILab-MLTools/LightAutoML/master/example_data/test_data_files/sampled_app_train.csv'
[ ]:
%%time
if not os.path.exists(DATASET_FULLNAME):
os.makedirs(DATASET_DIR, exist_ok=True)
dataset = requests.get(DATASET_URL).text
with open(DATASET_FULLNAME, 'w') as output:
output.write(dataset)
[ ]:
%%time
data = pd.read_csv(DATASET_FULLNAME)
data.head()
(Optional) Some user feature preparation
[ ]:
%%time
data['BIRTH_DATE'] = (np.datetime64('2018-01-01') + data['DAYS_BIRTH'].astype(np.dtype('timedelta64[D]'))).astype(str)
data['EMP_DATE'] = (np.datetime64('2018-01-01') + np.clip(data['DAYS_EMPLOYED'], None, 0).astype(np.dtype('timedelta64[D]'))
).astype(str)
data['report_dt'] = np.datetime64('2018-01-01')
data['constant'] = 1
data['allnan'] = np.nan
data.drop(['DAYS_BIRTH', 'DAYS_EMPLOYED'], axis=1, inplace=True)
data['CODE_GENDER'] = (data['CODE_GENDER'] == 'M').astype(int)
Data splitting for train-test
[ ]:
%%time
stratify_value = data[TARGET_NAME] + 10 * data[TREATMENT_NAME]
train, test = train_test_split(data, test_size=3000, stratify=stratify_value, random_state=42)
test_target, test_treatment = test[TARGET_NAME].values.ravel(), test[TREATMENT_NAME].values.ravel()
Setup columns roles
[ ]:
%%time
roles = {
'target': TARGET_NAME,
'treatment': TREATMENT_NAME,
DatetimeRole(base_date=True, seasonality=(), base_feats=False): 'report_dt'
}
AutoUplift (use predefined uplift methods)
Fit autouplift
[ ]:
%%time
task = Task('binary')
autouplift = AutoUplift(task,
metric='adj_qini',
has_report=True,
test_size=0.2,
timeout=200,
# timeout_metalearner=5
)
autouplift.fit(train, roles, verbose=1)
Show rating of uplift methods (meta-learners)
[ ]:
%%time
rating_table = autouplift.get_metalearners_rating()
rating_table
Get best metalearner
[ ]:
%%time
best_metalearner = autouplift.create_best_metalearner(
update_metalearner_params={'timeout': None},
update_baselearner_params={'timeout': 30}
)
best_metalearner.fit(train, roles)
_ = best_metalearner.predict(test);
Predict to test data and check metrics
[ ]:
%%time
uplift_pred, treatment_pred, control_pred = best_metalearner.predict(test)
uplift_pred = uplift_pred.ravel()
roc_auc_treatment = roc_auc_score(test_target[test_treatment == 1], treatment_pred[test_treatment == 1])
roc_auc_control = roc_auc_score(test_target[test_treatment == 0], control_pred[test_treatment == 0])
uplift_auc_algo = calculate_uplift_auc(test_target, uplift_pred, test_treatment, normed=False)
uplift_auc_algo_normed = calculate_uplift_auc(test_target, uplift_pred, test_treatment, normed=True)
auc_base, auc_perfect = calculate_min_max_uplift_auc(test_target, test_treatment)
print('--- Check scores ---')
print('OOF scores "ROC_AUC":')
print('\tTreatment = {:.5f}'.format(roc_auc_treatment))
print('\tControl = {:.5f}'.format(roc_auc_control))
print('Uplift score of test group (default="adj_qini"):')
print('\tBaseline = {:.5f}'.format(auc_base))
print('\tAlgo (Normed) = {:.5f} ({:.5f})'.format(uplift_auc_algo, uplift_auc_algo_normed))
print('\tPerfect = {:.5f}'.format(auc_perfect))
AutoUplift (custom uplift methods)
Fit autouplift
[ ]:
%%time
# Set uplift candidate for choosing best of them
# !!!ATTENTION!!!
# This is a demonstration of the possibilities,
# You may use default set of candidates
task = Task('binary')
uplift_candidates = [
MetaLearnerWrapper(
name='TLearner__Default',
klass=metalearners.TLearner,
params={'base_task': task}
),
MetaLearnerWrapper(
name='TLearner__Custom',
klass=metalearners.TLearner,
params={
'treatment_learner': BaseLearnerWrapper(
name='__TabularAutoML__',
klass=TabularAutoML,
params={'task': task, 'timeout': 10}),
'control_learner': BaseLearnerWrapper(
name='__Linear__',
klass=create_linear_automl,
params={'task': Task('binary')})
}
),
MetaLearnerWrapper(
name='XLearner__Custom',
klass=metalearners.XLearner,
params={
'outcome_learners': [
TabularAutoML(task=task, timeout=10), # [sec] , Only speed up example, don't change it!
create_linear_automl(task=Task('binary'))
],
'effect_learners': [BaseLearnerWrapper(
name='__TabularAutoML__',
klass=TabularAutoML,
params={'task': Task('reg'), 'timeout': 5})],
'propensity_learner': create_linear_automl(task=Task('binary')),
}
)
]
autouplift = AutoUplift(task,
uplift_candidates=uplift_candidates,
metric='adj_qini',
test_size=0.2,
threshold_imbalance_treatment=0.0, # Doesn't affect, see warnings
timeout=600) # Doesn't affect, see warnings
autouplift.fit(train, roles, verbose=1)
Show rating of uplift methods (meta-learners)
[ ]:
%%time
rating_table = autouplift.get_metalearners_rating()
rating_table
Predict to test data and check metrics
[ ]:
%%time
uplift_pred, treatment_pred, control_pred = autouplift.predict(test)
uplift_pred = uplift_pred.ravel()
roc_auc_treatment = roc_auc_score(test_target[test_treatment == 1], treatment_pred[test_treatment == 1])
roc_auc_control = roc_auc_score(test_target[test_treatment == 0], control_pred[test_treatment == 0])
uplift_auc_algo = calculate_uplift_auc(test_target, uplift_pred, test_treatment, normed=False)
uplift_auc_algo_normed = calculate_uplift_auc(test_target, uplift_pred, test_treatment, normed=True)
auc_base, auc_perfect = calculate_min_max_uplift_auc(test_target, test_treatment)
print('--- Check scores ---')
print('OOF scores "ROC_AUC":')
print('\tTreatment = {:.5f}'.format(roc_auc_treatment))
print('\tControl = {:.5f}'.format(roc_auc_control))
print('Uplift score of test group (default="adj_qini"):')
print('\tBaseline = {:.5f}'.format(auc_base))
print('\tAlgo (Normed) = {:.5f} ({:.5f})'.format(uplift_auc_algo, uplift_auc_algo_normed))
print('\tPerfect = {:.5f}'.format(auc_perfect))
AutoUplift with custom metric
Fit autouplift
[ ]:
%%time
# Using a custom metric
# How to determine custom metric, see below
task = Task('binary')
class CustomUpliftMetric(TUpliftMetric):
def __call__(self, target: np.ndarray, uplift_pred: np.ndarray, treatment: np.ndarray) -> float:
up_10 = calculate_uplift_at_top(target, uplift_pred, treatment, 10)
up_20 = calculate_uplift_at_top(target, uplift_pred, treatment, 20)
return 0.5 * (up_10 + up_20)
autouplift = AutoUplift(task,
add_dd_candidates=True,
metric=CustomUpliftMetric(),
test_size=0.2,
threshold_imbalance_treatment=0.0,
cpu_limit=10,
timeout=100)
autouplift.fit(train, roles)
Show rating of uplift methods (meta-learners)
[ ]:
%%time
rating_table = autouplift.get_metalearners_ranting()
rating_table
MetaLearner
TLearner
Fit on train data
[ ]:
%%time
# Default setting
tlearner = metalearners.TLearner(base_task=Task('binary'), cpu_limit=5)
tlearner.fit(train, roles)
Predict to test data and check metrics
[ ]:
%%time
uplift_pred, treatment_pred, control_pred = tlearner.predict(test)
uplift_pred = uplift_pred.ravel()
roc_auc_treatment = roc_auc_score(test_target[test_treatment == 1], treatment_pred[test_treatment == 1])
roc_auc_control = roc_auc_score(test_target[test_treatment == 0], control_pred[test_treatment == 0])
uplift_auc_algo = calculate_uplift_auc(test_target, uplift_pred, test_treatment, normed=False)
uplift_auc_algo_normed = calculate_uplift_auc(test_target, uplift_pred, test_treatment, normed=True)
auc_base, auc_perfect = calculate_min_max_uplift_auc(test_target, test_treatment)
print('--- Check scores ---')
print('OOF scores "ROC_AUC":')
print('\tTreatment = {:.5f}'.format(roc_auc_treatment))
print('\tControl = {:.5f}'.format(roc_auc_control))
print('Uplift score of test group (default="adj_qini"):')
print('\tBaseline = {:.5f}'.format(auc_base))
print('\tAlgo (Normed) = {:.5f} ({:.5f})'.format(uplift_auc_algo, uplift_auc_algo_normed))
print('\tPerfect = {:.5f}'.format(auc_perfect))
XLearner
Fit on train data
[ ]:
%%time
# Custom base algorithm
xlearner = metalearners.XLearner(
propensity_learner=TabularAutoML(task=Task('binary'), timeout=10),
outcome_learners=[
TabularAutoML(task=Task('binary'), timeout=10),
TabularAutoML(task=Task('binary'), timeout=10)
],
effect_learners=[
TabularAutoML(task=Task('reg'), timeout=10),
TabularAutoML(task=Task('reg'), timeout=10)
]
)
xlearner.fit(train, roles)
Predict to test data and check metrics
[ ]:
%%time
uplift_pred, treatment_pred, control_pred = xlearner.predict(test)
uplift_pred = uplift_pred.ravel()
roc_auc_treatment = roc_auc_score(test_target[test_treatment == 1], treatment_pred[test_treatment == 1])
roc_auc_control = roc_auc_score(test_target[test_treatment == 0], control_pred[test_treatment == 0])
uplift_auc_algo = calculate_uplift_auc(test_target, uplift_pred, test_treatment, normed=False)
uplift_auc_algo_normed = calculate_uplift_auc(test_target, uplift_pred, test_treatment, normed=True)
auc_base, auc_perfect = calculate_min_max_uplift_auc(test_target, test_treatment)
print('--- Check scores ---')
print('OOF scores "ROC_AUC":')
print('\tTreatment = {:.5f}'.format(roc_auc_treatment))
print('\tControl = {:.5f}'.format(roc_auc_control))
print('Uplift score of test group (default="adj_qini"):')
print('\tBaseline = {:.5f}'.format(auc_base))
print('\tAlgo (Normed) = {:.5f} ({:.5f})'.format(uplift_auc_algo, uplift_auc_algo_normed))
print('\tPerfect = {:.5f}'.format(auc_perfect))
Uplift metrics and graphics (using xlearner predictions)
[ ]:
%%time
UPLIFT_METRIC = 'adj_qini'
print("All available uplift metrics: {}".format(_available_uplift_modes))
Algorithm uplift curve
[ ]:
%%time
# Algorithm curve
xs_xlearner, ys_xlearner = calculate_graphic_uplift_curve(
test_target, uplift_pred, test_treatment, mode=UPLIFT_METRIC
)
Baseline, perfect curve
[ ]:
# Baseline curve
xs_base, ys_base = xs_xlearner, xs_xlearner * ys_xlearner[-1]
# Perfect curver
perfect_uplift = perfect_uplift_curve(test_target, test_treatment)
xs_perfect, ys_perfect = calculate_graphic_uplift_curve(
test_target, perfect_uplift, test_treatment, mode=UPLIFT_METRIC)
[ ]:
plt.figure(figsize=(10, 7))
plt.plot(xs_base, ys_base, 'black')
plt.plot(xs_xlearner, ys_xlearner, 'red')
plt.plot(xs_perfect, ys_perfect, 'green')
plt.fill_between(xs_xlearner, ys_base, ys_xlearner, alpha=0.5, color='orange')
plt.xlabel('Cumulative percentage of people in T/C groups')
plt.ylabel('Uplift metric (%s)'.format(UPLIFT_METRIC))
plt.grid()
plt.legend(['Baseline', 'XLearner', 'Perfect']);
Uplift TOP-K
[ ]:
tops = np.arange(5, 101, 5)
uplift_at_tops = []
for top in tops:
uat = calculate_uplift_at_top(test_target, uplift_pred, test_treatment, top=top)
uplift_at_tops.append(uat)
plt.figure(figsize=(10, 7))
plt.plot(tops, uplift_at_tops, marker='.')
plt.legend(['Uplift_At_K'])
plt.xticks(np.arange(0, 101, 10))
plt.grid()
Custom metric
[ ]:
# Custom metric can be used in AutoUplift
# There msut be a function's signature:
# def custom_metric(target, uplift_pred, treatment) -> float:
class CustomUpliftMetric(TUpliftMetric):
def __call__(self, target: np.ndarray, uplift_pred: np.ndarray, treatment: np.ndarray) -> float:
up_10 = calculate_uplift_at_top(target, uplift_pred, treatment, 10)
up_20 = calculate_uplift_at_top(target, uplift_pred, treatment, 20)
return 0.5 * (up_10 + up_20)
metric = CustomUpliftMetric()
metric_value = metric(test_target, uplift_pred, test_treatment)
print("Metric = {}".format(metric_value))
Report
[ ]:
%%time
RDU = ReportDecoUplift()
tlearner_deco = RDU(metalearners.TLearner(base_task=Task('binary')))
tlearner_deco.fit(train, roles)
_ = tlearner_deco.predict(test)
# Path to report: PATH_TO_CURRENT_NOTEBOOK/lama_report/lama_interactive_report.html
Tutorial 6: Custom pipiline tutorial
Official LightAutoML github repository is here
Preparing
Step 1. Install LightAutoML
Uncomment if doesn’t clone repository by git. (ex.: colab, kaggle version)
[1]:
#! pip install -U lightautoml
Step 2. Import necessary libraries
[2]:
# Standard python libraries
import os
import time
import requests
# Installed libraries
import numpy as np
import pandas as pd
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split
import torch
# Imports from our package
from lightautoml.automl.base import AutoML
from lightautoml.ml_algo.boost_lgbm import BoostLGBM
from lightautoml.ml_algo.tuning.optuna import OptunaTuner
from lightautoml.pipelines.features.lgb_pipeline import LGBSimpleFeatures
from lightautoml.pipelines.ml.base import MLPipeline
from lightautoml.pipelines.selection.importance_based import ImportanceCutoffSelector, ModelBasedImportanceEstimator
from lightautoml.reader.base import PandasToPandasReader
from lightautoml.tasks import Task
from lightautoml.automl.blend import WeightedBlender
Step 3. Parameters
[3]:
N_THREADS = 8 # threads cnt for lgbm and linear models
N_FOLDS = 5 # folds cnt for AutoML
RANDOM_STATE = 42 # fixed random state for various reasons
TEST_SIZE = 0.2 # Test size for metric check
TARGET_NAME = 'TARGET' # Target column name
Step 4. Fix torch number of threads and numpy seed
[4]:
np.random.seed(RANDOM_STATE)
torch.set_num_threads(N_THREADS)
Step 5. Example data load
Load a dataset from the repository if doesn’t clone repository by git.
[5]:
DATASET_DIR = '../data/'
DATASET_NAME = 'sampled_app_train.csv'
DATASET_FULLNAME = os.path.join(DATASET_DIR, DATASET_NAME)
DATASET_URL = 'https://raw.githubusercontent.com/AILab-MLTools/LightAutoML/master/examples/data/sampled_app_train.csv'
[6]:
%%time
if not os.path.exists(DATASET_FULLNAME):
os.makedirs(DATASET_DIR, exist_ok=True)
dataset = requests.get(DATASET_URL).text
with open(DATASET_FULLNAME, 'w') as output:
output.write(dataset)
CPU times: user 28 µs, sys: 20 µs, total: 48 µs
Wall time: 64.4 µs
[7]:
%%time
data = pd.read_csv(DATASET_FULLNAME)
data.head()
CPU times: user 105 ms, sys: 14.5 ms, total: 119 ms
Wall time: 118 ms
[7]:
SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | ... | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 313802 | 0 | Cash loans | M | N | Y | 0 | 270000.0 | 327024.0 | 15372.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
1 | 319656 | 0 | Cash loans | F | N | N | 0 | 108000.0 | 675000.0 | 19737.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
2 | 207678 | 0 | Revolving loans | F | Y | Y | 2 | 112500.0 | 270000.0 | 13500.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
3 | 381593 | 0 | Cash loans | F | N | N | 1 | 67500.0 | 142200.0 | 9630.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 4.0 |
4 | 258153 | 0 | Cash loans | F | Y | Y | 0 | 337500.0 | 1483231.5 | 46570.5 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 2.0 | 0.0 | 0.0 |
5 rows × 122 columns
Step 6. (Optional) Some user feature preparation
Cell below shows some user feature preparations to create task more difficult (this block can be omitted if you don’t want to change the initial data):
[8]:
%%time
data['BIRTH_DATE'] = (np.datetime64('2018-01-01') + data['DAYS_BIRTH'].astype(np.dtype('timedelta64[D]'))).astype(str)
data['EMP_DATE'] = (np.datetime64('2018-01-01') + np.clip(data['DAYS_EMPLOYED'], None, 0).astype(np.dtype('timedelta64[D]'))
).astype(str)
data['constant'] = 1
data['allnan'] = np.nan
data['report_dt'] = np.datetime64('2018-01-01')
data.drop(['DAYS_BIRTH', 'DAYS_EMPLOYED'], axis=1, inplace=True)
CPU times: user 108 ms, sys: 4.5 ms, total: 113 ms
Wall time: 111 ms
Step 7. (Optional) Data splitting for train-test
Block below can be omitted if you are going to train model only or you have specific train and test files:
[9]:
%%time
train_data, test_data = train_test_split(data,
test_size=TEST_SIZE,
stratify=data[TARGET_NAME],
random_state=RANDOM_STATE)
print('Data splitted. Parts sizes: train_data = {}, test_data = {}'
.format(train_data.shape, test_data.shape))
Data splitted. Parts sizes: train_data = (8000, 125), test_data = (2000, 125)
CPU times: user 7.85 ms, sys: 3.89 ms, total: 11.7 ms
Wall time: 10.1 ms
[10]:
train_data.head()
[10]:
SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | ... | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | BIRTH_DATE | EMP_DATE | constant | allnan | report_dt | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
6444 | 112261 | 0 | Cash loans | F | N | N | 1 | 90000.0 | 640080.0 | 31261.5 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1985-06-28 | 2012-06-21 | 1 | NaN | 2018-01-01 |
3586 | 115058 | 0 | Cash loans | F | N | Y | 0 | 180000.0 | 239850.0 | 23850.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 | 1953-12-27 | 2018-01-01 | 1 | NaN | 2018-01-01 |
9349 | 326623 | 0 | Cash loans | F | N | Y | 0 | 112500.0 | 337500.0 | 31086.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 2.0 | 1975-06-21 | 2016-06-17 | 1 | NaN | 2018-01-01 |
7734 | 191976 | 0 | Cash loans | M | Y | Y | 1 | 67500.0 | 135000.0 | 9018.0 | ... | NaN | NaN | NaN | NaN | NaN | 1988-04-27 | 2009-06-05 | 1 | NaN | 2018-01-01 |
2174 | 281519 | 0 | Revolving loans | F | N | Y | 0 | 67500.0 | 202500.0 | 10125.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 2.0 | 1975-06-13 | 1997-01-22 | 1 | NaN | 2018-01-01 |
5 rows × 125 columns
AutoML creation
Step 1. Create Task and PandasReader
[11]:
%%time
task = Task('binary')
reader = PandasToPandasReader(task, cv=N_FOLDS, random_state=RANDOM_STATE)
CPU times: user 4.03 ms, sys: 25 µs, total: 4.05 ms
Wall time: 2.99 ms
Step 2. Create feature selector (if necessary)
[12]:
%%time
model0 = BoostLGBM(
default_params={'learning_rate': 0.05, 'num_leaves': 64, 'seed': 42, 'num_threads': N_THREADS}
)
pipe0 = LGBSimpleFeatures()
mbie = ModelBasedImportanceEstimator()
selector = ImportanceCutoffSelector(pipe0, model0, mbie, cutoff=0)
Copying TaskTimer may affect the parent PipelineTimer, so copy will create new unlimited TaskTimer
CPU times: user 0 ns, sys: 1.91 ms, total: 1.91 ms
Wall time: 1.56 ms
Step 3.1. Create 1st level ML pipeline for AutoML
Our first level ML pipeline: - Simple features for gradient boosting built on selected features (using step 2) - 2 different models: * LightGBM with params tuning (using OptunaTuner) * LightGBM with heuristic params
[13]:
%%time
pipe = LGBSimpleFeatures()
params_tuner1 = OptunaTuner(n_trials=20, timeout=30) # stop after 20 iterations or after 30 seconds
model1 = BoostLGBM(
default_params={'learning_rate': 0.05, 'num_leaves': 128, 'seed': 1, 'num_threads': N_THREADS}
)
model2 = BoostLGBM(
default_params={'learning_rate': 0.025, 'num_leaves': 64, 'seed': 2, 'num_threads': N_THREADS}
)
pipeline_lvl1 = MLPipeline([
(model1, params_tuner1),
model2
], pre_selection=selector, features_pipeline=pipe, post_selection=None)
CPU times: user 51 µs, sys: 37 µs, total: 88 µs
Wall time: 96.8 µs
Step 3.2. Create 2nd level ML pipeline for AutoML
Our second level ML pipeline: - Using simple features as well, but now it will be Out-Of-Fold (OOF) predictions of algos from 1st level - Only one LGBM model without params tuning - Without feature selection on this stage because we want to use all OOFs here
[14]:
%%time
pipe1 = LGBSimpleFeatures()
model = BoostLGBM(
default_params={'learning_rate': 0.05, 'num_leaves': 64, 'max_bin': 1024, 'seed': 3, 'num_threads': N_THREADS},
freeze_defaults=True
)
pipeline_lvl2 = MLPipeline([model], pre_selection=None, features_pipeline=pipe1, post_selection=None)
CPU times: user 41 µs, sys: 29 µs, total: 70 µs
Wall time: 81.5 µs
Step 4. Create AutoML pipeline
AutoML pipeline consist of: - Reader for data preparation - First level ML pipeline (as built in step 3.1) - Second level ML pipeline (as built in step 3.2) - Skip_conn = False
equals here “not to use initial features on the second level pipeline”
[15]:
%%time
automl = AutoML(reader, [
[pipeline_lvl1],
[pipeline_lvl2],
], skip_conn=False)
CPU times: user 35 µs, sys: 24 µs, total: 59 µs
Wall time: 73.7 µs
Step 5. Train AutoML on loaded data
In cell below we train AutoML with target column TARGET
to receive fitted model and OOF predictions:
[16]:
%%time
oof_pred = automl.fit_predict(train_data, roles={'target': TARGET_NAME})
print('oof_pred:\n{}\nShape = {}'.format(oof_pred, oof_pred.shape))
[LightGBM] [Warning] seed is set=42, random_state=42 will be ignored. Current value: seed=42
[LightGBM] [Warning] seed is set=1, random_state=42 will be ignored. Current value: seed=1
[LightGBM] [Warning] seed is set=1, random_state=42 will be ignored. Current value: seed=1
[LightGBM] [Warning] seed is set=1, random_state=42 will be ignored. Current value: seed=1
[LightGBM] [Warning] seed is set=1, random_state=42 will be ignored. Current value: seed=1
[LightGBM] [Warning] seed is set=1, random_state=42 will be ignored. Current value: seed=1
[LightGBM] [Warning] seed is set=1, random_state=42 will be ignored. Current value: seed=1
[LightGBM] [Warning] seed is set=1, random_state=42 will be ignored. Current value: seed=1
[LightGBM] [Warning] seed is set=1, random_state=42 will be ignored. Current value: seed=1
[LightGBM] [Warning] seed is set=1, random_state=42 will be ignored. Current value: seed=1
[LightGBM] [Warning] seed is set=1, random_state=42 will be ignored. Current value: seed=1
[LightGBM] [Warning] seed is set=1, random_state=42 will be ignored. Current value: seed=1
[LightGBM] [Warning] seed is set=1, random_state=42 will be ignored. Current value: seed=1
[LightGBM] [Warning] seed is set=1, random_state=42 will be ignored. Current value: seed=1
[LightGBM] [Warning] seed is set=1, random_state=42 will be ignored. Current value: seed=1
[LightGBM] [Warning] seed is set=1, random_state=42 will be ignored. Current value: seed=1
[LightGBM] [Warning] seed is set=1, random_state=42 will be ignored. Current value: seed=1
[LightGBM] [Warning] seed is set=1, random_state=42 will be ignored. Current value: seed=1
[LightGBM] [Warning] seed is set=1, random_state=42 will be ignored. Current value: seed=1
[LightGBM] [Warning] seed is set=1, random_state=42 will be ignored. Current value: seed=1
[LightGBM] [Warning] seed is set=1, random_state=42 will be ignored. Current value: seed=1
[LightGBM] [Warning] seed is set=1, random_state=42 will be ignored. Current value: seed=1
[LightGBM] [Warning] seed is set=1, random_state=42 will be ignored. Current value: seed=1
[LightGBM] [Warning] seed is set=1, random_state=42 will be ignored. Current value: seed=1
[LightGBM] [Warning] seed is set=1, random_state=42 will be ignored. Current value: seed=1
[LightGBM] [Warning] seed is set=1, random_state=42 will be ignored. Current value: seed=1
[LightGBM] [Warning] seed is set=2, random_state=42 will be ignored. Current value: seed=2
[LightGBM] [Warning] seed is set=2, random_state=42 will be ignored. Current value: seed=2
[LightGBM] [Warning] seed is set=2, random_state=42 will be ignored. Current value: seed=2
[LightGBM] [Warning] seed is set=2, random_state=42 will be ignored. Current value: seed=2
[LightGBM] [Warning] seed is set=2, random_state=42 will be ignored. Current value: seed=2
[LightGBM] [Warning] seed is set=3, random_state=42 will be ignored. Current value: seed=3
[LightGBM] [Warning] seed is set=3, random_state=42 will be ignored. Current value: seed=3
[LightGBM] [Warning] seed is set=3, random_state=42 will be ignored. Current value: seed=3
[LightGBM] [Warning] seed is set=3, random_state=42 will be ignored. Current value: seed=3
[LightGBM] [Warning] seed is set=3, random_state=42 will be ignored. Current value: seed=3
oof_pred:
array([[0.07027727],
[0.06983411],
[0.06983411],
...,
[0.04349083],
[0.09716105],
[0.12494681]], dtype=float32)
Shape = (8000, 1)
CPU times: user 4min 23s, sys: 2.63 s, total: 4min 26s
Wall time: 37.3 s
Step 6. Analyze fitted model
Below we analyze feature importances of different algos:
[17]:
print('Feature importances of selector:\n{}'
.format(selector.get_features_score()))
print('=' * 70)
print('Feature importances of top level algorithm:\n{}'
.format(automl.levels[-1][0].ml_algos[0].get_features_score()))
print('=' * 70)
print('Feature importances of lowest level algorithm - model 0:\n{}'
.format(automl.levels[0][0].ml_algos[0].get_features_score()))
print('=' * 70)
print('Feature importances of lowest level algorithm - model 1:\n{}'
.format(automl.levels[0][0].ml_algos[1].get_features_score()))
print('=' * 70)
Feature importances of selector:
EXT_SOURCE_3 1029.681686
EXT_SOURCE_2 894.265428
BIRTH_DATE 537.081401
EXT_SOURCE_1 424.764621
DAYS_LAST_PHONE_CHANGE 262.583100
...
FLAG_DOCUMENT_16 0.000000
FLAG_DOCUMENT_14 0.000000
FLAG_DOCUMENT_13 0.000000
FLAG_DOCUMENT_11 0.000000
FLAG_PHONE 0.000000
Length: 110, dtype: float64
======================================================================
Feature importances of top level algorithm:
Lvl_0_Pipe_0_Mod_0_LightGBM_prediction_0 2546.473691
Lvl_0_Pipe_0_Mod_1_LightGBM_prediction_0 1686.589227
dtype: float64
======================================================================
Feature importances of lowest level algorithm - model 0:
EXT_SOURCE_2 1500.371550
EXT_SOURCE_3 1382.049802
dtdiff__BIRTH_DATE 714.069627
EXT_SOURCE_1 573.079861
DAYS_REGISTRATION 461.927863
...
ord__HOUSETYPE_MODE 1.985318
ELEVATORS_MEDI 1.862320
FLAG_DOCUMENT_6 0.000000
REG_REGION_NOT_WORK_REGION 0.000000
ord__FLAG_OWN_CAR 0.000000
Length: 85, dtype: float64
======================================================================
Feature importances of lowest level algorithm - model 1:
EXT_SOURCE_3 2666.270588
EXT_SOURCE_2 2425.430385
dtdiff__BIRTH_DATE 1607.440484
DAYS_REGISTRATION 1217.128893
SK_ID_CURR 1136.992744
...
LIVE_REGION_NOT_WORK_REGION 9.561320
ord__EMERGENCYSTATE_MODE 7.256624
REG_REGION_NOT_WORK_REGION 5.843864
ord__NAME_CONTRACT_TYPE 3.890026
FLAG_DOCUMENT_6 3.523548
Length: 85, dtype: float64
======================================================================
Step 7. Predict to test data and check scores
[18]:
%%time
test_pred = automl.predict(test_data)
print('Prediction for test data:\n{}\nShape = {}'
.format(test_pred, test_pred.shape))
print('Check scores...')
print('OOF score: {}'.format(roc_auc_score(train_data[TARGET_NAME].values, oof_pred.data[:, 0])))
print('TEST score: {}'.format(roc_auc_score(test_data[TARGET_NAME].values, test_pred.data[:, 0])))
Prediction for test data:
array([[0.060448 ],
[0.07832611],
[0.05339179],
...,
[0.06192666],
[0.07732402],
[0.20730501]], dtype=float32)
Shape = (2000, 1)
Check scores...
OOF score: 0.6979918272484156
TEST score: 0.7158254076086956
CPU times: user 421 ms, sys: 11.6 ms, total: 433 ms
Wall time: 103 ms
Tutorial 7: ICE and PDP Interpretation Tutorial
Official LightAutoML github repository is here
Partial dependence plot (PDP) and Individual Conditional Expectation (ICE) are two model-agnostic interpretation methods (see details here).
Download library and make some imports
[1]:
# !pip install lightautoml
[2]:
# Standard python libraries
import os
import requests
# Installed libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from sklearn.model_selection import train_test_split
# Imports from our package
from lightautoml.automl.presets.tabular_presets import TabularAutoML
from lightautoml.tasks import Task
[3]:
plt.rcParams.update({'font.size': 20})
sns.set(rc={'figure.figsize':(15, 11)})
sns.set(style="whitegrid", font_scale=1.5)
N_THREADS = 8 # threads cnt for lgbm and linear models
N_FOLDS = 5 # folds cnt for AutoML
RANDOM_STATE = 42 # fixed random state for various reasons
TEST_SIZE = 0.2 # Test size for metric check
TIMEOUT = 120 # Time in seconds for automl run
TARGET_NAME = 'TARGET' # Target column name
Prepare data
Load a dataset from the repository if doesn’t clone repository by git.
[4]:
DATASET_DIR = './data/'
DATASET_NAME = 'sampled_app_train.csv'
DATASET_FULLNAME = os.path.join(DATASET_DIR, DATASET_NAME)
DATASET_URL = 'https://raw.githubusercontent.com/AILab-MLTools/LightAutoML/master/examples/data/sampled_app_train.csv'
[5]:
%%time
if not os.path.exists(DATASET_FULLNAME):
os.makedirs(DATASET_DIR, exist_ok=True)
dataset = requests.get(DATASET_URL).text
with open(DATASET_FULLNAME, 'w') as output:
output.write(dataset)
data = pd.read_csv(DATASET_FULLNAME)
data['EMP_DATE'] = (np.datetime64('2018-01-01') + np.clip(data['DAYS_EMPLOYED'], None, 0).astype(np.dtype('timedelta64[D]'))
).astype(str)
CPU times: user 223 ms, sys: 52.9 ms, total: 276 ms
Wall time: 503 ms
[6]:
train_data, test_data = train_test_split(data,
test_size=TEST_SIZE,
stratify=data[TARGET_NAME],
random_state=RANDOM_STATE)
Create AutoML from preset
Also works with lightautoml.automl.presets.tabular_presets.TabularUtilizedAutoML
.
[7]:
%%time
task = Task('binary', )
roles = {'target': TARGET_NAME,}
automl = TabularAutoML(task = task,
timeout = TIMEOUT,
cpu_limit = N_THREADS,
reader_params = {'n_jobs': N_THREADS, 'cv': N_FOLDS, 'random_state': RANDOM_STATE},
)
oof_pred = automl.fit_predict(train_data, roles = roles, verbose = 1, log_file = 'train.log')
[16:58:33] Stdout logging level is INFO.
[16:58:33] Copying TaskTimer may affect the parent PipelineTimer, so copy will create new unlimited TaskTimer
[16:58:33] Task: binary
[16:58:33] Start automl preset with listed constraints:
[16:58:33] - time: 120.00 seconds
[16:58:33] - CPU: 8 cores
[16:58:33] - memory: 16 GB
[16:58:33] Train data shape: (8000, 123)
[16:58:36] Layer 1 train process start. Time left 117.58 secs
[16:58:36] Start fitting Lvl_0_Pipe_0_Mod_0_LinearL2 ...
[16:58:40] Fitting Lvl_0_Pipe_0_Mod_0_LinearL2 finished. score = 0.7340989893230383
[16:58:40] Lvl_0_Pipe_0_Mod_0_LinearL2 fitting and predicting completed
[16:58:40] Time left 112.94 secs
[16:58:43] Selector_LightGBM fitting and predicting completed
[16:58:44] Start fitting Lvl_0_Pipe_1_Mod_0_LightGBM ...
[16:58:53] Time limit exceeded after calculating fold 3
[16:58:53] Fitting Lvl_0_Pipe_1_Mod_0_LightGBM finished. score = 0.7336652733096534
[16:58:53] Lvl_0_Pipe_1_Mod_0_LightGBM fitting and predicting completed
[16:58:53] Start hyperparameters optimization for Lvl_0_Pipe_1_Mod_1_Tuned_LightGBM ... Time budget is 1.00 secs
[16:59:03] Hyperparameters optimization for Lvl_0_Pipe_1_Mod_1_Tuned_LightGBM completed
[16:59:03] Start fitting Lvl_0_Pipe_1_Mod_1_Tuned_LightGBM ...
[16:59:16] Fitting Lvl_0_Pipe_1_Mod_1_Tuned_LightGBM finished. score = 0.7146425170595188
[16:59:16] Lvl_0_Pipe_1_Mod_1_Tuned_LightGBM fitting and predicting completed
[16:59:16] Start fitting Lvl_0_Pipe_1_Mod_2_CatBoost ...
[16:59:21] Fitting Lvl_0_Pipe_1_Mod_2_CatBoost finished. score = 0.7180592042951911
[16:59:21] Lvl_0_Pipe_1_Mod_2_CatBoost fitting and predicting completed
[16:59:21] Start hyperparameters optimization for Lvl_0_Pipe_1_Mod_3_Tuned_CatBoost ... Time budget is 29.20 secs
[16:59:51] Hyperparameters optimization for Lvl_0_Pipe_1_Mod_3_Tuned_CatBoost completed
[16:59:51] Start fitting Lvl_0_Pipe_1_Mod_3_Tuned_CatBoost ...
[16:59:58] Fitting Lvl_0_Pipe_1_Mod_3_Tuned_CatBoost finished. score = 0.7424781750625415
[16:59:58] Lvl_0_Pipe_1_Mod_3_Tuned_CatBoost fitting and predicting completed
[16:59:58] Time left 35.17 secs
[16:59:58] Time limit exceeded in one of the tasks. AutoML will blend level 1 models.
[16:59:58] Layer 1 training completed.
[16:59:58] Blending: optimization starts with equal weights and score 0.7470969001073415
[16:59:58] Blending: iteration 0: score = 0.7483672886691461, weights = [0.18754406 0.1279657 0.37286162 0.06386749 0.24776113]
[16:59:58] Blending: iteration 1: score = 0.7484541355819561, weights = [0.23439428 0.12674679 0.31599942 0.06325912 0.25960034]
[16:59:59] Blending: iteration 2: score = 0.748450627689517, weights = [0.23445104 0.1267374 0.315976 0.06325444 0.25958112]
[16:59:59] Blending: iteration 3: score = 0.748450627689517, weights = [0.23445104 0.1267374 0.315976 0.06325444 0.25958112]
[16:59:59] Blending: no score update. Terminated
[16:59:59] Automl preset training completed in 85.25 seconds
[16:59:59] Model description:
Final prediction for new objects (level 0) =
0.23445 * (5 averaged models Lvl_0_Pipe_0_Mod_0_LinearL2) +
0.12674 * (4 averaged models Lvl_0_Pipe_1_Mod_0_LightGBM) +
0.31598 * (5 averaged models Lvl_0_Pipe_1_Mod_1_Tuned_LightGBM) +
0.06325 * (5 averaged models Lvl_0_Pipe_1_Mod_2_CatBoost) +
0.25958 * (5 averaged models Lvl_0_Pipe_1_Mod_3_Tuned_CatBoost)
CPU times: user 10min 7s, sys: 49.1 s, total: 10min 56s
Wall time: 1min 25s
Calculate interpretation data
ICE shows the functional relationship between the predicted response and the feature separately for each instance. PDP averages the individual lines of an ICE plot.
Numeric features
For numeric features you can specify n_bins
- number of bins into which the range of feature values is divided.
Calculate data for PDP plot manually:
[8]:
%%time
grid, ys, counts = automl.get_individual_pdp(test_data, feature_name='DAYS_BIRTH', n_bins=30)
100%|██████████| 30/30 [00:18<00:00, 1.63it/s]
CPU times: user 2min 2s, sys: 7.35 s, total: 2min 9s
Wall time: 18.4 s
[9]:
%%time
X = np.array([item.ravel() for item in ys]).T
plt.figure(figsize=(15, 11))
plt.plot(grid, X[0], alpha=0.05, color='m', label='ICE plots')
for i in range(1, X.shape[0]):
plt.plot(grid, X[i], alpha=0.05, color='b')
plt.plot(grid, X.mean(axis=0), linewidth=2, color='r', label='PDP mean')
plt.legend()
plt.show()

CPU times: user 5.9 s, sys: 3.63 s, total: 9.53 s
Wall time: 2.46 s
Built-in function:
[10]:
automl.plot_pdp(test_data, feature_name='DAYS_BIRTH')
100%|██████████| 30/30 [00:17<00:00, 1.67it/s]

[11]:
automl.plot_pdp(test_data, feature_name='DAYS_BIRTH', individual=True)
100%|██████████| 30/30 [00:18<00:00, 1.63it/s]

Categorical features
[12]:
%%time
automl.plot_pdp(test_data, feature_name='ORGANIZATION_TYPE')
100%|██████████| 10/10 [00:05<00:00, 1.69it/s]

CPU times: user 43.8 s, sys: 2.54 s, total: 46.4 s
Wall time: 6.87 s
Datetime features
For datetime features you can specify groupby level, allowed values: year
, month
, dayofweek
.
[13]:
%%time
automl.plot_pdp(test_data, feature_name='EMP_DATE', datetime_level='year')
100%|██████████| 45/45 [00:27<00:00, 1.63it/s]

CPU times: user 3min 2s, sys: 10.2 s, total: 3min 12s
Wall time: 29.4 s
Tutorial 8: CV preset
Official LightAutoML github repository is here
In this tutorial we will look how to apply LightAutoML to computer vision tasks.
Basically, the corresponding modules are designed to solve problems where the image is more of an auxiliary value (complement the rest of the data from the table) than for solving full-fledged CV problems. In LightAutoML working with images goes essentially through tabular data, that is, not the images themselves are used, but the paths for them. They should be written in a separate column, which needs to specify the corresponding 'path'
role. The target variable and optionally other
features are also specified in the table. To make predictions, numerical features are extracted from images, such as color histograms (RGB or HSV) and image embeddings based on EfficientNet (with the option to select a version and use AdvProp weights), and then standard machine learning models available in LightAutoML (as in conventional tabular presets) can be applied to them. By default, linear regression with L2 regularization and CatBoost are used.
Linear regression is trained on image embeddings, CatBoost is trained on histogram features, and weighted blending is finally applied to their predictions.
As an example, let’s consider the Paddy Doctor competition - the task of multi-class classification, determining the type of paddy leaf disease based on photographs and other numerical features. Data is a set of images and a table, each row of which corresponds to a specific image with a specification of the path to it.
Importing libraries and preparing data
We will use the data from Kaggle. You can download the dataset from this link and import it in any convenient way. For example, we download the data using kaggle API and install some corresponding requirements. You can run next cell for loading data and installing packages in this way:
[ ]:
##Kaggle functionality for loading data; Note that you have to use your kaggle API token (see the link above):
# !pip install opendatasets
# !pip install -q kaggle
# !pip install --upgrade --force-reinstall --no-deps kaggle
# !mkdir ~/.kaggle
# !ls ~/.kaggle
# !cp kaggle.json ~/.kaggle/
# !chmod 600 ~/.kaggle/kaggle.json
# !kaggle competitions download -c paddy-disease-classification
# #Unpack data:
# !mkdir paddy-disease
# !unzip paddy-disease-classification.zip -d paddy-disease
# #Install LightAutoML, Pandas and torch EfficientNet:
# !pip install -U lightautoml[cv] #[cv] is for installing CV tasks functionality
Then we will import the libraries we use in this kernel: - Standard python libraries for timing, working with OS etc. - Essential python DS libraries like numpy, pandas, scikit-learn and torch (the last we will use in the next cell) - LightAutoML modules: TabularCVAutoML
preset for AutoML model creation and Task class to setup what kind of ML problem we solve (binary/multiclass classification or regression)
[1]:
import os
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID" # see issue #152
os.environ["CUDA_VISIBLE_DEVICES"]="0"
[2]:
# Standard python libraries
import os
import time
# Essential DS libraries
import numpy as np
import pandas as pd
from sklearn.metrics import f1_score, accuracy_score, confusion_matrix
from sklearn.model_selection import train_test_split
import torch
import seaborn as sns
import matplotlib.pyplot as plt
# LightAutoML presets, task and report generation
from lightautoml.automl.presets.image_presets import TabularCVAutoML
from lightautoml.tasks import Task
'nlp' extra dependecy package 'gensim' isn't installed. Look at README.md in repo 'LightAutoML' for installation instructions.
'nlp' extra dependecy package 'nltk' isn't installed. Look at README.md in repo 'LightAutoML' for installation instructions.
'nlp' extra dependecy package 'transformers' isn't installed. Look at README.md in repo 'LightAutoML' for installation instructions.
'nlp' extra dependecy package 'gensim' isn't installed. Look at README.md in repo 'LightAutoML' for installation instructions.
'nlp' extra dependecy package 'nltk' isn't installed. Look at README.md in repo 'LightAutoML' for installation instructions.
'nlp' extra dependecy package 'transformers' isn't installed. Look at README.md in repo 'LightAutoML' for installation instructions.
/home/dvladimirvasilyev/LightAutoML/lightautoml/ml_algo/dl_model.py:41: UserWarning: 'transformers' - package isn't installed
warnings.warn("'transformers' - package isn't installed")
/home/dvladimirvasilyev/LightAutoML/lightautoml/text/nn_model.py:22: UserWarning: 'transformers' - package isn't installed
warnings.warn("'transformers' - package isn't installed")
/home/dvladimirvasilyev/LightAutoML/lightautoml/text/dl_transformers.py:25: UserWarning: 'transformers' - package isn't installed
warnings.warn("'transformers' - package isn't installed")
For better reproducibility fix numpy random seed with max number of threads for Torch (which usually try to use all the threads on server):
[3]:
np.random.seed(42)
torch.set_num_threads(2)
Let’s check the data we have:
[4]:
INPUT_DIR = './paddy-disease/'
[5]:
train_data = pd.read_csv(INPUT_DIR + 'train.csv')
print(train_data.shape)
train_data.head()
(10407, 4)
[5]:
image_id | label | variety | age | |
---|---|---|---|---|
0 | 100330.jpg | bacterial_leaf_blight | ADT45 | 45 |
1 | 100365.jpg | bacterial_leaf_blight | ADT45 | 45 |
2 | 100382.jpg | bacterial_leaf_blight | ADT45 | 45 |
3 | 100632.jpg | bacterial_leaf_blight | ADT45 | 45 |
4 | 101918.jpg | bacterial_leaf_blight | ADT45 | 45 |
[6]:
train_data['label'].value_counts()
[6]:
normal 1764
blast 1738
hispa 1594
dead_heart 1442
tungro 1088
brown_spot 965
downy_mildew 620
bacterial_leaf_blight 479
bacterial_leaf_streak 380
bacterial_panicle_blight 337
Name: label, dtype: int64
[7]:
train_data['variety'].value_counts()
[7]:
ADT45 6992
KarnatakaPonni 988
Ponni 657
AtchayaPonni 461
Zonal 399
AndraPonni 377
Onthanel 351
IR20 114
RR 36
Surya 32
Name: variety, dtype: int64
[8]:
train_data['age'].value_counts()
[8]:
70 3077
60 1660
50 1066
75 866
65 774
55 563
72 552
45 505
67 415
68 253
80 225
57 213
47 112
77 42
73 38
66 36
62 5
82 5
Name: age, dtype: int64
[9]:
submission = pd.read_csv(INPUT_DIR + 'sample_submission.csv')
print(submission.shape)
submission.head()
(3469, 2)
[9]:
image_id | label | |
---|---|---|
0 | 200001.jpg | NaN |
1 | 200002.jpg | NaN |
2 | 200003.jpg | NaN |
3 | 200004.jpg | NaN |
4 | 200005.jpg | NaN |
Add a column with the full path to the images:
[10]:
%%time
train_data['path'] = INPUT_DIR + 'train_images/' + train_data['label'] + '/' + train_data['image_id']
train_data.head()
CPU times: user 4.89 ms, sys: 485 µs, total: 5.37 ms
Wall time: 5.14 ms
[10]:
image_id | label | variety | age | path | |
---|---|---|---|---|---|
0 | 100330.jpg | bacterial_leaf_blight | ADT45 | 45 | ./paddy-disease/train_images/bacterial_leaf_bl... |
1 | 100365.jpg | bacterial_leaf_blight | ADT45 | 45 | ./paddy-disease/train_images/bacterial_leaf_bl... |
2 | 100382.jpg | bacterial_leaf_blight | ADT45 | 45 | ./paddy-disease/train_images/bacterial_leaf_bl... |
3 | 100632.jpg | bacterial_leaf_blight | ADT45 | 45 | ./paddy-disease/train_images/bacterial_leaf_bl... |
4 | 101918.jpg | bacterial_leaf_blight | ADT45 | 45 | ./paddy-disease/train_images/bacterial_leaf_bl... |
[11]:
submission['path'] = INPUT_DIR + 'test_images/' + submission['image_id']
submission.head()
[11]:
image_id | label | path | |
---|---|---|---|
0 | 200001.jpg | NaN | ./paddy-disease/test_images/200001.jpg |
1 | 200002.jpg | NaN | ./paddy-disease/test_images/200002.jpg |
2 | 200003.jpg | NaN | ./paddy-disease/test_images/200003.jpg |
3 | 200004.jpg | NaN | ./paddy-disease/test_images/200004.jpg |
4 | 200005.jpg | NaN | ./paddy-disease/test_images/200005.jpg |
Let’s expand the training data with augmentations: random rotations and flips:
[ ]:
os.mkdir('./paddy-disease/modified_train')
[12]:
from PIL import Image
from tqdm.notebook import tqdm
new_imgs = []
for i, p in tqdm(enumerate(train_data['path'].values)):
if i % 1000 == 0:
print(i)
img = Image.open(p)
for it in range(10):
new_img = img.rotate(np.random.rand() * 60 - 30, resample=3)
if np.random.rand() > 0.5:
new_img = new_img.transpose(Image.FLIP_LEFT_RIGHT)
new_img_name = './paddy-disease/modified_train/' + p.split('/')[-1][:-4] + '_' + str(it) + '.jpg'
new_img.save(new_img_name)
new_imgs.append([new_img_name, p.split('/')[-2], p.split('/')[-1]])
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
[13]:
train_data = pd.concat([train_data, pd.DataFrame(new_imgs, columns = ['path', 'label', 'image_id'])]).reset_index(drop = True)
train_data
[13]:
image_id | label | variety | age | path | |
---|---|---|---|---|---|
0 | 100330.jpg | bacterial_leaf_blight | ADT45 | 45.0 | ./paddy-disease/train_images/bacterial_leaf_bl... |
1 | 100365.jpg | bacterial_leaf_blight | ADT45 | 45.0 | ./paddy-disease/train_images/bacterial_leaf_bl... |
2 | 100382.jpg | bacterial_leaf_blight | ADT45 | 45.0 | ./paddy-disease/train_images/bacterial_leaf_bl... |
3 | 100632.jpg | bacterial_leaf_blight | ADT45 | 45.0 | ./paddy-disease/train_images/bacterial_leaf_bl... |
4 | 101918.jpg | bacterial_leaf_blight | ADT45 | 45.0 | ./paddy-disease/train_images/bacterial_leaf_bl... |
... | ... | ... | ... | ... | ... |
114472 | 110381.jpg | tungro | NaN | NaN | ./paddy-disease/modified_train/110381_5.jpg |
114473 | 110381.jpg | tungro | NaN | NaN | ./paddy-disease/modified_train/110381_6.jpg |
114474 | 110381.jpg | tungro | NaN | NaN | ./paddy-disease/modified_train/110381_7.jpg |
114475 | 110381.jpg | tungro | NaN | NaN | ./paddy-disease/modified_train/110381_8.jpg |
114476 | 110381.jpg | tungro | NaN | NaN | ./paddy-disease/modified_train/110381_9.jpg |
114477 rows × 5 columns
Let’s do the same for the test dataset:
[22]:
os.mkdir('./paddy-disease/modified_test')
[14]:
new_imgs = []
for i, p in tqdm(enumerate(submission['path'].values)):
if i % 1000 == 0:
print(i)
img = Image.open(p)
for it in range(5):
new_img = img.rotate(np.random.rand() * 60 - 30, resample=3)
if np.random.rand() > 0.5:
new_img = new_img.transpose(Image.FLIP_LEFT_RIGHT)
new_img_name = './paddy-disease/modified_test/' + p.split('/')[-1][:-4] + '_' + str(it) + '.jpg'
new_img.save(new_img_name)
new_imgs.append([new_img_name, p.split('/')[-1]])
0
1000
2000
3000
[15]:
submission = pd.concat([submission, pd.DataFrame(new_imgs, columns = ['path', 'image_id'])]).reset_index(drop = True)
submission
[15]:
image_id | label | path | |
---|---|---|---|
0 | 200001.jpg | NaN | ./paddy-disease/test_images/200001.jpg |
1 | 200002.jpg | NaN | ./paddy-disease/test_images/200002.jpg |
2 | 200003.jpg | NaN | ./paddy-disease/test_images/200003.jpg |
3 | 200004.jpg | NaN | ./paddy-disease/test_images/200004.jpg |
4 | 200005.jpg | NaN | ./paddy-disease/test_images/200005.jpg |
... | ... | ... | ... |
20809 | 203469.jpg | NaN | ./paddy-disease/modified_test/203469_0.jpg |
20810 | 203469.jpg | NaN | ./paddy-disease/modified_test/203469_1.jpg |
20811 | 203469.jpg | NaN | ./paddy-disease/modified_test/203469_2.jpg |
20812 | 203469.jpg | NaN | ./paddy-disease/modified_test/203469_3.jpg |
20813 | 203469.jpg | NaN | ./paddy-disease/modified_test/203469_4.jpg |
20814 rows × 3 columns
Task definition
Task type
On the cell below we create Task
object - the class to setup what task LightAutoML model should solve with specific loss and metric if necessary (more info can be found here in our documentation). In general, it can be any type of tasks available in LightAutoML (binary and multi-class classification, one-dimensional and multi-dimensional regression,
multi-label classification), but in this case we have a multi-class classification task:
[16]:
task = Task('multiclass')
Default metric and loss in multi-class classification is cross-entropy.
Feature roles setup
Next we need to setup columns roles. It is necessary to specify the role of the target variable ('target'
), as well as the role of the path to the images ('path'
) in the case of using TabularCVAutoML
. We will also group the images (the original ones and their augmentations) and apply group k-fold cross-validation, specifying the column with ids as the 'group'
role:
[17]:
roles = {
'target': 'label',
'path': ['path'],
'drop': ['variety', 'age'],
'group': 'image_id'
}
Then we initialize TabularCVAutoML
. It is possible to specify many parameters (reader parameters, time and memory limits etc), including the EfficientNet parameters for getting embeddings: version (B0
by default), device, batch size (128 by default), path for weights, AdvProp weights using (for better use of the shape in images, True
by default) etc. Note that the Utilized
version of TabularCVAutoML
for more flexible use of time
resources is not yet available.
[18]:
automl = TabularCVAutoML(task = task,
timeout=5 * 3600,
cpu_limit = 2,
reader_params = {'cv': 5, 'random_state': 42})
AutoML training
To run autoML training use fit_predict method: - train_data
- Dataset to train. - roles
- Roles dict. - verbose
- Controls the verbosity: the higher, the more messages. <1 : messages are not displayed; >=1 : the computation process for layers is displayed; >=2 : the information about folds processing is also displayed; >=3 : the hyperparameters optimization process is also displayed; >=4 : the training process for every algorithm is displayed;
Note: out-of-fold prediction is calculated during training and returned from the fit_predict method
[19]:
%%time
oof_pred = automl.fit_predict(train_data, roles = roles, verbose = 3)
[14:04:32] Stdout logging level is INFO3.
[14:04:32] Task: multiclass
[14:04:32] Start automl preset with listed constraints:
[14:04:32] - time: 18000.00 seconds
[14:04:32] - CPU: 2 cores
[14:04:32] - memory: 16 GB
[14:04:32] Train data shape: (114477, 5)
[14:04:32] Layer 1 train process start. Time left 17999.83 secs
100%|██████████| 895/895 [07:29<00:00, 1.99it/s]
[14:12:09] Feature path transformed
[14:12:16] Start fitting Lvl_0_Pipe_0_Mod_0_LinearL2 ...
[14:12:17] ===== Start working with fold 0 for Lvl_0_Pipe_0_Mod_0_LinearL2 =====
[14:12:26] Linear model: C = 1e-05 score = -0.9995305866945853
[14:12:32] Linear model: C = 5e-05 score = -0.6879959560713191
[14:12:38] Linear model: C = 0.0001 score = -0.5802952177399445
[14:12:45] Linear model: C = 0.0005 score = -0.3907926611544111
[14:12:51] Linear model: C = 0.001 score = -0.33425017155675657
[14:13:00] Linear model: C = 0.005 score = -0.2559518217619532
[14:13:07] Linear model: C = 0.01 score = -0.24141776919439237
[14:13:15] Linear model: C = 0.05 score = -0.2431661172897411
[14:13:23] Linear model: C = 0.1 score = -0.25925367786528475
[14:13:24] ===== Start working with fold 1 for Lvl_0_Pipe_0_Mod_0_LinearL2 =====
[14:13:32] Linear model: C = 1e-05 score = -0.9872444001968863
[14:13:39] Linear model: C = 5e-05 score = -0.6682540100549987
[14:13:45] Linear model: C = 0.0001 score = -0.5574685730009872
[14:13:51] Linear model: C = 0.0005 score = -0.3653461360638747
[14:13:58] Linear model: C = 0.001 score = -0.31059360297670363
[14:14:05] Linear model: C = 0.005 score = -0.2370436682635623
[14:14:14] Linear model: C = 0.01 score = -0.22495884629469698
[14:14:21] Linear model: C = 0.05 score = -0.23420873784566962
[14:14:29] Linear model: C = 0.1 score = -0.25263966927426823
[14:14:29] ===== Start working with fold 2 for Lvl_0_Pipe_0_Mod_0_LinearL2 =====
[14:14:37] Linear model: C = 1e-05 score = -0.9554531133528031
[14:14:43] Linear model: C = 5e-05 score = -0.640784196156178
[14:14:49] Linear model: C = 0.0001 score = -0.5345024606190905
[14:14:57] Linear model: C = 0.0005 score = -0.3546726337461952
[14:15:04] Linear model: C = 0.001 score = -0.30344210801693483
[14:15:12] Linear model: C = 0.005 score = -0.2331574262775805
[14:15:19] Linear model: C = 0.01 score = -0.22071779776854528
[14:15:28] Linear model: C = 0.05 score = -0.22603075278344578
[14:15:36] Linear model: C = 0.1 score = -0.24138537694410292
[14:15:36] ===== Start working with fold 3 for Lvl_0_Pipe_0_Mod_0_LinearL2 =====
[14:15:44] Linear model: C = 1e-05 score = -0.973115505822288
[14:15:51] Linear model: C = 5e-05 score = -0.6613476137718094
[14:15:56] Linear model: C = 0.0001 score = -0.5539538946164072
[14:16:04] Linear model: C = 0.0005 score = -0.3666276035478478
[14:16:10] Linear model: C = 0.001 score = -0.31130200709742806
[14:16:18] Linear model: C = 0.005 score = -0.2326339584928626
[14:16:25] Linear model: C = 0.01 score = -0.21658099282365262
[14:16:33] Linear model: C = 0.05 score = -0.21364841773406087
[14:16:42] Linear model: C = 0.1 score = -0.2256018292053085
[14:16:51] Linear model: C = 0.5 score = -0.2763179966937595
[14:16:51] ===== Start working with fold 4 for Lvl_0_Pipe_0_Mod_0_LinearL2 =====
[14:16:58] Linear model: C = 1e-05 score = -0.9531496536787142
[14:17:05] Linear model: C = 5e-05 score = -0.6270339670737181
[14:17:10] Linear model: C = 0.0001 score = -0.517302736118502
[14:17:17] Linear model: C = 0.0005 score = -0.331531311465719
[14:17:23] Linear model: C = 0.001 score = -0.27798570249468424
[14:17:32] Linear model: C = 0.005 score = -0.20448637290477473
[14:17:39] Linear model: C = 0.01 score = -0.19081673660070902
[14:17:47] Linear model: C = 0.05 score = -0.1923892363102242
[14:17:56] Linear model: C = 0.1 score = -0.20661581389305533
[14:17:56] Fitting Lvl_0_Pipe_0_Mod_0_LinearL2 finished. score = -0.21831477243925082
[14:17:56] Lvl_0_Pipe_0_Mod_0_LinearL2 fitting and predicting completed
[14:17:56] Time left 17195.98 secs
[14:22:15] Start fitting Lvl_0_Pipe_1_Mod_0_CatBoost ...
[14:22:16] ===== Start working with fold 0 for Lvl_0_Pipe_1_Mod_0_CatBoost =====
[14:22:16] 0: learn: 2.2636799 test: 2.2649649 best: 2.2649649 (0) total: 6.85ms remaining: 27.4s
[14:22:35] bestTest = 0.2436411292
[14:22:35] bestIteration = 3999
[14:22:35] ===== Start working with fold 1 for Lvl_0_Pipe_1_Mod_0_CatBoost =====
[14:22:36] 0: learn: 2.2634692 test: 2.2632526 best: 2.2632526 (0) total: 6.16ms remaining: 24.6s
[14:22:55] bestTest = 0.2658199543
[14:22:55] bestIteration = 3999
[14:22:56] ===== Start working with fold 2 for Lvl_0_Pipe_1_Mod_0_CatBoost =====
[14:22:56] 0: learn: 2.2631654 test: 2.2656298 best: 2.2656298 (0) total: 6.08ms remaining: 24.3s
[14:23:16] bestTest = 0.2753673319
[14:23:16] bestIteration = 3999
[14:23:16] ===== Start working with fold 3 for Lvl_0_Pipe_1_Mod_0_CatBoost =====
[14:23:17] 0: learn: 2.2645696 test: 2.2657045 best: 2.2657045 (0) total: 6.76ms remaining: 27s
[14:23:37] bestTest = 0.2738943611
[14:23:37] bestIteration = 3996
[14:23:37] Shrink model to first 3997 iterations.
[14:23:37] ===== Start working with fold 4 for Lvl_0_Pipe_1_Mod_0_CatBoost =====
[14:23:38] 0: learn: 2.2642805 test: 2.2644245 best: 2.2644245 (0) total: 5.84ms remaining: 23.4s
[14:23:57] bestTest = 0.2538460334
[14:23:57] bestIteration = 3999
[14:23:58] Fitting Lvl_0_Pipe_1_Mod_0_CatBoost finished. score = -0.2625123265864018
[14:23:58] Lvl_0_Pipe_1_Mod_0_CatBoost fitting and predicting completed
[14:23:58] Time left 16834.07 secs
[14:23:58] Layer 1 training completed.
[14:23:58] Blending: optimization starts with equal weights and score -0.1879588701291192
/home/dvladimirvasilyev/anaconda3/envs/myenv/lib/python3.8/site-packages/sklearn/metrics/_classification.py:2916: UserWarning: The y_pred values do not sum to one. Starting from 1.5 thiswill result in an error.
warnings.warn(
[14:23:59] Blending: iteration 0: score = -0.18573794844833624, weights = [0.63928086 0.36071914]
[14:23:59] Blending: iteration 1: score = -0.18573794844833624, weights = [0.63928086 0.36071914]
[14:23:59] Blending: no score update. Terminated
[14:23:59] Automl preset training completed in 1167.35 seconds
[14:23:59] Model description:
Final prediction for new objects (level 0) =
0.63928 * (5 averaged models Lvl_0_Pipe_0_Mod_0_LinearL2) +
0.36072 * (5 averaged models Lvl_0_Pipe_1_Mod_0_CatBoost)
CPU times: user 18min 40s, sys: 3min 1s, total: 21min 42s
Wall time: 19min 27s
Сonsider out-of-fold predictions on train data. In case of classification, LightAutoML
returns class probabilities as an output.
[21]:
preds = train_data[['image_id', 'label']]
preds
[21]:
image_id | label | |
---|---|---|
0 | 100330.jpg | bacterial_leaf_blight |
1 | 100365.jpg | bacterial_leaf_blight |
2 | 100382.jpg | bacterial_leaf_blight |
3 | 100632.jpg | bacterial_leaf_blight |
4 | 101918.jpg | bacterial_leaf_blight |
... | ... | ... |
114472 | 110381.jpg | tungro |
114473 | 110381.jpg | tungro |
114474 | 110381.jpg | tungro |
114475 | 110381.jpg | tungro |
114476 | 110381.jpg | tungro |
114477 rows × 2 columns
[22]:
for i in range(10):
preds['pred_' + str(i)] = oof_pred.data[:,i]
preds
/tmp/ipykernel_12895/1432655611.py:2: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
preds['pred_' + str(i)] = oof_pred.data[:,i]
/tmp/ipykernel_12895/1432655611.py:2: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
preds['pred_' + str(i)] = oof_pred.data[:,i]
[22]:
image_id | label | pred_0 | pred_1 | pred_2 | pred_3 | pred_4 | pred_5 | pred_6 | pred_7 | pred_8 | pred_9 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 100330.jpg | bacterial_leaf_blight | 0.023245 | 0.315283 | 0.470886 | 0.002528 | 0.021895 | 0.007454 | 0.001554 | 0.157142 | 8.914904e-06 | 4.559626e-06 |
1 | 100365.jpg | bacterial_leaf_blight | 0.003717 | 0.011035 | 0.028317 | 0.000110 | 0.003178 | 0.000015 | 0.000131 | 0.953496 | 1.555987e-07 | 5.692390e-07 |
2 | 100382.jpg | bacterial_leaf_blight | 0.025734 | 0.095088 | 0.208473 | 0.000879 | 0.007030 | 0.003382 | 0.000142 | 0.659271 | 3.872871e-07 | 2.898941e-07 |
3 | 100632.jpg | bacterial_leaf_blight | 0.002876 | 0.542942 | 0.027466 | 0.000317 | 0.036005 | 0.000398 | 0.000082 | 0.389901 | 3.837710e-06 | 9.339438e-06 |
4 | 101918.jpg | bacterial_leaf_blight | 0.009988 | 0.033572 | 0.017635 | 0.000032 | 0.008310 | 0.000136 | 0.000041 | 0.930286 | 1.554736e-07 | 1.530466e-07 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
114472 | 110381.jpg | tungro | 0.001716 | 0.109143 | 0.020722 | 0.001495 | 0.845324 | 0.000177 | 0.021384 | 0.000027 | 6.304998e-06 | 6.075803e-06 |
114473 | 110381.jpg | tungro | 0.022644 | 0.137650 | 0.026389 | 0.004165 | 0.788036 | 0.001093 | 0.019688 | 0.000259 | 3.142513e-05 | 4.477663e-05 |
114474 | 110381.jpg | tungro | 0.016897 | 0.072329 | 0.010469 | 0.005554 | 0.789777 | 0.001240 | 0.103631 | 0.000060 | 1.301366e-05 | 2.972130e-05 |
114475 | 110381.jpg | tungro | 0.008637 | 0.114299 | 0.082281 | 0.003465 | 0.560001 | 0.000741 | 0.230260 | 0.000112 | 1.909918e-04 | 1.351225e-05 |
114476 | 110381.jpg | tungro | 0.004179 | 0.099988 | 0.008320 | 0.004660 | 0.822037 | 0.000663 | 0.059627 | 0.000318 | 1.922170e-04 | 1.441010e-05 |
114477 rows × 12 columns
We will average forecasts for images by their augmentations:
[23]:
preds = preds.groupby(['image_id', 'label']).mean().reset_index()
preds
[23]:
image_id | label | pred_0 | pred_1 | pred_2 | pred_3 | pred_4 | pred_5 | pred_6 | pred_7 | pred_8 | pred_9 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 100001.jpg | brown_spot | 0.001334 | 0.000791 | 0.002372 | 5.432664e-03 | 0.005328 | 0.978495 | 0.002519 | 0.003511 | 7.897679e-05 | 1.378119e-04 |
1 | 100002.jpg | normal | 0.978428 | 0.011744 | 0.001621 | 3.187062e-03 | 0.002579 | 0.000282 | 0.000156 | 0.001969 | 3.391063e-05 | 1.971700e-07 |
2 | 100003.jpg | hispa | 0.004639 | 0.002192 | 0.992883 | 1.573081e-07 | 0.000026 | 0.000037 | 0.000005 | 0.000218 | 1.920397e-07 | 1.528186e-07 |
3 | 100004.jpg | blast | 0.000259 | 0.982406 | 0.004401 | 7.787708e-03 | 0.002372 | 0.002163 | 0.000173 | 0.000115 | 3.223106e-04 | 4.848040e-07 |
4 | 100005.jpg | hispa | 0.010951 | 0.047475 | 0.829855 | 1.200308e-05 | 0.091933 | 0.000418 | 0.018967 | 0.000370 | 1.118553e-05 | 8.759866e-06 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
10402 | 110403.jpg | tungro | 0.001664 | 0.002167 | 0.007366 | 4.507852e-03 | 0.981122 | 0.000052 | 0.001666 | 0.001455 | 1.527430e-07 | 3.928369e-07 |
10403 | 110404.jpg | normal | 0.932484 | 0.002359 | 0.049850 | 1.244102e-05 | 0.011696 | 0.000593 | 0.002646 | 0.000304 | 4.828784e-05 | 7.773816e-06 |
10404 | 110405.jpg | dead_heart | 0.000192 | 0.000044 | 0.000152 | 9.994839e-01 | 0.000001 | 0.000025 | 0.000058 | 0.000003 | 1.957294e-06 | 3.789358e-05 |
10405 | 110406.jpg | blast | 0.000226 | 0.977683 | 0.000268 | 9.254745e-03 | 0.004962 | 0.000595 | 0.004523 | 0.001717 | 5.624577e-04 | 2.080105e-04 |
10406 | 110407.jpg | brown_spot | 0.000009 | 0.000188 | 0.000539 | 4.357956e-04 | 0.000232 | 0.997215 | 0.000039 | 0.000010 | 1.319862e-03 | 1.372061e-05 |
10407 rows × 12 columns
Assign classes by maximum class probability:
[24]:
OOFs = np.argmax(preds[['pred_' + str(i) for i in range(10)]].values, axis = 1)
OOFs
[24]:
array([5, 0, 2, ..., 3, 1, 5])
Let’s see classification accuracy on train:
[25]:
accuracy = (OOFs == preds['label'].map(automl.reader.class_mapping)).mean()
print(f'Out-of-fold accuracy: {accuracy}')
Out-of-fold accuracy: 0.9686749303353512
Also to estimate the quality of classification, we can use the confusion matrix:
[26]:
cf_matrix = confusion_matrix(preds['label'].map(automl.reader.class_mapping),
OOFs)
plt.figure(figsize = (10, 10))
ax = sns.heatmap(cf_matrix, annot=True, cmap='Blues', fmt = 'd')
ax.set_title('Seaborn Confusion Matrix with labels\n\n');
ax.set_xlabel('\nPredicted Values')
ax.set_ylabel('Actual Values ');
inverse_class_mapping = {y: x for x,y in automl.reader.class_mapping.items()}
labels = [inverse_class_mapping[i] for i in range(len(inverse_class_mapping))]
ax.xaxis.set_ticklabels(labels, rotation = 90)
ax.yaxis.set_ticklabels(labels, rotation = 0)
plt.show()

Predict for test dataset
Now we are also ready to predict for our test competition dataset and submission file creation:
[27]:
%%time
te_pred = automl.predict(submission)
print(f'Prediction for te_data:\n{te_pred}\nShape = {te_pred.shape}')
100%|██████████| 163/163 [01:28<00:00, 1.84it/s]
[14:28:22] Feature path transformed
Prediction for te_data:
array([[1.57098308e-01, 2.81519257e-03, 5.96348643e-01, ...,
1.08084995e-02, 1.95845146e-07, 1.42198633e-05],
[9.83384371e-01, 6.52049668e-04, 1.45791359e-02, ...,
1.12365209e-03, 9.75986836e-07, 1.95965598e-07],
[1.68020770e-01, 3.79674375e-01, 1.86414778e-01, ...,
1.67078048e-03, 1.21877249e-03, 3.75247910e-03],
...,
[1.05072348e-03, 1.24680300e-05, 5.70231769e-03, ...,
4.37476301e-05, 1.52421890e-07, 1.81421214e-07],
[6.52685121e-04, 4.47798493e-06, 5.04824053e-03, ...,
2.13344283e-05, 1.52417726e-07, 1.62638599e-07],
[1.57185504e-03, 1.01540554e-05, 2.53849756e-02, ...,
1.17763964e-04, 1.52426963e-07, 1.77946404e-07]], dtype=float32)
Shape = (20814, 10)
CPU times: user 55.8 s, sys: 21.6 s, total: 1min 17s
Wall time: 2min 19s
[28]:
sub = submission[['image_id']]
for i in range(10):
sub['pred_' + str(i)] = te_pred.data[:,i]
sub
/tmp/ipykernel_12895/1185757098.py:3: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
sub['pred_' + str(i)] = te_pred.data[:,i]
[28]:
image_id | pred_0 | pred_1 | pred_2 | pred_3 | pred_4 | pred_5 | pred_6 | pred_7 | pred_8 | pred_9 | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 200001.jpg | 0.157098 | 0.002815 | 0.596349 | 0.020590 | 1.148577e-01 | 0.095614 | 0.001854 | 0.010808 | 1.958451e-07 | 1.421986e-05 |
1 | 200002.jpg | 0.983384 | 0.000652 | 0.014579 | 0.000139 | 6.825896e-05 | 0.000044 | 0.000008 | 0.001124 | 9.759868e-07 | 1.959656e-07 |
2 | 200003.jpg | 0.168021 | 0.379674 | 0.186415 | 0.000225 | 1.850213e-03 | 0.036919 | 0.220253 | 0.001671 | 1.218772e-03 | 3.752479e-03 |
3 | 200004.jpg | 0.000013 | 0.990730 | 0.008530 | 0.000097 | 1.116415e-04 | 0.000215 | 0.000111 | 0.000037 | 1.548404e-04 | 1.946677e-07 |
4 | 200005.jpg | 0.000340 | 0.999536 | 0.000031 | 0.000002 | 6.857538e-07 | 0.000003 | 0.000007 | 0.000029 | 5.404088e-07 | 4.985940e-05 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
20809 | 203469.jpg | 0.003061 | 0.000017 | 0.041731 | 0.943745 | 1.648944e-04 | 0.010877 | 0.000146 | 0.000258 | 1.524480e-07 | 2.509265e-07 |
20810 | 203469.jpg | 0.000430 | 0.000003 | 0.002508 | 0.993409 | 2.613632e-05 | 0.003580 | 0.000007 | 0.000036 | 1.524176e-07 | 1.595918e-07 |
20811 | 203469.jpg | 0.001051 | 0.000012 | 0.005702 | 0.989972 | 5.734707e-05 | 0.003144 | 0.000018 | 0.000044 | 1.524219e-07 | 1.814212e-07 |
20812 | 203469.jpg | 0.000653 | 0.000004 | 0.005048 | 0.990724 | 3.223727e-05 | 0.003505 | 0.000012 | 0.000021 | 1.524177e-07 | 1.626386e-07 |
20813 | 203469.jpg | 0.001572 | 0.000010 | 0.025385 | 0.965282 | 1.030424e-04 | 0.007472 | 0.000058 | 0.000118 | 1.524270e-07 | 1.779464e-07 |
20814 rows × 11 columns
[29]:
sub = sub.groupby(['image_id']).mean().reset_index()
sub
[29]:
image_id | pred_0 | pred_1 | pred_2 | pred_3 | pred_4 | pred_5 | pred_6 | pred_7 | pred_8 | pred_9 | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 200001.jpg | 0.127650 | 0.001409 | 0.599914 | 0.017568 | 0.136898 | 0.106915 | 0.001796 | 0.007829 | 8.801418e-06 | 1.216593e-05 |
1 | 200002.jpg | 0.937035 | 0.000638 | 0.060420 | 0.000098 | 0.000105 | 0.000096 | 0.000016 | 0.001586 | 6.087082e-06 | 2.249314e-07 |
2 | 200003.jpg | 0.120163 | 0.523312 | 0.106169 | 0.000473 | 0.000748 | 0.042688 | 0.201373 | 0.002807 | 1.389023e-03 | 8.788786e-04 |
3 | 200004.jpg | 0.000020 | 0.888623 | 0.006415 | 0.001150 | 0.000430 | 0.004390 | 0.000616 | 0.001799 | 9.654120e-02 | 1.466518e-05 |
4 | 200005.jpg | 0.000680 | 0.998898 | 0.000085 | 0.000009 | 0.000001 | 0.000002 | 0.000021 | 0.000172 | 1.743805e-06 | 1.304403e-04 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
3464 | 203465.jpg | 0.000224 | 0.002143 | 0.001514 | 0.990281 | 0.002657 | 0.000401 | 0.001074 | 0.000134 | 1.530934e-03 | 4.091801e-05 |
3465 | 203466.jpg | 0.250769 | 0.007148 | 0.741840 | 0.000002 | 0.000022 | 0.000013 | 0.000076 | 0.000129 | 2.629060e-07 | 2.120279e-07 |
3466 | 203467.jpg | 0.960745 | 0.004105 | 0.001135 | 0.000646 | 0.016724 | 0.008584 | 0.000062 | 0.007749 | 2.438365e-04 | 6.326832e-06 |
3467 | 203468.jpg | 0.003675 | 0.001097 | 0.038018 | 0.000038 | 0.000483 | 0.000310 | 0.000223 | 0.000208 | 9.551883e-01 | 7.596347e-04 |
3468 | 203469.jpg | 0.001372 | 0.000012 | 0.015432 | 0.977533 | 0.000086 | 0.005415 | 0.000046 | 0.000104 | 1.524300e-07 | 1.962799e-07 |
3469 rows × 11 columns
[30]:
TEs = pd.Series(np.argmax(sub[['pred_' + str(i) for i in range(10)]].values, axis = 1)).map(inverse_class_mapping)
TEs
[30]:
0 hispa
1 normal
2 blast
3 blast
4 blast
...
3464 dead_heart
3465 hispa
3466 normal
3467 bacterial_leaf_streak
3468 dead_heart
Length: 3469, dtype: object
[31]:
sub['label'] = TEs
sub[['image_id', 'label']].to_csv('LightAutoML_TabularCVAutoML_with_aug.csv', index = False)
sub[['image_id', 'label']]
[31]:
image_id | label | |
---|---|---|
0 | 200001.jpg | hispa |
1 | 200002.jpg | normal |
2 | 200003.jpg | blast |
3 | 200004.jpg | blast |
4 | 200005.jpg | blast |
... | ... | ... |
3464 | 203465.jpg | dead_heart |
3465 | 203466.jpg | hispa |
3466 | 203467.jpg | normal |
3467 | 203468.jpg | bacterial_leaf_streak |
3468 | 203469.jpg | dead_heart |
3469 rows × 2 columns
No we can choose another model from timm. So we will use resnet50.a1_in1k, by default it uses vit_base_patch16_224.augreg_in21k
[35]:
automl = TabularCVAutoML(task = task,
timeout=5 * 3600,
autocv_features={"embed_model": 'timm/tf_efficientnetv2_b0.in1k'},
cpu_limit = 2,
reader_params = {'cv': 5, 'random_state': 42})
[36]:
%%time
oof_pred = automl.fit_predict(train_data, roles = roles, verbose = 3)
[14:37:43] Stdout logging level is INFO3.
[14:37:43] Task: multiclass
[14:37:43] Start automl preset with listed constraints:
[14:37:43] - time: 18000.00 seconds
[14:37:43] - CPU: 2 cores
[14:37:43] - memory: 16 GB
[14:37:43] Train data shape: (114477, 5)
[14:37:43] Layer 1 train process start. Time left 17999.80 secs
100%|██████████| 895/895 [06:43<00:00, 2.22it/s]
[14:44:31] Feature path transformed
[14:44:41] Start fitting Lvl_0_Pipe_0_Mod_0_LinearL2 ...
[14:44:41] ===== Start working with fold 0 for Lvl_0_Pipe_0_Mod_0_LinearL2 =====
[14:44:53] Linear model: C = 1e-05 score = -1.2282992628176856
[14:45:04] Linear model: C = 5e-05 score = -0.9078946864858105
[14:45:14] Linear model: C = 0.0001 score = -0.7903223383077203
[14:45:25] Linear model: C = 0.0005 score = -0.5805263796419443
[14:45:37] Linear model: C = 0.001 score = -0.5191830537228186
[14:45:48] Linear model: C = 0.005 score = -0.44237800607788724
[14:46:01] Linear model: C = 0.01 score = -0.4332587963951451
[14:46:16] Linear model: C = 0.05 score = -0.4659824021930572
[14:46:28] Linear model: C = 0.1 score = -0.49696980356910764
[14:46:29] ===== Start working with fold 1 for Lvl_0_Pipe_0_Mod_0_LinearL2 =====
[14:46:40] Linear model: C = 1e-05 score = -1.1941203869888553
[14:46:50] Linear model: C = 5e-05 score = -0.870315687726058
[14:47:00] Linear model: C = 0.0001 score = -0.7542737074009194
[14:47:11] Linear model: C = 0.0005 score = -0.5565397834768919
[14:47:23] Linear model: C = 0.001 score = -0.5021799803891854
[14:47:37] Linear model: C = 0.005 score = -0.4375446715586552
[14:47:49] Linear model: C = 0.01 score = -0.4337117229695793
[14:48:03] Linear model: C = 0.05 score = -0.47678539878379567
[14:48:16] Linear model: C = 0.1 score = -0.5100193461879381
[14:48:16] ===== Start working with fold 2 for Lvl_0_Pipe_0_Mod_0_LinearL2 =====
[14:48:27] Linear model: C = 1e-05 score = -1.1828501053814764
[14:48:39] Linear model: C = 5e-05 score = -0.8603329618510173
[14:48:48] Linear model: C = 0.0001 score = -0.7451147263666518
[14:48:59] Linear model: C = 0.0005 score = -0.5469582228988039
[14:49:12] Linear model: C = 0.001 score = -0.49160247842297417
[14:49:24] Linear model: C = 0.005 score = -0.4257572256164155
[14:49:37] Linear model: C = 0.01 score = -0.4188241529929714
[14:49:50] Linear model: C = 0.05 score = -0.4522382557188784
[14:50:03] Linear model: C = 0.1 score = -0.48277984079191094
[14:50:04] ===== Start working with fold 3 for Lvl_0_Pipe_0_Mod_0_LinearL2 =====
[14:50:15] Linear model: C = 1e-05 score = -1.1958343845422246
[14:50:26] Linear model: C = 5e-05 score = -0.878725101433787
[14:50:35] Linear model: C = 0.0001 score = -0.7660166437189271
[14:50:45] Linear model: C = 0.0005 score = -0.5679153687919936
[14:50:59] Linear model: C = 0.001 score = -0.5110457138416219
[14:51:10] Linear model: C = 0.005 score = -0.44229320617124224
[14:51:23] Linear model: C = 0.01 score = -0.43663952743918066
[14:51:37] Linear model: C = 0.05 score = -0.47363171137894655
[14:51:51] Linear model: C = 0.1 score = -0.5032655687259646
[14:51:51] ===== Start working with fold 4 for Lvl_0_Pipe_0_Mod_0_LinearL2 =====
[14:52:02] Linear model: C = 1e-05 score = -1.1804715353776323
[14:52:13] Linear model: C = 5e-05 score = -0.8529105474280552
[14:52:21] Linear model: C = 0.0001 score = -0.7373622302487922
[14:52:32] Linear model: C = 0.0005 score = -0.537561225715503
[14:52:43] Linear model: C = 0.001 score = -0.48106564988541606
[14:52:57] Linear model: C = 0.005 score = -0.4138154861612588
[14:53:09] Linear model: C = 0.01 score = -0.40990101492044817
[14:53:23] Linear model: C = 0.05 score = -0.44904189928940963
[14:53:36] Linear model: C = 0.1 score = -0.4789966864522385
[14:53:36] Fitting Lvl_0_Pipe_0_Mod_0_LinearL2 finished. score = -0.4264683916927181
[14:53:36] Lvl_0_Pipe_0_Mod_0_LinearL2 fitting and predicting completed
[14:53:36] Time left 17046.53 secs
[14:58:02] Start fitting Lvl_0_Pipe_1_Mod_0_CatBoost ...
[14:58:02] ===== Start working with fold 0 for Lvl_0_Pipe_1_Mod_0_CatBoost =====
[14:58:02] 0: learn: 2.2636799 test: 2.2649651 best: 2.2649651 (0) total: 10.4ms remaining: 41.6s
[14:58:22] bestTest = 0.2436411292
[14:58:22] bestIteration = 3999
[14:58:23] ===== Start working with fold 1 for Lvl_0_Pipe_1_Mod_0_CatBoost =====
[14:58:23] 0: learn: 2.2634693 test: 2.2632523 best: 2.2632523 (0) total: 6.07ms remaining: 24.3s
[14:58:43] bestTest = 0.2658199756
[14:58:43] bestIteration = 3999
[14:58:43] ===== Start working with fold 2 for Lvl_0_Pipe_1_Mod_0_CatBoost =====
[14:58:44] 0: learn: 2.2631659 test: 2.2656305 best: 2.2656305 (0) total: 6.52ms remaining: 26.1s
[14:59:03] bestTest = 0.2753673959
[14:59:03] bestIteration = 3999
[14:59:04] ===== Start working with fold 3 for Lvl_0_Pipe_1_Mod_0_CatBoost =====
[14:59:04] 0: learn: 2.2645703 test: 2.2657044 best: 2.2657044 (0) total: 6.13ms remaining: 24.5s
[14:59:24] bestTest = 0.2738942971
[14:59:24] bestIteration = 3996
[14:59:24] Shrink model to first 3997 iterations.
[14:59:24] ===== Start working with fold 4 for Lvl_0_Pipe_1_Mod_0_CatBoost =====
[14:59:25] 0: learn: 2.2642798 test: 2.2644247 best: 2.2644247 (0) total: 5.95ms remaining: 23.8s
[14:59:44] bestTest = 0.2538460547
[14:59:44] bestIteration = 3999
[14:59:45] Fitting Lvl_0_Pipe_1_Mod_0_CatBoost finished. score = -0.2625123265864018
[14:59:45] Lvl_0_Pipe_1_Mod_0_CatBoost fitting and predicting completed
[14:59:45] Time left 16678.32 secs
[14:59:45] Layer 1 training completed.
[14:59:45] Blending: optimization starts with equal weights and score -0.2561708318332855
/home/dvladimirvasilyev/anaconda3/envs/myenv/lib/python3.8/site-packages/sklearn/metrics/_classification.py:2916: UserWarning: The y_pred values do not sum to one. Starting from 1.5 thiswill result in an error.
warnings.warn(
[14:59:45] Blending: iteration 0: score = -0.23692344794948073, weights = [0.19089036 0.8091096 ]
[14:59:46] Blending: iteration 1: score = -0.23692344794948073, weights = [0.19089036 0.8091096 ]
[14:59:46] Blending: no score update. Terminated
[14:59:46] Automl preset training completed in 1323.26 seconds
[14:59:46] Model description:
Final prediction for new objects (level 0) =
0.19089 * (5 averaged models Lvl_0_Pipe_0_Mod_0_LinearL2) +
0.80911 * (5 averaged models Lvl_0_Pipe_1_Mod_0_CatBoost)
CPU times: user 20min 56s, sys: 1min 25s, total: 22min 22s
Wall time: 22min 3s
[37]:
%%time
te_pred = automl.predict(submission)
print(f'Prediction for te_data:\n{te_pred}\nShape = {te_pred.shape}')
100%|██████████| 163/163 [01:16<00:00, 2.13it/s]
[15:01:03] Feature path transformed
Prediction for te_data:
array([[5.8534566e-02, 6.8576052e-03, 4.5334366e-01, ..., 1.5735241e-02,
4.2415738e-07, 2.8625556e-05],
[9.6386713e-01, 1.4697504e-03, 3.2047924e-02, ..., 2.0407902e-03,
6.7228694e-07, 1.4319470e-07],
[3.5120246e-01, 2.9431397e-01, 1.9644174e-01, ..., 2.2667376e-04,
6.4593733e-06, 5.9228983e-05],
...,
[2.3565248e-03, 2.7670001e-05, 1.2790265e-02, ..., 9.7831573e-05,
4.5524594e-08, 1.1057142e-07],
[1.4637065e-03, 9.7479615e-06, 1.1323140e-02, ..., 4.7557736e-05,
4.5515264e-08, 6.8441139e-08],
[3.5254466e-03, 2.2519611e-05, 5.6939691e-02, ..., 2.6386546e-04,
4.5536019e-08, 1.7701051e-07]], dtype=float32)
Shape = (20814, 10)
CPU times: user 13 s, sys: 3.3 s, total: 16.3 s
Wall time: 2min 7s
Our submission has 0.95770 accuracy on public and 0.95276 accuracy on private leaderboard (Alexander Ryzhkov account).
Additional materials
Tutorial 9: Neural Networks
Official LightAutoML github repository is here
In this tutorial you will learn how to: * train neural networks (nn) with LightAutoML on tabualr data * customize model architecture and pipelines
0. Prerequisites
0.0 install LightAutoML
[ ]:
# !pip install -U lightautoml[all]
0.1 Import libraries
Here we will import the libraries we use in this kernel: - Standard python libraries for timing, working with OS etc. - Essential python DS libraries like numpy, pandas, scikit-learn and torch (the last we will use in the next cell) - LightAutoML modules: presets for AutoML, task and report generation module
[2]:
# Standard python libraries
import os
# Essential DS libraries
import optuna
import requests
import numpy as np
import pandas as pd
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split
import torch
from copy import deepcopy as copy
import torch.nn as nn
from collections import OrderedDict
# LightAutoML presets, task and report generation
from lightautoml.automl.presets.tabular_presets import TabularAutoML
from lightautoml.tasks import Task
/home/dvladimirvasilyev/anaconda3/envs/myenv/lib/python3.8/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
from .autonotebook import tqdm as notebook_tqdm
'nlp' extra dependecy package 'gensim' isn't installed. Look at README.md in repo 'LightAutoML' for installation instructions.
'nlp' extra dependecy package 'nltk' isn't installed. Look at README.md in repo 'LightAutoML' for installation instructions.
'nlp' extra dependecy package 'transformers' isn't installed. Look at README.md in repo 'LightAutoML' for installation instructions.
'nlp' extra dependecy package 'gensim' isn't installed. Look at README.md in repo 'LightAutoML' for installation instructions.
'nlp' extra dependecy package 'nltk' isn't installed. Look at README.md in repo 'LightAutoML' for installation instructions.
'nlp' extra dependecy package 'transformers' isn't installed. Look at README.md in repo 'LightAutoML' for installation instructions.
/home/dvladimirvasilyev/LightAutoML/lightautoml/ml_algo/dl_model.py:41: UserWarning: 'transformers' - package isn't installed
warnings.warn("'transformers' - package isn't installed")
/home/dvladimirvasilyev/LightAutoML/lightautoml/text/nn_model.py:22: UserWarning: 'transformers' - package isn't installed
warnings.warn("'transformers' - package isn't installed")
/home/dvladimirvasilyev/LightAutoML/lightautoml/text/dl_transformers.py:25: UserWarning: 'transformers' - package isn't installed
warnings.warn("'transformers' - package isn't installed")
0.2 Constants
Here we setup the constants to use in the kernel: - N_THREADS
- number of vCPUs for LightAutoML model creation - N_FOLDS
- number of folds in LightAutoML inner CV - RANDOM_STATE
- random seed for better reproducibility - TEST_SIZE
- houldout data part size - TIMEOUT
- limit in seconds for model to train - TARGET_NAME
- target column name in dataset
[3]:
N_THREADS = 4
N_FOLDS = 5
RANDOM_STATE = 42
TEST_SIZE = 0.2
TIMEOUT = 300
TARGET_NAME = 'TARGET'
np.random.seed(RANDOM_STATE)
torch.set_num_threads(N_THREADS)
0.3 Data loading
[4]:
DATASET_DIR = '../data/'
DATASET_NAME = 'sampled_app_train.csv'
DATASET_FULLNAME = os.path.join(DATASET_DIR, DATASET_NAME)
DATASET_URL = 'https://raw.githubusercontent.com/AILab-MLTools/LightAutoML/master/examples/data/sampled_app_train.csv'
if not os.path.exists(DATASET_FULLNAME):
os.makedirs(DATASET_DIR, exist_ok=True)
dataset = requests.get(DATASET_URL).text
with open(DATASET_FULLNAME, 'w') as output:
output.write(dataset)
data = pd.read_csv(DATASET_FULLNAME)
data.head()
tr_data, te_data = train_test_split(
data,
test_size=TEST_SIZE,
stratify=data[TARGET_NAME],
random_state=RANDOM_STATE
)
1. Available built-in models
To use different model pass it to the list in "use_algo"
. We support custom models inherited from torch.nn.Module
class. For every model their parameters is listed below.
1.1 MLP ("mlp"
)
hidden_size
- define hidden layer dimensions
1.2 Dense Light ("denselight"
)
hidden_size
- define hidden layer dimensions
1.3 Dense ("dense"
)
block_config
- set number of blocks and layers within each blockcompression
- portion of neuron to drop afterDenseBlock
growth_size
- output dim of everyDenseLayer
bn_factor
- size of intermediate fc is increased times this factor in layer
1.4 Resnet ("resnet"
)
hid_factor
- size of intermediate fc is increased times this factor in layer
1.5 SNN ("snn"
)
hidden_size
- define hidden layer dimensions
2. Example of usage
2.1 Task definition
[5]:
task = Task('binary')
roles = {
'target': TARGET_NAME,
'drop': ['SK_ID_CURR']
}
2.2 LightAutoML model creation - TabularAutoML preset with neural network
In next the cell we are going to create LightAutoML model with TabularAutoML
class.
in just several lines. Let’s discuss the params we can setup: - task
- the type of the ML task (the only must have parameter) - timeout
- time limit in seconds for model to train - cpu_limit
- vCPU count for model to use - nn_params
- network and training params, for example, "hidden_size"
, "batch_size"
, "lr"
, etc. - nn_pipeline_params
- data preprocessing params, which affect how data is fed to the model: use embeddings or target encoding for categorical
columns, standard scalar or quantile transformer for numerical columns - reader_params
- parameter change for Reader object inside preset, which works on the first step of data preparation: automatic feature typization, preliminary almost-constant features, correct CV setup etc.
[5]:
automl = TabularAutoML(
task = task,
timeout = TIMEOUT,
cpu_limit = N_THREADS,
general_params = {"use_algos": [["mlp"]]}, # ['nn', 'mlp', 'dense', 'denselight', 'resnet', 'snn'] or custom torch model
nn_params = {"n_epochs": 10, "bs": 512, "num_workers": 0, "path_to_save": None, "freeze_defaults": True},
nn_pipeline_params = {"use_qnt": True, "use_te": False},
reader_params = {'n_jobs': N_THREADS, 'cv': N_FOLDS, 'random_state': RANDOM_STATE}
)
2.3 AutoML training
To run autoML training use fit_predict method:
train_data
- Dataset to train.roles
- Roles dict.verbose
- Controls the verbosity: the higher, the more messages. <1 : messages are not displayed; >=1 : the computation process for layers is displayed; >=2 : the information about folds processing is also displayed; >=3 : the hyperparameters optimization process is also displayed; >=4 : the training process for every algorithm is displayed;
Note: out-of-fold prediction is calculated during training and returned from the fit_predict method
[6]:
%%time
oof_pred = automl.fit_predict(tr_data, roles = roles, verbose = 1)
[11:35:05] Stdout logging level is INFO.
[11:35:05] Copying TaskTimer may affect the parent PipelineTimer, so copy will create new unlimited TaskTimer
[11:35:05] Task: binary
[11:35:05] Start automl preset with listed constraints:
[11:35:05] - time: 300.00 seconds
[11:35:05] - CPU: 4 cores
[11:35:05] - memory: 16 GB
[11:35:05] Train data shape: (8000, 122)
[11:35:07] Layer 1 train process start. Time left 297.35 secs
[11:35:08] Start fitting Lvl_0_Pipe_0_Mod_0_TorchNN_mlp_0 ...
[11:35:20] Fitting Lvl_0_Pipe_0_Mod_0_TorchNN_mlp_0 finished. score = 0.6951557493612979
[11:35:20] Lvl_0_Pipe_0_Mod_0_TorchNN_mlp_0 fitting and predicting completed
[11:35:20] Time left 284.31 secs
[11:35:20] Layer 1 training completed.
[11:35:20] Automl preset training completed in 15.70 seconds
[11:35:20] Model description:
Final prediction for new objects (level 0) =
1.00000 * (5 averaged models Lvl_0_Pipe_0_Mod_0_TorchNN_mlp_0)
CPU times: user 15.8 s, sys: 1.58 s, total: 17.4 s
Wall time: 15.7 s
2.4 Prediction on holdout and model evaluation
[7]:
%%time
te_pred = automl.predict(te_data)
print(f'Prediction for te_data:\n{te_pred}\nShape = {te_pred.shape}')
Prediction for te_data:
array([[0.08216639],
[0.08314921],
[0.07000729],
...,
[0.07061756],
[0.09196799],
[0.16275021]], dtype=float32)
Shape = (2000, 1)
CPU times: user 1.07 s, sys: 30.4 ms, total: 1.1 s
Wall time: 1 s
[8]:
print(f'OOF score: {roc_auc_score(tr_data[TARGET_NAME].values, oof_pred.data[:, 0])}')
print(f'HOLDOUT score: {roc_auc_score(te_data[TARGET_NAME].values, te_pred.data[:, 0])}')
OOF score: 0.6951557493612979
HOLDOUT score: 0.7132812500000001
You can obtain the description of the resulting pipeline:
[9]:
print(automl.create_model_str_desc())
Final prediction for new objects (level 0) =
1.00000 * (5 averaged models Lvl_0_Pipe_0_Mod_0_TorchNN_mlp_0)
3. Main training loop and pipeline params
3.1 Training loop params
bs
- batch_sizesnap_params
- early stopping and checkpoint averaging params, stochastic weight averaging (swa)opt
- lr optimizeropt_params
- optimizer paramsclip_grad
- use grad clipping for regularizationclip_grad_params
emb_dropout
- embedding dropout for categorical columns
This set of params should be passed in nn_params
as well.
3.2 Pipeline params
Transformation for numerical columns
use_qnt
- uses quantile transformation for numerical columnsoutput_distribution
- type of distribuiton of feature after qnt transformern_quantiles
- number of quantiles used to build feature distributionqnt_factor
- decresesn_quantiles
depending on train data shape
Transformation for categorical columns
use_te
- uses target encodingtop_intersections
- number of intersections of cat columns to use
Full list of default parametres you can find here: - nn_params - nn_pipeline_params
4. More use cases
Let’s remember default Lama params to be more compact.
[5]:
default_lama_params = {
"task": task,
"timeout": TIMEOUT,
"cpu_limit": N_THREADS,
"reader_params": {'n_jobs': N_THREADS, 'cv': N_FOLDS, 'random_state': RANDOM_STATE}
}
default_nn_params = {
"bs": 512, "num_workers": 0, "path_to_save": None, "n_epochs": 10, "freeze_defaults": True
}
4.1 Custom model
Consider simple neural network that we want to train.
[11]:
class SimpleNet(nn.Module):
def __init__(
self,
n_in,
n_out,
hidden_size,
drop_rate,
**kwargs, # kwargs is must-have to hold unnecessary parameters
):
super(SimpleNet, self).__init__()
self.features = nn.Sequential(OrderedDict([]))
self.features.add_module("norm", nn.BatchNorm1d(n_in))
self.features.add_module("dense1", nn.Linear(n_in, hidden_size))
self.features.add_module("act", nn.SiLU())
self.features.add_module("dropout", nn.Dropout(p=drop_rate))
self.features.add_module("dense2", nn.Linear(hidden_size, n_out))
def forward(self, x):
"""
Args:
x: data after feature pipeline transformation
(by default concatenation of columns)
"""
for layer in self.features:
x = layer(x)
return x
[12]:
automl = TabularAutoML(
**default_lama_params,
general_params={"use_algos": [[SimpleNet]]},
nn_params={
**default_nn_params,
"hidden_size": 256,
"drop_rate": 0.1
},
)
automl.fit_predict(tr_data, roles=roles, verbose=1)
[11:39:19] Stdout logging level is INFO.
[11:39:19] Task: binary
[11:39:19] Start automl preset with listed constraints:
[11:39:19] - time: 300.00 seconds
[11:39:19] - CPU: 4 cores
[11:39:19] - memory: 16 GB
[11:39:19] Train data shape: (8000, 122)
[11:39:20] Layer 1 train process start. Time left 299.14 secs
[11:39:20] Start fitting Lvl_0_Pipe_0_Mod_0_TorchNN_0 ...
[11:39:29] Fitting Lvl_0_Pipe_0_Mod_0_TorchNN_0 finished. score = 0.7060418025974987
[11:39:29] Lvl_0_Pipe_0_Mod_0_TorchNN_0 fitting and predicting completed
[11:39:29] Time left 290.88 secs
[11:39:29] Layer 1 training completed.
[11:39:29] Automl preset training completed in 9.12 seconds
[11:39:29] Model description:
Final prediction for new objects (level 0) =
1.00000 * (5 averaged models Lvl_0_Pipe_0_Mod_0_TorchNN_0)
[12]:
array([[0.02449569],
[0.03754642],
[0.04070117],
...,
[0.06268083],
[0.19106267],
[0.13282676]], dtype=float32)
4.1.1 Define the pipeline by yourself
[13]:
from typing import Sequence
from typing import Dict
from typing import Optional
from typing import Any
from typing import Callable
from typing import Union
class CatEmbedder(nn.Module):
"""Category data model.
Args:
cat_dims: Sequence with number of unique categories
for category features
"""
def __init__(
self,
cat_dims: Sequence[int],
**kwargs
):
super(CatEmbedder, self).__init__()
emb_dims = [
(int(x), 5)
for x in cat_dims
]
self.no_of_embs = sum([y for x, y in emb_dims])
self.emb_layers = nn.ModuleList([nn.Embedding(x, y) for x, y in emb_dims])
def get_out_shape(self) -> int:
"""Output shape.
Returns:
Int with module output shape.
"""
return self.no_of_embs
def forward(self, inp: Dict[str, torch.Tensor]) -> torch.Tensor:
"""Concat all categorical embeddings
"""
output = torch.cat(
[
emb_layer(inp["cat"][:, i])
for i, emb_layer in enumerate(self.emb_layers)
],
dim=1,
)
return output
class ContEmbedder(nn.Module):
"""Numeric data model.
Class for working with numeric data.
Args:
num_dims: Sequence with number of numeric features.
input_bn: Use 1d batch norm for input data.
"""
def __init__(self, num_dims: int, **kwargs):
super(ContEmbedder, self).__init__()
self.n_out = num_dims
def get_out_shape(self) -> int:
"""Output shape.
Returns:
int with module output shape.
"""
return self.n_out
def forward(self, inp: Dict[str, torch.Tensor]) -> torch.Tensor:
"""Forward-pass."""
return (inp["cont"] - inp["cont"].mean(axis=0)) / (inp["cont"].std(axis=0) + 1e-6)
[14]:
from lightautoml.text.nn_model import TorchUniversalModel
class SimpleNet_plus(TorchUniversalModel):
"""Mixed data model.
Class for preparing input for DL model with mixed data.
Args:
n_out: Number of output dimensions.
cont_params: Dict with numeric model params.
cat_params: Dict with category model para
**kwargs: Loss, task and other parameters.
"""
def __init__(
self,
n_out: int = 1,
cont_params: Optional[Dict] = None,
cat_params: Optional[Dict] = None,
**kwargs,
):
# init parent class (need some helper functions to be used)
super(SimpleNet_plus, self).__init__(**{
**kwargs,
"cont_params": cont_params,
"cat_params": cat_params,
"torch_model": None, # dont need any model inside parent class
})
n_in = 0
# add cont columns processing
self.cont_embedder = ContEmbedder(**cont_params)
n_in += self.cont_embedder.get_out_shape()
# add cat columns processing
self.cat_embedder = CatEmbedder(**cat_params)
n_in += self.cat_embedder.get_out_shape()
self.torch_model = SimpleNet(
**{
**kwargs,
**{"n_in": n_in, "n_out": n_out},
}
)
def get_logits(self, inp: Dict[str, torch.Tensor]) -> torch.Tensor:
outputs = []
outputs.append(self.cont_embedder(inp))
outputs.append(self.cat_embedder(inp))
if len(outputs) > 1:
output = torch.cat(outputs, dim=1)
else:
output = outputs[0]
logits = self.torch_model(output)
return logits
[15]:
automl = TabularAutoML(
**default_lama_params,
general_params={"use_algos": [[SimpleNet_plus]]},
nn_params={
**default_nn_params,
"hidden_size": 256,
"drop_rate": 0.1,
"model_with_emb": True,
},
debug=True
)
automl.fit_predict(tr_data, roles = roles, verbose = 1)
[11:39:33] Stdout logging level is INFO.
[11:39:33] Task: binary
[11:39:33] Start automl preset with listed constraints:
[11:39:33] - time: 300.00 seconds
[11:39:33] - CPU: 4 cores
[11:39:33] - memory: 16 GB
[11:39:33] Train data shape: (8000, 122)
[11:39:34] Layer 1 train process start. Time left 299.14 secs
[11:39:34] Start fitting Lvl_0_Pipe_0_Mod_0_TorchNN_0 ...
[11:39:42] Fitting Lvl_0_Pipe_0_Mod_0_TorchNN_0 finished. score = 0.680797945608108
[11:39:42] Lvl_0_Pipe_0_Mod_0_TorchNN_0 fitting and predicting completed
[11:39:42] Time left 290.91 secs
[11:39:42] Layer 1 training completed.
[11:39:42] Automl preset training completed in 9.10 seconds
[11:39:42] Model description:
Final prediction for new objects (level 0) =
1.00000 * (5 averaged models Lvl_0_Pipe_0_Mod_0_TorchNN_0)
[15]:
array([[0.06662331],
[0.05009553],
[0.05109952],
...,
[0.07657926],
[0.19059831],
[0.04237348]], dtype=float32)
4.2 Tuning network
One can try optimize metric with the help of Optuna. Among validation stratagies there are: - fit_on_holdout = True
- holdout - fit_on_holdout = False
- cross-validation.
4.2.1 Built-in models
Use "_tuned"
in model name to tune it.
[17]:
automl = TabularAutoML(
**default_lama_params,
general_params={"use_algos": [["denselight_tuned"]]},
nn_params={
**default_nn_params,
"n_epochs": 3,
"tuning_params": {
"max_tuning_iter": 5,
"max_tuning_time": 100,
"fit_on_holdout": True
}
},
)
automl.fit_predict(tr_data, roles = roles, verbose = 3)
[11:41:13] Stdout logging level is INFO3.
[11:41:13] Task: binary
[11:41:13] Start automl preset with listed constraints:
[11:41:13] - time: 300.00 seconds
[11:41:13] - CPU: 4 cores
[11:41:13] - memory: 16 GB
[11:41:13] Train data shape: (8000, 122)
[11:41:14] Feats was rejected during automatic roles guess: []
[11:41:14] Layer 1 train process start. Time left 299.15 secs
[11:41:14] Start hyperparameters optimization for Lvl_0_Pipe_0_Mod_0_Tuned_TorchNN_denselight_tuned_0 ... Time budget is 100.00 secs
[11:41:15] Epoch: 0, train loss: 1.0535998344421387, val loss: 0.32914862036705017, val metric: 0.6417082284266402
[11:41:16] Epoch: 1, train loss: 0.2719154357910156, val loss: 0.29687464237213135, val metric: 0.7061383111225151
[11:41:16] Epoch: 2, train loss: 0.2606324255466461, val loss: 0.26732537150382996, val metric: 0.7064643905255223
[11:41:16] Early stopping: val loss: 0.27287718653678894, val metric: 0.7066167390990586
[11:41:16] Trial 1 with hyperparameters {'bs': 128, 'weight_decay_bin': 0, 'lr': 0.029154431891537533} scored 0.7066167390990586 in 0:00:01.938430
[11:41:17] Epoch: 0, train loss: 0.28143310546875, val loss: 0.30991676449775696, val metric: 0.6123530638099972
[11:41:17] Epoch: 1, train loss: 0.27844926714897156, val loss: 0.3095921576023102, val metric: 0.6240812311902967
[11:41:17] Epoch: 2, train loss: 0.27550050616264343, val loss: 0.3089315891265869, val metric: 0.629087351861058
[11:41:17] Early stopping: val loss: 0.3095770478248596, val metric: 0.6233248338865992
[11:41:17] Trial 2 with hyperparameters {'bs': 512, 'weight_decay_bin': 0, 'lr': 5.415244119402538e-05} scored 0.6233248338865992 in 0:00:00.755362
[11:41:17] Epoch: 0, train loss: 0.2861214578151703, val loss: 0.2767995595932007, val metric: 0.6338181759866575
[11:41:18] Epoch: 1, train loss: 0.2688417136669159, val loss: 0.26842930912971497, val metric: 0.6981253107109064
[11:41:18] Epoch: 2, train loss: 0.24808287620544434, val loss: 0.2549731731414795, val metric: 0.7318772017041658
[11:41:18] Early stopping: val loss: 0.2691458761692047, val metric: 0.7062398768382059
[11:41:18] Trial 3 with hyperparameters {'bs': 1024, 'weight_decay_bin': 1, 'weight_decay': 2.9204338471814107e-05, 'lr': 0.0006672367170464204} scored 0.7062398768382059 in 0:00:00.636955
[11:41:19] Epoch: 0, train loss: 0.2786088287830353, val loss: 0.27673330903053284, val metric: 0.6143309224839767
[11:41:20] Epoch: 1, train loss: 0.2777416706085205, val loss: 0.27581408619880676, val metric: 0.6298918592406093
[11:41:21] Epoch: 2, train loss: 0.27608099579811096, val loss: 0.2738899886608124, val metric: 0.6364695757225867
[11:41:21] Early stopping: val loss: 0.2757117748260498, val metric: 0.6301484463118281
[11:41:21] Trial 4 with hyperparameters {'bs': 64, 'weight_decay_bin': 0, 'lr': 1.8205657658407255e-05} scored 0.6301484463118281 in 0:00:03.412576
[11:41:22] Epoch: 0, train loss: 0.27859607338905334, val loss: 0.2825257182121277, val metric: 0.6120082749330469
[11:41:22] Epoch: 1, train loss: 0.27754485607147217, val loss: 0.2815532982349396, val metric: 0.6283122450834175
[11:41:23] Epoch: 2, train loss: 0.27565139532089233, val loss: 0.2794336676597595, val metric: 0.6365551047463264
[11:41:23] Early stopping: val loss: 0.28141137957572937, val metric: 0.6296112171314634
[11:41:23] Trial 5 with hyperparameters {'bs': 128, 'weight_decay_bin': 0, 'lr': 3.077180271250682e-05} scored 0.6296112171314634 in 0:00:01.886276
[11:41:23] Hyperparameters optimization for Lvl_0_Pipe_0_Mod_0_Tuned_TorchNN_denselight_tuned_0 completed
[11:41:23] The set of hyperparameters {'num_workers': 0, 'pin_memory': False, 'max_length': 256, 'is_snap': False, 'input_bn': False, 'max_emb_size': 256, 'bert_name': None, 'pooling': 'cls', 'device': ['0', '1'], 'use_cont': True, 'use_cat': True, 'use_text': False, 'lang': 'en', 'deterministic': True, 'multigpu': False, 'random_state': 42, 'model': 'denselight', 'model_with_emb': False, 'path_to_save': None, 'verbose_inside': None, 'verbose': 1, 'n_epochs': 3, 'snap_params': {'k': 3, 'early_stopping': True, 'patience': 10, 'swa': True}, 'bs': 128, 'emb_dropout': 0.1, 'emb_ratio': 3, 'opt': 'Adam', 'opt_params': {'lr': 0.029154431891537533, 'weight_decay': 0}, 'sch': 'ReduceLROnPlateau', 'scheduler_params': {'patience': 5, 'factor': 0.5, 'min_lr': 1e-05}, 'loss': None, 'loss_params': {}, 'loss_on_logits': True, 'clip_grad': False, 'clip_grad_params': {}, 'init_bias': True, 'dataset': 'UniversalDataset', 'tuned': False, 'optimization_search_space': None, 'verbose_bar': False, 'freeze_defaults': True, 'n_out': None, 'hid_factor': [2, 2], 'hidden_size': [512, 512, 512], 'block_config': [2, 2], 'compression': 0.5, 'growth_size': 256, 'bn_factor': 2, 'drop_rate': 0.1, 'noise_std': 0.05, 'num_init_features': None, 'act_fun': 'ReLU', 'use_noise': False, 'use_bn': True, 'stop_by_metric': False, 'tuning_params': {'fit_on_holdout': True, 'max_tuning_iter': 5, 'max_tuning_time': 100}}
achieve 0.7066 auc
[11:41:23] Start fitting Lvl_0_Pipe_0_Mod_0_Tuned_TorchNN_denselight_tuned_0 ...
[11:41:23] ===== Start working with fold 0 for Lvl_0_Pipe_0_Mod_0_Tuned_TorchNN_denselight_tuned_0 =====
[11:41:24] Epoch: 0, train loss: 1.0535998344421387, val loss: 0.32914862036705017, val metric: 0.6417082284266402
[11:41:24] Epoch: 1, train loss: 0.2719154357910156, val loss: 0.29687464237213135, val metric: 0.7061383111225151
[11:41:25] Epoch: 2, train loss: 0.2606324255466461, val loss: 0.26732537150382996, val metric: 0.7064643905255223
[11:41:25] Early stopping: val loss: 0.27287718653678894, val metric: 0.7066167390990586
[11:41:25] ===== Start working with fold 1 for Lvl_0_Pipe_0_Mod_0_Tuned_TorchNN_denselight_tuned_0 =====
[11:41:26] Epoch: 0, train loss: 1.9826449155807495, val loss: 0.2658182680606842, val metric: 0.6983562967051631
[11:41:26] Epoch: 1, train loss: 0.27176782488822937, val loss: 0.2550528347492218, val metric: 0.7206712805706522
[11:41:27] Epoch: 2, train loss: 0.25288328528404236, val loss: 0.25439774990081787, val metric: 0.7291525135869564
[11:41:27] Early stopping: val loss: 0.25335100293159485, val metric: 0.7309198794157609
[11:41:27] ===== Start working with fold 2 for Lvl_0_Pipe_0_Mod_0_Tuned_TorchNN_denselight_tuned_0 =====
[11:41:27] Epoch: 0, train loss: 2.025099277496338, val loss: 0.3039107024669647, val metric: 0.5462911854619565
[11:41:28] Epoch: 1, train loss: 0.2893942892551422, val loss: 0.28194475173950195, val metric: 0.648352581521739
[11:41:28] Epoch: 2, train loss: 0.25209811329841614, val loss: 0.2732546031475067, val metric: 0.6559634001358696
[11:41:29] Early stopping: val loss: 0.2753060460090637, val metric: 0.6604216202445652
[11:41:29] ===== Start working with fold 3 for Lvl_0_Pipe_0_Mod_0_Tuned_TorchNN_denselight_tuned_0 =====
[11:41:29] Epoch: 0, train loss: 1.485915184020996, val loss: 0.3707677721977234, val metric: 0.590682319972826
[11:41:30] Epoch: 1, train loss: 0.27479660511016846, val loss: 0.283542662858963, val metric: 0.6689877717391306
[11:41:30] Epoch: 2, train loss: 0.27139508724212646, val loss: 0.2609306275844574, val metric: 0.7076362941576086
[11:41:30] Early stopping: val loss: 0.26281988620758057, val metric: 0.6873938519021738
[11:41:31] ===== Start working with fold 4 for Lvl_0_Pipe_0_Mod_0_Tuned_TorchNN_denselight_tuned_0 =====
[11:41:31] Epoch: 0, train loss: 1.2433826923370361, val loss: 0.2741888165473938, val metric: 0.6601270592730978
[11:41:32] Epoch: 1, train loss: 0.28257879614830017, val loss: 0.2676794230937958, val metric: 0.6493397588315217
[11:41:32] Epoch: 2, train loss: 0.26426026225090027, val loss: 0.2751479148864746, val metric: 0.6779360563858697
[11:41:32] Early stopping: val loss: 0.2678437829017639, val metric: 0.656356148097826
[11:41:32] Fitting Lvl_0_Pipe_0_Mod_0_Tuned_TorchNN_denselight_tuned_0 finished. score = 0.682772889051315
[11:41:32] Lvl_0_Pipe_0_Mod_0_Tuned_TorchNN_denselight_tuned_0 fitting and predicting completed
[11:41:32] Time left 280.97 secs
[11:41:32] Layer 1 training completed.
[11:41:32] Automl preset training completed in 19.03 seconds
[11:41:32] Model description:
Final prediction for new objects (level 0) =
1.00000 * (5 averaged models Lvl_0_Pipe_0_Mod_0_Tuned_TorchNN_denselight_tuned_0)
[17]:
array([[0.00909923],
[0.06779448],
[0.05014049],
...,
[0.04888163],
[0.18241519],
[0.07331596]], dtype=float32)
4.2.2 Custom model
There is a spesial flag tuned
to mark that you need optimize parameters for the model.
[18]:
automl = TabularAutoML(
**default_lama_params,
general_params={"use_algos": [[SimpleNet]]},
nn_params={
**default_nn_params,
"hidden_size": 256,
"drop_rate": 0.1,
"tuned": True,
"tuning_params": {
"max_tuning_iter": 5,
"max_tuning_time": 100,
"fit_on_holdout": True
}
},
)
automl.fit_predict(tr_data, roles = roles, verbose = 2)
[11:41:56] Stdout logging level is INFO2.
[11:41:56] Task: binary
[11:41:56] Start automl preset with listed constraints:
[11:41:56] - time: 300.00 seconds
[11:41:56] - CPU: 4 cores
[11:41:56] - memory: 16 GB
[11:41:56] Train data shape: (8000, 122)
[11:41:57] Layer 1 train process start. Time left 299.16 secs
[11:41:57] Start hyperparameters optimization for Lvl_0_Pipe_0_Mod_0_Tuned_TorchNN_0 ... Time budget is 100.00 secs
[11:42:17] Hyperparameters optimization for Lvl_0_Pipe_0_Mod_0_Tuned_TorchNN_0 completed
[11:42:17] The set of hyperparameters {'num_workers': 0, 'pin_memory': False, 'max_length': 256, 'is_snap': False, 'input_bn': False, 'max_emb_size': 256, 'bert_name': None, 'pooling': 'cls', 'device': ['0', '1'], 'use_cont': True, 'use_cat': True, 'use_text': False, 'lang': 'en', 'deterministic': True, 'multigpu': False, 'random_state': 42, 'model': <class '__main__.SimpleNet'>, 'model_with_emb': False, 'path_to_save': None, 'verbose_inside': None, 'verbose': 1, 'n_epochs': 10, 'snap_params': {'k': 3, 'early_stopping': True, 'patience': 10, 'swa': True}, 'bs': 128, 'emb_dropout': 0.1, 'emb_ratio': 3, 'opt': 'Adam', 'opt_params': {'lr': 0.029154431891537533, 'weight_decay': 0}, 'sch': 'ReduceLROnPlateau', 'scheduler_params': {'patience': 5, 'factor': 0.5, 'min_lr': 1e-05}, 'loss': None, 'loss_params': {}, 'loss_on_logits': True, 'clip_grad': False, 'clip_grad_params': {}, 'init_bias': True, 'dataset': 'UniversalDataset', 'tuned': True, 'optimization_search_space': None, 'verbose_bar': False, 'freeze_defaults': True, 'n_out': None, 'hid_factor': [2, 2], 'hidden_size': 256, 'block_config': [2, 2], 'compression': 0.5, 'growth_size': 256, 'bn_factor': 2, 'drop_rate': 0.1, 'noise_std': 0.05, 'num_init_features': None, 'act_fun': 'ReLU', 'use_noise': False, 'use_bn': True, 'stop_by_metric': False, 'tuning_params': {'fit_on_holdout': True, 'max_tuning_iter': 5, 'max_tuning_time': 100}}
achieve 0.7581 auc
[11:42:17] Start fitting Lvl_0_Pipe_0_Mod_0_Tuned_TorchNN_0 ...
[11:42:17] ===== Start working with fold 0 for Lvl_0_Pipe_0_Mod_0_Tuned_TorchNN_0 =====
[11:42:21] ===== Start working with fold 1 for Lvl_0_Pipe_0_Mod_0_Tuned_TorchNN_0 =====
[11:42:26] ===== Start working with fold 2 for Lvl_0_Pipe_0_Mod_0_Tuned_TorchNN_0 =====
[11:42:30] ===== Start working with fold 3 for Lvl_0_Pipe_0_Mod_0_Tuned_TorchNN_0 =====
[11:42:34] ===== Start working with fold 4 for Lvl_0_Pipe_0_Mod_0_Tuned_TorchNN_0 =====
[11:42:38] Fitting Lvl_0_Pipe_0_Mod_0_Tuned_TorchNN_0 finished. score = 0.7200461596125074
[11:42:38] Lvl_0_Pipe_0_Mod_0_Tuned_TorchNN_0 fitting and predicting completed
[11:42:38] Time left 258.19 secs
[11:42:38] Layer 1 training completed.
[11:42:38] Automl preset training completed in 41.82 seconds
[11:42:38] Model description:
Final prediction for new objects (level 0) =
1.00000 * (5 averaged models Lvl_0_Pipe_0_Mod_0_Tuned_TorchNN_0)
[18]:
array([[0.01359685],
[0.02897443],
[0.01692689],
...,
[0.04529661],
[0.17770922],
[0.17924136]], dtype=float32)
optimization_search_space
which describes neccesary parameter grid. See example below.bs
in [64, 128, 256, 512, 1024]
- hidden_size
in [64, 128, 256, 512, 1024]
- drop_rate
in [0.0, 0.3]
[19]:
def my_opt_space(trial: optuna.trial.Trial, estimated_n_trials, suggested_params):
'''
This fucntion needs for paramer tuning
'''
# optionally
trial_values = copy(suggested_params)
trial_values["bs"] = trial.suggest_categorical(
"bs", [2 ** i for i in range(6, 11)]
)
trial_values["hidden_size"] = trial.suggest_categorical(
"hidden_size", [2 ** i for i in range(6, 11)]
)
trial_values["drop_rate"] = trial.suggest_float(
"drop_rate", 0.0, 0.3
)
return trial_values
[20]:
automl = TabularAutoML(
**default_lama_params,
general_params={"use_algos": [[SimpleNet]]},
nn_params={
**default_nn_params,
"n_epochs": 3,
"tuned": True,
"tuning_params": {
"max_tuning_iter": 5,
"max_tuning_time": 3600,
"fit_on_holdout": True
},
"optimization_search_space": my_opt_space,
},
)
automl.fit_predict(tr_data, roles = roles, verbose = 3)
[11:42:39] Stdout logging level is INFO3.
[11:42:39] Task: binary
[11:42:39] Start automl preset with listed constraints:
[11:42:39] - time: 300.00 seconds
[11:42:39] - CPU: 4 cores
[11:42:39] - memory: 16 GB
[11:42:39] Train data shape: (8000, 122)
[11:42:39] Feats was rejected during automatic roles guess: []
[11:42:40] Layer 1 train process start. Time left 299.16 secs
[11:42:40] Start hyperparameters optimization for Lvl_0_Pipe_0_Mod_0_Tuned_TorchNN_0 ... Time budget is 156.93 secs
[11:42:40] Epoch: 0, train loss: 0.2768203020095825, val loss: 0.27835753560066223, val metric: 0.6358815636843766
[11:42:41] Epoch: 1, train loss: 0.269419401884079, val loss: 0.270462304353714, val metric: 0.6819656707880963
[11:42:41] Epoch: 2, train loss: 0.2603665292263031, val loss: 0.26111170649528503, val metric: 0.7290280161008387
[11:42:41] Early stopping: val loss: 0.2704150676727295, val metric: 0.6917587440062863
[11:42:41] Trial 1 with hyperparameters {'bs': 128, 'hidden_size': 256, 'drop_rate': 0.006175348288740734} scored 0.6917587440062863 in 0:00:01.446466
[11:42:42] Epoch: 0, train loss: 0.27193132042884827, val loss: 0.2582079768180847, val metric: 0.7192162334087058
[11:42:43] Epoch: 1, train loss: 0.25488120317459106, val loss: 0.2459825575351715, val metric: 0.763859711018811
[11:42:43] Epoch: 2, train loss: 0.2454848736524582, val loss: 0.24529244005680084, val metric: 0.7586050216228063
[11:42:44] Early stopping: val loss: 0.24799665808677673, val metric: 0.7566271629488269
[11:42:44] Trial 2 with hyperparameters {'bs': 64, 'hidden_size': 1024, 'drop_rate': 0.04184815819561255} scored 0.7566271629488269 in 0:00:02.566996
[11:42:44] Epoch: 0, train loss: 0.2784821093082428, val loss: 0.30797627568244934, val metric: 0.6284539025289864
[11:42:44] Epoch: 1, train loss: 0.27420976758003235, val loss: 0.30563732981681824, val metric: 0.6298303852547963
[11:42:44] Epoch: 2, train loss: 0.26723629236221313, val loss: 0.3022131323814392, val metric: 0.6563443826140877
[11:42:44] Early stopping: val loss: 0.30519333481788635, val metric: 0.6395967306530677
[11:42:44] Trial 3 with hyperparameters {'bs': 512, 'hidden_size': 512, 'drop_rate': 0.019515477895583853} scored 0.6395967306530677 in 0:00:00.620905
[11:42:45] Epoch: 0, train loss: 0.2780534625053406, val loss: 0.2811879515647888, val metric: 0.6194653366903475
[11:42:45] Epoch: 1, train loss: 0.2749352753162384, val loss: 0.27780336141586304, val metric: 0.6365337224903913
[11:42:46] Epoch: 2, train loss: 0.2703269124031067, val loss: 0.2734648585319519, val metric: 0.6630851387975689
[11:42:46] Early stopping: val loss: 0.27777808904647827, val metric: 0.641825830834282
[11:42:46] Trial 4 with hyperparameters {'bs': 128, 'hidden_size': 64, 'drop_rate': 0.2727961206236346} scored 0.641825830834282 in 0:00:01.440003
[11:42:46] Epoch: 0, train loss: 0.27747318148612976, val loss: 0.2802681624889374, val metric: 0.6187169577326256
[11:42:47] Epoch: 1, train loss: 0.27283716201782227, val loss: 0.27530962228775024, val metric: 0.6516670141283256
[11:42:47] Epoch: 2, train loss: 0.2660568654537201, val loss: 0.268929660320282, val metric: 0.6927262910873412
[11:42:47] Early stopping: val loss: 0.2752225697040558, val metric: 0.6592897883691219
[11:42:47] Trial 5 with hyperparameters {'bs': 128, 'hidden_size': 128, 'drop_rate': 0.17936999364332554} scored 0.6592897883691219 in 0:00:01.443614
[11:42:47] Hyperparameters optimization for Lvl_0_Pipe_0_Mod_0_Tuned_TorchNN_0 completed
[11:42:47] The set of hyperparameters {'num_workers': 0, 'pin_memory': False, 'max_length': 256, 'is_snap': False, 'input_bn': False, 'max_emb_size': 256, 'bert_name': None, 'pooling': 'cls', 'device': ['0', '1'], 'use_cont': True, 'use_cat': True, 'use_text': False, 'lang': 'en', 'deterministic': True, 'multigpu': False, 'random_state': 42, 'model': <class '__main__.SimpleNet'>, 'model_with_emb': False, 'path_to_save': None, 'verbose_inside': None, 'verbose': 1, 'n_epochs': 3, 'snap_params': {'k': 3, 'early_stopping': True, 'patience': 10, 'swa': True}, 'bs': 64, 'emb_dropout': 0.1, 'emb_ratio': 3, 'opt': 'Adam', 'opt_params': {'lr': 0.0003, 'weight_decay': 0}, 'sch': 'ReduceLROnPlateau', 'scheduler_params': {'patience': 5, 'factor': 0.5, 'min_lr': 1e-05}, 'loss': None, 'loss_params': {}, 'loss_on_logits': True, 'clip_grad': False, 'clip_grad_params': {}, 'init_bias': True, 'dataset': 'UniversalDataset', 'tuned': True, 'optimization_search_space': <function my_opt_space at 0x7f4dc0a14790>, 'verbose_bar': False, 'freeze_defaults': True, 'n_out': None, 'hid_factor': [2, 2], 'hidden_size': 1024, 'block_config': [2, 2], 'compression': 0.5, 'growth_size': 256, 'bn_factor': 2, 'drop_rate': 0.04184815819561255, 'noise_std': 0.05, 'num_init_features': None, 'act_fun': 'ReLU', 'use_noise': False, 'use_bn': True, 'stop_by_metric': False, 'tuning_params': {'fit_on_holdout': True, 'max_tuning_iter': 5, 'max_tuning_time': 3600}}
achieve 0.7566 auc
[11:42:47] Start fitting Lvl_0_Pipe_0_Mod_0_Tuned_TorchNN_0 ...
[11:42:47] ===== Start working with fold 0 for Lvl_0_Pipe_0_Mod_0_Tuned_TorchNN_0 =====
[11:42:48] Epoch: 0, train loss: 0.27193132042884827, val loss: 0.2582079768180847, val metric: 0.7192162334087058
[11:42:49] Epoch: 1, train loss: 0.25488120317459106, val loss: 0.2459825575351715, val metric: 0.763859711018811
[11:42:50] Epoch: 2, train loss: 0.2454848736524582, val loss: 0.24529244005680084, val metric: 0.7586050216228063
[11:42:50] Early stopping: val loss: 0.24799665808677673, val metric: 0.7566271629488269
[11:42:50] ===== Start working with fold 1 for Lvl_0_Pipe_0_Mod_0_Tuned_TorchNN_0 =====
[11:42:51] Epoch: 0, train loss: 0.27125343680381775, val loss: 0.26051169633865356, val metric: 0.7223749575407609
[11:42:51] Epoch: 1, train loss: 0.254526287317276, val loss: 0.25250861048698425, val metric: 0.7373842985733696
[11:42:52] Epoch: 2, train loss: 0.24166874587535858, val loss: 0.258941113948822, val metric: 0.7197000254755436
[11:42:52] Early stopping: val loss: 0.2543417811393738, val metric: 0.7311799422554348
[11:42:52] ===== Start working with fold 2 for Lvl_0_Pipe_0_Mod_0_Tuned_TorchNN_0 =====
[11:42:53] Epoch: 0, train loss: 0.2697919011116028, val loss: 0.26959753036499023, val metric: 0.6485861073369565
[11:42:54] Epoch: 1, train loss: 0.2507198452949524, val loss: 0.26720497012138367, val metric: 0.6837476647418478
[11:42:55] Epoch: 2, train loss: 0.24044080078601837, val loss: 0.2683129906654358, val metric: 0.6879935886548914
[11:42:55] Early stopping: val loss: 0.26637837290763855, val metric: 0.678169582201087
[11:42:55] ===== Start working with fold 3 for Lvl_0_Pipe_0_Mod_0_Tuned_TorchNN_0 =====
[11:42:56] Epoch: 0, train loss: 0.27104493975639343, val loss: 0.26356974244117737, val metric: 0.6876220703125
[11:42:56] Epoch: 1, train loss: 0.2532782256603241, val loss: 0.2557208836078644, val metric: 0.7101042374320653
[11:42:57] Epoch: 2, train loss: 0.2443435788154602, val loss: 0.25609511137008667, val metric: 0.7170038637907609
[11:42:57] Early stopping: val loss: 0.2564053535461426, val metric: 0.7097698709239131
[11:42:57] ===== Start working with fold 4 for Lvl_0_Pipe_0_Mod_0_Tuned_TorchNN_0 =====
[11:42:58] Epoch: 0, train loss: 0.2717946171760559, val loss: 0.26010674238204956, val metric: 0.7114735478940217
[11:42:59] Epoch: 1, train loss: 0.25437426567077637, val loss: 0.250105619430542, val metric: 0.7436258067255435
[11:43:00] Epoch: 2, train loss: 0.24573101103305817, val loss: 0.24898070096969604, val metric: 0.7468049422554348
[11:43:00] Early stopping: val loss: 0.251451313495636, val metric: 0.7411684782608696
[11:43:00] Fitting Lvl_0_Pipe_0_Mod_0_Tuned_TorchNN_0 finished. score = 0.7228784107078735
[11:43:00] Lvl_0_Pipe_0_Mod_0_Tuned_TorchNN_0 fitting and predicting completed
[11:43:00] Time left 278.70 secs
[11:43:00] Layer 1 training completed.
[11:43:00] Automl preset training completed in 21.30 seconds
[11:43:00] Model description:
Final prediction for new objects (level 0) =
1.00000 * (5 averaged models Lvl_0_Pipe_0_Mod_0_Tuned_TorchNN_0)
[20]:
array([[0.03668038],
[0.03266167],
[0.0428489 ],
...,
[0.05952494],
[0.19782786],
[0.10605511]], dtype=float32)
4.2.3 One more example
Tuning NODE params
[6]:
TIMEOUT = 3000
[15]:
default_lama_params = {
"task": task,
"timeout": TIMEOUT,
"cpu_limit": N_THREADS,
"reader_params": {'n_jobs': N_THREADS, 'cv': N_FOLDS, 'random_state': RANDOM_STATE}
}
default_nn_params = {
"bs": 512, "num_workers": 0, "path_to_save": None, "n_epochs": 10, "freeze_defaults": True
}
[16]:
def my_opt_space_NODE(trial: optuna.trial.Trial, estimated_n_trials, suggested_params):
'''
This fucntion needs for paramer tuning
'''
# optionally
trial_values = copy(suggested_params)
trial_values["layer_dim"] = trial.suggest_categorical(
"layer_dim", [2 ** i for i in range(8, 10)]
)
trial_values["use_original_head"] = trial.suggest_categorical(
"use_original_head", [True, False]
)
trial_values["num_layers"] = trial.suggest_int(
"num_layers", 1, 3
)
trial_values["drop_rate"] = trial.suggest_float(
"drop_rate", 0.0, 0.3
)
trial_values["tree_dim"] = trial.suggest_int(
"tree_dim", 1, 3
)
return trial_values
[17]:
automl = TabularAutoML(
task = task,
timeout = TIMEOUT,
cpu_limit = N_THREADS,
general_params = {"use_algos": [["node_tuned"]]}, # ['nn', 'mlp', 'dense', 'denselight', 'resnet', 'snn'] or custom torch model
nn_params = {"n_epochs": 10, "bs": 512, "num_workers": 0, "path_to_save": None, "freeze_defaults": True, "optimization_search_space": my_opt_space_NODE,},
nn_pipeline_params = {"use_qnt": True, "use_te": False},
reader_params = {'n_jobs': N_THREADS, 'cv': N_FOLDS, 'random_state': RANDOM_STATE}
)
[18]:
oof_pred = automl.fit_predict(tr_data, roles = roles, verbose = 2)
[11:58:03] Stdout logging level is INFO2.
[11:58:03] Task: binary
[11:58:03] Start automl preset with listed constraints:
[11:58:03] - time: 3000.00 seconds
[11:58:03] - CPU: 4 cores
[11:58:03] - memory: 16 GB
[11:58:03] Train data shape: (8000, 122)
[11:58:03] Layer 1 train process start. Time left 2999.19 secs
[11:58:04] Start hyperparameters optimization for Lvl_0_Pipe_0_Mod_0_Tuned_TorchNN_node_tuned_0 ... Time budget is 1574.27 secs
[12:01:57] Hyperparameters optimization for Lvl_0_Pipe_0_Mod_0_Tuned_TorchNN_node_tuned_0 completed
[12:01:57] The set of hyperparameters {'num_workers': 0, 'pin_memory': False, 'max_length': 256, 'is_snap': False, 'input_bn': False, 'max_emb_size': 256, 'bert_name': None, 'pooling': 'cls', 'device': ['0'], 'use_cont': True, 'use_cat': True, 'use_text': False, 'lang': 'en', 'deterministic': True, 'multigpu': False, 'random_state': 42, 'model': 'node', 'model_with_emb': False, 'path_to_save': None, 'verbose_inside': None, 'verbose': 1, 'n_epochs': 10, 'snap_params': {'k': 3, 'early_stopping': True, 'patience': 10, 'swa': True}, 'bs': 512, 'emb_dropout': 0.1, 'emb_ratio': 3, 'opt': 'Adam', 'opt_params': {'lr': 0.0003, 'weight_decay': 0}, 'sch': 'ReduceLROnPlateau', 'scheduler_params': {'patience': 5, 'factor': 0.5, 'min_lr': 1e-05}, 'loss': None, 'loss_params': {}, 'loss_on_logits': True, 'clip_grad': False, 'clip_grad_params': {}, 'init_bias': True, 'dataset': 'UniversalDataset', 'tuned': False, 'optimization_search_space': <function my_opt_space_NODE at 0x7fd4d11d1820>, 'verbose_bar': False, 'freeze_defaults': True, 'n_out': None, 'hid_factor': [2, 2], 'hidden_size': [512, 512, 512], 'block_config': [2, 2], 'compression': 0.5, 'growth_size': 256, 'bn_factor': 2, 'drop_rate': 0.12034524690886754, 'noise_std': 0.05, 'num_init_features': None, 'act_fun': 'ReLU', 'use_noise': False, 'use_bn': True, 'stop_by_metric': False, 'tuning_params': {'fit_on_holdout': True, 'max_tuning_iter': 25, 'max_tuning_time': 3600}, 'layer_dim': 512, 'use_original_head': False, 'num_layers': 3, 'tree_dim': 2}
achieve 0.7432 auc
[12:01:57] Start fitting Lvl_0_Pipe_0_Mod_0_Tuned_TorchNN_node_tuned_0 ...
[12:01:57] ===== Start working with fold 0 for Lvl_0_Pipe_0_Mod_0_Tuned_TorchNN_node_tuned_0 =====
[12:02:09] ===== Start working with fold 1 for Lvl_0_Pipe_0_Mod_0_Tuned_TorchNN_node_tuned_0 =====
[12:02:22] ===== Start working with fold 2 for Lvl_0_Pipe_0_Mod_0_Tuned_TorchNN_node_tuned_0 =====
[12:02:34] ===== Start working with fold 3 for Lvl_0_Pipe_0_Mod_0_Tuned_TorchNN_node_tuned_0 =====
[12:02:47] ===== Start working with fold 4 for Lvl_0_Pipe_0_Mod_0_Tuned_TorchNN_node_tuned_0 =====
[12:02:59] Fitting Lvl_0_Pipe_0_Mod_0_Tuned_TorchNN_node_tuned_0 finished. score = 0.7146780211829931
[12:02:59] Lvl_0_Pipe_0_Mod_0_Tuned_TorchNN_node_tuned_0 fitting and predicting completed
[12:02:59] Time left 2703.40 secs
[12:02:59] Layer 1 training completed.
[12:02:59] Automl preset training completed in 296.61 seconds
[12:02:59] Model description:
Final prediction for new objects (level 0) =
1.00000 * (5 averaged models Lvl_0_Pipe_0_Mod_0_Tuned_TorchNN_node_tuned_0)
4.3 Several models
[21]:
automl = TabularAutoML(
**default_lama_params,
general_params = {"use_algos": [["lgb", "mlp", "dense"]]},
nn_params = {"0": {**default_nn_params, "n_epochs": 2},
"1": {**default_nn_params, "n_epochs": 5}},
)
automl.fit_predict(tr_data, roles = roles, verbose = 3)
[11:43:12] Stdout logging level is INFO3.
[11:43:12] Task: binary
[11:43:12] Start automl preset with listed constraints:
[11:43:12] - time: 300.00 seconds
[11:43:12] - CPU: 4 cores
[11:43:12] - memory: 16 GB
[11:43:12] Train data shape: (8000, 122)
[11:43:13] Feats was rejected during automatic roles guess: []
[11:43:13] Layer 1 train process start. Time left 299.17 secs
[11:43:13] Training until validation scores don't improve for 200 rounds
[11:43:15] Selector_LightGBM fitting and predicting completed
[11:43:15] Start fitting Lvl_0_Pipe_0_Mod_0_LightGBM ...
[11:43:15] ===== Start working with fold 0 for Lvl_0_Pipe_0_Mod_0_LightGBM =====
[11:43:15] Training until validation scores don't improve for 200 rounds
[11:43:17] ===== Start working with fold 1 for Lvl_0_Pipe_0_Mod_0_LightGBM =====
[11:43:17] Training until validation scores don't improve for 200 rounds
[11:43:24] ===== Start working with fold 2 for Lvl_0_Pipe_0_Mod_0_LightGBM =====
[11:43:24] Training until validation scores don't improve for 200 rounds
[11:43:25] ===== Start working with fold 3 for Lvl_0_Pipe_0_Mod_0_LightGBM =====
[11:43:25] Training until validation scores don't improve for 200 rounds
[11:43:27] ===== Start working with fold 4 for Lvl_0_Pipe_0_Mod_0_LightGBM =====
[11:43:28] Training until validation scores don't improve for 200 rounds
[11:43:30] Fitting Lvl_0_Pipe_0_Mod_0_LightGBM finished. score = 0.7139016076564749
[11:43:30] Lvl_0_Pipe_0_Mod_0_LightGBM fitting and predicting completed
[11:43:30] Time left 281.41 secs
[11:43:31] Start fitting Lvl_0_Pipe_1_Mod_0_TorchNN_mlp_0 ...
[11:43:31] ===== Start working with fold 0 for Lvl_0_Pipe_1_Mod_0_TorchNN_mlp_0 =====
[11:43:31] Epoch: 0, train loss: 0.2787605822086334, val loss: 0.309201180934906, val metric: 0.6278257987608983
[11:43:31] Epoch: 1, train loss: 0.27014315128326416, val loss: 0.3014439344406128, val metric: 0.6727606096081167
[11:43:31] Early stopping: val loss: 0.3067486882209778, val metric: 0.6518273810478374
[11:43:31] ===== Start working with fold 1 for Lvl_0_Pipe_1_Mod_0_TorchNN_mlp_0 =====
[11:43:31] Epoch: 0, train loss: 0.2777830958366394, val loss: 0.25957152247428894, val metric: 0.6724667756453804
[11:43:32] Epoch: 1, train loss: 0.26959046721458435, val loss: 0.2489766776561737, val metric: 0.7130498471467391
[11:43:32] Early stopping: val loss: 0.2563818097114563, val metric: 0.6977326766304347
[11:43:32] ===== Start working with fold 2 for Lvl_0_Pipe_1_Mod_0_TorchNN_mlp_0 =====
[11:43:32] Epoch: 0, train loss: 0.27729859948158264, val loss: 0.26021498441696167, val metric: 0.5819967518682065
[11:43:32] Epoch: 1, train loss: 0.2688000798225403, val loss: 0.2569577395915985, val metric: 0.6118960173233696
[11:43:32] Early stopping: val loss: 0.25878584384918213, val metric: 0.5981498386548911
[11:43:32] ===== Start working with fold 3 for Lvl_0_Pipe_1_Mod_0_TorchNN_mlp_0 =====
[11:43:32] Epoch: 0, train loss: 0.2782357931137085, val loss: 0.2936389744281769, val metric: 0.6591664189877717
[11:43:33] Epoch: 1, train loss: 0.2691851854324341, val loss: 0.2860284745693207, val metric: 0.6782173488451086
[11:43:33] Early stopping: val loss: 0.29129818081855774, val metric: 0.6712646484375
[11:43:33] ===== Start working with fold 4 for Lvl_0_Pipe_1_Mod_0_TorchNN_mlp_0 =====
[11:43:33] Epoch: 0, train loss: 0.278407484292984, val loss: 0.27651435136795044, val metric: 0.6396059782608697
[11:43:33] Epoch: 1, train loss: 0.2713249921798706, val loss: 0.26803451776504517, val metric: 0.6795495074728259
[11:43:33] Early stopping: val loss: 0.2736702263355255, val metric: 0.6605489979619565
[11:43:33] Fitting Lvl_0_Pipe_1_Mod_0_TorchNN_mlp_0 finished. score = 0.649310571575994
[11:43:33] Lvl_0_Pipe_1_Mod_0_TorchNN_mlp_0 fitting and predicting completed
[11:43:33] Start fitting Lvl_0_Pipe_1_Mod_1_TorchNN_dense_1 ...
[11:43:33] ===== Start working with fold 0 for Lvl_0_Pipe_1_Mod_1_TorchNN_dense_1 =====
[11:43:34] Epoch: 0, train loss: 0.2770945131778717, val loss: 0.30580174922943115, val metric: 0.6998920196075288
[11:43:34] Epoch: 1, train loss: 0.24924148619174957, val loss: 0.286395400762558, val metric: 0.7347477695634278
[11:43:34] Epoch: 2, train loss: 0.2179156094789505, val loss: 0.30268681049346924, val metric: 0.6945170550218902
[11:43:35] Epoch: 3, train loss: 0.18389688432216644, val loss: 0.344427227973938, val metric: 0.6729530499115309
[11:43:35] Epoch: 4, train loss: 0.15367864072322845, val loss: 0.385821133852005, val metric: 0.6347750319397448
[11:43:35] Early stopping: val loss: 0.29431894421577454, val metric: 0.7355977142368406
[11:43:35] ===== Start working with fold 1 for Lvl_0_Pipe_1_Mod_1_TorchNN_dense_1 =====
[11:43:35] Epoch: 0, train loss: 0.2714083790779114, val loss: 0.25767791271209717, val metric: 0.7117336107336957
[11:43:36] Epoch: 1, train loss: 0.24716131389141083, val loss: 0.24509836733341217, val metric: 0.7210587211277173
[11:43:36] Epoch: 2, train loss: 0.21747852861881256, val loss: 0.25955381989479065, val metric: 0.6777821416440216
[11:43:36] Epoch: 3, train loss: 0.18614345788955688, val loss: 0.2698502540588379, val metric: 0.6448019276494565
[11:43:37] Epoch: 4, train loss: 0.15456536412239075, val loss: 0.28265029191970825, val metric: 0.6477210003396741
[11:43:37] Early stopping: val loss: 0.2492099106311798, val metric: 0.7247473675271738
[11:43:37] ===== Start working with fold 2 for Lvl_0_Pipe_1_Mod_1_TorchNN_dense_1 =====
[11:43:37] Epoch: 0, train loss: 0.2723230719566345, val loss: 0.2602400481700897, val metric: 0.614937160326087
[11:43:37] Epoch: 1, train loss: 0.2455139309167862, val loss: 0.2510344684123993, val metric: 0.6732390030570652
[11:43:38] Epoch: 2, train loss: 0.21499498188495636, val loss: 0.26110419631004333, val metric: 0.6599227241847827
[11:43:38] Epoch: 3, train loss: 0.18287943303585052, val loss: 0.2879733741283417, val metric: 0.6468452785326086
[11:43:38] Epoch: 4, train loss: 0.15581558644771576, val loss: 0.30908963084220886, val metric: 0.6247930112092391
[11:43:38] Early stopping: val loss: 0.2555948495864868, val metric: 0.665681258491848
[11:43:38] ===== Start working with fold 3 for Lvl_0_Pipe_1_Mod_1_TorchNN_dense_1 =====
[11:43:39] Epoch: 0, train loss: 0.274604856967926, val loss: 0.28871941566467285, val metric: 0.6861306895380436
[11:43:39] Epoch: 1, train loss: 0.242709219455719, val loss: 0.271901398897171, val metric: 0.7212710173233696
[11:43:39] Epoch: 2, train loss: 0.2166079580783844, val loss: 0.28201019763946533, val metric: 0.7093505859375001
[11:43:40] Epoch: 3, train loss: 0.18465735018253326, val loss: 0.2982863187789917, val metric: 0.7004288383152174
[11:43:40] Epoch: 4, train loss: 0.15224191546440125, val loss: 0.3260188102722168, val metric: 0.663733440896739
[11:43:40] Early stopping: val loss: 0.27753010392189026, val metric: 0.7169985563858694
[11:43:40] ===== Start working with fold 4 for Lvl_0_Pipe_1_Mod_1_TorchNN_dense_1 =====
[11:43:40] Epoch: 0, train loss: 0.2742379903793335, val loss: 0.274169921875, val metric: 0.7089684527853262
[11:43:41] Epoch: 1, train loss: 0.2457791119813919, val loss: 0.2631347179412842, val metric: 0.7099344004755435
[11:43:41] Epoch: 2, train loss: 0.2198062241077423, val loss: 0.2773990035057068, val metric: 0.6714291779891304
[11:43:41] Epoch: 3, train loss: 0.1879502236843109, val loss: 0.32326072454452515, val metric: 0.6654795771059782
[11:43:41] Epoch: 4, train loss: 0.1601661890745163, val loss: 0.357746422290802, val metric: 0.6396059782608696
[11:43:42] Early stopping: val loss: 0.266778826713562, val metric: 0.7362511676290762
[11:43:42] Fitting Lvl_0_Pipe_1_Mod_1_TorchNN_dense_1 finished. score = 0.7106457519741463
[11:43:42] Lvl_0_Pipe_1_Mod_1_TorchNN_dense_1 fitting and predicting completed
[11:43:42] Time left 270.04 secs
[11:43:42] Layer 1 training completed.
[11:43:42] Blending: optimization starts with equal weights and score 0.73113620210903
[11:43:42] Blending: iteration 0: score = 0.7330658618498413, weights = [0.35163662 0.06402954 0.58433384]
[11:43:42] Blending: iteration 1: score = 0.7332030948540493, weights = [0.371031 0. 0.628969]
[11:43:42] Blending: iteration 2: score = 0.7332030948540493, weights = [0.371031 0. 0.628969]
[11:43:42] Blending: no score update. Terminated
[11:43:42] Automl preset training completed in 30.17 seconds
[11:43:42] Model description:
Final prediction for new objects (level 0) =
0.37103 * (5 averaged models Lvl_0_Pipe_0_Mod_0_LightGBM) +
0.62897 * (5 averaged models Lvl_0_Pipe_1_Mod_1_TorchNN_dense_1)
[21]:
array([[0.08149866],
[0.04592865],
[0.04751563],
...,
[0.06561954],
[0.15983571],
[0.11311316]], dtype=float32)
Kaggle Kernels
Others
LightAutoML crash courses
Video guides
(Russian) LightAutoML webinar for Sberloga community (Alexander Ryzhkov), Dmitry Simakov)
(Russian) LightAutoML hands-on tutorial in Kaggle Kernels (Alexander Ryzhkov)
(English) Automated Machine Learning with LightAutoML: theory and practice (Alexander Ryzhkov)
(English) LightAutoML framework general overview, benchmarks and advantages for business (Alexander Ryzhkov)
(English) LightAutoML practical guide - ML pipeline presets overview (Dmitry Simakov)
Papers
Anton Vakhrushev, Alexander Ryzhkov, Dmitry Simakov, Rinchin Damdinov, Maxim Savchenko, Alexander Tuzhilin “LightAutoML: AutoML Solution for a Large Financial Services Ecosystem”. arXiv:2109.01528, 2021.
Articles about LightAutoML
Python-API
lightautoml.automl
The main module, which includes the AutoML class, blenders and ready-made presets.
Class for compile full pipeline of AutoML task. |
Presets
Presets for end-to-end model training for special tasks.
Basic class for automl preset. |
|
Classic preset - work with tabular data. |
|
Template to make TimeUtilization from TabularAutoML. |
|
Classic preset - work with tabular and text data. |
|
Preset for AutoWoE - logistic regression over binned features (scorecard). |
Blenders
Basic class for blending. |
|
Select best single model from level. |
|
Simple average level predictions. |
|
Weighted Blender based on coord descent, optimize task metric directly. |
lightautoml.addons
Extensions of core functionality.
Utilization
Class that helps to utilize given time to |
lightautoml.dataset
Provides an internal interface for working with data.
Dataset Interfaces
Basic class for pair - column, role. |
|
Basic class to create dataset. |
|
Dataset that contains info in np.ndarray format. |
|
Dataset that contains pd.DataFrame features and pd.Series targets. |
|
Dataset that contains sparse features and np.ndarray targets. |
Roles
Role contains information about the column, which determines how it is processed.
Abstract class for column role. |
|
Numeric role. |
|
Category role. |
|
Text role. |
|
Datetime role. |
|
Target role. |
|
Group role. |
|
Drop role. |
|
Weights role. |
|
Folds role. |
|
Path role. |
Utils
Utilities for working with the structure of a dataset.
Parser of roles. |
|
Get concatenation function for datasets of different types. |
|
Concat of numpy and pandas dataset. |
|
Dataset concatenation function. |
lightautoml.image
Provides an internal interface for working with image features.
Image Feature Extractors
Image feature extractors based on color histograms and CNN embeddings.
Class for parallel histogram computation. |
PyTorch Image Datasets
Utils
Load image from pathes. |
lightautoml.ml_algo
Models used for machine learning pipelines.
Base Classes
Abstract class for machine learning algorithm. |
|
Machine learning algorithms that accepts numpy arrays as input. |
Linear Models
LBFGS L2 regression based on torch. |
|
Coordinate descent based on sklearn implementation. |
|
Neural net for tabular datasets. |
Boosted Trees
Gradient boosting on decision trees from LightGBM library. |
|
Gradient boosting on decision trees from catboost library. |
Neural Networks
Realisation of 'mlp' model. |
|
Realisation of 'denselight' model. |
|
Realisation of 'dense' model. |
|
The ResNet model from https://github.com/Yura52/rtdl. |
|
Realisation of 'snn' model. |
WhiteBox
WhiteBox - scorecard model. |
lightautoml.ml_algo.tuning
Bunch of classes for hyperparameters tuning.
Base Classes
Base abstract class for hyperparameters tuners. |
|
Default realization of ParamsTuner - just take algo's defaults. |
Tuning with Optuna
Wrapper for optuna tuner. |
|
Wrapper for optuna tuner. |
lightautoml.ml_algo
Torch utils.
Pooling Strategies
Abstract pooling class. |
|
CLS token pooling. |
|
Max value pooling. |
|
Sum value pooling. |
|
Mean value pooling. |
|
Identity pooling. |
lightautoml.pipelines
Pipelines for solving different tasks.
Utils
Pipelines create name in the way 'prefix__feature_name'. |
|
Search for columns with specific role and attributes when building pipeline. |
lightautoml.pipelines.selection
Feature selection module for ML pipelines.
Base Classes
Abstract class, that estimates feature importances. |
|
Abstract class, performing feature selection. |
Importance Based Selectors
Base class for performing feature selection using model feature importances. |
|
Selector based on importance threshold. |
|
Permutation importance based estimator. |
|
Select features sequentially using chunks to find the best combination of chunks. |
Other Selectors
Selector to remove highly correlated features. |
lightautoml.pipelines.features
Pipelines for features generation.
Base Classes
Abstract class. |
|
Dummy feature pipeline - |
|
Helper class contains basic features transformations for tabular data. |
Feature Pipelines for Boosting Models
Creates simple pipeline for tree based models. |
|
Create advanced pipeline for trees based models. |
Feature Pipelines for Linear Models
Creates pipeline for linear models and nnets. |
Feature Pipelines for WhiteBox
Simple WhiteBox pipeline. |
Image Feature Pipelines
Class contains basic features transformations for image data. |
|
Class contains simple color histogram features for image data. |
|
Class contains efficient-net embeddings features for image data. |
Text Feature Pipelines
Class contains basic features transformations for text data. |
|
Class contains embedding features for text data. |
|
Class contains tfidf features for text data. |
|
Features pipeline for BERT. |
Feature Pipelines for Neural Networks Models
Creates simple pipeline for neural network models. |
lightautoml.pipelines.ml
Pipelines that merge together single model training steps.
Base Classes
Single ML pipeline. |
Pipeline for Nested Cross-Validation
Wrapper for MLAlgo to make it trainable over nested folds. |
|
Wrapper for MLPipeline to make it trainable over nested folds. |
Pipeline for WhiteBox
Special pipeline to handle WhiteBox model. |
lightautoml.reader
Utils for reading, training and analysing data.
Readers
Abstract Reader class. |
|
Pandas Reader. |
Tabular Batch Generators
Batch Handler Classes
Class to wraps batch of data in different formats. |
|
Batch of csv file. |
|
Abstract - generator of batches from data. |
|
Batch generator from |
|
Generator of batches from file. |
Data Read Functions
Read data for inference by batches for simple tabular data. |
|
Get |
lightautoml.report
Report generators and templates.
Decorator to wrap |
|
Special report wrapper for |
lightautoml.tasks
Task Class
Specify task (binary classification, multiclass classification, regression), metrics, losses. |
Common Metrics
Classes
Wrapper for |
|
Metric wrapper to get best class prediction instead of probs. |
|
Metric wrapper to get best class prediction instead of probs for multiclass. |
Functions
Computes Mean Quantile Error. |
|
Computes Mean Huber Error. |
|
Computes Mean Fair Error. |
|
Computes Mean Absolute Percentage error. |
|
ROC-AUC One-Versus-Rest. |
|
Root mean squared log error. |
|
Compute multi-class metric AUC-Mu. |
lightautoml.tasks.losses
Wrappers of loss and metric functions for different machine learning algorithms.
Base Classes
Wrapper for metric. |
|
Loss function with target transformation. |
Wrappers for LightGBM
Classes
Wrapper of metric function for LightGBM. |
|
Loss used for LightGBM. |
Functions
Softmax columnwise. |
|
Custom loss for optimizing f1. |
Wrappers for CatBoost
Classes
Loss used for CatBoost. |
|
Metric wrapper class for CatBoost. |
|
Regression metric wrapper for CatBoost. |
|
Classification metric wrapper for CatBoost. |
|
Multiclassification metric wrapper for CatBoost. |
Functions
CatBoost loss name wrapper, if it has keyword args. |
Wrappers for Sklearn
Classes
Loss used for scikit-learn. |
Wrappers for Torch
Classes
Customize PyTorch-based loss. |
|
Loss used for PyTorch. |
Functions
Computes Root Mean Squared Logarithmic Error. |
|
Computes Mean Quantile Error. |
|
Computes Mean Fair Error. |
|
Computes Mean Huber Error. |
|
Computes F1 macro. |
|
Computes Mean Absolute Percentage Error. |
lightautoml.text
Provides an internal interface for working with text features.
Sentence Embedders
Deep Learning based sentence embeddings. |
|
Class to compute Bag of Random Embedding Projections sentence embeddings from words embeddings. |
|
Class to compute Random LSTM sentence embeddings from words embeddings. |
|
Class to compute HuggingFace transformers words or sentence embeddings. |
|
Weighted average of word embeddings. |
Torch Datasets for Text
Dataset class with transformers tokenization. |
|
Dataset class for extracting word embeddings. |
Tokenizers
Base class for tokenizer method. |
|
Russian tokenizer. |
|
English tokenizer. |
Utils
Set random seed and cudnn params. |
|
Parse devices and convert first to the torch device. |
|
Puts each data field into a tensor with outer dimension batch size. |
|
Get text hash. |
|
Get hash of array with texts. |
lightautoml.transformers
Basic feature generation steps and helper utils.
Base Classes
Base class for transformer method (like sklearn, but works with datasets). |
|
Transformer that contains the list of transformers and apply one by one sequentially. |
|
Transformer that apply the sequence on transformers in parallel on dataset and concatenate the result. |
|
Select columns to pass to another transformers (or feature selection). |
|
Apply 1 columns transformer to all columns. |
|
Apply multiple transformers and select best. |
|
Convert dataset to given type. |
|
Change data roles (include dtypes etc). |
Numeric
Create NaN flags. |
|
Fillna with median. |
|
Fillna with mean. |
|
Fill inf with nan to handle as nan value. |
|
Convert probs to logodds. |
|
Classic StandardScaler. |
|
Discretization of numeric features by quantiles. |
|
Transform features using quantiles information. |
Categorical
Simple LabelEncoder in order of frequency. |
|
Simple OneHotEncoder over label encoded categories. |
|
Labels are encoded with frequency in train data. |
|
Encoding ordinal categories into numbers. |
|
Out-of-fold target encoding. |
|
Out-of-fold target encoding for multiclass task. |
|
Build label encoded intertsections of categorical variables. |
Datetime
Basic conversion strategy, used in selection one-to-one transformers. |
|
Basic conversion strategy, used in selection one-to-one transformers. |
|
Basic conversion strategy, used in selection one-to-one transformers. |
Decompositions
PCA. |
|
TruncatedSVD. |
Text
Base class for ML transformers. |
|
Simple Tfidf vectorizer. |
|
Simple tokenizer transformer. |
|
Out-of-fold sgd model prediction to reduce dimension of encoded text data. |
|
Concat text features transformer. |
|
Calculate text embeddings. |
Image
Simple image histogram. |
|
Calculate image embeddings. |
lightautoml.utils
Common util tools.
Timer
Timer to limit the duration tasks. |
|
Timer is used to control time over full automl run. |
|
Timer is used to control time over single ML task run. |
lightautoml.validation
The module provide classes and functions for model validation.
Iterators
Abstract class to train/validation iteration. |
|
Simple Iterator which use train data as validation. |
|
Iterator for classic holdout - just predefined train and valid samples. |
|
Iterator that uses function to create folds indexes. |
|
Classic cv iterator. |
|
Time Series Iterator. |
Iterators Getters and Utils
Creates train-validation iterator. |
|
Get iterator for np/sparse dataset. |