Tutorial 1: Basics

In this tutorial you will learn how to: * run LightAutoML training on tabular data * obtain feature importances and reports * configure resource usage in LightAutoML

Official LightAutoML github repository is here

LightAutoML logo

0. Prerequisites

0.0. install LightAutoML

[ ]:
!pip install -U lightautoml

0.1. Import libraries

Here we will import the libraries we use in this kernel: - Standard python libraries for timing, working with OS etc. - Essential python DS libraries like numpy, pandas, scikit-learn and torch (the last we will use in the next cell) - LightAutoML modules: presets for AutoML, task and report generation module

[1]:
# Standard python libraries
import os
import time

# Essential DS libraries
import numpy as np
import pandas as pd
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split
import torch

# LightAutoML presets, task and report generation
from lightautoml.automl.presets.tabular_presets import TabularAutoML, TabularUtilizedAutoML
from lightautoml.tasks import Task
from lightautoml.report.report_deco import ReportDeco

0.2. Constants

Here we setup the constants to use in the kernel: - N_THREADS - number of vCPUs for LightAutoML model creation - N_FOLDS - number of folds in LightAutoML inner CV - RANDOM_STATE - random seed for better reproducibility - TEST_SIZE - houldout data part size - TIMEOUT - limit in seconds for model to train - TARGET_NAME - target column name in dataset

[2]:
N_THREADS = 4
N_FOLDS = 5
RANDOM_STATE = 42
TEST_SIZE = 0.2
TIMEOUT = 300
TARGET_NAME = 'TARGET'
[3]:
DATASET_DIR = '../data/'
DATASET_NAME = 'sampled_app_train.csv'
DATASET_FULLNAME = os.path.join(DATASET_DIR, DATASET_NAME)
DATASET_URL = 'https://raw.githubusercontent.com/AILab-MLTools/LightAutoML/master/examples/data/sampled_app_train.csv'

0.3. Imported models setup

For better reproducibility fix numpy random seed with max number of threads for Torch (which usually try to use all the threads on server):

[4]:
np.random.seed(RANDOM_STATE)
torch.set_num_threads(N_THREADS)

0.4. Data loading

Let’s check the data we have:

[5]:
if not os.path.exists(DATASET_FULLNAME):
    os.makedirs(DATASET_DIR, exist_ok=True)

    dataset = requests.get(DATASET_URL).text
    with open(DATASET_FULLNAME, 'w') as output:
        output.write(dataset)
[6]:
data = pd.read_csv('../data/sampled_app_train.csv')
data.head()
[6]:
SK_ID_CURR TARGET NAME_CONTRACT_TYPE CODE_GENDER FLAG_OWN_CAR FLAG_OWN_REALTY CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY ... FLAG_DOCUMENT_18 FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21 AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR
0 313802 0 Cash loans M N Y 0 270000.0 327024.0 15372.0 ... 0 0 0 0 0.0 0.0 0.0 0.0 0.0 1.0
1 319656 0 Cash loans F N N 0 108000.0 675000.0 19737.0 ... 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0
2 207678 0 Revolving loans F Y Y 2 112500.0 270000.0 13500.0 ... 0 0 0 0 0.0 0.0 0.0 0.0 0.0 1.0
3 381593 0 Cash loans F N N 1 67500.0 142200.0 9630.0 ... 0 0 0 0 0.0 0.0 0.0 0.0 0.0 4.0
4 258153 0 Cash loans F Y Y 0 337500.0 1483231.5 46570.5 ... 0 0 0 0 0.0 0.0 0.0 2.0 0.0 0.0

5 rows × 122 columns

[7]:
data.shape
[7]:
(10000, 122)

0.5. Data splitting for train-holdout

As we have only one file with target values, we can split it into 80%-20% for holdout usage:

[8]:
tr_data, te_data = train_test_split(
    data,
    test_size=TEST_SIZE,
    stratify=data[TARGET_NAME],
    random_state=RANDOM_STATE
)

print(f'Data splitted. Parts sizes: tr_data = {tr_data.shape}, te_data = {te_data.shape}')

tr_data.head()
Data splitted. Parts sizes: tr_data = (8000, 122), te_data = (2000, 122)
[8]:
SK_ID_CURR TARGET NAME_CONTRACT_TYPE CODE_GENDER FLAG_OWN_CAR FLAG_OWN_REALTY CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY ... FLAG_DOCUMENT_18 FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21 AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR
6444 112261 0 Cash loans F N N 1 90000.0 640080.0 31261.5 ... 0 0 0 0 0.0 0.0 0.0 0.0 1.0 0.0
3586 115058 0 Cash loans F N Y 0 180000.0 239850.0 23850.0 ... 0 0 0 0 0.0 0.0 0.0 0.0 0.0 3.0
9349 326623 0 Cash loans F N Y 0 112500.0 337500.0 31086.0 ... 0 0 0 0 0.0 0.0 0.0 0.0 0.0 2.0
7734 191976 0 Cash loans M Y Y 1 67500.0 135000.0 9018.0 ... 0 0 0 0 NaN NaN NaN NaN NaN NaN
2174 281519 0 Revolving loans F N Y 0 67500.0 202500.0 10125.0 ... 0 0 0 0 0.0 0.0 0.0 0.0 0.0 2.0

5 rows × 122 columns

[ ]:

1. Task definition

1.1. Task type

On the cell below we create Task object - the class to setup what task LightAutoML model should solve with specific loss and metric if necessary (more info can be found here in our documentation):

[9]:
task = Task('binary')

1.2. Feature roles setup

To solve the task, we need to setup columns roles. The only role you must setup is target role, everything else (drop, numeric, categorical, group, weights etc.) is up to user - LightAutoML models have automatic columns typization inside:

[10]:
roles = {
    'target': TARGET_NAME,
    'drop': ['SK_ID_CURR']
}

1.3. LightAutoML model creation - TabularAutoML preset

In next the cell we are going to create LightAutoML model with TabularAutoML class - preset with default model structure like in the image below:

TabularAutoML preset pipeline

in just several lines. Let’s discuss the params we can setup: - task - the type of the ML task (the only must have parameter) - timeout - time limit in seconds for model to train - cpu_limit - vCPU count for model to use - reader_params - parameter change for Reader object inside preset, which works on the first step of data preparation: automatic feature typization, preliminary almost-constant features, correct CV setup etc. For example, we setup n_jobs threads for typization algo, cv folds and random_state as inside CV seed.

Important note: reader_params key is one of the YAML config keys, which is used inside TabularAutoML preset. More details on its structure with explanation comments can be found on the link attached. Each key from this config can be modified with user settings during preset object initialization. To get more info about different parameters setting (for example, ML algos which can be used in general_params->use_algos) please take a look at our article on TowardsDataScience.

Moreover, to receive the automatic report for our model we will use ReportDeco decorator and work with the decorated version in the same way as we do with usual one.

[11]:
automl = TabularAutoML(
    task = task,
    timeout = TIMEOUT,
    cpu_limit = N_THREADS,
    reader_params = {'n_jobs': N_THREADS, 'cv': N_FOLDS, 'random_state': RANDOM_STATE}
)

2. AutoML training

To run autoML training use fit_predict method:

  • train_data - Dataset to train.

  • roles - Roles dict.

  • verbose - Controls the verbosity: the higher, the more messages. <1 : messages are not displayed; >=1 : the computation process for layers is displayed; >=2 : the information about folds processing is also displayed; >=3 : the hyperparameters optimization process is also displayed; >=4 : the training process for every algorithm is displayed;

Note: out-of-fold prediction is calculated during training and returned from the fit_predict method

[12]:
%%time
oof_pred = automl.fit_predict(tr_data, roles = roles, verbose = 1)
[10:35:47] Stdout logging level is INFO.
[10:35:47] Copying TaskTimer may affect the parent PipelineTimer, so copy will create new unlimited TaskTimer
[10:35:47] Task: binary

[10:35:47] Start automl preset with listed constraints:
[10:35:47] - time: 300.00 seconds
[10:35:47] - CPU: 4 cores
[10:35:47] - memory: 16 GB

[10:35:47] Train data shape: (8000, 122)

[10:35:51] Layer 1 train process start. Time left 296.24 secs
[10:35:52] Start fitting Lvl_0_Pipe_0_Mod_0_LinearL2 ...
[10:35:55] Fitting Lvl_0_Pipe_0_Mod_0_LinearL2 finished. score = 0.7369471853840366
[10:35:55] Lvl_0_Pipe_0_Mod_0_LinearL2 fitting and predicting completed
[10:35:55] Time left 292.31 secs

[10:35:58] Selector_LightGBM fitting and predicting completed
[10:35:58] Start fitting Lvl_0_Pipe_1_Mod_0_LightGBM ...
[10:36:20] Fitting Lvl_0_Pipe_1_Mod_0_LightGBM finished. score = 0.7453257333249144
[10:36:20] Lvl_0_Pipe_1_Mod_0_LightGBM fitting and predicting completed
[10:36:20] Start hyperparameters optimization for Lvl_0_Pipe_1_Mod_1_Tuned_LightGBM ... Time budget is 1.00 secs
[10:36:32] Hyperparameters optimization for Lvl_0_Pipe_1_Mod_1_Tuned_LightGBM completed
[10:36:32] Start fitting Lvl_0_Pipe_1_Mod_1_Tuned_LightGBM ...
[10:36:48] Fitting Lvl_0_Pipe_1_Mod_1_Tuned_LightGBM finished. score = 0.7296807456461208
[10:36:48] Lvl_0_Pipe_1_Mod_1_Tuned_LightGBM fitting and predicting completed
[10:36:48] Start fitting Lvl_0_Pipe_1_Mod_2_CatBoost ...
[10:36:56] Fitting Lvl_0_Pipe_1_Mod_2_CatBoost finished. score = 0.7189925162835304
[10:36:56] Lvl_0_Pipe_1_Mod_2_CatBoost fitting and predicting completed
[10:36:56] Start hyperparameters optimization for Lvl_0_Pipe_1_Mod_3_Tuned_CatBoost ... Time budget is 167.78 secs
[10:39:32] Hyperparameters optimization for Lvl_0_Pipe_1_Mod_3_Tuned_CatBoost completed
[10:39:32] Start fitting Lvl_0_Pipe_1_Mod_3_Tuned_CatBoost ...
[10:39:41] Fitting Lvl_0_Pipe_1_Mod_3_Tuned_CatBoost finished. score = 0.7433999003758547
[10:39:41] Lvl_0_Pipe_1_Mod_3_Tuned_CatBoost fitting and predicting completed
[10:39:41] Time left 66.44 secs

[10:39:41] Layer 1 training completed.

[10:39:41] Blending: optimization starts with equal weights and score 0.7529907972036357
[10:39:41] Blending: iteration 0: score = 0.7552430767490724, weights = [0.17297602 0.45194352 0.12763917 0.         0.24744129]
[10:39:41] Blending: iteration 1: score = 0.7555030859886485, weights = [0.22631824 0.37608108 0.19821008 0.         0.1993906 ]
[10:39:41] Blending: iteration 2: score = 0.7554997906957511, weights = [0.23017529 0.38376114 0.18327494 0.         0.20278871]
[10:39:41] Blending: iteration 3: score = 0.755520519151073, weights = [0.23089828 0.38205293 0.18385063 0.         0.20319813]
[10:39:41] Blending: iteration 4: score = 0.7555106332723811, weights = [0.22740835 0.3837866  0.18468489 0.         0.20412019]
[10:39:41] Automl preset training completed in 234.02 seconds

[10:39:41] Model description:
Final prediction for new objects (level 0) =
         0.22741 * (5 averaged models Lvl_0_Pipe_0_Mod_0_LinearL2) +
         0.38379 * (5 averaged models Lvl_0_Pipe_1_Mod_0_LightGBM) +
         0.18468 * (5 averaged models Lvl_0_Pipe_1_Mod_1_Tuned_LightGBM) +
         0.20412 * (5 averaged models Lvl_0_Pipe_1_Mod_3_Tuned_CatBoost)

CPU times: user 18min 28s, sys: 1min 26s, total: 19min 55s
Wall time: 3min 54s

3. Prediction on holdout and model evaluation

[14]:
%%time

te_pred = automl.predict(te_data)
print(f'Prediction for te_data:\n{te_pred}\nShape = {te_pred.shape}')
Prediction for te_data:
array([[0.04951829],
       [0.06306513],
       [0.02074712],
       ...,
       [0.045104  ],
       [0.03391927],
       [0.19637743]], dtype=float32)
Shape = (2000, 1)
CPU times: user 8.65 s, sys: 454 ms, total: 9.1 s
Wall time: 704 ms
[17]:
print(f'OOF score: {roc_auc_score(tr_data[TARGET_NAME].values, oof_pred.data[:, 0])}')
print(f'HOLDOUT score: {roc_auc_score(te_data[TARGET_NAME].values, te_pred.data[:, 0])}')
OOF score: 0.7555279601350346
HOLDOUT score: 0.7310665760869565

4. Model analysis

4.1. Reports

You can obtain the description of the resulting pipeline:

[18]:
print(automl.create_model_str_desc())
Final prediction for new objects (level 0) =
         0.22741 * (5 averaged models Lvl_0_Pipe_0_Mod_0_LinearL2) +
         0.38379 * (5 averaged models Lvl_0_Pipe_1_Mod_0_LightGBM) +
         0.18468 * (5 averaged models Lvl_0_Pipe_1_Mod_1_Tuned_LightGBM) +
         0.20412 * (5 averaged models Lvl_0_Pipe_1_Mod_3_Tuned_CatBoost)

Also for this purposes LightAutoML have ReportDeco, use it to build reports:

[19]:
RD = ReportDeco(output_path = 'tabularAutoML_model_report')

automl_rd = RD(
    TabularAutoML(
        task = task,
        timeout = TIMEOUT,
        cpu_limit = N_THREADS,
        reader_params = {'n_jobs': N_THREADS, 'cv': N_FOLDS, 'random_state': RANDOM_STATE}
    )
)
[20]:
%%time
oof_pred = automl_rd.fit_predict(tr_data, roles = roles, verbose = 1)
[10:43:24] Stdout logging level is INFO.
[10:43:24] Task: binary

[10:43:24] Start automl preset with listed constraints:
[10:43:24] - time: 300.00 seconds
[10:43:24] - CPU: 4 cores
[10:43:24] - memory: 16 GB

[10:43:24] Train data shape: (8000, 122)

[10:43:27] Layer 1 train process start. Time left 296.42 secs
[10:43:28] Start fitting Lvl_0_Pipe_0_Mod_0_LinearL2 ...
[10:43:31] Fitting Lvl_0_Pipe_0_Mod_0_LinearL2 finished. score = 0.7369471853840366
[10:43:31] Lvl_0_Pipe_0_Mod_0_LinearL2 fitting and predicting completed
[10:43:31] Time left 292.42 secs

[10:43:34] Selector_LightGBM fitting and predicting completed
[10:43:35] Start fitting Lvl_0_Pipe_1_Mod_0_LightGBM ...
[10:43:55] Fitting Lvl_0_Pipe_1_Mod_0_LightGBM finished. score = 0.7453257333249144
[10:43:55] Lvl_0_Pipe_1_Mod_0_LightGBM fitting and predicting completed
[10:43:55] Start hyperparameters optimization for Lvl_0_Pipe_1_Mod_1_Tuned_LightGBM ... Time budget is 1.00 secs
[10:44:08] Hyperparameters optimization for Lvl_0_Pipe_1_Mod_1_Tuned_LightGBM completed
[10:44:08] Start fitting Lvl_0_Pipe_1_Mod_1_Tuned_LightGBM ...
[10:44:24] Fitting Lvl_0_Pipe_1_Mod_1_Tuned_LightGBM finished. score = 0.7296807456461208
[10:44:24] Lvl_0_Pipe_1_Mod_1_Tuned_LightGBM fitting and predicting completed
[10:44:24] Start fitting Lvl_0_Pipe_1_Mod_2_CatBoost ...
[10:44:31] Fitting Lvl_0_Pipe_1_Mod_2_CatBoost finished. score = 0.7189925162835304
[10:44:31] Lvl_0_Pipe_1_Mod_2_CatBoost fitting and predicting completed
[10:44:31] Start hyperparameters optimization for Lvl_0_Pipe_1_Mod_3_Tuned_CatBoost ... Time budget is 172.86 secs
[10:47:03] Hyperparameters optimization for Lvl_0_Pipe_1_Mod_3_Tuned_CatBoost completed
[10:47:03] Start fitting Lvl_0_Pipe_1_Mod_3_Tuned_CatBoost ...
[10:47:11] Fitting Lvl_0_Pipe_1_Mod_3_Tuned_CatBoost finished. score = 0.7433999003758547
[10:47:11] Lvl_0_Pipe_1_Mod_3_Tuned_CatBoost fitting and predicting completed
[10:47:11] Time left 72.32 secs

[10:47:11] Layer 1 training completed.

[10:47:11] Blending: optimization starts with equal weights and score 0.7529907972036357
[10:47:12] Blending: iteration 0: score = 0.7552430767490724, weights = [0.17297602 0.45194352 0.12763917 0.         0.24744129]
[10:47:12] Blending: iteration 1: score = 0.7555030859886485, weights = [0.22631824 0.37608108 0.19821008 0.         0.1993906 ]
[10:47:12] Blending: iteration 2: score = 0.7554997906957511, weights = [0.23017529 0.38376114 0.18327494 0.         0.20278871]
[10:47:12] Blending: iteration 3: score = 0.755520519151073, weights = [0.23089828 0.38205293 0.18385063 0.         0.20319813]
[10:47:12] Blending: iteration 4: score = 0.7555106332723811, weights = [0.22740835 0.3837866  0.18468489 0.         0.20412019]
[10:47:12] Automl preset training completed in 228.15 seconds

[10:47:12] Model description:
Final prediction for new objects (level 0) =
         0.22741 * (5 averaged models Lvl_0_Pipe_0_Mod_0_LinearL2) +
         0.38379 * (5 averaged models Lvl_0_Pipe_1_Mod_0_LightGBM) +
         0.18468 * (5 averaged models Lvl_0_Pipe_1_Mod_1_Tuned_LightGBM) +
         0.20412 * (5 averaged models Lvl_0_Pipe_1_Mod_3_Tuned_CatBoost)

CPU times: user 18min 14s, sys: 1min 19s, total: 19min 34s
Wall time: 3min 49s

So the report is available in tabularAutoML_model_report folder

[1]:
!ls tabularAutoML_model_report
lama_interactive_report.html           valid_distribution_of_logits.png
test_distribution_of_logits_1.png      valid_pie_f1_metric.png
test_pie_f1_metric_1.png               valid_pr_curve.png
test_pr_curve_1.png                    valid_preds_distribution_by_bins.png
test_preds_distribution_by_bins_1.png  valid_roc_curve.png
test_roc_curve_1.png
[22]:
%%time

te_pred = automl_rd.predict(te_data)
print(f'Prediction for te_data:\n{te_pred}\nShape = {te_pred.shape}')
Prediction for te_data:
array([[0.04951829],
       [0.06306513],
       [0.02074712],
       ...,
       [0.045104  ],
       [0.03391927],
       [0.19637743]], dtype=float32)
Shape = (2000, 1)
CPU times: user 9.73 s, sys: 425 ms, total: 10.2 s
Wall time: 2.19 s
[23]:
print(f'OOF score: {roc_auc_score(tr_data[TARGET_NAME].values, oof_pred.data[:, 0])}')
print(f'HOLDOUT score: {roc_auc_score(te_data[TARGET_NAME].values, te_pred.data[:, 0])}')
OOF score: 0.7555279601350346
HOLDOUT score: 0.7310665760869565

4.2 Feature importances calculation

For feature importances calculation we have 2 different methods in LightAutoML: - Fast (fast) - this method uses feature importances from feature selector LGBM model inside LightAutoML. It works extremely fast and almost always (almost because of situations, when feature selection is turned off or selector was removed from the final models with all GBM models). no need to use new labelled data. - Accurate (accurate) - this method calculate features permutation importances for the whole LightAutoML model based on the new labelled data. It always works but can take a lot of time to finish (depending on the model structure, new labelled dataset size etc.).

In the cell below we will use automl_rd.model instead automl_rd because we want to take the importances from the model, not from the report. But be carefull - everything, which is calculated using automl_rd.model will not go to the report.

[24]:
%%time

# Fast feature importances calculation
fast_fi = automl_rd.model.get_feature_scores('fast')
fast_fi.set_index('Feature')['Importance'].plot.bar(figsize = (30, 10), grid = True)
CPU times: user 145 ms, sys: 158 µs, total: 146 ms
Wall time: 140 ms
[24]:
<AxesSubplot:xlabel='Feature'>
../../_images/pages_tutorials_Tutorial_1_basics_49_2.png
[25]:
%%time

# Accurate feature importances calculation (Permutation importances) -  can take long time to calculate
accurate_fi = automl_rd.model.get_feature_scores('accurate', te_data, silent = False)
CPU times: user 14min 45s, sys: 1min 4s, total: 15min 50s
Wall time: 1min 13s
[26]:
accurate_fi.set_index('Feature')['Importance'].plot.bar(figsize = (30, 10), grid = True)
[26]:
<AxesSubplot:xlabel='Feature'>
../../_images/pages_tutorials_Tutorial_1_basics_51_1.png

Bonus: where is the automatic report?

As we used automl_rd in our training and prediction cells, it is already ready in the folder we specified - you can check the output kaggle folder and find the tabularAutoML_model_report folder with lama_interactive_report.html report inside (or just click this link for short). It’s interactive so you can click the black triangles on the left of the texts to go deeper in selected part.

5. Spending more from TIMEOUT - TabularUtilizedAutoML usage

Using TabularAutoML we spent only 31 second to build the model with setup TIMEOUT equal to 5 minutes. To spend (almost) all the TIMEOUT we can use TabularUtilizedAutoML preset instead of TabularAutoML, which has the same API:

[27]:
utilized_automl = TabularUtilizedAutoML(
    task = task,
    timeout = 600,
    cpu_limit = N_THREADS,
    reader_params = {'n_jobs': N_THREADS, 'cv': N_FOLDS, 'random_state': RANDOM_STATE},
)
[28]:
%%time

oof_pred = utilized_automl.fit_predict(tr_data, roles = roles, verbose = 1)
[11:20:33] Start automl utilizator with listed constraints:
[11:20:33] - time: 600.00 seconds
[11:20:33] - CPU: 4 cores
[11:20:33] - memory: 16 GB

[11:20:33] If one preset completes earlier, next preset configuration will be started

[11:20:33] ==================================================
[11:20:33] Start 0 automl preset configuration:
[11:20:33] conf_0_sel_type_0.yml, random state: {'reader_params': {'random_state': 42}, 'general_params': {'return_all_predictions': False}}
[11:20:33] Stdout logging level is INFO.
[11:20:33] Task: binary

[11:20:33] Start automl preset with listed constraints:
[11:20:33] - time: 600.00 seconds
[11:20:33] - CPU: 4 cores
[11:20:33] - memory: 16 GB

[11:20:33] Train data shape: (8000, 122)

[11:20:37] Layer 1 train process start. Time left 596.41 secs
[11:20:37] Start fitting Lvl_0_Pipe_0_Mod_0_LinearL2 ...
[11:20:40] Fitting Lvl_0_Pipe_0_Mod_0_LinearL2 finished. score = 0.7369471853840366
[11:20:40] Lvl_0_Pipe_0_Mod_0_LinearL2 fitting and predicting completed
[11:20:40] Time left 593.02 secs

[11:20:41] Start fitting Lvl_0_Pipe_1_Mod_0_LightGBM ...
[11:20:55] Fitting Lvl_0_Pipe_1_Mod_0_LightGBM finished. score = 0.7300502436497049
[11:20:55] Lvl_0_Pipe_1_Mod_0_LightGBM fitting and predicting completed
[11:20:55] Start hyperparameters optimization for Lvl_0_Pipe_1_Mod_1_Tuned_LightGBM ... Time budget is 68.65 secs
[11:22:05] Hyperparameters optimization for Lvl_0_Pipe_1_Mod_1_Tuned_LightGBM completed
[11:22:05] Start fitting Lvl_0_Pipe_1_Mod_1_Tuned_LightGBM ...
[11:22:10] Fitting Lvl_0_Pipe_1_Mod_1_Tuned_LightGBM finished. score = 0.7333493633387822
[11:22:10] Lvl_0_Pipe_1_Mod_1_Tuned_LightGBM fitting and predicting completed
[11:22:10] Start fitting Lvl_0_Pipe_1_Mod_2_CatBoost ...
[11:22:16] Fitting Lvl_0_Pipe_1_Mod_2_CatBoost finished. score = 0.7266430170936409
[11:22:16] Lvl_0_Pipe_1_Mod_2_CatBoost fitting and predicting completed
[11:22:16] Start hyperparameters optimization for Lvl_0_Pipe_1_Mod_3_Tuned_CatBoost ... Time budget is 300.00 secs
[11:24:32] Hyperparameters optimization for Lvl_0_Pipe_1_Mod_3_Tuned_CatBoost completed
[11:24:32] Start fitting Lvl_0_Pipe_1_Mod_3_Tuned_CatBoost ...
[11:24:45] Fitting Lvl_0_Pipe_1_Mod_3_Tuned_CatBoost finished. score = 0.7495838895468844
[11:24:45] Lvl_0_Pipe_1_Mod_3_Tuned_CatBoost fitting and predicting completed
[11:24:45] Time left 348.02 secs

[11:24:45] Layer 1 training completed.

[11:24:45] Blending: optimization starts with equal weights and score 0.7540669760840397
[11:24:45] Blending: iteration 0: score = 0.7563849488878813, weights = [0.20725691 0.08775456 0.24720865 0.         0.45777985]
[11:24:45] Blending: iteration 1: score = 0.7564952880500561, weights = [0.23765722 0.05360391 0.27327472 0.         0.43546414]
[11:24:45] Blending: iteration 2: score = 0.7565193117982754, weights = [0.23026922 0.05421622 0.2750762  0.         0.4404384 ]
[11:24:45] Blending: iteration 3: score = 0.7565031542331014, weights = [0.22985707 0.05424524 0.2752235  0.         0.44067422]
[11:24:45] Blending: iteration 4: score = 0.7565031542331014, weights = [0.22985707 0.05424524 0.2752235  0.         0.44067422]
[11:24:45] Blending: no score update. Terminated

[11:24:45] Automl preset training completed in 252.46 seconds

[11:24:45] Model description:
Final prediction for new objects (level 0) =
         0.22986 * (5 averaged models Lvl_0_Pipe_0_Mod_0_LinearL2) +
         0.05425 * (5 averaged models Lvl_0_Pipe_1_Mod_0_LightGBM) +
         0.27522 * (5 averaged models Lvl_0_Pipe_1_Mod_1_Tuned_LightGBM) +
         0.44067 * (5 averaged models Lvl_0_Pipe_1_Mod_3_Tuned_CatBoost)

[11:24:45] ==================================================
[11:24:45] Start 1 automl preset configuration:
[11:24:45] conf_1_sel_type_1.yml, random state: {'reader_params': {'random_state': 43}, 'general_params': {'return_all_predictions': False}}
[11:24:46] Stdout logging level is INFO.
[11:24:46] Task: binary

[11:24:46] Start automl preset with listed constraints:
[11:24:46] - time: 347.51 seconds
[11:24:46] - CPU: 4 cores
[11:24:46] - memory: 16 GB

[11:24:46] Train data shape: (8000, 122)

[11:24:47] Layer 1 train process start. Time left 346.35 secs
[11:24:47] Start fitting Lvl_0_Pipe_0_Mod_0_LinearL2 ...
[11:24:51] Fitting Lvl_0_Pipe_0_Mod_0_LinearL2 finished. score = 0.7349637379591591
[11:24:51] Lvl_0_Pipe_0_Mod_0_LinearL2 fitting and predicting completed
[11:24:51] Time left 342.11 secs

[11:24:54] Selector_LightGBM fitting and predicting completed
[11:24:54] Start fitting Lvl_0_Pipe_1_Mod_0_LightGBM ...
[11:25:08] Fitting Lvl_0_Pipe_1_Mod_0_LightGBM finished. score = 0.7385323275674212
[11:25:08] Lvl_0_Pipe_1_Mod_0_LightGBM fitting and predicting completed
[11:25:08] Start hyperparameters optimization for Lvl_0_Pipe_1_Mod_1_Tuned_LightGBM ... Time budget is 24.07 secs
[11:25:34] Hyperparameters optimization for Lvl_0_Pipe_1_Mod_1_Tuned_LightGBM completed
[11:25:34] Start fitting Lvl_0_Pipe_1_Mod_1_Tuned_LightGBM ...
[11:25:43] Fitting Lvl_0_Pipe_1_Mod_1_Tuned_LightGBM finished. score = 0.6923898293229618
[11:25:43] Lvl_0_Pipe_1_Mod_1_Tuned_LightGBM fitting and predicting completed
[11:25:43] Start fitting Lvl_0_Pipe_1_Mod_2_CatBoost ...
[11:25:49] Fitting Lvl_0_Pipe_1_Mod_2_CatBoost finished. score = 0.7111199552520485
[11:25:49] Lvl_0_Pipe_1_Mod_2_CatBoost fitting and predicting completed
[11:25:49] Start hyperparameters optimization for Lvl_0_Pipe_1_Mod_3_Tuned_CatBoost ... Time budget is 227.34 secs
[11:29:31] Hyperparameters optimization for Lvl_0_Pipe_1_Mod_3_Tuned_CatBoost completed
[11:29:31] Start fitting Lvl_0_Pipe_1_Mod_3_Tuned_CatBoost ...
[11:29:41] Fitting Lvl_0_Pipe_1_Mod_3_Tuned_CatBoost finished. score = 0.7387838328253268
[11:29:41] Lvl_0_Pipe_1_Mod_3_Tuned_CatBoost fitting and predicting completed
[11:29:41] Time left 51.74 secs

[11:29:41] Layer 1 training completed.

[11:29:41] Blending: optimization starts with equal weights and score 0.7477858289224244
[11:29:41] Blending: iteration 0: score = 0.7495591217002691, weights = [0.23445827 0.3661265  0.16263187 0.         0.23678337]
[11:29:41] Blending: iteration 1: score = 0.7497912803998742, weights = [0.28477693 0.27548397 0.16728112 0.         0.27245793]
[11:29:42] Blending: iteration 2: score = 0.7498005284799412, weights = [0.30106387 0.26921073 0.16347185 0.         0.2662536 ]
[11:29:42] Blending: iteration 3: score = 0.7498005284799412, weights = [0.30106387 0.26921073 0.16347185 0.         0.2662536 ]
[11:29:42] Blending: no score update. Terminated

[11:29:42] Automl preset training completed in 296.13 seconds

[11:29:42] Model description:
Final prediction for new objects (level 0) =
         0.30106 * (5 averaged models Lvl_0_Pipe_0_Mod_0_LinearL2) +
         0.26921 * (5 averaged models Lvl_0_Pipe_1_Mod_0_LightGBM) +
         0.16347 * (5 averaged models Lvl_0_Pipe_1_Mod_1_Tuned_LightGBM) +
         0.26625 * (5 averaged models Lvl_0_Pipe_1_Mod_3_Tuned_CatBoost)

[11:29:42] ==================================================
[11:29:42] Blending: optimization starts with equal weights and score 0.7574169070635984
[11:29:42] Blending: iteration 0: score = 0.7580085715883249, weights = [0.6580179  0.34198216]
[11:29:42] Blending: iteration 1: score = 0.7580085715883249, weights = [0.6580179  0.34198216]
[11:29:42] Blending: no score update. Terminated

CPU times: user 41min 32s, sys: 2min 53s, total: 44min 26s
Wall time: 9min 8s
[29]:
print('oof_pred:\n{}\nShape = {}'.format(oof_pred, oof_pred.shape))
oof_pred:
array([[0.02939807],
       [0.02236847],
       [0.02997774],
       ...,
       [0.02851386],
       [0.1783773 ],
       [0.11076172]], dtype=float32)
Shape = (8000, 1)
[30]:
print(utilized_automl.create_model_str_desc())
Final prediction for new objects =
        0.65802 * 1 averaged models with config = "conf_0_sel_type_0.yml" and different CV random_states. Their structures:

            Model #0.
                ================================================================================
                Final prediction for new objects (level 0) =
                         0.22986 * (5 averaged models Lvl_0_Pipe_0_Mod_0_LinearL2) +
                         0.05425 * (5 averaged models Lvl_0_Pipe_1_Mod_0_LightGBM) +
                         0.27522 * (5 averaged models Lvl_0_Pipe_1_Mod_1_Tuned_LightGBM) +
                         0.44067 * (5 averaged models Lvl_0_Pipe_1_Mod_3_Tuned_CatBoost)
                ================================================================================


        + 0.34198 * 1 averaged models with config = "conf_1_sel_type_1.yml" and different CV random_states. Their structures:

            Model #0.
                ================================================================================
                Final prediction for new objects (level 0) =
                         0.30106 * (5 averaged models Lvl_0_Pipe_0_Mod_0_LinearL2) +
                         0.26921 * (5 averaged models Lvl_0_Pipe_1_Mod_0_LightGBM) +
                         0.16347 * (5 averaged models Lvl_0_Pipe_1_Mod_1_Tuned_LightGBM) +
                         0.26625 * (5 averaged models Lvl_0_Pipe_1_Mod_3_Tuned_CatBoost)
                ================================================================================



Feature importances calculation for TabularUtilizedAutoML:

[31]:
%%time

# Fast feature importances calculation
fast_fi = utilized_automl.get_feature_scores('fast')
fast_fi.set_index('Feature')['Importance'].plot.bar(figsize = (30, 10), grid = True)
CPU times: user 164 ms, sys: 332 µs, total: 164 ms
Wall time: 158 ms
[31]:
<AxesSubplot:xlabel='Feature'>
../../_images/pages_tutorials_Tutorial_1_basics_61_2.png

Prediction on holdout and metric calculation

[32]:
%%time

te_pred = utilized_automl.predict(te_data)
print(f'Prediction for te_data:\n{te_pred}\nShape = {te_pred.shape}')
Prediction for te_data:
array([[0.06196886],
       [0.07107412],
       [0.02719494],
       ...,
       [0.0573416 ],
       [0.04053508],
       [0.21660838]], dtype=float32)
Shape = (2000, 1)
CPU times: user 15.5 s, sys: 1.1 s, total: 16.6 s
Wall time: 1.17 s
[33]:
print(f'OOF score: {roc_auc_score(tr_data[TARGET_NAME].values, oof_pred.data[:, 0])}')
print(f'HOLDOUT score: {roc_auc_score(te_data[TARGET_NAME].values, te_pred.data[:, 0])}')
OOF score: 0.7580085715883249
HOLDOUT score: 0.7344904891304348

Additional materials

[ ]: