Tutorial 7: ICE and PDP Interpretation Tutorial

LightAutoML logo

Official LightAutoML github repository is here

Partial dependence plot (PDP) and Individual Conditional Expectation (ICE) are two model-agnostic interpretation methods (see details here).

Download library and make some imports

[1]:
# !pip install lightautoml
[2]:
# Standard python libraries
import os
import requests

# Installed libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from sklearn.model_selection import train_test_split


# Imports from our package
from lightautoml.automl.presets.tabular_presets import TabularAutoML
from lightautoml.tasks import Task
[3]:
plt.rcParams.update({'font.size': 20})
sns.set(rc={'figure.figsize':(15, 11)})
sns.set(style="whitegrid", font_scale=1.5)

N_THREADS = 8 # threads cnt for lgbm and linear models
N_FOLDS = 5 # folds cnt for AutoML
RANDOM_STATE = 42 # fixed random state for various reasons
TEST_SIZE = 0.2 # Test size for metric check
TIMEOUT = 120 # Time in seconds for automl run
TARGET_NAME = 'TARGET' # Target column name

Prepare data

Load a dataset from the repository if doesn’t clone repository by git.

[4]:
DATASET_DIR = './data/'
DATASET_NAME = 'sampled_app_train.csv'
DATASET_FULLNAME = os.path.join(DATASET_DIR, DATASET_NAME)
DATASET_URL = 'https://raw.githubusercontent.com/AILab-MLTools/LightAutoML/master/examples/data/sampled_app_train.csv'
[5]:
%%time

if not os.path.exists(DATASET_FULLNAME):
    os.makedirs(DATASET_DIR, exist_ok=True)

    dataset = requests.get(DATASET_URL).text
    with open(DATASET_FULLNAME, 'w') as output:
        output.write(dataset)

data = pd.read_csv(DATASET_FULLNAME)
data['EMP_DATE'] = (np.datetime64('2018-01-01') + np.clip(data['DAYS_EMPLOYED'], None, 0).astype(np.dtype('timedelta64[D]'))
                    ).astype(str)
CPU times: user 223 ms, sys: 52.9 ms, total: 276 ms
Wall time: 503 ms
[6]:
train_data, test_data = train_test_split(data,
                                         test_size=TEST_SIZE,
                                         stratify=data[TARGET_NAME],
                                         random_state=RANDOM_STATE)

Create AutoML from preset

Also works with lightautoml.automl.presets.tabular_presets.TabularUtilizedAutoML.

[7]:
%%time

task = Task('binary', )
roles = {'target': TARGET_NAME,}

automl = TabularAutoML(task = task,
                       timeout = TIMEOUT,
                       cpu_limit = N_THREADS,
                       reader_params = {'n_jobs': N_THREADS, 'cv': N_FOLDS, 'random_state': RANDOM_STATE},
                      )
oof_pred = automl.fit_predict(train_data, roles = roles, verbose = 1, log_file = 'train.log')
[16:58:33] Stdout logging level is INFO.
[16:58:33] Copying TaskTimer may affect the parent PipelineTimer, so copy will create new unlimited TaskTimer
[16:58:33] Task: binary

[16:58:33] Start automl preset with listed constraints:
[16:58:33] - time: 120.00 seconds
[16:58:33] - CPU: 8 cores
[16:58:33] - memory: 16 GB

[16:58:33] Train data shape: (8000, 123)

[16:58:36] Layer 1 train process start. Time left 117.58 secs
[16:58:36] Start fitting Lvl_0_Pipe_0_Mod_0_LinearL2 ...
[16:58:40] Fitting Lvl_0_Pipe_0_Mod_0_LinearL2 finished. score = 0.7340989893230383
[16:58:40] Lvl_0_Pipe_0_Mod_0_LinearL2 fitting and predicting completed
[16:58:40] Time left 112.94 secs

[16:58:43] Selector_LightGBM fitting and predicting completed
[16:58:44] Start fitting Lvl_0_Pipe_1_Mod_0_LightGBM ...
[16:58:53] Time limit exceeded after calculating fold 3

[16:58:53] Fitting Lvl_0_Pipe_1_Mod_0_LightGBM finished. score = 0.7336652733096534
[16:58:53] Lvl_0_Pipe_1_Mod_0_LightGBM fitting and predicting completed
[16:58:53] Start hyperparameters optimization for Lvl_0_Pipe_1_Mod_1_Tuned_LightGBM ... Time budget is 1.00 secs
[16:59:03] Hyperparameters optimization for Lvl_0_Pipe_1_Mod_1_Tuned_LightGBM completed
[16:59:03] Start fitting Lvl_0_Pipe_1_Mod_1_Tuned_LightGBM ...
[16:59:16] Fitting Lvl_0_Pipe_1_Mod_1_Tuned_LightGBM finished. score = 0.7146425170595188
[16:59:16] Lvl_0_Pipe_1_Mod_1_Tuned_LightGBM fitting and predicting completed
[16:59:16] Start fitting Lvl_0_Pipe_1_Mod_2_CatBoost ...
[16:59:21] Fitting Lvl_0_Pipe_1_Mod_2_CatBoost finished. score = 0.7180592042951911
[16:59:21] Lvl_0_Pipe_1_Mod_2_CatBoost fitting and predicting completed
[16:59:21] Start hyperparameters optimization for Lvl_0_Pipe_1_Mod_3_Tuned_CatBoost ... Time budget is 29.20 secs
[16:59:51] Hyperparameters optimization for Lvl_0_Pipe_1_Mod_3_Tuned_CatBoost completed
[16:59:51] Start fitting Lvl_0_Pipe_1_Mod_3_Tuned_CatBoost ...
[16:59:58] Fitting Lvl_0_Pipe_1_Mod_3_Tuned_CatBoost finished. score = 0.7424781750625415
[16:59:58] Lvl_0_Pipe_1_Mod_3_Tuned_CatBoost fitting and predicting completed
[16:59:58] Time left 35.17 secs

[16:59:58] Time limit exceeded in one of the tasks. AutoML will blend level 1 models.

[16:59:58] Layer 1 training completed.

[16:59:58] Blending: optimization starts with equal weights and score 0.7470969001073415
[16:59:58] Blending: iteration 0: score = 0.7483672886691461, weights = [0.18754406 0.1279657  0.37286162 0.06386749 0.24776113]
[16:59:58] Blending: iteration 1: score = 0.7484541355819561, weights = [0.23439428 0.12674679 0.31599942 0.06325912 0.25960034]
[16:59:59] Blending: iteration 2: score = 0.748450627689517, weights = [0.23445104 0.1267374  0.315976   0.06325444 0.25958112]
[16:59:59] Blending: iteration 3: score = 0.748450627689517, weights = [0.23445104 0.1267374  0.315976   0.06325444 0.25958112]
[16:59:59] Blending: no score update. Terminated

[16:59:59] Automl preset training completed in 85.25 seconds

[16:59:59] Model description:
Final prediction for new objects (level 0) =
         0.23445 * (5 averaged models Lvl_0_Pipe_0_Mod_0_LinearL2) +
         0.12674 * (4 averaged models Lvl_0_Pipe_1_Mod_0_LightGBM) +
         0.31598 * (5 averaged models Lvl_0_Pipe_1_Mod_1_Tuned_LightGBM) +
         0.06325 * (5 averaged models Lvl_0_Pipe_1_Mod_2_CatBoost) +
         0.25958 * (5 averaged models Lvl_0_Pipe_1_Mod_3_Tuned_CatBoost)

CPU times: user 10min 7s, sys: 49.1 s, total: 10min 56s
Wall time: 1min 25s

Calculate interpretation data

ICE shows the functional relationship between the predicted response and the feature separately for each instance. PDP averages the individual lines of an ICE plot.

Numeric features

For numeric features you can specify n_bins - number of bins into which the range of feature values is divided.

Calculate data for PDP plot manually:

[8]:
%%time

grid, ys, counts = automl.get_individual_pdp(test_data, feature_name='DAYS_BIRTH', n_bins=30)
100%|██████████| 30/30 [00:18<00:00,  1.63it/s]
CPU times: user 2min 2s, sys: 7.35 s, total: 2min 9s
Wall time: 18.4 s

[9]:
%%time

X = np.array([item.ravel() for item in ys]).T

plt.figure(figsize=(15, 11))
plt.plot(grid, X[0], alpha=0.05, color='m', label='ICE plots')
for i in range(1, X.shape[0]):
    plt.plot(grid, X[i], alpha=0.05, color='b')
plt.plot(grid, X.mean(axis=0), linewidth=2, color='r', label='PDP mean')
plt.legend()
plt.show()
../../_images/pages_tutorials_Tutorial_7_ICE_and_PDP_interpretation_22_0.png
CPU times: user 5.9 s, sys: 3.63 s, total: 9.53 s
Wall time: 2.46 s

Built-in function:

[10]:
automl.plot_pdp(test_data, feature_name='DAYS_BIRTH')
100%|██████████| 30/30 [00:17<00:00,  1.67it/s]
../../_images/pages_tutorials_Tutorial_7_ICE_and_PDP_interpretation_24_1.png
[11]:
automl.plot_pdp(test_data, feature_name='DAYS_BIRTH', individual=True)
100%|██████████| 30/30 [00:18<00:00,  1.63it/s]
../../_images/pages_tutorials_Tutorial_7_ICE_and_PDP_interpretation_25_1.png

Categorical features

[12]:
%%time

automl.plot_pdp(test_data, feature_name='ORGANIZATION_TYPE')
100%|██████████| 10/10 [00:05<00:00,  1.69it/s]
../../_images/pages_tutorials_Tutorial_7_ICE_and_PDP_interpretation_27_1.png
CPU times: user 43.8 s, sys: 2.54 s, total: 46.4 s
Wall time: 6.87 s

Datetime features

For datetime features you can specify groupby level, allowed values: year, month, dayofweek.

[13]:
%%time

automl.plot_pdp(test_data, feature_name='EMP_DATE', datetime_level='year')
100%|██████████| 45/45 [00:27<00:00,  1.63it/s]
../../_images/pages_tutorials_Tutorial_7_ICE_and_PDP_interpretation_30_1.png
CPU times: user 3min 2s, sys: 10.2 s, total: 3min 12s
Wall time: 29.4 s