Tutorial 7: ICE and PDP Interpretation Tutorial

Partial dependence plot (PDP) and Individual Conditional Expectation (ICE) are two model-agnostic interpretation methods (see details here).

Download library and make some imports

[1]:
# !pip install lightautoml
[2]:
# Standard python libraries
import os
import time
import requests

# Installed libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split
import torch


# Imports from our package
from lightautoml.automl.presets.tabular_presets import TabularAutoML, TabularUtilizedAutoML
from lightautoml.dataset.roles import DatetimeRole
from lightautoml.tasks import Task
[3]:
plt.rcParams.update({'font.size': 20})
sns.set(rc={'figure.figsize':(15, 11)})
sns.set(style="whitegrid", font_scale=1.5)

N_THREADS = 8 # threads cnt for lgbm and linear models
N_FOLDS = 5 # folds cnt for AutoML
RANDOM_STATE = 42 # fixed random state for various reasons
TEST_SIZE = 0.2 # Test size for metric check
TIMEOUT = 120 # Time in seconds for automl run
TARGET_NAME = 'TARGET' # Target column name

Prepare data

Load a dataset from the repository if doesn’t clone repository by git.

[4]:
DATASET_DIR = './data/'
DATASET_NAME = 'sampled_app_train.csv'
DATASET_FULLNAME = os.path.join(DATASET_DIR, DATASET_NAME)
DATASET_URL = 'https://raw.githubusercontent.com/AILab-MLTools/LightAutoML/master/examples/data/sampled_app_train.csv'
[5]:
%%time

if not os.path.exists(DATASET_FULLNAME):
    os.makedirs(DATASET_DIR, exist_ok=True)

    dataset = requests.get(DATASET_URL).text
    with open(DATASET_FULLNAME, 'w') as output:
        output.write(dataset)

data = pd.read_csv(DATASET_FULLNAME)
data['EMP_DATE'] = (np.datetime64('2018-01-01') + np.clip(data['DAYS_EMPLOYED'], None, 0).astype(np.dtype('timedelta64[D]'))
                    ).astype(str)
CPU times: user 130 ms, sys: 21.6 ms, total: 152 ms
Wall time: 150 ms
[6]:
train_data, test_data = train_test_split(data,
                                         test_size=TEST_SIZE,
                                         stratify=data[TARGET_NAME],
                                         random_state=RANDOM_STATE)

Create AutoML from preset

Also works with lightautoml.automl.presets.tabular_presets.TabularUtilizedAutoML.

[7]:
%%time

task = Task('binary', )
roles = {'target': TARGET_NAME,}

automl = TabularAutoML(task = task,
                       timeout = TIMEOUT,
                       cpu_limit = N_THREADS,
                       reader_params = {'n_jobs': N_THREADS, 'cv': N_FOLDS, 'random_state': RANDOM_STATE},
                      )
oof_pred = automl.fit_predict(train_data, roles = roles, verbose = 1, log_file = 'train.log')
[13:59:12] Stdout logging level is INFO.
[13:59:12] Copying TaskTimer may affect the parent PipelineTimer, so copy will create new unlimited TaskTimer
[13:59:12] Task: binary

[13:59:12] Start automl preset with listed constraints:
[13:59:12] - time: 120.00 seconds
[13:59:12] - CPU: 8 cores
[13:59:12] - memory: 16 GB

[13:59:12] Train data shape: (8000, 123)

[13:59:17] Layer 1 train process start. Time left 115.46 secs
[13:59:17] Start fitting Lvl_0_Pipe_0_Mod_0_LinearL2 ...
[13:59:22] Fitting Lvl_0_Pipe_0_Mod_0_LinearL2 finished. score = 0.7357245254193578
[13:59:22] Lvl_0_Pipe_0_Mod_0_LinearL2 fitting and predicting completed
[13:59:22] Time left 110.46 secs

[13:59:24] Selector_LightGBM fitting and predicting completed
[13:59:25] Start fitting Lvl_0_Pipe_1_Mod_0_LightGBM ...
[13:59:35] Time limit exceeded after calculating fold 3

[13:59:35] Fitting Lvl_0_Pipe_1_Mod_0_LightGBM finished. score = 0.7341198672505939
[13:59:35] Lvl_0_Pipe_1_Mod_0_LightGBM fitting and predicting completed
[13:59:35] Start hyperparameters optimization for Lvl_0_Pipe_1_Mod_1_Tuned_LightGBM ... Time budget is 1.00 secs
[13:59:38] Hyperparameters optimization for Lvl_0_Pipe_1_Mod_1_Tuned_LightGBM completed
[13:59:38] Start fitting Lvl_0_Pipe_1_Mod_1_Tuned_LightGBM ...
[13:59:45] Fitting Lvl_0_Pipe_1_Mod_1_Tuned_LightGBM finished. score = 0.7129725476589708
[13:59:45] Lvl_0_Pipe_1_Mod_1_Tuned_LightGBM fitting and predicting completed
[13:59:45] Start fitting Lvl_0_Pipe_1_Mod_2_CatBoost ...
[13:59:51] Fitting Lvl_0_Pipe_1_Mod_2_CatBoost finished. score = 0.713851327864848
[13:59:51] Lvl_0_Pipe_1_Mod_2_CatBoost fitting and predicting completed
[13:59:51] Start hyperparameters optimization for Lvl_0_Pipe_1_Mod_3_Tuned_CatBoost ... Time budget is 38.95 secs
[14:00:31] Hyperparameters optimization for Lvl_0_Pipe_1_Mod_3_Tuned_CatBoost completed
[14:00:31] Start fitting Lvl_0_Pipe_1_Mod_3_Tuned_CatBoost ...
[14:00:39] Fitting Lvl_0_Pipe_1_Mod_3_Tuned_CatBoost finished. score = 0.7421257913220695
[14:00:39] Lvl_0_Pipe_1_Mod_3_Tuned_CatBoost fitting and predicting completed
[14:00:39] Time left 32.75 secs

[14:00:39] Time limit exceeded in one of the tasks. AutoML will blend level 1 models.

[14:00:39] Layer 1 training completed.

[14:00:39] Blending: optimization starts with equal weights and score 0.7483295522504831
[14:00:39] Blending: iteration 0: score = 0.7506005405555949, weights = [0.25259942 0.2371451  0.20045568 0.         0.3097998 ]
[14:00:39] Blending: iteration 1: score = 0.7506953599512212, weights = [0.29590473 0.21003363 0.1946668  0.         0.29939485]
[14:00:40] Blending: iteration 2: score = 0.7506954662509919, weights = [0.2959104  0.21003765 0.19467053 0.         0.29938143]
[14:00:40] Blending: iteration 3: score = 0.7506949347521377, weights = [0.2959104  0.21003765 0.19467053 0.         0.29938143]
[14:00:40] Blending: no score update. Terminated

[14:00:40] Automl preset training completed in 87.65 seconds

[14:00:40] Model description:
Final prediction for new objects (level 0) =
         0.29591 * (5 averaged models Lvl_0_Pipe_0_Mod_0_LinearL2) +
         0.21004 * (4 averaged models Lvl_0_Pipe_1_Mod_0_LightGBM) +
         0.19467 * (5 averaged models Lvl_0_Pipe_1_Mod_1_Tuned_LightGBM) +
         0.29938 * (5 averaged models Lvl_0_Pipe_1_Mod_3_Tuned_CatBoost)

CPU times: user 10min 4s, sys: 47.5 s, total: 10min 51s
Wall time: 1min 27s

Calculate interpretation data

ICE shows the functional relationship between the predicted response and the feature separately for each instance. PDP averages the individual lines of an ICE plot.

Numeric features

For numeric features you can specify n_bins - number of bins into which the range of feature values is divided.

Calculate data for PDP plot manually:

[8]:
%%time

grid, ys, counts = automl.get_individual_pdp(test_data, feature_name='DAYS_BIRTH', n_bins=30)
100%|██████████| 30/30 [00:15<00:00,  1.92it/s]
CPU times: user 3min 53s, sys: 19 s, total: 4min 12s
Wall time: 15.7 s
[9]:
%%time

X = np.array([item.ravel() for item in ys]).T

plt.figure(figsize=(15, 11))
plt.plot(grid, X[0], alpha=0.05, color='m', label='ICE plots')
for i in range(1, X.shape[0]):
    plt.plot(grid, X[i], alpha=0.05, color='b')
plt.plot(grid, X.mean(axis=0), linewidth=2, color='r', label='PDP mean')
plt.legend()
plt.show()
../../_images/pages_tutorials_Tutorial_7_ICE_and_PDP_interpretation_20_0.png
CPU times: user 2.8 s, sys: 75.3 ms, total: 2.87 s
Wall time: 2.86 s

Built-in function:

[10]:
automl.plot_pdp(test_data, feature_name='DAYS_BIRTH')
100%|██████████| 30/30 [00:16<00:00,  1.87it/s]
../../_images/pages_tutorials_Tutorial_7_ICE_and_PDP_interpretation_22_1.png
[11]:
automl.plot_pdp(test_data, feature_name='DAYS_BIRTH', individual=True)
100%|██████████| 30/30 [00:15<00:00,  1.90it/s]
../../_images/pages_tutorials_Tutorial_7_ICE_and_PDP_interpretation_23_1.png

Categorical features

[12]:
%%time

automl.plot_pdp(test_data, feature_name='ORGANIZATION_TYPE')
100%|██████████| 10/10 [00:05<00:00,  1.95it/s]
../../_images/pages_tutorials_Tutorial_7_ICE_and_PDP_interpretation_25_1.png
CPU times: user 1min 26s, sys: 8.09 s, total: 1min 34s
Wall time: 6.13 s

Datetime features

For datetime features you can specify groupby level, allowed values: year, month, dayofweek.

[13]:
%%time

automl.plot_pdp(test_data, feature_name='EMP_DATE', datetime_level='year')
100%|██████████| 45/45 [00:23<00:00,  1.88it/s]
../../_images/pages_tutorials_Tutorial_7_ICE_and_PDP_interpretation_28_1.png
CPU times: user 5min 56s, sys: 27.6 s, total: 6min 23s
Wall time: 26.1 s