Tutorial 2: AutoWoE (WhiteBox model for binary classification on tabular data)

Scorecard

WB0

Linear model

WB1

Discretization

WB2

Selection and One-dimensional analysis

WB3

Whitebox pipeline:

General parameters

  1. Technical

    • n_jobs

    • debug

  2. Simple features typing and initial cleaning

    1.1. Remove trash features

    Medium:
        - th_nan
        - th_const
    

    1.2. Typling (auto or user defined)

    Critical:
        - features_type (dict) {'age': 'real', 'education': 'cat', 'birth_date': (None, ("d", "wd"), ...}
    

    1.3. Categories and datetimes encoding

    Critical:
        - features_type (for datetimes)
    
    Optional:
        - cat_alpha (int) - greater means more conservative encoding
    
  3. Pre selection (based on BlackBox model importances)

    • Critical:

      • select_type (None or int)

      • imp_type (if type(select_type) is int ‘perm_imt’/’feature_imp’)

    • Optional:

      • imt_th (float) - threshold for select_type is None

  4. Binning (discretization)

    • Critical:

      • monotonic / features_monotone_constraints

      • max_bin_count / max_bin_count

      • min_bin_size

      • cat_merge_to

      • nan_merge_to

    • Medium:

      • force_single_split

    • Optional:

      • min_bin_mults

      • min_gains_to_split

  5. WoE estimation WoE = LN( ((% 0 in bin) / (% 0 in sample)) / ((% 1 in bin) / (% 1 in sample)) ):

    • Critical:

      • oof_woe

    • Optional:

      • woe_diff_th

      • n_folds (if oof_woe)

  6. 2nd selection stage:

    5.1. One-dimentional importance

    Critical:
        - auc_th
    

    5.2. VIF

    Critical:
        - vif_th
    

    5.3. Partial correlations

    Critical:
        - pearson_th
    
  7. 3rd selection stage (model based)

    • Optional:

      • n_folds

      • l1_base_step

      • l1_exp_step

    • Do not touch:

      • population_size

      • feature_groups_count

  8. Fitting the final model

    • Critical:

      • regularized_refit

      • p_val (if not regularized_refit)

      • validation (if not regularized_refit)

    • Optional:

      • interpreted_model

      • l1_base_step (if regularized_refit)

      • l1_exp_step (if regularized_refit)

  9. Report generation

    • report_params

Imports

[1]:
import pandas as pd
from pandas import Series, DataFrame

import numpy as np

import os
import requests
import joblib

from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score

from autowoe import AutoWoE, ReportDeco

Reading the data and train/test split

[2]:
DATASET_DIR = '../data/'
DATASET_NAME = 'jobs_train.csv'
DATASET_FULLNAME = os.path.join(DATASET_DIR, DATASET_NAME)
DATASET_URL = 'https://raw.githubusercontent.com/AILab-MLTools/LightAutoML/master/examples/data/jobs_train.csv'
[3]:
%%time

if not os.path.exists(DATASET_FULLNAME):
    os.makedirs(DATASET_DIR, exist_ok=True)

    dataset = requests.get(DATASET_URL).text
    with open(DATASET_FULLNAME, 'w') as output:
        output.write(dataset)
CPU times: user 14 µs, sys: 12 µs, total: 26 µs
Wall time: 62 µs
[2]:
data = pd.read_csv(DATASET_FULLNAME)
[3]:
data
[3]:
enrollee_id city city_development_index gender relevent_experience enrolled_university education_level major_discipline experience company_size company_type last_new_job training_hours target
0 8949 city_103 0.920 Male Has relevent experience no_enrollment Graduate STEM 21.0 NaN NaN 1.0 36 1.0
1 29725 city_40 0.776 Male No relevent experience no_enrollment Graduate STEM 15.0 99.0 Pvt Ltd 5.0 47 0.0
2 11561 city_21 0.624 NaN No relevent experience Full time course Graduate STEM 5.0 NaN NaN 0.0 83 0.0
3 33241 city_115 0.789 NaN No relevent experience NaN Graduate Business Degree 0.0 NaN Pvt Ltd 0.0 52 1.0
4 666 city_162 0.767 Male Has relevent experience no_enrollment Masters STEM 21.0 99.0 Funded Startup 4.0 8 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
19153 7386 city_173 0.878 Male No relevent experience no_enrollment Graduate Humanities 14.0 NaN NaN 1.0 42 1.0
19154 31398 city_103 0.920 Male Has relevent experience no_enrollment Graduate STEM 14.0 NaN NaN 4.0 52 1.0
19155 24576 city_103 0.920 Male Has relevent experience no_enrollment Graduate STEM 21.0 99.0 Pvt Ltd 4.0 44 0.0
19156 5756 city_65 0.802 Male Has relevent experience no_enrollment High School NaN 0.0 999.0 Pvt Ltd 2.0 97 0.0
19157 23834 city_67 0.855 NaN No relevent experience no_enrollment Primary School NaN 2.0 NaN NaN 1.0 127 0.0

19158 rows × 14 columns

[4]:
train, test = train_test_split(data.drop('enrollee_id', axis=1), test_size=0.2, stratify=data['target'])

AutoWoe: default settings

[5]:
auto_woe_0 = AutoWoE(interpreted_model=True,
                     monotonic=False,
                     max_bin_count=5,
                     select_type=None,
                     pearson_th=0.9,
                     auc_th=.505,
                     vif_th=10.,
                     imp_th=0,
                     th_const=32,
                     force_single_split=True,
                     th_nan=0.01,
                     th_cat=0.005,
                     auc_tol=1e-4,
                     cat_alpha=100,
                     cat_merge_to="to_woe_0",
                     nan_merge_to="to_woe_0",
                     imp_type="feature_imp",
                     regularized_refit=False,
                     p_val=0.05,
                     verbose=2
        )

auto_woe_0 = ReportDeco(auto_woe_0, )
[6]:
auto_woe_0.fit(train,
               target_name="target",
              )
city processing...
city_development_index processing...
gender processing...
relevent_experience processing...
enrolled_university processing...
education_level processing...
experience processing...
company_size processing...
company_type processing...
last_new_job processing...
training_hours processing...
dict_keys(['city', 'city_development_index', 'gender', 'relevent_experience', 'enrolled_university', 'education_level', 'experience', 'company_size', 'company_type', 'last_new_job', 'training_hours']) to selector !!!!!
Feature selection...
city_development_index   -0.974107
company_size             -0.795953
company_type             -0.400146
experience               -0.184238
enrolled_university      -0.251287
education_level          -1.188926
dtype: float64
[7]:
test_prediction = auto_woe_0.predict_proba(test)
test_prediction
[7]:
array([0.06265852, 0.56483877, 0.04151965, ..., 0.15191705, 0.08528486,
       0.0409943 ])
[8]:
roc_auc_score(test['target'].values, test_prediction)
[8]:
0.8034365349304012
[9]:
report_params = {"output_path": "HR_REPORT_1", # folder for report generation
                 "report_name": "WHITEBOX REPORT",
                 "report_version_id": 1,
                 "city": "Moscow",
                 "model_aim": "Predict if candidate will work for the company",
                 "model_name": "HR model",
                 "zakazchik": "Kaggle",
                 "high_level_department": "Ai Lab",
                 "ds_name": "Btbpanda",
                 "target_descr": "Candidate will work for the company",
                 "non_target_descr": "Candidate will work for the company"}

auto_woe_0.generate_report(report_params, )
No handles with labels found to put in legend.
No handles with labels found to put in legend.
No handles with labels found to put in legend.
No handles with labels found to put in legend.
No handles with labels found to put in legend.
No handles with labels found to put in legend.
No handles with labels found to put in legend.

AutoWoE - simplier model

[10]:
auto_woe_1 = AutoWoE(interpreted_model=True,
                     monotonic=True,
                     max_bin_count=4,
                     select_type=None,
                     pearson_th=0.9,
                     auc_th=.505,
                     vif_th=10.,
                     imp_th=0,
                     th_const=32,
                     force_single_split=True,
                     th_nan=0.01,
                     th_cat=0.005,
                     auc_tol=1e-4,
                     cat_alpha=100,
                     cat_merge_to="to_woe_0",
                     nan_merge_to="to_woe_0",
                     imp_type="feature_imp",
                     regularized_refit=False,
                     p_val=0.05,
                     verbose=2
        )

auto_woe_1 = ReportDeco(auto_woe_1, )
[11]:
auto_woe_1.fit(train,
               target_name="target",
              )
city processing...city_development_index processing...

gender processing...
relevent_experience processing...
enrolled_university processing...education_level processing...

experience processing...company_type processing...company_size processing...


last_new_job processing...
training_hours processing...
dict_keys(['city', 'city_development_index', 'gender', 'relevent_experience', 'enrolled_university', 'education_level', 'experience', 'company_size', 'company_type', 'last_new_job', 'training_hours']) to selector !!!!!
Feature selection...
city                     -0.516274
city_development_index   -0.512608
company_size             -0.814922
company_type             -0.397978
experience               -0.175231
enrolled_university      -0.219507
education_level          -1.239627
dtype: float64
[12]:
test_prediction = auto_woe_1.predict_proba(test)
test_prediction
[12]:
array([0.06460692, 0.57321671, 0.0497262 , ..., 0.13746553, 0.07190761,
       0.04153373])
[13]:
roc_auc_score(test['target'].values, test_prediction)
[13]:
0.8019815944109903
[14]:
report_params = {"output_path": "HR_REPORT_2", # folder for report generation
                 "report_name": "WHITEBOX REPORT",
                 "report_version_id": 2,
                 "city": "Moscow",
                 "model_aim": "Predict if candidate will work for the company",
                 "model_name": "HR model",
                 "zakazchik": "Kaggle",
                 "high_level_department": "Ai Lab",
                 "ds_name": "Btbpanda",
                 "target_descr": "Candidate will work for the company",
                 "non_target_descr": "Candidate will work for the company"}

auto_woe_1.generate_report(report_params, )
No handles with labels found to put in legend.
No handles with labels found to put in legend.
No handles with labels found to put in legend.
No handles with labels found to put in legend.
No handles with labels found to put in legend.
No handles with labels found to put in legend.
No handles with labels found to put in legend.
No handles with labels found to put in legend.

WhiteBox preset - like TabularAutoML

[15]:
from lightautoml.automl.presets.whitebox_presets import WhiteBoxPreset
from lightautoml import Task
[16]:
task = Task('binary')
automl = WhiteBoxPreset(task)
[17]:

train_pred = automl.fit_predict(train.reset_index(drop=True), roles={'target': 'target'})
Validation data is not set. Train will be used as valid in report and valid prediction
Start automl preset with listed constraints:
- time: 3600 seconds
- cpus: 4 cores
- memory: 16 gb

Train data shape: (15326, 13)
Feats was rejected during automatic roles guess: []


Layer 1 ...
Train process start. Time left 3595.0072581768036 secs
Start fitting Lvl_0_Pipe_0_Mod_0_WhiteBox ...

===== Start working with fold 0 for Lvl_0_Pipe_0_Mod_0_WhiteBox =====

 features [] contain too many nans or identical values
 features [] have low importance
city processing...
city_development_index processing...company_type processing...education_level processing...


enrolled_university processing...
gender processing...
major_discipline processing...
relevent_experience processing...
company_size processing...
experience processing...
last_new_job processing...
training_hours processing...
dict_keys(['city', 'city_development_index', 'company_type', 'education_level', 'enrolled_university', 'gender', 'major_discipline', 'relevent_experience', 'company_size', 'experience', 'last_new_job', 'training_hours']) to selector !!!!!
Feature selection...
Feature training_hours removed due to low AUC value 0.5031265374717342
Feature city_development_index removed due to high VIF value = 40.56438648184099
C parameter range in [0.0002603488674824265:260.3488674824265], 20 values
Result(score=0.7856775296767177, reg_alpha=0.020431136952654548, is_neg=True, min_weights=city                  -0.980620
company_size          -0.800535
company_type          -0.340185
experience            -0.198176
enrolled_university   -0.101047
relevent_experience    0.000000
education_level       -0.624324
last_new_job           0.000000
gender                 0.000000
major_discipline      -0.317699
dtype: float64)
Iter 0 of final refit starts with 7 features
Validation data checks
city                  -0.956550
company_size          -0.866063
company_type          -0.402941
experience            -0.329493
enrolled_university   -0.230776
education_level       -0.641994
major_discipline      -1.596907
dtype: float64
Lvl_0_Pipe_0_Mod_0_WhiteBox fitting and predicting completed
Time left 3587.2280378341675

Automl preset training completed in 12.77 seconds.
[18]:
test_prediction = automl.predict(test).data[:, 0]
[19]:
roc_auc_score(test['target'].values, test_prediction)
[19]:
0.7966826628232216

Serialization

Important note: auto_woe_1 is the ReportDeco object (the report generator object), not AutoWoE itself. To receive the AutoWoE object you can use the auto_woe_1.model.

ReportDeco object usage for inference is not recommended for several reasons: - The report object needs to have the target column because of model quality metrics calculation - Model inference using ReportDeco object is slower than the usual one because of the report update procedure

[20]:
joblib.dump(auto_woe_1.model, 'model.pkl')
model = joblib.load('model.pkl')

SQL inference query

[21]:
sql_query = model.get_sql_inference_query('global_temp.TABLE_1')
print(sql_query)
SELECT
  1 / (1 + EXP(-(
    -1.111
    -0.516*WOE_TAB.city
    -0.513*WOE_TAB.city_development_index
    -0.815*WOE_TAB.company_size
    -0.398*WOE_TAB.company_type
    -0.175*WOE_TAB.experience
    -0.22*WOE_TAB.enrolled_university
    -1.24*WOE_TAB.education_level
  ))) as PROB,
  WOE_TAB.*
FROM
    (SELECT
    CASE
      WHEN (city IS NULL OR LOWER(CAST(city AS VARCHAR(50))) = 'nan') THEN 0
      WHEN city IN ('city_100', 'city_102', 'city_103', 'city_116', 'city_149', 'city_159', 'city_160', 'city_45', 'city_46', 'city_64', 'city_71', 'city_73', 'city_83', 'city_99') THEN 0.213
      WHEN city IN ('city_104', 'city_114', 'city_136', 'city_138', 'city_16', 'city_173', 'city_23', 'city_28', 'city_36', 'city_50', 'city_57', 'city_61', 'city_65', 'city_67', 'city_75', 'city_97') THEN 1.017
      WHEN city IN ('city_11', 'city_21', 'city_74') THEN -1.455
      ELSE -0.209
    END AS city,
    CASE
      WHEN (city_development_index IS NULL OR city_development_index = 'NaN') THEN 0
      WHEN city_development_index <= 0.6245 THEN -1.454
      WHEN city_development_index <= 0.7915 THEN -0.121
      WHEN city_development_index <= 0.9235 THEN 0.461
      ELSE 1.101
    END AS city_development_index,
    CASE
      WHEN (company_size IS NULL OR company_size = 'NaN') THEN -0.717
      WHEN company_size <= 74.0 THEN 0.221
      ELSE 0.467
    END AS company_size,
    CASE
      WHEN (company_type IS NULL OR LOWER(CAST(company_type AS VARCHAR(50))) = 'nan') THEN -0.64
      WHEN company_type IN ('Early Stage Startup', 'NGO', 'Other', 'Public Sector') THEN 0.164
      WHEN company_type = 'Funded Startup' THEN 0.737
      WHEN company_type = 'Pvt Ltd' THEN 0.398
      ELSE 0
    END AS company_type,
    CASE
      WHEN (experience IS NULL OR experience = 'NaN') THEN 0
      WHEN experience <= 1.5 THEN -0.811
      WHEN experience <= 7.5 THEN -0.319
      WHEN experience <= 11.5 THEN 0.119
      ELSE 0.533
    END AS experience,
    CASE
      WHEN (enrolled_university IS NULL OR LOWER(CAST(enrolled_university AS VARCHAR(50))) = 'nan') THEN -0.327
      WHEN enrolled_university = 'Full time course' THEN -0.614
      WHEN enrolled_university = 'Part time course' THEN 0.026
      WHEN enrolled_university = 'no_enrollment' THEN 0.208
      ELSE 0
    END AS enrolled_university,
    CASE
      WHEN (education_level IS NULL OR LOWER(CAST(education_level AS VARCHAR(50))) = 'nan') THEN 0.21
      WHEN education_level = 'Graduate' THEN -0.166
      WHEN education_level = 'High School' THEN 0.34
      WHEN education_level = 'Masters' THEN 0.21
      WHEN education_level IN ('Phd', 'Primary School') THEN 0.704
      ELSE 0
    END AS education_level
  FROM global_temp.TABLE_1) as WOE_TAB

Check the SQL query by PySpark

[23]:
from pyspark.sql import SparkSession
[ ]:
spark = SparkSession.builder \
                    .master("local[2]") \
                    .appName("spark-course") \
                    .config("spark.driver.memory", "512m") \
                    .getOrCreate()
sc = spark.sparkContext
[24]:
spark_df = spark.read.csv("jobs_train.csv", header=True)
spark_df.createGlobalTempView("TABLE_1")
[25]:
res = spark.sql(sql_query).toPandas()
[26]:
res
[26]:
PROB city city_development_index company_size company_type experience enrolled_university education_level
0 0.365512 0.213 0.461 -0.717 -0.640 0.533 0.208 -0.166
1 0.195716 -0.209 -0.121 0.467 0.398 0.533 0.208 -0.166
2 0.835002 -1.455 -1.454 -0.717 -0.640 -0.319 -0.614 -0.166
3 0.476161 -0.209 -0.121 -0.717 0.398 -0.811 -0.327 -0.166
4 0.117694 -0.209 -0.121 0.467 0.737 0.533 0.208 0.210
... ... ... ... ... ... ... ... ...
19153 0.275602 1.017 0.461 -0.717 -0.640 0.533 0.208 -0.166
19154 0.365512 0.213 0.461 -0.717 -0.640 0.533 0.208 -0.166
19155 0.126794 0.213 0.461 0.467 0.398 0.533 0.208 -0.166
19156 0.060842 1.017 0.461 0.467 0.398 -0.811 0.208 0.340
19157 0.130552 1.017 0.461 -0.717 -0.640 -0.319 0.208 0.704

19158 rows × 8 columns

[27]:
sc.stop()
[28]:
full_prediction = model.predict_proba(data)
full_prediction
[28]:
array([0.36557352, 0.19577798, 0.83497665, ..., 0.12678668, 0.06083813,
       0.13061427])
[29]:
(res['PROB'] - full_prediction).abs().max()
[29]:
0.0002878641803194526