Tutorial 10: Relational datasets (with star scheme)

LightAutoML logo

Official LightAutoML github repository is here

In this tutorial, we will look at how to use LightAutoML with relational datasets.

Install LightAutoML

[1]:
#! pip install -U lightautoml

Import necessary libraries

[2]:
# Standard python libraries
from os.path import join as pjoin

# ML and DS libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error

# Imports from lightautoml package
from lightautoml.automl.base import AutoML
from lightautoml.ml_algo.boost_lgbm import BoostLGBM

from lightautoml.pipelines.features.lgb_pipeline import LGBSimpleFeatures
from lightautoml.pipelines.ml.base import MLPipeline
from lightautoml.reader.base import DictToPandasSeqReader
from lightautoml.tasks import Task

# Import Feature Generator Transformer
from lightautoml.pipelines.features.generator_pipeline import FeatureGeneratorPipeline

Relational data

Consider data that is a set of linked tables. Usually in this case there is a separate main table containing the objects identifiers and the corresponding values ​​of the target variable, as well as possibly the values ​​of other features (so called fact table). Other tables contain additional or auxiliary information, for example, records about all customer transactions (there can be an arbitrary number for a user with a specific identifier etc), the correspondence between the values ​​of one feature and the values ​​of another (the correspondence between an employee’s department and his salary, for example), etc (so called dimension tables). However the organization of the data may differ from this scheme. To apply machine learning algorithms and LightAutoML, it is necessary to create a single dataset with all the features for each of the objects. For this we need to set the correspondence between the columns of the main and auxiliary tables for the correct aggregation of features. Such tables can form different schemas.

In this example, we use Meal delivery company dataset and will consider one of the simplest and most common schemes for organizing tables - the so-called star scheme, in which there is one main table, and there are connections only between the main and auxiliary tables by specified columns, but not between separate auxiliary tables, not sequentially, etc. At the present moment, this is the only scheme supported in LightAutoML, support for more complex schemes is in development. Note that the connection between the main and each auxiliary table is carried out by a single key, but they may differ for different tables. Also, the columns for binding must be the primary key.

Consider an example of data with a star scheme organization. The dataset contains data on the sale of meals in the restaurant chain, consists of three tables: the main one containing information about completed orders (train and test parts), and two auxiliary tables containing information about restaurants (fulfilment_center_info) and available dishes (meal_info). The tables and the scheme of their organization are shown in the image below.

Star scheme tables

For the convenience of further use, we will save datasets and paths to them in dictionaries.

[3]:
data_dir = '../data/meal_delivery_company'

fulfilment_center_info = pd.read_csv(pjoin(data_dir, 'fulfilment_center_info.csv'))
meal_info = pd.read_csv(pjoin(data_dir, 'meal_info.csv'))
df_main = pd.read_csv(pjoin(data_dir, 'relational_main.csv.zip'))

[4]:
fulfilment_center_info.head()
[4]:
center_id city_code region_code center_type op_area
0 11 679 56 TYPE_A 3.7
1 13 590 56 TYPE_B 6.7
2 124 590 56 TYPE_C 4.0
3 66 648 34 TYPE_A 4.1
4 94 632 34 TYPE_C 3.6
[5]:
meal_info.head()
[5]:
meal_id category cuisine
0 1885 Beverages Thai
1 1993 Beverages Thai
2 2539 Beverages Thai
3 1248 Beverages Indian
4 2631 Beverages Indian
[6]:
df_main.head()
[6]:
id week center_id meal_id checkout_price base_price emailer_for_promotion homepage_featured num_orders
0 1476796 135 43 1770 486.03 486.03 0 0 40
1 1168999 65 23 2760 241.53 241.53 0 0 68
2 1190875 105 75 2444 709.13 708.13 0 0 80
3 1375454 68 10 2760 222.13 224.13 0 1 634
4 1397113 33 36 1438 256.08 243.50 0 1 122
[7]:
df_main.shape
[7]:
(45655, 9)

Create sequential star scheme dictionary

For further use of LightAutoML, you need to specify the data schema. It is necessary to specify secondary tables in the dictionary as the key to which the dictionary of the remaining parameters corresponds. The following parameters are specified in this dictionary:

  • 'case' - the type of column that plays the role of a key for binding. If 'ids', then the column is treated as a set of unique identifiers (ids), and if 'next_values', then it is treated as a set of timestamps.

  • 'params' - dictionary of timestamp processing and interpretation parameters in case of linking by 'next_values' type column. In case of 'ids' it might be set empty.

  • 'scheme' - dictionary describing the scheme of relationship between the main and secondary table. Consists of the next keys:

    • 'to' - the name of the table, the relationship with which is being considered (in case of star scheme, the name 'plain' should be specified here)

    • 'from_id' - the name of column for link in secondary table (from which the link exists);

    • 'to_id' - the name of column for link in main table (to which the link exists).

In our example, columns for linkage are IDs. Now we set a dictionary of parameters for communication taking into account the table schema:

[8]:
seq_params = {
   'fulfilment_center_info': {
      'case': 'ids',
      'params': {},
      'scheme': {'to': 'plain', 'from_id': 'center_id', 'to_id': 'center_id'},
   },
   'meal_info':{
      'case': 'ids',
      'params': {},
      'scheme': {'to': 'plain', 'from_id': 'meal_id', 'to_id': 'meal_id'},
   },
}

Create a dict with second-level tables.

[9]:
seq_data = {
       'fulfilment_center_info': fulfilment_center_info,
       'meal_info': meal_info
}

Define train and test data samples. They must be specified in the form of a dictionary, where the main dataset is specified by the 'plain' key, and the dictionary with secondary tables is specified by the 'seq' key (like the seq_data dictionary). Note that train and test data differ only in plain data, and train plain data must contain a column with the target variable.

[10]:
train, test = train_test_split(df_main.sort_values(by='week', ascending=True), shuffle=False, test_size=0.2)

train = {
    'plain': train,
    'seq': seq_data
}

test = {
    'plain': test,
    'seq': seq_data
}

Create Task snd Sequential Reader for the star scheme data

To work with linked tables in LightAutoML, it is not possible to use tabular presets like TabularAutoML, so we have to set all the pipeline manually. You can see more details about creating custom pipelines in this tutorial.

First we will set task and roles for our objective. Than it is necessary to create DictToPandasSeqReader to process data in form of relational tables. It requires setting the task and sequential data parameters dict as arguments (more details about this reader you can see here):

[11]:
task = Task('reg', metric='mae')
roles={'target': 'num_orders'}
reader = DictToPandasSeqReader(task=task, seq_params=seq_params)

Create Feature Generator Pipeline

In addition to aggregating data from all related tables into one, LightAutoML has the ability to perform additional feature generation by using FeatureGeneratorPipeline. Features can be generated using various aggregations (taking the average, median, counting unique values, etc.), extracting date features (year, day, difference between dates, weekend or weekday, etc.), different transformations, as well as using so-called interesting values, that is, constructing features by objects with a certain value of a set of categorical features (conditional feature generation, like “where” clause). For aggregation and transformation LightAutoML uses according primitives from FeatureTools, detailed info is available here.

Define interesting values parameters for feature generation in corresponding tables.

[12]:
interesting_values = {
    'fulfilment_center_info': {'center_type': ['TYPE_A', 'TYPE_C'], 'city_code': [647, 456, 703]},
    'meal_info': {'category': ['Extras', 'Seafood'], 'cuisine': ['Continental', 'Thai']}
}

So, in our example we want to generate features by orders where 'center_type' feature was equal to 'TYPE_A' or 'TYPE_C', and 'city_code' feature was equal to 647, 456 or 703, and similarly for features of ordered meal from meal_info table.

Params of feature generator:

  • seq_params: secondary tables or sequence related parameters.

  • max_gener_features: maximum number of generated features.

  • max_depth: maximum allowed depth of features (that is, the number of consecutively applied aggregation and transformation primitives in a superposition to obtain features).

  • agg_primitives: list of aggregation primitives. By default it is ["entropy", "count", "mean", "std", "median", "max", "sum", "num_unique", "min", "percent_true"].

  • trans_primitives: list of transform primitives. By default it is ["hour", "month", "weekday", "is_weekend", "day", "time_since_previous", "week", "age", "time_since"].

  • interesting_values: categorical values if the form of {‘table_name’: {‘column’: [values]}} for feature generation in corresponding slices (like the interesting_values dictionary above).

  • generate_interesting_values: whether generate feature in slices of unique categories or not.

  • per_top_categories: percent of most frequent categories for feature generation in corresponding slices. If number of unique values is less than 10, then the all values are be used.

  • sample_size: size of data to make generated feature selection on it.

  • n_jobs: number of processes to run in parallel

More details about FeatureGeneratorPipeline are available FeatureGeneratorPipeline class in lightautoml/pipelines/features/generator_pipeline.py

[13]:
generator = FeatureGeneratorPipeline(
    seq_params,
    max_gener_features=500,
    interesting_values = interesting_values,
    generate_interesting_values = True,
    per_top_categories = 25,
    sample_size = None,
    n_jobs = 16
)

Create one-level ML pipeline for AutoML

Next we will compose the entire pipeline. We will add the basic simplest transformations to the pipeline of feature generation (encoding categorical features, converting date features to appropriate format, defining numeric types, defining roles). The set of algorithms will consist only of LightGBM gradient boosting, and no pre-selection or post-selection of features will be used.

[14]:
simpletransf = LGBSimpleFeatures()
feats = generator.append(simpletransf)

model = BoostLGBM()

pipeline_lvl1 = MLPipeline([model], pre_selection=None, features_pipeline=feats,post_selection=None)

Initialize AutoML instance:

[15]:
automl = AutoML(reader, [[pipeline_lvl1],], skip_conn=False)

Train AutoML on loaded data

Let’s train our model on train data and look at the logs of training. For more detailed info we will set verbosity level to 3:

[16]:
%%time

train_pred = automl.fit_predict(train, roles=roles, verbose=3)
[17:03:13] Feats was rejected during automatic roles guess: []
[17:03:13] Layer 1 train process start. Time left 9999999997.76 secs
[17:03:13] This selector only for holdout training. fit_on_holout argument added just to be compatible
[17:03:13] Copying TaskTimer may affect the parent PipelineTimer, so copy will create new unlimited TaskTimer
/home/rinchin/lama_gitlab/LightAutoML/.venv/lib/python3.8/site-packages/featuretools/synthesis/dfs.py:321: UnusedPrimitiveWarning: Some specified primitives were not used during DFS:
  trans_primitives: ['age', 'day', 'hour', 'is_weekend', 'month', 'time_since', 'time_since_previous', 'week', 'weekday']
  agg_primitives: ['percent_true']
  where_primitives: ['entropy', 'num_unique', 'percent_true']
This may be caused by a using a value of max_depth that is too small, not setting interesting values, or it may indicate no compatible columns for the primitive were found in the data. If the DFS call contained multiple instances of a primitive in the list above, none of them were used.
  warnings.warn(warning_msg, UnusedPrimitiveWarning)
EntitySet scattered to 16 workers in 4 seconds
[17:03:22] Training until validation scores don't improve for 100 rounds
[17:03:23] [100]        valid's l1: 208.397
[17:03:23] [200]        valid's l1: 208.432
[17:03:24] Early stopping, best iteration is:
[125]   valid's l1: 208.028
[17:03:24] LightGBM fitting and predicting completed
[17:03:24] Started iteration 0, chunk = ['ft__plain_center_id.MAX(fulfilment_center_info.op_area)', 'ft__plain_center_id.MAX(fulfilment_center_info.city_code WHERE center_type = TYPE_A)', 'ft__plain_meal_id.COUNT(meal_info WHERE cuisine = Italian)', 'ft__plain_meal_id.COUNT(meal_info WHERE cuisine = Continental)', 'ft__plain_center_id.MAX(fulfilment_center_info.op_area WHERE center_type = TYPE_B)', 'ft__plain_meal_id.COUNT(meal_info WHERE cuisine = Indian)', 'ft__plain_center_id.MAX(fulfilment_center_info.city_code)', 'ft__plain_center_id.MAX(fulfilment_center_info.op_area WHERE center_type = TYPE_A)', 'ft__plain_center_id.MAX(fulfilment_center_info.region_code WHERE center_type = TYPE_A)', 'ft__plain_meal_id.COUNT(meal_info WHERE cuisine = Thai)', 'ft__plain_center_id.MEDIAN(fulfilment_center_info.op_area)', 'ft__plain_center_id.MEAN(fulfilment_center_info.op_area)'], feats to check = ['ft__plain_center_id.MAX(fulfilment_center_info.op_area)', 'ft__plain_center_id.MAX(fulfilment_center_info.city_code WHERE center_type = TYPE_A)', 'ft__plain_meal_id.COUNT(meal_info WHERE cuisine = Italian)', 'ft__plain_meal_id.COUNT(meal_info WHERE cuisine = Continental)', 'ft__plain_center_id.MAX(fulfilment_center_info.op_area WHERE center_type = TYPE_B)', 'ft__plain_meal_id.COUNT(meal_info WHERE cuisine = Indian)', 'ft__plain_center_id.MAX(fulfilment_center_info.city_code)', 'ft__plain_center_id.MAX(fulfilment_center_info.op_area WHERE center_type = TYPE_A)', 'ft__plain_center_id.MAX(fulfilment_center_info.region_code WHERE center_type = TYPE_A)', 'ft__plain_meal_id.COUNT(meal_info WHERE cuisine = Thai)', 'ft__plain_center_id.MEDIAN(fulfilment_center_info.op_area)', 'ft__plain_center_id.MEAN(fulfilment_center_info.op_area)']
[17:03:24] Features in SCI = ['ft__plain_center_id.MAX(fulfilment_center_info.city_code WHERE center_type = TYPE_A)', 'ft__plain_center_id.MAX(fulfilment_center_info.city_code)', 'ft__plain_center_id.MAX(fulfilment_center_info.op_area WHERE center_type = TYPE_A)', 'ft__plain_center_id.MAX(fulfilment_center_info.op_area WHERE center_type = TYPE_B)', 'ft__plain_center_id.MAX(fulfilment_center_info.op_area)', 'ft__plain_center_id.MAX(fulfilment_center_info.region_code WHERE center_type = TYPE_A)', 'ft__plain_center_id.MEAN(fulfilment_center_info.op_area)', 'ft__plain_center_id.MEDIAN(fulfilment_center_info.op_area)', 'ft__plain_meal_id.COUNT(meal_info WHERE cuisine = Continental)', 'ft__plain_meal_id.COUNT(meal_info WHERE cuisine = Indian)', 'ft__plain_meal_id.COUNT(meal_info WHERE cuisine = Italian)', 'ft__plain_meal_id.COUNT(meal_info WHERE cuisine = Thai)']
[17:03:24] Training until validation scores don't improve for 100 rounds
[17:03:24] [100]        valid's l1: 208.366
[17:03:25] [200]        valid's l1: 208.36
[17:03:25] Early stopping, best iteration is:
[125]   valid's l1: 208.044
[17:03:25] LightGBM fitting and predicting completed
[17:03:25] Update best score from None to -208.04352023579347
[17:03:25] Started iteration 1, chunk = ['ft__plain_center_id.MAX(fulfilment_center_info.city_code WHERE center_type = TYPE_B)', 'ft__plain_center_id.MAX(fulfilment_center_info.op_area WHERE center_type = TYPE_C)', 'ft__plain_center_id.MAX(fulfilment_center_info.region_code)', 'ft__plain_center_id.MAX(fulfilment_center_info.city_code WHERE center_type = TYPE_C)', 'ft__plain_center_id.MEAN(fulfilment_center_info.op_area WHERE center_type = TYPE_B)', 'ft__plain_center_id.MEAN(fulfilment_center_info.op_area WHERE center_type = TYPE_A)', 'ft__plain_center_id.COUNT(fulfilment_center_info WHERE center_type = TYPE_C)', 'ft__plain_center_id.MEAN(fulfilment_center_info.region_code WHERE center_type = TYPE_A)', 'ft__plain_center_id.MEAN(fulfilment_center_info.city_code WHERE center_type = TYPE_A)', 'ft__plain_center_id.MEAN(fulfilment_center_info.city_code)', 'ft__plain_center_id.MAX(fulfilment_center_info.region_code WHERE center_type = TYPE_C)', 'ft__plain_center_id.MEDIAN(fulfilment_center_info.city_code WHERE center_type = TYPE_A)'], feats to check = ['ft__plain_center_id.MAX(fulfilment_center_info.op_area)', 'ft__plain_center_id.MAX(fulfilment_center_info.city_code WHERE center_type = TYPE_A)', 'ft__plain_meal_id.COUNT(meal_info WHERE cuisine = Italian)', 'ft__plain_meal_id.COUNT(meal_info WHERE cuisine = Continental)', 'ft__plain_center_id.MAX(fulfilment_center_info.op_area WHERE center_type = TYPE_B)', 'ft__plain_meal_id.COUNT(meal_info WHERE cuisine = Indian)', 'ft__plain_center_id.MAX(fulfilment_center_info.city_code)', 'ft__plain_center_id.MAX(fulfilment_center_info.op_area WHERE center_type = TYPE_A)', 'ft__plain_center_id.MAX(fulfilment_center_info.region_code WHERE center_type = TYPE_A)', 'ft__plain_meal_id.COUNT(meal_info WHERE cuisine = Thai)', 'ft__plain_center_id.MEDIAN(fulfilment_center_info.op_area)', 'ft__plain_center_id.MEAN(fulfilment_center_info.op_area)', 'ft__plain_center_id.MAX(fulfilment_center_info.city_code WHERE center_type = TYPE_B)', 'ft__plain_center_id.MAX(fulfilment_center_info.op_area WHERE center_type = TYPE_C)', 'ft__plain_center_id.MAX(fulfilment_center_info.region_code)', 'ft__plain_center_id.MAX(fulfilment_center_info.city_code WHERE center_type = TYPE_C)', 'ft__plain_center_id.MEAN(fulfilment_center_info.op_area WHERE center_type = TYPE_B)', 'ft__plain_center_id.MEAN(fulfilment_center_info.op_area WHERE center_type = TYPE_A)', 'ft__plain_center_id.COUNT(fulfilment_center_info WHERE center_type = TYPE_C)', 'ft__plain_center_id.MEAN(fulfilment_center_info.region_code WHERE center_type = TYPE_A)', 'ft__plain_center_id.MEAN(fulfilment_center_info.city_code WHERE center_type = TYPE_A)', 'ft__plain_center_id.MEAN(fulfilment_center_info.city_code)', 'ft__plain_center_id.MAX(fulfilment_center_info.region_code WHERE center_type = TYPE_C)', 'ft__plain_center_id.MEDIAN(fulfilment_center_info.city_code WHERE center_type = TYPE_A)']
[17:03:25] Features in SCI = ['ft__plain_center_id.COUNT(fulfilment_center_info WHERE center_type = TYPE_C)', 'ft__plain_center_id.MAX(fulfilment_center_info.city_code WHERE center_type = TYPE_A)', 'ft__plain_center_id.MAX(fulfilment_center_info.city_code WHERE center_type = TYPE_B)', 'ft__plain_center_id.MAX(fulfilment_center_info.city_code WHERE center_type = TYPE_C)', 'ft__plain_center_id.MAX(fulfilment_center_info.city_code)', 'ft__plain_center_id.MAX(fulfilment_center_info.op_area WHERE center_type = TYPE_A)', 'ft__plain_center_id.MAX(fulfilment_center_info.op_area WHERE center_type = TYPE_B)', 'ft__plain_center_id.MAX(fulfilment_center_info.op_area WHERE center_type = TYPE_C)', 'ft__plain_center_id.MAX(fulfilment_center_info.op_area)', 'ft__plain_center_id.MAX(fulfilment_center_info.region_code WHERE center_type = TYPE_A)', 'ft__plain_center_id.MAX(fulfilment_center_info.region_code WHERE center_type = TYPE_C)', 'ft__plain_center_id.MAX(fulfilment_center_info.region_code)', 'ft__plain_center_id.MEAN(fulfilment_center_info.city_code WHERE center_type = TYPE_A)', 'ft__plain_center_id.MEAN(fulfilment_center_info.city_code)', 'ft__plain_center_id.MEAN(fulfilment_center_info.op_area WHERE center_type = TYPE_A)', 'ft__plain_center_id.MEAN(fulfilment_center_info.op_area WHERE center_type = TYPE_B)', 'ft__plain_center_id.MEAN(fulfilment_center_info.op_area)', 'ft__plain_center_id.MEAN(fulfilment_center_info.region_code WHERE center_type = TYPE_A)', 'ft__plain_center_id.MEDIAN(fulfilment_center_info.city_code WHERE center_type = TYPE_A)', 'ft__plain_center_id.MEDIAN(fulfilment_center_info.op_area)', 'ft__plain_meal_id.COUNT(meal_info WHERE cuisine = Continental)', 'ft__plain_meal_id.COUNT(meal_info WHERE cuisine = Indian)', 'ft__plain_meal_id.COUNT(meal_info WHERE cuisine = Italian)', 'ft__plain_meal_id.COUNT(meal_info WHERE cuisine = Thai)']
[17:03:25] Training until validation scores don't improve for 100 rounds
[17:03:26] [100]        valid's l1: 208.434
[17:03:26] [200]        valid's l1: 208.345
[17:03:27] Early stopping, best iteration is:
[125]   valid's l1: 208.067
[17:03:27] LightGBM fitting and predicting completed
[17:03:27] Started iteration 2, chunk = ['ft__plain_center_id.MAX(fulfilment_center_info.region_code WHERE center_type = TYPE_B)', 'ft__plain_center_id.MEAN(fulfilment_center_info.region_code)', 'ft__plain_center_id.MEDIAN(fulfilment_center_info.city_code)', 'ft__plain_center_id.MEAN(fulfilment_center_info.op_area WHERE center_type = TYPE_C)', 'ft__plain_center_id.COUNT(fulfilment_center_info WHERE center_type = TYPE_A)', 'ft__plain_center_id.MEAN(fulfilment_center_info.city_code WHERE center_type = TYPE_B)', 'ft__plain_center_id.MEDIAN(fulfilment_center_info.region_code WHERE center_type = TYPE_A)', 'ft__plain_center_id.MEDIAN(fulfilment_center_info.op_area WHERE center_type = TYPE_A)', 'ft__plain_center_id.MEAN(fulfilment_center_info.city_code WHERE center_type = TYPE_C)', 'ft__plain_center_id.MIN(fulfilment_center_info.op_area WHERE center_type = TYPE_B)', 'ft__plain_center_id.MIN(fulfilment_center_info.city_code)', 'ft__plain_center_id.MEDIAN(fulfilment_center_info.op_area WHERE center_type = TYPE_B)'], feats to check = ['ft__plain_center_id.MAX(fulfilment_center_info.op_area)', 'ft__plain_center_id.MAX(fulfilment_center_info.city_code WHERE center_type = TYPE_A)', 'ft__plain_meal_id.COUNT(meal_info WHERE cuisine = Italian)', 'ft__plain_meal_id.COUNT(meal_info WHERE cuisine = Continental)', 'ft__plain_center_id.MAX(fulfilment_center_info.op_area WHERE center_type = TYPE_B)', 'ft__plain_meal_id.COUNT(meal_info WHERE cuisine = Indian)', 'ft__plain_center_id.MAX(fulfilment_center_info.city_code)', 'ft__plain_center_id.MAX(fulfilment_center_info.op_area WHERE center_type = TYPE_A)', 'ft__plain_center_id.MAX(fulfilment_center_info.region_code WHERE center_type = TYPE_A)', 'ft__plain_meal_id.COUNT(meal_info WHERE cuisine = Thai)', 'ft__plain_center_id.MEDIAN(fulfilment_center_info.op_area)', 'ft__plain_center_id.MEAN(fulfilment_center_info.op_area)', 'ft__plain_center_id.MAX(fulfilment_center_info.region_code WHERE center_type = TYPE_B)', 'ft__plain_center_id.MEAN(fulfilment_center_info.region_code)', 'ft__plain_center_id.MEDIAN(fulfilment_center_info.city_code)', 'ft__plain_center_id.MEAN(fulfilment_center_info.op_area WHERE center_type = TYPE_C)', 'ft__plain_center_id.COUNT(fulfilment_center_info WHERE center_type = TYPE_A)', 'ft__plain_center_id.MEAN(fulfilment_center_info.city_code WHERE center_type = TYPE_B)', 'ft__plain_center_id.MEDIAN(fulfilment_center_info.region_code WHERE center_type = TYPE_A)', 'ft__plain_center_id.MEDIAN(fulfilment_center_info.op_area WHERE center_type = TYPE_A)', 'ft__plain_center_id.MEAN(fulfilment_center_info.city_code WHERE center_type = TYPE_C)', 'ft__plain_center_id.MIN(fulfilment_center_info.op_area WHERE center_type = TYPE_B)', 'ft__plain_center_id.MIN(fulfilment_center_info.city_code)', 'ft__plain_center_id.MEDIAN(fulfilment_center_info.op_area WHERE center_type = TYPE_B)']
[17:03:27] Features in SCI = ['ft__plain_center_id.COUNT(fulfilment_center_info WHERE center_type = TYPE_A)', 'ft__plain_center_id.MAX(fulfilment_center_info.city_code WHERE center_type = TYPE_A)', 'ft__plain_center_id.MAX(fulfilment_center_info.city_code)', 'ft__plain_center_id.MAX(fulfilment_center_info.op_area WHERE center_type = TYPE_A)', 'ft__plain_center_id.MAX(fulfilment_center_info.op_area WHERE center_type = TYPE_B)', 'ft__plain_center_id.MAX(fulfilment_center_info.op_area)', 'ft__plain_center_id.MAX(fulfilment_center_info.region_code WHERE center_type = TYPE_A)', 'ft__plain_center_id.MAX(fulfilment_center_info.region_code WHERE center_type = TYPE_B)', 'ft__plain_center_id.MEAN(fulfilment_center_info.city_code WHERE center_type = TYPE_B)', 'ft__plain_center_id.MEAN(fulfilment_center_info.city_code WHERE center_type = TYPE_C)', 'ft__plain_center_id.MEAN(fulfilment_center_info.op_area WHERE center_type = TYPE_C)', 'ft__plain_center_id.MEAN(fulfilment_center_info.op_area)', 'ft__plain_center_id.MEAN(fulfilment_center_info.region_code)', 'ft__plain_center_id.MEDIAN(fulfilment_center_info.city_code)', 'ft__plain_center_id.MEDIAN(fulfilment_center_info.op_area WHERE center_type = TYPE_A)', 'ft__plain_center_id.MEDIAN(fulfilment_center_info.op_area WHERE center_type = TYPE_B)', 'ft__plain_center_id.MEDIAN(fulfilment_center_info.op_area)', 'ft__plain_center_id.MEDIAN(fulfilment_center_info.region_code WHERE center_type = TYPE_A)', 'ft__plain_center_id.MIN(fulfilment_center_info.city_code)', 'ft__plain_center_id.MIN(fulfilment_center_info.op_area WHERE center_type = TYPE_B)', 'ft__plain_meal_id.COUNT(meal_info WHERE cuisine = Continental)', 'ft__plain_meal_id.COUNT(meal_info WHERE cuisine = Indian)', 'ft__plain_meal_id.COUNT(meal_info WHERE cuisine = Italian)', 'ft__plain_meal_id.COUNT(meal_info WHERE cuisine = Thai)']
[17:03:27] Training until validation scores don't improve for 100 rounds
[17:03:27] [100]        valid's l1: 208.433
[17:03:28] [200]        valid's l1: 208.34
[17:03:28] Early stopping, best iteration is:
[125]   valid's l1: 208.057
[17:03:28] LightGBM fitting and predicting completed
[17:03:28] Started iteration 3, chunk = ['ft__plain_center_id.COUNT(fulfilment_center_info WHERE center_type = TYPE_B)', 'ft__plain_center_id.MIN(fulfilment_center_info.op_area WHERE center_type = TYPE_A)', 'ft__plain_center_id.MEDIAN(fulfilment_center_info.city_code WHERE center_type = TYPE_B)', 'ft__plain_center_id.MEDIAN(fulfilment_center_info.op_area WHERE center_type = TYPE_C)', 'ft__plain_center_id.MEDIAN(fulfilment_center_info.region_code)', 'ft__plain_center_id.MIN(fulfilment_center_info.op_area)', 'ft__plain_center_id.SUM(fulfilment_center_info.city_code WHERE center_type = TYPE_A)', 'ft__plain_center_id.MIN(fulfilment_center_info.city_code WHERE center_type = TYPE_B)', 'ft__plain_center_id.MEAN(fulfilment_center_info.region_code WHERE center_type = TYPE_C)', 'ft__plain_center_id.MEDIAN(fulfilment_center_info.city_code WHERE center_type = TYPE_C)', 'ft__plain_center_id.SUM(fulfilment_center_info.city_code)', 'ft__plain_center_id.MIN(fulfilment_center_info.city_code WHERE center_type = TYPE_C)'], feats to check = ['ft__plain_center_id.MAX(fulfilment_center_info.op_area)', 'ft__plain_center_id.MAX(fulfilment_center_info.city_code WHERE center_type = TYPE_A)', 'ft__plain_meal_id.COUNT(meal_info WHERE cuisine = Italian)', 'ft__plain_meal_id.COUNT(meal_info WHERE cuisine = Continental)', 'ft__plain_center_id.MAX(fulfilment_center_info.op_area WHERE center_type = TYPE_B)', 'ft__plain_meal_id.COUNT(meal_info WHERE cuisine = Indian)', 'ft__plain_center_id.MAX(fulfilment_center_info.city_code)', 'ft__plain_center_id.MAX(fulfilment_center_info.op_area WHERE center_type = TYPE_A)', 'ft__plain_center_id.MAX(fulfilment_center_info.region_code WHERE center_type = TYPE_A)', 'ft__plain_meal_id.COUNT(meal_info WHERE cuisine = Thai)', 'ft__plain_center_id.MEDIAN(fulfilment_center_info.op_area)', 'ft__plain_center_id.MEAN(fulfilment_center_info.op_area)', 'ft__plain_center_id.COUNT(fulfilment_center_info WHERE center_type = TYPE_B)', 'ft__plain_center_id.MIN(fulfilment_center_info.op_area WHERE center_type = TYPE_A)', 'ft__plain_center_id.MEDIAN(fulfilment_center_info.city_code WHERE center_type = TYPE_B)', 'ft__plain_center_id.MEDIAN(fulfilment_center_info.op_area WHERE center_type = TYPE_C)', 'ft__plain_center_id.MEDIAN(fulfilment_center_info.region_code)', 'ft__plain_center_id.MIN(fulfilment_center_info.op_area)', 'ft__plain_center_id.SUM(fulfilment_center_info.city_code WHERE center_type = TYPE_A)', 'ft__plain_center_id.MIN(fulfilment_center_info.city_code WHERE center_type = TYPE_B)', 'ft__plain_center_id.MEAN(fulfilment_center_info.region_code WHERE center_type = TYPE_C)', 'ft__plain_center_id.MEDIAN(fulfilment_center_info.city_code WHERE center_type = TYPE_C)', 'ft__plain_center_id.SUM(fulfilment_center_info.city_code)', 'ft__plain_center_id.MIN(fulfilment_center_info.city_code WHERE center_type = TYPE_C)']
[17:03:28] Features in SCI = ['ft__plain_center_id.COUNT(fulfilment_center_info WHERE center_type = TYPE_B)', 'ft__plain_center_id.MAX(fulfilment_center_info.city_code WHERE center_type = TYPE_A)', 'ft__plain_center_id.MAX(fulfilment_center_info.city_code)', 'ft__plain_center_id.MAX(fulfilment_center_info.op_area WHERE center_type = TYPE_A)', 'ft__plain_center_id.MAX(fulfilment_center_info.op_area WHERE center_type = TYPE_B)', 'ft__plain_center_id.MAX(fulfilment_center_info.op_area)', 'ft__plain_center_id.MAX(fulfilment_center_info.region_code WHERE center_type = TYPE_A)', 'ft__plain_center_id.MEAN(fulfilment_center_info.op_area)', 'ft__plain_center_id.MEAN(fulfilment_center_info.region_code WHERE center_type = TYPE_C)', 'ft__plain_center_id.MEDIAN(fulfilment_center_info.city_code WHERE center_type = TYPE_B)', 'ft__plain_center_id.MEDIAN(fulfilment_center_info.city_code WHERE center_type = TYPE_C)', 'ft__plain_center_id.MEDIAN(fulfilment_center_info.op_area WHERE center_type = TYPE_C)', 'ft__plain_center_id.MEDIAN(fulfilment_center_info.op_area)', 'ft__plain_center_id.MEDIAN(fulfilment_center_info.region_code)', 'ft__plain_center_id.MIN(fulfilment_center_info.city_code WHERE center_type = TYPE_B)', 'ft__plain_center_id.MIN(fulfilment_center_info.city_code WHERE center_type = TYPE_C)', 'ft__plain_center_id.MIN(fulfilment_center_info.op_area WHERE center_type = TYPE_A)', 'ft__plain_center_id.MIN(fulfilment_center_info.op_area)', 'ft__plain_center_id.SUM(fulfilment_center_info.city_code WHERE center_type = TYPE_A)', 'ft__plain_center_id.SUM(fulfilment_center_info.city_code)', 'ft__plain_meal_id.COUNT(meal_info WHERE cuisine = Continental)', 'ft__plain_meal_id.COUNT(meal_info WHERE cuisine = Indian)', 'ft__plain_meal_id.COUNT(meal_info WHERE cuisine = Italian)', 'ft__plain_meal_id.COUNT(meal_info WHERE cuisine = Thai)']
[17:03:28] Training until validation scores don't improve for 100 rounds
[17:03:29] [100]        valid's l1: 208.432
[17:03:29] [200]        valid's l1: 208.338
[17:03:30] Early stopping, best iteration is:
[125]   valid's l1: 208.095
[17:03:30] LightGBM fitting and predicting completed
[17:03:30] Started iteration 4, chunk = ['ft__plain_center_id.MEAN(fulfilment_center_info.region_code WHERE center_type = TYPE_B)', 'ft__plain_center_id.MEDIAN(fulfilment_center_info.region_code WHERE center_type = TYPE_C)', 'ft__plain_center_id.MEDIAN(fulfilment_center_info.region_code WHERE center_type = TYPE_B)', 'ft__plain_center_id.MIN(fulfilment_center_info.op_area WHERE center_type = TYPE_C)', 'ft__plain_center_id.SUM(fulfilment_center_info.op_area WHERE center_type = TYPE_A)', 'ft__plain_center_id.MIN(fulfilment_center_info.region_code)', 'ft__plain_meal_id.COUNT(meal_info)', 'ft__plain_center_id.SUM(fulfilment_center_info.op_area)', 'ft__plain_center_id.SUM(fulfilment_center_info.city_code WHERE center_type = TYPE_C)', 'ft__plain_center_id.SUM(fulfilment_center_info.op_area WHERE center_type = TYPE_B)', 'ft__plain_center_id.SUM(fulfilment_center_info.op_area WHERE center_type = TYPE_C)', 'ft__plain_center_id.SUM(fulfilment_center_info.region_code WHERE center_type = TYPE_B)'], feats to check = ['ft__plain_center_id.MAX(fulfilment_center_info.op_area)', 'ft__plain_center_id.MAX(fulfilment_center_info.city_code WHERE center_type = TYPE_A)', 'ft__plain_meal_id.COUNT(meal_info WHERE cuisine = Italian)', 'ft__plain_meal_id.COUNT(meal_info WHERE cuisine = Continental)', 'ft__plain_center_id.MAX(fulfilment_center_info.op_area WHERE center_type = TYPE_B)', 'ft__plain_meal_id.COUNT(meal_info WHERE cuisine = Indian)', 'ft__plain_center_id.MAX(fulfilment_center_info.city_code)', 'ft__plain_center_id.MAX(fulfilment_center_info.op_area WHERE center_type = TYPE_A)', 'ft__plain_center_id.MAX(fulfilment_center_info.region_code WHERE center_type = TYPE_A)', 'ft__plain_meal_id.COUNT(meal_info WHERE cuisine = Thai)', 'ft__plain_center_id.MEDIAN(fulfilment_center_info.op_area)', 'ft__plain_center_id.MEAN(fulfilment_center_info.op_area)', 'ft__plain_center_id.MEAN(fulfilment_center_info.region_code WHERE center_type = TYPE_B)', 'ft__plain_center_id.MEDIAN(fulfilment_center_info.region_code WHERE center_type = TYPE_C)', 'ft__plain_center_id.MEDIAN(fulfilment_center_info.region_code WHERE center_type = TYPE_B)', 'ft__plain_center_id.MIN(fulfilment_center_info.op_area WHERE center_type = TYPE_C)', 'ft__plain_center_id.SUM(fulfilment_center_info.op_area WHERE center_type = TYPE_A)', 'ft__plain_center_id.MIN(fulfilment_center_info.region_code)', 'ft__plain_meal_id.COUNT(meal_info)', 'ft__plain_center_id.SUM(fulfilment_center_info.op_area)', 'ft__plain_center_id.SUM(fulfilment_center_info.city_code WHERE center_type = TYPE_C)', 'ft__plain_center_id.SUM(fulfilment_center_info.op_area WHERE center_type = TYPE_B)', 'ft__plain_center_id.SUM(fulfilment_center_info.op_area WHERE center_type = TYPE_C)', 'ft__plain_center_id.SUM(fulfilment_center_info.region_code WHERE center_type = TYPE_B)']
[17:03:30] Features in SCI = ['ft__plain_center_id.MAX(fulfilment_center_info.city_code WHERE center_type = TYPE_A)', 'ft__plain_center_id.MAX(fulfilment_center_info.city_code)', 'ft__plain_center_id.MAX(fulfilment_center_info.op_area WHERE center_type = TYPE_A)', 'ft__plain_center_id.MAX(fulfilment_center_info.op_area WHERE center_type = TYPE_B)', 'ft__plain_center_id.MAX(fulfilment_center_info.op_area)', 'ft__plain_center_id.MAX(fulfilment_center_info.region_code WHERE center_type = TYPE_A)', 'ft__plain_center_id.MEAN(fulfilment_center_info.op_area)', 'ft__plain_center_id.MEAN(fulfilment_center_info.region_code WHERE center_type = TYPE_B)', 'ft__plain_center_id.MEDIAN(fulfilment_center_info.op_area)', 'ft__plain_center_id.MEDIAN(fulfilment_center_info.region_code WHERE center_type = TYPE_B)', 'ft__plain_center_id.MEDIAN(fulfilment_center_info.region_code WHERE center_type = TYPE_C)', 'ft__plain_center_id.MIN(fulfilment_center_info.op_area WHERE center_type = TYPE_C)', 'ft__plain_center_id.MIN(fulfilment_center_info.region_code)', 'ft__plain_center_id.SUM(fulfilment_center_info.city_code WHERE center_type = TYPE_C)', 'ft__plain_center_id.SUM(fulfilment_center_info.op_area WHERE center_type = TYPE_A)', 'ft__plain_center_id.SUM(fulfilment_center_info.op_area WHERE center_type = TYPE_B)', 'ft__plain_center_id.SUM(fulfilment_center_info.op_area WHERE center_type = TYPE_C)', 'ft__plain_center_id.SUM(fulfilment_center_info.op_area)', 'ft__plain_center_id.SUM(fulfilment_center_info.region_code WHERE center_type = TYPE_B)', 'ft__plain_meal_id.COUNT(meal_info WHERE cuisine = Continental)', 'ft__plain_meal_id.COUNT(meal_info WHERE cuisine = Indian)', 'ft__plain_meal_id.COUNT(meal_info WHERE cuisine = Italian)', 'ft__plain_meal_id.COUNT(meal_info WHERE cuisine = Thai)', 'ft__plain_meal_id.COUNT(meal_info)']
[17:03:30] Training until validation scores don't improve for 100 rounds
[17:03:30] [100]        valid's l1: 208.386
[17:03:31] [200]        valid's l1: 208.392
[17:03:31] Early stopping, best iteration is:
[125]   valid's l1: 208.077
[17:03:31] LightGBM fitting and predicting completed
[17:03:31] Started iteration 5, chunk = ['ft__plain_center_id.SUM(fulfilment_center_info.region_code WHERE center_type = TYPE_A)', 'ft__plain_center_id.COUNT(fulfilment_center_info)', 'ft__plain_center_id.SUM(fulfilment_center_info.region_code WHERE center_type = TYPE_C)', 'ft__plain_center_id.SUM(fulfilment_center_info.city_code WHERE center_type = TYPE_B)', 'ft__plain_meal_id.ENTROPY(meal_info.cuisine)', 'ft__plain_center_id.ENTROPY(fulfilment_center_info.center_type)', 'ft__plain_center_id.SUM(fulfilment_center_info.region_code)', 'ft__plain_center_id.STD(fulfilment_center_info.region_code WHERE center_type = TYPE_C)', 'ft__plain_center_id.STD(fulfilment_center_info.region_code)', 'ft__plain_center_id.STD(fulfilment_center_info.city_code WHERE center_type = TYPE_C)', 'ft__plain_center_id.MIN(fulfilment_center_info.region_code WHERE center_type = TYPE_A)', 'ft__plain_center_id.MIN(fulfilment_center_info.region_code WHERE center_type = TYPE_B)'], feats to check = ['ft__plain_center_id.MAX(fulfilment_center_info.op_area)', 'ft__plain_center_id.MAX(fulfilment_center_info.city_code WHERE center_type = TYPE_A)', 'ft__plain_meal_id.COUNT(meal_info WHERE cuisine = Italian)', 'ft__plain_meal_id.COUNT(meal_info WHERE cuisine = Continental)', 'ft__plain_center_id.MAX(fulfilment_center_info.op_area WHERE center_type = TYPE_B)', 'ft__plain_meal_id.COUNT(meal_info WHERE cuisine = Indian)', 'ft__plain_center_id.MAX(fulfilment_center_info.city_code)', 'ft__plain_center_id.MAX(fulfilment_center_info.op_area WHERE center_type = TYPE_A)', 'ft__plain_center_id.MAX(fulfilment_center_info.region_code WHERE center_type = TYPE_A)', 'ft__plain_meal_id.COUNT(meal_info WHERE cuisine = Thai)', 'ft__plain_center_id.MEDIAN(fulfilment_center_info.op_area)', 'ft__plain_center_id.MEAN(fulfilment_center_info.op_area)', 'ft__plain_center_id.SUM(fulfilment_center_info.region_code WHERE center_type = TYPE_A)', 'ft__plain_center_id.COUNT(fulfilment_center_info)', 'ft__plain_center_id.SUM(fulfilment_center_info.region_code WHERE center_type = TYPE_C)', 'ft__plain_center_id.SUM(fulfilment_center_info.city_code WHERE center_type = TYPE_B)', 'ft__plain_meal_id.ENTROPY(meal_info.cuisine)', 'ft__plain_center_id.ENTROPY(fulfilment_center_info.center_type)', 'ft__plain_center_id.SUM(fulfilment_center_info.region_code)', 'ft__plain_center_id.STD(fulfilment_center_info.region_code WHERE center_type = TYPE_C)', 'ft__plain_center_id.STD(fulfilment_center_info.region_code)', 'ft__plain_center_id.STD(fulfilment_center_info.city_code WHERE center_type = TYPE_C)', 'ft__plain_center_id.MIN(fulfilment_center_info.region_code WHERE center_type = TYPE_A)', 'ft__plain_center_id.MIN(fulfilment_center_info.region_code WHERE center_type = TYPE_B)']
[17:03:31] Features in SCI = ['ft__plain_center_id.COUNT(fulfilment_center_info)', 'ft__plain_center_id.ENTROPY(fulfilment_center_info.center_type)', 'ft__plain_center_id.MAX(fulfilment_center_info.city_code WHERE center_type = TYPE_A)', 'ft__plain_center_id.MAX(fulfilment_center_info.city_code)', 'ft__plain_center_id.MAX(fulfilment_center_info.op_area WHERE center_type = TYPE_A)', 'ft__plain_center_id.MAX(fulfilment_center_info.op_area WHERE center_type = TYPE_B)', 'ft__plain_center_id.MAX(fulfilment_center_info.op_area)', 'ft__plain_center_id.MAX(fulfilment_center_info.region_code WHERE center_type = TYPE_A)', 'ft__plain_center_id.MEAN(fulfilment_center_info.op_area)', 'ft__plain_center_id.MEDIAN(fulfilment_center_info.op_area)', 'ft__plain_center_id.MIN(fulfilment_center_info.region_code WHERE center_type = TYPE_A)', 'ft__plain_center_id.MIN(fulfilment_center_info.region_code WHERE center_type = TYPE_B)', 'ft__plain_center_id.STD(fulfilment_center_info.city_code WHERE center_type = TYPE_C)', 'ft__plain_center_id.STD(fulfilment_center_info.region_code WHERE center_type = TYPE_C)', 'ft__plain_center_id.STD(fulfilment_center_info.region_code)', 'ft__plain_center_id.SUM(fulfilment_center_info.city_code WHERE center_type = TYPE_B)', 'ft__plain_center_id.SUM(fulfilment_center_info.region_code WHERE center_type = TYPE_A)', 'ft__plain_center_id.SUM(fulfilment_center_info.region_code WHERE center_type = TYPE_C)', 'ft__plain_center_id.SUM(fulfilment_center_info.region_code)', 'ft__plain_meal_id.COUNT(meal_info WHERE cuisine = Continental)', 'ft__plain_meal_id.COUNT(meal_info WHERE cuisine = Indian)', 'ft__plain_meal_id.COUNT(meal_info WHERE cuisine = Italian)', 'ft__plain_meal_id.COUNT(meal_info WHERE cuisine = Thai)', 'ft__plain_meal_id.ENTROPY(meal_info.cuisine)']
[17:03:31] Training until validation scores don't improve for 100 rounds
[17:03:32] [100]        valid's l1: 208.447
[17:03:32] [200]        valid's l1: 208.39
[17:03:33] Early stopping, best iteration is:
[125]   valid's l1: 208.101
[17:03:33] LightGBM fitting and predicting completed
[17:03:33] Started iteration 6, chunk = ['ft__plain_center_id.MIN(fulfilment_center_info.region_code WHERE center_type = TYPE_C)', 'ft__plain_center_id.NUM_UNIQUE(fulfilment_center_info.center_type)', 'ft__plain_center_id.STD(fulfilment_center_info.city_code WHERE center_type = TYPE_A)', 'ft__plain_center_id.STD(fulfilment_center_info.city_code WHERE center_type = TYPE_B)', 'ft__plain_center_id.STD(fulfilment_center_info.city_code)', 'ft__plain_center_id.MIN(fulfilment_center_info.city_code WHERE center_type = TYPE_A)', 'ft__plain_center_id.STD(fulfilment_center_info.op_area WHERE center_type = TYPE_A)', 'ft__plain_center_id.STD(fulfilment_center_info.op_area WHERE center_type = TYPE_B)', 'ft__plain_center_id.STD(fulfilment_center_info.op_area WHERE center_type = TYPE_C)', 'ft__plain_center_id.STD(fulfilment_center_info.op_area)', 'ft__plain_center_id.STD(fulfilment_center_info.region_code WHERE center_type = TYPE_A)', 'ft__plain_center_id.STD(fulfilment_center_info.region_code WHERE center_type = TYPE_B)'], feats to check = ['ft__plain_center_id.MAX(fulfilment_center_info.op_area)', 'ft__plain_center_id.MAX(fulfilment_center_info.city_code WHERE center_type = TYPE_A)', 'ft__plain_meal_id.COUNT(meal_info WHERE cuisine = Italian)', 'ft__plain_meal_id.COUNT(meal_info WHERE cuisine = Continental)', 'ft__plain_center_id.MAX(fulfilment_center_info.op_area WHERE center_type = TYPE_B)', 'ft__plain_meal_id.COUNT(meal_info WHERE cuisine = Indian)', 'ft__plain_center_id.MAX(fulfilment_center_info.city_code)', 'ft__plain_center_id.MAX(fulfilment_center_info.op_area WHERE center_type = TYPE_A)', 'ft__plain_center_id.MAX(fulfilment_center_info.region_code WHERE center_type = TYPE_A)', 'ft__plain_meal_id.COUNT(meal_info WHERE cuisine = Thai)', 'ft__plain_center_id.MEDIAN(fulfilment_center_info.op_area)', 'ft__plain_center_id.MEAN(fulfilment_center_info.op_area)', 'ft__plain_center_id.MIN(fulfilment_center_info.region_code WHERE center_type = TYPE_C)', 'ft__plain_center_id.NUM_UNIQUE(fulfilment_center_info.center_type)', 'ft__plain_center_id.STD(fulfilment_center_info.city_code WHERE center_type = TYPE_A)', 'ft__plain_center_id.STD(fulfilment_center_info.city_code WHERE center_type = TYPE_B)', 'ft__plain_center_id.STD(fulfilment_center_info.city_code)', 'ft__plain_center_id.MIN(fulfilment_center_info.city_code WHERE center_type = TYPE_A)', 'ft__plain_center_id.STD(fulfilment_center_info.op_area WHERE center_type = TYPE_A)', 'ft__plain_center_id.STD(fulfilment_center_info.op_area WHERE center_type = TYPE_B)', 'ft__plain_center_id.STD(fulfilment_center_info.op_area WHERE center_type = TYPE_C)', 'ft__plain_center_id.STD(fulfilment_center_info.op_area)', 'ft__plain_center_id.STD(fulfilment_center_info.region_code WHERE center_type = TYPE_A)', 'ft__plain_center_id.STD(fulfilment_center_info.region_code WHERE center_type = TYPE_B)']
[17:03:33] Features in SCI = ['ft__plain_center_id.MAX(fulfilment_center_info.city_code WHERE center_type = TYPE_A)', 'ft__plain_center_id.MAX(fulfilment_center_info.city_code)', 'ft__plain_center_id.MAX(fulfilment_center_info.op_area WHERE center_type = TYPE_A)', 'ft__plain_center_id.MAX(fulfilment_center_info.op_area WHERE center_type = TYPE_B)', 'ft__plain_center_id.MAX(fulfilment_center_info.op_area)', 'ft__plain_center_id.MAX(fulfilment_center_info.region_code WHERE center_type = TYPE_A)', 'ft__plain_center_id.MEAN(fulfilment_center_info.op_area)', 'ft__plain_center_id.MEDIAN(fulfilment_center_info.op_area)', 'ft__plain_center_id.MIN(fulfilment_center_info.city_code WHERE center_type = TYPE_A)', 'ft__plain_center_id.MIN(fulfilment_center_info.region_code WHERE center_type = TYPE_C)', 'ft__plain_center_id.NUM_UNIQUE(fulfilment_center_info.center_type)', 'ft__plain_center_id.STD(fulfilment_center_info.city_code WHERE center_type = TYPE_A)', 'ft__plain_center_id.STD(fulfilment_center_info.city_code WHERE center_type = TYPE_B)', 'ft__plain_center_id.STD(fulfilment_center_info.city_code)', 'ft__plain_center_id.STD(fulfilment_center_info.op_area WHERE center_type = TYPE_A)', 'ft__plain_center_id.STD(fulfilment_center_info.op_area WHERE center_type = TYPE_B)', 'ft__plain_center_id.STD(fulfilment_center_info.op_area WHERE center_type = TYPE_C)', 'ft__plain_center_id.STD(fulfilment_center_info.op_area)', 'ft__plain_center_id.STD(fulfilment_center_info.region_code WHERE center_type = TYPE_A)', 'ft__plain_center_id.STD(fulfilment_center_info.region_code WHERE center_type = TYPE_B)', 'ft__plain_meal_id.COUNT(meal_info WHERE cuisine = Continental)', 'ft__plain_meal_id.COUNT(meal_info WHERE cuisine = Indian)', 'ft__plain_meal_id.COUNT(meal_info WHERE cuisine = Italian)', 'ft__plain_meal_id.COUNT(meal_info WHERE cuisine = Thai)']
[17:03:33] Training until validation scores don't improve for 100 rounds
[17:03:33] [100]        valid's l1: 208.41
[17:03:34] [200]        valid's l1: 208.413
[17:03:34] Early stopping, best iteration is:
[125]   valid's l1: 207.977
[17:03:34] LightGBM fitting and predicting completed
[17:03:34] Update best score from -208.04352023579347 to -207.97743742493378
[17:03:34] Started iteration 7, chunk = ['ft__plain_meal_id.NUM_UNIQUE(meal_info.cuisine)'], feats to check = ['ft__plain_center_id.MAX(fulfilment_center_info.op_area)', 'ft__plain_center_id.MAX(fulfilment_center_info.city_code WHERE center_type = TYPE_A)', 'ft__plain_meal_id.COUNT(meal_info WHERE cuisine = Italian)', 'ft__plain_meal_id.COUNT(meal_info WHERE cuisine = Continental)', 'ft__plain_center_id.MAX(fulfilment_center_info.op_area WHERE center_type = TYPE_B)', 'ft__plain_meal_id.COUNT(meal_info WHERE cuisine = Indian)', 'ft__plain_center_id.MAX(fulfilment_center_info.city_code)', 'ft__plain_center_id.MAX(fulfilment_center_info.op_area WHERE center_type = TYPE_A)', 'ft__plain_center_id.MAX(fulfilment_center_info.region_code WHERE center_type = TYPE_A)', 'ft__plain_meal_id.COUNT(meal_info WHERE cuisine = Thai)', 'ft__plain_center_id.MEDIAN(fulfilment_center_info.op_area)', 'ft__plain_center_id.MEAN(fulfilment_center_info.op_area)', 'ft__plain_center_id.MIN(fulfilment_center_info.region_code WHERE center_type = TYPE_C)', 'ft__plain_center_id.NUM_UNIQUE(fulfilment_center_info.center_type)', 'ft__plain_center_id.STD(fulfilment_center_info.city_code WHERE center_type = TYPE_A)', 'ft__plain_center_id.STD(fulfilment_center_info.city_code WHERE center_type = TYPE_B)', 'ft__plain_center_id.STD(fulfilment_center_info.city_code)', 'ft__plain_center_id.MIN(fulfilment_center_info.city_code WHERE center_type = TYPE_A)', 'ft__plain_center_id.STD(fulfilment_center_info.op_area WHERE center_type = TYPE_A)', 'ft__plain_center_id.STD(fulfilment_center_info.op_area WHERE center_type = TYPE_B)', 'ft__plain_center_id.STD(fulfilment_center_info.op_area WHERE center_type = TYPE_C)', 'ft__plain_center_id.STD(fulfilment_center_info.op_area)', 'ft__plain_center_id.STD(fulfilment_center_info.region_code WHERE center_type = TYPE_A)', 'ft__plain_center_id.STD(fulfilment_center_info.region_code WHERE center_type = TYPE_B)', 'ft__plain_meal_id.NUM_UNIQUE(meal_info.cuisine)']
[17:03:34] Features in SCI = ['ft__plain_center_id.MAX(fulfilment_center_info.city_code WHERE center_type = TYPE_A)', 'ft__plain_center_id.MAX(fulfilment_center_info.city_code)', 'ft__plain_center_id.MAX(fulfilment_center_info.op_area WHERE center_type = TYPE_A)', 'ft__plain_center_id.MAX(fulfilment_center_info.op_area WHERE center_type = TYPE_B)', 'ft__plain_center_id.MAX(fulfilment_center_info.op_area)', 'ft__plain_center_id.MAX(fulfilment_center_info.region_code WHERE center_type = TYPE_A)', 'ft__plain_center_id.MEAN(fulfilment_center_info.op_area)', 'ft__plain_center_id.MEDIAN(fulfilment_center_info.op_area)', 'ft__plain_center_id.MIN(fulfilment_center_info.city_code WHERE center_type = TYPE_A)', 'ft__plain_center_id.MIN(fulfilment_center_info.region_code WHERE center_type = TYPE_C)', 'ft__plain_center_id.NUM_UNIQUE(fulfilment_center_info.center_type)', 'ft__plain_center_id.STD(fulfilment_center_info.city_code WHERE center_type = TYPE_A)', 'ft__plain_center_id.STD(fulfilment_center_info.city_code WHERE center_type = TYPE_B)', 'ft__plain_center_id.STD(fulfilment_center_info.city_code)', 'ft__plain_center_id.STD(fulfilment_center_info.op_area WHERE center_type = TYPE_A)', 'ft__plain_center_id.STD(fulfilment_center_info.op_area WHERE center_type = TYPE_B)', 'ft__plain_center_id.STD(fulfilment_center_info.op_area WHERE center_type = TYPE_C)', 'ft__plain_center_id.STD(fulfilment_center_info.op_area)', 'ft__plain_center_id.STD(fulfilment_center_info.region_code WHERE center_type = TYPE_A)', 'ft__plain_center_id.STD(fulfilment_center_info.region_code WHERE center_type = TYPE_B)', 'ft__plain_meal_id.COUNT(meal_info WHERE cuisine = Continental)', 'ft__plain_meal_id.COUNT(meal_info WHERE cuisine = Indian)', 'ft__plain_meal_id.COUNT(meal_info WHERE cuisine = Italian)', 'ft__plain_meal_id.COUNT(meal_info WHERE cuisine = Thai)', 'ft__plain_meal_id.NUM_UNIQUE(meal_info.cuisine)']
[17:03:34] Training until validation scores don't improve for 100 rounds
[17:03:35] [100]        valid's l1: 208.41
[17:03:35] [200]        valid's l1: 208.413
[17:03:35] Early stopping, best iteration is:
[125]   valid's l1: 207.977
[17:03:35] LightGBM fitting and predicting completed
[17:03:35] Finally selected feats = ['ft__plain_center_id.MAX(fulfilment_center_info.op_area)', 'ft__plain_center_id.MAX(fulfilment_center_info.city_code WHERE center_type = TYPE_A)', 'ft__plain_meal_id.COUNT(meal_info WHERE cuisine = Italian)', 'ft__plain_meal_id.COUNT(meal_info WHERE cuisine = Continental)', 'ft__plain_center_id.MAX(fulfilment_center_info.op_area WHERE center_type = TYPE_B)', 'ft__plain_meal_id.COUNT(meal_info WHERE cuisine = Indian)', 'ft__plain_center_id.MAX(fulfilment_center_info.city_code)', 'ft__plain_center_id.MAX(fulfilment_center_info.op_area WHERE center_type = TYPE_A)', 'ft__plain_center_id.MAX(fulfilment_center_info.region_code WHERE center_type = TYPE_A)', 'ft__plain_meal_id.COUNT(meal_info WHERE cuisine = Thai)', 'ft__plain_center_id.MEDIAN(fulfilment_center_info.op_area)', 'ft__plain_center_id.MEAN(fulfilment_center_info.op_area)', 'ft__plain_center_id.NUM_UNIQUE(fulfilment_center_info.center_type)', 'ft__plain_center_id.STD(fulfilment_center_info.city_code WHERE center_type = TYPE_A)', 'ft__plain_center_id.MIN(fulfilment_center_info.region_code WHERE center_type = TYPE_C)', 'ft__plain_center_id.STD(fulfilment_center_info.city_code)', 'ft__plain_center_id.STD(fulfilment_center_info.op_area WHERE center_type = TYPE_A)', 'ft__plain_center_id.STD(fulfilment_center_info.op_area WHERE center_type = TYPE_B)', 'ft__plain_center_id.STD(fulfilment_center_info.op_area WHERE center_type = TYPE_C)', 'ft__plain_center_id.STD(fulfilment_center_info.op_area)', 'ft__plain_center_id.STD(fulfilment_center_info.region_code WHERE center_type = TYPE_A)', 'ft__plain_center_id.STD(fulfilment_center_info.region_code WHERE center_type = TYPE_B)', 'ft__plain_center_id.MIN(fulfilment_center_info.city_code WHERE center_type = TYPE_A)', 'ft__plain_center_id.STD(fulfilment_center_info.city_code WHERE center_type = TYPE_B)']
/home/rinchin/lama_gitlab/LightAutoML/.venv/lib/python3.8/site-packages/featuretools/entityset/entityset.py:1738: UserWarning: Woodwork typing information on new dataframe will be replaced with existing typing information from plain
  warnings.warn(
2023-03-30 17:03:40,001 - distributed.worker - WARNING - Could not find data: {'bytes-7832c43e218d1860da556fd287d63146': ['tcp://127.0.0.1:33721', 'tcp://127.0.0.1:43615', 'tcp://127.0.0.1:45185']} on workers: [] (who_has: {'bytes-7832c43e218d1860da556fd287d63146': ['tcp://127.0.0.1:33721', 'tcp://127.0.0.1:43615', 'tcp://127.0.0.1:45185']})
2023-03-30 17:03:40,004 - distributed.scheduler - WARNING - Worker tcp://127.0.0.1:45367 failed to acquire keys: {'bytes-7832c43e218d1860da556fd287d63146': ('tcp://127.0.0.1:33721', 'tcp://127.0.0.1:43615', 'tcp://127.0.0.1:45185')}
2023-03-30 17:03:40,143 - distributed.worker - WARNING - Could not find data: {'bytes-7832c43e218d1860da556fd287d63146': ['tcp://127.0.0.1:33721', 'tcp://127.0.0.1:43615', 'tcp://127.0.0.1:45185']} on workers: [] (who_has: {'EntitySet-12f0afe6b03337e360b57b2fea5ba056': ['tcp://127.0.0.1:42689', 'tcp://127.0.0.1:44001', 'tcp://127.0.0.1:43877'], 'bytes-7832c43e218d1860da556fd287d63146': ['tcp://127.0.0.1:33721', 'tcp://127.0.0.1:43615', 'tcp://127.0.0.1:45185']})
2023-03-30 17:03:40,144 - distributed.scheduler - WARNING - Worker tcp://127.0.0.1:45425 failed to acquire keys: {'bytes-7832c43e218d1860da556fd287d63146': ('tcp://127.0.0.1:33721', 'tcp://127.0.0.1:43615', 'tcp://127.0.0.1:45185')}
EntitySet scattered to 16 workers in 4 seconds
[17:03:42] Start fitting Lvl_0_Pipe_0_Mod_0_LightGBM ...
[17:03:42] ===== Start working with fold 0 for Lvl_0_Pipe_0_Mod_0_LightGBM =====
[17:03:42] Training until validation scores don't improve for 100 rounds
[17:03:42] [100]        valid's l1: 95.8142
[17:03:43] [200]        valid's l1: 91.9066
[17:03:43] [300]        valid's l1: 90.8876
[17:03:44] [400]        valid's l1: 90.3315
[17:03:45] [500]        valid's l1: 90.0936
[17:03:45] [600]        valid's l1: 90.0952
[17:03:46] [700]        valid's l1: 89.995
[17:03:46] Early stopping, best iteration is:
[679]   valid's l1: 89.9358
[17:03:47] ===== Start working with fold 1 for Lvl_0_Pipe_0_Mod_0_LightGBM =====
[17:03:47] Training until validation scores don't improve for 100 rounds
[17:03:47] [100]        valid's l1: 93.3286
[17:03:48] [200]        valid's l1: 89.2285
[17:03:48] [300]        valid's l1: 88.3445
[17:03:49] [400]        valid's l1: 88.2788
[17:03:50] [500]        valid's l1: 88.2329
[17:03:50] Early stopping, best iteration is:
[442]   valid's l1: 88.1347
[17:03:50] ===== Start working with fold 2 for Lvl_0_Pipe_0_Mod_0_LightGBM =====
[17:03:50] Training until validation scores don't improve for 100 rounds
[17:03:51] [100]        valid's l1: 95.2671
[17:03:51] [200]        valid's l1: 91.0065
[17:03:52] [300]        valid's l1: 90.3078
[17:03:52] [400]        valid's l1: 89.9448
[17:03:53] [500]        valid's l1: 89.8401
[17:03:53] Early stopping, best iteration is:
[484]   valid's l1: 89.7472
[17:03:54] ===== Start working with fold 3 for Lvl_0_Pipe_0_Mod_0_LightGBM =====
[17:03:54] Training until validation scores don't improve for 100 rounds
[17:03:54] [100]        valid's l1: 92.7871
[17:03:55] [200]        valid's l1: 88.5443
[17:03:55] [300]        valid's l1: 87.5072
[17:03:56] [400]        valid's l1: 87.3209
[17:03:57] [500]        valid's l1: 87.0954
[17:03:57] [600]        valid's l1: 87.1619
[17:03:57] Early stopping, best iteration is:
[509]   valid's l1: 87.0716
[17:03:58] ===== Start working with fold 4 for Lvl_0_Pipe_0_Mod_0_LightGBM =====
[17:03:58] Training until validation scores don't improve for 100 rounds
[17:03:58] [100]        valid's l1: 92.7026
[17:03:59] [200]        valid's l1: 88.2765
[17:03:59] [300]        valid's l1: 87.0135
[17:04:00] [400]        valid's l1: 86.527
[17:04:00] Early stopping, best iteration is:
[396]   valid's l1: 86.4735
[17:04:01] Fitting Lvl_0_Pipe_0_Mod_0_LightGBM finished. score = -88.27262874082102
[17:04:01] Lvl_0_Pipe_0_Mod_0_LightGBM fitting and predicting completed
[17:04:01] Time left 9999999950.01 secs

[17:04:01] Layer 1 training completed.

CPU times: user 2min 13s, sys: 5.83 s, total: 2min 19s
Wall time: 50 s

In the “Finally selected feats” line, we can see the features generated by FeatureGenerationPipeline and selected using LightGBM, obtained using aggregations, tarnsformations and interesting values. For example, 'ft__plain_center_id.MEDIAN(fulfilment_center_info.region_code WHERE center_type = TYPE_A)' feature is median over 'region_code' column in fulfilment_center_info table (which linked with 'plain' dataset by 'center_id' key) where 'center_type' value equals 'TYPE_A'.

Analyze fitted model

Let’s see the generated features and their importances (received from LightGBM) which we get as a result of training the model:

[17]:
feature_imps = model.get_features_score()
feature_imps
[17]:
ord__checkout_price                                                                       7.618603e+09
meal_id                                                                                   5.803957e+09
ord__base_price                                                                           4.911443e+09
homepage_featured                                                                         2.367179e+09
week                                                                                      2.110229e+09
ft__plain_center_id.MAX(fulfilment_center_info.op_area)                                   1.824585e+09
emailer_for_promotion                                                                     1.425533e+09
center_id                                                                                 1.280621e+09
id                                                                                        1.044121e+09
ft__plain_center_id.MEDIAN(fulfilment_center_info.op_area)                                9.298832e+08
ft__plain_center_id.MAX(fulfilment_center_info.city_code WHERE center_type = TYPE_A)      8.705599e+08
ft__plain_meal_id.COUNT(meal_info WHERE cuisine = Italian)                                8.173745e+08
ft__plain_meal_id.COUNT(meal_info WHERE cuisine = Thai)                                   8.160740e+08
ft__plain_meal_id.COUNT(meal_info WHERE cuisine = Indian)                                 8.016899e+08
ft__plain_center_id.MAX(fulfilment_center_info.city_code)                                 7.446240e+08
ft__plain_center_id.MAX(fulfilment_center_info.op_area WHERE center_type = TYPE_B)        5.920852e+08
ft__plain_center_id.MAX(fulfilment_center_info.region_code WHERE center_type = TYPE_A)    5.403963e+08
ft__plain_center_id.MAX(fulfilment_center_info.op_area WHERE center_type = TYPE_A)        5.356365e+08
ft__plain_center_id.MEAN(fulfilment_center_info.op_area)                                  3.982109e+08
ft__plain_meal_id.COUNT(meal_info WHERE cuisine = Continental)                            3.171587e+08
ft__plain_center_id.MIN(fulfilment_center_info.region_code WHERE center_type = TYPE_C)    2.466869e+08
ft__plain_center_id.MIN(fulfilment_center_info.city_code WHERE center_type = TYPE_A)      1.435612e+08
ft__plain_center_id.STD(fulfilment_center_info.city_code WHERE center_type = TYPE_B)      0.000000e+00
ft__plain_center_id.STD(fulfilment_center_info.op_area)                                   0.000000e+00
ft__plain_center_id.STD(fulfilment_center_info.region_code WHERE center_type = TYPE_B)    0.000000e+00
ft__plain_center_id.STD(fulfilment_center_info.region_code WHERE center_type = TYPE_A)    0.000000e+00
ft__plain_center_id.STD(fulfilment_center_info.op_area WHERE center_type = TYPE_C)        0.000000e+00
ft__plain_center_id.STD(fulfilment_center_info.op_area WHERE center_type = TYPE_B)        0.000000e+00
ft__plain_center_id.STD(fulfilment_center_info.city_code)                                 0.000000e+00
ft__plain_center_id.STD(fulfilment_center_info.city_code WHERE center_type = TYPE_A)      0.000000e+00
ft__plain_center_id.NUM_UNIQUE(fulfilment_center_info.center_type)                        0.000000e+00
ft__plain_center_id.STD(fulfilment_center_info.op_area WHERE center_type = TYPE_A)        0.000000e+00
dtype: float64

Quite a large number of features heve non-zero importances:

[18]:
feature_imps.index[feature_imps > 0]
[18]:
Index(['ord__checkout_price', 'meal_id', 'ord__base_price',
       'homepage_featured', 'week',
       'ft__plain_center_id.MAX(fulfilment_center_info.op_area)',
       'emailer_for_promotion', 'center_id', 'id',
       'ft__plain_center_id.MEDIAN(fulfilment_center_info.op_area)',
       'ft__plain_center_id.MAX(fulfilment_center_info.city_code WHERE center_type = TYPE_A)',
       'ft__plain_meal_id.COUNT(meal_info WHERE cuisine = Italian)',
       'ft__plain_meal_id.COUNT(meal_info WHERE cuisine = Thai)',
       'ft__plain_meal_id.COUNT(meal_info WHERE cuisine = Indian)',
       'ft__plain_center_id.MAX(fulfilment_center_info.city_code)',
       'ft__plain_center_id.MAX(fulfilment_center_info.op_area WHERE center_type = TYPE_B)',
       'ft__plain_center_id.MAX(fulfilment_center_info.region_code WHERE center_type = TYPE_A)',
       'ft__plain_center_id.MAX(fulfilment_center_info.op_area WHERE center_type = TYPE_A)',
       'ft__plain_center_id.MEAN(fulfilment_center_info.op_area)',
       'ft__plain_meal_id.COUNT(meal_info WHERE cuisine = Continental)',
       'ft__plain_center_id.MIN(fulfilment_center_info.region_code WHERE center_type = TYPE_C)',
       'ft__plain_center_id.MIN(fulfilment_center_info.city_code WHERE center_type = TYPE_A)'],
      dtype='object')

Evaluation

[21]:
test_pred = automl.predict(test)
EntitySet scattered to 16 workers in 4 seconds
[22]:
print(f"OOF MAE on train: {mean_absolute_error(train['plain'][roles['target']], train_pred.data[:, 0])}")
print(f"MAE on test: {mean_absolute_error(test['plain'][roles['target']], test_pred.data[:, 0])}")
OOF MAE on train: 88.27262874082102
MAE on test: 95.97085837200986

Additional materials