AA test multigroup split simple

Provides simple approach to split into 2 and more groups.
Uses memory more efficiently than old method “process”
[1]:
import numpy as np
import pandas as pd

from lightautoml.addons.hypex import AATest
from lightautoml.addons.hypex.utils.tutorial_data_creation import create_test_data

pd.options.display.float_format = '{:,.2f}'.format

np.random.seed(42)  # needed to create example data

Prepare some large data

[2]:
some_large_dataframe = create_test_data(rs=52, na_step=10, nan_cols=['age', 'gender'], num_users=100_000)
some_large_dataframe
[2]:
user_id signup_month treat pre_spends post_spends age gender industry
0 0 0 0 478.00 422.89 NaN M Logistics
1 1 0 0 472.50 407.22 68.00 NaN E-commerce
2 2 0 0 485.00 426.11 44.00 F Logistics
3 3 8 1 485.00 466.11 59.00 F E-commerce
4 4 1 1 539.00 522.78 54.00 M E-commerce
... ... ... ... ... ... ... ... ...
99995 99995 0 0 473.50 428.56 22.00 F Logistics
99996 99996 0 0 497.00 421.89 65.00 M E-commerce
99997 99997 3 1 485.00 517.56 20.00 M Logistics
99998 99998 0 0 458.50 410.78 69.00 M E-commerce
99999 99999 0 0 494.00 416.89 44.00 F E-commerce

100000 rows × 8 columns

Process split

[3]:
target = ['post_spends', 'pre_spends']
# Dict
#  key - group name
#  value - share
#  The sum of the shares is 1
groups = {'test1': 0.3, 'test2': 0.2, 'control': 0.5 }

aa_test = AATest()
results = aa_test.process_split(df=some_large_dataframe, target_fields=target, groups=groups)
../../_images/pages_tutorials_Tutorial_13_AA_Test_multigroup_split_5_2.png
../../_images/pages_tutorials_Tutorial_13_AA_Test_multigroup_split_5_3.png
../../_images/pages_tutorials_Tutorial_13_AA_Test_multigroup_split_5_4.png
[4]:
results.keys()
[4]:
dict_keys(['best metric', 'best split', 'all metrics', 'all splits', 'best split DataFrame', 'get_resume'])

results is a dictionary with dataframes as values.

  • ‘best split’ - result of best separation

  • ‘best metric’ - metrics of best split

  • ‘all metrics’ - metrics of all experiments (i.e. of random splits)

  • ‘all splists’ - results of all random splits

  • ‘best split DataFrame’ - pandas DataFrame with column ‘group’ (or defined by ‘group_column_name’ parameter) contains values defined as keys in ‘groups’ parameter

  • ‘get resume’ - function that plots p-values distribution of all experiments

[5]:
results['best metric']
[5]:
control post_spends mean                          451.86
test1 post_spends mean                            451.93
test2 post_spends mean                            451.90
post_spends mean delta (control - test1)           -0.06
post_spends mean delta% (control - test1)/test1    -0.01
post_spends t-test p-value (control,test1)          0.83
post_spends ks-test p-value (control,test1)         0.98
post_spends mean delta (control - test2)           -0.04
post_spends mean delta% (control - test2)/test2    -0.01
post_spends t-test p-value (control,test2)          0.91
post_spends ks-test p-value (control,test2)         1.00
post_spends mean delta (test1 - test2)              0.02
post_spends mean delta% (test1 - test2)/test2       0.01
post_spends t-test p-value (test1,test2)            0.95
post_spends ks-test p-value (test1,test2)           0.97
post_spends mean t-test p-value                     0.89
post_spends mean ks-test p-value                    0.98
post_spends t-test passed                           True
post_spends ks-test passed                          True
control pre_spends mean                           487.31
test1 pre_spends mean                             487.35
test2 pre_spends mean                             487.27
pre_spends mean delta (control - test1)            -0.04
pre_spends mean delta% (control - test1)/test1     -0.01
pre_spends t-test p-value (control,test1)           0.75
pre_spends ks-test p-value (control,test1)          0.77
pre_spends mean delta (control - test2)             0.03
pre_spends mean delta% (control - test2)/test2      0.01
pre_spends t-test p-value (control,test2)           0.83
pre_spends ks-test p-value (control,test2)          0.94
pre_spends mean delta (test1 - test2)               0.08
pre_spends mean delta% (test1 - test2)/test2        0.02
pre_spends t-test p-value (test1,test2)             0.66
pre_spends ks-test p-value (test1,test2)            0.51
pre_spends mean t-test p-value                      0.83
pre_spends mean ks-test p-value                     0.88
pre_spends t-test passed                            True
pre_spends ks-test passed                           True
mean of means t-test p-value                        0.86
mean of means ks-test p-value                       0.93
t-test passed %                                   100.00
ks-test passed %                                  100.00
mean_test_score                                     0.91
experiment_index                                     909
Name: 908, dtype: object
[ ]: