Matching without replacement

0. Import libraries

[1]:
from lightautoml.addons.hypex import Matcher

1. Create or upload your dataset

In this case we will create random dataset with known effect size
If you have your own dataset, go to the part 2
[2]:
from lightautoml.addons.hypex.utils.tutorial_data_creation import create_test_data
[3]:
df = create_test_data(num_users=10000, rs=42, na_step=45, nan_cols=['age', 'gender'])
df
[3]:
user_id signup_month treat pre_spends post_spends age gender industry
0 0 0 0 504.5 422.777778 NaN F Logistics
1 1 4 1 500.0 506.333333 51.0 NaN E-commerce
2 2 0 0 485.0 434.000000 56.0 F Logistics
3 3 8 1 452.0 468.111111 46.0 M E-commerce
4 4 0 0 488.5 420.111111 56.0 M Logistics
... ... ... ... ... ... ... ... ...
9995 9995 2 1 482.0 501.666667 31.0 M Logistics
9996 9996 0 0 453.0 406.888889 53.0 M Logistics
9997 9997 0 0 461.0 415.111111 52.0 F E-commerce
9998 9998 10 1 491.5 439.222222 22.0 M E-commerce
9999 9999 2 1 481.0 517.222222 53.0 M E-commerce

10000 rows × 8 columns

[4]:
df.columns
[4]:
Index(['user_id', 'signup_month', 'treat', 'pre_spends', 'post_spends', 'age',
       'gender', 'industry'],
      dtype='object')
[5]:
df['treat'].value_counts()
[5]:
treat
0    5002
1    4998
Name: count, dtype: int64
[6]:
df['gender'].isna().sum()
[6]:
223

2. Matching without replacement

2.0 Init params

info_col used to define informative attributes that should not be part of matching, such as user_id
But to explicitly store this column in the table, so that you can compare directly after computation
[7]:
info_col = ['user_id']

outcome = 'post_spends'
treatment = 'treat'

2.1 Matching

This is the easiest way to initialize and calculate metrics on a Matching task
Use it when you are clear about each attribute or if you don’t have any additional task conditions (Strict equality for certain features)
[8]:
# Standard model with base parameters
model = Matcher(input_data=df, outcome=outcome, treatment=treatment, info_col=info_col)
[18.12.2024 22:04:52 | hypex | INFO]: Number of NaN values filled with zeros: 446
[9]:
df_matched = model.match_no_rep()
[10]:
df_matched
[10]:
signup_month treat pre_spends post_spends age gender_F gender_M industry_Logistics user_id signup_month_matched treat_matched pre_spends_matched post_spends_matched age_matched gender_F_matched gender_M_matched industry_Logistics_matched user_id_matched
0 0 0 504.5 422.777778 0.0 1 0 1 0 4 1 522.5 509.777778 0.0 1 0 1 4095
1 4 1 500.0 506.333333 51.0 0 0 0 1 0 0 488.0 420.333333 50.0 0 0 0 3916
2 0 0 485.0 434.000000 56.0 1 0 1 2 2 1 492.5 510.777778 56.0 1 0 1 6057
3 8 1 452.0 468.111111 46.0 0 1 0 3 0 0 453.5 415.222222 42.0 0 1 0 5255
4 0 0 488.5 420.111111 56.0 0 1 1 4 8 1 488.0 472.111111 59.0 0 1 1 6874
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
9991 2 1 482.0 501.666667 31.0 0 1 1 9995 0 0 474.0 435.111111 28.0 0 1 1 8397
9992 0 0 453.0 406.888889 53.0 0 1 1 9996 4 1 459.0 518.777778 57.0 0 0 1 8416
9993 0 0 461.0 415.111111 52.0 1 0 0 9997 7 1 459.5 472.111111 51.0 1 0 0 5727
9994 10 1 491.5 439.222222 22.0 0 1 0 9998 0 0 492.5 411.555556 25.0 0 1 0 5675
9995 2 1 481.0 517.222222 53.0 0 1 0 9999 0 0 471.5 427.000000 53.0 0 1 0 6361

9996 rows × 18 columns