Matching without replacement

0. Import libraries

[1]:

from lightautoml.addons.hypex import Matcher

In this case we will create random dataset with known effect size

If you have your own dataset, go to the part 2

[2]:

from lightautoml.addons.hypex.utils.tutorial_data_creation import create_test_data

[3]:

df = create_test_data(num_users=10000, rs=42, na_step=45, nan_cols=['age', 'gender'])
df

[3]:

	user_id	signup_month	treat	pre_spends	post_spends	age	gender	industry
0	0	0	0	504.5	422.777778	NaN	F	Logistics
1	1	4	1	500.0	506.333333	51.0	NaN	E-commerce
2	2	0	0	485.0	434.000000	56.0	F	Logistics
3	3	8	1	452.0	468.111111	46.0	M	E-commerce
4	4	0	0	488.5	420.111111	56.0	M	Logistics
...	...	...	...	...	...	...	...	...
9995	9995	2	1	482.0	501.666667	31.0	M	Logistics
9996	9996	0	0	453.0	406.888889	53.0	M	Logistics
9997	9997	0	0	461.0	415.111111	52.0	F	E-commerce
9998	9998	10	1	491.5	439.222222	22.0	M	E-commerce
9999	9999	2	1	481.0	517.222222	53.0	M	E-commerce

10000 rows × 8 columns

[4]:

df.columns

[4]:

Index(['user_id', 'signup_month', 'treat', 'pre_spends', 'post_spends', 'age',
       'gender', 'industry'],
      dtype='object')

[5]:

df['treat'].value_counts()

[5]:

treat
0    5002
1    4998
Name: count, dtype: int64

[6]:

df['gender'].isna().sum()

[6]:

info_col used to define informative attributes that should not be part of matching, such as user_id

But to explicitly store this column in the table, so that you can compare directly after computation

[7]:

info_col = ['user_id']

outcome = 'post_spends'
treatment = 'treat'

This is the easiest way to initialize and calculate metrics on a Matching task

Use it when you are clear about each attribute or if you don’t have any additional task conditions (Strict equality for certain features)

[8]:

# Standard model with base parameters
model = Matcher(input_data=df, outcome=outcome, treatment=treatment, info_col=info_col)

[18.12.2024 22:04:52 | hypex | INFO]: Number of NaN values filled with zeros: 446

[9]:

df_matched = model.match_no_rep()

[10]:

df_matched

[10]:

	signup_month	treat	pre_spends	post_spends	age	gender_F	gender_M	industry_Logistics	user_id	signup_month_matched	treat_matched	pre_spends_matched	post_spends_matched	age_matched	gender_F_matched	gender_M_matched	industry_Logistics_matched	user_id_matched
0	0	0	504.5	422.777778	0.0	1	0	1	0	4	1	522.5	509.777778	0.0	1	0	1	4095
1	4	1	500.0	506.333333	51.0	0	0	0	1	0	0	488.0	420.333333	50.0	0	0	0	3916
2	0	0	485.0	434.000000	56.0	1	0	1	2	2	1	492.5	510.777778	56.0	1	0	1	6057
3	8	1	452.0	468.111111	46.0	0	1	0	3	0	0	453.5	415.222222	42.0	0	1	0	5255
4	0	0	488.5	420.111111	56.0	0	1	1	4	8	1	488.0	472.111111	59.0	0	1	1	6874
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
9991	2	1	482.0	501.666667	31.0	0	1	1	9995	0	0	474.0	435.111111	28.0	0	1	1	8397
9992	0	0	453.0	406.888889	53.0	0	1	1	9996	4	1	459.0	518.777778	57.0	0	0	1	8416
9993	0	0	461.0	415.111111	52.0	1	0	0	9997	7	1	459.5	472.111111	51.0	1	0	0	5727
9994	10	1	491.5	439.222222	22.0	0	1	0	9998	0	0	492.5	411.555556	25.0	0	1	0	5675
9995	2	1	481.0	517.222222	53.0	0	1	0	9999	0	0	471.5	427.000000	53.0	0	1	0	6361

9996 rows × 18 columns