Matching

0. Import libraries

[1]:

from lightautoml.addons.hypex import Matcher

1. Create or upload your dataset

In this case we will create random dataset with known effect size

If you have your own dataset, go to the part 2

[2]:

from lightautoml.addons.hypex.utils.tutorial_data_creation import create_test_data

[3]:

df = create_test_data(num_users=10000, rs=42, na_step=45, nan_cols=['age', 'gender'])
df

[3]:

	user_id	signup_month	treat	pre_spends	post_spends	age	gender	industry
0	0	0	0	504.5	422.777778	NaN	F	Logistics
1	1	4	1	500.0	506.333333	51.0	NaN	E-commerce
2	2	0	0	485.0	434.000000	56.0	F	Logistics
3	3	8	1	452.0	468.111111	46.0	M	E-commerce
4	4	0	0	488.5	420.111111	56.0	M	Logistics
...	...	...	...	...	...	...	...	...
9995	9995	2	1	482.0	501.666667	31.0	M	Logistics
9996	9996	0	0	453.0	406.888889	53.0	M	Logistics
9997	9997	0	0	461.0	415.111111	52.0	F	E-commerce
9998	9998	10	1	491.5	439.222222	22.0	M	E-commerce
9999	9999	2	1	481.0	517.222222	53.0	M	E-commerce

10000 rows × 8 columns

[4]:

df.columns

[4]:

Index(['user_id', 'signup_month', 'treat', 'pre_spends', 'post_spends', 'age',
       'gender', 'industry'],
      dtype='object')

[5]:

df['treat'].value_counts()

[5]:

treat
0    5002
1    4998
Name: count, dtype: int64

[6]:

df['gender'].isna().sum()

[6]:

2. Matching

2.0 Init params

info_col used to define informative attributes that should not be part of matching, such as user_id

But to explicitly store this column in the table, so that you can compare directly after computation

[7]:

info_col = ['user_id']

outcome = 'post_spends'
treatment = 'treat'

2.1 Simple matching

This is the easiest way to initialize and calculate metrics on a Matching task

Use it when you are clear about each attribute or if you don’t have any additional task conditions (Strict equality for certain features)

[8]:

# Standard model with base parameters
model = Matcher(input_data=df, outcome=outcome, treatment=treatment, info_col=info_col,
                algo='fast')

[18.12.2024 22:05:15 | hypex | INFO]: Number of NaN values filled with zeros: 446

Feature selection models the significance of features for the accuracy of target approximation. However, it does not rule out the possibility of overlooked features, the complex impact of features on target description, or the significance of features from a business logic perspective. The algorithm will not function correctly if there are data leaks. Points to consider when selecting features:

Data leaks - these should not be present.
Influence on treatment distribution - features should not affect the treatment distribution.
The target should be describable by features.
All features significantly affecting the target should be included.
The business rationale of features.
The feature selection function can be useful for addressing these tasks, but it does not solve them nor does it absolve the user of the responsibility for their selection, nor does it justify it.

Link to ReadTheDocs

[9]:

selected_features = model.feature_select()
selected_features

/Users/tikhomirov/PycharmProjects/Sber_New/LightAutoML/.venv/lib/python3.10/site-packages/hypex/selectors/feature_selector.py:42: UserWarning: FeatureSelector does not rule out the possibility of overlooked features, the complex impact of features on target description, or the significance of features from a business logic perspective.
  warnings.warn(

[9]:

	rank
signup_month	1
pre_spends	2
age	3
gender_F	4
gender_M	5
industry_Logistics	6

[10]:

chosen_features = selected_features[:4].index
chosen_features

[10]:

Index(['signup_month', 'pre_spends', 'age', 'gender_F'], dtype='object')

[11]:

results, quality_results, df_matched = model.estimate(features=chosen_features)

[18.12.2024 22:05:18 | Faiss hypex | INFO]: The entry of bias into the ATT is 0.1%

What is necessary to check in this table?¶

effect size in ATT - it is effect in treated group
standart error shows how accurately the parameter estimate corresponds to the true value in the total population
p-value shows the measure of randomness of the sample (in this example: A p-value of 0.0 means that there is a 0% (percentage probability) that the result is due to randomness)
ci (confidence interval) - the interval that covers the estimated parameter (ATT, ATC, ATE) with a given probability

[12]:

results

[12]:

	effect_size	std_err	ci_lower	ci_upper	outcome
ATE	82.636764	2.339283	78.051770	87.221757	post_spends
ATC	101.687566	4.571439	92.727546	110.647585	post_spends
ATT	63.570714	0.694726	62.209051	64.932378	post_spends

Variable quality_results contains:

results of psi test
resulnt of Kolmogorov-Smirnov test
results of smd
number of repeats

PSI (Population Stability Index)

shows the difference between treated and untreated populations

Rules:

PSI < 0.1 - No change. You can continue using existing model.
PSI >=0.1 but less than 0.2 - Slight change is required.
PSI >=0.2 - Significant change is required. Ideally, you should not use this model any more.

SMD (Standardized Mean Differences)

helps to check if the balance of the groups has been reached

Rules:

Smaller than 0.1. For a randomized trial, the smd between all the covariates should typically fall into this bucket.
0.1 - 0.2. Not necessarily balanced, but small enough that people are usually not too worried about them.
0.2. Values that are greater than this threshold are considered seriously imbalanced.

Repeats

shows the fraction of duplicated indexes

[13]:

quality_results.keys()

[13]:

dict_keys(['psi', 'ks_test', 'smd', 'repeats'])

Kolmogorov-Smirnov test¶ the distribution of one sample is compared with the distribution of the second sample and it is decided whether the samples have the same or different distribution.

Table shows the p-value results of the test. If p-value < 0.05 we reject the null hypothesis and we have enough evidence to say that the sample data do not have the same distribution.

[14]:

quality_results['ks_test']

[14]:

	match_control_to_treat	match_treat_to_control
age	1.418144e-01	2.209790e-07
pre_spends	2.348715e-264	3.829212e-19
signup_month	0.000000e+00	0.000000e+00

[15]:

df_matched

[15]:

	index	signup_month	pre_spends	age	gender_F	gender_M	industry_Logistics	signup_month_matched	pre_spends_matched	age_matched	gender_F_matched	gender_M_matched	industry_Logistics_matched	index_matched	post_spends	post_spends_matched	post_spends_matched_bias	treat	treat_matched
0	956	8	487.0	22.0	0	1	0	0.0	506.0	23.0	0.0	1.0	1.0	[6352]	462.222222	408.333333	54.004165	1	0
1	7966	5	471.5	69.0	0	0	1	0.0	483.5	69.0	0.0	1.0	1.0	[5349]	505.222222	415.222222	90.068667	1	0
2	7231	4	487.0	62.0	1	0	1	0.0	496.5	62.0	1.0	0.0	1.0	[8654]	503.555556	428.777778	74.832139	1	0
3	1443	1	517.0	36.0	1	0	1	0.0	520.5	36.0	1.0	0.0	1.0	[7881]	526.111111	423.111111	103.020028	1	0
4	7973	10	501.0	65.0	1	0	0	0.0	525.0	65.0	1.0	0.0	0.0	[7333]	440.444444	416.666667	23.915112	1	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
4997	2926	0	486.0	45.0	0	0	0	2.0	479.5	46.0	0.0	1.0	1.0	[971]	421.777778	509.888889	89.948512	0	1
4998	1656	0	465.5	39.0	1	0	1	2.0	460.0	40.0	1.0	0.0	1.0	[6730]	417.666667	502.222222	86.443231	0	1
4999	2812	0	451.0	53.0	1	0	1	2.0	441.5	57.0	1.0	0.0	1.0	[6610]	417.333333	521.555556	106.002291	0	1
5000	2813	0	504.5	69.0	0	1	0	1.0	514.0	65.0	0.0	1.0	0.0	[9319]	422.888889	528.555556	107.086127	0	1
5001	1297	0	495.5	43.0	1	0	1	1.0	496.5	38.0	1.0	0.0	0.0	[3004]	415.111111	532.555556	118.405412	0	1

10000 rows × 19 columns

[16]:

df_matched[df_matched['industry_Logistics'] != df_matched['industry_Logistics_matched']]

[16]:

	index	signup_month	pre_spends	age	gender_F	gender_M	industry_Logistics	signup_month_matched	pre_spends_matched	age_matched	gender_F_matched	gender_M_matched	industry_Logistics_matched	index_matched	post_spends	post_spends_matched	post_spends_matched_bias	treat	treat_matched
0	956	8	487.0	22.0	0	1	0	0.0	506.0	23.0	0.0	1.0	1.0	[6352]	462.222222	408.333333	54.004165	1	0
5	8108	7	482.0	54.0	0	1	1	0.0	499.0	55.0	0.0	1.0	0.0	[1742]	478.333333	411.222222	67.214942	1	0
6	7228	7	493.0	35.0	1	0	1	0.0	511.0	36.0	1.0	0.0	0.0	[5653]	492.555556	419.444444	73.220665	1	0
7	7968	10	512.5	64.0	1	0	0	0.0	540.0	63.0	1.0	0.0	1.0	[4470]	436.777778	418.666667	18.261920	1	0
8	1396	6	479.0	27.0	0	0	0	0.0	493.5	27.0	0.0	1.0	1.0	[6721]	485.666667	408.888889	76.860750	1	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
4990	2924	0	493.0	32.0	1	0	1	1.0	496.5	38.0	1.0	0.0	0.0	[3004]	425.000000	532.555556	108.985004	0	1
4991	313	0	484.0	35.0	1	0	0	2.0	480.5	34.0	1.0	0.0	1.0	[7816]	426.000000	510.777778	86.703675	0	1
4992	2919	0	490.5	20.0	0	1	1	2.0	483.5	18.0	0.0	1.0	0.0	[2781]	420.333333	517.555556	98.940997	0	1
4997	2926	0	486.0	45.0	0	0	0	2.0	479.5	46.0	0.0	1.0	1.0	[971]	421.777778	509.888889	89.948512	0	1
5001	1297	0	495.5	43.0	1	0	1	1.0	496.5	38.0	1.0	0.0	0.0	[3004]	415.111111	532.555556	118.405412	0	1

4964 rows × 19 columns

2.2 Matching with a fixed variable

Used when you have categorical feature(s) that you want to compare by strict equality
group_col is used for strict comparison of categorical features.
In our case there is only one attribute
If there are several such attributes, you should make one of them and use it

[17]:

group_col = "industry"

Also group_col might be the list.

[18]:

model = Matcher(input_data=df, outcome=outcome, treatment=treatment,
                info_col=info_col, group_col=group_col)

[18.12.2024 22:05:18 | hypex | INFO]: Number of NaN values filled with zeros: 446

[19]:

selected_features = model.feature_select()
selected_features

/Users/tikhomirov/PycharmProjects/Sber_New/LightAutoML/.venv/lib/python3.10/site-packages/hypex/selectors/feature_selector.py:42: UserWarning: FeatureSelector does not rule out the possibility of overlooked features, the complex impact of features on target description, or the significance of features from a business logic perspective.
  warnings.warn(

[19]:

	rank
signup_month	1
pre_spends	2
age	3
gender_F	4
gender_M	5

[20]:

chosen_features = selected_features[:4].index
chosen_features

[20]:

Index(['signup_month', 'pre_spends', 'age', 'gender_F'], dtype='object')

[21]:

results, quality_results, df_matched = model.estimate(features=chosen_features)

[18.12.2024 22:05:22 | Faiss hypex | INFO]: The entry of bias into the ATT is 0.2%

[22]:

results

[22]:

	effect_size	std_err	ci_lower	ci_upper	outcome
ATE	81.192392	1.631028	77.995577	84.389208	post_spends
ATC	98.945415	3.116605	92.836869	105.053961	post_spends
ATT	63.425162	0.652349	62.146559	64.703765	post_spends

[23]:

df_matched[df_matched['industry'] != df_matched['industry_matched']]

[23]:

	index	signup_month	pre_spends	age	gender_F	gender_M	industry	signup_month_matched	pre_spends_matched	age_matched	gender_F_matched	gender_M_matched	industry_matched	index_matched	post_spends	post_spends_matched	post_spends_matched_bias	treat	treat_matched

3. Results

3.1 ATE, ATT, ATC

[24]:

results

[24]:

	effect_size	std_err	ci_lower	ci_upper	outcome
ATE	81.192392	1.631028	77.995577	84.389208	post_spends
ATC	98.945415	3.116605	92.836869	105.053961	post_spends
ATT	63.425162	0.652349	62.146559	64.703765	post_spends

3.2 SMD, PSI, KS-test, repeats

[25]:

quality_results.keys()

[25]:

dict_keys(['psi', 'ks_test', 'smd', 'repeats'])

[26]:

quality_results['psi']

[26]:

	column_treated	anomaly_score_treated	check_result_treated	column_untreated	anomaly_score_untreated	check_result_untreated
0	age_treated	0.01	OK	age_untreated	0.08	OK
1	gender_F_treated	0.00	OK	gender_F_untreated	0.00	OK
2	industry_treated	0.00	OK	industry_untreated	0.00	OK
3	pre_spends_treated	0.61	NOK	pre_spends_untreated	0.16	OK
4	signup_month_treated	16.14	NOK	signup_month_untreated	0.00	OK

[27]:

quality_results['ks_test']

[27]:

	match_control_to_treat	match_treat_to_control
age	1.292517e-01	2.721265e-04
pre_spends	4.164710e-263	1.725086e-24
signup_month	0.000000e+00	0.000000e+00

[28]:

quality_results['repeats']

[28]:

{'match_control_to_treat': 0.42, 'match_treat_to_control': 0.07}

4. Save model

[29]:

model.save("test_model.pickle")

[30]:

model2 = Matcher.load("test_model.pickle")

[31]:

model2.results

[31]:

	effect_size	std_err	ci_lower	ci_upper	outcome
ATE	81.192392	1.631028	77.995577	84.389208	post_spends
ATC	98.945415	3.116605	92.836869	105.053961	post_spends
ATT	63.425162	0.652349	62.146559	64.703765	post_spends

[32]:

model.results

[32]:

	effect_size	std_err	ci_lower	ci_upper	outcome
ATE	81.192392	1.631028	77.995577	84.389208	post_spends
ATC	98.945415	3.116605	92.836869	105.053961	post_spends
ATT	63.425162	0.652349	62.146559	64.703765	post_spends

[ ]: