Student: João Gil Ribeiro
ID: 32399
We want to predict if a user will click or not on an ad with special attention to the role of the website, the position and display of the add and other features that can be controled by the advertiser.
Given that most features are anonymized we will focus on precision and accuracy and less on interpretability of the model given that it is not possible to pursue that avenue.
Being this a supervised classification problem, the goal is to build a model that generates a ranked list of probabilities of wether or not a user will click on a an add, based of the available set of characteristics of both the consumer and the websites where the ad is displayed.
We want to understand if there's a pattern for a consumer to click on a given add and will explore the following methodologies:
import pandas as pd
import numpy as np
from tqdm.notebook import tqdm
df = pd.read_csv('ad_clicks_100k.csv')
# Number of instances
df.shape[0]*df.shape[1]
2400000
# Check for duplicates
if len(df.id.unique()) == df.shape[0]:
print('There are no duplicates')
else:
print('There are duplicates')
There are no duplicates
The dataset features can be divided into the following categories:
id:
ad identifier
click:
0/1 for non-click/click
banner_pos
: Refers to the banner position of the ad in the website
site_id
: A site ID is a unique identification number allocated to a website
site_domain
: A domain name is essentially the website’s equivalent of a physical address
site_category
: The theme of the site, for instance, it can belong to "Education", "Kids", "Weather", "Arts" etc...
app_id
: Two-part string used to identify one or more apps from a single development team. The string consists of a Team ID and a bundle ID search string, with a period (.) separating the two parts. The Team ID is supplied by Apple and is unique to a specific development team, while the bundle ID search string is supplied by users to match either the bundle ID of a single app or a set of bundle IDs for a group of your apps.
app_domain
: It behaves the same as any other website, but there are two factors that make it stand out. The first is visibility: searching for dedicated app websites will be easier, as .app domains will indicate that the website concerns a specific app (product rather it being a news vertical, video platform, or any other type of website). The other benefit, outlined below, relates to security
app_category
: The theme of the app, for instance, it can belong to "Education", "Kids", "Weather", "Arts" etc...
device_id
: String reported by a device’s enumerator. A device has only one device ID.
device_ip
: Similar to device id, it also uniquely identifies you, however there are personal and public id's and these are not always the same
device_model
: Refers to the manufactures (Apple, Samsung, etc..) device models such as iPhone 8, iPhone X, Galaxy 8...
device_type
: Label associated to the device which is used to match it to a series or model
device_conn_type
: Connected devices are physical objects that can connect with each other and other systems via the internet. They span everything from traditional computing hardware, such as a laptop or desktop, to common mobile devices, such as a smartphone or tablet, to an increasingly wide range of physical devices and objects.
hour:
format is YYMMDDHH, so 14091123 means 23:00 on Sept. 11, 2014 UTC.
C1
-- anonymized categorical feature
C14-C21
-- anonymized categorical features
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 100000 entries, 0 to 99999 Data columns (total 24 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 100000 non-null float64 1 click 100000 non-null int64 2 hour 100000 non-null int64 3 C1 100000 non-null int64 4 banner_pos 100000 non-null int64 5 site_id 100000 non-null object 6 site_domain 100000 non-null object 7 site_category 100000 non-null object 8 app_id 100000 non-null object 9 app_domain 100000 non-null object 10 app_category 100000 non-null object 11 device_id 100000 non-null object 12 device_ip 100000 non-null object 13 device_model 100000 non-null object 14 device_type 100000 non-null int64 15 device_conn_type 100000 non-null int64 16 C14 100000 non-null int64 17 C15 100000 non-null int64 18 C16 100000 non-null int64 19 C17 100000 non-null int64 20 C18 100000 non-null int64 21 C19 100000 non-null int64 22 C20 100000 non-null int64 23 C21 100000 non-null int64 dtypes: float64(1), int64(14), object(9) memory usage: 18.3+ MB
df.describe()
id | click | hour | C1 | banner_pos | device_type | device_conn_type | C14 | C15 | C16 | C17 | C18 | C19 | C20 | C21 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 1.000000e+05 | 100000.000000 | 1.000000e+05 | 100000.000000 | 100000.000000 | 100000.000000 | 100000.000000 | 100000.000000 | 100000.000000 | 100000.000000 | 100000.000000 | 100000.000000 | 100000.000000 | 100000.00000 | 100000.000000 |
mean | 9.236817e+18 | 0.170310 | 1.410256e+07 | 1004.969200 | 0.288640 | 1.015190 | 0.332260 | 18845.372740 | 318.852520 | 60.216420 | 2113.509950 | 1.432660 | 227.585890 | 53116.13077 | 83.498460 |
std | 5.333002e+18 | 0.375907 | 2.964051e+02 | 1.090561 | 0.503994 | 0.524959 | 0.853341 | 4963.742483 | 21.153502 | 47.740899 | 609.604505 | 1.325988 | 351.782614 | 49963.28370 | 70.365391 |
min | 1.460373e+14 | 0.000000 | 1.410210e+07 | 1001.000000 | 0.000000 | 0.000000 | 0.000000 | 375.000000 | 120.000000 | 20.000000 | 112.000000 | 0.000000 | 33.000000 | -1.00000 | 1.000000 |
25% | 4.615234e+18 | 0.000000 | 1.410230e+07 | 1005.000000 | 0.000000 | 1.000000 | 0.000000 | 16920.000000 | 320.000000 | 50.000000 | 1863.000000 | 0.000000 | 35.000000 | -1.00000 | 23.000000 |
50% | 9.267732e+18 | 0.000000 | 1.410260e+07 | 1005.000000 | 0.000000 | 1.000000 | 0.000000 | 20346.000000 | 320.000000 | 50.000000 | 2323.000000 | 2.000000 | 39.000000 | 100048.00000 | 61.000000 |
75% | 1.385993e+19 | 0.000000 | 1.410281e+07 | 1005.000000 | 1.000000 | 1.000000 | 0.000000 | 21894.000000 | 320.000000 | 50.000000 | 2526.000000 | 3.000000 | 171.000000 | 100086.00000 | 108.000000 |
max | 1.844666e+19 | 1.000000 | 1.410302e+07 | 1012.000000 | 7.000000 | 5.000000 | 5.000000 | 24043.000000 | 1024.000000 | 1024.000000 | 2757.000000 | 3.000000 | 1839.000000 | 100248.00000 | 255.000000 |
# Visualization Imports
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.express as px
from plotly.offline import iplot, init_notebook_mode
init_notebook_mode()
click:
0/1 for non-click/click
fig = px.pie(names=df['click'].value_counts()[:].index.tolist(),
values=df['click'].value_counts()[:].tolist(), hole=.5,
title=f'Distribution of Target Variable ` click `')
fig.show()
Looking at the target variable "click", the click rate is around 17.03% with 17.03K clicks in this sampled dataset, and 82.97% of users not clicking the add.
print('Current dataset shape ->> ', df.shape)
shape_inital = df.shape
initial_columns = df.columns.to_list()
df.head(2)
Current dataset shape ->> (100000, 24)
id | click | hour | C1 | banner_pos | site_id | site_domain | site_category | app_id | app_domain | ... | device_type | device_conn_type | C14 | C15 | C16 | C17 | C18 | C19 | C20 | C21 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 6.448465e+18 | 0 | 14102806 | 1005 | 0 | d6137915 | bb1ef334 | f028772b | ecad2386 | 7801e8d9 | ... | 1 | 0 | 19771 | 320 | 50 | 2227 | 0 | 935 | -1 | 48 |
1 | 1.342805e+19 | 0 | 14102307 | 1002 | 0 | 85f751fd | c4e18dd6 | 50e219e0 | 9a08a110 | 7801e8d9 | ... | 0 | 0 | 21676 | 320 | 50 | 2495 | 2 | 167 | -1 | 23 |
2 rows × 24 columns
hour:
format is YYMMDDHH, so 14091123 means 23:00 on Sept. 11, 2014 UTC.
Hour (convert to datetime to see if there is any relation between time of the day and click's) the hypothesis is that on times with higher traffic we'll have more users naturally clicking.
import datetime as dt
df['hour'] = df['hour'].apply(lambda x: dt.datetime.strptime(str(x), '%y%m%d%H'))
df.hour.describe(datetime_is_numeric=True)
count 100000 mean 2014-10-25 22:20:44.375999744 min 2014-10-21 00:00:00 25% 2014-10-23 04:00:00 50% 2014-10-26 01:00:00 75% 2014-10-28 14:00:00 max 2014-10-30 23:00:00 Name: hour, dtype: object
We have that hour data goes from the 21st fo October 2014 to 30th of October 2014
df.groupby('hour')['click'].sum().values
array([ 63, 56, 67, 68, 101, 104, 100, 84, 72, 108, 93, 99, 72, 95, 84, 73, 86, 76, 61, 45, 38, 40, 49, 33, 29, 46, 34, 63, 80, 109, 99, 103, 115, 158, 159, 133, 136, 105, 91, 93, 93, 93, 78, 72, 62, 59, 41, 44, 43, 50, 65, 87, 102, 93, 58, 74, 111, 100, 74, 81, 92, 79, 82, 117, 75, 67, 80, 69, 50, 39, 55, 54, 45, 54, 64, 51, 76, 71, 78, 91, 95, 71, 69, 56, 90, 91, 93, 103, 114, 85, 68, 20, 7, 6, 5, 11, 39, 44, 14, 22, 39, 44, 24, 72, 79, 81, 66, 89, 109, 128, 119, 114, 88, 81, 69, 42, 52, 36, 34, 35, 38, 24, 41, 50, 82, 81, 84, 88, 88, 89, 83, 94, 106, 104, 106, 105, 100, 71, 72, 53, 52, 45, 39, 44, 22, 31, 48, 57, 60, 78, 69, 58, 70, 75, 48, 49, 78, 94, 46, 43, 88, 117, 91, 58, 39, 22, 46, 38, 44, 48, 68, 46, 65, 80, 72, 77, 87, 95, 98, 84, 116, 159, 136, 104, 114, 104, 84, 73, 50, 72, 43, 33, 27, 40, 58, 104, 50, 39, 69, 96, 76, 57, 84, 59, 60, 76, 87, 99, 65, 59, 42, 41, 47, 32, 43, 53, 55, 42, 38, 63, 88, 85, 76, 67, 69, 93, 93, 106, 105, 104, 111, 87, 91, 76, 85, 58, 37, 46, 46, 38], dtype=int64)
px.line(x=df.groupby('hour')['click'].sum().index,
y=df.groupby('hour')['click'].sum().values,
title='Distribution of Clicks per hour over 10 days',
labels={'y':'Number of Clicks', 'x':'Days'})
aux_data = pd.DataFrame(df.groupby(['hour','click'])['id'].count()).reset_index().rename(columns={'id':'count'})
aux_data['click'] = pd.Categorical(aux_data.click)
px.bar(aux_data, x='hour', y='count', color='click')
Lets aggregate hours of the day and check if there's any correlation between hours of the day and clicks
df['hour_of_day'] = df['hour'].map(lambda x: x.hour)
aux_data1 = pd.DataFrame(df.groupby(['hour_of_day','click'])['id'].count()).reset_index().rename(columns={'id':'count'})
aux_data1['click'] = pd.Categorical(aux_data1.click)
px.bar(aux_data1, x='hour_of_day', y='count', color='click', barmode='overlay', text='count',
title = f"""Agregated click per hour of the day over 10 days. The maximum number of clicks is {aux_data1.loc[(aux_data1['click'] == 1)]['count'].max()}""")
This gives us a better picture of how the target variable click
is distributed throughout the day. It is now clear that is an approximatly normal distribution with very fat tails, peaking in the middle of the day (at 1pm (13h) with 1035 clicks in aggregate)
If we take a closer look:
by_click = aux_data1.loc[aux_data1['click'] == 1].copy()
px.bar(by_click, x='hour_of_day', y='count', color='click', barmode='overlay', text='count',
title = f"""Agregated click per hour of the day over 10 days. The maximum number of clicks is {by_click['count'].max()}""")
This is an interesting distribution, however it may mislead us as to when the ads are more or less effective, according to Google Ads Click definition, it is important and common to use "the click-through rate (CTR), which tells you how many people who’ve seen your ad end up clicking on it. This metric can help you gauge how enticing your ad is and how closely it matches your keywords and other targeting settings."
Therefore we need to compute the rate between click == 1 and total clicks
by_click.reset_index(inplace=True)
by_click.drop(columns='index', inplace=True)
total_hour_clicks = pd.DataFrame(aux_data1.groupby('hour_of_day')['count'].sum()).reset_index().rename(columns={'count':'total_clicks'})
by_click['total_clicks'] = total_hour_clicks['total_clicks']
by_click['CTR_hour'] = (by_click['count']/by_click['total_clicks'])
by_click['hour_of_day'] = pd.Categorical(by_click['hour_of_day'])
fig = px.bar(by_click, x='hour_of_day', y='CTR_hour', barmode='relative', color='hour_of_day',
text = 'CTR_hour',#[str(round(i,2))+'%' for i in list(by_click['CTR_hour'])],
title = f"""Agregated click per hour of the day over 10 days. The maximum CTR is {round(by_click['CTR_hour'].max(),3)}%""")
fig.update_traces(texttemplate='%{text:.1%}', textposition='outside')
fig.update_layout(showlegend=False)
fig.show()
by_click.sort_values(by='CTR_hour', ascending=False).head(5)
hour_of_day | click | count | total_clicks | CTR_hour | |
---|---|---|---|---|---|
0 | 0 | 1 | 405 | 2161 | 0.187413 |
1 | 1 | 1 | 435 | 2337 | 0.186136 |
16 | 16 | 1 | 914 | 4970 | 0.183903 |
23 | 23 | 1 | 383 | 2099 | 0.182468 |
15 | 15 | 1 | 938 | 5164 | 0.181642 |
We can now see that even though the highest number of clicks occurs in the middle of the day, for example, at 13 hours the CTR is only 17.8% when the highest CTR's occur at hours {0, 1, 16, 23, 15} ranging from 18.74% to 18.16%.
We now have a better understanding of how the time of the day influences the CTR and the predictability of wheter or not a user will click the ad.
Generalizing for the 10 days of the dataset.
df['day'] = df['hour'].apply(lambda x: x.day)
aux_df = pd.DataFrame(df.groupby(['day', 'click'])['id'].count()).reset_index().rename(columns={'id':'count'})
by_day = aux_df.loc[aux_df['click'] == 1].copy()
by_day.reset_index(inplace=True)
by_day.drop(columns='index', inplace=True)
total_day_clicks = pd.DataFrame(aux_df.groupby('day')['count'].sum()).reset_index().rename(columns={'count':'total_clicks'})
by_day['total_clicks'] = total_day_clicks['total_clicks']
by_day['CTR_day'] = (by_day['count']/by_day['total_clicks'])
by_day.sort_values(by='CTR_day').head(5)
day | click | count | total_clicks | CTR_day | |
---|---|---|---|---|---|
7 | 28 | 1 | 1952 | 13100 | 0.149008 |
8 | 29 | 1 | 1463 | 9400 | 0.155638 |
1 | 22 | 1 | 2095 | 13217 | 0.158508 |
9 | 30 | 1 | 1759 | 10377 | 0.169509 |
0 | 21 | 1 | 1767 | 10209 | 0.173083 |
by_day['day'] = pd.Categorical(by_day['day'])
fig = px.bar(by_day, x='day', y='CTR_day', barmode='relative', color='day',
text = 'CTR_day',
title = f"""Agregated click per day. The maximum CTR is {round(by_day['CTR_day'].max(),3)}%""")
fig.update_traces(texttemplate='%{text:.1%}', textposition='outside')
fig.update_layout(showlegend=False)
fig.show()
Over the 10 days of the dataset there's a clear pattern for the prevalence of a higher CTR from 23 to 27 of October 2014. A quick search reveals that 23rd is a Thursday with 27th being a Monday. Therefore, the highest CTR occurs by the end of the business week and weekend. However, these may not be, like in the above cases, the days where there's the highest number of clicks, just a larger conversion of views to clicks.
To check this distribution we can get the corresponding days of the week.
df['weekday'] = df['hour'].apply(lambda x: x.day_name())
week_df = pd.DataFrame(df.groupby(['weekday', 'click'])['id'].count()).reset_index().rename(columns={'id':'count'})
categories = "Monday Tuesday Wednesday Thursday Friday Saturday Sunday".split()
week_df['weekday'] = pd.Categorical(week_df['weekday'], categories=categories)
week_df['click'] = pd.Categorical(week_df.click)
week_df.head()
weekday | click | count | |
---|---|---|---|
0 | Friday | 0 | 6758 |
1 | Friday | 1 | 1514 |
2 | Monday | 0 | 6449 |
3 | Monday | 1 | 1425 |
4 | Saturday | 0 | 6868 |
fig = px.bar(week_df.sort_values('weekday'), x='weekday', y='count', barmode='group', color='click',
text = 'count',
title = f"Agregated total clicks per day of the week.")
fig.update_traces(texttemplate='%{text:.1}', textposition='outside')
fig.update_layout(showlegend=True)
fig.show()
This allows us to see the traffic that the ads receive throughout the week, with business days actually being the ones with highest traffic but not necessarily the highest conversion.
df['weekday'] = df['hour'].apply(lambda x: x.day_name())
week_df = pd.DataFrame(df.groupby(['weekday', 'click'])['id'].count()).reset_index().rename(columns={'id':'count'})
by_weekday = week_df.loc[week_df['click'] == 1].copy()
by_weekday.reset_index(inplace=True)
by_weekday.drop(columns='index', inplace=True)
total_weekday_clicks = pd.DataFrame(week_df.groupby('weekday')['count'].sum()).reset_index().rename(columns={'count':'total_clicks'})
by_weekday['total_clicks'] = total_day_clicks['total_clicks']
by_weekday['CTR_weekday'] = (by_day['count']/by_day['total_clicks'])
by_weekday.sort_values(by='CTR_weekday').head(5)
categories = "Monday Tuesday Wednesday Thursday Friday Saturday Sunday".split()
by_weekday['weekday'] = pd.Categorical(by_weekday['weekday'], categories=categories)
fig = px.bar(by_weekday.sort_values('weekday'), x='weekday', y='CTR_weekday', barmode='relative', color='weekday',
text = 'CTR_weekday',
title = f"""Agregated click per day. The maximum CTR is {round(by_weekday['CTR_weekday'].max()*100,1)}%""")
fig.update_traces(texttemplate='%{text:.1%}', textposition='outside')
fig.update_layout(showlegend=False)
fig.show()
Actually, when we compute the CTR per day of the week we obtain a more less uniform distribution with only Monday and Friday being significantly lower in terms of CTR conversion than the rest of the weekdays.
print("Initial shape ->> ", shape_inital)
print("These are the new features added from EDA", [i for i in df.columns.to_list() if i not in initial_columns])
print("Current shape ->> ", df.shape)
df.head(2)
Initial shape ->> (100000, 24) These are the new variables added from EDA ['hour_of_day', 'day', 'weekday'] Current shape ->> (100000, 27)
id | click | hour | C1 | banner_pos | site_id | site_domain | site_category | app_id | app_domain | ... | C15 | C16 | C17 | C18 | C19 | C20 | C21 | hour_of_day | day | weekday | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 6.448465e+18 | 0 | 2014-10-28 06:00:00 | 1005 | 0 | d6137915 | bb1ef334 | f028772b | ecad2386 | 7801e8d9 | ... | 320 | 50 | 2227 | 0 | 935 | -1 | 48 | 6 | 28 | Tuesday |
1 | 1.342805e+19 | 0 | 2014-10-23 07:00:00 | 1002 | 0 | 85f751fd | c4e18dd6 | 50e219e0 | 9a08a110 | 7801e8d9 | ... | 320 | 50 | 2495 | 2 | 167 | -1 | 23 | 7 | 23 | Thursday |
2 rows × 27 columns
C1
-- anonymized categorical features
C14-C21
-- anonymized categorical features
# Get All C# features
C_columns = [i for i in list(df.columns) if 'C' in i]
In order to see if there is any relevance of the categories inside of each anonymous categorical feature we will evaluate, for each of the inside categories, what is their corresponding CTR, that is, for a given category of a feature, how many clicks out of the total number of users that saw the add did it register.
new_data = {}
dataframes = []
for j in C_columns:
C = df[j].unique()
ctr = []
count = df.groupby(j)[j].count().to_list()
for i in C:
ctr_mean = df.loc[df[j] == i, 'click'].mean() # since this is 0 or 1 the mean will be the CTR
ctr.append(ctr_mean)
new_data = {j: C, 'count_'+j: count, 'CRT_'+j: ctr}
dataframes.append(pd.DataFrame(new_data))
dataframes[0].sort_values('count_C1', ascending=False)[0:10]
C1 | count_C1 | CRT_C1 | |
---|---|---|---|
2 | 1010 | 91963 | 0.095692 |
1 | 1002 | 5426 | 0.211758 |
5 | 1008 | 2205 | 0.266667 |
6 | 1001 | 292 | 0.052632 |
3 | 1012 | 80 | 0.181507 |
0 | 1005 | 19 | 0.169764 |
4 | 1007 | 15 | 0.012500 |
for i in range(0,9):
fig, ax = plt.subplots(1,2, figsize=(15,5))
n = 10 # Plot the Top 10 largest (as in number of times they appear) categories inside each feature
dataframes[i].sort_values(dataframes[i].columns[1], ascending=False, inplace=True)
sns.barplot(data=dataframes[i], x=dataframes[i][dataframes[i].columns[0]].to_list()[:n],
y=dataframes[i][dataframes[i].columns[1]].to_list()[:n],
ax=ax[0], palette='viridis', order=dataframes[i][dataframes[i].columns[0]].to_list()[:n])
ax[0].set_title(f'TOP 10 largest categories of {dataframes[i].columns[0]}')
ax[0].set_ylim(0,100000)
sns.barplot(data=dataframes[i], x=dataframes[i][dataframes[i].columns[0]].to_list()[:n],
y=dataframes[i][dataframes[i].columns[2]].to_list()[:n],
ax=ax[1], palette='viridis', order=dataframes[i][dataframes[i].columns[0]].to_list()[:n])
ax[1].set_title(f'CTR per category of {dataframes[i].columns[0]}')
ax[1].set_ylim(0,1)
fig.tight_layout()
plt.show()
Since features C1 and C14 to C21 are anonymized categorical features, it is very difficult to derive an intution behind the graphics above. However, by analysing the graphs, it is easy to see that some features have similar behaviour.
Firstly, features C1, C15 and C16 have one category that caputures almost the totality of the intances with a very low CTR (all around 0.1). Making these features possible good choices to flter no-click instances.
Secondly, features C14 and C17 have no major category, but rather similar density through categories. However, this regulariy on categories' density is not present in the CTR value, this changes considerably through each category.
Thirdly, C19, C20 and C21 have the number of instances more distributed than the features from the first point, but still have one or two categories with a greater alocation of instances when compared to the others. In this case the CTR value is once again very volatile.
Lastly, C18 has 4 categories with the majority of the instances and their CTR is relatively low, specially category 1.
banner_pos
: Refers to the banner position of the ad in the website
site_id
: A site ID is a unique identification number allocated to your website
site_domain
: A domain name is essentially the website’s equivalent of a physical address
site_category
: Site Category is, like the name indicates, the theme of the site. It can belong to "Education", "Kids", "Weather", "Arts" etc...
# Before plotting:
website_feat = ['banner_pos', 'site_id', 'site_domain', 'site_category']
for i in website_feat:
size = len(list(df[i].unique()))
print(f'- "{i}" has {size} unique variables')
- "banner_pos" has 7 unique variables - "site_id" has 1461 unique variables - "site_domain" has 1333 unique variables - "site_category" has 19 unique variables
web_data = {}
web_df = []
for j in website_feat:
C = df[j].unique()
ctr = []
count = df.groupby(j)[j].count().to_list()
for i in C:
ctr_mean = df.loc[df[j] == i, 'click'].mean() # since this is 0 or 1 the mean will be the CTR
ctr.append(ctr_mean)
web_data = {j: C, 'count_'+j: count, 'CRT_'+j: ctr}
web_df.append(pd.DataFrame(web_data))
for i in range(0,len(web_df)):
fig, ax = plt.subplots(1,2, figsize=(15,5))
n = 10 # Plot the Top 10 largest (as in number of times they appear) categories inside each feature
web_df[i].sort_values(web_df[i].columns[1], ascending=False, inplace=True)
sns.barplot(data=web_df[i], x=web_df[i][web_df[i].columns[0]].to_list()[:n],
y=web_df[i][web_df[i].columns[1]].to_list()[:n],
ax=ax[0], palette='viridis', order=web_df[i][web_df[i].columns[0]].to_list()[:n])
ax[0].set_title(f'TOP 10 largest categories of {web_df[i].columns[0]}')
ax[0].set_ylim(0,100000)
sns.barplot(data=web_df[i], x=web_df[i][web_df[i].columns[0]].to_list()[:n],
y=web_df[i][web_df[i].columns[2]].to_list()[:n],
ax=ax[1], palette='viridis', order=web_df[i][web_df[i].columns[0]].to_list()[:n])
ax[1].set_title(f'CTR per category of {web_df[i].columns[0]}')
ax[1].set_ylim(0,1)
fig.tight_layout()
plt.show()
banner_pos
: represents the ad banner position on the website. The graph on the left tells us that most websites have their ad banner on position 0 or 1, translating in a CTR close to 0.2 on both cases.
site_id
: represents a unique identification of the website. An interesting reading is that the category with the largest number of istances presents CTR around 0, meaning that although the website has many views no one is clicking on the ad. Contrarywise, category 53ef4a3 has a small number of instances but a CTR very close to 1.
site_domain
: only has 4 categories with a significant CTR and those have a very low number os instances. Showing that the domain of a website does not have an impact on the CTR distribution.
site_category
: is an allocation of websites by theme. Instances are mostly distributed through 4 categories on which the CTR varies considerably. Hence, in some categories visitors are more likely to click.
app_id
: An App ID is a two-part string used to identify one or more apps from a single development team. The string consists of a Team ID and a bundle ID search string, with a period (.) separating the two parts. The Team ID is supplied by Apple and is unique to a specific development team, while the bundle ID search string is supplied by you to match either the bundle ID of a single app or a set of bundle IDs for a group of your apps.
app_domain
: A .app domain behaves the same as any other website, but there are two factors that make it stand out. The first is visibility: searching for dedicated app websites will be easier, as .app domains will indicate that the website concerns a specific app (product rather it being a news vertical, video platform, or any other type of website). The other benefit, outlined below, relates to security
app_category
: App Category is, like the name indicates, the theme of the app. It can belong to "Education", "Kids", "Weather", "Arts" etc...
# Before plotting:
app_feat = ['app_id', 'app_domain', 'app_category']
for i in app_feat:
size = len(list(df[i].unique()))
print(f'- "{i}" has {size} unique categories')
- "app_id" has 1296 unique variables - "app_domain" has 93 unique variables - "app_category" has 23 unique variables
# Before plotting:
app_feat = ['app_id', 'app_domain', 'app_category']
for i in app_feat:
size = len(list(df[i].unique()))
print(f'- "{i}" has {size} unique categories')
- "app_id" has 1296 unique variables - "app_domain" has 93 unique variables - "app_category" has 23 unique variables
app_df[0].sort_values(by=['count_app_id'], ascending=False).head(10)
app_id | count_app_id | CRT_app_id | |
---|---|---|---|
1195 | c6ed9835 | 63960 | 0.000000 |
746 | 90d7ba56 | 3850 | 0.000000 |
1144 | 5c27485f | 2878 | 0.000000 |
794 | f3a35279 | 1895 | 0.000000 |
1290 | 7e8c4294 | 1879 | 0.000000 |
582 | c6e52c21 | 1543 | 0.000000 |
842 | 1c9ce06b | 1217 | 0.000000 |
1069 | 5a9ca415 | 1104 | 0.000000 |
524 | e36e1baa | 967 | 0.500000 |
418 | ec4ffa92 | 915 | 0.076923 |
for i in range(0,len(app_df)):
fig, ax = plt.subplots(1,2, figsize=(15,5))
n = 10 # Plot the Top 10 largest (as in number of times they appear) categories inside each feature
app_df[i].sort_values(app_df[i].columns[1], ascending=False, inplace=True)
sns.barplot(data=app_df[i], x=app_df[i][app_df[i].columns[0]].to_list()[:n],
y=app_df[i][app_df[i].columns[1]].to_list()[:n],
ax=ax[0], palette='viridis', order=app_df[i][app_df[i].columns[0]].to_list()[:n])
ax[0].set_title(f'TOP 10 largest categories of {app_df[i].columns[0]}')
ax[0].set_ylim(0,100000)
sns.barplot(data=app_df[i], x=app_df[i][app_df[i].columns[0]].to_list()[:n],
y=app_df[i][app_df[i].columns[2]].to_list()[:n],
ax=ax[1], palette='viridis', order=app_df[i][app_df[i].columns[0]].to_list()[:n])
ax[1].set_title(f'CTR per Category of {app_df[i].columns[0]}')
ax[1].set_ylim(0,1)
fig.tight_layout()
plt.show()
app_id
: although most instances are alocated in one category the CTR of this category is around 0.
app_domain
: one category has the biggest agregation of instances and simultaniously the biggest CTR. The clik through rate in most categories is around 0.
app_category
: most instances in one category with CTR close to 0.2, followed by a category with an even greater CTR. The remaing categories have a low number of intances and a very low CTR.
device_id
: A device ID is a string reported by a device’s enumerator. A device has only one device ID.
device_ip
: Similar to device id, it also uniquely identifies you, however there are personal and public id's and these are not always the same
device_model
: Device Model refers to the manufactures (Apple, Samsung, etc..) device models such as iPhone 8, iPhone X, Galaxy 8...
device_type
: Device Type is a label associated to the device which is used to match it to a series or model
device_conn_type
: Connected devices are physical objects that can connect with each other and other systems via the internet. They span everything from traditional computing hardware, such as a laptop or desktop, to common mobile devices, such as a smartphone or tablet, to an increasingly wide range of physical devices and objects.
# Before plotting:
device_feat = ['device_id', 'device_ip', 'device_model', 'device_type', 'device_conn_type']
for i in device_feat:
size = len(list(df[i].unique()))
print(f'- "{i}" has {size} unique categories')
- "device_id" has 16837 unique variables - "device_ip" has 77833 unique variables - "device_model" has 3167 unique variables - "device_type" has 4 unique variables - "device_conn_type" has 4 unique variables
device_data = {}
device_df = []
for j in tqdm(device_feat):
C = df[j].unique()
ctr = []
count = df.groupby(j)[j].count().to_list()
for i in C:
ctr_mean = df.loc[df[j] == i, 'click'].mean() # since this is 0 or 1 the mean will be the CTR
ctr.append(ctr_mean)
device_data = {j: C, 'count_'+j: count, 'CRT_'+j: ctr}
device_df.append(pd.DataFrame(device_data))
device_data = {}
device_df = []
for j in tqdm(device_feat):
C = df[j].unique()
ctr = []
count = df.groupby(j)[j].count().to_list()
for i in C:
ctr_mean = df.loc[df[j] == i, 'click'].mean() # since this is 0 or 1 the mean will be the CTR
ctr.append(ctr_mean)
device_data = {j: C, 'count_'+j: count, 'CRT_'+j: ctr}
device_df.append(pd.DataFrame(device_data))
for i in range(0,len(device_df)):
fig, ax = plt.subplots(1,2, figsize=(15,5))
n = 10 # Plot the Top 10 largest (as in number of times they appear) categories inside each feature
device_df[i].sort_values(device_df[i].columns[1], ascending=False, inplace=True)
sns.barplot(data=device_df[i], x=device_df[i][device_df[i].columns[0]].to_list()[:n],
y=device_df[i][device_df[i].columns[1]].to_list()[:n],
ax=ax[0], palette='viridis', order=device_df[i][device_df[i].columns[0]].to_list()[:n])
ax[0].set_title(f'TOP 10 largest categories of {device_df[i].columns[0]}')
ax[0].set_ylim(0,100000)
sns.barplot(data=device_df[i], x=device_df[i][device_df[i].columns[0]].to_list()[:n],
y=device_df[i][device_df[i].columns[2]].to_list()[:n],
ax=ax[1], palette='viridis', order=device_df[i][device_df[i].columns[0]].to_list()[:n])
ax[1].set_title(f'CTR per Category of {device_df[i].columns[0]}')
ax[1].set_ylim(0,1)
fig.tight_layout()
plt.show()
device_id
: characterized by one category with 0 CTR.
device_ip
: instances spread evenly through categories, in which five of those categories have a CTR around 1.
device_model
: many categories with low density and a CTR behaviour very diverse, ranging from around 0 to around 1.
device_type
and device_conn_type
: features characterized by four main categories in which one has the majority of instances. CTR is relatively low.
print("Current Shape ->> ", df.shape)
df.head(2)
Current Shape ->> (100000, 27)
id | click | hour | C1 | banner_pos | site_id | site_domain | site_category | app_id | app_domain | ... | C15 | C16 | C17 | C18 | C19 | C20 | C21 | hour_of_day | day | weekday | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 6.448465e+18 | 0 | 2014-10-28 06:00:00 | 1005 | 0 | d6137915 | bb1ef334 | f028772b | ecad2386 | 7801e8d9 | ... | 320 | 50 | 2227 | 0 | 935 | -1 | 48 | 6 | 28 | Tuesday |
1 | 1.342805e+19 | 0 | 2014-10-23 07:00:00 | 1002 | 0 | 85f751fd | c4e18dd6 | 50e219e0 | 9a08a110 | 7801e8d9 | ... | 320 | 50 | 2495 | 2 | 167 | -1 | 23 | 7 | 23 | Thursday |
2 rows × 27 columns
# Save Treated dataset
df.to_csv("ad_clicks_100k_Treated.csv")
import pandas as pd
import numpy as np
# allow multiple outputs per cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
df = pd.read_csv("ad_clicks_100k_Treated.csv")
df.drop(columns="Unnamed: 0", inplace=True)
df.head()
df.shape
id | click | hour | C1 | banner_pos | site_id | site_domain | site_category | app_id | app_domain | ... | C15 | C16 | C17 | C18 | C19 | C20 | C21 | hour_of_day | day | weekday | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 6.448465e+18 | 0 | 2014-10-28 06:00:00 | 1005 | 0 | d6137915 | bb1ef334 | f028772b | ecad2386 | 7801e8d9 | ... | 320 | 50 | 2227 | 0 | 935 | -1 | 48 | 6 | 28 | Tuesday |
1 | 1.342805e+19 | 0 | 2014-10-23 07:00:00 | 1002 | 0 | 85f751fd | c4e18dd6 | 50e219e0 | 9a08a110 | 7801e8d9 | ... | 320 | 50 | 2495 | 2 | 167 | -1 | 23 | 7 | 23 | Thursday |
2 | 1.048699e+19 | 0 | 2014-10-23 10:00:00 | 1005 | 0 | 9a28a858 | 64778742 | f028772b | ecad2386 | 7801e8d9 | ... | 300 | 250 | 2523 | 3 | 39 | -1 | 221 | 10 | 23 | Thursday |
3 | 8.833733e+18 | 0 | 2014-10-23 07:00:00 | 1005 | 0 | 1fbe01fe | f3845767 | 28905ebd | ecad2386 | 7801e8d9 | ... | 320 | 50 | 1722 | 0 | 35 | -1 | 79 | 7 | 23 | Thursday |
4 | 1.035453e+19 | 0 | 2014-10-28 11:00:00 | 1005 | 0 | 85f751fd | c4e18dd6 | 50e219e0 | a5184c22 | b8d325c3 | ... | 320 | 50 | 2676 | 0 | 35 | 100176 | 221 | 11 | 28 | Tuesday |
5 rows × 27 columns
(100000, 27)
id
(We do not need the variable ID as predictor, it does not make sense to use)hour
(because hour was broken down into (day, weekday, hour_of_day)) day
(because we find that it will add no significant predictive value that makes sense in any future application, i.e. the dataset covers 10 days of the year and we have no way to know if any of those days is representative of the year, or if there was something exceptional happening. By using hour_of_day and weekday we get an average over the days and thus make this possible effect less significant.device_ip
and device_id
Given device_id
and device_ip
are uniquely identifing the specific device
and public id
adress (which is not immutable for the same person - i.e. you can sport different public ips at different times) of the person that clicked the ad, these provide no meaningfull or usefull information toward predictic clicks
.
There is actually a problem of, given that these are very high cardinality features, them playing a very "important" yet misleading role when models are applied to the dataset. Furthermore, these features represent a person that may never be seen again by the model when applied in reality or with a different dataset, therefore it would be wrong to include these in the modeling process.
We are interested in knowing what are the shared characteristics of anyone that clicks, and the above mentioned features do not fit this requirement.
print(f'Shape after EDA ---->> {df.shape}')
df.drop(columns=['id','hour','device_ip','device_id','day'], inplace=True)
df.head()
print(f'Shape after feature drop ---->> {df.shape}')
Shape after EDA ---->> (100000, 27)
click | C1 | banner_pos | site_id | site_domain | site_category | app_id | app_domain | app_category | device_model | ... | C14 | C15 | C16 | C17 | C18 | C19 | C20 | C21 | hour_of_day | weekday | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 1005 | 0 | d6137915 | bb1ef334 | f028772b | ecad2386 | 7801e8d9 | 07d7df22 | 9efa421a | ... | 19771 | 320 | 50 | 2227 | 0 | 935 | -1 | 48 | 6 | Tuesday |
1 | 0 | 1002 | 0 | 85f751fd | c4e18dd6 | 50e219e0 | 9a08a110 | 7801e8d9 | 07d7df22 | 02d14ecc | ... | 21676 | 320 | 50 | 2495 | 2 | 167 | -1 | 23 | 7 | Thursday |
2 | 0 | 1005 | 0 | 9a28a858 | 64778742 | f028772b | ecad2386 | 7801e8d9 | 07d7df22 | ecb851b2 | ... | 21837 | 300 | 250 | 2523 | 3 | 39 | -1 | 221 | 10 | Thursday |
3 | 0 | 1005 | 0 | 1fbe01fe | f3845767 | 28905ebd | ecad2386 | 7801e8d9 | 07d7df22 | 779d90c2 | ... | 15706 | 320 | 50 | 1722 | 0 | 35 | -1 | 79 | 7 | Thursday |
4 | 0 | 1005 | 0 | 85f751fd | c4e18dd6 | 50e219e0 | a5184c22 | b8d325c3 | 0f2161f8 | dc15c87e | ... | 23224 | 320 | 50 | 2676 | 0 | 35 | 100176 | 221 | 11 | Tuesday |
5 rows × 22 columns
Shape after feature drop ---->> (100000, 22)
weekday
for which we have a clear interpretation of what each of its separate values meansTarget encoding for high cardinality features (100+ unique categories)
Numerical Features: hour_of_day
will not be encoded (it already is a discrete variable)
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OrdinalEncoder
from sklearn.compose import make_column_transformer
# Split into features type "O" and already encoded features
categorical_columns = [i for i in list(df.columns) if df[i].dtypes in ["O"] and i not in ['id','hour','hour_of_day','click']]
encoded_categorical_columns = [i for i in list(df.columns) if df[i].dtypes not in ["O"] and i not in ['id','hour','hour_of_day','click']]
print(f'Features with dtype "O" {categorical_columns}')
print(f'Categorical Features (already with discrete values) {encoded_categorical_columns}')
Features with dtype "O" ['site_id', 'site_domain', 'site_category', 'app_id', 'app_domain', 'app_category', 'device_model', 'weekday'] Categorical Features (already with discrete values) ['C1', 'banner_pos', 'device_type', 'device_conn_type', 'C14', 'C15', 'C16', 'C17', 'C18', 'C19', 'C20', 'C21']
# Transform features type "O" with label encoder
for i in categorical_columns:
if i == 'weekday':
# Ordinal Encode Weekday
categories = [['Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday']]
#Instantiate ordinal encoder
ordinal_encoder = OrdinalEncoder(categories=categories)
#Fit ordinal encoder
ordinal_encoder.fit(df[['weekday']])
# transform the data
df['weekday'] = ordinal_encoder.transform(df[['weekday']])
label_encoder = []
le = LabelEncoder()
le.fit(df[i])
label_encoder.append(le)
df[i] = le.transform(df[i])
all_features = categorical_columns + encoded_categorical_columns
print(all_features)
LabelEncoder()
LabelEncoder()
LabelEncoder()
LabelEncoder()
LabelEncoder()
LabelEncoder()
LabelEncoder()
OrdinalEncoder(categories=[['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']])
LabelEncoder()
['site_id', 'site_domain', 'site_category', 'app_id', 'app_domain', 'app_category', 'device_model', 'weekday', 'C1', 'banner_pos', 'device_type', 'device_conn_type', 'C14', 'C15', 'C16', 'C17', 'C18', 'C19', 'C20', 'C21']
data = df.copy()
one_hot = []
label_enc = []
target_enc = []
for i in all_features:
if len(df[i].unique()) <= 7:
# One hot encoding
one_hot.append(i)
data = pd.concat([data, pd.get_dummies(df[i], prefix=i)],axis=1)
# Drop the original column
data.drop(columns=[i], inplace=True)
elif 100 >= len(df[i].unique()) > 7:
# Label encoding
label_enc.append(i)
elif len(df[i].unique()) > 100:
# Mean encoding
target_enc.append(i)
print('Previus Dataset Shape ->> ', df.shape)
print('Shape After Encoding ->> ', data.shape, '\n')
print(f'One Hot Encoding the following features: {one_hot}')
print(f'Label Encoding the following features: {label_enc}')
print(f'Selected for Target Encoding: {target_enc}')
data.head()
Previus Dataset Shape ->> (100000, 22) Shape After Encoding ->> (100000, 49) One Hot Encoding the following features: ['weekday', 'C1', 'banner_pos', 'device_type', 'device_conn_type', 'C18'] Label Encoding the following features: ['site_category', 'app_domain', 'app_category', 'C15', 'C16', 'C19', 'C21'] Selected for Target Encoding: ['site_id', 'site_domain', 'app_id', 'device_model', 'C14', 'C17', 'C20']
click | site_id | site_domain | site_category | app_id | app_domain | app_category | device_model | C14 | C15 | ... | device_type_4 | device_type_5 | device_conn_type_0 | device_conn_type_2 | device_conn_type_3 | device_conn_type_5 | C18_0 | C18_1 | C18_2 | C18_3 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 1215 | 998 | 17 | 1195 | 41 | 0 | 1956 | 19771 | 320 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
1 | 0 | 770 | 1043 | 5 | 783 | 41 | 0 | 36 | 21676 | 320 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
2 | 0 | 890 | 535 | 17 | 1195 | 41 | 0 | 2936 | 21837 | 300 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
3 | 0 | 167 | 1261 | 1 | 1195 | 41 | 0 | 1476 | 15706 | 320 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
4 | 0 | 770 | 1043 | 5 | 842 | 67 | 3 | 2746 | 23224 | 320 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
5 rows × 49 columns
data.columns
Index(['click', 'site_id', 'site_domain', 'site_category', 'app_id', 'app_domain', 'app_category', 'device_model', 'C14', 'C15', 'C16', 'C17', 'C19', 'C20', 'C21', 'hour_of_day', 'weekday_0', 'weekday_1', 'weekday_2', 'weekday_3', 'weekday_4', 'weekday_5', 'weekday_6', 'C1_1001', 'C1_1002', 'C1_1005', 'C1_1007', 'C1_1008', 'C1_1010', 'C1_1012', 'banner_pos_0', 'banner_pos_1', 'banner_pos_2', 'banner_pos_3', 'banner_pos_4', 'banner_pos_5', 'banner_pos_7', 'device_type_0', 'device_type_1', 'device_type_4', 'device_type_5', 'device_conn_type_0', 'device_conn_type_2', 'device_conn_type_3', 'device_conn_type_5', 'C18_0', 'C18_1', 'C18_2', 'C18_3'], dtype='object')
from sklearn.model_selection import train_test_split
X = data.drop(columns=['click'])
y = data['click']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size=0.2)
X_train.shape
y_train.shape
X_test.shape
y_test.shape
(80000, 48)
(80000,)
(20000, 48)
(20000,)
# Effect of Target encoding for train set
pd.options.mode.chained_assignment = None # default='warn
X_train_encoding_example = X_train.copy()
for i in target_enc:
data_train = pd.concat([X_train_encoding_example, y_train], axis=1)
target_encoder = data_train.groupby([i])["click"].mean()
X_train_encoding_example[i] = X_train_encoding_example[i].map(target_encoder)
# Transform the test set based on the train
pd.options.mode.chained_assignment = 'warn' # default='warn
print('------- Mean Encoding changes -------')
for i,v in enumerate(target_enc):
print(f'{v} now would have {[len(X_train_encoding_example[i].unique()) for i in target_enc][i]} categories vs {len(X_train[v].unique())} before')
------- Mean Encoding changes ------- site_id now would have 186 categories vs 1342 before site_domain now would have 155 categories vs 1219 before app_id now would have 120 categories vs 1174 before device_model now would have 350 categories vs 2939 before C14 now would have 357 categories vs 1661 before C17 now would have 239 categories vs 394 before C20 now would have 103 categories vs 154 before
print("""Check if the example transformation above changed X_train in any way:
How many unique X_train.device_model categories? """, len(X_train.device_model.unique()))
if len(X_train.device_model.unique()) > len(X_train_encoding_example.device_model.unique()):
print("------>>> There was no change in the X_train set")
else:
print("------>>> WARNING: THERE IS DATA LEAKAGE")
Check if the example transformation above changed X_train in any way: How many unique X_train.device_model categories? 2939 ------>>> There was no change in the X_train set
# Imports
from imblearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.tree import DecisionTreeClassifier
from imblearn.over_sampling import SMOTE
from imblearn.over_sampling import RandomOverSampler
from sklearn.linear_model import LogisticRegression
from imblearn.under_sampling import RandomUnderSampler
import numpy as np
from category_encoders.target_encoder import TargetEncoder
from sklearn.model_selection import train_test_split
from imblearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, f1_score, make_scorer, accuracy_score, precision_score, recall_score, roc_auc_score
from sklearn.model_selection import cross_validate
from sklearn.metrics import plot_confusion_matrix, confusion_matrix, plot_roc_curve
from sklearn.model_selection import LeaveOneOut
# Plotting
import matplotlib.pyplot as plt
import seaborn as sns
# Ensemble methods
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import VotingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import StackingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
# Boosting
from catboost import CatBoostClassifier
# Stats (util)
from scipy.stats import randint as sp_randInt
from scipy.stats import uniform as sp_randFloat
We chose to maximize the f1-score
since this is the harmonic mean between precision and recall, as our business problem relies on good recall and precision of the "click" variable, it is our interest to be able to predict if a given user will click or not on the ad.
Maximizing recall only would lead to a model that ended up recalling only "click = 0" and leaving "click = 1" our main interest to predict, with a very low recall. While maximizing precision will leave us with a high number of False Positives, which we would like to avoid given that a too positive outlook on the impact of marketing can backfire into lower than expected revenues given the lower than predicted interaction.
We are still interested in a balance because, even though we want to predict click
, it still is very important to understand clearly what leads a user not to click an ad as it can lead us to identify and shift away from harmfull practices.
AUC
is also a widely used metric to compary binary classification models, however the AUC
is not a good measure for Imbalanced Datasets (at it is the case) because it considers only True Positive Rate against False Positive Rate, which ignores the number of actual TP and FP. That is, we may have a model with a high AUC but recalling very few True Positives. The AUC is high only because there are very few predictions for the True Positive Class and these are mostly correct.
Given our dataset is imbalanced, and after preliminary modeling showed that without rebalancing the model performance was severely hidered with an over importance of the majority class. These models yielded no information, therefore we opted to use rebalancing methods.
For all models we also used a StandardScalar to control for arbitrary differences in the scales of the features, and we also performed Target Encoding, as explained above - Effect of implementing Target Encoding - to reduce the effect of high cardinality features. This process cannot be conducted on the whole training set before Cross Validation otherwise there will we data leakage as the model would already have a feature perfectly mapping the target.
Pipeline
¶'sampling' →
RandomOverSampler()
,RandomOverSampler()
,SMOTE()
'transformer' →TargetEncoder(cols=target_enc)
'scaler' →StandardScaler()
'classifier' →KNeighborsClassifier()
,RandomForestClassifier()
,CatBoostClassifier()
n_neighbors
represents the number of neighbors to use for kneighbors queries
weights
:
uniform
: uniform weights. All points in each neighborhood are weighted equally.
distance
: weight points by the inverse of their distance. In this case, closer neighbors of a query point will have a greater influence than neighbors which are further away.
pipe = Pipeline([('sampling', RandomOverSampler(random_state=42)),
('transformer', TargetEncoder(cols=target_enc)),
('scaler', StandardScaler()),
('classifier', KNeighborsClassifier())])
param_grid = [
{'classifier': [KNeighborsClassifier()],
'scaler': [StandardScaler()],
'transformer': [TargetEncoder(cols=target_enc)],
'sampling': [RandomUnderSampler(random_state=42), RandomOverSampler(random_state=42), SMOTE(random_state=42)],
'classifier__n_neighbors': [3, 5, 7, 9],
'classifier__weights': ['uniform', 'distance']},
]
knn_grid = GridSearchCV(pipe, param_grid, cv=10, scoring = 'f1')
knn_grid.fit(X_train, y_train)
knn_model = knn_grid.best_estimator_
print("Best cross-validation score: {:.2f}".format(knn_grid.best_score_))
print("Best params:\n{}\n".format(knn_grid.best_params_))
GridSearchCV(cv=10, estimator=Pipeline(steps=[('sampling', RandomOverSampler(random_state=42)), ('transformer', TargetEncoder(cols=['site_id', 'site_domain', 'app_id', 'device_model', 'C14', 'C17', 'C20'])), ('scaler', StandardScaler()), ('classifier', KNeighborsClassifier())]), param_grid=[{'classifier': [KNeighborsClassifier(n_neighbors=9)], 'classifier__n_neighbors': [3, 5, 7, 9], 'classifier__weights': ['uniform', 'distance'], 'sampling': [RandomUnderSampler(random_state=42), RandomOverSampler(random_state=42), SMOTE(random_state=42)], 'scaler': [StandardScaler()], 'transformer': [TargetEncoder(cols=['site_id', 'site_domain', 'app_id', 'device_model', 'C14', 'C17', 'C20'])]}], scoring='f1')
Best cross-validation score: 0.37 Best params: {'classifier': KNeighborsClassifier(n_neighbors=9), 'classifier__n_neighbors': 9, 'classifier__weights': 'uniform', 'sampling': RandomUnderSampler(random_state=42), 'scaler': StandardScaler(), 'transformer': TargetEncoder(cols=['site_id', 'site_domain', 'app_id', 'device_model', 'C14', 'C17', 'C20'])}
We found that the number of neighbors was at the upper level of the defined interval. Thus, we decided to test if, kepping all else equal - i.e. the previously found hyper-parameters - the model's performance would increase if we increased the range of n_neighbors
pipe = Pipeline([('sampling', RandomOverSampler(random_state=42)),
('transformer', TargetEncoder(cols=target_enc)),
('scaler', StandardScaler()),
('classifier', KNeighborsClassifier())])
param_grid = [
{'classifier': [KNeighborsClassifier()],
'scaler': [StandardScaler()],
'transformer': [TargetEncoder(cols=target_enc)],
'sampling': [RandomUnderSampler(random_state=42)],
'classifier__n_neighbors': [i for i in range(7, 52, 2)],
'classifier__weights': ['uniform']},
]
knn_rand = RandomizedSearchCV(pipe, param_grid, cv=10, n_iter=10, scoring = 'f1')
knn_rand.fit(X_train, y_train)
knn_new_neigh_model = knn_rand.best_estimator_
print("Best cross-validation score: {:.2f}".format(knn_rand.best_score_))
print("Best params:\n{}\n".format(knn_rand.best_params_))
RandomizedSearchCV(cv=10, estimator=Pipeline(steps=[('sampling', RandomOverSampler(random_state=42)), ('transformer', TargetEncoder(cols=['site_id', 'site_domain', 'app_id', 'device_model', 'C14', 'C17', 'C20'])), ('scaler', StandardScaler()), ('classifier', KNeighborsClassifier())]), param_distributions=[{'classifier': [KNeighborsClassifier(n_neighbors=43)], 'classifier__n_neighbors': [7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29, 31, 33, 35, 37, 39, 41, 43, 45, 47, 49, 51], 'classifier__weights': ['uniform'], 'sampling': [RandomUnderSampler(random_state=42)], 'scaler': [StandardScaler()], 'transformer': [TargetEncoder(cols=['site_id', 'site_domain', 'app_id', 'device_model', 'C14', 'C17', 'C20'])]}], scoring='f1')
Best cross-validation score: 0.39 Best params: {'transformer': TargetEncoder(cols=['site_id', 'site_domain', 'app_id', 'device_model', 'C14', 'C17', 'C20']), 'scaler': StandardScaler(), 'sampling': RandomUnderSampler(random_state=42), 'classifier__weights': 'uniform', 'classifier__n_neighbors': 43, 'classifier': KNeighborsClassifier(n_neighbors=43)}
print('Knn -->> {:0.5f}'.format(knn_grid.best_score_))
print('Knn Opt -->> {:0.5f}'.format(knn_rand.best_score_))
Knn -->> 0.37069 Knn Opt -->> 0.38588
pipe = Pipeline([('sampling', RandomOverSampler(random_state=42)),
('transformer', TargetEncoder(cols=target_enc)),
('scaler', StandardScaler()),
('classifier', RandomForestClassifier(random_state=42))])
param_grid = [
{'classifier': [RandomForestClassifier(random_state=42)],
'scaler': [StandardScaler()],
'transformer': [TargetEncoder(cols=target_enc)],
'sampling': [None, RandomUnderSampler(random_state=42), RandomOverSampler(random_state=42), SMOTE(random_state=42)]}
]
rfc_simple_grid = GridSearchCV(pipe, param_grid, cv=10, scoring = 'f1')
rfc_simple_grid.fit(X_train, y_train)
rfc_model = rfc_simple_grid.best_estimator_
print("Best cross-validation score: {:.2f}".format(rfc_simple_grid.best_score_))
print("Best params:\n{}\n".format(rfc_simple_grid.best_params_))
GridSearchCV(cv=10, estimator=Pipeline(steps=[('sampling', RandomOverSampler(random_state=42)), ('transformer', TargetEncoder(cols=['site_id', 'site_domain', 'app_id', 'device_model', 'C14', 'C17', 'C20'])), ('scaler', StandardScaler()), ('classifier', RandomForestClassifier(random_state=42))]), param_grid=[{'classifier': [RandomForestClassifier(random_state=42)], 'sampling': [None, RandomUnderSampler(random_state=42), RandomOverSampler(random_state=42), SMOTE(random_state=42)], 'scaler': [StandardScaler()], 'transformer': [TargetEncoder(cols=['site_id', 'site_domain', 'app_id', 'device_model', 'C14', 'C17', 'C20'])]}], scoring='f1')
Best cross-validation score: 0.37 Best params: {'classifier': RandomForestClassifier(random_state=42), 'sampling': RandomUnderSampler(random_state=42), 'scaler': StandardScaler(), 'transformer': TargetEncoder(cols=['site_id', 'site_domain', 'app_id', 'device_model', 'C14', 'C17', 'C20'])}
Even with the default parameters, Random Forest Classifier
has a good f1-score
performance, the same as the first optimization of the Knn Classifer
. We will now perform some hyper-parameter tunning and try to increase the model's performace.
min_samples_leaf
The minimum number of samples required to be at a leaf node. A smaller leaf makes the model more prone to capturing noise in train data. This parameter is similar to min_samples_splits, however, this describe the minimum number of samples of samples at the leafs, the base of the tree.
min_samples_split
Represents the minimum number of samples required to split an internal node. This can vary between considering at least one sample at each node to considering all of the samples at each node. When we increase this parameter each tree in the forest becomes more constrained as it has to consider more samples at each node.
pipe = Pipeline([('sampling', RandomOverSampler(random_state=42)),
('transformer', TargetEncoder(cols=target_enc)),
('scaler', StandardScaler()),
('classifier', RandomForestClassifier(random_state=42))])
param_grid = [
{'classifier': [RandomForestClassifier(random_state=42)],
'scaler': [StandardScaler()],
'transformer': [TargetEncoder(cols=target_enc)],
'sampling': [RandomUnderSampler(random_state=42), RandomOverSampler(random_state=42), SMOTE(random_state=42)],
'classifier__n_estimators': sp_randInt(100, 1000),
'classifier__max_depth': sp_randInt(1, 50),
'classifier__min_samples_split': sp_randInt(1, 50),
'classifier__min_samples_leaf': sp_randInt(1, 50)},
]
rfc_rand = RandomizedSearchCV(pipe, param_grid, cv=10, scoring = 'f1', n_iter=5, random_state=42)
rfc_rand.fit(X_train, y_train)
rfc_model_opt = rfc_rand.best_estimator_
print("Best cross-validation score: {:.2f}".format(rfc_rand.best_score_))
print("Best params:\n{}\n".format(rfc_rand.best_params_))
RandomizedSearchCV(cv=10, estimator=Pipeline(steps=[('sampling', RandomOverSampler(random_state=42)), ('transformer', TargetEncoder(cols=['site_id', 'site_domain', 'app_id', 'device_model', 'C14', 'C17', 'C20'])), ('scaler', StandardScaler()), ('classifier', RandomForestClassifier(random_state=42))]), n_iter=5, param_distributions=[{'classifier': [RandomForestClass... 'classifier__n_estimators': <scipy.stats._distn_infrastructure.rv_frozen object at 0x000002A00096A700>, 'sampling': [RandomUnderSampler(random_state=42), RandomOverSampler(random_state=42), SMOTE(random_state=42)], 'scaler': [StandardScaler()], 'transformer': [TargetEncoder(cols=['site_id', 'site_domain', 'app_id', 'device_model', 'C14', 'C17', 'C20'])]}], random_state=42, scoring='f1')
Best cross-validation score: 0.40 Best params: {'classifier': RandomForestClassifier(max_depth=39, min_samples_leaf=29, min_samples_split=15, n_estimators=206, random_state=42), 'classifier__max_depth': 39, 'classifier__min_samples_leaf': 29, 'classifier__min_samples_split': 15, 'classifier__n_estimators': 206, 'sampling': RandomUnderSampler(random_state=42), 'scaler': StandardScaler(), 'transformer': TargetEncoder(cols=['site_id', 'site_domain', 'app_id', 'device_model', 'C14', 'C17', 'C20'])}
Keeping the several types of sampling - RandomOverSampler()
, RandomOverSampler()
, SMOTE()
- is computationally costly. To further explore the configuration of hyper-parameter we fixed the sampling method at the RandomOverSampler()
the chosen method for the optimized RandomForestClassifier
above, and performed another RandomizedSearchCV
with an increase number of iterations and smaller number of cross-validation folds, to increase speed of processing, focusing the tunning parameters intervals based on the above results, i.e. n_estimators
reduce interval, max_depth
& min_samples_split
& min_samples_leaf
increase bottom and upper bound.
pipe = Pipeline([('sampling', RandomOverSampler(random_state=42)),
('transformer', TargetEncoder(cols=target_enc)),
('scaler', StandardScaler()),
('classifier', RandomForestClassifier(random_state=42))])
param_grid = [
{'classifier': [RandomForestClassifier(random_state=42)],
'scaler': [StandardScaler()],
'transformer': [TargetEncoder(cols=target_enc)],
'sampling': [RandomUnderSampler(random_state=42)],# RandomOverSampler(random_state=42), SMOTE(random_state=42)],
'classifier__n_estimators': sp_randInt(150, 1000),
'classifier__max_depth': sp_randInt(20, 60),
'classifier__min_samples_split': sp_randInt(10, 60),
'classifier__min_samples_leaf': sp_randInt(20, 80)},
]
rfc_rand_2 = RandomizedSearchCV(pipe, param_grid, cv=5, scoring = 'f1', n_iter=10, random_state=42)
rfc_rand_2.fit(X_train, y_train)
rfc_model_opt_2 = rfc_rand_2.best_estimator_
print("Best cross-validation score: {:.2f}".format(rfc_rand_2.best_score_))
print("Best params:\n{}\n".format(rfc_rand_2.best_params_))
RandomizedSearchCV(cv=5, estimator=Pipeline(steps=[('sampling', RandomOverSampler(random_state=42)), ('transformer', TargetEncoder(cols=['site_id', 'site_domain', 'app_id', 'device_model', 'C14', 'C17', 'C20'])), ('scaler', StandardScaler()), ('classifier', RandomForestClassifier(random_state=42))]), param_distributions=[{'classifier': [RandomForestClassifier(max_... 'classifier__min_samples_split': <scipy.stats._distn_infrastructure.rv_frozen object at 0x000002A0008CB910>, 'classifier__n_estimators': <scipy.stats._distn_infrastructure.rv_frozen object at 0x000002A0030E8E80>, 'sampling': [RandomUnderSampler(random_state=42)], 'scaler': [StandardScaler()], 'transformer': [TargetEncoder(cols=['site_id', 'site_domain', 'app_id', 'device_model', 'C14', 'C17', 'C20'])]}], random_state=42, scoring='f1')
Best cross-validation score: 0.40 Best params: {'classifier': RandomForestClassifier(max_depth=27, min_samples_leaf=40, min_samples_split=48, n_estimators=271, random_state=42), 'classifier__max_depth': 27, 'classifier__min_samples_leaf': 40, 'classifier__min_samples_split': 48, 'classifier__n_estimators': 271, 'sampling': RandomUnderSampler(random_state=42), 'scaler': StandardScaler(), 'transformer': TargetEncoder(cols=['site_id', 'site_domain', 'app_id', 'device_model', 'C14', 'C17', 'C20'])}
To further explore the parameter tunning of the random forest, and given the previous exploration yielded no improvements with the f1-score
, we added other parameters to again tune the model. These are:
class_weight
: {balanced
, balanced_subsample
}, dict or list of dicts, default=None
RandomUnderSampler
). The balanced
mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as n_samples / (n_classes * np.bincount(y))criterion
: {gini
, entropy
}, default=gini
pipe = Pipeline([('sampling', RandomOverSampler(random_state=42)),
('transformer', TargetEncoder(cols=target_enc)),
('scaler', StandardScaler()),
('classifier', RandomForestClassifier(random_state=42))])
param_grid = [
{'classifier': [RandomForestClassifier(random_state=42)],
'scaler': [StandardScaler()],
'transformer': [TargetEncoder(cols=target_enc)],
'sampling': [RandomUnderSampler(random_state=42)],# RandomOverSampler(random_state=42), SMOTE(random_state=42)],
'classifier__n_estimators': sp_randInt(150, 1000),
'classifier__max_depth': sp_randInt(20, 60),
'classifier__min_samples_split': sp_randInt(10, 60),
'classifier__min_samples_leaf': sp_randInt(20, 80),
'classifier__class_weight': ['balanced'],
'classifier__criterion': ['gini','entropy']}
]
rfc_rand_3 = RandomizedSearchCV(pipe, param_grid, cv=5, scoring = 'f1', n_iter=10, random_state=42)
rfc_rand_3.fit(X_train, y_train)
rfc_model_opt_3 = rfc_rand_3.best_estimator_
print("Best cross-validation score: {:.2f}".format(rfc_rand_3.best_score_))
print("Best params:\n{}\n".format(rfc_rand_3.best_params_))
RandomizedSearchCV(cv=5, estimator=Pipeline(steps=[('sampling', RandomOverSampler(random_state=42)), ('transformer', TargetEncoder(cols=['site_id', 'site_domain', 'app_id', 'device_model', 'C14', 'C17', 'C20'])), ('scaler', StandardScaler()), ('classifier', RandomForestClassifier(random_state=42))]), param_distributions=[{'classifier': [RandomForestClassifier(clas... 'classifier__min_samples_split': <scipy.stats._distn_infrastructure.rv_frozen object at 0x000002A0008D8910>, 'classifier__n_estimators': <scipy.stats._distn_infrastructure.rv_frozen object at 0x000002A0035EED30>, 'sampling': [RandomUnderSampler(random_state=42)], 'scaler': [StandardScaler()], 'transformer': [TargetEncoder(cols=['site_id', 'site_domain', 'app_id', 'device_model', 'C14', 'C17', 'C20'])]}], random_state=42, scoring='f1')
Best cross-validation score: 0.40 Best params: {'classifier': RandomForestClassifier(class_weight='balanced', max_depth=58, min_samples_leaf=37, min_samples_split=13, n_estimators=750, random_state=42), 'classifier__class_weight': 'balanced', 'classifier__criterion': 'gini', 'classifier__max_depth': 58, 'classifier__min_samples_leaf': 37, 'classifier__min_samples_split': 13, 'classifier__n_estimators': 750, 'sampling': RandomUnderSampler(random_state=42), 'scaler': StandardScaler(), 'transformer': TargetEncoder(cols=['site_id', 'site_domain', 'app_id', 'device_model', 'C14', 'C17', 'C20'])}
print('Random Forest Opt 1 -->> {:0.5f}'.format(rfc_rand.best_score_))
print('Random Forest Opt 2 -->> {:0.5f}'.format(rfc_rand_2.best_score_))
print('Random Forest Opt 3 -->> {:0.5f}'.format(rfc_rand_3.best_score_))
Random Forest Opt 1 -->> 0.39671 Random Forest Opt 2 -->> 0.39576 Random Forest Opt 3 -->> 0.39559
On all 3 hyper-parametter tunning iterations we did not manage to increase the best cross-validation f1-score
which sat at 0.40. This is however better than the achieve by the two Knn Classifer
models.
Beyond the Knn Classifier
and the Random Forest Classifier
we wanted to explore a boosting model.
CatBoost is designed for categorical data and is known to have the best performance on it, showing the state-of-the-art performance over XGBoost and LightGBM in eight datasets in its official journal article.
depth
: In most cases, the optimal depth ranges from 4 to 10. Values in the range from 6 to 10 are recommended
learning_rate
: This setting is used for reducing the gradient step. It affects the overall time of training: the smaller the value, the more iterations are required for training. Choose the value based on the performance expectations.
eval_metric
: The metric used for overfitting detection (if enabled) and best model selection (if enabled).
od_type
: The type of the overfitting detector to use. Possible values: IncToDec
, Iter
early_stopping_rounds
: The number of iterations to continue the training after the iteration with the optimal metric value.
catb = Pipeline([('sampling', RandomUnderSampler(random_state=42)),
('transformer', TargetEncoder(cols=target_enc)),
('scaler', StandardScaler()),
('classifier', CatBoostClassifier(verbose=False, auto_class_weights='Balanced', early_stopping_rounds=10, random_seed=42))])
param_grid= [
{'classifier': [CatBoostClassifier(verbose=False, auto_class_weights='Balanced', early_stopping_rounds=10, random_seed=42)],
'scaler': [StandardScaler()],
'transformer': [TargetEncoder(cols=target_enc)],
'sampling': [RandomUnderSampler(random_state=42), RandomOverSampler(random_state=42), SMOTE(random_state=42)],
'classifier__learning_rate': sp_randFloat(0, 1),
'classifier__depth': sp_randInt(1, 10),
'classifier__l2_leaf_reg': sp_randInt(1,10)}]
catb_rand = RandomizedSearchCV(catb, param_grid, cv=10, scoring = 'f1', n_iter=10, random_state=42)
catb_rand.fit(X_train, y_train)
catb_model = catb_rand.best_estimator_
print("Best cross-validation score: {:.2f}".format(catb_rand.best_score_))
print("Best params:\n{}\n".format(catb_rand.best_params_))
RandomizedSearchCV(cv=10, estimator=Pipeline(steps=[('sampling', RandomUnderSampler(random_state=42)), ('transformer', TargetEncoder(cols=['site_id', 'site_domain', 'app_id', 'device_model', 'C14', 'C17', 'C20'])), ('scaler', StandardScaler()), ('classifier', <catboost.core.CatBoostClassifier object at 0x000002A0008C3A60>)]), param_distributions=[{'classifier': [<cat... 'classifier__learning_rate': <scipy.stats._distn_infrastructure.rv_frozen object at 0x000002A071C6FBE0>, 'sampling': [RandomUnderSampler(random_state=42), RandomOverSampler(random_state=42), SMOTE(random_state=42)], 'scaler': [StandardScaler()], 'transformer': [TargetEncoder(cols=['site_id', 'site_domain', 'app_id', 'device_model', 'C14', 'C17', 'C20'])]}], random_state=42, scoring='f1')
Best cross-validation score: 0.39 Best params: {'classifier': <catboost.core.CatBoostClassifier object at 0x000002A0008C3340>, 'classifier__depth': 9, 'classifier__l2_leaf_reg': 7, 'classifier__learning_rate': 0.013264961159866528, 'sampling': RandomUnderSampler(random_state=42), 'scaler': StandardScaler(), 'transformer': TargetEncoder(cols=['site_id', 'site_domain', 'app_id', 'device_model', 'C14', 'C17', 'C20'])}
Narrowing down on the Learning Rate.
catb = Pipeline([('sampling', RandomUnderSampler(random_state=42)),
('transformer', TargetEncoder(cols=target_enc)),
('scaler', StandardScaler()),
('classifier', CatBoostClassifier(verbose=False, auto_class_weights='Balanced', early_stopping_rounds=10, random_seed=42))])
param_grid= [
{'classifier': [CatBoostClassifier(verbose=False, auto_class_weights='Balanced', early_stopping_rounds=10, random_seed=42)],
'scaler': [StandardScaler()],
'transformer': [TargetEncoder(cols=target_enc)],
'sampling': [RandomUnderSampler(random_state=42)],
'classifier__learning_rate': sp_randFloat(0, 0.05),
'classifier__depth': [9], #sp_randInt(5, 20),
'classifier__l2_leaf_reg': [7]}] #sp_randInt(5, 10)}]
catb_rand_2 = RandomizedSearchCV(catb, param_grid, cv=5, scoring = 'f1', n_iter=5, random_state=42)
catb_rand_2.fit(X_train, y_train)
catb_model_2 = catb_rand_2.best_estimator_
print("Best cross-validation score: {:.2f}".format(catb_rand_2.best_score_))
print("Best params:\n{}\n".format(catb_rand_2.best_params_))
RandomizedSearchCV(cv=5, estimator=Pipeline(steps=[('sampling', RandomUnderSampler(random_state=42)), ('transformer', TargetEncoder(cols=['site_id', 'site_domain', 'app_id', 'device_model', 'C14', 'C17', 'C20'])), ('scaler', StandardScaler()), ('classifier', <catboost.core.CatBoostClassifier object at 0x000002A07274BE80>)]), n_iter=5, param_distributions=[{'classifie... 'classifier__depth': [9], 'classifier__l2_leaf_reg': [7], 'classifier__learning_rate': <scipy.stats._distn_infrastructure.rv_frozen object at 0x000002A0035EE5E0>, 'sampling': [RandomUnderSampler(random_state=42)], 'scaler': [StandardScaler()], 'transformer': [TargetEncoder(cols=['site_id', 'site_domain', 'app_id', 'device_model', 'C14', 'C17', 'C20'])]}], random_state=42, scoring='f1')
Best cross-validation score: 0.39 Best params: {'classifier': <catboost.core.CatBoostClassifier object at 0x000002A07274B040>, 'classifier__depth': 9, 'classifier__l2_leaf_reg': 7, 'classifier__learning_rate': 0.007800932022121826, 'sampling': RandomUnderSampler(random_state=42), 'scaler': StandardScaler(), 'transformer': TargetEncoder(cols=['site_id', 'site_domain', 'app_id', 'device_model', 'C14', 'C17', 'C20'])}
knn_grid.best_estimator_
knn_rand.best_estimator_
rfc_simple_grid.best_estimator_
rfc_rand.best_estimator_
rfc_rand_2.best_estimator_
rfc_rand_3.best_estimator_
catb_model
catb_model_2
Pipeline(steps=[('sampling', RandomUnderSampler(random_state=42)), ('transformer', TargetEncoder(cols=['site_id', 'site_domain', 'app_id', 'device_model', 'C14', 'C17', 'C20'])), ('scaler', StandardScaler()), ('classifier', KNeighborsClassifier(n_neighbors=9))])
Pipeline(steps=[('sampling', RandomUnderSampler(random_state=42)), ('transformer', TargetEncoder(cols=['site_id', 'site_domain', 'app_id', 'device_model', 'C14', 'C17', 'C20'])), ('scaler', StandardScaler()), ('classifier', KNeighborsClassifier(n_neighbors=43))])
Pipeline(steps=[('sampling', RandomUnderSampler(random_state=42)), ('transformer', TargetEncoder(cols=['site_id', 'site_domain', 'app_id', 'device_model', 'C14', 'C17', 'C20'])), ('scaler', StandardScaler()), ('classifier', RandomForestClassifier(random_state=42))])
Pipeline(steps=[('sampling', RandomUnderSampler(random_state=42)), ('transformer', TargetEncoder(cols=['site_id', 'site_domain', 'app_id', 'device_model', 'C14', 'C17', 'C20'])), ('scaler', StandardScaler()), ('classifier', RandomForestClassifier(max_depth=39, min_samples_leaf=29, min_samples_split=15, n_estimators=206, random_state=42))])
Pipeline(steps=[('sampling', RandomUnderSampler(random_state=42)), ('transformer', TargetEncoder(cols=['site_id', 'site_domain', 'app_id', 'device_model', 'C14', 'C17', 'C20'])), ('scaler', StandardScaler()), ('classifier', RandomForestClassifier(max_depth=27, min_samples_leaf=40, min_samples_split=48, n_estimators=271, random_state=42))])
Pipeline(steps=[('sampling', RandomUnderSampler(random_state=42)), ('transformer', TargetEncoder(cols=['site_id', 'site_domain', 'app_id', 'device_model', 'C14', 'C17', 'C20'])), ('scaler', StandardScaler()), ('classifier', RandomForestClassifier(class_weight='balanced', max_depth=58, min_samples_leaf=37, min_samples_split=13, n_estimators=750, random_state=42))])
Pipeline(steps=[('sampling', RandomUnderSampler(random_state=42)), ('transformer', TargetEncoder(cols=['site_id', 'site_domain', 'app_id', 'device_model', 'C14', 'C17', 'C20'])), ('scaler', StandardScaler()), ('classifier', <catboost.core.CatBoostClassifier object at 0x000002A00093D910>)])
Pipeline(steps=[('sampling', RandomUnderSampler(random_state=42)), ('transformer', TargetEncoder(cols=['site_id', 'site_domain', 'app_id', 'device_model', 'C14', 'C17', 'C20'])), ('scaler', StandardScaler()), ('classifier', <catboost.core.CatBoostClassifier object at 0x000002A0008F1400>)])
Aggregating the best single models after hyper-parameter tunning we stacked them up into a Voting Classifier
.
estimators = [('knn_optimized', knn_new_neigh_model),
('forest_optimized',rfc_model_opt_3),
('catboost_optmized', catb_model_2)]
voting = VotingClassifier(estimators, voting='soft')
voting.fit(X_train, y_train)
list_of_estimators = [('knn_new_neighbors', knn_new_neigh_model),
('forest_optimized_3',rfc_model_opt_3),
('catboost_opt_2', catb_model_2), ('voting', voting)]
for label, model in list_of_estimators:
cv_scores = cross_validate(model, X_train, y_train, cv=10, scoring=('f1','roc_auc'))
print(f"F1-score: {cv_scores['test_f1'].mean():0.4f} (+/- {cv_scores['test_f1'].std():0.4f}) | ROC-AUC: {cv_scores['test_roc_auc'].mean():0.4f} (+/- {cv_scores['test_roc_auc'].std():0.4f}) [{label}]")
VotingClassifier(estimators=[('knn_optimized', Pipeline(steps=[('sampling', RandomUnderSampler(random_state=42)), ('transformer', TargetEncoder(cols=['site_id', 'site_domain', 'app_id', 'device_model', 'C14', 'C17', 'C20'])), ('scaler', StandardScaler()), ('classifier', KNeighborsClassifier(n_neighbors=43))])), ('forest_optimized', Pipeline(steps=[('sampling', Rand... n_estimators=750, random_state=42))])), ('catboost_optmized', Pipeline(steps=[('sampling', RandomUnderSampler(random_state=42)), ('transformer', TargetEncoder(cols=['site_id', 'site_domain', 'app_id', 'device_model', 'C14', 'C17', 'C20'])), ('scaler', StandardScaler()), ('classifier', <catboost.core.CatBoostClassifier object at 0x000002A0008F1400>)]))], voting='soft')
F1-score: 0.3859 (+/- 0.0041) | ROC-AUC: 0.7101 (+/- 0.0049) [knn_new_neighbors] F1-score: 0.3970 (+/- 0.0051) | ROC-AUC: 0.7233 (+/- 0.0043) [forest_optimized_3] F1-score: 0.3938 (+/- 0.0047) | ROC-AUC: 0.7136 (+/- 0.0041) [catboost_opt_2] F1-score: 0.3939 (+/- 0.0048) | ROC-AUC: 0.7217 (+/- 0.0044) [voting]
Stacking with Voting Classifier set to soft
did not perform better than the hyper-parameter tunning setting we found for the RandomForest, albeit the differences are small and it has a slighly lower standard deviation for the f1-score
Using GridSearchCV we choose the best model after Hyper-parameter tunning. The metric we are using to make our decision regarding which is the best model is f1-score
best_pipe = Pipeline([('sampling', RandomUnderSampler()),
('transformer', TargetEncoder(cols=target_enc)),
('scaler', StandardScaler()),
('classifier', KNeighborsClassifier())])
param_grid= [#{key: [value] for (key, value) in knn_grid.best_params_.items()},
{key: [value] for (key, value) in knn_rand.best_params_.items()},
#{key: [value] for (key, value) in rfc_simple_grid.best_params_.items()},
{key: [value] for (key, value) in rfc_rand.best_params_.items()},
{key: [value] for (key, value) in rfc_rand_2.best_params_.items()},
{key: [value] for (key, value) in rfc_rand_3.best_params_.items()},
#{key: [value] for (key, value) in catb_rand.best_params_.items()},
{key: [value] for (key, value) in catb_rand_2.best_params_.items()},
{'sampling': [None], 'transformer': [None], 'scaler': [None], 'classifier': [voting]}]
originalclass = []
predictedclass = []
# Create Custom Scorig function
def classification_report_with_accuracy_score(y_true, y_pred):
originalclass.extend(y_true)
predictedclass.extend(y_pred)
return accuracy_score(y_true, y_pred)
scorers = {
'f1_score': make_scorer(f1_score),
'precision_score': make_scorer(precision_score),
'recall_score': make_scorer(recall_score),
'accuracy_score': make_scorer(accuracy_score),
'roc_auc_score': make_scorer(roc_auc_score),
'custom_score': make_scorer(classification_report_with_accuracy_score)
}
grid_best = GridSearchCV(
estimator=best_pipe,
param_grid=param_grid,
scoring=scorers,
cv=10,
refit='f1_score'
)
grid_best.fit(X_train, y_train)
best_model = grid_best.best_estimator_
print("Best cross-validation score: {:.2f}".format(grid_best.best_score_))
print("Best params:\n{}\n".format(grid_best.best_params_))
GridSearchCV(cv=10, estimator=Pipeline(steps=[('sampling', RandomUnderSampler()), ('transformer', TargetEncoder(cols=['site_id', 'site_domain', 'app_id', 'device_model', 'C14', 'C17', 'C20'])), ('scaler', StandardScaler()), ('classifier', KNeighborsClassifier())]), param_grid=[{'classifier': [KNeighborsClassifier(n_neighbors=43)], 'classifier__n_neighbors': [43], 'cl... 'sampling': [None], 'scaler': [None], 'transformer': [None]}], refit='f1_score', scoring={'accuracy_score': make_scorer(accuracy_score), 'custom_score': make_scorer(classification_report_with_accuracy_score), 'f1_score': make_scorer(f1_score), 'precision_score': make_scorer(precision_score), 'recall_score': make_scorer(recall_score), 'roc_auc_score': make_scorer(roc_auc_score)})
Best cross-validation score: 0.40 Best params: {'classifier': RandomForestClassifier(class_weight='balanced', max_depth=58, min_samples_leaf=37, min_samples_split=13, n_estimators=750, random_state=42), 'classifier__class_weight': 'balanced', 'classifier__criterion': 'gini', 'classifier__max_depth': 58, 'classifier__min_samples_leaf': 37, 'classifier__min_samples_split': 13, 'classifier__n_estimators': 750, 'sampling': RandomUnderSampler(random_state=42), 'scaler': StandardScaler(), 'transformer': TargetEncoder(cols=['site_id', 'site_domain', 'app_id', 'device_model', 'C14', 'C17', 'C20'])}
scores_best_models = pd.DataFrame(grid_best.cv_results_)[['param_classifier','mean_test_f1_score','std_test_f1_score','rank_test_f1_score']]
scores_best_models.sort_values(by='rank_test_f1_score')
param_classifier | mean_test_f1_score | std_test_f1_score | rank_test_f1_score | |
---|---|---|---|---|
3 | RandomForestClassifier(class_weight='balanced'... | 0.396958 | 0.005089 | 1 |
1 | RandomForestClassifier(max_depth=39, min_sampl... | 0.396706 | 0.003284 | 2 |
2 | RandomForestClassifier(max_depth=27, min_sampl... | 0.396676 | 0.005481 | 3 |
5 | VotingClassifier(estimators=[('knn_optimized',... | 0.393913 | 0.004823 | 4 |
4 | <catboost.core.CatBoostClassifier object at 0x... | 0.393792 | 0.004650 | 5 |
0 | KNeighborsClassifier(n_neighbors=43) | 0.385879 | 0.004114 | 6 |
mean_test_scores = ['mean_test_'+sc for sc in scorers.keys()]
pd.DataFrame(grid_best.cv_results_)[['param_classifier']+mean_test_scores+['rank_test_f1_score']].sort_values(by='rank_test_f1_score')
param_classifier | mean_test_f1_score | mean_test_precision_score | mean_test_recall_score | mean_test_accuracy_score | mean_test_roc_auc_score | mean_test_custom_score | rank_test_f1_score | |
---|---|---|---|---|---|---|---|---|
3 | RandomForestClassifier(class_weight='balanced'... | 0.396958 | 0.272905 | 0.727902 | 0.624088 | 0.665365 | 0.624088 | 1 |
1 | RandomForestClassifier(max_depth=39, min_sampl... | 0.396706 | 0.272687 | 0.727755 | 0.623750 | 0.665103 | 0.623750 | 2 |
2 | RandomForestClassifier(max_depth=27, min_sampl... | 0.396676 | 0.272557 | 0.728491 | 0.623337 | 0.665148 | 0.623337 | 3 |
5 | VotingClassifier(estimators=[('knn_optimized',... | 0.393913 | 0.269787 | 0.729667 | 0.618337 | 0.662603 | 0.618337 | 4 |
4 | <catboost.core.CatBoostClassifier object at 0x... | 0.393792 | 0.270544 | 0.723417 | 0.621450 | 0.661993 | 0.621450 | 5 |
0 | KNeighborsClassifier(n_neighbors=43) | 0.385879 | 0.263337 | 0.721797 | 0.609475 | 0.654135 | 0.609475 | 6 |
The Custom scorer above gathers all the original vs predictions for each CV split, which we use to create an Average CV Classification Repport.
print("-------------- Average CV Classification Report --------------\n",
classification_report(originalclass, predictedclass))
-------------- Average CV Classification Report -------------- precision recall f1-score support 0 0.91 0.60 0.72 398412 1 0.27 0.73 0.39 81588 accuracy 0.62 480000 macro avg 0.59 0.66 0.56 480000 weighted avg 0.80 0.62 0.67 480000
When two features are correlated and one of the features is permuted, the model will still have access to the feature through its correlated feature. This will result in a lower importance value for both features, where they might actually be important.
As such, a negative score is returned when a random permutation of a feature’s values results in a better performance metric than reality. It does not mean that the feature has a positive impact on the model, it rather means that substituting the feature with noise is better than the original feature. Hence, the feature is worse than noise. Quite likely this indicates that the negative feature interacts with other features.
Typically when there is a negative score, we should remove that variable and redo the model.
In order to correct the negative values, we used a f1 scoring, which led to a plot of only positive values.
# Reset again the output for only the last expression
InteractiveShell.ast_node_interactivity = 'last_expr'
from sklearn.inspection import permutation_importance
perm = permutation_importance(best_model, X_train, y_train, scoring = 'f1', n_repeats=10, random_state=42)
fi = perm.importances_mean
fig, ax = plt.subplots(figsize=(20, 5))
ax.bar(range(len(fi)), fi, align="center")
ax.set(xticks=range(len(fi)), xticklabels=X_train.columns)
plt.setp(ax.get_xticklabels(), rotation=45, ha="right", rotation_mode="anchor")
plt.show()
One important feature being C14, we could not take much interpretability from it.
From the features that we can interpret device model has the most relevant importance. A possible intuition could be the difference between softwares from each device, since newer versions could lead to better performance of the adds or some random explanation that this model is capturing.
Also, site id is explaining well the model. This could indicate that sites are build to atract the users to click on the add. It is a good metric to work on.
Finally, app id stands for another important feature.
One that could have a higher importance is hour of the day, since it could lead to different user responses. Clearly, is not explaining that much the model.
best_model.fit(X_train, y_train)
y_pred = best_model.predict(X_test)
print(best_model)
Pipeline(steps=[('sampling', RandomUnderSampler(random_state=42)), ('transformer', TargetEncoder(cols=['site_id', 'site_domain', 'app_id', 'device_model', 'C14', 'C17', 'C20'])), ('scaler', StandardScaler()), ('classifier', RandomForestClassifier(class_weight='balanced', max_depth=58, min_samples_leaf=37, min_samples_split=13, n_estimators=750, random_state=42))])
print(classification_report(y_test, y_pred))
precision recall f1-score support 0 0.92 0.60 0.72 16567 1 0.27 0.73 0.40 3433 accuracy 0.62 20000 macro avg 0.59 0.67 0.56 20000 weighted avg 0.81 0.62 0.67 20000
Our main target was to sustain a high f1 score, since our dataset is imbalanced. Also, it is more relevant to the business model, to guarantee a good harmonic mean between recall and precision.
As we see, prediction is not beeing perfected by the model, but recall has a significant value. This leads to a good f1 score.
The predicted classes are represented in the columns of the matrix, whereas the actual classes are in the rows of the matrix. We then have four cases:
cf_matrix = confusion_matrix(y_test, y_pred)
# Names, counts and Percentage
group_names = ['True Neg','False Pos','False Neg','True Pos']
group_counts = ["{0:0.0f}".format(value) for value in cf_matrix.flatten()]
group_percentages = ["{0:.2%}".format(value) for value in cf_matrix.flatten()/np.sum(cf_matrix)]
# Join the above into text
labels = [f"{v1}\n{v2}\n{v3}" for v1, v2, v3 in zip(group_names,group_counts,group_percentages)]
labels = np.asarray(labels).reshape(2,2)
categories = ['Not Click (0)','Click (1)']
sns.heatmap(cf_matrix, annot=labels, fmt='', cmap='Blues',xticklabels=categories,yticklabels=categories)
plt.title('Best Model')
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.show()
Let's interpret these values: Out of the 3433 instances of class 1 in our test set, the model identified 2514 of them as 1, and 919 predicted as 0 (false negatives, type II error). Out of 16567 users of class 0, 9923 were identified correctly as 0, and 6644 were false positives (type I error).
plot_roc_curve(best_model, X_test, y_test)
plt.plot([0, 1], [0, 1],'r--')
plt.title('Best Model')
plt.show()
# Imports
from sklearn.inspection import permutation_importance
import shap
The Shapley value is the average contribution of a feature value to the prediction in different coalitions. Shapley values are a robust mechanism to assess the importance of each feature to reach an output, and hence form a better alternative to standard feature importance mechanisms.
These include the TreeExplainer which is optimized for tree-based models, DeepExplainer and GradientExplainer for neural networks, and KernelExplainer, which makes no assumptions about the underlying model to be explained.
Given our best model is a Pipeline with an optimized Random Forest Classifier
we will use KernelExplainer
You can visualize feature attributions such as Shapley values as "forces". Each feature value is a force that either increases or decreases the prediction.* The prediction starts from the baseline. The baseline for Shapley values is the average of all predictions. In the plot, each Shapley value is an arrow that pushes to increase (positive value) or decrease (negative value) the prediction. These forces balance each other out at the actual prediction of the data instance.
row = 67
data_for_prediction = X_test.iloc[row]
data_for_prediction_array = data_for_prediction.values.reshape(1, -1)
def model_predict(data_asarray):
data_asframe = pd.DataFrame(data_asarray, columns=X_train.columns.to_list())
return best_model.predict_proba(data_asframe)
explainer = shap.KernelExplainer(model_predict, X_test.iloc[0:100], link='identity')
shap_values = explainer.shap_values(data_for_prediction)
shap.initjs()
shap.force_plot(explainer.expected_value[0], shap_values[0], data_for_prediction)
The output prediction is 1, which means the model classifies this observation as a click
. The base value is 0.5496. Feature values that push towards a no click
are in blue - device_model
, and the length of the region shows how much the feature contributes to this effect. Feature values increasing the prediction are in pink, namely app_id
and C14
.
Below, we have an example of a classification of a no click
where site_id
, site_domain
and device_model
strongly push toward no click
.
row = 1
data_for_prediction = X_test.iloc[row]
data_for_prediction_array = data_for_prediction.values.reshape(1, -1)
def model_predict(data_asarray):
data_asframe = pd.DataFrame(data_asarray, columns=X_train.columns.to_list())
return best_model.predict_proba(data_asframe)
explainer = shap.KernelExplainer(model_predict, X_test.iloc[0:100], link='identity')
shap_values = explainer.shap_values(data_for_prediction)
shap.initjs()
shap.force_plot(explainer.expected_value[0], shap_values[0], data_for_prediction)
To get an overview of which features are most important for a model on a global level, we can plot the SHAP values of every feature for every sample.
random_sample = X_test.sample(200, random_state=42)
shap_values = explainer.shap_values(random_sample)
SHAP feature importance measured as the mean absolute Shapley values. The C14
anonymous category was the most important feature, changing the predicted absolute click
probability on average by 45 percentage points (0.45 on x-axis).
SHAP feature importance is an alternative to permutation feature importance. There is a big difference between both importance measures: Permutation feature importance is based on the decrease in model performance. SHAP is based on magnitude of feature attributions.
Molnar, Christoph. "Interpretable machine learning. A Guide for Making Black Box Models Explainable", 2019. https://christophm.github.io/interpretable-ml-book/.
This is a useful plot, however, it contains no information beyond the importances. For a more informative plot, we have to look at the summary plot.
shap.summary_plot(shap_values, X_test, plot_type="bar")
The summary plot combines feature importance with feature effects.
Looking at the SHAP SUMMARY PLOT
we know that each point is a Shapley value for a feature and an instance. The position on the y-axis is determined by the feature and on the x-axis by the Shapley value.
shap.summary_plot(shap_values[0], random_sample , plot_type='dot')
Positive SHAP value impact on the model output leads to "pushing" to 1 the probability of the customer having clicked click = 1
and a negative SHAP value impact leads to the opposite push.
We can see that C14
, device_model
, site_id
, app_id
, site_domain
, C17
and C20
are the features with the highest impact on the model.
remember that these are a label encoded features, therefore to unpack this information properly we'd have to look at the values that correspond to the high, middle and low range after it was encoded and standardised
We can see that a lower value of the C14
pushes more often the model towards no click
than higher feature values. However, higher feature values seem to be very spread without a clear pattern emerging. We can only add that for an impact over 0.2 this is almost exclusively due to high feature values of C14.
device_model
is very hard to distinguish how the feature value affects the model, having a high impact but being very hard to extract any information from it, this is one of the problems of a blackbox model, where even though a feature is clearly important, without one-hot-encoding it is not possible to understand the role of each of its values on the model. However, one-hot-encoding would not be reasonable given the very high cardinality of this feature.
site_id
, which uniquely identifies a given website, has that the middle feature values affect the Shap values positively, between 0.0 and 0.2 thus pushing click
probability slighlty, while high or low feature values tend to be on the extremes, either negatively affecting the shap value or affecting very positively (look at the positive outliers) the model output.
app_id
which represents the unique id of a given app, has a very high concentration of high feature values having a small, negative impact on the model, which means this apps are probably not clicked
. There are some clusters both with a positive and negative impact on the output spread around the impact axis but its not possible to distiguish feature values from effects, they are very mixed.
site_domain
this can either be an unique identifier as the site id or a high-level domain name such as .com
, .pt
and others. However it is probably different from site_id
as they have a different number of unique values. Therefore we cannot know for sure if we should discard one of these features. However, now that we see that mid to high feature values of site_domain
gather between (-0... and 0.2) there is a tendency for site domains located at the upper range of the site_domain
encoding to be "pushing" the probabily to click
.
C17
is another anonymous feature. Here the mid-high feature values are mostly "pushing" toward no click
. The high feature values have a tendecy to push the probability toward click
with an impact between 0.0 and 0.2.
Some other mentions are C18_01
that when it exists there's almost always a positive impact towards click
. High values of banner position
seem to push slightly towards click
.
The partial dependence plot (short PDP or PD plot) shows the marginal effect one or two features have on the predicted outcome of a machine learning model (J. H. Friedman 200127)
from pdpbox import pdp
features_to_plot = ['device_model', 'site_id']
inter1 = pdp.pdp_interact(model=best_model, dataset=X_test, model_features=X_test.columns, features=features_to_plot)
pdp.pdp_interact_plot(pdp_interact_out=inter1, feature_names=features_to_plot, plot_type='contour', x_quantile=True, plot_pdp=True)
plt.show()
We chose the two best features that were interpretable (excluding anonimyzed features).
These are device model
and site id
, for which we plot them agains each other with a partial dependence plot. This plot helps to identify users who are more likely to click
on an ad (lighter regions) rather than not clicking (darker regions) based on the interaction between the two features.
We can see that for the encoded and standardized site_id
0.558, values of device_model
around 0.5 - 0.51, 0.5, 0.491 and 0.494 - have a higher probability of being predicted click
, i.e 1.
The main goal was to answer the business problem question, whether a user will click or not on an add.
We first analysed the features we had and, imediatelly, identified several which were not relevant to explain our target. We used methods like Click Through Rate and visualization of the instances to drop the poor features. Hence, we dropped id
, Device_id
, device_ip
, hour
and day
.
After that, we transformed the data using one hot, label and target encoding, in order to better use the models we were applying and to better interpet the situation.
Following the moddeling proccess, we used F1 Score as an evaluation metric, since it corrected some imbalance. We considered AUC, but it was not that good for this type of dataset. Also, we used a Pipeline to rebalance the dataset and four models to predict the target, such as Knn, Catboost, Random Forest and a Stacking Classifier using the previous models.
Then, we got to the best model (Random Forest), which presented the best F1 Score and tried a permutation to check how it was working. There was still some irrelevant features, but the main objective was accomplished. There were really important features, which could predict the model. When testing the best model, we considered that a common threshold of 0,5 was to high for this type of dataset, hence we agreed on stopping on the 0,4. Also, the recall has a significant value. On the other hand, precision has a lower value, but it proved to be difficult to increase.
In order to interpret, we used Shap and Partial Dependence Plot. The first one explains the deviations from a base value, which is the average of the features' scores, of the different variables. Also, we can use the library to see a feature importance stack bar separated by classes, which is very usefull. The second one, explains the target based on the interaction between two features chosen.
We conclude our work by saying that we achieved an interesting result. Not only will help identifying a user who will click on an add, but will help identifying which type of features will help make the user click on the add. With that, companies are able to antecipate and make adds more interesting to the person whose going to watch it.