Student: João Gil Ribeiro

ID: 32399

INDEX

  1. The Business Problem
  2. The problem and methodologies
  3. Describing the Dataset
  4. The Dataset Features
  5. Data Analysis and Exploration
    1. Target Variable
    2. Checkpoint - Initial shape of the dataset
    3. Hour/Time Feature
    4. Defining our evaluation metric - CTR (Click Through Rate))
    5. Checkpoint - Evaluate the transformations on the dataset
    6. Anonymised Categorical Features
      1. Using the CTR metric to understand relationships between features and the target
    7. Website Features
    8. App Features
    9. Device Features
    10. Checkpoing - Before Exporting EDA to csv for easier modeling
  6. Pre-processing
  7. Modeling
    1. Choosing an Evaluation Metric - F1_score
    2. Modeling Pipeline
    3. KNN Classifier
      1. KNN - Further Narrowing the Hyper-parameter Search
    4. Random Forest
      1. Simple Random Forest
      2. Hyper-parameter Tunning of Random Forest
      3. Random Forest - Further Narrowing the hyper-parameter search
      4. Random Forest - Further Hyper-parameter Research (adding other parameters))
    5. CatBoost
      1. CatBoost - Further Hyper-parameter research
    6. Stacking Classifier
  8. Selecting the Best Model
  9. Feature Importance
  10. Test Best Model
    1. Confusion matrix
  11. Interpretability
    1. SHAP (SHapley Additive exPlanations)
    2. SHAP Feature Importance Plot
    3. SHAP Summary Plot
    4. PDP
  12. Conclusion

The Business Problem

We want to predict if a user will click or not on an ad with special attention to the role of the website, the position and display of the add and other features that can be controled by the advertiser.

Given that most features are anonymized we will focus on precision and accuracy and less on interpretability of the model given that it is not possible to pursue that avenue.

The problem and methodologies

Being this a supervised classification problem, the goal is to build a model that generates a ranked list of probabilities of wether or not a user will click on a an add, based of the available set of characteristics of both the consumer and the websites where the ad is displayed.

We want to understand if there's a pattern for a consumer to click on a given add and will explore the following methodologies:

Methodologies

  1. Conduct data analysis and exploration in order to have a better understanding of the dataset. During this analysis features were grouped in categories.
  2. Feature selection, in order to drop some irrelevant features.
  3. Modelling using f1 score as an evaluation metric and pipeline, to correct some imbalance.
  4. Four models: KNN, Random Forest, CatBoost and Stacking to predict the target. Random Search to decide which hyperparameters were the best in each one, and Grid Search to decide which of the four models tunned were the best.
  5. Feature importance using permutation.
  6. Shap and PDP to interpret and help explaining what predicts the target.

Describing the Dataset

The Dataset Features

The dataset features can be divided into the following categories:

User Identifier:

id: ad identifier

Target Variable:

click: 0/1 for non-click/click

Website Features:

banner_pos: Refers to the banner position of the ad in the website

site_id: A site ID is a unique identification number allocated to a website

site_domain: A domain name is essentially the website’s equivalent of a physical address

site_category: The theme of the site, for instance, it can belong to "Education", "Kids", "Weather", "Arts" etc...

App Features:

app_id: Two-part string used to identify one or more apps from a single development team. The string consists of a Team ID and a bundle ID search string, with a period (.) separating the two parts. The Team ID is supplied by Apple and is unique to a specific development team, while the bundle ID search string is supplied by users to match either the bundle ID of a single app or a set of bundle IDs for a group of your apps.

app_domain: It behaves the same as any other website, but there are two factors that make it stand out. The first is visibility: searching for dedicated app websites will be easier, as .app domains will indicate that the website concerns a specific app (product rather it being a news vertical, video platform, or any other type of website). The other benefit, outlined below, relates to security

app_category: The theme of the app, for instance, it can belong to "Education", "Kids", "Weather", "Arts" etc...

Device Features:

device_id: String reported by a device’s enumerator. A device has only one device ID.

device_ip: Similar to device id, it also uniquely identifies you, however there are personal and public id's and these are not always the same

device_model: Refers to the manufactures (Apple, Samsung, etc..) device models such as iPhone 8, iPhone X, Galaxy 8...

device_type: Label associated to the device which is used to match it to a series or model

device_conn_type: Connected devices are physical objects that can connect with each other and other systems via the internet. They span everything from traditional computing hardware, such as a laptop or desktop, to common mobile devices, such as a smartphone or tablet, to an increasingly wide range of physical devices and objects.

Hour/Time Features:

hour: format is YYMMDDHH, so 14091123 means 23:00 on Sept. 11, 2014 UTC.

Anonymised Categorical Features

C1 -- anonymized categorical feature

C14-C21 -- anonymized categorical features

Data Analysis and Exploration

Target Variable

click: 0/1 for non-click/click

Looking at the target variable "click", the click rate is around 17.03% with 17.03K clicks in this sampled dataset, and 82.97% of users not clicking the add.

Checkpoint - Initial shape of the dataset

Hour/Time Feature

hour: format is YYMMDDHH, so 14091123 means 23:00 on Sept. 11, 2014 UTC.

Hour (convert to datetime to see if there is any relation between time of the day and click's) the hypothesis is that on times with higher traffic we'll have more users naturally clicking.

We have that hour data goes from the 21st fo October 2014 to 30th of October 2014

Lets aggregate hours of the day and check if there's any correlation between hours of the day and clicks

This gives us a better picture of how the target variable click is distributed throughout the day. It is now clear that is an approximatly normal distribution with very fat tails, peaking in the middle of the day (at 1pm (13h) with 1035 clicks in aggregate)

If we take a closer look:

This is an interesting distribution, however it may mislead us as to when the ads are more or less effective, according to Google Ads Click definition, it is important and common to use "the click-through rate (CTR), which tells you how many people who’ve seen your ad end up clicking on it. This metric can help you gauge how enticing your ad is and how closely it matches your keywords and other targeting settings."

Therefore we need to compute the rate between click == 1 and total clicks

Defining our evaluation metric - CTR (Click Through Rate)

We can now see that even though the highest number of clicks occurs in the middle of the day, for example, at 13 hours the CTR is only 17.8% when the highest CTR's occur at hours {0, 1, 16, 23, 15} ranging from 18.74% to 18.16%.

We now have a better understanding of how the time of the day influences the CTR and the predictability of wheter or not a user will click the ad.

Generalizing for the 10 days of the dataset.

Over the 10 days of the dataset there's a clear pattern for the prevalence of a higher CTR from 23 to 27 of October 2014. A quick search reveals that 23rd is a Thursday with 27th being a Monday. Therefore, the highest CTR occurs by the end of the business week and weekend. However, these may not be, like in the above cases, the days where there's the highest number of clicks, just a larger conversion of views to clicks.

To check this distribution we can get the corresponding days of the week.

This allows us to see the traffic that the ads receive throughout the week, with business days actually being the ones with highest traffic but not necessarily the highest conversion.

Actually, when we compute the CTR per day of the week we obtain a more less uniform distribution with only Monday and Friday being significantly lower in terms of CTR conversion than the rest of the weekdays.

Checkpoint - Evaluate the transformations on the dataset

Anonymised Categorical Features

C1 -- anonymized categorical features

C14-C21 -- anonymized categorical features

Using the CTR metric to understand relationships between features and the target

In order to see if there is any relevance of the categories inside of each anonymous categorical feature we will evaluate, for each of the inside categories, what is their corresponding CTR, that is, for a given category of a feature, how many clicks out of the total number of users that saw the add did it register.

Since features C1 and C14 to C21 are anonymized categorical features, it is very difficult to derive an intution behind the graphics above. However, by analysing the graphs, it is easy to see that some features have similar behaviour.

Firstly, features C1, C15 and C16 have one category that caputures almost the totality of the intances with a very low CTR (all around 0.1). Making these features possible good choices to flter no-click instances.

Secondly, features C14 and C17 have no major category, but rather similar density through categories. However, this regulariy on categories' density is not present in the CTR value, this changes considerably through each category.

Thirdly, C19, C20 and C21 have the number of instances more distributed than the features from the first point, but still have one or two categories with a greater alocation of instances when compared to the others. In this case the CTR value is once again very volatile.

Lastly, C18 has 4 categories with the majority of the instances and their CTR is relatively low, specially category 1.

Website features

banner_pos: Refers to the banner position of the ad in the website

site_id: A site ID is a unique identification number allocated to your website

site_domain: A domain name is essentially the website’s equivalent of a physical address

site_category: Site Category is, like the name indicates, the theme of the site. It can belong to "Education", "Kids", "Weather", "Arts" etc...

banner_pos: represents the ad banner position on the website. The graph on the left tells us that most websites have their ad banner on position 0 or 1, translating in a CTR close to 0.2 on both cases.

site_id: represents a unique identification of the website. An interesting reading is that the category with the largest number of istances presents CTR around 0, meaning that although the website has many views no one is clicking on the ad. Contrarywise, category 53ef4a3 has a small number of instances but a CTR very close to 1.

site_domain: only has 4 categories with a significant CTR and those have a very low number os instances. Showing that the domain of a website does not have an impact on the CTR distribution.

site_category: is an allocation of websites by theme. Instances are mostly distributed through 4 categories on which the CTR varies considerably. Hence, in some categories visitors are more likely to click.

App Features

app_id: An App ID is a two-part string used to identify one or more apps from a single development team. The string consists of a Team ID and a bundle ID search string, with a period (.) separating the two parts. The Team ID is supplied by Apple and is unique to a specific development team, while the bundle ID search string is supplied by you to match either the bundle ID of a single app or a set of bundle IDs for a group of your apps.

app_domain: A .app domain behaves the same as any other website, but there are two factors that make it stand out. The first is visibility: searching for dedicated app websites will be easier, as .app domains will indicate that the website concerns a specific app (product rather it being a news vertical, video platform, or any other type of website). The other benefit, outlined below, relates to security

app_category: App Category is, like the name indicates, the theme of the app. It can belong to "Education", "Kids", "Weather", "Arts" etc...

app_id: although most instances are alocated in one category the CTR of this category is around 0.

app_domain: one category has the biggest agregation of instances and simultaniously the biggest CTR. The clik through rate in most categories is around 0.

app_category: most instances in one category with CTR close to 0.2, followed by a category with an even greater CTR. The remaing categories have a low number of intances and a very low CTR.

Device Features

device_id: A device ID is a string reported by a device’s enumerator. A device has only one device ID.

device_ip: Similar to device id, it also uniquely identifies you, however there are personal and public id's and these are not always the same

device_model: Device Model refers to the manufactures (Apple, Samsung, etc..) device models such as iPhone 8, iPhone X, Galaxy 8...

device_type: Device Type is a label associated to the device which is used to match it to a series or model

device_conn_type: Connected devices are physical objects that can connect with each other and other systems via the internet. They span everything from traditional computing hardware, such as a laptop or desktop, to common mobile devices, such as a smartphone or tablet, to an increasingly wide range of physical devices and objects.

The cell below takes around 12 minutes to run given the 77.833 unique categories of the site id - Charts will still be displayed before running the notebook.

device_id: characterized by one category with 0 CTR.

device_ip: instances spread evenly through categories, in which five of those categories have a CTR around 1.

device_model: many categories with low density and a CTR behaviour very diverse, ranging from around 0 to around 1.

device_type and device_conn_type: features characterized by four main categories in which one has the majority of instances. CTR is relatively low.

Checkpoing - Before Exporting EDA to csv for easier modeling

Load Treated dataset from EDA

Drop features that will not go into the model

Given device_id and device_ip are uniquely identifing the specific device and public id adress (which is not immutable for the same person - i.e. you can sport different public ips at different times) of the person that clicked the ad, these provide no meaningfull or usefull information toward predictic clicks.

There is actually a problem of, given that these are very high cardinality features, them playing a very "important" yet misleading role when models are applied to the dataset. Furthermore, these features represent a person that may never be seen again by the model when applied in reality or with a different dataset, therefore it would be wrong to include these in the modeling process.

We are interested in knowing what are the shared characteristics of anyone that clicks, and the above mentioned features do not fit this requirement.

Pre-processing:

Apply

Split the Dataset

Lets take a closer look at the high cardinality features

Effect of implementing Target Encoding

Modeling

Choosing an Evaluation Metric - F1_score

We chose to maximize the f1-score since this is the harmonic mean between precision and recall, as our business problem relies on good recall and precision of the "click" variable, it is our interest to be able to predict if a given user will click or not on the ad.

Capturar

Maximizing recall only would lead to a model that ended up recalling only "click = 0" and leaving "click = 1" our main interest to predict, with a very low recall. While maximizing precision will leave us with a high number of False Positives, which we would like to avoid given that a too positive outlook on the impact of marketing can backfire into lower than expected revenues given the lower than predicted interaction.

We are still interested in a balance because, even though we want to predict click, it still is very important to understand clearly what leads a user not to click an ad as it can lead us to identify and shift away from harmfull practices.

AUC is also a widely used metric to compary binary classification models, however the AUC is not a good measure for Imbalanced Datasets (at it is the case) because it considers only True Positive Rate against False Positive Rate, which ignores the number of actual TP and FP. That is, we may have a model with a high AUC but recalling very few True Positives. The AUC is high only because there are very few predictions for the True Positive Class and these are mostly correct.

Modeling Pipeline

Given our dataset is imbalanced, and after preliminary modeling showed that without rebalancing the model performance was severely hidered with an over importance of the majority class. These models yielded no information, therefore we opted to use rebalancing methods.

For all models we also used a StandardScalar to control for arbitrary differences in the scales of the features, and we also performed Target Encoding, as explained above - Effect of implementing Target Encoding - to reduce the effect of high cardinality features. This process cannot be conducted on the whole training set before Cross Validation otherwise there will we data leakage as the model would already have a feature perfectly mapping the target.

Pipeline

'sampling'RandomOverSampler() , RandomOverSampler() , SMOTE()
'transformer'TargetEncoder(cols=target_enc)
'scaler'StandardScaler()
'classifier'KNeighborsClassifier() , RandomForestClassifier() , CatBoostClassifier()

KNN Classifier

n_neighbors represents the number of neighbors to use for kneighbors queries

weights:

uniform : uniform weights. All points in each neighborhood are weighted equally.
distance : weight points by the inverse of their distance. In this case, closer neighbors of a query point will have a greater influence than neighbors which are further away.

We found that the number of neighbors was at the upper level of the defined interval. Thus, we decided to test if, kepping all else equal - i.e. the previously found hyper-parameters - the model's performance would increase if we increased the range of n_neighbors

RANDOM FOREST

Simple Random Forest

Even with the default parameters, Random Forest Classifier has a good f1-score performance, the same as the first optimization of the Knn Classifer. We will now perform some hyper-parameter tunning and try to increase the model's performace.

Hyper-parameter Tunning of Random Forest

min_samples_leaf The minimum number of samples required to be at a leaf node. A smaller leaf makes the model more prone to capturing noise in train data. This parameter is similar to min_samples_splits, however, this describe the minimum number of samples of samples at the leafs, the base of the tree.

min_samples_split Represents the minimum number of samples required to split an internal node. This can vary between considering at least one sample at each node to considering all of the samples at each node. When we increase this parameter each tree in the forest becomes more constrained as it has to consider more samples at each node.

Keeping the several types of sampling - RandomOverSampler() , RandomOverSampler() , SMOTE() - is computationally costly. To further explore the configuration of hyper-parameter we fixed the sampling method at the RandomOverSampler() the chosen method for the optimized RandomForestClassifier above, and performed another RandomizedSearchCV with an increase number of iterations and smaller number of cross-validation folds, to increase speed of processing, focusing the tunning parameters intervals based on the above results, i.e. n_estimators reduce interval, max_depth & min_samples_split & min_samples_leaf increase bottom and upper bound.

Random Forest - Further Hyper-parameter Research (adding other parameters)

To further explore the parameter tunning of the random forest, and given the previous exploration yielded no improvements with the f1-score, we added other parameters to again tune the model. These are:

On all 3 hyper-parametter tunning iterations we did not manage to increase the best cross-validation f1-score which sat at 0.40. This is however better than the achieve by the two Knn Classifer models.

CATBOOST

Beyond the Knn Classifier and the Random Forest Classifier we wanted to explore a boosting model.

CatBoost is designed for categorical data and is known to have the best performance on it, showing the state-of-the-art performance over XGBoost and LightGBM in eight datasets in its official journal article.

depth: In most cases, the optimal depth ranges from 4 to 10. Values in the range from 6 to 10 are recommended
learning_rate: This setting is used for reducing the gradient step. It affects the overall time of training: the smaller the value, the more iterations are required for training. Choose the value based on the performance expectations.
eval_metric: The metric used for overfitting detection (if enabled) and best model selection (if enabled).
od_type: The type of the overfitting detector to use. Possible values: IncToDec, Iter
early_stopping_rounds: The number of iterations to continue the training after the iteration with the optimal metric value.

CatBoost - Further Hyper-parameter research

Narrowing down on the Learning Rate.

Stacking Classifier

Aggregating the best single models after hyper-parameter tunning we stacked them up into a Voting Classifier.

Stacking with Voting Classifier set to soft did not perform better than the hyper-parameter tunning setting we found for the RandomForest, albeit the differences are small and it has a slighly lower standard deviation for the f1-score

Selecting the Best Model

Using GridSearchCV we choose the best model after Hyper-parameter tunning. The metric we are using to make our decision regarding which is the best model is f1-score

The Custom scorer above gathers all the original vs predictions for each CV split, which we use to create an Average CV Classification Repport.

Feature Importance

When two features are correlated and one of the features is permuted, the model will still have access to the feature through its correlated feature. This will result in a lower importance value for both features, where they might actually be important.

As such, a negative score is returned when a random permutation of a feature’s values results in a better performance metric than reality. It does not mean that the feature has a positive impact on the model, it rather means that substituting the feature with noise is better than the original feature. Hence, the feature is worse than noise. Quite likely this indicates that the negative feature interacts with other features.

Typically when there is a negative score, we should remove that variable and redo the model.

In order to correct the negative values, we used a f1 scoring, which led to a plot of only positive values.

One important feature being C14, we could not take much interpretability from it.

From the features that we can interpret device model has the most relevant importance. A possible intuition could be the difference between softwares from each device, since newer versions could lead to better performance of the adds or some random explanation that this model is capturing.

Also, site id is explaining well the model. This could indicate that sites are build to atract the users to click on the add. It is a good metric to work on.

Finally, app id stands for another important feature.

One that could have a higher importance is hour of the day, since it could lead to different user responses. Clearly, is not explaining that much the model.

Test Best Model

Our main target was to sustain a high f1 score, since our dataset is imbalanced. Also, it is more relevant to the business model, to guarantee a good harmonic mean between recall and precision.

As we see, prediction is not beeing perfected by the model, but recall has a significant value. This leads to a good f1 score.

Confusion matrix

The predicted classes are represented in the columns of the matrix, whereas the actual classes are in the rows of the matrix. We then have four cases:

Let's interpret these values: Out of the 3433 instances of class 1 in our test set, the model identified 2514 of them as 1, and 919 predicted as 0 (false negatives, type II error). Out of 16567 users of class 0, 9923 were identified correctly as 0, and 6644 were false positives (type I error).

Interpretability

SHAP (SHapley Additive exPlanations)

The Shapley value is the average contribution of a feature value to the prediction in different coalitions. Shapley values are a robust mechanism to assess the importance of each feature to reach an output, and hence form a better alternative to standard feature importance mechanisms.

These include the TreeExplainer which is optimized for tree-based models, DeepExplainer and GradientExplainer for neural networks, and KernelExplainer, which makes no assumptions about the underlying model to be explained.

Given our best model is a Pipeline with an optimized Random Forest Classifier we will use KernelExplainer

You can visualize feature attributions such as Shapley values as "forces". Each feature value is a force that either increases or decreases the prediction.* The prediction starts from the baseline. The baseline for Shapley values is the average of all predictions. In the plot, each Shapley value is an arrow that pushes to increase (positive value) or decrease (negative value) the prediction. These forces balance each other out at the actual prediction of the data instance.

The output prediction is 1, which means the model classifies this observation as a click. The base value is 0.5496. Feature values that push towards a no click are in blue - device_model, and the length of the region shows how much the feature contributes to this effect. Feature values increasing the prediction are in pink, namely app_id and C14.

Below, we have an example of a classification of a no click where site_id, site_domain and device_model strongly push toward no click.

To get an overview of which features are most important for a model on a global level, we can plot the SHAP values of every feature for every sample.

SHAP Feature Importance Plot

SHAP feature importance measured as the mean absolute Shapley values. The C14 anonymous category was the most important feature, changing the predicted absolute click probability on average by 45 percentage points (0.45 on x-axis).

SHAP feature importance is an alternative to permutation feature importance. There is a big difference between both importance measures: Permutation feature importance is based on the decrease in model performance. SHAP is based on magnitude of feature attributions.

Molnar, Christoph. "Interpretable machine learning. A Guide for Making Black Box Models Explainable", 2019. https://christophm.github.io/interpretable-ml-book/.

This is a useful plot, however, it contains no information beyond the importances. For a more informative plot, we have to look at the summary plot.

SHAP Summary Plot

The summary plot combines feature importance with feature effects.

Looking at the SHAP SUMMARY PLOT we know that each point is a Shapley value for a feature and an instance. The position on the y-axis is determined by the feature and on the x-axis by the Shapley value.

Positive SHAP value impact on the model output leads to "pushing" to 1 the probability of the customer having clicked click = 1 and a negative SHAP value impact leads to the opposite push.

We can see that C14, device_model, site_id, app_id, site_domain, C17 and C20 are the features with the highest impact on the model.

remember that these are a label encoded features, therefore to unpack this information properly we'd have to look at the values that correspond to the high, middle and low range after it was encoded and standardised

We can see that a lower value of the C14 pushes more often the model towards no click than higher feature values. However, higher feature values seem to be very spread without a clear pattern emerging. We can only add that for an impact over 0.2 this is almost exclusively due to high feature values of C14.

device_model is very hard to distinguish how the feature value affects the model, having a high impact but being very hard to extract any information from it, this is one of the problems of a blackbox model, where even though a feature is clearly important, without one-hot-encoding it is not possible to understand the role of each of its values on the model. However, one-hot-encoding would not be reasonable given the very high cardinality of this feature.

site_id, which uniquely identifies a given website, has that the middle feature values affect the Shap values positively, between 0.0 and 0.2 thus pushing click probability slighlty, while high or low feature values tend to be on the extremes, either negatively affecting the shap value or affecting very positively (look at the positive outliers) the model output.

app_id which represents the unique id of a given app, has a very high concentration of high feature values having a small, negative impact on the model, which means this apps are probably not clicked. There are some clusters both with a positive and negative impact on the output spread around the impact axis but its not possible to distiguish feature values from effects, they are very mixed.

site_domain this can either be an unique identifier as the site id or a high-level domain name such as .com, .pt and others. However it is probably different from site_id as they have a different number of unique values. Therefore we cannot know for sure if we should discard one of these features. However, now that we see that mid to high feature values of site_domain gather between (-0... and 0.2) there is a tendency for site domains located at the upper range of the site_domain encoding to be "pushing" the probabily to click.

C17 is another anonymous feature. Here the mid-high feature values are mostly "pushing" toward no click. The high feature values have a tendecy to push the probability toward click with an impact between 0.0 and 0.2.

Some other mentions are C18_01 that when it exists there's almost always a positive impact towards click. High values of banner position seem to push slightly towards click.

PDP

The partial dependence plot (short PDP or PD plot) shows the marginal effect one or two features have on the predicted outcome of a machine learning model (J. H. Friedman 200127)

We chose the two best features that were interpretable (excluding anonimyzed features).

These are device model and site id, for which we plot them agains each other with a partial dependence plot. This plot helps to identify users who are more likely to clickon an ad (lighter regions) rather than not clicking (darker regions) based on the interaction between the two features.

We can see that for the encoded and standardized site_id 0.558, values of device_model around 0.5 - 0.51, 0.5, 0.491 and 0.494 - have a higher probability of being predicted click , i.e 1.

Conclusion

The main goal was to answer the business problem question, whether a user will click or not on an add.

We first analysed the features we had and, imediatelly, identified several which were not relevant to explain our target. We used methods like Click Through Rate and visualization of the instances to drop the poor features. Hence, we dropped id, Device_id, device_ip, hour and day.

After that, we transformed the data using one hot, label and target encoding, in order to better use the models we were applying and to better interpet the situation.

Following the moddeling proccess, we used F1 Score as an evaluation metric, since it corrected some imbalance. We considered AUC, but it was not that good for this type of dataset. Also, we used a Pipeline to rebalance the dataset and four models to predict the target, such as Knn, Catboost, Random Forest and a Stacking Classifier using the previous models.

Then, we got to the best model (Random Forest), which presented the best F1 Score and tried a permutation to check how it was working. There was still some irrelevant features, but the main objective was accomplished. There were really important features, which could predict the model. When testing the best model, we considered that a common threshold of 0,5 was to high for this type of dataset, hence we agreed on stopping on the 0,4. Also, the recall has a significant value. On the other hand, precision has a lower value, but it proved to be difficult to increase.

In order to interpret, we used Shap and Partial Dependence Plot. The first one explains the deviations from a base value, which is the average of the features' scores, of the different variables. Also, we can use the library to see a feature importance stack bar separated by classes, which is very usefull. The second one, explains the target based on the interaction between two features chosen.

We conclude our work by saying that we achieved an interesting result. Not only will help identifying a user who will click on an add, but will help identifying which type of features will help make the user click on the add. With that, companies are able to antecipate and make adds more interesting to the person whose going to watch it.