Explainable AI: Application of Shapely Values in Marketing Analytics

Source Node: 892099

shapely values in marketing

Recently, I stumbled upon a white paper, which talked about the latest in AI applications in Marketing Analytics. It specifically talked about the application of XAI (Explainable AI) in marketing mix modelling [white paper]. This caught my attention and I started exploring more about three things: XAI, the current state of marketing analytics, and XAI’s potential applications in marketing analytics. After going through the available resources, I realized that XAI has huge potential to reinvent marketing analytics.

In this article, firstly we will talk about the specific challenges and their solutions related to current state of marketing analytics. Secondly, we will try to develop an intuition about the XAI and finally, we will implement XAI on some openly available marketing dataset.

Challenges associated with current state of Marketing Analytics and its possible solutions:

There are many challenges but the three significant challenges associated with the current state of marketing analytics are related to accuracy of models used (GLMs: Generalized Linear Models), channel attribution, and inherent non-linearity in market response. In view of these challenges, we will talk about how XAI can address the issues related to GLMs, channel attribution, and other issues linked to non-linearity. As an introduction, we will discuss some challenges and its solution in very brief in this section; however, we will discuss everything in detail as we move forward in this article.

1. GLMs

Existing Challenge: Generalized Linear Models (GLMs) are used extensively for Marketing Mix Modelling through the industry. GLM’s additive nature helps us in easily identify contributions of marketing channels on the sales revenue. Modern-day algorithms are much more accurate as compared GLMs but those are more like a black box model and lacks explainability. Thus, the only reason for the industry-wide use of GLM is its ease in explainability.

Possible Solution: XAI provides us an excellent way to interpret any black-box model and thus it opens the gate for us to use highly accurate ensemble models in place of GLMs. Thus, now with the help of highly accurate ensemble models, we can not only outperform GLMs in accuracy but also in explainability.

If this in-depth educational content is useful for you, subscribe to our AI research mailing list to be alerted when we release new material.

2. Channel Attribution

Existing Challenge: This is one of the biggest pain points for marketers. Since there are interactions between channels, so it becomes almost impossible to fairly distribute or assign the payoffs to the different channels.

Possible Solution: Shapely values from Cooperative Game Theory comes to the rescue here. The Shapley value is a way to fairly distribute the total incremental gains to the collaborating players in the game. In our case, the marketing channels are the players cooperating with each other to increase the metrics such as revenue; and our goal is to fairly distribute the incremental revenue to the marketing channels. Google Analytics also uses shapely values in their Data-Driven Attribution methodology.

3. Interactions of different Marketing Channels

Existing Challenge: There are channels, which as a standalone, are not significant contributors; however in combination with other channels could play a significant role. Therefore it is important for a marketer to know about the different combinations of interacting channels. The number of interactions increases significantly as a number of channels increases and it becomes very cumbersome to include all such interaction terms in GLMs.

Possible Solution: To address this challenge we will again use the shapely values but in a different way. We will use the SHAP algorithm to study the interactions of channels at scale. [SHAP algorithm is the implementation of shapely values in machine learning to explain and interpret any black-box ML models. Since we will be replacing GLMs with highly accurate tree-based ensemble models in our example, therefore we will be using SHAP to interpret and explain channel interactions in our model.]

4. Positive Impact Threshold

Existing Challenge: For instance, in case of TV ads lower spend may not generate any revenue; however, after a certain minimum spend, TV ads start to show their positive impact. On the other hand, the market response function for a digital channel is generally steep, which means that the more you increase the spend on digital channels the more impact you see on your revenue before reaching the saturation point. Due to the inherent dynamic nonlinear nature of market, it becomes very challenging to get the positive impact threshold value using a linear model.

Possible Solution: Instead of using a linear model we can use a non-linear, tree-based, ensemble model along with an explainer model. From the explainer model, we will get the shapely values (contribution of each channel individually), then we can plot these shapely values to get a positive impact threshold for each channel. The chart given below shows an example of the Positive Impact Threshold for TV.

Graph showing Positive impact Threshold for TV Ads: Y axis is the incremental Revenue(Shapely Value) which is attributable to TV Ads. X axis shows the TV ads spend. Source: Image by Author

5. Optimal spend on each channel

Existing Challenge: Optimal spend depends upon the marketing response function for different channels and generally these marketing response functions are non-linear in nature. For instance, if we keep increasing the spending on a particular channel then after a certain spend the incremental positive impact starts to diminish. The point at which the incremental impact starts to saturate and then decreases eventually, is the point where we achieve the optimal spend on that channel.

Possible Solution: Here again, the solution is the same as in point 3, based on shapely values. However, the only difference is that here we have to get the point from which our graph becomes asymptotic or maybe starts to go down in some cases.

Given all these challenges and solutions you may be curious to know the answer of 5 questions related to shapely values:

  1. Why we have been using GLMs?
  2. What are shapely values?
  3. How shapely values are used for attribution?
  4. How it is implemented for Complex Machine Learning Models?
  5. How it can be interpreted, to get actionable insights?

Moving forward, this article will try to answer the above questions.

Table of Content

  1. Why GLMs in Marketing Analytics?
  2. Shapely Values
  3. Shapely Values: Intuition with an example
  4. Machine Learning Interpretability using Shapely Values
  5. SHAP: A Use Case in Marketing Analytics
  6. End Note
  7. Further Readings & References

Why GLMs in Marketing Analytics?

To understand this let us divide the machine learning models into two categories: one as simple models and the other as complex models. The simple models are those models that are very easily interpretable / explainable and low in accuracy; however, the complex models are black box in nature, almost impossible to explain, and very high in accuracy.

The example of the simple models are linear regression models and decision trees. Linear models are easy to interpret for example: with the help of the magnitude and the sign of the weights of the features in the model equation gives us the idea of feature impact on the output. In case of decision trees, we can move from the root node down towards leaf nodes through the splits (based on Gini, Info gain, or entropy) and parse the tree to understand /interpret the model easily. Though these models are highly interpretable, they have several issues: Linear models are not accurate because they cannot explain the non-linear behavior of the features, whereas the decision trees are also inaccurate and have the problem of overfitting.

As the model complexity and accuracy increases the interpretability of models decreases. Source: Image by Author

Random forest has high accuracy and very low variance / overfitting as compared to decision trees, but random forests have a very large number of trees which can not be interpreted in such large numbers. Therefore interpretability in such models is a big problem.

If I have accurate models then why would I use the less accurate ones?

Firstly, we should understand that the ability to correctly interpret the model’s output is very important. It helps in building trust of the stakeholders. We just cannot say to our stakeholders that we should implement some model because it is very accurate and it has good AUC/ ROC numbers or it has a very low mean squared error. Stakeholders are always interested in knowing how a model is making decisions.

Secondly, the interpretability of each feature in the model plays a very important role in Marketing Analytics. We should know the direction and the magnitude in which all the significant features are impacting the outcome of a model. This can be easily achieved by GLMs; however, in case of complex models we do get relative importance values of features, but those are not very helpful because it doesn’t tell us anything about the magnitude and the direction of the impact on model output.

Therefore, in marketing analytics, linear models are preferred for their ease of interpretation, even being less accurate than complex models.

To address the interpretability of the complex models there are some techniques (LIME, Shapely etc) through which we create an explanation model, which is simple linear model to explain the complex model very easily. The method we will discuss in this article uses the shapely value from cooperative game theory to create the explanation model.

The interesting thing to note here is that when these complex models are combined with the explanation models then they outperform GLMs not only in accuracy (obviously) but also in interpretability / explainability.

Next, we will try to develop an intuition about the shapely values through its definition and an example. Then, we will move on to its machine learning implementation using a very simple data set.

Shapely Values

The Shapley Value was developed by the economics Nobel Laureate Lloyd S. Shapley as an approach to fairly distribute the output of a team among the constituent team members.

We will discuss the definition of shapely value but for a better understanding of this concept, I would suggest you jump on to the example section (Shapely Values: Intuition with an example) first and then come back to the Definition section.

Definition

Formally, a coalition game is defined as Given a coalition game G(N, ν), where N is the set of players in a game and νis the function(a characteristic function) that assigns a real number to each subset when this characteristic function is applied on each subset S as ν(S). We also call ν(S) as the worth of coalition.

If there are 2 players (A,B) then there would be 4 subsets {Null, A, B, AB}. Thus, for a subset AB we have ν(AB) as a worth of coalition of A and B.

ν(S) describes the total expected sum of payoffs the members of S can obtain by cooperation.

The Shapley value is one way to distribute the total gains to the players, assuming that they all collaborate. According to the Shapley value, the amount that player i gets, given a coalitional game G(N, ν) is

Here, n is the total number of players in N, and ‘N{i}’ represents the set of players excluding player ‘i’. Therefore, ‘S’ here consists of all the subsets which can be made from the players in ‘N{i}’. For each S, ‘|S|’ represents the number of players in each subset.

For each subset in S, ‘|S|!’ represents the number of permutations, which could be created from the players in S. ‘(n-|S|-1)’ represents a number of remaining players excluding the player i. ‘(n-|S|-1)!’ represents the total number of permutations which could be formed from the remaining players excluding the player i.

For each subset in S, the multiplication of |S|! and (n-|S|-1)! represents the total number of permutations that could be formed from the players in S and outside S. This multiplication is used because for these (|S|! *(n-|S|-1)!) a number of combinations the marginal contribution of the player ‘i’ remains the same.

NOTE : Formula considers taking permutations because the order of players plays an important role here. The events here are not independent; therefore shapely value calculation considers all possible permutations and then finally takes the average to calculate the contribution of each player.

Here, n! showsthe total number of permutations which can be formed from all the combination of all the players in the game, and it is used to take the average of contribution over the possible different permutations in which the coalitions can be formed.

This formula may seem a bit daunting at first but it becomes very easy to understand once we go through it with the help of an example. In the following section we will try to develop an intuition with an example calculation.

NOTE: There are certain minimum properties which should be fulfilled by the characteristic function:

  1. Monotony: If the number of players in the coalition increases, the benefits should not decrease.
  2. Superadditivity: ν(S ∪ T) ≥ ν(S)+ν(T), if ν(S ∩ T) = ∅ ; this implies that the grand coalition has the highest payoff.

Shapely Values: Intuition with an example

To simplify it further and to have a better understanding of shapely values, we will take a very simplistic example of an imaginary new retail store, which markets itself through TV ads, Print Ads, and Radio Ads only. The combination of these marketing mediums helps that retail store to get customers and generate revenue.

Owner of the retail store has observed that the combination of these marketing mediums is fetching higher revenue for the store but the problem lies in the attribution of the share of revenue coming through each marketing medium. This information is important because with this information owner can optimize the marketing budget.

This problem (or game) can be solved with the help of shapely values. To move forward with the process let us have a look at content of the ‘Table A’ and draw some parallelism between the definition of shapely value above and the the actual data of the game.

The first column in ‘Table A’ shows the coalitions (S as represented above in the definition), which can be formed using the players of the game (players are TV, Radio, Print) including the null set (in which no player is involved). The second column shows the symbols corresponding to each coalition in the first column. The third column (ν(S) or ‘Worth of Coalition’ as represented above in the definition) shows the revenue corresponding to each subset / coalition given in first column.

In other words, when only TV ads are shown then store gets the $3000 as the revenue, when only Radio ads are shown then the revenue is $1000, similarly when only Prints ads are shown then the revenue is $1000. However, when either of Radio or Print are combined with TV then the revenue is double as compared to the sum of their individual contribution and when the all three are together then the revenue is $10,000.

Table A: Showing the coalition (S) and the worth of that coalition v(S). Source: Image by Author

We will use the information given in ‘Table A’ and use these values in deriving the shapely value(average marginal contribution) for each TV, Radio and Print.

The ‘Table B’ below shows the calculation of shapely values for TV, Radio and Print.

Table B: Showing the calculation for average marginal contribution or shapely values of each marketing medium. Source: Image by Author

Let’s have a look at the calculation of shapely value for TV as given in ‘Table B’. The first column shows the coalitions S excluding TV. So there are  such combinations including a null set {}. The second column shows the Revenue without TV, which is v(S) or the worth of coalition without TV for the subsets.

The third column shows the coalitions including TV and that is represented in the table as ‘S ∪ {i}’. The fourth column shows the revenue or the worth of the coalition including TV. The fifth column shows the change in revenue after adding TV or the marginal contribution of TV and that is represented as ‘v(S ∪ {i})-v(S)’. The sixth column shows the total number of players (n) overall.

The seventh column shows the number of players in each subset S. Eighth column is the total number of possible permutations for the occurrence of this marginal contribution for the subset in S, and further its division by n! is done to average that contribution. The last column is a sum of the average marginal contribution for the TV ($ 6000).

Similarly, the average marginal contribution or the shapely values of the Print ($ 2000) and Radio ($ 2000) have been calculated. I hope Table A and Table B along with their calculation have helped in developing the basic understanding and intuition about the shapely values.

Basic understanding and intuition about the shapely values are important because the next section of this article is about machine learning interpretability using shapely values. We will discuss and learn about how these shapely values are being used in machine learning, specifically in media mix modelling.

NOTE 1: Defining the characteristic function or ‘worth of coalition’ is a critical step and should be done with utmost caution. In real-world scenarios, its definition is heavily dependent on the methodology of data collection. If you are interested in reading about how a characteristic function is defined in case of attribution models in online advertising campaigns, then please refer this paper.

NOTE 2: Shapely values guarantee uniqueness in the solution. Shapley value is the only attribution method that satisfies the properties Efficiency, Symmetry, Dummy, and Additivity, which together can be considered a definition of a fair payout. 

Efficiency: The sum of the Shapley values of all players equals the value of the grand coalition, so that all the gain is distributed among the players.

Symmetry: The contributions of two players should be the same if they contribute equally to all possible coalitions.

Dummy: If the player i, adds no value to any coalition then player i receives the shapely value of zero or in other words receives the zero reward.

Additivity: Let ‘a’ be the shapely value from game ‘x’, and ‘b’ be the shapely value from game ‘y’. Then the shapely value for game ‘x+y’ is ‘a+b’.

For more details on the axioms, refer this section of a book.

Machine Learning Interpretability using Shapely Values

Now, since we have the basic understanding of the shapely values, so moving forward we should now discuss how this shapely value is being used in machine learning interpretation and then we will talk about its utility in marketing analytics.

Let’s have a look at the parallelism between the shapely value calculation in cooperative/coalition game theory and its use in machine learning.

In case of machine learning models, each instance of feature values, used as the data for generating a machine learning model, acts as team players and the prediction of the machine learning model acts as output or payoff.

To understand it more intuitively, let us assume that we have trained a machine learning model to predict the revenue based on the spend on different marketing channels. In the dataset we have an instance of data where marketing channels such as TV have a spend of $2000, Radio has spent of $500, and Print has spent of $800. The prediction value of the revenue from this instance of data is $100k; however, the average prediction value of the revenue from all the instances of data is $80k (base value). Given this situation, we want to understand and explain how TV, Radio and Print have contributed to the difference between the base value($80k) and predicted value ($100k). Thus, in this example, the instance value (spend) on TV, Radio, and Print are acting as players and the predicted revenue for any instance as the output or payoff. For this instance, the attributed revenue to each channel could be like TV has been attributed the revenue of $10k, whereas Radio has been attributed $6000 and Print $4000. Therefore, these attributed values to each marketing channel are helpful in making us understand how the spending on different marketing channels impacted our incremental revenue.

In the above example, there were two things that were happening parallelly: first, we have used some machine learning algorithm for prediction, and second, we have used shapely values to understand the feature behavior for a row /instance of data. Therefore, for most accurate predictions we can use any black box complex model and at the same time, we can use very simple explainable / interpretable shapely values to explain the feature behavior. This combination of solves the major problem of the explainability / interpretability of the complex models.

Now, we have a high-level understanding of how shapely values fit in the complex model interpretability. Thus, it is the right time to delve a bit deeper and understand the algorithmic implementation of shapely values.

SHapely Additive exPlanations (SHAP)

SHAP (SHapley Additive exPlanations) is a game-theoretic approach to explain the output of any machine learning model. It connects optimal credit allocation with local explanations using the classic Shapley values from game theory and their related extensions. 

SHAP is an explainer model and it has optimized functions for interpreting any black-box model with the help of shapely values. To understand SHAP we first need to know the concept of the Explainer Model and then we will see how this explainer model fits with any black-box model for its interpretability.

Explainer Model

In case of complex black box models (such as ensemble models or deep learning models), we use a different model to explain the complex model. Such a model is called the Explainer Model, which is an interpretable approximation of the original model. The explainer model is a simple linear additive attribution model, which works on binary variables.

Explainer Model

Here, g(z′) is an explainer function. ∅o is the base value, it represents the model output with all the inputs toggled off. ∅i is the shapely value of the feature i. z′ is either zero (when a feature is not present) or one (when a feature is present). M is the number of simplified input features.

We should also note that the value of g(z′) matches the output of the original model f(x), it is also called local accuracy. Local accuracy is also one of the three properties required by a method to have a single unique solution. The other two properties are Missingness and Consistency. These three properties guarantee the uniqueness of Shapley values from game theory as they apply to local explanations of predictions from machine learning models. To know more about these properties read this paper.

For simplification, let’s assume that we have a data set having 3 features TV, Radio, Print as a predictor, and Revenue as the target variable. We have 100 rows / instances of the data and for each row / instance we will calculate g(z′).

For any instance ‘j’ of data for the feature variables (TV, Radio, Print):

g(z′) is equal to the model prediction value for that instance of feature values. The value of z′ is one for TV, Radio, Print because they have some instance values. Had there been an instance where TV had spent $x_tv, Radio had spent $x_radio but Print had no spending, then, in that case, our g(z′) would be as below.

Since the explainer model is working at the instance or row-level data of features so it is also called the local explanation for the predictions. This explanation model directly measures local feature importance in the form of shapely values (also called SHAP Values for local feature attribution). We get the global model structure for the global explanation when we combine the local explanations for each prediction.

We will talk about both local and global explanations and this will become even more clear when we will discuss the usage of the SHAP library with an example in python.

How this Explainer Model works with Black Box Model?

Flowchart showing that the Explainer model takes the data and the prediction values as the input and creates an explanation based on shapely values. Source – PyCon Sweden I nterpreting ML Models Using SHAP by Ravi Singh

This flow chart gives us a very good idea how explainer model works in conjunction to a Black Box Model.

With data we train Black Box Model and then that trained model is passed on to the Explainer Model along with the data.

This Black Box model is then used as a predictive function and along with the feature values, this function is used to calculate feature importance (SHAP Values) for each feature. SHAP values are calculated using the below-mentioned formula, which is very similar to the shapely value calculation formula.

The formula for the calculation of SHAP Value for feature i, for instance x. Source : SHAP Values for ML Explainability – Adi Watzman

We are now calculating ∅i, which is the contribution of feature i in the prediction and it is calculated in three steps, as follows.

  1. Firstly we create all the subset of features without the feature i (represented as S) and then calculate the prediction value with (f(S U {i})) and without (f(S)) feature i., for each subset in S.
  2. Secondly, we subtract the two predictions (f(S U {i})-f(S)) to get the marginal contribution of that feature for each subset.
  3. Finally, we take the average of all the marginal contributions calculated in the previous step to get the contribution of feature i to the prediction.

The above calculation for SHAP Values is very similar to the shapely value calculations for the cooperative games. The only difference here is that the characteristic function is replaced by Black Box Model and here we are calculating how much feature i contributes to the prediction rather than the game.

Once we get all the SHAP Values for all the features instances, then these SHAP Values are used to explain the model locally as well as globally.

To get detailed information about SHAP values please go though the paper by authors of the SHAP Values Scott M Lundberg & Su-In Lee.

In the next section, we will see how we can interpret a model locally and globally to get actionable insights.

SHAP Values: A Use Case in Marketing Analytics

After having information about SHAP Values, now let have a look at how can we apply it to some marketing dataset.

Our data set consists of features such as TV ads spendings, Radio Ads spendings, Newspaper Ads spendings per week and our target variable is Sales revenue generated per week.

We can assume this data as a cross-sectional data created from a time series data of a Marketing Mix. Though this data is far from a real world marketing mix dataset, even then, it will give us a very good idea about how SHAP values can help us in getting actionable insights when we use it for marketing Mix Modelling.

Import the libraries and read the data set.

# import required libraries
import pandas as pd
import numpy as np import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split
from sklearn import preprocessing from sklearn.metrics import mean_squared_error
from sklearn.ensemble import RandomForestRegresso import shap shap.initjs() # read the data
df = pd.read_csv('Advertising.csv',dtype={'sales': np.float64})

This data set is available on Kaggle. The only change I have made to this data set is that I have multiplied the sales column by 100. So that it appears a bit closer to generated sales revenue.

I don’t have much details about this dataset, and I am just using it to show the application of SHAP value in marketing dataset.

Create train test split, fit Random Forest Regressor, predict and check the error.

# Create train test split.
Y = df['sales']
X = df[['TV', 'radio', 'newspaper']]
# Split the data into train and test data:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state = 0) # Fit Random Forest Regressor Model
rf = RandomForestRegressor(max_depth=5, random_state=0, n_estimators=20)
rf.fit(X_train, Y_train) # Predict
Y_predict=rf.predict(X_test) # RMSE
print('RMSE: ',mean_squared_error(Y_test, Y_predict)**(0.5))

Create explainer model and then create a data frame from the Shapely values.

There are several methods with which we can create an explainer model using shap library in python. We have shap.KernelExplainer, which is model agnostic and can be used to explain any machine learning model. We also have shap.TreeExplainer, which is optimized for tree-based machine learning models. In our case, we are using a random forest algorithm, which is a tree-based model, so we will be using TreeExplainer.

# create explainer model by passing trained model to shap
explainer = shap.TreeExplainer(rf) # Get SHAP values for the training data from explainer model
shap_values_train = explainer.shap_values(X_train) # Get SHAP values for the test data from explainer model
shap_values_test = explainer.shap_values(X_test) # Create a dataframe of the shap values for the training set and the test set
df_shap_train=pd.DataFrame(shap_values_train,columns=['TV_Shap','radio_Shap','newspaper_Shap'])
df_shap_test=pd.DataFrame(shap_values_test,columns=['TV_Shap','radio_Shap','newspaper_Shap'])

Let’s have a close look at the SHAP value dataframe.

We can clearly see from the above data frames that corresponding to each feature value in the test or train data frame, we have one shap value in the shap data frames (df_shap_train, df_shap_test) for the test and train data set. Therefore, we can see that for every instance of data for each feature we have one SHAP value.

Check the Base Value

# Base Value: Training data print('base array: ','n',shap_values_train.base[0], 'n')
print('basevalue: ',shap_values_train.base[0][3]) # Base Value: Test data print('base array: ','n',shap_values_test.base[0], 'n')
print('basevalue: ',shap_values_test.base[0][3]) # Base Value is is calculated from the training dataset so its same for test data
base_value=shap_values_train.base[0][3][0]
Source: Image by Author

We have already discussed about the base value when we discussed about the explainer model above. This ∅o is the base value.

NOTE: Base Value for TreeExplainer is the mean of model output over the training dataset. 

Base Value

The method ‘base’ gives us the list of arrays for each instance of data. The first 3 elements of each array contain the SHAP Values for the features and the 4th value is the Base Value. In our case, the Base Value is 1429.31875.

Base Value + SHAP Values = Predicted Value

# Create a new column for base value
df_shap_train['BaseValue']= base_value # Add shap values and Base Value
df_shap_train['(ShapValues + BaseValue)']=df_shap_train.iloc[:,0]+df_shap_train.iloc[:,1]+df_shap_train.iloc[:,2]+base_value # Also create a new columns for the prediction values from training dataset
df_shap_train['prediction']=pd.DataFrame(list(rf.predict(X_train))) # Note: Prediction Column is added to compare the values of Predicted with the sum of Base and SHAP Values df_shap_train.head()

Let’s compare the values of (Base Value + SHAP Values) and prediction value.

From this data frame we can easily see that the Base Value + SHAP Values for features for each instance is equal to the prediction value.

We discussed this relationship during the discussion of the Explainer model. So for each row of the data frame, we can see how this relationship holds true.

Model interpretation

Summary plots

Summary plots are meant to interpret the model at the global level. Summary plots are of two types: first, shows all the SHAP values for all the features, and the second type shows the aggregated value (Mean of Absolute SHAP Values) for all the features. We will see these plots and then discuss how to interpret them.

#SHAP Summary plot for training data
shap.summary_plot(shap_values_train,X_train) #SHAP Summary plot for test data
shap.summary_plot(shap_values_test,X_test) #Plot to see the average impat on the model output by each feature.
shap.summary_plot(shap_values_train,X_train,plot_type='bar',color=['#093333','#ffbf00','#ff0000'])

Summary Plots for training data:

Summary Plot showing the Mean Absolute SHAP value for the features

The above plot shows the aggregated view of the SHAP values corresponding to each feature. It shows the feature’s importance. The above plot is self-explanatory and I don’t think it needs any detailed explanation; however, the plots below are interesting to interpret.

Summary Plot showing all the SHAP Values for each feature
Summary Plot showing all the SHAP Values for each feature

Components of the Summary Plot showing all SHAP values: On the Y-axis we can see the features TV, Radio, Newspaper. The X-axis shows the range of SHAP values. The individual points corresponding to each feature is the SHAP Value for all the instances in the training data set. The vertical color intensity scale on the right side shows whether the feature value is on the higher or the lower side. Also, it is to be noted that the features are arranged in the descending order of the aggregated Mean Absolute SHAP Value for features.

Interpretation:

Which feature has the highest impact? Since TV has the highest Mean Absolute SHAP Value so it has the highest impact on the model output. In our case, the model output is the sales revenue. Thus, TV has the highest impact on sales revenue. In the same way, we can say about the radio and newspaper. Since radio has been ranked second so it has the second-highest impact on the sales revenue.

How is the spending on Marketing Channels impacting the sales revenue? To answer this have a look at the colors for the SHAP values assigned to each instance of the features. For TV, most of the points on the right-hand side (positive side)of the plot are Red. This means that for higher TV ads spend we get higher SHAP values or incremental sales Revenue. The plot also shows that for lower TV ads spend (Blue Coloured) we have negative SHAP values. Negative SHAP value indicates that our low spend on TV ads is not even able to generate sales revenue equal to its spend on that channel. Similarly, we can explain the impact of other channels. Newspaper doesn’t impact the sales revenue significantly, and this is evident from the summary plot.

Dependence plot (without interaction):

What is the market response for my TV ads and Radio ads spend? To answer this question we can draw the plot between individual features and their SHAP Values.

# Market Response for TV : Plot between TV ads spend and SHAP Values
shap.dependence_plot("TV",shap_values_train,X_train,interaction_index=None) # Market Response for TV : Plot between TV ads spend and SHAP Values
shap.dependence_plot("TV",shap_values_test,X_test,interaction_index=None) # Market Response for Radio : Plot between Radio ads spend and SHAP Values
shap.dependence_plot("radio",shap_values_train,X_train,interaction_index=None) # Market Response for Radio : Plot between Radio ads spend and SHAP Values
shap.dependence_plot("radio",shap_values_test,X_test,interaction_index=None)

The plot between Ads spending and SHAP Value is a very good indicator for market response to ads spend. From the plot above it is evident that as we increase our TV ads spend, our SHAP value increases, and after a particular TV ads spend it becomes starts getting stagnant.

Similarly, we can plot it for the radio ads and observe that as the radio ads spend is increasing the SHAP value is also increasing. Since we don’t have any stagnation point in our plot so we need to keep increasing the spending on Radio ads till we get to any stagnant point.

What is the Positive Impact Threshold for TV ads? To answer this question we will use the same dependence plot between TV ads and SHAP Values, and see a check on the Y-axis where the SHAP value is zero. Corresponding to that point we will get a point on X-axis and that point is the Threshold spend on TV ads from where TV ads start to have a positive impact on the sales revenue.

Positive Impact Threshold for TV ads. Source: Image by Author

Similarly we can the threshold value for the radio.

Positive Impact Threshold for Radio ads. Source: Image by Author

Interaction summary plot:

This plot is very important one and it gives the interaction between different channels in one view.

# Interaction Summary: Get the interaction values
shap_interaction_values = shap.TreeExplainer(rf).shap_interaction_values(X_train) # Plot Interaction Summary
shap.summary_plot(shap_interaction_values, X_train)
Interaction Summary . Source: Image by Author

The above plot shows the interaction summary. We can see the interaction off the diagonal. I have highlighted the possible interactions in a red circle. The only significant interaction which we can see here is between TV ads and Radio Ads.

Since we now know that the interaction between TV ads and Radio Ads is significant, so the other question should be to understand the nature of their interaction.

Dependence plot (with Interaction):

What is the nature of the interaction between TV and Radio? To answer such a question we will again plot a dependence plot, but this time with interaction. It is one of the most interesting plots and also a bit difficult to interpret. Let’s plot this and then go one step after another to interpret it.

# Dependence Plot between TV and Radio
shap.dependence_plot("TV",shap_values_train,X_train,interaction_index='auto') # Dependence plot between Radio and TV
shap.dependence_plot("radio",shap_values_train,X_train, interaction_index='auto')

Each dot represents the SHAP value corresponding to a particular instance of TV ads spending. Y-axis is SHAP Values and X-axis is the TV ads spend. Colour is based on the radio ads spend for each instance.

The nature of interaction between TV and Radio gets reversed for TV ads spend below 150 and above 150. Source: Image by Author

To understand this plot let’s have a look at the TV ads spend window between 150 to 300. In that TV ads spend window the SHAP Values range from higher to lower SHAP values (0 to 600 ). In that window, let’s take the TV ads spending around $220 (shown as a rectangular box with a black border), for that particular spend we have different SHAP Values, but the higher SHAP values are associated with the higher Radio spend too. This means there is a positive interaction between the Radio Ads Spend and TV Ads Spend in that window (TV Ads Spend 150–300).

On the other hand there is a negative interaction in the TV ads spend window between 0 to 150 of TV ads spend.

Till now we have done the Global Interpretation of the SHAP Values but now we will see some Local Interpretation of the SHAP Values. Local Interpretation can be done using Force Plot.

Force plot for local interpretation:

Local interpretation means interpretation at the instance level. Our instance level is at the granularity of weeks. So if we want to interpret the impact of marketing channels on a particular week then we should use this plot.

# Local Interpretation using force Plot i =0 # Index for first set of SHAP Values j = 134 # index value for corresponding instance of train data # Force Plot
shap.force_plot(explainer.expected_value, shap_values_train[i], features=X_train.loc[j], feature_names=X_train.columns)
force plot for an instance of data

In the plot above we can see blue bars and the red bars showing SHAP Values for TV, radio, and newspaper. The length of the bars represents the magnitude of the SHAP Value. The color of the bar shows the direction of the SHAP Value (Positive or Negative). The positive SHAP value is Red and the Negative is shown by Blue. The prediction value is shown in bold Black color. The Base Value has also been marked on the line graph. The instance value or the ads spend value is also shown for each channel. We can compare these values to the value calculated in the previous section, shown below.

TV SHAP Value is negative and the length of the bar is longest in negative direction showing a -563.6 value. Radio and Newspaper are positive but very small in magnitude 164.7 and 25.96 respectively. Since the sum of SHAP values of TV, Radio and Newspaper is negative, so once it is added to the Base Value, we get the prediction value, which is lower than the base value in this case. In summary, the main reason for this week’s lower performance is the low spend on TV ads.

So till here, we have seen how we can interpret any black-box model effectively with the help of SHAP. There are many other variables such as price, region, seasonality, trend, ad-stock, market saturation, etc; that can be included in the marketing data with some feature engineering. Even with those additional variables and more, we can now use the most accurate models along with SHAP to interpret them holistically and get actionable insights at scale.

End Note

In this article we have focused mainly on attribution and media mix modelling; however, XAI also has huge application in other areas of marketing and customer analytics such as churn prediction, customer retention, and decision support. For example in the case of churn prediction XAI (local explanation as discussed in the article) can help us in explaining why a particular customer churned or why a particular customer has a high probability of churn. From that explanation we can tailor our retention strategy for that specific customer. Furthermore, XAI also helps us in debugging our model by comparing the business domain knowledge with the model behavior. Therefore, to summarize, XAI is potentially a game changer in analytics industry.

Finally, I hope after going through this article, you have got a fair bit of idea about the shapely values in cooperative game theory and its high-level usage in Marketing Analytics.

Further Reading and References

  1. https://www.nature.com/articles/s42256-019-0138-9.epdf
  2. https://papers.nips.cc/paper/7062-a-unified-approach-to-interpreting-model-predictions.pdf
  3. Attribution models and the Cooperative Game Theory
  4. https://github.com/slundberg/shap
  5. https://www.h2o.ai/blog/the-benefits-of-budget-allocation-with-ai-driven-marketing-mix-models/
  6. Explainable AI for Science and Medicine
  7. https://edden-gerber.github.io/shapley-part-1/
  8. https://christophm.github.io/interpretable-ml-book/shapley.html
  9. https://shap.readthedocs.io/en/latest/examples.html#tree-explainer
  10. https://github.com/slundberg/shap/issues/352 **
  11. PyData Tel Aviv Meetup: SHAP Values for ML Explainability — Adi Watzman
  12. PyCon Sweden: Making sense of ML Black Box: Interpreting ML Models Using SHAP
  13. https://medium.com/@gabrieltseng/interpreting-complex-models-with-shap-values-1c187db6ec83
  14. https://towardsdatascience.com/explain-your-model-with-the-shap-values-bc36aac4de3d

This article was originally published on Towards Data Science and re-published to TOPBOTS with permission from the author.

Enjoy this article? Sign up for more AI updates.

We’ll let you know when we release more technical education.

Source: https://www.topbots.com/explainable-ai-marketing-analytics/

Time Stamp:

More from TOPBOTS