Geo Level Models

November 26, 2021
Michael Taylor

*** Note: this is featured in a simulator-based course on "Do I have enough data?" ***

One common problem with marketing mix modeling is lack of data. How do you know if you have enough?

The rule of thumb is that you need 7-10 observations per variable you include in your model. That means even with 3 years worth of weekly data, you can only support a maximum of 22 variables. 

That’s barely enough to account for the average marketing mix, without even factoring in transformations like Diminishing Returns and Adstocks. If you’re a startup with less than 3 years of relevant data, this can limit your ability to even build an accurate model in the first place.

The best way I’ve found to overcome this limitation is to split your data into geographical regions. For example if you’re advertising in the US, suddenly you can have 50 times more data points, one for each state. 

Splitting by geography also helps you improve the accuracy of the model, by giving you more variance – regression algorithms rely on their being differences in the data, otherwise it can’t easily tease out the underlying patterns. 

For example most marketing channels are correlated (spend tends to increase and decrease at the same time). However if there happened to be differences at the regional level, the model would now be able to tell these effects apart. 

You can even proactively engineer these differences! You can set up a deprivation test by turning off spend in randomly selected regions. There’s also scale testing, which is similar, except you increase spend in different regions to understand where efficiency starts to suffer.

Now let’s look at how you would build a geo model, once you have the data. So you can follow along, I’ve included a template with some example data for you to use.

How can you build a geo-level marketing mix model?

First, let’s look at a standard marketing mix model, at the national level. We are using three variables to predict sales: the price of the pies, how much we spent on advertising and whether it was a holiday that week. The accuracy we get as measured by both the R-Squared (R2) and the Normalized Root Mean Square Error (NRMSE), is pretty great.

If you have no idea what you’re looking at here, it might make sense to check out the post Econometrics in GSheets, which goes into a bit more detail on the technique. Essentially the model is saying that these three variables predict 94% of the changes in sales (R2), and that on the average day we’re off by 8.8% (NRMSE). 

The coefficients are what explains the impact of each variable on sales, so for example Advertising is driving sales at $17 for every $1 spent, and every time we increase the price by $1, we lose $19,000 in revenue (sales is in $1,000s). However this variable is showing as statistically insignificant (above 0.05), so we wouldn’t be able to rely on these results.

One other thing to call out is the margin of error – these are pretty wide ranges, which should make us uncomfortable. For example, the model is only confident that the impact of advertising is something between -$42 and $77 return on ad spend (ROAS)… that means we can’t really be that confident in our advertising ROI, which is one of the main reasons we built this model!

What does geo-level data look like?

To bring this model down to the geo level, we need to change the way we do our prediction. Previously we were predicting national sales on any given week. Now we need to predict regional sales, for each region, for each week. So for example we’ll go from data that looks like this:

To data that is broken down by region, like this:

What if you don't have geo-level data available for some channels? Well the simplest way is a type of interpolation (the same technique you use for converting weekly or monthly into daily data). You find out the population of each geo region, the share of the national population they have, then multiply that percentage by the national figures. Of course this is open to bias, so wherever possible you want to try and obtain the actual geographical numbers.

In our geo-level model we’re now going to be predicting regional sales, so our coefficients and how we interpret them will change. This is what the new geo model looks like:

It’s now saying that Holiday is worth $66k in revenue, whereas previously it was $266k. Well we have 4 regions, so on any given holiday we’d get an additional $66k from each region, which would total  $264k in revenue – not far off what the national model estimated.

Note that the R2 actually dropped! If that was your only measure of accuracy, you’d be led to conclude that this model was worse. In fact, even the NRMSE, a standard measure of error, has doubled (which is bad). So far the extra effort of pulling geo-level data doesn’t look like it’s paying off…

However the magic of the geo model becomes apparent when you look at the margins of error. It’s now showing advertising as statistically significant, and has zeroed in on the range of $10 to $25 return on ad spend. We still have a similar result for the coefficient of advertising, of $17.88 vs $17.23, but this model is far more confident about the impact of our ad campaigns.

What was the true accuracy of the geo-level data?

To illustrate this model I actually generated this data, so we can tell how close these two models got to the truth. I left the data generator in the exercise file so you can take a look at how it works. Note that every time the spreadsheet recalculates it shows different numbers, because it contains several random number functions to generate the data.

So what was the truth? How close did we get to the underlying patterns in the data? In this case both the national and geo models were both pretty close! The geo model was much more accurate on Price, getting a coefficient of $4.91 vs $19.03 predicted in the national model. However the rest of the variables are pretty close to the Ground Truth, if you account for the effect of rolling up 4 regions by dividing the national coefficient by 4 (Holiday and Intercept).

What you should take away from this analysis is the major improvement in margin of error you can get from going more granular with the data, with a geo level model. The range tightened up considerably and we even proved advertising was a statistically significant variable. Of course in real life you don’t know the ground truth, and can never find out, so it’s margin of error and statistical significance you should pay attention to. There’s also a cautionary tale here: you can’t look only at R2, NRMSE or other measures of accuracy, because it’s not the whole story. Better accuracy is usually a good sign, but a model is only useful if it can explain the underlying behavior driving the numbers – only then should you be comfortable making decisions.

September 16, 2020

More to read