Econometrics in GSheets

October 20, 2020
Michael Taylor
Facebook Twitter WhatsApp LinkedIn

You may have heard the term Econometrics, but most don't know how it works. The World's largest advertisers like Procter & Gamble, Unilever and Coca-Cola use it to measure marketing ROI — for example calculating how many incremental sales a Superbowl ad drove.

These methods are performed on behalf of large brands by experienced specialists to make life or death decisions on where to allocate multi-million dollar marketing budgets. When done well, Econometrics can answer valuable questions:

  • Is Facebook really responsible for as many sales as they’re reporting?
  • How do I calculate ROI when I invest in less measurable channels like TV?
  • Did my agency actually drive this spike in sales, or was it some other factor?
  • How much more do I sell when I lower my prices, or run a promotion?
  • What would happen to revenue if I increased my marketing budget?

These are core strategic questions that every marketing leader must answer, yet many resort to guesswork when making these decisions. There’s a lack of awareness amongst marketers around what Econometrics is, what it can do, and how it works.

Given the value of these techniques, and the politics that come with making large-scale budget decisions, most Econometricians are hesitant to share their methods. Thankfully, the core techniques can be easily replicated in Google Sheets, which is what I’ve done here:

There’s no complex math, and you don’t need a degree in statistics to understand it. This example uses fake data for a fictional company selling pies, but it’s enough to showcase the techniques involved. You can use this template to build your own model, or just follow along to become more informed before working with a professional (get in touch if you need help). If you're familiar with Econometrics and want to learn how to do this in Python, see Econometrics in Python.

With the rise of digital advertising and technologies like micro-targeting, methods like Market Mix Modelling began to look old-fashioned vs tracking each user directly to purchase. However with 30% adoption of ad blockers, the roll out of consumer privacy legislation like GDPR and CCPA, and moves against 3rd-party cookies or unique identifiers, expect a shift back to privacy-friendly marketing attribution methods like Econometrics.

What can Econometric Modelling do?

Econometrics is the practice of applying statistical techniques to economic data to explain the underlying relationships in the data. The common application we’re concerning ourselves with is called Marketing Mix Modelling, which attempts to predict sales based on factors like how much you spent on advertising, what price you set for your products or if it was a sunny day.

These factors are input as variables in a multivariate linear regression analysis, which outputs coefficients you can plug into an equation. This is important not just because it’s useful to be able to predict sales, but because knowing how predictive each variable is, tells you the contribution of that factor to revenue. Meaning you can calculate how much revenue you made from marketing, without needing to track your user’s conversion journey’s directly.

What makes sales go up and down?

Taking our fictional pie company (data from Jalayer Academy), we can chart our sales over time and see the line moving up and down without explanation. If you go ahead and chart your sales in the same fashion, you’ll see movements up and down unique to your company.

The main question we’re trying to answer is ‘what makes sales go up and down’. If we know that, we can strip out external factors (‘do sales go up or down when it’s a holiday?’) and determine what credit internal factors deserve (‘how much revenue did my advertising generate?’).

Match spikes and dips to internal and external factors

If we take just one variable like the average price at which we sold our pies, we can see how much that factor explains the changes in sales. We intuitively expect that a lower price will sell more pies, but our analysis will tell us how much more for each dollar change in price.

If we do a simple scatter plot chart and add a trendline, we can see that Price on its own explains only about 19% of the variance in the sales data. There does visually seem to be a downwards trend where the higher the price the lower the sales, which makes sense.

Plotting out our predictions based solely on price, we can tell we’re not seeing the full picture here – our forecast is consistently off by a large enough amount to render our model useless in making predictions on sales. This is reassuring, because if price explained all of the variance, it would be the same as saying our marketing budget had no impact!

Introduce more variables to better fit the line

The aim of the game is to dig through our data to find the right mix of variables to introduce in our model, so as to explain away more of the movement in sales. As marketers we’re doing this analysis to find out what impact our marketing had on revenue, so let’s add that as a variable.

You don’t need a background in statistics to see what’s happening here. The two lines are now much closer together, meaning going from one variable to two seemed to improve our model. It’s not quite ready to bet your job on it, but now you can clearly see advertising made a contribution.

Find an equation that can predict sales

Our model with just price and advertising budget was doing better than price alone, but was still missing something because it incorporated only factors we had control over. In the real world, we aren’t in full control of the market, and we need to incorporate information about the outside world to have any hope of model accuracy.

Here we’ve incorporated a variable denoting if the week was a holiday or not. This is called a binary variable, and it’s used to flag special time periods where conditions were different. For example in your model you might want to add a variable for ‘in lockdown from a global pandemic’ as a factor to determine what your sales would have been in a world without Coronavirus. We don’t have to use binary variables for external factors – for example it could be used to note whether your top selling product was in stock, the date you implemented a new marketing strategy, or the presence of a winning ad creative.

Amazingly, with just three variables, the two lines are lining up almost perfectly. We shouldn’t expect every analysis we do to be this easy, nor should we plan to capture everything in our model. However we can use this example to illustrate how with a handful of predictive variables we can reveal some valuable insights into what drives sales for our business.

Sales = -17 * Price + 16 * Advertising + 89 * Holiday + 415

The equation the model gives us, shows what impact each factor had in driving sales. This is where decisions can be made and actions taken once we understand the implications of what this equation is telling us. The figures for sales and advertising budget are in thousands, so we can interpret the model coefficients as follows:

  • Price had a negative impact of -17, meaning we lost $17,000 in sales revenue for every dollar we increased the price of our pies
  • Advertising drove $16,000 in sales revenue for every thousand dollars spent, or our return on investment (ROI) was 16:1
  • Holidays were good for our pie company, as we made an extra $89,000 in sales for every week where there was a holiday
  • Our intercept was 415 or $415,000 in sales, meaning all things held equal, we’d get about this much in sales without these three variables

Can we really trust the data?

All models are a simplification of the real world, so by nature they’re wrong. For example there are thousands of variables that affect sales of pies, including the weather, competitor activity and the quality of your ad creative, none of which are captured here. Just because these three factors have been highly correlated to sales in the past, doesn’t mean that relationship will hold in the future. We might also mix up our correlations – in this data set adspend is highly correlated with holidays, so missing one of those variables overstates the effect of the other.

However this is not a good excuse to go back to making gut decisions. Every decision we make models future outcomes with past data. We expect based on past experience that getting on the number 42 bus at the Tower of London will take us to Liverpool St Station in about 16 minutes. Yet that’s only a prediction, not a guarantee – there might be traffic, roadworks, or a change of route. Because your past data might be wrong, do you give up estimating travel time entirely?

The difference between using your gut (where your brain models future outcomes based on past data) and a spreadsheet like the one we made (where a formula models future outcomes based on past data), is that the spreadsheet is transparent about how it's making those predictions. This means it can more easily be interrogated and improved. 

Ultimately the best test of a model is how well it improves your ability to take the right actions under uncertainty.Try and guess next month’s sales and see how you do versus the spreadsheet. Simple models like this one often reliably outperform humans, so I don’t fancy your chances. Instead I recommend spending that energy brainstorming ways the model might be wrong, and only using your judgement to deviate where you have conviction.

How can you do multivariate regression in GSheets?

The method I originally found in this video by Jalayer Academy used a Google Sheets add-on package called XLMiner, which was broken by Google’s move to verify apps in December 2019. Instead I adapted the method to use the native  LINEST function, which works well for our purposes.

LINEST(known_data_y, known_data_x, [calculate_b], [verbose])
  • known_data_y - this is the variable you want to predict, in our case sales
  • known_data_x - the variable(s) you think can predict y (price, advertising, holiday)
  • [calculate_b] - optional variable to calculate the intercept, default true
  • [verbose] - optional variable to return all descriptive statistics, default false

In our model it looks like this: 

You can see what these numbers look like if you just incorporate 1 variable, or expand to 2 variables. To adapt to your business simply create your own sheet with the same function but different data and variables and see what results you get. Be aware that for some reason the coefficients come out in backwards order!

Once you have the coefficients you can plug them into a formula to be able to predict the number of sales for any combination. In the template we can see the coefficients given by the model in cells D23 through E23, and can put them together to construct our equation.

Sales = -17 * Price + 16 * Advertising + 89 * Holiday + 415

To get our charts we simply plugged back in the past data we had into this equation to see how close the prediction would be to the number of sales we actually saw. This let’s us determine the accuracy. Ideally you would then test the model by trying this equation on data it hasn’t seen yet, and see how well it does.

Verbose output of descriptive statistics

You’ll notice further down in the model at rows 25 to 42, we’re seeing what it looks like when you choose the ‘verbose’ output option with LINEST. Unless you’re working with statistics regularly you probably won’t know what these descriptions mean, but you can get pretty far by Googling each term and reading the Wikipedia page.

The output is a little strange especially with the #NA fields, but don’t worry about them. Instead concentrate on the dictionary of descriptions I put together below in yellow. I have linked each value to these descriptions so you can interpret where this would go in your own spreadsheet. 

The main field which is important aside from the coefficients (which you get in the non-verbose version of the function) is the coefficient of determination, or the R-squared. This is the same as what you saw in the scatterplot showing Pie Sales vs Price, but this time it’s showing the variance predicted by the three variables you included in your regression analysis. It tells us that this model explains about 83% of the variance we see in sales – still some way to go, but pretty predictive considering the simplicity of the model.

***Disclaimer – IANAS (I Am Not A Statistician)***

I did get a Masters degree in Economics, and I’ve used these techniques in my work at Ladder and Saxifrage, but I’m not a professional statistician and there are gross oversimplifications in this post. I’m of the belief that any model is better than no model, so I still recommend you follow along and build your own. However if you’re doing this for any reasonably sized marketing budget ($100k+ per month), contact me and I can help connect you with an expert.

Measuring long-lasting effects with adstocks

Once you’ve built a simple model, you’ll naturally want to experiment with adding in complexity to see if that improves predictions. One common feature of Marketing Mix Models is a factor called ‘Adstocks’. The theory goes that some marketing channels like TV have a longer lasting brand advertising effect, so you need some way of incorporating this in your model.

In practice the way this is calculated, is by assuming an ‘adstock rate’, which is the amount of spillover you get across time periods. An adstock rate of 50% from week to week means an effect as large as 50% of the adstock last week, is felt the week after the spend occurred. The formula then decays by 50% a week until the effect disappears.

For the first time period, adstock is just equal to the advertising spend. For all other time periods, it equals the advertising budget for that week, plus the adstock rate multiplied by the prior week’s adstocks (which in week 2, would be week 1’s advertising budget). Adstocks is then used as a variable in the regression in place of advertising. Check out the second tab of our model to see how this works.

With a relatively low adstock rate like 15%, it has the effect of smoothing out advertising budget, so the impact doesn’t drop off as quickly. In our model the marketing budget was always on, so the effect isn’t as clear as it might be for an advertiser running one-off TV campaigns.

With a very high adstock rate like 50%, you can see the effect builds and stays high, which may or may not be realistic depending on the marketing channel and business. The trick is to experiment with multiple adstock rates until you find one that best fits the model. Typically brand advertising channels like TV, radio or PR might have longer effects than direct marketing channels like Google Ads, direct mail or cold email.

In our model adstocks doesn’t seem to be a hugely important variable – with an adstock level of 15% we can increase the R-squared of the model from 0.836 to 0.839, but this is hardly worth including. In other models for other businesses correctly incorporating adstocks is the biggest determinant of model accuracy.

---

That covers the basics of Econometrics, and we did it all in Google Sheets! To see this same process but in Python instead of GSheets, or to learn about Diminishing Returns, visit the next posts in this series:

In future we plan to cover:

October 16, 2020
September 16, 2020

More to read