Note: I built a tool that anonymizes data for you automatically, you can check it out here:
> D-Anon Data Anonymizer
I’m the co-founder of Vexpower, where I offer simulator-based courses for marketers to learn how to be more data-driven. I’m a self-trained marketer, so everything I learned was on the job, while growing a 50 person marketing agency, working with clients like Booking.com, Monzo Bank, and Time Out Magazine. I made these courses so marketers can learn all the skills they’re currently not getting from their jobs, because their manager doesn’t know how to do it either, or they’re too busy with work to make time for training.
Most people fill the gap with online courses like Udemy, or reading blog posts, but there’s one huge advantage on-the-job training has over these methods: everyone in your office is under NDA (Non-Disclosure Agreement). That means you can share sensitive data with each other: the really good stuff that never leaves the walls of the company. It also means they’re willing to be more candid about how things are really getting done, rather than pretending they use some idealized process that sounds impressive in a blog post.
My solution was to make up fictional companies – like GoolyBib, a smart bib that alerts you when your baby is choking – and get experts to record themselves showing you how they’d do certain tasks for them. The companies are fictional and the data is fake, so nobody is getting fired or sued for breaking their NDAs. However the data still needs to be realistic! If the underlying relationships in the data aren’t preserved, and the numbers don’t look right, students will struggle to learn anything.
In this article I’ll show you some of the methods I’ve found useful in creating unique, realistic, anonymous datasets for 68 courses for Vexpower. I hope you find them useful when making your own online courses, writing blog posts, or any other situation where you need to make sensitive data public. I’d argue with all the privacy legislation, consumer backlash, and cybersecurity concerns, it’s already a good idea to anonymize data you share internally too.
DISCLAIMER: I would still recommend you tell your boss or client your plans to anonymize their data and share it. Please don’t start dumping sensitive information in the public realm without getting permission. I’ve found that often people are happy to be supportive, so long as steps have been taken to obscure anything sensitive, and they get to review it before it goes live.
The dataset we’re using is itself already anonymized, but I’ll use it to showcase how these techniques work. You can download or make a copy of the data here:
Noise & Scale
The first and most obvious anonymization technique is to ‘fuzz’ the data somehow, usually by adding random noise and/or changing the scale. Take some arbitrary number and add it to your sensitive revenue data to scale it up or down. Now competitors won’t be able to reverse-engineer any reliable information about your business, which is the number one concern most people have in my experience.
Even better if you add some Gaussian noise (the ‘normal’ distribution, you might remember from stats class), so that the numbers are still mostly related but the digits are unrecognizable. One function I’ve found useful in GSheets for doing this is =NORMINV(RAND(),A1+B1,(C1*A1)), where A1 is the value you’re adding noise to, B1 is the number you’re scaling it by, and C1 is the percentage noise you want to add (i.e. 0.05 for 5%).
It’s usually fine to say “I grew a client in the automotive space by 50%” or mention “I used to work on the BMW account”, but it’s not ok to put the two together. Share the name OR the numbers, but not both. So long as you stick to this rule, even if someone *thinks* they know where the data came from, there’s plausible deniability. They have no proof, so the data is of limited value to competitors, and the company will be more comfortable with you sharing it.
What if you accidentally shared real website referrer data in a blog post? Now anyone can look and see what websites send you traffic. Then it’d be pretty easy to piece together who the data is coming from. You’re basically handing competitors a playbook to steal that traffic! However you can’t just replace these values with something arbitrary: any relationship in the data between referrer and user behavior would vanish, making it useful for you to use in training. The solution is a trick called Entity Replacement, which is replacing the values with a new random value that has no meaning, but it’s the same value every time. Most people do this manually with find/replace, but it’s better and faster with a lookup.
With website referrers example, you could replace “bmw.co.jp” with “cars.co.jp” every time it occurs. Now nobody can figure out what website the traffic is coming from, but when you do analysis on conversion rate by referrer, the relationship will still be there. This is useful for lots of fields – names, credit cards, addresses – anything that could be used to trace the identity of the company or user. To do this in GSheets, you just need to make a list of all the values that occur in the original data with the UNIQUE() function, then use the mighty VLOOKUP to match that to an arbitrary (deduplicated) replacement value.
The other neat trick I’ve been using a lot is date series replacement. We’re famous for our marketing mix modeling training at Vexpower: a privacy-friendly marketing attribution technique that requires time series data. You might be tempted to use entity replacement on dates, but that’d be a mistake. Unlike entities, with dates the order matters. If we replace the date with a randomly selected date in the past or future, we have to make sure that every other date in the series follows on. Otherwise when we’re doing analysis, the models we build will get the timeline mixed up, and won’t be able to make accurate predictions.
The simplest way I’ve found to do this in GSheets is to convert the original date into its component parts – Year, Month, Day – then piece them back together after, each with a different arbitrary number added or subtracted from it. Those numbers stay consistent for all dates, and so long as you don’t have any formatting issues, your time series relationships will stay maintained. One thing I recommend here is to do your feature engineering first, before anonymization. You need to know if a spike in the data was around a national holiday, so extract seasonal features like that first, and change the names if you have to to maintain secrecy.
Anonymizing data is a complicated topic, one that hasn’t been written enough about. Most people who need to anonymize data, subject themselves to lots of manual find/replace and other manual manipulation. This is one of the big barriers to sharing more content online, creating more courses, and teaching more people. If we can make data anonymization easier, more people will be able to learn how to do a better job, get recruited and promoted. If done well, even very sensitive data can be properly obscured enough to share publicly.
As you got this far, I built a tool that anonymizes data for you automatically, and would appreciate any feedback you can give:
> D-Anon Data Anonymizer