An in-depth explanation regarding the security surrounding statistical linkage keys, why they’re important and how their security can be compromised…
The security of the Australian 2016 Census has sparked much debate and consternation among privacy advocates and security professionals alike. At the core of these concerns is a move by the Australian Bureau of Statistics (the ABS) to start linking census records to other data. The mechanism proposed for linking records and data is a ‘random looking’ Statistical Linkage Key. We have been told that the linkage key is secure and will be ‘hashed’ to make it irreversible - but what exactly does that mean, and how does it secure your data?
Introducing the Statistical Linkage Key
Statistical Linkage Keys or SLKs have been used frequently by people doing data research, it provides some very basic anonymity, and a sanity check on the data while retaining a way of identifying an individual throughout a study.
The Australian Bureau of Statistics publishes a standard called the SLK581 cluster. It defines a method for turning “Jane Smith 01/01/2007 Female” in to random looking serial number like “MIHAN010120072”.
The SLK581 has been registered for use for health, housing, and early childhood records. It is also designed so that the SLK can be issued once, and then follow Jane Smith around for the rest of her life. Because it’s relatively unique and nonsensical, it can be used to combine records without giving away Jane Smith’s name – except it’s not secure at all.
Rebuilding Jane’s SLK
If we assume that all we know is “Jane Smith” and we want to work out her SLK. We can create the first ‘MIHAN’ part instantly because the rules for the SLK581say, “use letters 2,3 and 5 from the family name” and “use letters 2 and 3 from the first name” as so:
SMITH – JANE
So, if we get a bunch of records for school kids, and we know one of them is named Jane Smith, we only have to look for records that start with MIHAN, then we know Jane’s birthday, and we know which records refer to Jane.
We get some very basic privacy through this process. We protect from accidental ‘glances’ by staff at personal records. It’s difficult to relate MIHAN010120072 back to Jane Smith from a quick look at a file. But, we get zero security; once we know we want Jane Smith’s details we can easily solve the puzzle by looking for records that match that key.
The use of SLKs protect Jane Smith from accidental identification from someone looking at a bit of paper on a desk, and helps to identify her records if information needs to be retrieved. But, as you can see, the information like the date of birth, and some of the letters in Jane Smith’s name are easy to identify.
So, within the context of the Australian Census, we assume (hope?) that the ABS is using a more sophisticated method to protect the Australian Census Statistical Linkage Keys. They’ve said publicly that they will be ‘hashing’ keys to make it secure, but what does that mean and does that imply security? (Spoiler: no).
Hashing and scrambling to obscure data
Hashing is an IT/Crypto nerd way of saying ‘scrambling data according to a predetermined set of rules’. There are many secure hash methods, and yes they are truly irreversible.
For the non-technical explanation, think of a hash as the final result of baking a cake; you mix a bunch of ingredients and the result is a unique cake. You can’t reverse the cake, but if you use the same ingredients and the same process, you’ll get the same cake every time.
This is handy, because if I present you with a finished cake AND a bag of ingredients, you can repeat the process and ’test’ that your cake made from the ingredients is identical to the finished cake I already presented.
If you end up with a different cake, you know that one (or more) of the ingredients is wrong. A hash is somewhat like a baked cake, it’s the result of a standard recipe consisting of a predefined method and unique ingredients. But it’s much more secure than a cake. There is no way to identify even a single ingredient from the complete hash, even a tiny change alters the hash unpredictably.
In security, we use hashes to uniquely represent a whole bunch of data. All of the components that go into the hash are combined to generate a really big jumbled nonsense identifier (the cake).
Hashes are very secure and unique constructions that are an essential tool for security professionals. Unfortunately many technology professionals are under the false impression that slapping a ‘hash’ onto a record makes it secure, this isn’t the case.
Hashing doesn’t necessarily improve security
Let me demonstrate that merely hashing Jane’s SLK gives us zero improvement to security, it merely adds one minor frustrating layer to the matching of names:
A hash of the SLK record just scrambles it a little bit more. For example, the hashed version of ‘MIHAN010120072’ looks like:
‘CkF62pr2W92pNdBTlrZEXMpj/nyzwLXljdBzjWdxNEA’ (Tech info for the nerds: Base64 encoded SHA-256 of Jane Smith’s SLK)
The hashing process is deterministic, which means that the ingredients that are Jane Smith’s SLK (MIHAN010120072) will always encode to the same scrambled ‘key’. A tiny change to Jane Smith’s SLK details will radically change the key. Let’s say another Jane Smith is born one day later; using the SLK581 for Jane Smith 02/01/2007, Female (MIHAN020120072) we can hash the value and get a new key:
‘O3Tf1Q0RbzMtZ+83GW3Hl0KvVrQCa3t7q7uhSSKimTg’
Even comparing the first few letters you can see the keys are widely different! Sounds secure right? (Of course it’s not, I wouldn’t have asked the question otherwise.)
Because we’re making these keys and the hashes deterministically, we can do what’s called a ‘brute force’ search. We can exhaustively scan through all of the combinations (ingredients) of MIHAN********* creating the hashes (cakes) for each one, until we find a hash SLK that matches a record of interest. i.e. we can test MIHAN010119001, MIHAN010119002, MIHAN020119001… MIHAN3112201602 - just like searching for a forgotten pin on a bicycle code lock.
The search process is almost exactly the same for the SLK581 as it is for a hashed SLK581. In this case, the hash (which jumbles the data) is a minor inconvenience to searching and matching records.
To find our Jane Smith’s birth date and gender (and to confirm whether a file belongs to her), we can look at every birth date since 1900, and every combination of male and female. That sure sounds like a lot of work right? YES! If you’re doing it with a pen and paper, it’ll take you forever!
Unfortunately for the security conscious, even a modest desktop computer can generate and calculate these keys at rates exceeding millions per second. We don’t need to worry about ‘reversing’ the irreversible hash when we can just make a billion or trillion guesses until we find a key that matches a name we are interested in. A real world SLK brute force example of SLK581
A real world example solving SLK581 using brute force
Here’s proof; on my laptop computer, I wrote a short program to generate every Jane Smith SLK from 1900 to 2016, the program took me about 5 minutes to write, and took just under 200 milliseconds (literally less than a blink of an eye) to generate all 86,000 Jane Smith combinations (and corresponding hashes) from 1900 to 2016, both male and female, have a look:
MIHAN171220161 - Gs13dk+PViqlAIMzVnz+n9CMdSkOutdOH0t6C+P3YM0 … MIHAN181220162 - aG2NZIpzflnJdRqtRQy+bZMe4W0/ro03bmZgCRB1Y4I MIHAN191220161 - X+BVbxpohSGjbB5IWRQaIGcUeA78uZxZGtolU6YvH1I (about 86,000 of these, generated in less than a blink of an eye)
Now all we need to do is search our secret ‘anonymous’ data for the key “CkF62pr2W92pNdBTlrZEXMpj/nyzwLXljdBzjWdxNEA” and we know we have the record of: Jane Smith 01/01/2007 Female. That is, we can look at all the baked cake hashes, and conclusively say ’this cake is Jane’s!". How to make things more difficult
So lets assume the ABS made things much harder for adversaries. It sounds like they may have added your address as a component in the statistical linkage key. If we do the math based on Australia Post delivery numbers, we can see there are about 11 million addresses in Australia.
A brute force search with the base name and date of birth is made 11 million times more complex by adding a full address. This increasingly complex ‘search’ on my (6 year old) laptop would take about 25 days; on a good desktop around 5-6 days; and for about $25 I can buy some time on a powerful cloud-computer and complete a brute force search in an hour or two.
This means I can generate SLK for every single combination of Jane Smith, for every birth-date from 1900 to 2016, for both male and female, for every single delivery address in Australia with no real difficulty at all. Unintended consequence, hashes might make things easier to attack
The bonus with cryptographic hashes is that once I find a matching combination, I can be 100% certain that I have a perfectly matched combination of name, date of birth, and address. Let’s say I’m looking for Jane Smith; I know a birth date, but no address. If I had access to a list of these ‘secure’ SLKs I could run the same search until I found all SLKs that match.
Even if I couldn’t gain access to any specific record data, simply knowing I found a matching SLK hash in a list would reveal to me Jane Smith’s address (or more accurately, the search would reveal all addresses where a ‘Jane Smith 1/1/2007 Female’ resides). This means if all other data were secure but the SLK numbers leaked, your name, address and date of birth details could be ‘brute forced’ given other pieces of information.
This a little like a cake shop leaked all the photos of their cakes. We could just scan the photos and know what ingredients (Name, DOB, Address) were used to make that cake. Perhaps there’s more to the Census SLK?
There are cryptographic ways of making this somewhat more secure. A common (and somewhat naive approach) could involve the ABS appending some piece of secret data to the SLK; this would act like a master key for every record in the system. Such a scheme would be a partial solution, and would rely on the master key never ever leaking. Once leaked, the records would be broken forever.
A secret component only protects the cake recipe while it’s secret. If there are some special spices that we don’t know about, we can’t identify all the ingredients and we can’t solve the cake puzzle. But as soon as we know the secret spices, we repeat the same brute force search for the other ingredients.
There are many other technical ways of making this process more secure, but without detailed explanations on how the Census SLKs work, no one can tell.
It has famously been said, security by obscurity is no security at all. If the Census SLK is designed correctly, informing people about how the SLK works will not weaken the key in any way - it’ll be secure by design. The punchline
Our Census data and SLKs must be secure forever, against all types of malicious (and inadvertent) incidents. Understanding how your Statistical Linkage Keys work is an important step to understanding your (our) security.
If anyone tells you that using a ‘hash’ (alone) makes a Statistical Linkage Key secure, it’s almost certain they’re not fully aware of the implications and difficulties of securing data, or they’re lying through their teeth.