“Anonymized” Data Isn’t

September 12, 2009

Anyone who has used the Web for more than a few hours has come across sites’ privacy policies, in which the sites describe how they will allow themselves to use your data.  There are many variants on this; survey or polling sites, for example, may promise that your responses will only be reported as part of “statistical aggregates”, or some similarly woolly term.

Ars Technica recently published an article that discusses why many of these promises, sincere though they may be, may not be worth much even when they are kept.  The article focuses on a discussion of personally identifiable information (such as your name, or your Social Security Number), but the underlying problem is one that bedevils all efforts to secure large collections of data.

Now it is easy enough to say that your Social Security Number should not be made available except to authorized people; and that is, in a limited way, a good thing.  But this kind of control is not enough.  The article begins with an account of a project in Massachusetts that was intended to provide data for medical research, while strictly protecting privacy.

The Massachusetts Group Insurance Commission [GIC] had a bright idea back in the mid-1990s—it decided to release “anonymized” data on state employees that showed every single hospital visit. The goal was to help researchers, and the state spent time removing all obvious identifiers such as name, address, and Social Security number.

However, a graduate student in computer science, Latanya Sweeney, decided to test the premise that data pertaining to a particular person could not be identified.  She started with the GIC data.  She focused on the then-Governor of Massachusetts, William Weld.  Knowing from public statements that he lived in Cambridge, she proceeded to purchase some additional information:

For twenty dollars, she purchased the complete voter rolls from the city of Cambridge, a database containing, among other things, the name, address, ZIP code, birth date, and sex of every voter.

By law, voter registration records are public information; the data have obvious applications to preventing election fraud.   With this data, Ms. Sweeney was easily able to identify Gov. Weld’s medical records.   The population of Cambridge is about 100,000; of these, only six had the same birth date, only three of those were men, and only one (Weld) lived in the same ZIP code.

This, as something of s stunt, was quite effective.  Perhaps more important, though, Dr. Sweeney later went on to show that the problem was broadly applicable:

… in 2000, she showed that 87 percent of all Americans could be uniquely identified using only three bits of information: ZIP code, birthdate, and sex.

Most people, when told of this result, are astonished; but their astonishment in large part reflects their lack of understanding of the issue.  We frequently, if unconsciously, operate under a set of rules and assumptions about identity and privacy that were developed for a very different world.  As I pointed out in a previous post:

At the root of many traditional methods for verifying a person’s identity is the notion that the only person likely to know a large number of disparate facts about a particular individual is the individual himself.  Even though there was a good deal of information that was legally a matter of public record (e.g., birth and death certificates, land records, wills), historically that information only existed on paper, probably tucked away in some dusty old courthouse annex.

The Massachusetts GIC example is an illustration of a classic problem in data base security.  In general, restricting access to individual items of information is not enough; one has to worry, also, about the way data may be cross-indexed and correlated with other sources.  The availability of so much data on the Internet, and in other electronic repositories, has dramatically reduced the cost of this sort of “statistical snooping”.

As increasing amounts of information on all of us are collected and disseminated online, scrubbing data just isn’t enough to keep our individual “databases of ruin” out of the hands of the police, political enemies, nosy neighbors, friends, and spies.

Dr. Paul Ohm, Associate Professor of Law at the University of Colorado, has published a lengthy paper [PDF] discussing the implications of this, the inadequacy of current privacy laws and regulations, and what might be done to address the problem.  As I’ve discussed here before, the argument that “I’ve got nothing to hide” just does not hold water; as Prof. Ohm says:

For almost every person on earth, there is at least one fact about them stored in a computer database that an adversary could use to blackmail, discriminate against, harass, or steal the identity of him or her. I mean more than mere embarrassment or inconvenience; I mean legally cognizable harm

There is not likely to be a simple solution to this problem.  It will take careful thought, and an evaluation of the trade-off between privacy and the societal value to be gained by making analysis of information sets possible.  The one approach that can be virtually guaranteed not to work is the ostrich-like one of sticking our heads in the sand (or some other dark place), and hoping the problem will go away.


%d bloggers like this: