Bursting Bubble Anonymity

June 12, 2011

I’ve written here often about some of the issues involved with security and privacy in computer systems.  In some cases, it is possible to identify individuals with fairly high accuracy, even when there does not seem to be much to start with.  One case in point is the identification of individual Web browsers; most users are surprised at how a collection of apparently innocuous data can pinpoint a specific individual browser.  It’s also clear that policies that just focus on securing specific data items (such as Social Security numbers) are often not adequate to prevent identification of specific individuals.

I imagine that most readers will have, at one time or another, taken a survey or standardized test, or completed an optical scan ballot form, in which one fills in a small circle or oval to select one of a set of choices.   When the form is for a test, of course, one’s identity is known; but for surveys and ballots, we are usually supposed to be anonymous.  And, on the surface, this seems likely to be true, assuming the little filled-in shapes are the only marks on the form.  But Will Clarkson, Joe Calandrino, and Ed Felten, from Princeton’s Center for Information Technology Policy, have released a new study showing that even those marks can contain enough information to identify individuals with a fair degree of accuracy.

These forms, popular for their use with standardized tests, require respondents to select answer choices by filling in a corresponding bubble. Contradicting a widespread implicit assumption, we show that individuals create distinctive marks on these forms, allowing use of the marks as a biometric. Using a sample of 92 surveys, we show that an individual’s markings enable unique re-identification within the sample set more than half of the time.

The team looked at various features of different users’ bubble-filling technique — qualitatively, we might think of things like the fraction of the interior space filled, and whether the user “colored” inside the lines — and developed a quantitative model to represent those features.  They then used machine learning techniques to “train” their system to recognize individuals’ patterns and characteristic features.  The resulting classification gives significantly better accuracy than would be expected by chance.

Our classifier achieves over 51% accuracy. The classifier is rarely far off: the correct answer falls in the classifier’s top three guesses 75% of the time (vs. 3% for random guessing) and its top ten guesses more than 92% of the time (vs. 11% for random guessing).

As the researchers point out, this result, if confirmed by more extensive testing, may have both good and bad uses.  One of the good uses might be as a tool to help detect cheating on standardized tests.  If someone takes a test, does poorly, and then takes it again, getting a significantly better score, one might check that the first and second answer sheets were completed by the same person.  On the other hand, being able to identify voters from optical scan ballots might be an invitation to attempted voter coercion.  (The primary reason for secret ballots is to prevent reliable vote gains from intimidation or purchase.)  This problem is potentially made worse by the practice, in some jurisdictions, of making scanned ballot images public.

As the team acknowledges, there is more work to be done to explore this phenomenon, especially to look at how the characteristic features may change over time.  It is another reminder, though, that a real evaluation of security and privacy risks requires more than a superficial analysis.

The paper [PDF] will be presented at the 2011 USENIX Security Symposium, to be held in San Francisco in August.

%d bloggers like this: