Today’s Washington Post has a story by Brian Krebs (who also writes the “Security Fix” blog) about some research done at Carnegie Mellon University on the possibility of guessing a person’s Social Security number. They used public information on how Social Security numbers [SSNs] are assigned:
The Social Security number’s first three digits — called the “area number” — is issued according to the Zip code of the mailing address provided in the application form. The fourth and fifth digits — known as the “group number” — transition slowly, and often remain constant over several years for a given region. The last four digits are assigned sequentially.
(It was pretty common in the early days of building data bases to use identifiers that encoded some information, just as the telephone Area Code originally specified a geographic area. We now have learned that this is usually a Bad Idea, but then Social Security was started in the late 1930s.)
They also used a data base that I had not heard of before, the Social Security Administration’s rather grimly named “Death Master File”. This apparently contains names, SSNs, state, and dates of birth and death for everyone who had a SSN and is deceased (to the knowledge of the Social Security Administration).
The researchers, Alessandro Aquisti and Ralph Gross, found that, by using this information and an individual’s place and date of birth, they could get a good start on discovering someone’s SSN:
The two tested their hunch using the Death Master File of people who died between 1972 and 2003, and found that on the first try they could correctly guess the first five digits of the SSN for 44 percent of deceased people who were born after 1988, and for 7 percent of those born between 1973 and 1988.
Their success rate was materially better for people born after 1988:
Acquisti and Gross found that it was far easier to predict SSNs for people born after 1988, when the Social Security Administration began an effort to ensure that U.S. newborns obtained their SSNs shortly after birth.
They were able to identify all nine digits for 8.5 percent of people born after 1988 in fewer than 1,000 attempts. For people born recently in smaller states, researchers sometimes needed just 10 or fewer attempts to predict all nine digits.
Now, a thousand tries may seem like a lot, but there are lots of Internet sites that allow on-line credit applications; it is not much of a stretch to imagine an enterprising crook writing a small computer program to automate the probing process – and then deploying it using a “botnet” of compromised PCs. As Krebs points out in his blog post, some sites do not even require all nine digits to be correct, to make life easier despite data base errors.
There will probably be some reaction to the effect that the process of assigning numbers needs to be changed. That entirely misses the point: the SSN was only supposed to be an account number for keeping track of Social Security taxes. My original Social Security card (yes, I still have it) says across the front, “Not to be Used for Identification”. Unfortunately, financial services firms and others more or less appropriated the SSN for an authentication role it was never meant to play. Undoubtedly, it was easier than devising a new method: virtually every working person had a number, and all you needed to do was put a 9-digit field in your data base. And, as is so often the case, the people and organizations responsible for designing the data bases and selling them for commercial purposes don’t bear the direct cause of the fraud that this sloppy design enables.
Perhaps it will be possible at some point to convince policy makers to do something about this:
Ross Anderson, a professor of security engineering at Cambridge University, said the findings suggest that businesses using SSNs as a password are being negligent, and should find other ways of verifying the claims to identity that are being made by their customers.
I’m personally not holding my breath.
The complete study is available for free download at the Proceedings of the National Academy of Science web site. The authors have put together a FAQ that covers the substance of their results. Perhaps the most important lesson one can draw is summarized there:
More broadly, our findings highlight the unexpected consequences of the interaction of multiple data sources in modern information economies. They show how non-sensitive personal data (such as information people reveal about themselves online) can be combined with other data sources, also non-sensitive, leading to the inference of much more sensitive information.
The fact that so much data is now available on the Internet has significantly reduced the effort involved in finding out a great deal of information about a person that heretofore would have been scattered around in various paper files. I don’t think we as a society have really come to grips with this yet.