NSF Funds Data Anonymization Project

An article at ThreatPost reports that the US National Science Foundation [NSF] has made a $1.5 million grant to a group of researchers at Purdue University, to study current methods of anonymizing data from individuals, and to evaluate the effectiveness of these methods, both in rendering the data truly anonymous, and in preserving whatever value it has for statistical analysis.  The project, which was already underway, also involves researchers from other institutions, including Indiana University, the Missouri University of Science and Technology, and the Kinsey Institute.

I posted a note here, a little over a year ago, on some of the problems with trying to ensure that data collected for statistical purposes was not individually identifiable.  In brief, the difficulty is that, while it is easy to remove specific pieces of identifying data (such as a Social Security number), it is very difficult to make sure that individuals cannot be identified using a combination of data items, possibly in connection with external information sources.   As far as anyone knows, there is no general-purpose way of accomplishing this.  Suppose, for example, we have an employee data base that lists information like name, address, Social Security number, position, and salary.  Being careful of security, we ensure that the Social Security number is encrypted, and we further disallow, for most users, viewing anything but average salary data.  Nonetheless, at least some individual salary data may be easy to get.  If there is only one engineering VP living in a certain ZIP code, the user might return the average salary for people (all one of them) in that category.  This example illustrates the general point that the degree to which information is individually identifiable is to some extent a property of the data, not just the restrictions on its access or manipulation.

The underlying question has a certain practical urgency.  Many organizations, as diverse as the IRS, Google, credit card companies, and Amazon, collect large amounts of information on their customers.  Given the rash of fraud via “identity theft”, the security of these data bases is an important question.  The research team hopes to come up with new ways to ensure the data can be used both effectively and safely.

Textual data, even when explicit identifiers are removed (names, dates, locations), can contain highly identifiable information. For example, a sample of chief complaint fields from the Indiana Network for Patient Care (INPC) found several instances of “phantom limb pain”. Amputees can be visually identifiable, but the HIPAA Safe Harbor rules do not list this as “identifying information”. Any policy explicitly listing all types of identifying data is likely to fail. Through a joint effort with computer science and linguistics, the project is developing new methods to remove specific details from text while preserving meaning, eliminating such highly identifiable information without a priori knowledge of what would be identifying.

To the extent that changes in the health care system involve more use of clinical data to support “evidence-based” medicine, getting the answers to these questions right will become even more important.

Comments are closed.

%d bloggers like this: