CIO

DNA hack could make medical privacy impossible

It may now be possible for anyone, even if they follow rigorous privacy and anonymity practices, to be identified by DNA data from people they do not even know.

A paper published in January in the journal Science describes a process by which it's possible to identify by name the donors of DNA samples, even without any demographic or personal information. The technique was developed by a team of geneticists at MIT's Whitehead Institute for Biomedical Research and is intended to demonstrate that science and technology have surpassed the techniques and laws currently in place for safeguarding private medical data, according to Yaniv Erlich, a fellow at Whitehead and member of the research team.

The point was not to reveal private information, but to demonstrate a systemic weakness that will require research, debate and new laws and technology to overcome, Erlich says. The technique relies on the custom of passing family names down through the fathers family. By statistically modeling the distribution of family names, the researchers were able to narrow the list of possible contributors of DNA samples. They then pinpointed individuals using a range of other publicly available sources, none of which were directly connected to the original donors and none of which included protected personal data.

Also see: "Personally Identifiable Information - My Digital DNA is Not for Sale!

This isn't a specific exploit against an effective wall of security, Erlich says. Instead, it demonstrates that genomic research may have grown beyond our ability to conceal the identities of the sources of DNA samples. The team started with a list of genomes that had already been sequenced, mapped and published for the use of genetic researchers. They analyzed the material to find identifying markers on the Y chromosome -- which is present only in men -- because surnames are generally passed down through fathers. They compared those Y markers to databases that list such markers along with the surnames of those from whom the samples were taken, but were not able to match all the samples with surnames using confirmed data. They determined which surnames were most likely to belong to which samples using scientifically accepted statistical models that were designed, among other things, to track the movement of regional populations by following the spread of family names.

The next step was more hack than science: The team used record-search engines on the Internet, obituaries, genealogical websites and demographic data from the National Institutes of Health's Human Genetic Cell Repository. Researchers then linked 50 of the samples to the names with those who contributed them.

Until now, the risk that private genetic data could be made public was considered fairly limited. Data about samples was kept separate from data about donors, and demographic data about the donors could only be supplied after identifiers were removed.

There is a risk to more than just donors, however. Even people who have never contributed a DNA sample could be identified and genetically typed if a relative has ever donated DNA. That scenario is becoming more likely as recreational genetic genealogy sites gain popularity. These sites trace family trees in part through a genetic componentand they make contributed genetic information available to members of the public, often without the same level of controls used by research or medical institutions. Until now, the identity of donors was considered protected if demographic and genetic data were kept in different databases and certain information was masked in the demographic record.

Legislation to keep research institutes from releasing any demographic information about donors would protect patient privacy, but would eliminate the ability of researchers who have identified markers for a particular disease to also identify the ethnic or cultural background of those who might have it, Erlich says. The whole point of scientific research is to publish the results so other researchers can build on it and develop more effective treatments. On the other hand, genetic information can be misused to identify members of ethnic or racial groups targeted for discrimination or other repressive or exploitative purposes, Erlich says.