A Harvard professor has re-identified the names of more than 40% of a
sample of anonymous participants in a high-profile DNA study,
highlighting the dangers that ever greater amounts of personal data
available in the Internet era could unravel personal secrets.
From the onset, the Personal Genome Project,
set up by Harvard Medical School Professor of Genetics George Church,
has warned participants of the risk that someone someday could identify
them, meaning anyone could look up the intimate medical histories that
many have posted along with their genome data. That day arrived on
Thursday.
Professor Latanya Sweeney, director of the Data Privacy Lab
at Harvard, along with her research assistant and two students scraped
data on 1,130 people of the now more than 2,500 who have shared their
DNA data for the Personal Genome Project. Church’s project posts
information about the volunteers on the Internet to help researchers
gain new insights about human health and disease. Their names do not
appear, but the profiles list medical conditions including abortions,
illegal drug use, alcoholism, depression, sexually transmitted diseases,
medications and their DNA sequence.
Of the 1,130 volunteers Sweeney and her team reviewed, about 579
provided zip code, date of birth and gender, the three key pieces of
information she needs to identify anonymous people combined with
information from voter rolls or other public records. Of these, Sweeney
succeeded in naming 241, or 42% of the total. The Personal Genome
Project confirmed that 97% of the names matched those in its database if
nicknames and first name variations were included. She describes her
findings here.
Sweeney has also set up a web page
for anyone to test how unique their birthdate, gender and zip are in
combination. When I tried it, I was the only match in my zip code,
suggesting that I, like so many others, would be easy to re-identify.
“This allows us to show the vulnerabilities and to show that they can be
identified by name,” she said. “Vulnerabilities exist but there are
solutions too.”
(Personal disclosure: I work closely with Professor Sweeney in the
Harvard Department of Government on topics related to my book research
on the business of personal data, but was not involved with this study).
On Thursday, researchers and participants in the Personal Genome Project gathered in Boston for a conference
timed to mark the 60th anniversary of James Watson and Francis Crick’s
publication of their discovery of the DNA double helix structure in
April 1953. Sweeney and her research assistant set up a table at the
conference where participants could find out whether they could easily
be identified. Sweeney sought not to out the study participants, but
rather to demonstrate to them how providing a little less
information–for example, just birth year rather than exact birth date,
and three digits rather than five or nine from the zip code–could help
preserve anonymity for participants.
Several participants said they expected someone would one day
re-identify them and said they were not particularly concerned.
Volunteer Gabriel Dean said he was far more worried about another future
threat forecast by the experiment, that one day criminals might be able
to replicate DNA and place some at the scene of a crime. The conference
took place a few blocks from the scene of the Boston Marathon bombing
earlier this month.
Volunteer Lenore Snyder, however, said that she did not want to be
identified and as a result did not provide her zip code and some other
identifying characteristics in her profile. She said her genetic testing
suggests she has an intellectual disability, even though she is a
molecular biologist with a PhD. “People don’t know how to interpret
this,” she said. “It’s dangerous. A little bit of information is
dangerous.”
Sweeney’s latest findings build on a 1997 study she did that showed
she could identify up to 87% of the U.S. population with just zip code,
birthdate and gender. She was also able to identify then Massachusetts
Gov. William Weld from anonymous hospital discharge records.
The same techniques could be used to identify people in various
surveys and records, pharmacy purchases, or from a wide variety of
seemingly anonymous activities such as Internet searches. Figuring out
clues about people could also enable identity theft. “I believe that
many people in the current interconnected digital world are not aware of
how easy it is to identify them with a high level of granularity,” says
Keith Batchelder, the founder of Genomic Healthcare Strategies in
Charlestown, Massachusetts, and one of the first ten volunteers in the
Personal Genome Project.
Church, who maintains a thick mountain-man beard, says that advances
in data and in medicine make it impossible to guarantee anonymity for
most medical experiment volunteers. Church has participated as a
volunteer himself in past medical studies and scoffs at claims that such
data can remain anonymous. Every year his university sends him an
anonymous survey. He scribbles in some additional information at the
beginning of the form. “My name is George Church, you could figure that
out anyway,” he writes.
His Personal Genome Project makes no privacy promises at all. “The
Personal Genome Project is a new form of public genomics research and,
as a result, it is impossible to accurately predict all of the possible
risks and discomforts that you might experience,” the 24-page consent
form tells users. Later it specifies some possible risks: “The data
that you provide to the PGP may be used, on its own or in combination
with your previously shared data, to identify you as a participant in
otherwise private and/or confidential research.”
Volunteers take an online exam about the risks they face before they
are allowed into the program. And the test does not pose a universal
‘you do understand the risks” question. It has 20 questions and he
requires a perfect score. Potential volunteers can take the test as many
times as they want until they pass. One person took the test 90 times
before passing.
Given what Church sees as the flaws in preserving privacy in the
Internet age, he has embraced openness about many aspects of his own
history. On his personal home page
he posts the exact coordinates of his home, his birthdate and parents,
medical problems (heart attack, carcinoma, narcolepsy, dyslexia,
pneumonia, motion sickness) and even a copy of the 1976 letter booting
him out of Duke University for getting an F in his graduate major
subject.
Many of the early participants in the Personal Genome Project share
the same ‘let it all hang out’ ethos. Volunteer Steven Pinker, a
well-known experimental psychologist and author of the 2011 book “The
Better Angels of Our Nature,” posts his genome and a 1996 scan of his
brain on his web page. He says even data as in depth as his genome and medical records does not provide especially deep insights into a person.
“There just isn’t going to be an ‘honesty gene’ or anything else that
would be nearly as informative as a person’s behavior, which, after
all, reflects the effect of all three billion base pairs and their
interactions together with chance, environmental effects, and personal
history,” he says. “As for the medical records, I just don’t think
anyone is particularly interested in my back pain.”
Could companies use medical information to single out people to deny
them services? Might a bank, for example, turn down a loan to someone
because their health records suggest they may die at a young age? Even
though Church expected reidentification of his volunteers, he does not
think so. “These companies are not yet highly motivated to do that and
probably judging from the way the winds blowing on the Genetic
Information Nondiscrimination Act they would be ill advised to do that
from a public relations standpoint,” he says, referring to the 2008 law.
In a different study released earlier this year, researcher Yaniv
Erlich at the Whitehead Institute for Biomedical Research in Cambridge,
Massachusetts, was also able to re-identify almost 50 people
participating in a different genomic study. He said that he does not
know of anyone who has suffered harm to date from such
re-identifications, but pointed out the current ethical debate “emerged
from the very bad history of the field in the first half of the 20th
century, where bad genetic and abundance of records of familial
genealogy contributed to one of the most horrific crimes.”
Misha Angrist, an assistant professor of the practice at the Duke
Institute for Genome Sciences & Policy and one of the original ten
to participate in the Personal Genome Project, praises the
re-identification experiments by researchers such as Sweeney and Erlich.
“It is a nuisance to scientists who are trying to operate under the
status quo and to tell their participants with a straight face, you
know, it’s very unlikely that you will be identified,” he says. “It is
useful for pointing out that the emperor has no clothes, that absolute
privacy and confidentiality are illusory.”
Post a Comment
Thanks for reading my blog.
Note: only a member of this blog may post a comment.