

Genetic data can be analysed to estimate your risk of certain conditions
Science Photo Library / Alamy
Genetic risk scores that summarise a person’s likelihood of getting certain health conditions can be exploited through mathematical tricks to reveal hidden details about their DNA.
The method could theoretically be used by health insurers to reconstruct genetic data from a summary genomic report, revealing health risks not divulged by the patient. Alternatively, people sharing their scores anonymously could be identified by extracting the genetic data and querying public genealogy databases.
Polygenic risk scores measure the impact of tens to thousands of individual letter variations in the genome, known as single-nucleotide polymorphisms (SNPs). Used by researchers and DNA testing companies such as 23andMe to summarise potential health risks, the scores are sometimes shared publicly, for example, by people asking for advice on interpreting their scores.
Unpacking a polygenic risk score is like trying to work out a phone number knowing only that the digits add up to 52. It’s an example of the knapsack problem in mathematics, known to be computationally difficult. Because of this, the scores are seen as a low privacy risk.
However, each SNP value used in a risk score is multiplied by an extremely precise weight – up to 16 digits long – that reflects its contribution to the overall disease risk. This makes small risk models vulnerable to attack.
“Because the final polygenic risk score is constrained by a finite number of ways you could arrive at that number, and a statistically likely arrangement of the underlying SNPs, it can be deduced with a high degree of accuracy,” says Gamze Gürsoy at Columbia University in New York.
Gürsoy and Kirill Nikitin, also at Columbia, ran 298 polygenic risk models that use 50 SNPs or fewer on genetic data from 2353 individuals. Working backwards, they calculated all the possible genomes that could have produced each given score, filtering out those with many uncommon mutations.
As one SNP may be used by multiple polygenic risk models, Gürsoy and Nikitin were able to daisy-chain their attack, using SNPs revealed by smaller models to help solve larger ones.
They were able to reconstruct the donor genotype with 94.6 per cent accuracy, correctly predicting 2450 SNPs per individual. Tests showed 27 SNPs were enough to identify an individual in a pool of half a million samples, and family members could be predicted with up to 90 per cent precision. Individuals of African and East Asian descent were more easily identified as they are less well-represented in genetic databases.
According to Gürsoy, 447 small, high-precision models in a public database of polygenic scores are vulnerable to this attack.
“We wanted to point out that the risk is low, but under [some conditions], there might still be some leakage,” says Gürsoy. “We should consider this when designing research studies, especially if we are involving vulnerable populations.”
Ying Wang at Massachusetts General Hospital says existing data protections and computational bottlenecks limit the risk of polygenic risk scores being exploited in this way. “The results may serve as a caution that small models should be treated as potentially sensitive data in clinical reporting and informed consent discussions,” she says.
Topics:


