Biomedical data is vital in medicine, especially when it is successfully shared between researchers and healthcare providers. This process, however, raises concerns around the privacy of such sharing since even anonymous records can be re-identified.
Most organizations have robust risk assessment frameworks to support formal re-identification, which informs how and when data can be shared. This can, of course, have a significant impact on the privacy of the people who provide that data, especially when adversaries are able to access several resources.
Research from Vanderbilt University explores this in a “re-identification game” to help them better assess the risks involved and the ultimate trade-off between maintaining privacy while retaining utility. The researchers utilized huge genomic datasets to conduct experiments based on game theory models to introduce various adversarial capabilities and test how data can be effectively shared while minimizing the risk of re-identification.
Biomedical data is typically collected en masse, often from the data contained in one’s electronic medical records, but also from other sources, including various genetic testing companies that exist today. While often this exchange is framed in the sense that it is maximizing the societal value of the data, there are nonetheless privacy implications for the subjects of the data.
The researchers attempt to model and quantify the privacy trade-offs for individuals when under multistage attack. They explain that most efforts to reduce the ability of attackers to re-identify biomedical data focus excessively on worst-case scenarios, and this focus tends to result in an overestimation of the actual privacy risk we face. By introducing a risk assessment approach based on game theory, the researchers hope to provide a more realistic perspective.
Through this approach, they believe they can help us to develop a more optimal sharing strategy that ensures the maximum medical benefits while maintaining privacy. They did this by conducting a number of experiments whereby multistage attacks unfolded using either large-scale simulations or large real-world datasets.
The experiments show how effective such an approach can be at effectively assessing the risks involved and developing mitigation strategies to minimize that risk. Indeed, the researchers believe that their models could help to produce the optimal sharing strategy that ensures that the data is used to its maximum while also ensuring it is medically useful and the data sharing process is fair to all stakeholders.
For instance, the researchers identified a scenario whereby the subject of a particular piece of data could choose how much of it would be shared with public datasets, such as the Personal Genome Project or the 1000 Genomes Project. Indeed, they could choose whether to share the entire sequence, specific aspects of their genome, or nothing at all.
Data sharing is increasingly important for the medical research sector so it’s important that it is able to create models and processes that support sharing as much data as possible while securing the rights and privacy of individuals and their data. This would help healthcare providers, medical researchers, data management teams, and so on, better understand the monetary benefits of data sharing and the various risks of re-identification by an adversary.
The simulations would see the subject decide how much of their data they wanted to share with the adversary, then decide whether to try and attack or not. This was then followed by a response phase, with subjects deploying masking strategies, which the adversary responded to, and so on.
The simulations included around 20 personal attributes per subject, along with 16 different genetic characteristics. Several scenarios were compared during the simulations, with the adversary striving to re-identify the subjects in each of them. Among the scenarios, a “no protection” approach offered the most variation relative to any utility from the data achieved.
The models were tested against eight distinct parameters and another three experimental conditions to show how they might work in a real-world environment. The researchers believe that their approach allows subjects to make better and more informed decisions about data sharing, especially in the face of increasingly sophisticated and complex re-identification technologies.
They hope that such models will be increasingly used in data-sharing conversations to help answer some of the questions people have about the risks of re-identification and therefore help people to make smarter choices with regard to the precise amount of data they share and with whom.