Assessing Disclosure Risks with Genomic Data

UoM administered thesis: Phd

  • Authors:
  • Sahel Shariati Samani

Abstract

The genomics revolution promises to bring advances in every area of our lives and is generating huge quantities of data for analysis. However, these data are sensitive and their potential cannot be realised without addressing complex questions of privacy. Genomics is not the first field to face these questions; for many decades, balancing confidentiality and data utility has been a concern for data stewardship organisations such as national statistical institutes. This led to the emergence of the field of Statistical Disclosure Control (SDC) to formulate this problem statistically. In this thesis, I explore some of the privacy issues of genomics data by drawing on key concepts from SDC. In the first paper, six possible scenarios where disclosure may occur are defined and analysed. The analysis shows that although assessing the disclosure risk of genomic data is not a straightforward task; the risk is potentially being overestimated in many cases. There are several factors that affect the overall risk of disclosure which have been neglected in most previous work. In particular, having a detailed knowledge of the data and a significant expertise in genetics and genomics is crucial. The risk also depends on the data environment and this research suggests that the disclosure risk of each genomic dataset must be assessed individually and systematically, with a focus on the actual attack procedure. In paper two, one high profile attack scenario, a patrilineal linkage attack, is considered in detail and a model of the risk of re-identifying genomic data (in the population of Wales and England) via this route is developed. The work demonstrates that re-identification is possible; however, the risk is low in the studied population. This work re-emphasises the importance of the data environment and external resources used in an attack and shows that they significantly affect overall risk which also depends on the characteristics of an individual genomic dataset. The paper also demonstrates how attaching geo-demographic metadata to genomic data can facilitate re-identification and hence advises caution with such attachments. Paper three considers the issue of linkage disequilibrium - the non-random correlation of allelic forms of different genes - and its impact on the intruder’s power to carry out inference attacks on regions of the genome which are suppressed for privacy reasons. By generating a variety of genomic data models, the work demonstrates that intruders can design more powerful attacks using higher-order correlations. The evaluation shows that this correlation cannot be captured properly using the lower-order models found in the existing literature, and therefore they cannot be relied upon when designing privacy-preserving techniques. The overarching conclusion is that SDC and genomic privacy can both learn from each other. Genomic privacy can benefit from the systematic approach that SDC provides. SDC can benefit from considering the new and more complex genomic data forms and therefore enhance its relevance as we move from the world of singular rectangular databases to one of an interconnected web of variegated data.

Details

Original languageEnglish
Awarding Institution
Supervisors/Advisors
Award date1 Aug 2018