Homomorphic Encryption of Genotypes and Phenotypes for Quantitative Genetics
Lead Research Organisation:
UNIVERSITY COLLEGE LONDON
Department Name: UCL Genetics Institute
Abstract
In order to identify genes that are associated with important traits like disease in humans, or improved yield in crops, it is necessary to analyse very large samples of individuals. Often this involves sharing genetic and other data collected in different studies, and there are risks to individuals' genetic privacy if these data are shared as plaintext. Homomorphic encryption refers to a type of data encryption that obscures the original plaintext data by replacing it with a ciphertext which nonetheless contains sufficient structure that it is still possible to perform the same data analyses as with the plaintext, thereby increasing the power to make discoveries whilst maintaining genetic privacy.
We have previously developed a method for homomorphic encryption of genotype and phenotype data, based on random high-dimensional rotations of data. In this proposal we will develop our method into a practical tool that can be used by geneticists and other scientists. This will involve writing a software implementation that can operate on very large datasets, and working closely with stakeholders to ensure the code is as useful as possible.
We have previously developed a method for homomorphic encryption of genotype and phenotype data, based on random high-dimensional rotations of data. In this proposal we will develop our method into a practical tool that can be used by geneticists and other scientists. This will involve writing a software implementation that can operate on very large datasets, and working closely with stakeholders to ensure the code is as useful as possible.
Technical Summary
Quantitative genetic analysis - such calculating heritability, testing genetic association, using mixed linear models to control for unequal relatedness between individuals - is a cornerstone of several important areas of genetics, including human complex disease mapping, and animal and crop improvement. To make progress it is often necessary to share data between studies, but privacy concerns sometimes prevent or delay data sharing. We previoiusly developed a method based on the use of random orthogonal matrix keys to encrypt genotype and phenotype plaintext into cyphertext that closely resembles samples from Gaussian deviates. Orthogonal transformation leaves unchanged keys parts of the quantitative genetic machinery, including the likelihood, parameters, heritability and the effects of a mixed model transformation. However, they scramble the identities of individuals by replacing individual genotypes with random linear superpositions.
We propose to develop the use of random orthogonal matrix keys, into a fully-fledged methodology and software package that can be used routinely by genetics researchers to share and analyse genetic data. We will also extend the methodology to other datatypes such as transcriptomic data, provided the analysis fits with a mixed model framework with Normal errors. We will aim to identify and correct any weaknesses that might permit decryption, and to work with potential users of the system in both human, plant and animal genetics, to propagate its use and thereby accelerate the sharing of genetic data, and of the the use of the FAIR (Findable, Accessible, Interopeerable and Repoducible) principles.
We propose to develop the use of random orthogonal matrix keys, into a fully-fledged methodology and software package that can be used routinely by genetics researchers to share and analyse genetic data. We will also extend the methodology to other datatypes such as transcriptomic data, provided the analysis fits with a mixed model framework with Normal errors. We will aim to identify and correct any weaknesses that might permit decryption, and to work with potential users of the system in both human, plant and animal genetics, to propagate its use and thereby accelerate the sharing of genetic data, and of the the use of the FAIR (Findable, Accessible, Interopeerable and Repoducible) principles.
People |
ORCID iD |
| Richard Mott (Principal Investigator) |
Publications
| Description | Collaboration to test homomorphic encryption for animal breeding |
| Organisation | Iowa State University |
| Country | United States |
| Sector | Academic/University |
| PI Contribution | We are collaborating to test the encryption methods we are developing can be used for animal breeding, using commercial pig data set as a test case. |
| Collaborator Contribution | A seed grant to AG2P was applied for which was successful. This paid for a posdoc at UCDAVIS to evaluate the methodology in the context of Bayesian QTL mapping, which was successful. A paper is in preparation. A grant was then submitted to USDA to continue the work (decision expected mid 2023) |
| Impact | None yet |
| Start Year | 2022 |
| Description | Collaboration with Gene Network |
| Organisation | University of Tennessee |
| Country | United States |
| Sector | Academic/University |
| PI Contribution | We have a collaboration with researchers at the University of Tennessee Health Sciences Center, USA to implement the HEGP genotype privacy methodology in their GeneNetwork system |
| Collaborator Contribution | Our partners will use the HEGP system for encrypting genotypes to enhance the genetic privacy of the Gene Network and database system |
| Impact | None to date |
| Start Year | 2023 |
| Description | Homomorphic Encryption and Privacy Enhancing Technology Meeting, NIH USA September 6, 2024 |
| Form Of Engagement Activity | A formal working group, expert panel or dialogue |
| Part Of Official Scheme? | No |
| Geographic Reach | International |
| Primary Audience | Policymakers/politicians |
| Results and Impact | The aim of the meeting, (held at the NIH campus in Bethesda, MD) was to bring together experts in the field of homomorphic encryption and experts in health policy in the US in order to write a position paper on the use of privacy-preserving technologies in the fields of medical research and (eventually) healthcare. Discussions are ongoing. The meeting was the culmination of the online webinar series held in 2024, as detailed in the URL. Richard Mott gave a webinar as part of this series, available at https://www.youtube.com/watch?v=1DW25GlYAt0 |
| Year(s) Of Engagement Activity | 2024 |
| URL | https://datascience.nih.gov/homomorphic-encryption-and-privacy-enhancing-technologies-webinar-series |
| Description | Seminar at NIAB, Cambridge, UK Feb 1st 2024 |
| Form Of Engagement Activity | A talk or presentation |
| Part Of Official Scheme? | No |
| Geographic Reach | International |
| Primary Audience | Professional Practitioners |
| Results and Impact | Richard Mott gave a seminar "Sharing confidential genetic data for crop improvement" at the National Institute for Agricultural Botany, Cambridge UK on 1st February 2024. |
| Year(s) Of Engagement Activity | 2024 |
| URL | https://www.niab.com/professor-ian-mackay-seminar |
| Description | Seminar at the conference "At the forefront of plant research 2023", Barcelona Spain, organised by CRAG |
| Form Of Engagement Activity | A talk or presentation |
| Part Of Official Scheme? | No |
| Geographic Reach | International |
| Primary Audience | Professional Practitioners |
| Results and Impact | On May 10th 2023 Richard Mott gave a talk "Toolkits for animal and crop improvement: Multiparental populations to dissect standing variation, and protocols to share proprietary genetic and phenotypic data " Abstract: We present two approaches to identify alleles relevant for improvement of animals and crops. These toolboxes represents the extremes of how we can leverage information from genetic and phenotypic data. Multiparental populations are descended from a selected set of founders, such that their chromosomes are random mosaics of the founders' chromosomes. They are particularly useful for crop breeding where genetically stable recombinant inbred lines may be bred and phenotyped across different environments, to map loci relevant to agronomic traits. We present data from a mutiparental population of 500 lines descended from 16 UK wheat varieties selected to represent germplasm across the past 70 years, to show how these populations can be used to understand the genetic architecture of complex traits, and to recapitulate the impact of the Green Revolution. An alternative to breeding new populations is to utilise existing genetic and phenotypic data, by combining datasets across institutions, including proprietary data held by breeding companies. To do so we must overcome the challenge of sharing data in such a way that individual level information is hidden whilst the computations needed to discover relevant genetic loci are unimpeded. We present a protocol based on random orthogonal transformations which is suitable for all quantitative genetic analyses that employ Gaussian likelihoods, including linear mixed models and many Bayesian methods. We demonstrate its use with proprietary pig data. |
| Year(s) Of Engagement Activity | 2023 |
| URL | https://www.cragenomica.es/events/forefront-plant-research-2023 |