Machine learning approaches for clinical diagnosis of autoimmune diseases with the T-cell receptor repertoire
Lead Research Organisation:
University of Liverpool
Department Name: Institute of Translational Medicine
Abstract
Genetic risk factors for some autoimmune conditions implicate T cells in disease mechanisms that are incompletely understood. The T-cell receptor (TCR) is encoded by genes that are recombined from an assortment of gene segments in the nuclei of T cells. The vastly diverse TCR repertoire arising from an individual's T cells evolved to bind to a wide variety of threats. T cell activation is initiated by TCR binding, which leads to clonal expansion. A lineage of T cells expressing the same TCR includes cells that participate in an active immune response, and some that persist to enable immunological memory. In autoimmune disease, T cells may be involved in an immune response directed against the host's own tissues or microbiome. Next generation sequencing has enabled vast libraries of TCRs to be sequenced, which presents a unique opportunity to better understand autoimmune disease. From a set of TCR repertoire samples, patterns associated with a condition might be identifiable through interpretation of a machine learning classification model. However, the limited sharing of identical TCRs between individuals with the same condition, as well as the vast outnumbering of samples by unique TCR sequences, leads to difficulty identifying signatures of TCR repertoires that are predictive of autoimmune disease status.
Promising TCR repertoire classification approaches consider relationships between non-identical TCR sequences. Methods that split TCR sequences into kmers demonstrate efficient performance that is comparable to deep learning. This work is dedicated to investigating the utility of methods that augment kmer-based representations of the TCR repertoire. Throughout, methodology is evaluated using real TCR repertoire datasets including samples from patients with coeliac disease and inflammatory bowel disease, as well as participants with cytomegalovirus infection. TCR repertoires are also simulated to guide methodological development. To assess the hypothesis that capturing similarity of kmers in a TCR repertoire representation will improve generalisability, a novel approach employing a reduced amino acid alphabet is benchmarked against alternatives to reveal the limited utility of property-informed kmers alone. However, one exception when classifying TCR repertoires from a rarer subset of T cells in the small intestine by coeliac disease status suggests that appropriate use cases may exist for the approach. Next, the notion that some kmers may be more informative than others leads to exploration of a deviation-based kmer filter, which indicates that adequate regularisation precludes the need for filtering. Further, a likelihood-based normalisation of kmer counts is found to be sensitive to inaccuracies that one might expect in real TCR repertoire data.
Methodology presented in this thesis may improve generalisability of certain TCR repertoire classification models, though this cannot be concluded universally. While results demonstrate the potential to identify TCR repertoire patterns that might be associated with autoimmune disease, further development of TCR repertoire classification approaches is warranted in coordination with more advanced TCR repertoire sequencing techniques. The ability to gain insights into the underlying mechanisms of autoimmune disease will also rely on experimental validation.
Promising TCR repertoire classification approaches consider relationships between non-identical TCR sequences. Methods that split TCR sequences into kmers demonstrate efficient performance that is comparable to deep learning. This work is dedicated to investigating the utility of methods that augment kmer-based representations of the TCR repertoire. Throughout, methodology is evaluated using real TCR repertoire datasets including samples from patients with coeliac disease and inflammatory bowel disease, as well as participants with cytomegalovirus infection. TCR repertoires are also simulated to guide methodological development. To assess the hypothesis that capturing similarity of kmers in a TCR repertoire representation will improve generalisability, a novel approach employing a reduced amino acid alphabet is benchmarked against alternatives to reveal the limited utility of property-informed kmers alone. However, one exception when classifying TCR repertoires from a rarer subset of T cells in the small intestine by coeliac disease status suggests that appropriate use cases may exist for the approach. Next, the notion that some kmers may be more informative than others leads to exploration of a deviation-based kmer filter, which indicates that adequate regularisation precludes the need for filtering. Further, a likelihood-based normalisation of kmer counts is found to be sensitive to inaccuracies that one might expect in real TCR repertoire data.
Methodology presented in this thesis may improve generalisability of certain TCR repertoire classification models, though this cannot be concluded universally. While results demonstrate the potential to identify TCR repertoire patterns that might be associated with autoimmune disease, further development of TCR repertoire classification approaches is warranted in coordination with more advanced TCR repertoire sequencing techniques. The ability to gain insights into the underlying mechanisms of autoimmune disease will also rely on experimental validation.
Organisations
People |
ORCID iD |
| Hannah Kockelbergh (Student) |
Studentship Projects
| Project Reference | Relationship | Related To | Start | End | Student Name |
|---|---|---|---|---|---|
| EP/T517975/1 | 30/09/2020 | 29/09/2025 | |||
| 2876514 | Studentship | EP/T517975/1 | 30/09/2020 | 29/06/2024 | Hannah Kockelbergh |