A long term resource to maximise the potential of laboratory mouse strains for medical research

Lead Research Organisation: Wellcome Sanger Institute
Department Name: Computational Genomics

Abstract

Our key aim is to explore the relationship between genetic and medically relevant human disease phenotypes. One way to do this is to assess the genetic differences between long-established laboratory mouse strains. Laboratory mouse strains display many important disease phenotypes such as resistance to various forms of cancer (e.g. liver, lung, and skin cancer), bacterial, and viral infection and are used as models for many human diseases. The foundation for studying the genetic differences in these strains is having accurate genome sequences. In this project, we will first generate genome sequences for the most commonly used laboratory mouse strains and then use these sequences and knowledge of the gene structures to determine the genetic cause of observed disease response and behaviour differences between these strains. By combining sequence and phenotypic data we will determine whether sequence variants are likely to be contributing to disease susceptibility.

The main aim of this project is to correctly identify all the genes on the newly completed release of genome sequences of 12 laboratory mouse strains. This is achieved in a combination of two strategies. Initially the genes will be identified using state of the art bioinformatic programs and pipelines. The genes are identified by matches to known mouse proteins on the genome, other transcribed data such at mRNAs and ESTs or conserved proteins from other species. As this is an automatic pipeline, there will be complex gene families that cannot be correctly identified and require manual inspection. The HAVANA team have been involved in manual annotation of the human, mouse and zebrafish reference genomes and have developed in-house specialist tools to help accurate identification of genes within different genomes. Since manual inspection is expensive and time consuming the manual effort will be targeted on complex gene families and genes of specific interest to the mouse scientific research community. Engaging with the community will be essential to receive feedback about targeting of annotation as well as to generate community participation in the manual inspection of genes of interest. Automatic annotation identifies around 70% of genes correctly, therefore the aim would be to use bioinformatics analysis and feedback from researchers to target the 30% incorrectly annotated genes and improve them.

Technical Summary

The genome represents a complete description of an organism. However, to understand the functioning of the genes and regulatory elements, and to design molecular biological experiments to test hypotheses, the genome sequence must be related to the extant functional data for that organism. In particular the set of genes must be accurately annotated. The first chromosome sequences for the laboratory mouse strains are soon to be released by the Mouse Genomes Project at the Wellcome Trust Sanger Institute. The main aim of this proposal is to take the sequences and create strain-specific annotation and targeted manual annotation in regions where the automated processes fail.

We propose to create a comprehensive evidence-based set of gene annotations for twelve laboratory mouse strains. This will be a combination of manual annotation in targeted loci and genome wide automatic annotation. Manual annotation provides the most accurate annotation of a locus, with all transcripts for which there is evidence, generated. Automatic annotation provides rapid genome wide gene annotation. Together, they provide the most useful cost effective gene set for researchers.

Manual annotation will be targeted at loci chosen by the community as important for medical based research, or where user feedback suggests automatic annotation has failed to generate good models. It will be performed using the established Otterlace/ZMap annotation tools.

An established process, used successfully in the ENCODE project, will merge the manual and automatic annotation for each Ensembl release. The gene set will be made available through the Ensembl website and via the other access methods to Ensembl (biomart datamining interface, Perl API, flat file dumps, MySQL database) and MGI, and for Ensembl tools e.g. Variant Effect Predictor. The gene set will be further annotated each release by Ensembl's comparative genomic, variation and functional genomic pipelines.

Planned Impact

The most obvious beneficiary of these genome sequences and annotation generated will be the mouse genetics community involved in mapping complex disease related traits, researchers mapping mutations in crosses involving the wild-derived strains and crosses attempting to identify modifiers of mutations.

Complete genome sequence and annotation is needed to explore the relationship between genetic and phenotypic variation at a number of levels. First, it is a starting point for exploring how sequence and gene structure variation impinges on gene function. The new gene structures that this project will identify will provide a resource for examining sequence function, particularly in those regions, identified by the ENCODE project, that are either transcribed or implicated in gene regulation. Importantly, complete sequence will allow unambiguous assignment of function to specific nucleotide differences.

Second, the sequence will accelerate the identification of genes involved in the increasingly large number of phenotypes available for inbred strains. To date, more than 2,000 loci that contribute to quantitative variation have been identified, with only a small number characterized at a molecular level. The de novo assemblies and corresponding annotation data will obviate the need to re-sequence candidate genes identified in genetic analysis of complex traits.

Third, in combination with accumulating expression, proteomic and metabolomic data sets, accurate genome annotation of multiple mouse strains will markedly improve our ability to understand gene function. A systems biology approach will be possible, in which the integration of genetic and functional genomic data provides a path to inferring causal associations between genes and disease.

Publications

10 25 50
 
Description Maximising the potential of laboratory mice for understanding the genetic basis for disease
Amount £685,000 (GBP)
Funding ID MR/R017565/1 
Organisation Medical Research Council (MRC) 
Sector Public
Country United Kingdom
Start 04/2018 
End 04/2021
 
Title Mouse Genomes Project database 
Description A catalogue of mouse strain sequences 
Type Of Material Database/Collection of data 
Year Produced 2010 
Provided To Others? Yes  
Impact Thousands of users and hundreds of papers 
URL http://www.sanger.ac.uk/resources/mouse/genomes/
 
Title Mouse genomes in ensembl 
Description Genome sequences and genome annotation for sixteen laboratory mouse genomes for free use by the public. 
Type Of Material Database/Collection of data 
Year Produced 2016 
Provided To Others? Yes  
Impact Availability for the wider mouse genetics community of whole genome sequences to reduce the number of laboratory animals required for experiments. 
URL http://www.ensembl.org/Mus_musculus/Info/Strains?db=core
 
Title UCSC Genome Browser 
Description Mouse genomes in UCSC Genome Browser 
Type Of Material Database/Collection of data 
Year Produced 2018 
Provided To Others? Yes  
Impact Public availability of first draft genome sequences for 18 laboratory mouse strains 
URL https://genome.ucsc.edu/
 
Description Ensembl Genome Browser 
Organisation EMBL European Bioinformatics Institute (EMBL - EBI)
Country United Kingdom 
Sector Academic/University 
PI Contribution In this project, we produced genome assemblies for the most widely used laboratory mouse strains. These genomes are now available to the wider community via the Ensembl Genome Browser.
Collaborator Contribution Ensembl has provided services for hosting and presenting the genome sequences to the wider research community. This provides ongoing long term sustainability for the data.
Impact Increased usage of the data Long term sustainability and availability of the data
Start Year 2018
 
Description Jackson Laboratory 
Organisation The Jackson Laboratory
Country United States 
Sector Charity/Non Profit 
PI Contribution Genome sequencing, genome assembly, and gene prediction.
Collaborator Contribution Supply of key samples to complete the research. Collaboration on data analysis and interpretation.
Impact First draft genome sequences for laboratory mouse genomes, a key resource for all mouse genetics research.
Start Year 2014
 
Description UCSC genome annotation 
Organisation University of California, Santa Cruz
Country United States 
Sector Academic/University 
PI Contribution Genome sequencing, and genome assemblies for sixteen inbred laboratory mouse genomes.
Collaborator Contribution Personnel and IT resources to complete whole-genome annotation of sixteen inbred laboratory mouse genomes.
Impact Whole-genome annotation of sixteen inbred laboratory mouse genomes.
Start Year 2015
 
Description Conference of the International Mammalian Genome Society 2015 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Third sector organisations
Results and Impact A research talk "Multiple mouse reference genomes and strain specific gene" describing the public resource being generated through this work.
Year(s) Of Engagement Activity 2015
URL http://www.imgc2015.jp/
 
Description Deep genome sequencing and variation analysis of 13 inbred mouse strains defines candidate phenotypic alleles, private variation, and homozygous truncating mutations 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Other audiences
Results and Impact A talk by Dr. Anthony Doran at The Allied Genetics Conference 2016.
Year(s) Of Engagement Activity 2016
URL http://www.genetics2016.org
 
Description Discovery, assembly, and annotation of subspecies specific haplotypes in classical and wild-derived mouse strains 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Third sector organisations
Results and Impact A talk at The Allied Genetics Conference 2016
Year(s) Of Engagement Activity 2016
URL http://www.genetics2016.org
 
Description Multiple mouse reference genomes defines subspecies specific haplotypes and novel coding sequences 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Talk at international research conference
Year(s) Of Engagement Activity 2017
URL http://imgs.org/
 
Description Talk: Discovery, assembly, and annotation of subspecies specific haplotypes in classical and wild-derived mouse strains 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Other audiences
Results and Impact A talk at The Allied Genetics Conference 2016 by Thomas Keane
Year(s) Of Engagement Activity 2016
URL http://www.genetics2016.org/