Improving the rat reference genome annotation and building community engagement

Lead Research Organisation: Wellcome Sanger Institute
Department Name: Computational Genomics

Abstract

Rats have been used in research for over a 100 years as a model to examine physiology and behaviour to provide insight into human disease. Owing to its well characterised physiology, the rat is also the favoured rodent model used in the pharmaceutical industry for the assessment of drug efficacy and toxicity. In 2004 the first reference Rat genome sequence was made public and this has changed the direction of research using Rat as a model organism, enabling identification of rat genes associated with specific diseases.

The first release of the rat genome sequence was not of high quality and contained many gaps and missing genes. This has been updated in 2012 by the Baylor College of Medicine Human sequencing group integrating sequence generated from new sequencing technologies increasing the amount of sequence covered in the genome. Recently new experimental techniques have enabled scientists to knockout genes in the Rat genome facilitating observations of what happens to the rat when a gene is deleted. As a result, it is essential that the genes targeted for this type of genetic experiments are correctly identified i.e. "annotated" on the rat genome.

The main aim of this project is to correctly identify all the rat genes on the new release of the reference rat genome. This is achieved in a combination of two strategies. Initially the genes will be identified using state of the art bioinformatic programs and pipelines developed by the Ensembl gene build team. The genes are identified by matches to known rat proteins on the genome, other transcribed data such at mRNAs and ESTs or conserved proteins from other species. As this is an automatic pipeline there maybe complex gene families that cannot be correctly identified and require manual inspection. The HAVANA team have been involved manual annotation of the human, mouse and zebrafish reference genomes and have developed in-house specialist tools to help accurate identification of genes within different genomes. Since manual inspection is expensive and time consuming the manual effort will be targeted on complex gene families and genes of specific interest to the rat scientific research community. Engaging with the community will be essential to receive feedback about targetting of annotation as well as to generate community participation in the manual inspection of genes of interest. There are predicted to be over 22000 protein-coding genes identified on the original rat assembly and therefore community input could improve and refine these gene models. Automatic annotation identifies around 70% of genes correctly, therefore the aim would be to use bioinformatics analysis and feedback from researchers to target the 30% incorrectly annotated genes and improve them.

The HAVANA team have previously worked with pig researchers to pursue a community annotation project of identify immuno-response genes on the pig genome. Approximately 8% of protein-coding genes were annotated using the Havana annotation tools remotely on their own laptops in their labs after attending a workshop on how to use the in-house tools. Regular contact with the professional annotators ensured the resulting models were consistent among all researchers and adhered to the guidelines produced by the Havana group. This model of community annotation will be presented to the Rat community as an opportunity to improve the annotation of Rat genes.

The reference rat genes can be viewed via the internet using the Ensembl genome browser. This reference gene set will be updated approximately every three months and updates from the manual annotation effort will be merged into the automatic gene set by the Ensembl gene builders. In addition any new Rat specific data that helps with identifying new genes such as new sequencing technology transcriptome data can be integrated into this complex genebuilding pipeline.

Technical Summary

The genome represents a complete description of an organism. However, to understand the functioning of the genes and regulatory elements, and to design sensible molecular biological experiments to test hypotheses, the genome sequence must be related to the extant functional data for that organism. In particular the set of genes must be accurately annotated. An updated genome assembly for rat (Rnor5.0) has recently been released. This improved assembly is more complete and has longer contig size, making it a better substrate for generating both automatic and manual gene annotation.

We propose to create a comprehensive evidence based set of gene annotation for rat. This will be a combination of manual annotation in targeted loci and genome wide automatic annotation produced using the established Ensembl annotation system. Manual annotation provides the most in depth annotation of a locus, with all transcripts for which there is evidence, generated. Automatic annotation provides rapid genome wide gene annotation. Together they provide the most useful, cost effective gene set for researchers.

Manual annotation will be targeted at loci chosen by the community as important for rat based research, or where user feedback suggests automatic annotation has failed to generate good models. It will be performed using the established Otterlace/ZMap annotation tools. A community annotation jamboree will be organized to further increase the amount of manual annotation possible.

An established process, used successfully in the ENCODE project, will merge the manual and automatic annotation for each Ensembl release. The gene set will be made available through the Ensembl website and via the other access methods to Ensembl (biomart datamining interface, Perl API, flat file dumps, MySQL database), and for Ensembl tools e.g. Variant Effect Predictor. The gene set will be further annotated each release by Ensembl's comparative genomic, variation and functional genomic pipelines.

Planned Impact

This proposal will generate a more accurate and complete annotation of the gene structures contained in the rat genome than is currently available. Accurate knowledge of the gene structures of an organism is a fundamental requirement for the interpretation of many types of experimental biological datasets and so this research is important to all individuals who carry out research concerning rats. The open availability of the data generated and the software code and tools to access it will ensure its use is maximized.

The beneficiaries of this research will include those researching the basic biology of rats and those using rats as a model of humans in order to better understand human physiology and disease. This group includes the pharmaceutical industry where the rat is an important model organism of human in drug development.

These groups will benefit from this research by having a more reliable and complete gene set to use in their analysis. This will enable them to design more precise experiments and better interpret experimental data. An improved gene annotation will also lead to more accurate and complete identification of orthologous genes in other organisms such as human and will enable detailed comparisons of gene structures. When using the rat as a model for human diseases or physiology for drug development research it is important to know how similar the biology of each species is. This in turn depends on how similar the genes in each species are, including their structure and regulatory features. An improved gene annotation will facilitate this analysis.

Research using the rat as a model organism has an important role in the understanding of human disease and in the development of new drugs. The research therefore has the potential to contribute to improved health of the UK population. The pharmaceutical industry is a major generator of wealth in the UK, so this research also has the potential to improve its research output and through that help improve the competitiveness of this sector of the UK economy. The community engagement aspects of this proposals will specifically enable UK researchers in both academia and companies to propose priorities for gene annotation improvements based on the priorities of their research and allow them to engage with expert annotators. The improved gene set resulting from this research could also potentially provide a starting point for commercial companies producing experimental reagents for other researchers.

The international importance of this research will also encourage links to other international rat resources and databases, such as the US funded Rat Genome Database (RGD) and the EU funded EURATools (http://euratools.rns4u.com/) and follow on project EURAtrans consortia (http://www.euratrans.eu/). Such links will enhance access of UK researchers to other large scale basic research projects on rat and the data they generate.

Finally the staff trained on this project will gain valuable expertise in computational methods for handling genome data and biological expertise around vertebrate gene structure. These bioinformatics skills, particularly in the use of high throughput biological data, are in great demand both in academia and industry. This increasingly includes the health sector where genome data is being increasingly used in medical diagnostics.

Publications

10 25 50
publication icon
Aken BL (2016) The Ensembl gene annotation system. in Database : the journal of biological databases and curation

publication icon
Cunningham F (2015) Ensembl 2015. in Nucleic acids research

publication icon
Cunningham F (2015) Improving the Sequence Ontology terminology for genomic variant annotation. in Journal of biomedical semantics

publication icon
Daniel Barrell (Author) (2013) Ensembl resources for rat in 2013

publication icon
Flicek P (2014) Ensembl 2014. in Nucleic acids research

publication icon
Harrow JL (2014) The Vertebrate Genome Annotation browser 10 years on. in Nucleic acids research

publication icon
Yates A (2016) Ensembl 2016. in Nucleic acids research

 
Description The rat genome community annotation initiative aims to improve the manual annotation of the genome and improve the computationally derived gene set in Ensembl. Rat researchers have been invited to nominate genes and regions of interest for annotation. The community response to our request for targeted gene sets has resulted in the manual annotation of 2609 genes; these include 1615 protein coding genes, 213 lncRNA, and 657 pseudogenes. Gene clusters are particular targets for manual annotation due to the difficulty with their automated annotation. We have annotated the Major Histocompatibility Complex (RT1) region in rat, which supplements the other manually annotated vertebrate species including mouse, human, pig and wallaby. The MHC is an important region for evolutionary biologist due to the allelic diversity as well as clinical researchers investigating complex diseases such as diabetes. Other gene families that have been manually annotated include the keratin (Krt) family. This consists of over 80 genes, including protein coding genes and pseudogenes, found on chromosomes 7 and 10. We have updated Vega to display the latest rat genome assembly Rnor6 which is a mixed sex strain made from pooled female animals of strain BN/SsNHsdMCW plus one male of strain SHR (also known as SHR-Akr)). Importantly the male Y chromosome is also now included in the rat assembly and we have completed gene annotation across its full length. Full manual gene annotation of the rat Y chromosome, combined with recently completed equivalent manual annotation of the Y chromosomes of mouse and pig, provide a substantial resource for comparative analysis. Incorporation of the new assembly necessitated considerable updating of the manually annotated gene models from Rnor5 to Rnor6 due to changes between the assemblies. The latter data has been supplemented with long non-coding RNA (lncRNAs) models from next generation sequencing (NGS) data from 12 different tissues. We have used these tissue specific models generated by the Ensembl pipeline from NGS data to predict lncRNAs in rat and extend our gene models. The Vega browser is updated quarterly in conjunction with Ensembl and the manual annotation is merged with the Ensembl gene set. Annotation is a continuous process and so between database updates we can release new annotation via a Vega update track. Updates are published as a list at (http://vega.sanger.ac.uk/info/data/Rattus_norvegicus_update_genes.html).
Exploitation Route Rats have become the most relevant model organism for the study of multifactorial diseases such as hypertension, diabetes, renal failure and neurological disorders. The public release of the reference genes set resulting from a merge of ensembl gene build and manual improvements on the latest release of the rat reference genome (RNOR6) in the Ensembl browser will aid clinical researchers to investigate multifactorial disease. It also aids researchers in general that utilize comparative genomics to understand gene functions as the rat genome is important to evolutionary biologists .
Sectors Healthcare,Pharmaceuticals and Medical Biotechnology

URL http://vega.sanger.ac.uk/Rattus_norvegicus/Info/Index
 
Title Ensembl release 70 
Description New Rnor5 assembly with genebuild orthologues to new human and mouse genes gene trees syntenic regions and pairwise alignments of rat with human and mouse multiple species alignments with other mammals microarray mapping assembly mapping RGSC3.4 to Rnor_5.0 links to external databases eg. uniprot 
Type Of Material Database/Collection of data 
Year Produced 2013 
Provided To Others? Yes  
Impact na 
URL http://jan2013.archive.ensembl.org/index.html
 
Title Ensembl release 71 
Description To allow users to identify the severity of variations - Added SIFT prediction and score filters and attributes Imported QTL data orthologues to new human genes 
Type Of Material Computer model/algorithm 
Year Produced 2013 
Provided To Others? Yes  
Impact na 
URL http://apr2013.archive.ensembl.org/index.html
 
Title Ensembl release 72 
Description orthologues to new human and mouse genes added additional 'estgenes' track, created by lining together EST alignments into transcript models 
Type Of Material Database/Collection of data 
Year Produced 2013 
Provided To Others? Yes  
Impact na 
URL http://jun2013.archive.ensembl.org/index.html
 
Title Ensembl release 73 
Description Orthologues to new human genes Added new "QTL chromosome name" and "QTL region" filters dbSNP Build 138 data imported 
Type Of Material Database/Collection of data 
Year Produced 2013 
Provided To Others? Yes  
Impact na 
URL http://sep2013.archive.ensembl.org/index.html
 
Title Ensembl release 74 
Description orthologues to new human and mouse genes 
Type Of Material Database/Collection of data 
Year Produced 2013 
Provided To Others? Yes  
Impact secondary structure of non-coding RNAs are now shown on the gene summary page, using the R2R package. 
URL http://dec2013.archive.ensembl.org/index.html
 
Title Ensembl release 75 
Description One gene ENSRNOG00000042244, reported by a user, deleted from the rat gene set 
Type Of Material Database/Collection of data 
Year Produced 2014 
Provided To Others? Yes  
Impact na 
URL http://feb2014.archive.ensembl.org/index.html
 
Title Ensembl release 76 
Description New BLAST interface Biomart- Retirement of the transcript splicing events data computed by the pipeline developed as part of the ASTD project 
Type Of Material Database/Collection of data 
Year Produced 2014 
Provided To Others? Yes  
Impact Not aware of any impact, however we know the pages are highly accessed 
URL http://www.ensembl.org/info/website/news.html?id=76&submit=Go
 
Title Ensembl release 77 
Description Manual annotation of rat from Havana is included. This represents the data released in Vega 57. This is the first merge for rat. This gene set is a merge of complete Ensembl gene models and the latest Havana gene annotation. 
Type Of Material Database/Collection of data 
Year Produced 2014 
Provided To Others? Yes  
Impact Not aware of any impact, but the pages are highly accessed by researchers 
URL http://www.ensembl.org/info/website/news.html?id=77&submit=Go
 
Title Ensembl release 79 
Description Imported phenotypes/diseases from the Rat Genome Database (RGD) 
Type Of Material Database/Collection of data 
Year Produced 2014 
Provided To Others? Yes  
Impact We are not aware of any impact, but the rat species pages are highly accessed 
URL http://www.ensembl.org/info/website/news.html?id=79&submit=Go
 
Title Ensembl release 80 
Description Updated rat gene annotation based on the Rnor_v6.0 assembly (Rat). This is the new gene set for rat based on Rnor_v6.0. The Y chromosome has been added to the assembly. The gene set contains RNASeq-based models. This gene set is a merge of complete Ensembl gene models and the latest Havana gene annotation; contains the data released in Vega 59. The rat variation database will be remapped to Rnor_6.0. 
Type Of Material Database/Collection of data 
Year Produced 2015 
Provided To Others? Yes  
Impact we are not aware of any impact yet, but the pages are highly accessed 
URL http://www.ensembl.org/info/website/news.html?id=80&submit=Go
 
Title Ensembl release 82 
Description We improved the database cross references for rat. 
Type Of Material Database/Collection of data 
Year Produced 2015 
Provided To Others? Yes  
Impact we are not aware of any specific impacts, but the pages are highly accessed. 
URL http://www.ensembl.org/info/website/news.html?id=82&submit=Go
 
Title Ensembl release 83 
Description Updated Ensembl-Havana rat gene set. This gene set is a merge of complete Ensembl gene models and the latest Havana gene annotation; contains the data released in Vega 63 
Type Of Material Database/Collection of data 
Year Produced 2015 
Provided To Others? Yes  
Impact we are not aware of any specific impacts, but the pages are highly accessed 
 
Title Ensembl release 84 
Description Updated phenotypes/disease data from the Rat Genome Database (RGD). 
Type Of Material Database/Collection of data 
Year Produced 2016 
Provided To Others? Yes  
Impact we are not aware of any specific impacts, but the pages are highly accessed. 
URL http://mar2016.archive.ensembl.org/Rattus_norvegicus/Info/Index
 
Title Ensembl release 85 
Description Updated Ensembl-Havana rat gene set. This gene set is a merge of complete Ensembl gene models and the latest Havana gene annotation; contains the data released in Vega 65. Updated phenotypes/disease data from the Rat Genome Database (RGD). 
Type Of Material Database/Collection of data 
Year Produced 2016 
Provided To Others? Yes  
Impact we are not aware of any specific impacts, but the pages are highly accessed. 
URL http://jul2016.archive.ensembl.org/Rattus_norvegicus/Info/Index
 
Title Ensembl release 86 
Description Updated phenotypes/disease data from the Rat Genome Database (RGD). 
Type Of Material Database/Collection of data 
Year Produced 2016 
Provided To Others? Yes  
Impact we are not aware of any specific impacts, but the pages are highly accessed 
URL http://www.ensembl.org/Rattus_norvegicus/Info/Index
 
Title Havana rat gene set 
Description Freeze of the manually annotated Rattus norvegicus gene set by the Havana group 
Type Of Material Database/Collection of data 
Year Produced 2013 
Provided To Others? No  
Impact No actual impacts realised to date 
URL http://ftp://ftp.sanger.ac.uk/pub/vega/rat
 
Title Rat (Vega53) 
Description Rnor5.0 (March 2012, strain BN/SsNHdMCW) Manual annotation of the assembly from the Rat Genome Sequencing Consortium 
Type Of Material Database/Collection of data 
Year Produced 2013 
Provided To Others? Yes  
Impact na 
URL http://vega.sanger.ac.uk/info/website/news.html?id=53&submit=Go
 
Title Ensembl release 76 
Description Updated ncRNA genes for rat Updated rat orthologues to human, mouse and all other Ensembl species Produced gene trees from these orthologues Upated pairwise alignment of rat to new human assembly Updated multiple species alignments which includes rat Ensembl REST service includes useful functionality such as stable ID lookup 
Type Of Technology Webtool/Application 
Year Produced 2014 
Impact na 
URL http://aug2014.archive.ensembl.org/index.html
 
Title Ensembl release 77 
Description Merged Ensembl and HAVANA rat annotation for the first time Imported RefSeq annotation for rat for the first time RefSeq lookup should now be possible in newly released REST service Updated rat orthologues to human, mouse and all other Ensembl species Produced gene trees from these orthologues Updated links to external databases eg. uniprot Updated multiple species alignments which includes rat 
Type Of Technology Webtool/Application 
Year Produced 2014 
Impact na 
URL http://www.ensembl.org/index.html
 
Description Community annotation of the rat genome presentation 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Talk sparked questions and discussion afterwards
Year(s) Of Engagement Activity 2014
URL http://imgs.org/?run=conference.program#O-41
 
Description Ensembl Genebuild Workshop by invitation from Yiqiang Zhao of CAU 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Undergraduate students
Results and Impact - introduction to ensembl and the human genome project
- introduction to gene building
- outreach resources (YouKu etc)
- workshop on running ensembl gene annotation pipeline
- workshop on running ensembl RNA-seq pipeline

na
Year(s) Of Engagement Activity 2014
URL http://www.ebi.ac.uk/training/workshop/ensembl-genebuild-workshop
 
Description Presentation at Rat Genomics & Models, CSHL, USA, December 2015 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Thibaut Hourlier gave a poster presentation titled 'Rat Genomics & Models' at Rat Genomics & Models, CSHL, USA, in December 2015. We gave the presentation in order to raise awareness of the rat resources provided by Ensembl. There were several questions during the poster session.
Year(s) Of Engagement Activity 2015
 
Description Workshop on Genome Resources for the Rat Community, Milwaukee, USA, September 2015 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Jane Lovland and Gaurab MukherjeeOrganised the workshop and taught the researchers how to use Zmap/otterlace software to annotate the Rat genome and the genes important to their research.
Thibaut Hourlier gave an overview of Ensembl's rat annotation and resources, including comparative analysis, NGS analysis, track hubs. The intended purpose was to train people who would like to use our resources and engage with the community and the workshop was successful in accomplishing this.
Year(s) Of Engagement Activity 2015
URL http://rgd.mcw.edu/wg/news2/07/10-workshop-on-genome-resources-for-the-rat-community