Pig genome annotation and analysis

Lead Research Organisation: The Wellcome Trust Sanger Institute
Department Name: Bioinformatics Division


We propose to provide state of the art analysis and annotation of the pig genome sequence being generated by the International Pig Genome Sequencing Project. We will make the annotated genome sequence accessible on the Web through the Ensembl site at http://www.ensembl.org . The pig genome is the entire DNA sequence of the pig which defines all the biological molecules that make up a pig. By acquiring, managing and annotating the pig genome sequence one accelerates research for both pig biology and for mammalian biology. Impact on pig biology: Because of the extensive selective breeding which has occurred during domestication, there are a considerable number of breed or line-specific features, from fat/muscle ratios, litter size to skin colour. These features can be mapped genetically into broad regions of the genome, but the final identification of the genes responsible and the causal genetic variation is very complex. The availability of a well-annotated pig genome sequence with links to other data sources, especially those on phenotypes such as growth, carcass composition or responses to infectious disease would provide a dramatic boost to the identification of these causative genes.

Technical Summary

The genome represents a complete description of an organism. However, to understand the functioning of the genes and regulatory elements, and to design sensible molecular biological experiments to test hypotheses, the genome sequence must be related to the extant functional data for that organism. We propose to annotate and analyse the sequence being generated by the International Pig Genome Sequencing Project. We will use the well established Ensembl system as the main tool for storage, management and dissemination of pig genome data. Pig genome sequencing is currently funded to 3-4x coverage from mapped clones, with two chromosomes at higher coverage. Experience from other low coverage genomes, such as cow, rabbit and armadillo is that this coverage will minimally provide an effective representation of exons, which can then be assembled into genes using a guide genome. By definition this approach cannot resolve lineage specific expansions in the pig genome. However, with this more clone based strategy there will be new opportunities for combining both assembly and annotation strategies to leverage more information out of a 3x assembly. We will integrate the pig genome sequence with diverse pre-existing data sets, including SNPs, ESTs and quantitative trait loci (QTL). We will integrate the sequence with maps (genetic, physical) and physical resources (clones, microarrays) providing a seamless route for interrogation and development of experimentation tools. Finally computational approaches, integrating the above resources and also leveraging the comparative genomics potential in the mammalian clade will be used to analyse and present the genome in a user friendly format. An annotated pig genome sequence will dramatically accelerate research on the pig as an important animal for agriculture and human biology. Our aim is to make the pig genome sequence maximally useful by delivering an annotated sequence of the highest quality in a user friendly manner.


10 25 50
publication icon
Aken BL (2017) Ensembl 2017. in Nucleic acids research

publication icon
Aken BL (2016) The Ensembl gene annotation system. in Database : the journal of biological databases and curation

publication icon
Flicek P (2014) Ensembl 2014. in Nucleic acids research

publication icon
Flicek P (2011) Ensembl 2011. in Nucleic acids research

publication icon
Flicek P (2008) Ensembl 2008. in Nucleic acids research

publication icon
Flicek P (2010) Ensembl's 10th year. in Nucleic acids research

publication icon
Flicek P (2012) Ensembl 2012. in Nucleic acids research

publication icon
Flicek P (2013) Ensembl 2013. in Nucleic acids research

publication icon
Harrow JL (2014) The Vertebrate Genome Annotation browser 10 years on. in Nucleic acids research

Description 1. A clone path was generated by ordering the sequenced clones using the integrated physical map. The contigs within the clones where then ordered according to read-pair, end-sequence information and overlaps between neighbouring clones. This resulted in a highly refined genome sequence assembly that can further be improved by closing remaining gaps. The clone path generated during this grant is a public resource and was invaluable in the generation of a new pig assembly Sscrofa10.2. This genome paper was used as a basis for researching, the findings of are reported in the Nature paper and include:
a. There is a deep phylogenetic split between European and Asian wild boars ~1 million years ago, and a selective sweep analysis indicates selection on genes involved in RNA processing and regulation.
b. Genes associated with immune response and olfaction exhibit fast evolution. Pigs have the largest repertoire of functional olfactory receptor genes, reflecting the importance of smell in this scavenging animal.
c. The pig genome sequence provides an important resource for further improvements of this important livestock species, and our identification of many putative disease-causing variants extends the potential of the pig as a biomedical model.

2. The SScrofa9 assembly of the genome was annotated using Ensembl automatic gene prediction pipelines. A set of protein coding genes was predicted based on pig cDNA and EST evidence, and on alignments from other mammals. A set of non coding RNA genes has also been generated, predicted on the basis of alignments from RFAM and mirBASE.

3. Comparative genomics alignments including pig have been generated. These include pairwise alignments to human and cow, and multiple alignments to other mammals and vertebrates. Other comparative resources include gene trees showing relationships for pig genes with 48 other species.
Exploitation Route The genome sequence, associated annotation and genome browser tools generated provides a resource that underpins Pig Genomics research. No genome sequence (not even human) is entirely complete, but the resources document how the sequence was generated and allow for it to be improved by additional sequencing. The most recent version of the genome is the Sscrofa10.2 assembly of the pig genome which was produced in August 2011 by the Swine Genome Sequencing Consortium (SGSC). This grant led to the successful application of a follow-on grant (BBSRC: Ensembl and enabling genetics and genomics research in farmed animal species BB/I025360) which supported Pig annotation being updated as reported by EBI (see report of outcomes). The most recent version of the annotation was released May 2012 with minor updated carried out in February 2014.
Sectors Agriculture, Food and Drink,Education,Environment

URL http://www.ensembl.org/Sus_scrofa/Info/Index
Description The genome sequence and associated annotation are made accessible through the Ensembl genome browser. The browser is widely used by pig researchers to integrate data they have independently collected, design specific experiments etc. Handling large genomes, generating annotation and providing tools to use this data requires substantial IT and software infrastructure. By generating sequence centrally though the Sanger Institutes' sequencing facilities and annotation and bioinformatics services through the Sanger/EBI Ensembl project software the productivity of Pig researchers is greatly increased since they can share data using a common platform and avoid each investing substantially in duplicate bioinformatics analysis.
First Year Of Impact 2006
Sector Agriculture, Food and Drink,Healthcare,Pharmaceuticals and Medical Biotechnology
Impact Types Economic

Title Addition of Ensembl-Havana gene set to Pig (Ensembl 69) 
Description An Ensembl-Havana gene set was added to the annotation. The VEGA manual annotation which had been generated through a community effort was added. 
Type Of Material Database/Collection of data 
Year Produced 2012 
Provided To Others? Yes  
Impact The annotated reference genome sequences have been delivered through a series of Ensembl releases 
URL http://oct2012.archive.ensembl.org/Sus_scrofa/Info/Index
Title Ensembl release 74 
Description orthologues to new human and mouse genes 
Type Of Material Database/Collection of data 
Year Produced 2013 
Provided To Others? Yes  
Impact secondary structure of non-coding RNAs are now shown on the gene summary page, using the R2R package. 
URL http://dec2013.archive.ensembl.org/index.html
Title Updated Pig Ensembl Website (Ensembl 77) 
Description Secondary structure of non-coding RNAs are now shown on the gene summary page, using the R2R package 
Type Of Material Database/Collection of data 
Year Produced 2013 
Provided To Others? Yes  
Impact The annotated reference genome sequences have been delivered through a series of Ensembl releases. 
URL http://www.ensembl.org/Sus_scrofa/Info/WhatsNew?db=core
Description Ensembl Genebuild Workshop by invitation from Yiqiang Zhao of CAU 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Undergraduate students
Results and Impact - introduction to ensembl and the human genome project
- introduction to gene building
- outreach resources (YouKu etc)
- workshop on running ensembl gene annotation pipeline
- workshop on running ensembl RNA-seq pipeline

Year(s) Of Engagement Activity 2014
URL http://www.ebi.ac.uk/training/workshop/ensembl-genebuild-workshop