pubmed2ensembl: a resource for linking biological literature to genome sequences

Lead Research Organisation: University of Manchester
Department Name: Life Sciences

Abstract

Due to advances in technology, the rate of discovery and publication in the field of biology is accelerating at an ever-increasing pace. Approximately 500,000 articles are published annually on biological research, and advanced computational systems are now needed to fully access and interpret this wealth of biological information. On an equally grand scale, the complete genetic blueprint for a large number of species has been recently made available to the scientific community through international genome sequencing projects. These genome projects have in large part driven the explosion in biological publication, however essentially no work has been done to develop computational systems that provide integrated access to genome sequences and the biomedical literature. This project seeks to overcome this critical limitation in access to biological information, by developing a computational resource, called pubmed2ensembl, that will directly integrate genomic data with the biomedical literature, providing biological researchers a unique bridge between two of the fastest growing sources of biological information. Our system will allow both experimental and computational researchers alike to perform 'cross-lingual' and 'multi-lingual' queries using both textual and genomic information (e.g. querying textual data using genomic information as constraints). Additionally, our system will allow direct navigation to the literature from genome sequences, allowing researchers to browse the published literature as they would any other genomic feature (e.g. genes). pubmed2ensembl will be open-access, accessible by both human and programmatic interfaces, and will be integrated with established bioinformatics services and resources (such as the Ensembl Genome Browser). By coupling the accumulated knowledge in millions of published articles directly with genome sequences, pubmed2ensembl will provide a critical and much-needed resource to decode biological processes encoded in genomes.

Technical Summary

Advances in DNA sequencing technology have drastically increased the rate of production of genomic sequence data, thereby accelerating the rate of biological discovery and publication. Genomic data are well-served by genome portals and PubMed provides widely-used access to the biomedical literature. However, essentially no effort has been made to systematically integrate genome sequences directly with the biological literature, despite the fact that these are the two most heavily relied-upon sources of information for many biologists. The ability to navigate directly between genomes and the biomedical literature, and to perform cross- and multi-lingual queries using both textual and genomic constraints would greatly aid experimental and computational researchers alike, and would provide a unique and much-needed bridge between two of the fastest growing sources of biological information. We propose to develop an open-access resource called pubmed2ensembl that links biological literature directly to genomes, allowing integrated queries over genomic and textual information via human and programmatic web interfaces. We will use both human-curated and automatically-extracted gene-publication links to populate the pubmed2ensembl database, including a novel source of links based on an automated method to extract DNA sequences from text and map them to genomes (called text2seq). Queries to the pubmed2ensembl system will be executed using genome- or text-based data types and return data types in the same or complementary domain. The capability for such cross- and multi-lingual queries over text and genomic data will be a novel and defining feature of the pubmed2ensembl system. Our system will also uniquely leverage comparative genomic data to allow cross- and multi-species retrieval of text-based information, thereby enabling one of the most common workflows in the life sciences of using published results from model organisms to guide further biological research.

Publications

10 25 50
 
Description The key outcomes of this project were the development of several software systems (http://linnaeus.sourceforge.net/, http://gnat.sourceforge.net/, http://text2genome.sourceforge.net/ and https://github.com/pubmed2ensembl/) that can be used to link biomedical publications to genomic data.
Exploitation Route The publications track at the UCSC Genome Browser has the potential to be used by the ~500,000 daily visitors, including those from medical, pharmaceutical, and governmental institutions. A follow-up project extending the text2genome system developed during this project is now underway at the University of California Santa Cruz Genome Browser (http://genome.ucsc.edu/) and is described here: http://blog.openhelix.eu/?p=12420
Sectors Education

URL http://pubmed2ensembl.org
 
Description Since 2010 our data and project websites (www.pubmed2ensembl.org and www.text2genome.org) have been used by biomedical scientists worldwide. Our approach to linking publications to genomic data has now been implemented in the UCSC Genome Browser as "publications" tracks, broadening the user base of our results considerably.
First Year Of Impact 2010
Sector Education
Impact Types Economic

 
Description BBSRC INTERNATIONAL SCIENTIFIC INTERCHANGE SCHEME (ISIS)
Amount £1,390 (GBP)
Funding ID 2076 
Organisation Biotechnology and Biological Sciences Research Council (BBSRC) 
Sector Public
Country United Kingdom
Start 05/2010 
End 03/2011
 
Title pubmed2ensembl 
Description pubmed2ensembl is a customised and extended version of the Ensembl BioMart on genes. We have extended the mart with gene-related publication information, i.e. PubMed-IDs and PubMed Central-IDs including URL link-outs and other information, from the following sources: Entrez Gene, EMBL Nucleotide Sequence Database, GNAT gene recognition runs on Medline and PMC, and text2genome. 
Type Of Material Database/Collection of data 
Year Produced 2011 
Provided To Others? Yes  
Impact Elements of this database were used in the conversion of the text2genome project into the UCSC Genome Browser Publications track. 
URL http://www.pubmed2ensembl.org/
 
Title text2genome 
Description text2genome is using a unique way to map scientific articles to genomic locations: From a full-text scientific article and it's supplementary data files, all words that resemble DNA sequences are extracted and then mapped to public genome sequences. They can then be displayed on genome browser websites and used in data-mining applications. 
Type Of Material Database/Collection of data 
Year Produced 2011 
Provided To Others? Yes  
Impact This database formed the basis for the UCSC Genome Browser Publications track 
URL http://www.text2genome.org/
 
Description Collaboration with UCSC Genome Bioinformatics Team 
Organisation University of California, Santa Cruz
Department GenomeBrowser
Country United States 
Sector Academic/University 
PI Contribution Our team developed the prototype text2genome system that was then incorporated into the UCSC Genome Browser as the Publications tracks for human and other model organisms. Our team was also involved in lobbying publishers to allow their closed access text to be used for the project.
Collaborator Contribution The UCSC Genome Bioinformatics team engineered a new version of the text2genome system, obtained licenses for all the closed access publications, ran the computations and hosted the final data
Impact - Publications tracks at UCSC Genome Browser: http://genome-euro.ucsc.edu/cgi-bin/hgTrackUi?g=pubs - Initial publication on UCSC Publications tracks: http://nar.oxfordjournals.org/content/41/D1/D64.long
Start Year 2011
 
Title GNAT Gene Name Recognition Software 
Description GNAT is a library and web service capable of performing gene entity NER and normalization of biomedical articles. Mentions of genes and proteins in the articles are linked to to Entrez Gene identifiers. GNAT is available both for local download (suitable for large-scale processing) and as a web service (suitable for more limited processing or testing). 
Type Of Technology Software 
Year Produced 2011 
Open Source License? Yes  
Impact GNAT has been used in several other text-mining applications. 
URL http://gnat.sourceforge.net/
 
Title LINNAEUS Species Name Recognition Software 
Description LINNAEUS is a general-purpose dictionary matching software, capable of processing multiple types of document formats in the biomedical domain (MEDLINE, PMC, BMC, OTMI, text, etc.). 
Type Of Technology Software 
Year Produced 2010 
Open Source License? Yes  
Impact LINNAEUS has been used in dozens of other text-mining projects to date, in fields ranging from biomedicine to biodiversity. 
URL http://linnaeus.sourceforge.net/
 
Description Interview in Guardian Article 
Form Of Engagement Activity A magazine, newsletter or online publication
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Public/other audiences
Results and Impact Interviewed for Guaridan article "Text mining: what do publishers have against this hi-tech research tool?" Guardian: http://www.guardian.co.uk/science/2012/may/23/text-mining-research-tool-forbidden

I was interviewed for two other articles on text mining after this article was published:
- "Pushing the Frontier of Access for Text Mining: A Conversation with Heather Piwowar on One Researcher's Attempt to Break New Ground" SPARC News (http://www.arl.org/sparc/media/pushing-frontier-access-for-text-mining-Piwowar-interview.shtml)
-"Trouble at the text mine." Nature 483:134-135 (http://www.nature.com/news/trouble-at-the-text-mine-1.10184)
Year(s) Of Engagement Activity 2013
URL http://www.guardian.co.uk/science/2012/may/23/text-mining-research-tool-forbidden