pubmed2ensembl: a resource for linking biological literature to genome sequences

Lead Research Organisation: University of Manchester

Department Name: Life Sciences

Abstract

Due to advances in technology, the rate of discovery and publication in the field of biology is accelerating at an ever-increasing pace. Approximately 500,000 articles are published annually on biological research, and advanced computational systems are now needed to fully access and interpret this wealth of biological information. On an equally grand scale, the complete genetic blueprint for a large number of species has been recently made available to the scientific community through international genome sequencing projects. These genome projects have in large part driven the explosion in biological publication, however essentially no work has been done to develop computational systems that provide integrated access to genome sequences and the biomedical literature. This project seeks to overcome this critical limitation in access to biological information, by developing a computational resource, called pubmed2ensembl, that will directly integrate genomic data with the biomedical literature, providing biological researchers a unique bridge between two of the fastest growing sources of biological information. Our system will allow both experimental and computational researchers alike to perform 'cross-lingual' and 'multi-lingual' queries using both textual and genomic information (e.g. querying textual data using genomic information as constraints). Additionally, our system will allow direct navigation to the literature from genome sequences, allowing researchers to browse the published literature as they would any other genomic feature (e.g. genes). pubmed2ensembl will be open-access, accessible by both human and programmatic interfaces, and will be integrated with established bioinformatics services and resources (such as the Ensembl Genome Browser). By coupling the accumulated knowledge in millions of published articles directly with genome sequences, pubmed2ensembl will provide a critical and much-needed resource to decode biological processes encoded in genomes.

Technical Summary

Advances in DNA sequencing technology have drastically increased the rate of production of genomic sequence data, thereby accelerating the rate of biological discovery and publication. Genomic data are well-served by genome portals and PubMed provides widely-used access to the biomedical literature. However, essentially no effort has been made to systematically integrate genome sequences directly with the biological literature, despite the fact that these are the two most heavily relied-upon sources of information for many biologists. The ability to navigate directly between genomes and the biomedical literature, and to perform cross- and multi-lingual queries using both textual and genomic constraints would greatly aid experimental and computational researchers alike, and would provide a unique and much-needed bridge between two of the fastest growing sources of biological information. We propose to develop an open-access resource called pubmed2ensembl that links biological literature directly to genomes, allowing integrated queries over genomic and textual information via human and programmatic web interfaces. We will use both human-curated and automatically-extracted gene-publication links to populate the pubmed2ensembl database, including a novel source of links based on an automated method to extract DNA sequences from text and map them to genomes (called text2seq). Queries to the pubmed2ensembl system will be executed using genome- or text-based data types and return data types in the same or complementary domain. The capability for such cross- and multi-lingual queries over text and genomic data will be a novel and defining feature of the pubmed2ensembl system. Our system will also uniquely leverage comparative genomic data to allow cross- and multi-species retrieval of text-based information, thereby enabling one of the most common workflows in the life sciences of using published results from model organisms to guide further biological research.

Funded Value:

£99,333

Funded Period:

Feb 09 - Feb 10

Funder:

BBSRC

Project Status:

Closed

Project Category:

Research Grant

Project Reference:

BB/G000093/1

Principal Investigator:

Casey Bergman

Research Subject:

Tools, technologies & methods (24%)

Research Topic:

Bioinformatics (12%)

eScience (12%)

Organisations

People	ORCID iD
Casey Bergman (Principal Investigator)
Goran Nenadic (Co-Investigator)

Publications

Author Name

Title Publication Date Published

10 25 50

Baran J (2011) pubmed2ensembl: a resource for mining the biological literature on genes. in PloS one

Gerner M (2010) LINNAEUS: a species name identification system for biomedical literature. in BMC bioinformatics

Haeussler M (2011) Annotating genes and genomes with DNA sequences extracted from biomedical articles. in Bioinformatics (Oxford, England)

Hakenberg J (2011) The GNAT library for local and remote gene mention normalization. in Bioinformatics (Oxford, England)

Key Findings
Impact Summary
Further Funding
Research Databases and Models
Collaboration
Software and Technical Products
Engagement Activities


Description	The key outcomes of this project were the development of several software systems (http://linnaeus.sourceforge.net/, http://gnat.sourceforge.net/, http://text2genome.sourceforge.net/ and https://github.com/pubmed2ensembl/) that can be used to link biomedical publications to genomic data.
Exploitation Route	The publications track at the UCSC Genome Browser has the potential to be used by the ~500,000 daily visitors, including those from medical, pharmaceutical, and governmental institutions. A follow-up project extending the text2genome system developed during this project is now underway at the University of California Santa Cruz Genome Browser (http://genome.ucsc.edu/) and is described here: http://blog.openhelix.eu/?p=12420
Sectors	Education
URL	http://pubmed2ensembl.org


Description	Since 2010 our data and project websites (www.pubmed2ensembl.org and www.text2genome.org) have been used by biomedical scientists worldwide. Our approach to linking publications to genomic data has now been implemented in the UCSC Genome Browser as "publications" tracks, broadening the user base of our results considerably.
First Year Of Impact	2010
Sector	Education
Impact Types	Economic


Description	BBSRC INTERNATIONAL SCIENTIFIC INTERCHANGE SCHEME (ISIS)
Amount	£1,390 (GBP)
Funding ID	2076
Organisation	Biotechnology and Biological Sciences Research Council (BBSRC)
Sector	Public
Country	United Kingdom
Start	04/2010
End	03/2011


Title	pubmed2ensembl
Description	pubmed2ensembl is a customised and extended version of the Ensembl BioMart on genes. We have extended the mart with gene-related publication information, i.e. PubMed-IDs and PubMed Central-IDs including URL link-outs and other information, from the following sources: Entrez Gene, EMBL Nucleotide Sequence Database, GNAT gene recognition runs on Medline and PMC, and text2genome.
Type Of Material	Database/Collection of data
Year Produced	2011
Provided To Others?	Yes
Impact	Elements of this database were used in the conversion of the text2genome project into the UCSC Genome Browser Publications track.
URL	http://www.pubmed2ensembl.org/


Title	text2genome
Description	text2genome is using a unique way to map scientific articles to genomic locations: From a full-text scientific article and it's supplementary data files, all words that resemble DNA sequences are extracted and then mapped to public genome sequences. They can then be displayed on genome browser websites and used in data-mining applications.
Type Of Material	Database/Collection of data
Year Produced	2011
Provided To Others?	Yes
Impact	This database formed the basis for the UCSC Genome Browser Publications track
URL	http://www.text2genome.org/


Description	Collaboration with UCSC Genome Bioinformatics Team
Organisation	University of California, Santa Cruz
Department	GenomeBrowser
Country	United States
Sector	Academic/University
PI Contribution	Our team developed the prototype text2genome system that was then incorporated into the UCSC Genome Browser as the Publications tracks for human and other model organisms. Our team was also involved in lobbying publishers to allow their closed access text to be used for the project.
Collaborator Contribution	The UCSC Genome Bioinformatics team engineered a new version of the text2genome system, obtained licenses for all the closed access publications, ran the computations and hosted the final data
Impact	- Publications tracks at UCSC Genome Browser: http://genome-euro.ucsc.edu/cgi-bin/hgTrackUi?g=pubs - Initial publication on UCSC Publications tracks: http://nar.oxfordjournals.org/content/41/D1/D64.long
Start Year	2011


Title	GNAT Gene Name Recognition Software
Description	GNAT is a library and web service capable of performing gene entity NER and normalization of biomedical articles. Mentions of genes and proteins in the articles are linked to to Entrez Gene identifiers. GNAT is available both for local download (suitable for large-scale processing) and as a web service (suitable for more limited processing or testing).
Type Of Technology	Software
Year Produced	2011
Open Source License?	Yes
Impact	GNAT has been used in several other text-mining applications.
URL	http://gnat.sourceforge.net/


Title	LINNAEUS Species Name Recognition Software
Description	LINNAEUS is a general-purpose dictionary matching software, capable of processing multiple types of document formats in the biomedical domain (MEDLINE, PMC, BMC, OTMI, text, etc.).
Type Of Technology	Software
Year Produced	2010
Open Source License?	Yes
Impact	LINNAEUS has been used in dozens of other text-mining projects to date, in fields ranging from biomedicine to biodiversity.
URL	http://linnaeus.sourceforge.net/


Description	Interview in Guardian Article
Form Of Engagement Activity	A magazine, newsletter or online publication
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Public/other audiences
Results and Impact	Interviewed for Guaridan article "Text mining: what do publishers have against this hi-tech research tool?" Guardian: http://www.guardian.co.uk/science/2012/may/23/text-mining-research-tool-forbidden I was interviewed for two other articles on text mining after this article was published: - "Pushing the Frontier of Access for Text Mining: A Conversation with Heather Piwowar on One Researcher's Attempt to Break New Ground" SPARC News (http://www.arl.org/sparc/media/pushing-frontier-access-for-text-mining-Piwowar-interview.shtml) -"Trouble at the text mine." Nature 483:134-135 (http://www.nature.com/news/trouble-at-the-text-mine-1.10184)
Year(s) Of Engagement Activity	2013
URL	http://www.guardian.co.uk/science/2012/may/23/text-mining-research-tool-forbidden

Abstract

Technical Summary

Organisations

People

ORCID iD

Publications