MINOTAUR: MINing Online Text - an Augmented User-friendly Resource

Lead Research Organisation: University of Manchester
Department Name: Life Sciences

Abstract

Biological research is generating mountains of data, and thousands of scientific papers describing the data are being published annually. Paradoxically, there is now too much information for researchers to handle: there are too many papers to read, too much data to understand - scientists just can't keep up with the deluge: how is it possible to find new things in the literature if we don't know where to look, or to learn new things, if there are too many papers to read? The wealth of information now available simply makes it harder to discover and assimilate new knowledge - i.e., we no longer even know what we know because what we know is buried in the literature and is getting harder to dig out. These difficulties have arisen partly owing to developments in 'high-throughput biology', which hinges on the use of automated lab techniques that generate masses of data. Nowhere is this more apparent than in the sequencing of whole genomes, a feat only made possible ~10 years ago. Now, sequencing relatively small genomes is feasible in a day - but acquiring the data and attaching meaning to them are tasks at different ends of the technological spectrum. It is not enough to know how many genes an organism may possess - for this information to be useful, researchers must know what the genes do (i.e., what their functions are). Finding this out is hard manual labour that involves searching databases to discover if similar genes have been found in different organisms, and searching the literature to find out if similar genes have been described and characterised experimentally. The problem is, searching the literature for information on a particular gene or protein can swamp researchers with thousands of publications in which different facts have been reported - finding the right bits of information from this mass of publications is extremely tedious, slow and expensive. In response to these problems, various automated (computational) approaches have been developed to help researchers to find and extract information that's relevant to them from the expanding literature. However, while several different programs have been created, these have tended to address particular steps in the overall process, such that, today, no software exists that can find the most relevant articles for a researcher, extract the most pertinent facts from those articles, and summarise the findings entirely automatically. The aim of this project is to create such a tool by combining several text-mining programs that we've developed in recent years. As these tools were created in different projects to meet different needs, they don't fit together as a seamless package. We will rectify this by integrating their components to produce a unified suite, capable of taking a protein or gene name, searching the online literature, downloading the most relevant abstracts, excising from these the most informative sentences, and presenting them in user-specified ways (individually or as paragraphs or reports). Emphasis will be given to designing a software interface that's easy to use - this is vital, as no matter how good a computer program is, it is effectively useless if its intended users find it hard or impossible to understand and use! To ensure the suite represents the needs of the research community, we'll work with local users who've already expressed particular text-mining needs. We'll hold 2 workshops: in the 1st, we'll find out from potential users what are the most important interface requirements (these will shape the design of the prototype); in the 2nd, we'll discover technical and usability issues (these will be fed back into the debugging process) - this workshop will also help to gauge requirements for future developments that we won't have time to implement in this project but could form the basis of future work. Overall, then, the project will not only build on previous work, but will prepare the ground for future avenues of research.

Technical Summary

This is a 6-month project to improve in-house biological text-mining tools, to make them easier to use and more generally useful. Manual annotation is known to be a major obstacle to database growth. To alleviate manual burdens, we have created various assistant tools: one (PRECIS) derives protein reports from Swiss-Prot annotation; others draw on the literature, using template- and SVM-driven sentence-classification systems (BioIE, METIS) to extract (structure, function, etc.) sentences from PubMed abstracts; and another (BioMinT) integrates information retrieval and extraction techniques, ultimately to generate suitable sentences for output as database annotation. While useful, these tools were developed separately to meet specific needs: hence, their input requirements are different, as are their outputs; and they don't work together as a coherent package, each being tied to its own, often non-intuitive, front end. These tools represent years of work, an investment we wish to exploit. Our aim is to reimplement them, unifying their complementary components, reengineering them to eliminate redundancy, and uniting them in a single interface: this will allow different access points (protein/gene name query terms for PubMed searches, individual sequences for BLAST searching, and multiple sequences for PRECIS analysis), and user-specified output formats (single sequences, paragraphs or full protein reports). To our knowledge, no other software has this functionality. To address user needs, we will liaise with biologists and database curators through short workshops, to: ascertain the most important interface requirements, which will shape the prototype design (the interface is vital - no matter how good the software, it will be useless if it's hard to use); uncover technical and usability issues for feedback into the debugging process; and gauge requirements for future developments. The final package will be made available via the National Centre for Text Mining.

Publications

10 25 50
 
Description We developed a text-mining tool to help database curators to gather information for annotating database entries.
Exploitation Route The tool may be used by any scientist wishing to gather specific information from the literature.
Sectors Healthcare,Pharmaceuticals and Medical Biotechnology

URL http://www.bioinf.man.ac.uk/dbbrowser/minotaur/about.html