Scalable and Robust GRID-based Text Mining of Scientific Papers

Lead Research Organisation: University of Cambridge

Department Name: Physics

Abstract

Search engine technologies are one of the most profitable information technology sectors. Current search techniques are based on the use of keywords and have serious limitations when applied to complex information sources. Advanced text-mining techniques extract far more useful information from documents, and promise to vastly improve the success of users' searches. We plan to deploy such techniques to search scientific literature as a precursor to a larger commercial deployment in other domains. This requires the existing tools to be scaled up in capacity, making the use of computational grids an attractive possibility. Distributed techniques are easily applicable as each document can be processed in a self-contained way. However, early experiments have revealed a number of problems with the robustness of the text processing pipeline and with distributed job management which result in an unacceptably high number of papers not being annotated. The main goal of this project is to address these issues in a more robust and generic way by developing better Grid-based job management tools using the STFC-funded Ganga system, as well as by re-implementing the less reliable components of the text mining system. The resulting system should allow the partner company to offer a superior search facility to any on the market at the moment, with potentially very large commercial returns.

Funded Value:

£83,897

Funded Period:

Oct 08 - Sep 09

Funder:

STFC

Project Status:

Closed

Project Category:

Research Grant

Project Reference:

ST/G003599/1

Principal Investigator:

Michael Andrew Parker

Research Subject:

Particle physics - experiment (100%)

Research Topic:

Beyond The Standard Model (100%)

Organisations

University of Cambridge (Lead Research Organisation)

People	ORCID iD
Michael Andrew Parker (Principal Investigator)	http://orcid.org/0000-0001-9798-8411

Publications

Author Name

Title Publication Date Published

10 25 50

Abstract

Organisations

People

ORCID iD

Publications