Scalable and Robust GRID-based Text Mining of Scientific Papers

Lead Research Organisation: University of Cambridge
Department Name: Physics


Search engine technologies are one of the most profitable information technology sectors. Current search techniques are based on the use of keywords and have serious limitations when applied to complex information sources. Advanced text-mining techniques extract far more useful information from documents, and promise to vastly improve the success of users' searches. We plan to deploy such techniques to search scientific literature as a precursor to a larger commercial deployment in other domains. This requires the existing tools to be scaled up in capacity, making the use of computational grids an attractive possibility. Distributed techniques are easily applicable as each document can be processed in a self-contained way. However, early experiments have revealed a number of problems with the robustness of the text processing pipeline and with distributed job management which result in an unacceptably high number of papers not being annotated. The main goal of this project is to address these issues in a more robust and generic way by developing better Grid-based job management tools using the STFC-funded Ganga system, as well as by re-implementing the less reliable components of the text mining system. The resulting system should allow the partner company to offer a superior search facility to any on the market at the moment, with potentially very large commercial returns.


10 25 50