The Lazarus Project: Resurrecting data and knowledge from life science articles by crowd-sourcing

Lead Research Organisation: University of Manchester
Department Name: Computer Science

Abstract

The scientific literature is one of the most important knowledge-resources for the life sciences, with over 200k articles downloaded each day from Elsevier's Science Direct system alone. Covering over 20k journals, two new papers per minute are added to 22 million or so existing articles indexed by PubMed. For most scientists, reading, analysing and organising their personal library of articles is a daily task that forms a fundamental part of their scientific process. As the rate of publishing accelerates, the need for computational support to work which articles to read, and how to interpret, reproduce and validate the claims they contain is growing. However traditional publications are aimed at consumption by humans -- they are 'stories that persuade with data' -- and their combination of nuanced natural language and complex figures does not make them easily amenable to processing by machine. In the life-science literature, drug-like molecules are typically represented as illustrations; biochemical properties as tables or graphs; protein/DNA sequences are buried amongst text; references and citations have arcane formats; and other objects of biological interest are referred to by ambiguous names. Capturing such data necessitates the familiar drudgery of re-typing figures from tables, chasing citations through digital libraries, redrawing molecules by hand: all of these are tedious, error prone, wasteful and currently wasted processes that are carried out by scientists on a regular basis. Mass-mining methods (text mining, optical recognition) to automate such tasks are not yet sufficiently reliable to be used without human validation, and are generally disallowed by the licenses under which articles are published. Thus without the 'human computation' possible through crowd-sourcing, existing knowledge is destined to remain entombed in the literature.

The Lazarus Project aims to harness the crowd of scientists reading life-science articles to resurrect the swathes of legacy data buried in charts, tables, diagrams and free-text, to liberate processable data into a shared resource that benefits the community. Lazarus aims are to harness activities that are currently carried out by individuals for their own purposes (annotating, cross-referencing articles with databases, organising collections of articles).

Our approach is to extend the functionality of an existing literature-enhancement platform that currently is designed for individual use. Utopia Documents is a PDF-reader that enhances the experience of reading life-science literature: it analyses documents on the fly, linking their content to online resources, and helps users explore associated data and knowledge bases. It has a number of 'convenience' features such as extracting data from tables, reconstructing molecules from images or 'markush-like' representations or navigating citations that make interacting with the content of an article more efficient. Its counterpart, Utopia Library, provides complimentary functions for collections, providing automated recommendation, legitimate copyright/license sensitive acquisition and sharing of articles and sophisticated 'semantic' classification and organisation of personal libraries. Lazarus aims to enhance the Utopia tools such that the micro-tasks already performed by individuals can be harnessed at a crowd scale and repurposed for crowd consumption.

As a result, scientists will benefit from richer, more searchable literature, and more accessible data; publishers - will benefit from enriched content, without the need to develop new in-house infrastructures; data integration initiatives - will benefit from access to a rich literature/data-linking resource.

Technical Summary

Lazarus aims to harness the crowd of scientists reading life-science articles to recover the swathes of legacy data buried in charts, tables, diagrams and free-text, to liberate process-able data into a shared resource that benefits the community. Scientific articles are 'stories that persuade with data', but their historical format makes accessing the data for validation or analysis difficult: small molecules are typically represented as illustrations; biochemical properties as tables or graphs; protein/DNA sequences are buried amongst text; references and citations have arcane formats; and other objects of biological interest are referred to by ambiguous names. Capturing such data necessitates the familiar drudgery of re-typing figures from tables, chasing citations through digital libraries, redrawing molecules by hand... tedious, error prone, wasteful and currently wasted processes. Mass-mining methods (text mining, optical recognition) to automate such tasks aren't yet sufficiently reliable to be used without human validation, and are generally disallowed by the licenses under which articles are published. Without 'human computation', existing knowledge is thus destined to remain entombed in the literature.
Lazarus' objectives are to harness a percentage of paper readership and leverage the Utopia document-reading platform with which any PDF from any publisher can be read. We aim to harness individuals' 'microtasks' of extracting data or annotating articles for personal use, and pool them for reuse; cross-validate and feedback annotations to better train the crowd and improve data quality; produce an open-access, restriction-free searchable and processable resource for use by computational and analytical pipelines; create a web-based observatory, gathering per-article metrics; observe and steer the crowd toward data-resurrection campaigns. Lazarus' methods combine data extraction micro-task design, task observation, crowd engagement and data reuse.

Planned Impact

Lazarus has the potential for exceptionally broad impact in the Life Sciences and beyond.

While the UK leads globally in terms of open access policies, the scientific community is in desperate need of tools to exploit the potential of these recent changes, and to make the most of the knowledge currently locked in the literature. The recent acquisition of Mendeley -- holders the largest 'independent' collection of biobliographic metadata and citation network data -- by commercial publishing giant Elsevier makes the creation of an open, freely accessible repository of knowledge from the literature ever more pressing.

BBSRCs investment in biology makes much data that are under-exploited and making available "what we didn't know we already knew" will have immediate and long-term benefits to biological science. Biologists lack tools tuned to aggregate, integrate and mine the data and insights currently locked in the scientific literature, this project addresses this need. This project has the potential to make an impact on the "reduction, refinement, and replacement" of animal experiments. By making the data on experiments published in the literature more available replication can be avoided. Although in its pilot phases this project focuses on three areas of life sciences (pathways, pharmacology and sequence/structure analysis) these are merely case studies designed to enabling the fine-tuning of the crowd-sourcing approach and the underlying technology; the resulting platform and approach will be applicable in any life science domain.

Scientists, whether in academia or industry will benefit from richer, more searchable literature, and more straightforward access to the data and concepts that are currently sequestered in papers. Tasks that they are presently required to perform manually and repeatedly will be simplified, reducing the time wasted and increasing the quality of the results. The data generated by UK-funded research, past and future, will be more open and accessible to human and machine consumption.

Scientific publishers of all scales, whether commercial and scholarly, will benefit from enriched content, without the need to develop new in-house infrastructures

Data integration initiatives and primary life science database will benefit from open access to a rich literature/data-linking resource the content of which has been validated by the crowd.

The pharmaceutical/biotech industry will benefit from a system that allows them to 'join up' their in house knowledge, linking their scientists' reading habits to their in-house knowledge-bases.

Publications

10 25 50
 
Description The software was released in 2017, and is in daily use around the world by private individuals and within academia and pharma/biotech companies. A component of the software used for the typographical analysis and reconstruction of life-science literature is currently being re-purposed for the analysis of policy and guidance documents in the UK National Archive in a project funded through the EPSRC KTA programme in collaboration with SME MirrorWeb.
First Year Of Impact 2017
Sector Government, Democracy and Justice,Pharmaceuticals and Medical Biotechnology,Other
Impact Types Policy & public services

 
Description EPSRC Impact Acceleration Award
Amount £158,950 (GBP)
Organisation University of Sheffield 
Department EPSRC KTA Knowledge Transfer Account
Sector Academic/University
Country United Kingdom
Start 11/2017 
End 12/2018
 
Title Lazarus 
Description An extension to the Utopia Documents PDF reader that crowdsources knowledge by mining the article being read and contributing it to a central server, which is then used to further enhance the reading of subsequent users. 
Type Of Technology Software 
Year Produced 2017 
Impact Licensing of underlying software to social networks and pharmaceutical companies, follow-on project using the typographical analysis component for reconstructing legal documents in the UK National Archive, esp relating to policy and Brexit. 
URL http://utopiadocs.com/lazarus