Pilot study of the utility of text mining and machine learning tools to accelerate systematic review and meta-analysis of findings of in vivo research

Lead Research Organisation: University of Edinburgh
Department Name: Centre for Clinical Brain Sciences

Abstract

Biomedical research is an incremental process, in which the findings from one experiment inform, and are challenged or confirmed in, future experiments. Where findings from one experiment are not available to those planning future experiments, the research process loses efficiency. Making results of research widely available is an important driver for open access dissemination, and has been identified as an important factor in increasing research efficiency.
There is now more research published than ever before. The primary bibliographic database for biomedical research, PubMed, adds around 3,500 new references every day. Our random sample of 2000 publications in PubMed suggests that in 2013 there were 98,000 publications describing in vivo experiments, of which 21,000 were in pharmacology and 14,500 in neuroscience. . No one individual can read, let alone appraise critically or use, even a small fraction of this new information, information which is the product of months of investigator effort and substantial investment of research funds. This mis-match, between the amount of research produced and the amount that can be effectively used, is a major challenge to biomedical research.
The Cochrane Collaboration has been highly successful in synthesising meta-analyses of clinical trial data and providing outcomes in an easily assimilated, widely recognised, format readily useable for healthcare funding decisions and day to day clinical practice. This approach has also influenced major improvements in research quality, especially the design, conduct and reporting of clinical trials. Whilst we wish to replicate the success of Cochrane in the pre-clinical domain, we recognise that the sheer volume and publication rate of pre-clinical data predicate that methodology innovations are required beyond the largely manual processes that are currently adopted for most clinical systematic reviews. For example, in our recently completed systematic reviews of neuropathic pain, data from 229 clinical trials required extraction, whereas for the corresponding on-going pre-clinical systematic review 65,156 publications were retrieved by the search, 33,818 had to be screened and of these data are being extracted from ~6000.
Further, there are substantial concerns about the risk of bias (due to sub-optimal experimental design) and publication bias in that work which is published, bias that is likely to overstate observed effects. Also where sample sizes are low (and sample size calculations are seldom reported), there is also a risk that important biological effects are overlooked because individual studies are underpowered.
In brief then, the challenges are:
1. Information of potential relevance to scientists is produced at such a volume and rate that "reading the literature" is not feasible
2. The risk of bias in in vivo research is such that detailed critical appraisal is required to allow judgement of whether the conclusions drawn are justified and whether a particular experimental design is appropriate
3. Publication bias means that scientists relying on selected sources (eg particular journals) are likely to be misled
4. Conventional systematic review can be helpful, but are usually one to two years out of date on the day of publication, a problem that is further compounded by the sheer volume of data implicit in a pre-clinical systematic review

Here we propose to exploit recent developments in text mining and machine learning to establish whether these are yet at the stage where they can be implemented in systematic reviews of in vivo data, to assist with the challenges outlined above.

Technical Summary

Firstly we will convene an expert panel to establish "required" and "desired" performance thresholds for the performance of text mining and machine learning tools. Then, for each of the three tasks of identifying and retrieving relevant publications, extracting meta-data from identified publications, and extracting outcome data from relevant publications we will (1) where not already performed, conduct a systematic review to identify all candidate approaches; (2) implement the most promising approaches using existing systematic review datasets; and then (3) prospectively validate these approaches in ongoing systematic reviews. These tasks will be conducted by a team which brings together expertise in text mining and machine learning as applied to systematic review (Thomas, Ananiadou) and in the conduct of systematic reviews of in vivo data (Sena, Rice, Macleod), supported by external collabortators.
The development datasets are (1) a systematic review of in vivo studies in neuropathic pain, (2) a systematic review of in vivo publications from leading UK institutions, and (3) a selection of in vivo publications describing different outcome measures curated on the CAMARADES database. For the validation datasets we will use a systematic review of in vivo models of depression. For each, we will ascertain the sensitivity, specificity and where relevant the accuracy of the text mining/ machine learning approach, and the reduction in human work (eg number of articles needed to screen) possible whilst maintaining performance at the "desired" threshold.

Planned Impact

At full implementation, the ability to provide unbiased contemporary summaries of existing knowledge would have a profound impact on the effective re-use of that knowledge in the planning of future in vivo research and would provide a firmer basis for academic and commercial decisions to embark on human clinical trials. Those impacts will not be realised during this pilot phase, but to the extent that a pilot phase is required (and we believe that it is), those benefits cannot be realised without this proposed work.

Specifically, this research addresses the first strategic aim of the MRC, to speed up the exploitation of the best ideas in medical science, by providing unbiased contemporary summaries of existing knowledge; and the fourth strategic aim of enabling the scientific community to respond effectively to current and future challenges in medical research by providing more detailed assessment of risks of bias in in vivo research.

Further, by allowing researchers and funders to know what is already known, we will provide a platform from which unnecessary replication might be prevented (by reducing unnecessary research involving animals). In time, we aspire also to be able to identify relevant information gaps (for instance regarding the efficacy of a drug being considered for clinical trial), thereby allowing value of information analysis to assess the comparative utility of competing funding applications.

We will maximise the impact of our research by
1. Securing consensus from the in vivo systematic review community for the required performance of machine learning/ text mining tools, so that if we are able to deliver that performance it will be fit for purpose.
2. Embedding those components of our work which reach the "required" threshold in the online CAMARADES systematic review system, which has 147 registered users around the world.
3. Explaining the use of these components in a series of short video user guides.
4. To the extent that the components which we are able to develop might be repurposed for other tasks (searching with results ranked by increasing risk of bias, ascertainment of risks of bias in recently published work), to make these tools available to the community including, for instance, to funding agencies.
5. Seeking agreement of for instance NC3Rs to provide links to the tools and user guides as part, for instance, of the Experimental Design Assistant.
6. One of the important issues in this work is the increasing complexity of summarising findings from laboratory research, and the nuances of explaining to a lay audience that all research is not of equal value and that the value of research is something which can be increased. We have substantial experience in public engagement work at a local level (Edinburgh Science festival, Midlothian Science Festival), nationally (British Science Association) and in the print and broadcast media. We will use these skills to engender greater societal understanding of the complexities of the challenges faced, and of the steps taken by scientists, publishers and funders to address these.
7. During the grant we will continue to work with David Carr (Wellcome Trust) and David Crosby (MRC) to build a wider consortium of potential funders in anticipation of a larger scale application.

Publications

10 25 50
publication icon
Akl EA (2017) Living systematic reviews: 4. Living guideline recommendations. in Journal of clinical epidemiology

publication icon
Bahor Z (2017) Risk of bias reporting in the recent animal focal cerebral ischaemia literature. in Clinical science (London, England : 1979)

publication icon
Bannach-Brown A (2021) Technological advances in preclinical meta-research. in BMJ open science

publication icon
Bannach-Brown A (2017) Understanding in vivo modelling of depression in non-human animals: a systematic review protocol in Evidence-based Preclinical Medicine

 
Description IMI 2
Amount € 9,627,162 (EUR)
Funding ID 777364 
Organisation European Commission 
Sector Public
Country European Union (EU)
Start 10/2017 
End 09/2020
 
Title training datasets for ML 
Description we have made curated training sets of systematic search results and include/ exclude decisions available to the community 
Type Of Material Database/Collection of data 
Year Produced 2017 
Provided To Others? Yes  
Impact others have used it to train and validate their algorithms 
 
Description CAMARADES 
Organisation Florey Institute of Neuroscience and Mental Health
Department National Stroke Research Institute NSRI
Country Australia 
Sector Academic/University 
PI Contribution we lead the collaboration, host the database, and have led its ongoing developemnt
Collaborator Contribution rich research collaborrationRich research collaborationRich research collaboration
Impact numerous publications with an h-index for the collaboration of 16
Start Year 2007
 
Description CAMARADES 
Organisation University Medical Center Utrecht (UMC)
Department Neurology UMC
Country Netherlands 
Sector Academic/University 
PI Contribution we lead the collaboration, host the database, and have led its ongoing developemnt
Collaborator Contribution rich research collaborrationRich research collaborationRich research collaboration
Impact numerous publications with an h-index for the collaboration of 16
Start Year 2007
 
Description CAMARADES 
Organisation University of Nottingham
Department School of Medicine
Country United Kingdom 
Sector Academic/University 
PI Contribution we lead the collaboration, host the database, and have led its ongoing developemnt
Collaborator Contribution rich research collaborrationRich research collaborationRich research collaboration
Impact numerous publications with an h-index for the collaboration of 16
Start Year 2007
 
Description EQIPD 
Organisation Imperial College London
Country United Kingdom 
Sector Academic/University 
PI Contribution consortium lead
Collaborator Contribution we initiated the consortium, led the development, led the application for funding, and lead the project
Impact 1st paper under review
Start Year 2016
 
Description EQIPD 
Organisation Janssen Pharmaceutica NV
Country Belgium 
Sector Private 
PI Contribution consortium lead
Collaborator Contribution we initiated the consortium, led the development, led the application for funding, and lead the project
Impact 1st paper under review
Start Year 2016
 
Description MDAR checklist development 
Form Of Engagement Activity A formal working group, expert panel or dialogue
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Development of a checklist for reporting biomedical research; I'm the academic lead, with reps from PLoS, Nature, Science, Cell, and others
Year(s) Of Engagement Activity 2017,2018,2019
 
Description keynote speaker, NHMRC/Reward meeting 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact NHMRC is ramping up its approach to reducing research waste, and this was a key contribution to that process
Year(s) Of Engagement Activity 2018