Mining the History of Medicine

Lead Research Organisation: University of Manchester
Department Name: Computer Science

Abstract

This project, a cross-disciplinary collaboration between the National Centre for Text Mining (NaCTeM) and the Centre for the History of Science, Technology and Medicine (CHSTM) at the University of Manchester, seeks to demonstrate the potential of text mining in medical history. To do this, firstly an asset will be created out of two very large, long-running digital sources, the British Medical Journal (BMJ) (1840 - present) and the London-area Medical Officer of Health (MOH) reports (1848-1972), by applying text mining techniques to enrich these data with semantic annotations.

An important aspect of this work is to build tools to identify and record terminological variation and semantic shift over time, via construction of a temporal terminological inventory from the archives. Then, a semantic search system will be developed to help historians in broadening and deepening their work to ask 'big' questions that cover long periods, without losing sensitivity to changes in terminology and meaning.

The resulting asset and tools will be used and evaluated in two case studies, exploring the modern epidemiological transition and the creation of a medical surveillance culture, two massively important and interrelated changes in British health experience, where many questions remain unanswered. The methods and results of the case studies will serve as concrete examples of how such an asset and tools can be used.

The text mining tools and derived resources will be made available to the community via an interoperable text mining environment that will be contributed to the repositories of major digital humanities infrastructures (CLARIN, DARIAH, META-SHARE).

The project will be guided by an Expert User Advisory Group. Two workshops are planned to receive further guidance from and to inform the wider community.

The project plans to extend its impact to the following sectors: public health, public policy, publishing, media and libraries, with a view to ensuring sustainability and wider uptake of methods and technologies.

Planned Impact

The direct impact of our project will be in the creation of an asset and technology that facilitate the work of humanities scholars from multiple disciplines, who seek new ways to reveal, explore, and discuss long-term, large-scale historical transformations related to medicine and health.

However, we have carefully designed our engagement strategy, supported by an Expert User Advisory Group, such that, indirectly, the project will impact on the following sectors: public health, public policy, publishing, media and libraries, thus ensuring not only high impact but also excellent potential for sustainability.

Evidence-based public health, particularly the National Institute for Health and Care Excellence (NICE), and public policy will benefit from the understanding of medical historical data at the large scale that our text mining methods will address. Text mining will uncover associations and hidden information from large-scale historical archives about disease and its impact on the environment and society, which can be used to guide and inform public health and policy. We offer a unique opportunity to advance our understanding of disease in the environment through historical data, which have hitherto been under-examined in all their complexity.

The text mining research methods using terms, semantic clustering, semantic metadata will impact on the publishing industry which is seeking to improve scholarly communications by enriching their archives using semantic methods but often lacks the expertise and technology.

Semantic analysis of historical data helps to inform the citizen: extending our links with the British Library, Wellcome Library, and the BBC, we shall also reach out to national archives (UK National Archives, National Archives of Scotland), in order to demonstrate how archives in the public domain can be enriched to allow the citizen to engage with big historical data semantically without being led astray by temporally-conditioned language.
 
Description Large scale historical text archives are a rich and diverse source of information but it can be difficult for historians to explore and search such large volumes without automated means.
To help historians and laypeople search and discover information, we have developed a search system which analyses these large archives automatically by using text mining methods. We support historians by showing how concepts and their relationships in the history of medicine change over time.
We developed: a) a search system; b) a terminological inventory; c) a corpus with semantically annotated concepts and their relationships; d) a text mining tool kit for other researchers working in that area.
Exploitation Route We have developed a number of tools and resources which are open and freely available to the community.
Data are available from the META-SHARE network of language repositories. The time-sensitive inventory of medical terminological inventory is available here: http://metashare.metanet4u.eu/go2/medical-inventory. The HIMERA annotated corpus is available here: http://metashare.metanet4u.eu/go2/himera-corpus.
Sectors Environment,Healthcare,Culture, Heritage, Museums and Collections

URL http://nactem.ac.uk/hom/
 
Description We have created a publicly available search system. This was achieved by drawing upon evidence from two large and varied archives of historical medical text, the British Medical Journal (BMJ) (http://www.bmj.com/archive) and the London Medical Officer of Health reports (MOH) (http://wellcomelibrary.org/moh/), whose documents collectively span the period from 1840 to the present day, and each of which has a different focus (i.e., professional medical matters vs. public health issues).
First Year Of Impact 2015
Sector Digital/Communication/Information Technologies (including Software),Environment,Healthcare,Culture, Heritage, Museums and Collections
Impact Types Cultural,Societal,Policy & public services

 
Description Copyright and Licensing in relation to Text and Data Mining
Geographic Reach Multiple continents/international 
Policy Influence Type Participation in a national consultation
Impact The National Centre for Text Mining played a leading role in advising on policy and development of UK legislation regarding a copyright exception in relation to text mining. Contributions included talks at events at the Houses of Parliament, the European Parliament, London School of Economics, and participation in consultations by the IPO and the EC (on the wider issue of copyright and licensing issues in the EU). Advice was also given on numerous occasions by request of the IPO during development of the legislation which came into force on 1st June 2014. It is somewhat too early to ascertain impact, however this has already led to major initiatives such as Europe PubMed Central being able to lawfully text mine full papers as well as increased levels of text mining within such bodies as the British Library and also within institutional repositories. It has also led to increased scope and expected impact of research projects as these can tackle for the first time large scale text mining of full text articles which are lawfully subscribed to in addition to open access material.
URL http://www.jisc.ac.uk/sites/default/files/value-text-mining.pdf
 
Title Argo for Biodiversity 
Description Argo is an interoperable infrastructure for building and running text-analysis solutions. It facilitates the development of custom text mining workflows from a selection of text mining components. We have augmented Argo to include biodiversity text mining tools. 
Type Of Material Improvements to research infrastructure 
Year Produced 2017 
Provided To Others? Yes  
Impact Supports the curation of databases, user collaboration, includes numerous (and third party) processing components, allows the creation of text mining workflows. Includes text mining tools for biodiversity. 
URL http://argo.nactem.ac.uk
 
Title EventMine 
Description EventMine is a machine learning-based pipeline system, which extracts events from documents that already contain named entity annotations (e.g., genes/proteins, etc.). Given appropriate training data, it can be trained to extract many different types and structures of events. 
Type Of Material Improvements to research infrastructure 
Year Produced 2012 
Provided To Others? Yes  
Impact Community shared tasks; other research teams improved results Customised to different domains and application areas; Part of the Argo text mining platform http://argo.nactem.ac.uk 
URL http://www.nactem.ac.uk/EventMine/
 
Title Search History of Medicine 
Description Text mining tools applied to large historical archives allow the development of sophisticated, semantic search system which provides functionalities such as the following: Automatically expanding user-entered query terms with synonyms, variants and other semantically-related terms, in order to aid in the retrieval of a maximal number of potentially relevant documents. Using automatically identified semantic information (e.g., NEs and relationships between them) as a means to isolate documents of greatest interest and/or to help users to explore the contents of large result sets from a semantic perspective. 
Type Of Material Improvements to research infrastructure 
Year Produced 2015 
Provided To Others? Yes  
Impact We adapted text mining techniques to the important domain of medical history, which has previously received little attention from a text mining viewpoint. Specifically, we are concerned with the development of the necessary resources and tools to facilitate the TM analysis of various types of published documents on medically-related matters, dating back to the mid 19th century. This task presents a number of challenges, according to the variant characteristics that can be exhibited by such documents, which may be subject to evolution as time progresses. These varying characteristics include not only potential shifts in terminology, but also possible variations in writing styles, according to the author, subject matter and intended audience of documents, together with changes in vocabulary and language structure over time. Such characteristics introduce difficulties not only in developing suitable terminological resources, which must account for the various ways in which concepts may be expressed in text both within and across different time periods, but also in creating annotated corpora that are fit for purpose. Historians of medicine use this search system for a better understanding of their research questions. 
URL http://nactem.ac.uk/hom/
 
Title Search for Clinical Trials 
Description The large amount of clinical trial data has lead to an information overload problem, making it difficult to locate the precise information that is required. Our infrastructure addresses this problem through the development of a search application that can help users to narrow down their search efficiently, and assist in the creation of new protocols. 
Type Of Material Improvements to research infrastructure 
Year Produced 2012 
Provided To Others? Yes  
Impact Opened collaboration with teams working on the development of clinical trials, experimental therapeutics, recommending automatically eligibility criteria. This method has been adapted to other domains and users, notably http://www.nactem.ac.uk/DID-ISHER/ supporting social history environment (digging into social unrest) 
URL http://www.nactem.ac.uk/clinical_trials/
 
Title Text mining tool kit for history of medicine 
Description A number of text mining tools to extract causes of a condition, symptoms associated with a condition, identification of the parts of the body that are most typically affected by conditions,factors (e.g., therapies, drugs, environmental surroundings) that can affect a condition, subsets of the population (e.g., children, adults, ethnic groups, people working in different occupations) are most likely to be affected by a condition 
Type Of Material Improvements to research infrastructure 
Year Produced 2015 
Provided To Others? Yes  
Impact Reproducibility of results by other research teams. 
URL http://metashare.metanet4u.eu/repository/browse/time-sensitive-inventory-of-medical-terminology/1a17...
 
Title Data from: A cross-lingual similarity measure for detecting biomedical term translations 
Description  
Type Of Material Database/Collection of data 
Year Produced 2015 
Provided To Others? Yes  
 
Title History of Medicine Dataset 
Description A rich semantic analysis of entities (condition, anatomical, sign, symptom, environmental, therapeutic, etc) and events (causality and affect) of medical historical data from BMJ and MOH (Wellcome) containing 77,138 words 
Type Of Material Database/Collection of data 
Year Produced 2016 
Provided To Others? Yes  
Impact The first annotated corpus of its kind which will allow researchers in text mining to continue this work. Improved search results for digital libraries (Wellcome) 
URL http://nactem001.mib.man.ac.uk/brat-v1.3/#/MhM/first_version_final/
 
Title TerMine 
Description Automatically recognises technical terms from text 
IP Reference  
Protection Copyrighted (e.g. software)
Year Protection Granted
Licensed Yes
Impact Licenced to Elsevier and other companies
 
Description invited talk for The Future of the History of the Human Sciences conference, York, UK 
Form Of Engagement Activity A formal working group, expert panel or dialogue
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Professional Practitioners
Results and Impact Invited talk
Year(s) Of Engagement Activity 2016
 
Description Conference of European Statistics Stakeholders 
Form Of Engagement Activity A press release, press conference or response to a media enquiry/interview
Part Of Official Scheme? Yes
Geographic Reach International
Primary Audience Policymakers/politicians
Results and Impact The aims of the conference are is to enhance the dialogue between European methodologists, producers, and users of European Statistics identifying the requirements of the users (ESAC), the best practices of the production (EUROSTAT, NSIs), the innovative ways of visualising and communicating statistics, and the new methodological ideas for collecting and analysing data (FENStatS). Specific topics of high interest regard the development of the European Statistical System towards 2020 and beyond; to investigate and present themes of research in official statistics within the scientific community, to explore the enabling instruments such as the Horizon 2020 Research Framework Programme, compare and share best practices of production, and a good opportunity to meet national and European users of statistics. The Conference is an operative tool to facilitate the evolution of statistics towards the 2020 modernisation targets.

Funding opportunities (event will take place 24/11/2014)
Year(s) Of Engagement Activity 2014
URL http://cdss.sta.uniroma1.it/index.php/dssconference/cess2014/
 
Description Keynote at Ada Lovelace Day 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? Yes
Geographic Reach Local
Primary Audience Public/other audiences
Results and Impact Discussions about the role of women in computing

Inspiring for female students in science, engineering and technology
Year(s) Of Engagement Activity 2014
URL http://www.cs.manchester.ac.uk/study/news/full-article/?articleid=1304
 
Description Seminar given at University of Luxembourg 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Postgraduate students
Results and Impact I gave a talk on text mining tools and infrastructure for biomedical applications. This was a research seminar organised by the school of computer science, University of Luxembourg. There was a lot of interest in the infrastructure and text mining tools we have developed.
Year(s) Of Engagement Activity 2016
URL http://iliasseminar.uni.lu
 
Description talk at Society for the Social History of Medicine conference, Oxford UK, July 2014 
Form Of Engagement Activity A formal working group, expert panel or dialogue
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Professional Practitioners
Results and Impact The talk "taking London's Pulse" introduced the audience to the novel methods of text mining
Year(s) Of Engagement Activity 2014