Mining the History of Medicine

Lead Research Organisation: University of Manchester

Department Name: Computer Science

Abstract

This project, a cross-disciplinary collaboration between the National Centre for Text Mining (NaCTeM) and the Centre for the History of Science, Technology and Medicine (CHSTM) at the University of Manchester, seeks to demonstrate the potential of text mining in medical history. To do this, firstly an asset will be created out of two very large, long-running digital sources, the British Medical Journal (BMJ) (1840 - present) and the London-area Medical Officer of Health (MOH) reports (1848-1972), by applying text mining techniques to enrich these data with semantic annotations.

An important aspect of this work is to build tools to identify and record terminological variation and semantic shift over time, via construction of a temporal terminological inventory from the archives. Then, a semantic search system will be developed to help historians in broadening and deepening their work to ask 'big' questions that cover long periods, without losing sensitivity to changes in terminology and meaning.

The resulting asset and tools will be used and evaluated in two case studies, exploring the modern epidemiological transition and the creation of a medical surveillance culture, two massively important and interrelated changes in British health experience, where many questions remain unanswered. The methods and results of the case studies will serve as concrete examples of how such an asset and tools can be used.

The text mining tools and derived resources will be made available to the community via an interoperable text mining environment that will be contributed to the repositories of major digital humanities infrastructures (CLARIN, DARIAH, META-SHARE).

The project will be guided by an Expert User Advisory Group. Two workshops are planned to receive further guidance from and to inform the wider community.

The project plans to extend its impact to the following sectors: public health, public policy, publishing, media and libraries, with a view to ensuring sustainability and wider uptake of methods and technologies.

Planned Impact

The direct impact of our project will be in the creation of an asset and technology that facilitate the work of humanities scholars from multiple disciplines, who seek new ways to reveal, explore, and discuss long-term, large-scale historical transformations related to medicine and health.

However, we have carefully designed our engagement strategy, supported by an Expert User Advisory Group, such that, indirectly, the project will impact on the following sectors: public health, public policy, publishing, media and libraries, thus ensuring not only high impact but also excellent potential for sustainability.

Evidence-based public health, particularly the National Institute for Health and Care Excellence (NICE), and public policy will benefit from the understanding of medical historical data at the large scale that our text mining methods will address. Text mining will uncover associations and hidden information from large-scale historical archives about disease and its impact on the environment and society, which can be used to guide and inform public health and policy. We offer a unique opportunity to advance our understanding of disease in the environment through historical data, which have hitherto been under-examined in all their complexity.

The text mining research methods using terms, semantic clustering, semantic metadata will impact on the publishing industry which is seeking to improve scholarly communications by enriching their archives using semantic methods but often lacks the expertise and technology.

Semantic analysis of historical data helps to inform the citizen: extending our links with the British Library, Wellcome Library, and the BBC, we shall also reach out to national archives (UK National Archives, National Archives of Scotland), in order to demonstrate how archives in the public domain can be enriched to allow the citizen to engage with big historical data semantically without being led astray by temporally-conditioned language.

Funded Value:

£261,428

Funded Period:

Jan 14 - Jun 15

Funder:

AHRC

Project Status:

Closed

Project Category:

Research Grant

Project Reference:

AH/L00982X/1

Principal Investigator:

Sophia Ananiadou

Research Subject:

History (33%)

Library & information studies (33%)

Linguistics (33%)

Research Topic:

Computational Linguistics (33%)

History of Sci./Med./Technol. (33%)

Information & Knowledge Mgmt (33%)

Organisations

University of Manchester (Lead Research Organisation)

People	ORCID iD
Sophia Ananiadou (Principal Investigator)	http://orcid.org/0000-0002-4097-9191
Carsten Timmermann (Co-Investigator)
John McNaught (Co-Investigator)
Michael Worboys (Co-Investigator)
Elizabeth Toon (Researcher)

Publications

Author Name

Title Publication Date Published

|< < 1 2 > >|

10 25 50

Alnazzawi N (2015) Using text mining techniques to extract phenotypic information from the PhenoCHF corpus. in BMC medical informatics and decision making

Bollegala D (2015) A cross-lingual similarity measure for detecting biomedical term translations. in PloS one

Kontonatsios, G. (2014) Combining String and Context Similarity for Bilingual Term Alignment from Comparable Corpora.

Korkontzelos I (2015) Boosting drug named entity recognition using an aggregate classifier. in Artificial intelligence in medicine

Miwa M (2015) Adaptable, high recall, event extraction system with minimal configuration. in BMC bioinformatics

Miwa, M. (2014) Comparable study of event extraction in newswire and biomedical domains in N/A

Rak R (2014) Processing biological literature with customizable Web services supporting interoperable formats. in Database : the journal of biological databases and curation

Thompson P (2016) Text Mining the History of Medicine in PLOS ONE

Thompson P (2016) Enriching news events with meta-knowledge information in Language Resources and Evaluation

Thompson P (2015) Semantically enhanced search system for historical medical archives

Key Findings
Impact Summary
Policy Influence
Research Databases and Models
Research Tools and Methods
Intellectual Property
Engagement Activities


Description	Large scale historical text archives are a rich and diverse source of information but it can be difficult for historians to explore and search such large volumes without automated means. To help historians and laypeople search and discover information, we have developed a search system which analyses these large archives automatically by using text mining methods. We support historians by showing how concepts and their relationships in the history of medicine change over time. We developed: a) a search system; b) a terminological inventory; c) a corpus with semantically annotated concepts and their relationships; d) a text mining tool kit for other researchers working in that area.
Exploitation Route	We have developed a number of tools and resources which are open and freely available to the community. Data are available from the META-SHARE network of language repositories. The time-sensitive inventory of medical terminological inventory is available here: http://metashare.metanet4u.eu/go2/medical-inventory. The HIMERA annotated corpus is available here: http://metashare.metanet4u.eu/go2/himera-corpus.
Sectors	Environment Healthcare Culture Heritage Museums and Collections
URL	http://nactem.ac.uk/hom/


Description	We have created a publicly available search system. This was achieved by drawing upon evidence from two large and varied archives of historical medical text, the British Medical Journal (BMJ) (http://www.bmj.com/archive) and the London Medical Officer of Health reports (MOH) (http://wellcomelibrary.org/moh/), whose documents collectively span the period from 1840 to the present day, and each of which has a different focus (i.e., professional medical matters vs. public health issues).
First Year Of Impact	2015
Sector	Digital/Communication/Information Technologies (including Software),Environment,Healthcare,Culture, Heritage, Museums and Collections
Impact Types	Cultural Societal Policy & public services


Description	Copyright and Licensing in relation to Text and Data Mining
Geographic Reach	Multiple continents/international
Policy Influence Type	Contribution to a national consultation/review
Impact	The National Centre for Text Mining played a leading role in advising on policy and development of UK legislation regarding a copyright exception in relation to text mining. Contributions included talks at events at the Houses of Parliament, the European Parliament, London School of Economics, and participation in consultations by the IPO and the EC (on the wider issue of copyright and licensing issues in the EU). Advice was also given on numerous occasions by request of the IPO during development of the legislation which came into force on 1st June 2014. It is somewhat too early to ascertain impact, however this has already led to major initiatives such as Europe PubMed Central being able to lawfully text mine full papers as well as increased levels of text mining within such bodies as the British Library and also within institutional repositories. It has also led to increased scope and expected impact of research projects as these can tackle for the first time large scale text mining of full text articles which are lawfully subscribed to in addition to open access material.
URL	http://www.jisc.ac.uk/sites/default/files/value-text-mining.pdf


Title	Argo for Biodiversity
Description	Argo is an interoperable infrastructure for building and running text-analysis solutions. It facilitates the development of custom text mining workflows from a selection of text mining components. We have augmented Argo to include biodiversity text mining tools.
Type Of Material	Improvements to research infrastructure
Year Produced	2017
Provided To Others?	Yes
Impact	Supports the curation of databases, user collaboration, includes numerous (and third party) processing components, allows the creation of text mining workflows. Includes text mining tools for biodiversity.
URL	http://argo.nactem.ac.uk


Title	EventMine
Description	EventMine is a machine learning-based pipeline system, which extracts events from documents that already contain named entity annotations (e.g., genes/proteins, etc.). Given appropriate training data, it can be trained to extract many different types and structures of events.
Type Of Material	Improvements to research infrastructure
Year Produced	2012
Provided To Others?	Yes
Impact	Community shared tasks; other research teams improved results Customised to different domains and application areas; Part of the Argo text mining platform http://argo.nactem.ac.uk
URL	http://www.nactem.ac.uk/EventMine/


Title	Search History of Medicine
Description	Text mining tools applied to large historical archives allow the development of sophisticated, semantic search system which provides functionalities such as the following: Automatically expanding user-entered query terms with synonyms, variants and other semantically-related terms, in order to aid in the retrieval of a maximal number of potentially relevant documents. Using automatically identified semantic information (e.g., NEs and relationships between them) as a means to isolate documents of greatest interest and/or to help users to explore the contents of large result sets from a semantic perspective.
Type Of Material	Improvements to research infrastructure
Year Produced	2015
Provided To Others?	Yes
Impact	We adapted text mining techniques to the important domain of medical history, which has previously received little attention from a text mining viewpoint. Specifically, we are concerned with the development of the necessary resources and tools to facilitate the TM analysis of various types of published documents on medically-related matters, dating back to the mid 19th century. This task presents a number of challenges, according to the variant characteristics that can be exhibited by such documents, which may be subject to evolution as time progresses. These varying characteristics include not only potential shifts in terminology, but also possible variations in writing styles, according to the author, subject matter and intended audience of documents, together with changes in vocabulary and language structure over time. Such characteristics introduce difficulties not only in developing suitable terminological resources, which must account for the various ways in which concepts may be expressed in text both within and across different time periods, but also in creating annotated corpora that are fit for purpose. Historians of medicine use this search system for a better understanding of their research questions.
URL	http://nactem.ac.uk/hom/


Title	Search for Clinical Trials
Description	The large amount of clinical trial data has lead to an information overload problem, making it difficult to locate the precise information that is required. Our infrastructure addresses this problem through the development of a search application that can help users to narrow down their search efficiently, and assist in the creation of new protocols.
Type Of Material	Improvements to research infrastructure
Year Produced	2012
Provided To Others?	Yes
Impact	Opened collaboration with teams working on the development of clinical trials, experimental therapeutics, recommending automatically eligibility criteria. This method has been adapted to other domains and users, notably http://www.nactem.ac.uk/DID-ISHER/ supporting social history environment (digging into social unrest)
URL	http://www.nactem.ac.uk/clinical_trials/


Title	Text mining tool kit for history of medicine
Description	A number of text mining tools to extract causes of a condition, symptoms associated with a condition, identification of the parts of the body that are most typically affected by conditions,factors (e.g., therapies, drugs, environmental surroundings) that can affect a condition, subsets of the population (e.g., children, adults, ethnic groups, people working in different occupations) are most likely to be affected by a condition
Type Of Material	Improvements to research infrastructure
Year Produced	2015
Provided To Others?	Yes
Impact	Reproducibility of results by other research teams.
URL	http://metashare.metanet4u.eu/repository/browse/time-sensitive-inventory-of-medical-terminology/1a17...


Title	Data from: A cross-lingual similarity measure for detecting biomedical term translations
Description
Type Of Material	Database/Collection of data
Year Produced	2015
Provided To Others?	Yes


Title	History of Medicine Dataset
Description	A rich semantic analysis of entities (condition, anatomical, sign, symptom, environmental, therapeutic, etc) and events (causality and affect) of medical historical data from BMJ and MOH (Wellcome) containing 77,138 words
Type Of Material	Database/Collection of data
Year Produced	2016
Provided To Others?	Yes
Impact	The first annotated corpus of its kind which will allow researchers in text mining to continue this work. Improved search results for digital libraries (Wellcome)
URL	http://nactem001.mib.man.ac.uk/brat-v1.3/#/MhM/first_version_final/


Title	TerMine
Description	Automatically recognises technical terms from text
IP Reference
Protection	Copyrighted (e.g. software)
Year Protection Granted
Licensed	Yes
Impact	Licenced to Elsevier and other companies


Description	Conference of European Statistics Stakeholders
Form Of Engagement Activity	A press release, press conference or response to a media enquiry/interview
Part Of Official Scheme?	Yes
Geographic Reach	International
Primary Audience	Policymakers/politicians
Results and Impact	The aims of the conference are is to enhance the dialogue between European methodologists, producers, and users of European Statistics identifying the requirements of the users (ESAC), the best practices of the production (EUROSTAT, NSIs), the innovative ways of visualising and communicating statistics, and the new methodological ideas for collecting and analysing data (FENStatS). Specific topics of high interest regard the development of the European Statistical System towards 2020 and beyond; to investigate and present themes of research in official statistics within the scientific community, to explore the enabling instruments such as the Horizon 2020 Research Framework Programme, compare and share best practices of production, and a good opportunity to meet national and European users of statistics. The Conference is an operative tool to facilitate the evolution of statistics towards the 2020 modernisation targets. Funding opportunities (event will take place 24/11/2014)
Year(s) Of Engagement Activity	2014
URL	http://cdss.sta.uniroma1.it/index.php/dssconference/cess2014/


Description	Keynote at Ada Lovelace Day
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	Yes
Geographic Reach	Local
Primary Audience	Public/other audiences
Results and Impact	Discussions about the role of women in computing Inspiring for female students in science, engineering and technology
Year(s) Of Engagement Activity	2014
URL	http://www.cs.manchester.ac.uk/study/news/full-article/?articleid=1304


Description	Seminar given at University of Luxembourg
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Postgraduate students
Results and Impact	I gave a talk on text mining tools and infrastructure for biomedical applications. This was a research seminar organised by the school of computer science, University of Luxembourg. There was a lot of interest in the infrastructure and text mining tools we have developed.
Year(s) Of Engagement Activity	2016
URL	http://iliasseminar.uni.lu


Description	invited talk for The Future of the History of the Human Sciences conference, York, UK
Form Of Engagement Activity	A formal working group, expert panel or dialogue
Part Of Official Scheme?	No
Geographic Reach	National
Primary Audience	Professional Practitioners
Results and Impact	Invited talk
Year(s) Of Engagement Activity	2016


Description	talk at Society for the Social History of Medicine conference, Oxford UK, July 2014
Form Of Engagement Activity	A formal working group, expert panel or dialogue
Part Of Official Scheme?	No
Geographic Reach	National
Primary Audience	Professional Practitioners
Results and Impact	The talk "taking London's Pulse" introduced the audience to the novel methods of text mining
Year(s) Of Engagement Activity	2014

Abstract

Planned Impact

Organisations

People

ORCID iD

Publications