Semantic Annotation and Mark Up for Enhancing Lexical Searches (SAMUELS)

Lead Research Organisation: University of Glasgow
Department Name: School of Critical Studies

Abstract

As humanities datasets get ever larger, researchers have a pressing need for more sophisticated techniques of analysis. The most significant issue in big data research into textual datasets is that our primary methodology for searching, aggregating and analysing them relies not on concepts or meanings, but rather on word forms. These forms are imperfect and evasive proxies for the meanings they refer to, and with 60% of word forms in English referring to more than one meaning, and some word forms referring to close to two hundred meanings, the irrelevant "noise" which appears when searching using word forms grows with the size of the texts being searched.

In big data contexts, this problem cripples research, making any sort of detailed analysis entirely intractable and requiring impossible amounts of manual intervention. In this project, we will deliver a system for automatically annotating words in texts with their precise meanings, enabling a step-change in the way we deal with large textual data. The system is based around the unparalleled Historical Thesaurus of English, which contains 797,000 words from across the history of English arranged into 236,000 hierarchical categories of meanings alongside each word's dates of known use. The annotation software will take a text and provide for each word it contains an XML annotation giving the word meaning's Historical Thesaurus category code. The system will automatically disambiguate word meanings using a range of state-of-the-art computational techniques alongside new context-dependent methods unlocked by the Thesaurus's dating codes and its uniquely detailed and fine-grained hierarchical structure.

Textual data tagged in this way can then be accurately searched and precisely investigated, with any results also able to be aggregated at a range of levels of precision, without the need for manual intervention. A major part of the project is also the development of new techniques for working with semantically-aggregated and disambiguated data. Project partners will conduct research on resources including the Hansard Corpus, consisting of over 2.3 billion words of text, the Oxford English Corpus, the world's largest stratified corpus of modern English, and the EEBO-TCP corpus of 40,000 early modern books. As part of our work on changing the nature of how we deal with data on this scale, we will mine these text collections for frequently-occurring or statistically unusual concepts, will take advantage of our ability to search large datasets for terms realised by ambiguous word forms (such as "union" in the particular context of industrial relations rather than any of the other 33 possible meanings of this word), and will examine the data as a whole from a distant-reading perspective in order to look for striking or significant patterns of meaning changes across time.

These research projects based on tagged data will also drive the development of our tools for using this data, with teams of researchers across the UK and abroad providing a range of different demands on the data, ensuring a variety of needs and use-cases are catered for in the development of the project. In this way, we are committed to producing a set of compelling, fruitful, and practical research outcomes using semantically-tagged data during the lifetime of the project, in order to demonstrate the value of our approach and to help ensure the work of the project is as widely utilised and exploited as possible.

By doing all of this, we will enable new and transformative techniques of exploring, searching and investigating large-scale cultural, literary, historical and linguistic phenomena in big humanities datasets; through this project, it will be possible to place meaning - rather than word forms - at the heart of digital humanities research into text.

Planned Impact

Education
The modern teaching of history and of literature often operates thematically, with students introduced to specific themes across a range of time periods. Open access through the BYU site to our annotated data will provide ways for students and teachers to explore the literature in EEBO and the political and historical information in Hansard through semantic categories which relate to the broader themes they are studying. We will provide suggestions for use of the data in teaching at upper secondary and tertiary levels, and will particularly highlight and support educational uses of the aggregated data produced in the Hansard project, as well as the Time Machine project. We also see potential for SAMUELS in areas such as improving written style through increased awareness of semantic patterning, grading reading materials according to age and topic suitability, and as a reading aid for older or complex texts.

Writers
Professional writers frequently use the print and online Historical Thesaurus for information about word usage at various periods, for the sake of authenticity in their writing or of novel ways to express concepts. The Time Machine project, linking words in supplied texts to the period during which those words are known to have been used, will be a further key resource for writers, particularly those working on historical fiction. Those writing about topics such as politics, history, and the development of debating styles will find a useful resource in the Hansard project.

Third Sector
Our work on Hansard has already led to discussions with third sector agencies such as theyworkforyou.org, who aggregate and display information to the public about their MP's activities. We will offer these agencies access to the aggregated data produced in the Glasgow Hansard project to enable them to display data on the topics and concepts most often discussed by their elected representatives across time. The original Hansard data was provided to Glasgow by the UK Parliamentary Service, and our enhanced version will be offered to them for their own uses. Libraries such as the National Library of Wales have indicated the need for a resource such as SAMUELS to aid cataloguing and other topic-related activities.

Commercial Implications
While non-commercial and academic access to the annotation system's website will remain free of charge, Oxford University Press, the experienced contractual operator for commercialising the Historical Thesaurus dataset, will partner with Glasgow and Lancaster to seek opportunities in the private sector for a semantic annotation system. One key commercial opportunity is with large providers of historical texts, who would benefit from semantically-aware searches, in particular from the ability to search using concepts in areas where users are unlikely to know historical synonyms for modern terms. This group includes the large private sector providers of legal texts and historical case material (eg WestLaw, HeinOnline or LexisNexis) as well as newspapers' historical archives, and database developers such as Gale, Cengage, and ProQuest. The project therefore has the potential to benefit the economy by stimulating income and creating further commercial outputs across time, and we will work with our partners to achieve this where feasible.

Data Mining
Finally, we see potential impact in SAMUELS for groups which need to mine large data sources for socially or commercially valuable information, including researchers in such fields as health informatics, marketing sentiment analysis, and wide-scale cultural analysis. Our project, with its advantages over plain lexical searches, will be an additional tool for these groups, permitting deeper and more productive mining of their data by aggregating a range of data sources in meaning-structured ways. Such work will also further expose the usefulness of the rich and unparalleled Historical Thesaurus lexical database for their research.

Publications

10 25 50
publication icon
Alexander M (2015) Metaphor, Popular Science, and Semantic Tagging: Distant reading with the in Digital Scholarship in the Humanities

publication icon
Archer D (2017) Tracing facework over time using semi-automated methods in International Journal of Corpus Linguistics

 
Description This project produced a system for automatically annotating words in texts with their precise meanings - disambiguating between possible meanings of the same word. It enables a step-change in the way we deal with large textual data, as 62% of English word forms refer to more than one meaning.

The tagging system uses the Historical Thesaurus of English as its core dataset, and provides for each word in a text the Historical Thesaurus reference code for that concept. Textual data tagged in this way can then be accurately searched and precisely investigated, producing results which can be automatically aggregated at a range of levels of precision. The project also drew on a series of research sub-projects which employed the software as it was being developed, testing and validating the utility of the SAMUELS tagger as a tool for wide-ranging further research.

Linked sub-projects (more information available on the project webpage) used the tagger and its corpus outputs to analyse political language in Hansard, aggression in historical texts, popular science metaphors, and others. These have been documented in a range of publications and presentations.
Exploitation Route The tagger is freely-available, as are the corpora created. These are available for use by any interested party for research. (See the website for details.)
Sectors Digital/Communication/Information Technologies (including Software),Government, Democracy and Justice

URL http://www.gla.ac.uk/samuels/
 
Description Follow-on Funding for Impact
Amount
Funding ID ah/r007136/1 
Organisation Arts & Humanities Research Council (AHRC) 
Sector Public
Country United Kingdom
Start 01/2018 
End 12/2018
 
Title Historical Thesaurus Thematic Set 
Description The Historical Thesaurus Thematic Set has been created with the intention to create a conceptual list of major concepts discussed throughout the history of English, based on the Historical Thesaurus of English. The HTTS was created using the category and sub-category headings in the HT. Headings which were deemed 'human-scale' have been kept, whilst those which seemed either too specific or too general were removed - for example, HT heading 01.03.01.05.12: 'Disorders of birds' (as part of the section on animal health) was thought too specific and specialist a topic for users to be likely to want to search for it; alternatively, 01.05.11.02: 'General parts' (i.e. of animals) appeared too general to be useful, although headings which are nested under it in the HT hierarchy (e.g. 01.05.11.02.04: 'Covering/skin') were considered significant enough as concepts to be given a thematic dataset heading. HT categories which were too miscellaneous to act as useful search terms have also been omitted (e.g. 03.11.11.42.05.08: 'Other parts' (i.e. of machines)). No HT sub-category headings have been included in the final list of HTTS headings. 
Type Of Material Database/Collection of data 
Year Produced 2016 
Provided To Others? Yes  
Impact 2016: the online Hansard Corpus uses this dataset for semantic searches. 
 
Title Semantic Early English Books Online (SEEBO) 
Description A version of the Early English Books Online dataset created by the Text Creation Partnership (EEBO-TCP), formatted as a searchable text corpus with a user interface. The corpus is tagged using the Historical Thesaurus Semantic Tagger developed by the SAMUELS project, which allows users to search for semantic categories in the data. 
Type Of Material Database/Collection of data 
Year Produced 2016 
Provided To Others? Yes  
Impact Researchers on historical semantics are able to easily access the resource on the widely used Brigham Young corpus respository maintained by Mark Davies, and will be able to access raw files through the Oxford Text Archive in the near future. 
URL http://corpus.byu.edu/eebo
 
Title The Hansard Corpus 1803-2005 
Description The Hansard Corpus contains nearly every speech given in the British Parliament from 1803-2005, and it allows you to search these speeches (including semantically-based searches) in ways that are not possible with any other resource. 
Type Of Material Database/Collection of data 
Year Produced 2015 
Provided To Others? Yes  
Impact Press coverage on BBC Radio Four (Today programme), BBC Radio Scotland, the BBC World News, the Times, the Independent, the Scotsman, the National, BBC News Online, BT News, the Metro, and the i. All in November 2015. 
URL http://www.hansard-corpus.org
 
Description Oxford University Press 
Organisation Oxford University Press
Country United Kingdom 
Sector Private 
PI Contribution We have linked the University of Glasgow's Historical Thesaurus data to the Oxford English Dictionary (OED), in order for us to engage in research related to the data contained in the OED.
Collaborator Contribution OUP provided the Oxford English Dictionary data and expertise in dealing with it, as well as worked on providing a legal framework for this collaboration, the first of its kind.
Impact An integrated database from Glasgow and OUP which can transform lexical research in the humanities (impact in progress)
Start Year 2014
 
Title The Historical Thesaurus Semantic Tagger 
Description The Historical Thesaurus Semantic Tagger (HTST) can be used to label lexical items in running text with codes based on the semantic categories of the Historical Thesaurus of English. It is the primary output of the SAMUELS project. 
Type Of Technology Software 
Year Produced 2016 
Impact None as yet 
URL http://www.gla.ac.uk/samuels/
 
Description 'Data Analysis: SAMUELS and corpus linguistics', Old Bailey Data, Sussex Humanities Lab, Brighton 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Professional Practitioners
Results and Impact Talk at research symposium to discuss future paths of analysis for data from corpus of Old Bailey trials. Resulted in plans for future collaborative work with academics at the University of Sussex.
Year(s) Of Engagement Activity 2017
 
Description 'Distributions of Concepts in the Old Bailey Voices Corpus', Making Effective Use of Metadata of Historical Texts and Corpora workshop, Saarbru¨cken 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Presentation at conference on the subject of employing metadata for historical textual corpora, resulting in discussion with attendees and requests for further information.
Year(s) Of Engagement Activity 2017
URL http://www.sfb1102.uni-saarland.de/wp/wp-content/uploads/2017/08/SFB-Workshop-2017_Programme.pdf
 
Description 'Semantic Annotation and Analysis of the UK Hansard Record' at CLARIN-ERIC Working with Parliamentary Data, Sofia, Bulgaria 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Invited UK representative of CLARIN research consortium discussing work on parliamentary records and the future of resources for such work. Attendees include university researchers, parliamentary reporters, and members of the consortium's organising committee. The talk resulted in discussion throughout the conference of issues raised and expressions of interest in further collaborative work.
Year(s) Of Engagement Activity 2017
URL https://www.clarin.eu/event/2017/clarin-plus-workshop-working-parliamentary-records
 
Description 'The Historical Thesaurus Semantic Tagger', Oxford University Press, Oxford 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Professional Practitioners
Results and Impact Around 20 members of Oxford University Press involved in dictionary production attended for information on the development and potential uses of the semantic tagger developed by the project. Resulted in editors of the OED using the semantic tagger as a form of input to updating of dictionary entries.
Year(s) Of Engagement Activity 2016
 
Description Explorathon 2014 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach Regional
Primary Audience Public/other audiences
Results and Impact A stand and public engagement activities (games, prizes, posters) at Explorathon on "European Researchers' Night" at the Glasgow Science Centre, 26 September 2014. Participants asked a lot of questions about the research undertaken and why we did it, and discussed how it could be used by people interested in words and word history.

Blog and twitter activity
Year(s) Of Engagement Activity 2014
 
Description Explorathon 2015 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Public/other audiences
Results and Impact A stand and public engagement activities (games, prizes, posters) at Explorathon on "European Researchers' Night" at the Glasgow Science Centre, 2015 (a repeat of the 2014 work). Participants asked a lot of questions about the research undertaken and why we did it, and discussed how it could be used by people interested in words and word history.
Year(s) Of Engagement Activity 2015
 
Description Explorathon 2016 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Public/other audiences
Results and Impact A talk to the general public, featured (a 'spotlight' talk) in the programme, at Glasgow Science Centre. Follow-up activities.
Year(s) Of Engagement Activity 2016
 
Description Fantasy Night at the Museum 2017 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach Regional
Primary Audience Public/other audiences
Results and Impact Stall held as part of 'Being Human Festival' event at Glasgow's Hunterian Museum. Around 200 visitors came to the stall which used word games as a lead in to discussions about linguistic research, and attendees showed real interest in following up through accessing the websites of related projects.
Year(s) Of Engagement Activity 2017
URL https://www.gla.ac.uk/hunterian/visit/events/headline_543558_en.html
 
Description HEL-LEX 5 Zurich Plenary 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Invited plenary at New Approaches to English Historical Lexis 5, Zu¨rich, Switzerland
Year(s) Of Engagement Activity 2017
 
Description How Can Linguistics Help a Healthy Internet? at Mozilla Festival 2017 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact 30 attendees took part in an interactive session on how linguistic corpora and annotation might feed into future voice recognition technology. Questions and discussion with participants followed the session and both audience and session organisers/presenters reported increase in knowledge and ideas related to the session's topics.
Year(s) Of Engagement Activity 2017
URL https://guidebook.com/guide/114124/event/16741408/
 
Description Press interviews, Hansard Corpus 1803-2005 
Form Of Engagement Activity A press release, press conference or response to a media enquiry/interview
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Public/other audiences
Results and Impact Interviews on BBC Radio Four (Today programme), BBC Radio Scotland, and the BBC World Service; interviews with the Times, the Independent, the Scotsman, the National,
the Metro, the i, and others. All discussing AHRC-funded Hansard Corpus as part of the SAMUELS webpage. Many follow-up requests and emails.

November 2015
Year(s) Of Engagement Activity 2017
 
Description Scot-Lex 1 Plenary 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Professional Practitioners
Results and Impact Invited plenary, Scottish Lexicography Symposium: ScotLex-1. Royal Society of Edinburgh, organised by Scottish Language Dictionaries Ltd for professional practitioners
Year(s) Of Engagement Activity 2016
 
Description The 72nd World Science Fiction Convention 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Public/other audiences
Results and Impact An invited stand and public engagement activities (games, prizes, posters) in the Exhibits Hall of LonCon3, the 72nd World Science Fiction Convention, which had over 10,000 registrations. We counted over 650 people who engaged with us for a period of time (that is, we gave away 650 postcard "prizes" for completing our engagement activity).

Significant website traffic, blog posts, and Twitter discussions. A very large number of fiction writers have said they will use our resource to help with their work.
Year(s) Of Engagement Activity 2014
URL http://www.loncon3.org/exhibits.php