TOBIAS: Thesaurus of British and Irish History as SKOS

Lead Research Organisation: University of London
Department Name: Inst of Historical Research

Abstract

This project applies for funding under the follow-on funding for impact and engagement scheme, under the 10th anniversary highlight notice. The Bibliography of British and Irish History (BBIH) - formerly the Royal Historical Society Bibliography - was funded under the old scheme [APN15510] in 2007 as a resource enhancement award. There were five project objectives, including "to widen access to the data by exposing it to OAI (Open Archives Initiative) harvesters, such as OAIster, and to online search engines". In the intervening eight years the possibilities for exposing and enriching data have become vastly more powerful and pervasive. This was simply not possible in 2007 and the time is now right to take advantage of these developments, to expose as linked data the detailed thesaurus which underpins the Bibliography's comprehensive cataloguing system, allowing other historical projects to benefit, and, as the semantic web develops, to integrate historical resources in a way which has transformative potential: such linkage, is moreover, only possible with a detailed classification system this project will provide.
The thesaurus will be marked up in SKOS, the Simple Knowledge Organization System (http://www.w3.org/2004/02/skos/), which is the best choice for marking up a thesaurus. It is simple to use, making it ideal for use by historical projects, while integrating well with the Web Ontology Language OWL (the two can be linked using OWL's annotation properties).
The Institute of Historical Research (IHR) proposes to publish as a web ontology the Royal Historical Society's subject classification for British and Irish History - the premier thesaurus in its subject area, comprising 8,800 terms. This will provide a comprehensive, standard resource for all British and Irish history projects wishing to expose their data and link it to other projects using RDF (Resource Description Framework). As an exemplar, the IHR will mark up its own extensively used History Online data (www.history.ac.uk/history-online/) with the RHS thesaurus terms, as a demonstration of the process and its value, as well as a human-readable version of the thesaurus, so that projects or scholars wishing to use a history ontology will quickly be able to see the scope and detail the thesaurus can offer. Additional aids to scholars or those unfamiliar with SKOS will be online tutorials, including worked examples. The IHR will hold a workshop at the end of the project, drawing on its extensive contacts with the local history and family history community, to demonstrate the benefits of the project to those inside and outside academia. To further demonstrate the applicability of the thesaurus, the Bodleian Library's John Johnson Collection of printed ephemera, produced in conjunction with ProQuest, will be marked up using the web ontology, proving its use in enhancing a disparate and heavily image-based collection.
The bibliography first produced its subject classification in the late 1990s. The thesaurus drew on the categorisation used for the earlier print volumes and was developed in consultation with the history profession. It has been revised and refined ever since. It now contains 8,800 terms. The thesaurus is known, at least in outline, to all users of the bibliography (the relevant parts of the subject tree are displayed with every record), which effectively means all historians of British and Irish History: although BBIH has been a subscription service since 2010, practically all UK Higher Education institutions with history departments subscribe. It is also used by the bibliography of Irish history, Irish History Online (http://www.irishhistoryonline.ie/). It has been used by a commercial company, Taylor & Francis, to mark up its online English Historical Documents series http://www.englishhistoricaldocuments.com/.

Planned Impact

The project's intention to mark up its History Online dataset will directly lead to impact among publishers. History Online consists of a number of datasets, for some of which data is directly supplied by publishers. Ten publishers currently subscribe to the service on History Online which lists new books and journal articles, including leading publishers like Oxford University Press, Routledge and Yale. Not only will these publishers see the benefit of their material being marked up in SKOS, the IHR's direct relationship with them will enable it to publicise the project as an enhancement to the listings, via the quarterly email that is sent to subscribers.
Furthermore History Online contains datasets directly related to the history profession in the UK: listings of Theses Completed, Theses in Progress, Teachers of History (at a university level) and Grants for historians. All of these are free and in constant use. Thus for many historians History Online may be the first introduction to linked data in practical use, and this can only disseminate knowledge and understanding of this valuable technology throughout the profession.
The IHR is leading the way in helping historians to negotiate the new terrain created by moves towards open access in academia (it has organised conferences on the subject and maintains an information resource on the subject, http://openaccess.blogs.sas.ac.uk/). Open access and linked open data are a natural fit and each will become more deeply connected with the other as both develop. This project will put the IHR at the forefront in these two areas and will be ideally placed to help historians engage with the issue as part of its mandate as the national centre for History in the UK.
The project will hold a workshop specifically aimed at workers in the Galleries, Libraries, Archives and Museums sector. The IHR has close connections with some of the leading members of this group. It has been a collaborator on two separate Big Data projects which finished at the end of March 2015, one with the British Library and the other with the National Archives. The IHR has particularly close relations with the British Library's digital curators and will invite one of them to present at the end-of-project workshops. Members of the Museums Association will be invited to attend this workshop.
The project will publish training materials online, with screencasts, tutorials and worked examples. This will be aimed at a non-technical audience.
Perhaps the most important, and certainly the broadest impact will be among the public. Personal interest is a thriving area in the field of British history, with family history an enormously popular activity, a proliferation of local history societies catering to an interest in topographical and regional history, and deep interest in the UK's heritage in architecture, social and political history and popular culture. All of these areas would benefit greatly from the ability to link silos of information and to query the great mass of information available via a protocol such as SPARQL, which enables complex queries to be created in a way that non-semantic search engines such as Google simply cannot provide. Furthermore the Institute of Historical Research, which aims to be the hub of academic history in the UK, plays a role among these independent scholars. For example, the IHR publishes the standard work of county history in the England, the Victoria County History, and thus coordinates the VCH's work among volunteers and independent scholars across the counties. The IHR also has links with the British Association for Local History (BALH). The IHR is uniquely placed to disseminate the importance of the thesaurus to all of these groups, and it will hold two workshops at the end of the project: one will focus on local history and one on family history, giving practical advice on how this new and powerful semantic web tool can be used.

Publications

10 25 50
 
Description The main project finding relates to the size of corpus required to produce reliable, scaleable results. TOBIAS drew on a substantial corpus of historical data that had already been labelled with historical terms by human annotators, containing over 1,000 items, averaging about 100 words per item. Although we hoped this corpus would be large enough to generate good results, it became clear that the larger the corpus the more effective the results. We also found that multi-label sets of data are an order of magnitude harder to work with than single-label sets (as happens with, for example, spam filtering, where each email either has the label "spam" or "not spam"). We intend to use machine learning to take this work forward and are investigating larger, labelled datasets in order to do so. The methodologies explored during the project, and the training resources developed as a result, will add significantly to the body of knowledge in the field and allow others to build on our work.
Exploitation Route We held two workshops in which participants from publishing and library backgrounds expressed interest in working with our datasets, so there is clear interesting both in (re)using the project's published data and in deploying and building on the methodology, particularly in relation to what will and will not work with datasets of this size.
Sectors Digital/Communication/Information Technologies (including Software),Education,Culture, Heritage, Museums and Collections

 
Title TOBIAS Vocabulary Explorer 
Description Bibliography of British and Irish History Vocabulary as HTML and SKOS with a web UI explorer for both. The /data directory contains the BBIH Vocabulary data and some transformation scripts to convert the source data into SKOS and an HTML list. The HTML list is dynamically loaded into the Vocabulary and SKOS explorer webpage and rendered as a usable/navigable tree using the JSTree jquery plugin. The SKOS explorer page is themed by a slightly customised css taken from the Github pages minimal theme and the JSTree is themed using the jsTree Bootstrap Theme. The Generated skos:Concept text fields and code block are dynamically updated using Brett Victor's Tangle.js. 
Type Of Technology New/Improved Technique/Technology 
Year Produced 2016 
Impact No known impacts to date. 
URL https://github.com/ihr-webmaster/vocab-explorer
 
Description Peaches and lemons are foodstuffs. Trying to classify historical texts using the BBIH thesaurus 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Regional
Primary Audience Postgraduate students
Results and Impact A talk about the project at the Oxford Centre for Digital Scholarship which led to discussion afterwards and ideas for collaboration post-project.
Year(s) Of Engagement Activity 2016
URL https://blogs.bodleian.ox.ac.uk/digital/2016/09/11/baker-blaney-steer/
 
Description TOBIAS: classifying historical material using the RHS vocabulary 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Professional Practitioners
Results and Impact A workshop to discuss the project and engage practitioners in advising us for further work.
Year(s) Of Engagement Activity 2016
 
Description TOBIAS: classifying historical material using the RHS vocabulary - librarians and academics 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Professional Practitioners
Results and Impact A workshop to discuss the project with librarians and academics and take their advice on next steps.
Year(s) Of Engagement Activity 2016