Archaeotools: Data mining, facetted classification and E-archaeology

Lead Research Organisation: University of York
Department Name: Archaeology

Abstract

This project builds upon previous work to develop tools which will allow archaeologists to share and analyse datasets and publications within a collaborative environment. It has three interrelated objectives, each represented by a distinct workpackage.

The first aim is to index the ADS/ AHDS Archaeology structured database of over one million metadata records describing sites and monuments in the British Isles, according to three criteria: When, What and Where. The project will use techniques of facetted classification, derived from information science, to allow users to navigate the three-dimensional space thereby created, allowing them to explore the data sets held in the three-dimensional space, and to make links between specific records. A map-based interface will be developed to allow the spatial dimension to be best explored.

Secondly the project will employ techniques of natural language processing to allow automated tools to search within documents for terms which are part of known classification schemes, adding them to the facetted index, and providing much deeper and richer access to unpublished archaeological literature. Although this literature forms the primary record of most archaeological investigation within the UK, the level of scholarly and public access has hitherto been severely curtailed, imposing a major constraint on archaeological research. Tools will also be employed which will allow users to impose their own classifications and index the documents according to their own criteria, adding further user-defined dimensions to the classification.

Thirdly, these tools will also be employed to investigate whether it is also possible to identify and harvest index terms within older antiquarian literature as represented by back runs of archaeological journals currently being digitised and being made available online. As site reports in this older literature rarely give precise geospatial coordinates it will be necessary to investigate if natural language processing will allow the recognition and harvesting of place names. If this is achievable then the placenames can be supplied to existing software which can look up the names in an online gazetteer of names and return precise grid coordinates which can be added to the index.

At the end of the project we will have created a major sustainable resource for archaeological research and made it available to all users via AHDS Archaeology. It will also be possible to make recommendations for the future format and indexing of grey literature, and to draw lessons for the wider humanities e-Science community.

Publications

10 25 50
 
Description Workpackage 1: We were able to map the 1 million records to thesauri by a combination of automatic rule-based expressions and manual techniques. The 'when' facet provides an example of the success of this combined approach. There is a large number of ways in which archaeological dates and date ranges can be written, e.g. 1066, 1001-1100, 11th centuary (sic), C11, 11C, eleventh century. Most of these were mapped directly to MIDAS-defined date ranges. Analysis initially recovered 457 instances of irresolvable dates, equating to 114,505 records which could not be classified. After automated processing this was reduced to 148 concepts and only 7,528 records. This is a manageable number to correct by manual intervention. The variety of uncontrolled terminology used for the 'What' facet, combined with a significant number of records with no subject information proved more intractable, but was not a serious problem as most records still appeared under either the 'When' or 'Where' facet. In total, of 1,001,595 records submitted for classification, 995,907 appeared in at least one facet, leaving only 5688 record totally unclassified.

Workpackage 2: Relatively high levels of success were achieved when the same techniques were applied to the sample of 1000 semi-structured grey literature reports. The greatest problem encountered was that of distinguishing between 'actual' and 'reference' terms. As well as the 'actual' place name referring to the location of the archaeological intervention, most grey literature reports also refer to comparative information from other sites, here called 'reference' terms. The IE software returned all place names in the document, masking the place name for the actual site amongst large numbers of other names. However this was solved by adopting the simple rule that the primary place name would appear within the 'summary' section of the report. If it was not possible to identify a summary then the first 10% of the document was used instead.

Workpackage 3: The 3rd strand of the project was to focus IE on the almost entirely unstructured digitised version of the PSAS. Despite the highly unstructured nature of the text and the antiquated use of language we were surprised to find that once trained on the grey literature reports the IE software achieved comparable levels of success with the antiquarian literature. Problems were encountered with more synthetic papers and other types of document, but where the primary subject of the article was a fieldwork report then it was possible to identify the key 'What' 'When' and 'Where' index terms using the same approach as adopted with the grey literature. After discounting prefatory papers the PSAS corpus was reduced to 3991 papers referring to archaeological discoveries. By applying the rule that the actual What, Where and When would appear in the first 10% of the paper it was possible to identify a subject for all but 277 of the papers, although there was less success with a geospatial location (627 papers with no location), and least success with period terms (2056 papers with no When term).
Exploitation Route Further development of Information Extraction and Natural Language Processing is underway, building on the Archaeotools project and is being explored with EU Infrastructure funding udner the ARIADNE infrastructure http://ariadne-infrastructure.eu/
Sectors Digital/Communication/Information Technologies (including Software),Culture, Heritage, Museums and Collections

URL http://archaeologydataservice.ac.uk/research/archaeotools
 
Description Some of the tools developed have been deployed in the ADS Library http://archaeologydataservice.ac.uk/library/
First Year Of Impact 2017
Sector Culture, Heritage, Museums and Collections
 
Title Faceted index of 3991 papers published in Proceedings of Society of Antiquaries of Scotland 
Description  
Type Of Material Database/Collection of data 
Provided To Others? No  
 
Title Faceted index of 906 grey literature reports 
Description  
Type Of Material Database/Collection of data 
Provided To Others? No  
 
Title Faceted index of over 1 million metadata records for sites and monuments of the UK 
Description  
Type Of Material Database/Collection of data 
Provided To Others? No  
 
Title Underlying database for new Archsearch catalogue 
Description  
Type Of Material Database/Collection of data 
Provided To Others? No