Living Virtually: Creating and Interfacing Digital Surrogates of Textual Data Embedded (Hidden) in Cultural Heritage Artefacts

Lead Research Organisation: University of Oxford
Department Name: Classics Faculty

Abstract

Since the discovery of the carbonised papyri at Herculaneum in the 18th century, there has been a great deal of interest in accessing the content contained in the scrolls preserved by the intense heat from the eruption of Mount Vesuvius in 79 CE. The first attempts to open these scrolls were made by hand using a knife, but this caused them to break into fragmented chunks. Subsequently in 1756 a machine was invented to create a safer method of unrolling, which was more successfully applied to numerous scrolls. However, in many cases it was impossible to keep the different layers of papyrus from sticking to each other, and so substantial portions of text remained hidden in even successfully opened scrolls, while hundreds of scrolls remained too firmly carbonised to unroll at all. The content of these fully intact scrolls, together with that of text under the stuck-on layers remains a mystery. New technology offers a solution. In the early 21st century the application of non-invasive CT scanning, a concept already proved by project members, reveals new possibilities. The structure of a scroll can be rendered digitally in three dimensions, revealing the layers of the papyrus in the scroll's circumference. Computational methods for algorithmically separating, unrolling, and flattening these layers have been developed by project members over the past decade. The virtual unrolling method has been successfully applied to P. Herc. 375 and 495. Nevertheless, despite such an achievement, the ink does not appear with any significant clarity. And while faint traces of a handful of Greek letters have been transcribed, there is currently no means to verify and replicate such results.
This project aims to address the problem of detecting ink in this non-invasive imaging and thus definitively solve the long-standing problem posed by the Herculaneum papyri. In 2016 project members successfully applied the virtual unrolling method to a carbonised Hebrew scroll from the site of Ein Gedi in Israel. The ink was immediately visible, but this was due to the fact that it was contaminated with heavy trace elements and thus naturally appeared in CT scanning. The carbon-based ink used in Herculaneum papyri cannot be visualised in the same way. However, we now know that the ink is weakly contaminated with lead. We thus propose a new method called Dark Field X-ray Imaging. This reveals ink by isolating and capturing trace elements, such as lead, in its composition. To enhance the resulting ink signal further we introduce a new neural network called Reference-Amplified Computed Tomography (RACT) to amplify both the ink's presence and the shapes of the Greek characters for improved legibility. This method will definitively solve the problem of reading the text hidden in the Herculaneum papyri. To add value, the project will make the data generated by this process accessible to researchers and the curators responsible for these artefacts, by developing a new digital platform, the Augmented Language Interface for Cultural Engagement (ALICE), ensuring that the data produced by the Dark Field X-ray Imaging and RACT processes is accessible, can be properly curated, and that the extracted text can be digitally edited. Moreover, ALICE includes the functionality for integrating 3D models of the original artefact and for recording the metadata that explains both how the text was created and from where in the object's geometry the text originates in the model generated along with its digital edition. This is necessary for scientifically verifying and replicating any subsequent analysis or publication of the data. Significantly, for other cultural heritage artefacts that contain hidden text, our new imaging techniques and digital platform will be built using open architecture standards; the source code will be easily adaptable for non-invasive reading of writing inside other intractable artefacts, such as burnt books, book-bindings, and mummy cartonnage.

Planned Impact

This is a project to develop new imaging and machine learning techniques to extract and make legible textual data embedded or hidden in damaged cultural heritage artefacts, in order to reveal the hidden content. We will focus on the carbonised papyri scrolls from Herculaneum, which will impact scholars in Classics, Papyrology, and Ancient Greek and Roman History and Philosophy by providing a significant body of new texts to advance their research. Other kinds of artefacts with hidden content will also be examined, and the data produced by the new techniques is applicable to other classes of material cultural artefacts. Poorly preserved and burnt book contain pages stuck together or spines that can no longer be opened without cracking and further deterioration. Book bindings contain hidden, cut-up pages of previously recycled books. The project will thus impact the curators of collections and the visitors and users (academics, as well as students, tourists and interested amateurs) to the museums and libraries that house them by enabling them to read these hitherto illegible artefacts virtually.

The project will facilitate the imaging and machine learning techniques developed by the project to other cultural heritage artefacts for the purpose of non-invasively extracting hidden textual data. Range of impact will be increased through the design and implementation of these new techniques according to open architecture standards. While the essential process of imaging and then virtually separating and revealing hidden text is the same, the physical object, language, and method of writing is not heterogeneous. The fundamental algorithms for virtually unrolling and extracting textual data will thus work with an updatable reference library to accommodate variation in types of physical object, language, and writing characters. The automated system contains its own established protocol for learning and adapting to new reference material. To demonstrate this transfer of knowledge in real time, our project documents and lays the groundwork for how the system may be transferred from Herculaneum papyri to other objects by publishing the results of the project and collaborating with curators in Bodleian Library, the Ashmolean Museum and the Chester Beatty Library to document how our project's methods are impacting their work and how they can be extended to other artefacts under their care.

As a model for interaction by visitors and users to a museum or library the system will enable them, through enhanced visualisation, to view and study artefacts that might otherwise not appear interesting or engaging. Herculaneum papyri scrolls look like burnt logs of wood. A book with pages stuck together might just as well be a closed book on display. Artefacts exhibiting texts of ancient, dead, or rare languages are likely to be impenetrable without an expert at hand to explain and contextualise their significance. To make the visitor experience more dynamic and informative, the project offers virtual and augmented reality applications via mobile technology tabulating use and thereby impact, in conjunction with project interviews with museum curators and staff (tour guides, conservators and department heads). Image data (in both 2D and 3D), video, audio, and translations, for example, bring to life a static burnt papyrus scroll or a volume that cannot be opened. The data produced will be contextualised by a digital platform through which curation, annotation, and digital editing is performed to make the extracted text accessible and useful for academic research and viewers' understanding. The system will translate the image data and the subsequently annotated and edited images into the formats required by mobile and wearable technology that makes the data ready for use in building augmented and virtual reality applications for virtual exhibition and visitor engagement.
 
Description With respect to new imaging methods, we built a prototype non-invasive imaging system based on x-rays produced by a coherent (monochromatic) radioactive source. The construction of the experimental system revealed an important set of constraints, including imaging times, required strengths for radiation sources, and sensitivity to various concentrations of trace amounts of lead in the ink. We also built several machine learning methods: photogrammetry and neural-radiance-fields for 3D shape recovery and refinement; principal component methods and non-linear learning of contrast enhancement from the spectral imaging of open fragments; and NeRF-based methods for x-ray-based tomography/tomosynthesis of the open fragments that potentially can overcome the form factor of already-opened papyrus in order to reveal hidden layers without full tomography. Methods for reconstruction are emerging, need to be evaluated more fully before a system could be built to realize it. But in principle a key finding is that it is possible.

With respect to the digital platform (originally "ALICE") for curating disparate data around Herculaneum, we designed and implemented a prototype system as part of the digital platform for curation defining a coherent metadata schema for Herculaneum material and in general for complex digital models of manuscripts that are similar. The working prototype shows how it can be used within the context of a papyrologist's work to improve/augment/influence their work. We assembled 3D+registered historical image data from open fragments, showing it to be a powerful way to present Herculaneum data to scholars. As part of the curated digital platform, we built a suite of software tools and showed that the software/systems pieces required are achievable and that the result is better as a tool than other methods for dealing with the same set of data. The datasets we have captured and will release for PHerc.118 are the best available for any Herculaneum material: 3D, spectral, composite historical. The construction of the required tools, software, and systems to produce the data is a key finding.

With respect to source code, serialized operations, and support for peer review of digital operations, as part of the digital curation platform, the METS schema supports instrumented serialization of algorithmic operations/transformations. The tool (smgl) is a C++14 library for creating custom dataflow pipelines that are instrumented for serialization. It was designed to make it easy to convert existing processing workflows into repeatable and observable pipelines for the purposes of experimental reporting, reliability, and validation.
Exploitation Route The application of AI and Machine Learning approaches in particular to the decipherment of ancient text documents is only just beginning, and it will be a natural next step to bring advances in machine learning to bear on the scans and images obtained using the latest technology. Since ancient and medieval textual evidence is quite diverse in its nature and in the particular challenges it poses, cross-fertilisation of neighbouring fields is a likely outcome. There are also obvious applications for these methodologies, after suitable adaptation, in non-academic contexts.
Sectors Digital/Communication/Information Technologies (including Software),Education,Culture, Heritage, Museums and Collections

 
Title PHerc 118 primary data 
Description This project produced a novel set of primary data of the PHerc.118 manuscript, and together with software tools produced a novel set of derived data, organized into a digital representation of the manuscript for use by scholars. The primary data was a set of photogrammetric inputs designed to support the metric reconstruction in 3D of the manuscript shape. The photogrammetry system was designed to capture six images at each of a few hundred stops above the surface of the manuscript. The multi-camera head, positioned via motorized gantry in three degrees of freedom, allowed the collection of well-lit images from known positions. This primary data for each of the trays of the opened PHerc.118 fragments is the basis for two methods of shape reconstruction: dense multi-camera stereo (i.e., photogrammetry); and NeRF-based volumetric inference. Primary data the PHerc 118 photogrammetry capture (the raw images); the PHerc 118 spectral captures the photon counts from the proxies at Rutherford-Appleton. 
Type Of Material Data analysis technique 
Year Produced 2023 
Provided To Others? No  
Impact This data will become available in 2023. It is too early to identify impacts. 
URL https://educelab.github.io/living-virtually/
 
Title PHerc 118 secondary data 
Description This project produced a novel set of primary data of the PHerc.118 manuscript, and together with software tools produced a novel set of derived data, organized into a digital representation of the manuscript for use by scholars. The second set of primary data was spectral photography at 50 megapixel resolution. The incident light, controlled via software across 14 bands, was imaged by the megavision digital camera to create an aligned spectral block. Derived data from the primary sets include 3D reconstructions, color images with contrast enhancement, metadata payloads that are structured according to a designed schema, and other visualization information that subsequent tools can use for effective interaction with the data. Derived data (which forms a part of the dataset too, just not captured from devices but instead is derived through algorithms) PHerc 118 3D from photogrammetry (complete set of 12 trays) PHerc 118 registration of historical data to the photogrammetry (varied: we've demonstrated registration of prior high resolution color images, prior infrared images captured by Brigham Young, and registration of digitized disegni) PHerc 118 color images derived from the spectral set; false coloring for contrast enhancement from data derived from the spectral bands (e.g., false coloring of PCA bands) PHerc 118 metadata, structured into the METS / XML schema we designed Bin counts and visualizations of cadmium source / Hexitec imaging of lead concentration samples Novel Analysis / Methods / Techniques this work has influenced: Although there is a plethora of structured metadata schemas for various artifacts/manuscripts, nothing was appropriately suited for capturing the nuance and uniqueness of Herculaneum papyri. The METS model that we designed (and the software we wrote to populate it from the captured data of PHerc 118) will influence the broader field as evidenced by the "Best Paper" award in the metadata conference where the work was submitted for initial publication. The photogrammetric reconstruction of PHerc.118 has formed the basis for alternative, possibly better 3D reconstruction methods using neural radiance fields (NeRF) This novel analysis using captured data will support a direct comparison of the best photogrammetric reconstruction of the opened papyri to a NeRF-based reconstruction. Photon-counting from the captured cadmium source setup gives us the data to estimate imaging times based on lead concentrations in ink. No other such study exists to our knowledge, and so the technique for comparing concentrations based on photon counts in the data is a direct influence from this work. 
Type Of Material Data analysis technique 
Year Produced 2023 
Provided To Others? No  
Impact This data will become available in 2023. It is too early to identify impacts. 
URL https://educelab.github.io/living-virtually/
 
Description Access and support from the Bodleian Library, Oxford 
Organisation University of Oxford
Department Bodleian Library
Country United Kingdom 
Sector Academic/University 
PI Contribution The material object which was at the centre of this research project is a fragmentary papyrus in the Bodleian Library. Assuming it is in the Library's interest to have its collections of rare items studied, in themselves and with a view to creating new methodologies, that was the project's contribution.
Collaborator Contribution The Bodleian Library/Weston Library granted the project access to the multiple fragments of the papyrus, providing supervised space to conduct scanning and imaging, and providing specialised personnel who prepared the specimens for scanning and imaging. This involved removing them from the frames/containers in which they are normally kept for preservation purposes.
Impact See website. The project is in its nature multi-disciplinary.
Start Year 2019
 
Description Collaboration with Rutherford Appleton Laboratory 
Organisation Rutherford Appleton Laboratory
Country United Kingdom 
Sector Academic/University 
PI Contribution The Faculty of Classics and the PI had a coordinating function
Collaborator Contribution Rutherford Appleton Laboratory is the employer of Co-PI Dr Jens Dopke. They contributed Dr Dopke's time and constructed the captured cadmium source set-up; see below, section 8.
Impact See below under Section 8 for summary, and the above website for the actual data.
Start Year 2019
 
Title Bench-top X-ray fluorescence system 
Description Based on the provision of a radioactive source by this project, a spectral sensor has been set up in a shielding enclosure to allow tests of benchtop imaging for trace materials in X-ray fluorescence. These tests have concluded in late January and while data is still being processed, we know that sensitivity is less than what we had hoped for, in particular in lead k-line emission. L-line emissions are promising with regards to surface scanning, but due to the lower energies involved will not prove very useful in many-layered samples. All parts of the system exist, but a functional setup is only provided as required. 
Type Of Technology Systems, Materials & Instrumental Engineering 
Year Produced 2023 
Impact Not as yet. 
URL https://educelab.github.io/living-virtually/
 
Title Control software for photogrammetry capture + scripting to manage post-acquisition orchestration of data for cloud-based photogrammetric reconstruction 
Description The photogrammetry system is a real-time capture system that drives five cameras, a set of visible and infrared lights, and a motorized stage with 3-degrees-of-freedom in order to capture the images needed to reconstruct the 3D shape of an open Herculaneum fragment. The control software, written in python and C++, and the post-processing software for managing the data and orchestrating the cloud-based photogrammetry software, is a working hardware and software system that was used on site at the Bodleian to recover data from the 12 trays of the PHerc 118. 
Type Of Technology Webtool/Application 
Year Produced 2023 
Impact Not as yet. 
URL https://educelab.github.io/living-virtually/
 
Title GUI application for analyzing x-ray fluorescence data 
Description Software scaffolding (python-based) for processing and visualizing x-ray fluorescence data in the imaging system using the radioactive source. 
Type Of Technology Webtool/Application 
Year Produced 2023 
Impact This is still being worked on, and it is not clear at present whether it can be made accessible through the project website. 
 
Title Improved aligned image comparison web viewer 
Description Software scaffolding to evaluate and improve the image registration. 
Type Of Technology Webtool/Application 
Year Produced 2023 
Open Source License? Yes  
Impact None as yet. 
URL https://educelab.github.io/living-virtually/
 
Title Machine Learning framework (python and C++; orchestration software for containerization and managing execution/control). 
Description All software for running AI/ML jobs in batches is containerized for execution in cloud-based environments. This set of scripts launches, collects, assembles, and reports progress in order to provide the researchers a way to queue and orchestrate experimental runs. 
Type Of Technology Webtool/Application 
Year Produced 2023 
Open Source License? Yes  
Impact None as yet. 
URL https://educelab.github.io/living-virtually/
 
Title Photogrammetry system 
Description This project contributed to its development. There were also other sources of support for its development beyond the use only for the 12 trays of PHerc.118. The system is bespoke and used to create the images of PHerc 118. 
Type Of Technology Systems, Materials & Instrumental Engineering 
Year Produced 2022 
Impact N/a. It is an imaging device. 
URL https://educelab.github.io/living-virtually/
 
Title Processing software for raw fluorescence data recorded with Hexitec 
Description It processes large amounts of raw recorded data into smaller 3D histograms to be further analysed. 
Type Of Technology Webtool/Application 
Year Produced 2023 
Open Source License? Yes  
Impact None as yet. 
URL https://gitlab.cern.ch/jdopke/XRayFluorescenceImaging
 
Title Prototype METS software for creating and exporting PHerc.118 metadata in our METS/XML model 
Description The software for creating the structured METS files, written in python, supports a user interface for practitioners to enter metadata regarding the details of a papyrus and then export that in a compliant METS schema. The software allows successive editing of the METS files, and sets a standard for how other software will parse and manipulate the metadata stored there. 
Type Of Technology Webtool/Application 
Year Produced 2023 
Open Source License? Yes  
Impact Not as yet. 
URL https://educelab.github.io/living-virtually/
 
Title Prototype NeRF software for comparing 3D reconstruction from the photogrammetry data set, allowing a direct comparison of methods 
Description As with the NeRF framework for tomosynthesis, we also adapted the framework to compare the NeRF ability to capture and render novel views with the standard photogrammetric approach of 3D reconstruction and model building / texturing. The software framework allows comparisons of distances and measurements in the two forms of reconstruction for the purpose of understanding which methods might be the most salient and helpful for a papyrologist using the data to understand a manuscript. 
Type Of Technology Webtool/Application 
Year Produced 2023 
Open Source License? Yes  
Impact None as yet. 
URL https://educelab.github.io/living-virtually/
 
Title Prototype experimental framework based on the Neural Radiance Fields (NeRF) approach for studying the viability of reconstructing "limited view tomography", also known as "tomosynthesis". 
Description This software, written in a combination of python and C++, is patterned after the Neural Radiance Fields approach to volumetric reconstruction based on a set of projections of that volume. The NeRF framework was originally intended to encode the structure and appearance of a visible scene, captured from multiple camera views, in order to render interpolations (e.g., novel views) and view sequences. We have adapted that framework to experiment with the idea of volumetric reconstruction of x-ray volumes as a way of reproducing the classical results from "filtered backprojection" (FB). What this experimental framework has allowed us to do is to understand how the NeRF framework might allow for the recovery of a tomographic volume without the need to collect regular and massive numbers of projections, which is what FB requires. The developed tool, which is a research framework, has been used to do initial experiments that show promise for reconstructions in environments where projections from all directions are difficult to collect. In the case of manuscript studies, this is exactly the case with opened texts (flattened, perhaps pasted down on boards or other materials) such as Herculaneum papyrus, that has multiple layers confounded but in a flattened geometry. 
Type Of Technology Webtool/Application 
Year Produced 2023 
Open Source License? Yes  
Impact None as yet. Due to Covid delays numerous outcomes are coming together only now. 
URL https://educelab.github.io/living-virtually/
 
Title Prototype spectral image processing pipeline for text enhancement 
Description Software scaffolding to process, visualize, and interpret various approaches for improving the contrast between ink and substrate in images where that contrast is low and makes it difficult for humans to read text. 
Type Of Technology Webtool/Application 
Year Produced 2023 
Open Source License? Yes  
Impact This is still being completed. 
 
Title Registration software to automate the calculation of the piecewise registration transformation from the photogrammetry backplate to the other images to be registered (the spectral images, the historical data) 
Description This registration software implements the automatic alignment of historical images and the spectral images to the 3D photogrammetry. The composite representation supports visualization of the aligned data set to make comparisons of how shape and appearance have changed over time. 
Type Of Technology Webtool/Application 
Year Produced 2023 
Open Source License? Yes  
Impact None as yet. 
URL https://educelab.github.io/living-virtually/
 
Title Web-based viewer for all photogrammetry reconstructions textured with spectral images. 
Description This viewer, written in javascript to run within a standard web browser and based on the Smithsonian's Voyager project, allows photogrammetry data and aligned/registered textures to be viewed without the need to download and install additional software packages. In progress: view all image derivatives in the web viewer. 
Type Of Technology Webtool/Application 
Year Produced 2023 
Impact This is still being worked on, and the website is at present not discoverable through searches. 
URL https://educelab.gitlab.io/dri-voyager/?document=PHerc0089Cr001.json
 
Description Appearance on TV documentary [Brent Seales] 
Form Of Engagement Activity A broadcast e.g. TV/radio/film/podcast (other than news/press)
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Public/other audiences
Results and Impact Professor Brent Seales appeared in an episode of 'The UnXplained' hosted by William Shatner on Netflix (Season 4, Episode 7; episode entitled 'Mysterious Tombs'; air date May 13, 2022).
Year(s) Of Engagement Activity 2022
URL https://en.wikipedia.org/wiki/The_UnXplained
 
Description Reading Ancient Scrolls with Modern Technology: How CT scans and AI will let scientists and historians look inside carbonized papyri [Brent Seales] 
Form Of Engagement Activity A broadcast e.g. TV/radio/film/podcast (other than news/press)
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Public/other audiences
Results and Impact Podcast hosted by the Getty Trust
Year(s) Of Engagement Activity 2021
URL https://blogs.getty.edu/iris/podcast-reading-ancient-scrolls-with-modern-technology/
 
Description Silicon Valley Invitation-Only Innovation Meeting, San Francisco, CA [Brent Seales] 
Form Of Engagement Activity A formal working group, expert panel or dialogue
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Supporters
Results and Impact Professor Seales spoke at an invitation-only meeting in order to introduce the project and in order to gauge interest in collaborations for AI approaches to reading papyri
Year(s) Of Engagement Activity 2022