The Alan Turing Institute

Lead Research Organisation: The Alan Turing Institute

Department Name: Research

Abstract

The work of the Alan Turing Institute will enable knowledge and predictions to be extracted from large-scale and diverse digital data. It will bring together the best people, organisations and technologies in data science for the development of foundational theory, methodologies and algorithms. These will inform scientific and technological discoveries, create new business opportunities, accelerate solutions to global challenges, inform policy-making, and improve the environment, health and infrastructure of the world in an 'Age of Algorithms'.

Planned Impact

The Institute will bring together leaders in advanced mathematics and computing science from the five founding universities and other partners. Its work is expected to encompass a wide range of scientific disciplines and be relevant to a large number of business sectors.

Funded Value:

£42,000,000

Funded Period:

Apr 15 - Sep 20

Funder:

EPSRC

Project Status:

Closed

Project Category:

Research Grant

Project Reference:

EP/N510129/1

Principal Investigator:

Adrian Smith

Andrew Blake

Alan Wilson

Research Subject:

Info. & commun. Technol. (50%)

Mathematical sciences (50%)

Research Topic:

Artificial Intelligence (50%)

Statistics & Appl. Probability (50%)

Organisations

People	ORCID iD
Adrian Smith (Principal Investigator)
Andrew Blake (Principal Investigator)
Alan Wilson (Principal Investigator)

Publications

Author Name

Title Publication Date Published

|< < 1 2 3 4 5 6 7 8 9 10 > >|

10 25 50

Aivaliotis G (2021) A comparison of time to event analysis methods, using weight status and breast cancer as a case study. in Scientific reports

Akyildiz Ö (2022) Statistical Finite Elements via Langevin Dynamics in SIAM/ASA Journal on Uncertainty Quantification

Akyildiz Ö (2021) Convergence rates for optimised adaptive importance samplers in Statistics and Computing

Al-Badri M (2021) Nanomaterial Functionalization Modulates Hard Protein Corona Formation: Atomistic Simulations Applied to Graphitic Materials in Advanced Materials Interfaces

Allen M (2019) Raincloud plots: a multi-platform tool for robust data visualization. in Wellcome open research

Alqassim M (2019) Facebook for Support versus Facebook for Research

Amabilino S (2020) Training atomic neural networks using fragment-based data generated in virtual reality. in The Journal of chemical physics

Ameer S (2019) Comparative Analysis of Machine Learning Techniques for Predicting Air Quality in Smart Cities in IEEE Access

Amjad Z (2020) Towards Energy Efficient Smart Grids Using Bio-Inspired Scheduling Techniques in IEEE Access

Artistic and Creative Products
Key Findings
Impact Summary
Research Databases and Models
Software and Technical Products


Title	2020-04-01 - Data Safe Havens in the Cloud - CW20 Workshop.pptx
Description	A talk given at the SSI Collaborations Workshop in April 2020 discussing the Alan Turing Institute's "Data Safe A talk given at the SSI Collaborations Workshop in April 2020 discussing the Alan Turing Institute's "Data Safe Havens in the Cloud" project. The slides are included here.
Type Of Art	Film/Video/Animation
Year Produced	2021
URL	https://cw20.figshare.com/articles/presentation/2020-04-01_-_Data_Safe_Havens_in_the_Cloud_-_CW20_Wo...


Title	2020-04-01 - Data Safe Havens in the Cloud - CW20 Workshop.pptx
Description	A talk given at the SSI Collaborations Workshop in April 2020 discussing the Alan Turing Institute's "Data Safe A talk given at the SSI Collaborations Workshop in April 2020 discussing the Alan Turing Institute's "Data Safe Havens in the Cloud" project. The slides are included here.
Type Of Art	Film/Video/Animation
Year Produced	2021
URL	https://cw20.figshare.com/articles/presentation/2020-04-01_-_Data_Safe_Havens_in_the_Cloud_-_CW20_Wo...


Title	34-productive-research-on-sensitive-data-using-cloud-based-secure-research-environments-james-robinson-martin-oreilly.mp4
Description	A talk given at the SSI Collaborations Workshop in April 2020 discussing the Alan Turing Institute's "Data Safe A talk given at the SSI Collaborations Workshop in April 2020 discussing the Alan Turing Institute's "Data Safe Havens in the Cloud" project. A video recording of the talk plus subsequent Q&A are included here.
Type Of Art	Film/Video/Animation
Year Produced	2021
URL	https://cw20.figshare.com/articles/presentation/34-productive-research-on-sensitive-data-using-cloud...


Title	34-productive-research-on-sensitive-data-using-cloud-based-secure-research-environments-james-robinson-martin-oreilly.mp4
Description	A talk given at the SSI Collaborations Workshop in April 2020 discussing the Alan Turing Institute's "Data Safe A talk given at the SSI Collaborations Workshop in April 2020 discussing the Alan Turing Institute's "Data Safe Havens in the Cloud" project. A video recording of the talk plus subsequent Q&A are included here.
Type Of Art	Film/Video/Animation
Year Produced	2021
URL	https://cw20.figshare.com/articles/presentation/34-productive-research-on-sensitive-data-using-cloud...


Title	Reproducible secure research environments: Talk from Safe Data Access Professionals Quarterly Meeting on 08 June 2021
Description	Overview of the challenges of supporting reproducible research on sensitive data and how the Turing addresses these in its Safe Haven secure research environment.
Type Of Art	Film/Video/Animation
Year Produced	2021
URL	https://figshare.com/articles/presentation/Reproducible_secure_research_environments_Talk_from_Safe_...


Title	Reproducible secure research environments: Talk from Safe Data Access Professionals Quarterly Meeting on 08 June 2021
Description	Overview of the challenges of supporting reproducible research on sensitive data and how the Turing addresses these in its Safe Haven secure research environment.
Type Of Art	Film/Video/Animation
Year Produced	2021
URL	https://figshare.com/articles/presentation/Reproducible_secure_research_environments_Talk_from_Safe_...


Description	For Key Findings and Impact, please see our Annual Report: https://www.turing.ac.uk/about-us/annual-report-2021-22
Exploitation Route	Please see our Annual Report: https://www.turing.ac.uk/about-us/annual-report-2021-22
Sectors	Aerospace, Defence and Marine,Agriculture, Food and Drink,Communities and Social Services/Policy,Construction,Creative Economy,Digital/Communication/Information Technologies (including Software),Education,Energy,Environment,Financial Services, and Management Consultancy,Healthcare,Leisure Activities, including Sports, Recreation and Tourism,Government, Democracy and Justice,Manufacturing, including Industrial Biotechology,Culture, Heritage, Museums and Collections,Pharmaceuticals and Medical Bio
URL	https://www.turing.ac.uk/


Description	For Key Findings and Impact, please see our Annual Report: https://www.turing.ac.uk/about-us/annual-report-2021-22
Sector	Aerospace, Defence and Marine,Agriculture, Food and Drink,Communities and Social Services/Policy,Construction,Creative Economy,Digital/Communication/Information Technologies (including Software),Energy,Environment,Financial Services, and Management Consultancy,Healthcare,Government, Democracy and Justice,Manufacturing, including Industrial Biotechology,Culture, Heritage, Museums and Collections,Pharmaceuticals and Medical Biotechnology,Security and Diplomacy,Transport,Other
Impact Types	Cultural,Societal,Economic,Policy & public services


Title	DETOX seismic tomography models
Description	----------------------- DETOX tomography models ----------------------- This folder contains three tomography models, DETOX-P1, DETOX-P2 and DETOX-P3, in the following formats: - NetCDF (dirname: grid_nc4) - VTK (dirname: vtk) - xyz-value (dirname: txt_tetrahedron) - JPEG for GPLATES, only high-velocities (dirname: GPLATES) The directories are organized as follow: DETOX-P1 +-- GPLATES +-- grid_nc4 +-- txt_tetrahedron +-- vtk DETOX-P2 +-- GPLATES +-- grid_nc4 +-- txt_tetrahedron +-- vtk DETOX-P3 +-- GPLATES +-- grid_nc4 +-- txt_tetrahedron +-- vtk --------------------- Citation: * Kasra Hosseini, Karin Sigloch, Maria Tsekhmistrenko, Afsaneh Zaheri, Tarje Nissen-Meyer, Heiner Igel, Global mantle structure from multifrequency tomography using P, PP and P-diffracted waves, Geophysical Journal International, Volume 220, Issue 1, January 2020, Pages 96-141, https://doi.org/10.1093/gji/ggz394
Type Of Material	Database/Collection of data
Year Produced	2019
Provided To Others?	Yes
URL	https://zenodo.org/record/3993275


Title	DETOX seismic tomography models
Description	----------------------- DETOX tomography models ----------------------- This folder contains three tomography models, DETOX-P1, DETOX-P2 and DETOX-P3, in the following formats: - NetCDF (dirname: grid_nc4) - VTK (dirname: vtk) - xyz-value (dirname: txt_tetrahedron) - JPEG for GPLATES, only high-velocities (dirname: GPLATES) The directories are organized as follow: DETOX-P1 +-- GPLATES +-- grid_nc4 +-- txt_tetrahedron +-- vtk DETOX-P2 +-- GPLATES +-- grid_nc4 +-- txt_tetrahedron +-- vtk DETOX-P3 +-- GPLATES +-- grid_nc4 +-- txt_tetrahedron +-- vtk --------------------- Citation: * Kasra Hosseini, Karin Sigloch, Maria Tsekhmistrenko, Afsaneh Zaheri, Tarje Nissen-Meyer, Heiner Igel, Global mantle structure from multifrequency tomography using P, PP and P-diffracted waves, Geophysical Journal International, Volume 220, Issue 1, January 2020, Pages 96-141, https://doi.org/10.1093/gji/ggz394
Type Of Material	Database/Collection of data
Year Produced	2019
Provided To Others?	Yes
URL	https://zenodo.org/record/3993276


Title	DUKweb (Diachronic UK web)
Description	We present DUKweb, a set of large-scale resources useful for the diachronic analysis of contemporary English. The dataset is derived from JISC UK Web Domain Dataset (1996-2013), which collects resources from the Internet Archive that were hosted on domains ending in '.uk'. The dataset includes co-occurrences matrices for each year and two types of word vectors by year, Temporal Random Indexing vectors and word2vec embeddings.
Type Of Material	Database/Collection of data
Year Produced	2020
Provided To Others?	Yes
URL	https://bl.iro.bl.uk/work/f9ff33ab-56b7-4594-8aca-49781296c0c6


Title	Data supporting "GABA, not BOLD, reveals dissociable learning-dependent plasticity mechanisms in the human brain"
Description	Behavioural data. BOLD change measurements. GABA change measurements. Behavioural data under tDCs intervention.
Type Of Material	Database/Collection of data
Year Produced	2018
Provided To Others?	Yes


Title	Dataset for Toponym Resolution in Nineteenth-Century English Newspapers
Description	We present a new dataset for the task of toponym resolution in digitised historical newspapers in English. It consists of 343 annotated articles from newspapers based in four different locations in England (Manchester, Ashton-under-Lyne, Poole and Dorchester), published between 1780 and 1870. The articles have been manually annotated with mentions of places, which are linked---whenever possible---to their corresponding entry on Wikipedia. The dataset is published on the British Library shared research repository, and is especially of interest to researchers working on improving semantic access to historical newspaper content. We share the 343 annotated files (one file per article) in the WebAnno TSV file format version 3.2, a CoNLL-based file format. We additionally provide a TSV file with metadata at the article level, and the annotation guidelines.
Type Of Material	Database/Collection of data
Year Produced	2021
Provided To Others?	Yes
URL	https://bl.iro.bl.uk/concern/datasets/de43a15c-e000-4fec-8b66-7ca94ae13db3


Title	Latin lexical semantic annotation
Description	This dataset is a collection of lexical annotation of the corpus occurrences 40 Latin lemmas. The corpus instances are from LatinISE and the process is described in Schlechtweg et al. (2020, 2021).The annotation was coordinated by Barbara McGillivray, and done by Annie Burman, Daria Kondakova, Francesca Dell'Oro, Helena Bermudez Sabel, Hugo Burgess, Paola Marongiu, and Rozalia Dobos. The pre-annotation was coordinated and designed by Barbara McGillivray and done by Manuel Márquez Cruz.ReferencesMcGillivray, B. and Kilgarriff, A. (2013). Tools for historical corpus research, and a corpus of Latin. In Paul Bennett, Martin Durrell, Silke Scheible, Richard J. Whitt (eds.), New Methods in Historical Corpus Linguistics. Tu¨bingen: NarrBarbara McGillivray, Dominik Schlechtweg, Haim Dubossarsky, Nina Tahmasebi, & Simon Hengchen. (2021). DWUG LA: Diachronic Word Usage Graphs for Latin [Data set]. Zenodo. https://doi.org/10.5281/zenodo.5255228Schlechtweg, D., McGillivray, B., Hengchen, S., Dubossarsky, H., Tahmasebi, N. (2020). SemEval-2020 Task 1: Unsupervised Lexical Semantic Change Detection. In Proceedings of the Fourteenth Workshop on Semantic Evaluation, 2020. International Committee for Computational Linguistics. DOI: 10.18653/v1/2020.semeval-1.1Schlechtweg, D., Tahmasebi, N., Hengchen, S., Dubossarsky, H., McGillivray, B. (2021). DWUG: A large Resource of Diachronic Word Usage Graphs in Four Languages. In Proceedings of EMNLP 2021.
Type Of Material	Database/Collection of data
Year Produced	2021
Provided To Others?	Yes
URL	https://kcl.figshare.com/articles/dataset/Latin_lexical_semantic_annotation/16974823


Title	Latin lexical semantic annotation
Description	This dataset is a collection of lexical annotation of the corpus occurrences 40 Latin lemmas. The corpus instances are from LatinISE and the process is described in Schlechtweg et al. (2020, 2021).The annotation was coordinated by Barbara McGillivray, and done by Annie Burman, Daria Kondakova, Francesca Dell'Oro, Helena Bermudez Sabel, Hugo Burgess, Paola Marongiu, and Rozalia Dobos. The pre-annotation was coordinated and designed by Barbara McGillivray and done by Manuel Márquez Cruz.ReferencesMcGillivray, B. and Kilgarriff, A. (2013). Tools for historical corpus research, and a corpus of Latin. In Paul Bennett, Martin Durrell, Silke Scheible, Richard J. Whitt (eds.), New Methods in Historical Corpus Linguistics. Tu¨bingen: NarrBarbara McGillivray, Dominik Schlechtweg, Haim Dubossarsky, Nina Tahmasebi, & Simon Hengchen. (2021). DWUG LA: Diachronic Word Usage Graphs for Latin [Data set]. Zenodo. https://doi.org/10.5281/zenodo.5255228Schlechtweg, D., McGillivray, B., Hengchen, S., Dubossarsky, H., Tahmasebi, N. (2020). SemEval-2020 Task 1: Unsupervised Lexical Semantic Change Detection. In Proceedings of the Fourteenth Workshop on Semantic Evaluation, 2020. International Committee for Computational Linguistics. DOI: 10.18653/v1/2020.semeval-1.1Schlechtweg, D., Tahmasebi, N., Hengchen, S., Dubossarsky, H., McGillivray, B. (2021). DWUG: A large Resource of Diachronic Word Usage Graphs in Four Languages. In Proceedings of EMNLP 2021.
Type Of Material	Database/Collection of data
Year Produced	2021
Provided To Others?	Yes
URL	https://kcl.figshare.com/articles/dataset/Latin_lexical_semantic_annotation/16974823/1


Title	LatinISE subcorpora for SemEval 2020 task 1
Description	This data collection contains the Latin test data for SemEval 2020 Task 1: Unsupervised Lexical Semantic Change Detection]: a Latin text corpus pair (`corpus1/`, `corpus2/`) 40 lemmas which have been annotated for their lexical semantic change between the two corpora (`targets.txt`) The corpus data have been automatically lemmatized and part-of-speech tagged, and have been partially corrected by hand. For homonyms, the lemmas are followed by the '\#' symbol and the number of the homonym according to the Lewis-Short dictionary of Latin when this number is greater than 1. For example, the lemma 'dico' corresponds to the first homonym in the Lewis-Short dictionary and 'dico\#2' corresponds to the second homonym, cf. Lewis-Short dictionary. __Corpus 1__ based on: LatinISE (McGillivray and Kilgarriff 2013), version on Sketch Engine language: Latin time covered: from the beginning of the second century before Christ (BC) to the end of the first century BC size: ~1.7 million tokens format: lemmatized, sentence length >= 2, no punctuation, sentences randomly shuffled encoding: UTF-8 __Corpus 2__ based on: LatinISE (McGillivray and Kilgarriff 2013) , version on Sketch Engine language: Latin time covered: from the beginning of the first century after Christ (AD) to the end of the twenty-first century AD size: ~9.4 million tokens format: lemmatized, sentence length >= 2, no punctuation, sentences randomly shuffled encoding: UTF-8 Find more information on the data in the papers referenced below. References Dominik Schlechtweg, Barbara McGillivray, Simon Hengchen, Haim Dubossarsky and Nina Tahmasebi SemEval 2020 Task 1: Unsupervised Lexical Semantic Change Detection. To appear in SemEval@COLING2020. McGillivray, B. and Kilgarriff, A. (2013). Tools for historical corpus research, and a corpus of Latin. In Paul Bennett, Martin Durrell, Silke Scheible, Richard J. Whitt (eds.), New Methods in Historical Corpus Linguistics, Tübingen: Narr.
Type Of Material	Database/Collection of data
Year Produced	2020
Provided To Others?	Yes
URL	https://zenodo.org/record/3674988


Title	LatinISE subcorpora for SemEval 2020 task 1
Description	This data collection contains the Latin test data for SemEval 2020 Task 1: Unsupervised Lexical Semantic Change Detection]: a Latin text corpus pair (`corpus1/lemma`, `corpus2/lemma`) 40 lemmas which have been annotated for their lexical semantic change between the two corpora (`targets.txt`) the annotated binary change scores of the targets for subtask 1, and their annotated graded change scores for subtask 2 (`truth/`) The corpus data have been automatically lemmatized and part-of-speech tagged, and have been partially corrected by hand. For homonyms, the lemmas are followed by the '\#' symbol and the number of the homonym according to the Lewis-Short dictionary of Latin when this number is greater than 1. For example, the lemma 'dico' corresponds to the first homonym in the Lewis-Short dictionary and 'dico\#2' corresponds to the second homonym, cf. Lewis-Short dictionary. __Corpus 1__ based on: LatinISE (McGillivray and Kilgarriff 2013), version on Sketch Engine language: Latin time covered: from the beginning of the second century before Christ (BC) to the end of the first century BC size: ~1.7 million tokens format: lemmatized, sentence length >= 2, no punctuation, sentences randomly shuffled encoding: UTF-8 __Corpus 2__ based on: LatinISE (McGillivray and Kilgarriff 2013) , version on Sketch Engine language: Latin time covered: from the beginning of the first century after Christ (AD) to the end of the twenty-first century AD size: ~9.4 million tokens format: lemmatized, sentence length >= 2, no punctuation, sentences randomly shuffled encoding: UTF-8 Find more information on the data in the papers referenced below. References Dominik Schlechtweg, Barbara McGillivray, Simon Hengchen, Haim Dubossarsky and Nina Tahmasebi SemEval 2020 Task 1: Unsupervised Lexical Semantic Change Detection. To appear in SemEval@COLING2020. McGillivray, B. and Kilgarriff, A. (2013). Tools for historical corpus research, and a corpus of Latin. In Paul Bennett, Martin Durrell, Silke Scheible, Richard J. Whitt (eds.), New Methods in Historical Corpus Linguistics, Tübingen: Narr.
Type Of Material	Database/Collection of data
Year Produced	2020
Provided To Others?	Yes
URL	https://zenodo.org/record/3732944


Title	LatinISE test data for SemEval 2020 task 1
Description	This data collection contains the Latin test data for SemEval 2020 Task 1: Unsupervised Lexical Semantic Change Detection: a Latin text corpus pair (`corpus1/lemma`, `corpus2/lemma`) 40 lemmas which have been annotated for their lexical semantic change between the two corpora (`targets.txt`) the annotated binary change scores of the targets for subtask 1, and their annotated graded change scores for subtask 2 (`truth/`) The corpus data have been automatically lemmatized and part-of-speech tagged, and have been partially corrected by hand. For homonyms, the lemmas are followed by the '\#' symbol and the number of the homonym according to the Lewis-Short dictionary of Latin when this number is greater than 1. For example, the lemma 'dico' corresponds to the first homonym in the Lewis-Short dictionary and 'dico\#2' corresponds to the second homonym, cf. Lewis-Short dictionary. __Corpus 1__ based on: LatinISE (McGillivray and Kilgarriff 2013), version on Sketch Engine language: Latin time covered: from the beginning of the second century before Christ (BC) to the end of the first century BC size: ~1.7 million tokens format: lemmatized, sentence length >= 2, no punctuation, sentences randomly shuffled encoding: UTF-8 __Corpus 2__ based on: LatinISE (McGillivray and Kilgarriff 2013) , version on Sketch Engine language: Latin time covered: from the beginning of the first century after Christ (AD) to the end of the twenty-first century AD size: ~9.4 million tokens format: lemmatized, sentence length >= 2, no punctuation, sentences randomly shuffled encoding: UTF-8 Find more information on the data in the papers referenced below. References Dominik Schlechtweg, Barbara McGillivray, Simon Hengchen, Haim Dubossarsky and Nina Tahmasebi SemEval 2020 Task 1: Unsupervised Lexical Semantic Change Detection. To appear in SemEval@COLING2020. McGillivray, B. and Kilgarriff, A. (2013). Tools for historical corpus research, and a corpus of Latin. In Paul Bennett, Martin Durrell, Silke Scheible, Richard J. Whitt (eds.), New Methods in Historical Corpus Linguistics, Tübingen: Narr.
Type Of Material	Database/Collection of data
Year Produced	2020
Provided To Others?	Yes
URL	https://zenodo.org/record/3734089


Title	LatinISE test data for SemEval 2020 task 1 with additional token versions of the corpora
Description	This data collection contains the Latin test data for SemEval 2020 Task 1: Unsupervised Lexical Semantic Change Detection: a Latin text corpus pair (`corpus1/lemma`, `corpus2/lemma`) 40 lemmas which have been annotated for their lexical semantic change between the two corpora (`targets.txt`) the annotated binary change scores of the targets for subtask 1, and their annotated graded change scores for subtask 2 (`truth/`) The corpus data have been automatically lemmatized and part-of-speech tagged, and have been partially corrected by hand. For homonyms, the lemmas are followed by the '\#' symbol and the number of the homonym according to the Lewis-Short dictionary of Latin when this number is greater than 1. For example, the lemma 'dico' corresponds to the first homonym in the Lewis-Short dictionary and 'dico\#2' corresponds to the second homonym, cf. Lewis-Short dictionary. __Corpus 1__ based on: LatinISE (McGillivray and Kilgarriff 2013), version on Sketch Engine language: Latin time covered: from the beginning of the second century before Christ (BC) to the end of the first century BC size: ~1.7 million tokens format: lemmatized, sentence length >= 2, no punctuation, sentences randomly shuffled encoding: UTF-8 __Corpus 2__ based on: LatinISE (McGillivray and Kilgarriff 2013) , version on Sketch Engine language: Latin time covered: from the beginning of the first century after Christ (AD) to the end of the twenty-first century AD size: ~9.4 million tokens format: lemmatized, sentence length >= 2, no punctuation, sentences randomly shuffled encoding: UTF-8 Find more information on the data in the papers referenced below. Besides the official lemma version of the corpora for SemEval-2020 Task 1 we also provide the raw token version ( `corpus1/token/`, `corpus2/token/`). It contains the raw sentences in the same order as in the lemma version. Find more information on the data and SemEval-2020 Task 1 in the paper referenced below. The creation of the data was supported by the CRETA center and the CLARIN-D grant funded by the German Ministry for Education and Research (BMBF). References Dominik Schlechtweg, Barbara McGillivray, Simon Hengchen, Haim Dubossarsky and Nina Tahmasebi SemEval 2020 Task 1: Unsupervised Lexical Semantic Change Detection. To appear in SemEval@COLING2020. McGillivray, B. and Kilgarriff, A. (2013). Tools for historical corpus research, and a corpus of Latin. In Paul Bennett, Martin Durrell, Silke Scheible, Richard J. Whitt (eds.), New Methods in Historical Corpus Linguistics, Tübingen: Narr.
Type Of Material	Database/Collection of data
Year Produced	2020
Provided To Others?	Yes
URL	https://zenodo.org/record/3992738


Title	LatinISE test data for SemEval 2020 task 1 with additional token versions of the corpora
Description	This data collection contains the Latin test data for SemEval 2020 Task 1: Unsupervised Lexical Semantic Change Detection: a Latin text corpus pair (`corpus1/lemma`, `corpus2/lemma`) 40 lemmas which have been annotated for their lexical semantic change between the two corpora (`targets.txt`) the annotated binary change scores of the targets for subtask 1, and their annotated graded change scores for subtask 2 (`truth/`) The corpus data have been automatically lemmatized and part-of-speech tagged, and have been partially corrected by hand. For homonyms, the lemmas are followed by the '\#' symbol and the number of the homonym according to the Lewis-Short dictionary of Latin when this number is greater than 1. For example, the lemma 'dico' corresponds to the first homonym in the Lewis-Short dictionary and 'dico\#2' corresponds to the second homonym, cf. Lewis-Short dictionary. __Corpus 1__ based on: LatinISE (McGillivray and Kilgarriff 2013), version on Sketch Engine language: Latin time covered: from the beginning of the second century before Christ (BC) to the end of the first century BC size: ~1.7 million tokens format: lemmatized, sentence length >= 2, no punctuation, sentences randomly shuffled encoding: UTF-8 __Corpus 2__ based on: LatinISE (McGillivray and Kilgarriff 2013) , version on Sketch Engine language: Latin time covered: from the beginning of the first century after Christ (AD) to the end of the twenty-first century AD size: ~9.4 million tokens format: lemmatized, sentence length >= 2, no punctuation, sentences randomly shuffled encoding: UTF-8 Find more information on the data in the papers referenced below. Besides the official lemma version of the corpora for SemEval-2020 Task 1 we also provide the raw token version ( `corpus1/token/`, `corpus2/token/`). It contains the raw sentences in the same order as in the lemma version. Find more information on the data and SemEval-2020 Task 1 in the paper referenced below. The creation of the data was supported by the CRETA center and the CLARIN-D grant funded by the German Ministry for Education and Research (BMBF). References Dominik Schlechtweg, Barbara McGillivray, Simon Hengchen, Haim Dubossarsky and Nina Tahmasebi SemEval 2020 Task 1: Unsupervised Lexical Semantic Change Detection. To appear in SemEval@COLING2020. McGillivray, B. and Kilgarriff, A. (2013). Tools for historical corpus research, and a corpus of Latin. In Paul Bennett, Martin Durrell, Silke Scheible, Richard J. Whitt (eds.), New Methods in Historical Corpus Linguistics, Tübingen: Narr.
Type Of Material	Database/Collection of data
Year Produced	2020
Provided To Others?	Yes
URL	https://zenodo.org/record/3674098


Title	Living Machines atypical animacy dataset
Description	Atypical animacy detection dataset, based on nineteenth-century sentences in English extracted from an open dataset of nineteenth-century books digitized by the British Library (available via https://doi.org/10.21250/db14, British Library Labs, 2014). This dataset contains 598 sentences containing mentions of machines. Each sentence has been annotated according to the animacy and humanness of the machine in the sentence. This dataset has been created as part of the following paper: Ardanuy, M. C., F. Nanni, K. Beelen, Kasra Hosseini, Ruth Ahnert, J. Lawrence, Katherine McDonough, Giorgia Tolfo, D. C. Wilson and B. McGillivray. "Living Machines: A study of atypical animacy." In Proceedings of the 28th International Conference on Computational Linguistics (COLING2020).
Type Of Material	Database/Collection of data
Year Produced	2020
Provided To Others?	Yes
URL	https://bl.iro.bl.uk/work/323177af-6081-4e93-8aaf-7932ca4a390a


Title	Monthly word embeddings for Twitter random sample (English, 2012-2018)
Description	This dataset contains monthly word embeddings created from the tweets available via the statuses/sample endpoint of the Twitter Streaming API from 2012 to 2018. Full details of the creation of the dataset are given in Room to Glo: A Systematic Comparison of Semantic Change Detection Approaches with Word Embeddings. The md5sum of the gzipped tarball file is a76888ffec8cc7aebba09d365ca55ace .
Type Of Material	Database/Collection of data
Year Produced	2019
Provided To Others?	Yes


Title	Monthly word embeddings for Twitter random sample (English, 2012-2018)
Description	This dataset contains monthly word embeddings created from the tweets available via the statuses/sample endpoint of the Twitter Streaming API from 2012 to 2018. Full details of the creation of the dataset are given in Room to Glo: A Systematic Comparison of Semantic Change Detection Approaches with Word Embeddings. The md5sum of the gzipped tarball file is a76888ffec8cc7aebba09d365ca55ace .
Type Of Material	Database/Collection of data
Year Produced	2019
Provided To Others?	Yes


Title	Research Data Supporting "Modelling prognostic trajectories of cognitive decline due to Alzheimer's disease"
Description
Type Of Material	Database/Collection of data
Year Produced	2020
Provided To Others?	Yes
URL	https://www.repository.cam.ac.uk/handle/1810/301740


Title	Research data supporting "Multimodal imaging of brain connectivity reveals predictors of individual decision strategy in statistical learning"
Description	Behavioural data, resting-state fMRI connectivity data and graph metrics data (see supporting data description .doc file for more information)
Type Of Material	Database/Collection of data
Year Produced	2019
Provided To Others?	Yes


Title	Research data supporting "White-Matter Pathways for Statistical Learning of Temporal Structures"
Description	Behavioural data and DTI connectivity data (see supporting data description .doc file for more information)
Type Of Material	Database/Collection of data
Year Produced	2019
Provided To Others?	Yes


Title	Supplementary material for 'A Deep Learning Approach to Geographical Candidate Selection through Toponym Matching'
Description	Supplementary material for the https://github.com/Living-with-machines/LwM_SIGSPATIAL2020_ToponymMatching repository, containing the underlying code and materials for the paper 'A Deep Learning Approach to Geographical Candidate Selection through Toponym Matching', accepted to SIGSPATIAL2020 as a poster paper. Coll Ardanuy, M., Hosseini, K., McDonough, K., Krause, A., van Strien, D. and Nanni, F. (2020): A Deep Learning Approach to Geographical Candidate Selection through Toponym Matching, SIGSPATIAL: Poster Paper.
Type Of Material	Database/Collection of data
Year Produced	2020
Provided To Others?	Yes
URL	https://zenodo.org/record/4034818


Title	Supplementary material for 'A Deep Learning Approach to Geographical Candidate Selection through Toponym Matching'
Description	Supplementary material for the https://github.com/Living-with-machines/LwM_SIGSPATIAL2020_ToponymMatching repository, containing the underlying code and materials for the paper 'A Deep Learning Approach to Geographical Candidate Selection through Toponym Matching', accepted to SIGSPATIAL2020 as a poster paper. Coll Ardanuy, M., Hosseini, K., McDonough, K., Krause, A., van Strien, D. and Nanni, F. (2020): A Deep Learning Approach to Geographical Candidate Selection through Toponym Matching, SIGSPATIAL: Poster Paper.
Type Of Material	Database/Collection of data
Year Produced	2020
Provided To Others?	Yes
URL	https://zenodo.org/record/4034819


Title	Visual Identification of Individual Holstein Friesian Cattle via Deep Metric Learning
Description	This dataset accompanies the paper - "Visual Identification of Individual Holstein Friesian Cattle via Deep Metric Learning" available at - https://arxiv.org/abs/2006.09205. It consists of two components: (a) detection and localisation, (b) identification. For an overview of this dataset, refer to Section 3 in the paper. For any queries, contact the corresponding author in the paper. For accompanying source code, check out - https://github.com/CWOA/MetricLearningIdentification
Type Of Material	Database/Collection of data
Year Produced	2020
Provided To Others?	Yes
URL	https://data.bris.ac.uk/data/dataset/10m32xl88x2b61zlkkgz3fml17/


Title	DeezyMatch
Description	DeezyMatch: A Flexible Deep Neural Network Approach to Fuzzy String Matching DeezyMatch can be applied for performing the following tasks: Record linkage Candidate selection for entity linking systems Toponym matching
Type Of Technology	Software
Year Produced	2020
Open Source License?	Yes
URL	https://zenodo.org/record/3983554


Title	DeezyMatch
Description	DeezyMatch: A Flexible Deep Neural Network Approach to Fuzzy String Matching DeezyMatch can be applied for performing the following tasks: Record linkage Candidate selection for entity linking systems Toponym matching
Type Of Technology	Software
Year Produced	2020
Open Source License?	Yes
URL	https://zenodo.org/record/3983555


Title	passt/miceandmen: Code released with manuscript.
Description	Source code related to Stumpf et al. (2020) Transfer learning from mouse to man.
Type Of Technology	Software
Year Produced	2020
URL	https://zenodo.org/record/4105890


Title	passt/miceandmen: Code released with manuscript.
Description	Source code related to Stumpf et al. (2020) Transfer learning from mouse to man.
Type Of Technology	Software
Year Produced	2020
URL	https://zenodo.org/record/4105891

Abstract

Planned Impact

Organisations

People

ORCID iD

Publications