The Alan Turing Institute

Lead Research Organisation: The Alan Turing Institute
Department Name: Research

Abstract

The work of the Alan Turing Institute will enable knowledge and predictions to be extracted from large-scale and diverse digital data. It will bring together the best people, organisations and technologies in data science for the development of foundational theory, methodologies and algorithms. These will inform scientific and technological discoveries, create new business opportunities, accelerate solutions to global challenges, inform policy-making, and improve the environment, health and infrastructure of the world in an 'Age of Algorithms'.

Planned Impact

The Institute will bring together leaders in advanced mathematics and computing science from the five founding universities and other partners. Its work is expected to encompass a wide range of scientific disciplines and be relevant to a large number of business sectors.
 
Title 2020-04-01 - Data Safe Havens in the Cloud - CW20 Workshop.pptx 
Description A talk given at the SSI Collaborations Workshop in April 2020 discussing the Alan Turing Institute's "Data Safe A talk given at the SSI Collaborations Workshop in April 2020 discussing the Alan Turing Institute's "Data Safe Havens in the Cloud" project. The slides are included here. 
Type Of Art Film/Video/Animation 
Year Produced 2021 
URL https://cw20.figshare.com/articles/presentation/2020-04-01_-_Data_Safe_Havens_in_the_Cloud_-_CW20_Wo...
 
Title 2020-04-01 - Data Safe Havens in the Cloud - CW20 Workshop.pptx 
Description A talk given at the SSI Collaborations Workshop in April 2020 discussing the Alan Turing Institute's "Data Safe A talk given at the SSI Collaborations Workshop in April 2020 discussing the Alan Turing Institute's "Data Safe Havens in the Cloud" project. The slides are included here. 
Type Of Art Film/Video/Animation 
Year Produced 2021 
URL https://cw20.figshare.com/articles/presentation/2020-04-01_-_Data_Safe_Havens_in_the_Cloud_-_CW20_Wo...
 
Title 34-productive-research-on-sensitive-data-using-cloud-based-secure-research-environments-james-robinson-martin-oreilly.mp4 
Description A talk given at the SSI Collaborations Workshop in April 2020 discussing the Alan Turing Institute's "Data Safe A talk given at the SSI Collaborations Workshop in April 2020 discussing the Alan Turing Institute's "Data Safe Havens in the Cloud" project. A video recording of the talk plus subsequent Q&A are included here. 
Type Of Art Film/Video/Animation 
Year Produced 2021 
URL https://cw20.figshare.com/articles/presentation/34-productive-research-on-sensitive-data-using-cloud...
 
Title 34-productive-research-on-sensitive-data-using-cloud-based-secure-research-environments-james-robinson-martin-oreilly.mp4 
Description A talk given at the SSI Collaborations Workshop in April 2020 discussing the Alan Turing Institute's "Data Safe A talk given at the SSI Collaborations Workshop in April 2020 discussing the Alan Turing Institute's "Data Safe Havens in the Cloud" project. A video recording of the talk plus subsequent Q&A are included here. 
Type Of Art Film/Video/Animation 
Year Produced 2021 
URL https://cw20.figshare.com/articles/presentation/34-productive-research-on-sensitive-data-using-cloud...
 
Title Reproducible secure research environments: Talk from Safe Data Access Professionals Quarterly Meeting on 08 June 2021 
Description Overview of the challenges of supporting reproducible research on sensitive data and how the Turing addresses these in its Safe Haven secure research environment. 
Type Of Art Film/Video/Animation 
Year Produced 2021 
URL https://figshare.com/articles/presentation/Reproducible_secure_research_environments_Talk_from_Safe_...
 
Title Reproducible secure research environments: Talk from Safe Data Access Professionals Quarterly Meeting on 08 June 2021 
Description Overview of the challenges of supporting reproducible research on sensitive data and how the Turing addresses these in its Safe Haven secure research environment. 
Type Of Art Film/Video/Animation 
Year Produced 2021 
URL https://figshare.com/articles/presentation/Reproducible_secure_research_environments_Talk_from_Safe_...
 
Description For Key Findings and Impact, please see our Annual Report: https://www.turing.ac.uk/about-us/annual-report-2021-22
Exploitation Route Please see our Annual Report: https://www.turing.ac.uk/about-us/annual-report-2021-22
Sectors Aerospace, Defence and Marine,Agriculture, Food and Drink,Communities and Social Services/Policy,Construction,Creative Economy,Digital/Communication/Information Technologies (including Software),Education,Energy,Environment,Financial Services, and Management Consultancy,Healthcare,Leisure Activities, including Sports, Recreation and Tourism,Government, Democracy and Justice,Manufacturing, including Industrial Biotechology,Culture, Heritage, Museums and Collections,Pharmaceuticals and Medical Bio

URL https://www.turing.ac.uk/
 
Description For Key Findings and Impact, please see our Annual Report: https://www.turing.ac.uk/about-us/annual-report-2021-22
Sector Aerospace, Defence and Marine,Agriculture, Food and Drink,Communities and Social Services/Policy,Construction,Creative Economy,Digital/Communication/Information Technologies (including Software),Energy,Environment,Financial Services, and Management Consultancy,Healthcare,Government, Democracy and Justice,Manufacturing, including Industrial Biotechology,Culture, Heritage, Museums and Collections,Pharmaceuticals and Medical Biotechnology,Security and Diplomacy,Transport,Other
Impact Types Cultural,Societal,Economic,Policy & public services

 
Title DETOX seismic tomography models 
Description -----------------------
DETOX tomography models
----------------------- This folder contains three tomography models, DETOX-P1, DETOX-P2 and DETOX-P3, in the following formats: - NetCDF (dirname: grid_nc4)
- VTK (dirname: vtk)
- xyz-value (dirname: txt_tetrahedron)
- JPEG for GPLATES, only high-velocities (dirname: GPLATES) The directories are organized as follow:

DETOX-P1
+-- GPLATES
+-- grid_nc4
+-- txt_tetrahedron
+-- vtk
DETOX-P2
+-- GPLATES
+-- grid_nc4
+-- txt_tetrahedron
+-- vtk
DETOX-P3
+-- GPLATES
+-- grid_nc4
+-- txt_tetrahedron
+-- vtk --------------------- Citation: * Kasra Hosseini, Karin Sigloch, Maria Tsekhmistrenko, Afsaneh Zaheri, Tarje Nissen-Meyer, Heiner Igel, Global mantle structure from multifrequency tomography using P, PP and P-diffracted waves, Geophysical Journal International, Volume 220, Issue 1, January 2020, Pages 96-141, https://doi.org/10.1093/gji/ggz394 
Type Of Material Database/Collection of data 
Year Produced 2019 
Provided To Others? Yes  
URL https://zenodo.org/record/3993275
 
Title DETOX seismic tomography models 
Description -----------------------
DETOX tomography models
----------------------- This folder contains three tomography models, DETOX-P1, DETOX-P2 and DETOX-P3, in the following formats: - NetCDF (dirname: grid_nc4)
- VTK (dirname: vtk)
- xyz-value (dirname: txt_tetrahedron)
- JPEG for GPLATES, only high-velocities (dirname: GPLATES) The directories are organized as follow:

DETOX-P1
+-- GPLATES
+-- grid_nc4
+-- txt_tetrahedron
+-- vtk
DETOX-P2
+-- GPLATES
+-- grid_nc4
+-- txt_tetrahedron
+-- vtk
DETOX-P3
+-- GPLATES
+-- grid_nc4
+-- txt_tetrahedron
+-- vtk --------------------- Citation: * Kasra Hosseini, Karin Sigloch, Maria Tsekhmistrenko, Afsaneh Zaheri, Tarje Nissen-Meyer, Heiner Igel, Global mantle structure from multifrequency tomography using P, PP and P-diffracted waves, Geophysical Journal International, Volume 220, Issue 1, January 2020, Pages 96-141, https://doi.org/10.1093/gji/ggz394 
Type Of Material Database/Collection of data 
Year Produced 2019 
Provided To Others? Yes  
URL https://zenodo.org/record/3993276
 
Title DUKweb (Diachronic UK web) 
Description We present DUKweb, a set of large-scale resources useful for the diachronic analysis of contemporary English. The dataset is derived from JISC UK Web Domain Dataset (1996-2013), which collects resources from the Internet Archive that were hosted on domains ending in '.uk'. The dataset includes co-occurrences matrices for each year and two types of word vectors by year, Temporal Random Indexing vectors and word2vec embeddings. 
Type Of Material Database/Collection of data 
Year Produced 2020 
Provided To Others? Yes  
URL https://bl.iro.bl.uk/work/f9ff33ab-56b7-4594-8aca-49781296c0c6
 
Title Data supporting "GABA, not BOLD, reveals dissociable learning-dependent plasticity mechanisms in the human brain" 
Description Behavioural data. BOLD change measurements. GABA change measurements. Behavioural data under tDCs intervention. 
Type Of Material Database/Collection of data 
Year Produced 2018 
Provided To Others? Yes  
 
Title Dataset for Toponym Resolution in Nineteenth-Century English Newspapers 
Description We present a new dataset for the task of toponym resolution in digitised historical newspapers in English. It consists of 343 annotated articles from newspapers based in four different locations in England (Manchester, Ashton-under-Lyne, Poole and Dorchester), published between 1780 and 1870. The articles have been manually annotated with mentions of places, which are linked---whenever possible---to their corresponding entry on Wikipedia. The dataset is published on the British Library shared research repository, and is especially of interest to researchers working on improving semantic access to historical newspaper content. We share the 343 annotated files (one file per article) in the WebAnno TSV file format version 3.2, a CoNLL-based file format. We additionally provide a TSV file with metadata at the article level, and the annotation guidelines. 
Type Of Material Database/Collection of data 
Year Produced 2021 
Provided To Others? Yes  
URL https://bl.iro.bl.uk/concern/datasets/de43a15c-e000-4fec-8b66-7ca94ae13db3
 
Title Latin lexical semantic annotation 
Description This dataset is a collection of lexical annotation of the corpus occurrences 40 Latin lemmas. The corpus instances are from LatinISE and the process is described in Schlechtweg et al. (2020, 2021).The annotation was coordinated by Barbara McGillivray, and done by Annie Burman, Daria Kondakova, Francesca Dell'Oro, Helena Bermudez Sabel, Hugo Burgess, Paola Marongiu, and Rozalia Dobos. The pre-annotation was coordinated and designed by Barbara McGillivray and done by Manuel Márquez Cruz.ReferencesMcGillivray, B. and Kilgarriff, A. (2013). Tools for historical corpus research, and a corpus of Latin. In Paul Bennett, Martin Durrell, Silke Scheible, Richard J. Whitt (eds.), New Methods in Historical Corpus Linguistics. Tu¨bingen: NarrBarbara McGillivray, Dominik Schlechtweg, Haim Dubossarsky, Nina Tahmasebi, & Simon Hengchen. (2021). DWUG LA: Diachronic Word Usage Graphs for Latin [Data set]. Zenodo. https://doi.org/10.5281/zenodo.5255228Schlechtweg, D., McGillivray, B., Hengchen, S., Dubossarsky, H., Tahmasebi, N. (2020). SemEval-2020 Task 1: Unsupervised Lexical Semantic Change Detection. In Proceedings of the Fourteenth Workshop on Semantic Evaluation, 2020. International Committee for Computational Linguistics. DOI: 10.18653/v1/2020.semeval-1.1Schlechtweg, D., Tahmasebi, N., Hengchen, S., Dubossarsky, H., McGillivray, B. (2021). DWUG: A large Resource of Diachronic Word Usage Graphs in Four Languages. In Proceedings of EMNLP 2021. 
Type Of Material Database/Collection of data 
Year Produced 2021 
Provided To Others? Yes  
URL https://kcl.figshare.com/articles/dataset/Latin_lexical_semantic_annotation/16974823
 
Title Latin lexical semantic annotation 
Description This dataset is a collection of lexical annotation of the corpus occurrences 40 Latin lemmas. The corpus instances are from LatinISE and the process is described in Schlechtweg et al. (2020, 2021).The annotation was coordinated by Barbara McGillivray, and done by Annie Burman, Daria Kondakova, Francesca Dell'Oro, Helena Bermudez Sabel, Hugo Burgess, Paola Marongiu, and Rozalia Dobos. The pre-annotation was coordinated and designed by Barbara McGillivray and done by Manuel Márquez Cruz.ReferencesMcGillivray, B. and Kilgarriff, A. (2013). Tools for historical corpus research, and a corpus of Latin. In Paul Bennett, Martin Durrell, Silke Scheible, Richard J. Whitt (eds.), New Methods in Historical Corpus Linguistics. Tu¨bingen: NarrBarbara McGillivray, Dominik Schlechtweg, Haim Dubossarsky, Nina Tahmasebi, & Simon Hengchen. (2021). DWUG LA: Diachronic Word Usage Graphs for Latin [Data set]. Zenodo. https://doi.org/10.5281/zenodo.5255228Schlechtweg, D., McGillivray, B., Hengchen, S., Dubossarsky, H., Tahmasebi, N. (2020). SemEval-2020 Task 1: Unsupervised Lexical Semantic Change Detection. In Proceedings of the Fourteenth Workshop on Semantic Evaluation, 2020. International Committee for Computational Linguistics. DOI: 10.18653/v1/2020.semeval-1.1Schlechtweg, D., Tahmasebi, N., Hengchen, S., Dubossarsky, H., McGillivray, B. (2021). DWUG: A large Resource of Diachronic Word Usage Graphs in Four Languages. In Proceedings of EMNLP 2021. 
Type Of Material Database/Collection of data 
Year Produced 2021 
Provided To Others? Yes  
URL https://kcl.figshare.com/articles/dataset/Latin_lexical_semantic_annotation/16974823/1
 
Title LatinISE subcorpora for SemEval 2020 task 1 
Description This data collection contains the Latin test data for SemEval 2020 Task 1: Unsupervised Lexical Semantic Change Detection]: a Latin text corpus pair (`corpus1/`, `corpus2/`) 40 lemmas which have been annotated for their lexical semantic change between the two corpora (`targets.txt`) The corpus data have been automatically lemmatized and part-of-speech tagged, and have been partially corrected by hand. For homonyms, the lemmas are followed by the '\#' symbol and the number of the homonym according to the Lewis-Short dictionary of Latin when this number is greater than 1. For example, the lemma 'dico' corresponds to the first homonym in the Lewis-Short dictionary and 'dico\#2' corresponds to the second homonym, cf. Lewis-Short dictionary. __Corpus 1__ based on: LatinISE (McGillivray and Kilgarriff 2013), version on Sketch Engine language: Latin time covered: from the beginning of the second century before Christ (BC) to the end of the first century BC size: ~1.7 million tokens format: lemmatized, sentence length >= 2, no punctuation, sentences randomly shuffled encoding: UTF-8 __Corpus 2__ based on: LatinISE (McGillivray and Kilgarriff 2013) , version on Sketch Engine language: Latin time covered: from the beginning of the first century after Christ (AD) to the end of the twenty-first century AD size: ~9.4 million tokens format: lemmatized, sentence length >= 2, no punctuation, sentences randomly shuffled encoding: UTF-8 Find more information on the data in the papers referenced below. References Dominik Schlechtweg, Barbara McGillivray, Simon Hengchen, Haim Dubossarsky and Nina Tahmasebi SemEval 2020 Task 1: Unsupervised Lexical Semantic Change Detection. To appear in SemEval@COLING2020. McGillivray, B. and Kilgarriff, A. (2013). Tools for historical corpus research, and a corpus of Latin. In Paul Bennett, Martin Durrell, Silke Scheible, Richard J. Whitt (eds.), New Methods in Historical Corpus Linguistics, Tübingen: Narr.
 
Type Of Material Database/Collection of data 
Year Produced 2020 
Provided To Others? Yes  
URL https://zenodo.org/record/3674988
 
Title LatinISE subcorpora for SemEval 2020 task 1 
Description This data collection contains the Latin test data for SemEval 2020 Task 1: Unsupervised Lexical Semantic Change Detection]: a Latin text corpus pair (`corpus1/lemma`, `corpus2/lemma`) 40 lemmas which have been annotated for their lexical semantic change between the two corpora (`targets.txt`) the annotated binary change scores of the targets for subtask 1, and their annotated graded change scores for subtask 2 (`truth/`) The corpus data have been automatically lemmatized and part-of-speech tagged, and have been partially corrected by hand. For homonyms, the lemmas are followed by the '\#' symbol and the number of the homonym according to the Lewis-Short dictionary of Latin when this number is greater than 1. For example, the lemma 'dico' corresponds to the first homonym in the Lewis-Short dictionary and 'dico\#2' corresponds to the second homonym, cf. Lewis-Short dictionary. __Corpus 1__ based on: LatinISE (McGillivray and Kilgarriff 2013), version on Sketch Engine language: Latin time covered: from the beginning of the second century before Christ (BC) to the end of the first century BC size: ~1.7 million tokens format: lemmatized, sentence length >= 2, no punctuation, sentences randomly shuffled encoding: UTF-8 __Corpus 2__ based on: LatinISE (McGillivray and Kilgarriff 2013) , version on Sketch Engine language: Latin time covered: from the beginning of the first century after Christ (AD) to the end of the twenty-first century AD size: ~9.4 million tokens format: lemmatized, sentence length >= 2, no punctuation, sentences randomly shuffled encoding: UTF-8 Find more information on the data in the papers referenced below. References Dominik Schlechtweg, Barbara McGillivray, Simon Hengchen, Haim Dubossarsky and Nina Tahmasebi SemEval 2020 Task 1: Unsupervised Lexical Semantic Change Detection. To appear in SemEval@COLING2020. McGillivray, B. and Kilgarriff, A. (2013). Tools for historical corpus research, and a corpus of Latin. In Paul Bennett, Martin Durrell, Silke Scheible, Richard J. Whitt (eds.), New Methods in Historical Corpus Linguistics, Tübingen: Narr.
 
Type Of Material Database/Collection of data 
Year Produced 2020 
Provided To Others? Yes  
URL https://zenodo.org/record/3732944
 
Title LatinISE test data for SemEval 2020 task 1 
Description This data collection contains the Latin test data for SemEval 2020 Task 1: Unsupervised Lexical Semantic Change Detection: a Latin text corpus pair (`corpus1/lemma`, `corpus2/lemma`) 40 lemmas which have been annotated for their lexical semantic change between the two corpora (`targets.txt`) the annotated binary change scores of the targets for subtask 1, and their annotated graded change scores for subtask 2 (`truth/`) The corpus data have been automatically lemmatized and part-of-speech tagged, and have been partially corrected by hand. For homonyms, the lemmas are followed by the '\#' symbol and the number of the homonym according to the Lewis-Short dictionary of Latin when this number is greater than 1. For example, the lemma 'dico' corresponds to the first homonym in the Lewis-Short dictionary and 'dico\#2' corresponds to the second homonym, cf. Lewis-Short dictionary. __Corpus 1__ based on: LatinISE (McGillivray and Kilgarriff 2013), version on Sketch Engine language: Latin time covered: from the beginning of the second century before Christ (BC) to the end of the first century BC size: ~1.7 million tokens format: lemmatized, sentence length >= 2, no punctuation, sentences randomly shuffled encoding: UTF-8 __Corpus 2__ based on: LatinISE (McGillivray and Kilgarriff 2013) , version on Sketch Engine language: Latin time covered: from the beginning of the first century after Christ (AD) to the end of the twenty-first century AD size: ~9.4 million tokens format: lemmatized, sentence length >= 2, no punctuation, sentences randomly shuffled encoding: UTF-8 Find more information on the data in the papers referenced below. References Dominik Schlechtweg, Barbara McGillivray, Simon Hengchen, Haim Dubossarsky and Nina Tahmasebi SemEval 2020 Task 1: Unsupervised Lexical Semantic Change Detection. To appear in SemEval@COLING2020. McGillivray, B. and Kilgarriff, A. (2013). Tools for historical corpus research, and a corpus of Latin. In Paul Bennett, Martin Durrell, Silke Scheible, Richard J. Whitt (eds.), New Methods in Historical Corpus Linguistics, Tübingen: Narr.
 
Type Of Material Database/Collection of data 
Year Produced 2020 
Provided To Others? Yes  
URL https://zenodo.org/record/3734089
 
Title LatinISE test data for SemEval 2020 task 1 with additional token versions of the corpora 
Description This data collection contains the Latin test data for SemEval 2020 Task 1: Unsupervised Lexical Semantic Change Detection: a Latin text corpus pair (`corpus1/lemma`, `corpus2/lemma`) 40 lemmas which have been annotated for their lexical semantic change between the two corpora (`targets.txt`) the annotated binary change scores of the targets for subtask 1, and their annotated graded change scores for subtask 2 (`truth/`) The corpus data have been automatically lemmatized and part-of-speech tagged, and have been partially corrected by hand. For homonyms, the lemmas are followed by the '\#' symbol and the number of the homonym according to the Lewis-Short dictionary of Latin when this number is greater than 1. For example, the lemma 'dico' corresponds to the first homonym in the Lewis-Short dictionary and 'dico\#2' corresponds to the second homonym, cf. Lewis-Short dictionary. __Corpus 1__ based on: LatinISE (McGillivray and Kilgarriff 2013), version on Sketch Engine language: Latin time covered: from the beginning of the second century before Christ (BC) to the end of the first century BC size: ~1.7 million tokens format: lemmatized, sentence length >= 2, no punctuation, sentences randomly shuffled encoding: UTF-8 __Corpus 2__ based on: LatinISE (McGillivray and Kilgarriff 2013) , version on Sketch Engine language: Latin time covered: from the beginning of the first century after Christ (AD) to the end of the twenty-first century AD size: ~9.4 million tokens format: lemmatized, sentence length >= 2, no punctuation, sentences randomly shuffled encoding: UTF-8 Find more information on the data in the papers referenced below. Besides the official lemma version of the corpora for SemEval-2020 Task 1 we also provide the raw token version ( corpus1/token/, corpus2/token/). It contains the raw sentences in the same order as in the lemma version. Find more information on the data and SemEval-2020 Task 1 in the paper referenced below. The creation of the data was supported by the CRETA center and the CLARIN-D grant funded by the German Ministry for Education and Research (BMBF). References Dominik Schlechtweg, Barbara McGillivray, Simon Hengchen, Haim Dubossarsky and Nina Tahmasebi SemEval 2020 Task 1: Unsupervised Lexical Semantic Change Detection. To appear in SemEval@COLING2020. McGillivray, B. and Kilgarriff, A. (2013). Tools for historical corpus research, and a corpus of Latin. In Paul Bennett, Martin Durrell, Silke Scheible, Richard J. Whitt (eds.), New Methods in Historical Corpus Linguistics, Tübingen: Narr.
 
Type Of Material Database/Collection of data 
Year Produced 2020 
Provided To Others? Yes  
URL https://zenodo.org/record/3992738
 
Title LatinISE test data for SemEval 2020 task 1 with additional token versions of the corpora 
Description This data collection contains the Latin test data for SemEval 2020 Task 1: Unsupervised Lexical Semantic Change Detection: a Latin text corpus pair (`corpus1/lemma`, `corpus2/lemma`) 40 lemmas which have been annotated for their lexical semantic change between the two corpora (`targets.txt`) the annotated binary change scores of the targets for subtask 1, and their annotated graded change scores for subtask 2 (`truth/`) The corpus data have been automatically lemmatized and part-of-speech tagged, and have been partially corrected by hand. For homonyms, the lemmas are followed by the '\#' symbol and the number of the homonym according to the Lewis-Short dictionary of Latin when this number is greater than 1. For example, the lemma 'dico' corresponds to the first homonym in the Lewis-Short dictionary and 'dico\#2' corresponds to the second homonym, cf. Lewis-Short dictionary. __Corpus 1__ based on: LatinISE (McGillivray and Kilgarriff 2013), version on Sketch Engine language: Latin time covered: from the beginning of the second century before Christ (BC) to the end of the first century BC size: ~1.7 million tokens format: lemmatized, sentence length >= 2, no punctuation, sentences randomly shuffled encoding: UTF-8 __Corpus 2__ based on: LatinISE (McGillivray and Kilgarriff 2013) , version on Sketch Engine language: Latin time covered: from the beginning of the first century after Christ (AD) to the end of the twenty-first century AD size: ~9.4 million tokens format: lemmatized, sentence length >= 2, no punctuation, sentences randomly shuffled encoding: UTF-8 Find more information on the data in the papers referenced below. Besides the official lemma version of the corpora for SemEval-2020 Task 1 we also provide the raw token version ( corpus1/token/, corpus2/token/). It contains the raw sentences in the same order as in the lemma version. Find more information on the data and SemEval-2020 Task 1 in the paper referenced below. The creation of the data was supported by the CRETA center and the CLARIN-D grant funded by the German Ministry for Education and Research (BMBF). References Dominik Schlechtweg, Barbara McGillivray, Simon Hengchen, Haim Dubossarsky and Nina Tahmasebi SemEval 2020 Task 1: Unsupervised Lexical Semantic Change Detection. To appear in SemEval@COLING2020. McGillivray, B. and Kilgarriff, A. (2013). Tools for historical corpus research, and a corpus of Latin. In Paul Bennett, Martin Durrell, Silke Scheible, Richard J. Whitt (eds.), New Methods in Historical Corpus Linguistics, Tübingen: Narr.
 
Type Of Material Database/Collection of data 
Year Produced 2020 
Provided To Others? Yes  
URL https://zenodo.org/record/3674098
 
Title Living Machines atypical animacy dataset 
Description Atypical animacy detection dataset, based on nineteenth-century sentences in English extracted from an open dataset of nineteenth-century books digitized by the British Library (available via https://doi.org/10.21250/db14, British Library Labs, 2014). This dataset contains 598 sentences containing mentions of machines. Each sentence has been annotated according to the animacy and humanness of the machine in the sentence. This dataset has been created as part of the following paper: Ardanuy, M. C., F. Nanni, K. Beelen, Kasra Hosseini, Ruth Ahnert, J. Lawrence, Katherine McDonough, Giorgia Tolfo, D. C. Wilson and B. McGillivray. "Living Machines: A study of atypical animacy." In Proceedings of the 28th International Conference on Computational Linguistics (COLING2020). 
Type Of Material Database/Collection of data 
Year Produced 2020 
Provided To Others? Yes  
URL https://bl.iro.bl.uk/work/323177af-6081-4e93-8aaf-7932ca4a390a
 
Title Monthly word embeddings for Twitter random sample (English, 2012-2018) 
Description This dataset contains monthly word embeddings created from the tweets available via the statuses/sample endpoint of the Twitter Streaming API from 2012 to 2018. Full details of the creation of the dataset are given in Room to Glo: A Systematic Comparison of Semantic Change Detection Approaches with Word Embeddings. The md5sum of the gzipped tarball file is a76888ffec8cc7aebba09d365ca55ace . 
Type Of Material Database/Collection of data 
Year Produced 2019 
Provided To Others? Yes  
 
Title Monthly word embeddings for Twitter random sample (English, 2012-2018) 
Description This dataset contains monthly word embeddings created from the tweets available via the statuses/sample endpoint of the Twitter Streaming API from 2012 to 2018. Full details of the creation of the dataset are given in Room to Glo: A Systematic Comparison of Semantic Change Detection Approaches with Word Embeddings. The md5sum of the gzipped tarball file is a76888ffec8cc7aebba09d365ca55ace . 
Type Of Material Database/Collection of data 
Year Produced 2019 
Provided To Others? Yes  
 
Title Research Data Supporting "Modelling prognostic trajectories of cognitive decline due to Alzheimer's disease" 
Description  
Type Of Material Database/Collection of data 
Year Produced 2020 
Provided To Others? Yes  
URL https://www.repository.cam.ac.uk/handle/1810/301740
 
Title Research data supporting "Multimodal imaging of brain connectivity reveals predictors of individual decision strategy in statistical learning" 
Description Behavioural data, resting-state fMRI connectivity data and graph metrics data (see supporting data description .doc file for more information) 
Type Of Material Database/Collection of data 
Year Produced 2019 
Provided To Others? Yes  
 
Title Research data supporting "White-Matter Pathways for Statistical Learning of Temporal Structures" 
Description Behavioural data and DTI connectivity data (see supporting data description .doc file for more information) 
Type Of Material Database/Collection of data 
Year Produced 2019 
Provided To Others? Yes  
 
Title Supplementary material for 'A Deep Learning Approach to Geographical Candidate Selection through Toponym Matching' 
Description Supplementary material for the https://github.com/Living-with-machines/LwM_SIGSPATIAL2020_ToponymMatching repository, containing the underlying code and materials for the paper 'A Deep Learning Approach to Geographical Candidate Selection through Toponym Matching', accepted to SIGSPATIAL2020 as a poster paper. Coll Ardanuy, M., Hosseini, K., McDonough, K., Krause, A., van Strien, D. and Nanni, F. (2020): A Deep Learning Approach to Geographical Candidate Selection through Toponym Matching, SIGSPATIAL: Poster Paper. 
Type Of Material Database/Collection of data 
Year Produced 2020 
Provided To Others? Yes  
URL https://zenodo.org/record/4034818
 
Title Supplementary material for 'A Deep Learning Approach to Geographical Candidate Selection through Toponym Matching' 
Description Supplementary material for the https://github.com/Living-with-machines/LwM_SIGSPATIAL2020_ToponymMatching repository, containing the underlying code and materials for the paper 'A Deep Learning Approach to Geographical Candidate Selection through Toponym Matching', accepted to SIGSPATIAL2020 as a poster paper. Coll Ardanuy, M., Hosseini, K., McDonough, K., Krause, A., van Strien, D. and Nanni, F. (2020): A Deep Learning Approach to Geographical Candidate Selection through Toponym Matching, SIGSPATIAL: Poster Paper. 
Type Of Material Database/Collection of data 
Year Produced 2020 
Provided To Others? Yes  
URL https://zenodo.org/record/4034819
 
Title Visual Identification of Individual Holstein Friesian Cattle via Deep Metric Learning 
Description This dataset accompanies the paper - "Visual Identification of Individual Holstein Friesian Cattle via Deep Metric Learning" available at - https://arxiv.org/abs/2006.09205. It consists of two components: (a) detection and localisation, (b) identification. For an overview of this dataset, refer to Section 3 in the paper. For any queries, contact the corresponding author in the paper. For accompanying source code, check out - https://github.com/CWOA/MetricLearningIdentification 
Type Of Material Database/Collection of data 
Year Produced 2020 
Provided To Others? Yes  
URL https://data.bris.ac.uk/data/dataset/10m32xl88x2b61zlkkgz3fml17/
 
Title DeezyMatch 
Description DeezyMatch: A Flexible Deep Neural Network Approach to Fuzzy String Matching DeezyMatch can be applied for performing the following tasks: Record linkage Candidate selection for entity linking systems Toponym matching 
Type Of Technology Software 
Year Produced 2020 
Open Source License? Yes  
URL https://zenodo.org/record/3983554
 
Title DeezyMatch 
Description DeezyMatch: A Flexible Deep Neural Network Approach to Fuzzy String Matching DeezyMatch can be applied for performing the following tasks: Record linkage Candidate selection for entity linking systems Toponym matching 
Type Of Technology Software 
Year Produced 2020 
Open Source License? Yes  
URL https://zenodo.org/record/3983555
 
Title passt/miceandmen: Code released with manuscript. 
Description Source code related to Stumpf et al. (2020) Transfer learning from mouse to man. 
Type Of Technology Software 
Year Produced 2020 
URL https://zenodo.org/record/4105890
 
Title passt/miceandmen: Code released with manuscript. 
Description Source code related to Stumpf et al. (2020) Transfer learning from mouse to man. 
Type Of Technology Software 
Year Produced 2020 
URL https://zenodo.org/record/4105891