The Alan Turing Institute

Lead Research Organisation: The Alan Turing Institute
Department Name: Research

Abstract

The work of the Alan Turing Institute will enable knowledge and predictions to be extracted from large-scale and diverse digital data. It will bring together the best people, organisations and technologies in data science for the development of foundational theory, methodologies and algorithms. These will inform scientific and technological discoveries, create new business opportunities, accelerate solutions to global challenges, inform policy-making, and improve the environment, health and infrastructure of the world in an 'Age of Algorithms'.

Planned Impact

The Institute will bring together leaders in advanced mathematics and computing science from the five founding universities and other partners. Its work is expected to encompass a wide range of scientific disciplines and be relevant to a large number of business sectors.

Publications

10 25 50
publication icon
Abbaszadeh M (2018) Uncertainty Quantification in Molecular Signals Using Polynomial Chaos Expansion in IEEE Transactions on Molecular, Biological and Multi-Scale Communications

publication icon
Abboud R (2022) Approximate weighted model integration on DNF structures in Artificial Intelligence

publication icon
Abboud R (2020) Learning to Reason: Leveraging Neural Networks for Approximate DNF Counting in Proceedings of the AAAI Conference on Artificial Intelligence

publication icon
Abboud R. (2020) On the approximability of weighted model integration on DNF structures in 17th International Conference on Principles of Knowledge Representation and Reasoning, KR 2020

publication icon
Abboud R. (2020) Learning to reason: Leveraging neural networks for approximate dnf counting in AAAI 2020 - 34th AAAI Conference on Artificial Intelligence

publication icon
Abboud R. (2020) BoxE: A box embedding model for knowledge base completion in Advances in Neural Information Processing Systems

publication icon
Abboud R. (2021) The Surprising Power of Graph Neural Networks with Random Node Initialization in IJCAI International Joint Conference on Artificial Intelligence

 
Title 2020-04-01 - Data Safe Havens in the Cloud - CW20 Workshop.pptx 
Description A talk given at the SSI Collaborations Workshop in April 2020 discussing the Alan Turing Institute's "Data Safe A talk given at the SSI Collaborations Workshop in April 2020 discussing the Alan Turing Institute's "Data Safe Havens in the Cloud" project. The slides are included here. 
Type Of Art Film/Video/Animation 
Year Produced 2021 
URL https://cw20.figshare.com/articles/presentation/2020-04-01_-_Data_Safe_Havens_in_the_Cloud_-_CW20_Wo...
 
Title 300 Training Scenarios [with influence of sediments] from Probabilistic quantification of tsunami current hazard using statistical emulation 
Description Animation showing the exact finite-fault configuration, slip profile, and seabed deformation for the 300 cases (with the influence of sediments) used in the work. They also portray how the Latin hypercube sampling sweeps through the input parameter space, and the manner in which the dimensions of the 300 sources are scaled with respect to the scaling relation. 
Type Of Art Film/Video/Animation 
Year Produced 2021 
URL https://rs.figshare.com/articles/media/300_Training_Scenarios_with_influence_of_sediments_from_Proba...
 
Title 300 Training Scenarios [with influence of sediments] from Probabilistic quantification of tsunami current hazard using statistical emulation 
Description Animation showing the exact finite-fault configuration, slip profile, and seabed deformation for the 300 cases (with the influence of sediments) used in the work. They also portray how the Latin hypercube sampling sweeps through the input parameter space, and the manner in which the dimensions of the 300 sources are scaled with respect to the scaling relation. 
Type Of Art Film/Video/Animation 
Year Produced 2021 
URL https://rs.figshare.com/articles/media/300_Training_Scenarios_with_influence_of_sediments_from_Proba...
 
Title 300 Training Scenarios [without influence of sediments] from Probabilistic quantification of tsunami current hazard using statistical emulation 
Description Animation showing the exact finite-fault configuration, slip profile, and seabed deformation for the 300 cases (without the influence of sediments). Provided only to visually compare with corresponding scenarios that incorporate the influence of sediments. 
Type Of Art Film/Video/Animation 
Year Produced 2021 
URL https://rs.figshare.com/articles/media/300_Training_Scenarios_without_influence_of_sediments_from_Pr...
 
Title 300 Training Scenarios [without influence of sediments] from Probabilistic quantification of tsunami current hazard using statistical emulation 
Description Animation showing the exact finite-fault configuration, slip profile, and seabed deformation for the 300 cases (without the influence of sediments). Provided only to visually compare with corresponding scenarios that incorporate the influence of sediments. 
Type Of Art Film/Video/Animation 
Year Produced 2021 
URL https://rs.figshare.com/articles/media/300_Training_Scenarios_without_influence_of_sediments_from_Pr...
 
Title 34-productive-research-on-sensitive-data-using-cloud-based-secure-research-environments-james-robinson-martin-oreilly.mp4 
Description A talk given at the SSI Collaborations Workshop in April 2020 discussing the Alan Turing Institute's "Data Safe A talk given at the SSI Collaborations Workshop in April 2020 discussing the Alan Turing Institute's "Data Safe Havens in the Cloud" project. A video recording of the talk plus subsequent Q&A are included here. 
Type Of Art Film/Video/Animation 
Year Produced 2021 
URL https://cw20.figshare.com/articles/presentation/34-productive-research-on-sensitive-data-using-cloud...
 
Title Reproducible secure research environments: Talk from Safe Data Access Professionals Quarterly Meeting on 08 June 2021 
Description Overview of the challenges of supporting reproducible research on sensitive data and how the Turing addresses these in its Safe Haven secure research environment. 
Type Of Art Film/Video/Animation 
Year Produced 2021 
URL https://figshare.com/articles/presentation/Reproducible_secure_research_environments_Talk_from_Safe_...
 
Description For Key Findings and Impact, please see our Annual Report: https://www.turing.ac.uk/about-us/annual-report-2022-23
Exploitation Route Please see our Annual Report: https://www.turing.ac.uk/about-us/annual-report-2022-23
Sectors Aerospace

Defence and Marine

Agriculture

Food and Drink

Communities and Social Services/Policy

Construction

Creative Economy

Digital/Communication/Information Technologies (including Software)

Education

Energy

Environment

Financial Services

and Management Consultancy

Healthcare

Leisure Activities

including Sports

Recreation and Tourism

Government

Democracy and Justice

Manufacturing

including Industrial Biotechology

Culture

Heritage

Museums and Collections

Pharmaceuticals and Medical Bio

URL https://www.turing.ac.uk/
 
Description For Key Findings and Impact, please see our Annual Report: https://www.turing.ac.uk/about-us/annual-report-2022-23
Sector Aerospace, Defence and Marine,Agriculture, Food and Drink,Communities and Social Services/Policy,Construction,Creative Economy,Digital/Communication/Information Technologies (including Software),Energy,Environment,Financial Services, and Management Consultancy,Healthcare,Government, Democracy and Justice,Manufacturing, including Industrial Biotechology,Culture, Heritage, Museums and Collections,Pharmaceuticals and Medical Biotechnology,Security and Diplomacy,Transport,Other
Impact Types Cultural

Societal

Economic

Policy & public services

 
Title A Statistical Approach to Surface Metrology for 3D-Printed Stainless Steel 
Description Surface metrology is the area of engineering concerned with the study of geometric variation in surfaces. This article explores the potential for modern techniques from spatial statistics to act as generative models for geometric variation in 3D-printed stainless steel. The complex macro-scale geometries of 3D-printed components pose a challenge that is not present in traditional surface metrology, as the training data and test data need not be defined on the same manifold. Strikingly, a covariance function defined in terms of geodesic distance on one manifold can fail to satisfy positive-definiteness and thus fail to be a valid covariance function in the context of a different manifold; this hinders the use of standard techniques that aim to learn a covariance function from a training dataset. On the other hand, the associated covariance differential operators are locally defined. This article proposes to perform inference for such differential operators, facilitating generalization from the manifold of a training dataset to the manifold of a test dataset. The approach is assessed in the context of model selection and explored in detail in the context of a finite element model for 3D-printed stainless steel. 
Type Of Material Database/Collection of data 
Year Produced 2022 
Provided To Others? Yes  
URL https://tandf.figshare.com/articles/dataset/A_Statistical_Approach_to_Surface_Metrology_for_3D-Print...
 
Title A Statistical Approach to Surface Metrology for 3D-Printed Stainless Steel 
Description Surface metrology is the area of engineering concerned with the study of geometric variation in surfaces. This article explores the potential for modern techniques from spatial statistics to act as generative models for geometric variation in 3D-printed stainless steel. The complex macro-scale geometries of 3D-printed components pose a challenge that is not present in traditional surface metrology, as the training data and test data need not be defined on the same manifold. Strikingly, a covariance function defined in terms of geodesic distance on one manifold can fail to satisfy positive-definiteness and thus fail to be a valid covariance function in the context of a different manifold; this hinders the use of standard techniques that aim to learn a covariance function from a training dataset. On the other hand, the associated covariance differential operators are locally defined. This article proposes to perform inference for such differential operators, facilitating generalization from the manifold of a training dataset to the manifold of a test dataset. The approach is assessed in the context of model selection and explored in detail in the context of a finite element model for 3D-printed stainless steel. 
Type Of Material Database/Collection of data 
Year Produced 2022 
Provided To Others? Yes  
URL https://tandf.figshare.com/articles/dataset/A_Statistical_Approach_to_Surface_Metrology_for_3D-Print...
 
Title A Statistical Approach to Surface Metrology for 3D-Printed Stainless Steel 
Description Surface metrology is the area of engineering concerned with the study of geometric variation in surfaces. This paper explores the potential for modern techniques from spatial statistics to act as generative models for geometric variation in 3D-printed stainless steel. The complex macro-scale geometries of 3D-printed components pose a challenge that is not present in traditional surface metrology, as the training data and test data need not be defined on the same manifold. Strikingly, a covariance function defined in terms of geodesic distance on one manifold can fail to satisfy positive-definiteness and thus fail to be a valid covariance function in the context of a different manifold; this hinders the use of standard techniques that aim to learn a covariance function from a training dataset. On the other hand, the associated covariance differential operators are locally defined. This paper proposes to perform inference for such differential operators, facilitating generalisation from the manifold of a training dataset to the manifold of a test dataset. The approach is assessed in the context of model selection and explored in detail in the context of a finite element model for 3D-printed stainless steel. 
Type Of Material Database/Collection of data 
Year Produced 2021 
Provided To Others? Yes  
URL https://tandf.figshare.com/articles/dataset/A_Statistical_Approach_to_Surface_Metrology_for_3D-Print...
 
Title Additional file 2 of A systematic review of natural language processing applied to radiology reports 
Description Additional file 2. Individual properties for every publication. 
Type Of Material Database/Collection of data 
Year Produced 2021 
Provided To Others? Yes  
URL https://springernature.figshare.com/articles/dataset/Additional_file_2_of_A_systematic_review_of_nat...
 
Title Additional file 2 of A systematic review of natural language processing applied to radiology reports 
Description Additional file 2. Individual properties for every publication. 
Type Of Material Database/Collection of data 
Year Produced 2021 
Provided To Others? Yes  
URL https://springernature.figshare.com/articles/dataset/Additional_file_2_of_A_systematic_review_of_nat...
 
Title Additional file 2 of Algorithmic hospital catchment area estimation using label propagation 
Description Additional file 2 Supplementary data - surge capacity estimates. A curated data set of estimated acute and ITU bed capacity in the NHS and private hospitals at the start of the pandemic, in England, Wales, and Scotland. 
Type Of Material Database/Collection of data 
Year Produced 2022 
Provided To Others? Yes  
URL https://springernature.figshare.com/articles/dataset/Additional_file_2_of_Algorithmic_hospital_catch...
 
Title Additional file 2 of Algorithmic hospital catchment area estimation using label propagation 
Description Additional file 2 Supplementary data - surge capacity estimates. A curated data set of estimated acute and ITU bed capacity in the NHS and private hospitals at the start of the pandemic, in England, Wales, and Scotland. 
Type Of Material Database/Collection of data 
Year Produced 2022 
Provided To Others? Yes  
URL https://springernature.figshare.com/articles/dataset/Additional_file_2_of_Algorithmic_hospital_catch...
 
Title DETOX seismic tomography models 
Description -----------------------
DETOX tomography models
----------------------- This folder contains three tomography models, DETOX-P1, DETOX-P2 and DETOX-P3, in the following formats: - NetCDF (dirname: grid_nc4)
- VTK (dirname: vtk)
- xyz-value (dirname: txt_tetrahedron)
- JPEG for GPLATES, only high-velocities (dirname: GPLATES) The directories are organized as follow:

DETOX-P1
+-- GPLATES
+-- grid_nc4
+-- txt_tetrahedron
+-- vtk
DETOX-P2
+-- GPLATES
+-- grid_nc4
+-- txt_tetrahedron
+-- vtk
DETOX-P3
+-- GPLATES
+-- grid_nc4
+-- txt_tetrahedron
+-- vtk --------------------- Citation: * Kasra Hosseini, Karin Sigloch, Maria Tsekhmistrenko, Afsaneh Zaheri, Tarje Nissen-Meyer, Heiner Igel, Global mantle structure from multifrequency tomography using P, PP and P-diffracted waves, Geophysical Journal International, Volume 220, Issue 1, January 2020, Pages 96-141, https://doi.org/10.1093/gji/ggz394 
Type Of Material Database/Collection of data 
Year Produced 2019 
Provided To Others? Yes  
URL https://zenodo.org/record/3993275
 
Title DETOX seismic tomography models 
Description -----------------------
DETOX tomography models
----------------------- This folder contains three tomography models, DETOX-P1, DETOX-P2 and DETOX-P3, in the following formats: - NetCDF (dirname: grid_nc4)
- VTK (dirname: vtk)
- xyz-value (dirname: txt_tetrahedron)
- JPEG for GPLATES, only high-velocities (dirname: GPLATES) The directories are organized as follow:

DETOX-P1
+-- GPLATES
+-- grid_nc4
+-- txt_tetrahedron
+-- vtk
DETOX-P2
+-- GPLATES
+-- grid_nc4
+-- txt_tetrahedron
+-- vtk
DETOX-P3
+-- GPLATES
+-- grid_nc4
+-- txt_tetrahedron
+-- vtk --------------------- Citation: * Kasra Hosseini, Karin Sigloch, Maria Tsekhmistrenko, Afsaneh Zaheri, Tarje Nissen-Meyer, Heiner Igel, Global mantle structure from multifrequency tomography using P, PP and P-diffracted waves, Geophysical Journal International, Volume 220, Issue 1, January 2020, Pages 96-141, https://doi.org/10.1093/gji/ggz394 
Type Of Material Database/Collection of data 
Year Produced 2019 
Provided To Others? Yes  
URL https://zenodo.org/record/3993276
 
Title DUKweb (Diachronic UK web) 
Description We present DUKweb, a set of large-scale resources useful for the diachronic analysis of contemporary English. The dataset is derived from JISC UK Web Domain Dataset (1996-2013), which collects resources from the Internet Archive that were hosted on domains ending in '.uk'. The dataset includes co-occurrences matrices for each year and two types of word vectors by year, Temporal Random Indexing vectors and word2vec embeddings. 
Type Of Material Database/Collection of data 
Year Produced 2020 
Provided To Others? Yes  
URL https://bl.iro.bl.uk/work/f9ff33ab-56b7-4594-8aca-49781296c0c6
 
Title Data From: Analysing social media forums to discover potential causes of phasic shifts in cryptocurrency price series 
Description The recent extreme volatility in cryptocurrency prices occurred in the setting of popular social media forums devoted to the discussion of cryptocurrencies. We develop a framework that discovers potential causes of phasic shifts in the price movement captured by social media discussions. This draws on principles developed in healthcare epidemiology where, similarly, only observational data are available. Such causes may have a major, one-off effect or recurring effects on the trend in the price series. We find a one-off effect of regulatory bans on bitcoin, the repeated effects of rival innovations on ether and the influence of technical traders, captured through discussion of market price, on both cryptocurrencies. The results for Bitcoin differ from Ethereum, which is consistent with the observed differences in the timing of the highest price and the price phases. This framework could be applied to a wide range of cryptocurrency price series where there exists a relevant social media text source. Identified causes with a recurring effect may have value in predictive modelling, whilst one-off causes may provide insight into unpredictable black swan events that can have a major impact on a system. 
Type Of Material Database/Collection of data 
Year Produced 2019 
Provided To Others? Yes  
URL https://datadryad.org/stash/dataset/doi:10.5061/dryad.q2bvq83f6
 
Title Data from: Global network centrality of university rankings 
Description Universities and higher education institutions form an integral part of the national infrastructure and prestige. As academic research benefits increasingly from international exchange and cooperation, many universities have increased investment in improving and enabling their global connectivity. Yet, the relationship of university performance and its global physical connectedness has not been explored in detail. We conduct the first large-scale data-driven analysis into whether there is a correlation between university relative ranking performance and its global connectivity via the air transport network. The results show that local access to global hubs (as measured by air transport network betweenness) strongly and positively correlates with the ranking growth (statistical significance in different models ranges between 5% and 1% level). We also showed that the local airport's aggregate flight paths (degree) and capacity (weighted degree) has no effect on university ranking, further showing global connectivity distance is more important that the capacity of flight connections. We also examined the effect of local city economic development as a confounding variable and no effect was observed suggesting that access to global transportation hubs outweighs economic performance as a determinant of university ranking. The impact of this research is that we have determined the importance of the centrality of global connectivity and, hence, established initial evidence for further exploring potential connections between university ranking and regional investment policies on improving global connectivity. 
Type Of Material Database/Collection of data 
Year Produced 2017 
Provided To Others? Yes  
URL https://datadryad.org/stash/dataset/doi:10.5061/dryad.fv5mn
 
Title Data from: Social capital predicts corruption risk in towns 
Description Corruption is a social plague: gains accrue to small groups, while its costs are borne by everyone. Significant variation in its level between and within countries suggests a relationship between social structure and the prevalence of corruption, yet, large-scale empirical studies thereof have been missing due to lack of data. In this paper, we relate the structural characteristics of social capital of settlements with corruption in their local governments. Using datasets from Hungary, we quantify corruption risk by suppressed competition and lack of transparency in the settlement's awarded public contracts. We characterize social capital using social network data from a popular online platform. Controlling for social, economic and political factors, we find that settlements with fragmented social networks, indicating an excess of bonding social capital has higher corruption risk, and settlements with more diverse external connectivity, suggesting a surplus of bridging social capital is less exposed to corruption. We interpret fragmentation as fostering in-group favouritism and conformity, which increase corruption, while diversity facilitates impartiality in public life and stifles corruption. 
Type Of Material Database/Collection of data 
Year Produced 2019 
Provided To Others? Yes  
URL https://datadryad.org/stash/dataset/doi:10.5061/dryad.jb48dg0
 
Title Data from: Using deep learning to quantify the beauty of outdoor places 
Description Beautiful outdoor locations are protected by governments and have recently been shown to be associated with better health. But what makes an outdoor space beautiful? Does a beautiful outdoor location differ from an outdoor location that is simply natural? Here, we explore whether ratings of over 200 000 images of Great Britain from the online game Scenic-Or-Not, combined with hundreds of image features extracted using the Places Convolutional Neural Network, might help us understand what beautiful outdoor spaces are composed of. We discover that, as well as natural features such as 'Coast', 'Mountain' and 'Canal Natural', man-made structures such as 'Tower', 'Castle' and 'Viaduct' lead to places being considered more scenic. Importantly, while scenes containing 'Trees' tend to rate highly, places containing more bland natural green features such as 'Grass' and 'Athletic Fields' are considered less scenic. We also find that a neural network can be trained to automatically identify scenic places, and that this network highlights both natural and built locations. Our findings demonstrate how online data combined with neural networks can provide a deeper understanding of what environments we might find beautiful and offer quantitative insights for policymakers charged with design and protection of our built and natural environments. 
Type Of Material Database/Collection of data 
Year Produced 2017 
Provided To Others? Yes  
URL https://datadryad.org/stash/dataset/doi:10.5061/dryad.rq4s3
 
Title Data supporting "GABA, not BOLD, reveals dissociable learning-dependent plasticity mechanisms in the human brain" 
Description Behavioural data. BOLD change measurements. GABA change measurements. Behavioural data under tDCs intervention. 
Type Of Material Database/Collection of data 
Year Produced 2018 
Provided To Others? Yes  
 
Title Data supporting NSPN publication "Structural covariance networks are coupled to expression of genes enriched in supragranular layers of the human cortex " Neuroimage 
Description Gene expression matrices for the Desikan-Killiany atlas (68 regions) and the high-resolution parcellation that includes 308 regions of 500m2 each. 
Type Of Material Database/Collection of data 
Year Produced 2018 
Provided To Others? Yes  
URL https://www.repository.cam.ac.uk/handle/1810/274101
 
Title Dataset for Toponym Resolution in Nineteenth-Century English Newspapers 
Description We present a new dataset for the task of toponym resolution in digitised historical newspapers in English. It consists of 343 annotated articles from newspapers based in four different locations in England (Manchester, Ashton-under-Lyne, Poole and Dorchester), published between 1780 and 1870. The articles have been manually annotated with mentions of places, which are linked---whenever possible---to their corresponding entry on Wikipedia. The dataset is published on the British Library shared research repository, and is especially of interest to researchers working on improving semantic access to historical newspaper content. We share the 343 annotated files (one file per article) in the WebAnno TSV file format version 3.2, a CoNLL-based file format. We additionally provide a TSV file with metadata at the article level, and the annotation guidelines. 
Type Of Material Database/Collection of data 
Year Produced 2021 
Provided To Others? Yes  
URL https://bl.iro.bl.uk/concern/datasets/de43a15c-e000-4fec-8b66-7ca94ae13db3
 
Title Dataset for: Learning Direct Optimization for Scene Understanding 
Description Description: The dataset consists of of a large number of realistic synthetic images that feature a number of objects on a table-top, of three classes: staplers, mugs and bananas. These are taken at a variety of lighting, viewpoint and object configuration conditions. In addition, the dataset includes a set of annotated real images that were manually taken to feature a number of objects of the considered classes. The dataset includes over 22000 realistic synthetic images that can be used for training and testing, and 135 annotated real images for testing. All datasets include object annotations and their masks. Image resolution is 256 x 256. Synthetic datasets include all the latent variables of the 3D scene (scene graph). The synthetic scenes were rendered using the Blender software: www.blender.org. For each object its associated latent variables are its position, scaling factor, azimuthal rotation, shape (1-of-K encoding) and colour (RGB). The ground plane has a random RGB colour. The camera is taken to be at a random height above the origin and to be looking down with a random angle of elevation. The illumination model is uniform lighting plus a directional source (specified by the strength, azimuth and elevation of the source). Real dataset: for each object we annotated its class, instance mask, and the contact point using the LabelMe software. 
Type Of Material Database/Collection of data 
Year Produced 2020 
Provided To Others? Yes  
URL https://data.mendeley.com/datasets/gr62b6d33h
 
Title Dataset for: Learning Direct Optimization for Scene Understanding 
Description Description: The dataset consists of of a large number of realistic synthetic images that feature a number of objects on a table-top, of three classes: staplers, mugs and bananas. These are taken at a variety of lighting, viewpoint and object configuration conditions. In addition, the dataset includes a set of annotated real images that were manually taken to feature a number of objects of the considered classes. The dataset includes over 22000 realistic synthetic images that can be used for training and testing, and 135 annotated real images for testing. All datasets include object annotations and their masks. Image resolution is 256 x 256. Synthetic datasets include all the latent variables of the 3D scene (scene graph). The synthetic scenes were rendered using the Blender software: www.blender.org. For each object its associated latent variables are its position, scaling factor, azimuthal rotation, shape (1-of-K encoding) and colour (RGB). The ground plane has a random RGB colour. The camera is taken to be at a random height above the origin and to be looking down with a random angle of elevation. The illumination model is uniform lighting plus a directional source (specified by the strength, azimuth and elevation of the source). Real dataset: for each object we annotated its class, instance mask, and the contact point using the LabelMe software. 
Type Of Material Database/Collection of data 
Year Produced 2020 
Provided To Others? Yes  
URL https://data.mendeley.com/datasets/gr62b6d33h/1
 
Title Faster indicators of chikungunya incidence using Google searches 
Description Data underlying: Miller, S., Preis, T., Mizzi, G., Bastos, L. S., Gomes, M. F. d. C., Coelho, F. C., Codeço, C. T., & Moat, H. S. (2022). Faster indicators of chikungunya incidence using Google searches. PLOS Neglected Tropical Diseases, 16, e0010441. doi:10.1371/journal.pntd.0010441. MillerEtAl_ChikungunyaCaseCountData.csv This file contains data on weekly chikungunya case counts in the city of Rio de Janeiro, aggregated by the week in which the case was first diagnosed (the notification week) and the delay in number of weeks in entering the case in the surveillance system. notification_week_commencing: the start date of the epidemiological week in which cases were notified notification_week: the epidemiological week in which cases were notified delay_in_weeks: the delay in number of weeks in entering the cases in the surveillance system case_count: the number of cases that were notified in the specified week with the specified delay in number of weeks MillerEtAl_Fig1A.csv The data underlying Fig. 1A. pct_entered: the percentage of cases notified in the specified epidemiological week that had been entered by the end of the week commencing 26 May 2019 notification_week_commencing: the start date of the epidemiological week in which cases were notified notified_cases: the number of cases notified in the specified epidemiological week entered_cases: the number of cases notified in the specified epidemiological week and entered by the end of the week commencing 26 May 2019 MillerEtAl_Fig1B.csv The data underlying Fig. 1B. pct_entered: the percentage of cases notified in the specified epidemiological week that had been entered by the end of the week commencing 21 July 2019 notification_week_commencing: the start date of the epidemiological week in which cases were notified notified_cases: the number of cases notified in the specified epidemiological week entered_cases: the number of cases notified in the specified epidemiological week and entered by the end of the week commencing 21 July 2019 MillerEtAl_Fig1C.csv The data underlying Fig. 1C. pct_entered: the percentage of cases notified in the specified epidemiological week that had been entered by the end of the week commencing 15 September 2019 notification_week_commencing: the start date of the epidemiological week in which cases were notified notified_cases: the number of cases notified in the specified epidemiological week entered_cases: the number of cases notified in the specified epidemiological week and entered by the end of the week commencing 15 September 2019 MillerEtAl_Fig2A.csv The data underlying Fig. 2A. notification_week_commencing: the start date of the epidemiological week in which cases were notified notified_cases: the number of cases notified in the specified epidemiological week entered_cases: the number of cases notified in the specified epidemiological week and entered by the end of the same week MillerEtAl_Fig3FigS1A.csv The data underlying Fig. 3 in the main text and Fig. A in S1 Appendix. notification_week_commencing: the start date of the epidemiological week in which cases were notified notification_week: the epidemiological week in which cases were notified notified_cases: the number of cases notified in the specified epidemiological week baseline_mean: the baseline nowcasting model's mean estimate of the number of cases notified in the specified epidemiological week baseline_2.5: the lower bound of the baseline nowcasting model's 95% prediction interval for the number of cases notified in the specified epidemiological week baseline_97.5: the upper bound of the baseline nowcasting model's 95% prediction interval for the number of cases notified in the specified epidemiological week baseline_in_interval: whether the true number of notified cases for the specified epidemiological week fell within the baseline nowcasting model's 95% prediction interval baseline_error: the difference between the baseline nowcasting model's mean estimate of the number of cases notified in the specified epidemiological week and the true number of notified cases baseline_interval_width: the size of the baseline nowcasting model's 95% prediction interval for the number of cases notified in the specified epidemiological week google_mean: the mean estimate of the number of cases notified in the specified epidemiological week produced by the nowcasting model using Google searches google_2.5: the lower bound of the 95% prediction interval for the number of cases notified in the specified epidemiological week produced by the nowcasting model using Google searches google_97.5: the upper bound of the 95% prediction interval for the number of cases notified in the specified epidemiological week produced by the nowcasting model using Google searches google_in_interval: whether the true number of notified cases for the specified epidemiological week fell within the 95% prediction interval produced by the nowcasting model using Google searches google_error: the difference between the mean estimate of the number of cases notified in the specified epidemiological week produced by the nowcasting model using Google searches and the true number of notified cases google_interval_width: the size of the 95% prediction interval for the number of cases notified in the specified epidemiological week produced by the nowcasting model using Google searches heuristic: the heuristic model's estimate of the number of cases notified in the specified epidemiological week 
Type Of Material Database/Collection of data 
Year Produced 2022 
Provided To Others? Yes  
URL https://figshare.com/articles/dataset/Faster_indicators_of_chikungunya_incidence_using_Google_search...
 
Title Generalised Bayesian Inference for Discrete Intractable Likelihood 
Description Discrete state spaces represent a major computational challenge to statistical inference, since the computation of normalisation constants requires summation over large or possibly infinite sets, which can be impractical. This paper addresses this computational challenge through the development of a novel generalised Bayesian inference procedure suitable for discrete intractable likelihood. Inspired by recent methodological advances for continuous data, the main idea is to update beliefs about model parameters using a discrete Fisher divergence, in lieu of the problematic intractable likelihood. The result is a generalised posterior that can be sampled from using standard computational tools, such as Markov chain Monte Carlo, circumventing the intractable normalising constant. The statistical properties of the generalised posterior are analysed, with sufficient conditions for posterior consistency and asymptotic normality established. In addition, a novel and general approach to calibration of generalised posteriors is proposed. Applications are presented on lattice models for discrete spatial data and on multivariate models for count data, where in each case the methodology facilitates generalised Bayesian inference at low computational cost. 
Type Of Material Database/Collection of data 
Year Produced 2023 
Provided To Others? Yes  
URL https://tandf.figshare.com/articles/dataset/Generalised_Bayesian_Inference_for_Discrete_Intractable_...
 
Title Generalised Bayesian Inference for Discrete Intractable Likelihood 
Description Discrete state spaces represent a major computational challenge to statistical inference, since the computation of normalisation constants requires summation over large or possibly infinite sets, which can be impractical. This paper addresses this computational challenge through the development of a novel generalised Bayesian inference procedure suitable for discrete intractable likelihood. Inspired by recent methodological advances for continuous data, the main idea is to update beliefs about model parameters using a discrete Fisher divergence, in lieu of the problematic intractable likelihood. The result is a generalised posterior that can be sampled from using standard computational tools, such as Markov chain Monte Carlo, circumventing the intractable normalising constant. The statistical properties of the generalised posterior are analysed, with sufficient conditions for posterior consistency and asymptotic normality established. In addition, a novel and general approach to calibration of generalised posteriors is proposed. Applications are presented on lattice models for discrete spatial data and on multivariate models for count data, where in each case the methodology facilitates generalised Bayesian inference at low computational cost. 
Type Of Material Database/Collection of data 
Year Produced 2023 
Provided To Others? Yes  
URL https://tandf.figshare.com/articles/dataset/Generalised_Bayesian_Inference_for_Discrete_Intractable_...
 
Title Geochemical data from volcanic rocks drilled in deep sea drilling project (DSDP) leg 81 site/hole 555 in the Rockall Plateau of the northeast Atlantic Ocean 
Description NdData.csv 143Nd/144Nd and associated eNd measurements of tuffs, lavas and hyaloclastites from DSDP Leg 81 Site 555. The sample ID number includes the site number (555), core box reference (e.g., 65-1), and the depth from the top of a given core (in cm). The 143Nd/144Nd ratios and associated eNd values are corrected to an age of 55 Ma. Also provided are published 143Nd/144Nd and associated eNd measurements from Site 555 lavas (from Macintyre and Hamilton, 1984). Errors on discrete measurements are 2 and 1 standard error (SE). XRF_PETM.csv Analysis of major and trace element compositions of volcanic tuffs from DSDP Site 555 in the northeast Atlantic. Note that Mg# = 100 x molecular MgO/(MgO + FeO), where FeO is assumed to be 0.9FeOT. 
Type Of Material Database/Collection of data 
Year Produced 2022 
Provided To Others? Yes  
URL https://figshare.com/articles/dataset/Geochemical_data_from_volcanic_rocks_drilled_in_deep_sea_drill...
 
Title Global Consensus Monte Carlo 
Description To conduct Bayesian inference with large datasets, it is often convenient or necessary to distribute the data across multiple machines. We consider a likelihood function expressed as a product of terms, each associated with a subset of the data. Inspired by global variable consensus optimization, we introduce an instrumental hierarchical model associating auxiliary statistical parameters with each term, which are conditionally independent given the top-level parameters. One of these top-level parameters controls the unconditional strength of association between the auxiliary parameters. This model leads to a distributed MCMC algorithm on an extended state space yielding approximations of posterior expectations. A trade-off between computational tractability and fidelity to the original model can be controlled by changing the association strength in the instrumental model. We further propose the use of an SMC sampler with a sequence of association strengths, allowing both the automatic determination of appropriate strengths and for a bias correction technique to be applied. In contrast to similar distributed Monte Carlo algorithms, this approach requires few distributional assumptions. The performance of the algorithms is illustrated with a number of simulated examples. Supplementary materials for this article are available online. 
Type Of Material Database/Collection of data 
Year Produced 2020 
Provided To Others? Yes  
URL https://tandf.figshare.com/articles/dataset/Global_Consensus_Monte_Carlo/12931061/1
 
Title Global Consensus Monte Carlo 
Description To conduct Bayesian inference with large datasets, it is often convenient or necessary to distribute the data across multiple machines. We consider a likelihood function expressed as a product of terms, each associated with a subset of the data. Inspired by global variable consensus optimization, we introduce an instrumental hierarchical model associating auxiliary statistical parameters with each term, which are conditionally independent given the top-level parameters. One of these top-level parameters controls the unconditional strength of association between the auxiliary parameters. This model leads to a distributed MCMC algorithm on an extended state space yielding approximations of posterior expectations. A trade-off between computational tractability and fidelity to the original model can be controlled by changing the association strength in the instrumental model. We further propose the use of an SMC sampler with a sequence of association strengths, allowing both the automatic determination of appropriate strengths and for a bias correction technique to be applied. In contrast to similar distributed Monte Carlo algorithms, this approach requires few distributional assumptions. The performance of the algorithms is illustrated with a number of simulated examples. Supplementary materials for this article are available online. 
Type Of Material Database/Collection of data 
Year Produced 2020 
Provided To Others? Yes  
URL https://tandf.figshare.com/articles/dataset/Global_Consensus_Monte_Carlo/12931061
 
Title Latin lexical semantic annotation 
Description This dataset is a collection of lexical annotation of the corpus occurrences 40 Latin lemmas. The corpus instances are from LatinISE and the process is described in Schlechtweg et al. (2020, 2021).The annotation was coordinated by Barbara McGillivray, and done by Annie Burman, Daria Kondakova, Francesca Dell'Oro, Helena Bermudez Sabel, Hugo Burgess, Paola Marongiu, and Rozalia Dobos. The pre-annotation was coordinated and designed by Barbara McGillivray and done by Manuel Márquez Cruz.ReferencesMcGillivray, B. and Kilgarriff, A. (2013). Tools for historical corpus research, and a corpus of Latin. In Paul Bennett, Martin Durrell, Silke Scheible, Richard J. Whitt (eds.), New Methods in Historical Corpus Linguistics. Tu¨bingen: NarrBarbara McGillivray, Dominik Schlechtweg, Haim Dubossarsky, Nina Tahmasebi, & Simon Hengchen. (2021). DWUG LA: Diachronic Word Usage Graphs for Latin [Data set]. Zenodo. https://doi.org/10.5281/zenodo.5255228Schlechtweg, D., McGillivray, B., Hengchen, S., Dubossarsky, H., Tahmasebi, N. (2020). SemEval-2020 Task 1: Unsupervised Lexical Semantic Change Detection. In Proceedings of the Fourteenth Workshop on Semantic Evaluation, 2020. International Committee for Computational Linguistics. DOI: 10.18653/v1/2020.semeval-1.1Schlechtweg, D., Tahmasebi, N., Hengchen, S., Dubossarsky, H., McGillivray, B. (2021). DWUG: A large Resource of Diachronic Word Usage Graphs in Four Languages. In Proceedings of EMNLP 2021. 
Type Of Material Database/Collection of data 
Year Produced 2021 
Provided To Others? Yes  
URL https://kcl.figshare.com/articles/dataset/Latin_lexical_semantic_annotation/16974823
 
Title LatinISE subcorpora for SemEval 2020 task 1 
Description This data collection contains the Latin test data for SemEval 2020 Task 1: Unsupervised Lexical Semantic Change Detection]: a Latin text corpus pair (`corpus1/`, `corpus2/`) 40 lemmas which have been annotated for their lexical semantic change between the two corpora (`targets.txt`) The corpus data have been automatically lemmatized and part-of-speech tagged, and have been partially corrected by hand. For homonyms, the lemmas are followed by the '\#' symbol and the number of the homonym according to the Lewis-Short dictionary of Latin when this number is greater than 1. For example, the lemma 'dico' corresponds to the first homonym in the Lewis-Short dictionary and 'dico\#2' corresponds to the second homonym, cf. Lewis-Short dictionary. __Corpus 1__ based on: LatinISE (McGillivray and Kilgarriff 2013), version on Sketch Engine language: Latin time covered: from the beginning of the second century before Christ (BC) to the end of the first century BC size: ~1.7 million tokens format: lemmatized, sentence length >= 2, no punctuation, sentences randomly shuffled encoding: UTF-8 __Corpus 2__ based on: LatinISE (McGillivray and Kilgarriff 2013) , version on Sketch Engine language: Latin time covered: from the beginning of the first century after Christ (AD) to the end of the twenty-first century AD size: ~9.4 million tokens format: lemmatized, sentence length >= 2, no punctuation, sentences randomly shuffled encoding: UTF-8 Find more information on the data in the papers referenced below. References Dominik Schlechtweg, Barbara McGillivray, Simon Hengchen, Haim Dubossarsky and Nina Tahmasebi SemEval 2020 Task 1: Unsupervised Lexical Semantic Change Detection. To appear in SemEval@COLING2020. McGillivray, B. and Kilgarriff, A. (2013). Tools for historical corpus research, and a corpus of Latin. In Paul Bennett, Martin Durrell, Silke Scheible, Richard J. Whitt (eds.), New Methods in Historical Corpus Linguistics, Tübingen: Narr.
 
Type Of Material Database/Collection of data 
Year Produced 2020 
Provided To Others? Yes  
URL https://zenodo.org/record/3674988
 
Title LatinISE subcorpora for SemEval 2020 task 1 
Description This data collection contains the Latin test data for SemEval 2020 Task 1: Unsupervised Lexical Semantic Change Detection]: a Latin text corpus pair (`corpus1/lemma`, `corpus2/lemma`) 40 lemmas which have been annotated for their lexical semantic change between the two corpora (`targets.txt`) the annotated binary change scores of the targets for subtask 1, and their annotated graded change scores for subtask 2 (`truth/`) The corpus data have been automatically lemmatized and part-of-speech tagged, and have been partially corrected by hand. For homonyms, the lemmas are followed by the '\#' symbol and the number of the homonym according to the Lewis-Short dictionary of Latin when this number is greater than 1. For example, the lemma 'dico' corresponds to the first homonym in the Lewis-Short dictionary and 'dico\#2' corresponds to the second homonym, cf. Lewis-Short dictionary. __Corpus 1__ based on: LatinISE (McGillivray and Kilgarriff 2013), version on Sketch Engine language: Latin time covered: from the beginning of the second century before Christ (BC) to the end of the first century BC size: ~1.7 million tokens format: lemmatized, sentence length >= 2, no punctuation, sentences randomly shuffled encoding: UTF-8 __Corpus 2__ based on: LatinISE (McGillivray and Kilgarriff 2013) , version on Sketch Engine language: Latin time covered: from the beginning of the first century after Christ (AD) to the end of the twenty-first century AD size: ~9.4 million tokens format: lemmatized, sentence length >= 2, no punctuation, sentences randomly shuffled encoding: UTF-8 Find more information on the data in the papers referenced below. References Dominik Schlechtweg, Barbara McGillivray, Simon Hengchen, Haim Dubossarsky and Nina Tahmasebi SemEval 2020 Task 1: Unsupervised Lexical Semantic Change Detection. To appear in SemEval@COLING2020. McGillivray, B. and Kilgarriff, A. (2013). Tools for historical corpus research, and a corpus of Latin. In Paul Bennett, Martin Durrell, Silke Scheible, Richard J. Whitt (eds.), New Methods in Historical Corpus Linguistics, Tübingen: Narr.
 
Type Of Material Database/Collection of data 
Year Produced 2020 
Provided To Others? Yes  
URL https://zenodo.org/record/3732944
 
Title LatinISE test data for SemEval 2020 task 1 
Description This data collection contains the Latin test data for SemEval 2020 Task 1: Unsupervised Lexical Semantic Change Detection: a Latin text corpus pair (`corpus1/lemma`, `corpus2/lemma`) 40 lemmas which have been annotated for their lexical semantic change between the two corpora (`targets.txt`) the annotated binary change scores of the targets for subtask 1, and their annotated graded change scores for subtask 2 (`truth/`) The corpus data have been automatically lemmatized and part-of-speech tagged, and have been partially corrected by hand. For homonyms, the lemmas are followed by the '\#' symbol and the number of the homonym according to the Lewis-Short dictionary of Latin when this number is greater than 1. For example, the lemma 'dico' corresponds to the first homonym in the Lewis-Short dictionary and 'dico\#2' corresponds to the second homonym, cf. Lewis-Short dictionary. __Corpus 1__ based on: LatinISE (McGillivray and Kilgarriff 2013), version on Sketch Engine language: Latin time covered: from the beginning of the second century before Christ (BC) to the end of the first century BC size: ~1.7 million tokens format: lemmatized, sentence length >= 2, no punctuation, sentences randomly shuffled encoding: UTF-8 __Corpus 2__ based on: LatinISE (McGillivray and Kilgarriff 2013) , version on Sketch Engine language: Latin time covered: from the beginning of the first century after Christ (AD) to the end of the twenty-first century AD size: ~9.4 million tokens format: lemmatized, sentence length >= 2, no punctuation, sentences randomly shuffled encoding: UTF-8 Find more information on the data in the papers referenced below. References Dominik Schlechtweg, Barbara McGillivray, Simon Hengchen, Haim Dubossarsky and Nina Tahmasebi SemEval 2020 Task 1: Unsupervised Lexical Semantic Change Detection. To appear in SemEval@COLING2020. McGillivray, B. and Kilgarriff, A. (2013). Tools for historical corpus research, and a corpus of Latin. In Paul Bennett, Martin Durrell, Silke Scheible, Richard J. Whitt (eds.), New Methods in Historical Corpus Linguistics, Tübingen: Narr.
 
Type Of Material Database/Collection of data 
Year Produced 2020 
Provided To Others? Yes  
URL https://zenodo.org/record/3734089
 
Title LatinISE test data for SemEval 2020 task 1 with additional token versions of the corpora 
Description This data collection contains the Latin test data for SemEval 2020 Task 1: Unsupervised Lexical Semantic Change Detection: a Latin text corpus pair (`corpus1/lemma`, `corpus2/lemma`) 40 lemmas which have been annotated for their lexical semantic change between the two corpora (`targets.txt`) the annotated binary change scores of the targets for subtask 1, and their annotated graded change scores for subtask 2 (`truth/`) The corpus data have been automatically lemmatized and part-of-speech tagged, and have been partially corrected by hand. For homonyms, the lemmas are followed by the '\#' symbol and the number of the homonym according to the Lewis-Short dictionary of Latin when this number is greater than 1. For example, the lemma 'dico' corresponds to the first homonym in the Lewis-Short dictionary and 'dico\#2' corresponds to the second homonym, cf. Lewis-Short dictionary. __Corpus 1__ based on: LatinISE (McGillivray and Kilgarriff 2013), version on Sketch Engine language: Latin time covered: from the beginning of the second century before Christ (BC) to the end of the first century BC size: ~1.7 million tokens format: lemmatized, sentence length >= 2, no punctuation, sentences randomly shuffled encoding: UTF-8 __Corpus 2__ based on: LatinISE (McGillivray and Kilgarriff 2013) , version on Sketch Engine language: Latin time covered: from the beginning of the first century after Christ (AD) to the end of the twenty-first century AD size: ~9.4 million tokens format: lemmatized, sentence length >= 2, no punctuation, sentences randomly shuffled encoding: UTF-8 Find more information on the data in the papers referenced below. Besides the official lemma version of the corpora for SemEval-2020 Task 1 we also provide the raw token version ( corpus1/token/, corpus2/token/). It contains the raw sentences in the same order as in the lemma version. Find more information on the data and SemEval-2020 Task 1 in the paper referenced below. The creation of the data was supported by the CRETA center and the CLARIN-D grant funded by the German Ministry for Education and Research (BMBF). References Dominik Schlechtweg, Barbara McGillivray, Simon Hengchen, Haim Dubossarsky and Nina Tahmasebi SemEval 2020 Task 1: Unsupervised Lexical Semantic Change Detection. To appear in SemEval@COLING2020. McGillivray, B. and Kilgarriff, A. (2013). Tools for historical corpus research, and a corpus of Latin. In Paul Bennett, Martin Durrell, Silke Scheible, Richard J. Whitt (eds.), New Methods in Historical Corpus Linguistics, Tübingen: Narr.
 
Type Of Material Database/Collection of data 
Year Produced 2020 
Provided To Others? Yes  
URL https://zenodo.org/record/3674098
 
Title LatinISE test data for SemEval 2020 task 1 with additional token versions of the corpora 
Description This data collection contains the Latin test data for SemEval 2020 Task 1: Unsupervised Lexical Semantic Change Detection: a Latin text corpus pair (`corpus1/lemma`, `corpus2/lemma`) 40 lemmas which have been annotated for their lexical semantic change between the two corpora (`targets.txt`) the annotated binary change scores of the targets for subtask 1, and their annotated graded change scores for subtask 2 (`truth/`) The corpus data have been automatically lemmatized and part-of-speech tagged, and have been partially corrected by hand. For homonyms, the lemmas are followed by the '\#' symbol and the number of the homonym according to the Lewis-Short dictionary of Latin when this number is greater than 1. For example, the lemma 'dico' corresponds to the first homonym in the Lewis-Short dictionary and 'dico\#2' corresponds to the second homonym, cf. Lewis-Short dictionary. __Corpus 1__ based on: LatinISE (McGillivray and Kilgarriff 2013), version on Sketch Engine language: Latin time covered: from the beginning of the second century before Christ (BC) to the end of the first century BC size: ~1.7 million tokens format: lemmatized, sentence length >= 2, no punctuation, sentences randomly shuffled encoding: UTF-8 __Corpus 2__ based on: LatinISE (McGillivray and Kilgarriff 2013) , version on Sketch Engine language: Latin time covered: from the beginning of the first century after Christ (AD) to the end of the twenty-first century AD size: ~9.4 million tokens format: lemmatized, sentence length >= 2, no punctuation, sentences randomly shuffled encoding: UTF-8 Find more information on the data in the papers referenced below. Besides the official lemma version of the corpora for SemEval-2020 Task 1 we also provide the raw token version ( corpus1/token/, corpus2/token/). It contains the raw sentences in the same order as in the lemma version. Find more information on the data and SemEval-2020 Task 1 in the paper referenced below. The creation of the data was supported by the CRETA center and the CLARIN-D grant funded by the German Ministry for Education and Research (BMBF). References Dominik Schlechtweg, Barbara McGillivray, Simon Hengchen, Haim Dubossarsky and Nina Tahmasebi SemEval 2020 Task 1: Unsupervised Lexical Semantic Change Detection. To appear in SemEval@COLING2020. McGillivray, B. and Kilgarriff, A. (2013). Tools for historical corpus research, and a corpus of Latin. In Paul Bennett, Martin Durrell, Silke Scheible, Richard J. Whitt (eds.), New Methods in Historical Corpus Linguistics, Tübingen: Narr.
 
Type Of Material Database/Collection of data 
Year Produced 2020 
Provided To Others? Yes  
URL https://zenodo.org/record/3992738
 
Title Living Machines atypical animacy dataset 
Description Atypical animacy detection dataset, based on nineteenth-century sentences in English extracted from an open dataset of nineteenth-century books digitized by the British Library (available via https://doi.org/10.21250/db14, British Library Labs, 2014). This dataset contains 598 sentences containing mentions of machines. Each sentence has been annotated according to the animacy and humanness of the machine in the sentence. This dataset has been created as part of the following paper: Ardanuy, M. C., F. Nanni, K. Beelen, Kasra Hosseini, Ruth Ahnert, J. Lawrence, Katherine McDonough, Giorgia Tolfo, D. C. Wilson and B. McGillivray. "Living Machines: A study of atypical animacy." In Proceedings of the 28th International Conference on Computational Linguistics (COLING2020). 
Type Of Material Database/Collection of data 
Year Produced 2020 
Provided To Others? Yes  
URL https://bl.iro.bl.uk/work/323177af-6081-4e93-8aaf-7932ca4a390a
 
Title MapReader_Data_SIGSPATIAL_2022 
Description MapReader in GeoHumanities workshop (SIGSPATIAL 2022): Gold standards and outputs Refer to: https://github.com/Living-with-machines/MapReader/wiki/GeoHumanities-workshop-in-SIGSPATIAL-2022 
Type Of Material Database/Collection of data 
Year Produced 2022 
Provided To Others? Yes  
URL https://zenodo.org/record/7116800
 
Title Metadata record for: DUKweb, diachronic word representations from the UK Web Archive corpus 
Description This dataset contains key characteristics about the data described in the Data Descriptor DUKweb, diachronic word representations from the UK Web Archive corpus. Contents: 1. human readable metadata summary table in CSV format 2. machine readable metadata file in JSON format 
Type Of Material Database/Collection of data 
Year Produced 2021 
Provided To Others? Yes  
URL https://springernature.figshare.com/articles/dataset/Metadata_record_for_DUKweb_diachronic_word_repr...
 
Title Metadata record for: DUKweb, diachronic word representations from the UK Web Archive corpus 
Description This dataset contains key characteristics about the data described in the Data Descriptor DUKweb, diachronic word representations from the UK Web Archive corpus. Contents: 1. human readable metadata summary table in CSV format 2. machine readable metadata file in JSON format 
Type Of Material Database/Collection of data 
Year Produced 2021 
Provided To Others? Yes  
URL https://springernature.figshare.com/articles/dataset/Metadata_record_for_DUKweb_diachronic_word_repr...
 
Title Monthly word embeddings for Twitter random sample (English, 2012-2018) 
Description This dataset contains monthly word embeddings created from the tweets available via the statuses/sample endpoint of the Twitter Streaming API from 2012 to 2018. Full details of the creation of the dataset are given in Room to Glo: A Systematic Comparison of Semantic Change Detection Approaches with Word Embeddings. The md5sum of the gzipped tarball file is a76888ffec8cc7aebba09d365ca55ace . 
Type Of Material Database/Collection of data 
Year Produced 2019 
Provided To Others? Yes  
 
Title Monthly word embeddings for Twitter random sample (English, 2012-2018) 
Description This dataset contains monthly word embeddings created from the tweets available via the statuses/sample endpoint of the Twitter Streaming API from 2012 to 2018. Full details of the creation of the dataset are given in Room to Glo: A Systematic Comparison of Semantic Change Detection Approaches with Word Embeddings. The md5sum of the gzipped tarball file is a76888ffec8cc7aebba09d365ca55ace . 
Type Of Material Database/Collection of data 
Year Produced 2019 
Provided To Others? Yes  
 
Title Neural Language Models for Nineteenth-Century English (dataset; language model zoo) 
Description This dataset contains four types of neural language models trained on a large historical dataset of books in English, published between 1760-1900 and comprised of ~5.1 billion tokens. The language model architectures include static (word2vec and fastText) and contextualized models (BERT and Flair). For each architecture, we trained a model instance using the whole dataset. Additionally, we trained separate instances on text published before 1850 for the two static models, and four instances considering different time slices for BERT. Github repository: https://github.com/Living-with-machines/histLM 
Type Of Material Database/Collection of data 
Year Produced 2021 
Provided To Others? Yes  
URL https://zenodo.org/record/4779090
 
Title Real-World Network Data from Goal-directed graph construction using reinforcement learning 
Description Contains the raw data for the real-world networks used in the experiments (Euroroad and Scigrid). 
Type Of Material Database/Collection of data 
Year Produced 2021 
Provided To Others? Yes  
URL https://rs.figshare.com/articles/dataset/Real-World_Network_Data_from_Goal-directed_graph_constructi...
 
Title Real-World Network Data from Goal-directed graph construction using reinforcement learning 
Description Contains the raw data for the real-world networks used in the experiments (Euroroad and Scigrid). 
Type Of Material Database/Collection of data 
Year Produced 2021 
Provided To Others? Yes  
URL https://rs.figshare.com/articles/dataset/Real-World_Network_Data_from_Goal-directed_graph_constructi...
 
Title Research Data Supporting "Modelling prognostic trajectories of cognitive decline due to Alzheimer's disease" 
Description  
Type Of Material Database/Collection of data 
Year Produced 2020 
Provided To Others? Yes  
URL https://www.repository.cam.ac.uk/handle/1810/301740
 
Title Research data supporting "Multimodal imaging of brain connectivity reveals predictors of individual decision strategy in statistical learning" 
Description Behavioural data, resting-state fMRI connectivity data and graph metrics data (see supporting data description .doc file for more information) 
Type Of Material Database/Collection of data 
Year Produced 2019 
Provided To Others? Yes  
 
Title Research data supporting "White-Matter Pathways for Statistical Learning of Temporal Structures" 
Description Behavioural data and DTI connectivity data (see supporting data description .doc file for more information) 
Type Of Material Database/Collection of data 
Year Produced 2019 
Provided To Others? Yes  
 
Title Source Code from Goal-directed graph construction using reinforcement learning 
Description Contains all source code used to obtain the results reported in the manuscript. 
Type Of Material Database/Collection of data 
Year Produced 2021 
Provided To Others? Yes  
URL https://rs.figshare.com/articles/dataset/Source_Code_from_Goal-directed_graph_construction_using_rei...
 
Title Source Code from Goal-directed graph construction using reinforcement learning 
Description Contains all source code used to obtain the results reported in the manuscript. 
Type Of Material Database/Collection of data 
Year Produced 2021 
Provided To Others? Yes  
URL https://rs.figshare.com/articles/dataset/Source_Code_from_Goal-directed_graph_construction_using_rei...
 
Title Supplementary Information Files for Early childhood weight gain: latent patterns and body composition outcomes 
Description Supplementary Information Files for Early childhood weight gain: latent patterns and body composition outcomes Background: Despite early childhood weight gain being a key indicator of obesity risk, we do not have a good understanding of the different patterns that exist. Objectives: To identify and characterise distinct groups of children displaying similar early life weight trajectories. Methods: A growth mixture model captured heterogeneity in weight trajectories between 0-60 months in 1,390 children in the Avon Longitudinal Study of Parents and Children. Differences between the classes in characteristics and body size/composition at 9 years were investigated. Results: The best model had five classes. The "Normal" (45%) and "Normal after initial catch-down" (24%) classes were close to the 50th centile of a growth standard between 24-60 months. The "High-decreasing" (21%) and "Stable-high" (7%) classes peaked at the ~91st centile at 12-18 months, but while the former declined to the ~75th centile and comprised constitutionally big children, the latter did not. The "Rapidlyincreasing" (3%) class gained weight from below the 50th centile at 4 months to above the 91st centile at 60 months. By 9 years, their mean body mass index (BMI) placed them at the 98th centile. This class was characterised by the highest maternal BMI; highest parity; highest levels of gestational hypertension and diabetes; and the lowest socio-economic position. At 9 years, the "Rapidly-increasing" class was estimated to have 68.2% (48.3,88.1) more fat mass than the "Normal" class, but only 14.0% (9.1,18.9) more lean mass. Conclusions: Criteria used in growth monitoring practice are unlikely to consistently distinguish between the different patterns of weight gain reported here. 
Type Of Material Database/Collection of data 
Year Produced 2021 
Provided To Others? Yes  
URL https://repository.lboro.ac.uk/articles/dataset/Supplementary_Information_Files_for_Early_childhood_...
 
Title Supplementary Information showing a list of all the ligand features considered from A simple spatial extension to the extended connectivity interaction features for binding affinity prediction 
Description The representation of the protein-ligand complexes used in building machine learning models play an important role in the accuracy of binding affinity prediction. The Extended Connectivity Interaction Features (ECIF) is one such representation. We report that (i) including the discretized distances between protein-ligand atom pairs in the ECIF scheme improves predictive accuracy, and (ii) in an evaluation using gradient boosted trees, we found that the resampling method used in selecting the best hyperparameters has a strong effect on predictive performance, especially for benchmarking purposes. 
Type Of Material Database/Collection of data 
Year Produced 2022 
Provided To Others? Yes  
URL https://rs.figshare.com/articles/dataset/Supplementary_Information_showing_a_list_of_all_the_ligand_...
 
Title Supplementary Information showing a list of all the ligand features considered from A simple spatial extension to the extended connectivity interaction features for binding affinity prediction 
Description The representation of the protein-ligand complexes used in building machine learning models play an important role in the accuracy of binding affinity prediction. The Extended Connectivity Interaction Features (ECIF) is one such representation. We report that (i) including the discretized distances between protein-ligand atom pairs in the ECIF scheme improves predictive accuracy, and (ii) in an evaluation using gradient boosted trees, we found that the resampling method used in selecting the best hyperparameters has a strong effect on predictive performance, especially for benchmarking purposes. 
Type Of Material Database/Collection of data 
Year Produced 2022 
Provided To Others? Yes  
URL https://rs.figshare.com/articles/dataset/Supplementary_Information_showing_a_list_of_all_the_ligand_...
 
Title Supplementary material for 'A Deep Learning Approach to Geographical Candidate Selection through Toponym Matching' 
Description Supplementary material for the https://github.com/Living-with-machines/LwM_SIGSPATIAL2020_ToponymMatching repository, containing the underlying code and materials for the paper 'A Deep Learning Approach to Geographical Candidate Selection through Toponym Matching', accepted to SIGSPATIAL2020 as a poster paper. Coll Ardanuy, M., Hosseini, K., McDonough, K., Krause, A., van Strien, D. and Nanni, F. (2020): A Deep Learning Approach to Geographical Candidate Selection through Toponym Matching, SIGSPATIAL: Poster Paper. 
Type Of Material Database/Collection of data 
Year Produced 2020 
Provided To Others? Yes  
URL https://zenodo.org/record/4034818
 
Title Supplementary material for 'A Deep Learning Approach to Geographical Candidate Selection through Toponym Matching' 
Description Supplementary material for the https://github.com/Living-with-machines/LwM_SIGSPATIAL2020_ToponymMatching repository, containing the underlying code and materials for the paper 'A Deep Learning Approach to Geographical Candidate Selection through Toponym Matching', accepted to SIGSPATIAL2020 as a poster paper. Coll Ardanuy, M., Hosseini, K., McDonough, K., Krause, A., van Strien, D. and Nanni, F. (2020): A Deep Learning Approach to Geographical Candidate Selection through Toponym Matching, SIGSPATIAL: Poster Paper. 
Type Of Material Database/Collection of data 
Year Produced 2020 
Provided To Others? Yes  
URL https://zenodo.org/record/4034819
 
Title Supplementary table showing the selected important features from A simple spatial extension to the extended connectivity interaction features for binding affinity prediction 
Description The representation of the protein-ligand complexes used in building machine learning models play an important role in the accuracy of binding affinity prediction. The Extended Connectivity Interaction Features (ECIF) is one such representation. We report that (i) including the discretized distances between protein-ligand atom pairs in the ECIF scheme improves predictive accuracy, and (ii) in an evaluation using gradient boosted trees, we found that the resampling method used in selecting the best hyperparameters has a strong effect on predictive performance, especially for benchmarking purposes. 
Type Of Material Database/Collection of data 
Year Produced 2022 
Provided To Others? Yes  
URL https://rs.figshare.com/articles/dataset/Supplementary_table_showing_the_selected_important_features...
 
Title Supplementary table showing the selected important features from A simple spatial extension to the extended connectivity interaction features for binding affinity prediction 
Description The representation of the protein-ligand complexes used in building machine learning models play an important role in the accuracy of binding affinity prediction. The Extended Connectivity Interaction Features (ECIF) is one such representation. We report that (i) including the discretized distances between protein-ligand atom pairs in the ECIF scheme improves predictive accuracy, and (ii) in an evaluation using gradient boosted trees, we found that the resampling method used in selecting the best hyperparameters has a strong effect on predictive performance, especially for benchmarking purposes. 
Type Of Material Database/Collection of data 
Year Produced 2022 
Provided To Others? Yes  
URL https://rs.figshare.com/articles/dataset/Supplementary_table_showing_the_selected_important_features...
 
Title Visual Identification of Individual Holstein Friesian Cattle via Deep Metric Learning 
Description This dataset accompanies the paper - "Visual Identification of Individual Holstein Friesian Cattle via Deep Metric Learning" available at - https://arxiv.org/abs/2006.09205. It consists of two components: (a) detection and localisation, (b) identification. For an overview of this dataset, refer to Section 3 in the paper. For any queries, contact the corresponding author in the paper. For accompanying source code, check out - https://github.com/CWOA/MetricLearningIdentification 
Type Of Material Database/Collection of data 
Year Produced 2020 
Provided To Others? Yes  
URL https://data.bris.ac.uk/data/dataset/10m32xl88x2b61zlkkgz3fml17/
 
Title rw_network_data.zip from Planning spatial networks with Monte Carlo Tree Search 
Description Real-world network data (Internet dataset). 
Type Of Material Database/Collection of data 
Year Produced 2022 
Provided To Others? Yes  
URL https://rs.figshare.com/articles/dataset/rw_network_data_zip_from_Planning_spatial_networks_with_Mon...
 
Title rw_network_data.zip from Planning spatial networks with Monte Carlo tree search 
Description Real-world network data (Internet dataset). 
Type Of Material Database/Collection of data 
Year Produced 2023 
Provided To Others? Yes  
URL https://rs.figshare.com/articles/dataset/rw_network_data_zip_from_Planning_spatial_networks_with_Mon...
 
Title source_code.zip from Planning spatial networks with Monte Carlo Tree Search 
Description Source code that enables reproducing the results reported in the paper. 
Type Of Material Database/Collection of data 
Year Produced 2022 
Provided To Others? Yes  
URL https://rs.figshare.com/articles/dataset/source_code_zip_from_Planning_spatial_networks_with_Monte_C...
 
Title source_code.zip from Planning spatial networks with Monte Carlo tree search 
Description Source code that enables reproducing the results reported in the paper. 
Type Of Material Database/Collection of data 
Year Produced 2023 
Provided To Others? Yes  
URL https://rs.figshare.com/articles/dataset/source_code_zip_from_Planning_spatial_networks_with_Monte_C...
 
Title Craystack: lossless compression tools for machine learning researchers 
Description Craystack is a modular Python package for doing lossless compression. The low-level core is a vectorized version of Asymmetric Numeral Systems (ANS), implemented using NumPy; high level composible 'Codecs' are provided for easily building compression and decompression functions. 
Type Of Technology Software 
Year Produced 2021 
Open Source License? Yes  
URL https://zenodo.org/record/4572728
 
Title DeezyMatch 
Description DeezyMatch: A Flexible Deep Neural Network Approach to Fuzzy String Matching DeezyMatch can be applied for performing the following tasks: Record linkage Candidate selection for entity linking systems Toponym matching 
Type Of Technology Software 
Year Produced 2020 
Open Source License? Yes  
URL https://zenodo.org/record/3983555
 
Title DeezyMatch 
Description DeezyMatch: A Flexible Deep Neural Network Approach to Fuzzy String Matching DeezyMatch can be applied for performing the following tasks: Record linkage Candidate selection for entity linking systems Toponym matching 
Type Of Technology Software 
Year Produced 2020 
Open Source License? Yes  
URL https://zenodo.org/record/3983554
 
Title Subtle variation in sepsis-III definitions influences predictive performance of machine learning 
Description This is the official implementation of the paper entitled "Subtle variation in sepsis-III definitions influences predictive performance of machine learning". 
Type Of Technology Software 
Year Produced 2021 
Open Source License? Yes  
URL https://zenodo.org/record/5168788
 
Title The Laplace Microarchitecture for Tracking Data Uncertainty and Its Implementation in a RISC-V Processor 
Description Source code of the evaluated benchmarks of the "The Laplace Microarchitecture for Tracking Data Uncertainty and Its Implementation in a RISC-V Processor" research paper accepted to appear in the 54th IEEE/ACM International Symposium on Microarchitecture (MICRO), 2021. 
Type Of Technology Software 
Year Produced 2021 
Open Source License? Yes  
URL https://zenodo.org/record/5150148
 
Title passt/miceandmen: Code released with manuscript. 
Description Source code related to Stumpf et al. (2020) Transfer learning from mouse to man. 
Type Of Technology Software 
Year Produced 2020 
URL https://zenodo.org/record/4105891
 
Title passt/miceandmen: Code released with manuscript. 
Description Source code related to Stumpf et al. (2020) Transfer learning from mouse to man. 
Type Of Technology Software 
Year Produced 2020 
URL https://zenodo.org/record/4105890