Manipulation and extraction of knowledge from large datasets with desktop tools optimised for the analysis of 'Big Data'
Lead Research Organisation:
Loughborough University
Department Name: Wolfson Sch of Mech, Elec & Manufac Eng
Abstract
We live in a 'data tsunami' and there is often too much data to handle without specialist skills and this is especially true for analysts (e.g. volunteers or small organisations) with limited resources. Large systems and/or communities of interest have the potential to generate huge amounts of data that can be kept almost indefinitely in low cost digital archives. There are also activities converting historical data from many different domains into searchable digital formats, sometimes by automated processes, and other times by volunteer effort. For example, citizen science data assembled from diverse records that go back decades is used to analyse and understand biodiversity; this data is assembled and analysed almost entirely by volunteer enthusiasts. The ability to effectively analyse Big Data with tools designed for desktop computers (or equivalent) has not kept pace with the ability to generate it. Familiar techniques, such as spreadsheets, do not scale effectively as the size of the dataset increases beyond that which can be checked by hand. Furthermore, programming languages for analysing large datasets with limited resources rely on the technical expertise of the programmer to write efficient code that will execute quickly. Currently there is no effective method to encapsulate that expertise into a template that can be reused on related sets of data. The reliability of such datasets may be compromised by inconsistent data cleansing techniques, which all rely (in any case) on knowing an expected form and range of the data; this is not a valid assumption for some datasets.
This work will develop a novel conceptual framework using task orientated templates, for the analysis of Big Data that effectively separates: data curation, analysis, and reporting of large datasets, while creating, a reproducible analysis. Outputs should include secondary data, with metadata, that captures all the transformations that have been applied, to provide an auditable connection to the source data. The objective is to support the reuse of secondary data with confidence in downstream analysis. Using task orientated templates should allow a more structured approach to data analysis, and facilitate reuse of data through the use of verifiable digital signatures. The template-based approach is intended to reduce the programming skills required for the analysis large data for a wide range of commercial, academic and social applications on desktop computers (or equivalent).
Fundamentally, this research will challenge the sequential nature of data handling (load, transform, analyse, output), to reduce the risk of irreversible load/transform/analysis- induced error propagation to enable users with limited IT resources and moderate algorithmic and programming skills to reliably extract new knowledge from large and complex datasets.
The relevant EPSRC sub theme is Data Information and knowledge
This work will develop a novel conceptual framework using task orientated templates, for the analysis of Big Data that effectively separates: data curation, analysis, and reporting of large datasets, while creating, a reproducible analysis. Outputs should include secondary data, with metadata, that captures all the transformations that have been applied, to provide an auditable connection to the source data. The objective is to support the reuse of secondary data with confidence in downstream analysis. Using task orientated templates should allow a more structured approach to data analysis, and facilitate reuse of data through the use of verifiable digital signatures. The template-based approach is intended to reduce the programming skills required for the analysis large data for a wide range of commercial, academic and social applications on desktop computers (or equivalent).
Fundamentally, this research will challenge the sequential nature of data handling (load, transform, analyse, output), to reduce the risk of irreversible load/transform/analysis- induced error propagation to enable users with limited IT resources and moderate algorithmic and programming skills to reliably extract new knowledge from large and complex datasets.
The relevant EPSRC sub theme is Data Information and knowledge
Organisations
People |
ORCID iD |
Michael Henshaw (Primary Supervisor) | |
P Palmer (Student) |
Studentship Projects
Project Reference | Relationship | Related To | Start | End | Student Name |
---|---|---|---|---|---|
EP/N509516/1 | 30/09/2016 | 29/09/2021 | |||
2051786 | Studentship | EP/N509516/1 | 01/01/2018 | 29/06/2021 | P Palmer |
Description | Novel Contributions =================== - Template theory - Introduced the concept of **reusability** as an extension of **reproducibility**. This sets the context in which reusability complements the existing literate programming techniques that underpin reproducibility. Essentially reusability is a subset of reproducibility. - The **task** rather than **data** oriented approach implied by reusability requires the introduction of new terminology to clearly convey the state of data in relation to the task. - Data *sensu lato*: Data defined in the loose sense. Typically presentation of data inside spreadsheets and many other formats is Data *sensu lato*. - Data *sensu stricto*: Data defined in the strict sense. When data are transformed into Data *sensu stricto* the state and format are completely defined and analysis may commence. - Data *sensu nascent*: Data not yet formed. This state arises in many research situations, where there is a belief that the potential data exists and may be collected and transformed into Data *sensu stricto* for analysis as part of a defined experimental process. - The concept of reusability is developed into a **mathematical framework** that describes the essential properties that are required for a template to be reusable. - - The framework and derived properties lead in turn to a **systematic method** for the functional implementation of reusability. - Empirical demonstration of theory - Created utility that creates skeleton reusable template framework to cran-R standards. - A technically unforgiving 20 + step process. - Reusable templates implemented using the mathematical theory as a guide. - Using R studio as the GUI. - R package framework used to manage the generation of searchable package documentation. - Code verified locally to ensure it meets cran-R requirements for formal publication as user contributed package. - Development version of templates on published on public and private Github repositories along with supporting documentation. - Used Travis CI (Continuous Integration) to ensure Gihub version meets cran-R requirements after updates applied. - Shared templates remotely to end users via internet using GitHub and R development package support services. |
Exploitation Route | Shared templates have already been remotely accessed end users via internet using GitHub and R development package support services. This has allowed end users to access the benefits of R without being R programmers. |
Sectors | Digital/Communication/Information Technologies (including Software) Education Environment Government Democracy and Justice Culture Heritage Museums and Collections |
Description | Local government has a statutory requirement to report o biodiversity status. In Leicestershire this reporting is made in conjunction with various bodies including the Leicestershire and Rutland Wildlife Trust (LRWT). Analysis of biodiversity data for the LRWT by PJP using draft templates highlighted issues with missing WeBS (Wetland Bird Survey) data that were previously unknown to stakeholders. A revised dataset was successfully analysed and used to support the statutory reporting of the current status of protected environmental areas. An extension of this analysis was used to provide an inventory of species in Leicestershire that occur in managed nature reserves compared to the county as a whole. This inventory is being incorporated by the conservation committee for the county to guide selection of additional land purchases for management of biodiversity. |
First Year Of Impact | 2019 |
Sector | Environment |
Impact Types | Societal Policy & public services |