Manipulation and extraction of knowledge from large datasets with desktop tools optimised for the analysis of 'Big Data'

Lead Research Organisation: Loughborough University
Department Name: Wolfson Sch of Mech, Elec & Manufac Eng

Abstract

We live in a 'data tsunami' and there is often too much data to handle without specialist skills and this is especially true for analysts (e.g. volunteers or small organisations) with limited resources. Large systems and/or communities of interest have the potential to generate huge amounts of data that can be kept almost indefinitely in low cost digital archives. There are also activities converting historical data from many different domains into searchable digital formats, sometimes by automated processes, and other times by volunteer effort. For example, citizen science data assembled from diverse records that go back decades is used to analyse and understand biodiversity; this data is assembled and analysed almost entirely by volunteer enthusiasts. The ability to effectively analyse Big Data with tools designed for desktop computers (or equivalent) has not kept pace with the ability to generate it. Familiar techniques, such as spreadsheets, do not scale effectively as the size of the dataset increases beyond that which can be checked by hand. Furthermore, programming languages for analysing large datasets with limited resources rely on the technical expertise of the programmer to write efficient code that will execute quickly. Currently there is no effective method to encapsulate that expertise into a template that can be reused on related sets of data. The reliability of such datasets may be compromised by inconsistent data cleansing techniques, which all rely (in any case) on knowing an expected form and range of the data; this is not a valid assumption for some datasets.
This work will develop a novel conceptual framework using task orientated templates, for the analysis of Big Data that effectively separates: data curation, analysis, and reporting of large datasets, while creating, a reproducible analysis. Outputs should include secondary data, with metadata, that captures all the transformations that have been applied, to provide an auditable connection to the source data. The objective is to support the reuse of secondary data with confidence in downstream analysis. Using task orientated templates should allow a more structured approach to data analysis, and facilitate reuse of data through the use of verifiable digital signatures. The template-based approach is intended to reduce the programming skills required for the analysis large data for a wide range of commercial, academic and social applications on desktop computers (or equivalent).
Fundamentally, this research will challenge the sequential nature of data handling (load, transform, analyse, output), to reduce the risk of irreversible load/transform/analysis- induced error propagation to enable users with limited IT resources and moderate algorithmic and programming skills to reliably extract new knowledge from large and complex datasets.
The relevant EPSRC sub theme is Data Information and knowledge

Publications

10 25 50

Studentship Projects

Project Reference Relationship Related To Start End Student Name
EP/N509516/1 30/09/2016 29/09/2021
2051786 Studentship EP/N509516/1 01/01/2018 29/06/2021 P Palmer
 
Description Novel Contributions
===================

- Template theory

- Introduced the concept of **reusability** as an extension of
**reproducibility**. This sets the context in which reusability
complements the existing literate programming techniques that
underpin reproducibility. Essentially reusability is a subset of
reproducibility.

- The **task** rather than **data** oriented approach implied by
reusability requires the introduction of new terminology to
clearly convey the state of data in relation to the task.

- Data *sensu lato*: Data defined in the loose sense.
Typically presentation of data inside spreadsheets and many
other formats is Data *sensu lato*.

- Data *sensu stricto*: Data defined in the strict sense. When
data are transformed into Data *sensu stricto* the state and
format are completely defined and analysis may commence.

- Data *sensu nascent*: Data not yet formed. This state arises
in many research situations, where there is a belief that
the potential data exists and may be collected and
transformed into Data *sensu stricto* for analysis as part
of a defined experimental process.

- The concept of reusability is developed into a **mathematical
framework** that describes the essential properties that are
required for a template to be reusable.

- - The framework and derived properties lead in turn to a
**systematic method** for the functional implementation of
reusability.

- Empirical demonstration of theory

- Created utility that creates skeleton reusable template
framework to cran-R standards.

- A technically unforgiving 20 + step process.

- Reusable templates implemented using the mathematical theory as
a guide.

- Using R studio as the GUI.

- R package framework used to manage the generation of
searchable package documentation.

- Code verified locally to ensure it meets cran-R requirements
for formal publication as user contributed package.

- Development version of templates on published on public and
private Github repositories along with supporting
documentation.

- Used Travis CI (Continuous Integration) to ensure Gihub
version meets cran-R requirements after updates applied.

- Shared templates remotely to end users via internet using
GitHub and R development package support services.
Exploitation Route Shared templates have already been remotely accessed end users via internet using GitHub and R development package support services. This has allowed end users to access the benefits of R without being R programmers.
Sectors Digital/Communication/Information Technologies (including Software)

Education

Environment

Government

Democracy and Justice

Culture

Heritage

Museums and Collections

 
Description Local government has a statutory requirement to report o biodiversity status. In Leicestershire this reporting is made in conjunction with various bodies including the Leicestershire and Rutland Wildlife Trust (LRWT). Analysis of biodiversity data for the LRWT by PJP using draft templates highlighted issues with missing WeBS (Wetland Bird Survey) data that were previously unknown to stakeholders. A revised dataset was successfully analysed and used to support the statutory reporting of the current status of protected environmental areas. An extension of this analysis was used to provide an inventory of species in Leicestershire that occur in managed nature reserves compared to the county as a whole. This inventory is being incorporated by the conservation committee for the county to guide selection of additional land purchases for management of biodiversity.
First Year Of Impact 2019
Sector Environment
Impact Types Societal

Policy & public services