Improving Data Quality and Data Analytics performance

Lead Research Organisation: University of Aberdeen

Department Name: Computing Science

Abstract

More and more industrial activities rely on the analysis of huge amounts of data (Big Data). However, quite often data can be "noisy" (when data items are known to be incorrect and/or may have gaps), negatively affecting the performance of machine learning algorithms, and data analysis in general. The proposed research will provide novel techniques to achieve noise reduction and to reconstruct data points in very large collections of numerical data from sensorial sources, with proven guarantees (i.e. conservative estimation of missing data points, correction of data points within limits of neighbouring data points, and so on). The research uses data collections and scenarios from the Oil & Gas sector, but the techniques and results are of far wider importance, benefiting for instance health, finance and other sectors. The research aims to provide better data quality and drastically improve the performance of data interpretation.

Student:

Milen Marev

Period of Study:

Nov 16 - Oct 19

Funder:

EPSRC

Project Status:

Closed

Project Category:

Studentship

Project Reference:

1957361

Research Topic:

Unclassified

Organisations

People	ORCID iD
Milen Marev (Student)	http://orcid.org/0000-0001-8038-2165

Publications

Author Name

Title Publication Date Published

10 25 50

Marev M (2018) Towards a context-dependent numerical data quality evaluation framework

Studentship Projects

Project Reference	Relationship	Related To	Start	End	Student Name
EP/N509814/1			30/09/2016	29/09/2021
1957361	Studentship	EP/N509814/1	01/11/2016	31/10/2019	Milen Marev

Key Findings
Impact Summary
Collaboration
Software and Technical Products


Description	My research focuses on numerical data quality evaluation and improvement by taking into consideration how data is used. Such data is derived from a variety of sensor and computing sources; also from physical measurements (e.g. temperature, speed), or from calculations, or from other algorithmic processes. Whatever the origin of this data, the quality of a dataset and of each of its individual components can be influenced by a number of different factors (e.g., noise, inaccuracy, imprecision, gaps, or inconsistencies). Some of these factors may critically affect numerical data sets, making the quest for numerical data quality a critical issue in data science, particularly whenever "big" datasets are considered. In the last year, my research has focused on my hypothesis that data quality is best measured and improved by knowing how it is used in context. To address this problem, I devised a context-dependent numerical data evaluation framework, which initially contained 8 dimensions. The rationale for the introduction of our framework is rooted in our research work aiming at improving the quality of Oil and Gas exploration and production datasets. In this area, the result of computational sequences (denoted as "workflows") may often be deemed as unreliable if the quality of the petrophysical datasets that make up the numerical workflow input falls short of specific standards. In the circumstances, either data quality improvement techniques (e.g., curation) are used to improve the problematic dataset or, failing this, alternative workflows may have to be pursued which could possibly deliver reliable result without requiring any modification of the input dataset. During the last year, I have written 2 papers, after conducting an extensive literature review of the current work done in the domain of data quality. One of those papers - "Towards a context-dependent numerical data quality evaluation framework" was submitted to the Intelligent Data Analysis '18 Conference in the Netherlands (IDA 18 - https://ida2018.org/), which is rated by ERA with grade A. In the last few months, I have reduced the number of dimensions to 4 - Accuracy, Consistency, Completeness and Precision. This significant reduction was made in order to ensure that all the dimensions in the framework are able to "quantitatively" evaluate the dataset quality. I have almost fully designed the algorithms necessary to measure the uncertainty using the metrics described above. By conducting the evaluation mentioned above I will comply with the criticism received from the peer review of my paper and will be able to publish an updated version, as the research progresses. In the period of 2019-2020, the research has advanced quite a bit in the last year. We have further developed our approach to numerical data quality by devising several intrinsic indicators, namely Intrinsic Quality Factor, Distance Based Factor and Information Entropy. These indicators are designed to measure the data quality of a dataset by comparing the best-fit curve values and the actual recorded value. Those two values when inputted into the algorithms of the indicators give us a clear overview of the state of the current dataset. These algorithms can provide us an initial indication of the dataset, before we apply any ``trivial'' data quality improvement methods. We call them ``trivial'' due to fact that they are not actually considering the usage of the dataset. As without knowing how the dataset is being used, some important information can be disregarded and lost. Our novel methodology still includes the notion of scientific workflows. Our approach now involves 4 steps: • Initial data quality measurement: This phase of the methodology involves the use of our novel data quality indicators in conjunction with some well know metrics as mean, standard deviation, etc. • Initial data quality improvement: This phase is designed to improve the data enough for further processing. This involves closing any gaps and eliminating some of the outliers. • Context dependent data quality improvement: in this step we are taking into consideration how the data is being used. When we know what the data needs to represent, we apply the most appropriate method for data ``curation'' at the specific place it needs to be applied. • In case these fail the choice of how to proceed is given to the user. In our experiments we use datasets provided by the Oil & Gas sector, more specifically - Distributed Thermal Sensing data (DTS). This data is gathered by placing a fibre optic cable into the wellbore, which records the surrounding temperature at a predefined depth. This dataset suffers from a specific data quality issues, which if not eliminated correctly will affect the subsequent results of the workflow. More information will be included in the thesis that is currently being produced.
Exploitation Route	This research can be taken forward by this methodology being adopted in real-life to evaluate numerical data quality.
Sectors	Aerospace Defence and Marine Energy Environment Manufacturing including Industrial Biotechology


Description	In this new millennium characterised by the Internet of Things (IoT), many devices worldwide generate digital data; Oil & Gas (O&G) industry devices are no exception. Digital technologies improve on an almost daily basis, generating even more data as a result. By applying new tools, the O&G industry is now able to tap into hydrocarbon reservoirs that were considered impossible or uneconomical to use in the past. Unconventional wells, where the oil is situated into small underground pockets usually mixed with sand are a typical example of this kind. Such pockets are spread around the geological structure, making it hard to predict the outcome of an extraction activity. Conversely, conventional reservoirs are much easier to model, as hydrocarbons are contained in a much better-defined structure. One way to optimize hydrocarbon production from unconventional reservoirs is thus to develop comprehensive models of the geological features in a basin. Using Big Data analytics, a prediction can be made about future hydrocarbon production. Analytics provides tools to analyse huge quantity of data in depth, thus preventing cases of misinterpretation. With a lower number of unknown aspects, a more accurate model can be created, allowing operators a better understanding of what to expect, reducing losses. Last but not least, the number of errors in production operation would be drastically reduced, notably increasing production levels. With technological improvements, sensors used in Exploration & Production (E&P) activities generate more data, which require more digital storage space. However, as capital expenditure on hardware is not preferable in the current economic climate, this triggers the need for scalability and concurrency, something that current software productivity tools like spreadsheets dramatically lack. Sensor data are often structured in such a manner, making it difficult to store in a relational database (RDBMS) efficiently. Furthermore, RDBMS systems are not designed to cope with huge quantities of data, generated by E&P devices. One example of the latest technologies for well analysis involves the use of fibre optic cables. With the help of this, the operator can gather inside at any location along the length of the cable at any time. Such information can be thermal, acoustic data and others. Depending on the use case the information that is gathered may exceed TBs of storage per day. With the vast amounts of data comes the problem with the processing and interpretation of the information collected. At the present stage only a small part of this large datasets are being analysed, thus important data/patterns inside are being missed. All parts of recorded data are vital for the successful operation of the asset. All well operations are monitored by sensors, each feeding a stream of data back to the controller. This brings the question of how reliable the gathered data is. The most common data problems that were identified during the research were gaps, outliers, noise and last but not least bias. Each of those artefacts presents different issues that can affect the evaluation process. During the duration of this research, we concluded that data quality is context-dependent. By knowing how data is going to be used we can pick the most appropriate technique to address the problems. By properly evaluating and subsequently correcting the issues with the datasets we can gather full understanding. With these extra knowledge companies can have more effective E&P activities, thus delivering significant economic and environmental impact.
First Year Of Impact	2018
Sector	Aerospace, Defence and Marine,Energy,Environment,Manufacturing, including Industrial Biotechology
Impact Types	Economic


Description	Hyperdap
Organisation	HyperDAP
Sector	Private
PI Contribution	More insight into the datasets provided
Collaborator Contribution	Datasets and help with research
Impact	More insight into O&G datasets
Start Year	2019


Description	Sensalytx Limited.
Organisation	Sensalytx Limited
Country	United Kingdom
Sector	Private
PI Contribution	A better understanding of the datasets received from the company, thus achieving better economic performance.
Collaborator Contribution	The provision of realistic datasets from the Oil & Gas Industry. This collaboration also provided valuable inside of the current issues experienced in the industry. Also, it has been providing valuable feedback regarding the project.
Impact	This collaboration has addressed the research activities in my PhD.
Start Year	2018


Title	NDEF
Description	The algorithms compose of the detection and then the subsequent correction of the data artefacts that have been found within the dataset. The problems that are being identified by the algorithms include all the major issues that persist with numerical datasets, such as gaps, noise, outliers, bias. The algorithms expect a flat file that contains the dataset and outputs the processed "cleaned" data. As mentioned in the paper "Towards a context-dependent numerical data quality evaluation framework", data quality is indeed context dependent, thus the user of the algorithms will be familiar with the particular expectation (the context) of the project. With such understanding, the user will be able to tailor fit the project as per their requirements. The program then decides which is the best possible curation algorithm and applies it to the dataset. Curation algorithms include Convolution with Low Pass filter, Linear/Polynomial regression, Fast Fourier Transform from the West.
Type Of Technology	New/Improved Technique/Technology
Year Produced	2019
Impact	Only a small amount of the data that has been gathered is actually evaluated, thus small deviations and patterns may get missed. By detecting the problems with the data, we can detect those anomalies much easier and therefore correct them if needed. The better understanding of the brings an improved performance of the assets, thus higher profits.

Abstract

Organisations

People

ORCID iD

Publications

Studentship Projects