ED3: Enabling analytics over Diverse Distributed Datasources

Lead Research Organisation: University of Oxford
Department Name: Computer Science

Abstract

Enterprises and government entities have a growing need for systems that provide decision support based on descriptive and predictive analytics over large volumes of data. Examples include supporting decisions on pricing and promotions based on analyses of revenue and demand data; supporting decisions on the operation of complex equipment based on analyses of sensor data; and supporting decisions on website content based on analyses of user behaviour. Such support may be critical for safety and regulatory compliance as well as for competitiveness.

Current data analytics technology and workflows are well-suited to settings where the data has a uniform structure and is easy to access. Problems can arise, however, when performing data analytics in real-world settings, where as well as being large, datasources are often distributed, heterogeneous, and dynamic.

Consider, for example, the case of Siemens Energy Services, which runs over 50 service centres, each of which provides remote monitoring and diagnostics for thousands of gas/steam turbines and ancillary equipment located in hundreds of power plants. Effective monitoring and diagnosis is essential for maintaining high availability of equipment and avoiding costly failures. A typical descriptive analytics procedure might be: "based on sensor data from an SGT-400 gas turbine, detect abnormal vibration patterns during the period prior to the shutdown and compare them with data on similar patterns in similar turbines over the last 5 years".

Such diagnostic tasks employ sophisticated data analytics tools, and operate on many TBs of current and historical data. In order to perform the analysis it is first necessary to identify, acquire and transform the relevant data. This data may be stored on-site (at a power-plant), at the local service centre or at other service centres; it comes in a wide range of different formats, ranging from flat files to XML and relational stores; access may be via a range of different interfaces, and incur a range of different costs; and it is constantly being augmented, with new data arriving at a rate of more than 30 GB per centre per day.

Acquiring the relevant data is thus very challenging, and is typically achieved via a combination of complex queries and bespoke data processing code, with numerous variants being required in order to deal with distribution and heterogeneity of the data. Given the large number of different analytics tasks that service centres need to perform, the development and maintenance of such procedures becomes a critical bottleneck.

In ED3 we will address this problem by developing an abstraction layer that mediates between analytics tools and datasources. This abstraction layer will adapt Ontology Based Data Access (OBDA) techniques, using an ontology to provide a uniform conceptual schema, declarative mappings to establish connections between ontological terms and data sources, and logic-based rewriting techniques to transform ontological queries into queries over the data sources. For OBDA to be effective in this new setting, however, it will need to be extended in several different directions. Firstly, it needs to provide greatly extended support for basic arithmetic and aggregation operations. Secondly, it needs to deal more effectively with heterogeneous and distributed data sources. Thirdly, it will be necessary to support the development, maintenance and evolution of suitable ontologies and mappings.

In ED3 we will address all of these issues, laying the foundations for a new generation of data access middleware with the conceptual modelling, query processing, and rapid-development infrastructure necessary to support analytic tasks. Moreover, we will develop a prototypical implementation of a suitable abstraction layer, and will evaluate our prototype in real-life deployments with our industrial partners.

Planned Impact

We foresee two classes of non-academic beneficiaries: data owners struggling to "make sense of their data", and a growing subset of the information technology industry for which data analytics represents an important component of their products and/or services.

Regarding data owners, we have already described the difficulties facing energy services companies such as Siemens and EDF. Similar challenges can be found in domains ranging from government and healthcare to the aerospace, energy and finance industries, and it is our belief that ED3 has the potential to have wide impact in all these sectors of the economy.

Regarding the technology industry, the needs of data owners has created a great interest in developing more flexible information management layers. We are already working with several of the major players in this area, including IBM, and Oracle, and also with LogicBlox, a new and rapidly growing company whose customers include retailers such as Home Depot, Walgreens, and Toys R Us in the US, Harrods in the UK, and M-Video in Russia.

ENGAGEMENT, DISSEMINATION AND EXPLOITATION

Engagement with non-academic beneficiaries is an integral part of ED3, with industry partners making a significant contribution to the project. This engagement will provide a direct pathway to impact via dissemination and possible exploitation.

Regarding dissemination, we will be making regular visits to Siemens and EDF, during which we will give presentations and demonstrations, not only to those parts of the company who are directly involved in the project, but also to other divisions for which the proposed technology could be of interest. LogicBlox will provide another set of opportunities for dissemination to their customer base in the retail domain.

We will also exploit our wider network of non-academic collaborators, including the partners in our DBOnto platform grant, for dissemination and exploitation activities. The platform grant can support visits and exploratory collaborations, which will provide an ideal mechanism for exploring applications of ED3 technology.

Regarding exploitation, we will actively pursue opportunities arising from all of the above engagements, and explore a range of mechanisms, including both licensing and spin-offs. Exploitation of IP resulting from the project will be managed by Isis Innovation, a wholly-owned subsidiary of Oxford University, founded to exploit know-how arising out of Oxford's research activities.

We will additionally undertake a range of more broadly focussed activities in order to ensure the widest possible dissemination of our results and engagement with potential beneficiaries.

Firstly, we will showcase the achievements of the project to industry and research leaders via dedicated workshops; these will include both events specific to ED3, and broader showcase events organised as part of DBOnto.

Secondly, we will continue our established pattern of publishing the results of our research in leading conferences and journals. In order to maximise the impact on non-academic partners, we will target "in-use" and "industry" tracks at conferences such as ISWC, SIGMOD, VLDBB and WWW, wherever possible co-authoring papers with industry partners.

Thirdly, we will participate in relevant international coordination and standardisation efforts within groups and organisations such as the World Wide Web consortium (W3C) and the OWL Experiences and Directions Group (OWLED). Through these activities we can help to foster awareness of our work and ensure that it has the maximum possible impact on any future standards.

Finally, we will continue to make all research outputs freely available from our web site, including papers, presentations, tutorials and software.

TRACK RECORD:

Our research has already been highly influential outside academia, and has been the basis for international standards, widely used and/or commercialised software systems, and spin-off companies

Publications

10 25 50
 
Description Motivated by the need for OBDA systems supporting database-style aggregate queries, we have proposed a bag semantics for OBDA, where duplicate tuples in the views defined by the mappings are taken into account. We have shown, however, that bag semantics makes query answering coNP-hard in data complexity. To regain tractability, we have proposed the rather general class of anchored queries and have shown that such queries are first-order rewritable under bag semantics over DL-Litecore ontologies.
Exploitation Route Extending practical OBDA systems to support bag semantics.
Sectors Aerospace, Defence and Marine,Energy,Financial Services, and Management Consultancy,Healthcare,Manufacturing, including Industrial Biotechology,Culture, Heritage, Museums and Collections,Retail

URL http://www.cs.ox.ac.uk/projects/ED3/
 
Description EDF ED3 
Organisation EDF Energy
Department EDF Innovation and Research
Country France 
Sector Private 
PI Contribution Expertise in accessing distributed and heterogeneous data sources.
Collaborator Contribution Use cases, testing and evaluation in the electricity distribution domain.
Impact .
Start Year 2016
 
Description LogicBlox DBOnto & ED3 
Organisation Logicblox
Country Georgia 
Sector Private 
PI Contribution Expertise in access to distributed and heterogeneous data sources.
Collaborator Contribution Use cases, testing and evaluation from their customer base in the retail domain, which includes Target, Home Depot, Walgreens and Toys R Us in the USA, Harods in the UK, and M-Video in Russia.
Impact Impact on Logicblox products, as well as joint research and publications, e.g., Todd J. Green, Dan Olteanu, Geoffrey Washburn: Live Programming in the LogicBlox System: A MetaLogiQL Approach. PVLDB 8(12): 1782-1793 (2015).
Start Year 2014
 
Description Oracle DBOnto 
Organisation Oracle Corporation
Country Global 
Sector Private 
PI Contribution Expertise in semantic technologies, in particular in RDF and OWL reasoning.
Collaborator Contribution Access to Oracle products and to large scale computing facilities for testing and evaluation purposes.
Impact Several joint publications that include details of the testing work carried out at Oracle.
Start Year 2014
 
Description Siemens ED3 
Organisation Siemens AG
Country Germany 
Sector Private 
PI Contribution Helping Siemens to analyse data from steam turbines.
Collaborator Contribution Providing domain knowledge, data and resources for testing and evaluation.
Impact Tools for the development and evolution of conceptual models at Siemens.
Start Year 2011
 
Description Keynote at conference in Lima, Peru 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Other audiences
Results and Impact Invited talk at SimBig18 in Lima, Peru
Year(s) Of Engagement Activity 2018
 
Description Keynote at workshop in Germany 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Other audiences
Results and Impact Keynote in workshop on logic
Year(s) Of Engagement Activity 2017
URL http://2017.soqe.org/
 
Description Keynote speech at Database conference 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Other audiences
Results and Impact I was the keynote speaker at one of the main conferences for database researchers, Principles of Database Systems (PODS). I gave an overview of work on reasoning within data management.
Year(s) Of Engagement Activity 2018
URL https://sigmod2018.org/