ED3: Enabling analytics over Diverse Distributed Datasources

Lead Research Organisation: University of Oxford
Department Name: Computer Science

Abstract

Enterprises and government entities have a growing need for systems that provide decision support based on descriptive and predictive analytics over large volumes of data. Examples include supporting decisions on pricing and promotions based on analyses of revenue and demand data; supporting decisions on the operation of complex equipment based on analyses of sensor data; and supporting decisions on website content based on analyses of user behaviour. Such support may be critical for safety and regulatory compliance as well as for competitiveness.

Current data analytics technology and workflows are well-suited to settings where the data has a uniform structure and is easy to access. Problems can arise, however, when performing data analytics in real-world settings, where as well as being large, datasources are often distributed, heterogeneous, and dynamic.

Consider, for example, the case of Siemens Energy Services, which runs over 50 service centres, each of which provides remote monitoring and diagnostics for thousands of gas/steam turbines and ancillary equipment located in hundreds of power plants. Effective monitoring and diagnosis is essential for maintaining high availability of equipment and avoiding costly failures. A typical descriptive analytics procedure might be: "based on sensor data from an SGT-400 gas turbine, detect abnormal vibration patterns during the period prior to the shutdown and compare them with data on similar patterns in similar turbines over the last 5 years".

Such diagnostic tasks employ sophisticated data analytics tools, and operate on many TBs of current and historical data. In order to perform the analysis it is first necessary to identify, acquire and transform the relevant data. This data may be stored on-site (at a power-plant), at the local service centre or at other service centres; it comes in a wide range of different formats, ranging from flat files to XML and relational stores; access may be via a range of different interfaces, and incur a range of different costs; and it is constantly being augmented, with new data arriving at a rate of more than 30 GB per centre per day.

Acquiring the relevant data is thus very challenging, and is typically achieved via a combination of complex queries and bespoke data processing code, with numerous variants being required in order to deal with distribution and heterogeneity of the data. Given the large number of different analytics tasks that service centres need to perform, the development and maintenance of such procedures becomes a critical bottleneck.

In ED3 we will address this problem by developing an abstraction layer that mediates between analytics tools and datasources. This abstraction layer will adapt Ontology Based Data Access (OBDA) techniques, using an ontology to provide a uniform conceptual schema, declarative mappings to establish connections between ontological terms and data sources, and logic-based rewriting techniques to transform ontological queries into queries over the data sources. For OBDA to be effective in this new setting, however, it will need to be extended in several different directions. Firstly, it needs to provide greatly extended support for basic arithmetic and aggregation operations. Secondly, it needs to deal more effectively with heterogeneous and distributed data sources. Thirdly, it will be necessary to support the development, maintenance and evolution of suitable ontologies and mappings.

In ED3 we will address all of these issues, laying the foundations for a new generation of data access middleware with the conceptual modelling, query processing, and rapid-development infrastructure necessary to support analytic tasks. Moreover, we will develop a prototypical implementation of a suitable abstraction layer, and will evaluate our prototype in real-life deployments with our industrial partners.

Planned Impact

We foresee two classes of non-academic beneficiaries: data owners struggling to "make sense of their data", and a growing subset of the information technology industry for which data analytics represents an important component of their products and/or services.

Regarding data owners, we have already described the difficulties facing energy services companies such as Siemens and EDF. Similar challenges can be found in domains ranging from government and healthcare to the aerospace, energy and finance industries, and it is our belief that ED3 has the potential to have wide impact in all these sectors of the economy.

Regarding the technology industry, the needs of data owners has created a great interest in developing more flexible information management layers. We are already working with several of the major players in this area, including IBM, and Oracle, and also with LogicBlox, a new and rapidly growing company whose customers include retailers such as Home Depot, Walgreens, and Toys R Us in the US, Harrods in the UK, and M-Video in Russia.

ENGAGEMENT, DISSEMINATION AND EXPLOITATION

Engagement with non-academic beneficiaries is an integral part of ED3, with industry partners making a significant contribution to the project. This engagement will provide a direct pathway to impact via dissemination and possible exploitation.

Regarding dissemination, we will be making regular visits to Siemens and EDF, during which we will give presentations and demonstrations, not only to those parts of the company who are directly involved in the project, but also to other divisions for which the proposed technology could be of interest. LogicBlox will provide another set of opportunities for dissemination to their customer base in the retail domain.

We will also exploit our wider network of non-academic collaborators, including the partners in our DBOnto platform grant, for dissemination and exploitation activities. The platform grant can support visits and exploratory collaborations, which will provide an ideal mechanism for exploring applications of ED3 technology.

Regarding exploitation, we will actively pursue opportunities arising from all of the above engagements, and explore a range of mechanisms, including both licensing and spin-offs. Exploitation of IP resulting from the project will be managed by Isis Innovation, a wholly-owned subsidiary of Oxford University, founded to exploit know-how arising out of Oxford's research activities.

We will additionally undertake a range of more broadly focussed activities in order to ensure the widest possible dissemination of our results and engagement with potential beneficiaries.

Firstly, we will showcase the achievements of the project to industry and research leaders via dedicated workshops; these will include both events specific to ED3, and broader showcase events organised as part of DBOnto.

Secondly, we will continue our established pattern of publishing the results of our research in leading conferences and journals. In order to maximise the impact on non-academic partners, we will target "in-use" and "industry" tracks at conferences such as ISWC, SIGMOD, VLDBB and WWW, wherever possible co-authoring papers with industry partners.

Thirdly, we will participate in relevant international coordination and standardisation efforts within groups and organisations such as the World Wide Web consortium (W3C) and the OWL Experiences and Directions Group (OWLED). Through these activities we can help to foster awareness of our work and ensure that it has the maximum possible impact on any future standards.

Finally, we will continue to make all research outputs freely available from our web site, including papers, presentations, tutorials and software.

TRACK RECORD:

Our research has already been highly influential outside academia, and has been the basis for international standards, widely used and/or commercialised software systems, and spin-off companies

Publications

10 25 50

publication icon
Kharlamov E (2017) Semantic access to streaming and static data at Siemens in Journal of Web Semantics

publication icon
Mehdi G (2017) SemDia

publication icon
Kharlamov E (2017) SemFacet

publication icon
Benedikt M. (2017) Source information disclosure in ontology-based data integration in 31st AAAI Conference on Artificial Intelligence, AAAI 2017

publication icon
Diaz G (2016) SPARQLByE querying RDF data by example in Proceedings of the VLDB Endowment

publication icon
Kaminski M (2021) The Complexity and Expressive Power of Limit Datalog in Journal of the ACM

publication icon
Ronca A (2022) The delay and window size problems in rule-based stream reasoning in Artificial Intelligence

publication icon
Zheleznyakov D. (2017) Trust-sensitive evolution of DL-lite knowledge bases in 31st AAAI Conference on Artificial Intelligence, AAAI 2017

publication icon
Amarilli A (2022) When Can We Answer Queries Using Result-Bounded Data Interfaces? in Logical Methods in Computer Science

 
Description Motivated by the need for OBDA systems supporting database-style aggregate queries, we have proposed a bag semantics for OBDA, where duplicate tuples in the views defined by the mappings are taken into account. We have shown, however, that bag semantics makes query answering coNP-hard in data complexity. To regain tractability, we have proposed the rather general class of anchored queries and have shown that such queries are first-order rewritable under bag semantics over DL-Litecore ontologies.
Exploitation Route Extending practical OBDA systems to support bag semantics.
Sectors Aerospace

Defence and Marine

Energy

Financial Services

and Management Consultancy

Healthcare

Manufacturing

including Industrial Biotechology

Culture

Heritage

Museums and Collections

Retail

URL http://www.cs.ox.ac.uk/projects/ED3/
 
Description In this project we carried out foundational research into extensions of Datalog that can support recursive rules with aggregation and other numeric operations while still guaranteeing that materialisation of implied tuples will terminate. As well as profoundly influencing academic research (e.g, best paper award at IJCAI and JACM paper), this work also influenced our development of the RDFox materialisation-based Datalog/RDF reasoner (the reasoner uses Datalog materialisation to realise efficient reasoning over large RDF graphs augmented with Datalog rules). The RDFox reasoner uses an extension of datalog that allows for aggregation functions in rule bodies, and the design of this language was influenced by our work on Limit Datalog. RDFox is now being commercialised by Oxford Semantic Technologies (OST), an Oxford University spinout company. OST has raised GBP4,100,000 in investment, including GBP3,000,000 in Series A investment led by Samsung Ventures, announced in June 2019, and OST now employs 10 FTEs. OST's patented technology is sold under licence to customers for a fee of approximately GBP50,000 per licence. Since April 2018, the company has secured licence sales worth over GBP1,500,000. Customers include Festo, a German multinational production line equipment company, electronics giant Samsung, and several major financial services companies including Dow Jones and JP Morgan Chase.
First Year Of Impact 2018
Sector Digital/Communication/Information Technologies (including Software),Financial Services, and Management Consultancy,Healthcare,Manufacturing, including Industrial Biotechology
Impact Types Economic

 
Description ConCur: Knowledge Base Construction and Curation
Amount £1,131,073 (GBP)
Funding ID EP/V050869/1 
Organisation Engineering and Physical Sciences Research Council (EPSRC) 
Sector Public
Country United Kingdom
Start 12/2021 
End 11/2024
 
Description Collaboration with Bosch 
Organisation Bosch Group
Department Bosch
Country Germany 
Sector Private 
PI Contribution PhD research
Collaborator Contribution Real-life problems and funding for PhD student
Impact PhD funding
Start Year 2021
 
Description Collaboration with Oxford Semantic Technologies 
Organisation Oxford Semantic Technologies
Country United Kingdom 
Sector Private 
PI Contribution Testing and evaluation of RDFox graph DB
Collaborator Contribution RDFox licence and support
Impact Publications
Start Year 2017
 
Description Collaboration with Samsung Research UK 
Organisation Samsung
Department Samsung, UK
Country United Kingdom 
Sector Private 
PI Contribution Collaboration with Samsung Research UK
Collaborator Contribution Research problems and funding for PhD students and PDRAs
Impact Publications and funding
Start Year 2019
 
Description Collaboration with Siemens 
Organisation Siemens AG
Country Germany 
Sector Private 
PI Contribution PhD research
Collaborator Contribution Real-life problems and funding for PhD student
Impact PhD funding
Start Year 2019
 
Description EDF ED3 
Organisation EDF Energy
Department EDF Innovation and Research
Country France 
Sector Private 
PI Contribution Expertise in accessing distributed and heterogeneous data sources.
Collaborator Contribution Use cases, testing and evaluation in the electricity distribution domain.
Impact .
Start Year 2016
 
Description LogicBlox DBOnto & ED3 
Organisation Logicblox
Country United States 
Sector Private 
PI Contribution Expertise in access to distributed and heterogeneous data sources.
Collaborator Contribution Use cases, testing and evaluation from their customer base in the retail domain, which includes Target, Home Depot, Walgreens and Toys R Us in the USA, Harods in the UK, and M-Video in Russia.
Impact Impact on Logicblox products, as well as joint research and publications, e.g., Todd J. Green, Dan Olteanu, Geoffrey Washburn: Live Programming in the LogicBlox System: A MetaLogiQL Approach. PVLDB 8(12): 1782-1793 (2015).
Start Year 2014
 
Description Oracle DBOnto 
Organisation Oracle Corporation
Country United States 
Sector Private 
PI Contribution Expertise in semantic technologies, in particular in RDF and OWL reasoning.
Collaborator Contribution Access to Oracle products and to large scale computing facilities for testing and evaluation purposes.
Impact Several joint publications that include details of the testing work carried out at Oracle.
Start Year 2014
 
Description Siemens ED3 
Organisation Siemens AG
Country Germany 
Sector Private 
PI Contribution Helping Siemens to analyse data from steam turbines.
Collaborator Contribution Providing domain knowledge, data and resources for testing and evaluation.
Impact Tools for the development and evolution of conceptual models at Siemens.
Start Year 2011
 
Title COMPLEX QUERY EVALUATION USING SIDEWAYS INFORMATION PASSING 
Description A program stored on non-transitory computer-readable storage medium executes a method of evaluating a graph over a query. Decomposition instructions decompose the query into a plurality of subqueries. Evaluation instructions evaluate a subquery of the plurality of subqueries and generate a substitution multiset representing a result of the evaluation of the subquery. Filtration instructions or expansion instructions may operate upon the generated substitution set before passing the substitution set to a next subquery to be evaluated. The filtration instructions identify one or more mappings in the substitution multiset that cannot be safely passed to the second subquery and delete the identified one or more mappings from the substitution multiset. The expansion instructions determine, in a case where the subquery is operated upon by a non-distributive query operator, an expansion of the substitution multiset based at least on adding one or more new substitutions to the substitution multiset. 
IP Reference US2022067042 
Protection Patent granted
Year Protection Granted 2022
Licensed Yes
Impact Founding of Oxford Semantic Technologies
 
Company Name Covatic 
Description Covatic develops software that analyses a user's online engagement to deliver personalised advertising. 
Year Established 2016 
Impact Although a new startup the company already has contracts with the BBC and with ITN.
Website https://covatic.com/
 
Company Name Oxford Semantic Technologies 
Description Oxford Semantic Technologies develops software that uses machine learning to analyse semantic data and its ontologies, which can be used when combining or ordering multiple datasets, and in simulating predictive relationships between data. 
Year Established 2016 
Impact The company has only recently been established, but we are already in discussions with several large companies in the financial services sector who are interested in both data integration and compliance verification.
Website http://www.oxfordsemantic.tech
 
Description Invited talk at Huawei research centre Edinburgh 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Industry/Business
Results and Impact Invited talk at Huawei research centre Edinburgh
Year(s) Of Engagement Activity 2021
 
Description Invited talk at IJCKG 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Postgraduate students
Results and Impact Invited talk at IJCKG
Year(s) Of Engagement Activity 2021
URL https://language-semantic.org/ijckg2021/
 
Description Invited talk at K-CAP 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Postgraduate students
Results and Impact Invited talk at K-CAP
Year(s) Of Engagement Activity 2021
URL https://www.k-cap.org/2021/
 
Description Invited talk at NeSY 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Postgraduate students
Results and Impact Invited talk at NeSY
Year(s) Of Engagement Activity 2021
URL https://sites.google.com/view/nesy20/home
 
Description Invited talk at ODSC 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Invited talk at ODSC
Year(s) Of Engagement Activity 2020
URL https://odsc.com/dublin/schedule-overview/
 
Description Invited talk at WEBIST 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Invited talk at WEBIST
Year(s) Of Engagement Activity 2021
URL https://webist.scitevents.org/?y=2021
 
Description Keynote at conference in Lima, Peru 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Other audiences
Results and Impact Invited talk at SimBig18 in Lima, Peru
Year(s) Of Engagement Activity 2018
 
Description Keynote at workshop in Germany 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Other audiences
Results and Impact Keynote in workshop on logic
Year(s) Of Engagement Activity 2017
URL http://2017.soqe.org/
 
Description Keynote speech at Database conference 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Other audiences
Results and Impact I was the keynote speaker at one of the main conferences for database researchers, Principles of Database Systems (PODS). I gave an overview of work on reasoning within data management.
Year(s) Of Engagement Activity 2018
URL https://sigmod2018.org/
 
Description Presentation at Bosch research workshop 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Postgraduate students
Results and Impact Presentation at Bosch research workshop
Year(s) Of Engagement Activity 2022
 
Description Presentation at Google Research, San Fancisco 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Presentation at Google Research, San Fancisco
Year(s) Of Engagement Activity 2019
 
Description Presentation at Samsung Research, California 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Presentation at Samsung Research, California
Year(s) Of Engagement Activity 2019
 
Description Presentation at Siemens Research, Munich 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Presentation at Siemens Research, Munich
Year(s) Of Engagement Activity 2019
 
Description Presentation at eBay, California 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Presentation at eBay, California
Year(s) Of Engagement Activity 2022
 
Description Presentation at eBay, California 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Presentation at eBay, California
Year(s) Of Engagement Activity 2019