ED3: Enabling analytics over Diverse Distributed Datasources
Lead Research Organisation:
University of Oxford
Department Name: Computer Science
Abstract
Enterprises and government entities have a growing need for systems that provide decision support based on descriptive and predictive analytics over large volumes of data. Examples include supporting decisions on pricing and promotions based on analyses of revenue and demand data; supporting decisions on the operation of complex equipment based on analyses of sensor data; and supporting decisions on website content based on analyses of user behaviour. Such support may be critical for safety and regulatory compliance as well as for competitiveness.
Current data analytics technology and workflows are well-suited to settings where the data has a uniform structure and is easy to access. Problems can arise, however, when performing data analytics in real-world settings, where as well as being large, datasources are often distributed, heterogeneous, and dynamic.
Consider, for example, the case of Siemens Energy Services, which runs over 50 service centres, each of which provides remote monitoring and diagnostics for thousands of gas/steam turbines and ancillary equipment located in hundreds of power plants. Effective monitoring and diagnosis is essential for maintaining high availability of equipment and avoiding costly failures. A typical descriptive analytics procedure might be: "based on sensor data from an SGT-400 gas turbine, detect abnormal vibration patterns during the period prior to the shutdown and compare them with data on similar patterns in similar turbines over the last 5 years".
Such diagnostic tasks employ sophisticated data analytics tools, and operate on many TBs of current and historical data. In order to perform the analysis it is first necessary to identify, acquire and transform the relevant data. This data may be stored on-site (at a power-plant), at the local service centre or at other service centres; it comes in a wide range of different formats, ranging from flat files to XML and relational stores; access may be via a range of different interfaces, and incur a range of different costs; and it is constantly being augmented, with new data arriving at a rate of more than 30 GB per centre per day.
Acquiring the relevant data is thus very challenging, and is typically achieved via a combination of complex queries and bespoke data processing code, with numerous variants being required in order to deal with distribution and heterogeneity of the data. Given the large number of different analytics tasks that service centres need to perform, the development and maintenance of such procedures becomes a critical bottleneck.
In ED3 we will address this problem by developing an abstraction layer that mediates between analytics tools and datasources. This abstraction layer will adapt Ontology Based Data Access (OBDA) techniques, using an ontology to provide a uniform conceptual schema, declarative mappings to establish connections between ontological terms and data sources, and logic-based rewriting techniques to transform ontological queries into queries over the data sources. For OBDA to be effective in this new setting, however, it will need to be extended in several different directions. Firstly, it needs to provide greatly extended support for basic arithmetic and aggregation operations. Secondly, it needs to deal more effectively with heterogeneous and distributed data sources. Thirdly, it will be necessary to support the development, maintenance and evolution of suitable ontologies and mappings.
In ED3 we will address all of these issues, laying the foundations for a new generation of data access middleware with the conceptual modelling, query processing, and rapid-development infrastructure necessary to support analytic tasks. Moreover, we will develop a prototypical implementation of a suitable abstraction layer, and will evaluate our prototype in real-life deployments with our industrial partners.
Current data analytics technology and workflows are well-suited to settings where the data has a uniform structure and is easy to access. Problems can arise, however, when performing data analytics in real-world settings, where as well as being large, datasources are often distributed, heterogeneous, and dynamic.
Consider, for example, the case of Siemens Energy Services, which runs over 50 service centres, each of which provides remote monitoring and diagnostics for thousands of gas/steam turbines and ancillary equipment located in hundreds of power plants. Effective monitoring and diagnosis is essential for maintaining high availability of equipment and avoiding costly failures. A typical descriptive analytics procedure might be: "based on sensor data from an SGT-400 gas turbine, detect abnormal vibration patterns during the period prior to the shutdown and compare them with data on similar patterns in similar turbines over the last 5 years".
Such diagnostic tasks employ sophisticated data analytics tools, and operate on many TBs of current and historical data. In order to perform the analysis it is first necessary to identify, acquire and transform the relevant data. This data may be stored on-site (at a power-plant), at the local service centre or at other service centres; it comes in a wide range of different formats, ranging from flat files to XML and relational stores; access may be via a range of different interfaces, and incur a range of different costs; and it is constantly being augmented, with new data arriving at a rate of more than 30 GB per centre per day.
Acquiring the relevant data is thus very challenging, and is typically achieved via a combination of complex queries and bespoke data processing code, with numerous variants being required in order to deal with distribution and heterogeneity of the data. Given the large number of different analytics tasks that service centres need to perform, the development and maintenance of such procedures becomes a critical bottleneck.
In ED3 we will address this problem by developing an abstraction layer that mediates between analytics tools and datasources. This abstraction layer will adapt Ontology Based Data Access (OBDA) techniques, using an ontology to provide a uniform conceptual schema, declarative mappings to establish connections between ontological terms and data sources, and logic-based rewriting techniques to transform ontological queries into queries over the data sources. For OBDA to be effective in this new setting, however, it will need to be extended in several different directions. Firstly, it needs to provide greatly extended support for basic arithmetic and aggregation operations. Secondly, it needs to deal more effectively with heterogeneous and distributed data sources. Thirdly, it will be necessary to support the development, maintenance and evolution of suitable ontologies and mappings.
In ED3 we will address all of these issues, laying the foundations for a new generation of data access middleware with the conceptual modelling, query processing, and rapid-development infrastructure necessary to support analytic tasks. Moreover, we will develop a prototypical implementation of a suitable abstraction layer, and will evaluate our prototype in real-life deployments with our industrial partners.
Planned Impact
We foresee two classes of non-academic beneficiaries: data owners struggling to "make sense of their data", and a growing subset of the information technology industry for which data analytics represents an important component of their products and/or services.
Regarding data owners, we have already described the difficulties facing energy services companies such as Siemens and EDF. Similar challenges can be found in domains ranging from government and healthcare to the aerospace, energy and finance industries, and it is our belief that ED3 has the potential to have wide impact in all these sectors of the economy.
Regarding the technology industry, the needs of data owners has created a great interest in developing more flexible information management layers. We are already working with several of the major players in this area, including IBM, and Oracle, and also with LogicBlox, a new and rapidly growing company whose customers include retailers such as Home Depot, Walgreens, and Toys R Us in the US, Harrods in the UK, and M-Video in Russia.
ENGAGEMENT, DISSEMINATION AND EXPLOITATION
Engagement with non-academic beneficiaries is an integral part of ED3, with industry partners making a significant contribution to the project. This engagement will provide a direct pathway to impact via dissemination and possible exploitation.
Regarding dissemination, we will be making regular visits to Siemens and EDF, during which we will give presentations and demonstrations, not only to those parts of the company who are directly involved in the project, but also to other divisions for which the proposed technology could be of interest. LogicBlox will provide another set of opportunities for dissemination to their customer base in the retail domain.
We will also exploit our wider network of non-academic collaborators, including the partners in our DBOnto platform grant, for dissemination and exploitation activities. The platform grant can support visits and exploratory collaborations, which will provide an ideal mechanism for exploring applications of ED3 technology.
Regarding exploitation, we will actively pursue opportunities arising from all of the above engagements, and explore a range of mechanisms, including both licensing and spin-offs. Exploitation of IP resulting from the project will be managed by Isis Innovation, a wholly-owned subsidiary of Oxford University, founded to exploit know-how arising out of Oxford's research activities.
We will additionally undertake a range of more broadly focussed activities in order to ensure the widest possible dissemination of our results and engagement with potential beneficiaries.
Firstly, we will showcase the achievements of the project to industry and research leaders via dedicated workshops; these will include both events specific to ED3, and broader showcase events organised as part of DBOnto.
Secondly, we will continue our established pattern of publishing the results of our research in leading conferences and journals. In order to maximise the impact on non-academic partners, we will target "in-use" and "industry" tracks at conferences such as ISWC, SIGMOD, VLDBB and WWW, wherever possible co-authoring papers with industry partners.
Thirdly, we will participate in relevant international coordination and standardisation efforts within groups and organisations such as the World Wide Web consortium (W3C) and the OWL Experiences and Directions Group (OWLED). Through these activities we can help to foster awareness of our work and ensure that it has the maximum possible impact on any future standards.
Finally, we will continue to make all research outputs freely available from our web site, including papers, presentations, tutorials and software.
TRACK RECORD:
Our research has already been highly influential outside academia, and has been the basis for international standards, widely used and/or commercialised software systems, and spin-off companies
Regarding data owners, we have already described the difficulties facing energy services companies such as Siemens and EDF. Similar challenges can be found in domains ranging from government and healthcare to the aerospace, energy and finance industries, and it is our belief that ED3 has the potential to have wide impact in all these sectors of the economy.
Regarding the technology industry, the needs of data owners has created a great interest in developing more flexible information management layers. We are already working with several of the major players in this area, including IBM, and Oracle, and also with LogicBlox, a new and rapidly growing company whose customers include retailers such as Home Depot, Walgreens, and Toys R Us in the US, Harrods in the UK, and M-Video in Russia.
ENGAGEMENT, DISSEMINATION AND EXPLOITATION
Engagement with non-academic beneficiaries is an integral part of ED3, with industry partners making a significant contribution to the project. This engagement will provide a direct pathway to impact via dissemination and possible exploitation.
Regarding dissemination, we will be making regular visits to Siemens and EDF, during which we will give presentations and demonstrations, not only to those parts of the company who are directly involved in the project, but also to other divisions for which the proposed technology could be of interest. LogicBlox will provide another set of opportunities for dissemination to their customer base in the retail domain.
We will also exploit our wider network of non-academic collaborators, including the partners in our DBOnto platform grant, for dissemination and exploitation activities. The platform grant can support visits and exploratory collaborations, which will provide an ideal mechanism for exploring applications of ED3 technology.
Regarding exploitation, we will actively pursue opportunities arising from all of the above engagements, and explore a range of mechanisms, including both licensing and spin-offs. Exploitation of IP resulting from the project will be managed by Isis Innovation, a wholly-owned subsidiary of Oxford University, founded to exploit know-how arising out of Oxford's research activities.
We will additionally undertake a range of more broadly focussed activities in order to ensure the widest possible dissemination of our results and engagement with potential beneficiaries.
Firstly, we will showcase the achievements of the project to industry and research leaders via dedicated workshops; these will include both events specific to ED3, and broader showcase events organised as part of DBOnto.
Secondly, we will continue our established pattern of publishing the results of our research in leading conferences and journals. In order to maximise the impact on non-academic partners, we will target "in-use" and "industry" tracks at conferences such as ISWC, SIGMOD, VLDBB and WWW, wherever possible co-authoring papers with industry partners.
Thirdly, we will participate in relevant international coordination and standardisation efforts within groups and organisations such as the World Wide Web consortium (W3C) and the OWL Experiences and Directions Group (OWLED). Through these activities we can help to foster awareness of our work and ensure that it has the maximum possible impact on any future standards.
Finally, we will continue to make all research outputs freely available from our web site, including papers, presentations, tutorials and software.
TRACK RECORD:
Our research has already been highly influential outside academia, and has been the basis for international standards, widely used and/or commercialised software systems, and spin-off companies
Organisations
- University of Oxford (Lead Research Organisation)
- Logicblox (Collaboration, Project Partner)
- Bosch Group (Collaboration)
- Oracle Corporation (Collaboration)
- Oxford Semantic Technologies (Collaboration)
- Siemens AG (Collaboration)
- Samsung (South Korea) (Collaboration)
- EDF Energy (United Kingdom) (Collaboration)
- Siemens (Germany) (Project Partner)
- EDF Group R&D, Clamart (Project Partner)
Publications
Benedikt M
(2018)
Logical foundations of information disclosure in ontology-based data integration
in Artificial Intelligence
Motik B
(2019)
Maintenance of datalog materialisations revisited
in Artificial Intelligence
Kaminski M
(2016)
Datalog rewritability of Disjunctive Datalog programs and non-Horn ontologies
in Artificial Intelligence
Cucala D.T.
(2017)
Consequence-based reasoning for description logics with disjunction, inverse roles, and nominals
in CEUR Workshop Proceedings
Kharlamov E.
(2017)
Ranking, aggregation, and reachability in faceted search with SemFacet
in CEUR Workshop Proceedings
Potter A
(2018)
Dynamic Data Exchange in Distributed RDF Stores
in IEEE Transactions on Knowledge and Data Engineering
Fazzinga B
(2018)
Ontological query answering under many-valued group preferences in Datalog+/-
in International Journal of Approximate Reasoning
Grau B
(2019)
Logical Foundations of Linked Data Anonymisation
in Journal of Artificial Intelligence Research
Bate A
(2018)
Consequence-Based Reasoning for Description Logics with Disjunctions and Number Restrictions
in Journal of Artificial Intelligence Research
Kaminski M
(2021)
The Complexity and Expressive Power of Limit Datalog
in Journal of the ACM
Description | Motivated by the need for OBDA systems supporting database-style aggregate queries, we have proposed a bag semantics for OBDA, where duplicate tuples in the views defined by the mappings are taken into account. We have shown, however, that bag semantics makes query answering coNP-hard in data complexity. To regain tractability, we have proposed the rather general class of anchored queries and have shown that such queries are first-order rewritable under bag semantics over DL-Litecore ontologies. |
Exploitation Route | Extending practical OBDA systems to support bag semantics. |
Sectors | Aerospace Defence and Marine Energy Financial Services and Management Consultancy Healthcare Manufacturing including Industrial Biotechology Culture Heritage Museums and Collections Retail |
URL | http://www.cs.ox.ac.uk/projects/ED3/ |
Description | In this project we carried out foundational research into extensions of Datalog that can support recursive rules with aggregation and other numeric operations while still guaranteeing that materialisation of implied tuples will terminate. As well as profoundly influencing academic research (e.g, best paper award at IJCAI and JACM paper), this work also influenced our development of the RDFox materialisation-based Datalog/RDF reasoner (the reasoner uses Datalog materialisation to realise efficient reasoning over large RDF graphs augmented with Datalog rules). The RDFox reasoner uses an extension of datalog that allows for aggregation functions in rule bodies, and the design of this language was influenced by our work on Limit Datalog. RDFox is now being commercialised by Oxford Semantic Technologies (OST), an Oxford University spinout company. OST has raised GBP4,100,000 in investment, including GBP3,000,000 in Series A investment led by Samsung Ventures, announced in June 2019, and OST now employs 10 FTEs. OST's patented technology is sold under licence to customers for a fee of approximately GBP50,000 per licence. Since April 2018, the company has secured licence sales worth over GBP1,500,000. Customers include Festo, a German multinational production line equipment company, electronics giant Samsung, and several major financial services companies including Dow Jones and JP Morgan Chase. |
First Year Of Impact | 2018 |
Sector | Digital/Communication/Information Technologies (including Software),Financial Services, and Management Consultancy,Healthcare,Manufacturing, including Industrial Biotechology |
Impact Types | Economic |
Description | ConCur: Knowledge Base Construction and Curation |
Amount | £1,131,073 (GBP) |
Funding ID | EP/V050869/1 |
Organisation | Engineering and Physical Sciences Research Council (EPSRC) |
Sector | Public |
Country | United Kingdom |
Start | 12/2021 |
End | 11/2024 |
Description | Collaboration with Bosch |
Organisation | Bosch Group |
Department | Bosch |
Country | Germany |
Sector | Private |
PI Contribution | PhD research |
Collaborator Contribution | Real-life problems and funding for PhD student |
Impact | PhD funding |
Start Year | 2021 |
Description | Collaboration with Oxford Semantic Technologies |
Organisation | Oxford Semantic Technologies |
Country | United Kingdom |
Sector | Private |
PI Contribution | Testing and evaluation of RDFox graph DB |
Collaborator Contribution | RDFox licence and support |
Impact | Publications |
Start Year | 2017 |
Description | Collaboration with Samsung Research UK |
Organisation | Samsung |
Department | Samsung, UK |
Country | United Kingdom |
Sector | Private |
PI Contribution | Collaboration with Samsung Research UK |
Collaborator Contribution | Research problems and funding for PhD students and PDRAs |
Impact | Publications and funding |
Start Year | 2019 |
Description | Collaboration with Siemens |
Organisation | Siemens AG |
Country | Germany |
Sector | Private |
PI Contribution | PhD research |
Collaborator Contribution | Real-life problems and funding for PhD student |
Impact | PhD funding |
Start Year | 2019 |
Description | EDF ED3 |
Organisation | EDF Energy |
Department | EDF Innovation and Research |
Country | France |
Sector | Private |
PI Contribution | Expertise in accessing distributed and heterogeneous data sources. |
Collaborator Contribution | Use cases, testing and evaluation in the electricity distribution domain. |
Impact | . |
Start Year | 2016 |
Description | LogicBlox DBOnto & ED3 |
Organisation | Logicblox |
Country | United States |
Sector | Private |
PI Contribution | Expertise in access to distributed and heterogeneous data sources. |
Collaborator Contribution | Use cases, testing and evaluation from their customer base in the retail domain, which includes Target, Home Depot, Walgreens and Toys R Us in the USA, Harods in the UK, and M-Video in Russia. |
Impact | Impact on Logicblox products, as well as joint research and publications, e.g., Todd J. Green, Dan Olteanu, Geoffrey Washburn: Live Programming in the LogicBlox System: A MetaLogiQL Approach. PVLDB 8(12): 1782-1793 (2015). |
Start Year | 2014 |
Description | Oracle DBOnto |
Organisation | Oracle Corporation |
Country | United States |
Sector | Private |
PI Contribution | Expertise in semantic technologies, in particular in RDF and OWL reasoning. |
Collaborator Contribution | Access to Oracle products and to large scale computing facilities for testing and evaluation purposes. |
Impact | Several joint publications that include details of the testing work carried out at Oracle. |
Start Year | 2014 |
Description | Siemens ED3 |
Organisation | Siemens AG |
Country | Germany |
Sector | Private |
PI Contribution | Helping Siemens to analyse data from steam turbines. |
Collaborator Contribution | Providing domain knowledge, data and resources for testing and evaluation. |
Impact | Tools for the development and evolution of conceptual models at Siemens. |
Start Year | 2011 |
Title | COMPLEX QUERY EVALUATION USING SIDEWAYS INFORMATION PASSING |
Description | A program stored on non-transitory computer-readable storage medium executes a method of evaluating a graph over a query. Decomposition instructions decompose the query into a plurality of subqueries. Evaluation instructions evaluate a subquery of the plurality of subqueries and generate a substitution multiset representing a result of the evaluation of the subquery. Filtration instructions or expansion instructions may operate upon the generated substitution set before passing the substitution set to a next subquery to be evaluated. The filtration instructions identify one or more mappings in the substitution multiset that cannot be safely passed to the second subquery and delete the identified one or more mappings from the substitution multiset. The expansion instructions determine, in a case where the subquery is operated upon by a non-distributive query operator, an expansion of the substitution multiset based at least on adding one or more new substitutions to the substitution multiset. |
IP Reference | US2022067042 |
Protection | Patent granted |
Year Protection Granted | 2022 |
Licensed | Yes |
Impact | Founding of Oxford Semantic Technologies |
Company Name | Covatic |
Description | Covatic develops software that analyses a user's online engagement to deliver personalised advertising. |
Year Established | 2016 |
Impact | Although a new startup the company already has contracts with the BBC and with ITN. |
Website | https://covatic.com/ |
Company Name | Oxford Semantic Technologies |
Description | Oxford Semantic Technologies develops software that uses machine learning to analyse semantic data and its ontologies, which can be used when combining or ordering multiple datasets, and in simulating predictive relationships between data. |
Year Established | 2016 |
Impact | The company has only recently been established, but we are already in discussions with several large companies in the financial services sector who are interested in both data integration and compliance verification. |
Website | http://www.oxfordsemantic.tech |
Description | Invited talk at Huawei research centre Edinburgh |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Industry/Business |
Results and Impact | Invited talk at Huawei research centre Edinburgh |
Year(s) Of Engagement Activity | 2021 |
Description | Invited talk at IJCKG |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Postgraduate students |
Results and Impact | Invited talk at IJCKG |
Year(s) Of Engagement Activity | 2021 |
URL | https://language-semantic.org/ijckg2021/ |
Description | Invited talk at K-CAP |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Postgraduate students |
Results and Impact | Invited talk at K-CAP |
Year(s) Of Engagement Activity | 2021 |
URL | https://www.k-cap.org/2021/ |
Description | Invited talk at NeSY |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Postgraduate students |
Results and Impact | Invited talk at NeSY |
Year(s) Of Engagement Activity | 2021 |
URL | https://sites.google.com/view/nesy20/home |
Description | Invited talk at ODSC |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Professional Practitioners |
Results and Impact | Invited talk at ODSC |
Year(s) Of Engagement Activity | 2020 |
URL | https://odsc.com/dublin/schedule-overview/ |
Description | Invited talk at WEBIST |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Professional Practitioners |
Results and Impact | Invited talk at WEBIST |
Year(s) Of Engagement Activity | 2021 |
URL | https://webist.scitevents.org/?y=2021 |
Description | Keynote at conference in Lima, Peru |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Other audiences |
Results and Impact | Invited talk at SimBig18 in Lima, Peru |
Year(s) Of Engagement Activity | 2018 |
Description | Keynote at workshop in Germany |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Other audiences |
Results and Impact | Keynote in workshop on logic |
Year(s) Of Engagement Activity | 2017 |
URL | http://2017.soqe.org/ |
Description | Keynote speech at Database conference |
Form Of Engagement Activity | Participation in an activity, workshop or similar |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Other audiences |
Results and Impact | I was the keynote speaker at one of the main conferences for database researchers, Principles of Database Systems (PODS). I gave an overview of work on reasoning within data management. |
Year(s) Of Engagement Activity | 2018 |
URL | https://sigmod2018.org/ |
Description | Presentation at Bosch research workshop |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Postgraduate students |
Results and Impact | Presentation at Bosch research workshop |
Year(s) Of Engagement Activity | 2022 |
Description | Presentation at Google Research, San Fancisco |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Professional Practitioners |
Results and Impact | Presentation at Google Research, San Fancisco |
Year(s) Of Engagement Activity | 2019 |
Description | Presentation at Samsung Research, California |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Professional Practitioners |
Results and Impact | Presentation at Samsung Research, California |
Year(s) Of Engagement Activity | 2019 |
Description | Presentation at Siemens Research, Munich |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Professional Practitioners |
Results and Impact | Presentation at Siemens Research, Munich |
Year(s) Of Engagement Activity | 2019 |
Description | Presentation at eBay, California |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Professional Practitioners |
Results and Impact | Presentation at eBay, California |
Year(s) Of Engagement Activity | 2022 |
Description | Presentation at eBay, California |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Professional Practitioners |
Results and Impact | Presentation at eBay, California |
Year(s) Of Engagement Activity | 2019 |