LIQUID: Logic-based Integration and Querying of Unindexed Internet Data
Lead Research Organisation:
Birkbeck, University of London
Department Name: Computer Science and Information Systems
Abstract
The Deep Web is constituted by data that are stored outside web pages, normally
in databases or files, and are accessible by querying them through web forms.
Deep Web data sources, apart from having local integrity constraints enforced
on them, are characterised by so-called access limitations, due to the input
requirements of the forms, and such limitations restrict the set of that that
can be ``reached'' by queries. Moreover, in general, recursive query plans are
needed to provide the maximum information as answer to a query. In the project
LIQUID (Logic-based Integration of Unindexed Internet Data), we address the
problem of integrating Deep Web sources and making them queryable through a
global schema, representing all underlying data. The global schema employs
constraints to model properties of the domain of interest, which are not
necessarily reflected by the data stored at the heterogeneous sources. In this
project we plan to study formalisms to represent Deep Web integration systems,
and to devise scalable query processing techniques for such systems. We plan
to study static and dynamic techniques to reduce the query processing cost, by
employing logic-based reasoning techniques on constraints and access
limitations. We plan to build a prototype implementing our findings. LIQUID
will face ambitious challenges, due to the variety of formalisms and problems
into play.
in databases or files, and are accessible by querying them through web forms.
Deep Web data sources, apart from having local integrity constraints enforced
on them, are characterised by so-called access limitations, due to the input
requirements of the forms, and such limitations restrict the set of that that
can be ``reached'' by queries. Moreover, in general, recursive query plans are
needed to provide the maximum information as answer to a query. In the project
LIQUID (Logic-based Integration of Unindexed Internet Data), we address the
problem of integrating Deep Web sources and making them queryable through a
global schema, representing all underlying data. The global schema employs
constraints to model properties of the domain of interest, which are not
necessarily reflected by the data stored at the heterogeneous sources. In this
project we plan to study formalisms to represent Deep Web integration systems,
and to devise scalable query processing techniques for such systems. We plan
to study static and dynamic techniques to reduce the query processing cost, by
employing logic-based reasoning techniques on constraints and access
limitations. We plan to build a prototype implementing our findings. LIQUID
will face ambitious challenges, due to the variety of formalisms and problems
into play.
Planned Impact
The long-term beneficiaries of the results of LIQUID are all organisation and
individuals that use Deep Web data. Among these, we mention:
- Private firms, especially aggregators of services and meta-search
engines focused on a vertical business (e.g., search engines for flights).
- Public sector organisations who need integration of web data, e.g. for crime
detection or building price indexes.
- Users in the wider public who use data integration services. Such services
are gaining in importance; we mention, for instance, that companies like
Microsoft and Oracle are now more focused on data interoperability than on
traditional data management.
- The research community engaged in research in areas such as database theory,
web information integration, knowledge representation and reasoning,
intelligent database systems.
The research planned in LIQUID, once properly transferred, in the medium-long
term may serve to the UK IT industry to increase its international
competitiveness.
individuals that use Deep Web data. Among these, we mention:
- Private firms, especially aggregators of services and meta-search
engines focused on a vertical business (e.g., search engines for flights).
- Public sector organisations who need integration of web data, e.g. for crime
detection or building price indexes.
- Users in the wider public who use data integration services. Such services
are gaining in importance; we mention, for instance, that companies like
Microsoft and Oracle are now more focused on data interoperability than on
traditional data management.
- The research community engaged in research in areas such as database theory,
web information integration, knowledge representation and reasoning,
intelligent database systems.
The research planned in LIQUID, once properly transferred, in the medium-long
term may serve to the UK IT industry to increase its international
competitiveness.
People |
ORCID iD |
Andrea Cali (Principal Investigator) |
Publications
Andrea Cali
(2013)
Querying Data Through Ontologies
Andrea Cali
(2014)
Semantic Search in RealFoodTrade
Andrea Cali
(2015)
Keyword Search in the Deep Web.
Anelli V
(2016)
Exposing Open Street Map in the Linked Data cloud
Anelli V
(2016)
Trends in Applied Knowledge-Based Systems and Data Science
Anelli V
(2017)
Querying deep web data sources as linked data
Bertossi L
(2016)
Query Answering on Expressive Datalog+/- Ontologies
Cal
(2017)
Non-FPT lower bounds for structural restrictions of decision DNNF
in arXiv e-prints
Description | The research has produced the findings summarised below. 1. In order to assess the impact of integrity constraints on Deep Web information systems, we analysed the complexity of query answering and query containment (a fundamental task in query optimisation) under expressive ontologies, including fuzzy ontologies, which are suitable for the context of the Web and uncertain information. The ontologies here are intended as constraints, as they impose properties on the actual data in the form of rules of various kind. 2. We analysed the computational complexity of query answering and query containment in the case of queries over the Deep Web, thus in the presence of so-called access limitations. We investigated fundamental problems of complexity of query answering and containment in order to give a clear picture of the problem in its core version. The query containment problem is important for query optimisation especially, and we have studied it in a "core"; context, namely that of binary relations. 3. To address the case where Web data are wrapped in RDF (in the form of Linked Data), we have studied the complexity of flexible queries (SPARQL that can be relaxed and approximated) on Web data. To this aim, we have also built a prototype software system to run tests with sample SPARQL queries. 4. We have also addressed the problem of keyword search on the Deep Web; in this context, we have build the Dataplex software framework in Lisp. Dataplex offer a flexible facility to query Deep Web sources, which can also be wrapped within the framework itself. The flexibility in managing data in Dataplex will enable us to integrate sources of different kind. We have proposed a novel notion of keyword search on the Deep Web, and proposed efficient algorithms for query processing. We experimented our techniques (and more experiments are being carried out) with the Dataplex system. 5. We have investigated the problem of semantic search on Web data, with application to search in electronic marketplaces. In this work, we integrate a man-made taxonomy for sea life (from FAO) with linked data and Deep Web data, devising novel algorithms for semantic search of goods through keywords. We built the RealFoodTrade system for semantic search in fish markets, which is actually a marketplace with a match-making system for making demand and supply meet. 6. We proposed models for integration of Linked Data and Deep Web data. We proposed completely novel techniques, based on distributed information retrieval, to select relevant Deep Web sources to a certain query, and to efficiently process conjunctive queries over such sources. 7. We proposed a framework to seamlessly integrate the Linked Data cloud with geographic information in the Open Street Map framework. We invision an application scenario in the context, among other, of disaster management: our techniques would be able to efficiently and effectively retrieve information about resources on a certain geographic area by making use of Linked Data and the Deep Web. Part of the research findings are not (yet) published in peer-reviewed journals or conferences. We have a few works that are submitted or awaiting to be submitted to international conferences/journals. - "On the complexity of query containment under access limitations"; (draft) with Igor Razgon. - "Answering queries over Distributed Deep Web Information Resources"; (draft) with Umberto Straccia. - "Linking geographical data: the case of Open Street Map" with Tommaso Di Noia, Azzurra Ragone, Andrea Maurino, Matteo Palmonari, Vito Walter Anelli. Submitted to the SEBD 2016 Conference. - "Exposing Open Steet Map in the Linked Data Cloud"; with Tommaso Di Noia, Azzurra Ragone, Andrea Maurino, Matteo Palmonari, Vito Walter Anelli. Submitted to the IEA/AIE 2016 Conference. - "Keyword Search in the Deep Web" with Davide Martinenghi and Riccardo Torlone. Submitted to the ER 2016 Conference. Finally, we disseminated our research in two tutorials: SBBD 2013 (Brazilian Database Conference) and DASFAA 2015 (International Conference on Databases for Advanced Applications). |
Exploitation Route | We believe our findings have a potential impact in several areas. Our prototypes, if further developed, will be available in the scenarios described below. 1. The academic community will take advantage of our novel scenarios in further investigations where more expressive information (e.g. ontological) comes into play in processing queries. 2. Our techniques and algorithms, rooted in our theoretical investigation carried our in the grant, can be of interest for firms and organisations willing to integrate corporate information with less structured Web data such as that of the Linked Data cloud and the Deep Web -- for instance: governmental organisations that need data that are tightly bound to a certain geographic area (e.g. in the field of disaster recovery); companies that need to perform data analytics on certain markets in certain geographic areas; and so on. 3. Our match-making techniques that make use of ontologies and Linked Data will be useful for semantic search in several areas, including electronic marketplaces. |
Sectors | Agriculture, Food and Drink,Digital/Communication/Information Technologies (including Software),Other |
Description | The research carried out during the grant (and after its end) has been aiming at having impact on real-world scenarios. While the outcomes of LIQUID are not yet currently used outside the grant, we have at least two lines of development in this respect. 1. We developed a semantic search engines as a back-end recommender system for marketplaces. In particular, we integrated an ontology produced by experts with Linked Data ontologies, suitably extracted, in the are of fisheries. This has provided useful methodologies and results for practitioners wanting to build systems to match demand and supply in markets. Our prototype, which is being re-engineered, is potentially useful for real markets. 2. Together with the Politecnico di Bari, Italy, we worked at the integration of geographic information with the Linked Data cloud. To this aim, we have developed the LOSM (Linked Open Stret Maps) prototype, which allows the Open Street Maps data to be queried in a seamless way together with RDF data sets such as Freebase, DBpedia or Wikidata. While this software is not currently in the hands of end-users in real-world scenarios, the prototype could be deployed in the area of disaster recovery. |
Sector | Agriculture, Food and Drink,Other |
Description | Birkbeck BEI School Research Grant |
Amount | £4,300 (GBP) |
Organisation | Birkbeck, University of London |
Sector | Academic/University |
Country | United Kingdom |
Start | 09/2015 |
End | 08/2016 |
Description | Concurso Anual de Incentivo a la Investigación |
Amount | S/ 20,000 (PEN) |
Organisation | Peruvian University of Applied Sciences |
Sector | Academic/University |
Country | Peru |
Start | 01/2016 |
End | 12/2016 |
Description | Concurso Anual de Incentivo a la Investigación |
Amount | S/ 20,000 (PEN) |
Organisation | Peruvian University of Applied Sciences |
Sector | Academic/University |
Country | Peru |
Start | 01/2016 |
End | 12/2016 |
Description | COST Action "KeyStone" |
Organisation | European Cooperation in Science and Technology (COST) |
Country | Belgium |
Sector | Public |
PI Contribution | We have investigated techniques for keyword search on the Deep Web. Andrea Cali is Management Committee Member of the Action; he has also served in the Programme Committee of the Keystone Conference, and collaborated to all activities of the Action, including the Summer School. |
Collaborator Contribution | Discussion and exchange with other partners contributed to a better understanding of our research. A visit to University Roma Tre was funded by the Action. |
Impact | We have published (and submitted) papers on keyword search on the Deep Web, in collaboration with the Politecnico di Milano (Italy) and the University Roma Tre (Italy). |
Start Year | 2013 |
Title | LIQUID Dataplex |
Description | The framework offers a highly flexible way of processing queries on Deep Web sources. It is developed in Racket LISP and it offers a very easy way of handling data in an semi-structured way, without a fixed schema for each entity. LIQUID Dataplex contains a Web scraper and can process queries on real Web sources. LIQUID Dataplex is not usable by the end-user; instead, it provides a layer for querying the Deep Web. |
Type Of Technology | Software |
Year Produced | 2015 |
Open Source License? | Yes |
Impact | LIQUID Dataplex is a fundamental tool for the LIQUID project. In particular, we have run experiments (which we will continue to run after the end of the grant duration, as a continuation of the research) on processing keyword searches on the Deep Web, and on the use of integrity constraints for reducing the query processing time. |
URL | https://github.com/Antigonus/liquid |
Title | RealFoodTrade (RFT) |
Description | RFT is an e-commerce platform for the sales of food. The current version focusses on fish. The idea is to open the market to competition by making producers visible to both wholesalers and end-buyers. The back-end of the system supports semantic search so as to provide the user with products that are semantically related to the entered keywords. To this aim, (1) we merge an ontology designed by experts with the Linked Data cloud; (2) we employ techniques from Information Retrieval to compute similarities between products. |
Type Of Technology | Software |
Year Produced | 2014 |
Impact | The adoption of RFT should reduce market inefficiency, lower end-prices and increase producers' income. We are currently setting up a trial study in Talcahuano, Chile, in collaboration with the Universidad del Desarrollo (Santiago and Concepcion, Chile). At the moment the trial study is still being set up, therefore we are currently unable to report on any proper impact. |