LIQUID: Logic-based Integration and Querying of Unindexed Internet Data

Lead Research Organisation: Birkbeck, University of London
Department Name: Computer Science and Information Systems

Abstract

The Deep Web is constituted by data that are stored outside web pages, normally
in databases or files, and are accessible by querying them through web forms.
Deep Web data sources, apart from having local integrity constraints enforced
on them, are characterised by so-called access limitations, due to the input
requirements of the forms, and such limitations restrict the set of that that
can be ``reached'' by queries. Moreover, in general, recursive query plans are
needed to provide the maximum information as answer to a query. In the project
LIQUID (Logic-based Integration of Unindexed Internet Data), we address the
problem of integrating Deep Web sources and making them queryable through a
global schema, representing all underlying data. The global schema employs
constraints to model properties of the domain of interest, which are not
necessarily reflected by the data stored at the heterogeneous sources. In this
project we plan to study formalisms to represent Deep Web integration systems,
and to devise scalable query processing techniques for such systems. We plan
to study static and dynamic techniques to reduce the query processing cost, by
employing logic-based reasoning techniques on constraints and access
limitations. We plan to build a prototype implementing our findings. LIQUID
will face ambitious challenges, due to the variety of formalisms and problems
into play.

Planned Impact

The long-term beneficiaries of the results of LIQUID are all organisation and
individuals that use Deep Web data. Among these, we mention:

- Private firms, especially aggregators of services and meta-search
engines focused on a vertical business (e.g., search engines for flights).

- Public sector organisations who need integration of web data, e.g. for crime
detection or building price indexes.

- Users in the wider public who use data integration services. Such services
are gaining in importance; we mention, for instance, that companies like
Microsoft and Oracle are now more focused on data interoperability than on
traditional data management.

- The research community engaged in research in areas such as database theory,
web information integration, knowledge representation and reasoning,
intelligent database systems.

The research planned in LIQUID, once properly transferred, in the medium-long
term may serve to the UK IT industry to increase its international
competitiveness.
 
Description The research has produced the findings
summarised below.

1. In order to assess the impact of integrity
constraints on Deep Web information systems,
we analysed the complexity of query answering
and query containment (a fundamental task in
query optimisation) under expressive
ontologies, including fuzzy ontologies, which
are suitable for the context of the Web and
uncertain information. The ontologies here
are intended as constraints, as they impose
properties on the actual data in the form of
rules of various kind.

2. We analysed the computational complexity
of query answering and query containment in
the case of queries over the Deep Web, thus
in the presence of so-called access
limitations. We investigated fundamental
problems of complexity of query answering and
containment in order to give a clear picture
of the problem in its core version. The
query containment problem is important for
query optimisation especially, and we have
studied it in a "core"; context, namely that
of binary relations.

3. To address the case where Web data are
wrapped in RDF (in the form of Linked Data),
we have studied the complexity of flexible
queries (SPARQL that can be relaxed and
approximated) on Web data. To this aim, we
have also built a prototype software system
to run tests with sample SPARQL queries.

4. We have also addressed the problem of
keyword search on the Deep Web; in this
context, we have build the Dataplex software
framework in Lisp. Dataplex offer a flexible
facility to query Deep Web sources, which can
also be wrapped within the framework itself.
The flexibility in managing data in Dataplex
will enable us to integrate sources of
different kind. We have proposed a novel
notion of keyword search on the Deep Web, and
proposed efficient algorithms for query
processing. We experimented our techniques
(and more experiments are being carried out)
with the Dataplex system.

5. We have investigated the problem of
semantic search on Web data, with application
to search in electronic marketplaces. In
this work, we integrate a man-made taxonomy
for sea life (from FAO) with linked data and
Deep Web data, devising novel algorithms for
semantic search of goods through keywords.
We built the RealFoodTrade system for
semantic search in fish markets, which is
actually a marketplace with a match-making
system for making demand and supply meet.

6. We proposed models for integration of
Linked Data and Deep Web data. We proposed
completely novel techniques, based on
distributed information retrieval, to select
relevant Deep Web sources to a certain query,
and to efficiently process conjunctive
queries over such sources.

7. We proposed a framework to seamlessly
integrate the Linked Data cloud with
geographic information in the Open Street Map
framework. We invision an application
scenario in the context, among other, of
disaster management: our techniques would be
able to efficiently and effectively retrieve
information about resources on a certain
geographic area by making use of Linked Data
and the Deep Web.

Part of the research findings are not (yet)
published in peer-reviewed journals or
conferences. We have a few works that are
submitted or awaiting to be submitted to
international conferences/journals.

- "On the complexity of query containment
under access limitations"; (draft) with Igor
Razgon.

- "Answering queries over Distributed Deep
Web Information Resources"; (draft) with
Umberto Straccia.

- "Linking geographical data: the case of
Open Street Map" with Tommaso Di Noia,
Azzurra Ragone, Andrea Maurino, Matteo
Palmonari, Vito Walter Anelli. Submitted
to the SEBD 2016 Conference.

- "Exposing Open Steet Map in the Linked Data
Cloud"; with Tommaso Di Noia, Azzurra
Ragone, Andrea Maurino, Matteo Palmonari,
Vito Walter Anelli. Submitted to the
IEA/AIE 2016 Conference.

- "Keyword Search in the Deep Web" with
Davide Martinenghi and Riccardo Torlone.
Submitted to the ER 2016 Conference.

Finally, we disseminated our research in two
tutorials: SBBD 2013 (Brazilian Database
Conference) and DASFAA 2015 (International
Conference on Databases for Advanced
Applications).
Exploitation Route We believe our findings have a potential
impact in several areas. Our prototypes, if
further developed, will be available in the
scenarios described below.

1. The academic community will take advantage
of our novel scenarios in further
investigations where more expressive
information (e.g. ontological) comes into
play in processing queries.

2. Our techniques and algorithms, rooted in
our theoretical investigation carried our in
the grant, can be of interest for firms and
organisations willing to integrate corporate
information with less structured Web data
such as that of the Linked Data cloud and the
Deep Web -- for instance: governmental
organisations that need data that are tightly
bound to a certain geographic area (e.g. in
the field of disaster recovery); companies
that need to perform data analytics on
certain markets in certain geographic areas;
and so on.

3. Our match-making techniques that make use
of ontologies and Linked Data will be useful
for semantic search in several areas,
including electronic marketplaces.
Sectors Agriculture, Food and Drink,Digital/Communication/Information Technologies (including Software),Other

 
Description The research carried out during the grant (and after its end) has been aiming at having impact on real-world scenarios. While the outcomes of LIQUID are not yet currently used outside the grant, we have at least two lines of development in this respect. 1. We developed a semantic search engines as a back-end recommender system for marketplaces. In particular, we integrated an ontology produced by experts with Linked Data ontologies, suitably extracted, in the are of fisheries. This has provided useful methodologies and results for practitioners wanting to build systems to match demand and supply in markets. Our prototype, which is being re-engineered, is potentially useful for real markets. 2. Together with the Politecnico di Bari, Italy, we worked at the integration of geographic information with the Linked Data cloud. To this aim, we have developed the LOSM (Linked Open Stret Maps) prototype, which allows the Open Street Maps data to be queried in a seamless way together with RDF data sets such as Freebase, DBpedia or Wikidata. While this software is not currently in the hands of end-users in real-world scenarios, the prototype could be deployed in the area of disaster recovery.
Sector Agriculture, Food and Drink,Other
 
Description Birkbeck BEI School Research Grant
Amount £4,300 (GBP)
Organisation Birkbeck, University of London 
Sector Academic/University
Country United Kingdom
Start 09/2015 
End 08/2016
 
Description Concurso Anual de Incentivo a la Investigación
Amount S/ 20,000 (PEN)
Organisation Peruvian University of Applied Sciences 
Sector Academic/University
Country Peru
Start 01/2016 
End 12/2016
 
Description Concurso Anual de Incentivo a la Investigación
Amount S/ 20,000 (PEN)
Organisation Peruvian University of Applied Sciences 
Sector Academic/University
Country Peru
Start 01/2016 
End 12/2016
 
Description COST Action "KeyStone" 
Organisation European Cooperation in Science and Technology (COST)
Country Belgium 
Sector Public 
PI Contribution We have investigated techniques for keyword search on the Deep Web. Andrea Cali is Management Committee Member of the Action; he has also served in the Programme Committee of the Keystone Conference, and collaborated to all activities of the Action, including the Summer School.
Collaborator Contribution Discussion and exchange with other partners contributed to a better understanding of our research. A visit to University Roma Tre was funded by the Action.
Impact We have published (and submitted) papers on keyword search on the Deep Web, in collaboration with the Politecnico di Milano (Italy) and the University Roma Tre (Italy).
Start Year 2013
 
Title LIQUID Dataplex 
Description The framework offers a highly flexible way of processing queries on Deep Web sources. It is developed in Racket LISP and it offers a very easy way of handling data in an semi-structured way, without a fixed schema for each entity. LIQUID Dataplex contains a Web scraper and can process queries on real Web sources. LIQUID Dataplex is not usable by the end-user; instead, it provides a layer for querying the Deep Web. 
Type Of Technology Software 
Year Produced 2015 
Open Source License? Yes  
Impact LIQUID Dataplex is a fundamental tool for the LIQUID project. In particular, we have run experiments (which we will continue to run after the end of the grant duration, as a continuation of the research) on processing keyword searches on the Deep Web, and on the use of integrity constraints for reducing the query processing time. 
URL https://github.com/Antigonus/liquid
 
Title RealFoodTrade (RFT) 
Description RFT is an e-commerce platform for the sales of food. The current version focusses on fish. The idea is to open the market to competition by making producers visible to both wholesalers and end-buyers. The back-end of the system supports semantic search so as to provide the user with products that are semantically related to the entered keywords. To this aim, (1) we merge an ontology designed by experts with the Linked Data cloud; (2) we employ techniques from Information Retrieval to compute similarities between products. 
Type Of Technology Software 
Year Produced 2014 
Impact The adoption of RFT should reduce market inefficiency, lower end-prices and increase producers' income. We are currently setting up a trial study in Talcahuano, Chile, in collaboration with the Universidad del Desarrollo (Santiago and Concepcion, Chile). At the moment the trial study is still being set up, therefore we are currently unable to report on any proper impact.