On-Demand Data Integration: Dataspaces by Refinement

Lead Research Organisation: University of Manchester

Department Name: Computer Science

Abstract

Web search engines, such as Google or Yahoo, provide access to large numbers of distributed resources. However, the questions such search engines can support are limited, and do not exploit structure within the accessed resources. For example, it is not possible to ask the question what is the phone number of the department where Suzanne Embury works , even though this information can be obtained by navigating from the result of a search for Suzanne Embury . However, one feature of search engines that has made them successful is that they need minimal configuration; for example, no manual annotation of pages is required before they can be searched. As a result, search engines can be seen as providing low-cost low-quality access to distributed data resources.Data integration infrastructures from the database community, by contrast, provide relatively high-cost, high-quality solutions. Where there are multiple data resources, distributed query processing systems provide the illusion that there is only one data resource, and allow complex questions to be answered that refer to data from multiple resources. For example, they could support the question about phone numbers above, even when the information about who Suzanne works for is stored in a different database from the phone number of her department. However, this precision in question answering is only able to be supported where the relationships between data sources have been manually identified, and inconsistencies resolved as part of a time consuming and largely manual data integration process. This proposal seeks to explore the space between search engines and distributed data management systems by providing various of the benefits of the latter with much reduced configuration costs. The term dataspace has been coined to refer to infrastructures that support precise question answering over resources that have been integrated at minimal cost. At present, dataspaces are more a vision than a reality; many design decisions need to be made that explore cost/quality trade-offs, and new techniques will be required for inter-relating data resources, ranking query answers, and for interacting with users about the likely quality of answers obtained. The proposed research hypothesizes that there is no single best position in the cost/quality tradeoff that exists between fully automated and manually constructed data integration. As a result, we propose to develop a flexible software architecture in which it is possible to experiment with different components for constructing mappings between resources, annotating the mappings with measures of their quality, and ranking results according to user-specified criteria. This architecture, in turn, enables exploration of alternative approaches to the design of the components, in particular with a view to allowing incremental refinement of an initial integration that was constructed automatically.

Funded Value:

£572,897

Funded Period:

Jul 08 - Jun 11

Funder:

EPSRC

Project Status:

Closed

Project Category:

Research Grant

Project Reference:

EP/F031092/1

Principal Investigator:

Norman Paton

Research Subject:

Info. & commun. Technol. (100%)

Research Topic:

Information & Knowledge Mgmt (100%)

Organisations

University of Manchester (Lead Research Organisation)

People	ORCID iD
Norman Paton (Principal Investigator)
Suzanne Embury (Co-Investigator)	http://orcid.org/0000-0002-3711-0778
Alvaro Fernandes (Co-Investigator)	http://orcid.org/0000-0002-6100-7199
Khalid Belhajjame (Researcher Co-Investigator)

Publications

Author Name

Title Publication Date Published

10 25 50

Alvaro Fernandes (2010) User Feedback as a First Class Citizen in Information Integration Systems

Belhajjame K (2014) Enabling community-driven information integration through clustering in Distributed and Parallel Databases

Belhajjame K (2013) Incrementally improving dataspaces based on user feedback in Information Systems

Belhajjame K (2010) Feedback-based annotation, selection and refinement of schema mappings for dataspaces

Hedeler C (2011) Pay-as-you-go mapping selection in dataspaces

Hedeler C (2012) Transactions on Large-Scale Data- and Knowledge-Centered Systems V

Hedeler C (2009) Dataspace: The Final Frontier

Paton N (2016) SOFSEM 2016: Theory and Practice of Computer Science

Impact Summary
Further Funding


Description	The findings were used in a collaboration with Greater Manchester Police, relating to data quality management.
First Year Of Impact	2012
Sector	Digital/Communication/Information Technologies (including Software)
Impact Types	Policy & public services


Description	EPSRC
Amount	£29,708 (GBP)
Funding ID	EPSRC Knowledge Transfer Account (Reference - 150)
Organisation	Engineering and Physical Sciences Research Council (EPSRC)
Sector	Public
Country	United Kingdom
Start


Description	Programme Grant
Amount	£1,435,718 (GBP)
Funding ID	EP/M025268/1
Organisation	Engineering and Physical Sciences Research Council (EPSRC)
Sector	Public
Country	United Kingdom
Start	04/2015
End	03/2020

Abstract

Organisations

People

ORCID iD

Publications