On-Demand Data Integration: Dataspaces by Refinement

Lead Research Organisation: University of Manchester
Department Name: Computer Science

Abstract

Web search engines, such as Google or Yahoo, provide access to large numbers of distributed resources. However, the questions such search engines can support are limited, and do not exploit structure within the accessed resources. For example, it is not possible to ask the question what is the phone number of the department where Suzanne Embury works , even though this information can be obtained by navigating from the result of a search for Suzanne Embury . However, one feature of search engines that has made them successful is that they need minimal configuration; for example, no manual annotation of pages is required before they can be searched. As a result, search engines can be seen as providing low-cost low-quality access to distributed data resources.Data integration infrastructures from the database community, by contrast, provide relatively high-cost, high-quality solutions. Where there are multiple data resources, distributed query processing systems provide the illusion that there is only one data resource, and allow complex questions to be answered that refer to data from multiple resources. For example, they could support the question about phone numbers above, even when the information about who Suzanne works for is stored in a different database from the phone number of her department. However, this precision in question answering is only able to be supported where the relationships between data sources have been manually identified, and inconsistencies resolved as part of a time consuming and largely manual data integration process. This proposal seeks to explore the space between search engines and distributed data management systems by providing various of the benefits of the latter with much reduced configuration costs. The term dataspace has been coined to refer to infrastructures that support precise question answering over resources that have been integrated at minimal cost. At present, dataspaces are more a vision than a reality; many design decisions need to be made that explore cost/quality trade-offs, and new techniques will be required for inter-relating data resources, ranking query answers, and for interacting with users about the likely quality of answers obtained. The proposed research hypothesizes that there is no single best position in the cost/quality tradeoff that exists between fully automated and manually constructed data integration. As a result, we propose to develop a flexible software architecture in which it is possible to experiment with different components for constructing mappings between resources, annotating the mappings with measures of their quality, and ranking results according to user-specified criteria. This architecture, in turn, enables exploration of alternative approaches to the design of the components, in particular with a view to allowing incremental refinement of an initial integration that was constructed automatically.
 
Description The findings were used in a collaboration with Greater Manchester Police, relating to data quality management.
First Year Of Impact 2012
Sector Digital/Communication/Information Technologies (including Software)
Impact Types Policy & public services

 
Description EPSRC
Amount £29,708 (GBP)
Funding ID EPSRC Knowledge Transfer Account (Reference - 150) 
Organisation Engineering and Physical Sciences Research Council (EPSRC) 
Sector Public
Country United Kingdom
Start  
 
Description Programme Grant
Amount £1,435,718 (GBP)
Funding ID EP/M025268/1 
Organisation Engineering and Physical Sciences Research Council (EPSRC) 
Sector Public
Country United Kingdom
Start 04/2015 
End 03/2020