On-Demand Data Integration: Dataspaces by Refinement
Lead Research Organisation:
University of Manchester
Department Name: Computer Science
Abstract
Web search engines, such as Google or Yahoo, provide access to large numbers of distributed resources. However, the questions such search engines can support are limited, and do not exploit structure within the accessed resources. For example, it is not possible to ask the question what is the phone number of the department where Suzanne Embury works , even though this information can be obtained by navigating from the result of a search for Suzanne Embury . However, one feature of search engines that has made them successful is that they need minimal configuration; for example, no manual annotation of pages is required before they can be searched. As a result, search engines can be seen as providing low-cost low-quality access to distributed data resources.Data integration infrastructures from the database community, by contrast, provide relatively high-cost, high-quality solutions. Where there are multiple data resources, distributed query processing systems provide the illusion that there is only one data resource, and allow complex questions to be answered that refer to data from multiple resources. For example, they could support the question about phone numbers above, even when the information about who Suzanne works for is stored in a different database from the phone number of her department. However, this precision in question answering is only able to be supported where the relationships between data sources have been manually identified, and inconsistencies resolved as part of a time consuming and largely manual data integration process. This proposal seeks to explore the space between search engines and distributed data management systems by providing various of the benefits of the latter with much reduced configuration costs. The term dataspace has been coined to refer to infrastructures that support precise question answering over resources that have been integrated at minimal cost. At present, dataspaces are more a vision than a reality; many design decisions need to be made that explore cost/quality trade-offs, and new techniques will be required for inter-relating data resources, ranking query answers, and for interacting with users about the likely quality of answers obtained. The proposed research hypothesizes that there is no single best position in the cost/quality tradeoff that exists between fully automated and manually constructed data integration. As a result, we propose to develop a flexible software architecture in which it is possible to experiment with different components for constructing mappings between resources, annotating the mappings with measures of their quality, and ranking results according to user-specified criteria. This architecture, in turn, enables exploration of alternative approaches to the design of the components, in particular with a view to allowing incremental refinement of an initial integration that was constructed automatically.
Organisations
Publications
Belhajjame K
(2014)
Enabling community-driven information integration through clustering
in Distributed and Parallel Databases
Belhajjame K
(2013)
Incrementally improving dataspaces based on user feedback
in Information Systems
Paton N
(2016)
SOFSEM 2016: Theory and Practice of Computer Science
Hedeler C
(2012)
Transactions on Large-Scale Data- and Knowledge-Centered Systems V
Hedeler C
(2011)
Pay-as-you-go mapping selection in dataspaces
Hedeler C
(2009)
Dataspace: The Final Frontier
Alvaro Fernandes
(2010)
User Feedback as a First Class Citizen in Information Integration Systems
Description | The findings were used in a collaboration with Greater Manchester Police, relating to data quality management. |
First Year Of Impact | 2012 |
Sector | Digital/Communication/Information Technologies (including Software) |
Impact Types | Policy & public services |
Description | EPSRC |
Amount | £29,708 (GBP) |
Funding ID | EPSRC Knowledge Transfer Account (Reference - 150) |
Organisation | Engineering and Physical Sciences Research Council (EPSRC) |
Sector | Public |
Country | United Kingdom |
Start |
Description | Programme Grant |
Amount | £1,435,718 (GBP) |
Funding ID | EP/M025268/1 |
Organisation | Engineering and Physical Sciences Research Council (EPSRC) |
Sector | Public |
Country | United Kingdom |
Start | 03/2015 |
End | 03/2020 |