ESPRESSO: Efficient Search over Personal Repositories - Secure and Sovereign

Lead Research Organisation: University of Southampton


Recent controversies over access and processing of personal data have highlighted the significance of the sovereignty of individuals over their personal data and are leading to new paradigms for application development based on personal online datastores (pods), where individuals have complete control over which applications can gain access to their personal data and for what purpose. Emerging frameworks and ecosystems, such as SOLID and the one by Dataswift, support the development of such decentralised applications which, when granted access by the individuals concerned, can access the data stored in pods to provide services to users in areas such as health and well- being, social networking, and collaborative authoring. However, this decentralisation presents significant performance challenges when searching or querying data stored in pods on a large scale, which is critical to fulfil the potential of such applications.

The current state of the art in searching for data by supplying keywords or phrases would require separate indexes to be created and maintained for each user (or group of users who share identical access rights to all available pods), leading to significant increases in storage, network and computation costs. Similarly, the current state of the art in searching for data by supplying a database query (e.g. using the SPARQL query language) would require separate metadata to be created and maintained for each user, and additional checks to control access to, and caching of, data during the query evaluation process. The current state of the art does not provide techniques for the efficient generation and maintenance of the necessary indexes and meta-information data structures, nor algorithms for evaluating search queries and aggregating query results on a large scale in such decentralised settings.

The ESPRESSO project will research, develop and evaluate appropriate algorithms, indexes and meta-information data structures to enable large-scale data search across distributed pods. Our techniques will handle varying access rights and data caching requirements, as set by each individual pod owner. We will address both keyword-based search where the most important (top-k) or all search results may be required, and distributed querying using SPARQL.

We will evaluate our techniques over pods that are implemented using the SOLID framework and SOLID-compatible data ecosystems such as the one by Dataswift based on HAT Microserver technology. The numbers, distribution and content of these pods will be determined by existing Information Retrieval benchmarks, extended with ESRESSO-specific data about owners' access and caching restrictions; and by the requirements of real-world scenarios elicited from the Health domain, which provides a wealth of settings to investigate and demonstrate the efficiency gains in pod data search achieved by our new techniques. We will not use real personal data, but will employ synthetic data generated using statistical patterns obtained in the aggregate from publicly available anonymised datasets of human subjects.

The project will collaborate with NExT++ centre in Singapore and with the SOLID and HAT project teams. We will actively engage with the academic community and industrial stakeholders through dedicated events. The project's findings will inform current research in distributed systems, databases, the digital economy and cybersecurity, as well as the design of innovative decentralised applications, and will help to address the policy challenges relating to data sovereignty and privacy.


10 25 50