ESPRESSO: Efficient Search over Personal Repositories - Secure and Sovereign
Lead Research Organisation:
University of Southampton
Department Name: Sch of Electronics and Computer Sci
Abstract
Recent controversies over access and processing of personal data have highlighted the significance of the sovereignty of individuals over their personal data and are leading to new paradigms for application development based on personal online datastores (pods), where individuals have complete control over which applications can gain access to their personal data and for what purpose. Emerging frameworks and ecosystems, such as SOLID and the one by Dataswift, support the development of such decentralised applications which, when granted access by the individuals concerned, can access the data stored in pods to provide services to users in areas such as health and well- being, social networking, and collaborative authoring. However, this decentralisation presents significant performance challenges when searching or querying data stored in pods on a large scale, which is critical to fulfil the potential of such applications.
The current state of the art in searching for data by supplying keywords or phrases would require separate indexes to be created and maintained for each user (or group of users who share identical access rights to all available pods), leading to significant increases in storage, network and computation costs. Similarly, the current state of the art in searching for data by supplying a database query (e.g. using the SPARQL query language) would require separate metadata to be created and maintained for each user, and additional checks to control access to, and caching of, data during the query evaluation process. The current state of the art does not provide techniques for the efficient generation and maintenance of the necessary indexes and meta-information data structures, nor algorithms for evaluating search queries and aggregating query results on a large scale in such decentralised settings.
The ESPRESSO project will research, develop and evaluate appropriate algorithms, indexes and meta-information data structures to enable large-scale data search across distributed pods. Our techniques will handle varying access rights and data caching requirements, as set by each individual pod owner. We will address both keyword-based search where the most important (top-k) or all search results may be required, and distributed querying using SPARQL.
We will evaluate our techniques over pods that are implemented using the SOLID framework and SOLID-compatible data ecosystems such as the one by Dataswift based on HAT Microserver technology. The numbers, distribution and content of these pods will be determined by existing Information Retrieval benchmarks, extended with ESRESSO-specific data about owners' access and caching restrictions; and by the requirements of real-world scenarios elicited from the Health domain, which provides a wealth of settings to investigate and demonstrate the efficiency gains in pod data search achieved by our new techniques. We will not use real personal data, but will employ synthetic data generated using statistical patterns obtained in the aggregate from publicly available anonymised datasets of human subjects.
The project will collaborate with NExT++ centre in Singapore and with the SOLID and HAT project teams. We will actively engage with the academic community and industrial stakeholders through dedicated events. The project's findings will inform current research in distributed systems, databases, the digital economy and cybersecurity, as well as the design of innovative decentralised applications, and will help to address the policy challenges relating to data sovereignty and privacy.
The current state of the art in searching for data by supplying keywords or phrases would require separate indexes to be created and maintained for each user (or group of users who share identical access rights to all available pods), leading to significant increases in storage, network and computation costs. Similarly, the current state of the art in searching for data by supplying a database query (e.g. using the SPARQL query language) would require separate metadata to be created and maintained for each user, and additional checks to control access to, and caching of, data during the query evaluation process. The current state of the art does not provide techniques for the efficient generation and maintenance of the necessary indexes and meta-information data structures, nor algorithms for evaluating search queries and aggregating query results on a large scale in such decentralised settings.
The ESPRESSO project will research, develop and evaluate appropriate algorithms, indexes and meta-information data structures to enable large-scale data search across distributed pods. Our techniques will handle varying access rights and data caching requirements, as set by each individual pod owner. We will address both keyword-based search where the most important (top-k) or all search results may be required, and distributed querying using SPARQL.
We will evaluate our techniques over pods that are implemented using the SOLID framework and SOLID-compatible data ecosystems such as the one by Dataswift based on HAT Microserver technology. The numbers, distribution and content of these pods will be determined by existing Information Retrieval benchmarks, extended with ESRESSO-specific data about owners' access and caching restrictions; and by the requirements of real-world scenarios elicited from the Health domain, which provides a wealth of settings to investigate and demonstrate the efficiency gains in pod data search achieved by our new techniques. We will not use real personal data, but will employ synthetic data generated using statistical patterns obtained in the aggregate from publicly available anonymised datasets of human subjects.
The project will collaborate with NExT++ centre in Singapore and with the SOLID and HAT project teams. We will actively engage with the academic community and industrial stakeholders through dedicated events. The project's findings will inform current research in distributed systems, databases, the digital economy and cybersecurity, as well as the design of innovative decentralised applications, and will help to address the policy challenges relating to data sovereignty and privacy.
Publications
Ragab M
(2024)
The 1st Workshop on Decentralised Search and Recommendation
Ragab M
(2024)
ESPRESSO: A Framework to Empower Search on the Decentralized Web
in Data Science and Engineering
| Description | The project has been exploring algorithms and metadata structures to improve the performance of keyword search and queries across personal online datastores (deploying on architectures such as Solid and Dataswyft). Health and well being scenarios were considered for the experimentation across up to 50 servers with thousands of personal online datastores each. Key findings of experimentation so far include: 1. It is possible for decentralised search to efficiently preserve privacy by encoding data-owner imposed access constraints to different searching parties. 2. Decentralised keyword search can involve long response times when exhaustive search is run across thousands of personal online datastores. However, the use of matadata can significantly improve performance. Further, metadata that strike the right balance between privacy preservation and source selection are crucial for both top-k search and exhaustive search. Specifically, when searching across 475K pods on 50 Solid servers in health and well-being data scenarios, metadata can improve search time by up to 24 times, 13.4 times on average. The performance of decentralised search using rare keywords can benefit the most from metadata. 3. Architectures for decentralised storage such as Solid would benefit from compute components (in addition to storage). The reason is that decentralised search can require significant computational power that may not be readily available to searching parties. Further, the framework of the Community Solid Server could support better performance for decentralise search by making use of multi-threading. 4. The use of Bloom Filters can provide adequate performance and more privacy safeguards than raw metadata, and still outperforms decentralised exhaustive search without metadata. |
| Exploitation Route | There are large communities of developers of applications over decentralised datastores especially around the Solid and the Dataswyft ecosystems who would benefit from the algorithms and matadata structures to support search and queries within applications or as independent services in those ecosystems. The research community will benefit from approaches to address the problem of information retrieval across datastores where different search parties may have different visibility to resources. This problem has not been sufficiently explored before because the scenarios requiring such algorithms were scarce. The community on health and well-being data collection and processing will also benefit from these approaches as they enable for privacy-aware information discovery across datastores. |
| Sectors | Communities and Social Services/Policy Creative Economy Digital/Communication/Information Technologies (including Software) Healthcare |
| URL | https://espressoproject.org/ |
| Title | ESPRESSO Search System |
| Description | The open source license of the software is: AGPL-3.0 The ESPRESSO project (espressoproject.org) researches, develops, and evaluates decentralised algorithms, meta-information data structures, and indexing techniques to enable large-scale data search across personal online datastores, taking into account varying access rights and caching requirements. This involves a number of Solid servers (see solidproject.org) that are inter-connected via an overlay GaianDB network (https://github.com/gaiandb/). The ESPRESSO system contains the following components that are installed alongside each Solid server in the network: - An indexing app (Brewmaster), that indexes the pods, and creates and maintains indexes inside each pod, along with a meta-index for the Solid server as a whole. - A search app (CoffeeFilter), that performs the local search on the pods of a Solid server. - An overlay network (the prototype system uses a custom build of GaianDB) that connects the servers, and routes and propagates the queries. - A user interface app (Barista) that receives search queries from the user and presents the search results. |
| Type Of Technology | Webtool/Application |
| Year Produced | 2024 |
| Open Source License? | Yes |
| Impact | The software release is only a few weeks old. It has enabled the project to run experiments that will benefit the research community with initial insights on performance and trade-offs of decentralised search. The results of that experimentation have informed the project work that has been published so far and two additional publications that are to be presented at the Web Conference 2024 conference and published in its proceedings. The open source publication of the software is expected to enable members of the research community beyond the project to engage with this research. |
| URL | https://github.com/espressogroup/ESPRESSO |
| Description | ESPRESSO Workshop on Decentralised Search |
| Form Of Engagement Activity | Participation in an activity, workshop or similar |
| Part Of Official Scheme? | No |
| Geographic Reach | International |
| Primary Audience | Industry/Business |
| Results and Impact | In a world where health data is at the forefront of our digital lives, the need for data privacy and control has become paramount. Today's Web landscape confines user-generated health data within centralized data silos, limiting individuals' autonomy and insight into the management and utilization of their own health information. These developments have highlighted the crucial importance of individual sovereignty when it comes to personal health data. This paradigm shift has given rise to new approaches in application development, focusing on personal online data stores known as "pods." In these systems, individuals have full authority over which applications can access their personal health data and specify the purposes for which such access is granted. However, the journey toward decentralization presents its own set of challenges, particularly when it comes to enabling secure and efficient search and distributed queries over such decentralized platforms. To this end, our workshop sought to blend research and industry engagement, encouraging active participation from the public. Our primary goal was to collectively explore viable solutions and engage in fruitful discussions that address the challenges of decentralized web search and privacy-preserving information retrieval, especially in the context of health data. This workshop offered a structured and interactive platform, fostering knowledge exchange and collaboration within the specified timeframe and objectives. We firmly believe that workshops that bridge industry and academia will play a pivotal role in advancing our understanding and development of decentralized online services. Through these collaborative efforts, we aim to drive the creation of new techniques and technologies that can revolutionize the way we store and process our personal (health) data. Ultimately, this research trajectory has the potential to pave the way for innovative approaches that could influence the global adoption of transformative systems, revolutionizing how personal health data is managed and enhancing data sovereignty for individuals. This two-day event included participants from industry and academia, with participants from the UK, Europe, the US and Asia. The workshop took place on the first day, and a hackathon event took place on the second day. |
| Year(s) Of Engagement Activity | 2024 |
| URL | https://espressoproject.org/london-june-24/ |
| Description | Research Visit and Workshop at the NExT Research Centre, National University of Singapore |
| Form Of Engagement Activity | Participation in an activity, workshop or similar |
| Part Of Official Scheme? | No |
| Geographic Reach | International |
| Primary Audience | Postgraduate students |
| Results and Impact | The workshop explored the research challenges of decentralised information systems from the perspective of search and information retrieval. The challenges of decentralised recommendation and AI decentralised information retrieval were also discussed. Researchers from the project partners and other visiting researchers were involved. A proposal for a workshop at the Web Conference 2024 was submitted to enable wider discussion on the topics. The workshop proposal was accepted for inclusion in the Web Conference programme and is to take place on 13 May 2024. |
| Year(s) Of Engagement Activity | 2023 |
| Description | The 1st Workshop on Decentralised Search and Recommendation (DeSeRe'24) at the Web Conference 2024, Singapore, Singapore |
| Form Of Engagement Activity | Participation in an activity, workshop or similar |
| Part Of Official Scheme? | No |
| Geographic Reach | International |
| Primary Audience | Professional Practitioners |
| Results and Impact | Workshop on decentralised search and recommendation that was accepted and organised in collaboration with the National University of Singapore at the Web Conference 2024 in Singapore. It became the most attended among the workshops running in parallel. |
| Year(s) Of Engagement Activity | 2024 |
| URL | https://desere.org/ |
