Physical Sciences Data Infrastructure (PSDI) Phase 1 Pilot

Lead Research Organisation: Science and Technology Facilities Council
Department Name: Scientific Computing Department

Abstract

Today, each physical science research infrastructure, from individual laboratories to large facilities, has essentially its own isolated data infrastructure. In contrast, many other domains have data-centric infrastructures for collecting and reusing data which act as community hubs and drivers of new methods and discoveries. There is a clear need within physical sciences for an additional infrastructure layer to enable researchers to share and use existing resources whilst ensuring that each resource can remain dedicated to its specific application.

There is a need to preserve and exploit outputs from past research while keeping pace with the increasing rate of data generation, the latter posing the greatest challenge and potential for innovation. New chemicals, materials and devices are key to a sustainable future, both environmentally and financially. The UK needs to invent its way out of seemingly conflicting targets of maintaining economic growth whilst making unprecedented strides towards an imminent net zero carbon output.

This pilot develops on a proposed physical science data infrastructure (PSDI) outlined in a Community Statement of Need response to the EPSRC 'Large Infrastructures' call, which will enable the wider community to do more with existing resources and build an ecosystem for data discovery. This short scoping exercise (November 2021 - March 2022) is intended to inform larger subsequent construction and scale up phases. Accordingly, this pilot study will gather recommendations to feed into specifications addressing operating models, governance, systems design and architecture, capabilities and remit. To do this the pilot will engage with physical science research communities and facilities to shape the key areas to address and their requirements.

The pilot activities therefore focus on four key areas: strategy, stakeholder engagement, technical architecture, and case studies. Strategy concentrates on defining the operational model - that is how the whole infrastructure is delivered, i.e. the way technical infrastructure, expert service providers and new research communities effectively function together, and proposing a governance structure to ensure efficient and sustainable delivery. Stakeholder engagement is a key aspect of the pilot that will inform multiple research communities at the same time as capturing the bulk of the requirements. The work on technical architecture will test a number of key platform components in order to propose a system design for future construction. Eight case studies are designed to probe specific key areas of the infrastructure through a combination of focussed desk-based research and test implementations, and are split into areas: disciplinary science-based and underpinning techniques/methods.

Each of the tasks in the work packages will produce recommendations, which will be synthesised into a single PDSI design recommendations and specifications report.
 
Description The PSDI-Pilot grant supported the first, five-month phase of the development of a data infrastructure for the physical sciences (PSDI). The pilot undertook a number of engagement and requirements capture activities, as well as some case studies that demonstrated scenarios of potential use, to shape the design of the PSDI .

The main findings are a set of top-level recommendations for future development of data infrastructures in the Physical Sciences. (https://www.psdi.ac.uk/the-pilot/recommendations/ ) These are supported by the results of the eight case studies which demonstrated some potential benefits of PSDI and indicated the work to be continued in later phases (https://www.psdi.ac.uk/the-pilot/wp4-case-studies/ ).

The recommendations are grouped into 4 areas.

The first area concerned the need to connect existing technologies. The engagement activities confirmed the view that the current data landscape is fragmented, making data analysis and reuse unduly complex, and that physical sciences research would be greatly accelerated by more integration of the systems that handle data. The PSDI would enable researchers not only to undertake their own analyses more effectively, but also to make their data products available as inputs for further research. This data infrastructure should connect existing systems, widening their applicability and adding value through aggregation. Such an integration would support data workflows, enabling researchers to concentrate on their science rather than spend time on data management activities. It should also be trustworthy and enduring, as without assurance of its longevity, researchers would be reticent to invest the time required to engage with a system which may be temporary or fail to gain traction as key infrastructure.

The second set of recommendations concerned the best use of data. There is a need to open up data for reuse and aggregation into collections that add value, and to link up with data sources from other domains for cross-disciplinary, multiscale modelling and multimodal research. There is also a particular need to bridge between experimental and computational activities. It should be possible to readily access provenanced data, including reference quality data, and secondary data underpinning publications. Availability of data should support reproducibility and validation of research, in addition to application in further research including machine learning and AI. The new infrastructure should support the overarching principles of data being as open and FAIR as possible, and drive international collaborations and interdisciplinary research through the use of open standards.

The third area of recommendations concerns the best use of people. It was clearly recognised that an effective research ecosystem requires not only investment in technology, but also needs support professionals to make it usable, and appropriately trained people to fully exploit it. We observed a wide variation in levels of data skills in different groups. This highlighted an opportunity for sharing knowledge and best practice between projects, disciplines and research domains. Much of a physical sciences researcher time is spent finding, cleaning, transforming and moving data. There is a need for dedicated professionals who can either fully support researchers' data workflows, enabling them to concentrate on research without being impeded by cumbersome data management, or provide streamlined tools supporting data intensive research in the physical sciences, enabling researchers to more easily support themselves. The role of these professionals must be fully established, recognised and sustained.

The fourth area of recommendations concerns the best use of technology. Physical science researchers currently have to navigate a wide diversity of IT in a highly heterogeneous technological environment. Tools for data management and data analysis are changing rapidly and often diverging, with important new functionality emerging continuously. For physical science research workflows to "Ride the wave" of this technological evolution and make use of the latest technological developments, requires an integrated infrastructure where researchers can easily adopt diverse tools which readily work together. Delivering this infrastructure will require agreement on and maintenance of the vocabularies, interfaces and tools that enable interoperability. An essential feature of a data infrastructure for physical sciences should thus be to develop and maintain interoperability standards and the associated supporting tools that enable sharing and discovery of metadata and data.
Exploitation Route The PSDI Phase 1 Pilot supported the first 5 month phase of a longer term ambition to develop a data infrastructure for the physical sciences. Later phases will involve a broader range of partners, contributing to a collaborative development and operation of a federated data infrastructure complementing and connecting existing provision.
Sectors Chemicals

Digital/Communication/Information Technologies (including Software)

Education

Manufacturing

including Industrial Biotechology

URL https://doi.org/10.5281/zenodo.7684860
 
Description This project produced a number of recommendations (https://www.psdi.ac.uk/the-pilot/recommendations/) which then led to continuation of the work through the PSDI Phase 1 (https://www.psdi.ac.uk/psdi-phase-1/) and Phase 2 (https://www.psdi.ac.uk/psdi-phase-2/) projects.
First Year Of Impact 2022
Sector Chemicals,Digital/Communication/Information Technologies (including Software),Pharmaceuticals and Medical Biotechnology
Impact Types Policy & public services

 
Description Presentations to EPSRC Strategic Advisory Teams for Physical sciences, Research infrastructure e-infrastructure, and Research infrastructure capital equipment.
Geographic Reach National 
Policy Influence Type Participation in a guidance/advisory committee
 
Description PSDI Phase 1b
Amount £755,911 (GBP)
Funding ID EP/X032663/1 
Organisation Engineering and Physical Sciences Research Council (EPSRC) 
Sector Public
Country United Kingdom
Start 09/2022 
End 09/2023
 
Description PSDI Phase 1b - SCD Extension
Amount £2,716,308 (GBP)
Funding ID EP/X032663/1 
Organisation Engineering and Physical Sciences Research Council (EPSRC) 
Sector Public
Country United Kingdom
Start 09/2023 
End 12/2024
 
Description Physical Sciences Data Infrastructure Phase 1b
Amount £752,936 (GBP)
Funding ID EP/X032701/1 
Organisation Engineering and Physical Sciences Research Council (EPSRC) 
Sector Public
Country United Kingdom
Start 09/2022 
End 09/2023
 
Description Physical Sciences Data Infrastructure Phase 1b - Southampton Extension
Amount £2,146,968 (GBP)
Funding ID EP/X032701/1 
Organisation Engineering and Physical Sciences Research Council (EPSRC) 
Sector Public
Country United Kingdom
Start 09/2023 
End 12/2024
 
Title PSDI Case Study 3: Examples of PSDI-LD proof-of-concept 
Description Dataset accompanying work discussed in PSDI Pilot Case Study 3 - Combining data sources in Materials Physics. Proof-of-concept examples according to PSDI-LD draft with CS3 additions to illustrate problem-solving approaches. Used a data bundle supplied from University of Sheffield. This dataset contains a readme.md file outlining the examples. There are 3 JSONLD files containing the whole dataset, the subparts of the dataset and a flattened representation. 
Type Of Material Database/Collection of data 
Year Produced 2022 
Provided To Others? Yes  
URL https://zenodo.org/record/7704823
 
Description AI4SD, PSDS & PSDI Skills4Scientists Series 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Undergraduate students
Results and Impact This series was organised as a joint venture between the Artificial Intelligence for Scientific Discovery Network+ (AI4SD), the Physical Sciences Data-Science Service (PSDS), and the Physical Sciences Data Infrastructure (PSDI). This series was initially run over summer 2021 and aimed to educate and improve scientists skills in a range of areas including research data management, python, version control, ethics, and career development. The first iteration of this series was primarily aimed at final year undergraduates / early stage PhD students.

This series has now been run again in 2022 and 2023 and is in further development for 2024 to create a flipped/blended learning course, and to make a wide range of materials available online alongside the initial video content.
Year(s) Of Engagement Activity 2021,2022,2023
URL https://eprints.soton.ac.uk/453198/
 
Description Data Management Strategies of Tomorrow: Bridging the Gap Between Retired Data Systems and Digital Innovation 
Form Of Engagement Activity A press release, press conference or response to a media enquiry/interview
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Media (as a channel to the public)
Results and Impact This was a writeup of a panel I chaired at the Pharma Data & Smart Labs Conference.
Year(s) Of Engagement Activity 2022,2023
URL https://www.oxfordglobal.co.uk/resources/data-management-strategies-of-tomorrow-bridging-the-gap-bet...
 
Description Digitisation and Beyond: Understanding the Lab of the Future 
Form Of Engagement Activity A press release, press conference or response to a media enquiry/interview
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Media (as a channel to the public)
Results and Impact This was a writeup of mine and some others talks as a key opinion leader.
Year(s) Of Engagement Activity 2022,2023
URL https://www.oxfordglobal.co.uk/resources/digitisation-and-beyond-understanding-the-lab-of-the-future...
 
Description Digitization: The Lab of the Future 
Form Of Engagement Activity A press release, press conference or response to a media enquiry/interview
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Other audiences
Results and Impact I was invited for an interview based on the research conducted for PSDI after presenting on our work at Future Labs Live.
Year(s) Of Engagement Activity 2022
URL https://www.technologynetworks.com/informatics/articles/digitization-the-lab-of-the-future-363356
 
Description Lab Insider Interview 
Form Of Engagement Activity A press release, press conference or response to a media enquiry/interview
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Industry/Business
Results and Impact I was invited to speak on the YouTube show, LabInsider about my research including my work on PSDI.
Year(s) Of Engagement Activity 2022
URL https://www.labinsider.com/lab-transformation/let-your-voice-be-heard-soon-we-ll-be-speaking-to-our-...
 
Description Panel Session at RSECon2022 
Form Of Engagement Activity A formal working group, expert panel or dialogue
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Professional Practitioners
Results and Impact Hosted an expert panel and discussion session at the national research software engineering conference RSECon. Discussion on the data needs for the long tail conmmunity.
Year(s) Of Engagement Activity 2022
 
Description Presentation at AI4SD Annual Conference "PSDI - Shaping the Physical Sciences Roadmap" 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Other audiences
Results and Impact Invited to present on the PSDI project and the landscape for physical sciences data. Interactive engagement with the hybrid audience of over 100 to discuss their data needs and requirements. This was also published as a video on the Organisers YouTube channel and has over 300 views.
Year(s) Of Engagement Activity 2022
URL https://www.youtube.com/watch?v=4Ukn7TawAhs&list=PLyeHH3bEQqIYYcv2ZmgJ50wCaOreX8Dvn&index=23
 
Description Presentation at UK Catalysis Hub Core Science Meeting 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Professional Practitioners
Results and Impact Presentation at UK Catalysis Hub Core Science Meeting, raising awareness of work carried out in PSDI pilot.
Year(s) Of Engagement Activity 2022
 
Description Presented PSDI Poster at Drug Discovery World 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Industry/Business
Results and Impact Presented our PSDI poster at Drug Discovery World.
Year(s) Of Engagement Activity 2022
URL https://www.elrig.org/portfolio/drug-discovery-2022/
 
Description Presented PSDI Poster at Royal Society of Chemistry Meeting 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Other audiences
Results and Impact Presented our PSDI poster at the RSC Ultra Large Chemical Libraries Meeting.
Year(s) Of Engagement Activity 2022
URL https://www.rsc.org/events/detail/73675/ultra-large-chemical-libraries
 
Description Requirements analysis with National Research Facilities 
Form Of Engagement Activity A formal working group, expert panel or dialogue
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Professional Practitioners
Results and Impact Representatives from National Research Facilities in the UK joined a workshop with PSDI to discuss their data needs and requirements. Several sessions were run with active discussion among participants. Follow up discussion has been had about further activities to be explored with PSDI.
Year(s) Of Engagement Activity 2022