CoED: Deep reinforcement learning for improving research productivity in the life science sector.

Lead Research Organisation: University of York
Department Name: Biology

Abstract

There have recently been significant leaps in deep reinforcement learning algorithms, with notable successes in games such as Atari arcade games and Go; however, there is still a need to adapt these techniques to be more widely applicable in other domains, such as the life science sector. Identifying regulatory relationships between genes is one of the primary research activities carried out by molecular biologists and geneticists, since learning the structure of gene regulatory networks is critical for many applications, for example understanding the origins of many diseases and how crops respond to their environments. Biologists sequentially conduct experiments that provide information about the gene network structure, but they must operate under strict cost and time limits. This project aims to formulate this experiment design procedure in a reinforcement-learning framework, to ascertain how biologists should prioritise experiments to maximise information about the gene networks, under constraints. The primary deliverable will be a Computer-aided Experimental Design (CoED) software tool to aid researchers in utilising their resources most effectively. This reinforcement-learning framework could also be used to identify the bottlenecks for biomedical research, such as the pricing model or the time-intensity of certain experiments, thereby identifying the most impactful areas for further development in experimental methodology. We will deliver impact by providing consultation services to laboratory supply and service providers, and through our collaboration with our industrial partner Google Brain Genomics. This project primarily aligns with the new approaches to data science and high productivity services through specialised artificial intelligence priority areas of this call.

Planned Impact

There are two main target beneficiaries of this research proposal: (i) biology researchers in academia and industry and (ii) laboratory supply and service providers.

In academia and industry, there are a great many biologists who are inferring the structures of gene regulatory networks, and the CoED software will help them design these types of experiments more effectively. Current approaches to network inference assume that biologists either conduct all their experiments at once or that they conduct their experiments sequentially, while in fact biologists usually run their experiments with partial concurrency. My team will develop software that will better reflect the realities of the biology laboratory, and so will be of greater utility to biology researchers. Biologists will use the software to design experiments that will be most likely to learn the structure of a gene network, without going over deadlines or budget constraints. Gene networks are extremely important for reaching the EPSRC Prosperity outcomes of creating a 'Healthy Nation' and 'Resilient Nation'. For instance, gene networks help us understand the onset and disease progression of cancer (Healthy Nation) and how crops will respond to climate change (Resilient Nation).

Secondly, this project will help laboratory supply and service providers. According to 'The Scientist', there are approximately 1000 UK-based companies that provide key support services to academic and industrial biologists. However, since these companies are often small-to-medium sized, they often do not have the resources to have an internal research team to evaluate what their consumers (academic and industrial biologists) need to boost their research productivity. My team will analyse the schedules generated by CoED to identify what experimental protocols are the primary bottlenecks in experimental biology. From this analysis, my team will develop a business strategy to help these companies best deliver products and services that would have the greatest impact on research productivity.

Finally, there will be a significant outreach component of the project that will specifically target undergraduate women who major in computer science or related fields. In order to encourage this group of women to apply for graduate school in STEM subjects, I will apply to present my research and participate in a career roundtable at the Grace Hopper Celebration of Women in Computing. This EPSRC project will give me experience in working on a multidisciplinary project and developing industrial collaborations, knowledge that I could impart to other women as part of this outreach activity.

Publications

10 25 50
 
Description Please note that this is a continuation of a different reward after an institution shift, and these findings overlap between awards, because the same work was undertaken before/after the institution changed.
The goal of this study is to develop strategies for improving the efficiency of experimental biology, saving time and money.
There are a number of key research outcomes that are currently in manuscripts that are under review:
1. We have developed a software tool (NITPicker) to help academic or industrial scientist to select the optimal time points for an experiment to save research time and money.
2. We have utilised new statistical methods for predicting spring onion growth rates based on environmental inputs, using a dataset that was collected in a distributed way. This will allow researchers to use crowdsourced science practices to develop hypotheses that can be tested in the lab, which can be cheaper than investing in dozens of expensive plant growth chambers.
3. We have developed a software tool (PAFway) to help academic or industrial scientists interpret a large biological network in a way that allows them to make new hypotheses about relevant biological pathways, increasing the efficiency of research about gene regulation.
4. We used the tools described in (1) and (3) to infer and validate a biological network that senses light in the morning and controls growth pathways in plants. Since plants are most resilient to many environmental perturbations in the morning, this network provides us with many candidates for future crop breeding activities.
5. We developed a website to allow biologists to easily search through the gene network in (4) and apply the results in their own research.
6. We developed a new software tool that would allow a more in depth analysis of gene expression data in time series, finding groups of relevant genes that would be interesting targets for future research. We are currently applying this new tool to learn how plants respond to the change in seasons, which is a topic with many agricultural applications because plants decide when to flower, grow fruit, etc based on how they perceive the seasons.

Note that we have now successfully applied the methodology to a case study in Arabidopsis, developing a new gene network, in which specific predicted edges have been confirmed experimentally.
Exploitation Route We have developed four different novel statistical and/or data visualisation tools that are now available to the research community to enhance research productivity.
Sectors Agriculture

Food and Drink

Education

Pharmaceuticals and Medical Biotechnology

Other

 
Description A wide variety of research tools have been developed, which can be used to increase the efficiency of research productivity, leading to more efficient research and more efficient use of public money for research. Moreover, the researchers have become involved in AI for Social Good initiatives, such as working with NGOs to increase the efficiency of their workflows in the light of AI advances. This includes co-design of AI initiatives alongside NGOs during two Dagsuhl Seminars (2018 and 2024). Therefore, the reach of this work goes beyond academic impact.
First Year Of Impact 2020
Sector Agriculture, Food and Drink,Education,Healthcare
Impact Types Societal

Economic

Policy & public services

 
Title Dawn burst gene network visualiser 
Description Allows the user to search through and visualise a gene network that was describes how genes are regulated in the early morning. Several edges of the network have already been experimentally validated. 
Type Of Technology Webtool/Application 
Year Produced 2020 
Impact While this network visualisation tool is already publicly available on my website, the publication that will introduce this to the broader community is currently under revision, so it is unlikely to be widely used yet. 
URL https://www-users.york.ac.uk/~de656/dawnBurst/dawnBurst.html
 
Title FRECL (Functional Regression Clustering) 
Description This is a statistical package for clustering time series RNA-seq data on the basis of the functional map between time series datasets. This can help with identifying groups of genes that are regulated in the same way as a result of environmental perturbation and help with experimental design. 
Type Of Technology Software 
Year Produced 2021 
Open Source License? Yes  
Impact We recently submitted a manuscript describing the novel statistical method underlying the software. We do not expect this to have a high impact until after the publication is peer reviewed in a statistics journal. 
URL https://gitlab.com/sconde778/frmm_rpackagefunctions.git
 
Title PAFway 
Description Summarises biological gene networks in a format that makes it easier to interpret and develop new biological hypotheses 
Type Of Technology Software 
Year Produced 2020 
Open Source License? Yes  
Impact Available for other researchers to use. We currently know that this is being used by at least one other research project. 
URL https://cran.r-project.org/web/packages/PAFway/index.html
 
Description Biological Physical Sciences Interdisciplinary Network (BPSInet) Summer Research Scheme 
Form Of Engagement Activity A formal working group, expert panel or dialogue
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Undergraduate students
Results and Impact Many undergraduate students lost their planned summer internships or summer jobs due to Covid-19. I helped 5 such students, by providing them with a research placement over the summer. All were biology students, and I taught them advanced topics in statistics and machine learning to help them perform their analysis, training them in transferable skills. At least one now plans to continue 'data science' as a career.
Year(s) Of Engagement Activity 2020
URL https://www.york.ac.uk/physics/bpsi/
 
Description Discussion leader at Gatsby Summer School 
Form Of Engagement Activity A formal working group, expert panel or dialogue
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Undergraduate students
Results and Impact Led discussions of a tutorial group of undergraduate students, to encourage them to enter the field of plant sciences. I received very positive feedback and was particularly proud that some of the students took an interest in learning programming in R afterwards.
Year(s) Of Engagement Activity 2020
URL https://www.gatsby.org.uk/plant-science/programmes/gatsby-plant-science-summer-school
 
Description New Scientist article highlighting research 
Form Of Engagement Activity A press release, press conference or response to a media enquiry/interview
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Media (as a channel to the public)
Results and Impact New Scientist published an article highlighting a recent paper we published
Year(s) Of Engagement Activity 2020
URL https://www.newscientist.com/article/2232410-privacy-of-hundreds-of-thousands-of-genetic-volunteers-...
 
Description Training experimental biologists in data science skills during Covid-19 
Form Of Engagement Activity Engagement focused website, blog or social media channel
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Postgraduate students
Results and Impact A YouTube channel was created to help experimental biologists keep performing research (using data science) during the Covid-19 pandemic, while their labs were closed. Later, this was distributed to undergraduate biology students to help with transferable skills training. It is also generally available online and has been accessed over 100 times.
Year(s) Of Engagement Activity 2020
URL https://www.youtube.com/channel/UCc_UlQOUOwESFGu51hwxGPg