Ribosomal DNA variation in multi-locus systems

Lead Research Organisation: Quadram Institute
Department Name: National Collection of Yeast Cultures

Abstract

Recent technological advances have led to a dramatic drop in both the cost and the time taken to obtain the genome sequence of a chosen organism. As a consequence, the genomes of thousands of organisms are currently being sequenced around the world. Once these genome sequences have been obtained, researchers may then analyse them using a growing toolkit of software. Much of the effort analysing these sequences is naturally spent on examining the genes, which make proteins that are used in cells for growth and development.

Despite the quantity of genome sequences now publicly available, one part of them that has received scant attention is the ribosomal DNA (rDNA). The rDNA is essential for life, as it is involved in "reading" the sequence of a gene and from that sequence constructing a protein. The rDNA itself is a short sequence (of a few thousand "letters" long) that in many organisms is repeated over and over again, in tens or hundreds of copies, at one or more locations within a genome. Until recently, researchers believed that all the tens or hundreds of copies of the rDNA within a single organism were identical. However, recent studies have shown that there are indeed differences between rDNA copies, both in terms of the number of copies and their DNA sequences. Furthermore, the rDNA is now being shown to play a role in important biological processes such as ageing but we have yet to discover how these rDNA differences affect such processes.

Over the last decade, we have meticulously analysed the rDNA in species of yeast that package it within just a single location within their genome. We have shown that the differences between copies of the rDNA both within and between organisms encapsulate a rich source of evolutionary information. An important part of this work was developing two software tools, TURNIP and VariantLister, that enabled us to find those rDNA differences. Here, we will extend our knowledge of rDNA differences to include species that organise their rDNA across two or more genomic locations. We will do this by analysing special sequence datasets that comprise just a single chromosome within a genome - analogous to a chapter within a book - for the yeast species Candida glabrata (2 rDNA locations) and bread wheat (5 rDNA locations). Such an analysis is important as many species that humans depend upon, including farm animals and cereal crops, organise their rDNA across multiple locations and finding out how the rDNA differs between locations may help us to develop better breeds and varieties in the future. We will then test whether we could in fact have used DNA sequences from whole genomes to determine the same information, which will have broad implications for how we analyse organisms with multiple rDNA locations in the future. These tasks will require us to first improve the VariantLister software so that it can accurately find rDNA differences without the need for us to edit its results by eye.

We will then determine which of the rDNA differences that we have identified are actually used by yeast and wheat to construct proteins. In particular, we will discover if the rDNA differences they use depend on the genomic location at which they are found and the environmental conditions in which the organism is living (e.g. temperature, water availability). These results, which may indicate rDNA differences that change aspects of how an organism functions (i.e. its traits), will be communicated to relevant crop and yeast improvement projects that are aiming to develop new varieties and strains tailored to specific purposes (e.g. crops that grow best in certain environmental conditions).

Finally, we will make all project datasets and the VariantLister tool freely available on a dedicated project website, to the benefit of researchers around the world, so that others may carry out their own studies on rDNA variation, evolution and function.

Technical Summary

The ribosomal DNA (rDNA) evolves under a balance of heterogeneity-inducing point mutations acting against homogeneity-inducing concerted evolutionary processes. This balance is now known to be imperfect, leading to intra- and inter-organism variation in rDNA copy number and unit sequence. This situation is made even more complex by the presence, in many organisms, of multiple rDNA loci that are believed to be homogenised by processes such as gene conversion. Evidence is growing rapidly that rDNA variation is tightly linked to phenotype. Furthermore, the rDNA has now been implicated in vital biological phenomena such as genome integrity and ageing. We therefore urgently need to discover how rDNA variation is organised within a genome, how it is expressed and how it underpins the functionality of an organism.

We have developed VariantLister, a software tool that enables us to systematically characterise rDNA variation in organisms with a single rDNA locus. However, to date we have been unable to attribute rDNA variants to a specific locus in organisms with multiple rDNA loci. Through analysis of single chromosome datasets and further VariantLister development, we will carry out such an analysis for the first time, in yeast and wheat. We will also carefully assess whether clustering of rDNA variants called from whole genome sequence datasets can be accurately ascribed to distinct loci for organisms with multiple rDNA loci. Finally, by analysis of transcriptome datasets, we will discover which of the identified rDNA variants are expressed and how a variant's expression fate depends on its locus and the environment.

The results of these tasks will provide vital new knowledge on the organisation and expression of rDNA variants in two key eukaryotes, which will underpin future investigations of rDNA function and ultimately species improvement programs. Finally we will disseminate all project software and data on a dedicated project website.

Planned Impact

This project has considerable promise to impact on the UK society and economy, the general public and the project participants. While economic and societal impact will be derived in the long-term, benefits to the general public and project participants are expected to be realised both during and following the project.

1) The UK society and economy
Our society is currently facing significant challenges stemming from threats ranging from climate change to a growing and ageing population. We have an urgent need to secure and optimise future food production while also utilising food and agricultural waste in the replacement of petroleum as sustainable sources of key chemicals. This project will impact on both of these needs, to the benefit of our society and economy.

a) Crop breeders/Agri-food industry
Wheat is the UK's most important cereal crop, yielding 16.68 million tonnes in 2015, and a vital component of the UK diet. New knowledge of rDNA variation and expression in bread wheat will be communicated to crop breeders and the agri-food industry, to be used for the development of new varieties of wheat tailored to specific environmental conditions. In particular, analysis of RNA-Seq datasets will kick-start the identification of rDNA variants that are preferentially expressed under stress, including conditions of high temperature and low water. Dr Davey's position in the wheat community, including the BBSRC Wheat ISP, will be key to effective knowledge dissemination in the pursuit of continued food security.

b) Industrial biotechnology/Biopharma
The vast quantities of wheat straw left over from food production (e.g. 6.3 million tonnes in 2007), in particular in the East of England close to the project's location, is a key target substrate for secondary biorefining in the UK. Here, sugars released from the straw are fermented by yeast to produce a wide spectrum of platform chemicals and fuels. Harnessing the vast biodiversity of yeast is a fast emerging area of interest to a wide range of UK companies and NCYC has recently developed a new collaboration on yeast natural products with Croda, a FTSE 250 company. Yeast rDNA variants and expression profiles discovered in this project are expected to lead to the development of new strains that efficiently produce optimal quantities of a required chemical product. Consortia such as Sc2.0 and existing relationships with key companies such as Croda will ensure broad communication of our results.

2) The general public
There is a growing public appetite for scientific knowledge, with a wide recognition of the enormous impact that science has on our prosperity and continued well-being. The project team are highly committed to public scientific outreach, each tending to focus on a different part of this broad sector. Within this project, we will engage directly with members of the public, from schoolchildren to our society's most senior members, to educate them in its most important aspects. In particular, we will use our existing contacts within local organisations such as the SAW Trust and BBC Look East to introduce concepts such as genetic variation, synthetic biology and industrial biotechnology, to explain why we are carrying out this project and what benefits we anticipate it will bring to the local population and to the wider UK community.

3) The project participants
The three project investigators are all highly skilled in the training of new members of their field, and combined they have passed on a wide range of scientific and transferable skills to dozens of scientists in the UK and beyond, many of them now holding senior scientific positions of their own. Within this project, the post-doctoral research assistant will benefit from this expertise, gaining excellent inter-disciplinary training, with the project focus ensuring they possess the skill sets essential to the next generation of UK scientists.
 
Description The postdoctoral researcher on the project, Dr Ziauddin Ursani, was in post for 24 months. In that time, Dr Ursani developed two computational pipelines for the prediction of rDNA sequence variants from Next Generation Sequencing datasets. The first pipeline, a "linear" pipeline, is complete and has been tested successfully on yeast genomic datasets, detecting a greater breadth and accuracy of variants than we had been able to do previously, most importantly without the need for (time) costly manual checking of results. A Python software program named Parsley that encapsulates this pipeline - and which includes additional facilities such dataset simulation and evaluation of software accuracy - has been published at https://github.com/ziaursani/parsley_root and is currently being used to investigate rDNA sequence and copy number variation in a range of real datasets, with the results of these investigations expected to lead to a series of publications. The second pipeline, a "graphical" pipeline, is in prototype stage. Dr Ursani showed in simulations that the graphical pipeline is capable of detecting an even greater number of sequence variants than the linear pipeline but it is more complex to use. In future, as graphical variant calling becomes more embedded in everyday bioinformatics, it is likely that this pipeline will be ported into a future version of the Parsley software.
Exploitation Route We will test the Parsley software on wheat single chromosome and yeast transcriptome datasets in the coming months. This will generate new knowledge on the mechanisms of rDNA evolution and the potential functional effects of rDNA variation, which could impact on areas such as (for example) plant breeding. Several publications are anticipated to arise from this new knowledge.

The Parsley software developed within this project is also now freely available to all. Members of the biological and bioinformatics communities can therefore now use the software for the analysis of ribosomal DNA in any eukaryotic organism from Next Generation Sequencing datasets.

We will also test the software on single chromosome and transcriptome datasets in the coming months. This will generate new knowledge on the mechanisms of rDNA evolution and the functional effects of rDNA variation, which could impact on areas such as (for example) plant breeding.
Sectors Agriculture, Food and Drink,Education,Healthcare,Pharmaceuticals and Medical Biotechnology

 
Description Collaboration with Dr Conrad Nieduszynski, Earlham Institute 
Organisation Earlham Institute
Country United Kingdom 
Sector Academic/University 
PI Contribution Research time by me and my PhD student (Ms Laura Tingley) on the development of software for rDNA variation discovery and on the analysis of datasets developed within the collaboration.
Collaborator Contribution Collaboration within the CELLGEN ISP. Development of datasets to gain a greater understanding of rDNA variation and evolution. Additional staff time (Dr Graham Etherington) for software development and dataset analysis.
Impact Inter-disciplinary collaboration very recently begun, including generation of biological datasets, analysis of prior biological datasets and software development.
Start Year 2023
 
Description Collaboration with Dr Jane Usher, University of Exeter 
Organisation University of Exeter
Country United Kingdom 
Sector Academic/University 
PI Contribution I am providing expertise in the computational analysis of ribosomal DNA, gained over the last 13 years. This collaboration with Dr Usher will help us to extend our expertise from computational analysis of single-locus rDNA systems (e.g. Saccharomyces cerevisiae) to multi-locus rDNA systems (e.g. Candida glabrata).
Collaborator Contribution Dr Usher is providing expert training and guidance on the single chromosome sequencing of strains of the yeast species Candida glabrata. Dr Usher is also providing expertise on the downstream characterisation and analysis of the resulting data, based on her own previous experience in this area.
Impact No outcomes yet in this multi-disciplinary project, but publications and new computer software anticipated in the coming months.
Start Year 2018
 
Description Collaboration with Dr Robert Davey, Earlham Institute 
Organisation Earlham Institute
Country United Kingdom 
Sector Academic/University 
PI Contribution The collaboration with Dr Davey primarily consists of weekly project meetings with me and Ziauddin Ursani, the postdoctoral researcher employed on this project. The meetings, where Dr Ursani and I present recent work to Dr Davey, collectively discuss its implications and decide on next steps to be taken, enable the project to progress more rapidly.
Collaborator Contribution The Earlham Institute, through Dr Davey, have offered the use of desk space and computing facilities to Dr Ursani for focussed project development, in addition to his QIB facilities.
Impact No outcomes yet in this multi-disciplinary project, but publications and new computer software anticipated in the coming months.
Start Year 2018
 
Title Parsley Root: Pipeline for Analysis of Ribosomal Locus Evolution in Yeast (Reusable for Other Organisms Though) 
Description The Parsley software predicts sequence and copy number variation for ribosomal DNA from Next Generation Sequencing datasets. It can also simulate rDNA sequence variation and compare simutated and predicted variation datasets in vcf format. It is wrigtten in the Python programming language and is freely available on github at site https://github.com/ziaursani/parsley_root . 
Type Of Technology Software 
Year Produced 2020 
Impact Parsley is, to the best of our knowledge, the only software available worldwide for variant prediction in ribosomal DNA datasets from Next Generation Sequencing reads. Parsley has already been used successfuly for MSc student training on the University of Edinburgh's Bioinformatics course. 
URL https://github.com/ziaursani/parsley_root
 
Description Annual rDNA lecture and practical (University of Edinburgh) 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Postgraduate students
Results and Impact Provision of an hour-long lecture and two-hour long practical session on the use of computer software for the identification of sequence variants within ribosomal DNA. The audience consisted of approximately 30-50 students enrolled on the MSc in Bioinformatics course at the University of Edinburgh. The activity takes place on an annual basis. The students appear highly engaged with the topic and ask lots of pertinent questions throughout the activity.
Year(s) Of Engagement Activity 2018,2019,2020,2021,2022,2023
URL http://www.drps.ed.ac.uk/18-19/dpt/cxpgbi11115.htm
 
Description Challenges and opportunities of ribosomal DNA micro-heterogeneity detection and analysis in yeast NGS datasets 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Professional Practitioners
Results and Impact Research scientists and postgraduate students attended the Midsummer Phylogenetics Workshop at the University of East Anglia on July 17th 2018 to hear a varied program of research talks on computational biology, focussed mainly on phylogenetic analysis.
Year(s) Of Engagement Activity 2018
URL http://www.uea.ac.uk/computing/news-and-events/conferences/midsummer-phylogenetics-meeting-2018
 
Description Evaluating the use of variation graphs for the characterisation of yeast rDNA arrays 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Research scientists attended the IEA/AIE-2019: 32nd International Conference on Industrial, Engineering & Other Applications of Applied Intelligent Systems, Graz University of Technology, Graz, Austria, July 9-11, 2019 where they saw variety of spoken research papers on machine learning-related topics.
Year(s) Of Engagement Activity 2019
URL https://ieaaie2019.ist.tugraz.at/
 
Description Evaluating the use of variation graphs for the characterisation of yeast rDNA arrays - poster presentation 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Research scientists and postgraduate students from the international yeast community attended the British Yeast Group 2019 conference in Newcastle from 26th-28th June 2019.
Year(s) Of Engagement Activity 2019
URL https://microbiologysociety.org/event/society-events-and-meetings/byg-discovery-to-impact.html
 
Description Machine learning for precision variant detection in ribosomal DNA repeats 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Professional Practitioners
Results and Impact UK bioscience community members attended the BBSRC Artificial Intelligence in Biology Workshop in Norwich on October 4th 2018 to hear current about community machine learning research and to discuss how ML/AI would likely impact on the biosciences in the near future.
Year(s) Of Engagement Activity 2018
URL https://www.earlham.ac.uk/bbsrc-artificial-intelligence-biology-workshop