The Dundee Resource for Sequence Analysis and Structure Prediction (DRSASP) - 2023 and Beyond

Lead Research Organisation: University of Dundee
Department Name: School of Life Sciences

Abstract

This resource application is focused continuing to support computer tools and techniques developed at the University of Dundee that are in daily use by thousands of biological research scientists and students throughout the UK and the world. The resource will not only ensure that these tools are readily available to all, but also improve the ability of scientists and students to use them through better interfaces and via training videos and other on-line materials. The tools focus on the analysis of protein sequences and structures which are briefly introduced here. The plans to make a plant, animal or micro-organism are encoded as the molecule DNA and known as its genome. The genome can be represented as a long word made up of four different letters (A, C, G, T). The genome may be a few thousand letters long for a virus, to several billion letters for plants and animals. The genome is divided up into regions called genes which are translated by complex molecular machines into other molecules such as proteins. Humans and other animals have 20-30,000 genes that code for proteins and each protein made up of a sequence of 20 different amino acid types joined together in a chain. Protein sequences from an organism vary in length from a few amino acids, to several thousand and can be represented as a word made up of 20 different letter types. The protein chain folds up into a complex three-dimensional shape that is defined primarily by its sequence. The shape of the protein, its "conformation", dictates the biological function of the protein, so understanding the conformation of a protein is vitally important to understanding the protein function. Over recent years there have been huge advances in technology to sequence DNA and so the genomes of many different organisms have been determined. As a consequence, the sequences of several million proteins are now known but less than 150,000 have had their detailed three-dimensional structures worked out. The computational tools that will make up this resource help to bridge this information gap by classifying protein sequences and making predictions of protein structure available in easy to interpret way that can guide biologists to design more efficient and effective experiments. This proposal will provide support, maintenance and training for the popular JPred protein structure prediction server which performs around 30,000 predictions monthly for scientists in >100 countries and other techniques that we have developed. It will enhance JPred to include structure predictions from the latest methods to predict protein three-dimensional structures. Web sites are good for humans to interact with, but less useful for computer software to interface to. Since our tools are useful for large analyses that might be done on many thousands of proteins, the resource also supports a novel "web services" interface to the tools using technology called 'Slivka' developed in Dundee that makes it easy to add new services. Web services allow a program or application to be run remotely from within a program. For example, I might have a program running on my desktop computer but call for an intensive calculation to be done on a remote high-performance computer system. Our services are complex computer systems but we will use technologies called docker and conda to package them in a way that other institutions can install and run them. This will help ensure the services remain available even if those at Dundee fail.

Technical Summary

DRSASP supports a suite of key tools and makes them available through the web, APIs and as downloadable packages. We will continue to support the secondary structure prediction and solvent accessibility server JPred which performs 30,000 predictions/month for scientists in >100 countries. We will enhance JPred by presenting its predictions alongside information from 3D structure (PDB, AlphaFoldDB, SwissModel, ESM Atlas, etc) to highlight regions where it provides additional information. The JABAWS platform gives access to 8 multiple sequence alignment (MSA) methods, 4 disorder predictors, an RNA secondary structure predictor and 18 conservation calculations from alignment and serves >19,000 jobs/month. Its successor, "Slivka" will provide a long-term solution for web-service deployment. Slivka will replace JABAWS during 2023 when it will support Jalview 2.12 and its JavaScript version JalviewJS. Further services will be added to Slivka over the course of the grant. The ProteoCache stores all new JPred results (>450,000 proteins to date). We will continue ProteoCache and add modules to allow output in SQL and flat-file formats. ProIntVar has powerful methods to integrate population variation data from VCF files with multiple protein 3D structures, MSAs and protein-protein and protein-ligand interactions. We will deploy ProIntVar as a Slivka service and provide an interactive web interface. An objective of this proposal is the long-term support of services that remain useful to DRSASP's large international community. This will be achieved by migration of DRSASP to an institutional core resource at Dundee and packaging of all legacy services in docker/conda. The users of DRSASP are very diverse, from experimental biologists, undergraduate and school students, to software developers. Accordingly, we will continue to develop extensive manuals, videos and e-learning materials to inform and educate users at all levels.

Publications

10 25 50