PAML 5: A friendly and powerful bioinformatics resource for phylogenomics

Lead Research Organisation: University College London
Department Name: Genetics Evolution and Environment

Abstract

PAML (for Phylogenetic Analysis by Maximum Likelihood) is a bioinformatics software tool widely used in the fields of molecular evolution, molecular phylogenetics, virology, biochemistry, and genomics. It is currently attracting over 1000 citations per year (with a total of >16K citations since its first release in 1993), and has a well-established user base. One of the major strengths of the package is its rich collection of sophisticated models for DNA and protein sequence evolution, which are useful in maximum likelihood and Bayesian methods in phylogenetics. The package can be used to compare different evolutionary trees, to infer adaptive molecular evolution affecting protein-coding genes, to reconstruct sequences in extinct ancestral species, etc. Despite being a widely used bioinformatics tool, PAML has a poor user interface and a steep learning curve. It is not parallelized, and is computationally inefficient when applied to large datasets. These issues have hindered its widespread adoption by new users and mean that existing users have to endure the computational burden.

In this project, we propose to redesign and reimplement the key algorithms in the PAML programs, parallelize the code, and also develop a R interface. We will expand and improve its functionality by developing new mutation-selection models of codon evolution, which will lead to more accurate ancestral sequence reconstruction and more robust detection of genes under positive selection. Those improvements will greatly improve the usability and computational performance of the program package, making it an extremely valuable bioinformatics resource for the biosciences research community.

This project is low-risk and high-impact. HPC technology, Pthreads and MPI have been around for many years. In phylogenetics, all the major likelihood and Bayesian programs have already been successfully parallelized, including PhyML, RAxML, MrBayes/RevBayes, and BEAST. Indeed PAML is uniquely bad in this sense for not having been parallelized. A frequently asked question at the google discussion site for PAML is whether a parallel version exists. Given the ubiquity of multicore architectures from laptops to supercomputers, this parallelisation will radically boost performance of PAML analysis and the scale of data that can be processed. The improved computational performance will be highly beneficial to all existing users of PAML and the improved user interface may help to attract new users.

The program will be distributed at its github site (https://github.com/abacus-gene/paml) under the GPL 3 license. Support is mostly provided at the google discussion site (https://groups.google.com/g/pamlsoftware), where users post and answer questions about the software. The PI visits the site regularly, in particular, to answer more technical user queries about the software.

Technical Summary

The PAML package implements a number of statistical models for phylogenetic analyses of DNA and protein sequences. Its main strengths lie in the rich repertoire of evolutionary models implemented, which can be used to estimate parameters in models of sequence evolution and to test interesting biological hypotheses. The program is widely used in teaching in molecular evolution courses worldwide, including the high-profile Woods Hole Workshop on Molecular Evolution and the Wellcome/EMBO Advanced Workshop on Computational Molecular Evolution (in Hinxton and Crete in alternative years). In the past three decades, the software package has become an important bioinformatics resource, attracting more than 15k citations. The annual citations have been steadily rising, indicating that there has been significant demand for the software.

Nevertheless, the package has a minimalist user interface based on control files and command line, and is known for a steep learning curve. In this project we propose to parallelise the code to make use of modern multiprocessor muticore computer architecture and implement an R interface. The improved software will become a friendly and important bioinformatics resource, suitable for analysing genome-scale datasets. We will also develop advanced models of codon substitution for robust inference of positive selection and accurate reconstruction of ancestral sequences.

This will be a low-risk high-gain project. HPC technology, Pthreads and MPI have been around for many years and have been successfully applied in several other phylogenetic programs. Given the ubiquity of multicore architectures from laptops to supercomputers, the proposed parallelisation will radically boost performance of PAML analysis and the scale of data that can be processed.

Publications

10 25 50