PLUTo: Phyloinformatic Literature Unlocking Tools. Software for making published phyloinformatic data discoverable, open, and reusable

Lead Research Organisation: University of Bath
Department Name: Biology and Biochemistry

Abstract

Phylogenetic data, and the trees inferred from them, represent a hugely valuable resource for evolutionary biological research. The data are often expensive and time-consuming to acquire, and the results from analyses of these data - typically trees - represent a vast investment of effort and expertise across the global community of bioinformaticians and systematists. Trees, and their underlying character data, are often repurposed in other areas of biology; notably in evolutionary studies that seek to test patterns of genomic evolution or macroevolutionary trends. Despite their enormous value, recent research by the PDRA estimates that less than 4% of the phylogenetic trees published in 2010 are available in machine-readable form.

Our proposal stands at the leading edge of content mining technology. We will create Open Source 'data liberation' software tools that will allow us to unlock the greater proportion of phyloinformatic data from where they are currently buried in the literature. These will include phylogenetic trees, branch lengths and support values (extracted from the SVG content of PDF files), analytical methods and indices of data quality (from figure legends and the main body of the text) and the underlying molecular and morphological character data. We will also derive full bibliographic and geographical data for each source paper. We will test, refine and perfect these tools by applying them to PLoS, BMC, Elsevier, Wiley and Springer online content from the 21st Century. Once the data are extracted, we will ensure that their immense interdisciplinary (evolutionary biology, ecology, ethology, palaeobiology and conservation) and legacy potential is realised by making them available online in an explicitly open manner. We will also use the data ourselves in order to address several related questions concerning research effort, phyloinfomatic data quality and the progress of systematic research.

While there is renewed interest and emphasis on curating underlying research data and results (exemplified by projects such as TreeBASE, Dryad, BMC's partnership with LabArchives, and FigShare) these ventures rely upon author submission, which is rarely mandated by journals. Uptake has been slow and coverage is woeful. The data archiving success of NCBI/GenBank for nucleotide sequences (N.B., not alignments, trees or other results, and certainly not morphology) is the exception rather than the rule in the Biological Sciences. For the foreseeable future, therefore, there is a pressing need to retrospectively gather data from the published literature.

This project is extremely novel in its scale and ambition. If successful in re-extracting the majority of phylogenetic data from the last decade, the software will easily be adapted and modified by others to suit the data re-extraction needs of other areas of science. This will better harness the billions of pounds of research money hitherto invested into obtaining and analyzing data, only for it to have been locked down and subsequently obfuscated in PDF publications when projects are completed.

The project is also widely trans-disciplinary, bringing together a macroevolutionary phylogeneticist (Wills), a chemoinformaticist (Murray-Rust), and a young, up-coming Researcher (Mounce). The potential wider benefits of this project are vast and diverse; content mining techniques are estimated to be capable of generating up to £200 billion annually in added value for Europe alone. We cannot claim to generate those benefits directly, but we will create open tools and generate open data that will greatly facilitate other commercial, industrial and academic ventures.

Technical Summary

While there are well-established and excellent repositories for molecular sequence data (NCBI), there are no comparable resources for alignments or morphological data (Dryad is the best), still less for trees or other meta-data (measures of tree support, indices of homoplasy, etc.). These data remain locked down into PDFs, and are currently not machine-readable. This is hugely detrimental to many biological disciplines.

We will develop and perfect tools (PLUTo) enabling researchers to unlock phyloinformatic data from published PDFs. These will generate Newick/NeXML tree files (with branch lengths and support metrics) by interpreting SVG and other graphics, and parsing the text/legends for other data.

We will use AMI2 extraction technology, based on PDFBox, JUMBO and AMI-code. This is presently in prototype. The full code system (PLUTo) will comprise AMI2 and SOLR.

The beta will be presented to BMC, PLoS and EuPMC staff/boards. We will also contact selected TA publishers to seek CC0 extraction agreements. The corpus of data will then be checked and annotated in detail by a data clerk (Bath) and via PyBossa (an OKF crowdsourcing community platform).

We will explore the possibility of EuPMC and publisher-adopted installation for sustainable CC0 tree extraction. We will develop annotation tools for testing and validating PLUTo on new content. We will set up a PLUTo server based on SOLR (OKF already has Pubcrawler with extracted bibliographic metadata for 25 million STM publications in CKAN). PLUTo content will be uploaded to the OKF CKAN/Datahub (following the model for data.gov.uk).

We will use the corpus to address key questions in phyloinformatics and systematics. Which clades are the foci of phylogenetic research and what types of data are being used? Importantly, how does this research effort relate to the diversity of clades? Are some groups disproportionately under-sampled? Is the quality of phylogenetic data variable across higher taxa?

Planned Impact

Academic Impact

This project will have international academic impact in five areas.
1. Our results will be indispensable for any researcher conducting a systematic review of the phylogenetic literature. We will have identified where all papers containing phylogenetic trees have been published in the last decade, and be able to extract trees, their meta-data and underlying character data from many of these publications. These resources will be repurposed in many additional projects.
2. Our resources will be invaluable for evolutionary biologists, ecologists, ethologists and palaeobiologists needing to test evolutionary hypotheses against a phylogeny. There is also vast potential for developing phylogenetically-informed indices of conservation priorities.
3. The project will complement and enhance the published literature by providing discoverable, open, reusable data. Our resources will also complement projects elsewhere, especially the Assembling and Visualising the Tree of Life NSF projects. In particular, we will author tools that will enable much of the backlog of phyloinformatic data to be liberated from the literature. We are aware of no strategy to achieve this implemented elsewhere.
4. All systematic researchers will benefit from the project, as it relieves them from the responsibility of submitting their trees and meta-data to a separate repository (e.g., TreeBASE). The woeful coverage of such repositories (<4%) speaks to the inefficiency with which self-archiving captures the overall research investment. Making the data within a paper available and re-usable also increases the probability that the paper will be cited.
5. The new tools and data will revolutionise the process of supertree construction. It will integrate with other tools under development for this purpose; notably the Supertree Toolkit (STK).

Economic and Societal Impact

A recent appraisal of the economic potential of content mining techniques as applied to the scientific literature estimated that the value to Europe's economy could be £200 billion annually. Our project will develop cutting-edge technologies; considerably more advanced than straightforward text-mining, because we are also extracting images and amalgamating both techniques. Our proposal will promote technological progress in this area by providing open source software tools that can be applied transferably to other problems. A recent JISC report found that data mining techniques can result in substantial cost savings, productivity gains and innovative service development . The same report also found considerable potential for societal benefit; most significantly the provision of better visualisation techniques for large volumes of data. These new tools with ultimately allow researchers "to better convey research findings and other complex ideas to general audiences".

New developments in content mining technology within academia also highlight the need for a fresh appraisal of UK Copyright law. Currently there are no exceptions allowed for research purposes. However, the independent Hargreaves Review of intellectual property and growth suggested that exceptions should be made; particularly and especially where there are clearly identified scientific benefits. Thus, our findings (and the benefits from them) will lend considerable weight to these Hargreaves recommendations. This has the potential to influence legislative change that will affect UK society as a whole.

Publications

10 25 50
 
Description There are huge numbers of phylogentic trees and associated data in the biological literature, but most of these data are locked into figures that are not easily machine-readable. Phylogenetic trees are research outputs that often represent the distillation of hundreds of hours of research and computer time, but their inaccessibility in electronic form means that their utility is severely limited. Unlike molecular sequence data - which is archived and curated systematically in central online repositories - the archiving of morphological data is in its infancy, and the deposition of trees within servers such as 'TreeBase' is piecemeal. In this project, we have explored ways to extract phylogentic trees from the literature. We have developed a variety of tools (PLUTo) within the ContentMine workflow that enable us to extract trees and metadata from pdf files and others that enable us to systematically scrape journal content from open journal repositories. We have used these tools and data for three scientific applications thus far. Firstly, we have produced a supertree of microorganisms from the trees contained within 4,300 papers from the International Journal of Systematic and Evolutionary Microbiology. Secondly, we have analysed 1,300 morphological data matrices extracted by our scripts (and derived from other repositories) to explore the distribution of homoplasy (phylogenetic noise) across major groups of animals. Some clades (e.g., arthropods) contain significantly more homoplasy than others (e.g., mammals). Thirdly, we have used a small subset of our data to explore the heterogeneity of phylogenetic signal across the cranial and postcranial skeletons of vertebrates, finding that there is often significant disagreement between the trees derived from the partitions.
Exploitation Route Our software and tools require further development, but they have already allowed us to tackle analyses that would have hitherto been intractable. There is potential to develop the accuracy of our character recognition algorithms, and the efficiency with which we check taxonomic names. Many phylogenetic figures are complex, composite diagrams that are difficult to decompose into tree branching structure, support values and taxon names because of conflating annotation. We are beginning to get to grips with these, but further development of the algorithms would be desirable. More generally, there is a vast demand for tools that will enable phyloinformatic and other scientific facts (currently tied up in pdf and other file types) to be made discoverable, indexable and available for repurposing. Our open source PLUTo tools, as part of the ContentMine workflow, will continue to develop and contribute to this objective.
Sectors Digital/Communication/Information Technologies (including Software),Education,Environment,Culture, Heritage, Museums and Collections,Other

 
Description This work was made possible in large part by the introduction of a copyright reform in the UK in 2014 (the "Hargreaves" exemption). This allowed researchers, to mine documents or non-commercial purposes-without the permission of the rights holder. Before Hargreaves, we would have had to write 5000 mails to authors requesting permission to mine the documents. From past experience (e.g. Dr Max Hauessler mining sequences) we would expect that roughly 1/3 would refuse, 1/3 would fail to reply and 1/3 would give permission (but often with restrictive conditions). This would have added many months to the research and drastically reduced the value. The PLUTo work is seen as a pre-eminent example of the "The Right to Read is the Right to Mine". In 2015 Europe, through the Parliament and also the Commission, has sought to emulate the UK legislation to reform copyright. PM-R has worked closely with the European lead, MEP Julia Reda and also given evidence at the Commission and presented the work to otehr interested NGOs (LIBER, LERU, RLUK) and H2020 projects (LEARN, OpenMinTed, FutreTDM). PLUTo is therefore one of the archetypal examples of the importance of legal and directive reform.
First Year Of Impact 2015
Sector Digital/Communication/Information Technologies (including Software)
Impact Types Societal,Policy & public services

 
Description INCREASING COMPLEXITY: THE FIRST RULE OF EVOLUTION?
Amount $793,291 (USD)
Funding ID 61408 
Organisation The John Templeton Foundation 
Sector Academic/University
Country United States
Start 09/2019 
End 05/2022
 
Description NERC GW4+ Studentship: - NERC GW4+ Studentship (£ 70000; 2019 - 2023)
Amount £70,000 (GBP)
Organisation GW4 
Sector Academic/University
Country United Kingdom
Start 09/2019 
End 08/2023
 
Description SWBIO DTP Studentship
Amount £70,000 (GBP)
Organisation Biotechnology and Biological Sciences Research Council (BBSRC) 
Sector Public
Country United Kingdom
Start 09/2017 
End 08/2021
 
Title Computational results of PLUTo ami-phylo analysis of trees from Int. J. Syst. Evol. Microbiol 
Description Computational results of PLUTo ami-phylo analysis of trees from Int. J. Syst. Evol. Microbiol 
Type Of Material Database/Collection of data 
Year Produced 2015 
Provided To Others? Yes  
Impact Supertree paper in Prepartion 
URL https://github.com/ContentMine/ijsem
 
Title Corpus of data from International Journal of Systematic and Evolutionary Microbiology 
Description Computational results of PLUTo ami-phylo analysis of trees from Int. J. Syst. Evol. Microbiol. 
Type Of Material Database/Collection of data 
Year Produced 2015 
Provided To Others? Yes  
Impact Paper in prep. 
URL https://github.com/ContentMine/ijsem
 
Title Edited TNT files from Mounce et al. 2016, http://dx.doi.org/10.1111/evo.12884 
Description Mounce RCP, Sansom R, Wills MA (2016) Data from: Sampling diverse characters improves phylogenies: Craniodental and postcranial characters of vertebrates often imply different trees. Dryad Digital Repository. http://dx.doi.org/10.5061/dryad.7hb7r 
Type Of Material Database/Collection of data 
Year Produced 2016 
Provided To Others? Yes  
Impact Publication http://dx.doi.org/10.1111/evo.12884 
URL http://dx.doi.org/10.5061/dryad.7hb7r
 
Title Supertree Toolikit Website: Data on Caridea 
Description Data from: https://www.nature.com/articles/s42003-018-0018-6 doi:10.1038/s42003-018-0018-6 
Type Of Material Database/Collection of data 
Year Produced 2018 
Provided To Others? Yes  
Impact Paper in 'Communications Biology' 
URL http://supertreetoolkit.org/?q=data
 
Title Supplementary files for: Why should we compare morphological and molecular disparity? 
Description 1. Indices of morphological disparity seek to summarise the highly multivariate morphological variation across groups of species within clades, time bins or other groups. Morphological variation can be quantified using geometric morphometric, outline or surface-based methods. These are most effective when morphological differences are relatively modest and there are numerous ubiquitous landmarks and phase-aligned features of shape variation. The most disparate samples, such as those across classes and phyla, typically necessitate the use of discrete characters. Unfortunately, such characters are often compiled subjectively in a manner reflecting the level of morphological and taxonomic focus and the intensity of taxon sampling. 2. Sampling intensity is often highly variable within a single data set, especially in repurposed and amalgamated cladistic matrices. Here we propose indices of molecular disparity analogous to those of morphological disparity. Molecular sequence data can be obtained in a more objective, automated and scaleable manner than morphological data. 3. Comparisons of the morphological and molecular disparity of subclades in sixteen large data sets suggest that molecular disparity is less susceptible to sampling biases than morphological disparity. Moreover, distance matrices inferred from individual genes tend to correlate strongly with each other and with distances from all concatenated genes. By contrast, morphological and molecular disparity are typically not significantly correlated across subclades, such that comparisons for groups can help to give a fuller picture of their evolution. Within mammals, Afrotheria have conspicuously high morphological disparity but modest molecular disparity, suggesting unusually high morphological plasticity. Even more strikingly, the molecular disparity of rodents is over five times that for Artiodactyla, despite having only half of their morphological disparity. These contrasts suggest the differential operation of geometric, biomechanical, ontogenetic and environmental constraints on form. 4. Given the increasing abundance of total evidence data sets in the literature and the widespread and sometimes uncritical repurposing of discrete morphological characters, we propose the comparison of morphological and molecular disparity as a useful tool to understand subclade evolution more fully. 
Type Of Material Database/Collection of data 
Year Produced 2022 
Provided To Others? Yes  
URL http://datadryad.org/stash/dataset/doi:10.5061/dryad.4b8gthtgs
 
Title Thesis: Morphological disparity across clades: Correlates, limitations and alternatives - Chapter 6: What determines disparity in avian clades? 
Description Morphological disparity is an aspect of avian evolution that remains understudied and has rarely been quantified explicitly, despite its importance for inferring patterns of avian evolution. Morphological and molecular data are now routinely used both in combination and in isolation to infer phylogeny and to study evolutionary rates. Similarly, parallel studies of phylogeny, diversity and morphological disparity are now commonplace in both the neontological and palaeontological literature (Giribet, 2015; Bromham et al., 2002; Hopkins and Gerber, 2017; Deline et al., 2018; Prum et al., 2015). Most recently, the concept of molecular disparity has been introduced as an analogue of morphological disparity, although there are few studies that attempt to deploy it (Deline et al., 2018; van den Ende et al., 2022). This chapter addresses ten related questions using the largest available morphological and molecular data set for birds. 
Type Of Material Database/Collection of data 
Year Produced 2022 
Provided To Others? Yes  
URL http://datadryad.org/stash/dataset/doi:10.5061/dryad.w3r2280vc
 
Title DiagramAnalyzer 
Description Part of the AMI system. DiagramAnalyzer takes a set of graphics primitives from ImageAnalysis (mixture of paths and pixels) originating from diagrams and turns them into an approximation of the original diagram. Current aspirations are: binary and other phylogenetic trees (mainly done) PhyloTreeAnalyzer. 
Type Of Technology Software 
Year Produced 2014 
Open Source License? Yes  
Impact In progress 
URL https://bitbucket.org/petermr/diagramanalyzer
 
Title PLUTo - Code associated with project 
Description AMI provides a generic infrastructure where plugins can search, index or transform structured documents on a high-through basis. The typical input is structured, normalized, tagged XHTML, possibly containing (or linked to) SVG and PNG files. The plugins are designed to analyse text or graphics or a combination according to the discipline. 
Type Of Technology Software 
Year Produced 2014 
Open Source License? Yes  
Impact Papers in prep. 
URL https://bitbucket.org/petermr/ami-core
 
Description Article for The Conversation • Crabs have evolved five separate times - why do the same forms keep appearing in nature? 
Form Of Engagement Activity A magazine, newsletter or online publication
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Public/other audiences
Results and Impact Crabs have evolved five separate times - why do the same forms keep appearing in nature?
120,000 reads
Year(s) Of Engagement Activity 2022
URL https://theconversation.com/crabs-have-evolved-five-separate-times-why-do-the-same-forms-keep-appear...
 
Description Article for The Conversation • Evolutionary tree of life: modern science is showing how we got so much wrong 
Form Of Engagement Activity A magazine, newsletter or online publication
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Public/other audiences
Results and Impact Evolutionary tree of life: modern science is showing how we got so much wrong
Year(s) Of Engagement Activity 2022
URL https://theconversation.com/evolutionary-tree-of-life-modern-science-is-showing-how-we-got-so-much-w...
 
Description Article in 'The Conversation': Dinosaurs could have avoided mass extinction if the killer asteroid had landed almost anywhere else. 36,000 reads 
Form Of Engagement Activity Engagement focused website, blog or social media channel
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Public/other audiences
Results and Impact Popular science article with 36,000 reads.
Year(s) Of Engagement Activity 2017
URL https://theconversation.com/dinosaurs-could-have-avoided-mass-extinction-if-the-killer-asteroid-had-...
 
Description Be Ready Webinar Programme - What is the Evidence for Evolution? 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Regional
Primary Audience Schools
Results and Impact A wide-ranging series of interactive online talks and workshops offering support and information for students, teachers and parents/carers.
Year(s) Of Engagement Activity 2020
URL https://www.bath.ac.uk/campaigns/be-ready-webinar-programme/
 
Description Blog on Caridean diversity paper 
Form Of Engagement Activity Engagement focused website, blog or social media channel
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Media (as a channel to the public)
Results and Impact Press release planned for next week
Year(s) Of Engagement Activity 2018
URL https://natureecoevocommunity.nature.com/users/85098-katie-davis/posts/30571-unravelling-the-secrets...
 
Description Content Mining: Talk at JISC DigiFest Birmingham, 2/3/2016 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Professional Practitioners
Results and Impact Content Mining: Talk at JISC DigiFest Birmingham, 2/3/2016
Including coverage of PLUTo and supertree analyses
Year(s) Of Engagement Activity 2016
URL http://www.slideshare.net/petermurrayrust/contentmine-tdm-at-jisc-digifest
 
Description ContentMine tools: mining images and texts for phylogenetic and species-related information 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Postgraduate students
Results and Impact Presentation at Workshop:

Tools and methods for constructing the Tree of Life
University of York, 19th and 20th December, 2016

Evolutionary history, or phylogeny, is the backbone of systematics and knowledge of phylogeny is essential in a variety of fields within evolutionary biology; such as macroevolution, palaeontology, evolutionary ecology, and conservation. Non-specialists are increasingly in need of large, inclusive phylogenies for which there are two main methods of construction: the supermatrix and the supertree. For both methods, the main issues are collecting, curating and processing the source data. The individual sources will have individual tips and character data that do not match other sources due to misspellings, synonyms, or use of higher-level taxa. How do you correct this for hundreds or thousands of data sources? How do you efficiently collate hundreds of data sources? Supertrees combine a number of these overlapping source trees (source data) to then create the "supertree". In contrast supermatrices take primary information from characters (including genes or morphological characters) and combine them into a single, large matrix. Both methods can be cumbersome and time-consuming when creating large phylogenies. And, less obviously but no less vital, how do you even begin to visualise your output of 1000s tips in any meaningful way?
Year(s) Of Engagement Activity 2016
URL https://jonxhill.wordpress.com/2016/11/15/tools-and-methods-for-constructing-the-tree-of-life/
 
Description Evolution and Bio science taster day 11th March 2020 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach Regional
Primary Audience Schools
Results and Impact Lecture on convergent evolution and phylogeny and practical class on hominid evolution. 25 Pupils from postcode areas with low recruitment to tertiary education.
Year(s) Of Engagement Activity 2020
URL https://www.bath.ac.uk/announcements/widening-participation-programme-evolution-and-bio-science-tast...
 
Description Extinction in a Macroevolutionary Context - Talk 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Public/other audiences
Results and Impact Culture, Memory & Extinction

Recent months have seen an explosion of public, media and academic interest in the idea, threat and reality of extinction. This acknowledgement has contributed to debates over climate change and other, related, ways that humanity has altered environments and ecosystems in this epoch we have begun to call the Anthropocene. This one-day conference asks what role can culture play in widening the understanding, representation and, indeed, remembrance of this unfolding and catastrophic species loss. With this in mind, the event aims to foster dialogue between academics, journalists, museum curators, charities, writers, environmental groups, and the media to explore how societies engage with the complexities of the processes of extinction and remember the extinct. More specifically, the event examines how increased dialogue between these communities and constituencies contributes to the public re-evaluation and remembrance of life on our planet.

Speakers:

Dan Barnard & Rachel Briscoe. Lead Artists, fanSHEN Environmental Theatre Collective.
Fae Brauer. Professor of Art and Visual Culture, University of East London.
Sbastian Brooke. Director, MEMO (Mass Extinction Monitoring Observatory).
Melanie Challenger. Author, On Extinction.
Cathy Dean. Director, Save the Rhino.
Sebastian Groes. English and Creative Writing, Roehampton University.
Steve Parker. Author, Extinction: Not the End of the World?.
Jules Pretty. Professor of Environment and Society, University of Essex.
Bernd Scherer. Director, Haus der Kulturen der Welt, Berlin.
Matt Williams. Associate Director, A Focus on Nature.
Matthew Wills. Biodiversity Lab, University of Bath.

Free tickets: register at https://www.eventbrite.co.uk/e/culture-memory-and-extinction-tickets-19379987063
Website: https://naturalhistoryofmemory.wordpress.com/london-2015/
Email: memoryandextinction@gmail.com

Organised by The Natural History of Memory: Dr Lucy Bond (Westminster), Dr Rick Crownshaw (Goldsmiths), Dr Jessica Rapson (King's College London); Research assistant: Ifor Duncan (Goldsmiths).
Year(s) Of Engagement Activity 2015
URL http://instituteformodern.co.uk/2015/culture-memory-extinction
 
Description Fossil Roadshow @ Bath Festival of Nature (June 25th, 2017) 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Public/other audiences
Results and Impact Join experts from Milner Centre for Evolution at the University of Bath and BRLSI for Festival of Nature's first ever Jurassic Road Show!

Families from across Bath are invited to empty their closets and lofts and bring along their favourite fossils at Bath Festival of Nature in Parade Gardens, which is a free event from 11:00 - 18:00 on Sunday, June 25th. Whether it has been found on the beach or bought as a gift, our experts will attempt to identify them, say interesting things about them, and place them on a timeline stretching back millions of years across time.

There will also be plenty of showcase fossils on-site to explore, including some beautiful fossil ferns found in Radstock, some Mosasaur teeth (see below for images), and specimens from a favourite Bath venue, Bath Royal Literary and Scientific Institution.
Year(s) Of Engagement Activity 2017
URL http://www.bnhc.org.uk/whats-my-fossil/
 
Description Homoplasy and clade support across higher taxa and through research time - Systematics Biennial Meeting, Oxford, August 2015 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Homoplasy and clade support across higher taxa and through research
time
Ross Mounce1, Graeme Lloyd2, Mark Wilkinson3 & Matthew Wills1
1. Department of Biology and Biochemistry, University of Bath, Bath BA2 7AY, UK.
2. Department of Geography and Planning, Macquarie University, Sydney, Australia.
3. Department of Life Sciences, Natural History Museum, Cromwell Road, London SW7 5BD, UK.
E-mail: rcpm20@bath.ac.uk; m.a.wills@bath.ac.uk; graeme.lloyd@mq.edu.au; m.wilkinson@nhm.ac.uk
We quantify the levels of homoplasy in over 1,200 morphological character matrices
of animals using parsimony, and investigate its distribution across taxa and
throughout research time. No index of homoplasy is entirely satisfactory, and we explore
the empirical relationship between the ensemble consistency index (CI), the ensemble
retention index (strictly an index of retained synapomorphy), and the
homoplasy excess ratio (HER). We propose a refinement to the latter; specifically controlling
for the distribution of missing entries. We also investigate whether levels of homoplasy
predict levels of tree support; specifically mean non-parametric bootstrap
support, total support index (TSI) and proportional support index (PSI). Surprisingly, the
relationship is not especially strong. Lastly, we model the extent to which all homoplasy
and support indices are biased by data matrix dimensions, complementing the theoretical
work of Hoyal-Cuthill and colleagues. We find the expected inverse relationship
between the number of taxa and the CI, but also between the number of characters
and CI. The CI is therefore a poor measure of homoplasy between data sets. The RI
and modified HER are much less biased by dataset parameters, and yield similar results,
although we prefer the scaling of the latter. We demonstrate a significant decline
in the CI inferred for data sets over the last 30 years of research time, but this is
largely attributable to an increase in data set dimensions over the same period. Residual
CI and modified HER show no such trends. There are also significant differences
in homoplasy and branch support between higher taxa, which remain after modeling
out data set dimensions.
Year(s) Of Engagement Activity 2015
URL http://www.systass.org/biennial2015/SystematicsAssociationBiennialConferenceProgramme2015v2.pdf
 
Description Inaugural Lecture open to the Public: Re-running the Tape of Life - Video also on Vimeo 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Public/other audiences
Results and Impact Purpose: to explain my research programme to a broad audience.
Year(s) Of Engagement Activity 2017
URL https://vimeo.com/219521162
 
Description Keynote talk to LEARN (LERU/H2020 project) for research data management. 29/01/2016 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Professional Practitioners
Results and Impact Keynote talk to LEARN (LERU/H2020 project) for research data management. Emphasizes that problems are cultural not technical. Promotes modern approaches such as Git / continuousIntegration, announces DAT. Asserts that the Right to Read in the Right to Mine. Calls for widespread development of contentmining (TDM)

PlUTo project showcased as part of this
Year(s) Of Engagement Activity 2016
URL http://www.slideshare.net/petermurrayrust/the-culture-of-researchdata
 
Description PLUTo - Short talk at Chicago Computation Institute, November 2014 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact PLUTo - Short talk at Chicago Computation Institute, November 2014
Year(s) Of Engagement Activity 2014
URL https://github.com/ContentMine/Chicago-20141114/blob/master/Figure-Image-Mining/ami-tree_demo.md
 
Description PLUTo: Phyloinformatic Literature Unlocking Tools. Talk at iEVO Bio, Raleigh, June 2014 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact EvoBio is a forum bringing together biologists working in evolution, systematics, and biodiversity, with software developers, and mathematicians. The goal of iEvoBio is both to catalyse the development of new tools, and to increase awareness of the possibilities offered by existing technologies (ranging from standards and reusable toolkits to mega-scale data analysis to rich visualization). The meeting extends over two full days and features traditional elements, including a keynote presentation at the beginning of each day and contributed talks, as well as more dynamic and interactive elements, such as lightning talk-style sessions, a software bazaar, and unconference sessions. The conference has established itself as a self-sustaining annual event and a must-attend for researchers, developers, and users of informatics resources at the intersection of phylogenetics, evolution, and biodiversity science.
Year(s) Of Engagement Activity 2014
URL http://www.slideshare.net/rossmounce/the-pluto-project-ievobio-2014
 
Description Phyloinformatic Literature Unlocking Tools. BOSC 2014 - Talk at he 15th Annual Bioinformatics Open Source Conference (BOSC 2014), held at the Hynes Convention Center in Boston on July 11-12 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Phyloinformatic Literature Unlocking Tools. BOSC 2014 - Talk at he 15th Annual Bioinformatics Open Source Conference (BOSC 2014), held at the Hynes Convention Center in Boston on July 11-12

The Bioinformatics Open Source Conference is held annually as a Special Interest Group (SIG) of the ISCB's annual ISMB conference). BOSC promotes and facilitates the open source development of bioinformatics tools and open science. The aims of the conferences are to:
Provide developers with a forum for displaying the results of their development efforts to the wider research community;
Provide a focused environment for developers and users to interact and share ideas about software development, open science, and practical techniques in bioinformatics;
Promote Open Science, with its focus on sharing data and tools, transparency, reproducibility, and data provenance;
Inform the research community of important developments occurring within the Open Source Bioinformatics Developer community.
Year(s) Of Engagement Activity 2014
URL https://www.youtube.com/watch?v=qX0ocTMc8MY
 
Description Podcast as part of 42evolution.org 
Form Of Engagement Activity A broadcast e.g. TV/radio/film/podcast (other than news/press)
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Public/other audiences
Results and Impact Series of podcasts on evolutionary biology alongside those of various luminaries including Sir David Attenborough
Year(s) Of Engagement Activity 2015
URL http://www.42evolution.org/videos/researcher/professor-matthew-wills/
 
Description Ross Youtube video on PLUTo 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Talk on the PLUTo project aimed at academics, postgrads and a wider audience
Year(s) Of Engagement Activity 2014
URL https://www.youtube.com/watch?v=qX0ocTMc8MY
 
Description School Visit - Sexey's School - April 2014 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Public/other audiences
Results and Impact Discussion with pupils, parents and staff from Sexey's and other schools.

Discussion with headmaster regarding the teaching of evolution
Year(s) Of Engagement Activity 2014
URL http://sexeyshead.blogspot.co.uk/2014/04/head-masters-weekly-notes-25th-april.html
 
Description Slides on the Architecture of ContentMine 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Regional
Primary Audience Professional Practitioners
Results and Impact This is the evolving architecture of ContentMine (contentmine.org) architecture. It includes an overview ( slide #2, ) showing getpapers, quickscrape, norma and ami.

The key container is the CTree and the architecture shows where components are added or transformed to this.

These slides are dated and may be out-of-date wrt code. Some diagrams are autogenerated from *.dot files.

Please use http://discuss.contentmine.org/c/software as the main source of up-to-date info. Feel free to ask questions, offer help, critique, etc.

All s/w is Open (BSD, Apache2)
Year(s) Of Engagement Activity 2015
URL http://www.slideshare.net/petermurrayrust/architecture-of
 
Description Talk - Content Mining in Europe. Open Forum Europe OFA, Brussels. 22/10/15 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Talk - Content Mining in Europe. Open Forum Europe OFA, Brussels. 22/10/15
Showcased PLUTo project

Talk to OpenForum Academy (Open Forum Europe) about Text and data Mining. Four use cases selected fo non-scientists. Also discussion of latest on Europena copyright reform and TDM exceptions
Year(s) Of Engagement Activity 2015
URL http://www.slideshare.net/petermurrayrust/content-mining-of-science-in-europe
 
Description Talk - Content Mining in Neuroscience. UNAM MX, 09/10/2015 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Regional
Primary Audience Professional Practitioners
Results and Impact Talk - Content Mining in Neuroscience. UNAM MX, 09/10/2015
Showcasing PLUTo's tree reading capabilities

How content mining , especially of diagrams, can help neuroscientists to read the literature effectively
Year(s) Of Engagement Activity 2015
URL http://www.slideshare.net/petermurrayrust/contentmining-in-neuroscience
 
Description Talk - Content Mining of Science and Medicine. FTDM Knowledge Cafe, Leiden, 29/02/16 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Regional
Primary Audience Professional Practitioners
Results and Impact Talk - Content Mining of Science and Medicine. FTDM Knowledge Cafe, Leiden, 29/02/16
Including coverage of PLUTo project
Year(s) Of Engagement Activity 2016
URL http://www.slideshare.net/petermurrayrust/text-and-data-mining-explained-at-ftdm
 
Description Talk - Digital Scholarship: Enlightenment or Devastated Landscape? - IT Futures Conference, Informatics Forum, Edinburgh, 17/12/2015 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Professional Practitioners
Results and Impact Talk - Digital Scholarship: Enlightenment or Devastated Landscape? - IT Futures Conference, Informatics Forum, Edinburgh, 17/12/2015
Showcased PLUTo project

Every year 500 Billion USD of public funding is spent on research, but much of this lies hidden in papers that are never read. I describe how machines can help us to read the literature. However there is massive opposition from publishers who are trying to prevent open scholarship and who build walled gardens that they control
Year(s) Of Engagement Activity 2015
URL http://www.slideshare.net/petermurrayrust/digital-scholarship-56229527
 
Description Talk - Mining scientific diagrams for facts. DAMPT, Cambridge UK, 27/02/16 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach Regional
Primary Audience Professional Practitioners
Results and Impact Talk - Mining scientific diagrams for facts. DAMPT, Cambridge UK, 27/02/16
Coverage of PLUTo project and our IJSEM corpus of data and supertree
Year(s) Of Engagement Activity 2016
URL http://www.slideshare.net/petermurrayrust/mining-scientific-diagrams-for-facts
 
Description Talk - The technology for managing research data is already here...but we need a change of culture. Open Notebook Science. LEARN, London UK, 29/02/16 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Professional Practitioners
Results and Impact Talk - The technology for managing research data is already here...but we need a change of culture. Open Notebook Science. LEARN, London UK, 29/02/16
Including discussion of PLUTo project
Year(s) Of Engagement Activity 2016
URL http://www.slideshare.net/petermurrayrust/the-culture-of-researchdata
 
Description Talk at Bath Royal Scientific and Literary Institution 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Regional
Primary Audience Public/other audiences
Results and Impact Re-running the Tape of Life. Is Evolution Predictable?
Is evolution an essentially open-ended process of unlimited potential, or is its outcome predictable? If we could re-run the Tape of Life would small perturbations to starting conditions yield radically different outcomes, or would the course of evolution follow a familiar path, differing only in details? Matthew Wills will explore how major animal groups have evolved according to a common template, seeking evidence for actively driven evolutionary trends in morphological complexity and possible rules governing mass extinctions.
Year(s) Of Engagement Activity 2017
URL https://www.brlsi.org/node/90082
 
Description Talk at Evolution 2015 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Hill, J., Tovar, J., Davis, K.E., Wills, MA. Automated processing of taxonomy, metadata, and source trees for supertree construction.
Evolution 2015, Guaruja, Brazil (talk)
Year(s) Of Engagement Activity 2015
 
Description Talk at GSA Seattle 2017 - Climate change and extinction risk in an important group of marine invertebrates (Decapoda): Inferences from the geological past 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Talk at GSA. A promising new strategy in conservation biology is to use the geological record to inform present day intervention priorities. We illustrate three ways in which macroevolutionary data can be used to better understand biotic responses to current and ongoing environmental change.
We present a complete phylogeny of all 16,083 species of Decapoda; an order of crustaceans of great economic importance. We use this tree in combination with Species Distribution Models (SDMs) under three IPCC climate scenarios (A1B, B1 and A2) to highlight species most at risk of extinction due to climate change. We also investigate whether there is any link between Evolutionary Distinctiveness (ED) on our phylogeny and the likelihood of extinction under each IPCC scenario.

To further explore the value of this approach, we present macroevolutionary and macroecological data from two decapod infraorders: Anomura and Caridea. In Anomura (hermit crabs), habitat influences geohistorical speciation rates, with marine species under greater threat of extinction from global warming than freshwater species. In Caridea (shrimp), species that live in commensal or parasitic association with reef organisms (such as corals) experience lower speciation rates free living species. The risk of extinction in the former group may be elevated with the ongoing destruction of the fragile ecosystems they tend to inhabit. Although we do not suggest that these relationships are ubiquitous, they highlight the importance of macroecological and macroevolutianary perspectives for understanding the effects of current and ongoing climate change.

We give examples of three ways in which a macroevolutionary perspective can inform our understanding of the likely responses of extant species to current and projected environmental changes. 1) Phylogenies underpin valuable macroevolutionary indices, including measures of Evolutionary Distinctiveness. These can be used in conjunction with SDMs to infer proportions of biodiversity currently at risk of extinction. 2) The role of past climate change upon speciation rates can be modelled. 3) Some ecological traits may render species at greater risk of extinction. A future synthesis of these approaches may enable us to identify the Earth's most vulnerable species, and thereby to inform conservation strategies.
Year(s) Of Engagement Activity 2017
URL https://gsa.confex.com/gsa/2017AM/webprogram/Paper302970.html
 
Description Talk at Westonbirt School • What is the Evidence for Evolution 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Schools
Results and Impact Two talks to sixth formers on The Evidence for Evolution as part of British Science Week
Year(s) Of Engagement Activity 2023
 
Description Talk by Katie Davis at the University of Leeds (November): Climate change and extinction risk in an important group of marine invertebrates (Decapoda): Inferences from the geological past 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Professional Practitioners
Results and Impact Departmental talk
Year(s) Of Engagement Activity 2017
 
Description Talk on "Teeth" to children from Chapmanslade Primary School. 21st June 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Schools
Results and Impact Talk on "Teeth" to children from Chapmanslade Primary School
Year(s) Of Engagement Activity 2018
 
Description Talk on the Evidence for Evolution to 50 Sixth Formers from Villiers Park. A Widening Participation Activity organised through the University. 4th April. 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Schools
Results and Impact An activity aimed at attracting students from less advantaged backgrounds into science at University
Year(s) Of Engagement Activity 2017
 
Description Text Mining for Biologists - Workshop in Bath run by Ross Mounce & Peter Murray Rust, 28th July 2015 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Professional Practitioners
Results and Impact Text Mining for Biologists - Workshop in Bath run by Ross Mounce & Peter Murray Rust, 28th July 2015
25 Participants:
Content Discovery with GetPapers
Systematic Downloading of Journal Articles witn Quickscrape
Recognition and Extraction of Species Names
Phylogentic Tree Extraction wirth PLUTo
ElasticSearch and D3-based Visualisation of Results
Year(s) Of Engagement Activity 2015
URL http://www.eventbrite.com/e/text-data-mining-for-biologists-registration-17603659018#
 
Description The University of Bath and Bath STAR (Student Action for Refugees) open day for refugee-background students aged 16 - 19. "Are Aliens Real?". 19th Nov, 2019. 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Undergraduate students
Results and Impact The University of Bath and Bath STAR (Student Action for Refugees) open day for refugee-background students aged 16 - 19. "Are Aliens Real?". 19th Nov, 2019.
Year(s) Of Engagement Activity 2019
 
Description Three talks "What is the Evidence for Evolution?" to Years 9 and 10 RE classes at St Augustine's School, Trowbridge. 21st October, 2019 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Schools
Results and Impact Three talks "What is the Evidence for Evolution?" to Years 9 and 10 RE classes at St Augustine's School, Trowbridge. 21st October, 2019
Year(s) Of Engagement Activity 2019
 
Description Vimeo Video - Does Evolution Have Direction? 
Form Of Engagement Activity A broadcast e.g. TV/radio/film/podcast (other than news/press)
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Public/other audiences
Results and Impact A video for outreach and public understanding. Evolution is usually regarded as lacking any direction or goal, but the history of Life on Earth certainly seem to show some consistent patterns. How can we reconcile these two observations? Used for undergraduate teaching and available worldwide. Released 30/1/21
Year(s) Of Engagement Activity 2021
URL https://vimeo.com/showcase/8072205/video/506154182
 
Description Vimeo Video - What is irreducible complexity? 
Form Of Engagement Activity A broadcast e.g. TV/radio/film/podcast (other than news/press)
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Public/other audiences
Results and Impact A video for outreach and public understanding. Are some biological structures so complex that they couldn't possibly have evolved through a series of intermediates? Used for undergraduate teaching and available worldwide. Released 30/1/21
Year(s) Of Engagement Activity 2021
URL https://vimeo.com/showcase/8072205/video/506153679
 
Description Visit to Malmesbury School - talk open to members of the public (1st July) 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Public/other audiences
Results and Impact Talk to approximately 60 members of the public and a small number of pupils. There were excellent discussions after my talk, and I have been in correspondence with two of the attendees since the event.
Year(s) Of Engagement Activity 2015
URL http://www.malmesbury.wilts.sch.uk/assets/Attachments/Prof-Matthew-Wills-1st-July-2015.pdf
 
Description What limits the morphological disparity of clades? - Talk at Systematics Association Biennial, Oxford, August 2015 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact What limits the morphological disparity of clades?
Jack Oyston1, Martin Hughes2, Peter Wagner3, Sylvain Gerber4 & Matthew Wills1
1. Department of Biology and Biochemistry, University of Bath, Bath BA2 7AY, UK.
2. Department of Life Sciences, Natural History Museum, Cromwell Road, London SW7 5BD, UK.
3. Smithsonian Institution, National Museum of Natural History, Washington, DC 20560-0121, USA.
4. Department of Earth Sciences, University of Cambridge, Downing Street, Cambridge, CB2
3EQ, UK.
E-mail: jwo22@bath.ac.uk
Patterns of morphological and taxonomic diversity are often at odds. Specifically,
there is a tendency for groups to reach maximum levels of morphological disparity
relatively early in their evolutionary histories, even while species richness or
diversity is comparatively low. Early high disparity is evident not only in a diverse range
of animal clades but also major groups of vascular plants, suggesting it may represent
a universal evolutionary phenomenon. The shapes of disparity profiles through time
can be quantified in terms of their centre of gravity, with bottom heaviness (CG < 0.50)
being typical of extinct clades that do not terminate at a mass extinction. It is widely
supposed that increasing developmental constraints or ecological restrictions limit the
range of morphologies that can evolve within a clade; consistent with an observed
decrease in the rate of origination of novel bodyplans and higher taxa through time.
It has also been demonstrated that the rate of evolution of new character states decreases
through time, although the relationship between this 'character exhaustion'
and overall disparity has hitherto been untested. Here, we quantify the rate of character
exhaustion in 93 published phylogenies of extinct animal clades, and test for a
relationship with disparity profile centre of gravity. We find no significant correlation,
and conclude that patterns of early high disparity are not shaped by exhaustion of
the state space in any straightforward manner.
Year(s) Of Engagement Activity 2015
URL http://www.systass.org/biennial2015/SystematicsAssociationBiennialConferenceProgramme2015v2.pdf
 
Description Widening Particiption Event - First School 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Schools
Results and Impact Combe Down Yr 1. 'Dinosaur' visit 24/11/15
The year 1 curriculum is all about classification. The teachers have chosen dinosaurs (identification and common names) to illustrate structure, diet and environment. The curriculum extends this to include the structure of common animals including the human body, fish, amphibians, reptiles, birds and mammals.

The talk will include dinosaurs and some features which give clues to their structure and function (eye sockets & teeth?) The lab activities will give an opportunity to look closely at skulls (spot the difference - eye sockets and dentition); diet of herbivores, omnivores and carnivores; eggs from fish, snails, manduca moths, locusts, chickens, quails; spiral timeline - colouring activity; skeletons; classification of mammals, amphibians, birds etc. There will be a 5 minute 'ask the expert' session at the end of the session. The children will depart the lab at 11:45.
Year(s) Of Engagement Activity 2015
 
Description Widening Particiption Event - First School (11th March 2015) 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Schools
Results and Impact Talk to first school pupils about extinction with a free discussion afterwards
Year(s) Of Engagement Activity 2015