Characterization and correction of ascertainment bias in protein interaction network analysis

Lead Research Organisation: Imperial College London
Department Name: Life Sciences


Increasingly, research in the biological sciences is driven by the need to understand complex cellular systems as a whole. The production and analysis of very large amounts of data therefore play a central role in modern biology. These data are frequently organised around the concept of the genome, the set of all genes for each organism, so are said to be 'genome-scale'. One area of particular interest is the study of protein interactions, that is the various ways in which proteins (which make up the molecular machinery of the cell) are able to combine and communicate with one another. By understanding more about protein interactions and the biological networks that they support, researchers hope to explain many of the processes that keep cells alive. However, we only have a partial view of the complete set of protein interactions, even for a relatively simple organism such as yeast. It is known that almost all of the available data is biased towards reporting certain types of interactions, which may vary depending on the type of experiment used. Interactions drawn from small-scale experiments will often be biased towards well-studied proteins, leaving large parts of the genome untested. In addition, these experiments are not 100% accurate, so a certain amount of 'noise' enters the data in the form of incorrect or missing interactions. All of these factors together have a large impact on computational analyses of protein interaction networks, meaning that some features of the network might appear to be significantly different from those expected, when in reality they are not. Detecting, measuring and correcting for these errors and biases is therefore essential before we can produce reliable assessments of the properties of protein interaction networks and their implications for the function and evolution of biological systems. In this project, we will use statistical modeling to address these issues and produce software that can correct for the biases present in the data. Using this software, we will re-test many of the currently held ideas about protein interaction networks to see if they are still valid when bias is taken into account. To help other researchers make the best use of the available interaction data, we will also produce a web-driven service to provide a confidence score for each possible interaction and the probability that it has been tested correctly.

Technical Summary

Increasingly, research in the biological sciences is driven by the need to understand complex cellular systems in their entirety. As a result, the production and analysis of genome-scale data play a central role in modern biology. One area of particular importance is the study of protein-protein interactions, as interrogated by yeast two-hybrid, tandem affinity purification plus mass spectrometry, protein fragment complementation or other direct or indirect methods. The networks formed by these interactions constitute an essential framework of cellular processes, into which more detailed models are being constructed. The problem of biased, noisy and incomplete protein interaction data is well known and has substantial impacts on conclusions drawn from the analysis of these networks, rendering much of the published research on network biology questionable. In this project we aim to develop a statistical modeling framework for the quantification of bias and error characteristics in genome-scale network data. This methodology will be applied to construct appropriate null samples for hypothesis testing of network properties, in order to re-assess the validity of existing claims of biological significance taken from the literature on biological networks. Using the statistical models developed, we will integrate the available protein interaction data to produce a probabilistic view of each organism's interactome and the extent to which it has been sampled, to be provided as a resource for the research community. This work will provide a crucial contribution to the ongoing development of systems biology by enabling a more 'cordial' meeting between top-down and bottom-up approaches.

Planned Impact

Research in the fundamentals of systems biology is expected to have substantial long-term impact for the general public in the context of drug development, genomic medicine, post-genomic agriculture and the biotechnological applications of synthetic biology, including biofuels. More immediate impacts will be concentrated in the pharmaceutical and biotech industries, where the effective exploitation of protein interaction data is of great interest. Engagement with the industrial beneficiaries of the research will chiefly be through presentations and demonstrations at national and international conferences. Our existing websites will be used to promote the resources developed, and a dedicated website and web service will provide unrestricted public access to the research outputs. We aim to publish the research in high-quality, high-impact journals such as the Nucleic Acids Research webserver issue, and have requested funds to ensure that all publications can be made open access to ensure accessibility beyond the academic community. Further opportunities for engagement with the wider public will be sought through the Imperial College media office and the Royal Society, which funds JP as a University Research Fellow. We are very keen to maximize the exploitation of the research and to encourage its re-use, both in the academic and non-academic communities. By the end of three years we therefore aim to have established a permanent, automatically-updated website and DASMI service that will continue to provide access to the resources developed beyond the lifetime of the project. These resources will be freely available to the public to enable third parties to develop their own applications of our research. We expect that this project will provide an excellent opportunity for JP to develop collaborative relationships with industrial partners in the form of future research proposals and joint studentships. The resources developed are likely to be of immediate benefit to all pharmaceutical and biotech companies with an interest in network biology.


10 25 50
Description We have developed an effective and widely applicable method to detect and correct for biases in biological data sets that are caused by differences in the amount of study that different genes have received. This method can provide better justification for statistical statements based on analyses of biased data.
Exploitation Route Our method is broadly applicable and will be of use in any situation where ascertainment biases in large data sets are suspected
Sectors Digital/Communication/Information Technologies (including Software),Environment,Healthcare,Government, Democracy and Justice,Manufacturing, including Industrial Biotechology,Pharmaceuticals and Medical Biotechnology,Security and Diplomacy

Description Ascertainment biases affect data sources from many fields, in particular data collected from WWW participation. Our models are finding uses in statistical analyses of these data sets, including sociological studies based on friendship networks derived from social media.
First Year Of Impact 2014
Sector Digital/Communication/Information Technologies (including Software)
Impact Types Cultural,Societal

Title Ascertainment bias detection and correction methods 
Description Ascertainment bias is a difficult factor to eliminate from computational analysis of genomic or interaction data. The methods developed are applicable to any situation where bias linked to a known ascertainment variable (e.g. number of publications mentioning a given gene) is suspected, and are capable of detecting and correcting for such biases in downstream analyses such as hypothesis testing. 
Type Of Material Data analysis technique 
Provided To Others? No  
Impact In ongoing work in the group, this computational tool is permitting rigorous statistical analysis of the biological factors influencing the genes involved in viral-human interactions.