Cosmology and Cancer - Astronomical Statistics In Healthcare

Lead Research Organisation: Durham University
Department Name: Physics

Abstract

Astronomy has entered the big-data era, with modern survey telescopes set to acquire imaging of tens of billions of galaxies. To learn more about the Universe, one of the biggest challenges facing astronomers is applying modern statistical techniques in a way that can be scaled-up to these ever growing data-sets. This places the emphasis on automated methodology and software, where the sheer volume of data means that it is no longer feasible for a scientist to inspect the results of an analysis to individual galaxies.

As data-sets grow, previously rare phenomenon become common, for example strong gravitational lensing. When two galaxies are aligned perfectly down the line-of-sight to Earth, the background galaxy's light is bent by the intervening mass of the foreground galaxy. Its light can be fully bent around the foreground galaxy, traversing multiple paths to the Earth, meaning that the background galaxy is observed multiple times. This by-chance alignment of two galaxies, called a strong gravitational lens, is rare, with the dedicated effort of astronomers around the world discovering some 500 examples of such objects over the past 50 years. In the next 10 years, over 100000 strong lenses will be found, a testament to how the big-data era is changing the scale of astronomical data-sets.

Historically, the analysis of a strong lens takes months of human time and input. With 100000 objects, such an approach does not scale. Therefore, our team at Durham University developed software fully automating the analysis, called PyAutoLens. This combines contemporary techniques in computational statistics with a framework that breaks the fitting procedure down into smaller and simpler fitting phases, that can be combined together to build an automated fitting pipeline. By mapping out the traversal of light in 100000 strong lenses, we will measure the constituent mass of each foreground lens galaxy and determine the role that dark matter plays in their formation and evolution.

The challenge of applying modern statistical techniques to extremely large data-sets is not unique to astronomy, but one facing many data-science companies in the UK today. For this project, we will generalize the statistical methodology developed at Durham to automatically analyse strong lenses, so that it can help data science enterprises scale up their analyses to large digital data-sets.

This project will run in collaboration with ConcR, a UK based data science venture aiming to develop software that predicts the evolution of a patient's response to pancreatic cancer treatment. ConcR are facing a similar challenge of scale, in that they need to apply genomic models of tumour growth to large clinical trial data-sets, to extract evidence based measurements of when cancer treatment is effective. If successful, this will allow for individualized tracking of patient responses to cancer treatments, saving lives and the NHS millions of pounds every year. Thus, by sharing Durham's expertise in statistical modeling, we aim to transfer the knowledge we have gained from studying cosmology and dark matter to improve the treatment of cancer for NHS patients in the UK.

Planned Impact

With our research, we are targeting data-science enterprises who require advanced statistical modeling toolkits to fit large digital data-sets. There is currently significant investment ongoing in the data analysis and analytics sector. UK investment is expected to double to £25 billion by 2025, whereas the USA invested over £50 billion in 2017. Industry use-cases in data science that would benefit from Bayesian inference tools are broad in scope, but include applications to the agriculture, banking, finance and advanced image recognition. Thus, it is key that the software we develop is general and can reach as large an audience as possible.

The benefit to these companies is providing them with statistical tools and software that can solve big-data model fitting problems, which otherwise may have taken a significant amount of time and money to develop. Furthermore, our software framework will utilize modern Bayesian inference tools developed in Astronomy, Particle Physics and other scientific disciplines which would otherwise be unknown or inaccessible to industry based data-science companies. Broadly speaking, we anticipate that our statistics framework can act as a 'bridge' between the development and release of cutting edge statistical methods in academia and their uptake and use in industry. This exploitation of scientific knowledge and methodology will enhance the research capabilities of these organizations and contribute towards their efficiency of operations.

Health data science is an area the software developed for this grant can target more directly, given our collaboration with ConcR. Aspects of our statistics framework will be directly applicable to the use-case of modeling large patient data-sets, a common use-case amongst health data companies. This makes the open source nature of our software is appealing, as health companies may otherwise have to seek financing through corporates, which opens up possible conflicts of interest and may ultimately restrict their freedom to operate and the pricing and equitable distribution of their outcomes.

Online lecture courses teaching skills such as programming, data analysis and visualization have made a huge contribution to improving the skill set and quality of workers in the data science industry. Accompanying the public release of our software, we will freely provide similar online materials that explain relevant concepts of Bayesian inference and computational statistics. This will be aimed at a low-level and assume limited prior knowledge, and will therefore contribute to the training of skilled workers in non-academic jobs.

Publications

10 25 50
publication icon
Amorisco N (2022) Halo concentration strengthens dark matter constraints in galaxy-galaxy strong lensing analyses in Monthly Notices of the Royal Astronomical Society

publication icon
Etherington A (2023) Beyond the bulge-halo conspiracy? Density profiles of early-type galaxies from extended-source strong lensing in Monthly Notices of the Royal Astronomical Society

publication icon
Etherington A (2022) Automated galaxy-galaxy strong lens modelling: No lens left behind in Monthly Notices of the Royal Astronomical Society

publication icon
He Q (2023) Testing strong lensing subhalo detection with a cosmological simulation in Monthly Notices of the Royal Astronomical Society

publication icon
Kegerreis J (2022) Immediate Origin of the Moon as a Post-impact Satellite in The Astrophysical Journal Letters

publication icon
Nightingale (2021) PyAutoFit: Classy probabilistic programming in Astrophysics Source Code Library

publication icon
Nightingale J (2023) Abell 1201: detection of an ultramassive black hole in a strong gravitational lens in Monthly Notices of the Royal Astronomical Society

publication icon
Nightingale J (2021) PyAutoLens: Open-Source Strong Gravitational Lensing in Journal of Open Source Software

 
Title PyAutoFit: A Classy Probabilistic Programming Language for Model Composition and Fitting 
Description A major trend in academia and data science is the rapid adoption of Bayesian statistics for data analysis and modeling, leading to the development of probabilistic programming languages (PPL). A PPL provides a framework that allows users to easily specify a probabilistic model and perform inference automatically. PyAutoFit is a Python-based PPL which interfaces with all aspects of the modeling (e.g., the model, data, fitting procedure, visualization, results) and therefore provides complete management of every aspect of modeling. This includes composing high-dimensionality models from individual model components, customizing the fitting procedure and performing data augmentation before a model-fit. Advanced features include database tools for analysing large suites of modeling results and exploiting domain- specific knowledge of a problem via non-linear search chaining. 
Type Of Technology Software 
Year Produced 2020 
Open Source License? Yes  
Impact