Systemic sclerosis: Statistical considerations in an epidemiological study of a rare disease

Lead Research Organisation: University of Bath
Department Name: Mathematical Sciences

Abstract

The project will explore the statistical issues involved in the analysis of a matched cohort study based on a large electronic health database, the Clinical Practice Research Datalink. The project will involve a combination of analytical work & simulation to explore the statistical properties of different methods of analysis for handling a number of different issues raised by the project. The outputs of the project will be recommendations for what analysis methods should be used to be able to robustly estimate the relationship between a rare disease & subsequent incidence of other outcomes using such databases.
The lead supervisor has an overarching study to use this data to investigate the relationship (if any) between systemic sclerosis (SSc) & cancer. This study's protocol specified how the dataset would be constructed to answer this question, & part of this specified that patients who developed SSc during the study would be matched to patients who do not have systemic sclerosis. They were matched on the basis of a certain number of variables. Since SSc is rare, to increase the sample size the study also aims to make use of data from patients who have an SSc diagnosis recorded in CPRD database but this diagnosis occurred prior to when the patient joined the CPRD or the CPRD was setup.
The study raises a number of statistical & epidemiological issues. The first is that patients who developed SSc prior to when the CPRD was setup, developed cancer, & then died, are not included in the database, whereas those who developed SSc but did not die before CPRD started, will be present. The inclusion of such patients in an analysis of time to cancer or death in patients with SSc potentially introduces biases because the analysis sample is biased towards those who survive longer. This problem is known as left truncation or length biased sampling. There is an existing literature on this, & the student will need to assimilate this and explore its implications for the analysis in question. A combination of analytical & simulation work will likely be necessary to explore how different assumptions impact on any biases of different analysis methods.
A second issue with the inclusion of patients with a historical diagnosis of SSc is how they should be matched to a non-SSc patient, & moreover how the time scale should be chosen in any analysis. The natural time scale is time since diagnosis of SSc, but if this time scale were used there are no contemporaneous non-SSc matches available within the time frame of CPRD. If a time scale is chosen with the time frame of CPRD, time zero is essentially arbitrary, raising doubt about the epidemiological validity of any analyses.
A third issue is how the matching should be optimally performed, & indeed to understand the advantages and disadvantage of using matching in such an analysis. There is a relatively limited literature on matched cohort studies, & it would be important to explore the properties of estimators based on different matching strategies. One aspect is the efficiency of the resulting estimates, & another important one is the ability of matching to adequately adjust for confounding by common causes of the exposure (SSc) & outcome (cancer).
A fourth issue is exactly how to define causal effects of one disease on another, given that it is neither feasible nor desirable to intervene to give otherwise healthy patients SSc. If pursued, this component of the project would involve exploring how to cast the target of estimation within the modern causal inference framework, whose context is typically one where one is estimating the effects of treatments or at least exposures whose level to which people are subjected could be manipulated.
Early project-specific training will be largely delivered through supervision meetings, with advice & guidance in regards to the CPRD database, programming in R, & epidemiology. In addition, the student is going to participate in the APTS training programme.

Planned Impact

The impact of the SAMBa CDT will occur principally through the following two pathways:

1. Direct engagement with industrial partners, leading to PhD projects that are collaborative with industry, and that are focussed on topics with direct industrial impact.

2. The production of PhD graduates with
(a) the mathematical, statistical and computational technical skill sets that have been identified as in crucial demand both by EPSRC and by our industrial partners, coupled to
(b) extensive experience of industrial collaboration.

The underlying opportunity that SAMBa provides is to train graduates to have the ability to combine complex models with 'big data'. Such people will be uniquely equipped to deliver impact: whether they continue with academic careers or move directly to posts in industry, through quantitative modelling, they will provide the information that gives UK businesses competitive advantages. Our industrial partners make it clear to us that competitiveness in the energy, manufacturing, service, retail and financial sectors is increasingly dependent on who can best and most quickly analyse the huge datasets made available by the present information revolution.

During their training as part of SAMBa, these students will have already gained experience of industrial collaboration, through their PhD projects and/or the Integrated Think Tanks (ITTs) that we propose, that will give all SAMBa students opportunities to develop these transferable skills. PhD projects that involve industrial collaboration, whether arising from ITTs or not, will themselves deliver economic and social benefits to UK through the private companies and public sector organisations with which SAMBa will collaborate.

We emphasise that Bath is at the forefront of knowledge transfer (KT) activities of the kind needed to translate our research into impact. Our KT agenda has recently been supported by KT Accounts and Impact Acceleration Accounts from EPSRC (£4.9M in total) and a current HEFCE HEIF allocation of £2.4M. Bath is at the forefront of UK activity in KTPs, having completed 150 and currently holding 16 KTP contracts worth around £2.5M.

The SAMBa ITTs are an exciting new mechanism through which we will actively look for opportunities to turn industrial links into research partnerships, supported in the design of these projects by the substantial experience available across the University.

More widely, we envisage impact stemming from a range of other activities within SAMBa:

- We will look to feed the results of projects involving ecological or epidemiological data directly into environmental and public health policy. We have done this successfully many times and have three REF Case Studies describing work of this nature.

- Students will be encouraged to make statistical tools available as open source software. This will promote dissemination of their research results, particularly beyond academia. There is plenty of recent evidence that such packages are taken up and used.

- Students will discuss how to use new media to promote the public understanding of science, for example contributing to projects such as Wikipedia.

- Students will be encouraged to engage in at least one outreach activity. Bath is well known for its varied, and EPSRC-supported, public engagement activities that include Royal Institution Masterclasses, coaching the UK Mathematics Olympiad team, and reaching 50 000 people in ten days with an exhibit at the Royal Society's 350th Anniversary Summer Exhibition in 2010.

Publications

10 25 50