Gene-environment interplay in the generation of health inequalities

Lead Research Organisation: University of Bristol
Department Name: Economics

Abstract

This project will advance understanding of gene-environment interactions in the generation of health inequalities. Combining methods from genetics and social science, it will test whether privileged environments protect against genetic susceptibility to risky health behaviours, using 'natural experiments' to deal with unmeasured confounding. The aim of this project is to study the interplay between genes and the environment (GxE) in the formation of life course inequalities in health. The project will explore whether an advantageous environment cushions genetic susceptibility to risky health behaviours, including obesity, smoking and drinking. Although GxE analysis is not new, only few studies have attempted to take into account the possible endogeneity, such as unobserved confounding, of the environment arising from self-selection into environments, or environments being created in response to the genotype. Endogenous environments are problematic for two main reasons. First, conceptually, it is important to distinguish between the actions of people and the circumstances in which these actions occur. Second, it hampers causal inference, leading to misinformed policy. This project focuses on socioeconomic and policy environments faced by individuals. To take into account the endogeneity of the environment, it will exploit well-established 'exogenous shocks' or 'natural experiments', enabling us to study GxE causality.

For example we could explore whether income or education interacts with genetic make-up to predict smoking intensity. Since risky health behaviours are more prevalent among those of lower socio-economic status (SES), and advantageous environments can cushion susceptibility to health shocks, we hypothesize that advantageous environments may protect individuals with elevated genetic risk from developing risky behaviours such as smoking. To deal with unobserved confounding relating to, in this example, income or education, the project will explore natural experiments, such as the 1947 compulsory school-leaving age change, or the recent recession to identify exogenous changes in wages or wealth. The main data sources that will be exploited are the UK Biobank and ALSPAC

Publications

10 25 50

Studentship Projects

Project Reference Relationship Related To Start End Student Name
MR/N013794/1 01/10/2016 30/09/2025
2131777 Studentship MR/N013794/1 01/10/2018 31/03/2022 Samuel Baker
 
Title A 3D animation showing weekly changes of Scarlet Fever over 40 years in UK Districts 
Description This work allows for a visual representation of changes in disease over 2080 weeks from 1940-1974. The animation is created in 3D, allowing it be used both within papers as a figure by positioning the camera looking down on England and Wales, but also set up a range of camera animation's for creating informatics / public outreach information. 
Type Of Art Film/Video/Animation 
Year Produced 2020 
Impact This figure was used to present researchers at a DIAL conference in November, but is designed for the public / press when the data is released to allow for a simple way to explore and absorb all the information in a fun not table way. 
 
Title BIO-HGIS 
Description BIO-HGIS focuses on data ranging between 1931 and 1974 to target the years of birth of those within the UK Biobank, which ranges from 1934 to 1971. Briefly, the UK Biobank is a prospective cohort study of UK adults aged 40-69 at time of recruitment. The UK Biobank contains extensive later life information on the health and well-being of its participants, most of whom have been genotyped. External sources of data have already been linked to the UK Biobank, such as the Hospital Episode Statistics, allowing for the UK Biobanks' potential to grow over time. BIO-HGIS was constructed to add information on early life exposures or characters of early life environments to the UK Biobank participants. The UK Biobank itself has collected relatively little information on individuals' early life circumstances but did collect the place of birth as coordinates. The birth coordinates within the UK Biobank have a level-of-detail close to half a km, but for confidentiality reasons, they were generalised to a 1 km grid coordinate. To ensure sufficient variation, were possible, BIO-HGIS has aimed to digitise and link records that were reported frequently and in a sufficiently detailed geographical area. However, when investigating exposures over prolonged periods, individuals born a single month apart in the same location of birth will still share extensively the same exposure. Identifying variation, therefore, comes more from individuals who were born many years apart within the same place of birth. Fortunately, given the UK Biobank has close to half a million participants born over nearly 40 years this is often possible. 
Type Of Material Database/Collection of data 
Year Produced 2020 
Provided To Others? No  
Impact BIO-HGIS has been designed to continually add new and meaningful data for research and public information. Whilst the BIO-HGIS itself is for UK data, much of the methods sections of this thesis that have been used to construct it are generalisable to other country's data. If others utilise these methods to undertake similar efforts in their own countries, then the research body would benefit but this is beyond my capacity. The papers that have been started or shown within the thesis represent a fraction of the potential of the database. Most research projects have been focused exclusively on the UK Biobank, despite the huge potential for use in the many additional cohort studies that exist throughout the 20th century. Whilst the focus of the work behind the construction of the data remains focused on health, many disciplines may find extensive use for such data that is outside the remit of our knowledge. Whilst not currently available, due to the complexity of the task for a single individual, it will hopefully be available to other researchers soon. 
 
Title Labour Gazette 
Description We digitised monthly unemployment figures from the labour gazette that was reported for a select group of districts, in additional to unemployment within all regions. We then used a generalised imputation process, detailed separately, and census data on unemployment from CASWEB and NOMIS to extrapolate the unemployment in every district from 1946-1971. 
Type Of Material Database/Collection of data 
Year Produced 2022 
Provided To Others? No  
Impact Defining deprivation is complex, with multiple historical measures existing, such as the Townsend deprivation index or the Jarman index. The original Townsend deprivation index was a function of four variables of non-car ownership, non-home ownership, the log of unemployment, and the log of overcrowding. Many of these variables are only available in census years, making a time varying measure difficult. However, simple regional unemployment is often highly correlated to such deprivation measures. Regional standardised unemployment rates alone are 92.4% correlated to the Townsend deprivation index and 86.6 to the Jarman index. We created multiple deprivation indexes that span the early half of the 20th century and linked them to the UK Biobank. By using PheWAS on health outcomes from ICD10we showed that, even using a deprivation index purely based on unemployment, that those in the UK Biobank born to districts with higher unemployment in their early life had worse health outcomes. However, as we do not currently have a good index of deprivation in the early half of the 20th century, the output of this research will also be useful for controlling time-varying environments that may otherwise add confounding to a study. 
 
Title Regional Imputation 
Description Many administrative statistics report a sub-set of the total number of locations when reporting in a highly geographically detailed area. This limits the potential power of this data to be linked to cohort studies, such as the UK Biobank, as many of the locations have missing data. However, using multi-level records and an infrequently reported full estimates within level, it is possible to reconstruct the dataset. 
Type Of Material Computer model/algorithm 
Year Produced 2022 
Provided To Others? No  
Impact This algorithm was made for specifically two datasets, one digitised and another yet to be digitised. The labour gazette produced records of unemployment for certain districts, which we have digitised and is detailed separately. The second resource represented weekly infant mortality, which again was only reported in certain districts. In 1971 and 1961, every district within Great Britain and England and Wales, respectively, has a record of the unemployment they experience within that year from the Census. Every district also has an annual count for the total infant deaths in the Registrar General's Statistical Review of England and Wales. This imputation method constructs a ratio of expected values, which can then be partitioned from a higher order estimate. For the labour gazette, we digitised regional unemployment so that we knew the total for each region at each month. From the census years, we constructed a district-region ratio for each district. Then, for districts where no unemployment was reported within the records, we could utilise the ratios from census years to impute their values from the regional total. 
 
Title Registrar General's Statistical Review of England and Wales 
Description The Registrar General office produced annual reports on many administrative statistics. Two tables were digitised. One contained the births, deaths, population, and infant mortality for each district within the England and Wales. The second contained mortality data by age bands and sex by month of birth across England and Wales. Vision of Britain had already digitised part of the population-based database from 1930 to 1974, but it was missing the years of 1958 to 1962 and had errors within it. The second dataset on monthly mortality, however, was purely our own contribution. 
Type Of Material Database/Collection of data 
Year Produced 2020 
Provided To Others? No  
Impact One of the principal uses has been to convert the data collected during this research period into rates. However, the data on infant mortality has also been used in a forthcoming paper from myself and co-authors of 'Beyond' Barker: Infant mortality at Birth and Ischemic Heart Disease in Older Age. We replicated the principle of the barker Hypothesis showing how exposure to infant mortality was associated with ischemic heart disease. However, instead of using ICD-10 records within regions, we crucially showed this association at an individual level by linking the data to the UK Biobank. Even adding both the polygenic score for ischemic heart disease and allowing for gene-environmental interplay did not notably change the result. However, when we utilised the district fixed effects, we showed half of this association between ischemic heart disease and early life environment was capturing time-invariant differences between districts. This means that infant mortality rates are likely to capture far more than just nutritional deficiencies that Barker suggested. on individual level measures of ischemic heart disease. 
 
Title The Blitz 
Description War, State, and Society digitised the records of each individual air raid that occurred during the Second World War. We used the Google-Maps API to geo-locate each of these individual air raids and link them to districts for use in the UK Biobank. 
Type Of Material Database/Collection of data 
Year Produced 2021 
Provided To Others? No  
Impact This data set is being used to explore the potential risks of a stillbirth or spontaneous miscarriage later in life from early life exposure to wartime conditions. Whilst this paper formulated during the end of 2021, due to recent events, this paper will off a potential warning of further consequences of the current conflict as of March in 2022. 
 
Title The Registrar General's weekly return for England and Wales: Births and Deaths - Infectious Diseases - Weather 
Description The General register office produced weekly reports on matters such as births, deaths, and infections by location. Unlike the currently available annual estimates, these records contained some of the most detailed records of notifications of infectious diseases for the UK which, is also available at a highly detailed regional level; approximately 1500 separate locations per week. We digitised approximately 40,000 tables, or just over 15 million characters, of weekly records from 1941 to 1974 using a new software solution of ArchiveOCR; detailed in its own section. The completed dataset gives weekly records of notifications of acute meningitis, acute non-paralytic polio, acute paralytic polio, diphtheria, dysentery, food poisoning, infective jaundice, measles, pneumonia, scarlet fever, tuberculous respiratory, and pertussis. Not all diseases were notifiable across the full range of weeks, but this dataset represents one of most detailed datasets of notifications of disease to date. We hope to release this dataset once papers have gone through to publication. 
Type Of Material Database/Collection of data 
Year Produced 2020 
Provided To Others? No  
Impact The dataset has been vital to multiple papers, but they are yet to be published. Other researchers have used the dataset both within the University of Bristol and those part of the NORFAC DIAL. Papers that I have specifically worked on relevant to this dataset are: Early life exposure to scarlet fever is associated with ischemic heart disease later in life.: This paper found that there are further long-term consequences of exposure to scarlet fever in childhood and later life cardiovascular diseases, fluid intelligence, and educational attainment by linking the dataset to the UK Biobank; see BIO-HGIS. Can gene-environment interactions explain the rise in asthma incidence in the 20th century?: This paper utilised a pre-existing plausible biological mechanism from the literature to explore if a gene-environment interaction between the declining disease incidence and genetic predisposition to asthma explains part of the increase in asthma prevalence in the 20th century. We found little evidence of association of exposure to childhood diseases and asthma instance later in life. However, we found that for those with a high genetic predisposition to asthma who were exposed to higher rates of scarlet fever and pertussis in early life were much less likely to report having asthma later in life. Duration of measles immunosuppression on respiratory bacterial disease rates: Measles can cause immune amnesia, but the length of said amnesia is not fully established. This paper investigated to what extent increases in notifications of measles affected the rates of other notifiable diseases within the 20th century using weekly notifications of infectious diseases of measles, scarlet fever, pertussis, pneumonia, and tuberculosis. 
 
Title ArchiveOCR 
Description Digitisation is the process of transferring media from its original physical analogue format to a digital one using computational hardware and software. Projects often require a range of skills not commonly held by a single individual and can be manually intensive without computational assistance. Existing software and packages exist, such as AbbyFineReader and Google's Tesseract, but they have significant financial, skill and / or time costs. These barriers lead to significant problems in digitisation. Only 8% of the documentation held by the national archives in the UK have been digitised, so a simpler solution is needed. ArchiveOCR is a python-based software solution for digitisation of tables. It can scale from notebooks to super-computers, whilst also reducing barriers to entry through graphical user interfacing front ends. Whilst not a derivative of pre-existing digitisation software packages, it does extensively use the pre-existing open-computer vision library and numpy. However, ArchiveOCR specifically uses a custom written wrapper for CV2 of imageObjects, that makes CV2 work in an image based Object-Oriented Programming (OOP) approach; publicly available on GitHub. 
Type Of Technology Software 
Year Produced 2019 
Impact ArchiveOCR was built originally to assist digitisation of weekly reports of notifiable diseases from the Registrar-General's Weekly Return (1941-73). Weekly returns between February 1941 and December 1973 were digitised, leading to around 1670 weeks, 20,000 pages, and 40,000 tables requiring digitisation. Using ArchiveOCR allowed for 15,098,150 characters to be processed and cleaned within the span of two months mostly by a single individual, something which would simply not be possible using presently available means. It has since been used for other digitisation projects, and when stable it will be made public. 
 
Title weightGIS 
Description Geo-spatial data is often time variant in nature. When seeking to exploiting within variation from regions as identifying variation in a model, regions being abolished can significantly reduce power, and even if they remain any change can lead to within comparisons no longer being constant. weightGIS is designed to standardize geographic information systems (GIS) in the form of Shapefiles to a standardized base year. Once regions have been standardized, weightGIS can then use the set of weights produced by creating a single coherent set of regions over time to weight external datasets to a base year. 
Type Of Technology Software 
Year Produced 2020 
Open Source License? Yes  
Impact WeightGIS was designed to assist in the standardisation and construction of a time-invariant district data within England and Wales from 1941 to 1973 for BIO-HGIS. 
URL https://github.com/sbaker-dev/weightGIS