Rigorous Training in Longitudinal Data Science (RADIANCE)
Lead Research Organisation:
University College London
Department Name: Institute of Child Health
Abstract
We live in a world where data are collected on nearly everything we do. Such information has the potential to be extremely useful if we wish to improve our health. However, doing this safely is not easy.
There are many examples where data have been misused or erroneous interpretation of the evidence has been drawn. This is increasingly apparent this year, where scientists and governments are struggling to communicate the uncertainties in their understanding of the current pandemic. Because the available evidence is limited, scientists stress that more data are needed to compare regions, subgroups of people and, crucially, study the evolution of the epidemic over time. Only with more data we will be able to understand variations in the population and explain health inequalities as well as time trends. For example, to assess if a local or national lock-down is working, we need to count how many cases arise in different communities over certain periods of time, how many of them are hospitalised, and how many die. Ideally, we should follow each individual from diagnosis, to hospitalization, to recovery or death, and then compare the incidence of each of these events, by region, sex, occupation, and ethnicity, for example. To achieve this, we need to account for when each of these events occurs. This requires linking information on the same individual over time.
The same principle applies to the study of other diseases. For this reason, access to linked individual medical and administrative records is crucial for biomedical and public health research. Having the data is not sufficient, however. They need to be: (a) safely stored, cleaned, and prepared for analysis; (b) properly analysed; and (c) interpreted together with evidence from other countries and other published research. We label these steps: data stewardship, analysis, and context. Our proposal aims to train health and social data scientists in the core skills needed to achieve these steps. We will use different formats which will all be on-line to reach the broadest community of data scientists. We will produce short introductory videos (which we call "Appetisers"), and then various on-line material delivered at an intermediate and more advanced levels. Some of this will be in the form of recorded lectures, some as live tutorials where the material covered by the lectures is reinforced with practical computer-based exercises. We will also run specific courses on specialised topics which will include live (but on-line) interactions with members of the training team, and "data clinics" where participants can have one-to-one discussions with us.
In summary, we will endeavour to develop and run an accessible and inclusive training programme for data scientists involved in the management, analysis and interpretation of complex longitudinal biosocial data.
There are many examples where data have been misused or erroneous interpretation of the evidence has been drawn. This is increasingly apparent this year, where scientists and governments are struggling to communicate the uncertainties in their understanding of the current pandemic. Because the available evidence is limited, scientists stress that more data are needed to compare regions, subgroups of people and, crucially, study the evolution of the epidemic over time. Only with more data we will be able to understand variations in the population and explain health inequalities as well as time trends. For example, to assess if a local or national lock-down is working, we need to count how many cases arise in different communities over certain periods of time, how many of them are hospitalised, and how many die. Ideally, we should follow each individual from diagnosis, to hospitalization, to recovery or death, and then compare the incidence of each of these events, by region, sex, occupation, and ethnicity, for example. To achieve this, we need to account for when each of these events occurs. This requires linking information on the same individual over time.
The same principle applies to the study of other diseases. For this reason, access to linked individual medical and administrative records is crucial for biomedical and public health research. Having the data is not sufficient, however. They need to be: (a) safely stored, cleaned, and prepared for analysis; (b) properly analysed; and (c) interpreted together with evidence from other countries and other published research. We label these steps: data stewardship, analysis, and context. Our proposal aims to train health and social data scientists in the core skills needed to achieve these steps. We will use different formats which will all be on-line to reach the broadest community of data scientists. We will produce short introductory videos (which we call "Appetisers"), and then various on-line material delivered at an intermediate and more advanced levels. Some of this will be in the form of recorded lectures, some as live tutorials where the material covered by the lectures is reinforced with practical computer-based exercises. We will also run specific courses on specialised topics which will include live (but on-line) interactions with members of the training team, and "data clinics" where participants can have one-to-one discussions with us.
In summary, we will endeavour to develop and run an accessible and inclusive training programme for data scientists involved in the management, analysis and interpretation of complex longitudinal biosocial data.
Technical Summary
Researchers in the biomedical and social sciences have now access to incrementally larger data resources which are generated by linkage between administrative, cohort and panel databases, with many of these spanning over decades. Their longitudinal nature provides huge opportunities for describing and investigating medical and behavioural histories, as well as socio-economic changes, and to study their relationship with population health outcomes. Evidence informed by complex longitudinal biosocial data must be based on rigorous data stewardship (i.e. data linkage, manipulation, cleaning, and documentation) twinned with appropriate targets of analysis, transparent analytical plans and accurate interpretation of results. The high-and multi-dimensional nature of these newly created data resources requires skills that are often compartmentalised within different disciplines, however.
With this training programme we aim to provide a comprehensive, cohesive and rigorous portfolio targeted at the broad community of quantitative researchers involved in the management, analysis and interpretation of longitudinal biosocial data and the production of rigorous and transparent scientific evidence. The training will be framed around the three core themes of "Data Stewardship", "Analysis" and "Context".
We will use multiple delivery formats, including short videos on basic concepts ("appetizers"), modules, short courses and clinics all delivered on-line. The training will be developed and delivered by a team of methodological and applied researchers working in health and social science, each bringing their own considerable expertise in both research and education.
In summary, this training programme will enhance knowledge, self-confidence and expertise that is much in demand among researchers who want to utilize complex longitudinal biosocial data.
With this training programme we aim to provide a comprehensive, cohesive and rigorous portfolio targeted at the broad community of quantitative researchers involved in the management, analysis and interpretation of longitudinal biosocial data and the production of rigorous and transparent scientific evidence. The training will be framed around the three core themes of "Data Stewardship", "Analysis" and "Context".
We will use multiple delivery formats, including short videos on basic concepts ("appetizers"), modules, short courses and clinics all delivered on-line. The training will be developed and delivered by a team of methodological and applied researchers working in health and social science, each bringing their own considerable expertise in both research and education.
In summary, this training programme will enhance knowledge, self-confidence and expertise that is much in demand among researchers who want to utilize complex longitudinal biosocial data.
Organisations
Description | 9 introductory videos |
Form Of Engagement Activity | A broadcast e.g. TV/radio/film/podcast (other than news/press) |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Professional Practitioners |
Results and Impact | We produce 9 "appetizers" to the training course on the following topic's: 1- Questions, data and research methods: https://youtu.be/OBfMkzAP7fU 2-Causal Questions: https://youtu.be/N0kMRRbuPCE 3-Information Governance for users of Administrative Data: https://youtu.be/Tie4Ih5SJns 4-Trusted Research Environments: https://youtu.be/_mQWDvjAU0M 5-Data Handling: https://youtu.be/T2WE5cY4IRg 6-Ethical Considerations for data scientists: https://youtu.be/mBIYw2W6yFg 7-Reproducible and Open Data Science: https://youtu.be/ENcpbmGRfAk 8-Longitudinal Data Structures: https://youtu.be/-xctS1yNjns 9-Introduction to Missing Data: https://youtu.be/ZFDk13WrdmM |
Year(s) Of Engagement Activity | 2022 |
URL | https://radiance.org.uk/training/ |
Description | Addressing causal questions: an introduction |
Form Of Engagement Activity | Participation in an activity, workshop or similar |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Professional Practitioners |
Results and Impact | This introductory course is for anyone wishing to understand how causal questions can be investigated using real world data (RWD), that is data on the everyday experiences of individuals that are collected through surveys, cohort studies, administrative and clinical databases or accrued for reasons other than research. These data are observational, as opposed to experimental. Because of this, using them to address causal questions raises many concerns and difficulties. In this course we will describe the main sources of bias affecting RWD and possible strategies to deal with them. The course started by distinguishing between different types of studies (e.g., RCTs, cross-sectional and longitudinal) and data sources (e.g., research-based, administrative databases). It then described the sources of bias that are likely to affect observational data, in particular those arising from the non-randomized allocation of exposures (denoted confounding bias in epidemiology and selection bias in the social sciences), from missing participation (including missing data), and from measurement errors. It then introduced two main design-based approaches to attempt dealing with (some of) these biases: the framework of target trial emulation and the exploitation of natural experiments. |
Year(s) Of Engagement Activity | 2023 |
Description | Analysis of administrative health data |
Form Of Engagement Activity | Participation in an activity, workshop or similar |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Postgraduate students |
Results and Impact | Administrative data, sometimes referred to as routinely collected data, provide large and rich datasets for research. However, they require careful cleaning, management and interpretation. This online course was for those who are interested in whether they might want to use administrative data for research and would like a short introduction to this topic. The course used administrative health data (national hospital inpatient data - the Hospital Episode Statistics database) as an example, but the principles apply to all administrative data. |
Year(s) Of Engagement Activity | 2023 |
Description | Analysis of electronic Health records |
Form Of Engagement Activity | Participation in an activity, workshop or similar |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Professional Practitioners |
Results and Impact | Administrative data, sometimes referred to as routinely collected data, provide large and rich datasets for research. However, they require careful cleaning, management and interpretation. This online course is for those who are interested in whether they might want to use administrative data for research and would like a short introduction to this topic. The course will use administrative health data (national hospital inpatient data - the Hospital Episode Statistics database) as an example, but the principles apply to all administrative data. |
Year(s) Of Engagement Activity | 2022 |
Description | Causal Diagrams, Jan 23 |
Form Of Engagement Activity | Participation in an activity, workshop or similar |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Professional Practitioners |
Results and Impact | This introductory course is for anyone wishing to learn how to graphically draw our assumptions regarding how an exposure and an outcome may be related, either causally or via common associations with other variables. Learning about how to draw such assumptions is useful to guide: the design of observational studies aiming to investigate the causal relationship between exposure and outcome and the analysis of such studies. We will introduce the language of potential outcomes before describing the fundamental rules for drawing and interrogating causal diagrams. |
Year(s) Of Engagement Activity | 2023 |
Description | Causal questions: an introduction Nov 22 |
Form Of Engagement Activity | Participation in an activity, workshop or similar |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Professional Practitioners |
Results and Impact | This introductory course is for anyone wishing to understand how causal questions can be investigated using real world data (RWD), that is data on the everyday experiences of individuals that are collected through surveys, cohort studies, administrative and clinical databases or accrued for reasons other than research. These data are observational, as opposed to experimental. Because of this, using them to address causal questions raises many concerns and difficulties. In this course we will describe the main sources of bias affecting RWD and possible strategies to deal with them. The course will start by distinguishing between different types of studies (e.g., RCTs, cross-sectional and longitudinal) and data sources (e.g., research-based, administrative databases). It will then describe the sources of bias that are likely to affect observational data, in particular those arising from the non-randomized allocation of exposures (denoted confounding bias in epidemiology and selection bias in the social sciences), from missing participation (including missing data), and from measurement errors. We will then introduce two main design-based approaches to attempt dealing with (some of) these biases: the framework of target trial emulation and the exploitation of natural experiments. |
Year(s) Of Engagement Activity | 2022 |
Description | Causal questions: an introduction March 22 |
Form Of Engagement Activity | Participation in an activity, workshop or similar |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Professional Practitioners |
Results and Impact | his introductory course is for anyone wishing to understand how causal questions can be investigated using real world data (RWD), that is data on the everyday experiences of individuals that are collected through surveys, cohort studies, administrative and clinical databases or accrued for reasons other than research. These data are observational, as opposed to experimental. Because of this, using them to address causal questions raises many concerns and difficulties. In this course we will describe the main sources of bias affecting RWD and possible strategies to deal with them. |
Year(s) Of Engagement Activity | 2022 |
Description | Estimating Causal Effects |
Form Of Engagement Activity | Participation in an activity, workshop or similar |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Postgraduate students |
Results and Impact | This course introduced participants to the two main approaches to estimating causal effects from observational data: those based on the assumption of no unmeasured confounding and those that exploit the availability of instrumental variables. The course focussed on settings where the exposure/intervention is time fixed as well as the more general case when exposures/treatments are time-varying (and hence may be affected by time varying confounding). |
Year(s) Of Engagement Activity | 2024 |
Description | Estimation of causal effects in real world data |
Form Of Engagement Activity | Participation in an activity, workshop or similar |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Professional Practitioners |
Results and Impact | This course introduced participants to the two main approaches to estimating causal effects from observational data: those based on the assumption of no unmeasured confounding and those that exploit the availability of instrumental variables. The course focussed on settings where the exposure/intervention is time fixed but will also give an introduction to the more general case when exposures/treatments are time-varying (and hence may be affected by time varying confounding. |
Year(s) Of Engagement Activity | 2023 |
Description | Introduction to Mediation Analysis |
Form Of Engagement Activity | Participation in an activity, workshop or similar |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Postgraduate students |
Results and Impact | This introductory course was for anyone wishing to have an overview of main concepts of mediation analysis. Various approaches were presented with an emphasis on comparing standard approaches with those from the causal inference framework. The course consisted of a lecture followed by a computer practical exercise. During the computer practical session students were given a data set and a set of questions to answer using a statistical software (Stata or R), under the guidance of tutors. |
Year(s) Of Engagement Activity | 2023 |
Description | Introduction to target trial emulation |
Form Of Engagement Activity | Participation in an activity, workshop or similar |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Professional Practitioners |
Results and Impact | This introductory course was for anyone wishing to understand how comparisons of the effectiveness of alternative therapies or interventions can be performed using real world data (RWD) when adopting the framework of target trial emulation (TTE). RWD are data on the everyday experiences of individuals that are collected through surveys, cohort studies, administrative and clinical. These data are observational, as opposed to experimental. Because of this, using them to address causal questions such as those of comparative effectiveness raises many concerns and difficulties. In this course we described the main sources of bias affecting RWD, describe how TTE can address some of them, and discuss its application in group discussions and computer practicals (in Stata and R). |
Year(s) Of Engagement Activity | 2023 |
Description | Longitudinal Data Preparation & Visualisation For Epidemiological And Social Research, March 23 |
Form Of Engagement Activity | Participation in an activity, workshop or similar |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Professional Practitioners |
Results and Impact | This online course is for anyone that needs to prepare longitudinal data for analysis. It will cover the main procedures needed from converting raw longitudinal data to cleaned data that can be readily analysed. The course will have two sessions, one covering data preparation and the other covering data description and visualization. Both will focus on longitudinal data and real-world data. You can take the module either in R or in Stata, each will have its own videos and practical exercises. |
Year(s) Of Engagement Activity | 2023 |
Description | Longitudinal data analysis |
Form Of Engagement Activity | Participation in an activity, workshop or similar |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Postgraduate students |
Results and Impact | Longitudinal data (data collected multiple times from the same cases) is becoming increasingly popular due to the important insights it can bring us. For example, it can be used to track how individuals change in time and what are the causes of change, it can also be used to understand causal relationships or used as part of impact evaluation. Unfortunately, traditional models such as OLS regression are not appropriate as repeated measures are nested within individuals. For this reason, specialised statistical models are needed. Multilevel Modelling (MLM) and Structural Equation Modelling (SEM) offer flexible frameworks in which longitudinal data can be analysed. They offer a series of advantages compared to other approaches such as: the separation of within and between variation, the inclusion of both time constant and time varying variables, the inclusion of multiple relationships (path analysis, mediation, etc.), the inclusion of measurement error, the estimation of change in measurement error, multi-group analysis, etc. The course gave an introduction to the Multilevel Model for change and the Latent Growth Model (LGM) using the Stata and R. |
Year(s) Of Engagement Activity | 2023 |
Description | Longitudinal data preparation and visualisation |
Form Of Engagement Activity | Participation in an activity, workshop or similar |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Postgraduate students |
Results and Impact | This course was for anyone that needed to prepare longitudinal data for analysis. It covered the main procedures needed from converting raw longitudinal data to cleaned data that can be readily analysed. The course had two sessions, one covering data preparation and the other covering data description and visualization. Both focussed on longitudinal data and real-world data |
Year(s) Of Engagement Activity | 2023 |
Description | Longitudinal data preparation and visualisation |
Form Of Engagement Activity | Participation in an activity, workshop or similar |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Professional Practitioners |
Results and Impact | This online course is for anyone that needs to prepare longitudinal data for analysis. It will cover the main procedures needed from converting raw longitudinal data to cleaned data that can be readily analysed. The course will have two sessions, one covering data preparation and the other covering data description and visualization. Both will focus on longitudinal data and real-world data. You can take the module either in R or in Stata, each will have its own videos and practical exercises. |
Year(s) Of Engagement Activity | 2022 |
Description | Machine learning/Regression classification |
Form Of Engagement Activity | Participation in an activity, workshop or similar |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Postgraduate students |
Results and Impact | This course introduced a variety of machine learning (ML) methods to analyse continuous and categorical outcomes and discussed how these methods can be applied in both in prediction and causal inference settings. This was an introductory course, and therefore ideas were explained at a beginner level, with a particular focus given to practical applications of ML in real-world studies. Tutorials included readily available methods and solutions. |
Year(s) Of Engagement Activity | 2023 |
Description | Multiple Imputation of Missing data |
Form Of Engagement Activity | Participation in an activity, workshop or similar |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Professional Practitioners |
Results and Impact | This online course is for anyone needing to address the issue of missing information in their quantitative data. It covers the most important principles of missing data analysis and how to effectively address the issues in analyses. |
Year(s) Of Engagement Activity | 2022 |
Description | Multiple Imputation of Missing data |
Form Of Engagement Activity | Participation in an activity, workshop or similar |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Postgraduate students |
Results and Impact | This online course was for anyone needing to address the issue of missing information in their quantitative data. It covered the most important principles of missing data analysis and how to effectively address the issues in analyses. |
Year(s) Of Engagement Activity | 2023 |
Description | Regression Models |
Form Of Engagement Activity | Participation in an activity, workshop or similar |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Professional Practitioners |
Results and Impact | This online course gives you an overview of commonly used regression methods to examine the relationship between an outcome of interest and an explanatory variable. You will be introduced to classical linear regression and generalised linear models (e.g. logistic, Poisson, ordinal/multinomial models) depending on the distribution of the outcome. The course covers the basic concept, formulation, interpretation, and validation of the models. Real-world data will be used to demonstrate the practical applications of these models. |
Year(s) Of Engagement Activity | 2022 |
Description | Regression Models |
Form Of Engagement Activity | Participation in an activity, workshop or similar |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Postgraduate students |
Results and Impact | This course gave an overview of commonly used regression methods to examine the relationship between an outcome of interest and an explanatory variable. It included classical linear regression and generalised linear models (e.g. logistic, Poisson, ordinal/multinomial models). The course covered the basic concepts, formulation, interpretation, and validation of the models. Real-world data were used to demonstrate the practical applications of these models. |
Year(s) Of Engagement Activity | 2023 |
Description | Statistics Clinic |
Form Of Engagement Activity | A formal working group, expert panel or dialogue |
Part Of Official Scheme? | No |
Geographic Reach | National |
Primary Audience | Professional Practitioners |
Results and Impact | This was a one-to-one clinic on how to compare the population who access alcohol service with the population that should be accessing the service because they tested positive to the alcohol test, in terms of age, sex, ethnicity and IMD |
Year(s) Of Engagement Activity | 2022 |