Machine learning, broadly defined as analytic techniques that fit models algorithmically by adapting to patterns in data, is growing in use within epidemiology. This workshop will explore how epidemiologists can use machine learning to advance their research and practice, while reflecting on some of the ethical and scientific considerations that arise from the use of data-driven techniques. The workshop will use a flipped classroom format to maximize time for discussion and programming activities during the SER workshop. Prior to the workshop, attendees will be sent 2-3 readings and links to 2-3 30 minute videos. These videos will introduce key terms, commonly-used algorithms, evaluation techniques and examples of epidemiologic studies that incorporated machine learning. During the workshop, these topics will be reinforced through a review of concepts, guided discussions, presentations of case-studies and demonstrations of analytic pipelines using R/R Studio. Attendees will work individually and in small groups on hands-on programming exercises of publicly available data, while also discussing the ethical and scientific challenges presented by different research scenarios. At the conclusion of this workshop, attendees will be able to discuss scenarios where machine learning can benefit epidemiologic analysis, analyze public health data using commonly-used algorithms, and feel empowered to pursue additional training or collaborate with scientists with expertise in machine learning.
Recent developments by the R community have revolutionized the data analysis pipeline in R, from manipulating and visualizing data to communicating results. Our workshop will provide hands-on training in tools from the tidyverse ecosystem, using real epidemiologic data. In the first section, we will teach data manipulation with dplyr, a package that makes data cleaning easy, flexible, and enjoyable. In the next section, we will teach data visualization with ggplot2, the most popular plotting package in R, with a focus on creating publication-quality plots. We will then put these tools together to make reproducible documents. Using R Markdown, we will weave code and text together and learn to write papers and reports, exported to PDF, Word, or HTML, entirely in R. This workflow easily propagates upstream changes to data or analyses throughout a document and eliminates copy and paste errors. Together, these tools form a data analysis pipeline for reproducible, publication-ready work.
When randomized experiments are infeasible, analysts must rely on observational data in which treatment (or exposure) is not randomly assigned. Although randomized trials are the gold standard, there are many important epidemiological questions that can be addressed using observational data. Drawing unbiased inferences from such data relies on the use of appropriate statistical methods, such as causal inference methods, to account for the non-randomized design. This workshop will introduce the potential outcomes framework and the use of inverse probability (or propensity) of treatment weights (IPTW) to estimate causal effects. We will present step-by-step guidelines on how to estimate and perform diagnostic checks of the weights for settings with two or more treatment groups and for continuous exposures. We will provide an overview on how to implement omitted variable analyses, which are critical to any IPTW analysis as the robustness of causal effects depends on no unobserved confounders. Attendees will gain hands-on experience estimating each type of weight using gradient (or generalized) boosting models (GBM), as well as in how to estimate the causal effects of interest using the IPTW. Running these analyses can be done via the TWANG package/suite of commands in Stata, SAS, or R; code will be shared. We will showcase a new menu-driven free Shiny app. Attendees should be familiar with linear and logistic regression, but prior knowledge of IPTW and GBM is not necessary.
This workshop will introduce participants to the Causal Roadmap for epidemiologic questions: 1) clear statement of the scientific question, 2) definition of the causal model and parameter of interest, 3) assessment of identifiability – that is, linking the causal effect to a parameter estimable from the observed data distribution, 4) choice and implementation of estimators including parametric and semi-parametric, and 5) interpretation of findings. The focus will be on estimation with a simple substitution estimator (parametric G-computation), inverse probability of treatment weighting (IPTW), and targeted maximum likelihood estimation (TMLE) with Super Learner. Participants will work through the Roadmap using an applied example and implement these estimators in R during the workshop session.
Session Chair: Rachel Sippy, University of Florida
Machine learning (ML) is a popular approach for prediction of outcomes, including forecasting and spatial predictions. It is well-suited to large datasets with many potential predictor variables and has been applied to many problems in public health and healthcare. This workshop is intended for participants with some statistical modeling background, interested in using ML for prediction. In this hands-on workshop, you will learn to identify appropriate questions for ML, the principles of ML, and how it relates to other modeling approaches. We will apply ML methods with a sample dataset, understand the tools available for using ML, and other resources for ML. This workshop assumes a working knowledge of R, and a laptop with R and RStudio installed will be required for the workshop.
Session Chair: Sam Harper, McGill University
Generating transparent and reproducible research is both ethical and necessary for making epidemiologic science useful. This workshop will provide participants with an overview of the rationale for why funders of epidemiologic research, and investigators and students of epidemiologic studies should aim to make their research transparent and fully reproducible, as well as hands-on experience with a selection of tools needed to do so. The workshop will provide: 1) an introductory, high-level overview of what it means to engage in reproducible research; 2) guidance on how to create a management plan for a research project and a structured workspace for the project that facilitates a reproducible workflow; 3) a discussion of pre-registration and pre-analysis plans for both experimental and observational research designs; 4) an introduction to version control and dynamic documents; and 5) tools and guidance for how to ethically and responsible share the outputs of a research project, including data, code, and research reports. The format for the workshop will be a combination of short lecture material, collaborative group work, as well as hands-on exercises. The workshop will be conducted using both R and Stata, but will focus on general practices and core principles that can be adapted to any software platform. The aim is for participants to leave with a strong grasp of why and how to use transparent and reproducible practices throughout the research life cycle.
Session Chair: Moyses Szklo, JHSPH
In this half-day workshop, participants will critically review a paper as initially submitted to the American Journal of Epidemiology, but not yet published. The paper will be sent to participants in advance of the workshop for their critical review. During the workshop, a presentation will be made regarding some of the main points to be considered when preparing or reviewing a manuscript. Small-group work will follow the presentation so that participants can compare their reviews and prepare a consolidated list of critical comments on the paper. Each group will designate a leader who will present the group’s review of the paper to the whole group of participants. At the end of the workshop, students will receive copies of the manuscript’s AjE reviews, the initial editorial decision, and the final accepted version of the paper.
It is becoming increasingly clear that producing causal estimates from studies with acceptable internal validity is not sufficient to guide interventions and policy analysis for population health. External validity is critical for applying internally valid results from a study population to a target population that may or may not have given rise to the study population. Novel developments in causal inference allow us to give the sufficient and necessary conditions for generalizability and transportability. This workshop will provide accessible theoretical and practical introduction to the concepts of internal and external validity and show to generalize or transport internally valid external estimates from study populations to source or target populations. The concept of data fusion will be introduced to workshop participants for the purposes of generalizing or transporting data and effect estimates across populations and settings. The workshop will use structural and graphical language to make it accessible to epidemiologists interested in causal inference for informing interventions and policy. It will show how g-methods, particularly g-computation and inverse-probability-weighting and inverse-odds-weighting with(out) augmentation, can be used to generalize or transport effect estimates. Ample applications using empirical datasets and software codes will be provided in SAS, Stata and R.