Causal Inference
Handling Missing Data in Cluster Randomized Trials: A Comparison of Multiple Imputation and Two-Stage TMLE Joy Zora Nakato* Joy Zora Nakato Nakato University of California, Berkeley
Missing data are ubiquitous in research studies and pose particular challenges in cluster randomized trials (CRTs), which randomize groups (e.g., communities or clinics) to the intervention versus control. CRTs often have complex dependence structures and are prone to missing data due to their pragmatic nature. A range of methods have been proposed to handle missing data in CRTs. Here, we compare multiple imputation (MI), a model-based method relying on strong assumptions about the data generating process (DGP), and Two-Stage TMLE, a machine learning method facilitating flexible estimation of both outcome and missingness processes. Although MI and TMLE have been compared in other settings, their relative performance in CRTs remains underexplored. To address this gap, we conducted a simulation study reflecting OPAL, the motivating CRT designed to improve use of HIV prevention in rural East Africa. Specifically, we simulated 1000 CRTs each with 100 clusters of size ~100 under two scenarios: a simple DGP where the outcome and missingness processes were linear and a complex DGP where these processes were highly non-linear. In the simple scenario, both methods achieved low bias and nominal 95% confidence interval coverage. However, in the complex scenario, MI was meaningfully biased, resulting in poor coverage (89.2%), whereas Two-Stage TMLE retained robust performance with negligible bias and near-nominal coverage (93.4%). Our findings demonstrate that while MI performs adequately when models are correctly specified its performance deteriorates under more realistic settings. In contrast, Two-Stage TMLE maintains robust performance under complex missingness and outcome mechanisms, highlighting its potential as a flexible approach for handling missing data in CRTs.
