Methods/Statistics
Bias-Variance Implications of Outcome Missingness and Misclassification in EHR-Based Studies: A Simulation Study Elaona Lemoto* Elaona Lemoto Lemoto Lemoto Lemoto Lemoto Duke University School of Medicine, Department of Biostatistics and Bioinformatics
Real-world data (RWD) are increasingly used to generate real-world evidence (RWE) for clinical and policy decision-making, yet such data are often affected by outcome missingness and misclassification, particularly for binary disease outcomes. Missing data are commonly categorized as missing completely at random (MCAR), missing at random (MAR), or missing not at random (MNAR), but less is known about how these mechanisms compare when missing outcomes are inadvertently misclassified as non-events. We conducted a simulation study motivated using data from the electronic health record (EHR) for the association between diabetes and cardiovascular disease (CVD). Data were generated with one exposure, one observed confounder, and one unobserved confounder, with correlation among covariates ranging from 0.1 to 0.8. Outcomes were generated as binary (CVD diagnosis) or continuous (CVD severity). Outcome missingness, with potentially subsequent misclassification, was imposed under MCAR, MAR (dependent on the observed confounder), and MNAR (dependent on the unobserved confounder), with proportion of missingness from 5% to 75% and sample sizes from 1,000 to 50,000. Logistic or linear regression models excluding the unobserved confounder were fit, and bias and 95% confidence interval coverage were evaluated using the full model with no missingness nor misclassification as the reference. For continuous outcomes, coverage was around 95% under MCAR and remained above 85% under MAR but dropped below 80% under MNAR at 75% missingness and below 30% with high correlation (ρ=0.8), ≥25% missingness, and N≥50,000. Binary outcomes with missingness maintained >90% coverage except under MNAR with high correlation or high proportion of missingness. Binary misclassification produced extreme bias (up to 120%) and near-zero coverage across all mechanisms as sample size increased. These findings indicate that outcome misclassification may be substantially more harmful than outcome missingness in EHR-based studies and underscore the importance of explicitly addressing misclassification when diagnoses are incompletely captured.
