Thinking Beyond Simulation: Benchmarking as an Empirical Tool for Method Comparison

Presenting Author

Haidong Lu

Yale University

Submitting Author

Haidong Lu

Additional Authors

Kaicheng Wang, Lindsey Rosman

Abstract

Background and Objective: Simulation studies dominate methodological comparisons of propensity score estimators, yet their conclusions depend heavily on design choices that may not reflect the complexity of real-world data. In contrast, benchmarking, defined as comparing observational estimates against results from randomized trials addressing the same question, offers an empirical alternative for evaluating causal methods. We used the benchmarking approach to assess whether machine learning-based propensity score estimation outperforms conventional logistic-regression method in practice.

Methods: Using electronic health record data from the Veterans Affairs health system, we emulated a target trial comparing sacubitril–valsartan with angiotensin-converting enzyme inhibitors and angiotensin receptor blockers on all-cause mortality among patients with heart failure and implantable cardioverter-defibrillators between 2016 and 2020. We compared three propensity score approaches: (1) logistic regression with pre-specified confounders; (2) generalized boosted models (GBM) using the same pre-specified confounders; and (3) GBM with expanded covariates and automated feature selection. Observational effect estimates were benchmarked against results from an existing randomized trial addressing the same research question.

Results: The logistic regression–based propensity score approach yielded estimates closest to the randomized trial (HR = 0.93, 95% CI: 0.61–1.42 vs. trial HR = 0.81, 95% CI: 0.61–1.06; 23-month RR = 0.86, 95% CI: 0.57–1.24). Despite superior predictive performance, GBM with pre-specified confounders showed no improvement over logistic regression (HR = 0.97, 95% CI: 0.68–1.37; RR = 0.96, 95% CI: 0.89–1.98). In contrast, GBM with expanded covariates and data-driven feature selection substantially increased bias (HR = 0.61, 95% CI: 0.30–1.23; RR = 0.69, 95% CI: 0.36–1.04).

Conclusions: Machine learning–based propensity score estimation does not inherently outperform conventional approaches and may exacerbate bias due to overfitting or causal model misspecification. These findings underscore the value of benchmarking against randomized trials as an innovative and important empirical tool for comparing causal inference methods using real-world data. Rather than serving as stand-alone solutions, machine learning algorithms should be embedded within principled causal frameworks, such as doubly robust estimators with cross-fitting, while retaining explicit, subject-matter–driven confounder specification.

Abstract Search

Abstract