Causal Inference
Including Trial-Selection Variables in Machine-learning for Generalizable Conditional Average Treatment Effect Estimation Rikuta Hamaya* Rikuta Hamaya Konan Hara Etsuji Suzuki
Machine-learning approaches for estimating the conditional average treatment effect (CATE) can inform individualized treatment decisions. However, generalizing such estimates beyond randomized controlled trial (RCT) participants remains challenging due to potential selection bias through trial participation. Accordingly, we aim to investigate whether including trial participation covariates improves estimation for CATE-estimating algorithms. Using theoretical derivations, we show that unbiased CATE estimation in a source population requires conditioning on trial-selection variables, either when aiming to estimate CATE for specific covariates or individual treatment effect (ITE). Simulation demonstrates that simply including all relevant covariates in a Causal Forest can reduce bias but may inflate variance unless the sample size is large (e.g. >5000 for continuous outcome with 5 CATE covariates). We further evaluate an inverse probability weighting (IPW) approach that leverages data on the source population. IPW reduces selection bias more efficiently than simply adding covariates in high dimensions. In a real-world application using the VITamin D and OmegA-3 TriaL (VITAL), we compare CATEs of omega-3 fatty acid supplementation on coronary heart disease incidence. Including trial-selection variables in the Causal Forest model yields stronger effect estimates among those most likely to benefit, though the evaluation is limited to the trial samples. Our findings highlight that identifying and incorporating variables determining trial participation is crucial for generalizable CATE estimates, and thus RCT may better be designed to collect such variables. However, simply including these variables in Causal Forest may not necessarily lead to better estimates, even when aiming to estimate ITE. Combining RCT data with baseline information from the source population can improve estimation performance, particularly aiming to estimate CATE for specific covariates.