Big Data/Machine Learning/AI
Virtual Pooling: Accurate, Scalable, and Privacy-preserving Analytics Without Aggregating Patient Data Ishtiyaque Ahmad* Trinabh Gupta Gupta Gupta Gupta Gupta Gupta DataUnite
Background: Concerns regarding patient privacy and data security preclude many important medical studies that require multicenter datasets for maximal robustness. Although federated learning methods address this problem, existing implementations are often complex to use, less accurate, and lack support for common biostatistical and multi-iterative analyses. Here we test Virtual Pooling (VP), a new approach for analyzing multicenter data without aggregating patient data into a single repository.
Methods: We used VP to replicate a high impact study that measured the association between chronic immunosuppressant use and in-hospital outcomes among COVID-19 patients using the N3C dataset from 59 health systems (N=334,754). The study used a common biostatistical analytical pipeline involving propensity score matching followed by Cox proportional hazards model. We compared VP results against direct pooling, the gold standard. We selected two covariates and induced center-specific correlations (positive, negative, or none) via probabilistic patient assignment to test VP’s robustness under imbalanced cross-center data distributions.
Results: VP generated an identical matched cohort to that obtained via direct pooling. Following propensity score matching, Cox regression analyses produced the same covariate coefficients and p-values under both approaches. This concordance persisted in simulations with extreme distributional imbalances. VP required only 17 seconds of additional computation time relative to direct pooling.
Conclusions: Virtual pooling enables accurate, scalable multicenter data analysis without sharing patient-level data, ensuring strong privacy. It achieves accuracy comparable to direct data pooling, while supporting advanced time-to-event analyses such as Cox proportional hazards model. Thus, VP can accelerate multicenter collaborations and enhance the robustness and generalizability of medical research.

