Abstract Search – Society for Epidemiologic Research

Big Data/Machine Learning/AI

Error Patterns in Machine Learning Algorithms for Healthcare: a Multicohort Study Across Brazil’s Five Regions Júlia Neuenschwander* Júlia Neuenschwander Alexandre Chiavegatto Filho

Machine learning (ML) algorithms are increasingly used in healthcare, yet studies examining their error patterns remain limited. Understanding whether misclassified patients share common traits, and how algorithmic architecture influences error profiles, is essential to improving diagnostic accuracy and fairness. Thus, this study aimed to assess the error patterns in ML models predicting COVID-19 mortality, exploring the relationship between algorithmic architecture, error profiles, and dataset characteristics. Data were sourced from the IACOV-BR initiative, which includes 15,598 adult patients with RT-PCR-confirmed COVID-19 across 21 hospitals in Brazil. The multicohort design, spanning Brazil’s vast geographical area and diverse population, enhances the generalizability of the findings. Five models—Random Forest, XGBoost, Catboost, LightGBM, and TabPFN—were evaluated using demographic and laboratory data. The performance was assessed using the area under the receiver operating characteristic curve (AUROC), and Shapley values were used to determine variable importance. The results revealed that TabPFN exhibited the best overall predictive performance, particularly in smaller datasets, which further supports its utility in resource-limited settings. Approximately 10% of patients were misclassified by all models, and 54% were correctly classified by all of them. Key predictors of misclassification included age, platelet count, and C-reactive protein. Strong error correlation was observed between gradient boosting models (XGBoost and LightGBM: R2 = 0.85), while Random Forest showed moderate correlation with both boosting models and TabPFN. Performance validation across hospitals demonstrated improved outcomes with larger datasets, with hospital-specific factors influencing variability. In conclusion, this study highlights shared error patterns across ML models and the influence of specific predictors on misclassification. Additionally, it suggests that no single algorithm consistently outperforms others, emphasizing the need for hospitals to select ML models tailored to their specific contexts. These insights can guide the development of diagnostic tools and strategies to mitigate errors, enhancing the robustness and equity of ML applications in healthcare.