Big Data/Machine Learning/AI
Machine Learning to Predict Tuberculosis Treatment Abandonment in São Paulo, Brazil Marcela Quaresma Soares* Marcela Soares Fabiano Barcellos Filho Alexandre Dias Porto Chiavegatto Filho
Tuberculosis (TB) treatment abandonment is a significant public health challenge, contributing to increased disease transmission and worsening patient outcomes. This study aimed to develop and compare machine learning models to predict treatment abandonment, using data from the Brazilian Unified Health System (SUS) in São Paulo, covering the period from 2013 to 2023. We included a total of 232,809 TB cases, of which 34,034 (14.6%) were classified as treatment abandonment. The analysis considered a wide range of independent variables, including demographic, clinical, and contextual factors, with missing data grouped into ‘ignored’ categories. Popular machine learning algorithms for structure data such as Random Forest, XGBoost, LightGBM, CatBoost, and TabNet were evaluated. All models were optimized through hyperparameter tuning and validated using 5-fold cross-validation to enhance robustness and generalizability. Performance was assessed using AUC-ROC, precision, recall, and F1-score metrics. CatBoost achieved the highest performance, with an AUC-ROC of 0.912, followed by LightGBM (0.911) and XGBoost (0.910). CatBoost also demonstrated the best balance between precision (0.73) and F1-score (0.61), significantly outperforming other models in identifying cases of treatment abandonment. Hyperparameter tuning notably enhanced the performance of the boosting algorithms, particularly in improving precision and recall. These results highlight the potential of machine learning not only to predict treatment abandonment but also to inform evidence-based decision-making. The ability to identify high-risk patients can enable health authorities to design targeted interventions, prioritize resources, and ultimately reduce TB-related morbidity and mortality. This study reinforces the importance of integrating advanced analytical methods into public health strategies, especially in resource-limited settings where TB remains a critical health issue.