Big Data/Machine Learning/AI
CARDIOVASCULAR HEALTH PREDICTION IN CHILDREN USING MACHINE LEARNING Tiago Almeida de Oliveira* Tiago Almeida de Oliveira Mateus Silva Rocha Keisyanne De Araujo-Moura Marcus Vinícius Nascimento-Ferreira Augusto César Ferreira De Moraes
Aim: To develop and validate risk scores for predicting cardiovascular health in children using Machine Learning algorithms based on extrinsic and intrinsic variables. Methods: The SAYCARE study, a cross-sectional, multicenter, school-based study conducted in São Paulo and Fortaleza, Brazil, including 462 children aged 5 to 8 years. The primary outcome was cardiovascular health (CVH), which includes eight cardiovascular health components: healthy diet, participation in physical activity, avoidance of nicotine, restorative sleep, healthy weight, and healthy levels of blood lipids, glycated blood hemoglobin, and blood pressure. Each metric has a scoring algorithm ranging from 0 to 100 points, allowing the generation of a composite cardiovascular health score that varies from 0 to 100 points. Potential predictors were based on sociodemographic, maternal, environmental, and behavioral factors and nutritional status. The data were split 70/30, with the CVH variable transformed into a binary outcome: 1 (high CVH) and 0 (lower CVH). Three Machine Learning algorithms (Random Forest, XGBoost, and LightGBM) were optimized via GridSearch with 5-fold cross-validation. Precision, Recall, F1 Score, AUC, and the Precision-Recall curve were assessed. Variable importance was interpreted using Shapley Values, which also informed a nomogram. Results: The outcome is imbalanced with prevalence of 1 (high CVH) in 32.47% of cases and 0 (lower CVH) in 67.53%. Despite this imbalance the Random Forest model excelled in predicting cardiovascular conditions in the binary scenario, achieving an AUC of 0.88, precision of 0.85, recall of 0.77, F1-Score of 0.79, and a Precision-Recall curve area of 0.78, highlighting its robustness and clinical applicability. The most influential variables are: Sedentary Behavior, Weight, Negative Environmental Factors, and Household Income. These variables underscore the interplay between individual and environmental contributors to cardiovascular risk. A nomogram was developed to translate these contributions into a visual scale, effectively illustrating each variable’s relative importance. Conclusion: The Random Forest model demonstrated the best performance to predict cardiovascular health in children, and environmental factors and lifestyle behaviors are most important.