Big Data/Machine Learning/AI
Machine Learning for Predicting Cardiovascular Health in South American Pediatric Population. Tiago Almeida de Oliveira* Mateus Silva Rocha Tiago Almeida de Oliveira Marcus Vinicius Nascimento-Ferreira Alexandre Chiavegatto Filho Augusto Cesar Ferreira De Moraes
This study aimed to develop cardiovascular health (CVH) prediction models for South American children and adolescents, incorporating a range of extrinsic and intrinsic factors. We analyzed data from the South American Youth/Child Cardiovascular and Environmental (SAYCARE) Study, an observational multicenter feasibility study, focused on individuals aged 3-18 years in five different South American cities. Of the initial 475 participants, those with incomplete data on dietary intake, physical activity, nicotine exposure, sleep health, body mass index, blood lipids, fasting glucose, blood pressure, or missing covariate data were excluded. The models considered sociodemographic, maternal, environmental, and behavioral factors, including nutritional status. CVH was categorized as low (Class 1 – prevalence 0.56), moderate (Class 2 – prevalence 0.24) or high (Class 3 – prevalence 0.20). The study employed a 70/30 split for training and testing the algorithms. This approach provided a comprehensive analysis of CVH predictors in this demographic. Exploratory analysis, employing Spearman and biserial correlation, addressed multicollinearity concerns (dropping variables with a correlation >0.90). Z-score was used for feature standardization, and one-hot encoding was applied to categorial features. After preprocessing and data cleaning, the sample size was comprised of 297 observations. A six-class model comparison was undertaken using popular algorithms for tabular data (XGBoost, KNN, LightGBM, SVM, Logistic Regression Multinomial, and Random Forest) and different multinomial strategies (One vs One, One vs Rest, Multiclass). The chosen model, Random Forest, exhibited superior performance, and after refinement and validation through 5-fold cross-validation using Grid Search, the Multiclass model was selected. It achieved an area under the ROC curve of 0.88 in the test set, with strong predictive capabilities across CVH categories (Class 1: 0.85, Class 2: 0.79, Class 3: 0.99).