Machine learning and natural language processing for the early detection of potential mental disorders among school-age children: a prospective birth cohort study

Presenting Author

Shanquan Chen

The University of Hong Kong

Submitting Author

Shanquan Chen

Additional Authors

Ting Dang, Mengjie Qian, Quinette Abegail Louw

Abstract

Background This study aimed to assess whether integrating natural language processing (NLP) of children’s essays with traditional risk factors could enhance the early detection of potential mental health disorders compared with models using either data type alone.

Methods We conducted a prospective analysis using data from the UK-based National Child Development Study (NCDS), a national birth cohort initiated in 1958. Data from birth, age 7, and age 11 assessments were analyzed. The final sample included 8,981 children (4,428 [49.3%] female) who completed a creative writing essay at age 11 describing their imagined life at age 25. Predictors comprised traditional risk factors (perinatal, socioeconomic, and parental engagement variables) and linguistic features computationally extracted from the essays. The primary outcome was potential mental health disorder at age 11, defined as scoring above the 95th or 90th percentile on the teacher-completed Bristol Social Adjustment Guide (BSAG). The mother-completed Rutter A Scale was used for sensitivity analysis. Machine learning models incorporating various predictor combinations were developed, and their predictive performance was evaluated using receiver operating characteristic (ROC) curves.

Results Using BSAG 95th percentile threshold, models combining top five selected variables with essay features achieved significantly higher predictive capability (ROC:0.77, 95%CI:0.71-0.83) compared to models using all variables (ROC:0.70, 95%CI:0.63-0.76) or essay features alone (ROC:0.67, 95%CI:0.60-0.74). At 90th percentile threshold, this integrated approach showed similar improvement (ROC:0.81, 95%CI:0.78-0.85). Key predictors included gestational length, maternal parity, parental age, residential characteristics, parental engagement metrics, and children’s BMI. Sensitivity analyses using Rutter A Scale confirmed these findings.

Conclusions In this prospective birth cohort study, integrating NLP analysis of children’s essays with a small set of key risk factors substantially improved the prediction of potential mental health disorders. This integrated approach represents a potential paradigm for developing scalable, objective screening tools.

Abstract Search

Abstract