Big Data/Machine Learning/AI
Improving Detection of Adolescent Prescription Opioid Misuse Using Machine Learning with Synthetic Oversampling Asef Raiyan* Asef Raiyan Hoque Hoque 1) Central Michigan University College of Medicine 2) University of Texas at Austin Department of Statistics and Data Sciences
Background: Prescription opioid misuse among adolescents is a major public health concern. Population-based surveillance systems provide valuable data for developing predictive models to support early identification of youth at risk of opioid misuse. However, supervised machine learning models applied to these datasets often face severe class imbalance, resulting in misleadingly high accuracy and area under the receiver-operating characteristic curve (AUC), driven primarily by correct classification of non-misuse cases. Consequently, model sensitivity is often low, limiting the ability to identify adolescents at risk.
Methods: This study utilized the CDC’s nationally representative Youth Risk Behavior Surveillance System (YRBSS) data from 2017-2021. We evaluated performance of eight supervised machine learning models. To address pronounced outcome imbalance, the Synthetic Minority Over-sampling Technique (SMOTE) was applied to the training data to assess whether correcting class imbalance improves minority-class detection without substantially compromising overall performance. Model performance was evaluated using accuracy, sensitivity, specificity, ROC AUC, precision, and F1-score.
Results: Across all models, application of SMOTE improvement sensitivity and F1-score, indicating improved detection of adolescents at risk of prescription opioid misuse. Random Forest and Extreme Gradient Boosting demonstrated the most favorable performance metrics, with the largest gains in sensitivity 18.37% to 84.33% and 21.10% to 87.39%, respectively. These improvements indicate a marked reduction in false negative classification. While accuracy declined for some models, ROC AUC and F1-score improved consistently, and specificity remained high.
Conclusion: Application of SMOTE resulted in substantial improvements in sensitivity and F1-score across models, enabling more accurate identification of adolescents at risk of prescription opioid misuse. These gains were achieved with minimal loss in specificity, preserving strong classification of non-misuse cases. Our findings suggest that resampling approaches such as SMOTE can be a practical and effective tool for improving case detection when applying machine learning to imbalanced public health data.
