Cancer
Machine learning classification of HER2 mutations and treatment response using gene expression in breast cancer Muyiwa Ategbole* Muyiwa Ategbole Ategbole Ategbole University of South Carolina
Background:
HER2 hotspot mutations are actionable targets in breast cancer, yet transcriptomic signatures that reliably distinguish clinically relevant variants, including V777L, and predict treatment-related phenotypes are not well characterized.
Methods:
We analyzed RNA sequencing data from recount3 (SRP166112), comprising 864 breast cancer cell samples representing multiple HER2 genotypes and drug exposures. After variance-stabilizing transformation and filtering, we used LASSO regularization to identify a sparse panel of 29 informative genes. Machine learning models (random forest, LASSO, elastic net, support vector machine, and gradient boosting) were trained to classify treated versus control samples using a 70/30 stratified train–test split. Model performance was evaluated on an independent test set using accuracy, sensitivity, specificity, and area under the receiver operating characteristic curve (AUC).
Results:
All models demonstrated strong discrimination between treated and control samples. Test-set accuracy ranged from 98.1% to 99.6% across models, with AUC values near 1.00. The LASSO model achieved the highest accuracy (99.6%), sensitivity (97.6%), and specificity (100%). Random forest, elastic net, support vector machine, and gradient boosting models showed similarly high performance (AUC 0.995–1.000). The LASSO-selected gene panel included biologically relevant predictors such as CYP1A1, SERPINA6, CISH, and CYP1B1, which were strongly associated with treatment exposure.
Conclusions:
A sparse set of gene expression markers enables highly accurate classification of treatment exposure in HER2-mutant breast cancer cell models. These transcriptomic signatures provide a foundation for biomarker discovery and future translational studies aimed at stratifying HER2-mutant tumors by therapeutic response.
