Big Data/Machine Learning/AI
Text Classification Models for Natal Sex Identification from Electronic Health Records of Transgender Population Qi Zhang* Qi Zhang Yuting Guo Timothy L. Lash Abeed Sarker Michael Goodman
Transgender and gender diverse (TGD) people assigned at birth (natal) male or female sex are often referred to as transfeminine (TF) and transmasculine (TM), respectively. TF and TM persons may have health outcomes related to gender affirming therapy. Electronic health records (EHR) allow systematic identification of TGD people in large health systems, but the EHR administrative designation of sex/gender is unreliable for TGD people because it may designate gender or natal sex. In this study, we developed natural language processing (NLP) models to distinguish TF from TM based on free-text clinical notes. The Study of Transition Outcomes and Gender (STRONG) cohort includes TGD members enrolled in Kaiser Permanente healthcare plans from 2006 through 2022. We used text strings containing relevant keywords among 6150 members with gold standard labels for model development. Data were divided into training (64%), validation (16%), and test (20%) sets. We first applied support vector machines (SVM), random forests (RF), shallow neural networks, and k-nearest neighbor models. In addition, two deep learning models, BiLSTM and Transformer (RoBERTa), were also used. Models were evaluated based on micro F1-score, precision (positive predictive value) and recall (sensitivity) metrics. An ongoing adaptive validation study is being conducted using data on 40,305 new TGD candidates from multiple sites based on manual review of text strings. We first validate the natal sex of 100 subjects with predicted scores above 0.98 and below 0.02, respectively, from each site. Our results show that SVM produced the highest F1-score (0.97, 95% CI: 0.96, 0.98) with a recall of 0.97 and a precision of 0.97, and RF yielded comparable performance (F1-score: 0.96, 95% CI: 0.95, 0.97), followed by RoBERTa (F1-score: 0.95, 95%: 0.94, 0.97). The NLP models can provide an efficient way for automated EHR-based identification of natal sex in the TGD population, with SVM achieving the optimal performance.