Skip to content

Abstract Search

Big Data/Machine Learning/AI

A validated transformer-based model to identify transgender and gender diverse people in electronic health records Qi Zhang* Qi Zhang Yuting Guo Mohammed Al-Garadi Timothy L. Lash Lee Cromwell Abeed Sarker Michael Goodman

Background: Natural language processing (NLP) of free-text information from electronic health records (EHR) holds promise for efficient large-scale identification of hard-to-reach populations, including transgender and gender diverse (TGD) people.

Objective: To develop and validate NLP models for automated creation of a large de-identified TGD cohort from EHR across multiple institutions.

Methods: Free-text EHR data were collected from the Study of Transition, Outcomes, and Gender (STRONG), including individuals enrolled in multiple sites of the Kaiser Permanente healthcare network between January 1, 2006, and February 28, 2022. This research protocol included two studies: 1) model development and evaluation, using TGD keyword-containing text excerpts pertaining to 11,529 previously confirmed TGD individuals; and 2) assessment of model validity, involving model application to a larger group (n=371,909) of TGD candidates, with a stratified random sample validated by trained reviewers. The validation strata were based on NLP predicted class, geographic location and additional evidence (TGD-specific diagnostic codes and/or self-reported gender identity). Performance of models was assessed using sensitivity, and positive and negative predictive values (PPV and NPV) and F1 score.

Results: In the first study, the transformer-based RoBERTa outperformed other models, achieving an F1 score of 0.95, with a sensitivity of 0.97 and a PPV of 0.94. In the validation study, when participants had TGD evidence beyond keywords, RoBERTa predictions showed a high PPV of 0.92–1.00, despite a low NPV of 0.11–0.59. Conversely, the model yielded a high NPV (0.98–0.99) but a low PPV (0.20–0.36) among candidates with keywords alone.

Conclusion: NLP models provide an efficient and scalable approach for identifying TGD individuals in EHR. Transformer-based models outperformed other algorithms and showed potential for transportability to external populations with additional TGD evidence.