Skip to content

Abstract Search

Methods/Statistics

Evaluating methods for imputing race and ethnicity in electronic health record data Sarah Conderino* Sarah Conderino Jasmin Divers John A. Dodson Lorna E. Thorpe Mark G. Weiner Samrachana Adhikari

Background: Race/ethnicity is often missing on a substantial proportion of patients due to challenges with data collection. Gold standard imputation approaches rely on identifiable information, including names and addresses, that are not readily available in many research databases. Using electronic health record data from NYU Langone Health and the INSIGHT Clinical Research Network, we compared methods to assess whether anonymized variables are sufficient for the imputation of race/ethnicity.

Methods: We first conducted simulation analyses under different missing data mechanisms to compare the performance of Bayesian Improved Surname Geocoding (BISG), single imputation with neighborhood majority, random forest imputation, and multiple imputation with chained equations (MICE). Performance was measured compared to self-reported race/ethnicity using sensitivity, positive predictive value, and overall accuracy, and agreement was measured with Cohen’s kappa (κ). We then applied these methods to impute race/ethnicity in two EHR-based data sources and compared chronic disease burden by race/ethnicity across imputation approaches.

Results: Under simulation analyses, non-anonymized BISG imputation provided the most accurate classification of race/ethnicity, ranging from 66% to 73% across missing data mechanisms. Anonymized imputation methods were more sensitive to the missing data mechanism, with agreement dropping when race/ethnicity was missing not at random (MNAR) (κMICE=0.25, κsingle= 0.25, κrandom forest=0.33). When these methods were applied to the NYU and INSIGHT cohorts, racial/ethnic distributions and chronic disease burden were consistent across all imputation methods.

Conclusions: BISG imputation may provide a more accurate racial/ethnic classification than single or multiple imputation using anonymized covariates, particularly if the missing data mechanism is MNAR. Descriptive studies of disease burden may not be sensitive to methods for imputing missing data.