Abstract Search – Society for Epidemiologic Research

Big Data/Machine Learning/AI

Application of data imputation using a generative adversarial network Hayden L. Smith* Hayden Smith

Background: Data missingness is a common occurrence in medical research. Imputation is the process of filling in missing values using substitution. These values can be estimated with statistical models. Advances in generative machine learning provide additional methods, such as generative adversarial networks (GANs). These networks use two competing models (e.g., neural networks) to uncover underlying data structures and generate synthetic data. Objective: to present a basic process for imputing missing data using GAN architecture.

Methods: A version of the Framingham Heart Study dataset was used in this example and included age, sex, weight, diastolic BP, systolic BP, cholesterol, cigarettes per day, and mortality. A GAN was fit to create synthetic values for randomly removed data, using functions from Generative Adversarial Imputation Nets and PyTorch. Next, missing rates, sample sizes, and hyperparameters (i.e., alpha and learning rates) were varied to explore limits of the approach. Lastly, a mortality classification model was fit with varied amounts of imputed covariate data.

Results: The sample included 5000 complete case observations. After randomly removing data from the sample, GAN-based imputation was performed with results and model specifics presented in the Figure. When varying sample size from 0.1 to 0.7, RMSE changed from 0.26 to 0.23. When altering missing rate from 0.2 to 0.8, RMSE changed from 0.25 to 0.30. Adjusting alpha from 0 to 15, RMSE reduced from 0.45 to 0.25 and changing learning rate from 0.1 to 0.01 resulted in a RMSE change from 0.28 to 0.24. Lastly, when altering missingness from 0.0 to 0.9, the mortality classification AUROC went from 0.66 to 0.58.

Conclusions: The presented process can be used to impute synthetic data and repeated to conduct multiple imputation. The process can be computationally intensive. Further analyses should examine GAN utility in different scenarios and include comparative analytics with other approaches.