Big Data/Machine Learning/AI
Use of generative adversarial networks to create synthetic data Hayden Smith* Hayden Smith
Background: Generative adversarial networks (GANs) consist of two competing models (e.g., neural networks), a generator and discriminator. The generator creates synthetic data from random noise while the discriminator attempts to distinguish between real and generated data. The networks optimize generation and discrimination of synthetic data via an iterative competitive training process using a defined value function. This approach can be used to generate data for teaching, research, and sharing of sensitive information. Objective: to describe two processes for creating synthetic data using GANs.
Methods: Two generative examples were constructed. The first example consisted of simulating data based on a simple function (e.g., Var1 ~ N(0, 1), Var2 = Var1^2 + U(-0.9, 0.9)). This example served to introduce the basic process using a GAN. Next, an available version of the Framingham Heart Study data was used to generate a synthetic dataset using a conditional tabular GAN (CTGAN). Synthetic and real data were then evaluated and contrasted.
Results: The top panel of the Figure shows the model building progress for the simple example overlaid on real data. The last pane on the top panel shows the benefit of using drop out to regularize network output. The bottom panel of the Figure shows some summary displays for the second example’s generated synthetic data. Also presented at the conference will be statistical model comparisons based on synthetic and real data as well as a basic attempt to use synthetic data to augment a power analysis.
Conclusions: The presented processes can be used to create synthetic data based on supplied data files. Different types of GANs, value functions, and hyperparameter tuning can be explored to attempt to meet project goals. For example, if data appear to be too deterministic or overfit, approaches like dropout can be incorporated. Future work may include exploring GAN applications for power analysis, outcome imbalance, and imputation.