Synthetic tabular health data generation: a practical comparison between correlation and model-based statistical approach and the conditional generative adversarial network approach
{A statistical perspective on synthetic health tabular data generation: comparing statistical approach with conditional generative adversarial network approach.} Synthetic datasets are vital in various areas of health, including sharing sensitive human data, protecting patient's privacy and validating prediction model performance with limited sample size. While generating synthetic data for these purposes is not new, statistical data simulation approaches have traditionally been used before the development of generative adversarial networks. Will statistical methods in this context become less relevant? Which of these two approaches is better when learning from health data? With these questions in mind, we aim to review existing synthetic tabular health data generation approaches, to empirically compare on real-world datasets, and to ultimately provide practical guidance on choices of methods. Our empirical study reveals that either technique generates synthetic datasets that closely resemble the real data structure and that contribute to evaluating prediction model performances.
Details
Title
Synthetic tabular health data generation: a practical comparison between correlation and model-based statistical approach and the conditional generative adversarial network approach
Authors/Creators
Yunwei Zhang - Murdoch University, School of Mathematics, Statistics, Chemistry and Physics
Samuel Muller - The University of Sydney
Publication Details
Journal of statistical computation and simulation
Publisher
Informa UK Limited, trading as Taylor & Francis Group.