Abstract
Introduction: High quality healthcare data plays an important role in research across all domains of medicine, including the study of Multiple Sclerosis (MS). However, medical data represents sensitive information, and is subject to stringent privacy regulations. Size of the presently available cohorts is also a limiting factor for some complex research questions. Objectives/Aims: The aim of this study is to generate a representative synthetic longitudinal cohort of patients with MS. Methods: MSBase, was used as the training cohort for this project. The study applied Bayesian networks and domain knowledge to build the generative framework for the synthetic dataset. During the generative process, Bayesian regression models were used to simulate the following synthetic variables conditional on the structure of the Bayesian network: country, sex, date of MS onset, age at MS onset, age at MS diagnosis, first symptoms (supratentorial, spinal cord, brainstem, optic pathways), MS phenotype, time to visit, age at visit, disease duration, relapse, expanded disability scale score (EDSS), treatment, magnetic resonance imaging (MRI). Results: 10 cross-sectional and 5 longitudinal demographic and clinical variables were generated for 2.8 million synthetic patient records. The variables contain a mixture of data types, continuous and discrete. Baseline characteristics of the simulated records at the first clinical visit include mean MS onset age of 32.6 (}11.4), mean age of 34.2 (}11.7), median EDSS of 2 (IQR: 1-4), 95.6% relapsing phenotype, and 34.4% treated with any disease modifying therapy. The simulated follow-up was at least 237 days per patient. The marginal distributions of the synthetic data are comparable to the real-world cohort. The range in differences between r coefficients of the correlation matrices at baseline in the training cohort compared to the simulated data was 0.00–0.12. Conclusion: Bayesian regression models are suitable for generation of realistic longitudinal cohort of MS patients.