Abstract
Background: High quality healthcare data plays an important role in research across all domains of medicine, including the study of Multiple Sclerosis (MS). However, medical data represents sensitive information, and is subject to stringent privacy regulations. Size of presently available cohorts is also a limiting factor for some complex research questions.
Objective: The aim of this study is to generate a representative synthetic longitudinal cohort of patients with MS.
Methods: MSBase, was used as the training cohort for this project. Bayesian networks and Bayesian regression models were used to model and generate synthetic patient variables: country, sex, date of MS onset, age at MS onset, age at MS diagnosis, first symptoms (supratentorial, spinal cord, brainstem, optic pathways), MS phenotype, time to visit, age at visit, pregnancy, disease duration, relapse, expanded disability scale score (EDSS), treatment, magnetic resonance imaging (MRI), CSFOCB.
Results: Baseline characteristics of the simulated records at the first clinical visit include mean MS onset age of 32.6 (±11.4), median EDSS of 2 (IQR: 1-4), 95.6% relapsing phenotype, and 34.4% treated with any disease modifying therapy. The simulated follow-up was at least 237 days per patient. The marginal distributions and r coefficients of the correlation matrices of the synthetic data at baseline are comparable to the real-world cohort.
Conclusion: Bayesian networks and Bayesian regression models are suitable for generation of realistic longitudinal cohort of MS patients.