Abstract
Background: Data scarcity and privacy concern impedes research requiring large datasets in neurology. Synthetic data holds the promise of facilitating research that requires significant analytical power. Longitudinal data in MS represents a significant challenge, because of the high complexity of this neurological condition.
Objective: This study presents a dual generative framework to produce a synthetic MS data, assessing their utility in predictive modelling, and the credibility of the generated synthetic data.
Methods: We used the MSBase data to train two models: an autoencoder for cross-sectional data, which used clinico-demographic information from 77,215 patients, and a Long Short-Term Memory model for longitudinal data, which was trained on 850,000 patient sequences. The autoencoder generated 13 cross-sectional variables, and LSTM generated time to visit, EDSS, relapses, treatment, treatment change, and MRI. The process was used to generate 2.8 million synthetic patient records.
Results: The simulated cohort had a mean age at onset of 31.6 years (SD: 10.5), mean disease duration of 6.8 years, mean EDSS of 2.95 (SD: 2.06), female prevalence of 70.5%,mean follow-up duration of 8.9 years (95%CI: 8.7–9.1),and 22.2% of patients on high-efficacy therapies for 44.1% of the total follow-up, all comparable to the original dataset. Spearman analysis confirmed intra-variable relationships in the simulated data, with coefficients (r=0.2-0.9) consistent in the real data.
Conclusion: Dual autoencoder-LSTM approach is suitable for the generation of cross-sectional and longitudinal synthetic data in multiple sclerosis. This solution has the potential to augment research requiring large and representative clinical datasets.