Logo image
Machine Learning-Generated Longitudinal Synthetic International Data in Multiple Sclerosis
Conference proceeding   Peer reviewed

Machine Learning-Generated Longitudinal Synthetic International Data in Multiple Sclerosis

Hassam Iqbal, Zhe Qiang, Sifat Sharmin, Gareth Ball, Aida Brankovic, Allan G Kermode, Marzena Pedrini, William Carroll, Katherine Buzzard, Olga Skibina, …
Multiple sclerosis, Vol.32(1_suppl), P.18
MS Australia: 10th Progress in MS Research Conference 2025 (Sofitel Brisbane Central, Queensland, 03/12/2025–05/12/2025)
12/2025

Abstract

EBV Ms T-cells Single-cell transcriptomics
Background: Data scarcity and privacy concern impedes research requiring large datasets in neurology. Synthetic data holds the promise of facilitating research that requires significant analytical power. Longitudinal data in MS represents a significant challenge, because of the high complexity of this neurological condition. Objective: This study presents a dual generative framework to produce a synthetic MS data, assessing their utility in predictive modelling, and the credibility of the generated synthetic data. Methods: We used the MSBase data to train two models: an autoencoder for cross-sectional data, which used clinico-demographic information from 77,215 patients, and a Long Short-Term Memory model for longitudinal data, which was trained on 850,000 patient sequences. The autoencoder generated 13 cross-sectional variables, and LSTM generated time to visit, EDSS, relapses, treatment, treatment change, and MRI. The process was used to generate 2.8 million synthetic patient records. Results: The simulated cohort had a mean age at onset of 31.6 years (SD: 10.5), mean disease duration of 6.8 years, mean EDSS of 2.95 (SD: 2.06), female prevalence of 70.5%,mean follow-up duration of 8.9 years (95%CI: 8.7–9.1),and 22.2% of patients on high-efficacy therapies for 44.1% of the total follow-up, all comparable to the original dataset. Spearman analysis confirmed intra-variable relationships in the simulated data, with coefficients (r=0.2-0.9) consistent in the real data. Conclusion: Dual autoencoder-LSTM approach is suitable for the generation of cross-sectional and longitudinal synthetic data in multiple sclerosis. This solution has the potential to augment research requiring large and representative clinical datasets.

Details

Metrics

1 Record Views
Logo image