Machine Learning-Generated Longitudinal Synthetic International Data in Multiple Sclerosis

Allan G Kermode; Marzena Pedrini

doi:10.1177/13524585251358343

Introduction: Data scarcity and privacy concern impedes research requiring large datasets in neurology. Synthetic data holds the promise of facilitating research that requires significant analytical power. Longitudinal data in MS represents a significant challenge, because of the high complexity of this neurological condition. Objectives/Aims: This study presents a dual generative framework to produce a synthetic MS data, assessing their utility in predictive modelling, and the credibility of the generated synthetic data. Methods: We used the MSBase data to train two models: an autoencoder for cross-sectional data, which used clinico-demographic information from 77,215 patients, and a Long Short- Term Memory model for longitudinal data, which was trained on 261,939 patient sequences. The process was used to generate 2.8 million synthetic patient records. Results: Synthetic cross-sectional data generated by the autoencoder showed a mean age at onset of 31.6 years (SD:10.5), a mean disease duration of 6.89 years, and a female prevalence of 70.54%, closely matching the real dataset. A Kolmogorov-Smirnov test for age at onset showed no significant difference (p=0.85, statistic= 0.0022).The synthetic EDSS distribution (mean:2.95, SD:2.06) similarly approximated the real data. Spearman correlation analysis revealed preserved intra-variable relationships, with synthetic correlation coefficients (r=0.248–0.329) aligning well with those of the real dataset. LSTM-generated longitudinal data exhibited a mean follow-up duration of 8.9 years (95% CI:8.7– 9.1), an annualized relapse rate of 0.19, and a treatment distribution of 13.2% Ocrelizumab across all visits in the synthetic predictions, closely matching the real data and demonstrating strong alignment with actual treatment patterns. Spearman correlation analysis confirmed preserved intra-variable relationships in the synthetic data, with correlation coefficients (r=0.28–0.95) consistent with those observed in the real data. Conclusion: Dual autoencoder-LSTM approach is suitable for generation of cross-sectional and longitudinal synthetic data in multiple sclerosis. This solution has the potential to augment research requiring large and representative clinical datasets.

Machine Learning-Generated Longitudinal Synthetic International Data in Multiple Sclerosis

Abstract

Details

Metrics