Navigating synthetic data: developing a utility evaluation framework
Published 29 May 2024
Digital health intern at Beamtree | Statistics and biochemistry and molecular biology student at University of Sydney
LinkedIn: Sadiq Dohadwalla
Electronic Health Record data has the potential to revolutionise healthcare even further. In an age of ML and AI, EHR data can be used to catalyse a technology revolution of healthcare. However, from an ethical, legal and resource point of view – accessing high-quality data is nearly impossible. With the aims of overcoming these issues, synthetic data has taken the main stage with its myriad of opportunities and solutions. Generating synthetic data is only half the challenge. The other half – the less talked of one – is the challenge of evaluating the synthetic data outputted by different generative models. Specifically, evaluating the utility – or the synthetic data’s usability in downstream applications is underdeveloped.
By conducting a literature review we developed a utility evaluation framework (UEF) and identified representative studies to apply the UEF. We recognised the equal importance of integrating both machine learning and traditional statistical modelling aspects to holistically assess the utility of synthetic EHR. For the machine learning modelling aspect we encourage augmenting the real data with synthetic data and vice versa to improve the robustness of the chosen ML models. Due to an alarming absence in the literature of testing whether the temporal structure of the synthetic EHR data is maintained – we also emphasise the use of longitudinal models to test if the temporality of the data is preserved.
As researchers strive to harness the power of EHR data for disease phenotyping, predictive modelling, personalized medicine, and clinical decision support, ensuring the reliability and applicability of synthetic data becomes paramount. As synthetic data becomes an important tool and goes from the “new, shiny, and exciting thing” to the norm – thinking more carefully about what is considered “good” synthetic data will be crucial for rigorous research to be conducted. By providing a structured framework approach to evaluate synthetic data, our work begins to contribute to the foundation of responsible and impactful healthcare and digital health research.