Cedars-Sinai says it can generate privacy-preserving, patient-like synthetic datasets in about an hour—aiming to reduce research access bottlenecks and speed AI/ML experimentation. The move also draws a clear line between de-identified data and synthetic data as a privacy-enhancing alternative for broader internal and external collaboration.
Cedars-Sinai adopts Syntho to speed AI/ML research with synthetic clinical data
Cedars-Sinai is adopting a synthetic data platform as part of its push to expand AI and machine learning capabilities for research and clinical initiatives. The health system says synthetic datasets that replicate real-world patient patterns can be generated quickly—often in about an hour—so teams can work with “patient-like” data without waiting through the typical access and approval cycle for real patient records.
The effort is being implemented with Syntho, an Amsterdam-based company that participated in the Cedars-Sinai Accelerator program in 2022 and focuses on AI-based privacy-enhancing technology for anonymous synthetic data generation. Cedars-Sinai also ties the synthetic data work to its newly launched Digital Innovation Platform, which aims to develop solutions with staff, investors, and venture-builder Redesign Health. Research leadership is anchored in the Department of Computational Biomedicine, with Jason Moore (chair) and Nicholas Tatonetti (vice chair) leading the synthetic data research efforts. Cedars-Sinai CIO Craig Kwiatkowski framed the initiative as part of adopting “cutting-edge technologies” to advance research and patient care.
Operationally, Cedars-Sinai positions synthetic data as distinct from de-identified data: de-identified data is real patient data with identifiers removed, while synthetic data is fully artificial data generated to preserve relationships and patterns from the original. The organization argues this reduces the risk of re-identification while still supporting modeling, collaboration, and training. Tatonetti described three goals: lowering barriers to clinical research by reducing lengthy approvals, accelerating the start/stop cycle of studies for faster hypothesis testing, and providing students and trainees with realistic datasets for learning and analysis. Moore also flags limitations—synthetic data may not handle all data types well, citing discrete genetic data as an example—underscoring the need to set expectations and define appropriate use cases.
- Faster iteration loops for ML teams: If “about an hour” generation holds for common tables, teams can prototype features, pipelines, and model approaches earlier—before getting access to sensitive source data.
- A practical privacy engineering lever beyond de-identification: Cedars-Sinai’s framing highlights a shift from “remove identifiers” to “avoid using real records,” which can lower re-identification risk and simplify some sharing scenarios.
- Governance still matters—just differently: Synthetic data can reduce access bottlenecks, but organizations still need validation against real distributions, clear labeling, and guardrails for where synthetic data is unsuitable (e.g., certain genetic modalities).
- Enablement and training become first-class use cases: Providing realistic datasets for students/trainees is a reminder that synthetic data ROI often shows up in onboarding, education, and cross-team alignment—not only in production modeling.
“The use of synthetic data at Cedars-Sinai reflects our pursuit of cutting-edge technologies to advance medical research and improve patient care.” — Craig Kwiatkowski, PharmD, SVP and CIO, Cedars-Sinai
