European I3LUNG Project Validates Multimodal Synthetic Data for Lung Cancer Research

The EU-funded I3LUNG project reports validation results for multimodal synthetic data in non-small cell lung cancer (NSCLC), showing synthetic cohorts can match real patient distributions while reducing dependence on identifiable patient data. The work points to a practical blueprint for cross-border research and model development under tighter health data governance.

I3LUNG shows multimodal synthetic NSCLC cohorts can match real distributions

The European I3LUNG project (reported Nov. 10, 2025) validated a pipeline for generating multimodal synthetic data for lung cancer research, aiming to support analysis and model development without requiring broad access to real patient records. The project focused on NSCLC and generated synthetic patient cohorts designed to closely mirror real-world distributions.

According to the report, I3LUNG combined a cross-modal autoencoder to integrate clinical variables (including PD-L1 expression and smoking status) with pathology imagery, then used Gaussian copula sampling in a joint latent space to produce synthetic patient records. A HistoXGAN model generated corresponding histology images. The evaluation referenced data from 1,813 NSCLC patients and reported that the resulting synthetic cohorts matched the original distributions and were usable for downstream statistical work, including Cox proportional hazards modeling.

Faster prototyping with fewer governance blockers: Data and ML teams can stand up baselines, run feature engineering, and validate modeling pipelines on synthetic cohorts that preserve key distributions—before requesting access to protected health information (PHI).
Multimodal matters for clinical AI: Many clinical use cases depend on aligning structured variables with images. A validated approach that generates both modalities together reduces the “single-table synthetic data” ceiling that often limits utility.
Cross-border collaboration becomes more realistic: If institutions can exchange high-fidelity synthetic cohorts instead of raw records, it can lower friction for multi-site studies and complement federated learning setups—while staying within local governance constraints.
Privacy engineering still needs proof, not promises: Utility validation is only half the bar. Teams operationalizing this approach will still need measurable privacy risk assessments, clear release criteria, and monitoring for leakage or re-identification risk when synthetic data is shared externally.

Daily BriefJul 17, 20262 min