RECORDTWIN: Towards Creating Safe Synthetic Clinical Corpora

Seiji Shimizu

Ibrahim Baroud

Lisa Raithel

Shuntaro Yada

Shoko Wakamiya

Eiji Aramaki

July 26, 2025

The scarcity of publicly available clinical cor pora hinders developing and applying NLP tools in clinical research. While existing work tackles this issue by utilizing generative models to create high-quality synthetic corpora, their methods require learning from the original in hospital clinical documents, turning them un feasible in practice. To address this problem, we introduce RECORDTWIN, a novel synthetic corpus creation method designed to generate synthetic documents from anonymized clini cal entities. In this method, we first extract and anonymize entities from in-hospital docu ments to ensure the information contained in the synthetic corpus is restricted. Then, we use a large language model to fill the context between anonymized entities. To do so, we use a small, privacy-preserving subset of the original documents to mimic their formatting and writing style. This approach only requires anonymized entities and a small subset of orig inal documents in the generation process, mak ing it more feasible in practice. To evaluate the synthetic corpus created with our method, we conduct a proof-of-concept study using a publicly available clinical database. Our results demonstrate that the synthetic corpus has a util ity comparable to the original data and a safety advantage over baselines, highlighting the po tential of RECORDTWIN forprivacy-preserving synthetic corpus creation.

https://aclanthology.org/2025.findings-acl.759.pdf

BIFOLD AUTHORS

Lisa Raithel