Synthetic Data and Habsburg AI

Synthetic data has become a topic of rising intrigue within healthcare AI. Proponents argue that it can solve two issues at once, simultaneously providing more data to train medical AI while not needing to bypass any patient data privacy protection. By generating artificial patient records that mimic real-world populations, researchers can train models without exposing sensitive information.

Read the full newsletter here

On paper it sounds like we can eat our cake and have it too, generating more of a rare and sensitive resource while not needing to worry about privacy, but some researchers are beginning to ask an uncomfortable question: what happens when an AI is trained on AI generated data?

The technical term for this concern is known as “model collapse”, but many have referred to it more informally as “AI inbreeding”.

This phenomenon occurs when AI systems are repeatedly trained on synthetic outputs rather than genuine real-world data. Over time, models begin to lose important information, particularly rare and unusual events and outliers. The phenomenon is analogous to what would happen if you photocopied a photocopy of a photocopy; you would lose crucial details and, in healthcare, likely miss the signals that matter most.

Oxford researcher Ilia Shumailov, whose team published one of the landmark papers on model collapse, has warned that as synthetic data is recycled through successive generations of AI systems, uncommon but clinically important patterns can be gradually eroded.

Others have used more colourful language. Technology researcher Jathan Sadowski coined the phrase “Habsburg AI”, comparing the process to generations of inbreeding within the Habsburg royal dynasty. Each generation may appear functional on the surface, but underlying weaknesses will subtly accumulate until the system’s quality will noticeably degrade.

Many of the most valuable medical insights come from unusual patient cases, rare adverse events and unexpected treatment responses. If these cases become diluted or lost within synthetic datasets, models may increasingly become less representative of what reality is, and more what hundreds of other AI systems guessed reality was.

This does not mean synthetic is inherently flawed necessarily. Some healthcare applications have shown real value when synthetic data is generated from large volumes of validated patient data and carefully benchmarked against real-world outcomes. The risk emerges when synthetic data becomes a substitute rather than a supplement.

As AI adoption accelerates, healthcare organisations must avoid creating feedback loops where models increasingly learn from each other rather than from patients themselves. Otherwise, there is a legitimate risk that future AI systems become very good at recognising patterns that exist only within their own fictional world.

Sign up for regular AI newsletter