A new study published in Nature warns that vulnerabilities in the data used to train artificial‑intelligence models could allow personal medical records to be exposed, particularly for underrepresented groups. The researchers, led by a team at the University of Cambridge, examined how large language models ingest and store sensitive information and found that the models can inadvertently reproduce protected data when queried in certain ways.
The paper, “Identification risks are more severe for underrepresented groups in the training data,” was released online on June 24, 2026 (doi:10.1038/d41586-026-02032-3). It documents a systematic analysis of how language models, when trained on publicly available datasets that include health information, can generate text that closely mirrors the original data. The authors demonstrate that, for a subset of individuals—particularly those from minority or low‑representation groups—models produce more accurate reproductions of their personal details, raising privacy concerns.
Analysis: The study highlights a mismatch between the diversity of training data and the safeguards needed to protect sensitive information. “We found that the more a demographic is underrepresented in the training corpus, the higher the risk that the model will reproduce that person’s data,” the authors note. This suggests that privacy risks are not evenly distributed across populations, potentially widening existing disparities in data security.
The researchers also note that the problem is compounded by the fact that many health datasets are not fully anonymized or are scraped from public sources without proper consent. They argue that current regulatory frameworks, which often treat all data uniformly, may be insufficient to address these nuanced risks.
The findings have implications for companies developing AI applications in healthcare, as well as for policy makers overseeing data protection. The authors call for more robust de‑identification techniques, better auditing of training datasets, and stricter controls on model outputs that could reveal personal information.
The paper also draws a broader point about the “unevenness of the Universe,” a metaphor the authors use to describe how data distribution can mirror societal inequalities. They suggest that AI systems may inherit and amplify these imbalances unless deliberate steps are taken to correct them.
Sources
– Nature. “Identification risks are more severe for underrepresented groups in the training data.” Published online 24 June 2026. https://www.nature.com/articles/d41586-026-02032-3
Source: Nature – Original article
Corrections
If you believe this article contains an error, contact Herald Express with the source URL and supporting evidence.
Story synopsis gathered from: Nature — source

