A benchmark for trustworthy clinical AI

BIFOLD and Aignostics researchers introduce PathoROB

A new study published in Nature Communications by researchers from BIFOLD and Aignostics shows that today's pathology foundation models can be fooled by the hospital a tissue sample comes from. The team developed PathoROB, a first-of-its-kind benchmark to measure and mitigate the problem. PathoROB has already begun to influence how the next generation of pathology AI models is built and evaluated.

Artificial intelligence is expected to support doctors in diagnosing and characterizing cancer faster and more precisely. So-called foundation models -- large AI systems pre-trained on millions of microscope images of tissue -- are starting to support cancer detection, disease subtyping, and biomarker prediction in clinical workflows. A new study by an interdisciplinary research team from the Berlin Institute for the Foundations of Learning and Data (BIFOLD) at TU Berlin, the Berlin-based AI company Aignostics, Ludwig Maximilian University (LMU) Munich, and the Netherlands Cancer Institute (NKI) now reveals a critical blind spot in these models: they can secretly rely on the hospital a tissue sample came from, rather than on its actual biology, to make their predictions.

Every pathology laboratory leaves a subtle signature on its tissue slides: differences in how a biopsy is cut, stained, and scanned. These differences are medically irrelevant, but they are visible to AI, and the models internalize them. The researchers showed that current foundation models can identify the hospital of origin of a slide with 88–98% accuracy from their learned features. In some cases, the model's internal "map" of the data was organized primarily by hospital, and only secondarily by whether the tissue was healthy or cancerous.

The consequences can be severe. In one striking example, an AI model learned to use the hospital signature as a shortcut. As a result, it confidently mislabelled a clearly malignant tissue patch as normal, simply because the patch originated from a hospital it had associated with healthy tissue.

Hospital "fingerprints" hidden in the models

To make this problem measurable, the researchers built PathoROB, the first-of-its-kind public benchmark dedicated to the robustness of pathology foundation models against technical variation. It combines four datasets covering around 100,000 tissue patches, 28 biological classes, and 34 medical centers, and introduces a new "robustness index" that quantifies how much a model's internal representation is driven by real biology rather than hospital artifacts.

Applied to 20 widely used foundation models, PathoROB exposed robustness deficits in every one of them. Larger models trained on more diverse data, as well as models that combine images with text reports (vision-language models), performed best. The researchers also tested several post-hoc "robustification" techniques and found that they can substantially reduce, but not yet fully eliminate, the risk of hospital-driven errors, without requiring costly retraining of the underlying model.

"Foundation models for pathology are advancing quickly, and that is exciting. But our results show that strong performance on a standard benchmark is not enough to trust a model in the clinic," says Julius Hense, co-first author of the study and researcher at BIFOLD and TU Berlin. "PathoROB gives developers and clinicians a tool to check whether a model has truly learned biology, or whether it has learned which hospital a slide came from."

Shaping the next generation of pathology AI models

PathoROB has already begun to reshape how pathology AI is built and compared. Aignostics' next-generation foundation model Atlas 2, released in January 2026 in collaboration with Mayo Clinic, was explicitly positioned to address the performance–robustness trade-offs that PathoROB exposed and uses the benchmark to demonstrate state-of-the-art robustness alongside prediction accuracy. Furthermore, the community is adopting PathoROB as a standard yardstick for foundation model robustness: new open-weight models such as GenBio-PathFM report PathoROB scores alongside accuracy, and platforms such as Waiv's Histoboard feature PathoROB among the benchmarks that researchers and clinicians can use to compare pathology AI models head-to-head.

By making the benchmark, datasets, and code openly available, the authors hope to establish robustness evaluation as a routine step in validating any biomedical foundation model before it is used to inform patient decisions in the clinic.

Publication

Towards robust foundation models for digital pathology. Jonah Kömen, Edwin D. de Jong, Julius Hense, Hannah Marienwald, Jonas Dippel, Philip Naumann, Eric Marcus, Lukas Ruff, Maximilian Alber, Jonas Teuwen, Frederick Klauschen, Klaus-Robert Müller. 11 June 2026.
Nature Communications publication
Code and benchmark
Datasets

Figure PathoROB benchmark and foundation model representation space exploration: a We subsampled balanced multi-center datasets to compare biological class and technical/medical center information, extracted features from 20 pathology foundation models (FMs), and analyzed the resulting representations from different perspectives. b The PathoROB benchmark consists of four datasets from three public sources, together with three metrics to quantify FM robustness and its consequences. Each dataset matrix element depicts the number of patches per combination of biological class and medical center. For TCGA 2 × 2, we extracted 94 unique class-class-center-center quartets. c t-SNE plot of the representation spaces of Phikon-v2 and Virchow2 from a subset of the Camelyon tumor detection dataset (other FMs in Supplementary Note 6.1). The representation space of Phikon-v2 is organized by medical center (RUMC/UMCU), showing that at the highest level, the model distinguishes images based on the medical center. Virchow2's representation space is primarily split by biological information (normal/tumor), with a secondary organization by medical center. d Accuracy of predicting medical center vs. biological class from the feature vectors via linear probing. We report mean prediction accuracies with 95% confidence intervals on held-out test sets from three datasets (Camelyon, TCGA 4 × 4, Tolkach ESCA) with 20 repetitions each (n = 60), corrected to remove common variance due to dataset (Masson & Loftus⁹³). Across all FMs, the medical center origin of most patches could be recovered from the FM representations. Source data are provided as a Source Data file.