Banner Banner

Generalization of a Deep Learning Model for HER2 Status Prediction on H&E-Stained Whole Slide Images from 3 Neoadjuvant Clinical Studies

Miriam Hägele
Klaus-Robert Müller
Carsten Denkert
Andreas Schneeweiss
Bruno Sinn
Michael Untch
Marion T. Van Mackelenbergh
Christian Jackisch
Valentina Nekljudova
Thomas Karn
Maximilian Alber
Frederik Marmé
Christian Schem
Elmar Stickeler
Peter A. Fasching
Volkmar Müller
Karsten E. Weber
Bianca Lederer
Sibylle Loibl
Frederick Klauschen

September 13, 2022

Background
Due to its decisive role to select targeted treatment options in breast cancer, we investigated if machine learning (ML) models could identify HER2-positive tumours from routinely obtained haematoxylin and eosin (H&E)-stained whole slide images.
Methods
We trained a machine learning model to predict the HER2 status on 844 patients from the GeparSixto and GeparSepto studies and validated it on 1567 independent patients from GeparSixto, GeparSepto and GeparOcto. The challenge of a single label per gigapixel image without additional, granular domain-expert annotations, is addressed by aggregating learned features per patient. We trained a network with a gated attention head and patch filtering on previously segmented tumour regions. Final predictions are made by an ensemble from a 5-fold outer cross-validation. Additionally, we use selective prediction analysis to choose a subset of patients for which the model is highly certain.
Results
Our model was evaluated on 1488 patients of the validation set (excluding 5% due to automatic quality control). With respect to the centrally assessed clinical HER2 status it achieves an area under the receiver-operator curve (AUROC) of 0.81 (95% CI 0.79-0.83) and a balanced accuracy (BA) of 73.1%. In the held-out GeparOcto subcohort, we observe AUROC=0.80 (95% CI 0.77-0.83) and BA=70.4%. Taking the model's confidence into account, we show that for a subset of 13% and another set of 32% of selected cases the BA increases to 84.7% and 81.5%, respectively. Considering different thresholds illustrates the trade-off between coverage and performance.
Conclusions
The trained ML model predicts HER2 status with state-of-the-art performance. We extend on that by demonstrating the generalization of performance on a held-out clinical study (GeparOcto). In addition, our approach demonstrates that substantial performance increases can be achieved for subsets of patients based on the model's confidence. Both, the generalization to a held-out clinical study as well as the subset analysis of the most confident predictions, demonstrate the robustness of our approach.