Study design
The study had two distinct steps: development and evaluation of an AI tissue detection algorithm, and evaluation of downstream performance of a Gleason grading algorithm, comparing the AI tissue detection with that of a classical thresholding-based tissue detection method. The former was done by procuring an evaluation set of high quality tissue segmentation masks to compare against, then empirically finding a high performing architecture and training set (more details below). The latter utilized an end-to-end Gleason grading model presented recently35. A separate study protocol46 provides details on the development and evaluation of the Gleason grading model, including how reference grading by pathologists was obtained for each cohort.
Prior to conducting this study, we had generated segmentation masks for tissue detection for every WSI using Otsu’s thresholding, to be used for development of the Gleason grading algorithm35. The parameters of the thresholding algorithm had been selected individually for each cohort. These segmentation masks were used as labels for the training set of the AI tissue detection algorithm, which used a UNet + + architecture47. Subsets of the segmentation masks had been checked visually, and in certain cases manually edited to improve quality (see Table 1). Utilising some of these, as well as by iteratively re-running the thresholding algorithm with parameters tweaked on a WSI-by-WSI basis until high quality segmentation masks were confirmed by visual inspection, a set of 6,823 WSIs with high quality masks was generated. This set was used both for continuously validating the AI during training, as well as for evaluating its performance after finishing training for the purpose of model selection. For model selection, the model that achieved the best sensitivity was chosen, so long as the precision was not deemed unacceptable (< 90%).
Table 1 Summary of data and partitions for training, validation and evaluation for development of the segmentation model. The labels were considered strong if tissue segmentation masks from these cohorts had been checked visually to verify their correctness, and weak if the quality of almost all segmentation masks in the cohort had not been checked. Certain cohorts that had been checked visually had also had a small number of segmentation masks manually edited to improve the labels, the number of which is documented in the rightmost column.
Two sets of segmentation masks were generated for each WSI in the test set of the Gleason grading AI model. One set was generated using thresholding, using a single set of manually fine-tuned parameters for all WSIs. The other set was generated by running the segmentation AI model from the previous step, with no additional processing steps. The Gleason grading algorithm was then evaluated twice, once with tissue detection based on each set of segmentation masks. To allow for a direct comparison of downstream task performance, only those WSIs where both segmentation algorithms detected tissue were included. This excluded 140 difficult slides where one or both algorithms failed to detect any tissue, see Table 2 for details.
Table 2 The number of WSIs in the test cohorts of the downstream Gleason grading model where tissue segmentation by AI, by thresholding or by both methods failed to detect any tissue. Only WSIs with tissue detected by both algorithms were included in the comparison (Fig. 2).
Datasets and data partitioning
Dataset for Gleason grading AI
A complete description of the development of the Gleason grading AI is given in the study protocol46, here we give a brief overview of the datasets involved. The dataset represents digitized hematoxylin and eosin (H&E) stained prostate core needle biopsies from patients who underwent biopsy between 2012 and 2023. Samples were obtained from 15 clinical sites, of which this study utilized slides from 13, excluding the Aquesta Uropathology (“AQ”) and Karolinska University Hospital morphological subtypes “KUH-2” cohorts representing non-gradable rare variants (see the study protocol for descriptions). The included slides were scanned using 13 whole slide scanners comprising 9 different models from 5 different vendors. The majority of these scanners use a 40x magnification, with a few using 20x magnification. The Gleason grading AI was trained on 55,798 WSIs (Stockholm3, “STHLM3”; Stavanger University Hospital, “SUH”) and tuned on 1,177 WSIs (STHLM3; Radboud University Medical Center, “RUMC”; Karolinska University Hospital, “KUH-1”). For this study, 27,272 WSIs were initially segmented. A small number of these were excluded due to missing information such as ISUP grading, and for cohorts with several WSIs per slide (STHLM3, SUH) only one WSI were chosen per slide, randomly. This resulted in an evaluation set for the Gleason grading AI of 18,848 WSIs (Aarhus University Hospital, “AUH”; Mehiläinen Länsi-Pohja, “MLP”; Medical University of Lodz, “MUL”; RUMC; Synlab Switzerland, “SCH”; Synlab Finland, “SFI”; Synlab France, “SFR”; Spear Prostate Biopsy 2020, “SPROB20”; STHLM3; SUH; University Hospital Cologne, “UKK”; Hospital Wiener Neustadt, “WNS”) from the internal and external validation cohorts. Internal validation cohorts represent data from the same lab and/or WSI scanner as the training data but from independent patients, and external validation cohorts represent data from different labs, scanners, and patients than the training data46. MLP, SCH and SFI were evaluated at a (anatomical) location aggregation level, and SPROB20 at a patient aggregation level, resulting in a final evaluation dataset of 13,549 unique cases.
Dataset for tissue detection AI
The WSIs for developing the segmentation AI model were a subset of the development set of the Gleason grading AI to ensure that the combined system respected the held-out internal and external test set splits specified in the study protocol. Multiple segmentation models were trained using a few different subsets of the data before choosing the UNet + + architecture trained on 33,823 WSIs (Table 1), as this achieved the highest tissue detection sensitivity on the segmentation evaluation set. Of these 33,823 WSIs, 6,172 (18.2%) had strong labels that were either checked visually to verify their quality, or, in 54 cases (0.16% of total), manually edited. The other 27,651 (81.8%) WSIs had weak labels that had not been checked for quality, apart from during the initial mask creation process when parameters were tuned empirically using a small and random subset of the WSIs. Validation for early stopping and evaluation for model selection used the set of 6,823 WSIs with manually curated high-quality segmentation masks (see “Study design”), split 30%−70% on patient-level (2,523 WSIs for validation and 4,305 WSIs for evaluation, see Table 1).
Thresholding algorithm
The thresholding algorithm used for the comparison, to generate labels for training the segmentation AI model, and to generate the validation and evaluation sets, was based on Otsu’s method36 and subsequent morphological operations. For details on the specific functions used, see the supplementary materials. The original purpose of the implementation of the algorithm was to generate segmentation masks to be used for the Gleason grading AI, and these are the masks that constituted our labels. For these, parameters of the algorithm were chosen in a cohort-specific manner based on what was deemed optimal for each cohort. For the validation and evaluation sets of the tissue detection algorithm, the parameters were instead tuned to every individual WSI, and re-tuned until the visually evaluated quality of the mask was very high. Finally, for the comparison between thresholding and AI-based tissue detection, a single uniform set of parameters was used for all WSIs, selected manually based on what empirical testing revealed to be the most consistently good parameters during the validation set curation.
Segmentation AI model
U-Net is a convolution neural network developed for segmentation of biomedical images48, and UNet + + is an extension of this architecture47. They are supervised learning models and hence require training data with annotated labels: WSIs or image patches with corresponding segmentation masks. The segmentation AI model architecture in this paper was that of UNet++, implemented using the SegmentationModels python library, version 0.3.349. The implementation used encoder resnext101_32 × 4 d with a depth of 5, and five decoder channels (512, 256, 128, 54, 32). For details on augmentations, see the supplementary materials.
Training was done with an AdamW optimizer50 (torch.optim.AdamW) with a base learning rate of 1e-6, epsilon constant of 1e-6 for stability, and weight decay 0.01. Binary cross entropy (torch.nn.BCEWithLogitsLoss with pos_weight = 5.0) was used as a loss function for training, while F1-score (torchmetrics.classification.BinaryF1Score) was used as a metric on the validation set for early stopping.
Tissue detection and patch extraction
For both the thresholding algorithm and segmentation AI, a resolution of 8.0 μm per pixel was used, which is heavily downsampled from the original resolution of the WSIs, and segmentation masks were stored as binary images. In training the segmentation AI, patches of size 512 × 512 pixels were extracted with no overlap. To fit an exact number of such patches, each WSI was first mirrored around each edge an appropriate amount. During inference, patches were generated such that they overlap 128 pixels about each edge to allow the edges of the predicted mask to be discarded. This avoids issues near tile edges due to lack of neighboring pixels providing context.
For both tasks, patches were downsampled from the closest higher resolution level in the WSI resolution pyramid using Lanczos resampling. For training the grading model, a higher resolution of 1.0 μm per pixel was used, with patches of size 256 × 256 pixels. Only patches with at least 10% of tissue pixels according to the segmentation masks were kept. Patches were extracted without overlap for training and with 128 pixel overlap during inference. Extracted patches were stored in TFRecord format, with each WSI saved as a separate file.
Gleason grading model
The grading model used for evaluation in this study is a weakly-supervised algorithm relying on an attention-based multiple instance learning (ABMIL) architecture51. The model utilizes an EfficientNet-V2-S encoder52 initialized with ImageNet weights that produces patch-level feature embeddings. These are then aggregated into slide-level representations through the ABMIL and classified into primary and secondary Gleason patterns (i.e. 3, 4, or 5), and further translated into Gleason scores and International Society of Urological Pathology (ISUP) grades. The model was trained in an end-to-end fashion, jointly optimizing all model parameters for cross-entropy loss using the AdamW optimizer with a base learning rate of 0.0001. Details on the model design, hyperparameters, and complete training strategy are given in the original publication35. The model was trained on 10 cross-validation folds, stratified by patient and ISUP grade. During model predictions, test time augmentation (TTA) was applied on three iterations for each of the 10 folds, and the final predictions were obtained as a majority vote of the resulting 30 Gleason scores.
Statistical analysis
The segmentation masks produced by thresholding and AI were compared using the pixel-wise metrics of sensitivity (true positive rate) and precision (positive predictive value). For our purposes, a model with high sensitivity is crucial, as low sensitivity indicates large missed regions. Precision is included to ensure that excessive amounts of background are not detected as tissue.
The Gleason grading model was trained using pre-existing segmentation masks generated with thresholding, and evaluated once using the UNet + + segmentation masks and once using masks generated with thresholding to detect tissue in the evaluation slides. In this step, all thresholding masks were created using a uniform set of thresholding parameters, generated via empirical testing. The Gleason grading models were compared using quadratic weighted Kappa, a modification of the Cohen’s Kappa statistic that measures agreement between two sets: in this case, the model’s predictions and the pathologists’ labels for each WSI or group of WSIs graded together. Confidence intervals were computed using bootstrapping with 1000 replicates.

