Sample
This study was considered a service evaluation (NHS Grampian Quality Improvement and Assurance (ID no. 5834)) by the local Research Governance team and did not require ethical approval. The evaluation was registered with the Scottish National Screening Organisation Research and Innovation group and received NHS Grampian Caldicott Approval. Women invited to routine breast screening received an amended invitation letter, providing information about the study, a participant information sheet, a frequently asked question guide and details on how to opt out of the evaluation.
The source population for the GEMINI study included women attending routine breast screening within NHS Grampian, part of the Scottish Breast Screening Service (SBSS), who did not opt out of the study and whose mammograms were read between 27 February and 5 October 2023. SBSS uses the Community Health Index number, a unique patient identifier used in Scotland, to identify women. In the UK NHS Breast Screening Programme, women from 50 to 71 years old are invited triennially for routine mammography screening. A standard mammography examination consisted of mediolateral oblique and craniocaudal full-field digital mammographic views of each breast using a Selenia Dimensions system (Hologic; software v.1.10). All acquisitions followed UK guidelines. Cancers were defined as histologically confirmed cancers present in the breast after surgery (if available) or biopsy.
The AI system
Mia v.3 (Kheiron Medical Technologies Ltd) is a CE-marked AI system that uses deep learning. Based on a woman’s screening mammograms, it outputs a continuous malignancy prediction value ranging from 0 to 1. The AI indicates that a woman should be invited back for additional examinations if her mammogram’s malignancy prediction value is above a certain decision threshold. The vendor prescribes six pre-set OPs based on different decision thresholds for different configurations. These thresholds are validated using locally acquired data with known outcomes and set before live AI use. During the AI-Additional Read workflow used in the GEMINI study, the AI findings were presented to the arbitration group as a recommended recall opinion with up to three regions of interest, depicted by a yellow outline in each view. The AI system was trained on images from real-world screening programs across different countries and centers, and with equipment from different hardware vendors over more than ten years16. However, no data from the GEMINI study NHS Grampian evaluation site was used in the development, training or calibration of the AI system’s machine learning model. The AI OPs used in this study (OP1 to OP4) were tuned and validated locally using a retrospective dataset from the same screening setting11.
Study design
For all women, routine screening practice and standard of care was maintained with all mammograms initially read by two human readers (Fig. 2). Per the center’s routine practice, inexperienced readers were paired with experienced readers (≥3 years of reading experience). In cases of discrepancy, a third human reader made the final clinical decision. Neither the first, second or third human reader (arbitrator) could view the AI opinion.
The AI tool was used live in the clinical workflow with the intention of increasing the CDR (‘AI-Additional Read’ workflow at OP2). If the final clinical decision was not to recall, but the AI tool gave a recall opinion, these cases were additionally arbitrated by 2–3 senior human readers out of a pool of five readers with at least 5 years of experience. During the additional arbitration, the human readers were presented with AI-generated regions of interest. The arbitration panel were instructed to recall women with visible suspicious findings in the breast as well as those with subtle changes that might point to a malignancy, reflecting the UK breast screening program’s interval cancers criteria, recalling at the unsatisfactory or satisfactory with learning points threshold17. A study team member (B.T.) recorded the time needed to perform this additional review with a stopwatch. As part of their return consultation, recalled women were met by a clinician who explained that the AI tool had detected a region of interest requiring further evaluation.
The AI opinion for all mammograms was downloaded at the end of the study and provided to the study team in the Grampian Data Safe Haven (DaSH) independent research environment (see ‘Statistics and reproducibility’ section). This design allowed for the simulation of the AI for the ‘triage’ and ‘triage negatives’ workflows. For the ‘triage’ workflow, the AI replaces the second reader when the AI and reader 1 have the same recall/no recall opinion, optimizing workload savings. For the ‘triage negatives’ workflow, the AI replaces the second reader when the AI and reader 1 agree not to recall, reducing workload while avoiding additional recalls. The design also enabled analyses of AI triage in combination with AI as an additional reader (‘AI-Additional Read’) using multiple different OPs to enable the assessment of 17 different workflows.
The primary workflow (specified a priori) combines the live AI workflow (AI-Additional Read at OP2) with the ‘triage negatives’ workflow (at OP3). All workflow definitions, including the choice of the primary workflow, were specified in the Evaluation plan before commencement of data collection.
Further information on study design is available in the Nature Research Reporting Summary linked to this article.
Data processing
Mammograms from screening attendees with exactly one of each of the four standard views (bilateral (left and right), mediolateral oblique and craniocaudal), identified as female in the Digital Imaging and Communications in Medicine data, were de-identified and sent to Mia’s cloud service, along with the relevant metadata. The Mia server processed cases and returned the Mia results to the Virtual Machine Gateway, where they were returned to the SBSS. Customized Business Objects XI reports containing clinical data, including age, reader opinions, AI opinions and cancer diagnoses, were exported from the SBSS and transferred to the DaSH, a Trusted Research Environment, for analysis.
Statistics and reproducibility
Independence and reproducibility
This study adheres to the STARD-AI reporting statement18.
The DaSH team pseudonymized and provisioned the dataset into a secure DaSH workspace, accessible to the NHS Grampian & University of Aberdeen study team only (C.F.d.V., J.A.D., G.L. and L.A.A.). The AI vendor could not access this workspace to ensure that the evaluation was performed independently from the industry partner.
The AI vendor was given access to the data within a separate workspace, where they established the accuracy of the reported results. As this was a real-world evaluation and all participants underwent routine screening, no blinding or randomization was required.
Statistical analysis
A Statistical Analysis Plan was created before commencement of data analysis. CDR, recall rate, sensitivity, specificity and PPV were calculated for the AI workflows and routine double reading, with 95% Wilson confidence intervals (CIs). Bootstrapping with 50,000 repetitions was used to generate two-sided 90% CIs for relative differences between the AI workflows and routine double reading for each metric. These CIs were then used to perform non-inferiority tests with a relative margin of 0.1 and an alpha of 0.05, comparing the combination AI workflows with routine double reading for each metric. If non-inferiority was established, a superiority test was executed with an alpha of 0.1. Superiority was established by the lower or upper confidence bound of the ratio being above or below 1, depending on the specific metric. The percentile bootstrap interval was used to estimate the CIs. The relative change in performance for each AI workflow compared to routine double reading, with corresponding 90% CIs, are reported in the Supplementary Table 1. A sensitivity analysis was undertaken for the primary workflow, assuming fewer additional cancers were detected.
All non-inferiority and superiority tests were performed on the lower or upper bound of a CI for the ratio of two metrics. Although the test itself does not involve any assumptions regarding normality and equal variances, the calculation of the CI should be accurate. As the assumption of normality usually does not hold for ratios, a bootstrap procedure was used to calculate these CIs. Data distribution was not assumed to be normal.
A gating strategy was defined for each AI combination workflow for the non-inferiority and superiority testing (Table 5). The order of the tests was defined a priori. No corrections for alpha were applied because the prespecified gating limits the number of hypotheses tested. Furthermore, each combination workflow represents a distinct product configuration that would be deployed independently. If a test was not passed, the next test was exploratory instead of confirmatory.
Table 5 Gating strategy for each AI combination workflow
The human reader workload was quantified for both the AI workflows and routine double reading, measured as the sum of the number of mammogram examinations read by the first, second and third (arbitration) readers and the number of examinations additionally arbitrated after being flagged by the AI × the number of potential arbiters. The time taken for the additional arbitration was quantified into four categories: 0–30, 30–60, 60–120 and over 120 s. At the center, first and second reads average 30–60 s, with arbitration cases averaging 60–120 s.
Statistical analyses were performed in R (v.4.2.1)19. The R library ‘boot’ was used to perform bootstrapping and estimate the CIs20,21.
Predetermined sample size
The sample size was based on a non-inferiority test for the relative difference between the routine double reading workflow and the primary AI workflow in detecting screen-detected cancers. The agreement rate between the routine double reading workflow and the primary AI workflow was expected to be 95.0%. A percentage of 1.9% of the confirmed positives was expected to only be detected by the routine double reading workflow, while 3.0% was expected to be detected by the AI workflow. Using a one-sided alpha of 0.05 and a non-inferiority margin of 10% relative to the routine double reading workflow proportion, a sample size of 65 confirmed positives was determined before study commencement to have a power of 91.5%. Achieving this sample size ensures that the power for the secondary endpoints for the AI workflow is at least 90%. Because of natural variation in CDR and the time duration between mammography assessment and potential cancer diagnosis, the study conclusion date was estimated to allow 65 confirmed positives to be achieved, resulting in 106 confirmed positives in this sample.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

