Study 1: Standalone performance and integration feasibility
The first study was split into two phases. In the first phase, we conducted a large-scale multi-center retrospective evaluation of the standalone performance of the AI system. In the second phase, we conducted a prospective, non-interventional deployment study to evaluate the feasibility and challenges associated with integrating a live system into real clinical workflows.
Phase 1: Multicenter standalone performance evaluation
The first, retrospective phase involved mammograms from 125,000 women (115,973 after applying inclusion/exclusion criteria) who were screened at five NHS screening services in the UK. The services covered three different clinical workflows, varying by whether the second reader was blinded to the first and how cases were selected for arbitration (see figure below). AI operating points (the threshold that determines the conservativeness with which the AI flags cases) were determined separately at each screening service to adjust for local differences in screening populations and workflows.
The primary endpoints of the study assessed the sensitivity and specificity of the AI system in detecting cancer compared to the historical (original) first reader for the case. The study used a rigorous ground truth, utilizing a 39-month follow-up window that allowed us to study the AI system’s incremental benefit in detecting interval and next-round cancers long before they became clinically symptomatic. In addition to the primary endpoints, the study also assessed performance of the AI system compared to second and consensus readers, as well as lesion-level localization (whether the correct abnormality in the breast was identified) and fairness analyses. By incorporating rigorous lesion-level analysis, our study addressed whether the AI system was successfully localizing the precise regions of interest rather than relying on potentially spurious correlations. This phase of the study was retrospective to enable validation of AI performance at a large scale and did not involve collecting any additional interpretations from human readers or prospective deployment.

