Artificial intelligence for imaging-based COVID-19 detection: systematic review comparing added value of AI versus human readers


 Purpose
 A growing number of studies have examined whether Artificial Intelligence (AI) systems can support imaging-based diagnosis of COVID-19-caused pneumonia, including both gains in diagnostic performance and speed. However, what is currently missing is a combined appreciation of studies comparing human readers and AI.
 
 Methods
 We followed PRISMA-DTA guidelines for our systematic review, searching EMBASE, PUBMED and Scopus databases. To gain insights into the potential value of AI methods, we focused on studies comparing the performance of human readers versus AI models or versus AI-supported human readings.
 
 Results
 Our search identified 1270 studies, of which 12 fulfilled specific selection criteria. Concerning diagnostic performance, in testing datasets reported sensitivity was 42-100% (human readers, n=9 studies), 60-95% (AI systems, n=10) and 81-98% (AI-supported readers, n=3), whilst reported specificity was 26-100% (human readers, n=8), 61-96% (AI systems, n=10) and 78-99% (AI-supported readings, n=2). One study highlighted the potential of AI-supported readings for the assessment of lung lesion burden changes, whilst two studies indicated potential time savings for detection with AI.
 
 Conclusions
 Our review indicates that AI systems or AI-supported human readings show less performance variability (interquartile range) in general, and may support the differentiation of COVID-19 pneumonia from other forms of pneumonia when used in high-prevalence and symptomatic populations. However, inconsistencies related to study design, reporting of data, areas of risk of bias, as well as limitations of statistical analyses complicate clear conclusions. We therefore support efforts for developing critical elements of study design when assessing the value of AI for diagnostic imaging.



Introduction
The field of medical imaging has seen rapid progress in the last years for the application of artificial intelligence (AI) methodologies, especially machine learning (ML) and deep learning (DL). 1 AI systems have progressed in image-recognition tasks relevant to disease diagnosis and detection, thus mimicking expert data interpretation capacities. 2 Recent studies have shown that AI-supported imaging technologies for specific diseases have a diagnostic performance comparable to medical experts. 3 According to the WHO, there have been over 247 million confirmed cases of COVID-19 globally as of November 3, 2021 and the pandemic has claimed over 5 million deaths worldwide . 4 COVID-19 diagnosis is largely based on reverse transcriptase-polymerase chain reaction (RT-PCR) testing, but problems related to test kit availability, test reliability, and test turnaround time have persisted in many countries. Additional fast, low-cost, and easily scalable tools for triaging and detecting COVID-19 suspected patients are crucial. 5 The COVID-19 pandemic has provided considerable momentum to this research area, with high expectations from the clinical community, which warrants an overview assessing the current evidence regarding the potential of AI methods to support the accurate, and fast detection of COVID-19 pneumonia. However, the accurate differentiation between COVID-19 and pneumonia of other origins remains challenging, due to subtle radiologic differences, especially in asymptomatic patients and those with early onset of symptoms. 5 We performed a systematic review of peer-reviewed publications that used AI for the evaluation of lung imaging to support the detection of COVID-19, and compared findings between a selected AI system and human readers, or AI-supported readings versus human readers alone, to obtain a comprehensive view of the current status of published evidence. 4 To our knowledge, this is the first published systematic review to date on this focused topic.
Due to the large heterogeneity in the reporting of relevant results, the information was not amenable to quantitative data grouping or meta-analysis. We therefore focused our study on the following aims: (a) firstly, to compare the evidence from published studies that used AI methodologies for supporting the detection of COVID-19 in lung imaging and also included an element of comparison between the performance values of human readers versus AI models or human readers versus AI-supported readings, (b) secondly, to look into the consistency of the reporting of outcomes across different studies, and (c) finally, to assess the risk of bias in the published studies according to a standard appraisal tool. 5

Search strategy and selection criteria
Our systematic review was performed in accordance with the Preferred Reporting Items for Systematic Reviews and Meta-Analyses of Diagnostic Test Accuracy (PRISMA-DTA), including the related checklist. 6,7 We used the systematic review methods outlined in the York Centre for Reviews and Dissemination's guidance for undertaking reviews in healthcare, specifically focusing on the guidance for systematic reviews of clinical tests. 8 The relevant population included COVID-19 pneumonia positive patients versus those with other pneumonia symptoms that tested negative for COVID-19. The intervention or index test was imaging, which was evaluated by both AI and human readers. As a comparator or reference standard, a positive PCR test was most commonly used in current practice, but in some cases they were supported by radiologists' findings and a clinical consensus. We were interested in the outcome measures related to the support of COVID-19 detection, comparing AI and human readers, i.e. sensitivity, specificity per patient, and related values, such as AUC (area under the Receiver Operator Characteristic -ROC-curve). We also explored additional outcomes, such as time required for diagnosis.
We only included peer-reviewed studies in our review. The included studies had to focus on supporting the detection of COVID-19 with a lung-imaging modality, and applying an AI methodology for the analysis of the imaging outcomes, including ML and DL. Furthermore, the studies had to include a comparison between human readers versus AI models or human readers versus AI-supported readings and report outcomes related to sensitivity and specificity at least, but ideally also further outcome measures. We excluded studies not 6 focusing on lung imaging of COVID-19, for example those using ex-vivo imaging and pathology studies, and those focused on aspects related to segmentation, features extraction, treatment, survival and disease risk prediction (see appendix, table 1 for full list of selection criteria).
We conducted a systematic search of the literature including EMBASE, PubMed and Scopus databases for papers published from January 1, 2019 in the English language (see appendix, table 2 for search strings). The searches were performed on November 30, 2020. Additional papers were identified through automated database notifications of new publications according to our search terms until January 31, 2021. The reference lists from all included papers were checked to identify and include any other potential studies.
Two reviewers independently performed a screening of the citations by title and abstract, with discrepancies resolved by consensus. The citations identified in the systematic search were uploaded to EndNote reference manager and duplicates were automatically deleted.

Quality assessment
Two reviewers acquired the full-text versions of the included papers and independently assessed their methodological quality using the QUADAS-2 tool, which has been adapted to the systematic review objectives (see overview of signalling questions in appendix, table 3).
Any discrepancies were resolved through discussion between the reviewers. QUADAS-2 is an appraisal tool recommended for the use in systematic reviews to evaluate the risk of bias in diagnostic accuracy studies. 9 QUADAS-2 consists of the following four domains: patient selection, index test, reference standard and flow and timing. We used an additional domain of data management and assessed all five domains for risk of bias using relevant signalling questions. While QUADAS-2 is not intended to generate an overall score, the tool highlights 7 a high, medium or unclear risk of bias according to the domains assessed. We excluded studies, with a high risk of bias in at least two domains, in the final analysis of our systematic review. In addition, one reviewer evaluated the methodologies applied for statistical analysis on all available datasets, where appropriate performance measurements were reported in the included papers.

Data analysis
Two reviewers extracted the following information on study characteristics for each paper, including: study category, imaging modality, country where the imaging took place, experience of human readers, selection criteria in the original research studies, reference standards used and information on blinding, as well as sample sizes for different datasets and validation type (Table 1a). In addition, information on the AI method and the data source, together with the data acquisition time period were extracted (Table 1b). Along with the original study authors' original conclusions, the following outcome data were extracted perpatient where applicable: values for sensitivity, specificity, accuracy, positive predictive value, negative predictive value and AUC. Where indicated, other values were extracted for speeds of reading and potential time savings (Table 2). 8

Results
The search strategy identified a total of 1261 articles on November 30, 2020, with a further six studies identified via database notifications of the same search until January 31, 2021.
Overall, 771 articles were screened for meeting the selection criteria. A total of 20 studies that met these criteria were assessed for the risk of bias with the QUADAS-2 tool. 5,10-28 Eight studies with a high risk of bias rating in at least two domains were excluded (figure 1, appendix, table 3). A total of 12 studies were included in our systematic review. 5,[16][17][18][20][21][22][23][24][26][27][28] (See figure 2 for the study selection and appendix, table 4 for the PRISMA-DTA checklist and   appendix, table 5 for information on statistical analysis).

Study characteristics
We identified that seven studies used computed tomography (CT) as the main imaging modality, 16,[22][23][24][26][27][28] with the remaining five studies focusing on chest X-ray (CXR) imaging. 5,17,18,20,21 (See table 1a for details on study characteristics). All studies applied a DL model for their AI methodology that included neural networks (table 1b). For the majority of four studies, China was indicated as the country where the imaging took place, 23,24,27,28 with two studies using imaging data from both China and the United States. 16,23 In terms of data source origins, imaging data was considered from various geographical locations, with a focus on patients from the United States for two studies, 17,18 two studies from the Netherlands, 20,26 one study from Italy, 21 and one study from Hong Kong. 5 There was considerable heterogeneity between the study designs, data collection and patient selection criteria applied throughout the studies, i.e. with different inclusion criteria for the selection of COVID-19 positive patients and the related characteristics for imaging required, as well as the use of automatic assessment scoring and subgroup analysis. 9 For the reference standard indicated in the included papers, three studies used RT-PCR, 20,21,28 three studies applied consensus findings between readers, 13,16,18 two studies provided outcome values for both these options, 17,19 and in four studies the use of the reference standard was not sufficiently detailed, but included the above (see table 1a). 5,23,26,27 All studies were based on patient selection with the diagnosis of an RT-PCR or nucleic acid amplification (NAAT) test. The total number of human readers ranged from two readers to a panel of ten, with practice levels ranging from less than five years, 26 to over 30 years of experience in thoracic imaging. 20 In ten studies, various hospitals, academic centres and clinics were indicated as the data source. [16][17][18][21][22][23][24][26][27][28] Two studies used a mix of data from publicly available databases as well as from hospitals. 5,20 There was a large variety of patients in the total numbers used in the overall datasets, ranging from 216 to 25,146 patients. Studies that used CXR as the imaging modality had generally larger sample sizes. All studies used differentiated datasets for training, testing and validation (table 3). Ten studies provided performance measurements for testing datasets, reporting 15 different sample sizes between 18 and 2,193 patients. [16][17][18][20][21][22][23][24]26,27 Four studies reported values for validation, with sample sizes ranging between 18 and 910 patients. 5,[22][23]27 While three studies provided performance measurements for external validation, 22-23,27 one study did this for independent internal validation. 5 In eight out of 12 studies, there was a larger representation of men in the patient characteristics, 16,20,21,23,24,26-28 ranging from 54-59% of males, 28,23 of the total number of patients, and up to 65.8% for specific datasets used for training. 24 Three studies reported equal numbers of males and females, 5,18,22 and with 53%, one study indicated a larger representation of women in the patient characteristics. 17 10 Age values were reported in a heterogeneous way across the studies. Three studies reported mean ages for the total patient population, 17,20,28 ranging from 40.7 28 to 58 years. 17 Three studies reported mean ages for COVID and non-COVID groups, 14,16,19 with one study indicating a younger average age of 48 years in the COVID group vs. 62 years in the non-COVID group, 16 and two studies reporting more balanced ages for the two groups. 18,21

Performance outcomes
We focused on the performance data from 126 scenarios, reporting sensitivity (118 for specificity) in 12 studies, and 65 scenarios for AUC values in 11 studies, applying our grouping of the categories into human readers, AI models and AI-supported human readings, visualised

General findings
For overall conclusions, as claimed by the authors in the 12 original studies, six reported a diagnostic performance of the AI model which was comparable to human readers, with three focused on CT, 26-28 and three with CXR as an imaging modality (figure 3 and table 2). 17,20,21 Two other studies using CXR indicated that their AI model outperformed human readers. 5,18 Compared to a reader-only approach, three studies reported that AI augmentation improved human readers' performance using CT as an imaging modality. 16,23,24 One study reported positive results of the AI system to aid radiologists in the assessment of changes of the lung lesion burden on pairs of CT scans, with comparable performance of the AI model to human readers. 22

Validation datasets
Three of the studies reported sensitivity at 49-100% for human readers, 5,22,27 with four studies indicating ranges of 71.2-96.3% for AI models. 5,[22][23]27 No values were provided for AI supported human readings. Median sensitivity was higher in AI models (79.4%) than for human readers (64.2%).The IQR of the above studies was lower in AI models (14.8%) than for human readers (46.4%).

Validation datasets
Three of the studies reported specificity with 72-97.9% for human readers, and 57.9-89.2% for AI models. 5,22,27 No values were provided for AI supported human readings. Median specificity was 81% for AI models and 82.2% for human readers. The IQR of the above studies was lower in AI models (9.6%) than for human readers (18.7%).

Validation datasets
Two studies reported AUC ranges of 0.61-0.802 for human readers, 5,27 with three studies indicating values of 0.71-0.98 for AI models. 5,23,27 No values were provided for AI supported human readings. Median AUC values were higher in AI models (0.87) compared to human readers (0.64). The IQR of the above studies was lower for AI models (0.05) than for human readers (0.121).

Comparison of performance between testing and validation datasets
Only scenarios for AI models and human readers were comparable between testing and validation datasets. AI-supported readings showed the highest median values, and in general, the median for sensitivity, specificity and AUC values, was higher in the testing datasets. The 13 IQR for sensitivity in general was lower in the testing datasets, while being comparable for specificity. Only for comparable datasets in two studies, related to testing and external validation, the medians in the testing datasets were higher, showing an overall reduced performance in external validation (figure 5 j-).

Time savings
AI-supported triage improved the efficiency of scan-to-fever-clinician triage at each hospital in the study by Wang and colleagues, with a median reduction in triage time ranging between 18.77 and 198.28 minutes in different hospitals 22 . Wehbe and colleagues reported that the time to analyse a data subset with AI took approximately 18 minutes, compared to approximately 2.5-3.5 hours for each radiologist. 17 14

Discussion
Our study shows promising performance results of AI-supported detection of COVID-19 imaging, specifically comparing the performance of AI and human readers, however these need to be interpreted in the context of risks of bias. The medians of all performance values in the testing datasets were in general the highest for AI-supported human readings, followed by AI models, and then human readers alone, as shown in Table 4. However, human readers alone reached the highest maximum, but also the lowest minimum values in testing datasets, with the latter especially notable for specificity. Variability of diagnostic performance focused on the IQR was in general lower in AI systems or AI-supported readings. In addition, some studies reported time savings with AI models, as detailed in Table 2. Reporting related to speedier analysis with AI also included AI-supported triage, whereby time-to-triage was faster compared to a standard clinical workflow across different clinics. Notably, this implied an ideal scenario where clinicians would respond instantly to AI notifications and could thereby potentially shorten the time to diagnosis, with multiple benefits for the isolation and the treatment of affected patients 22 .
In the validation datasets, there was an overall reduction in performance in comparison to the testing datasets related to median values, as shown in Figure 5. However, a detailed analysis of the same performance measurements across all studies was difficult, as most studies did not include a complete reporting on all performance measurements disaggregated by datasets and differentiating all scenarios for human readers, AI models and AI-supported patients. However, there were differences in how studies attempted to evaluate these merits, i.e. through a comparison of AI and human readers as well as AI-supported readings 16,23,24 versus human readings alone 5,[17][18][20][21][22][26][27][28] . The studies also applied different strategies regarding subgroup analysis, and the differentiation of training, (external) testing and (external) validation. In order to ensure the robustness of final results, future studies should clearly describe the splitting of all datasets for training, testing and validation, not only subsets of these (i.e. training and validation only). In addition, the documentation of relevant patient characteristics should be consistently applied for all datasets. A variety of different approaches were used for the scenarios of human readers, including different levels of experience, as well as thresholds and cut-off points (see figure 4). The heterogeneity in scenarios also applies to AI models, which reported different model types, design features, classifiers and reference standards. Due to such heterogeneity in the study methodology and analysis applied in the different studies and the resulting lack of comparability, it is difficult to make solid conclusions regarding improved performance measurements with AI. There were also major differences in baseline characteristics between patient groups with COVID-19, and those with non-COVID-19 pneumonia, introducing possible selection bias with imbalances regarding gender and age, with a related overrepresentation of younger males, including also different datasets within a study as shown in table 1a. In addition, reporting on the details of patient characteristics for different cohorts varied, for example there was limited information on patients who may have been immunosuppressed. 16 Notably, some CT-focused studies featured smaller sample sizes for certain datasets -mostly for validation 23,27 , and patients with early-stage COVID-19. The limitations related to patient populations were further complicated by variations in image quality, and heterogeneity in imaging acquisition and post-processing parameters as shown in our QUADAS-2 assessment.
Future studies should focus on testing AI models with data from more populations and geographical areas. Imaging should also include more data and information on disaggregation related to all ethnicities, and not limited to Caucasian and Asian races, 21 as well as older patients, with more equal gender representation to allow for further generalisations. To assess and validate the robustness of AI models, training with larger multi-centre datasets and consistent external validation is required, as well as more prospective study data and evaluations. 26,29 Overly positive interpretations of promising performance results attributed to AI models or AI-supported readings should be treated with caution, as there is a risk of overestimating results due to several potential confounding factors. As such, Chiu and colleagues have discussed how the use of RT-PCR as the ground truth for training models may not reflect the real performance of AI systems, since false-negative rates of RT-PCR have been reported to be as high as 30% 5 . In addition, there is a potential susceptibility of AI systems to 'learn' dataset characteristics instead of disease classification, for example by introducing a bias described as 'shortcuts learning'. 30 In these instances, models may rely on features that are not related to correct object classification for a disease pathology, by relying on differences in the background instead, such as textual markers in obtained images, for example related to patient positioning. 31 Another area of attention is related to an intrinsic feature of AI, specifically its "black box" character: it is difficult to rule out that an algorithm is not using findings outside the lungs when discriminating for COVID-19. 5 The black box problem emphasizes another value of human readers, namely that results are difficult to explain. 27 Moreover, human analysis is required to rule out motion artefacts which may cause errors when detecting COVID- 19. 26 Our QUADAS-2 analysis highlighted the need to address possible risk of bias in future studies, especially related to a clear need for adequate and consistent descriptions of patient selection and study population description. Further areas of attention are clear information on the use of index tests, reference standards and flow and timing between these. In the context of imaging, detailed descriptions of imaging acquisition and processing methods are required.
In addition, clear descriptions of data sources are important. An analysis of the statistical methods applied in the studies revealed further methodological concerns, such as the inconsistent use of 95% confidence intervals for all performance values, or of p-values for interpreting lack of significance, as shown in Table 5 of the supplementary appendix. We found considerable limitations concerning the methods applied to compare performance between AI systems and human readers. For example, the selection of sample sizes can strongly influence performance measurements and only one study analysed the threshold above which no performance gain could be obtained. 18 Several studies did not consistently apply the agreement index, using instead, average values of reader performance. To enable appropriate inter-rater agreement, studies should focus on agreement rather than correlation indices. 32,33 In addition, future statistical analysis should consistently analyse the disease prevalence, for example, by using Negative Predictive Value and Positive Predictive Value parameters, as well as positive and negative likelihood ratios to reduce uncertainties regarding the validity of diagnostic tests, 34 with the latter ratios only reported in one of the 12 studies. 21 In this way, meaningful analysis of the overall merit of AI-supported imagingbased COVID-19 detection beyond high-prevalence populations could be ensured. 18 Future studies need to address concerns of AI methodology regarding transparency, reproducibility, ethics and effectiveness, as well as specific reporting standards for AI studies. 35 This includes attention to critical elements that are required for coherent use of methodology, including study design, clear description of data sets and patient characteristics as well as consistent application of assessment methods that analyse potential risks of bias.
We acknowledge certain limitations in our study. We only included peer-reviewed studies in English from the published literature, potentially increasing the likelihood of publication bias influencing our findings. We did not evaluate the methodology of the AI and DL approaches applied in the reported studies, which may further affect the risk of bias. Importantly, we were unable to provide a meta-analysis of the studies included in our review, due to a large heterogeneity in the methodologies and results of the studies, precluding conclusions on pooled diagnostic accuracy of AI versus human readers. 19

Conclusions
While the studies included in our systematic review reported promising results of AI, these need to be seen in the context of COVID-19 severity, overall disease prevalence in a population, as well as several risks of bias regarding study methodologies applied. For earlyonset patients and those who are asymptomatic or have mild disease, the performance of imaging may not be satisfactory. AI-supported imaging would therefore be most promising in high prevalence areas, and as a support tool for rapid diagnosis to identify suspected patients as a priority within a triage setting. 17,21,28 At the same time, the involvement of human readers remains crucial.
Our study identified and highlighted several inconsistencies of data reporting and presentation, heterogeneity for study methodologies and evaluation methods applied, certain areas of risk of bias (e.g. selection bias, "shortcut learnings" bias) and limitations of statistical analyses.
While our study presented an overall promising potential of AI models for COVID-19 imaging diagnostic decision-making, this can only be confirmed by future studies with an improved and harmonised overall methodology that allows for reliable generalisability of results. In this context, we support efforts for the development and future implementation of a set of methodological 'critical elements' regarding study design, and assessment methodology to facilitate data aggregation, comparability and conclusions of the possible added value of AIbased imaging systems for diagnostic decision-making. 20
Legend: Red colour indicates "Better performance of AI", blue colour indicates "Better performance of human readers", orange colour indicates "Increased performance of human readers with AI support", grey colour indicates "Comparable performance of AI and human readers to differentiate COVID-19", green colour indicates "Comparable performance of AI and human readers for lesion changes", no colour and "n.a." indicates "Values in comparison not indicated"