Advertisement

Convolutional neural network performance compared to radiologists in detecting intracranial hemorrhage from brain computed tomography: A systematic review and meta-analysis

  • Mia Daugaard Jørgensen
    Affiliations
    Faculty of Health and Medical Sciences, Copenhagen University, Copenhagen, Denmark
    Search for articles by this author
  • Ronald Antulov
    Affiliations
    Department of Radiology and Nuclear Medicine, Hospital of South West Jutland, University Hospital of Southern Denmark, Esbjerg, Denmark

    Department of Regional Health Research, Faculty of Health Sciences, University of Southern Denmark, Odense, Denmark
    Search for articles by this author
  • Søren Hess
    Affiliations
    Department of Radiology and Nuclear Medicine, Hospital of South West Jutland, University Hospital of Southern Denmark, Esbjerg, Denmark

    Department of Regional Health Research, Faculty of Health Sciences, University of Southern Denmark, Odense, Denmark
    Search for articles by this author
  • Simon Lysdahlgaard
    Correspondence
    Corresponding author at: Haraldsgade 58, 1. Sal th, 6700 Esbjerg, Denmark.
    Affiliations
    Department of Radiology and Nuclear Medicine, Hospital of South West Jutland, University Hospital of Southern Denmark, Esbjerg, Denmark

    Department of Regional Health Research, Faculty of Health Sciences, University of Southern Denmark, Odense, Denmark
    Search for articles by this author
Open AccessPublished:November 24, 2021DOI:https://doi.org/10.1016/j.ejrad.2021.110073

      Highlights

      • Convolutional neural networks have high sensitivity and specificity for bleedings in patients with ICH.
      • Methodological issues and diversity of reference standards were found in the small number of included studies.
      • There is a need for larger studies including external datasets with more robust reference standards.

      Abstract

      Purpose

      To compare the diagnostic accuracy of convolutional neural networks (CNN) with radiologists as the reference standard in the diagnosis of intracranial hemorrhages (ICH) with non contrast computed tomography of the cerebrum (NCTC).

      Methods

      PubMed, Embase, Scopus, and Web of Science were searched for the period from 1 January 2012 to 20 July 2020; eligible studies included patients with and without ICH as the target condition undergoing NCTC, studies had deep learning algorithms based on CNNs and radiologists reports as the minimum reference standard. Pooled sensitivities, specificities and a summary receiver operating characteristics curve (SROC) were employed for meta-analysis.

      Results

      5,119 records were identified through database searching. Title-screening left 47 studies for full-text assessment and 6 studies for meta-analysis. Comparing the CNN performance to reference standards in the retrospective studies found a pooled sensitivity of 96.00% (95% CI: 93.00% to 97.00%), pooled specificity of 97.00% (95% CI: 90.00% to 99.00%) and SROC of 98.00% (95% CI: 97.00% to 99.00%), and combining retrospective and studies with external datasets found a pooled sensitivity of 95.00% (95% CI: 91.00% to 97.00%), pooled specificity of 96.00% (95% CI: 91.00% to 98.00%) and a pooled SROC of 98.00% (95% CI: 97.00% to 99.00%).

      Conclusion

      This review found the diagnostic performance of CNNs to be equivalent to that of radiologists for retrospective studies. Out-of-sample external validation studies pooled with retrospective studies found CNN performance to be slightly worse. There is a critical need for studies with a robust reference standard and external data-set validation.

      Keywords

      Abbreviations:

      AI (Artificial Intelligence), AUC (Area under the curve), CNN (Convolutional neural network), EDH (Epidural haemorrhage), ICH (Intracranial haemorrhage), IPH (Intraparenchymal haemorrhage), IVH (Intraventricular haemorrhage), NCTC (Non-contrast computed tomography of cerebrum), RNN (Recurrent neural network), SAH (Subarachnoid haemorrhage), SDH (Subdural haemorrhage), SROC (Summary receiver operating characteristics curve)

      1. Introduction

      Intracranial hemorrhage (ICH) is a potentially life-threatening condition and accounts for 2 million strokes worldwide [
      • Qureshi A.I.
      • Mendelow A.D.
      • Hanley D.F.
      Intracerebral haemorrhage.
      ], with an estimated incidence rate of approximately 25 per 100,000 person-years [
      • van Asch C.JJ.
      • Luitse M.JA.
      • Rinkel G.JE.
      • van der Tweel I.
      • Algra A.
      • Klijn C.JM.
      Incidence, case fatality, and functional outcome of intracerebral haemorrhage over time, according to age, sex, and ethnic origin: a systematic review and meta-analysis.
      ]. There are many underlying causes of ICH which can be classified as primary (80–85%) or secondary (15–20%). The most common nontraumatic secondary causes include vascular malformation, ischaemic stroke and intracranial tumor [
      • Elliott J.
      • Smith M.
      The Acute Management of Intracerebral Hemorrhage.
      ]. Hospital admissions due to ICH have increased in the last decade, mainly related to population aging with increased usage of blood thinning agents and/or lack of blood-pressure control [
      • Qureshi A.I.
      • Suri M.F.K.
      • Nasar A.
      • Kirmani J.F.
      • Ezzeddine M.A.
      • Divani A.A.
      • Giles W.H.
      Changes in cost and outcome among US patients with stroke hospitalized in 1990 to 1991 and those hospitalized in 2000 to 2001.
      ]. Based on clinical presentation an acute ICH may be difficult to distinguish from other diagnoses like ischaemic stroke, and neuroimaging is therefore crucial for the diagnosis [
      • Anderson C.S.
      • Huang Y.
      • Wang J.G.
      • Arima H.
      • Neal B.
      • Peng B.
      • Heeley E.
      • Skulina C.
      • Parsons M.W.
      • Kim J.S.
      • Tao Q.L.
      • Li Y.C.
      • Jiang J.D.
      • Tai L.W.
      • Zhang J.L.
      • Xu E.n.
      • Cheng Y.
      • Heritier S.
      • Morgenstern L.B.
      • Chalmers J.
      Intensive blood pressure reduction in acute cerebral haemorrhage trial (INTERACT): a randomised pilot trial.
      ].
      The key part of the ICH diagnostic pathway is timing and completion of a non-contrast computed tomography of the cerebrum (NCTC), the most readily available and rapid tool for diagnosing ICH. NCTC provides morphological information of basic ICH characteristics, such as location, ventricular system extension, edema, midline shift and mass effect development [
      • Panagos P.D.
      • Jauch E.C.
      • Broderick J.P.
      Intracerebral hemorrhage.
      ]. However, increased use of NCTC may cause delays in the diagnosis of ICH and an increasing workload may cause job related stress and burnout in radiology departments, while artificial intelligence (AI) has been shown to affect radiology practice by decreasing workload [
      • Glover McKinley
      • Almeida R.R.
      • Schaefer P.W.
      • Lev M.H.
      • Mehan W.A.
      Quantifying the Impact of Noninterpretive Tasks on Radiology Report Turn Around Times.
      ,
      • Chetlen A.L.
      • Chan T.L.
      • Ballard D.H.
      • Frigini L.A.
      • Hildebrand A.
      • Kim S.
      • Brian J.M.
      • Krupinski E.A.
      • Ganeshan D.
      Addressing Burnout in Radiologists.
      ,
      • Codari M.
      • Melazzini L.
      • Morozov S.P.
      • van Kuijk C.C.
      • Sconfienza L.M.
      • Sardanelli F.
      Impact of artificial intelligence on radiology: a EuroAIM survey among members of the European Society of Radiology.
      ].
      The past decade showed significant progress in the performance of machine learning models, including translation to and development of deep learning systems in medical image computer vision. The first successful application of the famous ImageNet Large Scale Visual Recognition Challenge was won by the convolutional neural network (CNN) AlexNet in 2012 by Krizhevsky et al. [
      • Krizhevsky A.
      • Sutskever I.
      • Hinton G.E.
      ImageNet classification with deep convolutional neural networks.
      ] with an error rate improvement of 10.8% ending at 15.3%. In the past years, evidence has proven that the development of CNNs can be used in medical imaging classification tasks, e.g. identifying diabetic retinopathy [
      • Gulshan V.
      • Peng L.
      • Coram M.
      • Stumpe M.C.
      • Wu D.
      • Narayanaswamy A.
      • Venugopalan S.
      • Widner K.
      • Madams T.
      • Cuadros J.
      • Kim R.
      • Raman R.
      • Nelson P.C.
      • Mega J.L.
      • Webster D.R.
      Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs.
      ] and classifying skin lesions as benign or malignt with specialist physician accuracy [
      • Esteva A.
      • Kuprel B.
      • Novoa R.A.
      • Ko J.
      • Swetter S.M.
      • Blau H.M.
      • Thrun S.
      Dermatologist-level classification of skin cancer with deep neural networks.
      ]. CNNs have also established noteworthy progress and have spread to the entire field of medical imaging, where computed tomography (CT) is one of the most studied imaging modalities [
      • Domingues I.
      • Pereira G.
      • Martins P.
      • Duarte H.
      • Santos J.
      • Abreu P.H.
      Using deep learning techniques in medical imaging: a systematic review of applications on CT and PET.
      ]. The development of CNNs with an accurate algorithm for imaging examinations requires an appropriate model, a large number of accurately labelled datasets for algorithm-training and a final algorithm that can be generalized to unfamiliar data [
      • Litjens G.
      • Kooi T.
      • Bejnordi B.E.
      • Setio A.A.A.
      • Ciompi F.
      • Ghafoorian M.
      • van der Laak J.A.W.M.
      • van Ginneken B.
      • Sánchez C.I.
      A survey on deep learning in medical image analysis.
      ]. Liu et al. [
      • Liu X.
      • Faes L.
      • Kale A.U.
      • Wagner S.K.
      • Fu D.J.
      • Bruynseels A.
      • Mahendiran T.
      • Moraes G.
      • Shamdas M.
      • Kern C.
      • Ledsam J.R.
      • Schmid M.K.
      • Balaskas K.
      • Topol E.J.
      • Bachmann L.M.
      • Keane P.A.
      • Denniston A.K.
      A comparison of deep learning performance against health-care professionals in detecting diseases from medical imaging: a systematic review and meta-analysis.
      ] compared health-care professionals to deep learning performance in different clinical domains, including breast cancer, skin cancer, hepatology and found a pooled sensitivity of 87.00% (95% CI: 83.00% to 90.20%) and a pooled specificity of 92.50% (95% CI: 85.10% to 96.40%). CNNs have the potential to identify ICH seconds after the scan is performed, decreasing the delay before the scan is interpreted by the radiologist [
      • O’Neill T.J.
      • Xi Y.
      • Stehel E.
      • Browning T.
      • Ng Y.S.
      • Baker C.
      • Peshock R.M.
      Active Reprioritization of the Reading Worklist Using Artificial Intelligence Has a Beneficial Effect on the Turnaround Time for Interpretation of Head CTs with Intracranial Hemorrhage.
      ].
      The purpose of this systematic review and meta-analysis was to critically appraise the evidence of CNNs in per-patient diagnosis of ICH on NCTC, considering issues of study design, reporting and clinical value. To our knowledge, no previous meta-analysis has assessed if CNNs can reliably diagnose ICH on NCTC with radiologists as the reference standard.

      2. Methods/materials

      PubMed, Embase, Scopus, and Web of science were searched by M.D.J and S.L for articles published between 1 January 2012 and 20 July 2020. The search was first performed from 1 January 2012 to 1 April 2020, and then updated to 20 July 2020. Search strings were documented (Suppl 1), and the Preferred Reporting Items for Systematic reviews and Meta-analysis (PRISMA) were followed [
      • McInnes M.D.F.
      • Moher D.
      • Thombs B.D.
      • McGrath T.A.
      • Bossuyt P.M.
      • Clifford T.
      • Cohen J.F.
      • Deeks J.J.
      • Gatsonis C.
      • Hooft L.
      • Hunt H.A.
      • Hyde C.J.
      • Korevaar D.A.
      • Leeflang M.M.G.
      • Macaskill P.
      • Reitsma J.B.
      • Rodin R.
      • Rutjes A.W.S.
      • Salameh J.-P.
      • Stevens A.
      • Takwoingi Y.
      • Tonelli M.
      • Weeks L.
      • Whiting P.
      • Willis B.H.
      Preferred Reporting Items for a Systematic Review and Meta-analysis of Diagnostic Test Accuracy Studies The PRISMA-DTA Statement.
      ]. For the process of article screening the Rayyan QCRI online platform was used [
      • Ouzzani M.
      • Hammady H.
      • Fedorowicz Z.
      • Elmagarmid A.
      Rayyan-a web and mobile app for systematic reviews.
      ]. Screening included removing duplicates and non-relevant studies by title screening. Full-text versions of eligible articles were retrieved, and in the case when full-text was not accessible authors were contacted. Assessment of eligible articles was performed independently by M.D.J and S.L according to pre-defined inclusion criteria. Any conflicts were resolved by consensus.
      Original studies had to fulfill all the following pre-defined inclusion criteria in order to be eligible: a) patients undergoing NCTC for the detection of ICH including intraparenchymal haemorrhage (IPH), epidural haemorrhage (EDH), subdural haemorrhage (SDH), subarachnoid haemorrhage (SAH) and intraventricular haemorrhage (IVH), b) using radiologists or clinical report as the reference standard (cf. below), and c) application of CNN-algorithms for ICH detection. The publication period was censored at 1 January 2012 on the basis of a recognized change in development of deep learning performance in the ImageNet classification challenge [
      • Krizhevsky A.
      • Sutskever I.
      • Hinton G.E.
      ImageNet classification with deep convolutional neural networks.
      ].
      Based on current literature the reference standard (ground truth labelling) was considered acceptable when the study included detailed and specific definitions for annotations. The minimum acceptable reference standard were manual, semi-automated or automated image labelling extracted from radiology reports or electronic health records using natural language processing or recurrent neural networks (RNN). Labelling by independent readers was also acceptable when the number of human annotators and their qualifications were specified, including detailed description of annotation flow process [
      • Willemink M.J.
      • Koszek W.A.
      • Hardell C.
      • Wu J.
      • Fleischmann D.
      • Harvey H.
      • Folio L.R.
      • Summers R.M.
      • Rubin D.L.
      • Lungren M.P.
      Preparing medical imaging data for machine learning.
      ].
      Conference papers, editorials, commentaries, reviews, guidelines, book chapters, technical articles, and papers with insufficient reference standards were excluded.
      Data extraction was conducted with Checklist for Artificial Intelligence in Medical Imaging (CLAIM) [
      • Mongan J.
      • Moy L.
      • Kahn C.E.
      Checklist for Artificial Intelligence in Medical Imaging (CLAIM): A Guide for Authors and Reviewers.
      ], risk of bias and applicability were assessed independently by M.D.J and S.L. using a modified Quality Assessment of Diagnostic test Accuracy Studies (QUADAS-2) tool [
      • Whiting P.F.
      • Rutjes A.W.S.S.
      • Westwood M.E.
      • Mallett S.
      • Deeks J.J.
      • Reitsma J.B.
      • Leeflang M.M.G.
      • Sterne J.A.C.
      • Bossuyt P.M.M.
      Quadas-2: A revised tool for the quality assessment of diagnostic accuracy studies.
      ] (Suppl 2), conflicts were resolved by consensus. Studies with reported diagnostic measures on a per-patient basis for ICH were included in the meta-analysis. When diagnostic accuracy extraction and construction of contingency tables was not possible, we calculated data from available information or authors were contacted.
      Meta-analyses were performed for sensitivity and specificity. For comparing the retrospective studies with CNN to the reference standard we used a unified hierarchical model that was developed for the meta-analysis of diagnostic accuracy studies and plotted summary receiver operating characteristics (SROC) curves. The SROC-curve provides an estimate of average sensitivity and specificity of the included studies with 95% confidence intervals [
      • Harbord R.M.
      • Whiting P.
      • Sterne J.A.C.
      • Egger M.
      • Deeks J.J.
      • Shang A.
      • Bachmann L.M.
      An empirical comparison of methods for meta-analysis of diagnostic accuracy showed hierarchical models are necessary.
      ]. We investigated SROC-curves of studies with retrospective results, and afterwards results from retrospective studies combined with studies including validation with external data set. Studies were considered retrospective when there was no external data and prospective when there was external test data. Fixed-effects model was applied and heterogeneity was assessed with I-squared statistics. Siginificance level was 5%. Statistical analyses and presentations were done using STATA/MP 15.0 (StataCorp).

      3. Results

      The flowchart in Fig. 1 shows 5,119 eligible studies of which we included 6 studies based on a total of 380,382 NCTCs. The PRISMA-checklist is presented in Suppl 3. Table 1 presents data characteristics of the included studies and data extraction chart is presented in Suppl 4 [
      • Arbabshirani M.R.
      • Fornwalt B.K.
      • Mongelluzzo G.J.
      • Suever J.D.
      • Geise B.D.
      • Patel A.A.
      • Moore G.J.
      Advanced machine learning in action: identification of intracranial hemorrhage on computed tomography scans of the head with clinical workflow integration.
      ,
      • Chang P.D.
      • Kuoy E.
      • Grinband J.
      • Weinberg B.D.
      • Thompson M.
      • Homo R.
      • Chen J.
      • Abcede H.
      • Shafie M.
      • Sugrue L.
      • Filippi C.G.
      • Su M.-Y.
      • Yu W.
      • Hess C.
      • Chow D.
      Hybrid 3D/2D convolutional neural network for hemorrhage evaluation on head CT.
      ,
      • Chilamkurthy S.
      • Ghosh R.
      • Tanamala S.
      • Biviji M.
      • Campeau N.G.
      • Venugopal V.K.
      • Mahajan V.
      • Rao P.
      • Warier P.
      Deep learning algorithms for detection of critical findings in head CT scans: a retrospective study.
      ,
      • Grewal M.
      • Srivastava M.M.
      • Kumar P.
      • Varadarajan S.
      RADnet: Radiologist level accuracy using deep learning for hemorrhage detection in CT scans. I: Proceedings - International Symposium on Biomedical.
      ,
      • Kuo W.
      • Hӓne C.
      • Mukherjee P.
      • Malik J.
      • Yuh E.L.
      Expert-level detection of acute intracranial hemorrhage on head computed tomography using deep learning.
      ,
      • Lee H.
      • Yune S.
      • Mansouri M.
      • Kim M.
      • Tajmir S.H.
      • Guerrier C.E.
      • Ebert S.A.
      • Pomerantz S.R.
      • Romero J.M.
      • Kamalian S.
      • Gonzalez R.G.
      • Lev M.H.
      • Do S.
      An explainable deep-learning algorithm for the detection of acute intracranial haemorrhage from small datasets.
      ,
      • Ye H.
      • Gao F.
      • Yin Y.
      • Guo D.
      • Zhao P.
      • Lu Y.i.
      • Wang X.
      • Bai J.
      • Cao K.
      • Song Q.i.
      • Zhang H.
      • Chen W.
      • Guo X.
      • Xia J.
      Precise diagnosis of intracranial hemorrhage and subtypes using a three-dimensional joint convolutional and recurrent neural network.
      ,
      • Nagendran M.
      • Chen Y.
      • Lovejoy C.A.
      • Gordon A.C.
      • Komorowski M.
      • Harvey H.
      • Topol E.J.
      • Ioannidis J.P.A.
      • Collins G.S.
      • Maruthappu M.
      Artificial intelligence versus clinicians: Systematic review of design, reporting standards, and claims of deep learning studies in medical imaging.
      ]. Risk of bias and QUADAS-2 evaluation is presented in Table 2 and Fig. 2. The inter-rater agreement for QUADAS-2 was 85.71% with a kappa-value of 0,69 (p = 0.000).
      Table 1Characteristics of included studies.
      Author, yearStudy designDataset sizeTarget conditionReference standardCNN model typeTraining set sizeValidation set sizeExternal test set sizeSensitivitySpecificityAUC
      Arbabshirani et al (2018)P46,583 CTsICHClinical radiology reportsCNN37,0749,49934770.00%87.00%84.00%
      Lee et al (2019)R

      P
      1,300 CTsICH + subtypesThe retrospective was labelled by consensus of five neuroradiologists

      The prospective dataset was labelled by consensus of three neuroradiologists
      CNN

      (Ensemble model of VGG16, ResNet-50, Inception-v3 and Inception-ResNet-v2)
      704200200

      196
      98.00%

      92.40%
      95.00%

      94.90%
      99.30%

      96.10%
      Chilamkurthy et al (2018)R (CQ2500)

      R (CQ500)
      313,318 CTs

      ICH + subtypesClinical radiology reports

      Consensus by three independent readers
      CNN transfer learning (ResNet18)290,05521.095

      NA
      NA

      491
      90.10%

      94.20%
      90.00%

      90.20%
      91.90%

      94.20%
      Kuo et al (2019)P4,596 CTsICH + subtypesConsensus by two radiologistsCNN

      (Ensemble)
      4,396NA200100.00%90.00%98.20%
      Chang et al (2018)R

      P
      11,021 CTsICH + subtypesReports confirmed by one radiologistMask R-CNN + Hybrid 3D/2D CNN10,159NANA

      682
      97.10%

      95.10%
      97.50%

      97.30%
      98.40%

      97.20%
      Ker et al (2019)R399 CTsICHClinical reports3D CNN399NANANANANA
      Ye et al (2019)R2,836 CTsICH + subtypesConsensus by three radiologistsCNN

      (Sub-lab)
      2,25528229998.00%99.00%99.00%
      Grewal et al (2018)R329 CTsICHClinical radiology reports with consultation from two radiologistsCNN (RADnet)185677788.60%72.70%81.80%
      AUC: Area under the curve.
      NA: Not applicable (Anything that is not reported in the study).
      CNN: Convolutional Neural Network.
      ICH: Intracranial haemorrhage.
      R: Retrospective.
      P: Prospective.
      Table 2QUADAS-2 evaluation of risk of bias and applicability.
      StudyRisk of biasApplicability concerns
      Patient selectionIndex testReference standardFlow and timingPatient selectionIndex testReference standard
      Arbabshirani et al (2018)UnclearLowUnclearLowUnclearUnclearLow
      Lee et al (2019)LowLowLowLowLowLowLow
      Chilamkurthy et al (2018)LowUnclearUnclearLowLowUnclearUnclear
      Kuo et al (2019)LowLowLowUnclearLowLowLow
      Chang et al (2018)LowLowHighLowLowLowHigh
      Ye et al (2019)UnclearLowLowHighUnclearLowLow
      Grewal et al (2018)UnclearUnclearLowLowUnclearUnclearLow
      Low, high and unclear risk.
      Arbabshirani et al. [
      • Arbabshirani M.R.
      • Fornwalt B.K.
      • Mongelluzzo G.J.
      • Suever J.D.
      • Geise B.D.
      • Patel A.A.
      • Moore G.J.
      Advanced machine learning in action: identification of intracranial hemorrhage on computed tomography scans of the head with clinical workflow integration.
      ] included 46,583 inpatients, outpatients and emergency patients NCTCs for ICH detection with a five-layer and two fully connected-layer CNN architecture. Reference standards were extracted from official clinical radiology reports. A trained research assistant under the supervision of an experienced neuroradiologist classified (negative or positive ICH) for 25% of the data, and natural language processing was used for the remaining.
      A two-fold study was conducted by Chang et al. [
      • Chang P.D.
      • Kuoy E.
      • Grinband J.
      • Weinberg B.D.
      • Thompson M.
      • Homo R.
      • Chen J.
      • Abcede H.
      • Shafie M.
      • Sugrue L.
      • Filippi C.G.
      • Su M.-Y.
      • Yu W.
      • Hess C.
      • Chow D.
      Hybrid 3D/2D convolutional neural network for hemorrhage evaluation on head CT.
      ] with an examination training cohort and test cohort of 10,159 NCTCs and 682 NCTCs, respectively. The training cohort were used to develop a custom hybrid 3D/2D mask-based CNN architecture and the trained network was applied for the external validation data. Clinical reports were used to identify ICH positive cases (IPH, EDH/SDH, and SAH) in both cohorts based on the assessment done by one board-certified radiologist.
      Chilamkurthy et al. [
      • Chilamkurthy S.
      • Ghosh R.
      • Tanamala S.
      • Biviji M.
      • Campeau N.G.
      • Venugopal V.K.
      • Mahajan V.
      • Rao P.
      • Warier P.
      Deep learning algorithms for detection of critical findings in head CT scans: a retrospective study.
      ] retrospectively included patients with ICH (IPH, EDH/SDH and SAH) in the Qure25k data and CQ500 dataset with 21,095 and 419 scans, respectively. ResNet18 CNN network with slight modifications was used for detection of ICH subtypes based on the electronic clinical reports for the Qure25k dataset and consensus by three radiologists for the CQ500 dataset. Patients younger than 7 years were removed from the Qure25k dataset and the rest was used for training.
      Two hospitals were used to obtain a dataset of 329 NCTCs for training, validation and testing by Grewal et al. [
      • Grewal M.
      • Srivastava M.M.
      • Kumar P.
      • Varadarajan S.
      RADnet: Radiologist level accuracy using deep learning for hemorrhage detection in CT scans. I: Proceedings - International Symposium on Biomedical.
      ]. Their final Recurrent Attention DenseNet (RADnet) architecture was used to binary classify NCTC with or without ICH. A web-based annotation tool was applied for reference standard labelling of the training and validation dataset at slice-level. This was done with the consensus of two senior radiologists, in correlation to patients medical history.
      During a 7-years period, Kuo et al. [
      • Kuo W.
      • Hӓne C.
      • Mukherjee P.
      • Malik J.
      • Yuh E.L.
      Expert-level detection of acute intracranial hemorrhage on head computed tomography using deep learning.
      ], constructed a dataset consisting of 4,396 CT scans with pixelwise labels for ICH confirmed by two American Board of Radiology-certified radiologists. Data preprocessing consisted of removing skull and face bones, while retaining the intracranial structures for network training. 200 CT scans where external validation and performed over a two-months period to test the fully-trained CNN.
      Lee et al. [
      • Lee H.
      • Yune S.
      • Mansouri M.
      • Kim M.
      • Tajmir S.H.
      • Guerrier C.E.
      • Ebert S.A.
      • Pomerantz S.R.
      • Romero J.M.
      • Kamalian S.
      • Gonzalez R.G.
      • Lev M.H.
      • Do S.
      An explainable deep-learning algorithm for the detection of acute intracranial haemorrhage from small datasets.
      ] included patients with and without acute ICH (IPH, EDH/SDH, SAH and IVH) resulting in 704 cases for algorithm training and 200 cases for validation. An additional dataset was retrospectively collected non-consecutively from the same database resulting in 100 ICH-positive and 100 ICH-negative cases for testing. Five subspecialty board-certified neuroradiologists annotated each slice during the developing phase. Over a 4 months period another dataset consisting of 195 cases with and without ICH were annotated independently by two neuroradiologists and with the consensus of a third neuroradiologist.
      Three hospitals contributed with 2,102, 511 and 516 CT scans for the retrospective study of Ye et al. [
      • Ye H.
      • Gao F.
      • Yin Y.
      • Guo D.
      • Zhao P.
      • Lu Y.i.
      • Wang X.
      • Bai J.
      • Cao K.
      • Song Q.i.
      • Zhang H.
      • Chen W.
      • Guo X.
      • Xia J.
      Precise diagnosis of intracranial hemorrhage and subtypes using a three-dimensional joint convolutional and recurrent neural network.
      ]. Three independent radiologists annotated all images with ICH (IPH, EDH/SDH, SAH and IVH), with majority voting as reference standard on a per-patient level and per-slice level. The algorithm consisted of two components, a CNN component for extraction of image slice features followed by a RNN component using the slice features for generating ICH probability.
      Grewal et al. [
      • Grewal M.
      • Srivastava M.M.
      • Kumar P.
      • Varadarajan S.
      RADnet: Radiologist level accuracy using deep learning for hemorrhage detection in CT scans. I: Proceedings - International Symposium on Biomedical.
      ] and Ker et al. [
      • Ker Justin
      • Singh Satya P.
      • Bai Yeqi
      • Rao Jai
      • Lim Tchoyoson
      • Wang Lipo
      Image thresholding improves 3-dimensional convolutional neural network diagnosis of different acute brain hemorrhages on computed tomography scans.
      ] were excluded for meta-analysis, because of insufficient diagnostic measures and at high risk of bias, respectively. Fig. 3 shows the sensitivity and specificity for the retrospective data, Fig. 4 shows the SROC curve for that same data. Fig. 5, Fig. 6 present pooled sensitivity, specificity and SROC-curve for combined retrospective studies and studies with external data set validation, the nature of data represented in Fig. 5 with either external validation test set or retrospective, respectively. Funnel plots for sensitivities and specificities are found in Suppl 5–8 which may suggest publication bias. Comparing the CNN performance to reference standard in the retrospective studies found a pooled sensitivity of 96.00% (95% CI: 93.00% to 97.00%), pooled specificity of 97.00% (95% CI: 90.00% to 99.00%) and SROC of 98.00% (95% CI: 97.00% to 99.00%), and combining retrospective and external validation set studies found a pooled sensitivity of 95.00% (95% CI: 91.00% to 97.00%), pooled specificity of 96.00% (95% CI: 91.00% to 98.00%) and a pooled SROC of 98.00% (95% CI: 97.00% to 99.00%). It should be noted that the I-squared analysis showed considerable heterogeneity between studies, with an I-square of 99.07 (95% CI: 98.78 to 99.35) for retrospective studies and an I-square of 99.96 (95% CI: 99.95 to 99.96) for pooling of retrospective and external validation test sets.
      Figure thumbnail gr3
      Fig. 3Pooled sensitivities and specificities of studies with retrospective data.
      Figure thumbnail gr4
      Fig. 4Summary receiver operating characteristics of studies with retrospective data.
      Figure thumbnail gr5
      Fig. 5Pooled sensitivities and specificities of studies with retrospective and prospective data. P. Prospective and R. Retrospective.
      Figure thumbnail gr6
      Fig. 6Summary receiver operating characteristics of studies with retrospective and prospective data.

      4. Discussion

      Our meta-analysis suggests that CNN-algorithms accurately detect ICHs based on retrospective data alone [
      • Chang P.D.
      • Kuoy E.
      • Grinband J.
      • Weinberg B.D.
      • Thompson M.
      • Homo R.
      • Chen J.
      • Abcede H.
      • Shafie M.
      • Sugrue L.
      • Filippi C.G.
      • Su M.-Y.
      • Yu W.
      • Hess C.
      • Chow D.
      Hybrid 3D/2D convolutional neural network for hemorrhage evaluation on head CT.
      ,
      • Chilamkurthy S.
      • Ghosh R.
      • Tanamala S.
      • Biviji M.
      • Campeau N.G.
      • Venugopal V.K.
      • Mahajan V.
      • Rao P.
      • Warier P.
      Deep learning algorithms for detection of critical findings in head CT scans: a retrospective study.
      ,
      • Lee H.
      • Yune S.
      • Mansouri M.
      • Kim M.
      • Tajmir S.H.
      • Guerrier C.E.
      • Ebert S.A.
      • Pomerantz S.R.
      • Romero J.M.
      • Kamalian S.
      • Gonzalez R.G.
      • Lev M.H.
      • Do S.
      An explainable deep-learning algorithm for the detection of acute intracranial haemorrhage from small datasets.
      ,
      • Ye H.
      • Gao F.
      • Yin Y.
      • Guo D.
      • Zhao P.
      • Lu Y.i.
      • Wang X.
      • Bai J.
      • Cao K.
      • Song Q.i.
      • Zhang H.
      • Chen W.
      • Guo X.
      • Xia J.
      Precise diagnosis of intracranial hemorrhage and subtypes using a three-dimensional joint convolutional and recurrent neural network.
      ] as well as in our combined exploratory analysis of retrospective and studies with external validation data sets [
      • Arbabshirani M.R.
      • Fornwalt B.K.
      • Mongelluzzo G.J.
      • Suever J.D.
      • Geise B.D.
      • Patel A.A.
      • Moore G.J.
      Advanced machine learning in action: identification of intracranial hemorrhage on computed tomography scans of the head with clinical workflow integration.
      ,
      • Chang P.D.
      • Kuoy E.
      • Grinband J.
      • Weinberg B.D.
      • Thompson M.
      • Homo R.
      • Chen J.
      • Abcede H.
      • Shafie M.
      • Sugrue L.
      • Filippi C.G.
      • Su M.-Y.
      • Yu W.
      • Hess C.
      • Chow D.
      Hybrid 3D/2D convolutional neural network for hemorrhage evaluation on head CT.
      ,
      • Kuo W.
      • Hӓne C.
      • Mukherjee P.
      • Malik J.
      • Yuh E.L.
      Expert-level detection of acute intracranial hemorrhage on head computed tomography using deep learning.
      ,
      • Lee H.
      • Yune S.
      • Mansouri M.
      • Kim M.
      • Tajmir S.H.
      • Guerrier C.E.
      • Ebert S.A.
      • Pomerantz S.R.
      • Romero J.M.
      • Kamalian S.
      • Gonzalez R.G.
      • Lev M.H.
      • Do S.
      An explainable deep-learning algorithm for the detection of acute intracranial haemorrhage from small datasets.
      ]. However, our results are still too limited to make firm conclusions and several limitations need to be addressed.
      The total number of included examinations reached 380,382 NCTCs, with the majority (313,318 examinations) of NCTCs collected from one study [
      • Chilamkurthy S.
      • Ghosh R.
      • Tanamala S.
      • Biviji M.
      • Campeau N.G.
      • Venugopal V.K.
      • Mahajan V.
      • Rao P.
      • Warier P.
      Deep learning algorithms for detection of critical findings in head CT scans: a retrospective study.
      ]. Four key conclusions were established in this review. Firstly, all included studies had non-randomised retrospective designs, where only four studies tested CNN performance on external data sets. Algorithm-performance compared to clinicians or comparison between clinicians with and without algorithms is crucial but difficult to evaluate in the artificial in silico context of these type of studies [
      • Nagendran M.
      • Chen Y.
      • Lovejoy C.A.
      • Gordon A.C.
      • Komorowski M.
      • Harvey H.
      • Topol E.J.
      • Ioannidis J.P.A.
      • Collins G.S.
      • Maruthappu M.
      Artificial intelligence versus clinicians: Systematic review of design, reporting standards, and claims of deep learning studies in medical imaging.
      ]. The high area under the curve (AUC) of the included studies might not reflect the real clinical benefit of the algorithm performance which is also described as surrogate end points in clinical trials [
      • Fleming T.R.
      Surrogate End Points in Clinical Trials: Are We Being Misled?.
      ]. Liu et al. [
      • Liu X.
      • Faes L.
      • Kale A.U.
      • Wagner S.K.
      • Fu D.J.
      • Bruynseels A.
      • Mahendiran T.
      • Moraes G.
      • Shamdas M.
      • Kern C.
      • Ledsam J.R.
      • Schmid M.K.
      • Balaskas K.
      • Topol E.J.
      • Bachmann L.M.
      • Keane P.A.
      • Denniston A.K.
      A comparison of deep learning performance against health-care professionals in detecting diseases from medical imaging: a systematic review and meta-analysis.
      ] pooled 14 out-of-sample external validation studies, when restricting the analysis to the highest accuracy reported and found a pooled sensitivity of 87.00% (95% CI: 83.00% to 90.20%) and specificity of 92.50% (95% CI: 85.10% to 96.40%) for deep learning models compared to the reference standard. However, Liu et al. [
      • Liu X.
      • Faes L.
      • Kale A.U.
      • Wagner S.K.
      • Fu D.J.
      • Bruynseels A.
      • Mahendiran T.
      • Moraes G.
      • Shamdas M.
      • Kern C.
      • Ledsam J.R.
      • Schmid M.K.
      • Balaskas K.
      • Topol E.J.
      • Bachmann L.M.
      • Keane P.A.
      • Denniston A.K.
      A comparison of deep learning performance against health-care professionals in detecting diseases from medical imaging: a systematic review and meta-analysis.
      ] pooled studies with deep learning algorithms to their respective reference standards over a wide range of subspecialities, i.e. retinopathy, mammography and dermoscopy. The pooling of studies with external validation data sets and retrospective studies versus retrospective studies only found a pooled sensitivity of 95.00% (95% CI: 91.00% to 97.00%) and pooled specificity of 96.00% (95% CI: 91.00% to 98.00%), respectively. Retrospective studies with internal validation data used to develop the model may therefore substantially overestimate algorithm performance. However, without external data evaluation the algorithm might have unintended adverse events of misdiagnosis in clinical environments. Researchers should use external validation data sets and strive for as low as reasonable achievable false positive rate and false negative rate [
      • Nagendran M.
      • Chen Y.
      • Lovejoy C.A.
      • Gordon A.C.
      • Komorowski M.
      • Harvey H.
      • Topol E.J.
      • Ioannidis J.P.A.
      • Collins G.S.
      • Maruthappu M.
      Artificial intelligence versus clinicians: Systematic review of design, reporting standards, and claims of deep learning studies in medical imaging.
      ]. Lee et al. [
      • Lee H.
      • Yune S.
      • Mansouri M.
      • Kim M.
      • Tajmir S.H.
      • Guerrier C.E.
      • Ebert S.A.
      • Pomerantz S.R.
      • Romero J.M.
      • Kamalian S.
      • Gonzalez R.G.
      • Lev M.H.
      • Do S.
      An explainable deep-learning algorithm for the detection of acute intracranial haemorrhage from small datasets.
      ] and Chang et al. [
      • Chang P.D.
      • Kuoy E.
      • Grinband J.
      • Weinberg B.D.
      • Thompson M.
      • Homo R.
      • Chen J.
      • Abcede H.
      • Shafie M.
      • Sugrue L.
      • Filippi C.G.
      • Su M.-Y.
      • Yu W.
      • Hess C.
      • Chow D.
      Hybrid 3D/2D convolutional neural network for hemorrhage evaluation on head CT.
      ] conducted studies with retrospective and external validation data and both studies found only small decreases in AUC from retrospective to external validation evaluation. External validation testing of algorithms is not necessary in algorithm development, but is crucial in ensuring functional robustness and implementation for final evaluations [
      • Fleming T.R.
      Surrogate End Points in Clinical Trials: Are We Being Misled?.
      ].
      Secondly, huge datasets are needed to develop CNN algorithms for medical diagnosis or prediction; subjects should be recruited consecutively to cover the entire spectrum of ICH and its subtypes according to explicitly defined eligibility criteria. All studies included ICH subtypes except Arbabshirani et al. [
      • Arbabshirani M.R.
      • Fornwalt B.K.
      • Mongelluzzo G.J.
      • Suever J.D.
      • Geise B.D.
      • Patel A.A.
      • Moore G.J.
      Advanced machine learning in action: identification of intracranial hemorrhage on computed tomography scans of the head with clinical workflow integration.
      ] who only confirmed ICH detection without defining a subtype, which may be problematic as a final diagnosis of emergency department patients with suspected ICH [
      • Shetty V.S.
      • Reis M.N.
      • Aulino J.M.
      • Berger K.L.
      • Broder J.
      • Choudhri A.F.
      • Kendi A.T.
      • Kessler M.M.
      • Kirsch C.F.
      • Luttrull M.D.
      • Mechtler L.L.
      • Prall J.A.
      • Raksin P.B.
      • Roth C.J.
      • Sharma A.
      • West O.C.
      • Wintermark M.
      • Cornelius R.S.
      • Bykowski J.
      ACR Appropriateness Criteria Head Trauma.
      ]. Huge datasets are needed to ensure generalizability for populations with substantial heterogeneity or subtle differences between imaging phenotypes; algorithm performance increases logarithmically with increased training data volume and a saturated sample size is needed for each variation and type of ICH [

      C. Sun, A. Shrivastava, S. Singh, A. Gupta, Revisiting Unreasonable Effectiveness of Data in Deep Learning Era. RXiv:1707.02968 (2017).

      ]. Another clinically important and relevant issue is the misdetection of ICHs which are hard to differentiate from the bone or undetected microbleedings in trauma imaging. A possible solution would be the introduction of heat or saliency maps [

      W. Samek, A. Binder, G. Montavon, S. Lapuschkin, K.-R. Müller, Evaluating the Visualization of What a Deep Neural Network Has Learned. RXiv:1509.06321 (2017).

      ]. Kuo et al. [
      • Kuo W.
      • Hӓne C.
      • Mukherjee P.
      • Malik J.
      • Yuh E.L.
      Expert-level detection of acute intracranial hemorrhage on head computed tomography using deep learning.
      ] removed the skull and face from the NCTCs using a series of image processing techniques. In a external test set of 200 NCTCs they received a sensitivity of 100% which could be due to easier bleeding detection if only intracranial structures are included. Exclusion or removal of any patients due to image artifacts can increase the algorithm performance. Patient-related image artifacts in CT is well known for NCTCs, including metallic materials, patient motion and incomplete projections [
      • Barrett J.F.
      • Keat N.
      Artifacts in CT: Recognition and avoidance.
      ]. Additionally, the heterogeneity of used CT scanners as well as of image reconstruction techniques makes direct comparison between studies difficult.
      Thirdly, all studies chose reference standard classification based on supervised learning, except Arbabshirani et al. [
      • Arbabshirani M.R.
      • Fornwalt B.K.
      • Mongelluzzo G.J.
      • Suever J.D.
      • Geise B.D.
      • Patel A.A.
      • Moore G.J.
      Advanced machine learning in action: identification of intracranial hemorrhage on computed tomography scans of the head with clinical workflow integration.
      ] who used semi-supervised learning for labeling derived from NCTCs radiological reports. Medical imaging based on consensus of expert interpretation is considered the reference standard for ICH detection [
      • Mongan J.
      • Moy L.
      • Kahn C.E.
      Checklist for Artificial Intelligence in Medical Imaging (CLAIM): A Guide for Authors and Reviewers.
      ]. Radiology reports often contain unstructured text which is not per se ideal for AI as an algorithm substract, and therefore are rather low quality labels [

      B. Jing, P. Xie, E. Xing, On the Automatic Generation of Medical Imaging Reports. RXiv:1711.08195 (2017).

      ]. This could explain the relatively low sensitivity, specificity and AUC obtained by Arbabshirani et al. [
      • Arbabshirani M.R.
      • Fornwalt B.K.
      • Mongelluzzo G.J.
      • Suever J.D.
      • Geise B.D.
      • Patel A.A.
      • Moore G.J.
      Advanced machine learning in action: identification of intracranial hemorrhage on computed tomography scans of the head with clinical workflow integration.
      ]. Chang et al. [
      • Chang P.D.
      • Kuoy E.
      • Grinband J.
      • Weinberg B.D.
      • Thompson M.
      • Homo R.
      • Chen J.
      • Abcede H.
      • Shafie M.
      • Sugrue L.
      • Filippi C.G.
      • Su M.-Y.
      • Yu W.
      • Hess C.
      • Chow D.
      Hybrid 3D/2D convolutional neural network for hemorrhage evaluation on head CT.
      ] identified cases positive for ICH and subtypes (IPH, EDH/SDH, and SAH) from clinical reports, which was additionally confirmed by one board-certified radiologist who assessed the corresponding scans. The rest of the studies used consensus between two to five radiologists for reference standard assignment. Furthermore it is important to take in consideration a large discrepancy between what radiologists visually perceive and what they clinically report, as shown by Oltunji et al. [

      T. Olatunji, L. Yao, B. Covington, A. Rhodes, A. Upton, Caveats in Generating Medical Imaging Labels from Radiology Reportrs. RXiv:1905.02283 (2019).

      ] in a dataset of 1,000 chest x-rays.
      Fourthly, the use of CNN architectures varied greatly between studies from ensemble learning and transfer learning to custom neural networks. Chilamkurthy et al. [
      • Chilamkurthy S.
      • Ghosh R.
      • Tanamala S.
      • Biviji M.
      • Campeau N.G.
      • Venugopal V.K.
      • Mahajan V.
      • Rao P.
      • Warier P.
      Deep learning algorithms for detection of critical findings in head CT scans: a retrospective study.
      ] used ResNet18 with slight modifications regarding an increase in fully connected layers from one to five and Lee et al. [
      • Lee H.
      • Yune S.
      • Mansouri M.
      • Kim M.
      • Tajmir S.H.
      • Guerrier C.E.
      • Ebert S.A.
      • Pomerantz S.R.
      • Romero J.M.
      • Kamalian S.
      • Gonzalez R.G.
      • Lev M.H.
      • Do S.
      An explainable deep-learning algorithm for the detection of acute intracranial haemorrhage from small datasets.
      ] ensembled a final CNN consisting of VGG16, ResNet-50, Inception-v3 and Inception-ResNet-v2. Standard ImageNet architectures like the ResNet18 have shown to not significantly improve performance compared to smaller and simpler CNNs [

      C.M. Zhang, G. Brain, J. Kleinberg, S. Bengio, Transfusion: Understanding Transfer Learning for Medical Imaging. RXiv:1902.07208v3 (2019).

      ]. However, by averaging several models in an ensemble, a better performance can be achieved than when using single transfer models. Chang et al. [
      • Chang P.D.
      • Kuoy E.
      • Grinband J.
      • Weinberg B.D.
      • Thompson M.
      • Homo R.
      • Chen J.
      • Abcede H.
      • Shafie M.
      • Sugrue L.
      • Filippi C.G.
      • Su M.-Y.
      • Yu W.
      • Hess C.
      • Chow D.
      Hybrid 3D/2D convolutional neural network for hemorrhage evaluation on head CT.
      ] used a custom mask R-CNN architecture for detection and segmentation of ICHs and a pyramid featured hybrid 3D/2D CNN as the backbone network. Fine-tuned 3D networks can be applied to medical images and have shown excellent performance in different domains. However, 3D networks require a large number of training parameters, a large dataset, where the images depth volume can vary from 20 to 400 slices per scan, which is also more demanding regarding computation efficiency [
      • Singh Satya P.
      • Wang Lipo
      • Gupta Sukrit
      • Goli Haveesh
      • Padmanabhan Parasuraman
      • Gulyás Balázs
      3D Deep Learning on Medical Images: A Review.
      ].
      Limitations regarding CNNs should be accounted for, but ICH detection by CNNs in routine studies may reduce time to diagnose, where early diagnosis is of critical clinical importance since nearly half of ICH-related mortality occurs in the first 24 h [
      • O’Neill T.J.
      • Xi Y.
      • Stehel E.
      • Browning T.
      • Ng Y.S.
      • Baker C.
      • Peshock R.M.
      Active Reprioritization of the Reading Worklist Using Artificial Intelligence Has a Beneficial Effect on the Turnaround Time for Interpretation of Head CTs with Intracranial Hemorrhage.
      ,
      • Elliott J.
      • Smith M.
      The acute management of intracerebral hemorrhage: A clinical review.
      ]. Additionally, building up a clinical setting where the radiologist is assisted by a CNN, could increase the diagnostic efficiency and should also be evaluated from a socioeconomic and patient perspective [
      • Nagendran M.
      • Chen Y.
      • Lovejoy C.A.
      • Gordon A.C.
      • Komorowski M.
      • Harvey H.
      • Topol E.J.
      • Ioannidis J.P.A.
      • Collins G.S.
      • Maruthappu M.
      Artificial intelligence versus clinicians: Systematic review of design, reporting standards, and claims of deep learning studies in medical imaging.
      ]. Authors involved in CNN studies should appropriately report the development, evaluation and be transparent in machine learning prediction studies to ensure correct and reproducible research [
      • Moons Karel G.M.
      • Altman Douglas G.
      • Reitsma Johannes B.
      • Ioannidis John P.A.
      • Macaskill Petra
      • Steyerberg Ewout W.
      • Vickers Andrew J.
      • Ransohoff David F.
      • Collins Gary S.
      Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD): Explanation and Elaboration.
      ]. Another limitation is the application of CNNs in clinical workflows which requires an advanced setup integrated into the medical imaging systems.
      As any other comprehensive search we might have missed some relevant studies that could have been included and were limited to only published data. The new guidelines we used to assess non-randomised studies (CLAIM) were designed specifically for AI studies, but due to their novelty are not yet well established. Our focus was specifically on CNNs in detecting overall ICHs and it might not be appropriate to generalize our findings to other types of AI studies. Furthermore, looking at I-squared analysis results, it is evident that pooling data from these studies might be inappropriate, which emphasizes the small number of external validation studies. Comparison of these types of studies is difficult as previously mentioned because of factors such as scanning techniques, scanner types, CNN architecture and reference standards, which reduces the validity and generalizability of the conclusion. With different experience levels from the authors some subjective bias may have been introduced into the assement of articles.

      5. Conclusion

      Evidence for using CNNs in ICH detection remains promising regarding retrospective studies and retrospective and prospective external test set studies combined, but is limited by the small number of available studies. There is a critical need for more studies using external test sets with uniformity in their methods, which should include a robust reference standard made by independent experts in unselected patient cohorts.

      CRediT authorship contribution statement

      Mia Daugaard Jørgensen: Conceptualization, Investigation, Resources, Writing – original draft, Writing – review & editing. Ronald Antulov: Methodology, Writing – original draft, Writing – review & editing, Supervision. Søren Hess: Methodology, Writing – original draft, Writing – review & editing, Supervision. Simon Lysdahlgaard: Methodology, Resources, Writing – original draft, Writing – review & editing, Supervision, Formal analysis, Visualization, Project administration.

      Declaration of Competing Interest

      The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

      Appendix A. Supplementary material

      Figure thumbnail fx1
      Figure thumbnail fx2
      Figure thumbnail fx3
      Figure thumbnail fx4

      References

        • Qureshi A.I.
        • Mendelow A.D.
        • Hanley D.F.
        Intracerebral haemorrhage.
        Lancet. 2009; 373: 1632-1644https://doi.org/10.1016/S0140-6736(09)60371-8
        • van Asch C.JJ.
        • Luitse M.JA.
        • Rinkel G.JE.
        • van der Tweel I.
        • Algra A.
        • Klijn C.JM.
        Incidence, case fatality, and functional outcome of intracerebral haemorrhage over time, according to age, sex, and ethnic origin: a systematic review and meta-analysis.
        Lancet Neurol. 2010; 9: 167-176https://doi.org/10.1016/S1474-4422(09)70340-0
        • Elliott J.
        • Smith M.
        The Acute Management of Intracerebral Hemorrhage.
        Anesthesia Analgesia. 2010; https://doi.org/10.1213/ANE.0b013e3181d568c8
        • Qureshi A.I.
        • Suri M.F.K.
        • Nasar A.
        • Kirmani J.F.
        • Ezzeddine M.A.
        • Divani A.A.
        • Giles W.H.
        Changes in cost and outcome among US patients with stroke hospitalized in 1990 to 1991 and those hospitalized in 2000 to 2001.
        Stroke. 2007; 38: 2180-2184https://doi.org/10.1161/STROKEAHA.106.467506
        • Anderson C.S.
        • Huang Y.
        • Wang J.G.
        • Arima H.
        • Neal B.
        • Peng B.
        • Heeley E.
        • Skulina C.
        • Parsons M.W.
        • Kim J.S.
        • Tao Q.L.
        • Li Y.C.
        • Jiang J.D.
        • Tai L.W.
        • Zhang J.L.
        • Xu E.n.
        • Cheng Y.
        • Heritier S.
        • Morgenstern L.B.
        • Chalmers J.
        Intensive blood pressure reduction in acute cerebral haemorrhage trial (INTERACT): a randomised pilot trial.
        Lancet Neurol. 2008; 7: 391-399https://doi.org/10.1016/S1474-4422(08)70069-3
        • Panagos P.D.
        • Jauch E.C.
        • Broderick J.P.
        Intracerebral hemorrhage.
        Emerg. Med. Clin. North Am. 2002; 20: 631-655https://doi.org/10.1016/S0733-8627(02)00015-9
        • Glover McKinley
        • Almeida R.R.
        • Schaefer P.W.
        • Lev M.H.
        • Mehan W.A.
        Quantifying the Impact of Noninterpretive Tasks on Radiology Report Turn Around Times.
        J Am Coll Radiol. 2017; 14: 1498-1503https://doi.org/10.1016/j.jacr.2017.07.023
        • Chetlen A.L.
        • Chan T.L.
        • Ballard D.H.
        • Frigini L.A.
        • Hildebrand A.
        • Kim S.
        • Brian J.M.
        • Krupinski E.A.
        • Ganeshan D.
        Addressing Burnout in Radiologists.
        Acad. Radiol. 2019; 26: 526-533https://doi.org/10.1016/j.acra.2018.07.001
        • Codari M.
        • Melazzini L.
        • Morozov S.P.
        • van Kuijk C.C.
        • Sconfienza L.M.
        • Sardanelli F.
        Impact of artificial intelligence on radiology: a EuroAIM survey among members of the European Society of Radiology.
        Insights Imaging. 2019; https://doi.org/10.1186/s13244-019-0798-3
        • Krizhevsky A.
        • Sutskever I.
        • Hinton G.E.
        ImageNet classification with deep convolutional neural networks.
        in: I: Advances in Neural Information Processing Systems. 2012https://doi.org/10.1145/3065386
        • Gulshan V.
        • Peng L.
        • Coram M.
        • Stumpe M.C.
        • Wu D.
        • Narayanaswamy A.
        • Venugopalan S.
        • Widner K.
        • Madams T.
        • Cuadros J.
        • Kim R.
        • Raman R.
        • Nelson P.C.
        • Mega J.L.
        • Webster D.R.
        Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs.
        JAMA J. Am. Med. Assoc. 2016; 316: 2402https://doi.org/10.1001/jama.2016.17216
        • Esteva A.
        • Kuprel B.
        • Novoa R.A.
        • Ko J.
        • Swetter S.M.
        • Blau H.M.
        • Thrun S.
        Dermatologist-level classification of skin cancer with deep neural networks.
        Nature. 2017; 542: 115-118https://doi.org/10.1038/nature21056
        • Domingues I.
        • Pereira G.
        • Martins P.
        • Duarte H.
        • Santos J.
        • Abreu P.H.
        Using deep learning techniques in medical imaging: a systematic review of applications on CT and PET.
        Artif. Intell. Rev. 2020; 53: 4093-4160https://doi.org/10.1007/s10462-019-09788-3
        • Litjens G.
        • Kooi T.
        • Bejnordi B.E.
        • Setio A.A.A.
        • Ciompi F.
        • Ghafoorian M.
        • van der Laak J.A.W.M.
        • van Ginneken B.
        • Sánchez C.I.
        A survey on deep learning in medical image analysis.
        Med. Image Anal. 2017; 42: 60-88https://doi.org/10.1016/j.media.2017.07.005
        • Liu X.
        • Faes L.
        • Kale A.U.
        • Wagner S.K.
        • Fu D.J.
        • Bruynseels A.
        • Mahendiran T.
        • Moraes G.
        • Shamdas M.
        • Kern C.
        • Ledsam J.R.
        • Schmid M.K.
        • Balaskas K.
        • Topol E.J.
        • Bachmann L.M.
        • Keane P.A.
        • Denniston A.K.
        A comparison of deep learning performance against health-care professionals in detecting diseases from medical imaging: a systematic review and meta-analysis.
        Lancet Digit Health. 2019; 1: e271-e297https://doi.org/10.1016/S2589-7500(19)30123-2
        • O’Neill T.J.
        • Xi Y.
        • Stehel E.
        • Browning T.
        • Ng Y.S.
        • Baker C.
        • Peshock R.M.
        Active Reprioritization of the Reading Worklist Using Artificial Intelligence Has a Beneficial Effect on the Turnaround Time for Interpretation of Head CTs with Intracranial Hemorrhage.
        Radiol. Artif. Intell. 2020; https://doi.org/10.1148/ryai.2020200024
        • McInnes M.D.F.
        • Moher D.
        • Thombs B.D.
        • McGrath T.A.
        • Bossuyt P.M.
        • Clifford T.
        • Cohen J.F.
        • Deeks J.J.
        • Gatsonis C.
        • Hooft L.
        • Hunt H.A.
        • Hyde C.J.
        • Korevaar D.A.
        • Leeflang M.M.G.
        • Macaskill P.
        • Reitsma J.B.
        • Rodin R.
        • Rutjes A.W.S.
        • Salameh J.-P.
        • Stevens A.
        • Takwoingi Y.
        • Tonelli M.
        • Weeks L.
        • Whiting P.
        • Willis B.H.
        Preferred Reporting Items for a Systematic Review and Meta-analysis of Diagnostic Test Accuracy Studies The PRISMA-DTA Statement.
        JAMA J.Am. Med. Assoc. 2018; 319: 388https://doi.org/10.1001/jama.2017.19163
        • Ouzzani M.
        • Hammady H.
        • Fedorowicz Z.
        • Elmagarmid A.
        Rayyan-a web and mobile app for systematic reviews.
        Syst. Rev. 2016; 5https://doi.org/10.1186/s13643-016-0384-4
        • Willemink M.J.
        • Koszek W.A.
        • Hardell C.
        • Wu J.
        • Fleischmann D.
        • Harvey H.
        • Folio L.R.
        • Summers R.M.
        • Rubin D.L.
        • Lungren M.P.
        Preparing medical imaging data for machine learning.
        Radiology. 2020; 295: 4-15https://doi.org/10.1148/radiol.2020192224
        • Mongan J.
        • Moy L.
        • Kahn C.E.
        Checklist for Artificial Intelligence in Medical Imaging (CLAIM): A Guide for Authors and Reviewers.
        Radiol. Artif. Intell. 2020; 2: e200029https://doi.org/10.1148/ryai.2020200029
        • Whiting P.F.
        • Rutjes A.W.S.S.
        • Westwood M.E.
        • Mallett S.
        • Deeks J.J.
        • Reitsma J.B.
        • Leeflang M.M.G.
        • Sterne J.A.C.
        • Bossuyt P.M.M.
        Quadas-2: A revised tool for the quality assessment of diagnostic accuracy studies.
        Ann. Intern. Med. 2011; https://doi.org/10.7326/0003-4819-155-8-201110180-00009
        • Harbord R.M.
        • Whiting P.
        • Sterne J.A.C.
        • Egger M.
        • Deeks J.J.
        • Shang A.
        • Bachmann L.M.
        An empirical comparison of methods for meta-analysis of diagnostic accuracy showed hierarchical models are necessary.
        J. Clin. Epidemiol. 2008; 61: 1095-1103https://doi.org/10.1016/j.jclinepi.2007.09.013
        • Arbabshirani M.R.
        • Fornwalt B.K.
        • Mongelluzzo G.J.
        • Suever J.D.
        • Geise B.D.
        • Patel A.A.
        • Moore G.J.
        Advanced machine learning in action: identification of intracranial hemorrhage on computed tomography scans of the head with clinical workflow integration.
        Npj Digit Med. 2018; 1https://doi.org/10.1038/s41746-017-0015-z
        • Chang P.D.
        • Kuoy E.
        • Grinband J.
        • Weinberg B.D.
        • Thompson M.
        • Homo R.
        • Chen J.
        • Abcede H.
        • Shafie M.
        • Sugrue L.
        • Filippi C.G.
        • Su M.-Y.
        • Yu W.
        • Hess C.
        • Chow D.
        Hybrid 3D/2D convolutional neural network for hemorrhage evaluation on head CT.
        Am. J. Neuroradiol. 2018; 39: 1609-1616https://doi.org/10.3174/ajnr.A5742
        • Chilamkurthy S.
        • Ghosh R.
        • Tanamala S.
        • Biviji M.
        • Campeau N.G.
        • Venugopal V.K.
        • Mahajan V.
        • Rao P.
        • Warier P.
        Deep learning algorithms for detection of critical findings in head CT scans: a retrospective study.
        Lancet. 2018; 392: 2388-2396https://doi.org/10.1016/S0140-6736(18)31645-3
        • Grewal M.
        • Srivastava M.M.
        • Kumar P.
        • Varadarajan S.
        RADnet: Radiologist level accuracy using deep learning for hemorrhage detection in CT scans. I: Proceedings - International Symposium on Biomedical.
        Imaging. 2018; https://doi.org/10.1109/ISBI.2018.8363574
        • Kuo W.
        • Hӓne C.
        • Mukherjee P.
        • Malik J.
        • Yuh E.L.
        Expert-level detection of acute intracranial hemorrhage on head computed tomography using deep learning.
        Proc. Natl. Acad. Sci. USA. 2019; 116: 22737-22745https://doi.org/10.1073/pnas.1908021116
        • Lee H.
        • Yune S.
        • Mansouri M.
        • Kim M.
        • Tajmir S.H.
        • Guerrier C.E.
        • Ebert S.A.
        • Pomerantz S.R.
        • Romero J.M.
        • Kamalian S.
        • Gonzalez R.G.
        • Lev M.H.
        • Do S.
        An explainable deep-learning algorithm for the detection of acute intracranial haemorrhage from small datasets.
        Nat. Biomed. Eng. 2019; 3: 173-182https://doi.org/10.1038/s41551-018-0324-9
        • Ye H.
        • Gao F.
        • Yin Y.
        • Guo D.
        • Zhao P.
        • Lu Y.i.
        • Wang X.
        • Bai J.
        • Cao K.
        • Song Q.i.
        • Zhang H.
        • Chen W.
        • Guo X.
        • Xia J.
        Precise diagnosis of intracranial hemorrhage and subtypes using a three-dimensional joint convolutional and recurrent neural network.
        Eur. Radiol. 2019; 29: 6191-6201https://doi.org/10.1007/s00330-019-06163-2
        • Ker Justin
        • Singh Satya P.
        • Bai Yeqi
        • Rao Jai
        • Lim Tchoyoson
        • Wang Lipo
        Image thresholding improves 3-dimensional convolutional neural network diagnosis of different acute brain hemorrhages on computed tomography scans.
        Sensors. 2019; 19: 2167https://doi.org/10.3390/s19092167
        • Nagendran M.
        • Chen Y.
        • Lovejoy C.A.
        • Gordon A.C.
        • Komorowski M.
        • Harvey H.
        • Topol E.J.
        • Ioannidis J.P.A.
        • Collins G.S.
        • Maruthappu M.
        Artificial intelligence versus clinicians: Systematic review of design, reporting standards, and claims of deep learning studies in medical imaging.
        BMJ. 2020; https://doi.org/10.1136/bmj.m689
        • Fleming T.R.
        Surrogate End Points in Clinical Trials: Are We Being Misled?.
        Ann. Intern. Med. 1996; 125: 605https://doi.org/10.7326/0003-4819-125-7-199610010-00011
        • Shetty V.S.
        • Reis M.N.
        • Aulino J.M.
        • Berger K.L.
        • Broder J.
        • Choudhri A.F.
        • Kendi A.T.
        • Kessler M.M.
        • Kirsch C.F.
        • Luttrull M.D.
        • Mechtler L.L.
        • Prall J.A.
        • Raksin P.B.
        • Roth C.J.
        • Sharma A.
        • West O.C.
        • Wintermark M.
        • Cornelius R.S.
        • Bykowski J.
        ACR Appropriateness Criteria Head Trauma.
        J. Am. Coll. Radiol. 2016; 13: 668-679https://doi.org/10.1016/j.jacr.2016.02.023
      1. C. Sun, A. Shrivastava, S. Singh, A. Gupta, Revisiting Unreasonable Effectiveness of Data in Deep Learning Era. RXiv:1707.02968 (2017).

      2. W. Samek, A. Binder, G. Montavon, S. Lapuschkin, K.-R. Müller, Evaluating the Visualization of What a Deep Neural Network Has Learned. RXiv:1509.06321 (2017).

        • Barrett J.F.
        • Keat N.
        Artifacts in CT: Recognition and avoidance.
        Radiographics. 2004; 24: 1679-1691https://doi.org/10.1148/rg.246045065
      3. B. Jing, P. Xie, E. Xing, On the Automatic Generation of Medical Imaging Reports. RXiv:1711.08195 (2017).

      4. T. Olatunji, L. Yao, B. Covington, A. Rhodes, A. Upton, Caveats in Generating Medical Imaging Labels from Radiology Reportrs. RXiv:1905.02283 (2019).

      5. C.M. Zhang, G. Brain, J. Kleinberg, S. Bengio, Transfusion: Understanding Transfer Learning for Medical Imaging. RXiv:1902.07208v3 (2019).

        • Singh Satya P.
        • Wang Lipo
        • Gupta Sukrit
        • Goli Haveesh
        • Padmanabhan Parasuraman
        • Gulyás Balázs
        3D Deep Learning on Medical Images: A Review.
        Sensors. 2020; 20: 5097https://doi.org/10.3390/s20185097
        • Elliott J.
        • Smith M.
        The acute management of intracerebral hemorrhage: A clinical review.
        Anesthesia Analgesia. 2010; https://doi.org/10.1213/ANE.0b013e3181d568c8
        • Moons Karel G.M.
        • Altman Douglas G.
        • Reitsma Johannes B.
        • Ioannidis John P.A.
        • Macaskill Petra
        • Steyerberg Ewout W.
        • Vickers Andrew J.
        • Ransohoff David F.
        • Collins Gary S.
        Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD): Explanation and Elaboration.
        Ann. Intern. Med. 2015; 162: W1-W73https://doi.org/10.7326/M14-0698