Deep-Learning Based Semantic Labeling for 2D Mammography & Comparison of Complexity for Machine Learning Tasks Paul H. Yi, MD, Abigail Lin, BSE, Jinchi Wei, BSE, Haris I. Sair, MD, Ferdinand K. Hui, MD, Gregory D Hager, PhD, & Susan C. Harvey, MD Radiology Artificial Intelligence Lab (RAIL), Johns Hopkins University School of Medicine & Malone Center for Engineering in Healthcare, Johns Hopkins University Whiting School of Engineering
Introduction
Introduction
Introduction arxiv 2017. 112,000 frontal CXRs!
Introduction https://lukeoakdenrayner.wordpress.com/2018/04/30/theunreasonable-usefulness-of-deep-learning-in-medical-image-datasets/
Introduction Radiologist PACS workflow depends on accurate semantic labeling. Although DICOM stores metadata, its inclusion is inconsistent, can vary between equipment manufacturers, & can be inaccurate. Automated method for semantic labeling could: 1) Improve radiologist workflow 2) Facilitate curation of medical imaging datasets for machine learning purposes.
How Many Images Do You Need? arxiv 2015. ~5000 images per class J Digit Imaging. 2017. this number may depend on the difficulty of the training set.
Mammography as a Model? Mammography is recommended annually for all women over age 40 by the ACR. Due to strict national regulations for DICOM labeling, mammography serves as a potential model to explore the nuances of developing semantic labeling algorithms. Lessons learned can be applied towards other modalities and more complex, but analogous problems!
Purpose 1. Develop deep convolutional neural networks (DCNNs) for automated classification of 2D-mammography: 1. View 2. Laterality 3. Breast density 4. Normal/Benign vs. Malignant masses 2. Compare the performance of DCNNs on these tasks of varying complexity.
Methods- Dataset Digital Database for Screening Mammography*: 3034 2D mammography images (2620 patients) 4 USA Hospitals (MGH, Wake Forest, Sacred Heart, WUSTL) Normal & Benign or malignant masses (pathology groundtruth) Labels: Mammographic view (craniocaudal [CC] vs. mediolateral oblique [MLO]) Breast laterality (right vs. left) Breast density (4 BI-RADS categories) Normal/benign mass vs. malignant mass. *Updated CBIS-DDSM. Rebecca Sawyer Lee, Francisco Gimenez, Assaf Hoogi, Daniel Rubin (2016). Curated Breast Imaging Subset of DDSM. The Cancer Imaging Archive.
Methods- Dataset Data split into training (70%), validation (10%), & testing (20%) sets. Mammographi c View Total Label #s (3034) CC: 1429 (47%) MLO: 1605 (53%) Laterality Left: 1560 (51%) Right: 1474 (49%) Breast Density (BI-RADS) Benign vs. Malignant A: 416 (14%) B: 1182 (39%) C: 928 (31%) D: 508 (16%) Benign: 1704 (56%) Malignant: 1330 (44%) Training (70%) Validation (10%) Testing (20%) CC: 1000 MLO: 1123 Left: 1092 Right: 1032 A: 291 B: 827 C:649 D: 355 Benign: 1193 Malignant: 931 A = Fatty; B = Scattered fibroglandular; C = Heterogeneously dense; D = Dense CC: 143 MLO: 161 Left: 156 Right: 148 A: 42 B: 119 C: 93 D: 51 Benign: 171 Malignant: 1331 CC: 288 MLO: 323 Left: 314 Right: 296 A: 85 B: 238 C: 188 D: 104 Benign: 342 Malignant: 268
Methods- DCNN Transfer learning to train, validate, and test the ResNet-50 DCNN pretrained on ImageNet on these mammography datasets. Last fully connected layer fine-tuned using these datasets. During each training epoch, images augmented via rotations, cropping, and horizontal flipping.
Methods Receiver operating characteristic (ROC) curves with area under the curve (AUC) were generated. AUCs compared between DCNNs using the DeLong parametric method (significance set at p<0.05).
Methods Heatmaps created through class activation mapping: http://cnnlocalization.csail.mit.edu
Results: Simplest Task The DCNN trained to classify mammographic view achieved AUC of 1.
Results: Slightly More Difficult Task The DCNN trained to classify breast laterality initially mis-classified right and left breasts not infrequently (AUC 0.89, 77% accuracy) However, after discontinuing horizontal flips, AUC significantly improved to 1 (p<0.0001)!
Results: Laterality
Results: Most Difficult Tasks Classification of normal/benign vs. malignant masses proved more difficult: AUC of 0.75 (p<0.0001, compared to both view and laterality DCNNs) Similarly, breast density classification was not as successful with 68% accuracy.
Discussion Semantic labeling DCNNs achieved AUC of 1 for mammographic view and laterality ( obvious differences) using 2427 training/validation images. J Digit Imaging. 2017. Deep convolutional neural networks perform rather well in distinguishing images that have many obvious differences, such as chest vs. abdominal radiographs (AUC = 1.00), and require only a small amount of training data. 45 chest and 45 abdominal XRs were sufficient!
Discussion They were less successful at more complex tasks, likely owing to increased subtleties in these categories. Datasets to train high-performing DCNNs for more complex tasks need to be larger than those used for simple tasks, e.g., semantic labeling.
Discussion More augmentation did not not always improve performance! Laterality DCNNs demonstrated significantly improved performance by omitting horizontal flips. ARRS 2018 Interestingly, the network initially miscategorized large right for large left effusions; however, 100% accuracy was achieved for correct laterality identification after discontinuing horizontal flipping during data augmentation.
Discussion: Is More Better? Canonical ML wisdom is to always perform data augmentation during training to decrease overfitting. However, this should be performed thoughtfully! N.B. We don t know why certain techniques may help or hurt GREY
Discussion: How low can we go? We know that DCNNs for simpler tasks require less data! But how low can we go? Work Imaging Views Total # Training & Validation Images AUC Rajkomar et al. J Digit Imaging. 2017. Yi et al. (Current work) Lakhani et al. J Digit Imaging. 2017. *Augmented Future work? Frontal vs. Lateral Chest X-Ray 150,000* 1 CC vs. MLO Mammography 2427 1 Chest vs. Abdomen X-Ray 90 1
Conclusions DCNN semantic labeling of 2D-mammography is feasible using relatively small image datasets. However, automated classification of more difficult tasks will likely require larger datasets. Risks of image augmentation?
Thank you!