Building an Ensemble System for Diagnosing Masses in Mammograms

Building an Ensemble System for Diagnosing Masses in Mammograms Yu Zhang, Noriko Tomuro, Jacob Furst, Daniela Stan Raicu College of Computing and Digital Media DePaul University, Chicago, IL 60604, USA {jzhang2, tomuro, jfurst, draicu}@cs.depaul.edu ABSTRACT Purpose. Classification of a suspicious mass (region of interest, ROI) in a mammogram as malignant or benign may be achieved using mass shape features. An ensemble system was built for this purpose and tested. Methods. Multiple contours were generated from a single ROI using various parameter settings of the image enhancement functions for the segmentation. For each segmented contour, the mass shape features were computed. For classification, the dataset was partitioned into four subsets based on the patient age (young/old) and the ROI size (large/small). We built an ensemble learning system consisting of four single classifiers, where each classifier is a specialist, trained specifically for one of the subsets. Those specialist classifiers are also an optimal classifier for the subset, selected from several candidate classifiers through preliminary experiment. In this scheme, the final diagnosis (malignant or benign) of an instance is the classification produced by the classifier trained for the subset to which the instance belongs. Results. The Digital Database for Screening Mammography (DDSM) from the University of South Florida was used to test the ensemble system for classification of masses which achieved a 72% overall accuracy. This ensemble of specialist classifiers achieved better performance than single classification (56%). Conclusion. An ensemble classifier for mammography-detected masses may provide superior performance to any single classifier in distinguishing benign from malignant cases. Keyword: Mass Classification, Mass Segmentation, CADx, Ensemble Learning 1

1 Introduction Breast cancer is the second leading cause of cancer related deaths for women in the U.S. after lung cancer [1]. At the present, the most effective method for the early detection of breast cancer is mammography screening. Many Computer-Aided Diagnosis (CADx) systems have been developed as a second opinion to assist radiologists to detect or diagnose abnormalities in mammography screening. Mass and microcalcification are the two most common types of abnormalities associated with breast cancer. The research presented in this paper is an ongoing project for developing an image-based CADx system to classify suspicious masses in mammograms as malignant or benign. For radiologists, the shape and margin of masses are the two most important criteria in distinguishing malignant from benign masses [2]. In a CADx system for diagnosing masses in mammograms, segmentation separates a mass from its background and captures the shape and boundary of the mass. After segmentation, the contour of a mass is identified; and the shape features and spiculation level can be computed for classifying the mass as benign or malignant [3]. In this research, we present an ensemble system for classifying a suspicious mass region in a mammogram as malignant or benign by using mass shape features. Our CADx system processes a suspicious mass region (region of interest, ROI) in four stages: 1) mass segmentation, 2) feature extraction, 3) feature selection, and 4) classification. Figure 1 below depicts the schematic framework of our approach. In our CADx system, multiple mass contours are extracted from each ROI image by applying multiple segmentations. Each segmentation involves different image enhancement function which is a combination of values of the various parameters. We call each of such segmentations a weak segmentor, since there is no one set of image enhancement parameter values which produce the optimal segmentation for all images. Then for each segmented contour, we compute the mass shape features for classification. Finally for classification, we partition the dataset into four subsets based on the patient age (young/old) and the ROI pixel size (large/small), and build an ensemble learning system consisting of four single classifiers, where each classifier is trained specifically for one of the subsets. Thus, the final diagnosis (malignant or benign) of an instance is the classification produced by the optimal classifier trained for the subset to which the instance belongs. Fig. 1. Overall Framework of Our Approach This paper is organized as follows: Section 2 reviews the mass segmentation, mass CADx systems and ensemble learning, Section 3 describes the dataset, multiple segmentation, shape feature extraction and ensemble learning methods, Section 4 presents the experimental results, and Section 5 discusses the conclusions and future work. 2

2 Mass Segmentation and Mass CADx System 2.1 Mass Segmentation Masses are thickenings of breast tissue which appear as lesions, with the size ranging from 3mm to 30mm. The shape and margin of masses are two important criterions to distinguish malignant from benign masses. Usually a poorly defined shape is more likely to be malignant than a well-circumscribed mass. Margin is the border of a mass. Ill-defined margins or spiculated lesions are much more likely to be malignant [2]. In a CADx system for classifying a mass, segmentation is an essential step: it separates a mass from its background and captures the contour (boundary) of the mass from a suspicious area. Then, the shape and spiculation features from the detected contour can be computed for mass classification. Previous studies have shown that improving the mass segmentation can significantly improve the accuracy of mass diagnosis [4]. Many mass segmentation methods for mammogram images have been developed. There are two common mass segmentation approaches: 1) region-based methods and 2) edge-based methods. In region-based methods, mass regions are iteratively grown by comparing all neighboring pixels and including the pixels with similarity to the respective regions. The similarity can be measured with intensity or pixel texture features. In edge-based methods, segmentation is commonly done by techniques based on edge detection. Jiang et al. proposed a mass segmentation method to obtain initial segmented regions by a threshold based on the principle of maximum entropy [4]. Kupinski et al. developed two methods for segmenting lesions in mammographic images: radial gradient index (RGI) algorithm and a probabilistic algorithm [5]. Mencattini proposed a modified region-based mass segmentation procedure which optimized the similarity criteria [6]. Xu et al. developed a mass segmentation algorithm for two types of mass models. In their system, the Canny edge information of each pixel was computed, and the region growing technique was applied to merge adjacent regions [7]. Yuan et al. developed a method for mass lesion segmentation using a geometric active contour model [8]. Byrd et al. presented a comprehensive analysis to evaluate the performance of three existing digital mammography segmentation algorithms against the manual segmentation results produced by two radiologists [9]. Mammogram images have varied intensity contrast ranges and different noise levels. Also, patients have different breast density levels. Those variability factors make it difficult to find one optimal segmentor which can fit for all mass images [10]. To alleviate this problem, we built multiple weak segmentors for each ROI by using various image enhancements. The results showed that using multiple weak segmentors is an effective method to generate a strong mass segmentation for mammograms [11]. 2.2 CADx Systems for Mass Diagnosis in Mammograms With varied feature extraction, feature selection and classification methods, many CADx systems have been developed to assist radiologists to diagnose masses as benign or malignant. Domínguez et al. applied two segmentation methods to obtain two sets of mass contours, and the simplified contours were used to extract features [12]. In their CADx system, three classifiers (Bayesian classifier, Fisher's linear discriminate, and a Support Vector Machine (SVM)) were used to classify masses as benign or malignant. In Delogu et al. work, the gradient-based segmentation was applied, and mass shape, size and intensity features were computed [13]. At the end, a neural network (NN) was applied to classify mass type. Sampat et al. computed the Beamlet transform features, and applied these features to a K-Nearest Neighbor (K-NN) classifier to predict mass BI- RADS shape categories [14]. Ghosh et al. computed three categories of features (statistical, structural and grey level dependency) from suspicious areas, and applied genetic algorithm for feature selection. In their CADx system, the selected features were fed to a NN classifier to 3

diagnose suspicious areas in mammograms [15]. Zhang et al. compared a few classification and feature selection models for mass classification [16]. 2.3 Ensemble Learning An ensemble of classifiers is a set of individually trained classifiers whose predictions are combined to classify new. Previous research has shown that ensemble learning often achieves better accuracy in classification than the individual classifiers that make up the ensemble [17, 18]. The most commonly used ensemble learning methods include bagging, stacked generalization (stacking) and boosting. In the bagging method, multiple subsets of are formed, and each of these subsets is used to train a classifier (using the same classification algorithm for each subset). Finally, an aggregated predictor is generated from those classifiers [19]. The stacking uses several base-level classifiers to generate multiple predictions, and combines them to generate a final classification by using a meta-level classifier [20, 21]. The boosting method repeatedly runs a classification algorithm and generates a sequence of weak classifiers. In each iteration, a classifier is trained where greater weights are assigned to the which are not correctly classified in the previous iteration; and lower weights are assigned to those correctly classified [22]. In our previous research, we developed a content-based classification system which used BI- RADS features for classifying masses [23]. In that research, the ensemble learning partitioned the whole dataset into subsets based on the patient age and the mass shape category. For each subset, we tested several classifiers and selected the classifier which produced the highest accuracy as the optimal base classifier for the subset. In this research, we will apply similar ensemble learning approach to our image-based CADx system. 3 Data Description and Methodology 3.1 Data Description and Mass ROI Extraction In this work, all mass ROI images were extracted from the Digital Database for Screening Mammography (DDSM) from the University of South Florida. DDSM is the largest publicly available resource for the mammogram analysis research community. In DDSM images, BI- RADS information is annotated for each abnormal region [24]. In DDSM, mammogram images are digitized by different scanners with different resolutions. For the purpose of data consistency, all images are collected from the same type of scanner and resolution in this research. We use all mammograms from the scanner type LUMSYSYS, because the largest number of cases are digitized by this type scanner in DDSM. In DDSM, a suspicious region (ROI) is marked by experienced radiologists, and chain codes recorded in an overlay file indicate the location of the ROI in the mammogram. For each suspicious mass, we extracted a rectangle image as a mass ROI, which includes the suspicious mass and its surrounding area. In our study, we removed with extreme digitization artifacts (e.g. incorrectly ordered scan lines) and of extremely large size (over 2000 x 2000 pixels). We also removed with mixed BI-RADS descriptors and those ROI images which displayed only a portion of a mass. After removing those, a total of 543 mass ROI images were left for this study, where 272 were benign and 271 were malignant. Figure 2 and 3 show the distribution of the mass BI-RADS shape and margin features respectively. 4

Fig. 2. Mass BI-RADS shape distribution Fig. 3. Mass BI-RADS margin distribution 3.2 Building Multiple Mass Segmentors Mammogram images have varied intensity contrast ranges and noise levels. Those variability factors make it difficult to select a single segmentor (one setting of parameters) to produce the optimal segmentation results for all images. To address this problem, we applied multiple segmentations to each ROI image, which we call weak segmentors. For each mass ROI, by applying various gamma corrections and Gaussian filters (k different settings of parameters), a number of k enhanced images are generated. Then, from each of enhanced image, we compute the energy descriptor of each pixel and extract an energy texture image. Finally, we use an edge-based segmentation method to detect the mass contour from each energy texture image, k segmentation results are generated for each mass ROI [10]. The segmentation results are evaluated as successful or unsuccessful by the overlapping ratio. For a successful segmentation, the boundary of the detected region is used as mass contour. Table 1 shows examples of enhanced images and segmentations generated by three weak segmentors (using gamma corrections γ = 1, 2, 5, and Gaussian filter σ = 5) for the same ROI. In segmentation examples, the green line is the mass contour identified by our edge-based segmentation, while the red line is the radiologist marked mass outline. Also note that the three weak segmentors produced, individually, the overall successful segmentation rate of 66%, 73% and 77% respectively (with respect to the whole image set). Table 1 Multiple Weak Segmentors Weak Segmentor Segmentor 1 Segmentor 2 Segmentor 3 Image Enhancement γ = 1, σ = 5 γ = 2, σ = 5 γ = 5, σ = 5 Successful Segmentation Rate 66 % 73 % 77 % Enhanced Image Segmentation Result Segmentation Evaluation Sucessful Unsucessful Sucessful 5

3.3 Mass Extraction The shape and margin of masses are two important criterions to distinguish malignant from benign masses. In this step, for each successfully segmented contour, we compute the mass shape features which measure the properties of a mass. The following 14 shape features are computed: area, convex, perimeter, circularity, compactness, solidity, convex, roughness, equivalent diameter, elongation, major axis length, minor axis length, eccentricity and extent [25]. Then for each ROI image, we concatenate the shape features from the k weak segmentors and represent each mass instance by a total of 14*k shape features: {{f 1_1, f 1_2,, f 1_n,, f 1_14 }, {f 2_1, f 2_2,, f 2_n,, f 2_14 },,{f k_1, f k_2,, f k_n,, f k_14 }} where f i_j (1 <= i <= k, 1 <= j <= 14) denotes a value of the jth shape feature produced by the ith segmentor. For unsuccessful segmentations, no shape features can be computed and their values are set to a default value so that they will have no influence in classification. 3.4 Ensemble Learning Previous research has shown that ensemble learning often achieves better accuracy in classification than individual classifiers. In this research, we propose an ensemble learning which partitions a dataset into several subsets and develop an optimal classifier for each subset. By applying the best classification algorithm for each subset, we expect the overall classification accuracy for the whole dataset could improve. In our previous research, we used the ensemble learning with data partitioned by patient age and mass BI-RADS shape feature, and achieved better performance over the best classification with no data partitioning [23]. In this study, using similar approach, we compute the mean of patient age and mean of ROI size as splitting threshold, and partitioned the data into four subsets, which are displayed in Table 2. Table 2 Partitioning into Four Subsets based on Patient Age and Mass ROI Size Data Subset Patient Age (years) Mass ROI Size (pixels) Instances Young Age Small ROI Size < 57 < 643200 184 Young Age Large ROI Size < 57 >= 643200 87 Old Age Small ROI Size >= 57 < 643200 172 Old Age Large ROI Size >= 57 >= 643200 100 Then for each subset, we performed feature selection to remove potentially irrelevant or redundant features in order to improve classification. To select attributes, we first computed Information Gain (IG) of all features and ranked them from high to low for each subset. IG is a measure of purity based on Entropy, and indicates the amount of information an attribute gives: a larger IG means the attribute is more informative [26]. Then, we removed those features which had very low IG (close to 0), which indicated those features could be nearly irrelevant for classification, and therefore can be removed. After feature selection, we selected an optimal classifier for each data subset. To do so, we ran three classification algorithms (Decision Tree, SVM and K-NN) and selected the one which produced the best accuracy as the optimal classifier for the subset. This way, every selected classifier is a specialist which is trained specifically for a given population of that have certain characteristics, and the ensemble of those specialists forms a system which could diagnose masses more accurately over all in the whole dataset than one general classifier or an ensemble of general classifiers. 6

Note that we chose those three candidate algorithms because they have diverse characteristics. For example, Decision Tree is based on information gain; SVM are known to be robust to noise; K-NN decides the classification based on local information. Also note that in this study, we used the well-known machine learning tool Weka [27] to build classification models, and its cross validation (10-fold) option to do the training and testing for all data (sub) sets. 4 Results and Discussion In our experiment, we built three weak segmentors from different image enhancements (gamma corrections γ = 1, 2, 5, with Gaussian filter using σ = 5 for all three gamma values). Those segmentors achieved a successful segmentation rate of 66%, 73% and 77% respectively. A total of 42 mass shape features (3 segmentors x 14 shape features) were computed from each mass instance. Then, we partitioned the data into four subsets (Young age small ROI, Young age large ROI, Old age small ROI and Old age large ROI) as described in the previous section. For feature selection, we computed IG of all features in all subsets. In the Young age - large ROI subset, three shape features (solidity, eccentricity and elongation) from two segmentors were selected for classification. In the other three subsets, all features had the same IG value, so we kept all 42 features for classification. For each subset, we applied three single classifiers (Decision Tree, SVM and K-NN) to find the optimal classifiers. Table 3 shows the classification accuracies. Column (a) Overall Accuracy in the table indicates the average accuracies weighted by the proportion of the size of the subsets. Column (b) is the classification accuracies without dataset partitions. The ensemble learning with weighted classification accuracy showed better performance than the best classification (by one classifier, SVM) with no data partitioning (72% vs. 56 %), and the difference was statistically significant (p < 0.05). Note that the overall accuracy is largely drawn-down by poor performance of the Old age - small ROI subset, where the three other groups have achieved significantly better classifications. This result is similar to our previous mass BI-RADS feature study [23], where the Old age regular shape subset had the worst classification performance. In our future work, we are planning on investigating the data distribution and the classifications made by the classifiers for the Old age - small ROI size subset. Table 3. Classification Accuracies of Datasets Partitioned by Age and ROI Size Young Age Young Age Old Age Old Age Overall Accuracy Subsets Small ROI Large ROI Small ROI Large ROI Accuracy No Partition With Partition (b) (a) Segmentor γ =1, 2, 5 γ =1, 2 γ =1, 2, 5 γ =1, 2, 5 γ =1, 2, 5 Number of s 3x14 shape 2x3 shape 3x14 shape 3x14 shape 3x14 shape Classifier 184 87 172 100 543 543 Decision Tree 73 % 74 % 54% 78 % 68 % 55 % SVM 76 % 64 % 62 % 83 % 71 % 56 % K-NN (k=5) K-NN (k=15) 71 % 76 % 66 % 64 % 53 % 53 % 81 % 82 % 66 % 68 % 51 % 51 % The Optimal SVM Decision SVM SVM SVM Classifiers K-NN Tree The Best Accuracy 76 % 74 % 62 % 83 % 72 %* 56 % * Weighted accuracy computed from the best classifiers. 7

In this study, the classification accuracies were used to measure the performance of the ensemble learning system. Sensitivity measures how reliable a system is making positive (malignant) identifications, and specificity measures how well a system can make a negative (benign) identification. For the ensemble learning, after data partition, some subsets became unbalanced (for example, in the old age large ROI subset, the majority of training instants are malignant). The unbalanced dataset could lead the poor specificity or sensitivity of classification. In our future work, we will perform data balancing for each subset to improve sensitivities and specificities. And, besides classification accuracy, we plan to add ROC (receiver operating characteristics) to evaluate the ensemble learning system. 5 Conclusions and Future Work In our proposed CADx system for classifying masses in mammograms, multiple weak segmentors are built. From the segmented mass contours, we construct the mass shape feature sets. Then, we build an ensemble learning system, which partitions the whole dataset into four categories by patient age and ROI size. In this study, the ensemble system achieved 72% overall accuracy. The preliminary results showed that our ensemble learning system greatly improved overall diagnosis accuracy for classifying masses in mammograms. In this experiment, we find that mass of Old age small ROI size subset have much lower classification accuracies than other groups. In our future work, we need to further investigate the segmentations and features in this group to improve the classification. Currently, only 61% to 77% of ROI images were successfully segmented by each of weak segmentors. In our future work, we will investigate using other segmentation methods such as region-based segmentation as alternative methods for those ROI images which could not be successfully segmented by using our edge-based segmentation. In this study, we only computed mass shape features for classification. In our future work, for the proposed CADx system, we plan to add mass texture and spiculation features. We also plan to investigate a different ensemble learning model to generate the final diagnosis for a suspicious mass. References 1. National Cancer Institute (2010) American Cancer Society Cancer Facts & Figures 2010. http://www.cancer.org. 2. Winchester DJ, Winchester DP, Hudis CA, Norton L (2007) Breast Cancer (Second Edition). Springer, New York 3. Cheng HD, Shi XJ, Min R, Hu LM, Cai XP et al (2006) Approaches for Automated Detection and Classification of Masses in Mammograms. Pattern Recognition. doi:10.1016/j.patcog.2005.07.006 4. Jiang L, Song E, Xu X, Ma G, Zhang B (2008) Automated Detection of Breast Mass Spiculation Levels and Evaluation of Scheme Performance. Acad Radiol. doi:10.1016/j.acra.2008.07.015 5. Kupinski MA, Giger ML (1998) Automated Seeded Lesion Segmentation on Digital Mammograms. IEEE Transaction on Medical Imaging. doi:10.1109/42.730396 6. Mencattini A, Rabottino G, Salmeri M, Lojacono R, Colini E (2008) Breast Mass Segmentation in Mammographic Image by an Effective Region Growing Algorithm. Advanced Concepts for Intelligent Vision Systems Conference. doi: 10.1007/978-3-540-88458-3_86 7. Xu W, Xia S, Xiao M, Duan, H (2005) A Model-based Algorithm for Mass Segmentation in Mammograms, Engineering in Medicine and Biology 27th Annual Conference. doi: 10.1109/IEMBS.2005.1616987 8. Yuan Y, Giger ML, Li H, Suzuki K, Sennett C (2007) A Dual-stage Method for Lesion Segmentation on digital mammograms. Med Phys 34: 4180-4193. doi:10.1118/1.2790837 9. Byrd K, Zeng J, Chouikha M (2005) Performance Assessment of Mammography Image Segmentation Algorithms. 34th Applied Imagery and Pattern Recognition Workshop. pp.152-157. 8

10. Zhang Y, Tomuro N, Furst JD, Raicu DS (2010) Image Enhancement and Edge-based Mass Segmentation in Mammogram. 2010 SPIE Medical Imaging Conference. doi:10.1117/12.844492 11. Zhang Y, Tomuro N, Furst JD, Raicu DS (2011) Multiple Weak Segmentors for Strong Mass Segmentation in Mammogram. 2011 SPIE Medical Imaging Conference. doi:10.1117/12.877450 12. Domínguez1AR, Nandi AK (2009) Toward Breast Cancer Diagnosis Based on Automated Segmentation of Masses in Mammograms. Pattern Recognition. doi:10.1016/j.patcog.2008.08.006 13. Delogu P, Fantacci M, Kasae P, Retico A (2007) Characterization of Mammographic Masses Using a Gradient-based Segmentation Algorithm and a Neural Classifier. Computers in Biology and Medicine. doi:10.1016/j.compbiomed.2007.01.009 14. Sampat MP, Markey MK, Bovik AC (2005) Computer-Aided Detection and Diagnosis in Mammography. Elsevier Academic Press. 15. Ghosh R, Ghosh M, Yearwood J (2004) A Modular Framework for Multicategory Selection in Digital mammography. European Symposium on Artificial Neutral Networks. 175-180. 16. Zhang P, Verma B, Kumar K (2005) Neural vs. Statistical Classifier in Conjunction with Genetic Algorithm Based Selection. Pattern Recognition Letters. doi:10.1016/j.patrec.2004.09.053 17. Opitz D, Maclin R (1999) Popular Ensemble Methods: An Empirical Study. Journal of Artificial Intelligence Research. doi:10.1613/jair.614 18. Dzeroski S, Zenko B ( 2004) Is Combining Classifiers with Stacking Better than Selecting the Best One? Machine Learning. doi: 10.1023/B:MACH.0000015881.36452.6e 19. Breiman L (1996) Bagging Predictors. Machine Learning. 24:123-140 20. Wolper DH (1992) Stacked Generalization, Neural Networks. 5:241-259 21. Ting K, Witten I (1999) Issues in Stacked Generalization. Journal of Artificial Intelligence Research. 10: 271-289 22. Freund Y, Schapire R (1996) Experiments with a new boosting algorithm. Proceedings of the 13th International Conference on Machine Learning. 148-156 23. Zhang Y, Tomuro N, Furst JD, Raicu DS (2009) Using BI-RADS Descriptors and Ensemble Learning for Classifying Masses in Mammograms. Medical Content-based Retrieval for Clinical Decision Support ( MCR-CDS). doi: 10.1007/978-3-642-11769-5_7 24. Heath M, Bowyer K, Kopans D, Moore R, Kegelmeyer WP (2001) The Digital Database for Screening Mammography. Proceeding of the 5th International Workshop on Digital Mammography. 212-218 25. Choras R (2008) Shape and Texture Extraction for Retrieval Mammogram in Databases. Information Tech. Biomedicine. 47:121-128 26. Quinlan R (1993) C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, San Mateo, CA 27. Witten I, Frank E (2005) Data Mining: Practical Machine Learning Tools and Techniques (2nd edition). Morgan Kaufmann 9