Final Project Report. Detection of Cervical Cancer in Pap Smear Images

Final Project Report Detection of Cervical Cancer in Pap Smear Images Hana Sarbortova COMPSCI/ECE/ME 539 Introduction to Artificial Neural Networks and Fuzzy Systems University of Wisconsin - Madison December 17, 2013 Summary Cervical cancer is one of the most common cancers but also one of the most preventable ones. Regular Pap smear test can uncover pre-cancerous signs of cervical cells and treatment can be done before cancer fully develop. Automatic analysis of a Pap smear has to deal with various problems such as cell occlusion and cell type variability. Some research focused on specific part of this problem has been done and some methods, mostly semi-automatic, are also used in practice. This project focus on classification based on features extracted from Pap smear images. The source code is not intended to be released. 1

1 Introduction Cervical cancer is the second most common cancer affecting woman worldwide but at the same time it is one of the most preventable and treatable cancers. Since the most common form of cervical cancer starts with pre-cancerous changes and develops very slowly, up to 90% of cervical cancers may be prevented if cell changes are detected and treated early [1]. Early detection is undertaken using a Pap smear. There are two types of Pap smears, Conventional and ThinPrep, which differ in the way they are obtained. Conventional method tool is cytobrush while ThinPrep uses a broom-like device. The advantage of ThinPrep is that it contains less contaminants and reduce clumping which makes seeing unobstructed cells much easier [2]. ThinPrep is widely used in the most developed countries and samples obtained by this method will be used in this project. Each sample is usually stained before microscope investigation. The most common method, and also the method used for samples in this project, is Haemotoxylin and Eosin (H&E) staining [3]. It can help differentiate cell components but does not have any reasonable meaning for recognition. The color of a cell indicates its age which is not a significant information for cancer identification[4], therefore all samples will be converted to grey scale images. Although Cervical cancer can be prevented, Pap smear images must be evaluated properly in order to achieve that. Detection errors can be caused by inappropriate smear thickness which causes cell overlapping, or by unwanted particles in the smear. Also, diagnosis done by cytotechnologists and cytopathologists may by faulty if number of cancerous cells is small or if their experience is not sufficient. Automatic detection can help to increase cancer cell awareness, diagnosis objectivity and decrease testing cost at the same time. The cancer detection process consists of image preprocessing, segmentation, feature extraction and classification. These can be very challenging problems due to variability of cells in a single sample and by their clumping and occlusion. There are two types of cervical cells, Squamous cells (Exocervix) and Glandulas cells (Endocervix). Cervical cancer is usually caused 2

by the first type and is called Squamous cell carcinoma [5]. Additionally, white blood cells (Neutrophil), metaplastic cells, yeast strands, cell debris or bacteria can appear in samples. They all can be clustered in an arbitrary way which makes even segmentation very difficult as we are interested mostly just in separation of squamous cell cytoplasm and nuclei. 2 Related work Many research papers focused on segmentation, classification, or both have been published over past 30 years. Although, the majority of them is trying to solve a very specific problems while working with very restricted datasets. Segmentation usually focus on localisation of nucleus and does not deal with overlapping cytoplasm. Proposed classification methods usually deal with already separated cervix cells, and again, does not consider overlap. Very little methods considering a realistic Pap smear has been published recently. Segmentation The most popular and common choices for the segmentation task in the literature are automatic thresholding, morphological operations, and active contours model. Bamford and Lovell [14] segmented the nucleus using an active contour model that was estimated by using dynamic programming to find the boundary with the minimum cost within a bounded space around the darkest point in the image. Wu at al. [15] used a parametric cost function with an elliptical shape assumption for the region of interest. Yang-Mao et al. [16] applied automatic thresholding to the image gradient in order to identify the edge pixels corresponding to nucleus and cytoplasm boundaries. This method was improved by replacement of the thresholding step by k-means clustering into two partitions by Tsai et al.[17]. Harandi et al.[18] identified the cervical cells boundaries by using the active contour algorithm and then used thresholding to identify the nucleus within each cell. The cytoplasm corresponding to each nucleus was identified by separate active contour. Plissiti et al. [1] detected the locations of nuclei centroids detected the locations of nuclei centroids in Pap smear images by using the local minima of image gradient,eliminated the candidate centroids that were too 3

close to each other,and used a support vector machine (SVM) classifier for the final selection of points using color values in square neighbourhoods. In [19], they used the detected centroids as markers in marker-based watershed segmentation to find the nuclei boundaries, and eliminated the false-positive regions by using a binary SVM classifier with shape, texture, and intensity features. Most of the described methods focus on only nuclei segmentation [14] [15] [1] [19], which usually can t be easily used for classification as the cytoplasm are has to be considered too. Genstav et al. [6] focus on correct identification of cells of the individual nuclei under the presence of overlapping cells while assuming that the overlapping cytoplasm area is shared by different cells in the rest of the analysis. However, they can t segment individual cytoplasm of individual cells under overlap. The first step in the segmentation process proposed by [6] separates the cell regions from the background using morphological operations and automatic thresholding that can handle varying staining and illumination levels. Then, the second step builds a hierarchical segmentation tree by using a multi-scale watershed segmentation procedure, and automatically selects the regions that maximize a joint measure of homogeneity and circularity with the goal of identifying the nuclei at different scales. The third step finalizes the separation of nuclei from cytoplasm within the segmented cell regions by using a binary classifier. Classification The classification can either determine a cell to be normal or abnormal, or assign it one of various levels of dysplasia. Automatic [7] [8] [9] and semi-automatic [10] [11] method have been proposed to discriminate normal from abnormal dysplasia cells. However, the state of squamolous cell dysplasia can be described more specifically, see Figure 1, which has been considered in [12] [6] [13]. [12] and [13] classified cervical cells into normal, LSIL and HSIL classes but they didn t distinguished types of normal cells. Moreover, [6] distinguished 7 different types including 3 normal (Superficial squamous, Intermediate squamous, Columnar) and 4 abnormal (Mild dysplasia, Moderate dysplasia, Severe dysplasia, Carcinoma in situ). As it is crucial to determine whether a patient has to be treated or not, this project will consider only normal or abnormal cervical cells. 4

Published methods for cervical cancer classification works with cervical features that are either extracted manually by a human expert [22] [23] or automatically [6] [21]. Furthermore, most of the classification methods (feature extractors) work with single cell images, in which cytoplasm area can be computed relatively easily[21]. The most important features that has to be extracted are nucleus and cytoplasm area, nucleus and cytoplasm brightness/minima/maxima and nucleus roundness. A lot of methods make use of Artificial Neural Networks (ANN) [22] [23], Mat-Isa et al. [22] developed Hybrid Multilayer Perceptron which obtains better results than classical ANN. A lot of research work has been done but the majority focuses only on a specific areas of the problem on a limited dataset. They are usually not applicable in the full chain of operations which is needed in order to achieve full analysis from an image of a whole Pap smear. An open question is analysis of occluded cells, especially indicating area of a cervical cell under occlusion. Figure 1: Possible states of squamous cell dysplasia 5

3 Method proposal 3.1 Cancerous (abnormal) cell characteristics Abnormal and normal cervical cells have some very significant features that are necessary for distinguishing between them. The most important characteristics is nuclei-cytoplasm area ratio. Abnormal cells has larger nuclei and much smaller area of cytoplasm than normal cervical cells. Abnormal cells also tend to have more ellipsoidal shape while normal cells are more or less rounded. On the other hand, abnormal cells usually have more rounded shape of cytoplasm, but this rule is not necessarily true for cancerous cells with a heavily developed dysplasia. Abnormal cell nuclei has a darker intensity value and much significant structural pattern. Also, abnormal cells tends to be clustered in Pap smear images. Some examples of cancerous and normal cells are shown in Figure 2 and Figure 3 respectively. Figure 2: Examples of abnormal cervical cells 6

Figure 3: Examples of normal cervical cells 3.2 Feature extraction Pap smear test images are first converted from color to grayscale images. A median filter has been applied in order to remove a small noise while preserving edge sharpness. 3.2.1 Segmentation Segmentation consists of two major steps while working with full Pap smear images. First, segmentation of regions of cells and clusters of cells, i.e. background removal. Second, segmentation of cells within the previously located regions, i.e. nuclei segmentation. The hardest part even for a human eye is segmentation of cell cytoplasm within cell clusters. We consider only an estimation of it s area. Background removal Background consists of pixels that have relatively uniform intensity values. Usually, they form the largest part of the image. These two things naturally leads to an idea that the highest peak in the image histogram will be background, see Figure 4. In this project, the region segmentation has been done by using global thresholding method. 7

The threshold has been estimated from a Gaussian smoothed histogram; the seek threshold belongs to the first saddle towards the lower values. Figure 4: Gaussian smoothed histogram of Pap smear image Nuclei extraction Nuclei are significantly darker than surrounding cytoplasm, however, a global threshold within a region generally cannot be used. Nevertheless, a nucleus usually belongs to an area of local maxima. The segmentation used in this project works with maxima of four single direction gradient values. Four single direction gradient images are computed, the leftward horizontal, rightward horizontal, upward vertical and downward vertical. Each pixel of the Pap smear is then assigned a class according to maximum value of those four gradient images (or no class if the variance between gradient values is not large enough). Connected areas with the same label can be seen as nodes of a graph, edges exist only between two classes (i.e. directed edge is between leftward and downward, downward and rightward,est.). Therefore, nuclei can be seen as four node cycles in directed graph. This locates nuclei, however, nuclei edges has to be estimated a bit more precisely. In order to do that, a minimum path in an all-direction gradient image around the located area is found. 8

Cytoplasm estimation Cytoplasm area can be precisely determined for single cells, however, an estimation has to be used for cell clusters. An area of the Voronoi diagram cell is used as an upper bound and a distance to the nearest Voronoi edge represents the lower bound (resp. area of a circle with a radius equal to half of the mentioned distance) 3.2.2 Extracted features Several features describing shape and structure have been used to construct the feature vector for classification. Shape descriptor[5] A circle has been centred in the nucleus centroid. Number of pixels belonging to nucleus were counted around rays going from the centroid in specific direction (15 degrees angle). Difference between consecutive rays values has been counted and finally a histogram of 5 bins has been computed from the values. This is a description invariant to intensity changes and rotation. Structure descriptor[7] An area around the nucleus centroid has been taken in order to analyse the nucleus structure. Variance within small areas has been computed and histogram has been constructed from all the resulting values. This is a description invariant to intensity changes. Nucleus Shape features[7] Nucleus perimeter Major axis length Minor axis length Roundness Estimated cytoplasm area Nucleus-cytoplasm area ratio Nucleus area 3.3 Classification The data used for training and testing consist of feature vectors with 19 features each. The classification classes are cancerous cell and normal cell. The features were chosen so that the types of normal cells does not have to be distinguished. The best classification result has been obtained by using 9

Feedforward Artificial Neural Network. Matlab Neural Network Toolbox has been used to train and test the network. The best network had 20 hidden layer neurons. The cross-validation has been used for more reliable training and testing. 4 4.1 Results Dataset Full pap smear images has been used for obtaining the test and training data. The dataset consists of 40 labelled images, see Figure 5. There were 10 cancerous and 30 normal Pap smears in the dataset. A Pap smear is considered to be cancerous (abnormal) if at least one cell is cancerous. Figure 5: Example of dataset labelling (cancer image) 4.2 Classification results The result of segmentation gave 548 nuclei, 31 cancerous cells and the remaining any other cells (not necessarily only normal cervical cells). 37 10

images were successfully classified. The result on cell classification is 79%. Due to variability of cell clusters, much bigger dataset would be needed in order to get better results. Also, it would help with the preliminary analysis and feature selection. References [1] M. E. Plissiti, C. Nikou, A. Charchanti, Automated detection of cell nuclei in Pap smear images using morphological reconstruction and clustering, IEEE Transactions on Information Technology in Biomedicine 15(2)(2011)233-241. [2] S. Rogers, Collection of specimens for conventional & thinprep pap tests, hpv tests, & gc/ct [http://www.frhg.org/documents/lab Manuals/Collection-of- Specimens-for-Conventional-and-Thin-Prep-Pap-Tests,-HPV-Tests,- and-gc-ct-tests.pdf ] [3] The histology guide, University of Leeds [http://www.cancer.org/cancer/cervicalcancer/index] [4] Haematoxylin eosin (h&e) staining [http://protocolsonline.com/histology/dyes-and-stains/haematoxylineosin-he-staining/] [5] Pathology of an abnormal Pap Smear [http://wdavidstinsonmd.com/pap%20test.htm] [6] A. Genctav, S. Aksoy, and S. nder, Unsupervised segmentation and classification of cervical cell images, Pattern Recognition (2012) 45 4151-4168. [7] M. E. Plissiti, C. Nikou, and A. Charchanti, Automated detection of cell nuclei in Pap smear images using morphological reconstruction and clustering, IEEE Transactions on Information Technology in Biomedicine (2011) 15(2) 233-241 [8] P. Sobrevilla, E. Montseny, F. Vaschetto, and E. Lerma, Fuzzy-based analysis of microscopic color cervical pap smear images: Nuclei detection, Computational Intelligence and Applications (2010) 9(3) 187-206. 11

[9] N. M. Harandi, S. Sadri, N. A. Moghaddam, and R. Amirfattahi, An automated method for segmentation of epithelial cervical cell images of ThinPrep, Journal of Medical Systems (2010) 34 1043-1058. [10] S. N. Sulaimana, N. A. M. Isab, and N. H. Othmanc, Semi-automated pseudo colour features extraction technique for cervical cancers pap smear images, Knowledge-based and Intelligent Engineering Systems (2011) 15 131-143. [11] C. Bergmeir, M. Garca-Silvente, and J. M. Bentez, Segmentation of cervical cell nuclei in high-resolution microscopic images: A new algorithm and a web-based software framework, Computer Methods and Programs in Biomedicine (2012) 107(3) 497-512. [12] B. Sokouti, S. Haghipour, and A. D. Tabrizi, A framework for diagnosing cervical cancer disease based on feedforward MLP neural network and ThinPrep histopathological cell image features, Neural Computing and Applications (2012) DOI 10.1007/s00521-012-1220-y [13] N. A. Mat-Isa, M. Y. Mashor, N. H. Othman, An automated cervical pre-cancerous diagnostic system, Artificial Intelligence in Medicine (2008) 42 1-11 [14] P. Bramford, B. Lovell, Unsupervised cell nucleus segmentation with active contours, Signal Processing 71 (2) (1998) 203-213 [15] H.-S. Wu, J. Barba, J. Gil, A parametric fitting algorithm for segmentation of cell images, IEEE Transactions on Biomedical Engineering 45 (3) (1998) 400-408 [16] S.-F. Yang-Mao, Y.-K. Chan, Y.-P. Chu, Edge enhancement nucleus and cytoplasm contour detector of cervical smear images, IEEE Transactions on Systems, Man, and Cybernetics Part B: Cybernetics 38 (2) (2008) 353-366 [17] M.-H. Tsai, Y.-K. Chan, Z.-Z. Lin, S.-F. Yang-Mao, P.-C. Huang, Nucleus and cytoplasm contour detector of cervical smear image, Pattern Recognition Letters 29 (9) (2008) 1441-1453 12

[18] N.M. Harandi, S. Sadri, N. A. Moghaddam, R. Amirfattahi, An automated method for segmentation of epithelial cervical cells in images of ThinPrep, Journal of Medical Systems 34(6)(2010) 1043-1058. [19] M. E. Plissiti, C. Nikou, A. Charchanti, Combining shape,texture and intensity features for cell nuclei extraction in Pap smear images,pattern Recognition Letters 32(6)(2011)838-853. [20] K. Li, Z. Lu, W. Liu, J. Yin, Cytoplasm and nucleus segmentation in cervical smear images using Radiating GVF snake, Pattern Recognition 45(4)(2012) 1255-1264. [21] Y. Chen, P. Huang, et al., Semi-Automatic segmentation and classification of Pap smear Cells, IEEE Journal of Biomedical and Health Informatics (2013) PP [22] R. Ashfaq, B. Solares, M. Saboorian, Detection of endocervical component by Papnet system on negative cervical smears. Diagnostic Cytopathology (1996) 15(2) 121-124. [23] C. Balas, A novel optical imaging method for the early detection, quantitative grading and mapping of cancerous and precancerous lesions of cervix. IEEE Transactions on Biomedical Engineering (2001) 48(1) 96-104. 13