COMPARISON OF BREAST CANCER STAGING IN NATURAL LANGUAGE TEXT AND SNOMED ANNOTATED TEXT

Volume 116 No. 21 2017, 243-249 ISSN: 1311-8080 (printed version); ISSN: 1314-3395 (on-line version) url: http://www.ijpam.eu ijpam.eu COMPARISON OF BREAST CANCER STAGING IN NATURAL LANGUAGE TEXT AND SNOMED ANNOTATED TEXT 1Johanna Johnsi Rani G, 2 Dennis Gladis, 3 Joy John Mammen 1 Department of Computer Science, Madras Christian College, Chennai 600 059, South India 2 Department of Computer Science, Presidency College, Chennai 600 005, South India 3 Department of Transfusion Medicine, Christian Medical College, Vellore - 632 004, South India 1 johanna.g@mcc.edu.in, 2 Christophergladis67@gmail.com, 3 joymammen@cmcvellore.ac.in Abstract: In recent times, medical reports are generated electronically and stored in databases for automated systems to collate, process, analyze and interpret the patient data. A collection of such reports can help in population studies on the disease domain. Automated systems can also verify the manual diagnosis presented in the reports by experts. The corpus for the automated system discussed is a set of breast cancer pathology reports retrieved and processed using Natural Language Processing (NLP) techniques. According to the protocol by American Joint Committee on Cancer, ptnm classification is used to determine the pathological staging of breast cancer. The characteristics and classifications of Tumour T, Lymph node N and Distant Metastases M determine the stage of cancer. M is not evident from Pathology reports, hence it is given a default value of M0. The T and N classifications in the reports are validated and modified by the domain experts to give the Gold standard, with generation of discrepancy report for those with varying values. The cancer staging parameters extracted by the automated system is compared against the Gold Standard for analysis. The focus of the work is to extract the parameters required to determine the cancer stage of patients from two kinds of reports namely reports with natural language text and reports with SNOMED annotated text. The cancer staging process on both types of reports is compared and results indicate that cancer stage derived from SNOMED annotated pathology reports yield better results than on natural language text. Keywords: Breast cancer; Pathology reports; Natural Language Processing; Annotated text; Cancer stage 1. Introduction Most of the medical reports both in written and electronic form have descriptive narrations in natural language mostly in English. Processing textual data from these documents can be accomplished through natural language processing methods. Most of the hospitals in India generate and store medical reports in databases. Processing these medical reports using automated systems can provide valuable information for analysis and interpretation about the patient population. Statistics indicates that India ranks at the top in Breast cancer deaths. With the available set of breast cancer pathology reports, an automated system is developed to determine the cancer stage of patients. The required parameters are extracted from both natural language reports and reports annotated using Systematized Nomenclature of Medicine Clinical terms (SNOMED CT). The set of breast cancer pathology reports are obtained from a hospital in South India. The report has the following sections namely Specimen, Clinical, Gross, Micro, and Impression. The Impression section of the de-identified Pathology reports are processed to derive the Pathological classification ptnm, in which T represents Tumour, N represents Lymph node and M represents Distant Metastasis. The grouping of T, N and M classifications, is used to detect the stage of cancer of patients. The American Joint Committee on Cancer (AJCC) has created resource materials that provide indepth and easy-to-access information for doctors and other medical professionals who perform the staging of cancer patients, and for cancer registrars who abstract the cancer cases [11]. The existence of primary tumour and its size are the prime values required to classify T. Breast cancer may spread to the axillary lymph nodes in the armpit. The conditions for classification of lymph node N, which is more complex than T classification. Distant Metastases M is not classified based on details in a Pathology report. Hence the system sets a default value of M0 to derive the cancer stage. The stage of breast cancer in a patient describes the extent of the spread of cancer in the body and the grouping of T, N and M clearly specifies the extent of the disease in a patient. The cancer stage is determined through grouping of T, N and M as recommended by AJCC. 243

Prior to determining the cancer stage on natural language text and the SNOMED annotated text, the textual content is pre-processed. The pre-processing steps required for natural language includes Natural Language Processing (NLP) related tasks, and standardization of numerical and non-numerical values in the text. The SNOMED annotated text requires a major pre-processing step of extracting a disease specific subset from the SNOMED database. As the subsequent pre-processing step, the SNOMED subset extracted for the disease domain is used to annotate the text with SNOMED terms and their code. Out of the processing steps, extraction of SNOMED subset for breast cancer domain and annotation of natural language text using the subset are out of scope of this paper. The work uses regional data collected from hospital in India. Hence it has practical applicability in the diagnosis, treatment and population-based studies of breast cancer in women in India. The paper is organized as follows: Section II describes Related Works in Natural Language Processing (NLP), SNOMED annotation of text, and Cancer Staging. Section III explains the Materials and Methods used. Section IV describes the Results obtained. Section V presents the Conclusion. 2. Related Works Electronic Health Records (EHR), especially those in narrative text form are processed by applying Natural Language Processing (NLP) and Information Extraction (IE) techniques. Erik Cambria and White mention various approaches that use Production rules, Semantic categories and those based on First-order Logic (FOL) Bayesian and Semantic networks [1]. Dunham et al.[12], Schadow and McDonald [4], Xu et al.[3], Anni Coden et al.[6], and Nguyen et al.[5], used domain-specific lexicons and rules in processing pathology reports. Nelson et al. developed a web-based search application with sequential queries. [9] Buckley JM et al. converted free text EHRs to a machine readable form using NLP techniques. [3] Anni Coden et al. automatically extracted cancer disease characteristics from pathology reports [6]. David Martinz and Yue Li, used text mining tools to extract information with minimal human intervention [8]. Cancer staging in this work is done using extraction of required parameters using pattern-matching on free text and annotated text. The Clinical reports in the developed countries use medical terminologies such as SNOMED or ICD. Buckley et al. used ICD and Current Procedural Terminology (CPT) codes to identify those reports pertaining to breast [2]. Schadow G and McDonald developed a method of extraction for details about specimens and their related findings from coded text. [4] Nguyen et al. applied Symbolic rule-based classification methodology, to identify SNOMED CT concepts in free text. [14]. Napolitano G, Fox C, Middleton R and Connolly D used Pattern-based extraction from pathology reports [7]. Many breast cancer related research works in India use the Wisconsin Breast Cancer dataset. This work uses regional data and hence the results have practical relevance and applicability. The system uses Patternmatching rules for extraction and cancer staging on both natural language text [17] and annotated text. The annotation is done using SNOMED. Ching-Heng Lin, Nai-Yuan Wu, Wei-Shao Lai and Der-Ming Liou developed an Auto-annotation tool that selects terms using a suggesting and ranking algorithm to annotate reports from terms in a SNOMED subset [16]. The two essential processing steps in this work are use of pattern-matching algorithms to extract the necessary parameters for cancer staging and annotation of text using SNOMED. A comparison in the cancer staging process on both natural language text and those annotated text is performed using the dataset and the results are compared and analyzed to determine which performs better. 3. Materials and Methods The dataset and the methods applied to determine the cancer stage of patients are explained in this section. The process applies steps in natural language processing and pattern matching rules to determine the cancer stage. A. Dataset One hundred and fifty de-identified breast cancer pathology reports constitute the corpus used in this work. The reports written by a Pathologist narrates the patient s condition determined by examining cells and tissues under a microscope. The report has the following sections: Demographic information, Specimen section indicating the body part from where the tissue samples are taken, Clinical history describing breast abnormality and the kind of surgery done and, Gross description giving the size, weight, and color of each piece of tissue removed. The Microscopic description describes how cancer cells look under the microscope, and their relationship to the normal surrounding tissue, the size of cancer, results of special tests and growth rate of cells. The Impression section summarizes all the important findings from the tissues examined. 244

B. Cancer staging The stage of cancer indicates how far the cancer has spread. There are two types of cancer staging - Clinical staging and Pathological staging. Out of the two, Pathological staging is more accurate than Clinical staging. T, N and M classifications are found from the Impression section, applying AJCC protocol and their grouping determines the stage of cancer. The stage is determined on reports with natural language and SNOMED annotated text. C. Preprocessing for cancer staging on Plain text Retrieval of reports, pre-processing on the report content, extraction of the required details for TNM classification and staging are the major tasks performed in the developed automated system. The pathology report is retrieved either as.pdf or.txt file and the listed preprocessing steps are performed on plain text reports. The precision of results in any process on natural language text depends on the number of preprocessing steps applied to homogenize and standardize the data. The preprocessing steps applied to the breast cancer pathology reports are listed below. Report segregation: Separating multiple reports into individual reports. Section segmentation: Extracting the contents of the sections in the reports as separate sections. Standardization of measures: All tumour sizes are either given in centimeters or millimeters. This step converts all the measures into millimeters. Date formats: All dates are converted to a uniform DD/MM.YYYY format. Sentence segmentation: The contents of each section are separated into individual sentences. Period (.) is used to identify the sentences, with handling of exceptions for fraction values. Standardization of numerical values: The pathology reports have numeric values represented in numerals (3), or in English words (three). Such numerical values are standardized to Arabic numerals. Alpha numeric representations: The number of lymph nodes are represented as 1/3, or 1 out of three, or one out of three. This value is converted into complete textual form as one out of three. Abbreviations: Abbreviations are expanded by the system. Spelling variations: All discrepancies in spelling between British and American English are standardized using British English. Whitespace removal: The whitespaces are removed from the document. This improves the data extraction process. Handling parenthesized terms: Parentheses () or [ ] in the document are homogenized into [ ]. Case sensitivity: All text comparisons are made by converting the terms into lower case. In case of medical terms such as Ductal Carcinoma in situ, the terms are converted to a form as found in SNOMED. Missing headers: The pre-processing module appends missing headers into the document whenever necessary. The application of the above pre-processing steps homogenizes the reports and improves the parameters extraction process for cancer staging. The efficiency and precision of annotation of medical terms in the report, using SNOMED improves with the preprocessing steps. Fig. 1 shows the workflow for the cancer staging process on natural language text in pathology reports and on SNOMED annotated reports. The diagram shows that both archived reports and newly generated reports are processed to determine the cancer stage of patients. Figure 1. Workflow of Comparison on Breast Cancer Staging D. Preprocessing for cancer staging on SNOMED Annotated text The pre-processing steps required for extraction of cancer stage on SNOMED annotated text are, manually building a Lexicon of breast cancer terms and extraction of SNOMED subset for breast cancer domain using the Lexicon and queries. These are part of our earlier work in the development of the automated system. The Lexicon is built in two ways i. Through manual process of examining the reports to accumulate terms and store them in a database and ii. Through application of NLP based tasks such as sectioning, sentence splitting, tokenization and stop word removal, after tagging the medical terms 245

found in the manual lexicon. [18] The above two preprocessing steps are out of scope of this paper. In the annotation process using the subset, each medical term in the report is replaced with its corresponding SNOMED term and its code. Pattern-matching algorithms are applied on SNOMED annotated text to find the ptnm classification. The patterns used for cancer staging on annotated text have been coined using several components. The components are SNOMED Concept Ids that were identified using the CliniClue SNOMED browser, numerical values, negation values (No / Not), Logical connective (and, or). The conditions are the same classification conditions specified in AJCC protocol. When the free text is annotated with the SNOMED codes for medical terms, the ptnm classifications are also annotated with their respective codes. 4. Results The automated system successfully determined the cancer stage for each patient from the natural language text and annotated text in all the 150 reports. The pattern-matching rules applied for the process extracted the details required for classification of T, and N and cancer staging. Figure 2to Figure 4 present the analysis reports of T, N and Cancer Stage extracted from natural language text. E. Gold Standard for the Cancer Staging The system has three ptnm classifications: i. ptnm given at the end of the report, manually derived by the Pathologist by examining the parameters in the report, ii. Gold Standard ptnm, the ptnm verified and validated by the Pathologist through a graphical interface and iii. ptnm classification automatically derived by the application. The ptnm specified in the Impression section of each pathology report is verified by the Pathologists, to correct erroneous and missing classifications. This is the gold standard that is used to validate the automatically derived ptnm classification. The ptnm is of prime importance as it determines the stage of cancer in patients. Figure 2. Analysis of T- Classification on Natural language text F. Analysis of Cancer staging process The analysis on cancer stage values derived is performed by finding the True Positive (TP), True Negative (TN), False Positive (FP) and False Negative (FN) values. The evaluation parameters used in the analysis are listed below. Precision (P) = TP / (TP + FP) Recall (R) = TP / (TP + FN) Specificity = TN / (TN + FP) Accuracy = (TP + TN) / (TP + TN + FP + FN) F-measure = (2*Precision*Recall) / (Precision + Recall) Error Rate = (FP + FN) / (TP + TN + FP + FN) Figure 3. Analysis of N-Classification on Natural Language text Figure 4. Analysis of Cancer Staging on Natural Language text 246

Cancer staging on free text indicates that the average Precision in cancer staging on natural language text is 94.72%, the average Recall is 95.94%, average Accuracy is 92.12% and average Specificity is 80.96%. The average F-measure for the process is 95.89% and average Error is 3.94%. The results show that the system performs well to extract cancer stage of patients. This efficiency can be attributed to numerous pre-processing steps applied on the textual contents before the extraction process. The results of Cancer staging on SNOMED annotated text in pathology reports is presented in Figures 5 to Figure 7. Figure 5. Analysis of T-Classification on SNOMED Annotated text Cancer staging process on SNOMED annotated report yields the following results. The average Precision of the process is 95.48%, the average Recall is 100%, average Accuracy is 97.97% and average Specificity of 96.27%. The average F-measure for the process is 97.66% and average is Error 0.04%. As the analysis parameters indicate the cancer staging process on SNOMED annotated text, yields better results. This can be attributed to the following reasons. i. The preprocessing steps extensively applied on the medical text contribute to homogeniztion and standardization of text in the reports. This cleans the dataset for efficient process. ii. The correctness of the process is ensured by the manually collating a Lexicon of medical terms relating to breast cancer from the pathology reports and using it for the annotation process. The Lexicon has been obtained and verified using manual and automated means, which standardized the subset extraction process. iii. The Lexicon generated by the system is used in SNOMED subset extraction. The comprehensiveness and the completeness of the lexicon terms contributes to effective subset extraction. iv. SNOMED subset for cancer consists of about 1% of all the SNOMED CT concepts in the database. The extraction of SNOMED subset for breast cancer domain, instead of using the complete SNOMED database, result in faster and precise annotation of reports, thus giving better results for cancer staging than on natural language text. v. The annotation process standardized every medical term in the report, by replacing it with its equivalent term in the Medical vocabulary in SNOMED. 5. Conclusions Figure 6. Analysis of N-Classification on SNOMED Annotated text Figure 7. Analysis of Cancer staging on SNOMED Annotated text The objective of the work to derive the stage of cancer on natural language textual reports and SNOMED annotated reports was successfully achieved. The use of standard AJCC protocol for cancer staging and globally accepted medical vocabulary such as SNOMED yielded better results in the staging process. The natural language text is heterogeneous but the pre-processing steps bring homogeneity to the text. In spite of this, the reason for less efficiency in cancer staging on natural language text reports can be attributed to the use of only the Impression section of the report for the staging process. Processing other sections would improve the results. The accuracy of automated systems in medical domain, especially in a task as critical as cancer staging is of vital importance, as it involves diagnostic and treatment decision on a human being. This critical factor necessitates that reports be annotated and processed for better results, analysis and decision-making. Annotation of the reports using SNOMED also makes it possible to apply numerous 247

queries on any annotated disease dataset to get better understanding of the patient population. The work clearly indicates that between cancer staging process on natural language text and the SNOMED annotated text, the process on annotated text yields best results. As extension of this work, the annotation process can be performed on reports of other disease domains for required processing and decision making. 6. Acknowledgement The authors would like to thank the Department of Pathology, Christian Medical College and Hospital, Vellore for providing the sample data for the study. The authors would also like to acknowledge S. Pradeep Vignesh, student of MCA in the Department of Computer Science, Madras Christian College for his contributions towards developing the automated system. References [1] Erik Cambria, Bebo White, Jumping NLP Curves: A Review of Natural Language Processing Research, IEEE Computational intelligence magazine, pp 48-57, May 2014. [2] Buckley JM, Coopey SB, Sharko J, et al. The feasibility of using natural language processing to extract clinical information from breast pathology reports. Journal of Pathology Informatics. 2012;3:23. doi:10.4103/2153-3539.97788. [3] Xu H, Friedman C. Facilitating research in pathology using natural language processing. AMIA Annual Symp. Proc. 2003:1057. [4] Schadow G, McDonald CJ. Extracting Structured Information from Free Text Pathology Reports. AMIA Annual Symposium Proceedings., pp. 584-588, 2003. [5] Nguyen, Moore, Lawley, Hansen, Colquist, Automatic extraction of cancer characteristics from freetext pathology reports for cancer notifications, Stud Health echnol Inform. 2011;168:117-24. [6] Anni Coden et al., Automatically extracting cancer disease characteristics from pathology reports into a Disease Knowledge Representation Model, Elsevier, Journal of Biomedical Informatics 42, pp 937 949, 2009. [7] Napolitano G, Fox C, Middleton R, Connolly D, Pattern-based information extraction from pathology reports for cancer registration, Cancer causes control, 2010 Nov;21(11):1887-94. doi: 10.1007/s10552-010-9616-4. Epub 2010 Jul 23. [8] David Martinz, Yue Li,,, Information Extraction from Pathology reports in a hospital setting, Proceedings of the 20th ACM international conference on Information and knowledge management, pp. 1877-1882, 2010. [9] Nelson HD, Weerasinghe R, Martel M, Bifulco C, Assur T, Elmore JG, et al. Development of an electronic breast pathology database in a community health system. J Pathol Inform 2014;5:26. [10] McCowan I, Moore D, Nguyen AN, Bowman RV, Clarke BE, Duhig EE, et al. Application of Information Technology: Collection of Cancer Stage Data by Classifying Free-text Medical Reports. JAMIA. 2007;14(6):736 745. [11] AJCC Cancer Staging Manual. 7th ed. New York, NY: Springer, 00 347-76, 2010. [12] G.S. Dunham, M.G. Pacak and A. W. Pratt, Automatic indexing of Pathology data, Journal of the American Society for Information Science, 29(2):81-90, Mar., 1978. [13] David A. Hanauer et al., The registry case finding engine: an automated tool to identify cancer cases from unstructured, free-text pathology reports and clinical notes, Journal of the American College of Surgeons, 205(5): pp. 690-697, Nov. 2007. [14] Anthony N Nguyen et al., Symbolic rule-based classification of lung cancer stages from free-text pathology reports, Journal of the American Medical Informatics Association (JAMIA), 17:440-445, 2010. [15] Carlos Rodrigues-Solano, Leonardo Lezcano, Miguel-Angel Sicilia, Information Systems and Technologies for Enhancing Health and Social Care, 2013, pp. 15. [16] Lin C-H, Wu N-Y, Lai W-S, Liou D-M. Comparison of a semi-automatic annotation tool and a natural language processing application for the generation of clinical statement entries. Journal of the American Medical Informatics Association : JAMIA. 2015;22(1):132-142. doi:10.1136/amiajnl-2014-002991. [17] Johanna Johnsi Rani G., Dennis Gladis, Marie Therese Manipadam, Gunadala Ishitha, Breast Cancer Staging using Natural Language Processing, 2015, IEEE Conference publications, pp. 1552-1558, DOI: 10.1109/ICACCI.2015.7275834. [18] Johanna Johnsi Rani G., Dennis Gladis, Joy John Mammen, Lexicon-based and Query-based Autoannotation of Medical Reports using SNOMED, Proceedings of the International Conference on Computing Paradigms (ICCP), 2017 [19] Johanna Johnsi Rani G., Dennis Gladis, Joy John Mammen, SNOMED Subset Extraction for Annotation of Breast Cancer Pathology Reports, Proceedings of National Conference on ICT Solutions for Challenges and Issues in e-health (NCICTEH'17), 2017. [20] K.Srikar,M.Akhil,V.Krishna reddy, Execution of Cloud Scheduling Algorithms,International Innovative Research Journal of Engineering and Technology, vol 02,no 04,pp.108-111,2017. 248

249

250