A Study of Abbreviations in Clinical Notes Hua Xu MS, MA 1, Peter D. Stetson, MD, MA 1, 2, Carol Friedman Ph.D. 1

A Study of Abbreviations in Clinical Notes Hua Xu MS, MA 1, Peter D. Stetson, MD, MA 1, 2, Carol Friedman Ph.D. 1 1 Department of Biomedical Informatics, Columbia University, New York, NY, USA 2 Department of Medicine, Columbia University, New York, NY, USA Various natural language processing (NLP) systems have been developed to unlock patient information from narrative clinical notes in order to support knowledge based applications such as error detection, surveillance and decision support. In many clinical notes, abbreviations are widely used without mention of their definitions, which is very different from the use of abbreviations in the biomedical literature. Thus, it is critical, but more challenging, for NLP systems to correctly interpret abbreviations in these notes. In this paper we describe a study of a two-step model for building a clinical abbreviation database: first, abbreviations in a text corpus were detected and then a sense inventory was built for those that were found. Four detection methods were developed and evaluated. Results showed that the best detection method had a precision of 91.4% and recall of 80.3%. A simple method was used to build sense inventories from two different knowledge sources: the Unified Medical Language System (UMLS) and a MEDLINE abbreviation database (ADAM). Evaluation showed the inventory from the UMLS appeared to be the more appropriate of the two for defining the sense of abbreviations, but was not ideal. It covered 35% of the senses and had an ambiguity rate of 40% for those that were covered. However, annotation by domain experts appears necessary for uncovered abbreviations and to determine the correct senses. INTRODUCTION Natural language processing (NLP) systems 1-3 have been developed in the clinical domain to unlock clinical information from free text. Information retrieved by these systems was used for various knowledge-based applications, such as decision support systems, and was shown to improve the quality of health care. Studies have shown that different types of clinical notes have various challenges for NLP systems. Long 4 discussed several issues when parsing free text nursing notes, including tokenization, recognition of special forms, determining the meaning of abbreviations, and spelling correction. Stetson et al. 5 performed a study of clinical Signout notes and found they contain more abbreviations and have denser content than ambulatory and discharge notes. In general, the wide usage of abbreviations seems to be a common problem for most types of clinical notes. In the biomedical literature, abbreviations usually occur together with their expanded forms at least once in the document, typically with the format of short form (long form) e.g. CABG (coronary artery bypass graft). Various approaches have been developed to map abbreviation-definition patterns, which were then applied to MEDLINE abstracts to build databases 6-8 that contain abbreviations that were detected along with their possible senses (also called a sense inventory). In most clinical reports, such as admission notes, occurrences of abbreviations are different from those in the literature because in clinical reports, they usually do not occur along with their expanded forms, making the task of identification more difficult. Therefore, approaches based on abbreviation-definition patterns applicable for the literature are not applicable for the patient record. Abbreviations are also highly ambiguous, e.g. RA could be right atrium or rheumatoid arthritis. Liu and colleagues 9 reported that 33.1% of abbreviations found in the UMLS were ambiguous. In this paper, we describe an initial study of abbreviations in inpatient admission notes. We developed and evaluated four different methods that automatically detect abbreviations in the notes. To the best of our knowledge, this is the first time machine learning techniques have been developed for detecting abbreviations in clinical notes. We also studied the coverage of available biomedical knowledge sources for the abbreviations that were detected in order to explore adequacy of those knowledge sources for building sense inventories. Although we applied our methods to admission notes, our methods are generalizable and could be applied to other types of notes as well. BACKGROUND Detection of abbreviations in unrestricted text is a challenging task. In the domain of general English, different methods have been developed for detecting them. Park and Byrd 10 defined a set of manually created rules for abbreviation detection. Toole 11 described a decision tree based method to identify abbreviations from words that were not recognized by his NLP system. He AMIA 2007 Symposium Proceedings Page - 821

reported a precision of 91.1% on free text from an Air Safety Reporting System database. For abbreviations containing a period, the process of identifying them involves solving the sentence boundary problem as well. (.e.g. Dr. should not be considered the end of sentence). In the biomedical domain, there are several knowledge sources available that contain abbreviations and their possible senses. The UMLS 12, which combines many biomedical vocabularies, is a comprehensive source for medical terms, including clinical abbreviations. The Metathesaurus file MRCONSO.RRF contains information associated with medical terms, such as the concept unique identifier (CUI) and source vocabulary. Therefore it can be used to as a source of medical terms, including abbreviations. Additionally, Liu 9 et al. reported on a method to extract abbreviations embedded in UMLS terms. For example, the abbreviation CAD is extracted from the UMLS term CAD - Coronary artery disease. ADAM 8 is another source of knowledge concerning abbreviations, and contains acronym and non-acronym abbreviations. Different definitions of the abbreviations were extracted automatically from MEDLINE titles and abstracts based on short form / long form patterns. In the clinical domain, Berman 13 reviewed thousands of clinical abbreviations from pathology reports and classified them based on the relationship between the abbreviations and their corresponding long forms, which is useful for implementation of abbreviation detection and expansion algorithms. METHODS In this study, we developed four methods to automatically detect abbreviations in clinical notes, and evaluated performance using a manually annotated test set. We also used the test set to evaluate the coverage of sense inventories generated from UMLS and ADAM. Data Set The New York Presbyterian Hospital (NYPH) Clinical Data Repository (CDR) is a central pool of clinical data, which includes narrative data consisting of different types of clinical reports. In 2004, NYPH implemented a new Physician Data Entry system called enote 14, which allows physicians to directly key in various types of notes, such as hospital admission notes and discharge summaries. For this study, we collected all the admission notes from the internal medicine service, which were entered via the enote system during 2004-2006, amounting to 16,949 notes. All the admission notes were then tokenized in order to separate the entire text into a sequence of individual strings or tokens. A simple heuristic rule-based tokenizer was developed based on observation of a few admission notes in the data set. Ten admission notes were randomly selected from the collection and a hospitalist attending physician from the internal medical service manually reviewed the selected notes, listed all the abbreviations they contained, and specified their full forms. The set was then randomly partitioned. Six annotated notes were used as a training set for the methods. The remaining four annotated notes, which served as a reference standard in the study, were used as a test set. In order to evaluate the coverage of available knowledge sources for clinical abbreviations, 100 different abbreviations with their expanded forms based on the annotations were randomly selected from the test set and used to evaluate the coverage of the two available knowledge sources. Abbreviations Detection We developed the following four methods for detecting abbreviations in clinical notes. The first is a simple method, which is used to measure the baseline performance of the abbreviation detection task. This method selects all unknown tokens in the text as abbreviations. To determine if a token is unknown or not, we used two word lists, and labeled any word not in the two lists as unknown (i.e. an abbreviation). The first list is an English word list (Knuth's list of 110,573 American English words 15 ), which also contains morphological variants of normal English words. The second list is a medical term list consisting of 9,721 words, which was obtained from two of the lexical files of an NLP system called MedLEE 3 (Medical Language Extraction and Encoding System). These files contained single medical words and single word drug names commonly found in clinical reports. To improve the performance of this baseline detection method, known medical abbreviations in the MedLEE lexicon were eliminated from this list by manual review so that they would be considered unknown. The second method is a heuristic rule-based program developed by observing several admission notes. It utilizes information concerning word formation, such as capital letters, numeric and alphabetic characters and their combinations, together with the above two word lists of English and medical terms. If a word meets one of the following criteria, it is considered an abbreviation: 1) If the word AMIA 2007 Symposium Proceedings Page - 822

contains special characters such as - and. ; 2) If the word contains less than 6 characters, and contains one of following: a) mixture of numeric and alphabetic characters; b) capital letter(s), but not when only the first letter is uppercase following a period; c) lower case letters where the word is not in the English or medical list. For the third and fourth methods, we trained a decision tree classifier using the training set. We used the J48 decision tree in Weka 16 3.4, which is an implementation of a C4.5 decision tree generating algorithm. Method 3, the decision tree 1 (DT1) method, used features concerning word formation and a feature from the corpus associated with frequency, as described below. Method 4, the decision tree 2 (DT2) method, used the same features as method 3 but also used features derived from outside knowledge sources. Features concerning the word formation include 1) special characters such as -, and. ; 2) alphabetic/numeric characters and their combination; 3) information about upper case and positions in the word; 4) length of the word. A feature derived from the corpus is the average document frequency of a word, which is defined as the total number of occurrences of the word over the number of notes in the corpus. The features derived from outside knowledge sources, consisting of the English and medical term lists, consider whether a word is an English word and whether it is a known medical term. To evaluate performance, the testing set was processed by each of the abbreviation detection methods, and a list of predicted abbreviations was generated respectively for each: Baseline, Rule-Based, DT1 (decision tree method based only on word information and frequency) and DT2 (decision tree method based on word information, frequency and external knowledge). Those automatically generated abbreviations were compared to the reference standard and precision/recall was reported for each method. Precision is defined as the ratio between the number of correctly predicted abbreviations and the number of all predicted abbreviations by the automated method. Recall is defined as the ratio between the number of correctly predicted abbreviations by the automated method and the number of all abbreviations in the reference standard. Sense Inventory Study After detecting the abbreviations, a sense inventory was created for them. We created three sense inventories from two available knowledge sources: the UMLS and the ADAM abbreviation database, and evaluated the coverage of generated sense inventories. We used the UMLS 2006AB to generate the UMLS sense inventory. It was obtained by: 1) using all the terms in the metathesaurus file MRCONSO.RRF ; 2) derived abbreviations from the UMLS as described in Liu 9. All terms in UMLS were normalized to lower case. CUIs which had a corresponding term that matched the abbreviation were considered as possible senses for that abbreviation. The sense inventory from the UMLS contained the CUIs and their corresponding preferred strings for the abbreviations found in the UMLS. For example, the UMLS sense inventory for the abbreviation ESR would be C0086250: Erythrocyte Sedimentation Rate and C0013845: Electron Spin Resonance Spectroscopy. Similarly, we also obtained a sense inventory of the abbreviations from the ADAM abbreviation database. For example, we obtained forms such as electron spin resonance, erythrocyte sedimentation rate and estrogen receptor for abbreviation ESR. For an abbreviation, we studied two types of coverage: 1) abbreviation coverage to determine if the sense inventory contains an entry for the abbreviated term, but not necessarily for the correct sense; 2) sense coverage to determine if the sense inventory contains the correct sense of the abbreviation as determined by the expert. We computed both term and sense coverage for three sense inventories from: 1) the UMLS directly; 2) the UMLS+Abbr, which includes the UMLS and abbreviated terms derived from the UMLS and 3) ADAM from the ADAM database. One evaluator was used for this study, and was shown 1) the abbreviation, 2) the corresponding sense in the clinical note as determined by the expert when annotating the note, and 3) possible senses from each of the three inventories; if an inventory did not have an entry for the abbreviation, that field was left blank. The evaluator then determined which inventories contained an entry for the abbreviated term and which inventories contained the correct sense for each abbreviation. If a sense inventory covered the clinical abbreviation, we also noted whether the abbreviation was ambiguous in that inventory, signifying that the abbreviation mapped to more than one expanded sense. RESULTS Based on the expert annotations, the training set for the decision tree contained 3007 tokens, AMIA 2007 Symposium Proceedings Page - 823

where 415 were punctuation tokens and 418 were abbreviations; the test set contained 2611 tokens where 363 were punctuation tokens and 411 were abbreviations. Table 1 shows the precision and recall for each abbreviation detection method. Both the rule-based and decision tree-based methods achieved better precision and recall than the baseline method. Among them, the DT2 method reached the highest precision of 91.4% while retaining a recall of 80.3%. Table 1. Results of abbreviation detection. Method Precision Recall Baseline 286/387=73.9% 286/411=69.6% Rule-Based 345/404=85.4% 345/411=83.9% DT1 294/336=87.5% 294/411=71.5% DT2 330/361=91.4% 330/411=80.3% Table 2 shows the abbreviation coverage, sense coverage, and ambiguity rate of the three different sense inventories. Note that although the abbreviation coverage was greater than 50% for all inventories, the sense coverage was lower. ADAM had the highest sense coverage of 38.0%, but 71.1% of the covered abbreviations were ambiguous. UMLS+Abbr had a slightly lower coverage (35.0%) than ADAM, but with a much lower ambiguity rate of 40.0%. Table 2. Results of sense coverage and ambiguity study. Sense Resource % of Abbr. Coverage % of sense Coverage % of ambiguity if covered UMLS 56.0% (56/100) 24.0% (24/100) 33.3% (8/24) UMLS+ 67.0% 35.0% 40.0% Abbr (67/100) ADAM 66.0% (66/100) (35/100) 38.0% (38/100) (14/35) 71.1% (27/38) DISCUSSION When analyzing the abbreviations, we observed that most were either acronyms or shortened words, but also observed a few other ways in which they were formed. Table 3 shows a summary of the different types of abbreviation formation along with examples and an estimate of the frequencies based on the coverage testing set of 100 abbreviations. As shown in Table 3, acronyms usually are associated with multi-word phrases, and are formed by taking the first letter of each word in a phrase. Another type is a shortened form, which usually is a substring of a long word, but not always. Contraction is another type of abbreviation, which consists of an abbreviated contraction of multiple words with a separator (usually / ) between each word. We also noted the semantic classes of the abbreviations and found that disease/symptom occurs most frequently (33%), followed by procedure (11%) and labtest (11%). Table 3. Different types of abbreviations. Abbr. Type Examples Frequency Acronym BP Blood Pressure 50.0% Shortened Words Contraction Pt Patient Sx Symptoms t/d/a-tobacco, drugs or alcohol 2/2-secondary to 32.0% 9.0% Others etoh-alcohol 9.0% We performed an error analysis of the abbreviation detection methods and noted several issues. First, not surprisingly, the tokenization step substantially affected the precision and recall of all methods of detection. For example, S. Aureus (Staphylococcus aureus) was broken into three tokens: S,., and Aureus by our simple tokenizer. In the predicted abbreviation list, both S and Aureus were detected individually but not S. Aureus, which should have been considered a single token. Similar problems were also observed when a single token contains a space (e.g. ex tol represents the abbreviation exercise tolerance ). A number of studies 17-18 concerning tokenization, discussed methods that handle tokenization problems such as ambiguous periods. To develop a sophisticated tokenizer for clinical notes is not in scope of this study, but a tokenizer specifically developed for clinical notes would improve the performance of abbreviation detection systems. The baseline method is simple, but performance was not good. Many terms not in the two word lists were misclassified as abbreviations but were not, which lowered the precision the most. A significant cause of error, which lowered recall, occurred because many abbreviations were actually in the English word list and therefore were not considered abbreviations by the method. For example, cc (e.g. chief complaint) and CT (e.g. cat scan) were included in the English word list. If we manually review the English word list and remove abbreviations, recall of the baseline would be better, but this would be a timeconsuming task. Compared to the baseline, the rule-based method achieved dramatically increased performance because it included some simple rules about word formation. As expected, both decision tree methods had better precision than the rule-based method since the rules generated by the decision tree algorithm were AMIA 2007 Symposium Proceedings Page - 824

optimized based on the training set. However, DT2 which used external knowledge about words, performed better, although the recall of that method is a little lower than that of the rule based method. We notice that some abbreviations, which were all lower case, such as prob and uri were not captured by either of the decision tree methods. This may be related to the small size of the training set and to the uneven distribution of positive and negative samples in the training set. Since we used only six notes for training, we anticipate that there will be a larger performance gain when more training data is used. The test set is relative small too and we plan to use a large one in the future. Another interesting observation is that the decision tree method successfully excluded most misspelled words (e.g. givn given ), while the rule-based method did not. After looking at the decision tree that was generated, we believe that the performance gain was mostly from the frequency feature. Another advantage of the decision tree method is that we can modify the weight of the tree to maximize either precision or recall based on our application. Sense inventory from UMLS together with the derived abbreviations had slightly less coverage but a much lower ambiguity rate than the sense inventory from MEDLINE generated database, which indicates it would be a more appropriate source for defining the sense of abbreviations. However, none of the generated sense inventories had adequate coverage, due to the large number of unusual abbreviations (e.g. 2/2 meaning secondary to ). Therefore, a clinical expert is likely to be required to determine the appropriate interpretation, which could vary depending on the note type and clinical domain. Future work might include 1) performing a study to estimate the appropriate size of the training set for the decision tree based detection method so that it will achieve higher performance, 2) developing a method to automatically identify senses of an abbreviation from existing knowledge sources, 3) studying how to facilitate the building of a sense inventory for abbreviations that are not covered by available knowledge sources and 4) developing better tokenizers. CONCLUSIONS In this paper, we developed and evaluated several methods for detecting abbreviations from hospital admission notes, and compared their relative performance. Among them, the decision tree method with external knowledge reached a highest precision of 91.4% with a reasonable recall of 80.3%. Sense inventories were generated from different knowledge sources via a simple method. Evaluation on the coverage of generated sense inventories generated showed that coverage of abbreviations could be up to 67%, but at best only 38% of their senses are covered. The sense inventory from UMLS with derived abbreviations, which covers 35% of abbreviation senses and has an ambiguity rate of 40% for covered abbreviations, may help build a sense inventory automatically, but annotation by domain experts seems to be necessary for uncovered abbreviations. Acknowledgement This study was supported by grants LM007659, LM008635 and K22LM008805 from the NLM and NSF-IIS-0430743 from NSF. Reference: 1. Haug P. J. et al. A natural language parsing system for encoding admitting disgnoses. AMIA 1997: 4-8. 2. Aronson AR. Effective mapping of biomedical text to the UMLS matathesaurus: the MetaMap program. AMIA 2001: 17-21. 3. Friedman, C.et al.. A general natural language text processor for clinical radiology. JAMIA. 1994;1:161-174. 4. Long WJ. Parsing free text nursing notes. AMIA 2003: 917 5. Stetson PD, Johnson SB, Scotch M, Hripcsak G. The sublanguage of cross-coverage.amia 2002:742-6. 6. Chang,J.T., Schutze,H. and Altman,R.B. Creating an online dictionary of abbreviations from MEDLINE. JAMIA., 2001, 9, 612 620. 7. Adar,E. SaRAD: a Simple and Robust Abbreviation Dictionary. Bioinformatics, 2004, 20, 527 533. 8. Zhou W, Torvik VI, Smalheiser NR. ADAM: another database of abbreviations in MEDLINE. Bioinformatics, 2006 Nov 15;22(22):2813-8. 9. Liu H, Lussier YA, Friedman C. A study of abbreviations in the UMLS. AMIA 2001:393-397. 10. Park Y and Byrd R. Hybrid text mining for finding abbreviations and their definitions. In Proc. EMNLP 2001. 11. Toole J. A hybrid approach to the identification and expansion of abbreviations. In RIAO 2000. 12. Bodenreider O. The Unified Medical Language System (UMLS): integrating biomedical terminology. Nucleic Acids Research, 2004, 32, D267-D270. 13. Berman JJ. Pathology abbreviated: a long review of short terms. Arch Pathol Lab Med. 2004 Mar;128(3):347-52. 14. Stetson PD et al. Electronic discharge summaries. AMIA. 2005:1121. 15. http://rabbit.eng.miami.edu/dics/knuthus.txt 16. Witten I and Frank E. Data Mining: Practical machine learning tools and techniques, 2nd Edition, Morgan Kaufmann, San Fran, 2005. 17. Mikheev A. Document centered approach to text normalization. In Proc. SIGIR 2000. 18. Kiss T and Strunk J. Scaled log likelihood ratios for the detection of abbreviations in text corpora. ACL 2002:1-5. AMIA 2007 Symposium Proceedings Page - 825