Code2Vec: Embedding and Clustering Medical Diagnosis Data

Size: px

Start display at page:

Download "Code2Vec: Embedding and Clustering Medical Diagnosis Data"

Ruby Johns
5 years ago
Views:

1 2017 IEEE International Conference on Healthcare Informatics Code2Vec: Embedding and Clustering Medical Diagnosis Data David Kartchner, Tanner Christensen, Jeffrey Humpherys, Sean Wade Department of Mathematics Brigham Young University Provo, Utah, USA Abstract Identifying disease comorbidities and grouping medical diagnoses into disease incidents are two important problems in health care delivery and assessment. Using vector space embeddings produced using the Global Vectors (GloVe) algorithm, we are able to find useful vector representations of diagnosis codes that can identify related diagnoses and thus improve identification of related disease incidents. Keywords-Diagnosis Codes, Embeddings, Clustering, GloVe, Word2Vec I. INTRODUCTION One of the fundamental problems of health care is foreseeing and preventing the future health problems of patients. To do so, physicians often identify individuals as high risk for diseases when they observe co-occurring conditions called comorbidities. Diabetes and hypertension, for instance, are strong indicators that an individual is at risk to develop chronic kidney disease. While some comorbidities are obvious, others are more subtle and difficult to detect. In this analysis, we explore the viability of identifying comorbidities through insurance claims data using statistical clustering algorithms on vector space embeddings of medical diagnosis and procedure codes. Creating meaningful embeddings is useful for at least three reasons. First, grouping claims into disease episodes is fundamental to calculating both the monetary cost and life impact of a particular disease. A stroke, for instance, can have cascading effects beyond initial treatment, such as subsequent falls or injuries caused by mobility impairment. While clinical classification software (CCS) groupings have been created as one means of grouping related diagnoses, such classification systems could miss more subtle relationships. Since we would expect related diagnoses to group together in our embedded data, such embeddings would provide an additional tool to group related diseases and to classify an individual s medical history into major medical incidents. Second, by taking advantage of large patient populations, our embeddings could both identify comorbidities for hard-to-predict diseases such as epilepsy and provide medical researchers with leads in identifying new comorbidities to more common diseases. While vector space embeddings are commonly used in natural language processing to capture word meaning, few have translated these concepts into the medical field. One notable exception is [1], which classifies diagnoses, procedures, prescriptions, and various other medical terminology using embeddings learned from insurance claims data. The exposition in this paper extends the work of [1] by demonstrating that diagnosis code embeddings can be effectively created without the use of neural networks using the Global Vectors (GloVe) algorithm. It further demonstrates that clustering these embeddings can lend insight into related disease incidents above and beyond that provided by CCS groupings. II. DATA To explore the viability of comorbidity identification via vector space embeddings, we used a database of approximately 2 million insurance claims generated by roughly 90,000 individuals over the course of 5 years. Though traditional medical insurance claims include up to four diagnoses and a procedure code, our data is limited to only a single diagnosis code for each insurance claim, so we restrict our analysis to medical diagnoses. In order to make our data analogous to a corpus of text, we list each individual s diagnoses in roughly chronological order (it is impossible to distinguish between claims filed on the same day) and add an additional dummy code in the place of every month in which the individual received no medical diagnosis. Adding dummy codes ensures that diagnoses that occur years apart do not co-occur, which could happen if an individual temporarily switched insurance providers or simply did not consume any health services for an extended period of time. We then used these claims to train a GloVe model (described in section III below) to obtain 25- dimensional representations of each of the 8,477 codes that appear at least 5 times in our database. A. GloVe III. METHODS GloVe is a method originally developed for finding vector space word embeddings based on word context (i.e. cooccurrence with other words)[2]. GloVe assumes that words with similar meaning occur in similar contexts and uses this information to find vectors that capture this similarity. GloVe learns to represent semantic meaning by considering words in our corpus pairwise and comparing the probability that each co-occurs with a given context word. To do so, define X ij to be the number of times word j appears in the context of word i and X i = j X ij to be the total number of times word i appears in the context of any word. GloVe seeks to find word vectors w j and word context vectors w j that minimize the cost /17 $ IEEE DOI /ICHI

2 functional: J = V f(x ij )(wi T w j + b i + b j log(x ij )) 2 (1) i,j=1 which is essentially a weighted least squares problem. The pieces of J are as follows: 1) The weighting function f is a non-decreasing, bounded function that increases the weight of frequent cooccurrences while down-weighting infrequent cooccurrences. Moreover, f is chosen to be relatively small for large x so that frequent words are not excessively overweighted. In practice, f is chosen to be { (x/xmax ) α x x max f = 1 x>x max where α=0.75 and x max = 100 work well empirically. 2) w i and w j are our word vector and context vector representations, respectively, with respective biases (intercepts) b i and b j. These are the parameters we seek to learn with our model. 3) log(x ij ) comes from considering taking the log of the probabilities P ij = P (j i) = Xij X i and noting that the denominator X i is independent of j and can thus be absorbed into the b i. Reproducing the full derivation of equation III-A is beyond the scope of this paper, but can be found in [2]. This equation is then minimized numerically using gradient descent, then yielding the desired word embeddings. For our purposes, we consider each diagnosis or procedure code to be a word and the sequence of codes for a given patient to be a document in a corpus. We then train GloVe on our corpus of patients, separating patients with a few instances of a dummy code to prevent the model from using codes from unrelated patients simultaneously. While there is no standardized metric for evaluating the goodness of diagnosis code embeddings, a visual inspection seems to indicate that our embeddings are quite accurate. An example of the nearest neighbors of diabetes is shown in Table I. The first seven of these results are quite obviously related to diabetes. Of the last three, while diabetes is well-known for impairing vision by reducing lens clarity on the eye, the connection to myopia is subtle. A justification for the link between diabetes and myopia has been established quite recently, and is currently an area of active research [3], [4], [5]. Recent studies have also shown that diabetics are more likely to suffer from nail fungi, such as dermatophytosis of the nail, which accounts for both (8) and (10) [6]. This last result is particularly compelling since the effect of diabetes on dermatophytosis has been disputed over the years and has only been established relatively recently as a medical fact [7]. B. Clustering Once we have obtained embeddings for our data, we cluster our data using the K-Means algorithm, since agglomerative TABLE I NEAREST NEIGHBORS OF END STAGE RENAL DISEASE GIVEN BY EMBEDDED POINTS. NOTE THAT ALL RELATED CONDITIONS CORRESPOND TO RENAL (KIDNEY) FAILURE OR ASSOCIATED TREATMENTS Diabetes Mellitus Nearest Neighbors 1 Diabetes mellitus without mention of complication, type I [juvenile type], uncontrolled 2 Diabetes mellitus without mention of complication, type II or unspecified type, uncontrolled 3 Diabetes with neurological manifestations, type II or unspecified type, not stated as uncontrolled 4 Diabetes with renal manifestations, type II or unspecified type, not stated as uncontrolled 5 Diabetes with ophthalmic manifestations, type I [juvenile type], not stated as uncontrolled 6 Diabetes with ophthalmic manifestations, type II or unspecified type, not stated as uncontrolled 7 Diabetes mellitus without mention of complication, type II or unspecified type, not stated as uncontrolled 8 Dermatophytosis of nail 9 Myopia 10 Other specified diseases of nail clustering methods are too computationally intensive to efficiently cluster our data. We find these clusters using the following two methods: 1) Let initial cluster centroids be the mean of the embedded vector representations for each CCS group represented in our data. This presents an intuitive choice for clusters, but also fixes the number of clusters at 259, which could be unnecessarily restrictive. 2) Choose initial centroids according to the k-means++ procedure described in [8] as follows: a) Randomly choose the first centroid to be a point from the dataset to be clustered. b) For each successive centroid, pick x j to be the next centroid with probability p j = D(xj), n D(x j) j=1 where D(x j ) is the distance from x j to the nearest centroid. This step is repeated until k centroids have been chosen. Fig. 1 shows a visualization of various 2-dimensional projections the clusters found in our data using t-distributed stochastic neighbor embeddings (t-sne), principal component projections (PCA), and linear discriminant projections. In order to make this figure more meaningful, we limit our plots to clusters pertaining to select diagnoses. We select these clusters by picking a major diagnosis code representative of of the disease in question and choose the cluster to which it belongs. Though limited by low dimensionality, one can observe that at least some of the data form into relatively distinct clusters, especially under the t-sne projection. For the sake of brevity, fig 1 shows only data clustered using the 387

3 Fig. 1. Visual comparison of clusters with initial centroids set at means of CCS groupings. The results using k-means++ centroids are qualitatively similar. means of CCS groupings as centroids because the clusters generated using k-means++ are qualitatively similar. IV. RESULTS We now return to our initial question of whether clusters of embedded diagnoses can help us identify comorbidities in our data. Since clustering is an inherently unsupervised task, we acknowledge that we do not have a simple, absolute metric by which to assess cluster validity. Many conditions may be comorbidities to multiple diseases, but we restrict each to be in exactly one cluster, nor do we have exhaustive information on known comorbidities. In the absence of such data, we heuristically assess both cluster validity and the presence of comorbidities by inspecting a few key clusters element-byelement and attempt to determine how much each data point is actually related to the main theme of the cluster. To illustrate how this procedure works, consider the cluster containing the diagnosis code for advanced chronic kidney disease, known as end-stage renal disease (ESRD). Once an individual enters end-stage kidney disease, his or her kidneys have lost so much function that dialysis is required multiple times a week to properly filter toxins from the blood. Worse, chronic kidney disease is irreversible, so individuals in endstage must either receive a kidney transplant or receive dialysis for the rest of their lives. Thus, we would expect codes in our ESRD cluster to be related to advanced kidney damage, dialysis, kidney transplants, and associated conditions. Table II summarizes codes contained in our ESRD cluster. It is readily apparent that the entries in the table correspond 388

4 TABLE II SUBSET DIAGNOSES AND PROCEDURES CONTAINED IN CLUSTER CORRESPONDING TO END-STAGE RENAL DISEASE. NOTETHATALLBUTTWOOFTHESE CONDITIONS EXPLICITLY RELATE TO RENAL PROBLEMS OR DIALYSIS. End-Stage Renal Disease Related Code CCS Group 1 Kidney replaced by transplant Chronic kidney disease, unspecified Chronic kidney disease, Stage IV (severe) Anemia in chronic kidney disease 59 5 Diabetes with renal manifestations, type II or unspecified type, not stated as uncontrolled 50 6 Hypertensive chronic kidney disease, unspecified, with chronic kidney disease stage I through 99 stage IV, or unspecified 7 Chronic kidney disease, Stage II (mild) Chronic kidney disease, Stage V Secondary hyperparathyroidism (of renal origin) Diabetes with renal manifestations, type I [juvenile type], not stated as uncontrolled Hypertensive chronic kidney disease, benign, with chronic kidney disease stage I through stage 99 IV, or unspecified 12 Hypertensive chronic kidney disease, unspecified, with chronic kidney disease stage V or end 99 stage renal disease 13 Other malignant lymphomas, lymph nodes of multiple sites Complications of transplanted bone marrow Complications of transplanted kidney Polycystic kidney, unspecified type End stage renal disease 158 directly to renal problems, with the possible exception of entries 13 and 14. Further inspection, however, reveals that kidney failure often follows bone marrow transplants [9] and also that kidney function is often a symptom of lymphoma [10]. Thus we see that each of the codes we investigated from our ESRD cluster is closely linked to ESRD, which lends strength to the hypothesis that clusterings could be useful in identifying disease comorbidities. We further see that our combining clustering with our embeddings is able to capture relationships between diagnoses and procedures not captured in CCS groupings alone, as can be seen by the diversity of CCS groupings in table II. We note that while the results of clustering around ESRD are promising, other diagnoses exhibit worse results. Burns, for example, are clustered together with insect bites, presumably because both deal with skin irritation and blistering. In spite of this, we believe that our clusters could be a useful tool because we would expect chronic diseases (e.g. kidney disease) to show more consistent, long-term patterns of comorbidities than incidental injuries (e.g. burns). A broader, more systematic exploration of these patterns is a potential area of future research. V. CONCLUSION Vector space embeddings can be a powerful means of mining meaning from medical text data. Using the GloVe algorithm to create 25-dimensional embeddings of medical diagnosis codes, we cluster diagnoses and use these clusters to identify related diseases, even if such are in different CCS categories. This success indicates that embeddings capture some level of inherent meaning present in the diagnosis codes, suggesting that these embeddings could be useful features for disease prediction algorithm. Such is a promising area of future research. VI. ACKNOWLEDGEMENTS This work was supported in part by the National Science Foundation, Grant Number and the Defense Threat Reduction Agency, Grant Number HDRTA REFERENCES [1] Y. Choi, C. Y.-I. Chiu, and D. Sontag, Learning low-dimensional representations of medical concepts, in AMIA Summits on Translational Science Proceedings, 2016, pp [2] J. Pennington, R. Socher, and C. D. Manning, Glove: Global vectors for word representation. in EMNLP, vol. 14, 2014, pp [3] H. C. Fledelius, Myopia and diabetes mellitus with special reference to adult-onset myopia, Acta Ophthalmologica, vol. 64, no. 1, pp , Feb [4] M. A. A. Paul Chous, Ocular manifestations of diabetes: Some clues for eyecare professionals, May [Online]. Available: modernmedicine.com/optometrytimes/content/tags/diabetes/ ocular-manifestations-diabetes-some-clues-eyecare-professionals [5] M. Young, Connecting diabetes and myopia, April [Online]. Available: article-connecting-diabetes-and-myopia [6] T. C. Vlahovic and J. A. Sebag, Onychomycosis in Diabetics. Cham: Springer International Publishing, 2017, pp [Online]. Available: 17 [7] A. Lugo-Somolinos and J. Sanches, Prevalence of dermatophytosis in patients with diabetes, Journal of the American Academy of Dermatology, vol. 26, pp , March

5 [8] D. Arthur and S. Vassilvitskii, K-means++: The advantages of careful seeding, in Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, ser. SODA 07. Philadelphia, PA, USA: Society for Industrial and Applied Mathematics, 2007, pp [Online]. Available: [9] B. Pulla, Y. Barry, and A. E., Acute renal failure following bone marrow transplantation, Renal Failure, vol. 20, pp , May [10] L. J. Cohen, H. G. Rennke, J. P. Laubach, and B. D. Humphreys, The spectrum of kidney involvement in lymphoma: a case report and review of the literature, American Journal of Kidney Diseases: The Official Journal of the National Kidney Foundation, vol. 56, pp , [11] L. van der Maaten and G. Hinton, Visualizing high-dimensional data using t-sne, [12] M. K. Kuhlmann, A. Kribben, M. Wittwer, and W. H. Hrl, Optamalnutrition in chronic renal failure, Nephrology Dialysis Transplantation, vol. 22, no. 3, p. iii13, [Online]. Available: [13] H. C. I. I. Institute. (2017) Prometheus analytics. [14] F. Hildebrant, Renal medicine 1: Genetic kidney diseases, The Lancet, vol. 375, no. 9722, pp [15] S. J. Ryu, Intracranial hemorrhage in patients with polycystic kidney disease. Stroke, vol. 21, no. 2, pp , [Online]. Available: [16] W. McKinney, Data structures for statistical computing in python, in Proceedings of the 9th Python in Science Conference, S. van der Walt and J. Millman, Eds., 2010, pp [17] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay, Scikit-learn: Machine learning in Python, Journal of Machine Learning Research, vol. 12, pp , [18] W. H. Organization et al., International classification of diseases:[9th] ninth revision, basic tabulation list with alphabetic index,

Automated Estimation of mts Score in Hand Joint X-Ray Image Using Machine Learning

Automated Estimation of mts Score in Hand Joint X-Ray Image Using Machine Learning Shweta Khairnar, Sharvari Khairnar 1 Graduate student, Pace University, New York, United States 2 Student, Computer Engineering,