A Context-Sensitive Support System for Medical Diagnosis Discovery based on Symptom Matching

A Context-Sensitive Support System for Medical Diagnosis Discovery based on Symptom Matching Marko KNEŽEVIĆ a,1, Vladimir IVANČEVIĆ a a and Ivan LUKOVIĆ a University of Novi Sad, Faculty of Technical Sciences, Novi Sad, Serbia Abstract. In this paper, we present a prototype of a clinical decision-support system. This prototype relies on a two-phase algorithm that is based on the differential diagnosis method from medical diagnostics and predictive models for disease occurrence in a subpopulation. The algorithm requires a data set containing information about diseases and their corresponding symptoms, and a data set with registered disease cases. The main output of this algorithm is a ranked list of diagnoses that might explain the manifested symptoms. The ranking is influenced by the patient s context, i.e., disease trends within a subpopulation to which the patient belongs. In the context of medical diagnosis discovery based on symptom matching, we present a short rationale for developing such a system, brief review of similar systems, algorithm for diagnoses ranking, and ideas for future research. Furthermore, we elaborate on the required data sets and illustrate the application of the proposed solution with a typical use scenario. Keywords. clinical decision support system, symptom checker, data mining, medical diagnosis, health informatics Introduction With the increased availability of databases containing vast knowledge regarding diseases and medical conditions, new possibilities arise for non-expert users who are interested in obtaining information about illnesses that they constantly face. However, the use of Web search as a diagnostic procedure, where queries describing symptoms are input and the resulting information, together with the associated rank, is interpreted as a diagnostic conclusion, may lead users to believe that common symptoms are likely the result of a serious illness. Such escalation from common symptoms to serious concerns may cause unnecessary anxiety, investment of time and expensive engagements with healthcare professionals [1]. Therefore, there is a strong need for trusted health and medical information about a set of manifested symptoms. Any software-based solution for such a requirement should also take into the account the fuzzy and incomplete nature of user queries and the lack of precise knowledge regarding symptom names in the general population. What complicates this issue even more is the fact that the relationship between symptoms and diseases is of the many- 1 Corresponding Author: Marko Knežević; University of Novi Sad, Faculty of Technical Sciences, Trg Dositeja Obradovića 6, 21000 Novi Sad, Serbia; E-mail: marko.knezevic@uns.ac.rs.

to-many type in which one symptom may be linked to many disorders and vice versa. In addition, the symptoms of disease may vary between individuals or disease subtypes. Our goal is to create a pilot version of a software-based solution that could support, at least partially, one way symptoms-to-diseases matching while satisfying the aforementioned criteria. Moreover, as the Serbian healthcare system is undergoing extensive and significant modernization [2], this solution should primarily motivate stakeholders to more seriously regard IT solutions as an indispensable component in the modern provision of health services in Serbia. The proposed software system needs to support a scenario in which a user provides the list of present symptoms and relevant personal information, in order to obtain a list of potential diagnoses. What sets this solution apart from many other similar systems is its reliance on a broader patient context. This context sensitivity may be observed in the refinement of the initial ranked set of diagnoses that match a set of symptoms according to medical knowledge: the ranking of a matching diagnosis is increased if such diagnosis is often observed in the subpopulation to which a patient belongs and decreased if the diagnosis is rare (or not present) in the subpopulation In this manner, medical diagnosticians could utilize the system to more rapidly evaluate potential diagnoses in individual cases. Tasks that need to be carried out during the implementation of such system include: (i) collection of medical knowledge about diseases and their symptoms; (ii) collection of data that could help narrow down the search space of diseases by containing a valid excerpt of registered cases for different subpopulations; (iii) formulation of an algorithm that forms a list of possible diagnoses matching a list of symptoms by utilizing information from the two abovementioned data sets; and (iv) implementation of the algorithm and its integration into a software that is also designed for non-expert users. This paper is divided into four sections besides the Introduction and Conclusion: Section 1, which gives an overview of similar solutions; Section 2, which presents data sets utilized in the proposed algorithm; Section 3, which describes the algorithm for symptoms-to-disease matching; and Section 4, which illustrates a use scenario for the algorithm in a software system for epidemiological research and monitoring. 1. Related Work There are many clinical decision support systems (CDSSs) [3] and software systems that provide information on most common causes of provided symptoms. One group of such systems includes symptom checkers, online tools used to help educate users and suggest what condition certain symptoms could indicate. WebMD Symptom Checker [4] is one such example of a tool designed to help determine the underlying cause for a set of medical symptoms and learn about possible conditions. Different areas where discomfort or pain is being felt may be selected on the displayed human figure. Based on the selection, symptoms that are being experienced and related to that area may be further specified. Depending on the input, WebMD Symptom Checker provides information needed to determine the next steps in dealing with the symptoms, including recommendations to consult with a physician, as well as lifestyle changes. Another similar solution is the Isabel Symptom Checker [5], which takes a pattern of symptoms in everyday language and instantly computes the most likely diseases. When compared to our solution, both WebMD Symptom Checker and Isabel Symptom Checker ignore data about disease trends in a community to which the user belongs. Furthermore, they provide information only on the basis of symptoms being consistent

with a diagnosis. In this manner, those systems provide answers to such questions as: whether some symptoms might indicate a serious, perhaps chronic or fatal condition, or whether such fears are unfounded. Unlike our system, their results are not influenced by the results of an analysis that is performed over a set of registered disease cases. The second group of such systems helps physicians to diagnose diseases. Internist [6] is one such example of a diagnostic program. It tries to explicitly capture the way human experts make their diagnoses, using a complex problem-solving strategy based on the technique of differential diagnosis that clinicians use every day. Its main strengths with respect to our solution are two parameters supplied for each finding (such as symptoms and test results), indicating the correlation between disease and finding. The first parameter represents the likelihood of the disease given that the finding occurs. The second parameter represents the likelihood of the finding given that the disease occurs. Values of both parameters for each association between disease and finding within Internist s knowledge base are a result of many man-years of effort provided by a team of physicians. On the other hand, our solution provides estimation of these parameters on the basis of number of symptoms consistent with a disease and number of diseases associated with a symptom. However, there is a conceptual difference between these two systems regarding algorithm and usage scope. Our solution is intended to be used outside the physician s office, as well. In this manner, it may be used by people other than healthcare professionals, in which case it has only informative purpose. Furthermore, it is a web-based system that may be used through a regular web browser and easily made public. As far as the algorithm is concerned, unlike Internist, our solution utilizes predictive models to refine the initial results that are based solely on observed symptoms. The user may also set how much the final score assigned to each diagnosis is going to be influenced by the results of the corresponding predictive model. The final score is dependent on the parameter that determines the ratio of scores provided by separate phases of algorithm, one of which relies on predictive models. DXplain [7] is a decision support system developed at the Laboratory of Computer Science at the Massachusetts General Hospital. It utilizes a set of clinical findings (signs, symptoms, and laboratory data) in order to produces a ranked list of diagnoses that might explain the clinical manifestation. Furthermore, it suggests what further clinical information would be useful to collect for each disease, and lists what clinical manifestations, if any, would be unusual or atypical for each of the specific diseases. However, similarly to our solution, it is not intended to be used as a substitute for professional medical advice. Another similarity is that it also offers a ranked list of diagnoses with corresponding sets of not observed symptoms. These symptoms are provided as guidance in making decisions about which laboratory test to order or which symptoms to observe. On the other hand, the principal difference lies in the ability to rank the diseases based on the epidemiological trend in the subpopulation. The strong point of our solution when compared to the aforementioned solutions is the possibility to reduce errors caused by different probabilistic relationships between findings and diagnoses in different patient populations. The reduction is done by utilizing a data set with registered disease cases in a subpopulation. In this manner, among diseases with similar sets of symptoms, higher ranking is given to those more consistent with disease trends within the specified subpopulation. Therefore, the relationship between the symptoms and a disease to which the subpopulation is more susceptible has a greater significance. Our solution is a complex system that has traits typical of both symptom checkers and CDSSs: (i) it may provide information about the meaning of observed symptoms; and (ii) support medical diagnosis discovery.

2. Data Sets From a data point of view, the algorithm that we propose in this paper requires: (i) a data set with diseases and their corresponding symptoms; and (ii) a data set with registered disease cases and sufficient information to determine the subpopulation to which a case belongs. Each of these data sets is needed in a different phase of the algorithm and directly influences the generated results. The development of the diagnostics-support system was primarily driven by the quality of existing data sets. The devised algorithm had to conform to the structure of the two data sets, especially because the data set on registered cases, which was obtained from an official public health institution, well illustrates what type of information may be retrieved from the majority of electronic health records that are presently available in Serbia. In these circumstances, acknowledging additional data dimensions in the algorithm would only impede the diagnosis discovery process due to a general lack of additional information about diagnoses. The exact use of these data is covered in Section 3, while in the rest of this section we elaborate on the origin and structure of the two data sets. 2.1. Symptoms Data Set The data set with diagnosis and their corresponding findings is acquired from Freebase, an open repository of structured data [8]. The content of Freebase is divided into topics, which may be associated with different types also known as Freebase ontologies (structured categories). Each type has a number of defined predicates called properties. For example, an entry for lung cancer would be entered as a topic that would include a variety of types describing it as disease or medical condition, cause of death, and medical condition in fiction. The first step in the extraction process implied downloading a list of all diagnoses together with the locations of the related resources within Freebase. Using that list, we were able to download JavaScript Object Notation (JSON) [9] files that contain structured information on findings consistent with the diagnoses. The next step implied parsing downloaded files and inserting data into a relational database whose database schema is presented in Figure 1. Figure 1. Relational schema of the database.

The obtained sample has approximately 9,500 records about diagnoses and 1,700 records about corresponding symptoms. Besides symptoms, our database contains information about disease risk factors, causes of diseases, treatments and medical specialties with approximately 900, 600, 1550 and 70 records, respectively. 2.2. Registered Cases Data Set The health information system (HIS) of The Institute of Public Health of Vojvodina in Novi Sad, Serbia, was used as a source in the extraction of approximately 8,500 records concerning workplace absence. Each record contains information about absence cause, disease codes for initial and final diagnosis, municipality code, gender, age, start and end date of absence, and business activity code of a person involved. All of the codes in the data set are part of the officially recognized codebooks of causes, municipalities, business activities and diseases. The data warehouse that contains the aforementioned medical data is modeled using a star schema. It contains one fact table and eight dimensions, two of which are role-playing dimensions. Therefore, each recorded event may be observed in the context of gender of the person involved, time (when an event occurred or ended), absence cause, person s profession, data source, person s age, place where it happened, and diagnosis that was established. Structure of the data warehouse is given in more detail in [10]. As the collected data needed to be transformed and loaded into the data warehouse, an Extraction Transform Load (ETL) process was specially designed for that purpose. Through a series of transformations within the ETL steps, disease information was extended with disease name, subcategory, and category, while various errors and inconsistencies within the data set were detected and eliminated. 3. An Algorithm for Matching Symptoms to Diseases The proposed algorithm is based on Bayes theorem and the method of differential diagnosis that clinicians regularly use in establishing a diagnosis. It consists of two phases, each of which independently contributes to the final score (rank) of each suggested diagnosis. The result of algorithm is a set of possible diagnoses with the corresponding final score and individual scores from each phase. The algorithm input consists of two sets of data. The first set is a list of symptoms, while the other, which is optional, includes personal data of a patient: municipality, age, gender, and the quarter of the year when the patient first started experiencing symptoms. The output is a set of possible diagnoses with a list of symptoms associated with each diagnosis but not currently observed. Besides symptoms, each diagnosis is given a score meant to help the physician in the decision making process. Furthermore, the output contains risk factors, causes, a description of diagnosis and known treatments associated with each diagnosis from the set. The algorithm relies heavily on associations between diagnoses and symptoms. Besides associations, two values are supplied, indicating correlation between diagnoses and symptoms. The first value estimates the likelihood of certain diagnosis being correct if associated symptom occurred. The second value estimates the likelihood that a symptom manifests given that the diagnosis is correct. Both values are numbers from the [0, 1] interval. The first value is based on the number of symptoms consistent with the diagnosis c(d) and number of diagnosis associated to the symptom f(s). Eq. (1) is

used in the calculation of the estimated value for every pair of diagnosis and associated symptom. The second value is initially set to one. It may be adjusted manually or automatically for all symptoms associated with a disease. After an arbitrary number of cases in which the diagnosis was confirmed, the value assigned to the second estimation may be updated to the mean number of the onsets of the symptom in those cases. In this manner, as the number of these cases increases, the second value gives better approximation of the probability that a symptom manifests in the case the diagnosis is correct. L D S 1 1 c D f S (1) Depending on input, the system selects diagnoses whose at least one symptom is present in the input set of symptoms. For every diagnosis from the selection, four lists (L1-L4) are created. They contain: (i) observed symptoms consistent with the diagnosis (L1); (ii) observed symptoms not associated with the diagnosis (L2); (iii) symptoms associated with the diagnosis that are observed not to be present in a patient (L3); and (iv) symptoms not yet observed but that are associated with the diagnosis (L4). The first three lists are given a score that is calculated using Eq. (2). L( D Si) s L L( S D) i i (2) With symptoms in the list L1 contributing positively to the score given to the associated diagnosis and the other lists contributing negatively, we specified equation Eq. (3) for calculating a score based on observed symptoms. Eq. (3) features three parameters p, q and r with initial values set to 1, 0.1, and 1 respectively. These parameters are introduced in order to control impact of each list on the final score. Therefore, their values may be changed within the system settings. 1 2 3 w D p s L q s L r s L (3) In the next phase, naïve Bayes classifiers are used to estimate the individual share of each diagnosis in all work absences [10]. Estimation is based on the second input data set. In this manner, we form the second list of diagnoses with coarse prediction of the share for each diagnosis. Finally, both lists are merged and each diagnosis is given a combined score based on Eq. (4) that utilizes previously calculated values from both phases of the algorithm: score based on symptoms w(d) and prediction p(d) based on recorded cases. The default value of proportion parameter x is 0.5. 1 f D x w D x p D (4) Portion of each value in the combined score may be manually set, which gives user a possibility to favor one of the values to a certain extent. Furthermore, symptoms not yet observed but which are associated with the diagnoses and present in output list of symptoms, may be marked as observed to be present or not. Therefore, process of calculating diagnosis scores may be repeated with the inclusion of new observations.

4. Implementation and Use Scenario Dr Warehouse is an intelligent software system for epidemiological monitoring, prediction, and research [10]. It consists of a data warehouse with epidemiological data, extensible application server, extensible web client application, and mobile device client application. The system is designed to be relatively easy integrated with an existing HIS. Dr Warehouse currently supports: access to the data warehouse containing medical records; use of advanced analytical operations over data stored in a data cube; execution of advanced analyses and forecasts; services for mobile device users; and use of extensions. The diagnosis support system presented in this paper is integrated into the existing Dr Warehouse solution and to some extent relies on its services. Its user is initially presented with a list of symptoms. From the presented list, the user is able to specify manifested symptoms. After submitting the initial set of symptoms, the system provides a list of associated diagnoses consistent with at least one of the specified symptoms. For each diagnosis in the list, the system provides a list of not observed symptoms, consistent with that diagnosis. Optionally, from the list of not observed symptoms, the user may specify symptoms that are observed not to be present. Based on the specification of user observations, each diagnosis is given a score. The second phase of the algorithm is responsible for supplementing the diagnosis estimation with information about diagnosis prevalence in a given subpopulation. In order to define a subpopulation, the user must specify year quarter when symptoms first manifested and subpopulation to which the person reporting the symptoms belongs, as defined by age group and gender, in a selected municipality. Subpopulation definition is an optional part of the diagnosis support process. Diagnosis prevalence in a subpopulation is determined by a predictive data mining model. For each municipality and year quarter, there is a naïve Bayes classifier that, for a given age group and gender, provides a list of frequent diseases and probabilities of their occurrence. In other words, these classifiers generate coarse predictions of the distribution of the most common diseases in a selected subpopulation. They are trained on the data set with registered disease cases. However, as classification algorithms support a limited number of classes, it is not uncommon that a list of diagnoses given by a prediction model has little or no overlap with the list of diagnoses obtained on the basis of symptoms. Final score for each diagnosis is calculated based on scores given by each phase of algorithm. Furthermore, the level of contribution to final score by each phase may be manually set, which allows the user to rely more on the results of one of the algorithm phases, depending on the actual situation. The use of the proposed algorithm is illustrated by the example of the manifested symptoms: cough, fatigue, nasal congestion, headache and sneezing. Based on these symptoms, the system displays a list of 177 possible diagnoses including Pertussis, Upper respiratory tract infection, Common cold, Rhinitis and Sinusitis. The principal reason for such a long list is the fatigue symptom, which is consistent with many different diagnoses. For each diagnosis in the list, we may view lists of associated symptoms L1, L3, and L4, which are described in Section 3. As Pertussis disease is consistent with nasal congestion, cough and sneezing, we observed symptoms listed in the L4 in order to provide more information on that disease. Based on observation we discovered that anorexia is not present and moved it to the L3 list. More information on each diagnosis gives us a more precise score in the final calculation. On the other hand, some observations may be expensive and time consuming like laboratory tests that are

intended to confirm or deny presence of a symptom. Therefore, we should first gather more information on the most probable diagnoses. In order to increase the efficiency of diagnose support process we performed initial calculation to determine which diagnoses are more probable. As part of the calculation, system utilized previously trained naïve Bayes classification model for 62 year old, female person in the city of Novi Sad with the first occurrence of symptoms during the first quarter of the year. In this manner, we enhance result with coarse predictions of the distribution of the most common diagnoses in a subpopulation to which this patient belongs. Proportion parameter is set to 0.8 in order to give precedence to diagnoses with scores calculated on symptom basis. The calculation result is displayed in form of a data grid presented in Figure 2. Second column contains scores given on the basis of observed symptoms; third column contains prediction values while last column contains combined score from the second and third column. These scores indicate that common cold is the most probable diagnosis, and viral infections, although not consistent with the observed symptoms, is very common diagnosis for this period among the subpopulation to which the patient belongs. Furthermore, sinusitis is consistent with all manifested symptoms but score given to it based on those observations is lower than score given to pertussis. The main reason for this is a higher number of symptoms related to sinusitis. Because of this, the observed symptoms contribute less to the score for sinusitis. On the other hand, since the algorithm contains a number of variable parameters, this contribution could be increased. By changing the parameter q from 0.1 to 1, the impact of symptoms that are present but not associated with the diagnosis is increased resulting in a decreased value of score given to pertussis. However, this modification leaves scores given to common cold and sinusitis unchanged. The initial calculation provided us with the suggestion concerning the order of our future observations with the purpose of detecting present symptoms. Observing symptoms listed in L4 for common cold, sinusitis and pertussis we specified pyrexia, toothache and facial pain as present symptoms and laryngitis and rhinitis as observed not to be present. After another calculation, sinusitis, being consistent with all present symptoms, becomes the most probable diagnosis with a score of 0.139. However, it is not uncommon to be diagnosed with more than one diagnosis. Therefore, diagnoses with lower scores should not be left out of consideration. On the other hand, common, likely innocuous symptoms may lead to considering serious and rare conditions that are linked to common symptoms. Such considerations may escalate to unnecessary changes in health behavior [11]. Figure 2. Scores from initial calculation.

Within our solution, such escalation is prevented by utilizing information about disease prevalence. Classification models that are trained using data about officially registered cases yield probabilities of a disease occurrence in a subpopulation. We consider this to be an advantage over other similar systems as the persistence of post-escalation concerns and effects of these concerns may affect and interfere with patients activities over a prolonged time period [1]. Within the presented example, pertussis could be considered a serious disease. However, because it is uncommon for the subpopulation to which the patient belongs, its overall score is lower. Conclusion and Future Work In this paper, we present an algorithm that matches a user-provided list of symptoms to a list of potential diagnoses, explain in more detail the data needed to make such inference, and demonstrate an application of the algorithm through a typical use scenario. The input to the algorithm includes two data sets, one concerning symptoms and other concerning information on the subpopulation to which a patient belongs. Therefore, the presented algorithm consists of two phases, each of which uses the corresponding input data set. Equations which are an integral part of the algorithm contain different parameters that may be changed by the user. In this manner, the user is able to favor one type of findings. The presented solution is designed to help users understand various medical symptoms, provide estimated probability of diagnoses consistent with input data and support medical diagnosis discovery process. It is not meant to give medical advice, final diagnosis or treatment. However, this system is designed to reduce user stress and support more informed medical decision making. Within Serbian healthcare, this is one of the first attempts to replace numerous disconnected spreadsheets with a business intelligence (BI) system and make the most of such replacement with context-sensitive diagnosis support system. We are aware that, in other contexts, more complex algorithms for diagnosis support are applied. However, the presented algorithm is primarily designed to be used on structurally limited electronic health records that are available in official healthcare institutions in Serbia. Furthermore, as one of our goals is also to motivate stakeholders to consider new solutions in this field, the proposed algorithm had to be generally uncomplicated and accessible to its future users, namely medical experts. Some of our future research may include evaluation of different parameter values and validation of the results produced by the algorithm. We also consider supporting different names for a single symptom and optimization of the first phase in the algorithm by calculating in advance disease shares for various subpopulations. At the moment, our solution has a passive role in giving advice to clinicians. Therefore, the clinician must recognize when the presented advice would be useful and then make an effort to describe a case by entering additional medical data and request a diagnostic assessment. In this manner, our future research will also include integration with diverse sources of patient information within the healthcare enterprise. This integration would give a more active role to our solution as laborious data entry would not be required. Our solution falls into the group of clinical decision-support programs that assist with decisions about what observations on symptoms, laboratory tests, and treatments should be applied rather than those that assist healthcare workers by establishing a single, highly probable, diagnosis from a specific disease category.

Therefore, we also plan to incorporate into our solution the way to present and balance the costs and benefits of actions suggested by the solution. At the moment, the algorithm favors more common diseases. Therefore, we may extend the existing algorithm (or formulate a different one) to support detection of rare disease cases by utilizing techniques for discovering anomalies and outliers. Furthermore, the algorithm is based on the Bayesian model that assumes that there are no conditional dependencies among findings. Due to the fact that the presence of some symptoms affects the likelihood of the presence of other symptoms, we intend to provide more expressive algorithm in which conditional dependencies are modeled explicitly rather than ignored. One of our long term goals is to create a software system that would be based on a set of diagnostics algorithms which could support everyday work of medical professionals in the primary health care. Our future collaboration with The Institute of Public Health of Vojvodina in Novi Sad could lead to adoption of this pilot project. Such scenario would require defining suitable measures for the evaluation of the proposed system in practice. We could measure whether the use of the proposed system would lower the number of incorrect diagnoses and shorten the time necessary to make a diagnosis when compared to the procedure currently employed by the diagnosticians in Serbia. Acknowledgements The research was supported by Ministry of Education, Science and Technological Development of Republic of Serbia, Grant III-44010. The authors are most grateful to The Institute of Public Health of Vojvodina in Novi Sad for the provided data sample featuring reported disease cases and valuable comments. References [1] R.W. White, and E. Horvitz, Cyberchondria: Studies of the escalation of medical concerns in Web search, ACM Transactions on Information Systems, vol. 27, no. 4, pp. 23:1 23:37, 2009. [2] Strategija razvoja informacionog društva u Republici Srbiji do 2020. godine [The Strategy for the Development of Information Society in the Republic of Serbia until the Year 2020], (in Serbian), Službeni glasnik Republike Srbije, vol. 51, 2010. [3] M.A. Musen, Y. Shahar, and E.H. Shortliffe, Biomedical Informatics, New York: Springer, pp. 698-736, 1995. [4] WebMD Symptom Checker, http://symptoms.webmd.com/ [Feb. 17, 2013]. [5] Isabel Symptom Checker, http://symptomchecker.isabelhealthcare.com/ [Feb. 17, 2013]. [6] R.A. Miller, H.E. Pople, Jr., and J.D. Myers, INTERNIST-1: An Experimental Computer-Based Diagnostic Consultant for General Internal Medicine, New England Journal of Medicine, vol. 307, pp. 427 433, 1982. [7] G.O. Barnett, K.T. Famiglietti, R.J. Kim, E.P. Hoffer, and M.J. Feldman, DXplain on the Internet, in proceedings of the AMIA Annual Fall Symposium 1998, pp.607-611. [8] Freebase, http://www.freebase.com/ [Feb. 17, 2013]. [9] JavaScript Object Notation, http://www.json.org/ [Feb. 17, 2013]. [10] V. Ivančević, M. Knežević, M. Simić, I. Luković, and D. Mandić, Dr Warehouse An Intelligent Software System for Epidemiological Monitoring, Prediction, and Research, in Proceedings of DBKDA 2013, pp. 204-210. [11] S.L. Ayers, and J.J. Kronenfeld, Chronic illness and health-seeking information on the Internet, Health, vol. 11, no. 3, pp. 327-347, 2007.