Using an Integrated Ontology and Information Model for Querying and Reasoning about Phenotypes: The Case of Autism Samson W. Tu, MS, Lakshika Tennakoon, RMP, MSC, MPhil, Martin O'Connor, MS, Ravi Shankar, MS, Amar Das, MD, PhD Stanford Center for Biomedical Informatics Research, Stanford University, Stanford, CA, USA ABSTRACT The Open Biomedical Ontologies (OBO) Foundry is a coordinated community-wide effort to develop ontologies that support the annotation and integration of scientific data. In work supported by the National Database of Autism Research (NDAR), we are developing an ontology of autism that extends the ontologies available in the OBO Foundry. We undertook a systematic literature review to identify domain terms and relationships relevant to autism phenotypes. To enable user queries and inferences about such phenotypes using data in the NDAR repository, we augmented the domain ontology with an information model. In this paper, we show how our approach, using a combination of description logic and rule-based reasoning, enables high-level phenotypic abstractions to be inferred from subject-specific data. Our integrated domain ontology information model approach allows scientific data repositories to be augmented with rule-based abstractions that facilitate the ability of researchers to undertake data analysis. INTRODUCTION Recent studies have reported an increased prevalence of autism spectrum disorder (ASD). 1 To advance research in identifying common genetic or other factors that influence the etiology of ASD, the NIH established the National Database for Autism Research (NDAR) (http://ndar.nih.gov/). The goal of NDAR is to provide investigators a public resource for collecting, archiving, retrieving, sharing, and analyzing data on autism. A central function of NDAR is to store and to link genetic, clinical, imaging, and other data on subjects who participate in NIH-funded autism research studies. The underlying architecture uses a federated data repository based on the Biomedical Informatics Research Network (BIRN) Grid (http://www.nbirn.net/). One of the functionalities of NDAR is to provide a query tool to construct data sets for answering specific questions relevant to autism researchers. A question such as Use head circumference to categorize macroencephaly. Then see if the subjects differ in their ADOS and ADI-R cognitive and language profiles, and in their genetic data. requires that phenotypic abstractions (e.g., macroencephaly) be inferred from available data collected from standard autism assessment instruments (e.g., the Autism Diagnostic Interview-Revised (ADI-R) 2 and the Autism Diagnostic Observation Schedule (ADOS) 3 ). Performing such reasoning on subject-specific data in NDAR requires the integration of an information model representing research and clinical data about study subject with an ontology that defines the terms and relationships in the domain. Ultimately, the use of such a combined modeling approach can allow a researcher to formulate a query at the conceptual level, using terms and relationships from the ontology, and have it translated automatically to specific queries that take into account the schemas and source vocabularies of the underlying data sources. The BIRN Grid does not currently address the need to query phenotypic abstractions. It consists of tools, including a mediator and an infrastructure, to handle all phases of the data integration process, such as registration of sources, registry queries, and semantic data queries. 4 The mediator supports a form of semantic data querying where a user can annotate data sources, views, attributes, and attribute values with terms from ontologies. To perform a query, a user can submit a list of keywords, using terms from ontologies available in BIRN. A mediator can then search for these terms and produce a ranked list of relevant sources and relations. A user can request data from individual views generated for each candidate source or from joined views. To develop the ontology of autism that supports the querying of phenotype abstractions in the BIRN Grid, we need to extend the ontologies currently supported in that environment. The BIRN community has developed BIRNLex, a controlled lexicon for neuroscience that can be used to annotate BIRN data sources (http://birnlex.nbirn.net/ontology/birnlex.owl). Developers of BIRNLex have adopted and refined practices for ontology development being promoted by the Open Biological Ontologies (OBO) Foundry. 5 Part of these practices is the reuse of existing ontologies covering domains of interest, such as the Basic Formal Ontology (BFO), the Ontology for Biomedical Investigations (OBI), and the Phenotype and Trait Ontology (PATO). The BFO is a foundational ontology that provides a set of upper-level distinctions shared by all OBO foundry ontologies. 6 It promotes a realism-based approach to ontology modeling, which holds that classes in an ontology are universal categories of objects that represent things and processes in reality. PATO, for example, models phenotypes as qualities and dispositionsboth AMIA 2008 Symposium Proceedings Page - 727
classes in BFOthat inhere in organisms, which are bearers-of such phenotypes. Our task is to formulated the ontology of autism for NDAR as an extension of the BIRNLex ontology. However, the extension cannot be merely addition of autism-related terms and relationships. We have found that the use cases for the autism ontology extend beyond the capabilities currently supported by BIRN, because the BIRN data integration environment conceives of an ontology only as a network of typed nodes and relations. 4 The research questions addressed in this paper include: 1. How to formulate terms and definitions used in studies of autism as extensions of the BIRNLex ontology and, when appropriate, as PATO phenotypes? 2. How can the structure of clinical and research data be reflected in the ontology so that the formulation of analytical questions are informed by available data? 3. How can abstractions defined in terms of clinical and research data be incorporated into the ontology and related to terms in the literature? METHODS To gather terms, relationships, and abstractions for building our autism ontology, we conducted a literature search and reviewed the data dictionary codebook used by NDAR. Our literature search of the PubMed database used the key words (ADI-R or ADOS or Vineland) and (genes or genetics) and autism. We found 43 published research papers as of March 1, 2008. We selected only 26 papers as relevant based on the inclusion criteria of studies who enrolled subjects with a diagnosis of autism and were published in the English language. We supplemented the corpus with standard sources on autism diagnosis, such as the Diagnostic and Statistical Manual of Mental Disorders (DSM-IV). 7 We then examined the data-analytic requirements of investigations reported in the papers, focusing on the types of terms and relationships that would be needed to retrieve and abstract a data set for the analysis. From NDAR, we obtained a codebook specification of the data elements and their format in the data repository (available at http://ndar.nih.gov/ndarpublicweb/datasubmission.go) in MS Excel spreadsheets. We manually extracted terms and their definitions from the literature corpus and modeled them as extensions of the BIRNLex ontology in Protégé, an ontology-editing tool that supports the Ontology Web Language (OWL). 8 OWL is the description-logic-based language for constructing ontologies endorsed by the World Wide Web Consortium (W3C). We wrote scripts to import the NDAR codebook data specification into Protégé OWL. To represent definitions and mappings that specify data abstractions and their relationships to other terms in the ontology, we used Semantic Web Rule Language (SWRL), 9 which allows using rules as additional axioms in an OWL ontology. Furthermore, a query extension to SWRL allowed us to specify, using terms and relationships in the ontology, how to extract data for statistical analysis. RESULTS The autism ontology integrates (1) an information model that represents research and clinical data; (2) terms and relationships specified in the domain of interest; and (3) data abstractions that relate observable data with research and clinical terms. To demonstrate the development and use of the autism ontology, we provide illustrative examples of how the abstractions used in a publication by Hus and colleagues (Hus2007) 10 and the terms used in a fragment of the DSM-IV definition of autism are modeled in our autism ontology. We also show the use of the ontology in abstracting and retrieving data needed for performing statistical analysis in Hus2007. Hus2007 characterizes autism phenotypes based on language acquisition, restricted and repetitive behaviors, and savant skills. The phenotype trait for the acquisition of first word, for example, is based on ADI-R item 9 ( Age of First Words ) and categorized as follows: NDW (not delayed words): acquired words 24 months DW (delayed words): acquired words > 24 months NW (no words): no words at time of ADI-R assessment The statistical analysis presented in Hus2007 correlates the phenotype categorizations of individual subjects with their verbal and nonverbal IQ, age, and ADOS and ADI-R domain scores. We next show how data for such statistical analysis may be formulated using the terms and relationships from the ontology. The DSM-IV diagnostic criteria for autistic disorder, in part, says that the subject should have (A) qualitative impairment in social interaction, as manifested by at least two of the following: 1. marked impairments in the use of multiple nonverbal behaviors such as eye-to-eye gaze, facial expression, to regulate social interaction (B) qualitative impairments in communication as manifested by at least one of the following: 1. delay in, or total lack of, the development of spoken language AMIA 2008 Symposium Proceedings Page - 728
Figure 1. The representation of data collected through the ADI-2003 autism assessment instrument as part of the autism ontology. First we extended the BIRNLex ontology s informationcontent entity, currently populated with Narrative object, such as Book, Data object, and bibliographical database, with an Assessment result class that represents results of study assessments done on subjects, such as instrument, imaging or genetic testing. Using a Python script, we imported the NDAR data specification as subclasses of Assessment result. Thus data collected through the use of the ADI-2003 instrument, for example, can be seen as instances of the ADI-2003 class, where properties such as acqorlossoflang_aword represent the assessed value of an ADI-2003 item (i.e., age at which first word is uttered) (Figure 1). Because all subject data submitted to NDAR are de-identified and are referenced instead by an NDAR GUID (Global Unique Identifier), it will be possible to join the data across multiple types of assessments. We use OWL s capability to define super-property and sub-properties to classify the items of assessment instruments into domains. For example, the acqorlossoflang_aword property is a sub-property of the Acquisition and Loss of Language and Other Skills domain in the ADI instrument. The property hierarchy allows us to query for relationships at multiple levels of abstraction. We model the Status of age of word as an OWL class defined by the necessary and sufficient condition of being the disjoint union of its Delayed word, No word, and Not delayed word subclasses (Figure 2). We annotate the Status of age of word class with the bibliographic citation where the phenotype trait is defined. Each of the phenotype values, such as Delayed word, is defined in terms of the underlying assessment result using a SWRL rule (Figure 3). The SWRL rule makes use of the Assessment_result information model described previously. Specifically, it says that if the acqorlossoflang_aword value (represented by the variable?wordage) of an ADI 2003 assessment Figure 2. The representation of the Status of age of words phentotype group as a OWL class partition by the possible statuses. (represented by the variable?assessment) is greater than 24 (months), then the human (represented by the variable?subject) whose subject id is that of the assessment is the bearer-of Delayed_word. In this example, we use a SWRL greater-than built-in (a method for extending the capabilities of SWRL) to perform the comparison of an assessment value with a cut-off. The Protégé implementation of the SWRL language allows the use of mathematical expressions, temporal comparisons, and terminological reasoning. The definitions of the savant skill trait, for example, requires averaging of multiple assessment scores, which cannot be done in OWL alone. SWRL rules, like the one shown in Figure 3, are abstraction rules that, based on statements made in the information modelassessment results in the format of NDAR data modelinfer assertion about some quality of study subjects. Transformation of the assertion as data input for statistical analysis requires a further step. Instead of using a logical formalism to represent assertions, we need to represent the qualities such as Delayed word as the value of a variable. We can easily do that by augmenting the ADI information model with attributes like age_of_word_status that has coded values (e.g., 0 for No word, 1 for Not delayed word, and 2 for Delayed word). A SWRL rule similar to the one in Figure 3 can assert that, for a subject who is the bearer of Delayed word quality, his ADI 2003 assessment age_of_word_status attribute is 2. Once the phenotype abstractions have been turned into data values, we query them as part of the available data set. Using an SQL-like extension of SWRL we have developed within Protégé, 11 we can formulate the query constructing the data set for correlating phenotype groups with verbal and nonverbal IQ, age, and ADOS and ADI-R scores in terms of the augmented NDAR data model. ADI_2003_result(?assessment) & acqorlossoflang_aword(?assessment,?wordage)& swrlb:greaterthan(?wordage, 24) & subject_id(?assessment,?subjectid) & orgtax:human(?subject) & subject_id(?subject,?subjectid) birn_obo_ubo:bearer_of(?subject, Delayed_word Figure 3. A SWRL rule concluding that a subject has 'Delayed word' AMIA 2008 Symposium Proceedings Page - 729
Figure 4. Partial hierarchy showing the autism-related developmental capabilities The second part of our results concerns the formulation of autism phenotypes in terms of the PATO ontology of qualities and dispositions. Delayed is a PATO quality for occurrence. The status of Delayed word can be formally decomposed as Delay in the onset of the capability to use words, where capability to use words is a kind of developmental capabilities, and onset is the time of a developmental capability's occurrence. Terms used in DSM-IV diagnostic criteria for autism can be modeled in exactly the same way as the decomposition of phenotype traits. By creating an ontology of developmental capabilities, such as the one shown in Figure 4, we can define dispositions toward autisms in terms of PATO qualities of the developmental capabilities. For example, qualitative impairment in social interaction can be defined as the PATO quality impaired of some Social_interaction developmental capability. Because of the subsumption hierarchy of developmental capabilities, any assessment of, say, impaired eye-to-eye_social_interaction, will be classified as an impairment of social interaction. Using the OWL 2.0 s description logic capability to define qualified cardinality restriction, we create OWL classes that represent criteria like qualitative impairment in social interaction, as manifested by at least two of the following as assessment results where there are at least two instances of impairment of social interaction. Based on an information model of assessment result, we can represent the DSM-IV diagnostic criteria depicted in the sample text quoted earlier as necessary and sufficient conditions for inferring the presence of autism. DISCUSSION Most of the existing works on the use of ontologies to facilitate data integration focus on the annotation of data, images, and literature with terms from ontologies. 12 In this paper we described a modeling approach that extends the use of ontologies by integrating them with information models so that inferences can be made with the data themselves. While we have established the feasibility of combining such reasoning with data queries to derive data sets to be used in statistical analysis, we do not expect autism experts will be able to write OWLbased rules and queries themselves. In work parallel to that reported in this paper, we are trying to find patterns that would allow us to design templates that can ease the task of writing such abstraction rules and queries. We found that BIRNLex and PATO provide an adequate framework for specifying the qualities and dispositions that represent phenotypes described in the autism literature. Nevertheless, we found that OWL does not allow a direct description of a subject bearing a phenotype if the relationship between the subject and the phenotype needs qualifications. Instead, such descriptions could be represented as statements in an information model. The BFO and PATO s realism-based principle of modeling relations between real-life entities is problematic when we want to extend the model to state that a subject is the bearer-of a quality during certain time. In this case, the bearer-of relationship is an ternary relation involving a subject, the phenotype quality, and time. OWL, like other entity-attribute-value (EAV) languages, allows only unary (class) and binary (property) relations. As noted by Mungall, 13 OWL does not permit the direct representation of an assertion saying that a subject is the bearer of a quality during certain time. The alternatives that Mungall considered were judged unsatisfactory. The standard method to represent n-ary relationships in an EAV language is to reify the relationships as objects. 14 Our approach of integrating an information model of AMIA 2008 Symposium Proceedings Page - 730
assessment results with an ontology provides such a mechanism. Instead of modeling a subject as the bearer of a phenotype, and not being able to state the temporal extent of the relationship, we model the assertion as a statement in the information model. Thus, for example, we can create a phenotype-description entity in our information model. Similar to an assessment result, it has a subject id, the qualitative or quantitative phenotype value, the time interval during which the assertion is valid, and any other contextual or qualifying information. Our current implementation relies on a SWRL/database integration that fetches the necessary data into memory for rule and query processing. 11 For large data sets, a more scalable architecture is needed. It may be possible to treat the SWRL rules and OWL class expressions as specifications that can be translated into more conventional database triggers and queries. Another possibility is to precompute and store abstractions so that they are readily available for querying. Our work to construct an autism ontology so far focuses on solving the modeling of terms to conform to and extend the existing BFO and BIRNLex ontology development framework. In contrast, Petric and colleagues 15 used machine-learning and text-mining techniques to construct ontologies of the autism domain. Using the semi-automatic tool OntoGen (http://ontogen.ijs.si/), they constructed the ontologies as hierarchies of terms, and used the underlying document corpus to discover suggestive relationships among terms in the ontologies. Such semi-automated machine-learning approach can very usefully supplement the labor-intensive work that our manual ontology development entails. Acknowledgement This work was funded in part by an NIMH supplement to the Protégé resource funded by NLM grant LM007885. The authors thank Lynn Young, Bill Bug, Dan Hall, and Matt McAulliffe for discussions and assistance on this project and Mor Peleg for her extensive comments. REFERENCES 1. Rice, C.E., et al., A public health collaboration for the surveillance of autism spectrum disorders. Paediatr Perinat Epidemiol, 2007. 21(2): p. 179-190. 2. Lord, C., M. Rutter, and A. Le Couteur, Autism Diagnostic Interview-Revised: a revised version of a diagnostic interview for caregivers of individuals with possible pervasive developmental disorders. J Autism Dev Disord, 1994. 24(5): p. 659-685. 3. Lord, C., et al., The autism diagnostic observation schedule-generic: a standard measure of social and communication deficits associated with the spectrum of autism. J Autism Dev Disord, 2000. 30(3): p. 205-223. 4. Astakhov, V., et al. Semantic Data Integration Environment for Biomedical Research. in Proceedings of the 19th IEEE Symposium on Computer-Based Medical Systems. 2006. p. 171-176. 5. Smith, B., et al., The OBO Foundry: coordinated evolution of ontologies to support biomedical data integration. Nat Biotechnol, 2007. 25(11): p. 1251-1255. 6. Grenon, P., B. Smith, and L. Goldberg, Biodynamic Ontology: Applying BFO in the Biomedical Domain, in Ontologies in Medicine, D.M. Pisanelli, Editor. 2004, IOS Press: Amsterdam. p. 20-38. 7. American Psychiatric Association, Diagnostic and Statistical Manual of Mental Disorders DSM-IV-TR. 2000. 8. O'Connor, M.J., et al. Supporting Rule System Interoperability on the Semantic Web with SWRL. in Fourth International Semantic Web Conference. 2005. Galway, Ireland. 9. Horrocks, I., et al. SWRL: A Semantic Web Rule Language Combining OWL and RuleML. W3C Member Submission 21 May 2004. 2004 [cited 2008; Available from: http://www.w3.org/submission/swrl/. 10. Hus, V., et al., Using the autism diagnostic interview-- revised to increase phenotypic homogeneity in genetic studies of autism. Biol Psychiatry, 2007. 61(4): p. 438-448. 11. O'Connor, M.J., et al. Efficiently Querying Relational Databases using OWL and SWRL. in The First International Conference on Web Reasoning and Rule Systems. 2007. Innsbruck, Australia: Springer. p. 361-363. 12. Louie, B., et al., Data integration and genomic medicine. J Biomed Inform, 2007. 40(1): p. 5-16. 13. Mungall, C., et al. Representing Phenotypes in OWL. in Third International Workshop of OWLED. 2007. Innsbruck, Austria. 14. Noy, N. and A.L. Rector. Defining N-ary Relations on the Semantic Web. 2006 [cited 2008 June 17, 2008]; Available from: http://www.w3.org/tr/swbp-naryrelations/. 15. Petric, I., T. Urbanicic, and B. Cestnik, Discovering Hidden Knowledge from Biomedical Literature. Informatica, 2007. 31: p. 15-20. AMIA 2008 Symposium Proceedings Page - 731