Comparing Techniques for Classifying Patients with Schizophrenia and Healthy Controls using Machine Learning and Magnetic Resonance Imaging

Size: px

Start display at page:

Download "Comparing Techniques for Classifying Patients with Schizophrenia and Healthy Controls using Machine Learning and Magnetic Resonance Imaging"

Constance Hampton
5 years ago
Views:

1 Comparing Techniques for Classifying Patients with Schizophrenia and Healthy Controls using Machine Learning and Magnetic Resonance Imaging by Julie Lynne Winterburn A thesis submitted in conformity with the requirements for the degree of Masters of Applied Science Institute for Biomaterials and Biomedical Engineering University of Toronto Copyright by Julie Winterburn 2015

2 Comparing Techniques for Classifying Schizophrenia Patients and Healthy Controls using Machine Learning and Magnetic Resonance Imaging Abstract Julie Winterburn Masters of Applied Science Institute of Biomaterials and Biomedical Engineering University of Toronto 2015 Schizophrenia and related psychoses are debilitating mental illnesses that are associated with an abnormal neuroanatomical phenotype on magnetic resonance images. Machine learning is a powerful statistical tool that can recognize unique feature patterns in data and use this information to perform class separation in unseen instances. This study explores the effect of machine learning algorithm selection, dataset characteristics, and feature choice on classification performance within a rigorous validation framework. Specifically, penalized logistic regression, support vector machines, and linear discriminant analysis are compared using three large, independently-collected datasets with multiple neuroanatomically-based features. This study replicates existing studies in the literature, and provides a direct comparison of techniques within a systematic structure. Additionally, this study illustrates many of the challenges inherent to performing patient classification, and aims to establish a benchmark for future studies in the field. ii

3 Acknowledgments This project would not have been possible without the help and support of a great team of people. First and foremost, I would like to thank my supervisor Dr. Mallar Chakravarty, who has given me so many opportunities over the past four years, and who has helped me navigate this process from beginning to end. I would also like to sincerely thank Dr. Aristotle Voineskos who has offered invaluable advice and wisdom, both on and off my committee, and who generously accommodated me in his lab during the final year of my degree. My committee members Jo Knight and Tomáš Paus provided valuable insight, and my project was immensely improved by their contributions. To all past and present members of the Kimel Family Translational Imaging-Genetics Research Laboratory, thank you for being great company, and for always making research so much fun. A big thank-you to Jon Pipitone, who has taught me almost everything I know about computers, and is always there to lend a hand when you need it. Gabriel Devenyi was vital in helping me trouble-shoot my many computer problems, and answered endless questions about every topic imaginable with infinite patience. Nikhil Bhagwat was a great companion to navigate graduate school with, and was always willing to talk through tough technical concepts, or help me see the lighter side of things. To the members of the Computational Brain Anatomy Laboratory, thank you for helping me develop my ideas, making me think critically about my project, and inspiring me with your fantastic research. To my family, thank you for all the support and encouragement you have given me throughout this long process. I don t know where I would be today without you. To Josh, I can t say thank you enough for your unwavering support and patience. Finally, I d like to thank the University of Toronto Rowing Team, who kept me sane throughout the last two years. You are an inspiring group of people, and you helped me push myself to be the best that I can be, both on and off the water. iii

4 Table of Contents Acknowledgments Table of Contents List of Tables List of Figures List of Abbreviations List of Appendices iii iv vii viii ix xi Chapter 1: Introduction Schizophrenia History and Definition Diagnosis 15 Magnetic Resonance Imaging Image Acquisition Image Formation 17 Image Processing Image Registration Voxel- Based Morphometry Cortical Thickness 21 Neuroanatomical Phenotype of Schizophrenia Postmortem Evidence for Abnormal Neuroanatomy in Schizophrenia MR Evidence for Abnormal Neuroanatomy in Schizophrenia Region- of- Interest Approach Voxel- Based Morphometry Approach Cortical Thickness 25 5 Machine Learning Univariate versus Multivariate Analyses Feature Reduction Machine Learning Algorithms 29 iv

5 5.3.1 Regularized Logistic Regression Support Vector Machines Linear Discriminant Analysis Comparing Algorithms Performance Evaluation Validation Machine Learning and Schizophrenia 32 Goals and Contributions 40 Chapter 2: Comparing techniques for classifying patients with schizophrenia and healthy controls using machine learning and magnetic resonance imaging Introduction 41 Methods Datasets Evaluated Centre for Addiction and Mental Health (CAMH) Northwestern University Schizophrenia Data and Software Tool (NUSDAST) National Institute of Neurology and Neurosurgery of Mexico (INNN) Image Processing Image Registration Modulated Voxel- Based Morphometry RAVENS Maps Cortical Thickness Machine Learning Algorithms Logistic Regression with Elastic Net Regularization Support Vector Machines Linear Discriminant Analysis COMPARE Validation Results Modulated Voxel- Based Morphometry Quality Control Classifier Performance RAVENS Maps 67 v

6 Classifier Performance Cortical Thickness Quality Control Classifier Performance Discussion 72 Chapter 3: General Discussion & Future Directions Summary of Results Image Processing Inhomogeneity Correction Atlas Selection Non- linear Registration Algorithm Cortical Thickness Analysis Tissue Density Analysis Alternative Machine Learning Algorithms Additional Metrics for Classification Schizophrenia Diagnosis: Present and Future Limitations Conclusion 97 References 99 Appendices 111 vi

7 List of Tables Table 1: Demographic characteristics of all datasets Table 2: All modulated VBM results in the training set using 10-fold cross-validation. Overall classification accuracy (%) is reported. The p-value of a binomial test on the % accuracy is reported in brackets Table 3: All modulated VBM results in the validation set. The % sensitivity (# true positives), specificity (# true negatives), and overall classification accuracy (in bold) are reported. The p- value of a binomial test on the % accuracy is reported in brackets Table 4: All RAVENS maps results in the training set using 10-fold cross-validation. Overall classification accuracy (%) is reported. The p-value of a binomial test on the % accuracy is reported in brackets Table 5: All RAVENS maps results in the validation set. The % sensitivity (# true positives), specificity (# true negatives), and overall classification accuracy (in bold). The p-value of a binomial test on the % accuracy is reported in brackets Table 6: All cortical thickness results in the training set using 10-fold cross-validation. Overall classification accuracy (%) is reported. The p-value of a binomial test on the % accuracy is reported in brackets Table 7: All cortical thickness results in the validation set. The % sensitivity (# true positives), specificity (# true negatives), and overall classification accuracy (in bold). The p-value of a binomial test on the % accuracy is reported in brackets Table 8: Summary of studies using structural MR imaging to classify healthy controls and schizophrenia patients vii

8 List of Figures Figure 1: Image registration and voxel-based morphometry pipeline Figure 2: Validation scheme for training algorithms. Data is split into training and validation subsets (ratio 2:1), and the validation data is set aside. The training subset is further divided to tune algorithm parameters using 10-fold cross-validation. The accuracy of the algorithm is assessed on the validation set only once tuning is complete Figure 3: Quality control stage. A: Correctly classified subject; B: Subject excluded from analyses because of failed registration Figure 4: Percent of variance explained by all PCs for the modulated voxel-based morphometry for each of the three datasets. Only PCs that individually explained >1% of the total variance in the dataset were retained as input to the classifiers Figure 5: Sagittal views of voxel-wise contributions to the first principal component of the modulated VBM and RAVENS data for each dataset Figure 6: Parameter optimization and selection in elastic net regularized LR model for CAMH modulated VBM dataset. Panel A illustrates selection of the λ parameter based on 10-fold crossvalidation. Each coloured line represents a separate variable. The optimal λ is that which shows the lowest model deviance. The selection of λ affects the number of variables retained in the model. In this case, the optimal log(λ) = -2.2, so optimal λ = Panel B shows selection of the α parameter. The λ parameter optimization shown in Panel A is performed at 21 values of α ranging from 0-1, and the optimal model is that which shows the lowest model deviance for a given combination of λ and α, in this case optimal α = Panel C illustrates the effect of λ selection on fractional deviance explained by the model, number of variables retained, and their coefficients viii

9 List of Abbreviations ANTs BPD CAMH COMPARE CSF DSM GM HC INNN LASSO LDA LR MR NUSDAST PCA RAVENS RDoC RF Advanced Normalization Tools Brief Psychotic Disorder Centre for Addiction and Mental Health dataset Classification of Morphological Patterns using Recursive Feature Elimination Cerebrospinal Fluid Diagnostic and Statistical Manual of Mental Disorders Grey Matter Healthy Control National Institute of Neurology and Neurosurgery of Mexico dataset Least Absolute Shrinkage and Selection Operator Linear Discriminant Analysis Logistic Regression Magnetic Resonance Northwestern University Schizophrenia Data and Software Tool dataset Principal Component Analysis Regional Analysis of Volumes in Normalized Space Research Domain Criteria Radio Frequency ix

10 ROI SA SVM SZ VBM WM Region of Interest Schizoaffective Support Vector Machine Schizophrenia Voxel-Based Morphometry White Matter x

11 List of Appendices Supplementary Paper 1: Voineskos*, AN, Winterburn*, JL, Felsky, D, Pipitone, J, Rajji, TK, Mulsant, BH. & Chakravarty, MM (2015). Hippocampal (subfield) volume and shape in relation to cognitive performance across the adult lifespan. Human Brain Mapping. *These authors contributed equally. Supplementary Paper 2: Winterburn, JL, Pruessner, JC, Chavez, S, Schira, MM, Lobaugh, NJ, Voineskos, AN, Chakravarty, MM (In Press). High-resolution in vivo manual segmentation protocol for human hippocampal subfields using 3T magnetic resonance imaging. Journal of Visualized Experiments. xi

12 Chapter 1: Introduction 1 Schizophrenia 1.1 History and Definition Schizophrenia is one of the most common and most debilitating mental disorders, yet it remains poorly understood. Recent studies estimate that % of the population suffers from schizophrenia, which places it as one of the largest disease burdens internationally (Messias et al., 2007; McGrath et al., 2008; Regier et al., 1993; WHO, 2004 Global Burden of Disease- Update, n.d.). Evidence for schizophrenia dates back as early as 1550 BC, to the Ebers Papyrus, an Egyptian medical document which describes symptoms that are now associated with depression, dementia, and psychosis ( History of Schizophrenia, 2010). Throughout history, it has been associated with demonic possession, evil spirits, and divine punishment. During the time of the ancient Greeks, in the 4 th century BC, the study of mental illness moved into the scientific realm with Hippocrates hypothesis that madness stemmed from an imbalance of the four bodily humors, specifically an excess of black bile ( A Brief History of Schizophrenia, 2012). Schizophrenia was first identified as a distinct mental disorder in 1887 by the psychiatrist Emile Kraepelin, who used the term dementia praecox, or dementia of early life to describe the mental deterioration he saw in his young patients, and to differentiate them from his patients with other mood disorders (such as depression and bipolar disorder) (Kraepelin, 1971). Kraepelin also noted a general weakening of mental processes in his patients, which he described as defect and which could co-exist with more productive symptoms, such as hallucinations and delusions (Jablensky, 2010). This understanding of schizophrenia symptomology is still around today, in the form of negative and positive symptoms. In 1911, Eugen Bleuler advocated for 12

13 changing the name of the disorder from dementia praecox to schizophrenia, as he disagreed with Kraepelin that it has its roots in dementia (Bleuler, 1950). Bleuler was the first to systematically record details about the disorder, including its acute and chronic stages, primary and secondary symptoms, positive and negative symptoms, and remission. Bleuler s work was rooted in psychoanalysis, and he felt strongly that the disorder should be characterized and diagnosed based on broad psychological presentations (Moskowiz and Heim, 2011), whereas the Kraepelinian school of thought placed more emphasis on psychotic symptoms. The first edition of the Diagnostic and Statistical Manual of Mental Disorders (DSM-I) was developed in 1952 as a guide for clinicians in diagnosing mental disorders, and the definition of schizophrenia included in the manual was heavily influenced by Bleuler s work (Moskowitz and Heim, 2011). In 1950, the German psychiatrist Kurt Schneider defined nine first-rank symptoms, in which he included delusions and hallucinations, and argued, like Kraepelin, that these symptoms were the most strongly associated with schizophrenia, as opposed to other types of psychosis (Schneider, 1950). Although these positive symptoms have since been shown not to be unique to schizophrenia (Strauss and Carpenter, 1974; Kluft, 1987), and not to have high discriminative value (Strauss and Carpenter, 1974), they continue to be included as diagnostic criteria. Some argue that cognitive or negative symptoms, like those emphasized by Bleuler, would be more discriminative for diagnosis (Andreasen, 1999; Fischer and Carpenter, 2009). During the twentieth century, efforts to better identify schizophrenia included the development of the Present State Examination, a structured diagnostic instrument than emphasized psychotic symptoms (Wing et al., 1974). Additional disease categories were also added to the schizophrenia family, including schizoaffective disorder (Kasanin, 1933), schizophreniform 13

14 psychoses (Langfeld, 1956), process-nonprocess (Stephens and Astrup, 1963), and paranoid-nonparanoid schizophrenia (Tsuang and Winokur, 1974). Over the course of the twentieth and beginning of the twenty-first centuries, evidence has been mounting from the brain imaging and genetics communities that schizophrenia is an illness with a biological etiology. A number of biological markers have pointed to neurocognitive dysfunction (Kremen et al., 2000), brain dysmorphology (Wheeler et al., 2014), and neurochemical abnormalities (Torrey et al., 2005). In spite of the large amount of research data available, however, there is no conclusive evidence that points to a single cause of the disorder, which has led some to call the very existence of schizophrenia as an independent entity into question (Jablensky, 2011). Psychiatric and clinical neuroscientific researchers continue to attempt to pinpoint the etiology of the illness, but as Osmond and Smythies noted in their 1952 paper, this is no easy task: We must therefore account for a situation in which a person of any age, but usually a young adult, in response to stress or with little evidence of it, becomes slowly and insidiously, or with overwhelming speed and accompanied by acute confusion, subjected to disturbances of association, changes in affect, thought disorder, hallucinations and delusions and catatonic symptoms to such an extent that life outside a mental hospital becomes impossible. The sick person, on the other hand, may never even need to visit a doctor but may simply appear odd and eccentric. The illness may terminate quickly either with or without medical aid, or may be completely resistant to any form of treatment and continue for years without any pathognomonic physical changes being demonstrable (Osmond and Smythies, 1952). The heterogeneity in presentation and functional outcome of schizophrenia patients makes it an especially challenging area of study for researchers and clinicians alike. 14

15 1.2 Diagnosis The current definition of schizophrenia is contained within the DSM-5 (American Psychiatric Association, 2013). To meet criteria for a schizophrenia diagnosis, patients must exhibit two or more of the following symptoms: delusions, hallucinations, disorganized speech, grossly disorganized or catatonic behaviour, and negative symptoms (ie. diminished emotional expression or avolition) (Tandon et al., 2013). Symptoms must have been present for at least six months, with at least one month of active symptoms. Notably absent from this newest edition of the DSM is a definition of schizophrenia sub-types (disorganized, catatonic, paranoid, residual, and undifferentiated), which were last seen in the DSM-IV-TR (American Psychiatric Association, 2000). These sub-types were shown to have low diagnostic reliability, and were not effective in predicting disease outcome. Schizophrenia patients tend to experience their first psychotic episode in their late teens or early twenties, with a later age of onset in females than males (Eranti et al., 2013; Rabinowitz et al., 2006). Patients are generally considered to be experiencing their first episode of psychosis if they have been experiencing symptoms for a period of less than six months. Within this general umbrella of psychosis, there are a number of specific diagnoses, including schizophrenia, schizophreniform disorder, schizoaffective disorder, and brief psychotic disorder ( The Different Types of Psychosis, 2015). Schizophreniform disorder is the same as schizophrenia, but with symptoms present for less than six months. Schizoaffective disorder combines symptoms of schizophrenia and a concomitant mood disorder, such as bipolar disorder or depression. In brief psychotic disorder, symptoms come on suddenly, often in response to a stressful life event, and generally last less than one month. 15

16 2 Magnetic Resonance Imaging 2.1 Image Acquisition Much of the established neurobiological insight into schizophrenia has been derived from in vivo magnetic resonance (MR) imaging. Structural MR imaging is a noninvasive tool that allows visualization of soft tissues and organs by taking advantage of the inherent magnetic properties of various nuclei (most commonly hydrogen) found in the human body. Hydrogen atoms are present throughout the human body in water molecules, particularly in fatty tissues (which are plentiful in the human brain). Hydrogen contains a single proton that spins around the axis of the atom. As a charged particle, a spinning proton creates an electromagnetic field that can be influenced by electromagnetic waves (Nishimura, 2010). When an external magnetic field is present (such as that of an MR machine, often referred to as B0), spinning nuclei align such that they are precessing around the direction of the external field (Nishimura, 2010). To obtain an MR image, a radio frequency (RF) pulse is applied to the system (normally at a 90 angle) at the Larmor frequency of the nuclei to tip the magnetization into the plane that is orthogonal to the main B0 field (Nishimura, 2010). This excites the protons into an unstable high-energy state. In this high-energy state, the protons begin to precess in phase with each other. At this point, there are two types of magnetization acting upon the protons: the longitudinal magnetization, from the MR magnet (B0); and the transverse magnetization in the plane perpendicular to the main B0 field, resulting from application of the RF excitation pulse. Once in this excited state, the nuclei undergo relaxation back to equilibrium, during which they emit a signal (Nishimura, 2010). This signal is measured by the receiving coil. Positional information of the signal is encoded using spatial gradients, which vary the strength of the magnetic field spatially (Nishimura, 2010). The 16

17 Larmor frequency of hydrogen atoms is dependent on field strength, so varying the field strength spatially means that the frequencies of the signals produced by hydrogen atoms will vary positionally as well. There are multiple different relaxation processes. T1 relaxation (the spin-lattice or longitudinal relaxation time) occurs first, and is the characteristic time constant for precessing nuclei to realign themselves with the external magnetic field (Nishimura, 2010). This is the time required for the system to return to 63% of its equilibrium value after bring exposed to the 90 degree RF pulse. T2 (the spin-spin or transverse relaxation time) is the characteristic time constant for spins to lose phase coherence. Interactions between the spins result in a loss of transverse magnetization. Contrast between tissue types is achievable due to the differing speeds at which excited hydrogen atoms return to the equilibrium state in different tissues. In a clinical setting, most MR imaging machines have a field strength of between 1.5T and 3T, although more powerful machines (3T - 11T) are becoming increasingly common (Nishimura, 2010). 2.2 Image Formation Raw signal data is collected from the MR scanner in a matrix called k-space (Nishimura, 2010). This signal is collected as a function of time, and must be converted to a function of frequency using a Fourier transformation. K-space has coordinates of spatial frequencies with units of cycles/mm. K-space is filled based on the strength of two gradients, one in the x (frequencyencoding) and the other in the y (phase-encoding) direction. One line of k-space is filled per repetition time (the time between RF pulses). The outer areas of this matrix, which store the higher frequencies, influences the image resolution, while the inner areas, which store the lower 17

18 frequencies, have more influence over contrast. Once the k-space matrix is filled, each data-point is Fourier-transformed in the x, then y directions to create a magnitude image which shows the MR signal at each point (Nishimura, 2010). 3 Image Processing 3.1 Image Registration In order to compare the anatomy of two different subjects, their images must be in the same common space. This can be achieved by deforming one image to fit the other, or by registering them both to some common reference frame (Klein et al., 2010). Over the years, efforts have been made to define three-dimensional coordinate systems or labeling protocols that achieve this (Fischl et al., 1999; Talairach and Tournoux, 1988). A common reference frame allows data to be compared across subjects, images to be classified, and for patterns across subjects to be detected (Klein et al., 2009). Anatomical correspondence between images can be achieved using feature-based or intensitybased methods. Feature-based methods use distinct features or points in images to align them, such as manually-identified sulci or gyri in the brain (Davatzikos et al., 2001). Intensity-based methods use similarity metrics to compare intensity patterns in the images. Brain shape is often so heterogeneous that point correspondences are not sufficient to capture the full variability, so intensity-based methods are more widely used in neuroimaging. Images are first aligned with a linear transformation, which uses a series of translations, rotations, scales, and shears to align the images (Pipitone et al., 2014). An affine transformation is fully invertible, and the same transformation is applied to all voxels in an image. 18

19 Linear registration is global, so does not sufficiently model local anatomical differences between images. Therefore, non-linear registration step is required as well. There are many different nonlinear registration techniques, each with their own similarity measures, transformation models, regularization methods, and optimization strategies (Avants et al., 2008; Chakravarty et al., 2009; Collins et al., 1995; Klein et al., 2009). Generally, registration is performed through the optimization of an image similarity metric, which measures the degree of similarity in intensity patterns between two images. Commonly-used similarity metrics include cross-correlation, mutual information, and the sum of squared intensity differences, while other metrics exist (Avants et al., 2008; Klein et al., 2009). One of the most widely-used non-linear algorithms is Advanced Normalization Tools (ANTs), which is an invertible diffeomorphic registration algorithm that uses the SyN image normalization technique with a cross-correlation similarity metric (Avants et al., 2008). In this algorithm, transformations are estimated in a hierarchical fashion in which the data is subsampled, and large deformations are estimated first (Chakravarty et al., 2009; Pipitone et al., 2014). These deformations are successively refined into smaller deformations, in which the data is subsampled into a finer grid. Both the deformation field and the objective function are regularized with a Gaussian kernel at each level of the hierarchy (Pipitone et al., 2014). Nonlinear deformations are represented with a 4D deformation field that is regularized with a tradeoff for smoothness and rigidity. The transformation is smooth and continuous, with invertible derivatives (ie. the Jacobian determinant is nonzero) (Ashburner, 2007). Several other non-linear registration algorithms exist, such as ANIMAL (Collins et al., 1995), DARTEL (Ashburner, 2007), FNIRT (Andersson et al., 2010), the SPM algorithms (Friston et al., 1995a), and ROMEO (Hellier et al., 2001) and have been used throughout the literature for 19

20 various purposes; however, ANTs has consistently been shown to be the best performing nonlinear registration algorithm across a wide range of applications (Avants et al., 2008; Klein et al., 2009; Murphy et al., 2011; Pipitone et al., 2014). In addition to its performance, advantages of ANTs include that it is publically available, and has the most adjustable parameters of the leading algorithms (Klein et al., 2009). 3.2 Voxel-Based Morphometry Traditionally, structural MR studies have used a region-of-interest (ROI) approach to detect neuroanatomical differences between the brains of patients and controls. In this approach, regions or structures of the brain are identified as being of interest to the disease being studied, and are delineated from the images of subjects via manual or automated segmentation. The volume or shape of these structures is then compared between disease groups. Although often effective at identifying differences at the group level, ROI-based analyses require predefined brain regions, which are time-consuming to produce, and requires that regions have consistently definable boundaries (Bora et al., 2011; Zarogianni et al., 2013). Unlike the ROI-approach, voxel-based morphometry (VBM) allows for voxel-wise comparison of local tissue (ie. grey matter (GM), white matter (WM), or cerebrospinal fluid (CSF)) concentration between two groups (Ashburner and Friston, 2000). Briefly, it involves spatially normalizing images with a linear registration to the same stereotactic space, extracting tissue from the normalized images, smoothing, and conducting statistical analyses in template space to make conclusions about group differences (Ashburner and Friston, 2000). To account for volumetric changes to the unit cube (voxel) during the registration, tissue volumes are modulated 20

21 by the Jacobian determinant of the non-linear transformation (Chung et al., 2001). As opposed to the ROI-based approach, VBM does not require any a priori manual segmentation of images, can capture features across the whole brain, and can detect differences that may be present at the sub-structural level. 3.3 Cortical Thickness The human cerebral cortex contains over 14 billion neurons (Malcolm, 1985), and its thickness has been shown to be a metric of interest across many disease groups, including schizophrenia (Narr et al., 2005; Wheeler et al, 2013). Cortical thickness is defined as the sum of the layers of the cerebral cortex (six in the neocortex, three or four elsewhere in the allocortex), or the distance between the inner subcortical WM surface and the outer cortical GM surface. The thickness of the cortex is thought to reflect the arrangement, morphology, and density of neurons (Parent et al., 1995). A number of techniques exist to measure cortical thickness on MR images. One of the most widely-used methods, CIVET, is an automatic pipeline that takes standard T1- weighted images as input, and outputs an estimation of the thickness of the cerebral cortex at 81,924 vertices in each subject (Ad-Dab'bagh et al., 2005; Ad-Dab'bagh et al., 2006; Lerch & Evans, 2005; Due to surface-based registration steps carried out during the pipeline, these vertices have spatial correspondence among subjects, so thickness values can be compared directly between groups (Lerch & Evans, 2005; MacDonald et al., 2000). Another popular method for estimating cortical thickness is using the FreeSurfer software package ( Dale et al., 1999; Fischl and Dale, 2000). Like CIVET, the FreeSurfer pipeline involves pre-processing steps for spatial normalization, intensity inhomogeneity correction, skull stripping, and tissue classification. The 21

22 major difference between the pipelines is the method used for extracting the GM and WM surfaces. CIVET uses a deformable sphere polygon model to fit the inner WM surface, while FreeSurfer uses a method based on triangular tessellation. CIVET has been shown to generate more reproducible results and less geometric error than FreeSurfer (Lee et al., 2006), and also to be more sensitive to subtle cortical thickness abnormalities in patients with mild cognitive impairment (Redolfi et al., 2015). 4 Neuroanatomical Phenotype of Schizophrenia 4.1 Postmortem Evidence for Abnormal Neuroanatomy in Schizophrenia Post-mortem studies of patients with a schizophrenia diagnosis have revealed significant structural and volumetric differences from healthy controls. Numerous studies have shown increases in neuronal density and decreased neuronal somal size in postmortem histological studies of schizophrenia patients relative to healthy controls, particularly in the frontal regions (Selemon et al., 1995, 1998, 2003; Rajkowska, 1997), but increased neuronal density has also been recorded in the anterior cingulate (Bouras et al., 2001) and occipital regions (Selemon et al., 2005). Increased microglia density has been observed in the frontal and temporal cortices (Radewicz et al., 2000). These neuronal differences correspond to decreases in cortical thickness, as measured by manual segmentation of postmortem histological images (Narr et al., 2005). 22

23 4.2 MR Evidence for Abnormal Neuroanatomy in Schizophrenia The advent of MR imaging allowed scientists to investigate neuroanatomy in an in vivo setting, and has allowed for significant increases in sample size in studies. In schizophrenia in particular, it is not well understood if neuroanatomical differences identified between healthy controls and patients reflect the cause of the illness or the effect of the illness on the brain. However, recording the nature of these differences in as much detail as possible will potentially lead to a greater understanding of the disease, the bidirectional relationship between brain measures and symptomatology, and will perhaps one day, in combination with studies using other modalities and metrics, be used to answer questions about the underlying mechanism of the disorder. A number of different approaches have been used to study neuroanatomical abnormalities in schizophrenia, namely the region-of-interest approach, VBM, and cortical thickness analyses Region-of-Interest Approach Studies that have used structural MR to examine schizophrenia patients have consistently shown abnormalities across the three primary tissue classifications of the brain, namely GM, WM, and CSF (Bora et al., 2011). Although abnormalities are reported in regions across the entire brain, the temporal and prefrontal lobes are repeatedly implicated. A number of studies have shown decreased GM volumes in patients relative to controls, specifically in the hippocampus (Csernansky et al., 2002), amygdala (Ganzola et al., 2014), thalamus, striatum, and superior temporal cortex (Gaser et al., 2004). Additionally, increased ventricular volume has been frequently reported (Gaser et al., 2004; Shenton et al., 2001). Studies have described these abnormalities in the context of the fronto-striatal, fronto-temporal, and anterior limbic networks 23

24 (Benes, 2000; Fletcher et al., 1999; Pantelis et al., 1992; Woodruff et al., 1997). These differences have been shown to be present even in patients experiencing their first episode of psychosis, early in the disease course. Specifically, first-episode patients have been shown to have smaller whole-brain (Vita et al., 2007), frontal, temporo-limbic (Ananth et al., 2002; Job et al., 2002; Kubrick et al., 2002; Salgado-Pineda et al., 2003) and thalamic and basal nulclei volume (Salgado-Pineda et al., 2003, Chua et al., 2007), as well as larger lateral ventricular volume (Chan et al., 2009; Vita et al., 2007). In chronic patients, who have been suffering from the disease for an extended period of time, whole-brain volume is likewise smaller than controls, as are hippocampal and parahippocampal volumes, and ventricular volumes are larger (Shenton et al., 2001; Lawrie et al., 1998; Wright et al., 2000) Voxel-Based Morphometry Approach The technique has been used successfully to detect brain structural differences between schizophrenia patient and control groups, often replicating and expanding upon ROI-based results (Hulshoff et al., 2001; Park et al., 2004; Gaser et al., 2001; Job et al., 2002). In a metaanalysis published in 2011, Bora and colleagues examined 79 VBM studies that reported GM and WM abnormalities in schizophrenia using a number of different modalities (Bora et al., 2011). Across the structural MR studies (49 in total), they found significant GM reductions in the bilateral insula/inferior frontal cortex, superior temporal gyrus, anterior cingulate gyrus/medial frontal cortex, thalamus, and left amygdala in patients relative to controls. In the WM, they found decreased fractional anisotropy or decreased volume in interhemispheric fibers, anterior thalamic radiation, inferior longitudinal fasciculi, inferior frontal occipital fasciculi, cingulum, and fornix. They found GM abnormalities were more severe in male patients, patients with 24

25 chronic illness, and patients with negative symptoms, and WM deficits were more severe in more chronic patients. Overlapping GM/WM findings were bilateral anterior cortical, limbic, and subcortical abnormalities, and the WM that connects these structures. Similarly, Chan and colleagues compared 41 VBM studies focusing on high-risk individuals with schizophrenia (8 studies), patients experiencing their first-episode of schizophrenia (14 studies), and patients with chronic schizophrenia (19 studies) (Chan et al., 2009). In the high-risk group, they observed decreased GM in the anterior cingulate, left amygdala, and right insula in patients relative to controls. In the first-episode group, they also found decreased GM in the anterior cingulate and right insula, but not in the amygdala. In the chronic group, GM decreases were the most extensive, and included the areas found in the two less severe groups, as well as the superior temporal gyri, thalamus, posterior cingulate, and parahippocampal gyrus in patients relative to controls. Of note, the first-episode group had lower GM volumes in the ventral-dorsal anterior cingulate, right insula, left amygdala, and thalamus than the chronic group. They conclude that the frontotemporal brain is implicated across disease severity, beginning in high-risk individuals, and becoming more extensive through first-episode and chronic patients, and that schizophrenia is therefore a progressive disorder of the cortico-striato-thalamic loop Cortical Thickness A number of studies have shown cortical thickness decreases in patient populations relative to controls. In a population of 54 schizophrenia patients and 68 controls, Wheeler and colleagues used a vertex-wise approach to show regional cortical thinning in the frontal gyri (left superior, left and right middle and right inferior), right anterior cingulate, bilateral entorhinal cortex, bilateral lingual gyrus, right middle temporal gyrus, left temporal pole, left angular gyrus and 25

26 gyrus rectus, and right striate and extrastriate cortex (in summary, regions predominantly in the frontal and temporal areas) in a heterogeneous patient population relative to controls (Wheeler et al., 2013). Goldman and colleagues examined the thickness of the cortex using a surface-wide node-by-node basis in a large population of 160 controls, 115 schizophrenia patients, and 192 unaffected siblings (Goldman et al., 2009). They reported reductions in thickness in the patient group relative to controls, especially in the frontal lobe and temporal cortex. Unaffected siblings were different than the control group at the trend level only. Kuperberg and colleagues used a population of 33 schizophrenia patients and 32 controls to test the hypothesis that cortical thinning in patients is more pronounced in the temporal and prefrontal regions than the superior, parietal, calcarine, postcentral, central, and precentral cortices (Kuperberg et al., 2003). They found their hypothesis to be true by showing a significant interaction between group and region type. Specifically, thinning was observed in the bilateral orbitofrontal cortices, the left inferior frontal, inferior temporal, and occipitotemporal cortices, and the right medial temporal and medial frontal cortices. They also found that the superior parietal, primary somatosensory, and motor cortices were spared. Results are similar in first-episode populations. Using a cohort of 72 first-episode patients and 78 controls, Narr and colleagues showed regional changes in cortical thickness in the bilateral frontal, temporal, and parietal heteromodal association cortices (Narr et al., 2005). Taken in context, these ROI, VBM, and cortical thickness studies all conclude that patients with schizophrenia have a distinct brain phenotype from healthy controls. Specifically, patients have decreased GM volumes and densities throughout the brain, pronouncedly in the temporal and frontal lobes, as well as enlarged ventricles and decreased WM integrity. Given these 26

27 differences, it is reasonable to conclude that patients and controls could be automatically differentiated using sophisticated multivariate statistical techniques, as described below. 5 Machine Learning 5.1 Univariate versus Multivariate Analyses Traditionally, the ROI and VBM results discussed above are conducted using a General Linear Model, wherein data is modeled as a linear combination of variables (Friston et al., 1995; Zarogianni et al., 2013). One of the major problems with this method is that statistical tests are performed on each variable (ie. each ROI or each voxel) independently, so in VBM studies where the number of variables can reach into the millions, extreme corrections for multiple comparisons are necessary, and only the most robust results survive. Additionally, these univariate approaches provide group-wise statistics, so can only comment on the trends of the disease groups as a whole. As such, they cannot provide diagnostic information on a patient-bypatient basis. Machine learning, a family of multivariate analyses, is able to overcome both of these limitations by considering all variables simultaneously, and also providing a hard threshold to divide patients from controls. In this way, researchers are able to elucidate a more complete network -like picture of the brain, and use this information to predict the class (patient or control) of an unseen subject. 27

28 5.2 Feature Reduction Due to the extremely large number of features available in neuroimaging data, a feature reduction technique must often be applied before applying machine learning algorithms. Feature reduction techniques range from more basic, such as downsampling the voxel dimensions of an image or using an a priori disease hypothesis to exclude a portion of the brain, to more complex, such as re-expressing the full dataset as a smaller number of continuous latent variables. The best feature reduction techniques significantly reduce the number of variables, while also minimizing information loss. One of the most commonly used data reduction techniques is a principal component analysis (PCA), which is the orthogonal projection of the data onto a lowerdimensional space such that the projection of the data onto this subspace maximizes the variance of the data (Bishop, 2006; Hotelling, 1933). PCA is an unsupervised analysis technique, as it finds inherent patterns in the data without requiring class labels. Linear discriminant analysis is also commonly used to reduce feature sets. Linear discriminant analysis aims to find a linear transformation that maps the input variables to a lower-dimensional space in which the variance between classes is maximized, and the variance within class is minimized (Gu, n.d.). Unlike PCA, LDA is a supervised form of machine learning, where the class labels are known in advance and required for feature reduction. Other feature reduction methods involve ranking features based on their discriminatory ability, and only retaining a subset for classification (Fan et al., 2007). 28

29 5.3 Machine Learning Algorithms Regularized Logistic Regression Numerous algorithms have been developed by the machine learning community, each with its own advantages and disadvantages. Logistic regression is analogous to the more common linear regression, where a number of continuous independent variables describe a single dependent variable, except that the dependent variable is categorical. In cases where the number of variables (p) is much greater than the number of subjects (n), as is often the case in neuroimaging datasets, there is a very high probability of over-fitting the model to the data. To avoid this, regularization can be added to the logistic regression algorithm. There are two types of regularization: LASSO (Least Absolute Shrinkage and Selection Operator) and ridge regression. The LASSO penalty discards unimportant variables (and can only retain at most n variables), while the ridge penalty retains all variables, but forces their coefficients to be low. The elastic net penalty combines the LASSO and ridge penalties. A parameter (α) controls the relative weight of each penalty used (Hastie and Qian, 2014). Another parameter, λ, controls the overall strength of the penalties. The values of λ and α are tuned to determine the optimal model, defined as that with the lowest model deviance (Hastie and Qian, 2014) Support Vector Machines Another method commonly used in the machine learning literature is the support vector machine (SVM). An SVM is essentially a quadratic optimization problem that seeks to separate two distinct classes by constructing a separating hyperplane between the classes, and maximizing the margin between the two classes using the datapoints that are closest together (Meyer, 2015). For 29

30 problems that are not linearly separable, the raw data can be mapped to a higher-dimensional space in which the data are linearly separable using a kernel. There are a number of different kernel types available; however the most common is the radial basis function (RBF) kernel (also sometimes called the Gaussian kernel). Most schizophrenia classification studies use the RBF kernel, although it has been suggested that it is not suitable when the number of features is very large, as with neuroimaging studies (Hsu et al., 2010). In these cases, the feature space is already very high-dimensional, so does not benefit from being mapped to an even higher dimensional space (ie. the linear kernel is sufficient). Linear SVMs are parameterized by the cost parameter (C), which influences the narrowness of the support vector margin. Non-linear SVMs (those using a RBF kernel) require optimization of another parameter, ϒ, which controls the mapping of datapoints to the higher dimensional space. A disadvantage of SVMs is that their performance is highly dependent on parameter tuning (Burges, 1998) Linear Discriminant Analysis Linear discriminant analysis (LDA) is a multivariate method that performs classification based on continuous variables (Zarogianni et al., 2013). LDA aims to map the input features to a lower dimensional space using a linear transformation (Gu, n.d.). In this lower dimensional space, between-class variance is maximized, and within-class variance is minimized. An LDA uses an optimal weighted combination of variables to construct discriminant functions, such that the first discriminant function provides the best class separation, and so on. The method then selects a linear combination of these discriminant functions that are able to perform the overall best class. LDA does not require any parameter tuning, and can also be used as a form of data reduction. 30

31 5.4 Comparing Algorithms Applying the elastic net penalty to LR serves as an intrinsic feature selection process, which is advantageous in models with a large number of variables. Additionally, features selected from LR analyses are easy to interpret, as they have not been significantly transformed. LDA is the best method to use when the class distributions are Gaussian; since SVMs are non-parametric, they can handle data with an unknown distribution (Fan et al., 2007, Zarogianni et al., 2013). LDA works only on datasets with linearly-separable data (Zarogianni et al., 2013), while SVMs can better represent nonlinear relationships in data. SVMs also show stronger generalizability, as they emphasize samples that are located close to the decision boundary (Hastie et al., 2009; Kambeitz et al., 2015). The importance of tuning the SVM cost parameter C to improve model performance has previously been shown (Franke et al., 2010). In a recent meta-analysis of 36 studies performing schizophrenia/control classification, 72% of them use either SVMs or some kind of discriminant analysis (Kambeitz et al., 2015). SVMs do not scale well with data size because of the quadratic optimization algorithm and the kernel optimization (Meyer, 2015). Additionally, SVM performance is very susceptible to parameters, which means extensive and computationally-expensive tuning is required. 5.5 Performance Evaluation The performance of classifiers is typically reported using three metrics: sensitivity, or total number of true positives (ie. patients correctly classified as patients); specificity, or total number 31

32 of true negatives (ie. controls correctly classified as control); and accuracy, or total number of correctly classified subjects. 5.6 Validation When constructing an algorithm, it is important to make sure that it generalizes to new data, or in other words is not over-fit to the training data (Zarogianni et al., 2013). There are a number of ways to achieve this. One of the most widely-used is leave-one-out cross-validation, in which a single data instance is left out of the training, and the classification algorithm is used to predict the class of this single excluded subject. This process is then repeated such that each subject in the training set is left out (ie. number of rounds of validation = number of training instances), and the accuracy of the algorithm on unseen data is estimated as the number of subjects correctly classified across the rounds of validation. This method uses almost all of the data available to create models, and is therefore prone to over-fitting and over-estimating the performance of the models (Nieuwenhuis et al., 2012). It is considerably more robust to validate with multiple subjects that have not been used to train the algorithm, or even with a completely independent dataset, if it is available (Hsu et al., 2010). N-fold cross-validation, another popular technique, is similar to leave-one-out, except a group of subjects instead of a single subject is left out. 6 Machine Learning and Schizophrenia Machine learning methods have been applied to schizophrenia datasets with varying decrees of success. In 2005, Davatzikos and colleagues published their paper Whole-Brain Morphometric 32

33 Study of Schizophrenia Revealing a Spatially Complex Set of Focal Abnormalities and introduced machine learning classification methods to the schizophrenia neuroimaging community. In their seminal paper, they performed deformation-based morphometry on 1.5T T1- weighted 0.94x0.94x1mm images from a cohort of 69 patients with a DSM-IV schizophrenia diagnosis and 79 matched controls. They reported reduced whole brain GM volumes and increased CSF volumes in patients relative to controls. Specifically, reduced GM was observed in the hippocampus, the cingulate cortex, the orbitofrontal cortex, the frontotemporal area, the parietotemporal area, and the occipital area near the lingual gyrus. The only WM group differences reported were in the areas adjacent to the right hippocampus and the left occipital area. Many previous VBM studies have shown similar results (Ananth et al., 2002; Gaser et al., 2004; Hulshoff et al., 2001; Kubicki et al., 2002; Park et al., 2004; Sigmundsson et al., 2001; Sowell et al., 2000; Wright et al., 1999). As with previous studies comparing VBM and ROI approaches (Hulshoff et al., 2001; Park et al., 2004; Gaser et al., 2001; Job et al., 2002), the whole-brain derived results in this study replicated ROI volumetric results previously published on the same cohort (Gur et al., 1999). They conclude from this that VBM is at least as sensitive as the traditional ROI-based approach. They then went a step further and used their VBM metric to construct a classifier to separate patients from controls. They used a non-linear SVM, and validated their classifier using the leave-one-out method. They report 100% classification accuracy on their training set, and 81.1% accuracy (87.3% specificity and 73.9% sensitivity) overall, with a slightly higher accuracy for males than females. The excellent review by Zarogianni and colleagues, Towards the identification of imaging biomarkers in schizophrenia, using multivariate pattern classification at a single-subject level profiles 11 studies that use structural MR to perform schizophrenia/control classification, 33

34 including the original Davatzikos study. The study with the most impressive results is by Karageorgiou and colleagues, who report an overall classification accuracy of 92%. The goal of this group was to develop a classifier that performed early disease diagnosis. They used a cohort of 28 patients with recent-onset schizophrenia (psychotic symptoms of no more than 5 years duration with limited exposure to antipsychotic medication or symptoms of no more than 2 years with no more than 6 months of antipsychotic exposure) and 47 controls for which they had 1.5T T1-weighted 0.63x0.63x1.5mm structural MR images and 75 neuropsychological (NP) variables. They obtained 95 volumetric measurements from their MR images using the Freesurfer software package ( Version 4.0.1) (Fischl et al. 2002). They used two approaches: LDA on their raw data (MR and NP separately and combined) and an LDA after reducing their data with a PCA and selecting PCs using a permutation analysis. They validated their classifiers using leave-one-out cross-validation. With MR data only, they report similar results with the LDA (64.3% sensitivity and 76.6% specificity) and the PCA-LDA (64.3% sensitivity and 72.3% specificity). The results are better with the NP data only (LDA: 71.4% sensitivity and 80.9% specificity; PCA-LDA: 78.5% sensitivity and 91.5% specificity). Their best results are with the two data types combined (LDA: 64.3% sensitivity and 83.0% specificity; PCA-LDA 89.3% sensitivity and 93.6% specificity). MR variables selected by the classifier were in the frontal, temporal, and subcortical regions. The NP variables most useful for the classifier were related to memory (verbal and visual), motor dexterity and speed, visuospatial abilities, and executive function. They conclude that the stronger NP results indicate that cognitive deficits can be detected earlier than structural abnormalities in recent-onset patients, and that the PCA-LDA MR and NP classifier can be used to assist with early intervention and evaluation of treatment response. 34

35 Another strong performing study is that of Fan and colleagues, the follow-up from the same group as the original Davatzikos paper. They use two cohorts, one of all females (38 controls and 23 schizophrenia patients) and another of all males (41 controls and 46 patients). For both they have 1.5T 1mm isotropic MR data, and report overall classification accuracies of 91.8% for the female cohort and 90.8% for the male cohort, although no real rationale is given for this sexbased split. As with the Davatzikos paper, they perform deformation-based morphometry on their imaging data, and construct classifiers using non-linear SVMs. In this paper, however, they introduce a new tool, COMPARE (Classification of Morphological Patterns Using Adaptive Regional Elements), that combines a novel feature selection and extraction algorithm with nonlinear SVMs, and constitutes an easy black-box classification program. Throughout, they validate their feature selection and classification using leave-one-out cross-validation. Their feature reduction technique depends on wavelet decomposition, which can represent data at multiple scales and has been shown to be effective when applied to MR data (Lao et al., 2004). Morphological features are clustered into regions using a watershed segmentation method (Vincent and Soille, 1991), and the most robust, discriminative features are selected by grouping voxels that show similar relationships to the classification variable. A final feature selection method removes irrelevant and redundant features. It should be noted that they do not perform deformation-based morphometry in the way pioneered by Ashburner and Friston. They argue that measurements such as voxel-wise displacement fields, Jacobian determinants, and tissue density maps are too localized and have excessive variability between subjects. They concede that these voxel-wise approaches can be improved by applying Gaussian smoothing, but argue that since smoothing is applied uniformly to the entire brain and across all individuals (ie. not adaptive to specific structures), measurements are not reliable. In their method, they apply the 35

36 deformation field resulting from the spatial registration to segmented images, which generates mass-preserving volumetric maps, which they call RAVENS (Regional Analysis of Volumes Examined in Normalized Space) maps (Shen and Davatzikos, 2003; Zanetti et al., 2013). Like Jacobian-modulated images, RAVENS maps represent tissue density at each voxel, which reflect the amount of tissue present in each subject s image at a given location (Shen and Davatzikos, 2003; Zanetti et al., 2013). RAVENS maps, however, are based on a high-dimensional elastic transformation driven by point correspondences on anatomical surfaces throughout the brain (Davatzikos et al., 2001), while VBM uses smoother parametric transformations (Ashburner et al., 1998). There is also a difference between how the two methods deal with global shape differences. VBM uses low-parameter shape transformations (Ashburner et al., 1998), and any remaining variability is interpreted as inherent group differences in morphology. RAVENS maps preserve the volume information across the spatial normalization at both the local and global levels (Davatzikos et al., 2001). Davatzikos and colleagues have shown that RAVENS maps demonstrate superior performance compared with standard VBM available through the SPM 99 software (Friston et al., 1995) in assessing simulated atrophy in precentral and superior temporal gyri (Davatzikos et al., 2001). Following the work described above done by Fan and colleagues, Zanetti and colleagues applied the COMPARE algorithm to their sample of 62 first-episode schizophrenia/schizophreniform disorder patients (both medicated and unmedicated) and 62 age, gender, and education-matched controls, with 1-year follow-up (age of patients and controls 18-50). 1.5T T1-weighted MR images with 0.86x0.86x1.5mm voxel dimensions were collected for all subjects. They calculated RAVENS maps for their images, and correct these maps for total brain volume before smoothing them with an 8mm Gaussian kernel (Zanetti et al., 2013). They report a classification accuracy of 36

37 73.4%, considerably more modest than the results reported by Fan and colleagues. They mapped their most discriminative features back onto their MR images, and found the fronto-temporaloccipital GM and WM regions bilaterally, including the inferior fronto-occipital fasciculus, as well as the third and lateral ventricles were important for distinguishing between first-episode patients and controls. Using their 1-year follow-up data, their classifier was not able to predict prognosis (remitting versus non-remitting course), as they were only able to achieve an accuracy of 58.3%. They argue that their subject population represents a more real world sample, recruited using epidemiological methods (which they argue reduces selection biases by ensuring controls truly represent the population form which the patients were selected), with variable patterns of comorbidity and disease course, than that used in the Fan study. They conclude that pattern classification techniques are perhaps not effective in such heterogeneous, real world situations. They also note that differences in the pipelines for image processing, feature extraction/dimensionality reduction and pattern recognition methods likely account for the discrepancies in accuracies between studies. The studies completed by Nieuwenhuis and colleagues represent the largest population samples present to date in the schizophrenia classification literature (Nieuwenhuis et al., 2012). In their 2012 paper, they perform classification using a linear SVM on a training sample of 128 chronic patients (including schizophrenia, schizophreniform, and schizoaffective patients) and 111 controls matched based on age, sex, and highest parental level of education (age of all subjects <50 years). They validate their algorithms on a sample of 155 patients with chronic schizophrenia and 122 controls. They have 1.5T T1-weighted 1mmx1mmx1.2mm MR images for all subjects, and derive Jacobian-modulated GM tissue densities, blurred with an 8mm Gaussian kernel, from these images. They reduce their feature set by downsampling their images 37

38 by a factor of two in each dimension. Prior to classification, they regress out the effects of age, sex, and handedness. They report classification accuracies of 71.4% in their training sample (using leave-one-out cross-validation), and 70.4% in their validation sample. The discriminative pattern of the classifier showed decreases in GM density in the frontal and superior temporal lobes and hippocampus and increases in GM density in the basal ganglia and the left occipital lobe in patients relative to controls. They also perform an analysis of the stability of their model, and conclude that accuracies are only stable in population sizes greater than 130 subjects. They report that even in populations with 140 subjects in the training sample, accuracies fluctuate between 52% and 74% in the validation sample, illustrating that the model still heavily depends on the subjects being chosen. Only one major study to date has explored using cortical thickness for schizophrenia classification (Yoon et al., 2009). Using 1mm isotropic 1.5T T1-weighted MR images in a population of 53 right-handed patients and 52 controls matched for age, sex, handedness, and socioeconomic status, they calculated the thickness of the cortical mantle at 81,924 vertices. They perform a Principal Component Analysis (PCA) on the thicknesses in each lobe individually and use the PCs as input into an SVM. They train their classifier using leave-one-out cross-validation, and kept 30 subjects aside for a validation dataset. They conclude that some PCs are more effective than others for distinguishing between patients and controls, but that this effectiveness is not necessarily correlated with the amount of variance explained by the PC. To this end, they re-order their PCs by performing a two-sample t-test between groups to determine effectiveness. Using this method, they report mean classification accuracies of % (a different accuracy is reported for each hemisphere of each lobe of the brain, ie. 8 in total). They report the precentral, postcentral, superior frontal and temporal, cingulate, and parahippocampal 38

39 gyri are all regions that are important for classification. Although their classifier appears to be effective on their sample, it is evidently very dataset-specific, and it is unlikely that their model, or even their data pre-processing steps, would generalize to a new sample. When they order their PCs in a more conventional way, according to the variance they explain, accuracies drop to 63%- 77%. Since no solid justification is given for why they decided to reorder their PCs, and why they chose a two-sample t-test to do so, it calls into question what other preprocessing steps were tried unsuccessfully, but not reported, before they landed on their published successful method. In a recent meta-analysis of multivariate pattern recognition studies in schizophrenia, Kambeitz and colleagues used a bivariate random-effects model to investigate 38 studies with a total of 1602 patients and 1637 controls (Kambeitz et al., 2015). Across the 20 studies that use structural MR, they report a mean sensitivity of 76.4%, and a specificity of 76.9%. Across all studies, they found older subjects contributed to a higher sensitivity, as well as imaging modality (functional versus structural MR) and higher sensitivity in chronic versus first-episode patients. They also found higher specificity in patients with more positive symptoms, and higher specificity in patients with higher antipsychotic medication doses. They found no effect of sex, illness duration, PANSS (Positive and Negative Syndrome Scale, a tool used for measuring symptom severity in schizophrenia patients) positive scores, PANNS negative scores or analysis methods (SVM or LDA) on sensitivity or specificity. The authors note that there are significant differences among studies in the demographic characteristics of the populations, clinical symptoms of the patients, imaging modalities, preprocessing of data, statistical models, and validation structure, and that these differences make it difficult to directly compare reported sensitivities and specificities. They also make the very important point that most machine learning studies test a range of analysis pipelines and report only the one which gives the highest 39

40 performance, meaning that little is published on the effect of different factors on model success (Pers et al., 2009). As noted by Kambeitz and colleagues, although there are many studies that successfully perform classification between schizophrenia patients and healthy controls based on structural MR, it is difficult to directly compare results among multiple studies due to the inherent differences in the sample population and study design (Kambeitz et al., 2015). To date, there has been no study that systematically compares the effects of population composition and machine learning algorithm on accuracy. If the research and clinical communities wish to move forward with classification studies, and perhaps even implement them in the clinic to supplement symptom-based diagnoses, it is necessary first to establish benchmarks and guidelines for best practices. 7 Goals and Contributions In this project, the performance of numerous machine learning algorithms on multiple independent datasets that represent unique data acquisition techniques and patient populations is compared. Specifically, regularized logistic regression, support vector machines, and linear discriminant analysis are all applied to three datasets, two representing chronic schizophrenia patients (one at 1.5T and the other at 3T), and a third representing 3T data of unmedicated, firstepisode patients. Cortical thicknesses and two measures of GM tissue densities are extracted from each dataset and used for classification. The goal of this project is to elucidate the effect of dataset characteristics, image processing technique, brain region studied, and machine learning algorithm on classification performance, and to draw a conclusion about which combination of methods is optimal for schizophrenia applications. 40

41 Chapter 2: Comparing techniques for classifying patients with schizophrenia and healthy controls using machine learning and magnetic resonance imaging 8 Introduction Schizophrenia is a debilitating mental illness that is associated with impaired social functioning, cognition, and a decreased quality of life (Jobe & Harrow, 2005). Diagnosis of the disease is currently achieved using criteria defined in The Diagnostic and Statistical Manual of Mental Disorders (5 th ed.; DSM-5; American Psychiatric Association, 2013), based on cognitive and social reports from the patient and the patient s family, as well as cognitive assessments conducted by a qualified clinician (First, 1995; Kay et al., 1987). Although there is good reliability of diagnoses (as reported and evaluated by the authors; Regier et al., 2013; Dice s kappa = ) using this method, from a mechanistic and treatment perspective, schizophrenia (and related disorders, including schizoaffective disorder, schizophreniform disorder, and psychosis) is generally a poorly-understood illness from a neurobiological perspective, and there is significant symptom overlap with other psychiatric illnesses, such as bipolar disorder (American Psychiatric Association, 2013; Demirci & Calhoun, 2009; Kim et al., 2015). In an effort to enforce complete diagnostic objectivity and to limit the effects of factors that confound diagnoses, many groups have begun to develop and train computer-algorithmbased methodologies that rely solely on neurobiological factors as inputs and output healthy control or schizophrenia patient status. 41

42 In this regard, many recent studies have used structural magnetic resonance (MR) imaging to develop a more reliable, fully-automated, and neuoranatomically-informed diagnosis for schizophrenia. However, MR images and the neuroanatomical features that are derived from them (such as structure segmentations, voxel-based morphometry information, or cortical thickness) suffer from high dimensionality (a standard 1mm isotropic MR image of the brain is composed of over 8 million voxels, each containing important signal intensity information (Davatzikos et al., 2005)). Mass univariate methods (such as conducting a voxel-wise general linear model) are able to demonstrate brain regions that are different between schizophrenia patients and normal controls; for example the fronto-temporal regions, the thalamus and the hippocampus, as well as increased ventricular volumes (Csernansky et al., 2002; Ganzola et al., 2014; Gaser et al., 2004; Narr et al., 2005, Wheeler et al., 2013) have consistently been described. While the valuable findings from these studies can be used to better understand neuroanatomical differences, they may not be useful for diagnosis on a patient-by-patient basis (Davatzikos et al., 2005; Fan et al., 2007). As an alternative, many groups have turned to multivariate pattern analysis techniques as they can be trained to identify the anatomy of a schizophrenia patient (Fan et al., 2007). Such multivariate methods, often referred to as machine learning methods, have been shown to be effective at identifying patients suffering from neuropsychiatric disorders based solely on information contained MR images. One of the most notable applications of this technology has been in the diagnosis of Alzheimer s disease, where patient identification accuracy consistently exceeds 90% (Falahati et al., 2014). Similar methods have been applied to schizophrenia patient populations, with varying degrees of success. It is difficult, however, to directly compare reported accuracies among studies, as each group uses different subject populations, data 42

43 processing steps, classification algorithms, and validation techniques (Kambeitz et al., 2015). For researchers with novel datasets, or eventually clinicians seeking to support their diagnoses, there is no clear preferred method. In this manuscript, we present a comprehensive comparison of the most prominent machine learning algorithms used in the schizophrenia versus healthy control classification literature. Namely, we compare the performance of logistic regression, support vector machines, and linear discriminant analysis on three independent datasets (435 subjects total) using both voxel- and vertex-based anatomy measures. These datasets include imaging data from multiple field strengths (1.5 and 3T), and patients from across the spectrum of disease severity (first-episode to chronic). This study is the most expansive and systematic on this topic to date. Our aim is to draw conclusions about the effect of dataset and algorithm selection on the effectiveness of the classifier, with the end goal of providing a benchmark for future analyses, and commenting on the feasibility of these approaches. 9 Methods 9.1 Datasets Evaluated Classification performance was evaluated over three independently collected datasets, all of which have been published previously, and are described briefly below. Each dataset represents different disease severities and MR field strengths. 43

44 9.1.1 Centre for Addiction and Mental Health (CAMH) A total of 191 volunteers (103 healthy control, 88 schizophrenia; age 18-59) were recruited at the Centre for Addiction and Mental Health (CAMH), Toronto, Canada, as part of an ongoing neuroimaging, genetics, and cognition research program in neuropsychiatric disorders (Voineskos et al., 2011; Wheeler et al., 2013) (Table 1). Persons with previous head trauma and loss of consciousness, a neurological disorder, a history of primary psychotic disorder in a firstdegree relative, current substance abuse (urine toxicology screens were obtained from all potential subjects), or a history of substance dependence were excluded from the study. All study procedures complied with the Declaration of Helsinki and were approved by the CAMH Research Ethics Board; all subjects provided written, informed consent. Subjects were assessed with the Edinburgh Handedness Inventory (Oldfield, 1971), the Hollingshead Four-Factor Index of Socioeconomic Status (Hollingshead, 1975), and the Wechsler Test of Adult Reading (WTAR) (Wechsler 2001) for IQ. They completed the Structured Clinical Interview for DSM- IV-TR Axis I Disorders (First MB et al., 2001) to ensure they were free of neuropsychiatric disorders, and were screened for dementia using the Mini Mental State Examination (MMSE) (Folstein et al., 1975) (Table 1). T1-weighted MR images were acquired for each subject using an 8-channel head coil on a 1.5T GE Echospeed system (General Electric Medical Systems; Fairfield, Connecticut). Images were acquired using an axial inversion recovery-prepared spoiled gradient-recalled sequence with echo time 5.3 ms, repetition time 12.3 ms, time to inversion ms, flip angle, 20, and 1 excitation, for a total of 124 contiguous slices with 1.5 mm thickness and 0.78 mm x 0.78 mm inplane voxel size. 44

45 9.1.2 Northwestern University Schizophrenia Data and Software Tool (NUSDAST) One hundred and fifty-eight (91 schizophrenia, 67 controls; age 18-59) were used from the Northwestern University Schizophrenia Data and Software Tool (NUSDAST), a publicallyavailable database of schizophrenia subject data, including structural MR images, clinical, cognitive, and genetic information (Wang et al., 2013; (Table 1). Patients were assessed with the Scale for the Assessment of Positive Symptoms (Andreasen, 1984), and the Scale for the Assessment of Negative Symptoms (Andreasen, 1983) (Table 1). For all subjects, T1-weighted images were acquired on a 1.5T Siemens MR scanner (Siemens Medical Systems; Erlangen, Germany) at the Mallinckrodt Institute of Radiology at Washington University School of Medicine. Images were acquired using a 3D turbo-flash sequence (TR = 20ms; TE = 5.4ms; flip angle = 30, 256 x 256 pixel matrix; 1mm x 1mm x 1.25mm voxel dimensions) National Institute of Neurology and Neurosurgery of Mexico (INNN) One hundred subjects were recruited at the National Institute of Neurology and Neurosurgery (INNN) in Mexico City, Mexico (de la Fuente-Sandoval et al., 2011, 2013) (Table 1). Fifty were patients experiencing their first episode (less than two years of psychotic symptoms) of nonaffective psychosis (including a diagnosis of brief psychotic disorder, schizophreniform disorder, and schizophrenia), and were all antipsychotic naïve (age 18-47). Fifty right-handed 45

46 age- and sex-matched controls were also recruited (age 18-42, mean 23.1 (SD 4.91)). Exclusion criteria included high risk for suicide, psychomotor agitation, comorbidity with other Axis 1 disorders, concomitant medical or neurological illness, and current or previous substance abuse. The study was approved by the ethics and scientific committees of the INNN, and all participants provided written, informed consent (Table 1). All participants were scanned on a 3T GE MR scanner (GE Healthcare; Milwaukee, Wisconsin) with a high-resolution 8-channel head coil at the Neuroimaging Department of the INNN. T1- weighted spoiled gradient-echo 3-dimensional axial acquisition (SPGR) images were collected (echo time, 5.7 ms; repetition time, 13.4 ms; inversion time, 450 ms; flip angle, 20 ; field of view, 25.6 cm; 256 x 256-pixel matrix; 1mm x 1mm x 1mm or 0.47mm x 0.47mm x 0.6mm or 0.47mm x 0.47mm x 1.2mm voxel dimensions). 46

47 Table 1: Demographic characteristics of all datasets CAMH (1.5T) NUSDAST (3T) INNN (3T) Demographic Schizophrenia Patients Healthy Controls Schizophrenia Patients Healthy Controls FEP Patients Healthy Controls (n=88) (n=103) (n=91) (n=67) (n=50) (n=50) Mean SD Mean SD Mean SD Mean SD Mean SD Mean SD Age Education (years) 13.4 a a a Parental Education (years) WTAR (IQ) a MMSE a CIRS-G 1.64 a Age of onset NA NA NA NA NA NA Illness Duration (weeks) 1, NA NA NA NA NA NA Chlorpr. Equiv (mg) NA NA NA NA NA NA PANSS Positive NA NA NA NA NA NA Negative NA NA NA NA NA NA General NA NA NA NA NA NA N N N N N N Diagnosis 65 SCZ 23 SA NA NA 11 BPD/ 18 SFD/ 21 SZ NA Antipsychotic Treatment NA NA NA Sex 57 M 30 F 51 M 43 F 61 M 30 F 39 M 28 F 31 M 19 F 32 M 18 F Handedness 80 R 6 L 89 R 7 L 80 R 9 L 58 R 9 L 50 R 0 L 50 R 0 L Significance in independent samples t-test: a p<0.05 WTAR = Wechsler Test of Adult Reading MMSE= Mini Mental State Examination CIRS-G = Cumulative Illness Rating Scale for Geriatrics PANSS = Positive and Negative Syndrome Scale BPD = Brief Psychotic Disorder SA = Schizoaffective Disorder SFD = Schizophreniform Disorder SZ = Schizophrenia 47

48 9.2 Image Processing We used three different input feature sets for evaluation of all machine learning methods described below. The first is a modulated voxel-based morphometry method intended to bestreplicate what others have used using the current state-of-the art in registration methods (Karageorgiou et al., 2011; Nieuwenhuis et al., 2012; Schnack et al., 2013). The second is a methodology based on tissue classification and brain parcellation developed entirely by another group (Fan, Shen, & Gur, 2007; Zanetti et al., 2013). Finally, we use cortical thickness measures to determine the dependency on algorithmic performance and input-feature type (Yoon et al., 2007) Image Registration All images were first converted to the MINC file format (bic.mni.mcgill.ca/servicessoftware/homepage), then corrected for radio-frequency (RF) inhomogeneity non-uniformity using the N4 algorithm (Tustison et al., 2010). Images were registered in a multi-step process to a model image (nonlinear ICBM c) that represents the mean anatomy of a population and shows superior contrast and anatomical consistency relative to a single image (Fonov et al., 2011; Fonov et al., 2009) (Figure 1). In the first step of this process, images were registered to the model image with a 12-parameter affine registration (3 each of rotations, translations, scales, and shears) using the bestlinreg command (part of the MINC tool kit, The transformation from this initial registration was passed to a version of the ANTS algorithm 48

49 that uses MINC images (Avants et al., 2013; Klein et al., 2009; Murphy et al., 2011; github.com/vfonov/mincants), which affine registered each subject image to MNI space (using the ICBM c model (Fonov et al., 2011; Fonov et al., 2009)) using a brainmask derived using the BEaST brain extraction tool (Eskildsen SF et al. 2013). The subject image was then transformed along the final affine pathway before being passed to the non-linear stage of the registration. In the non-linear stage, the affine images were aligned to the brain-masked model using the ANTs algorithm for MINC images as follows: mincants 3 m CC[input.mnc,model.mnc,1,4] -r Gauss[3,0] -t SyN[0.5] -i 100x100x100x100x20 -o output.xfm --use-histogram-matching --continue-affine false where m CC indicates use of the cross-correlation objective function for similarity during registration, -r selects the type of regularization (in this case Gaussian), -t indicates the type of transformation model used during registration (in this case, diffeomorphic image registration with a gradient step size of 0.5), -i indicates the number of iterations at successively finer resolutions, and o specifies the output file. The non-linear transformation from this registration was retained for subsequent analyses. After registration, all images from all datasets had 1mm x 1mm x 1mm voxel dimensions Modulated Voxel-Based Morphometry Voxel-based morphometry (VBM) was performed by estimating voxel-wise tissue densities on the affine model-aligned images of all subjects. First, the brain was extracted using the BEaST pipeline (Eskildsen SF et al., 2013; Figure 1). The extracted brain of each subject was then 49

50 classified into grey matter (GM), white matter (WM), and cerebrospinal fluid (CSF) using an Artificial Neural Networks classifier trained by stereotaxic space probability masks (Kollokian, 1996; Lerch et al., 2005; Zijdenbos et al., 2002) These classified images were then blurred with a 8mm full width at half-maximum Gaussian kernel to obtain a voxel-wise estimate of tissue density (Ashburner & Friston, 2000), and modulated with the Jacobian determinant of the inverse non-linear subject-model transformation to account for volume changes during registration. The modulated, blurred tissue masks were then transformed along the forward nonlinear transformation such that each subject image resided in the same (model) space, thereby allowing for voxel-wise image analyses. All images were checked for registration and segmentation quality by examining GM, WM, and CSF masks on three axial, two sagittal, and three coronal slices for each subject. Images were assigned either pass or fail status. Masked and blurred images were downsampled by a factor of two in each dimension to reduce dimensionality (Nieuwenhuis et al., 2012). 50

51 Figure 1: Image registration and voxel-based morphometry pipeline 51

52 9.2.3 RAVENS Maps In order to compare the analyses presented in this manuscript with the current classification literature, we also constructed RAVENS (Regional Analysis of Volumes in Normalized Space) maps from our T1-weighted data. A number of classification studies use RAVENS maps instead of modulated VBM to perform their analyses (Davatzikos et al., 2005; Fan et al., 2007; Zanetti et al., 2013). Like modulated VBM, RAVENS maps are a method of analyzing regional tissue densities in a subject s brain. RAVENS maps reflect the amount of tissue present in each subject s image at a given location (Shen and Davatzikos, 2003; Zanetti et al., 2013). Unlike modulated VBM, a subject s brain is classified into tissue types in native space, and a highdimensional, volume-preserving elastic transformation driven by point correspondences on anatomical surfaces throughout the brain is used to register the classified image to a model. In order to estimate RAVENS maps of our images, we used the image registration software DRAMMS (Deformable Registration via Attribute Matching and Mutual-Saliency weighting) (Davatzikos et al., 2001). DRAMMS requires existing tissue classifications for input images in the subject space in order to estimate the RAVENS map. As such, all tissue masks (not blurred or modulated with the Jacobian determinant) were transformed along the inverse model-subject pathway to get the tissue masks into native subject space. These labels, along with the native subject images, were used as inputs into the DRAMMS-RAVENS pipeline, and RAVENS maps were output in the model space. To maintain consistency with the Jacobian-based VBM analyses, these maps were downsampled by a factor of two in each dimension (Nieuwenhuis et al., 2012). RAVENS maps were used as input into the classification algorithms to compare their performance with the Jacobian modulated VBM estimates. 52

53 9.2.4 Cortical Thickness The thickness of the cortex was estimated using the automated CIVET pipeline (version , Montreal Neurological Institute at McGill University), which has been described previously (Lerch & Evans, 2005; Yoon et al., 2007). Briefly, T1-weighted images are first intensity corrected for nonuniformity (Sled et al., 1998) and spatially normalized to the ICBM 152 average (Collins et al., 1994). Brains are then classified into GM, WM, CSF, and background using existing stereotaxic space probability maps (Kollokian, 1996; Zijdenbos et al., 2002). An automated surface extraction algorithm is then used to extract the inner and outer cortical surfaces (MacDonald et al., 2000). Finally, a surface-based diffusion smoothing kernel is applied to the thickness data (Chung et al., 2002). T1-weighted images for all subjects from all three datasets were input into this pipeline. 9.3 Machine Learning Algorithms All statistical analyses were performed using R, an open-source software package ( R Core Team, 2013). Cortical thicknesses were read into R in the form of a dataframe of size n (number of subjects) x p (number of thickness vertices). GM tissue density images were imported into R using the RMINC library (github.com/mouse-imaging- Centre/RMINC) as 3-dimensional arrays. The array for each tissue type and each subject was then converted into a vector. Vectors for all subjects GM images were merged vertically to create a matrix of size n (number of subjects) x p (number of voxels). At this point, subjects in each dataset were divided randomly into training and validation subsets, and the validation subset was set aside. After subtracting the mean from each variable, the dimensionality of the 53

54 training data was reduced via a Principal Component Analysis (PCA) using the prcomp function in R. A PCA constructs n principal components (PCs) for a dataset. Only those PCs that explained >1% of the variance in the data were retained for further analyses Logistic Regression with Elastic Net Regularization For all datasets, logistic regression (LR) with elastic net regularization was performed using the glmnet library in R (Friedman et al, 2010) for binomial models. In LR, a number of continuous independent variables describe a single dependent categorical variable. To avoid over-fitting, elastic net regularization was added to the models. Elastic net combines two types of regularization: LASSO (Least Absolute Shrinkage and Selection Operator) and ridge regression. The LASSO penalty discards unimportant variables, while the ridge penalty retains all variables, but forces their coefficients to be low. The elastic net penalty combines the LASSO and ridge penalties and the relative contribution of each penalty is controlled by the parameter α (Gastie and Qian, 2014). Another parameter, λ, controls the overall strength of the penalties. The values of λ and α are tuned to determine the optimal model, defined as that with the lowest model deviance (Hastie and Qian, 2014). Twenty-one values of α (ranging from 0 to 1, in increments of 0.05) were tested. A 10-fold cross-validation scheme was used at each value of α to determine the optimal value of λ. All PCs retained from the PCA (ie. those that explain >1% of the total variance) were used as input to the classifier. 54

55 9.3.2 Support Vector Machines As with the LR analysis, PCs that each explained >1% of the variance in the data were used as the input for the support vector machine (SVM) algorithms. SVMs are quadratic optimizers that seek to separate two distinct classes by constructing a separating hyperplane between the classes, and maximizing the margin between the two classes using the datapoints that are closest together (Meyer, 2015). The LIBSVM implementation in R was used for SVM analyses (Chang & Lin, 2011; Meyer, 2014), and the performance of both linear and radial basis function (RBF, also referred to as Gaussian) kernels was assessed. For the linear kernel, the cost parameter (C), which influences the narrowness of the support vector margin, was optimized using 10-fold cross-validation by scanning the space from to in increments of , as well as from 0.1 to 1000 in increments of 1 (Nieuwenhuis et al., 2013; Meyer, 2015). The RBF kernel required a tuning of the ϒ parameter as well, which controls the mapping of the data to a higher dimensional space. As suggested by Hsu et al., 2010, for the RBF kernel, the C and ϒ parameters were selected using a grid search with exponentially growing sequences (Hsu et al., 2010). Specifically, values of 2 x, where x={-14 10} in increments of 0.5 were tested for each parameter. The optimal model that was retained for each dataset was the one that showed the highest mean overall classification accuracy across the 10 rounds of cross-validation Linear Discriminant Analysis Linear discriminant analysis (LDA) performs classification based on continuous variables, and aims to map input features to a lower dimensional space using a linear transformation (Gu, n.d.). In this lower dimensional space, between-class variance is maximized, and within-class variance is minimized. As with the prior analyses, the PCs explaining >1% of the data s variance were 55

56 used as input into the LDA algorithm. The lda function, part of the MASS package in R, was used for all analyses (Venables and Ripley, 2002). A 10-fold cross-validation scheme was used to find the best linear discriminants to separate the two classes, and the best model was retained COMPARE In addition to the SVM models calculated using LIBSVM in R, the performance of the SVM classifier COMPARE (Classification of Morphological Patterns using Recursive Feature Elimination) was assessed. This algorithm has been described in detail previously (Fan et al., 2007), and is used by a number of groups performing SZ/HC classification in the literature (Davatzikos et al., 2005; Fan et al., 2007; Zanetti et al., 2013). The required input to this algorithm is GM tissue densities, and the output is an optimal model and model performance. The exact command used for the program was: Compare list_of_files output_file m model S spatial_map where -m stores the final model in model, and S stores the variable loadings in spatial map. The model was then applied to the validation subset of subjects using: Compare_test model list_of_files output_file Both the modulated VBM and RAVENS tissue densities (downsampled by a factor of two in each dimension) were used as input into this algorithm, and their results were compared. 56

57 9.3.5 Validation To accurately evaluate the performance of all classification algorithms, each dataset was divided into training and validation subsets (ratio 2:1; SZ:HC ratio maintained). The validation data was set aside and not used until all parameter tuning was completed. The training data was further divided into ten subsets to perform 10-fold cross-validation. 9/10 ths of data was used to train algorithms, and training set accuracy was assessed on the remaining 1/10 th. This process was repeated ten times, swapping out the subset used for evaluation. Unlike the more commonly-used leave-one-out cross-validation scheme, which is prone to over-fitting since almost all of the data is used in every training sample, the n-fold cross-validation structure allows for tuning of algorithm parameters without excessive over-fitting, and as such represents a more reliable, although more stringent, estimate of true algorithm performance (Hsu et al., 2010; Nieuwenhuis et al., 2012). The full validation scheme is shown in Figure 2. Within the training data, only overall accuracies are reported. All raw validation data were first projected onto the PC space of their corresponding training datasets. For all metrics (cortical thickness, modulated VBM and RAVENS maps), saved models from the LR, SVM, and LDA analyses were then applied to these new data, and the sensitivity (number of true positives), specificity (number of true negatives), and overall accuracy of the algorithms were recorded. To determine if accuracies were significantly different than chance, the binomial probability of each result was calculated. 57

58 Figure 2: Validation scheme for training algorithms. Data is split into training and validation subsets (ratio 2:1), and the validation data is set aside. The training subset is further divided to tune algorithm parameters using 10-fold cross-validation. The accuracy of the algorithm is assessed on the validation set only once tuning is complete. 58

10 Results 10.1 Modulated Voxel-Based Morphometry 10.1.1 Quality Control In the modulated VBM analysis, 12 subjects failed to pass the quality control stage in the CAMH dataset (3 HC and 9 SZ); 10

59 10 Results 10.1 Modulated Voxel-Based Morphometry Quality Control In the modulated VBM analysis, 12 subjects failed to pass the quality control stage in the CAMH dataset (3 HC and 9 SZ); 10 subjects in the NUSDAST dataset (3 HC and 7 SZ); and 1 subject in the INNN dataset (1 HC; the age-matched FEP subject was removed as well). As a result, 179 subjects were included in the CAMH analyses; 141 in the NUSDAST analyses; and 98 in the INNN analyses. Figure 4 illustrates the quality control process. Figure 3: Quality control stage. A: Correctly classified subject; B: Subject excluded from analyses because of failed registration 59

60 Classifier Performance After downsampling the data by a factor of two in each dimension, the size of all images was 94 x 116 x 98 voxels (2mm x 2mm x 2mm voxel dimensions), resulting in a matrix for each dataset with dimensions of n (number of subjects) x 1,068,592 variables. These matrices were first reduced using PCA, and PCs that explained >1% of the variance in the data were retained (CAMH: 29 PCs; INNN: 47 PCs; NUSDAST: 35 PCs). All modulated VBM results for the 10- fold cross-validation in the training data are summarized in Table 2. Results for the validation dataset are in Table 3. Figure 4 shows the variance profile of the PCs for each of the three datasets, and the top row of Figure 5 illustrates the PC loadings of the first PC for each VBM dataset. Figure 6 illustrates the parameter optimization procedure for the LR analysis. Panels A and B illustrate the selection of the λ and α parameters. In Panel A, 100 values of λ are tested using a 10-fold cross-validation structure. The optimal λ is that which shows the lowest mean crossvalidated error between the model and the data. The selection of λ affects the number of variables retained in the model. The program loops over 21 values of α, shown in Panel B, and at each value performs the λ optimization. The optimal model is that which shows the lowest model deviance at given values of λ and α. Panel C shows the effect of the λ parameter on the fractional deviance explained by the model (analogous to R 2 in continuous output models), the number of variables retained, and the coefficient values of the variables. The far right of the graph, with all variables retained and high variable coefficients represents an over-trained model. In this case, the parameters chosen were α = 0.45 and λ =

61 Within the 10-fold cross-validation, the non-linear SVMs showed the highest performance, with a mean accuracy of 63.2% across the three datasets (compared with 59.8% for LR, 55.1% for linear SVM, and 57.7% for LDA). The non-linear SVMs performed the best within each dataset as well. All of the datasets performed approximately the same across the four algorithms (59.0% mean accuracy for CAMH, 58.1% for NUSDAST, and 59.9% for INNN). The single best performing algorithm was the non-linear SVM in the NUSDAST dataset, with 64.2% accuracy. The mean accuracy across all datasets and algorithms was 59.9%. Significance tests indicate that the algorithms are performing better than chance (p<0.05), with the exception of the linear SVM in the NUSDAST and INNN datasets (p=0.059 and p=0.061, respectively). In the validation data, no single method significantly outperformed the others; however, as opposed to the training datasets, LR performed slightly better than the other algorithms, with an average accuracy of 63.5% across the three datasets, compared to 60.5% for linear SVM, 58.2% for non-linear SVM, and 57.6% for LDA. LR performed equally as well as the non-linear SVM within the CAMH dataset, linear SVM was the best algorithm in the NUSDAST dataset, and LR and linear SVM performed best in the INNN dataset. Across the four algorithms, each dataset performed equally well (mean accuracy 59.7% for CAMH, 59.2% for NUSDAST, and 58.4% for INNN). The overall best performing algorithm was LR in the NUSDAST dataset, with 65.2%. There is no fixed trend for sensitivity and specificity performance; however, a number of the models selectively show very poor (<50%) sensitivity (ie. ability to detect patients) and specificity (ie. ability to detect controls). Tests for significance indicate that many of the algorithms do not perform better than chance (p>0.05) with the exception of LR (p=0.016) and the non-linear SVM (p=0.016) in the CAMH dataset and the LR (p=0.014), non-linear SVM 61

62 (p=0.025), and LDA (p=0.025) in the NUSDAST dataset. None of the algorithms in the INNN dataset performed better than chance. The datasets used as input into the COMPARE algorithm were identical to those used for the modulated VBM analyses. The training and validation subsets were likewise identical. On the training sets, COMPARE performed with an overall accuracy of 63.3%, 71.3%, and 71.2% on the CAMH, NUSDAST, and INNN datasets, respectively. On the validation set, the performance was 55.9% (65.4% sensitivity, 48.5% specificity), 61% (81.5% sensitivity, 31.6% specificity), and 67.7% (56.3% sensitivity, 80.0% specificity) for the CAMH, NUSDAST, and INNN datasets, respectively. 62

63 Table 2: All modulated VBM results in the training set using 10-fold cross-validation. Overall classification accuracy (%) is reported. The p-value of a binomial test on the % accuracy is reported in brackets. CAMH NUSDAST INNN LR 60.0 (0.0066) 57.4 (0.029) 62.1 (0.014) SVM (linear) 55.0 (0.040) 54.3 (0.059) 56.1 (0.061) SVM (RBF) 61.7 (0.0028) 64.2 (0.0022) 63.8 (0.0085) LDA 59.2 (0.0097) 56.4 (0.038) 57.6 (0.046) Table 3: All modulated VBM results in the validation set. The % sensitivity (# true positives), specificity (# true negatives), and overall classification accuracy (in bold) are reported. The p-value of a binomial test on the % accuracy is reported in brackets. LR 23.1/93.9/62.7 (0.016) SVM (linear) 38.5/69.7/55.9 (0.069) SVM (RBF) 38.5/81.8/62.7 (0.016) LDA 34.6/75.8/57.6 (0.053) CAMH NUSDAST INNN 100/15.8/65.2 (0.014) 63.0/63.2/63.0 (0.025) 92.6/5.26/56.5 (0.080) 77.8/42.1/63.0 (0.025) 68.8/56.3/62.5 (0.053) 62.5/62.5/62.5 (0.053) 56.3/62.5/59.4 (0.081) 56.3/56.3/56.3 (0.11) 63

64 Figure 4: Percent of variance explained by all PCs for the modulated voxel-based morphometry for each of the three datasets. Only PCs that individually explained >1% of the total variance in the dataset were retained as input to the classifiers. 64

65 Figure 5: Sagittal views of voxel-wise contributions to the first principal component of the modulated VBM and RAVENS data for each dataset. 65

66 Figure 6: Parameter optimization and selection in elastic net regularized LR model for CAMH modulated VBM dataset. Panel A illustrates selection of the λ parameter based on 10-fold cross-validation. Each coloured line represents a separate variable. The optimal λ is that which shows the lowest model deviance. The selection of λ affects the number of variables retained in the model. In this case, the optimal log(λ) = -2.2, so optimal λ = Panel B shows selection of the α parameter. The λ parameter optimization shown in Panel A is performed at 21 values of α ranging from 0-1, and the optimal model is that which shows the lowest model deviance for a given combination of λ and α, in this case optimal α = Panel C illustrates the effect of λ selection on fractional deviance explained by the model, number of variables retained, and their coefficients. 66

67 10.2 RAVENS Maps Classifier Performance As with the modulated VBM analysis, RAVENS maps had a final image size of 94 x 116 x 98 voxels (2mm x 2mm x 2mm voxel dimensions; 1,068,592 variables total). 8 PCs were retained in the CAMH dataset, 28 in the NUSDAST dataset, and 48 in the INNN dataset. RAVENS maps results for the training data are summarized in Table 4, and results for the validation data are in Table 5. The bottom row of Figure 5 illustrates the PC loadings of the first PC for each RAVENS dataset. In the training data, LR performed the best across all datasets (65.6% versus 60.1% for linear SVM, 62.0% for non-linear SVM, and 57.6% for LDA). None of the datasets significantly outperformed the others, and the highest overall accuracy was achieved by the NUSDAST dataset using LR. The mean accuracy across all datasets and algorithms was 61.3%. The linear SVM in the CAMH dataset and the LDA in the INNN dataset do no perform better than chance (p>0.05). Within the validation dataset, the RAVENS accuracies were slightly higher than the modulated VBM results (61.4% versus 59.9%). LDA outperformed all other methods with 65.7% accuracy averaged across the three datasets (compared to 61.7% for LR, 58.5% for linear SVM, and 59.8% for non-linear SVM). None of the datasets significantly outperformed the others. The best performing algorithm was LDA on the CAMH dataset with an accuracy of 69.5%. All algorithms in the CAMH and NUSDAST datasets perform better than chance (p<0.05) except for the linear SVM in the CAMH dataset (p=0.096) and the non-linear SVM in the NUSDAST dataset 67

68 (p=0.059). Within the INNN dataset, only the linear SVM performs better than chance (p=0.030). The subjects selected for inclusion in the RAVENS COMPARE analysis were the same as those used in the modulated VBM analysis (see Section ), as were the subjects in the training and validation subsets. COMPARE performed with an overall accuracy of 56%, 51%, and 57% for the CAMH, NUSDAST, and INNN datasets, respectively. Table 4: All RAVENS maps results in the training set using 10-fold cross-validation. Overall classification accuracy (%) is reported. The p-value of a binomial test on the % accuracy is reported in brackets. CAMH NUSDAST INNN LR 60.8 (0.0044) 66.0 ( ) 70.0 ( ) SVM (linear) 51.7 (0.068) 55.3 (0.048) 66.7 (0.0025) SVM (RBF) 63.3 (0.0010) 60.9 (0.0098) 70.0 ( ) LDA 59.2 (0.0097) 57.5 (0.029) 56.1 (0.061) Table 5: All RAVENS maps results in the validation set. The % sensitivity (# true positives), specificity (# true negatives), and overall classification accuracy (in bold). The p- value of a binomial test on the % accuracy is reported in brackets. LR 30.1/97.0/67.8 (0.0024) SVM (linear) 11.5/84.9/52.5 (0.096) SVM (RBF) 46.2/81.8/66.1 (0.0048) LDA 38.5/94.0/69.5 (0.0011) CAMH NUSDAST INNN 92.6/15.8/60.9 (0.040) 77.8/57.9/69.6 (0.0034) 100/0/58.7 (0.059) 81.5/42.1/65.2 (0.014) 68.8/43.8/56.3 (0.11) 68.8/62.5/65.6 (0.030) 87.5/18.8/53.1 (0.13) 68.8/56.3/62.5 (0.053) 68

69 10.3 Cortical Thickness Quality Control In the cortical thickness analysis, all subjects from all three datasets passed a quality control inspection of the GM/WM surface extraction. As such, cortical thickness data was included for 191 subjects (103 HC, 88 SZ) in the CAMH dataset (age 18-59; HC mean 35.2; SZ mean 36.5); 100 subjects (50 HC, 50 FEP) in the INNN dataset (age 18-47; HC mean 23.8; FEP mean 25.8); and 144 subjects (60 HC, 84 SZ) in the NUSDAST dataset (age 18-59; HC mean 26.9; SZ mean 33.0) Classifier Performance The CIVET pipeline estimates the thickness of the cortex at 81,924 vertices (40,962 vertices per hemisphere), resulting in a matrix of n (number of subjects/independent dataset analyzed) x 81,924 for each dataset. All cortical thickness results for the training data are summarized in Table 6, while the validation data is summarized in Table PCs in CAMH dataset explained >1% of the variance in the data, 32 PCs in the INNN dataset, and 18 PCs in the NUSDAST dataset. As with both the modulated VBM results, non-linear SVMs outperformed all other methods in the training data (67.0% mean accuracy versus 59.1 for LR, 64.2 for linear SVM, and 59.6 for LDA). The NUSDAST dataset significantly outperformed the CAMH and INNN datasets, with a mean accuracy of 68.8% across the four algorithms (compared with 58.6% for CAMH and 60.0% for INNN), and also had the best performing algorithm, with 71.9% accuracy using a linear SVM. Tests for significance indicate that most algorithms perform better than chance 69

70 (p<0.05), with the exception of LR in the CAMH and INNN datasets (p=0.069 and p=0.46, respectively). In both the training and validation datasets, the cortical thickness results were better than both of the tissue density datasets (64.4% overall training mean accuracy and 62.5% overall validation mean accuracy). As with the validation VBM results, LR outperformed all other methods in the validation set, with 66.9% mean accuracy across the three datasets (compared with 62.9% for linear SVM, 63.7% for non-linear SVM, and 63.9% for LDA). LDA performed the best within the CAMH dataset, non-linear SVM within the NUSDAST dataset, and LR within the INNN dataset. The NUSDAST dataset performed better overall across the four algorithms, with 68.1% mean accuracy (compared with 63.5% in the CAMH dataset and 61.5% in the INNN dataset). The best performance was using LR in the INNN dataset, with an accuracy of 69.7%. Similar to the modulated VBM data, a number of the models show very poor (<50%) sensitivity. All algorithms perform better than chance (p<0.05) with the exception of the linear SVM and LDA in the INNN dataset (p=0.081 for both). Table 6: All cortical thickness results in the training set using 10-fold cross-validation. Overall classification accuracy (%) is reported. The p-value of a binomial test on the % accuracy is reported in brackets. CAMH NUSDAST INNN LR 50.8 (0.069) 68.8 ( ) 57.6 (0.46) SVM (linear) 60.2 (0.0050) 71.9 ( ) 60.6 (0.022) SVM (RBF) 68.0 ( ) 68.8 ( ) 64.1 (0.0085) LDA 55.5 (0.033) 65.6 ( ) 57.6 (0.046) 70

71 Table 7: All cortical thickness results in the validation set. The % sensitivity (# true positives), specificity (# true negatives), and overall classification accuracy (in bold). The p- value of a binomial test on the % accuracy is reported in brackets. LR 24.1/91.1/60.3 (0.026) SVM (linear) 65.5/58.8/61.9 (0.017) SVM (RBF) 44.8/79.4/63.5 (0.010) LDA 65.5/70.6/68.3 (0.0015) CAMH NUSDAST INNN 78.6/60.0/70.8 (0.0017) 75.0/60.0/68.1 (0.0039) 78.6/60.0/70.8 (0.0017) 67.9/60.0/64.6 (0.015) 59.0/76.5/67.6 (0.015) 52.9/64.7/58.8 (0.081) 58.8/88.2/73.5 (0.0031) 41.2/76.5/58.8 (0.081) 71

72 11 Discussion The effectiveness of three machine learning algorithms (logistic regression, support vector machines, and linear discriminant analysis) commonly used in the literature for classifying patients with schizophrenia and related illnesses from healthy controls were compared on three independent datasets using both cortical thickness and tissue densities. This is the only study to date that uses multiple algorithms (including two algorithms for non-linear support vector machines), numerous neuroanatomical metrics (including two different methods for estimating tissue densities), and multiple independently-collected datasets that represent different disease severities and magnetic field strengths to compare methods. The performance of all algorithms on all datasets was poor; however non-linear support vector machines marginally outperformed the other methods using 10-fold cross-validation with modulated VBM and cortical thickness, maximizing at 71.9% accuracy using cortical thickness in the NUSDAST dataset. As well, logistic regression was slightly superior to the other methods using a left-out validation dataset with both modulated VBM and cortical thickness, with a maximum accuracy of 70.8% for cortical thickness, also in the NUSDAST dataset. The best algorithms using RAVENS maps were logistic regression in the training set and linear discriminant analysis in the validation set. These results illustrate that, contrary to many of the results reported in the literature, reliable and versatile schizophrenia/control classifiers are difficult to construct, and existing classifiers in the literature may be over-estimating performance. Tests for significance in the accuracy results indicated that in many cases the algorithms perform better than chance; however performance is poor overall, especially in the VBM datasets. Performing significantly above chance indicates that the algorithms are detecting patterns in the 72

73 data and are functioning as classifiers, albeit not strong ones. The cortical thickness results have the fewest algorithms that do not reach significance, reinforcing the conclusion that cortical thickness is a better metric for classification than tissue density. Fewer algorithms in the validation set performed better than chance in comparison to the training set, but this is due to the lower statistical power associated with a smaller sample size (validation set size is half that of training set size). Algorithms in the INNN dataset tended to perform below significance, but this is again likely largely driven by lack of power from the relatively small sample size. There does not appear to be a consistent effect of algorithm on above-chance performance. Previous studies that perform schizophrenia/control classification using structural MR data are summarized in Table 8. To the best of our ability we have attempted to replicate the methodologies used in these datasets through the use of state-of-the-art implementations of image processing and classification pipelines described in these studies or through the direct implementation of these pipelines from different groups. To our knowledge, only one study uses PCs of cortical thickness, and reports 88.8% % accuracy using a linear support vector machine (Yoon et al., 2007). However, this is only after performing separate PCA analyses for each lobe of the brain, and rearranging their PCs in increasing order based on the p-value associated with a two-sample t-test, both of which represent a direct attempt at over-fitting the data. If the PCs are ordered more conventionally, according to the variance they explain, accuracies drop to 63%-77%. In our datasets, we show validation set accuracies of 58.8% % using linear SVMs on PC-reduced cortical thickness data, lower that what is reported by Yoon and colleagues, but within their range when they use conventional ordering for their PCs. It is likely that their accuracies are boosted for their dataset by performing their PCA and classification on a lobe-wise basis, but since they do not report the whole-brain PCA results, or 73

74 give a justification for why they performed their PCA on a lobe-wise basis, it is not possible to tell if this boosted their performance, or would boost performance in all datasets. Additionally, although they report that they have 53 patients and 52 controls, they reveal they leave out 30 subjects from their training data to form a validation group. This means that they are training on a dataset composed of only 37 patients and 36 controls, thereby constituting a small training sample size. As such, the results presented in this manuscript, which are based on larger training datasets (especially the CAMH and NUSDAST cohorts), should be considered more reliable. Classification with logistic regression consistently out-performed all other methods across two of the three metrics (cortical thickness and modulated VBM) in the validation datasets. In the modulated VBM dataset, we were able to achieve accuracies of 62.7% for the CAMH dataset, 65.2% for the NUSDAST dataset, and 62.5% for the INNN dataset using LR. Sun and colleagues report 86.1% classification accuracy using sparse multinomial logistic regression and a recentonset schizophrenia dataset (most similar to the INNN dataset in this manuscript) (Sun et al, 2009). However, their population was relatively small (36 patients and 36 controls), and their validation scheme was less rigorous (leave-one-out cross-validation) than the one used in this study, which likely means their classifier was over-fit, and as such their accuracy was overestimated. The motivation behind constructing machine learning algorithms is to create a machine that has identified patterns in data, and can be applied to novel datasets to identify class assignments. Validation of machine learning algorithms is important to gain an understanding of how the algorithms will perform on such unseen data. A high training accuracy does not necessarily mean the algorithm is a good one; often, it can be over-fit to the training data, and has a low predictive performance. Leave-one-out cross-validation (LOOCV) is used frequently in the literature to 74

75 validate algorithms (Davatzikos et al., 2005; Fan et al., 2007; Karageorgiou et al., 2011; Nieuwenhuis et al., 2012; Sun et al., 2009; Zanetti et al., 2013). In this method, a single subject is left out of the dataset, and the algorithm is trained on the remaining subjects. The left-out subject is then classified using the trained algorithm. This is repeated n times such that each subject is left out once, and the accuracy is the fraction of subjects correctly classified. This method is often used on smaller datasets in which there are not enough subjects available to allocate to a separate validation set. However, it has been shown that LOOCV does not lead to a consistent estimate of the model (Shao, 1993). It is limited, as almost all of the data is used to train the algorithm every time, meaning that the classifier is often over-fit to the data, and as such classifier performance is over-estimated. Additionally, since only one subject is classified in each round, there is no estimate for how the trained classifier performs on a group of heterogeneous subjects. A more rigorous method involves leaving a set of subjects out of the training step. For example, in n-fold cross-validation, the data is split into n subsets, and 1/n th of the data is left out at each iteration. Another method is to leave a subset of the data, typically 1/3 rd, out of training entirely, and only use it once algorithm tuning is complete to give an estimate of how the algorithm will perform on unseen data. Logically, the most rigorous validation method is therefore a combination of these two, in which the algorithm is tuned using n-fold crossvalidation (in our case, n=10), and performance on unseen data is estimated on a separate validation subset of the data. This ensures that parameters are selected in a rigorous way, and algorithm performance is then assessed on completely unseen data, simulating performance on a novel dataset. Many of the existing studies in the literature use support vector machines for classification. In our results, we showed low validation set accuracies of % using a linear SVM and 75

76 modulated VBM, and % using a non-linear SVM. Nieuwenhuis and colleagues use a linear SVM to perform classification in their large sample of chronic patients (including schizophrenia, schizophreniform, and schizoaffective patients), a dataset very similar to the CAMH sample presented here. Using leave-one-out cross-validation, they report an accuracy of 71.4%, much higher than our 55.9%. However, if you consider the results from our COMPARE analysis in only the training dataset, which uses leave-one-out cross-validation, the accuracy in our sample is boosted to 63.3%. Nieuwenhuis also evaluates their classifier performance on an entirely independent dataset of 155 patients and 122 controls, and report an impressive accuracy of 70.4%. Their higher result is likely due to their very large sample sizes, and strongly suggests that larger sample sizes are the solution to the highly variable accuracies reported in the literature. In fact, they test their model using different training sample sizes, and show that even with 140 subjects, the accuracies on the validation set can fluctuate between 52% and 74% (ie. the accuracy achieved on the single test of the validation set could be an anomaly). In spite of this, based on their methodology, sample sizes, and validation structure, their accuracy is likely the most reliable reported in the literature. Using RAVENS maps and non-linear SVMs, we were able to achieve accuracies of % in our validation set. Davatzikos and colleagues, one of the first groups to perform schizophrenia/control classification, use RAVENS maps, and report an overall accuracy of 81.1% using a non-linear SVM. However, they use leave-one-out cross-validation, and report an incredible 100% accuracy on their training data, which suggests that their model is likely over-fit to their data. Using the COMPARE program, we were only able to achieve accuracies of % using modulated VBM, and 51-57% using RAVENS maps. Fan and colleagues use the COMPARE 76

77 program with RAVENS maps to classify two chronic patient populations, one all female (38 controls and 23 patients) and one all male (41 controls and 46 patients). Within these two datasets, they report impressive overall accuracies of 91.8% and 90.8%, respectively. However, as with many other studies, their sample size is very small, and they perform only leave-one-out cross-validation. Also, the uniform sex of their populations makes them unrealistically homogeneous, and they do not report results for the combined datasets. Zanetti and colleagues also use RAVENS maps and COMPARE on their population of first-episode schizophrenia/schizophreniform disorder patients, a sample most similar to the INNN dataset analyzed in our study, except their patients are both medicated and unmedicated. They report a more modest overall accuracy than the original Fan study, just 73.4%. They argue that their dataset represents a more real-world sample than that used in the Fan study. Using RAVENS maps and COMPARE in our first-episode population, we achieved only 57% accuracy; however, it should be noted that result was reached using the option in COMPARE to leave out a portion of the data as an independent validation dataset. Using the leave-one-out cross-validation method (ie. the result from only the training set), COMPARE produced a classification accuracy of 71.2%, much closer to what was reported by the Zanetti study. This is further evidence of the tendency for leave-one-out cross-validation to overestimate the generalizability of the classifier. Karageorgiou and colleagues use a small sample of 28 patients with recent-onset schizophrenia (most similar to the INNN dataset in this manuscript) and 47 controls to perform an LDA classification, with PCA feature reduction (Karageorgiou et al., 2011). Using only structural MR data, they report 64.3% sensitivity and 72.3% specificity (overall accuracy is not reported). The PCA-LDA classifier constructed on the INNN data, a first-episode dataset most similar to the 77

78 one used on the Karageorgiou study, performs with a lower sensitivity (56.3%), and a lower specificity (56.3%). Table 8: Summary of studies using structural MR imaging to classify healthy controls and schizophrenia patients. Study Population Characteristics Feature Set Classification Method Validation Accuracy Davatzikos et al., HC, 69 SZ GM/WM/CSF densities SVM, RBF LOOCV 81.10% Fan et al., 2007 Female Sample: 38 HC, 23 RAVENS SVM, RBF LOOCV % SZ maps (COMPARE) Male Sample: 41 HC, 46 SZ Karageorgiou et al., ROS, 47 HC GM densities LDA LOOCV 64.3% sensitivity 76.6% specificity Nieuwenhuis et al., Training Sample: 128 GM densities SVM, linear LOOCV in Training set: 2012 chronic SZ, 111 HC training; 71.4% Validation Sample: 155 chronic SZ, 122 HC independent validation set Validation set: 70.4% Sun et al., ROS, 36 HC GM densities LR LOOCV 86.10% Zanetti et al., FE, 62 HC RAVENS maps SVM, RBF (COMPARE) LOOCV 73.40% Yoon et al., chronic SZ, 52 HC Cortical SVM LOOCV & 89-94% thickness validation set 78

79 It is tempting to adjust and restructure datasets and algorithms to try to boost accuracy after seeing the final results, or test a variety of pipelines and then only report the one with the highest performance (Kambeitz et al., 2015). In this study, we endeavored to minimize data preprocessing, and report only those results that we achieved during the first pass over our data. We believe this better represents the true discriminative ability of the features being studied (cortical thicknesses and tissue densities), and also better reflects the intrinsic performance of the classification methods. Our results, when taken in context with existing results in the literature, demonstrate the sensitivity of machine learning algorithms when applied to neuroimaging data. A small change in the composition or structure of the dataset can have a huge effect on the final accuracy. The next steps of the neuroimaging community should be to work towards developing techniques that are more stable and can be applied to novel datasets with more consistent results. It is important to consider the effect of demographic variables and medication status on brain structure and classifier performance, as there are significant differences in these metrics among samples. Notably, the CAMH dataset represents the oldest population (mean age 36.2 years), while the NUSDAST dataset is slightly younger (mean age 29.0 years), and the INNN dataset is the youngest (mean age 24.8 years). In terms of medication status, the INNN dataset represents a completely antipsychotic-naïve population, while the CAMH sample is more heterogeneous, with both medicated and unmedicated patients, and the medication status of the NUSDAST dataset is not known. Looking at the overall accuracies for each dataset (averaged across all feature types and algorithms), performance is 62.3% for NUSDAST, 61.8% for INNN, and 58.8% for CAMH in the training set, and 64.5% for NUSDAST, 62.4% for CAMH, and 60.4% for INNN in the validation set. In spite of the differences in age and medication status among the datasets, overall performance does not seem to be significantly affected. This would suggest, 79

80 contrary to the findings of Kambeitz and colleagues discussed below, that age and medication status do not have as significant an effect on classifier performance as may be expected (Kambeitz et al., 2015). Despite the recently demonstrated effects of antipsychotic medications on cortical gray matter volume and cortical thickness reductions, including in first-episode schizophrenia (Ho et al., 2011; Lesh et al., 2015; Vita et al., 2015), our classifier performed similarly in our three samples, one of which consisted exclusively of patients with first-episode psychosis and controls. The similar results in unmedicated first-episode and chronic patients should mitigate concern that our classifier might have been picking up antipsychotic medication effects between groups, rather than effects of diagnosis. A recent meta-analysis aimed to systematically compare the performance of classifiers across published studies, and articulated the need for the work presented in this manuscript (Kambeitz et al., 2015). Across a cohort of 38 studies, they reported a mean overall sensitivity of 76.4% and specificity of 79.0%. They showed a higher sensitivity in older cohorts, as well as in chronic versus first-episode patients. As discussed above, the CAMH dataset is the oldest (mean age 36.2 years), while the INNN dataset is the youngest (mean age 24.8 years). Within each classification method, there was no consistent difference in sensitivity between the two groups. The CAMH and NUSDAST datasets both represent chronic patients, and the INNN dataset is a first-episode, but again there was no trend in sensitivity to support the results of the meta-analysis. Kambeitz and colleagues also showed that antipsychotic medication dose had a positive effect on specificity, and although this was observed between the CAMH (medicated) and INNN (unmedicated) datasets, the opposite effect was seen between the INNN and NUSDAST datasets. The meta-analysis does not indicate an effect of classification algorithm on the results, however 80

81 only studies using support vector machines and linear discriminant analysis were included. The authors emphasize that there are significant unknown effects of subject demographics, imaging modalities and scanning sequences, data processing techniques, algorithm selection, and validation structure among studies, and that a comprehensive study examining these factors, like the one presented in this manuscript, is necessary to answer important questions about classifier effectiveness. We are limited in the study by a number of factors. Although we examined three datasets and our sample sizes are large relative to those reported elsewhere in the literature, the number of subjects we have access to is still much lower than the number of variables we have. Increasing sample size would improve the reliability of our results. Additionally, the number of features available from an MR image using a voxel-wise approach is enormous. In order to make the data more manageable, and significantly reduce computational times, we were forced to downsample our images and then further reduce with a PCA, which inevitably results in some loss of information, although this is a step taken by other groups in the literature as well (Karageorgiou et al., 2011; Yoon et al., 2007). With improved computational power, in the future it may be possible to retain all variables in an analysis, and as such pick up on more of the subtle variability that exists between the brains of schizophrenia patients and controls. The next step in this process is to test additional study variants, such as different feature sets (region-of-interest approach, inclusion of psychological and genetic data), feature reduction techniques (a priori region selection, LDA as feature reduction), and classification techniques (random forests, k-nearest neighbor, neural networks). It is not possible to sample every potential combination in a single study. The data presented here, however, is the most comprehensive 81

82 evaluation to date of each of these variables on accuracy, and report a consistent result across multiple datasets. In conclusion, three classification methods were compared (logistic regression, support vector machines, and linear discriminant analysis) using cortical thickness and tissue densities. None of the methods produced high-performing classifiers; however logistic regression marginally outperformed all other methods in both the cortical thickness and tissue density datasets. These results were consistent across three independent datasets. This study illustrates some of the limitations of applying machine learning to neuroimaging data, and suggests that perhaps cortical thickness and tissue densities are not reliable methods for distinguishing between schizophrenia patients and controls on a patient-by-patient basis. Along with future, more extensive studies, the results from this study can be used to construct guidelines for performing classifications on novel datasets, which can eventually be applied in the clinic to supplement symptom-based diagnoses. 82

83 Chapter 3: General Discussion & Future Directions 12 Summary of Results 12.1 Image Processing In this study, the performances of three machine learning algorithms (logistic regression, support vector machines, and linear discriminant analysis) were compared on three large, independent datasets. Two of the datasets represented chronic, heterogeneous patient populations (one with 1.5T data, the other 3T), while the third represented a medication-naïve first-episode population (3T data). Three neuroanatomically-based metrics were assessed: cortical thickness (Yoon et al., 2009), and two estimates of tissue densities (Ashburner and Friston, 2000; Davatzikos et al., 2001). All algorithms were thoroughly validated. No single combination of algorithm, dataset, and neuroanatomical metric significantly outperformed the others; however algorithms constructed using cortical thicknesses tended to perform better, as did the logistic regression classifiers. Additionally, no single dataset outperformed the others. The classifiers built using support vector machines performed significantly worse than those reported elsewhere in the literature, but the classifiers presented here were more stringently validated and were performed on larger datasets. This study conducted the most comprehensive comparison to date of the factors influencing classifier performance, and illustrated the importance of stringent validation in realistically estimating accuracy and generalizability of classifiers. 83

84 In the following sections, factors that may influence performance, such as image pre-processing steps and non-linear registration algorithms are discussed. Additionally, alternative machine learning algorithms and classification metrics are considered, as is the future of machine learning in the context of schizophrenia Inhomogeneity Correction There are a number of alternatives for many of the data preprocessing steps performed in this study. In the pipeline used here, images were first corrected for image intensity nonuniformity using the improved nonarametric nonuniform intensity normalization (N3) algorithm (known as the N4 algorithm; Sled et al., 1998; Tustison et al., 2010). This algorithm models artefacts due to poor radio frequency coil uniformity, gradient-driven eddy currents, and patient anatomy as a smoothly varying non-parametric function, then corrects MR data on a voxel-by-voxel basis using this function estimate (Sled et al., 1998).. Performing this correction is important before using automatic segmentation pipelines because these pipelines assume intensity homogeneity within each tissue type. The N3 correction has been shown to substantially reduce intensity nonuniformity and improve image quality (Sled et al., 1998; Tustison et al., 2010). The improved N4 algorithm builds on the success of the N3 algorithm, and includes a faster and more robust method for B-spline approximation and an improved method for bias field correction (Tustison et al., 2010). 84

85 Atlas Selection All images were registered to the ICBM c model, which is the most recent in a series of models released by the McConnell Brain Imaging Centre (Montreal Neurological Institute, Montreal, Quebec, Canada; Fonov et al., 2011; Fonov et al., 2009; This model represents the mean anatomy of a population of 152 images and shows high spatial resolution and signal-tonoise ratio, without being subject to the biases of a single subject (Fonov et al., 2011). This model is built through an iterative process in which individual native MR images are nonlinearly fit to the average image from the previous iteration, starting with the MNI 152 linear template. There are a number of advantages of registering images to a model. Having all images in a common space allows direct comparisons to be made among images, including images acquired in different studies, and the anatomical variability of a healthy young population is well-represented. Other machine learning studies in the literature also register their images to a population-based model (Karageorgiou et al., 2011; Nieuwenhuis et al., 2013; Yoon et al., 2009). Instead of a population-based model, a single subject can be selected from the group as the reference image, and all other subjects can be registered to that one image (Fonov et al., 2011; Jelacic et al., 2006; Shan et al., 2006). The success of this method, however, is highly dependent on the subject chosen as the target as the selected subject cannot adequately represent the inherent anatomical variability of the population. The proper method for choosing such a target from an ensemble of images is still unclear; however there are potential advantages to this methodology. For example, choosing an atlas target from within the dataset under analysis would allow for the reference image to possess the same intensity and contrast characteristics of 85

86 the dataset under analysis. This type of homologous intensity profile between images could, theoretically, improve registration accuracy between pairs of images. A template specific to the study could have been created by registering and averaging all images, similar to the process used to create the ICBM 152 atlas. This method is used in the machine learning literature (Sun et al., 2009). A template such as this, however, would represent the mean anatomy of both control and patient groups. This a considerable advantage as the data being analyzed will be used in the creation of this final template. As a result, the final atlas will have limited dependence on an a priori choice for a template and its associated neuroanatomy and intensity characteristics (e.g.: noise profile, intensity range, and contrast). However, using a wellknown and well-used template such as ICBM 152 is good for benchmarking purposes. The goal of this project was to evaluate the performance of different machine learning algorithms using a common platform for image processing and downstream machine learning-based classification. As such, using the same template across experiments in different datasets is critical for downstream evaluation and for the comparison of techniques across laboratories and studies. Further, this template is well-known and is commonly used in many other univariate studies in the neuroimaging field; therefore results obtained using this template as a reference will be well understood by the community Non-linear Registration Algorithm Non-linear registration estimates anatomical correspondence to be established among images. There are many different non-linear algorithms available, each with their own similarity measures, transformation models, regularization methods, and optimization strategies. The 86

87 method used in this study, Advanced Normalization Tools (ANTs) is an example of a diffeomorphic, or large-deformation framework, which estimates invertible deformations (diffeomorphisms) that allow for preservation of topology (Ashburner, 2007). A diffeomorphic algorithm is important in anatomical studies because the transformation is smooth and continuous, with invertible derivatives (ie. the Jacobian determinant is nonzero) (Ashburner, 2007). Given the established accuracy, these well-documented transformation properties, and our familiarity with this algorithm it was an ideal choice for this project (Avants et al., 2008; Klein et al., 2009; Murphy et al., 2011; Pipitone et al., 2014; Avants et al., 2009). Another popular non-linear registration algorithm is Automatic Normalization and Image Matching and Anatomical Labeling (ANIMAL), which uses an iterative process wherein a 3D deformation field is estimated between two MR images (Collins et al., 1995; Pipitone et al., 2014), and has previously been used extensively by our group. ANIMAL uses a two-step algorithm. In the first step, known as the outer loop, large deformations on data blurred with a Gaussian kernel are estimated. This fit is subsequently refined with increasingly smaller deformations. In the second step, known as the inner loop, which occurs with each step of the outer loop, ANIMAL optimizes the non-linear transformation to maximize similarity between the source and target images with an objective function composed of the local similarity measure and the cost function, which penalizes large deformations (Chakravarty et al., 2009). The transformation is represented by a deformation field that estimates local translations for each voxel defined by optimizing the optimization function, which is smoothed to ensure the deformation field is continuous (Chakravarty et al., 2009). ANIMAL performed the best in a study that compared four different non-linear registration methods applied to atlas-to-patient registration of an atlas of the basal ganglia and thalamus (Chakravarty et al., 2009). However, 87

88 ANTs was not included in this comparison, and Klein and colleagues have subsequently showed that ANIMAL was outperformed by ANTs (Klein et al., 2009). ANTs has consistently been shown to be the best performing non-linear registration algorithm compared with others in the field (Avants et al., 2008; Klein et al., 2009; Murphy et al., 2011; Pipitone et al., 2014), including DARTEL (Ashburner, 2007), FNIRT (Andersson et al., 2010), the SPM algorithms (Friston et al., 1995a), and ROMEO (Hellier et al., 2001). Further, Avants and colleagues have demonstrated that an increase in algorithm performance corresponds directly with the accuracy of results when trying to make clinically-meaningful measurements (Avants et al., 2008). In a study examining the efficacy of 14 different non-linear algorithms in performing registrations of human MR brain images, ANTs emerged as the method that delivered the most consistently high accuracy across numerous error measurements (Klein et al., 2009). Algorithm performance was measured using volume and surface overlap, volume similarity, and distance measures to evaluate similarity of corresponding labeled regions. Algorithms were also ranked based on permutation tests, ANOVA tests, and indifference-zone rankings. ANTs performed the best across multiple subject populations, labeling protocols, and reliability measures. Additional advantages of ANTs include that it is publically available, and has easily adjustable parameters (Klein et al., 2009) Cortical Thickness Analysis The cortical thickness analyses in this study were performed using the CIVET algorithm (version , Montreal Neurological Institute at McGill University), which provides fully-automated extraction of the inner white matter and outer grey matter surfaces of the brain, as well as an 88

89 estimate of the thickness of the cortex at 81,924 vertices (Ad-Dab'bagh et al., 2005; Ad-Dab'bagh et al., 2006). The rationale behind using this algorithm was that it was the same one used by Yoon and colleagues to perform patient classification, and we wished to compare our results directly with theirs (Yoon et al., 2007). There are other methods available for estimating cortical thickness, notably the FreeSurfer software package ( Dale et al., 1999; Fischl and Dale, 2000). In both CIVET and FreeSurfer, the input data is pre-processed in a series of steps that include spatial normalization, intensity inhomogeneity correction, skull stripping, and tissue classification. In FreeSurfer, tissue classification is performed based on the geometric structure of the GM/WM interface (Lee et al., 2006). After classification, the tissues are partitioned using a connected component algorithm, and the resulting volume is covered with a triangular tessellation and deformed to represent the inner WM/GM surface as well as the outer GM/CSF surface. This triangular approach often leads to topological defects in the WM surface (Lee et al., 2006). The surfaces are then touched up with manual editing. In CIVET, tissue classification is performed using the Intensity-Normalized Stereotaxic Environment for Classification of Tissues (INSECT) method (Zijdenbos et al., 1996, 1998), and the inner cortical surface is estimated by deforming a sphere polygon model to the boundary between WM and GM in a hierarchical iterative process (Lee et al., 2007). The inner WM surface is then smoothly expanded to estimate the outer surface GM. CIVET has previously been shown to produce more accurate estimations of classified images and surfaces than FreeSurfer (Lee et al., 2006). Additionally, CIVET showed the smallest errors in both geometric accuracy and reproducibility when compared with FreeSurfer (Lee et al., 2006). Overall, CIVET showed the best geometric/topologic accuracy, while FreeSurfer showed a more even distribution of points (Lee et al., 2006). In another study, it was shown that CIVET was more sensitive to the cortical 89

90 atrophy of patients with mild cognitive impairment (MCI) than FreeSurfer, but that both were able to detect the more significant differences present in Alzheimer s disease (AD) patients (Redolfi et al., 2015). The cortical thickness differences between schizophrenia patients and controls can be very subtle, so detecting cortical thickness differences would be more akin to MCI analyses than AD analyses (if any parallels are to be drawn), suggesting that CIVET may be a better pipeline for schizophrenia applications. It may be of interest in the future to assess the effect of cortical thickness software selection on classification performance; however the higher performance record of CIVET, especially on difficult-to-classify patient populations, would suggest that results would be more reliable using the CIVET pipeline Tissue Density Analysis For the tissue density analyses, we used both the voxel-based morphometry method proposed by Ashburner and Friston (Ashburner and Frison, 2000; Good et al., 2001), as well as RAVENS maps (Davatzikos et al., 2001; Shen and Davatzikos, 2003). Grey matter densities have been used by a number of groups performing schizophrenia/control classification (Nieuwenhuis et al., 2012; Sun et al., 2009), while others use RAVENS maps (Davatzikos et al., 2001; Fan et al., 2007; Zanetti et al., 2013). In the interest of replicating the literature, we included both of these methods in our comparisons. Controlling for volume changes during the non-linear registration is important when performing voxel-by-voxel analyses in a common space. Different studies overcome this using different approaches. Sun and colleagues perform their tissue classification in a subject s native space, and then use point-correspondences to the cortical surface of a model image to allow for direct comparisons of tissue densities between subjects. We chose to modulate our tissue density estimates using the Jacobian determinant of the non-linear 90

91 transformation, which corrects for changes to the unit cube (voxel) during registration (Chung et al., 2001). RAVENS maps are constructed via an inherently mass-preserving shape transformation (Davatzikos et al., 2001; Fan et al., 2007), so are already volume-corrected Alternative Machine Learning Algorithms In addition to the three algorithms compared in this study (logistic regression, support vector machines, and linear discriminant analysis), there are many additional machine learning algorithms used throughout the computer science and neuroimaging literature to perform classifications. Within the schizophrenia classification literature, Greenstein and colleagues used random forests (RFs) to perform classification between child-onset schizophrenia patients and healthy controls, and report an overall classification accuracy of 73.7% (Greenstein et al., 2012). Unlike most of the studies addressed in the present manuscript, they used volumes from 74 brain regions as their input data. RFs use a subset of the data to construct a classification tree, wherein at each node, a variable is selected that best splits the data into two additional nodes (Breiman, 2001; Greenstein et al., 2012). The output measure is classification error, or the number of subjects incorrectly classified after traversing the tree. This method performs well when groups have distinctly different features. This is often not the case in heterogeneous schizophrenia patient populations, and as such RFs have been used only in this one schizophrenia study. Within the Alzheimer s disease (AD) classification literature, which is more mature than its schizophrenia counterpart, a number of other classification algorithms are used. One of the 91

92 newer, promising methods is the artificial neural network (ANN). ANNs are a powerful nonlinear algorithm that is modeled after the structure of the central nervous system. Advantages of the ANN method include flexibility in handling various distributions of the input data (no assumptions about distribution have to be made), and an ability to accurately handle non-linear and complex patterns in the data. Disadvantages include difficulty in interpreting important discriminatory features, and a high computational cost (Falahati et al., 2014). Studies have reported strong results using ANNs to classify AD patients from controls (>88%) (Aguilar et al., 2013; Escudero et al., 2011). Ensemble methods have also been used, which combine multiple algorithms to draw on each of their unique advantages (Falahati et al., 2014) Additional Metrics for Classification Although tissue densities are the focus of this project, schizophrenia/control classifications are performed throughout the literature using other metrics. Included among these are metrics derived from diffusion-tensor imaging (DTI), resting-state functional magnetic resonance imaging (rsfmri), task-based fmri, positron emission tomography (PET), genetics, and neuropsychological variables. It has been suggested that disease heterogeneity is greater in the neuroanatomical domain, so some of these other metrics may inherently produce superior classifiers (Kambeitz et al., 2015; Zhang et al., 2014) Studies using rsfmri have shown classification accuracies ranging from 75%-92% (Zarogianni et al., 2013). In their meta-analysis, Kambeitz and colleagues showed that studies using rsfmri had a higher mean sensitivity (84.5%) than studies using structural MRI (76.4%) (Kambeitz et 92

93 al., 2015). Shen and colleagues (Shen et al., 2010) used a resting-state paradigm to construct a unsupervised classifier based on C-means, and demonstrated an accuracy of 92.3%. Venkataraman and colleagues used a random forest classifier on their rsfmri data, and showed a 75% classification accuracy. Sample sizes in fmri datasets, however, are often very small (often around 20 subjects per group), even smaller than those typically seen in structural MR datasets, which significantly decreases the reliability of the results. Caprihan and colleagues used a modified PCA and fractional anisotropy (FA) measures from DTI images to distinguish schizophrenia patients and controls in their population (Caprihan et al., 2008). After a number of data preprocessing steps, including selecting the top six tracts based on their discriminative power, they report a zero classification error (using leave-one-out crossvalidation). Ardenkani and colleagues used FA and mean diffusivity (MD) maps and a linear discriminant analysis to classify their patient population (Ardenkani et al., 2011). Using a training/testing set validation structure, they achieved 94% accuracy in the testing set with FA maps, and 98% accuracy with MD maps. Using the two metrics combined did not significantly improve accuracy. Both of these studies show a lot of promise for developing reliable classifiers using diffusion-based WM metrics. Several studies have demonstrated excellent classification accuracies by combining multiple data types. Karageorgiou and colleagues demonstrated 64.3% sensitivity and 72.3% specificity using only structural MR data and a PCA-LDA classifier (Karageorgiou et al., 2011). With only neuropsychological data, they achieved 78.5% sensitivity and 91.5% specificity. By combining the two metrics, they were able to boost their results to 89.3% sensitivity and 93.6% specificity. Likewise, Yang and colleagues used an SVM-based ensemble method to construct a classifier for schizophrenia patients (Yang et al., 2010). With fmri data alone, they reported an accuracy of 93

94 83%, and with genetic data alone, their accuracy was 74%. With the two data types combined, however, they achieved an accuracy of 87%. These results suggest that combining data sources may lead to the best classification models, and that this area of research warrants significant attention from the field moving forward Schizophrenia Diagnosis: Present and Future Diagnosis of schizophrenia is currently based on observed and reported symptomology. The newest edition of the Diagnostic and Statistical Manual for Mental Disorders (DSM-5, American Psychiatric Association, 2013) aims to increase the consistency of psychiatric diagnoses over the previous version, the DSM-IV (American Psychiatric Association 2000). Field trials were conducted to get an estimate of the reliability of the DSM-5. Using patients meant to represent the kind of heterogeneous population observed in a real-world clinical setting, two clinicians independently assessed a patient, and an intraclass kappa was calculated to measure the reliability of the diagnosis. The intraclass kappa for schizophrenia was assessed across two psychiatric institutions, the Centre for Addiction and Mental Health (CAMH) and the University of Texas at San Antonio (UTSA). CAMH reported a kappa of 0.50, which falls into what the authors of the study refer to as the good range ( ), while the UTSA reports a kappa of 0.39, which falls into the questionable range ( ) (Regier et al., 2013). Although the reliability of schizophrenia diagnosis and the general understanding of the illness have greatly improved in recent years, there is clearly still significant room for improvement. 94

95 The direction of the field is moving away from a categorical definition of schizophrenia. Evidence of this can be seen in the fact that the DSM-5 (American Psychiatric Association, 2013) does not have the schizophrenia sub-types (schizoaffective disorder, schizophreniform disorder) that were present in the previous version of the DSM, the DSM-IV-TR (American Psychiatric Association, 2000; Tandon et al., 2013). Instead, efforts such as the Research Domain Criteria (RDoC) are becoming more prominent. The RDoC is a new classification framework for research on mental disorders which emphasizes behavourial dimensions and neurobiological measures (Insel et al., 2010). The RDoC is a matrix of functional dimensions, and aims to integrate modern research approaches into a more clinical framework. The objective is to use an understanding of basic mechanisms to contextualize symptoms, instead of the other way around. This brings important research findings closer to the clinic, and is an exciting future direction for the neuroscience and psychiatric fields. A promising application of automated classifiers is distinguishing between patients with schizophrenia and mood disorders, which show significant symptom overlap (American Psychiatric Association, 2013; Demirci et al., 2009; Kim et al., 2015). Already some progress has been made in this area. Koutsouleris and colleagues in 2015 showed that they could achieve individualized differential diagnosis of schizophrenia and major depression using neuroanatomical biomarkers (Koutsouleris et al., 2015). Schnack and colleagues compared two samples of patients with schizophrenia and bipolar disorder, and were able to achieve a classification accuracy between the two disease groups of 88% in the training data, and 66% in the validation data (Schnack et al., 2013). They conclude that the grey matter pathologies within the two groups are unique enough to be used as a reliable metric for differentiating between the diseases using machine learning. Certainly this is a promising area of research. 95

96 12.5 Limitations In addition to the factors addressed in Section 11, there are a number of important additional limitations to consider in this study. Although we have endeavoured to perform the most robust image registration possible by using the strongest algorithm (ANTs), the most up-to-date atlas, and performing a comprehensive quality control step, there are always inaccuracies associated with non-linear image registration. It is not possible to perfectly align every voxel of every subject, so some amount of registration error is bound to affect the final results. Estimation of tissue density is likewise limited by imperfect registrations and classifications. Additionally, at the voxel size used in this study, it is not possible to avoid partial volume issues. Especially at the borders, a full voxel is unlikely to be 100% GM/WM/CSF. We have tried to mitigate this effect by smoothing our classified images with a Gaussian kernel. Tissue density estimates were used in this study in an effort to replicate classification studies in the literature. However, some argue that voxel-based morphometry (and other tissue density estimates including RAVENS maps) do not reflect true neuroanatomical characteristics, and that tissue density is an artificial metric. In spite of this criticism, tissue densities have continually shown a strong performance in detecting anatomical differences between patients and control groups, both in the schizophrenia literature (as addressed in Section 1.0), and elsewhere. Although the datasets used in this study are significantly larger than those used in many other machine learning studies, we are still limited by sample size. Nieuwenhuis and colleagues suggest that algorithms are only stable when there are greater than 130 subjects in the training set, something we are not able to satisfy with any of our datasets (Nieuwenhuis et al., 2013). Further, although our cross-validation strategy leads to a rigorous assessment of accuracy, 96

97 dividing each dataset up multiple times often means that each subgroup only has a small number of subjects. For example, in the INNN dataset, after removing subjects who failed the quality control step, the 98 remaining subjects were divided into training and validation subsets (ie. 33 each of patients and controls in the training set and 16 in the validation set). Within the training subset, 10-fold cross-validation means that only 3 subjects are used to test the parameters within each fold, which may lead to unstable results Conclusion In this study, we have examined the efficacy of machine learning algorithms when applied to schizophrenia patient classification. Specifically, we compared the performance of logistic regression, support vector machines, and linear discriminant analysis in classifying patients from controls in three heterogeneous datasets representing patients from multiple disease stages, diagnoses, and medication statuses. Classifiers were built on cortical thickness or tissue density data. No single method significantly outperformed the others, and overall accuracies were generally lower than those reported in the literature. However, the results presented here are likely more reliable and representative of classifier performance in real-life datasets because of the rigorous validation scheme used, as well as the limited use of population-specific data and algorithm manipulation. In general, this study showed that reliable schizophrenia classifiers are difficult to construct, and classification accuracies currently reported in the literature are likely over-estimated. In order for disease classifiers to make the leap to the clinic, guidelines addressing which methods are best, and how algorithm performance is assessed must be established. If this is achieved, machine learning presents an exciting opportunity to improve the 97

98 reliability and specificity of psychiatric diagnoses, especially in patient populations with overlapping symptomology, as well as gain novel insights into disease pathologies. 98

99 References Ad-Dab bagh, Y., Singh, V., & Robbins, S. (2005). Native-space cortical thickness measurement and the absence of correlation to cerebral volume. Presented at the 11th Annual Organization of Human Brain Mapping Meeting, Toronto, Ontario, Canada. Ad-Dab'bagh, Y. et al., "The CIVET image-processing environment: A fully automated comprehensive pipeline for anatomical neuroimaging research", in "Proceedings of the 12th Annual Meeting of the Organization for Human Brain Mapping", M. Corbetta, ed. (Florence, Italy, NeuroImage), Aguilar C, Westman E, Muehlboeck JS, Mecocci P, Vel- las B, Tsolaki M, Kloszewska I, Soininen H, Lovestone S, Spenger C, Simmons A, Wahlund LO (2013) Different multivariate techniques for automated classification of MRI data in Alzheimer s disease and mild cognitive impairment. Psychiatry Res 212, Allen JS, Bruss J, Mehta S, Grabowski T, Brown CK, Damasio H. Effects of spatial transformation on regional brain volume estimates. NeuroImage 2008;42: American Psychiatric Association. (2000). Diagnostic and Statistical Manual of Mental Disorders - (DSM-IV-TR), 4th edition. American Psychiatric Association, Washington D.C.(Text Revision) American Psychiatric Association. (2013). Diagnostic and statistical manual of mental disorders (5th ed.). Washington, DC. Ananth H, Popescu I, Critchley HD, Good CD, Frackowiak RSJ, Dolan RJ. Cortical and subcortical gray matter abnormalities in schizophrenia determined through structural magnetic resonance imaging with optimized volumetric voxel-based morphometry. Am J Psychiatry. 2002;159: Andersson JLR, Jenkinson M, Smith S (2010) Non-linear registration, aka spatial normalisation. FMRIB technical report TR07JA2 Andreasen NC. A unitary model of schizophrenia: Bleuler s fragmented phrene as schizencephaly. Arch Gen Psychiatry. 1999;56: Ardekani BA, Tabesh A, Sevy S, Robinson DG, Bilder RM, Szeszko PR. Diffusion tensor imaging reliably differentiates patients with schizophrenia from healthy volunteers. HumBrain Mapp 2011;32:1 9. Ashburner, J. (2007). A fast diffeomorphic image registration algorithm. NeuroImage, 38(1), doi: /j.neuroimage Ashburner, J., & Friston, K. J. (2000). Voxel-based morphometry--the methods. NeuroImage, 11(6 Pt 1), doi: /nimg

100 Avants, B. B., Epstein, C. L., Grossman, M., & Gee, J. C. (2008). Symmetric diffeomorphic image registration with cross-correlation: Evaluating automated labeling of elderly and neurodegenerative brain. Medical Image Analysis, 12(1), doi: /j.media Avants, B., Tustison, N., & Song, G. (2009). Advanced Normalization Tools (ANTS). Insight Journal, Benes, F.M., Emerging principles of altered neural circuitry in schizophrenia. Brain Res. Rev. 31, Bleuler, E. (1950). Dementia Praecox, or the group of Schizophrenias. (J. Zinkin, Trans.). New York: International University Press. Bora E, Fornito A, Radua J, Walterfang M, Seal M, Wood SJ, et al. Neuroanatomical abnormalities in schizophrenia: a multimodal voxelwise meta-analysis and meta- regression analysis. Schizophr Res 2011;127:46 57 Bora E, Fornito A, Radua J, Walterfang M, Seal M, Wood SJ, et al. Neuroanatomical abnormalities in schizophrenia: a multimodal voxelwise meta-analysis and metaregression analysis. Schizophr Res 2011;127:46 57 Borgwardt, S., Koutsouleris, N., Aston, J., Studerus, E., Smieskova, R., Riecher-Rössler, A., & Meisenzahl, E. M. (n.d.). Supplementary_Methods: Distinguishing prodromal from firstepisode psychosis using neuroanatomical single-subject pattern recognition.bouras C, Kovari E, Hof PR, Riederer BM, Giannakopoulos P (2001) Anterior cingulate cortex pathology in schizophrenia and bipolar disorder. Acta Neuropathol (Berl) 102: A Brief History of Schizophrenia. (2012, September 8). Retrieved from: Breiman, L. (2001). Random forests. Mach. Learn. 45, 5 32Burges, C.J.C., A tutorial on support vector machines for pattern recognition. Data Mining Knowl. Discov. 2 (2), Caprihan A, Pearlson GD, Calhoun VD. Application of principal component analysis to distinguish patients with schizophrenia from healthy controls based on fractional anisotropy measurements. Neuroimage 2008;42:675 82Carpenter WT, Strauss JS. Crosscultural evaluation of Schneider s first-rank symptoms of schizophrenia: a report from the International Pilot Study of Schizophrenia. Am J Psychiatry. 1974;131: Chakravarty, M. M., Sadikot, A. F., Germann, J., Hellier, P., Bertrand, G., & Collins, D. L. (2009). Comparison of piece-wise linear, linear, and nonlinear atlas-to-patient warping techniques: analysis of the labeling of subcortical nuclei for functional neurosurgical applications. Human Brain Mapping, 30(11), doi: /hbm Chan RCK, Di X, McAlonan GM, Gong Q-Y (2009). Brain anatomical abnormalities in highrisk individuals, first-episode, and chronic schizophrenia: an activation like- lihood estimation meta-analysis of illness progression. Schizophr Bull 37:

101 Chua SE, Cheung C, Cheung V, et al. Cerebral grey, white matter and CSF in never-medicated, first-episode schizo- phrenia. Schizophr Res. 2007;89: Chung, M. K., Worsley, K. J., Paus, T., Cherif, C., Collins, D. L., Giedd, J. N., Evans, a C. (2001). A unified statistical approach to deformation-based morphometry. NeuroImage, 14(3), doi: /nimg Collins, D.L., Neelin, P., Peters, T.M., Evans, A.C., Automatic 3D intersubject registration of MR volumetric data in standardized Talairach space. J. Comput. Assist. Tomogr. 180 (2), (ISSN ) Collins DL, Holmes CJ, Peters T, Evans AC (1995): Automatic 3-D model based neuroanatomical segmentation. Hum Brain Mapp 3: Collins DL, Evans AC (1997): ANIMAL: Validation and application of nonlinear registrationbased segmentation. Int J Pattern Recog Artif Intell 11: Csernansky, John G., Lei Wang, Donald Jones, Devna Rastogi-Cruz, Joel A. Posener, Gitry Heydebrand, J. Philip Miller, and Michael I. Miller (2002). "Hippocampal deformities in schizophrenia characterized by high dimensional brain mapping." American Journal of Psychiatry 159, no. 12: Cutting, John (Ed); Shepherd, M. (Ed), (1987). The clinical roots of the schizophrenia concept: Translations of seminal European contributions on schizophrenia., (pp ). New York, NY, US: Cambridge University Press, vii, 238 pp. Dale, A.M., Fischl, B., Sereno, M.I., Cortical surface-based analysis. I. Segmentation and surface reconstruction. NeuroImage 9 (2), Davatzikos, C., Genc, a, Xu, D., & Resnick, S. M. (2001). Voxel-based morphometry using the RAVENS maps: methods and validation using simulated longitudinal atrophy. NeuroImage, 14(6), doi: /nimg Davatzikos, C., Shen, D., Gur, R. C., Wu, X., Liu, D., Fan, Y., Gur, R. E. (2005). Wholebrain morphometric study of schizophrenia revealing a spatially complex set of focal abnormalities. Archives of General Psychiatry, 62(11), doi: /archpsyc Demirci, O., Calhoun, V.D., Functional magnetic resonance imaging implications for detection of schizophrenia. Eur. Neurol. Rev. 4, The Different Types of Psychosis. (2012). Retrieved from ormation/psychosis/first_episode_psychosis_information_guide/pages/fep_types.aspx 101

102 Eranti, S. V., MacCabe, J. H., Bundy, H., & Murray, R. M. (2012). Gender difference in age at onset of schizophrenia: a meta-analysis. Psychological Medicine, (September), doi: /s x Escudero J, Zajicek JP, Ifeachor E, Alzheimer s Disease Neuroimaging I (2011) Machine Learning classification of MRI features of Alzheimer s disease and mild cognitive impairment subjects to reduce the sample size in clinical trials. Conf Proc IEEE Eng Med Biol Soc 2011, Eskildsen, SF, Coupé, P, García-Lorenzo, D, Fonov, V, Pruessner, JC, Collins, DL, Alzheimer's Disease Neuroimaging Initiative Prediction of Alzheimer's disease in subjects with mild cognitive impairment from the ADNI cohort using patterns of cortical thinning. Neuroimage. 65: Falahati, F., Westman, E., & Simmons, A. (2014). Multivariate Data Analysis and Machine Learning in Alzheimer s Disease with a Focus on Structural Magnetic Resonance Imaging. Journal of Alzheimer s Disease : JAD. doi: /jad Fischer BA, Carpenter WT. Will the Kraepelinian dichotomy survive DSM-V? Neuropsychopharmacology. 2009;34: Fan, Y., Shen, D., & Gur, R. (2007). COMPARE: classification of morphological patterns using adaptive regional elements. Medical Imaging, IEEE, 26(1), First, M. B. (1995). Structured Clinical Interview for the DSM (SCID). The Encyclopedia of Clinical Psychology. Fischl B, Sereno MI, Tootell RBH, Dale AM. High-resolution intersubject averaging and a coordinate system for the cortical surface. Hum. Brain Map 1999;8: Fischl, B., Dale, A.M., Measuring the thickness of the human cerebral cortex from magnetic resonance images. Proc Natl Acad Sci U S A 97, Fischl, B., Salat, D. H., Busa, E., Albert, M., Dieterich, M., Haselgrove, C., et al. (2002). Whole brain segmentation: automated labeling of neuroanatomical structures in the human brain. Neuron, 33, Fletcher, P., McKenna, P.J., Friston, K.J., Frith, C.D., Dolan, R.J., Abnormal cingulate modulation of fronto-temporal connectivity in schizophrenia. Neuroimage 9, Fonov, V., Evans, A. C., Botteron, K., Almli, C. R., McKinstry, R. C., & Collins, D. L. (2011). Unbiased average age-appropriate atlases for pediatric studies. NeuroImage, 54(1), doi: /j.neuroimage Fonov, V., Evans, A., McKinstry, R., Almli, C., & Collins, D. (2009). Unbiased nonlinear average age-appropriate brain templates from birth to adulthood. NeuroImage, 47, S102. doi: /s (09)

103 Franke, K., Ziegler, G., Kloppel, S., Gaser, C., Estimating the age of healthy subjects from T1-weighted MRI scans using kernel methods: exploring the influence of various parameters. Neuroimage 50, Friston KJ, Ashburner CD, Frith CD, Poline J-B, Heather JD, Frackowiak RSJ (1995a): Spatial registration and normalization of images. Hum Brain Mapp 3: Friston KJ, Holmes AP, Worsley KJ, Poline J-B, Frith CD, Fracko- wiak RSJ (1995b): Statistical parametric maps in functional imaging: A general linear approach. Hum Brain Mapp 2: Ganzola, R., Maziade, M., & Duchesne, S. (2014). Hippocampus and amygdala volumes in children and young adults at high-risk of schizophrenia: Research synthesis. Schizophr Res, 156, doi: /j.schres Gaser C, Nenadica I, Buchsbaumb BR, Hazlettc EA, Buchsbaumc MS. Deformation- based morphometry and its relation to conventional volumetry of brain lateral ventricles in MRI. Neuroimage. 2001;13: Gaser C, Nenadic I, Buchsbaum BR, Hazlett EA, Buchsbaum MS. Ventricular enlargement in schizophrenia related to volume reduction of the thalamus, striatum, and superior temporal cortex. Am J Psychiatry. 2004;161: Goldman AL, Pezawas L, Mattay VS, et al. Widespread reductions of cortical thickness in schizophrenia and spec- trum disorders and evidence of heritability. Arch Gen Psychiatry. 2009;66: Greenstein, D., Malley, J. D., Weisinger, B., Clasen, L., & Gogtay, N. (2012). Using multivariate machine learning methods and structural MRI to classify childhood onset schizophrenia and healthy controls. Frontiers in Psychiatry, 3(June), 53. doi: /fpsyt Gu, Q., Li, Z., & Han, J. (n.d.). Linear Discriminant Dimensionality Reduction. Analysis. Gur R, Turetsky B, Bilker W, Gur R. Reduced gray matter volume in schizophrenia. Arch Gen Psychiatry. 1999;56: Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning: Data Mining, Inference and Prediction. New York, NY: Springer-Verlag; Hastie T, Tibshirani R, Friedman J (2009). The elements of statistical learning. Springer, 2. Hellier, P. (2003). Consistent intensity correction of MR images; International Conference on Image Processing: ICIP p History of Schizophrenia. (2010). Retrieved from Ho, B-C., Andreasen, N. C., Ziebell, S., Pierson, R., & Magnotta, V. (2011). Long-term Antipsychotic Treatment and Brain Volumes, 68(2),

104 Holmes CJ, Blanton RE, Kornsand DS, Caplan R, McCracken J, Asarnow R, Toga AW. Brain abnormalities in early-onset schizophrenia spectrum disorder observed with statistical parametric mapping of structural magnetic resonance images. Am J Psychiatry. 2000;157: Hulshoff Pol HE, Schnack HG, Mandl RCW, van Haren NE, Koning H, Collins DL, Evans AC, Kahn RS. Focal gray matter density changes in schizophrenia. Arch Gen Psychiatry. 2001;58: Insel, T., Cuthbert, B., Garvey, M., Heinssen, R., Pine, D. S., Quinn, K., Wang, P. (2010). Research Domain Criteria (RDoC): Toward a new classification framework for research on mental disorders. American Journal of Psychiatry, 167(7), doi: /appi.ajp Jablensky, A. (2010). The diagnostic concept of schizophrenia: its history, evolution, and future prospects. Dialogues in Clinical Neuroscience, 12(3), doi: /aln.0b013e318212ba87 Jelacic, S., de Regt, D., et al., Interactive digital MR Atlas of the pediatric brain. Radiographics 26 (2), Job DE, Whalley HC, McConnell S, Glabus M, Johnstone EC, Lawrie SM. Structural gray matter differences between first-episode schizophrenics and normal controls using voxelbased morphometry. Neuroimage. 2002;17: Kambeitz, J., Kambeitz-Ilankovic, L., Leucht, S., Wood, S., Davatzikos, C., Malchow, B., Koutsouleris, N. (2015). Detecting Neuroimaging Biomarkers for Schizophrenia: A Meta- Analysis of Multivariate Pattern Recognition Studies. Neuropsychopharmacology, 40(7), doi: /npp Karageorgiou, E., Schulz, S. C., Gollub, R. L., Andreasen, N. C., Ho, B. C., Lauriello, J., Georgopoulos, A. P. (2011). Neuropsychological testing and structural magnetic resonance imaging as diagnostic biomarkers early in the course of schizophrenia and related psychoses. Neuroinformatics, 9(4), doi: /s Kasanin J. The acute schizoaffective psychosis. Am J Psychiatry. 1933;90: Kay, Stanley R.; Flszbein, Abraham; Opfer, Lewis A. (1987). The Positive and Negative Syndrome Scale (PANSS) for Schizophrenia. Schizophrenia Bulletin, 13(2), Kim, D., Kim, J., Koo, T., Yun, H., & Won, S. (2015). Shared and Distinct Neurocognitive Endophenotypes of Schizophrenia and Psychotic Bipolar Disorder, 13(1), Klein, A., Andersson, J., Ardekani, B. A., Ashburner, J., Avants, B., Chiang, M.-C., Parsey, R. V. (2009). Evaluation of 14 nonlinear deformation algorithms applied to human brain MRI registration. NeuroImage, 46(3), doi: /j.neuroimage

105 Klein, S., Staring, M., Murphy, K., Viergever, M. a., & Pluim, J. P. W. (2010). Elastix: A toolbox for intensity-based medical image registration. IEEE Transactions on Medical Imaging, 29(1), doi: /tmi Kluft RP. First-rank symptoms as a diagnostic clue to multi- ple personality disorder. Am J Psychiatry. 1987;144: Koutsouleris, N., Meisenzahl, E. M., Borgwardt, S., Riecher-Rossler, a., Frodl, T., Kambeitz, J., Davatzikos, C. (2015). Individualized differential diagnosis of schizophrenia and mood disorders using neuroanatomical biomarkers. Brain. doi: /brain/awv111 Kraepelin, E. (1971). Dementia Praecox and Paraphrenia. (R. G. Krieger, Ed.). New York, NY, US. Kremen, W. S., Seidman, L. J., Faraone, S. V., Toomey, R., & Tsuang, M. T. (2000). The paradox of normal neuropsychological function in schizophrenia. Journal of Abnormal Psychology, 109(4), Kubicki M, Shenton ME, Salisbury DF, Hirayasu Y, Kasai K, Kikinis R, Jolesz FA, McCarley RW. Voxel-based morphometric analysis of gray matter in first epi- sode schizophrenia. Neuroimage. 2002;17: Kuperberg GR, Broome MR, McGuire PK, David AS, Eddy M, Ozawa F, Goff D, West WC, Williams SC, van der Kouwe AJ, Salat DH, Dale AM, Fischl B. Regionally localized thinning of the cerebral cortex in schizophrenia. Arch Gen Psychiatry. 2003;60: Langfeld G. The prognosis of schizophrenia. Acta Psychiatr Neurol Scand (suppl 110). Lao, Z., D. Shen, Z. Xue, B. Karacali, S. M. Resnick, and C. Davatzikos, Morphological classification of brains via high-dimensional shape transformations and machine learning methods, NeuroImage, vol. 21, pp , Lawrie SM, Abulkmeil SS. Brain abnormality in schizophre- nia: a systematic and quantitative review of volumetric magnetic resonance imaging studies. Br J Psychiatry. 1998;172: Lee, J. K., Lee, J. M., Kim, J. S., Kim, I. Y., Evans, A. C., & Kim, S. I. (2006). A novel quantitative cross-validation of different cortical surface reconstruction algorithms using MRI phantom. Neuroimage, 31(2), Lerch, J. P., & Evans, A. C. (2005). Cortical thickness analysis examined through power analysis and a population simulation. NeuroImage, 24(1), doi: /j.neuroimage Lesh, T. A., Tanase, C., Geib, B. R., Niendam, T. A., Yoon, J. H., Minzenberg, M. J., Ragland, J. D., Solomon, M., Carter, C. S. (2015). A Multimodal Analysis of Antipsychotic Effects on Brain Structure and Function in First-Episode Schizophrenia. JAMA Psychiatry, 72(3),

106 Magnetic Resonance: A Peer-Reviewed, Critical Introduction. April Retrieved from MacDonald, D., Kabani, N., Avis, D., & Evans, A. C. (2000). Automated 3-D extraction of inner and outer surfaces of cerebral cortex from MRI. NeuroImage, 12(3), Malcolm B. Carpenter. Core Text of Neuroanatomy. Williams & Wilkins, third edition,1985 McGrath J, Saha S, Chant D, Welham J (2008). Schizophrenia: a concise overview of incidence, prevalence, and mortality. Epidemiol Rev 30: Messias, E. L., Chen, C. Y., & Eaton, W. W. (2007). Epidemiology of Schizophrenia: Review of Findings and Myths. Psychiatric Clinics of North America, 30(3), doi: /j.psc Moskowitz, A., & Heim, G. (2011). Eugen Bleuler s Dementia Praecox or the Group of Schizophrenias (1911): A centenary appreciation and reconsideration. Schizophrenia Bulletin, 37(3), doi: /schbul/sbr016 Murphy, K., Van Ginneken, B., Reinhardt, J. M., Kabus, S., Ding, K., Deng, X., Pluim, J. P. W. (2011). Evaluation of registration methods on thoracic CT: The EMPIRE10 challenge. IEEE Transactions on Medical Imaging, 30(11), doi: /tmi Narr K, Thompson P, Sharma T, Moussai J, Zoumalan C, Rayman J, Toga A. Three-dimensional mapping of gyral shape and cortical surface asymmetries in schizophrenia: gender effects. Am J Psychiatry. 2001;158: Narr, K. L., Bilder, R. M., Toga, A. W., Woods, R. P., Rex, D. E., Szeszko, P. R., Thompson, P. M. (2005). Mapping cortical thickness and gray matter concentration in first episode schizophrenia. Cerebral Cortex (New York, N.Y. : 1991), 15(6), doi: /cercor/bhh172 Nieuwenhuis, M., van Haren, N. E. M., Hulshoff Pol, H. E., Cahn, W., Kahn, R. S., & Schnack, H. G. (2012, July 2). Classification of schizophrenia patients and healthy controls from structural MRI scans in two large independent samples. NeuroImage. Elsevier Inc. doi: /j.neuroimage Nishimura, D. W. (2010). Principals of Magnetic Resonance Imaging. Redwood City, CA: Stanford University Press. Osmond, H., & Smythies, J. (1952). Schizophrenia: A New Approach. The British Journal of Psychiatry, 98(411), doi: /bjp Pantelis, C., Barnes, T.R.E., Nelson, H.E., Is the concept of frontal subcortical dementia relevant to schizophrenia? Br. J. Psychiatry 160,

107 Parent A, Carpenter MB. Human Neuroanatomy. Baltimore, MD: Williams & Wilkins; Park HJ, Levitt J, Shenton ME, Salisbury DF, Kubicki M, Kikinis R, Jolesz FA, McCarley RW. An MRI study of spatial probability brain map differences be- tween first-episode schizophrenia and normal controls. Neuroimage. 2004; 22: Pers TH, Albrechtsen A, Holst C, S(rensen TIA, Gerds TA (2009). The validation and assessment of machine learning: a game of prediction from high-dimensional data. PLoS One 4: e6287. Pipitone, J., Park, M. T. M., Winterburn, J., Lett, T. A., Lerch, J. P., Pruessner, J. C.,... & Alzheimer's Disease Neuroimaging Initiative. (2014). Multi-atlas segmentation of the whole hippocampus and subfields using multiple automatically generated templates. Neuroimage, 101, Rabinowitz, J., Levine, S. Z., & Häfner, H. (2006). A population based elaboration of the role of age of onset on the course of schizophrenia. Schizophrenia Research, 88(1-3), doi: /j.schres Radewicz K, Garey LJ, Gentleman SM, Reynolds R (2000) Increase in hla- dr immunoreactive microglia in frontal and temporal cortex of chronic schizophrenics. J Neuropathol Exp Neurol 59: Rajkowska G (1997) Morphometric methods for studying the prefrontal cortex in suicide victims and psychiatric patients. Ann N Y Acad Sci 836: Redolfi, A., Manset, D., Barkhof, F., Wahlund, L.-O., Glatard, T., Mangin, J.-F., & Frisoni, G. B. (2015). Head-to-Head Comparison of Two Popular Cortical Thickness Extraction Algorithms: A Cross-Sectional and Longitudinal Study. Plos One, 10(3), e doi: /journal.pone Regier DA, Narrow WE, Rae DS, Manderscheid RW, Locke BZ, Goodwin FK. The de facto mental and addictive disorders service system. Epidemiologic Catchment Area prospective 1-year prevalence rates of disorders and services. Archives of General Psychiatry Feb;50(2): Salgado-Pineda P, Baeza I, Pe rez-go mez M, et al. Sustained attention impairment correlates to gray matter decreases in first episode neuroleptic-naive schizophrenic patients. NeuroImage. 2003;19: Schnack, H. G., Nieuwenhuis, M., van Haren, N. E. M., Abramovic, L., Scheewe, T. W., Brouwer, R. M., Kahn, R. S. (2013). Can structural MRI aid in clinical classification? A machine learning study in two independent samples of patients with schizophrenia, bipolar disorder and healthy subjects. NeuroImage. doi: /j.neuroimage

108 Schneider K. Klinische Psychopathologie, 8th edition, Stuttgart, Germany: Thieme; English translation by Hamilton MW, Anderson EW. Clinical Psychopathology. New York, NY: Grune and Stratton; Selemon LD, Rajkowska G, Goldman-Rakic PS (1995) Abnormally high neuronal density in the schizophrenic cortex. A morphometric analysis of prefrontal area 9 and occipital area 17. Arch Gen Psychiatry 52: Selemon LD, Rajkowska G, Goldman-Rakic PS (1998) Elevated neuronal density in prefrontal area 46 in brains from schizophrenic patients: application of a three-dimensional, stereologic counting method. J Comp Neurol 392: Selemon LD, Mrzljak J, Kleinman JE, Herman MM, Goldman-Rakic PS (2003) Regional specificity in the neuropathologic substrates of schizophrenia: a morphometric analysis of Broca s area 44 and area 9. Arch Gen Psychiatry 60: Shan, Z., Parra, C., et al., A digital pediatric brain structure atlas from T1-weighted MR images. Medical Image Computing and Computer-Assisted Intervention MICCAI 2006, pp Shao, J. (1993). Linear model selection by cross-validation. Journal of the American statistical Association, 88(422), Shen D, Davatzikos C. Very high-resolution morphometry using mass-preserving deformations and HAMMER elastic registration. Neuroimage 2003;18: Shen, H.,Wang, L., Liu, Y., Hu, D., Discriminative analysis of resting-state functional connectivity patterns of schizophrenia using low dimensional embedding of fmri. NeuroImage 49, Shenton, M.E., Dickey, C.C., Frumin, M., McCarley, R.W. (2001). A review of MRI findings in schizophrenia. Schizophr. Res. 49, Sigmundsson T, Suckling J, Maier M, Williams S, Bullmore E, Greenwood K, Fukuda R, Ron M, Toone B. Structural abnormalities in frontal, temporal, and limbic regions and interconnecting white matter tracts in schizophrenic patients with prominent negative symptoms. Am J Psychiatry. 2001;158: Sowell ER, Levitt J, Thompson PM, Sled, J. G., Zijdenbos, P., & Evans, C. (1998). A nonparametric method for automatic correction of intensity nonuniformity in MRI data. IEEE Transactions on Medical Imaging, 17(1), doi: / Smieskova, R., Fusar-Poli, P., Allen, P., Bendfeldt, K., Stieglitz, R.D., Drewe, J., Radue, E.W., McGuire, P.K., Riecher-Rossler, A., Borgwardt, S.J., The effects of antipsychotics on the brain: what have we learnt from structural imaging of schizophrenia? a systematic review. Curr. Pharm. Des. 15, Stephens JH, Astrup C. Prognosis in process and non-process schizophrenia. Am J Psychiatry. 1963;119:

109 Strauss JS, Carpenter WT. Characteristic symptoms and outcome in schizophrenia. Arch Gen Psychiatry. 1974;30: Talairach, J.; Tournoux, P. Co-planar Stereotaxic Atlas of the Human Brain. Thieme Medical Publishers; New York: Tandon, R., Gaebel, W., Barch, D. M., Bustillo, J., Gur, R. E., Heckers, S., Carpenter, W. (2013). Definition and description of schizophrenia in the DSM-5. Schizophrenia Research, 150(1), doi: /j.schres Torrey, E. F., Barci, B. M., Webster, M. J., Bartko, J. J., Meador-Woodruff, J. H., & Knable, M. B. (2005). Neurochemical markers for schizophrenia, bipolar disorder, and major depression in postmortem brains. Biological Psychiatry, 57(3), doi: /j.biopsych Tsuang MT, Winokur G. Criteria for subtyping schizophrenia. Arch Gen Psychiatry. 1974;31: Venkataraman, A.,Whitford, T.J., Westin, C.-F., Golland, P., Kubicki,M., 2012.Whole brain resting state functional connectivity abnormalities in schizophrenia. Schizophr. Res. 139, 7 12Vincent, L., and P. Soille, Watersheds in digital spaces: An efficient algorithm based on immersion simulations, IEEE Trans. Pattern Anal. Mach. Intell., vol. 13, no. 6, pp , Jun Vita A, De Peri L. Hippocampal and amygdala volume reductions in first-episode schizophrenia. Br J Psychiatry. 2007;190:271. Vita, A., De Peri, L., Deste, G., Barlati, S., & Sacchetti, E. (2015). The effect of antipsychotic treatment on cortical gray matter changes in schizophrenia: does the class matter? A metaanalysis and meta-regression of longitudinal magnetic resonance imaging studies. Biological Psychiatry, 78(6), Wang, Q., Seghers, D., et al., Construction and Validation of Mean Shape Atlas Templates for Atlas-Based Brain Image Segmentation. Information Processing in Medical Imaging, pp Wheeler, A. L., Chakravarty, M. M., Lerch, J. P., Pipitone, J., Daskalakis, Z. J., Rajji, T. K., Voineskos, A. N. (2013). Disrupted Prefrontal Interhemispheric Structural Coupling in Schizophrenia Related to Working Memory Performance. Schizophrenia Bulletin, doi: /schbul/sbt100 Wheeler, A. L., & Voineskos, A. N. (2014). A review of structural neuroimaging in schizophrenia: from connectivity to connectomics. Frontiers in Human Neuroscience, 8(August), doi: /fnhum

110 Wing JK, Cooper JE, Sartorius N. The Description and Clas- sification of Psychiatric Symptoms: An Instruction Manual for the PSE and Catego system. London, UK: Cambridge University Press; World Health Organization. (1973). The International Pilot Study of Schizophrenia. Geneva, Switzerland: World Health Organization Press. World Health Organization (2004). Global Burden of Disease Update. Available at Woodruff, P.W.R., Wright, I.C., Shuriquie, N., Russouw, H., Rushe, T., Howard, R.J., Graves, M., Bullmore, E.T., Murray, R.M., Structural brain abnormalities in male schizophrenics reflect fronto-temporal dissociation. Psychol. Med. 27, Wright IC, Sharma T, Ellison ZR, McGuire PK, Friston KJ, Brammer MJ, Murray RM, Bullmore ET. Supra-regional brain systems and the neuropathology of schizophrenia. Cereb Cortex. 1999;9: Wright IC, Rabe-Hesketh S, Woodruff PWR, David AS, Murray RM, Bullmore ET. Metaanalysis of regional brain volumes in schizophrenia. Am J Psychiatry. 2000;157: Yang, H., Liu, J., Sui, J., Pearlson, G., Calhoun, V., Ahybridmachine learningmethod for fusing fmri and genetic data: combining both improves classification of schizo- phrenia. Hum. Neurosci. 4, 192 (2010). Yoon, U., Lee, J. M., Im, K., Shin, Y. W., Cho, B. H., Kim, I. Y., Kim, S. I. (2007). Pattern classification using principal components of cortical thickness and its discriminative pattern in schizophrenia. NeuroImage, 34(4), Zanetti, M. V, Schaufelberger, M. S., Doshi, J., Ou, Y., Ferreira, L. K., Menezes, P. R., Busatto, G. F. (2013). Neuroanatomical pattern classification in a population-based sample of first-episode schizophrenia. Progress in Neuro-Psychopharmacology & Biological Psychiatry, 43, Zarogianni, E., Moorhead, T. W. J., & Lawrie, S. M. (2013). Towards the identification of imaging biomarkers in schizophrenia, using multivariate pattern classification at a singlesubject level. NeuroImage. Clinical, 3,

111 Appendices Supplementary Paper 1: Hippocampal (subfield) volume and shape in relation to cognitive performance across the adult lifespan Aristotle N. Voineskos MD, PhD, FRCP(C) 1,2,3*, Julie Winterburn BEng 1,4,5*, Daniel Felsky BSc 1,3, Jon Pipitone MSc 1, Tarek K. Rajji MD, FRCP(C) 2,3, Benoit H. Mulsant MD, MS, FRCP(C) 2,3, M. Mallar Chakravarty PhD 4,5 *Aristotle Voineskos and Julie Winterburn contributed equally to the manuscript 1 Kimel Family Translational Imaging Genetics Laboratory, Research Imaging Centre, Campbell Family Mental Health Institute, Centre for Addiction and Mental Health 2 Geriatric Mental Health Service and Campbell Family Mental Health Institute, Centre for Addiction and Mental Health 3 Department of Psychiatry and Institute of Medical Science, University of Toronto 4 Institute of Biomaterials and Biomedical Engineering, University of Toronto 5 Computational Brain Anatomy Laboratory, Douglas Institute, McGill University Correspondence should be addressed to: Dr. Aristotle Voineskos, Kimel Family Translational Imaging-Genetics Laboratory, Research Imaging Centre, Centre for Addiction and Mental Health 250 College St, Room 105, M5T1R8, Toronto, Canada Aristotle.Voineskos@camh.ca Keywords: Aging, MRI, morphometry, cognition, memory, hippocampus 111

112 Running Title: Hippocampal volume, shape, and age-related cognitive performance 5 Tables 6 Figures Supplementary Material Word Count: 6,

113 Abstract Newer approaches to characterizing hippocampal morphology can provide novel insights regarding cognitive function across the lifespan. We comprehensively assessed the relationships among age, hippocampal morphology, and hippocampal-dependent cognitive function in 137 healthy individuals across the adult lifespan (18-86 years of age). They underwent MRI, cognitive assessments and genotyping for Apolipoprotein E status. We measured hippocampal subfield volumes using a new multi-atlas segmentation tool (MAGeT-Brain) and assessed vertex-wise (inward and outward displacements) and global surface-based descriptions of hippocampus morphology. We examined the effects of age on hippocampal morphology, as well as the relationship among age, hippocampal morphology, and episodic and working memory performance. Age and volume were modestly correlated across hippocampal subfields. Significant patterns of inward and outward displacement in hippocampal head and tail were associated with age. The first principal shape component of the left hippocampus, characterized by a lengthening of the antero-posterior axis was prominently associated with working memory performance across the adult lifespan. In contrast, no significant relationships were found among subfield volumes and cognitive performance. Our findings demonstrate that hippocampal shape plays a unique and important role in hippocampal-dependent cognitive aging across the adult lifespan, meriting consideration as a biomarker in strategies targeting the delay of cognitive aging. 113

114 Introduction Despite the predominant view of age-related structural susceptibility of the hippocampus and its role in episodic memory (Squire LR 1992; Tulving E 2002; Van Petten C 2004) and working memory (Beauchamp MH et al. 2008; Yonelinas AP 2013) performance, there are conflicting results in the literature on both accounts. While some studies have shown a relationship of volume with age, others have not (Sullivan EV et al. 1995; Jack CR, Jr. et al. 1997; Jernigan TL et al. 2001; Pruessner JC et al. 2001; Head D et al. 2005; Jernigan TL and AC Gamst 2005; Sullivan EV et al. 2005; Lupien SJ et al. 2007; Malykhin NV et al. 2008). These inconsistent results may be due to the use of volumetric measurements focused on the whole hippocampus or anterior or posterior sections (Van Petten C 2004; Raz N and KM Rodrigue 2006; Shing YL et al. 2011). In recent years, approaches indexing hippocampal subfield volume and shape have provided novel alternatives for characterization of hippocampal morphology in vivo (Yushkevich PA et al. 2009; Yushkevich PA, H Wang et al. 2010; Winterburn JL et al. 2013; Yang X et al. 2013). Since the hippocampus is not a homogeneous structure, application of these new approaches may help to address the inconsistent findings from more conventional measurements of hippocampal structure (Mueller SG et al. 2007; Mueller SG and MW Weiner 2009; La Joie R et al. 2010; Yang X et al. 2013). They may also help to clarify whether agerelated cognitive decline is associated with changes in hippocampal morphology. Similar to studies of hippocampal volume and age, studies of the relationships of hippocampal volume with episodic and working memory performance, particularly in the context of aging, have not been consistent (Della-Maggiore V et al. 2002; Van Petten C 2004; Raz N and 114

115 KM Rodrigue 2006). Higher-field scanners and novel segmentation approaches enable studies examining the relationship both of hippocampal subfield volume with age (Mueller SG et al. 2007; Mueller SG and MW Weiner 2009; La Joie R et al. 2010), and of subfield volume with cognitive performance (Shing YL et al. 2011; Engvig A et al. 2012). Hippocampal subfields seem susceptible to age-related effects, although not consistently across studies, with findings overlapping with changes observed in pathological aging, in particular in the CA1 subregion (Mueller SG et al. 2007; Mueller SG and MW Weiner 2009; Mueller SG et al. 2010; Kerchner GA et al. 2012). Recent studies of hippocampal shape metrics have demonstrated differences between normal and pathological aging (particularly Alzheimer s disease) (Carmichael O et al. 2012; Kerchner GA et al. 2012; Tondelli M et al. 2012). Shape metrics may offer classification or distinction among groups beyond conventional volumetric measures of the hippocampus (Achterberg HC et al. 2013), and predict conversion to disease (Shen KK et al. 2012). Volume metrics are global measures of an entire structure or substructure (e.g., single number representation), whereas shape metrics provide more nuanced, multidimensional and local descriptions of difference (e.g., an entire mesh of displacements). This capacity for nuance may lead to better discrimination (Raznahan A et al. 2014; Shaw P et al. 2014b). However, the relationship between hippocampal shape metrics and cognitive function across adult life is not known. We have built a freely available (imaging-genetics.camh.ca/hippocampus) high-resolution hippocampal atlas that reliably delineates subfields (Winterburn JL et al. 2013). In combination with the Multiple Automatically Generated Template (MAGeT-Brain; segmentation approach, we have shown that this atlas 115

116 can be successfully used to segment the hippocampus and its subfields in a population-based study (Pipitone J et al. 2014). The combination of these approaches permits high-throughput analysis of hippocampal morphology in a large sample. Within the present investigation, we also developed an approach that used our subfield atlas to assess both local and global hippocampal shape. Taken together, these approaches allow for a comprehensive investigation of hippocampal morphology. Therefore, we applied these approaches in a relatively large sample (n=137) of healthy individuals across the adult lifespan (age range: 18-86) to determine relationships among age, hippocampal morphology, and episodic and working memory performance. We also analyzed the effects of Apolipoprotein E (APOE) genetic variation due to its effect on hippocampal volume reported by some (Mueller SG and MW Weiner 2009) but not all studies (Morra JH et al. 2009). We hypothesized that: i) all subfields would decrease in volume with age; ii) hippocampal shape would change with age; and iii) hippocampal shape and subfield volumes would be associated with hippocampal-dependent cognitive function. Methods Subjects A total of 137 healthy volunteers (mean (SD) age: 45.4 (19.0); range: 18-86) were recruited at the Centre for Addiction and Mental Health (CAMH) in Toronto, Canada, as part of an ongoing neuroimaging, genetics, and cognition research program in neuropsychiatric disorders. Persons with previous head trauma and loss of consciousness, a neurological disorder, a history of primary psychotic disorder in a first-degree relative, current substance abuse (urine toxicology screens were obtained in all potential subjects), or a history of substance dependence were 116

117 excluded from the study. All study procedures complied with the Declaration of Helsinki and were approved by the CAMH Research Ethics Board; all subjects provided written, informed consent. Subjects were assessed with the Edinburgh Handedness Inventory (Oldfield RC 1971), the Hollingshead Four-Factor Index of Socioeconomic Status (Hollingshead AB 1975), and the Wechsler Test of Adult Reading (WTAR) (Wechsler D 2001) for IQ. They completed the Structured Clinical Interview for DSM-IV-TR Axis I Disorders (First MB et al. 2001) to ensure they were free of neuropsychiatric disorders, and were screened for dementia using the Mini Mental State Examination (MMSE) (Folstein MF et al. 1975) (Table 1). Neurocognitive Assessment Each subject underwent a comprehensive neurocognitive battery. The present analyses focus on verbal episodic memory, visuospatial episodic memory, and working memory: tests within the Repeatable Battery for the Assessment of Neuropsychological Status (Hobart MP et al. 1999) were used to assess verbal episodic memory ( list recall ) and visuospatial episodic memory ( figure recall ). The Letter Number Span (LNS) test was used to assess verbal working memory performance. Complete cognitive data were available for 133 subjects (Table 2). APOE ε4 Genotyping APOE ε4 carrier status was obtained by combining genotypes at rs7412 and rs These SNPs were genotyped directly using standard ABI TaqMan Assay-on-Demand protocols and 117

118 10% of sample genotypes were duplicated for quality control with 100% reliability. Genetic information was available for 133 subjects (Table 1). Image Acquisition T1-weighted magnetic resonance (MR) images were acquired for each subject using an 8- channel head coil on a 1.5 T GE Echospeed system (General Electric Medical Systems; Fairfield, Connecticut). Images were acquired using an axial inversion recovery-prepared spoiled gradientrecalled sequence with echo time 5.3 ms, repetition time 12.3 ms, time to inversion ms, flip angle, 20, and 1 excitation, for a total of 124 contiguous slices with 1.5 mm thickness and 0.78 mm x 0.78 mm in-plane voxel size. Image Segmentation The first step in the analysis was to segment the hippocampus and hippocampal subfields on the T1-weighted images of all subjects. Segmentations were performed using the MAGeT-brain multi-atlas segmentation tool, which leverages the neuroanatomical variability of a subject population to boost segmentation accuracy (Figure 1) (Chakravarty MM et al. 2013; Park MT et al. 2014; Pipitone J et al. 2014). Five high-resolution (300µm isotropic voxels) in vivo atlases of the hippocampus and hippocampal subfields were used for the automatic segmentation pipeline (Winterburn JL et al., 2013). These atlases include definitions for the right and left CA1, CA2/CA3, CA4/dentate gyrus, strata radiatum/lacunosum/moleculare (SR/SL/SM), and subiculum. For further information on the MAGeT-Brain segmentation pipeline, see the 118

119 Supplementary Materials. Segmented images were inspected by one of the authors (JLW), an expert in hippocampal subfield segmentation, to ensure high segmentation quality. Total brain volume (TBV) was estimated using mincbeast (Eskildsen SF et al. 2013), an automated pipeline that is a part of the MINC toolkit (MNI, Montreal). Morphometric Analyses Model creation Analysis of the shape characteristics of a structure can provide information that is neuroanatomically unique in relation to volumetric assessment (Csernansky JG et al. 2005; Zhao Z et al. 2008; Miller MI et al. 2009). Like most analyses in neuroimaging research, a common volumetric (Mazziotta JC et al. 1995) or surface-based (Raznahan A et al. 2014) model is required to facilitate group analyses. For the analyses presented in this manuscript, we derived a model that is the average neuroanatomical representation of all five expertly-segmented atlases used as input to the MAGeT-Brain pipeline (referred to henceforth as the model). This model has superior contrast, signal, and definition when compared to a single atlas and provides a common space for analysis of surface-based metrics (see below). From this model we were able to derive models of the hippocampus with ~10,000 vertices/hemisphere that we used as input to local (or vertex-based) and global (from a point distribtuon model) analyses. See the Supplementary Materials for further details. Local (univariate) morphometric analysis 119

120 We used a surface-based metric proposed by Lerch et al. for analyzing hippocampal shape differences in a population (Lerch et al. 2008). This method, along with local surface area metrics, has been used recently in morphometric analyses of the striatum, thalamus, and pallidum (Chakravarty MM et al. 2014; Magon S et al. 2014; Raznahan A et al. 2014; Shaw P et al. 2014b). In the present work, we have refined this technique and applied it to hippocampal morphology in a multi-atlas framework. Briefly, this technique estimates the local shape differences of the hippocampus using surface displacements, and a univariate (vertex-wise) analysis is conducted using the surface-based representations of the left and right whole hippocampi created on the final model (described above). All possible combinations of nonlinear transformations for each subject were mapped and averaged to reduce noise and increase signal. Surface displacements at each vertex relative to the model were estimated between each deformation vector and the normal at each vertex on the surface using a dot product (Figure 2). See the Supplementary Materials for further details. Global (multivariate) morphometric analysis While measuring local surface displacement is one descriptor of morphometry, it does not describe global shape patterns for a structure. We used a point-distribution model of the vertices across all subjects and analyzed their variance using a principal component analysis (PCA) to explore global hippocampal shape (Cootes TF et al. 1995; Chakravarty MM et al. 2011). Using each possible transformation mapping each subject to each model we were able to derive a median surface-based representation of each hippocampus. All hippocampi were then normalized for 12-parameter linear dimensions (3 each of translations, rotations, scales, and 120

121 shears) and were then input into the PCA analysis. Each PC represents a dominant shape-mode in the data and the PC-score for each subject represents how much that subjects s hippocampus loads on that score. Statistical Analysis Volumetric analyses A general linear model (GLM) was used to assess all relationships, and all tests were corrected for multiple comparisons using FDR (Benjamini Y and Hochberg Y. 1995). Comparisons surviving 5% FDR were deemed to be significant. The relationship between hippocampal volumes and age was assessed first, with sex, years of education, APOE ε4 carrier status, and total brain volume (TBV) included in the model. Next, the relationship between volume and cognitive scores (list recall, figure recall, and LNS scores) was assessed with age, sex, years of education, APOE ε4 carrier status, and TBV included in the model. Local (univariate) morphometric analysis First, a vertex-wise GLM was used to evaluate the relationship between age and local shape differences in the right and left whole hippocampus, while including age, sex, years of education, and APOE ε4 carrier status in the model. The shape metric is already normalized for TBV, so TBV was not included as a covariate. The effect of cognition was then evaluated in additional models including terms for cognitive test performance and co-varying for age, sex, years of 121

122 education, and APOE ε4 status. A 5% false discovery rate (FDR) correction was used to correct for multiple comparisons across the vertices (Genovese CR et al. 2002). Global (multivariate) morphometric analysis Global hippocampal shape was analyzed using a PCA. The relationship between PC score and cognitive performance was assessed using a GLM with age, sex, years of education, and APOE ε4 status included in the model. FDR was used to correct for multiple comparisons, and results that survived the 5% threshold were considered to be significant. Results Relationship of TBV and Hippocampal Volume Total brain volume (TBV) was co-linear with total (right + left) hippocampal volume (R 2 =0.022, p<0.05), so was included in all volumetric statistical models. Relationship of Volumetric Measures and Age All right and left whole hippocampal and subfield volumes were inversely related to age after correction for multiple comparisons except for the right and left CA1 (R 2 =0.018, q=0.069 for both) (Figure 3). Table 3 summarizes the results of all volumetric analyses involving age and cognitive tests. Quadratic effects of age were assessed, but found to be not significant (R 2 <0.02, 122

123 q>>0.05). Among the subfields studied, the right and left CA1 showed the least prominent relationship with age, with a 9.8x10-4 % and 1.0x10-3 % decrease in volume per year, respectively (based on the mean subfield volumes of the population, co-varying for sex, education, APOE ε4 status, and TBV). The right and left CA2/CA3 showed the most prominent relationship with age, with a 2.6x10-3 % and 3.6x10-3 % decrease in volume per year, respectively. Age-by-sex interactions were also explored in all subfields, and were significant in the left CA2/CA3 (R 2 =0.025, p=0.037), but this interaction did not survive 5% FDR correction. No relationships between APOE ε4 status and subfield volume survived FDR correction. Given some investigations showing effects of APOE ε4 only in later decades of life (Felsky D and AN Voineskos 2013), analyses were repeated in a subset of the subjects older than 50 years of age, but results remained not significant in this subgroup. Relationship of Local Morphometry and Age Significant relationships between local shape of the whole hippocampus and age survived 5% FDR (right hippocampus: DOF=124, t=2.47; left hippocampus: DOF=124, t=2.56) (Figure 4). Table 4 summarizes all results of local morphometry analyses. Patterns of inward and outward displacement were mainly symmetric along both hippocampi. In both hemispheres, inward displacement was localized in the inferior and medial hippocampal head (anterior subiculum and inferior SR/SL/SM of the hippocampal uncus; Figure 4 i), the medial hippocampal body (along the CA1-CA4/DG border; Figure 4 ii), and the tip of the hippocampal tail (posterior CA1; Figure 4 iii). Outward displacement in both hemispheres was localized to the lateral hippocampal body 123

124 (CA1; Figure 4 iv), the lateral edge of the uncus in the hippocampal head (SR/SL/SM; Figure 4 v), and the base of the uncus (CA1; Figure 4 vi). In the right hemisphere only, outward displacements were also observed on the superior tail (a region including parts of the CA4/dentate gyrus, SR/SL/SM, and CA1; Figure 4 vii). Relationship of Local Morphometry with APOE ε4 Status Local morphological relationships based on APOE ε4 status were analyzed (Table 4). At the 5% FDR level, no relationships were significant; however relationships were found at a more liberal correction of 10% FDR (Figure 5). Inward surface displacements were observed in both the right (DOF=124, t=2.43) and left hippocampi (DOF=124, t=2.58) in ε4 carriers relative to non-carriers along the medial hippocampal head, especially the base of the hippocampal uncus (CA1; Figure 5 i), the antero-medial portion of the hippocampal body (CA2/CA3; Figure 5 ii), and the medial side of the hippocampal uncus (SR/SL/SM; Figure 5 iii). The area of significant inward displacement in the hippocampal head was larger in the right hemisphere than the left. Outward surface displacements were also observed in both hemispheres in the hippocampal head, particularly on the anterior portion of the inferior surface (subiculum; Figure 5 iv) and the inferior surface of the uncus (SR/SL/SM; Figure 5 v). Some outward displacements were also observed on the lateral surface of the uncus in both hemispheres (SR/SL/SM; Figure 5 vi). Relationship of Volumetric Measures with Episodic and Working Memory Performance 124

125 A greater left CA4/DG volume was associated with a higher figure recall score (R 2 =0.060, p=0.027) (Table 3), but this relationship did not survive FDR correction at the 5% level. Similarly, a greater right CA2/CA3 volume was associated with a higher LNS score (R 2 =0.037, p=0.019), but not after correcting for multiple comparisons. Direct correlations between cognitive scores and hippocampal volumes (no covariates) are reported in Table S1. Relationship of Local Morphometry with Episodic and Working Memory Performance No significant local shape differences were found that explained LNS, list recall, or figure recall performance after correction at 5% FDR (or at a more liberal 10% FDR threshold) (Table 4). Relationship of Global Morphometry to Age and Memory Performance The 1 st PC of the whole left hippocampus (eigenvalue=4.92), which identifies an elongation along the anterior-posterior axis, explained 65% of the variance in the data and showed a significant linear relationship with performance on the LNS test (R 2 =0.066, q=0.021) (Table 5), even after FDR correction at the 5% level. The patterns of the shape distribution for the top three right and left PCs are shown in Figure 6. The 1 st PC of the whole right hippocampus (eigenvalue=5.40) has a significant relationship with age (R 2 =0.045, p=0.0094), and the 3 rd right PC (eigenvalue=3.05) has a significant relationship with figure recall performance (R 2 =0.041, p=0.015); however neither of these associations survives FDR correction. All PC tests were repeated with only right-handed subjects, and the results remained significant (R 2 =0.050, q=0.030). 125

126 Discussion We conducted a comprehensive examination of hippocampal morphology across the adult lifespan, and then assessed relationships among age, shape, and episodic and working memory performance. Relationships between age and hippocampal volumes were found, but not between cognitive scores and hippocampal volumes. Relationships were also found between age and hippocampal shape, both using local (univariate, vertex-wise approach), and global (multivariate, principal component analysis) indices of hippocampus shape. In addition, relationships were found between global hippocampal shape and cognition. These results suggest that hippocampal shape may be a more informative biomarker of age- and cognition-related effects on the hippocampus than subfield volume. After 5% FDR correction, none of the subfield volumes predicted episodic or working memory performance. In contrast, hippocampal shape analysis provided a significant relationship with working memory performance, consistent with prominent age-related effects on shape. The first principal component characterizing left hippocampal shape, an elongation along the antero-posterior axis of the left hippocampus, significantly predicted working memory performance across the adult lifespan. Overall, our findings suggest that people with healthy ( normal ) cognitive aging have relatively preserved hippocampal subfield volumes. However, hippocampal shape appears to provide unique information regarding age and cognition effects on hippocampal morphometry. Specific shape changes may serve as a novel biomarker of working memory performance in a normal aging population. Shape analysis using local vertex-wise metrics provided substantive findings in relation to age and cognitive performance. We found considerable age-related inward 126

127 displacement of bilateral hippocampus, particularly at the hippocampal head, and some outward displacement bilaterally particularly at the hippocampal body. These results are consistent with the only other adult lifespan study of hippocampal shape (Yang X et al. 2013). However, that study did not include cognitive data or APOE ε4 genotype data. With respect to cognitive data, we found that the first principal component explained most of the variance of left hippocampal shape (which shows a longer longitudinal axis), and significantly predicted working memory performance. Although working memory performance (a composite of executive function, attention, and memory) is classically considered a frontally-based task, considerable evidence supports a prefrontal-hippocampal circuit, or a prefrontal-parietal-hippocampal circuit (Oztekin I, B McElree et al. 2009) as one of the main neural mechanisms underlying working memory performance. Verbal working memory is also supported by a medial temporal lobe prefrontal circuit (Oztekin I, CE Curtis et al. 2009). Therefore, a longer or wider hippocampal axis may provide more surface area for projections to cortical regions, which can support working memory performance. The hippocampus has rich connectivity within the medial temporal lobe (Yassa MA et al. 2010), among subfields, and to the anterior thalamus (as part of the so-called Papez circuit) (Bezaire MJ and I Soltesz 2013; Bennett IJ et al. 2014), and may be under influences of many of the similar mechanisms that shape the geometry of the neocortex. The surface area and the complexity of the human cerebral cortex have been postulated to be a result of, in part, the rate of neuronal proliferation and programmed cell death of neurons through the neurodevelopemental period. These neurons migrate from the ventricular zone, across the intermediate zone on a scaffold of radial glia, and then go on to differentiate into neuronal subtypes that allow for the laminar organization of the cerebral cortex (Rakic P 1988). Increased surface area has been hypothesized to be one of the substrates for increased short- and long- 127

128 range cortical connectivity (Rakic P 1988; Van Essen DC 1997). Since the hippocampus is one of the best-preserved structures throughout vertebrates in nature, it is likely to be partly governed by many processes similar to the ones described above (Eckenhoff MF and P Rakic 1984). Therefore, although purely speculative, the lengthening of the long axis of the hippocampus (which is likely to correspond to an increase in surface area) may also be indicative of enhanced intra- and/or extra-structural connectivity. The hippocampal head shows connections to the white matter in the amygdala and uncinate fasciculus and in other regions including the prefrontal cortex (Travis SG et al. 2014). Furthermore, posterior hippocampal activity has been correlated with cingulate, precuneus, and inferior parietal cortical activity (Travis SG et al. 2014). Future work investigating anatomical relationships between hippocampal shape and cortical regions may help further improve our understanding of network-based connectivity mechanisms underlying working memory performance. Only modest relationships of hippocampal subfield volumes with age were found. In particular, there was an absence of a relationship between the CA1 subfield and age after correction for multiple comparisons. In contrast, others have reported a decreased CA1 subfield volume in people with Alzheimer s disease (Mueller SG et al. 2010; Kerchner GA et al. 2012) and mild cognitive impairment (Mueller SG et al. 2010; Pluta J et al. 2012) compared to healthy controls. Such a decrease has also been reported in older healthy individuals compared to younger healthy individuals (Mueller SG et al. 2007). Others have shown that age-related changes in CA1 may be dependent on the presence of hypertension (Shing YL et al. 2011), rather than being directly due to age itself. However, studies of older apparently healthy individuals typically include a heterogeneous group, with some subjects in a preclinical pathological aging stage: in these 128

129 subjects, neuroimaging can detect preclinical disease change, or predict onset of Alzheimer s disease using CA1 subfield volume (Apostolova LG et al. 2010; Devanand DP et al. 2012). CA1 volume data from our sample of healthy, carefully screened subjects provides support for preserved CA1 volume as a marker of healthy aging. In contrast, all other subfields had significant, although modest, negative relationships with age. This includes the CA4/DG subfield which is typically considered as a subfield spared from the effects of pathological aging, at least in early phases, yet one that is consistently shown to decrease in volume in healthy aging (Mueller SG and MW Weiner 2009; Small SA et al. 2011). Furthermore, unlike the CA1 volume, DG volume does not appear to be a predictive marker of conversion to Alzheimer s disease. One could view the similar rates of decrease across these subfield regions in our sample as supporting potentially shared mechanisms of age-related change in these subfields. However, modest, rather than strong, relationships with age are not surprising given that cell numbers are preserved in normal aging in humans, nonhuman primates and rodents in the principal cell types of the hippocampus (granule cells, CA1 and CA3 pyramidal cells) (Samson RD and CA Barnes 2013). We were surprised to find only a weakly significant correlation between total brain volume (TBV) and hippocampal volumes. While this may be a somewhat unexpected finding, it is difficult to compare this result to other results in the field due to methodological differences in brain volume correction, the use of total intra-cranial volumes (rather than TBV) in many cases, and the infrequent explicit examination of the correlation of total brain and hippocampal volumes (Fjell AM et al. 2013). Many studies use the intracranial volume (ICV) measure similar to the total intracranial volume (TIV) measure developed by Buckner and colleagues, which is 129

130 well-correlated with manually segmented volumes that account, in part, for TBV using an automated estimate (Buckner RL et al. 2004). In Buckner s work, hippocampal volume is shown to be directly correlated with the TIV (Buckner RL et al. 2004). However, other more recent publications that use this measure do not explicitly report correlations among ICV, brain volume, and hippocampal volumes (Fjell AM et al. 2013; Krogsrud SK et al. 2014). In publications where these correlations are not directly reported, it is possible to infer the presence or absence of a relationship among these variables based on whether volumetric relationships of the hippocampus with age do or do not change following normalization for TBV. For instance, in a recent paper by Li and colleagues, normalization by TBV did not change the direction of the slopes of the hippocampal volume vs. age relationship in a lifespan analysis (Li W et al. 2014). However in other studies, the inclusion of brain volume in the analysis completely changes the direction of the effect, suggesting correlation between the measures (Maller JJ et al. 2006). With respect to APOE genotype we were somewhat surprised that we did not find any significant effect of ε4 carrier status on hippocampal and subfield volume and only trend-level relationships (10% FDR) with respect to shape. Our findings align with some groups (Morra JH et al. 2009), but not with others (Mueller SG and MW Weiner 2009). It may be that effects of ε4 status are more prominent in studies that include only late-life participants, whereas we studied people across the adult lifespan (although there are studies that demonstrate ε4 effects in early adult life (Felsky D and AN Voineskos 2013; Nichols LM et al. 2012)). 130

131 It should be noted that other groups that perform manual and automated segmentation of the hippocampal subfields have done so using T2-weighted (Mueller SG et al., 2007, 2009, 2010 ; Wisse LEM et al., 2012, 2014 ; Kerchner GA et al., 2010, 2013 ; Pluta J et al., 2012; Winterburn JL et al. 2013) or proton density images (La Joie R et al., 2010, 2013 ; Shing YL et al., 2011; de Flores R et al., 2014 ; Raz N et al., 2014). While this has been an extremely useful addition to the field, many of these techniques only segment a small subset of the hippocampus along the anterior-posterior axis (Mueller SG et al., 2007, 2009, 2010; Kerchner GA et al., 2010, 2013; Raz N et al., 2014). In addition, many of these acquisitions are high resolution in the coronal plane but are acquired with a slice thickness of 2-3 mm. While we previously demonstrated the feasibility of automated segmentation of the subfields on standard T1-weighted segmentations through a simulation (Pipitone J et al. 2014), we further note that these experiments in the original validation used images with 0.9 mm isotropic voxels measured at 3T. Nonetheless, previous studies from our group (Pipitone J et al., 2014; Treadway MT et al., 2015) and by others who have adopted the Winterburn protocol demonstrate the feasibility of its implementation in a wide range of applications (Iglesias JE et al. 2015; Winterburn JL et al. 2013). Our study was the first, to our knowledge, to assess the relationship of either hippocampal subfield volumes or hippocampal shape metrics with cognitive performance across the adult lifespan. The different aspects of cognitive performance that we assessed were susceptible to the effects of age to varying degrees. However, we did not find significant age-specific effects of whole hippocampal or subfield volumes on these cognitive functions. Only modest variance in visuospatial memory performance was explained by the CA4/DG subfield (e.g., the left CA4/DG 131

132 explained 6% of the variance) and these findings did not survive corrections for multiple comparisons. Likewise, the right CA2/CA3 subfield explained 3.7% of the variance in working memory, but the relationship did not survive correction for multiple comparisons. Both animal studies and more recent human neuroimaging studies have attempted to clarify the function of individual subfields in relation to memory performance. For instance, pattern separation and completion, functions that support visuospatial memory by DG and CA3 has been shown (Leutgeb JK et al. 2007; Bakker A et al. 2008; Yassa MA et al. 2010). This demonstrated function of these subfields aligns with our finding of DG subfield volume providing modest explanation of the variance in visuospatial memory performance. Subjects were asked to draw the Rey-Osterrieth figure from memory, which is a complicated diagram requiring pattern separation and completion ability. On the other hand, novelty detection and allocentric encoding is considered a function of the CA1 subfield (Suthana NA et al. 2009). We did not, however, find a relationship with CA1 volume and any type of memory performance. It is worth noting that much of our knowledge of specific subfield function emerges from mouse and rat studies (Mizumori SJ et al. 1989; Brun VH et al. 2002; Brun VH et al. 2008). The standard paper and pencil neurocognitive tests in humans that we used may not be adequately designed to reflect some of those functions such as allocentric encoding. The relationships between hippocampal shape and cognitive performance were consistent in both younger and older adults. Others have shown that compensatory neural mechanisms help ensure normal cognitive performance in healthy aging populations (Raz N et al. 2005; Raz N and KM Rodrigue 2006; Sullivan EV and A Pfefferbaum 2006; Voineskos AN et al. 2012), rather than preservation of regions important for cognitive performance in young adult life. It is possible that 132

133 our subjects who were in their seventh, eighth, or ninth decade of life, without experiencing mild cognitive impairment or a dementia, may exhibit compensatory brain change in other regions. There is considerable evidence for such changes in the cortex of healthy older individuals (Raz N et al. 1997; Raz N et al. 2005). Other compensatory changes can also occur in white matter fiber connections (Voineskos AN et al. 2012), which can help ensure normal cognitive performance in aging, particularly in the executive function/working memory domain (Sullivan EV and A Pfefferbaum 2006; Voineskos AN et al. 2012). Some limitations of our study deserve consideration. Hippocampal subfield definition and measurement remain ongoing sources of disagreement and technical challenge respectively (Yushkevich PA et al. 2009; Yushkevich PA et al. 2010; Yushkevich PA et al. 2015; Winterburn JL et al. 2013). We used a CA2/CA3 subfield and a CA4/DG subfield among our subfield classifications, and challenges in differentiating CA3 from the DG have been described, with CA2/CA3 and DG often considered together (Chakeres DW et al. 2005; Carr VA et al. 2010). Given the role of CA3 in pattern completion and rapid and flexible acquisition of spatial memories (Lavenex P and P Banta Lavenex 2013), it is possible that the CA3 subfield may in part contribute to our finding that the CA4/DG subfield explained 6% of the variance in visuospatial memory performance. However, we have previously demonstrated that the CA4/DG subfield can be segmented with excellent reliability (Winterburn JL et al. 2013). Also, the very existence of a CA4 as defined by Duvernoy (Duvernoy HM 2005) is widely debated within the field, and is often included as the dentate hilus (Adler DH et al. 2014). Although we have previously demonstrated good reliability of subfield segmentations, some regions such as the CA2/CA3 region are less well-reliably segmented (Winterburn JL et al. 2013); this is not 133

134 surprising due to its overall size and thickness in comparison to the resolution of standard T1- weighted MR imaging data. Thus, for those subfields, use of a higher field scanner might have further improved resolution, and thus subfield segmentation accuracy. In turn, discovery of relationships among subfield volume and cognitive performance may have been facilitated. Although relationships of hippocampal subfield volume and shape with memory performance that we did find were consistent across the adult lifespan, our study was cross-sectional and not longitudinal. While our older individuals were no different from our younger individuals in sex and education, cross-sectional studies may be subject to a cohort effect. Furthermore, it remains an open question of whether variability of brain morphometry and cognitive measures is greater in late adult life compared to early adult life (Raz N and U Lindenberger 2011; Salthouse TA 2011). These factors might limit the capability of a study such as ours to determine associations between brain morphometry and cognitive performance. Other studies have found more pronounced differences in subfield volume or shape in older vs. younger individuals, or relationships with age of greater negative magnitude (Mueller SG et al. 2007; Mueller SG and MW Weiner 2009; Yang X et al. 2013). One possible explanation for our findings of a more limited relationship between age and hippocampal subfield volumes may be the preserved health and cognition of our older subjects, as documented by a detailed assessment. Therefore, one interpretation is that preserved hippocampal subfield volumes may be important for successful cognitive aging. Our finding that the brain-behaviour relationships in the hippocampus in our sample are similar in our younger and older subjects further supports this interpretation. 134

135 The interpretation of results based on local and global shape measures may not be as intuitive as those based on volumetric measures. Volume measures are sometimes thought of as a proxy for total neuronal or overall cell counts; however, there is little evidence for this in the human MR literature and limited supporting evidence in the murine MR literature (Lerch JP et al. 2008; Lerch JP et al. 2011). There has been support in recent reports that the shape of subcortical structures represents a neurodevelopmental phenotype (Chakravarty MM et al. 2015; Raznahan A et al. 2014; Shaw P et al. 2014a; Shaw P et al. 2014b). However given the variability of brain anatomy in the late stages of life and the potential influence of genetic, lifestyle, and environmental factors, it is unclear whether the shape descriptions in our manuscript can truly be considered neurodevelopmental. In summary, our study is the first examination of the relationship among hippocampal subfield volumes, hippocampal shape, and memory performance across the adult lifespan. We observed only modest relationships between subfield volumes and episodic memory performance. In contrast, characteristics of hippocampal shape emerged as the most powerful predictors of hippocampal-dependent cognitive performance. Several interventions have been shown to change hippocampal structure and function (Pajonk FG et al. 2010; Erickson KI et al. 2011; Engvig A et al. 2012; Kuhn S et al. 2013). Thus, our findings may help to select the interventions that specifically preserve aspects of hippocampal shape important for working memory performance. Such a strategy to attempt to prolong healthy aging and delay the onset of dementia would not be an exclusive one, but rather complementary to other strategies that 135

136 currently target compensatory structures or networks that are involved in successful cognitive aging. Funding This work was supported in part by the CAMH Foundation thanks to the Kimel Family, Koerner New Scientist Award, and Paul E Garfinkel New Investigator Catalyst Award, as well as the Canadian Institutes of Health Research, Brain Canada, Weston Foundation, Michael J. Fox Foundation, Alzheimer Society of Canada, Ontario Mental Health Foundation, Brain and Behavior Research Foundation and NIMH grants R01MH and R01MH These organizations did not play a role in the design and conduct of the study; collection, management, analysis, and interpretation of the data; and preparation, review, or approval of the manuscript; and decision to submit the manuscript for publication. ANV, JW, and MMC had full access to all the data in the study and take responsibility for the integrity of the data and the accuracy of the data analysis. Conflict of Interest Disclosures/ Acknowledgements: BHM has received research support in the form of medications for NIH-funded clinical trials from Bristol-Myers Squibb, Pfizer and Eli-Lilly. All other authors declare no conflict of interest. 136

137 Figures & Figure Captions Figure 1: MAGeT-Brain image registration and automatic segmentation pipeline: i) 5 manually labeled atlases are registered non-linearly to a subset of the subject population (templates, in this case 20). ii) Each image from the template library is registered to every subject image. iii) Labels are propagated along each possible registration pathway such that there are 100 (5 atlases x 20 templates) candidate segmentations for each subject. iv) Creation of a single fused label for each subject via voxel-wise majority voting. v) A single model image is created from the 5 atlas images, and a single surface-based representation of the hippocampus is created. vi) The model surface is propagated along the MAGeT-Brain registration pathways. vii) A single surface is created for each subject from the 100 candidate surfaces by determining the median position of each vertex. Figure 2: Surface-based local morphometric analysis: Projection of the deformation vector at each vertex onto the unit vector of the surface normal at the same vertex (dot product) to determine the magnitude of displacement in the direction perpendicular to the surface at each vertex. Figure 3: Negative relationships of: a) whole right and left hippocampus with age; and b) all right and left hippocampal subfields (CA1, CA2/CA3, CA4/DG, SR/SL/SM, subiculum) with age. All analyses have 134 degrees of freedom. *Indicates significant negative relationship of 137

138 volume with age (with sex, years of education, APOE ε4 carrier status, and total brain volume included in the model) after 5% FDR correction. Figure 4: Relationships between right and left whole hippocampal shape and age (with sex, years of education, and APOE ε4 carrier status included in the model). Blue colour maps on the hippocampal surfaces indicate inward displacement after 1%-5% FDR correction; red colour maps indicate outward displacement after 1%-5% FDR correction. In both hemispheres, inward displacement was localized in i) the inferior and medial hippocampal head (anterior subiculum and inferior SR/SL/SM of the hippocampal uncus); ii) the medial hippocampal body (along the CA1-CA4/DG border); and iii) the tip of the hippocampal tail (posterior CA1). Outward displacement in both hemispheres was localized to iv) the lateral hippocampal body (CA1); v) the lateral edge of the uncus in the hippocampal head (SR/SL/SM); and vi) the base of the uncus (CA1). viii) In the right hemisphere only, outward displacements were also observed on the superior tail (a region including parts of the CA4/dentate gyrus, SR/SL/SM, and CA1). Figure 5: Relationships between right and left whole hippocampal shape and APOE4 ε4 status (with age, sex, and years of education included in the model) in ε4 carriers relative to noncarriers. Blue colour maps on the hippocampal surfaces indicate inward displacement after 10%- 15% FDR correction; red colour maps indicate outward displacement 10%-15% FDR correction. Inward surface displacements were observed in ε4 carriers relative to non-carriers along the medial hippocampal head, especially i) the base of the hippocampal uncus (CA1); ii) the anteromedial portion of the hippocampal body (CA2/CA3); and iii) the medial side of the hippocampal 138

139 uncus (SR/SL/SM). Outward surface displacements were also observed in both hemispheres in the hippocampal head, particularly on iv) the anterior portion of the inferior surface (subiculum); and v) the inferior surface of the uncus (SR/SL/SM). vi) Some outward displacements were also observed on the lateral surface of the uncus in the left hemisphere (SR/SL/SM). Figure 6: Shape principal components (PCs) of the right and left whole hippocampus viewed from above (axial). The reference shape is from the hippocampal model (created from 5 manually-segmented atlases), and ve and +ve indicate negative and positive PC contributions to the hippocampal model, respectively. The 1 st PC on the left side explained 65% of the variance in the data, and showed a significant linear relationship with LNS test performance (R 2 =0.066, q<0.021). This PC represents a lengthening along the anterior-posterior axis (results have been normalized for volume). The 1 st right PC explained 74% of the variance in the data, and showed a significant relationship with age (R 2 =0.045, p=0.0094), but this did not survive correction for multiple comparisons. This PC also represents a lengthening along the anterior-posterior axis of the hippocampus. The 3 rd right PC explained 7.5% of the variance in the data, and showed a significant relationship with figure recall performance (R 2 =0.041, p=0.015), but this relationship did not survive correction for multiple comparisons. This PC represents a thickening along the medial-lateral axis of the hippocampus. 139

140 Figure 1: 140

141 Figure 2: 141

142 Figure 3: 142

143 Figure 4: 143

144 Figure 5: 144

145 Figure 6: 145

146 Tables Table 1: Demographic characteristics of subjects Demographic Mean (SD) Age (19.02) Education (years) (1.95) WTAR (IQ) (7.83) + MMSE (0.92) ++ N Gender Handedness 72M, 65F 129R, 9L APOE ε4 35C(4 homozygous), 98NC +++ WTAR = Wechsler Test of Adult Reading MMSE= Mini Mental State Examination C = carrier; NC = non-carrier NA = Test not administered + 3 NA; ++ 2 NA; NA Table 2: Effect of age on cognition based on a general linear model (with sex included in the model). R 2 values are adjusted and apply to the effect of age only. * Indicates p<0.05. Cognitive Task Mean Score (SD) p R 2 Letter Number Sequence (LNS) (3.35) 0.021* List Recall 7.20 (2.15) <0.0001* 0.14 Figure Recall (4.16) <0.0001*

147 Table 3: Summary of all volumetric statistical tests. R 2 values are adjusted and apply to the variable of interest only. *Indicates significance before/after 5% FDR correction (p/q<0.05). TBV=total brain volume Variable of Interest Covariates Hemisphere Structure R 2 p q CA Subiculum * 0.028* Right CA4/DG * * CA2/CA * * SR/SL/SM * * Age Sex, Education, APOE ε4 status, TBV Whole * * CA Subiculum * 0.040* Left CA4/DG * * CA2/CA * * SR/SL/SM * * Whole * * CA Age, Sex, Subiculum List Recall Education, APOE ε4 status, Right CA4/DG TBV CA2/CA SR/SL/SM

148 148 Whole CA Subiculum Left CA4/DG CA2/CA SR/SL/SM Whole CA Subiculum Right CA4/DG CA2/CA Figure Recall Age, Sex, Education, APOE ε4 status, TBV SR/SL/SM Whole CA Subiculum Left CA4/DG * 0.44 CA2/CA SR/SL/SM Whole LNS Age, Sex, Education, Right CA Subiculum

149 149 APOE ε4 status, TBV CA4/DG CA2/CA * 0.44 SR/SL/SM Whole CA Subiculum Left CA4/DG CA2/CA SR/SL/SM Whole

150 150 Table 4: Summary of all local morphometry statistical tests. All tests had 124 degrees of freedom. Note: Although 5% FDR was set as the threshold for significance throughout this manuscript, 10% FDR is presented for the APOE ε4 and cognition results, as only trend-level results survive correction for multiple comparisons in these tests Variable of Interest Covariates Structure FDR t-stat Age Sex, Education, APOE ε4 status Right Whole 5% 2.47 Age Sex, Education, APOE e ε4 status Left Whole 5% 2.56 APOE ε4 status Age, Sex, Education Right Whole 10% 2.43 APOE ε4 status Age, Sex, Education Left Whole 10% 2.58 List Recall/Figure Recall/LNS Age, Sex, Education, APOE ε4 status Right/Left Whole 10% NA Table 5: Summary of principal components on the right and left whole hippocampus from the shape analysis. *Indicates significance after 5% FDR correction (q<0.05). Principal Component Eigenvalue Variance Explained Linear Variables R 2 p q Right PC % Age Right PC % Right PC % Figure Recall Left PC % LNS * Left PC % Left PC %

151 151 References Achterberg HC, van der Lijn F, den Heijer T, Vernooij MW, Ikram MA, Niessen WJ, de Bruijne M Hippocampal shape is predictive for the development of dementia in a normal, elderly population. Hum Brain Mapp. Adler DH, Pluta J, Kadivar S, Craige C, Gee JC, Avants BB, Yushkevich PA Histologyderived volumetric annotation of the human hippocampal subfields in postmortem MRI. Neuroimage. 84: Apostolova LG, Mosconi L, Thompson PM, Green AE, Hwang KS, Ramirez A, Mistur R, Tsui WH, de Leon MJ Subregional hippocampal atrophy predicts Alzheimer's dementia in the cognitively normal. Neurobiol Aging. 31: Bakker A, Kirwan CB, Miller M, Stark CE Pattern separation in the human hippocampal CA3 and dentate gyrus. Science. 319: Beauchamp MH, Thompson DK, Howard K, Doyle LW, Egan GF, Inder TE, Anderson PJ Preterm infant hippocampal volumes correlate with later working memory deficits. Brain. 131: Benjamini Y, Hochberg Y Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society. 57: Bennett IJ, Huffman DJ, Stark CE Limbic Tract Integrity Contributes to Pattern Separation Performance Across the Lifespan. Cereb Cortex. Bezaire MJ, Soltesz I Quantitative assessment of CA1 local circuits: knowledge base for interneuron-pyramidal cell connectivity. Hippocampus. 23: Brun VH, Leutgeb S, Wu HQ, Schwarcz R, Witter MP, Moser EI, Moser MB Impaired spatial representation in CA1 after lesion of direct input from entorhinal cortex. Neuron. 57:

152 152 Brun VH, Otnass MK, Molden S, Steffenach HA, Witter MP, Moser MB, Moser EI Place cells and place recognition maintained by direct entorhinal-hippocampal circuitry. Science. 296: Buckner RL, Head D, Parker J, Fotenos AF, Marcus D, Morris JC, Snyder AZ A unified approach for morphometric and functional data analysis in young, old, and demented adults using automated atlas-based head size normalization: reliability and validation against manual measurement of total intracranial volume. Neuroimage. 23: Carmichael O, Xie J, Fletcher E, Singh B, DeCarli C Localized hippocampus measures are associated with Alzheimer pathology and cognition independent of total hippocampal volume. Neurobiol Aging. 33:1124 e Carr VA, Rissman J, Wagner AD Imaging the human medial temporal lobe with highresolution fmri. Neuron. 65: Chakeres DW, Whitaker CD, Dashner RA, Scharre DW, Beversdorf DQ, Raychaudhury A, Schmalbrock P High-resolution 8 Tesla imaging of the formalin-fixed normal human hippocampus. Clin Anat. 18: Chakravarty MM, Aleong R, Leonard G, Perron M, Pike GB, Richer L, Veillette S, Pausova Z, Paus T Automated analysis of craniofacial morphology using magnetic resonance images. PLoS ONE. 6:e Chakravarty MM, Steadman P, van Eede MC, Calcott RD, Gu V, Shaw P, Raznahan A, Collins DL, Lerch JP Performing label-fusion-based segmentation using multiple automatically generated templates. Hum Brain Mapp. 34: Chakravarty, MM, Rapoport, JL, Giedd, JN, Raznahan, A, Shaw, P, Collins, DL, Lerch, JP, Gogtay, N. (2014). Striatal shape abnormalities as novel neurodevelopmental endophenotypes in schizophrenia: A longitudinal study. Human brain mapping. Cootes TF, Taylor CJ, Cooper DH, Graham J Active shape models - their training and application. Computer Vision and Image Understanding. 61:38-59.

153 153 Csernansky JG, Wang L, Swank J, Miller JP, Gado M, McKeel D, Miller MI, Morris JC Preclinical detection of Alzheimer's disease: hippocampal shape and volume predict dementia onset in the elderly. Neuroimage. 25: De Flores, R, La Joie, R, Landeau, B, Perrotin, A, Mézenge, F, de La Sayette, V, Eustache, F., Desgranges, B., Chételat, G Effects of age and Alzheimer s disease on hippocampal subfields. Human Brain Mapping, 36: Della-Maggiore V, Grady CL, McIntosh AR Dissecting the effect of aging on the neural substrates of memory: deterioration, preservation or functional reorganization? Rev Neurosci. 13: Devanand DP, Bansal R, Liu J, Hao X, Pradhaban G, Peterson BS MRI hippocampal and entorhinal cortex mapping in predicting conversion to Alzheimer's disease. Neuroimage. 60: Duvernoy HM The Human Hippocampus: Functional Anatomy, Vascularization, and Serial Sections with MRI: Springer Verlag. Eckenhoff, MF, Rakic, P Radial organization of the hippocampal dentate gyrus: a Golgi, ultrastructural, and immunocytochemical analysis in the developing rhesus monkey. Journal of comparative neurology. 223:1-21. Engvig A, Fjell AM, Westlye LT, Skaane NV, Sundseth O, Walhovd KB Hippocampal subfield volumes correlate with memory training benefit in subjective memory impairment. Neuroimage. 61: Erickson KI, Voss MW, Prakash RS, Basak C, Szabo A, Chaddock L, Kim JS, Heo S, Alves H, White SM, Wojcicki TR, Mailey E, Vieira VJ, Martin SA, Pence BD, Woods JA, McAuley E, Kramer AF Exercise training increases size of hippocampus and improves memory. Proc Natl Acad Sci U S A. 108: Eskildsen, SF, Coupé, P, García-Lorenzo, D, Fonov, V, Pruessner, JC, Collins, DL, Alzheimer's Disease Neuroimaging Initiative Prediction of Alzheimer's disease in subjects with mild

154 154 cognitive impairment from the ADNI cohort using patterns of cortical thinning. Neuroimage. 65: Felsky D, Voineskos AN APOE 4, aging, and effects on white matter across the adult life span. JAMA Psychiatry. 70: First MB, Spitzer RL, Gibbon M, Williams JBW Structured Clinical Interview for DSM- IV-TR Axis I Disorders - Patient Edition (SCID-I/P). New York: Biometrics Research Department. Fjell AM, Westlye LT, Grydeland H, Amlien I, Espeseth T, Reinvang I, Raz N, Holland D, Dale AM, Walhovd KB, Alzheimer Disease Neuroimaging Initiative Critical ages in the life course of the adult brain: nonlinear subcortical aging. Neurobiology of Aging. 34: Folstein MF, Folstein SE, McHugh PR "Mini-mental state". A practical method for grading the cognitive state of patients for the clinician. J Psychiatr Res. 12: Genovese CR, Lazar NA, Nichols T Thresholding of statistical maps in functional neuroimaging using the false discovery rate. Neuroimage. 15: Head D, Snyder AZ, Girton LE, Morris JC, Buckner RL Frontal-hippocampal double dissociation between normal aging and Alzheimer's disease. Cereb Cortex. 15: Hobart MP, Goldberg R, Bartko JJ, Gold JM Repeatable battery for the assessment of neuropsychological status as a screening test in schizophrenia, II: convergent/discriminant validity and diagnostic group comparisons. Am J Psychiatry. 156: Hollingshead AB Four Factor Index of Social Status. In: Yale University. New Haven, Ct. Jack CR, Jr., Petersen RC, Xu YC, Waring SC, O'Brien PC, Tangalos EG, Smith GE, Ivnik RJ, Iglesias, JE, Sabuncu, MR, Aganj, I, Bhatt, P, Casillas, C, Salat, D, Boxer, A, Fischl, B Van Leemput, K (2015). An algorithm for optimal fusion of atlases with different labeling protocols. NeuroImage. 106:

155 155 Kerchner, GA, Hess, CP, Hammond-Rosenbluth, KE, Xu, D, Rabinovici, GD, Kelley, DAC, Vigneron, DB, Nelson, SJ and Miller, BL (2010). Hippocampal CA1 apical neuropil atrophy in mild Alzheimer disease visualized with 7-T MRI. Neurology. 75: Kerchner, GA, Bernstein, JD, Fenesy, MC, Deutsch, GK, Saranathan, M, Zeineh, MM, Rutt, BK Shared vulnerability of two synaptically-connected medial temporal lobe areas to age and cognitive decline: a seven tesla magnetic resonance imaging study. The Journal of Neuroscience. 33: Kokmen E Medial temporal atrophy on MRI in normal aging and very mild Alzheimer's disease. Neurology. 49: Krogsrud SK, Tamnes CK, Fjell AM, Amlien I, Grydeland H, Sulutvedt U, Due Tønnessen P, Bjørnerud A, Sølsnes AE, Håberg, AK, Skrane J, Walhovd, KB Development of hippocampal subfield volumes from 4 to 22 years. Human Brain Mapping. 35: Jernigan TL, Archibald SL, Fennema-Notestine C, Gamst AC, Stout JC, Bonner J, Hesselink JR Effects of age on tissues and regions of the cerebrum and cerebellum. Neurobiol Aging. 22: Jernigan TL, Gamst AC Changes in volume with age--consistency and interpretation of observed effects. Neurobiol Aging. 26: ; discussion Kerchner GA, Deutsch GK, Zeineh M, Dougherty RF, Saranathan M, Rutt BK Hippocampal CA1 apical neuropil atrophy and memory performance in Alzheimer's disease. Neuroimage. 63: Kuhn S, Gleich T, Lorenz RC, Lindenberger U, Gallinat J Playing Super Mario induces structural brain plasticity: gray matter changes resulting from training with a commercial video game. Mol Psychiatry. La Joie R, Fouquet M, Mezenge F, Landeau B, Villain N, Mevel K, Pelerin A, Eustache F, Desgranges B, Chetelat G Differential effect of age on hippocampal subfields assessed using a new high-resolution 3T MR sequence. Neuroimage. 53:

156 156 La Joie, R, Perrotin, A, De La Sayette, V, Egret, S, Doeuvre, L, Belliard, S, Francis E, Béatrice D, Chételat, G. (2013). Hippocampal subfield volumetry in mild cognitive impairment, Alzheimer's disease and semantic dementia. NeuroImage: Clinical. 3: Lavenex P, Banta Lavenex P Building hippocampal circuits to learn and remember: insights into the development of human memory. Behav Brain Res. 254:8-21. Lerch JP, Carroll JB, Spring S, Bertram LN, Schwab C, Hayden MR, Henkelman RM Automated deformation analysis in the YAC128 Huntington disease mouse model. Neuroimage. 39: Lerch JP, Yiu AP, Martinez-Canabal A, Pekar T, Bohbot VD, Frankland PW, Henkelman RM, Josselyn SA, Sled JG Maze training in mice induces MRI-detectable brain shape changes specific to the type of learning. Neuroimage. 54: Leutgeb JK, Leutgeb S, Moser MB, Moser EI Pattern separation in the dentate gyrus and CA3 of the hippocampus. Science. 315: Li W, Tol MJ, Li M, Miao M, Jiao Y, Heinze HJ, Bogerts B, He H, Walter M Regional specificity of sex effects on subcortical volumes across the lifespan in healthy aging. Human Brain Mapping. 35: Lupien SJ, Evans A, Lord C, Miles J, Pruessner M, Pike B, Pruessner JC Hippocampal volume is as variable in young as in older adults: implications for the notion of hippocampal atrophy in humans. Neuroimage. 34: Magon S, Chakravarty MM, Amann M, Weier K, Naegelin Y, Andelova M, Radue EW, Stippich C, Lerch JP, Kappos L, Sprenger T Label-fusion-segmentation and deformation-based shape analysis of deep gray matter in multiple sclerosis: The impact of thalamic subnuclei on disability. Human Brain Mapping. Maller JJ, Réglade Meslin C, Anstey KJ, Sachdev P Sex and symmetry differences in hippocampal volumetrics: before and beyond the opening of the crus of the fornix. Hippocampus. 16:80-90.

157 157 Malykhin NV, Bouchard TP, Camicioli R, Coupland NJ Aging hippocampus and amygdala. Neuroreport. 19: Mazziotta JC, Toga AW, Evans A, Fox P, Lancaster J A probabilistic atlas of the human brain: theory and rationale for its development. The International Consortium for Brain Mapping (ICBM). Neuroimage. 2: Miller MI, Priebe CE, Qiu A, Fischl B, Kolasny A, Brown T, Park Y, Ratnanather JT, Busa E, Jovicich J, Yu P, Dickerson BC, Buckner RL Collaborative computational anatomy: an MRI morphometry study of the human brain via diffeomorphic metric mapping. Hum Brain Mapp. 30: Mizumori SJ, Barnes CA, McNaughton BL Reversible inactivation of the medial septum: selective effects on the spontaneous unit activity of different hippocampal cell types. Brain Res. 500: Morra JH, Tu Z, Apostolova LG, Green AE, Avedissian C, Madsen SK, Parikshak N, Hua X, Toga AW, Jack CR, Jr., Schuff N, Weiner MW, Thompson PM Automated 3D mapping of hippocampal atrophy and its clinical correlates in 400 subjects with Alzheimer's disease, mild cognitive impairment, and elderly controls. Hum Brain Mapp. 30: Mueller SG, Schuff N, Yaffe K, Madison C, Miller B, Weiner MW Hippocampal atrophy patterns in mild cognitive impairment and Alzheimer's disease. Hum Brain Mapp. 31: Mueller SG, Stables L, Du AT, Schuff N, Truran D, Cashdollar N, Weiner MW Measurement of hippocampal subfields and age-related changes with high resolution MRI at 4T. Neurobiol Aging. 28: Mueller SG, Weiner MW Selective effect of age, Apo e4, and Alzheimer's disease on hippocampal subfields. Hippocampus. 19: Nichols, LM, Masdeu, JC, Mattay, VS, Kohn, P, Emery, M, Sambataro, F, Kolachana, B, Elvevag, B, Kippenhan, S, Weinberger, D, Berman, K. F. (2012). Interactive effect of

158 158 apolipoprotein e genotype and age on hippocampal activation during memory processing in healthy adults. Archives of general psychiatry 69: Oldfield RC The assessment and analysis of handedness: the Edinburgh inventory. Neuropsychologia. 9: Oztekin I, Curtis CE, McElree B The medial temporal lobe and the left inferior prefrontal cortex jointly support interference resolution in verbal working memory. J Cogn Neurosci. 21: Oztekin I, McElree B, Staresina BP, Davachi L Working memory retrieval: contributions of the left prefrontal cortex, the left posterior parietal cortex, and the hippocampus. J Cogn Neurosci. 21: Pajonk FG, Wobrock T, Gruber O, Scherk H, Berner D, Kaizl I, Kierer A, Muller S, Oest M, Meyer T, Backens M, Schneider-Axmann T, Thornton AE, Honer WG, Falkai P Hippocampal plasticity in response to exercise in schizophrenia. Arch Gen Psychiatry. 67: Park MT, Pipitone J, Baer LH, Winterburn JL, Shah Y, Chavez S, Schira MM, Lobaugh NJ, Lerch JP, Voineskos AN, Chakravarty MM Derivation of high-resolution MRI atlases of the human cerebellum at 3T and segmentation using multiple automatically generated templates. Neuroimage. 95C: Pipitone J, Park MT, Winterburn J, Lett TA, Lerch JP, Pruessner JC, Lepage M, Voineskos AN, Mallar Chakravarty M Multi-atlas segmentation of the whole hippocampus and subfields using multiple automatically generated templates. Neuroimage. Pluta J, Yushkevich P, Das S, Wolk D In vivo analysis of hippocampal subfield atrophy in mild cognitive impairment via semi-automatic segmentation of T2-weighted MRI. J Alzheimers Dis. 31: Pruessner JC, Collins DL, Pruessner M, Evans AC Age and gender predict volume decline in the anterior and posterior hippocampus in early adulthood. J Neurosci. 21:

159 159 Rakic, P Specification of cerebral cortical areas. Science, 241: Raz N, Gunning FM, Head D, Dupuis JH, McQuain J, Briggs SD, Loken WJ, Thornton AE, Acker JD Selective aging of the human cerebral cortex observed in vivo: differential vulnerability of the prefrontal gray matter. Cereb Cortex. 7: Raz N, Lindenberger U, Rodrigue KM, Kennedy KM, Head D, Williamson A, Dahle C, Gerstorf D, Acker JD Regional brain changes in aging healthy adults: general trends, individual differences and modifiers. Cereb Cortex. 15: Raz N, Rodrigue KM Differential aging of the brain: patterns, cognitive correlates and modifiers. Neurosci Biobehav Rev. 30: Raz, N, Lindenberger, U Only time will tell: Cross-sectional studies offer no solution to the age brain cognition triangle: Comment on Salthouse (2011). Psychological Bulletin. 137: Raz, N, Daugherty, AM, Bender, AR, Dahle, CL, Land, S Volume of the hippocampal subfields in healthy adults: differential associations with age and a pro-inflammatory genetic variant. Brain Structure and Function Raznahan A, Shaw PW, Lerch JP, Clasen LS, Greenstein D, Berman R, Pipitone J, Chakravarty MM, Giedd JN Longitudinal four-dimensional mapping of subcortical anatomy in human development. Proc Natl Acad Sci U S A. 111: Samson RD, Barnes CA Impact of aging brain circuits on cognition. Eur J Neurosci. 37: Salthouse, TA Neuroanatomical substrates of age-related cognitive decline. Psychological bulletin, 137:753. Shaw P, De Rossi P, Watson B, Wharton A, Greenstein D, Raznahan A, Sharp W, Lerch JP, Chakravarty MM. 2014a. Mapping the development of the basal ganglia in children with attention-deficit/hyperactivity disorder. J Am Acad Child Adolesc Psychiatry. 53: e711.

160 160 Shaw P, Sharp W, Sudre G, Wharton A, Greenstein D, Raznahan A, Evans A, Chakravarty MM, Lerch JP, Rapoport J. 2014b. Subcortical and cortical morphological anomalies as an endophenotype in obsessive-compulsive disorder. Mol Psychiatry. Shen KK, Fripp J, Meriaudeau F, Chetelat G, Salvado O, Bourgeat P Detecting global and local hippocampal shape changes in Alzheimer's disease using statistical shape models. Neuroimage. 59: Shing YL, Rodrigue KM, Kennedy KM, Fandakova Y, Bodammer N, Werkle-Bergner M, Lindenberger U, Raz N Hippocampal subfield volumes: age, vascular risk, and correlation with associative memory. Front Aging Neurosci. 3:2. Small SA, Schobel SA, Buxton RB, Witter MP, Barnes CA A pathophysiological framework of hippocampal dysfunction in ageing and disease. Nat Rev Neurosci. 12: Smith SM, Jenkinson M, Woolrich MW, Beckmann CF, Behrens TE, Johansen-Berg H, Bannister PR, De Luca M, Drobnjak I, Flitney DE, Niazy RK, Saunders J, Vickers J, Zhang Y, De Stefano N, Brady JM, Matthews PM Advances in functional and structural MR image analysis and implementation as FSL. Neuroimage. 23 Suppl 1:S Smith SM, Zhang Y, Jenkinson M, Chen J, Matthews PM, Federico A, De Stefano N Accurate, robust, and automated longitudinal and cross-sectional brain change analysis. Neuroimage. 17: Squire LR Memory and the hippocampus: a synthesis from findings with rats, monkeys, and humans. Psychol Rev. 99: Sullivan EV, Marsh L, Mathalon DH, Lim KO, Pfefferbaum A Age-related decline in MRI volumes of temporal lobe gray matter but not hippocampus. Neurobiol Aging. 16: Sullivan EV, Marsh L, Pfefferbaum A Preservation of hippocampal volume throughout adulthood in healthy men and women. Neurobiol Aging. 26: Sullivan EV, Pfefferbaum A Diffusion tensor imaging and aging. Neurosci Biobehav Rev. 30:

161 161 Suthana NA, Ekstrom AD, Moshirvaziri S, Knowlton B, Bookheimer SY Human hippocampal CA1 involvement during allocentric encoding of spatial information. J Neurosci. 29: Tondelli M, Wilcock GK, Nichelli P, De Jager CA, Jenkinson M, Zamboni G Structural MRI changes detectable up to ten years before clinical Alzheimer's disease. Neurobiol Aging. 33:825 e Travis SG, Huang Y, Fujiwara E, Radomski A, Olsen F, Carter R, Seres P, Malykhin NV High field structural MRI reveals specific episodic memory correlates in the subfields of the hippocampus. Neuropsychologia. 53: Treadway, MT, Waskom, ML, Dillon, DG, Holmes, AJ, Park, MTM, Chakravarty, MM, Dultra, SJ, Polli, FE, Iosifescu, DV, Fava, M, Gabrieli, JDE, Pizzagalli, DA Illness progression, recent stress, and morphometry of hippocampal subfields and medial prefrontal cortex in major depression. Biological psychiatry. 77: Tulving E Episodic memory: from mind to brain. Annu Rev Psychol. 53:1-25. Van Essen DC A tension-based theory of morphogenesis and compact wiring in the central nervous system. Nature. 385: Van Petten C Relationship between hippocampal volume and memory ability in healthy individuals across the lifespan: review and meta-analysis. Neuropsychologia. 42: Voineskos AN, Rajji TK, Lobaugh NJ, Miranda D, Shenton ME, Kennedy JL, Pollock BG, Mulsant BH Age-related decline in white matter tract integrity and cognitive performance: a DTI tractography and structural equation modeling study. Neurobiol Aging. 33: Wechsler D Wechsler Test of Adult Reading. In: Harcourt Assessment. Winterburn JL, Pruessner JC, Chavez S, Schira MM, Lobaugh NJ, Voineskos AN, Chakravarty MM A novel in vivo atlas of human hippocampal subfields using high-resolution 3 T magnetic resonance imaging. Neuroimage. 74:

162 162 Wisse, LEM, Gerritsen, L, Zwanenburg, JJ, Kuijf, HJ, Luijten, PR, Biessels, GJ, Geerlings, MI Subfields of the hippocampal formation at 7T MRI: in vivo volumetric assessment. Neuroimage. 61: Yang X, Goh A, Chen SH, Qiu A Evolution of hippocampal shapes across the human lifespan. Hum Brain Mapp. 34: Yassa MA, Muftuler LT, Stark CE Ultrahigh-resolution microstructural diffusion tensor imaging reveals perforant path degradation in aged humans in vivo. Proc Natl Acad Sci U S A. 107: Yassa MA, Stark SM, Bakker A, Albert MS, Gallagher M, Stark CE High-resolution structural and functional MRI of hippocampal CA3 and dentate gyrus in patients with amnestic Mild Cognitive Impairment. Neuroimage. 51: Yonelinas AP The hippocampus supports high-resolution binding in the service of perception, working memory and long-term memory. Behav Brain Res. 254: Yushkevich, PA, Amaral, RS, Augustinack, JC, Bender, AR, Bernstein, JD, Boccardi, M,... & for the Hippocampal Subfields Group Quantitative Comparison of 21 Protocols for Labeling Hippocampal Subfields and Parahippocampal Subregions in In Vivo MRI: Towards a Harmonized Segmentation Protocol. NeuroImage. Yushkevich PA, Avants BB, Das SR, Pluta J, Altinay M, Craige C Bias in estimation of hippocampal atrophy using deformation-based morphometry arises from asymmetric global normalization: an illustration in ADNI 3 T MRI data. Neuroimage. 50: Yushkevich PA, Avants BB, Pluta J, Das S, Minkoff D, Mechanic-Hamilton D, Glynn S, Pickup S, Liu W, Gee JC, Grossman M, Detre JA A high-resolution computational atlas of the human hippocampus from postmortem magnetic resonance imaging at 9.4 T. Neuroimage. 44:

163 163 Yushkevich PA, Wang H, Pluta J, Das SR, Craige C, Avants BB, Weiner MW, Mueller S Nearly automatic segmentation of hippocampal subfields in in vivo focal T2-weighted MRI. Neuroimage. 53: Zhao Z, Taylor WD, Styner M, Steffens DC, Krishnan KR, MacFall JR Hippocampus shape analysis and late-life depression. PLoS ONE. 3:e1837.

164 164 Supplementary Materials Volumetric Analysis MAGeT-Brain Segmentation MAGeT-Brain was configured using a template library composed of 20 images selected from the unlabeled subjects to match the age range of the entire sample. Each atlas label is first propagated to label each template image using non-linear transformation estimated that matches each atlas to each template. Registration was carried out using the Advanced Normalization Tools (ANTs; (Avants BB et al. 2008) for MINC formatted images (McConnell Brain Imaging Centre, Montreal Neurological Institute, McGill University; Based on a five-image atlas library, each template image therefore received five separate labels (one from each atlas). Labels from each template image are then similarly propagated to each unlabeled subject image, resulting in 100 candidate labelings for each subject. To limit errors due to resampling, or registration error, voxel-wise majority voting was then used to fuse the candidate labels for each subject into a single consensus label (Park MT et al. 2014; Pipitone J et al. 2014). Morphometric Analysis Model Creation The model was created using atlas-creation methods described previously (Borghammer P et al. 2010; Dorr AE et al. 2008; Frey S et al. 2011). Briefly, one of the five atlases is selected as the target, and the other four atlases are registered in a 6-parameter (3 translations and 3 rotations) linear registration to this target (carried out using ANTs for MINC formatted images). These images are then registered to each other on a pairwise basis using a 12-parameter linear

165 165 registration (3 translations, 3 rotations, 3 scales, and 3 shears), and resampled to normalize each image for average linear brain size. The resampled images are then averaged to create an initial model (M0). Next, the resampled images are non-linearly registered to M0 using ANTs, resampled again, and averaged to create M1. This step is repeated twice more, with each additional step improving the accuracy of the model in representing the mean anatomy of the five original atlases. Using the atlas labels (for right and left whole hippocampus and all subfields), surface-based representations of the hippocampus and its subfields were defined on the final model (M3) using the marching cubes algorithm in the Display software (part of the MINCtools package, and were subsampled and smoothed to make data processing steps more feasible using the AMIRA software package (Visage Imaging; San Diego, CA, USA) (subsampled from ~10,000 vertices/hemisphere to 1329 and 1328 vertices for the right and left whole hippocampal surfaces, respectively). Local (vertex-wise) morphometric analysis To determine surface displacement, all of the atlas-to-subject non-linear transformations from the MAGeT-Brain algorithm were concatenated with the atlas-to-model non-linear transformations from the model creation process to create model-to-subject transformations (resulting in a total of 100 possible deformation fields). These transformations were averaged into a single nonlinear transformation for each model-to-subject pathway to reduce noise in the transformation and to increase precision and accuracy (Holmes CJ et al. 1998). To determine the shape difference at corresponding vertices of a subject s hippocampus and the model hippocampal surface, the dot product of the unit vector lying normal to the model surface at each vertex and the vector from the final averaged non-linear deformation field at that same vertex was

166 166 evaluated. This determined the magnitude of local inward or outward displacement in the direction normal to the model surface at each vertex (Figure 2). These displacements, by definition, are normalized for brain volume. In addition, any residual global linear effects from the nonlinear transformation have been modeled and removed. Global (multivariate) morphometric analysis The first step to complete this analysis was to obtain a surface-based representation of the hippocampus for each subject. The model hippocampal surfaces were transformed along all concatenated transformations for each subject, resulting in 100 candidate surfaces for each subject. A single right and left surface was estimated from these candidate surfaces by taking the median value of the Cartesian coordinates at each vertex (Figure 1). Visual inspection of the surfaces showed that there were positional differences between subjects. To eliminate the possibility of capturing positional differences in the PCA, further steps were therefore required to ensure that the surfaces for all subjects were completely aligned. First, each subject surface (right or left whole hippocampus) was registered on a vertex-wise basis to the model surface via a 6- parameter (3 translations and 3 rotations) linear registration (MacDonald 1998). Then, each subject surface was registered (using mincants) to every other subject surface via a 12- parameter (3 translations, 3 rotations, 3 scales and 3 shears) linear registration (eg. in a set of 137 subjects, there would be 136 registrations per subject). These surfaces, by definition, are normalized for volume. For each subject, these transformations were then averaged, and the hippocampal surfaces were propagated along these average transformations. The Cartesian coordinates of each vertex of these aligned surfaces were used as the input to the PCA.

167 167 Supplementary Tables Table S1: Correlations between hippocampal volumes (with subfields) and cognitive scores (no covariates). *Indicates significance before/after 5% FDR correction (p/q<0.05). Cognitive Measure Hemisphere Structure R 2 p q Whole CA Right CA2/CA CA4/DG SR/SL/SM List Recall Subiculum * 0.15 Whole * 0.11 CA Left CA2/CA CA4/DG * 0.11 SR/SL/SM * 0.11 Subiculum Whole CA Figure Recall Right CA2/CA CA4/DG * 0.15 SR/SL/SM

168 168 Subiculum * 0.12 Whole * 0.12 CA Left CA2/CA CA4/DG * 0.11 SR/SL/SM * 0.11 Subiculum Whole CA Right CA2/CA CA4/DG SR/SL/SM LNS Subiculum Whole CA Left CA2/CA CA4/DG SR/SL/SM Subiculum

169 169 References Avants BB, Epstein CL, Grossman M, Gee JC Symmetric diffeomorphic image registration with cross-correlation: evaluating automated labeling of elderly and neurodegenerative brain. Med Image Anal 12: Borghammer P, Ostergaard K, Cumming P, Gjedde A, Rodell A, Hall N, Chakravarty MM A deformation-based morphometry study of patients with early-stage Parkinson's disease. Eur J Neurol 17: Dorr AE, Lerch JP, Spring S, Kabani N, Henkelman RM High resolution threedimensional brain atlas using an average magnetic resonance image of 40 adult C57Bl/6J mice. Neuroimage 42:60-9. Frey S, Pandya DN, Chakravarty MM, Bailey L, Petrides M, Collins DL An MRI based average macaque monkey stereotaxic atlas and space (MNI monkey space). Neuroimage 55: Holmes CJ, Hoge R, Collins L, Woods R, Toga AW, Evans AC Enhancement of MR images using registration for signal averaging. J Comput Assist Tomogr 22: MacDonald D A Method for Identifying Geometrically Simple Surfaces from Three Dimensional Images. Montreal: McGill University. Park MT, Pipitone J, Baer LH, Winterburn JL, Shah Y, Chavez S, Schira MM, Lobaugh NJ, Lerch JP, Voineskos AN and others Derivation of high-resolution MRI atlases of the human cerebellum at 3T and segmentation using multiple automatically generated templates. Neuroimage 95: Pipitone J, Park MT, Winterburn J, Lett TA, Lerch JP, Pruessner JC, Lepage M, Voineskos AN, Mallar Chakravarty M. 2014: Multi-atlas segmentation of the whole hippocampus and subfields using multiple automatically generated templates. Neuroimage.

170 170 Supplementary Paper 2: High-resolution in vivo manual segmentation protocol for human hippocampal subfields using 3T magnetic resonance imaging Authors: Winterburn, Julie L 1 Institute of Biomaterials and Biomedical Engineering University of Toronto Toronto, Canada 2 Computational Brain Anatomy Laboratory Douglas Institute McGill University Montreal, Canada winterburn.julie@gmail.com Pruessner, Jens C McGill Centre for Studies in Aging McGill University Montreal, Canada jens.pruessner@mcgill.ca Chavez, Sofia 1 MRI Unit, Research Imaging Centre, Campbell Family Mental Health Research Institute Centre for Addiction and Mental Health Toronto, Canada 2 Department of Psychiatry University of Toronto Toronto, Canada sofia.chavez@camhpet.ca

171 171 Schira, Mark M 1 School of Psychology University of Wollongong Wollongong, Australia 2 Neuroscience Research Australia Sydney, Australia mark.schira@gmail.com Lobaugh, Nancy J 1 MRI Unit, Research Imaging Centre, Campbell Family Mental Health Research Institute Centre for Addiction and Mental Health Toronto, Canada 2 Division of Neurology, Department of Medicine University of Toronto Toronto, Canada nancy.lobaugh@gmail.com Voineskos, Aristotle N 1 Kimel Family Translational Imaging Genetics Research Laboratory, Research Imaging Centre, Campbell Family Mental Health Research Institute Centre for Addiction and Mental Health Toronto, Canada 2 Department of Psychiatry University of Toronto Toronto, Canada aristotlevoineskos@gmail.com Chakravarty, M. Mallar 1 Computational Brain Anatomy Laboratory Douglas Institute McGill University Montreal, Canada 2 Institute of Biomaterials and Biomedical Engineering University of Toronto Toronto, Canada mallar.chak@gmail.com

172 172 Corresponding authors: Winterburn, Julie L winterburn.julie@gmail.com (416) x33980 Chakravarty, M. Mallar mallar.chak@gmail.com (416) x33980 Keywords: Structural magnetic resonance imaging, High resolution, Neuroanatomy, Hippocampus, Hippocampal subfields, Manual segmentation, Atlas

173 173 Short Abstract: The goal of this manuscript is to study the hippocampus and hippocampal subfields using MRI. The manuscript describes a protocol for segmenting the hippocampus and five hippocampal substructures: cornu ammonis (CA) 1, CA2/CA3, CA4/dentate gyrus, strata radiatum/lacunosum/moleculare, and subiculum. Long Abstract: The human hippocampus has been broadly studied in the context of memory and normal brain function and its role in different neuropsychiatric disorders has been heavily studied. While many imaging studies treat the hippocampus as a single unitary neuroanatomical structure, it is, in fact, composed of several subfields that have a complex three-dimensional geometry. As such, it is known that these subfields perform specialized functions and are differentially affected through the course of different disease states. Magnetic resonance (MR) imaging can be used as a powerful tool to interrogate the morphology of the hippocampus and its subfields. Many groups use advanced imaging software and hardware (>3T) to image the subfields; however this type of technology may not be readily available in most research and clinical imaging centres. To address this need, this manuscript provides a detailed step-by-step protocol for segmenting the full anterior-posterior length of the hippocampus and its subfields: cornu ammonis (CA) 1, CA2/CA3, CA4/dentate gyrus, strata radiatum/lacunosum/moleculare, and subiculum. This protocol has been applied to five subjects (3F, 2M; age 29-57, avg. 37). Protocol reliability is assessed by resegmenting either the right or left hippocampus of each subject and computing the overlap using the Dice's kappa metric. Mean Dice's kappa (range) across the five subjects are:

174 174 whole hippocampus, 0.91 ( ); CA1, 0.78 ( ); CA2/CA3, 0.64 ( ); CA4/dentate gyrus, 0.83 ( ); strata radiatum/lacunosum/moleculare, 0.71 ( ); and subiculum 0.75 ( ). The segmentation protocol presented here provides other laboratories with a reliable method to study the hippocampus and hippocampal subfields in vivo using commonly available MR tools.

175 175 Introduction: The hippocampus is a widely studied medial temporal lobe structure that is associated with episodic memory, spatial navigation, and other cognitive functions 10,31. Its role in neurodegenerative and neuropsychiatric disorders such as Alzheimer s disease, schizophrenia, and bipolar disorder is well-documented 4,5,18,24,30. The goal of this manuscript is to provide additional detail to the manual segmentation protocol published previously 35 for human hippocampal subfields on high-resolution magnetic resonance (MR) images acquired at 3T. Additionally, the video component accompanying this manuscript will provide further assistance to researchers who wish to implement the protocol on their own datasets. The hippocampus can be divided into subfields based on cytoarchitectonic differences observed in histologically-prepared post-mortem specimens 12,22. Such post-mortem specimens define the ground truth for the identification and study of hippocampal subfields; however preparations of this nature require specialized skills and equipment for staining, and are limited by the availability of fixed tissue, especially in diseased populations. In vivo imaging has the advantage of a much larger pool of subjects, and also presents the opportunity for follow-up studies and observing changes in populations. Although it has been shown that signal intensities in T2- weighted ex vivo MR images reflect cellular density 13, it is still difficult to identify undisputed borders between subfields using solely MR signal intensities. As such, a number of different approaches for identifying histology-level detail on MR images have been developed. Some groups have made efforts to reconstruct and digitize histological datasets and then use these reconstructions along with image registration techniques to localize hippocampal subfield

176 176 neuroanatomy on in vivo MR 1,2,8,9,14,15,17,32. Although this is an effective technique for mapping a version of the histological ground truth directly onto MR images, reconstructions of this nature are difficult to complete. Projects such as these are limited by the availability of intact medial temporal lobe specimens, histological techniques, data loss during histological processing, and the fundamental morphological inconsistencies between fixed and in vivo brains. Other groups have used high-field scanners (7T or 9.4T) in an effort to acquire in vivo or ex vivo images with a small enough ( mm isotropic) voxel size to visualize spatially localized differences in image contrast that are used to infer boundaries between subfields 35,37. Even at 7T-9.4T and with such a small voxel size, the cytoarchitectonic characteristics of hippocampal subfields are not visible. As such, manual segmentation protocols have been developed that approximate the known histological boundaries on MR images. These protocols determine subfield boundaries by interpreting local image contrast differences and defining geometric rules (such as straight lines and angles) relative to visible structures. Although images taken at a high field strength are able to offer detailed insight into hippocampal subfields, high-field scanners are not yet common in clinical or research settings, so 7T and 9.4T protocols currently have limited applicability. Similar protocols have been developed for images collected on 3T and 4T scanners 11,20,21,23,24,25,28,33. Many of these protocols are based on images with sub-1mm voxels voxel dimensions in the coronal plane, but have large slice thicknesses (0.8-3mm) 11,20,21,23,25,28,33 or large inter-slice distances 20,28, both of which result in a significant measurement bias in the estimation of volumes of the individual subfields. Additionally, many of the existing 3T protocols exclude subfields in all or part of the hippocampal head or tail 20,23,25,33 or do not provide detailed segmentations of important substructures (ie. combine the DG with CA2/CA3 or do not include the strata radiatum/lacunosum/moleculare of the CA) 11,20,21,23,24,25,28,33. There is

177 177 therefore a need in the field for a detailed description of a protocol that can reliably identify relevant subfields throughout the head, body, and tail of the hippocampus that is based on a scanner commonly available in clinical and research settings. Efforts are currently underway by the Hippocampal Subfields Group ( to harmonize the hippocampal subfield segmentation process between laboratories, similar to an existing harmonization effort for whole hippocampal segmentation 6, and an initial paper comparing 21 existing protocols was recently published 38. The work from this group will further elucidate optimal segmentation procedures. This manuscript provides detailed written and video instructions for reliably implementing the hippocampal subfield segmentation protocol described previously by Winterburn and colleagues 34 on high-resolution 3T MR images. The protocol has been implemented on five images of healthy controls for the whole hippocampus and five hippocampal subfields (CA1, CA2/CA3, CA4/dentate gyrus, strata radiatum/lacunosum/moleculare, and subiculum). These segmented images are available to the public online (cobralab.ca/atlases/hippocampus). The protocol and the segmented images will be useful for groups who wish to study detailed hippocampal neuroanatomy in MR images. Protocol: Study Participants The protocol in this manuscript was developed for five representative high-resolution images collected from healthy volunteers (3F, 2M; age 29-57, avg. 37) who were free of neurological and neuropsychiatric disorders and cases of severe head trauma. All subjects were recruited at

178 178 the Centre for Addiction and Mental Health (CAMH). The study was approved by the CAMH Research Ethics Board and was conducted in keeping with the Declaration of Helsinki. All subjects provided written, informed consent for data acquisition and sharing. For details about the acquisition sequence used to collect these images, please refer to Winterburn et al., 2013 and Park et al., ,34 Images for all five subjects were checked for quality and retained. The hippocampus spanned an average of 118 coronal slices in these images. 1 Software Set-up 1.1 Open Display: from the terminal using the following command: Display image_name.mnc - label label_name.mnc. The program will open 3 windows: 3D visualization window, 3- orientation image viewing window, and a navigation window. The terminal will also be used to run the program. Enlarge the coronal view, as the segmentations will be performed coronally. Zoom in on the hippocampus. Select F (Segmenting) in the navigation window. Select F (XY Radius:0.1). The terminal window will prompt for the user to Enter xy brush size:. Set to 0.1. This will set the size of your paintbrush. The user can now begin drawing the hippocampus onto the MR image. 2 Whole Hippocampus Manual Segmentation 2.1 Set-up: Using a T1-weighted image, scroll to the anterior-most coronal slice of the hippocampus. To advance slices in the anterior direction, use the '+' key; use the '-' key to move in the posterior direction.

179 Slice A: Anterior-Most Slice: Using the right-click on the mouse, draw the outer-most border of the hippocampal grey matter where it meets the surrounding temporal lobe white matter and use the high-intensity white matter of the alveus to assist with the superior border, where the hippocampus meets the amygdala 12,22. Use the E (Label Fill) key in the segmentation menu of the navigation window to fill in the label inside the border. Continue to apply these borders throughout the anterior hippocampal head. 2.3 Slice B: Hippocampal Head 1: Superior, inferior, lateral, medial borders: Continue to draw the borders as described in step 2.2, using the white matter of the temporal lobe and alveus as a guide Supero-medial border: For this, using the axial view, draw a horizontal line from the anterior edge of the lateral hippocampus 29, and include anything below this line as hippocampus. NOTE: The supero-medial border becomes more ambiguous in these slices, where the grey matter of the hippocampus blends with the grey matter of the amygdala. 2.4 Slice C: Hippocampal Head 2 with Dentations: Depending on the subject, the dentations of the hippocampus may be visible for 3-4 slices (typically, they are more visible on T2- weighted versus T1-weighted images). In these slices, use the white matter of the alveus and temporal lobe to guide border segmentation 12,22. For further details, follow steps

180 Slice D: Hippocampal Head 3: Superior, inferior, lateral, medial borders: Draw the inferior border of the hippocampus at the white matter of the temporal lobe, the lateral border at the inferior horn of the lateral ventricle, the superior border, following the curve of the dentations, at the white matter of the alveus/fimbria, and the medial border at the hypointense region of the ambient cistern 12, Supero-medial and infero-medial borders: Continue to define the supero-medial border as described in step Draw the inferior portion of the medial border where the hippocampus thins slightly and extends into the mildly hyperintense grey matter of the entorhinal cortex 12, Slice E: Hippocampal Head 4 with Uncus: Continue to draw the inferior, lateral, and superior borders described in step Include the uncus (which is located medal to the main body of the hippocampus and is surrounded by low-intensity CSF) in the hippocampal segmentation 12, Slice F: Hippocampal Body: Continue to draw the inferior, lateral, medial, and superior borders described in steps Draw the infero-medial border at the point where the hippocampus thins as it transitions to entorhinal cortex/para-hippocampal gyrus 12,22. Do not include the low-intensity CSF of the vestigial hippocampal sulcus in the segmentation of the hippocampal head. 2.8 Slice G: Hippocampal Tail: Begin segmenting hippocampal tail-type slices when the crus of the fornix is first visible. Exclude the fascicular gyrus (a grey matter structure which blends with the hippocampus in parts of the hippocampal tail) from the segmentation by extrapolating

181 181 the shape of the fascicular gyrus into the hippocampal tail from more anterior slices 12,22. This extrapolation is only possible for 2-3 slices, after which the two structures cannot be accurately distinguished; at this point, treat all visible grey matter in this area as hippocampus. 2.9 Slice H: Hippocampal Tail 2: Segment the low-intensity grey matter of the posterior hippocampal tail from the surrounding high-intensity white matter Slice I: Posterior-Most Slice: Segment the small remaining area of hippocampal grey matter from the surrounding white matter of the temporal lobe. 3 Hippocampal Subfield Manual Segmentation 3.1 Set-up: Using a T2-weighted image, scroll to the anterior-most coronal slice of the hippocampus (as in step 2.1). To change the colour of the paintbrush, select D (Set Paint Lbl:) on the segmenting menu in the navigation window. The command terminal will prompt: Enter current paint label:. Enter a number between 1 and 255. Each number corresponds to a different label colour. 3.2 Slice A: Anterior-Most Slice: Since subfield divisions are not yet visible in the anteriormost slice, draw a line dividing the visible hippocampal grey matter along its longest visible axis (which is not necessarily parallel to any of the cardinal axes) into two equal sections to approximate the true anatomy 12,22. Label the superior of these two sections as CA1 and the inferior section as subiculum by choosing a different coloured label for each subfield 23,35.

182 Slice B: Hippocampal Head 1: Label the low-intensity area in the middle of the hippocampal formation as SR/SL/SM 13,37. When the bend along the inferior edge of the hippocampus becomes clear, use this landmark as the lateral border separating the subiculum from the CA1 12,22. Continue to follow the longest axis of the hippocampus to draw the CA1- subiculum border on the supero-medial tip Slice C: Hippocampal Head 2 with Dentations: SR/SL/SM, CA4/DG, and subiculum: Label the SR/SL/SM, CA4/DG, and subiculum as described for slice D (step 3.5.1) CA2/CA3 and CA1: Define the border between CA1 and CA2/CA3 as a 45º angle line extending in the supero-lateral direction from the most supero-lateral edge of the SR/SL/SM 12,22. Extend the CA2/CA3 medially along the superior edge to the trough between the dentations 12,22. Label the rest of the superior edge as CA1 12, Slice D: Hippocampal Head SR/SL/SM, CA4/DG, and subiculum: Label the dark SR/SL/SM band first, which will follow the curve of the CA1 37. Label any high-intensity grey matter inside of the SR/SL/SM as CA4/DG 12,22,23,35,37. This may not be a continuous region, as in Figure 2C. Continue to define the subiculum-ca1 border using the bend in the inferior hippocampus 12,22.

183 CA2/CA3 and CA1: Continue to define the CA1 and CA2/CA3 border as in step Extend the CA2/CA3 medially halfway along the superior edge of the hippocampus 12,22 and label the other half of the superior edge as CA1 12, Supero-medial hippocampal head: In this slice, divide the supero-medial hippocampal head vertically in half. Label the medial half as SR/SL/SM 12. Divide the lateral half in half again, this time horizontally. Label the superior portion as CA4/DG and the inferior portion as CA2/CA Slice E: Hippocampal Head 4 with Uncus Lateral hippocampal head (subiculum): In the lateral portion of these slices, define the subiculum-ca1 border as a vertical line extending in the inferior direction from the most medial edge of the CA4/DG 12, Lateral hippocampal head (CA1, CA2/CA3, CA4/DG, SR/SL/SM.): Define the CA1- CA2/CA3 border in the same way as in step Continue to label the SR/SL/SM as the low intensity region following the curve of the CA regions. Label the CA4/DG as the centre cavity inside the SR/SL/SM, as in step Uncal hippocampal head (SR/SL/SM): Label the uncus of the hippocampus for approximately 10 slices as the hippocampal head transitions into the hippocampal body. In the uncus, label the low intensity region in the centre as SR/SL/SM (when this is difficult to see, approximate the anatomy by segmenting a line 2-3 voxels wide up the centre of the uncus) 12.

184 Uncal hippocampal head (CA2/CA3, CA4/DG): Draw a line at the superior edge of the SR/SL/SM section along infero-lateral/supero-medial axis of the uncus. Label all grey matter above this line as CA2/CA3 12. Label any unlabelled grey matter below this line (on either side of the SR/SL/SM) as CA4/DG Slice F: Hippocampal Body: Continue to apply the borders described in step Slice G: Hippocampal Tail 1: Continue to apply the rules described in step The subiculum-ca1 border becomes a 45º angle line extending in the infero-medial direction from the medial edge of the CA4/DG 12, Slice H: Hippocampal Tail 2: Once the fascicular gyrus can no longer be distinguished from the hippocampal formation, label the entire outer layer as CA1, the low-intensity area inside of this as SR/SL/SM (as in previous slices), and any remaining grey matter in the middle as CA4/DG 12, Slice I: Posterior-Most Slice: Once the dark SR/SL/SM is no longer visible in the centre of the hippocampal formation, label the entire structure as CA1 12,22. 4 Protocol Reliability 4.1 Resegment either the right or left hippocampus of each subject after waiting approximately one month from performing the original segmentation. Segment all of the subfields along the

185 185 entire anterior-posterior length of the hippocampus, trying to follow the protocol rules as consistently as is possible. 4.2 Calculate the Dice s kappa between the original and resegmented volumes: 2 A B k = A + B where k=dice s kappa and A and B are label volumes. Representative Results: Results from the protocol reliability test are summarized in Table 2. For the whole bilateral hippocampus, mean spatial overlap as measured by Dice s kappa is 0.91 and ranges from Subfield kappa values range from 0.64 (CA2/CA3) to 0.83 (CA4/dentate gyrus). Mean volumes for all subfields and the whole hippocampus are reported in Table 3. Volumes for the whole hippocampus range from mm 3. The CA2/CA3 is the smallest subfield at mm 3, while the CA1 is the largest at mm 3. Figure Legends & Figures: Figure 1: Segmentation of the whole hippocampus for 9 coronal slices (A-I) using T1- weighted images. The vertical red lines on the hippocampal surface illustrate the location of each coronal slice. The hippocampus was present in an average of 118 coronal slices in each of the five subjects included in this study. Images progress from anterior (slice 1) at the top to posterior (slice 118) at the bottom. Images are shown in the left column without segmentation

186 186 and with segmentation in the right column. The scale bar shows 3 mm for reference. Roman numerals point to specific features identified in the protocol manuscript. i. The alveus distinguishes the hippocampal grey matter from the grey matter of the amygdala in the anteriormost slice. ii. The white matter of the temporal lobe defines the inferior border of the hippocampus in the hippocampal head. iii. The lateral border of the hippocampus in the hippocampal head is the inferior horn of the lateral ventricle. iv. The superior border is defined by the white matter of the alveus/fimbria. v. The medial border of the hippocampal head is the ambient cistern. vi. The infero-medial hippocampus extends into the entorhinal cortex, which shows up as a mildy hyper-intense band in T1-weighted images. vii. The uncus of the hippocampus is present in the hippocampal head and can be easily distinguished from the surrounding CSF. viii. In the infero-medial direction, the border between the subiculum and the para-hippocampal gyrus is defined by a slight thinning of the hippocampal grey matter. ix. The CSF of the vestigial hippocampal sulcus is not included in the segmentation. x. The fascicular gyrus is not included in the segmentation of the hippocampal tail when it is possible to differentiate it. xi. When it is no longer possible to distinguish between the fascicular gyrus and the hippocampal tail, the fascicular gyrus is included in the segmentation. Figure 2: Segmentation of the hippocampal subfields for 9 coronal slices (A-I) using T2- weighted images. The vertical red lines on the hippocampal surface illustrate the location of each coronal slice. The hippocampus was present in an average of 118 coronal slices in each of the five subjects included in this study. Images progress from anterior (slice 1) at the top to posterior (slice 118) at the bottom. Images are shown in the left column without segmentation and with segmentation in the right column. The scale bar shows 3 mm for reference. Roman

187 187 numerals point to specific features identified in the protocol manuscript. i. The low intensity region in the centre of the hippocampal head is the SR/SL/SM. ii. The uncal-shaped bend on the infero-lateral edge of the hippocampus marks the border between the CA1 and the subiculum. iii. The subiculum-ca1 border continues to be defined at the bend in the inferior hippocampus in the hippocampal head. iv. The border between CA1 and CA2/CA3 is defined as a 45 angle extending in the supero-lateral direction from the most supero-lateral edge of the SR/SL/SM. v. The CA2/CA3 extends halfway along the superior edge of the hippocampus, to the trough of the dentations, medial to which it is labelled as CA1. vi. The grey matter in the centre of the hippocampal head is labelled as CA4/DG. vii. Continue to define the CA1-CA2/CA3 border as a 45 angle extending in the supero-lateral direction from the most supero-lateral edge of the SR/SL/SM. viii. The CA2/CA3 continues to extend halfway along the superior edge of the hippocampus, medial to which it is labelled as CA1. ix. In slice D, the supero-medial hippocampal head is divided into subfields (see step 3.5.3). x. The subiculum-ca1 border is defined as a vertical line extending from the most medial border of the CA4/DG. xi. The SR/SL/SM continues to be the low-intensity region following the curve of the CA regions. xii. In the uncal portion of the hippocampal head, the SR/SL/SM is the low-intensity region in the centre of the uncus. If this cannot be seen, draw a line 2-3 pixels wide up the centre of the uncus.

188 188 Figure 1:

189 189 Figure 2:

Supplementary Information Methods Subjects The study was comprised of 84 chronic pain patients with either chronic back pain (CBP) or osteoarthritis

Supplementary Information Methods Subjects The study was comprised of 84 chronic pain patients with either chronic back pain (CBP) or osteoarthritis (OA). All subjects provided informed consent to procedures