Classification of ADHD and Non-ADHD Using AR Models and Machine Learning Algorithms

Size: px

Start display at page:

Download "Classification of ADHD and Non-ADHD Using AR Models and Machine Learning Algorithms"

Spencer Stewart
5 years ago
Views:

1 Classification of ADHD and Non-ADHD Using AR Models and Machine Learning Algorithms Jan Lopez Marcano Thesis sbmitted to the Faclty of the Virginia Polytechnic Institte and State University in partial flfillment of the reqirements for the degree of Master of Science In Electrical Engineering A. A. (Lois) Beex, Chair Scott Bailey JoAnn Pal October 26, 2016 Blacksbrg, Virginia Keywords: EEG, ADHD, Classification, Machine Learning, KNN, SVM, GMM, Atoregressive Coefficients Copyright 2016 by Jan Lopez Marcano. All rights reserved.

2 Classification of ADHD and Non-ADHD Using AR Models and Machine Learning Algorithms Jan Lopez Marcano ABSTRACT As of 2016, diagnosis of ADHD in the US is controversial. Diagnosis of ADHD is based on sbjective observations, and treatment is sally done throgh stimlants, which can have negative side-effects in the long term. Evidence shows that the probability of diagnosing a child with ADHD not only depends on the observations of parents, teachers, and behavioral scientists, bt also on state-level special edcation policies. In light of these facts, nbiased, qantitative methods are needed for the diagnosis of ADHD. This problem has been tackled since the 1990s, and has reslted in methods that have not made it past the research stage and methods for which claimed performance cold not be reprodced. This work proposes a combination of machine learning algorithms and signal processing techniqes applied to EEG data in order to classify sbjects with and withot ADHD with high accracy and confidence. More specifically, the K-nearest Neighbor algorithm and Gassian- Mixtre-Model-based Universal Backgrond Models (GMM-UBM), along with atoregressive (AR) model featres, are investigated and evalated for the classification problem at hand. In this effort, classical KNN and GMM-UBM were also modified in order to accont for ncertainty in diagnoses. Some of the major findings reported in this work inclde classification performance as high, if not higher, than those of the highest performing algorithms fond in the literatre. One of the major findings reported here is that activities that reqire attention help the discrimination of ADHD and Non-ADHD sbjects. Mixing in EEG data from periods of rest or dring eyes closed leads to loss

3 of classification performance, to the point of approximating gessing when only resting EEG data is sed.

4 Classification of ADHD and Non-ADHD Using AR Models and Machine Learning Algorithms Jan Lopez Marcano GENERAL AUDIENCE ABSTRACT As of 2016, diagnosis of ADHD in the US is controversial. Diagnosis of ADHD is based on sbjective observations, and treatment is sally done throgh stimlants, which can have negative side-effects in the long term. Evidence shows that the probability of diagnosing a child with ADHD not only depends on the observations of parents, teachers, and behavioral scientists, bt also on state-level special edcation policies. In light of these facts, nbiased, qantitative methods are needed for the diagnosis of ADHD. This problem has been tackled since the 1990s, and has reslted in methods that have not made it past the research stage and methods for which claimed performance cold not be reprodced. This work proposes a combination of machine learning algorithms and signal processing techniqes applied to EEG data in order to classify sbjects with and withot ADHD with high accracy and confidence. Signal processing techniqes are sed to extract atoregressive (AR) coefficients, which contain information abot brain activities and are sed as featres. Then, the featres, extracted from datasets containing ADHD and Non-ADHD sbjects, are sed to create or train models that can classify sbjects as either ADHD or Non-ADHD. Lastly, the models are tested sing datasets that are different from the ones sed in the previos stage, and performance is analyzed based on how many of the predicted labels (ADHD or Non-ADHD) match the expected labels. Some of the major findings reported in this work inclde classification performance as high, if not higher, than those of the highest performing algorithms fond in the literatre. One of the major

5 findings reported here is that activities that reqire attention help the discrimination of ADHD and Non-ADHD sbjects. Mixing in EEG data from periods of rest or dring eyes closed leads to loss of classification performance, to the point of approximating gessing when only resting EEG data is sed. v

6 ACKNOWLEDGEMENTS I wold like to thank Dr. Beex for giving me the opportnity to work on sch an interesting and demanding project. His constant gidance, sense of hmor, encoragement, and discoragement over the past 10 months gave me the strength to sccessflly complete this thesis, pblish mltiple papers, and make relevant contribtions to the field. Dr. Beex always encoraged me to go above and beyond, to qestion everything, and to go where no one has gone. I also want to thank Dr. Pal and Dr. Bailey, for having been my professors in the past, for diligently responding to my qestions, and for being part of my thesis committee. Needless to say, none of this wold have been possible withot the help of my family, my extended family, and most importantly, God. Thank yo for yor jokes, words, hgs, and in short, for being there one way or the other. Yo, along with my close and distant friends, shaped my life experiences, and they transcended to the work that is presented in this thesis. I also want to thank family and friends who cold not live long enogh to hear the news of this thesis. Yo will always be remembered. vi

7 TABLE OF CONTENTS 1. INTRODUCTION History of ADHD Diagnosis of ADHD EEG Limitations Otline LITERATURE REVIEW Spectral Analysis Theta-to-Beta Power Ratio (TBPR) Other Approaches for the Classification of A and NA Sbjects Classification of Other EEG Patterns FEATURE EXTRACTION AR Modeling Brg Method LSF Akaike Information Criterion (AIC) Channel Choice Dataset KNN CLASSIFICATION K-nearest Neighbor Algorithm Confidence in KNN vii

8 4.3. Choosing the Vale of K Disadvantages of KNN Performance Evalation KNN Experiments Window Size and Choice of K Additional Test Sbjects Increasing the Training Dataset Reflection Coefficients and Line Spectral Freqencies UNIVERSAL BACKGROUND MODEL Gassian Mixtre Models Expectation Maximization Algorithm K-means Clstering Algorithm UBM Adaptation Performance Evalation Experiments Nmber of Mixtre Components Effect of Activities GMM-UBM with EC and ANT data AR vs RC and LSF CLASSIFICATION USING SOFT LABELS Soft KNN Soft GMM-UBM Performance evalation Experiments viii

9 Membership Fnctions Setting the Vale of K Soft KNN vs Hard KNN Nmber of Mixtre Components Soft GMM-UBM vs GMM-UBM OPTIMAL CHANNEL REDUCTION Channel Ranking Experiments Best Channel Combinations for KNN Best Channel Combinations for GMM-UBM CONCLUSIONS REFERENCES ix

10 LIST OF FIGURES Fig. 1.1: Cool map showing the se of behavioral therapy and medications to treat ADHD in the US Fig. 1.2: EEG electrode configration Fig. 1.3: EEG Signals of an epileptic patient dring seizre Fig. 3.1: Linear prediction model Fig. 3.2: AIC vs model order Fig. 3.3: Cross ratios for all channels Fig. 4.1: Example of a 2D, 2-class classification sing KNN wit K set to Fig. 4.2: Accracy for different window sizes and vales of K Fig. 4.3: Classification accracy for 4 pairings as window size changes Fig. 4.4: TPR (left) and TNR (right) for 4 pairings as window size changes Fig. 4.5: A Confidence (left) and NA Confidence (right) levels for 4 pairings as window size changes Fig. 4.6: Confidence histograms from training pairings (in title) for test cases (in legend box).. 35 Fig. 4.7: Confidence histograms for original training pairings, when testing with 18776A and 18716NA Fig. 4.8: Confidence histograms for original training pairings, when testing with 32436A and 32386NA Fig. 4.9: Confidence histograms for pairings involving sbject 32386NA Fig. 4.10: Confidence histograms for pairings involving sbject 32386NA and displaying two test sbjects only Fig. 4.11: Confidence histograms for pairings involving sbject 32386NA and all other NA sbjects Fig. 4.12: Confidence histograms for pairings involving sbject 32386NA x

11 Fig. 4.13: Accracy vales (top) and confidence levels (bottom) obtained from all 30 combinations of 2 NA sbjects and 2 A sbjects for training Fig. 4.14: Distribtion of TNRs (top) and NA conf (bottom) obtained from all 30 combinations of 2 NA sbjects and 2 A sbjects for training Fig. 4.15: Distribtion of TPRs (top) and A conf (bottom) obtained from all 30 combinations of 2 NA sbjects and 2 A sbjects for training Fig. 4.15: Accracy vales (top) and confidence levels (bottom) when sing RC (ble), AR (green), and LSF (yellow) as featres Fig. 4.16: TPRs (top) and A conf levels (bottom) when sing RC (ble), AR (green), and LSF (yellow) as featres Fig. 4.17: TNRs (top) and NA conf levels (bottom) when sing RC (ble), AR (green), and LSF (yellow) as featres Fig. 5.1: EM iterations (left) and final EM iteration (right) Fig. 5.2: Convergence of EM algorithm with random initialization (left) and convergence sing K-means clstering for initialization (right) Fig. 5.3: Example of K-means clstering algorithm with data before clstering (left) and after clstering (right) Fig. 5.4: GMM-UBM for the classification of A/NA sbjects Fig. 5.5: Effect of the nmber of mixtre components Fig. 5.6: Distribtion of AUCs (top) and EERs (bottom) when training/testing = EC/EC (dark ble), ANT/ANT (ble), ANT/EC (olive), and EC/ANT (yellow); all combinations of 2 sbjects (1 A and 1 NA) sed for training and all other non-overlapping sbjects for testing Fig. 5.7: Distribtion of AUCs (top) and EERs (bottom) for training/testing cases EC/EC (dark ble), ANT/ANT (ble), ANT/EC (olive), and EC/ANT(yellow); all combinations of 4 sbjects (2 A and 2 NA) sed for training and all other non-overlapping sbjects for testing xi

12 Fig. 5.8: Distribtion of AUCs (top) and EERs (bottom) when ANT featre vectors from 4 sbjects are sed for training and from another 4 ANT sbjects for testing Fig. 5.9: AUCs (top) and EERs (bottom) of GMM-UBMs with ANT+EC (mixed) composition training datasets Fig. 5.10: Sample ROC plots with different training datasets Figre 5.11: Sample DET crves with different datasets Fig. 5.12: AUCs (top) and EERs (bottom) of GMM-UBMs with ANT+EC+VIDEO (mixed) composition training datasets and same composition testing sets Fig. 5.14: AUCs (top) and EERs (bottom) of GMM-UBMs trained with RC (ble), AR (green), and LSF (yellow) coefficients Fig. 6.1: KNN example (top) and Soft KNN example (bottom) with K = Fig. 6.2: Last EM iterations for hard GMM (left) and for soft GMM (right) Fig. 6.3: Distribtion of a posteriori probabilities of all the vectors extracted from sbjects 18316NA, 18396A, 18586NA, and 18606A Fig. 6.4: Distribtion of a posteriori probabilities of all the vectors extracted from sbjects 18716NA, 18776A, 32386NA, and 32436A Fig. 6.5: Mean Accracy of Classification for Different Vales of K Fig. 6.6: Distribtion of overall accracy vales (top) and overall confidence levels (bottom) when sing Hard KNN and Soft KNN Fig. 6.7: Distribtion of TPRs (top) and distribtion of Aconf levels (bottom) when sing Hard KNN and Soft KNN Fig. 6.8: Histogram of TNRs (top) and distribtion NA conf (bottom) when sing KNN and Soft KNN Fig. 6.9: Mean AUC for all softening scenarios involving Fig. 6.10: Mean EER for all softening scenarios involving 1 (top) and 1 (top) and 2 (bottom) (bottom) xii

13 Fig. 6.11: Distribtion of AUCs (top) and EERs (bottom) for all softening scenarios sing Fig. 6.12: Distribtion of AUCs (top) and EERs (bottom) for all softening scenarios sing Fig. 6.13: Comparison of 0, 1 SPL, and SPL in terms of AUCs (top) and EERs (bottom) Fig. 6.14: Comparison of DET crves for the average (left) and worst (right) cases Fig. 7.1: Mean accracy for all 2-channel combinations Fig. 7.2: Accracy of 3-channel combinations that inclde Fc1-Pz Fig. 7.3: Accracy of all 4-channel combinations that inclde Fc1-Pz-Cp Fig. 7.4: AUCs (above diagonal) and 1-EERs (below diagonal) of all 2-channel combinations Fig. 7.5: AUCs and EERs of all 3-channel combinations that inclde Fc1-Pz Fig. 7.6: AUCs and EERs of all 4-channel combinations that inclde Fc1-Pz-Cp Fig. 7.7: ROC when pair Fc1-Pz is sed in GMM-UBM Fig. 7.8: DET crves when pair Fc1-Pz sed in GMM-UBM LIST OF TABLES Table 1.1: Combinations of training and testing sbjects when 32386NA is sed for training Table 1.2: Combinations of training and testing sbjects when sbject 32386NA is sed for training Table 5.1: Smmary of AUC nder different training and testing scenarios (percentage in mix) xiii

14 LIST OF ABBREVIATIONS AIC ANT AR ATT AUC BCI BMD CDF CT DET EC EEG EP EER EM FN FP FCM FFT GMM KNN LLR LSF MAP MEG MRI MRS OCD PET RC ROC SNR SVM TBPR TN TP UBM Akaike Information Criterion Attention Network Task Atoregressive Attention Deficit withot Hyperactivity Area Under the Crve Brain-Compter-Interface Bipolar-Mood Disorders Cmlative Distribtion Fnction Compted Tomography Detection Error Trade-off Eyes Closed Electro Encephalogram Evoked Potentials Eqal Error Rate Expectation Maximization False Negatives False Positives Fzzy C-means Fast Forier Transform Gassian Mixtre Model K-Nearest Neighbor Log-Likelihood Ratio Line Spectral Freqency Maximm a Posteriori Magnetoencephalogram Magnetic Resonance Imaging Magnetic Resonance Spectroscopy Obsessive-Complsive Disorders Positron Emitted Tomography Reflection Coefficient Receiver Operating Characteristics Signal-To-Noise Ratio Spport Vector Machine Theta-to-Beta Power Ratio Tre Negatives Tre Positives Universal Backgrond Model xiv

15 1. INTRODUCTION In the US, ADHD is a condition that affects approximately 11% of children ages 4 to 17 [1]. Diagnosis of ADHD is done by sing the Diagnostic and Statistical Manal of Mental Disorders (DSM), pblished by the American Psychiatric Association (APA) [2], which provides a list of symptoms that behavioral scientists se to determine whether or not a sbject has a mental disorder History of ADHD Althogh ADHD has been part of many people s lives, it was not acknowledged ntil the 1920 s, and started to be treated in the late 1930 s. Between 1918 and 1925, researchers noted that there was an nsal inattentive, implsive, and hyperactive behavior in children who had had inflenza [3-5]. In the 1930 s, researchers agreed that some children had mild brain dysfnctions, which consisted of poor attention, hyperactivity, and behavioral dysfnctions. As early as 1937, children with these symptoms were treated with amphetamines, which is the main component fond in Ritalin. Dr. Charles Bradley was the first physician to administer amphetamines to children with these symptoms, and he fond that, depending on the dosage, academic achievement improved drastically [6]. Between the 1940s and the 1960s, there were great advances in ADHD research. In 1947, the terms minimal brain dysfnction (MBD) and Strass Syndrome were formally coined to describe people with the aforementioned symptoms [7]. A large nmber of visal-motor and intelligence tests were created to differentiate people with MBD from people withot MBD. By the early 1970s, it became evident that there were too many disorders to place nder the MBD mbrella, so MBD was divided into 4 categories: learning disabilities, hyperkinetic disorders, condct disorders, and attention disorders [5]. Since the late 1980s, there have been changes in the definition of ADHD. In 1987, the DSM- III-R changed attention disorders to attention-deficit disorders (ADD) with or withot hyperactivity [8]. In 1994, with DSM-IV [9], ADD with or withot hyperactivity was narrowed down to ADHD and three sbtypes of ADHD defined: inattentive, hyperactive, and combined. 1

16 Finally, in 2013, DSM-V [2] placed ADHD nder the mbrella of nerodevelopmental disorders. The symptoms, and the criteria for diagnosis of ADHD have changed from version to version Diagnosis of ADHD Diagnosis of ADHD is done throgh sbjective observations by teachers and/or parents and finally by behavioral scientists. When teachers or parents sspect that a child exhibits symptoms of ADHD, which comes abot by observations, the child is taken to a behavioral scientist to investigate whether or not the child has the condition [10]. The behavioral scientist, in trn, observes the behavior of the child and compares the behavior of the child with the symptoms of ADHD described in the DSM. According to the DSM-V, someone with ADHD often fails to give close attention to details or makes careless mistakes; often has difficlty sstaining attention to tasks; often does not seem to listen when spoken to directly; often fails to follow instrctions careflly and completely; losing or forgetting important things; feeling restless, often fidgeting with hands or feet, or sqirming; rnning or climbing excessively; often talks excessively; often blrts ot answers before hearing the whole qestion; often has difficlty awaiting trn. Unfortnately, these sbjective observations have a large error associated with them. Snyder et al. [10] condcted a stdy with 159 participants, 101 males and 58 females, aged 6 to 18, where 61% of the sbjects (97) were diagnosed ADHD (A) and 39% (62) were diagnosed Non-ADHD (NA). The sbjects participated in clinical interviews and were diagnosed according to the DSM- IV and to Conners Rating Scales-Revised (CRS-R), which is another manal sed for the diagnosis of behavioral disorders. The stdy revealed that parents and teachers can predict ADHD with an accracy of 47% to 58%. 47% was obtained when comparing the teachers prediction and the diagnoses based on DSM-IV, and 58% was obtained when the teachers prediction was compared to the diagnoses based on CRS-R. Likewise, parents predictions matched 56% of CRS- R diagnosis and 55% of DSM-IV predictions. Therefore, teachers and parents can predict ADHD with an accracy that is only slightly better than a gess. ADHD is generally treated sing behavioral interventions and/or pharmacological interventions [11]. Behavioral interventions consist of therapies that teach the person with ADHD to self-control, self-monitor, and self-evalate, with the objective of improving the symptoms. According to the gidelines of the American Association of Pediatrics (AAP), behavioral interventions shold be 2

sed on children before considering medication, which consists of the se of amphetamine-based or methyliphenidate-based stimlants [12]. Unfortnately, stimlants sed for ADHD treatment have side effects.

17 sed on children before considering medication, which consists of the se of amphetamine-based or methyliphenidate-based stimlants [12]. Unfortnately, stimlants sed for ADHD treatment have side effects. In the short-term, these stimlants are known to case weight loss, loss of appetite, and sleeping trobles. If taken over a long period of time, these medicines cold case high blood pressre, higher heart rate, and they cold also increase the risk of acqiring heart arrhythmias [13]. Indeed, in 2006, the FDA s Drg Safety and Risk Management Committee decided to assign a black box warning, the strongest warning sed by the FDA, to ADHD medications to indicate cardiovasclar risks. Fig. 1.1: Cool map showing the se of behavioral therapy and medications to treat ADHD in the US [1]. Figre 1.1 shows how ADHD was treated, between 2009 and 2010, in the US. On a different note, it is nknown what week the legends of the figre refer to. Note that there is a considerable nmber of states where over 80% of people diagnosed with ADHD se medication. In fact, in 2011, it was estimated that 6.1% of children aged 4 to 17 years, approximately 3.5 million, sed ADHD medication [1]. Moreover, diagnosis of ADHD grows every year, and with it, consmption of stimlants. 3

18 One of the reasons diagnosis of ADHD is growing every year may be financial incentives. In some states, schools receive additional fnding from the state depending on how many of their stdents have special needs. It is in those states where diagnosis of ADHD at an early age is highest and grows every year, whereas diagnosis of ADHD flctates slightly in the states where there is no additional fnding that depends on the nmber of stdents with learning and other disabilities [14, 15]. Chances are, a large portion of the people who are diagnosed with ADHD, especially children, do not have ADHD. Elder condcted a stdy that analyzed the data from the Early Childhood Longitdinal Stdy-Kindergarten cohort (ECLS-K), which incldes parent and teacher reports of ADHD symptoms, diagnoses, and stimlant-based treatments [16]. His research fond that school ct-off dates and a child s date of birth greatly affect whether or not he or she will be diagnosed with ADHD. The stdy fond that children born right before their state s kindergarten eligibility ct-off date are two times more likely to se stimlants, based on ADHD diagnosis, than those who are born right after the ct-off date and have to wait another year to start kindergarten. Elder estimates that 20% ot of the 4.5 million children diagnosed with ADHD as of 2005 do not have ADHD. As a reslt of inadeqate expectations and perceptions from their teachers, almost 900,000 children are taking medications they do not need and frthermore will reslt in negative long-term effects. As arged, the cost of misdiagnosis is very high. Stimlants sch as Ritalin and Adderall not only affect the behavior and experiences of its consmers, bt also the lifetime of the consmers. Since the sorce error seems to be the sbjective observations of parents and teachers, this work will explore qantitative/data-based ways to prescreen and/or diagnose ADHD. In this stdy, information was extracted from electroencephalogram (EEG) data in order to discriminate between A and NA sbjects. Mlti-channel EEG data was sed not only becase of its availability, bt also becase it has mltiple advantages over other methods. One of the advantages of having mlti-channel EEG is facilitating minimization of the needed processing; processing data that does not contribte to higher discrimination performance may be detrimental. However, it may potentially provide robstness in the event that one of the channels fails. 4

19 1.3. EEG EEG signals are low-power signals (sally below 0.6 mv) that are measred on the scalp of a hman or non-hman head and are crrently being sed for many applications. To record EEG data from a sbject, electrodes are sed and placed in different locations on the scalp of the sbject. The recordings are, then, processed by a Data Acqisition Device, which typically performs sampling and low-pass filtering. As of today, EEG data is sed in the health care indstry, for the diagnosis of epilepsy, brain tmors, and stroke; in psychological and nerological research, for the stdy of the development and nderstanding of the hman brain; and also, for brain-compter interfaces (BCIs), which se EEG data as inpts to a system [17]. Electrode placement follows a standardized configration, which indicates how and where the electrodes shold be placed. The configration sed, typically 10-20, indicates the distance between two adjacent electrodes. For 10-20, the distance is 10-20% of the front-back or right-left distance of the scalp [18]. The distances are measred from the Nasion (bridge of the nose) to the Inion (perceptal protberance on the back of the head) and from preariclar point to preariclar point. Fig. 1.2: EEG electrode configration. 5

20 Figre 1.2 shows the configration sed in this work. Every channel starts with a F, T, P, C, or O, which stand for frontal, temporal, parietal, central, and occipital and represent the respective lobes of the brain. When electrode names have 2 letters, i.e. Fc, Cp, etc., this indicates that the electrode is between the two lobes denoted by the two letters. Finally, electrode names also contain a nmber, 1 throgh 8 in the example. Odd nmbers indicate the left side of the scalp and even nmbers the right side of the scalp [18]. EEG has been gaining poplarity over the years becase it has some advantages over other brain-scanning techniqes sch as magnetic resonance imaging (MRI), compted tomography (CT), positron emitted tomography (PET), magnetic resonance spectroscopy (MRS), and Magnetoencephalogram (MEG). These advantages inclde, bt are not limited to: - Cost efficiency: Hardware cost for the collection of EEG data is mch lower than that of other techniqes, especially becase magnetically shielded rooms are not reqired. - Resoltion: EEG data can provide data at the millisecond level, which is impossible with MRI, CT, and other techniqes [19]. - Ubiqity: Althogh collection of EEG data reqires placement of electrodes on the scalp, it is more biqitos than other techniqes, sch as MRI, MRS, and PET, which can even elicit clastrophobia. Frther, EEG is silent [20]. - Low radiation: Usage of EEG does not reqire exposre to intense magnetic fields (over 1 Tesla), which is the case for MRS, MRI, and MEG [21]. All of these advantages make EEG a great option for neroscience research. However, EEG cold be inconvenient for the following reasons: - Low SNR: Signal-to-noise ratio is low in EEG. This can be mitigated by filtering, either throgh hardware or software. - Preparation: EEG electrodes have to placed correctly on the scalp by sing gels, saline soltions, or other methods so that the electrodes are in contact with the scalp for the dration of the test. This makes the preparation time for EEG data collection longer than for the other methods. 6

21 - Cryptic: EEG data is not in the form of images, which makes it difficlt to observe the interaction between different brain regions or the activation of nerotransmitters, which cold be done with techniqes that se resonance [19]. The latter disadvantage may not be a disadvantage depending on the application. EEG data can be considered raw brainwave data, and sefl information can be obtained abot events in the brain if some processing is done and even if no processing is done. In the health sciences, derivatives of EEG data are sed. These derivatives are Evoked Potentials (EPs) and Event-related Potentials (ERPs). The first consists of averaging the EEG data over a time interval as a stimls (i.e. aditory, visal) is presented to the sbject [22]. The latter consists of averaging the EEG over a period of time when the sbject is performing a motor or cognitive task repeatedly. These markers are widely sed in cognitive psychology and neroscience research [23]. By filtering EEG signals, the behavior of the brain in different freqency bands, colloqially known as brainwaves, can be observed. In the literatre, 5 different freqency bands are defined: Delta (0.1 to 4 Hz), Theta (4 to 7 Hz), Alpha (8 to 13 Hz), Beta (13 to 30 Hz), and Gamma (over 31 Hz) [24]. These band designations are fairly consistent bt, depending on the sorce, there is some variation. The power in freqency bands (compted sing the FFT) has been stdied for the classification of A and NA sbjects. Fig. 1.3: EEG Signals of an epileptic patient dring seizre. Image taken from [25]. 7

22 For epilepsy, EEG data can be sed for diagnosis even withot processing. Epilepsy is characterized by abnormal and/or excessive synchronization of the brain cells. This synchronization can be observed with little effort in EEG data, as shown in Fig In the figre, it can easily be seen that there are grops of EEG channels (boxed) that are synchronized [26]. For BCIs and classification problems, more processing is needed becase the objective is to atomatically and accrately detect patterns in the brainwaves. Machine learning algorithms are generally sed in order to create models for the classes (i.e. activities, thoghts, conditions, etc.) that will be detected by the algorithm. The work done in this thesis falls in this category. This is not the first time signal processing and machine learning algorithms are sed for the classification of A and NA. This will be covered in more detail in Chapter 2, bt this problem has been approached since 1992, and high performance has been reported. To the best of or knowledge, the closest competitors to the algorithms presented in this thesis are [27] and [28]. The former reported performance in terms of AUC, with average AUC of 0.97 (See Section 5.5 for a definition of AUC) and the latter reported performance in terms of sensitivity (A sbjects classified correctly), and accomplished 95.6% of sensitivity, missing 4.4% of the A sbjects on average. In this thesis, the method that maximized performance achieved average AUC of 0.99 and missed 3.57% of the A sbjects on average Limitations As will become evident in the next sections, the work presented in this thesis is preliminary and has its limitations. To begin with, this work is a pilot stdy becase the nmber of sbjects whose EEG data was available to s is 8 (4 A and 4 NA). It is nderstandable if the reader qestions whether or not the reslts presented in this thesis will generalize to larger datasets. It is worth noting, however, that 250 test vectors were extracted from each sbject, and training and testing of the classification models developed in this work were performed for 30 niqe selections of training and testing sbjects, since the data that was sed to create and test the models was switched in order to provide average, best, and worst performance vales for the classification algorithm. Besides availability of data, a major limiting factor is the definition of ADHD itself. ADHD is defined by a set of symptoms, which overlap with those of 15 other conditions, sch as Bipolar Mood Disorders (BMD), Sleep Disorders, and Obsessive-Complsive Disorder (OCD) [29, 30]. 8

23 Therefore, it is possible that a child evalated by a clinician may be diagnosed with ADHD when the accrate diagnosis is BMD. In fact, it has been reported that a child may exhibit ADHD symptoms as a reslt of dietary habits, exposre to toxins sch as lead, and deficiencies in vitamins, bt less than 26% of clinicians in the US se laboratory tests to determine if the observed ADHD symptoms are cased by nerodevelopmental disorders or by organic reactions [29]. Conseqently, it is difficlt to say if children diagnosed with ADHD actally have mental disorders; and those who do have mental disorders may not necessarily have ADHD. The way ADHD is diagnosed in the US has changed over the years and it may contine to change. With every new version of the DSM, ADHD symptoms have changed and/or the nmber of minimm symptoms reqired in order to receive ADHD diagnosis has changed, bt these symptoms are still sbjectively observed. Moreover, reports from parents or teachers play a role in the diagnosis of ADHD, and it is known that they can detect it with 47-58% of accracy [10]. Another concern is whether or not diagnosis of ADHD shold be tailored to different age grops. As of now, the DSM-V has different reqirements for adolescents and adlts over 17 years and children p to age 16. Shold there be more grops? Research shows that the age grop that a child is compared to has an impact on whether or not the child receives an ADHD diagnosis or not [16], and age grop alone is a factor that seems to have cased an estimate of 20% incorrect diagnoses. This occrs, again, becase behavioral patterns are observed sbjectively. In this thesis, the eight sbjects were ages 6 to 8. However, the algorithms presented here do not take into consideration age becase it was nknown to s which sbjects were 6, 7, or 8 years old. If the pattern detection work was performed by an objective nit that evalates objective metrics or featres, we hypothesize that these errors wold be mitigated. Be mindfl, however, that the goal of this work is not to provide a soltion to redce that very high 20% error. The goal of this work is to stdy the feasibility of atomatic classification of A and NA sbjects, based on atoregressive coefficient featres extracted from mlti-channel EEG data, which are then sed to create classification models. The fact that EEG data is sed brings p other concerns becase EEG is task-dependent. Assming there are particlar mental tasks known to be sefl for the classification of A and NA, how wold a classification of A and NA be affected if a sbject is not performing the activity that he or she is instrcted to perform? The answer is that performance most likely decreases. EEG 9

24 data is so sensitive to tasks that BCIs can detect different mental tasks with over 90% of accracy depending on the classification algorithms sed [31-33]. Therefore, an A sbject may be labeled as NA if inadeqate data is sed for diagnosis. Something else that shold be considered is severity levels of ADHD. The DSM-V defines three severity levels: mild, moderate, and severe [2]. The level of severity depends on how mch of the person s social and occpational fnctioning is affected by ADHD, and the level of severity is assigned sbjectively as well. This thesis performs binary classification (either A or NA), and does not provide information on severity levels since that information was not available to s. Moreover, we believe that ADHD diagnoses shold come in a continm of vales and shold indicate how confident the diagnosis was. Alas, crrent diagnosis methods do not provide any of these metrics. Last bt not least, gender may be something that shold be taken into consideration. It has been reported that ADHD is more common in boys than girls, and ADHD is manifested differently in boys and girls [34]. This work does not take gender into consideration as the gender of the sbjects was not provided to s. All of these reasons make it difficlt to obtain the golden standard ADHD sbject, which is another limitation to the work presented here. The qality of a classification model depends on the qality of the data that was sed to create the model. Since it is difficlt to find the golden standard, it is difficlt to create ideal classification models. In fact, it is qestionable whether or not any data is labeled correctly. This is an isse that is faced in Section of this thesis, where the label of one NA sbject was qestioned and later flipped to A. Maybe this sbject was indeed NA bt was not performing the activity that he or she was instrcted to perform; or maybe he or she was slightly more active or inattentive than the other NAs; or maybe the EEG data of this sbject appeared to be similar to that of A sbjects as an organic response to dietary habits; or maybe he or she was actally mislabeled. In smmary, stdies that concern the classification of A and NA face many challenges. There are many sorces of error, to the point that the labels sed in a stdy may not be trstworthy. Nevertheless, if the A and NA labels sed are correct, frameworks for the classification of A and NA can be created, and this work arges that if optimization is done at every stage of the 10

25 framework, classification can be done with high levels of accracy for the best and worst case scenarios Otline The rest of the thesis is organized as follows. Chapter 2 contains a review of the research that has been done in order to classify A and NA sbjects. Special attention is given to controversial methods that were investigated between 1992 and the present time. In Chapter 3 the mathematical techniqes sed in featre extraction are reviewed. This chapter also contains a review of how EEG channel selection was done in this thesis. In Chapter 4 the se of the K-nearest Neighbor (KNN) algorithm for classifying A and NA sbjects is smmarized. In Chapter 5 the mathematical backgrond for, and the reslts obtained from, a Gassian-Mixtre-Model-based (GMM) Universal Backgrond Models (UBM) for the classification of A and NA are provided. In Chapter 6 is shown how KNN and GMM-UBM can be modified to accont for the ncertainty in diagnoses or labeling sed for training, which in this docment are referred to as soft labels. An iterative approach to minimizing the nmber of channels for KNN and GMM-UBM is described in Chapter 7. Lastly, in Chapter 8 the conclsions and sggestions for ftre work are provided. 11

26 2. LITERATURE REVIEW Abnormal EEG patterns have been observed in ADHD sbjects over the last 70 years. To the best of or knowledge, in 1938, Dr. Bradley presented evidence showing that there were EEG abnormalities in the children he administered amphetamines to, making him the first to report this observation [6]. Nmeros stdies between the 1950s and 1960s also noted abnormal EEG signals for A sbjects. Qantitative classification of A and NA sbjects dates back to the 1980s. Some of the approaches that researchers proposed prodced reslts that had large errors or cold not be reprodced. In the last 5 to 8 years, on the other hand, the methods that have been proposed have mch higher accracy. In this chapter will be discssed how EEG has been sed over the years in order to classify NA and A Spectral Analysis Since 1992, advances have been made towards qantitatively finding differences between A sbjects and NA sbjects. In 1992, Mann et al. [35] stdied the power in freqency bands of 25 A sbjects and 27 NA sbjects while they were in baseline activity, reading, and drawing. The nmber of windows or epochs and the dration was not reported, bt there were between 90 and 100 s of EEG recordings for each activity. The stdy fond that the power in the theta band was higher for A sbjects than for NA sbjects, and the power in the beta band was mch lower for the A sbjects than for the NA sbjects in channels F3 and F4. These featres were sed in a discriminant fnction whose type was not disclosed, and reported A sbjects were classified correctly 80% of the times and NA sbjects were classified correctly 74% of the times. In 1996, a stdy sed a 19-channel configration to collect EEG signals from 310 control sbjects and 407 ADHD/ATT (other hyperactivity disorders) sbjects aged 6 to 17 [36]. EEG signals were recorded dring eyes-closed activity, and windows of 2.5 s per sbject were sed. The stdy evalated the se of mean coherence, mean freqency, and absolte power in freqency bands in order to create a discriminant fnction to classify control sbjects and ADHD/ATT sbjects [36]. Althogh it was not specified what kind of decision fnction was based on these featres (i.e. decision tree, linear discriminant, qadratic discriminant, etc.), the approach 12

27 achieved 93.1% of sensitivity (correct classification of ADHD/ATT sbjects) and 94.8% of specificity (correct classification of control sbjects) Theta-to-Beta Power Ratio (TBPR) In 1999, a stdy reported that the θ/β power ratio (TBPR) of A sbjects tends to be higher than that of NA sbjects [37]. The hypothesis was that, since theta brainwaves (4-7 Hz) are associated with hyperactivity and beta brainwaves (13-30 Hz) are associated with attention, the ratio of the power in those bands wold be larger for A than for NA sbjects. To test this hypothesis, EEG signals of 482 sbjects 6-17 years old were recorded % of the sbjects (85) were NA and 82.37% of the sbjects (397) were A. The data was recorded from a single EEG channel, Cz, while the sbjects were in resting eyes open, eyes closed, reading, and performing visal, and motor activity. At least 15 2-s windows were collected from each sbject dring each task. In the stdy, the TBPR was obtained by compting the PSD estimates from the FFT for the theta and beta bands. For classification, the TBPR of NA sbjects was averaged, and power ratios that were more than 1.5 standard deviations above the average TBPR of NA (the threshold) were classified as associated with A sbjects, whereas those that fell below the threshold were classified as NA. As performance metrics, sensitivity (A sbjects classified correctly) and specificity (NA sbjects classified correctly) were sed. This simple decision rle was reported to yield 86% of sensitivity 98% of specificity and 99% overall predictive power, as confsing and strange as this may sond. Some of the athors of the latter stdy replicated their approach in 2001 [38]. The same methods were sed, bt this time, a poplation of 129 sbjects, aged 6 to 20, was sed. In this set, there were 96 A and 33 NA. Sensitivity was reported to be 90% and specificity 94%. There were other stdies, bt sensitivity and specificity cold not be analyzed becase their datasets only had A sbjects. The methods developed by [37] have been replicated in nmeros stdies. Snyder et al. sed TBPR in their stdy [10], which had 159 participants, 97 A and 62 NA. EEG data was recorded from 19 channels in a configration that follows the international system while the sbjects were resting with eyes closed, eyes open, reading, and listening. TBPRs were compted for channels Fp1, Fp2, F3, F4, F7, and F8 for at least 15 windows of 4 seconds for each activity. Sensitivity, specificity, and overall accracy were fond to be 87%, 94%, and 89% respectively, 13

28 which is in line with reported reslts [37, 38]. Other stdies, between 2004 and 2008, sed TBPR for a similar nmber of windows, similar activities, bt different nmber of sbjects, and reported reslts between 87% and 96% accracy [39-41]. Another stdy [27] sed power in freqency bands along with semi-spervised learning in order to classify A and NA sbjects. EEG data was recorded from 10 sbjects, 7 A and 3 NA, while they were performing an activity that reqires attention that lasted approximately 2 mintes. In this stdy, the power and power ratios in the α, β,, and γ freqency bands were compted over windows of 1 s from channels F3, F7, F8, Fz, Fp1, Fp2, and Cz ot of a configration. The mtal information criterion was sed to choose the least redndant featres for training of a Gassian spport vector machine (SVM). The accracy of classification, measred in AUC, was 0.92 for TBPR and 0.97 for theta; miss rates were not reported. Althogh the problem of classifying A and NA sbjects seemed to be solved, recent stdies failed to replicate the reslts [42, 43]. In 2014, a stdy involving 62 A and 55 NA sbjects reported accracy rates between 49.2% and 54.8%. In this stdy, EEG data was recorded for 3 mintes of eyes closed activity. Althogh 128 EEG channels were available, only channels Cz, Fz and Pz were sed. The TBPR was compted for 2 s windows with overlaps of 1 s, which reslts in a large nmber of windows to test with for every sbject. For classification, logistic regression was sed, and the AUC were fond to be between and Similarly, another stdy [44], which involved 54 A and 51 NA sbjects dring eyes closed and eyes open activity, reported 40% to 53% overall accracy for TBPR sing stepwise discriminant analyses and other discriminant fnctions that were not disclosed. In this stdy, there was a configration, and TBPRs were compted from Cz. In total, there were approximately 20 2-second windows for each activity. Althogh the validity of TBPR became qestionable, the stdies that qestion TBPR did not exactly se the method described earlier [37]. For instance [42] sed logistic regression rather than a threshold based on standard deviations; likewise, [44] sed discriminant fnctions, which were not thoroghly explained, that may be different from the threshold concept [37]. These factors may have had an impact on the performance of TBPR. Moreover, the data acqisition devices sed vary from stdy to stdy. Since it is difficlt to say whether or not TBPR is the soltion to the classification of A and NA, other methods mst be explored. Unfortnately, most of the stdies have focsed on reprodcing 14

29 the reslts fond for the TBPR or sing the TBPR. However, there are a few stdies that have sed signal processing techniqes and machine learning to approach the classification of A and NA Other Approaches for the Classification of A and NA Sbjects The effectiveness of event-related potentials (ERPs) was stdied as well [45]. In that stdy, 74 A and 74 NA sbjects performed a visal two-stimls GO/NOGO task while their EEG data was recorded, which lasted approximately 22 mintes. Independent component analysis (ICA) was performed on the ERPs, and these featres were sed to train a SVM classifier, which achieved 92% accracy of classification (90% sensitivity and 94% specificity). Another method that has been explored is feed-forward neral networks [28]. The stdy had 54 sbjects, 47 A and 7 NA, whose EEG signals were recorded dring eyes closed activity for 3 mintes, which reslted in 14,000 samples for each of the 19 channels sed in this stdy, for each sbject. With the wavelet algorithm (no details specified), the EEG data was decomposed into the 5 freqency bands between 0 and 60 Hz (alpha, beta, delta, theta, and gamma) pls the original signal. This was for every channel so there were 6x19 = 114 featres at the inpt layer of the neral network. The hidden layer combined the featres non-linearly, and retrned a vale of 1 or 0 to indicate A or NA respectively. By sing this approach, a sensitivity of 95.6% was achieved. There are other stdies, bt they focs more on replication than on creation. Therefore, other approaches that have been sccessfl at finding EEG patterns shold be considered for the classification of NA and A sbjects. The detection of stroke, epilepsy, and the classification of mental tasks for BCIs have inspired many sccessfl algorithms, which shold also be considered for the problem at hand Classification of Other EEG Patterns Atoregressive (AR) coefficients are good candidates to se as featres for the classification of A and NA. Atoregressive coefficients have been largely sed as featres in BCIs, yielding highly accrate reslts for the classification of mental tasks (reading, performing arithmetic operations, etc.) sing short windows of time [31-33]. If the fnctioning of a NA brain is modeled as an activity and the fnctioning of an A brain is modeled as another activity, then the reslts that were fond in the latter stdies [31-33] shold extrapolate to NA and A. 15

30 A stdy explored the se of AR coefficients for the classification of ADHD (A) sbjects and bipolar mood disorder (BMD) sbjects [46]. In the stdy, EEG data was recorded from 21 A and 22 BMD sbjects while they were in eyes closed and eyes open activities, 3 mintes each. A configration with 22 channels was sed, and the AR coefficients, of nspecified order, were extracted from 1-second intervals from each channel. Mltiple classifiers were trained and the overall accracy was slightly over 70%. As far as algorithms go, Gassian Mixtre Models (GMM) have been sed for classification, bt not specifically for the classification of A and NA sbjects. GMMs have been sed for neonatal seizre detection [47]. The stdy recorded EEG data from 17 sbjects who were between 39 and 42 weeks old. The data was recorded sing eight combinations of two channels for approximately 15 hors per sbject. Approximately 691 seizres were observed dring that time for all the sbjects. As featres, the power in mltiple freqency bands and peak freqencies in the spectrm were sed, and the performance, measred in AUC was

31 3. FEATURE EXTRACTION Featre extraction is an essential component in machine learning, especially when dealing with large datasets. There are two reasons for performing featre extraction. First, featre extraction is done to represent a vector of samples of arbitrary size as a vector of samples of lower size, in order to redce the nmber of comptations needed in order to classify the vector. Second, depending on the data, raw data may not be enogh to detect patterns/perform classification. For stroke detection, it is clear when there is a stroke and where there is not becase stroke cases EEG data to spike. Ths, featre extraction may not be needed. Nevertheless, when the changes in the data are very sbtle or not visible, featres mst be extracted to maximize the difference between the different classes that need to be detected. In this chapter, the mathematical backgrond and algorithms sed in order to compte the featres sed throghot the thesis are presented. This chapter focses on AR modeling and the method sed in order to select the EEG channels sed throghot this thesis AR Modeling The objective of linear prediction is to predict the crrent vale of a signal based on its previos vales. Linear prediction theory states that, for an adeqate vale of p, the crrent vale of a discrete stochastic process prediction error en, [ ] which is a white process. xn can be predicted based on its p previos vales incrring a p x[ n] ak x[ n k] e[ n] (3.1) k 1 where the coefficients a k are AR coefficients. Then, defining the predicted estimates as p xˆ[ n] ak x[ n k] (3.2) k 1 yields the prediction error e[ n] x[ n] xˆ [ n] (3.3) 17

32 From a deterministic point of view, the discrete stochastic process xn [ ] in (3.1) has a Z-transform Xz, which can be expressed as a fnction of en, [ ] which has a Z-transform Ez. X z 1 E z p k k 1 az k (3.4) From (3.4), Ez can be modeled as the otpt of a prediction filter Xz. This process is smmarized in Fig Ap z when driven by Ap Fig. 3.1: Linear prediction model. z in Fig. 3.1 is a prediction filter of order p that can be expressed as p k A z 1 a z (3.5) p k 1 AR coefficients are compted to minimize the error associated with prediction. The error criterion that is minimized depends on the algorithm. This chapter will limit the development to only that for Brg method, since that is the one that was sed. This choice was made to minimize a particlar error criterion and also garantee model stability, as will be shown in the next section Brg Method The Brg method finds the AR coefficients by minimizing the sm of the forward and backward prediction errors, fp[ n ] and bp[ n ], in the least-sqares sense over a time interval of length L samples, in an order-recrsive fashion where fi[ n ] and b[ n ] are expressed as i L1 2 2 i i i mi k E f [ m] b [ m] (3.6) 18

33 f [ n] x[ n] a x[ n 1] a x[ n 2]... a x[ n i] (3.7) i i1 i2 ii b [ n] x[ n i] a x[ n i 1] a x[ n i 2]... a x[ n] (3.8) i i1 i2 ii where each a ik is an atoregressive coefficient of a model of order i. For the order-recrsive development, in the Brg method, fi[ n ] and b[ n ] are rewritten as i f [ n] f [ n] b [ n 1] (3.9) i i1 i i1 b [ n] b [ n 1] f [ n 1] (3.10) i i1 i i where i are the reflection coefficients and i=1,2,, p. Reflection coefficients are a different representation of the AR coefficients. They carry the same information, bt their vales have different distribtions. When sing Brg s method, the reflection coefficients are constrained to the range [-1,1]. By sbstitting (3.10) and (3.9) into (3.6), (3.11) is obtained L1 L [ ] 1[ 1] 1[ 1] [ 1] (3.11) E f m b m b m f m i i i i i i i mi mi by taking the derivative of (3.11) and setting it to 0, the maximm is fond E L1 L1 i 2 fi 1[ m] ibi 1[ m 1] 2 bi 1[ m 1] i fi[ m 1] 0 (3.12) ai mi mi and solving for i reslts in i L1 L1 2 f [ m] b [ m 1] mi mi i1 i1 f [ m] b [ m 1] 2 2 i1 i1 (3.13) which can be sed to compte the vales of i. (3.13) is sed recrsively along with (3.14) in order to find the AR coefficients a a a (3.14) R p p1 p p1 19

34 where R a p is the vector of AR coefficients of order p and p1 a is the time-reversed vector of AR coefficients of order p-1. (3.14) can be written in matrix form as a a a p1 p1 p1, p1 a a p2 p2 p ap2 a pp, 1 a p1, p1 a p1 a pp 0 1 (3.15) which is known as the Levinson-Drbin recrsion [48]. In short, the Brg algorithm can be smmarized by the following steps: 0. Initialize the parameters f [ n] b [ n] x[ n] p, A p z, and 0 1 E 0 L1 1 x[ m] L m At stage p 1, the following information is available f [] p 1 n, b [ ] p 1 n, and A z p 1 2. Compte p sing (3.13) 3. Compte Ap z sing (3.15) 4. Compte fp[ n ] and bp[ n ] 2 5. Compte the error sing E 1 6. Go to stage p E p p p1 The Levinson-Drbin recrsion cold also be exected backwards: If qantities f [ ] p n, p[ ] b n, and p, Ap z are known, the steps cold be done in reverse to find f [] p 1 n, b [ ] p 1 n E p,, A z p 1, p 1 E, and p 1. However, troble arises when any 1 becase i E p p p E. Fortnately, 1 is nlikely to happen becase i is a partial correlation coefficient. i 20

35 There are other algorithms to find AR models, bt they have disadvantages. The atocorrelation method minimizes only the forward prediction error and zero-pads the ends of xn [ ], which introdces a bias that increases errors on the one hand, bt garantees stability; the covariance method has stability isses becase the reflection coefficients are not constrained to [-1, 1]. Levinson-Drbin recrsion solves Yle-Walker eqations withot performing any kind of minimizations. Althogh the coefficients may not vary too mch in practice, these are some of the reasons the Brg algorithm is considered to be more robst than other methods. Hence, the Brg method prodces 2 vectors that contain the same information: a vector of AR coefficients, and a vector of reflection coefficients. Throghot this work, AR coefficients are constantly sed. Reflection coefficients and Line Spectral Freqencies (LSF) were sed for some experiments in order to investigate which one of these qantities maximized the accracy of classification LSF LSF are a different representation of AR coefficients, jst like reflection coefficients. LSF are compted from the AR coefficients. This is done by expressing polynomials, where P z and P z and Qz, both of order p+1, k1 Ap z as the sm of two p k P z Q z Ap z 1 akz (3.17) 2 Qz can be expressed as p p 1 1 p P z A z z A z (3.18) p p 1 1 p Q z A z z A z (3.19) The LSF are the phases of the roots of both polynomials, which are defined as w Pp wp 1, wp 2,..., w Pp (3.20) w Qp wq 1, wq 2,..., w Qp (3.21) and are interspersed. 21

36 3.4. Akaike Information Criterion (AIC) L p wp1, wq 1, wp 2, wq 2,..., w p, w p (3.22) P Q 2 2 Althogh AR coefficients and its different variations are said to model the behavior of a signal, there is a parameter that has to be tned: the order of the model. If the order of a model is too low, it will not be able to model the data it was created with (nderfit). On the other hand, if the order is too high, it will accrately model the data it was created with, bt it will not be able to generalize to model data in ftre time instants (overfit). Therefore, a reasonable order mst be chosen in order to accrately model the existing data and predict data at ftre time instants with a reasonable error. A way to compte goodness of fit of a model is by sing the Akaike Information Criterion (AIC) [49]. There are other methods, bt a stdy that investigated order selection methods for EEG signals deemed most, if not all, of the methods to be seless, bt fond AIC to be the only method that tended to not nderestimate the order of the AR model [50]. The AIC is compted as follows: where L is the length of the window sed, 2 AIC p Lln 2p (3.23) 2 is the prediction error variance of the model, and p is the order of the model. The order p of a model is taken to be that which best fits, i.e. the vale of p that minimizes AIC. In order to choose the order of a model, the AIC was compted on 100 intervals of 51 samples for AR models of orders 1 throgh 15. The AR coefficients, of orders 1 throgh 15, were compted on the 100 intervals to obtain 2 for every model for every interval. 22

37 Fig. 3.2: AIC vs model order. Figre 3.2 shows the reslts of the experiment. The plot shows the normalized AICs, which were obtained by averaging all 100 AICs for every order p. The graph shows that AIC is minimm when the order is 8. Note that 7 wold also be a good choice, bt not as good as 8 or 9. Since AIC is known to slightly overestimate the order of a model [51], it was set to 7, instead of 8 or 9, for the rest of the thesis Channel Choice To redce the nmber of channels to be sed for the analyses described herein, the aim was to determine five channels that probably will provide good discrimination. Previos research indicates that resting state eyes-open and eyes-closed theta/beta power ratios (TBPR) tend to be higher for A sbjects than for NA sbjects [35, 37]. The preliminary step of channel redction is therefore exected based on TBPR; however TBPRs were evalated here dring ANT activity for all recorded channels, for all sbjects, i.e. not dring resting state. TBPRs were compted sing the FFT over the entire dration of the ANT task to compte the power spectral densities. 23

38 csk PSD TBPR c, s, k (3.24) PSD where csk indicates the TBPR of class c, sbject s, for channel k. For c = 0, 1, s = 0, 1, and k = 0, 1,, 23. c = 0 indicates A and c = 1 indicates NA. The next step was comptation of all cross ratios, defined as the ratio of TBPR-A over TBPR-NA. 0sk slk (3.25) 1 lk Fig. 3.3: Cross ratios for all channels. Figre 3.3 reflects the distribtion of cross-ratios for each of the EEG channels, in the form of boxplots. The red lines in the middle of boxes represent the means, while the top and bottom sides of the boxes represent the 75 th and 25 th percentiles respectively, with the pper and lower horizontal lines representing the maximm and minimm vales respectively. The cross ratios fall between 0.9 and Note that these are not centered abot 1, i.e. higher for A than for NA, so there is some trth to [35, 37]. Based on this preliminary analysis step, the channels chosen to proceed with are Fc2, Fc1, Fc5, Cp6, and C3. This choice does not necessarily mean that these five channels prodce the very best possible discrimination; after all, the performance of varios 24

39 methods is yet to be analyzed, and the reslts of sch analysis may indicate that optimization of the channel choice needs refinement when targeting a specific application. Fc2, Fc1, Fc5, Cp6, and C3 are the 5 channels sed for the experiments described in Chapters 4 throgh 7. Since the order of the AR models was set to 7, the AR(7) were extracted from each of the 5 channels, for every time interval/window of the ANT activity, and/or the activity nder stdy. This reslted in 35-D (5x7) featre vectors. Note that there are 8 AR coefficients for an AR(7), bt the first coefficient is always normalized to 1, regardless of the data. Therefore, the first parameter was not inclded in the featre vector Dataset The data sed in this stdy was made available to s by or collaborator from the Psychology Department at Virginia Tech, Dr. Martha Ann Bell. The dataset consisted of 8 sbjects: 4 A and 4 NA children between the ages of 6 and 8 years, who visited the research lab as part of an ongoing longitdinal stdy focsed on frontal lobe development from infancy throgh childhood. Information regarding diagnosis of ADHD was obtained via maternal report. EEG was recorded sing a stretch cap (Electro-Cap, Inc Eaton, OH: E1-series cap) in the extended 10/20 system pattern. Recordings were made from 26 electrodes located eqidistant across the scalp. Electrode impedances were kept nder 20k ohms. The electrical activity from each lead was amplified sing separate bioamps (James Long Company, Caroga Lake, NY). Dring data collection, the high-pass filter was a single pole RC filter with a 0.1 Hz ct-off (3 db or half-power point) and 6 db/octave roll-off. The low-pass filter was a two-pole Btterworth type with a 100- Hz ct-off (3 db or half-power point) and 12 db/octave roll-off. The EEG signal was digitized at 512 samples per second for each channel so that data were not affected by aliasing. The acqisition software was Snapshot-Snapstream (HEM Data Corp, Sothfield MI). Prior to the recording of each sbject, a 10 Hz, 50 µv peak-to-peak sine wave was inpt throgh each amplifier and digitized for 30 sec. This signal was analyzed and the reslting power vales sed to calibrate the EEGs. After the EEG electrodes were applied, children participated in eyes open, eyes closed, and qiet VIDEO baseline events to collect resting EEG data. Then the children completed a battery of cognitive tasks designed to assess varios aspects of attention [52] sing the child version [53] 25

40 of the Attention Network Task (ANT) and varios aspects of cognition associated with exective fnctions (e.g., nmber Stroop, Dimensional Change Card Sort Task, Digit Span Task). Data from the ANT were sed in the analyses that are the focs of this report. The ANT was designed to assess Posner s brain-based attention networks [15] and yields measres of conflict, alerting, and orienting. The test reqires the child to respond to a central target (a yellow fish on a light ble backgrond) displayed on a compter screen and indicate whether the fish is facing left or right. The child is instrcted to look at the fixation point, above or below which the target will appear. The target may appear with or withot flankers (other fish), which may or may not be congrent with respect to direction they are facing. Reaction time responses to the alert ces, spatial ces, and flankers are maniplated to provide an assessment of the efficiency of each of the attention networks. The ANT is divided into 3 blocks of ~5 mintes each, with a brief rest period between blocks. The EEG dring the first block and second block were sed in these analyses. After the research visit, EEG data were analyzed sing EEG Analysis software developed by the James Long Company. Data were re-referenced via software to an average reference configration and then analyzed with a discrete Forier transform (DFT) sing a Hanning window of 1 second width and 50% overlap. Power vales were compted at each electrode site for theta (4-7 Hz) and beta (13-30 Hz) freqency bands. Power was expressed as mean sqare microvolts. 26

41 4. KNN CLASSIFICATION In this chapter, we present or first approach for the classification of A and NA sbjects. Using the featres and channel selection method described in the previos chapters along with the K- nearest Neighbor (KNN) algorithm, in this chapter is explored how separable the A and NA classes are in the featre domains sed. Frther, in this chapter a confidence level is proposed. Said confidence level speaks to how mch confidence there is in the decision that a sbject belongs to one class or the other. The performance of the KNN algorithm is explored when sing AR coefficients, reflection coefficients, and line spectral freqencies as featres for the classification of A and NA sbjects. Since there is no processing, dimensionality redction, or space warping associated with KNN, the performance of KNN models is an indicator of how separable the NA and A classes are. To the best of or knowledge, we are the first to evalate this family of featres for the classification of A and NA sbjects. In this chapter, the objective is not only to obtain high accracy, bt also to obtain high confidence of classification. This is an important factor to keep in mind becase a decision that comes with 100% confidence is a confident decision. If the decision is right, it means that the sbject clearly is part of that class, bt if the decision is wrong, it shold be investigated why the sbject is so strongly classified as being part of the wrong class. On the other hand, a decision that comes with near 50% confidence is nothing more than a gess, regardless of whether the decision is right or wrong K-nearest Neighbor Algorithm The K-nearest Neighbor algorithm, also known as KNN, is a machine learning algorithm that can be sed for classification and regression. Unlike other machine learning algorithms, the process of training a KNN model consists of storing the data sed in training, which makes it one of the simplest machine learning algorithms [54]. For classification, a KNN algorithm finds the K training vectors that are closest in distance to a test vector x. Althogh Eclidean distance is sally sed, any other distance metric (i.e. 27

42 Hamming distance) or ser-defined fnction can be sed to compte the distance between two vectors. Once the K closest training vectors have been fond, the label assigned to x is that of the most freqent label of the K nearest neighbors. Figre 4.1 provides an illstration of how the algorithm works. Fig. 4.1: Example of a 2D, 2-class classification sing KNN wit K set to 9. Figre 4.1 presents a 2D, 2-class classification problem sing KNN. In the example, the label of a test vector, denoted by X, is nknown. Since the vale of K was set to 9, the 9 training vectors that are closest to X are fond (circled in the figre). Note that 5 of the training vectors belong to Class 1 and 4 of the training vectors belong to Class 2. Therefore, X is assigned the label Class Confidence in KNN Since KNN classification is based on vote conts over the total nmber of votes, a confidence level can be obtained to reflect the level of confidence with which a decision was made. For the example of Fig. 4.1, there are two confidence vales: 28

43 # Class1votes Class1conf (4.1) K # Class2votes Class2conf (4.2) K Confidence levels are bonded to the range [0,1], with 1 being highest confidence and 0 meaning no confidence. For the example of Fig. 4.1, the test vector was labeled as Class 1 with a confidence of 5/9. Since this vale is between 0.4 and 0.6, it can be considered to be close to gessing Choosing the Vale of K K is the only parameter that can be explored in order to optimize performance. The vale of K that maximizes performance always depends on the data. Therefore, a line search from 1 to, typically, half of the size of the training dataset is performed. In other words, KNN models have to be made and then tested for K = 1,2,3,, T, where T can be 50% of the nmber of training vectors or a lower nmber. Then, K is set to the vale that maximizes performance [54]. The vale of K is also chosen to make the algorithm robst to ties. For example, in two-class classification problems, it is recommended that K be an odd nmber. In short, it is recommended that for a C-class problem, K is set so that C is not divisible by K. If C is inevitably a mltiple of K, then heristics are sed to resolve ties Disadvantages of KNN There are two known disadvantages associated with KNN. First, jst like with any other machine learning model, the performance of a KNN model is highly dependent on the training dataset. However, since no processing is done on the training dataset to create a KNN model, the measred performance of the model will depend on how separable the training dataset is and on how similar the testing dataset is to the training dataset. Other machine learning algorithms, on the other hand, involve processing, which is done to the benefit or the detriment of the model. The other disadvantage of KNN is the crse of dimensionality, which affects classification algorithms that are highly dependent on distance metrics. If the nmber of dimensions is too high and/or the N scalar vales of the N-Dimensional training vectors are large, the distance between two vectors may become very large (approach infinity), even for neighboring elements, which 29

44 cases misclassification. Fortnately, the crse of dimensionality can be addressed by redcing the nmber of dimensions and/or normalizing the data so that the distances do not approach infinity [55] Performance Evalation Performance is defined in terms of accracy of classification, which is defined as the nmber of tre positives (TP) pls the nmber of tre negatives (TN) over the total nmber of tests. TP TN Accracy (4.3) # tests 4.6. KNN Experiments In this section, the experiments are covered that were performed with KNN models. Starting with parameter selection, next the experiments are discssed when 2 sbjects (1 A and 1 NA) were sed for training and 2 others were sed for testing (1 A and 1 NA) Window Size and Choice of K The preliminary experiments were based on 4 sbjects (all are identified by a nmerical vale together with the given label): 18316NA, 18396A, 18586NA, and 18606NA. For the selected channels, dring the ANT, the distribtion of estimated AR orders based on 20 random sets of 0.1 sec of data (51 samples) peaked at 7, 8, and 9. To compensate for the tendency of AIC to overestimate the order of AR models, the order sed in this stdy is set to 7. The 4 sbjects were variosly paired for training as follows: AB (18396A,18316NA), AD: (18606A,18316NA), CB: (18396A,18586NA), and CD: (18606A,18586NA); for each of these cases, the 2 sbjects not part of the pairing for training were sed for testing. To have an idea of the effect of observation window length on classification performance, AR(7) coefficients were compted from windows of 0.05, 0.1, 0.2, 0.5, 1, and 2 seconds long. Given 5 channels were selected, the featre vectors that are being sed consist of the concatenation of 5 sets of 7 AR coefficients, i.e. 35-D vectors. By sing two sbjects for training (one A and one NA) and the other two for testing, KNN classifiers were bilt. For training, 200 random observation intervals were sed; for testing, sing 30

45 an overlap of 50%, all windows possible over the ANT interval were sed (from sec windows to sec windows). This process was exected for K = 1, 3, 5,, 99. Fig. 4.2: Accracy for different window sizes and vales of K. Figre 4.2 smmarizes the optimization process exected to determine the window size and K to se. The accracy vales displayed in Fig. 4.2 reflect the mean accracy of the 4 pairs of sbjects (AB, AD, CB, and CD) for hndreds of test vectors. From the graph, it is clear that windows of 0.5 s or less shold not be sed, since the accracy is below 0.85 for any vale of K. Windows of 2 s, seem to achieve higher accracy than any of the shorter window lengths. For windows of 2 s, there are several local maxima, at K = 5 and K = 99, bt the global maximm is at K = 51. Ths, K was set to 51 and the window size to 2 s. Since the vale of K for KNN classifiers always changes depending on the data and the application, the effect of window length is explored in more detail. Figres 4.2 throgh 4.4 examine how accracy, tre positive rate (TPR), tre negative rate (TNR), and confidence levels ( A conf and NA conf ) change as window length changes. 31

46 Fig. 4.3: Classification accracy for 4 pairings as window size changes. Figre 4.3 shows the reslts that were aggregated to obtain the dark ble crve of Fig. 4.2 (K=51, 2-second windows). As seen previosly, accracy increases as window length increases. For windows of 2-sec dration, classification accracy varies from 85% to 95%, depending on which pairings were sed for training and testing. From a classification point of view, these reslts imply that the two classes (A and NA) are separable in the featre domain selected. For windows of 1 s, accracy varies from 82% to 92%, which is not that far off from its 2 s conterpart. For windows of 0.5 s, accracy varies from 82% to 91%, which is almost eqal to the 1 s case. However, for windows below 0.2 s, accracy is below 80% even for the best case scenario. Interestingly enogh, accracy increases sharply in going from 50 msec to 100 msec; the latter is perhaps indicative of the size of time-freqency atoms for EEG [56]. 32

47 Fig. 4.4: TPR (left) and TNR (right) for 4 pairings as window size changes. Since accracy is compted as a fnction of TPR (A vectors classified correctly) and TNR (NA vectors classified correctly), these will be examined in more detail in the left and right graphs of Fig. 4.4 respectively. As sggested by Fig. 4.4, TPR and TNR tend to increase as window length increases. For pairings CB and CD, the TNR display prononced p-down-p behavior as window length increases. Unlike these two cases, the TNRs obtained from pairs AD and AB seem to increase more monotonically with window length, and reach TNR of 1 and 0.99 respectively. Interestingly enogh, the TPR obtained from CB and CD continosly increase. The TPR obtained from AB reaches almost 0.90 at 0.5 s and stays at that vale; the TPR obtained from AD reaches 0.78 at 0.2 s, bt then behaves erratically. Figre 4.4 reveals that there may be some biasing. For TNR at 2 sec, the highest vale is 1, for pair AD, and the lowest TNR is 0.87, for pair CB. Note that there is another large TNR, of 0.99, for pair AB. For TPR, the largest vale is 0.99, for pair CB, which happens to correspond to the lowest TNR. Likewise, the lowest TPR of 0.70 is achieved by pair AD, which achieved the highest TNR. Hence, pair AD seems to be biased to classify test vectors as NA and pairs CB and CD are 33

48 biased towards classifying test vectors as A. Lastly, pair AB is slightly biased to classify sbjects as NA. The left graph of Fig. 4.5 shows how Aconf changes with window length and the right graph of Fig. 4.5 shows how NA conf changes with window length. The y-axis label refers to 1- A conf. Fig. 4.5: A Confidence (left) and NA Confidence (right) levels for 4 pairings as window size changes. The pattern observed in Fig. 4.4 is also seen in Fig Vales of represent high confidence as classification in A, and vales of NA conf that are close to zero NA conf that are close to one represent high confidence in classification as NA. Note that A conf for pairs CB and CD decreases with window length, as expected, and they reach 0.05 and 0.10 respectively for windows of 2 s. These A conf levels are the lowest in the left graph, which indicates high confidence when these pairs are sed in training. Nevertheless, the NA conf levels for CB increase as window length increases ntil they reach 0.84, and those of CD behave somewhat erratically, bt reach This makes CB and 34

49 CD the pairs with the lowest NA conf levels. The opposite is observed for pair AD, whose A conf reaches 0.35 for windows of 2 s, the largest in the left graph, bt its NA conf levels of 0.98 are the largest in the right graph. Pair AB seems to have good performance for both cases, bt its is slightly better than its A conf. NA conf In each of Figs. 4.6 throgh 4.12 the window drations sed were 2 sec and K was set to 51. The figre titles on top indicate the sorces of the training data, and the legends indicate the sorces of the testing data. The Confidence x-axis label refers to NA conf (so that mostly correct NA decisions concentrate the histogram on the right, and v.v.). Note that classification confidence less than 0.5 implies a classification error for the associated test vector when classifying NA sbjects. On the other hand, confidence levels greater than 0.5 are considered classification errors when classifying A sbjects. Generally, when the fraction of votes is between 0.4 and 0.6, the confidence is eqivalent to gessing. Fig. 4.6: Confidence histograms from training pairings (in title) for test cases (in legend box). 35

50 Figre 4.6 shows the confidence histograms when sing 2-sec windows in order to classify A sbjects (yellow) and NA sbjects (ble) based on the same training pairings sed for Figs. 4.1 throgh 4.5. These histograms serve to explore in greater detail the meaning of Fig In each histogram, 240 to 260 test vectors were sed. For most test vectors, the confidence of the sbject belonging to the NA class is over 0.8. In the top right and top left graphs, which correspond to pairs AB and AD respectively, a large concentration of NA conf levels is close to 1. In fact, for the top right histogram, 250 test vectors were classified as NA with confidence over 0.9. The lowest NA conf level for this pair was For the bottom left and bottom right graphs, NA conf is lower. There is a small portion of vales that are between 0.6 and 0.4, which indicate gesses, and there is an even smaller portion of vales below 0.4, which render the mean NA conf to 93.2%. Similarly, when testing vectors from the A class most of the decisions are made with over 80% confidence. However, the top right histogram shows more gesses than any of the others and more misclassification errors when testing with A sbjects. For the top right histogram, the overall A conf vales, for sbject 18396A, were 83%. Still, averaging the Aconf vales over all test cases (95.3%, 95.1%, and 92.5% for pairings AB, CB, and CD respectively) yields 91.9% for A sbjects Additional Test Sbjects After an additional set of sbjects became available, the classification approach was repeated: Training with a pair of sbjects and testing with a different pair. The nmber of training sbjects was kept to 2 to test whether or not KNN, with the chosen parameters, wold generalize and correctly classify new test sbjects. 36

51 Fig. 4.7: Confidence histograms for original training pairings, when testing with 18776A and 18716NA. In Fig. 4.7, sing the same training pairings as in Section 4.6.1, the additional sbject 18716NA is classified correctly for all test windows, and with a very high level of averaged confidence (99.6%), whereas additional sbject 18776A is correctly classified for three ot of the for training cases shown. It is worth noting that 18776A is highly misclassified when sbjects 18606A and 18316NA are sed in training, and Section revealed that this combination is biased towards classifying test vectors as NA. Even so, the average or overall decision, for 18776A, is for belonging to the A class with 92.6% confidence. 37

52 Fig. 4.8: Confidence histograms for original training pairings, when testing with 32436A and 32386NA. Figre 4.8 shows intriging reslts. Sbject 32436A was classified correctly for most of its test vectors and for all the original training pairings, with an averaged confidence of 98.4%. However, sbject 32386NA is misclassified from almost all test windows, and the confidence in these classifications is a very high 91.4%. The latter reslt cold be de to several reasons. One might be that this sbject was the least calm of all NA sbjects. Another reason might be that the sbject was not performing the activity as instrcted. Finally, there is the possibility that the sbject was mislabeled. In any case, from a classification perspective, Fig. 4.8 shows that sbject 32386NA is mch more similar to A sbjects than to NA sbjects. Moreover, the confidence levels reflected in Fig. 4.8 show that sbject 32386NA is very distant from the NA training sbjects. To try to diagnose what might be off with sbject 32386NA, it was sed for training. Sbject 32386NA was paired with each of the 4 original A sbjects for training, and for testing, the remaining sbjects were sed (3 A and 3 NA). Table 1.1 smmarizes which sbjects were sed for training and which ones were sed for testing for each case. 38

53 Table 1.1: Combinations of training and testing sbjects when 32386NA is sed for training. Training 18396A and 32386NA 18606A and 32386NA 18776A and 32386NA 32436A and 32386NA Testing 18316NA, 18586NA, 18716NA, 18606A, 18776A, 32436A 18316NA, 18586NA, 18716NA, 18396A, 18776A, 32436A 18316NA, 18586NA, 18716NA, 18396A, 18606A, 32436A 18316NA, 18586NA, 18716NA, 18396A, 18606A, 18776A For these experiments, approximately 500 vectors were sed for training and 1500 for testing. Since 32386NA cold not be correctly classified, we anticipate that classification will be poor if this sbject is sed for training. Fig. 4.9: Confidence histograms for pairings involving sbject 32386NA. Figre 4.9 smmarizes what occrs when sbject 32386NA is sed for training along with an A sbject. All for graphs spport or hypothesis that sbject 32386NA carries the wrong label. For the top right and top left graphs, classification was highly biased towards the NA class; most 39

54 of the test vectors from the A sbjects were classified as NA. For the top left graph, NA conf seems to be slightly Gassian, with certainties going from almost 0 to almost 1. The top left graph shows high NA conf, which is desired, and high A conf confidence, which is not desired. The bottom graphs of Fig. 4.9 also show poor performance. For the bottom left graph, histogram vales of NA conf are highly concentrated below 0.3, meaning that NA sbjects tend to be mislabeled. For the same graph, the distribtion of Aconf vales indicates that classification is sally done correctly for A vectors, bt there is a large nmber of gesses and low confidence in the decision. Lastly, the bottom right graph shows randomness in the classification of A sbjects, bt NA test vectors are correctly classified most of the time. Fig. 4.10: Confidence histograms for pairings involving sbject 32386NA and displaying two test sbjects only. Figre 4.10 shows examples of how individal sbjects are classified when 32386NA is sed in training. The top left graph shows psedo-gassian behavior for the classification of both 40

55 sbjects; the top right graph displays a highly biased classifier; the bottom left graph shows low A conf vales and relatively high NA conf vales; finally, the bottom right graph shows poor A conf and relatively high NA conf vales. Overall, the average confidence level for these cases is 63.91% for A sbjects and 62.31% for NA sbjects. These reslts are only slightly above gessing. Since classification seemed to be poor when training with 32386NA and an A sbject, it was sed for training along with another NA sbject next. In other words, the label of 32386NA was flipped (designated 32386a) to test what effect that wold have on training and classification. For these experiments, training was done by pairing 32386a with each of the 3 NA sbjects whose labels do not seem qestionable, and tested with the remaining sbjects (4 A and 2 NA). The combinations are shown in Table 1.2. Table 1.2: Combinations of training and testing sbjects when sbject 32386NA is sed for training, Training 32386a and 18316NA 32386a and 18586NA 32386a and 18716NA Testing 18586NA, 18716NA, 18396A, 18606A, 18776A, 32436A 18316NA, 18716NA, 18396A, 18606A, 18776A, 32436A 18316NA, 18586NA, 18396A, 18606A, 18776A, 32436A Jst as for Figs. 4.9 and 4.10, approximately 500 vectors were sed for training and approximately 1500 were sed for testing. 41

56 Fig. 4.11: Confidence histograms for pairings involving sbject 32386NA and all other NA sbjects. Figre 4.11 smmarizes what occrs when sbject 32386a is sed for training along with an NA sbject. In all the graphs, it is difficlt to see the NA conf vales becase the nmber of A test vectors almost dobles that of the NA test vectors. As can be seen, classification of A sbjects is done with very high confidence. Some A vectors were misclassified or correctly classified with low confidence, bt the proportion is negligible compared to the high confidence decisions. As far as NA conf goes, the top right graph shows a large concentration of NA conf levels close to 1. The top left histogram shows that classification is relatively poor, and for the bottom left graph, NA conf levels tend to be over

57 Fig. 4.12: Confidence histograms for pairings involving sbject 32386NA. Figre 4.12 shows some examples of how individal sbjects are classified when 32386a is sed in training. The graphs repeat the pattern shown in Fig. 4.10: high confidence (low A conf vales) across all graphs; NA conf levels are mostly concentrated close to 1 for most test vectors across all graphs, except for the bottom left graph. There is a relatively high proportion of NAconf vales nder 0.5, bt these do not otweigh the performance obtained from the other tests. Overall, the average confidence level for these cases is 94.58% for A sbjects and 92.87% for NA sbjects. Flipping the label assigned to 32386NA, correct reslts were obtained, making it likely that sbject 32386NA had been mislabeled early in the process. Since flipping the label of 32386NA from NA to A yielded the best reslts, it will be sed in the sbseqent experiments as an A sbject. Therefore, there will be 3 NA sbjects and 5 A sbjects. 43

58 Increasing the Training Dataset In this section the investigation of the effect of increasing the size of the training dataset from 2 (1 A and 1 NA) sbjects to 4 (2 A and 2 NA) is reported. For testing, 3 A sbjects and 1 NA sbject will be sed. As a conseqence, the nmber of training vectors sed for training will be approximately 1000, and the nmber of vectors sed for testing will be approximately 1000 as well. For these experiments, the nmber of possible combinations of 2 A sbjects and 2 NA sbjects for training is 30. Or hypothesis is that the effect of otliers will be sppressed as more data is added, which is expected to reslt in either an increase in performance or no changes. Figre 4.13 shows the distribtion of accracy vales obtained when training with 2 A and 2 NA and testing with 3 A and 1 NA. The confidence vales (bottom) were obtained by averaging A conf, converting 1 Aconf to NA conf prior to averaging, and NA conf. Fig. 4.13: Accracy vales (top) and confidence levels (bottom) obtained from all 30 combinations of 2 NA sbjects and 2 A sbjects for training. 44

59 Figre 4.13 shows that increasing the nmber of sbjects sed in training has a strange effect on performance. In the previos section, accracy varied from 85% to 95%, and so did confidence. However, the top and bottom graphs in Fig show distribtions between 75% and 100%, which may sggest that performance in the worst case deteriorated. Frther, the mean accracy for all 30 cases is 89.63% and the mean confidence is 88.14%, whereas the mean accracy was 91% when 2 sbjects are sed for training and the mean confidence was 90.48%. Since what was obtained is the opposite of what was expected, the meaning of these reslts will be examined more careflly. The TPR, TNR, NA conf, and Aconf will be explored in more detail to nderstand why performance did not improve after adding more data to the training dataset. Fig. 4.14: Distribtion of TNRs (top) and NA conf (bottom) obtained from all 30 combinations of 2 NA sbjects and 2 A sbjects for training. The reslts shown in Fig are more in line with or expectations. The distribtion of TNRs (top graph) shows that most of the TNR are over 0.9. In fact, the mean TNR is , and the 45

60 worst The NA conf levels display a similar behavior: Most of the NA conf levels are accmlated over 0.9, with a mean of and a worst case of Fig. 4.15: Distribtion of TPRs (top) and A conf (bottom) obtained from all 30 combinations of 2 NA sbjects and 2 A sbjects for training. The reslts shown in Fig may provide an explanation for why performance seems to have deteriorated. The distribtion of TPR (top graph) shows that half of the TPRs are over 0.9, bt the other half is scattered between and 0.9. The mean TPR is and the worst is A similar pattern can be observed for the A conf levels (bottom). The Aconf levels tend to be nder 0.15, and the average and worst levels are (87.73%) and (79.31%) respectively. The TPR and Aconf levels decrease the performance of the classifier not only becase they are smaller than the TNR and NA conf levels, bt also becase the classes are skewed. Since testing is done for 3 A sbjects and 1 NA sbject, TPRs have a higher weight in performance than TNRs. 46

61 For instance, the case where TPR is lowest (0.668) happens to be one of the cases where TNR was high (0.9872). For this case, the overall accracy was 74.78% becase it was compted as (0.668*3) / 4. If the classes had not been skewed, the mean accracy wold have been compted as / 2, which eqals 82.76%. This vale is still below 85%, bt it is closer to 85%. The random variations on the performance of the KNN models trained with 4 sbjects were investigated. The first 0.5 seconds of the ANT activity were removed from the dataset to indce a 0.5-second delay on the time-series data, and training and testing were performed as explained at the beginning of Section Histograms are not shown becase the differences are difficlt to tell. The standard deviation of the overall accracy vales for the dataset that has a delay is , and the standard deviation for the original experiments (which were sed to generate Figs ) is Frther, the mean overall accracy vales over all 30 combinations are and for the original and delayed versions. Therefore, random variations have a minscle effect on performance. In conclsion, contrary to expectations, increasing the size of the training dataset does not have a tremendos impact on performance. The initial reslts even sggested a decrease in performance, bt the TPR and TNR showed that the reslts were partially de to the fact that class sizes became nbalanced after sbject 32386NA was trned into 32386a. Also, the models appear to be robst to random variations Reflection Coefficients and Line Spectral Freqencies In this section we report on the investigation of how classification performance varies when reflection coefficients (RC) or line spectral freqencies (LSF) are sed as featres, instead of AR coefficients. The reslts obtained so far indicate that AR coefficients concatenated in featre vectors create a high dimensional space where classification is done with high accracy and confidence. Since LSF and RC contain the same information as contained in AR coefficients, we explore whether or not KNN classification sing RC and LSF can be done with the same, better, or worse level of accracy and confidence as with AR coefficients. For these experiments, training and testing are performed as in Section 4.6.2: 2 A and 2 NA sbjects for training and 3 A and 1 NA for testing. K was set to 51 and window length to 47

62 2 seconds. Jst as in the previos sections, the accracy, confidence level, TPR, and TNR will be explored. Fig. 4.15: Accracy vales (top) and confidence levels (bottom) when sing RC (ble), AR (green), and LSF (yellow) as featres. As seen in Fig. 4.15, classification performance is highest when AR coefficients are sed as featres. The averaged accracies (top) reveal that for the 3 kinds of featres, the reslts are concentrated above 0.7, bt the worst case for RC and LSF are 0.54 and 0.47 respectively, whereas the worst case for AR is For these 3 cases, the mean accracy vales are 0.89 for AR, 0.83 for RC, and 0.82 for LSF. The observations made for the accracy vales transcend to the confidence levels. Confidence is highest when AR coefficients are sed as featres, since the average confidence level is 88.14% and the worst is 76.50%. On the other hand, the mean confidence levels when RC and LSF are sed as featres are 82.72% and 81.28% respectively, and their worst confidence levels are 59.16% and 51.13% respectively. 48

63 Fig. 4.16: TPRs (top) and A conf levels (bottom) when sing RC (ble), AR (green), and LSF (yellow) as featres. The answer as to why RC and LSF were otperformed by AR may lie in Fig The TPRs (top) for the 3 kinds of featres are concentrated above 0.65, bt the worst cases are between 0.43 and 0.38 for RC, and between 0.30 and 0.36 for LSF, whereas the worst case for AR is The mean TPRs are , , and for LSF, RC, and AR respectively. This evidence shows that KNN classifiers sing RC or LSF featres are more likely to misclassify A sbjects. The Aconf levels (bottom) agree with or recent observations. The mean A conf in the bottom graph are (87.76%), (80.27%), and (79.12%) for AR, RC, and LSF respectively. The worst cases for KNN classifiers sing these featres are (37.22%), (46.04%), and (70.49%) for LSF, RC, and AR respectively. 49

64 Fig. 4.17: TNRs (top) and NA conf levels (bottom) when sing RC (ble), AR (green), and LSF (yellow) as featres. Lastly, Fig provides frther insights into the performance of KNN classifiers sing these featres. The TNRs (top) obtained when sing LSF as featres (yellow) are either concentrated above 0.95 or below 0.85, with a worst case of , the lowest TNR in the graph, and a mean of Interestingly enogh, the TNRs obtained when sing RC are higher than those obtained when sing AR. The worst case for RC is and that of AR is ; the mean TNR for RC is and that of AR is The NA conf levels repeat the pattern observed previosly: Highest confidence is achieved by RC (mean = 90.14%, worst = 80.59%), followed by AR (mean = 89.36%, worst = 79.80%), and LSF (mean = 87.84%, worst = 79.54%). These experiments sggest that the featre spaces created by AR featre vectors work better with KNN classifiers for the classification problem at hand than those created by RC or LSF. It seems that classification performance of a KNN classifier sing AR as featres is 50

65 better than that of a KNN classifier sing LSF in terms of accracy, TPR, TNR, and confidence. The same can be said abot KNN sing AR featres verss sing RC featres. Althogh NA conf and TNR improved for KNN classifiers sing RC, this improvement was at the expense of Aconf and TPRs, which makes AR coefficients better than RC for the problem at hand. These reslts are both srprising and nsrprising at the same time. They are srprising becase even thogh AR, RC, and LSF coefficients carry the same information, the accracy fond when sing AR coefficients as featres otperformed that fond when sing RC and LSF. On the other hand, for KNN, the representation of the data has a large impact on performance. Since AR(7), RC, and LSF have different representations, it is nderstandable that performance varies somewhat. To smmarize, in this chapter, the KNN algorithm was presented and sed for the classification of A and NA. After carefl selection of the vale of K and that of the window length, experiments were condcted to evalate the performance of the KNN algorithm in conjnction with AR(7) as featres to classify A and NA test vectors. A confidence metric was introdced and was sed to qestion the validity of the labels of one the sbjects. The sbject was originally labeled as NA, bt the confidence histograms determined that the sbject was distant from the NA class. As a reslt, the label of the sbject was switched to A, and that label will be sed throghot the rest of this thesis. High accracy (85 95% for 2 sbjects, and % for 4 sbjects) was observed along with high certainty (over 90% for 2 sbjects, % for 4 sbjects), which shows that the A and NA classes are separable for the featre domain created by the AR coefficients, even withot any processing. 51

66 5. UNIVERSAL BACKGROUND MODEL As reported in the previos chapter, KNN classification performance was encoraging, even thogh no processing was done on the featre vectors to maximize performance. As mentioned in the previos chapter, KNN is an indicator of how separable the classes are in the particlar featre domain that was explored. Moreover, we arge that the performance vales obtained in the previos chapter can be seen as a lower bond. Therefore, in this chapter the se of a machine learning algorithm is explored, in which statistics are sed to maximize the separation between the two classes in order to frther maximize performance. In this chapter, Gassian-Mixtre-Model-based (GMM) niversal backgrond models (UBM) are explored for the classification of A and NA sbjects. UBMs have been sed in the past for speaker verification and identification, and have achieved high levels of accracy nder different noise conditions [57, 58]. Moreover, GMMs and UBMs have recently been stdied for the detection and classification of EEG patterns [47, 59]. The hypothesis addressed here is that a UBM can potentially address the shortcomings of other classification schemes. Over the last 30 years, the A/NA classification problem has been tackled by extracting featres from EEG data when the sbjects are resting with their eyes closed or performing some activity. However, when test sbjects do not perform the activity they are instrcted to perform, classification accracy is more likely to sffer (perhaps even to the point of resembling gessing). Therefore, a UBM trained sing a large nmber of featre vectors, extracted from several activities, may make classification more robst Gassian Mixtre Models A Gassian Mixtre Model (GMM) is a model for a probability density fnction (pdf) expressed as a weighted sm of Gassian probability density fnctions. The main reason for sing GMMs for classification problems is that Gassian distribtions can approximate any arbitrary pdf [58]. The pdf of a GMM λ is expressed as p M w g, v m m v μm Σ m (5.1) m1 52

67 where v is an N-dimensional featre vector, Gassian pdf, which have the following form: w m are the weights, and g m the individal N-variate g 1 1 T 1 v μ, Σ exp v μ Σ v μ m m m N /2 m m m (2 ) Σm 2 (5.2) where μ m is an N-dimensional colmn vector of featre element means and matrix of the featre element vector Expectation Maximization Algorithm Σ m is the covariance To train the GMMs, i.e. finding the model parameters, the expectation maximization (EM) algorithm was sed. In the expectation-step (E-step) of the EM algorithm, the a posteriori probabilities of a featre vector ν t, c, m t belonging to the Gassian mixtre model λ c, also known as class membership weights, are compted iteratively over variable i in this fashion: Pw ( ν, ) t, c, k c, k t c, i M w m1 c, k w g ( ν, Σ ) cm, k g m t c, k, i c, k, i ( ν, Σ ) t c, m, i cm,, i (5.3) where i 1, 2,, I, c 1, 2,, C, m 1,2,, M, and t 1,2,, T, where I is the total nmber of iterations, C is the total nmber of classes, M is the total nmber of mixtre components, and T is the total nmber of featre vectors. In the maximization step (M-step), the weights, means, and covariance matrices that parameterize the Gassian mixtre models λ c are compted as follows: w T 1 (5.4) c, k, i 1 t, ck, T t1 T ν ck,, i1 (5.5) t, c, k t t1 Tw c, k, i 1 Σ ck,, i1 T t1 ν ν t, c, k t c, k, i 1 t c, k, i 1 Tw c, k, i 1 T (5.6) 53

68 With every iteration i the likelihood for which the parameters are compted increases, so that a maximm in the likelihood occrs at the last iteration; however, that maximm may have reached a platea at an earlier iteration. In other words, l νt c,i 1 l νt c,i (5.7) Figre 5.1 illstrates how EM iterations occr when creating 2 models to characterize randomly generated data. The randomly generated data are 2 Gassians, one with 0.75 variance and mean eqal to 1 on both dimensions, and the other with 0.5 variance and mean of -1 on both dimensions. The left graph shows the initial EM iterations, where every ellipsoid in every clster represents an iteration. Model 0 appears to show only 2 ellipsoids becase several overlay the 2 that are separately visible. For Model 1, most of the iterations reslts are qite visible and distinct from one another. The right graph shows the final iteration reslts, i.e. the model that was obtained after 10 iterations. Fig. 5.1: EM iterations (left) and final EM iteration (right). 54

69 In this stdy, EM was exected for a maximm of 15 iterations. This nmber represents a safe choice. For speech data, when the parameters are initialized randomly, the nmber of iterations needed tends to be over 10 [60], whereas the nmber of iterations needed is less than 10 when sing K-means clstering for initialization, as is sed here. Figre 5.2 shows an example of how EM with random initialization compares to EM sing K- means clstering for initialization. The advantage of sing K-means is qite visible: The starting point is different in both plots, and the one when K-means clstering is sed for initialization is closer to the convergence vale than that from random initialization. The reslts in the left graph reflect faster convergence to the final vale than those in the right graph. Fig. 5.2: Convergence of EM algorithm with random initialization (left) and convergence sing K-means clstering for initialization (right) K-means Clstering Algorithm The objective of K-means clstering is to find N centroids to partition a dataset X that contains T vectors x t, where t 1,2,..., T, into X1, X 2,..., X N clsters with centroids μ 1,μ 2,...,μ N so that 55

70 the cmlative distance J between the centroids and the vectors that lie within the clsters is minimized. J can be expressed as J N n1 x t X n n t 2 μ x (5.8) The algorithm is initialized by randomly choosing N vectors from the dataset and setting them to the initial centroids μ 1,μ 2,...,μ N. At every iteration, the algorithm begins with some initial estimates (random estimates for the first iteration), and assigns to every clster are closest to the centroid μ n. n, i t n, i t p, i t 2 2 X n the vectors that X x : μ x μ x n,1 n N,1 t T,1 p N (5.9) The next step is to re-estimate the centroids of the clsters, which is done as follows: 1 μ x (5.10) n, i1 X ni, xtxn, i t where X ni, is the cardinality of set X ni,, i 1,2,..., I were i represents the crrent iteration and I the maximm nmber of iterations. Once μ ni, 1 has been compted, the next iteration starts, with μ ni, 1 being the latest centroid estimates. In this stdy, the total nmber of iterations sed was Moreover, to mitigate the probability of choosing clsters that are not very optimal, the entire process is done 100 times, starting with different random initializations. 56

71 Fig. 5.3: Example of K-means clstering algorithm with data before clstering (left) and after clstering (right). Figre 5.3 shows an example of sing K-means clstering for finding centroids. The nmber of centroids to compte was arbitrarily set to 2. As seen, the algorithm ends p with 2 centroids that may not be globally optimal, bt shold be locally optimal. Observe in the figre that all the datapoints on the left of the eqidistant line are clstered to Clster 1 and all the datapoints on the right of the eqidistant line is clstered to clster UBM Adaptation Once the GMMs have been formed, a UBM λ UBM can be created based on the GMMs. In order to do so, each GMM λ c that will be part of the UBM is adapted by performing the maximm a posteriori (MAP) adaptation. This is done by, first, compting the a posteriori probabilities of each featre vector belonging to the UBM. 57

72 P w ν, t, c, k c, k t UBM M w m1 c, k w g k g ν m, Σ t c, k c, k ν, Σ cm, t c, m cm, (5.11) In or classification problem, C = 2. Therefore, there will be 2 GMMs, λ ADHD and λ Non ADHD, and λ Non ADHD will be adapted to form λ UBM. Sfficient statistics are then compted to obtain the weights, means, and variances of λ UBM. These parameters are the cont, first, and second moment of the posterior probabilities fond in (5.12) throgh (5.14). n T (5.12) c, k t, c, k t1 E E ck, c, k T t, c, k νt t1 ν t (5.13) T t1 T T t1 t t T t1 t, ck, T t, c, k νν t t νν (5.14) Once the sfficient statistics have been compted, the weights, means, and variances are adapted. In theory, adaptation shold improve performance by making the mixtres in the target class tighter [61]. Adaptation is performed by sing (5.15) throgh (5.17) tc,, k a n w 1 1 a w T c, k, w c, k c, k c, k, w c, k t, c, k (5.15) ν 1 μ a E a μ (5.16) c, k c, k, t c, k, c, k T t 1 c σ a E νν a σ μ μ (5.17) c, k c, k, c, k t c, k, c, k, k c, k where ac, k, w, ac, k,, ac, k, are the adaptation coefficients for the weights, means, and variances respectively. They control the balance between the new and old coefficients and are compted by sing the following formlae [61]: 58

73 a c, k, w n n c, k ck, w (5.18) a ck,, n n ck, ck, (5.19) a ck,, n n ck, ck, (5.20) where,, are the relevance factors of the weights, means, and variances. In this stdy, w relevance factors were set to 10, since relevance factors between 8 and 20 seem to not affect performance [61]. Otside of this range, performance may be affected positively or negatively [62]. In this stdy, GMM-UBMs were fond sing featres extracted dring varios EEG tasks from the NA sbjects (impostors). The activities that were experimented with were eyes closed, VIDEO, and ANT. Models were also fond to fit the class of ADHD sbjects (targets). Figre 5.4 smmarizes how classification is done in this stdy. Fig. 5.4: GMM-UBM for the classification of A/NA sbjects. For classification, the log-likelihood ratio (LLR) is sed, i.e. the ratio of the likelihood of a test vector ν t belonging to the ADHD model over the likelihood of ν t belonging to the niversal 59

Culture Bias in Clinical Assessment: Using New Metrics to Address Thorny Problems in Practice and Research

Culture Bias in Clinical Assessment: Using New Metrics to Address Thorny Problems in Practice and Research Cltre Bias in Clinical Assessment: Using New Metrics to Address Thorny Problems in Practice and Research MICHAEL CANUTE LAMBERT 1 GEORGE T. ROWAN 2 FREDRICK HICKLING 3 MAUREEN SAMMS VAUGHAN 3 1 The niversity