Classification of ADHD Using Heterogeneity Classes and Attention Network Task Timing

Size: px

Start display at page:

Download "Classification of ADHD Using Heterogeneity Classes and Attention Network Task Timing"

Mariah Dean
5 years ago
Views:

1 Classification of ADHD Using Heterogeneity Classes and Attention Network Task Timing Sarah Elizabeth Hanson Thesis submitted to the faculty of the Virginia Polytechnic Institute and State University in partial fulfillment of the requirements for the degree of Master of Science In Electrical Engineering A. A. (Louis) Beex, Chair Martha Ann Bell William T. Baumann May 7, 2018 Blacksburg, Virginia Keywords: ADHD, EEG, KNN, K-Means, Heterogeneity, Attention Network Task Copyright 2018 by Sarah Elizabeth Hanson. All rights reserved.

2 Classification of ADHD Using Heterogeneity Classes and Attention Network Task Timing Sarah Elizabeth Hanson ABSTRACT Throughout the 1990s ADHD diagnosis and medication rates have increased rapidly, and this trend continues today. These sharp increases have been met with both public and clinical criticism, detractors stating over-diagnosis is a problem and healthy children are being unnecessarily medicated and labeled as disabled. However, others say that ADHD is being underdiagnosed in some populations. Critics often state that there are multiple factors that introduce subjectivity into the diagnosis process, meaning that a final diagnosis may be influenced by more than the desire to protect a patient s wellbeing. Some of these factors include standardized testing, legislation affecting special education funding, and the diagnostic process. In an effort to circumvent these extraneous factors, this work aims to further develop a potential method of using EEG signals to accurately discriminate between ADHD and non-adhd children using features that capture spectral and perhaps temporal information from evoked EEG signals. KNN has been shown in prior research to be an effective tool in discriminating between ADHD and non-adhd, therefore several different KNN models are created using features derived in a variety of fashions. One takes into account the heterogeneity of ADHD, and another one seeks to exploit differences in executive functioning of ADHD and non-adhd subjects. The results of this classification method vary widely depending on the sample used to train and test the KNN model. With unfiltered Dataset 1 data over the entire ANT1 period, the most accurate EEG channel pair achieved an overall vector classification accuracy of 94%, and the 5 th percentile of classification confidence was 80%. These metrics suggest that using KNN of EEG signals taken during the ANT task would be a useful diagnosis tool. However, the most accurate channel pair for unfiltered Dataset 2 data achieved an overall accuracy of 65% and a 5 th percentile

3 of classification confidence of 17%. The same method that worked so well for Dataset 1 did not work well for Dataset 2, and no conclusive reason for this difference was identified, although several methods to remove possible sources of noise were used. Using target time linked intervals did appear to marginally improve results in both Dataset 1 and Dataset 2. However, the changes in accuracy of intervals relative to target presentation vary between Dataset 1 and Dataset 2. Separating subjects into heterogeneity classes does appear to result in good (up to 83%) classification accuracy for some classes, but results are poor (about 50%) for other heterogeneity classes. A much larger data set is necessary to determine whether or not the very positive results found with Dataset 1 extend to a wide population.

4 Classification of ADHD Using Heterogeneity Classes and Attention Network Task Timing Sarah Elizabeth Hanson GENERAL AUDIENCE ABSTRACT Throughout the 1990s ADHD diagnosis and medication rates have increased rapidly, and this trend continues today. These sharp increases have been met with both public and clinical criticism, detractors stating over-diagnosis is a problem and healthy children are being unnecessarily medicated and labeled as disabled. However, others say that ADHD is being underdiagnosed in some populations. Critics often state that there are multiple factors that introduce subjectivity into the diagnosis process, meaning that a final diagnosis may be influenced by more than the desire to protect a patient s wellbeing. Some of these factors include standardized testing, legislation affecting special education funding, and the diagnostic process. In an effort to circumvent these extraneous factors, this work aims to further develop a potential method of using EEG signals to accurately discriminate between ADHD and non-adhd children using features that capture spectral and perhaps temporal information from evoked EEG signals. KNN has been shown in prior research to be an effective tool in discriminating between ADHD and non-adhd, therefore several different machine learning models are created using features derived in a variety of fashions. One takes into account the heterogeneity of ADHD, and another one seeks to exploit differences in executive functioning of ADHD and non-adhd subjects. The results of this classification method vary widely depending on the sample used to train and test the KNN model, classification accuracy has ranged from 65% to 94%, and the cause for this variation was not identified. A much larger data set is necessary to determine whether or not the very positive results found with Dataset 1 extend to a wide population.

5 Acknowledgements I would like to express my gratitude towards Dr. Beex for the chance to work on an interesting project that incorporates technical material relating to digital signal processing and machine learning, but also has implications in the healthcare industry. Thanks also goes to Dr. Bell for providing her data, her willingness to answer many questions, and her service on my committee. Thank you to Dr. Baumann for service on my committee and for help in building foundational programming skills in the past. Thank you to my parents, Van and Polly Hanson, for their unconditional love and support, for always encouraging the pursuit of education, and for all the opportunities they provide through many sacrifices. I would also like to thank Paul Kennedy, Elaine Khuu, and David Evans for their friendship and support through a challenging program. Of course, thank you to God for opening the right doors at the right time and always making a way possible. v

6 Table of Contents Acknowledgements... v List of Tables... xi List of Abbreviations... xi 1. Introduction Overview of ADHD Subjectivity in ADHD Diagnosis Potential Impact of Objective Diagnosis Limitations Outline Literature Review EEG Theta Beta Power Ratio ADHD Heterogeneity Attention Network Task Test Data and Features Dataset and Equipment Autoregressive Coefficients Reflection Coefficients Line Spectral Frequencies KNN Over Entire ANT1 Period K-Nearest Neighbors Choosing an Appropriate K Value Performance Metrics Challenges of Using KNN vi

7 4.5 KNN Results Using AR Coefficients: Male 6-Year-Olds Unfiltered Data Low Pass Filtered Data High Pass Filtered Data Hz Notch Filtered Data Line Frequency (60 Hz) Notch Filter and Delta Band Filter Line Frequency (60 Hz) Notch Filter and Delta and Theta Band Filter Zero Mean Over All Time Zero Mean Over Four Second Intervals Physical Location of Channels Interpolating a Channel with Dataset Dataset 2 Tested Against Dataset KNN Results Using Line Spectral Frequencies KNN Results Using Reflection Coefficients KNN Using Target Time Linked Intervals Target Time Linked Intervals using Dataset Target Time Linked Intervals using Dataset Heterogeneity of ADHD and non-adhd K-Means Clustering and K-Means Five-Class Separation Methods Manual Separation Classes from Unseeded K-Means Classes from Seeded K-Means Discussion References vii

8 List of Figures Figure 2.1 Different Cues in the ANT Figure 2.2 Different Configuration of Flankers around the Target Figure 4.1 Example of 2D KNN model with K = Figure 4.2 Accuracy of KNN using a Range of K Values Figure 4.3 Average Accuracy and 5 th Percentile of Confidence of Dataset 1 (Unfiltered) Figure 4.4 Distribution of Class A Neighbors for All Dataset 1 Training-Testing Combinations (T7-Pz) with N Subjects (Top) and A subjects (Bottom) Figure 4.5 Example of Distribution of Class A Neighbors with 1 Incorrect Diagnosis (C3-C4). 29 Figure 4.6 Average Accuracy and 5 th Percentile of Confidence of Dataset 2 (Unfiltered) Figure 4.7 Distribution of Class A Neighbors for All Dataset 2 Training-Testing Combinations (Fp2-P7) with N Subjects (Top) and A subjects (Bottom) Figure 4.8 LPF used on Dataset 1 and Dataset Figure 4.9 Impulse Response of LPF Figure 4.10 Average Accuracy and 5 th Percentile of Confidence of Dataset 1 (LPF) Figure 4.11 Average Accuracy and 5 th Percentile of Confidence of Dataset 2 (LPF) Figure 4.12 HPF used on Dataset 1 and Dataset Figure 4.13 Impulse Response of HPF Figure 4.14 Average Accuracy and 5 th Percentile of Confidence of Dataset 1 (HPF) Figure 4.15 Average Accuracy and 5 th Percentile of Confidence of Dataset 2 (HPF) Figure 4.16 Spectra Examples from C3 of Dataset 1 and Dataset Figure Hz Notch Filter Figure 4.18 Impulse Response of 60 Hz Notch Filter Figure 4.19 Average Accuracy and 5 th Percentile of Confidence of Dataset 1 (Notch) Figure 4.20 Average Accuracy and 5 th Percentile of Confidence of Dataset 2 (Notch) Figure Hz Notch and Delta Band Filter Figure 4.22 Impulse Response of 60 Hz Notch and Delta Band Filter Figure 4.23 Average Accuracy and 5 th Percentile of Confidence of Dataset 1 (No Delta/60 Hz) 44 Figure 4.24 Average Accuracy and 5 th Percentile of Confidence of Dataset 2 (No Delta/60 Hz) 45 Figure 4.25 Delta/Theta/60 Hz Notch Filter viii

9 Figure 4.26 Impulse Response of 60 Hz Notch and Delta and Theta Band Filter Figure 4.27 Average Accuracy and 5 th Percentile of Confidence of Dataset 1 (No Delta, Theta/60 Hz) Figure 4.28 Average Accuracy and 5 th Percentile of Confidence of Dataset 2 (No Delta, Theta/60 Hz) Figure 4.29 Average Accuracy and 5 th Percentile of Confidence of Dataset 1 (Zero Mean Overall) Figure 4.30 Average Accuracy and 5 th Percentile of Confidence of Dataset 2 (Zero Mean Overall) Figure 4.31 Average Accuracy and 5 th Percentile of Confidence of Dataset 1 (Zero Mean Intervals) Figure 4.32 Average Accuracy and 5 th Percentile of Confidence of Dataset 2 (Zero Mean Intervals) Figure 4.33 Physical Location of High Accuracy Channels of Dataset 1 (left/blue) and Dataset 2 (right/red) Figure 4.34 Histograms of Overall Accuracy of NA Tests (top left) and A tests (bottom right) and Confidence of NA Tests (top right) and A Tests (bottom right) Figure 4.35 Dataset 2 Tested Against Dataset 1 (AR Coefficients) Figure 4.36 Average Accuracy and 5th Percentile of Confidence of Dataset 1 (LSF Features).. 60 Figure 4.37 Average Accuracy and 5 th Percentile of Confidence of Dataset 2 (LSF Features) Figure 4.38 Average Accuracy and 5 th Percentile of Confidence of Dataset 2 vs Dataset Figure 4.39 Average Accuracy and 5 th Percentile of Confidence of Dataset 1 (RC Features) Figure 4.40 Average Accuracy and 5 th Percentile of Confidence of Dataset 2 (RC Features) Figure 4.41 Average Accuracy and 5 th Percentile of Confidence of Dataset 2 against Dataset 1 (RC Features) Figure 5.1 Overall Accuracies for Varying Interval Lengths and Starting Times Relative to Target Presentation (Dataset 1) Figure 5.2 Overall Accuracies for Varying Interval Lengths and Starting Time Relative to Target Presentation (Dataset 2) Figure 6.1 Estimated Relative Power Distribution for Proposed Five Heterogeneity EEG Classes [28] ix

10 Figure 6.2 Relative Power Distribution of Five Heterogeneity Classes Separated with Unseeded K-Means Figure 6.3 Average Accuracy and 5 th Percentile of Confidence of the Delta Class (Unseeded K- Means) Figure 6.4 Distribution of Class A Neighbors for Unseeded Delta Class (F7-T7) with N Subjects (Top) and A subjects (Bottom) Figure 6.5 Average Accuracy and 5 th Percentile of Confidence of the Theta Class (Unseeded K- Means) Figure 6.6 Distribution of Class A Neighbors for Unseeded Theta Class (Fp2-C4) with N Subjects (Top) and A subjects (Bottom) Figure 6.7 Average Accuracy and 5 th Percentile of Confidence of the Alpha Class (Unseeded K- Means) Figure 6.8 Distribution of Class A Neighbors for Unseeded Alpha Class (Fp5-O2) with N Subjects (Top) and A subjects (Bottom) Figure 6.9 Average Accuracy and 5 th Percentile of Confidence of the Beta Class (Unseeded K- means) Figure 6.10 Average Accuracy and 5 th Percentile of Confidence of the NSE Class (Unseeded K- means) Figure 6.11 Relative Power Distribution of Five Heterogeneity Classes Separated with Seeded K- Means Figure 6.12 Average Accuracy and 5 th Percentile of Confidence of the Delta Class (Seeded K- means) Figure 6.13 Distribution of Class A Neighbors for Seeded Delta Class (Fp1-Fc1) with N Subjects (Top) and A subjects (Bottom) Figure 6.14 Average Accuracy and 5 th Percentile of Confidence of the Theta Class (Seeded K- means) Figure 6.15 Distribution of Class A Neighbors for Seeded Theta Class (Fc2-Pz) with N Subjects (Top) and A subjects (Bottom) Figure 6.16 Average Accuracy and 5 th Percentile of Confidence of the Alpha Class (Seeded K- means) x

11 Figure 6.17 Average Accuracy and 5 th Percentile of Confidence of the Beta Class (Seeded K- means) Figure 6.18 Average Accuracy and 5 th Percentile of Confidence of the NSE Class (Seeded K- means) Figure 6.19 Distribution of Class A Neighbors for Seeded NSE Class (Fc1-Cp2) with N Subjects (Top) and A subjects (Bottom) List of Tables Table 1 Highest average classification accuracy across various processing conditions Table 2 Class of Each Subject Across Classification Methods Table 3 Highest average classification accuracy across different heterogeneity classes List of Abbreviations AD ADHD ANT AR BMD DSM EEG KNN LSF RC TBPR Alzheimer s Disease Attention-Deficit/Hyperactivity Disorder Attention Network Task Autoregressive Bipolar Mood Disorder Diagnostic and Statistical Manual of Mental Disorders Electroencephalography K-Nearest Neighbor Line Spectral Frequencies Reflection Coefficients Theta/Beta Power Ratio xi

12 1. Introduction Attention-deficit/hyperactivity disorder (ADHD) is considered a neurodevelopmental disorder, hallmarked by three primary traits: inattention, hyperactivity, and impulsivity. ADHD is commonly treated with either medication, behavior therapy, or some combination of the two, and is one of the most common disorders found in children. 1.1 Overview of ADHD Conditions reporting hyperactivity and inattention in children are not a new occurrence. A condition named Hyperkinetic disease of infancy was first identified and distinguished from other similar conditions in 1932, however symptoms were described as early as the mid-1800s [1]. The cause of ADHD is still unknown, however a genetic component appears to play a significant role in its manifestation [2]. It was first standardized in the second edition of the Diagnostic and Statistical Manual of Mental Disorders (DSM), which is the diagnostic manual published by the American Psychiatric Association, under the name Hyperkinetic Reaction of Childhood, and the next edition of the DSM, DSM III, included Attention Deficit Disorder as a condition [3, 4]. The DSM III description listed two different types of ADD, one with hyperactivity and one without [4]. Criteria for diagnosis continued to change up through the most recent edition of the DSM V. This shows a shift in attitudes towards ADHD, from thinking that either hyperactivity or inattention was the root cause to thinking that there are two tiers of symptoms that sometimes are coupled and sometimes exist independently [1]. The latest edition identifies three different types of ADHD: predominantly inattentive type, predominantly hyperactive and impulsive type, and combined type [5]. The current diagnostic criteria require that symptoms appear before age 12, the symptoms are present in at least two different settings, the symptoms interfere with normal function for the setting, and that at least 6 of 18 listed symptoms are present for at least 6 months [5]. Some of these symptoms include Often fails to give close attention to detail or makes mistakes, Often has difficulty sustaining attention in tasks or activities, Often unable to play or engage in leisure activities quietly. Many of the symptoms, when isolated, sound like descriptors of typical childhood behavior, which adds an element of ambiguity in the diagnosis process. Other medical conditions often occur alongside ADHD, including anxiety, depression, autism, Conduct Disorder, or Oppositional Defiant Disorder [6]. Some of these other conditions have symptoms that overlap 1

13 with ADHD symptoms, so it can be difficult to discern whether a child has ADHD alone, another condition alone, or some combination. Diagnosing physicians face the challenge of identifying the actual root cause of the symptoms, otherwise symptoms may worsen without appropriate treatment or a patient may be unnecessarily exposed to medications that can carry risky side effects. While this disorder is relatively common among children, it is difficult to determine the actual proportion of children who have it. Recent estimates from the 2016 National Survey of Children s Health (NSCH) report that 9.4% of children aged 2-17 have at some point been diagnosed with ADHD, and 62% of those were taking some kind of medication to manage the symptoms [6]. Breakdowns of ADHD diagnosis rates based on gender vary, but historically boys are diagnosed at a much higher rate than girls. The 2011 NSCH reported that approximately 20% of boys in high school and approximately 9% of high school girls had received a diagnosis of ADHD at some point (although they may no longer exhibit symptoms) [7]. Boys are also much more likely to be medicated for ADHD than girls are, about 2.3 times more likely [8]. However the diagnosis rate of ADHD in girls is rising faster compared to boys, and this is attributed to both an overall increase in diagnosis but also the medical community improving at recognizing ADHD symptoms in girls [8]. One of the primary reasons for the controversy surrounding ADHD is its rapid increase in diagnosis. Previous NSCH surveys from 2003, 2007, and report ADHD diagnoses rates of 7.8%, 9.5%, and 11.0%, respectively, in children aged 4-17 [6]. The 2016 survey was redesigned from previous years, so it is difficult to directly compare the most recent and earlier estimates. However, it is clear that from 2003 to 2012, there was a significant increase in ADHD, an over 41% increase over 9 years. A sharp increase also occurred in the late 80s through the 90s, from a reported 0.9% of children in 1987 to a reported 3.4% of children in 1997 this is an over 370% increase over one decade [9]. Many people are critical of this increase in diagnosis, suggesting that there is an over-diagnosis of ADHD and typical childhood behavior is now being labeled as symptoms of a neurological disorder and subsequently medicated. This over-diagnosis leads to unnecessarily giving children a medical label that carries a stigma with it. It also leads to children taking stimulants over a long period of time, over years, in order to minimize disruptive behaviors. However, others claim that this rate increase merely means that more children and families are getting the treatment and support they need to cope with a condition that when left untreated can put children at an educational disadvantage. 2

14 Another source of the concern surrounding ADHD is its treatment. Both behavioral therapy and medication are used in treatment. The CDC states that for children under the age of 6, behavioral treatment should be used and its effectiveness evaluated before prescribing medication, and that for children 6 and older both behavioral therapy and/or medication are good options [10]. There are two primary classes of ADHD medications: stimulants and non-stimulants. Stimulant medications include amphetamine (which includes Adderall) and methylphenidate (which includes Ritalin). Like with any medication, these drugs carry additional health risks and may cause adverse events. The FDA label for Adderall warns that its use increases risk of cardiovascular adverse events in children with cardiac abnormalities, causes a slight increase in blood pressure, can provoke symptoms in patients who also have a psychotic disorder, circulation issues, and suppression of growth rate in children and adolescents [11]. The label also states that long-term effects of amphetamines in children have not been well established [11]. Given the severity of some of the side effects, it is concerning that the long term risks of medicating children with Adderall have not been thoroughly vetted. While some of the severe effects are not as commonly reported, others are. One study reported that up to 48% of patients experience some kind of side effect from ADHD medication, however only 20% of medicated patients reported these side effects to their physician [12]. Younger children may experience a greater risk of experiencing adverse advents, one study revealed that 11% of preschoolers who were prescribed methylphenidate for treatment of ADHD terminated the treatment because the side effects were found to outweigh the benefits [13]. 1.2 Subjectivity in ADHD Diagnosis Searching for explanations as to why there has been such a sharp increase in ADHD diagnosis rates has led researchers to find several factors that introduce subjectivity into deciding whether or not a subject has ADHD. This subjectivity is concerning because both false positive and false negative diagnoses carry negative consequences, although the sharp increase in diagnosis suggests that false positive diagnoses would be more common. One of the most supported indicators of subjectivity is the role of relative age in ADHD diagnosis. When enrolling students, schools often have a cutoff date meaning students born right before that date are relatively younger than their classmates, and students born right after that date are relatively older than their classmates. The relatively younger and older students can be nearly 3

15 a year apart in age, and this difference can result in significant maturity differences in a classroom, especially in the younger years. One study of children in the United States revealed that children born just before the cutoff date were over 60% more likely to receive an ADHD diagnosis than peers who were born just after the cutoff date, and this trend has been supported in other studies of US children and with different cutoff dates [14, 15]. Additionally, these students are more than twice as likely to use stimulants for ADHD [15]. Perhaps a more interesting and revealing finding from this study is that teachers tend to perceive the relatively younger students as more poorly behaved than the parents of the relatively younger students the discontinuity of reported ADHD symptoms around the school cutoff dates are four times larger from teachers than from parents [15]. This suggests that in a classroom immaturity due to age is often confused with ADHD symptoms, and that teacher reports play a significant role in potential over-diagnosis of ADHD. A study of Canadian students found similar results, boys born in the month before the school cutoff date were 30% more likely to be diagnosed with ADHD than boys who were born in the month following the cutoff date, and there was a 70% increase in ADHD diagnosis for girls [16]. It is unclear whether or not this is because the relatively younger students are more likely to be over-diagnosed or because the older students are more likely to be under-diagnosed. A German study reported that areas with worse teaching conditions (such as larger class sizes) and more educated adults tend to have higher rates of ADHD diagnosis [17]. While one should be careful not to confuse correlation and causation, it is possible that children act up more frequently in more difficult classroom conditions, and teachers interpret this as more students exhibiting ADHD behavior. Since ADHD lacks a definitive and measurable biomarker, and instead relies on observations of behaviors compared to peers, it is understandable that students who are immature due to age (but following typical developmental patterns) appear more hyperactive and exercise less attention in a classroom compared to peers that may be nearly a year older than them. Interestingly, this pattern is not replicated in Denmark. One study revealed that there was no significant difference in diagnosis rate between those born just before and just after the school enrollment cutoff date [18]. While Denmark does use a different diagnostic standard (the International Classification of Diseases), this is not believed to be the root of this discrepancy. Researchers instead attribute this absence of relative age effect to a difference in diagnosis methods; in Denmark only specialists can provide an ADHD diagnosis whereas in the United States and other nations an ADHD diagnosis can come from other medical professionals, like a 4

16 pediatrician [18]. This implies that the diagnosis method can have a significant impact on whether or not a child is diagnosed with ADHD, and using teachers as primary reporters of symptoms increases the likelihood of a child being labeled as ADHD. The idea that diagnostic outcome is influenced by the diagnostic process is supported in other literature. A meta-analysis covering 175 studies that span the use of DSM versions III, III- R, and IV revealed a change in diagnostic methods among DSM versions [19]. The use of clinicians in the diagnostic process decreased, from 55% in DSM III to 28% in DSM IV [19]. Additionally, using a child interview in the process decreased from 82% to 62% from DSM III to DSM III-R and then decreased to 39% with DSM IV [19]. Skounti reports in a meta-analysis that when a clinical assessment, versus just reports, is included as part of the diagnosis process, ADHD diagnosis rates decrease, and it would follow that if over time fewer clinical assessments were used, then ADHD diagnosis rates would increase [20]. If a child interview was not used, the source of the reports also appears to affect the outcome. Diagnosis rates increase when based on just a teacher report compared to just a parent report, each method resulting in 8.8% and 13.3%, respectively, according to Willcut [21]. If reports come from parents and teachers, the diagnosis rate is lower than using parents or teachers alone [22]. This high variability in the diagnostic method appears to contribute to the high variability in diagnosis rates. Another compelling statistic is the uneven distribution of ADHD diagnosis rates across different states of the US. One might expect the diagnosis to be roughly the same across all states, however this is untrue. Based on 2011 data from the CDC of children 4-17 years old who had ever been diagnosed with ADHD, the highest rate was 18.7% in Kentucky, and the lowest was 5.6% in Nevada [23]. The rate in Kentucky is over 3 times that in Nevada. Nevada is also the only state to report a decrease in the rate of children who had ever been diagnosed with ADHD [23]. Diagnosis rates tend to be lower in western states and higher in the southeast and mid-atlantic regions [23]. Some research suggests that at least some of this disparity between states is related to the way special education is funded (ADHD is considered a disability and students can receive special accommodations). One study found that when comparing states that provided special education funding on a per-student basis to states without this funding model, more children were classified as disabled than in those that did not have these incentives; ADHD diagnosis rates increased by 1.6% and medication rates increased by 1.3% [24]. Morrill asserts that this is a causal relationship based on rates that changed over time in two states that introduced this kind of legislation [24]. 5

17 Other legislation has been suggested to factor into ADHD diagnosis rates. Fulton asserts that educational accountability reforms (stemming from the No Child Left Behind Act) correlate to higher ADHD diagnosis rates in low income children, an increase of about 2.8%. This is thought to be mostly from increased pressure to meet testing benchmarks [25]. Fulton also states that a decrease of about 2.2% in diagnosis is found in states that have laws in place forbidding public schools from suggesting or requiring students use medication [25]. While correlation is not causation, one can see how different pressures and methods might affect the subjective diagnosis of whether or not a child has ADHD. The use of an objective diagnosis tool would be beneficial to help ensure that children are diagnosed and treated correctly - based only on their medical state and not on other factors. 1.3 Potential Impact of Objective Diagnosis An effective objective diagnosis has the potential to address several concerns around ADHD diagnosis. Measurements would not be influenced by the comparative behavior in a classroom, nor by legislation, nor by standardized testing, all factors that may contribute to increases in ADHD diagnosis. Of course medical professionals would be the ones to decide what threshold separates ADHD from non-adhd, but this would be much less likely to be influenced by the seemingly irrelevant factors listed above. Also, if the measurement process took data from a 5 to 10 minute test, this is less time consuming than diagnostic interviews with the patient, parent, and/or teacher, thus reducing the cost of the diagnostic process. If developed into a fully contained device with analysis software, a doctor may not even need to be present for the testing, potentially making this technology more accessible to lower income patients and areas. Concerns about both over-diagnosis and underdiagnosis could be alleviated with an objective technique, facilitating the diagnosis for those whose symptoms tend not to draw attention while also protecting children with natural behavior from unnecessary medication. 1.4 Limitations One of two primary limitations in this study is the availability of data. Safety protocols make gathering data challenging. The subjects were from the local area, so the overall population 6

18 is limited. It can also be challenging for subjects to fully comply with the instructions as subjects sometimes decide to quit participating in a task. A small sample size is often not suitable to generalize to an entire population because the margin of error can be large. This can have far reaching consequences in the case of medical diagnosis. This work does have more data available than its predecessor, since a greater number of subjects were available. Here there were 54 subjects with eye-closed data and 53 with ANT1 data, compared to 8 availabledif subjects in [26]. To make the best use of the limited number of subjects, different combinations of subjects were used to train models, and then the remaining subjects were tested with these different models, increasing the number of possible tests. For instance, if a subset of the data had 3 N subjects and 4 A subjects, then a total of 54 training-testing combinations are possible with just 7 subjects if 2-N and 2-A subjects are used to train the model and the remaining 3 subjects are tested. The other significant challenge in working with this data is the impreciseness of the ADHD and non-adhd labels, which is the very challenge this thesis is working to mitigate. The ADHD diagnosis is reported by the mother on a questionnaire that asks whether or not her child has been diagnosed by a clinician. If the child has been diagnosed incorrectly, then the child will be labeled incorrectly in the training and testing. This can significantly alter the results. In fact, Marcano presumably encountered an instance like this, a subject labeled as N appeared to be a very clear A in KNN results [26]. However, it was revealed afterwards that this subject was a female, and given that separate profiles for males and females are gaining support, this subject was tested separately from the male subjects [27]. 1.5 Outline The following work will describe past uses of EEG signals to characterize and classify ADHD versus non-adhd subjects. EEG data from both ADHD subjects and non-adhd subjects will be reduced into different features that still represent the spectral information of the EEG signal. These features will then be classified using machine learning, KNN and K-means. Marcano showed that KNN worked well to classify a small dataset [26]. This work seeks to duplicate the same high accuracy Marcano achieved with an additional dataset. This work will also explore heterogeneity of power spectra in ADHD subjects and whether or not heterogeneity classification can be leveraged to improve accuracy of ADHD classification. 7

19 2. Literature Review 2.1 EEG Electroencephalography (EEG) is a procedure that uses electrodes to measure electrical activity in the brain. Electrodes are typically on the surface of the scalp, but are sometimes intracranial. When neurons fire, their electric potential changes, and this produces a measurable voltage change. However, it is difficult to get EEG signals from parts of the brain that are beneath the cortex. Electrodes are placed in a standardized configuration, a combination of letters and numbers describing where they lie on the scalp. Underlying frequency components have been grouped into different frequency bands. The bandwidths may vary slightly from investigator to investigator, but Delta (1-4 Hz), Theta (4-7 Hz), Alpha (8-12 Hz), Beta1 (12-16 Hz), Beta2 (16-21 Hz) is a representative example of how these different bands are defined [28]. Atypical EEG patterns can be helpful in diagnosing a variety of medical conditions, including epilepsy, Bipolar Mood Disorder (BMD), autism, and Alzheimer s disease (AD) [29-31]. Machine learning techniques have been investigated for several of these conditions. For instance, one study explored several different classification methods, including LDA and XCSF, for BMD patients using features from eyes-closed resting state and eyes-open resting state [29]. This showed initial success; out of 22 BMD patients, only 1 was diagnosed incorrectly (for both eyes-open and eyes-closed), and 9 achieved a classification rate of at least 90% for all 5 classification methods explored [29]. Another study investigated how well different classification methods could use modified multiscale entropy as features to identify Autism in young children [32]. When all subjects were taken together (boys and girls), but stratified by age, KNN produced 90% classification accuracy at 9 months and 18 months [32]. A study on EEG and AD found specific EEG markers that linked to receptors for a neurotransmitter affected by the disease [33]. This showed itself to be an especially effective method as using this model and data resulted in an accuracy of 95%, a sensitivity of 96%, and a specificity of 93% [33]. These examples show that machine learning, when using the right model with an appropriate feature, can be a useful diagnostic tool. However, there is limited literature on using machine learning with EEG to diagnose ADHD and an even smaller amount that uses KNN as the classification algorithm. One study used support vector machine classifiers (SVM) to distinguish between both ADHD and non-adhd and 8

20 also between the three different subtypes of ADHD [34]. EEG data from four different tasks were used, and when classifying between non-adhd and ADHD (all subtypes included) the individual SVMs had an accuracy ranging from 69.2% to 71.7% on its own, but when using all found SVMs in the final decision, the accuracy was about 82.3% [34]. The 4 SVM model had greater success separating ADHD subjects into EEG subtypes as defined by Kropotov (which are different subtypes than defined in the DSM V), ADHD III and ADHD IV were separable in 100% of the dataset, and ADHD II and ADHD III were separable 96.7% of the time [34, 35]. Another study incorporated Independent Component Analysis (ICA), K-means, and KNN on EEG data from a Continuous Performance Test, and the best accuracy based on one feature was 86.36% [36]. Marcano published one of the most promising studies to date that uses AR coefficients in KNN and GMM-UBM models [26]. When using only features from the ANT period (which will be discussed in Section 2.4), a mean AUC of 0.98 was reported, and all AUC values ranged from 0.92 to 1.00 [26]. Classification with KNN also showed high accuracy (4 of 7 subjects showed accuracy above 75%) and confidence (4 of 7 showed confidence above 80%) [26]. This work seeks to build upon the work surrounding KNN models by examining its use on a larger dataset, as well as incorporating heterogeneity of EEG and ADHD and also considering the timing of the cues and targets within the ANT testing. 2.2 Theta Beta Power Ratio The theta to beta power ratio (TBPR) is found by comparing the average power in the Theta frequency band (4-7 Hz) to the average power in the beta band (13-30 Hz). Earlier research seemed to indicate this ratio was a promising biomarker of ADHD. Monastra reported that using TBPR to differentiate between ADHD and non-adhd achieved a specificity of 98% and a sensitivity of 86% and duplicated these results in a later study [37, 38]. Lubar suggested that the TBPR could be used to diagnose pure ADD (ADD is ADHD as defined by the DSM-III-R) when hyperactivity and learning disabilities are not present [39]. The FDA even approved a device that measured the TBPR, the NEBA System, to be used as a medical device [40, 41]. However, later studies and meta-analyses call these findings into question saying that the TBPR is not a reliable indicator of ADHD [42-45]. A study by Ogrim reported that the TBPR correctly differentiated ADHD patients from non-adhd patients in only 58% of cases, and that the TBPR was more indicative of age than ADHD [43]. Buyck found similar results; TBPR 9

21 predicted ADHD diagnosis with accuracies ranging from 49.2% to 54.8%, and again was more predictive of age [42]. One interesting note of this study is that a TBPR difference was found between the Inattentive subtype and Combined subtype [42]. Additionally, Marcano examined the use of diagnoses made using the TBPR on 6-year-old subjects of Dataset 1, and results indicated that TBPR was not an accurate method of diagnosing ADHD [26]. For these reasons, the inconsistency of TBPR results and irreproducibility in Dataset 1 subjects, TBPR will not be explored in this work. 2.3 ADHD Heterogeneity ADHD is heterogeneous by its clinical definition, the DSM V lists three different subtypes, Inattentive, Hyperactive/Impulsive, and Combined Type, each type corresponding to a different profile of behavior [5]. Patients are also heterogeneous. They vary in age, gender, personality, and having other comorbidities, among other factors. This can make identifying a definitive EEG biomarker of ADHD challenging, given that age and gender significantly affect EEG patterns. One gender difference is readily seen in the disparity in diagnosis rates. As stated above, the CDC ADHD diagnoses rates from 2011 show that boys are over twice as likely as girls to be diagnosed with ADHD at some point between the ages of 4 and 17 [23]. Most literature on ADHD is based on samples consisting of boys. However, literature on girls with ADHD reveals that ADHD symptoms in girls often manifest themselves differently. Girls with ADHD tend to have greater intellectual impairment, lower levels of hyperactivity, and lower rates of other externalizing behaviors compared to boys with ADHD [46]. Girls with ADHD are also less likely to have a comorbid behavior disorder than boys with ADHD [47]. Additionally, EEG patterns for girls with ADHD differ from EEG patterns of boys with ADHD. Clarke found that the EEG differences between ADHD and non-adhd are smaller in girls than boys, which generally makes separation more difficult [27]. Female ADHD patients were also found to have higher total power compared to males [27]. Another study compared typical boys to ADHD boys and typical girls to ADHD girls, and found that several differences between typical and ADHD groups were common between boys and girls; girls also differed in the absolute Delta power being more elevated in ADHD, and this difference was not reported in boys [48]. For these reasons, in parts of the following classification methods only boys will be tested against models trained with boys. 10

22 EEG patterns of patients with ADHD also change with age, but EEG patterns of healthy patients also change with age, so care must be taken not to confuse these differences with one another. For instance, one study that involved children, adolescents, and adults reported that Delta and Theta power are negatively correlated with age whether or not a patient had ADHD [49]. If younger patients were compared to older non-adhd patients, one could mistakenly attribute a difference in Delta and Theta power to mean the younger patients have ADHD when this may not be true. An example of an ADHD and non-adhd difference that does change with time is power and central frequencies - ADHD and non-adhd adults could be separated with a sensitivity of 67% and a selectivity of 83% when using a shift in the central alpha frequency and increases in alpha 1 and beta power (defined by Poil as 8-10 Hz and Hz, respectively) as features [49]. However, this could not be reproduced in children, indicating a change in EEG spectra from childhood to adulthood [49]. Clarke reported that one ADHD component, thought to be related to hyperactivity, appeared to become more typical as the patient aged while another component, thought to be related to inattention, did not [27]. These components are thought to be independent of one another, and the component that lessens may be related to maturational lag, which is discussed in the next paragraph [27]. In this work, unless heterogeneity is being leveraged in the classification method or otherwise stated, only 6-year-olds are being used for training and testing of the models. While the actual cause and mechanism of ADHD is not truly known, there are several proposed models. One is the hypoarousal model, which says that since those with ADHD are less aroused, they seek more stimulation from their environment, which leads to what is considered disruptive behavior [50, 51]. A second suggestion is a maturational lag model [52-54]. This model is said to describe children who have normal developmental patterns, but there is a delay in some of that development [53]. These models serve to help explain some of the different EEG patterns and trajectories in ADHD patients. However, these models do not explain all EEG patterns found in ADHD children, leaving some other mechanisms at the root of EEG differences [52]. Some more recent studies have used cluster analysis to separate patients. A study by Clarke separated a set of ADHD patients into 5 different classes with different EEG profiles [52]. Not only did each cluster have a different EEG trait, but these trends also correlated with behavioral differences. For example, one cluster had elevated activity in the beta band, and this group also was more likely to commit acts of vandalism, while showing far less guilt compared to other 11

23 groups [52]. Another group tended to prefer spending time with peers of a younger age and be more impulsive than other groups, suggesting a developmental delay. Another important finding of this study is that the patient pool also included patients with comorbidities and found that this did not significantly impact the EEG profile, which could be of great value considering how symptoms of other conditions can easily be mistaken for ADHD symptoms [52]. Another recent study also classifies ADHD patients into 5 groups, however this study classified non-adhd patients as well [28]. Loo asserts that all subjects fall into one of 5 heterogeneity classes, no matter whether or not they have ADHD [28]. The proportion of ADHD subjects in each class was very similar, except for one class, indicating that ADHD is probably nearly independent of heterogeneity class. However, a subject s heterogeneity class was not independent of age and gender. These findings suggest that creating two EEG classes, non-adhd and ADHD, may not be the best separation method and that using more classes may be a better suited approach. This work describes attempted methods to separate based on two classes, and also later on results from separation methods that also take a 5-class model of EEG into account. 2.4 Attention Network Task Test Much of the research that seeks to classify ADHD from non-adhd involves EEG data taken from the eyes-closed resting state where the subject is instructed to lie or sit still with his or her eyes closed for about a minute. However, this research moves away from that and instead uses data collected during the child version of the Attention Network Task (ANT) test. During this test, the subject is presented with a directional target (in this case, a fish) that points either left or right. The subject is instructed to press the arrow key on a keyboard that corresponds to the same direction as the fish, either left or right. A cue is presented to alert the patient that the target is coming soon. There are four different kinds of cues a central cue, a double cue, a spatial cue, or no cue at all, shown in Figure

Alternatively, there may be no flankers, and this is considered the neutral case. The target appears either slightly above or slightly below the center of the screen. Figure 2.

24 Figure 2.1 Different Cues in the ANT The target appears 600ms after the cue. When the target appears, it may have flankers surrounding it, either all in the same direction as the target, or all facing the opposite direction as the target. Alternatively, there may be no flankers, and this is considered the neutral case. The target appears either slightly above or slightly below the center of the screen. Figure 2.2 displays different ways the flankers and target may be displayed to the subject. Figure 2.2 Different Configuration of Flankers around the Target The child has a limited time to respond before the individual trial is over. After each trial, the child receives an audio cue indicating whether, or not, he or she successfully chose the right 13

25 direction of the target. Considering the two different target directions, two spatial orientations, three flanker cases, and four cue types, there are 48 different cases occurring during each ANT period. This test is of particular interest because three different functions are utilized that are typically associated with attention: executive, orienting, and alerting. The executive functions are related to effortful control of attention, the orienting functions involve the movement of visual attention in space, and the alerting functions work to achieve and maintain an optimally alert attentional state [55-58]. In this test, the congruent/incongruent flankers are meant to affect executive attention, the left/right direction of the target is meant to utilize orienting attention, and visual cues that warn of an approaching target are meant to evoke the alerting function [58]. The executive and alerting functions have been shown to be weaker in children with ADHD than in those without it [58]. Given this difference in performance, all features used in KNN in this work are taken from the ANT period in order to help magnify differences between EEG signals of ADHD and non-adhd children. 14

26 3. Data and Features Feature selection and extraction are important parts of implementing a machine learning algorithm. This often aids in reducing the number of computations to be made and the amount of hardware and memory needed. For instance, a single EEG channel sampled at 512 Hz for 5 minutes, which is the approximate length of the signals being considered, results in 153,600 single data points to consider. However, if a 5-dimensional feature were used to characterize 2s intervals, with those intervals overlapping by 50%, this would result in dimensional features, or 1,490 numbers in all. The latter represents a 100-fold reduction in required memory and, depending on the machine learning algorithm, an even greater reduction of computational complexity. In this work, feature selection is largely empirical. Marcano determined that an AR model of the 7 th order was one of the best fits for this data with the Akaike Information Criterion [26]. He also determined that 2 s windows produced higher accuracy in KNN models than windows of shorter length [26]. The use of 7 th order models and 2 s windows (unless otherwise noted) was continued in the following analyses. AR coefficients were used primarily, however other equivalent representations of spectral information were also used, including Reflection Coefficients and Line Spectral Frequencies. 3.1 Dataset and Equipment The data for the testing and training of these algorithms was provided by Dr. Martha Ann Bell of the Cognition, Affect, and Psychophysiology Lab of the Department of Psychology of Virginia Tech. The data came from two different locations, Virginia Tech in Blacksburg (which will be referred to as Dataset 1) and University of North Carolina at Greensboro (which will be referred to as Dataset 2). For Dataset 1, 28 EEG channels were recorded on left and right regions of the scalp: Fp1, Fp2 in the frontal polar region; F2, F4, Fz, F7, and F8 of the frontal region; Fc1, Fc2, Fc5, and Fc6 of the frontocentral region; C3 and C4 of the central region; Cp1, Cp2, Cp5, and Cp6 of the centroparietal region; T7 and T8 of the temporal region; P3, P4, Pz, P7, and P8 of the parietal region; and O1 and O2 of the occipital region. For Dataset 2, 18 EEG channels were recorded: Fp1, Fp2 in the frontal polar region; F2, F4, F7, and F8 of the frontal region; C3 and C4 of the central region; T7 and T8 of the temporal region; P3, P4, P7, and P8 of the parietal region; and O1 15

27 and O2 of the occipital region. The left and right mastoids also had electrodes placed on them, however this data was not considered in any of the following analyses. Channel Cz served as the reference for all electrode recordings. The electrodes were placed with a stretch cap that used tin electrodes in the 10/20 layout. Once the cap was in place, abrasive gel was placed at each electrode contact and rubbed gently. Then conductive gel was added to each electrode contact site. To ensure proper contact was made, the impedance of each electrode was measured and approved as long as it was below 20kΩ. During recording, each EEG signal was separately amplified with James Long Company Bioamps (James Long Company; Caroga Lake, NY). Separate tasks were marked by an electrical pulse at the beginning and end of each task. All signals were also low- and highpass filtered. The low pass filter was a RC filter with one pole, with 3 db point at 0.1 Hz, and a roll-off of 6 db per decade. The high pass filter was a second order Butterworth with a 3 db point at 100 Hz and a roll-off of 12 db per decade. These signals were then sampled at 512 samples per second to eliminate aliasing effects, and Snapshot-Snapstream was used to collect the EEG data (HEM Data Corp.; Southfield, MI). The psychology publication guidelines for EEG collection were followed as described in [59]. For each subject, there is EEG data covering the span of over one hour during which each child performed several tasks. However, this thesis primarily focuses on ANT period data. This consists of the same ANT being administered three times (ANT1, ANT2, ANT3). A CND file with time stamps of the beginning and end of each activity was used to approximate the interval of interest (which was ANT1 unless otherwise noted). Then channel 1 (which contains the pulses marking when each test begins and ends) of the data was used to find the exact sample indicating when the ANT round begins and ends. 3.2 Autoregressive Coefficients Autoregressive (AR) models are a linear prediction model that use a linear combination of past discrete-time values to estimate the current discrete-time value. A p-th order linear predictor uses a linear combination of p previous values. The signal is x[n], the p-th order linear prediction coefficients are a p, and the forward prediction error is f[n], where f[n] will be white noise when x[n] is an AR(p) process. 16

28 The forward predictor x [n] of order m is p x[n] = a pk x[n k] + f[n] (3.1) k=1 m x [n] = a mk x[n k] k=1 (3.2) and the backward predictor x [n m] of order m is m x [n m] = a mk x[n + k m] k=1 (3.3) Therefore, the forward prediction error is f m [n] = x[n] x [n] (3.4) and the backward prediction error is b m [n] = x[n m] x [n m] (3.5) The Burg method was used to estimate a p. This is an order-recursive method that uses the reflection coefficients R k of the lattice structure filter. The objective is that the reflection coefficients, and thus the liner prediction coefficients, are chosen such that the sum of the squared forward and backward errors of N data points is minimized. N 1 E m = [ f m [n] 2 + b m [n] 2 ] n=m (3.6) where m is the order of the model and 1 m p, and 1 k m. Using the forward and backward predictor representations, x [n] and x [n m], in (3.1) and (3.2), the f m [n] and b m [n] equations can be rewritten as f m [n] = x[n] + a m1 [n 1] + a m2 [n 2] + + a mm [n m] (3.7) b m [n] = x[n m] + a m1 [n m + 1] + a m2 [n m + 2] + + a mm [n] (3.8) 17

29 Additionally, the f m [n] and b m [n] can be represented in terms of each other and the reflection coefficients R m. f m [n] = f m 1 [n] + R m b m 1 [n 1] (3.9) b m [n] = b m 1 [n 1] + R m f m 1 [n] (3.10) Then f m [n] and b m [n] in the form in (3.9) and (3.10) can be substituted back into (3.6) to put the m-th order error in terms of the order recursive forward and backward prediction error. N 1 E m = [ f m 1 [n] + R m b m 1 [n 1] 2 + b m 1 [n 1] + R m f m 1 [n] 2 ] n=m (3.11) In order to find the minimum error, the derivative of E m is taken with respect to the reflection coefficients and set to 0. Solving for R m yields the following N 1 n=m f m 1 [n]b m 1 [n 1] R m = 1 2 N 1 n=m [ f m 1[n] 2 + b m 1 [n 1] 2 ] (3.12) These reflection coefficients are the coefficients of a lattice filter with stable roots. The prediction error filter can be represented as a FIR filter, which is always stable. This FIR filter is given by p A(z) = 1 a p [k]x[n k] i=1 (3.13) This can be summarized in the following steps [60]: 1. Initialize the counter i and forward and backward prediction error vectors f 0 and b 0 both with all of the values of x[n] i = 0 (3.14) f 0 [n] = [x[0], x[1], x[2],, x[n 1]] T (3.15) b 0 [n] = [x[0], x[1], x[2],, x[n 1]] T (3.16) 18

30 2. Find the actual forward and backward error prediction values by removing the first value of the most recent forward error prediction and the last value of the most recent backwards error prediction f i = f i (1: N i 1) (3.17) b i = b i (0: N i 2) (3.18) 3. Calculate the reflection coefficient r i using the forward and backward prediction error values 2b H i f i r i = f H i f i + b H (3.19) i b i 4. If counter i equals the desired order m of coefficients, exit here 5. Else, update the prediction errors of the next iteration using the actual prediction errors found in step 2 and the reflection coefficients found in Step 3 f i+1 [n] = f i + r i b i (3.20) b i+1 [n] = b i + r i f i (3.21) 6. Increment counter i by 1 to represent next power of coefficients and return to Step Reflection Coefficients Reflection coefficients (RC) are another way to represent the same system as described with AR coefficients. In fact, in the Burg algorithm the RC are calculated in the process of solving for the AR coefficients. The difference between the two is that AR coefficients represent the coefficients of a FIR filter (when the prediction error input is not known) or an IIR filter (when the prediction error input is known) and reflection coefficients are used in a lattice filter. They are calculated in (3.12) above. 3.4 Line Spectral Frequencies Line spectral frequencies (LSF) constitute another equivalent set of parameters representing the same information about a system as is contained in either the AR coefficients or 19

31 the reflection coefficients. LSFs are often used in speech recognition and processing and often result in reduced errors due to quantization and transmission losses. A p-th order prediction error filter can be represented by the polynomial p A(z) = 1 a i z i i=1 (3.22) This polynomial can be decomposed into two other polynomials. where P(z) and Q(z) are defined as A(z) = P(z) + Q(z) 2 (3.23) P(z) = A(z) + z (p+1) A(z 1 ) (3.24) Q(z) = A(z) z (p+1) A(z 1 ) (3.25) The roots of these latter two polynomials can be shown to always lie on the unit circle, so their magnitude is fixed at one, and they can be completely characterized by the angle of the roots. These angles correspond to the line spectral frequencies. 20

32 4. KNN Over Entire ANT1 Period 4.1 K-Nearest Neighbors K-Nearest Neighbors is a supervised machine learning algorithm, and it is the simplest machine learning algorithm. Some amount of training features is used to train the model, and the model knows the classification, i.e. the label for each class. Any test feature will then be compared to the nearest K features. In order to determine the closest features, the distance between the test feature and every training feature must be found. The Euclidian distance was used in all of the following KNN tests. Once the K nearest neighbors are found, the number of neighbors of each class are counted. Each neighbor s class is weighted equally in all KNN usages in this work, however there are times when different weighting based on distance is appropriate. The class that occurs most commonly among the neighbors is deemed to be the class of the test feature. If there is a tie, which can only occur if an even number of neighbors is contemplated, an additional method should be implemented to determine the final classification. Figure 4.1 illustrates a simple 2D example. There are two classes within the training data: Class A denoted by blue squares, and Class B denoted by red diamonds. The green dot is the test feature. K is chosen to be 5, so the five neighbors with the smallest Euclidian distance between themselves and this test feature are identified and circled in black. Figure 4.1 Example of 2D KNN model with K = 5 21

33 Of these five neighbors, four belong to Class A and only one belongs to Class B. Since most of the neighbors belong to Class A, the test feature is decided to belong to Class A as well with 80% confidence in the classification. 4.2 Choosing an Appropriate K Value Selecting an appropriate value of K is a critical part of KNN implementation. If it is too small, it is easy for a few outliers to reduce the confidence in classification, even to the point of flipping the classified label entirely. If it is too large, the distance of neighbors may extend beyond natural class borders and reduce the confidence in results as well. Since there are only two classes, N (non-adhd) and A (ADHD), only odd integers will be considered since this will avoid a tie in the number of neighbors in each class. An empirical method was used to determine an appropriate K value. Previous work on Dataset 1 showed that the 5-channel combination C3-Cp6-Fc1-Fc2- Fc5 was effective in discriminating between classes A and N during ANT1 [26]. AR coefficients of a 7 th order polynomial were found for 2s long, 50% overlapping intervals of the entire ANT1 period. The results were found after using KNN models trained with 1-N and 1-A data and testing all remaining subjects for K values 3 through 51. All possible training-testing combinations of male 6-year-old subjects from Dataset 1 were explored. The average N classification accuracy, average A classification accuracy, and overall classification accuracy is shown in Figure

34 Figure 4.2 Accuracy of KNN using a Range of K Values One can see from this plot that the value of K does not significantly affect the overall accuracy. While the lower k values have a slight peak in accuracy, K values below 11 were not considered in order to minimize possible feature noise that causes misclassification. Between K = 11 and K = 17 there is a slight rise in accuracy. Beyond K = 17, there is a slight downward trend, but it is less than one hundredth. A value of 13 was decided to be an appropriate value to use. Unless otherwise noted, K = 13 is used throughout the results shown. This is consistent with previous work done on this dataset, KNN analysis with intervals randomly selected showed that K values between 3 and 99 resulted in very little variation in overall accuracy [26]. 4.3 Performance Metrics The primary way performance is reflected in this thesis is in terms of accuracy, although a few different terms are used to represent similar ideas. For an individual training-testing combination, accuracy is defined as the proportion of features that are correctly classified to the total number of features tested. In this case, all features tested belong to the same class. 23

35 Overall Accuracy = # True Positives + # True Negatives # Features Tested (4.1) Overall accuracy is also a metric used to summarize overall performance. When this term is used in this chapter, it signifies the proportion of correctly classified features (true positives and true negatives) to the total number of features for all training-testing combinations for a specific channel combination. Tested features include features that belong to both classes. Confidence in classification is another metric used. It represents how many votes a feature receives indicating it belongs to Class A. Features belonging to Class A should have a high confidence (tending towards 1), and features belonging to Class N should have a low confidence value (tending towards 0). Features that have a mid-range confidence (40%-60%) do not strongly suggest which class a feature belongs to. Given that K is selected to be 13, all confidence values for a single feature should be multiples of 1/13 between 0 and 1, inclusive. The confidence value of a subject s classification is the proportion of the total number of neighbors of all features that belong to Class A. Confidence A = Confidence A = 1 n 1 n A votes, n = number of tested features K N votes, n = number of tested features K (4.2) (4.3) It is important to note that when the confidence values are represented in the heat maps shown in subsequent subsections, that all of the N confidence values are transformed to 1 Confidence A so that they can be more easily compared to the Confidence A values. Another metric used is the C-score, which is used to rank the performance of different channel combinations in classification. It is defined as the product of the overall accuracy and the 5 th percentile confidence for the A case, and 95 th percentile confidence. The 5 th percentile is found by ranking the confidences of each training-testing combination within a channel pair, and selecting the 5 th percentile from that ranking. C score A = Overall Accuracy 5 th percentile Confidence A (4.4) C score N = Overall Accuracy (95 th percentile Confidence N ) (4.5) 24

36 This metric is used because it factors in both the accuracy and confidence when trying to determine which channel pair is best for diagnosis purposes. For instance, one model may diagnose 85% of the vectors correctly, but the 5 th percentile of confidence for all training-testing combinations may be 30%. Another model may diagnose 80% of vectors accurately, but the 5 th percentile of confidence may be 70%. While the accuracy of the first model is higher than the accuracy of the second model, the 5 th percentile confidence of the second model indicates that at least 95% of the tested subjects were diagnosed correctly, making it more appropriate to use for diagnosis. 4.4 Challenges of Using KNN One of the main drawbacks of KNN is that it requires many calculations, and the more training data is used (which is generally assumed to improve the quality of the model), the more calculations have to be done. For example, if four subjects are used to train the ADHD/non-ADHD model, and each subject has 250 features, and a test subject also has 250 features, Euclidian distances have to be calculated and then sorted. These features are often multi-dimensional, and this further adds to the demand on the software. The math may be simple, but it is not readily scalable. Another potential drawback of KNN is that classification only occurs at the level of a single feature rather than considering the distribution as a whole. For instance, two different classes may share the same average value, but subjects belonging to Class A may have a much wider variance than subjects belonging to Class B. Therefore, Class B would be much more concentrated around this shared center point, and any individual testing point of data that is Class A may appear to be Class B instead, however all of the data considered together may reveal the larger variance a characteristic of Class A that KNN cannot acknowledge. A final challenge that comes with using KNN, but is not unique to KNN, is that the accuracy of this model depends heavily on the training data. If the data used for training is distorted or for some reason unrepresentative of the class, or mislabeled, the performance of this algorithm can be severely undermined. 25

37 4.5 KNN Results Using AR Coefficients: Male 6-Year-Olds In the remainder of this chapter, only 6-year-old male subjects were used in the training and testing of the classification algorithm in order to minimize feature variations due to age and gender differences. Dataset 1 contains EEG data for three 6-year-old male subjects labeled N (for non-adhd, meaning the mother of the subject reported that a physician had not diagnosed the child with ADHD) and four 6-year-old male subjects labeled A (meaning the mother of the subject reported that the child had been diagnosed with ADHD by a physician). Dataset 2 contains EEG data for four 6-year-old male subjects labeled N and five 6-year-old male subjects labeled A. For each test subject from Dataset 1, 26 EEG channels were recorded. This results in 325 different two-channel combinations. For each test subject from Dataset 2, 16 EEG channels were recorded, resulting in 120 two-channel combinations. All of the channels recorded for Dataset 2 form a subset of the channels recorded in Dataset 1. Dataset 1 contains 10 additional EEG channel recordings. Unless otherwise noted, Dataset 1 subjects were tested against Dataset 1 subjects, and Dataset 2 subjects were tested against Dataset 2 subjects. In each test, a combination of two subjects labeled N and two subjects labeled A was used to train the KNN model and the remaining subjects (those not used for training) were then tested against that 2 N - 2 A training combination. This is repeated for all possible unique 2 N - 2 A training combinations, resulting in 54 trainingtesting combinations for Dataset 1 and 300 training-testing combinations for Dataset 2. The features used in this subsection are seventh order AR coefficients with the first coefficient, which always equals 1, removed since it is deterministic and provides no information about the signal. Unless otherwise noted, the features in this subsection are extracted from 2s 50% overlapping intervals from the beginning of ANT1 to the end of ANT Unfiltered Data Initially, the testing and training data was not processed at all, meaning this KNN execution only uses raw, unfiltered data. Figure 4.3 shows the overall accuracy of all two-channel combinations for the male 6-year-olds of Dataset 1 below the diagonal, and shows the 5 th percentile of confidence values for all training-testing combinations above the diagonal. 26

38 Figure 4.3 Average Accuracy and 5 th Percentile of Confidence of Dataset 1 (Unfiltered) Observe that there are many (over 150) channel combinations that yield an overall accuracy of at least 80%, and 3 combinations achieve an overall accuracy above 90%. The highest overall accuracy was 94% for channel pair Fc1-Pz. This would suggest that there is little more that can be done to improve the classification accuracy for Dataset 1. Additionally, 10 channel combinations have a 5 th percentile confidence of at least 80%, meaning that 95% of the training-testing combinations have a confidence of 80% or higher in the correct direction. The highest 5 th percentile of confidence for individual training-testing combinations is 84% with channel pair T7-Pz, which is good. This model is heavily skewed toward correct diagnosis. The highest C-score is 0.78, and this was achieved by channel pair T7-Pz. Most vectors are classified correctly, and all tested subjects are classified correctly and with high confidence. This will serve as the baseline for future comparisons after any kind of processing and for other datasets. It is also interesting to look at the individual distributions of the number of neighbors of class A for all possible training and testing combinations. Figure 4.4 shows the proportion of features from channels T7-Pz (this pair had the highest C-score out of all possible 2 channel combinations) that indicate the confidence that the tested subject belongs to Class A. Tested subjects belonging to class N should have a low confidence (of belonging to Class A), and tested 27

39 subjects belonging to class A should have a high confidence. Since this channel pair had a high average classification accuracy, it is expected that the individual trials show strong skewedness. Figure 4.4 Distribution of Class A Neighbors for All Dataset 1 Training-Testing Combinations (T7-Pz) with N Subjects (Top) and A subjects (Bottom) The individual distributions show the expected behavior. The distributions of tested N subjects are heavily skewed towards the left hand side, indicating that few neighbors of a given feature had many (if any) neighbors belonging to class A. The distributions of the class of k neighbors of tested A subjects (in the lower half of Figure 4.4) are heavily skewed to the right, illustrating that most tested A features had very few N neighbors. It is also worth noting that with this channel combination, no subject was ever misclassified for any individual test combination. 28

40 It is important to point out that if a subject is misclassified for a single training-testing combination, the model is not necessarily an overall poor choice. In Figure 4.5 another distribution of test vector neighbors is shown from a different channel pair, C3-C4. This time only the tested A subjects are shown. Figure 4.5 Example of Distribution of Class A Neighbors with 1 Incorrect Diagnosis (C3-C4) One of the tests of subject indicated that this A subject belonged to class N. One can also see that its distribution is close to a uniform distribution, with an uptick at both ends. Just under 50% of its features were classified as A for one training combination. However, this subject was classified as an A subject for all other training-testing combinations, so overall this model would classify subject correctly when considering all training-testing combinations. The same classification process was applied to Dataset 2 using the raw, unfiltered data. Once again, in Figure 4.6 below the diagonal the average accuracy resulting from all 300 trainingtesting combinations is shown, and above the diagonal the 5 th percentile of the confidence of the individual training-testing combinations is shown. 29

41 Figure 4.6 Average Accuracy and 5 th Percentile of Confidence of Dataset 2 (Unfiltered) Dataset 2 yielded a much lower average classification accuracy overall compared to Dataset 1. The highest average classification accuracy achieved was 64% for channels Fp2-O1, one of only 5 channel combinations with an overall accuracy above 60%. Many (80 out of 120) of the channel pairs produced an overall accuracy 40-60%, which is comparable to being diagnosed based solely on a coin flip. The highest 5 th percentile was 20% for channel pair Fp2-P7, which is not indicative of skewedness towards a correct diagnosis for the training-testing combinations. The highest C-score is 0.11 from channel combination Fp2-P7, another indicator of poor results using this algorithm on Dataset 2. The performance of this algorithm on Dataset 2 plummeted compared to Dataset 1. The source of this change is not clear, and it is puzzling. For each training-testing combination using only the subjects of Dataset 2, the distribution of class A neighbors for each feature vector from the channels Fp2-P7 (which yielded the highest C-score) is shown in Figure 4.7. Given that the overall accuracy was poor, one would expect to see many subjects misclassified. Recall that A subjects should have a skewedness towards the right (indicting many neighbors that belong to class A) and that N subjects should have a skewedness towards the left (indicating few neighbors that belong to class A). 30

Figure 4.7 Distribution of Class A Neighbors for All Dataset 2 Training-Testing Combinations (Fp2-P7) with N Subjects (Top) and A subjects (Bottom) As seen in Figure 4.

42 Figure 4.7 Distribution of Class A Neighbors for All Dataset 2 Training-Testing Combinations (Fp2-P7) with N Subjects (Top) and A subjects (Bottom) As seen in Figure 4.7, as predicted, very few training-testing combinations of Dataset 2 produce the heavy skew of neighbors of one class as was seen in Figure 4.4. In the tested A subjects, there is some skew towards correct diagnosis, but there is also an uptick in the tail that points in the opposite direction. However, the overall aggregation of training-testing combination results in Figure 4.7 for the A subjects still indicates the correct diagnosis. In the N case for several subjects, there is some skew in the opposite direction. Many individual training-testing combinations, as can be seen with some of the N individual tests (drawn in black), appear 31

43 to have an almost uniform distribution across all K values, indicating that there is no real separation between the N and A classes for Dataset 2 when using AR coefficient feature vectors. It is surprising that there is such a large difference in performance between Dataset 1 and Dataset 2. One possible explanation for this difference is that there are subclasses of EEG distributions that are independent of ADHD and that Dataset 1 is by chance or otherwise more homogenous, i.e. that Dataset 2 is a more representative sample of these different classes Low Pass Filtered Data Given that the defined EEG rhythms (delta, alpha, beta, theta, and gamma) fall below 40 Hz, it seems reasonable to assume that the information relevant to classification accuracy resides below 40 Hz and spectral content at higher frequencies constitutes noise. To test the latter hypothesis, the testing and training data was passed through a linear phase FIR lowpass filter with a cutoff frequency at 45 Hz, as shown in Figure 4.8. Figure 4.8 LPF used on Dataset 1 and Dataset 2 The impulse response of the LPF is in Figure

44 Figure 4.9 Impulse Response of LPF The results from using the LPF data from Dataset 1 are shown below in Figure 4.10 with the overall accuracy shown below the diagonal, and the 5 th percentile of confidence of all trainingtesting combinations shown above the diagonal. 33

45 Figure 4.10 Average Accuracy and 5 th Percentile of Confidence of Dataset 1 (LPF) Surprisingly, using LPF data from Dataset 1 reduced the average accuracy greatly, with a highest average classification accuracy of 75% for Cp1-Pz. This is about a 20 percentage point drop in accuracy for the high accuracy channel pairs compared to unfiltered data. The 5 th percentile of confidences overall do not indicate a strong skew towards accurate diagnosis, although there is some skew. The highest 5 th percentile of confidences is 45%, and this comes from Fp1-C4. This channel pair also has the highest C-score, This suggests that while the main EEG bands are below 40 Hz, the difference between A and N subjects is more readily differentiated in higher frequencies using this dataset. The results for using KNN with LPF data from Dataset 2 are below in Figure The exact same filter as described and shown above in Figure 4.8 was used on this data. 34

46 Figure 4.11 Average Accuracy and 5 th Percentile of Confidence of Dataset 2 (LPF) The highest average classification accuracy is 57% for P4-O1. This is a 6 percentage point drop in peak accuracy for Dataset 2. While this is a decrease, it is not as drastic as seen with Dataset 1. The highest 5 th percentile of confidence is 27% from channel pair Fp1-F7. This is a higher 5 th percentile than when using unfiltered data, however it is far from suggesting that most of the subjects are classified accurately. The channel pair with the highest C-score is also Fp1-F7, and the C-score is All metrics indicate that low pass filtered data does not improve performance with these datasets and this algorithm High Pass Filtered Data Given that lowpass filtering the data reduced the classification performance, it now becomes of interest to see how good classification performance can be using highpass filtered data. A linear phase highpass FIR filter with a cutoff frequency of about Hz, as seen in Figure 4.12, was used to process the data. 35

47 Figure 4.12 HPF used on Dataset 1 and Dataset 2 The impulse response of this HPF is shown in Figure Figure 4.13 Impulse Response of HPF The overall accuracies and 5 th percentiles of confidence of each channel combination from using the HPF data from Dataset 1 are presented in Figure

48 Figure 4.14 Average Accuracy and 5 th Percentile of Confidence of Dataset 1 (HPF) The highest average classification accuracy of 92% is for channels Fc2-Cp2. This is approximately a 2-point drop in comparison to the highest average accuracy when using unprocessed data. Channel pair Fz-Cp2 achieved the highest 5 th percentile of 77%. This is less than 2 points lower than the highest 5 th percentile recorded with this dataset using unfiltered data. The highest C-score is 0.71, and is associated with channel pair Fz-Cp2. While using the high pass filtered data does reduce performance, several channel pairs still performed well and diagnosed most training-testing combinations with at least 70% confidence. This is significant because it signals that important information lies in frequencies above 60 Hz, which is a higher frequency range than most EEG ADHD research investigates. Additionally, the channel pairs that performed well with the unfiltered data also performed relatively well with the highpass filtered data. The same highpass filter was applied to Dataset 2, and the average accuracies and 5 th percentile of confidences are presented in Figure

49 Figure 4.15 Average Accuracy and 5 th Percentile of Confidence of Dataset 2 (HPF) For HPF Dataset 2, the highest overall classification accuracy is 57% for channels F8-O1. Dataset 2 accuracy performed about 7 percentage points worse after highpass filtering compared to using all frequency information. The highest 5 th percentile of confidence is 26%, and this comes when channel pair F8-O1 is used. The highest C-score is 0.15, and it comes from using channel pair F1-O1. This drop in performance is about the same as the drop that comes from using the lowpass filtered data. This suggests that perhaps for the UNC data neither the information contained above 60 Hz nor the information below 60 Hz contains dominantly relevant information relating to ADHD diagnosis Hz Notch Filtered Data To further investigate the apparent discrepancy between Dataset 1 and Dataset 2, the EEG spectral content was compared. Figure 4.16 shows the result of the spectral estimation process using the averaged Welch periodogram, executed with the following Matlab statement: pwelch(data,hamming(window),window/2, 2^16, fs), where data is the EEG channel from ANT1, Hamming windowed data segments of 400 samples are used, with 50% 38

50 overlap of the segments, the segments are zero padded to 2^16, and the sampling frequency fs is 512 Hz. Figure 4.16 Spectra Examples from C3 of Dataset 1 and Dataset 2 Looking at this representative sample of spectra of subjects from Dataset 1 and Dataset 2, one sees a strong spike at 60 Hz (sometimes with over a 50 db magnitude) in the spectra of Dataset 2 subjects, but not in Dataset 1 subplots. In order to see if this power grid interference adversely affects the accuracy results, the EEG signals from both Dataset 1 and Dataset 2 were processed with a 60 Hz notch filter with a stop band from about 59 Hz to 61 Hz and a stopband attenuation of at least 53 db, as shown in Figure It was expected that this filter would not significantly impact the performance of the models trained and tested with Dataset 1 because the 60 Hz noise either does not appear or appears with a small magnitude in the spectrum. For Dataset 2, it is expected that this filtering will improve the performance of the KNN classification because the 60 Hz signal component is an interference that is not of interest. 39

51 Figure Hz Notch Filter The impulse response of this notch filter is below in Figure Figure 4.18 Impulse Response of 60 Hz Notch Filter The overall accuracies and 5 th percentile of confidence from performing KNN classification on Dataset 1 are presented in Figure

52 Figure 4.19 Average Accuracy and 5 th Percentile of Confidence of Dataset 1 (Notch) For Dataset 1, the highest average classification accuracy was 94% for the Fc1-Pz pair, i.e. nominally the same as for the unfiltered data case. The highest 5 th percentile of confidence is 85% for channel pair Cp2-Pz, which is about 2 points higher than the 5 th percentile of confidence using unfiltered data. The highest C-score is 0.80 with the channel pair Cp2-Pz. Overall the performance was approximately the same as in the unfiltered data case (Figure 4.3), suggesting that its level of 60 Hz interference does not negatively impact classification performance for Dataset 1. Considering that there was not a large 60 Hz spike in Dataset 1 spectra, this is not a surprising result. The same 60 Hz notch filter was applied to Dataset 2 and the results, overall accuracy and 5 th percentile of confidence, are shown in Figure

53 Figure 4.20 Average Accuracy and 5 th Percentile of Confidence of Dataset 2 (Notch) This processing of Dataset 2 showed a slight (about 3 percentage points) decrease in the overall accuracy compared to the unfiltered case. The highest overall accuracy of 62% was for the C3-O1 pair (Figure 4.6). Channel pair F8-O1 achieved the highest 5 th percentile of confidence, 24%. The F8-O1 channel pair also received the highest C-score, of None of these metrics indicates that using the notch filter in combination with KNN makes this a suitable method for diagnosing ADHD. This is somewhat unexpected since the magnitude of the spike is so large, however, this does show that the 60 Hz interference does not appear to have a major impact on the classification results Line Frequency (60 Hz) Notch Filter and Delta Band Filter The spectra of the unfiltered data also showed a very large low frequency (<20 Hz) component present in all subjects. The EEG band with the lowest frequency is the delta band, which goes up to a frequency of about 3 Hz. The delta rhythm is not typically associated with attention activities, so it is plausible to think that frequency components from this region do not help differentiate between ADHD and non-adhd. In this next procedure, a filter with the same 42

54 60 Hz notch filter as used previously is cascaded with a highpass filter with a cutoff at 3 Hz (Figure 4.21). Figure Hz Notch and Delta Band Filter The impulse response of this filter is below in Figure Figure 4.22 Impulse Response of 60 Hz Notch and Delta Band Filter After filtering, KNN is performed as usual, and the results of Dataset 1 and Dataset 2 are presented in Figure 4.23 and Figure

55 Figure 4.23 Average Accuracy and 5 th Percentile of Confidence of Dataset 1 (No Delta/60 Hz) For Dataset 1, this filtering resulted in a highest overall accuracy of 94% for the channel pair Fc1-Pz, a highest 5 th percentile of confidence of 84% with channel pair Cp2-Pz, and a highest C-score of 0.79 for channel pair Cp2-Pz (Figure 4.23). Generally, the results were similar to those for the unfiltered data. 44

56 Figure 4.24 Average Accuracy and 5 th Percentile of Confidence of Dataset 2 (No Delta/60 Hz) Dataset 2 yielded a highest overall accuracy of 64% for channel pair C3-O1, a highest 5 th percentile of confidence of 24% for channel pair F8-O1, and a highest C-score of 0.14 for channel pair F8-O1 (Figure 4.24). Again, overall performance was poor, with overall accuracy in the 40% range for many channel combinations. There is no significant change in classification accuracy relative to the unfiltered data case, suggesting that information important to ADHD discrimination does not reside in the delta band Line Frequency (60 Hz) Notch Filter and Delta and Theta Band Filter The next trial consisted of using a filter similar to the previous one, only this time (in addition to the 60 Hz notch filter) the highpass filter has a cutoff frequency of about 8 Hz, as seen in Figure 4.25, which roughly corresponds to eliminating both the delta and theta bands. 45

57 Figure 4.25 Delta/Theta/60 Hz Notch Filter The impulse response of this 60 Hz and Delta and Theta band filter is in Figure Figure 4.26 Impulse Response of 60 Hz Notch and Delta and Theta Band Filter The accuracy results of using this filtered data are shown in the following figures, Figure 4.27 and Figure

58 Figure 4.27 Average Accuracy and 5 th Percentile of Confidence of Dataset 1 (No Delta, Theta/60 Hz) For Dataset 1, the channels Fc1-Pz produced the highest average classification accuracy of 94%, which is marginally lower than when using unfiltered data (Figure 4.27). Channel pair T7- Pz has the highest 5 th percentile of confidence, which is 83%. This channel pair also results in the highest C-Score for this filtering, which is

59 Figure 4.28 Average Accuracy and 5 th Percentile of Confidence of Dataset 2 (No Delta, Theta/60 Hz) Dataset 2 resulted in a highest average classification accuracy of 61% for the channel pair C3-O1, but overall performance remained poor, with for many channel pairs an accuracy of around 40% (Figure 4.28). The highest 5 th percentile of confidence is 25%, and this comes from using channel pair F8-O1. This channel pair received the highest C-score of Once again, the A and N subjects of Dataset 2 are not separated with accuracy or confidence Zero Mean Over All Time Another possible source for the large low-end frequency component is a DC offset in the signals. To remove this offset, the mean of the entire ANT1 signal was subtracted from the signal, then features were extracted and KNN was performed as usual. The resulting classification performance is reflected in Figure 4.29 and Figure

60 Figure 4.29 Average Accuracy and 5 th Percentile of Confidence of Dataset 1 (Zero Mean Overall) Removing the overall mean before processing resulted in a marginal improvement for Dataset 1. The peak overall accuracy for 95% for channels T7-Pz. The other highest performing pairs for zero mean data also performed slightly better than other high performing pairs in the unfiltered data. With this processing, the highest 5 th percentile of confidence was 83% for channel pair Cp2-Pz. The highest C-score was 0.79, also for channel pair Cp2-Pz. Performance seems to be slightly improved for some of the overall accuracies, however, these improvements are marginal. 49

61 Figure 4.30 Average Accuracy and 5 th Percentile of Confidence of Dataset 2 (Zero Mean Overall) Dataset 2 produced a slightly lower highest overall accuracy of 63% for channel pair C3- O1, however, across all channel combinations there were only marginal differences in the accuracies between the original data and the data after any DC offset was removed. The highest 5 th percentile of confidence was 19% with channel pair Cp2-Pz, this is a drop from what was reported after processing with other schemes. The highest C-score is 0.11 from channel pair Fp2- P7. This is the lowest best C-score out of all the methods examined so far, suggesting that it would be harmful to implement this overall zero mean processing in a realized system Zero Mean Over Four Second Intervals This next method was also intended to remove any existing DC offset. This time however, the data was broken up into 4 second, 50% overlapping intervals. Each set of interval data had the interval mean subtracted from it, and then the features were extracted from that zero-mean 4 second interval. The intent behind using zero mean interval data was that more of the DC offset as well as low frequency content would be removed that is not important to the classification. The results of this particular processing scheme are presented in Figure 4.31 and Figure

62 Figure 4.31 Average Accuracy and 5 th Percentile of Confidence of Dataset 1 (Zero Mean Intervals) The highest overall accuracy producing channel pair for the Dataset1 was Fc1-Pz, with an average classification accuracy of 95% (Figure 4.31). This is approximately the same accuracy as when removing the mean from the entire ANT1 interval. The highest 5 th percentile of confidence was 83% when using channel pair Cp2-Pz. The highest C-score was 0.79 and also came from using Cp2-Pz. Removing the average from the 4s intervals does not significantly impact the overall classification performance when using Dataset 1. 51

63 Figure 4.32 Average Accuracy and 5 th Percentile of Confidence of Dataset 2 (Zero Mean Intervals) Dataset 2 produced a highest average classification accuracy of 63% for channel pair C3- O1 (Figure 4.32). The highest 5 th percentile of confidences was 18% for channel pair Fp2-P7. The highest C-score is 0.11 with channel pair Fp2-T8. All of these metrics indicate poor performance. To summarize, the best performing channel pairs and C-scores, along with their overall accuracy and 5 th percentile of confidences are presented in Table 1. 52

64 Table 1 Highest average classification accuracy across various processing conditions VT Data UNC Data Channel Pair Overall Accuracy 5 th Percentile Confidence 33 rd Percentile Confidence Channel Pair Overall Accuracy 5 th Percentile Confidence 33 rd Percentile Confidence 60 Hz Notch HPF 60 Hz Notch LPF No Filtering & Delta 60 Hz Notch & Delta & Zero Mean Overall Zero Mean Intervals Filtered Theta Filtered Fc1-Pz 94% 80% 90% C3-O1 65% 17% 51% Cp2-Pz 94% 83% 89% C3-T7 62% 12% 42% T7-Pz 93% 84% 88% C3-T8 62% 12% 46% Fc2-Pz 93% 81% 89% Fp2-O1 61% 15% 50% Cp1-Pz 92% 64% 90% C3-P3 60% 5% 48% Cp1-Pz 75% 34% 60% P4-O1 57% 4% 46% Fc1-Cp1 75% 23% 63% Fp2-F7 56% 13% 45% C4-Cp1 72% 41% 58% Fp1-P4 54% 4% 37% Fc1-Pz 72% 22% 57% Fp2-F8 54% 14% 42% Fc1-T7 72% 32% 60% Fp2-P4 53% 7% 33% Fc1-Pz 94% 82% 90% C3-O1 62% 9% 40% T7-Pz 94% 84% 89% Fp2-O1 62% 10% 52% Cp2-Pz 94% 85% 88% F8-O1 59% 24% 48% Fc5-Pz 93% 76% 89% C3-P3 59% 4% 37% Cp1-Pz 93% 62% 91% C3-T7 58% 8% 41% Fc2-Cp2 92% 71% 89% F8-O1 57% 26% 48% Fc1-Cp2 92% 71% 89% C3-O1 57% 14% 44% Fz-Cp2 92% 77% 87% F8-O2 54% 23% 42% Cp2-Pz 91% 74% 85% F8-C4 54% 21% 44% Fz-Fc1 91% 72% 85% Fp2-F8 54% 17% 38% Fc1-Pz 94% 82% 90% C3-O1 62% 8% 40% Cp2-Pz 94% 84% 88% Fp2-O1 60% 9% 51% T7-Pz 94% 83% 90% F8-O1 59% 24% 47% Fc5-Pz 93% 76% 89% C3-P3 59% 4% 36% Fc2-Pz 93% 78% 89% C3-T7 59% 8% 41% Fc1-Pz 94% 80% 90% C3-O1 61% 9% 40% Cp2-Pz 93% 83% 88% Fp2-O1 60% 11% 52% T7-Pz 93% 83% 89% C3-P3 58% 4% 34% Fc2-Pz 92% 74% 88% F8-O1 58% 25% 48% Cp1-Pz 92% 61% 89% C3-P7 57% 7% 39% T7-Pz 95% 82% 90% C3-O1 63% 12% 50% Fc1-Pz 95% 79% 91% Fp2-O1 61% 16% 52% Cp2-Pz 95% 83% 91% C3-P3 60% 1% 47% Fc2-Pz 93% 80% 90% C3-T8 60% 8% 41% Cp1-Pz 93% 70% 90% C3-T7 60% 8% 41% Fc1-Pz 95% 79% 91% C3-O1 63% 11% 50% T7-Pz 95% 82% 89% Fp2-O1 61% 16% 52% Cp2-Pz 95% 83% 91% C3-P3 60% 1% 48% Fc2-Pz 93% 80% 90% C3-T7 60% 8% 41% Cp1-Pz 93% 70% 90% C3-T8 60% 8% 40% 53

65 4.5.9 Physical Location of Channels Each of the EEG channels corresponds to a physical location on the scalp of the subject. In order to see any possible clear spatial relationships between any of the better performing channels, the latter are seen as yellower channels on the heatmaps (eg. Channel Pz for Dataset 1 in Figure 4.3). For Dataset 1, channels Fc1, Fc2, Cp1, Cp2, Fz, Pz and T7 are part of the highest performing pairs. For Dataset 2, channels Fp1, Fp2, F8, O1, and P7 stand out when compared to other channels, even though objectively their accuracy is not very high. Figure 4.33 visually shows the locations of all of the different channels used to collect the EEG data. The channels covered by Dataset 1 are shown as colored in blue on the left-hand side of the circles, and the channels covered by Dataset 2 are shown as colored in red for the right-hand side. The best performing channels have a darker half ring around on the corresponding side (Dataset 1 on the left, Dataset 2 on the right). 54

66 Figure 4.33 Physical Location of High Accuracy Channels of Dataset 1 (left/blue) and Dataset 2 (right/red) The channels that provide the most discrimination for Dataset 1 are all located around the center of the electrode array. The channels that produce the highest average performance for Dataset 2 do not seem to be clustered at all. This is somewhat surprising given that the mechanisms in the brain that are responsible for the functions required in focusing and attention have a physical location, and this central area seems to be advantageous in Dataset 1 when differentiating between N and A subjects. However, it is also worth noting that the best performing channels of Dataset 1 are simply not part of the physical system that was used to collect the EEG signals for Dataset 2. It is possible that if those channels were available, results would be more favorable, however this is speculation and cannot be verified. 55

67 Interpolating a Channel with Dataset 2 It was noticed from observing the physical layout of the EEG channels on the electrode cap that there were several EEG channels in Dataset 2 that straddle the location of the informative channels of Dataset 1. This was most evident for channels Fz and Pz. For channel Fz, channels F3 and F4 are in close proximity and on opposite sides of Fz. For channel Pz, channels P3 and P4 are also in close proximity and on opposite sides of Pz. It was considered that perhaps the information supplied from Fz and Pz that showed to be useful for Dataset 1 could be inferred in Dataset 2 from the other close by electrodes. To test this, channels F3 and F4 were averaged to simulate channel Fz for Dataset 2, and to simulate Pz, channels P3 and P4 were averaged. There was a total of 162 training-testing combinations. So many different combinations make plotting all individual distributions prohibitive. However, the distribution of overall accuracies for each training-testing combination and confidence (overall confidence, not 5 th percentile) are shown below in Figure

68 Figure 4.34 Histograms of Overall Accuracy of NA Tests (top left) and A tests (bottom right) and Confidence of NA Tests (top right) and A Tests (bottom right) The overall accuracy of feature vector classification (including both N and A feature vectors) is 44% with a 5 th percentile confidence of 1%. The sensitivity is 68%, and specificity is 14% meaning that N subjects are classified as A more frequently than N subjects are classified as N. This can be plainly seen in the top left-hand plot, which shows that very few vectors were classified correctly across all training-testing combinations. Additionally, from the confidence of N tests (recall that when testing the N subjects, confidence in A diagnosis should be low, close to 0), that not only are the N subjects diagnosed incorrectly, the model is very confident in this incorrect conclusion, as this plot is very skewed. This is unacceptable performance. Most A vectors are classified correctly, but many are not. The confidence of the A vectors tends to be skewed in the correction direction, however it is not strongly skewed. This is moderate performance. On the one hand, this is discouraging as Dataset 2 still does not produce an effective means of classification. However, this does suggest that the signals on individual electrodes are not merely 57

69 an average of the nearby electrode signals, but do individually represent something informative about the mental processes occurring near that location Dataset 2 Tested Against Dataset 1 From the previous results, it appears that there is more significant separation between the N and A classes of Dataset 1 than between the N and A classes of Dataset 2. It was considered that perhaps the Dataset 2 feature vectors, while hard to distinguish on their own, might fall in between the N and A classes of Dataset 1 but still favor one class over the other. In order to evaluate whether or not this could be the case, KNN models trained with every possible 2-N and 2-A combination of subjects from Dataset 1 were used to test all subjects in Dataset 2. The overall accuracy and 5 th percentile of confidences is presented in Figure Figure 4.35 Dataset 2 Tested Against Dataset 1 (AR Coefficients) As can be seen in the plot, testing Dataset 2 against a model trained with Dataset 1 does not improve the performance of the model. Nearly all (99 out of 120) of the overall accuracies of the different channel combinations fall between 40-60%, which is essentially the same as flipping a coin and being diagnosed by chance. Additionally, the highest 5 th percentile of confidence is 19% from channel pair T8-P8. This same channel pair received the highest C-score, a value of 58

70 0.10. All of these metrics indicate that the subjects in Dataset 2 were not diagnosed accurately or with confidence when Dataset 1 combinations were used to train the KNN model. From these many different iterations, it seems that AR coefficients may not be a useful means of diagnosing all forms of ADHD. Of course, there are different subtypes, so maybe this model can distinguish some subtypes and not others. However, that information is not available at this time. Perhaps in the future, data collection would also record subtype of subjects, and this could be considered for future work. 4.6 KNN Results Using Line Spectral Frequencies Line spectral frequencies represent the same information about the spectrum of a signal as AR coefficients do, however, they are often considered to reduce error due to quantization. It was thought that perhaps using feature vectors that are the LSFs of the EEG signal segments might produce comparable or improved accuracy. In order to investigate this, the AR coefficients of the 50% overlapping 2s intervals of the entire ANT1 period were converted into LSF, and then KNN with k = 13 was used to classify test feature vectors. Each KNN model was trained with 2-N and 2-A subjects, using only the 6-year-old male subjects. The results of training and testing with Dataset 1 using this method are shown below in Figure

71 Figure 4.36 Average Accuracy and 5th Percentile of Confidence of Dataset 1 (LSF Features) The highest overall accuracy of feature classification was 91%, and this was achieved with channel pair Fc1-Pz. This combination yielded the highest overall accuracy, and also the highest rate of sensitivity (96%) and specificity (89%). Overall, the accuracy rates are lower than using AR coefficients as features (Figure 4.3). The highest 5 th percentile of confidence was 74%, and this came from channels Fc1-T7. The highest C-score also came from Fc1-T7, and it was Observe that there are six channels that seem to consistently yield higher overall accuracy than their peers and that these stand out visually as the generally brighter orange stripes: Fz, Pz, Fc1, Fc2, Cp1, and Cp2. This is consistent with the results found using the unfiltered AR coefficients, channels clustered around the center of the EEG cap provide greater differentiation than channels closer to the edge of the cap. Next, the same method was applied to the 6-year-old male subjects in Dataset 2. The results of this classification method for these subjects are shown below in Figure

72 Figure 4.37 Average Accuracy and 5 th Percentile of Confidence of Dataset 2 (LSF Features) Unsurprisingly, the overall accuracies with Dataset 2 are significantly lower than when using Dataset 1. Most channel combinations, 85 out of 120, have an overall accuracy between 40% and 60%, which is comparable to diagnosis based on pure chance. The highest overall accuracy is 78%, which comes from using the channel pair Fp2-C3. This combination yielded a sensitivity of 71.12% and a specificity of 88%. The highest 5 th percentile of confidence is 17%, and results from Fp1-Fp2. The highest C-score is 0.11, and is associated with channel pair Fp2-C3. Using LSF for Dataset 2 appears to have improved performance in confidence for some subjects, however there are still enough outliers that are not impacted by this change of feature type, that the overall performance would still be unacceptable for clinical use. One interesting detail of these overall accuracies is that the 15 highest values all involved channel C3. This also stands out on Figure 4.31 as a line that is more yellow than any squares that surround it. This channel is physically close to the center of the EEG cap, as can be seen in Figure Perhaps the physical proximity to the central region is why this channel outperforms the others, although it is still puzzling why other relatively central channels, like C4, don t also yield relatively improved performance. Given that some measures of performance for Dataset 2 improved with LSF feature vectors, it was of interest to see whether or not Dataset 2 subjects would be classified correctly 61

73 when using subjects from Dataset 1 to train the KNN model. The overall accuracy and 5 th percentile confidence values are presented below in Figure 4.38 Figure 4.38 Average Accuracy and 5 th Percentile of Confidence of Dataset 2 vs Dataset 1 These results are underwhelming, having neither high average accuracies nor high 5 th percentiles of confidences. The highest overall accuracy is 62% for channel pair P8-P3, and the highest 5 th percentile of confidence is 15% for channel pair F8-T7. This pair also received the highest C-score, of This is lower than when training and testing with Dataset 2 alone, indicating that performance worsened when these datasets were used together. Based on these results, LSF as features for KNN would not be a viable method for diagnosis of ADHD. 4.7 KNN Results Using Reflection Coefficients Reflection coefficients are a third way of representing the spectral information of a signal that is equal to the information in AR coefficients and LSF. Given the change in performance between using AR coefficients and LSF, especially for Dataset 2, it was thought that using RC in the KNN model might also result in different performance. To test this, AR coefficients from 50% overlapping 2s intervals covering the entire ANT1 period were converted into reflection coefficients, and then these features were used in the KNN model. The model was trained with 2-62

74 N subjects and 2-A subjects, and all remaining subjects were tested. All possible 2-N and 2-A combinations were used to train the model. The results for using this method to test all channel combinations for Dataset 1 are shown in Figure Figure 4.39 Average Accuracy and 5 th Percentile of Confidence of Dataset 1 (RC Features) The average accuracies overall were not as high as when using AR coefficients, but they were slightly better than when using LSF features. The channel pair that produces the highest average accuracy, 92%, is Fc1-Pz. This pair yields a sensitivity of 89% and a selectivity of 97%. The channel pair producing the highest 5 th percentile of confidence, a value of 75%, was T7-Cp1. The highest C-score was 0.67, and resulted from channel pair T7-Cp1. This performance is not as good as using AR coefficients of unfiltered Dataset 1 data, however it still diagnoses at least 95% of all tested subjects correctly and with high confidence. The overall average and 5 th percentile confidences of classifying subjects with RC from Dataset 2 are shown below in Figure

75 Figure 4.40 Average Accuracy and 5 th Percentile of Confidence of Dataset 2 (RC Features) The highest overall accuracy is 77%, and this resulted from using channel pair Fp2-C3. Channel pair Fp1-Fp2 obtained the highest 5 th percentile of confidence and C-score, which are 18% and 0.10, respectively. This is better performance than using LSF as feature vectors, however, it is not better than using AR coefficients, based on the C-scores. Interestingly, C3 provides noticeably higher overall accuracies like it did when using LSF. It appears that using RC changes the distribution of individual confidences, however the subjects that were not classified correctly or confidently before are still not being classified correctly or confidently. In the next figure, Figure 4.41, the overall accuracies and 5 th percentile confidence values are shown for when Dataset 2 subjects are tested against Dataset 1. 64

76 Figure 4.41 Average Accuracy and 5 th Percentile of Confidence of Dataset 2 against Dataset 1 (RC Features) The highest overall accuracy resulted from channel pair P3-O2, and had a value of 65%. The highest 5 th percentile accuracy was 17% from channel pair Fp1-T7. This channel pair also produced the highest C-score, which was Overall, training the KNN model with Dataset 1 and testing with Dataset 2 when using RC feature vectors did not prove to be an effective way to distinguish between ADHD and non-adhd. 65

77 5. KNN Using Target Time Linked Intervals As discussed in Section 2.4, the ANT test is designed to utilize the alerting, orienting, and executive functions by presenting cues, changing the location and direction of a target, and asking the subject to respond to that target. Each of these functions is triggered by a certain part of the task, which always follows the same temporal pattern. Within the ANT1 test, a subject is presented with a target 48 times. By using timing information that records precisely when each target is presented, this pattern can be used to align each of the 48 relative to the appearance of the target. This allows features of different length intervals to be taken from the same point of each target presentation, e.g. a 300ms interval starting when the cue is presented or a 400ms interval starting 100ms after the target is presented. Given that different components of the ANT test target different networks (alerting, orienting, and executive), the hypothesis is that observations taken specifically during the time intervals those networks are in use might produce significant differences between the EEG signals of ADHD and non-adhd subjects. Specifically, since ADHD is related to deficiencies in the executive network, parts of the ANT1 test that rely on the executive network might provide better discrimination. This approach was evaluated on the 6-year-old males of Dataset 1 and Dataset 2, and these subgroups were tested separately. While the interval length and starting location of the interval changed, the features used were all seventh order AR coefficients found using the Burg method. The different interval lengths used were: 200ms, 300ms, 400ms, 500ms, 600ms, 700ms, 800ms, 900ms, and 1000ms. The different starting points ranged from 600ms before the target presentation to 1000ms after the target presentation, spaced 50ms apart. 5.1 Target Time Linked Intervals using Dataset 1 This method of classification was first used on the 6-year-old boys of Dataset 1, and the results are presented in Figure 5.1. Once again, 2-N and 2-A subjects were used to create the KNN model, and all training-testing combinations of subjects were investigated. The channel pair used was Fc1-Pz, selected because it produced the highest accuracy in Section for Dataset 1. 66

78 Figure 5.1 Overall Accuracies for Varying Interval Lengths and Starting Times Relative to Target Presentation (Dataset 1) The highest overall accuracy value reported was 95%, which was found in two instances, one with a 800ms interval starting -600ms relative to the target and the other with a 1000ms interval starting 400ms after the target. This is a slightly higher than the highest overall accuracy across all channel pairs over all time of Dataset 1, which was 94% from channel pair Fc1-Pz (which is the same channel pair). 67

Deficit/Hyperactivity Disorder. Description

Deficit/Hyperactivity Disorder. Description MMentRadi Last Review Status/Date: December 2013 Page: 1 of 9 Description Patients with Attention- (ADHD) may have alterations in their brain wave patterns that can be measured by quantitative electroencephalography