SPECTRAL-TEMPORAL MODALITY BASED EEG SLEEP STAGING

Size: px

Start display at page:

Download "SPECTRAL-TEMPORAL MODALITY BASED EEG SLEEP STAGING"

Nelson Parrish
5 years ago
Views:

1 SPECTRAL-TEMPORAL MODALITY BASED EEG SLEEP STAGING A Thesis Presented to The Academic Faculty by Mark McCurry In Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the School of Electrical and Computer Engineering Georgia Institute of Technology May 2017 Copyright 2017 by Mark McCurry

2 TABLE OF CONTENTS I INTRODUCTION What Are Sleep Stages? The Signal Domain AASM Classification Standard Classical Approach and Modern Approach Low Cost Signal Acquisition Alternatives Approaching Structured Sparsity Contributions II LITERATURE REVIEW Prior EEG Methods Commercial Offerings Vision and Seeing The Data Texture Ordered Visual Structure Change-Point Techniques Sparse Approximation Variational Methods

3 III METHODS Model Denoising Movement Artifacts Normalization Median Filtering Robust PCA Region Based Features Organizing Geometrically Structured Labels Reducing To Segments Denoising Supervised Learning On Segments IV INITIAL INVESTIGATION Prior Work Initial Application Of Methods Dense Un-quantized Experiments Low Rank Quantized Experiments Initial Discussion V RESULTS & DISCUSSION

4 5.1 Datasets Time Domain Features Cross Validation Parameter Sensitivity Analysis Random Forest Configuration Amount Of Denoising Tolerance To Noise Coarseness & Type Of Time-Frequency Segmentation Internal Region Estimating Thresholds Amount Of Training Data Statistical Significance VI CONCLUSIONS & FUTURE WORK Aims of Research Assumptions Future Work Multi-Sensor Sleep Staging Extensions Beyond Sleep Generalization Of Thresholds Frequency Band Refinement

5 6.4 Conclusions APPENDIX A SOFTWARE USED IN DATA ANALYSIS REFERENCES

6 LIST OF TABLES 1 AASM Sleep Stage Classifying Features Confusion Matrix Over All Trials With Un-quantized & Denoised Spectra 41 3 Confusion Matrix Over First 10 Subjects With Quantized & Denoised Spectra Median Accuracy Across Datasets And Methods Cross Data Set Training and Testing Random forest parameter sensitivity (deviation from mean % accuracy) Time-Frequency Segmentation Variants P-value of DDS features having a higher mean accuracy

7 LIST OF FIGURES 1 Rescaled time domain traces of each AASM sleep stage Variance and mean of sleep state s log spectrum over a single subject recording Lower Rank Approximations Figure of the median spectral energy for subjects 1 through 10 in the DREAMS subject database Region of EEG spectra generated by pulsing activity Growth of canidiate regions Cover of thresholded regions Decision Tree Peak power features versus epoch for ST7241. The black dots at bottom indicate the expert identified sleep state: waking (lowest y-axis value), REM sleep, sleep stages 1-4 (shown with respectively increasing y-axis value) Filtered version of figure A plot of subject 2 labels and predictions Accuracy of Random Forest before (green) and after (blue) denoising method Results of using Modal Regions Directly For Classification

8 14 Predictions of Bandpower estimate Example of FDA at different scales Example of DFA at different scales Pseudo-Shannon Entropy Trend across denoising amounts Affect of Denoising Amount On ST7011 s classification Accuracy Response to varying levels of pink noise Fine grained analysis of the affect of pink noise Candidate regions on noisy input Raw spectrum of subject ST Examples of denoised subject ST7192 s spectra under different approaches Accuracy with varying time segmentation coarseness Inhibition in blue, Excitation in red, neutral in gray Accuracy with varying inhibition/excitation levels Accuracy with varying amounts of training data

9 CHAPTER I INTRODUCTION Sleep is a key component to an individual s health [5] and its qualitative evaluation typically requires a sleep study. These sleep studies can be expensive, with costs ranging from $600 to $5,000 for a single night 1, and they remove individuals from their normal sleeping environments. This environmental change is not strictly necessary, as studies conducted within a patient s home can provide data of similar quality [34]. Some researchers propose low-cost take home equipment as an alternative to the traditional studies and their associated equipment [28]. These lower cost recording devices, however, tend to produce lower fidelity signals. Signal noise becomes problematic as lower cost options may have fewer sensors and standardized means of sleep staging can require additional electrodes for electrooculogram (EOG) and electromyograph (EMG) recording. The lack of these additional signals and the increased noise and artifacts make low-cost home-use sleep EEG processing a challenging pattern recognition task. This dissertation provides a method of processing sleep EEG which can can be practically and robustly used to address signal fidelity issues through the use of novel a Dense Denoised Spectral (DDS) feature analysis. 1.1 What Are Sleep Stages? As the name implies, sleep stages are sequential sections of activity during sleep. These stages can roughly be broken down into when a person is awake, during Rapid Eye Movement (REM) sleep, and during non-rem sleep. REM sleep is the familiar portion of the night when individuals can dream. Non-REM sleep is broken into

10 stages under the AASM standard, N1, N2, and N3. Stages N1 and N3 are more commonly known as light sleep and deep sleep. The absence of one of these stages or the overabundance of another can be used in the diagnosis of numerous conditions ranging from sleep apnea, hypersomnia, insomnia, or sleep talking [5]. Additionally the sequence of sleep stages can indicate the quality of sleep as normal cycles exist in addition to the typical frequency of each sleep stage. 1.2 The Signal Domain Typically EEG signals provide information in separating these stages as illustrated in Fig. 1. In both of the official classification standards [53, 61], the characteristics of these waveforms are typically described largely in terms of time-domain representations. This dissertation views sleep stages through a dense spectral representation. Differences in the spectral shape between stages can be observed in Fig. 2. Viewing and understanding sleep stages as a dense spectra differs from classical approaches which tend to either use a series of time signatures or a low rank representation of the power present in a handful of bands. Prior art is discussed in section 2.1. Figure 1: Rescaled time domain traces of each AASM sleep stage Depending upon what time-frequency windows are observed, these frequency domain representations can initially appear to be rather similar. Spectral features, however, can provide a powerful means of separating these stages. Instead of relying on the local statistics of each temporal frame, the more nuanced differences present 10

11 Figure 2: Variance and mean of sleep state s log spectrum over a single subject recording in the spectral-temporal evolution of the signal hold the key in understanding and classifying these EEG recordings AASM Classification Standard Sleep stages are defined by two main bodies of work, the Rechtschaffen and Kales (R&K) standard [61] and the American Academy of Sleep Medicine (AASM) standard [53]. The older R&K classification standard [61] has been superseded by one formulated by the AASM, though they both share a considerable number of similarities. From the AASM s visual scoring of adults manual, the following characteristics of each sleep stage have been identified: 11

12 Table 1: AASM Sleep Stage Classifying Features Sleep stage W AASM features Rapid Eye Movement (REM) 8-13 Hz activity attenuated with open eyes Hz eye blinks N1 Slow eye movement Low amplitude 4-7 Hz activity Sharp time domain waves with duration < 0.5 sec N2 K-complex (sharp negative then positive wave 0.5 sec) Sleep spindle (train of waves at Hz) duration 0.5 sec N3 R Hz activity and peak-peak amplitude >75 uv REM EMG in the chin matches sleeping levels Sawtooth 2-6 Hz waves Sporadic muscle activity in under quarter second bursts Most of the listed features are either bandpass spectral features or time domain signatures. Features such as k-complexes and sleep spindles have been studied in detail; however, their inter-rater agreement is quite low ranging from 33% to 59% in some of the publicly available datasets [40]. Additionally, general inter-rater agreement has been studied showing that between 80% and 82% agreement should be expected depending on whether AASM or R&K classification is utilized [21]. 1.3 Classical Approach and Modern Approach Based upon both the AASM and R&K sleep staging methods, the idea of band power has remained an important concept dating back to the 1960 s [11]. Over time this approach has evolved into more complex methods for sleep classification. Both 12

13 classical and more modern approaches, however, tend not to be robust to artifacts and thus devote considerable resources into the identification of these confounding factors and their elimination from the data recordings [68, 12]. 1.4 Low Cost Signal Acquisition Alternatives For now, most sleep EEG data for both clinical and research purposes is recorded in strictly controlled medical and research environments. Low cost recording devices which are intended to be used in less controlled environments are becoming a new option for sleep and other EEG projects. Lower cost devices can be either sent home with individuals or brought to remote regions where facilities for sleep studies are rare. These low cost EEG acquisition technologies are not an entirely new concept. One of the larger organized efforts in do-it-yourself electrodes and recording equipment, OpenEEG, started back in 2000 [29]. Interest in this area has grown and there are a number of consumer grade hardware devices for both multi-channel and single channel EEG/EMG recordings. A quick survey shows that in the current market there are the NeuroSky MindWave $ , Muse Brain Sensing Headband $ , 19-Channel EEG Headband $ , the Emotiv EPOC $399 3, and OpenBCI $ This consumer-grade technology, however, poses some major trade-offs. The hardware itself will likely have fewer electrodes with higher noise floors and in some cases some issues with synchronizing large numbers of channels. Non-professional use of these products will tend to result in suboptimal electrode connections, inaccurate electrode placement, increased movement based artifacts, and generally higher noise floors. The technology does provide interested individuals with easier access to data, but due to the faults of lower cost options with generally untrained individuals there 2 price from amazon.com 3 price from 4 price from 13

is a higher burden on the analysis phase as more signal pre-conditioning is needed. Solving these problems will be key in making EEGs usable in environments which are not strictly monitored. 1.

By using the relatively sparse change points within each EEG recording, it is possible to denoise the spectrogram in a way that roughly agrees with observable signal modality.

14 is a higher burden on the analysis phase as more signal pre-conditioning is needed. Solving these problems will be key in making EEGs usable in environments which are not strictly monitored. 1.5 Approaching Structured Sparsity The proposed work views the sleep state labeling problem primarily as an unsupervised 2D change point detection problem on the EEG spectra. By using the relatively sparse change points within each EEG recording, it is possible to denoise the spectrogram in a way that roughly agrees with observable signal modality. Given the knowledge of the temporal and spectral change points, a new dense spectrogram can be built using closer to optimal estimates of the band power at any given time. This approach is not seen in literature for sleep EEG classification and it makes the resulting classifications more robust to noise and variation between different subjects than other tested methods. In this way sparse methods are useful within each EEG recording, but the dense realization is utilized in inter-patient comparisons. (a) Spectral centroid of EEG spectra (REM in red, Deep Sleep in blue) (b) Before and after rectangular quantization Figure 3: Lower Rank Approximations Identifying the modal regions is a nontrivial process. Earlier work in applying spectral centroids based upon [44] showed that such modal regions can be roughly 14

15 observed in a single dimension. In Fig. 3a, clear shifts in the centroid were visible with REM and adjacent non-rem phases. When observing regions of activity in the raw spectra, both rough frequency bands can be seen and temporal shifts can be identified. By identifying temporal-spectral windows with consistent excitation or inhibition, it is possible to extract both time and spectral modes as can be seen in Fig. 3b. 1.6 Contributions This dissertation offers several new contributions to the sleep staging task. First, very dense features are used to represent and classify the output of a single EEG channel. Most prior approaches for single EEG channel based methods tend to generate small dimensional features for each observation. For comparison, Dense Denoised Spectral (DDS), cross frequency coupling, time-domain, bandpower, and Philip Low spectral centroid features have a respective dimensionality of 1000, 36, 9, 6, and 1 for each observation. Secondly, this dissertation presents a novel adaptive denoising approach on the dense features not seen in computational neuroscience literature previously. The DDS features use a novel method to learn an aligned block constant structure and DDS features produce results that outperform other single EEG channel based features. Third, this work presents a comparison between bandpower, time domain, cross frequency coupling 5, a modified Philip Low method 5, Dense Raw Spectral (DRS), and DDS features using a cross-validation approach not commonly seen in literature. The cross validation approach explicitly avoided training and testing on data within individual patient-recordings to avoid learning nuance variations which did not generalize to other subjects. To further evaluate how well DDS features generalize compared to other techniques a series of cross-dataset experiments were performed. Cross-dataset trials showed that DDS features tended to generalize well to situations with a different 5 A partial evaluation was performed on this feature representation due to poor relative performance 15

16 data collection environments and labels from different clinicians. 16

17 CHAPTER II LITERATURE REVIEW 2.1 Prior EEG Methods Initial work in the area of classifying and characterizing EEG modalities started in a time domain based visual analysis [61, 53] which were performed manually. Manual analysis focused heavily on several fixed frequency bands which were standardized in the early 1960s [11]. These bands were incorporated into he R&K standard which is based off of the delta (0.1-3Hz), theta (4-7Hz), alpha (8-15Hz), beta (16-31Hz), and gamma (32-100Hz) bands. Initial numeric analysis involved simple first and second order statistics on these bands. This information was enough to show it was possible to perform basic classifications and the statistical significance of these features were repeatedly touted in early analyses. Numerous papers and talks look at these band power estimates [7, 8, 49, 2, 9], however these techniques are no longer considered state-of-the-art. Analysis techniques have since expanded to look at coupling of both the individual bands and the spacial sources within multi-electrode systems. One such technique which has risen to popularity is cross frequency coupling [14, 18]. When two frequency bands are coupled, then one band modulates a second one. This is typically expressed in terms of how the phase of one band affects the amplitude of another. When they are uncorrelated, the phase amplitude coupling is expected to be roughly uniformly distributed. When they are related a measure of divergence, such as Kullback-Leibler divergence, can be used to express how tightly related two bands are within a given time period. Similarly this approach has been used to look at spacial correlations of bands across or within the skull. 17

18 Extending the idea of band power estimates other measures have been applied to the spectra, such as central frequency, autoregressive models, relative band power, and harmonics [24]. The spectral center of mass underwent a thorough analysis by Philip Low where shifts in this feature characterized various sleep states very well, however this approach does not generalize well on publicly available datasets [44]. This reflects the notion that while the fixed bands of earlier methods did have significant meaning, they did not effectively extract all information from the full spectra. As the spectral centroid method explored a dimensionality reduction upon a full spectrogram, this method served as the initial inspiration for the proposed work. Other techniques have been utilized to avoid the fixed bands, each with varying success. In [25], spectral energy concentration, spectral entropy, and phase space based non-linearity measures were used. These varying approaches are often not assessed under similar conditions or with the same datasets, so it is difficult to provide a direct comparison directly from literature Commercial Offerings Commercial applications for the task of sleep state classification have been created, though adoption has been somewhat limited. While Fully Automated Sleep Stagers (FASS) do offer a much quicker way to create sleep stagings they will have errors, which may critically differ from trained raters. While different raters do not produce identical classifications they have substantial similarities as can be seen by Cohen s kappa values within [45]. Commercial automated options such as Morpheus, Somnolyzer, Twin, and other products do offer a means of getting similar classifications, though they fall short of what is needed in practical situations. In [46] the author summarizes the current FASS quality with Experience with these systems has been quite disappointing and in practice their use is limited to simple tasks... The paper 18

19 further states we do not endorse [pure FASS]. This problem in the accuracy of existing systems can be seen in Twin s documentation where hard limits on sleep state durations are imposed due to likely errors from the underlying classifier [72]. 2.2 Vision and Seeing The Data When looking at more general machine learning problems, understanding complicated data often starts with a researcher looking for visual structure in both the raw signal and several feature space realizations. Once some ideas and heuristics are formulated based upon this first understanding of the problem domain, then further work can used to target underlying macro-structure Texture In the traditional sense of the word, a texture is a unifying tactile or visual structure of a surface. This is the sense of a texture which will be mostly observed in this work, however many computer vision papers focus on the somewhat different problem involving textures which are represented well by a quasi-periodic set of image patches using techniques like DCTs or bag-of-words on learned dictionaries. Identification these patches has been a relatively large and important problem withing the realm of computer vision for years. Textons, for example, are one of the more recent approaches in understanding these patches by training dictionaries over small regions and generating bags of patches to describe larger regions [37, 77]. While this approach tends to be best suited to image problems where textures are exactly realized, the general technique of training a few binary feature extractors and bagging the results is still quite useful Ordered Visual Structure Outside of these relatively local texture based features many vision based solutions focus on points of interest which typically have relatively large gradients compared 19

20 to the rest of the domain. These lines or corners work quite well for addressing a number of real world objects, though for something like EEG spectra the question of what is a meaningfully large edge or corner? is much more ambiguous than the normal domain of images. Moving away from edge based features, there is the simple problem of finding the extent of an object which under some feature representation can be viewed as a connected graph of components where each component shares a similar trait or it does not differ from its neighbors in a significant way. This is a problem address by mean-shift[20, 19], markov-fields [81], connected components, convolutional networks, and a variety of other approaches. As many approaches are very scale sensitive, multi-scale representations are used quite commonly. While the exact details of the multi-scale representations varies, they all effectively modulate the density of events, labels, and strengths of gradients [42, 19]. Doing this provides very real flexibility in terms of scale invariant behavior, though fusing information seen in different layers for segmentation is still a relatively new task [43]. 2.3 Change-Point Techniques Another area that looks at structure and when different regions begin and end is change-point literature. In 1D a simple example would be to identify when the mean of a signal changes in a recording. Different levels of certainty mean that online algorithms can more quickly or more slowly identify when a shift happens. Higher dimensional detection routines do exist, though as [79] states The performance of classical methods for change-point detection typically scales poorly with the dimensionality of the data. In these higher dimensional cases, a dimensionality reduction is typically involved to more easily estimate each parameter. Additionally much of change-point literature focuses on identifying anomalous events or only attempts to 20

21 identify a singular event (or known number of events) within one recording. 2.4 Sparse Approximation Given that high dimensional change point detection approaches are fragile utilizing structure to reduce and approximate the input data is an appealing option. There is a large community who work in this area, but for this work two techniques are of particular interest, matching pursuit [47] and robust PCA [80, 59]. Robust PCA is an extension on dimensionality reduction through PCA which attempts to decompose some input A into L + C where L is a low rank matrix and C is sparse. The sparsity is typically element-wise, though some work in column-wise sparse approaches has been done to handle outliers [59]. Matching pursuit on the other hand simply takes a known dictionary and attempts to approximate an input with a small number of elements from the dictionary. 2.5 Variational Methods The Total Variation (TV) is a norm which combines the notion of sparsity, global structure, and change point literature without imposing any specific dictionary model. The most common use of TV is to add a cost to total variation of the derivative of the output. Once applied the result converges to a piecewise constant version of the input. In an ideal sense a number of the change point approaches desire to find these piecewise transitions. This approach can be generalized to when piecewise segments have some transition time using group sparsity operators [69, 17]. This variant of total variation in particular has already seen some use within the EEG sleep community for identifying a sleep stage marker called sleep spindles [54]. Other extensions exist for this technique including vector total variation and generalized total variation, among others. 21

22 CHAPTER III METHODS 3.1 Model As briefly touched upon in the introduction, this work models EEG signals by a power spectrum which is composed by time-frequency windows that have consistent region local features. Similar region-based segmentation can be observed in other computer vision tasks; however, constraining the segmentation to an aligned grid as well as the application appear to be unique. The aligned grid structure agrees with the existence of spectrally observable brain states which appears in the neuroscience ligature. In this context a brain state is a set of frequency bands which at some time t are either excited, inhibited, or have normal levels of activity. Brain states can be further constrained by hypothesizing that activity at a particular frequency f will be similar to [f a, f + b). In other words, a brain state lasts from t 1 to t 2 and it consists of a set to of frequency bands F = f 1, f 2, f 3, f 4 each of which has some level of activity. These brain states can be observed over a wide number of spectral bands. Using brain state models it is possible to recognize that during sleep several of the sleep stage transitions provide a sharp contrast in portions of the spectrum. Additionally, sleep EEG recordings have a very strong correlation within neighborhoods of temporalspectral samples contained in the same mode which dominate the noise present in the recordings. These structured neighborhoods make the segmentation problem tractable. 22

23 3.2 Denoising The noise model consists of 3 different types of noise: temporally sparse pulses, dense Gaussian noise, and sparse perturbations. Broadband temporally sparse noise corresponds well to motion based artifacts. Gaussian noise is present in the recordings through the whole spectrum. The Gaussian noise variance is correlated with the general decay of energy in higher bands. Speckling exists throughout in sparse temporal-spectral regions. The sparse noise created by speckling creates large shifts from the mean, though the shifts are not large enough to treat it as traditional saltand-pepper noise seen commonly in image processing tasks. Therefore we consider the observation O to be composed of some lower rank signal T with an aligned block constant structure, dense noise N, pixel sparse noise S, and column sparse artifacts C. O = T + σn + S + C (1) Movement Artifacts The first step in removing noise is to identify column sparse artifacts. Any motion artifact tends to be limited to less than 10 seconds of movement. Each frame of data in the spectrum uses 20 seconds of data, so either one or two frames of data can be expected to be corrupted. To detect outliers, first the difference d i = j O j,i O j,i (2) Random subsets of the difference function are used to estimate the mean and variance of the difference distribution without outliers. A threshold is then initialized as µ + 1.5σ. Any frame which is above the threshold or surrounded by other frames which are above the threshold is considered to be an outlier. As several recordings also included sections when the EEG electrode was disconnected, so any frame with a difference below are considered to be invalid as well. 23

24 3.2.2 Normalization Once some of the temporally sparse noise is eliminated, normalization is needed to make different frames within a recordings comparable. Similar to most physically generated signals, there is a decay in the activity of the signal going from low to higher frequencies. The log scaling of the spectra helps with this issue, but a linear decrease still exist. Figure 4: Figure of the median spectral energy for subjects 1 through 10 in the DREAMS subject database To correct for the trend shown in Fig. 4 and to flatten the spectrum, the median energy is removed from all frequency bins producing X. Then the remaining median is removed from temporal frames resulting in X. Some information is lost by discarding these medians which is why this normalization process is only used for learning the time-frequency structure and not for producing the final DDS features Median Filtering X j,k = X j,k median(x :,k ) (3) X j,k = X j,k median(x j,:) (4) Median filtering provides a means of reducing both general Gaussian noise and the pixel sparse perturbations. In order to avoid smoothing out noisy change points in 24

25 the data, a small window of 3 3 was chosen for the median operator M( ). A single pass of this operator however proved insufficient. After observing the perceptual improvements from using a recursive median, M r ( ), an iterated median was applied until convergence, when M N+1 I ( ) MI N ( ). This proved to have similar gains without the diagonal distortion which is visible with the recursive formulation. M(x, i, j, ext) = median(x i ext..i+ext,j ext..j+ext ) (5) M r (x, i, j, k) = M(M r (x i k..i 1,j k..j+k, x i,j k..j 1 ), x i,j..j+k, x i+1..i+k,j k..j+k ) (6) M 0 I (x, i, j, ext) = x i,j (7) M N I (x, i, j, ext) = M N 1 I (M(x, i, j, ext)) (8) Adaptive medians [73, 33] were examined, but the noise present in the observed recordings consisted of large sparse perturbations rather than traditional pepper and salt noise. This difference made it impractical to identify which pixels should be considered wholly corrupted by noise. Experiments indicated that no adaptive form consistently performed well based upon visual examination Robust PCA Robust PCA is one class of techniques which is applicable to the problem with the sparse noise model. It serves as an alternative to simple medians. Our assumption that a recording can be segmented into a handful of regions implies that it exists in a low rank space. Robust PCA provides an improved way of estimating this space when a large amount of noise still affects it. 3.3 Region Based Features Given an image denoised with a combination of Robust PCA and iterated medians, the next step is to search for features utilize in segmentation. There are both local 25

and global methods to consider in this search. Local features for identifying texture were first investigated using filter banks, dictionary-based methods, and local shifts to identify boundaries.

26 and global methods to consider in this search. Local features for identifying texture were first investigated using filter banks, dictionary-based methods, and local shifts to identify boundaries. Some of these techniques included Textons [37, 77], mean-shift [20, 19], and random fields [81]. These efforts were initially motivated by stripes which can be seen in Fig. 5. Directly applying such features to classifications performed worse than existing methods. Figure 5: Region of EEG spectra generated by pulsing activity Local features such as stripes may represent meaningful structures indicating particular phases of sleep, including the presence of K-complexes or spindles. These structures however provide limited classification power. In figures such as Fig. 5 a more global structure can be seen by inspecting which regions are colored in with red, yellow, and blue. The colored regions roughly correspond to µ + σ, µ, and µ σ. Observing the red regions, Fig. 5 shows a region of activity Hz from 0-3 minutes and another one at 0-15 Hz from 3-7 minutes. By thresholding full recordings into these three sets of levels a speckled version of the spectra could be obtained. The 26

27 goal of this process is to identify when neural activity is excited, neutral, or inhibited at a given time-frequency bin. The set of regions created by thresholding provide the initial basis for where time-frequency changes occur. These thresholded images are still very noisy, though they start to reveal the global time-frequency structure. For computer vision tasks the particular binary-mask noise is a very common subproblem within segmentation tasks. The typical solution is to use standard binary morphology operators including erosion and dilation. An additional non-iterated 5 5 median operator is applied to each of the three binary masks, though other binary operators would have also worked well. What was left were either sections with reasonably well connected components or areas with reasonably high densities of speckled values. 3.4 Organizing Geometrically Structured Labels The binary masks constructed in the previous step form a rough function of where time and frequency boundaries exist between brain states. Any location where neighboring time-frequency bins are above or below the threshold indicates a possible global shift in the brain state or a meaningful frequency band switch. The edge based estimate is essentially a blurred version of the target separation function. To refine the separation function the geometry of the binary masks can be exploited. Through visual inspection, each one of these masks is roughly rectangular in shape. The signal model agrees with this structure as it predicts sharp changes at spectral and temporal boundaries. When these boundaries are broken up, it is a combination of the Gaussian and sparse noise. These boundaries create the sparse approximation problem of representing the spectrogram mask as the union of a series of rectangles. Finding an optimal or near optimal solution is difficult as the solution may involve the space of all rectangles, which is O(w 2 h 2 ). When applied to the typical image size of a full night recording, it is impractical to use these approaches as the binary masks 27

tend to be around 2048 1400 pixels. Other approaches to solve the optimization problem used smaller dictionaries and image sizes which were at least an order of magnitude smaller [1].

28 tend to be around pixels. Other approaches to solve the optimization problem used smaller dictionaries and image sizes which were at least an order of magnitude smaller [1]. In order to make the problem approachable, a greedy algorithm is used to generate the dictionary of possible rectangles. Matching pursuit is used to constrain the somewhat redundant set of rectangles. Each rectangle starts from a non-zero pixel. From here it can be grown or shrunk either up, down, left, or right. The rectangles are grown over the binary mask M by the following cost: 4M x,y P out-of-bounds (x, y) (9) x,y Figure 6: Growth of canidiate regions Each iteration increases the total value of the rectangle and is guaranteed to 28

terminate after a finite number of iterations. Rectangles are repeatedly grown from seeds until at least 1000 unique rectangles are identified with an area of at least 20 pixels each.

regions of interest. The collection of rectangles at this point forms a series of candidate time-frequency regions within the recording.

mode boundaries estimates. One example of the candidate regions is shown in Fig. 7.

29 terminate after a finite number of iterations. Rectangles are repeatedly grown from seeds until at least 1000 unique rectangles are identified with an area of at least 20 pixels each. Matching pursuit is used to identify the coarse structures which make up the image while discarding rectangles from the dictionary which do not significantly contribute to covering regions of interest. The collection of rectangles at this point forms a series of candidate time-frequency regions within the recording. The collection of edges of either the union of the candidate regions or the edges from individual candidates can be used to obtain a more concentrated version of the band and temporal mode boundaries estimates. One example of the candidate regions is shown in Fig. 7. The largest regions within the figures show the most significant temporal and frequency changes which in the left and right figure can be seen to model a repeated structure. The process of generating candidates is summarized in Algorithm. 1. (a) Excited Bands (b) Average Bands (c) Inhibited Bands Figure 7: Cover of thresholded regions 29

30 Algorithm 1 Greedy Rectangle Image Cover Problem procedure RectCover(I, N, M) N M S 1 while S 1 < N do R seed rectangle while R R do R R R expand(r) end while end while S 2 while S 2 < M do R argmax r S1 r (I k S2 k) end while end procedure 30

31 3.5 Reducing To Segments Now that a collection of candidate regions are available, they need to be aligned to produce the hypothesized global structure. Instead of perturbing candidate regions, their bounding edges are used to establish a prior on where global segment edges exist. The initial segmentation is very much an over-segmentation, as the edges tend not to agree over the exact location of a segment boundary. The discrepancy on edge location is not a major problem as segments tend to be at least a few time steps long. To refine the candidate region segmentation three primary phases are used. First, temporal segment edges were initialized by summing the frequency bands covered by each candidate region boundary at time t i. Then t was constructed to be a Parzen window filtered version of t. Any peaks of t were then considered to be an initial global time segment edge. This process was repeated for the excited, inhibited, and normal energy threshold masks independently. Second, frequency segments, S f, were initialized by creating boundaries between bands which were divided by the candidate regions. Under the proposed model, a frequency boundary should only exist if regions f 1 = [a, b] and f 2 = [b + 1, c] are expected to have a different distributionover time. Based solely on the existence of a frequency edge there is a large number of bands within the segmentation S f, on the order of 30% of the total frequency bins. Most of these share the same distribution as their neighbors and as such they would be reasonably expected to exist in the same cluster of data if a standard clustering technique was run. K-means clustering can be used to merge similar clusters, but a single run of k-means is sensitive to initialization and very sensitive to the number of clusters. Multiple runs of k-means can be used to give a more robust consensus. The consensus in this case provides a means of estimating P (µ(f i ) = µ(f i+1 )). For refining the frequency edges one round of k-means consensus clustering (trials=100, k=20) is used to combine all frequency bands which commonly occur in the same cluster and 31

32 are adjacent. Typically one round of k-means consensus clustering is sufficient, but in cases where the thresholds do not match a secondary pass is used to prune excessively small frequency bands. The process takes frequency bands predicted by consensus clustering and prunes transitions which occur too closely. Additionally the process assumes frequency bands are more compact in the lower frequency bands and more sparse in higher frequencies. Algorithm 2 denoises the transition sequence and produces the initial frequency segmentation S f. Algorithm 2 Minimum Band Size Regulation procedure CleanTransition(T ) T - input binary transition sequence M - input minimum freq distance Initially there are no refined transitions transition 0 T i minfreq 0 for i T i do F frequency i if log(f ) minfreq T i = 1 then minfreq log(f ) + M transition i = 1 end if end for return transition end procedure Lastly, the time segmentation needs to be coarsened. From the initial set of bounding boxes, the normalized spectrum can be used to produce a denoised version D. The denoised version D is created by taking each time-frequency region from segmentation S = (S t, S f ) and replacing all values within the region with a single value. The single value can be the mean, median, and quartiles. Different representations 32

33 were used to decorrelate the error presented by any one representation. Consensus k-means clustering is used on D to coarsen the time segmentation S t. Additionally each denoised image D for the different binary masks is used simultaneously in the clustering process. Given the estimates of co-occurrence frames would be coalesced into a new and larger time clusters. The denoising process of creating D and coarsening S t is repeated 6 times to produce a sufficiently coarse representation. 3.6 Denoising As is mentioned in the previous section, denoising is done on the basis of segmented time-frequency regions, (S t, S f ). Each region has a denoising function applied independently. Several functions were applied to these 2D images including, a mean function, a median function, a bi-linear approximation of the region, and a bi-quadratic approximation. Unless specified otherwise the evaluated denoising function for region r on image I and denoised image I was I r = 1 2 (median(i r) + mean(i r )). (10) The mean/median combination appeared to give the best results in initial tests. An alternative bi-quadratic approach was shown in section to produce images better for visualization, and similar classification accuracy. In those cases the denoising was found by finding the best fit φ r to minimize I r j k =< φ r, [1, j, j 2, k, k 2 ] > (11) φ = argmin φ I I 2. (12) When learning edges regions are defined based upon the frequency and temporal edges provided by the refined candidate edges. The final denoising however is provided a set of temporal edges and frequency edges from each of the excited, inhibited, and normal level segmentations. The temporal edges are identical in each, but frequency 33

34 edges differ. Unless stated otherwise the final denoising only uses the temporal edges. In section 5.4.4, it was shown that choosing edges associated with excitation may yield better results than discarding the frequency band information. The resulting features generated by applying the aligned grid segmentation and region based denoising is known as Dense Denoised Spectral, DDS, features. 3.7 Supervised Learning On Segments The DDS features provide a representation which reduces nuisance variation, though the task of classifying frames of data still has its own difficulties. Even after denoising there are still outliers within the data due to noise and the information presented by different frequency bands varies considerably. In particular lower frequency bands tend to provide much more information than higher frequency bands. These constraints indicate that the selected supervised learning tool needs to be robust to data outliers and work well with large dimensional feature spaces which may be either redundant or corrupted by noise. Immediately the ensemble method, of random forests spring to mind. Random forests are a popular method used widely in computer vision, typically only superseded by deep neural networks. Figure 8 shows an example tree which uses a set of threshold, t, and a set of features, f, to label an input as class 1 or 2. Each tree within a random forest is composed of threshold based decisions which each use a single feature. As binary decisions repeatedly split data into smaller and smaller groups, few features tend to be involved in the labeling of data. With roughly equal splits the labeling process can approach the logarithmic depth of a binary search tree. As few features are needed a decision tree scales excellently with high dimensional features and it is rare for a tree to use any specific feature. In aggregate, the occasional decision tree which uses a corrupted feature is averaged out with several other trees which do not. Thus the aggregation results in a robust random forest. 34

35 f a < t a f b < t b f c < t c f d < t d f e < t e class 1 class 2 class 2 class 1 class 1 class 2 Figure 8: Decision Tree In theory, random forests have properties which make them an ideal option in terms of both performance and the computational resources needed for them to process the large datasets generated by sleep polysomnographs. Rather than overlooking other methods, they were applied to a small subset of the DREAMS subject dataset. First, generative methods were applied, including, Gaussian Models, Gaussian Mixture Models (GMM), Hidden Markov Models (HMM), and AdaBoost with simple stumps. Generative models struggled with frequent outliers in the data and took a large amount of computational resources to train. In particular any Gaussian or Gaussian Mixture model tested on the sample set either neglected significant higher order variance or had insufficient data to properly estimate it. Applying a HMM resulted in improved accuracy for very similar patients, but very long sleep states made estimating the stage duration difficult and sleep stage lengths do vary considerably between subjects given different ages, genders, or sleep disorders. The level of variation between subjects biases any state-duration based model considerably and makes them unsuitable for the classification task. Discriminative methods such as SVM, Linear Discriminate Analysis (LDA), neural networks, and random forests were also tested. Both SVM and LDA proved difficult to tune given the high dimensional feature space. Neural networks showed some initial promise, but required much longer 35

36 training times as well as much more data than random forests. 36

37 CHAPTER IV INITIAL INVESTIGATION In order to gauge the effectiveness of the approach detailed in Chapter 3, it was initially applied to open datasets including Physionet 1 and DREAMS 2 based off of recommendations in [36]. This work builds off of earlier projects in both EEG and LFP domains. 4.1 Prior Work This work initially began when investigating the generalizability of cross-frequencycoupling (CFC) features used for Parkinsonism [65]. In neuroscience literature it is common to see energy in a low frequency band predict the energy in a higher frequency band. CFC features establish the amount of phase-amplitude coupling that exists between different bands [14, 18]. Typically the coupling is observed as a high frequency band rhythmically pulses in activity at the same frequency as the lower frequency band. The amount of coupling can be used to establish what state the brain is in at a Similarly this approach has been used to look at spacial correlations of bands across or within the skull. CFC features appeared to be well suited for the slowly varying nature of sleep and a compliment to the information presented by classic bandpower features. CFC features were compared to prior work as well as a spectral centroid tracking method based upon the work of Philip Low [44]. In Low s method, the maximal indices of STFT frames were used to establish the brain state at a given moment. The original 1D feature representation proposed by Low proved insufficient when

38 applied to on the sleep EEG datasets, so a modified procedure was used. Let X be the STFT ω [0, π] of the signal with no overlap and five second frames. Next, normalize the z-score of every frequency bin across the patient s entire recording. Let Y i,j < 1 if argmax j X j,i i and 0 otherwise. Lastly, blur Y columnwise such that the energy placed in one frequency bin is spread to adjacent ones. The modified procedure transformed the originally described features as shown in Fig. 9 into the image shown in Fig. 10. Figure 9: Peak power features versus epoch for ST7241. The black dots at bottom indicate the expert identified sleep state: waking (lowest y-axis value), REM sleep, sleep stages 1-4 (shown with respectively increasing y-axis value). This last step of blurring was essential as the data appeared to have much wider distribution of preferential frequency bands at each state compared to the results presented in [44]. To obtain better frequency estimates, additional blurring can be done through time along each row as seen below. This, however, was not done on the features given to the classification routines to avoid any over optimistic classifications through nearest neighbor cases. 38

39 Figure 10: Filtered version of figure 9 The CFC features were compared to classic bandpower features and the modified Low method in [67]. All methods were compared using a 10-fold cross validation on 13 full-night recordings from Physionet using R&K labelings. First, band power features were extracted using 30 second windows as described in Section 2.1. This method performed relatively poorly under most classifiers and the combination of NREM3 and NREM4 negatively impacted classification due to their relative rarity. Using a random forest classifier, an accuracy of 70% was obtained after fusing the NREM3 and NREM4 states into one. Cross-frequency-coupling based upon wavelets was compared to this classical approach. These features were evaluated with the same cross fold validation, random forest classifier, and altered R&K labels. This approach yielded 65% accuracy, though when combined with band power estimates a 75% accuracy could be observed. Lastly the modified Low method was tested on the same dataset. The accuracy for the Low method was comparable to cross-frequency-coupling without being combined with band power estimates. The Low method did show a higher order structure which showed which time-frequency regions were typically active within certain phases of sleep and the modification which mapped the 1D features back to a high dimensional 39

representation hinted at possible extensions. 4.2 Initial Application Of Methods Initial work was performed on a larger dataset from DREAMS consisting of 20 healthy patients.

40 representation hinted at possible extensions. 4.2 Initial Application Of Methods Initial work was performed on a larger dataset from DREAMS consisting of 20 healthy patients. Unlike the previous work in [67], this dataset was labeled with the more modern AASM labels. While the previous methods were simply testing various approaches on sleep recordings the methods discussed in this work aim to be more robust to noise and produce more consistent labelings Dense Un-quantized Experiments Figure 11: A plot of subject 2 labels and predictions Using the methods as described in chapter 3, good performance was observed, comparable with inter-analysis error rates. In this analysis a random forest of 100 decision trees with 20 features for each decision layer were used. They were compared using a leave-one-out methodology. This approach worked quite well for most states, but the transitory N1 state was very rarely found. If this state was included, a 77% median overall accuracy was obtained, or 83% in the case where this state was neglected. 40

This method offers a significant improvement over classifying frames which were not denoised at all. In those cases an accuracy of 58.3% or 63.0% was observed.

41 This method offers a significant improvement over classifying frames which were not denoised at all. In those cases an accuracy of 58.3% or 63.0% was observed. The per subject improvements can be seen in Fig. 12 below. Table 2: Confusion Matrix Over All Trials With Un-quantized & Denoised Spectra Expert Labeling NREM3 NREM2 NREM1 REM WAKE NREM NREM Prediction NREM REM WAKE Figure 12: Accuracy of Random Forest before (green) and after (blue) denoising method 41

4.2.2 Low Rank Quantized Experiments (b) Subject 2 prediction based off (a) Quantized Representation quantized regions Figure 13: Results of using Modal Regions Directly For Classification As an

These regions can be combined to create a quantized representation of the recordings as seen in Fig. 13a.

42 4.2.2 Low Rank Quantized Experiments (b) Subject 2 prediction based off (a) Quantized Representation quantized regions Figure 13: Results of using Modal Regions Directly For Classification As an alternative approach to see if a very low rank and low precision realization would produce meaningful results, the regions of activity from section 3.4 were used. These regions can be combined to create a quantized representation of the recordings as seen in Fig. 13a. Very little training data is needed for this lower dimensional realization, though it does produce generally lower accuracy numbers given all data. Using the first 10 subjects for a leave-one-out cross validation, this generated 74.1% accuracy including N1 and 79.7% excluding it. 4.3 Initial Discussion While the multi-stage algorithm discussed in this work may have a complicated implementation, the steps are designed to provide rich visual information about what is getting modeled at each stage. As such, analyzing the intermediates shows how various stages are missed by this algorithm and why specific errors occur. Outside of the easily measured percentage of frames classified correctly, there is a much more subjective question about the believability of some result based upon state durations and correct transitions. In the case of DDS features, one failing is the 42

Table 3: Confusion Matrix Over First 10 Subjects With Quantized & Denoised Spectra Expert Labeling NREM3 NREM2 NREM1 REM WAKE NREM3 1640 349 3 0 1 NREM2 1265 5194 288 401 149 Prediction NREM1 7 69

43 Table 3: Confusion Matrix Over First 10 Subjects With Quantized & Denoised Spectra Expert Labeling NREM3 NREM2 NREM1 REM WAKE NREM NREM Prediction NREM REM WAKE fact that transitions between sleep stages does not always match up with the labels that are used in training. Transition offsets however should be considered relatively minor compared to the much more realistic state durations prevalent throughout the predicted label sequences. Figure 14: Predictions of Bandpower estimate When band power or other features are used, it is normal to see jittering between 43

44 two or three states as occurs in Fig. 14. While jittered results produces the best accuracy for those feature sets, it is less believable as a sequence of stages throughout the night. Some commercial software solutions work around unrealistic state sequences by imposing hard minimum limits on how long each state can be and they may also force some state transitions to occur [72]. Comparatively state sequence issues are not a major issue using the proposed approach, as can be seen in Fig. 13b and 11. This is part of the advantage of seeking and enhancing the inherent modality present in the data under the current model. 44

45 CHAPTER V RESULTS & DISCUSSION Both academic and commercial solutions to the single channel EEG sleep staging problem are lacking. The most frequently discussed methods have been band power and time signature based methods. These more common methods use a small number of features which are plausibly linked to the underlying biological processes. This statue quo largely ignored the possible use of very dense feature representations. Thus it was hypothesized that high dimensional features generated from a single electrode could outperform existing techniques with a sizable margin. This search initially began by mapping well known signal processing and machine learning techniques from speech processing onto the EEG domain. After testing MFCC, DCT, and raw spectra features with a GMM on a small subset of the DREAMs subject dataset it was observed that the raw spectrum performed the best. The performance was significantly lacking with the raw spectrum and visual inspections could easily result in improved manual classifications. Traditionally the time relation between frames of data could be encoded by time delayed copies or by including gradients as features or by adding a hidden Markov layer. Unfortunately the feature dimensionality was very large which made data augmentation consume an excessive amount of computational resources. A Markov layer on the other hand added the possibility of biasing the output classification towards any sleep pathology the system could be trained on possibly resulting in misclassifications which could result in a misdiagnosis. As visual inspection seemed to beat a frame-by-frame GMM classification that would mean that either the existing noisy feature space or the supervised classification 45

46 layer needed to be improved. Improved supervised learning will be addressed later in this section. First the feature quality will be addressed. Visual inspection hinted that there was both a block structure and a texture which existed in local neighboring bands of 6x6 to 20x50 time-frequency bins. This was reinforced by the classification performance gains after running a trivial 3x3 median filter. A variety of approches were used to verify the block-like regions with limited success. Initially mean-shift was used to identify regions, however the results did not match regions identified manually, nor did it significantly enhance classification rates beyond the basic median filter. Mean-shift did not identify the visual blobs of the spectra which defined the sleep stages and neither did any of the blob tracking algorithms. As this initial 2D denoising did not fair as well as was expected a spread of vector denoising techniques were applied. The vector denoising techniques started with very standard tools including PCA, ICA, and TV denoising. These techniques fared better than the previous 2D denoising techniques and more complex techiques including sparsepca, RobustPCA, spectral custering, ksvd, and LLE yielded some further improvements. Even with these techniques combined though performance on the dense spectral features was little better than the classical bandpower features. At this stage I hypothesized that higher order 2D methods must be the solution. The search resumed looking at 2D denoising techinques including conditional random fields, superpixels, Markov random fields, patch based dictionary learning methods, and 2D extensions for total variation denoising. These techniques still had one fatal flaw. They did not sharpen the differences in the spectrum across all frequencies when there was a change point in time. As there was no method in literature which appeared to work a problem description was made. Sleep EEG spectrum consists of a series of observations with relatively quick 46

47 changes that correspond to changing the underlying sleep state Changes in the spectrum roughly consist of independent bands Changes can be characterized by shifts in the overall energy present in the individual bands Existing 1D and 2D denoising techniques did not encode this approximate structure well, so a means for extracting and denoising this structure was built. 5.1 Datasets In order to evaluate the quality of Dense Denoised Spectral, or DDS, features, four primary datasets were used. The first two datasets are from the DREAMS research project 1. This dataset was collected around 2004 and is currently maintained by Stéphanie Devuyst at University of Mons in Belgium. The DREAMS project consists of eight datasets: The DREAMS Subjects Database The DREAMS Patients Database The DREAMS Artifacts Database The DREAMS Sleep Spindles Database The DREAMS K-complexes Database The DREAMS REMs Database The DREAMS PLMs Database The DREAMS Apnea Database

48 The subjects and the patients datasets were used due to two key features. First, they provide full night polysomnographs with expert AASM labeling. Secondly, they provide a variety of signals to inspect the recordings including multiple EEG channels, EKG, EMG, EOG, and more. The subjects dataset consists of 20 full night recordings of healthy subjects recorded at 200Hz. The patients dataset consists of 27 full night recordings featuring patients experiencing various pathologies recorded at 200Hz. The second source of data was the expanded Sleep EDF Physionet dataset 2. The Physionet C dataset consisted of sleep cassettes which were recorded between 1987 and This dataset consisted of 20 hour recordings of 39 different subjects recorded at 100Hz within the subject s homes. The Physionet T dataset consisted of sleep trials studying Temazepam made in subjects from this datasets were evaluated and each one consisted of a full night recording at 100Hz. Both datasets used R&K labelings which were translated to the equivalent AASM sleep states. This set of four datasets provides an open means of comparing the developed DDS features with other methods as well as a way to test how these features perform in various conditions. The physionet T and C datasets provide data collected with older instruments and in subject s home, rather than in a clinic, resulting in a higher noise floor. The physionet C dataset provides longer recordings which should show when various methods have problems with the distribution of data being biased towards the wake state. The DREAMS subject dataset shows a relatively ideal operating environment. Lastly, the DREAMS patient dataset shows how this system should be expected when it is used with individuals that have trouble sleeping. 5.2 Time Domain Features As mentioned in section 2.1 there are a number of means of extracting time domain features to classify sleep stages. One particular survey paper on this topic is [63]

49 This paper focuses on the task of using entropy measure to identify different levels of structure within sleep EEG recordings. In a coarse sense each one of these entropy measures offers a way to understand the relationshipt between the overall signal energy and the relationship between high frequency and low frequency components. Some of these measures additionally can measure features of key sleep EEG events such as the slew rate within spindles and k-complexes. Figure 15: Example of FDA at different scales The first method, Fractal Dimension Analysis (FDA) is influienced by the idea that functions at different subsamplings have different effective segment or arc lengths. At each scale, S i, the full function is segmented into equal length rectangles which bound the min/max of that segment. Then the diagonal of each of the rectangles are summed to provide the length at the scale, L i. This process is repeated at different scales and the trend between the scale S and length L is extracted to provide a scaler value for each epoch. [60] 49

Figure 16: Example of DFA at different scales Similar to FDA, Detrended Fluctuation Analysis works by comparing multi-scale representations of the underlying function [55].

50 Figure 16: Example of DFA at different scales Similar to FDA, Detrended Fluctuation Analysis works by comparing multi-scale representations of the underlying function [55]. At each scale, S i, a piecewise-linear approximation is created of the signal with the length of each piece being S i. The error at each scale is recorded as E i. Then the DFA value is the best fit linear relation between S i and E i. Figure 17: Pseudo-Shannon Entropy 50

51 In [25] they refer to Eqn. 13 as an analogue for Shannon entropy. The pseudoentropy equates x 2 with the probability-density function. The choice of a pdf allows the pseudo-entropy measure to track two features of each epoch. Firstly, it tracks the amount of variance that the signal has. Secondly, it records information about the mean deviation from zero as the mean is not removed from the epoch. i x 2 i log(ɛ + x 2 i ) (13) Approximate Entropy Approximate Entropy (ApEn) is a measure which identifies the regularity of patterns within a sequence [57, 62]. First, the sequence is broken into segments of length m. Then, for each segment the average number of segments within distance r of the template are counted via C i. The template matches are then combined and the result is compared to the same process using segments of length m + 1. Formally this is defined via: ApEn = φ m (x, r) φ m+1 (x, r) (14) φ m (x, r) = i log(c i m(x, r)) (15) C i m(x, r) = (log(d m (x) r)) (16) d(a, b) = max i ((a i b i ) 2 (17) Sample Entropy Sample entropy (SampEn) is a variation on the ApEn method [62]. SampEn varies r across samples based upon the variance of each epoch while ApEn assumed as fixed value. Additionally SampEn uses two alternate definitions: C i m(x, r) = (d m (x) < r) (18) φ m (x, r) = i C i m(x, r) (19) 51

52 Testing Information The originally referenced paper, [63], used multiple EEG signals to achieve their reported error rate of 49.4% across all the physionet T dataset. To compare the performance of their method to the single channel EEG analysis used within this dissertation, 30 second epochs of only Fpz-Cz were analyzed using the nine features, mean, max, min, std, FD, DFA, Shannon Entropy, ApEn, and SampEn. The resulting representation was used to train random forests with 40 trees each and 3 features at each level. Time Domain Performance The extracted features were first analyzed using a leave-one-out methodology for the DREAMS subject database and the DREAMS patient database. This provided a respective mean/median error rate of 31.5%/31.3% and 42.2%/41.3%. To identify how well this set of features generalize across different datasets, the features were trained on the patient database and tested on the subject database producing an error rate of 39.4%/35.4%. Training on the subject database and testing on the patient database yielded an error rate of 42.1%/40.2%. 5.3 Cross Validation One question that is not investigated frequently within existing sleep study literature is: How well does a given feature representation generalise? It is not unusual to observe authors training a classifier and testing it on random time observations within a single session from a single patient. This validation scheme, however, is extremely inadequate for real world scenarios. A common task for a fully automated sleep staging system would be to train on a set of data which is collected and analysed by one set of clinicians at lab A. This sleep staging system would then be provided to lab B which would then collect their own data, label it according to their technicians judgement, and then compare the results of the fully automated sleep staging system. When training and testing across different labs a more challenging and interesting problem is created. In particular this is one step beyond the training and testing 52

53 across different individuals. Training and testing on random samples from a single individual can be close to training and testing on the same data due to the slow time varying nature of sleep EEG. Testing across different individuals introduces differences in overall sleep patterns which can affect normalization methods and it introduces small differences in electrode placements within each different session. Moving from testing in different labs results in different noise conditions (e.g. moving from a country with 50 Hz to 60 Hz mains noise), it can result in coarse differences in electrode placement due to lab procedure, different spectral shaping due to different data acquisition toolchains, and disagreement on what labels should be assigned to ambiguous sleep stages. To best evaluate these various conditions my method along with others have been evaluated within each dataset as well as across different dataset. First, each method within each one of the datasets mentioned in section 5.1 was evaluated using a leave one out cross validation method. The results can be seen in Table 4. For in this set of experiements the best performing accuracy for each dataset is bolded. These results show that DDS features outperform other methods by a large margin on each dataset. For the four datasets this margin was between 5-13% more accurate than the next compariable method. This shows that when labeling is provided by a technician for a given dataset DDS outperform existing methods and provide similar to human rater levels of overall accuracy. Table 4: Median Accuracy Across Datasets And Methods Dense Spectral Raw Spectral Bandpower Time-Domain Physionet T 76.38% 63.91% 70.78% 67.19% Physionet C 92.74% 86.37% 85.30% 78.54% Dreams Subject 83% 63% 70% 68.70% Dreams Patient 71.27% 60.15% 63.93% 58.71% 53

54 With similar patients, similar recording environments, and similar technicians doing the labeling, dense denoised spectral features provide excellent accuracy. This alone does not show the whole picture though. If this methodology is used in cases like commercial fully automated sleep classifiers, those similarities are removed and worse performance is expected. In Tables 5 the performance of all four methods training and testing across all four datasets is shown. To make comparisions easier, for each training-testing result the table cell is colorcoded to indicate if it was the most accurate (green), second most (blue), third most (yellow), or worst (red). With the notable exception of training on the Physionet T dataset and testing on the DREAMS datasets Table 5a show that dense denoised features outperformed other methods in most cases. 54

55 Table 5: Cross Data Set Training and Testing (a) Dense Denoised Spectral (b) Raw Spectral Tr Ts Phys T Phys C Dream S Dream P Tr Ts Phys T Phys C Dream S Dream P Phys T Phys C Dream S Dream P (c) Bandpower Phys T Phys C Dream S Dream P (d) Time Domain Tr Ts Phys T Phys C Dream S Dream P Tr Ts Phys T Phys C Dream S Dream P Phys T Phys C Dream S Phys T Phys C Dream S Dream P Dream P Differences in the individual datasets and how they are labeled impacts how well all of the methods could accurately label new examples. This in itself is not supprising and it shows the difficult challenges fully automated sleep stagers experience in real world conditions. It is not unusual for the overall error to be halved when training on data within the target dataset even though there is no overlap between different patients. To address the cross dataset accuracy problem either the classifier would need to learn to adapt to the new dataset or the process could use more than one sensor as discussed in section Parameter Sensitivity Analysis In order to evaluate how well my method would perform under data variation and variations within implementations a sensitivity analysis was conducted. Importantly this analysis should address the minimum requirements for the supervised layer, the performance across different levels of noise, the amount of denoising that can be 55

56 tolerated, and the sensitivity to internal parameter changes. Each of these parameters has some effect on the resulting classification accuracy of the overall system, but it is expected that the tests in section 4.2 were run with a choice of parameters that was relatively insensitive to changes Random Forest Configuration The first experiment was to identify the minimum requirements for the random forest classifier used throughout the rest of the experiments. This was done to ensure that further experiments could be sampled more densely if retraining of the supervised classifier was required. The two tunable parameters that random forests possess are the number of features for each decision and the number of trees in a random forest. When building each tree at each binary decision a single feature is chosen to most optimally split the data based upon the distribution of labels. As this is a simple greedy algorithm, trees would be identical if they were permitted to select from all features within the feature vector. By restricting each decision to a random subset of n choices each tree is unique with high probability. If n is too high, then tree similarity increases and if n is too low than many decisions will be made using low information features which results in very deep decision trees with poor generalizability. When these trees are aggregated together a random forest is produced. Higher number of trees typically results in higher classification rates, but it takes longer to both evaluate and train the forest. To evaluate what was a sensible n and number of trees a complete leave-one-out cross validation was performed on the Physionet T dataset. n was varied from 5 to 50 features per layer and the number of trees per classifier was varied between 5 and 50. The results are shown below in Table 6. These results show that below 15 features per layer and below 20 trees consistently performs poorly compared to alternative parameter combinations. As expected higher 56

57 Table 6: Random forest parameter sensitivity (deviation from mean % accuracy) n Trees numbers of trees and features per layer tended to increase overall accuracy. Results were relatively insensitive to the number of features per layer once more then 20 were offered Amount Of Denoising The next parameter to be inspected was the amount of denoising. Initially this may seem like a non-issue, however smoothing over transitions which do occur is a possibility with virtually any denoising solution. As sleep consists of sleep stages which vary in duration from 30 seconds to hours it was thought that some classifications would perform best on the raw spectral information. These sleep stages though are thought to compose a small portion of an overall nights sleep. To evaluate the tradeoff between high levels of denoising and consistently representing very short time events two experiments were run. First the whole physionet 57

58 dataset was cross-validated with varying levels of the pure denoised signal and the raw spectrum. This was done using 0% to 100% denoising with 10% intervals and the random forest classifier trained at 20 feat 16 trees. 34 Error Rate % Error Rate % Denoising Amount % Denoising Mixture % (a) Classification accuracy with dif- (b) Average performance across sub- ferent levels of denoising jects Figure 18: Trend across denoising amounts While the trend is difficult to see in Fig. 18, it can be seen somewhat more clearly in the second experiment. The second experiment trained a classifier on 70% denoised physionet subjects and varied the denoising amount on a single subject (ST7011). The accuracy of the model on the subject was recorded between 0% and 100% denoised levels with 1% intervals. The fine grained sweep resulted in Fig. 19. Based upon the results in Fig. 19 it is clear that any denoising amount between 60% and 85% perform relatively similarly Tolerance To Noise With a selection of supervised parameters and denoising amounts from the previous two experiments, the next step was to identify the limits of the system in terms of the noise floor that could be tolerated. As the EEG signals are generated by a physiological process, a 1 f falloff with power over frequency can be observed. Thus pink noise was used in relation to the signal s variance. It was expected that this 58

59 12 Error Rate % Denoising Amount% Figure 19: Affect of Denoising Amount On ST7011 s classification Accuracy dissertation s method would perform nearly identically across different amounts of noise until it regions of excitation and inhibition looked visually identical. Once these two levels of activity become indistinguishable, then accuracy is expected to drop to chance, or below chance, levels. A single representative individual was chosen for this experiment (ST7192) and noise levels between 80dB and -20dB SNR with 10dB increments were evaluated. This resulted in a region insensitive to noise and a section where the classifier failed to identify any meaningful structure in the results. 59

60 Error Rate % Signal Noise Ratio (db) Figure 20: Response to varying levels of pink noise To supplement the initial analysis a more fine grained analysis was performed between 60dB and 20dB SNR at 2.5dB increments. This produced a similar plot as is seen below: Error Rate % Signal Noise Ratio (db) Figure 21: Fine grained analysis of the affect of pink noise Taking a look at the combined spectra and noise at 40 and 50 db SNR noise it is clear that the initial suspicion about losing regions of inhibition and excitation were 60

accurate. As Fig. 22 shows the region corresponding to 380-700 bands (18.5-34 Hz) is no longer visible under higher levels of noise. The corresponding segmentations seen in Fig.

(a) Candidate excitation/inhibition of SUBJECT at +40 db SNR resulting in an error rate of 16.9% (b) Candidate excitation/inhibition of SUBJECT at +50 db SNR resulting in an error rate of 90.

61 accurate. As Fig. 22 shows the region corresponding to bands ( Hz) is no longer visible under higher levels of noise. The corresponding segmentations seen in Fig. 22 show that this prevents this dissertation s method from finding regions corresponding to proper time-frequency bands above 45 db for this individual. (a) Candidate excitation/inhibition of SUBJECT at +40 db SNR resulting in an error rate of 16.9% (b) Candidate excitation/inhibition of SUBJECT at +50 db SNR resulting in an error rate of 90.6% Figure 22: Candidate regions on noisy input Coarseness & Type Of Time-Frequency Segmentation This next experiment focused on understanding the coarseness of the final segmentation. After refining the spectrum as detailed in section 3.5 the final denoising has to be performed using the underlying structure found. At this stage the structure is represented in terms of a series of unrefined frequency bands and a collection of time change points along with a likeliness score. The method used in section 3.6 divides the spectrum into individual frequency bands and time regions between change point with likeliness above zero. Each time-frequency zone is then replaced with a block constant representation. This approach creates an under-segmentation in terms of frequency and it does not reject any unlikely time change points. To investigate if a suitable scale of time and frequency segmentations were being used four different denoising setups were 61

62 used and the time coarseness was adjusted for each. These different configurations are summarized in Table 7. Table 7: Time-Frequency Segmentation Variants Configuration Time Coarseness Frequency Coarseness Denoising Type Type A Variable Fine Block Constant Type B Variable Fine Block Quadratic Type C Variable Coarse Block Quadratic Type D Variable Coarse Block Constant The results in other sections were performed using a type A configuration as it appeared to limit the worst case performance observed in a small number of patients. The type A configuration does discard a considerable amount of information however. If the hypothesis that the data is represented well by rectangular time-frequency regions, then it would be expected that type C and type D configurations should increase the overall classification accuracy. 0 5 Frequency (Hz) Time (hours) Figure 23: Raw spectrum of subject ST7192 In particular a number of differences for these methods can be visually observed. 62

Fig. 23 shows the raw spectrum for subject ST7192. Focusing on the first hour of this recording, you can see that at 2 Hz and 12 Hz there is some form of excitation and 10 Hz is somewhat inhibited.

63 Fig. 23 shows the raw spectrum for subject ST7192. Focusing on the first hour of this recording, you can see that at 2 Hz and 12 Hz there is some form of excitation and 10 Hz is somewhat inhibited. In Fig. 24, types A, B, & C show this pattern of excitation and inhibition. Type D captures the regions of excitation, but averages over the region of inhibition. Looking at the 2 Hz excitation more closely the region where very-high (red) activity transitions to high (yellow) activity is not fixed over the first hour. This very-high activity grows from 1 Hz to 3 Hz over the first hour and generally intensifies. Block constant methods fail to capture this behavior with type C producing the results which visually match the structure the best. (a) Type A (b) Type B (c) Type C (d) Type D Figure 24: Examples of denoised subject ST7192 s spectra under different approaches Visual inspection provides some insights into what each approach removes from the noisy spectrum, however it overlooks some of the nuances in the job of a discriminative classifier. For example, the inhibition at 12 Hz which occurs at hours 2.9, 4.9, 6.1, and 7.1 are represented accurately in all four of the types. To understand what structure is a nuisance parameter and which affect classification all approaches were run with 63

NMF-Density: NMF-Based Breast Density Classifier

NMF-Density: NMF-Based Breast Density Classifier Lahouari Ghouti and Abdullah H. Owaidh King Fahd University of Petroleum and Minerals - Department of Information and Computer Science. KFUPM Box 1128.