Assessment of Automated Scoring of Polysomnographic Recordings in a Population with Suspected Sleep-disordered Breathing

Similar documents
Assessment of a wrist-worn device in the detection of obstructive sleep apnea

Using a Wrist-Worn Device Based on Peripheral Arterial Tonometry to Diagnose Obstructive Sleep Apnea: In-Laboratory and Ambulatory Validation

Basics of Polysomnography. Chitra Lal, MD, FCCP, FAASM Assistant professor of Medicine, Pulmonary, Critical Care and Sleep, MUSC, Charleston, SC

Diagnostic Accuracy of the Multivariable Apnea Prediction (MAP) Index as a Screening Tool for Obstructive Sleep Apnea

Frequency-domain Index of Oxyhemoglobin Saturation from Pulse Oximetry for Obstructive Sleep Apnea Syndrome

NATIONAL COMPETENCY SKILL STANDARDS FOR PERFORMING POLYSOMNOGRAPHY/SLEEP TECHNOLOGY

The AASM Manual for the Scoring of Sleep and Associated Events

Coding for Sleep Disorders Jennifer Rose V. Molano, MD

Non-contact Screening System with Two Microwave Radars in the Diagnosis of Sleep Apnea-Hypopnea Syndrome

The AASM Manual for the Scoring of Sleep and Associated Events

Key words: Medicare; obstructive sleep apnea; oximetry; sleep apnea syndromes

Validation of a Self-Applied Unattended Monitor for Sleep Disordered Breathing

Summary of Features and Performance

Autonomic Arousal Index: an Automated Detection Based on Peripheral Arterial Tonometry

FAQ CODING & REIMBURSEMENT. WatchPAT TM Home Sleep Test

Portable Computerized Polysomnography in Attended and Unattended Settings*

Web-Based Home Sleep Testing

There has been a substantial increase in the number of studies. The Validity of Wrist Actimetry Assessment of Sleep With and Without Sleep Apnea

CLASSIFICATION OF SLEEP STAGES IN INFANTS: A NEURO FUZZY APPROACH

A 74-year-old man with severe ischemic cardiomyopathy and atrial fibrillation

Automated analysis of digital oximetry in the diagnosis of obstructive sleep apnoea

There have been many traffic accidents in our country, and the

Arousal detection in sleep

DECLARATION OF CONFLICT OF INTEREST

Split Night Protocols for Adult Patients - Updated July 2012

* Cedars Sinai Medical Center, Los Angeles, California, U.S.A.

Nasal pressure recording in the diagnosis of sleep apnoea hypopnoea syndrome

Obstructive sleep apnoea How to identify?

AASM guidelines, when available. Does this mean if our medical director chooses for us to use an alternative rule that our accreditation is at risk?

Internet Journal of Medical Update

Simple diagnostic tools for the Screening of Sleep Apnea in subjects with high risk of cardiovascular disease

EEG Arousals: Scoring Rules and Examples. A Preliminary Report from the Sleep Disorders Atlas Task Force of the American Sleep Disorders Association

In 1994, the American Sleep Disorders Association

The recommended method for diagnosing sleep

Polysomnography Course Session: Sept 2017

Simplest method: Questionnaires. Retrospective: past week, month, year, lifetime Daily: Sleep diary What kinds of questions would you ask?

The STOP-Bang Equivalent Model and Prediction of Severity

Effectiveness of Portable Monitoring Devices for Diagnosing Obstructive Sleep Apnea: Update of a Systematic Review

Out of Center Sleep Testing (OCST) - Updated July 2012

Comparison of Nasal Prong Pressure and Thermistor Measurements for Detecting Respiratory Events during Sleep

The International Franco - Palestinian Congress in Sleep Medicine

The Latest Technology from CareFusion

FEP Medical Policy Manual

Introducing the WatchPAT 200 # 1 Home Sleep Study Device

Utility of Technologist Editing of Polysomnography Scoring Performed by a Validated Automatic System

Testing the Accuracy of ECG Captured by Cronovo through Comparison of ECG Recording to a Standard 12-Lead ECG Recording Device

Critical Review Form Diagnostic Test

Validating the Watch-PAT for Diagnosing Obstructive Sleep. Apnea in Adolescents

Outlining a simple and robust method for the automatic detection of EEG arousals

EFFICACY OF MODAFINIL IN 10 TAIWANESE PATIENTS WITH NARCOLEPSY: FINDINGS USING THE MULTIPLE SLEEP LATENCY TEST AND EPWORTH SLEEPINESS SCALE

Obstructive sleep apnea (OSA) is characterized by. Quality of Life in Patients with Obstructive Sleep Apnea*

O bstructive sleep apnoea-hypopnoea (OSAH) is a highly

The identification of obstructive apneas and hypopneas in

Differentiating Obstructive from Central and Complex Sleep Apnea Using an Automated Electrocardiogram-Based Method

The Familial Occurrence of Obstructive Sleep Apnoea Syndrome (OSAS)

Procedures in the Sleep Laboratory

NORAH Sleep Study External Comment Mathias Basner, MD, PhD, MSc

In-Patient Sleep Testing/Management Boaz Markewitz, MD

Prediction of sleep-disordered breathing by unattended overnight oximetry

The Effect of Altitude Descent on Obstructive Sleep Apnea*

Methods of Diagnosing Sleep Apnea. The Diagnosis of Sleep Apnea: Questionnaires and Home Studies

Sleep Bruxism and Sleep-Disordered Breathing

Effect of body mass index on overnight oximetry for the diagnosis of sleep apnea

Proceedings 23rd Annual Conference IEEE/EMBS Oct.25-28, 2001, Istanbul, TURKEY

(To be filled by the treating physician)

Published Papers Cardio Pulmonary Coupling

Validating the Watch-PAT for Diagnosing Obstructive Sleep Apnea in Adolescents

Biomedical Signal Processing

Polysomnography Artifacts and Updates on AASM Scoring Rules. Robin Lloyd, MD, FAASM, FAAP 2017 Utah Sleep Society Conference

Polysomnography (PSG) (Sleep Studies), Sleep Center

Effect of Manual Editing of Total Recording Time: Implications for Home Sleep Apnea Testing

A Sleep Laboratory Evaluation of an Automatic Positive Airway Pressure System for Treatment of Obstructive Sleep Apnea

linkedin.com/in/lizziehillsleeptechservices 1

Κλινικό Φροντιστήριο Αναγνώριση και καταγραφή αναπνευστικών επεισοδίων Λυκούργος Κολιλέκας Επιμελητής A ΕΣΥ 7η Πνευμονολογική Κλινική ΝΝΘΑ Η ΣΩΤΗΡΙΑ

Pulse Rate Variability Analysis to Enhance Oximetry as at-home Alternative for Sleep Apnea Diagnosing

Periodic Leg Movement, L-Dopa, 5-Hydroxytryptophan, and L-Tryptophan

LEARNING MANUAL OF PSG CHART

Appendix 1. Practice Guidelines for Standards of Adult Sleep Medicine Services

Medicare CPAP/BIPAP Coverage Criteria

Efremidis George, Varela Katerina, Spyropoulou Maria, Beroukas Lambros, Nikoloutsou Konstantina, and Georgopoulos Dimitrios

Western Hospital System. PSG in History. SENSORS in the field of SLEEP. PSG in History continued. Remember

FEP Medical Policy Manual

Patterns of Sleepiness in Various Disorders of Excessive Daytime Somnolence

PEDIATRIC SLEEP GUIDELINES Version 1.0; Effective

Evaluation of a 2-Channel Portable Device and a Predictive Model to Screen for Obstructive Sleep Apnea in a Laboratory Environment

Obstructive Sleep Apnea

ALMA MATER STUDIORUM UNIVERSITÀ DI BOLOGNA

Sleep disordered breathing is a spectrum of diseases that SCIENTIFIC INVESTIGATIONS

SPECTRAL ANALYSIS OF LIFE-THREATENING CARDIAC ARRHYTHMIAS

PREDICTIVE VALUE OF AUTOMATED OXYGEN SATURATION ANALYSIS FOR THE DIAGNOSIS AND TREATMENT OF OBSTRUCTIVE SLEEP APNEA IN A HOME-BASED SETTING

The Epworth Sleepiness Scale (ESS) was developed by Johns

Respiratory Event Detection by a Positive Airway Pressure Device

INFORMATION SOCIETY TECHNOLOGIES (IST) PROGRAMME

Obstructive sleep apnea (OSA) is common but underdiagnosed

Key words: adenotonsillectomy; arousal; rapid eye movement sleep; sleep apnea

The amount of nightly variability that occurs in sleepdisordered. Nightly variability of sleep-disordered breathing measured over 3 nights

RESEARCH PACKET DENTAL SLEEP MEDICINE

Milena Pavlova, M.D., FAASM Department of Neurology, Brigham and Women's Hospital Assistant Professor of Neurology, Harvard Medical School Medical

Sleep 101. Kathleen Feeney RPSGT, RST, CSE Business Development Specialist

Transcription:

Assessment of Automated Scoring of Polysomnographic Recordings in a Population with Suspected Sleep-disordered Breathing Stephen D. Pittman, MSBME, RPSGT 1 ; Mary M. MacDonald, RPSGT 1 ; Robert B. Fogel, MD 1,2 ; Atul Malhotra, MD 1,2 ; Koby Todros 3 ; Baruch Levy 3 ; Amir B. Geva, DSc 3 ; David P. White, MD 1,2 1Division of Sleep Medicine, Brigham and Women s Hospital, Boston, Mass; 2 Harvard Medical School, Boston, Mass; 3 WideMED Ltd., Omer, Israel Study Objectives: To assess the accuracy of an automated system (Morpheus I Sleep Scoring System) for analyzing and quantifying polysomnographic data from a population with sleep-disordered breathing. Setting: Sleep laboratory affiliated with a tertiary care academic medical center. Measurements and Results: 31 diagnostic polysomnograms were blindly analyzed prospectively with the investigational automated system and manually by 2 registered polysomnography technologists (M1 & M2) from the same laboratory. Sleep stages, arousals, periodic limb movements, and respiratory events (apneas and hypopneas) were scored by all 3. Agreement, Cohen κ, and intraclass correlation coefficients were tabulated for each variable and compared between scoring pairs (A-M1, A-M2, M1-M2). The 26,876 epochs (224 hours of recording time) were analyzed. For sleep staging, agreement/κ were A-M1: 78%/0.67, A-M2: 73%/0.61, and M1-M2: 82%/0.73. The mean respiratory disturbance indexes were M1: 20.6 ± 23.0, M2: 22.5 ± 24.5, and A: 23.7 ± 23.4 events per hour of sleep. The respiratory disturbance index concordance between each scoring pair was excellent (intraclass correlation coefficients 0.95 for all pairs), although there was disagreement in the classification of moderate INTRODUCTION POLYSOMNOGRAPHY (PSG) REMAINS THE STANDARD DIAGNOSTIC TEST FOR THE EVALUATION OF SLEEP AND THE DETECTION OF SLEEP PATHOLOGIES such as Disclosure Statement This was an industry supported study by WideMed Ltd. Mr. Pittman has received research support from Respironics Inc., Itamar Medical, WideMed Ltd., and the Alfred E. Mann Foundation; served as a consultant to Itamar Medical; and is currently employed by Respironics Inc. Dr. Geva is the president and CEO of WideMed Ltd. Ms. MacDonald has received support from Cephalon, Respironics Inc., Itamar Medical, and WideMed Ltd. Dr. Malhotra receives research support from Respironics Inc. Dr. White has received research support and consulting fees from Respironics Inc., Itamar Medical, the Alfred E. Mann Foundation, and WideMed Ltd; and has served as a consultant for Aspire Medical. Dr. Fogel is a member of the Visiting Speaker's Bureau for Wyeth Pharmaceuticals. Mr. Todros and Mr. Levy are employed by WideMed Ltd. The manual scoring and statistical analyses for this study were done at the sleep laboratory affiliated with Brigham and Women's Hospital. The automated algorithms used were developed by WideMed Ltd.; however, the actual analysis of the data was performed onsite at the sleep laboratory by the authors with some technical assistance from WideMed Ltd. The authors maintained control of the data throughout the entire study. The paper was written by the authors. Submitted for publication April 2004 Accepted for publication July 2004 Address correspondence to: David P. White MD, Sleep Disorders Research Program @BIDMC, Division of Sleep Medicine, Brigham and Women s Hospital, 75 Francis Street, Boston, MA 02115-5817; Tel: (617) 732-5778; Fax: (617) 732-7337; Email: dpwhite@rics.bwh.harvard.edu SLEEP, Vol. 27, No. 7, 2004 1394 sleep-disordered breathing (percentage of positive agreement: A-M1, 37.5% and A-M2, 44.4%) defined as a respiratory disturbance index between 15 and 30 events per hour of sleep. For respiratory-event detection, agreement/κ were A-M1 and A-M2: 90%/0.66 and M1-M2: 95%/0.82. The agreement and κ for limb movement detection were A-M1: 93%/0.68, A-M2: 92%/0.66, and M1-M2: 96%/0.77. The scoring of arousals was less reliable (agreement range: 76%-84%, κ range: 0.28-0.57) for all pairs. Conclusions: Agreement between manual scorers in a population with moderate sleep-disordered breathing was close to the average pairwise agreement of 87% reported in the Sleep Heart Health Study. The automated classification of sleep stages was also close to this standard. The automated scoring system holds promise as a rapid method to score polysomnographic records, but expert verification of the automated scoring is required. Key Words: automated scoring, obstructive sleep apnea, sleep apnea, sleep-disordered breathing, arousal, periodic limb movement, polysomnography Citation: Pittman SD; MacDonald MM; Fogel RB et al. Assessment of automated scoring of polysomnographic recordings in a population with suspected sleep-disordered breathing. SLEEP 2004;27(7):1394-1403. sleep-disordered breathing (SDB) and periodic limb movement (PLM) disorder. Digital PSG acquisition systems have allowed the diagnostic capacity of sleep disorder centers across the country to increase in recent years, but the manual scoring of sleep stages, brief arousals from sleep, SDB events, and PLMs remains the standard. To date, computer-assisted scoring methods have rarely been rigorously validated and, when evaluated, have rarely performed well. Manual scoring relies on the application of standard rules to identify specific waveforms and quantify their duration. This process is time consuming, expensive, and subject to variability in interpretation. 1 Previous studies have reported a range of agreement in sleep-stage classification among human scorers with scoring pairs from different laboratories tending to agree less than scorers from the same laboratory. 2 One investigation of interscorer variability was conducted as part of the Sleep Heart Health Study following thorough training of the scoring technologists at a central scoring site to maximize the uniform application of scoring rules and improve reproducibility. 3 The average pairwise agreement of sleep staging among 3 scorers on 30 randomly selected PSG records using epoch-by-epoch comparisons was 87%, although the amount of SDB was relatively low in this population (average respiratory disturbance index [RDI] was 6 events per hour using a 4% desaturation requirement). Investigations of automated scoring of adult human sleep to date have considered single components of PSG scoring such as sleep staging, 4-7 detection of respiratory events, 8 arousals from sleep, 9 and PLMs. 10 Recent sleep-stage scoring methods have uti-

lized power spectral analysis, 4 a neural network model based on feature extraction, 5 and a Gaussian Observation Hidden Markov Model. 6 Epoch-by-epoch agreement in sleep-stage scoring for these methods and manual scoring varied from 74% to 84.5%, but these comparisons are for normal populations without SDB. Taha and coworkers 8 reported an epoch-by-epoch agreement of 93.1% between manual and automated scoring of respiratory events, including apneas and hypopneas, on 10 PSG records. The method of automated arousal scoring by Pillar and coworkers 9 correlated well with the manual scoring of arousals using standard criteria, 11 but the automated scoring was performed on a measure of peripheral arterial vasomotor tone that is not a standard PSG measure. Thus, this method may have limited clinical utility. The purpose of the present study was to evaluate interscorer agreement in sleep-stage assignment and the detection of defined abnormal events during sleep between 2 expert human scorers and an automated sleep-scoring system. The study was performed on a data set from subjects with suspected SDB. MATERIALS AND METHODS All night, in-laboratory, diagnostic PSG records from adult patients referred to the clinical sleep laboratory of Brigham and Women s Hospital with suspected obstructive sleep apnea were included in this study. These were not consecutive records, but a random sample of records from patients who disclosed on a comprehensive questionnaire between June and December of 2002 that they were interested in being contacted about research studies conducted at the sleep laboratory. The analysis of these records was approved by the Human Research Committee of Brigham and Women s Hospital. Subjects Thirty-one subjects (9 women) with suspected SDB participated in the study. The mean age of the subjects was 44.3 ± 11.3 (range 21-62) years, and the mean body mass index was 33.7 ± 7.0 kg/m 2. The mean Epworth Sleepiness Scale scores were 9.3 ± 4.7. Protocol Recorded signals included electroencephalogram ([EEG] C4- A1, C3-A2, O2-A1 and O1-A2), electrooculogram (EOG), submental and bilateral tibial electromyogram (EMG), electrocardiogram (ECG), airflow (nasal-oral thermistor and nasal pressure [PTAF2, Pro-Tech Services, Woodinville, Wash]), chest and abdominal movement (piezo bands), arterial oxyhemoglobin saturation (Model 930 Pulse-Oximeter, Respironics, Murrysville, Penn), body position, and snoring intensity. All physiologic data were collected and stored using the ALICE3 digital PSG system (Respironics). The EEG and EMG channels were sampled at 100 Hz. The PSGs were then scored manually and independently from the raw data by 2 registered PSG technologists (M1 & M2). Sleep was staged according to standard criteria in 30-second epochs from a high-resolution computer display. 12 The predominant sleep stage was scored for each epoch as wake, stage 1, stage 2, stage 3, stage 4, or rapid eye movement (REM). Arousals from sleep were identified according to American Sleep Disorders Association (ASDA) guidelines. 11 An arousal index was calculated as the number of arousals per hour of sleep. Respiratory events were scored according to recent American Academy of Sleep Medicine (AASM) Medicare guidelines. 13 In particular, an apnea was scored if airflow was absent for 10 seconds, and a hypopnea was scored if thoracoabdominal movement or airflow was reduced by 30% compared to baseline for at least 10 seconds with at least a 4% oxygen desaturation. An RDI was then calculated based on the number of apneas plus hypopneas per hour of sleep. The PLMs were scored according to ASDA criteria. 14 A PLM index (PLMI) was calculated as the number of PLMs per hour of sleep. Unscored copies of all records were then queued for automated batch scoring using the Morpheus I Sleep Scoring System (A) to classify sleep stages and then detect arousals, SDB events, and PLMs according to the same ASDA and AASM criteria. A database containing the scores for each epoch for all PSGs for all scorers (M1, M2, A) was created. Stages 3 and 4 were grouped into delta sleep. Only epochs scored as sleep were analyzed for arousals, respiratory events, and limb movements. Automated Algorithms The Morpheus 1 system performs fully automated PSG data analysis that includes sleep staging, arousal scoring, PLM detection, snoring detection, single-channel ECG analysis, and respiratory-event scoring. These analyses are enabled by utilizing novel algorithms to process EEG, EOG, EMG, ECG, snoring, and respiratory-related signals. The Morpheus I EEG analysis is based on dynamic state modeling of the EEG frequency during sleep. Accordingly, a dynamic, 4 frequency state model with high-frequency, low-energy mixed-frequency, high-energy mixed-frequency, and low-frequency states is applied for modeling the behavior of EEG signals for a given patient. The EEG signals are initially partitioned into quasi-stationary segments by applying a novel adaptive segmentation algorithm. The patient s intrinsic states are then adaptively recognized by utilizing fuzzy clustering 15,16 of features extracted from these quasi-stationary EEG segments. Membership levels in each of the recognized states are attributed to each EEG segment, creating for each state a membership function, which sketches a continuous measurement of intrinsic EEG activity. The membership level in the highfrequency state is dominant in wakefulness, while the membership levels in the low-energy mixed-frequency, high-energy mixed-frequency, and low-frequency states are dominant in sleep stages 1 and rapid eye movement (REM), sleep stage 2, and delta sleep, respectively. The EEG analysis also includes detection of K-complexes, sleep spindles, movement artifacts, and electrode artifacts. The 2 EOG signals are simultaneously portioned into quasi-stationary segments, while features such as variance and cross-correlation are extracted from parallel segments. Segments with enhanced relative energy with respect to their environment, for which the cross-correlation measure decreases below a certain threshold are classified as REM. The EMG signal is also partitioned into quasi-stationary segments, and time-frequency features are extracted from each segment. Segments with enhanced relative energy with respect to their environment are classified as EMG bursts, which are useful for arousal scoring. Zones of reduced EMG energy level are detected by applying hierarchical fuzzy clustering 17 of features extracted from each EMG segment. SLEEP, Vol. 27, No. 7, 2004 1395

It is important to mention that artifacts such as ECG tracings and 60-Hz noise are filtered from the EEG, EOG, and EMG by applying adaptive noise-canceling techniques. The information gathered from the EEG, EOG, and EMG signals is fused by the Rechtschaffen and Kales criteria for automatic sleep staging, in the following manner: (1) the membership level in the high-frequency state and detected movements are used for scoring stage wake; (2) the membership level in the low-energy mixed-frequency state is used for scoring stage 1; (3) the membership level in high-energy mixed-frequency state, detected K-complexes, and spindles are used for scoring sleep stage 2; (4) the membership level in low-energy mixed-frequency, detected zones of reduced EMG activities, and detected REM are used for scoring stage REM; and (5) the membership level in low-frequency state and the peak-to-peak amplitudes of the EEG segments are used for scoring stage delta sleep. The automatic arousal-scoring algorithm is based on the ASDA set of rules for arousal scoring. The algorithm utilizes membership levels for each EEG segment, together with scored sleep stages and detected EMG energy bursts. The algorithm also utilizes detected respiratory events, which are related to EEG alternating patterns in a time window of 10 seconds before and after a respiratory event. Both limb channels are adaptively segmented into quasi-stationary segments. Segments with specific duration and with enhanced relative energy with respect to their environment are detected as limb movements. Standard criteria for the classification of PLMs are applied on the detected limb movements in order to score a PLMI. The snoring detection algorithm works like the algorithm for detection of EMG energy bursts. By applying wavelet-based and template-matching algorithms, the Morpheus 1 system performs full 1-lead (lead II) ECG analysis, which includes full-beat segmentation; pathologic beat detection (premature atrial contraction, premature ventricular contraction); heart-rate variability analysis; heart-rate analysis (bradycardia, tachycardia, heart-rate drops, and heart-rate elevations); ST-segment analysis; QT analysis; and arrhythmia detection (supraventricular tachycardia, ventricular tachycardia). We did not validate the ECG analysis in the present study. For respiratory signals, the Morpheus I system utilizes a fuzzylogic decision algorithm to reach more realistic decisions regarding the detection of apnea or hypopnea. The respiratory recordings are first segmented using a fuzzy-logic minimum modelderived distance approach. In addition to the respiratory tracing, the system searches around each suspected event for arousals (5 seconds before and 5 seconds after the suspected event), desaturations (5 seconds before and 70 seconds after the suspected event), and limb movements (5 seconds before and 70 seconds after the suspected event). A fuzzy-logic decision algorithm is then applied to calculate the probability that the suspected event complies with the respiratory-event rules by integrating the information gathered from the different sources. Once an event is detected, adaptive segmentation and spectral analysis are applied to classify the type of the respiratory event. Data Analysis The variables assessed for each scorer included total sleep time, sleep efficiency (total sleep time / time in bed), sleep-onset latency (time from lights out to the first epoch of sleep), REMonset latency (time from lights out to the first epoch of REM), minutes of stage 1 sleep, minutes of stage 2 sleep, minutes of stage delta sleep, minutes of REM sleep, minutes of wake, RDI, Arousal Index, and PLMI. Comparisons of each scoring pair (M1-M2, A-M1, A-M2) included epoch-by-epoch assessments of agreement and Cohen κ 18 for the classification of sleep stages (5 5 matrix: wake, stage 1, stage 2, delta, and REM sleep; 4 4 matrix: wake, stage 1 + stage 2, delta, and REM sleep; and 3 3 matrix: wake, non-rem, and REM sleep) and detection of arousals (3 3 matrix: 0, 1, and 2 arousals per epoch), SDB events (3 3 matrix: 0, 1, and 2 events per epoch) and PLMs (3 3 matrix: 0, 1, and 2 events per epoch). The percentage of agreement for a scoring pair was calculated by summing the main diagonal and dividing by the total number of epochs. The κ represents the nonrandom component of this agreement. Bland- Altman plots were used to evaluate the mean difference between scorers for selected parameters. 19 Concordance was assessed using the intraclass correlation coefficient (ICC). Receiver-operator characteristic curves were constructed for the RDIs calculated for each scoring pair to assess the performance of the automated algorithm across the spectrum of SDB severity (RDI cutoffs of 5, 10, 15, 20, and 30 events per hour for defining obstructive sleep apnea). The area under the receiver-operator characteristic curve was then calculated for each threshold and reported with the standard error and the limits of the 95% confidence interval. Using AASM criteria, 20 normal breathing was defined as an RDI less than 5 events per hour, mild SDB as an RDI of 5 or more but less than 15 events per hour, moderate SDB as an RDI of 15 or more but less than 30 events per hour, and severe SDB as an RDI of 30 or more events per hour of sleep. All results are provided as means ± 1 SD except for the Bland- Altman mean difference, which used ± 2 SD for the limits of agreement. Statistical significance was considered to be present when P <.05. The paired t test was used to assess significance when the means were normally distributed for each scoring pair; otherwise, the nonparametric Wilcoxon signed-rank test was used. RESULTS A total of 26,876 epochs (224.0 hours of recording time) from 31 diagnostic PSG records were analyzed in this study. Epochby-epoch sleep-stage scoring agreement for each scoring pair as shown in Table 1. Agreement between the 2 manual scorers was 82.1% with a κ of 0.73. Automated scoring identified 57 epochs (0.02%) that did not meet predefined reliability criteria for sleepstage scoring. Thus, 26,819 epochs were used to calculate agreement between the automated scoring system and each manual scorer. Agreement between M1 and A was 77.7% with a κ of 0.67. Agreement between M2 and A was 73.3% with a κ of 0.61. Representative hypnograms for each scorer are shown for a single PSG in Figure 1. Agreement between M1 and A improved to 82.6% with a κ of 0.71 when stages 1 and 2 sleep were combined to yield a 4 4 matrix (wake, light sleep, delta sleep, and REM sleep). Agreement between M2 and A was 79.9% with a κ of 0.65 for the same 4 4 matrix. Likewise, agreement between M1 and M2 improved to 88.7% with a κ of 0.80 when stages 1 and 2 sleep were combined. When stages 1, 2, and delta sleep were combined to yield a 3 SLEEP, Vol. 27, No. 7, 2004 1396

Table 1 Epoch-by-Epoch Comparisons for Sleep-Stage Scoring for Each Scoring Pair* 3 matrix (wake, non-rem and REM sleep), agreement between M1 and A was 88.0% with a κ of 0.75. Agreement between M2 and A improved to 88.0% with a κ of 0.74 for the same 3x3 matrix. Likewise, agreement between M1 and M2 was 93.5% with a κ of 0.87. Table 2 shows the variables commonly used to assess sleep and sleep disorders for each scorer (M1, M2, and A). The mean total sleep time was 8.9 minutes longer for A than for M1 (P <.05) and 12.5 minutes longer than for M2 (P <.01). The total sleep time was highly concordant among all scoring pairs (ICC 0.92), as shown in Table 2. The mean sleep-onset latency was 3.8 minutes shorter for A than for M1 (NS) and 3.9 minutes shorter than for M2 (NS). Concordance was high for calculating the sleeponset latency among all scoring pairs (ICC 0.86), as shown in Figure 2, with the manual scoring pairs demonstrating the best concordance. On the other hand, the mean REM-onset latency was considerably longer for A than for M1 (45.7 minute difference, P<.01) and for M2 (48.1 minute difference, P <.01) because the automated system missed the first REM period in 10 (32%) subjects. In terms of sleep staging, the automated system tended to score less stage 2 sleep (17.2 minute less than M1, P<.01 and 7.8 minutes less than M2, NS) and wakefulness (9.8 minutes less than M1, P <.05 and 13.4 minutes less than M2, P <.01) and more delta sleep (11.9 minutes more than M1, P <.01 and 29.4 minutes more than M2, P <.01) than either manual scorer. Significant differences were observed for the mean duration of stage 1 sleep for all scoring pairs (P <.01). The differences between the REM sleep duration for automated system and both M1 and M2 were not significant. The ICC are provided for all variables in Table 2. The automated system scored fewer arousals than either manual scorer (8.1 arousals per hour less than M1, P <.01, and 14.2 arousals per hour less than M2, P <.01). Concordance was better between M1 and A (ICC = 0.72) than between M2 and A (ICC = 0.58). The agreement for detecting arousals between A and manual scoring seems acceptable (76.2% agreement with M1 and 76.1% agreement with M2), but the κ shown in Table 3 suggest the agreement is only marginally better than chance (κ = 0.28 for A-M1 and 0.30 for A-M2). However, the κ between manual scorers for arousals was only 0.57, suggesting agreement corrected for chance was relatively poor regardless of the scoring pair. The automated system scored slightly more respiratory events than either manual scorer (3.1 events per hour more than M1, P <.01, and 1.2 events per hour more than M2, NS) as shown in Stage Comparison M1-M2 A-M1 A-M2 Wake Wake 4,835 4,037 4,162 PPA+, % 80.9 68.7 69.6 Stage 1 535 669 857 Stage 2 376 792 609 Delta 4 28 18 REM 223 353 334 Stage 1 Stage 1 731 408 892 PPA+, % 21.3 13.1 19.9 Stage 2 1,765 1,327 1,765 Delta 0 0 4 REM 408 710 966 Stage 2 Stage 2 12,233 11,694 11,022 PPA+, % 77.1 73.5 68.9 Delta 1,293 1,427 2,080 REM 194 675 529 Delta Delta 1,181 2,013 1,146 PPA+, % 47.7 58.0 35.3 REM 0 0 0 REM REM 3,098 2,686 2,435 PPA+, % 79.0 60.7 57.1 Total epochs, no. 26,876 26,819 26,819 Total agreements, no. 22,078 20,838 19,657 Total agreement, % 82.1 77.7 73.3 κ (wake, 1, 2, delta, REM) 0.73 0.67 0.61 κ (wake, 1+2, delta, REM) 0.80 0.71 0.65 κ (wake, non-rem, REM) 0.87 0.75 0.74 *Comparisons with automated scoring excluded 57 epochs (0.02%) that were identified as epochs that did not meet reliability criteria for automated scoring. The number of occurrences of the specified stage combinations relative to total number of epochs scored. A refers to automated scoring with the Morpheus I Sleep Scoring System; M 1, manual scoring for Scorer 1; M 2, manual scoring for Scorer 2; PPA+, the percentage of positive agreement for the specified sleep stage [eg, PPA+ associated with wake (M 1 -M 2 ) = 4,835/(4,835 + 535 + 376 + 4 + 223) = 80.9%]; REM, rapid eye movement. SLEEP, Vol. 27, No. 7, 2004 1397

Figure 1 Representative hypnograms for each scorer are shown for 1 subject. M 1 refers to manual scoring for Scorer 1; M 2, manual scoring for Scorer 2; A, automated scoring with the Morpheus I Sleep Scoring System; REM, rapid eye movement. SLEEP, Vol. 27, No. 7, 2004 1398

Table 2 Standard Variables Commonly Used to Quantify Sleep and Sleep Disorders for Each Scorer. Variable Scorer* Intraclass Correlation Coefficient M 1 M 2 A M 1 vs M 2 M 1 vs A M 2 vs A Total sleep time, min 348.1 ± 63.2 344.5 ± 61.1 357.0 ± 64.5 0.98 0.92 0.94 υ Sleep efficiency, % 82.7 ± 11.9 81.9 ± 11.4 84.9 ± 12.3 0.96 0.87 0.91 υ Sleep-onset latency, min 25.9 ± 24.3 26.0 ± 24.2 22.1 ± 21.1 1.00 0.86 0.86 υ REM-onset latency, min 129.5 ± 78.6 127.1 ± 74.1 175.2 ± 81.0 0.99 0.46 0.43 υ Stage wake, min 85.3 ± 48.5 88.9 ± 45.9 75.5 ± 50.2 0.96 0.87 0.91 υ Stage 1, min 18.6 ± 15.1 48.5 ± 19.9 38.1 ± 27.2 0.22 0.37 0.53 ΣΝ Γ Stage 2, min 231.2 ± 54.0 221.8 ± 45.6 214.0 ± 47.9 0.80 0.84 0.72 Ν Stage 1 & 2, min 249.9 ± 55.4 270.4 ± 52.5 252.2 ± 58.6 0.86 0.87 0.73 Σ Γ Stage delta, min 38.2 ± 26.1 20.7 ± 23.2 50.1 ± 25.5 0.57 0.53 0.18 συ Stage REM, min 59.9 ± 29.7 53.3 ± 26.4 54.7 ± 34.4 0.92 0.72 0.76 σ Arousal Index, events/h 30.1 ± 18.5 36.2 ± 17.4 22.0 ± 15.5 0.81 0.72 0.58 ΣΝ Γ PLM Index, events/h 13.1 ± 18.6 16.3 ± 21.3 18.8 ± 25.1 0.93 0.61 0.65 συ RDI, events/h 20.6 ± 23.0 22.5 ± 24.5 23.7 ± 23.4 0.99 0.95 0.95 συ *Data for M 1 (manual scoring for Scorer 1), M 2 (manual scoring for Scorer 2), and A (automated scoring with the Morpheus I Sleep Scoring System) are expressed as mean ± 1 SD. A significant difference in the mean for each scoring pair is shown using the upper case Greek letter Σ, Ν, or Γ for the paired t test and the corresponding lower case Greek letter σ, υ, or for the nonparametric Wilcoxon signed-rank test for scoring pairs M 1 - M 2, M 1 -A, and M 2 A, respectively. Intraclass correlation coefficients are also given for each scoring pair to assess concordance. REM refers to rapid eye movement; PLM, periodic limb movement; RDI, respiratory disturbance index. Figure 2 Scatter plots of the sleep-onset latencies (minutes) are shown for each scoring pair. The solid lines represent the linear regression line, and the dashed lines represent the 95% confidence interval. A refers to automated scoring with the Morpheus I Sleep Scoring System; M 1, manual scoring for Scorer 1; M 2, manual scoring for Scorer 2. SLEEP, Vol. 27, No. 7, 2004 1399

Table 2. Concordance between both manual scorers and the automated system was excellent (ICC = 0.95), as shown in Figure 3. Table 4 illustrates that the automated algorithm for calculating the RDI has high sensitivity and specificity across the spectrum of SDB severity using receiver-operator characteristic curve analyses. The epoch-by-epoch agreement for detecting respiratory events between manual scoring and A was high (89.7% agreement with both manual scorers) with a κ of 0.66 for both comparisons as shown in Table 3. The pairwise agreement for the classification of SDB (normal, mild, moderate, and severe) is provided in Table 5. The detection of PLMs during sleep yielded a somewhat higher PLMI for the automated system than for either manual scorer (5.7 more limb movements per hour than M1, P <.01 and 2.5 more limb movements per hour than M2, NS) as shown in Table 2. There was also considerable scatter in this variable for each Table 3 Epoch-by-Epoch Agreement and Cohen κ Statistic for Detecting Arousals, Respiratory Events (Apneas and Hypopneas), and Periodic Limb Movements of Sleep for Each Scoring Pair* M 1 vs A M 2 vs A M 1 vs M 2 Parameter Agreement κ Agreement κ Agreement κ Arousals 76.2% 0.28 76.1% 0.30 83.7% 0.57 Respiratory Events 89.7% 0.66 89.7% 0.66 94.9% 0.82 Limb Movements 93.1% 0.68 92.2% 0.66 95.6% 0.77 *For each analysis, a 3 3 matrix was created to identify 0, 1, or 2 events per epoch for each scorer. A refers to automated scoring with the Morpheus I Sleep Scoring System; M 1, manual scoring for Scorer 1; M 2, manual scoring for Scorer 2. N = 20,940 epochs (only epochs scored as sleep by all 3 scorers were included in the analysis) Figure 3 Scatter plots of the respiratory disturbance indexes (RDI) calculated for each scoring pair are shown in A-C. The solid lines represent the linear regression line, and the dashed lines the 95% confidence interval. (D-F) Bland-Altman plots showing the difference in the RDI versus the mean of the RDI calculated for each scoring pair are shown in D-F. The solid lines represent the mean difference, and the dotted lines represent the limits of the agreement (± 2 SD). A refers to automated scoring with the Morpheus I Sleep Scoring System; M 1, manual scoring for Scorer 1; M 2, manual scoring for Scorer 2. SLEEP, Vol. 27, No. 7, 2004 1400

scoring pair as shown in Figure 4. However, the 2 manual scorers showed a closer relationship than the automated system did with either manual scorer. Concordance measured by ICC was relatively poor between A and manual scoring (ICC = 0.61 for A vs M1 and ICC = 0.65 for A vs M2). However, epoch-by-epoch agreement for scoring limb movements between the automatic and the manual scorers was better as shown in Table 3. DISCUSSION In the present study, we observed reasonably good agreement between automated PSG scoring algorithms, with that of the current accepted standard (ie, manual scoring according to established criteria). This applies to sleep staging, respiratory events, and PLMs. An automated technique would be a major advance Table 4 Receiver-Operator Characteristic Curve Data for the Evaluation of the Automated Algorithm Across the Spectrum of Sleep-disordered Breathing Severity OSA defined by OSA Prevalence ROC Area Under the Curve M 1 vs M 2 M 1 vs A M 1 RDI 5 events/h 0.71 1.0 (SE: 0) 0.90 (SE: 0.06, 95% CI 0.78-1.0) M 1 RDI 10 events/h 0.58 0.97 (SE: 0.03, 95% CI 0.91-1.0) 1.0 (SE: 0) M 1 RDI 15 events/h 0.45 1.0 (SE: 0) 0.98 (SE: 0.02, 95% CI 0.94-1.0) M 1 RDI 20 events/h 0.35 1.0 (SE: 0) 0.98 (SE: 0.02, 95% CI 0.95-1.0) M 1 RDI 30 events/h 0.26 1.0 (SE: 0) 0.97 (SE: 0.03, 95% CI 0.90-1.0) Obstructive sleep apnea (OSA) defined by various M1-determined respiratory disturbance index (RDI) cutoffs. A refers to automated scoring with the Morpheus I Sleep Scoring System; M 1, manual scoring for Scorer 1; M 2, manual scoring for Scorer 2; ROC, receiver-operator characteristic; SE: standard error, CI: confidence interval. Figure 4 Scatter plots of the periodic limb movement indexes (PLMI) are shown for each scoring pair. The solid lines represent the linear regression line, and the dashed lines represent the 95% confidence interval. A refers to automated scoring with the Morpheus I Sleep Scoring System; M 1, manual scoring for Scorer 1; M 2, manual scoring for Scorer 2. SLEEP, Vol. 27, No. 7, 2004 1401

for the field of sleep medicine because the manual scoring process is subjective, labor intensive, and expensive. Although manual scoring is currently considered the accepted standard for PSG interpretation, we observed relatively comparable concordance between a pair of manual scorers, as was observed between manual and automated scoring. Our manual scoring data are consistent with previous studies assessing interobserver correlations for manual PSG scoring. In a report by Schaltenbrand and coworkers, 5 interobserver agreement was 87.5% in a population with relatively unfragmented sleep. In the Sleep Heart Health Study, after extensive training of professional scorers within a single laboratory, the average pairwise agreement for manual scoring was reportedly 87%. 3 However, the patients in this Sleep Heart Health Study population had relatively mild apnea (RDI = 5.7 events per hour based on Medicare criteria). Overall, the literature supports better agreement between observers when they are within the same laboratory and for patients with minimal sleep fragmentation. 2 Thus, we believe that our agreement between manual scorers is comparable to the existing literature in this area. The automated algorithm that we employed in the present study performed reasonably well based on epoch-by-epoch agreement for sleep staging, respiratory event detection, and limb movements but less well for arousals. However, given the clinical population that we studied, with moderate obstructive sleep apnea and associated sleep fragmentation, we consider the results to be generally acceptable for clinical purposes. For example, the disagreement between the automated-detection algorithm and the manual scoring of sleep was partially driven by discrepancies in scoring the various stages of non-rem sleep. For clinical purposes, we do not believe that this distinction is likely to be impor- Table 5 Pairwise Comparisons for Classification of Sleep-disordered Breathing tant. The ability of the automated algorithm to distinguish non- REM sleep, REM sleep, and wakefulness is quite good. The lack of concordance in arousal detection has been previously reported. 21 This may reflect ambiguity in implementing the ASDA arousal criteria rather than failure of the automated system per se. In addition, arousal frequency and duration are poor predictors of daytime sequelae and therefore may not be critical in most clinical scenarios. 22, 23 On the other hand, the automated analysis did generate occasional errors in important sleep variables. For RDI, as indicated in Figure 3 and Table 4, several subjects would have been misclassified in terms of the presence or absence of sleep apnea. Table 5 also illustrates that there was significant disagreement in the classification of moderate SDB (RDI between 15 and 30 events per hour) between the automated analysis and both manual scorers. The same principal applies to PLMs as well. One potential clinical limitation of the automated analysis is the difficulty in detecting the first REM period in 10 out of the 31 subjects in this study. This resulted in poor agreement between automated scoring and the manual scorers (ICCs in the mid-40s) for calculating the REM-onset latency. As a result of these discrepancies, some manual editing will likely be required to verify the automated analysis before clinical decisions are reached. The difficulty in detecting the first REM period also suggests that the automated sleep scoring cannot be applied to Multiple Sleep Latency Test recordings when narcolepsy is suspected. This study had many strengths, including blinded manual scoring by 2 observers, the use of a clinical sleep population rather than relatively healthy subjects, and the use of automated algorithms that analyze all-important PSG outcome variables rather than just sleep staging or respiratory events. However, it also has SDB Severity M1-M2 A-M1 A-M2 Normal Normal * 8 6 6 PPA+, % 100.0 66.7 66.7 Mild 0 2 2 Moderate 0 1 1 Severe 0 0 0 Mild Mild 7 7 6 PPA+, % 77.8 63.6 60.0 Moderate 2 2 2 Severe 0 0 0 Moderate Moderate 5 3 4 PPA+, % 71.4 37.5 44.4 Severe 0 2 2 Severe Severe 9 8 8 PPA+, % 100.0 80.0 80.0 Total subjects, no. 31 31 31 Total agreements, no. 29 24 24 Total Agreement, % 93.5 77.4 77.4 SDB refers to sleep-disordered breathing; A, automated scoring with the Morpheus I Sleep Scoring System; M 1, manual scoring for Scorer 1; M 2, manual scoring for Scorer 2; PPA+, the percentage of positive agreement for the specified level of severity [eg, PPA+ associated with normal sleepdisordered breathing (SDB) (A-M1) = 6/(6 + 2 + 1 + 0) = 66.7%]. * The number of occurrences of the specified SDB severity relative to total # of subjects. SLEEP, Vol. 27, No. 7, 2004 1402

the following limitations. First, we were unable to study consecutive patients referred to our sleep laboratory and thus have the potential for enrollment bias. However, our institutional review board will not allow the inclusion of consecutive patients and therefore only willing subjects were allowed to participate. We did take careful measures to ensure that systematic bias did not occur among the subjects that we approached; however, we cannot be confident that unrecognized factors may have affected participation. Second, although the agreement between the manual scorers in the present study compared favorably to past reports of interobserver reliability, there were scoring differences between M1 and M2, especially the scoring of stage 1 and delta sleep. Third, one could argue that the ideal assessment of an automated system would include computer analysis followed by expert verification and validation. Thus, it is possible that the combination of our automated algorithm and manual verification would have been superior to any of the approaches in our present study. In the future, a more definitive study assessing the optimal scoring techniques and the associated cost effectiveness of each of these approaches will need to be undertaken. Fourth, Medicare criteria were the only criteria used for the definition of hypopneas. Thus, we cannot comment on the ability of the algorithm to detect more subtle events such as those described in the Chicago criteria that do not require an oxyhemoglobin desaturation. 20 Fifth, previous reports have suggested greater variability across sleep laboratories in PSG scoring. Thus, our single-center study would need to be reproduced in other centers to be broadly generalizable. Sixth, we only studied an adult population with suspected SDB, thus our results cannot be generalized to other populations such as insomniacs, children, or patients with parasomnias. We conclude that the good epoch-by-epoch agreement for sleep stages, respiratory events, and limb movements lead us to believe that this automated system could be successfully implemented clinically. However, expert verification of the automated scoring is essential. During the initial implementation of the system, this might consist of verifying the scoring of every epoch. With time and expert experience with the automated system, we would expect such verification could become more limited, such as identifying the first REM period and selective validation of the algorithm sensitivity and specificity for detecting respiratory events and PLMs within an individual PSG. REFERENCES 1. Collop NA. Scoring variability between polysomnography technologists in different sleep laboratories. Sleep Med 2002;3:43-7. 2. Norman RG, Pal I, Stewart C, Walsleben JA, Rapoport DM. Interobserver agreement among sleep scorers from different centers in a large dataset. Sleep 2000;23:901-8. 3. Whitney CW, Gottlieb DJ, Redline S et al. Reliability of scoring respiratory disturbance indices and sleep staging. Sleep 1998;21:749-57. 4. Prinz PN, Larsen LH, Moe KE, Dulberg EM, Vitiello MV. C STAGE, automated sleep scoring: development and comparison with human sleep scoring for healthy older men and women. Sleep 1994;17:711-7. 5. Schaltenbrand N, Lengelle R, Toussaint M, et al. Sleep stage scoring using the neural network model: comparison between visual and automatic analysis in normal subjects and patients. Sleep 1996;19:26-35. 6. Grube G, Flexer A, Dorffner G. Unsupervised continuous sleep analysis. Methods Find Exp Clin Pharmacol 2002;24 Suppl D:51-6. 7. Agarwal R, Gotman J. Computer-assisted sleep staging. IEEE Trans Biomed Eng 2001;48:1412-23. 8. Taha BH, Dempsey JA, Weber SM et al. Automated detection and classification of sleep-disordered breathing from conventional polysomnography data. Sleep 1997;20:991-1001. 9. Pillar G, Bar A, Shlitner A, Schnall R, Shefy J, Lavie P. Autonomic arousal index: an automated detection based on peripheral arterial tonometry. Sleep 2002;25:543-9. 10. Kayed K, Roberts S, Davies WL. Computer detection and analysis of periodic movements in sleep. Sleep 1990;13:253-61. 11. EEG arousals: scoring rules and examples: a preliminary report from the Sleep Disorders Atlas Task Force of the American Sleep Disorders Association. Sleep 1992;15:173-84. 12. Rechtschaffen A, Kales A, eds. A Manual of Standardized Terminology, Techniques, and Scoring System for Sleep Stages of Human Subjects. Los Angeles: Brain Information Service/ Brain Research Institute, UCLA; 1968. 13. Meoli AL, Casey KR, Clark RW, et al. Hypopnea in sleep-disordered breathing in adults. Sleep 2001;24:469-70. 14. Recording and scoring leg movements. The Atlas Task Force. Sleep 1993;16:748-59. 15. Geva AB, Kerem DH. Brain state identification and forecasting of acute pathology using unsupervised fuzzy clustering of EEG temporal patterns. In: Teodorescu HN, Kandel A, Jain LC, eds. Fuzzy and Neuro-Fuzzy Systems in Medicine: CRC Press; 1999:57-93. 16. Gath I, Feuerstein C, Geva AB. Unsupervised classification and adaptive definition of sleep patterns. Pattern Recognition Letters 1994;15:977-984. 17. Geva AB. Feature Extraction and State Recognition in Biomedical Signals with Hierarchical Unsupervised Fuzzy Clustering Methods. Medical & Biomedical Engineering & Computing 1998;36:608-14. 18. Cohen J. A coefficient of agreement for nominal scales. Educational and Psychol Meas 1960;20:37-46. 19. Bland JM, Altman DG. Statistical methods for assessing agreement between two methods of clinical measurement. Lancet 1986;1:307-10. 20. Sleep-related breathing disorders in adults: recommendations for syndrome definition and measurement techniques in clinical research. The Report of an American Academy of Sleep Medicine Task Force. Sleep 1999;22:667-89. 21. Drinnan MJ, Murray A, Griffiths CJ, Gibson GJ. Interobserver variability in recognizing arousal in respiratory sleep disorders. Am J Respir Crit Care Med 1998;158:358-62. 22. Martin SE, Engleman HM, Kingshott RN, Douglas NJ. Microarousals in patients with sleep apnoea/hypopnoea syndrome. J Sleep Res 1997;6:276-80. 23. Kingshott RN, Engleman HM, Deary IJ, Douglas NJ. Does arousal frequency predict daytime function? Eur Respir J 1998;12:1264-70. SLEEP, Vol. 27, No. 7, 2004 1403