Shaheen N. Awan 1, Nancy Pearl Solomon 2, Leah B. Helou 3, & Alexander Stojadinovic 2

Shaheen N. Awan 1, Nancy Pearl Solomon 2, Leah B. Helou 3, & Alexander Stojadinovic 2 1 Bloomsburg University of Pennsylvania; 2 Walter Reed National Military Medical Center; 3 University of Pittsburgh

Dr. S. N. Awan is a consultant to KayPentax (Montvale, NJ) for the development of commercial computer software including cepstral analysis of continuous speech algorithms. KayPentax licenses the algorithms that form the basis of the Analysis of Dysphonia in Speech & Voice (ADSV) program from Dr. Awan. The views expressed in this presentation are those of the authors and do not reflect official policy of the United States Army, Department of Defense, or US Government.

Time-based perturbation measures have at least two key limitations: 1. Difficulty Analyzing More Severely Dysphonic Vowel Samples 2. Lack of Validity of Traditional Perturbation Measures In the Analysis of Continuous Speech:

In contrast to traditional perturbation analyses, spectralbased acoustic measures have shown the ability to characterize the voice signal by extracting characteristics such as the fundamental frequency (F 0 ) and the relative amplitude of harmonics vs. noise (de Krom, 1993) without the necessity of identifying cycle boundaries. Spectral-based methods analyze frames of data rather than cycles. *** Spectral/cepstral measures are able to provide valid and reliable correlates of vocal quality in continuous speech contexts.***

The cepstrum (a Fourier transform of the power spectrum of the voice signal) can graphically display the extent to which the dominant rahmonic (an anagram of harmonic a cepstral peak often associated with the vocal fundamental frequency) is individualized and emerges out of the background noise level. Measures of the relative amplitude of the cepstral peak in relation to extraneous cepstral components have been reported to provide an effective method for quantifying the severity of the dysphonic voice (Awan, Roy, & Dromey, 2009; Awan & Roy, 2005; Awan & Roy, 2006; Heman-Ackah, Michael, & Goding, 2002; Wolfe & martin, 1997; Hillenbrand, Cleveland, and Erickson,1994).

Examples of the log spectrum of a voiced signal and subsequent cepstrum (from Papamichalis, 1987)

Typical cepstrum for a normal male sustained vowel production. The cepstral peak is circled the cepstral peak corresponds to the fundamental frequency (115.91 Hz) and a quefrency (x-axis value in time) of approx. 8.63 ms.

Analysis of Dysphonia in Speech and Voice (ADSV) from KayPENTAX is the first commercial program of its kind, allowing for voice quality assessment of sustained vowel and continuous speech samples in normal and mild-to-severely dysphonic voices. This program provides several key spectral and cepstral measures of the voice sample along with a graphic display of how these values change over time. The program also incorporates the Cepstral Spectral Index of Dysphonia (CSID) a multifactorial estimate of vocal severity that correlates with the VAS (%) severity scale used in the CAPE-V.

A E B C F D ADSV main screen - spectral/cepstral analyses of a normal female voice sample ("We were away a year ago"). Analysis windows include: (A) Sound spectrogram; (B) sound wave; (C) Low/High spectral ratio (L/H Ratio) over time; (D) Cepstral peak prominence (CPP) over time; (E) focused spectral analysis per data frame; (F) focused cepstral analysis per data frame.

In 2010, Awan, Roy, Jetté, Meltzner, & Hillman reported that an algorithm incorporating measures from cepstral and spectral analyses was able to produce estimates of dysphonia severity that strongly correlated with auditory-perceptual judgments of dysphonia severity. Using measures of the cepstral peak prominence (CPP), a ratio of low vs. high frequency spectral energy (L/H ratio), and the respective standard deviations of these measures, Awan et al. (2010) reported: R = 0.81 between acoustic and auditory-perceptual estimates of dysphonia severity in CAPE-V sentences, R = 0.96 between acoustic and auditory-perceptual estimates of dysphonia severity in sustained vowel productions.

All CAPE-V Sentences Combined Sustained Vowel 100 100 Dysphonia Severity (100 pt. VAS) 80 60 40 20 Dysphonia Severity (100 pt. VAS) 80 60 40 20 0 Normal Mild Moderate Severe 0 Normal Mild Moderate Severe Group (n=128; 32 subjects per group) Group (n=32; 8 subjects per group) Estimated Rating Listener Rating Estimated Rating Listener Rating

"How hard did he hit him?" "We were away a year ago." 100 100 Dysphonia Severity (100 pt. VAS) 80 60 40 20 Dysphonia Severity (100 pt. VAS) 80 60 40 20 0 Normal Mild Moderate Severe 0 Normal Mild Moderate Severe Group (n=32; 8 subjects per group) Group (n=32; 8 subjects per group) Estimated Rating Listener Rating Estimated Rating Listener Rating "We eat eggs at Easter." "Peter will keep at the peak." 100 100 Dysphonia Severity (100 pt. VAS) 80 60 40 20 Dysphonia Severity (100 pt. VAS) 80 60 40 20 0 Normal Mild Moderate Severe 0 Normal Mild Moderate Severe Group (n=32; 8 subjects per group) Group (n=32; 8 subjects per group) Estimated Rating Listener Rating Estimated Rating Listener Rating

Because any research finding (strong or weak) may simply be a reflection of the particular sample of subjects being studied, it is essential that the results of any single study be replicated with alternative samples. In this way, the external validity (i.e., the ability to reproduce results with alternative subjects and in settings outside of the original study) of a particular research finding can be established. The goal of the present study was to assess the external validity of the acoustic algorithm and analysis methods reported in Awan et al. (2010) with a completely new and independent set of normal and disordered CAPE-V samples and associated listener judgments.

Samples were obtained from previously recorded voices of patients scheduled for partial or total thyroidectomy. Perceptual and acoustic analyses were conducted for subjects pre- and postthyroidectomy. CAPE-V sentences and sustained vowel samples were elicited from each subject at comfortable pitch and loudness levels.

3 experienced SLPs rated CAPE-V samples Custom automated program Blinded, randomized order, blocked for subject Over headphones in sound-treated booth A single rating for vowel and sentences combined. Separate sessions for male and female samples Accompanied by an anchor for moderate severity Median ratings (% of 100-mm line, labeled for severity) Severity Roughness Breathiness Strain

2-s center of each vowel /ɑ/ trimmed for onset & offset CAPE-V sentences soft glottal attacks and voiceless to voiced transitions ( How hard did he hit him? ), the presence of possible voiced stoppages or spasms, and the ability to maintain consistent voicing ( We were away a year ago ), the presence of hard glottal attacks ( We eat eggs every Easter ), the ability to transition easily between voiceless stop-plosive production and vowel production ( Peter will keep at the peak ). Measures of the cepstral peak prominence (CPP) and the ratio of low vs. high frequency spectral energy (L/H Ratio), as well as the standard deviations for the aforementioned measures were obtained.

In addition to the aforementioned acoustic measures, the ADSV program produces an estimate of dysphonia severity called the CSID (Cepstral/Spectral Index of Dysphonia). Separate CSID values are provided for sustained vowels vs. individual speech samples. For the purposes of this study, the CSID values for the vowel and each analyzed sentence were averaged to provide a single acoustic estimate of dysphonia.

Samples of 40 voices (20 normal [mean CAPE-V severity = 8.52%, SD = 4.90] and 20 dysphonic samples [mean CAPE-V severity = 30.84%, SD = 13.69]) were selected from the initial corpus of data: These voices reflected a relatively wide range of perceived dysphonia severity (CAPE-V Range = 1 to 65). Allowed focus on the ability to discriminate normal/typical voices from those judged to have mild-to-moderate degrees of dysphonia severity. Equal distribution of males and females.

Statistical evaluation of the CSID values (i.e., acoustically estimated dysphonia severity) vs. auditory-perceptual dysphonia severities (i.e., CAPE-V ratings) revealed the following: No significant difference between CSID vs. auditoryperceptual dysphonia severity in normal subjects. No significant difference between CSID vs. auditoryperceptual dysphonia severity in disordered subjects. Significant differences in dysphonia severity between normal vs. disordered subjects whether estimated via acoustic analyses (CSID; t (38) = -5.01, p <.001) or via auditory-perceptual judgment (t (38) = -6.87, p <.001).

Subjects Mean Cepstral/Spectral Index of Dysphonia (CSID) Mean Auditory- Perceptual Rating (CAPE-V) 20 Normal Voices 5.72 (SD = 9.97) 8.52 (SD = 4.90) 20 Disordered Voices 28.50 (SD = 17.71) 30.84 (SD = 13.69)

Across all 40 subjects, a strong and significant correlation between CSID values and CAPE-V auditory-perceptual ratings of dysphonia severity was observed: (r = 0.85; r 2 = 0.73; p <.001).

CAPE-V Severity Rating CSID: Vowel /ɑ/ CSID: Easy Onset Sentence CSID: All Voiced Sentence CSID: Glottal Sentence CSID: Plosive Sentence r = 0.67 r = 0.79 r = 0.82 r = 0.78 r = 0.78 All Pearson s r correlations significant at p <.01.

The results of this study indicate that the acoustic algorithm reported by Awan et al. (2010) and incorporated in the CSID is externally valid and an effective correlate of perceived dysphonia CAPE-V severity. Results are actually somewhat stronger than observed in Awan et al. (2010) for the multifactor CSID measure (sample characteristics differed in that the voices in the current study tended to be in the normal to moderate range of severity) - as in Awan et al. (2010), the all voiced sentence was observed to provide the best individual correlate of dysphonia severity.

In this study, spectral/cepstral measures from sentences correlated best with perceived dysphonia severity in contrast to Awan et al s (2010) sample in which measures from vowels provided the strongest correlate. Dysphonia may be more prominent in vowels vs. sentences, or vice versa, in different cases therefore, both samples are necessary.

The CSID provides an objective, multivariate measure of dysphonia severity that is effective in both sustained vowel and continuous speech contexts. The objective, automatic estimation of dysphonia severity in continuous speech and vowel samples is a potentially valuable and easily communicated method of categorizing voice and voice change without requiring multiple trained judges.

Awan, S. N., & Roy, N. (2005). Acoustic prediction of voice type in women with functional dysphonia. Journal of Voice, 19(2), 268 282. Awan, S. N., & Roy, N. (2006). Toward the development of an objective index of dysphonia severity: a four-factor acoustic model. Clinical Linguistics & Phonetics, 20(1), 35 49. Awan, S. N., Roy, N., & Dromey, C. (2009). Estimating dysphonia severity in continuous speech: application of a multi-parameter spectral/cepstral model. Clinical Linguistics & Phonetics, 23(11), 825 841. Awan, S. N., Roy, N., Jetté, M. E., Meltzner, G. S., & Hillman, R. E. (2010). Quantifying dysphonia severity using a spectral/cepstral-based acoustic index: Comparisons with auditory-perceptual judgments from the CAPE-V. Clinical Linguistics & Phonetics, 24(9), 742-758. de Krom, G. (1993). A cepstrum-based technique for determining a harmonicsto-noise ratio in speech signals. Journal of Speech and Hearing Research, 36(2), 254-266. Heman-Ackah, Y. D., Michael, D. D., & Goding, G. S., Jr. (2002). The relationship between cepstral peak prominence and selected parameters of dysphonia. Journal of Voice, 16(1), 20-27. Hillenbrand, J., Cleveland, R. A., & Erickson, R. L. (1994). Acoustic correlates of breathy vocal quality. Journal of Speech and Hearing Research, 37(4), 769-778. Papamichalis, P.E. (1987). Practical Approaches to Speech Coding. Englewood Cliffs, NJ: Prentice-Hall. Wolfe, V., & Martin, D. (1997). Acoustic correlates of dysphonia: type and severity. Journal of Communication Disorders, 30(5), 403-415.