Integral and Diagnostic Speech-Quality Measurement: State of the Art, Problems, and New Approaches

Size: px

Start display at page:

Download "Integral and Diagnostic Speech-Quality Measurement: State of the Art, Problems, and New Approaches"

Ashlie Hopkins
6 years ago
Views:

de, ** Institute of Communication Acoustics, Ruhr-University, D-44780 Bochum, Germany, {sebastian.moeller,marcel.waeltermann}@rub.

1 Integral and Diagnostic Speech-Quality Measurement: State of the Art, Problems, and New Approaches U. Heute *, S. Möller **, A. Raake ***, K. Scholz *, M. Wältermann ** * Institute for Circuit and System Theory, Christian-Albrechts-University, Kaiserstr. 2, D Kiel, Germany, {uh,ks}@tf.uni-kiel.de, ** Institute of Communication Acoustics, Ruhr-University, D Bochum, Germany, {sebastian.moeller,marcel.waeltermann}@rub.de, *** LIMSI - CNRS, Université de Paris-Sud, F Orsay, France, alexander.raake@limsi.fr The user s overall impression of a speech signal which has been influenced by some system can be described in terms of the integral quality. There are well-defined auditory methods to assess integral quality: Mostly, an absolute-category rating in a listening-only situation is used, resulting in the meanopinion score (MOS). For this integral quality index, proposals exist for instrumental measurements, yielding MOS estimates. Such estimates have problems with distortions not taken into account during the model development. Furthermore, they do not allow to characterize the quality of speech signals. A different approach, trying to overcome these problems, aims at quality attributes, concerning distinct distortions, thus allowing for a system diagnosis, and together forming an integral-quality-impression model also able to cope with future degradations. Earlier work into that direction had problems, too. Avoiding some weaknesses and re-defining more suitable attributes, the diagnostic approach is re-visited. First results are reported, and further work is outlined. 1 Introduction The quality attributed to some speech-transmission, speech-enhancement, or, more generally, speechprocessing system by a user is the result of a perception and a judgment process. Both depend on factors not directly visible (or audible) in the speech signal, so that users alone, i.e., human subjects, can give valid quality statements, after all. Such statements will often concern the overall, i.e., integral quality. There are well-defined auditory methods to assess integral quality: Mostly, an absolutecategory rating (ACR) in a listening-only situation is used, resulting in the mean-opinion score (MOS). Numerous other methods exist, using different scales or more refined test situations. Smaller quality degradations may better be judged in a paired comparison test, because the differential sensitivity of the ear is higher than the absolute one. This is relevant in our context, because we would like to discriminate relatively small deteriorations, like those in a telephone transmission situation. Unfortunately, auditory experiments with human test subjects are expensive and time-consuming. System developers, thus, would like to get information on quality directly from the speech signal, without asking human listeners. For the integral quality, proposals exist for efficient instrumental, signal-based measurements, yielding MOS estimates, like the ITU-T standard P.862 [12] or the TOSQA model [3][4]. Models relying on signalbased measurements have problems with distortions not taken into account during the model development. Furthermore, the MOS estimates they provide do not allow to characterize the quality of speech signals: They might provide the same MOS for two signals which are perceived to sound differently in an auditory test. A different approach has been well-known that overcomes these problems: Quality attributes can be aimed at which - concern detailed and distinct distortion effects, - thus, allow for a system diagnosis, and - together may form an integral-quality-impression model - which is able to cope also with future degradations, not yet taken into account when the model was developed. Quality attributes can be investigated both on an auditory and on a speech-signal level. On the auditory level, perceptual dimensions have to be identified and their importance for integral quality has to be determined; on the speech-signal level, instrumental measures for the quality attributes have to be found and combined to provide an integral-quality estimate. Earlier work on quality attributes on both levels [1][5] has been known to have problems, too. Avoiding some weaknesses and re-defining more suitable attributes, the diagnostic approach is re-visited, including an integral-quality measure to be finally derived. New, important types of degradations, like time-varying, bursty data-packet loss in internet transmission, are to be included. In Sec. 2 and 3, present auditory and instrumental measurements for integral quality are motivated and explained. Corresponding procedures towards diagnostic quality estimation are presented in Sec. 4, motivating an improved attribute selection outlined in 1695

2 Sec. 5. Sec. 6 shows results of first experimental investigations in order to identify underlying perceptual attributes. The necessities of further work are summarized in Sec Auditory Integral-Quality Measurement Speech quality is an important factor for the acceptance (i.e. after all: the economical success) of a communication system. For valid judgments, users have to be asked for their impressions after using the communication system. Involving human subjects gives rise to the common notation as subjective tests ; in fact, measuring quality involves a perception and a judgment process which can only take place in a human acting subject. Still, a properly-designed test quite rigidly prescribes the choice of speech material, speakers, and listeners in order to objectivate the test results, i.e. to make them generalizable. Because of this, the term auditory measurement is preferred although the above-named points of choice always leave some subjectivity. A communication background would actually require two-way, i.e., conversation tests as a model. Because of their excessive time consumption, conversation tests with simplifications have been devised [17]. Still, listening-only tests (LOT) are preferred when new communication systems are to be designed, e.g., for the selection of a best candidate for a new standard system by the International Telecommunication Union / Telecommunication Sector (ITU-T). Even in an LOT the expenditure is quite high; so the possibility of direct (e.g., paired) system comparisons is often discarded. Instead, each system is independently graded on an absolute scale between 5 = excellent and 1 = poor after a listener s impression of a speech sample, and the judgments of many listeners in this ACR are averaged, yielding a MOS value for the system under test. The procedure for the choice of speech material and subjects as well as the evaluation have been defined in international standards [10]. 3 Instrumental Integral-Quality Estimation Even in its form as an ACR-LOT, auditory evaluations may be too expensive, especially too costly for frequent tests during a system development in a laboratory. Therefore, it is attractive to search for signal-based features measurable purely instrumentally such that the LOT-MOS result can be estimated reliably. The basis is, usually, a signal comparison it should be noted that, here, the quite poor LOT model of the communication situation is modeled itself, and in a not really consistent way. There are two distinct approaches for the comparison: One, due to Atal [20], relates the system-output signal to an error signal, i.e., the difference between input and output signals (see Fig. 1); the other one, due to Gersho [24], compares the input and output signals directly (see Fig. 2). The first idea generates more or less refined and extended variants of the classical SNR. The latter approach led to a series of advanced techniques (e.g., PSQM in [2][11], q C in [6], SQET in [7]), two of which are available commercially (TOSQA in [4]) or even standardized (PESQ in [12]). Their common idea is the measurement of distances (or similarities) between the input signal and the output or noise signal. The comparison is performed on a perception-based representation, e.g. using non-linearly transformed inner-ear excitations on a warped (Bark-) frequency scale. The methods differ in measurement details (like their determination of spectra), but also in the amount of psychoacoustics involved. Fig. 3 shows SQET as an example. PESQ, e.g., ignores loudness compression laws which are known from psychoacoustics for many years. Figure 1: Comparison between input and error signal for MOS estimation, according to Hauenstein [7], type

3 Figure 2: Comparison between input and output signal for MOS estimation, according to Hauenstein [7], type 1 Investigations in [7], however, have indicated that indeed as much knowledge as available and realizable should be employed: In the view of a generally applicable measure (i.e., outside the system-training world ), this is preferable to any combination of magical factors optimized blindly with respect to a maximum correlation between real and estimated MOS values. The latter two problems can be observed also in the best available measures. Fig. 4 shows auditory MOS values and their estimates using the PESQ or the TOSQA model. In some cases, instrumental measures predict the same MOS where indeed the auditory ratings differ quite largely, resulting in vertical groups of the corresponding test conditions (see ellipse in the right panel). In other cases, differences in MOS are predicted where the auditory MOS remains the same, leading to horizontal groups (ellipse in the left panel). Diagnostic information can be obtained through auditory tests (see Sec. 4.1) as well as through instrumental measurements (see Sec. 4.2). When using auditory tests, attributes can either be pre-defined, or they result from the test itself. Both methods will be discussed in the following paragraphs. 4.1 Auditory Diagnostic Measures Figure 3: Speech-quality evaluation tool (SQET, [8]) 4 Diagnostic Quality Measures While for a system selection, a single-value (MOS-) result is useful, it is not desirable during the systemdevelopment phase: More information about the influence of system parameters on sound perception is needed. Some distortions may change in a counteracting way such that the MOS remains the same. Even more, in instrumental measurement, a constant average distance may lead to an unchanged MOS estimate where in fact a listener would weight the distortions differently and change a judgment; on the other hand, the distance may change while the listener feels that, over all, the quality is the same. Barnwell s group [18] used the diagnostic acceptability measure (DAM, [22]) for very detailed investigations: Listeners were asked for their impression on both integral quality and details in terms of L = 10 attributes like brightness, noisiness, naturalness, etc. It was also tried to mimic human judgment by combining weighted attribute judgments into an integral-quality estimate. The run of such tests, however, requires huge resources, because listeners have to be trained for judging the attributes. While the DAM relies on predefined attributes, a multi-dimensional analysis [15][16][23] assumes only a fixed number L of dimensions: Listeners evaluate the perceived distances between system variants (in a comparison test), and N L systems are then interpreted as points on an L-dimensional map. Here, similarities and differences between the systems become visible, and from system knowledge, names may be attributed to the L axes. The dimensionality has to be selected as a compromise between an optimal fit of the distance judgments and the interpretability of the resulting axes. 1697

4 Figure 4: MOS values and their estimations MOS for two versions of enhanced Bark-spectral distances, both with high correlations: a) TOSQA, b) PESQ 4.2 Instrumental Diagnostic Measures Quality attributes which have been determined in an auditory test may also be predicted with the help of instrumental measures. For example, Barnwell s group investigated the correlation between the auditory DAM results and measured, known or varied quality indicators like the global or segmental SNR, a frequency-dependent SNR, Lp-norms of (LPC-) spectral distances, etc. The results show relatively loose relations even with optimized estimation parameters (like the value p in a norm), unless the class of systems was strongly restricted. Much higher correlations were achieved with a complicated, provisional combined measure [18]. Halka et al. [5] followed a different way: Using an improved noise-loading measurement [21], the linear distortion was described by an average frequency response jω H ( e ), the non-linearities by an average yy e noise spectrum jω Φnn ( e ), and the output signal by its average power density jω Φ ( ), with a speechlike random process as an input signal rather than the commonly used speech-signal samples. Then, these measured functions were mapped onto features highly correlated with the main factors of an auditory attribute analysis [1], and a combined measure for the integral quality was derived. A total correlation of some 95% was achieved, but with unexplained outliers. Later work with inclusion of foreign-type systems showed that more than just a parameter adjustment or additional sub-features were needed [7]. 5 Improved Attribute Approach The above-named direct integral-quality estimators (PESQ, TOSQA, SQET, etc.) outperform our earlier attribute-matching approach by far. After a careful analysis, the disadvantages are seen - not in the use of a speech-model signal, - not in the detour via attributes: this does fit the human judgment process; - indeed in the extraction of quality information from system descriptions averaged a priori rather than from instantaneous behaviors with an averaging at the end (as in the direct procedures); - indeed in the comparison between output and error / distortion signals rather than input and output signals (as in the direct procedures: Gersho s approach is, other than in audio quality, more appropriate than Atal s); - indeed in the use of pre-defined attributes which include clearness and, especially, also naturalness, i.e., actually no detail attributes but rather composite features almost like quality itself. We aim now at - more suitable attributes with 1698

5 - better separation of perceptual details, - i.e.: orthogonality such that, e.g., naturalness is replaced by a decomposition of un-natural effects, - model systems for such effects, and - immediate measuring procedures for these new attributes. In a first step, we try to identify perceptual attributes for a large number of degradations which are common in modern telephone transmission. In a second step, these attributes will be predicted by adequate signalbased measures. From the auditorily or instrumentally determined attribute values, indices for integral quality will be derived as a final step. The instrumental models as well as real-world test systems are run and evaluated on a real-time simulator able to simulate a duplex speech communication with large variability [17][19]. 6 First Results In order to identify attributes which characterize a wide range of modern telecommunication systems, a multidimensional scaling experiment has been carried out [23]. 14 test subjects had to rate the similarity of stimuli which were presented in pairs, on a continuous scale. The stimuli correspond to 14 different circuit conditions reflecting typical degradations, namely those associated with - codecs, - linear distortions and bandwidth limitations, - terminal equipment such as hands-free terminals (HFTs), - time-varying packet loss in IP networks, - additive background noise, and - noise-reduction algorithms. The test was carried out with source signals from two speakers (1m, 1f), resulting in 2 14 stimuli and comparisons to be performed by each listener. The similarity judgments were mapped to an L- dimensional space, using the INDSCAL algorithm which accounts for individual differences of the weighting of perceptual dimensions by each test subject [14]. L = 4 was found to be a good compromise between the goodness of fit (S-stress value, explained variance) and the interpretability of the derived dimensions. Following an informal listening of the stimuli in their ranking with respect to the dimensions by several expert listeners, the extracted dimensions have been labeled as follows: 1. Interruptedness : This dimension is mainly loaded by packet loss, as well as by musical tones stemming from imperfect noise reduction. 2. Clearness /directness : This dimension shows high loadings for the HFT stimuli; it seems to be linked to peaks and valleys in the amplitude spectrum. 3. Frequency content : This dimension is characterized by the presence or absence of high and low frequencies; it seems to be speakerdependent. 4. Noisiness : Stimuli which show high loadings on this dimension are the background-noise and circuit-noise conditions. Some of these dimensions have already been found in earlier experiments, e.g. interruptedness [15]. Thus, they seem to be context-independent. Interestingly, the codec conditions all group in a cluster of the perceptual space. Earlier investigations which focused on such stimuli [1] thus cover only a small fraction of the perceptual space of modern telecommunication systems. 7 Summary and Outlook In the present paper, different methods for the auditory measurement and the instrumental estimation of speech quality were discussed. In particular, we focussed on such measures which provide information on different perceptive quality dimensions, and not just integral quality values. We argued that instrumental measures which rely on quality attributes may provide more diagnostic and more generic information on quality. For a set of transmission conditions which are typical for modern telecommunication situations, relevant dimensions were derived in an auditory multidimensional scaling experiment. Four dimensions could be extracted; the one which explained most of the auditory data was labelled interruptedness. Apparently, time-varying characteristics have a strong impact on the perception of the transmitted speech samples, and it can be expected that their impact on quality is also very high. Interestingly, these characteristics seem to form a distinct perceptual dimension; this could be a result of the judgment procedure where only one rating was solicited after listening to an entire sample pair. In future experiments, it may be helpful to collect time-varying ratings for the time-varying stimuli, e.g. using a slider method which has been proposed in [13]. The question which could not be answered so far is how the individual quality attributes contribute to integral quality. On the one hand, the relationship between integral quality and individual quality attributes can be investigated auditorily, using the external preference mapping procedure applied in [15]. On the other hand, it will be interesting to see whether the same relationships also hold for instrumental estimates of individual quality attributes. Developing 1699

6 instrumental predictors for individual quality attributes is our ultimate aim, because we think that they will make quality prediction more robust against changes in the physical system, and open to new types of degradations. It is desirable that an instrumental predictor can not only estimate the time-average rating of a certain quality attribute but also its time-variations in case of stimuli with time-varying characteristics. Such predictors may work with real speech signals or with artificial speech-like signals. In order to develop such predictors, tools may be helpful which allow adjusting individual quality attributes. The work for the above aims is based on grants of Deutsche Forschungsgemeinschaft (DFG grants He 4465 and MO 1038); this is gratefully acknowledged. References [1] Bappert, V., Blauert, J.: Auditory Quality Evaluation of Speech-Coding Systems. Acta Acust., vol. 2 (1994), pp [2] Beerends, J.G., Stemerdink, J.A.: A Perceptual Speech-Quality Measure Based on a Psychoacoustic Sound Representation. J. Audio Eng. Soc., vol. 42 (1994), pp [3] Berger, J.: Instrumental Methods for Speech- Quality Estimation Models of Auditory Tests. Doctoral dissertation (in German), Univ. Kiel, [4] Berger, J.: TOSQA Telecommunication Objective Speech-Quality Assessment. ITU-T, Contr. COM-12-34, Geneva, [5] Halka, U., Heute, U.: A New Approach to Speech- Quality Measures Based on Attribute Matching. Speech Comm., vol. 11 (1992), pp [6] Hansen, M.: Assessment and Prediction of Speech Transmission Quality with an Auditory Processing Model. Doctoral dissertation, Univ. Oldenburg, [7] Hauenstein, M.: Psychoacoustically Motivated Measures for Instrumental Speech-Quality Judgment. Doctoral dissertation (in German), Univ. Kiel, [8] Hauenstein, M.: Homepage. [9] ITU-T Handbook on Telephonometry. ITU-T, Geneva, [10] ITU-T Rec. P.800: Methods for Subjective Determination of Transmission Quality. ITU-T, Geneva, [11] ITU-T Rec. P.861: Objective Quality Measurement of Telephone-band Speech Codecs. ITU-T, Geneva, [12] ITU-T Rec. P.862: Perceptual Evaluation of Speech Quality (PESQ), an Objective Method for End-to-end Speech Quality Assessment of Narrowband Telephone Networks and Speech Codecs. ITU-T, Geneva, [13] ITU-T Rec. P.880: Continuous Evaluation of Time-varying Speech Quality. ITU-T, Geneva, [14] Kruskal, J., Wish, M.: Multidimensional Scaling. Quantitative Applications in the Social Sciences, Vol (E.M. Uslaner, ed.), Sage, Newbury Park CA, [15] Mattila, V.-V.: Perceptual Dimensions of Speech Quality in Mobile Communications. PhD thesis, Tampere Univ. Techn., [16] McDermott B.J.: Multidimensional Analyses of Circuit Quality Judgments. JASA, vol. 45 (1969), pp [17] Möller, S.: Assessment and Prediction of Speech Quality in Telecommunications. Kluwer Academic Publ., Boston MA, [18] Quackenbush, S.R., Barnwell, T.P., Clement, M.A.: Objective Measures of Speech Quality. Prentice Hall, Englewood Cliffs, [19] Raake, A.: Assessment and Parametric Modelling of Speech Quality in Voice-over-IP Networks. Doctoral dissertation (unpubl.), Ruhr-Univ. Bochum, [20] Schroeder, M.R., Atal, B.S., Hall, J.L.: Optimizing Digital Speech Coders by Exploiting Masking Properties of the Human Ear. JASA, vol. 66 (1979), pp [21] Schüssler, H.W., Dong, Y.: A New Method for Measuring the Performance of Weakly Nonlinear and Noisy Systems. Frequenz, vol. 44 (1990), pp [22] Voiers, W.D.: Diagnostic Acceptability Measure for Speech Communication Systems. Proceed. IEEE ICASSP, Munich, 1977, pp [23] Wältermann, M.: Determination of Relevant Quality Dimensions in Speech Transmission via Modern Telecommunication Links. Diploma thesis (in German), Ruhr-Univ., Bochum, [24] Wang, S., Sekey, A., Gersho, A.: Auditory Distortion Measure for Speech Coding. Proceed. IEEE ICASSP, 1991, pp

An active unpleasantness control system for indoor noise based on auditory masking

An active unpleasantness control system for indoor noise based on auditory masking Daisuke Ikefuji, Masato Nakayama, Takanabu Nishiura and Yoich Yamashita Graduate School of Information Science and Engineering,