Application of neural networks for prediction of subjectively assessed interior aircraft noise DENIZ HADZALIC

Size: px

Start display at page:

Download "Application of neural networks for prediction of subjectively assessed interior aircraft noise DENIZ HADZALIC"

Aron Dalton
5 years ago
Views:

1 Application of neural networks for prediction of subjectively assessed interior aircraft noise DENIZ HADZALIC Master s Degree Project Stockholm, Sweden June 2018

2 Application of neural networks for prediction of subjectively assessed interior aircraft noise Deniz Hadzalic January 2018

3 Acknowledgement First, I would like to acknowledge my supervisor at Siemens PLM Software - PhD Fabio Santos. I am thankful for your guidance and help during the execution of my master thesis. I also wanna thank professor Peter Göransson, KTH, for all the advises and the mentoring along the way. I would specially like to thank Herman Van Der Auweraer, Global Director RTD at Siemens PLM, for enlightening me of the opportunity of conducting a thesis within the company. I am, as well, very grateful for the support I have received during my application time and stay. Last - but far from least - I want to thank my family for all encouragement and love I have received during this period; it has helped more than you will ever know, nor understand. I am fortune to have you by my side.

4 Abstract Products are increasingly judged by their acoustic performance; and during the last decades, sound quality in general has gained a lot of attention, both from academia and companies. An obstacle in the evaluation of the sound quality is that jury testing is time consuming and require human resources. In an attempt to overcome these limitations, neural networks have been applied in this work with the objective to find a relation between human acoustic perception and a quantity possible to physically measure. For this purpose, 30 of 170 sound samples of interior aircraft noise have been subjectively assessed during jury testing with 40 participators. With extracted psychoacoustic features from the sound samples and the obtained results from the jury testing, a shallow neural network (SNN) with one hidden layer is trained and tested. The prediction performance of the SNN is compared with another alternative method - multiple linear regression. The evaluation of the remaining un-assessed sound samples is predicted by the trained SNN and implemented in the deep learning neural networks, such as Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN).

5 Contents 1 Introduction Objective Structure of the report Theoretical background Psychoacoustical metrics Bark spectrum Loudness Sharpness Tonality Roughness Fluctuation strength Jury testing Preparation of sound files Design of test Design of procedure Result analysis Artificial Neural Network (ANN) Neural Network Toolbox Matlab Recurrent Neural Network (RNN) Convolutional Neural Network (CNN) Subjective assessment of aircraft interior noise Sample preparation Adjective and type of grading Defining the participating juror Execution of the jury test Results and analyse of the jury test Evaluated annoyance of aircraft interior noise Additional questions Sound quality prediction based on Machine learning Correlation between psychoacoustic and annoyance Validation of model Selection of input data Multiple Linear Regression with two predictors Final model and prediction of remaining points in the aircraft Results of the Shallow Neural Network Comparison between different input data combinations Multiple Linear Regression versus Shallow Neural Network Prediction of the remaining not assessed stimuli s Sound quality prediction with deep learning Convolutional Neural Network - Spectrogram images Pre-processing of input data - CNN Applied layers - CNN Training process and validation Recurrent Neural Network type LSTM Preprocessing of input data - RNN Applied layers - RNN Training process and validation

6 8 Results of multiple hidden layer neural networks Results of CNN with spectrograms as input CNN - data randomised with seed CNN - data randomised with seed Results of RNN with time signal as input RNN - data randomised with seed RNN - data randomised with seed Summary of result Discussion Concluding part Conclusion Future work References Appendix A - Matlab code Shallow Neural Network code Convolutional Neural Network code Recurrent Neural Network (LSTM) code

7 1 Introduction The sound quality plays a huge role in how a product is perceived by the end-user. Judgements on the quality, robustness and reliability are highly based on the sound emitted. With the growing economy and development, the awareness and attention of human s sensitivity to sound quality increases as well - the sound quality has become a higher priority. Subjective assessments - jury testing - are vital for understanding how the acoustic performance is perceived. Conducting a traditional subjective evaluation of acoustic performance with jurors is time consuming and for reliable results, a large amount of people attending the test is necessary. The relationship between the human hearing and acoustic performance is tremendously complex. The hypothesis of this work is thus: Artificial Neural Network (ANN) is an alternative to mimic the human acoustic perception; and thus, increase the understanding of the sound quality, find the relation, and mitigate the time-consuming process of re-assessments. The noise and sound quality concept started back in 1883 when Stumpf proposed the acoustic characteristics concept, used for describing the physical characteristics of various feelings of a sound with the same associated sound pressure level. Many relate sound quality with merely the A-weighted sound pressure level, i.e. the lower the db(a) the better perceived sound quality. The attenuation of low-frequency sound by the a-weighted filter can result in two completely different subjectively assessed sounds having the same db(a) value - it is the character of the sound that should be investigated in order to take reasonable measures. In some cases, it is not possible to minimise the sound pressure level given the constraint of either cost, weight, lack of space or demand of creating an adequately sufficient product. Sometimes the requested task is to create a powerful sound which still meets the sound pressure level requirements. The relationship between the human hearing and the acoustic performance cannot be fully reflected by a single psychoacoustic metric, due to its complexity. Many previous investigations have been performed within the area, see e.g. references [1], [2], [3] and [4]. Understanding and performing the extraction of psychoacoustic features from sound samples requires expertise and experience. Another draw-back of using extracted features - when given as scalars - is the lack of time dependency of the signal, in this case a huge loss of information is unavoidable. Time dependency is essential in the process of how humans perceive the sound they are exposed to. Deep learning methods allows one to utilise the time signal or the the images of the spectrogram - showing the time dependency - as inputs. ANN s in the deep learning area have multiple hidden layers creating a larger network compared to shallow ANN s - Shallow Neural Network (SNN) - with only one hidden layer besides the input and output layer. Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN) are deep learning networks often used for image processing and speech-recognition (classification) respectively; however, SNN s require less amount of data compared to ANN s with multiple hidden layers. 1.1 Objective The objective of this report is to investigate the possibility of applying machine learning and deep learning methods for modelling human acoustic perception, by predicting the subjective perception of annoyance on interior aircraft noise. To acquire the subjective assessment - of the annoyance from the interior of the aircraft - a jury test must be executed. A tradeoff must be done when considering the boundaries of the number of signals subjectively assessed versus the demand of a high number of data for the application of deep learning. A cross-over between machine learning and deep learning allows one using less sound samples for jury testing - reducing required time - by predicting sound samples not yet subjectively 1

8 assessed. Deep learning models compared to machine learning - allow one to use different inputs, such as whole time signals and corresponding spectrograms, instead of extracting psychoacoustic features. With these kind of inputs, the prediction models could possibly take the time dependency into consideration. 1.2 Structure of the report A theoretical background to the current work is given in chapter 2, where the psychoacoustic metrics are introduced, followed by jury testing and the ANN models that are applied. The methods for conducting the jury test on the noise of aircraft interior sound, are presented in chapter 3. Since the results obtained from the jury test are used in the prediction models, it is intuitively more convenient for the reader to be presented with the results right after the jury testing methodology - chapter 4. The method for constructing the SNN with extracted features (machine learning) is displayed in chapter 5 and the results are shown right after, in chapter 6. The developed multi-layer neural networks (deep learning) are shown in chapter 7. The results of the deep learning networks are shown in chapter 8. Further the discussion is found in chapter 9 and lastly the concluding part is brought to the reader in chapter 10 where the conclusion and suggested improvements for future work can be located. 2

9 2 Theoretical background 2.1 Psychoacoustical metrics Psychoacoustic is the psychophysical study of acoustics and describes the relationship between the subjective perception, psychology, and the objective physical variable, physics, in present work acoustic sound pressure variations. Psychoacoustic is usually used for describing how humans perceive sound [6] and as in this case, link it with the subjectively assessed annoyance metric. For the sake of convenience, the psychoacoustical metrics used in this study are defined and explained below Bark spectrum Before defining the psychoacoustic metrics the frequency scale Bark should be introduced. Bark is a psychoacoustical scale proposed by Eberhard Zwicker and is named after Henrich Barkhausen whom proposed the first subjective measurement of loudness [7]. Dividing scales into linearly and geometrically (logarithmic) is usually useful for mathematical and physical purposes, e.g. into one-third octave or octave bands. However, some problems require a subdivision more closely related to the approach in which the ear itself appears to conduct the process. For that reason critical bands seem to be very useful [8]. The relation between the frequency and the critical-band function is demonstrated in figure 1. The scale goes from number 1 to 24 and corresponds to the first 24 critical bands of hearing and each has the width of one Bark [9], see table 1. Figure 1: Relation between the frequency and the critical-band function [8]. 3

10 Number Center Frequency (Hz) Cut-off Frequency (Hz) Bandwidth (Hz) Table 1: Bark scale critical bands [8] Loudness Loudness of a sound is a perceptual measure of the effect of the sound energy content in the ear, i.e the sensation corresponding to the physical sound intensity; loudness should not be confused with the magnitude of the sound level. When the sound power increases with the double, the sound level of logarithmic scale increases with 3 db. However, the same does not apply to this concerned psychoacoustical metric - loudness is dependent on additional sound properties such as frequency and temporal characteristics [10]. Several different standardised methods in measuring loudness of sound are available. The unit phon was introduced by Barkenhausen in the 1920s. The level of the loudness is defined as the sound pressure level of a 1kHz tone in a plane wave and frontal incident that is as loud as the sound itself [9]. However, as sounds become more comprehensive and complex in their character, the critical band-width becomes a contributing factor. If two tones are closer to one another than one critical band-width, the tones risk masking each other and sound like one tone instead of two, thus the loudness measurement becomes more difficult. With the assumption that a relative change in loudness is a relative change in intensity [9], a specific loudness can be derived from the db level of each third-octave band. By utilising a power law, the specific loudness, N, with the unit Sone/Bark can be computed. The final value for the loudness, N, is obtained by integrating the constructed masking curves 4

11 that represent the effect of critical bands according to: N = 24Bark 0 N dz (1) where z is the critical band rate measured in Bark. The relation between phones and sones is given as: N = P (P 40)/10 (2) Both methods have their limitations and other ways to measure have been proposed by Moore, Glasberg and Baer, [11], whom suggested using non-zero value for loudness at the threshold. Further investigations in methods of investigating the metric loudness will not be conducted in this specific study Sharpness To measure the high frequency content of the signal, one can utilise the psychoacoustical metric Sharpness where the unit is defined as Acum. Sharpness is useful where the high frequency content of the sound is important for a products quality; it allows one to classify sounds as shrill (sharp) or dull. If the proportion of the high frequency content is increasing - the sound is becoming sharper. In terms of sound from example engines, vacuum cleaners and hair dryers, sharpness has been used to quantify sound quality. The metric sharpness has not yet been standardised and there are various ways to calculate the value of sharpness. Zwicker and Fastl s method [9], which is a version of Von Bismarck s [12], derives sharpness as the weighted first moment of the specific loudness, N. The first partial moment at z is calculated, N zdz. Later it is weighted by the function g (z) by multiplying it to the term. The summation of the now weighted partial moment is calculated and additionally divided by the total loudness, N, defined as: N = 24Bark 0 N dz (3) Further on, the term is multiplied with the constant c=0.11 in order to obtain the final value of the sharpness - expressed as S: S = c 24Bark 0 N g (z) z dz 24Bark 0 N dz where the weighting function is defined as: { z < 14, g (z) = 1 z > 14, g (z) = z z z z } (4) (5) Tonality Tonality has to be judged subjectively; however, there are various methods and standards to calculate a value which may not reflect the human perception linearly. A tone at around 700Hz in a noise spectrum gives the highest impression of tonality. Narrowband sounds can be perceived as tonal and the smaller the bandwith is - the more tonal the noise is considered by the receiver. The metric of tonality is created from a model made by W. Aures - the tonal components are identified by utilising the pitch extraction method developed by Terhardt [13]. The spectral lines that are at least 7 db higher than their two lower and higher (in frequency) neighbours are isolated. The new spectrum - now free from tonal components - is built by removing the detected sequences of five spectral lines considered as pure tones. 5

12 The ratio of the total loudness, from both spectra due to tonal components is calculated and denoted by W N. Further an extra weighting function, W T, is derived from the pitch weights of the tonal components that are relevant to the pitch perception - a frequency dependent function where the perception of tonality is maximal at 700 Hz. Additionally a final constant value is introduced, C, to scale the tonality to standardise the result. The constant is adjusted so that a 1 khz sine tone at 60 db gives a tonality of 1 t.u (tonality unit). The formula for calculating the tonality is defined as the following: Roughness K = C W 0.29 T W 0.79 N (6) Roughness is another psychoacoustic quantity that is often used among the engineers in sound quality; for example in the automotive field, the level of roughness determines the sportiness of a car. Roughness is substantially described by the temporal masking of sound. Figure 2 illustrates the temporal variation of sound (hatched areas) and the amplitude in the time domain. Theoretically the modulation of level between the peaks reaches Figure 2: Temporal variation of a sound with almost 100 % modulated amplitude and inputs to the model of calculating the roughness of a signal [7]. near minus infinity on the db-scale. However, in practical application this is not the case since the minimum value is controlled by the dynamics of the hearing system. Post-masking represented by the decay of psychoacoustic excitation in the hearing-system occurs and hence the modulation depth, L, reaches much smaller values [7]. With the modulation frequency, f mod, and the modulation depth, L, presented by the solid black line in figure 2 it is possible to estimate the roughness, R. R L f mod (7) The unit of roughness is named Asper and means rough. One Asper of roughness is defined and thus obtained when the sound is a 1-kHz tone at 60 db, 100 % amplitude modulated at a modulation frequency of 70 Hz. A trade-off between f mod and L occurs since rapid changes are perceived differently by objective measurements and subjective perception. With low modulated frequencies the product, equation (7), becomes small, i.e. low roughness. With high modulated frequency the subjectively perceived temporal depth, L, becomes very small and the roughness vanishes. Since the value of L depends on the critical band rate the approximation given in (7) is substituted with: R = c 24Bark 0 f mod L(z) dz (8) where c is a constant that depends on the boundary conditions applied. The boundary condition obtained for a 1 khz tone at 60 db and 100% amplitude modulated that produces 1 asper - gives a constant with the value c=0.3 [9]. 6

13 2.1.6 Fluctuation strength Similar concept as for roughness, yet for modulation frequencies up to 20 Hz. The roughness character takes over for modulation frequencies above 20 Hz. The limit between these two psychoacoustical quantities is vague. The unit for fluctuation strength is called Vacil and one vacil is produced by a 1000Hz tone of 60dB which is 100% amplitude modulated at 4Hz. The relation for fluctuation strength, F, contains the parameters for the modulation frequency, f mod, and the masking depth, L. The masking depth, L, should not be mistaken for the modulation depth. The masking depth is smaller due to post-masking [9]. 2.2 Jury testing F = Bark ( 0 ) ( L ) dz fmod (9) 4Hz 4Hz + f mod Several different combinations of conducting a sound quality test with the help of a jury exist. The main objective of a jury test is to determine how the average human perceives a product depending on the sound that it generates and thus assess it according to a given adjective. One example, beyond the one investigated in this report, could be how annoying the sound excited by an electric shaver is perceived by the user which is investigated in article [1]. The results are later utilised for analysing possible improvements of the sound quality by defining the correct attributes to re-design and impact the appeal of the sound investigated [14]. The structure for conducting a jury test is made out of four main steps, see figure 3. Figure 3: A common structure of how the subjective jury testing is conducted Preparation of sound files Included in the first step is to mainly prepare the sound files that represent the sound of the product that is in the interest to be evaluated. The sound is measured either in a monaural or binaural way. The way of recording the sound should be executed in a similar environment as where the receiver will be exposed to it: if the sound of a leaf blower is examined, the recordings could be made in a semi-anechoic room, i.e. no reflections except from the floor since a leaf blower normally operates outdoors. Other aspects to take into consideration are the decomposition to synthetic sound and determining the length of the actual stimuli; the memory of the ear is up to five seconds which complicates in case of a long and non-stationary signal. Further many purposes exist for performing decomposition and synthesizing of the recorded sound samples, two main reasons are: (i) the possibility to focus on the important parts of the sound that are of actual interest, e.g. the low frequency sound of a car engine instead of the wind fluctuation noise that was included in the measurement recorded in situ and/or, as in the case of this study, (ii) to enable modifications of the sound sample - re-synthesizing - and submit for re-assessment by a jury or a prediction model Design of test The second sequence contains the designing process of the test, where the sounds will be graded by different jurors. Some of the settings such as: the sounds that will be used, optional reference answers and pictures that display the product itself are designed for the 7

Semantic differential is an alternative method of asking the juror for his/her subjective opinion in form of two opposing adjectives, e.g. insufficient-good with step numbers inbetween to grade the level of the given adjectives.

14 specific test. The questions are formulated together with how the different sounds should be graded by the jurors. A-B comparison can be conducted between two different products - the juror chooses their preference based on the criterion imposed by the test designer. Semantic differential is an alternative method of asking the juror for his/her subjective opinion in form of two opposing adjectives, e.g. insufficient-good with step numbers inbetween to grade the level of the given adjectives. An additional way of grading is by displaying different categories for the juror to choose when grading a criterion imposed, e.g. the criterion is labelled as pleasant and the juror must choose between categories as not at all, slightly, moderately, very and extremely, see figure 4. Figure 4: Grading a coffee machine with categories according to the criterion pleasant, [15] Design of procedure The third step consists of conducting the actual test with the jury, see figure 5. With the help of jury testing programs, as TestLab [15], it is feasible to conduct tests with a large number of jurors with easy management and evaluation of the submitted results. Figure 5: A common structure of how the actual subjective testing is conducted [28] Result analysis The fourth and last sequence includes analysing the results obtained from the subjective testing. Quantities as the consistency and the concordance among the jurors are evaluated and a threshold can be specified to exclude grades with high discrepancy from the average, so-called outliers. Consistency is a measure on how consistent the answers of each juror is provided; it is often tested by utilising an arbitrary number of sound samples that are repeated during the test. Concordance is a measure of how similar the jurors have answered relatively to each other. The final results of the assessment can in the later stage be used for 8

suitable features for the model. 2.3 Artificial Neural Network (ANN) The topic ANN is of such size, thus covering it all reaches beyond the scope of this dissertation.

15 creating a linear regression model or as the objective of this thesis - a prediction model of human acoustic perception with ANN. In the cases when extracted features will be utilised as input, the correlation between the subjective quantities decided by the jurors and objective quantities are investigated to find the most suitable features for the model. 2.3 Artificial Neural Network (ANN) The topic ANN is of such size, thus covering it all reaches beyond the scope of this dissertation. Nevertheless, it is of big importance to introduce the subject for the reader whom possibly is new to ANN, which is of high probability - the dissertation is after all in the field of acoustics. However, first off it is important to understand where the classes, being used in this study, are in the tree of Artificial Intelligence (AI). What is AI briefly put? AI is the field of machines designed to become intelligent in general, as humans. To achieve this, several different classes exist within AI where Machine learning is a big subset. Machine learning is being used for creating functions that mimic patterns in data. In traditional programming the user inputs the data and an own made algorithm to obtain the output, however, in the field of machine learning the user gives the input and output to obtain the algorithm giving the relation. Deep learning, on the other hand, is a subset of machine learning which implies all neural network with more than one hidden layer - multi-layer, see figure 6. Figure 6: Illustration of the subsets of artificial intelligence [30]. One of the key objectives of this work is to find the relation between a signal and the human perception of it. In machine learning the features of interest are already extracted from the signal itself. For instance the psychoacoustic metrics are evaluated and used as inputs in the network. In deep learning, the network is supposed to extract the important features on its own from the raw data given as input; see figure 7. Figure 7: The difference between machine and deep learning [16]. 9

16 So what is ANN? ANN imitates the human biological neural system which is, briefly explained, built by a series of interconnected neurons that interact with each other by axon terminals that are connected via synapses; see figure 8. The neural system of a human-being is still so tremendously complex and non-transparent that a ANN is after a specific point inconsistent with the biological neural network. The neural network is the reason why a person can e.g. learn to identify patterns and distinguish different animals or objects. The same is desired for machines today, i.e. they improve and learn on their own over time; without machine learning one would be required to maintain the algorithms. It may seem as ANN is something new, however, the beginning of neurocomputing is often said to have been initiated in year 1943 when an article, published by McCulloch and Pits, showed that trivial systems of neural networks could compute any arithmetic or logical function [17] with something denoted as threshold logic. ANN has a substantial impact on the world today and, as a human, one is in contact with ANN daily; also, exposed by programs where ANN has been implemented. Some few examples, among a vast number of applications, are search engines as Google that study the pattern of what the user is searching for with the task to display suitable ads. Another example of where ANN is applied is in the medical field of identifying tumours on x-ray images; with the help of a well-trained ANN it has the potential to obtain a lower level of observation errors compared to a human being. Figure 8: Illustration of a biological neuron [31]. Deep neural networks architecture start with an input layer, multiple hidden layers in the middle and with an output layer in the end. The layers are connected via the nodes - called neurons - were the input of one node is the output from the previous node. With synapses connected in-between, the multiple hidden layers are communicating with each other; see figure 9. 10

17 Figure 9: Demonstration of a neural network with input, output and multiple hidden layers [16]. In this work, a shallow neural network with one hidden layer is created with the Neural Net Fitting app of Matlab; also, neural networks with multiple hidden layers are created in Matlab but from scratch - such as Recurrent Neural Network and Convolutional Neural Network Neural Network Toolbox Matlab With the help of a neural network created with the MATLAB Neural Network Toolkit one has the possibility to rather quickly acquire a moderately accurate prediction model that solves a data-fitting problem. The user feeds the program with extracted features of the different sound samples and it maps between the data set of numeric inputs and a set of numeric targets. The Neural Net Fitting app from the MATLAB software allows one to construct a shallow neural network which is built by a two-layer feed-forward network. One chooses the number of inputs and targets that the network trains on, and the remaining are used for validation or/and testing. The validation makes sure that the training proceeds properly without creating a network that over-fits, i.e. high accuracy on trained inputs and targets but low accuracy on new inputs and targets used for testing. The tool also supplies - conveniently - the user with the opportunities to visualise the results in terms such as regression fit, error histogram and training performance among others Recurrent Neural Network (RNN) A feed-forward network channel its information by feeding it forward through the neurons, i.e. the information does not pass one neuron twice, see figure 10. Recurrent Neural Networks, on the other hand, are designed with a loop; it works in the way that the output of a hidden layer (or the network) is fed back to the input layer - allowing the network to determine on two sources of input instead of one [18]. RNN is suitable when creating a prediction on a sequence of inputs and where the form of sequence is of importance, i.e. when it matters what order the information has, e.g. time-series. RNN s can produce an output for every entity used as input or, as in this case, one single output value for the entire sequence [19]. 11

18 Figure 10: The difference between RNN and a Feed-Forward neural network [32]. A RNN architecture often used, is the Long Short-Term Memory (LSTM); it solves the problem when the gradient of some weights either become too small or too large if the network is unfolded for too many time steps [25]. The LSTM expands the memory of the RNN and is therefore very well suited for signals that have notable changes over time - it enables the RNN to remember the inputs over a longer time-period Convolutional Neural Network (CNN) Convolutional Neural Networks (CNN) are well established and recognised in the world of deep learning. CNN is of great utility - widely used as well - in the field of image processing. Networks tend often to become too big, hence, unfeasible to train. CNN s overcome this issue by limiting the number of connections of the neurons in the hidden layer to a few of the neurons in the input layer - and thus reducing the number of parameters to be learned. The convolutional layer is composed by several groups of neurons and each group has as many neurons required to cover the whole image. The weights obtained in the group are shared among all the neurons in the corresponding group. Each group extract something called a feature by calculating a convolution of the image with their weights - a processed version of the image [18]. Further, layers named as average pooling and max pooling layers are applied that filter the image. The technique allows one to obtain the average or max value of a group (size depends on the properties of the chosen layer) of the new processed images; the new images are even smaller than the original size. This can be repeated as many times as desired, until the layers become small enough to a fully connected layer and finally the output layer, see figure 11. After the network has performed all the steps of the feed-forward propagation, the error between the target value and the obtained output is derived. Further back-propagation occurs to calculate the gradients of the error with respect to the initial weights; gradient descent is then used to update the weights and parameter values to decrease the error. This is repeated until the training error reaches a value that is adequate for the application. 12

19 Figure 11: Convolutional neural network for its most common application - image classification [26]. 13

3 Subjective assessment of aircraft interior noise A jury test was conducted with the help of the LMS Test.Lab jury testing program. In total, 40 jurors were participating in the test.

20 3 Subjective assessment of aircraft interior noise A jury test was conducted with the help of the LMS Test.Lab jury testing program. In total, 40 jurors were participating in the test. The consecutive steps of designing the layout of the evaluation test were the following: Sample preparation - choosing the appropriate sound samples for the jurors to evaluate and the number of samples Adjective to assess and type of grading - Determining the adjective(s) of interest for subjective assessment and deciding the way of grading the sound samples Additional questions - Formulating questions that defines the juror 3.1 Sample preparation The data acquired are sound samples recorded inside the cabin fuselage of a Dornier 228 propeller aircraft during a project called ASANCA II, see figure points in the aircraft were measured during two cruising conditions: with both engines working (i) synchronously and with both engines working (ii) asynchronously. The recorded sound samples are further decomposed to synthetic sounds conducted by Lorenzo Angeloni [20]. Synthetic sounds allow one to investigate which individual noise components contributes to a lower perceived sound quality. Figure 12: The interior noise recordings were made at 85 positions with a grid of 17 times 5 microphones. Similarity between the psychoacoustic metric values of the sound samples are obtained in specific concentrated areas of the aircraft; hence, four different clusters are defined for the two cases (synchronous ON/OFF), see figure 13. The clusters are derived with the K-mean technique and the results are obtained from the dissertation of Lorenzo Angeloni [20]. From these 8 clusters, 14 sound samples in synchronous mode and 16 sound samples in asynchronous mode are chosen. 14

21 Figure 13: Clusters, derived with K-mean value technique, displaying the samples with similar combination of psychoacoustic features [20]. A reference sound sample is chosen, called anchor, for the juror to compare every sound sample with and grade accordingly to. The reference sample is from point 38 in the synchronous case; the aim was to find an anchor that was neutral - i.e. neither having high nor low feature values compared to the rest of points in the aircraft. The anchor sound sample is played for the juror and subsequently one randomly selected stimuli is played. The juror subjectively assesses how he/she perceives the annoyance of stimuli accordingly to the anchor. The complete sound samples that the juror is exposed to is first composed by 6.5 seconds of the anchor, followed by a silent pause with the duration of 1.5 seconds and subsequently the actual stimuli for evaluation - is played for 6.5 seconds, see figure 14. The silent pause is sufficient for the juror to recognise the change of stimuli, nevertheless short enough to not exceed the acoustic memory of the ear, see section 2.2. The sound samples from the inside of the aircraft are stationary, however, they contain modulation; for the juror to be able to give a true judgement he/she needs to perceive the sound for a longer time - hence 6.5 seconds. Figure 14: The structure of the sound sample that the juror is exposed to. For the sake of good results, it is highly important having a appropriate number of samples 15

22 for assessment in order to avoid tiring out the juror. During the assessment process the jurors will hear a total of 35 sound samples (also called stimuli) where 5 samples are repeated with the purpose of investigating the consistency of each juror. If the juror does not assess the repeating stimuli with the same grade as previously, an averaged value was computed. 3.2 Adjective and type of grading The adjective being assessed is annoyance and the grading is executed with semantic differential. The juror will be able to grade the given stimuli accordingly to the anchor by choosing between the following formulations of the adjective: Much less Annoying, Less Annoying, Slightly less Annoying, Equally Annoying, Slightly more Annoying, More Annoying and Much more Annoying. Pictures of a similar type aircraft and the inside of the cabin is placed above the semantic differential, for the juror to relate and obtain a feeling of the environment where the sound occur. The final design of the interface is presented in figure 15. Figure 15: The interface that is displayed for the juror to assess the sound sample. 3.3 Defining the participating juror The jurors have the possibility to be anonymous; however, it is important - for the sake of the performed experiment - to acquire information about who participated. Among other possible reasons, the information about the jurors includes if it is the right target group who has performed the evaluation. Additional questions were formulated to define the following details of the juror: Age of the participating juror Gender Commercial flying experience The jurors opinion regarding the performed test 3.4 Execution of the jury test The test was performed at PLM Siemens Software in Leuven, Belgium. All participators were employees at the same company with different working role s. Before initiation, the 16

23 jurors were prepared by being informed about the background and objective of the test; as well each step of the execution. No blind-test was required since each assessment was compared to the same anchor; the different sound samples were neither representing different products, only different locations inside of an aircraft, hence no known motivation for bias was identified. The headphones utilised during the test were over-ear Sennheiser HD600. Totally 40 people participated. The test was performed during 3o minutes. Nine sessions were held with a maximum amount of six jurors at the same time. All computers used were linked and the assessment per stimuli/additional question was conducted in synchronous, i.e. not until everyone had finished grading/ answering one stimuli/additional question they could move on to the next. In figure 16 it is possible to see one of the session performed. Figure 16: During one of the jury test sessions. All sessions were executed in the facility of PLM Siemens Software in Leuven, Belgium. 17

24 4 Results and analyse of the jury test 4.1 Evaluated annoyance of aircraft interior noise After that all the sessions had been conducted, the resulting consistency and concordance was analysed; see figure 17. Figure 17: The obtained total concordance and consistency of each juror. High values were obtained for both measures - concordance and consistency. All jurors met the thresholds of 0.7 in consistency and 0.8 in concordance. The numerical values obtained from the grading are set by the software, see figure 18. Figure 18: The numerical values given when the juror grades the sound sample. After the assessment - by 40 participating jurors - had conducted the test, the mean of the grading numbers for each stimuli was computed. The subjective assessed annoyance for each point in the aircraft is displayed in figure

Figure 19: The mean obtained of 30 evaluated sound samples by 40 jurors. The s after the point number, for some of the point names, denotes stimuli s from the synchronous case.

A good variation of annoyance levels between high and low annoyance grades is obtained; however, there is a lack of stimuli s perceived as equally and closely annoying as the anchor.

25 Figure 19: The mean obtained of 30 evaluated sound samples by 40 jurors. The s after the point number, for some of the point names, denotes stimuli s from the synchronous case. The points without a s in the point name are obtained during the asynchronous case. A good variation of annoyance levels between high and low annoyance grades is obtained; however, there is a lack of stimuli s perceived as equally and closely annoying as the anchor. Ro obtain a good predictor - it should be trained on data with broad variety. A network only trained on e.g. high annoyance will only be able to predict signals with high annoyance well, but not the signals assessed with low annoyance. These results are further used as targets for the training of the neural networks. 4.2 Additional questions The results of the additional questions are displayed in circle diagrams provided by the software TestLab. 28/40 juror were between years old, while 10/40 were between the ages 30-40, 1/40 was between years and 1/40 was between years; see figure

26 Figure 20: Circle diagram in percentage presenting the age distribution of the 40 jurors participating in the jury test. Of all the jurors assessing the sound samples, 35/40 were male and 5/40 were female; see figure 21. Figure 21: The distribution of gender among the juror participating. Some jurors either missed or declined to answer some of the additional questions, hence the inconvenient percentage values in the following circle diagrams. The experience of travelling by an aircraft was asked for and the answer is presented in figure

This would indicate if some difficulties had occurred during the execution affecting the subjective assessment. Figure 23: 2D drawing of the interior of the propeller aircraft of investigation.

27 Figure 22: The distribution of how many times the jurors fly per year. For the case of certainty regarding the quality of the executed assessment - the jurors were also asked what their opinion on the test was in the end, see figure 23. This would indicate if some difficulties had occurred during the execution affecting the subjective assessment. Figure 23: 2D drawing of the interior of the propeller aircraft of investigation. An evident majority ( 67%) perceived the test as easy and well described ; as well as the second biggest majority perceived the test as easy ( 29%). Two small portions of the juror thought either the sound samples were too short or too long ( 2% each). The conclusion is that a vast majority of the jurors were not having any issues with performing the assessment, thus the judgement was not negatively affected. 21

28 5 Sound quality prediction based on Machine learning With the help of the Matlab Neural Network Toolkit it is possible to design single hidden layer neural networks that map between a data set of numeric inputs and numeric targets. By choosing the most suitable psychoacoustic features, the most accurate prediction model can be obtained. The features are, in this case, psychoacoustic metrics that have been derived by L. Angeloni [20] using the program Simulation LMS TestLab. Initiating the work by designing a shallow neural network, it was possible to find a dependency between the input and targets. Continuing the development and with the objective to improve the relation i.e. decreasing the discrepancy between the expected and predicted value. The most suitable combination of psychoacoustic metrics is investigated by performing correlation analyses and k-fold cross validation to compare the performances. 5.1 Correlation between psychoacoustic and annoyance With the obtained results from the jury test it is possible to proceed with creating a shallow neural network. The input in the data-fitting program is as explained the extracted features - psychoacoustic metrics - see table 2. 22

29 Annoyance [] Loudness [sone] Fluc. Str. [asper] Tonality [] Sharpness [acum] Roughness [vacil] 1 (1) s (2) (3) s (4) (5) s (6) s (7) (8) s (9) (10) s (11) s (12) s (13) (14) (15) (16) (17) (18) s (19) (20) (21) s (22) s (23) s (24) (25) s (26) s (27) (28) s (29) s (30) Anchor Table 2: Numerical values of the subjectively assessed annoyance and the derived psychoacoustic metrics that have been extracted for each stimuli evaluated during the jury testing. Correlation analysis is performed to investigate which potential inputs have a linear dependency to the output. The equation to evaluate the magnitude of linear dependency [21] between the psychoacoustic units and the annoyance, given by the subjective evaluation is given as: ρ(x, Y ) = C(X, Y ) (10) D(X)D(Y ) where C(X, Y ) is the co-variance between the variables X and Y, while D(X) and D(Y) is the standard deviation for the the variables X and Y respectively. The obtained correlation is given in table 3. 23

30 Correlation Annoyance Loudness Fluc. Str. Tonality Sharpness Roughness Annoyance Loudness Fluc. Str Tonality Sharpness Roughness 1.00 Table 3: The correlation between the subjectively assessed adjective - annoyance - and extracted psychoacoustic features. The psychoacoustic metrics with highest correlation to the subjective assessed quantity annoyance are: loudness, sharpness and tonality. The mentioned metrics also obtained a high correlation between each other. Fluctuation strength and roughness obtain also significant negative correlation to annoyance. The correlation coefficient determines the linear dependency between two variables. If no dependency exists, neither is there any correlation. However, this is not true the other way around; if no correlation exists, or is very small, the two quantities can still have a relation - a non-linear one. Thus, one should not only choose the metrics with highest correlation, nevertheless, it still gives a clue of which to initiate with. 5.2 Validation of model Methods for validating the neural network models are crucial for emphasising the respective reliability. The absolute error for each stimuli, i, is defined as the absolute value of the difference between the predicted annoyance value and the corresponding target: e rel i = f i y i (11) The relative errors are compared and the maximum error, e max, can be found. All relative errors are summed and divided by the number n representing the amount of relative error terms in order to obtain a mean average: n i=1 e = erel i (12) n An additional way to analyse the accuracy of the network is to compute the root-meansquare error which is defined as: n i=1 RMSE = (f i y i ) 2 (13) n In order to prove which network with corresponding configurations performs best, i.e. predicting with highest accuracy, K-fold cross-validations are performed. The vector containing the targets is randomised together with the associate psychoacoustic quantity and divided with the number K. The number of targets is divided into 6 sets of 5, resulting in 17% of the stimuli s are used for testing each run. For each run, the set for testing is moved to test 5 other stimuli s until each one has been used for testing; see figure

For each run the average error, RMSE-value and max error is observed and saved for comparison.

31 Figure 24: Each row represent the 6 different runs (K=6) performed to validate that the model predicts well for all targets. The red boxes represent the part of the vector (of the targets) that are used for testing, and the blue boxes for training the model. For each run the average error, RMSE-value and max error is observed and saved for comparison. When the final run has been conducted, all average errors from each run is summed and divided by the numbers of runs performed - final average is obtained. F inal accuracy = K j=1 e j K (14) 5.3 Selection of input data High values (both negative and positive) of correlation between the annoyance and the psychoacoustic metrics: loudness, sharpness, tonality, fluctuation strength and roughness are obtained and a comparison is conducted. Five different combinations of inputs are selected for comparison: 1. Loudness and sharpness (L) 2. Loudness and sharpness (L,S) 3. Loudness and tonality (L,T) 4. Loudness, sharpness and tonality (L,S,T) 5. Loudness, sharpness, tonality and fluctuation strength (L,S,T,F) 6. Loudness, sharpness, tonality and roughness (L,S,T,R) 7. Loudness, sharpness, tonality, fluctuation strength and roughness (L,S,T,F,R) 25

32 The training function utilised for the SNN is the Bayesian regularisation back-propagation, it updates the weights and bias values according to Levenberg-Marquardt optimisation. By minimising a combination of squared errors and weights it later chooses the right combination that generalises well. The pre-processing and post-processing chosen is the mapstd - function; it processes the input and target data by mapping its mean to 0 and its standard deviations to 1. The distribution of data chosen as input and targets are randomised. The randomisation is performed the same way for the different combinations, i.e. the differently trained networks have the same sound samples used as input and same corresponding targets for testing to avoid biased testing results. 5.4 Multiple Linear Regression with two predictors Multiple linear regression (MLR) is one of the most common methods for deriving a function expressing the targets, when using several extracted features as input variables and one response variable (the target value). It is used with the goal to investigate and improve sound quality. Like the method performed when utilising the SNN, the annoyance values obtained from the jury test are used as targets and the psychoacoustic metrics are used as predictor variables (inputs). A comparison is executed between the final SNN designed and a MLR model. Two MLR models are created: (i) without and (ii) with interaction terms. The simple function for the simple MLR model is defined as: y = b 0 + b 1 x 1 + b 2 x 2 (15) where b i are the regression coefficients that are estimated and x i the inputs. The same method used for K-fold cross-validation of the SNN is utilised on the MLR model; and later compared with the results from the cross-validation of SNN. The same psychoacoustic metric inputs - from 25 stimuli s - are used for deriving the multiple linear regression model and further the same 5 remaining inputs and targets are also used for testing and comparing the models. The coefficients, b i, are derived with the Matlab function regress. The quadratic function with interaction terms is defined accordingly: y = b 0 + b 1 x 1 + b 2 x 2 + b 3 x 1 x 2 (16) 5.5 Final model and prediction of remaining points in the aircraft When the final combination of psychoacoustic metrics has been chosen, the network is trained with the features of all 30 stimuli s. The same architecture of the shallow network is utilised as in the investigation of selection of input data. Ten neurons are used; due to high linear dependency, the number of neurons do not affect the average error results performed during the k-fold cross-validation significantly. The shallow neural network is trained better with a larger amount of data and therefore all 30 stimuli s were used for the final training of model, which later was applied for predicting new targets for the deep learning model. One of the main reasons for investigating the application of SNN on interior aircraft noise was the opportunity to use less stimuli s during the jury testing, but still perform deep learning computations that require more data. As mentioned earlier in section 3.1, 85 points were measured during two cruising conditions and 170 sound samples were acquired, see figure of the 170 sound samples are subjectively assessed. With the 25 of the subjectively assessed sound samples the SNN is trained and with remaining 5, tested; the network is then used for predicting the rest of the 140 stimuli s. 26

33 Figure 25: The geometry and point numbers of the Dornier aircraft with the 85 points measured - a grid of 17 times 5 microphones. The names of the measurements points start with seat number 1, in the front row to the left; with point number 3 in the middle - on the aisle of the fuselage - and seat number 5, all the way to the right. Seat number 6 starts again behind seat number 1, thus seat number 85 results in being in back row to the right. 27

34 6 Results of the Shallow Neural Network 6.1 Comparison between different input data combinations A comparison is performed to investigate how accurate the SNN predicts the targets, depending on the features used as input. The K-fold cross-validation is applied on each combination stated, see figure 26 for the results of the combination: loudness and sharpness (L,S). Figure 26: The corresponding absolute error for each stimuli. The red bars represent the stimuli s tested during the run, are the only errors taken into consideration. One can note that the trend of how the model predicts each stimuli s is similar for each run - the model is consistent. The average absolute error, max absolute error and RMSE per run for each combination of features is computed and compared, see table 4 and 5. 28

35 Average absolute error per run Run 1 Run 2 Run 3 Run 4 Run 5 Run 6 Final avg Max error L L,S L,T L,S,T L,S,T,F L,S,T,R L,S,T,F,R Table 4: The average absolute error and absolute max error observed for each run of the SNN model. Root-mean-square-error per run Run 1 Run 2 Run 3 Run 4 Run 5 Run 6 Final RMSE L L,S L,T L,S,T L,S,T,F L,S,T,R L,S,T,F,R Table 5: The average RMSE observed for each run of the SNN model with different feature combinations as input. The combination loudness and sharpness as input obtains the lowest discrepancy from the targets. The variation between the absolute average errors and RMSE of each run should not be volatile; indicating a model not generalising well. In order to create a better visualisation and perception of how well the model predicts, the results of run 6 is compared to the target values; see figure 27. Figure 27: The prediction obtained during run 6 with the loudness and sharpness combination utilised as input. The pink circles are the mean of all the 40 subjective assessments made by the jurors - 29

36 the targets. The blue dots are the predicted values of the stimuli s that were also used for training the network; further the red dots where the predicted values for the stimuli s only used for testing - totally unknown for the neural network. 6.2 Multiple Linear Regression versus Shallow Neural Network Two multiple linear regression (MLR) models are created: (i) basic without interaction terms and (ii) with quadratic interaction terms in the prediction function; both later compared with the SNN model. The same stimuli s used for training the SNN are used for deriving the coefficient in MLR model; later the same remaining stimuli s are used for testing both models. In order to validate which model is the most accurate, only the errors between the prediction and the five testing targets per run are taken into consideration. The absolute errors of each testing target per run obtained from both models are displayed in figure 28. Figure 28: The errors between the prediction and testing targets of the two different approaches - SNN and MLR. The blue bars represent the absolute difference between the predictions performed by the SNN model, and the red bars represent the absolute difference between the prediction performed by the MLR model and corresponding testing targets. It is easy to draw the conclusion by merely observing figure 28 - the SNN model achieves predicting the testing targets more precise. The MLR model achives predicting more accurate only on 3/30 compared to the SNN model. Quadratic interaction terms are added to the function and new β i coefficients are derived. The absolute difference and new comparision is presented in figure

37 Figure 29: The new obtained error when interaction terms are added to the MLR function. The additional MLR model with interaction terms plainly shows improved accuracy compared to the MLR basic model without interaction terms. However, the SNN predicts 20 out of 30 stimuli s more accurate, compared to the MLR model with interaction terms. Nevertheless, the final average absolute error and maximum absolute error is observed after conducted cross-validation, in order to draw a final conclusion of the comparison, see table 6. Average absolute error per run Run 1 Run 2 Run 3 Run 4 Run 5 Run 6 Avg. Max. SNN MLR MLR int Table 6: The average absolute error per run together with the final average after the crossvalidation and maximum absolute error. The numerical values support the first observations from the previous plots, indicating that the SNN model is a more accurate approach for predicting the annoyance when utilising extracted features as input. All the different measures computed in the table clearly shows the benefit of utilising the SNN over MLR model in this application. 6.3 Prediction of the remaining not assessed stimuli s All 30 sound samples - that are subjectively assessed - are now utilised for training and testing the SNN. The trained SNN is later used for predicting the annoyance of the remaining 140 sound samples that have not yet been subjectively evaluated by jurors. The annoyance is predicted for both cases (asynchronous and synchronous) and presented in the geometry of the aircraft to illustrate the distribution. The annoyance distribution of the asynchronous case is presented in figure

38 Figure 30: The predicted distribution of the annoyance when the engines of the Dornier 228 aircraft work asynchronously. Yellow colour indicates high annoyance, as the colour bar on the right side of the figure indicates. During the case of the engine working asynchronously, the highest intensity of the annoyance is in the front of the fuselage while lower in the back. The distribution of annoyance in the synchronous case is presented in figure 31. Figure 31: The predicted distribution of the annoyance when the engines of the Dornier 228 aircraft work synchronously. The difference between the two cases is clearly visible. The highest intensity of predicted annoyance is still in the front; however, the annoyance level is mitigated along the aisle and close to the left wing. The sound pressure level is lower in the fuselage when the propeller engines are working synchronously as stated in [22]. Part of the explanation for why the sound pressure level is lower in the case of both propeller engines working synchronously is destructive interference. The propellers rotate clockwise when observed from behind and facing in the direction of travel. The pressure variations induced by the propeller on the right side are arising from the bottom and thus acting on the floor of the fuselage; the floor has a higher stiffness compared to the roof - less structure borne noise. On the left side, 32

the pressure variations act on the roof, creating a moving membrane cancelling - due to the propellers working as correlated sound sources - the airborne noise inside the fuselage.

39 the pressure variations act on the roof, creating a moving membrane cancelling - due to the propellers working as correlated sound sources - the airborne noise inside the fuselage. The propellers are shifted to create a phase shift inducing cancellation of sound pressure, while working synchronously. The loudness of the sound is highly correlated with annoyance, hence, if the synchronous case results in attenuation of sound pressure level in the fuselage - the perceived annoyance often decreases as well. The additional reason for the higher obtained annoyance levels during the asynchronous case is the acoustic beating arising as the engines are working asynchronously. 7 Sound quality prediction with deep learning This chapter provides the methods used building the deep learning networks with the software Matlab. With the SNN, the annoyance of the un-assessed stimuli s has been predicted and implemented in the deep learning networks - which usually requires more data than the SNNs due to their complex structure. The deep learning networks are tested both with 140 predicted targets (from the SNN) and 30 assessed targets (from the jury testing). Additionally, to increase the consistency of error induced by the SNN, the deep learning neural networks are trained with all 170 predicted targets - all having a deviation. All the 30 assessed targets have been used for training the SNN, nevertheless, still obtain a discrepancy to the real targets after being predicted by the SNN; see figure 32 and table 7. Figure 32: To retain the consistency of the error between the targets applied on the deep learning neural networks, a data configuration with all targets being predicted by the SNN is tested. The comparison show small differences. 33

40 JT Targets SNN modified targets Difference Table 7: The neural network is fed by 2 different sets of 170 targets each. The first contains 30 real assessed targets from the jury testing together with 140 predicted un-assessed targets predicted by the SNN. The second - SNN modified targets - contains the 30 annoyance values predicted by the SNN after it was trained with all 30 of the real jury testing targets. Two different kinds of deep learning network are presented in this chapter: Convolutional Neural Network - image processing Recurrent Neural Network - Long Short-Term Memory Convolution Neural Newtorks (CNN) are often applied for classification of images; while the Recurrent Neural Networks (RNN), with Long Short-Term Memory (LSTM), are applied for prediction problems where the form of sequence of the input data is important - signals with time steps. The targets (subjectively assessed grades of stimuli s) are graded values between 0 and 1 - a continuous output - defining it as a regression problem and hence the regression layer is applied for both networks. Optionally a classification layer can be chosen where the desired outputs are categorised, however, not appropriate in this case. The pre-process and the structure of the layers is presented for each neural network in the corresponding sub-chapter. 34

7.1 Convolutional Neural Network - Spectrogram images A CNN is designed to find the relation between the subjective human acoustic perception and something possible to measure physically.

The size of the input data partly determines the performance of the network, thus, the pre-processing is vital part of the process creating well performing networks.

41 7.1 Convolutional Neural Network - Spectrogram images A CNN is designed to find the relation between the subjective human acoustic perception and something possible to measure physically. The spectrogram of each stimuli is derived and utilised as input, see figure 33. The size of the input data partly determines the performance of the network, thus, the pre-processing is vital part of the process creating well performing networks. Figure 33: Convolutional Neural Network with the images of the signals spectrogram as input. However, observe that this figure demonstrates an example of the order of layers. The final architecture of the CNN is presented further down in this chapter Pre-processing of input data - CNN With a good method of re-processing the data, both computation time and accuracy can be improved. Several different methods were applied and subsequently validated. The sound signals contain both negative and positive values with a max amplitude around 28, see figure 34. Figure 34: The sound sample with highest amplitude out of the 170 used. The original sampling frequency of the raw time signal is Hz. Re-sampling is performed by a factor ten - new sampling frequency is 4410 Hz. No filtering is applied on the signal. All signals used in the investigation have the frequency content under 1000 Hz, hence, no necessity of filters implemented, accordingly to the Nyquist-Shannon theorem ( 1 2 f s > f max ). The frequency spectrum of the same signal as in figure 34 is displayed below in figure

42 Figure 35: The frequency spectrum of the same signal as in 34. All dominant harmonics are found above 500 Hz. It is possible to see the 4 dominating harmonics under 500 Hz, also where the vast majority of the sound energy is located. The re-sampled time signal is plotted above the original time signal with the initial frequency sampling, see figure 36. Figure 36: The effect of re-sampling the time signal by a factor ten. It is not possible to see any of the original time signal (blue) in the plot. When increasing the zoom of the figure, the small deviations are spotted on the peak. With the performed re-sampling the spectrograms of each stimuli is derived, see figure

Figure 37: The image of the spectrogram, extracted from one of the stimuli s, used as input in the CNN.

The images obtained have the size of 1025x1157x1 representing the height x width x channels.

To decrease the amount of data, the image is re-sized down to 227x227x1.

43 Figure 37: The image of the spectrogram, extracted from one of the stimuli s, used as input in the CNN. The images of the spectrograms are derived with the function spectrogram in Matlab. The images obtained have the size of 1025x1157x1 representing the height x width x channels. The image fed to the network does not contain colours which requires 3 channels to contain the RGB values. To decrease the amount of data, the image is re-sized down to 227x227x1. The re-sizing of the images results in a lower resolution; the time and memory required during training and testing process is decreased Applied layers - CNN The final architecture of the CNN is demonstrated in figure 38. Figure 38: The layers applied in the CNN network where images - of the spectrogram - are used as input. 37

44 The software Matlab is being utilised for creating the neural networks, hence all of the information regarding the layers functions belonging to the architecture of neural network software are obtained from [23]. The explanation of each layer and its function is found below: imageinputlayer: The input layer when using images as input to a network; applies data normalisation. The input size is chosen depending on the size of the images. convolutional2dlayer: This layer applies sliding filters to the input by moving along the input vertically and horizontally while computing the dot product of the weights and the inputs; and also adding a bias term. relulayer: A threshold operation is performed to each element of the input. Any value less than zero is set to zero. This layer is known as an activation function, it transforms the values or states the conditions for the decision of the output neuron. batchnormalizationlayer: A batch normalisation layer takes each input channel across a mini-batch an normalises it. First it normalizes the activation s of each channel by subtracting the mini-batch with the mean and additionally dividing with the standard deviation of the mini-batch. The layer also shifts the input by a learn-able offset β and scales it by a learn-able scale factor γ. This layer is used between convolutional layer and non-linearity s - such as relu layers - this increases the speed of training of convolutional neural networks and reduces the sensitivity to network initialisation. averagepooling2dlayer: This layer computes the average of each region and performs a down-sampling by dividing the input into rectangular pooling regions. fullyconnectedlayer: This layer adds weights in a matrix and multiplies the input with it and the adds a bias vector. The settings of each layer for the CNN architecture is presented below: imageinputlayer([ ]) convolution2dlayer(3,32, Padding, same ) batchnormalizationlayer relulayer() averagepooling2dlayer(16, Stride,2) convolution2dlayer(3,8, Padding, same ) batchnormalizationlayer relulayer() dropoutlayer(0.3) fullyconnectedlayer(40) relulayer() fullyconnectedlayer(10) relulayer fullyconnectedlayer(1) regressionlayer 38

7.1.3 Training process and validation The properties for the training were set accordingly: the initial rate is set to 10 5, the max epochs number is 60 and the mini-batch size is 17 (= 136/8).

45 7.1.3 Training process and validation The properties for the training were set accordingly: the initial rate is set to 10 5, the max epochs number is 60 and the mini-batch size is 17 (= 136/8). The mini-batch size defines how many stimuli s are used to evaluate the loss of the function and updating the weights. One epoch is defined as when all mini-batches have been passed through the training algorithm. One iteration is performed after passing one mini-batch - eight iterations are performed to finish one epoch. The network is trained with the solver adam as a stochastic gradient decent optimisation. The learn drop factor is set to 0.1 and the learn drop period has the value 20. After 20 epochs are performed, the learning rate drops 0.1 resulting in more reliable training. The network was monitored during its training process. Monitoring has the opportunities to early observe if the accuracy of the training is improving, or if the network is over-fitting. A percentage of the training data was used as validation (during the design phase) to investigate if the error increases with the training time, however, the training is stable and decreases steady along with time; see figure 39. Figure 39: The monitor feeding the system subsequently with selected data for validation. The monitor, from one of the runs, displays the RMSE and as can be observed: the validation results (black dot-line) are close to the RMSE of the training data (blue dot-line), indicating that the network is not over-fitting. To finally test the reliability of the model, a K-fold cross-validation is performed (same as for the SNN). The K-number is set to 5, resulting in 5 runs where 136 stimuli s are used for training the network and the remaining 34 for testing the prediction accuracy; see figure

Figure 40: In terms of testing the reliability of the model a K-fold crossover-validation is performed for both the CNN and RNN. 5 runs where 20% of the data is tested and 80% is used for training.

The crossvalidation process is performed twice: (i) when the data has been shuffled with the random seed number 55 and (ii) with seed number 56.

46 Figure 40: In terms of testing the reliability of the model a K-fold crossover-validation is performed for both the CNN and RNN. 5 runs where 20% of the data is tested and 80% is used for training. For each run new targets are used for testing to avoid biased results. The stimuli s are randomly shuffled and placed in five different testing sets. The crossvalidation process is performed twice: (i) when the data has been shuffled with the random seed number 55 and (ii) with seed number 56. It is conducted on both target data sets from the table 32. All data is required to be utilised for testing to thoroughly as possible validate the neural networks. To avoid biased result in-between the iterations (specified as run in the tables) the testing sets (for each seed number) are plotted to investigate the distribution, see figure 41 and

Each randomised set - distributed by the two seed numbers 55 and 56 - cover most of the grading scope, indicating that no test set has acquired merely similar targets from

47 Figure 41: The distribution of the targets for each set when the seed number for randomisation is 55. Figure 42: The distribution of the targets for each set when the seed number for randomisation is 56. Each randomised set - distributed by the two seed numbers 55 and 56 - cover most of the grading scope, indicating that no test set has acquired merely similar targets from the same level of annoyance. After the training is executed, the prediction is conducted and an average of the absolute errors obtained - between the predicted values and the targets - from each of the 34 testing stimuli s is computed per iteration. A final average error is later derived when all runs are performed. 41

48 7.2 Recurrent Neural Network type LSTM LSTM, the type of RNN utilised, requires a large amount of memory for the computations. The pre-processing (until the extraction of spectrograms) conducted on the CNN input data is conveniently applied on the RNN model and modifications performed are described here Preprocessing of input data - RNN Instead of images used as inputs for training the CNN, the time signal will act as input for the RNN. Same operation is performed as in the case of pre-processing of the input data for the CNN - all time signals are re-sampled by a factor ten. As earlier mentioned, the time signals contain both negative and positive number where the maximum amplitude found is 28. Considering the network must predict values between zero and one, normalisation of the signals is conducted in order to optimise the training process. The normalisation is performed by computing the standard deviation of all stimuli s and take the maximum. The mean is as well computed from the same signal. All stimuli s are later subtracted with the same mean and divided by the standard deviation, see figure 43. Sorting the time signals into batches according to the time length improves usually the results. In this case, all signals have the same length; hence, the necessity of sorting the data according to their length is not considered. Figure 43: The red time signal displays the normalised version of the blue original time signal. normalisation is performed after the re-sampling. The Applied layers - RNN The final architecture of the RNN is demonstrated in figure

49 Figure 44: The layers applied in the RNN with LSTM. All layers utilised for the model are described below: sequenceinputlayer: The sequence input layer that inputs sequence data to a network. The input-size is the length of the vector; in our case the vector length of the signal is after re-sampling. dropoutlayer: A layer that randomly chooses and set input elements to zero with a given probability. The layer forces the network to become more redundant and obtain a good prediction altough some of the elements are set to zero. lstmlayer: The layer is a recurrent neural network (RNN) layer that enables support for time series and sequence data in a network. The Long Shot-Term Memory layers are best suited for learning long-term dependencies like distant time steps. By performing additive interactions, the gradient flow can be improved over long sequences during training. The lstmlayer is given a number of hidden units; also known as hidden size. fullyconnectedlayer: This layer adds weights in a matrix and multiplies the input with it and the adds a bias vector. regressionlayer: Creates a regression outputlayer - applied when the output is continuous. The settings of each layer for the RNN (LSTM) architecture is presented below: sequenceinputlayer(inputsize) dropoutlayer(0.75) lstmlayer(250) fullyconnectedlayer(1) regressionlayer Training process and validation The properties for the training were set accordingly: the initial rate is set to , the learn drop factor is set to 0.1 and the learn drop period has the value 125. After the training has reached 125 epochs, the learning rate drops 0.1. The max epochs number is 400 and the mini-batch size is 68 (= 136/2) - 2 iterations are performed to finish one epoch. The gradient threshold is set to one to avoid the gradient from exploding. The RNN is, as well as the CNN, trained with the solver adam. The network was also monitored during its training process, see figure

50 Figure 45: The monitor show clearly how the training data is converging and obtaining low error levels. The monitor displays how the RMSE between the prediction of the training data has low deviation from the corresponding target. The K-fold cross-validation and the random distribution of the targets was performed the same way as with the CNN,

8 Results of multiple hidden layer neural networks With the crossover-validation the models are iterated 5 times, each time with a new set (136 stimul s) for training and a new set of unknown stimuli

51 8 Results of multiple hidden layer neural networks With the crossover-validation the models are iterated 5 times, each time with a new set (136 stimul s) for training and a new set of unknown stimuli s (34) - which the trained neural network never has seen before - for testing the accuracy of the predictions. The results of this work are displayed in plots and summarised in tables. The plots will display the comparison of the predicted values and the targets, to easier visualise the accuracy. Not all five iterations per input configuration will be displayed in plots but the iteration with lowest absolute average error, and the iteration obtaining the highest accuracy. First the results of the CNN will be presented and last the result obtained from the RNN (LSTM). Both the CNN and the RNN are trained and tested with the 30 jury evaluated targets among the 170 and the fully SNN modified targets. In the end of both sub-chapters, a table can be found, containing the numerical values of the average error of each iteration, the final average error and the maximum error obtained. 8.1 Results of CNN with spectrograms as input CNN - data randomised with seed 55 First the results obtained from when the randomisation of the data is performed with the seed number 55. The lowest accuracy is given by run four, see figure 46; and highest accuracy is obtained during run five, see figure 47 Figure 46: Comparison between the targets and the predicted values by the CNN (55) after performing run number four, which obtained the lowest accuracy out of five runs. 45

Figure 47: Comparison between the targets and the predicted values by the CNN (55) after performing run number five, which obtained the highest accuracy out of five runs.

The absolute average value difference between run 4 and run 5 is 0.025. The results obtained with the SNN modified targets are presented with a * after the seed number.

52 Figure 47: Comparison between the targets and the predicted values by the CNN (55) after performing run number five, which obtained the highest accuracy out of five runs. Right y-axis displays the subjective grades on the annoyance utilised during the jury test in order to interpret the impact of the error. The absolute average value difference between run 4 and run 5 is The results obtained with the SNN modified targets are presented with a * after the seed number. The same seed number is used for the fully SNN modified targets in order to compare the results; see figures 48 and 49 for lowest and highest accuracy among the iterations, respectively. Figure 48: Comparison between the SNN modified targets and the predicted values by the CNN (55*) after performing run number four, which obtained the lowest accuracy out of five runs. The maximum absolute error (0.16) is found in this comparison - stimuli two. 46

53 Figure 49: Comparison between the SNN modified targets and the predicted values by the CNN (55*) after performing run number two, which obtained the highest accuracy out of five runs CNN - data randomised with seed 56 The data is now randomised with the seed number 56, different distribution of the data is divided in the training and testing sets during the crossover-validation. The iterations with lowest and highest accuracy among the iterations are displayed in figure 50 and 51 respectively. Figure 50: Comparison between the targets and the predicted values by the CNN (56) after performing run number one, which obtained the lowest average accuracy out of five runs. 47

54 Figure 51: Comparison between the targets and the predicted values by the CNN (56) after performing run number five, which obtained the highest average accuracy out of five runs. Exchanging the real 30 subjectively assessed target for the SNN modified targets in figure 52 (lowest average accuracy) and 53 (highest average accuracy). Figure 52: Comparison between the SNN modified targets and the predicted values by the CNN (56) after performing run number one, which obtained the lowest average accuracy out of five runs. 48

Figure 53: Comparison between the SNN modified targets and the predicted values by the CNN (56) after performing run number five, which obtained the highest average accuracy out of five runs.

55 Figure 53: Comparison between the SNN modified targets and the predicted values by the CNN (56) after performing run number five, which obtained the highest average accuracy out of five runs. With the worst case scenarios - the iterations obtaining lowest average accuracy - the results indeed are still converging good. The maximum deviation is obtained during seed 55* with the fully SNN modified targets. The step between the numerical values representing the subjective grads is 0.167; with a maximum error of 0.16 the network is predicting under one grade step off the target. In table 8 the obtained results, from the CNN, are summarised. Absolute average error Run 1 Run 2 Run 3 Run 4 Run 5 Final Max CNN (55) CNN (55)* CNN (56) CNN (56)* Table 8: Summary of the results obtained from the CNN. One can notice that variation of the average error between the different runs exist. This can be explained by the low amount of data available. The model still predicts with a satisfactory accuracy taking the low amount of data for training into consideration. The difference in maximum errors between the seeds 55 and 56 (when same targets are used) is respectively Results of RNN with time signal as input RNN - data randomised with seed 55 The predictions performed by the RNN (LSTM) when the seed is 55 and the 30 real target from the jury test are among the 170, are presented in figure 54 (lowest average accuracy among the 5 iterations) and 55 (highest average accuracy among the 5 iterations). 49

Figure 54: Comparison between the targets and the predicted values by the CNN (55) after performing run number four, which obtained the lowest accuracy out of five runs.

56 Figure 54: Comparison between the targets and the predicted values by the CNN (55) after performing run number four, which obtained the lowest accuracy out of five runs. Figure 55: Comparison between the targets and the predicted values by the RNN (55) after performing run number five, which obtained the highest average accuracy out of five runs. The accuracy is noticeably lower compared to the CNN. The maximum error obtained is almost 2 grading steps. All SNN modified targets are applied and the iteration with lowest accuracy is presented in figure 56 and with the highest accuracy displayed in figure

Figure 56: Comparison between the SNN modified targets and the predicted values by the RNN (55*) after performing run number one, which obtained the

Figure 57: Comparison between the SNN modified targets and the predicted values by the RNN (55*) after performing run number five, which obtained the

Which is the same stimuli obtaining a high error with 30 jury testing assessed targets among the 170 targets - 0.26

57 Figure 56: Comparison between the SNN modified targets and the predicted values by the RNN (55*) after performing run number one, which obtained the lowest average accuracy out of five runs. Figure 57: Comparison between the SNN modified targets and the predicted values by the RNN (55*) after performing run number five, which obtained the highest average accuracy out of five runs. The maximum error has increased significantly to stimuli number five in run 1. Which is the same stimuli obtaining a high error with 30 jury testing assessed targets among the 170 targets RNN - data randomised with seed 56 As for the CNN, an additional seed number is tested to investigate the level of generalisation by the network. The lowest accuracy is obtained during run 2 and is presented in figure 58, the highest average accuracy is obtained during iteration 3 and is displayed in figure

Figure 59: Comparison between the targets and the predicted values by the RNN (56) after performing run number three, which

58 Figure 58: Comparison between the targets and the predicted values by the RNN (56) after performing run number two, which obtained the lowest average accuracy out of five runs. Figure 59: Comparison between the targets and the predicted values by the RNN (56) after performing run number three, which obtained the highest average accuracy out of five runs. With the modified SNN targets are applied on seed 56 the result obtained with lowest average accuracy in figure 60, and with highest average accuracy in figure

Figure 60: Comparison between the SNN modified targets and the predicted values by the RNN (56*) after performing run number one, which obtained the lowest average accuracy out of five runs.

The summary of the results obtained when utilising RNN (LSTM) for prediction of annoyance is presented in table 9. Absolute average error Run 1 Run 2 Run 3 Run 4 Run 5 Final Max RNN (55) 0.079 0.

59 Figure 60: Comparison between the SNN modified targets and the predicted values by the RNN (56*) after performing run number one, which obtained the lowest average accuracy out of five runs. Figure 61: Comparison between the SNN modified targets and the predicted values by the RNN (56) after performing run number three, which obtained the highest average accuracy out of five runs. The summary of the results obtained when utilising RNN (LSTM) for prediction of annoyance is presented in table 9. Absolute average error Run 1 Run 2 Run 3 Run 4 Run 5 Final Max RNN (55) RNN (55)* RNN (56) RNN (56)* Table 9: The obtained errors of the RNN (LSTM) model with different input configurations. 53

Sound Quality Evaluation of Hermetic Compressors Using Artificial Neural Networks

Purdue University Purdue e-pubs International Compressor Engineering Conference School of Mechanical Engineering 2006 Sound Quality Evaluation of Hermetic Compressors Using Artificial Neural Networks Claudio