Accepted Manuscript. Title: Hidden Markov Modeling of Frequency-Following Responses to Mandarin Lexical Tones

Size: px

Start display at page:

Download "Accepted Manuscript. Title: Hidden Markov Modeling of Frequency-Following Responses to Mandarin Lexical Tones"

Jeffery Ball
5 years ago
Views:

Accepted Manuscript Title: Hidden Markov Modeling of Frequency-Following Responses to Mandarin Lexical Tones Authors: Fernando Llanos, Zilong Xie, Bharath Chandrasekaran PII: S0165-0270(17)30291-1

1 Accepted Manuscript Title: Hidden Markov Modeling of Frequency-Following Responses to Mandarin Lexical Tones Authors: Fernando Llanos, Zilong Xie, Bharath Chandrasekaran PII: S (17) DOI: Reference: NSM 7818 To appear in: Journal of Neuroscience Methods Received date: Revised date: Accepted date: Please cite this article as: Llanos Fernando, Xie Zilong, Chandrasekaran Bharath.Hidden Markov Modeling of Frequency-Following Responses to Mandarin Lexical Tones.Journal of Neuroscience Methods This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

2 TITLE Hidden Markov Modeling of Frequency-Following Responses to Mandarin Lexical Tones AUTHORS Fernando Llanos 1, Zilong Xie 1, Bharath Chandrasekaran 1,2,3 INSTITUTIONS 1 Department of Communication Sciences & Disorders, Moody College of Communication, The University of Texas at Austin 2 Department of Psychology, College of Liberal Arts, The University of Texas at Austin 3 Institute for Neuroscience, College of Liberal Arts, The University of Texas at Austin TYPE OF ARTICLE Research paper CORRESPONDING AUTHOR Bharath Chandrasekaran (bchandra@utexas.edu; ) 2504 Whitis Ave., Austin, TX 78712, USA RUNNING TITLE Hidden Markov modeling of FFRs 1

3 HIGHLIGHTS A new method of neural pitch encoding assessment is described This method uses a hidden Markov model to decode neural pitch patterns reflected by frequency following responses The accuracy of this model provides a measure that accounts for language experiencedependent plasticity in subcortical encoding of pitch This method enables neural pitch encoding assessment with reduced datasets ABSTRACT Background. The frequency-following response (FFR) is a scalp-recorded electrophysiological potential reflecting phase-locked activity from neural ensembles in the auditory system. The FFR is often used to assess the robustness of subcortical pitch processing. Due to low signal-to-noise ratio at the single-trial level, FFRs are typically averaged across thousands of stimulus repetitions. Prior work using this approach has shown that subcortical encoding of linguistically-relevant pitch patterns is modulated by long-term language experience. New method. We examine the extent to which a machine learning approach using hidden Markov modeling (HMM) can be utilized to decode Mandarin tone-categories from scalp-record electrophysiolgical activity. We then assess the extent to which the HMM can capture biologically-relevant effects (language experience-driven plasticity). To this end, we recorded FFRs to four Mandarin tones from 14 adult native speakers of Chinese and 14 of native English. We trained a HMM to decode tone categories from the FFRs with varying size of averages. Results and comparisons with existing methods. Tone categories were decoded with above-chance accuracies using HMM. The HMM derived metric (decoding accuracy) revealed a robust effect of language 2

4 experience, such that FFRs from native Chinese speakers yielded greater accuracies than native English speakers. Critically, the language experience-driven plasticity was captured with average sizes significantly smaller than those used in the extant literature. Conclusions. Our results demonstrate the feasibility of HMM in assessing the robustness of neural pitch. Machine-learning approaches can complement extant analytical methods that capture auditory function and could reduce the number of trials needed to capture biological phenomena. KEYWORDS Frequency-following response, hidden Markov model, machine learning, pitch encoding, plasticity HIDDEN MARKOV MODELING OF FREQUENCY-FOLLOWING RESPONSES TO MANDARIN LEXICAL TONES 1 INTRODUCTION Pitch is critical to speech and music processing (Ladefoged & Maddieson, 1998; Patel, 2010). For example, speakers of tonal languages (e.g., Chinese) rely on phonologically-relevant pitch patterns (i.e., lexical tones) to convey different word meanings (Gandour, 1983; Ladefoged & Maddieson, 1998). The main cues used for perceptual categorization of such lexical tones are pitch height and pitch direction (Gandour, 1994; Francis & Ciocca, 2003). The neural encoding of pitch is often assessed with the frequency-following response (FFR). FFR is a scalp-recorded electrophysiological potential that reflects phase-locked activity from neural ensembles involved in the processing of low level sound characteristics (Bidelman, 2015; Chandrasekaran & Kraus, 2010; Krishnan, Gandour, & Bidelman, 2010; Coffey, Herholz, Chepesiuk, Baillet, & Zatorre, 2006; Smith, Marsh, & Brown, 1975; Sohmer, Pratt, & Kinarti, 1977). Although it is generally considered that the FFR is entirely generated by auditory subcortical structures (e.g., Krishnan et al., 2005), recent evidence suggests a contribution from auditory cortex (Coffey et al. 2016). An important property of the FFR is that it captures the spectro-temporal correlates of the pitch (e.g., the fundamental 3

5 frequency, F0) with high fidelity (Chandrasekaran & Kraus, 2010; Krishnan, Xu, Gandour, & Cariani, 2004) (see Fig-1) Figure 1. Waveform (A) and spectrogram (B) of a mid-rising Mandarin tone (tone 2), as well as the waveform (C) and spectrogram (D) of the corresponding FFR, elicited from a representative Chinese participant. The FFR was averaged across 1000 trials. The pre-stimulus portion of the FFR in panel C is delimited with a rectangle. Panel E shows the F0 contours of the four stimuli included in the present study. Prior work has shown that the FFR is malleable to auditory experiences across several timescales (Skoe & Chandrasekaran, 2014). These experiences include long-term language experience (Krishnan, Xu, Gandour, & Cariani, 2005; Krizman, Marian, Shook, Skoe, & Kraus, 2012; Krizman, Skoe, Marian, & Kraus, 2014; Xie, Reetzke, & Chandrasekaran, 2017), musical training (Musacchia, Sams, Skoe, & Kraus, 2007; Wong, Skoe, Russo, Dees, & Kraus, 2007), short-term auditory training (Song, Skoe, Wong, & Kraus, 2008; Xie et al., 2017), online context (Lau, Wong & Chandrasekaran, 2016), and stimulus probability and FFR timing (Gorina-Careta, Zarnowiec, Costa-Faidella & Escera, 2016; Shiga, Althen, Cornella, Zarnowiec, Yabe & Escera, 2015; Slabu, Grimm & Escera, 2012). Following this line of research, recent studies have documented a more faithful subcortical encoding of lexical tones (e.g., Mandarin tones) in native speakers of tonal languages (e.g., Chinese), relative to native speakers of languages that are not tonal (e.g., English) (Krishnan et al., 2010; Krishnan, Gandour, Bidelman, & Swaminathan, 2009; Krishnan et al., 2005; Xie et 4

6 al., 2017). These results are theoretically interesting because they demonstrate that subcortical structures are not just passive relay stations, but relate to individual differences in auditory processing (Chandrasekaran, Skoe, & Kraus, 2014). In the extant literature, the subcortical encoding of pitch is usually assessed with metrics that evaluate the similarity of F0 between the stimulus and the evoked FFR, or the degree of the FFR periodicity (Krishnan et al., 2009; Krishnan et al., 2005; Lau, Wong, & Chandrasekaran, 2016; Wong et al., 2007). The stimulus-to-response similarity is quantified with Euclidean norms that calculate the distance between the stimulus and the evoked FFR s F0 contour (F0 error metric), or normalized crosscorrelation coefficients that maximize at latencies for which the F0 contours being compared become highly correlate (stimulusto-response correlation metric). To estimate the degree of periodicity, a FFR is compared to itself using the autocorrelation function (peak autocorrelation metric). Due to the low SNR of single-trial responses, FFRs are usually averaged across thousands of stimulus repetitions to faciliate the estimation of these metrics (Krishnan et al., 2010; Krishnan et al., 2005; Xu, Krishnan, & Gandour, 2006). In the present study, we examine the extent to which a machine learning approach using the hidden Markov model (HMM) can be utilized to decode Mandarin tones from FFRs, and capture language experience-dependent plasticity in the neural encoding of pitch. Critically, we examine the extent to which HMM can capture subtle language experience-dependent neural plasticity using subaverages (of the FFR) and datasets that are smaller than those reported in the existing literature. Our data-driven approach builds upon emerging applications of machine learning methods to model electrophysiological responses (Hausfeld, De Martino, Bonte, & Formisano, 2012; Mesgarani, Cheung, Johnson, & Chang, 2014; Pei, Barbour, Leuthardt, & Schalk, 2011; Yi, Xie, Reetzke, Dimakis, & Chandrasekaran, 2017). The goal of unsupervised machine learning models (Alpaydin, 2014; Mohri, Rostamizadeh, & Talwalkar, 2012) is to learn target patterns (e.g., lexical tones) without being explicitly programmed. Instead, machine learning models are trained to develop broad mathematical 5

7 representations that capture the variability of target patterns in the input. These representations are then used to decode target patterns from novel input signals. An important characteristic of machine learning approaches is that they incorporate the modeling of complex distributional input properties (Mohri et al., 2012). Previous studies have taken advantage of machine learning approaches to decoding cortical electrophysiological responses even on a single-trial basis (Mesgarani et al., 2014). Due to the brainstem sites of origin, FFRs are smaller in magnitude than cortical electrophysiological responses at the scalp, and hence have relatively lower SNR (Chandrasekaran & Kraus, 2010). Although this may limit the power of machine learning models to decode FFRs with small averages, a previous study has demonstrated the feasibility of machine learning methods to decode FFRs to steady-state vowels on a single-trial basis (Yi et al., 2017). Building upon this finding, we aim to investigate the extent to which machine learning methods can also decode FFRs to lexical tones. To the best of our knowledge, this is the first study using the HMM to decode FFR signal properties. Among extant machine learning models, we chose the HMM because of the characteristic timevarying profile of lexical tones. HMMs are designed to decode series of emissions (e.g., F0 contours) by assuming that the emissions in these series are produced, with certain probabilities, by a series of states that cannot be directly observed (e.g., target lexical tone categories). In the HMM, hidden states are assumed to follow each other with certain transition probabilities established from the distribution of the emissions across sequences (Rabiner, 1989; Rabiner & Juang, 1986). For example, a HMM trained with rising pitch contours will generalize one stochastic representation in which lower F0s are followed by higher F0s with a higher probability than the opposite trend (higher F0s followed by lower F0s). In particular, there are two properties that make the HMM a good candidate to model time-varying pitch patterns. First, HMM can handle a wide variety of time-varying speech signals (Gales & Young, 2008; Huang, Ariki, & Jack, 1990), including Mandarin tones (Yang, Lee, Chang, & Wang, 1988). Second, the HMM 6

8 can decode patterns at moderate SNRs (Viikki & Laurila, 1997). Considering the low SNR characteristic of single-trial FFRs, the latter property can be useful to decode lexical tone categories in small FFR datasets. We recorded 1000 trials of FFRs to each of the four Mandarin lexical tones (T1: high-level; T2: mid-rising; T3: falling-rising; and T4: high-falling) elicited from 14 native speakers of Mandarin Chinese and 14 native speakers of English with no prior tonal language experience. Then, we assessed the neural encoding of Mandarin tones by using a HMM to decode these tones from FFRs. Critically, we manipulated the number of trials used to train and test the HMM, as well as the number of trials used to average the FFRs. The rationale behind these manipulations was to assess the validity of the HMM in decoding FFRs averaged across fewer trials in smaller datasets. We also examined the performance of the HMM at different time points of the FFR. The goal of this additional analysis was to evaluate the effects of language experience-dependent plasticity on the decoding of Mandarin tones over time as well as elucidate the relative contributions of different portions of the FFR to tone classification. Here, the HMM was trained with complete FFRs and tested with FFR portions gradually increasing in 20 steps of 12.5 ms from the onset (12.5 ms corresponded to the 5% of the FFR duration). To anticipate, the HMM decoded tone categories with above-chance accuracies in both Chinese and English participants. The HMM derived metric (decoding accuracy) exhibited the same pattern of neural plasticity documented in the literature: the Chinese group exhibited a more reliable neural encoding of native lexical pitch patterns than the English group. Critically, these language group differences were elucidated with data sets and averaging sizes that were considerably smaller than the ones documented in the existing literature. Further, FFRs in Chinese listeners yielded faster decoding of Mandarin tone categories relative to English listeners. 7

9 2 METHODS 2.1 Participants Participants were recruited, tested, and compensated in Austin, Texas, by a research protocol approved by the Institutional Board Review of the University of Texas at Austin. We collected data from fourteen native speakers of Chinese (4 male; age: M = 24 years, SD = 3.4 years) and fourteen native speakers of English with no prior tonal language experience (9 male; age: M = 21.9 years, SD = 2.7 years). FFRs are not modulated by gender (Krizman, Skoe, & Kraus, 2012). All participants were right-handed by self-report. They did not report any history of musical training in the past 9 years. English participants were not fluent in a language other than English, and did not report any training in tonal languages. Chinese speakers were late unbalanced bilinguals of English (onset of learning: M = 11 years, SD = 2.8 years; self-rated English proficiency, from 1 to 10 points: M = 6.3 points; SD = 0.9 points). All participants reported no history of hearing problems, neurodevelopmental disorders, or traumatic brain injuries. Audiograms revealed pure-tone thresholds (from 250 to 8000 Hz, octave steps) lower than 25 db for both air and bone conduction in each ear. 2.2 Stimuli Stimuli consisted of four 250 ms synthetic Chinese words minimally distinguised by their lexical tones: /yi 1 / clothing (T1), high-level tone with F0 equal to 129 Hz; /yi 2 / aunt (T2), mid-rising tone with F0 rising from to 109 to 133 Hz (T2); /yi 3 / chair, low-dipping tone with F0 onset falling from 103 to 89 Hz and F0 offset rising from 89 to 111 Hz (T3); and /yi 4 / easy (T4), falling tone with F0 falling from 140 to 92 Hz (T4). F0 contours of these four tones are shown in Fig-1 (panel E). The synthesis was derived from natural male production data (Chandrasekaran, Krishnan, & Gandour, 2007; Xu, 1997). We adjusted all stimulus magnitudes to the same root-mean-square (RMS) intensity value, equivalent to 72 db sound 8

10 pressure level (SPL). SPL levels were measured with a Brüel & Kjaer hand-held analyzer (type 2250-L) connected to a Brüel & Kjaer artificial ear (type 4152). 2.3 Electrophysiological data acquisition Electrophysiological responses to the stimuli were recorded (sampling rate: 25 khz) in a sound attenuated booth, using one Ag-AgCl scalp electrode placed on the top-middle part of the head (Cz), with the left and right mastoids as the ground and reference, respectively. The use of a mastoid reference is consistent with our previous work (Xie et al., 2017) and prior work from other labs (Bidelman & Krishnan, 2010; Krishnan et al. 2010). This setup also provided us with a good SNR of FFRs, as shown in Fig-8. All electrode impedances were lower than 5 kω. Participants were watching a silent movie of choice with subtitles while listening to the stimuli, binaurally delivered through two insert earphones (ER-3; Etymotic Research, Elk Grove Village, IL). They were instructed to focus their attention on their movies and ignore the auditory stimuli. Stimulus presentation was controlled by a customized E-Prime interface (Schneider et al., 2002), using an inter-stimulus interval (ISI) randomly jittered between 122 to 148 ms. Each of the four Mandarin tones was presented in separate blocks with alternating polarities to minimize the impact of cochlear microphonic artifact on the average FFR (Bidelman, Moreno, & Alain, 2013; Chandrasekaran & Kraus, 2010). The order of the blocks was counterbalanced across participants, with half of the participants listening to blocks of T1 and T2 first, and the other half of participants listening to blocks of T3 and T4 first. The whole recording lasted about 60 minutes. 2.4 Preprocessing of electrophysiological data Raw electrophysiological responses were preprocessed off-line with BrainVision Analyzer 2.0 (Brain Products, Gilching, Germany). First, responses were bandpass-filtered from 80 to 1,000 Hz (12 db/octave, zero phase-shift). Then, they were segmented into epochs of 310 ms (40 ms before stimulus onset and 20 ms after stimulus offset). Responses were baseline corrected to the mean voltage of the 9

noise floor (-40 to 0 ms), and trials with amplitudes exceeding the range of ± 35 µv were rejected. For each tone, we collected 1000 artifact-free trials (500 for each polarity) for further analysis.

11 noise floor (-40 to 0 ms), and trials with amplitudes exceeding the range of ± 35 µv were rejected. For each tone, we collected 1000 artifact-free trials (500 for each polarity) for further analysis. 3 HIDDEN MARKOV MODELING OF NEURAL PITCH PATTERNS The HMM classifier consisted of four independent HMMs connected in parallel and implemented in MATLAB R2015b (please see supplementary methods for the implemented MATLAB scripts). These HMMs were trained to decode FFR F0 contours to the four Mandarin tones in individual participants. Tone categories were selected as the HMMs providing the more optimal decoding of target FFRs. Therefore, the decoding of a particular tone category was also influenced by the decoding of the other categories. The F0 contours were estimated from averaged FFRs using the method described in section 3.1 and quantized into sequences of discrete F0 intervals, or codewords, previously established by clustering. The accuracy of the HMM classifier was cross-validated across multiple training and dataset testing selections. Critically, we manipulated the number of FFRs used to train and test the classifier, as well as the number of trials used to average FFRs (averaging size). These steps are schematically illustrated in Fig-2 and described in more detail throughout the next few sections. For more general information about HMM, we refer the readers to the specialized literature (Rabiner, 1989; Rabiner & Juang, 1993) 10

12 Figure 2. A flowchart illustrating the main steps followed at each cross-validation step. FFRs are labeled with the number of the corresponding tone (e.g., 1 or 4). Stimulus polarities are indexed with a plus and minus symbol attached to each tone number. FFRs to the same tone (e.g., T1) were separated in training and testing subsets using a K-Fold partition, where K is the possible number of disjoint testing selections available in a given tone dataset (e.g., T1). For example, if the testing size is 100 and the size of the tone data set is 500, the possible number K of disjoint testing selections is 5. The flowchart illustrates the steps followed at each testing selection. Training and testing FFRs were averaged within their corresponding training and testing subsets. F0 contours were extracted from averaged FFRs and quantized into sequences of discrete F0 intervals. Each HMM (one per tone category) was trained with the F0 sequences of the corresponding tone class. Each testing sequence was classified into the tone class of the HMM giving the highest likelihood log-probability of providing the sequence. HMM accuracy was estimated from the confusion matrix that resulted from this classification method across all testing selections. 3.1 F0 extraction procedure Averaged FFRs were sliced into 22 consecutive time frames of 40 ms each separated by sliding steps of 10 ms. This time frame duration was chosen to improve F0 tracking by including a minimum of 3 cycles per frame. The F0 in each frame was estimated by the short-term autocorrelation method described in Boersma (1993). We searched for the time-lag that maximized the normalized autocorrelation function within a time interval spanning the time-variant periods of 4 to 14.3 ms. We chose this time interval because its reciprocals, of 250 and 70 Hz respectively, encompassed the F0 range of all the stimuli. The reciprocal of the time-lag that maximized the autocorrelation function was considered as the F0 value of the corresponding frame. Time frames were mean de-trended and fully ramped with a standard Hann window, as described in Boersma (1993). Autocorrelation coefficients were divided by their corresponding coefficients in the Hann s autocorrelation function. We applied the same method to extract the F0 contour of each stimulus, previously resampled to the sampling rate of the FFR (i.e., 25 khz). 3.2 Stochastic topology and vector quantization Each of the four HMMs in the HMM classifier was defined as one stochastic chain of three selfconnected hidden states, feedforward connected to the next two states (i.e., the first hidden state was 11

13 connected to itself as well as to the second and third states, whereas the second state was connected to itself and the third state). The F0 contours were quantized into sequences of 22 discrete F0 intervals, or codewords, using a codebook of 50 centroids. To create this codebook, we clustered all the F0 data points available for training in 50 Voronoi cells with a Linde-Buzo-Gray algorithm (Chang & Hu, 1998; Equitz, 1989). The codebook consisted of all the centroids provided by these cells. This choice of parameters (HMM architecture and codebook length) was derived from a previous study that used the HMM to recognize Mandarin tone production (Yang et al., 1988). 3.3 Training, testing, and cross-validation Each HMM in the classifier was trained with Viterbi (Durbin, 1998) to learn F0 sequences from the same tone class. Trained HMMs were then combined in a parallel architecture to classify novel F0 sequences from any of the four tone classes. Each novel F0 sequence was decoded to a tone category of the HMM providing the highest likelihood log-probability. The accuracy (or correct classification rate) of the classifier was directly computed from the confusion matrix that resulted from this classification method, as the number of true positives and true negatives, divided over the number of true positives, true negatives, false positives, and false negatives (Fielding & Bell, 1997; Gardner & Altman, 1989). HMM accuracy was first estimated for each tone, resulting in a total of four tone-specific accuracy scores (Acc1, Acc2, Acc3, and Acc4) ranging from 0 (minimal accuracy) to 1 (maximal accuracy). We also calculated a general accuracy score across all tones (Acc0), which was computed from the total number of true and false positives and negatives provided by the classifier for all tone classes. FFRs in the same tone dataset (T1, T2, T3, or T4) were divided into disjoint training and testing subsets. To avoid stimulus artifact bias, training and testing subsets alternated the same number of trials with opposite stimulus polarity (Bidelman et al., 2013; Skoe & Kraus, 2010). To avoid sample selection bias, the selection of testing and training subsets was cross-validated using a K-fold partition, with K equal 12

14 to the possible number of disjoint testing selections available in the corresponding tone dataset (Kuhn & Johnson, 2013; Siuly, Li, & Zhang, 2017). For example, if the testing size is 100 trials and the number of FFRs to the tone data set is 500, the possible number of disjoint testing selections is 5. The HMM classifier was trained and tested K times, using different cross-validated selections of testing and training data sets at a time. The accuracy scores (Acc1, Acc2, Acc3, Acc4, and Acc0) were computed from the confusion matrix that resulted from the classification of all F0 sequences at each cross-validation step. 3.4 Averaging method FFRs in training and testing subsets were averaged with a moving average window (MAW) of length L (we relied on different MAW sizes, which are provided in the next section). This means that each FFR was averaged with respect to the L/2 previous and L/2 following FFRs within the same training or testing subset. This averaging method helped increase the SNR of each FFR while keeping a adequate number of trials for training and testing. However, it also increased the level of redundancy across trials, and thus the risk of under-fitting (Kuhn & Johnson, 2013). To reduce this risk, we performed a exploratory series of HMM runs across different training and MAW sizes. The purpose of this exploratory series was to identify the MAW sizes for which the accuracy of the classifier did not grant much improvement. By excluding these MAW sizes in future runs, we aimed minimize the risk of under-fitting without decreasing the accuracy of the classifier. In this exploratory series, training size was independently fixed to a small size (250 trials), medium size (500 trials) and large size (750 trials). Since the goal here was not to find the minimum number of trials required for the classifier to identify cross-language differences in decoding accuracy, FFR trials that were not used for training were left for testing. The size of the MAW was manipulated, for each training size, from 50 trials to the corresponding training size in increments of 25. Results (see Fig-3) revealed no significant accuracy improvements for MAW sizes larger than approximately the 50% of the training size (mean across training sizes = 55%). The effects of underfitting 13

15 were especially salient in the large sample size (see Fig-3, right panel), which covered a wider range of averaging sizes. Figure 3. Mean HMM accuracy scores for Chinese and English groups (in solid red and dashed blue, respectively), as a function of the averaging size and for training sizes of 250 trials (left panel), 500 (middle panel), and 750 (right panel). Mean standard errors are color shaded. Dashed vertical lines highlight the onset of the moving average range that did not grant much improvement in HMM accuracy. 3.5 Manipulation of training, testing, and averaging size We manipulated the training, testing and averaging size across different runs in individual participants. The sample size was denoted as the training size plus the testing size. The training size ranged from 100 to 900 trials per tone in steps of 50 across runs. The averaging size was manipulated as a function of the training size to minimize the risk of underfitting (see section 3.4). Specifically, for each training size, the averaging size varied from 50 trials to the half the corresponding training size in increments of 25. The testing size, in each run, was set to be the double of the corresponding averaging size to minimize the risk of under-fitting while keeping a reduced but a decent number of trials per run. Combinations of training and testing sizes that led to sampling sizes larger than 1000 were excluded. The resulting combinations of training, testing and averaging size are depicted in Fig-5 and Fig-6 (section 4.2). For each combination of 14

16 sizes, we obtained a total of five accuracy scores per participant: four tone-specific accuracy scores (Acc1, Ac2, Acc3, and Acc4), and one general accuracy score for all tones altogether (Acc0). 3.6 Tone decoding accuracy over time In addition to the modeling described in the previous sections ( ), we assessed the performance of the classifier in decoding Mandarin tones over time. Our goal here was to investigate the extent to which FFRs from Chinese listeners yielded earlier decoding of tone categories than English listeners. In such a case, we would expect the HMM to require less time (shorter FFR portions) to decode Mandarin tones from FFRs in Chinese listeners. To test this, we trained the HMM with complete F0 sequences. Then, we estimated the tone class of each novel F0 sequence along 20 consecutive time points gradually increasing in steps of 12.5 ms (12.5 ms corresponded to the 5% of the FFR duration). The tone class at each time point was estimated with the classification criteria described in section 3.2 (maximum likelihood log-probability). Portions of F0 sequences from the onset of the FFR to the corresponding time point were decoded to the tone category that provided the highest likelihood log-probability. This analysis differs from existing analyses of pitch-tracking accuracy across different FFR chunks (Krishnan et al., 2009) in that our goal here was to assess the amount of information required over time for the HMM to decode tone categories correctly, instead of identifying the portions of the FFR that were more reliably decoded. In this context, the use of complete FFRs during the training phase was meant to replicate the experimental circumstances in which FFRs were elicited (participants were listening to complete stimulus waveforms). Our analysis of time-dependent accuracy focused on the combinations of training, testing and averaging size that maximized and minimized cross-language differences in general accuracy (Acc0). We focused on these combinations of sizes because they represented the best and worst circumstances in 15

17 which language group differences were expected to occur. The exact size values are provided in section STATISTICAL ANALYSES AND RESULTS 4.1 FFR dataset validation Prior to the HMM modeling, we used our FFR data set to validate the pattern of language experience-dependent plasticity documented in the literature with the existing metrics: F0 error metric, the stimulus-to-response correlation, and peak autocorrelation (see Xie et al., 2007, for a more detailed description of these metrics). These metrics have been used to evaluate the effects of long-term auditory experience (Bidelman, Gandour, & Krishnan, 2011; Krishnan et al., 2005; Wong et al., 2007; Xu et al., 2006) and short-term auditory training (Chandrasekaran, Kraus, & Wong, 2012; Song et al., 2008; Xie et al., 2017) in the subcortical encoding of Mandarin tones in native speakers of tonal and non-tonal languages. As expected, the Chinese group revealed a more optimal neural encoding of pitch than the English group across the three metrics, as evidenced by significant effects of the language group [F0 error: F(1,26) = , p = 0.002, η p = 0.18; stimulus-to-response correlation: F(1,26) = 6.61, p = 0.01, η p = 0.09; peak 2 autocorrelation: F(1,26) = 7.44, p = 0.01, η p = 0.12]. Specifically, FFRs from Chinese listeners, relative to those from English listeners, exhibit lower F0 errors and higher stimulus-to-response and peak autocorrelation coefficients [F0 error: Chinese, M = 4.92 (SD = 4.23) vs. English, M = 9.97 (SD = 7); stimulusto-response correlation: Chinese, M = 0.56 (SD = 0.23) vs. English, M = 0.48 (SD = 0.21); peak autocorrelation: Chinese, M = 0.60 (SD = 0.13) vs. English, M = 0.43 (SD = 0.13)]. Fig-4 depicts the F0 contours of the four stimulus waveforms with the F0 contours of their elicited FFRs, averaged by language group. In this figure, stimulus pitch contours are more faithfully reproduced in Chinese than in English. We note that the results for the HMM metric, introduced in the next sections, 16

18 are not entirely consistent with the fidelity of the FFR. The HMM classifier is structured as four tonespecific HMMs connected in parallel. Therefore, its performance is expected to be modulated by F0 contour similarities across tone categories. This type of modulation is not expected to occur in previous FFR metrics, which focuses on stimulus and FFR signal properties within the same tone category but not across tone categories. Figure 4. F0 contours of stimulus signals (thick green line) and mean F0 contours of averaged FFRs across Chinese speakers (thin red line) and English speakers (thin blue-dashed line) listening to T1 (top left), T2 (top right), T3 (bottom left), and T4 (bottom right). 4.2 Tone decoding accuracy Mean accuracy scores for each combination of language group (Chinese and English) and tone (T1, T2, T3, and T4) were higher than the level of chance (0.25), and ranged from 0.74 to 0.88 across these combinations. Fig-5 depicts the accuracy scores (Acc0, Acc1, Acc2, Acc3, and Acc4), averaged across all 17

participants, for each combination of training, testing and averaging size. T4 provided the poorest mean accuracy scores. These means were lower than 0.

19 participants, for each combination of training, testing and averaging size. T4 provided the poorest mean accuracy scores. These means were lower than 0.75 when the averaging size was smaller than 100 trials. In contrast, the mean accuracies of the other tones were higher than 0.75 for averaging sizes as small as 50 trials, with a few exceptional cases, and were higher than 0.85 for averaging sizes approximately larger than 125 trials (see Fig-5). Figure 5. HMM accuracy scores (from 0 to 1) averaged across all participants, as a function of the training, testing, and averaging size. General accuracy scores (Acc0) for all tones are shown in the top panel. The panels below are labeled with the names of their corresponding tone-specific accuracy scores (Acc1, Acc2, Acc3, and Acc4). Accuracy scores are encoded with a color-shape scale that groups them in different intervals from lower to higher accuracy. Numbers on top of each colored squared correspond to the sampling size (training plus testing size). Mean accuracy scores were consistently higher for the Chinese group, relative to the English group [Acc1: Chinese, M = 0.88 (SD = 0.1) vs. English, M = 0.79 (SD = 0.1); Acc2: Chinese, M = 0.86 (SD = 0.1) vs. English, M = 0.78 (SD = 0.1); Acc3: Chinese, M = 0.88 (SD = 0.09) vs. English, M = 0.78 (SD = 0.1); Acc4: Chinese, M = 0.81 (SD = 0.12) vs. English, M = 0.74 (SD = 0.11)]. When all the accuracy scores were 18

20 collapsed by tone, Acc4 (M = 0.77, SD = 0.12) was smaller than Acc1 (M = 0.84, SD = 0.11), Acc2 (M = 0.82, SD = 0.11), and Acc3 scores (M = 0.83, SD = 0.11). Pairwise comparisons via a two-sample t-test (Mann- Whitney test for non-normal and/or heteroscedastic score distributions) with T4 as the baseline tone confirmed that the Acc4 was significantly lower than the other three accuracy scores (Acc1, Acc2, and Acc3; all ps < 0.001). This indicates that T4 was more poorly decoded than the other three tones. 4.3 Language group differences Language group differences for each accuracy score (Acc1, Acc2, Acc3, Acc4, and Acc0) at each combination of training, testing and averaging size were evaluated using two-sample t-tests (Mann- Whitney tests for non-normal and/or heteroscedastic score distributions). In each test, accuracy score and language group were the dependent and independent variables, respectively. Figure 6. T-test results for language group differences (Chinese vs. English) in decoding accuracy as a function of the training, testing and averaging size. General accuracy scores for all tones (Acc0) are shown in the top panel. The panels below are labeled with the name of the corresponding tone-specific accuracy 19

21 score (Acc1, Acc2, Acc3, and Acc4). Statistical significance is encoded with a color-shape scale. Numbers on top of each colored squared correspond to the sampling size (training plus testing size). Test results are displayed in Fig-6. This figure uses a color scale to display the statistical significance of each test. In all tests yielding p < 0.05, the mean accuracy score of the Chinese group was higher than the one of the English group. Test results for general accuracy (Acc0; Fig-6, top panel) provided a big area of statistical significance across almost all combinations of training, testing and averaging size, with only six combinations that were not statistically significant. Tests results for Acc3 (Fig-6, bottom-left panel) revealed statistically significant language group differences across all size combinations except one (training size = 200, averaging size = 50, testing size = 400). In contrast, the results for Acc4 (Fig-6, bottomright panel) provided the smallest number of tests to reach statistical significance (12 out of 81). In this case, language group differences became statistically significant with sampling sizes larger than approximately 700 trials and averaging sizes larger than approximately 350 trials. In regards to Acc1 and Acc2 (Fig-6, mid-left and -right panels), language group differences reached statistical significance at sampling and averaging sizes larger than approximately 550 and 150 trials, respectively, with a total number of 51 (Acc1) and 61 (Acc2) tests showing statistically significant language group differences. A series of four binary Chi-square tests confirmed that Acc4 provided the smallest proportion of statistically significant language group differences (qs > 17.92, ps < 0.001). 4.4 Tone decoding accuracy over time Language group differences in decoding accuracy over time (FFR portions) were statistically evaluated via two-sample t-tests (Mann-Whitney tests for non-normal and heteroscedastic distributions). Language group differences were evaluated for each decoding accuracy score (Acc0, Acc1, Acc2, Acc3 and Acc4) at each time point (20 time points total) in each condition (optimal and suboptimal). In the optimal condition, we focused on the combination of training, averaging and testing size that maximized language 20

22 group differences in general accuracy (training size = 500 trials, averaging size = 200 trials, sampling size = 900 trials). In the suboptimal condition, we focused on the combination of sizes that did not yield differences between language groups (training size = 150 trials, averaging size = 50 trials, sampling size = 250). These two conditions represented the best and worst circumstances in classifier performance (with the ability to capture language differences). Test results are shown in Fig-7 (panels B - E). Figure 7. Panel A shows the evolution of tone-specific accuracy scores (Acc1, Acc2, Acc3, and Acc4), over time, in Chinese (solid lines) and English (dashed lines) in the optimal condition (training size = 500, averaging size = 10, sampling size = 900). The other panels (B-E) illustrate the statistical significance (pvalues) of language group differences in accuracy scores over time. P-values in B and D refer to language group differences in tone-specific accuracies over time and p-values in C and E refer to the same differences in general accuracy over time. P-values in B and C refer to the optimal condition and p-values in D and E refer to the suboptimal condition (training size = 150, averaging size = 50, sampling size = 250). The dotted lines in B-E mark the level of statistical significance (0.05). In the optimal condition, the Chinese tone-specific accuracy scores were higher than the English ones across all time points (Fig-7, panel A). Within the Chinese group, T3 was decoded earlier than T1 and T2. T4 was poorly decoded across time, relative to the other tones. In English, however, T1 provided the best decoding accuracies during the first 100 ms. After 100 ms, the evolution of tone-specific accuracy scores over time was similar across language groups: T3 provided the best over-time accuracy scores, 21

23 followed by T2 and T1 at similar over-time accuracy scores, which, in turn, were better than the ones of T4. Further, language group differences in accuracy reached statistical significance (i.e. p < 0.05) during the first 200 ms (Fig-7, panel B). The first tone to reach p<0.05 was T3, at 37.5 ms, followed by T2 (50 ms), T1 (137.5 ms) and T4 (137.5 ms), respectively. From these results, we infer that, in the optimal condition, Chinese FFRs yielded faster decoding of tone categories than English FFRs. Also, T3 was decoded faster than the other three tones, except in English FFRs, which provided a faster decoding of T1, at least during their first 100 ms. Finally, T4 was decoded slower than the other three tones in both language groups. In the suboptimal condition (Fig-7, panel D), only T3 (150 ms) and T2 (187.5 ms) reached statistical significance in language group comparisons of accuracy over time (p<0.05). The temporal evolution of language group differences in general accuracy (Acc0) is shown in panels C (optimal condition) and E (suboptimal condition) of Fig-7. In this case, language group differences reached statistical significance as early as 50 ms (optimal condition) and ms (suboptimal condition). Altogether, these results indicate that the decoding of Mandarin tones from FFRs was modulated, over time, by both language-dependent factors and tone categories. 5 GENERAL DISCUSSION 5.1 Results and major contributions We examined the extent to which hidden Markov modeling (HMM) can be utilized to decode Mandarin tones from FFRs. A second goal was to assess the extent to which this metric (decoding accuracy, tone decoding accuracy over time-decoding accuracy) was able to capture subtle language experience-dependent plasticity in the neural encoding of Mandarin tones as documented in the literature (Krishnan et al., 2005; Krizman, Marian, et al., 2012; Krizman et al., 2014; Xie et al., 2017). 22

24 Specifically, we investigated the extent to which the Chinese FFRs provided a better and faster decoding of Mandarin tones than English ones. The HMM decoded tone category information from FFR data with high, above-chance accuracy scores. Importantly, these scores were achieved with averaging sizes as small as 50 or 100 trials, depending on the tone (see Fig-5). Mandarin tones were more accurately decoded from Chinese FFRs, suggesting a more reliable neural encoding of pitch in the Chinese group, relative to the English group (see Fig-6). This finding is consisent with the language experience-dependent plasticity documented in the literature (Krishnan et al., 2005; Xie et al., 2017; Xu et al., 2006). Figure 8. Mean signal-to-noise ratios (SNRs) as a function of the averaging size, language group (Chinese, and English) and tone (T1, T2, T3, T4). SNRs (y-axes) were estimated with the procedure described in section 3.4. For every average size L (x-axes), FFR and pre-stimulus signals were averaged with respect to the L/2 preceding and L/2 following signals. The mean amplitude of the averaged FFR signals were then divided over the mean amplitude of their corresponding averaged pre-stimulus signals. The ratio was take 23

25 as a value of SNR. The resulting SNRs were then averaged into a single SNR score per participant. Language group variances are shaded in the corresponding color. Critically, in the present study, this pattern of results was elucidated with FFR data sets as small as 200, 550 or 700 trials, and was found to be tone-dependent (T3 and T4 required the smallest and largest sample sizes, respectively). Also, Chinese FFRs provided a faster decoding of tone categories than English FFRs, as early as 50 ms in the optimal condition, relative to ms in the suboptimal one (see Fig-7, panels C and E). Taken together, these results demonstrate that the HMM is robust in capturing biologically-relevant information in the FFR, even at low SNR evidenced in smaller sample sizes and averaging sizes (these SNRs are shown in Fig-8). 5.2 Language-dependent neural plasticity The language experience-dependent plasticity in the neural encoding of pitch patterns (e.g., Mandarin tones) has been attributed to differential amounts of exposure to dynamic pitch patterns and relative differences in the linguistic-relevance of dynamic pitch patterns (Jeng, Dickman, Montgomery- Reagan, Tong, Wu & Lin, 2011; Krishnan & Gandour, 2009; Krishnan et al. 2012). Mandarin Chinese listeners, relative to English listeners, receive extensive exposure to dynamic, linguistically-relevant pitch patterns during their language development. As a result, the subcortical auditory circuitry in Chinese listeners is likely to be reorganized to facilitate the encoding of dynamic pitch patterns that are frequently occurring in their auditory environment. This account is supported by animal models suggesting that neural circuitry calibrates locally to selectively fine-tune representation of frequently occurring signals in one s auditory environment during development (Keuroghlian & Knudsen, 2007; Knudsen, 2002) Such local reorganization can persist through adulthood, and continue to modulate auditory processing (Keuroghlian & Knudsen, 2007; Linkenhoker, von der Ohe, & Knudsen, 2005) 24

26 Our results indicate that the neural encoding of pitch is not only affected by language experience but also by tone category. In particular, the decoding of T4 was considerably less accurate than any of the other three tones. This result is in line with previous findings showing a suboptimal subcortical encoding of falling tonal patterns (Krishnan et al., 2010; Krishnan et al., 2009; Krishnan et al., 2004; Liu, Maggu, Lau, & Wong, 2015). For example, Krishnan et al. (2010) used normalized autocorrelation to estimate the strength of periodicity in the FFR (peak autocorrelation metric). They found a suboptimal neural encoding of periodicity in FFRs to falling linguistically-relevant tonal contours, relative to rising contours. This pitch direction asymmetry has also been documented in FFRs to non-speech signals (Krishnan & Parkinson, 2000; Krishnan et al., 2004) and in psychoacoustic experiments (Collins & Cullen Jr, 1978) in which rising tones differences were better discriminated than the ones between falling tones. In the present study, the poorer decoding of the falling tone (T4) contrasts with the optimal accuracy scores of the Mandarin tones with rising F0 trajectories (T3 and T2). This result, is also reflected in the FFR literature (Krishnan, Gandour, & Bidelman, 2012; Krishnan et al., 2004). Speakers of tonal languages tend to exhibit a better neural encoding of pitch directional cues (critical to dynamic tones like T3 and T2) relative to speakers of languages that are not tonal. In our study, SNRs are also tone-specific (see Fig-8). Chinese SNRs were higher than the English ones for all tones except the level T1. When averaging size was maximum (1000 trials, optimal SNR), T2 and T3 maximized the differences between mean Chinese SNRs (T1: 2.00; T2: 2.47; T3: 2.94; T4: 2.28) and mean English SNRs (T1: 2.00; T2: 2.18; T3: 2.11; T4: 2.05). Language group differences in SNR reached statistical significance (p<0.05) with T3, the tone involving more complex pitch directional changes. These group differences systematically reflect the relative differences in the tuning to dynamic pitch. 25

27 5.3 Tone decoding accuracy over time Our results suggest that tone decoding accuracy over time is also dependent on the tone category. Decoding of T4 was considerably more delayed than any of the other three tones in both Chinese and English speakers. This result could be a consequence of the poor decoding accuracy of T4 combined with the acoustic similarity of the onsets of T4 and T1 during approximately the first 70 ms. Language experience also mediated tone-specific differences in tone decoding accuracies over time. In Chinese, T3 and T2 were decoded faster than T1 and T4. Accuracies over time were different in English, especially during the first 100 ms, where T1 was decoded faster the other tones. Interestingly, also in English, the decoding accuracy of T3 increased considerably in magnitude, after approximately 150 ms, consistent with T3 turning point (Krishnan et al., 2010). In general, Chinese scores for general accuracy over time (Acc0, all tones combined) were significantly better than the English ones even with small sampling and averaging sizes (see Fig-7, panels C and E). This result demonstrates that the time-based decoding of Mandarin tones, as reflected by the HMM, is highly influenced by long-term auditory experience. Despite this, our time-based analysis has two important limitations. First, we used duration-normalized stimuli, so this may not reflect real-world differences. Second, we did not have behavioral data (tone gating paradigm) that could provide information on behavioral-relevance. These limitations can be addressed in future studies. In previous tone gating studies, Chinese listeners required more time to identify T3, relative to the other three Mandarin tones (Whalen & Xu, 1992; Wu & Shu, 203), while T1 is typically identified the earliest (e.g., Whalen & Xu, 1992; however, T2 in Wu & Shu, 2003). Consistent with previous studies, our HMM analysis showed that T1 is decoded earlier than T4. In contrast to previous studies (Whalen & Xu, 1992), we found that T3 was decoded faster relative to the other three tones. This divergence could be due to differences in tone specific duration 26

28 across studies. Previous tone gating studies preserved the tone specific duration differences observed in Mandarin Chinese production, such that T3 is longer than T2, which in turn is usually longer than T4 and T1. The longer duration of T3 might have contributed to its longer identification latency, relative to the other Mandarin tones. In our study, however, all the tones had the same duration. This, combined with the lower F0 onset of T3, relative to the onset of the other tones, might have facilitated an earlier temporal decoding of T3. Another important difference worth noting here is that listeners performance in previous tone gating tasks was modulated by attentional and decisional factors. Our data acquisition paradigm, based on passive listening, may be less affected by attentional and decisional factors. The difference in the attentional and decisional factors between our study and prior tone gating work may explain the disparity of findings. The influence of attentional and decisional factors on pitch representation could be further investigated with FFR elicitation paradigms that engage varying levels of attention. 5.4 Conclusion and future directions Our results demonstrate the feasibility of machine learning approaches to characterize the FFR, which is considered a promising biomarker of auditory function (Chandrasekaran & Kraus, 2010; Chandrasekaran et al., 2012; Chandrasekaran., Skoe, & Kraus, 2014). Importantly, our study shows that the machine learning derived metrics can reflect biologically-relevant influences on auditory processing. One important finding worth highlighting is that this machine learning approach can be used to quantify FFR with trial numbers well below those reported in the existing literature. Studies using Mandarin tones often recorded FFRs to at least 1500 stimulus repetitions (Bidelman et al., 2013; Krishnan et al., 2010; Krishnan et al., 2005; Wong et al., 2007; Xu et al., 2006). When all tones are considered, the language 27

Twenty subjects (11 females) participated in this study. None of the subjects had

SUPPLEMENTARY METHODS Subjects Twenty subjects (11 females) participated in this study. None of the subjects had previous exposure to a tone language. Subjects were divided into two groups based on musical