Jackknife-based method for measuring LRP onset latency differences

Size: px

Start display at page:

Download "Jackknife-based method for measuring LRP onset latency differences"

Joanna Martin
5 years ago
Views:

1 Psychophysiology, 35 ~1998!, Cambridge University Press. Printed in the USA. Copyright 1998 Society for Psychophysiological Research METHODOLOGY Jackknife-based method for measuring LRP onset latency differences JEFF MILLER, a TUI PATTERSON, a and ROLF ULRICH b a Department of Psychology, University of Otago, Dunedin, New Zealand b Department of Psychology, University of Wuppertal, Germany Abstract A new method based on jackknifing is presented for measuring the difference between two conditions in the onset latencies of the lateralized readiness potential ~LRP!. The method can be used with both stimulus- and response-locked LRPs, and simulations indicate that it provides accurate estimates of onset latency differences in many common experimental conditions. Descriptors: Lateralized readiness potential, Onset latency measurement, Jackknifing, Computer simulations In recent years, the lateralized readiness potential ~LRP! has become an important tool in psychophysiological studies of choice reaction time ~RT! tasks ~e.g., Coles, 1989!. In such tasks, a stimulus is presented and people must respond as quickly as possible. A classic problem in the study of such tasks is to determine which mental processes are influenced by an experimental manipulation, such as stimulus probability. For example, it is easy to establish that people respond faster to probable stimuli than to improbable stimuli, but discerning whether this difference in RT is because of faster processing at the level of perception, decision making, or motor response is difficult ~e.g., Gehring, Gratton, Coles, & Donchin, 1992; Miller & Pachella, 1973!. In principle, this problem can be addressed by using the LRP, an electrophysiological indicator of response preparation that is obtained by comparing electroencephalographic ~EEG! activity over the left and right motor cortices prior to movements of the left and right hands ~for a complete description of LRP derivation, see Coles, 1989!. The onset of LRP provides a time marker intervening between stimulus and response, indicating the beginning of sidespecific response preparation ~Coles, 1989!, and this time marker could be used to localize the effects of experimental manipulations. For example, an experimenter could compare the time between stimulus onset and LRP onset for high and low probability stimuli. If this time is the same for the two types of stimuli, then it can be inferred that processing of high probability stimuli is This work was supported by cooperative research funds from the Deutsche Raum- und Luftfahrtgesellschaft e.v. and the New Zealand Ministry of Research, Science, and Technology. During her work on this project, Tui Patterson was supported by an Otago University summer bursary. We thank Michael Coles, Patricia Haden, Aaron Ilan, Bert Mulder, Allen Osman, and Fren Smulders for helpful comments on earlier drafts of the manuscript. Address reprint requests to: Jeff Miller, Department of Psychology, University of Otago, Dunedin, New Zealand, miller@otago.ac.nz. faster after the moment when response preparation begins. Alternatively, if this time is shorter for high probability stimuli, then it can be inferred that processing is faster before response preparation begins. This general inferential technique can be used with almost any experimental manipulation. Unfortunately, determining exactly when LRP onset occurs is difficult because EEG signals have a low signal-to-noise ratio. Previous investigators have devised and used various methods to determine differences in LRP onset latencies but have reported evaluations of the efficiency of their methods in only one case ~Smulders, Kenemans, & Kok, 1996!. This article presents a class of new methods that are computationally simpler than previous methods and reports computer simulations indicating that the new methods are also more accurate. Measurement of Latencies Versus Measurement of Latency Differences Previous investigators have focused on accurate measurement of LRP latency within a condition. Viewed from this perspective, the problem is to determine what level the measured LRP might reach by chance when true LRP onset had not yet occurred. A criterion for LRP onset can then be set at a level just beyond what is likely to be reached by chance, and LRP onset can be estimated as the moment at which LRP reaches this criterion level. Both parametric ~Osman, Bashore, Coles, Donchin, & Meyer, 1992! and nonparametric ~Van Dellen, Brookhuis, Mulder, Okita, & Mulder, 1985! statistical techniques have been used to determine when the LRP exceeds chance levels. Regardless of the statistical technique, however, the problem is extremely difficult for at least three reasons: ~a! there is considerable noise in the LRP, ~b! at its onset, LRP usually rises gradually rather than sharply, and ~c! statistical testing at many time points, which is necessitated by the high frequencies at which EEG is sampled, inflates the Type I error probability by providing many opportunities to conclude incorrectly that LRP onset has occurred. 99

2 N 100 J. Miller, T. Patterson, and R. Ulrich In contrast, the new approach suggested in the present study focuses on accurate measurement of the difference in latencies between two conditions. Specifically, we suggest measuring in each condition the latency at which the LRP reaches a fairly large criterion value, which is much larger than the minimum LRP needed to say that LRP onset has occurred ~cf. Smulders et al., 1996!. Clearly, each of these latencies will tend to be too large in isolation because the large criterion will only be reached some time after onset. Nonetheless, as long as both latencies are too large by the same amount, the difference in latencies will accurately reflect the true difference in onset latencies. 1 In short, we acknowledge that the new procedure may produce a poor measure of LRP onset latencies per se but nonetheless argue that this technique produces a good measure of differences in onset latencies. Thus, the new procedure should be especially useful in situations in which differences are of primary theoretical interest, although other methods may also be useful for determining absolute rather than relative LRP onset latencies. The new approach begins with a simple measure of the onset latency within a given condition, also used by Smulders et al. ~1996!. A criterion level of LRP, say 0.5 mv, is chosen large enough to be certain that this level would not be crossed by chance in the grand-average waveforms. LRP onset latency in each condition is then measured as the first time point at which the grandaverage LRP for that condition exceeds this criterion. Figure 1, for example, shows the stimulus- and response-locked LRPs obtained in two conditions, with cutoffs of 0.5 mv used to obtain stimuluslocked LRP onsets of 208 ms and 280 ms and response-locked LRP onsets of 216 ms and 244 ms in the control and experimental conditions, respectively. Thus, for each type of waveform, the estimated latency difference, D, is the difference between these values. Although the latency estimates could be questioned in absolute terms, the corresponding differences of D ms for the stimulus-locked waveforms and D ~ 244! ~ 216! 28 ms for the response-locked waveforms seem to be reasonable summaries of the data. Measurement of onset latencies in this fashion suffers minimally from each of the three problems just described as plaguing previous methods. First, LRP noise is kept to a minimum because grand averages are used. Second and more importantly, the problematic gradual initial increase is avoided by setting a high criterion for onset, which will only be reached in the sharply rising portion of the LRP, where noise will have less effect on the estimated latency. Third, no explicit statistical testing is done to identify the moment of onset, so there is no need to worry about obtaining significant LRP by chance given a preselected a level. Estimating the Standard Error of the Latency Difference As described so far, the proposed new method produces a single estimate of the latency difference between two conditions based on the overall mean LRP across all participants ~cf. Smulders et al., 1996!. It is also necessary to have an estimate of the standard error of this difference, however, to test hypotheses about the true difference ~e.g., is it zero?! and to construct confidence intervals. 1 The two latency estimates might be too large by different amounts in cases in which the initial portions of the two LRPs have different shapes, either because of differences in response activation or, for stimulus-locked LRPs, because of differences in onset variability of the response process. We are primarily concerned with common cases in which the initial portion LRP shape does not differ much across conditions, but the effects of LRP shape differences on the new method will be considered in a later section. Figure 1. Examples of measured lateralized readiness potential ~LRP! latency differences, D, in observed stimulus- and response-locked LRPs. In both types of waveforms, latencies were scored with a criterion of 0.5 mv. Classically, the standard error of a difference is estimated with the following formula. If d i, i 1,...,Nare the latency differences observed for each of N participants and d is the mean of those differences, then the standard error of the mean difference, s d N,is s dn ( N ~d i d! N 2 i 1. ~1! N~N 1! In essence, this formula uses a measure of the individual variation of the difference scores ~i.e., the numerator of the fraction! to estimate the random error associated with the sample summary statistic ~i.e., mean difference scores!. Unfortunately, the classical approach to the computation of standard error may not work well in the case of LRP onset latency differences because the individual-subject LRPs contain much more noise than the grand-average LRPs. Even if the experimenter chooses a fairly large criterion value, there is some chance that an individual participant s LRP will cross the criterion by chance in one condition or another. Moreover, when the criterion is set to a reasonably large value, there is also a substantial chance that an individual participant s LRP will never reach that criterion at all, leading to an awkward problem of missing data. As part of the new method, we propose to use an alternative technique known as jackknifing ~Efron, 1981; Jackson, 1986; Miller, 1974; Mosteller & Tukey, 1977! to measure the standard error, s D, of the difference. Using this technique, a researcher would compute the values D i, i 1,...,N, where D i is the difference in latencies computed from a subsample including all subjects except for subject i. More specifically, to obtain each D i, one first computes the grand-average LRP for each of the two conditions, averaging across all subjects except subject i. Then, one checks each of these two grand averages to see at which point the criterion level of LRP is reached, thereby obtaining a latency estimate for each grand average. The value of D i is then the difference between these two latency estimates. If JN is the mean of the differences obtained in the subsamples ~i.e., JN ( D i 0N!, then the jackknife estimate of the estimated standard error of the difference, s D,is s D N 1 N N { ( i 1 ~D i J! N 2. ~2! ~The Appendix presents a numerical example to illustrate the proposed computations in more detail.! Unlike the traditional measure of standard error, this technique compares variation in the quantity

3 LRP onset latency differences 101 of interest across subsets of the total sample rather than across individuals. In brief, its conceptual basis is to judge the variability between subjects by temporarily leaving each subject out of the calculation. If all participants show approximately the same effect, then the values of D i should be quite similar whichever subject is omitted. If the participants show different effects, however, then the results should fluctuate substantially depending on which subject is left out. Although some may find its conceptual basis less intuitive than that of the traditional techniques, jackknifing nonetheless deserves serious consideration because it has a sound theoretical basis and useful distributional properties and because it is sometimes superior to traditional methods ~Efron, 1981!. In various standard cases ~e.g., standard error of a sample mean!, the jackknife estimate of standard error is mathematically equivalent to the classical estimate of standard error. In summary, we propose to ~a! measure the difference in LRP onset latencies between two conditions by taking the difference, D, in the times at which the grand-average LRPs cross a certain fairly large criterion value, and ~b! estimate the standard error of this difference with the jackknife standard error, s D, obtained from Equation 2. In the remainder of this article, we report simulations designed to evaluate this procedure and some alternative procedures. In these simulations, we also examined the effects of using various different cutoff values in conjunction with the new procedure. These cutoff values were defined either in absolute terms ~e.g., 0.5 mv, 1.0 mv! or in relative terms ~e.g., 30% of the maximum value of the LRP!, and it turned out that criteria defined in relative terms lead to more accurate difference estimates than criteria defined in absolute terms. have had to make somewhat arbitrary assumptions about the size and temporal pattern of EEG noise and about the characteristics of the signal of interest ~e.g., LRP!, including its average shape, subject-to-subject variability in its shape, and trial-to-trial variability in its shape within each participant. To generate the responselocked LRPs needed in the present case, we would also have to make assumptions about the distributions of RTs both within and between subjects and about the precise temporal relationship of the EEG to the overt manual response. Fortunately, in the present case, it is possible to derive appropriate simulated data sets from actual observed data sets, thereby eliminating the need for such arbitrary assumptions. Thus, the simulations reported in this article were based on actual data, and we used two different data sets as a means of cross-validation. Figure 2 illustrates the method used to generate simulated data corresponding to an experiment with two conditions ~referred to as the experimental and control conditions!, starting from actual observed data. With this method, the true mean RT is adjusted to be 100 ms larger, on average, in the experimental condition than in the control condition, and this effect on RT is entirely due to a 100-ms increase in stimulus-locked LRP onset latency, with no Simulations To evaluate the proposed new measurement procedure and compare its accuracy with that of other procedures, we need to determine the sampling distributions of the values produced by each procedure under a known set of conditions. With these sampling distributions, for example, it would be a simple matter to see which procedure yielded the best estimate on the average, which one produced the least random variation in estimation, and so on. Unfortunately, the measurement procedures are complicated and the distributional properties of the underlying sources of noise are unknown, so it is not possible to calculate these sampling distributions analytically. It is possible to estimate these sampling distributions by simulation, however, so that is the approach we have taken. Overview and General Method In each of the present simulations, a two-step procedure was iterated 1,000 times, with each iteration simulating one whole LRP experiment and its analysis. The results of the 1,000 analyses were tabulated in various ways to see how well the analysis recovered the true LRP onset latency difference, as described further below. Each iteration had two main steps. ~a! Generate a random set of data corresponding to the outcome of a single experiment. As described below, this data set was drawn from a population with known true differences in stimulus- and response-locked LRP onset latencies. ~b! Analyze the generated data to obtain estimates of the between-condition differences in stimulus-locked LRP latencies and in response-locked LRP latencies. To get valid simulation results, it is crucial that the generated data sets be as realistic as possible, but this realism is not easy to achieve. To generate simulated EEG data, previous researchers Figure 2. Illustration of procedure for generating simulated data for two conditions with a 100-ms difference in stimulus-locked onset latencies. A represents the pool of all actual experimental trials for a given real participant. B and C show reaction times ~RTs! and lateralized readiness potentials ~LRPs! for two trials sampled randomly from this pool and assigned to the experimental and control conditions. D and E show the data actually used in the simulations, as derived from B and C, respectively. For the control condition, the randomly selected RT and LRP are used without modification. For the experimental condition, the selected LRP is shifted 100 ms later in time, and RT is increased by 100 ms.

4 102 J. Miller, T. Patterson, and R. Ulrich change in response-locked LRP. Figure 2A represents the LRPs obtained in a single pool of actual experimental trials from a single participant in a single condition from the observed data ~e.g., trials with left-hand responses!. 2 Figures 2B and 2C show observed LRPs on two trials, each with its own RT, drawn randomly from this pool and then assigned randomly to the experimental and control conditions. Figures 2D and 2E represent the trials actually used in the simulation, as derived from those depicted in Figures 2B and 2C, respectively. In the control condition, the observed data are used in the simulation without any modification; that is, the simulated data in Figure 2D are identical to the observed data in Figure 2B. In the experimental condition, the data are modified in two ways before being used in the simulation. First, the entire EEG waveform is shifted 100 ms to the right along the time line, so that each simulated EEG reading occurs 100 ms later in the simulated data than in the actual data. 3 Second, 100 ms are added to the RT, so that the relation of the LRP to the response is the same in the simulated data as in the observed data. Adding 100 ms to each RT randomly assigned to the experimental condition increases the mean RT for that condition by 100 ms relative to the mean RT in the control condition, although RTs still vary considerably within both conditions. This method of constructing simulated data has several useful properties. First, the method guarantees that the true difference in stimulus-locked LRP onsets is 100 ms, despite the fact that we do not know precisely when LRP onset occurred either in the original pool of trials or in the constructed experimental and control conditions. Whenever LRP started without the shift, it must start 100 ms later with the shift, because the EEG is shifted 100 ms later in the experimental condition. Second, the construction guarantees that there is no true difference in the onsets of response-locked LRPs because RT and EEG are shifted by the same amount. Third, because they are derived from real data, the simulated data are realistic with respect to EEG noise, LRP signals, trial-to-trial and subject-to-subject variability in EEG and RT, and so forth. 4 It should be emphasized that the LRPs derived from the simulated data will also vary realistically from one simulated experiment to 2 Technically, it is inaccurate to use LRP to refer to an asymmetry observed on trials with a single response hand because the LRP is derived by averaging C30C4 asymmetries over trials with both left- and right-hand responses. Nonetheless, it is simplest to describe our method in terms of what happened with trials for a single hand; the same method was used for the other hand, and then the results were averaged across hands to obtain a true LRP. 3 The figure is somewhat oversimplified because it suggests that we simply shifted the LRP difference score, C3 ' minus C4 ' or C4 ' minus C3 ', depending on the condition. In fact, we actually shifted both C3 ' and C4 ' because one of the methods to be examined ~i.e., the Wilcoxon! requires that these two channels be kept separate. A further complication is that shifting all the EEGs 100 ms later necessarily creates a gap for the first 100 ms of the baseline period. In most cases, this gap was filled in with the original values of the first 100 ms as they had existed before the shift, except in reverse order to avoid creating a discontinuity at 100 ms. This procedure was not appropriate for the baseline deviation method ~described later! because it artifactually reduced the estimate of variability during the baseline period. For simulations involving this method, we shifted only the EEG readings after the end of the baseline period to the right along the time axis ~i.e., to later time points!. With this procedure, the gap was created during the first 100 ms after the end of the baseline period, and we again filled it by reinstating its original values in reverse order. 4 In one respect, the simulated data could be unrealistic in the experimental condition with the stimulus-locked effect. Because EEG was shifted in this condition, any event-related desynchronization ~e.g., Pfurtscheller & Aranibar, 1977! present in the EEG would start later in the experimental condition than in the control condition. the next because data from different combinations of participants were used in different simulated experiments, and different trials were randomly assigned to the experimental and control conditions in each simulated experiment. In one simulated experiment, for example, the trials with faster responses might by chance tend to be assigned to the control condition, in which case the observed effect of condition on RT would be greater than 100 ms for this simulated experiment. Similarly, in another simulated experiment, the control condition might contain trials with especially large or small observed LRPs, with especially early or late observed stimuluslocked LRPs, and so forth. In short, although the only true differences between experimental and control conditions are the 100-ms differences in RT and in stimulus-locked LRP onsets, the simulated data will contain various observed differences that might arise by chance when sampling actual trials from two such conditions. 5 The method illustrated in Figure 2 can easily be modified to construct simulated data sets that would be obtained if the experimental manipulation increased response-locked LRP latency or increased both stimulus- and response-locked latencies. To generate data for an experiment with a 100-ms effect on response-locked LRP, for example, 100 ms are added to RT for trials in the experimental condition, but EEG is not shifted for either condition. With the response moved 100 ms later and the EEG left as it was, the LRP must by definition start 100 ms earlier, relative to the response, in the experimental condition than in the control. The experimental and control conditions have identical underlying stimulus-locked LRPs because the relation of EEG to stimulus onset is not altered. As another example, it is also possible to generate data for an experiment with 100-ms effects on both stimulus- and response-locked LRP onsets. To generate these data, 200 ms are added to RTs in the experimental condition, but EEGs in this condition are shifted by only 100 ms. Estimation of 100-ms Effects Table 1 summarizes the results of two simulations conducted to see how accurately the new method estimates 100-ms effects on stimulus- or response-locked LRPs. For the simulation shown in the left half of the table, trials selected for the experimental condition were modified to implement a 100-ms effect on stimuluslocked LRP onset latency ~i.e., as shown in Figure 2!. For the simulation shown in the right half of the table, trials selected for the experimental condition were modified to implement a 100-ms effect on response-locked LRP onset latency ~i.e., 100 ms were added to RT but EEG was not shifted!. For each simulated experiment in both types of simulations, N 8 participants were sampled without replacement from a pool of 20 actual experimental participants whose data had been collected in connection with 5 The only real simplification embodied in the simulation procedure is that the experimental effect is 100 ms for all participants and all trials. This simplification is unlikely to be exactly true because there is surely some variation in effect size across trials and participants. However, the evidence suggests that the effect size variance is small enough to be safely ignored. Consider RT: To the extent that effect size varies from trial to trial, the standard deviations of individual trial RTs, computed across trials within a participant, should tend to be larger in the experimental condition than in the control condition. In our experience, however, there is usually not much difference in these standard deviations, suggesting that trial-to-trial variance in effect size is a rather small proportion of the within-subject variance. Similarly, if the effect size varies across participants, then the standard deviation of the mean RTs should be larger in the experimental condition than in the control condition; again, no such large differences are evident. Similar considerations suggest that the effects of condition on the size and onset latencies of LRP have little variance relative to other sources.

5 LRP onset latency differences 103 Table 1. Mean (M) and Standard Deviation (SD) of Estimated Differences (D) in Stimulus-Locked (SL) and Response-Locked (RL) Onset Latency 100-ms Stimulus-locked effect 100-ms Response-locked effect SL Onset RL Onset SL Onset RL Onset Method and criterion M SD M SD M SD M SD Absolute criteria ~mv! Relative criteria ~% maximum amplitude! Wilcoxon ~critical p level for determining onset! Baseline deviation ~number of noise SDs! Half-amplitude ~% maximum amplitude! another project ~Miller, Ulrich, & Rinkenauer, 1997!. 6 A new random sample of participants was chosen for each simulated experiment so that the simulated data sets would vary in betweensubjects variation, just as actual data sets do. For each participant within a simulated experiment, 50 trials were randomly assigned to each of the two simulated conditions. The rows of Table 1 correspond to different scoring methods and criteria, including both the proposed new method, discussed in this section, and several alternative methods, discussed in the next section. Each scoring method was used on the identical simulated data sets, so that the accuracy of different methods could be compared directly. We tried several different scoring criteria with each method because a given method may be more or less effective, depending on the exact criterion chosen. For example, the criterion of 0.5 mv considered above ~e.g., Figure 1! is a somewhat arbitrary choice. To help identify the most effective criterion to use with the new method, we scored the data with five different criteria defined in absolute terms ~0.2, 0.4, 0.6, 0.8, or 1.0 mv! and also with five criteria defined in relative terms ~10, 30, 50, 70, or 90% 6 For each participant in this experiment, we had observed approximately 150 artifact-free trials per hand. On each trial, we had recorded a prestimulus baseline period of 200 ms and a poststimulus epoch of 2,000 ms, sampling at 250 Hz, with bandpass settings of Hz and impedances below 5 kv. The raw EEG was filtered off-line by using a low-pass filter with a half-power cutoff of 4 Hz. All of the participants had discernable response-locked LRPs. of the maximum LRP amplitude!. Similarly, we also tried various criteria with the other methods because there is no way to judge a priori which criterion is likely be most effective ~e.g., Osman & Moore, 1993; Smulders et al., 1996!. In Table 1, the left-most column labeled M shows the mean estimated difference in stimulus-locked onset latencies ~experimental condition minus control condition!, averaging across 1,000 simulated experiments. Evidently, the new method is accurate in the long run because it produces mean differences that are very close to the true value of 100 ms for almost every criterion tried. The adjacent column, labeled SD, shows the standard deviations of the 1,000 estimated differences computed with each criterion ~inspection of frequency distributions indicated that the distributions of estimated differences were approximately normal for all criteria!. The relative criteria of 30 70% produced the smallest standard deviations, which indicates that they provide the best estimates of the difference ~i.e., they are subject to less random variability around the true value from one experiment to the next!. 7 7 The means and standard deviations shown in this table should not be interpreted as precise quantitative estimates in cases in which SD ms. Such large standard deviations indicate that onset latency estimation by that method was contaminated by outliers arising in many simulated experiments. For example, the criterion might be satisfied at the end of the baseline period, producing a latency of zero in that data set, or the criterion might never be satisfied anywhere in the epoch, in which case the end of the epoch ~2,200 ms! was taken as the latency.

6 104 J. Miller, T. Patterson, and R. Ulrich The third and fourth columns of the table show the analogous results for the estimated differences in response-locked onset latencies. The means of the estimated differences are approximately zero, in accordance with the fact that the data for this simulation were constructed to have equal response-locked onsets in both conditions. In this case, the 50 90% relative criteria produced the smallest standard deviations, indicating that they are the most accurate estimators. The rightmost four columns of the table show the results for the complementary simulation in which the data were constructed with a 100-ms difference in response-locked onsets and no difference in stimulus-locked onsets. On average, the estimated differences in stimulus-locked onsets latencies are quite close to zero, as they should be, and the estimated differences in response-locked onsets are close to 100, which is consistent with the fact that LRP onset was 100 ms earlier in the experimental condition than in the control condition. The relative criteria again produced the smallest standard deviations, with the optimal criteria again being 30 70% for the measurement of stimulus-locked onsets and 50 90% for the measurement of response-locked onsets. In summary, the results presented in Table 1 indicate that the new method is very promising, especially when used with a relative criterion of approximately 50% for stimulus-locked onsets and 90% for response-locked onsets. In the long run, the estimated differences are essentially identical to their true values, and the estimates do not vary much around the true value from one experiment to the next. Many questions about the new method remain, however, before we can recommend that LRP researchers routinely apply it. It is also interesting to note that estimation of onset differences is more accurate in response-locked waveforms than in stimuluslocked ones. This finding would seem to be a natural consequence of the fact that the LRP is better time locked to the response than to the stimulus ~Coles, 1989!, which implies a higher signal-tonoise ratio for response-locked LRPs. Comparison with Other Methods Perhaps the most obvious question is how the new method compares with other possible methods. We report on the accuracy of three alternative methods, two of which have been used in several previous studies. The Wilcoxon method was first described by Van Dellen et al. ~1985! and subsequently used by De Jong, Wierda, Mulder, and Mulder ~1988! and Smid, Lamain, Hogeboom, Mulder, and Mulder ~1991!, among others. In brief, for each participant and condition, a time series of Wilcoxon statistics is computed, with each Wilcoxon providing a nonparametric test of the null hypothesis that the C3 ' 0C4 ' difference at that time point is the same for trials on which the left and right hands are activated. Then, t tests across subjects are computed by using the values of the Wilcoxon statistic as data points. The first time point yielding a significant t value in the expected direction can be taken as the LRP onset latency. 8 In our simulations, we examined the effectiveness of this method using t critical values with two-tailed significance levels of.05,.025, and Smid, Böcker, van Touw, Mulder, and Brunia ~1996! noted that the Wilcoxon test is usually found to be significantly different from zero for more than 200 ms ~p. 7! and thus defined LRP onset as the point at which such extended significance ~one-tailed! was obtained ~Mulder, personal communication, 1996!. We became aware of this variant of the Wilcoxon procedure too late to include it in the present simulations. The baseline deviation method was described and used by Osman et al. ~1992!. In brief, the stimulus-locked LRP waveform for each participant in each condition is examined to find the first time point at which the LRP begins to exceed consistently a criterion value set to 2.5 times the standard deviation of noise LRP, estimated from LRP fluctuations during the baseline period. Consistently exceed meant that the average LRP during each of the two 50-ms windows following the estimated LRP onset also had to exceed the criterion ~Osman, personal communication, 1995!. In our simulations, we also examined the effectiveness of this method with the criterion defined as 2.0 or 3.0 times the noise standard deviation. The third method we examined was a half-amplitude method carried out at the level of the individual participant data rather than the grand averages. With this method, the LRP for each participant in each condition is examined to find the moment at which it reaches half of its maximum amplitude, and this moment is taken as the LRP onset for that participant and condition. Relative to the proposed jackknife procedure, this method has the advantage that it can be used with traditional statistical tests ~e.g., t tests! because an onset is obtained for each participant in each condition. In our simulations, we also examined the effectiveness of this method with the criterion defined as 10, 30, 70, or 90% of the maximum amplitude. Two other problems arose when scoring the simulated data with the baseline deviation and half-amplitude methods because these methods yield a separate measure of onset latency for each participant. First, it is possible for a participant s data to satisfy the criterion at time zero ~i.e., the time of the stimulus onset!. Second, with the baseline deviation method, it is possible that a participant s data do not satisfy the criterion at any time point, leading to an undefined latency. Simulated participants with either of these problems were excluded from the sample when computing the overall results ~e.g., mean onset latency! for a simulated data set, as they would have to be in the analysis of real data sets. Table 1 also shows summary statistics for the baseline deviation, half-amplitude, and Wilcoxon methods computed from the same 2,000 simulated data sets used to estimate differences for the jackknife-based method. In most cases, these methods are also reasonably unbiased, but they produce higher standard deviations than the jackknife-based method, indicating that they are less accurate estimators of the true difference. To gain some intuition about why the jackknifing method works so well relative to the other methods, it is helpful to compare this method with the half-amplitude method carried out at the level of the individual-subject LRPs. When the jackknifing method is defined in relative terms ~e.g., 50% of maximum amplitude!, it differs from the half-amplitude method only with respect to the averaging done prior to latency determination. Using the halfamplitude method, LRPs are averaged and latencies determined for each participant separately, and then latencies are averaged across participants to overcome experimental noise. With the proposed new method, LRPs are averaged across participants before latencies are determined, so the experimental noise is overcome at an earlier stage of the analysis. The simulation results suggest that it is better to average out experimental noise earlier rather than later because the latency in the average, provided by the new method, is much more stable than the average of the latencies, provided by the old method ~cf. Table 1!. Intuitively, it is easy to see why this is true. If the LRP noise is large enough that the criterion can be crossed before the LRP has started, then the criterion level will be reached at an

7 LRP onset latency differences 105 almost entirely random latency. Even an average of these random latencies will not be a stable quantity. Conversely, if LRP noise is kept small enough that the criterion will not be crossed until LRP has started ~e.g., by averaging across participants!, then the noise will cause only small random variation in the moment of reaching the criterion level. In short, a small increase in LRP noise can produce a big increase in the variability of latency estimates, and the new method works well because it obtains latency estimates only when LRP noise is small. Hypothesis Testing Although the jackknife-based method clearly gives good estimates of latency differences on the average across experiments, an equally important question for researchers is whether the method can be used to make inferences from a single set of experimental results. Typically, for example, a researcher will want to test the null hypothesis that the true difference is zero, both for stimulus-locked and for response-locked onsets. Using the new method, the researcher would compute t J D s D, ~3! where D is the difference in stimulus- or response-locked latencies observed in the overall sample, and s D is the estimate of its standard error obtained from Equation 2. According to the null hypothesis, the quantity t J should have approximately a t distribution with N 1 degrees of freedom because D is approximately normal with a mean of zero and s D is an estimate of the standard deviation of D. Thus, the researcher would reject the null hypothesis if the observed value of t J exceeded the critical cutoff obtained from a standard t table. To see how well this procedure would work, we computed 4,000 values of t J from the 2,000 simulated experiments used to construct Table 1 ~one value of t J to test the null hypothesis of zero difference in stimulus-locked onsets and one to test the null hypothesis of zero difference in response-locked onsets!. We then tabulated the proportion of simulated experiments in which the resulting t J was significant by using significance cutoffs for p levels of.05 and.01. We conducted this exercise for each of the different scoring methods and criteria. Computation of t values was straightforward for the baseline deviation and half-amplitude methods because these provide difference estimates for each participant, thus allowing computation of the standard error of the difference by using the usual formula ~i.e., Equation 1!. As described previously, the Wilcoxon procedure does not supply any estimate of standard error. Thus, we also employed jackknifing with this procedure to obtain estimated standard errors by computing the estimated latency difference via this method for each of the subsamples of N 1 participants and applying Equation 2. Table 2 shows the most informative results. For each method, it shows the probability of correctly rejecting the null hypothesis ~i.e., power! separately for tests of a stimulus-locked effect in the simulation where this effect was present and for tests of a responselocked effect in the simulation having the response-locked effect. We also estimated Type I error probabilities by tabulating the proportion of times that a significant response-locked effect was obtained in the simulation where the stimulus-locked effect was present, and vice versa, but these are not shown because they were close to or smaller than the nominal values of.05 and.01 in all cases. Table 2. Power as a Function of Stimulus- Versus Response-Locked Effect, Scoring Method and Criterion, and Significance Level (.05 vs..01) Stimulus-locked Response-locked Method and criterion Absolute criteria ~mv! Relative criteria ~%! Wilcoxon ~critical p level for determining onset! Baseline deviation ~number of noise SDs! Half-amplitude ~% maximum amplitude! With appropriate criteria, the jackknife-based method once again does extremely well. The power of this method is quite large, even at the more stringent.01 significance level. In combination with the low Type I error probabilities, this finding suggests that researchers can be appropriately confident of reaching the correct conclusion when using this method to look for a 100-ms effect in an experiment of this size ~i.e., eight participants with 50 trials per participant! and with an amount of EEG noise comparable to that present in this data set. We will consider in a later section the question of how well the method does with smaller effects and with experiments of other sizes. In contrast, the baseline-deviation, half-amplitude, and Wilcoxon methods do not perform well. Clearly, they have much less power than the jackknife-based method under the conditions implemented in these simulations. Confidence Intervals Besides testing null hypotheses, experimenters often wish to compute confidence intervals for the sizes of their experimental effects. Using the present method, the bounds of a 95% confidence interval for a true difference could be constructed by using the formula bound D 6 t.05 s D, ~4! where D is the latency difference computed from the grand averages including the full sample, s D is the jackknife-based estimate of its standard error, and t.05 is the critical t value with N 1 degrees of freedom for the.05 alpha level.

8 106 J. Miller, T. Patterson, and R. Ulrich For each of the 2,000 simulated experiments discussed to this point, we computed 95% confidence intervals around the true mean by using each method with the same error terms described in the previous section. All methods produced confidence intervals with a high likelihood of containing the true value ~0 or 100 ms, depending on the effect simulated!, so in this sense all methods performed well. As shown in Table 3, however, the methods differed widely in the mean half-width of the computed confidence intervals ~averaging across simulated experiments!. The smallest half-widths, which reflect the most precise difference estimates, were obtained with relative criteria of 50% when estimating stimulus-locked onsets and of 90% when estimating responselocked onsets. That is, under experimental conditions comparable to those simulated here, these methods would on average allow an experimenter to estimate an onset latency difference to within 636 ms for stimulus-locked waveforms and to within 619 ms for response-locked ones. Clearly, these methods would also be quite powerful in detecting effects smaller than 100 ms, a point to which we will return in a subsequent section. 9 Generality of Application Because we have discussed only two specific simulations thus far, a salient additional question about the new method is whether it works well under a wide variety of circumstances. This section describes, albeit briefly, additional simulations designed to see how well the method works under various other conditions. In the interests of brevity, it is convenient to summarize the results of the additional simulations by using a single number to 9 Some readers may be puzzled by apparent inconsistencies that arise when comparing Tables 1 and 3. Specifically, the mean half-widths shown in Table 3 do not order the different methods and criteria in exactly the same fashion as the SD values shown in Table 1. For example, the mean half-width shown in Table 3 is 36 ms for estimating a stimulus-locked effect with the relative criterion of 50% ~eighth row and second column in the table!, whereas it is 44 ms for the relative criterion of 70%. This comparison of half-widths suggests that the 50% criterion is more accurate than the 70% criterion. However, the SD values shown in Table 1 for these same two criteria are 14.7 and 14.1 ms, respectively, suggesting that the 70% criterion is more accurate. The discrepancy between the half-widths in Table 3 and the SDs in Table 1 arises because the two tables use somewhat different measures of variability. A confidence interval is computed by using a standard error value estimated from a single simulated experiment with Equation 1 or Equation 2, as appropriate. Thus, the mean confidence interval half-width is sensitive to the mean of these individual-experiment standard error estimates, averaging across simulated experiments. In contrast, the SDs shown in Table 1 reflect variation in the estimated difference scores across simulated experiments. Discrepancies between these two measures of variability can arise when the standard error of a difference estimated from a single experiment is a biased estimate of the true variability of differences across experiments. In fact, such biases were present in the current simulation results. For all procedures, means of the single-experiment standard errors were slightly larger than the actual standard deviations of the difference scores across experiments. Moreover, the amount of overestimation varies slightly from one method and criterion to the next. With the 50% criterion, overestimation was slight: The mean of the individual-experiment standard errors was 15.25, which is only slightly larger than the actual standard error of Overestimation was larger with the 70% criterion: The mean of the standard errors was 18.65, which is substantially larger than the actual standard error of The greater overestimation of standard error with the 70% criterion than with the 50% criterion is responsible for the discrepancy. In summary, the 70% criterion provides a better point estimate of the difference, but the 50% criterion provides a better interval estimate because of its superior estimation of standard error. The same considerations apply in the case of other apparent discrepancies that can be found in comparing these two tables. Table 3. Mean Half-Widths of 95% Confidence Intervals Computed for Differences in Stimulus-Locked (SL) and Response-Locked (RL) Onsets as a Function of Simulated Effect and Scoring Method and Criterion Stimulus-locked Response-locked Method and criterion SL onset RL onset SL onset RL onset Absolute criteria ~mv! Relative criteria ~%! Wilcoxon ~critical p level for determining onset! Baseline deviation ~number of noise SDs! Half-amplitude ~% maximum amplitude! measure the ability of each method to discriminate between situations with true versus false null hypotheses under the conditions embodied in the simulation. One such measure of discriminability is the quantity d ' used in signal detection theory ~e.g., Green & Swets, 1966!. For each scoring method, an estimate of d ' can be computed from d ' XP 0 XP 1 %~S 2 0 S 2 1!02, ~5! where XP 0 and S 0 are the mean and standard deviation, respectively, of the difference estimates produced by that scoring method across 1,000 simulated experiments in which the null hypothesis is true, and XP 1 and S 1 are the corresponding mean and standard deviation, respectively, obtained from 1,000 simulated experiments in which the null hypothesis is false. Using values shown in Table 1, for example, it is possible to estimate the d ' with which the jackknifebased measure with a relative criterion of 50% discriminates between cases with stimulus-locked onset differences of 0 versus 100 ms: d ' 6.8. % ~ !02 In brief, d ' increases with the separation between the distributions of scores obtained with true versus false null hypotheses, so a

Measurement of ERP latency differences: A comparison of single-participant and jackknife-based scoring methods

Psychophysiology, 45 (28), 25 274. Blackwell Publishing Inc. Printed in the USA. Copyright r 27 Society for Psychophysiological Research DOI: 1.1111/j.1469-8986.27.618.x Measurement of ERP latency differences: