Predictive Bias Correction for Sequential. Quantitative Visual Assessments

Size: px

Start display at page:

Download "Predictive Bias Correction for Sequential. Quantitative Visual Assessments"

Shonda Campbell
5 years ago
Views:

1 Predictive Bias Correction for Sequential Quantitative Visual Assessments 1

2 Aletta Nonyane Department of Primary Care and General Practice, University of Birmingham, Edgbaston, Birmingham, England, B15 2TT, U.K. Chris Theobald School of Mathematics, University of Edinburgh and Biomathematics and Statistics Scotland, The King s Buildings, Edinburgh, Scotland EH9 3JZ, U.K. cmt@bioss.ac.uk and Alec Mann Biomathematics and Statistics Scotland, The King s Buildings, Edinburgh, Scotland EH9 3JZ, U.K. alec@bioss.ac.uk January 6, 2005 Abstract Bayesian predictive calibration is used here to correct visual assessment scores which are prone to subjective bias. It is applied to correct for bias in a long 2

3 sequence of responses which are subject to carry-over and order effects, as well as autocorrelation. The method is illustrated in the context of experiments in which the scoring of crop leaves for disease severity is simulated. This correction is done under the assumption of a parametric regression model which relates the scores to the true values and some covariates. Assessors precision in the scoring tasks is compared using a criterion based on the Shannon information contained in the predictive distribution of future scores. Key Words: Bayesian inference; Predictive distribution; Visual assessment. 1 Introduction Despite advances in automated image analysis, there are many quantitative visual perception tasks in agricultural and medical research for which trained human assessors are required. Examples of this are seen in plant breeding, where the susceptibility of large numbers of rival varieties to diseases such as mildew and leaf rust is assessed by recording the percentage of leaf area covered by lesions, and the stems of winter cereals may be assessed for damage from eyespot. Similar assessments are required for epidemiological and fungicide studies of plants, for estimating the proportions of quadrants covered by foliage, for measuring needle loss from evergreen trees caused by acid rainfall, and for scoring carcass fat depth in breeding experiments on meat animals. In other contexts, assessments may be required of the numbers of dots appearing in a defined area, for example plant bugs such as aphids on a leaf, and microorganisms on a plate. Human judgements are always affected by the context within which they are made. Visual assessments carried out by humans are subject to bias from various sources: these include the order of the stimulus in a sequence and carry-over, the tendency for the 3

4 judgement of one stimulus to be influenced by the previous one. For example, Parker et al. (1995) found that visual estimates of wheat disease severity were substantially biased and varied considerably over short time-scales: visual assessment errors were large enough to alter the conclusions of a comparison of fungicide treatments. In the context of measuring fat in the carcasses of meat animals, Fisher (1990) argued that differences between operators and between periods of assessment are causes for concern even when photographic standards are used. Morris and Rule (1988) found that for estimation of line length and numerosity the average score declined over a sequence of images, and Sawyer and Wesenstein (1994) and Ferris et al. (2001) reported evidence of carryover in judgements of numerosity and percentage cover. The effect of carry-over is said to be assimilative when the current score is biased towards the previous stimulus and contrastive when the score is biased away from the previous stimulus (Cross (1973)): the direction of any bias could, though, depend on the magnitude of the current stimulus. DeCarlo and Cross (1990) discussed different regression models for sequential effects when assessors are asked to give their estimates of stimulus intensity. Most recently, Ferris et al. (2001) discussed the carry-over effects seen in a visual assessment experiment similar to the one being studied here. They proposed modelling carry-over as a logistic function of the difference in stimulus intensity. Automated systems for quantitative image measurement should not suffer from order and carry-over effects, but may not be practicable as alternatives to human judgement. In the context of crop disease assessment, where rapid measurement is required, objective assessment methods may be time consuming, expensive or impractical for field use, and may not permit repeated inspection of the same plants (Newton and Hackett (1994)). Where there is no adequate alternative to visual assessment, two complementary approaches to 4

5 reducing the bias of human assessors may be pursued. (a) Assessors may be trained to improve their performance. For example, the computerbased training program DISTRAIN has been developed by Tomerlin and Howell (1988) to train assessors estimating disease severity on cereals: leaf images showing random amounts of damage are displayed on a computer monitor and feedback is provided about the true percentage cover after each response is entered. (b) If individual biases can be treated as fairly consistent within the period in which assessments are required and if a sufficiently simple model can be fitted to them, biases can be largely removed by testing the assessors in circumstances which closely mimic the field situation: responses to sequences of images with known stimulus values can be used to estimate biases and to correct future responses by calibration. It may be advantageous to incorporate carry-over effects into the calibration, possibly together with autocorrelation in the responses. We concentrate in this paper on the second method, assuming that a parametric regression model which includes order and carry-over effects can be found to relate each assessor s responses to the true stimulus values. The assessors performance is likely to change as they become more experienced, so calibration would need to be repeated over time. In Section 2 of this paper we describe experiments carried out for two purposes 1. to investigate the biases, including order and carry-over effects, of human assessors in tasks which mimic subjective scoring of crop disease severity 2. to illustrate the adjustment of assessors responses in a future task of the same type using Bayesian prediction. 5

6 The design used, as well as the data obtained and the model fitted, are also discussed in this section. Section 3 considers calibration for the two tasks. Section 4 discusses the evaluation of the assessors precision, based on their performance in the calibration experiment. This precision is measured by the amount of Shannon information contained in the predictive distribution of future scores. The overall discussion is given in Section 5. 2 The Visual Assessment Experiments Twenty five undergraduate students aged between 18 and 24 and following statistics modules took part in experiments intended to resemble the visual scoring of disease severity. They were unpaid, but prizes were offered for the three most accurate sets of responses. The experiments were developed from those of Ferris et al. (2001), and comprised two tasks, which we refer to below as the cover and count tasks. Examples of images used are given in Figures 1 and 2. In the cover task, images comprising possibly overlapping black circular disks of different sizes within a white square were displayed on a computer monitor, and subjects were required to give estimates of the percentages of the square which were covered by the disks in a sequence of images. Each image was displayed for 6 seconds above a slider for recording the assessor s responses. This was graduated from 0 to 100 and controlled by a computer mouse. Before recording began, 9 images were displayed for 6 seconds each with stimuli in the same range as those used for the task itself. Subjects were allowed to enter a response during these intervals, after which the true percentage cover was shown until the subject pressed the return key to view the next one. No feedback was given once recording had begun. 6

7 The count task was similar, except that the images comprised disjoint black circular disks of the same size, and estimates were required of the numbers of disks. The slider was graduated from 0 to 250, but to avoid the impression of an upper limit it was extended to the edge of the screen, at about 300. The order of the two tasks was chosen at random for each assessor, and they were allowed an unlimited break between them: the average break lasted 3 minutes. The images in the cover task were generated before the experiment was performed by choosing points at random in a square of pixels to be centres of circular disks with radii taken from a uniform distribution on the interval (5, 13). For the count task, the disks had common radius 10 pixels, and points were rejected if disks centred on them would overlap a side of the square or a previously generated disk. 2.1 Design of the image sequences The sequence of images presented to each individual may be regarded as forming a changeover design with an unusually large number of periods. Since our interest was to investigate the effects of the current and previous images and the effect of order, the magnitudes of the stimuli in these sequences were chosen to be approximately balanced for these effects. One way to achieve approximate balance would be to select a range of values for the amounts of cover or the counts, choose a probability distribution over this range, and sample randomly from this distribution for successive images: this method was used, for example, by DeCarlo and Cross (1990) for magnitude estimation of sound stimuli. However, to achieve more exact balance we chose to define nominal stimulus levels corresponding to disjoint intervals within the selected ranges. With m distinct levels, we seek to define a sequentially balanced sequence of m levels comprising a single, leading 7

8 level followed by m blocks, each block containing a permutation of the m levels. Sequential balance means that all of the m 2 ordered pairs of levels occur once each. The following is an example of such a sequence with 7 nominal levels denoted by the letters A to G and blocks separated by spaces. A A B C D E F G G A D C F B E E G F D B A C C G B F A E D D F E A G C B B D G E C A F F C E B G D A Notice that the m self-adjacencies must occur at the beginning of the sequence and at the boundaries of the blocks, and that the use of the blocks of permutations provides approximate balance between the order and both the current and the previous stimulus effects. The response to the first image may be ignored in the analysis since it includes no carry-over effect. This type of design was proposed by Finney and Outhwaite (1956) for bioassay studies with several treatments. They observed that no such sequences are available for m equal to 3, 4 and 5. Sampford (1957) gave examples with m from 6 to 11, but provided no theorems on existence and no general methods of construction. For m equal to 6 and 7, we have systematically enumerated 324 and 175,588 sequences respectively with the first block written in natural order: our methods and results will be reported elsewhere. In principle, one might choose a sequence at random from all those available: we merely selected at random from the four then known to us with 7 letters. To generate a sequence of images from a given sequence of levels, the m nominal levels were allocated at random to m intervals for the appropriate type of stimulus, and images were chosen with stimulus values in the corresponding intervals: no images were presented to a subject more than once. For our experiment, m was taken to be 7, giving a sequence of 50 images. In the cover task, the majority of images were shown with percentage cover less than 50 to reflect 8

9 the values likely to be of interest in crop disease assessment: the intervals used were 5 ± 0.5, 10 ± 1.0, 17 ± 1.7, 26 ± 2.6, 37 ± 3.7, 50 ± 5.0, 65 ± 6.5. The intervals used for count similarly included values within one tenth of 7 central values, these values being 17, 27, 40, 60, 90, 135, 200. The smallest central value was taken as 17 since lower numbers of disks might be counted individually rather than estimated. For each task two sequences of 50 images each were displayed with no break between them. For convenience, the leading image in the second sequence was included, although this made the overall sequence slightly unbalanced. Thus each combination of assessor and task resulted in 100 responses, including any missing ones. The first response is ignored, and those analysed are numbered from 1 to Data from the two tasks The top-left plot in Figure 3 shows the responses of all the assessors in the cover task against the true amounts of cover, along with the line of equality. This plot suggests that there is some upward bias at low amounts of cover, and shows a few outlying observations. As might be expected with proportions, there is an initial increase in variability with the true percentage cover. The bottom-left plot in this figure shows responses from these subjects in the count task. They show both variability and downward bias increasing with true count. Logarithmic transformations are commonly applied to magnitude-scaling stimuli and responses (De Carlo and Cross (1990)), sometimes in the hope that response is roughly proportional to a power of stimulus intensity. This transformation is not appropriate for percentage cover because we can expect the responses to become more accurate when the true cover approaches 100%: a logit transformation of both axes seems more natural. 9

10 The top-right plot in Figure 3 shows the same data after this transformation: it seems to have been reasonably successful in stabilizing the variance and achieving linearity, so we analyse the cover data on the logit scale. The log-transformed data from the count task in the bottom-right plot shows roughly constant variance, so we use this scale for analysing the counts. In the context of examining leaves for disease, we might expect to find some with no lesions, corresponding to a logit response of. Any linear adjustment on this scale would leave this value unchanged, so that assessors are assumed to identify undamaged leaves correctly, and a correction would be applied only for leaves given positive responses. To assess the validity of the linearity and equal-variance assumptions for each task, residuals were examined from the regression of transformed response on sequence order and transformed current and previous stimuli for each assessor. For the cover task, the residuals tended to be positive when the logit of the current stimulus was positive, that is when cover exceeded 50%. For the count task, the residuals for several assessors showed slight curvature when plotted against current stimulus. For both tasks, the effect of order appeared to be slightly non-linear for a few assessors, but there was no dependence on previous stimulus: the occasional gross error occurred, but there was no consistent non-normality. From now on, we take stimulus and response to refer to the data after any transformation intended to induce linearity and homoscedasticity in the relationship. 10

11 2.3 Statistical model We consider predictive calibration under a homoscedastic normal linear regression model which includes carry-over, order effects and auto-correlation, allowing the expectation µ t of the tth response, y t say, to be a linear function of the current stimulus x t, the previous stimulus x t 1 and t itself. The latter effect might arise from fatigue or be a learning effect. The expected responses for a particular subject may therefore be expressed as µ t = E(y t x t, x t 1, t, β) = β 0 +β 1 (x t x)+β 2 (x t 1 x)+β 3 (t t), (t = 1, 2,..., n) (1) where β = [ β 0 β 1 β 2 β 3 ] T is a vector of unknown regression coefficients and t indexes the n responses following the initial one; x is the mean stimulus level and t is N+1 2. The intercept is assumed to be the overall expected level of transformed cover or count, and to minimise its correlation with other coefficients, the predictors are centred about their means. Lastly, the responses are assumed to follow an AR(1) process defined by y t µ t = ρ (y t 1 µ t 1 ) + u t (t = 2, 3,..., n), (2) where ρ is the autocorrelation parameter and the u t are taken to be independent and N(0, σ 2 ). Using the Markov property of this process, the joint distribution of y 1,..., y n is expressible as the product of the marginal distribution of y 1, N(µ 1, σ 2 /(1 ρ 2 )), and the conditional distributions N(µ t + ρ(y t 1 µ t 1 ), σ 2 ) of the y t given the previous responses (t = 2, 3,..., n). This model will be henceforth referred to as Model 1. Fitting the above model to the data from the 25 assessors showed all parameters to be significant for both tasks (that is, p-value p < 0.001), except for the carry-over effect in the cover task which was nonsignificant for all assessors. The correlation between β 0 and other coefficients has been reduced by centring the independent variables about their 11

12 respective. This then makes it possible to assume that these coefficient are independent when specifying their priors for the Bayesian predictive calibration. 3 Predictive Calibration with Carryover and Covariates Aitchison and Dunsmore (1975) and Aitchison (1977) make a persuasive case that statistical calibration should be carried out within a Bayesian predictive framework. Aitchison and Dunsmore (1975) consider calibration only for a single future response, but the inclusion of carry-over effects and autocorrelation in our regression of responses on stimuli requires that calibration be performed simultaneously for a sequence of future responses. We therefore generalize the predictive calibration procedure of Aitchison and Dunsmore (1975) to allow for carry-over, for covariates and also for correlated responses, as in (1) and (2), and illustrate the use of Markov chain Monte Carlo to apply the more general calibration procedure to sequences of human visual assessments. In the following, probability density functions are denoted by p, and the argument of p indicates the random variable being considered. The predictive method assumes that we have data from a calibration experiment on an assessor in which a response is recorded at selected values of a stimulus. Let x and y denote the vectors of stimuli and responses for the assessor. We also want to allow the distribution of y to depend on the sequence order of the stimulus, so let t denote a vector defining this order. The regression model specifies the probability density function p(y x, t, θ) of y given x, t and a vector θ of unknown parameters. Given a prior density p(θ) for θ, the posterior density p(θ x, t, y) of θ is proportional to p(θ) p(y x, t, θ). We 12

13 then assume vectors y and t to be recorded for the same assessor in order to estimate the unknown vector of stimuli x. The vectors t and t could be generalized to include any other covariates whose values are to be specified in the calibration experiment and in the future. The predictive density function of y, is then given by p(y x, t, y, x, t ) = p(y x, t, θ) p(θ x, t, y) dθ. (3) Aitchison and Dunsmore (1975) distinguish designed and natural calibration experiments. In the former, x is chosen to cover whatever range of stimuli is of interest; in the latter, x is assumed to be drawn from the same population as x, so that x provides information on the prior distribution of x. Ours is a designed calibration experiment, so it is necessary to specify a prior density p(x ) for x which is not dependent on x or θ. In the context of assessing disease lesions on leaves, this prior might be influenced by the amounts of such disease recorded in previous seasons and an overall impression of the extent of disease in the current season. For fairness, it should not depend on the crop variety. With a designed experiment, inferences about x are based on the calibrative density p(x x, t, y, t, y ), which may be derived as follows from the joint density of x and θ given x, t, y, t and y. Assuming that x and θ are independent a priori, and that y and y are independent given x, t, x, t and θ, we have p(x, θ x, t, y, t, y ) p(x, θ) p(y, x, y t, t, x, θ) p(x ) p(θ) p(y x, t, θ) p(y x, t, θ) p(x ) p(y x, t, θ) p(θ x, t, y), where the proportionality is in x and θ. Integrating with respect to θ and using (3) gives p(x x, t, y, t, y ) p(x ) p(y x, t, y, x, t ). (4) 13

14 Expressions (3) and (4) represent a generalization of the calibrative method for designed experiments described on page 189 of Aitchison and Dunsmore (1975) to include a covariate and vector (rather than scalar) future stimuli and responses. Although this extension is straightforward in theory, iterative methods are likely to be required in practice for calculating the calibrative distribution of x except in special cases. We follow the modern practice of representing our model by a directed acyclic graph whose nodes correspond to the data and the model parameters, as discussed in Gilks et al. (1996). 3.1 Implementation The WinBUGS program (Spiegelhalter et al. (2003)) can be used to fit the autoregressive model above to the data provided by each assessor, and thus approximate the posterior distribution of β 0, β 1, β 2, β 3, σ and ρ for that individual. The program, which is freely available from allows models to be specified graphically. The user defines the conditional distribution for each stochastic node given the values of its parents, and the program selects and implements an appropriate Markov chain Monte Carlo method. The output includes convergence diagnostics, and summary statistics and kernel density estimates of the posterior probability density functions for any node. The graphical representation of the model can be extended to include nodes for future response vectors y recorded on the same individual and for the corresponding stimuli x. Unknown future stimuli are treated as parameters, and the calibrative distribution of any element of x is its marginal posterior distribution. 14

15 3.2 Predictive calibration for the two visual tasks To illustrate predictive calibration for correcting the bias in an individual s assessments, we consider data recorded by one assessor who completed the two tasks on two occasions one week apart. The values of the stimuli are known for both occasions, but those given on the second occasion are treated as unknown for the predictive analysis, so that we can examine the nature and magnitude of any improvement arising from the calibration. The area of application influences the choice of the prior distribution for x and the range of stimuli for a designed calibration experiment. Prior distributions for severity of plant disease may be determined by the plant pathologist observation of disease severity during the current and previous season. Here the Normal prior was based on the calibration experiment, with mean equal to that of x in the experiment, but with twice the standard deviation to allow for the possibility of higher variability in future scores. Normal prior distributions were assumed for the coefficients β 0, β 1, β 2 and β 3. Two possibilities for the values of their prior expectations were considered. First, it may be assumed a priori, for Model 1, that the assessors expected bias, carry-over effects and order effects are zero. The prior expectations for β 0, β 1, β 2 and β 3 would then by x, 1, 0 and 0, respectively. This will be referred to as Prior 1. Another option would be to base prior expectations and variances of the coefficients on their estimates after fitting Model 1 to the data from the experiment described above. These priors are shown in Table 1 and will be referred to as Prior 2. In both cases, the correlation parameter, ρ, has a beta prior, centred around 0 because very little autocorrelation was exhibited from fitting Model 1. It is assumed that, like in the design of the experiment, x 0 comes from the same stimulus interval as x 1. The prior distribution for σ 2 was taken to be inverse gamma, and specified by assuming ds 2 σ 2 has a distribution χ 2 (d), where s 2 is a prior 15

16 estimate of σ 2 and the degrees of freedom d reflect the precision of the estimate. The values of these are taken from the analysis of data from the above experiment. The calibration method is illustrated here for 5 assessors who did not take part in the experiment above, but carried out each task on two occasions for the purpose of calibration. The first two plots in the upper row of Figure 4 show the logits of the true and recorded cover for one of the assessors (labelled C later) on the first and second occasions; The third plot shows the posterior expected logit cover against logit true cover on the second occasion. The lower row of Figure 4 shows the corresponding plots for the count task. Note that posterior expected stimuli may be calculated for missing responses, but that these are omitted from the plots. One measure of effectiveness of the calibration method for any assessor is the ratio of the mean square errors of their recorded and calibrated responses. These mean square errors are given in Tables 2 and 3: The values under the heading Model 2 are defined later. Moreover, if calibration is effective, we expect to see closer agreement with the line of equality in the third plot in each row than in the second. This is more true in the plots of the cover task than it is for the count task. These plots illustrate one drawback of this method, and that is the methods reliance on the assumption of consistent bias between the first and second performances. When this assumption does not hold, as in the count task where the plots show more bias in the second performance than in the first, calibration does not improve the scores of the second performance as much. In a similar way, for the cover task, the first occasion shows both over- and under-estimation and the second one has under-estimation only. Hence, although there is a lot more correction in this task, there is some over-compensation as well. 16

17 The calibration procedure was tested for robustness to changes in the prior distribution of the future stimulus level x. It showed no robustness to this change for both tasks. Particulary, when the prior variance of x was halved, the calibrated responses which are the posterior expected stimuli, were drawn towards the mean, hence introducing bias at the extreme levels of the stimulus. Another test of the calibration procedure was done by changing the regression model assumed for calibration. The carry-over from the previous stimulus level and order effects were removed, assuming that errors were correlated, following an AR(1) model. This resulted in the model with expectation: µ t = β 0 + β 1 (x t x), (5) hence y t = µ t + ε t (6) with ε t = ρε t 1 + u t. (7) This model, referred to as Model 2 henceforth, was fitted using Prior 1 because this prior was seen to perform better than Prior 2. The mean square errors under this model are given in the last columns of Tables 2 and 3. The result of this change in the model reflects what was shown by the analysis of variance for the two tasks in the experiment. For the count task, analysis of variance showed carry-over from the previous image level to be significant. Thus, calibration under a model without this carry-over term (Model 2) does not improve the responses as well as calibration under the model with the carry-over term. Hence the mean square errors of calibrated responses under Model 2 are higher 17

18 than those under Model 1. On the other hand, for the cover task, carry-over from the previous image level was not significant, as shown by the analysis of variance results. Thus, calibration under a model without this carry-over term (Model 2) improves the responses (for 4 out of 5 assessors) more than calibration under the model with the carry-over term. Hence the mean square errors of calibrated responses under Model 2 are lower than those under Model 1. 4 Assessing the Assessors A researcher who has to address a quantitative visual assessment task of the sort referred to in the Introduction might recruit some potential assessors, instruct them using examples of the type of assessment required, test the accuracy of each candidate on a set of objects or images with carefully measured stimulus values and then offer employment to the candidates whose responses appeared to be most precise. In a frequentist comparison of the accuracy of the assessors results, one might compare the mean square errors of their responses relative to the stimulus values: if linear bias correction is allowed, the accuracy of assessors might then be compared using their coefficients of determination, R 2. For a Bayesian analysis, the quality of an assessor could be measured with reference to some notional future sequence of stimuli x. As in Spezzaferri (1985), we consider the expected gain in Shannon information about x arising from the calibration experiment. This expected gain compares the probability density of x conditional on x, y and y with the unconditional (prior) density: it depends on the number of stimuli, and is given 18

19 by { } p(x x, y, y ) p(y x, y) p(x x, y, y ) ln dx dy. (8) p(x ) It is shown in the Appendix that for a designed calibration experiment, expression (8) may be approximated by [ 1 ln det{v(y x, y)} 2 ] p(x ) ln det{v(y x, y, x )} dx, (9) where V denotes a variance matrix. For a scalar future response, (9) becomes the expectation, over the prior distribution of x, of 1 2 ln{v(y x, y)/ V(y x, y, x )}. Large values indicate that responses could be predicted accurately from future stimulus values. The numbers of notional future stimuli and responses to be considered might reflect the number of assessments typically made in a sequence. Potentially, the ranking of the assessors depends on this choice, but for simplicity we consider a sequence comprising two stimuli x 0 and x 1, say, and the corresponding responses y 0 and y 1, the first of which is ignored as before. Again, we take x 0 and x 1 to be independent and from a common prior distribution with density p(x ). The flexibility of defining this prior distribution is what makes the Bayesian approach more desirable than the use of R 2, for example, when high precision in a particular interval of the stimulus scale is emphasized. The evaluation of the second term in (9) using Markov chain Monte Carlo appears to require the generation of responses for every combination of values of x 0 and x 1. It is therefore convenient to approximate the prior distribution for x by a discrete distribution: we may calculate the r quantiles of p(x ) for some positive integer r and give equal probability to each of them. Then values of y 1 may be generated from the posterior predictive distribution for each of the r 2 combinations of (x 0, x 1 ), V(y 1 x, y, x 0, x 1 ) may be estimated from the set of generated values for each combination, and ln V(y 1 x, y, x 0, x 1 ) 19

20 averaged over these combinations. The variance V(y x, y) in the first term of (9) may be estimated from the combined set of values of y 1. The ranking of assessors according to the information criterion in (9) is illustrated here for the count task carried out by the first 25 assessors. The plots of the individual transformed data sets are given in Figure 6 in order to see the agreement between the information criterion and the level of agreement between responses and true counts. As mentioned earlier, R 2 provides a frequentist measure of assessor performance: this is plotted against the values of the information criterion (9) in Figure 5, showing a good agreement in ranking assessors. They both, in turn, agree with the plots of observed individual data in Figure 6. As an example, the responses of assessor 19 agree well with the true values and this assessor is ranked as the most accurate by the two criteria. Assessors 12 and 13 are the worst performers. 5 Discussion The use of the design balanced for carry-over effects in the visual assessment experiments allows for effective estimation of carry-over and order effects and this is discussed further in Nonyane (2004). Bayesian predictive calibration was successfully generalised to calibration for a vector of responses with carry-over and order effects, as well as autoregressive errors, and implementation of this was made possible by the availability of Bayesian MCMC methods. Implementing this method in the Bayesian framework has the advantages that prior information about the parameters can be incorporated, and prediction for missing responses can be made. The calibration does indeed improve the estimation of the stimuli. It relies, though, on the assumption that bias stays consistent between the two successive experiments. This problem may be reduced by testing 20

21 assessors repeatedly over time. The measure of Shannon information in the predictive distribution of future responses, though not so easy to implement, may be used for selecting or ranking assessors. Ranking assessors by R 2 appears to be similar, its only drawback being the inability to incorporate one s prior belief about future stimulus. Acknowledgements The work of the first author was supported by the Cecil Renaud Charitable and Educational Trust for PhD studies at the University of Edinburgh. We are grateful to the late Rob Kempton for suggesting this study and to Dirk Husmeier for useful discussions. 21

22 Appendix A: Approximation to the expected gain in information about a future sequence of stimuli We wish to approximate the expected gain in Shannon information about a future sequence of stimuli x which arises from the calibration experiment, given by { } p(x x, y, y ) p(y x, y) p(x x, y, y ) ln dx dy. (10) p(x ) Since we have a designed calibration experiment, p(x x, y) is equal to p(x ), so that p(y x, y) p(x x, y, y ) = p(x, y x, y) = p(y x, y, x ) p(x ). Thus (10) becomes p(x ) p(y x, y, x ) ln {p(y x, y, x )} dy dx p(x, y x, y) ln {p(y x, y)} dx dy or p(x ) p(y x, y, x ) ln {p(y x, y, x )} dy dx p(y x, y) ln {p(y x, y)} dy. (11) For the case of Normal linear regression with a single stimulus x and response y, and suitable prior distributions, p(y x, y, x ) is a Student density (Aitchison and Dunsmore (1975). In other cases, (11) may be evaluated for any assessor by approximating the distributions of y given x, y, x and of y given only x, y using Normal distributions with the appropriate first and second moments. Then the first and second terms in (11) become q 2 {ln(2π) + 1} 1 2 p(x ) ln det{v(y x, y, x )} dx and q 2 {ln(2π) + 1} ln det{v(y x, y)}, 22

23 where V denotes a variance matrix. Thus (11) is approximately [ 1 ln det{v(y x, y)} 2 ] p(x ) ln det{v(y x, y, x )} dx. (12) 23

24 References Aitchison, J. and Dunsmore, I. R. (1975). Statistical Prediction Analysis. Cambridge University Press. Aitchison, J. (1977). A calibration problem in statistical diagnosis: The system transfer problem. Biometrika, 64: Cross, D. V. (1973). Sequential dependencies and regression in psychophysical judgements. Perception and Psychophysics, 14: DeCarlo, L. T. and Cross, D. V. (1990). Sequential effects in magnitude scaling: Models and theory. Journal of Experimental Psychology: General, 119: Ferris, S. J., Kempton, R. A., Deary, I. J., Austin, E. J., and Shotter, M. V. (2001). Carryover bias in visual assessment. Perception, 30: Finney, D. J. and Outhwaite, A. D. (1956). Serially balanced sequences in bioassay. Proceedings of the Royal Society B, 145: Fisher, A. (1990). Reducing Fat in Meat Animals, chapter New approaches to measuring fat in the carcasses of meat animals, pages London: Elsevier. Gilks, W., Richardson, S., and Spiegelhalter, D. J. (1996). Markov Chain Monte Carlo in Practice. London: Chapman and Hall. Morris, R. B. and Rule, S. J. (1988). Sequential judgement effects in magnitude estimation. Canadian Journal of Psychology, 42: Newton, A. C. and Hackett, C. A. (1994). Subjective components of mildew assessment on spring barley. European Journal of Plant Pathology, 100:

25 Nonyane, B. A. S. (2004). Design and Analysis for Subjective Assessment of Visual and Taste Stimuli. Doctor of philosophy, School of Mathematics, University of Edinburgh. Parker, S. R., Shaw, M. W., and Royle, D. J. (1995). The reliability of visual estimates of disease severity on cereal leaves. Plant Pathology, 44: Sampford, M. R. (1957). Methods of construction and analysis of serially balanced sequences. Journal of the Royal Statistical Society B, 19: Sawyer, T. F. and Wesenstein, N. J. (1994). Anchoring effects on judgment, estimation and discrimination of numerosity. Perceptual and Motor Skills, 78: Spezzaferri, F. (1985). A note on multivariate calibration experiments. Biometrics, 41: Spiegelhalter, D. J., Thomas, A., Best, N. G., and Gilks, W. (2003). WinBUGS: Bayesian Inference Using Gibbs Sampling. MRC Biostatistics Unit, Cambridge, 1.4 edition. Tomerlin, J. R. and Howell, T. A. (1988). DISTRAIN: A computer program for training people to estimate disease severity on cereal leaves. Plant Disease, 72:

26 Table 1: Prior 2 distributions for the parameters in Model 1 Parameter Cover task Count task Expected response β 0 N( 1.008, ) β 0 N(4.062, ) Coef. for current level β 1 N(0.888, ) β 1 N(0.846, ) Coef. for previous level β 2 N(0.012, ) β 2 N(0.017, ) Coef. for seq. position β 3 N(0.002, ) β 3 N( 0.001, ) Standard deviation σ 2 χ 2 (20) σ 2 χ 2 (20) 1 Autocorrelation 1(ρ + 1) Be(5, 5) 2 2 Future stimulus x N( 1.09, 0.74) x N(4.09, 1.49) Table 2: Mean square errors for count task Posterior expected response Observed Model 1 Model 1 Model 2 Assessor Response Prior 1 Prior 2 Prior 1 A B C D E

22 28.68 C 81.36 28.92 34.65 27.06 D 21.96 23.84 17.51 20.92 E 83.42 24.59 58.47 23.

27 Table 3: Mean square errors for cover task Posterior expected response Observed Model 1 Model 1 Model 2 Assessor Response Prior 1 Prior 2 Prior 1 A B C D E Figure 1: Sample images for count: 27, 60, and 135 Figure 2: Sample images for cover: 10, 26, and 50 percent 27

28 Figure 3: Plots of responses versus true values for cover and count tasks, and their corresponding transformations Recorded Cover (%) Logit Recorded Cover True Cover (%) Logit True Cover Recorded Count Log Recorded Count True Count Log True Count 28

29 Figure 4: Plots for assessor C s responses versus stimuli for first and second occasions and posterior expected stimuli; row 1 corresponds to cover task and row 2 corresponds to count task Logit Recorded Cover Logit Recorded Cover Posterior Expected Logit Cover Logit True Cover Logit True Cover Logit True Cover Recorded Count Recorded Count Posterior Expected Log Count Log True Count Log True Count Log True Count 29

30 Figure 5: Information criterion versus R-squared Information Criterion R-squared 30

31 Figure 6: Plots of log responses versus log true count for Experiment

Design and Analysis for Subjective Assessment of Visual and Taste Stimuli. Bareng A. S. Nonyane

Design and Analysis for Subjective Assessment of Visual and Taste Stimuli Bareng A. S. Nonyane Doctor of Philosophy University of Edinburgh 2004 Acknowledgements I would like to thank my supervisors, Dr