Reliability Analysis: Its Application in Clinical Practice

Reliability Analysis: Its Application in Clinical Practice NahathaiWongpakaran Department of Psychiatry, Faculty of Medicine Chiang Mai University, Thailand TinakonWongpakaran Department of Psychiatry, Faculty of Medicine Chiang Mai University, Thailand

1 Introduction 1.1 What We Need From Measurement? In order to evaluate the empirical applicability of theoretical propositions, we need indicators that are not weak or unreliable, but how can we evaluate the degree to which ten items of Rosenberg s scale, a scale used to measure self-esteem, correctly represent that concept? In essence, we need two basic measurement properties to be present. First, a reliability indicator needs to be used which represents the extent to which a measurement yields the same results over repeated trials. In fact, it may not be possible to yield the exact same results over repeated measurements, although they will tend to be consistent. Second, an indicator will be more reliable if it provides an accurate representation of the concept being measured, called validity. For example, a neuroticism scale is considered a valid measure if it is bound to a neuroticism trait shown by the respondent, and in accordance with the Eysenck theory of personality, rather than other phenomenon such as anxiety disorders. Put another way, reliability is concerned with the degree to which a measurement is consistent across repeated uses, while validity reflects the relationship between an indicator and a concept. Like reliability, validity is also a matter of degree. 2 Types of Reliability Measure Reliability refers to the accuracy and precision of a scale s score, and in this chapter, reliability is based on Classical test theory (CTT), which provides a basis for the effectiveness of a measurement or scale. Most psychological measurements attempt to appropriate the variability of the measurement which is, according to CTT, due to the actual variation across individuals within the phenomenon that the scale is measuring, made up of a true score and error. Therefore, observed scores are determined based on the true score (T) and the measurement error (E), or O x = T x + E x. True scores are the signal we wish to detect, whereas measurement errors are noise that cloud the signal. Errors that affect measurement include random and non-random errors, and noise refers to a random error - anything that confounds measurement. In addition, reliability can be viewed as the degree to which observed score variances are uncorrelated and unaffected by error scores; for example, a reliability score of 0.65 means that 65% of the variability in the observed scale score is attributable to true score variability. Since it is not possible to directly observe a true score, a number of methods have been devised which are aimed at assessing reliability, these being: (i) the test-retest method, in which the same group of samples is used for the same measure but at different times, (ii) the measure of equivalence, which uses equivalent methods over the same period of time (to test inter-method reliability), (iii) a measure of stability and equivalence, (iv) a measure of internal consistency, meaning the consistency between items for the same test, or on the same sub-scale of a test that demonstrates the same underlying latent construct. Under this last method, internal consistency is measured using the following methods: the splithalf method, the Kuder-Richardson method (Feldt, 1969), Cronbach s alpha (Cronbach, 1951) and Hoyt s analysis of variance method (Hoyt, 1941). And finally (v) the inter-rater reliability method, which refers to measurements taken by different persons but using the same method or instruments. In this chapter, we will focus on the most commonly used reliability measures, these being the internal consistency and test-retest methods.

For all measures reliability ranges between 0 and 1, with the higher the score, the better the psychometric properties in terms of validity and reliability. An acceptable value is 0.7, while 0.8 or higher is considered good and 0.9 or higher is considered excellent. When the score falls below 0.7, this signifies the presence of psychometric problems. Reliability impacts upon the statistical significance (t-test or F- test), effect size, as well as the factor structure of the scale. Since a high level of reliability means that an individual score is a good estimate of the true score, low reliability precludes the real results from materializing. Furr (2011) suggested that if estimated reliability is acceptable, then researchers should examine the appropriate psychological interpretation of a scale s score (i.e. its validity) 2.1 Internal Consistency 2.1.1 What Determines Internal Consistency? Internal consistency is a measure used to estimate the reliability of a test from the single administration of a form, and is commonly used for scales which are multiple-item in nature, in this case a composite score test computed by summing or averaging. Internal consistency depends on the individual s performance from item to item based on the standard deviation of the test and the standard deviations of the items. Internal consistency approaches include the split-half method, Kuder-Richardson method, Cronbach s alpha and Hoyt s analysis of variance method. One of the tests most commonly used for exploring internal consistency is Cronbach s alpha, an item-level approach which uses inter-item correlation to determine the level of reliability. In sum, the purpose of Cronbach s alpha is to see how items are linked. Put another way, it provides an index of reliability. When the scale yield of Cronbach s alpha is high, it does not mean that the scale is reliable, although this can be tested through exact repetition. There are two types of Cronbach s alpha; raw and standardized. The formula for the raw version is defined as: α = K K 1 1 K 2 σ i=1 Yi 2 σ X or α = K c v + ( K 1)c ( ), 2 2 where K is the number of items, σ X is the variance of the observed total test scores and σ Yi is the variance of the component i for the current sample of people (Devellis, 1991); v is the average variance and c the average of all the co-variances of the components taken across the current sample. Under the assumption that the item variances are all equal, this ratio modifies to the average interitem correlation, and the result is known as the standardized item alpha. The formula for this is as follows: kr α = 1+ ( k 1)r. In general, raw and standardized alpha are similar or very close in terms of their value, except when different item response formats are applied; for example, a 5-item scale with a 7-Likert scale, or an ordinal scale and a dichotomous scale. Inter-item covariance is normally positive because it is internally consistent, and an appropriate coefficient should be in the range 0.3 to 0.5. Whenever it appears as a negative value or is near to zero, it means the items do not measure the same variable, or there is measurement error (for example, the particular items are unclear, meaning the respondents cannot express their true scores). For example, the

Rosenberg Self-esteem scale item which says I wish I could have more respect for myself, seems to be problematic in many studies (see below for details). As for the item-total statistic, this will follow the inter-item correlation and should be higher than 0.2. The example in Figure 1 is drawn from a study by Wongpakaran and Wongpakaran (2010d) with regard to the therapeutic alliance (data modified for simplicity), and shows the results of the Cronbach s alpha test when the two therapeutic alliance scales and different response formats are combined. The standardized alpha scores are seen to be higher than the raw alpha (0.711 as compared with 0.655). Reliability Statistics Cronbach s Alpha Cronbach s Alpha Based on Standardized Items No. of Items.655.711 6 Item Statistics Mean Std. Deviation N Item 1 4.9567 1.97986 231 Item 2 5.0349 1.69875 231 Item 3 5.8001 1.46079 231 Item 4 2.7275.52480 231 Item 5 2.5968.62713 231 Item 6 2.4122.82589 231 Scale Mean if Item Deleted Item-Total Statistics Scale Variance if Item Deleted Corrected Item- Total Correlation Squared Multiple Correlation Cronbach s Alpha if Item Deleted Item 1 18.5714 12.157.477.290.595 Item 2 18.4933 13.167.536.303.547 Item 3 17.7281 15.325.455.211.585 Item 4 20.8007 20.904.308.341.651 Item 5 20.9313 20.100.385.368.635 Item 6 21.1160 19.275.373.245.627 Figure 1: Reliability analysis results using SPSS 2.1.2 Factors Affecting Reliability Coefficients Factors affecting reliability coefficients: 1. Group homogeneity We can see from the formula that the greater the variance, the higher the reliability coefficients. Why is a large variance good? A large variance means there is a wide spread of scores, which means it is easier to differentiate respondents. If a test has a small amount of variance,

the scores for each class are close (a high level of homogeneity). It is important to note that the reliability score given here is not the test s reliability itself, but the reliability of that sample. 2. Test length The longer the test, the more accurate the reliability coefficients; however, a long test may comprise the motivation or concentration levels of the respondents. As a result, when shortening the measurement, we need to be cautious about its reliability. 3. Inter-item correlation The more correlation there is between items, the higher the level of reliability, because of its homogeneity (Guilford, 1954). 4. Time limitation too little time given for completing the test 5. Test factor - this includes poor instructions and ambiguous questionnaires 6. Examinee factors - includes poor concentration, poor motivation and fatigue 7. Administrator factors - influence of staff on respondent biases. Cronbach s Alpha is a single assessment tool and may be affected by some of the factors above, but might not be affected in the same way on different occasions; for example, reliability may be poor the first time because the participants have poor levels of concentration, but better the second time when they have better levels of concentration. In addition, it is important to note that the alpha coefficient produced by this measurement tool will vary by the group of respondents involved; for example, students or clinical patients. In patient groups, the alpha tends to be lower than for student groups, because they are more likely to be affected by a number of compromising factors such as poorer levels of concentration and lower cognition (T. Wongpakaran et al., 2012b). In addition, the alpha coefficient does not indicate that the scale or measurement is uni-dimensional, but if dimensionality does exist, the alpha coefficient should be calculated according to different dimensions or sub-scales, in addition to the overall scale. 2.2 Test-Retest Assumptions related to the test-retest stage include: 1) that there should be the same true scores between the first and second administration, and 2) that error variances between the two iterations should be equal. However, there are a number of factors that prevent these assumptions happening the expected way, especially for the first. For example, some attributes probed by particular items of the measurement may change over time, such as feelings of boredom, whereas interpersonal style or personality traits may take a longer time to change. As a result, the interval between the tests needs to be considered carefully, with a two to eight week interval being common in psychological and psychiatric practice. Another factor that might affect the test-retest tool is the measurement itself, for we have found that the longer the length of the test, the lower the level of reliability (Wongpakaran et al., 2012b). In general, Pearson s product-moment correlation coefficient is commonly used on test-retest reliability measures. Yen & Lo (2002) demonstrated that intra-class correlation is more sensitive to the detection of systematic error, and that as, theoretically, Pearson s product-moment correlation is the correlation between two different variables, it should not be used in reliability analyses as it cannot detect the

existence of systematic errors. For example, the Pearson s product-moment correlation coefficient when assessing test 1 and test 2 is 0.816, while the ICC is 0.816. When systemic error is introduced for 18 points, the Pearson s product-moment correlation coefficient for test 1 and test 2 is 0.816, while ICC is 0.807. Therefore, the ICC coefficient is the more appropriate approach to use to evaluate test and retest reliability. 2.3 Inter-Rater Reliability This is used to measure the extent to which a measurement can be considered consistent (agreed) between raters. Inter-rater reliability is commonly used by clinicians who want to use an instrument for diagnostic criteria, and the responses are usually dichotomous (in a yes-no format). For example, general practitioners use the structured clinical interview for DSM-IV to diagnose personality disorders (SCID II) (First et al., 1997) or the CAM algorithm to detect delirium in inpatient units (Wongpakaran et al., 2011). There are a number of factors associated with the degree of reliability of the test-retest instrument, these being: 1. The number of raters the more raters there are, the lower the level of reliability 2. Rating scale format an ordinal or Likert type rating is likely to give higher levels of reliability than when using dichotomous responses 3. The rating method used (nested, jointed, independent or mixed SCID II). Jointed rating (i.e. both raters see the interviewee at the same time or use video of the first rater as material) yields the highest level of reliability, while a nested rating design in which each rater independently sees the interviewee yields the lowest. If an interval is involved, the degree of agreement might be even lower as the interval affects the degree of variance in interviewee factors (as found in the test-retest) (T. Wongpakaran et al., 2012a). To sum up, the method of rating allows for which type of variation will play a role; rater variation (e.g. data interpretation) or patient variation (e.g. consistency in giving information on different occasions and/or with different raters) 2.3.1 How to Calculate Inter-Rater Reliability Statistical methods used to assess inter-rater reliability include Cohen s kappa (Cohen, 1988), the Fleiss kappa (Fleiss, 1981) or a weighted kappa; inter-rater correlation, concordance correlation coefficients and intra-class correlation. Which one is chosen depends on the type of data being analyzed. Kappa methods are used for nominal, correlation coefficients, and intra-class correlation is used for order (Spearman rho) or continuous (Pearson r) scales. However, the intra-class correlation coefficient (ICC) tends to prevail over Spearman s rho and Pearson s r because it takes into account the differences in ratings for individual parts, along with the association between raters. Moreover, it also provides flexibility in terms of the number of raters, the number of participants, the response format and the handling of missing data. Table 1 shows an example of inter-rater reliability between raters for SCID II. When categorical diagnosis is used (yes/no), kappa statistics indicate that obsessive-compulsive personality disorders (PD) yield the highest level of agreement between raters, while depressive PD is the lowest, at 0.70. When the summed score (continuous data) is used, passive-aggressive PD is found to be much improved than when using Cohen s kappa.

Personality disorder (PD) No. of diagnoses (%) 1 st rater 2 nd rater % Observed agreement Kappa Avoidant 11(24.49) 11(20.4) 96 0.89 0.84 Obsessive-compulsive 14(25.9) 14(25.9) 96 0.90 0.87 Passive-aggressive 8(14.8) 7(13.0) 94 0.77 0.86 Depressive 5(9.3) 6(11.1) 94 0.70 0.79 Paranoid 9(16.7) 6(11.1) 94 0.78 0.82 Schizoid 7(13) 8(14.8) 94 0.77 0.78 Borderline 7(13.0) 7(13.0) 96 0.84 0.90 Anti-social 5(9.3) 5(9.3) 96 0.78 0.90 ICC Table 1: Kappa values and ICC 3 Relationship Between Reliability and Validity Validity is used to interpret what a scale or measurement represents. When the measurement is internally consistent, it can gauge something but cannot tell what it is. Therefore, to examine its validity means to test for some hypothesis about the constructs related to the measurement. There are several ways to test for constructs; for example, examining inter-correlation with other designated constructs. Dimensionality and reliability are important elements of the psychometric properties of a measurement; however, what is more important is validity, because this tells you how useful a measurement or scale is. Some important issues need to be noted here regarding validity, these being: 1. It is not of a yes/no quality it is a matter of degree 2. It varies from setting to setting (of study) and from sample to sample; for example participants may be depressed, or a patient may respond to the Hamilton Depression rating scale even though from a non-depressed part of the population 3. It can be assessed in a variety of ways 4. It needs to have both scientific evidence as well as theoretically-based structure to compare against, and 5. It is concerned with the interpretation of a scale score, not the scale itself (Furr, 2011) The techniques used for studying validity include: 1. Criterion-related validity, which compares the measurement or scale-score of a measurement with the scores given on other variables (or criteria). The definition of criterion-related validity has been defined by Nunnally (1994) as: to use an instrument to estimate some important form of behavior that is external to the measuring instrument itself, the latter being referred to as the criterion (Nunnally&Bernstein, 1994). For example, Chan-Ob and Boonyanaruthee (1999) validated high school students performance in mathematics to predict how well the students with high scores would be able to obtain high grade-point averages (GPA) in their

medical school studies. There are two types of validity; predictive validity - as mentioned earlier, and concurrent validity - which assesses the degree of correlation between the measurement and the criterion (other measurement) at the same time. For example, Wongpakaran et al. (2011b) have studied the level of concurrent validity of the experience of close relationship questionnaire (revised), or ECR-R, by establishing the Pearson s product moment correlation coefficient between the anxiety scale for ECR-R and Speilberger s State-Trait Anxiety Inventory, finding it to be positively correlated. 2. Content validity, which means the extent to which a measurement represents all aspects of the given construct. For example, when measuring content validity for depression, the test items should correspond to the required domain of a given syndrome of depression, such as dysphoric mood, poor concentration, a lack of energy, anxiety and sleep problems. In addition, in the development of a scale or measurement, experts can carefully select items that are relevant to a test specification drawn up to test a particular subject domain. 3. Construct validity, which refers to the validity evidence that requires empirical and theoretical support in order to allow interpretation of the construct. This evidence includes the internal construct, the degree of association with other variables, the test s content, response processes and the consequences of its use. In this chapter, we will focus on internal structure using exploratory factor analysis and confirmatory factor analysis. 3.1 How Reliability Affects Validity Validity is often assessed alongside reliability - the extent to which a measurement gives consistent results. A measure can have a high level of reliability without having validity (and in this case, the measure is not useful at all). On the other hand, any attempt to define the level of validity is futile if a test is not reliable. Within classical test theory, predictive or concurrent validity (correlation between the predictor and the predicted) cannot exceed the square root of the product of reliability, or r 12max = r 11 r 12 - where r 11 and r 12 are the reliability of the two variables. Internal error and time sampling error are two types of reliability that reduce the validity of tests. 3.2 Confirmatory Factor Analysis (CFA) CFA is used to evaluate the dimensionality of the scale or to verify the factor structure of a set of observed variables. It allows researchers to test the hypothesis that there is a relationship between observed variables and unseen latent constructs. It is the next step after exploratory factor analysis (EFA) because it is used to confirm the factor structure as theoretically proposed. More importantly, CFA can also be used to determine, compare and revise models (model modification and re-analysis), which EFA is unable to do. Statistics used for CFA include the χ 2 test, which indicates the amount of difference between the expected and observed covariance matrices to a level of probability of more than 0.05, and while the χ 2 value is close to zero. Other tests to determine how well a model fits the data given include Normed Fit Index (NFI), the Comparative Fit Index (CFI), the Tucker-Lewis Index (TLI) and the Goodness of Fit Index (GFI). These tests produce values ranging from 0 to 1; the larger the number the better the model fit. Other statistics include bad fit indices, and these include the standardized root-mean-square residual

(SRMR) which has a value of no more than 0.08, and the root-mean-square error of approximation (RMSEA) scale, for which an acceptable value is no more than 0.06 (Hu &Bentler, 1998; 1999; 1995). Since validity requires test reliability, there also needs to be an acceptable level of internal consistency (at least 0.7) for the scale or sub-scale. For example, one study was carried out into the perceived stress of a non-clinical sample, using the Perceived Stress Scale-10 (PSS-10) (N. Wongpakaran &Wongpakaran, 2010) - a scale used for measuring levels of stress as perceived by patients at a particular point in time. Within the study scale there were ten items and two sub-scales (latent factors) - control (Factor II) and stress (Factor). Items in the scale were loaded onto a designated factor, these being items 1, 2, 3, 6, 9 and 10 for factor I and items 4, 5, 7 and 8 for factor II, as hypothesized (Table 2). Communality (reliability of the indicator - h 2 ) also reflected an inter-item correlation in which all values were acceptable. The results show that the Cronbach s alpha for each sub-scale was very good (0.83 and 0.90) meaning that inter-item correlation between the designated factor was also good, reflected in acceptable values for communality (all were more than 0.3), as shown. Item Factor I Factor II h 2 1. In the last month, how often have you been upset because of something that has 0.881-0.02 0.76 happened unexpectedly? 2. In the last month, how often have you felt unable to control the important things 0.867 0.045 0.79 in your life? 3. In the last month, how often have you felt nervous and "stressed"? 0.663-0.107 0.38 4. In the last month, how often have you found yourself unable to cope with all the 0.547 0.162 0.42 things that you have had to do? 5. In the last month, how often have you been angered because of things outside of 0.759 0.035 0.61 your control? 6. In the last month, how often have you felt difficulties piling up so high that you 0.801 0.054 0.69 could not overcome them? 7. In the last month, how often have you felt confident about your ability to handle -0.054 0.638 0.37 your personal problems? 8. In the last month, how often have you felt that things were going your way? -0.006 0.761 0.57 9. In the last month, how often have you been able to control the irritations in your 0.381 0.798 0.49 life? 10. In the last month, how often have you felt on top of things? 0.516 0.868 0.78 Eigenvalue 5.05 1.60 %variance 50.53 15.97 M 5.25 5.10 SD 4.9 3.77 Cronbach s alpha 0.90 0.83 Table 2: Factor structure of PSS-10 CFA was also carried out to see to what extent the factor structure was yielded as a hypothetical construct. The original PSS-10 was shown to have a two latent construct, i.e. stress and control. In this sample, EFA yielded two factors, and the fit statistics from CFA were shown to have a good outcome,

that is:! 2 = 59.214, df = 31, p < 0.002; Goodness-of-Fit Index (GFI) = 0.97; Non-Normed Fit Index (NNFI or TLI) = 0.95; Normed fit index (NFI) = 0.93; Comparative Fit Index (CFI) = 0.97; Standardized Root Mean square Residual (SRMR) = 0.041, and Root Mean Square Error of Approximation (RMSEA) = 0.050 (Figure 2). Figure. 2: Measurement model with parameter estimates for PSS-10 3.3 CFA and Reliability As we know, Cronbach s alpha may not accurately reflect the reliability of a measurement because reliability is impacted by the nature of the psychometric properties, that is, the fact that correlated errors terms exist (Furr, 2011; Miller, 1995). Rather than using traditional alpha, in this example CFA is used to assess reliability. The example below shows a modified model plus parameter estimates for the uni-dimensional scale of the modified Rosenberg Self-esteem scale (m-rses) (T. Wongpakaran&Wongpakaran, 2012a), in which items 5 and Item 7 are revised (see Table 2 for the revised items). Figure 3 shows the final model after modification, which represents a one-factor solution with correlated terms (positively worded items and negatively worded items). The model yields good fit statistics, as shown by the outcomes:! 2 = 47.50, df =27, p = 0.009, Goodness-of-Fit Index (GFI) = 0.96, Non-Normed Fit Index (NNFI or TLI) = 0.96, Normed fit index (NFI) = 0.95, Comparative Fit Index (CFI) = 0.98, Standardized Root Mean square Residual (SRMR) = 0.038, Root Mean Square Error of Approximation (RMSEA) = 0.055, and a Cronbach s alpha for the scale of 0.84. If the reliability of the scale is calculated through the use of struc-

tural equation modeling and using CFA as per the following formula, the estimated reliability (Brown, 2006) can be shown as:! = (#" i ) + $ ii (#" i ) 2 # + 2 # $ ij where " i is the item factor loading; # ii is the item error variance; # ij is the covariance between error terms of the two items; ($" i ) 2 = (.74 +.50 +.62 +.71 +.72 +.34 +.46 +.53 +.52 +.49) 2 =29.48; $# ii + 2$# ij = (.55 +.25 +.39 +.50 +.52 +.52 +.11 +.21 +.28 +.27 +.24) + 2(.41+.37+.50+.31+.43+.53+.13+.20) = 29.48 / (29.48 + 9.59) = 0.755. We can see a difference between the two levels of reliability yielded from Cronbach s alpha and from CFA. As stated by Miller (1995), alpha has a tendency to wrongly estimate reliability. Figure 3: Measurement Model with Parameter Estimates of the Modified Rosenberg Self- Esteem Scale (m-rses). 4 How to Solve the Problem of Low Reliability in the Measurement?

As mentioned earlier, a number of factors influence levels of reliability and validity. In clinical practice, we commonly find that clinical samples respond differently to questionnaires or psychological measurements than non-clinical samples. This does not mean that clinical participants tend to have higher scores on problems than non-clinical participants; it is because the ways in which they respond lead to different factor structures. For example, in a review of the factor structure of the Multi-dimensional Scale of Perceived Social Support (MSPSS), which is designed to tap into three sub-scales (friends, family and significant others), we found that clinically distressed patients tended to merge family members with significant others, whereas non-clinical people, especially younger people, tended to merge significant others with friends (Wongpakaran et al., 2011a). Invariance tests of the CFA show that there is a significant difference in terms of factor structure, which reflects in poor reliability among some other sub-scales of the MSPSS, a fact which may compromise the effect size and create significant differences in any analyses that use this measurement. As a result, we have come up with a modification to the instruction part of the questionnaire, in order to make it clear to the respondents that they should be aware of the existence of significant others (Wongpakaran & Wongpakaran, 2010c). Other ways to fix the problem of low internal consistency include: to increase the number of items, increase the sample size and reduce the use of items that cause low inter-item correlation; however, such trimming should first pass theoretical evaluation. Rather than increase the number of items, researchers tend to reduce them into the smallest number possible, as long as the test still holds acceptable psychometric properties. A long test, in contrast, might be seen to compromise any subsequent adherence to the test, leading to poor reliability. This can be clearly supported in a clinical setting, especially in those patients with mental health or psychiatric problems, whose minds tend to wander easily. Speaking of items causing low inter-item correlation, we have discovered from our previous studies that some negatively worded items can cause problems in terms of internal consistency and the factor structures of the scale. As mentioned before, in a clinical setting, patients tend to have a poor cognitive ability, regardless of how much motivation they may have; they usually perform particularly poorly on negated or double negated sentences, or on sentences containing the word not (Wongpakaran & Wongpakaran, 2012a; Wongpakaran & Wongpakaran, 2012b). Not only can this be found in clinical samples, but also in general samples. We would now like to give an example of a well-known scale - the Rosenberg Self-Esteem scale. The item in this scale that seems to be most problematic and tends to create poor inter-item correlation and low levels of reliability is item # 5 I wish I could have more respect for myself. This is clearly a negatively worded statement and in a number of studies has yielded the poorest correlation with other items (that is, it is the least internally consistent item) (Beeberet al., 2007; Carmines, 1978; Wongpakaran & Wongpakaran, 2011). This item also leads to an indeterminable factor structure. Table 2 shows a communality value of 0.077, which is unacceptable, so we subsequently managed to change it in a positive way by rephrasing the statement to I think I am able to give myself more respect and reanalyzing it with another sample. This ends up providing a better result in terms of the item factor loading (h 2 increases to 0.468) and the acceptability of the model fit for CFA. However, it then raises a problem for item #7 I feel that I m a person of worth, at least on an equal plane with others (h 2 increase to 0.149). Therefore, in the latest revision, we have altered this sentence into a more positive form: I feel that I m a person of worth, and I feel this emotion is stronger in me than in many other people, testing it on a third group of samples. We now find it provides an acceptable factor loading for all the items, and

with the same (good) model fit as found before with the second revision. Interestingly, the Cronbach s alphas are similar, around 0.84, for all three versions of the test. Version Original (n = 664) Revised #5 (n = 187) Revised #5 and #7 (n = 251) Item m SD F.L h 2 m SD F.L h 2 m SD F.L h 2 1. On the whole, I am satisfied with myself. 2. At times I think I am no good at all. 3. I am able to do things as well as most other people. 4. I feel I do not have much to be proud of. 5. I wish I could have more respect for myself. [5] 1 I think I am able to give myself more respect. 6. I certainly feel useless at times. 7. I feel that I m a person of worth, at least on an equal plane with others. [7] 1 I feel that I m a person of worth, and I feel this emotion is stronger in me than in many other people. 8. All in all, I am inclined to feel that I am a failure. 9. I feel that I have a number of good qualities. 10. I take a positive attitude toward myself. 3.19.66.721.526 3.27.63.602.395 3.44.57.599.573 2.99.76.629.664 3.10.82.819.672 3.39.67.827.691 3.09.61.675.514 3.18.61.661.442 3.25.63.635.413 3.05.76.734.593 3.07.88.764.584 3.38.69.825.723 2.23.82.277.077 - - - - - - - - - - - - 3.18.66.657.468 3.37.56.738.570 3.25.80.808.756 3.36.84.805.660 3.68.55.814.701 2.93.72.415.194 2.95.90.361.149 - - - - 3.01.67.550.302 3.25.82.776.661 3.29.86.718.519 3.65.54.719.570 3.16.61.758.611 3.23.66.814.663 3.47.53.737.611 3.31.74.782.619 3.34.75.781.667 3.54.56.685.584 m = mean; SD = standard deviation; F.L. = factor loading; h 2 = communality 1 re-worded to positive direction; Table. 2: A comparison of mean, SD, factor loadings and communalities (h 2 ) between the original, revised # 5,and revised #5 and #7

5 Conclusion Reliability can be assessed in a number of ways, and its importance lies in the impact it can have on the construct validity of a scale or measurement. Since reliability and validity are specific to clinical settings and samples, finding possible sources of error and providing modifications for problematic items that cause low reliability should be encouraged, in order that scales or measurements do not mislead or skew the outcomes of analysis. References Beeber, L. B., Seeherunwong, A., Schwartz, T., Funk, S. G., & Vongsirimas, N. (2007). Validity of the Rosenberg Selfesteem Scale in Young Women from Thailand and the USA. Thai Journal of Nursing Research, 11(4), 240-250. Brown, T. A. (2006). Confirmatory factor analysis for applied research. New York: Guilford. Carmines, E. G. (1978). Psychological origins of adolescent political attitudes: self-esteem, political salience, and political involvement. American politics Quarterly, 6, 167-186. Chan-Ob, T., & Boonyanaruthee, V. (1999). Medical student selection: which matriculation scores and personality factors are important? Journal of Medical Assocociation of Thailand, 82(6), 604-610. Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences. New Jersey: Lawrence Erlbaum Associates. Cronbach, L. J. (1951). Coefficient alpha and the internal structure of tests. Psychometrika, 297-334. Devellis, R. F. (1991). Scale Development: Theory and Applications. Newbury Park: Sage. Feldt, L. S. (1969). A test of the hypothesis that Cronbach s alpha or Kuder-Richardson coefficient twenty is the same for two tests. Psychometrika, 34, 363-373. First, M. B., Gibbon, M., Spitzer, R. L., Williams, J. B. W., & Benjamin, L. S. (1997). Structured Clinical Interview for DSM-IV Axis II Personality Disorder (SCID-II). Washington, DC: American Psychiatric Press. Fleiss, J. (1981). Statistical methods for rates and proportions. (2nd ed.). New York: John Wiley. Furr, R. M. (2011). Scale Construction and Psychometrics for Social and Personality Psychology. London, UK: Sage Publications. Guilford, J. P. (Ed.). (1954). Psychometric Methods. New York: McGraw-Hill Book Company. Hoyt, C. (1941). Test Reliability Estimated by Analysis of Variance. Psychometrika, 6, 153-160. Hu, L., & Bentler, P. M. (1998). Fit indices in covariance structure modeling: Sensitivity to under parameterized model misspecification. Psycholical Methods, 3, 424-453. Hu, L., & Bentler, P. M. (1999). Cut off criteria for fit indexes in covariance structure analysis: Conventional criteria versus new alternatives. Structural Equation Modeling, 6, 1-55. Hu, L., & Bentler, P. M. (1995). Evaluating model fit. In R. H. Hoyle (Ed.), In Structural Equation Modeling: Concepts, Issues and Applications (pp. 76-99). California: Sage. Miller, M. B. (1995). Coefficient alpha: A basic introduction from the perspectives of classical test theory and structural equation modeling. Structural Equation Modeling, 2(3), 255-273. Nunnally, J., & Bernstein, I. (1994). Psychometric theory. New York, USA: McGraw-Hill Book Company.

Wongpakaran, N., & Wongpakaran, T. (2010). The Thai version of the PSS-10: An Investigation of its psychometric properties. Biopsychosocial Medicine, 4, 6. doi:10.1186/1751-0759-4-6 Wongpakaran, N., Wongpakaran, T., Bookamana, P., Pinyopornpanish, M., Maneeton, B., Lerttrakarnnon, P., Jiraniramai, S. (2011). Diagnosing delirium in elderly Thai patients: utilization of the CAM algorithm. BMC Family Practice, 12, 65. Wongpakaran, T., & Wongpakaran, N. (2011). Confirmatory Factor Analysis of Rosenberg Self Esteem Scale: A study of Thai student sample. Journal of Psychiatric Association of Thailand, 65(1), 59-70. Wongpakaran, T., Wongpakaran, N., & Ruktrakul, R. (2011a). Reliability and Validity of the Multidimensional Scale of Perceived Social Support (MSPSS): Thai Version. Clinical Practice &Epidemiology in Mental Health, 7, 161-166. doi:10.2174/1745017901107010161 Wongpakaran, T., Wongpakaran, N., & Wannarit, K. (2011b). Validity and reliability of the Thai version of the Experiences of Close Relationships-Revised questionnaire. Singapore Medical Journal, 52(2), 100-106. Wongpakaran, T., & Wongpakaran, N. (2012a). A comparison of reliability and construct validity between the original and revised versions of the Rosenberg Self-Esteem Scale. Psychiatry Investigation, 9(1), 54-58. Wongpakaran, T., & Wongpakaran, N. (2012b). A short version of the Revised Experience of Close Relationships Questionnaire: Investigating non-clinical and clinical samples.clinical Practice & Epidemiology in Mental Health, 8, 36-42. Wongpakaran, T., & Wongpakaran, N. (2010c). A revised Thai Multi-dimensional Scale of Perceived Social Support (MSPSS). Spanish Journal of Psychology. 15(3), 1503-1509. Wongpakaran, T., & Wongpakaran, N. (2010d). How the interpersonal and attachment styles of therapists impact upon the therapeutic alliance and therapeutic outcomes?. Journal of Medical Assocociation of Thailand. 95(12), 1583-1592. Wongpakaran, T., Wongpakaran, N., Bukkamana, P., Boonyanaruthee, V., Pinyopornpanish, M., Likhitsathian, S., Srisutadsanavong, U. (2012a). Interrater Reliability of the Thai version of the Structured Clinical Interview for DSM-IV Axis II disorders (T-SCID II). Journal of Medical Assocociation of Thailand, 95(2), 264-269. Wongpakaran, T., Wongpakaran, N., Sirithepthawee, U., Pratoomsri, W., Burapakajornpong, N., Rangseekajee, P.,Temboonkiat, A. (2012b). Interpersonal Problems among psychiatric outpatients and non-clinical samples. Singapore Medical Journal, 53(7), 481-7. Yen, M., & Lo, L. H. (2002). Examining test-retest reliability: an intra-class correlation approach. Nursing Research, 51(1), 59-62.