Scaling the quality of clinical audit projects: a pilot study

International Journal for Quality in Health Care 1999; Volume 11, Number 3: pp. 241 249 Scaling the quality of clinical audit projects: a pilot study ANDREW D. MILLARD Scottish Clinical Audit Resource Centre, Departments of Postgraduate Medical Education and Public Health, University of Glasgow, Scotland Abstract Objective. To pilot the development of a scale measuring the quality of audit projects through audit project reports. Design. Statements about clinical audit projects were selected from existing instruments, assessing the quality of clinical audit projects, to form a Likert scale. Setting. The audit facilitators were based in Scottish health boards and trusts. Study participants. The participants were audit facilitators known to have over 2 years experience of supporting clinical audit. The response at first test was 11 out of 14 and at the second test it was 27 out of 46. Interventions. The draft scale was tested by 27 audit facilitators who expressed their strength of agreement or disagreement with each statement for three reports. Main outcome measures. Validity was assessed by test re-test, item total, and total global indicator correlations. Results. Of the 20 statements, 15 had satisfactory correlations with scale totals. Scale totals had good correlations with global indicators. Test re-test correlation was modest. Conclusions. The wide range of responses means further research is needed to measure the consistency of audit facilitators interpretations, perhaps comparing a trained group with an untrained group. There may be a need for a separate scale for reaudits. Educational impact is distinct from project impact generally. It may be more meaningful to treat the selection of projects and aims, methodology and impact separately as subscales and take a project profiling approach rather than attempting to produce a global quality index. Keywords: evaluation, medical audit, quality Clinical audit is a method of improving the quality of clinical care. It was introduced in the UK in 1989, as a governmentfunded initiative. The clinical audit process is controlled by the health professionals involved at the local level. The development of a measure of the quality of clinical audit projects has been seen as a key part of evaluating and improving clinical effectiveness and audit [1]. The purpose of this planned measure was to evaluate audit projects through project reports in order to improve audit programmes. A previous study revealed no one definitive view about the quality of audit projects because they served different purposes for different people [2]. This applied particularly to the aims and the outcomes of audit projects. There was consensus on the methodological features needed, and that it should facilitate improvement to clinical practice, but there were still different emphases. Even for randomized controlled trials, which have more defined and quantitatively rigorous methodology than audit projects, rating methodological quality is a complex and subjective process [3]. Consensus existed that audit was itself evaluated by asking whether improvements had happened. Evaluation has been defined as a method for determining the degree to which a planned programme reaches its objective [4]. As views of the quality of audit projects are subjective, the methods developed to measure attitudes were thought potentially suitable for measuring quality in this context. Attitude scaling involves the construction of a valid scale through item analysis and assessment of the correlation of individual scale items with the scale total [5]. A scale for the measurement of staff attitudes to trust clinical audit Address correspondence to Andrew D. Millard, Scottish Clinical Audit Resource Centre, Department of Postgraduate Medical Education, University of Glasgow, 1 Horselethill Road, Glasgow G12 9LX, UK. Tel: +44 141 330 6195. Fax: +44 141 330 6192. E-mail: andrewm@pgm.gla.ac.uk 1999 International Society for Quality in Health Care and Oxford University Press 241

A. D. Millard programmes has been successfully developed and validated elsewhere [6]. A quantitative means of assessing the quality of clinical audit projects has not yet been developed. It is needed to allow the usefulness of audit projects to be compared, and to identify and improve weaker audit programmes. Since this project was completed, Walshe and Spurgeon [7] have developed a framework of criteria for both the Figure 1 Project design. assessment and improvement of audit projects and programmes. The criteria are explained as an integral part of the framework. They are intended for use for self-assessment, and for trust and health authority reviews of services. The English National Centre for Clinical Audit has also published additional checks on its validity. The full instrument is given its criteria for clinical audit [8]. These form a checklist for in the appendix. Not applicable (N/A) and Don t know good practice in planning audit projects rather than for (D/K) options were available to respondents in addition to evaluating audit project reports. the five Likert-style agreement/disagreement categories. N/ A items could include questions on the use of statistics, for example, whereas D/K indicated a deficiency of information Methods in the report. Low scores were for low satisfaction of criteria. Both N/A and D/K items were scored zero. Study objectives The aim of the study was to pilot the use of a scaling Study participants approach as the basis of a quantitative index of the quality of audit projects seen in audit project reports. The tool was piloted by 27 respondents of a sample of 46 staff members known to have been on a list of 210 audit Study design support staff for over 2 years, and who were still in an audit support role. The list was developed separately from this A list of items was developed (see Appendix) using two other project by the Scottish Audit Network (SAN), an organization audit quality questionnaires [9,10]. Information was also taken for audit staff, to support their networking activities. The into account from the more general literature on the quality staff members were chosen from the list by the SAN secretary, of audit [11 13]. on personal knowledge of their time in audit. It is likely that In selecting items for the scale tested from the two the members chosen were a majority of those faciliators with other questionnaires, overlapping and duplicated items were over 2 years in audit. amalgamated by choosing the simplest phrasing. Purely de- Of the 27 respondents, nine (33%) were male and 18 were scriptive items were excluded, e.g. the audit approach used female. Of the sample, 11 (24%) were male and 35 were by the project. In some cases thinking had moved on since female. The sample reflected the proportions in the whole these questionnaires had been designed. For example, several list (55 male, 155 female); two more males responded than questions on the development of standards could be replaced expected. This was not significantly different from the sample by one question specifically on the use of evidence in setting composition (χ 2 =2.7, d.f.=1). Responses were received from standards. The choice of a topic including potential for facilitators in acute and community trusts in 12 out of the change was taken to be an essential part of an audit project 15 health board areas. [13]. The final scale related to a model including reaudit [14]. The earlier questionnaires were not useful for validating the new scale because of the changes in audit since they were Interventions developed: they could not be used as quantitative measures as they stood because they were not designed or tested for Three project reports were selected at random by the researcher from 50 short reports published by the Clinical this purpose. Criteria were expressed in the form of agree/disagree Resource and Audit Group, a Scottish Health Service national questions using 5-point Likert scales that included five negatively worded items whose scores were reversed for results was evaluated by an initial 11 respondents ( time 1 ). One committee of senior health professionals. One of the projects analysis. The criteria included one question about the priority month later ( time 2 ) this was re-evaluated by the same of the topic and one about the quality of description of the respondents, and two other projects were evaluated by these problem addressed. Most questions covered methodology and another 16 respondents with the same level of experience. and impact. At the second stage, two single answer questions This gave 27 complete responses to the scale at time 2, asking the respondents to give their overall assessment of making 81 different project respondent combinations in total the project methodology and their assessment of the worth- at time 2 (Figure 1). Three audit projects were thought to be whileness of the project s aims were added to the instrument. the maximum that respondents would have the time to They were not included in the scale, but were used as evaluate for the study. 242

Quality of clinical audit Figure 2 Distribution of scale totals for time 1 and time 2(n=11). Figure 3 Boxplots of scale sums for the three projects. Main outcome measures middle 50% of values, with the dark line indicating the median. The whiskers indicate the range of the upper and Reliability lower quartiles. The boxplot is a simple means of showing the inter-observer agreement between the 27 observers; in Using Kendall s Tau-b [5], the correlation between the results this case it suggests considerable variation. from time 1 and time 2 was calculated for each question. Using Pearson s correlation coefficient [15] the correlation Reliability was calculated for the sums of scores for each respondent from time 1 and time 2. The answers to three questions showed positive correlation between time 1 and time 2 at significant levels (P < 0.01) Validity (Table 1). The answers to all but one question were positively correlated. There was no external criterion for the quality of audit To improve the reliability of the scale, it is necessary to reports available against which to measure the validity of the measure how strongly the scores for each item correlate with instrument. In the absence of an external criterion of validity, the scale total. To do this fairly, the score for each item is the scale totals were tested for validity by correlating them correlated with the scale total minus the value for that item. with respondents answers to single global questions about D/K and N/A were scored zero. the quality of the project and the worthwhileness of the The low correlation for individual items between time 1 aims. To avoid a pre-test effect, the time 1 respondents were and time 2 implies that there was little memory effect. excluded from the correlations with global indicators carried Although the low correlation of individual items does imply out at time 2. A bad project with a good report would score a considerable variation in interpretation, most items and the 1 in each category. This made a baseline of 20 for both an scale totals were positively correlated. The low correlation audit and a reaudit with a full report [see questions 16(i) level may have resulted from using only one project for test 16(ii) and 17 (Appendix)]. However, a bad project with an and repeat test. If several projects had been rated by each incomplete or unclear report could score less than 20. A respondent, there would have been more variability in quality good project with no report would score nothing. For a good and hence more consistency between the first and second project with a complete report the maximum score was 100 time ratings for each project. if the project included a reaudit. Projects with a complete The item total correlations suggest that the three questions report not including a reaudit could score a maximum of 96 about reaudit fitted less well into the scale (only one of (project 2, which was ranked highest, was in fact the only the projects was a reaudit). The only negative test retest reaudit). Assuming the evaluator to be fully trained, the correlation was for the reaudit item 16b. This may imply the completeness of the report is measured by the number of need for a separate scale for reaudits. The answers to the D/K responses, which are solely a reflection of missing question about the prioritization of the topic also stood out information. as less strongly related to the scale, almost certainly because the majority of questions were about methodology and impact. The one question on identification of areas requiring educational input was less strongly related to the total also. More Results research is required to determine why. Figures 2 and 3 show the shape of the data obtained. Figure For the full 81 different cases correlations of item values 2 shows the distribution of scale totals aggregated for all with the scale total are shown in Table 1. three projects. Totals were slightly higher at time 2. The items should have an association significantly greater Figure 3 shows boxplots of the scale sums for the three than zero with the scale total to be included in the scale. projects individually. The central box shows the range of the The association should preferably be greater than 0.2 [16]. 243

A. D. Millard Table 1 Correlations between the first time and second time answers to individual questions (columns 2 and 3) and item total correlations at time 2 for a group of 81 (columns 4 and 5) Reliability... Item total correlations Time1totime2(n=11)... (n=81)... Item... Tau-b Significance Tau-b Significance 1. Priority topic 0.54 0.07 0.18 0.04 2. Problem description 0.78 0.006 0.29 0.001 3. Audit question answered 0.45 0.13 0.33 <0.001 4. Fair interpretation 0.15 0.58 0.39 <0.001 5. Comparison with criteria 0.29 0.28 0.40 <0.001 6. Evidence-based 0.18 0.52 0.30 <0.001 7. Methods description 0.43 0.13 0.38 <0.001 8. Redundant data 0.25 0.35 0.29 0.001 9. Suitable sample size 0.53 0.04 0.46 <0.001 10. Correct statistical methods 0.77 0.006 0.24 0.004 11. Collaboration 0.63 0.03 0.13 0.14 12. Identified useful change 0.06 0.83 0.34 <0.001 13. Implementation of change 0.31 0.28 0.47 <0.001 14. Involvement in change 0.37 0.18 0.43 <0.001 15. Change description 0.62 0.03 0.31 <0.001 16. Reaudit 0.21 0.43 0.16 0.05 16b. No reaudit but needed one 0.43 0.12 0.26 0.003 17. Reaudit showed more cost effective care 0.03 0.92 0.11 0.21 18. More patient satisfaction 0.53 0.05 0.26 0.002 19. Educational needs clear 0.15 0.61 0.12 0.14 20. Exit strategy identified 0.82 0.003 0.33 <0.001 Fifteen items had an association with the scale total sig- level for this association is between 0.4 and 0.8 [16], the nificantly greater than zero at the 1% level. These items were scale passed this test as a valid measure of the quality of retained (they all had an association greater than 0.2) and the an audit project as expressed in the project report. A scale total recalculated. comment received with the completed scales included a The correlation between the scale totals for the 15-item statement of uncertainty about the meaning of question scale at the first and second tests was calculated using 20 What was an exit strategy from an audit project? One Pearson s correlation coefficient (r=0.63, P=0.07). This level respondent expressed uncertainty about the use of the D/ is ranked as modest [17], but the sample was small. This, K option did it mean something different from neither together with the large interquartile range (Figure 3) suggests agreeing nor disagreeing? that interpretations of questions varied. The scale total was correlated with the global assessments Validity at time 2 (n=69). The correlation of global indicators with the sum was significant for both at P < 0.001. For quality of The association between the recalculated scale total and methods Tau-b was 0.56 and for worthwhileness of aims the global evaluations increased (Table 2). As the accepted Tau-b was 0.50. 244

Quality of clinical audit Table 2 Increase in association of scale total with global indicators when the five items were removed from the calculation (n=69) Tau-b old Tau-b new... value Significance value Significance Quality of methods 0.56 <0.001 0.60 <0.001 Worthwhileness of aims 0.50 <0.001 0.51 <0.001 Table 3 Significance of differences between sums of scores for pairs of projects using the Wilcoxon matched pairs test (n=27 for each project) Significance of Significance of Significance of Significance of difference between difference between differences between differences between values for global values for global aims Projects... scale totals scale medians methods indicator indicator 1:2 0.12 0.30 2:3 0.001 0.008 0.03 0.02 3:1 0.03 0.06 Table 4 Ranking of projects quality using sums of scores and medians of scores Project Medians (n=81) Sums (n=81)... 2 (mean rank) 2.33 2.41 1 (mean rank) 2.09 2.06 3 (mean rank) 1.57 1.54 Kendal s W 0.19 0.20 Significance 0.005 0.005 Distinguishing between projects In order to assess whether the instrument could distinguish between projects, the Wilcoxon matched pairs test was used to identify differences between the quality of projects using the 20-item scale. Sums and medians were compared in this way. The greatest difference between project scores for scale medians and totals was for projects 2 and 3 (Table 3). Global indicators showed similar differences. Projects were given a strikingly large range of sums of scores. Project 2, which was selected as the highest quality, had the smallest range of scores, and hence the greatest consensus about its quality (Figure 3). Table 3 shows that there was a significant difference only between projects 2 and 3. To rank all three projects in order of quality, scale totals (20 items) and medians were compared using Kendall s W. Table 4 shows the result. Project 2 was ranked highest, with project 1 next and finally project 3. Discussion Usefulness of the instrument While consensus is important in audit, groups can suffer from groupthink bias that could prevent needed change. A further problem is that groups in different settings may use different rules for evaluating their audit. A recent instrument [13] is dependent on group consensus to support it. The current instrument is intended to provide a more objective method of comparing audit projects with each other. It shows potential but needs further development to improve test retest reliability. Guidance in interpreting the questions is needed to reduce the variability in response for example much and appropriately involved need to be defined. Used by audit facilitators to assess their trust audit projects, the instrument could be used to decide the content of an education programme targeted on specific features of clinical audit and specific groups of staff. To deal with reaudits and audits in the same scale is a problem, as implied by the lower correlation of the questions on reaudit with the scale totals. To distinguish a project as a reaudit, a yes/no rather than a Likert type question is needed. A separate scale or subscale would allow this. A possible model for applying the instrument A successful approach to assessing audit projects, that helps to indicate where the instrument tested may be lacking, comes from Lough, McKay and Murray [11] who developed an audit marking schedule. This is for use in a marking system involving three levels of assessment arranged to filter out the worst projects for reconsideration, and at the highest level, 245

A. D. Millard resubmission. The marking criteria were developed through robust to distinguish between projects using a smaller number focus group and consensus methods. Cells of three assessors of people. Supporting explanatory information for each ques- discussed their marks with another cell of assessors who tion, for example the words much and appropriate, would marked the same project. Thus the assessors were trained, help here. Because respondents may tend to mark consistently and the reliability of the instrument must depend on this. low or high, if the instrument were further developed, markers Groups of between one and three markers and pass/referral may need to be calibrated to produce comparable results. To criteria were tested to find the most sensitive and specific test whether markers do consistently mark differently from combination passing the greatest numbers of good and each other, a larger number of projects would need to be referring the greatest numbers of bad projects first time. scored by a smaller number of markers. To enhance the Three markers and a pass from each to pass overall was comparability of results further, audit reports could be prefound to be the most sensitive and specific combination. tested for completeness and audit facilitators tested to ensure This instrument works because it is implemented in a that they used the intended interpretations of questions. system that incorporates reliability checks. Marking is done There may be a need for a separate scale for reaudits. twice for projects referred at the first level only. Those that Educational impact is distinct from project impact generally. pass the first time are not re-marked. Although appropriate It may be more meaningful to treat the selection of projects in an educational setting, this may be too lenient as an and aims, methodology and impact separately as subscales evaluation of the effectiveness of funding, or the quality of and take a project profiling approach rather than attempting a local audit programme, and too high a level to compare to produce a global quality index. The quality of the report strengths and weaknesses at the process level in order to can be measured by its completeness using this scale: the identify improvements systematically. Training of assessors, larger the number of D/K responses the less complete the provision of supporting information, double and treble mark- project. Project quality can be assessed only for complete ing, and the setting of pass levels all help to improve reports. reliability in this case. The scale tested could be implemented in individual settings in a similar way, by specifying locally appropriate pass marks, in order to reduce the number of markers required. Acknowledgements The author would like to thank H. Gilmour and I. Crombie A profile rather than an index? for comments on this paper, and all the audit facilitators who The scale tested was composed of a number of related piloted the instrument. The project was supported by a grant dimensions report quality, worth of project aims, soundness from the Clinical Resource and Audit Group. The views of methodology, and the extent of the benefits accruing. expressed are the author s, not those of the funding body. Although they are related, as implied by the correlation tables, it may be more meaningful to treat these dimensions separately as subscales, and to take a project profiling approach rather References than attempting to produce a global quality index. The phrasing of individual questions could be improved by dis- 1. National Centre for Clinical Audit. Evidence to Support Criteria for cussion of the topic and the current scale with groups of audit Clinical Audit. London: NCCA, 1996. facilitators. The sensitivity of the scale might be improved by 2. Millard A. Perceptions of clinical audit: a preliminary evaluation. using a 7-point rather than a 5-point range. A global question J Clin Effect 1996; 1: 96 99. on the impact of the project could be used to validate the 3. Sindhu F, Carpenter L, Seers K. Development of a tool to rate questions on this area. A global question on the quality of the quality assessment of randomised controlled trials using a the report completeness and clarity would also be a Delphi technique. J Adv Nursing 1997; 25: 1262 1268. useful addition, perhaps as a filter question to select valid reports. Questions testing audit facilitators knowledge could 4. Suchman EA. Evaluative Research: Principles in Public Service and be used to exclude those with inadequate knowledge of audit. Action Programmes. New York: Russell Sage, 1967. Explanatory information about each question would improve 5. Robson C. Real World Research. Oxford: Blackwell, 1993. the consistency of interpretations. 6. Lord J, Littlejohns P. Development of an instrument to assess staff perceptions of the impact of trust-based clinical audit programmes. J Clin Effect 1996; 1: 83 89. Conclusions 7. Walshe K, Spurgeon P. Clinical Audit Assessment Framework Handbook Series 24. Birmingham : HSMC, University of Birmingham, The wide range of responses means further research is needed 1997. to measure the consistency of audit facilitators interpretations, perhaps comparing a trained group with an 8. National Centre for Clinical Audit. Information for Better Healthcare. untrained group, and using more audit project reports. The London: NCCA, 1997. ideal number of markers for a valid and reliable assessment 9. Bhopal R, Thomson R. A form to help learn and teach about needs further research. The instrument needs to be more assessing medical audit papers. Br Med J 1991; 303: 1520 1522. 246

Quality of clinical audit 10. Crombie IK, Davies HTO. Towards good audit. Br J Hosp Med 15. Rust J, Golombok S. Modern Psychometrics, the Science of Psychological 1992; 48: 182 185. Assessment. London: Routledge, 1989. 11. Hopkins A. Clinical audit: time for a reappraisal? J R Coll Phys 16. Streiner DL, Norman GR. Health Measurement Scales: a Practical Lond 1996; 30: 415 425. Guide to their Development and Use. Oxford: Oxford University Press, 1995. 12. Walshe K, Coles J. Evaluating Audit: Developing a Framework. London: CASPE Research, 1993. 13. Lough JRM, McKay J, Murray TS. Audit and summative assessment: system development and testing. Med Educ 1997; 31: 219 224. 17. Cohen L, Holliday M. Statistics for Social Scientists. London: Harper & Rowe, 1982. 14. Mann T. Clinical Audit in the NHS. Using Clinical Audit in the NHS: A Position Statement. Leeds: NHS Executive, 1996. Accepted for publication 28 October 1998 247

A. D. Millard 248

Quality of clinical audit 249