Introduction. Richard Jeanerett. Comments on the standards for educational and psychological testing

Intrductin Richard Jeanerett As yu may recall the revisin t the 1999 Standards fr Educatinal and Psychlgical Testing is underway, and requests fr recmmendatins n tpics needing revisin were slicited by APA. In July, Lis Tetrick sent a ntice t all SIOP members requesting their input n revisin issues. Additinally, a revisin recmmendatins task frce was assembled with the fllwing members: Winfred Arthur, Jse Crtina, Marilyn Gwing, Dick Jeanneret (c-chair), Jerry Kehe, Jim Outtz, Bb Rams, Paul Sackett, Suzanne Tsacumis, and Shelly Zedeck (c-chair). The task frce was charged with respnsibilities fr receiving member input and cnducting a cmprehensive review f the Standards t develp a set f SIOP revisin recmmendatins. This fall the task frce cmpleted its wrk and revisin recmmendatins were submitted t the Standards revisin cmmittee. The revisin recmmendatins submitted t APA are as fllws: Cmments n the standards fr educatinal and psychlgical testing General Cmment Regarding the Standards: Given the increasing number f testing-related standards (bth inside and utside the US) that have surfaced r are underway, the revisin prcess shuld be sensitive, and (as apprpriate), respnsive t these develpments. Examples f tpics that shuld be evaluated include: adaptive testing, cmputer-based and internet delivered testing, assessment centers and perfrmance testing. (Critical) Careful cnsideratin shuld be given t the definitin and applicatin f the term cnstruct as it is used thrughut the standards. There appears t be a lack f scientific rigr cmmunicated by the Standards as t what evidence is required t supprt the existence f a cnstruct. (Critical) COMMENTS ON APA STANDARDS PART 1 Part 1 General Cmments. Cncerning the Tests as Measures f Cnstructs (p. 5), there needs t be a clear and rbust delineatin f the predictr cnstruct and the predictr methd. (Critical) Chapters 1 and 2 shuld be switched arund; that is, present reliability befre validity. (Imprvement) Validity When discussing the interpretatin f validity and the use f

(Chapter 1) Reliability and Errrs f Measurement (Chapter 2) Test Develpment and Revisin (Chapter 3) Scales, Nrms, and Scre Cmparability (Chapter 4) Test Administratin, Scring, and Reprting (Chapter 5) multiple tests, describe the strategies f cmpensatry mdels and multiple cutffs (and hw the validity is established fr the latter strategy and its cmpnents). (Imprvement) Discuss the imprtance f the cncept f cnstruct-irrelevant variance (e.g., Messick 1995). It is mentined in Standard 7.2 in the cntext f subgrup differences but it is a much brader issue. (Very Imprtant) Braden the term validity generalizatin t include transprtability, synthetic validity/jb cmpnent validity, and meta-analysis. (Critical) Discuss the value f crrected reliabilities (Imprvement) Discuss what might be the best reliability estimate as a functin f use f test/criterin and the intent f the measure (e.g., Crnbach s alpha is nt useful if the cnstruct being measured is unknwn r multidimensinal). (Imprvement) Discuss use f meta-analyses as a basis fr reliability estimatin. (Imprvement) Cnsider discussing: (Imprvements) What t d when yu cannt crss-validate Empirical vs. ratinal keying When/hw ften t revise test/perfrmance measures Threats t validity (e.g., self-reprt measurement). Discuss implicatins f unprctred internet testing fr test develpers. (Critical) Discuss measurement invariance and hw it is established. (Imprvement) Discuss rank rdering (Very Imprtant) Discuss nline testing. (Imprvement) COMMENTS ON APA STANDARDS - PART II Part II 7. Fairness in Testing and Test Use Backgrund There are tw places that address the issues f whether subgrup mean differences alne can be interpreted as signaling unfairness: P. 74 The idea that fairness requires equal passing rates has been almst entirely repudiated in the testing literature. P. 75 Mst testing prfessinals wuld prbably agree Including the wrds almst and prbably suggest that there culd be sme chance that the presence f subgrup mean differences, absent ther infrmatin, indicates unfairness. We recmmend strnger language that n cannt simply draw such inferences; mean differences alne d n per se signal bias r unfairness. (Critical)

8. The Rights and Respnsibilities f the Test Takers 9. Testing Individuals f Diverse Linguistic Backgrunds 10. Testing Individuals with Disabilities N Cmments General Cmment This sectin, as well as the Sectin 10 (Testing Individuals with Disabilities) des nt prvide real directin fr smene seeking infrmatin in these areas. Part f the prblem is that little relevant research is available. Even thugh small sample sizes preclude any significant research, we understand that these cmmunities need t be represented. Nevertheless, it wuld be valuable if the infrmatin presented in these sectins culd be mre directive. (Imprvement) General Cmment See cmment fr Sectin 9. (Imprvement) COMMENTS ON APA STANDARDS PART III Part III 11. The Respnsibilities f Test Users General Cmments. As with Chapter 8, several standards here are redundant f APA s ethical principles. Fr example, Standards 11.6, 11.11, 11.14, amng thers, seem t be directly redundant f ethical principles. The Standards revisin cmmittee shuld scur these standards and eliminate thse that are redundant f APA s Ethical Principles. (The larger issue here is the apprpriateness f Standards that are, at rt, ethical imperatives. The recmmendatin here is t recnsider whether ethically-based Standards are apprpriate and, if nt, minimize them.) (Very Imprtant) P. 112 First paragraph. Please keep in the fllwing sentence, There are circumstances in which selectin based exclusively n test scres may be apprpriate. Fr example, this may be the case in pre-emplyment screening. (Keep in) P.112 Secnd paragraph. In the discussin f test users defending their testing practices, please include the cncept f return n investment. As an example, based n Defense Department (DD) data, fr every sixty thusand dllars ($60,000.00) spent each mnth n pre-admissin candidate testing at the Baghdad Plice Academy, there was a very real savings f apprximately 1.8 millin dllars that wuld have been wasted in fruitless effrts t turn unsuitable applicants int satisfactry law enfrcement fficers. Thus, fr every dllar spent n the testing prgram, DD saved $30 r a 30 t 1 ROI. (Imprvement) Standard 11.3

Standard 11.6 Standard 11.7 Standard 11.8. Standard 11.16 Standard 11.21 Standard 11.19 Please change t read, thse individuals wh have the training, prfessinal credentials and/r experience necessary. Many lcal gvernment human resurce managers and public sectr assessment prfessinals with training and experience, but nt necessarily prfessinal credentials, rent tests frm prfessinal assciatins fr fire and plice testing prgrams. They are the test users.) (Imprvement) Please keep in the sentence stating, Feedback in the frm f a scre reprt r interpretatin is nt typically prvided when tests are administered fr persnnel selectin r prmtin. (Keep in) Please keep in the last sentence stating, When tests are invlved in litigatin, inspectin f the instruments shuld be restricted t the extent permitted by law t thse wh are legally r ethically bligated t safeguard test security. (Keep in) This standard is trivial in that it is nthing mre than the pint that test users shuld be law-abiding. It culd be eliminated with n lss. (Imprvement) The Standard mentins their mdes f test administratin. The cmment shuld include a discussin f the impact f n-line testing. (Issues f nline testing are emerging based, t a great extent, n wide ranging practices. Many psychlgists are cncerned abut questinable practices based n issues f prctring, access, feedback, etc. It appears this is nw its wn issue and needs t have sme language addressed specifically t the prfessinal issues assciated with it.). (Critical) Keep in the sentence stating, Autmated, narrative reprts are nt a substitute fr sund prfessinal judgment. Sphisticated simulatins (assessment centers n cmputer) have been develped by a number f test publishers and they are being used t make high stakes prmtin decisins fr rganizatinal leaders. These simulatins rely n sme autmated scring and reprts. The APA Standards cmmittee may want t request a viewing f the new sphisticated line f prducts such as An s LEADeR. (Keep in) The cmment shuld address un-prctred testing. (See cmment re Standard 11.16) (Critical) 12. Psychlgical Testing and Assessment Chapter title. This title is t brad and culd be imprved by including sme reference t clinical r persnal applicatins. (Imprvement) Backgrund

13. Educatinal Testing and Assessment 14. Testing in Emplyment and Credentialing The first paragraph shuld prbably als mentin bidata tests that measure cunterprductive wrk behavirs. (Imprvement) Cnsider a cmment n the ptential misuse f certain types f tests, particularly in the emplyment selectin cntext. (Imprvement) N Cmments P. 151, Cl 1, Backgrund First Paragraph Add a sentence indicating that emplyment testing is als used t eliminate thse with cunterprductive wrk behavirs (P. 152 makes a reference t negative selectin r screening ut; P. 158 mentins cunterprductive wrk behavirs). (Imprvement) P. 152, Cl 1, Shrt-term vs. lng-term fcus. Please retain the sentence stating, Cncerns abut changing jb tasks and jb requirements als can lead t a fcus n characteristics prjected t be necessary fr perfrmance n the target jb in the future, even if nt a part f the jb as currently cnstituted. This is an imprtant sentence. (Keep in) P. 152 Cl 1- Cl 2 Mechanical vs. judgmental decisin making. In discussing the use f test infrmatin, ne ther use is t develp Flag Reprts that are then target interview questins in a fllw-up structured interview. (Imprvement) P. 152, Cl 2, Onging vs. ne-time use f a test. The last sentence is nt clear. The key questin is whether advance knwledge f test cntent impacts the candidate s perfrmance unfairly as ppsed t changes the cnstructs measured by the test. The latter phrase is true, but may nt be understd by practitiners. (Imprvement) P153 Cl 2, last sentence Expand the last sentence t clarify the meaning f peratinalizing these dmains (Imprvement) Pp. 153-155 Sme where in here it wuld be helpful t add sme language addressing the distinctin between the validity f tests used t make selectin decisins and the apprpriateness f the decisins themselves. The language f pp. 153-155 really fcuses n the test itself as the fcal bject f validatin. It might be helpful t be explicit abut the distinctin between this view and a ptentially larger view f validatin, r evaluatin, that fcuses n the whle selectin system, including hiring decisins, etc. The purpse f this cmment is simply t braden the discussin a bit t prvide a clearer cntext fr the fcus n tests themselves. (Imprvement) P 155, Cl 1, Last Paragraph Keep in the fllwing sentence: A meta-analytic integratin f this research can frm an integral part f the strategy fr linking test infrmatin t the cnstruct

dmain f interest.. (Keep in) P 155, Cl 2 Last Paragraph Please add ROI which is different than utility analysis evaluatin. Als, decisins abut test use are influenced abut cnsideratins f safety (wrkplace vilence) and security (anti-terrrism). (Imprvement) P. 158, Cl 1, Middle Para and Standard 14.17: Althugh widely accepted, this paragraph is really a scial plicy, nt a testing standard. In its bradest sense ne culd easily imagine a different scial plicy that allws bna fide gate keepers t set credentialing standards that are either nrmative r that help mdulate the distributin f prfessinals wh are credentialed. Of curse, this changes the meaning f the credential, but the pint is that this is a scial plicy issue, nt a testing standard issue. (Imprvement) Standard 14.1 Dn t use the language f bjectives. Replace bjectives with purpse f intended use. The mre substantive pint abut this standard is that it is nt always the case that validatin effrts shuld evaluate f the extent t which purpses are achieved. The pint is that sme purpses f the use f testing fr emplyment are irrelevant t the meaning f the test scres. Fr example, an rganizatin may implement a test t take the place f a subjective interview. That s a purpse, r bjective, that is nt relevant t the meaning f the test scre and the validatin effrt wuld nt be bligated t evaluate the extent t which all interviews were replaced by the test. (Imprvement) Standard 14.4. Relating back t the cmment abve abut differential predictin, it wuld be helpful if this standard made the additinal pint that the criterin(a) chsen t evaluate the validity f a particular predictr shuld be a criterin(a) that has theretical relevance t the predictr. Fr example, it is nt useful t evaluate the validity f a predictr measuring teamwrk rientatin against a criterin assessing, say, wrk speed. (Imprvement) Mdify the last sentence t state There is nt a clear need fr jb analysis t supprt criterin use when measures such as absenteeism, turnver, r wrkplace safety/security are the criteria f interest. (Imprvement) Standard 14.6. Additinal language shuld be added making it clear that the judgment f crrespndence shuld be based n thse factrs that are likely t mderate validity and is nt required t cnsider factrs unlikely t mderate validity. (Very Imprtant) Standard 14.8. Smewhere, maybe here, it wuld be helpful t have language that tackles the issue relating t the assumed accuracy f subject matter expert ratings as part f cntent validatin (r, fr that matter, jb analysis in

15. Testing in Prgram Evaluatin and Public Plicy Standard 14.9 general). The pint is, jb SMEs are nt likely t be equally accurate in their ratings f, say, the imprtance f jb knwledge requirements and the imprtance f mathematical reasning. At rt, this issue has t d with the degree t which jb and test cntent can be specified t accurately reflect the intended underlying cnstruct, if any, and at the same time enable jb SMEs t make accurate judgments abut it. (The larger issue here is the need fr an enhanced treatment f cntent validatin and the limitatin it impses n validity inferences.) (Very Imprtant) This Standard seems fine, but the Cmment is t restrictive. A clse link between test cntent and jb cntent may be established nt nly by jb samples r jb knwledge, but als by resemblance between the specific skills and abilities assessed by the test and the specific skills and abilities required by the jb, whether r nt there is clse resemblance between the test as a whle and the jb as a whle. See fr example Sectin 14 (6) Prir training r experience (Imprvement) General Cmments N Cmments Standards regarding the develpment and validatin f predictrs place t much emphasis in entry-level selectin t the exclusin f a discussin f predictrs used in making prmtin decisins. (Imprvement) Standards regarding cmbining predictrs need t include a discussin f weighting f predictr cmpnents and predictr redundancy. (Imprvement) This chapter shuld include a discussin f predictr frmat and medium as imprtant factrs in predictr chice. (Imprvement) We need an equivalent f Standard 12.1. Peple shuld nt be engaged in emplyment testing unless they have been trained in IO psychlgy. Just as certain credentials are required fr clinical r schl testing, s are they required fr emplyment testing (althugh Standard 11. 3 appears t care fr this.) (Imprvement)