Creating Multiple Cohorts Using the SAS DATA Step Jonathan Steinberg, Educational Testing Service, Princeton, NJ

Creating Multiple Cohorts Using the SAS DATA Step Jonathan Steinberg, Educational Testing Service, Princeton, NJ ABSTRACT The challenge of creating multiple cohorts of people within a data set, based on one or more common characteristics, is driven by the nature of the data that is collected and how it is structured. A major benefit of establishing such a data set is the ease and flexibility with which you can statistically examine differences among respondents using various indicator variables, such as test dates, particularly if there is consistency to the data layout over time. If not, the data analyst may need to spend additional time to create such a layout. A standardized test may have several administrations during the course of a calendar year. The SAS frequency procedure, PROC FREQ, could easily identify patterns in the data. However, while some candidates will take a test once, others may take the same test multiple times. Each time a candidate takes a test, a record is appended to a database with the score, along with an appropriate indicator of whether or not the respondent is a repeat testtaker. An in-depth cohort analysis would focus on the patterns of testing dates, including test-retest behavior, as the time between testing dates may impact performance. There are two important issues to address at the outset. Multiple years of data may be needed to ensure adequate sample sizes for certain analyses. Additionally, the combination of administration date and repeat testtaking behavior can lead to a large series of frequency distributions that would need to be organized and interpreted, which may not be the most efficient way to start the cohort analysis. This paper will describe two DATA Step processes that are required before starting work on a project with the aim to improve the overall efficiency of the analysis. The first process is to create a single row for each respondent with the test score and repeater status for each possible administration. This requires combining several small files into one large file. The second process creates a single variable of repeater status across administrations. It is created within the new file so that testing dates, notably repeats, can be identified. This becomes the basis for the cohort analysis. This paper is intended for those with moderate experience in DATA Step processing. INTRODUCTION In general, when data is stored so that a date reference is a separate variable, there may be difficulties for the analyst if the scope of work involves restructuring the data to look at patterns of test-taking behavior based on time. The TABLES statement within PROC FREQ can easily generate counts of test-takers by a given date. However, that would not show patterns of multiple testing administrations. An issue that often surfaces when working with standardized test data is how to incorporate repeat test-taking behavior into the analysis. Repeater status is a variable often included in such data sets since most testing programs have many administrations during the course of a calendar year. Some candidates will take a test once while others will take the same test several times for various reasons, which sometimes can confound the results of the analysis. Thus, this requires more work than running a single PROC FREQ to examine the combination of testing dates and repeat test-taking behavior. The basis for this paper came from an attempt to explore patterns of repeat standardized test-taking over a rolling three-year period, identified by administration dates. The procedure by which the data set was rearranged will be described. The extraction of the test score information that follows is a relatively routine procedure using the DATA Step. Finally, it will be shown that more complex DATA Step processing is necessary to create timedependent respondent profiles of repeat test-taking status. The end result is a simplification of the research question upon which results from a single PROC FREQ can direct future analyses. 1

CREATION OF INDIVIDUAL TEST INFORMATION PROFILES Each time a respondent takes a test, a record is appended to the database with a unique identifier so that repeater status can easily be determined, along with possible other background information. The file is sorted by respondent identification number, calendar year, and month in ascending order. Table 1 displays the file layout along with simulated data. Table 1: Original File Layout with Simulated Data ID # Month Year Repeat Test Status Score Form 1 11 2002 N 179 A 2 11 2002 N 176 A 3 11 2002 N 180 A 4 11 2002 Y 163 A 5 11 2002 N 178 A 6 11 2002 N 161 B 7 11 2002 N 189 A 8 11 2002 N 182 A 9 11 2002 N 173 B 10 11 2002 N 157 A 11 11 2002 N 175 B 12 11 2002 N 187 A It is worth noting that the variable named Repeat Status has two possible values, N for first-time test-takers, and Y for repeat test-takers based on identification number. However, not all respondents may have first-time test-taking status in the data file. Given that this is a three-year snapshot into this particular testing environment, respondents may be classified as repeat test-takers because their first administration occurred before the beginning of the selected date range of the data file, as is the case in Row 4. Additionally, this test has multiple forms. A counter is applied beginning with the first instance of the respondent identifier after the data has been sorted by key variables. Table 2 displays an example of the resulting data set. proc sort data=test1 out=test1_sorted; by idnum year month; data test1_unique; set test1_sorted; by idnum year month; if first.idnum then records = 0; records + 1; 2

Table 2: Sorted Simulated Data File with Counter ID # Month Year Repeat Test Status Score Form Records 1 11 2003 N 160 A 1 2 11 2003 N 170 B 1 3 3 2005 N 184 B 1 4 6 2004 Y 134 A 1 5 6 2005 N 200 A 1 6 3 2003 Y 134 C 1 6 4 2003 Y 138 D 2 6 6 2003 Y 143 M 3 7 1 2004 N 186 C 1 8 4 2005 N 154 A 1 8 9 2005 Y 143 G 2 9 11 2002 Y 137 A 1 9 11 2003 Y 129 B 2 9 6 2004 Y 145 A 3 9 3 2005 Y 134 D 4 This extract from the data file shows that Respondent 6 took this test on three separate occasions. The four key variables extracted from the original database are the administration date (month and year), respondent identifier, repeater status, and test score, based on the last instance of the respondent identifier. This is accomplished using the KEEP statement on the DATA Step line: data test1_unique2 (keep = idnum month year repeat score); set test1_unique; by idnum; if last.idnum; The next step is to create individual data files by administration date of just the other three key variables mentioned above. For example: data test1_admin1102 (keep = idnum repeat score); set test1_sorted; where month = 11 and year = 2002; data test1_admin0103 (keep = idnum repeat score); set test1_sorted; where month = 1 and year = 2003; Then using the RENAME option, a suffix is appended to only the repeat status and score variables in the datespecific files, according to the particular administration date. Two examples are listed below: data test1_admin1102 (rename = (repeat = repeat1102 score = score1102)); set test1_admin1102; 3

data test1_admin0103 (rename = (repeat = repeat0103 score = score0103)); set test1_admin0103; Once completed for all administration dates, a data set consisting of only the identification number is created, again using the KEEP option. The smaller files are then merged together using the unique respondent identifier as the matching variable. data test1_unique3 (keep = idnum); set test1_unique2; data test1_total; merge test1_unique3 test1_admin1102 test1_admin0103; by idnum; Figure 3 displays a partial data layout after the merging process is completed: ID # Repeat Status Test Score (Nov 02) (Nov 02) 1. 2. 3. 4. 5. N Repeat Status (Jan 03) Test Score (Jan 03).... 168 6 Y 137. 7 N 164. 8 Y 145. 9 N 180. 10 Y 134. 11 N 176. 12 N 121. 13. Y 153 14. N 200 15 Y 153 Y 160 16 Y 145. It is evident from this small extract that the first four respondents did not take the test in November 2002 or in January 2003. This indicates that their first appearance in the data file occurred at a later administration date. Rows 6 through 12, as well as Row 16, have test data from November 2002 but not January 2003, with different values of repeat status. Rows 5, 13, and 14 have test data from January 2003, but not November 2002, again with varying values for repeat status. However, in Row 15, this respondent took this test both in November 2002 and January 2003 as a repeat test-taker, meaning his or her first testing date occurred prior to November 2002. IDENTIFICATION OF REPEAT TEST-TAKING BEHAVIOR PROFILES As displayed in Figure 3, the data is now arranged so that for each unique respondent in the original file, a set of key variable information grouped by administration date is displayed. As mentioned earlier, the key fields will be blank (for repeat test-taking status) or missing (for test score) for any administration date in which the respondent did not take the test. The creation of the repeater profile is done in two steps. First, given that the repeater status variable is text and not numeric, concatenation of these letters by administration is required. The width of the resulting variable, referred to as string, is equal to the number of administrations in the original data file. 4

data test1_total2; set test1_total; string = repeat1102 repeat0103 repeat0303 ; The second step is to use the FIND function to identify the column positions, expressed as numeric values from zero (indicating non-existence) to the number of administrations of four key test-taking dates, expressed as new variables: the instance of the N (if it exists), the instance of the first Y which marks the first repeat, the second Y, and the third Y, if any or all exist in the data file. firstn = find(string,'n',1); firsty = find(string,'y',1); secondy = find(string,'y',(firsty+1)); thirdy = find(string,'y',(secondy+1)); The function begins at the first column, and then for the variables called secondy and thirdy, the function begins from one column after finding the previous repeat. Please note that if an N exists for a respondent in the data file, the value of firsty will be greater than that of firstn, the value of secondy will be greater than that of firsty, and the value of thirdy will be greater than that of secondy. The resulting values for each of the four variables created above are contingent on the basis that they have a value other than zero in the data file. Figure 4 displays an example data set after this has been executed. It is worth noting the varying pattern of test-taking profiles, even among this subset of thirty observations. Most of these respondents only took the test once during the three-year period in the data file, but that first test date ranges from Date 2 to Date 17. Row 14 shows that the respondent took the test on three consecutive administration dates. Row 27 shows that at Time 16, the respondent took the test for the first time, and then repeated at Time 19. 5

Figure 4: Data File Extract after Creating Respondent Profiles A variable called status is appended to the data set above and is created based on the results from building the respondent profile. The goal here is to summarize the possible profiles in a way that will help inform the future cohort analysis and try to simplify the research question for the analyst. There are seven possible classifications formed in the following way: status =.; if firstn > 0 and firsty = 0 and secondy = 0 and thirdy = 0 then status = 1; if firstn = 0 and firsty > 0 and secondy = 0 then status = 2; if firstn > 0 and firsty > firstn and secondy = 0 then status = 3; if firstn = 0 and firsty > 0 and secondy > firsty and thirdy = 0 then status = 4; if firstn > 0 and firsty > firstn and secondy > firsty and thirdy = 0 then status = 5; 6

if firstn = 0 and firsty > 0 and secondy > firsty and thirdy > secondy then status = 6; if firstn > 0 and firsty > firstn and secondy > firsty and thirdy > secondy then status = 7; Table 5 summarizes the description of each group according to the status variable. The sample sizes generated from a frequency distribution of simulated data are used to examine repeat test-taking profile behavior patterns. Table 5: Summary of Respondent Profiles with Sample Proportions Status FirstN FirstY SecondY ThirdY Data Example % Sample 1 7,0,0,0 80.80 2 0,11,0,0 5.24 3 16,19,0,0 2.01 4 0,3,4,0 5.88 5 14,16,17,0 0.50 6 0,6,7,10 5.42 7 13,15,16,18 0.15 STUDY RESULTS Approximately 80% of this test-taking population took this test for the first and only time within the time frame of the database. That leaves the remaining 20% with varying degrees of repeat test-taking behavior. About threefourths of repeat test-takers (Groups 2, 4, and 6) will have both first taken the test prior to the start of the date range of the file, and repeated the test one, two, or three times, in approximately equal proportions. The remaining respondents took the test for the first time during the date range of the file, and then repeated mainly just once (about 2% as in Group 3), with others repeating more times (less than 1% each in Groups 5 and 7). IMPLICATIONS FOR FURTHER ANALYSES According to the results from this study, the next steps for Group 1 are the most straightforward as only a single test administration date defines the cohort, and is the easiest to explain since all of its members are first-time testtakers. The issue of test-taking behavior can now be explored for Group 1 through one PROC FREQ based on administration date. The sample size should be adequate for this group if the distribution of repeat test-taking status is similar to what was outlined in the study. However, sample size concerns may arise for the other groups, despite the apparent self-explanatory nature of respondent membership for Groups 3, 5, and 7. Groups 2, 4, and 6 would be affected if a need arises to define the first testing date, which would mean extending the date range of the file. The use of hierarchical clustering based on frequency distributions of, for example, values of firstn or firsty may be needed to alleviate problems of insufficient sample size. As mentioned previously, multiple years of data may be necessary for this type of analysis, but sufficient testing volume may also be a requirement to proceed, given the example cohort proportions reported here. The resulting data shown in this paper serves as the basis for asking questions such as: (1) Does there tend to be a clustering of respondents, whether first-time test-takers or repeaters, at certain administration dates? (2) What is the average change in test score if one repeats the test one, two, or three times? 7

(3) Given a three-year snapshot, does the effect on test performance if the periods between repeat administrations vary from a few months to as much as three years? CONCLUSION The procedure described in this paper takes into account a historically consistent data layout as applied to a standardized testing program. For the intended purpose of analyzing multiple cohorts of respondents based on testtaking behavior, considerations such as date range, the inclusion or exclusion of repeat test-takers, and perhaps testing volume are primary concerns to the data analyst before proceeding with a project such as this. A moderate level of DATA Step processing is required to carry out the tasks described in creating the cohorts for analysis, as the KEEP and RENAME options are used, as well as functions involving merging and finding text within a concatenated variable. The results shown through this analysis may or may not be typical of all testing programs, or all data types. However, the procedure is very powerful for easily creating multiple cohorts for further analyses, as well as identifying possible areas of research. This procedure aims to reduce the amount of output generated from what would otherwise be many uses of PROC FREQ and more importantly, can reduce the amount of additional exploration time expended by the data analyst. ACKNOWLEDGMENTS The author would like to thank Ted Blew, Catherine Trapani, and Jennifer Minsky for their support and assistance in proofreading this paper. REFERENCES SAS 9.1.3 Online Help and Documentation, Cary, NC: SAS Institute Inc., 2004. Cody, Ronald P., and Smith, Jeffrey K. (1991) Applied Statistics and the SAS Programming Language, Third Edition. Englewood Cliffs, NJ: Prentice Hall. Feng, Ying (2006), PROC SQL: When and How to Use It?. Proceedings of 2006 NESUG Conference, Paper CC20. Online at http://www.nesug.org/proceedings/nesug06/cc/cc20.pdf Zhang, Rodger (2006), Creating a Report Showing Monthly, Quarter-To-Date and Year-To-Date Information Without Changing Date Parameters Monthly. Proceedings of 2006 NESUG Conference, Paper CC09. Online at http://www.nesug.org/proceedings/nesug06/cc/cc09.pdf TRADEMARKS SAS and all other SAS Institute, Inc. product or service names are registered trademarks or trademarks of SAS Institute, Inc. in the USA and other countries. indicates USA registration. Other brand and product names are registered trademarks or trademarks of their respective companies. 8

CONTACT INFORMATION Your comments and questions are appreciated and encouraged. Please contact the author at: Jonathan Steinberg Research Data Analyst Center for Data Analysis Research Educational Testing Service Rosedale Road Mail Stop 20-T Princeton, NJ 08541 Telephone: (609) 734-5324 E-mail: jsteinberg@ets.org 9