Decision Models for Placement of Students Based on State Test Scores in Grades 4 and 8 Gerald E. DeMauro December, 1999
Considerations are examined for means to interpret statewide assessment results. This paper is intended to balance the expressed desire of local administrators for increasingly finer divisions of populations and item groupings with the requirements for reliability in a standards-based assessment system. Considerations are presented for decision models that systematically proceed from the highest levels of reliability to the lowest levels. In this way, the most reliable data forms the foundation for decisions while the least reliable data is examined in view of other sources of information.Decision Models for Placement of Students Based on State Test Scores in Grades 4 and 8 Overview The standard setting studies to determine proficiency levels in grades 4 and 8 in English Language Arts and in Mathematics and on New York State Regents Examinations in Comprehensive English and Mathematics A used a procedure called item mapping. This procedure requires student scores and test item difficulty to be scaled together. On that common scale, students with scores with higher scale values than an item difficulty value have a greater than 50 percent chance of answering that item correctly. Students with scores with lower scale values than an item difficulty value have a lower than 50 percent chance of answering that item correctly. Through a deliberative process of expert judgment, items are classified as representing no achievement, partial, or full achievement of the standards. Because the items are on the same scale as student scores, this deliberative classification of test items results in the scale scores that demarcate the proficiency levels of the examinations. This process rests heavily on the definitions used to describe achievement of the Learning Standards. The experts who participated in standard setting for the fourth and eighth grade tests were advised that Level 1 students demonstrated no achievement of the Learning Standards, while Level 2 students demonstrated either some achievement of each Learning Standard or full achievement of some but not all of the Learning Standards. Essentially, then, Level 1 students have not demonstrated success with respect to meeting the Learning Standards. This recommends a placement that is qualitatively different from the current instructional program. Level 2 students, on the other hand, have quantitative deficits that might be addressed by quantitatively different programs, e.g., those that differentially stress identified weaknesses. Multiple Measures in a Hierarchical Model The statewide tests can never reveal more about the skills and deficits of students than the local teachers and administrators. However, the statewide tests do provide a single uniform measure across all school districts and classrooms and provide broadly diagnostic information that is most reliable when it is sampled over larger numbers of students and includes larger numbers of test questions.
- When each aspect of the standard or key idea, including each performance indicator is taught during the child's academic career;
- How much instructional time is devoted to each standard or key idea;
- How the acquisition of the standard or key idea is evaluated;
- How feedback is provided to the child on the evaluation;
- What the consequences of different levels of performance on local assessments are for the child in terms of implementing the standards;
- How the instructional program varies from building to building, from classroom to classroom.
The self study is a necessary first step in interpreting the results of statewide assessment. Without critical attention to the instructional program
- The numbers of students in a group;
- The numbers of test questions on which decisions are made.
Too often, with the best intentions of deriving all possible information from test results, local administrators can make unreliable or wrong decisions about children because a data base that is too small in terms of numbers of students or items can mislead them. For example, one item may show that the students have not mastered a certain concept, e.g., main idea of a passage. In fact, that item might have a very high difficulty statewide and might be difficult for all children. Similarly, students who did poorly on a certain item might fare better on the test as a whole simply because that particular item was not as good a discriminator as other items tapping the same concept. Programs might also be redesigned in error because of the performance of a few students. Even though the test is designed to minimize extraneous factors, when low numbers of students are involved, the influence of such factors is much greater, and programs should be cautious about making large-scale changes on the basis of little evidence. Student performance analyses should proceed from the largest aggregation of students to the smallest: - Statewide descriptive statistics such as means, frequencies, and standard deviations;
- Public or nonpublic descriptive statistics;
- Program-wide descriptive statistics, e.g., statewide special or general education results;
- Regional or needs resource-level descriptive statistics for the population of interest;
- District-level descriptive statistics for the population of interest;
- Building-level descriptive statistics for the population of interest;
- Classroom-level descriptive statistics for the population of interest;
- Student-level performance for the population of interest.
Each step should be evaluated in terms of observed discrepancies with the higher steps. Program, district, building and classroom level information must be cognizant of the self study results. The second dimension to be considered is the level of test aggregation. This progresses differently for different subject areas, but in general should be considered from the largest aggregation (most reliable) to the smallest: - Whole test, scale score or raw score;
- Reading Scale (where available);
- Standard Performance Indicator (where available);
- Content component of Learning Standard;
- Major concepts, e.g., main idea;
- Item.
The item-by-child level analysis is by far the most useless and deceptive, while the whole test-by-whole state analysis is clearly the most reliable. By proceeding from the most reliable to the least reliable data, local administrators can decide on how much credibility the data have.
- Results of other standardized examinations in the same year and on the same construct;
- Course grades in the appropriate subject in the same year;
- Other course grades that reflect on skills needed to respond to test questions;
- Local assessment results in the appropriate subject in the same year;
- History of standardized examination performance on the same construct.
Because there are scale differences in each of these measures, analyses of multiple measures may be facilitated as follows: - Identify a period of time. e.g., two years, over which the performance of every child in the statewide test cohort can be tracked;
- Rank all of the students in the cohort from highest (1) to lowest (n, where n is the number of students in the cohort) based on their performance on that measure;
- Note the range of ranks within each performance level on the statewide test;
- Rank the cohort on the Standards Performance Indicators, as well;
- Look for large discrepancies, e.g. large changes in ranks, from one measure to the next, or for each individual measure compared to the student's average rank over all measures. For example, if among 40 students in the cohort, the student ranks 12
^{th}on the overall English Language Arts scale but 8th on the Reading scale, the students may have greater difficulty in Writing, than in Reading.
Particular attention should be given to variables in which the rank of a student is outside of the range of ranks of the performance level of the student on the statewide examination. If the student scores Level 2 on the Grade 4 English Language Arts examination, and there are 42 students in fourth grade in this district, the range of ranks for Level 2 students may be 7 to 15. Particular attention should be given to any variables in which any Level 2 child ranks from 1 to 6 or from 16 to 42. The most persuasive evidence for student placement should be data from the most reliable source, on measures most focused on the State Learning Standards, and on data collected nearest the date of the administration of the statewide test. The self study should be very useful in explaining discrepancies in the ranked data. Particular attention should be given to differences in ranks between statewide test data aggregated to the level of the Learning Standards, such as the Standard Performance Indicators, and the overall scale scores on the statewide assessment. This will identify areas of particular strengths and weaknesses. Intervention Any student below a passing scale score (local or state) on the Regents examinations or below Level 3 on the statewide Grade 4 and Grade 8 English Language Arts or Mathematics examinations should receive some form of instructional intervention. As mentioned earlier, for students nearest the criterion score, the intervention is more quantitatively different in nature, requiring more instructional time for certain deficits. For students in Level 1 of the grade 4 or grade 8 English Language Arts or Mathematics examinations a qualitatively different instructional program in which the intervention may require entirely different strategies for instruction, may better suit the student's needs. The decision about the required degree of special intervention is a matter of local discretion, in accordance with Part 100 of the Commissioner's Regulations. Multiple measures should both identify the intensity of the needed intervention and indicate the types of feedback mechanisms that should be in place to decide when the special instructional intervention has been successful and which future intervention or placement is most appropriate. For example, if the ranks analyses indicate that students in Level 2 of the Grade 4 English Language Arts examination have a history of strength in writing and a relative weakness on the reading scale, then feedback mechanisms should be built into the intervention to discern progress in reading and maintenance of skill levels in writing. Using Strengths to Address Weaknesses The multiple measures analyses should identify strengths as well as weaknesses. In the second example above, writing can be used to improve reading skills when a program is designed where students can edit their work, or when students are asked to write from their interpretation of reading passages. It is beyond the scope of this paper to suggest intervention models, but the multiple measures and assessment models lend themselves readily to effective intervention. Profile Analyses A more complex analysis is available by reviewing each student's performance on parts of the examination in terms of standardized differences from mean scores of certain populations. One way of accomplishing this is to compute the mean and standard deviation on each Learning Standard or key idea for the populations who scored exactly at the cutoff scores defining each performance level. The computation of differences divided by the standard deviation standardizes these differences, thus making them comparable for the purposes of identifying areas of strengths and weaknesses. For example, although all of the Standard Performance Indicators (SPI) range from 1 to 100, in actuality, the scores of the students of the State may be bunched within a certain small range (small standard deviation) for one standard. Hence, scoring below that range may indicate a particular deficit for a student since the other students in the state or class who were exposed to the standard were more likely to score higher. The same student's SPI score on a standard in which the scores are more evenly scattered (larger standard deviation) may indicate less of a problem for a particular student but somewhat more of a problem for a group of students. Table 1 shows the means and standard deviations for the Grade 4 and Grade 8 English Language Arts and Mathematics examinations on the SPI. Table 1 Standards Performance Indicators Means and Standard Deviations
Table 2 presents some fictitious data as an example of student level analysis. The standardization (difference from the mean divided by the standard deviation) places all of the comparisons on the same scale. This enables an analysis of how far each student is from the profile of a just minimally achieving Level 2, Level 3, or Level 4 student. The data in Table 2 shows that this student is relatively strongest is Key Idea 2 (+0.86) and weakest in Key Idea 3 (-1.13). Resource Needs Comparisons Table 3 presents the means and standard deviations for the resource need categories on the 1999 Grade 4 and Grade 8 Mathematics and English Language Arts assessments. In an absolute sense, these data are not very informative, but they do provide a general description of how students statewide in the same resource need category fared on the four examinations. Again, standardized differences may be computed as shown in Table 2, using the mean and standard deviation in Table 3 to aid the profile analyses. Table 2 Analysis on Fictitious Student Data of Relative Strengths and
*(Student-Mean)/standard deviation Table 3 Means and Standard Deviations of Standards Performance Indicators
Table 3 (continued)
Conclusion The focus of this paper is on maximizing the information available from statewide assessments while containing the assault on reliability. Several models are presented for working down from the most reliable sources of data in order to make some program inferences from the least reliable sources. More information will be provided in future papers to continue to provide a sound base for decisions for those responsible for educational programs. |