Grade 8 Intermediate Level
Social Studies Examination--Data and Information Related to Standard Setting
A study performed for the New York State Education Department by
April 27, 2001
The New York State Board of Regents has established learning standards all students must meet to graduate from high school. One set of learning standards is for social studies. The standards pertain to:
Key ideas, performance indicators, and sample tasks further describe each learning standard. Standards are also broken down by educational level--elementary, intermediate, and commencement. To assess the extent that students have met the learning standards, the New York State Education Department has developed a testing program. The content of the tests reflect accomplishment of the learning standards. For social studies, the State Education Department has developed a Grade 8 Intermediate Level Social Studies Test to reflect accomplishment of the learning standards at the intermediate level. Most students will take this test at the end of 8th grade. Schools must provide students who fail the test with special academic intervention programs.
Although scores for the test are placed on a numerical scale, essentially there are only three scores—not meeting the standards, meeting the standards, and meeting the standards with distinction. The test items have been developed by New York State teachers using professionally established procedures and have been pretested and field-tested on samples of students.
The purpose of the study described in this report is to obtain information that the State Education Department can use to establish scores that will classify test takers into does not meet standards, meets standards, and meets standards with distinction categories. Setting cut-scores requires judgment. This study employs professionally established methods to quantify and summarize the judgments of experts related to how individuals who have met the learning standards will perform on the test.
The Grade 8 Intermediate Level Social Studies Test
The Grade 8 Intermediate Social Studies Test is a three-part test administered in two one and a half hour sessions. Test content is based on the intermediate-level key ideas and performance indicators found in the Learning Standards for Social Studies and the Social Studies core curriculum, developed and adopted by the Board of Regents. The three parts of the examination are as follows:
Items for the test were developed and pretested by a consortium of teachers, supervisors, and administrators from school districts across the State; Erie I BOCES staff; and State Education Department staff. All constructed response, scaffold, and essay type responses are scored holistically using scoring rubrics by trained teachers in their districts.
A complete description of the examination, including test specifications and scoring rubrics, is given in a test sampler.
Data related to the performance standards for the test were obtained from a committee of experts. Judgments from committee members were quantified using standard practices employed by psychometricians who conduct standard setting studies. The committee made their judgments with respect to the difficulty scale resulting from the scaling and equating of field test items. In the field testing, each item, or score category if the item has multiple scores, is given a difficulty parameter obtained through item response methods. Test items corresponding to various points on the difficulty scale are presented as examples of test items at that difficulty level. The items used came from the anchor test form. The anchor test form is the test form upon which the cut-scores are set and the form to which all later forms of the test will be equated.
Committee members were given definitions of three performance categories—not meeting standards, meeting standards, and meeting standards with distinction. The State Education Department has developed these category definitions and they are applied to all of the Intermediate Level tests that are being developed. In addition, committee members were given an exercise in which they were asked to categorize some of their students into the performance categories as defined by the State Education Department.
The committee met as a group on March 20, 2001 at the State Education Department.
The standard setting study test used the bookmarking approach because all the multiple choice items and constructed response item had been scaled using item response theory methods and because the bookmarking procedure enables committee members to consider these two item types together.
In the bookmarking procedure, multiple choice items and constructed response items are ordered in terms of their difficulty parameters. The purpose of the items is to illustrate the meaning of the difficulty scale at specific points. Committee members are asked to apply their judgments to these ordered items. The committee meeting is conducted in rounds. The rounds and the activities employed in each round are given below.
Committee members were also asked four overall questions about accomplishment of the learning standards and test performance. Answers to these questions might aid New York in setting appropriate performance standards on the test. These questions asked:
The New York State Education Department's Office of Curriculum and Instruction assembled a committee of 20 people to provide judgments for the study. Committee members were current or former social studies classroom teachers. All committee members were recognized as very knowledgeable of the learning standards for social studies and of how students perform on standardized tests similar to the Grade 8 Intermediate Level Social Studies Test. Some had worked on an aspect of either the standards or development of the test.
Committee members, their schools, the number of years experience each has in teaching Intermediate Level Social Studies and the number of students who are in their grade 8 social studies classes are given in the table below.
Committee members were chosen so that they would represent a wide range of schools and different types of students. Each committee member was asked to complete a short background questionnaire that included questions about their sex, ethnic background, and the setting for their school. Results of the questionnaire tabulations are given in the table below.
Findings related to the bookmarking procedure
In round 2 every committee member independently placed his or her own bookmarks for meeting standards. The results of the placements are given in the table below. The table gives the difficulty, corresponding raw score, and the corresponding percent of students that fall below that cut-score based on the field test data. The cut-points include the committee average plus or minus one or two standard deviations (i.e., standard deviations of the committee estimates) and the median committee cut-point including the cut-points corresponding to the 75th and 25th percentile ranks of committee estimates.
In round 3 every committee member independently placed his or her own bookmarks for meeting standards with distinction. The results of the placements are given in the table below. The table gives the difficulty, corresponding raw score, and the corresponding percent falling above that cut-score based on the field test data. The cut-points include the committee average plus or minus one or two standard deviations (i.e., standard deviations of the committee estimates) and the median committee cut-point including the cut-points corresponding to the 75th and 25th percentile ranks of committee estimates.
In round four, committee members received a report of their round two results. They also were placed in small groups where individual results were discussed. After the discussion, committee members were asked to place another bookmark for meeting standards based on the information and knowledge they had gained up to this point. The round four results, which generally show less variation than the round two results, are given in the table below.
In round five, committee members received a report of their round three results. They also were placed in small groups where individual results were discussed. After the discussion, committee members were asked to place another bookmark for meeting standards with distinction based on the information and knowledge they had gained up to this point. The round five results, which generally show less variation than the round three results, are given in the table below.
In round six, committee members received a report of their round four and five judgments. They also received a report of the impact of their estimates from that round. Impact was reported in terms of the frequency distributions of the field test scores. The committee was also advised that scores from field testing generally underestimate operational test performance, but that the amount of the underestimate was not known. Committee members then returned to their groups and discussed the report and their judgments. At the end of the discussion, committee members were asked to place new bookmarks for both meeting standards and meeting standards with distinction based on the information and knowledge they had at that time. Results of this final placement are given in the table below.
Other Judgments Obtained
Committee members were asked to provide their best judgment of the percentage of their students who are currently meeting the learning standards as well as the percentage of their students who are meeting those standards at the distinction level. These judgments were made not with respect to the test, but with respect to the learning standards and the definitions of meeting standards and meeting standards with distinction. The results showing the committee average plus or minus one or two standard deviations (i.e., standard deviations of the committee estimates) and the median committee results and the 75th and 25th percentile ranks of committee estimates are given below:
The data in the table above relates to the passing scores for the test in that the committee on average was indicating that in their judgment about 15% to 25% of students in the state is not currently achieving at the minimum level suggested by the learning standards. This assessment was made not on the basis of test scores, but on the basis of the teacher’s observations and judgment. Similarly, the committee on average judged that roughly 10% to 20% of students were achieving at the distinction level.
Also noteworthy are the relatively large variation for the estimates. This reflects the very real variation in achievement among classrooms. For example, estimates of the percentage of students meeting standards ranged from 10% to 96%. For meeting standards with distinction, the estimates ranged from 5% to 40%.
With respect to the relative severity of the errors of classification, three-fourths of the committee indicated that for the passing cut-point, the most severe error was to pass a student on the examination who in fact had not met the learning standards. Similarly, about two of three committee members indicated that the most severe error with respect to the passing with distinction cut-point was to give a student a passing with distinction score, though the student had not met the learning standards at that level.
Discussion and Recommendations
The purpose of this study is to obtain data and information that New York may use in setting passing points for the Grade 8 Intermediate Social Studies Examination. The data should be used to guide those decisions.
The committee that provided the data was diverse and well represented the diversity of New York students, teachers, and school districts. With that diversity, it is not surprising that committee judgments varied.
The final bookmarks from the procedure are given in the table below:
The committee also indicated that based on their own assessments of classroom performance, currently about 15% to 25% of students are not achieving the learning standards and that about 10% - 20% of students are meeting standards with distinction. Further, the committee overwhelmingly believes that the error of passing a student who should fail should be minimized. Though less strong, the committee believed that the error of granting distinction when that was not the level of achievement attained should be minimized. Generally, the committee thought favorably of the academic intervention that would take place for failing students.
The chief concern the study author has is the nature of the field test results, especially those for the constructed response and essay items. The item difficulty parameters used in the bookmarking procedure are derived from field test results. But there are significant questions about the validity of those results. Based on the field test results, the constructed response items appear very difficult. But throughout the committee meeting, committee members indicated that they did not believe those items were as difficult as the parameter suggested.
Field test results often results in poorer performance than might be expected in operational testing. The motivation of students is not the same. Students are not prepared by teachers for field testing as they are for operational testing. And during field testing, not all teachers are following the curriculum guidelines to which the test is tied. So, although it is known that field test performance underestimates operational test performance, what is not known is the amount of improvement that might be expected once operational testing begins.
What should be made of these results?
The study author recognizes that New York has the responsibility and duty to set cut-points in such a way that the purpose of the testing program is best accomplished. That requires judgment and consideration of all the data and information that is available at the time cut-points are set.
The first recommendation is that New York should repeat the standard setting study after operational testing has begun. That study should use recalculated difficulty parameters obtained from the first operational testing. Performance data from that administration should also be used to guide the panelists. Using operational test data should result in a more valid information concerning cut-points than can be obtained with data from field testing.
Given that cut-points must be in place once operational testing begins, however, data from the current study should be used. The study author has little confidence in the normative data provided—i.e., the percentage of students who might fail and the percentage of students who might achieve distinction. Therefore, the second recommendation is that New York ignore these estimates as it deliberates over the choice of initial cut-points.
The study author recommends that because of the nature of committee member judgments (they are not normally distributed), the median and first and second quartile of committee judgments be used to establish the cut-points. The study author recommends that the cut-point for passing be within the range of a raw score of 24 – 33. The cut-point for passing with distinction is recommended between raw scores of 72 –77.
Unlike what is often found for tests that students must pass to graduate, Grade 8 Intermediate Social Studies teachers feel much less negative about failing a student who has actually met the standards. Generally they believe that the academic intervention given the student as a result of failing the test is a positive experience. Consequently, the state might set the cut-points for passing at the upper levels of the range.
Although the study author recognizes that the state, not the study author, has the responsibility to set the cut-points, and that state staff have a greater knowledge of curricular practice and other factors that may affect choice of cut-points, if the study author were forced, he would choose cut-points of 32 for passing and 72 for passing with distinction.
Once cut-points are established for operational testing, New York can use data from the study to evaluate the choice of those cut-points. If the cut-points are appropriate, then somewhere between 15% and 25% of students should receive failing scores while 10% - 20% of students should receive passing with distinction scores. Regardless of the scores, the study author recommends that the whole standard setting study be repeated, collecting the same data as the current study, but with statistics from the operational testing available.