Grade 8 Intermediate-Level Science Test--Data and Information Related to Standard Setting
A study performed for the New York State Education Department by
April 27, 2001
The New York State Board of Regents has established learning standards all students must meet to graduate from high school. One set of learning standards is mathematics, science and technology. Key ideas, performance indicators, and sample tasks further describe each learning standard. Standards are also broken down by educational level--elementary, intermediate, and commencement. To assess the extent that students have met the learning standards, the New York State Education Department has developed a testing program. The content of the tests reflect accomplishment of the learning standards. For science, the State Education Department has developed a Grade 8 Intermediate-Level Science Test to reflect accomplishment of the learning standards at the intermediate level. Most students will take this test at the end of 8th grade. Schools must provide students who do not meet standards with special academic intervention programs.
Although scores for the test are placed on a numerical scale, essentially there are only three scores—does not meet standards, meets standards, and meets standards with distinction. The test items have been developed by New York State teachers using professionally established procedures and the items have been pretested and field-tested on samples of students.
The purpose of the study described in this report is to obtain information that the State Education Department can use to establish scores that will classify test takers into the score categories. Setting passing scores requires judgment. This study employs professionally established methods to quantify and summarize the judgments of experts related to how individuals who have met the learning standards will perform on the test.
The Grade 8 Intermediate-Level Science Test
The Grade 8 Intermediate-Level Science Test assesses student achievement in Learning Standards 1, 2, 4, 6, and 7. Items for the exam were developed through the cooperative efforts of teachers, school districts, State Education Department staff, and science educators.
Questions are content and skills based and require students to graph, complete a data table, label diagrams, design experiments, make calculations, and write responses. Test takers are also asked to hypothesize, interpret, analyze and evaluate data and apply their scientific knowledge and skills to real-world situations.
Test content is based on the intermediate-level key ideas and performance indicators found in the Learning Standards for Mathematics, Science and Technology developed and adopted by the Board of Regents. There is a written and a performance part. The written test has three parts and is administered to students at the end of grade 8. The written test has a two-hour time limit. The three parts of the written examination are as follows:
The performance test is a laboratory performance test and is intended to make up 15% of the total score. Students complete the test by working at three laboratory stations. They earn points at each of the stations. A total score for the performance test is obtained by adding the points attained at the three stations.
A complete description of the examination, including test specifications and scoring rubrics, is given in a test sampler.
Data related to the performance standards for the test were obtained from a committee of experts. Judgments from committee members were quantified using standard practices employed by psychometricians who conduct standard setting studies. Committee members made their judgments with respect to the difficulty scale resulting from the scaling and equating of field test items. In the field testing, each item, or score category if the item has multiple scores, is given a difficulty parameter obtained through item response methods. Test items corresponding to various points on the difficulty scale are presented as examples of test items at that difficulty level. The items used came from the anchor test form. The anchor test form is the test form upon which the cut-points are set and the form to which all later forms of the test will be equated.
Committee members were given definitions of the performance categories—not meeting standards, meeting standards, and meeting standards with distinction. The State Education Department has developed these category definitions and they are applied to all of the tests that are being developed. In addition, committee members were given an exercise designed to help familiarize them with the test and an exercise in which they were asked to categorize some of their students into the performance categories as defined by the State Education Department.
The committee met as a group on March 27, 2001 at the State Education Department.
The standard setting study test used the bookmarking approach because all the multiple choice items and constructed response items had been scaled using item response theory methods and because the bookmarking procedure enables committee members to consider these two item types together. The state employs the bookmarking procedure in all of its standard setting studies.
In the bookmarking procedure, multiple choice items and constructed response items are ordered in terms of their difficulty parameters. The purpose of the items is to illustrate the meaning of the difficulty scale at specific points. Committee members are asked to apply their judgments to these ordered items. The committee meeting is conducted in rounds. The rounds and the activities employed in each round are given below.
After the above process was completed, committee members went back and placed bookmarks (similar to those of steps 2 and 3 above) for hypothetical students who although they had not met the learning standards overall, had demonstrated some proficiency with the learning standards.
Committee members were also asked four overall questions about accomplishment of the learning standards and test performance. Answers to these questions might aid New York in setting appropriate performance standards on the test. These questions asked:
After these activities were completed, the committee provided judgments about student performance on the performance test. Committee members were asked to estimate the number of points the two hypothetical students they had considered earlier would achieve at each station. Committee members were also given the average number of points students in the field test had achieved at each station.
Because making such judgments was deemed related to knowledge of the scoring rubric, the committee was divided into two groups—those who had been trained in the scoring rubric and those who had not been trained in the scoring rubric. Those who had not been trained were further divided into three groups. Staff from the State Education Department worked with each of these groups who provided judgments related to only one station. Committee members who had been trained in the scoring rubric provided judgments for all three of the stations.
The New York State Education Department assembled a committee of 22 people to provide judgments for the study. Committee members were current or former science classroom teachers. All committee members were recognized as very knowledgeable of the learning standards related to science at the intermediate level and of how students perform on standardized tests similar to the Grade 8 Intermediate-Level Science Test. Some had worked on an aspect of either the standards or development of the tests.
Committee members, their schools, and the number of years experience each has in teaching Intermediate Level Science is given in the table below.
Committee members were chosen so that they would represent a wide range of schools and different types of students. Each committee member was asked to complete a short background questionnaire that included questions about their sex, ethnic background, and the setting for their school. Results of the questionnaire tabulations are given in the table below.
Findings related to the bookmarking procedure
In round 2 every committee member independently placed his or her own bookmarks for meeting standards. The results of the placements are given in the table below. The table gives the difficulty, corresponding raw score, and the corresponding percent of students below that raw score at each cut-point based on the field test data. The cut-points include the committee average plus or minus one or two standard deviations (i.e., standard deviations of the committee estimates) and the median committee cut-point including the cut-points corresponding to the 75th and 25th percentile ranks of committee estimates.
In round 3 every committee member independently placed his or her own bookmarks for meeting standards with distinction. The results of the placements are given in the table below. The table gives the difficulty, corresponding raw score, and the corresponding percent of students that are above each cut-point based on the field test data. The cut-points include the committee average plus or minus one or two standard deviations (i.e., standard deviations of the committee estimates) and the median committee cut-point including the cut-points corresponding to the 75th and 25th percentile ranks of committee estimates.
In round four, committee members received a report of their round two results. They also were placed in small groups where individual results were discussed. After the discussion, committee members were asked to place another bookmark for meeting standards based on the information and knowledge they had gained up to this point. The round four results, which generally show less variation than the round two results, are given in the table below.
In round five, committee members received a report of their round three results. They also were placed in small groups where individual results were discussed. After the discussion, committee members were asked to place another bookmark for meeting standards with distinction based on the information and knowledge they had gained up to this point. The round five results, which generally show less variation than the round three results, are given in the table below.
In round six, committee members received a report of their round four and five judgments. They also received a report of the impact of their estimates from that round. Impact was reported in terms of the frequency distributions of the field test scores. The committee was also advised that scores from field testing generally underestimate operational test performance, but that the amount of the underestimate was not known. Committee members then returned to their groups and discussed the report and their judgments. At the end of the discussion, committee members were asked to place new bookmarks for both meeting standards and meeting standards with distinction based on the information and knowledge they had at that time. Results of this final placement are given in the table below.
Other Judgments Obtained
Committee members were asked to provide their best judgment of the percentage of their students who are currently meeting the learning standards as well as the percentage of their students who are meeting those standards at the distinction level. These judgments were made not with respect to the test, but with respect to the learning standards and the definitions of meeting standards and meeting standards with distinction. The results showing the committee average plus or minus one or two standard deviations (i.e., standard deviations of the committee estimates) and the median committee results and the 75th and 25th percentile ranks of committee estimates are given below:
The data in the table above relates to the lower cut-point for the test in that the committee on average was indicating that in their judgment about 20% to 30% of students in the state are not currently achieving at the minimum level suggested by the learning standards. This assessment was made not on the basis of test scores, but on the basis of teacher observations and judgment. Similarly, the committee on average judged that roughly 10% to 15% of students were achieving at the distinction level.
Also noteworthy is the relatively large variation for the estimates. This reflects the very real variation in achievement among classrooms. For example, estimates of the percentage of students currently achieving the learning standards ranged from 5% to 95%. For meeting the standards with distinction, the estimates ranged from 2% to 90%.
With respect to the relative severity of the errors of classification, almost three-fourths (77%) of the committee indicated that for the meeting standards cut-point, the most severe error was to fail a student on the examination who in fact had met the learning standards. Similarly, about four of five (82%) committee members indicated that the most severe error with respect to the meeting standards with distinction cut-point was to give a student a meeting standards with distinction score, though the student had not met the learning standards at that level. In more common terms, the committee felt what might be called "lenient" with respect to setting the lower cut-point and more "difficult" with respect to the setting the higher cut-point.
Results for the performance test
Results of the standard setting for the performance test is given in the table below:
Standard deviations for the estimates by section were between one and two score points.
Discussion and Recommendations
The purpose of this study is to obtain data and information that New York may use in setting cut-points for the Grade 8 Intermediate Social Studies Examination. These cut-points should relate to the intended use of the tests. The data should be used to guide those decisions.
The committee that provided the data was diverse and well represented the diversity of New York students, teachers, and school districts. With that diversity, it is not surprising that committee judgments varied.
The final bookmarks from the procedure are given in the table below.
The committee also indicated that based on their own assessments of classroom performance, currently about 20% to 30% of students are not currently achieving the learning standards and that about 10% - 15% of students are currently achieving at the distinction level with respect to the learning standards. Further, the committee overwhelmingly believes that the error of failing a student who has met the learning standards should be minimized. At the distinction level, the committee felt the opposite—the more serious error with respect to distinction was to grant distinction when it was not deserved. Thus, the committee was lenient at the lower cut-point, but difficult at the higher cut-point.
The chief concern the study author has is the nature of the field test results, especially those for the constructed response and essay items. The item difficulty parameters used in the bookmarking procedure are derived from field test results. But there are significant questions about the validity of those results. Based on the field test results, the constructed response items appear very difficult. But throughout the committee meeting, committee members indicated that they did not believe those items were as difficult as the parameter suggested.
Field test results often results in poorer performance than might be expected in operational testing. Student motivation is not the same as in operational testing. Teachers so not prepare students for field testing as they do for operational testing. And schools vary in the emphasis they place on different content areas. So, although field test performance underestimates operational test performance, what is not known is the amount of improvement that might be expected once operational testing begins.
What then should be made of these results?
The study author recognizes that New York has the responsibility and duty to set cut-points in such a way that the purpose of the testing program is best accomplished. That requires judgment and consideration of all the data and information that is available at the time cut-points are set. The study author strongly encourages New York to not routinely adopt the mean committee bookmarks and to consider all of the relevant data presented and to exercise its judgment within the parameters suggested in this report.
The first recommendation is that New York should repeat the standard setting study after operational testing has begun. That study should use recalculated difficulty parameters obtained from the first operational testing. Performance data from that administration should also be used to guide the panelists. Using operational test data should result in a more valid information concerning cut-points than can be obtained with data from field testing.
Given that cut-points must be in place for operational testing to begin, however, data from the current study should be used.
The clearest cut-point is that for meeting standards with distinction. The study author recommends that New York adopt a cut-point that falls in the range of 70-78 raw score points. In the study author’s opinion, the best choice is a raw score of 73. This is both the mean and median panel cut-point. Based on field test results, about 8% of students would score at or above this level. But field test results underestimate operational test performance, so implementation of a 73 as the cut-point has a good chance of producing the result that 10%-15% of students are currently achieving at the distinction level.
Less clear is the cut-point for meeting standards. The study author believes the data related to the percentage of students who are currently meeting standards provides good evidence about the number of students who might achieve less than this score. The committee had indicated that the percentage is now in the range of about 20% to 30%. Implementation of the committee bookmark mean or median as a cut-point will likely result in fewer than 20% scoring in the does not meet standards category. At the same time, the committee has indicated a definite desire for leniency, which argues for generally slightly lowering the cut-point.
The study author recommends that New York adopt a cut-point for meeting standards that falls in the range of 30-40 raw score points. In the study author’s opinion, the best choice is a raw score of 40, the highest score in the range. Based on field test results about 27% of students could be expected to score lower than 40. In operational testing, that percentage will certainly decrease and is more likely to be consistent with committee estimates of current performance.
With respect to the performance test, committee estimates were reasonably consistent for each of the tests different stations. The study author recommends that New York also calculate difficulty parameters for the items making up the performance test and place those difficulties on the same scale as for the non-performance items. Data relating to the cut-points from this study can be applied to determine cut-points for the performance test. Until those difficulty estimates are available, however, the study author suggests that New York adopt 24 as the cut-point for meets standards and 44 as the cut-point for meets standards with distinction.
Once cut-points are established for operational testing, New York can use data from the study to evaluate the choice of those cut-points. If the cut-points are appropriate, then somewhere between 20% and 30% of students should receive does not meet standards scores, while 10% - 15% of students should receive meets standards with distinction scores. Regardless of the scores, the study author recommends that the whole standard setting study be repeated, collecting the same data as the current study, but with statistics from the operational testing available.