Physical Setting/Earth Science Regents
A study performed for the New York State Education Department by
April 27, 2001
The New York State Board of Regents has established learning standards all students must meet to graduate from high school. One set of learning standards is for mathematics, science and technology. Within those learning standards, some apply to earth science. In terms of general content, the earth science content refers to
Key ideas, performance indicators, and sample tasks further describe each learning standard. Standards are also broken down by educational level--elementary, intermediate, and commencement. To assess the extent that students have met the learning standards, the New York State Education Department has developed a testing program. The content of the tests reflect accomplishment of the learning standards. For earth science, the State Education Department has developed a Regents Examination in Physical setting/Earth Science to reflect accomplishment of the learning standards pertaining to the above content areas.
Although scores for Physical Setting/Earth Science Regents Examination are placed on a numerical scale, essentially there are only three scoresódoes not meet standards, meets standards, and meets standards with distinction. New York State teachers, using professionally established procedures, have developed the test items, and the items have been pretested and field-tested on samples of students.
The purpose of the study described in this report is to obtain information that the State Education Department can use to establish scores that will classify test takers into does not meet standards, meets standards, and meets standards with distinction categories. Setting cut-points requires judgment. This study employs professionally established methods to quantify and summarize the judgements of experts related to how individuals who have met the learning standards will perform on the test.
The Physical Setting/Earth Science Regents Examination
The Physical Setting/Earth Science Regents Examination assesses student achievement at the commencement level. Items for the examination were developed through the cooperative efforts of teachers, school districts, other science educators, and New York State Education Department staff. The examination consists of two parts. The first part is a written examination. The second part is a performance examination. The written portion of the examination is administered in a 3-hour period and will first be offered in June 2001.
The written part of the examination has three sections (or parts):
The performance part of the examination, termed part D, is a performance examination and is an assessment of laboratory skills. Part D must be administered prior to the written examination.
The examination blueprint, taken from the test sampler, is given in the table below:
A complete description of the examination, including test specifications and scoring rubrics, is given in a test sampler.
Data related to the performance standards for the test were obtained from a committee of experts. Judgments from committee members were quantified using standard practices employed by psychometricians who conduct standard setting studies. The committee made their judgments with respect to the difficulty scale resulting from the scaling and equating of field test items. In the field testing, each item, or score category if the item has multiple scores, is given a difficulty parameter obtained through item response methods. Test items corresponding to various points on the difficulty scale are presented as examples of test items at that difficulty level. The items used came from the anchor test form. The anchor test form is the test form upon which the cut-points are set and the form to which all later forms of the test will be equated.
Committee members were given definitions of three performance categoriesónot meeting standards, meeting standards, and meeting standards with distinction. The State Education Department has developed these category definitions and they are applied to all of the Regents tests that are being developed. In addition, committee members were given an exercise designed to help familiarize them with the examination and an exercise in which they were asked to categorize some of their students into the performance categories as defined by the State Education Department.
The committee met as a group on April 2, 2001 at the State Education Department.
The standard setting study test used the bookmarking approach because all the multiple choice items and constructed response item had been scaled using item response theory methods and because the bookmarking procedure enables committee members to consider these two item types together.
In the bookmarking procedure, multiple choice items and constructed response items are ordered in terms of their difficulty parameters. The purpose of the items is to illustrate the meaning of the difficulty scale at specific points. Committee members are asked to apply their judgments to these ordered items. The committee meeting is conducted in rounds. The rounds and the activities employed in each round are given below.
Committee members were also asked four overall questions about accomplishment of the learning standards and test performance. Answers to these questions might aid New York in setting appropriate performance standards on the test. These questions asked:
Committee members provided judgments relating to the performance test using the following procedure.
The New York State Education Department assembled a committee of 24 people to provide judgments for the study. Committee members were, with one exception, current or former classroom teachers. Some were supervisors. One committee member was a representative from the business community. All committee members were recognized as very knowledgeable of the learning standards pertaining to physical setting and earth science and of how students perform on standardized tests similar to the Physical Setting/Earth Science Examination. Some had worked on an aspect of either the standards or development of the tests.
Committee members, their schools, the number of years experience each has in teaching Earth Science and the number of students who are currently in their Earth Science classes are given in the table below.
Committee members were chosen so that they would represent a wide range of schools and different types of students. Each committee member was asked to complete a short background questionnaire that included questions about their sex, ethnic background, and the setting for their school. Results of the questionnaire tabulations are given in the table below.
Findings related to the bookmarking procedure
In round 2 every committee member independently placed his or her own bookmarks for meeting standards. The results of the placements are given in the table below. The table gives the difficulty parameter, corresponding raw score, and percentage of students below that raw score based on the field test results. The cut-points include the committee average plus or minus one or two standard deviations (i.e., standard deviations of the committee estimates) and the median committee cut-point including the cut-points corresponding to the 75th and 25th percentile ranks of committee estimates.
In round 3 every committee member independently placed his or her own bookmarks for meeting standards with distinction. The results of the placements are given in the table below. The table gives the difficulty parameter, corresponding raw score, and the percentage above that raw score based on the field test results. The cut-points include the committee average plus or minus one or two standard deviations (i.e., standard deviations of the committee estimates) and the median committee cut-point including the cut-points corresponding to the 75th and 25th percentile ranks of committee estimates.
In round four, committee members received a report of their round two results. They also were placed in small groups where individual results were discussed. After the discussion, committee members were asked to place another bookmark for meeting standards based on the information and knowledge they had gained up to this point. The round four results, which generally show less variation than the round two results, are given in the table below.
In round five, committee members received a report of their round three results. They also were placed in small groups where individual results were discussed. After the discussion, committee members were asked to place another bookmark for meeting standards with distinction based on the information and knowledge they had gained up to this point. The round five results, which generally show less variation than the round three results, are given in the table below.
In round six, committee members received a report of their round four and five judgments. They also received a report of the impact of their estimates from that round. Impact was reported in terms of the frequency distributions of the field test scores. The committee was also advised that scores from field testing generally underestimate operational test performance, but that the amount of the underestimate was not known. Committee members then returned to their groups and discussed the report and their judgments. At the end of the discussion, committee members were asked to place new bookmarks for both meeting standards and meeting standards with distinction based on the information and knowledge they had at that time. Results of this final placement are given in the table below.
Other Judgments Obtained
Committee members were asked to provide their best judgment of the percentage of their current students who are not achieving the learning standards as well as the percentage of their current students who are achieving the learning standards with distinction. These judgments were made not with respect to the test, but with respect to the learning standards and the definitions of meeting standards and meeting standards with distinction. Results appear in the table below.
The data in the table above relates to the cut-points for the test in that the committee on average was indicating that in their judgment almost one of four students in the state is not currently achieving at level suggested by the learning standards. This assessment was made without test scores and is independent of the test scores. Similarly, the committee on average judged that about 15%-25% of students were achieving at the distinguished level.
Also noteworthy is the relatively large standard deviations for the estimates. This reflects the very real variation in achievement among classrooms. For example, estimates of the percentage of students achieving at least at the meets standards level of achievement ranged from 1% to 100%. For meeting standards with distinction, the estimates ranged from 0% to 75%.
With respect to the relative severity of the errors of classification, 71% of the committee said that classifying a student as having not met standards who in reality has met the learning standards was more serious than classifying a student as having met the standards who in reality has not met the learning standards. Twenty-nine percent of the committee said the opposite. With respect to meeting standards with distinction, the committee was about evenly divided. Fifty-four percent said that not granting a student distinction who in reality has attained that level of achievement was the more serious error.
Thus, the committee might be considered "lenient" with respect to setting the lower cut-point, but indifferent about setting the higher cut-point.
Cut-Points for the performance test
Results for the performance test are given in the table below.
Discussion and Recommendations
The purpose of this study was to obtain data and information that New York may use in setting cut-points for the Physical Setting/Earth Science Examination. The data should be used to guide those decisions.
The committee that provided the data was diverse and well represented the diversity of New York students, teachers, and school districts. With that diversity, it is not surprising that committee judgments varied.
The final bookmarks from the procedure are given in the table below.
The committee also indicated that currently about 25% of students are not meeting the learning standards and about 15% - 25% of students are meeting the standards at the distinction level. Further, the committee overwhelmingly believes that the error of classifying a student as not meeting standards who in reality has met standards should be minimized. The committee seems indifferent with respect to classification errors at the distinction level.
Finally, it is well known that student performance improves once operational testing begins. What is not known is the amount of improvement that might be expected.
Final judgments for the performance test are given in the table below.
What should be made of these results?
The study author recognizes that New York has the responsibility and duty to set cut-points in such a way that the purpose of the testing program is best accomplished. That requires judgment and consideration of all the data and information that is available at the time cut-points are set. The study author strongly encourages New York not to routinely adopt the mean for the bookmarking procedure as the final cut-points. Final cut-points should result from staff deliberations using all of the data presented in this report.
It is well known that field test results underestimate how well students perform on operational testing. Under performance in field testing is due to several factors, chief of which are student recognition that the test scores do not count and that teaching practices are not yet congruent with the standards on which the tests are based. The amount of underestimation for the Physical Setting/Earth Science Examination is unknown. Yet difficulty parameters and impact estimates used in the standard setting were based on field test statistics.
Thus, the study authorís first recommendation is to repeat the standard setting study once the test becomes operational. The repeated study should use item difficulty and impact estimates obtained from operational testing and not simply a repeat of the study using the same data. If that is not possible, the study author encourages New York to repeat the standard setting study with other methods, such as the contrasting groups method, which does not rely on the state collecting item level data on a large scale.
For initial operational testing, the study author recommends that the cut-point for meeting standards be set within the raw score range of 33-45. The committee means and medians fall within this range. Within this range, the study author recommends a final cut-point be set based on the state's best judgment as to the improvement that will actually occur once operational testing begins. That judgment should be informed by discussions with test developers, curriculum specialists, and teachers. The study author would choose a raw score of 40, which is lenient and will likely result in fewer than about 20% failing the test, a lower level than indicated by teachers of the percentage of students who are not currently meeting standards. This is only the personal opinion of the study author, however.
For initial operational testing, the study author recommends that the cut-point for meeting standards with distinction be set within the raw-score range of 69-74. Again, all committee mean and median judgments fall within that range. And again, within that range choice should be made based on the estimated improvement from field testing to operational testing and the choice should be informed by discussions with test developers, curriculum specialists, and teachers. The state should realize, however, that improvement in the upper range of scores is likely to be less than improvement in the lower range of scores. The study author would choose a raw score of 71, but again that is the personal opinion of the study author only.
With respect to the performance test, until more data on performance can be collected, the study author recommends that cut-points of 13 and 21 (the committee medians) be used.