Living Environment Regents
A study performed for the New York State Education Department by
April 27, 2001
The New York State Board of Regents has established learning standards all students must meet to graduate from high school. One set of learning standards is for mathematics, science and technology. Within those learning standards, some apply to Living Environment.
Key ideas, performance indicators, and sample tasks further describe each learning standard. Standards are also broken down by educational level--elementary, intermediate, and commencement. To assess the extent that students have met the learning standards, the New York State Education Department has developed a testing program. The content of the tests reflect accomplishment of the learning standards. For Living Environment, the State Education Department has developed a Regents Examination in Living Environment to reflect accomplishment of the learning standards pertaining to the appropriate standards.
Although scores for Living Environment Regents Examination are placed on a numerical scale, essentially there are only three scores—does not meet standards, meets standards, and meets standards with distinction. New York State teachers, using professionally established procedures, have developed the test items, and the items have been pretested and field-tested on samples of students.
The purpose of the study described in this report is to obtain information that the State Education Department can use to establish scores that will classify test takers into does not meet standards, meets standards, and meets standards with distinction categories. Setting cut-scores requires judgment. This study employs professionally established methods to quantify and summarize the judgments of experts related to how individuals who have met the learning standards will perform on the test.
The Living Environment Regents Examination
The Living Environment Regents Examination assesses student achievement at the commencement level. Items for the examination were developed through the cooperative efforts of teachers, school districts, other science educators, and New York State Education Department staff. The concepts and skills tested can be found in the Living Environment Core Curriculum. Students are asked to graph, complete a data table, label diagrams, design experiments, analyze data, and write responses. In addition questions require students to hypothesize, interpret, evaluate and apply their scientific knowledge to real-world situations.
The examination is administered in a three hour period and has three parts:
Part A consists of multiple choice questions assessing the student’s knowledge and understanding of core material.
The examination blueprint, taken from the test sampler, is given in the table below:
A complete description of the examination, including test specifications and scoring rubrics, is given in a test sampler.
Data related to the performance standards for the test were obtained from a committee of experts. Judgments from committee members were quantified using standard practices employed by psychometricians who conduct standard setting studies. The committee made their judgments with respect to the difficulty scale resulting from the scaling and equating of field test items. In the field testing, each item, or score category if the item has multiple scores, is given a difficulty parameter obtained through item response methods. Test items corresponding to various points on the difficulty scale are presented as examples of test items at that difficulty level. The items used for the study came from the anchor test form. The anchor test form is the test form upon which the cut-points are set and the form to which all later forms of the test will be equated.
Committee members were given definitions of three performance categories—does not meet standards, meets standards, and meets standards with distinction. The State Education Department has developed these category definitions and they are applied to all of the Regents tests that are being developed. In addition, committee members were given an exercise designed to help familiarize them with the examination and an exercise in which they were asked to categorize some of their students into the three performance categories.
The committee met as a group on March 30, 2001 at the State Education Department.
The standard setting study test used the bookmarking approach because all the multiple choice items and constructed response item had been scaled using item response theory methods and because the bookmarking procedure enables committee members to consider these two item types together.
In the bookmarking procedure, multiple choice items and constructed response items are ordered in terms of their difficulty parameters. The purpose of the items is to illustrate the meaning of the difficulty scale at specific points. Committee members are asked to apply their judgments to these ordered items. The committee meeting is conducted in rounds. The rounds and the activities employed in each round are given below.
Committee members were also asked four overall questions about accomplishment of the learning standards and test performance. Answers to these questions might aid New York in setting appropriate performance standards on the test. These questions asked:
Each committee member's estimate of the percentage of students in their classes who are currently meeting the learning standards.
Each committee member's estimate of the percentage of students in their classes who are currently meeting the learning standards with distinction.
Which was the more serious error--to categorize a student as having not met the standards when in reality the student has met the standards or to categorize a student as having met the standards when in reality the student has not met the standards?
Which was the more serious error--to grant distinction to a student who has not met the learning standards at that level or to fail to grant distinction to a student who has achieved that level of proficiency.
The New York State Education Department's Office of Curriculum and Instruction assembled a committee of 22 people to provide judgments for the study. Committee members were, with one exception, current or former classroom teachers. All committee members were recognized as very knowledgeable of the learning standards pertaining to living environment and of how students perform on standardized tests similar to the Living Environment Examination. Some had worked on an aspect of either the standards or development of the tests.
Committee members, their schools, the number of years experience each has in teaching Living Environment or Biology and the number of students who are in their Living Environment or Biology classes are given in the table below.
Committee members were chosen so that they would represent a wide range of schools and different types of students. Each committee member was asked to complete a short background questionnaire that included questions about their sex, ethnic background, and the setting for their school. Results of the questionnaire tabulations are given in the table below.
Findings related to the bookmarking procedure
In round 2 every committee member independently placed his or her own bookmarks for meeting standards. The results of the placements are given in the table below. The table gives the difficulty parameter, its corresponding raw score, and the corresponding percent of students that would fall below that cut-point based on the field test data. The cut-points include the committee average plus or minus one or two standard deviations (i.e., standard deviations of the committee estimates) and the median committee cut-point including the cut-points corresponding to the 75th and 25th percentile ranks of committee estimates.
In round 3 every committee member independently placed his or her own bookmarks for meeting standards with distinction. The results of the placements are given in the table below. The table gives resulting difficulty parameter, its corresponding raw score, and the corresponding percent that would fall above that cut-point based on the field test data. The cut-points include the committee average plus or minus one or two standard deviations (i.e., standard deviations of the committee estimates) and the median committee cut-point including the cut-points corresponding to the 75th and 25th percentile ranks of committee estimates.
In round four, committee members received a report of their round two results. They also were placed in small groups where individual results were discussed. After the discussion, committee members were asked to place another bookmark for meeting standards based on the information and knowledge they had gained up to this point. The round four results, which generally show less variation than the round two results, are given in the table below.
In round five, committee members received a report of their round three results. They also were placed in small groups where individual results were discussed. After the discussion, committee members were asked to place another bookmark for meeting standards with distinction based on the information and knowledge they had gained up to this point. The round five results, which generally show less variation than the round three results, are given in the table below.
In round six, committee members received a report of their round four and five judgments. They also received a report of the impact of their estimates from that round. Impact was reported in terms of the frequency distributions of the field test scores. The committee was also advised that scores from field testing generally underestimate operational test performance, but that the amount of the underestimate was not known. Committee members then returned to their groups and discussed the report and their judgments. At the end of the discussion, committee members were asked to place final bookmarks for both meeting standards and for meeting standards with distinction based on the information and knowledge they had at that time. Results of this final placement are given in the table below.
Other Judgments Obtained
Committee members were asked to provide their best judgment of the percentage of their students who are currently achieving the learning standards as well as the percentage of their students who are achieving with distinction with respect to the learning standards. These judgments were made not with respect to the test, but with respect to the learning standards and the definitions of meeting standards and meeting standards with distinction. The committee averages and standard deviations are presented in the table below.
The data in the table above relates to the cut-points for the test in that the committee on average was indicating that in their judgment almost one of four students in the state is not currently achieving at the minimum level suggested by the learning standards. This assessment was made without test scores and is independent of the test scores. Similarly, the committee on average judged that about 15%-20% of students were achieving at the distinguished level.
Also noteworthy are the relatively large variation for the estimates. This reflects the very real variation in achievement among classrooms. For example, estimates of the percentage of students achieving at least at the meets standards level ranged from 20% to 90%. For meeting standards with distinction, the estimates ranged from 0% to 85%.
With respect to the relative severity of the errors of classification, 73% of the committee said that categorizing a student as having not met the standards who in reality has met them was more serious an error than classifying a student as having met the standards who in reality has not met the learning standards. Twenty-seven percent of the committee said the opposite. With respect to meeting the standards with distinction, the committee was evenly divided. Fifty-five percent said that granting distinction to a student who in reality has not reached that level of achievement was more serious than failing to grant distinction to a student who has achieved at that level. Forty-five percent of the committee said the opposite.
The above suggests that the committee might be regarded as "lenient" with respect to the cut-point for meeting standards, but could be considered indifferent with respect to errors of classification for the cut-point for meets standards with distinction.
Discussion and Recommendations
The purpose of this study is to obtain data and information that New York may use in setting cut-points for the Living Environment Examination. The data should be used to guide those decisions.
The committee that provided the data was diverse and well represented the diversity of New York students, teachers, and school districts. With that diversity, it is not surprising that committee judgments varied.
The final bookmarks from the procedure are given in the table below.
The committee also indicated that currently about 20%-30% of students are not achieving the learning standards and that about 15%-20% of students are achieving at the distinction level with respect to the learning standards. Further, the committee overwhelmingly believes that the error of classifying a student as having not met standards who has met the standards should be minimized. The committee seems indifferent with respect to classification errors at the distinction level.
Finally, it is well known that student performance improves once operational testing begins. What is not known is the amount of improvement that might be expected.
What should be made of these results?
The study author recognizes that New York has the responsibility and duty to set cut-points in such a way that the purpose of the testing program is best accomplished. That requires judgment and consideration of all the data and information that is available at the time cut-points are set. The study author strongly encourages New York to not routinely adopt the mean committee bookmarks and to consider all of the relevant data presented and to exercise its judgment within the parameters suggested in this report.
To the study author, one item stands out in importance. Committee members overwhelmingly indicated that it was more important to not fail a student who had the requisite level of skill and knowledge than to pass someone who did not have the requisite skill or knowledge. To the study author, that implies setting a cut-point that might best be described as "lenient," or "giving the benefit of any doubt" to the student.
At the same time, committee members indicated that in their judgment about 25% of students are not currently performing at least at the meets standards level.
Although the committee was willing to give the benefit of doubt to the student when it came to meeting standards, that was not the case when it came to meeting standards with distinction. There committee members were equally divided about the severity of the two types of classification error. To the study author, that implied that one need not, nor should not, be willing to give any benefit of doubt to the student at this level of performance.
A second item that is extremely important has to do with the impact of any cut-point. It is well known that field test results underestimate how well students perform on operational testing. Under performance in field testing is due to several factors, chief of which are student recognition that the test scores do not count and that teaching practices are not yet congruent with the standards on which the tests are based. The amount of underestimation for Living Environment is unknown. This suggests to the study author that the state closely monitor the initial operational administrations and repeat the standard setting with difficulty parameters and impact data based on operational data.
For initial operational testing, the study author recommends that the cut-point for meeting standards be set within the raw score range of 30-40. The committee means and medians fall within this range. Within this range, the final cut-point should be set based on the state's best judgment as to the improvement that will actually occur once operational testing begins. That judgment should be informed by discussions with test developers, curriculum specialists, and teachers. The study author recommends choosing a raw score of 36, which is high with respect to the committee bookmarks, but more in line with committee estimates of current performance.
For initial operational testing, the study author recommends that the cut-point for meeting standards with distinction be set within the raw-score range of 72-78. Again, all committee mean and median judgments fall within that range. And again, within that range choice should be made based on the estimated improvement from field testing to operational testing. The state should realize, however, that improvement in the upper range of scores is likely to be less than improvement in the lower range of scores. The study author recommends choosing a raw score of 73 for that cut-point.
Reconvened Supplement: Reconvened Round of Living Environment Standard Setting
Six committee members were reconvened for a final round of judgment for the standard setting of Living Environment. Their instructions are given below. The final round was convened to confirm or adjust the judgments made to this point in the study. The same criteria for judgments were employed in this round.
After about two hours of deliberation the six judges set the logistical values for passing and for passing with distinction as follows:
The passing cut score is about .639 standard deviations higher than the mean judgment to this point. The passing with distinction cut score is about the same. These both represent closer agreement with the definitions of achieving the standards in the best judgment of the panelists.
Final Round: Questions for Panel
You have made your judgments and have had the opportunity to review and adjust them in terms of the judgments made by the other panelists. I am going to ask you to:
Please record what you think was your original judgment for passing or for passing with distinction
The cutoff scores recommended thus far are was -0.2 and 1.95, respectively. If there are any adjustments you would like to make based on the originals, please record them.