Final Report

Physical Setting/Earth Science Regents Examination:
Data and Information Related to Standard Setting

A study performed for the New York State Education Department by

Gary Echternacht
Gary Echternacht, Inc.
4 State Park Drive
Titusville, NJ 08560

(609) 737-8187
garyecht@aol.com

April 27, 2001

Introduction

The New York State Board of Regents has established learning standards all students must meet to graduate from high school. One set of learning standards is for mathematics, science and technology. Within those learning standards, some apply to earth science. In terms of general content, the earth science content refers to

  • Astronomy
  • Meteorology and weather
  • Geology

Key ideas, performance indicators, and sample tasks further describe each learning standard. Standards are also broken down by educational level--elementary, intermediate, and commencement. To assess the extent that students have met the learning standards, the New York State Education Department has developed a testing program. The content of the tests reflect accomplishment of the learning standards. For earth science, the State Education Department has developed a Regents Examination in Physical setting/Earth Science to reflect accomplishment of the learning standards pertaining to the above content areas.

Although scores for Physical Setting/Earth Science Regents Examination are placed on a numerical scale, essentially there are only three scoresódoes not meet standards, meets standards, and meets standards with distinction. New York State teachers, using professionally established procedures, have developed the test items, and the items have been pretested and field-tested on samples of students.

The purpose of the study described in this report is to obtain information that the State Education Department can use to establish scores that will classify test takers into does not meet standards, meets standards, and meets standards with distinction categories. Setting cut-points requires judgment. This study employs professionally established methods to quantify and summarize the judgements of experts related to how individuals who have met the learning standards will perform on the test.

The Physical Setting/Earth Science Regents Examination

The Physical Setting/Earth Science Regents Examination assesses student achievement at the commencement level. Items for the examination were developed through the cooperative efforts of teachers, school districts, other science educators, and New York State Education Department staff. The examination consists of two parts. The first part is a written examination. The second part is a performance examination. The written portion of the examination is administered in a 3-hour period and will first be offered in June 2001.

The written part of the examination has three sections (or parts):

  • Part A consists of multiple-choice questions assessing the studentís knowledge and understanding of core material.
  • Part B consists of multiple choice and constructed response questions assessing the studentís ability to apply, analyze, and evaluate material.
  • Part C consists of constructed response and extended response questions assessing the studentís ability to apply knowledge of science concepts and skills.

The performance part of the examination, termed part D, is a performance examination and is an assessment of laboratory skills. Part D must be administered prior to the written examination.

The examination blueprint, taken from the test sampler, is given in the table below:

Content

Approximate Weight (%)

Standard 1 (Analysis, Inquiry, and Design)

Mathematical Analysis

Scientific Inquiry

Engineering Design

15-20

Standard 2

Information Systems

0-5

Standard 6 (Interconnectedness: Common Themes)

Systems Thinking

Models

Magnitude and Scale

Equilibrium and Stability

Patterns of Change

Optimization

15-20

Standard 7 (Interdisciplinary Problem Solving)

Connections

Strategies

5-10

Standard 4

Key Idea 1

20-25

Key Idea 2

20-25

Key Idea 3

0-5

A complete description of the examination, including test specifications and scoring rubrics, is given in a test sampler.

Methods Employed

Data related to the performance standards for the test were obtained from a committee of experts. Judgments from committee members were quantified using standard practices employed by psychometricians who conduct standard setting studies. The committee made their judgments with respect to the difficulty scale resulting from the scaling and equating of field test items. In the field testing, each item, or score category if the item has multiple scores, is given a difficulty parameter obtained through item response methods. Test items corresponding to various points on the difficulty scale are presented as examples of test items at that difficulty level. The items used came from the anchor test form. The anchor test form is the test form upon which the cut-points are set and the form to which all later forms of the test will be equated.

Committee members were given definitions of three performance categoriesónot meeting standards, meeting standards, and meeting standards with distinction. The State Education Department has developed these category definitions and they are applied to all of the Regents tests that are being developed. In addition, committee members were given an exercise designed to help familiarize them with the examination and an exercise in which they were asked to categorize some of their students into the performance categories as defined by the State Education Department.

The committee met as a group on April 2, 2001 at the State Education Department.

The standard setting study test used the bookmarking approach because all the multiple choice items and constructed response item had been scaled using item response theory methods and because the bookmarking procedure enables committee members to consider these two item types together.

In the bookmarking procedure, multiple choice items and constructed response items are ordered in terms of their difficulty parameters. The purpose of the items is to illustrate the meaning of the difficulty scale at specific points. Committee members are asked to apply their judgments to these ordered items. The committee meeting is conducted in rounds. The rounds and the activities employed in each round are given below.

Round

Activity

1

Committee members review the Learning Standards for the content area and consider ways of measuring accomplishment of the performance indicators and key ideas. Committee members review the ordered items and learn and understand the increasing complexity of the items and responses required.

2

Working individually, committee members set their bookmark for meeting the standards. That is, committee members conceive of an individual who has the minimum level of skill and knowledge needed to meet the learning standards and indicate the last item (or difficulty level) that the hypothetical individual is likely to answer correctly two-thirds of the time (or to construct a response that is at least as good).

3

Working individually, committee members set their bookmarks for meeting standards with distinction. That is, committee members conceive of an individual who has the minimum level of skill and knowledge needed to meet the standards with distinction and indicate the last item students are likely to answer correctly (or to construct a response that is at least as good).

4

A report of the results of round 2 is given committee members. The committee is divided into small groups and the individual results are discussed. Committee members revise their judgments in light of the discussion.

5

The same procedure as in round 4 is used with the round 3 results.

6

A report of rounds 4 and 5 are given the committee. Also given the committee are the impacts (percent below the committee median for meeting standards and percent above for meeting standards with distinction based on field test results). Committee members make final judgments based on the accumulated judgments and data.

Committee members were also asked four overall questions about accomplishment of the learning standards and test performance. Answers to these questions might aid New York in setting appropriate performance standards on the test. These questions asked:

  • Each committee member's estimate of the percentage of students in their classes who are currently meeting the learning standards.
  • Each committee member's estimate of the percentage of students in their classes who are currently meeting the learning standards with distinction.
  • Which was the more serious error--to categorize a student as having met the standards when, in fact, that student has not met the learning standards or to categorize a student as having not met the learning standards when, in fact, that student has met the standards?
  • Which was the more serious error--to grant distinction a student who has not met the learning standards at that level or to fail to grand distinction to a student who had achieved that level of proficiency.

Committee members provided judgments relating to the performance test using the following procedure.

  • Committee members reviewed the directions to the student and the scoring rubric for each item. All committee members were familiar with the rubric and most had used the rubric in scoring the performance test.
  • Committee members then estimated the score that the borderline student (i.e., a student who meets the standards minimally) who meets standards would achieve on the test.
  • Committee members then estimated the score that the borderline student who meets the standards with distinction would achieve on the test.
  • Score distributions of the individual committee member judgments were obtained.

Committee Members

The New York State Education Department assembled a committee of 24 people to provide judgments for the study. Committee members were, with one exception, current or former classroom teachers. Some were supervisors. One committee member was a representative from the business community. All committee members were recognized as very knowledgeable of the learning standards pertaining to physical setting and earth science and of how students perform on standardized tests similar to the Physical Setting/Earth Science Examination. Some had worked on an aspect of either the standards or development of the tests.

Committee members, their schools, the number of years experience each has in teaching Earth Science and the number of students who are currently in their Earth Science classes are given in the table below.

Committee Member

School and Location

Years Teaching Physical Setting/Earth Science

Number of Students Currently

Sue Ellen Ali

Highland Residential Center

Highland

17

10

David Banker

Stamford Central School

Stamford

21

33

Mary Bishop

Saugerties High School

Saugerties

30

78

Kathleen Champney

Retired

37

0

Dennis Conklin

Retired

34

0

Kathy Conway

Sand Creek Middle School

Albany

12

25

Dennis DeSain

Retired

30

0

Lisa Gottlieb

Ardsley High School

Ardsley

2

74

Frances Hess

Cooperstown High School

Cooperstown

36

85

Susan Hoffmire

Phoenix Central High School

Phoenix

3

60

Faye Landsman

Community District 10

Bronx

5

0

Janette Liddle

Adirondack High School

Boonville

17

26

Michael McDonnell

Millwood High School

Brooklyn

7

100

Glenn Meyer

Marlboro High School

Marlboro

17

0

Glen Olf

Hoosac School

Hoosick

26

13

George Pafumi

Geologist

0

0

John Pritchard

Grover Cleveland High School

Ridgewood

8

10

Jack Ridolph

Roy C. Ketcham High School

Wappingers Falls

31

75

Len Sharp

Liverpool High School

Liverpool

30

105

Sue Marie Soto

Health Opportunities High School

Bronx

2

0

Nancy Spaulding

Elmira Free Academy

Elmira

35

0

Wendy Taylor

Schenectady High School

Schenectady

6

110

Bernadette Tomaselli

Lancaster High School

Lancaster

24

70

Ruth Wahl

Allegany-Limestone High School

Allegany

13

125

Committee members were chosen so that they would represent a wide range of schools and different types of students. Each committee member was asked to complete a short background questionnaire that included questions about their sex, ethnic background, and the setting for their school. Results of the questionnaire tabulations are given in the table below.

Characteristic

Percent of committee

Sex

Female

58%

Male

42%

Ethnic Background of Committee Member

Hispanic

4%

White

96%

School Setting

New York City

21%

Other urban

13%

Suburban

33%

Rural

33%

Findings related to the bookmarking procedure

Findings--Round 2

In round 2 every committee member independently placed his or her own bookmarks for meeting standards. The results of the placements are given in the table below. The table gives the difficulty parameter, corresponding raw score, and percentage of students below that raw score based on the field test results. The cut-points include the committee average plus or minus one or two standard deviations (i.e., standard deviations of the committee estimates) and the median committee cut-point including the cut-points corresponding to the 75th and 25th percentile ranks of committee estimates.

Cut-point

Difficulty

Raw score (Max=83)

Percent below

Mean + 2 SD

1.60

73

93%

Mean + 1 SD

0.91

60

70%

Mean

0.23

28

9%

Mean - 1 SD

-0.46

11

1%

Mean - 2 SD

-1.15

4

0%

75%

0.60

45

34%

Median

0.30

31

13%

25%

-0.30

17

2%

Findings--round 3

In round 3 every committee member independently placed his or her own bookmarks for meeting standards with distinction. The results of the placements are given in the table below. The table gives the difficulty parameter, corresponding raw score, and the percentage above that raw score based on the field test results. The cut-points include the committee average plus or minus one or two standard deviations (i.e., standard deviations of the committee estimates) and the median committee cut-point including the cut-points corresponding to the 75th and 25th percentile ranks of committee estimates.

Cut-point

Difficulty

Raw score (Max=83)

Percent above

Mean + 2 SD

2.36

82

0%

Mean + 1 SD

1.83

76

4%

Mean

1.30

69

13%

Mean - 1 SD

0.76

56

40%

Mean - 2 SD

0.23

28

91%

75%

1.70

74

6%

Median

1.50

71

10%

25%

0.93

60

30%

Findings--round 4

In round four, committee members received a report of their round two results. They also were placed in small groups where individual results were discussed. After the discussion, committee members were asked to place another bookmark for meeting standards based on the information and knowledge they had gained up to this point. The round four results, which generally show less variation than the round two results, are given in the table below.

Cut-point

Difficulty

Raw score (Max=83)

Percent below

Mean + 2 SD

1.41

69

87%

Mean + 1 SD

0.77

57

62%

Mean

0.13

27

8%

Mean - 1 SD

-0.52

10

1%

Mean - 2 SD

-1.16

4

0%

75%

0.50

40

25%

Median

-0.05

23

5%

25%

-0.40

15

2%

Findings--round 5

In round five, committee members received a report of their round three results. They also were placed in small groups where individual results were discussed. After the discussion, committee members were asked to place another bookmark for meeting standards with distinction based on the information and knowledge they had gained up to this point. The round five results, which generally show less variation than the round three results, are given in the table below.

Cut-point

Difficulty

Raw score (Max=83)

Percent above

Mean + 2 SD

2.10

82

0%

Mean + 1 SD

1.73

74

6%

Mean

1.37

69

13%

Mean - 1 SD

1.00

63

24%

Mean - 2 SD

0.63

46

63%

75%

1.50

71

10%

Median

1.50

71

10%

25%

1.18

66

19%

Findings--round 6

In round six, committee members received a report of their round four and five judgments. They also received a report of the impact of their estimates from that round. Impact was reported in terms of the frequency distributions of the field test scores. The committee was also advised that scores from field testing generally underestimate operational test performance, but that the amount of the underestimate was not known. Committee members then returned to their groups and discussed the report and their judgments. At the end of the discussion, committee members were asked to place new bookmarks for both meeting standards and meeting standards with distinction based on the information and knowledge they had at that time. Results of this final placement are given in the table below.

 

Cut-point

Meeting standards

Meeting standards with distinction

Diff

Raw score

Percent below

Diff

Raw score

Percent above

Mean + 2 SD

1.31

69

87%

1.90

79

2%

Mean + 1 SD

0.84

58

65%

1.72

74

6%

Mean

0.36

33

15%

1.53

73

7%

Mean - 1 SD

-0.11

21

4%

1.34

69

13%

Mean - 2 SD

-0.59

8

0%

1.16

66

19%

75%

0.61

45

34%

1.53

73

7%

Median

0.50

40

25%

1.50

71

10%

25%

0.10

26

8%

1.50

71

10%

Other Judgments Obtained

Committee members were asked to provide their best judgment of the percentage of their current students who are not achieving the learning standards as well as the percentage of their current students who are achieving the learning standards with distinction. These judgments were made not with respect to the test, but with respect to the learning standards and the definitions of meeting standards and meeting standards with distinction. Results appear in the table below.

Standard

% Meeting standards

% Meeting standards with distinction

Mean + 2 SD

100%

63%

Mean + 1 SD

97%

44%

Mean

73%

24%

Mean - 1 SD

49%

5%

Mean - 2 SD

24%

0%

75%

89%

37%

Median

75%

15%

25%

64%

10%

The data in the table above relates to the cut-points for the test in that the committee on average was indicating that in their judgment almost one of four students in the state is not currently achieving at level suggested by the learning standards. This assessment was made without test scores and is independent of the test scores. Similarly, the committee on average judged that about 15%-25% of students were achieving at the distinguished level.

Also noteworthy is the relatively large standard deviations for the estimates. This reflects the very real variation in achievement among classrooms. For example, estimates of the percentage of students achieving at least at the meets standards level of achievement ranged from 1% to 100%. For meeting standards with distinction, the estimates ranged from 0% to 75%.

With respect to the relative severity of the errors of classification, 71% of the committee said that classifying a student as having not met standards who in reality has met the learning standards was more serious than classifying a student as having met the standards who in reality has not met the learning standards. Twenty-nine percent of the committee said the opposite. With respect to meeting standards with distinction, the committee was about evenly divided. Fifty-four percent said that not granting a student distinction who in reality has attained that level of achievement was the more serious error.

Thus, the committee might be considered "lenient" with respect to setting the lower cut-point, but indifferent about setting the higher cut-point.

Cut-Points for the performance test

Results for the performance test are given in the table below.

Standard

Meeting Standards

Meeting Standards with Distinction

Mean + 2 SD

17.1

23.7

Mean + 1 SD

15.0

22.2

Mean

12.9

20.7

Mean - 1 SD

10.8

19.2

Mean - 2 SD

8.7

17.7

75%

14.0

21.3

Median

13.0

21.0

25%

11.8

20.0

Discussion and Recommendations

The purpose of this study was to obtain data and information that New York may use in setting cut-points for the Physical Setting/Earth Science Examination. The data should be used to guide those decisions.

The committee that provided the data was diverse and well represented the diversity of New York students, teachers, and school districts. With that diversity, it is not surprising that committee judgments varied.

The final bookmarks from the procedure are given in the table below.

 

Cut-point

Meeting standards

Meeting standards with distinction

Diff

Raw score

Percent below

Diff

Raw score

Percent above

Mean + 2 SD

1.31

69

87%

1.90

79

2%

Mean + 1 SD

0.84

58

65%

1.72

74

6%

Mean

0.36

33

15%

1.53

73

7%

Mean - 1 SD

-0.11

21

4%

1.34

69

13%

Mean - 2 SD

-0.59

8

0%

1.16

66

19%

75%

0.61

45

34%

1.53

73

7%

Median

0.50

40

25%

1.50

71

10%

25%

0.10

26

8%

1.50

71

10%

The committee also indicated that currently about 25% of students are not meeting the learning standards and about 15% - 25% of students are meeting the standards at the distinction level. Further, the committee overwhelmingly believes that the error of classifying a student as not meeting standards who in reality has met standards should be minimized. The committee seems indifferent with respect to classification errors at the distinction level.

Finally, it is well known that student performance improves once operational testing begins. What is not known is the amount of improvement that might be expected.

Final judgments for the performance test are given in the table below.

Standard

Meeting Standards

Meeting Standards with Distinction

Mean + 2 SD

17.1

23.7

Mean + 1 SD

15.0

22.2

Mean

12.9

20.7

Mean - 1 SD

10.8

19.2

Mean - 2 SD

8.7

17.7

75%

14.0

21.3

Median

13.0

21.0

25%

11.8

20.0

What should be made of these results?

The study author recognizes that New York has the responsibility and duty to set cut-points in such a way that the purpose of the testing program is best accomplished. That requires judgment and consideration of all the data and information that is available at the time cut-points are set. The study author strongly encourages New York not to routinely adopt the mean for the bookmarking procedure as the final cut-points. Final cut-points should result from staff deliberations using all of the data presented in this report.

It is well known that field test results underestimate how well students perform on operational testing. Under performance in field testing is due to several factors, chief of which are student recognition that the test scores do not count and that teaching practices are not yet congruent with the standards on which the tests are based. The amount of underestimation for the Physical Setting/Earth Science Examination is unknown. Yet difficulty parameters and impact estimates used in the standard setting were based on field test statistics.

Thus, the study authorís first recommendation is to repeat the standard setting study once the test becomes operational. The repeated study should use item difficulty and impact estimates obtained from operational testing and not simply a repeat of the study using the same data. If that is not possible, the study author encourages New York to repeat the standard setting study with other methods, such as the contrasting groups method, which does not rely on the state collecting item level data on a large scale.

For initial operational testing, the study author recommends that the cut-point for meeting standards be set within the raw score range of 33-45. The committee means and medians fall within this range. Within this range, the study author recommends a final cut-point be set based on the state's best judgment as to the improvement that will actually occur once operational testing begins. That judgment should be informed by discussions with test developers, curriculum specialists, and teachers. The study author would choose a raw score of 40, which is lenient and will likely result in fewer than about 20% failing the test, a lower level than indicated by teachers of the percentage of students who are not currently meeting standards. This is only the personal opinion of the study author, however.

For initial operational testing, the study author recommends that the cut-point for meeting standards with distinction be set within the raw-score range of 69-74. Again, all committee mean and median judgments fall within that range. And again, within that range choice should be made based on the estimated improvement from field testing to operational testing and the choice should be informed by discussions with test developers, curriculum specialists, and teachers. The state should realize, however, that improvement in the upper range of scores is likely to be less than improvement in the lower range of scores. The study author would choose a raw score of 71, but again that is the personal opinion of the study author only.

With respect to the performance test, until more data on performance can be collected, the study author recommends that cut-points of 13 and 21 (the committee medians) be used.