Final Report

Grade 8 Intermediate Level

Social Studies Examination--Data and Information Related to Standard Setting

A study performed for the New York State Education Department by

Gary Echternacht
Gary Echternacht, Inc.
4 State Park Drive
Titusville, NJ 08560

(609) 737-8187
garyecht@aol.com

April 27, 2001

Introduction

The New York State Board of Regents has established learning standards all students must meet to graduate from high school. One set of learning standards is for social studies. The standards pertain to:

  • History of the United States and New York
  • World History
  • Geography
  • Economics
  • Civics, Citizenship, and Government

Key ideas, performance indicators, and sample tasks further describe each learning standard. Standards are also broken down by educational level--elementary, intermediate, and commencement. To assess the extent that students have met the learning standards, the New York State Education Department has developed a testing program. The content of the tests reflect accomplishment of the learning standards. For social studies, the State Education Department has developed a Grade 8 Intermediate Level Social Studies Test to reflect accomplishment of the learning standards at the intermediate level. Most students will take this test at the end of 8th grade. Schools must provide students who fail the test with special academic intervention programs.

Although scores for the test are placed on a numerical scale, essentially there are only three scores—not meeting the standards, meeting the standards, and meeting the standards with distinction. The test items have been developed by New York State teachers using professionally established procedures and have been pretested and field-tested on samples of students.

The purpose of the study described in this report is to obtain information that the State Education Department can use to establish scores that will classify test takers into does not meet standards, meets standards, and meets standards with distinction categories. Setting cut-scores requires judgment. This study employs professionally established methods to quantify and summarize the judgments of experts related to how individuals who have met the learning standards will perform on the test.

The Grade 8 Intermediate Level Social Studies Test

The Grade 8 Intermediate Social Studies Test is a three-part test administered in two one and a half hour sessions. Test content is based on the intermediate-level key ideas and performance indicators found in the Learning Standards for Social Studies and the Social Studies core curriculum, developed and adopted by the Board of Regents. The three parts of the examination are as follows:

  • Part I consists of 45 multiple choice questions. The questions test knowledge of content included in all social studies units and cover the five social studies standards—US and New York history, world history, geography, economics, and civics, citizenship, and government. The multiple choice part of the test makes up 50% of the total test score.
  • Part II consists of 3-4 constructed response prompts with questions that follow the last prompt. Most questions are scored on either a 0-1 or 0-2 scale. This part of the examination is designed to test a student’s ability to draw inferences from a stimulus (e.g., a chart, picture, or map). This part makes up 20% of the total score.
  • Part III consists of a document based question and makes up 30% of the total test score. The document based question requires students to identify and explore multiple perspectives on events or issues by examining, analyzing, and evaluating textual and visual primary and secondary documents. The document based question consists of a series of short-answer scaffold questions based on individual documents (making up 10% of the total score) and a document based essay question (making up 20% of the total score).

Items for the test were developed and pretested by a consortium of teachers, supervisors, and administrators from school districts across the State; Erie I BOCES staff; and State Education Department staff. All constructed response, scaffold, and essay type responses are scored holistically using scoring rubrics by trained teachers in their districts.

A complete description of the examination, including test specifications and scoring rubrics, is given in a test sampler.

Methods Employed

Data related to the performance standards for the test were obtained from a committee of experts. Judgments from committee members were quantified using standard practices employed by psychometricians who conduct standard setting studies. The committee made their judgments with respect to the difficulty scale resulting from the scaling and equating of field test items. In the field testing, each item, or score category if the item has multiple scores, is given a difficulty parameter obtained through item response methods. Test items corresponding to various points on the difficulty scale are presented as examples of test items at that difficulty level. The items used came from the anchor test form. The anchor test form is the test form upon which the cut-scores are set and the form to which all later forms of the test will be equated.

Committee members were given definitions of three performance categories—not meeting standards, meeting standards, and meeting standards with distinction. The State Education Department has developed these category definitions and they are applied to all of the Intermediate Level tests that are being developed. In addition, committee members were given an exercise in which they were asked to categorize some of their students into the performance categories as defined by the State Education Department.

The committee met as a group on March 20, 2001 at the State Education Department.

The standard setting study test used the bookmarking approach because all the multiple choice items and constructed response item had been scaled using item response theory methods and because the bookmarking procedure enables committee members to consider these two item types together.

In the bookmarking procedure, multiple choice items and constructed response items are ordered in terms of their difficulty parameters. The purpose of the items is to illustrate the meaning of the difficulty scale at specific points. Committee members are asked to apply their judgments to these ordered items. The committee meeting is conducted in rounds. The rounds and the activities employed in each round are given below.

Round

Activity

1

Committee members review the Learning Standards for the content area and consider ways of measuring accomplishment of the performance indicators and key ideas. Committee members review the ordered items and learn and understand the increasing complexity of the items and responses required.

2

Working individually, committee members set their bookmark for meeting standards. That is, committee members conceive of an individual who has the minimum level of skill and knowledge needed to meet the learning standards and indicate the last item (or difficulty level) where the hypothetical individual is likely to answer the item correctly two-thirds of the time (or to construct a response that is at least as good).

3

Working individually, committee members set their bookmarks for meeting the learning standards with distinction. That is, committee members conceive of an individual who has the minimum level of skill and knowledge needed to meet the standards with distinction and indicate the last item at which students are likely to answer correctly (or to construct a response that is at least as good).

4

A report of the results of round 2 is given committee members. The committee is divided into small groups and the individual results are discussed. Committee members revise their judgments in light of the discussion. Responses are recorded both on data sheets and in the notebook of ordered items.

5

The same procedure as in round 4 is used with the round 3 results.

6

A report of rounds 4 and 5 are given the committee. Also given the committee are the impacts (percent failing and passing with distinction based on field test results). Committee members make final judgments based on the accumulated judgments and data. Responses are recorded both on data sheets and in the notebook of ordered items.

Committee members were also asked four overall questions about accomplishment of the learning standards and test performance. Answers to these questions might aid New York in setting appropriate performance standards on the test. These questions asked:

  • Each committee member's estimate of the percentage of students in their classes who are currently meeting the learning standards.
  • Each committee member's estimate of the percentage of students in their classes who are currently meeting the learning standards with distinction.
  • Which was the more serious error--to classify a student as having met the standards based on the test result when that student has not met the learning standards or to classify a student as having not met the learning standards when that student has actually met the learning standards?
  • Which was the more serious error--to classify a student as having met the standards with distinction based on the test result when that student has not met the learning standards with distinction or to classify a student as having not met the learning standards with distinction when that student has actually met the learning standards with distinction?

Committee Members

The New York State Education Department's Office of Curriculum and Instruction assembled a committee of 20 people to provide judgments for the study. Committee members were current or former social studies classroom teachers. All committee members were recognized as very knowledgeable of the learning standards for social studies and of how students perform on standardized tests similar to the Grade 8 Intermediate Level Social Studies Test. Some had worked on an aspect of either the standards or development of the test.

Committee members, their schools, the number of years experience each has in teaching Intermediate Level Social Studies and the number of students who are in their grade 8 social studies classes are given in the table below.

Committee Member

School and Location

Years Teaching Intermediate Social Studies

Number of Students Currently

Jacqueline Andrews

Peru Middle School

Peru

21

96

Sari Bacon

Shulamith School for Girls

7

60

Ann Marie Carter

Ballston Spa Middle School

Ballston Spa

4

146

Jack Daly

Willsboro Central School

Willsboro

13

30

Rosemary Damm

Diocese of Rockville Centre

Rockville Centre

27

0

Brian Freeland

Hughes Magret

Syracuse

7

88

Paul Gold

West Babylon Junior High

West Babylon

4

50

Cathleen Hayes

Johnson City High School

Johnson City

21

100

Joseph Holloway

Hoosick Falls Middle School

Hoosick Falls

36

104

Lesley Hughes

Diocese of Syracuse

Syracuse

13

0

Kathleen Anne Jankosky

Cunningham Junior High

Brooklyn

11

35

Martha Kerr

Shaker Junior High

Latham

11

115

Jack Lyons

Westbury Middle School

Westbury

31

120

Claire Machosky

Woodmere Middle School

Hewlett

34

51

Maureen Mackin

Downsville Central School

Downsville

10

22

Vanessa Randle

Huntington Middle School

Syracuse

16

125

Linda Romano

Community School District 20

Brooklyn

6

0

Mike Smith

Marion Jr-Sr High School

Marion

9

87

Gloria-Towle-Hilt

A.H. Farnsworth Middle School

Guilderland

29

110

George Whitton

New Hartford High School

New Hartford

3

90

Committee members were chosen so that they would represent a wide range of schools and different types of students. Each committee member was asked to complete a short background questionnaire that included questions about their sex, ethnic background, and the setting for their school. Results of the questionnaire tabulations are given in the table below.

Characteristic

Percent of committee

Sex

Female

65%

Male

35%

Ethnic Background of Committee Member

African-American

10%

White

90%

School Setting

New York City

15%

Other urban

15%

Suburban

40%

Rural

30%

Findings related to the bookmarking procedure

Findings--Round 2

In round 2 every committee member independently placed his or her own bookmarks for meeting standards. The results of the placements are given in the table below. The table gives the difficulty, corresponding raw score, and the corresponding percent of students that fall below that cut-score based on the field test data. The cut-points include the committee average plus or minus one or two standard deviations (i.e., standard deviations of the committee estimates) and the median committee cut-point including the cut-points corresponding to the 75th and 25th percentile ranks of committee estimates.

Cut-point

Difficulty

Raw score

(Max=81)

Percent below

Mean + 2 SD

1.67

69

95%

Mean + 1 SD

1.09

49

73%

Mean

0.51

33

44%

Mean - 1 SD

-0.07

14

6%

Mean - 2 SD

-0.65

6

1%

75%

0.70

37

52%

Median

0.70

37

52%

25%

-0.13

12

4%

Findings--round 3

In round 3 every committee member independently placed his or her own bookmarks for meeting standards with distinction. The results of the placements are given in the table below. The table gives the difficulty, corresponding raw score, and the corresponding percent falling above that cut-score based on the field test data. The cut-points include the committee average plus or minus one or two standard deviations (i.e., standard deviations of the committee estimates) and the median committee cut-point including the cut-points corresponding to the 75th and 25th percentile ranks of committee estimates.

Cut-point

Difficulty

Raw score

(Max=81)

Percent above

Mean + 2 SD

4.30

81

0%

Mean + 1 SD

3.49

81

0%

Mean

2.69

81

0%

Mean - 1 SD

1.88

73

2%

Mean - 2 SD

1.07

48

28%

75%

3.30

81

0%

Median

3.30

81

0%

25%

2.18

78

1%

Findings--round 4

In round four, committee members received a report of their round two results. They also were placed in small groups where individual results were discussed. After the discussion, committee members were asked to place another bookmark for meeting standards based on the information and knowledge they had gained up to this point. The round four results, which generally show less variation than the round two results, are given in the table below.

Cut-point

Difficulty

Raw score

(Max=81)

Percent below

Mean + 2 SD

1.23

52

78%

Mean + 1 SD

0.87

43

64%

Mean

0.52

33

44%

Mean - 1 SD

0.17

23

21%

Mean - 2 SD

-0.19

11

3%

75%

0.70

37

52%

Median

0.60

36

50%

25%

0.28

24

24%

Findings--round 5

In round five, committee members received a report of their round three results. They also were placed in small groups where individual results were discussed. After the discussion, committee members were asked to place another bookmark for meeting standards with distinction based on the information and knowledge they had gained up to this point. The round five results, which generally show less variation than the round three results, are given in the table below.

Cut-point

Difficulty

Raw score

(Max=81)

Percent above

Mean + 2 SD

3.77

81

0%

Mean + 1 SD

3.20

81

0%

Mean

2.63

80

0%

Mean - 1 SD

2.06

75

2%

Mean - 2 SD

1.49

58

14%

75%

3.30

81

0%

Median

2.30

80

0%

25%

2.20

79

0%

Findings--round 6

In round six, committee members received a report of their round four and five judgments. They also received a report of the impact of their estimates from that round. Impact was reported in terms of the frequency distributions of the field test scores. The committee was also advised that scores from field testing generally underestimate operational test performance, but that the amount of the underestimate was not known. Committee members then returned to their groups and discussed the report and their judgments. At the end of the discussion, committee members were asked to place new bookmarks for both meeting standards and meeting standards with distinction based on the information and knowledge they had at that time. Results of this final placement are given in the table below.

 

Cut-point

Meeting standards

Meeting standards with distinction

Diff

Raw score

Percent below

Diff

Raw score

Percent above

Mean + 2 SD

1.01

46

69%

3.37

81

0%

Mean + 1 SD

0.72

38

54%

2.84

81

0%

Mean

0.44

32

43%

2.31

80

0%

Mean - 1 SD

0.16

23

21%

1.77

72

2%

Mean - 2 SD

-0.13

12

4%

1.24

52

22%

75%

0.55

33

44%

2.30

80

0%

Median

0.40

29

36%

2.15

77

1%

25%

0.30

24

24%

1.90

73

2%

Other Judgments Obtained

Committee members were asked to provide their best judgment of the percentage of their students who are currently meeting the learning standards as well as the percentage of their students who are meeting those standards at the distinction level. These judgments were made not with respect to the test, but with respect to the learning standards and the definitions of meeting standards and meeting standards with distinction. The results showing the committee average plus or minus one or two standard deviations (i.e., standard deviations of the committee estimates) and the median committee results and the 75th and 25th percentile ranks of committee estimates are given below:

Standard

% Meeting Standards

% Meeting Standards with Distinction

Mean + 2 SD

100%

36%

Mean + 1 SD

97%

26%

Mean

74%

16%

Mean - 1 SD

51%

6%

Mean - 2 SD

27%

0%

75%

85%

20%

Median

83%

10%

25%

75%

10%

The data in the table above relates to the passing scores for the test in that the committee on average was indicating that in their judgment about 15% to 25% of students in the state is not currently achieving at the minimum level suggested by the learning standards. This assessment was made not on the basis of test scores, but on the basis of the teacher’s observations and judgment. Similarly, the committee on average judged that roughly 10% to 20% of students were achieving at the distinction level.

Also noteworthy are the relatively large variation for the estimates. This reflects the very real variation in achievement among classrooms. For example, estimates of the percentage of students meeting standards ranged from 10% to 96%. For meeting standards with distinction, the estimates ranged from 5% to 40%.

With respect to the relative severity of the errors of classification, three-fourths of the committee indicated that for the passing cut-point, the most severe error was to pass a student on the examination who in fact had not met the learning standards. Similarly, about two of three committee members indicated that the most severe error with respect to the passing with distinction cut-point was to give a student a passing with distinction score, though the student had not met the learning standards at that level.

Discussion and Recommendations

The purpose of this study is to obtain data and information that New York may use in setting passing points for the Grade 8 Intermediate Social Studies Examination. The data should be used to guide those decisions.

The committee that provided the data was diverse and well represented the diversity of New York students, teachers, and school districts. With that diversity, it is not surprising that committee judgments varied.

The final bookmarks from the procedure are given in the table below:

 

Cut-point

Meeting standards

Meeting standards with distinction

Diff

Raw score

Percent below

Diff

Raw score

Percent above

Mean + 2 SD

1.01

46

69%

3.37

81

0%

Mean + 1 SD

0.72

38

54%

2.84

81

0%

Mean

0.44

32

43%

2.31

80

0%

Mean - 1 SD

0.16

23

21%

1.77

72

2%

Mean - 2 SD

-0.13

12

4%

1.24

52

22%

75%

0.55

33

44%

2.30

80

0%

Median

0.40

29

36%

2.15

77

1%

25%

0.30

24

24%

1.90

73

2%

The committee also indicated that based on their own assessments of classroom performance, currently about 15% to 25% of students are not achieving the learning standards and that about 10% - 20% of students are meeting standards with distinction. Further, the committee overwhelmingly believes that the error of passing a student who should fail should be minimized. Though less strong, the committee believed that the error of granting distinction when that was not the level of achievement attained should be minimized. Generally, the committee thought favorably of the academic intervention that would take place for failing students.

The chief concern the study author has is the nature of the field test results, especially those for the constructed response and essay items. The item difficulty parameters used in the bookmarking procedure are derived from field test results. But there are significant questions about the validity of those results. Based on the field test results, the constructed response items appear very difficult. But throughout the committee meeting, committee members indicated that they did not believe those items were as difficult as the parameter suggested.

Field test results often results in poorer performance than might be expected in operational testing. The motivation of students is not the same. Students are not prepared by teachers for field testing as they are for operational testing. And during field testing, not all teachers are following the curriculum guidelines to which the test is tied. So, although it is known that field test performance underestimates operational test performance, what is not known is the amount of improvement that might be expected once operational testing begins.

What should be made of these results?

The study author recognizes that New York has the responsibility and duty to set cut-points in such a way that the purpose of the testing program is best accomplished. That requires judgment and consideration of all the data and information that is available at the time cut-points are set.

The first recommendation is that New York should repeat the standard setting study after operational testing has begun. That study should use recalculated difficulty parameters obtained from the first operational testing. Performance data from that administration should also be used to guide the panelists. Using operational test data should result in a more valid information concerning cut-points than can be obtained with data from field testing.

Given that cut-points must be in place once operational testing begins, however, data from the current study should be used. The study author has little confidence in the normative data provided—i.e., the percentage of students who might fail and the percentage of students who might achieve distinction. Therefore, the second recommendation is that New York ignore these estimates as it deliberates over the choice of initial cut-points.

The study author recommends that because of the nature of committee member judgments (they are not normally distributed), the median and first and second quartile of committee judgments be used to establish the cut-points. The study author recommends that the cut-point for passing be within the range of a raw score of 24 – 33. The cut-point for passing with distinction is recommended between raw scores of 72 –77.

Unlike what is often found for tests that students must pass to graduate, Grade 8 Intermediate Social Studies teachers feel much less negative about failing a student who has actually met the standards. Generally they believe that the academic intervention given the student as a result of failing the test is a positive experience. Consequently, the state might set the cut-points for passing at the upper levels of the range.

Although the study author recognizes that the state, not the study author, has the responsibility to set the cut-points, and that state staff have a greater knowledge of curricular practice and other factors that may affect choice of cut-points, if the study author were forced, he would choose cut-points of 32 for passing and 72 for passing with distinction.

Once cut-points are established for operational testing, New York can use data from the study to evaluate the choice of those cut-points. If the cut-points are appropriate, then somewhere between 15% and 25% of students should receive failing scores while 10% - 20% of students should receive passing with distinction scores. Regardless of the scores, the study author recommends that the whole standard setting study be repeated, collecting the same data as the current study, but with statistics from the operational testing available.