How the Scale Scores Are Calculated for New York State Regents Examinations
Gerald E. DeMauro
(Revised July, 2002)
Rationale for Scaling
The industry’s standards for psychological and educational testing define validity as
"the degree to which evidence and theory support the interpretations of test scores
entailed by proposed uses of the tests."1 (p. 9)
In New York State, the explicit use of State examinations is to determine if students have achieved New York State Learning Standards. Therefore, validity of the instruments demands that interpretation of test scores must be referenced to achievement of State Learning Standards.
In the past, before New York State Learning Standards were adopted, there was no such explicit interpretation of the test scores in terms of a scale that was referenced, point by point, to an array of skills. The adoption of State Learning Standards has made it necessary that the passing scores, or performance standards, of a test must have the interpretation that a student has achieved such learning standards. Moreover, every point on the scale score continuum should be related to the student’s status with respect to achieving State Learning Standards. In fact, this is accomplished on the State examinations by deriving those scale scores so that each score is related to a point on the array of skills that comprise State Learning Standards. In this way, a student at any given scale score can be said to have an expected probability of having any of the skills or knowledge that make up State Learning Standards. For example, for a given skill, e.g., plotting points on a graph, higher scale scores indicate a higher probability that the student has mastered that skill, while lower scale scores indicate a lower chance. More importantly, this probability can be precisely computed from the student's scale score, so that educators and teachers can derive from the scale score just which skills are more likely to be within the child’s capability. Other kinds of scores, such as percentages correct, do not have this built in reference to achievement, and therefore are less valid measures of achieving State Learning Standards.
Also, because the scale scores represent an invariant level of achievement, they address the basic equity issue of having the same meaning across forms of the test. In this way, students can be assured that with the same level of skills and knowledge, their score does not depend on whether this is an easier form of the test or a more difficult form of the test. The same scale score always represents the same level of achievement of State Learning Standards.
Nevertheless, these two elegant and necessary features of the scale scores, interpretation in terms of achieving State Learning Standards, and equity across test forms, come at a cost of technical complexity. Because these technical requirements were not necessary before State Learning Standards were adopted, they are often confusing to people who are unfamiliar with psychometric techniques and the demands of fairness and validity. This document addresses the technical underpinnings of the New York State Regents scale scores to assist in making them more comprehensible.
New York State Regents examinations have traditionally used 65 percent of all questions correct as a passing score. This score was not based on analyses of what would be needed to be competent with respect to a specific body of knowledge and skills, or to achieve any explicit standards. Rather, this passing standard reflected what was considered to be general competence in the subject matter being tested.
Because the tests were based on course work, the tests were field tested on students finishing the appropriate courses. Test forms were equated, that is, kept equivalent in difficulty, by averaging the percent of students who answered the field test questions correctly and matching that average percent from test form to test form. However, this method assumed that the year-to-year field test samples were equivalent in skills, an assumption that is better supported when coursework has had years of stability rather than in a system that is standards based. In a standards based environment, where the students are continually improving in the skills and knowledge they are acquiring, and where the educational objectives are explicitly defined by State Learning Standards and not course completion, other methods to determine passing scores and to equate test forms are likely to be more appropriate.
Three Processes to Address These Changes
For the sake of equity, it is important that determination of passing and failing is made with respect to State Learning Standards and what constitutes achievement of those standards. It is also vital that the scores achieved on various forms of the same test, that is, the tests administered at each administration, are equivalent in difficulty so that determination of passing and failing is indifferent to when the test was taken. The three steps initiated to address these concerns were: formal scaling and equating through field testing and follow up with Department Review of operational (counting) testing; formal standard setting studies to determine what level of performance constitutes achievement of the standards; formal scale determination to set that level of performance on a 0 – 100 scale in which 0 represents no performance, 100 represents highest possible performance, 65 represents achievement of the standards, and 85 represents achievement of the standards with distinction.
Field Test Equating
The new Regents examinations go through three rounds of administration, with evaluation after each round: pretesting, field testing, and operational (counting) testing. Each round involves deliberation of the results with New York State teachers. Pretest and field tests are evaluated by content committees. After operational testing, student test papers are sampled and rescored to evaluate the performance of test questions and returned to schools.
Pretesting consists of administering new test questions in small groups to small groups of students. Changes are made to the questions based on student responses and statistical evaluations. These statistics describe the test questions in terms of their difficulty and the degree of relationship between performance on each question and performance on whole sets of questions designed to measure the same learning standards.
Field testing involves large parts of the examinations (thirds, fourths, or fifths), that are administered to many thousands of students. These sections of the tests that are new are administered with sections of the test that are called "anchor" questions. These anchor questions come from forms of the tests that are administered repeatedly in field testing over a few years. Eventually, the anchor forms are used as operational tests.
Knowing the difficulties of the anchor questions from previous field testing makes it possible first to gauge the skills of the students taking each field test and then to determine the difficulty of the new field test questions. This statistical evaluation of the skills of the students who take each field test, is a necessary component of gauging the difficulty of the new test questions, e.g., a highly skilled population would make the questions seem inordinately easy, while a poorly-skilled population would make the questions seem inordinately hard.
All field test questions are reviewed for difficulty, fit to these statistical models, agreement with the test as a whole, and reliability, and are also analyzed for bias using the methods described in other reports. After these evaluations, student scores are put on the same scale as the item difficulty estimates that have been calculated with respect to the skills of the field test population.
First Process: Scaling and Equating
Student scores are put on the same scale as item difficulty, using what is called IRT, or item response theory models. These models work as follows: Student performance is located on a scale with item difficulties. If the student is higher on that scale than an item is difficult, it means that a student performing at that level has a greater than 50 percent chance of answering that item correctly. If a student is lower on the scale than an item is difficult, than a student performing at that level has less than a 50 percent chance of answering the item correctly. Remember, the questions are screened to have these scale properties, so that these determinations of probabilities are accurate for any given set of questions. The further the distance between a student’s performance and an item’s difficulty, the greater the probability for the student to answer correctly or incorrectly.
Also remember that the item difficulties have been adjusted, before the students are located on the scale, through using anchor questions on the field testing, as described above. This insures that the performance of the students is referenced, through the common questions used in field testing, to the same common scale each time the examination is administered. As a result of these careful and deliberative processes, it does not matter which form of an operational (counting) examination a student takes, as long as the test questions on that form have been calibrated (the difficulties of the questions have been estimated with respect to the anchor questions) through the field test process.
Second Process: Formal Standard Setting
It is most important to note that standard setting is not based on an a priori notion of the percentages of students who should pass and fail a test. That would not be a means to determine whether or not a student has achieved State Learning Standards. Rather, a technique is used, called item mapping, that gauges student performance with respect to what New York State expert teachers determine to be proof of meeting the standards.
Teachers are sampled from around the State who have been identified as content experts and brought to a central meeting point. They represent the State in demographic variables and region. They are first given definitions of passing, low passing (55 – 64), and passing with distinction in the very general terms of 'meeting State Learning Standards' and 'going beyond State Learning Standards'. Because the Board of Regents only defined passing and passing with distinction as specific scale scores, 65 and 85, respectively, standard setting seeks to define operationally exactly what those scores infer about the required levels of student knowledge and skills.
The teachers discuss these definitions at some length, are given one or more forms of the tests to review and consider how these definitions apply item by item. For example, a student might be required by State Learning Standards to know the area of a circle. That student should be expected to know the relationship between the circumference and the diameter of the circle, to square a radius, etc. This would all constitute passing the Regents examination in this made up example. However, such a student might not know how to compute the volume of a cone.
Teachers are then given books of test questions on the forms they have just discussed. The books arrange the questions in order, based on the scale values, from easiest to hardest. Open-ended, e.g., essay, questions are given separate scale values for each possible point. For example, a "4" on an essay may be harder than item number 27, so typical "4" responses are located further in the book than item 27, while a "2" may be easier, so typical "2" responses are located before item 27 in the book.
The expert teachers than divide the book to delineate which questions would constitute passing or passing with distinction, and which would not, placing bookmarks at the appropriate points. Because student scores have been gauged, based on the field tests, to item difficulties, these divisions made by the teachers correspond to places of student performance, where scores can be identified. The teachers receive feedback on where the entire group of teachers made their demarcations, and are given opportunity to discuss their reasoning, as well as several iterations of judgments until they converge at specific points. These points operationally define passing and passing with distinction.
The original scale of item difficulty and student performance is in logarithmic units, based on theories of probability. Each value also has a raw total equivalent on each test form. It is these logarithmic values that are estimated in the field testing and are ultimately expressed on a common scale across test forms. These logarithmic values, and the raw total scores they represent on each test form, have to be converted to a scale in which 65 represents passing and 85 represents passing with distinction, according to the results of the standard setting study. This conversion is accomplished algebraically.
The highest value (perfect score) is assigned 100, and the lowest possible is assigned 0. Passing is assigned 65 and passing with distinction is assigned 85. All values can then be converted to the scale using an algebraic transformation:
In this equation, the value of x in each of four equations represents perfect scoring, lowest possible scoring, just passing, or just passing with distinction, respectively. The coefficients can be determined because each equation is set equal to the four known scale values: 0, 65, 85, and 100. These values yield four simultaneous equations. Thereafter, every logarithmic value, which, through equating has the exact same meaning in performance from test form to test form, can be assigned a scale value from 0 to 100 in which 65 is passing. In this way, a logarithmic scale having meaning with reference to achievement of State Learning Standards can be converted to a scale that ranges from 0 to 100 and carries the same meaning with reference to achieving State Learning Standards.
As described above, there is no State manipulation of test results to have a certain level of students pass or fail. Rather, the object is to operationally define what it means to achieve State Learning Standards, and New York State teachers accomplish this through a study. Once the standard setting is completed, it is applied to the scale values of the questions that comprise the test form. These scale values determine scores above which students can answer the questions correctly, and below which students cannot. Standard setting has associated these levels of
performance with passing, passing with distinction, etc. Of course, there is much greater detail involved in these processes, and the Office of State Assessment welcomes opportunities to discuss these details and the philosophy of fair and equitable test scoring and scaling with interested audiences.