Homogeneity of the Regents Comprehensive Examination in English
Samples of English Language Learners (ELL) and monolingual English curriculum students were analyzed to estimate the homogeneity of Regents Comprehensive examination in English (CEE) from June 1999. In fact, five ELL populations were compared to the Department Review sample of 488.
The ELL samples were drawn from large and small cities. The Department Review sample represents a secondary random sample of the randomly sampled 10 percent of all papers re-read as part of the State Education Department audit of test scores.
Because test validity is largely concerned with the support for score based inferences, (see Joint Standards, Validity 2000), it is important if the test has the same consequences based on those inferences for different population groups that the evidence in support of them is comparable.
Among the most effective means of measuring the relative degree of support of inference across populations is by assessing the homogeneity, that is, the concurrence of the meaning of test scores across population groups in terms of the agreement in the ordering of test questions from least to most difficult. The logic of this is that certain levels of skill and knowledge imply mastery of some content and difficulty with other content. Naturally, then, there should be a greater probability of answering correctly the items measuring the mastered content than of answering correctly the items measuring the difficult content. If the same inferences of strengths and weaknesses can be drawn across populations from the same score, than the ordering of item difficulties should be homogenous across populations because this ordering operationally defines which content is mastered and which is difficult.
A common demonstration of the agreement of item ordering is the correlation of item difficulties (Angoff and Modu, 1973). Item difficulty values are measured by the proportion of children who answer them correctly. The proportion is converted to an equal interval (ratio) scale and these converted item values are correlated across populations.
The current study assesses the homogeneity of the construct measured by the CEE for ELL and monolingual curriculum students, and for groups that vary accordingly to three ability levels based on overall scale score: 0-54 (fail); 55-64 (local pass), 65-100 (Regents pass).
A small city, two large cities, and a suburban district were sampled because they volunteered. The means and standard deviations are presented below.
Means and Standard Deviations on the June 1999 Administration of the
Item Difficulty Correlation
Item difficulty values were converted from proportion correct to delta values. These values are centered at 13 with a standard deviation of -4. They are an equal interval scale which permits statistical manipulations (e.g., averaging, correlations) that would not be possible with
p-values (proportion correct).
The samples were divided into three groups each: fail, low pass, and pass. The test has 26 multiple choice and four open-ended questions worth maximum values of one and six points each, respectively. The raw total is computed as: multiple choice total plus 2 x open-ended total. This raw value is converted to a scale score ranging from 0 to 100, in which students may be eligible for a local diploma at 55 (decided by their school board) and are eligible for a State endorsed Regents diploma at 65. For the purpose of this research, low pass was 55-64 and pass was 65-100.
Open-ended questions totals were each first divided by six, the maximum possible point value, before being converted to the delta scale.
Table 2 shows the item difficulty correlations. Note that no correlations between the two student groups, ELL or monolingual curriculum were below .35. The lowest correlation was between failing ELL students and low passing ELL students (.581). The highest correlation was between passing and low passing monolingual curriculum students (.941). On the whole, the correlations between the monolingual curriculum students (.938, .890, and .941) were higher than the respective correlations between ELL students (.581, .685, and .680) although both groups show considerable homogeneity. However, the correlations between the ELL and monolingual curriculum students were .638, .648, and .698, showing again substantial agreement in the measurement properties of the test for the two groups.
Standard Errors of Estimate
Finally, the two groups were matched on scale scores. The delta values for the ELL group were regressed onto the delta values for the monolingual curriculum students to yield a predicted value and a residual value. The square root of the average squared residual, called the standard error of estimate, is a good measure of the accuracy of predicting the difficulty. The smaller the value, the more accurate the prediction.
Table 3 shows that the greatest agreement (least standard error of estimate) was between 55 and 76, right where it should be to maximize the inferences supported by the test scores. This finding agrees with the Table 2 evidence that the construct for the ELL and monolingual curriculum students is substantially equivalent.
The generalizations supported by this use of available samples are limited. Nevertheless, the result of these analyses suggest that the Regents Comprehensive Examination in English defines the same strengths and weaknesses for ELL and monolingual curriculum students. This supports the validity and use of this examination for ELL students.
American Educational Research Association, American Psychological Association, and
National Council on Measurement in Education
Standards for Educational and Psychological Testing (Joint Standards).
Angoff, W. H. & Modu, C.C.
Equating the Scales of the Prueba de Aptitud Acadmica and the Scholastic Aptitude Test.