The Homogenity of the Grade 4 English Language Arts Test (ELA-4)
Gerald E. DeMauro
Office of State Assessment
As the State Assessments become more universal, it is important to insure that each test measures the same construct or trait for different populations for which the scores have the same interpretation. A study of the homogeneity, or universal meaning of the test scores of the Grade 4 English Language Arts examination (ELA-4) was conducted based on the 1999 administration. Based on the stability of the relative difficulties of the test questions, the test appears to be homogenous for groups in different ranges of overall scores and for students who have a disability and those who do not. This construct evaluation is very important to support making the same inferences about student skills for different populations with the same scores.
Several analyses are presented of construct validity of the New York State Grade 4 English Language Arts assessment (ELA-4) for students who indicated that they had disabilities. The focus was not on the difficulty of the test for these students as compared to other students, but on whether or not the score supported the same inference about student skills, independent of disability status.
One method of measuring this is to evaluate whether students with disabilities found the same test questions to be more difficult and less difficult as the students without disabilities. The analyses examine this in terms of correlations of item difficulties. The analyses were extended to consider several levels of scoring, as well. The results provide strong evidence for the homogeneity, or lack of differential construct validity of the ELA-4 across populations.
The Homogenity of the Grade 4 English Language Arts Test (ELA-4)
Gerald E. DeMauro
A common means of evaluating the homogeneity of the construct measured by a test is to correlate the item difficulties between populations (Angoff and Modu, 1972). In an homogenous test the most difficult questions are most difficult for all students, and the least difficult are least difficult for all students. This property insures that the test results share the same interpretation for all students. At its very base, evaluation of this property focuses on the construct validity of the test in terms of the support for the score-based inferences. Quite simply, these inferences cannot be supported if the test measures different things for different children.
One application of homogeneity is to evaluate the construct properties of the test for special populations. For the test to be a useful, interpretable measure for students with disabilities should be high correlations among item difficulties for these students and all other students. This study assesses the homogeneity of the 1999 Grade 4 English Language Arts (ELA-4) examination, its differential construct properties for students with disabilities and those without disabilities at various score ranges.
Sample. The sample consisted of 237,281 grade four students who took the ELA-4 in January 1999 and had valid scores and codes indicating either that they were without a disability or that they had a disability. In all 211,676 (89.2 percent) of students indicated no disability and 25,605 (10.8 percent) had a disability, as shown in Table 1.
Of the 25,605 students identified as having a disability, 16,947 (66.2 percent) were boys.
Distributions. Overall, ELA-4 scale scores are distributed in four proficiency ranges: Levels 1 (low) through level 4 (high). There are smaller numbers of students with disabilities in the higher ranges of test scores, as shown by the lower means for students with disabilities in Table 1. Therefore, to investigate the ELA-4 measurement properties for students with disabilities over the range of scores, students were divided into three groups of equal number within level 1, three groups of equal number within level 2, two groups of equal number within level 3, and two groups of equal number within level 1.
Analyses were conducted on three levels. A General Linear Model regression assessed two main variables, sex and disability classification and their interaction. The second level of analysis was concerned with the homogeneity of the test over two variables: range of scores and reported disability or no reported disability.
For the homogeneity analysis, performance on each of the 35 items of the test were first divided by the number of possible points to control the range and contribution of each item to the total raw score. The average difficulty, or p-values, were then converted to the delta scale, which has a mean of 13 and a standard deviation of four, inverted to have low delta values represent less difficult items. This conversion permits the mathematical manipulations necessary for subsequent correlational analyses of item difficulties because the delta scale is an interval scale, while the p-value scale is ordinal.
Regression and correlational analyses then followed on two levels. The first analysis was a straight-forward bivariate correlation of the item delta values for each disability category with the item delta values for the group that had no indication of a disability. This was a preliminary population of test homogeneity for each category.
The second analysis tested homogeneity over performance groups. In all, over the 35 items, ten performance levels, and two disability categories (reported or not), 700 delta values were computed. Each bivariate correlation was computed over 35 delta values, that is, over the estimated difficulties of the 35 items for each pair of groups being compared.
Mean Scale Scores on the January 1999 Administration
Sex by Disability analysis
The main effect for sex (F (1,237.255) = 0.81, ns) was not significant (means = 645.13 for girls and 638.21 for boys). There were significant scoring differences attributed to disability category (F (12,237.255) - 2144.10, pl.001) and the interaction of sex and disability (F (12,237255) = 13.58, p <.0001) (refer to Table 2 for means). Higher mean delta values indicate the item was harder for that group.
Post hoc Tukey comparisons showed that the students without disabilities outscored all other students, and that learning disabilities were associated with the lowest scores.
Table 2 also shows that the correlations of item difficulties for students with disabilities with students without disabilities are high (above .35). The lowest correlation involves girls with autism and non-disabled girls (.683). In general, the correlations for boys with disabilities and boys without disabilities were higher than for girls with disabilities and girls without disabilities.
An analysis was also made of the homogeneity of regression of the mean item difficulty for girls onto the mean item difficulty for boys in relation to disability category. The results showed no differences in relation to disability category (F (12, 2) = 1.40, ns). Table 3 shows the regression coefficients.
Item Difficulty (Delta) Correlations on the Grade Four ELA
Regression Slopes and Intercepts of Item Difficulties for
Homogeneity Across Disability Status and Scoring Levels
Campbell and Fishe (1959) set four criteria for the construct validity of tests that focus on the convergent properties of test scores with other measures of the same construct and the discriminant properties of test scores with other measures of different test scores. These criteria, termed multitrait-multimethod validation are based on correlations of the scores with other variables, and can be summarized as follows:
The evaluation of the homogeneity of the ELA-4 construct involves a modification of the criteria. If we consider level of score range to be the trait and students with disabilities or without disabilities to be the method, then it is clear that, because we expect the ELA-4 to measure the same trait throughout the range of scores, we do not expect to necessarily meet criterion 3. Naturally, uniform levels of high correlations would also make it difficult to discern patterns in the data (criterion 4).
To enable statistical manipulation of the correlations, the coefficients were transformed to z-scores.* All correlations were above. 350 (criterion 1). A general linear regression model revealed that there were no reliable differences in the correlations of item difficulties for students with disabilities across scoring levels, compared to those for students without disabilities across scoring levels (criterion 2), but there was a significant difference between the mean correlations across disability status and within score levels (mean = .969) than either within disability status across score levels (mean = .872) or across disability status and across score levels (mean = .845) (F (2, 178) = 12.33, p <.0001).
In terms of the pattern of correlations, in all ten score levels, the correlations were highest either across disability status within score level or with item difficulties within one score level and within disability status. That is, the item difficulty correlations for students with disabilities in eight of the ten score levels were highest with those for other students with disabilities in an adjoining score level. Table 5 shows the mean correlations were ordered, both for students with disabilities and those without disabilities according to the distance from the score level.
A second general linear model was performed in which the type of comparison (across score levels, across disability status) and the degree of score level difference were main effects. The interaction of these two was also assessed.
There were no significant differences attributable to type of correlation (F (df = 1,162) = 2.72, ns) or the interaction of score level difference and type of comparison (F (df = 8, 162) = 0.28, ns). However, as expected, there was a significant difference attributable to score level difference (F (8, 162) = 36.99, p <.001). Keep in mind that the only correlation having no score level difference involved item difficulties across disability status groups but within score level. Thus, this analysis confounds type of correlation with score level difference. Nevertheless, the mean correlations within both disability status groups decreases monotonically as the score level differences decrease (see Table 5), thus addressing in Campbell & Fiske's criterion 4.
Homogeneity of Regression
Finally, an analysis was made of the homogeneity of regression of item difficulties for students without disabilities onto those of students with disabilities. The overall r-square for the regression was .962. Analyses revealed no significant variation of this regression across score levels (F (9, 9) = 0.63, ns). Nor was there a significant difference in the residuals (observed item difficulties minus predicted item difficulties) of the regression in relation to score level (F (9, 340) = 0.57, ns). Table 6 shows the regression coefficients.
Item Difficulty Correlations of Students With Disabilities with
1999 Grade 4 ELA, by Score Level
Mean Item Difficulty (in Deltas) Correlations by
Regression Slopes and Intercepts of Item Difficulties of
The ELA-4 is very robust in terms of the construct being measured across levels of scoring and disability status. This is good news, because, at least within the scope and precision of these analyses, the inference one can make of student strengths and weaknesses, given the score, is substantially independent of disability status, supporting the validity of this test for students with disabilities.
Angoff, W. H. & Modu, C. C. "Equating the Scales of the Prueba de Aptitud Academica and the Scholastic Aptitude Test." (New York: College Entrance Examination Board, Research Report 3, 1973).
Campbell, D. T. & Fiske, D. W. Convergent and discriminant validation by multi trait-multimethod matrix. In D. N. Jackson & S. Messick (Eds.), "Problems in Human Assessment." (New York: McGraw-Hill, 1962), pp. 124-131.