The Homogenity of the Grade 4 English Language Arts Test (ELA-4)
Gerald E. DeMauro
August 5, 1999
As the State Assessments become more universal, it is important to insure that each test measures the same construct or trait for different populations for which the scores have the same interpretation. A study of the homogeneity, or universal meaning of the test scores of the Grade 4 English Language Arts examination (ELA-4) was conducted based on the 1999 administration. Based on the stability of the relative difficulties of the test questions, the test appears to be homogenous for groups in different ranges of overall scores and for Limited English Proficient (LEP) and non-Limited English Proficient students.
The Homogenity of the Grade 4 English Language Arts Test (ELA-4)
A common means of evaluating the homogeneity of the construct measured by a test is to correlate the item difficulties between populations (Angoff and Modu, 1972). In an homogenous test the most difficult questions are most difficult for all students, and the least difficult are least difficult for all students. This property insures that the test results share the same interpretation for all students.
One application of homogeneity is to evaluate the construct properties of the test for special populations. For the test to be a useful, interpretable measure for Limited English Proficient (LEP) students, for example, there should be high correlations among item difficulties for LEP and non-LEP students. This study assesses the homogeneity of the 1999 Grade 4 English Language Arts (ELA-4) examination, its differential construct properties for LEP and non-LEP students and for students at various score ranges.
Sample. The sample consisted of 234,503 fourth graders who took the ELA-4 in January 1999. Of these, 7,717 (3.3 percent) were currently LEP students. Former LEP students were included with monolingual curriculum students. Because the analyses were conducted within specific score ranges, there was no contamination of the results by including these students. That is, LEP students who scored high were compared to monolingual curriculum students who scored high, regardless of whether they had been LEP students in the past.
Distributions. The total population was divided into 10 score intervals based on the ELA-4 total score scale. This division provides some precision at the upper and lower ranges of the scale, but is wide enough for each LEP status group to facilitate meaningful comparisons among correlations, while controlling for the effects of restriction of range on the correlation coefficients. In this way, the stability of the meaning of the test scores can be evaluated for students of all ability levels. The distributions by LEP status for these 10 intervals are given in Table 1.
Evaluation of construct stability. The methodology draws on the multi-trait-multimethod process developed by Campbell and Fiske (1962). These analyses usually evaluate three properties of tests: convergent validity, discriminant validity, and differential construct validity. In this study, the trait under consideration is the test construct as manifest in the item difficulties for 10 scoring levels. The variable substituted for the method of measurement is LEP status: LEP or not. This analysis tests the hypothesis that the ELA-4 measures the same construct for LEP and non-LEP groups.
In the usual application of this methodology, we would expect to see high correlations within traits (score levels) between LEP and non-LEP students, but low correlations within methods (LEP or not) between scoring levels. In contrast to usual applications of this procedure, the current focus on the homogeneity of the test would lead to the hypothesis that there are high correlations between item difficulties for all scoring ranges, and especially high correlations within scoring ranges between the difficulties for LEP and non-LEP students.
The criteria originally developed by Campbell and Fiske were adapted to evaluate the 20 X 20 correlation matrices generated by comparing all pairs of the 10 skills levels and the two LEP status groups. Four criteria are listed below along with a comment on how they will be applied to the current questions:
1. Non nominal correlations within traits. In this case, this would mean that item difficulties should exhibit high correlations of item difficulties (at least .35 or higher) across ability level groups and LEP status.
2. Higher correlations within traits (ability levels) across methods (LEP Status) than across both traits and methods. This means that within score intervals the item difficulties for LEP and non-LEP students should have higher correlations than should the item difficulties for LEP or for non-LEP students across LEP status in different score intervals. While we expect all correlations to be high, there should be some decline between ability levels that are farthest apart, e.g., the item difficulties for the lowest scoring students compared to those for the highest scoring students.
3. Higher correlations within traits across methods than across traits and methods. This means that within score intervals, the item difficulties for LEP and non-LEP students should have higher correlations than should the correlations of the difficulties for LEP students of different score intervals or non-LEP students of different score intervals (between traits). We expect to see that groups of the same abilities are most closely related to each other rather than to groups of different abilities that share the same LEP Status.
4. Relationships among traits that follow the same pattern regardless of method. This requires that the patterns of higher and lower correlations of item difficulties across score intervals is the same for LEP students as it is for the non-LEP students. These criteria focus on the differential validity of the ELA-4. We also expect that the correlations across score intervals of the item difficulties will be high for both LEP and non-LEP students. This is the convergent property of homogeneity.
Item Difficulties. Item difficulties were estimated by p-values converted to the delta scale, in which the mean difficulty is 13 and the standard deviation is -4. Unlike p-values, the scale is positive for difficulty, e.g. the higher the delta value, the more difficult the question, and the scale is equal interval (Angoff, 1984). For the three questions scored 0-2, the two questions scored 0-3, and the two questions scored 0-4, the mean scores were divided by the maximum scores. The resulting proportion of maximum scores were then converted to delta scores.
Difficulties were computed for LEP students and for non-LEP students in each of 20 score intervals.
Analyses. Two analyses were performed. The first was an analysis of whether the regression was consistently high across score levels and whether the slope difficulties for LEP students onto difficulties for non-LEP students varied by score interval. If the construct remains stable, then there should be high r-square values (prediction) across score groups and homogeneity of regression slopes. That is, the relationship between the item difficulties for the two groups should be consistently high.
A regression model was used in which the difficulties for the non-LEP population were regressed onto the difficulties for the LEP population. The interaction of score interval by slope was used to evaluate the homogenity of regression slopes.
The second analysis compared the item difficulty correlations of the ten proficiency intervals and two student groups (LEP and non LEP), 20 X 20 matrix. The correlation coefficients were first converted to z scores to retain interval scale properties.
The correlations were divided into two groups: whether they were difficulties for the same types of students (either LEP or non LEP) or for different types of students (LEP compared to non-LEP). They were also blocked or grouped by differences in the ten overall score intervals: zero differences, 1-3 levels difference, 4-6 levels difference, and 7-9 levels difference. Thus, a correlation of item difficulties for non-LEP students in level 5 and LEP students in level 8 would be classified as a comparison among different students, 3 levels apart. A general linear model regression assessed the effects of these two variables in terms of the adapted Campbell and Fiske criteria.
Note, the zero difference difficulties involved comparing LEP difficulties and non-LEP difficulties for the students of the same ability levels. These correlation coefficients are called validity coefficients and test Campbell and Fiske's criterion #3.
Homogeneity of Regression Slope (Table 2)
There was a significant status by ability level interaction (F (9, 330) = 2.67, p <.01). This indicates that the slopes of the item difficulties, regressing LEP difficulties into non-LEP difficulties, varied by ability level. Table 2 shows differences in the slopes ranging from 1.099 for the lowest scoring group to 0.896 for the second highest-scoring group. The high levels of prediction (R-square values) suggest that these regression differences are not very meaningful. An analysis of the regression residuals suggests that the item difficulties for the lowest scoring LEP students slightly over predicted those for non-LEP students. This means that the questions were actually harder for the non-LEP students than would have been predicted based on LEP student performance. Those of the next four intervals and the next to the highest interval (F (9,340) = 2.23, p<.05) slightly under predicted those of non-LEP students, meaning the items were easier than expected for non-LEP students.
Non Nominal Correlations
The range of correlations was .628 to 1.000. Of the 400 correlation coefficients, 254 (63.5 percent) were .90 or above. Clearly these were not nominal, indicating that the test is homogenous (measures the same construct) across LEP and non-LEP populations and ability groups.
Within Traits, Within Abilities
The general linear model found a significant effect for the relationships among score levels (f (3, 240) = 155.54, p <.0001), but not for differences in relationships within LEP or non-LEP groups compared to between groups (F (1, 240) = 0.0, ns), nor for an interaction of these two variables (F(2,240) = 0.0, ns).
The mean correlations, converted back to r values from z-scores, are given in Table 3.
1994 ELA-4 for LEP and non-LEP Groups
As the difference in ability levels increased, the mean correlations between ability groups decreased somewhat. However, there were no correlations below .628.
As is shown, the item difficulty correlations (validity coefficients) within ability levels across LEP status were highest (criterion 3). There were not higher correlations between ability levels within LEP status, or within non-LEP status. Nor were there higher correlations across LEP status and groups (criterion 3).
Finally, the same pattern was observed for item difficulty correlations for both LEP and non-LEP groups, among the 400 correlations. Table 4 shows that all of the highest correlations for each of the 20 groups (2 LEP status, 10 levels) were either no levels of ability apart (validity coefficients) or within one level of ability apart. In all cases, where the validity coefficient was not the highest for a group, the highest correlations were for item difficulties for groups of the same LEP status.
(Entry=Ability Level, 0-9)
The first analysis of regression differences suggests that there are some differences in prediction of item difficulties from LEP to non-LEP students. However, the prediction coefficients (R-square values) are so high as to make these differences meaningless. The failure to find differences in constructs, as measured by item difficulty correlations supports the hypothesis that the ELA-4 is an homogenous test, suited for all ability levels in the grade 4 population and for LEP and non-LEP students.
Angoff, W. H. (1984). Scales, Norms, and Equivalent Scores. Princeton, N.J.: Educational Testing Service
Angoff, W. H. & Modu, C. C. (1973). Equating the Scales of Prueba de Aptitud Aademica and the Scholastic Aptitude Test. Research Report #3. New York: College Entrance Examination Board Research Report.
Campbell, D. T. & Fiske, D. W. (1962). Convergent and Discriminant Validaiton by Multi-Trait Multimethod Matrix. In D. N. Jackson and S. Messick (Eds.), Problems in Human Assessment, New York: McGraw Hill, Inc., pp. 124-131.