The Homogenity of the Grade 4 English Language Arts Test (ELA4) Gerald E. DeMauro August 5, 1999 Abstract As the State Assessments become more universal, it is important to insure that each test measures the same construct or trait for different populations for which the scores have the same interpretation. A study of the homogeneity, or universal meaning of the test scores of the Grade 4 English Language Arts examination (ELA4) was conducted based on the 1999 administration. Based on the stability of the relative difficulties of the test questions, the test appears to be homogenous for groups in different ranges of overall scores and for Limited English Proficient (LEP) and nonLimited English Proficient students. The Homogenity of the Grade 4 English Language Arts Test (ELA4) Overview A common means of evaluating the homogeneity of the construct measured by a test is to correlate the item difficulties between populations (Angoff and Modu, 1972). In an homogenous test the most difficult questions are most difficult for all students, and the least difficult are least difficult for all students. This property insures that the test results share the same interpretation for all students. One application of homogeneity is to evaluate the construct properties of the test for special populations. For the test to be a useful, interpretable measure for Limited English Proficient (LEP) students, for example, there should be high correlations among item difficulties for LEP and nonLEP students. This study assesses the homogeneity of the 1999 Grade 4 English Language Arts (ELA4) examination, its differential construct properties for LEP and nonLEP students and for students at various score ranges. Methods Sample. The sample consisted of 234,503 fourth graders who took the ELA4 in January 1999. Of these, 7,717 (3.3 percent) were currently LEP students. Former LEP students were included with monolingual curriculum students. Because the analyses were conducted within specific score ranges, there was no contamination of the results by including these students. That is, LEP students who scored high were compared to monolingual curriculum students who scored high, regardless of whether they had been LEP students in the past. Distributions. The total population was divided into 10 score intervals based on the ELA4 total score scale. This division provides some precision at the upper and lower ranges of the scale, but is wide enough for each LEP status group to facilitate meaningful comparisons among correlations, while controlling for the effects of restriction of range on the correlation coefficients. In this way, the stability of the meaning of the test scores can be evaluated for students of all ability levels. The distributions by LEP status for these 10 intervals are given in Table 1. Table 1

_______________________________________ Evaluation of construct stability. The methodology draws on the multitraitmultimethod process developed by Campbell and Fiske (1962). These analyses usually evaluate three properties of tests: convergent validity, discriminant validity, and differential construct validity. In this study, the trait under consideration is the test construct as manifest in the item difficulties for 10 scoring levels. The variable substituted for the method of measurement is LEP status: LEP or not. This analysis tests the hypothesis that the ELA4 measures the same construct for LEP and nonLEP groups. In the usual application of this methodology, we would expect to see high correlations within traits (score levels) between LEP and nonLEP students, but low correlations within methods (LEP or not) between scoring levels. In contrast to usual applications of this procedure, the current focus on the homogeneity of the test would lead to the hypothesis that there are high correlations between item difficulties for all scoring ranges, and especially high correlations within scoring ranges between the difficulties for LEP and nonLEP students. The criteria originally developed by Campbell and Fiske were adapted to evaluate the 20 X 20 correlation matrices generated by comparing all pairs of the 10 skills levels and the two LEP status groups. Four criteria are listed below along with a comment on how they will be applied to the current questions: 1. Non nominal correlations within traits. In this case, this would mean that item difficulties should exhibit high correlations of item difficulties (at least .35 or higher) across ability level groups and LEP status. 2. Higher correlations within traits (ability levels) across methods (LEP Status) than across both traits and methods. This means that within score intervals the item difficulties for LEP and nonLEP students should have higher correlations than should the item difficulties for LEP or for nonLEP students across LEP status in different score intervals. While we expect all correlations to be high, there should be some decline between ability levels that are farthest apart, e.g., the item difficulties for the lowest scoring students compared to those for the highest scoring students. 3. Higher correlations within traits across methods than across traits and methods. This means that within score intervals, the item difficulties for LEP and nonLEP students should have higher correlations than should the correlations of the difficulties for LEP students of different score intervals or nonLEP students of different score intervals (between traits). We expect to see that groups of the same abilities are most closely related to each other rather than to groups of different abilities that share the same LEP Status. 4. Relationships among traits that follow the same pattern regardless of method. This requires that the patterns of higher and lower correlations of item difficulties across score intervals is the same for LEP students as it is for the nonLEP students. These criteria focus on the differential validity of the ELA4. We also expect that the correlations across score intervals of the item difficulties will be high for both LEP and nonLEP students. This is the convergent property of homogeneity. Item Difficulties. Item difficulties were estimated by pvalues converted to the delta scale, in which the mean difficulty is 13 and the standard deviation is 4. Unlike pvalues, the scale is positive for difficulty, e.g. the higher the delta value, the more difficult the question, and the scale is equal interval (Angoff, 1984). For the three questions scored 02, the two questions scored 03, and the two questions scored 04, the mean scores were divided by the maximum scores. The resulting proportion of maximum scores were then converted to delta scores. Difficulties were computed for LEP students and for nonLEP students in each of 20 score intervals. Analyses. Two analyses were performed. The first was an analysis of whether the regression was consistently high across score levels and whether the slope difficulties for LEP students onto difficulties for nonLEP students varied by score interval. If the construct remains stable, then there should be high rsquare values (prediction) across score groups and homogeneity of regression slopes. That is, the relationship between the item difficulties for the two groups should be consistently high. A regression model was used in which the difficulties for the nonLEP population were regressed onto the difficulties for the LEP population. The interaction of score interval by slope was used to evaluate the homogenity of regression slopes. The second analysis compared the item difficulty correlations of the ten proficiency intervals and two student groups (LEP and non LEP), 20 X 20 matrix. The correlation coefficients were first converted to z scores to retain interval scale properties. The correlations were divided into two groups: whether they were difficulties for the same types of students (either LEP or non LEP) or for different types of students (LEP compared to nonLEP). They were also blocked or grouped by differences in the ten overall score intervals: zero differences, 13 levels difference, 46 levels difference, and 79 levels difference. Thus, a correlation of item difficulties for nonLEP students in level 5 and LEP students in level 8 would be classified as a comparison among different students, 3 levels apart. A general linear model regression assessed the effects of these two variables in terms of the adapted Campbell and Fiske criteria. Note, the zero difference difficulties involved comparing LEP difficulties and nonLEP difficulties for the students of the same ability levels. These correlation coefficients are called validity coefficients and test Campbell and Fiske's criterion #3. Table 2 Item Difficulties
Results Homogeneity of Regression Slope (Table 2) There was a significant status by ability level interaction (F (9, 330) = 2.67, p <.01). This indicates that the slopes of the item difficulties, regressing LEP difficulties into nonLEP difficulties, varied by ability level. Table 2 shows differences in the slopes ranging from 1.099 for the lowest scoring group to 0.896 for the second highestscoring group. The high levels of prediction (Rsquare values) suggest that these regression differences are not very meaningful. An analysis of the regression residuals suggests that the item difficulties for the lowest scoring LEP students slightly over predicted those for nonLEP students. This means that the questions were actually harder for the nonLEP students than would have been predicted based on LEP student performance. Those of the next four intervals and the next to the highest interval (F (9,340) = 2.23, p<.05) slightly under predicted those of nonLEP students, meaning the items were easier than expected for nonLEP students. Non Nominal Correlations The range of correlations was .628 to 1.000. Of the 400 correlation coefficients, 254 (63.5 percent) were .90 or above. Clearly these were not nominal, indicating that the test is homogenous (measures the same construct) across LEP and nonLEP populations and ability groups. Within Traits, Within Abilities The general linear model found a significant effect for the relationships among score levels (f (3, 240) = 155.54, p <.0001), but not for differences in relationships within LEP or nonLEP groups compared to between groups (F (1, 240) = 0.0, ns), nor for an interaction of these two variables (F(2,240) = 0.0, ns). The mean correlations, converted back to r values from zscores, are given in Table 3. Table 3 1994 ELA4 for LEP and nonLEP Groups
As the difference in ability levels increased, the mean correlations between ability groups decreased somewhat. However, there were no correlations below .628. As is shown, the item difficulty correlations (validity coefficients) within ability levels across LEP status were highest (criterion 3). There were not higher correlations between ability levels within LEP status, or within nonLEP status. Nor were there higher correlations across LEP status and groups (criterion 3). Finally, the same pattern was observed for item difficulty correlations for both LEP and nonLEP groups, among the 400 correlations. Table 4 shows that all of the highest correlations for each of the 20 groups (2 LEP status, 10 levels) were either no levels of ability apart (validity coefficients) or within one level of ability apart. In all cases, where the validity coefficient was not the highest for a group, the highest correlations were for item difficulties for groups of the same LEP status. Table 4 Highest Correlations (Entry=Ability Level, 09)
Conclusion The first analysis of regression differences suggests that there are some differences in prediction of item difficulties from LEP to nonLEP students. However, the prediction coefficients (Rsquare values) are so high as to make these differences meaningless. The failure to find differences in constructs, as measured by item difficulty correlations supports the hypothesis that the ELA4 is an homogenous test, suited for all ability levels in the grade 4 population and for LEP and nonLEP students. References Angoff, W. H. (1984). Scales, Norms, and Equivalent Scores. Princeton, N.J.: Educational Testing Service Angoff, W. H. & Modu, C. C. (1973). Equating the Scales of Prueba de Aptitud Aademica and the Scholastic Aptitude Test. Research Report #3. New York: College Entrance Examination Board Research Report. Campbell, D. T. & Fiske, D. W. (1962). Convergent and Discriminant Validaiton by MultiTrait Multimethod Matrix. In D. N. Jackson and S. Messick (Eds.), Problems in Human Assessment, New York: McGraw Hill, Inc., pp. 124131. 