The Homogenity of the Grade 4 English Language Arts Test (ELA-4) 
for the Range of Ability Levels and LEP Status

Gerald E. DeMauro
Office of State Assessment

August 5, 1999

Abstract

As the State Assessments become more universal, it is important to insure that each test measures the same construct or trait for different populations for which the scores have the same interpretation. A study of the homogeneity, or universal meaning of the test scores of the Grade 4 English Language Arts examination (ELA-4) was conducted based on the 1999 administration. Based on the stability of the relative difficulties of the test questions, the test appears to be homogenous for groups in different ranges of overall scores and for Limited English Proficient (LEP) and non-Limited English Proficient students.

The Homogenity of the Grade 4 English Language Arts Test (ELA-4) 
for the Range of Ability Levels and LEP Status

Overview

A common means of evaluating the homogeneity of the construct measured by a test is to correlate the item difficulties between populations (Angoff and Modu, 1972). In an homogenous test the most difficult questions are most difficult for all students, and the least difficult are least difficult for all students. This property insures that the test results share the same interpretation for all students.

One application of homogeneity is to evaluate the construct properties of the test for special populations. For the test to be a useful, interpretable measure for Limited English Proficient (LEP) students, for example, there should be high correlations among item difficulties for LEP and non-LEP students. This study assesses the homogeneity of the 1999 Grade 4 English Language Arts (ELA-4) examination, its differential construct properties for LEP and non-LEP students and for students at various score ranges.

Methods

Sample. The sample consisted of 234,503 fourth graders who took the ELA-4 in January 1999. Of these, 7,717 (3.3 percent) were currently LEP students. Former LEP students were included with monolingual curriculum students. Because the analyses were conducted within specific score ranges, there was no contamination of the results by including these students. That is, LEP students who scored high were compared to monolingual curriculum students who scored high, regardless of whether they had been LEP students in the past.

Distributions. The total population was divided into 10 score intervals based on the ELA-4 total score scale. This division provides some precision at the upper and lower ranges of the scale, but is wide enough for each LEP status group to facilitate meaningful comparisons among correlations, while controlling for the effects of restriction of range on the correlation coefficients. In this way, the stability of the meaning of the test scores can be evaluated for students of all ability levels. The distributions by LEP status for these 10 intervals are given in Table 1.

Table 1
Mean 1999 ELA-4 Scale Scores for LEP and Non-LEP Students
by Score Interval

     

LEP Students

Non-LEP Students

 

Overall Score Interval

Mean

N

N

%

N

%

%

455-600

575.490

23,689

3989

16.8

19,363

81.7

9.9

601-616

609.315

23,839

1332

5.6

22,089

92.7

9.9

617-627

622.279

25,086

787

3.1

23,796

94.9

10.5

628-635

631.582

22,406

451

2.0

21,504

96.0

9.3

636-643

639.533

24,727

345

1.4

23,875

96.6

10.3

644-651

647.486

25,594

270

1.1

24,713

96.6

10.7

652-658

654.965

21,804

173

0.7

21,127

96.9

9.1

659-667

662.858

24,237

149

0.6

23,506

97.0

10.1

668-680

673.409

24,456

131

0.5

23,707

96.9

10.2

681-800

697.293

23,891

90

0.4

23,106

96.7

10.0

Total

641.454

239,729*

7717

3.2

226,786

94.6

100.0

Overall Mean

641.454

 

595.203

 

642.902

SD

641.454

 

43.141

 

33.000

_______________________________________
*includes 5,226 students of unknown LEP status

Evaluation of construct stability. The methodology draws on the multi-trait-multimethod process developed by Campbell and Fiske (1962). These analyses usually evaluate three properties of tests: convergent validity, discriminant validity, and differential construct validity. In this study, the trait under consideration is the test construct as manifest in the item difficulties for 10 scoring levels. The variable substituted for the method of measurement is LEP status: LEP or not. This analysis tests the hypothesis that the ELA-4 measures the same construct for LEP and non-LEP groups.

In the usual application of this methodology, we would expect to see high correlations within traits (score levels) between LEP and non-LEP students, but low correlations within methods (LEP or not) between scoring levels. In contrast to usual applications of this procedure, the current focus on the homogeneity of the test would lead to the hypothesis that there are high correlations between item difficulties for all scoring ranges, and especially high correlations within scoring ranges between the difficulties for LEP and non-LEP students.

The criteria originally developed by Campbell and Fiske were adapted to evaluate the 20 X 20 correlation matrices generated by comparing all pairs of the 10 skills levels and the two LEP status groups. Four criteria are listed below along with a comment on how they will be applied to the current questions:

    1. Non nominal correlations within traits. In this case, this would mean that item difficulties should exhibit high correlations of item difficulties (at least .35 or higher) across ability level groups and LEP status.

    2. Higher correlations within traits (ability levels) across methods (LEP Status) than across both traits and methods. This means that within score intervals the item difficulties for LEP and non-LEP students should have higher correlations than should the item difficulties for LEP or for non-LEP students across LEP status in different score intervals. While we expect all correlations to be high, there should be some decline between ability levels that are farthest apart, e.g., the item difficulties for the lowest scoring students compared to those for the highest scoring students.

    3. Higher correlations within traits across methods than across traits and methods. This means that within score intervals, the item difficulties for LEP and non-LEP students should have higher correlations than should the correlations of the difficulties for LEP students of different score intervals or non-LEP students of different score intervals (between traits). We expect to see that groups of the same abilities are most closely related to each other rather than to groups of different abilities that share the same LEP Status.

    4. Relationships among traits that follow the same pattern regardless of method. This requires that the patterns of higher and lower correlations of item difficulties across score intervals is the same for LEP students as it is for the non-LEP students. These criteria focus on the differential validity of the ELA-4. We also expect that the correlations across score intervals of the item difficulties will be high for both LEP and non-LEP students. This is the convergent property of homogeneity.

Item Difficulties. Item difficulties were estimated by p-values converted to the delta scale, in which the mean difficulty is 13 and the standard deviation is -4. Unlike p-values, the scale is positive for difficulty, e.g. the higher the delta value, the more difficult the question, and the scale is equal interval (Angoff, 1984). For the three questions scored 0-2, the two questions scored 0-3, and the two questions scored 0-4, the mean scores were divided by the maximum scores. The resulting proportion of maximum scores were then converted to delta scores.

Difficulties were computed for LEP students and for non-LEP students in each of 20 score intervals.

Analyses. Two analyses were performed. The first was an analysis of whether the regression was consistently high across score levels and whether the slope difficulties for LEP students onto difficulties for non-LEP students varied by score interval. If the construct remains stable, then there should be high r-square values (prediction) across score groups and homogeneity of regression slopes. That is, the relationship between the item difficulties for the two groups should be consistently high.

A regression model was used in which the difficulties for the non-LEP population were regressed onto the difficulties for the LEP population. The interaction of score interval by slope was used to evaluate the homogenity of regression slopes.

The second analysis compared the item difficulty correlations of the ten proficiency intervals and two student groups (LEP and non LEP), 20 X 20 matrix. The correlation coefficients were first converted to z scores to retain interval scale properties.

The correlations were divided into two groups: whether they were difficulties for the same types of students (either LEP or non LEP) or for different types of students (LEP compared to non-LEP). They were also blocked or grouped by differences in the ten overall score intervals: zero differences, 1-3 levels difference, 4-6 levels difference, and 7-9 levels difference. Thus, a correlation of item difficulties for non-LEP students in level 5 and LEP students in level 8 would be classified as a comparison among different students, 3 levels apart. A general linear model regression assessed the effects of these two variables in terms of the adapted Campbell and Fiske criteria.

Note, the zero difference difficulties involved comparing LEP difficulties and non-LEP difficulties for the students of the same ability levels. These correlation coefficients are called validity coefficients and test Campbell and Fiske's criterion #3.

Table 2
Regression of Monolingual Curriculum Student Item Difficulties
(in Deltas) Onto Limited English Proficient (LEP) Student

Item Difficulties

Overall Score Level

Mean Delta

Slope

Intercept

R-Square

Mean Residual

0

14.233

1.099

-1.875

0.981

-0.231

1

12.499

1.011

-0.210

0.991

0.093

2

11.519

0.995

0.012

0.991

0.101

3

10.787

0.987

0.098

0.988

0.102

4

10.116

0.976

0.224

0.995

0.106

5

9.475

1.049

-0.602

0.982

-0.015

6

8.774

1.087

-1.005

0.975

-0.112

7

8.180

0.938

0.390

0.961

-0.026

8

7.423

0.896

0.787

0.957

9.102

9

5.932

0.975

-0.348

9.750

-0.120

Overall

9.894

0.986

0.019

0.984

0.000

Results

Homogeneity of Regression Slope (Table 2)

There was a significant status by ability level interaction (F (9, 330) = 2.67, p <.01). This indicates that the slopes of the item difficulties, regressing LEP difficulties into non-LEP difficulties, varied by ability level. Table 2 shows differences in the slopes ranging from 1.099 for the lowest scoring group to 0.896 for the second highest-scoring group. The high levels of prediction (R-square values) suggest that these regression differences are not very meaningful. An analysis of the regression residuals suggests that the item difficulties for the lowest scoring LEP students slightly over predicted those for non-LEP students. This means that the questions were actually harder for the non-LEP students than would have been predicted based on LEP student performance. Those of the next four intervals and the next to the highest interval (F (9,340) = 2.23, p<.05) slightly under predicted those of non-LEP students, meaning the items were easier than expected for non-LEP students.

Non Nominal Correlations

The range of correlations was .628 to 1.000. Of the 400 correlation coefficients, 254 (63.5 percent) were .90 or above. Clearly these were not nominal, indicating that the test is homogenous (measures the same construct) across LEP and non-LEP populations and ability groups.

Within Traits, Within Abilities

The general linear model found a significant effect for the relationships among score levels (f (3, 240) = 155.54, p <.0001), but not for differences in relationships within LEP or non-LEP groups compared to between groups (F (1, 240) = 0.0, ns), nor for an interaction of these two variables (F(2,240) = 0.0, ns).

The mean correlations, converted back to r values from z-scores, are given in Table 3.

Table 3
Mean Correlations of Item Difficulties on the

1994 ELA-4 for LEP and non-LEP Groups

Comparison

Levels of Deciles Apart

n

Mean Z-Score

r

Within LEP Status

1

2

3

53

44

23

1.955

1.371

0.974

.961

.879

.750

Total Within LEP Status

-

120

1.553

.914

Between LEP Status

0 (valid.)

1

2

3

7

53

44

23

2.340

1.963

1.374

0.972

.982

.961

.880

.750

Total Between LEP Status

-

127

1.600

.922

Total

Validity

1

2

3

7

106

88

46

2.340

1.959

1.373

0.973

.982

.962

.880

.750

All

 

247

1.577

.918

As the difference in ability levels increased, the mean correlations between ability groups decreased somewhat. However, there were no correlations below .628.

As is shown, the item difficulty correlations (validity coefficients) within ability levels across LEP status were highest (criterion 3). There were not higher correlations between ability levels within LEP status, or within non-LEP status. Nor were there higher correlations across LEP status and groups (criterion 3).

Finally, the same pattern was observed for item difficulty correlations for both LEP and non-LEP groups, among the 400 correlations. Table 4 shows that all of the highest correlations for each of the 20 groups (2 LEP status, 10 levels) were either no levels of ability apart (validity coefficients) or within one level of ability apart. In all cases, where the validity coefficient was not the highest for a group, the highest correlations were for item difficulties for groups of the same LEP status.

Table 4
Patterns of Item Difficulty Correlations by 10 Ability Levels
and LEP Status on the 1999 ELA-4

Highest Correlations

(Entry=Ability Level, 0-9)

 

 

Difference in Ability Levels

 

+1

-1

 

LEP Status

Same

Opposite

Same

Opposite

Validity

Non LEP

2, 6

-

3, 5, 7, 8, 9

-

0, 1, 4

LEP

-

-

8

-

5, 6, 9

Conclusion

The first analysis of regression differences suggests that there are some differences in prediction of item difficulties from LEP to non-LEP students. However, the prediction coefficients (R-square values) are so high as to make these differences meaningless. The failure to find differences in constructs, as measured by item difficulty correlations supports the hypothesis that the ELA-4 is an homogenous test, suited for all ability levels in the grade 4 population and for LEP and non-LEP students.

References

Angoff, W. H. (1984). Scales, Norms, and Equivalent Scores. Princeton, N.J.: Educational Testing Service

Angoff, W. H. & Modu, C. C. (1973). Equating the Scales of Prueba de Aptitud Aademica and the Scholastic Aptitude Test. Research Report #3. New York: College Entrance Examination Board Research Report.

Campbell, D. T. & Fiske, D. W. (1962). Convergent and Discriminant Validaiton by Multi-Trait Multimethod Matrix. In D. N. Jackson and S. Messick (Eds.), Problems in Human Assessment, New York: McGraw Hill, Inc., pp. 124-131.