The Homogenity of the Grade 4 English Language Arts Test (ELA-4)
for the Range of Ability Levels and Disability Status

Gerald E. DeMauro

Office of State Assessment

November, 1999

Abstract

As the State Assessments become more universal, it is important to insure that each test measures the same construct or trait for different populations for which the scores have the same interpretation. A study of the homogeneity, or universal meaning of the test scores of the Grade 4 English Language Arts examination (ELA-4) was conducted based on the 1999 administration. Based on the stability of the relative difficulties of the test questions, the test appears to be homogenous for groups in different ranges of overall scores and for students who have a disability and those who do not. This construct evaluation is very important to support making the same inferences about student skills for different populations with the same scores.

Executive Summary

Several analyses are presented of construct validity of the New York State Grade 4 English Language Arts assessment (ELA-4) for students who indicated that they had disabilities. The focus was not on the difficulty of the test for these students as compared to other students, but on whether or not the score supported the same inference about student skills, independent of disability status.

One method of measuring this is to evaluate whether students with disabilities found the same test questions to be more difficult and less difficult as the students without disabilities. The analyses examine this in terms of correlations of item difficulties. The analyses were extended to consider several levels of scoring, as well. The results provide strong evidence for the homogeneity, or lack of differential construct validity of the ELA-4 across populations.

The Homogenity of the Grade 4 English Language Arts Test (ELA-4)
for the Range of Ability Levels and Disability Status

Gerald E. DeMauro

Overview

A common means of evaluating the homogeneity of the construct measured by a test is to correlate the item difficulties between populations (Angoff and Modu, 1972). In an homogenous test the most difficult questions are most difficult for all students, and the least difficult are least difficult for all students. This property insures that the test results share the same interpretation for all students. At its very base, evaluation of this property focuses on the construct validity of the test in terms of the support for the score-based inferences. Quite simply, these inferences cannot be supported if the test measures different things for different children.

One application of homogeneity is to evaluate the construct properties of the test for special populations. For the test to be a useful, interpretable measure for students with disabilities should be high correlations among item difficulties for these students and all other students. This study assesses the homogeneity of the 1999 Grade 4 English Language Arts (ELA-4) examination, its differential construct properties for students with disabilities and those without disabilities at various score ranges.

Methods

Sample. The sample consisted of 237,281 grade four students who took the ELA-4 in January 1999 and had valid scores and codes indicating either that they were without a disability or that they had a disability. In all 211,676 (89.2 percent) of students indicated no disability and 25,605 (10.8 percent) had a disability, as shown in Table 1.

Of the 25,605 students identified as having a disability, 16,947 (66.2 percent) were boys.

Distributions. Overall, ELA-4 scale scores are distributed in four proficiency ranges: Levels 1 (low) through level 4 (high). There are smaller numbers of students with disabilities in the higher ranges of test scores, as shown by the lower means for students with disabilities in Table 1. Therefore, to investigate the ELA-4 measurement properties for students with disabilities over the range of scores, students were divided into three groups of equal number within level 1, three groups of equal number within level 2, two groups of equal number within level 3, and two groups of equal number within level 1.

Analyses

Analyses were conducted on three levels. A General Linear Model regression assessed two main variables, sex and disability classification and their interaction. The second level of analysis was concerned with the homogeneity of the test over two variables: range of scores and reported disability or no reported disability.

For the homogeneity analysis, performance on each of the 35 items of the test were first divided by the number of possible points to control the range and contribution of each item to the total raw score. The average difficulty, or p-values, were then converted to the delta scale, which has a mean of 13 and a standard deviation of four, inverted to have low delta values represent less difficult items. This conversion permits the mathematical manipulations necessary for subsequent correlational analyses of item difficulties because the delta scale is an interval scale, while the p-value scale is ordinal.

Regression and correlational analyses then followed on two levels. The first analysis was a straight-forward bivariate correlation of the item delta values for each disability category with the item delta values for the group that had no indication of a disability. This was a preliminary population of test homogeneity for each category.

The second analysis tested homogeneity over performance groups. In all, over the 35 items, ten performance levels, and two disability categories (reported or not), 700 delta values were computed. Each bivariate correlation was computed over 35 delta values, that is, over the estimated difficulties of the 35 items for each pair of groups being compared.

Table 1

Mean Scale Scores on the January 1999 Administration
of the Grade 4 English Language Arts Examination
by Sex and Disability

 

Disability Status

Girls

Boys

Both

N

Mean

N

Mean

N

Mean

None

108,278

647.96

103,398

642.87

211,676

645.48

Autistic

14

624.43

47

611.02

61

614.10

Emotionally Disturbed

439

604.43

1,567

597.92

2,006

599.34

Learning Disabled

5,462

609.86

9,829

610.58

15,291

610.32

Mentally Retarded

121

577.75

128

565.39

249

571.40

Deaf

15

604.07

17

606.35

32

605.28

Hard of Hearing

117

607.97

147

608.41

264

608.22

Speech Impaired

1,792

609.37

3,481

608.12

5,273

608.55

Visually Impaired

35

627.43

60

638.90

95

634.67

Orthopedically Impaired

102

629.56

124

630.48

226

630.07

Other Health Impaired

349

622.36

1,125

627.76

1,474

626.48

Multiple Disabilities

198

602.71

382

603.44

580

603.19

Traumatic Brain Injury

14

610.21

40

607.88

54

608.48

Results

Sex by Disability analysis

The main effect for sex (F (1,237.255) = 0.81, ns) was not significant (means = 645.13 for girls and 638.21 for boys). There were significant scoring differences attributed to disability category (F (12,237.255) - 2144.10, pl.001) and the interaction of sex and disability (F (12,237255) = 13.58, p <.0001) (refer to Table 2 for means). Higher mean delta values indicate the item was harder for that group.

Post hoc Tukey comparisons showed that the students without disabilities outscored all other students, and that learning disabilities were associated with the lowest scores.

Table 2 also shows that the correlations of item difficulties for students with disabilities with students without disabilities are high (above .35). The lowest correlation involves girls with autism and non-disabled girls (.683). In general, the correlations for boys with disabilities and boys without disabilities were higher than for girls with disabilities and girls without disabilities.

An analysis was also made of the homogeneity of regression of the mean item difficulty for girls onto the mean item difficulty for boys in relation to disability category. The results showed no differences in relation to disability category (F (12, 2) = 1.40, ns). Table 3 shows the regression coefficients.

Table 2

Item Difficulty (Delta) Correlations on the Grade Four ELA
for Each Category of Disability and Students without Identified (Mean Delta = 12.73)

     

Girls

Boys

 

Disability Category

Mean Delta

Correlation with Group Without Disabilities

Mean Delta

Cor.

Mean Delta

Cor.

Girls/Boys Cor.

Autistic

12.09

.921

11.45

.677

12.23

.827

.718

Emotionally Disturbed

12.57

.934

12.41

.947

12.63

.943

.783

Learning Disabled

12.10

.950

12.20

.952

12.03

.957

.984

Mentally Retarded

14.05

.790

13.97

.814

14.13

.740

.901

Deaf

12.20

.829

12.46

.796

11.92

.761

.738

Hard of Hearing

12.16

.938

12.27

.916

12.07

.919

.962

Speech Impaired

12.28

.964

12.30

.957

12.29

.970

.986

Visually Impaired

10.76

.960

11.26

.929

10.39

.943

.875

Orthopedically Impaired

11.02

.971

11.12

.964

10.92

.967

.961

Other Health Impaired

11.11

.965

11.38

.966

11.02

.974

.970

Multiple Disabilities

12.40

.891

12.43

.862

12.38

.906

.984

Traumatic Brain Injury

12.35

.938

11.99

.768

12.44

.930

.676

Median Correlations

 

.938

 

.923

 

.937

.931

Table 3

Regression Slopes and Intercepts of Item Difficulties for
Girls onto Item Difficulties for Boys, by Category of Disability

Type of Disability

Slope

Intercept

R-Square

Autism

0.958

-0.372

.515

Emotionally Disturbed

1.081

-1.242

.965

Learning Disabled

1.012

0.025

.969

Mentally Retarded

0.937

0.720

.813

Deaf

0.720

3.883

.545

Hard of Hearing

1.043

-0.317

.925

Speech Impaired

0.997

0.044

.971

Visually Impaired

0.668

4.309

.766

Orthopedically Impaired

0.856

1.777

.923

Other Health Impaired

0.992

0.451

.941

Multiple Disabilities

0.986

0.222

.968

Traumatic Brain Injury

1.086

-1.523

.457

Without Disability

1.006

-0.283

.972

Overall

0.906

1.163

.785

 

Homogeneity Across Disability Status and Scoring Levels

Campbell and Fishe (1959) set four criteria for the construct validity of tests that focus on the convergent properties of test scores with other measures of the same construct and the discriminant properties of test scores with other measures of different test scores. These criteria, termed multitrait-multimethod validation are based on correlations of the scores with other variables, and can be summarized as follows:

    1. Correlations with other measures of the same trait are not nominal (equal to or greater than .350);
    2. The correlations of item difficulties across disability status but within score range, e.g., students in score level 3 without disabilities and students in score level 3 with disabilities, are greater than both those in the same disability status and those in different disability status and different score levels;
    3. The correlations of item difficulties of different scoring levels over disability status groups are greater than those of different disability status groups over scoring levels;
    4. The pattern of across disability status groups and scoring levels correlations of item difficulties is uniform for each score level.

The evaluation of the homogeneity of the ELA-4 construct involves a modification of the criteria. If we consider level of score range to be the trait and students with disabilities or without disabilities to be the method, then it is clear that, because we expect the ELA-4 to measure the same trait throughout the range of scores, we do not expect to necessarily meet criterion 3. Naturally, uniform levels of high correlations would also make it difficult to discern patterns in the data (criterion 4).

To enable statistical manipulation of the correlations, the coefficients were transformed to z-scores.* All correlations were above. 350 (criterion 1). A general linear regression model revealed that there were no reliable differences in the correlations of item difficulties for students with disabilities across scoring levels, compared to those for students without disabilities across scoring levels (criterion 2), but there was a significant difference between the mean correlations across disability status and within score levels (mean = .969) than either within disability status across score levels (mean = .872) or across disability status and across score levels (mean = .845) (F (2, 178) = 12.33, p <.0001).

In terms of the pattern of correlations, in all ten score levels, the correlations were highest either across disability status within score level or with item difficulties within one score level and within disability status. That is, the item difficulty correlations for students with disabilities in eight of the ten score levels were highest with those for other students with disabilities in an adjoining score level. Table 5 shows the mean correlations were ordered, both for students with disabilities and those without disabilities according to the distance from the score level.
_______________________________
*although statistical manipulations were performed on z-scores, mean z-scores were transformed back to correlation coefficients for the purposes of this report.

A second general linear model was performed in which the type of comparison (across score levels, across disability status) and the degree of score level difference were main effects. The interaction of these two was also assessed.

There were no significant differences attributable to type of correlation (F (df = 1,162) = 2.72, ns) or the interaction of score level difference and type of comparison (F (df = 8, 162) = 0.28, ns). However, as expected, there was a significant difference attributable to score level difference (F (8, 162) = 36.99, p <.001). Keep in mind that the only correlation having no score level difference involved item difficulties across disability status groups but within score level. Thus, this analysis confounds type of correlation with score level difference. Nevertheless, the mean correlations within both disability status groups decreases monotonically as the score level differences decrease (see Table 5), thus addressing in Campbell & Fiske's criterion 4.

Homogeneity of Regression

Finally, an analysis was made of the homogeneity of regression of item difficulties for students without disabilities onto those of students with disabilities. The overall r-square for the regression was .962. Analyses revealed no significant variation of this regression across score levels (F (9, 9) = 0.63, ns). Nor was there a significant difference in the residuals (observed item difficulties minus predicted item difficulties) of the regression in relation to score level (F (9, 340) = 0.57, ns). Table 6 shows the regression coefficients.

Table 4

Item Difficulty Correlations of Students With Disabilities with
Students Without Disabilities on the

1999 Grade 4 ELA, by Score Level

 

Correlation with Dis

Score Level

No. Dis.*

Dis.

Different Score Ranges*

No. Dis.

Different Score Ranges*

455-577

578-593

594-602

603-621

622-634

635-644

652-661

662-691

692-701

702-800

.967

.980

.981

.982

.980

.978

.966

.958

.892

.944

.638

.821

.874

.880

.902

.902

.908

.859

.828

.750

.561

.837

.837

.898

.902

.911

.881

.850

.774

.711

______________________________
*Median correlations

Table 5

Mean Item Difficulty (in Deltas) Correlations by
Distance in Level of Scoring and Disability Status

Distance Between Score Levels

Both Statuses

Within Disability Status

Across Disability Status

0

1

2

3

4

5

6

7

8

9

Mean

.969

.958

.918

.860

.797

.740

.683

.640

.618

.616

.870

-

.968

.927

.868

.805

.754

.698

.666

.647

.652

.872

.969

.948

.908

.852

.784

.727

.669

.618

.587

.576

.845

Table 6

Regression Slopes and Intercepts of Item Difficulties of
Students Without Disabilities onto Item Difficulties
of Students with Disabilities, by Scoring Level*

Scoring Level

Slope

Intercept

R-Square

455-577

578-593

594-602

603-621

622-634

635-644

652-661

662-691

692-701

702-800

Overall

1.069

1.044

1.044

1.052

1.039

1.042

1.061

0.989

0.913

0.880

0.999

-1.279

-0.454

-0.392

-0.520

-0.262

-0.236

-0.411

0.136

0.590

0.718

0.093

.904

.961

.960

.964

.955

.953

.932

.917

.760

.753

.960

_______________________________
*without disability = disability X slope + intercept

Conclusion

The ELA-4 is very robust in terms of the construct being measured across levels of scoring and disability status. This is good news, because, at least within the scope and precision of these analyses, the inference one can make of student strengths and weaknesses, given the score, is substantially independent of disability status, supporting the validity of this test for students with disabilities.

NOTES

Angoff, W. H. & Modu, C. C. "Equating the Scales of the Prueba de Aptitud Academica and the Scholastic Aptitude Test." (New York: College Entrance Examination Board, Research Report 3, 1973).

Campbell, D. T. & Fiske, D. W. Convergent and discriminant validation by multi trait-multimethod matrix. In D. N. Jackson & S. Messick (Eds.), "Problems in Human Assessment." (New York: McGraw-Hill, 1962), pp. 124-131.