Reliability and Classical Test Theory

Learning Objectives

Explain the importance of reliability in measurement
Explain what true score and error score are in classical test theory (CTT)
Define and derive reliability in CTT
Explain what parallel, tau-equivalent, and congeneric tests are

Reliability

A test is reliable means that we would obtain very similar scores if we were to repeat the test

aka dependability or consistency across some condition

Also precision
E.g., across time, items, forms, raters

Important

Reliability concerns observed scores
Reliability coefficient is not defined for a single score, but a set of (hypothetical) scores
On the other hand, precision can be defined for each score

What is Considered Error?

Variability in scores is not necessarily error

E.g., variations in measurement of a person’s weight vs. height across days

What about when an examinee answer a question incorrectly first, and then answer the same question correctly in a second try? Should we consider the difference in response in the two trials “error”?

Classical Test Theory

\[X = T + E\]

\(X\): Observed score
\(E\): Random error/inconsistencies
\(T\): “True” score
- A hypothetical average score if we could repeatedly test a person, and “brainwash” them after each testing

Propensity Distribution

See Figure 7.1

Standard deviation of PD = standard error of measurement

Random vs. Systematic Error

CTT assumes random error: \(E_1\), \(E_2\), . . . are independent

Errors are also random across persons

In practice, error can be systematic

E.g., raters are too lenient; blood pressure meter not calibrated

True Score and Error Score

CTT defines \(T\) = \(E(X)\)

So \(T\) can contain systematic error

Important

By construction of CTT

The expected value of \(E\) is zero
Corr(\(E\), \(T\)) = 0

Reliability in CTT

\(\sigma^2_T\) = variance of \(T\) across persons
\(\sigma^2_X\) = variance of \(X\) across persons

\[\rho_{X X'} = \frac{\sigma^2_T}{\sigma^2_X}\]

Note

Reliability in CTT is sample-specific

Because \(T\) is not observed, \(\rho_{X X'}\) cannot be obtained

This is solved using the concept of parallel tests

Parallel Tests

If \(X_1\) and \(X_2\) are two parallel tests with true scores \(T_1\) and \(T_2\), they are parallel if and only if

\(T_1\) = \(T_2\); \(Var(E_1)\) = \(Var(E_2)\)

     t1   x11   x12  x123    t2   x21   x22   x23
1  8.77  9.17  9.21  7.94  8.77  8.96  8.22  9.14
2  9.42  9.99  8.45  9.82  9.42  9.54  8.53 10.20
3 10.98 11.52 10.74 10.70 10.98 12.37 11.02  9.56
4 10.22  9.51 12.00  9.15 10.22  8.85 10.93 10.88
5 11.47 11.81 10.51 12.07 11.47  9.43 14.06 10.90

Without loss of generality, assume \(X_1\) and \(X_2\) have been centered

\[E(X_1 X_2) = E(T_1 T_2) + E(T_1 E_2) + E(T_2 E_1) + E(E_1 E_2)\]

The last three terms are zero by construction

Because \(T_1\) = \(T_2\) = \(T\), and \(E(T)\) = 0,

\[E(X_1 X_2) = E(T^2) = \sigma^2_T\]

With parallel tests, \(Var(X_1)\) = \(Var(X_2)\) = \(\sigma^2_X\), so

\[\rho_{X X'} = \frac{\sigma^2_T}{\sigma^2_X} = \frac{E(X_1 X_2)}{\sqrt{Var(X_1) Var(X_2)}}\]

where the last term is the correlation between \(X_1\) and \(X_2\)

So,

Note

Reliability = Correlation between two parallel tests

Tau-Equivalent Tests

\(T_1\) = \(T_2\); \(Var(E_1)\) may be different from \(Var(E_2)\)

     t1   x11   x12  x123    t2   x21   x22   x23
1  7.96  7.38  7.58  8.91  7.96 15.05  1.89  6.93
2  8.16  7.37  8.18  8.94  8.16 11.89  4.36  8.23
3  6.80  7.51  5.59  7.30  6.80  8.61  2.18  9.60
4 14.86 14.20 14.57 15.81 14.86 14.02 22.74  7.83
5  9.72  9.19 10.23  9.73  9.72  1.53 13.86 13.77

Essentially Tau-Equivalent Tests

\(T_1\) = \(a\) + \(T_2\); \(Var(E_1)\) may be different from \(Var(E_2)\)

     t1   x11   x12  x123    t2   x21   x22   x23
1 10.36 10.85  9.84 10.39 12.36 11.48 14.77 10.84
2  8.64  8.37  8.48  9.07 10.64 14.46  4.00 13.46
3 10.51 11.07 11.16  9.29 12.51 13.06 11.76 12.70
4  6.69  6.20  7.69  6.16  8.69  9.53  6.11 10.42
5  7.32  7.61  7.38  6.97  9.32  8.47 10.77  8.72

Congeneric Tests

\(T_1\) = \(a\) + \(\color{red}{b}\) \(T_2\); \(Var(E_1)\) may be different from \(Var(E_2)\)

     t1   x11   x12  x123    t2   x21   x22   x23
1 10.48 10.14  9.63 11.68  9.34  8.92  8.25 10.85
2  9.62  8.71 11.03  9.12  8.73  6.97 12.45  6.79
3  7.13  7.01  7.38  7.02  6.99  2.55  7.57 10.86
4 12.92 12.24 13.61 12.90 11.04 10.77 11.10 11.25
5  9.98  8.80 11.52  9.61  8.98 11.36  9.36  6.23

Additional Note on Reliability in CTT

Theoretically speaking, \(0 \leq \rho_{X X'} \leq 1\)
\(\rho_{X X'}\) = squared correlation between \(T\) and \(X\) (think of \(R^2\))
Only one error variance is estimated, which is the average of error variance across persons