Reliability and Classical Test Theory
Learning Objectives
- Explain the importance of reliability in measurement
- Explain what true score and error score are in classical test theory (CTT)
- Define and derive reliability in CTT
- Explain what parallel, tau-equivalent, and congeneric tests are
Reliability
A test is reliable means that we would obtain very similar scores if we were to repeat the test
aka dependability or consistency across some condition
Also precision
E.g., across time, items, forms, raters
Reliability concerns observed scores
Reliability coefficient is not defined for a single score, but a set of (hypothetical) scores
On the other hand, precision can be defined for each score
What is Considered Error?
Variability in scores is not necessarily error
E.g., variations in measurement of a person’s weight vs. height across days
What about when an examinee answer a question incorrectly first, and then answer the same question correctly in a second try? Should we consider the difference in response in the two trials “error”?
Classical Test Theory
\[X = T + E\]
- \(X\): Observed score
- \(E\): Random error/inconsistencies
- \(T\): “True” score
- A hypothetical average score if we could repeatedly test a person, and “brainwash” them after each testing
Propensity Distribution
See Figure 7.1
Standard deviation of PD = standard error of measurement
Random vs. Systematic Error
CTT assumes random error: \(E_1\), \(E_2\), . . . are independent
- Errors are also random across persons
In practice, error can be systematic
- E.g., raters are too lenient; blood pressure meter not calibrated
True Score and Error Score
CTT defines \(T\) = \(E(X)\)
So \(T\) can contain systematic error
By construction of CTT
- The expected value of \(E\) is zero
- Corr(\(E\), \(T\)) = 0
Reliability in CTT
- \(\sigma^2_T\) = variance of \(T\) across persons
- \(\sigma^2_X\) = variance of \(X\) across persons
\[\rho_{X X'} = \frac{\sigma^2_T}{\sigma^2_X}\]
Reliability in CTT is sample-specific
Because \(T\) is not observed, \(\rho_{X X'}\) cannot be obtained
This is solved using the concept of parallel tests
Parallel Tests
If \(X_1\) and \(X_2\) are two parallel tests with true scores \(T_1\) and \(T_2\), they are parallel if and only if
\(T_1\) = \(T_2\); \(Var(E_1)\) = \(Var(E_2)\)
t1 x11 x12 x123 t2 x21 x22 x23
1 8.77 9.17 9.21 7.94 8.77 8.96 8.22 9.14
2 9.42 9.99 8.45 9.82 9.42 9.54 8.53 10.20
3 10.98 11.52 10.74 10.70 10.98 12.37 11.02 9.56
4 10.22 9.51 12.00 9.15 10.22 8.85 10.93 10.88
5 11.47 11.81 10.51 12.07 11.47 9.43 14.06 10.90
Without loss of generality, assume \(X_1\) and \(X_2\) have been centered
\[E(X_1 X_2) = E(T_1 T_2) + E(T_1 E_2) + E(T_2 E_1) + E(E_1 E_2)\]
The last three terms are zero by construction
Because \(T_1\) = \(T_2\) = \(T\), and \(E(T)\) = 0,
\[E(X_1 X_2) = E(T^2) = \sigma^2_T\]
With parallel tests, \(Var(X_1)\) = \(Var(X_2)\) = \(\sigma^2_X\), so
\[\rho_{X X'} = \frac{\sigma^2_T}{\sigma^2_X} = \frac{E(X_1 X_2)}{\sqrt{Var(X_1) Var(X_2)}}\]
where the last term is the correlation between \(X_1\) and \(X_2\)
So,
Reliability = Correlation between two parallel tests
Tau-Equivalent Tests
\(T_1\) = \(T_2\); \(Var(E_1)\) may be different from \(Var(E_2)\)
t1 x11 x12 x123 t2 x21 x22 x23
1 7.96 7.38 7.58 8.91 7.96 15.05 1.89 6.93
2 8.16 7.37 8.18 8.94 8.16 11.89 4.36 8.23
3 6.80 7.51 5.59 7.30 6.80 8.61 2.18 9.60
4 14.86 14.20 14.57 15.81 14.86 14.02 22.74 7.83
5 9.72 9.19 10.23 9.73 9.72 1.53 13.86 13.77
Essentially Tau-Equivalent Tests
\(T_1\) = \(a\) + \(T_2\); \(Var(E_1)\) may be different from \(Var(E_2)\)
t1 x11 x12 x123 t2 x21 x22 x23
1 10.36 10.85 9.84 10.39 12.36 11.48 14.77 10.84
2 8.64 8.37 8.48 9.07 10.64 14.46 4.00 13.46
3 10.51 11.07 11.16 9.29 12.51 13.06 11.76 12.70
4 6.69 6.20 7.69 6.16 8.69 9.53 6.11 10.42
5 7.32 7.61 7.38 6.97 9.32 8.47 10.77 8.72
Congeneric Tests
\(T_1\) = \(a\) + \(\color{red}{b}\) \(T_2\); \(Var(E_1)\) may be different from \(Var(E_2)\)
t1 x11 x12 x123 t2 x21 x22 x23
1 10.48 10.14 9.63 11.68 9.34 8.92 8.25 10.85
2 9.62 8.71 11.03 9.12 8.73 6.97 12.45 6.79
3 7.13 7.01 7.38 7.02 6.99 2.55 7.57 10.86
4 12.92 12.24 13.61 12.90 11.04 10.77 11.10 11.25
5 9.98 8.80 11.52 9.61 8.98 11.36 9.36 6.23
Additional Note on Reliability in CTT
Theoretically speaking, \(0 \leq \rho_{X X'} \leq 1\)
\(\rho_{X X'}\) = squared correlation between \(T\) and \(X\) (think of \(R^2\))
Only one error variance is estimated, which is the average of error variance across persons