Intraclass Correlation and Agreement

PSYC 520

Learning Objectives

Distinguish between reliability and agreement
Articulate the limitations of Cohen’s $\kappa$ for interrater agreement
Choose the right intraclass correlation (ICC) formula for different study designs and decisions
Obtain confidence intervals for ICCs

Rater Data

Person	R1	R2	R3
1	1	1	1
2	2	1	3
3	2	2	3
4	3	3	4
5	3	2	4
6	3	3	4
7	4	4	4
8	4	3	5
9	5	5	5
10	5	5	5

Repeated Measures

Concepts in interrater reliability has also been used for repeated measures
- R1, R2, R3 → T1, T2, T3
Important assumption: construct does not change over the time points
One can also consider R1, R2, R3 as items

Two-Way Contingency Table

	1	2	3	4	5	Sum
1	1	0	0	0	0	1
2	1	1	0	0	0	2
3	0	1	2	0	0	3
4	0	0	1	1	0	2
5	0	0	0	0	2	2
Sum	2	2	3	1	2	10

Nominal and Chance Agreement

Nominal agreement: 1 + 1 + 2 + 1 + 2 = 7 (sum of diagonal)

Chance agreement: $\frac{1}{N} \sum n_{i+} n_{+i}$

$n_{i+}$: row $i$ total (R1)
$n_{+i}$: column $i$ total (R2)

Example

R1 gave a 4 for 2 out of 10 ratees ($n_{4+}$ = 2);
- i.e., R1 has a 20% chance to give a 4
R2 gave a 4 for 1 out of 10 ratees ($n_{+4}$ = 1);
Among 1 person R2 rated as 4, by chance 1 * 0.2 = 0.2 person would also be rated as 4 by R1

Cohen’s $\kappa$

\[ \kappa = \frac{p_o - p_e}{1 - p_e} \]

$p_o$: observed proportion agreement
$p_e$: expected proportion agreement by chance

Thinking

What is the $\kappa$ for the table?
What does the denominator represent?

Criticism of $\kappa$

Only for two raters
Sensitive to marginal distributions
- A small or negative $\kappa$ may be due to high prevalence of one category
- $\kappa$ may behave weirdly when marginal distributions are very different across raters
Designed for nominal data
- Weighted $\kappa$ for ordinal data

Agreement vs. Reliability

Agreement: extent to which raters give the same score
Reliability: extent to which raters give consistent scores

R4	R5
1	3
2	4
2	4
3	5
3	5

Agreement = 0
Reliability = 1

Intraclass Correlation (ICC)

Key references

Shrout and Fleiss (1979) (ST)
McGraw and Wong (1996) (MW)
Ten Hove, Jorgensen, and Van Der Ark (2024)

Terminology

One-way (nested) vs. two-way (crossed) designs
Consistency vs. absolute agreement
Single vs. average ratings
Fixed vs. Random?

See Table 9.5 of textbook

Variance Decomposition

\[ \mathop{\mathrm{\mathrm{Var}}}(Y) = \underbrace{\sigma^2_p}_{\text{Person}} + \underbrace{\sigma^2_r}_{\text{Rater}} + \underbrace{\sigma^2_{pr}}_{\text{Person $\times$ Rater}} + \underbrace{\sigma^2_e}_{\text{Error}} \]

To estimate $\sigma^2_r$, one needs multiple observations per rater
To estimate $\sigma^2_{pr}$, one needs multiple observations per combination of person and rater

Nested Design

Different raters for each person
$\sigma^2_r$, $\sigma^2_{pr}$, $\sigma^2_e$ cannot be separated

Examples

Child behavior ratings by own parents
Individuals measured on different dates
Respondents (egos) reported by people in their network (alters), with no overlap in alters

Crossed Design

Same raters for each person
$\sigma^2_{pr}$, $\sigma^2_e$ cannot be separated, unless each rater rate each person twice or more
Partially crossed: different sets of raters for different persons, with some overlap

Consistency vs. Agreement

ICC: $\dfrac{\sigma^2_p}{\sigma^2_p + \text{Error} / k}$

Consistency: raters give consistent ratings if rank order is the same
- Error = $\sigma^2_{pr} + \sigma^2_e$
Agreement: raters give the same ratings
- Error = $\sigma^2_r + \sigma^2_{pr} + \sigma^2_e$

Random vs. Fixed

Fixed raters: no interest in generalizing to other raters
Ten Hove, Jorgensen, and Van Der Ark (2024): Raters should rarely be treated as fixed
“Fixed” is sometimes confused with consistency in some R packages (e.g., psych::ICC)

Single Rating Consistency/Agreement

Agreement	Design	ICC (ST)	ICC (MW)
Consistency	Nested	ICC(1, 1)	ICC(1)
Agreement	Nested	ICC(1, 1)	ICC(1)
Consistency	Crossed	ICC(3, 1)	ICC(C, 1)
Agreement	Crossed	ICC(2, 1)	ICC(A, 1)

Average of $k$ Ratings Consistency/Agreement

Agreement	Design	ICC (ST)	ICC (MW)
Consistency	Nested	ICC(1, k)	ICC(k)
Agreement	Nested	ICC(1, k)	ICC(k)
Consistency	Crossed	ICC(3, k)	ICC(C, k)
Agreement	Crossed	ICC(2, k)	ICC(A, k)

Analysis

Wide to long formats

library(dplyr)
dat_long <- dat_9_1 |>
    tidyr::pivot_longer(-Person, names_to = "Rater", values_to = "Score")

Person	Rater	Score
1	R1	1
1	R2	1
1	R3	1
2	R1	2
2	R2	1
2	R3	3
3	R1	2
3	R2	2
3	R3	3
4	R1	3

Variance Component Estimation

Full mixed-effect model

\[ Y_{ij} = \mu + P_i + R_j + (P \times R)_{ij} + e_{ij} \]

translates to the R formula

y ~ 1 + (1 | P) + (1 | R) + (1 | P:R)

library(lme4)
# If nested design:
# lmer(Score ~ (1 | Person), data = dat_long)
# If crossed design:
m1 <- lmer(Score ~ (1 | Person) + (1 | Rater), data = dat_long)
summary(m1)

Linear mixed model fit by REML ['lmerMod']
Formula: Score ~ (1 | Person) + (1 | Rater)
   Data: dat_long

REML criterion at convergence: 73.5

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-1.5147 -0.5966  0.1211  0.4753  1.2630 

Random effects:
 Groups   Name        Variance Std.Dev.
 Person   (Intercept) 1.5704   1.2531  
 Rater    (Intercept) 0.1889   0.4346  
 Residual             0.2111   0.4595  
Number of obs: 30, groups:  Person, 10; Rater, 3

Fixed effects:
            Estimate Std. Error t value
(Intercept)   3.3000     0.4765   6.926

Exercise

From the R output, compute ICC for agreement when averaging across 3 raters

See R notes for another example

References

McGraw, Kenneth O., and S. P. Wong. 1996. “Forming Inferences about Some Intraclass Correlation Coefficients.” Psychological Methods 1 (1): 30–46. https://doi.org/10.1037/1082-989X.1.1.30.

Shrout, Patrick E., and Joseph L. Fleiss. 1979. “Intraclass Correlations: Uses in Assessing Rater Reliability.” Psychological Bulletin 86 (2): 420–28. https://doi.org/10.1037/0033-2909.86.2.420.

Ten Hove, Debby, Terrence D. Jorgensen, and L. Andries Van Der Ark. 2024. “Updated Guidelines on Selecting an Intraclass Correlation Coefficient for Interrater Reliability, with Applications to Incomplete Observational Designs.” Psychological Methods 29 (5): 967–79. https://doi.org/10.1037/met0000516.