Intraclass Correlation and Agreement

PSYC 520

Learning Objectives

  • Distinguish between reliability and agreement
  • Articulate the limitations of Cohen’s \(\kappa\) for interrater agreement
  • Choose the right intraclass correlation (ICC) formula for different study designs and decisions
  • Obtain confidence intervals for ICCs

Rater Data

Person R1 R2 R3
1 1 1 1
2 2 1 3
3 2 2 3
4 3 3 4
5 3 2 4
6 3 3 4
7 4 4 4
8 4 3 5
9 5 5 5
10 5 5 5

Repeated Measures

  • Concepts in interrater reliability has also been used for repeated measures

    • R1, R2, R3 → T1, T2, T3
  • Important assumption: construct does not change over the time points

  • One can also consider R1, R2, R3 as items

Two-Way Contingency Table

1 2 3 4 5 Sum
1 1 0 0 0 0 1
2 1 1 0 0 0 2
3 0 1 2 0 0 3
4 0 0 1 1 0 2
5 0 0 0 0 2 2
Sum 2 2 3 1 2 10

Nominal and Chance Agreement

Nominal agreement: 1 + 1 + 2 + 1 + 2 = 7 (sum of diagonal)

Chance agreement: \(\frac{1}{N} \sum n_{i+} n_{+i}\)

  • \(n_{i+}\): row \(i\) total (R1)
  • \(n_{+i}\): column \(i\) total (R2)

Example

  • R1 gave a 4 for 2 out of 10 ratees (\(n_{4+}\) = 2);
    • i.e., R1 has a 20% chance to give a 4
  • R2 gave a 4 for 1 out of 10 ratees (\(n_{+4}\) = 1);
  • Among 1 person R2 rated as 4, by chance 1 * 0.2 = 0.2 person would also be rated as 4 by R1

Cohen’s \(\kappa\)

\[ \kappa = \frac{p_o - p_e}{1 - p_e} \]

  • \(p_o\): observed proportion agreement
  • \(p_e\): expected proportion agreement by chance

Thinking

  • What is the \(\kappa\) for the table?
  • What does the denominator represent?

Criticism of \(\kappa\)

  • Only for two raters
  • Sensitive to marginal distributions
    • A small or negative \(\kappa\) may be due to high prevalence of one category
    • \(\kappa\) may behave weirdly when marginal distributions are very different across raters
  • Designed for nominal data
    • Weighted \(\kappa\) for ordinal data

Agreement vs. Reliability

  • Agreement: extent to which raters give the same score
  • Reliability: extent to which raters give consistent scores
R4 R5
1 3
2 4
2 4
3 5
3 5
  • Agreement = 0
  • Reliability = 1

Intraclass Correlation (ICC)

Key references

  • Shrout and Fleiss (1979) (ST)
  • McGraw and Wong (1996) (MW)
  • Ten Hove, Jorgensen, and Van Der Ark (2024)

Terminology

  • One-way (nested) vs. two-way (crossed) designs
  • Consistency vs. absolute agreement
  • Single vs. average ratings
  • Fixed vs. Random?

See Table 9.5 of textbook

Variance Decomposition

\[ \mathop{\mathrm{\mathrm{Var}}}(Y) = \underbrace{\sigma^2_p}_{\text{Person}} + \underbrace{\sigma^2_r}_{\text{Rater}} + \underbrace{\sigma^2_{pr}}_{\text{Person $\times$ Rater}} + \underbrace{\sigma^2_e}_{\text{Error}} \]

  • To estimate \(\sigma^2_r\), one needs multiple observations per rater

  • To estimate \(\sigma^2_{pr}\), one needs multiple observations per combination of person and rater

Nested Design

  • Different raters for each person
  • \(\sigma^2_r\), \(\sigma^2_{pr}\), \(\sigma^2_e\) cannot be separated

Examples

  • Child behavior ratings by own parents
  • Individuals measured on different dates
  • Respondents (egos) reported by people in their network (alters), with no overlap in alters

Crossed Design

  • Same raters for each person
  • \(\sigma^2_{pr}\), \(\sigma^2_e\) cannot be separated, unless each rater rate each person twice or more
  • Partially crossed: different sets of raters for different persons, with some overlap

Consistency vs. Agreement

ICC: \(\dfrac{\sigma^2_p}{\sigma^2_p + \text{Error} / k}\)

  • Consistency: raters give consistent ratings if rank order is the same
    • Error = \(\sigma^2_{pr} + \sigma^2_e\)
  • Agreement: raters give the same ratings
    • Error = \(\sigma^2_r + \sigma^2_{pr} + \sigma^2_e\)

Random vs. Fixed

  • Fixed raters: no interest in generalizing to other raters
  • Ten Hove, Jorgensen, and Van Der Ark (2024): Raters should rarely be treated as fixed
  • “Fixed” is sometimes confused with consistency in some R packages (e.g., psych::ICC)

Single Rating Consistency/Agreement

Agreement Design ICC (ST) ICC (MW)
Consistency Nested ICC(1, 1) ICC(1)
Agreement Nested ICC(1, 1) ICC(1)
Consistency Crossed ICC(3, 1) ICC(C, 1)
Agreement Crossed ICC(2, 1) ICC(A, 1)

Average of \(k\) Ratings Consistency/Agreement

Agreement Design ICC (ST) ICC (MW)
Consistency Nested ICC(1, k) ICC(k)
Agreement Nested ICC(1, k) ICC(k)
Consistency Crossed ICC(3, k) ICC(C, k)
Agreement Crossed ICC(2, k) ICC(A, k)

Analysis

Wide to long formats

library(dplyr)
dat_long <- dat_9_1 |>
    tidyr::pivot_longer(-Person, names_to = "Rater", values_to = "Score")
Person Rater Score
1 R1 1
1 R2 1
1 R3 1
2 R1 2
2 R2 1
2 R3 3
3 R1 2
3 R2 2
3 R3 3
4 R1 3

Variance Component Estimation

Full mixed-effect model

\[ Y_{ij} = \mu + P_i + R_j + (P \times R)_{ij} + e_{ij} \]

translates to the R formula

y ~ 1 + (1 | P) + (1 | R) + (1 | P:R)
library(lme4)
# If nested design:
# lmer(Score ~ (1 | Person), data = dat_long)
# If crossed design:
m1 <- lmer(Score ~ (1 | Person) + (1 | Rater), data = dat_long)
summary(m1)
Linear mixed model fit by REML ['lmerMod']
Formula: Score ~ (1 | Person) + (1 | Rater)
   Data: dat_long

REML criterion at convergence: 73.5

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-1.5147 -0.5966  0.1211  0.4753  1.2630 

Random effects:
 Groups   Name        Variance Std.Dev.
 Person   (Intercept) 1.5704   1.2531  
 Rater    (Intercept) 0.1889   0.4346  
 Residual             0.2111   0.4595  
Number of obs: 30, groups:  Person, 10; Rater, 3

Fixed effects:
            Estimate Std. Error t value
(Intercept)   3.3000     0.4765   6.926

Exercise

From the R output, compute ICC for agreement when averaging across 3 raters

See R notes for another example

References

McGraw, Kenneth O., and S. P. Wong. 1996. “Forming Inferences about Some Intraclass Correlation Coefficients.” Psychological Methods 1 (1): 30–46. https://doi.org/10.1037/1082-989X.1.1.30.
Shrout, Patrick E., and Joseph L. Fleiss. 1979. “Intraclass Correlations: Uses in Assessing Rater Reliability.” Psychological Bulletin 86 (2): 420–28. https://doi.org/10.1037/0033-2909.86.2.420.
Ten Hove, Debby, Terrence D. Jorgensen, and L. Andries Van Der Ark. 2024. “Updated Guidelines on Selecting an Intraclass Correlation Coefficient for Interrater Reliability, with Applications to Incomplete Observational Designs.” Psychological Methods 29 (5): 967–79. https://doi.org/10.1037/met0000516.