# Define a function for formatting numbers
<- function(x, d = 2) format(x, digits = d, big.mark = ",") comma
Interrater Reliability
library(psych)
library(lme4)
Warning in check_dep_version(): ABI version mismatch:
lme4 was built with Matrix ABI version 2
Current Matrix ABI version is 1
Please re-install lme4 from source or restore original 'Matrix' package
library(tidyverse)
library(plotly)
Interrater Reliability
Data analyzed in this paper: https://www.mdpi.com/2076-3425/12/8/1011/
<- read.csv("https://osf.io/download/p9gqk/")
slp_dat <- slp_dat |>
slp_vas_wide select(slpID, Speaker, slp_VAS) |>
pivot_wider(names_from = slpID, values_from = slp_VAS)
head(slp_vas_wide)
Plot
<- ggplot(slp_dat, aes(x = slpID, y = slp_VAS, group = Speaker, color = Speaker)) +
p geom_line(alpha = 0.5) +
labs(x = "Rater", y = "VAS Score") +
theme_bw()
ggplotly(p)
For Two Raters
We’ll first select just the first two raters.
<- slp_vas_wide |>
slp_2rater select(slp14, slp15)
Nominal Agreement
To compute nominal agreement, we need to consider the ratings between two raters to be exactly the same. If we go by that definition, we have
with(slp_2rater, mean(slp14 == slp15))
[1] 0.1
If we instead relax the definition a little bit and say that agreement is reached if it is the same after rounding,
<- round(slp_2rater / 10)
slp_2round <- lapply(slp_2round, FUN = factor, levels = 0:10)
slp_2round # Contingency table
table(slp_2round)
slp15
slp14 0 1 2 3 4 5 6 7 8 9 10
0 0 1 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0 0
5 0 0 0 0 0 0 1 0 0 0 0
6 0 0 0 0 0 1 0 0 0 0 2
7 0 0 0 0 0 0 0 1 0 0 1
8 0 0 0 0 0 0 0 0 0 2 0
9 0 0 0 0 0 0 0 0 1 0 2
10 0 0 0 0 0 0 0 1 0 0 7
<- with(slp_2round, mean(slp14 == slp15))
p0 p0
[1] 0.4
Cohen’s Kappa
\[ \kappa = \frac{P_0 - P_c}{1 - P_c} \]
\[ P_c = \frac{1}{N^2} \sum_{i = 1}^c (n_{i+}) (n_{+i}) \]
<- table(slp_2round)
slp_2tab <- sum(colSums(slp_2tab) * rowSums(slp_2tab)) / sum(slp_2tab)^2
pc <- (p0 - pc) / (1 - pc)) (kappa
[1] 0.1666667
Drawback: Kappa tends to be small when scores are unevenly distributed (e.g., most scores belong to certain categories). It is certainly the case above.
Multiple Raters
Coefficient \(\alpha\)
Treat each rater as an “item.”
::alpha(slp_vas_wide[-1]) psych
Number of categories should be increased in order to count frequencies.
Warning in cor.smooth(r): Matrix was not positive definite, smoothing was done
In smc, smcs > 1 were set to 1.0
In smc, smcs > 1 were set to 1.0
In smc, smcs > 1 were set to 1.0
In smc, smcs > 1 were set to 1.0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs > 1 were set to 1.0
In smc, smcs < 0 were set to .0
In smc, smcs > 1 were set to 1.0
In smc, smcs > 1 were set to 1.0
In smc, smcs > 1 were set to 1.0
In smc, smcs > 1 were set to 1.0
In smc, smcs < 0 were set to .0
In smc, smcs > 1 were set to 1.0
In smc, smcs < 0 were set to .0
In smc, smcs > 1 were set to 1.0
In smc, smcs < 0 were set to .0
In smc, smcs > 1 were set to 1.0
In smc, smcs > 1 were set to 1.0
In smc, smcs > 1 were set to 1.0
In smc, smcs > 1 were set to 1.0
In smc, smcs > 1 were set to 1.0
In smc, smcs < 0 were set to .0
In smc, smcs > 1 were set to 1.0
Reliability analysis
Call: psych::alpha(x = slp_vas_wide[-1])
raw_alpha std.alpha G6(smc) average_r S/N ase mean sd median_r
0.98 0.98 1 0.69 46 0.0076 62 25 0.7
95% confidence boundaries
lower alpha upper
Feldt 0.96 0.98 0.99
Duhachek 0.96 0.98 0.99
Reliability if an item is dropped:
raw_alpha std.alpha G6(smc) average_r S/N var.r med.r
slp10 0.97 0.98 0.99 0.68 43 0.014 0.70
slp11 0.97 0.98 0.99 0.68 43 0.014 0.70
slp13 0.97 0.98 0.99 0.69 44 0.015 0.70
slp14 0.97 0.98 0.99 0.69 44 0.015 0.70
slp15 0.97 0.98 0.99 0.69 44 0.014 0.70
slp16 0.98 0.98 0.99 0.70 46 0.014 0.71
slp17 0.97 0.98 0.99 0.68 43 0.015 0.69
slp18 0.97 0.98 0.99 0.69 45 0.013 0.70
slp19 0.97 0.98 0.99 0.69 45 0.014 0.71
slp2 0.97 0.98 1.00 0.69 45 0.014 0.71
slp20 0.97 0.98 0.99 0.68 43 0.015 0.69
slp21 0.97 0.98 0.99 0.68 43 0.014 0.69
slp22 0.97 0.98 0.99 0.69 44 0.015 0.70
slp23 0.97 0.98 0.99 0.68 43 0.015 0.69
slp25 0.97 0.98 0.99 0.69 45 0.014 0.71
slp3 0.98 0.98 0.99 0.70 46 0.014 0.71
slp4 0.97 0.98 0.99 0.69 45 0.014 0.71
slp5 0.97 0.98 0.99 0.69 45 0.015 0.70
slp6 0.98 0.98 0.99 0.70 47 0.010 0.71
slp8 0.97 0.98 0.99 0.69 44 0.015 0.70
slp9 0.97 0.98 1.00 0.68 43 0.014 0.69
Item statistics
n raw.r std.r r.cor r.drop mean sd
slp10 18 0.89 0.89 0.90 0.86 81 28
slp11 18 0.89 0.90 0.90 0.90 69 33
slp13 19 0.86 0.86 0.85 0.86 54 33
slp14 20 0.82 0.84 0.84 0.81 80 25
slp15 20 0.81 0.83 0.83 0.80 85 22
slp16 20 0.77 0.76 0.76 0.73 50 42
slp17 20 0.90 0.91 0.91 0.90 63 34
slp18 20 0.82 0.80 0.77 0.79 70 30
slp19 19 0.83 0.82 0.82 0.81 56 25
slp2 20 0.83 0.82 0.82 0.80 64 32
slp20 20 0.90 0.90 0.90 0.89 49 25
slp21 19 0.91 0.91 0.90 0.91 59 25
slp22 20 0.86 0.86 0.86 0.84 50 27
slp23 20 0.89 0.89 0.89 0.86 71 32
slp25 20 0.79 0.79 0.79 0.77 79 28
slp3 19 0.76 0.75 0.76 0.74 52 40
slp4 20 0.79 0.79 0.78 0.77 47 27
slp5 19 0.82 0.82 0.82 0.81 43 23
slp6 13 0.80 0.71 0.71 0.54 76 23
slp8 20 0.86 0.87 0.87 0.86 53 31
slp9 19 0.89 0.90 0.90 0.90 58 23
Intraclass Correlation
One-way ANOVA for nested design
Let’s say we have data where each person is rated by two different raters:
<- slp_dat |>
slp_vas_nested mutate(SpeakerID = as.numeric(as.factor(Speaker))) |>
# Select only 10 speakers
filter(SpeakerID <= 10) |>
group_by(Speaker) |>
# Filter specific raters
filter(row_number() %in% (SpeakerID[1] * 2 - (1:0)))
We have a design with raters nested within ratees. With this design, we cannot distinguish rater effect from random error. We can now run a one-way random-effect ANOVA, which is the same as a random-intercept multilevel model:
<- lmer(slp_VAS ~ 1 + (1 | Speaker), data = slp_vas_nested)
m1 summary(m1)
Linear mixed model fit by REML ['lmerMod']
Formula: slp_VAS ~ 1 + (1 | Speaker)
Data: slp_vas_nested
REML criterion at convergence: 179.6
Scaled residuals:
Min 1Q Median 3Q Max
-1.50049 -0.77609 -0.05948 0.87584 1.35589
Random effects:
Groups Name Variance Std.Dev.
Speaker (Intercept) 308.9 17.58
Residual 816.8 28.58
Number of obs: 19, groups: Speaker, 10
Fixed effects:
Estimate Std. Error t value
(Intercept) 53.613 8.627 6.214
# Extract variance components (Ratee, error)
<- as.data.frame(VarCorr(m1)) vc_m1
ICC for single rating: \(\frac{\sigma^2_\text{ratee}}{\sigma^2_\text{ratee} + \sigma^2_E}\)
$vcov[1] / (vc_m1$vcov[1] + vc_m1$vcov[2]) vc_m1
[1] 0.2744076
ICC for average rating: \(\frac{\sigma^2_\text{ratee}}{\sigma^2_\text{ratee} + \sigma^2_E / k}\), where \(k\) is the number of raters per ratee
$vcov[1] / (vc_m1$vcov[1] + vc_m1$vcov[2] / 2) vc_m1
[1] 0.4306434
Two-way ANOVA
For consistency (relative decision), rater effect is not error, because the rater bias is applying to everyone and does not change the rank order.
For agreement (absolute decision), rater effect is error as it changes the absolute scores.
<- lmer(slp_VAS ~ 1 + (1 | Speaker) + (1 | slpID), data = slp_dat)
m2 summary(m2)
Linear mixed model fit by REML ['lmerMod']
Formula: slp_VAS ~ 1 + (1 | Speaker) + (1 | slpID)
Data: slp_dat
REML criterion at convergence: 3544.1
Scaled residuals:
Min 1Q Median 3Q Max
-3.3281 -0.5813 0.0862 0.6611 2.5747
Random effects:
Groups Name Variance Std.Dev.
slpID (Intercept) 132.8 11.53
Speaker (Intercept) 585.7 24.20
Residual 291.2 17.07
Number of obs: 403, groups: slpID, 21; Speaker, 20
Fixed effects:
Estimate Std. Error t value
(Intercept) 61.882 6.028 10.27
<- as.data.frame(VarCorr(m2)) vc_m2
ICC for single rating:
- Agreement: \(\frac{\sigma^2_\text{ratee}}{\sigma^2_\text{ratee} + \sigma^2_\text{rater} + \sigma^2_E}\)
# Note: ratee is in position 2, but may be different in different data sets
$vcov[2] / (vc_m2$vcov[1] + vc_m2$vcov[2] + vc_m2$vcov[3]) vc_m2
[1] 0.5800175
- Consistency: \(\frac{\sigma^2_\text{ratee}}{\sigma^2_\text{ratee} + \sigma^2_E}\)
# Note: ratee is in position 2, but may be different in different data sets
$vcov[2] / (vc_m2$vcov[2] + vc_m2$vcov[3]) vc_m2
[1] 0.6678742
ICC for average rating:
- Agreement: \(\frac{\sigma^2_\text{ratee}}{\sigma^2_\text{ratee} + (\sigma^2_\text{rater} + \sigma^2_E) / k}\)
# Note: ratee is in position 2, but may be different in different data sets
$vcov[2] / (vc_m2$vcov[2] + (vc_m2$vcov[1] + vc_m2$vcov[3]) / 21) vc_m2
[1] 0.966669
- Consistency: \(\frac{\sigma^2_\text{ratee}}{\sigma^2_\text{ratee} + \sigma^2_E / k}\)
$vcov[2] / (vc_m2$vcov[2] + vc_m2$vcov[3] / 21) vc_m2
[1] 0.9768674
Q1
In practice, for someone to be seen by a pathologist, which coefficient from the above is relevant? And what is its value?
Answer:
Q2
While it is highly unlikely that someone would be able to get opinions from \(k\) = 21 pathologists, and take their average, to inform their status on dysarthria, how many pathologists would be needed to have an interrater agreement of at least .90?
Hint: change the value of \(k\) in the formula to get updated numbers.
Answer:
You can use the psych::ICC()
function for these calculations. This requires wide-format data.
::ICC(slp_vas_wide[-1]) psych
Call: psych::ICC(x = slp_vas_wide[-1])
Intraclass correlation coefficients
type ICC F df1 df2 p lower bound upper bound
Single_raters_absolute ICC1 0.58 30 19 400 1.6e-64 0.43 0.75
Single_random_raters ICC2 0.58 43 19 380 1.0e-82 0.43 0.75
Single_fixed_raters ICC3 0.67 43 19 380 1.0e-82 0.53 0.81
Average_raters_absolute ICC1k 0.97 30 19 400 1.6e-64 0.94 0.98
Average_random_raters ICC2k 0.97 43 19 380 1.0e-82 0.94 0.98
Average_fixed_raters ICC3k 0.98 43 19 380 1.0e-82 0.96 0.99
Number of subjects = 20 Number of Judges = 21
See the help file for a discussion of the other 4 McGraw and Wong estimates,
Be aware that the terminology in the above output may cause confusion. Here is the translation:
Single_raters_absolute
, ICC(1, 1): ICC for single rating for one-way ANOVA (assuming each ratee is rated by a different set of raters)Average_raters_absolute
, ICC(1, k): ICC for average rating for one-way ANOVA (assuming each ratee is rated by a different set of raters)Single_random_raters
, ICC(2, 1): ICC for interrater agreement for single rating for two-way ANOVA (“random” here really means agreement)Average_random_absolute
, ICC(2, k): ICC for interrater agreement for average rating for two-way ANOVA (“random” here really means agreement)Single_fixed_raters
, ICC(3, 1): ICC for interrater consistency for single rating for two-way ANOVA (“fixed” here really means consistency)Average_fixed_absolute
, ICC(3, k): ICC for interrater consistency for average rating for two-way ANOVA (“fixed” here really means consistency)
For this example, ICC(1, 1) and ICC(1, k) are not relevant, as it is a crossed (not nested) design.