Interrater Reliability

# Define a function for formatting numbers
comma <- function(x, d = 2) format(x, digits = d, big.mark = ",")

library(psych)
library(lme4)
library(tidyverse)

Interrater Reliability

Data analyzed in this paper: https://www.mdpi.com/2076-3425/12/8/1011/

slp_dat <- read.csv("https://osf.io/download/p9gqk/")
slp_vas_wide <- slp_dat |>
    select(slpID, Speaker, slp_VAS) |>
    pivot_wider(names_from = slpID, values_from = slp_VAS)
head(slp_vas_wide)

For Two Raters

We’ll first select just the first two raters.

slp_2rater <- slp_vas_wide |>
    select(slp14, slp15)

Nominal Agreement

To compute nominal agreement, we need to consider the ratings between two raters to be exactly the same. If we go by that definition, we have

with(slp_2rater, mean(slp14 == slp15))

[1] 0.1

If we instead relax the definition a little bit and say that agreement is reached if it is the same after rounding,

slp_2round <- round(slp_2rater / 10)
slp_2round <- lapply(slp_2round, FUN = factor, levels = 0:10)
# Contingency table
table(slp_2round)

     slp15
slp14 0 1 2 3 4 5 6 7 8 9 10
   0  0 1 0 0 0 0 0 0 0 0  0
   1  0 0 0 0 0 0 0 0 0 0  0
   2  0 0 0 0 0 0 0 0 0 0  0
   3  0 0 0 0 0 0 0 0 0 0  0
   4  0 0 0 0 0 0 0 0 0 0  0
   5  0 0 0 0 0 0 1 0 0 0  0
   6  0 0 0 0 0 1 0 0 0 0  2
   7  0 0 0 0 0 0 0 1 0 0  1
   8  0 0 0 0 0 0 0 0 0 2  0
   9  0 0 0 0 0 0 0 0 1 0  2
   10 0 0 0 0 0 0 0 1 0 0  7

p0 <- with(slp_2round, mean(slp14 == slp15))
p0

[1] 0.4

Cohen’s Kappa

\[\kappa = \frac{P_0 - P_c}{1 - P_c}\]

\[P_c = \frac{1}{N^2} \sum_{i = 1}^c (n_{i+}) (n_{+i})\]

slp_2tab <- table(slp_2round)
pc <- sum(colSums(slp_2tab) * rowSums(slp_2tab)) / sum(slp_2tab)^2
(kappa <- (p0 - pc) / (1 - pc))

[1] 0.1666667

Drawback: Kappa tends to be small when scores are unevenly distributed (e.g., most scores belong to certain categories). It is certainly the case above.

Multiple Raters

Coefficient \(\alpha\)

Treat each rater as an “item.”

psych::alpha(slp_vas_wide[-1])

Number of categories should be increased  in order to count frequencies.

Warning in cor.smooth(r): Matrix was not positive definite, smoothing was done

In smc, smcs > 1 were set to 1.0

In smc, smcs > 1 were set to 1.0

In smc, smcs < 0 were set to .0

In smc, smcs > 1 were set to 1.0
In smc, smcs > 1 were set to 1.0
In smc, smcs > 1 were set to 1.0
In smc, smcs > 1 were set to 1.0
In smc, smcs > 1 were set to 1.0

In smc, smcs < 0 were set to .0

In smc, smcs > 1 were set to 1.0

In smc, smcs < 0 were set to .0

In smc, smcs > 1 were set to 1.0
In smc, smcs > 1 were set to 1.0
In smc, smcs > 1 were set to 1.0
In smc, smcs > 1 were set to 1.0
In smc, smcs > 1 were set to 1.0

In smc, smcs < 0 were set to .0

In smc, smcs > 1 were set to 1.0
In smc, smcs > 1 were set to 1.0

In smc, smcs < 0 were set to .0

In smc, smcs > 1 were set to 1.0

In smc, smcs < 0 were set to .0

In smc, smcs > 1 were set to 1.0
In smc, smcs > 1 were set to 1.0
In smc, smcs > 1 were set to 1.0

In smc, smcs < 0 were set to .0

In smc, smcs > 1 were set to 1.0

In smc, smcs < 0 were set to .0

In smc, smcs > 1 were set to 1.0

In smc, smcs < 0 were set to .0

In smc, smcs > 1 were set to 1.0
In smc, smcs > 1 were set to 1.0


Reliability analysis   
Call: psych::alpha(x = slp_vas_wide[-1])

  raw_alpha std.alpha G6(smc) average_r S/N    ase mean sd median_r
      0.98      0.98    0.99      0.68  45 0.0076   62 25      0.7

    95% confidence boundaries 
         lower alpha upper
Feldt     0.96  0.98  0.99
Duhachek  0.96  0.98  0.99

 Reliability if an item is dropped:
      raw_alpha std.alpha G6(smc) average_r S/N var.r med.r
slp10      0.97      0.98    0.98      0.68  42 0.019  0.70
slp11      0.97      0.98    1.00      0.68  42 0.018  0.70
slp13      0.97      0.98    0.99      0.68  42 0.019  0.70
slp14      0.97      0.98    0.99      0.68  43 0.019  0.71
slp15      0.97      0.98    1.00      0.68  43 0.019  0.71
slp16      0.98      0.98    0.99      0.69  45 0.018  0.71
slp17      0.97      0.98    0.99      0.68  42 0.019  0.69
slp18      0.97      0.98    1.00      0.69  44 0.017  0.71
slp19      0.97      0.98    0.99      0.68  43 0.019  0.71
slp2       0.97      0.98    0.99      0.68  43 0.018  0.71
slp20      0.97      0.98    0.99      0.68  42 0.019  0.69
slp21      0.97      0.98    0.99      0.68  42 0.018  0.69
slp22      0.97      0.98    0.99      0.68  43 0.020  0.69
slp23      0.97      0.98    0.99      0.68  42 0.019  0.69
slp25      0.97      0.98    0.99      0.69  44 0.018  0.71
slp3       0.98      0.98    0.99      0.69  44 0.019  0.71
slp4       0.97      0.98    0.99      0.69  44 0.018  0.71
slp5       0.97      0.98    0.99      0.68  43 0.019  0.70
slp6       0.98      0.98    0.99      0.71  48 0.011  0.71
slp8       0.97      0.98    0.99      0.68  42 0.020  0.70
slp9       0.97      0.98    0.99      0.68  42 0.018  0.69

 Item statistics 
       n raw.r std.r r.cor r.drop mean sd
slp10 18  0.89  0.88  0.88   0.86   81 28
slp11 18  0.89  0.91  0.92   0.90   69 33
slp13 19  0.86  0.88  0.88   0.86   54 33
slp14 20  0.82  0.84  0.84   0.81   80 25
slp15 20  0.81  0.82  0.81   0.80   85 22
slp16 20  0.77  0.76  0.76   0.73   50 42
slp17 20  0.90  0.91  0.91   0.90   63 34
slp18 20  0.82  0.81  0.78   0.79   70 30
slp19 19  0.83  0.83  0.83   0.81   56 25
slp2  20  0.83  0.82  0.82   0.80   64 32
slp20 20  0.90  0.91  0.91   0.89   49 25
slp21 19  0.91  0.91  0.91   0.91   59 25
slp22 20  0.86  0.86  0.85   0.84   50 27
slp23 20  0.89  0.88  0.88   0.86   71 32
slp25 20  0.79  0.79  0.79   0.77   79 28
slp3  19  0.76  0.76  0.76   0.74   52 40
slp4  20  0.79  0.80  0.78   0.77   47 27
slp5  19  0.82  0.82  0.81   0.81   43 23
slp6  13  0.80  0.57  0.57   0.54   76 23
slp8  20  0.86  0.88  0.88   0.86   53 31
slp9  19  0.89  0.91  0.90   0.90   58 23

Intraclass Correlation

Based on the ANOVA/mixed-effect model framework.

Independent variables: Raters and ratees
Dependent variables: scores

Sources of variance:

Ratees (variance of interest)
Raters
Ratees \(\times\) raters interaction*
Random error*

The variances due to ratees \(\times\) raters interaction and random error can only be separated when each rater rates each ratee more than once, which is uncommon.

Reliability = ratee variance / (ratee variance + error variance)

ICC Designs

However, what is considered error depends on multiple considerations, including

whether the design is crossed (i.e., all raters rate each ratee) or nested (different sets of raters rate each ratee);
whether raters are considered fixed (not generalizable) or random (generalizable to a “population of raters”); this is only relevant when rater \(\times\) ratee interaction can be estimated;
consistency (relative decision; only rank-ordering is of interest) vs. absolute agreement (absolute decision; actual score is of interest).

One-way ANOVA for nested design

Let’s say we have data where each person is rated by two different raters:

slp_vas_nested <- slp_dat |>
    mutate(SpeakerID = as.numeric(as.factor(Speaker))) |>
    # Select only 10 speakers
    filter(SpeakerID <= 10) |>
    group_by(Speaker) |>
    # Filter specific raters
    filter(row_number() %in% (SpeakerID[1] * 2 - (1:0)))

We have a design with raters nested within ratees. With this design, we cannot distinguish rater effect from random error. We can now run a one-way random-effect ANOVA, which is the same as a random-intercept multilevel model:

m1 <- lmer(slp_VAS ~ 1 + (1 | Speaker), data = slp_vas_nested)
summary(m1)

Linear mixed model fit by REML ['lmerMod']
Formula: slp_VAS ~ 1 + (1 | Speaker)
   Data: slp_vas_nested

REML criterion at convergence: 179.6

Scaled residuals: 
     Min       1Q   Median       3Q      Max 
-1.50049 -0.77609 -0.05948  0.87584  1.35589 

Random effects:
 Groups   Name        Variance Std.Dev.
 Speaker  (Intercept) 308.9    17.58   
 Residual             816.8    28.58   
Number of obs: 19, groups:  Speaker, 10

Fixed effects:
            Estimate Std. Error t value
(Intercept)   53.613      8.627   6.214

# Extract variance components (Ratee, error)
vc_m1 <- as.data.frame(VarCorr(m1))

One-way random-effect ANOVA

ICC for single rating: \(\frac{\sigma^2_\text{ratee}}{\sigma^2_\text{ratee} + \sigma^2_E}\)

vc_m1$vcov[1] / (vc_m1$vcov[1] + vc_m1$vcov[2])

[1] 0.2744076

ICC for average rating: \(\frac{\sigma^2_\text{ratee}}{\sigma^2_\text{ratee} + \sigma^2_E / k}\), where \(k\) is the number of raters per ratee

vc_m1$vcov[1] / (vc_m1$vcov[1] + vc_m1$vcov[2] / 2)

[1] 0.4306434

Two-way ANOVA

For consistency (relative decision), rater effect is not error, because the rater bias is applying to everyone and does not change the rank order.

For agreement (absolute decision), rater effect is error as it changes the absolute scores.

m2 <- lmer(slp_VAS ~ 1 + (1 | Speaker) + (1 | slpID), data = slp_dat)
summary(m2)

Linear mixed model fit by REML ['lmerMod']
Formula: slp_VAS ~ 1 + (1 | Speaker) + (1 | slpID)
   Data: slp_dat

REML criterion at convergence: 3544.1

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-3.3281 -0.5813  0.0862  0.6611  2.5747 

Random effects:
 Groups   Name        Variance Std.Dev.
 slpID    (Intercept) 132.8    11.53   
 Speaker  (Intercept) 585.7    24.20   
 Residual             291.2    17.07   
Number of obs: 403, groups:  slpID, 21; Speaker, 20

Fixed effects:
            Estimate Std. Error t value
(Intercept)   61.882      6.028   10.27

vc_m2 <- as.data.frame(VarCorr(m2))

Two-way ANOVA

ICC for single rating:

Agreement: \(\frac{\sigma^2_\text{ratee}}{\sigma^2_\text{ratee} + \sigma^2_\text{rater} + \sigma^2_E}\)

# Note: ratee is in position 2, but may be different in different data sets
vc_m2$vcov[2] / (vc_m2$vcov[1] + vc_m2$vcov[2] + vc_m2$vcov[3])

[1] 0.5800175

Consistency: \(\frac{\sigma^2_\text{ratee}}{\sigma^2_\text{ratee} + \sigma^2_E}\)

# Note: ratee is in position 2, but may be different in different data sets
vc_m2$vcov[2] / (vc_m2$vcov[2] + vc_m2$vcov[3])

[1] 0.6678742

ICC for average rating:

Agreement: \(\frac{\sigma^2_\text{ratee}}{\sigma^2_\text{ratee} + (\sigma^2_\text{rater} + \sigma^2_E) / k}\)

# Note: ratee is in position 2, but may be different in different data sets
vc_m2$vcov[2] / (vc_m2$vcov[2] + (vc_m2$vcov[1] + vc_m2$vcov[3]) / 21)

[1] 0.966669

Consistency: \(\frac{\sigma^2_\text{ratee}}{\sigma^2_\text{ratee} + \sigma^2_E / k}\)

vc_m2$vcov[2] / (vc_m2$vcov[2] + vc_m2$vcov[3] / 21)

[1] 0.9768674

You can use the psych::ICC() function for these calculations. This requires wide-format data.

psych::ICC(slp_vas_wide[-1])

Call: psych::ICC(x = slp_vas_wide[-1])

Intraclass correlation coefficients 
                         type  ICC  F df1 df2       p lower bound upper bound
Single_raters_absolute   ICC1 0.58 30  19 400 1.6e-64        0.43        0.75
Single_random_raters     ICC2 0.58 43  19 380 1.0e-82        0.43        0.75
Single_fixed_raters      ICC3 0.67 43  19 380 1.0e-82        0.53        0.81
Average_raters_absolute ICC1k 0.97 30  19 400 1.6e-64        0.94        0.98
Average_random_raters   ICC2k 0.97 43  19 380 1.0e-82        0.94        0.98
Average_fixed_raters    ICC3k 0.98 43  19 380 1.0e-82        0.96        0.99

 Number of subjects = 20     Number of Judges =  21
See the help file for a discussion of the other 4 McGraw and Wong estimates,

Be aware that the terminology in the above output may cause confusion. Here is the translation:

Single_raters_absolute, ICC(1, 1): ICC for single rating for one-way ANOVA (assuming each ratee is rated by a different set of raters)
Average_raters_absolute, ICC(1, k): ICC for average rating for one-way ANOVA (assuming each ratee is rated by a different set of raters)
Single_random_raters, ICC(2, 1): ICC for interrater agreement for single rating for two-way ANOVA (“random” here really means agreement)
Average_random_absolute, ICC(2, k): ICC for interrater agreement for average rating for two-way ANOVA (“random” here really means agreement)
Single_fixed_raters, ICC(3, 1): ICC for interrater consistency for single rating for two-way ANOVA (“fixed” here really means consistency)
Average_fixed_absolute, ICC(3, k): ICC for interrater consistency for average rating for two-way ANOVA (“fixed” here really means consistency)

For this example, ICC(1, 1) and ICC(1, k) are not relevant, as it is a crossed (not nested) design.

Derivation

Each score consists of grand mean + ratee effect + error. For \(i = 1, \ldots, k\) and \(j = 1, \ldots, N\),

\[Y_{ij} = \beta_j + e_{ij}\]

with \(\beta_j \sim N(\mu, \sigr^2)\) and \(e_{ij} \sim N(0, \sige^2)\)

With \(N\) ratees, the mean rating for ratee \(j\) is

\[\bar Y_{.j} = \frac{\sum_i Y_{ij}}{k} = \beta_j + \frac{\sum_i e_{ij}}{k}\]

The grand mean is

\[\bar Y_{..} = \frac{\sum_j \sum_i Y_{ij}}{kN} = \frac{\sum_j \beta_j}{N} + \frac{\sum_j \sum_i e_{ij}}{kN}\]

The sum of squares for the ratee effect is

\[ \begin{aligned} \ssb & = \sum_j k (\bar Y_{.j} - \bar Y_{..})^2 \\ & = \sum_j k \left(\beta_j + \frac{\sum_i e_{ij}}{k} - \frac{\sum_j \beta_j}{N} - \frac{\sum_j \sum_i e_{ij}}{kN}\right)^2 \\ & = \sum_j k [(\beta_j - \bar \beta) + (\bar e_{.j} - \bar e_{..})]^2 \end{aligned} \]

Taking the expected value, we have

\[ \begin{aligned} E[\sum_j (\beta_j - \bar \beta)^2] & = E[\sum_j (\beta_j - \mu + \mu - \bar \beta)^2] \\ & = E[\sum_j (\beta_j - \mu)^2] - 2 E[(\mu - \bar \beta) \sum_j (\beta_j - \mu)] + E[\sum_j (\mu - \bar \beta)^2] \\ & = N \sigr^2 - 2 E[N (\mu - \bar \beta)^2] + E[N (\bar \beta - \mu)^2] \\ & = N \sigr^2 - N \sigr^2 / N \\ & = (N - 1) \sigr^2 \end{aligned} \]

Similarly,

\[ E[\sum_j (\bar e_{.j} - \bar e_{..})^2] = (N - 1) \sige^2 / k \]

and given \(\beta_j\) and \(e_{ij}\) are independent, we have

\[ E(\ssb) = k (N - 1) \sigr^2 + (N - 1) \sige^2. \]

The mean square is the sum of squares divided by the degrees of freedom, which is \(N - 1\) for the rater effect. Therefore, the expected mean square is

\[ E(\msb) = k \sigr^2 + \sige^2. \]

For sum of squares error, we have

\[ \begin{aligned} \sse & = \sum_j \sum_i k (Y_{ij} - \bar Y_{.j})^2 \\ & = \sum_j \sum_i \left(\beta_j + \frac{\sum_i e_{ij}}{k} - \beta_j - e_{ij}\right)^2 \\ & = \sum_j \sum_i (e_{ij} - \bar e_{.})^2 \end{aligned} \]

And the expected sum of squares (and mean squares) error can be shown as

\[ \begin{aligned} E(\sse) & = N (k - 1) \sige^2 \\ E(\mse) & = \sige^2 \end{aligned} \]

Therefore, using method of moments, the estimators for \(\sige^2\) and \(\sigr^2\) are

\[ \begin{aligned} \tilde \sige^2 & = \mse \\ \tilde \sigr^2 & = (\msb - \mse) / k \\ \end{aligned} \]

Using the above, you can obtain the ICC formulas for “one-way random” of Table 9.5 of your text.

Confidence intervals

From standard ANOVA results Cochran’s theorem, with a balanced design, one gets

\[ \begin{aligned} \sse & \sim \sige^2 \chi^2_{N (k - 1)} \\ \ssb & \sim (k\sigr^2 + \sige^2) \chi^2_{N - 1} \end{aligned} \]

and the two sum of squares are independently distributed, and

\[ \frac{\sige^2}{k\sigr^2 + \sige^2}\frac{\msb}{\mse} \sim F_{N - 1, N (k - 1)} \]

Note that the distribution of \(\frac{\sige^2}{k\sigr^2 + \sige^2}\frac{\msb}{\mse}\) does not depend on the parameters, \(\sigr^2\) and \(\sige^2\). Therefore, we can invert a two-tailed \(F\) test to consider values that are “non-significant” to obtain a confidence interval. Let \(\flo\) and \(\fhi\) be the \(\alpha / 2\) and \(1 - \alpha / 2\) quantiles of a central \(F\) distribution with df1 = \(N - 1\) and df2 = \(N (k - 1)\), and \(F_0 = \msb / \mse\), then because ICC(1, \(k\)) for average rating is

\[ \begin{aligned} \frac{\sigr^2}{\sigr^2 + \sige^2 / k} & = \frac{k \sigr^2}{k \sigr^2 + \sige^2} \\ & = 1 - \frac{\sige^2}{k \sigr^2 + \sige^2}, \end{aligned} \]

We have

\[ \frac{\sige^2}{k\sigr^2 + \sige^2}\frac{\msb}{\mse} = [1 - \rho(1, k)]F_0 \sim F_{N - 1, N (k - 1)} \]

Define \(F = [1 - \rho(1, k)]F_0\), which follows an \(F\) distribution. Because \(\rho(F) = 1 - F / F_0\) is a monotonic decreasing function with known \(F_0\), we can obtain 95% CI by transforming \(\flo\) and \(\fhi\), so an analytic 95% CI for ICC(1, \(k\)) is

\[ \left\{1 - \frac{\fhi}{F_0} \leq \rho(1, k) \leq 1 - \frac{\flo}{F_0}\right\} \]

Similarly, for ICC(1, 1) = \(\sigr^2 / (\sigr^2 + \sige^2)\), as \(F_0 / F - 1\) = \(k \sigr^2 / \sige^2\) = \(k \rho(1, 1) / [1 - \rho(1, 1)]\), we have

\[ \rho(F) = \frac{F_0 / F - 1}{F_0 / F - 1 + k}, \]

which is monotonic decreasing in \(F\). Therefore, we can again obtain a 95% CI by transforming \(\flo\) and \(\fhi\)

\[ \left\{\frac{\frac{F_0}{\fhi} - 1}{\frac{F_0}{\fhi} - 1 + k} \leq \rho(1, 1) \leq \frac{\frac{F_0}{\flo} - 1}{\frac{F_0}{\flo} - 1 + k}\right\} \]

Tip

For 95% CIs with two-way crossed designs, see the following papers:

Fleiss, J.L., Shrout, P.E. Approximate interval estimation for a certain intraclass correlation coefficient. Psychometrika 43, 259–262 (1978). https://doi.org/10.1007/BF02293867

McGraw, K. O., & Wong, S. P. (1996). Forming inferences about some intraclass correlation coefficients. Psychological Methods, 1(1), 30–46. https://doi.org/10.1037/1082-989X.1.1.30

Shrout, P. E., & Fleiss, J. L. (1979). Intraclass correlations: Uses in assessing rater reliability. Psychological Bulletin, 86(2), 420–428. https://doi.org/10.1037/0033-2909.86.2.420

See also this paper for some updated guidelines and recent development on ICC:

ten Hove, D., Jorgensen, T. D., & van der Ark, L. A. (2022). Updated guidelines on selecting an intraclass correlation coefficient for interrater reliability, with applications to incomplete observational designs. Psychological Methods. Advance online publication. https://doi.org/10.1037/met0000516

Direct link: https://pure.uva.nl/ws/files/73290493/TenHove.etal_2022_GuidelinesSelectingICCforIRR_AuthorVersion.pdf

Demonstrating Analytic CIs in R

Let’s look at the results by treating the raters as different for each ratee (which is incorrect, but just to show the calculation for CIs):

m1_full <- lmer(slp_VAS ~ 1 + (1 | Speaker), data = slp_dat)
vc_m1f <- as.data.frame(VarCorr(m1_full))
# Obtain mean squares from the variance components
k <- 21
msr <- k * vc_m1f$vcov[1] + vc_m1f$vcov[2]
mse <- vc_m1f$vcov[2]
f0 <- msr / mse
# 95% CI (df1 = 19, df2 = 20 * 20 = 400)
# For ICC(1, k)
c("LL" = 1 - 1 / f0 * qf(.975, 19, 400),
  "UL" = 1 - 1 / f0 * qf(.025, 19, 400))

       LL        UL 
0.9410646 0.9844892

# For ICC(1, 1)
c("LL" = (f0 / qf(.975, 19, 400) - 1) / (f0 / qf(.975, 19, 400) - 1 + k),
  "UL" = (f0 / qf(.025, 19, 400) - 1) / (f0 / qf(.025, 19, 400) - 1 + k))

       LL        UL 
0.4319368 0.7513950

Bootstrap Confidence Intervals

Because the models we fitted are random-effect or mixed-effect models (i.e., multilevel models), we’ll need to do multilevel bootstrap. We could use the bootMer() available in lme4 for parametric bootstrap; the bootmlm package provides alternative bootstrap methods that are useful with assumption violations.

From my observation, the bootstrap CIs seem wider than the analytic CIs for ICCs, at least for this example. This observation is not unique for bootstrap CIs.

One-Way Random

# One-way random-effect
icc_11 <- function(x) {
    # Function for ICC(1, 1)
    vc_x <- as.data.frame(VarCorr(x))
    vc_x$vcov[1] / sum(vc_x$vcov)
}
boo <- bootMer(m1_full, icc_11, nsim = 1999,
               # change to .progress = "txt" for progress bar
               .progress = "none")
boot::boot.ci(boo, index = 1, type = c("norm", "basic", "perc"))

BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 1999 bootstrap replicates

CALL : 
boot::boot.ci(boot.out = boo, type = c("norm", "basic", "perc"), 
    index = 1)

Intervals : 
Level      Normal              Basic              Percentile     
95%   ( 0.4310,  0.7592 )   ( 0.4486,  0.7729 )   ( 0.3857,  0.7101 )  
Calculations and Intervals on Original Scale

Two-Way Random

# Two-way random-effect
icc_2k <- function(x) {
    # Function for ICC(2, k), which is for consistency
    vc_x <- as.data.frame(VarCorr(x))
    vc_x$vcov[2] / (vc_x$vcov[2] + (vc_x$vcov[1] + vc_x$vcov[3]) / 21)
}
boo2 <- bootMer(m2, icc_2k, nsim = 1999,
                # change to .progress = "txt" for progress bar
                .progress = "none")
boot::boot.ci(boo2, index = 1, type = c("norm", "basic", "perc"))

BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 1999 bootstrap replicates

CALL : 
boot::boot.ci(boot.out = boo2, type = c("norm", "basic", "perc"), 
    index = 1)

Intervals : 
Level      Normal              Basic              Percentile     
95%   ( 0.9427,  0.9989 )   ( 0.9518,  1.0054 )   ( 0.9279,  0.9816 )  
Calculations and Intervals on Original Scale