Item Analysis

library(tidyverse)
library(modelsummary)  # for summarizing data
library(TAM)
library(sjPlot)  # for item difficulty and discrimination
library(psych)  # for factor analysis
library(performance)  # for item-total correlation
library(boot)  # for bootstrapping

Cognitive Items

We will use a sample data set from the TAM package, which contains 143 students on 30 items. There were actually two versions of the data set, which `data.mc\(raw\) showing the raw responses (i.e., the response selected), and the scored responses (i.e., 0 = incorrect, 1 = correct).

data(data.mc)

Descriptives

mc_raw <- data.mc$raw  # get the raw choices portion
# Frequency table for the first two items
datasummary_skim(mc_raw[, 1:2], type = "categorical")
N %
I01 9 2 1.4
A 130 90.9
B 4 2.8
C 3 2.1
D 3 2.1
NA 1 0.7
I02 8 1 0.7
9 1 0.7
A 5 3.5
B 126 88.1
C 4 2.8
D 5 3.5
NA 1 0.7
Exercise 1

Which options do you think are likely to be the correct option for I10 and I20?

Here are the scored version

mc_scored <- data.mc$scored  # get the scored responses
# Frequency table for the first two items
datasummary_skim(mc_scored, fmt = "%.2f")
Unique (#) Missing (%) Mean SD Min Median Max
I01 3 1 0.92 0.28 0.00 1.00 1.00
I02 3 1 0.89 0.32 0.00 1.00 1.00
I03 3 1 0.78 0.41 0.00 1.00 1.00
I04 3 1 0.91 0.28 0.00 1.00 1.00
I05 3 1 0.77 0.42 0.00 1.00 1.00
I06 3 1 0.88 0.33 0.00 1.00 1.00
I07 3 2 0.67 0.47 0.00 1.00 1.00
I08 3 2 0.63 0.48 0.00 1.00 1.00
I09 3 2 0.42 0.50 0.00 0.00 1.00
I10 3 2 0.22 0.42 0.00 0.00 1.00
I11 3 2 0.41 0.49 0.00 0.00 1.00
I12 3 3 0.45 0.50 0.00 0.00 1.00
I13 3 3 0.78 0.41 0.00 1.00 1.00
I14 3 3 0.41 0.49 0.00 0.00 1.00
I15 3 3 0.27 0.44 0.00 0.00 1.00
I16 3 3 0.60 0.49 0.00 1.00 1.00
I17 3 3 0.65 0.48 0.00 1.00 1.00
I18 3 5 0.61 0.49 0.00 1.00 1.00
I19 3 5 0.59 0.49 0.00 1.00 1.00
I20 3 6 0.67 0.47 0.00 1.00 1.00
I21 3 6 0.70 0.46 0.00 1.00 1.00
I22 3 6 0.30 0.46 0.00 0.00 1.00
I23 3 6 0.63 0.49 0.00 1.00 1.00
I24 3 8 0.51 0.50 0.00 1.00 1.00
I25 3 8 0.55 0.50 0.00 1.00 1.00
I26 3 8 0.36 0.48 0.00 0.00 1.00
I27 3 10 0.22 0.41 0.00 0.00 1.00
I28 3 10 0.38 0.49 0.00 0.00 1.00
I29 3 10 0.27 0.44 0.00 0.00 1.00
I30 3 11 0.61 0.49 0.00 1.00 1.00

Item Difficulty

This is the proportion of “1”s divided by the total number of responses. Note this is somewhat a misnomer, as an item with a higher “difficulty” level is easier.

colSums(mc_scored, na.rm = TRUE) / nrow(mc_scored)
      I01       I02       I03       I04       I05       I06       I07       I08 
0.9090909 0.8811189 0.7762238 0.9020979 0.7622378 0.8671329 0.6573427 0.6153846 
      I09       I10       I11       I12       I13       I14       I15       I16 
0.4125874 0.2167832 0.3986014 0.4405594 0.7622378 0.3916084 0.2587413 0.5804196 
      I17       I18       I19       I20       I21       I22       I23       I24 
0.6293706 0.5804196 0.5594406 0.6293706 0.6573427 0.2867133 0.5874126 0.4685315 
      I25       I26       I27       I28       I29       I30 
0.5034965 0.3286713 0.1958042 0.3426573 0.2377622 0.5454545 
Exercise 2

Which item is the most difficult? Which is the least difficult?

Answer:

Exercise 3

Observe the item means and the standard deviations. What is the relation between the mean and the standard deviation?

Answer:

Item Discrimination

The degree to which an item can distinguish respondents of different level of knowledge or skill, such that those with a higher level of knowledge or skill is more likely to answer an item correctly.

Internal criterion: Performance in the current test

# Create mean score
mc_scored$person_mean <- rowMeans(mc_scored[1:30], na.rm = TRUE)
# Top and bottom 27%
nrow(mc_scored) * .27  # 39 people
[1] 38.61
# Corresponding scores
bounds <- sort(mc_scored$person_mean)[c(39, 143 - 39)]
mc_scored <- mc_scored |>
    mutate(group = case_when(
        person_mean > bounds[2] ~ "upper",
        person_mean <= bounds[1] ~ "lower"
    ))
# Plot the distributions
ggplot(mc_scored, aes(x = person_mean)) +
    geom_histogram(bins = 15) +
    geom_vline(
        xintercept = bounds,
        col = "red"
    )
Warning: Removed 1 rows containing non-finite values (`stat_bin()`).

We can see the choices by the two groups, for example Item 5:

table(mc_scored$group, mc_raw[ , "I05"])
       
         A  B  C  D
  lower  9  4 11 14
  upper  0  0  0 35

which shows that the upper group all selects “D”. Assuming D is the right choice (and it is), the item has a reasonable discrimination. On the other hand, consider Item 10:

table(mc_scored$group, mc_raw[ , "I10"])
       
         9  A  B  C  D
  lower  2 11 10  7  8
  upper  0 23  0  2 10

which seems more mixed.

We can compute the discrimination for item \(i\) as

\[D_i = P_{ui} - P_{li},\]

where \(P_u\) is the proportion correct in the upper group and \(P_l\) is that in the lower group. We can do this for all items:

mc_p <- mc_scored |>
    filter(!is.na(group)) |>
    group_by(group) |>
    summarize(across(I01:I30, .fns = list(P = ~ mean(.x, na.rm = TRUE))))
mc_p

And get the discrimination using the difference

mc_p[2, 2:31] - mc_p[1, 2:31]

This shows that Item 10 has a lower discrimination than Item 5, as we expect.

Exercise 4

What are the maximum and minimum values of \(D\)? And when are those values achieved?

Answer:

External criterion: Get upper and lower groups based on another test. This is usually preferred when feasible.

Using correlations

Point-biserial correlation: Pearson correlation between a dichotomous and a continuous variable.

For example, we can compute the item-total correlation for Item 1:

cor(mc_scored$I01, mc_scored$person_mean, use = "complete")
[1] 0.3354805

Because the item is included in the calculation of the total/mean score, another common approach is to compute the total/mean of all other items, excluding the one of interest:

# Exclude Item 1
person_mean_no1 <- rowMeans(mc_scored[2:30], na.rm = TRUE)
# Correlation
cor(mc_scored$I01, person_mean_no1, use = "complete")
[1] 0.2725089

The sjPlot::tab_itemscale() function is handy for getting item difficulty and discrimination, but make sure to know how they calculate them.

tab_itemscale(mc_scored[1:30])
Component 1
Row Missings Mean SD Skew Item Difficulty Item Discrimination α if deleted
I01 0.70 % 0.92 0.28 -3.02 0.92 0.27 0.82
I02 0.70 % 0.89 0.32 -2.48 0.89 0.26 0.82
I03 0.70 % 0.78 0.41 -1.38 0.78 0.45 0.82
I04 1.40 % 0.91 0.28 -3.01 0.91 0.27 0.82
I05 1.40 % 0.77 0.42 -1.32 0.77 0.47 0.82
I06 1.40 % 0.88 0.33 -2.36 0.88 0.25 0.82
I07 2.10 % 0.67 0.47 -0.74 0.67 0.45 0.82
I08 2.10 % 0.63 0.48 -0.54 0.63 0.44 0.82
I09 2.10 % 0.42 0.5 0.32 0.42 0.36 0.82
I10 2.10 % 0.22 0.42 1.36 0.22 -0.01 0.83
I11 2.10 % 0.41 0.49 0.38 0.41 0.35 0.82
I12 2.80 % 0.45 0.5 0.19 0.45 0.38 0.82
I13 2.80 % 0.78 0.41 -1.4 0.78 0.31 0.82
I14 3.50 % 0.41 0.49 0.39 0.41 0.12 0.83
I15 3.50 % 0.27 0.44 1.06 0.27 0.47 0.82
I16 3.50 % 0.6 0.49 -0.42 0.60 0.57 0.81
I17 3.50 % 0.65 0.48 -0.65 0.65 0.41 0.82
I18 4.90 % 0.61 0.49 -0.46 0.61 0.52 0.81
I19 4.90 % 0.59 0.49 -0.36 0.59 0.47 0.82
I20 5.59 % 0.67 0.47 -0.72 0.67 0.33 0.82
I21 5.59 % 0.7 0.46 -0.86 0.70 0.40 0.82
I22 5.59 % 0.3 0.46 0.86 0.30 0.12 0.83
I23 6.29 % 0.63 0.49 -0.53 0.63 0.29 0.82
I24 8.39 % 0.51 0.5 -0.05 0.51 0.52 0.81
I25 8.39 % 0.55 0.5 -0.2 0.55 0.45 0.82
I26 8.39 % 0.36 0.48 0.6 0.36 0.13 0.83
I27 9.79 % 0.22 0.41 1.39 0.22 0.14 0.83
I28 9.79 % 0.38 0.49 0.5 0.38 0.48 0.82
I29 10.49 % 0.27 0.44 1.07 0.27 -0.02 0.83
I30 11.19 % 0.61 0.49 -0.47 0.61 0.38 0.82
Mean inter-item-correlation=0.135 · Cronbach's α=0.826

Evaluating the Distractors

The discrimination for the distractors should be negative. Let’s consider Item 17:

table(mc_scored$group, mc_raw[ , "I17"])
       
         9  A  B  C  D
  lower  2  3 11 10 11
  upper  1  1  4 29  0

Here “C” is the correct option, and “A”, “B”, and “D” are distractors. We can see that those options are endorsed more often by the lower group. Also, “A” is relatively ineffective, as it was barely chosen by either groups.

Noncognitive Items

We will use the bfi data set again.

data(bfi, package = "psych")

Note: Generally Likert-type items are not normal, but using a normal distribution is probably most defensible when there were at least five scale points and when the item distributions are relatively symmetric and unimodal (Rhemtulla et al., 2012, https://doi.org/10.1037/a0029315).

Frequency and Descriptives

While one can compute the usual descriptives like mean, median, and SD, for scale items that have only a few options, I generally recommend using actual bar plots.

You can see that the Agreeableness items are relatively skewed:

library(tidyr)
bfi %>%
    select(A1:A5) %>%
    pivot_longer(cols = A1:A5, names_to = "item", values_to = "score") %>%
    ggplot(aes(x = score)) +
    geom_bar() +
    facet_wrap(~ item)
Warning: Removed 104 rows containing non-finite values (`stat_count()`).

Exercise 5

Why do you think the Agreeableness items are skewed?

Answer:

When items show non-normal distributions, consider the reason for it and the purpose of the scale. For example, some skewed items are needed to assess extreme levels of a construct.

Recoding

An important step is to recode the negatively oriented variables.

# Recode A1
bfi$A1r <- 7 - bfi$A1

Interitem Correlations

Exercise 6

Use the modelsummary::datasummary_correlation() to compute the interitem correlations among the Agreeableness items.

# Your code

Corrected Item-Total Correlations

bfi |>
    select(A1r, A2:A5) |>
    tab_itemscale()
Component 1
Row Missings Mean SD Skew Item Difficulty Item Discrimination α if deleted
A1r 0.57 % 4.59 1.41 -0.83 0.76 0.31 0.72
A2 0.96 % 4.8 1.17 -1.13 0.80 0.56 0.62
A3 0.93 % 4.6 1.3 -1 0.77 0.59 0.60
A4 0.68 % 4.7 1.48 -1.03 0.78 0.40 0.69
A5 0.57 % 4.56 1.26 -0.85 0.76 0.49 0.65
Mean inter-item-correlation=0.332 · Cronbach's α=0.704

The above function is based on the performance::item_reliability() function:

bfi |>
    select(A1r, A2:A5) |>
    item_reliability()

Factor Analysis

Sorting out the items based on the relative magnitude of their intercorrelations.

Factors: based on shared variance among items.

Factor loading: the degree to which an item share variance with the common factor.

fa(bfi[1:25], nfactors = 5)
Loading required namespace: GPArotation
Factor Analysis using method =  minres
Call: fa(r = bfi[1:25], nfactors = 5)
Standardized loadings (pattern matrix) based upon correlation matrix
     MR2   MR1   MR3   MR5   MR4   h2   u2 com
A1  0.21  0.17  0.07 -0.41 -0.06 0.19 0.81 2.0
A2 -0.02  0.00  0.08  0.64  0.03 0.45 0.55 1.0
A3 -0.03  0.12  0.02  0.66  0.03 0.52 0.48 1.1
A4 -0.06  0.06  0.19  0.43 -0.15 0.28 0.72 1.7
A5 -0.11  0.23  0.01  0.53  0.04 0.46 0.54 1.5
C1  0.07 -0.03  0.55 -0.02  0.15 0.33 0.67 1.2
C2  0.15 -0.09  0.67  0.08  0.04 0.45 0.55 1.2
C3  0.03 -0.06  0.57  0.09 -0.07 0.32 0.68 1.1
C4  0.17  0.00 -0.61  0.04 -0.05 0.45 0.55 1.2
C5  0.19 -0.14 -0.55  0.02  0.09 0.43 0.57 1.4
E1 -0.06 -0.56  0.11 -0.08 -0.10 0.35 0.65 1.2
E2  0.10 -0.68 -0.02 -0.05 -0.06 0.54 0.46 1.1
E3  0.08  0.42  0.00  0.25  0.28 0.44 0.56 2.6
E4  0.01  0.59  0.02  0.29 -0.08 0.53 0.47 1.5
E5  0.15  0.42  0.27  0.05  0.21 0.40 0.60 2.6
N1  0.81  0.10  0.00 -0.11 -0.05 0.65 0.35 1.1
N2  0.78  0.04  0.01 -0.09  0.01 0.60 0.40 1.0
N3  0.71 -0.10 -0.04  0.08  0.02 0.55 0.45 1.1
N4  0.47 -0.39 -0.14  0.09  0.08 0.49 0.51 2.3
N5  0.49 -0.20  0.00  0.21 -0.15 0.35 0.65 2.0
O1  0.02  0.10  0.07  0.02  0.51 0.31 0.69 1.1
O2  0.19  0.06 -0.08  0.16 -0.46 0.26 0.74 1.7
O3  0.03  0.15  0.02  0.08  0.61 0.46 0.54 1.2
O4  0.13 -0.32 -0.02  0.17  0.37 0.25 0.75 2.7
O5  0.13  0.10 -0.03  0.04 -0.54 0.30 0.70 1.2

                       MR2  MR1  MR3  MR5  MR4
SS loadings           2.57 2.20 2.03 1.99 1.59
Proportion Var        0.10 0.09 0.08 0.08 0.06
Cumulative Var        0.10 0.19 0.27 0.35 0.41
Proportion Explained  0.25 0.21 0.20 0.19 0.15
Cumulative Proportion 0.25 0.46 0.66 0.85 1.00

 With factor correlations of 
      MR2   MR1   MR3   MR5   MR4
MR2  1.00 -0.21 -0.19 -0.04 -0.01
MR1 -0.21  1.00  0.23  0.33  0.17
MR3 -0.19  0.23  1.00  0.20  0.19
MR5 -0.04  0.33  0.20  1.00  0.19
MR4 -0.01  0.17  0.19  0.19  1.00

Mean item complexity =  1.5
Test of the hypothesis that 5 factors are sufficient.

df null model =  300  with the objective function =  7.23 with Chi Square =  20163.79
df of  the model are 185  and the objective function was  0.65 

The root mean square of the residuals (RMSR) is  0.03 
The df corrected root mean square of the residuals is  0.04 

The harmonic n.obs is  2762 with the empirical chi square  1392.16  with prob <  5.6e-184 
The total n.obs was  2800  with Likelihood Chi Square =  1808.94  with prob <  4.3e-264 

Tucker Lewis Index of factoring reliability =  0.867
RMSEA index =  0.056  and the 90 % confidence intervals are  0.054 0.058
BIC =  340.53
Fit based upon off diagonal values = 0.98
Measures of factor score adequacy             
                                                   MR2  MR1  MR3  MR5  MR4
Correlation of (regression) scores with factors   0.92 0.89 0.88 0.88 0.84
Multiple R square of scores with factors          0.85 0.79 0.77 0.77 0.71
Minimum correlation of possible factor scores     0.70 0.59 0.54 0.54 0.42

There are two caveats if choosing items solely based on their predictive and diagnostic utility: (a) it may result in nonsensical scales, and (b) it may result in low internal consistency reliability.

Considering Uncertainty in Statistics

  • Statistical information only provides one source of information when selecting or deleting items.
  • An item showing suboptimal performance may nevertheless be important conceptually.
  • An item showing a poor performance in one’s sample may perform adequately in previous and subsequent samples.

To the last point, it is important to consider the uncertainty in the statistics we use for making decisions, especially when we have a small and/or non-representative sample. While software packages seldom report standard errors or confidence intervals for statistics like item-total correlation, it is relatively straightforward to use the bootstrap to quantify uncertainty for a lot of the statistics. Below is an example, using a small subset of the bfi data.

# Obtain a subset
bfi_sub <- bfi[1:200, ]
# Bootstrap CI
# 1. Define function for item-total correlations. The second argument needs to an index for resampling rows of the data.
ff <- function(d, inds) {
    performance::item_reliability(d[inds, ])$item_discrimination
}
# 2. Bootstrap
boo <- boot(bfi_sub[, c("A1r", "A2", "A3", "A4", "A5")],
     statistic = ff,
     R = 1999)
# 3. Summarize bootstrap distributions
plot(boo, index = 1)  # for A1r

plot(boo, index = 4)  # for A4

# Number of bootstrap samples where item-total correlations < .20
# (A1r, A2:A5)
colMeans(boo$t < .20)
[1] 0.06003002 0.00000000 0.00000000 0.06453227 0.00000000
# 4. Bootstrap confidence intervals for the first statistic
# (i.e., item-total correlation for A1r)
boot.ci(boo, index = 1)
Warning in boot.ci(boo, index = 1): bootstrap variances needed for studentized
intervals
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 1999 bootstrap replicates

CALL : 
boot.ci(boot.out = boo, index = 1)

Intervals : 
Level      Normal              Basic         
95%   ( 0.1708,  0.4622 )   ( 0.1740,  0.4660 )  

Level     Percentile            BCa          
95%   ( 0.164,  0.456 )   ( 0.164,  0.456 )  
Calculations and Intervals on Original Scale
Exercise 7

From the bootstrap samples above, what is the proportion of replications where Item A1r has the lowest item-total correlation? Show your code, and discuss what this implies.

# Your code

Answer: