library(tidyverse)
library(modelsummary) # for summarizing data
library(TAM)
library(sjPlot) # for item difficulty and discrimination
library(psych) # for factor analysis
library(performance) # for item-total correlation
library(boot) # for bootstrapping
Item Analysis
Cognitive Items
We will use a sample data set from the TAM
package, which contains 143 students on 30 items. There were actually two versions of the data set, which `data.mc\(raw\) showing the raw responses (i.e., the response selected), and the scored responses (i.e., 0 = incorrect, 1 = correct).
data(data.mc)
Descriptives
<- data.mc$raw # get the raw choices portion
mc_raw # Frequency table for the first two items
datasummary_skim(mc_raw[, 1:2], type = "categorical")
N | % | ||
---|---|---|---|
I01 | 9 | 2 | 1.4 |
A | 130 | 90.9 | |
B | 4 | 2.8 | |
C | 3 | 2.1 | |
D | 3 | 2.1 | |
NA | 1 | 0.7 | |
I02 | 8 | 1 | 0.7 |
9 | 1 | 0.7 | |
A | 5 | 3.5 | |
B | 126 | 88.1 | |
C | 4 | 2.8 | |
D | 5 | 3.5 | |
NA | 1 | 0.7 |
Which options do you think are likely to be the correct option for I10 and I20?
Here are the scored version
<- data.mc$scored # get the scored responses
mc_scored # Frequency table for the first two items
datasummary_skim(mc_scored, fmt = "%.2f")
Unique (#) | Missing (%) | Mean | SD | Min | Median | Max | ||
---|---|---|---|---|---|---|---|---|
I01 | 3 | 1 | 0.92 | 0.28 | 0.00 | 1.00 | 1.00 | |
I02 | 3 | 1 | 0.89 | 0.32 | 0.00 | 1.00 | 1.00 | |
I03 | 3 | 1 | 0.78 | 0.41 | 0.00 | 1.00 | 1.00 | |
I04 | 3 | 1 | 0.91 | 0.28 | 0.00 | 1.00 | 1.00 | |
I05 | 3 | 1 | 0.77 | 0.42 | 0.00 | 1.00 | 1.00 | |
I06 | 3 | 1 | 0.88 | 0.33 | 0.00 | 1.00 | 1.00 | |
I07 | 3 | 2 | 0.67 | 0.47 | 0.00 | 1.00 | 1.00 | |
I08 | 3 | 2 | 0.63 | 0.48 | 0.00 | 1.00 | 1.00 | |
I09 | 3 | 2 | 0.42 | 0.50 | 0.00 | 0.00 | 1.00 | |
I10 | 3 | 2 | 0.22 | 0.42 | 0.00 | 0.00 | 1.00 | |
I11 | 3 | 2 | 0.41 | 0.49 | 0.00 | 0.00 | 1.00 | |
I12 | 3 | 3 | 0.45 | 0.50 | 0.00 | 0.00 | 1.00 | |
I13 | 3 | 3 | 0.78 | 0.41 | 0.00 | 1.00 | 1.00 | |
I14 | 3 | 3 | 0.41 | 0.49 | 0.00 | 0.00 | 1.00 | |
I15 | 3 | 3 | 0.27 | 0.44 | 0.00 | 0.00 | 1.00 | |
I16 | 3 | 3 | 0.60 | 0.49 | 0.00 | 1.00 | 1.00 | |
I17 | 3 | 3 | 0.65 | 0.48 | 0.00 | 1.00 | 1.00 | |
I18 | 3 | 5 | 0.61 | 0.49 | 0.00 | 1.00 | 1.00 | |
I19 | 3 | 5 | 0.59 | 0.49 | 0.00 | 1.00 | 1.00 | |
I20 | 3 | 6 | 0.67 | 0.47 | 0.00 | 1.00 | 1.00 | |
I21 | 3 | 6 | 0.70 | 0.46 | 0.00 | 1.00 | 1.00 | |
I22 | 3 | 6 | 0.30 | 0.46 | 0.00 | 0.00 | 1.00 | |
I23 | 3 | 6 | 0.63 | 0.49 | 0.00 | 1.00 | 1.00 | |
I24 | 3 | 8 | 0.51 | 0.50 | 0.00 | 1.00 | 1.00 | |
I25 | 3 | 8 | 0.55 | 0.50 | 0.00 | 1.00 | 1.00 | |
I26 | 3 | 8 | 0.36 | 0.48 | 0.00 | 0.00 | 1.00 | |
I27 | 3 | 10 | 0.22 | 0.41 | 0.00 | 0.00 | 1.00 | |
I28 | 3 | 10 | 0.38 | 0.49 | 0.00 | 0.00 | 1.00 | |
I29 | 3 | 10 | 0.27 | 0.44 | 0.00 | 0.00 | 1.00 | |
I30 | 3 | 11 | 0.61 | 0.49 | 0.00 | 1.00 | 1.00 |
Item Difficulty
This is the proportion of “1”s divided by the total number of responses. Note this is somewhat a misnomer, as an item with a higher “difficulty” level is easier.
colSums(mc_scored, na.rm = TRUE) / nrow(mc_scored)
I01 I02 I03 I04 I05 I06 I07 I08
0.9090909 0.8811189 0.7762238 0.9020979 0.7622378 0.8671329 0.6573427 0.6153846
I09 I10 I11 I12 I13 I14 I15 I16
0.4125874 0.2167832 0.3986014 0.4405594 0.7622378 0.3916084 0.2587413 0.5804196
I17 I18 I19 I20 I21 I22 I23 I24
0.6293706 0.5804196 0.5594406 0.6293706 0.6573427 0.2867133 0.5874126 0.4685315
I25 I26 I27 I28 I29 I30
0.5034965 0.3286713 0.1958042 0.3426573 0.2377622 0.5454545
Which item is the most difficult? Which is the least difficult?
Answer:
Observe the item means and the standard deviations. What is the relation between the mean and the standard deviation?
Answer:
Item Discrimination
The degree to which an item can distinguish respondents of different level of knowledge or skill, such that those with a higher level of knowledge or skill is more likely to answer an item correctly.
Internal criterion: Performance in the current test
# Create mean score
$person_mean <- rowMeans(mc_scored[1:30], na.rm = TRUE)
mc_scored# Top and bottom 27%
nrow(mc_scored) * .27 # 39 people
[1] 38.61
# Corresponding scores
<- sort(mc_scored$person_mean)[c(39, 143 - 39)]
bounds <- mc_scored |>
mc_scored mutate(group = case_when(
> bounds[2] ~ "upper",
person_mean <= bounds[1] ~ "lower"
person_mean
))# Plot the distributions
ggplot(mc_scored, aes(x = person_mean)) +
geom_histogram(bins = 15) +
geom_vline(
xintercept = bounds,
col = "red"
)
Warning: Removed 1 rows containing non-finite values (`stat_bin()`).
We can see the choices by the two groups, for example Item 5:
table(mc_scored$group, mc_raw[ , "I05"])
A B C D
lower 9 4 11 14
upper 0 0 0 35
which shows that the upper group all selects “D”. Assuming D is the right choice (and it is), the item has a reasonable discrimination. On the other hand, consider Item 10:
table(mc_scored$group, mc_raw[ , "I10"])
9 A B C D
lower 2 11 10 7 8
upper 0 23 0 2 10
which seems more mixed.
We can compute the discrimination for item \(i\) as
\[D_i = P_{ui} - P_{li},\]
where \(P_u\) is the proportion correct in the upper group and \(P_l\) is that in the lower group. We can do this for all items:
<- mc_scored |>
mc_p filter(!is.na(group)) |>
group_by(group) |>
summarize(across(I01:I30, .fns = list(P = ~ mean(.x, na.rm = TRUE))))
mc_p
And get the discrimination using the difference
2, 2:31] - mc_p[1, 2:31] mc_p[
This shows that Item 10 has a lower discrimination than Item 5, as we expect.
What are the maximum and minimum values of \(D\)? And when are those values achieved?
Answer:
External criterion: Get upper and lower groups based on another test. This is usually preferred when feasible.
Using correlations
Point-biserial correlation: Pearson correlation between a dichotomous and a continuous variable.
For example, we can compute the item-total correlation for Item 1:
cor(mc_scored$I01, mc_scored$person_mean, use = "complete")
[1] 0.3354805
Because the item is included in the calculation of the total/mean score, another common approach is to compute the total/mean of all other items, excluding the one of interest:
# Exclude Item 1
<- rowMeans(mc_scored[2:30], na.rm = TRUE)
person_mean_no1 # Correlation
cor(mc_scored$I01, person_mean_no1, use = "complete")
[1] 0.2725089
The sjPlot::tab_itemscale()
function is handy for getting item difficulty and discrimination, but make sure to know how they calculate them.
tab_itemscale(mc_scored[1:30])
Row | Missings | Mean | SD | Skew | Item Difficulty | Item Discrimination | α if deleted | |
---|---|---|---|---|---|---|---|---|
I01 | 0.70 % | 0.92 | 0.28 | -3.02 | 0.92 | 0.27 | 0.82 | |
I02 | 0.70 % | 0.89 | 0.32 | -2.48 | 0.89 | 0.26 | 0.82 | |
I03 | 0.70 % | 0.78 | 0.41 | -1.38 | 0.78 | 0.45 | 0.82 | |
I04 | 1.40 % | 0.91 | 0.28 | -3.01 | 0.91 | 0.27 | 0.82 | |
I05 | 1.40 % | 0.77 | 0.42 | -1.32 | 0.77 | 0.47 | 0.82 | |
I06 | 1.40 % | 0.88 | 0.33 | -2.36 | 0.88 | 0.25 | 0.82 | |
I07 | 2.10 % | 0.67 | 0.47 | -0.74 | 0.67 | 0.45 | 0.82 | |
I08 | 2.10 % | 0.63 | 0.48 | -0.54 | 0.63 | 0.44 | 0.82 | |
I09 | 2.10 % | 0.42 | 0.5 | 0.32 | 0.42 | 0.36 | 0.82 | |
I10 | 2.10 % | 0.22 | 0.42 | 1.36 | 0.22 | -0.01 | 0.83 | |
I11 | 2.10 % | 0.41 | 0.49 | 0.38 | 0.41 | 0.35 | 0.82 | |
I12 | 2.80 % | 0.45 | 0.5 | 0.19 | 0.45 | 0.38 | 0.82 | |
I13 | 2.80 % | 0.78 | 0.41 | -1.4 | 0.78 | 0.31 | 0.82 | |
I14 | 3.50 % | 0.41 | 0.49 | 0.39 | 0.41 | 0.12 | 0.83 | |
I15 | 3.50 % | 0.27 | 0.44 | 1.06 | 0.27 | 0.47 | 0.82 | |
I16 | 3.50 % | 0.6 | 0.49 | -0.42 | 0.60 | 0.57 | 0.81 | |
I17 | 3.50 % | 0.65 | 0.48 | -0.65 | 0.65 | 0.41 | 0.82 | |
I18 | 4.90 % | 0.61 | 0.49 | -0.46 | 0.61 | 0.52 | 0.81 | |
I19 | 4.90 % | 0.59 | 0.49 | -0.36 | 0.59 | 0.47 | 0.82 | |
I20 | 5.59 % | 0.67 | 0.47 | -0.72 | 0.67 | 0.33 | 0.82 | |
I21 | 5.59 % | 0.7 | 0.46 | -0.86 | 0.70 | 0.40 | 0.82 | |
I22 | 5.59 % | 0.3 | 0.46 | 0.86 | 0.30 | 0.12 | 0.83 | |
I23 | 6.29 % | 0.63 | 0.49 | -0.53 | 0.63 | 0.29 | 0.82 | |
I24 | 8.39 % | 0.51 | 0.5 | -0.05 | 0.51 | 0.52 | 0.81 | |
I25 | 8.39 % | 0.55 | 0.5 | -0.2 | 0.55 | 0.45 | 0.82 | |
I26 | 8.39 % | 0.36 | 0.48 | 0.6 | 0.36 | 0.13 | 0.83 | |
I27 | 9.79 % | 0.22 | 0.41 | 1.39 | 0.22 | 0.14 | 0.83 | |
I28 | 9.79 % | 0.38 | 0.49 | 0.5 | 0.38 | 0.48 | 0.82 | |
I29 | 10.49 % | 0.27 | 0.44 | 1.07 | 0.27 | -0.02 | 0.83 | |
I30 | 11.19 % | 0.61 | 0.49 | -0.47 | 0.61 | 0.38 | 0.82 | |
Mean inter-item-correlation=0.135 · Cronbach's α=0.826 |
Evaluating the Distractors
The discrimination for the distractors should be negative. Let’s consider Item 17:
table(mc_scored$group, mc_raw[ , "I17"])
9 A B C D
lower 2 3 11 10 11
upper 1 1 4 29 0
Here “C” is the correct option, and “A”, “B”, and “D” are distractors. We can see that those options are endorsed more often by the lower group. Also, “A” is relatively ineffective, as it was barely chosen by either groups.
Noncognitive Items
We will use the bfi
data set again.
data(bfi, package = "psych")
Note: Generally Likert-type items are not normal, but using a normal distribution is probably most defensible when there were at least five scale points and when the item distributions are relatively symmetric and unimodal (Rhemtulla et al., 2012, https://doi.org/10.1037/a0029315).
Frequency and Descriptives
While one can compute the usual descriptives like mean, median, and SD, for scale items that have only a few options, I generally recommend using actual bar plots.
You can see that the Agreeableness items are relatively skewed:
library(tidyr)
%>%
bfi select(A1:A5) %>%
pivot_longer(cols = A1:A5, names_to = "item", values_to = "score") %>%
ggplot(aes(x = score)) +
geom_bar() +
facet_wrap(~ item)
Warning: Removed 104 rows containing non-finite values (`stat_count()`).
Why do you think the Agreeableness items are skewed?
Answer:
When items show non-normal distributions, consider the reason for it and the purpose of the scale. For example, some skewed items are needed to assess extreme levels of a construct.
Recoding
An important step is to recode the negatively oriented variables.
# Recode A1
$A1r <- 7 - bfi$A1 bfi
Interitem Correlations
Use the modelsummary::datasummary_correlation()
to compute the interitem correlations among the Agreeableness items.
# Your code
Corrected Item-Total Correlations
|>
bfi select(A1r, A2:A5) |>
tab_itemscale()
Row | Missings | Mean | SD | Skew | Item Difficulty | Item Discrimination | α if deleted | |
---|---|---|---|---|---|---|---|---|
A1r | 0.57 % | 4.59 | 1.41 | -0.83 | 0.76 | 0.31 | 0.72 | |
A2 | 0.96 % | 4.8 | 1.17 | -1.13 | 0.80 | 0.56 | 0.62 | |
A3 | 0.93 % | 4.6 | 1.3 | -1 | 0.77 | 0.59 | 0.60 | |
A4 | 0.68 % | 4.7 | 1.48 | -1.03 | 0.78 | 0.40 | 0.69 | |
A5 | 0.57 % | 4.56 | 1.26 | -0.85 | 0.76 | 0.49 | 0.65 | |
Mean inter-item-correlation=0.332 · Cronbach's α=0.704 |
The above function is based on the performance::item_reliability()
function:
|>
bfi select(A1r, A2:A5) |>
item_reliability()
Factor Analysis
Sorting out the items based on the relative magnitude of their intercorrelations.
Factors: based on shared variance among items.
Factor loading: the degree to which an item share variance with the common factor.
fa(bfi[1:25], nfactors = 5)
Loading required namespace: GPArotation
Factor Analysis using method = minres
Call: fa(r = bfi[1:25], nfactors = 5)
Standardized loadings (pattern matrix) based upon correlation matrix
MR2 MR1 MR3 MR5 MR4 h2 u2 com
A1 0.21 0.17 0.07 -0.41 -0.06 0.19 0.81 2.0
A2 -0.02 0.00 0.08 0.64 0.03 0.45 0.55 1.0
A3 -0.03 0.12 0.02 0.66 0.03 0.52 0.48 1.1
A4 -0.06 0.06 0.19 0.43 -0.15 0.28 0.72 1.7
A5 -0.11 0.23 0.01 0.53 0.04 0.46 0.54 1.5
C1 0.07 -0.03 0.55 -0.02 0.15 0.33 0.67 1.2
C2 0.15 -0.09 0.67 0.08 0.04 0.45 0.55 1.2
C3 0.03 -0.06 0.57 0.09 -0.07 0.32 0.68 1.1
C4 0.17 0.00 -0.61 0.04 -0.05 0.45 0.55 1.2
C5 0.19 -0.14 -0.55 0.02 0.09 0.43 0.57 1.4
E1 -0.06 -0.56 0.11 -0.08 -0.10 0.35 0.65 1.2
E2 0.10 -0.68 -0.02 -0.05 -0.06 0.54 0.46 1.1
E3 0.08 0.42 0.00 0.25 0.28 0.44 0.56 2.6
E4 0.01 0.59 0.02 0.29 -0.08 0.53 0.47 1.5
E5 0.15 0.42 0.27 0.05 0.21 0.40 0.60 2.6
N1 0.81 0.10 0.00 -0.11 -0.05 0.65 0.35 1.1
N2 0.78 0.04 0.01 -0.09 0.01 0.60 0.40 1.0
N3 0.71 -0.10 -0.04 0.08 0.02 0.55 0.45 1.1
N4 0.47 -0.39 -0.14 0.09 0.08 0.49 0.51 2.3
N5 0.49 -0.20 0.00 0.21 -0.15 0.35 0.65 2.0
O1 0.02 0.10 0.07 0.02 0.51 0.31 0.69 1.1
O2 0.19 0.06 -0.08 0.16 -0.46 0.26 0.74 1.7
O3 0.03 0.15 0.02 0.08 0.61 0.46 0.54 1.2
O4 0.13 -0.32 -0.02 0.17 0.37 0.25 0.75 2.7
O5 0.13 0.10 -0.03 0.04 -0.54 0.30 0.70 1.2
MR2 MR1 MR3 MR5 MR4
SS loadings 2.57 2.20 2.03 1.99 1.59
Proportion Var 0.10 0.09 0.08 0.08 0.06
Cumulative Var 0.10 0.19 0.27 0.35 0.41
Proportion Explained 0.25 0.21 0.20 0.19 0.15
Cumulative Proportion 0.25 0.46 0.66 0.85 1.00
With factor correlations of
MR2 MR1 MR3 MR5 MR4
MR2 1.00 -0.21 -0.19 -0.04 -0.01
MR1 -0.21 1.00 0.23 0.33 0.17
MR3 -0.19 0.23 1.00 0.20 0.19
MR5 -0.04 0.33 0.20 1.00 0.19
MR4 -0.01 0.17 0.19 0.19 1.00
Mean item complexity = 1.5
Test of the hypothesis that 5 factors are sufficient.
df null model = 300 with the objective function = 7.23 with Chi Square = 20163.79
df of the model are 185 and the objective function was 0.65
The root mean square of the residuals (RMSR) is 0.03
The df corrected root mean square of the residuals is 0.04
The harmonic n.obs is 2762 with the empirical chi square 1392.16 with prob < 5.6e-184
The total n.obs was 2800 with Likelihood Chi Square = 1808.94 with prob < 4.3e-264
Tucker Lewis Index of factoring reliability = 0.867
RMSEA index = 0.056 and the 90 % confidence intervals are 0.054 0.058
BIC = 340.53
Fit based upon off diagonal values = 0.98
Measures of factor score adequacy
MR2 MR1 MR3 MR5 MR4
Correlation of (regression) scores with factors 0.92 0.89 0.88 0.88 0.84
Multiple R square of scores with factors 0.85 0.79 0.77 0.77 0.71
Minimum correlation of possible factor scores 0.70 0.59 0.54 0.54 0.42
There are two caveats if choosing items solely based on their predictive and diagnostic utility: (a) it may result in nonsensical scales, and (b) it may result in low internal consistency reliability.
Considering Uncertainty in Statistics
- Statistical information only provides one source of information when selecting or deleting items.
- An item showing suboptimal performance may nevertheless be important conceptually.
- An item showing a poor performance in one’s sample may perform adequately in previous and subsequent samples.
To the last point, it is important to consider the uncertainty in the statistics we use for making decisions, especially when we have a small and/or non-representative sample. While software packages seldom report standard errors or confidence intervals for statistics like item-total correlation, it is relatively straightforward to use the bootstrap to quantify uncertainty for a lot of the statistics. Below is an example, using a small subset of the bfi
data.
# Obtain a subset
<- bfi[1:200, ]
bfi_sub # Bootstrap CI
# 1. Define function for item-total correlations. The second argument needs to an index for resampling rows of the data.
<- function(d, inds) {
ff ::item_reliability(d[inds, ])$item_discrimination
performance
}# 2. Bootstrap
<- boot(bfi_sub[, c("A1r", "A2", "A3", "A4", "A5")],
boo statistic = ff,
R = 1999)
# 3. Summarize bootstrap distributions
plot(boo, index = 1) # for A1r
plot(boo, index = 4) # for A4
# Number of bootstrap samples where item-total correlations < .20
# (A1r, A2:A5)
colMeans(boo$t < .20)
[1] 0.06003002 0.00000000 0.00000000 0.06453227 0.00000000
# 4. Bootstrap confidence intervals for the first statistic
# (i.e., item-total correlation for A1r)
boot.ci(boo, index = 1)
Warning in boot.ci(boo, index = 1): bootstrap variances needed for studentized
intervals
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 1999 bootstrap replicates
CALL :
boot.ci(boot.out = boo, index = 1)
Intervals :
Level Normal Basic
95% ( 0.1708, 0.4622 ) ( 0.1740, 0.4660 )
Level Percentile BCa
95% ( 0.164, 0.456 ) ( 0.164, 0.456 )
Calculations and Intervals on Original Scale
From the bootstrap samples above, what is the proportion of replications where Item A1r has the lowest item-total correlation? Show your code, and discuss what this implies.
# Your code
Answer: