11  Model Comparison

11.1 Overfitting and Underfitting

In statistical modeling, a more complex model almost always results in a better fit to the data. Roughly speaking, a more complex model means a model with more parameters. However, as you will see later, determining the number of parameters in Bayesian analyses is not straightforward. On the extreme side, if one has 10 observations, a model with 10 parameters will perfectly predict every single data point (by just having a parameter to predict each data point). However, there are two problems with too complex a model. First, an increasingly complex model makes it increasingly hard to extract useful information from the data. Instead of describing the relationship between two variables, like Marriage and Divorce, by a straight line, one ends up with a crazy model that is difficult to make sense of. Second, as you will also see, the more complex a model, the more is the risk that it overfits the current data, such that it does not work for future observations.

For example, let’s randomly sample 10 states in the waffle_divorce data set and build some models.

waffle_divorce <- read_delim(  # read delimited files
    "data/WaffleDivorce.csv",
    delim = ";"
)
Rows: 50 Columns: 13
── Column specification ────────────────────────────────────────────────────────
Delimiter: ";"
chr  (2): Location, Loc
dbl (11): Population, MedianAgeMarriage, Marriage, Marriage SE, Divorce, Div...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Rescale Marriage and Divorce by dividing by 10
waffle_divorce$Marriage <- waffle_divorce$Marriage / 10
waffle_divorce$Divorce <- waffle_divorce$Divorce / 10
waffle_divorce$MedianAgeMarriage <- waffle_divorce$MedianAgeMarriage / 10
# Recode `South` to a factor variable
waffle_divorce$South <- factor(waffle_divorce$South,
    levels = c(0, 1),
    labels = c("non-south", "south")
)
set.seed(1547)  # set the seed for reproducibility
# Sample 10 observations
train <- sample.int(nrow(waffle_divorce), 10L)
wd_sub <- waffle_divorce[train, ]
base <- ggplot(aes(x = Marriage, y = Divorce),
               data = wd_sub) +
    geom_point() +
    coord_cartesian(ylim = c(0.6, 1.4)) +
    xlim(range(waffle_divorce$Marriage))
ggplot(waffle_divorce,
       aes(x = Marriage, y = Divorce)) + 
    geom_point(col = "lightblue") +
    geom_point(size = 1.5, data = wd_sub, col = "red") +
    coord_cartesian(ylim = c(0.6, 1.4)) +
    xlim(range(waffle_divorce$Marriage))

When using Marriage to predict Divorce, we can use beyond a linear regression line by using higher-order polynomials. For example, a second-order polynomial represents a quadratic effect (with one turning point); it goes to cubic, quartic, and more. The figure below shows the fit from a linear effect of Marriage, a quadratic effect, and increasingly complex models up to a sixth-degree polynomial. As you can see, as the model gets more complex, the fitted line tries to capture all the 10 points really well, with an increasing \(R^2\). However, the standard error around the fitted line also gets larger and bizarre, meaning more uncertainty in the model parameters.

Code
r2 <- function(object, newresp, newdata) {
    # Function for computing R^2
    ypred <- predict(object, newdata = newdata)
    cor(ypred, newresp)^2
}
rmse <- function(object, newresp, newdata) {
    # Function for RMSE
    ypred <- predict(object, newdata = newdata)
    sqrt(mean((ypred - newresp)^2))
}
# Create six plots through a loop
p_list <- map(1:6, function(i) {
    # Use frequentist analyses for speed
    mod <- lm(Divorce ~ poly(Marriage, degree = i), data = wd_sub)
    base +
        geom_smooth(method = "lm", formula = y ~ poly(x, i), level = .80,
                    fullrange = TRUE) +
        annotate("text", x = 1.7, y = 1.4,
                 label = paste0("italic(R)^2 == ",
                                round(r2(mod, wd_sub$Divorce), 2)),
                 parse = TRUE) +
        annotate("text", x = 1.7, y = 1.2,
                 label = paste0("RMSE == ",
                                round(rmse(mod, wd_sub$Divorce), 2)),
                 parse = TRUE)
})
do.call(grid.arrange, c(p_list, nrow = 2))
Figure 11.1: Fit of models on the 10 random cases. Top panel: linear, quadratic, and cubic; bottom panel: 4th, 5th, and 6th degree polynomials

Another way to look at model accuracy is the Root Mean Squared Error (RMSE), defined as the square root of the average squared prediction error. RMSE is a measure of prediction error. The smaller the RMSE, the better the prediction is. As you can see in the above figure, more complex models always reduce the RMSE in the data we use to fit the model (also called training data).

However, if I take the estimated regression line/curve based on the subsample of 10 observations, and predict the remaining cases in the data set, things will be different. As you can see in the figure below, whereas prediction error is comparable for the linear and the quadratic model, polynomials of higher degrees predict the data really badly. When you use a complex model in a data set, it tailors the coefficients to any sampling errors and noise in the data such that it will not generalize to new observations. Therefore, our goal in model comparison is to choose a model complex enough to capture the essence of the data generation process (and thus avoid underfitting), but not too complex such that it suffers from overfitting.

Code
base2 <- ggplot(aes(x = Marriage, y = Divorce),
               data = waffle_divorce[-train, ]) +
    geom_point() +
    coord_cartesian(ylim = c(0.6, 1.4)) +
    xlim(range(waffle_divorce$Marriage))
# Create six plots through a loop
p_list2 <- map(1:6, function(i) {
    # Use frequentist analyses for speed
    mod <- lm(Divorce ~ poly(Marriage, degree = i), data = wd_sub)
    # New data and response
    test_dat <- waffle_divorce[-train, ]
    ynew <- test_dat$Divorce
    base2 +
        geom_smooth(data = wd_sub, method = "lm", formula = y ~ poly(x, i),
                    level = .80, fullrange = TRUE) +
        annotate("text", x = 1.7, y = 1.4,
                 label = paste0("italic(R)^2 == ",
                                round(r2(mod, ynew, test_dat), 2)),
                 parse = TRUE) +
        annotate("text", x = 1.7, y = 1.2,
                 label = paste0("RMSE == ",
                                round(rmse(mod, ynew, test_dat), 2)),
                 parse = TRUE)
})
do.call(grid.arrange, c(p_list2, nrow = 2))
Figure 11.2: Using the regression lines based on 10 random cases to predict the remaining 40 cases. Top panel: linear, quadratic, and cubic; bottom panel: 4th, 5th, and 6th degree polynomials

The goal of statistical modeling is to choose an optimal model between the overfitting/underfitting dichotomy. In machine learning, this is also commonly referred to as the bias-variance trade-off, as a model that is too simple tends to produce biased predictions because it does not capture the essence of the data generating process. In contrast, a overly complex model is unbiased but results in a lot of uncertainty in the prediction because there are too many unnecessary components that can affect predictions, as indicated in the confidence bands around the 6th-degree polynomial line.

Polynomials of varying degrees are merely one example of comparing simple to complex models. You can think about:

  • models with and without interactions,
  • models with a few predictors versus hundreds of predictors,
  • regression analyses versus multilevel models, etc.

Whereas one can always avoid underfitting by fitting a more and more complex model, we need tools to keep us from overfitting. This lecture is about finding an optimal model that avoids overfitting and avoids underfitting. You will learn to perform model comparisons with information criteria to find a model that has a better balance between overfitting and underfitting.

11.2 Kullback-Leibler Divergence

When comparing models (e.g., linear vs. quadratic), we prefer models closer to the “true” data-generating process. To do so, we need some ways to quantify the degree of “closeness” to the true model. In this context, models comprise both the distributional family and the parameter values. For example, the model \(y_i \sim N(5, 2)\) is a different model than \(y_i \sim N(3, 2)\), which is a different model than \(y_i \sim \mathrm{Gamma}(2, 2)\). The first two have the same family but different parameter values (different means, same \(\mathit{SD}\)). In contrast, the last two have different distributional families (Normal vs. Gamma).

To measure the degree of “closeness” between two models, \(M_0\) and \(M_1\), by far the most popular metric in statistics is the Kullback-Liebler Divergence (or Kullback-Liebler discrepancy; \(D_\textrm{KL}\)). By definition,

\[ \begin{aligned} D_\textrm{KL}(M_0 \mid M_1) & = \int_{-\infty}^\infty p_{M_0} (\boldsymbol{\mathbf{y}}) \log \frac{p_{M_0}(\boldsymbol{\mathbf{y}})}{p_{M_1}(\boldsymbol{\mathbf{y}})} \; \mathrm{d}\boldsymbol{\mathbf{y}} \\ & = \int_{-\infty}^\infty p_{M_0} (\boldsymbol{\mathbf{y}}) \log p_{M_0}(\boldsymbol{\mathbf{y}}) \; \mathrm{d}\boldsymbol{\mathbf{y}} - \int_{-\infty}^\infty p_{M_0} (\boldsymbol{\mathbf{y}}) \log p_{M_1}(\boldsymbol{\mathbf{y}}) \; \mathrm{d}\boldsymbol{\mathbf{y}}. \end{aligned} \]

Note that strictly speaking, \(D_\textrm{KL}\) cannot be called a “distance” between two models because in general, \(D_\textrm{KL}(M_0 \mid M_1) \neq D_\textrm{KL}(M_1 \mid M_0)\). As an example, assume that the data are generated by a true model \(M_0\), and we have two candidate models \(M_1\) and \(M_2\), where

  • \(M_0: y \sim N(3, 2)\)
  • \(M_1: y \sim N(3.5, 2.5)\)
  • \(M_2: y \sim \mathrm{Cauchy}(3, 2)\)
Code
ggplot(data.frame(x = c(-3, 9)), aes(x = x)) +
    stat_function(fun = dnorm, args = list(mean = 3, sd = 2),
                  aes(col = "M0"), linewidth = 1) +
    stat_function(fun = dnorm, args = list(mean = 3.5, sd = 2.5),
                  aes(col = "M1"), linewidth = 2) +
    stat_function(fun = dcauchy, args = list(location = 3, scale = 2),
                  aes(col = "M2"), linewidth = 2) +
    scale_color_manual(values = c("black", "red", "blue"),
                       labels = c("M0", "M1", "M2")) +
    labs(x = "y", y = "density", col = NULL)
Figure 11.3: Density for \(M_0\), \(M_1\), and \(M_2\)
f1 <- function(x) {
    dnorm(x, 3, 2) * (dnorm(x, 3, 2, log = TRUE) -
        dnorm(x, 3.5, 2.5, log = TRUE))
}
f2 <- function(x) {
    dnorm(x, 3, 2) * (dnorm(x, 3, 2, log = TRUE) -
        dcauchy(x, 3, 2, log = TRUE))
}

One can compute that \(D_\textrm{KL}(M_0 \mid M_1) = 0.0631436\) and \(D_\textrm{KL}(M_0 \mid M_1) = 0.2592445\), and so \(M_1\) is a better model than \(M_2\).

Note that in the expression of \(D_\textrm{KL}\), when talking about the same target model, the first term is always the same and describes the “true” model, \(M_0\). Therefore, it is sufficient to compare models on the second term, \(\int_{-\infty}^\infty p_{M_0} (\boldsymbol{\mathbf{y}}) \log p_{M_1}(\boldsymbol{\mathbf{y}}) \; \mathrm{d}\boldsymbol{\mathbf{y}}\), which can also be written as \(\mathrm{E}=[\log p_{M_1} (\boldsymbol{\mathbf{y}})]\), i.e., the expected log predictive density (elpd). In other words, a model with a larger elpd is preferred over a model with a smaller elpd.

However, we don’t know what \(M_0\) is in real data analysis. If we knew, then we would just need to choose \(M_0\) as our model, and there will be no need for model comparisons. In addition, even if we know that the true model is, e.g., a normal model (which never happens in real data analysis), we still need to estimate the parameter values, and the estimates will not be exactly the same as the true parameter values. However, elpd is defined as the expected value over the true predictive distribution, \(p_{M_0}(y)\), which cannot be obtained without knowing what \(M_0\) is.

So instead, we need to estimate the elpd. A naive way to estimate it is to use the data distribution in place of the true model, but that will lead to an overly optimistic estimate as the sample data are noisy. Computing elpd this way will always favor a more complex model. The ideal way is to collect data on a new, independent sample that share the same data generating process as the current sample, and estimate elpd on the new sample. This is called out-of-sample validation. The problem, of course, is that we usually do not have the resources to collect a new sample.

Therefore, statisticians have worked hard to find ways to estimate elpd from the current sample, and there are two broad approaches:

  • Information criteria: AIC, DIC, and WAIC, which estimate the elpd in the current sample, minus a correction factor
  • Cross-validation, which splits the current sample into \(K\) parts, estimates the parameters in \(K - 1\) parts, and estimates the elpd in the remaining part. A special case is when \(K\) = \(N\), each time one uses \(N\) - 1 data points to estimate the model parameters, and estimate the elpd for the observation that was left out. This is called leave-one-out cross-validation (LOO-CV).

11.3 Deviance

Without going too deep into the underlying math, it can be shown that a good estimate of elpd is

\[ \sum_{i = 1}^n \log p_{M_1}(y_i) - p, \]

where \(p\) is some measure of the number of parameters in \(M_1\). The first term is the likelihood of the model in the current sample. The second term is an adjustment factor so that the quantity above represents the average likelihood of the model in a new sample. It is more common to work with deviance by multiplying the log-likelihood by \(-2\), i.e.,

\[ D = -2 \sum_{i = 1}^n \log p_{M_1}(y_i). \]

11.3.1 Experiment on Deviance

Now, let’s check the in-sample deviance and out-of-sample deviance of our waffle_divorce data with different polynomial functions. Here is a sample function for computing elpd (with frequentist, just for speed) for polynomials of different degrees:

Code
# Function for computing deviance with different polynomial
deviance_divorce <- function(degree = 1,
                             train = 10,
                             y = waffle_divorce$Divorce,
                             x = waffle_divorce$Marriage) {
    N <- length(y)
    # get training sample
    if (length(train) == 1) {
        train <- sample.int(N, train)
    }
    ntrain <- length(train)
    # Obtain design matrix
    X <- cbind(1, poly(x, degree, simple = TRUE))
    # Get elpd for training sample
    Xtrain <- X[train, ]
    ytrain <- y[train]
    betahat <- qr.solve(Xtrain, ytrain)  # estimated betas
    res_train <- ytrain - Xtrain %*% betahat
    sigmahat <- sqrt(sum(res_train^2) /
        (ntrain - 1 - degree)) # estimated sigma
    deviance_train <- -2 * sum(dnorm(res_train, sd = sigmahat, log = TRUE))
    res_test <- y[-train] - X[-train, ] %*% betahat
    deviance_test <- -2 * sum(dnorm(res_test, sd = sigmahat, log = TRUE))
    data.frame(
        degree = degree,
        sample = c("in-sample", "out-of-sample"),
        deviance = c(deviance_train / ntrain,
                     deviance_test / (N - ntrain))
    )
}

Below shows the in-sample and out-of-sample elpd for the linear model:

deviance_divorce(degree = 1, train = train)

And for quadratic:

deviance_divorce(degree = 2, train = train)

In general, as you can see, the deviance is smaller for the current data than for the hold-out data. Note that because the training and testing data sets have different sizes, I divided the deviance by the sample size so that they can be compared.

Now let’s run an experiment to check the elpd with different degrees polynomial, with a training sample size of 25:

set.seed(1733)
# Use the `map` function to run different polynomials, and use the `rerun`
# function run the deviance 100 times. The code below runs `deviance_divorce` by
# randomly sampling 25 training samples 100 times, and compute the in-sample
# and out-of-sample deviance for each.
# rerun(100, deviance_divorce(degree = 1, train = 25L)) |>
#     bind_rows()
# Now run 1 to 4 degree polynomial, each 1000 times:
dev_list <- lapply(1:4, FUN = function(p) {
    results <- replicate(1000, deviance_divorce(degree = p, train = 25L), simplify = FALSE)
    do.call(rbind, results)
})
dev_df <- do.call(rbind, dev_list)
# Plot the results
dev_df |>
    ggplot(aes(x = degree, y = deviance, col = sample)) +
    stat_summary() +
    stat_summary(geom = "line") +
    labs(col = NULL)
No summary function supplied, defaulting to `mean_se()`
No summary function supplied, defaulting to `mean_se()`

As you can see, the in-sample deviance (red line) keeps decreasing, indicating that a more complex model fits the data better, which is always the case. So if one were to use deviance to determine what model is optimal, one would always choose the most complex model, just like using \(R^2\) (indeed, for linear models, deviance is basically the same as \(R^2\)).

Now, look at the blue line, which represents the deviance computed using the coefficients obtained from the training set but applied to the remaining data. As you can see, the deviance achieves its minimum around the linear and the quadratic model, and starts to increase, meaning that the more complex models do not fit the hold-out data.

A statistical model is used to learn something from a data set that can generalize to other observations. Therefore, we should care about the blue line, instead of the red one. The indices you will see in the remaining of this note are all attempts to approximate the blue line.

More complex models always fit the current data better, but may not generalize to other data. In other words, models that are too complex are not generalizable.

11.4 Information Criteria

We will illustrate the computation of information criteria with Marriage predicting Divorce:

m1 <- brm(Divorce ~ Marriage, data = waffle_divorce,
          prior = c(prior(student_t(4, 0, 5), class = "Intercept"),
                    prior(normal(0, 2), class = "b"),
                    prior(student_t(4, 0, 1), class = "sigma")),
          iter = 4000,
          seed = 2302,
          file = "07_m1"
)
Start sampling
Loading required package: rstan
Loading required package: StanHeaders

rstan version 2.32.5 (Stan version 2.32.2)
For execution on a local, multicore CPU with excess RAM we recommend calling
options(mc.cores = parallel::detectCores()).
To avoid recompilation of unchanged Stan programs, we recommend calling
rstan_options(auto_write = TRUE)
For within-chain threading using `reduce_sum()` or `map_rect()` Stan functions,
change `threads_per_chain` option:
rstan_options(threads_per_chain = 1)

Attaching package: 'rstan'
The following objects are masked from 'package:posterior':

    ess_bulk, ess_tail
The following object is masked from 'package:tidyr':

    extract

11.4.1 Akaike Information Criteria (AIC)

Multiplying the quantity of elpd - \(p\) by \(-2\), or deviance + 2\(p\), with the deviance obtained using the maximum likelihood estimates (MLEs) for the parameters, gives you the formula for AIC:

\[ \textrm{AIC} = D(\hat \theta) + 2p, \]

and \(p\) in AIC is just the number of parameters. As we have multiplied by a negative number, maximizing the estimate of elpd is equivalent to minimizing the AIC, so one would prefer a model with the smallest AIC.

The AIC is not Bayesian because it only uses point estimates (MLEs) of parameters rather than their posterior distributions. Also, it does not take into account any prior information.

# Frequentist model
m1_freq <- lm(m1$formula, data = m1$data)
AIC(m1_freq)
[1] -30.96869

11.4.2 Deviance Information Criteria (DIC)

The definition of AIC assumes that the parameter estimates are known or are maximum likelihood estimates. The DIC, instead, replaces those with the posterior distribution of the parameters. The general formula for DIC is

\[ \textrm{DIC} = \mathrm{E}(D \mid \boldsymbol{\mathbf{y}}) + 2 p_D, \]

where \(p_D\) is the effective number of parameters estimated in the Markov chain. Although DIC does take into account the prior distributions, it does not consider the full posterior distributions of the parameters.

# Function to compute DIC
dic_brmsfit <- function(object) {
    Dbar <- -2 * mean(rowSums(log_lik(object)))
    res <- residuals(object)[ , "Estimate"]
    sigma <- posterior_summary(object, variable = "sigma")[ , "Estimate"]
    Dhat <- -2 * sum(dnorm(res, sd = sigma, log = TRUE))
    p <- Dbar - Dhat
    elpd <- Dhat / -2 - p
    data.frame(elpd_dic = elpd, p_dic = p, dic = Dhat + 2 * p,
               row.names = "Estimate")
}
dic_brmsfit(m1)

11.4.3 Watanabe-Akaike Information Criteria (WAIC)

A further modification is to use the log pointwise posterior predictive density, with the effective number of parameters computed using the posterior variance of the likelihood.

\[ \textrm{WAIC} = -2 \sum_{i = 1}^n \log \mathrm{E}[p(y_i \mid \boldsymbol{\mathbf{\theta}}, \boldsymbol{\mathbf{y}})] + 2 p_\textrm{WAIC}, \]

where \(\mathrm{E}[p(y_i \mid \boldsymbol{\mathbf{\theta}}, \boldsymbol{\mathbf{y}})]\) is the posterior mean of the likelihood of the \(i\)th observation. The WAIC incorporates prior information, and the use of pointwise likelihood makes it more robust when the posterior distributions deviate from normality. In general, WAIC is a better estimate of the out-of-sample deviance than AIC and DIC.

waic(m1)  # built-in function in brms
Warning: 
1 (2.0%) p_waic estimates greater than 0.4. We recommend trying loo instead.

Computed from 8000 by 50 log-likelihood matrix.

          Estimate  SE
elpd_waic     15.2 4.9
p_waic         3.2 0.9
waic         -30.3 9.9

1 (2.0%) p_waic estimates greater than 0.4. We recommend trying loo instead. 

11.4.4 Leave-One-Out Cross-Validation

The idea of cross-validation is to split the sample so that it imitates the scenario of estimating the parameters in part of the data and predicting the remaining part. The part used for estimation is called the training set, and the part used for prediction is called the validation set. Leave-one-out information criteria (LOO-IC) means that one uses \(N - 1\) observations as the training set and 1 observation as the validation sample and repeat the process \(N\) times so that a different observation is being predicted each time. Adding up the prediction results will give an estimate of elpd that closely approximates the results that would be obtained by collecting new data and doing the validation. To make it more concrete, we can go back to the waffle_divorce data with Marriage predicting Divorce. We can do this for case #1 (Alabama), as an example:

# Estimate the model without case #1
m1_no1 <- update(m1, newdata = waffle_divorce[-1, ])
# The log predictive density for case #1
mean(log_lik(m1_no1, newdata = waffle_divorce[1, ]))
[1] -0.8111212

Because LOO-IC requires fitting the model \(N\) times, it is generally very computationally intensive. There are, however, shortcuts for some models to make the computation faster. WAIC can also be treated as a fast approximation of LOO-IC, although LOO-IC is more robust and will be a better estimate of out-of-sample deviance. The loo package uses the so-called Pareto smoothed importance sampling (PSIS) to approximate LOO-IC without repeating the process \(N\) times.

Here is the LOO-IC for the model:

loo(m1)

Computed from 8000 by 50 log-likelihood matrix.

         Estimate  SE
elpd_loo     15.1 5.0
p_loo         3.3 1.0
looic       -30.2 9.9
------
MCSE of elpd_loo is 0.0.
MCSE and ESS estimates assume MCMC draws (r_eff in [0.7, 1.0]).

All Pareto k estimates are good (k < 0.7).
See help('pareto-k-diagnostic') for details.

You can save the WAIC and the LOO-IC information to the fitted result:

m1 <- add_criterion(m1, criterion = c("loo", "waic"))
Warning: 
1 (2.0%) p_waic estimates greater than 0.4. We recommend trying loo instead.
Automatically saving the model object in '07_m1.rds'

See Vehtari et al. (2017) for more discussions on WAIC and LOO-IC.


11.4.5 Example

Consider four potential models in predicting Divorce:

\[ \texttt{Divorce}_i \sim N(\mu_i, \sigma) \]

  • M1: Marriage
  • M2: Marriage, South, Marriage \(\times\) South
  • M3: South, smoothing spline of Marriage by South
  • M4: Marriage, South, MedianAgeMarriage, Marriage \(\times\) South, Marriage \(\times\) MedianAgeMarriage, South \(\times\) MedianAgeMarriage, Marriage \(\times\) South \(\times\) MedianAgeMarriage
# Note, m1 has been fit before; the `update()` function
# can be used to simply change the formula, and brms will
# determine whether it needs re-compiling.
# M2: Add South and interaction
m2 <- update(m1, formula = Divorce ~ Marriage * South,
             newdata = waffle_divorce)
m2 <- add_criterion(m2, c("loo", "waic"))
Warning: 
1 (2.0%) p_waic estimates greater than 0.4. We recommend trying loo instead.
# M3: Spline function for Marriage
m3 <- update(m1, formula = Divorce ~ South + s(Marriage, by = South),
             newdata = waffle_divorce,
             control = list(adapt_delta = .999))
m3 <- add_criterion(m3, c("loo", "waic"))
Warning: Found 2 observations with a pareto_k > 0.7 in model 'm3'. We recommend
to set 'moment_match = TRUE' in order to perform moment matching for
problematic observations.
Warning: 
4 (8.0%) p_waic estimates greater than 0.4. We recommend trying loo instead.
# M4: Three-way interactions
m4 <- update(m1, formula = Divorce ~ Marriage * MedianAgeMarriage * South,
             newdata = waffle_divorce,
             control = list(max_treedepth = 12))  # increase due to warning
m4 <- add_criterion(m4, c("loo", "waic"))
Warning: Found 1 observations with a pareto_k > 0.7 in model 'm4'. We recommend
to set 'moment_match = TRUE' in order to perform moment matching for
problematic observations.
Warning: 
3 (6.0%) p_waic estimates greater than 0.4. We recommend trying loo instead.

The first model only has Marriage as a predictor, which means that the coefficients for South and MedianAgeMarriage are assumed to be zero. The second model added South and its interaction with Marriage as a predictor. The third model includes a smoothing spline term (a flexible non-linear function, within the class of linear models), whereas the fourth model also includes MedianAgeMarriage and all two-way and three-way interactions. Now, we can compare the four models:

loo_compare(m1, m2, m3, m4)
   elpd_diff se_diff
m4  0.0       0.0   
m2 -5.3       4.1   
m3 -6.1       4.1   
m1 -8.5       4.3   
# m4 is the best
msummary(list(M1 = m1, M2 = m2, M3 = m3, M4 = m4),
         estimate = "{estimate} [{conf.low}, {conf.high}]",
         statistic = NULL, fmt = 2)
Warning: 
`modelsummary` uses the `performance` package to extract goodness-of-fit
statistics from models of this class. You can specify the statistics you wish
to compute by supplying a `metrics` argument to `modelsummary`, which will then
push it forward to `performance`. Acceptable values are: "all", "common",
"none", or a character vector of metrics names. For example: `modelsummary(mod,
metrics = c("RMSE", "R2")` Note that some metrics are computationally
expensive. See `?performance::performance` for details.
 This warning appears once per session.
M1  M2  M3  M4
b_Intercept 0.61 [0.35, 0.87] 0.67 [0.41, 0.92] 0.94 [0.89, 0.99] 5.50 [1.86, 9.13]
b_Marriage 0.18 [0.05, 0.31] 0.13 [0.01, 0.26] −1.20 [−2.88, 0.50]
sigma 0.17 [0.14, 0.21] 0.16 [0.13, 0.20] 0.15 [0.12, 0.19] 0.14 [0.11, 0.18]
b_Southsouth −0.62 [−1.44, 0.19] 0.10 [−0.03, 0.22] 0.34 [−3.00, 3.88]
b_Marriage × Southsouth 0.36 [−0.03, 0.76] 0.49 [−1.62, 2.81]
bs_sMarriage × SouthnonMsouth_1 −0.46 [−3.34, 1.51]
bs_sMarriage × Southsouth_1 1.27 [−2.02, 3.54]
sds_sMarriageSouthnonMsouth_1 0.87 [0.05, 2.60]
sds_sMarriageSouthsouth_1 0.48 [0.02, 2.48]
b_MedianAgeMarriage −1.72 [−3.10, −0.32]
b_Marriage × MedianAgeMarriage 0.45 [−0.22, 1.10]
b_MedianAgeMarriage × Southsouth −0.37 [−1.72, 0.96]
b_Marriage × MedianAgeMarriage × Southsouth −0.07 [−1.08, 0.88]
Num.Obs. 50 50 50 50
R2 0.139 0.305 0.388 0.490
R2 Adj. 0.072 0.209 0.164 0.367
ELPD 15.1 18.3 17.5 23.7
ELPD s.e. 5.0 5.5 5.9 6.1
LOOIC −30.2 −36.7 −35.1 −47.3
LOOIC s.e. 9.9 11.0 11.9 12.3
WAIC −30.3 −36.8 −36.7 −48.1
RMSE 0.17 0.15 0.14 0.13

Model 4 has the lowest LOO-IC, so one may conclude that Model 4 is the best model among the four, for prediction purposes.