Model Comparison

PSYC 573

Guiding Questions

  • What is overfitting and why is it problematic?
  • How to measure closeness of a model to the true model?
    • What do information criteria do?

In-Sample and Out-Of-Sample Prediction

  • Randomly sample 10 states

Underfitting and Overfitting

Complex models require more data

  • Too few data for a complex model: overfitting
  • A model being too simple: underfitting

Prediction of Future Observations

  • The more a model captures the noise in the original data, the less likely it predicts future observations well

Terminologies

What it is (roughly) Better when index is
Entropy Amount of uncertainty/information in the outcome variable(s)
KL Divergence “Discrepancy” from “true” model smaller
elpd “Fit” to sample data larger
deviance \(-2\) \(\times\) in-sample elpd smaller
AIC/WAIC out-of-sample prediction error smaller
LOOIC out-of-sample prediction error smaller

What Is A Good Model?

  • Closeness from the proposed model (\(M_1\)) to a “true” model (\(M_0\))
    • Kullback-Leibler Divergence (\(D_\textrm{KL}\))
      = \(\text{Entropy of }M_0 - \text{elpd of }M_1\)
    • elpd: expected log predictive density: \(E_{M_0}[\log P_{M_1}(\tilde {\mathbf{y}})]\)
  • Choose a model with smallest \(D_\textrm{KL}\)
    • When \(M_0 = M_1\), \(D_\textrm{KL} = 0\)
    • \(\Rightarrow\) choose a model with largest elpd

Example

  • True model of data: \(M_0\): \(y \sim N(3, 2)\)
  • \(M_1\): \(y \sim N(3.5, 2.5)\)
  • \(M_2\): \(y \sim \mathrm{Cauchy}(3, 2)\)

Entropy of \(M_0\) = -2.112

elpd \(D_\textrm{KL}(M_0 \mid M_.)\)
\(M_1\) -2.175 0.063
\(M_2\) -2.371 0.259

Expected log pointwise predictive density

\[ \sum_i \log P_{M_1} (y_i) \]

Note: elpd is a function of sample size

  • Problem: elpd depends on \(M_0\), which is unknown
    • Estimate elpd using the current sample \(\rightarrow\) underestimate discrepancy
    • Need to estimate elpd using an independent sample

Overfitting

Training set: 25 states; Test set: 25 remaining states

  • More complex model = more discrepancy between in-sample and out-of-sample elpd

Information Criteria (IC)

Approximate discrepancy between in-sample and out-of-sample elpd

  • IC = \(-2\) \(\times\) (in-sample elpd \(-\) \(p\))

  • \(p\) = penalty for model complexity

    • function of number of parameters

Choose a model with smaller IC

Bayesian ICs: DIC, WAIC, etc

Cross-Validation

  • Split the sample into K parts

  • Fit a model with K - 1 parts, and obtain elpd for the “hold-out” part

Leave-one-out: K = N

  • Very computationally intensive

  • loo package: approximation using Pareto smoothed importance sampling

loo(m1)

Computed from 8000 by 50 log-likelihood matrix.

         Estimate  SE
elpd_loo     15.0 5.0
p_loo         3.4 1.0
looic       -30.1 9.9
------
MCSE of elpd_loo is 0.0.
MCSE and ESS estimates assume MCMC draws (r_eff in [0.7, 1.0]).

All Pareto k estimates are good (k < 0.7).
See help('pareto-k-diagnostic') for details.

Comparing Models

\[ \texttt{Divorce}_i \sim N(\mu_i, \sigma) \]

  • M1: Marriage
  • M2: Marriage, South, Marriage \(\times\) South
  • M3: South, smoothing spline of Marriage by South
  • M4: Marriage, South, MedianAgeMarriage, Marriage \(\times\) South, Marriage \(\times\) MedianAgeMarriage, South \(\times\) MedianAgeMarriage, Marriage \(\times\) South \(\times\) MedianAgeMarriage

M1  M2  M3  M4
b_Intercept 0.61 0.67 0.94 5.52
b_Marriage 0.18 0.13 −1.20
b_Southsouth −0.63 0.10 0.31
b_Marriage × Southsouth 0.37 0.54
bs_sMarriage × SouthnonMsouth_1 −0.47
bs_sMarriage × Southsouth_1 1.21
sds_sMarriageSouthnonMsouth_1 0.87
sds_sMarriageSouthsouth_1 0.50
b_MedianAgeMarriage −1.72
b_Marriage × MedianAgeMarriage 0.45
b_MedianAgeMarriage × Southsouth −0.34
b_Marriage × MedianAgeMarriage × Southsouth −0.09
ELPD 15.0 18.2 17.7 23.5
ELPD s.e. 5.0 5.5 5.9 6.2
LOOIC −30.1 −36.5 −35.3 −47.1
LOOIC s.e. 9.9 11.0 11.7 12.3
RMSE 0.17 0.15 0.14 0.13

Notes for Using ICs

  • Same outcome variable and transformation
  • Same sample size
    • Sample size could change when adding a predictor that has missing values
  • Cannot compare discrete and continuous models
    • E.g., Poisson vs. normal

Other Techniques

See notes on stacking and regularization

  • Stacking: average predictions from multiple models
  • Regularization: using sparsity-inducing priors to identify major predictors
  • Variable selection: using projection-based methods