-
Notifications
You must be signed in to change notification settings - Fork 0
/
030-methods.Rmd
978 lines (679 loc) · 46.9 KB
/
030-methods.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
```{r ch030-setup, include=FALSE}
library(tidyverse)
library(kableExtra)
library(rstan)
library(patchwork)
library(bayesplot)
library(rethinking)
library(loo)
```
# Background {#methods}
## Fitting the psychometric function using GLMs {#glms}
Psychometric functions are commonly fit using generalized linear models which allows for the linear model to be related to the response variable via a link function, which for psychometric functions comes from the family of S-shaped curves called a sigmoid.
Commonly GLMs are fit using maximum likelihood estimation. The outcome of a single experiment can be represented as the result of a Bernoulli trial. The psychometric function, $F(x; \theta)$, determines the probability that the outcome is 1:
\setstretch{1.0}
\begin{align*}
Y &\sim \textrm{Bernoulli}(\pi) \\
\pi &= P(Y=1 \vert x; \theta) = F(x; \theta)
\end{align*}
\setstretch{2.0}
If $P(Y=1 | x; \theta) = F(x;\theta)$, then $P(Y = 0 | x; \theta) = 1 - F(x;\theta)$, and hence the probability of an outcome is:
\begin{equation}
P(Y=y | x; \theta) = F(x;\theta)^y(1-F(x;\theta))^{1-y}
(\#eq:bernproby)
\end{equation}
The likelihood $\mathcal{L}$ of observing a set of independent and identically distributed data given the parameterization $\theta$ is determined by taking the product of the probabilities for each datum:
\begin{equation}
\begin{split}
\mathcal{L} &= \prod_{i}^{N} P(y_i | x_i; \theta) \\
&= \prod_{i}^{N}F(x_i;\theta)^{y_i}(1-F(x_i;\theta))^{1-y_i}
\end{split}
(\#eq:bernlik)
\end{equation}
For some $\theta$, $\mathcal{L}$ achieves a maximum value, so maximum likelihood estimation determines the parameters that maximizes the likelihood of the observed data. Equation \@ref(eq:bernlik) is commonly expressed in terms of its logarithm as a function of $\theta$, where due to monotonicity of the logarithm, is an equivalent optimization problem:
\begin{equation}
\ln \mathcal{L}(\theta | y, x) = \sum_{i}^{N} y_i \ln\left(F(x_i;\theta)\right) + (1-y_i) \ln\left(1 - F(x_i;\theta))\right)
(\#eq:bernloglik)
\end{equation}
The classical approach is to differentiate \@ref(eq:bernloglik) with respect to $\theta$, set the equation equal to $0$, and solve for $\theta$:
\begin{equation}
\frac{d}{d\theta} \ln \mathcal{L}(\theta) = 0
(\#eq:ddthetaloglik)
\end{equation}
However, no closed form expression exists for the solution to \@ref(eq:ddthetaloglik), and so numerical root finding methods such as gradient descent are used to iteratively find the maximum likelihood solution. The likelihood function, $\mathcal{L}(\theta | y)$, also has a connection to Bayes' Theorem:
\begin{equation}
\mathcal{L}(\theta | y) = \frac{P(y | \theta) P(\theta)}{P(y)}
(\#eq:bayeslik)
\end{equation}
In \@ref(eq:bayeslik), $P(y | \theta)$ is the likelihood of the data given $\theta$, $P(\theta)$ is the prior distribution for the parameter $\theta$, and $P(y)$ is the probability of the data averaged over the parameter space. When the prior distribution is uniform over the parameter space, then the Bayesian _maximum a posteriori_ (MAP) estimate coincides with the maximum likelihood estimate.
There are common situations where MLE fails such as complete separation in the data. This is when the positive class can be separated from the negative class by a set of linear predictors (shown in figure \@ref(fig:ch030-comp-sep)).
```{r ch030-comp-sep, fig.cap="Example of complete separation in the data. All of the \`0\`-responses can be separated from the \`1\`-responses by some value of \`x\` between \`-1\` and \`0\`."}
y <- c(0, 0, 1, 1, 1)
x <- c(-2, -1, 0, 1, 2)
sep_data <- data.frame(y=y, x=x)
ggplot(sep_data, aes(x, y)) +
geom_point(size = 3) +
theme_minimal()
```
For a slope-intercept model, the MLE for the slope is infinity, and the location is undefined. Figure \@ref(fig:ch030-mle-comp-sep) displays a grid of log-likelihoods for a range of scale (inverse slope) and location parameters. The log-likelihood increases to zero as the scale decreases to zero (slope increases to infinity). Numerical root finding methods will converge after a finite number of iterations -- usually by a stopping condition such as the difference in log-likelihoods between steps.
```{r ch030-mle-comp-sep, fig.cap="Grid of log-likelihoods for the completely separable data. The log-likelihood increases to zero as the scale decreases to zero (slope increases to infinity). For smaller slopes, the MLE for the location is \`0.5\` -- the median of the inner-most datum from each class. At larger slopes, the MLE for the location matters little."}
f <- function(m, s, y, x) {
sum( y * plogis(x, m, s, log.p = TRUE) +
(1 - y) * plogis(x, m, s, lower.tail = FALSE, log.p = TRUE) )
}
ms <- seq(-1, 0, length.out = 201)
ss <- 10^seq(1, -5, length.out = 201)
log_lik <- expand_grid(ms = ms, ss = ss)
log_lik$`Log Likelihood` <- apply(log_lik, 1, function(r) f(r[[1]], r[[2]], y, x))
ggplot(log_lik, aes(ms, ss, fill = `Log Likelihood`)) +
geom_tile() +
scale_fill_viridis_c() +
scale_y_log10() +
labs(x = "Location", y = "Scale (inverse slope)") +
theme_minimal()
```
When the separable data from above is fit using `R`'s `glm` function, there is a warning about fitted probabilities being $0$ or $1$, indicating that the slope is very steep.
\setstretch{1.0}
```{r ch030-Rubber Blue, echo=TRUE, warning=TRUE, message=FALSE}
fit <- glm(y ~ x, family = binomial("logit"))
coefficients(fit)
```
\setstretch{2.0}
The coefficients of the linear predictor are in slope-intercept form ($23 + 46 x$). Rearranging the coefficients into location-scale form yields:
\begin{equation}
\theta = \frac{x - (-0.5)}{1/46}
(\#eq:rglmmle)
\end{equation}
However, this is not the true maximum likelihood estimate. Table \@ref(tab:ch030-Quality-Surreal-Street) shows that the log-likelihood is still increasing as the scale decreases to zero. The change between $-2.05\times 10^{-10}$ and $-3.86\times10^{-22}$ is very large on a relative scale, but computers cannot tell the difference on an absolute scale. The precision of most modern computers is around $10^{-15}$, and anything smaller is treated as numerically zero.
```{r ch030-Quality-Surreal-Street}
tibble(
Scale = c("1/10", "1/46*", "1/100", "1/1000"),
`Log-Likelihood` = c(f(-0.5, 1/10, y, x),
f(-0.5, 1/46, y, x),
f(-0.5, 1/100, y, x),
f(-0.5, 1/1000, y, x))
) %>%
mutate(`Log-Likelihood` = formatC(`Log-Likelihood`, format="e", digits=2)) %>%
kable(booktabs = TRUE,
caption = "Log-likelihood estimates for different scale parameters. *Indicates the MLE solution from R.") %>%
kable_styling(latex_options = "hold_position")
```
Using a weakly-informative prior for the slope instead of the non-informative uniform distribution will allow for a proper MAP estimate, but care should be taken to consider what prior is appropriate, or why the data is separable in the first place. If an experiment can be improved to avoid the situation, that should be the first action.
The models in this paper rely on MCMC techniques to sample from the posterior distribution. In certain models, the posterior distribution takes on a familiar form given a specific prior distribution. These conjugate likelihood-priors exist for some exponential family models, like for the binomial-beta, Poisson-gamma, and gamma-gamma. The benefit of conjugate prior models are that they do not require computationally expensive sampling techniques to derive the posterior distribution. @chen2003conjugate discuss conjugate priors for GLMs by specifying a prior for the scale (strength of belief in the prior) and for the observed data, $y_o$, and then relates the prior in $y_o$ to the parameters, $\beta$. This methodology can be convenient, but limits model flexibility.
## Multilevel modeling
In classical regression, a simple single-level slope-intercept model can be specified as:
\begin{equation}
y_i = \alpha + \beta x_i + \epsilon_i
(\#eq:single-level-fixed)
\end{equation}
The slope and intercept is fixed for all observations in the data set. If there is a categorical variable with $J$ levels, then a varying-slope varying-intercept (or simply varying effects) model can be specified as:
\begin{equation}
y_i = \alpha_{j[i]} + \beta_{j[i]} x_i + \epsilon_i
(\#eq:single-level-fixed-varying)
\end{equation}
where $j[i]$ indexes the group for observation $i$. In a multilevel model, the coefficients are modeled by a separate regression:
\setstretch{1.0}
\begin{equation}
\begin{split}
y_i &= \alpha_{j[i]} + \beta_{j[i]} x_i \\
\alpha_j &\sim \mathcal{N}(a_0 + a_1 u_j, \sigma_{\alpha}^2) \\
\beta_j &\sim \mathcal{N}(b_0 + b_1 u_j, \sigma_{\beta}^2) \\
\end{split}
(\#eq:multilevel-varying)
\end{equation}
\setstretch{2.0}
In \@ref(eq:multilevel-varying), $x$ is a population level predictor, and $u$ is a group level predictor. Equations \@ref(eq:single-level-fixed) and \@ref(eq:single-level-fixed-varying) represent the two extremes of excluding a categorical variable from a model (complete pooling) and fitting a regression coefficient for each level in the categorical variable (no pooling). Multilevel modeling provides a compromise between these two extremes, resulting in partial pooling estimates.
For a simple intercept-only model, the partial pooling estimate of the intercept, $\hat{\alpha}_j$ is a weighted average of the mean of the of the observations in the group (no pooling estimate, $\bar{y}_j$) and the mean over all groups (complete pooling estimate, $\bar{y}$):
$$
\hat{\alpha}_j \approx \frac{\frac{n_j}{\sigma_y^2} \bar{y}_j + \frac{1}{\sigma_\alpha^2} \bar{y}}{\frac{n_j}{\sigma_y^2} + \frac{1}{\sigma_\alpha^2}}
$$
where $n_j$ is the number of observations in group level $j$, $\sigma_y^2$ is the within-group variance, and $\sigma_\alpha^2$ is the variance between the group level averages. As the between-group variance goes to infinity (or as $n_j \rightarrow \infty$), the partial pooling estimates approach the no pooling estimates. When there are fewer samples, the partial pooling estimate is closer to the overall average. In this way, partial pooling reflects the relative information contained within each group. For an in-depth introduction to multilevel modeling, see @gelman2006data.
## Hamiltonian Monte Carlo and NUTS
We will be using `Stan` for model fitting throughout this paper. `Stan` allows for MCMC sampling of Bayesian models using a variant of Hamiltonian Monte Carlo called the No-U-Turn sampler (NUTS). HMC can be though of as a physics simulation: a massless "particle" is imparted with a random direction and some amount of kinetic energy in a probability field, and is stopped after a number of steps, $L$, called leapfrog steps. The stopping point is the new proposal sample. The NUTS algorithm removes the need for leapfrog steps by stopping automatically when the particle begins to double back and retrace its steps [@hoffman2014no]. This sampling scheme has a much higher rate of accepted samples, and also comes with many built-in diagnostic tools that let us know when the sampler is having trouble efficiently exploring the posterior.
The NUTS algorithm samples in two phases: a warm-up phase and a sampling phase. During the warm-up phase, the sampler is automatically tuning three internal parameters that can significantly affect the sampling efficiency.
## Non-centered parameterization
Because HMC is a physics simulation, complicated geometry or posteriors with steep slopes can be difficult to traverse if the step size is too course. The solution is to explore a simpler geometry, and then transform the sample into the target distribution. Reparameterization is especially important for hierarchical models. For `Stan`, sampling from a standard normal or uniform distribution is very easy, and so the non-centered parameterization can alleviate divergent transitions. Here we present three reparameterizations that we use in the next chapter. The left-hand side shows the centered parameterization, and the right-hand side shows the non-centered parameterization.
**Gaussian distribution** with mean $\mu$ and standard deviation $\sigma$:
\setstretch{1.0}
\begin{equation}
\begin{split}
X &\sim \mathcal{N}(\mu, \sigma^2)
\end{split}
\quad \Longrightarrow \quad
\begin{split}
Z &\sim \mathcal{N}(0, 1^2) \\
X &= \mu + \sigma \cdot Z
\end{split}
(\#eq:nc-normal)
\end{equation}
\setstretch{2.0}
**Log-Normal distribution** with mean-log $\mu$ and standard deviation-log $\sigma$:
\setstretch{1.0}
\begin{equation}
\begin{split}
X &\sim \mathrm{Lognormal}(\mu, \sigma^2)
\end{split}
\quad \Longrightarrow \quad
\begin{split}
Z &\sim \mathcal{N}(0, 1^2) \\
X &= \exp\left(\mu + \sigma \cdot Z\right)
\end{split}
(\#eq:nc-lognormal)
\end{equation}
\setstretch{2.0}
**Cauchy distribution** with location $\mu$ and scale $\tau$:
\setstretch{1.0}
\begin{equation}
\begin{split}
X &\sim \mathrm{Cauchy}(\mu, \tau)
\end{split}
\quad \Longrightarrow \quad
\begin{split}
U &\sim \mathcal{U}\left(-\frac{\pi}{2}, \frac{\pi}{2}\right) \\
X &= \mu + \tau \cdot \tan(U)
\end{split}
(\#eq:nc-cauchy)
\end{equation}
\setstretch{2.0}
## Methods for model checking {#model-checking}
Below is the 8 Schools data [@gelman2013bayesian] which is a standard textbook example for introducing multilevel modeling. Here we use it to illustrate essential MCMC model checking tools.
```{r ch030-Digital Knife, echo=TRUE}
schools_dat <- list(
J = 8,
y = c(28, 8, -3, 7, -1, 1, 18, 12),
sigma = c(15, 10, 16, 11, 9, 11, 10, 18)
)
```
```{stan ch030-Rusty Limousine, output.var="schools"}
data {
int<lower=0> J; // number of schools
real y[J]; // estimated treatment effects
real<lower=0> sigma[J]; // standard error of effect estimates
}
parameters {
real mu; // population treatment effect
real<lower=0> tau; // standard deviation in treatment effects
vector[J] eta; // unscaled deviation from mu by school
}
transformed parameters {
vector[J] theta = mu + tau * eta; // school treatment effects
}
model {
target += normal_lpdf(eta | 0, 1); // prior log-density
target += normal_lpdf(y | theta, sigma); // log-likelihood
}
generated quantities {
vector[J] log_lik;
for (j in 1:J) {
log_lik[j] = normal_lpdf(y[j] | theta[j], sigma[j]);
}
}
```
```{stan ch030-Mercury Rainbow, output.var="schools_cp"}
data {
int<lower=0> J;
vector[J] y;
vector<lower=0>[J] sigma;
}
parameters {
real mu;
real<lower=0> tau;
vector[J] theta;
}
model {
mu ~ normal(0, 10);
tau ~ cauchy(0, 10);
theta ~ normal(mu, tau);
y ~ normal(theta, sigma);
}
generated quantities {
vector[J] log_lik;
for (j in 1:J) {
log_lik[j] = normal_lpdf(y[j] | theta[j], sigma[j]);
}
}
```
```{stan ch030-Jupiter Wild, output.var="schools_ncp"}
data {
int<lower=0> J;
vector[J] y;
vector<lower=0>[J] sigma;
}
parameters {
real mu;
real<lower=0> tau;
vector[J] eta;
}
transformed parameters {
vector[J] theta;
theta = mu + tau * eta;
}
model {
mu ~ normal(0, 10);
tau ~ cauchy(0, 10);
eta ~ normal(0, 1); // implies theta ~ normal(mu, tau)
y ~ normal(theta, sigma);
}
generated quantities {
vector[J] log_lik;
for (j in 1:J) {
log_lik[j] = normal_lpdf(y[j] | theta[j], sigma[j]);
}
}
```
```{r ch030-Mysterious Python}
m_schools <- sampling(schools, data = schools_dat,
refresh = 0, seed = 1234125)
p_schools <- as.array(m_schools)
```
**Trace Plots.** Trace plots have been used since the conception of MCMC to assess chain sampling efficiency and quality. They are visual aids that let the practitioner asses the qualitative health of the chains, looking for properties such as autocorrelation, heteroskedacity, non-stationarity, and convergence. Healthy chains are well-mixed and stationary. It's often better to run more chains during the model building process so that issues with mixing and convergence can be diagnosed sooner. One unhealthy chain can be indicative of a poorly specified model. The addition of more chains also contributes to the estimation of the split $\hat{R}$ statistic (discussed below). Figure \@ref(fig:ch030-Brave-Moose) shows what a set of healthy chains looks like -- the chains are nearly indistinguishable, fluctuate around the same mean, and do not show any long periods of being in the same location.
```{r ch030-Brave-Moose, fig.cap="An example of healthy chains."}
mcmc_trace(p_schools, pars = c("mu", "tau"))
```
As the number of parameters in a model grows, it becomes exceedingly tedious to check the trace plots of all parameters, and so numerical summaries are helpful to flag potential issues within the model.
**R-hat Statistic.** The most common summary statistic for chain health is the potential scale reduction factor [@gelman1992inference] that measures the ratio of between chain variance and within chain variance. When the two have converged, the ratio is one. We've shared examples of healthy chains which would also have healthy $\hat{R}$ values, but it's valuable to also share an example of a bad model.
The initial starting parameters for this model are intentionally set to vary between $-10$ and $10$ -- in contrast to the default range of $(-2, 2)$ -- and with only a few samples drawn in order to artificially drive up the split $\hat{R}$ statistic. The model is provided as supplementary code in the [appendix](#code).
\setstretch{1.0}
```{r ch030-Eager Galaxy, echo=TRUE}
fit_cp <- sampling(schools_cp, data = schools_dat, refresh = 0,
iter = 40, init_r = 10, seed = 671254821)
```
\setstretch{2.0}
`Stan` warns about many different issues with this model, but the R-hat is the one of interest. The largest is $1.71$ which is incredibly large. Gelmen suggests using a threshold of $1.10$ to flag unhealthy chains.
```{r ch030-Rocky-Test}
p_cp <- as.array(fit_cp)
mcmc_trace(p_cp, pars = c("mu", "tau"))
```
These chains do not look good at all -- they have not converged to a stationary distribution. This is by design, however, as only 40 samples were simulated. The $\hat{R}$ values are listed in table \@ref(tab:ch030-Ninth-Finger).
```{r ch030-Ninth-Finger}
rhat(fit_cp) %>% t %>% t %>%
as_tibble(rownames = "Parameter") %>%
rename("Rhat" = V1) %>%
head(2) %>%
kable(booktabs = TRUE, caption = "Split R-hat values from the 8 Schools example.") %>%
kable_styling(latex_options = "hold_position")
```
To calculate the (non split) $\hat{R}$, first calculate the between-chain variance, and then the average chain variance. For $M$ independent Markov chains, $\{\theta_1, \ldots \theta_M\}$, with $N$ samples each, the between-chain variance is:
\setstretch{1.0}
$$
B = \frac{N}{M-1}\sum_{m=1}^{M}\left(\bar{\theta}_m - \bar{\theta}\right)^2
$$
where
$$
\bar{\theta}_m = \frac{1}{N}\sum_{n=1}^{N}\theta_{m}^{(n)}
$$
and
$$
\bar{\theta} = \frac{1}{M}\sum_{m=1}^{M}\bar{\theta}_m
$$
The within-chain variance, $W$, is the variance averaged over all the chains:
$$
W = \frac{1}{M}\sum_{m=1}^{M} s_{m}^2
$$
where
$$
s_{m}^2 = \frac{1}{N-1}\sum_{n=1}^{N}\left(\theta_{m}^{(n)} - \bar{\theta}_m\right)^2
$$
\setstretch{2.0}
The weighted mixture of the within-chain and cross-chain variation is:
\setstretch{1.0}
$$
\hat{var} = \frac{N-1}{N} W + \frac{1}{N} B
$$
and finally the $\hat{R}$ statistic is:
$$
\hat{R} = \sqrt{\frac{\hat{var}}{W}}
$$
Here is the calculation in `R`:
```{r ch030-Steamy Eyelid, echo=TRUE}
param <- "mu"
theta <- p_cp[,,param]
N <- nrow(theta)
M <- ncol(theta)
theta_bar_m <- colMeans(theta)
theta_bar <- mean(theta_bar_m)
B <- N / (M - 1) * sum((theta_bar_m - theta_bar)^2)
s_sq_m <- apply(theta, 2, var)
W <- mean(s_sq_m)
var_hat <- W * (N - 1) / N + B / N
(mu_Rhat <- sqrt(var_hat / W))
```
\setstretch{2.0}
The $\hat{R}$ statistic is smaller than the split $\hat{R}$ value provided by `Stan`. This is a consequence of steadily increasing or decreasing chains. The split value does what it sounds like, and splits the samples from the chains in half -- effectively doubling the number of chains and halving the number of samples per chain. In this way, the measure is more robust in detecting unhealthy chains. This also highlights the utility in using both visual and statistical tools to evaluate models. Here is the calculation of the split $\hat{R}$:
\setstretch{1.0}
```{r ch030-Steady Brave Windshield, echo=TRUE}
param <- "mu"
theta_tmp <- p_cp[,,param]
theta <- cbind(theta_tmp[1:10,], theta_tmp[11:20,])
N <- nrow(theta)
M <- ncol(theta)
theta_bar_m <- colMeans(theta)
theta_bar <- mean(theta_bar_m)
B <- N / (M - 1) * sum((theta_bar_m - theta_bar)^2)
s_sq_m <- apply(theta, 2, var)
W <- mean(s_sq_m)
var_hat <- W * (N - 1) / N + B / N
(mu_Rhat <- sqrt(var_hat / W))
```
\setstretch{2.0}
We've successfully replicated the calculation of the split $\hat{R}$. @vehtari2020rank propose an improved rank-normalized $\hat{R}$ for assessing the convergence of MCMC chains, and also suggest using a threshold of $1.01$.
**Effective Sample Size.** Samples from Markov Chains are typically autocorrelated, which can increase uncertainty of posterior estimates. The solution is generally to reparameterize the model to avoid steep log-posterior densities. When the HMC algorithm is exploring difficult geometry, it can get stuck in regions of high densities, which means that there is more correlation between successive samples. Equation \@ref(eq:schools-ncp) shows the centered (left) and non-centered (right) parameterization of the 8-Schools model, and the benefit of reparameterization is conveyed by the ratio of effective sample size to actual sample size in figure \@ref(fig:ch030-Timely-Nitrogen).
\setstretch{1.0}
\begin{equation}
\begin{split}
\sigma &\sim \mathcal{U}(0, \infty) \\
\mu &\sim \mathcal{N}(0, 10) \\
\tau &\sim \mathrm{HalfCauchy}(0, 10) \\
\theta &\sim \mathcal{N}(\mu, \tau) \\
y &\sim \mathcal{N}(\theta, \sigma)
\end{split}
\quad \Longrightarrow \quad
\begin{split}
\sigma &\sim \mathcal{U}(0, \infty) \\
\mu &\sim \mathcal{N}(0, 10) \\
\tau &\sim \mathrm{HalfCauchy}(0, 10) \\
\eta &\sim \mathcal{N}(0, 1) \\
\theta &= \mu + \tau \cdot \eta \\
y &\sim \mathcal{N}(\theta, \sigma)
\end{split}
(\#eq:schools-ncp)
\end{equation}
\setstretch{2.0}
```{r ch030-Third Tuna}
fit_cp <- sampling(schools_cp, data = schools_dat, refresh = 0,
seed = 1734652)
fit_ncp <- sampling(schools_ncp, data = schools_dat, refresh = 0,
seed = 783652)
```
```{r ch030-Timely-Nitrogen, fig.cap="Ratio of N\\_eff to actual sample size. Low ratios imply high autocorrelation which can be alleviated by reparameterizing the model or by thinning."}
cp_neff <- neff_ratio(fit_cp, pars = c("theta", "mu", "tau"))
ncp_neff <- neff_ratio(fit_ncp, pars = c("theta", "mu", "tau"))
p_ncp <- mcmc_neff(ncp_neff) +
labs(title = "Non-centered Parameterization")
p_cp <- mcmc_neff(cp_neff) +
labs(title = "Centered Parameterization")
p_cp / p_ncp
```
As the strength of autocorrelation generally decreases at larger lags, a simple prescription to decrease autocorrelation between samples and increase the effective sample size is to use thinning. Thinning means saving every $k^{th}$ sample and throwing the rest away. If one desired to have 2000 posterior draws, it could be done in two of many possible ways
- Generate 2000 draws after warmup and save all of them
- Generate 10,000 draws after warmup and save every $5^{th}$ sample.
Both will produce 2000 samples, but the method using thinning will have less autocorrelation and a higher effective number of samples. Though it should be noted that generating 10,000 draws and saving all of them will have a higher number of effective samples than the second method with thinning, so thinning should only be favored to save memory.
**Divergent Transitions.** Unlike the previous tools for algorithmic faithfulness which can be used for any MCMC sampler, information about divergent transitions is intrinsic to Hamiltonian Monte Carlo. Recall that the HMC and NUTS algorithm can be imagined as a physics simulation of a particle in a potential energy field, and a random momentum is imparted on the particle. The sum of the potential energy and the kinetic energy of the system is called the Hamiltonian, and is conserved along the trajectory of the particle [@stanref]. The path that the particle takes is a discrete approximation to the actual path where the position of the particle is updated in small steps called leapfrog steps (see @leimkuhler2004simulating for a detailed explanation of the leapfrog algorithm). A divergent transition happens when the simulated trajectory is far from the true trajectory as measured by the Hamiltonian.
A few divergent transitions is not indicative of a poorly performing model, and often divergent transitions can be mitigated by reducing the step size and increasing the adapt delta parameter. On the other hand, a bad model may never be improved just by tweaking some parameters. This is the folk theorem of statistical computing - if there is a problem with the sampling, blame the model, not the algorithm.
Divergent transitions are never saved in the posterior samples, but they are saved internally to the `Stan` fit object and can be compared against good samples. Sometimes this can give insight into which parameters and which regions of the posterior the divergent transitions are coming from.
```{r ch030-Hot-Locomotive, fig.cap="Divergent transitions highlighted for some parameters from the centered parameterization of the eight schools example."}
n_cp <- nuts_params(fit_cp)
post_cp <- as.array(fit_cp)
mcmc_pairs(post_cp, np = n_cp, pars = c("mu", "tau"))
```
From figure \@ref(fig:ch030-Hot-Locomotive) we can see that most of the divergent transitions occur when the variance term $\tau$ is close to zero. This is common for multilevel models, and illustrates why non-centered parameterization is so important. We discuss centered and non-centered parameterization in the next chapter.
## Estimating predictive performance
All models are wrong, but some are useful. This quote is from George Box [@box1976science], and it is a popular quote that statisticians like to throw around. All models are wrong because it is nearly impossible to account for the minutiae of every process that contributes to an observed phenomenon, and often trying to results in poorer performing models. Also it is never truly possible to prove that a model is correct. At best the scientific method can falsify certain hypotheses, but it cannot ever determine if a model is universally correct. That does not matter. What does matter is if the model is useful and can make accurate predictions.
Why is predictive performance so important? Consider five points of data (figure \@ref(fig:ch030-Moving-Moose)). They have been simulated from some polynomial equation of degree less than five, but with no more information other than that, how can the best polynomial model be selected?
```{r ch030-Moving-Moose, fig.cap="Five points from a polynomial model."}
set.seed(11)
i <- 2
x0 <- 0:5
y0 <- -0.5*(x0 - 2)^2 + 2 + rnorm(length(x0), 0, 1)
x <- x0[-i]
y <- y0[-i]
p <- ggplot(data.frame(x=x, y=y), aes(x, y)) +
geom_point(size = 4) +
theme_minimal()
p
```
One thing to try is fit a handful of linear models, check the parameter's p-values, the $R^2$ statistic, and perform other goodness of fit tests, but there is a problem. As the degree of the polynomial fit increases, the $R^2$ statistic will always increase. In fact with five data points, a fourth degree polynomial will fit the data perfectly (figure \@ref(fig:ch030-Olive-Screwdriver)).
```{r ch030-Olive-Screwdriver, fig.cap="Data points with various polynomial regression lines."}
p2 <- p +
geom_smooth(formula = y ~ 1 + x,
method = "lm", se = FALSE,
aes(color = "Linear")) +
geom_smooth(formula = y ~ 1 + x + I(x^2),
method = "lm", se = FALSE,
aes(color = "Quadratic")) +
geom_smooth(formula = y ~ 1 + x + I(x^2) + I(x^3),
method = "lm", se = FALSE,
aes(color = "Cubic")) +
geom_smooth(formula = y ~ 1 + x + I(x^2) + I(x^3) + I(x^4),
method = "lm", se = FALSE,
aes(color = "Quartic")) +
geom_point(size = 4) +
scale_color_manual(values = c("#20639B",
"#3CAEA3",
"#F6D55C",
"#ED553B"),
limits = c("Linear",
"Quadratic",
"Cubic",
"Quartic"),
labels = c("Linear",
"Quadratic",
"Cubic",
"Quartic"),
name = "Degree")
p2
```
If a $6^{th}$ point were to be added -- a new observation -- which of the models would be expected to predict best? Can it be estimated which model will predict best before testing with new data? One guess is that the quadratic or cubic model will do well because because the linear model is potentially underfit to the data and the quartic is overfit to the data. Figure \@ref(fig:ch030-Cold-Fish) shows the new data point from the polynomial model. Now the linear and cubic models are trending in the wrong direction. The quadratic and quartic models are both trending down, so perhaps they may be the correct form for the model.
```{r ch030-Cold-Fish, fig.cap="The fitted polynomial models with a new observation."}
p2 +
geom_point(data = data.frame(x = x0[i], y = y0[i]),
shape = 17, size = 4)
```
Figure \@ref(fig:ch030-Strawberry-Swallow) shows the 80% and 95% prediction intervals for a new observation given $x = `r x0[i]`$ as well as the true outcome as a dashed line at $y = `r round(y0[i], 3)`$. The linear model has the smallest prediction interval (PI), but completely misses the target. The remaining three models all include the observed value in their 95% PIs, but the quadratic model has the smallest PI of the three. The actual data generating polynomial is
\setstretch{1.0}
\begin{align*}
y &\sim \mathcal{N}(\mu, 1^2) \\
\mu &= -0.5(x - 2)^2 + 2
\end{align*}
\setstretch{2.0}
```{r ch030-Maximum Panther, include=FALSE}
d <- data.frame(y = y,
x = x,
x2 = x^2,
x3 = x^3,
x4 = x^4)
m1 <- xfun::cache_rds({
ulam(alist(
y ~ dnorm(mu, sigma),
mu <- a + b1*x,
c(a, b1) ~ dnorm(0, 2),
sigma ~ dexp(1)
), data = d, log_lik=TRUE)
})
m2 <- xfun::cache_rds({
ulam(alist(
y ~ dnorm(mu, sigma),
mu <- a + b1*x + b2*x2,
c(a, b1, b2) ~ dnorm(0, 2),
sigma ~ dexp(1)
), data = d, log_lik=TRUE)
})
m3 <- xfun::cache_rds({
ulam(alist(
y ~ dnorm(mu, sigma),
mu <- a + b1*x + b2*x2 + b3*x3,
c(a, b1, b2, b3) ~ dnorm(0, 2),
sigma ~ dexp(1)
), data = d, log_lik=TRUE)
})
m4 <- xfun::cache_rds({
ulam(alist(
y ~ dnorm(mu, sigma),
mu <- a + b1*x + b2*x2 + b3*x3 + b4*x4,
c(a, b1, b2, b3, b4) ~ dnorm(0, 2),
sigma ~ dexp(1)
), data = d, log_lik=TRUE)
})
```
```{r ch030-Strawberry-Swallow, fig.cap="95\\% Prediction intervals for the four polynomial models, as well as the true value (dashed line)."}
new_y <- y0[i]
new_x <- x0[i]
new_dat <- data.frame(x = new_x,
x2 = new_x^2,
x3 = new_x^3,
x4 = new_x^4)
bind_rows(
round(PI(link(m1, data = new_dat), prob = c(0.8, 0.95)), 3),
round(PI(link(m2, data = new_dat), prob = c(0.8, 0.95)), 3),
round(PI(link(m3, data = new_dat), prob = c(0.8, 0.95)), 3),
round(PI(link(m4, data = new_dat), prob = c(0.8, 0.95)), 3)
) %>%
add_column(model = c("Linear", "Quadratic", "Cubic", "Quartic"),
.before = 1) %>%
mutate(model = factor(model,
levels = c("Linear",
"Quadratic",
"Cubic",
"Quartic"))) %>%
ggplot(aes(x = model, xend = model, y = `3%`, yend = `98%`)) +
geom_segment(size = 1, color = "gray30") +
geom_segment(aes(x = model, xend = model, y = `10%`, yend = `90%`),
size = 2, inherit.aes = FALSE) +
geom_hline(yintercept = new_y, linetype = "dashed") +
labs(y = "Prediction Interval", x = "Model",
title = paste("Predicting a new Observation when x =", new_x),
subtitle = "80% and 95% Prediction Intervals") +
theme_minimal()
```
The best fit to the observed data is the quartic model, but it is too variable and doesn't capture the regular features of the data, so it does poorly for the out-of-sample prediction. The linear model suffers as well by being more biased and too inflexible to capture the structure of the data. The quadratic and cubic are in the middle, but the quadratic does well and makes fewer assumptions about the data. The quadratic model is just complex enough to predict well while making fewer assumptions. Information criteria is a way of weighing the prediction quality of a model against its complexity, and is arguably a better system for model selection/comparison than other goodness-of-fit statistics such as $R^2$ or p-values [@burnham2002practical].
A technique to evaluate predictive performance is cross validation, where the data is split into training data and testing data [@friedman2001elements]. The model is fit to the training data, and then predictions are made with the testing data and compared to the observed values. This can often give a good estimate for out-of-sample prediction error. Cross validation can be extended into k-fold cross validation. The idea is to fold the data into $k$ disjoint partitions, and predict partition $i$ using the rest of the data to train on. The prediction error of the $k$-folds can then be averaged over to get an estimate for out-of-sample prediction error.
Taking $k$-fold CV to the limit by letting $k$ equal the number of observations results in leave-one-out cross validation (LOOCV), where for each observation in the data, the model is fit to the remaining data and predicted for the left out observation. $k$-fold cross validation requires fitting the model $k$ times, which can be computationally expensive for complex Bayesian models. Thankfully there is a way to approximate LOOCV without having to refit the model many times.
**Estimating cross validation error via Pareto-Smoothed-Importance Sampling**. LOOCV and many other evaluation tools such as the widely applicable information criterion [@watanabe2013widely] rest on the log-pointwise-predictive-density (lppd), which measures deviance from some "true" probability distribution. Typically we don't have the analytic form of the predictive posterior density, so instead we use $S$ MCMC draws to approximate the lppd [@vehtari2017practical]:
\begin{equation}
\mathrm{lppd}(y, \Theta) = \sum_{i=1}^N \log \frac{1}{S} \sum_{s=1}^S p(y_i | \Theta_s)
(\#eq:lppd)
\end{equation}
To estimate LOOCV, the relative "importance" of each observation must be computed. Certain observations have more influence on the posterior distribution, and so have more impact on the posterior if they are removed. By omitting a sample, the relative importance weight can be measured by the lppd. This omitted calculation is known as the out-of-sample lppd. For each omitted $y_i$,
$$
\mathrm{lppd}_{CV} = \sum_{i=1}^N \frac{1}{S} \sum_{s=1}^S \log p(y_{i} | \theta_{-i,s})
$$
The method of using weights to estimate the cross-validation is called Pareto-Smoothed Importance Sampling Cross-Validation (PSIS). Pareto-smoothing is a technique for making the importance weight more reliable. Each sample $s$ is re-weighted by the inverse of the probability of the omitted observation:
$$
r(\theta_s) = \frac{1}{p(y_i \vert \theta_s)}
$$
Then the importance sampling estimate of the out-of-sample lppd is calculated as:
$$
\mathrm{lppd}_{IS} = \sum_{i=1}^N\log \frac{\sum_{s=1}^S r(\theta_s) p(y_i \vert \theta_s)}{\sum_{s=1}^S r(\theta_s)}
$$
However, the importance weights can have a heavy right tail, and so they can be stabilized by using the Pareto distribution [@vehtari2015pareto]. The distribution of weights theoretically follow a Pareto distribution, so the larger weights can be used to estimate the generalized Pareto distribution
$$
p(r; \mu, \sigma, k) = \frac{1}{\sigma} \left(1 + \frac{k (r - \mu)}{\sigma}\right)^{-(1/k + 1)}
$$
where $\mu$ is the location, $\sigma$ is the scale, and $k$ is the shape. Then the estimated distribution is used to smooth the weights. A side-effect of using PSIS is that the estimated value of $k$ can be used as a diagnostic tool for a particular observation. For $k>0.5$, the Pareto distribution will have infinite variance, and a really heavy tail. If the tail is very heavy, then the smoothed weights are harder to trust. In theory and in practice, however, PSIS works well as long as $k < 0.7$ [@vehtari2015pareto].
There is an `R` package called `loo` that can compute the expected log-pointwise-posterior-density (ELPD) using PSIS-LOO, as well as the estimated number of effective parameters and LOO information criterion [@R-loo]. For the part of the researcher, the log-likelihood of the observations must be computed. This can be calculated in the `generated quantities` block of a `Stan` program, and it is standard practice to name the log-likelihood as `log_lik` in the model. An example of calculating the log-likelihood for the eight schools data in `Stan` is:
\setstretch{1.0}
```
generated quantities {
vector[J] log_lik;
for (j in 1:J) {
log_lik[j] = normal_lpdf(y[j] | theta[j], sigma[j]);
}
}
```
\setstretch{2.0}
Models can be compared simply using `loo::loo_compare`. It estimates the ELPD and its standard error, then calculates the relative differences between all the models. The model with the highest ELPD is predicted to have the best out-of-sample predictions. The comparison of four polynomial models from the earlier example is shown below.
```{r ch030-Lantern Quality}
linear <- loo(m1@stanfit)
quadratic <- loo(m2@stanfit)
cubic <- loo(m3@stanfit)
quartic <- loo(m4@stanfit)
```
```{r ch030-Artificial Bleeding, echo=TRUE}
comp <- loo_compare(linear, quadratic, cubic, quartic)
```
```{r ch030-Galaxy-Itchy}
comp %>%
as_tibble(rownames = "Model") %>%
mutate(Model = factor(Model,
levels = paste0("model", 1:4),
labels = c("Linear",
"Quadratic",
"Cubic",
"Quartic"))) %>%
select(Model, elpd_diff, se_diff, p_loo, looic) %>%
kable(digits = 4, booktabs = TRUE,
caption = "LOO comparison of Polynomial equations.") %>%
kable_styling(latex_options = "hold_position")
```
This comparison is unreliable since there are only five data points to estimate the predictive performance. This assertion is backed by the difference in ELPD and the standard error of the differences -- the standard error of the difference is at the same order of magnitude for the difference in each case.
## A modern principled bayesian modeling workflow
A principled workflow is a method of employing domain expertise and statistical knowledge to iteratively build a statistical model that satisfies the constraints and goals set forth by the researcher. Many other workflow and model checking techniques are given without context for when they are appropriate, and according to @betancourt2020, this leaves "practitioners to piece together their own model building workflows from potentially incomplete or even inconsistent heuristics." For any given problem, there is not, nor should there be, a default set of steps to take to get from data exploration to predictive inferences. Rather, consideration must be given to domain expertise and the questions that one is trying to answer with the statistical model.
Because everyone asks different questions, the value of a model is not in how well it ticks the boxes of goodness-of-fit checks, but in how consistent it is with domain expertise and its ability to answer the unique set of questions. Betancourt suggests answering four questions to evaluate a model by, summarized in table \@ref(tab:ch030-Confidential-Proton).
```{r ch030-Confidential-Proton}
data.frame(
Evaluation = c(
"1. Domain Expertise Consistency",
"2. Computational Faithfulness",
"3. Inferential Adequacy",
"4. Model Adequacy"
),
Question = c(
"Is our model consistent with our domain expertise?",
"Will our computational tools be sufficient to accurately fit our posteriors?",
"Will our inferences provide enough information to answer our questions?",
"Is our model rich enough to capture the relevant structure of the true data generating process?"
)
) %>%
kable(booktabs=TRUE, caption = "Questions for model evaluation.") %>%
kable_styling(latex_options = c("hold_position", "striped")) %>%
column_spec(1, width = "1.75in") %>%
column_spec(2, width = "3.25in")
```
Much work is done before seeing the data or building a model. This includes talking with experts to gain domain knowledge or to elicit priors. A benefit of modeling in a Bayesian framework is that all prior knowledge may be incorporated into the model to be used to estimate the posterior distribution. The same prior knowledge may also be used to check the posterior to ensure that predictions remain within physical or expert-given constraints.
In this section we describe a simulation-based, principled workflow proposed by @betancourt2020 and broadly adopted by many members of the Bayesian community. The workflow broadly consists of specifying the likelihood and priors, performing prior predictive checks, fitting a model, and performing posterior predictive checks. The steps of the workflow are divided into three phases: 1) pre-model, pre-data, 2) post-model, pre-data, and 3) post-model, post-data. Tables \@ref(tab:ch030-Reborn-Space), \@ref(tab:ch030-Freaky-Sledgehammer), and \@ref(tab:ch030-Bleeding-Liquid-Dagger) list the steps of each phase.
```{r ch030-Reborn-Space}
data.frame(Step = c("Conceptual Analysis",
"Define Observational Space",
"Construct Summary Statistics"),
Description = c("Write down the inferential goals and consider how the variables of interest interact with the environment and how those interactions work to generate observations.",
"What are the possible values that the observed data can take on? The observational space can help inform the statistical model such as in count data.",
"What measurements and estimates can be used to help ensure that the inferential goals are met? Prior predictive checks and posterior retrodictive checks are founded on summary statistics that answer the questions of domain expertise consistency and model adequacy.")) %>%
kable(booktabs=TRUE, caption="Pre-Model, Pre-Data steps.") %>%
kable_styling(latex_options = c("hold_position", "striped")) %>%
column_spec(1, width = "1.75in") %>%
column_spec(2, width = "3.25in")
```
```{r ch030-Freaky-Sledgehammer}
data.frame(Step = c("Develop Model",
"Construct Summary Functions",
"Simulate Bayesian Ensemble",
"Prior Checks",
"Configure Algorithm",
"Fit Simulated Ensemble",
"Algorithmic Calibration",
"Inferential Calibration"),
Description = c(
"Build an observational model that is consistent with the conceptual analysis and observational space, and then specify the complementary prior model.",
"Use the developed model to construct explicit summary functions that can be used in prior predictive checks and posterior retrodictive checks.",
"Since the model is a data generating model, it can be used to simulate observations from the prior predictive distribution without yet having seen any data.",
"Check that the prior predictive distribution is consistent with domain expertise using the summary functions developed in the previous step.",
"Having simulated data, the next step is to fit the data generating model to the generated data. There are many different MCMC samplers with their own configurable parameters, so here is where those settings are tweaked.",
"Fit the simulated data to the model using the algorithm configured in the previous step.",
"How well did the algorithm do in fitting the simulated data? This step helps to answer the question regarding computational faithfulness. A model may be well specified, but if the algorithm used is unreliable then the posterior distribution is also unreliable, and this can lead to poor inferences.",
"Are there any pathological behaviors in the model such as overfitting or non-identifiability? This step helps to answer the question of inferential adequacy."
)) %>%
kable(booktabs=TRUE, caption="Post-Model, Pre-Data steps.") %>%
kable_styling(latex_options = c("hold_position", "striped")) %>%
column_spec(1, width = "1.75in") %>%
column_spec(2, width = "3.25in")
```
```{r ch030-Bleeding-Liquid-Dagger}
data.frame(Step = c("Fit Observed Data",
"Diagnose Posterior Fit",
"Posterior Retrodictive Checks",
"Celebrate"),
Description = c(
"After performing the prior predictive checks and being satisfied with the model, the next step is to fit the model to the observed data.",
"Did the model fit well? Can a poorly performing algorithm be fixed by tweaking the algorithmic configuration, or is there a problem with the model itself where it is not rich enough to capture the structure of the observed data? Utilize the diagnostic tools available for the algorithm to check the computational faithfulness.",
"Do the posterior retrodictions match the observed data well, or are there still apparent discrepancies between what is expected and what is predicted by the model? It is important that any changes to the model going forward are motivated by domain expertise so as to mitigate the risk of overfitting.",
"After going through the tedious process of iteratively developing a model, it is okay to celebrate before moving on to answer the research questions."
)) %>%
kable(booktabs=TRUE, caption="Post-Model, Post-Data steps.") %>%
kable_styling(latex_options = c("hold_position", "striped")) %>%
column_spec(1, width = "1.75in") %>%
column_spec(2, width = "3.25in")
```
These steps are not meant to be followed in a strictly linear fashion. If a conceptual misunderstanding is discovered at any step in the process, then it is recommended to go back to an earlier step and start over. The workflow is a process of model expansion, and multiple iterations are required to get to a final model (or collection of models). Similarly if the model fails prior predictive checks, then one may need to return to the model development step. A full diagram of the workflow is displayed in figure \@ref(fig:ch030-workflow-diagram).
```{r ch030-workflow-diagram, fig.cap="Diagram is copywrited material of Michael Betancourt and used under the CC BY-NC 4.0 license. Image created with Lucid app.", out.width="100%"}
knitr::include_graphics("figures/workflow-diagram.png")
```