30th June, 2019

Why should we care about statistics?

  • Uncertainty is integral to both science and life
  • Statistics is the discipline that deals with quantifying uncertainty
  • In contrast to how it is usually taught, statistics is a messy discipline full of opposing viewpoints

  • In the 20th century and beyond, there were 3 big traditions in statistics
    • They continue to influence our thinking profoundly

Historical context

Significance? What significance?

  • Before 1940, significance testing was virtually non-existent
  • By 1955, 80% of empirical articles reported significance tests
  • Today, the figure is close to 100% (see Gigerenzer, 1993)

“[Statisticians] have already overrun every branch of science with a rapidity of conquest rivalled only by Atilla, Mohammed, and the Colorado beetle.”

  • Kendall (1942, p.69)

Ronald Fisher

  • Father of modern statistical inference
  • Randomization, design of experiments
  • \(p\)-value, \(\alpha\) level, null hypothesis, analysis of variance
  • Maximum likelihood estimation (Stigler, 2007)
  • The “greatest of Darwin’s successors” (Edwards, 2011)

Neyman-Pearson

  • Jerzy Neyman and Egon Pearson put Fisher’s ideas on a rigorous mathematical basis
  • Fisher didn’t like it, and a personal feud began that lasted until his death

  • Some concepts Neyman & Pearson introduced were
    • The alternative hypothesis
    • Type I (\(\alpha\)) and Type II (\(\beta\)) errors
    • Statistical power
    • Confidence intervals
    • Decision theoretic foundation of hypothesis testing

Current statistical practice: Hybrid NHST

  • Current statistical practice is a hybrid between the approaches of Fisher and Neyman-Pearson
  • Gigerenzer (2004) introduces the term “null ritual” to denote current practices:
  1. Setup a statistical null hypothesis, but do not specifiy your own hypothesis nor any alternative.
  2. Use the 5% significance level for rejecting the null and accepting your hypothesis.
  3. Always perform this procedure.

Bayes-Laplace-Jeffreys

  • Thomas Bayes (1701 - 1761)
    • Presbyterian minister & hobby mathematician
    • “An essay towards solving a Problem in the Doctrine of Chances” — Bayes’ rule

  • Pierre-Simon Laplace (1749 - 1827)
    • “Newton of France”
    • Independently derived Bayes’ rule
    • Proved the central limit theorem

  • Harold Jeffreys (1891 - 1989)
    • Revived the Bayesian view of probability in the 20th century
    • Developed the Bayes factor & default Bayesian hypotheses tests

  • Key Idea: Use probability to quantify uncertainty about parameters, models, and hypotheses

Fundamental difference

  • Probability is a branch of mathematics based on set theory, and as such uncontroversial
  • However, there are different interpretations of probability
    • Frequentist view: Probability as the limit of relative frequency
    • Bayesian view: Probability as degree of belief

  • What is the probability that Trump gets re-elected?
  • What is the probability of a global temperature rise of 2°C in the 30 years?

Frequentist interpretation

“The rational concept of probability, which is the only basis of probability calculus, applies only to problems in which either the same event repeats itself again and again … [In] order to apply the theory of probability we must have a practically unlimited sequence of observations.”

  • Richard von Mises (quoted in Jackman, 2009, p. 5)

Bayesian interpretation

“[I]n the conception we follow and sustain here, only subjective probabilities exist – i.e., the degree of belief in the occurrence of an event attributed by a given person at a given instant and with a given set of information.”

  • Bruno de Finetti (quoted in Nau, 2001, p. 90)

Why does it matter? Misunderstandings!

  • This difference endows frequentist methods with a somewhat counter-intuitive air

  • This is partly responsible for the many misunderstandings of statistical concepts
  • Oakes (1986); Wulff et al. (1987); Haller & Krauss (2002); Lecoutre et al. (2003); Hoekstra et al. (2014)

Some bedtime readings

  • Accessible popular science books that touch upon the history are
    • The Theory That Would Not Die (McGrayne, 2011)
    • The Seven Pillars Of Statistical Wisdom (Stigler, 2016)
    • Ten Great Ideas About Chance (Diaconis & Skyrms, 2017; review here)
    • The Empire of Chance (Gigerenzer et al., 1989)
    • The History of Statistics: The Measurement of Uncertainty before 1900 (Stigler, 1990)

  • See also Fienberg (1992) for a review of six books about the history of probability

Outline

  • 1. Some historical context
    • Fisher, Neyman-Pearson, Bayes-Laplace-Jeffreys
    • Two ways to interpret probabilities

  • 2. Recap of Probability

  • 3. Classical statistics
    • Maximum likelihood estimation
    • \(p\)-values, confidence intervals

  • 4. Bayesian statistics I
    • Estimation, testing, prediction
    • Bayes factor practicals in JASP

  • 5. Bayesian statistics II
    • Hint at Markov chain Monte Carlo
    • Bayesian workflow and model checking

Two Rules of Probability

Joint, Marginal, and Conditional probabilities

Consider the following table specifying the joint probability over eye and hair colour

##       blond brown  red black
## blue   0.22  0.21 0.00  0.01
## green  0.00  0.14 0.06  0.01
## brown  0.16  0.15 0.00  0.04
  • We have \(P(\text{Eye}, \text{Hair})\)
    • What is the probability distribution over eye — \(P(\text{Eye})\)?
    • What is the probability distribution over hair — \(P(\text{Hair})\)?
    • What is the probability of hair, given eye colour blue — \(P(\text{Hair} \mid \text{Eye = blue})\)?

Joint, Marginal, and Conditional probabilities

  • Marginal distribution of Eye: \(P(\text{Eye}) = \sum_{h}^{n} P(\text{Eye}, \text{Hair = h})\)
x[, 1] + x[, 2] + x[, 3] + x[, 4]
##  blue green brown 
##  0.44  0.21  0.35

  • Marginal distribution of Hair: \(P(\text{Hair}) = \sum_{e}^{m} P(\text{Hair}, \text{Eye = e})\)
x[1, ] + x[2, ] + x[3, ]
## blond brown   red black 
##  0.38  0.50  0.06  0.06

Joint, Marginal, and Conditional probabilities

  • Condititonal distribution over Hair, given blue eyes: Maybe it’s just this?
x[1, ]
## blond brown   red black 
##  0.22  0.21  0.00  0.01

  • No! We have to renormalize — \(P(\text{Hair} \mid \text{Eye = blue}) = \frac{P(\text{Hair}, \text{Eye = blue})}{P(\text{Eye = blue})}\)
x[1, ] / sum(x[1, ])
##      blond      brown        red      black 
## 0.50000000 0.47727273 0.00000000 0.02272727

Two rules of probability

  • We saw the sum rule on the previous slide

\[ p(y) = \int_{\Theta} p(y, \theta) \, \mathrm{d}\theta \]

  • The product rule states that

\[ p(\theta, y) = p(y \mid \theta) \,p(\theta) = p(\theta \mid y) \, p(y) \]

  • From this, Bayes rule immediately follows

\[ \underbrace{p(\theta \mid y)}_{{\text{Posterior}}} = \frac{\overbrace{p(y \mid \theta)}^{\text{Likelihood}}\, \overbrace{p(\theta)}^{\text{Prior}}}{\underbrace{p(y)}_{\text{Marginal likelihood}}} = \frac{p(y \mid \theta) \, p(\theta)}{\int_{\Theta} p(y, \theta) \, \mathrm{d}\theta} \]

Notation

  • \(X\) is a random variable with realization \(x\)
  • \(n\) realizations are \(\mathbf{x} = (x_1, \ldots, x_n)\)
  • \(p(x)\) both for discrete and continuous random variables [\(P(X = x)\) and \(p(X = x)\)]
  • \(p(\mathbf{x} \mid \theta)\) denotes a likelihood function relating parameter (vector) \(\theta\) to data \(\mathbf{x}\)
  • We use \(\mathbb{E}[X]\) to denote the expectation of either a discrete or a continuous random variable, i.e.

\[ \begin{aligned} \mathbb{E}[X] &= \int_{X} x \, p(x) \, \mathrm{d}x \\[.5em] \mathbb{E}[X] &= \sum_{i=1}^n x_i \, p(x_i) \\ \end{aligned} \]

Why most published research is false

\(\mathcal{H}_0\) \(\mathcal{H}_1\)
reject \(\mathcal{H}_0\) \(\alpha\) \(1 - \beta\)
keep \(\mathcal{H}_0\) 1 - \(\alpha\) \(\beta\)

Exercise I

  • What is the probability that my hypothesis is true, given that I have observed \(p < \alpha\)?
  • Let’s say you have
    • Statistical power of \(\beta = 0.70\) for some effect size
    • Want to control error rates with \(\alpha = 0.05\)
    • Assume \(P(\mathcal{H}_1) = 0.20\)
    • Compute \(P(\mathcal{H}_1 \mid p < \alpha)\)!

Why most published research is false

  • What is the probability that my hypothesis is true, given that I have observed \(p < \alpha\)?
  • Recall the product rule

\[ \begin{split} P(H_1|p < \alpha)P(p < \alpha) = P(H_1, p < \alpha) &= P(p < \alpha|H_1)P(H_1) \end{split} \]

  • which leads to Bayes’ rule

\[ \begin{split} P(H_1|p < \alpha) &= \frac{P(p < \alpha|H_1)P(H_1)}{P(p < \alpha)} \\[1.5ex] &= \frac{P(p < \alpha|H_1)P(H_1)}{P(p < \alpha|H_1)P(H_1) + P(p < \alpha|H_0)P(H_0)} \end{split} \]

  • Plugging in \(P(p < \alpha|H_1) = 1 - \beta\) and \(P(p < \alpha|H_0) = \alpha\)

\[ P(H_1|p < \alpha) = \frac{(1 - \beta) P(H_1)}{(1 - \beta) P(H_1) + \alpha (1 - P(H_1))} \]

Why most published research is false

Statistical models

Statistical models

  • Say I have a set of \(n = 100\) observations \(d = \{x_1, x_2, x_3, \ldots, x_n\}\)
  • You ask me about the data; what can I tell you?
d
##   [1] 107.97829  97.69310 105.68723  91.57817 100.68255  99.36588 113.81083
##   [8] 104.03224  87.80916  89.35512 110.42069  85.15186  97.48300 100.40710
##  [15]  99.52343 108.13351  76.99894  94.06699  97.02159  93.92357 121.06758
##  [22] 104.91280  92.48396 108.75160 104.85715 102.44135 107.42267 115.40291
##  [29] 105.29429 111.19024 100.96793  83.76809 101.20973 106.53400 103.20483
##  [36]  96.63230 112.29554  85.09677 119.74183  96.44044 110.73340 100.41942
##  [43] 105.36852 104.88455 106.27562  89.70382 104.66260  94.95458  89.90731
##  [50] 118.17459  90.85233 112.99377 103.12626  96.37302  92.45636  99.70174
##  [57]  97.29787  95.07651 116.58248  87.37510  89.91494  99.70467  97.43632
##  [64] 107.66106  85.73984 110.95736 101.22271 106.39446 104.21625  87.33096
##  [71]  93.35681 102.30501  97.05229  98.09652  96.22863 109.15487  94.86212
##  [78] 102.97085 106.76618 109.77560  81.26005  92.04866 101.85651  99.68925
##  [85] 101.97910  78.06927 114.79401  96.77466  98.15235 103.90206  95.19197
##  [92]  99.48015  98.22895  80.30618 101.59158  90.52129 105.40732  81.46523
##  [99]  97.32792 116.16882

Statistical models

  • Introduce a statistical model to reduce complexity
  • A statistical model describes the relationship of one or more parameters and the observations \(d\)

\[ p(d \mid \mu, \sigma) = \prod_{i=1}^n \frac{1}{\sqrt{2\pi\sigma^2}} \exp \left(-\frac{1}{2\sigma} (x_i - \mu)^2 \right) \]

Statistical models

  • I used a normal distribution for the data
  • Now I can describe the data with just two numbers — the mean \(\mu\) and the standard deviation \(\sigma\)
c(mean(d), sd(d))
## [1] 99.931219  9.413697
  • But there are other possible statistical models I could use!

Statistical models

  • Statistical models need not be appropriate; in fact, they can be horribly misspecified!

\[ p(d \mid a, b) = \prod_{i=1}^n \frac{1}{b - a} \]

Practical example

  • Data example
    • We observe \(n = 20\) coin flips
    • \(y = 15\) show heads
    • Questions
      • Estimation: What are likely values for \(\theta\)?
      • Testing: Is the coin likely to be biased?
      • Prediction: How likely is the next coin toss to show heads?
Problem Classical statistics Bayesian statistics
Estimation Maximum Likelihood, CIs Posterior distribution (Bayes’ rule)
Testing p-value Bayes factor (Bayes’ rule)
Prediction Plug-in Estimate Account for uncertainty using Bayes’ rule

Binomial likelihood

\[ \begin{split} p(x \mid \theta) &\stackrel{\text{i.i.d.}}{=} p(x_1 \mid \theta) \cdot p(x_2 \mid \theta) \cdot \ldots \cdot p(x_n \mid \theta) \\ &= \prod^n_{i=1} p(x_i; \theta) \\ &= \prod^n_{i=1} \theta^{x_i} (1 - \theta)^{1 - x_i} \\ &= \theta^{\sum^n_{i = 1} x_i} (1 - \theta)^{\sum^n_{i = 1} 1 - x_i} \\[4ex] p(y \mid \theta, n) &= {n \choose y} \theta^{\sum^n_{i = 1} x_i} (1 - \theta)^{\sum^n_{i = 1} 1 - x_i} \hspace{3em} \text{Neglecting order} \\[1ex] &= {n \choose y} \theta^{y} (1 - \theta)^{n - y} \hspace{6.5em} \text{Introducing } Y = \sum^n_{i=1} x_i \\ \end{split} \]

Classical approach

Estimation: MLE

  • A classic estimation method is maximum likelihood estimation

\[ \hat{\theta}_{\text{MLE}} = \underset{\theta}{\text{argmax}} \hspace{.5em} p(y \mid \theta) \]

“Objectivity is one of the principal reasons that frequentism dominated 20th century applications; a frequentist method like Wilcoxon’s test, which is completely devoid of prior opinion, has a clear claim to being objective – a crucial fact when scientists communicate with their skeptical colleagues.”

  • Bradley Efron (2005, p. 4-5)

Estimation: MLE

  • A classic estimation method is the maximum likelihood estimation

\[ \hat{\theta}_{\text{MLE}} = \underset{\theta}{\text{argmax}} \hspace{.5em} p(y \mid \theta) \]

  • Exercise II:
    • Can you give a Bayesian interpretation of this estimate?
    • That is, can you use Bayes’ rule in such a way that we get the MLE?

Estimation

  • Given the data (a random sample), what is the most likely population value \(\theta\)?
  • We could apply different estimators to answer this question:
    • \(e_1(y, n) = y - n\)
    • \(e_2(y, n) = \frac{1}{2}\)
    • \(e_3(y, n) = \frac{y}{n} + \frac{1}{n}\)
    • \(e_4(y, n) = \frac{y}{n}\)
    • \(e_5(y, n) = \frac{y}{n} + 0.05 \cdot \left(-1\right)^{I\left(\frac{y}{n} > 0.50\right)}\)

  • Compare them w.r.t.:
    • bias, variance,
    • asymptotic bias, asymptotic variance / consistency,
    • asymptotic normality, etc.

Estimation

  • The MLE \(\hat{\theta} = \frac{y}{n}\) has good properties (Stigler, 2007)

Confidence intervals

  • Definition. An X% confidence interval for a parameter \(\theta\) is an interval (L, U ) generated by a procedure that in repeated sampling has an X% probability of containing the true value of \(\theta\), for all possible values of \(\theta\) (Neyman, 1937).

“Consider now the case when a sample … is already drawn and the [confidence interval] given … Can we say that in this particular case the probability of the true value of [the parameter] falling between [the limits] is equal to [X%]? The answer is obviously in the negative.”

  • Jerzy Neyman (quoted in Morey et al., 2015, p. 105)

Confidence intervals

Confidence intervals

“it is not suggested that we can ‘conclude’ that [the interval contains θ], nor that we should ‘believe’ that [the interval contains θ]…[we] decide to behave as if we actually knew that the true value [is in the interval]. This is done as a result of our decision and has nothing to do with ‘reasoning’ or ‘conclusion’. The reasoning ended when the [confidence procedure was derived]. The above process [of using CIs] is also devoid of any ‘belief’ concerning the value […] of [θ].”

  • Jerzy Neyman (quoted in Morey et al., 2015, p. 106)

Exercise III

Can you construct a 50% confidence interval that is completely “meaningless”?

\(p\)-values

\(p\)-values

  • We replicate data under the assumption that \(H_0\) is true, i.e, \(\theta = .5\)
conduct_experiment_h0 <- function(n) sum(rbinom(n, 1, prob = .5))
yrep <- replicate(100000, conduct_experiment_h0(n))

head(yrep, 10) # peak at some experiment outcomes
##  [1]  8 12 12 11  9 14 10  8  6 14
  • Then we compute the probability that these data are at least as extreme as the one we observed
mean(yrep >= y) * 2
## [1] 0.0401

\(p\)-values

binom.test(y, n)
## 
##  Exact binomial test
## 
## data:  y and n
## number of successes = 15, number of trials = 20, p-value = 0.04139
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
##  0.5089541 0.9134285
## sample estimates:
## probability of success 
##                   0.75

\(p\)-values

res <- sapply(seq(20000), function(x) {
  dat <- rnorm(100, 0, 1)
  t.test(dat)$p.value
})

Some issues with \(p\)-values

  • \(p\)-values are uniformly distributed under \(H_0\)
    • This is by design, in order to have \(\alpha\)% of rejections under \(H_0\)
    • Therefore, however, \(p\)-values do not allow stating evidence in favour of \(H_0\)

  • \(p\)-values are not really a means to compare models, but rather to check models
    • Conditional on the model (e.g., \(H_0\)), are the data surprising?
    • This is actuallly pretty cool, and we’ll come back to it

  • Inference using \(p\)-values does not distinguish between being underpowered or \(H_1\) being true

Some Literature

  • Gigerenzer, G. (2004). Mindless statistics. The Journal of Socio-Economics, 33(5), 587-606.
  • Greenland, S., Senn, S. J., Rothman, K. J., Carlin, J. B., Poole, C., Goodman, S. N., & Altman, D. G. (2016). Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations. European Journal of Epidemiology, 31(4), 337-350.
  • Morey, R. D., Hoekstra, R., Rouder, J. N., Lee, M. D., & Wagenmakers, E. J. (2016). The fallacy of placing confidence in confidence intervals. Psychonomic bulletin & review, 23(1), 103-123.

Bayesian approach

Outline

  • 1. Bayesian estimation
    • Conjugate priors
    • Credible intervals
  • 2. Bayesian testing
    • Prior predictions and marginal likelihoods
    • Practical exercises in JASP
Problem Classical statistics Bayesian statistics
Estimation Maximum Likelihood, CIs Posterior distribution (Bayes’ rule)
Testing p-value Bayes factor (Bayes’ rule)
Prediction Plug-in Estimate Account for uncertainty using Bayes’ rule

Bayesian estimation

  • Uses Bayes’ rule

\[ \underbrace{p(\theta \mid y)}_{{\text{Posterior}}} = \frac{\overbrace{p(y \mid \theta)}^{\text{Likelihood}}\, \overbrace{p(\theta)}^{\text{Prior}}}{\underbrace{p(y)}_{\text{Marginal likelihood}}} = \frac{p(y \mid \theta) \, p(\theta)}{\int_{\Theta} p(y, \theta) \, \mathrm{d}\theta} \]

  • We know \(p(y \mid \theta)\) from our statistical setup
  • How do we choose for \(p(\theta)\)?

Beta prior distribution

\[ p(\theta \mid \alpha, \beta) = \frac{1}{\text{Beta}(\alpha, \beta)} \theta^{\alpha - 1}(1 - \theta)^{\beta - 1} \]

Bayesian estimation

  • [Computation on whiteboard]
  • With some rearranging, we find that

\[ p(\theta \mid y) = \frac{1}{\text{Beta}(y + \alpha, n - y + \beta)} \theta^{y + \alpha - 1}(1 - \theta)^{n - y + \beta - 1} \]

Bayesian updating

Shiny applications not supported in static R Markdown documents

Conjugacy: Old joke

Bayesian priors \(\equiv\) regularization

Bayesian hypothesis testing

  • In classical statistics, we test

\[ \mathcal{H}_0: \theta = 0.5 \\ \mathcal{H}_1: \theta \neq 0.5 \\ \]

  • In Bayesian testing, we need to specify \(p(\theta \mid \mathcal{H}_1)\)
  • It is through the prior that the model (which instantiates the hypothesis) makes predictions

\[ \begin{split} \mathcal{H}_0 &: \theta = 0.5 \\ \mathcal{H}_1 &: \theta \sim \text{Beta}(\alpha, \beta) \end{split} \]

  • Bayesian hypothesis testing is about comparing the prior predictive performance of hypotheses
  • It is logically prior to parameter estimation
    • First ask: is there an effect?
    • Then ask: how big is it?

Prior predictive distribution

  • The prior formalizes our uncertainty about \(\theta\) before seeing the data
  • We get predictions by propagating this uncertainty:
    • First, we sample a value for \(\theta\) from the prior
    • Second, we use this sample to generate data from the likelihood
    • This results in a prior predictive distribution
n <- 10
nr_draws <- 10e3
preds_m0 <- rbinom(nr_draws, n, prob = 0.5)

prior <- rbeta(nr_draws, 1, 1)
preds_m1 <- sapply(prior, function(theta) rbinom(1, n, theta))

Prior predictive distribution

Predictive performance of \(\mathcal{M}_0\)

  • Look at predictions before seeing the data — pre-dictive

\[ \begin{aligned} p(y = 5 \mid \mathcal{M}_0) &= \int p(y = 5 \mid \theta, \mathcal{M}_1) \, p(\theta \mid \mathcal{M}_1) \, \mathrm{d}\theta \\[.5em] &= \int p(y = 5 \mid \theta, \mathcal{M}_0) \, \delta(\theta = 0.5) \, \mathrm{d}\theta \\[.5em] &= p(y = 5 \mid \theta = 0.5, \mathcal{M}_0) \\[.5em] &= {10 \choose 5} 0.5^5 (1 - 0.5)^5 \\[.5em] & = 0.25 \end{aligned} \]

Predictive performance of \(\mathcal{M}_1\)

  • For \(\mathcal{M}_1\), things are a little more complicated:

\[ \begin{aligned} p(y \mid \mathcal{M}_1) &= \int p(y \mid \theta, \mathcal{M}_1) \, p(\theta \mid \mathcal{M}_1) \, \mathrm{d}\theta \\[.5em] &= \int {n \choose y} \theta^y (1 - \theta)^{n - y} \frac{1}{\text{B}(a, b)} \theta^{a - 1} (1 - \theta)^{b - 1} \, \mathrm{d}\theta \\[1em] &= {n \choose y} \frac{1}{\text{B}(a, b)} \int \theta^y (1 - \theta)^{n - y} \theta^{a - 1} (1 - \theta)^{b - 1} \, \mathrm{d}\theta \\[1em] &= {n \choose y} \frac{1}{\text{B}(a, b)} \int \theta^{a + y - 1} (1 - \theta)^{b + n - y - 1} \, \mathrm{d}\theta \\[1em] &= {n \choose y} \frac{\text{B}(a + y, b + n - y)}{\text{B}(a, b)} \\[1em] &\rightarrow \frac{1}{11} \end{aligned} \]

Bayesian model comparison

  • How do we update our beliefs across models? \[ \begin{split} \underbrace{\frac{p(M_0 \mid y)}{p(M_1 \mid y)}}_{\text{Posterior odds}} &= \underbrace{\frac{p(y \mid M_0)}{p(y \mid M_1)}}_{\text{Bayes factor}} \, \times \underbrace{\frac{p(M_0)}{p(M_1)}}_{\text{Prior odds}} \\[1ex] \frac{p(M_0 \mid y)}{p(M_1 \mid y)} &= \frac{1 / 4}{1 / 11} \times \frac{1/2}{1/2} \approx 2.75 \end{split} \]

Bayes factor

  • is a great tool for model comparison
    • complex models can predict many data patterns
    • thus spread out their prior probability, \(p(\theta)\)
    • this gets factored into the marginal likelihood, \(\int p(y \mid \theta) \, p(\theta)\mathrm{d}\theta\), and decreases it
    • instantiates an automatic Ockham’s razor

  • Bayes factors look at the functional form of the model, compared to
    • \(AIC = -2 \log p(\textbf{y} \mid \hat \theta) + 2\cdot k\)
    • \(BIC = -2 \log p(\textbf{y} \mid \hat \theta) + \log n \cdot k\)
  • which penalize model complexity solely based on the number of parameters \(k\)

Thermometer of evidence

Bayesian model comparison: Wrap-up

  • Model = Prior + Likelihood makes predictions about future data
  • Model’s predictive performance is given by how likely the observed data are under that model
    • Point priors = just the likelihood
    • Proper priors = integrate prior over likelihood
  • Easy in conjugate settings
  • Easy in low-dimensional settings

Monte Carlo

  • Monte Carlo integration:

\[ \begin{aligned} p(y \mid \mathcal{M}_1) &= \int p(y \mid \theta, \mathcal{M}_1) \, p(\theta \mid \mathcal{M}_1) \, \mathrm{d}\theta \\[.5em] &\approx \frac{1}{N} \sum_{i=1}^N p(y \mid \theta_i) \hspace{1em}\text{with} \hspace{1em} \theta_i \sim \text{Beta}(1, 1) \end{aligned} \]

prior <- rbeta(nr_draws, 1, 1)
mean(dbinom(5, 10, prior))
## [1] 0.09102586
  • This suffers from the curse of dimensionality:
    • With increasing number of parameters, the space in which the samples live grows exponentially

Curse of dimensionality

  • See Betancourt (2017)

Bayesian posterior prediction

  • [Computation on whiteboard]
Problem Classical statistics Bayesian statistics
Estimation Maximum Likelihood, CIs Posterior distribution (Bayes’ rule)
Testing p-value Bayes factor (Bayes’ rule)
Prediction Plug-in Estimate Account for uncertainty using Bayes’ rule

Bayesian posterior prediction

  • Let \(y'\) denote the outcome of the next coin flip, and denote \(y\) as the observed data

\[ \begin{align*} p(y' = 1 \mid y) &= \int_0^1 p(y' = 1, \theta \mid y) \mathrm{d}\theta \\ &= \int_0^1 p(y' = 1 \mid \theta) \, p(\theta \mid y)\mathrm{d}\theta \\ &= \int_0^1 \theta \, p(\theta \mid y)\mathrm{d}\theta \\ &= \int_0^1 \theta \, \text{Beta}(\alpha + y, \beta + n - y) \, \mathrm{d}\theta \\ &\rightarrow \mathbb{E}[\theta \mid y] = \frac{16}{16 + 5} \approx .76 \end{align*} \]

Bayesian priors \(\equiv\) regularization

Exercise IV

  • .(a) Assume \(y = 15\) out of \(n = 20\) coin flips
  • Compare the models

\[ \begin{aligned} \mathcal{M}_1&: \theta = 0.5 \\[.5em] \mathcal{M}_2&: \theta \sim \text{Beta}(1, 1) \\[.5em] \mathcal{M}_3&: \theta \sim \text{Beta}(1,1), \theta \in [0, 0.5] \\[.5em] \mathcal{M}_4&: \theta \sim \text{Beta}(1,1), \theta \in [0.5, 1] \\[.5em] \end{aligned} \] - Which model predicts the data best?

Exercises IV

  • .(b) Using uniform priors on models, compute the posterior model probabilities
    • Remove the best performing models and recompute the posterior model probabilities
      • Formally, what does this procedure do?
      • Does it make you feel a bit uncomfortable? Why? Why not?
      • What are the differences between posterior model probabilities and Bayes factors?

  • .(c) Plot the posterior of \(\theta\) for the best model

\[ p(y' = 1 \mid y, \mathcal{M}) = \int p(y' = 1 \mid y, \theta, \mathcal{M}) \, p(\theta \mid y, \mathcal{M}) \, \mathrm{d}\theta \]

Solutions (a)

y <- 15
n <- 20
nsim <- 10e3
compute_pred <- function(prior) mean(sapply(prior, function(theta) dbinom(y, n, theta)))

pred1 <- dbinom(y, n, 0.50)
pred2 <- compute_pred(runif(nsim, 0, 1))
pred3 <- compute_pred(runif(nsim, .5, 1))
pred4 <- compute_pred(runif(nsim, 0, .5))

(preds <- c(pred1, pred2, pred3, pred4))
## [1] 0.014785767 0.047842480 0.094482410 0.001267485

Solutions (b)

model_prior <- 1/4

(model_posteriors <- model_prior * preds / (sum(model_prior * preds)))
## [1] 0.093357369 0.302077545 0.596562185 0.008002901
(model_posteriors2 <- model_prior[-3] * preds[-3] / (sum(model_prior[-3] * preds[-3])))
## [1] 0.23140461 0.74875863 0.01983676

Solutions (c)

theta <- seq(0, 1, .001)
post_theta <- dunif(theta, .5, 1) * dbinom(y, n, theta)
post_theta <- post_theta / sum(post_theta)
plot(theta, post_theta, type = 'l', xlab = 'theta', ylab = 'Density')

Markov chain Monte Carlo

  • Computing \(p(\theta \mid y)\) is generally speaking very hard

\[ \underbrace{p(\theta \mid y)}_{{\text{Posterior}}} = \frac{\overbrace{p(y \mid \theta)}^{\text{Likelihood}}\, \overbrace{p(\theta)}^{\text{Prior}}}{\underbrace{p(y)}_{\text{Marginal likelihood}}} = \frac{p(y \mid \theta) \, p(\theta)}{\int_{\Theta} p(y, \theta) \, \mathrm{d}\theta} \]

  • https://chi-feng.github.io/mcmc-demo/app.html
  • Popularity of Bayesian methods is due to computational methods that sample efficiently from the posterior:
    • Chib, S., & Greenberg, E. (1995). Understanding the Metropolis-Hastings algorithm. The American Statistician, 49(4), 327-335.
    • Casella, G., & George, E. I. (1992). Explaining the Gibbs sampler. The American Statistician, 46(3), 167-174.
    • Betancourt, M. (2017). A conceptual introduction to Hamiltonian Monte Carlo. arXiv preprint arXiv:1701.02434.
    • Gronau, et al. (2017). A tutorial on bridge sampling. Journal of Mathematical Psychology, 81, 80-97.

Bayesian hypothesis testing

Thermometer of evidence

Lindley’s Paradox

k <- 49581
N <- 98451

\[ \mathcal{H}_0: \theta = .5 \\ \mathcal{H}_1: \theta \neq .5 \]

\(p\)-value

## [1] 0.02364686

\[ \begin{split} \mathcal{H}_0 &: \theta = .5 \\ \mathcal{H}_1 &: \theta \sim \text{Beta}(1, 1) \end{split} \]

\(BF_{01}\)

## [1] 19.21139

What makes a good prior for testing?

  • The Bayes factor is crucially sensitive to the prior over the ‘test-relevant’ parameter
    • The Bayes factor is a ratio of the average prior predictive success of two models
    • This prior sensitivity is completely obvious once one realizes that the prior is part of the model
    • If the prior is bad, the predictions are bad (example on board)

\[ p(y \mid \mathcal{M}_1) = \int_{\Theta} p(y \mid \theta, \mathcal{M}_1) \cdot p(\theta \mid \mathcal{M}_1) \mathrm{d}\theta \]

- Alright. So what makes a good prior for testing?

What makes a good prior for testing?

  • There are several desiderata that we want a Bayes factor to fulfill (Jeffreys, 1961; Ly et al., 2018)
    • Predictive matching
    • Information consistency
    • Model selection consistency
    • Label-invariance

Correlation in JASP

Correlation in JASP

  • 1 Load Presidents.csv into JASP. Run descriptive statistics and summarize what you find.

  • 2 Conduct a classical correlation test.
    • 2a Interpret the resulting p-value and confidence interval.
    • 2b Are these concepts valid for the current data set? Hint: think about the data generating process.

  • 3 Conduct a Bayesian correlation test.
    • 3a Estimate the posterior distribution and interpret the Bayes factor.
    • 3b Does the default prior in JASP make sense?
    • 3c Run a robustness check. What do you conclude?
    • 3d Run a sequential analysis. What do you conclude?

t-test in JASP

t-test in JASP

  • 1 Load KitchenRolls.csv into JASP. Run descriptive statistics and summarize the data graphically.

  • 2 Conduct a classical t-test.
    • 2a Interpret the resulting p-value and confidence interval.
    • 2b Are these concepts valid for the current data set? Hint: think about the data generating process.

  • 3 Conduct a Bayesian t-test.
    • 3a Estimate the posterior distribution and interpret the Bayes factor.
    • 3b Does the default prior in JASP make sense?
    • 3c Run a robustness check. What do you conclude?
    • 3d Run a sequential analysis. What do you conclude?

  • 4 Conduct a Bayesian one-sided t-test. Is this approach justifiable?

Linear regression

  • Example: UScrime data set (aggregated)
##     M So  Ed Po1 Po2  LF  M.F Pop  NW  U1 U2 GDP Ineq    y
## 1 151  1  91  58  56 510  950  33 301 108 41 394  261  791
## 2 143  0 113 103  95 583 1012  13 102  96 36 557  194 1635
## 3 142  1  89  45  44 533  969  18 219  94 33 318  250  578
## 4 136  0 121 149 141 577  994 157  80 102 39 673  167 1969
## 5 141  0 121 109 101 591  985  18  30  91 20 578  174 1234
## 6 121  0 110 118 115 547  964  25  44  84 29 689  126  682
  • Load it into JASP and run a regression predicting \(y\) using all other variables
  • There’s a bunch of new stuff we haven’t touched on yet. Poke around, then we’ll discuss.

Comparing theoretically motivated hypotheses

Comparing theoretically motivated hypotheses

  • Previous research has used a t-test comparing, say
    • Mean extraversion of Gryffindor with mean extraversion of Ravenclaw, Hufflepuff, Slytherin
    • This is not what we want!

  • Using Bayesian inference, we can compare models with order-constraints on the mean

\[ \begin{split} \mathcal{M}_0 &: \mu_{\text{G}} = \mu_{\text{S}} = \mu_{\text{R}} = \mu_{\text{H}} \\ \mathcal{M}_f &: \mu_{\text{G}}, \, \mu_{\text{S}} , \, \mu_{\text{R}} , \, \mu_{\text{H}} \\ \mathcal{M}_r &: \mu_{\text{G}} > (\mu_{\text{S}} , \, \mu_{\text{R}} , \, \mu_{\text{H}}) \end{split} \]

Some Literature

  • DeGroot, M. H. (1982). Lindley’s paradox: Comment. Journal of the American Statistical Association, 77(378), 336-339.
  • Ly, A., Verhagen, A. J., & Wagenmakers, E.-J. (2016). Harold Jeffreys’s default Bayes factor hypothesis tests: Explanation, extension, and application in psychology. Journal of Mathematical Psychology, 72, 19-32.
  • Ly, A., et al. (2019). The Bayesian Methodology of Sir Harold Jeffreys as a Practical Alternative to the P-value Hypothesis Test. Pre-print.

Bayesian workflow

Workflow

  • Specify the likelihood

  • Choose priors
    • Do they make substantive sense?
    • Can we constrain them logically?
    • Can we base them on prior data / modeling?
    • Does the resulting prior predictive distribution make sense?

A post-dictive perspective

  • Bayes factors consider a model’s prediction before seeing data
  • Leave-one-out (Loo) cross-validation considers predictions after seeing data

  • Let \((x_1, x_2, \ldots, x_n)\) denote \(n\) observations
  • We compute the Loo score of a model by writing

\[ \begin{aligned} \hat{\text{elpd}}_{\text{loo}}^{\mathcal{M}} &= \sum_{i=1}^n \, \text{log} \, p(x_i \mid x_{-i}) \\[.5em] &= \sum_{i=1}^n \int \text{log} \, p(x_i, \theta \mid x_{-i}) \, \mathrm{d}\theta \\[.5em] &= \sum_{i=1}^n \int \text{log} \, \left[p(x_i \mid \theta) \, p(\theta \mid x_{-i}) \right]\, \mathrm{d}\theta \end{aligned} \]

  • For details, see Gelman et al. (2014) & Vehtari et al. (2017)

Contrast: Bayes factors & Loo

  • Bayes factors are
    • pre-dictive
    • Model selection consistent
    • Highly dependent on the priors
    • What probability dictates (have a nice interpretation)
    • \(\mathcal{M}\)-closed?
    • Hard to compute
  • Loo is
    • post-dictive
    • Model selection inconsistent
    • No strong dependence on the priors
    • \(\mathcal{M}\)-open
    • Easy to compute

  • For recent discussions, see Gronau & Wagenmakers (2019) & Vehtari et al. (2019)

Posterior predictive checking

  • p-values are a good tool for model checking
    • Assuming the model is true, are the data surprising?

Posterior predictive checking

library('brms')
m <- brm(
  x ~ 1, prior = prior(normal(0, 5), class = 'Intercept'), data = data.frame(x = x), refresh = 0
); plot(m)

Posterior predictive checking

Conclusion

  • Bayesian inference uses probability to quantify uncertainty for everything
  • Model comparison, parameter estimation, and model prediction are unified by Bayes’ rule

\[ \underbrace{p(\theta \mid y)}_{{\text{Posterior}}} = \frac{\overbrace{p(y \mid \theta)}^{\text{Likelihood}}\, \overbrace{p(\theta)}^{\text{Prior}}}{\underbrace{p(y)}_{\text{Marginal likelihood}}} = \frac{p(y \mid \theta) \, p(\theta)}{\int_{\Theta} p(y, \theta) \, \mathrm{d}\theta} \]

  • Bayesian priors naturally regularize parameter estimates
  • Bayes factors require careful prior specification (cf., Lindley’s paradox)
  • Can test theoretically interesting hypotheses and compare non-nested models
  • Not so important for parameter estimation, where the data tends to dominate
  • Software for Bayesian inference: