2. Small Worlds and Large Worlds#
2.1 The garden of forking data#
2.1.1 Counting possibilities#
The author’s footnote refers to Cox’s theorem; other justifications are given in Bayesian probability § Justification.
As discussed in Cox’s theorem § Interpretation and further discussion, there’s plenty of reason to doubt this justification.
An arguably better justification is the Dutch book theorems; see Dutch Book Arguments (Stanford Encyclopedia of Philosophy) and Notes on the Dutch Book Argument (by David A Freedman) for some more rigorous mathematics going back to the original person to make the argument (De Finetti). This justification remains completely finite, which seems desirable, not only if you have prior commitments to finitism but based on the following research (quoting from David A. Freedman):
In particular, the 1965 paper with the innocent title “On the asymptotic behaviour of Bayes estimates in the discrete case II” finds the rather disappointing answer that when sampling from a countably infinite population the Bayesian procedure fails almost everywhere, i.e., one does not obtain the true distribution asymptotically. This situation is quite different from the finite case when the (discrete) random variable takes only finite many values and the Bayesian method is consistent in agreement with earlier findings of Doob (1948).
From Bayesian inference § Alternatives to Bayesian updating:
Ian Hacking noted that traditional “Dutch book” arguments did not specify Bayesian updating: they left open the possibility that non-Bayesian updating rules could avoid Dutch books. Hacking wrote: “And neither the Dutch book argument nor any other in the personalist arsenal of proofs of the probability axioms entails the dynamic assumption. Not one entails Bayesianism. So the personalist requires the dynamic assumption to be Bayesian. It is true that in consistency a personalist could abandon the Bayesian model of learning from experience. Salt could lose its savour.”
Indeed, there are non-Bayesian updating rules that also avoid Dutch books (as discussed in the literature on “probability kinematics”) following the publication of Richard C. Jeffrey’s rule, which applies Bayes’ rule to the case where the evidence itself is assigned a probability. The additional hypotheses needed to uniquely require Bayesian updating have been deemed to be substantial, complicated, and unsatisfactory.
2.1.3 From counts to probability#
The author uses the term “plausibility” as a synonym for probability; it’s not clear why the word is being introduced.
ways <- c(0, 3, 8, 9, 0)
ways / sum(ways)
- 0
- 0.15
- 0.4
- 0.45
- 0
The following sentence seems to accidentally italicize the word they (đź“Ś):
These plausibilities are also probabilities—they are …
2.2. Building a model#
The three steps the author introduces in this section are almost surely coming directly from his reading of BDA3 section The three steps of Bayesian data analysis.
2.3. Components of the model#
2.3.2. Definitions#
By far the most confusing definition given here is for the likelihood. In this book, a “likelihood” will refer to a distribution function assigned to an observed variable. In this section, for example, the “likelihood” is the binomial distribution. According to the author, this is the language used in “conventional” statistics as well. The author often calls this the “likelihood” but it is a function, of course, because any probability distribution is a function.
In non-Bayesian statistics and in particular on Wikipedia the definition of the likelihood function is completely different and denoted with \(\mathcal{L}\). See the author’s footnote and Likelihood function.
dbinom(6, size = 9, prob = 0.5)
2.4. Making the model go#
2.4.3 Grid Approximation#
p_grid <- seq(from=0, to=1, length.out=20)
p_grid
- 0
- 0.0526315789473684
- 0.105263157894737
- 0.157894736842105
- 0.210526315789474
- 0.263157894736842
- 0.315789473684211
- 0.368421052631579
- 0.421052631578947
- 0.473684210526316
- 0.526315789473684
- 0.578947368421053
- 0.631578947368421
- 0.684210526315789
- 0.736842105263158
- 0.789473684210526
- 0.842105263157895
- 0.894736842105263
- 0.947368421052632
- 1
prior <- rep(1, 20)
prior
- 1
- 1
- 1
- 1
- 1
- 1
- 1
- 1
- 1
- 1
- 1
- 1
- 1
- 1
- 1
- 1
- 1
- 1
- 1
- 1
Compute the likelihood at each value in a grid:
likelihood <- dbinom(6, size=9, prob=p_grid)
plot(p_grid, likelihood, type="b", xlab="probability of water", ylab="likelihood")
Compute the product of the likelihood and prior:
unstd.posterior <- likelihood * prior
plot(p_grid, unstd.posterior, type="b", xlab="probability of water", ylab="unstandardized posterior")
Standardize the posterior, so it sums to 1:
2.4.4 Quadratic Approximation#
suppressPackageStartupMessages(library(rethinking))
globe.qa <- quap(
alist(
W ~ dbinom(W+L, p), # binomial likelihood
p ~ dunif(0, 1) # uniform prior
),
data=list(W=6,L=3) )
precis(globe.qa)
mean | sd | 5.5% | 94.5% | |
---|---|---|---|---|
<dbl> | <dbl> | <dbl> | <dbl> | |
p | 0.6666664 | 0.1571338 | 0.4155361 | 0.9177966 |
2.4.5 Markov chain Monte Carlo#
n_samples <- 1000
p <- rep(NA, n_samples)
p[1] <- 0.5
W <- 6
L <- 3
for (i in 2:n_samples) {
p_new <- rnorm( 1 , p[i-1] , 0.1 )
if ( p_new < 0 ) p_new <- abs( p_new )
if ( p_new > 1 ) p_new <- 2 - p_new
q0 <- dbinom( W , W+L , p[i-1] )
q1 <- dbinom( W , W+L , p_new )
p[i] <- ifelse( runif(1) < q1/q0 , p_new , p[i-1] )
}
dens( p , xlim=c(0,1) )
curve( dbeta( x , W+1 , L+1 ) , lty=2 , add=TRUE )