2. Small Worlds and Large Worlds#

2.1 The garden of forking data#

2.1.1 Counting possibilities#

The author’s footnote refers to Cox’s theorem; other justifications are given in Bayesian probability § Justification.

As discussed in Cox’s theorem § Interpretation and further discussion, there’s plenty of reason to doubt this justification.

An arguably better justification is the Dutch book theorems; see Dutch Book Arguments (Stanford Encyclopedia of Philosophy) and Notes on the Dutch Book Argument (by David A Freedman) for some more rigorous mathematics going back to the original person to make the argument (De Finetti). This justification remains completely finite, which seems desirable, not only if you have prior commitments to finitism but based on the following research (quoting from David A. Freedman):

In particular, the 1965 paper with the innocent title “On the asymptotic behaviour of Bayes estimates in the discrete case II” finds the rather disappointing answer that when sampling from a countably infinite population the Bayesian procedure fails almost everywhere, i.e., one does not obtain the true distribution asymptotically. This situation is quite different from the finite case when the (discrete) random variable takes only finite many values and the Bayesian method is consistent in agreement with earlier findings of Doob (1948).

From Bayesian inference § Alternatives to Bayesian updating:

Ian Hacking noted that traditional “Dutch book” arguments did not specify Bayesian updating: they left open the possibility that non-Bayesian updating rules could avoid Dutch books. Hacking wrote: “And neither the Dutch book argument nor any other in the personalist arsenal of proofs of the probability axioms entails the dynamic assumption. Not one entails Bayesianism. So the personalist requires the dynamic assumption to be Bayesian. It is true that in consistency a personalist could abandon the Bayesian model of learning from experience. Salt could lose its savour.”

Indeed, there are non-Bayesian updating rules that also avoid Dutch books (as discussed in the literature on “probability kinematics”) following the publication of Richard C. Jeffrey’s rule, which applies Bayes’ rule to the case where the evidence itself is assigned a probability. The additional hypotheses needed to uniquely require Bayesian updating have been deemed to be substantial, complicated, and unsatisfactory.

2.1.2 Combining other information#

2.1.3 From counts to probability#

The author uses the term “plausibility” as a synonym for probability; it’s not clear why the word is being introduced.

ways <- c(0, 3, 8, 9, 0)
ways / sum(ways)
  1. 0
  2. 0.15
  3. 0.4
  4. 0.45
  5. 0

The following sentence seems to accidentally italicize the word they (📌):

These plausibilities are also probabilities—they are 


2.2. Building a model#

The three steps the author introduces in this section are almost surely coming directly from his reading of BDA3 section The three steps of Bayesian data analysis.

2.2.1. A data story#

2.2.2. Bayesian updating#

2.2.3. Evaluate#

2.3. Components of the model#

2.3.1. Variables#

2.3.2. Definitions#

By far the most confusing definition given here is for the likelihood. In this book, a ‘likelihood’ will refer to a distribution function assigned to an observed variable. In this section, for example, the ‘likelihood’ is the binomial distribution. According to the author, this is the language used in ‘conventional’ statistics as well. The author often calls this the ‘likelihood’ but it is a function, of course, because any probability distribution is a function.

In non-Bayesian statistics and in particular on Wikipedia the definition of the likelihood function is completely different and denoted with \(\mathcal{L}\). See the author’s footnote and Likelihood function.

dbinom(6, size = 9, prob = 0.5)
0.1640625

2.4. Making the model go#

2.4.3 Grid Approximation#

p_grid <- seq(from=0, to=1, length.out=20)
p_grid
  1. 0
  2. 0.0526315789473684
  3. 0.105263157894737
  4. 0.157894736842105
  5. 0.210526315789474
  6. 0.263157894736842
  7. 0.315789473684211
  8. 0.368421052631579
  9. 0.421052631578947
  10. 0.473684210526316
  11. 0.526315789473684
  12. 0.578947368421053
  13. 0.631578947368421
  14. 0.684210526315789
  15. 0.736842105263158
  16. 0.789473684210526
  17. 0.842105263157895
  18. 0.894736842105263
  19. 0.947368421052632
  20. 1
prior <- rep(1, 20)
prior
  1. 1
  2. 1
  3. 1
  4. 1
  5. 1
  6. 1
  7. 1
  8. 1
  9. 1
  10. 1
  11. 1
  12. 1
  13. 1
  14. 1
  15. 1
  16. 1
  17. 1
  18. 1
  19. 1
  20. 1
# compute likelihood at each value in grid
likelihood <- dbinom(6, size=9, prob=p_grid)
plot(p_grid, likelihood, type="b", xlab="probability of water", ylab="likelihood")
../../_images/6f654bb592f6f10e76d60dc70a20a81ebb0c3f277e9dd60cb4737e5eeeaaeb06.svg
# compute product of likelihood and prior
unstd.posterior <- likelihood * prior
plot(p_grid, unstd.posterior, type="b", xlab="probability of water", ylab="unstandardized posterior")
../../_images/bcc5fcf59b105630a1396e9632557c3ca938a813591771b90ab31afb035bdb8c.svg
# standardize the posterior, so it sums to 1
posterior <- unstd.posterior / sum(unstd.posterior)
plot(p_grid, posterior, type="b", xlab="probability of water", ylab="posterior probability")
mtext( "20 points" )
../../_images/9e998bad7f5952d0e5c0ed69706e5ed0ce9f3989a6dc6ded36fa1c05e6ecb1fa.svg

2.4.4 Quadratic Approximation#

library(rethinking)
Loading required package: rstan
Loading required package: StanHeaders
rstan version 2.32.6 (Stan version 2.32.2)
For execution on a local, multicore CPU with excess RAM we recommend calling
options(mc.cores = parallel::detectCores()).
To avoid recompilation of unchanged Stan programs, we recommend calling
rstan_options(auto_write = TRUE)
For within-chain threading using `reduce_sum()` or `map_rect()` Stan functions,
change `threads_per_chain` option:
rstan_options(threads_per_chain = 1)
Loading required package: parallel
rethinking (Version 2.13)
Attaching package: ‘rethinking’
The following object is masked from ‘package:stats’:

    rstudent
globe.qa <- quap(
    alist(
        W ~ dbinom(W+L, p), # binomial likelihood
        p ~ dunif(0, 1)     # uniform prior
    ),
    data=list(W=6,L=3) )
precis(globe.qa)
A precis: 1 × 4
meansd5.5%94.5%
<dbl><dbl><dbl><dbl>
p0.66666690.15713370.41553680.9177969
# analytical calculation
W <- 6
L <- 3
curve( dbeta( x , W+1 , L+1 ) , from=0 , to=1 )
# quadratic approximation
curve( dnorm( x , 0.67 , 0.16 ) , lty=2 , add=TRUE )
../../_images/38bf3fc15b2092ddbb7b293f8835f947ababe122ccc5fa243c8a7b9b497f30c3.svg

2.4.5 Markov chain Monte Carlo#

n_samples <- 1000
p <- rep(NA, n_samples)
p[1] <- 0.5
W <- 6
L <- 3
for (i in 2:n_samples) {
    p_new <- rnorm( 1 , p[i-1] , 0.1 )
    if ( p_new < 0 ) p_new <- abs( p_new )
    if ( p_new > 1 ) p_new <- 2 - p_new
    q0 <- dbinom( W , W+L , p[i-1] )
    q1 <- dbinom( W , W+L , p_new )
    p[i] <- ifelse( runif(1) < q1/q0 , p_new , p[i-1] )
}
dens( p , xlim=c(0,1) )
curve( dbeta( x , W+1 , L+1 ) , lty=2 , add=TRUE )
../../_images/8093dd84c29464d505306fd31e09e00217d0b5333d98ec078cacfdd10b9fbf3a.svg