Practice: Chp. 12#

source("iplot.R")
suppressPackageStartupMessages(library(rethinking))


12E1. What is the difference between an ordered categorical variable and an unordered one? Define and then give an example of each.

Answer. An ordered categorical variable is one in which the categories are ranked, but the distance between the categories is not necessarily constant. An example of an ordered categorical variable is the Likert scale; an example of an unordered categorical variable is the Blood type of a person.

12E2. What kind of link function does an ordered logistic regression employ? How does it differ from an ordinary logit link?

Answer. It employs a cumulative link, which maps the probability of a category or any category of lower rank (the cumulative probability) to the log-cumulative-odds scale. The major difference from the ordinary logit link is the type of probabilties being mapped; the cumulative link maps cumulative probabilities to log-cumulative-odds and the ordinary logit link maps absolute probabilities to log-odds.

12E3. When count data are zero-inflated, using a model that ignores zero-inflation will tend to induce which kind of inferential error?

Answer. A ‘zero’ means that nothing happened; nothing can happen either because the rate of events is low because the process that generates events failed to get started. We may incorrectly infer the rate of events is lower than it is in reality, when in fact it never got started.

12E4. Over-dispersion is common in count data. Give an example of a natural process that might produce over-dispersed counts. Can you also give an example of a process that might produce under-dispersed counts?

Answer. The sex ratios of families actually skew toward either all boys or girls; see Overdispersion.

In contexts where the waiting time for a service is bounded (e.g. a store is only open a short time) but the model still uses a Poisson distribution you may see underdispersed counts because large values are simply not possible. If one queue feeds into another, and the first queue has a limited number of workers, the waiting times for the second queue may be underdispersed because of the homogeneity enforced by the bottleneck in the first queue. “)

display_markdown(” 12M1. At a certain university, employees are annually rated from 1 to 4 on their productivity, with 1 being least productive and 4 most productive. In a certain department at this certain university in a certain year, the numbers of employees receiving each rating were (from 1 to 4): 12, 36, 7, 41. Compute the log cumulative odds of each rating.

freq_prod <- c(12, 36, 7, 41)
pr_k <- freq_prod / sum(freq_prod)
cum_pr_k <- cumsum(pr_k)
lco = list(log_cum_odds = logit(cum_pr_k))
display(lco)

log_cum_odds = 1. -1.94591014905531 2. 0 3. 0.293761118528163 4. Inf 12M2. Make a version of Figure 12.5 for the employee ratings data given just above. iplot(function() { off <- 0.04 plot( 1:4, cum_pr_k, type="b", main="Version of Figure 12.5 for Employee Ratings Data", xlab="response", ylab="cumulative proportion", ylim=c(0,1) ) # See simplehist for (j in 1:4) lines(c(j,j), c(0,cum_pr_k[j]), lwd=3, col="gray") for (j in 1:4) lines(c(j,j)+off, c(if (j==1) 0 else cum_pr_k[j-1],cum_pr_k[j]), lwd=3, col=rangi2) })  12M3. Can you modify the derivation of the zero-inflated Poisson distribution (ZIPoisson) from the chapter to construct a zero-inflated binomial distribution? Answer. Call $$p_z$$ the additional probability of a zero, and $$p_b$$ the probability of success associated with the binomial distribution. The likelihood of observing a zero is: \begin{split} \begin{align} Pr(0 | p_z,n,p_b) & = p_z + (1-p_z)Pr(0|n,p_b) \\ & = p_z + (1-p_z)\binom{n}{0}p_b^0(1-p_b)^{n-0} \\ & = p_z + (1-p_z)(1-p_b)^n \end{align} \end{split} The likelihood of observing a non-zero value $$k$$ is:\begin{align} Pr(k | k>0,p_z,n,p_b) & = (1-p_z)Pr(k|n,p_b) \\ & = (1-p_z)\binom{n}{k}p_b^k(1-p_b)^{n-k} \end{align}$To construct a GLM, use a logit link with $$p_z$$ and $$p_b$$ (in the typical case where $$n$$ is known). 12H1. In 2014, a paper was published that was entitled “Female hurricanes are deadlier than male hurricanes.” As the title suggests, the paper claimed that hurricanes with female names have caused greater loss of life, and the explanation given is that people unconsciously rate female hurricanes as less dangerous and so are less likely to evacuate. Statisticians severely criticized the paper after publication. Here, you’ll explore the complete data used in the paper and consider the hypothesis that hurricanes with female names are deadlier. Load the data with: R code 12.38 library(rethinking) data(Hurricanes)  Acquaint yourself with the columns by inspecting the help ?Hurricanes. In this problem, you’ll focus on predicting deaths using femininity of each hurricane’s name. Fit and interpret the simplest possible model, a Poisson model of deaths using femininity as a predictor. You can use map or ulam. Compare the model to an intercept-only Poisson model of deaths. How strong is the association between femininity of name and deaths? Which storms does the model fit (retrodict) well? Which storms does it fit poorly? Answer. The Hurricanes data.frame with an index and a standardized femininity: data(Hurricanes) hur <- Hurricanes hur$index = 1:92
hur$StdFem = standardize(hur$femininity)
display(hur)

hur_dat = list(
Deaths = hur$deaths, Femininity = hur$StdFem
)

m_hur_pois <- map(
alist(
Deaths ~ dpois(lambda),
log(lambda) <- a + bFemininity * Femininity,
a ~ dnorm(3, 0.5),
bFemininity ~ dnorm(0, 1.0)
),
data = hur_dat
)

m_hur_intercept <- map(
alist(
Deaths ~ dpois(lambda),
log(lambda) <- a,
a ~ dnorm(3, 0.5)
),
data = hur_dat
)

A data.frame: 92 × 10
nameyeardeathscategorymin_pressuredamage_normfemalefemininityindexStdFem
<fct><int><int><int><int><int><int><dbl><int><dbl>
Easy 1950 23960 159016.77778 1-0.0009347453
King 1950 43955 535001.38889 2-1.6707580424
Able 1952 31985 15003.83333 3-0.9133139565
Barbara 1953 11987 5819.83333 4 0.9458703621
Florence1953 01985 1518.33333 5 0.4810742824
Carol 1954 6039601932118.11111 6 0.4122162926
Edna 1954 203954 323018.55556 7 0.5499353710
Hazel 1954 2049382426019.44444 8 0.8253673305
Connie 1955 03962 203018.50000 9 0.5327193242
Diane 195520019871473019.8888910 0.9630864089
Ione 1955 73960 620005.9444411-0.2591568553
Flossy 1956 152975 154017.0000012 0.0679232445
Helene 1958 13946 54019.8888913 0.9630864089
Debra 1959 01984 43019.8888914 0.9630864089
Gracie 1959 223950 51019.7777815 0.9286574139
Donna 1960 5049305327019.2777816 0.7737253874
Ethel 1960 01981 3518.7222217 0.6015773141
Carla 1961 4649311585019.5000018 0.8425833773
Cindy 1963 31996 30019.9444419 0.9802993570
Cleo 1964 32968 645017.9444420 0.3605712508
Dora 1964 529661626019.3333321 0.7909383355
Hilda 1964 373950 277018.8333322 0.6360063090
Isbell 1964 32974 80019.4444423 0.8253673305
Betsy 1965 7539482000018.3333324 0.4810742824
Alma 1966 62982 73018.7777825 0.6187933608
Inez 1966 31983 9918.2777826 0.4638613343
Beulah 1967 153950 506017.2777827 0.1539972812
Camille 196925659092304019.0555629 0.7048673975
Celia 1970 223945 687019.4444430 0.8253673305
Bertha 1996 82974 7001 8.5000063 0.5327193
Fran 1996 263954 82601 7.1666764 0.1195683
Danny 1997 101984 2000 2.2222265-1.4125390
Bonnie 1998 32964 16501 9.3888966 0.8081544
Earl 1998 31987 1600 1.8888967-1.5158260
Georges 1998 12964 38700 2.2777868-1.3953230
Bret 1999 03951 940 2.3333369-1.3781100
Floyd 1999 562956 81300 1.8333370-1.5330421
Irene 1999 81964 14301 9.2777871 0.7737254
Lili 2002 21963 1260110.3333372 1.1008024
Claudette2003 31979 2501 9.1666773 0.7392964
Isabel 2003 512957 49801 9.3888974 0.8081544
Alex 2004 11972 50 4.1666775-0.8100239
Charley 2004 104941205100 2.8888976-1.2059620
Frances 2004 72960126201 6.0000077-0.2419408
Gaston 2004 81985 1700 2.6666778-1.2748200
Ivan 2004 253946185900 1.0555679-1.7740450
Jeanne 2004 53950102101 8.5000080 0.5327193
Cindy 2005 11991 3501 9.9444481 0.9802994
Dennis 2005 153946 26500 2.4444482-1.3436810
Ophelia 2005 11982 911 9.1666783 0.7392964
Rita 2005 623937106901 9.5000084 0.8425834
Wilma 2005 53950259601 8.6111185 0.5671483
Humberto 2007 11985 510 2.3888986-1.3608940
Dolly 2008 11967 11101 9.8333387 0.9458704
Gustav 2008 522954 43600 1.7222288-1.5674711
Ike 2008 842950203700 1.8888989-1.5158260
Irene 2011 411952 71101 9.2777890 0.7737254
Isaac 2012 51966240000 1.9444491-1.4986131
Sandy 20121592942750001 9.0000092 0.6876514

The model with the femininity predictor appears to have inferred a minor but significant association of deaths with femininity:

iplot(function() {
plot(precis(m_hur_pois), main="m_hur_pois")
}, ar=4.5)


The model with the femininity predictor performs only marginally better on WAIC than the intercept-only model:

iplot(function() {
plot(compare(m_hur_pois, m_hur_intercept))
}, ar=4.5)


The model is not accurate when it attempts to retrodict the death toll on storms with many deaths. The predictions it makes are less dispersed than the data, that is, the data are overdispersed relative to the predictions (Overdispersion).

iplot(function() {
result <- postcheck(m_hur_pois, window=92)
display_markdown("The raw data:")
display(result)
})


The raw data:

$mean 1. 20.1251659857895 2. 13.5272690213414 3. 16.1946676460679 4. 25.2301287082935 5. 22.5782266740179 6. 22.2100755222089 7. 22.9525690372696 8. 24.513735856352 9. 22.8583946338462 10. 25.3341934023582 11. 18.9237585345459 12. 20.4583741902435 13. 25.3341934023582 14. 25.3341934023582 15. 25.1265150113655 16. 24.2130552851921 17. 23.2374194592045 18. 24.6148119744012 19. 25.43867345702 20. 21.9379438722294 21. 24.3128593461609 22. 23.4293096819226 23. 24.513735856352 24. 22.5782266740179 25. 23.3331736632992 26. 22.4856224225171 27. 20.8827526901423 28. 23.6228029801984 29. 23.8179305320726 30. 24.513735856352 31. 22.8583946338462 32. 21.0549844953731 33. 25.5436083208327 34. 23.142079863412 35. 23.2374194592045 36. 23.6228029801984 37. 26.3989288665572 38. 20.2910782055103 39. 13.8065131719318 40. 13.8630519306067 41. 14.6797282101263 42. 14.8609723008321 43. 25.2301287082935 44. 25.43867345702 45. 13.8065131719318 46. 14.3826130109495 47. 25.0233131904701 48. 24.6148119744012 49. 14.0915799092339 50. 24.920558586997 51. 24.4130976457744 52. 15.1061505280087 53. 13.9768434241141 54. 22.5782266740179 55. 23.8179305320726 56. 15.1061505280087 57. 14.5007199645343 58. 13.8065131719318 59. 14.3826130109495 60. 25.2301287082935 61. 20.7971642453911 62. 22.8583946338462 63. 22.8583946338462 64. 20.7119459426219 65. 14.3826130109495 66. 24.4130976457744 67. 14.0340982113646 68. 14.4415496593931 69. 14.5007199645343 70. 13.9768434241141 71. 24.2130552851921 72. 26.1824197235577 73. 24.0146708789476 74. 24.4130976457744 75. 16.5975275516719 76. 15.1061505280087 77. 19.0015518427568 78. 14.8609723008321 79. 13.1997116297174 80. 22.8583946338462 81. 25.43867345702 82. 14.6198081648425 83. 24.0146708789476 84. 24.6148119744012 85. 23.0471188819599 86. 14.5601462541308 87. 25.2301287082935 88. 13.8630519306067 89. 14.0340982113646 90. 24.2130552851921 91. 14.0915799092339 92. 23.7201637909248$PI
 5% 94% 19.3809 12.3561 15.2026 24.0737 21.6986 21.3723 22.0392 23.4624 21.9572 24.1586 ⋯ 23.001 23.5472 22.1312 13.458 24.0737 12.713 12.8953 23.1787 12.9566 22.7214 20.9036 14.7386 17.173 26.4229 23.4528 23.0573 23.8752 25.6005 23.7635 26.5406 ⋯ 25.0423 25.7144 23.9808 15.6671 26.4229 15.0391 15.1975 25.2586 15.249 24.7197
$outPI  5% 94% 13 8 10 18 15 15 16 17 16 17 ⋯ 16 17 15.945 9 17.945 8 8 17 8 16 28 20 23 34 31 30 31 32 31 33.055 ⋯ 32 33 30 21 33 20 20 32 21 32 12H2. Counts are nearly always over-dispersed relative to Poisson. So fit a gamma-Poisson (aka negative-binomial) model to predict deaths using femininity. Show that the over-dispersed model no longer shows as precise a positive association between femininity and deaths, with an 89% interval that overlaps zero. Can you explain why the association diminished in strength? data(Hurricanes) hur <- Hurricanes hur$index = 1:92
hur$StdFem = standardize(hur$femininity)

hur_dat = list(
Deaths = hur$deaths, Femininity = hur$StdFem
)

m_hur_gamma_pois <- map(
alist(
Deaths ~ dgampois(lambda, phi),
log(lambda) <- a + bFemininity * Femininity,
a ~ dnorm(3, 0.5),
bFemininity ~ dnorm(0, 1.0),
phi ~ dexp(1)
),
data = hur_dat
)


Answer. The model no longer infers an association of deaths with femininity:

iplot(function() {
plot(precis(m_hur_gamma_pois), main="m_hur_gamma_pois")
}, ar=4.5)


The association has diminished in strength because the influence of outliers are diminished. Notice the storms with the greatest death toll are female in the original data: Diane, Camille, and Sandy.

12H3. In the data, there are two measures of a hurricane’s potential to cause death: damage_norm and min_pressure. Consult ?Hurricanes for their meanings. It makes some sense to imagine that femininity of a name matters more when the hurricane is itself deadly. This implies an interaction between femininity and either or both of damage_norm and min_pressure. Fit a series of models evaluating these interactions. Interpret and compare the models. In interpreting the estimates, it may help to generate counterfactual predictions contrasting hurricanes with masculine and feminine names. Are the effect sizes plausible?

data(Hurricanes)
hur <- Hurricanes
hur$index = 1:92 hur$StdFem = standardize(hur$femininity) hur$StdDam = standardize(hur$damage_norm) hur$StdPres = standardize(hur$min_pressure)  Answer. To get some unreasonable answers, lets go back to the Poisson distribution rather than the Gamma-Poisson. It’s likely this is what the authors of the original paper were using. hur_dat = list( Deaths = hur$deaths,
StdFem = hur$StdFem, StdDam = hur$StdDam
)

m_fem_dam_inter <- quap(
alist(
Deaths ~ dpois(lambda),
log(lambda) <- a + bFem * StdFem + bDam * StdDam + bFemDamInter * StdFem * StdDam,
a ~ dnorm(3, 0.5),
bFem ~ dnorm(0, 1.0),
bDam ~ dnorm(0, 1.0),
bFemDamInter ~ dnorm(0, 0.4)
),
data = hur_dat
)


In the first model, lets assume an interaction between femininity and damage_norm. Damage is unsurprisingly associated with more deaths. The model also believes there is a minor interaction between damage and femininity, where together they increase deaths more than either alone.

iplot(function() {
plot(precis(m_fem_dam_inter), main="m_fem_dam_inter")
}, ar=4.5)

hur_dat = list(
Deaths = hur$deaths, StdFem = hur$StdFem,
StdPres = hur$StdPres ) m_fem_pres_inter <- quap( alist( Deaths ~ dpois(lambda), log(lambda) <- a + bFem * StdFem + bPres * StdPres + bFemPresInter * StdFem * StdPres, a ~ dnorm(3, 0.5), bFem ~ dnorm(0, 1.0), bPres ~ dnorm(0, 1.0), bFemPresInter ~ dnorm(0, 0.4) ), data = hur_dat )  In the second model, lets assume an interaction between femininity and min_pressure. Pressure is unsurprisingly negatively associated with more deaths; lower pressure indicates a stronger storm. There also appears to be a minor interaction between pressure and femininity, though now the association is in the opposite direction. That is, when both pressure and femininity are positive, which means the storm is weak and feminine, more deaths occur. iplot(function() { plot(precis(m_fem_pres_inter), main="m_fem_pres_inter") }, ar=4.5) hur_dat = list( Deaths = hur$deaths,
StdFem = hur$StdFem, StdPres = hur$StdPres,
StdDam = hur$StdDam ) m_fem_dam_pres_inter <- quap( alist( Deaths ~ dpois(lambda), log(lambda) <- a + bFem * StdFem + bPres * StdPres + bDam * StdDam + bFemPresInter * StdFem * StdPres + bFemDamInter * StdFem * StdDam + bDamPresInter * StdDam * StdPres + bFDPInter * StdPres * StdDam * StdPres, a ~ dnorm(3, 0.5), bFem ~ dnorm(0, 1.0), bDam ~ dnorm(0, 1.0), bPres ~ dnorm(0, 1.0), bFemPresInter ~ dnorm(0, 0.4), bFemDamInter ~ dnorm(0, 0.4), bDamPresInter ~ dnorm(0, 0.4), bFDPInter ~ dnorm(0, 0.4) ), data = hur_dat )  In the third model, lets assume interactions between all three predictors. The influence of femininity has dropped significantly. The three-way interaction term is hard to interpret on the parameter scale. Lets try to interpret it on the outcome scale. iplot(function() { plot(precis(m_fem_dam_pres_inter), main="m_fem_dam_pres_inter") }, ar=3.5)  To some extent this is a counterfactual plot rather than just a posterior prediction plot because we’re considering extremes of at least some predictors. Still, many of the predictors vary here only within the range of the data. In the following Enneaptych, we hold damage and pressure at typical values and vary femininity. We observe the effect on deaths. Many of the effect sizes are inplausible. In the bottom-right plot, notice an implication of the non-negative three-way interaction parameter: when pressure is high and damage is low femininity has a dramatic effect on deaths. iplot(function() { par(mfrow=c(3,3)) for (d in -1: 1) { for (p in -1:1) { sim_dat = data.frame(StdPres=p, StdDam=d, StdFem=seq(from=-2, to=2, length.out=30)) s <- sim(m_fem_dam_pres_inter, data=sim_dat) plot(sim_dat$StdFem, colMeans(s),
main=paste("StdPres = ", p, ",", "StdDam = ", d),
type="l", ylab="Deaths", xlab="StdFem")
}
}
})


Ignoring the known outliers, PSIS believes the full interaction model performs best:

iplot(function() {
plot(compare(m_fem_dam_pres_inter, m_fem_pres_inter, m_fem_dam_inter, func=PSIS))
}, ar=4.5)

Some Pareto k values are very high (>1). Set pointwise=TRUE to inspect individual points.

Some Pareto k values are very high (>1). Set pointwise=TRUE to inspect individual points.

Some Pareto k values are very high (>1). Set pointwise=TRUE to inspect individual points.


12H4. In the original hurricanes paper, storm damage (damage_norm) was used directly. This assumption implies that mortality increases exponentially with a linear increase in storm strength, because a Poisson regression uses a log link. So it’s worth exploring an alternative hypothesis: that the logarithm of storm strength is what matters. Explore this by using the logarithm of damage_norm as a predictor. Using the best model structure from the previous problem, compare a model that uses log(damage_norm) to a model that uses damage_norm directly. Compare their PSIS/WAIC values as well as their implied predictions. What do you conclude?

Answer We’ll use the full interaction model as our baseline, though it probably doesn’t matter which model we start from.

data(Hurricanes)
hur <- Hurricanes
hur$index = 1:92 hur$StdFem = standardize(hur$femininity) hur$StdDam = standardize(hur$damage_norm) hur$StdLogDam = standardize(log(hur$damage_norm)) hur$StdPres = standardize(hur$min_pressure) hur_dat = list( Deaths = hur$deaths,
StdFem = hur$StdFem, StdPres = hur$StdPres,
StdDam = hur$StdDam ) m_fem_dam_pres_inter <- quap( alist( Deaths ~ dpois(lambda), log(lambda) <- a + bFem * StdFem + bPres * StdPres + bDam * StdDam + bFemPresInter * StdFem * StdPres + bFemDamInter * StdFem * StdDam + bDamPresInter * StdDam * StdPres + bFDPInter * StdPres * StdDam * StdPres, a ~ dnorm(3, 0.5), bFem ~ dnorm(0, 1.0), bDam ~ dnorm(0, 1.0), bPres ~ dnorm(0, 1.0), bFemPresInter ~ dnorm(0, 0.4), bFemDamInter ~ dnorm(0, 0.4), bDamPresInter ~ dnorm(0, 0.4), bFDPInter ~ dnorm(0, 0.4) ), data = hur_dat )  First we’ll run postcheck on the baseline model. Notice there is more dispersion than in the plain old Poisson model in 12H1: iplot(function() { result <- postcheck(m_fem_dam_pres_inter, window=92) display_markdown("The raw data:") display(result) }) hur_dat = list( Deaths = hur$deaths,
StdFem = hur$StdFem, StdPres = hur$StdPres,
StdLogDam = hur$StdLogDam ) m_fem_logdam_pres_inter <- quap( alist( Deaths ~ dpois(lambda), log(lambda) <- a + bFem * StdFem + bPres * StdPres + bDam * StdLogDam + bFemPresInter * StdFem * StdPres + bFemDamInter * StdFem * StdLogDam + bDamPresInter * StdLogDam * StdPres + bFDPInter * StdPres * StdLogDam * StdPres, a ~ dnorm(3, 0.5), bFem ~ dnorm(0, 1.0), bDam ~ dnorm(0, 1.0), bPres ~ dnorm(0, 1.0), bFemPresInter ~ dnorm(0, 0.4), bFemDamInter ~ dnorm(0, 0.4), bDamPresInter ~ dnorm(0, 0.4), bFDPInter ~ dnorm(0, 0.4) ), data = hur_dat )  The raw data:$mean
1. 12.4289629811358
2. 18.5555907721241
3. 2.1249070586138
4. 1.77374475898114
5. 2.06701979381536
6. 39.5683369755651
7. 15.9939793092382
8. 37.4206334005963
9. 11.3393368856691
10. 96.6869545199769
11. 16.2813093468006
12. 5.70521738475244
13. 16.91337602597
14. 2.48568771306356
15. 15.2596820401918
16. 114.838235470021
17. 2.86766572841986
18. 34.5128187022824
19. 0.856829259425056
20. 14.1797118534941
21. 42.1928980075304
22. 17.3290292624905
23. 5.31717940816366
24. 30.8172826315106
25. 3.08573684193247
26. 2.48545947274957
27. 19.4141586210783
28. 4.41744064025897
29. 203.106284084467
30. 21.5776644407842
31. 3.78105321641471
32. 3.67383539337091
33. 0.908907885314257
34. 184.984153821049
35. 15.4361111412495
36. 18.0008369177879
37. 3.43397244991166
38. 0.824230228928947
39. 1.90433426037738
40. 8.89528698983109
41. 22.0565207809061
42. 25.0936386773702
43. 22.8871246559745
44. 15.5198373640191
45. 0.319483073147672
46. 1.76904239289811
47. 14.4654254906871
48. 20.1824722641769
49. 10.122581589849
50. 8.30801811862062
51. 1.34098254166381
52. 1.29640697342803
53. 0.933284011415338
54. 2.24111365545684
55. 2.04205676669216
56. 26.1365654250733
57. 2.58118066040024
58. 13.7981829597164
59. 66.615328996312
60. 9.78792501462256
61. 6.50495949367915
62. 23.2557773959279
63. 5.28556240933658
64. 20.0449904479539
65. 2.34938006223169
66. 9.92009766804856
67. 1.76765633021552
68. 12.7572493066998
69. 20.724355786377
70. 18.5368837752248
71. 9.7380836695525
72. 9.7864089263809
73. 3.4710114383841
74. 16.0821194776559
75. 5.87258812291837
76. 21.9002218721581
77. 23.0819441177566
78. 2.13353996480914
79. 19.1758534832144
80. 22.7315653259822
81. 1.37215259835675
82. 24.4221358194378
83. 2.6812724963672
84. 26.6361393883323
85. 39.9601777458033
86. 2.08160460381955
87. 8.09843833341641
88. 18.9855117772041
89. 19.8208111691232
90. 19.7288010624798
91. 39.4585510163461
92. 125.255203562995
$PI  5% 94% 11.6868 16.6544 1.76294 1.51475 1.79781 37.5705 15.1303 35.1476 10.6284 87.1836 ⋯ 2.34791 24.5644 37.481 1.65812 7.40845 17.0353 17.9102 18.5794 32.8471 111.126 13.1858 20.6513 2.54588 2.06995 2.35854 41.7431 16.9033 39.7824 12.0532 106.807 ⋯ 3.04196 28.8211 42.4813 2.576 8.80773 21.1394 21.8907 20.8945 46.57 141.628$outPI
 5% 94% 7 12 0 0 0 29 10 27 6 79 ⋯ 1 19 30 0 4 12 13 13 27 102 19 26 5 4 5 50 23 48 17 115 ⋯ 6 35.055 51 5 13 26 27 27 52 149

Ignoring the known outliers, comparing the models using PSIS, it appears the log(damage) model improves performance:

iplot(function() {
plot(compare(m_fem_dam_pres_inter, m_fem_logdam_pres_inter, func=PSIS))
}, ar=4.5)

Some Pareto k values are very high (>1). Set pointwise=TRUE to inspect individual points.

Some Pareto k values are very high (>1). Set pointwise=TRUE to inspect individual points.


Run postcheck to visualize how much the model has changed on the outcome scale:

iplot(function() {
result <- postcheck(m_fem_logdam_pres_inter, window=92)
display_markdown("The raw data:")
display(result)
})


The raw data:

$mean 1. 10.9140498819251 2. 16.3313710022569 3. 2.6709357607139 4. 0.322526496273671 5. 0.136160362124281 6. 40.9683395319938 7. 14.2580819168725 8. 49.8113315443322 9. 11.9564068702583 10. 154.596179681551 11. 20.5892701945882 12. 12.2153330272158 13. 4.49122488660067 14. 3.18140578723594 15. 4.53948056380432 16. 113.659356226795 17. 0.398583050207092 18. 45.2021587791377 19. 1.3848939343451 20. 26.8510567912674 21. 49.2758181095555 22. 12.6829339090324 23. 6.49005192517415 24. 36.1190315021328 25. 6.09839407651378 26. 0.985349989263909 27. 17.2100121657768 28. 6.68745222404144 29. 226.247559918251 30. 21.1517893910113 31. 3.06888928147871 32. 5.14531160281342 33. 0.824883825194171 34. 116.523475702912 35. 8.0641871709459 36. 20.7742255615973 37. 4.26613646512088 38. 0.336503528493622 39. 2.15921940466006 40. 17.6042019668685 41. 19.471387174271 42. 12.4919902588978 43. 33.5125543173247 44. 3.88133868051414 45. 0.980499511312163 46. 3.27154926980215 47. 17.603607823164 48. 12.858401141245 49. 21.9762686587169 50. 9.03992355758047 51. 0.0191863016958919 52. 1.05680367616152 53. 0.028132573873713 54. 0.0431267785317184 55. 2.19936585828156 56. 27.3093114353624 57. 4.90419915633675 58. 16.6633629041742 59. 82.4226128407474 60. 1.60139950190683 61. 12.4178519108521 62. 21.783740514807 63. 6.30210027489317 64. 22.3899535257731 65. 4.3845208266936 66. 10.4265057085354 67. 3.44908259779698 68. 17.5834546186034 69. 6.55064690808527 70. 18.2385297667872 71. 9.5489541419671 72. 8.2831244154469 73. 2.34924285804395 74. 18.9706169543237 75. 0.800492465445044 76. 23.6536111949234 77. 28.2980353485363 78. 3.51696407225748 79. 18.9235904552031 80. 25.6757596927057 81. 2.0637764086095 82. 13.2976653987726 83. 0.821268377189356 84. 29.7819961636452 85. 42.9099807127748 86. 1.61784913313187 87. 7.9762091532931 88. 15.6121558301295 89. 20.7781808862812 90. 22.0385062348842 91. 31.5885609198836 92. 86.8549157202277$PI
 5% 94% 10.1322 14.8567 2.08952 0.228094 0.0932196 38.6507 13.3054 46.4444 11.145 139.168 ⋯ 0.634816 27.6634 40.2527 1.14407 7.15347 14.1017 18.7572 20.6296 26.1905 78.7553 11.7063 17.8731 3.35915 0.44012 0.190644 43.3699 15.2802 53.015 12.8176 170.152 ⋯ 1.04677 31.8539 45.7829 2.20172 8.84143 17.1442 22.9406 23.4714 37.8703 95.5726
$outPI  5% 94% 6 10 0 0 0 31 8 38 7 128 ⋯ 0 21 33 0 4 9 13 15 21 71 16 23 6 1 1 52 21 61 18 180 ⋯ 2 39 54 4 13 22 28 30 44 105 12H5. One hypothesis from developmental psychology, usually attributed to Carol Gilligan, proposes that women and men have different average tendencies in moral reasoning. Like most hypotheses in social psychology, it is merely descriptive, not causal. The notion is that women are more concerned with care (avoiding harm), while men are more concerned with justice and rights. Evaluate this hypothesis, using the Trolley data, supposing that contact provides a proxy for physical harm. Are women more or less bothered by contact than are men, in these data? Figure out the model(s) that is needed to address this question. Answer. The head of the Trolley data.frame: data(Trolley) d <- Trolley display(head(Trolley))  A data.frame: 6 × 12 caseresponseorderidagemaleeduactionintentioncontactstoryaction2 <fct><int><int><fct><int><int><fct><int><int><int><fct><int> 1cfaqu4 296;434140Middle School001aqu1 2cfbur33196;434140Middle School001bur1 3cfrub41696;434140Middle School001rub1 4cibox33296;434140Middle School011box1 5cibur3 496;434140Middle School011bur1 6cispe3 996;434140Middle School011spe1 We’ll start from model m12.5 in the text, converting bC and bIC to index variables based on gender. Before doing the conversion, lets reproduce the precis summary for m12.5 in the text: dat <- list( R = d$response,
A = d$action, I = d$intention,
C = d$contact ) m12.5 <- ulam( alist( R ~ dordlogit(phi, cutpoints), phi <- bA * A + bC * C + BI * I, BI <- bI + bIA * A + bIC * C, c(bA, bI, bC, bIA, bIC) ~ dnorm(0, 0.5), cutpoints ~ dnorm(0, 1.5) ), data = dat, chains = 4, cores = min(detectCores(), 4), log_lik=TRUE ) iplot(function() { plot(precis(m12.5), main="m12.5") }, ar=4) dat <- list( R = d$response,
A = d$action, I = d$intention,
C = d$contact, gid = ifelse(d$male == 1, 2, 1)
)
m_trolley_care <- ulam(
alist(
R ~ dordlogit(phi, cutpoints),
phi <- bA * A + bC[gid] * C + BI * I,
BI <- bI + bIA * A + bIC[gid] * C,
c(bA, bI, bIA) ~ dnorm(0, 0.5),
bC[gid] ~ dnorm(0, 0.5),
bIC[gid] ~ dnorm(0, 0.5),
cutpoints ~ dnorm(0, 1.5)
),
data = dat, chains = 4, cores = min(detectCores(), 4), log_lik=TRUE
)
flush.console()

6 vector or matrix parameters hidden. Use depth=2 to show them.


The following plot shows the same precis summary after converting bC and bIC to index variables based on gender. It appears women disapprove of contact more than men from this data and model, lending some support to the theory:

iplot(function() {
plot(precis(m_trolley_care, depth=2), main="m_trolley_care")
}, ar=2.5)


Comparing the models using PSIS, to check for overfitting or outliers:

iplot(function() {
plot(compare(m12.5, m_trolley_care, func=PSIS))
}, ar=4.5)


12H6. The data in data(Fish) are records of visits to a national park. See ?Fish for details. The question of interest is how many fish an average visitor takes per hour, when fishing. The problem is that not everyone tried to fish, so the fish_caught numbers are zero-inflated. As with the monks example in the chapter, there is a process that determines who is fishing (working) and another process that determines fish per hour (manuscripts per day), conditional on fishing (working). We want to model both. Otherwise we’ll end up with an underestimate of rate of fish extraction from the park.

You will model these data using zero-inflated Poisson GLMs. Predict fish_caught as a function of any of the other variables you think are relevant. One thing you must do, however, is use a proper Poisson offset/exposure in the Poisson portion of the zero-inflated model. Then use the hours variable to construct the offset. This will adjust the model for the differing amount of time individuals spent in the park.

Answer. The short documentation on Fish:

display(help(Fish))

data(Fish)

 Fish {rethinking} R Documentation

Fishing data

Description

Fishing data from visitors to a national park.

Usage

data(Fish)


Format

1. fish_caught : Number of fish caught during visit

2. livebait : Whether or not group used livebait to fish

3. camper : Whether or not group had a camper

4. persons : Number of adults in group

5. child : Number of children in group

6. hours : Number of hours group spent in park

[Package rethinking version 2.13 ]

The head of Fish:

display(head(Fish))

A data.frame: 6 × 6
fish_caughtlivebaitcamperpersonschildhours
<int><int><int><int><int><dbl>
10001021.124
201110 5.732
301010 1.323
401121 0.548
511010 1.695
601142 0.493

Notice several outliers in the full dataframe (not represented in head):

display(summary(Fish))

  fish_caught         livebait         camper         persons
Min.   :  0.000   Min.   :0.000   Min.   :0.000   Min.   :1.000
1st Qu.:  0.000   1st Qu.:1.000   1st Qu.:0.000   1st Qu.:2.000
Median :  0.000   Median :1.000   Median :1.000   Median :2.000
Mean   :  3.296   Mean   :0.864   Mean   :0.588   Mean   :2.528
3rd Qu.:  2.000   3rd Qu.:1.000   3rd Qu.:1.000   3rd Qu.:4.000
Max.   :149.000   Max.   :1.000   Max.   :1.000   Max.   :4.000
child           hours
Min.   :0.000   Min.   : 0.0040
1st Qu.:0.000   1st Qu.: 0.2865
Median :0.000   Median : 1.8315
Mean   :0.684   Mean   : 5.5260
3rd Qu.:1.000   3rd Qu.: 7.3427
Max.   :3.000   Max.   :71.0360


One complication introduced by the question and data is how to infer fish per ‘average visitor’ when every observation includes several visitors, both adults and children. This solution disregards children, assuming any fishing they do is offset by the time they take away from fishing. Addressing persons can be done the same way we address hours, by treating it as an exposure. That is: \begin{align} log \left(\frac{fish}{person \cdot hour}\right) & = log(fish) - log(person \cdot hour) \\ & = log(fish) - log(person) - log(hour) \end{align}

fish_dat <- list(
FishCaught = Fish$fish_caught, LogPersonsInGroup = log(Fish$persons),
LogHours = log(Fish$hours) ) m_fish <- ulam( alist( FishCaught ~ dzipois(p, fish_per_person_hour), log(fish_per_person_hour) <- LogPersonsInGroup + LogHours + al, logit(p) <- ap, ap ~ dnorm(-0.5, 1), al ~ dnorm(1, 2) ), data = fish_dat, chains = 4 ) flush.console()  SAMPLING FOR MODEL '1a5d1436509eaa3c57b3600d89299fd9' NOW (CHAIN 1). Chain 1: Chain 1: Gradient evaluation took 0.000111 seconds Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 1.11 seconds. Chain 1: Adjust your expectations accordingly! Chain 1: Chain 1: Chain 1: Iteration: 1 / 1000 [ 0%] (Warmup) Chain 1: Iteration: 100 / 1000 [ 10%] (Warmup) Chain 1: Iteration: 200 / 1000 [ 20%] (Warmup) Chain 1: Iteration: 300 / 1000 [ 30%] (Warmup) Chain 1: Iteration: 400 / 1000 [ 40%] (Warmup) Chain 1: Iteration: 500 / 1000 [ 50%] (Warmup) Chain 1: Iteration: 501 / 1000 [ 50%] (Sampling) Chain 1: Iteration: 600 / 1000 [ 60%] (Sampling) Chain 1: Iteration: 700 / 1000 [ 70%] (Sampling) Chain 1: Iteration: 800 / 1000 [ 80%] (Sampling) Chain 1: Iteration: 900 / 1000 [ 90%] (Sampling) Chain 1: Iteration: 1000 / 1000 [100%] (Sampling) Chain 1: Chain 1: Elapsed Time: 0.360352 seconds (Warm-up) Chain 1: 0.28316 seconds (Sampling) Chain 1: 0.643512 seconds (Total) Chain 1: SAMPLING FOR MODEL '1a5d1436509eaa3c57b3600d89299fd9' NOW (CHAIN 2). Chain 2: Chain 2: Gradient evaluation took 8.9e-05 seconds Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.89 seconds. Chain 2: Adjust your expectations accordingly! Chain 2: Chain 2: Chain 2: Iteration: 1 / 1000 [ 0%] (Warmup) Chain 2: Iteration: 100 / 1000 [ 10%] (Warmup) Chain 2: Iteration: 200 / 1000 [ 20%] (Warmup) Chain 2: Iteration: 300 / 1000 [ 30%] (Warmup) Chain 2: Iteration: 400 / 1000 [ 40%] (Warmup) Chain 2: Iteration: 500 / 1000 [ 50%] (Warmup) Chain 2: Iteration: 501 / 1000 [ 50%] (Sampling) Chain 2: Iteration: 600 / 1000 [ 60%] (Sampling) Chain 2: Iteration: 700 / 1000 [ 70%] (Sampling) Chain 2: Iteration: 800 / 1000 [ 80%] (Sampling) Chain 2: Iteration: 900 / 1000 [ 90%] (Sampling) Chain 2: Iteration: 1000 / 1000 [100%] (Sampling) Chain 2: Chain 2: Elapsed Time: 0.388114 seconds (Warm-up) Chain 2: 0.290798 seconds (Sampling) Chain 2: 0.678912 seconds (Total) Chain 2: SAMPLING FOR MODEL '1a5d1436509eaa3c57b3600d89299fd9' NOW (CHAIN 3). Chain 3: Chain 3: Gradient evaluation took 0.00012 seconds Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 1.2 seconds. Chain 3: Adjust your expectations accordingly! Chain 3: Chain 3: Chain 3: Iteration: 1 / 1000 [ 0%] (Warmup) Chain 3: Iteration: 100 / 1000 [ 10%] (Warmup) Chain 3: Iteration: 200 / 1000 [ 20%] (Warmup) Chain 3: Iteration: 300 / 1000 [ 30%] (Warmup) Chain 3: Iteration: 400 / 1000 [ 40%] (Warmup) Chain 3: Iteration: 500 / 1000 [ 50%] (Warmup) Chain 3: Iteration: 501 / 1000 [ 50%] (Sampling) Chain 3: Iteration: 600 / 1000 [ 60%] (Sampling) Chain 3: Iteration: 700 / 1000 [ 70%] (Sampling) Chain 3: Iteration: 800 / 1000 [ 80%] (Sampling) Chain 3: Iteration: 900 / 1000 [ 90%] (Sampling) Chain 3: Iteration: 1000 / 1000 [100%] (Sampling) Chain 3: Chain 3: Elapsed Time: 0.353938 seconds (Warm-up) Chain 3: 0.310407 seconds (Sampling) Chain 3: 0.664345 seconds (Total) Chain 3: SAMPLING FOR MODEL '1a5d1436509eaa3c57b3600d89299fd9' NOW (CHAIN 4). Chain 4: Chain 4: Gradient evaluation took 8.8e-05 seconds Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0.88 seconds. Chain 4: Adjust your expectations accordingly! Chain 4: Chain 4: Chain 4: Iteration: 1 / 1000 [ 0%] (Warmup) Chain 4: Iteration: 100 / 1000 [ 10%] (Warmup) Chain 4: Iteration: 200 / 1000 [ 20%] (Warmup) Chain 4: Iteration: 300 / 1000 [ 30%] (Warmup) Chain 4: Iteration: 400 / 1000 [ 40%] (Warmup) Chain 4: Iteration: 500 / 1000 [ 50%] (Warmup) Chain 4: Iteration: 501 / 1000 [ 50%] (Sampling) Chain 4: Iteration: 600 / 1000 [ 60%] (Sampling) Chain 4: Iteration: 700 / 1000 [ 70%] (Sampling) Chain 4: Iteration: 800 / 1000 [ 80%] (Sampling) Chain 4: Iteration: 900 / 1000 [ 90%] (Sampling) Chain 4: Iteration: 1000 / 1000 [100%] (Sampling) Chain 4: Chain 4: Elapsed Time: 0.422965 seconds (Warm-up) Chain 4: 0.313815 seconds (Sampling) Chain 4: 0.73678 seconds (Total) Chain 4:  The precision of the parameters, which is rather hard to interpret. We can take from ap that most visitors who were a part of this survey chose to fish. That is, the probability of not fishing looks solidly below 0.5, since ap is solidly negative and any score less than zero implies a probability of less than 0.5: iplot(function() { plot(precis(m_fish)) }, ar=4.5)  Finally, a posterior check. Despite the large uncertainty in ap the model is still surprised by outliers. An obvious alternative based on this chapter is a Gamma-Poisson, but this solution doesn’t pursue that. iplot(function() { result <- postcheck(m_fish, window=250) display_markdown("The raw data:") display(result) })  The raw data:$mean
1. 6.76595490376015
2. 1.83594269590765
3. 0.423752998375056
4. 0.351045567814861
5. 0.542903501319516
6. 0.631625784426009
7. 0.0317094080416709
8. 0.0051247528148155
9. 0.208513380152805
10. 1.29111741228258
11. 0.185772289537062
12. 0.080714856833344
13. 4.71797556013951
14. 2.20236252216696
15. 5.37682659389423
16. 2.77281156986611
17. 0.0589346573703781
18. 0.136446543694462
19. 0.531693104537107
20. 0.853271343666779
21. 1.28375058011128
22. 8.50452729618631
23. 0.477242605879693
24. 4.42138049098206
25. 4.10204433120887
26. 10.7197017003903
27. 6.714707375612
28. 0.221004965138918
29. 0.34720200320375
30. 0.010249505629631
31. 0.805226786027884
32. 0.494858943680621
33. 42.7839988744871
34. 1.52845752701872
35. 0.693122818203795
36. 3.35799428190785
37. 0.336952497574119
38. 4.44700425505614
39. 0.0989717887361242
40. 0.293392098648187
41. 6.56640984103327
42. 13.5101296080573
43. 0.0442009930277836
44. 18.6131022234099
45. 0.0896831742592711
46. 2.7036274068661
47. 0.868005008009374
48. 1.51372386267613
49. 1.19919215866683
50. 0.0819960450370479
51. 0.851349561361224
52. 0.315812892213005
53. 5.83196870326003
54. 5.12539340891734
55. 13.3217949421129
56. 20.9282093075028
57. 2.67544126638461
58. 1.23730750772702
59. 4.52259435907467
60. 1.63319466267151
61. 1.89295557097247
62. 5.20610826575069
63. 0.595111920620449
64. 16.3748664315392
65. 0.0192178230555581
66. 1.58771248144002
67. 0.973703034814943
68. 0.060536142625008
69. 2.39422045567161
70. 0.177444566212986
71. 9.28893477390401
72. 0.00896831742592712
73. 0.0858396096481595
74. 0.0384356461111162
75. 0.798500547958439
76. 0.416386166203759
77. 0.0413183195694499
78. 8.40587580450111
79. 0.504788152259326
80. 5.16446964913031
81. 0.850708967259372
82. 0.465071317944506
83. 13.6561850632796
84. 0.409659928134313
85. 1.78661695006505
86. 0.555395086305629
87. 0.0999326798889021
88. 3.07837495644948
89. 45.6000505462282
90. 0.060536142625008
91. 0.650523310430642
92. 0.700169353324167
93. 2.539635316792
94. 0.655968360296383
95. 0.226770312055585
96. 0.214278727069473
97. 0.0166554466481503
98. 0.622657467000082
99. 12.3993394354461
100. 3.02872891355596
101. 2.51465214681978
102. 6.17724892415822
103. 0.161429713666688
104. 0.859997581736225
105. 0.456103000518579
106. 0.0884019860555672
107. 0.654687172092679
108. 0.140290108305574
109. 0.185772289537062
110. 4.99791518264881
111. 7.68744951927416
112. 5.36113203839886
113. 0.050927231097229
114. 2.66615265190776
115. 0.108580700263903
116. 2.33624668945401
117. 0.497741617138955
118. 0.594471326518597
119. 0.411261413388943
120. 0.373786658430605
121. 0.126837632166683
122. 3.95919184649589
123. 0.0768712922222324
124. 0.573972315259335
125. 0.0666217865926014
126. 0.609845584963044
127. 3.08542149156985
128. 30.2286747752401
129. 2.34393381867624
130. 91.0104852383084
131. 20.4650597718638
132. 2.47781798596329
133. 15.8726406556873
134. 0.176803972111134
135. 2.28051500259289
136. 2.92239029264853
137. 1.3221862262224
138. 1.93811745515303
139. 0.545145580675998
140. 6.6826776705194
141. 0.0140930702407426
142. 0.0967297093796424
143. 1.34909117850018
144. 0.207552489000027
145. 1.18221641496775
146. 0.334390121166711
147. 45.3707178577652
148. 14.177948959238
149. 0.122033176402794
150. 0.737964405333431
151. 2.54796304011608
152. 0.397168343148201
153. 0.373146064328753
154. 0.393324778537089
155. 1.40290108305574
156. 1.62454664229651
157. 4.80701814029693
158. 0.876332731333449
159. 3.12289624652819
160. 5.00303993546362
161. 4.21350770493111
162. 0.38531735226394
163. 0.0374747549583383
164. 0.26136239355559
165. 0.0550910927592665
166. 3.02104178433373
167. 1.18509908842608
168. 11.6613750301127
169. 0.0970500064305684
170. 0.0243425758703736
171. 13.0191142289878
172. 2.98004376181521
173. 7.96258468601957
174. 7.05165987318612
175. 0.070465351203713
176. 15.3941168616039
177. 3.72633589047271
178. 1.03936393025477
179. 7.56669753107507
180. 0.453220327060245
181. 9.66079965002906
182. 0.27801784020374
183. 12.1347740713812
184. 0.427916860037094
185. 0.199865359777804
186. 3.31315269477822
187. 0.029787625736115
188. 0.171999516347245
189. 41.2619472884869
190. 1.16588126537052
191. 12.66358450246
192. 1.63479614792614
193. 2.22478331573178
194. 0.370583687921345
195. 12.5261770676128
196. 2.32151302511142
197. 0.872168869671411
198. 0.347842597305602
199. 0.117869314740756
200. 2.34329322457438
201. 0.0800742627314921
202. 6.3828796308527
203. 2.90669573715316
204. 3.08189822400967
205. 0.538099045555627
206. 0.30268071312504
207. 4.20870324916722
208. 0.0128118820370387
209. 3.76893539824587
210. 0.414144086847277
211. 5.99211722872301
212. 11.3346720381682
213. 1.30777285893073
214. 5.90371524266745
215. 0.0115306938333349
216. 0.207552489000027
217. 0.0525287163518588
218. 0.0643797072361196
219. 2.27410906157437
220. 0.144774267018538
221. 0.352326756018565
222. 0.965695608541794
223. 2.78594374895407
224. 0.834694114713073
225. 0.329905962453747
226. 11.2059126236959
227. 20.0787815284471
228. 0.590307464856559
229. 0.091284659513901
230. 2.53354967282441
231. 2.72508730927814
232. 0.0819960450370479
233. 0.161429713666688
234. 1.74497833344468
235. 0.411902007490795
236. 8.18935499807516
237. 1.65465456508355
238. 0.685435688981572
239. 8.32420005651499
240. 1.62006248358355
241. 22.7884945792808
242. 0.343358438592638
243. 1.84491101333358
244. 0.461227753333394
245. 3.58124132640325
246. 6.54623112682494
247. 0.0051247528148155
248. 1.26132978654646
249. 0.0720668364583429
250. 0.149258425731501
$PI  5% 94% 6.38144 1.73161 0.399671 0.331096 0.51205 0.59573 0.0299074 0.00483351 0.196663 1.21774 ⋯ 21.4934 0.323845 1.74006 0.435016 3.37772 6.17421 0.00483351 1.18965 0.0679713 0.140776 7.14992 1.94013 0.447801 0.370967 0.573713 0.66747 0.0335089 0.00541558 0.220346 1.36439 ⋯ 24.0817 0.362844 1.94961 0.487402 3.78447 6.91772 0.00541558 1.33291 0.0761566 0.157729$outPI
 5% 94% 0 0 0 0 0 0 0 0 0 0 ⋯ 0 0 0 0 0 0 0 0 0 0 11 4 1 1 2 2 0 0 1 3 ⋯ 30 1 4 1 7 10 0 3 0 1

12H7. In the trolley data — data(Trolley) — we saw how education level (modeled as an ordered category) is associated with responses. But is this association causal? One plausible confound is that education is also associated with age, through a causal process: People are older when they finish school than when they begin it. Reconsider the Trolley data in this light. Draw a DAG that represents hypothetical causal relationships among response, education, and age. Which statical model or models do you need to evaluate the causal influence of education on responses? Fit these models to the trolley data. What do you conclude about the causal relationships among these three variables?

ERROR:

Which statical [sic] model or models do you need

Answer. Consider the following DAGs. Implicit in this list of DAGs is that the response would never influence age or education. Because we already know that education is associated with the response, we also only include DAGs where Education and Response are d-connected.

A more difficult question is what arrows to allow between education and age. Getting older does not necessarily lead to more education, but education always leads to greater age. That is, someone could stop schooling at any point in their lives. This perspective implies Education should cause Age, because the choice to get a particular education is also a choice to get older before continuing with ‘real’ work.

The issue with allowing this arrow is it contradicts the following definition of Causal dependence from David Lewis:

An event E causally depends on C if, and only if, (i) if C had occurred, then E would have occurred, and (ii) if C had not occurred, then E would not have occurred.

Said another way, causality is almost always associated with the advancement of time. This is inherent in the definition of causality above, and in lists (see #4) such as the Bradford-Hill criteria. See also the definition of Causal inference. The Age covariate is confusingly a measurement of time. It is perhaps best to think of Age as a measurement of the potential to get education; greater Age indicates an individual is ‘richer’ in terms of time opportunity to get an education.

Said yet another way, age is not controllable. We will get older regardless of whether we get an education. Education on the other hand is clearly within our control. The issue seems to be that we require two things to get a particular educational degree: a personal commitment to do so and an investment of our time. We talk about investing our time when in the end we are always forced to invest it in something; we can’t choose to stay in a frozen present.

Said yet another way, we should not confuse prediction and causal inference. Someone’s education may help us predict their age, but this does not imply the relationship is causal.

Described as a rule for DAGs, indirect measures of time (like Age) can only have arrows pointing out of them. A convenient side effect of this rule is that to get the total causal effect of a variable like Age there should never be any back door paths to close.

Do these rules contradict general relativity? If there was a variable for how fast or far someone had traveled in their life (e.g. if they were an astronaut) couldn’t that cause a change in their Age? Perhaps in this extra special case you’d need to define a standardized time that all other time measurements were made relative to.

This perspective leads to other questions. Age does not necessarily indicate the level of education we will eventually achieve. Are people who had a healthy upbringing, e.g. with more material resources, more or less likely to provide disapproving responses, and also more or less likely to get an education? There are plenty of possible confounds that could influence both eventual education level and response, such as temperament. Would a 14 year old who will eventually get a PhD answer the same way to these questions now and after they finish their PhD?

Practially speaking, based on the wording in the question and the last paragraph of section 12.4, it seems the author wanted readers to only consider that Age influences Education (not vice versa).

Conditional independencies, if any, are listed under each DAG.

In this first model, education is a mediator between age and response:

library(dagitty)

age_dag <- dagitty('
dag {
bb="0,0,1,1"
Age [pos="0.4,0.2"]
Education [pos="0.6,0.2"]
Response [pos="0.5,0.3"]
Age -> Education
Education -> Response
}')
iplot(function() plot(age_dag), scale=10)
display(impliedConditionalIndependencies(age_dag))

Age _||_ Rspn | Edct


In this model, all variables are associated regardless of conditioning:

mediator_dag <- dagitty('
dag {
bb="0,0,1,1"
Age [pos="0.4,0.2"]
Education [pos="0.6,0.2"]
Response [pos="0.5,0.3"]
Age -> Education
Age -> Response
Education -> Response
}')
iplot(function() plot(mediator_dag), scale=10)
display(impliedConditionalIndependencies(mediator_dag))




In this model, education has no direct causal effect on the response:

independent_dag <- dagitty('
dag {
bb="0,0,1,1"
Age [pos="0.4,0.2"]
Education [pos="0.6,0.2"]
Response [pos="0.5,0.3"]
Age -> Education
Age -> Response
}')
iplot(function() plot(independent_dag), scale=10)
display(impliedConditionalIndependencies(independent_dag))

Edct _||_ Rspn | Age


This answer will not consider this model because, as stated above, it seems more reasonable to assume that age leads to education than vice versa. Notice it has the same conditional independencies as the last model, and we could construct similar models where Education -> Age and with the same conditional independencies as the first and second models. Although that approach would work for this question, see comments in the next question.

independent_dag <- dagitty('
dag {
bb="0,0,1,1"
Age [pos="0.4,0.2"]
Education [pos="0.6,0.2"]
Response [pos="0.5,0.3"]
Education -> Age
Age -> Response
}')
iplot(function() plot(independent_dag), scale=10)
display(impliedConditionalIndependencies(independent_dag))

Edct _||_ Rspn | Age


To test which model is correct, we only need one more model beyond what we learned from m12.6 in the text: a model that includes age as a predictor. If we find that the education coefficient bE is near zero in this model, the third causal model is correct. If we find that the age coefficient is near zero, the first causal model is correct. If both coefficients are non-zero, we can infer the second causal model is correct. Of course, all causal inferences are limited by the variables we are considering.

data(Trolley)
d <- Trolley

edu_levels <- c(6, 1, 8, 4, 7, 2, 5, 3)
d$edu_new <- edu_levels[d$edu]

dat <- list(
R = d$response, action = d$action,
intention = d$intention, contact = d$contact,
E = as.integer(d$edu_new), StdAge = standardize(d$age),
alpha = rep(2, 7)
)

m_trolley_age <- ulam(
alist(
R ~ ordered_logistic(phi, kappa),
phi <- bE * sum(delta_j[1:E]) + bStdAge * StdAge + bA * action + bI * intention + bC * contact,
kappa ~ normal(0, 1.5),
c(bA, bI, bC, bE, bStdAge) ~ normal(0, 1),
vector[8]:delta_j <<- append_row(0, delta),
simplex[7]:delta ~ dirichlet(alpha)
),
data = dat, chains = 4, cores = min(detectCores(), 4), iter=2000
)

iplot(function() {
plot(precis(m_trolley_age, depth=2), main="m_trolley_age")
}, ar=1.8)
display_markdown("Raw data (preceding plot):")
display(precis(m_trolley_age, depth = 2), mimetypes="text/plain")


Raw data (preceding plot):

         mean        sd         5.5%        94.5%       n_eff     Rhat4
kappa[1] -2.68361325 0.08865314 -2.81809233 -2.55714508  934.8242 1.0045896
kappa[2] -1.99898884 0.08602333 -2.13072845 -1.87498384  956.9940 1.0043763
kappa[3] -1.41489799 0.08459355 -1.54145036 -1.29335548  930.5575 1.0041952
kappa[4] -0.39022666 0.08408004 -0.51812506 -0.27072486  912.5518 1.0048259
kappa[5]  0.27999016 0.08492709  0.14992044  0.40188572  911.4397 1.0043455
kappa[6]  1.18651290 0.08672497  1.05659667  1.31209534  934.4431 1.0038228
bStdAge  -0.09906705 0.02135479 -0.13219621 -0.06503196 2383.0371 1.0002423
bE        0.23368574 0.10691089  0.07107577  0.37026784  725.3296 1.0056800
bC       -0.95951533 0.04991494 -1.03958491 -0.88054049 4455.0481 1.0006841
bI       -0.71919289 0.03580042 -0.77702762 -0.66303366 3907.5633 1.0004625
bA       -0.70773068 0.03971195 -0.77110260 -0.64431515 3836.2686 1.0004352
delta[1]  0.10779876 0.07067768  0.02181789  0.23894013 4253.9584 1.0008411
delta[2]  0.12025954 0.07794762  0.02225073  0.26652180 4951.3060 0.9996026
delta[3]  0.08666906 0.06125356  0.01645026  0.19803202 2473.1828 1.0024484
delta[4]  0.06414256 0.04969173  0.01182441  0.15042389 1764.9911 1.0015228
delta[5]  0.44279365 0.13735187  0.18244217  0.63211812  915.1823 1.0039116
delta[6]  0.08272248 0.05728302  0.01653252  0.19350915 3970.9883 1.0004653
delta[7]  0.09561396 0.06337485  0.01884149  0.21152727 4067.4154 1.0001127


As implied in the text, it is not correct to causally infer the associated of education bE is negative based on m12.6. In this model, the association of education with response has reversed, becoming positive. Since the error bars on neither bStdAge or bE cross zero, we’ll infer for now that the second causal model above is correct.

12H8. Consider one more variable in the trolley data: Gender. Suppose that gender might influence education as well as response directly. Draw the DAG now that includes response, education, age, and gender. Using only the DAG, is it possible that the inferences from 12H7 above are confounded by gender? If so, define any additional models you need to infer the causal influence of education on response. What do you conclude?

Answer. First, lets consider a DAG that doesn’t work. In the following DAG we assume Education -> Age rather than that Age -> Gender:

incorrect_dag <- dagitty('
dag {
bb="0,0,1,1"
Education [pos="0.4,0.2"]
Age [pos="0.6,0.2"]
Response [pos="0.5,0.3"]
Gender [pos="0.3,0.3"]
Education -> Age
Education -> Response
Age -> Response
Gender -> Education
Gender -> Response
}')
iplot(function() plot(incorrect_dag), scale=10)
display(impliedConditionalIndependencies(incorrect_dag))

Age _||_ Gndr | Edct


With the additions implied by this question, this DAG implies that Gender will be able to predict Age without conditioning, which is not correct.

Instead, we’ll add this question’s covariate to the second model from the last question:

gender_dag <- dagitty('
dag {
bb="0,0,1,1"
Age [pos="0.4,0.2"]
Education [pos="0.6,0.2"]
Response [pos="0.5,0.3"]
Gender [pos="0.7,0.3"]
Age -> Education
Education -> Response
Age -> Response
Gender -> Education
Gender -> Response
}')
iplot(function() plot(gender_dag), scale=10)
display(impliedConditionalIndependencies(gender_dag))

Age _||_ Gndr


This DAG implies Age will always be independent of Gender, which seems reasonable in this context, ignoring that females live slightly longer than males. To infer either the direct or total causal effect of Education on Response the required adjustment is to condition on both Age and Gender:

display(adjustmentSets(gender_dag, exposure="Education", outcome="Response", effect="total"))

{ Age, Gender }


We can do so by fitting a model that includes all three variables as predictors of Response:

data(Trolley)
d <- Trolley

edu_levels <- c(6, 1, 8, 4, 7, 2, 5, 3)
d$edu_new <- edu_levels[d$edu]

dat <- list(
R = d$response, action = d$action,
intention = d$intention, contact = d$contact,
E = as.integer(d$edu_new), StdAge = standardize(d$age),
gid = ifelse(d\$male == 1, 2, 1),
alpha = rep(2, 7)
)

# This fit takes more than three hours with 4 chains/cores.
m_trolley_gender <- ulam(
alist(
R ~ ordered_logistic(phi, kappa),
phi <- bE * sum(delta_j[1:E]) + bStdAge * StdAge + bGender[gid]
+ bA * action + bI * intention + bC * contact,
kappa ~ normal(0, 1.5),
bGender[gid] ~ normal(0, 1),
c(bA, bI, bC, bE, bStdAge) ~ normal(0, 1),
vector[8]:delta_j <<- append_row(0, delta),
simplex[7]:delta ~ dirichlet(alpha)
),
data = dat, chains = 2, cores = min(detectCores(), 2), iter=1600
)
flush.console()

iplot(function() {
plot(precis(m_trolley_gender, depth=2), main="m_trolley_gender")
}, ar=1.8)
display_markdown("Raw data (preceding plot):")
display(precis(m_trolley_gender, depth = 2), mimetypes="text/plain")


Raw data (preceding plot):

           mean         sd         5.5%        94.5%      n_eff     Rhat4
kappa[1]   -2.266657325 0.46288890 -3.00485360 -1.5165712  573.0954 1.0016901
kappa[2]   -1.582421818 0.46177762 -2.31693874 -0.8349806  572.3607 1.0018609
kappa[3]   -0.993131342 0.46200481 -1.72707767 -0.2474544  570.7043 1.0017850
kappa[4]    0.051192897 0.46220897 -0.67312307  0.7982066  574.1828 1.0017592
kappa[5]    0.738588857 0.46212487  0.01278630  1.4784193  575.5787 1.0016047
kappa[6]    1.664283873 0.46311785  0.93955882  2.4109957  577.9330 1.0016990
bGender[1]  0.301659579 0.45912846 -0.41939969  1.0395143  561.7080 1.0012553
bGender[2]  0.862541251 0.46088869  0.13650775  1.6024925  565.7668 1.0011667
bStdAge    -0.066221873 0.02193411 -0.10169805 -0.0318095  690.4008 0.9990740
bE          0.002776721 0.17079243 -0.27589601  0.2518510  334.8590 1.0001277
bC         -0.973445079 0.04982326 -1.05475852 -0.8931221 1228.7393 1.0019420
bI         -0.725770562 0.03634345 -0.78454521 -0.6683968 1536.4701 0.9997787
bA         -0.713728611 0.03985093 -0.77665578 -0.6505416 1458.8686 1.0037167
delta[1]    0.150196008 0.09681367  0.03337167  0.3302251 1387.4576 0.9993277
delta[2]    0.146686397 0.09427857  0.03088713  0.3291819 1614.6417 0.9995624
delta[3]    0.138256497 0.09033356  0.02936921  0.2970326 1220.4606 1.0010909
delta[4]    0.136899029 0.09815857  0.02294144  0.3208046  904.8349 0.9990669
delta[5]    0.186898355 0.15213432  0.01796063  0.4656690  494.4601 1.0018564
delta[6]    0.114728654 0.07435170  0.02224116  0.2496879 1319.6433 1.0002505
delta[7]    0.126335060 0.08040863  0.02693776  0.2792283 1802.2701 0.9989585


At this point, education appears to have no causal effect on response. That is, the mean and HPDI of bE are near zero.