Section 4: Introduction
Parameter Simulation, Optimisation, & Inference
(or “Applying statistics in modern scientific analyses”)
We apply our understanding of Bayesian statistics to the common problems of parameter simulation, optimisation, and inference. Students will learn the fundamentals of hypothesis testing, quantifying goodness-of-fit, and parameter inference. We discuss common errors in parameter inference, including standard physical and astrophysical biases that corrupt statistical analyses.
Bayesian Modelling
The previous examples of how to calculate model preferences is all well and good, but this is where the magic happens.
Because there is some uncertainty in expressing/specifying any single model: \[ f(\theta,x)=f(x|\theta)f(\theta) \] we can instead construct a single model that we define as being the union of all alternative models that we might wish to entertain. We will then provide a prior over the suite of encompassed models.
Take an example where we have two models that we think might be appropriate for our dataset, both of which fall within the general “Gamma” family of distributions.
Recall that the Gamma family of distributions all take the format: \[ f(x|\alpha,\beta,\gamma)=\frac{\gamma\beta^\alpha}{\Gamma(\alpha)}x^{\alpha\gamma-1}\exp\left(-\beta x^\gamma\right) \] Our two hypothesised models are a Weibull distribution: \[ f_1(x|\beta,\gamma)=\gamma\beta x^{\gamma-1}\exp\left(-\beta x^\gamma\right) \] and a two-parameter Gamma distribution: \[ f(x|\alpha,\beta)=\frac{\beta^\alpha}{\Gamma(\alpha)}x^{\alpha-1}\exp\left(-\beta x\right) \] We can analyse these two models in the same way as previously. Require that these two hypotheses be exhaustive (\(P(m_1)=1-P(m_2)\)), and formulate the values of \(\alpha,\beta,\gamma\).
However, we could alternatively specify a single encompassing model that is just a generalised gamma distribution. This distribution contains both of the previous \(2\) models and many many more. Nominally it is no more or less sensible to formulate our model comparison using priors on \(\alpha,\beta,\gamma\) instead of on \(m_1,m_2\), and we can construct priors that recover the behaviour of having only the two models in any case: \[ f(\alpha,\beta,\gamma)= \begin{cases} f(\alpha,\beta,\gamma) & \textrm{if}\;\; \alpha=1 \\ f(\alpha,\beta,\gamma) & \textrm{if}\;\; \gamma=1 \\ 0 & \textrm{otherwise} \\ \end{cases} \]
What did we just do?
We just demonstrated that we can perform model comparison within the bayesian framework by specifying a generic model and providing priors on the parameters that govern that model. In this way, the likelihood that we specified was general: we didn’t pick particular values for the models in the likelihood, rather we specified a distribution of possible likelihoods and gave (possibly broad) priors on the variables that govern the distribution of possible models.
This leads us to an interesting class of models known (appropriately) as Bayesian hierarchical models (BHMs).
Start small
As a demonstration of the power of BHMs, we’re going to spend the remainder of this lecture demonstrating how we can take an initially simple model, and with few logical steps, can construct an much more complex model that has exceptional explanatory power.
The dataset that we’re going to explore today to demonstrate this process is one from the US, where \(8\) high-schools reported on the possible benefits of giving students additional coaching prior to the SAT-V (“scholastic aptitude test - verbal”) exams.
Under randomisation, students at each school were given either extra coaching or not. SATs are designed to be resilient to short-term efforts, however all schools think that their program is useful/effective nonetheless.
There’s no prior reason to expect that any program is more or less effective than the others.
The data
We have \(J=8\) independent experiments, with coaching effects \(\theta_j\) being judged by \(n_j\) i.i.d. normally distributed observations \(y_{ij}\), each with (assumed) known error variance \(\sigma^2\).
That is: \[ y_{ij}|\theta_j \sim N(\theta_j,\sigma^2),\;\; i=1,\dots,n_j;\;\; j=1,\dots,J \] The estimated effect of the coaching at each school is given by the mean \(\bar{y}_j\), with the standard error on the estimate \(\sigma_j^2\). \[ \bar{y}_j = \frac{1}{n_j}\sim_{i=1}^{n_j} y_{ij} \\ \sigma_j = \sqrt\frac{\sigma^2}{n_j} \]
The likelihood for each \(\theta_j\) can be expressed in terms of the sufficient statistic: \[ \bar{y}_{j}|\theta_j\sim N(\theta_j,\sigma_j^2) \]
dat<-data.frame(Estimated_Effect=c(28,8,-3,7,-1,1,18,12),
Standard_Error=c(15,10,16,11,9,11,10,18))
rownames(dat)<-LETTERS[1:8]
print(dat)
## Estimated_Effect Standard_Error
## A 28 15
## B 8 10
## C -3 16
## D 7 11
## E -1 9
## F 1 11
## G 18 10
## H 12 18
Methods of analysis
print(dat)
## Estimated_Effect Standard_Error
## A 28 15
## B 8 10
## C -3 16
## D 7 11
## E -1 9
## F 1 11
## G 18 10
## H 12 18
Each to their own
There are multiple ways that we could approach the modelling of this dataset. The first option is to treat each experiment independently, as the data have been provided. We will call this the separate analysis.
At first glance, there is a mixed-bag of results. Some schools show reasonably large effects (\(\theta_j\geq 18\)), some show small effects (\(0\leq \theta_j\leq 12\)), and some show small negative effects.
However, each estimate also has a large standard error. This makes it difficult to distinguish between the different results. The \(95\%\) posterior credibility intervals for these results all significantly overlap.
All together now
The large overlap between the individual credible intervals might suggest that all of the experiments are trying to measure the same underlying quantity. So we might prefer to assume that \(\theta_j=\theta_0\;\forall j\in\{1,\dots,J\}\). That is, that all the values of \(\theta_j\) are the same. Given this hypothesis, we can estimate the value of each \(\theta_j\) using the pooled average \(\bar{y}\).
Said differently: assuming that all experiments have the same effect (and produce random independent estimates) then we can treat the \(8\) experiments as a i.i.d. observations from the underlying truth, with known variances.
We can estimate this quantity simply: \[ \bar{y} = \frac{\sum_{j=1}^J w_j\bar{y}_j}{\sum_{j=1}^J w_j} \] where \(w_j=1/\sigma^2_j\).
ybar<-with(dat,{
weight<-1/Standard_Error^2
return(sum(weight*Estimated_Effect)/sum(weight))
})
print(ybar)
## [1] 7.685617
The variance of this estimate is the inverse of the sum of the weights: \[ \textrm{var}(\bar{y}) = \frac{1}{\sum_{j=1}^J w_j} \]
var_ybar<-with(dat,{
weight<-1/Standard_Error^2
return(1/sum(weight))
})
print(var_ybar)
## [1] 16.58053
So we have an estimate of \(\theta_j\sim N(7.69,16.58)\).
Does this seem reasonable?
Take the experiment at school A as a test case:
- In the independent analysis: \(\hat{\theta}_1=28\) and \(\hat{\sigma}_1=15\).
- In the pooled analysis: \(\hat{\theta}_1=7.69\) and \(\hat{\sigma}_1=4.07\).
Putting these results into words, the first estimate tells us that the probability of the true \(\theta_j\) being greater than \(28\) is \(\frac{1}{2}\). However this is hard to reconcile with our third experiment, which has \(\hat{theta}_3=-3\). Conversely, the latter estimate tells us that probability of the true \(\theta_1\) is less than \(-1\) is also \(\frac{1}{2}\). This is similarly difficult to justify.
A Hierarchical Model
We can display our two previous models as directed acyclic graphs:
These show how the variable we observe (\(\bar{y}_i\)) is related to the parameter of interest \(\theta\). In the first instance (i.e the separate estimates), we assumed that each school observed a totally independent \(theta_i\). In the second case (i.e. the pooled estimate), however, we assumed that \(\theta\) was a constant.
Let’s now instead assume that the values of \(\theta_j\) are drawn from a normal distribution. The properties of the normal distribution we will determine with two hyper-parameters \((\mu, \tau)\).
Mathematically, we are defining the joint probability of all our \(\theta_i\) as the product of the probabilities of observing the data, given that each \(\theta_i\) is drawn from a parent population \(N(\mu,\tau)\). \[ \begin{align} f(\theta_1,\dots,\theta_J|\mu,\tau) &= \prod_{j=1}^{J} N(\theta_j|\mu,\tau^2) \\ &=\int \prod_{j=1}^{J}\left[N\left(\theta_j|\mu,\tau^2\right)\right]f(\mu,\tau)\textrm{d}(\mu,\tau) \end{align} \]
This is a hierarchical model, which can interpret the \(\theta_j\)’s as being randomly drawn from some shared parent distribution. Why is this useful?
There are physical reasons that you might expect this to be the case, but for now let’s ignore those. We initially had the problem of determining whether or not to choose the independent or pooled estimate. However in our hierarchical model:
- As \(\tau\rightarrow 0\): the \(\theta_j\) values are drawn from a narrower and narrower range around \(\mu\). In the limit, \(\theta_j=\mu\;\forall j\in\{1,\dots,J\}\), and so we have the pooled estimate.
- As \(\tau\rightarrow \infty\): the \(\theta_j\) values become independent of each other. That is, if we know \(\theta_1\) with absolute certainty, this gives us no information about \(\theta_2\). This is therefore the independent/separate estimate.
For finite, non-zero values of \(\tau\), our result will therefore be some mixture of the pooled and separate analyses.
Important implications
The smaller \(\tau\), the more related are the individual values of \(\theta_j\). This means that they contribute more to the estimates of \(\theta_j\) for the other experiments: the experiments “borrow strength” from one-another.
Building a model
As our last exercise in Bayesian Hierarchical Models, we’re going to build a complex hierarchical model from scratch, using nothing but logic.
Consider the following scenario
You have two passions in life: statistics, and watching videos of dogs on the internet. These passions are largely separated, until one-day when you discover the unbridled joy of televised dog shows…
The Dog-show dataset