Author: John Hallman

Resources:

- An Introduction to MCMC for Machine Learning - primary reading for this presentation
- A Short History of Markov Chain Monte Carlo: Subjective Recollections from Incomplete Data - additional reading about the history of MCMC

In statistics and probability theory, there are times when we want to sample from or reason about probability distributions that is very hard to model explicitly. The introductory quote from the paper we are discussing today summarizes this intuition:

<aside>
ðŸ’¡ *While convalescing from an illness in 1946, Stan Ulam was playing solitaire. It, then, occurred to him to try to compute the chances that a particular solitaire laid out with 52 cards would come out successfully. After attempting exhaustive combinatorial calculations, he decided to go for the more practical approach of laying out several solitaires at random and then observing and counting the number of successful plays. This idea of selecting a statistical sample to approximate a hard combinatorial problem by a much simpler problem is at the heart of modern Monte Carlo simulation.*

</aside>

The above quote describes *Monte Carlo simulation* - using random sampling to estimate some unknown quantity.

*Markov Chain* Monte Carlo (MCMC) simulation is then a category of algorithms that utilize random sampling in order to estimate probability distributions by constructing a Markov Chain that has the desired probability distribution as its stationary distribution.

*Why do we care about sampling from distributions?*

There are tons of situations where probability distributions are not accessible in closed or convenient form. One example that is particularly relevant to Sisu is in Bayesian Structural Time Series (BSTS) models, see for example the following papers:

- Predicting the Present with Bayesian Structural Time Series (Scott & Varian, 2013)
- Inferring Causal Impact using Bayesian Structural Time-Series Models (Brodersen et al., 2013)

In both of these, the popular *spike-and-slab* prior (Mitchell & Beauchamp, 1988) is used to incorporate a sparse prior over the parameter space. The spike-and-slab is a degenerate prior with a point mass probability at 0 (the spike) which encodes the prior expectation that most parameters should be 0, and a simple prior $P(x)$ for all $x \neq 0$ such as a Gaussian or constant.