While My MCMC Gently Samples

Bayesian modeling, Data Science, and Python

Why hierarchical models are awesome, tricky, and Bayesian

(c) 2017 by Thomas Wiecki

Hierarchical models are underappreciated. Hierarchies exist in many data sets and modeling them appropriately adds a boat load of statistical power (the common metric of statistical power). I provided an introduction to hierarchical models in a previous blog post: Best Of Both Worlds: Hierarchical Linear Regression in PyMC3", written with Danne Elbers. See also my interview with FastForwardLabs where I touch on these points.

Here I want to focus on a common but subtle problem when trying to estimate these models and how to solve it with a simple trick. Although I was somewhat aware of this trick for quite some time it just recently clicked for me. We will use the same hierarchical linear regression model on the Radon data set from the previous blog post, so if you are not familiar, I recommend to start there.

I will then use the intuitions we've built up to highlight a subtle point about expectations vs modes (i.e. the MAP). Several talks by Michael Betancourt have really expanded my thinking here.

In [31]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pymc3 as pm 
import pandas as pd
import theano
import seaborn as sns


data = pd.read_csv('../data/radon.csv')
data['log_radon'] = data['log_radon'].astype(theano.config.floatX)
county_names = data.county.unique()
county_idx = data.county_code.values

n_counties = len(data.county.unique())

The intuitive specification

Usually, hierachical models are specified in a centered way. In a regression model, individual slopes would be centered around a group mean with a certain group variance, which controls the shrinkage:

In [2]:
with pm.Model() as hierarchical_model_centered:
    # Hyperpriors for group nodes
    mu_a = pm.Normal('mu_a', mu=0., sd=100**2)
    sigma_a = pm.HalfCauchy('sigma_a', 5)
    mu_b = pm.Normal('mu_b', mu=0., sd=100**2)
    sigma_b = pm.HalfCauchy('sigma_b', 5)

    # Intercept for each county, distributed around group mean mu_a
    # Above we just set mu and sd to a fixed value while here we
    # plug in a common group distribution for all a and b (which are
    # vectors of length n_counties).
    a = pm.Normal('a', mu=mu_a, sd=sigma_a, shape=n_counties)

    # Intercept for each county, distributed around group mean mu_a
    b = pm.Normal('b', mu=mu_b, sd=sigma_b, shape=n_counties)

    # Model error
    eps = pm.HalfCauchy('eps', 5)
    # Linear regression
    radon_est = a[county_idx] + b[county_idx] * data.floor.values
    # Data likelihood
    radon_like = pm.Normal('radon_like', mu=radon_est, sd=eps, observed=data.log_radon)
In [3]:
# Inference button (TM)!
with hierarchical_model_centered:
    hierarchical_centered_trace = pm.sample(draws=5000, tune=1000, njobs=4)[1000:]
Auto-assigning NUTS sampler...
Initializing NUTS using advi...
Average ELBO = -1,090.4: 100%|██████████| 200000/200000 [00:37<00:00, 5322.37it/s]
Finished [100%]: Average ELBO = -1,090.4
100%|██████████| 5000/5000 [01:15<00:00, 65.96it/s] 
In [32]: