Probabilistic Programming in Python


Thomas Wiecki

@twiecki

Quantopian Inc.

In [65]:
from IPython.display import YouTubeVideo
YouTubeVideo("WESld11iNcQ")
Out[65]:
In [19]:
from IPython.display import Image
import prettyplotlib as ppl
from prettyplotlib import plt
import numpy as np
import scipy as sp
import seaborn as sns
sns.set_context('talk')
from matplotlib import rc
rc('font',**{'family':'sans-serif','sans-serif':['Helvetica'], 'size': 22})
rc('xtick', labelsize=12) 
rc('ytick', labelsize=12)
## for Palatino and other serif fonts use:
#rc('font',**{'family':'serif','serif':['Palatino']})
rc('text', usetex=True)
%matplotlib inline

from IPython.html import widgets # Widget definitions
from IPython.display import display # Used to display widgets in the notebook
from IPython.html.widgets import interact, interactive
from IPython.display import clear_output, display, HTML

def gen_plot(success=(0, 100), failure=(0, 100)):
    alpha = 5 + success
    beta = 5 + failure
    fig = plt.figure(figsize=(8, 6))
    x = np.linspace(0, 1, 100)
    ax = fig.add_subplot(111, xlabel='Chance of success (beating the market)', 
                         ylabel='Probability of hypothesis', 
                         title=r'Posterior probability distribution of $\theta$')
    ax.plot(x, sp.stats.beta(alpha, beta).pdf(x), linewidth=3.)
    ax.set_xticklabels(['0\%', '20\%', '40\%', '60\%', '80\%', '100\%']);
In [5]:
from IPython.display import Image
import prettyplotlib as ppl
from prettyplotlib import plt
import numpy as np
import scipy as sp
import seaborn as sns
sns.set_context('talk')
from matplotlib import rc
rc('font',**{'family':'sans-serif','sans-serif':['Helvetica'], 'size': 22})
rc('xtick', labelsize=12) 
rc('ytick', labelsize=12)
## for Palatino and other serif fonts use:
#rc('font',**{'family':'serif','serif':['Palatino']})
rc('text', usetex=True)
%matplotlib inline

from IPython.html import widgets # Widget definitions
from IPython.display import display # Used to display widgets in the notebook
from IPython.html.widgets import interact, interactive
from IPython.display import clear_output, display, HTML

About me

  • PhD candidate at Brown studying decision making using Bayesian modeling.
  • Quantitative Researcher at Quantopian Inc: Building the world's first algorithmic trading platform in the web browser.

Why should you care about Probabilistic Programming?

Motivating Example -- A/B testing

  • Split users into two groups that experience different versions of your website.
  • Measure conversion rate (e.g. click-through rate).
  • Simple question: Which one does better?

Same problem: two trading strategies -- which one has a higher chance of beating the market?

Data might look like this

In [38]:
from IPython.display import Image
import prettyplotlib as ppl
from prettyplotlib import plt
import numpy as np
import scipy as sp
import seaborn as sns
sns.set_context('talk')
from matplotlib import rc
rc('font',**{'family':'sans-serif','sans-serif':['Helvetica'], 'size': 22})
rc('xtick', labelsize=12) 
rc('ytick', labelsize=12)
## for Palatino and other serif fonts use:
#rc('font',**{'family':'serif','serif':['Palatino']})
rc('text', usetex=True)
%matplotlib inline

from IPython.html import widgets # Widget definitions
from IPython.display import display # Used to display widgets in the notebook
from IPython.html.widgets import interact, interactive
from IPython.display import clear_output, display, HTML
In [62]:
np.random.seed(9)
algo_a = sp.stats.bernoulli(.5).rvs(10) # 10 samples from 50% chance of beating the market
algo_b = sp.stats.bernoulli(.6).rvs(10) # 10 samples from 60% chance of beating the market

print 'Beat the market algo A: ', algo_a
print 'Beat the market algo B: ', algo_b
Beat the market algo A:  [0 1 0 0 0 0 0 0 0 0]
Beat the market algo B:  [1 0 0 1 0 1 0 0 1 0]

The simplest answer

  • Take the mean.
  • This is equivalent to the maximum likelihood estimate, so statistically not a bad idea.
In [212]:
print 'Probability algo A beating the market = %.0f%%' 
% (np.mean(algo_a) * 100)
print 'Probability algo B beating the market = %.0f%%' 
% (np.mean(algo_b) * 100)
Probability algo A beating the market = 10%
Probability algo B beating the market = 40%

How to decide which one is better?

  • Run a frequentist statistical test!
  • Returns a p-value that indicates probability of observing this data by chance.

What if we get more data?

  • Run another statistical test!
In [63]:
def run_exp():
    # Generate 50 random binary variables with probability = 0.5
    test_a = sp.stats.bernoulli(.5).rvs(50)
    test_b = sp.stats.bernoulli(.5).rvs(50)

    for i in range(2, 50):
        # Run statistical t-test
        _, p = sp.stats.ttest_ind(test_a[:i], test_b[:i])
        if p < 0.05:
            return True
        
    return False

p_sign_result = np.mean([run_exp() for i in range(1000)])

print 'Probability of getting significant result even though no\n difference exists = %.2f%%' % (p_sign_result * 100)
Probability of getting significant result even though no
 difference exists = 36.60%

The problem with frequentist statistics:

Dependent on your intentions of collecting the data!

What is Bayesian statistics?

  • At the core: formula to update our beliefs after having observed data (Bayes formula)
  • Implies that we have a prior belief about the world.
  • Updated beliefs after observing data is called posterior.
  • Beliefs are represented using random variables.

Let's define a prior for our random variable $\theta$

In [183]:
def gen_plot(successes=(0, 100, 1), failures=(0, 100, 1)):
    alpha_prior=5
    beta_prior=5
    if successes == 0 and failures == 0:
        title = r'Prior probability distribution on $\theta$'
    else:
        title = r'Posterior probability distribution of $\theta$ after having seen data'
        
    alpha = alpha_prior + successes
    beta = beta_prior + failures
    x = np.linspace(0, 1, 201)
    fig = plt.figure(figsize=(8, 6))
    ax = fig.add_subplot(111, xlabel=r'Chance of success (profit or conversion)', 
                         ylabel=r'Probability',
                         title=title)
    ax.plot(x, sp.stats.beta(alpha, beta).pdf(x), linewidth=3.)
    ax.set_xticklabels([r'0\%', r'20\%', r'40\%', r'60\%', r'80\%', r'100\%']);
    
    return
In [15]:
def gen_plot(success=(0, 100), failure=(0, 100)):
    alpha = 5 + success
    beta = 5 + success
    fig = plt.figure(figsize=(8, 6))
    x = np.linspace(0, 1, 100)
    ax = fig.add_subplot(111, xlabel='Chance of success (beating the market)', 
                         ylabel='Probability of hypothesis', 
                         title=r'Posterior probability distribution of $\theta$')
    ax.plot(x, sp.stats.beta(alpha, beta).pdf(x), linewidth=3.)
    ax.set_xticklabels(['0\%', '20\%', '40\%', '60\%', '80\%', '100\%']);
In [16]:
gen_plot(0, 0)
In [64]:
interactive(gen_plot)

Where's the catch?

In most interesting cases, we can't solve what is inside the Bayesian box.

Probabilistic Programming

  • If we can't solve something, approximate it.
  • Markov-Chain Monte Carlo (MCMC) instead draws samples from the posterior.
  • Fortunately, this algorithm can be applied to almost any model.
In [143]:
def plot_approx():
    from scipy import stats
    fig = ppl.plt.figure(figsize=(14, 6))
    x_plot = np.linspace(0, 1, 200)
    ax1 = fig.add_subplot(121, title='What we want', ylim=(0, 2.5))# , xlabel=r'$\theta$', ylabel=r'$P(\theta)$')
    ppl.plot(ax1, x_plot, stats.beta(4, 4).pdf(x_plot), linewidth=4.)
    ax2 = fig.add_subplot(122, title='What we get', xlim=(0, 1), ylim=(0, 2300))#, xlabel=r'\theta', ylabel='\# of samples')
    ax2.hist(stats.beta(4, 4).rvs(20000), bins=20);

Approximating the posterior with MCMC sampling

In [144]:
plot_approx()

PyMC3

  • Probabilistic Programming framework written in Python.
  • Allows for construction of probabilistic models using intuitive syntax.
  • Features advanced MCMC samplers.
  • Just-in-time compiled by Theano.
  • Extensible: easily incorporates custom MCMC algorithms and unusual probability distributions.
  • Written by: John Salvatier, Chris Fonnesbeck, Thomas Wiecki
  • Currently still in alpha (few bugs but mainly missing docs).

Some necessary formalism:

$\theta_a \sim Beta(5, 5)$

$\theta_b \sim Beta(5, 5)$

$data_a \sim Bernoulli(\theta_a)$

$data_b \sim Bernoulli(\theta_b)$

  • $\theta$ represents the unkown cause we want to infer, but can't observe directly.
  • We write a generative story of how this unknown cause relates to observable data.
  • Use Bayes to invert this generative story and infer hidden causes from observations -- updating our beliefs about $\theta$.
In [206]:
np.random.seed(9)
algo_a = sp.stats.bernoulli(.5).rvs(300) # 50% profitable days
algo_b = sp.stats.bernoulli(.6).rvs(300) # 60% profitable days
Chance of beating the market algo A:  0.493333333333
Chance of beating the market algo B:  0.6

In [207]:
import pymc as pm
model = pm.Model()
with model: # model specifications in PyMC3 are wrapped in a with-statement
    # Define random variables
    theta_a = pm.Beta('theta_a', alpha=5, beta=5) # prior
    theta_b = pm.Beta('theta_b', alpha=5, beta=5) # prior
    
    # Define how data relates to unknown causes
    data_a = pm.Bernoulli('observed A',
                          p=theta_a, 
                          observed=algo_a)
    
    data_b = pm.Bernoulli('observed B', 
                          p=theta_b, 
                          observed=algo_b)
    
    # Inference!
    start = pm.find_MAP() # Find good starting point
    step = pm.Slice() # Instantiate MCMC sampling algorithm
    trace = pm.sample(10000, step, start=start, progressbar=False) # draw posterior samples using slice sampling 
In [209]:
sns.distplot(trace['theta_a'], label=r'$\theta_a$')
sns.distplot(trace['theta_b'], label=r'$\theta_b$')
plt.title('Posteriors')
plt.legend()
Out[209]:
<matplotlib.legend.Legend at 0xf6ef250>

Hypothesis testing is trivial

In [208]:
p_b_better_than_a = np.mean(trace['theta_a'] < trace['theta_b'])
print 'Probability that algo B is better than A = %.2f%%' % (p_b_better_than_a * 100)
Probability that algo B is better than A = 99.11%

Advanced example -- hierarchical models

  • Consider that instead of 2 algorithms, we have 20.
  • Before we considered that $\theta_a$ and $\theta_b$ are completely independent.
  • In reality, while they will probably be different, they will also have similarities.
  • We can use hierarchical modeling to infer individual algorithms, but also their group distribution.

Unpooled

unpooled

Pooled

pooled

Partially pooled -- hierarchical

hierarchical

In [247]:
np.random.seed(9)
algos = []
algos_idx = []
samples = 300
for i, p_algo in enumerate(sp.stats.beta(10, 10).rvs(20)):
    algos.append(sp.stats.bernoulli(p_algo).rvs(samples))
    algos_idx.append(np.ones(samples) * i)
    
algos = np.asarray(algos)
algos_idx = np.asarray(algos_idx, dtype=np.int)
print algos.shape
print algos
print algos_idx
(20, 300)
[[0 1 1 ..., 1 0 0]
 [0 1 1 ..., 0 1 0]
 [1 1 1 ..., 1 1 1]
 ..., 
 [1 0 0 ..., 1 0 1]
 [0 1 1 ..., 0 0 0]
 [1 1 0 ..., 0 1 0]]
[[ 0  0  0 ...,  0  0  0]
 [ 1  1  1 ...,  1  1  1]
 [ 2  2  2 ...,  2  2  2]
 ..., 
 [17 17 17 ..., 17 17 17]
 [18 18 18 ..., 18 18 18]
 [19 19 19 ..., 19 19 19]]

In [248]:
with pm.Model() as model: # model specifications in PyMC3 are wrapped in a with-statement
    # Define random variables
    grp_mean = pm.Beta('grp mean', alpha=2, beta=2) # prior
    grp_scale = pm.Gamma('grp scale', alpha=1, beta=10./10**2) # prior
    
    # Transform
    alpha = grp_mean * grp_scale
    beta = (1 - grp_mean) * grp_scale
    
    # Individual random variables, vector of lenght 20
    theta = pm.Beta('theta', alpha=alpha, beta=beta, shape=20)
    
    # Define how data relates to unknown causes
    data = pm.Bernoulli('observed',
                        p=theta[algos_idx],
                        observed=algos)
    
    # Inference!
    start = pm.find_MAP() # Find good starting point
    step = pm.NUTS (scaling=start) # Instantiate MCMC sampling algorithm
    trace = pm.sample(10000, step, start=start, progressbar=False)[:2000] # draw posterior samples using slice sampling
In [249]:
pm.traceplot(trace);

Conclusions

  • Probabilistic Programming allows you to tell a genarative story.
  • Blackbox inference algorithms allow estimation of complex models.
  • PyMC3 puts advanced samplers at your fingertips.

Outstanding Issues

Scalability

  • Variational Inference
  • see also Max Welling's work for scaling MCMC

Usability

  • still too difficult to use
  • wanted: library on top of PyMC3 with common models

Further reading