In this exercise we will look at different adaptive Metropolis algorithms. The aim is to get an intuition for what their “adaptations” are doing, by examining their behaviour with our simple monod model. You are encouraged to play with each of the adaptation parameters of each algorithm and to check how it influences the resulting sample chain.

Define posterior distribution

You can use you own code from exercises 2 or the functions defined below. And of course you can also do the exercises with the growth model.

R

## load the monod model
source("../../models/models.r")

## read data
data.monod <- read.table("../../data/model_monod_stoch.csv", header=T)

## Logprior for model "monod": lognormal distribution
prior.monod.mean <- 0.5 * c(r_max=5, K=3, sigma=0.5)
prior.monod.sd   <- 0.5 * prior.monod.mean

logprior.monod <- function(par, mean, sd){
    sdlog <- sqrt(log(1+sd*sd/(mean*mean)))
    meanlog <- log(mean) - sdlog*sdlog/2
    return(sum(dlnorm(par, meanlog=meanlog, sdlog=sdlog, log=TRUE)))
}

## Log-likelihood for model "monod"
loglikelihood.monod <- function(y, C, par){

    ## deterministic part:
    y.det <- model.monod(par, C) # defined in `models.r`

    ## Calculate loglikelihood assuming independence:
    return( sum(dnorm(y, mean=y.det, sd=par['sigma'], log=TRUE )) )
}

## Log-posterior for model "monod"
logposterior.monod <- function(par) {
    lp <- logprior.monod(par, prior.monod.mean, prior.monod.sd)
    if(is.finite(lp)){
        return( lp + loglikelihood.monod(data.monod$r, data.monod$C, par) )
    } else {
        return(-Inf)
    }
}

Julia

using DataFrames
import CSV
using Distributions
using ComponentArrays

## load monod model
include("../../models/models.jl");

## read data
monod_data = CSV.read("../../data/model_monod_stoch.csv", DataFrame)

# set parameters
prior_monod_mean = ComponentVector(r_max = 2.5, K=1.4, sigma=0.25);
prior_monod_sd = 0.5 .* prior_monod_mean;

## Use a lognormal distribution for all model parameters
function logprior_monod(par, m, sd)
    μ = @. log(m/sqrt(1+sd^2/m^2))
    σ = @. sqrt(log(1+sd^2/m^2))
    return sum(logpdf.(LogNormal.(μ, σ), par)) # sum because we are in the log-space
end

## Log-likelihood for model "monod"
function loglikelihood_monod(par::ComponentVector, data::DataFrame)
    y_det = model_monod(data.C, par)
    return sum(logpdf.(Normal.(y_det, par.sigma), data.r))
end

## Log-posterior for model "monod"
function logposterior_monod(par::ComponentVector)
    lp = logprior_monod(par, prior_monod_mean, prior_monod_sd)
    if !isinf(lp)
        lp += loglikelihood_monod(par, monod_data)
    end
    lp
end

1. Adaptive Metropolis with delayed rejection ★

We try the adaptive Metropolis with delayed rejection as described by (Haario, Saksman and Tamminen (2001).

What could you use as initial values?
What is a meaningful initial covariance matrix for the jump distribution?
Plot the chains and see how quick the convergence is.
Look at 2d marginal plots. What happens in these marginal plots if you don’t cut off a burn-in or only use the beginning of the chain?

Hints

R

It is implemented in the function modMCMC of the package FME. Note, this function expects the negative log density”, which is sometimes called energy. See the modMCMC function documentation for details.

neg.log.post <- function(par) -logposterior.monod(par)

Now call the modMCMC function and investigate the effects of the updatecov parameter (try values of 10, 100 and 1000), which determines how often the covariance of the jump distribution is updated, and the ntrydr parameter (try values of 1, 3 and 10), which determines the number of permissible jump attempts before the step is rejected. In particular, examine the first part of the chain and see how the adaptation works.

AMDR <- modMCMC(
    f         = neg.log.post,
    p         = par.init,
    jump      = jump.cov,
    niter     = 10000,
    updatecov = 10,
    ntrydr    = 3
)

Julia

The package AdaptMCMC provides multiple MCMC algorithms including the adaptive Metropolis.

using ComponentArrays
using AdaptiveMCMC
using MCMCChains
using Plots
using StatsPlots

θinit = ComponentVector(r_max=..., K=..., sigma=...)

res = adaptive_rwm(θinit, logposterior_monod,
                   10_000;
                   algorithm = :am,
                   b=1,   # length of burn-in. Set to 1 for no burn-in.
                   );

# convert to MCMCChains for summary and plotting
chn = Chains(res.X', labels(θinit))

plot(chn)
corner(chn)

2. Robust adaptive Metropolis ★

The robust adaptive Metropolis algorithm proposed by Vihola (2012) is often a good choice. It adapts the scale and rotation of the covariance matrix until it reaches a predefined acceptance rate.

Did the algorithm reach the desired acceptance rate?
How is the covariance matrix after the adaptation different from the initial covariance that you provided?

Hints

R

The package adaptMCMC provides the function MCMC. If the parameter adapt is set to TRUE it implements the adaptation propsoed by Vihola. Again examine the effect of the adaptation settings on the chains, in particular examining the first part of the chain to see how the adaptation works. Here the adaptation is determined by parameter acc.rate. Try values between 0.1 and 1.

How is the influence on the burn-in? What happens, if you use a very bad initial value?

RAM <- MCMC(
    p        = logposterior.monod,
    n        = 10000,
    init     = par.start,
    scale    = jump.cov,
    adapt    = TRUE,
    acc.rate = 0.5
)

str(RAM)

Julia

The package AdaptMCMC provides multiple MCMC algorithms including the robust adaptive Metropolis.

using ComponentArrays
using AdaptiveMCMC
using MCMCChains
using Plots
using StatsPlots

θinit = ComponentVector(r_max=..., K=..., sigma=...)

res = adaptive_rwm(θinit, logposterior_monod,
                   10_000;
                   algorithm = :ram,
                   b=1,   # length of burn-in. Set to 1 for no burn-in.
                   );

# convert to MCMCChains for summary and plotting
chn = Chains(res.X', labels(θinit))

plot(chn)
corner(chn)

3. Population based sampler ★

Population based samplers do not have an explicit jump distribution. Instead, they run multiple chains (often called particles or walkers in this context) in parallel and the proposals are generated based on the position of the other particles.

A popular algorithm of this class is the Affine-Invariant MCMC population sampler proposed by Goodman and Weare (2010). The algorithm is often called EMCEE based on the python package with the same name.

Hints

R

Population based samplers are implemented in the package mcmcensemble as MCMCEnsemble. It has two methods: stretch move (method = "stretch") and differential evolution (method = "differential.evolution").

EMCEE <- MCMCEnsemble(
    f           = logposterior.monod,
    lower.inits = par.start.lower,
    upper.inits = par.start.upper,
    max.iter    = 10000,
    n.walkers   = n.walkers,
    method      = "stretch",
    coda        = FALSE
)

How do you choose par.start.lower and par.start.upper? Better wide or narrow?
What is the influence of the number of walkers?
Which method works better in this case?

Julia

We use the package KissMCMC which provides a function emcee:

using ComponentArrays
using KissMCMC: emcee
using Plots
using StatsPlots

# number of walkers (parallel chains)
n_walkers = 10

## We need a vector of inital values, one for each walkers.
## Make sure that they do not start from the same point.
θinits = [θinit .* rand(3) for _ in 1:n_walkers]

# Run sampler
samples, acceptance_rate, lp = emcee(logposterior_monod,
                                     θinits;
                                     niter = 10_000, # total number of density evaluations
                                     nburnin = 0);

# This looks a bit ugly. It just converts the result into
# a `MCMCChains.Chains` object for plotting.
X = permutedims(
    cat((hcat(samples[i]...) for i in 1:n_walkers)..., dims=3),
    [2, 1, 3]);
chn = Chains(X, labels(θinit))

# plotting
plot(chn)
corner(chn)

How do you define the initial values? Very similar or very different?
What is the influence of the number of walkers?

Python

Population based samplers are implemented in the package emcee.

sampler = emcee.EnsembleSampler(nwalkers, ndim, 
logposterior_monod, moves)

What is the influence of the number of walkers?
Which method works better in this case?

4. Posterior Predictions ★

In many cases we want to make predictions for new model inputs. Let’s say we have observed the data \((y, x)\) and used it to infer the posterior \(p(\theta | y, x)\). We would now like to make predictions given a new input \(x^*\) using the learned posterior distribution:

\[ p(Y^* | x^*, y, x,) = \int p(Y^* | x^*, \theta) \, p(\theta | y, x) \text{d}\theta \]

For the monod model, what is \(p(Y^* | x^*, \theta)\)?
Produce predictions for the monod model for \(C = \{10, 12, 14, 16\}\) using a posterior sample from a exercises. We only need samples form the predictive distribution. How can you do this without solving the integral analytically?
Plot the result with 90% prediction interval.
How much do you trust that the interval is correct? What assumptions did you make?

5. Gradient based samplers ♛

If we are able to compute the gradient of the log density, \(\nabla \log p\), we can use much more efficient sampling methods. For small numbers of parameters this may not be relevant, for larger problems (more than 20 dimensions) the differences can be huge.

Julia is particularly well suited for these applications, because many libraries for Automatic Differentiation (AD) are available - methods that compute the gradient of (almost) any Julia function by analyzing the code.

Because there is no equally powerful AD available in R, we can do this exercise only with Julia.

Parameter transformation

Most gradient based samplers assume that all parameters are in \(\mathbb{R}\). If we have a parameter that is only defined on an interval, such as a standard deviation that is never negative, we need to transform the model parameter before sampling. For this we need three ingredients:

a function that maps every vector in \(\mathbb{R}^n\) to our “normal” model parameter space,
the inverse of this function,
the determinant of the Jacobian of this function.

The package TransformVariables helps us with these transformations. We need a function that takes a vector in \(\mathbb{R}^n\) to evaluate the posterior.

using TransformVariables

# defines the 'legal' parameter space, all parameter cannot be negative
# due to the lognormal prior
trans = as((r_max = asℝ₊, K = asℝ₊, sigma = asℝ₊))

We can now sample in \(\mathbb{R}^n\) and later transform the samples to the model parameter space with:

TransformVariables.transform(trans, [-1,-1,-1]) # -> (r_max=0.367, K=0.367, sigma=0.367)

Automatic Differentation

The last ingredient we need is a function that computes the gradient of logposterior_monod_Rn. To do so we need to differentiate through our model, the likelihood, the prior, the parameter transformation, and the determinant of the Jaccobian. Needles to say, even for our very simple model this would be very tedious to do manually!

Instead we use Automatic Differentiation (AD). Julia has multiple packages for AD that make different trade-offs. We use ForwardDiff that is well suited for smaller dimensions.

AD requires all your model code to be implemented in pure Julia! Otherwise there are few restrictions. For example, you can compute the gradient of the growth model even though it uses an advanced adaptive ODE solver internally.

using TransformedLogDensities: TransformedLogDensity
using LogDensityProblemsAD: ADgradient

# Define an object that is compatible with the LogDensityProblems.jl interface. 
# It requires a parameter transformation and a function to compute the 
# log probability density
# This inyterface enables compability with different samplers.
lp = TransformedLogDensity(trans, θ -> logposterior_monod(ComponentVector(θ)))

lp = ADgradient(:ForwardDiff, lp) # define the AD framework to use

Hamiltonian Monte Carlo

Hamiltonian Monte Carlo (HMC) is one of the most powerful methods to sample in high dimensions. The “No-U-Turn Sampler” (NUTS) is a popular version. For example, it is used in STAN.

The package DynamicHMC.jl provides a robust implementation.

HMC samplers need many density- and gradient-evaluations to produce a single proposal. However, the acceptance rate of an HMC sampler should be close to one. We can use the wrapper function stanHMC like this:

import Random
using DynamicHMC: mcmc_with_warmup, ProgressMeterReport, Diagnostics.summarize_tree_statistics

# run for 100 samples
par_init = [-1.0, -1.0, -1.0];   # in ℝⁿ

results = mcmc_with_warmup(Random.GLOBAL_RNG, lp, 100;
                           initialization = (q = par_init, ),
                           reporter = ProgressMeterReport());

The samples we get are in \(\mathbb{R}^n\). Before we have a look, let’s transform them to the “normal” parameter space:

# back-transform samples
_samples = [TransformVariables.transform(trans, s)
            for s in eachcol(results.posterior_matrix)];

samples = vcat((hcat(i...) for i in _samples)...);
chn = MCMCChains.Chains(samples,  [:r_max, :K, :sigma])

plot(chn)

BarkerMCMC

Livingstone and Zanella (2021) proposed a comparably simple MCMC algorithm that uses the gradient to adapt the jump distribution. It is a promising alternative to HMC in cases where the number of parameters is not very high, or if the gradient is expected to be noisy (which is often the case if a model uses adaptive ODE solvers).

BarkerMCMC.jl implements it including an adaptation that aims at a given acceptance rate.

using BarkerMCMC: barker_mcmc

par_init = [-1.0, -1.0, -1.0]   # note, this is in ℝⁿ

# see `?barker_mcmc` for all options
res = barker_mcmc(lp,
                  par_init;
                  n_iter = 1000,
                  target_acceptance_rate=0.4);

res.samples                     # this are the samples in ℝⁿ !
res.log_p

The samples we get are in \(\mathbb{R}^n\). Before we have a look, let’s transform them to the “normal” parameter space:

using MCMCChains
using StatsPlots

# Transform the samples to the "normal" space and convert to `Chains`
_samples = [TransformVariables.transform(trans, s)
            for s in eachrow(res.samples)];
samples = vcat((hcat(i...) for i in _samples)...);
chn = Chains(samples,  [:r_max, :K, :sigma])

plot(chn)
corner(chn)

Exercise 3: Adaptive Monte Carlo Markov Chain samplers

Eawag Summer School in Environmental Systems Analysis

Define posterior distribution

R

Julia

1. Adaptive Metropolis with delayed rejection ★

Hints

R

Julia

2. Robust adaptive Metropolis ★

Hints

R

Julia

3. Population based sampler ★

Hints

R

Julia

Python

4. Posterior Predictions ★

5. Gradient based samplers ♛

Parameter transformation

Automatic Differentation

Hamiltonian Monte Carlo

BarkerMCMC