I want to implement a basic model of capture and recapture in PyMC3 (you capture 100 animals and mark them, then you liberate them and recapture 100 of them after they have mixed and annotate how many are marked). This is my code:
import numpy as np
import pymc3 as pm
import arviz as az
# Datos:
K = 100 #marked animals in first round
n = 100 #captured animals in second round
obs = 10 #observed marked in second round
with pm.Model() as my_model:
N = pm.DiscreteUniform("N", lower=K, upper=10000)
likelihood = pm.HyperGeometric('likelihood', N=N, k=K, n=n, observed=obs)
trace = pm.sample(10000)
print(pm.summary(trace))
print(trace['N'])
ppc = pm.sample_posterior_predictive(trace, 100, var_names=["N"])
data_ppc = az.from_pymc3(trace=trace, posterior_predictive=ppc) #create inference data
az.plot_ppc(data_ppc, figsize=(12, 6))
But I obtain the error in plot_ppc
that 'var names: "[\'likelihood\'] are not present" in dataset'
. Also the warning posterior predictive variable N's shape not compatible with number of chains and draws. This can mean that some draws or even whole chains are not represented.
What is happening and what can I do to obtain a posterior predictive check plot?
The root of all the problems is in this line
ppc = pm.sample_posterior_predictive(trace, 100, var_names=["N"])
.By using
var_names=["N"]
you are indicating PyMC to "sample" only the variableN
which is actually a latent variable that was sampled while sampling the posterior in thepm.sample
call. Doing this in thepm.sample_posterior_predictive
call is indicating PyMC to not sample the observed variable (likelihood
in this case) and to just copy the samples forN
to the posterior predictive too. You will see thatdata_ppc
is anInferenceData
object with multiple groups,N
is already in the posterior group (like it is in thetrace
object).By using
100
(akasamples=100
as a positional argument) you are indicating PyMC to draw posterior predictive samples only for the first 100 draws of the first chain. This is a bad idea, so ArviZ prints a warning when converting toInferenceData
. You should generate one posterior predictive sample per posterior sample, only generating samples for a subset of the posterior if posterior predictive sampling were very slow.My recommendation, which also applies as a general rule, is to trust PyMC defaults unless you have a reason not to or want things to give the same result with multiple versions. We update the defaults from time to time to try and keep them coherent with best practices which are updated and improve over the time. You should therefore do:
ppc = pm.sample_posterior_predictive(trace)
. PyMC will default to sampling thelikelihood
variable only and to generate one sample per posterior draw.