Pull plot from toys in zFit

235 Views Asked by At

First of all I would like to say that I do have conceptual difficulties with this topic, so please correct me if my intention does not make sense.

I'm trying to validate the model that I use to extract the signal yield from a distribution. For simplicity, assume there is only a signal distribution and no background. The model is a standard Gauss and I created an extended PDF.

In my thinking I would create sample data from that PDF and only change the number of events, i.e. only the scaling. Than I would fit this toy sample and compare the generated number of events to the fitted signal yield by computing

pull = (N_generated - N_fit) / sigma_Nfit

What I don't understand is how I set and get the generated number. In the generation, I want to set the number of events randomly (I guess Poisson distributed?) and keep all other model parameters fixed.

In the end I have signal+background distributions and want to:

  1. vary number of signal + keep number of background fixed
  2. fix signal + vary background
  3. vary fraction in n_total = n_signal + fraction * n_background

The following code is taken from the zfit-tutorials and updated to have an extended model (also found here: https://github.com/holyschmoly/zfit_toymc):

mu = zfit.Parameter('mu', 0, -1, 1)
sigma = zfit.Parameter('sigma', 1, 0.5, 1.5)
model = zfit.pdf.Gauss(obs=obs, mu=mu, sigma=sigma)

# do i need this range here?
nsig_toy = 1e4
nsig_range = 0.1
n_sig = zfit.Parameter('n_sig', nsig_toy, nsig_toy*(1-nsig_range), nsig_toy*(1+nsig_range))
model = model.create_extended(n_sig)

# vary only n_sig
sampler = model.create_sampler(fixed_params=[mu, sigma])
nll = zfit.loss.ExtendedUnbinnedNLL(model, sampler)

minimizer = zfit.minimize.Minuit(
    strategy=zfit.minimize.DefaultToyStrategy(),
    verbosity=0,
    tolerance=1e-3,
    use_minuit_grad=True)
params = nll.get_params()

fit_results = []
ntoys = 10

while len(fit_results) < ntoys:

    # What does really happen here?
    sampler.resample()

    # I think this is probably not what I want but I don't understand why
    # this is needed in the first place.
    # I want something like: vary n_sig randomly, keep mu and sigma fixed
    for param in params:
       param.randomize()

    result = minimizer.minimize(nll)

    if result.converged:
        result.hesse()
        fit_results.append(result)
    
# Now I want something like this:
# pull =  (N_generated - N_fit ) / sigma_fit
# for each fit_result.
1

There are 1 best solutions below

0
On

That looks like a reasonable thing to do. The question is whether you really want to vary only signal and only bkg, but it is of course free to do.

What you need then is though something more sophisticated.

First things first: the resample method actually samples again from the PDF. However, from the whole PDF, not from parts.

What you need is a for-loop and inside this for-loop, you can use the sample method from pdf to generate the individual parts. Then you stitch them together and create a loss. Then you give this loss to the minimizer. Let me sketch it:

for i in ntoys:
    # calculate how many background and signal samples you want, e.g. with numpy
    sample_bkg = bkg_pdf.sample(n=...)
    sample_signal = sig_pdf.sample(n=...)
    sample = tf.concat([sample_sig.value(), sample_bkg.value()], axis=0)
    data = zfit.Data.from_tensor(tensor=sample, obs=obs)
    nll = zfit.loss.ExtendedUnbinnedNLL(model, data)
    result = minimizer.minimize(nll)
    # use the result to get the param values
    nsig = result.param[sig_yield]['value']  # assuming sig_yield is a zfit.Parameter, the yield of the signal
    # calculate pull, add it to a list of pulls

maybe you want to add at the beginning of the loop a zfit.run.clear_graph_cache() as otherwise the memory keeps growing or run everything in eager mode with a zfit.run.set_graph_mode(False) at the very beginning of the script. This may be faster or slower, depending on the complexity. For simple, fast fits, this is expected to be faster with the graph mode set to False, for more complicated, the first method - clearing the cache - should perform better.