Why is inference in Markov Random Fields hard?

770 Views Asked by At

I'm studying Markov Random Fields, and, apparently, inference in MRF is hard / computationally expensive. Specifically, Kevin Murphy's book Machine Learning: A Probabilistic Perspective says the following:

"In the first term, we fix y to its observed values; this is sometimes called the clamped term. In the second term, y is free; this is sometimes called the unclamped term or contrastive term. Note that computing the unclamped term requires inference in the model, and this must be done once per gradient step. This makes training undirected graphical models harder than training directed graphical models."

Why are we performing inference here? I understand that we're summing over all y's, which seems expensive, but I don't see where we're actually estimating any parameters. Wikipedia also talks about inference, but only talks about calculating the conditional distribution, and needing to sum over all non-specified nodes.. but.. that's not what we're doing here, is it?

Alternatively, any have good intuition on why inference in MRF is difficult?

Sources: Chapter 19 of ML:PP: https://www.cs.ubc.ca/~murphyk/MLbook/pml-print3-ch19.pdf

Specific section seen below

enter image description here


There are 1 best solutions below


When training your CRF, you want to estimate your parameters, \theta.

In order to do this, you can differentiate your loss function (Equation 19.38) with respect to \theta, set it to 0, and solve for \theta.

You can't analytically solve the equation for \theta if you do this though. You can, however, minimise Equation 19.38 by gradient descent. Since the loss function is convex, it is guaranteed that gradient descent will get you the globally optimal solution when it converges.

Equation 19.41 is the actual gradient which you need to compute in order to be able to do gradient descent. The first term is easy (and computationally cheap) to compute as you are summing up over the observed values of y. However, the second term requires you to do inference. In this term, you are not summing up over the observed value of y as in the first term. Instead, you need to compute the configuration of y (inference), and then calculate the value of the potential function under this configuration.