Q: In section 3.2, Eq. 15, why try to fit the conditional distribution p(x|x_pi), rather than just fitting a mixture of gaussians to the joint p(x,x_pi), and then conditioning it? We could fit the joint using a nicer algorithm like EM. A: There are two answers, a technically correct one, and an intuitively satisfying one. Technically correct: Maximum likelihood learning dictates maximizing the sum of log probabilities. The math just demands that we maximize log p(x|x_pi). Maximizing the log p(x,x_pi) would not be equivalent to maximum likelihood. Intuitively satisfying: The reason to do this is that errors are inevitable. If it were possible to fit the TRUE joint distribution p(x,x_pi), conditioning it would also give the true conditional distribution p(x|x_pi). So the question is reasonable. However, we assume that the true distribution is imperfectly modeled by a mixture of gaussians. This means that neither p(x,x_pi) nor p(x|x_pi) can exactly represent the true distribution. By maximizing the conditional, p(x|x_pi), we can ignore errors in p(x_pi), meaning all modeling effort is spent on modeling what we care about. Now, one confusing aspect of the intuitively satisfying answer is that our representation of the conditional distribution p(x|x_pi) (Eq. 14) is obtained by taking a joint gaussian and then conditioning it. At first, this may seem in conflict with the explanation above, but it is not. Even though the equation may contain a term p(x_pi) does not mean that the fitting criterion tries to make that term accurate. Presumably, if the log likelihood of the term p(x_pi) were calculated on test data, the score would be very low.