Gaussian Mixture Model and EM Algorithm

Huarui Zhou

$ \newcommand{\ud}{\mathrm{d} } \newcommand{\calX}{\mathcal{X}} \newcommand{\ve}{\varepsilon} \newcommand{\N}{\mathcal{N}} \newcommand{\bX}{\mathbf{X}} \newcommand{\bA}{\mathbf{A}} \newcommand{\bb}{\mathbf{b}} \newcommand{\bY}{\mathbf{Y}} \newcommand{\by}{\mathbf{y}} \newcommand{\bx}{\mathbf{x}} \newcommand{\bI}{\mathbf{I}} \newcommand{\bbeta}{\pmb{\beta}} \newcommand{\bve}{\pmb{\varepsilon}} \newcommand{\bzero}{\pmb{0}} \newcommand{\ds}{\displaystyle} \newcommand{\tL}{\tilde{L}} \newcommand{\pt}{\partial} \newcommand{\E}{\mathbb{E}} \newcommand{\P}{\mathbb{P}} \newcommand{\R}{\mathbb{R}} \newcommand{\Z}{\mathbb{Z}} \newcommand{\Cov}{\mathrm{Cov}} \newcommand{\Var}{\mathrm{Var}} \newcommand{\raw}{\rightarrow} \newcommand{\gt}{>} \newcommand{\lt}{<} \newcommand{\sm}{\setminus} \newcommand{\ud}{\mathrm{d}} \newcommand{\chr}{\mathbb{1}} $

1. GMM for complete data

Gaussian mixture model (GMM) basically hypothesizes that the datapoints follow a mixture of gaussian distribution. Suppose there are $n$ datapoint and each datapoint has the distribution of $K$ mixed gaussian. Specifically, there are two series of independent random variables: $X_1,X_2,\cdots,X_n$ and $Y_1,Y_2,\cdots,Y_n$, where each $Y_i$ has the distribution \[\P(Y_i = k) = w_{k},\quad k = 1,2,\cdots,K,\] \[\sum_{k=1}^K w_k = 1,\] and given $Y_i=k$, each $X_i$ has gaussian distribution with $\mu_k$ and $\sigma_k$, i.e. \[X_i\,|\, Y_i=k\quad \sim \quad \N(\mu_k,\sigma_k^2).\] Immediately, the density of $X_i$ is \[f_{X_i}(x) = \sum^K_{k=1}f_k(x)\P(Y_i=k) = \sum^K_{k=1}w_kf_k(x),\] where $f_k$ is the density of $\N(\mu_k,\sigma_k^2)$, in other word, $X_i$ has the mixture gaussian distribution. The joint density of $(X_i,Y_i)$ is \[\begin{split}f_{X_iY_i}(x,y) &= \frac{\pt \P(X_i\leq x,Y_i=y)}{\pt x} \\ &= \frac{\pt \P(X_i\leq x|Y_i=y)\P(Y_i=y)}{\pt x} \\ &= w_{y}f_y(x) \\ &= \prod^K_{k=1} (w_{k}f_k(x))^{\chr(y=k)}.\end{split}\]

If we know the complete data, i.e. the value of both $X_i$ and $Y_i$, then it is very easy to get the estimator of $\theta = (\mu_k,\sigma_k^2,w_{k})$ by MLE.

Suppose $\bx=\{x_1,\cdots,x_n\}$ and $\by=\{y_1,\cdots,y_n\}$ are the complete data, then MLE for $\theta =(\mu_k,\sigma_k^2,w_k)$ is \[\hat{\mu}_k = \frac{\sum^n_{i=1}x_i\chr(y_i=k)}{n_k},\quad \hat{\sigma}^2_k = \frac{\sum^n_{i=1}(x_i-\hat{\mu}_k)^2\chr(y_i=k)}{n_k},\quad \hat{w}_k = \frac{n_k}{n},\] where $n_k = \#\{i:y_i=k\}$.
The log likelihood function for the complete data is \[\begin{split}l(\theta\,|\,\bx,\by)&= \log \prod_{i=1}^n f_{X_iY_i}(x_i,y_i)\\ &=\log \prod_{i=1}^n\prod^K_{k=1} (w_{k}f_k(x_i))^{\chr(y_i=k)}\\ &=\sum_{i=1}^n\sum_{k=1}^K \chr(y_i=k)\log w_{k}f_k(x_i)\\ &=\sum_{i=1}^n\sum_{k=1}^K \chr(y_i=k)\log w_{k}+\sum_{i=1}^n\sum_{k=1}^K \chr(y_i=k)\log f_k(x_i) \end{split}\] then given $\sum^K_{k=1}w_k=1$, we can use lagrange multiplier method to find the estimator to maximize $l$ (details omitted). \[\tag*{$\blacksquare$}\]

2. GMM for incomplete data and EM algorithm

However, sometimes we only have the incomplete data $x_1,\cdots,x_n$, the log likelihood function becomes \[l'(\theta\,|\,\bx) = \log \prod^n_{i=1}f_{X_i}(x_i) =\sum^n_{i=1}\log\left(\sum^K_{k=1}w_kf_k(x_i)\right),\] whcih is super hard to maximize since the sum is inside the $\log$ function. The problem can still be solved by the expectation and maximization (EM) algorithm.


Expectation.

First, we start from an initial estimation of the parameters, $\theta^{(0)}=(\mu_k^{(0)},\sigma_k^{2(0)},w_k^{(0)})$, and compute the conditional expectation of $l(\theta\,|\,\bX,\bY)$ given $\theta^{(0)}$ and incomplete data $\bx$, \[\begin{split} Q(\theta\,|\,\theta^{(0)},\bx)&=\E(l(\theta\,|\,\bX,\bY)\,|\,\theta^{(0)},\bX=\bx) \\ &= \E(l(\theta\,|\,\bx,\bY)\,|\,\theta^{(0)},\bx)\\ & = \E\left[\sum_{i=1}^n\sum_{k=1}^K \chr(Y_i=k)\log w_{k}+\sum_{i=1}^n\sum_{k=1}^K \chr(Y_i=k)\log f_k(x_i) \,\bigg|\,\theta^{(0)},\bx\right]\\ &=\sum_{i=1}^n\sum_{k=1}^K \E(\chr(Y_i=k)\,|\,\theta^{(0)})\log w_{k}+\sum_{i=1}^n\sum_{k=1}^K \E(\chr(Y_i=k)\,|\,\theta^{(0)})\log f_k(x_i)\\ & = \sum_{i=1}^n\sum_{k=1}^K \P(Y_i=k\,|\,\theta^{(0)},\bx)\log w_{k}+\sum_{i=1}^n\sum_{k=1}^K \P(Y_i=k\,|\,\theta^{(0)},\bx)\log f_k(x_i)\\ & = \sum_{i=1}^n\sum_{k=1}^K r_{ik}\log w_{k}+\sum_{i=1}^n\sum_{k=1}^K r_{ik}\log f_k(x_i)\\ \end{split} \] where \[\begin{split} r_{ik} &= \P(Y_i=k\,|\,\theta^{(0)},\bx) \\ &= \P(Y_i=k\,|\,\theta^{(0)},x_i)\\ &=\frac{f(x_i\,|\,Y_i=k,\theta^{(0)})\P(Y_i=k\,|\,\theta^{(0)})}{f(x_i\,|\,\theta^{(0)})} \quad(\text{by Bayesian formula})\\ &=\frac{f(x_i\,|\,Y_i=k,\theta^{(0)})\P(Y_i=k\,|\,\theta^{(0)})}{\sum^K_{k=1}f(x_i\,|\,Y_i=k,\theta^{(0)})\P(Y_i=k\,|\,\theta^{(0)})}\\ & = \frac{f_k(x_i\,|\,\theta^{(0)})w_k^{(0)}}{\sum^K_{k=1}f_k(x_i\,|\,\theta^{(0)})w_k^{(0)}} \end{split}\] and $f_k(\cdot\,|\,\theta^{(0)})$ is the density function of $\N(\mu^{(0)}_k,\sigma_k^{2(0)})$.


Maximization.

Then we compute the MLE $\theta^{(1)}=(\mu_k^{(1)},\sigma^{2(1)}_k,w_k^{(1)})$ for $Q(\theta\,|\,\theta^{(0)},\bx)$, which is \[\mu_k^{(1)} = \frac{\sum^{n}_{i=1}r_{ik}x_i}{\sum^n_{i=1}r_{ik}},\quad \sigma^{2(1)}_k = \frac{\sum^{n}_{i=1}r_{ik}(x_i-\mu_k^{(1)})^2}{\sum^n_{i=1}r_{ik}},\quad w^{(1)}_k = \frac{\sum^n_{i=1}r_{ik}}{n}.\] Next, repeat the expectation step to calculate $Q(\theta\,|\,\theta^{(1)},\bx)$ and maximize it again to obtain $\theta^{(2)}$, and so on. The $\theta^{(r)}$ will converge to the MLE for $\theta$ (which maximize $l'(\theta\,|\,\bx)$) as $r\raw\infty$.

3. Why EM algorithm works

The reason why EM algorithm works can be interpreted by the folowing theorem.

The parameter series $\theta^{(0)},\theta^{(1)},\theta^{(2)},\cdots$ defined by the EM algorithm satisfies \[l'(\theta^{(r+1)}\,|\,\bx)\geq l'(\theta^{(r)}\,|\,\bx),\] where $l'(\theta\,|\,\bx)$ is the log likelihood function for incomplete data $\bx$.
Let $L(\theta\,|\,\bx,\by)$ be the likelihood function for the complete data and $L'(\theta\,|\,\bx)$ be the likelihood function for the incomplete data. First we have \[l'(\theta\,|\,\bx) = \log L'(\theta\,|\,\bx) = \log L(\theta\,|\,\bx,\by) - \log \frac{L(\theta\,|\,\bx,\by)}{L'(\theta\,|\,\bx)}.\tag{1}\] Then \[k(\by\,|\, \theta,\bx) =\frac{L(\theta\,|\,\bx,\by)}{L'(\theta\,|\,\bx)}=\prod^n_{i=1}\frac{f_{X_iY_i}(x_i,y_i)}{f_{X_i}(x_i)}\] is the conditional density function of $\bY$ given $\bX=\bx$ and $\theta$. Take the conditional expectation on both side of $(1)$ w.r.t. $\bY$ given $\bx$ and $\theta^{(r)}$, then we have \[\begin{split}l'(\theta\,|\,\bx) &= \E[l'(\theta\,|\,\bx)\,|\,\theta^{(r)},\bx]\quad(\text{since $\bY$ does not appear in $l'$})\\ &=\E[\log L(\theta\,|\,\bx,\bY)\,|\,\theta^{(r)},\bx] - \E[\log k(\bY\,|\,\theta,\bx)\,|\,\theta^{(r)},\bx]\\ &=Q(\theta\,|\,\theta^{(r)},\bx) - \E[\log k(\bY\,|\,\theta,\bx)\,|\,\theta^{(r)},\bx]. \end{split}\] Since $\theta^{(r+1)}$ maximize $Q(\theta\,|\,\theta^{(r)},\bx)$, we have \[Q(\theta^{(r+1)}\,|\,\theta^{(r)},\bx)\geq Q(\theta^{(r)}\,|\,\theta^{(r)},\bx).\tag{2}\] And by Jensen's ineqaulity, since $\log$ is a concave function, for any density function $f(\by)$, we have \[\int\left[ \log(\frac{f(\by)}{k(\by)}) k(\by)\right]\ud \by\leq \log \int\left[ \frac{f(\by)}{k(\by)}\cdot k(\by)\right]\ud\by =\log 1 = 0,\] so\[\int(\log f)k\,\ud\by\leq \int(\log k)k\,\ud\by,\] or in terms of expectation w.r.t. the distribution $k(\by)$, we have \[\E(\log f(\bY))\leq \E(\log k(\bY)),\] therefore \[\E[\log k(\bY\,|\,\theta^{(r+1)},\bx)\,|\,\theta^{(r)},\bx]\leq \E[\log k(\bY\,|\,\theta^{(r)},\bx)\,|\,\theta^{(r)},\bx].\tag{3}\] Combining $(2)$ and $(3)$, we can conclude \[l'(\theta^{(r+1)}\,|\,\bx)\geq l'(\theta^{(r)}\,|\,\bx).\tag*{$\blacksquare$}\]