$
\newcommand{\ud}{\mathrm{d} }
\newcommand{\calX}{\mathcal{X}}
\newcommand{\ve}{\varepsilon}
\newcommand{\N}{\mathcal{N}}
\newcommand{\bX}{\mathbf{X}}
\newcommand{\bA}{\mathbf{A}}
\newcommand{\bb}{\mathbf{b}}
\newcommand{\bY}{\mathbf{Y}}
\newcommand{\by}{\mathbf{y}}
\newcommand{\bx}{\mathbf{x}}
\newcommand{\bI}{\mathbf{I}}
\newcommand{\bbeta}{\pmb{\beta}}
\newcommand{\bve}{\pmb{\varepsilon}}
\newcommand{\bzero}{\pmb{0}}
\newcommand{\ds}{\displaystyle}
\newcommand{\tL}{\tilde{L}}
\newcommand{\pt}{\partial}
\newcommand{\E}{\mathbb{E}}
\newcommand{\P}{\mathbb{P}}
\newcommand{\R}{\mathbb{R}}
\newcommand{\Z}{\mathbb{Z}}
\newcommand{\Cov}{\mathrm{Cov}}
\newcommand{\Var}{\mathrm{Var}}
\newcommand{\raw}{\rightarrow}
\newcommand{\gt}{>}
\newcommand{\lt}{<}
\newcommand{\sm}{\setminus}
\newcommand{\ud}{\mathrm{d}}
\newcommand{\chr}{\mathbb{1}}
$
1. GMM for complete data
Gaussian mixture model (GMM) basically hypothesizes that the datapoints follow a mixture of gaussian distribution.
Suppose there are $n$ datapoint and each datapoint has the distribution of $K$ mixed gaussian. Specifically,
there are two series of independent random variables: $X_1,X_2,\cdots,X_n$ and $Y_1,Y_2,\cdots,Y_n$,
where each $Y_i$ has the distribution
\[\P(Y_i = k) = w_{k},\quad k = 1,2,\cdots,K,\]
\[\sum_{k=1}^K w_k = 1,\]
and given $Y_i=k$, each $X_i$ has gaussian distribution with $\mu_k$ and $\sigma_k$, i.e.
\[X_i\,|\, Y_i=k\quad \sim \quad \N(\mu_k,\sigma_k^2).\]
Immediately, the density of $X_i$ is
\[f_{X_i}(x) = \sum^K_{k=1}f_k(x)\P(Y_i=k) = \sum^K_{k=1}w_kf_k(x),\]
where $f_k$ is the density of $\N(\mu_k,\sigma_k^2)$,
in other word, $X_i$ has the mixture gaussian distribution. The joint density of $(X_i,Y_i)$ is
\[\begin{split}f_{X_iY_i}(x,y) &= \frac{\pt \P(X_i\leq x,Y_i=y)}{\pt x} \\
&= \frac{\pt \P(X_i\leq x|Y_i=y)\P(Y_i=y)}{\pt x} \\
&= w_{y}f_y(x) \\
&= \prod^K_{k=1} (w_{k}f_k(x))^{\chr(y=k)}.\end{split}\]
If we know the complete data, i.e. the value of both $X_i$ and $Y_i$,
then it is very easy to get the estimator of $\theta = (\mu_k,\sigma_k^2,w_{k})$ by MLE.
Suppose $\bx=\{x_1,\cdots,x_n\}$ and $\by=\{y_1,\cdots,y_n\}$ are the complete data, then MLE for $\theta =(\mu_k,\sigma_k^2,w_k)$ is
\[\hat{\mu}_k = \frac{\sum^n_{i=1}x_i\chr(y_i=k)}{n_k},\quad \hat{\sigma}^2_k = \frac{\sum^n_{i=1}(x_i-\hat{\mu}_k)^2\chr(y_i=k)}{n_k},\quad \hat{w}_k = \frac{n_k}{n},\]
where $n_k = \#\{i:y_i=k\}$.
The log likelihood function for the complete data is
\[\begin{split}l(\theta\,|\,\bx,\by)&= \log \prod_{i=1}^n f_{X_iY_i}(x_i,y_i)\\
&=\log \prod_{i=1}^n\prod^K_{k=1} (w_{k}f_k(x_i))^{\chr(y_i=k)}\\
&=\sum_{i=1}^n\sum_{k=1}^K \chr(y_i=k)\log w_{k}f_k(x_i)\\
&=\sum_{i=1}^n\sum_{k=1}^K \chr(y_i=k)\log w_{k}+\sum_{i=1}^n\sum_{k=1}^K \chr(y_i=k)\log f_k(x_i)
\end{split}\]
then given $\sum^K_{k=1}w_k=1$, we can use lagrange multiplier method to find the estimator to maximize $l$ (details omitted).
\[\tag*{$\blacksquare$}\]
2. GMM for incomplete data and EM algorithm
However, sometimes we only have the incomplete data $x_1,\cdots,x_n$, the log likelihood function becomes
\[l'(\theta\,|\,\bx) = \log \prod^n_{i=1}f_{X_i}(x_i) =\sum^n_{i=1}\log\left(\sum^K_{k=1}w_kf_k(x_i)\right),\]
whcih is super hard to maximize since the sum is inside the $\log$ function.
The problem can still be solved by the expectation and maximization (EM) algorithm.
Expectation.
First, we start from an initial estimation of the parameters,
$\theta^{(0)}=(\mu_k^{(0)},\sigma_k^{2(0)},w_k^{(0)})$,
and compute the conditional expectation of $l(\theta\,|\,\bX,\bY)$
given $\theta^{(0)}$ and incomplete data $\bx$,
\[\begin{split} Q(\theta\,|\,\theta^{(0)},\bx)&=\E(l(\theta\,|\,\bX,\bY)\,|\,\theta^{(0)},\bX=\bx) \\
&= \E(l(\theta\,|\,\bx,\bY)\,|\,\theta^{(0)},\bx)\\
& = \E\left[\sum_{i=1}^n\sum_{k=1}^K \chr(Y_i=k)\log w_{k}+\sum_{i=1}^n\sum_{k=1}^K \chr(Y_i=k)\log f_k(x_i) \,\bigg|\,\theta^{(0)},\bx\right]\\
&=\sum_{i=1}^n\sum_{k=1}^K \E(\chr(Y_i=k)\,|\,\theta^{(0)})\log w_{k}+\sum_{i=1}^n\sum_{k=1}^K \E(\chr(Y_i=k)\,|\,\theta^{(0)})\log f_k(x_i)\\
& = \sum_{i=1}^n\sum_{k=1}^K \P(Y_i=k\,|\,\theta^{(0)},\bx)\log w_{k}+\sum_{i=1}^n\sum_{k=1}^K \P(Y_i=k\,|\,\theta^{(0)},\bx)\log f_k(x_i)\\
& = \sum_{i=1}^n\sum_{k=1}^K r_{ik}\log w_{k}+\sum_{i=1}^n\sum_{k=1}^K r_{ik}\log f_k(x_i)\\
\end{split} \]
where
\[\begin{split}
r_{ik} &= \P(Y_i=k\,|\,\theta^{(0)},\bx) \\
&= \P(Y_i=k\,|\,\theta^{(0)},x_i)\\
&=\frac{f(x_i\,|\,Y_i=k,\theta^{(0)})\P(Y_i=k\,|\,\theta^{(0)})}{f(x_i\,|\,\theta^{(0)})} \quad(\text{by Bayesian formula})\\
&=\frac{f(x_i\,|\,Y_i=k,\theta^{(0)})\P(Y_i=k\,|\,\theta^{(0)})}{\sum^K_{k=1}f(x_i\,|\,Y_i=k,\theta^{(0)})\P(Y_i=k\,|\,\theta^{(0)})}\\
& = \frac{f_k(x_i\,|\,\theta^{(0)})w_k^{(0)}}{\sum^K_{k=1}f_k(x_i\,|\,\theta^{(0)})w_k^{(0)}}
\end{split}\]
and $f_k(\cdot\,|\,\theta^{(0)})$ is the density function of $\N(\mu^{(0)}_k,\sigma_k^{2(0)})$.
Maximization.
Then we compute the MLE $\theta^{(1)}=(\mu_k^{(1)},\sigma^{2(1)}_k,w_k^{(1)})$ for $Q(\theta\,|\,\theta^{(0)},\bx)$, which is
\[\mu_k^{(1)} = \frac{\sum^{n}_{i=1}r_{ik}x_i}{\sum^n_{i=1}r_{ik}},\quad \sigma^{2(1)}_k = \frac{\sum^{n}_{i=1}r_{ik}(x_i-\mu_k^{(1)})^2}{\sum^n_{i=1}r_{ik}},\quad w^{(1)}_k = \frac{\sum^n_{i=1}r_{ik}}{n}.\]
Next, repeat the expectation step to calculate $Q(\theta\,|\,\theta^{(1)},\bx)$ and maximize it again to obtain $\theta^{(2)}$, and so on.
The $\theta^{(r)}$ will converge to the MLE for $\theta$ (which maximize $l'(\theta\,|\,\bx)$) as $r\raw\infty$.
3. Why EM algorithm works
The reason why EM algorithm works can be interpreted by the folowing theorem.
The parameter series $\theta^{(0)},\theta^{(1)},\theta^{(2)},\cdots$ defined by the EM algorithm satisfies
\[l'(\theta^{(r+1)}\,|\,\bx)\geq l'(\theta^{(r)}\,|\,\bx),\]
where $l'(\theta\,|\,\bx)$ is the log likelihood function for incomplete data $\bx$.
Let $L(\theta\,|\,\bx,\by)$ be the likelihood function for the complete data
and $L'(\theta\,|\,\bx)$ be the likelihood function for the incomplete data. First we have
\[l'(\theta\,|\,\bx) = \log L'(\theta\,|\,\bx) = \log L(\theta\,|\,\bx,\by)
- \log \frac{L(\theta\,|\,\bx,\by)}{L'(\theta\,|\,\bx)}.\tag{1}\]
Then \[k(\by\,|\, \theta,\bx) =\frac{L(\theta\,|\,\bx,\by)}{L'(\theta\,|\,\bx)}=\prod^n_{i=1}\frac{f_{X_iY_i}(x_i,y_i)}{f_{X_i}(x_i)}\]
is the conditional density function of $\bY$ given $\bX=\bx$ and $\theta$.
Take the conditional expectation on both side of $(1)$ w.r.t. $\bY$ given $\bx$ and $\theta^{(r)}$, then we have
\[\begin{split}l'(\theta\,|\,\bx) &= \E[l'(\theta\,|\,\bx)\,|\,\theta^{(r)},\bx]\quad(\text{since $\bY$ does not appear in $l'$})\\
&=\E[\log L(\theta\,|\,\bx,\bY)\,|\,\theta^{(r)},\bx]
- \E[\log k(\bY\,|\,\theta,\bx)\,|\,\theta^{(r)},\bx]\\
&=Q(\theta\,|\,\theta^{(r)},\bx) - \E[\log k(\bY\,|\,\theta,\bx)\,|\,\theta^{(r)},\bx].
\end{split}\]
Since $\theta^{(r+1)}$ maximize $Q(\theta\,|\,\theta^{(r)},\bx)$, we have
\[Q(\theta^{(r+1)}\,|\,\theta^{(r)},\bx)\geq Q(\theta^{(r)}\,|\,\theta^{(r)},\bx).\tag{2}\]
And by Jensen's ineqaulity, since $\log$ is a concave function, for any density function $f(\by)$, we have
\[\int\left[ \log(\frac{f(\by)}{k(\by)}) k(\by)\right]\ud \by\leq \log \int\left[ \frac{f(\by)}{k(\by)}\cdot k(\by)\right]\ud\by =\log 1 = 0,\]
so\[\int(\log f)k\,\ud\by\leq \int(\log k)k\,\ud\by,\]
or in terms of expectation w.r.t. the distribution $k(\by)$, we have
\[\E(\log f(\bY))\leq \E(\log k(\bY)),\]
therefore
\[\E[\log k(\bY\,|\,\theta^{(r+1)},\bx)\,|\,\theta^{(r)},\bx]\leq \E[\log k(\bY\,|\,\theta^{(r)},\bx)\,|\,\theta^{(r)},\bx].\tag{3}\]
Combining $(2)$ and $(3)$, we can conclude
\[l'(\theta^{(r+1)}\,|\,\bx)\geq l'(\theta^{(r)}\,|\,\bx).\tag*{$\blacksquare$}\]