Huarui Zhou
Suppose the random variable $Y$ is a linear function of $p$ variables (they are assumed fixed) $X_1,X_2,\cdots,X_p$ plus an error term $\ve$, i.e. \[Y=\beta_0+\beta_1X_1+\cdots+\beta_pX_p+\ve.\tag{1}\] When we randomly take $n$ samples from the population, eq (1) can be written as \[Y_i = \beta_0+\beta_1X_{i1}+\cdots+\beta_pX_{ip}+\ve_i =\sum^p_{j=0} \beta_jX_{ij} +\ve_j, \quad i = 1,2,\cdots,n\tag{2}\] where we denote $X^{i0}\equiv 1$ and assume $\ve_i \;\;i.i.d.\sim \N(0,\sigma^2)$, thus \[Y_i\sim \N(\sum^p_{j=0} \beta_jX_{ij},\sigma^2).\] In matrix form, we have \[\bY = \bX\bbeta+\bve,\] where $\bY = (Y_i)_{n\times 1}$, $X = (X_{ij})_{n\times (p+1)}$, $\bbeta = (\beta_j)_{(p+1)\times 1}$, $\bve = (\ve_i)_{n\times 1}$. Let the observations of $\bY$ be denoted as $\by$. The primary objective in regression is to estimate the values of the parameters $\beta$ and $\sigma^2$ based on $\bX$ and $\by$.
One of the common methods for estimating parameters is Maximum Likelihood Estimation (MLE). Now, we will employ MLE to estimate the parameters $\bbeta$ and $\sigma^2$. The likelihood function for the $n$ samples is \[L(\bbeta|\bY) = \prod^n_{i=1}f_{Y_i}(y_i) = \prod^n_{i=1} \frac{1}{\sqrt{2\pi}\sigma} \exp\left(\ds-\frac{(y_i-\sum^p_{j=0} \beta_jX_{ij})^2}{2\sigma^2}\right) \tag{3}\] It is much simpler to compute the critical points for $L(\bbeta | \bY)$ by utilizing its logarithmic likelihood function, so we define \[\begin{split}\tL(\bbeta|\bY) &= \log L(\bbeta|\bY) \\ &= \sum^n_{i=1}\left(\log \frac{1}{\sqrt{2\pi}\sigma} + \left(\ds-\frac{(y_i-\sum^p_{j=0} \beta_jX_{ij})^2}{2\sigma^2}\right)\right)\\ &=-n\log \sqrt{2\pi}\sigma -\frac{1}{2\sigma^2} \sum^n_{i=1}(y_i-\sum^p_{j=0} \beta_jX_{ij})^2\end{split} \] we hope to find certain $\bbeta$ to maximize $\tL(\bbeta|\bY)$.
We will compute the partial derivatives of $\tL(\bbeta|\bY)$ with respect to each $\beta_k$ and $\sigma$. \[\begin{split}\frac{\pt{\tL(\bbeta|\bY)}}{\pt \beta_k} &= -\frac{1}{2\sigma^2} \sum^n_{i=1}2(-X_{ik})(y_i-\sum^p_{j=0} \beta_jX_{ij})\\ &=\frac{1}{\sigma^2} \sum^n_{i=1}X_{ik}(y_i-\sum^p_{j=0} \beta_jX_{ij}) \end{split} \tag{4}\] Let $\ds \frac{\pt{\tL(\bbeta|\bY)}}{\pt \beta_k} = 0$, we have for $k = 0,1,\cdots,p$, \[\sum^n_{i=1}X_{ik}(y_i-\sum^p_{j=0} \beta_jX_{ij}) = 0,\] \[\sum^n_{i=1}\left(X_{ik}\sum^p_{j=0} \beta_jX_{ij}\right) = \sum^n_{i=1}X_{ik}y_i,\] let $\bX_{i;}$ be the $i$th row vector of $\bX$, and $\bX_{;k}$ be the $k$th column vector of $\bX$, then we have \[\sum^n_{i=1}X_{ik}\bX_{i;}\bbeta = \bX_{;k}^T\by,\] \[\bX_{;k}^T\bX\bbeta = \bX_{;k}^T\by,\] combine all $k=0,1,\cdots,p$, we have \[\bX^T\bX\bbeta = \bX^T\by,\] assume $\det(\bX^T\bX)\neq 0$, then we get the critical point of $\tL(\bbeta|\bY)$, \[\bbeta = (\bX^T\bX)^{-1}\bX^T\by.\] We can further confirm it is also the maximum point of $\tL(\bbeta|\bY)$ (details omitted), Replacing observative $\by$ by the random variable $\bY$, then we obtain the maximum likelihood estimator of the parameter $\bbeta$ \[\hat{\bbeta} = (\bX^T\bX)^{-1}\bX^T\bY. \tag{5}\]
It is worth noting that while $\hat{\bbeta}$ maximizes $\tL(\bbeta | \bY$), it also minimizes the residual sum of squares (RSS), \[RSS = \sum^n_{i=1}(y_i-\sum^p_{j=0} \beta_jX_{ij})^2.\] Hence, $\hat{\bbeta}$ serves as both the maximum likelihood estimator and the least squares estimator.
Then we will continue to find the MLE for $\sigma$. \[0=\frac{\pt{\tL(\bbeta|\bY)}}{\pt \sigma} = -\frac{n}{\sigma}+\frac{1}{\sigma^3}\sum^n_{i=1}(y_i-\sum^p_{j=0} \beta_jX_{ij})^2 \] we have \[\sigma^2 = \frac{1}{n}\sum^n_{i=1}(y_i-\sum^p_{j=0} \beta_jX_{ij})^2.\tag{6}\] Substituting $\hat{\bbeta}$ into (6), we obtain the MLE for $\sigma^2$, \[\hat{\sigma}^2 = \frac{1}{n}\|\bY - \bX\hat{\bbeta}\|^2.\tag{7}\]
We will prove the following theorem.