MLE and MAP

Why MLE and MAP

MLE and MAP are fundamental in the ML framework. They allow a simple way to estimate the parameters that are part of the model.

MLE: $P(x|\lambda) = \prod_{i=1}^n P(x_i|\lambda)$
MAP: $P(\lambda|x) \Rightarrow \frac{P(\lambda)P(x|\lambda)}{P(x)} \Rightarrow P(\lambda)P(x|\lambda)$

Maximum Likelihood Estimator (MLE)

In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed statistical model, the observed data is most probable. The point in the parameter space that maximizes the likelihood function is called the maximum likelihood estimate.[1]

maximize the conditional probability $P(x|\lambda)$
maximizing a function or the log of the function is the same thing
the data are considered i.i.d. (independent and identically distributed) so we can apply the product rule
the log of the product is the sum of logs of each term

The MLE wants to estimete the most likely parameters for a given set of data:

\[L(parameter|data) = P(data | parameter)\]

As an example let assume we want to find the MLE estimator for the following distribution $P(x|\lambda)=\lambda e^{-\lambda x^2}$

\[\begin{align} argmax_{\lambda} P(x|\lambda) & = argmax_{\lambda} log[P(x|\lambda)] \\ &= argmax_{\lambda} log[\prod_{i=1}^n P(x_i|\lambda)] \\ &= argmax_{\lambda} log[\prod_{i=1}^n \lambda e^{-\lambda x_i^2}] \\ & = argmax_{\lambda} \sum_{i=1}^n [log(\lambda e^{-\lambda x_i^2})] \\ & = argmax_{\lambda} \sum_{i=1}^n[log(\lambda)-\lambda x_i^2)] \\ & = argmax_{\lambda} [nlog(\lambda)- \lambda \sum_{i=1}^n x_i^2] \end{align}\]

To find the MLE let’s calculate the maximum of the function deriving it with respect λ and equal the terms to zero

\[\hat{\lambda}_{MLE} = argmax_{\lambda} [nlog(\lambda)- \lambda \sum_{i=1}^n x_i^2]\] \[\frac{n}{\lambda} - \sum_{i=1}^n x_i^2 = 0\] \[\Rightarrow \hat{\lambda}_{MLE} = \left(\frac{\sum_{i=1}^n x_i^2}{n}\right)^{-1}\]

Maximum a Posteriori Probability (MAP)

The MAP can be used to obtain a point estimate of an unobserved quantity on the basis of empirical data. It is closely related to the method of maximum likelihood (ML) estimation, but employs an augmented optimization objective which incorporates a prior distribution (that quantifies the additional information available through prior knowledge of a related event) over the quantity one wants to estimate. MAP estimation can therefore be seen as a regularization of maximum likelihood estimation.[2]

Posterior probability $\Rightarrow P(\lambda| x) = \frac{P(\lambda)P(x|\lambda)}{P(x)} \propto P(\lambda)P(x|\lambda)$

Therefore we can follow this procedure:

\[\begin{align} argmax_{\lambda} P(\lambda| x) & = argmax_{\lambda} \frac{P(\lambda)P(x|\lambda)}{P(x)}\\ & = argmax_{\lambda} P(\lambda)P(x|\lambda)\\ & = argmax_{\lambda} log[P(\lambda)P(x|\lambda)]\\ & =argmax_\lambda[(log(P(\lambda)) + log(P(x|\lambda)]\\ & =argmax_\lambda[(log(P(\lambda)) + log(\prod_{i=1}^n P(x_i|\lambda)]\\ & =argmax_\lambda[(log(e^{-\lambda}) + log(\prod_{i=1}^n\lambda e^{-\lambda x_i^2})]\\ & =argmax_\lambda[(log(e^{-\lambda}) + \sum_{i=1}^n log(\lambda e^{-\lambda x_i^2})]\\ & = argmax_\lambda [-\lambda + \sum_{i=1}^n[log(\lambda)-\lambda x_i^2]]\\ & = argmax_\lambda [-\lambda + nlog(\lambda)-\lambda \sum_{i=1}^nx_i^2] \end{align}\]

Derive the equation we just obtained with respect to $\lambda$ and equal everything to zero:

\[-1 + \frac{n}{\lambda}-\sum_{i=1}^n x_i^2 = 0\] \[\Rightarrow \hat{\lambda}_{MAP} = \left( \frac{1+\sum_{i=1}^n x_i^2}{n}\right)^{-1}\]

References:

Rossi, Richard J. (2018). Mathematical Statistics : An Introduction to Likelihood Based Inference. New York: John Wiley & Sons. p. 227. ISBN 978-1-118-77104-4.
Wikipedia, https://en.wikipedia.org/wiki/Maximum_a_posteriori_estimation, March 2022