Optimisation in deep learning regards highly non-linear and non-convex function, as opposed to classical machine learning, which deals mainly with convex optimisation problems.

Optimization methods can be divided into two main categories: first-order methods, which involve gradient descent methods, and high-order optimization methods, in which expansion of Taylor Series can be used. First-order methods, are widely used for the higher simplicity in implementation. Compared with first-order methods, high-order methods converge at a faster speed in which the curvature information makes the search direction more effective. The difficulty in high-order methods lies in the operation and storage of the inverse matrix of the Hessian matrix.

As the simplest example let us consider Supervised Learning: the goal is to find an optimal mapping function $f(x)$ to minimize the loss function of the training samples

\begin{equation} \min_{(w,\beta)} \frac{1}{2}\sum(y_i - f(x_i))^2 \end{equation}

where

\begin{equation} f(x_i)=\beta_0 + \sum_{k=1}^K \beta_{k}g(w_{k0}+\sum_{j=1}^p w_{kj}x_{ij}) \end{equation}

Because of the nested arrangement of the parameters \textbf{minimisation is not straightforward}. The problem has multiple solutions as it is non-convex. To overcome this complexity and avoid overfitting, we use methods such as Gradient Descent or Regularisation and high-order optimisation methods.

Gradient descent

The gradient descent method is the earliest and most common optimization method. The idea of the gradient descent method is that variables update iteratively in the (opposite) direction of the gradients of the objective function. The update is performed to gradually converge to the optimal value of the objective function. The learning rate $\eta$ determines the step size in each iteration, and thus influences the number of iterations to reach the optimal value. Let us give the formal expression of gradient descent method.\

Considering an objective function like the following:

\begin{equation} h(\theta)=\frac{1}{m}\sum_{i=1}^ml(y_i,f_{\theta}(x_i))
\end{equation}

where $l(y,z)$ is the loss function and $f(x,\theta)$ is a neural network function. We want to calculate the Gradient $\nabla h(\theta_k)$ of the function around $\theta$ :

\begin{equation} \nabla h(\theta_k) = \begin{bmatrix} \frac{\partial h(\theta)}{\partial \theta_1}
\frac{\partial h(\theta)}{\partial \theta_2}
\vdots
\frac{\partial h(\theta)}{\partial \theta_n} \end{bmatrix} \end{equation}

we will compute the update of $\theta_{k+1}$ as:

\begin{equation} \theta_{k+1}=\theta_k-\eta \nabla h(\theta_k) \end{equation}

where $\eta$ is the learning rate parameter, which can be $\eta \leq 1/L$ , where L is the maximum curvature.\

The gradient descent method is simple to implement. The solution is global optimal when the objective function is convex. It often converges at a slower speed if the variable is closer to the optimal solution, and more careful iterations need to be performed.

Stochastic gradient descent

The idea of stochastic gradient descent is using one sample randomly to update the gradient per iteration, instead of directly calculating the exact value of the gradient. The stochastic gradient is an unbiased estimate of the real gradient.

A typical training object consists of the sum of individual losses for each training case such that:

\begin{equation} h(\theta)=\frac{1}{m}\sum_{i=1}^m h_i(\theta) \end{equation}

If \textbf{m} \textbf{is very big}, the computation of the GD becomes excessively expensive, as the gradient will be as well the sum of the individual gradients:

\begin{equation} \nabla h(\theta) = \frac{1}{m} \sum_{i=1}^m \nabla h_i(\theta) \end{equation}

The general idea is to randomly sub-sample a “min-batch” of training cases $S \subset {1,2,...,m}$ of size $b << m$ and estimate the Gradient as:

\begin{equation} \tilde{\nabla} h(\theta)= \frac{1}{b}\sum_{i \in S} \nabla h_i(\theta) \end{equation}

The Stochastic Gradient Descent (SGD) then replaces the classic GD with his mini batch estimate:

\begin{equation} \theta_{k+1}=\theta_k-\eta \tilde{\nabla} h(\theta_k) \end{equation}

SGD increases the overall optimization efficiency at the expense of more iterations, but the increased iteration number is insignificant compared with the high computation complexity caused by large numbers of samples.

2nd-order methods

The second-order methods can be used for addressing the problem where an objective function is highly non-linear and ill-conditioned. They work effectively by introducing curvature information. The problem with the 1st-order methods is that the number of steps to convergence grows with the “condition number”:

\begin{equation} K=\frac{L}{\mu} \end{equation}

where L is the max curvature and $\mu$ is the minimum curvature. The 2nd-order methods can improve this dependency. The basic idea is to approximate with a Taylor Series the objective function around the current $\theta$ such as:

\begin{equation} h(\theta+d)\approx h(\theta)+ \nabla h(\theta)^Td+ \frac{1}{2}d^TH(\theta)d \end{equation}

where $H(\theta)$ is the Hessian matrix. Let us minimise the Taylor approximation with respect $d$ such that we obtain:

\begin{equation} d=-H(\theta)^{-1} \nabla h(\theta) \end{equation}

The update of $\theta$ will be given by the following equation:

\begin{equation}\theta_{k+1}=\theta_k - \eta(H(\theta)^{-1} \nabla h(\theta_k)) \end{equation}

Backpropagation

The concept of backpropagation is easily represented starting by analyzing a single fully connected single layer network, to grasp the concept of the loss function gradient. We can then expend this concept, by mean of the chain rule, to a multiple layer netwrok, in a way that we can better understand what is the concept of backpropagation. We will see following a detailed analysis of what backpropagation is.

When analyzing our code we can find this concept in all those functions that require ‘requires_grad = TRUE’, as in this functions, we can store the gradients of the parameter of the model, enabling it to learn during training.

References:

Shiliang Sun, Zehui Cao, Han Zhu, and Jing Zhao (2019). A Survey of Optimization Methods from a Machine Learning Perspective, https://arxiv.org/abs/1906.06821