Understanding momentum

Momentum is often described as speeding up gradient descent. The intuition given is of a physical system - a ball with a mass rolling down a hill. Does the intuition go further? Indeed, in his seminal paper, Some methods of speeding up the convergence of iteration methods , Polyak, B.T., 1964 Polyak himself alludes to the fact that the momentum equation is "the same equation as the motion of a body in a potential field", thus the name "heavy ball method".

This intuition hints at the very properties momentum is known for. The notorious Rosenbrock function, known for its long and narrow valley. A common test for optimization algorithms. Namely, when the error function surface has a long and narrow valley (as in the Rosenbrock function to the right) the gradient is almost orthogonal to the long axis of the valley. In this setting, gradient descent oscillates back and forth along the direction of the short axis and fails to make progress to the global minimum down the valley. This results in slow convergence rates. Imagine a ball rolling down the Rosenbrock valley. The ball accumulates momentum and, thus, averages out the oscillations along the short axis while at the same time sustains progress along the long axis. This at least is the common informal interpretation given. Consult Learning Internal Representations by Error Propagation, Rumelhart, D.E., Hinton, G.E., Williams, R.J., 1986 While this post does not claim novelty in the ideas presented, it attempts to give the reader a more rigorous understanding of why momentum speeds up gradient descent and shed light on some of the heuristics surrounding the use of this method.

Getting Started: Gradient Descent

To begin, let us look more closely at gradient descent. The formulation is simple enough, $$w^{k+1} = w^k - \alpha \nabla f(w^k)$$ for some parameters $w \in \mathbb{R^n}$ and an error function $f: \mathbb{R^n} \rightarrow \mathbb{R}$. Consider minimizing the convex quadratic function $$f(w) = \frac{1}{2} w^T A w - b^T w$$ where we assume $A$ is a symmetric, invertible matrix. The gradient is easy enough, $$\nabla f(w) = Aw - b$$ and thus the update rule becomes $$ \begin{equation} w^{k+1} = w^k - \alpha (Aw^k - b) \end{equation} $$ Setting the gradient to zero, we can also infer the optimal solution $w^* = A^{-1} b$.

Since $A$ is symmetric, it is diagonalizable such as $$A = Q \Lambda Q^T$$ where $Q$ is the matrix of orthogonal eigenvectors $q_1, q_2, \dots, q_n$ and $$ \Lambda = \begin{bmatrix} \lambda_1 & 0 & \cdots & 0 \\ 0 & \lambda_2 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & \lambda_n \end{bmatrix} $$ is the diagonal matrix of eigenvalues. This is nice, because we can now decompose the iteration of gradient descent into individual eigen components. The following computation largely follows the awesome article Why Momentum Really Works, Goh, G., 2017

In particular, we can view the deviation $w^k - w^*$ under the eigenbasis. Substract $w^*$ from both sides of eq. (1), and change basis by multiplying with $Q^T$ on the left. We get $$ \begin{equation} x^{k+1} = x^k - \alpha (Q^T A w^k - Q^T b), \quad x^k = Q^T (w^k - w^*) \end{equation} $$ Ideally, we get rid of the $b$. Recall that $w^* = A^{-1} b$, so we can solve the expression of $x^k$ for $b$, $$ b = A w^k - A Q x^k $$ Substitute this back in eq. (2). The $w^k$ terms cancel nicely, and, using the diagonalizability of $A$, we get just what we wanted, $$ \begin{equation*} \begin{split} x^{k+1} &= x^k - \alpha \Lambda x^k \\ &\quad = (I - \alpha \Lambda) x^k \\ &\quad = (I - \alpha \Lambda)^{k+1} x^0 \end{split} \end{equation*} $$ which we can rewrite for the individual components as $$ x^{k+1}_i = (1 - \alpha \lambda_i)^{k+1} x^0_i $$ Having decomposed the updates, we can now move back to the original basis, $$ w^k - w^* = Q x^k = \Sigma_{i=1}^n (1-\alpha \lambda_i)^k x^0_i q_i $$ where $q_i$ is the $i$-th eigenvector of $A$. There we have it, gradient descent in closed form! Observe that if we want gradient descent to converge, the condition is $\lvert 1 - \alpha \lambda_i \rvert < 1$; or,$$ \begin{equation} 0 < \alpha \lambda_i < 2 \end{equation} $$

Decomposing the error

This simple formulation of gradient descent allows for a simple interpretation. Namely, we start with some initial deviation from the optimial parameters as measured in the eigenspace by the components $x^0_i$, and each "eigen-component" follows its own path of exponential decrease with rate $1-\alpha \lambda_i$.

This exponential decrease explains why, for the usual step sizes, the first few iterations make large progress in minimizing the error, but with time the progress plateaus. Furthermore, based on the equation, we can confidentally predict that larger eigenvalues correspond to faster error convergence. Indeed we can test this hypothesis by looking at the eigen components of the error, $$ \begin{equation*} \begin{split} f(w^k) - f(w^*) &= \frac{1}{2} w^T A w^k - b^T w^k - \frac{1}{2} \left[w^*\right]^T A w^* + b^T w^* \\ &\quad = \frac{1}{2} \left[(w^k - w^*)^T A (w^k - w^*) \right. \\ &\hspace{6em} \left. + \left[w^k\right]^T A w^* + \left[w^*\right]^T A w^k - 2 \left[w^*\right]^T A w^* \right. \\ &\hspace{6em} \left. - 2 b^T w^k + 2 b^T w^* \right] \\ &\quad = \frac{1}{2} (w^k-w^*)^T A (w^k-w^*) \\ &\quad = \frac{1}{2} \left(Q^T(w^k-w^*)\right)^T \Lambda \left(Q^T(w^k-w^*)\right) \\ &\quad = \frac{1}{2} \left[x^k\right]^T \Lambda x \\ &\quad = \frac{1}{2} \Sigma \lambda_i \left[x_i^k\right]^2 \\ &\qquad = \frac{1}{2} \Sigma \lambda_i (1 - \alpha \lambda_i)^{2 k} \left[x_i^0\right]^2 \end{split} \end{equation*} $$ We perform a simple iteration in $\mathbb{R}^3$ for the eigenvalues $\lambda_1 = 0.4$, $\lambda_2 = 0.6$, and $\lambda_3 = 1.0$,

$\alpha$ = 1

Clearly, some choices of $\alpha$ are better then other. The natural question is, which is the best? As has become evident, the issue is the "slowest" component; i.e. the component with smallest eigenvalue, or, equivalently, the component with the smallest rate of convergence. This component puts an upper bound on how fast we can converge, thus, an obvious approach would be to try to maximize the minimal convergence rate, $$ \begin{equation*} \begin{split} &\underset{\alpha}{\mathrm{arg\,max}} \enspace \underset{i}{\mathrm{min}} \; \lvert 1 - \alpha \lambda_i \rvert \\ &\quad = \underset{\alpha}{\mathrm{arg\,max}} \enspace \mathrm{min} \; \left\{ \lvert 1 - \alpha \lambda_1 \rvert, \, \lvert 1 - \alpha \lambda_n \rvert \right\} \end{split} \end{equation*} $$ where we've ordered the eigenvalues from smallest to largest ($\lambda_1 \leq \lambda_2 \leq \dots \leq \lambda_n$). Such a maximum is obtained for $\alpha$ such that the smallest and biggest convergence rates equal. This yields $$ \alpha = \frac{2}{\lambda_1 + \lambda_n} $$ plugging this back into the converge rate yields an optimal rate of $$ \lvert \frac{\kappa - 1}{\kappa + 1} \rvert, \quad \kappa = \frac{\lambda_n}{\lambda_1} $$ The ratio $\lambda_n/\lambda_1$ appears often enough that it has a name - the condition number. In various settings it means various things, but roughly speaking it is a direct measure of how "hard" the problem is; i.e. a measure of pathological curvature.

A Physical Analogy

Take a look at the equation for gradient descent (reformulated a bit)$$ \begin{equation} w^{k+1} - w^k = -\alpha \nabla f(w^k) \end{equation} $$how about we consider the continuous version $$ \frac{d}{d t} w(t) = -\alpha \nabla f(w(t)) $$ where $w$ now takes the form of a continuous function of time rather than indexed by discrete time steps. Here is where some physical analogy comes. Let's assign to $f(w)$ the interpretation of potential energy, which we want minimized. Naturally, $w$ describes a position in parameter-space, so $\frac{d}{d t} w(t)$ is our velocity term. Newton's third law ($F = ma$) suggests we also include an acceleration term, let us do that $$ m \frac{d^2}{d t^2} w + \frac{d}{d t} w = -\alpha \nabla f(w) $$ ideally, after some time, we want our "ball", whose position is described by $w$, to settle down. In physics, a common way to bring a system to a halt is to add friction, which was already hinted to us by the velocity term, thus $$ \begin{equation} m \frac{d^2}{d t^2} w + \mu \frac{d}{d t} w = -\alpha \nabla f(w) \end{equation} $$ where $\mu$ is the friction coefficient.

That was the inspiration from physics! Since the end goal is to run the algorithm on a computer, we neccesarily have to discretize back eq. (5), $$ m \frac{w_{t+\Delta t} + w_{t-\Delta t} - 2 w_t}{\Delta t^2} + \mu \frac{w_{t+\Delta t}-w_t}{\Delta t} = -\alpha \nabla f(w_t) $$ rearanging terms we get $$ w_{t+\Delta t} - w_t = -\frac{\alpha \Delta t^2}{m + \mu \Delta t} \nabla f(w_t) + \frac{m}{m+\mu\Delta t} (w_t - w_{t-\Delta t}) $$ compare this with eq. (4). We can write, $$ \begin{equation} w^{k+1} - w^{k} = -\epsilon \nabla f(w^k) + \beta (w^k - w^{k-1}) \end{equation} $$ where $$ \epsilon = \frac{\alpha \Delta t^2}{m + \mu \Delta t}, \quad \beta = \frac{m}{m+\mu\Delta t} $$ It is clear that $\epsilon$ serves as the new step-size. The $\beta$ term "tracks" the magnitude of the previous update, therefore, the name "momentum" is used to describe it. Note that $\beta=0$ implies $m=0$ and vice-versa. Thus, the common analogy with gradient descent being "massless", whereas momentum a "heavy" ball rolling down a hill.

Analyzing Momentum

How does momentum speed up convergence of the system to a local minimum? To address this question, we analyze eq. (6) by expanding the potential energy $f(w)$ around a minimum at $w_0$, $$ \begin{equation*} \begin{split} & f(w) \approx f(w_0) + \nabla f(w_0) (w - w_0) \\ &\quad \implies \nabla f(w) \approx H (w - w_0) \end{split} \end{equation*} $$ the gradient $\nabla f(w_0)$ vanished, since we are at a minimum, and $H$ is the Hessian defined as $$ H_{i,j} = \frac{\partial^2 f(w)}{\partial w_i \partial w_j} |_{w_0} $$ Use this approximation for $\nabla f(w^k)$ in eq. (6) and, without loss of generality, take $w_0 = 0$. After simple rearangements we obtain the update rule, $$ w^{k+1} = \left[(1 + \beta) I - \epsilon H\right] w^k - \beta w^{k-1} $$ What was the benefit of all this? Well, the Hessian is symmetric, This is not always the case. Here we assume, $$ \frac{\partial^2 f(w)}{\partial w_i \partial w_j} = \frac{\partial^2 f(w)}{\partial w_j \partial w_i} $$ thus it is diagonalizable and we can apply the same trick we used to analyze gradient descent! In particular, $H = Q \Lambda Q^T$ where as usual $Q$ is the matrix of eigenvectors and $\Lambda$ is diagonal matrix of eigenvalues $\lambda_1, \dots, \lambda_n$. Perform a change of basis $x^k = Q^T w^k$ and we get the component-wise update rule, $$ x^{k+1}_i = \left[1 + \beta - \epsilon k_i\right]x^k_i - \beta x^{k-1}_i $$ we can now use a clever trick See On the momentum term in gradient descent learning algorithms, Qian, N., 1999 for the source of the trick. that begins with rewriting the update rule in matrix-vector form $$ \begin{pmatrix} x^k_i \\ x^{k+1}_i \end{pmatrix} = A \begin{pmatrix} x^{k-1}_i \\ x^k_i \end{pmatrix}, \quad A = \begin{pmatrix} 0 & 1 \\ -\beta & 1 + \beta - \epsilon k_i \end{pmatrix} $$ in closed form $$ \begin{pmatrix} x^k_i \\ x^{k+1}_i \end{pmatrix} = A^k \begin{pmatrix} x^0_i \\ x^1_i \end{pmatrix}, \quad $$ this becomes extemely simple and it's self-evident that if we want momentum to converge, we want $A^k$ to converge as $k \to \infty$. There is a nice formula for taking the powers of a $2 \times 2$ matrix in terms of its eigenvalues given by, The $n$-th Power of a $2 \times 2$ Matrix, Williams K. S. $$ A^k = \begin{cases} \sigma_1^k \frac{A - \sigma_2 I}{\sigma_1 - \sigma_2} + \sigma_2^k \frac{A - \sigma_1 I}{\sigma_2 - \sigma_1}, \quad &\text{if}\; \sigma_1 \neq \sigma_2, \\ \sigma_1^{k-1} \left(k A - (k - 1) \sigma_1 I\right), \quad &\text{if}\; \sigma_1 = \sigma_2 \end{cases} $$ where $\sigma_1, \sigma_2$ are the eigenvalues of $A$. For our particular matrix, the eigenvalues take the form $$ \sigma_{\left\{1,2\right\}, i} = \frac{1 + \beta - \epsilon \lambda_i \pm \sqrt{(1 + \beta - \epsilon \lambda_i)^2 - 4 \beta}} {2} $$ The conlusion is that if we want momentum to converge, we require $\lvert \sigma_{1, i} \rvert < 1$ and $\lvert \sigma_{2, i} \rvert < 1$. This is reminiscent to the condition of convergence for gradient descent, however instead of a singel series, here we have two coupled series. More succinctly, convergence is ensured when, $$ \max \left\{ \lvert \sigma_{1, i} \rvert, \lvert \sigma_{2, i} \rvert \right\} < 1 $$ this quantity also describes the convergence rate of momentum, which is the slowest among $\sigma_{1, i}$ and $\sigma_{2, i}$. It is therefore instructive to plot the quantity as a function of the parameters $\beta$ (momentum) and $\epsilon \lambda_i$ (step-size).

Immediately we can observe different regimes for momentum. Most of the parameter space gives rise to oscillatory behaviour of the error (typical of momentum!). There are, however, regimes characterized by monotonic decrease (lower left corner) and ones characterized by divergence (lower right corner). The bluish region is essentially the region of divergence where $ \max \left\{ \lvert \sigma_{1, i} \rvert, \lvert \sigma_{2, i} \rvert \right\} > 1 $.

It isn't hard to observe that the range of step-sizes under which momentum converges depends on the value of the momentum parameter. Formally, the system converges if and only if Consult Appendix A. of On the momentum term in gradient descent learning algorithms for a formal proof of this result. $$ -1 < \beta < 1 \quad \textit{and} \quad 0 < \epsilon \lambda_i < 2 + 2 \beta $$ Compared this with eq. (3). In essence, momentum has increased the permissible step-size!

Optimal parameters

We now consider the question of optimal parameters for the momentum algorithm. Unfortunately, this requires a rather complicated procedure, so the details are ommited. The complication stems from the requirement to optimize over $\sigma_{\left\{1,2\right\}, i}$ for $\beta$ and $\epsilon$ for all $i$, i.e. $$ \underset{\beta,\, \epsilon}{\mathrm{arg\,min}} \enspace \underset{i}{\mathrm{max}} \; \left\{ \max \left\{ \lvert \sigma_{1,i} \rvert, \lvert \sigma_{2,i} \rvert \right\} \right\} \\ $$ Once carried out, however, we obtain, $$ \beta = \left( \frac{\sqrt{\lambda_n} - \sqrt{\lambda_1}}{\sqrt{\lambda_n} + \sqrt{\lambda_1}} \right)^2, \quad \epsilon = \left( \frac{2}{\sqrt{\lambda_1} + \sqrt{\lambda_n}} \right)^2 $$ which also gives us a convergence rate of $$ \frac{\sqrt{\kappa} - 1}{\sqrt{\kappa} + 1} $$ compared to gradient descent's, $$ \frac{\kappa - 1}{\kappa + 1} $$ By using momentum we have reduced the the effect of the condition number! Of course, the setting consider here is a toy one and one has to be careful when extrapolating to real world scenarios. Nevertheless, the analysis so far has given us some insights. For instance, if the problem we are dealing with is ill-conditioned (i.e. $\kappa >> 1$). The optimal step-size is approximately twice that for gradient descent, while the optimium momentum parameter $\beta$ is close to $1$.

Concluding remarks

Momentum is a workhorse algorithm in modern optimization settings. It's wide adoptance has also been supplemented with huristics about why momentum works and what the best parameters are. This article has demonstrated that many of this heuristics have a firm grounding in mathematical analysis - like the analogy betweeen "heavy-ball" and momentum and choosing $\beta$ very close to $1$. Hopefully, this has provided to the reader a more solid understanding of momentum. Nevertheless, one needs to stay aware of the fact that not everything is a nail once you have a hammer. Comparing momentum, a first order algorithm, to higher order techniques could shed light on the limitations of momentum, which were not discussed in this article. This is left for the future!