Understanding momentum

Momentum is often described as speeding up gradient descent. The intuition given is of a physical system - a ball with a mass rolling down a hill. Does the intuition go further? Indeed, in his seminal paper, Some methods of speeding up the convergence of iteration methods , Polyak, B.T., 1964 Polyak himself alludes to the fact that the momentum equation is "the same equation as the motion of a body in a potential field", thus the name "heavy ball method".

This intuition hints at the very properties momentum is known for. The notorious Rosenbrock function, known for its long and narrow valley. A common test for optimization algorithms. Namely, when the error function surface has a long and narrow valley (as in the Rosenbrock function to the right) the gradient is almost orthogonal to the long axis of the valley. In this setting, gradient descent oscillates back and forth along the direction of the short axis and fails to make progress to the global minimum down the valley. This results in slow convergence rates. Imagine a ball rolling down the Rosenbrock valley. The ball accumulates momentum and, thus, averages out the oscillations along the short axis while at the same time sustains progress along the long axis. This at least is the common informal interpretation given. Consult Learning Internal Representations by Error Propagation, Rumelhart, D.E., Hinton, G.E., Williams, R.J., 1986 While this post does not claim novelty in the ideas presented, it attempts to give the reader a more rigorous understanding of why momentum speeds up gradient descent and shed light on some of the heuristics surrounding the use of this method.

Getting Started: Gradient Descent

To begin, let us look more closely at gradient descent. The formulation is simple enough, wk+1=wkαf(wk)w^{k+1} = w^k - \alpha \nabla f(w^k) for some parameters wRnw \in \mathbb{R^n} and an error function f:RnRf: \mathbb{R^n} \rightarrow \mathbb{R}. Consider minimizing the convex quadratic function f(w)=12wTAwbTwf(w) = \frac{1}{2} w^T A w - b^T w where we assume AA is a symmetric, invertible matrix. The gradient is easy enough, f(w)=Awb\nabla f(w) = Aw - b and thus the update rule becomes wk+1=wkα(Awkb) \begin{equation} w^{k+1} = w^k - \alpha (Aw^k - b) \end{equation} Setting the gradient to zero, we can also infer the optimal solution w=A1bw^* = A^{-1} b.

Since AA is symmetric, it is diagonalizable such as A=QΛQTA = Q \Lambda Q^T where QQ is the matrix of orthogonal eigenvectors q1,q2,,qnq_1, q_2, \dots, q_n and Λ=[λ1000λ2000λn] \Lambda = \begin{bmatrix} \lambda_1 & 0 & \cdots & 0 \\ 0 & \lambda_2 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & \lambda_n \end{bmatrix} is the diagonal matrix of eigenvalues. This is nice, because we can now decompose the iteration of gradient descent into individual eigen components. The following computation largely follows the awesome article Why Momentum Really Works, Goh, G., 2017

In particular, we can view the deviation wkww^k - w^* under the eigenbasis. Substract ww^* from both sides of eq. (1), and change basis by multiplying with QTQ^T on the left. We get xk+1=xkα(QTAwkQTb),xk=QT(wkw) \begin{equation} x^{k+1} = x^k - \alpha (Q^T A w^k - Q^T b), \quad x^k = Q^T (w^k - w^*) \end{equation} Ideally, we get rid of the bb. Recall that w=A1bw^* = A^{-1} b, so we can solve the expression of xkx^k for bb, b=AwkAQxk b = A w^k - A Q x^k Substitute this back in eq. (2). The wkw^k terms cancel nicely, and, using the diagonalizability of AA, we get just what we wanted, xk+1=xkαΛxk=(IαΛ)xk=(IαΛ)k+1x0 \begin{equation*} \begin{split} x^{k+1} &= x^k - \alpha \Lambda x^k \\ &\quad = (I - \alpha \Lambda) x^k \\ &\quad = (I - \alpha \Lambda)^{k+1} x^0 \end{split} \end{equation*} which we can rewrite for the individual components as xik+1=(1αλi)k+1xi0 x^{k+1}_i = (1 - \alpha \lambda_i)^{k+1} x^0_i Having decomposed the updates, we can now move back to the original basis, wkw=Qxk=Σi=1n(1αλi)kxi0qi w^k - w^* = Q x^k = \Sigma_{i=1}^n (1-\alpha \lambda_i)^k x^0_i q_i where qiq_i is the ii-th eigenvector of AA. There we have it, gradient descent in closed form! Observe that if we want gradient descent to converge, the condition is 1αλi<1\lvert 1 - \alpha \lambda_i \rvert < 1; or,0<αλi<2 \begin{equation} 0 < \alpha \lambda_i < 2 \end{equation}

Decomposing the error

This simple formulation of gradient descent allows for a simple interpretation. Namely, we start with some initial deviation from the optimial parameters as measured in the eigenspace by the components xi0x^0_i, and each "eigen-component" follows its own path of exponential decrease with rate 1αλi1-\alpha \lambda_i.

This exponential decrease explains why, for the usual step sizes, the first few iterations make large progress in minimizing the error, but with time the progress plateaus. Furthermore, based on the equation, we can confidentally predict that larger eigenvalues correspond to faster error convergence. Indeed we can test this hypothesis by looking at the eigen components of the error, f(wk)f(w)=12wTAwkbTwk12[w]TAw+bTw=12[(wkw)TA(wkw)+[wk]TAw+[w]TAwk2[w]TAw2bTwk+2bTw]=12(wkw)TA(wkw)=12(QT(wkw))TΛ(QT(wkw))=12[xk]TΛx=12Σλi[xik]2=12Σλi(1αλi)2k[xi0]2 \begin{equation*} \begin{split} f(w^k) - f(w^*) &= \frac{1}{2} w^T A w^k - b^T w^k - \frac{1}{2} \left[w^*\right]^T A w^* + b^T w^* \\ &\quad = \frac{1}{2} \left[(w^k - w^*)^T A (w^k - w^*) \right. \\ &\hspace{6em} \left. + \left[w^k\right]^T A w^* + \left[w^*\right]^T A w^k - 2 \left[w^*\right]^T A w^* \right. \\ &\hspace{6em} \left. - 2 b^T w^k + 2 b^T w^* \right] \\ &\quad = \frac{1}{2} (w^k-w^*)^T A (w^k-w^*) \\ &\quad = \frac{1}{2} \left(Q^T(w^k-w^*)\right)^T \Lambda \left(Q^T(w^k-w^*)\right) \\ &\quad = \frac{1}{2} \left[x^k\right]^T \Lambda x \\ &\quad = \frac{1}{2} \Sigma \lambda_i \left[x_i^k\right]^2 \\ &\qquad = \frac{1}{2} \Sigma \lambda_i (1 - \alpha \lambda_i)^{2 k} \left[x_i^0\right]^2 \end{split} \end{equation*} We perform a simple iteration in R3\mathbb{R}^3 for the eigenvalues λ1=0.4\lambda_1 = 0.4, λ2=0.6\lambda_2 = 0.6, and λ3=1.0\lambda_3 = 1.0,

IterationError

α\alpha = 1

Clearly, some choices of α\alpha are better then other. The natural question is, which is the best? As has become evident, the issue is the "slowest" component; i.e. the component with smallest eigenvalue, or, equivalently, the component with the smallest rate of convergence. This component puts an upper bound on how fast we can converge, thus, an obvious approach would be to try to maximize the minimal convergence rate, argmaxαmini  1αλi=argmaxαmin  {1αλ1,1αλn} \begin{equation*} \begin{split} &\underset{\alpha}{\mathrm{arg\,max}} \enspace \underset{i}{\mathrm{min}} \; \lvert 1 - \alpha \lambda_i \rvert \\ &\quad = \underset{\alpha}{\mathrm{arg\,max}} \enspace \mathrm{min} \; \left\{ \lvert 1 - \alpha \lambda_1 \rvert, \, \lvert 1 - \alpha \lambda_n \rvert \right\} \end{split} \end{equation*} where we've ordered the eigenvalues from smallest to largest (λ1λ2λn\lambda_1 \leq \lambda_2 \leq \dots \leq \lambda_n). Such a maximum is obtained for α\alpha such that the smallest and biggest convergence rates equal. This yields α=2λ1+λn \alpha = \frac{2}{\lambda_1 + \lambda_n} plugging this back into the converge rate yields an optimal rate of κ1κ+1,κ=λnλ1 \lvert \frac{\kappa - 1}{\kappa + 1} \rvert, \quad \kappa = \frac{\lambda_n}{\lambda_1} The ratio λn/λ1\lambda_n/\lambda_1 appears often enough that it has a name - the condition number. In various settings it means various things, but roughly speaking it is a direct measure of how "hard" the problem is; i.e. a measure of pathological curvature.

A Physical Analogy

Take a look at the equation for gradient descent (reformulated a bit)wk+1wk=αf(wk) \begin{equation} w^{k+1} - w^k = -\alpha \nabla f(w^k) \end{equation} how about we consider the continuous version ddtw(t)=αf(w(t)) \frac{d}{d t} w(t) = -\alpha \nabla f(w(t)) where ww now takes the form of a continuous function of time rather than indexed by discrete time steps. Here is where some physical analogy comes. Let's assign to f(w)f(w) the interpretation of potential energy, which we want minimized. Naturally, ww describes a position in parameter-space, so ddtw(t)\frac{d}{d t} w(t) is our velocity term. Newton's third law (F=maF = ma) suggests we also include an acceleration term, let us do that md2dt2w+ddtw=αf(w) m \frac{d^2}{d t^2} w + \frac{d}{d t} w = -\alpha \nabla f(w) ideally, after some time, we want our "ball", whose position is described by ww, to settle down. In physics, a common way to bring a system to a halt is to add friction, which was already hinted to us by the velocity term, thus md2dt2w+μddtw=αf(w) \begin{equation} m \frac{d^2}{d t^2} w + \mu \frac{d}{d t} w = -\alpha \nabla f(w) \end{equation} where μ\mu is the friction coefficient.

That was the inspiration from physics! Since the end goal is to run the algorithm on a computer, we neccesarily have to discretize back eq. (5), mwt+Δt+wtΔt2wtΔt2+μwt+ΔtwtΔt=αf(wt) m \frac{w_{t+\Delta t} + w_{t-\Delta t} - 2 w_t}{\Delta t^2} + \mu \frac{w_{t+\Delta t}-w_t}{\Delta t} = -\alpha \nabla f(w_t) rearanging terms we get wt+Δtwt=αΔt2m+μΔtf(wt)+mm+μΔt(wtwtΔt) w_{t+\Delta t} - w_t = -\frac{\alpha \Delta t^2}{m + \mu \Delta t} \nabla f(w_t) + \frac{m}{m+\mu\Delta t} (w_t - w_{t-\Delta t}) compare this with eq. (4). We can write, wk+1wk=ϵf(wk)+β(wkwk1) \begin{equation} w^{k+1} - w^{k} = -\epsilon \nabla f(w^k) + \beta (w^k - w^{k-1}) \end{equation} where ϵ=αΔt2m+μΔt,β=mm+μΔt \epsilon = \frac{\alpha \Delta t^2}{m + \mu \Delta t}, \quad \beta = \frac{m}{m+\mu\Delta t} It is clear that ϵ\epsilon serves as the new step-size. The β\beta term "tracks" the magnitude of the previous update, therefore, the name "momentum" is used to describe it. Note that β=0\beta=0 implies m=0m=0 and vice-versa. Thus, the common analogy with gradient descent being "massless", whereas momentum a "heavy" ball rolling down a hill.

Analyzing Momentum

How does momentum speed up convergence of the system to a local minimum? To address this question, we analyze eq. (6) by expanding the potential energy f(w)f(w) around a minimum at w0w_0, f(w)f(w0)+f(w0)(ww0)    f(w)H(ww0) \begin{equation*} \begin{split} & f(w) \approx f(w_0) + \nabla f(w_0) (w - w_0) \\ &\quad \implies \nabla f(w) \approx H (w - w_0) \end{split} \end{equation*} the gradient f(w0)\nabla f(w_0) vanished, since we are at a minimum, and HH is the Hessian defined as Hi,j=2f(w)wiwjw0 H_{i,j} = \frac{\partial^2 f(w)}{\partial w_i \partial w_j} |_{w_0} Use this approximation for f(wk)\nabla f(w^k) in eq. (6) and, without loss of generality, take w0=0w_0 = 0. After simple rearangements we obtain the update rule, wk+1=[(1+β)IϵH]wkβwk1 w^{k+1} = \left[(1 + \beta) I - \epsilon H\right] w^k - \beta w^{k-1} What was the benefit of all this? Well, the Hessian is symmetric, This is not always the case. Here we assume, 2f(w)wiwj=2f(w)wjwi \frac{\partial^2 f(w)}{\partial w_i \partial w_j} = \frac{\partial^2 f(w)}{\partial w_j \partial w_i} thus it is diagonalizable and we can apply the same trick we used to analyze gradient descent! In particular, H=QΛQTH = Q \Lambda Q^T where as usual QQ is the matrix of eigenvectors and Λ\Lambda is diagonal matrix of eigenvalues λ1,,λn\lambda_1, \dots, \lambda_n. Perform a change of basis xk=QTwkx^k = Q^T w^k and we get the component-wise update rule, xik+1=[1+βϵki]xikβxik1 x^{k+1}_i = \left[1 + \beta - \epsilon k_i\right]x^k_i - \beta x^{k-1}_i we can now use a clever trick See On the momentum term in gradient descent learning algorithms, Qian, N., 1999 for the source of the trick. that begins with rewriting the update rule in matrix-vector form (xikxik+1)=A(xik1xik),A=(01β1+βϵki) \begin{pmatrix} x^k_i \\ x^{k+1}_i \end{pmatrix} = A \begin{pmatrix} x^{k-1}_i \\ x^k_i \end{pmatrix}, \quad A = \begin{pmatrix} 0 & 1 \\ -\beta & 1 + \beta - \epsilon k_i \end{pmatrix} in closed form (xikxik+1)=Ak(xi0xi1), \begin{pmatrix} x^k_i \\ x^{k+1}_i \end{pmatrix} = A^k \begin{pmatrix} x^0_i \\ x^1_i \end{pmatrix}, \quad this becomes extemely simple and it's self-evident that if we want momentum to converge, we want AkA^k to converge as kk \to \infty. There is a nice formula for taking the powers of a 2×22 \times 2 matrix in terms of its eigenvalues given by, The nn-th Power of a 2×22 \times 2 Matrix, Williams K. S. Ak={σ1kAσ2Iσ1σ2+σ2kAσ1Iσ2σ1,if  σ1σ2,σ1k1(kA(k1)σ1I),if  σ1=σ2 A^k = \begin{cases} \sigma_1^k \frac{A - \sigma_2 I}{\sigma_1 - \sigma_2} + \sigma_2^k \frac{A - \sigma_1 I}{\sigma_2 - \sigma_1}, \quad &\text{if}\; \sigma_1 \neq \sigma_2, \\ \sigma_1^{k-1} \left(k A - (k - 1) \sigma_1 I\right), \quad &\text{if}\; \sigma_1 = \sigma_2 \end{cases} where σ1,σ2\sigma_1, \sigma_2 are the eigenvalues of AA. For our particular matrix, the eigenvalues take the form σ{1,2},i=1+βϵλi±(1+βϵλi)24β2 \sigma_{\left\{1,2\right\}, i} = \frac{1 + \beta - \epsilon \lambda_i \pm \sqrt{(1 + \beta - \epsilon \lambda_i)^2 - 4 \beta}} {2} The conlusion is that if we want momentum to converge, we require σ1,i<1\lvert \sigma_{1, i} \rvert < 1 and σ2,i<1\lvert \sigma_{2, i} \rvert < 1. This is reminiscent to the condition of convergence for gradient descent, however instead of a singel series, here we have two coupled series. More succinctly, convergence is ensured when, max{σ1,i,σ2,i}<1 \max \left\{ \lvert \sigma_{1, i} \rvert, \lvert \sigma_{2, i} \rvert \right\} < 1 this quantity also describes the convergence rate of momentum, which is the slowest among σ1,i\sigma_{1, i} and σ2,i\sigma_{2, i}. It is therefore instructive to plot the quantity as a function of the parameters β\beta (momentum) and ϵλi\epsilon \lambda_i (step-size).

01234Step-size0.00.51.0Momentum
0.03.0
IterationError

Immediately we can observe different regimes for momentum. Most of the parameter space gives rise to oscillatory behaviour of the error (typical of momentum!). There are, however, regimes characterized by monotonic decrease (lower left corner) and ones characterized by divergence (lower right corner). The bluish region is essentially the region of divergence where max{σ1,i,σ2,i}>1 \max \left\{ \lvert \sigma_{1, i} \rvert, \lvert \sigma_{2, i} \rvert \right\} > 1 .

It isn't hard to observe that the range of step-sizes under which momentum converges depends on the value of the momentum parameter. Formally, the system converges if and only if Consult Appendix A. of On the momentum term in gradient descent learning algorithms for a formal proof of this result. 1<β<1and0<ϵλi<2+2β -1 < \beta < 1 \quad \textit{and} \quad 0 < \epsilon \lambda_i < 2 + 2 \beta Compared this with eq. (3). In essence, momentum has increased the permissible step-size!

Optimal parameters

We now consider the question of optimal parameters for the momentum algorithm. Unfortunately, this requires a rather complicated procedure, so the details are ommited. The complication stems from the requirement to optimize over σ{1,2},i\sigma_{\left\{1,2\right\}, i} for β\beta and ϵ\epsilon for all ii, i.e. argminβ,ϵmaxi  {max{σ1,i,σ2,i}} \underset{\beta,\, \epsilon}{\mathrm{arg\,min}} \enspace \underset{i}{\mathrm{max}} \; \left\{ \max \left\{ \lvert \sigma_{1,i} \rvert, \lvert \sigma_{2,i} \rvert \right\} \right\} \\ Once carried out, however, we obtain, β=(λnλ1λn+λ1)2,ϵ=(2λ1+λn)2 \beta = \left( \frac{\sqrt{\lambda_n} - \sqrt{\lambda_1}}{\sqrt{\lambda_n} + \sqrt{\lambda_1}} \right)^2, \quad \epsilon = \left( \frac{2}{\sqrt{\lambda_1} + \sqrt{\lambda_n}} \right)^2 which also gives us a convergence rate of κ1κ+1 \frac{\sqrt{\kappa} - 1}{\sqrt{\kappa} + 1} compared to gradient descent's, κ1κ+1 \frac{\kappa - 1}{\kappa + 1} By using momentum we have reduced the the effect of the condition number! Of course, the setting consider here is a toy one and one has to be careful when extrapolating to real world scenarios. Nevertheless, the analysis so far has given us some insights. For instance, if the problem we are dealing with is ill-conditioned (i.e. κ>>1\kappa >> 1). The optimal step-size is approximately twice that for gradient descent, while the optimium momentum parameter β\beta is close to 11.

Concluding remarks

Momentum is a workhorse algorithm in modern optimization settings. It's wide adoptance has also been supplemented with huristics about why momentum works and what the best parameters are. This article has demonstrated that many of this heuristics have a firm grounding in mathematical analysis - like the analogy betweeen "heavy-ball" and momentum and choosing β\beta very close to 11. Hopefully, this has provided to the reader a more solid understanding of momentum. Nevertheless, one needs to stay aware of the fact that not everything is a nail once you have a hammer. Comparing momentum, a first order algorithm, to higher order techniques could shed light on the limitations of momentum, which were not discussed in this article. This is left for the future!