Sketchy Polytopes

Accelerating first-order methods

The lower bound on the oracle complexity of continuously differentiable, $\beta$-smooth convex function is $O(\frac{1}{\sqrt{\epsilon}})$ [Theorem 2.1.6, Nesterov04; Theorem 3.8, Bubeck14; Nesterov08]. General first-order gradient descent does not achieve this - e.g. L1-Prox, or ISTA, achieves $O(\frac{1}{\epsilon})$. Nesterov, in a series of papers [Nesterov83, Nesterov88, Nesterov07], proposed techniques to improve the convergence rate for smooth functions to $O(\frac{1}{t^2})$. In this post, we discuss Nesterov acceleration.

Accelerated gradient descent: $\alpha$-strongly convex, $\beta$-smooth

Algorithm: Given a starting point x1 = y1, then for t ≥ 1

Theorem: For α-strongly convex, β-smooth functions with condition number Q = β/α, Nesterov’s accelerated gradient descent satisfies [Theorem 3.11, Bubeck14], which comes close to the lower bound [Theorem 2.1.13, Nesterov04]:

Proof sketch:

  1. Following [Bubeck14], one can define an α-strongly convex quadratic approximation to f(x) that improves with each iteration:

Then, using the definition of α-strong convexity, one can show that

  1. Now, assuming the following inequality is true,
  1. Then one can argue that

Accelerated gradient descent: $\beta$-smooth

Algorithm: Given a starting point x1 = y1, then for t ≥ 1

Theorem: For non-strictly convex functions, where α = 0, gradient descent satisfies [Theorem 3.12, Bubeck14], which is within a constant factor of the lower bound [Theorem 2.1.7 in Nesterov04]:

Proof sketch:

  1. In a gradient descent scheme for β-smooth functions [Lemma 3.4, Bubeck14]
  1. Let δs = f(ys) - f(x^{\ast}), then scaling the first inequality by (λs - 1) and adding the result to second
  1. Scaling by λs and after some manipulation
  1. Summing these inequalities from k = 1 to k = t - 1, yields

Interpretation: Chebychev polynomials

Moritz Hardt elegantly describes an interpretation of accelerated gradient descent based on function interpolation [Hardt13]. Gradient descent can be seen as approximating a function using a degree-k polynomial. For strongly convex functions, the derivative can be expressed as a linear combination of previous steps i.e.

Assuming that ∑λs = 1,

Thus given that

$P(\mu)$ needs to be chosen over all eigenvalues of A so that (1) $P_{k}(\mu)$ = 1, for $\mu$ = 0, and (2) its maximum norm is minimized. This is hard since it requires knowing all eigenvalues but can be achieved, under certain conditions, if the domain is assumed to be continuous [Kelner09]. In short, a scaled and shifted Chebychev polynomial is the unique polynomial that minimizes this approximation error. For Chebychev polynomial, the maximum error is

Moreover, since Chebychev polynomials can be expressed recursively, required only the two previously calculated polynomials, the gradient descent update only depends on last two values.

This method seems to be well known in Numerical Analysis, where it is used to speed up iterative linear solvers [chapter 4, Kincaid02; section 7.4, Saad11]. It also seems to predate Nesterov’s work [Manteuffel77].

Extension: Bregman Divergence

Paul Tseng uses Bregman Divergences to give a unified framework for Nesterov’s methods across different classes of problems [Tseng08].

See also [Taboulle10, Candes11, Hao11, Gordon12] for a good overview of first-order acceleration methods. Next post discusses techniques that utilize accelerated gradient descent to solve regularization problems [Nesterov07, Beck09, Becker09].

References

  • Beck09 A. Beck, M. Teboulle. A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems. SIAM Journal of Imaging Sciences, 2009
  • Becker09 S. Becker, J. Bobin, E. Candes. NESTA: A Fast and Accurate First-Order Method for Sparse Recovery. Technical Report, Caltech, 2009
  • Bubeck14 S. Bubeck, Theory of Convex Optimization for Machine Learning
  • Candes11 E. Candes. Math 301, Lectures Notes.
  • Chen11 H. Chen, X. Ming. Accelerating Nesterov’s Method for Strongly Convex Functions. Course presentation, 2011.
  • Hardt13 M. Hardt. The Zen of Gradient Descent. Blog, 2013.
  • Karniadakis00 G. E. Karniadakis. R. M. Kirby. Parallel Scientific Computing in C++ and MPI. 2000
  • Kelner09 J. Kelner. Lecture Notes. MIT 18.409, Lecture 22, 2009
  • Kincaid02 D. R. Kincaid, E. W. Cheney. Numerical Analysis: Mathematics of Scientific Computing. AMS, 2002
  • [Nesterov83] Y. Nesterov. A method for solving a convex programming problem with convergence rate O(1/k2). Dokaldy AN SSR, 1983
  • [Nesterov88] Y. Nesterov. On an approach to the construction of optimal methods of minimization of smooth convex functions. Ekonom. i. Mat. Metody, 1988
  • [Nesterov04] Y. Nesterov. Introductory Lectures On Convex Programming: A Basic Course. Kluwer Academic Publishers, 2004
  • Nesterov07 Y. Nesterov. Gradient methods for minimizing composite objective function. Report, CORE, 2007
  • Nesterov08 Y. Nesterov. How to advance in Structural Convex Optimization. Optima, 2008
  • Manteuffel77 T. A. Manteuffel. The Tchebyshev iteration for nonsymmetric linear systems. Numer. Math, 1977.
  • Saad11 Y. Saad. Numerical Methods for Large Eigenvalue Problems. SIAM, 2011
  • Taboulle10. M. Taboulle. First-Order Methods for Optimization. IPAM Optimization Tutorials, 2010
  • Gordon12 G. Gordon, R. Tibshirani. Course Notes, 10-725 Optimization, Fall 2012
  • Tseng08 P. Tseng. On Accelerated Proximal Gradient Algorithms for Convex-Concave Optimization. SIAM Journal of Optimization, 2008

Credit

Main image from Wolfram’s Chebyshev Polynomial of the First Kind: “A beautiful plot can be obtained by plotting $T_{n}(x)$ radially, increasing the radius for each value of $n$, and filling in the areas between the curves (Trott 1999, pp. 10 and 84).”