2021/03/28  阅读:63  主题:默认主题

AdaHessian paper阅读笔记:Introduction

AdaHessian paper阅读笔记:Introduction


  1. AdaHessian: a second order stochastic optimiser algorithm which dynamically incorporates the curvature of the Hessian.
  2. We show that AdaHessian achives new state-of-art results by a large margin as compared to other adaptive optimization methods, including variants of Adam.


  1. first oder methods like stochastic gradient descend (SGD) 并不一定是最佳的训练neural network的方法
  2. 同时有很多ad-hoc(临时的)规则影响着结果
    • Choice of first order optimizer
    • hyperparameters
  3. 因此,one has to babysit the optimizer to make sure that training converges to an acceptable training loss, without any guarantee that a given number of iterations is enough to reach alocal minima

然而,上述问题在一些popular learning tasks, such as ResNet50 training on ImageNet 可能并不存在,因为经过多年的hyper parameter tuning, 一些不错的hyperparameter已经被找到


导致上述问题的原因:first oder methods only use gradient information, but not consider the curvature properties of the loss landscape.

second oder methods 的优势

  • specifically designed to capture and exploit the curvature of the loss landscape
  • incorporate both gradient and hessian information
  • also has many favorable properties
    • resiliency to ill-conditioned loss landscapes
    • invariance to parameter scaling
    • robustness to hyperparameter

main idea of second order methods:


  1. For a general problem, different parameter dimensions exhitbit different curvature properties.

    • e.g. The loss could be very flat in one dimension and very sharp in another
      • As a result, the step size taken by the optimizershould be different for these dimensions
        • larger steps for flat dimensions
        • smaller steps for sharp dimensions
  2. second order methods capture the curvature difference by normalizing different dimensions through rotation and scaling of the gradient vectore before the weight update

  3. second order methods 的劣势:

    • High computational cost

What this paper contributes:

showed that it's possible to compute approximately an exponential moving average of the Hessian and use it to precondition the gradient adptively. More details:

  1. To reduce the overhead of second order methods, we approximate the Hessian as a diagonal operator.
    • acchived by applying Hutchinson's method to approximater the diagonal of the Hessian.
      • importantly, this approximation allows us to efficiently apply a root-mean-square exponential moving average to smooth out "rugged" loss surfaces.
    • The advantage of this approach: it has memory complexity
  2. We incorporate a block diagonal averaging to reduce the variance of Hessian diagonal elements.
    • This has no additional computational overhead in the Hutchinson's method, but it favourably affects the performance of the optimizer
  3. showed that AdaHessian is robust to the hyperparameters such as
    • learning rate
    • block diagonal averaging size
    • delayed Hessian computation
  4. extensively test AdaHessian on a wide range of learning tasks and showed that in all tests, AdaHessian significantly outperforms other adaptive optimization methods


2021/03/28  阅读:63  主题:默认主题