AdaHessian paper阅读笔记：Introduction

Abstract

AdaHessian: a second order stochastic optimiser algorithm which dynamically incorporates the curvature of the Hessian.
We show that AdaHessian achives new state-of-art results by a large margin as compared to other adaptive optimization methods, including variants of Adam.

Introduction

P1

first oder methods like stochastic gradient descend (SGD) 并不一定是最佳的训练neural network的方法
同时有很多ad-hoc（临时的）规则影响着结果
- Choice of first order optimizer
- hyperparameters
因此，one has to babysit the optimizer to make sure that training converges to an acceptable training loss, without any guarantee that a given number of iterations is enough to reach alocal minima

P2

然而，上述问题在一些popular learning tasks, such as ResNet50 training on ImageNet 可能并不存在，因为经过多年的hyper parameter tuning，一些不错的hyperparameter已经被找到

P3

导致上述问题的原因：first oder methods only use gradient information, but not consider the curvature properties of the loss landscape.

second oder methods 的优势：

specifically designed to capture and exploit the curvature of the loss landscape
incorporate both gradient and hessian information
also has many favorable properties
- resiliency to ill-conditioned loss landscapes
- invariance to parameter scaling
- robustness to hyperparameter

main idea of second order methods:

在将梯度向量用于权重更新之前对其进行预处理

For a general problem, different parameter dimensions exhitbit different curvature properties.
- e.g. The loss could be very flat in one dimension and very sharp in another
  - As a result, the step size taken by the optimizershould be different for these dimensions
    
    larger steps for flat dimensions
    
    smaller steps for sharp dimensions
second order methods capture the curvature difference by normalizing different dimensions through rotation and scaling of the gradient vectore before the weight update
second order methods 的劣势：
- High computational cost

What this paper contributes:

showed that it's possible to compute approximately an exponential moving average of the Hessian and use it to precondition the gradient adptively. More details:

To reduce the overhead of second order methods, we approximate the Hessian as a diagonal operator.
- acchived by applying Hutchinson's method to approximater the diagonal of the Hessian.
  - importantly, this approximation allows us to efficiently apply a root-mean-square exponential moving average to smooth out "rugged" loss surfaces.
- The advantage of this approach: it has memory complexity
We incorporate a block diagonal averaging to reduce the variance of Hessian diagonal elements.
- This has no additional computational overhead in the Hutchinson's method, but it favourably affects the performance of the optimizer
showed that AdaHessian is robust to the hyperparameters such as
- learning rate
- block diagonal averaging size
- delayed Hessian computation
extensively test AdaHessian on a wide range of learning tasks and showed that in all tests, AdaHessian significantly outperforms other adaptive optimization methods