希仔
2021/03/28 阅读：34 主题：默认主题
AdaHessian paper阅读笔记：Introduction
AdaHessian paper阅读笔记：Introduction
Abstract

AdaHessian: a second order stochastic optimiser algorithm which dynamically incorporates the curvature of the Hessian. 
We show that AdaHessian achives new stateofart results by a large margin as compared to other adaptive optimization methods, including variants of Adam.
Introduction
P1

first oder methods like stochastic gradient descend (SGD) 并不一定是最佳的训练neural network的方法 
同时有很多adhoc（临时的）规则影响着结果 
Choice of first order optimizer 
hyperparameters


因此，one has to babysit the optimizer to make sure that training converges to an acceptable training loss, without any guarantee that a given number of iterations is enough to reach alocal minima
P2
然而，上述问题在一些popular learning tasks, such as ResNet50 training on ImageNet 可能并不存在，因为经过多年的hyper parameter tuning， 一些不错的hyperparameter已经被找到
P3
导致上述问题的原因：first oder methods only use gradient information, but not consider the curvature properties of the loss landscape.
second oder methods 的优势：

specifically designed to capture and exploit the curvature of the loss landscape 
incorporate both gradient and hessian information 
also has many favorable properties 
resiliency to illconditioned loss landscapes 
invariance to parameter scaling 
robustness to hyperparameter

main idea of second order methods:
在将梯度向量用于权重更新之前对其进行预处理

For a general problem, different parameter dimensions exhitbit different curvature properties.

e.g. The loss could be very flat in one dimension and very sharp in another 
As a result, the step size taken by the optimizershould be different for these dimensions 
larger steps for flat dimensions 
smaller steps for sharp dimensions




second order methods capture the curvature difference by normalizing different dimensions through rotation and scaling of the gradient vectore before the weight update

second order methods 的劣势：

High computational cost

What this paper contributes:
showed that it's possible to compute approximately an exponential moving average of the Hessian and use it to precondition the gradient adptively. More details:

To reduce the overhead of second order methods, we approximate the Hessian as a diagonal operator. 
acchived by applying Hutchinson's method to approximater the diagonal of the Hessian. 
importantly, this approximation allows us to efficiently apply a rootmeansquare exponential moving average to smooth out "rugged" loss surfaces.


The advantage of this approach: it has memory complexity


We incorporate a block diagonal averaging to reduce the variance of Hessian diagonal elements. 
This has no additional computational overhead in the Hutchinson's method, but it favourably affects the performance of the optimizer


showed that AdaHessian is robust to the hyperparameters such as 
learning rate 
block diagonal averaging size 
delayed Hessian computation


extensively test AdaHessian on a wide range of learning tasks and showed that in all tests, AdaHessian significantly outperforms other adaptive optimization methods
希仔
2021/03/28 阅读：34 主题：默认主题
作者介绍