AdaGrad is a stochastic optimization method that adapts the learning rate to the parameters. It performs smaller updates for parameters associated with frequently occurring features, and larger updates for parameters associated with infrequently occurring features. In its update rule, Adagrad modifies the general learning rate $\eta$ at each time step $t$ for every parameter $\theta_{i}$ based on the past gradients for $\theta_{i}$: $$ \theta_{t+1, i} = \theta_{t, i} - \frac{\eta}{\sqrt{G_{t, ii} + \epsilon}}g_{t, i} $$The benefit of AdaGrad is that it eliminates the need to manually tune the learning rate; most leave it at a default value of $0.01$. Its main weakness is the accumulation of the squared gradients in the denominator. Since every added term is positive, the accumulated sum keeps growing during training, causing the learning rate to shrink and becoming infinitesimally small.Image: Alec Radford
相关学科: RMSPropAdaDeltaAMSGradAdaMaxNADAMSGDAdaBoundAdamAdaHessianAMSBound









Geoffrey E. Hinton

345738 被引用,408 篇论文

Andrew Zisserman

195560 被引用,885 篇论文

Léon Bottou

98650 被引用,174 篇论文

Ruslan Salakhutdinov

89393 被引用,413 篇论文

Jimmy Ba

83691 被引用,93 篇论文

Richard Socher

81897 被引用,249 篇论文

Francisco Herrera

79311 被引用,1072 篇论文

Jason Weston

61864 被引用,284 篇论文

Georgios B. Giannakis

59112 被引用,1336 篇论文

Samy Bengio

39495 被引用,373 篇论文