AdaGrad
0 订阅
AdaGrad is a stochastic optimization method that adapts the learning rate to the parameters. It performs smaller updates for parameters associated with frequently occurring features, and larger updates for parameters associated with infrequently occurring features. In its update rule, Adagrad modifies the general learning rate $\eta$ at each time step $t$ for every parameter $\theta_{i}$ based on the past gradients for $\theta_{i}$: $$ \theta_{t+1, i} = \theta_{t, i} - \frac{\eta}{\sqrt{G_{t, ii} + \epsilon}}g_{t, i} $$The benefit of AdaGrad is that it eliminates the need to manually tune the learning rate; most leave it at a default value of $0.01$. Its main weakness is the accumulation of the squared gradients in the denominator. Since every added term is positive, the accumulated sum keeps growing during training, causing the learning rate to shrink and becoming infinitesimally small.Image: Alec Radford
相关学科: RMSPropAdaDeltaAMSGradAdaMaxNADAMSGDAdaBoundAdamAdaHessianAMSBound
学科讨论

暂无讨论内容,你可以
推荐文献
按被引用数
学科管理组
暂无学科课代表,你可以申请成为课代表
重要学者
Geoffrey E. Hinton
345738 被引用,408
篇论文
Andrew Zisserman
195560 被引用,885
篇论文
Léon Bottou
98650 被引用,174
篇论文
Ruslan Salakhutdinov
89393 被引用,413
篇论文
Jimmy Ba
83691 被引用,93
篇论文
Richard Socher
81897 被引用,249
篇论文
Francisco Herrera
79311 被引用,1072
篇论文
Jason Weston
61864 被引用,284
篇论文
Georgios B. Giannakis
59112 被引用,1336
篇论文
Samy Bengio
39495 被引用,373
篇论文