Adam is an adaptive learning rate optimization algorithm that utilises both momentum and scaling, combining the benefits of RMSProp and SGD w/th Momentum. The optimizer is designed to be appropriate for non-stationary objectives and problems with very noisy and/or sparse gradients. The weight updates are performed as:$$w_{t} = w_{t-1} - \eta\frac{\hat{m}_{t}}{\sqrt{\hat{v}_{t}} + \epsilon}$$with$$\hat{m}_{t} = \frac{m_{t}}{1-\beta^{t}_{1}}$$$$\hat{v}_{t} = \frac{v_{t}}{1-\beta^{t}_{2}}$$$$m_{t} = \beta_{1}m_{t-1} + (1-\beta_{1})g_{t}$$$$v_{t} = \beta_{2}v_{t-1} + (1-\beta_{2})g_{t}^{2}$$$\eta$ is the step size/learning rate, around 1e-3 in the original paper. $\epsilon$ is a small number, typically 1e-8 or 1e-10, to prevent dividing by zero. $\beta_{1}$ and $\beta_{2}$ are forgetting parameters, with typical values 0.9 and 0.999, respectively.

## 重要学者

### Geoffrey E. Hinton

345738 被引用，408 篇论文

### Yi Chen

267689 被引用，4684 篇论文

### David Haussler

210533 被引用，548 篇论文

### John P. A. Ioannidis

201477 被引用，1398 篇论文

### Julian P T Higgins

178638 被引用，355 篇论文

### Hyun-Chul Kim

172231 被引用，4513 篇论文

### Mark Gerstein

132098 被引用，800 篇论文