This website requires JavaScript.

Dynamics of SGD with Stochastic Polyak Stepsizes: Truly Adaptive Variants and Convergence to Exact Solution

Antonio OrvietoSimon Lacoste-JulienNicolas Loizou
May 2022
Recently Loizou et al. (2021), proposed and analyzed stochastic gradientdescent (SGD) with stochastic Polyak stepsize (SPS). The proposed SPS comeswith strong convergence guarantees and competitive performance; however, it hastwo main drawbacks when it is used in non-over-parameterized regimes: (i) Itrequires a priori knowledge of the optimal mini-batch losses, which are notavailable when the interpolation condition is not satisfied (e.g., regularizedlosses), and (ii) it guarantees convergence only to a neighborhood of thesolution. In this work, we study the dynamics and the convergence properties ofSGD equipped with new variants of the stochastic Polyak stepsize and providesolutions to both drawbacks of the original SPS. We first show that a simplemodification of the original SPS that uses lower bounds instead of the optimalfunction values can directly solve the issue (i). On the other hand, solvingissue (ii) turns out to be more challenging and leads us to valuable insightsinto the method's behavior. We show that if interpolation is not satisfied, thecorrelation between SPS and stochastic gradients introduces a bias. This biaseffectively distorts the expectation of the gradient signal near minimizers,leading to non-convergence - even if the stepsize is scaled down duringtraining. This phenomenon is in direct contrast to the behavior of SGD, whereclassical results guarantee convergence under simple stepsize annealing. To fixthis issue, we propose DecSPS, a novel modification of SPS, which guaranteesconvergence to the exact minimizer - without a priori knowledge of the problemparameters. We show that the new variant of SPS works well both in smooth andnon-smooth settings.