| | |
| | |
Stat |
Members: 3645 Articles: 2'504'928 Articles rated: 2609
26 April 2024 |
|
| | | |
|
Article overview
| |
|
On the Heavy-Tailed Theory of Stochastic Gradient Descent for Deep Neural Networks | Umut Şimşekli
; Mert Gürbüzbalaban
; Thanh Huy Nguyen
; Gaël Richard
; Levent Sagun
; | Date: |
Fri, 29 Nov 2019 16:56:02 GMT (1822kb,D) | Abstract: | The gradient noise (GN) in the stochastic gradient descent (SGD) algorithm is
often considered to be Gaussian in the large data regime by assuming that the
emph{classical} central limit theorem (CLT) kicks in. This assumption is often
made for mathematical convenience, since it enables SGD to be analyzed as a
stochastic differential equation (SDE) driven by a Brownian motion. We argue
that the Gaussianity assumption might fail to hold in deep learning settings
and hence render the Brownian motion-based analyses inappropriate. Inspired by
non-Gaussian natural phenomena, we consider the GN in a more general context
and invoke the emph{generalized} CLT, which suggests that the GN converges to
a emph{heavy-tailed} $alpha$-stable random vector, where emph{tail-index}
$alpha$ determines the heavy-tailedness of the distribution. Accordingly, we
propose to analyze SGD as a discretization of an SDE driven by a L’{e}vy
motion. Such SDEs can incur ’jumps’, which force the SDE and its discretization
emph{transition} from narrow minima to wider minima, as proven by existing
metastability theory and the extensions that we proved recently. In this study,
under the $alpha$-stable GN assumption, we further establish an explicit
connection between the convergence rate of SGD to a local minimum and the
tail-index $alpha$. To validate the $alpha$-stable assumption, we conduct
experiments on common deep learning scenarios and show that in all settings,
the GN is highly non-Gaussian and admits heavy-tails. We investigate the tail
behavior in varying network architectures and sizes, loss functions, and
datasets. Our results open up a different perspective and shed more light on
the belief that SGD prefers wide minima. | Source: | arXiv, 1912.0018 | Services: | Forum | Review | PDF | Favorites |
|
|
No review found.
Did you like this article?
Note: answers to reviews or questions about the article must be posted in the forum section.
Authors are not allowed to review their own article. They can use the forum section.
browser Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ClaudeBot/1.0; +claudebot@anthropic.com)
|
| |
|
|
|
| News, job offers and information for researchers and scientists:
| |