Check nearby libraries
Buy this book
In this dissertation, we are concerned with the advancement of optimization algorithms for training deep learning models, and in particular about practical second-order methods that take into account the structure of deep neural networks (DNNs). Although first-order methods such as stochastic gradient descent have long been the predominant optimization algorithm used in deep learning, second-order methods are of interest because of their ability to use curvature information to accelerate the optimization process. After the presentation of some background information in Chapter 1, Chapters 2 and 3 focus on the development of practical quasi-Newton methods for training DNNs. We analyze the Kronecker-factored structure of the Hessian matrix of multi-layer perceptrons and convolutional neural networks and consequently propose block-diagonal Kronecker-factored quasi-Newton methods named K-BFGS and K-BFGS(L). To handle the non-convexity nature of DNNs, we also establish new double damping techniques for our proposed methods.
Our K-BFGS and K-BFGS(L) methods have memory requirements comparable to first-order methods and experience only mild overhead in terms of per-iteration time complexity. In Chapter 4, we develop a new approximate natural gradient method named Tensor Normal Training (TNT), in which the Fisher matrix is viewed as the covariance matrix of a tensor normal distribution (a generalized form of the normal distribution). The tractable Kronecker-factored approximation to the Fisher information matrix that results from this approximation enables TNT to enjoy memory requirements and per-iteration computational costs that are only slightly higher than those for first-order methods. Notably, unlike KFAC and K-BFGS/K-BFGS(L), TNT only requires the knowledge of the shape of the trainable parameters of a model and does not depend on the specific model architecture. In Chapter 5, we consider the subsampled versions of Gauss-Newton and natural gradient methods applied to DNNs. Because of the low-rank nature of the subsampled matrices, we make use of the Sherman-Morrison-Woodbury formula along with backpropagation to efficiently compute their inverse.
We also show that, under rather mild conditions, the algorithm converges to a stationary point if Levenberg-Marquardt damping is used. The results of a substantial number of numerical experiments are reported in Chapters 2, 3, 4 and 5, in which we compare the performance of our methods to state-of-the-art methods used to train DNNs, that demonstrate the efficiency and effectiveness of our proposed new second-order methods.
Check nearby libraries
Buy this book
Edition | Availability |
---|---|
1
Using Second-Order Information in Training Deep Neural Networks
2022, [publisher not identified]
in English
|
aaaa
|
Book Details
Edition Notes
Department: Industrial Engineering and Operations Research.
Thesis advisor: Donald Goldfarb.
Thesis (Ph.D.)--Columbia University, 2022.
The Physical Object
ID Numbers
Community Reviews (0)
Feedback?December 9, 2022 | Created by MARC Bot | import new book |