Using Second-Order Information in Training Deep Neural Networks /

MARC Record from marc_columbia

Record ID	marc_columbia/Columbia-extract-20221130-034.mrc:12792540:3924
Source	marc_columbia
Download Link	/show-records/marc_columbia/Columbia-extract-20221130-034.mrc:12792540:3924?format=raw

LEADER: 03924cam a22003973i 4500
001 16624190
005 20220701080433.0
006 m o d
007 cr |n||||a||||
008 220615s2022 nyu|||| om 00| ||eng d
035 $a(OCoLC)1333964740
035 $a(OCoLC)on1333964740
035 $a(NNC)ACfeed:legacy_id:ac:7sqv9s4n2j
035 $a(NNC)ACfeed:doi:10.7916/hstp-s234
035 $a(NNC)16624190
040 $aNNC$beng$erda$cNNC
100 1 $aRen, Yi.
245 10 $aUsing Second-Order Information in Training Deep Neural Networks /$cYi Ren.
264 1 $a[New York, N.Y.?] :$b[publisher not identified],$c2022.
336 $atext$btxt$2rdacontent
337 $acomputer$bc$2rdamedia
338 $aonline resource$bcr$2rdacarrier
300 $a1 online resource.
502 $aThesis (Ph.D.)--Columbia University, 2022.
500 $aDepartment: Industrial Engineering and Operations Research.
500 $aThesis advisor: Donald Goldfarb.
520 $aIn this dissertation, we are concerned with the advancement of optimization algorithms for training deep learning models, and in particular about practical second-order methods that take into account the structure of deep neural networks (DNNs). Although first-order methods such as stochastic gradient descent have long been the predominant optimization algorithm used in deep learning, second-order methods are of interest because of their ability to use curvature information to accelerate the optimization process. After the presentation of some background information in Chapter 1, Chapters 2 and 3 focus on the development of practical quasi-Newton methods for training DNNs. We analyze the Kronecker-factored structure of the Hessian matrix of multi-layer perceptrons and convolutional neural networks and consequently propose block-diagonal Kronecker-factored quasi-Newton methods named K-BFGS and K-BFGS(L). To handle the non-convexity nature of DNNs, we also establish new double damping techniques for our proposed methods.
520 $aOur K-BFGS and K-BFGS(L) methods have memory requirements comparable to first-order methods and experience only mild overhead in terms of per-iteration time complexity. In Chapter 4, we develop a new approximate natural gradient method named Tensor Normal Training (TNT), in which the Fisher matrix is viewed as the covariance matrix of a tensor normal distribution (a generalized form of the normal distribution). The tractable Kronecker-factored approximation to the Fisher information matrix that results from this approximation enables TNT to enjoy memory requirements and per-iteration computational costs that are only slightly higher than those for first-order methods. Notably, unlike KFAC and K-BFGS/K-BFGS(L), TNT only requires the knowledge of the shape of the trainable parameters of a model and does not depend on the specific model architecture. In Chapter 5, we consider the subsampled versions of Gauss-Newton and natural gradient methods applied to DNNs. Because of the low-rank nature of the subsampled matrices, we make use of the Sherman-Morrison-Woodbury formula along with backpropagation to efficiently compute their inverse.
520 $aWe also show that, under rather mild conditions, the algorithm converges to a stationary point if Levenberg-Marquardt damping is used. The results of a substantial number of numerical experiments are reported in Chapters 2, 3, 4 and 5, in which we compare the performance of our methods to state-of-the-art methods used to train DNNs, that demonstrate the efficiency and effectiveness of our proposed new second-order methods.
653 0 $aOperations research
653 0 $aDeep learning (Machine learning)
653 0 $aStochastic processes
653 0 $aNeural networks (Computer science)
653 0 $aLeast squares
856 40 $uhttps://doi.org/10.7916/hstp-s234$zClick for full text
852 8 $blweb$hDISSERTATIONS