diff options
author | Evan Shelhamer <shelhamer@imaginarynumber.net> | 2015-02-04 10:15:21 -0800 |
---|---|---|
committer | Evan Shelhamer <shelhamer@imaginarynumber.net> | 2015-02-04 10:15:21 -0800 |
commit | 648aed72acf1c506009ddb33d8cace40b75e176e (patch) | |
tree | 88d68d9a3f81e873ebdcf61f7ee0b09cdb545c80 /docs/tutorial | |
parent | b6f9dc8f864ebb9cd63398da4de83493e81b5b54 (diff) | |
download | caffeonacl-648aed72acf1c506009ddb33d8cace40b75e176e.tar.gz caffeonacl-648aed72acf1c506009ddb33d8cace40b75e176e.tar.bz2 caffeonacl-648aed72acf1c506009ddb33d8cace40b75e176e.zip |
fix Nesterov typo found by @bamos
Diffstat (limited to 'docs/tutorial')
-rw-r--r-- | docs/tutorial/solver.md | 4 |
1 files changed, 2 insertions, 2 deletions
diff --git a/docs/tutorial/solver.md b/docs/tutorial/solver.md index 8884ea0e..17f793ef 100644 --- a/docs/tutorial/solver.md +++ b/docs/tutorial/solver.md @@ -6,7 +6,7 @@ title: Solver / Model Optimization The solver orchestrates model optimization by coordinating the network's forward inference and backward gradients to form parameter updates that attempt to improve the loss. The responsibilities of learning are divided between the Solver for overseeing the optimization and generating parameter updates and the Net for yielding loss and gradients. -The Caffe solvers are Stochastic Gradient Descent (SGD), Adaptive Gradient (ADAGRAD), and Nesterov's Accelerated Gradient (NAG). +The Caffe solvers are Stochastic Gradient Descent (SGD), Adaptive Gradient (ADAGRAD), and Nesterov's Accelerated Gradient (NESTEROV). The solver @@ -126,7 +126,7 @@ Note that in practice, for weights $$ W \in \mathcal{R}^d $$, AdaGrad implementa ### NAG -**Nesterov's accelerated gradient** (`solver_type: NAG`) was proposed by Nesterov [1] as an "optimal" method of convex optimization, achieving a convergence rate of $$ \mathcal{O}(1/t^2) $$ rather than the $$ \mathcal{O}(1/t) $$. +**Nesterov's accelerated gradient** (`solver_type: NESTEROV`) was proposed by Nesterov [1] as an "optimal" method of convex optimization, achieving a convergence rate of $$ \mathcal{O}(1/t^2) $$ rather than the $$ \mathcal{O}(1/t) $$. Though the required assumptions to achieve the $$ \mathcal{O}(1/t^2) $$ convergence typically will not hold for deep networks trained with Caffe (e.g., due to non-smoothness and non-convexity), in practice NAG can be a very effective method for optimizing certain types of deep learning architectures, as demonstrated for deep MNIST autoencoders by Sutskever et al. [2]. The weight update formulas look very similar to the SGD updates given above: |