summaryrefslogtreecommitdiff
path: root/docs/tutorial
diff options
context:
space:
mode:
authorEvan Shelhamer <shelhamer@imaginarynumber.net>2015-02-04 10:15:21 -0800
committerEvan Shelhamer <shelhamer@imaginarynumber.net>2015-02-04 10:15:21 -0800
commit648aed72acf1c506009ddb33d8cace40b75e176e (patch)
tree88d68d9a3f81e873ebdcf61f7ee0b09cdb545c80 /docs/tutorial
parentb6f9dc8f864ebb9cd63398da4de83493e81b5b54 (diff)
downloadcaffeonacl-648aed72acf1c506009ddb33d8cace40b75e176e.tar.gz
caffeonacl-648aed72acf1c506009ddb33d8cace40b75e176e.tar.bz2
caffeonacl-648aed72acf1c506009ddb33d8cace40b75e176e.zip
fix Nesterov typo found by @bamos
Diffstat (limited to 'docs/tutorial')
-rw-r--r--docs/tutorial/solver.md4
1 files changed, 2 insertions, 2 deletions
diff --git a/docs/tutorial/solver.md b/docs/tutorial/solver.md
index 8884ea0e..17f793ef 100644
--- a/docs/tutorial/solver.md
+++ b/docs/tutorial/solver.md
@@ -6,7 +6,7 @@ title: Solver / Model Optimization
The solver orchestrates model optimization by coordinating the network's forward inference and backward gradients to form parameter updates that attempt to improve the loss.
The responsibilities of learning are divided between the Solver for overseeing the optimization and generating parameter updates and the Net for yielding loss and gradients.
-The Caffe solvers are Stochastic Gradient Descent (SGD), Adaptive Gradient (ADAGRAD), and Nesterov's Accelerated Gradient (NAG).
+The Caffe solvers are Stochastic Gradient Descent (SGD), Adaptive Gradient (ADAGRAD), and Nesterov's Accelerated Gradient (NESTEROV).
The solver
@@ -126,7 +126,7 @@ Note that in practice, for weights $$ W \in \mathcal{R}^d $$, AdaGrad implementa
### NAG
-**Nesterov's accelerated gradient** (`solver_type: NAG`) was proposed by Nesterov [1] as an "optimal" method of convex optimization, achieving a convergence rate of $$ \mathcal{O}(1/t^2) $$ rather than the $$ \mathcal{O}(1/t) $$.
+**Nesterov's accelerated gradient** (`solver_type: NESTEROV`) was proposed by Nesterov [1] as an "optimal" method of convex optimization, achieving a convergence rate of $$ \mathcal{O}(1/t^2) $$ rather than the $$ \mathcal{O}(1/t) $$.
Though the required assumptions to achieve the $$ \mathcal{O}(1/t^2) $$ convergence typically will not hold for deep networks trained with Caffe (e.g., due to non-smoothness and non-convexity), in practice NAG can be a very effective method for optimizing certain types of deep learning architectures, as demonstrated for deep MNIST autoencoders by Sutskever et al. [2].
The weight update formulas look very similar to the SGD updates given above: