0
$\begingroup$

When do "Ada" optimizers (e.g. Adagrad, Adam, etc...) "adapt" their parameters? Is it at the end of each mini-batch or epoch?

$\endgroup$

1 Answer 1

2
$\begingroup$

They update their parameters after each mini-batch. (I use this term to avoid confusion with “batch gradient descent”; most neural network libraries talk about a ‘batch size’ when they mean ‘minibatch size’.)

To help remember this: The optimizer has no notion of an ‘epoch’. For instance, for stochastic gradient descent, you could sample randomly from the dataset at every time step (rather than the common shuffle-and-iterate strategy) and it’ll still work. (It’s your job to define the training curriculum, not the optimizer’s.)

In that case, there’s no clearly defined ‘epoch’. Everything is in terms of which mini-batch the optimizer processes.

$\endgroup$
1
  • $\begingroup$ Yes, I meant mini-batch, editing question now. It makes sense, in documentation I sometimes read "after each gradient update" and that happens after processing a mini-batch. $\endgroup$ Commented Jun 3, 2021 at 17:43

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.