As far as I understand, in Machine Learning there are 2 moments for optimization. Before training the model there is the optimization of the hyperparameters to find the best configuration of the model before really training the model (please correct me if I am wrong). The second moment is the optimization of the parameters. The optimization of the parameters are only possible when we have an active learning model, or an online machine Learning model? And the optimization of the parameters adjust the same coefficients as the ones during the optimization of the hyperparameters?
2 Answers
Hyperparameter optimization is also dependent on the model learning procedure and there is no science behind the optimal values of hyperparameters before learning and can vary with different sets of hypothesis. Hyperparameter optimization is usually done iteratively to find the optimal value which gives the desired results. Hyperparameters are more like added bias to the model which are mostly based on heuristiscs.
In a sense, yes, there are several optimisation problems that one aims at solving. Consider a very generic case where we want to learn a function $y^*:\mathbb{R} \to \mathbb{R}$ from a dataset: $\{(x_i,y_i)\}_{i=1}^m$. The most important optimisation problem is finding the right hypothesis: $$\underset{h \in H}{\text{min}} \sum_i l(h(x_i),y_i) + \lambda *\text{ regularisation term}$$ in some hypothesis space $H$. This is the learning problem (what you called optimisation of the parameters). However, How do we choose $H$? Let's consider for example polynomial fitting. $H$ is then the space of polynomials up to degree $n$. Hence, $n$ is a free "hyperparamter" that we can tune to find the best hypothesis, and here's the second optimisation problem! In other words, you can consider hyperparamter tuning as restricting or enlarging the size (loosely speaking) of $H$. The same intuition holds for Neural Networks.
All of this has nothing to do with active leanring (AL). In fact, AL introduces another optimisation problem; find the best dataset, whose labels (not yet known) would improve performance.
But how can one perform these two optimisation problems (let's not talk about AL for now)? In practice this is done through an iterative procedure:
- select a certain set of hyperparamaters. This correspond to fixing the "size" of $H$.
- Run validation tests to diagnose over- and underfitting. If you model suffers from underfitting then you should consider enlarging $H$ (use bigger NNs or higher order polynomials). If your model suffers from overfitting you should consider restricting the space $H$ (can be done through regularization).
- Try a new set of hyperparamters based on your diagnosis.