The monograph Cross Validation contains a section on nested cross-validation for hyper-parameter optimisation (page 6). The author refers to this paper for a reason why it is better to decouple hp-search from model selection, but I didn't find an intuitive and understandable answer. In short, my question is:
Why use nested validation for hp-search and model selection if an ML algorithm with different values of hyper-parameters can be seen as two different ML algorithms (and it seems that it is fine to use the flat validation approach for selecting the best ML algorithm)?
To make the question precise, below I describe in detail what are nested and flat validations. For simplicity, I will omit the cross part and do not divide data into folds -- the core of the question remains the same and so I believe the reasons should stay the same as well.
HP-tuning and model search using flat validation
This procedure divides the data set into two parts

best_model_family = none
best_hp = none
best_model = none
best_score = none
for each model_family:
for each value of hp of model_family:
model = model_family.train(hp, A)
score = evaluate(model,V)
if score > best_score: // larger is better
best_model_family = model_family
best_hp = hp
best_model = model
best_score = score
After completing the procedure, best_hp contains the hyper-parameter value yielding a model that scores the highest. The value best_score is the prediction of performance of the production model, where the production model will be trained on the whole data set: model = best_model_family.train(best_hp, A \cup V).
The author says that this approach is prone to over-fitting, because best model and best hyper-parameters are picked using the same data. I don't understand why the mere fact of using the same data set for two searches leads to over-fitting. To me, using different hyper-parameter values is akin to using different model families. Consider for instance the Nearest Neighbours ML algorithm, and let its hyper-parameter be the number of neighbours. To me, NN(3) describes is a family of models different from NN(4). It is considered OK to pick the best model family using flat validation. However, once we view the number of neighbours as a hyper-parameter, it is not OK anymore to use the flat validation. What do I miss here? For a reference, I now describe what is the nested-validation approach for hp-tuning and model selection.
HP-tuning in a nested loop (nested validation)
The nested validation approach divides the data set into three parts: the original set A is divided into two parts: A = A' \cup B. The hyper-parameter tuning is performed by training on A' and evaluation on B (that is why the approach is called "nested"), whereas model selection is performed as before by training on A=A'\cup B and evaluation on V, using the previously found best hyper-parameters.

for each model_family:
best_score = none
for each value of hp of model_family:
model = model_family.train(hp, A')
score = evaluate(model,B)
if score > best_score: // larger is better
best_hp[model_family] = hp
best_model_family = none
best_score = none
for each model_family:
model = model_family.train(best_hp[model_family], A' \cup B)
score = evaluate(model,V)
if score > best_score: // larger is better
best_model_family = model_family
best_score = score
After executing the procedure, best_model_family is the ML algorithm that we want to use in production, and we train it on the whole data set using the hyper-parameter value best_hp[best_model_family].

