I'll try to be brief. I have two questions about what exactly happens when I train a gradient boosted ensemble of trees using, say, XGBoost in order to perform a Gamma regression. I apologize in advance for any (very) likely misunderstanding from my part - but I do appreciate any further clarification very much! Suppose we have $n-1$ learners which as an ensemble estimate the target $y(x)$ with $\hat{y}^{(n-1)}(x)$. The goal would be to add another learner, say $f^{(n)}$, and my updated estimation $\hat{y}^{(n)}(x)$ for $y(x)$ is given by $$\hat{y}^{(t-1)}(x)+f^{(n)}(x)$$ We then choose the structure of the learner $f^{(n)}$ in such a way that an objective function $$\sum L(y(x), \hat{y}^{(n)}(x)) + \omega(f^{(n)})$$ is minimized. Let's ignore $\omega$ for the time being. I am assuming that if I set the parameter objective to reg:gamma then I'm trying to maximize the negative likelihood function of a Gamma distribution. The two questions are quite basic :
- How are the scale and shape parameters taken in account ? The expression for the ML of a Gamma distribution is in terms of one of those parameters but there is no (explicit) way to make XGBoost aware of this ?
- Related with the first question : are we implicitly assuming that the residuals are also Gamma distributed ? What I have in mind is the following : consider a regression task where the target variable is obviously following something that resembles a Gamma distribution upon visual inspection. However, the moment I add more than a learner, I am no longer dealing with the original target but rather a residual. Are we still considering it to be Gamma distributed?