In the Box-Cox transformation parameter $\lambda$ is defined by likelihood function. But I cannot understand what exactly is maximized in this case? What is the purpose of maximum-likelihood in this case?
2 Answers
This family of transformations combines power and log transformations, and is parametrised by $\lambda$. Note that this is continuous in $\lambda$. The aim is to use likelihood methods to find the “best” $\lambda$.
Maybe it is best to provide an example, so let's assume that, for some $\lambda$ we have $E(Y ^{(λ)} ) = X\beta$ together with the normality assumption. Then, given data $Y_1, . . . , Y_n$ (ie the untransformed data), the likelihood is
$$ (2\pi \sigma^2)^{-n/2}\exp\left(-\frac1{2\sigma^2}(Y^{(\lambda)}-X\beta)^T(Y^{(\lambda)}-X\beta)\right)\prod_{i=1}^nY_i^{\lambda -1}$$
where the product at the end is the relevant Jacobian which will clearly differ in size for different values of $\lambda$, and so we want the optimal one for it to be consistent with our data. For each $\lambda$, fitting the linear model gives $\hat{\beta}{(\lambda)} = (X^TX)^{-1}X^TY^{(\lambda)} , RSS(λ) = (Y^{(\lambda)})^T(I_X)Y^{(λ)}$ , and $\hat{\sigma}^2 (λ) = RSS(\lambda)/n$ (the maximum likelihood estimate).
The profile log-likelihood for $\lambda$, obtained by maximising the loglikelihood over $\beta$ and $\sigma^2$, is therefore
$$ L_{max}(\lambda)= c - \frac{n}{2}\log(RSS(\lambda)/n)+ (\lambda-1)\sum_{i=1}^n \log(Y_i)$$
And so... we treat this as we usually treat log-likelihood functions: values of $\lambda$ close to the maximising value $\hat{\lambda}$ of $\lambda$ are consistent with the data.
-
$\begingroup$ What is X? What is beta? $\endgroup$railgun– railgun2023-08-18 22:16:37 +00:00Commented Aug 18, 2023 at 22:16
-
$\begingroup$ @railgun X is the input/predictor variables, and beta is the regression coefficients. $\endgroup$Closed Limelike Curves– Closed Limelike Curves2023-10-30 02:46:32 +00:00Commented Oct 30, 2023 at 2:46
This is a good question. One can argue that the model used to estimate the box-cox transformation, something like $$ y_i^{(\lambda)} = \beta_0 + x_i^T \beta +\epsilon_i, \quad 1=1,\dotsc,n $$ with the error term $\epsilon_i$ independent and identically distributed with a normal distribution , zero mean and some variance. This is problematic as a statistical model Peter McCullagh wrote a paper about that https://projecteuclid.org/euclid.aos/1035844977) and I will come back and try to write about that, but no time now.
For one thing, the $\beta$ parameters and the variance will depend on the transformation parameter $\lambda$, but more important, the meaning of the model will change with changing $\lambda$. But still "estimating" $\lambda$ could be a meaningful thing to do, as help in modeling. It could still be that it is really not estimation in a scientific sense (since the $\lambda$ parameter do not reflect or represent anything in the reality we are modeling, it just indexes a family of models).
But the most obvious thing that happens when varying $\lambda$, is that the size of the $y^{(\lambda)}$ will change. That must be accounted for, and the jacobian is introduced for that reason. A post with details is How do I get the Box-Cox log likelihood using the Jacobian?
(When time (after easter or later) I will come back and (try to) explain my maybe somewhat cryptic comments above)