2
$\begingroup$

What are some examples of less common transformations or basis expansions that have been applied to a set of input variables to build a more complex linear regression model?

For example, a "common" transformation might be to include a set of interaction predictors; a "less common" transformation might be to cluster the data using k-means, and then include the cluster ID as an additional predictor. The set of possible transformations is infinite, but I'm interested to know if any specific approaches have been found useful in practice. If prediction is the main interest, interpretation is not necessarily a concern.

$\endgroup$
6
  • 2
    $\begingroup$ Are you looking for a list of all possible transformations on the predictors? That might be too much asking... $\endgroup$ Commented May 18, 2012 at 17:58
  • $\begingroup$ agreed, I clarified the question asking for any specific examples used in practice $\endgroup$ Commented May 18, 2012 at 19:34
  • $\begingroup$ Some of what you are calling transformations are not examples of what I thing of as transformations. Including interaction terms is just changing the form of the model it is not doing anything to the data. I would think of something like taking logarithms or squares roots of the response variable as a transformation. Another might be a more general power transformation like Box-Cox. You transform the respons eand then fit a linear model to the transformed response. $\endgroup$ Commented May 18, 2012 at 19:47
  • $\begingroup$ The terminology may differ between fields; for example, interaction terms could be thought of as 2nd order terms in a Volterra series expansion, which is a nonlinear transformation of the inputs. The response is modeled as a linear combination of the output of the nonlinear basis functions. $\endgroup$ Commented May 18, 2012 at 22:06
  • $\begingroup$ Okay. How is clustering a transformation? Is it that after forming the cluster each member of the cluster is mapped into the cluster mean? $\endgroup$ Commented May 19, 2012 at 3:08

1 Answer 1

2
$\begingroup$

Interactions in data do provide a more refined adjustment in multiple regression models, though they can lead to parameters which are very difficult to interpret.

I can't say I would recommend using K-means clustering in general unless it was part of a prespecified analysis plan. It's a supervised learning procedure, so it would considerably change the scope and objective of an analysis if applied haphazardly. For something like a mediation analysis where you're interested in latent classes and their relationship with some continuous outcome, this could provide an alternative to multiple adjustment, though I think structural equation modeling would be a superior approach to account for exogenous factors when a "latent state" is an exposure of interest.

In terms of refined adjustment in univariate models or models without interaction, I think the use of splines is worth being knowledgable about. There are many books which cover their use in regression modeling. The number of knots and polynomial degree of splines allow one to control the extent of stratification for factors. I use them almost exclusively in the analysis of ordinal exposure data, since the most sophisticated splines provide adjustment which is equivalent to categorical adjustment and the simplest, linear regression with grouped linear exposure.

$\endgroup$

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.