1
$\begingroup$

I am new to applying the machine learning models. I have to find a correlation between 1 continuous dependent variable and 27 continuous independent variables.

In the beginning, I was confused about applying linear or non-linear regression models. To understand the normality of the data, first I visualized the relation between them using a scatter plot. Second, I applied a linear regression and used the QQ plot of the residuals to find out the distribution of the error.

I found that most of the variables produce the following plots. Could you help me to understand if my data is linear or not? because I tried to understand the QQ plot and found out about something called heavy tail and light tails, but I couldn't understand what this meant.

If it is non-linear, is it okay to apply linear regression models to it after transforming it by exponential or should I apply non-linear models?

enter image description here enter image description here

$\endgroup$
2
  • $\begingroup$ Please clarify your specific problem or provide additional details to highlight exactly what you need. As it's currently written, it's hard to tell exactly what you're asking. $\endgroup$ Commented Aug 3, 2024 at 13:54
  • $\begingroup$ Please explain what you mean by "a correlation between 1 continuous dependent variable and 27 continuous independent variables". The usual (Pearson) correlation is a single number that describes (or more precisely: its square) how well a linear transform of one variable predicts the other variable. Its generalization to more than one variable is R-squared in linear regression. Or are you interested in the correlation matrix? Or in measures for a monotonous relationship like Spearman's correlation, ore even in measures for arbitrary functional relationsships like Chatterjee's correlation? $\endgroup$ Commented Aug 3, 2024 at 21:06

1 Answer 1

1
$\begingroup$

You only have one scatter plot, though you mention 20+ predictors. I would say that the scatterplot says neither that there is a linear relation nor that there is a nonlinear one: there is essentially no association at all from my best determination of the data. If the rest of your plots look like this, then the same can be said of them. Running a statistical model on this kind of data would be pointless unless you were trying to test some theoretical mode: it doesn't appear in your case that you are doing this.

As for the QQ plot, there is clearly some strong tailing going on. Whether that is important or not is debatable (see my answer here for comments about that). The most important mathematical feature of the data, linearity, is typically of greater significance here, but my above comments seem to indicate this as a non-issue in your case.

As a final comment about this:

I have to find a correlation between 1 continuous dependent variable and 27 continuous independent variables.

I'm not sure what you mean given correlations are typically between pairs of variables, but regardless the above commentary still stands.

$\endgroup$
2
  • $\begingroup$ Thank you for the detailed explanation. So, when my aim is to estimate the relationship between variables (i.e., the regression coefficients), the normality of the residuals is less crucial, correct? However, if my goal is to predict the variable, then the normality of the residuals becomes important. The primary objective is to identify the correlation between pairs of variables (the dependent variable and each of the 27 independent variables). $\endgroup$ Commented Aug 3, 2024 at 15:15
  • $\begingroup$ I'm not sure what you mean. If your goal is prediction, then the regression coefficients still matter. Depending on the nature of the problem, non-normality can matter if it gives you clues about the data generating process (DGP), which you can adjust for in your model for better predictions. The example I give in the linked answer I included shows data that simulates natural decay. This is something where having an understanding of the DGP helps answer what modeling we should do and how it can yield better predictions. The main point is that the normality of the residuals isnt the end goal. $\endgroup$ Commented Aug 3, 2024 at 20:03

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.