Interpretation of scatter and qqplot to apply regression [closed]

Question

Closed. This question needs details or clarity. It is not currently accepting answers.

Want to improve this question? As written, this question is lacking some of the information it needs to be answered. If the author adds details in comments, consider editing them into the question. Once there's sufficient detail to answer, vote to reopen the question.

Closed last year.

Improve this question

I am new to applying the machine learning models. I have to find a correlation between 1 continuous dependent variable and 27 continuous independent variables.

In the beginning, I was confused about applying linear or non-linear regression models. To understand the normality of the data, first I visualized the relation between them using a scatter plot. Second, I applied a linear regression and used the QQ plot of the residuals to find out the distribution of the error.

I found that most of the variables produce the following plots. Could you help me to understand if my data is linear or not? because I tried to understand the QQ plot and found out about something called heavy tail and light tails, but I couldn't understand what this meant.

If it is non-linear, is it okay to apply linear regression models to it after transforming it by exponential or should I apply non-linear models?

Please clarify your specific problem or provide additional details to highlight exactly what you need. As it's currently written, it's hard to tell exactly what you're asking. — Community
– Community Bot, Commented Aug 3, 2024 at 13:54
Please explain what you mean by "a correlation between 1 continuous dependent variable and 27 continuous independent variables". The usual (Pearson) correlation is a single number that describes (or more precisely: its square) how well a linear transform of one variable predicts the other variable. Its generalization to more than one variable is R-squared in linear regression. Or are you interested in the correlation matrix? Or in measures for a monotonous relationship like Spearman's correlation, ore even in measures for arbitrary functional relationsships like Chatterjee's correlation? — cdalitz
– cdalitz, Commented Aug 3, 2024 at 21:06

Nick Cox · Accepted Answer · 2024-08-03 14:09:52Z

1

You only have one scatter plot, though you mention 20+ predictors. I would say that the scatterplot says neither that there is a linear relation nor that there is a nonlinear one: there is essentially no association at all from my best determination of the data. If the rest of your plots look like this, then the same can be said of them. Running a statistical model on this kind of data would be pointless unless you were trying to test some theoretical mode: it doesn't appear in your case that you are doing this.

As for the QQ plot, there is clearly some strong tailing going on. Whether that is important or not is debatable (see my answer here for comments about that). The most important mathematical feature of the data, linearity, is typically of greater significance here, but my above comments seem to indicate this as a non-issue in your case.

As a final comment about this:

I have to find a correlation between 1 continuous dependent variable and 27 continuous independent variables.

I'm not sure what you mean given correlations are typically between pairs of variables, but regardless the above commentary still stands.

edited Aug 3, 2024 at 14:09

Nick Cox

62.1k8 gold badges145 silver badges231 bronze badges

answered Aug 3, 2024 at 14:01

Shawn Hemelstrand

21.5k7 gold badges41 silver badges103 bronze badges

$\begingroup$ Thank you for the detailed explanation. So, when my aim is to estimate the relationship between variables (i.e., the regression coefficients), the normality of the residuals is less crucial, correct? However, if my goal is to predict the variable, then the normality of the residuals becomes important. The primary objective is to identify the correlation between pairs of variables (the dependent variable and each of the 27 independent variables). $\endgroup$

Manar
– Manar

2024-08-03 15:15:53 +00:00
Commented Aug 3, 2024 at 15:15
$\begingroup$ I'm not sure what you mean. If your goal is prediction, then the regression coefficients still matter. Depending on the nature of the problem, non-normality can matter if it gives you clues about the data generating process (DGP), which you can adjust for in your model for better predictions. The example I give in the linked answer I included shows data that simulates natural decay. This is something where having an understanding of the DGP helps answer what modeling we should do and how it can yield better predictions. The main point is that the normality of the residuals isnt the end goal. $\endgroup$

Shawn Hemelstrand
– Shawn Hemelstrand

2024-08-03 20:03:36 +00:00
Commented Aug 3, 2024 at 20:03

Add a comment |

Stack Exchange Network

Interpretation of scatter and qqplot to apply regression [closed]

1 Answer 1

Linked

Hot Network Questions

Interpretation of scatter and qqplot to apply regression [closed]

1 Answer 1

Linked

Related

Hot Network Questions