2
$\begingroup$

I have a dataset on which I am trying to fit a Linear Regression model. It has 4 independent variables. I am trying to predict my dependent variable using these four columns. However, 2 out of these 4 columns contain data that has value 0 (40% to 55%). So when I plot my data for each column it shows zero-inflated right-skewed distribution. I tried using data transformation by applying log(x+1) but it does not create any significant impact. My model r2_score is = 0.44 which is not improving. I have a couple of questions here.

1- Is my assumption correct that these columns are messing up my choice of using a linear regression model?

2- What is the best choice of the model if linear regression is not the right one.

3- How do you deal with such kind of data?

I am using python for modeling this data

$\endgroup$
2
  • $\begingroup$ What kind of variable is the dependent variable ? Do you have excess zeros only in the independent variables ? $\endgroup$ Commented Jul 17, 2020 at 7:10
  • $\begingroup$ @RobertLong dependent variable is like a count for views variable, it has very less zero as compare to independent $\endgroup$ Commented Jul 17, 2020 at 8:10

1 Answer 1

1
$\begingroup$

It's not the distribution of the independent variables that is important. It's the distribution of the outcome. If it is a count variable and not zero inflated then you should consider fittig a poisson or negative binomial (in case of under/over dispersion) generalised linear model (glm)

$\endgroup$
1
  • $\begingroup$ Does this answer your question ? If so please consider marking it as the accepted answer. If not please let us know why so that it can be improved $\endgroup$ Commented Aug 7, 2020 at 5:27

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.