6
$\begingroup$

This question must have been asked many times before but I can't find an answer.

I'm getting very confused about when to use a zero-inflated negative binomial regression vs standard negative binomial regression.

I'm comparing the number of times subjects in two groups get sick. The zeros are true zeros because we've designed the experiment such that we have all the relevant data. However, about two thirds didn't get sick during the measured period, which means there are "excessive zeros".

I have now read lots of sources that say zero-inflated negative binomial regression is used for "excessive zero counts" but ALSO that it is for models where the data generation process means they are not true zeros (or the zeros could not have been anything but zero). This is not the case in my experiment. Therefore I'm not sure if I can use a regular negative binomial regression or not. Could anyone advise?

$\endgroup$

2 Answers 2

9
$\begingroup$

I don't quite think that a distinction between "true" and "untrue" ("false"?) zeros is very helpful.

Zero inflated distributions arise naturally if your data generating process (DGP) actually consists of two processes: one that only generates zeros, and another that generates negbin (or whatever other distribution) distributed observations. So any one observed zero could come from either one of these two distributions. There are no "true" or "false" zeros in this formulation, only zeros coming from one or the other distribution.

In a medical context, you might be counting cases of some condition based on people presenting at a walk-in clinic. Some days, nobody appears at all... that is a zero from the first distribution. Some other days, people appear, but present with a different condition than the one you are counting... that would be a zero from the second distribution.

In your case, your description sounds as if this mixture of distributions (zero inflation is a mixture distribution in the sense above) does not really describe your DGP well.

However, of course it may still be that a simple negbin simply is not a good description of your DGP, and that a zero inflated negbin simply fits it better. After all, there is more to the world than the simple dichotomy "either a negbin or a zero-negbin mixture".

In this latter case, you could use methods to decide between competing distributional descriptions of your data. We have a number of threads along these lines here on CV. Note, however, that this kind of model fitting to existing data will of course mean that any p values you calculate for the "winner" model will be biased, so you should take them with a large grain of salt.

$\endgroup$
4
  • 1
    $\begingroup$ (+1) Nice answer, I didn't realize one had already been posted. $\endgroup$ Commented Jul 16, 2024 at 14:44
  • 1
    $\begingroup$ I wonder if the OP's data also can be interpreted to be generated from a mixture distribution? Most months I'm not sick, but if I get sick I'm usually out for more than just one day. This feels to me like a mixture between a Bernoulli process saying whether I'm sick or not, and a second process describing the severity of that sickness. $\endgroup$ Commented Jul 17, 2024 at 7:10
  • 3
    $\begingroup$ @AkselA Yes: If you assume that even when sick, you can be out for zero days, this would be a Bernoulli/count mixture (i.e. zero-inflated). If you assume that sickness leads to at least one day out, this would be an example of a hurdle. $\endgroup$ Commented Jul 17, 2024 at 7:14
  • $\begingroup$ Thanks for the answer and comments. We're specifically looking at bouts of disease, so counting discrete incidences (i.e. with time in-between) rather than number of days. We know that we aren't missing data because all subjects were closely monitored for sickness. $\endgroup$ Commented Jul 19, 2024 at 16:33
8
$\begingroup$

To choose between a regular and zero-inflated count model, consider that a large proportion of zeros can happen for several reasons:

  1. Average counts are low. For example, with an average rate of $\lambda = \frac{1}{2}$, you would expect ~61% zeros in a Poisson distribution.
  2. The true process is a mixture of a Bernoulli process, and a Poisson/negative binomial process.
    (The typical argument for a zero-inflated model.)
  3. There is some event that must happen in order for a non-zero count to occur in the first place.
    (The typical argument for a hurdle model.)
  4. There are both 'true zeros' and artificially introduced zeros (e.g., can be considered missing).

As far as modelling zeros, your model doesn't care about the distinction between 2, 3, or even 4. A zero-inflated (or hurdle) model will by design capture the observed proportion of zeros in a process, irrespective of their nature.

One way to assess zero-inflation is by using a rootogram. This allows you to distinguish between the overall shape of the probability distribution vs a specific excess of zeros.

$\endgroup$
1
  • 1
    $\begingroup$ This is really helpful, thank you. I would expect counts to be low (I don't expect our subjects to get sick often). I haven't been able to get the rootogram code to work so far though, unfortunately (not sure what it needs for the "fitted" argument) $\endgroup$ Commented Jul 19, 2024 at 16:47

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.