Propagating uncertainty while calculating cumulative timeseries data (negative binomial error distribution)

Question

I am trying to appropriately propagate uncertainty from my predicted datasets following a GAM model. Essentially, I have weekly counts of data (i.e., egg counts), but want to calculate cumulative counts by week, and am using the following GAM (mgcv package in R):

weekly_egg_gam = gam(weekly_eggs ~ s(week) + s(treatment, pc=0) + 
  s(organism_id, bs="re"),
  method="REML", data=data_egg_gam, family=nb(link="log",theta=NULL))

However, it is easier for people to view plots of cumulative count data, as opposed to weekly count data. My lab group usually overlays empirical means and error with the predicted model results +/- confidence intervals.

Thus, my problem is: how do I cumulatively propagate the calculated model uncertainty appropriately, and does it matter that it has a negative binomial error distribution?

What I have done so far: I use the predict() function to generate model predictions, calculate the fit and variance (for a negative binomial distribution) on the response scale, then cumulatively sum the fit and variance over time, then calculate the standard error from the cumulative variance. The data from week to week is assumed to be independent. The code is as follows:


theta = weekly_egg_gam$family$getTheta(TRUE)
gam_weekly_egg_predict = predict(weekly_egg_gam, newdata=gam_egg_input_df,
  type="link", se.fit= TRUE, exclude=c("s(organism_id)"))

# Pull fit estimate, put on response scale
gam_weekly_egg_predict$fv = exp(gam_weekly_egg_predict$fit) 
# Calculate the variance (for negative binomial) on response scale
gam_weekly_egg_predict$var = (gam_weekly_egg_predict$fv + 
  ((gam_weekly_egg_predict$fv^2) / theta)) 

gam_weekly_egg_predict_df = data.frame(
  gam_weekly_egg_predict,
  gam_egg_input_df
) %>% # Combine predictions with input data
  arrange(organism_id, week) %>%
  group_by(organism_id) %>%
  mutate(
    fv_cum = cumsum(fv),
    var_cum = cumsum(var),
    se_cum = sqrt(var_cum),
    ul_cum = (fv_cum + (1.96 * se_cum)),
    ll_cum = (fv_cum - (1.96 * se_cum))
  ) %>%
  ungroup()

Alternatively, I could also model the actual cumulative counts with a GAM, but that sounded less statistically sound than cumulatively propagating weekly model uncertainty: (1), I presume I might need to use a GAMM to account for the autocorrelation at that point and (2) not sure if the negative binomial error distribution would be best for cumulative count data (as opposed to quasipoisson etc).

Would appreciate feedback on this, though I promise I have tried endlessly googling how to propagate error appropriately!

EDIT: I organize the cumulative totals by organism ID, because the IDs selected for predict() are already representative of the unique treatments.

en.wikipedia.org/wiki/Variance#Linear_combinations . Keep in mind that with time series the covariance terms are often substantial — Glen_b
– Glen_b, Commented Jul 17 at 0:31
@Glen_b, so it looks like (if I assume the error from week to week is uncorrelated, which feels reasonable in this case) I do in fact just sum the variances? And in this case, the variance of the negative binomial? — lmbradley
– lmbradley, Commented Jul 17 at 1:52

Gavin Simpson · Accepted Answer · 2025-07-18 09:36:46Z

This seems like a perfect opportunity to use posterior simulation.

The general idea would be to:

create a data set of covariate values you want to predict at, including each week, for all combinations of the organisms and treatment levels you want to consider
simulate a bunch of new data from the model, say 10000 draws for each row of the data set created in 1.
for each draw
- for each each organism
  - for each treatment level
    1. form the weekly cumulative sum of the model predicted data
summarise the posterior distributions you created in 3
- For each combination of organism and treatment level, compute say the median and 0.25th and 97.5th probability quantiles of the posterior distribution to give an estimated cumulative sum and it's uncertainty.

This might sound complicated, but you can use, for each step above

gratia::data_slice() to generate the new data to predict at (1. above)
gratia::predicted_samples() or gratia::posterior_samples() to do the posterior simulation in 2. The difference is whether you want to only consider the uncertainty in the model parameters — i.e., only the uncertainty in the expected response, $\mu_i$ — (gratia::predicted_samples()) or both the uncertainty in the model parameters and the sampling uncertainty in the data (gratia::posterior_samples())
your dplyr fu will handle this step of computing the cumulative weekly summed counts group_by(.draw, organism_id, treatment) |> summarise(cum_count = cumsum(.fitted)) for example
the ggdist package has several functions for computing the posterior distributions

I have more details on the different things we might sample from in this article that accompanies my gratia R package: https://gavinsimpson.github.io/gratia/articles/posterior-simulation.html

We used this approach in a paper on muscid fly phenology (Gerlich et al 2025) that was published just a couple of weeks ago (at the time of writing), to do something very similar to what you want to do here.

You don't need to use a GAMM (gamm()) to model the autocorrelation if the $f(\mathtt{week}_i)$ term in the model (s(week)) sufficiently captures the temporal dependencies. I can almost guarantee that it won't in your model because you're assuming that all organisms follow the same time trajectory. But, there's nothing stopping you including in your model a smooth term for week for each organism, for example:

gam(weekly_eggs ~ s(week) +         # common smooth of week
  s(week, organism_id, bs = "fs") + # organism-specific smooth of week
  s(treatment, pc = 0),             # independent treatment effects
  method = "REML",
  data = data_egg_gam,
  family = nb()
)

You could also allow the treatment effect to vary with week if needed. This model is an example of modelling the temporal dependence through the mean structure of the model, which is an alternative to the approach of modelling the dependencies through the covariance matrix. See Hefley et al (2017) for more on these ideas.

Sørine (Gerlich) and I tried to document the mgcv code for the various GAMs we estimated in the paper I mentioned — I think this is all in the papers Supplementary materials. You can fit the organism-specific effects in different ways, for example, most of which I covered in an earlier paper (Pedersen et al, 2019) on what we called a HGAM.

Regarding your sub-questions:

(1), I presume I would need to use a GAMM to account for the obvious autocorrelation

Modelling the dependencies through the model's covariance matrix does get trickier with non-Gaussian models; gamm() uses the trick of glmmPQL() to essentially view the GAMM as a GLMM, which in turn estimates the model as a LMM with lme(), iterating back-and-forth between these different representations. For low mean count Poisson data, this PQL method for GLMMs is known to not work so well (e.g. Bolker et al 2009). If your data are regularly-spaced in time, you can fit an AR(1) in the covariance matrix of the model using bam(), which, when combined with a non-Gaussian fmaily, treat the AR(1) correlation matrix as a working correlation matrix in the same way as one would in a generalized estimating equation (GEE) model.

(2) not sure if the negative binomial error distribution would work for cumulative count data...

This is the wrong way to think about it; your actual data is not the cumulative count and you should try to stick as close to the original data as possible when modelling (in this case). Better, in this case at least, is to do the cumulative summation after you've fitted a model to the observed count data, and evaluate the uncertainty in those cumulative sums, either via math (which is hard, at least I find it hard) or via posterior sampling.

References

Bolker, B.M., Brooks, M.E., Clark, C.J., Geange, S.W., Poulsen, J.R., Stevens, M.H.H., White, J.-S.S., 2009. Generalized linear mixed models: a practical guide for ecology and evolution. Trends Ecol. Evol. 24, 127–135. https://doi.org/10.1016/j.tree.2008.10.008

Gerlich, H.S., Loboda, S., Simpson, G.L., Savage, J., Schmidt, N.M., Holmstrup, M., Høye, T.T., 2025. Species’ traits modulate rapid changes in flight time in high-Arctic muscid flies under climate change. Proc. Biol. Sci. 292, 20250970. https://doi.org/10.1098/rspb.2025.0970

Hefley, T.J., Broms, K.M., Brost, B.M., Buderman, F.E., Kay, S.L., Scharf, H.R., Tipton, J.R., Williams, P.J., Hooten, M.B., 2017. The basis function approach for modeling autocorrelation in ecological data. Ecology 98, 632–646. https://doi.org/10.1002/ecy.1674

Pedersen, E.J., Miller, D.L., Simpson, G.L., Ross, N., 2019. Hierarchical generalized additive models in ecology: an introduction with mgcv. PeerJ 7, e6876. https://doi.org/10.7717/peerj.6876

Gavin! Thank you so much. This level of detail and references is really useful as a student. Really glad to have gotten your feedback on it. — lmbradley
– lmbradley, Commented Jul 22 at 15:10

Stack Exchange Network

Propagating uncertainty while calculating cumulative timeseries data (negative binomial error distribution)

1 Answer 1

References

Your Answer

Hot Network Questions

Propagating uncertainty while calculating cumulative timeseries data (negative binomial error distribution)

1 Answer 1

References

Your Answer

Sign up or log in

Post as a guest

Related

Hot Network Questions