3
$\begingroup$

I am trying to appropriately propagate uncertainty from my predicted datasets following a GAM model. Essentially, I have weekly counts of data (i.e., egg counts), but want to calculate cumulative counts by week, and am using the following GAM (mgcv package in R):

weekly_egg_gam = gam(weekly_eggs ~ s(week) + s(treatment, pc=0) + 
  s(organism_id, bs="re"),
  method="REML", data=data_egg_gam, family=nb(link="log",theta=NULL))

However, it is easier for people to view plots of cumulative count data, as opposed to weekly count data. My lab group usually overlays empirical means and error with the predicted model results +/- confidence intervals.

Thus, my problem is: how do I cumulatively propagate the calculated model uncertainty appropriately, and does it matter that it has a negative binomial error distribution?

What I have done so far: I use the predict() function to generate model predictions, calculate the fit and variance (for a negative binomial distribution) on the response scale, then cumulatively sum the fit and variance over time, then calculate the standard error from the cumulative variance. The data from week to week is assumed to be independent. The code is as follows:


theta = weekly_egg_gam$family$getTheta(TRUE)
gam_weekly_egg_predict = predict(weekly_egg_gam, newdata=gam_egg_input_df,
  type="link", se.fit= TRUE, exclude=c("s(organism_id)"))

# Pull fit estimate, put on response scale
gam_weekly_egg_predict$fv = exp(gam_weekly_egg_predict$fit) 
# Calculate the variance (for negative binomial) on response scale
gam_weekly_egg_predict$var = (gam_weekly_egg_predict$fv + 
  ((gam_weekly_egg_predict$fv^2) / theta)) 

gam_weekly_egg_predict_df = data.frame(
  gam_weekly_egg_predict,
  gam_egg_input_df
) %>% # Combine predictions with input data
  arrange(organism_id, week) %>%
  group_by(organism_id) %>%
  mutate(
    fv_cum = cumsum(fv),
    var_cum = cumsum(var),
    se_cum = sqrt(var_cum),
    ul_cum = (fv_cum + (1.96 * se_cum)),
    ll_cum = (fv_cum - (1.96 * se_cum))
  ) %>%
  ungroup()

Alternatively, I could also model the actual cumulative counts with a GAM, but that sounded less statistically sound than cumulatively propagating weekly model uncertainty: (1), I presume I might need to use a GAMM to account for the autocorrelation at that point and (2) not sure if the negative binomial error distribution would be best for cumulative count data (as opposed to quasipoisson etc).

Would appreciate feedback on this, though I promise I have tried endlessly googling how to propagate error appropriately!

EDIT: I organize the cumulative totals by organism ID, because the IDs selected for predict() are already representative of the unique treatments.

$\endgroup$
2
  • $\begingroup$ en.wikipedia.org/wiki/Variance#Linear_combinations . Keep in mind that with time series the covariance terms are often substantial $\endgroup$ Commented Jul 17 at 0:31
  • $\begingroup$ @Glen_b, so it looks like (if I assume the error from week to week is uncorrelated, which feels reasonable in this case) I do in fact just sum the variances? And in this case, the variance of the negative binomial? $\endgroup$ Commented Jul 17 at 1:52

1 Answer 1

3
$\begingroup$

This seems like a perfect opportunity to use posterior simulation.

The general idea would be to:

  1. create a data set of covariate values you want to predict at, including each week, for all combinations of the organisms and treatment levels you want to consider
  2. simulate a bunch of new data from the model, say 10000 draws for each row of the data set created in 1.
  3. for each draw
    • for each each organism
      • for each treatment level
        1. form the weekly cumulative sum of the model predicted data
  4. summarise the posterior distributions you created in 3
    • For each combination of organism and treatment level, compute say the median and 0.25th and 97.5th probability quantiles of the posterior distribution to give an estimated cumulative sum and it's uncertainty.

This might sound complicated, but you can use, for each step above

  1. gratia::data_slice() to generate the new data to predict at (1. above)
  2. gratia::predicted_samples() or gratia::posterior_samples() to do the posterior simulation in 2. The difference is whether you want to only consider the uncertainty in the model parameters — i.e., only the uncertainty in the expected response, $\mu_i$ — (gratia::predicted_samples()) or both the uncertainty in the model parameters and the sampling uncertainty in the data (gratia::posterior_samples())
  3. your dplyr fu will handle this step of computing the cumulative weekly summed counts group_by(.draw, organism_id, treatment) |> summarise(cum_count = cumsum(.fitted)) for example
  4. the ggdist package has several functions for computing the posterior distributions

I have more details on the different things we might sample from in this article that accompanies my gratia R package: https://gavinsimpson.github.io/gratia/articles/posterior-simulation.html

We used this approach in a paper on muscid fly phenology (Gerlich et al 2025) that was published just a couple of weeks ago (at the time of writing), to do something very similar to what you want to do here.

You don't need to use a GAMM (gamm()) to model the autocorrelation if the $f(\mathtt{week}_i)$ term in the model (s(week)) sufficiently captures the temporal dependencies. I can almost guarantee that it won't in your model because you're assuming that all organisms follow the same time trajectory. But, there's nothing stopping you including in your model a smooth term for week for each organism, for example:

gam(weekly_eggs ~ s(week) +         # common smooth of week
  s(week, organism_id, bs = "fs") + # organism-specific smooth of week
  s(treatment, pc = 0),             # independent treatment effects
  method = "REML",
  data = data_egg_gam,
  family = nb()
)

You could also allow the treatment effect to vary with week if needed. This model is an example of modelling the temporal dependence through the mean structure of the model, which is an alternative to the approach of modelling the dependencies through the covariance matrix. See Hefley et al (2017) for more on these ideas.

Sørine (Gerlich) and I tried to document the mgcv code for the various GAMs we estimated in the paper I mentioned — I think this is all in the papers Supplementary materials. You can fit the organism-specific effects in different ways, for example, most of which I covered in an earlier paper (Pedersen et al, 2019) on what we called a HGAM.

Regarding your sub-questions:

(1), I presume I would need to use a GAMM to account for the obvious autocorrelation

Modelling the dependencies through the model's covariance matrix does get trickier with non-Gaussian models; gamm() uses the trick of glmmPQL() to essentially view the GAMM as a GLMM, which in turn estimates the model as a LMM with lme(), iterating back-and-forth between these different representations. For low mean count Poisson data, this PQL method for GLMMs is known to not work so well (e.g. Bolker et al 2009). If your data are regularly-spaced in time, you can fit an AR(1) in the covariance matrix of the model using bam(), which, when combined with a non-Gaussian fmaily, treat the AR(1) correlation matrix as a working correlation matrix in the same way as one would in a generalized estimating equation (GEE) model.

(2) not sure if the negative binomial error distribution would work for cumulative count data...

This is the wrong way to think about it; your actual data is not the cumulative count and you should try to stick as close to the original data as possible when modelling (in this case). Better, in this case at least, is to do the cumulative summation after you've fitted a model to the observed count data, and evaluate the uncertainty in those cumulative sums, either via math (which is hard, at least I find it hard) or via posterior sampling.

References

Bolker, B.M., Brooks, M.E., Clark, C.J., Geange, S.W., Poulsen, J.R., Stevens, M.H.H., White, J.-S.S., 2009. Generalized linear mixed models: a practical guide for ecology and evolution. Trends Ecol. Evol. 24, 127–135. https://doi.org/10.1016/j.tree.2008.10.008

Gerlich, H.S., Loboda, S., Simpson, G.L., Savage, J., Schmidt, N.M., Holmstrup, M., Høye, T.T., 2025. Species’ traits modulate rapid changes in flight time in high-Arctic muscid flies under climate change. Proc. Biol. Sci. 292, 20250970. https://doi.org/10.1098/rspb.2025.0970

Hefley, T.J., Broms, K.M., Brost, B.M., Buderman, F.E., Kay, S.L., Scharf, H.R., Tipton, J.R., Williams, P.J., Hooten, M.B., 2017. The basis function approach for modeling autocorrelation in ecological data. Ecology 98, 632–646. https://doi.org/10.1002/ecy.1674

Pedersen, E.J., Miller, D.L., Simpson, G.L., Ross, N., 2019. Hierarchical generalized additive models in ecology: an introduction with mgcv. PeerJ 7, e6876. https://doi.org/10.7717/peerj.6876

$\endgroup$
1
  • 1
    $\begingroup$ Gavin! Thank you so much. This level of detail and references is really useful as a student. Really glad to have gotten your feedback on it. $\endgroup$ Commented Jul 22 at 15:10

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.