Using GAMs and Checking for Autocorrelation in Time Series Data

Question

I’ve been running Generalized Additive Models (GAMs) to explore temporal trends in my soil phosphorus data. I have 20 years of data at each site. I'm considering either modeling individual GAMs for each site or a hierarchal GAM with global smooth and site specific deviations from the global smooth. My workflow compares a “plain” GAM with a GAMM that includes an autocorrelation structure. Workflow:

Fit a GAM with gam() and check assumptions with gam.check().
Fit a GAMM with autocorrelation using gamm(..., correlation = corCAR1(...)).
Check the $\phi$ parameter (the AR(1) correlation estimate) and its confidence interval.
If $\phi$ indicates significant residual autocorrelation, I keep and report the autocorrelation model. If not, I stick with the simpler GAM.

I want to ensure I am modeling my data—and replicates—correctly.

Code:

Simple GAM

m1 <- gam(Total_P ~ s(Year, k = 3),
          data = filter(soil_df, Site == "S1"),
          method = "REML")
gam.check(m1)

GAMM with CAR(1) autocorrelation

mod1 <- gamm(Total_P ~ s(Year, k = 3),
             data = filter(soil_df, Site == "S1"),
             correlation = corCAR1(form = ~ Year|Plot),
             method = "REML")


#Compare models
summary(mod1$gam)
AIC(m1, mod1)

#Estimate of autocorrelation parameter and CI
smallPhi <- intervals(mod1$lme, which = "var-cov")$corStruct
smallPhi

Hierarchal gams:

m_tp_GI <- gam(
  Total_P ~                             
    s(Year, k = 3) +       # global smooth
    s(Year, Site, k = 10, bs = "sz"),
  data = soil_df,
  method = "REML"
)

m2_ac <- gamm(
  Total_P ~ 
    s(Year, k = 6, m=2) + 
    s(Year, Site, bs = "fs", k = 6, m=2),
  data = soil_df,
  correlation = corCAR1(form = ~ Year | Plot),
  method = "REML"
)

Questions

In my dataset, I have three replicate soil collections per year. These were randomly sampled, not permanently marked plots. That means they are not repeated measures through time.

To make the correlation structure work in gamm(), I added a Plot column to uniquely identify each replicate. Since corCAR1(form = ~ Year|Plot) expects an ID for the grouping factor, I believe this setup treats residuals as correlated over time within each replicate.

Should I instead collapse replicates to yearly means and model autocorrelation across years? Or can I keep replicates modeled this way?

Do hierarchical models with penalization give the same answer as site-by-site GAMs?

Gam.check k selection and gam.check diagnostic plots of the hierarchal GAM don't look great right now; is my model structure missing anything?

UPDATE I fit a HGAM with tweedie distribution, which seemed to fit data best


m_TP_SRS <- gam(
  TP ~ s(Year, k = 8) +             # global trend (shrinkage)
    s(Year, Site, bs = "sz", k = 8),        # site-specific deviations
  data   = srs,
  family = Gamma(link = "log"),
  method = "REML")

m_TP_SRS_tw <- gam(
  TP ~ s(Year, k = 8) +                  # global smooth
    s(Year, Site, bs = "sz", k = 8),  # constrained factor smooth (includes site mean shifts)
  data   = srs,
  family = tw(link = "log"),
  method = "REML"
)

AIC(m_TP_SRS_tw,m_TP_SRS)

Diagnostic plots

Gavin Simpson · Accepted Answer · 2025-09-22 10:13:16Z

Should I instead collapse replicates to yearly means and model autocorrelation across years? Or can I keep replicates modeled this way?

If you wanted to learn something about the variation among sites, then aggregating the data in the way you suggest would mean that would no longer be possible. I don't see anything immediately wrong with what you have done here.

Do hierarchical models with penalization give the same answer as site-by-site GAMs?

Yes and no; the HGAM can draw power from all the sites to inform model estimates, which may make a big difference for any sites that are not well-sampled for example. Another advantage is that you have a basis for comparing among sites with the HGAM, while separate GAMs per site would not permit this statistically.

All else equal, I don't think you should see anything worse by moving to the HGAM.

gam.check k selection and diagnostic plots of the hierarchal GAM don't look great right now; is my model structure missing anything?

I doubt you can model total phosphorus as being conditionally distributed Gaussian as this variable cannot take negative values and is often highly skewed. The response may also be censored if the TP concentrations are low enough to be approaching the limits of detection. Switching to a distribution that is more appropriate would likely help. But note that if you do this (which you probably should) you will then be fitting via MASS::glmmPQL() which will exercise the methods very much - do read ?gamm closely before you do this.

As you don't show the diagnostic plots it is impossible to comment further.

The k check will almost always fail for the trend components in a model like this as it doesn't actually know anything about the CAR(1) process in the residuals. As such you should ignore this test in this case.

I would plot the normalised residuals against time and model them with another smoother to see if there is unmodelled trend. I would also look at a variogram of the normalised residuals to assess an remaining autocorrelation.

It might also be better to fit the model using method = "NCV" as that can leave out data that is autocorrelated while estimating the smoothing parameters. Setting up the structures needed to use this neighbourhood crossvalidation is not trivial, but then you could fit the model using gam() and you wouldn't need the CAR(1) nor MASS:glmmPQL() in that case. The idea would be to exclude observations that shouldn't be used to estimate a given data point; so this might involve excluding the data that come after a given data point. But read the help for NCV in mgcv for more detail.

thank you!I updated my post with diagnostics from HGAM using a Tweedie family. You were right that its right-skewed and non-negative.I first fit Gamma(log), but DHARMa dispersion test was significant,so I switched to Tweedie;the diagnostics look better,though not perfect. I also plotted normalized residuals over time and a residual variogram by site—no clear autocorrelation, so I’m not adding CAR(1).gam.check() still flags the k-index,but I’m not chasing larger k given your note,edf < k′,and insensitivity to higher k. If you have a moment to skim the plots, I’d appreciate any red flags you see — camila
– camila, Commented Sep 22 at 18:56
The point about using the variogram on the normalized residuals was to check that the autocorrelation had been modelled adequately; there should be no trend in the semi-variance. This should not be taken as justification that the CAR(1) is not needed. You could use it on the deviance residuals from the model without the CAR(1) to see if you had unmodelled autocorrelation, but what you have done now is wrong. — Gavin Simpson
– Gavin Simpson, Commented Sep 24 at 7:08

Stack Exchange Network

Using GAMs and Checking for Autocorrelation in Time Series Data

1 Answer 1

Your Answer

Hot Network Questions

Using GAMs and Checking for Autocorrelation in Time Series Data

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Related

Hot Network Questions