3
$\begingroup$

I am working on a project analyzing olive plantation data, where I aim to simulate the relationship between investment costs (Costs), revenues (Revenues), and temperature (Temp) over time, accounting for the specific temporal dynamics of the data. The goal is to generate realistic scenarios for future tree plantations. My idea is to employ copulas.

The data I have consists of annual records for 10 years, where:

  • Costs represent the investments required for the olive plantation.
  • Revenues represent the returns from the sale of olives.
  • Temp is the annual average temperature.
  • TempCng is the annual temperature change.

Since the data is inherently temporal (i.e., Costs and Revenues are not independent and identically distributed over time), I aim to capture the time structure, particularly the significant initial investments (Costs), followed by revenues (Revenues) that only materialize after several years as the trees need time to grow. To address this, I include a time trend variable in my analysis.

Here’s my approach so far:

# Packages
library(VineCopula)
library(copula)

# Synthetic data for convenience
Costs <- c(100, 0, 150, 50, 0, 0, 0, 0, 0, 0)
Revenues <- c(0, 0, 0, 50, 0, 225, 100, 0, 150, 5)
Temp <- c(20.00, 21.60, 16.05, 15.68, 17.40, 19.51, 19.87, 19.02, 18.21, 18.18)
TempCng <- c(0.001464764, diff(Temp) / head(Temp, -1))
Years <- seq(2008,2017)

# Create data frame
OliveTrees <- data.frame(Costs, Revenues, Temp, TempCng, row.names = Years)

# Compute mean and standard deviation
mu_C <- mean(Costs)
mu_R <- mean(Revenues)
mu_T <- mean(TempCng)

sigma_C <- sd(Costs)
sigma_R <- sd(Revenues)
sigma_T <- sd(TempCng)

# Normalize the data
OliveTrees$CNorm <- (OliveTrees$Costs - mu_C) / sigma_C
OliveTrees$RNorm <- (OliveTrees$Revenues - mu_R) / sigma_R
OliveTrees$TNorm <- (OliveTrees$TempCng - mu_T) / sigma_T

# Apply empirical distribution
C_dist <- pobs(OliveTrees$CNorm)
R_dist <- pobs(OliveTrees$RNorm)
T_dist <- pobs(OliveTrees$TNorm)

# Time trend (sequence of years)
S_dist <- pobs(1:nrow(OliveTrees))

# Combine the distributions
U <- cbind(C_dist, R_dist, T_dist, S_dist)

# Fit a Gaussian copula
CopulaModel <- normalCopula(dim = 4, dispstr = 'un')
FittedCopula <- fitCopula(CopulaModel, U, method = 'ml')
CopulaModel@parameters <- coef(FittedCopula)

# Simulate from the copula
set.seed(321)
U <- rCopula(n = nrow(OliveTrees), CopulaModel)

# Sort the simulated values to account for the time trend
U <- U[order(U[, 4]), ]

# Apply the inverse CDF to get the simulated values
C_sim <- quantile(OliveTrees$CNorm, U[, 1])
R_sim <- quantile(OliveTrees$RNorm, U[, 2])
T_sim <- quantile(OliveTrees$TNorm, U[, 3])

# Denormalize the simulated values
C_sim <- round(C_sim * sigma_C + mu_C, 2)
R_sim <- round(R_sim * sigma_R + mu_R, 2)
T_sim <- T_sim * sigma_T + mu_T

# Create a data frame for the simulation results
OliveTrees_sim <- data.frame(C_sim, R_sim, T_sim, row.names = Years)
OliveTrees_sim$Temp <- round(OliveTrees$Temp[1] * c(1, cumprod(1 + OliveTrees_sim$T_sim[2:length(OliveTrees_sim$T_sim)])), 2)

My Questions:

  1. Is this copula approach valid for accounting for the temporal dynamics of olive plantation data? Specifically, temporal dynamics refer to the fact that there are large initial costs followed by growing revenues, and that both are not IID due to the time structure.

  2. Is including a time trend (in the form of a sequence of years) a suitable solution for modeling the temporal dependencies?

  3. Is there any literature or research that supports this approach, or are there better ways to model the temporal dependency in the data?

  4. Are there any better modeling approaches or improvements that could better capture the temporal dynamics between Costs, Revenues, and Temperature?

EDIT: Motivation of my analysis: calculate the internal rate of return of my investment and understand how sensitive it is to temperature changes. I will for sure include other variables (rain, storms,...), the above version of the data variables is a simplification.

Thank you for your help!

$\endgroup$
10
  • $\begingroup$ The idea of using copulas to simulate dependent data seems good to me. But before that, can you tell us what your intended analysis model is going to be ? One approach is to use a Markov Longitudinal Ordinal Model, another is a mixed effects where you would specify a variance-covariance matrix for the residual error - for example AR(1), Toepliz, and various other including fully unstructured. See my answer here. $\endgroup$ Commented Dec 8, 2024 at 11:34
  • $\begingroup$ @RobertLong Thank you for your suggestion! I am not intending to use a formal modeling approach at this stage. My goal is to create possible scenarios and perform data augmentation, with the aim of better understanding the relationship between the variables and optimizing the management of the olive plantation. Since the payback period is long and both the investments required and the revenues, as well as the temperature, are unpredictable, I thought it would be logical to generate joint scenarios and evaluate various possibilities for managing the farm effectively. $\endgroup$ Commented Dec 8, 2024 at 11:46
  • $\begingroup$ Yes, you can, and the best start is using D-vine copula. There are many publications regarding this topic. $\endgroup$ Commented Dec 16, 2024 at 7:07
  • $\begingroup$ @Dr.Statistics could you please provide some lit references? Thank for your comment $\endgroup$ Commented Dec 16, 2024 at 17:36
  • 1
    $\begingroup$ For the simplicity of the code, you can skip the normalisation step. It doesn't change the copula fit and simulation. $\endgroup$ Commented Dec 21, 2024 at 8:16

1 Answer 1

3
+150
$\begingroup$

Here are some one dimensional simulations that try to explore the approach and demonstrate what could go wrong.

Blue points are simulated data from the true model, black/gray dots are simulations from the fitted model.


Example 1

$$y|t \sim Binom\left(100,\frac{(t-0.3)^2}{0.8^2}\right)$$

If the trend is non-monotonic, then this is not well modeled with any copula which assumes a monotonic relationship.

example 1


Example 2

$$y|t \sim Poisson(100,100*t)$$

Heteroscedasticity is not well modeled as for a Gaussian distribution the variance conditional on one of the other parameters is constant. This is also true for the variance Matrix. So in your multidimensional scheme you won't be able to model changes in the correlations between variables as function of time.

example 2


Example 3

$$y|t \sim Normal(100,100*t,100)$$

Even for a plain OLS type model the fit is not the same. This is because the normal copula relates to a linear model between $y$ and $t$ when $t$ is expressed by the z-values.

example 3


Other points

  • The simulations will not simulate an uniform time variable. But you could do this manually by simulating from the Gaussian distribution underlying the copula, conditional on the time variable, and transform everything manually.
  • The copula assumes a continuous distribution. If there is any autocorrelation or temporal variations (like alternating good and bad years for the harvest), then the method with the copula will average these out.

Code

library(copula)

n = 200
set.seed(1)

time = c(1:n)/n
timeu = pobs(time)
y = rbinom(n,100,(time-0.3)^2/0.8^2) ## example 1
#y = rpois(n,time*100) ## example 2
#y = rnorm(n,time*100,10) ## example 3
yu = pobs(y)

plot(time,y)

# Fit a Gaussian copula
normal.cop = normalCopula(c(0.8), dim = 2, dispstr = 'un')
mod = fitCopula(normal.cop, cbind(timeu,yu), method = "ml")
fit = normalCopula(coef(mod), dim = 2, dispstr = 'un')
x = rCopula(1000,fit)

points(quantile(time,x[,1]),
       quantile(y,   x[,2]), pch = 20, cex = 0.5, col = rgb(0,0,0,0.5))
points(time,y, pch = 20, col = 4, cex = 1.5)
$\endgroup$

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.