I have a gridded dataset indexed by time and space, represented as a $m \times n$ array. I'm following along with Eq. 10 in this paper to partition the variance in this data over space and time. Specifically, they partition the total variance $\sigma^2_g$ into the average temporal variance over all regions ($\bar{\sigma^2_t}$) and the average spatial variance over all time points ($\bar{\sigma^2_s}$):
$$ \sigma^2_g=\frac{n(m-1)}{(m \times n) -1}\bar{\sigma^2_t}+\frac{m(n-1)}{(m \times n) - 1}\bar{\sigma^2_s} $$
My hangup is that I have a considerable amount of missing data, so I know I have to account for different sample sizes. $m = 194540$ and $n = 25$ for my data, so the coefficients in each term are near one. This implies that the sum of the spatial and temporal variances I calculate should be close to the global variance.
My current approach is to calculate the variance for each slice in time/space, and then calculate their weighted average based on the number of valid observations. As implemented in numpy:
# data.shape == (25, 194540)
# first axis is time, second axis is spatial position
total_var = np.nanvar(data)
var_over_time = np.nanvar(data, axis=0)
samples_per_px = np.sum(~np.isnan(data), axis=0)
sigma_t = np.average(var_over_time, weights=samples_per_px)
var_over_space = np.nanvar(data, axis=1)
samples_per_t = np.sum(~np.isnan(data), axis=1)
sigma_s = np.average(var_over_space, weights=samples_per_t)
print(f"Total variance: {total_var:.2f}")
print(f"Temporal variance: {sigma_t:.2f}")
print(f"Spatial variance: {sigma_s:.2f}")
This gives me
Total variance: 4.55
Temporal variance: 3.93
Spatial variance: 4.50
But this is inconsistent with my thinking the the sum of the spatial and temporal variances should equal the total variance. Next I tried repeating the calculation but with all missing values replaced with zero. This still gives me a variance sum larger than the total variance, which makes me think I could be approaching this question incorrectly.
So, two questions:
- Does it make sense to use this partitioning procedure for data with missing values?
- Can variances "overlap" in the sense that we cannot attribute variance to time or space alone?
Thanks for your help!