0
$\begingroup$
Question

Are the following trends in Type I error (false positive / incorrectly reject $H_0$) expected when testing the Markov memoryless assumption? Specifically, Type I error seems to increase as:

  • the number of time points (nt) decreases (right to left)
  • the number of individuals chains (ni) increases (top to bottom).

I suspect I may be defining the degrees of freedom wrong: ($S*(S-1)*(T-1)$ for $S$ states and $T$ periods. Edit: I just noticed this d.f. does not consider ni ! ... How / should I consider ni in d.f. ?

updated results

Background & Code

Similar questions have been asked before (e.g. 1, 2, 3) but few of these answers provide code. Based on this paper), I've implemented R code to generate random Markov samples & run the Pearson & Likelihood Ratio $\chi^2$ tests. Note: we don't allow zero-probability transitions; this would make the test statistics a bit more complicated to define.

library('ggplot2')
rbind.lappy = function(...){ # convenience
  do.call(rbind,parallel::mclapply(...,mc.cores=7))
}
px.random = function(d=2){ # generate random transition matrix
  px = matrix(runif(d^2,.1,.9),d,d) # random numbers
  px = sweep(px,1,rowSums(px),'/')  # normalize
}
markov.limit = function(px){ # get limiting distribution
  e1 = eigen(t(px))$vectors[,1]
  p0 = e1 / sum(e1)
}
markov.sim = function(p0,px,nt,ni){ # simulate samples
  ns = length(p0)
  state = list(sample(1:ns,ni,p=p0,rep=TRUE)) # initial time
  for (i in 2:nt){ # each subsequent time
    state[[i]] = sapply(state[[i-1]],function(si){ sample(1:ns,1,p=px[si,]) })
  }
  names(state) = paste0('s',1:nt)
  ms = data.frame(i=1:ni,state)
}
markov.test = function(ms){
  ns = length(unique(ms[,2]))  # num states
  nt = ncol(ms)-1              # num time points
  n_hij = array(1,c(ns,ns,ns)) # 2nd order counts*
  n_ij  = array(1,c(ns,ns))    # 1st order counts*
  # *initialize with 1 to avoid divide-by-zero
  for (m in seq(nt-2)){ n_hij = n_hij + table(ms[paste0('s',m+0:2)]) }
  for (m in seq(nt-1)){ n_ij  = n_ij  + table(ms[paste0('s',m+0:1)]) }
  n_hi  = rowSums(n_hij,dim=2)
  n_i   = rowSums(n_ij, dim=1)
  p_hij = sweep(n_hij,1:2,n_hi,'/')
  p_ij  = sweep(n_ij, 1,  n_i, '/')
  Q.p   = sum(sweep(sweep(sweep(p_hij,2:3,p_ij,'-')^2,2:3,p_ij,'/'),1:2,n_hi,'*'))
  Q.lr  = 2*sum(n_hij*log(sweep(p_hij,2:3,p_ij,'/')))
  Q = c(pearson=Q.p, likel.ratio=Q.lr)
  p = pchisq(Q,df=ns*(ns-1)*(nt-1),lower=FALSE)
}
markov.test.pvs = function(nt,ni,ns=2,nk=100,lim=1){
  pvs = rbind.lappy(1:nk,function(k){
    set.seed(k)
    px = px.random(ns)
    if (lim){ p0 = markov.limit(px) }
    else    { p0 = rep(1,ns) / ns }
    ms = markov.sim(p0,px,nt=nt,ni=ni)
    pv = c(nt=nt,ni=ni,ns=ns,nk=nk,lim=lim,markov.test(ms))
  })
}
# main
G = expand.grid( # params to sweep
  ni=c(30,100,300,1000), # num individuals (chains)
  nt=3:7,  # num time points
  ns=2,    # num states
  nk=1000, # num replicates
  lim=0:1) # init at steady-state
pvs = rbind.lappy(split(G,1:nrow(G)),do.call,what=markov.test.pvs)
# reshape & plot
pvs.m = reshape2::melt(as.data.frame(pvs),id=colnames(pvs)[1:5])
pvs.m$init = factor(pvs.m$lim,0:1,c('uniform','limit'))
pos = position_dodge(width=1)
g = ggplot(pvs.m,aes(y=value,x='',color=variable,lty=init)) +
  facet_grid('ni~nt',labeller=label_both) +
  geom_violin(scale='width',position=pos,fill=NA) +
  geom_hline(yintercept=.05,lty='11') +
  stat_summary(geom='text',position=pos,show.legend=FALSE,
    fun.data=function(x){ data.frame(y=-.1,label=mean(x < .05),size=2) }) +
  labs(y='p value',x='',color='statistic') + ylim(-.1,1)
ggsave('Rplots.pdf',w=8,h=6)
$\endgroup$
5
  • $\begingroup$ I can see just one statistical question here, so I'm focusing on it: what do you mean by "these trends"?? $\endgroup$ Commented Nov 22, 2024 at 13:33
  • $\begingroup$ Thanks. There seems to be an increase in Type I error as: the number of time points (nt) decreases (right to left) and the number of individuals chains (ni) increases (top to bottom). I wonder if I am calculating degrees of freedom (df) in pchisq wrong. $\endgroup$ Commented Nov 22, 2024 at 16:44
  • $\begingroup$ That's a good statistical question, but please don't bury it in the code! State in English what you are hoping the code is doing in that regard and ask your question explicitly. $\endgroup$ Commented Nov 22, 2024 at 17:51
  • $\begingroup$ Thanks again @whuber. I have edited the question to hopefully be more clear. $\endgroup$ Commented Nov 23, 2024 at 11:33
  • 1
    $\begingroup$ With so many parameters I think your model is too unstructured. See this for a structured first order Markov model where you can focus the ‘memoryless’ part of the model on things like the interaction between absolute time and previous state, and also assess the Markov assumption by adding random effects into the model if you have a lot of independent experimental units. $\endgroup$ Commented Nov 23, 2024 at 12:22

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.