4
$\begingroup$

The example is from https://www.hsph.harvard.edu/miguel-hernan/causal-inference-book/. Chapter 12.

In causal inference, it is common to get inverse probability weighting then fit the weighted regression model. After the weighting is done, here is the code (two ways to fit the weighted regression)

gee.obj <- geeglm(wt82_71~qsmk, data = nhefs0, weight=w, id=seqn, corstr="independence")

glm.obj <- glm(wt82_71 ~ qsmk + cluster(seqn), data = nhefs0, weights = w)

I am wondering:

What do these two commands mean? The R documentation is really confusing... The book indicates that the above is a “sandwich estimator”. I know that it is a robust procedure for misspecification while the code looks like a longitudinal procedure. The data do not have that structure at all (e.g., seqn is unique so there is only one element in each cluster)...

Also, if you can comment on how the robust procedure is compared to a simple lm(wt82_71~qsmk, weight=w), I will deeply appreciate it

All data are downloadable from the website if you want to try.

$\endgroup$

3 Answers 3

4
$\begingroup$

The use of IPW requires a robust variance estimator, even without the sort of structure you're thinking of.

Basically, both those commands above are using a sandwich estimator (http://thestatsgeek.com/2013/10/12/the-robust-sandwich-variance-estimator-for-linear-regression/ and http://thestatsgeek.com/2014/02/14/the-robust-sandwich-variance-estimator-for-linear-regression-using-r/ are decent coverages of the topic) to allow for some structure in the data and thus a more appropriate estimation of variance. As this is very often used for clustered data, many of the functions to do that in R (and other packages) will use some sort of "cluster"-type nomenclature.

When it was introduced to me, there wasn't compelling analytical results for why IPW needed robust variance, but it had been shown using simulation, and one explanation I heard was that the weights aren't independent, because if you know N-1 weights, you know the Nth weight.

$\endgroup$
2
  • $\begingroup$ I might miss a point. Though, you can do lm() without clustering and still use the sandwich estimator...What is the difference? Also if you can tell me what lm(y~cluster(x)+z) actually means, I will appreciate it (or just give me a link). Many thanks! $\endgroup$ Commented Aug 22, 2018 at 2:40
  • $\begingroup$ @failedstatistician There are many ways to do robust variance estimates in R - using the cluster argument like that lets you stay in a fairly familiar syntax for R users. lm(y ~ z + cluster(x)) - which is the way your example above is format, just means to compute cluster robust standard errors where x indicates the cluster ID. If each individual is its own cluster, you get back a standard sandwich estimation. $\endgroup$ Commented Aug 22, 2018 at 2:46
3
$\begingroup$

Besides the good answers already provided by @Fomite and @Noah, I would like to point out the difference between the weights argument in lm() and a function like svyglm(). This answers your question of "how the robust procedure is compared to a simple lm(wt82_71~qsmk, weight=w)?".

When using the lm() function, the weights argument are considered as the inverse of residual variances (i.e., precision weights), not sampling weights which are actually the one computed through IPTW and considered by the svydesign() and svyglm() functions.

So, with a weighted least squares estimation using lm(), the estimates will be correct but standard errors will be biased.

$\endgroup$
3
$\begingroup$

These are just two of many ways to compute the robust standard errors for IPW. In both ways, the analyst has "hacked" tools for cluster-robust standard errors to be used when you don't have clusters. Indeed, there are many other more straightforward ways to get these standard errors in R. Here are two that are more straightforward:

fit <- survey::svyglm(wt82_71~qsmk, design = survey::svydesign(id = ~1, data = nhefs0, weight=w))
summary(fit)

fit <- lm(wt82_71~qsmk, data = nhefs0, weights = w)
jtools::summ(fit, robust = TRUE)

There are many other ways, such as using the sandwich package directly, etc. All of these should provide the same or similar answers (there are a variety of ways to compute robust standard errors, e.g., HC0, HC1, etc., and the defaults differ by package). In SAS, you can get these with proc surveyreg, and in Stata, you can get these by setting [pweights=w] (and it will produce robust standard errors automatically).

Note that none of these are the standard errors developed for IPW that are described in Lunceford & Davidian (2004), which require you to specify a system of generalized estimating equations for the propensity scores and the causal effects. These are (conservative) approximations as recommended by Robins, Hernan, & Brumback (2000), among others.

$\endgroup$
4
  • $\begingroup$ Thank you! I strictly prefer the svyglm one because this is at least something I know... $\endgroup$ Commented Aug 23, 2018 at 3:08
  • $\begingroup$ BTW...do you know the difference between glm(y~z) and glm(y~z+cluster(x)), if we apply robust estimator in both cases? I see both practices (the latter in the book above, the former here...coursera.org/learn/crash-course-in-causality/lecture/Ie48W/…). Sometimes the coefficient is very different between these two. $\endgroup$ Commented Aug 23, 2018 at 3:13
  • $\begingroup$ I don't know, sorry. I've actually never been able to get the cluster syntax to work. I would avoid it. $\endgroup$ Commented Aug 23, 2018 at 4:04
  • $\begingroup$ Well. Thanks anyway. You have been very helpful. BTW, update your survival package to see if cluster(x) works. $\endgroup$ Commented Aug 23, 2018 at 4:41

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.