I am running a regression with several independent variables with 32 observations (from 1975 to 2006 and they are yearly data). The issue here is that there does not exist any observation for one of the variables prior to 1980. Consequently, that variable has 5 missing observations (from 1975 to 1979). Is there any method in R to provide an estimation for these missing values? By the way, the explanatory variable here is "total labor force" and it has a very pronounced trend. Therefore, I know very well that it is statistically very possible to estimate the past values.
-
$\begingroup$ The "predict" function may be what you're after? $\endgroup$Matt Albrecht– Matt Albrecht2012-04-12 04:38:59 +00:00Commented Apr 12, 2012 at 4:38
-
$\begingroup$ Could you please use it in a line of code for me? Thank you @MattAlbrecht $\endgroup$SavedByJESUS– SavedByJESUS2012-04-12 05:57:34 +00:00Commented Apr 12, 2012 at 5:57
-
$\begingroup$ Labour markets and economies around the world were undergoing a period of great change at this time. It may not be sensible to extrapolate backwards from your data. en.wikipedia.org/wiki/Early_1980s_recession $\endgroup$James– James2012-04-13 11:01:00 +00:00Commented Apr 13, 2012 at 11:01
-
$\begingroup$ @James Thank you very much for pointing this out, but the country under consideration is Côte d'Ivoire (West Africa) and as a small open economy, the impact of the major changes in the 1980 may have been very minimal. $\endgroup$SavedByJESUS– SavedByJESUS2012-04-14 04:15:24 +00:00Commented Apr 14, 2012 at 4:15
-
1$\begingroup$ You might be interested in stats.stackexchange.com/questions/13984/… ... it's a related problem, and has some worked examples $\endgroup$naught101– naught1012012-04-17 12:58:07 +00:00Commented Apr 17, 2012 at 12:58
3 Answers
x <- 1:30; y <- c(rnorm(25) + 1:25, rep(NA, 5)) #generate data with NAs
df1 <- data.frame(x, y) #combine into data frame
lmx <- lm(y~x, data=df1) #create model to predict from
ndf <- data.frame(x=1:30) #create data to predict to
df1$fit <- predict(lmx, newdata=ndf) #get predictions
df1$y2 <- with(df1, ifelse(is.na(y) == T, fit, y))
The last line creates a new variable in the data frame that has all of the old variables as well as the fitted variables from the regression.
-
$\begingroup$ Of course
lmmay or may not be the model you are looking for... $\endgroup$nico– nico2012-04-12 06:57:19 +00:00Commented Apr 12, 2012 at 6:57 -
$\begingroup$ @MattAlbrecht This is exactlly what I was looking for. Thank you so much! $\endgroup$SavedByJESUS– SavedByJESUS2012-04-14 19:01:51 +00:00Commented Apr 14, 2012 at 19:01
-
$\begingroup$ No problem. Just make sure that what you're doing is defensible. $\endgroup$Matt Albrecht– Matt Albrecht2012-04-15 08:43:57 +00:00Commented Apr 15, 2012 at 8:43
It is often a good idea to consider the possible reasons for data being missing, ie mising completely at random, missing at random, missing not at random. Depending on this, methods to estimate missing data may be biased.
A sophisticated way to deal with data missing at random is multiple imputation, which acknowledges that there is uncertainty about the values of the missing quantities. This can be done in R using the MICE package. Here is a reproducible example using the nhanes data that comes with the package:
library(mice)
imp <-mice(nhanes)
fit <-with(imp, lm(bmi~chl+hyp))
fit
summary(pool(fit))
complete(imp) # returns the data with first imputed values. complete(imp,2) returns 2nd set
Another approach would be to use simulation solution like Gibbs Sampling based on statistics on past observations.
I believe there is support for that in R : http://darrenjw.wordpress.com/2011/07/31/faster-gibbs-sampling-mcmc-from-within-r/