Dear All, A situation that for sure happens very often: suppose you are in the following situation set.seed(1235) x1 <- seq(30) x2 <- c(rep(NA, 9), rnorm(19)+9, c(NA, NA)) x3 <- c(rnorm(17)-2, rep(NA, 13)) y <- exp(seq(1,5, length=30)) mm<-lm(y~x1+x2+x3) i.e. you try a simple linear regression with multiple regressors which exhibit some missing values. This is what happens to me while working with some time series which I use as regressors and whose missing values are padded with NAs. lm, as a default, disregard the sets of incomplete observations and therefore drops quite a lot of data. Is there any way to circumvent this? I mean, is there a way to somehow come up with a piecewise linear regression where, whenever possible, all the 3 regressors are used but we switch to 1 or 2 when there are missing data? I say this because it is totally unfeasible to try to figure out the values of the missing data in my regressors, but at the same time I cannot restrict my model to the intersection of the non-NA values in the 3 regressors. If this makes sense, do I have to code it myself or is there any package which already implemented this? Any suggestion is appreciated. Cheers Lorenzo
IMHO this is not a question about R... it is a question about statistics whether R is involved or not. As such, a forum like stats.stackexchange.com would be better suited to address this. FWIW I happen to think that expecting R to solve this for you is unreasonable. -- Sent from my phone. Please excuse my brevity. On March 15, 2016 8:14:42 AM PDT, Lorenzo Isella <lorenzo.isella at gmail.com> wrote:>Dear All, >A situation that for sure happens very often: suppose you are in the >following situation > >set.seed(1235) >x1 <- seq(30) >x2 <- c(rep(NA, 9), rnorm(19)+9, c(NA, NA)) >x3 <- c(rnorm(17)-2, rep(NA, 13)) > >y <- exp(seq(1,5, length=30)) > > >mm<-lm(y~x1+x2+x3) > >i.e. you try a simple linear regression with multiple regressors >which exhibit some missing values. >This is what happens to me while working with some time series which I >use as regressors and whose missing values are padded with NAs. >lm, as a default, disregard the sets of incomplete observations and >therefore drops quite a lot of data. >Is there any way to circumvent this? I mean, is there a way to somehow >come up with a piecewise linear regression where, whenever possible, >all the 3 regressors are used but we switch to 1 or 2 when there are >missing data? >I say this because it is totally unfeasible to try to figure out the >values of the missing data in my regressors, but at the same time I >cannot restrict my model to the intersection of the non-NA values in >the 3 regressors. If this makes sense, do I have to code it myself or >is there any package which already implemented this? >Any suggestion is appreciated. >Cheers > >Lorenzo > >______________________________________________ >R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >https://stat.ethz.ch/mailman/listinfo/r-help >PLEASE do read the posting guide >http://www.R-project.org/posting-guide.html >and provide commented, minimal, self-contained, reproducible code.[[alternative HTML version deleted]]
One technique for dealing with this is called 'multiple imputation'. Google for 'multiple imputation in R' to find R packages that implement it (e.g., the 'mi' package). Bill Dunlap TIBCO Software wdunlap tibco.com On Tue, Mar 15, 2016 at 8:14 AM, Lorenzo Isella <lorenzo.isella at gmail.com> wrote:> Dear All, > A situation that for sure happens very often: suppose you are in the > following situation > > set.seed(1235) > x1 <- seq(30) > x2 <- c(rep(NA, 9), rnorm(19)+9, c(NA, NA)) > x3 <- c(rnorm(17)-2, rep(NA, 13)) > > y <- exp(seq(1,5, length=30)) > > > mm<-lm(y~x1+x2+x3) > > i.e. you try a simple linear regression with multiple regressors > which exhibit some missing values. > This is what happens to me while working with some time series which I > use as regressors and whose missing values are padded with NAs. > lm, as a default, disregard the sets of incomplete observations and > therefore drops quite a lot of data. > Is there any way to circumvent this? I mean, is there a way to somehow > come up with a piecewise linear regression where, whenever possible, > all the 3 regressors are used but we switch to 1 or 2 when there are > missing data? > I say this because it is totally unfeasible to try to figure out the > values of the missing data in my regressors, but at the same time I > cannot restrict my model to the intersection of the non-NA values in > the 3 regressors. If this makes sense, do I have to code it myself or > is there any package which already implemented this? > Any suggestion is appreciated. > Cheers > > Lorenzo > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]