I need predict to ignore rows that contain levels not in the model. Consider a data frame, "const", that has columns for the number of days required to construct a site and the city and state the site was constructed in. g<-lm(days~city,data=const) Some of the sites in const have not yet been completed, and therefore they have days==NA. I want to predict how many days these sites will take to complete (I've simplified the above discussion to remove many of the other factors involved.) nconst<-subset(const,is.na(const$days)) x<-predict(g,nconst) Error in model.frame.default(object, data, xlev = xlev) : factor city has new level(s) ALBANY This is because we haven't yet completed a site in Albany. If I just had one to worry about I could easily fix it (choose a nearby market with similar characteristic) but I am dealing with a several hundred cities. Instead, for the cities not modeled by g I'd simply like to use the state, even though I don't expect it to be as good: g<-lm(days~state,data=const) x<-predict(g,nconst) I'm not sure how to identify the cities in nconst that are not modeled by g (my actual model has many more predictors in the formula) Is there a way to instruct predict to only predict the rows for which it has enough information and not complain about the others? g<-lm(days~city,data=const) x<-predict(g,nconst) ## the rows of x with city=ALBANY will be NA g<-lm(days~state,data=const) y<-predict(g,nconst) x[is.na(x)]<-y[is.na(x)] thanks, pete
On Tue, Sep 16, 2003 at 11:44:02AM -0500, Peter Whiting wrote:> > I'm not sure how to identify the cities in nconst that are not > modeled by g (my actual model has many more predictors in the > formula)I guess I could use some form of subset(const,const$city%in%g$xlevels$city) over and over again for each factor... as usual, there has to be a better way. pete> Is there a way to instruct predict to only predict the > rows for which it has enough information and not complain about > the others? > > g<-lm(days~city,data=const) > x<-predict(g,nconst) ## the rows of x with city=ALBANY will be NA > g<-lm(days~state,data=const) > y<-predict(g,nconst) > x[is.na(x)]<-y[is.na(x)] > > thanks, > pete > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://www.stat.math.ethz.ch/mailman/listinfo/r-help
Thomas W Blackwell
2003-Sep-16 20:17 UTC
[R] can predict ignore rows with insufficient info
Peter - Your subsequent email seems just right. You have to determine ahead of time which rows can be estimated. Here's a strategy, and possibly some code to implement it. Let supported(i,y,d) be a user-written function which returns a logical vector indicating rows which should be omitted from the prediction on account of a non-covered covariate in column i of data frame d with outcome variable y. Apply this function to all columns in your data frame using lapply(). Then do the "or" of all the logical vectors by calculating the row sums of the numeric (0 or 1) equivalents. Last, convert back to logical, and subscript your data frame with this in the call to predict(). Here's some rough code: supported <- function(i,y,d) { result <- rep(F, dim(d)[1]) # default return value when if (is.factor(d[[i]])) # d[[i]] is not a factor. result <- d[[i]] %in% unique(d[[i]][ !is.na(d[[y]]) ]) result } tmp.1 <- lapply(seq(along=const), supported, "days", const) tmp.2 <- matrix(unlist(tmp.1[ names(const) != "days" ]), nrow=dim(const)[1]) tmp.3 <- as.logical(as.vector(tmp.2 %*% rep(1, dim(tmp.2)[2]))) x <- predict(g, const[ is.na(const$days) & !tmp.3, ]) This code uses a few arcane maneuvers. Look at help pages for the relevant functions to dope out what it is doing. Particularly for lapply(), seq(), rep(), unlist(), unique(), "%*%", "%in%". (The last two must be quoted in order to see the help). However, the code might work for you right out of the box ! - tom blackwell - u michigan medical school - ann arbor - On Tue, 16 Sep 2003, Peter Whiting wrote:> I need predict to ignore rows that contain levels not in the > model. > > Consider a data frame, "const", that has columns for the number of > days required to construct a site and the city and state the site > was constructed in. > > g<-lm(days~city,data=const) > > Some of the sites in const have not yet been completed, and therefore > they have days==NA. I want to predict how many days these sites > will take to complete (I've simplified the above discussion to > remove many of the other factors involved.) > > nconst<-subset(const,is.na(const$days)) > x<-predict(g,nconst) > Error in model.frame.default(object, data, xlev = xlev) : > factor city has new level(s) ALBANY > > This is because we haven't yet completed a site in Albany. > If I just had one to worry about I could easily fix it (choose > a nearby market with similar characteristic) but I am dealing > with a several hundred cities. Instead, for the cities not > modeled by g I'd simply like to use the state, even though I > don't expect it to be as good: > > g<-lm(days~state,data=const) > x<-predict(g,nconst) > > I'm not sure how to identify the cities in nconst that are not > modeled by g (my actual model has many more predictors in the > formula) Is there a way to instruct predict to only predict the > rows for which it has enough information and not complain about > the others? > > g<-lm(days~city,data=const) > x<-predict(g,nconst) ## the rows of x with city=ALBANY will be NA > g<-lm(days~state,data=const) > y<-predict(g,nconst) > x[is.na(x)]<-y[is.na(x)] > > thanks, > pete >