Fredrik Karlsson
2016-Jun-20 07:48 UTC
[R] L1 penalized regression fails to predict from model
Dear list, Sorry for this cross-post from StackOverflow, but I see that SO was maybe the wrong forum for this question. Too package specific and Ok, what I am trying to do is to predict from an L1 penalized regression. This falls due to a data set dimension problem that I cannot figure out. The procedure I'm using is the following: require(penalized)# neg contains negative data# pos contains positive data Now, the procedure below aims to construct comparable (balanced in terms os positive and negative cases) training and validation data sets. # 50% negative training set negSamp <- neg %>% sample_frac(0.5) %>% as.data.frame()# Negative validation set negCompl <- neg[setdiff(row.names(neg),row.names(negSamp)),]# 50% positive training set posSamp <- pos %>% sample_frac(0.5) %>% as.data.frame()# Positive validation set posCompl <- pos[setdiff(row.names(pos),row.names(posSamp)),]# Combine sets validat <- rbind(negSamp,posSamp) training <- rbind(negCompl,posCompl) Ok, so here we now have two comparable sets. [1] FALSE TRUE> dim(training)[1] 1061 381> dim(validat)[1] 1060 381> identical(names(training),names(validat))[1] TRUE I fit the model to the training set without a problem (and I've tried using a range of Lambda1 values here). But, fitting the model to the validation data set fails, with a just odd error description.> fit <- penalized(VoiceTremor,training[-1],data=training,lambda1=40,standardize=TRUE)# nonzero coefficients: 13> fit2 <- predict(fit, penalized=validat[-1], data=validat)Error in .local(object, ...) : row counts of "penalized", "unpenalized" and/or "data" do not match Just to make sure that this is not due to some NA's in the data set:> identical(validat,na.omit(validat))[1] TRUEOddly enough, I may generate some new data that is comparable to the proper data set:> data.frame(VoiceTremor="NVT",matrix(rnorm(380000),nrow=1000,ncol=380) ) -> neg > data.frame(VoiceTremor="VT",matrix(rnorm(380000),nrow=1000,ncol=380) ) -> pos> dim(pos)[1] 1000 381> dim(neg)[1] 1000 381and run the procedure above, and then the prediction step works! How come? What could be wrong with my second (not training) data set? Fredrik -- "Life is like a trumpet - if you don't put anything into it, you don't get anything out of it." [[alternative HTML version deleted]]