Jason Roberts
2011-Oct-14 16:06 UTC
[R] Party package: varimp(..., conditional=TRUE) error: term 1 would require 9e+12 columns
I would like to build a forest of regression trees to see how well some covariates predict a response variable and to examine the importance of the covariates. I have a small number of covariates (8) and large number of records (27368). The response and all of the covariates are continuous variables. A cursory examination of the covariates does not suggest they are correlated in a simple fashion (e.g. the variance inflation factors are all fairly low) but common sense suggests there should be some relationship: one of them is the day of the year and some of the others are environmental parameters such as water temperature. For this reason I would like to follow the advice of Strobl et al. (2008) and try the authors' conditional variable importance measure. This is implemented in the party package by calling varimp(..., conditional=TRUE). Unfortunately, when I call that on my forest I receive the error:> varimp(myforest, conditional=TRUE)Error in model.matrix.default(as.formula(f), data = blocks) : term 1 would require 9e+12 columns Does anyone know what is wrong? I noticed a post in June 2011 where a user reported this message and the ultimate problem was that the importance measure was being conditioned on too many variables (47). I have only a small number of variables here so I guessed that was not the problem. Another suggestion was that there could be a factor with too many levels. In my case, all of the variables are continuous. Term 1 (x1 below) is the day of the year, which does happen to be integers 1 ... 366. But the variable is class numeric, not integer, so I don't believe cforest would treat it as a factor, although I do not know how to tell whether cforest is treating something as continuous or as a factor. Thank you for any help you can provide. I am running R 2.13.1 with party 0.9-99994. You can download the data from http://www.duke.edu/~jjr8/data.rdata (512 KB). Here is the complete code:> load("\\Temp\\data.rdata") > nrow(df)[1] 27368> summary(df)y x1 x2 x3 x4 x5 x6 x7 x8 Min. : 0.000 Min. : 1.0 Min. :0.0000 Min. : 1.00 Min. : 52 Min. : 0.008184 Min. :16.71 Min. :0.0000000 Min. : 0.02727 1st Qu.: 0.000 1st Qu.:105.0 1st Qu.:0.0000 1st Qu.: 30.00 1st Qu.:1290 1st Qu.: 6.747035 1st Qu.:23.92 1st Qu.:0.0000000 1st Qu.: 0.11850 Median : 1.282 Median :169.0 Median :0.2353 Median : 38.00 Median :1857 Median :11.310277 Median :26.35 Median :0.0001569 Median : 0.14625 Mean : 5.651 Mean :178.7 Mean :0.2555 Mean : 55.03 Mean :1907 Mean :12.889021 Mean :26.31 Mean :0.0162043 Mean : 0.20684 3rd Qu.: 5.353 3rd Qu.:262.0 3rd Qu.:0.4315 3rd Qu.: 47.00 3rd Qu.:2594 3rd Qu.:18.427410 3rd Qu.:28.95 3rd Qu.:0.0144660 3rd Qu.: 0.20095 Max. :195.238 Max. :366.0 Max. :1.0000 Max. :400.00 Max. :3832 Max. :29.492380 Max. :31.73 Max. :0.3157486 Max. :11.76877> library(HH)<output deleted>> vif(y ~ ., data=df)x1 x2 x3 x4 x5 x6 x7 x8 1.374583 1.252250 1.021672 1.218801 1.015124 1.439868 1.075546 1.060580> library(party)<output deleted>> mycontrols <- cforest_unbiased(ntree=50, mtry=3) # Small forestbut requires a few minutes> myforest <- cforest(y ~ ., data=df, controls=mycontrols) > varimp(myforest)x1 x2 x3 x4 x5 x6 x7 x8 11.924498 103.180195 16.228864 30.658946 5.053500 12.820551 2.113394 6.911377> varimp(myforest, conditional=TRUE)Error in model.matrix.default(as.formula(f), data = blocks) : term 1 would require 9e+12 columns
Ken Hutchison
2011-Oct-14 19:21 UTC
[R] Party package: varimp(..., conditional=TRUE) error: term 1 would require 9e+12 columns
Hi, That's a tough one, I'll do my best and hope a more knowledgeable person will correct me. Since you can measure conditional importance by permuting predictors and re-evaluating importance, perhaps try the randomForest package and examine how your results change based on permutation of each predictor. I understand permutation would take a prohibitively large amount of time for certain applications. Try (clumsy) shortcuts: Some pseudocode First Option myforest=randomForest(y~.,data=df) imp=myforest$importance #Just for your info importance is here. #permute an x newx=sample(x,length(x),replace=F) #make new forest newforest=randomForest(y~newx+all else...) #predict oldx with newforest #If somewhat accurate, problems afoot. Second Option: Predicting your held out variable 1000 times with new forest(pretty quick to do) and examining the quantile of the predicted value relative to the old (non permuted) distribution of the variable, should be uniformly distributed between 0 and 1 if truly random inside the forest (and random outside since we know it has been permuted)... could measure this with Chi-square statistic. #Third option Permute the x's and plot importance for each variable when the others are held out (inferential only) Weak I know, but I hope it helps! Ken Hutchison On Fri, Oct 14, 2011 at 12:06 PM, Jason Roberts <jason.roberts@duke.edu>wrote:> I would like to build a forest of regression trees to see how well some > covariates predict a response variable and to examine the importance of the > covariates. I have a small number of covariates (8) and large number of > records (27368). The response and all of the covariates are continuous > variables. > > A cursory examination of the covariates does not suggest they are > correlated > in a simple fashion (e.g. the variance inflation factors are all fairly > low) > but common sense suggests there should be some relationship: one of them is > the day of the year and some of the others are environmental parameters > such > as water temperature. For this reason I would like to follow the advice of > Strobl et al. (2008) and try the authors' conditional variable importance > measure. This is implemented in the party package by calling varimp(..., > conditional=TRUE). Unfortunately, when I call that on my forest I receive > the error: > > > varimp(myforest, conditional=TRUE) > Error in model.matrix.default(as.formula(f), data = blocks) : > term 1 would require 9e+12 columns > > Does anyone know what is wrong? > > I noticed a post in June 2011 where a user reported this message and the > ultimate problem was that the importance measure was being conditioned on > too many variables (47). I have only a small number of variables here so I > guessed that was not the problem. > > Another suggestion was that there could be a factor with too many levels. > In > my case, all of the variables are continuous. Term 1 (x1 below) is the day > of the year, which does happen to be integers 1 ... 366. But the variable > is > class numeric, not integer, so I don't believe cforest would treat it as a > factor, although I do not know how to tell whether cforest is treating > something as continuous or as a factor. > > Thank you for any help you can provide. I am running R 2.13.1 with party > 0.9-99994. You can download the data from > http://www.duke.edu/~jjr8/data.rdata (512 KB). Here is the complete code: > > > load("\\Temp\\data.rdata") > > nrow(df) > [1] 27368 > > summary(df) > y x1 x2 x3 > x4 x5 x6 x7 x8 > > Min. : 0.000 Min. : 1.0 Min. :0.0000 Min. : 1.00 Min. > : 52 Min. : 0.008184 Min. :16.71 Min. :0.0000000 Min. : > 0.02727 > 1st Qu.: 0.000 1st Qu.:105.0 1st Qu.:0.0000 1st Qu.: 30.00 1st > Qu.:1290 1st Qu.: 6.747035 1st Qu.:23.92 1st Qu.:0.0000000 1st Qu.: > 0.11850 > Median : 1.282 Median :169.0 Median :0.2353 Median : 38.00 Median > :1857 Median :11.310277 Median :26.35 Median :0.0001569 Median : > 0.14625 > Mean : 5.651 Mean :178.7 Mean :0.2555 Mean : 55.03 Mean > :1907 Mean :12.889021 Mean :26.31 Mean :0.0162043 Mean : > 0.20684 > 3rd Qu.: 5.353 3rd Qu.:262.0 3rd Qu.:0.4315 3rd Qu.: 47.00 3rd > Qu.:2594 3rd Qu.:18.427410 3rd Qu.:28.95 3rd Qu.:0.0144660 3rd Qu.: > 0.20095 > Max. :195.238 Max. :366.0 Max. :1.0000 Max. :400.00 Max. > :3832 Max. :29.492380 Max. :31.73 Max. :0.3157486 Max. > :11.76877 > > library(HH) > <output deleted> > > vif(y ~ ., data=df) > x1 x2 x3 x4 x5 x6 x7 x8 > 1.374583 1.252250 1.021672 1.218801 1.015124 1.439868 1.075546 1.060580 > > library(party) > <output deleted> > > mycontrols <- cforest_unbiased(ntree=50, mtry=3) # Small forest > but requires a few minutes > > myforest <- cforest(y ~ ., data=df, controls=mycontrols) > > varimp(myforest) > x1 x2 x3 x4 x5 x6 x7 > x8 > 11.924498 103.180195 16.228864 30.658946 5.053500 12.820551 > 2.113394 > 6.911377 > > varimp(myforest, conditional=TRUE) > Error in model.matrix.default(as.formula(f), data = blocks) : > term 1 would require 9e+12 columns > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]