Jason Roberts
2011-Oct-14  16:06 UTC
[R] Party package: varimp(..., conditional=TRUE) error: term 1 would require 9e+12 columns
I would like to build a forest of regression trees to see how well some covariates predict a response variable and to examine the importance of the covariates. I have a small number of covariates (8) and large number of records (27368). The response and all of the covariates are continuous variables. A cursory examination of the covariates does not suggest they are correlated in a simple fashion (e.g. the variance inflation factors are all fairly low) but common sense suggests there should be some relationship: one of them is the day of the year and some of the others are environmental parameters such as water temperature. For this reason I would like to follow the advice of Strobl et al. (2008) and try the authors' conditional variable importance measure. This is implemented in the party package by calling varimp(..., conditional=TRUE). Unfortunately, when I call that on my forest I receive the error:> varimp(myforest, conditional=TRUE)Error in model.matrix.default(as.formula(f), data = blocks) : term 1 would require 9e+12 columns Does anyone know what is wrong? I noticed a post in June 2011 where a user reported this message and the ultimate problem was that the importance measure was being conditioned on too many variables (47). I have only a small number of variables here so I guessed that was not the problem. Another suggestion was that there could be a factor with too many levels. In my case, all of the variables are continuous. Term 1 (x1 below) is the day of the year, which does happen to be integers 1 ... 366. But the variable is class numeric, not integer, so I don't believe cforest would treat it as a factor, although I do not know how to tell whether cforest is treating something as continuous or as a factor. Thank you for any help you can provide. I am running R 2.13.1 with party 0.9-99994. You can download the data from http://www.duke.edu/~jjr8/data.rdata (512 KB). Here is the complete code:> load("\\Temp\\data.rdata") > nrow(df)[1] 27368> summary(df)y x1 x2 x3 x4 x5 x6 x7 x8 Min. : 0.000 Min. : 1.0 Min. :0.0000 Min. : 1.00 Min. : 52 Min. : 0.008184 Min. :16.71 Min. :0.0000000 Min. : 0.02727 1st Qu.: 0.000 1st Qu.:105.0 1st Qu.:0.0000 1st Qu.: 30.00 1st Qu.:1290 1st Qu.: 6.747035 1st Qu.:23.92 1st Qu.:0.0000000 1st Qu.: 0.11850 Median : 1.282 Median :169.0 Median :0.2353 Median : 38.00 Median :1857 Median :11.310277 Median :26.35 Median :0.0001569 Median : 0.14625 Mean : 5.651 Mean :178.7 Mean :0.2555 Mean : 55.03 Mean :1907 Mean :12.889021 Mean :26.31 Mean :0.0162043 Mean : 0.20684 3rd Qu.: 5.353 3rd Qu.:262.0 3rd Qu.:0.4315 3rd Qu.: 47.00 3rd Qu.:2594 3rd Qu.:18.427410 3rd Qu.:28.95 3rd Qu.:0.0144660 3rd Qu.: 0.20095 Max. :195.238 Max. :366.0 Max. :1.0000 Max. :400.00 Max. :3832 Max. :29.492380 Max. :31.73 Max. :0.3157486 Max. :11.76877> library(HH)<output deleted>> vif(y ~ ., data=df)x1 x2 x3 x4 x5 x6 x7 x8 1.374583 1.252250 1.021672 1.218801 1.015124 1.439868 1.075546 1.060580> library(party)<output deleted>> mycontrols <- cforest_unbiased(ntree=50, mtry=3) # Small forestbut requires a few minutes> myforest <- cforest(y ~ ., data=df, controls=mycontrols) > varimp(myforest)x1 x2 x3 x4 x5 x6 x7 x8 11.924498 103.180195 16.228864 30.658946 5.053500 12.820551 2.113394 6.911377> varimp(myforest, conditional=TRUE)Error in model.matrix.default(as.formula(f), data = blocks) : term 1 would require 9e+12 columns
Ken Hutchison
2011-Oct-14  19:21 UTC
[R] Party package: varimp(..., conditional=TRUE) error: term 1 would require 9e+12 columns
Hi,
That's a tough one, I'll do my best and hope a more knowledgeable person
will correct me.
 Since you can measure conditional importance by permuting predictors and
re-evaluating importance, perhaps try the randomForest package and examine
how your results change based on permutation of each predictor. I understand
permutation would take a prohibitively large amount of time for certain
applications. Try (clumsy) shortcuts:
Some pseudocode
First Option
myforest=randomForest(y~.,data=df)
imp=myforest$importance #Just for your info importance is here.
#permute an x
newx=sample(x,length(x),replace=F)
#make new forest
newforest=randomForest(y~newx+all else...)
#predict oldx with newforest
#If somewhat accurate, problems afoot.
Second Option:
Predicting your held out variable 1000 times with new forest(pretty quick to
do) and examining the quantile of the predicted value relative to the old
(non permuted) distribution of the variable, should be uniformly distributed
between 0 and 1 if truly random inside the forest (and random outside since
we know it has been permuted)... could measure this with Chi-square
statistic.
#Third option
Permute the x's and plot importance for each variable when the others are
 held out (inferential only)
              Weak I know, but I hope it helps!
               Ken Hutchison
On Fri, Oct 14, 2011 at 12:06 PM, Jason Roberts
<jason.roberts@duke.edu>wrote:
> I would like to build a forest of regression trees to see how well some
> covariates predict a response variable and to examine the importance of the
> covariates. I have a small number of covariates (8) and large number of
> records (27368). The response and all of the covariates are continuous
> variables.
>
> A cursory examination of the covariates does not suggest they are
> correlated
> in a simple fashion (e.g. the variance inflation factors are all fairly
> low)
> but common sense suggests there should be some relationship: one of them is
> the day of the year and some of the others are environmental parameters
> such
> as water temperature. For this reason I would like to follow the advice of
> Strobl et al. (2008) and try the authors' conditional variable
importance
> measure. This is implemented in the party package by calling varimp(...,
> conditional=TRUE). Unfortunately, when I call that on my forest I receive
> the error:
>
> > varimp(myforest, conditional=TRUE)
> Error in model.matrix.default(as.formula(f), data = blocks) :
>  term 1 would require 9e+12 columns
>
> Does anyone know what is wrong?
>
> I noticed a post in June 2011 where a user reported this message and the
> ultimate problem was that the importance measure was being conditioned on
> too many variables (47). I have only a small number of variables here so I
> guessed that was not the problem.
>
> Another suggestion was that there could be a factor with too many levels.
> In
> my case, all of the variables are continuous. Term 1 (x1 below) is the day
> of the year, which does happen to be integers 1 ... 366. But the variable
> is
> class numeric, not integer, so I don't believe cforest would treat it
as a
> factor, although I do not know how to tell whether cforest is treating
> something as continuous or as a factor.
>
> Thank you for any help you can provide. I am running R 2.13.1 with party
> 0.9-99994. You can download the data from
> http://www.duke.edu/~jjr8/data.rdata (512 KB). Here is the complete code:
>
> > load("\\Temp\\data.rdata")
> > nrow(df)
> [1] 27368
> > summary(df)
>       y                 x1              x2               x3
> x4             x5                  x6              x7                  x8
>
>  Min.   :  0.000   Min.   :  1.0   Min.   :0.0000   Min.   :  1.00   Min.
> :  52   Min.   : 0.008184   Min.   :16.71   Min.   :0.0000000   Min.   :
> 0.02727
>  1st Qu.:  0.000   1st Qu.:105.0   1st Qu.:0.0000   1st Qu.: 30.00   1st
> Qu.:1290   1st Qu.: 6.747035   1st Qu.:23.92   1st Qu.:0.0000000   1st Qu.:
> 0.11850
>  Median :  1.282   Median :169.0   Median :0.2353   Median : 38.00   Median
> :1857   Median :11.310277   Median :26.35   Median :0.0001569   Median :
> 0.14625
>  Mean   :  5.651   Mean   :178.7   Mean   :0.2555   Mean   : 55.03   Mean
> :1907   Mean   :12.889021   Mean   :26.31   Mean   :0.0162043   Mean   :
> 0.20684
>  3rd Qu.:  5.353   3rd Qu.:262.0   3rd Qu.:0.4315   3rd Qu.: 47.00   3rd
> Qu.:2594   3rd Qu.:18.427410   3rd Qu.:28.95   3rd Qu.:0.0144660   3rd Qu.:
> 0.20095
>  Max.   :195.238   Max.   :366.0   Max.   :1.0000   Max.   :400.00   Max.
> :3832   Max.   :29.492380   Max.   :31.73   Max.   :0.3157486   Max.
> :11.76877
> > library(HH)
> <output deleted>
> > vif(y ~ ., data=df)
>      x1       x2       x3       x4       x5       x6       x7       x8
> 1.374583 1.252250 1.021672 1.218801 1.015124 1.439868 1.075546 1.060580
> > library(party)
> <output deleted>
> > mycontrols <- cforest_unbiased(ntree=50, mtry=3)           # Small
forest
> but requires a few minutes
> > myforest <- cforest(y ~ ., data=df, controls=mycontrols)
> > varimp(myforest)
>        x1         x2         x3         x4         x5         x6         x7
> x8
>  11.924498 103.180195  16.228864  30.658946   5.053500  12.820551
> 2.113394
> 6.911377
> > varimp(myforest, conditional=TRUE)
> Error in model.matrix.default(as.formula(f), data = blocks) :
>  term 1 would require 9e+12 columns
>
> ______________________________________________
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
	[[alternative HTML version deleted]]