(Apparently, I posted this to the wrong place. I am hopefully posting this
is the correct place now. If not, please advise.)
A few weeks back I posted that the subset feature of rpart was not working
when predicting a categorical variable. I was able to figure out a simple
solution to the problem that I hope can be included in future editions of
rpart. I also include a fix for another related problem.
The basic problem is that when predicting a categorical using a subset, the
subset may not have all the categories present. For instance, using a
subset of car.test.frame to predict the Type, you might have a subset that
does not include Type=Small.
The frequency is counted in the gini.c with zero for Small and then the
priors are calculated. The priors will then have entries that are NaN's as
the frequency is used in a divide. Naturally, this causes lots of problems
and the rpart will fail.
The simple solution which I am hoping can be validated here is to put a very
small number there instead of zero. Here is the change to gini.c:
#if 1 // Bob's simple fix
// This fixes a problem when predicting a categorical using a
subset.
// The subset may be missing one or more of the categories. Here
// we make any zeros a very small number so that the rpart run will
not fail.
for (i=0; i<numclass; i++) {
if (freq[i] == 0) freq[i] = 0.00000000001;
}
#endif
for (i=0; i<numclass; i++) {
prior[i] /= freq[i];
aprior[i] /= (temp * freq[i]); /* pi_i / n_i */
}
}
The calculation of the priors (existing code) above shows where I put the
fix in giniinit (in gini.c). However, at this point the rpart run will
complete but will have many NaN's in the rpart$frame for yval's. This
is
because of an oversight in rpart.s when calculating the probabilities for
each node. The change to the function rpart in rpart.s is:
if (method.int ==3 ) {
numclass <- init$numresp -1
temp <- rp$dnode[,-(1:4)] %*% diag(init$parms$prior*
sum(init$counts)/init$counts)
# Bob's clumsy attempt to remove the NaN's that result when
# calculating the probabilities
tempx <- matrix(0, nrow=nrow(temp), ncol=ncol(temp))
tempx[which(is.finite(temp))] <-
temp[which(is.finite(temp))]
temp <- tempx
# end of Bob's changes
yprob <- temp /apply(temp,1,sum) #necessary with altered priors
yval2 <- matrix(rp$dnode[, -(1:3)], ncol=numclass+1)
frame$yval2 <- cbind(yval2, yprob)
}
The frequencies are calculated for each node and the resulting zeros produce
NaN's when calculating the probabilities. My clumsy R code simply removes
the NaN's from the matrix of probabilities.
During my testing for this solution, I found that any priors provided as
parameters to the rpart call were not being honored for categorical
variables. My first attempted solution had bee to provide priors that would
not have zeros. This override failed to provide an alternative solution to
the original problem as I had hoped. I was surprised so I investigated
further and found that priors in the parms were not being honored by the
function giniinit in gini.c.
Here is the solution to that problem:
#if 0
for (i=0; i<numclass; i++) freq[i] =0;
temp =0;
for (i=0; i<n; i++) {
j = *y[i] -1;
freq[j] += wt[i];
temp += wt[i]; /*sum total of weights */
}
for (i=0; i<numclass; i++) freq[i] /=temp; /*relative frequency
*/
#else
// The frequency was already calculated from the data OR overidden
with "parms=list(prior=c(....))"
for (i=0; i<numclass; i++) freq[i] = parm[i];
#endif
The code that is now removed was always calculating the frequencies from the
data but the frequencies were already provided in the parms so the
calculation was unnecessary. By using the parms as the source for the
frequencies, any priors in the parameter list would be honored as well.
Less is more.
The combination of honoring any priors in the parameters with the assumption
that the frequency is some really small value seems to me to be a flexible
solution. If you don't like the very small number, you must provide another
one in the parameter list.
I am hoping that Prof. Ripley or anyone else working on rpart will vet these
recommendations and incorporate them into the tree.
Bob Davies
Intel Corp.
PS: here is some R code from my original message that can be used to test
the subset option and validate the above changes:
On Wed, 2 Jan 2002, Davies, Bob wrote:
Any rpart user:
I am trying to construct an rpart tree using a subset of the data and it
will occasionally fail when predicting a categorical response variable.
The reason that rpart fails is that the subset does not contain each of the
categories present in the original data. For example, in the car.test.frame
example, a subset that has all the categories except "Small" will not
produce an rpart tree.
I attempted to use "parms=list(prior=...) and it did not correct the
problem.
Here is a demonstration of the problem using the car.test.frame:
library(rpart)
data(car.test.frame)
t1 <- rpart(Type ~ ., car.test.frame)
t1
sub <- row.names(car.test.frame[car.test.frame[,"Weight"] >
2567.5,]) #
create a subset
rpart (Type ~ ., car.test.frame, subset=sub ) # this statement will fail
# so now attempt to indicate what the priors should look like:
rpart (Type ~ ., car.test.frame, subset=sub,
parms=list(prior=t1$parms$prior)) # this statement will fail
# now add 1 "Small" car to this subset of car.test.frame
sub <- row.names(car.test.frame[car.test.frame[,"Weight"] >
2559,]) # lower
the weight just a little to get a "Small" car.
rpart(Type ~ ., car.test.frame, subset=sub) # This statement will work !
Any suggestions?
I am using R 1.4.0 on Windows 2000.
Bob Davies
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !) To: r-help-request at
stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._