Christoph Lehmann
2004-Mar-30 07:50 UTC
[R] classification with nnet: handling unequal class sizes
I hope this question is adequate for this list I use the nnet code from V&R p. 348: The very nice and general function CVnn2() to choose the number of hidden units and the amount of weight decay by an inner cross-validation- with a slight modification to use it for classification (see below). My data has 2 classes with unequal size: 45 observations for classI and 116 obs. for classII With CVnn2 I get the following confusion matrix (%) (average of 10 runs): predicted true 53 47 16 84 I had a similar biased confusion matrix with randomForest until I used the sampsize argument (the same holds for svm until I used the class.weights argument). How can I handle this problem of unequal class sizes with nnet, in order to get a less biased confusion matrix? (with randomForest I finally got 78 22 16 84 ) many thanks for a hint Christoph ---------------------------------------------------------------------------- #--- neural networks #classification network is constructed; this has one output and entropy fit if the number of levels is two, and a number of outputs equal to the number of classes and a softmax output stage for more levels. -> therefore two lines of Prof. Ripley's wrapper function are changed below (original commented out) and an additional function has been introduced (resmatrix) con <- function(...) { print(tab <- table(...)) diag(tab) <- 0 cat("error rate = ", round(100*sum(tab)/length(list(...)[[1]]), 2), "%\n") invisible() } CVnn2 <- function(formula, data, size = c(0,4,4,10,10), lambda = c(0, rep(c(0.001, 0.01),2)), nreps = 1, nifold = 5, verbose = 99, ...) { resmatrix <- function(predict.matrix,learn, data, ri, i) { rae.matrix <- predict.matrix rae.matrix[,] <- 0 rae.vector <- as.numeric(as.factor((predict(learn, data[ri =i,], type = "class")))) for (k in 1:dim(rae.matrix)[1]) { if (rae.vector[k] == 1) rae.matrix[k,1] <- rae.matrix[k,1] + 1 else rae.matrix[k,2] <- rae.matrix[k,2] + 1 } rae.matrix } CVnn1 <- function(formula, data, nreps=1, ri, verbose, ...) { totalerror <- 0 truth <- data[,deparse(formula[[2]])] res <- matrix(0, nrow(data), length(levels(truth))) if(verbose > 20) cat(" inner fold") for (i in sort(unique(ri))) { if(verbose > 20) cat(" ", i, sep="") for(rep in 1:nreps) { learn <- nnet(formula, data[ri !=i,], trace = F, ...) #res[ri == i,] <- res[ri == i,] + predict(learn, data[ri == i,]) res[ri == i,] <- res[ri == i,] + resmatrix(res[ri =i,],learn,data, ri, i) } } if(verbose > 20) cat("\n") sum(as.numeric(truth) != max.col(res/nreps)) } truth <- data[,deparse(formula[[2]])] res <- matrix(0, nrow(data), length(levels(truth))) choice <- numeric(length(lambda)) for (i in sort(unique(rand))) { if(verbose > 0) cat("fold ", i,"\n", sep="") ri <- sample(nifold, sum(rand!=i), replace=T) for(j in seq(along=lambda)) { if(verbose > 10) cat(" size =", size[j], "decay =", lambda[j], "\n") choice[j] <- CVnn1(formula, data[rand != i,], nreps=nreps, ri=ri, size=size[j], decay=lambda[j], verbose=verbose, ...) } decay <- lambda[which.is.max(-choice)] csize <- size[which.is.max(-choice)] if(verbose > 5) cat(" #errors:", choice, " ") # if(verbose > 1) cat("chosen size = ", csize, " decay = ", decay, "\n", sep="") for(rep in 1:nreps) { learn <- nnet(formula, data[rand != i,], trace=F, size=csize, decay=decay, ...) #res[rand == i,] <- res[rand == i,] + predict(learn, data[rand == i,]) res[rand == i,] <- res[rand == i,] + resmatrix(res[rand =i,],learn,data, rand, i) } } factor(levels(truth)[max.col(res/nreps)], levels = levels(truth)) } -- Christoph Lehmann <christoph.lehmann at gmx.ch>
Christoph Lehmann
2004-Mar-31 19:43 UTC
[R] classification with nnet: handling unequal class sizes
Dear Prof. Ripley Since you are the creator of the MASS library I dare to ask you a short question, for which I didn't get an answer from the R mailing-list. If you feel disturbed by my question, please forgive me and just ignore my mail. I use the nnet code from your book, V&R p. 348: The very nice and general function CVnn2() to choose the number of hidden units and the amount of weight decay by an inner cross-validation- with a slight modification to use it for classification (see below). My data has 2 classes with unequal size: 45 observations for classI and 116 obs. for classII (number of variables: 39) With CVnn2 I get the following confusion matrix (%) (average of 10 runs): predicted true 53 47 16 84 I had a similar biased confusion matrix with randomForest until I used the sampsize argument (the same holds for svm until I used the class.weights argument). How can I handle this problem of unequal class sizes with nnet, in order to get a less biased confusion matrix? (with randomForest I finally got 78 22 16 84 ) many thanks for a hint. By the way, I just want to say 'thank you' for your great MASS book. Since your first recommendation, I consult it quite frequently. Christoph ---------------------------------------------------------------------------- #--- neural networks #classification network is constructed; this has one output and entropy fit if the number of levels is two, and a number of outputs equal to the number of classes and a softmax output stage for more levels. -> therefore two lines of Prof. Ripley's wrapper function are changed below (original commented out) and an additional function has been introduced (resmatrix) con <- function(...) { print(tab <- table(...)) diag(tab) <- 0 cat("error rate = ", round(100*sum(tab)/length(list(...)[[1]]), 2), "%\n") invisible() } CVnn2 <- function(formula, data, size = c(0,4,4,10,10), lambda = c(0, rep(c(0.001, 0.01),2)), nreps = 1, nifold = 5, verbose = 99, ...) { resmatrix <- function(predict.matrix,learn, data, ri, i) { rae.matrix <- predict.matrix rae.matrix[,] <- 0 rae.vector <- as.numeric(as.factor((predict(learn, data[ri =i,], type = "class")))) for (k in 1:dim(rae.matrix)[1]) { if (rae.vector[k] == 1) rae.matrix[k,1] <- rae.matrix[k,1] + 1 else rae.matrix[k,2] <- rae.matrix[k,2] + 1 } rae.matrix } CVnn1 <- function(formula, data, nreps=1, ri, verbose, ...) { totalerror <- 0 truth <- data[,deparse(formula[[2]])] res <- matrix(0, nrow(data), length(levels(truth))) if(verbose > 20) cat(" inner fold") for (i in sort(unique(ri))) { if(verbose > 20) cat(" ", i, sep="") for(rep in 1:nreps) { learn <- nnet(formula, data[ri !=i,], trace = F, ...) #res[ri == i,] <- res[ri == i,] + predict(learn, data[ri == i,]) res[ri == i,] <- res[ri == i,] + resmatrix(res[ri =i,],learn,data, ri, i) } } if(verbose > 20) cat("\n") sum(as.numeric(truth) != max.col(res/nreps)) } truth <- data[,deparse(formula[[2]])] res <- matrix(0, nrow(data), length(levels(truth))) choice <- numeric(length(lambda)) for (i in sort(unique(rand))) { if(verbose > 0) cat("fold ", i,"\n", sep="") ri <- sample(nifold, sum(rand!=i), replace=T) for(j in seq(along=lambda)) { if(verbose > 10) cat(" size =", size[j], "decay =", lambda[j], "\n") choice[j] <- CVnn1(formula, data[rand != i,], nreps=nreps, ri=ri, size=size[j], decay=lambda[j], verbose=verbose, ...) } decay <- lambda[which.is.max(-choice)] csize <- size[which.is.max(-choice)] if(verbose > 5) cat(" #errors:", choice, " ") # if(verbose > 1) cat("chosen size = ", csize, " decay = ", decay, "\n", sep="") for(rep in 1:nreps) { learn <- nnet(formula, data[rand != i,], trace=F, size=csize, decay=decay, ...) #res[rand == i,] <- res[rand == i,] + predict(learn, data[rand == i,]) res[rand == i,] <- res[rand == i,] + resmatrix(res[rand =i,],learn,data, rand, i) } } factor(levels(truth)[max.col(res/nreps)], levels = levels(truth)) } -- Christoph Lehmann <christoph.lehmann at gmx.ch> -- Christoph Lehmann Phone: ++41 31 930 93 83 Department of Psychiatric Neurophysiology Mobile: ++41 76 570 28 00 University Hospital of Clinical Psychiatry Fax: ++41 31 930 99 61 Waldau lehmann at puk.unibe.ch CH-3000 Bern 60 http://www.puk.unibe.ch/cl/pn_ni_cv_cl_03.html