Hennig, Christian
2012-Aug-14 11:53 UTC
[R] Problems with lda-CV, and collinear variables in lda
Dear R-help list, two issues regarding lda. 1) I'm puzzled by the fact that lda's in-build cross-validation gives results different from the manual cross-validation routine that I run (of course mine may be wrong, but I don't think so). See here: library(MASS) set.seed(12345) n <- 50 p <- 10 # or p<- 200 testdata <- matrix(ncol=p,nrow=n) for (i in 1:p) testdata[,i] <- rnorm(n) class <- as.factor(c(rep(1,25),rep(2,25))) lda1 <- lda(x=testdata,grouping=class,CV=TRUE) table1 <- table(lda1$class,class) y.lda <- rep(NA, n) for(i in 1:n){ testset <- testdata[i,,drop=FALSE] trainset <- testdata[-i,] model.lda <- lda(x=trainset,grouping=class[-i]) y.lda[i] <- predict(model.lda, testset)$class } table2 <-table(y.lda, class) With p=10:> table1class 1 2 1 10 11 2 15 14> table2class y.lda 1 2 1 10 12 2 15 13 Why are these not the same? Getting closer to my second issue, it gets worse when p>n, e.g., p=200:> table1class 1 2 1 14 16 2 11 9> table2class y.lda 1 2 1 15 10 2 10 15 2) I can't find properly explained on the help page how lda is computed for p>n, because its standard definition involves inversion of the within-class covariance matrix, which cannot be inverted for p>n. It actually gives a warning when p>n, but occasionally cross-validated results are quite good. I have a guess how it's done but would be happy about clarification. Best regards, Christian *** --- *** Christian Hennig University College London, Department of Statistical Science Gower St., London WC1E 6BT, phone +44 207 679 1698 chrish at stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche