no228@cam.ac.uk
2006-Feb-02 15:28 UTC
[Rd] crossvalidation in svm regression in e1071 gives incorrect results (PR#8554)
Full_Name: Noel O'Boyle Version: 2.1.0 OS: Debian GNU/Linux Sarge Submission from: (NULL) (131.111.8.96) (1) Description of error The 10-fold CV option for the svm function in e1071 appears to give incorrect results for the rmse. The example code in (3) uses the example regression data in the svm documentation. The rmse for internal prediction is 0.24. It is expected the 10-fold CV rmse should be bigger, but the result obtained using the "cross=10" option is 0.07. When the 10-fold CV is conducted either 'by hand' (not shown below) or using the errorest function in ipred (shown below) the answer is closer to 0.27, a more reasonable value. (2) Description of system I'm using the Debian Sarge version of R: R : Copyright 2005, The R Foundation for Statistical Computing Version 2.1.0 (2005-04-18), ISBN 3-900051-07-0 svm is in the e1071 package from CRAN: Version: 1.5-11 Date: 2005-09-19 (3) Example code illustrating the problem library(e1071) set.seed(42) # create data x <- seq(0.1, 5, by = 0.05) y <- log(x) + rnorm(x, sd = 0.2) data <- as.data.frame(cbind(y,x)) # estimate model and predict input values mysvm <- svm(y ~ x,data) result <- predict(mysvm, data) (rmse <- sqrt(mean((result-data[,1])**2))) # 0.2390489 # built-in 10-fold CV estimate of prediction error spread <- rep(0,20) for (i in 1:20) { mysvm <- svm(y ~ x,data,cross=10) spread[i] <- mean(mysvm$MSE) } summary(spread) # Min. 1st Qu. Median Mean 3rd Qu. Max. # 0.06789 0.07089 0.07236 0.07310 0.07411 0.08434 (or something similar) # 10-fold CV using errorest library(ipred) mysvm <- function(formula,data) { model <- svm(formula,data) function(newdata) predict(model,newdata) } spread <- rep(0,20) for (i in 1:20) { spread[i] <- errorest(y ~ x, data, model=mysvm)$error } summary(spread) # Min. 1st Qu. Median Mean 3rd Qu. Max. # 0.2601 0.2649 0.2673 0.2696 0.2741 0.2927 Regards, Noel O'Boyle.