Hi, I'm studying SVMs and found that if I run SVM in R, Weka, Python their results are differ. So, to eliminate possible pitfalls, I decided to use standard iris dataset and wrote implementation in R, Weka, Python for the same SVM/kernel. I think the choice of kernel does not matter and only needs to be consistent among implementations. I excluded cross validation since python does not have it and tried to keep consistent set of input parameters among all implementations (I went through them all and the defaults seems consistent). So, the Weka and Python both produced identical confusion matrix, but R results stays apart (I tried both e1071 and kerblab, they consistent among each other, but differ from Weka/Python). That's why I decided to post my message to R community and ask for help to identify the "problem" (if any) or get reasonable explanation why R results can differ. Please note that all implementation uses libsvm underneath (at least that what I got from reading), so I would expect results to be the same. I understand that seeds may differ, but I used entire dataset without any sampling, may be there is internal normalization? I'm posting the code for all implementations along with confusion matrix outputs. Feel free to reproduce and comment. Thanks, Valentin. Weka: -------------------------------------------------- #!/usr/bin/env bash # set path to Weka export CLASSPATH=/Applications/weka-3-6-9.app/Contents/Resources/Java/weka.jar data=./iris.arff kernel="weka.classifiers.functions.supportVector.RBFKernel -C 250007 -G 0.01" c=1.0 t=0.001 # -V The number of folds for the internal cross-validation. (default -1, use training data) # -N Whether to 0=normalize/1=standardize/2=neither. (default 0=normalize) # -W The random number seed. (default 1) #opts="-C $c -L $t -N 2 -V -1 -W 1" opts="-C $c -L $t -N 2" cmd="java weka.classifiers.functions.SMO" if [ "$1" == "help" ]; then $cmd exit 0 fi $cmd $opts -K "$kernel" -t $data -------------------------------------------------- a b c <-- classified as 50 0 0 | a = Iris-setosa 0 47 3 | b = Iris-versicolor 0 5 45 | c = Iris-virginica Python: -------------------------------------------------- from sklearn import svm from sklearn import svm, datasets from sklearn.metrics import confusion_matrix from sklearn.metrics import classification_report def report(clf, x_test, y_test): y_pred = clf.predict(x_test) print clf print(classification_report(y_test, y_pred)) print(confusion_matrix(y_test, y_pred)) def classifier(): # import some data to play with iris = datasets.load_iris() x_train = iris.data y_train = iris.target regC = 1.0 # SVM regularization parameter clf = svm.SVC(kernel='rbf', gamma=0.01, C=regC).fit(x_train, y_train) report(clf, x_train, y_train) if __name__ == '__main__': classifier() -------------------------------------------------- [[50 0 0] [ 0 47 3] [ 0 5 45]] R: -------------------------------------------------- library(kernlab) library(e1071) # load data data(iris) # run svm algorithm (e1071 library) for given vector of data and kernel model <- svm(Species~., data=iris, kernel="radial", gamma=0.01) print(model) # the last column of this dataset is what we'll predict, so we'll exclude it prediction <- predict(model, iris[,-ncol(iris)]) # the last column is what we'll check against for tab <- table(pred = prediction, true = iris[,ncol(iris)]) print(tab) cls <- classAgreement(tab) msg <- sprintf("Correctly classified: %f, kappa %f", cls$diag, cls$kappa) print(msg) -------------------------------------------------- true pred setosa versicolor virginica setosa 50 0 0 versicolor 0 46 11 virginica 0 4 39 [1] "Correctly classified: 0.900000, kappa 0.850000"