Hi,
I'm studying SVMs and found that if I run SVM in R, Weka, Python their
results are differ. So, to eliminate possible pitfalls, I decided to use
standard iris dataset and wrote implementation in R, Weka, Python for the same
SVM/kernel. I think the choice of kernel does not matter and only needs to be
consistent among implementations. I excluded cross validation since python does
not have it and tried to keep consistent set of input parameters among all
implementations (I went through them all and the defaults seems consistent). So,
the Weka and Python both produced identical confusion matrix, but R results
stays apart (I tried both e1071 and kerblab, they consistent among each other,
but differ from Weka/Python). That's why I decided to post my message to R
community and ask for help to identify the "problem" (if any) or get
reasonable explanation why R results can differ. Please note that all
implementation uses libsvm underneath (at least that what I got from reading),
so I would expect results to be the same. I understand that seeds may differ,
but I used entire dataset without any sampling, may be there is internal
normalization?
I'm posting the code for all implementations along with confusion matrix
outputs. Feel free to reproduce and comment.
Thanks,
Valentin.
Weka:
--------------------------------------------------
#!/usr/bin/env bash
# set path to Weka
export CLASSPATH=/Applications/weka-3-6-9.app/Contents/Resources/Java/weka.jar
data=./iris.arff
kernel="weka.classifiers.functions.supportVector.RBFKernel -C 250007 -G
0.01"
c=1.0
t=0.001
# -V The number of folds for the internal cross-validation. (default -1, use
training data)
# -N Whether to 0=normalize/1=standardize/2=neither. (default 0=normalize)
# -W The random number seed. (default 1)
#opts="-C $c -L $t -N 2 -V -1 -W 1"
opts="-C $c -L $t -N 2"
cmd="java weka.classifiers.functions.SMO"
if [ "$1" == "help" ]; then
$cmd
exit 0
fi
$cmd $opts -K "$kernel" -t $data
--------------------------------------------------
a b c <-- classified as
50 0 0 | a = Iris-setosa
0 47 3 | b = Iris-versicolor
0 5 45 | c = Iris-virginica
Python:
--------------------------------------------------
from sklearn import svm
from sklearn import svm, datasets
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
def report(clf, x_test, y_test):
y_pred = clf.predict(x_test)
print clf
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
def classifier():
# import some data to play with
iris = datasets.load_iris()
x_train = iris.data
y_train = iris.target
regC = 1.0 # SVM regularization parameter
clf = svm.SVC(kernel='rbf', gamma=0.01, C=regC).fit(x_train,
y_train)
report(clf, x_train, y_train)
if __name__ == '__main__':
classifier()
--------------------------------------------------
[[50 0 0]
[ 0 47 3]
[ 0 5 45]]
R:
--------------------------------------------------
library(kernlab)
library(e1071)
# load data
data(iris)
# run svm algorithm (e1071 library) for given vector of data and kernel
model <- svm(Species~., data=iris, kernel="radial", gamma=0.01)
print(model)
# the last column of this dataset is what we'll predict, so we'll
exclude it
prediction <- predict(model, iris[,-ncol(iris)])
# the last column is what we'll check against for
tab <- table(pred = prediction, true = iris[,ncol(iris)])
print(tab)
cls <- classAgreement(tab)
msg <- sprintf("Correctly classified: %f, kappa %f", cls$diag,
cls$kappa)
print(msg)
--------------------------------------------------
true
pred setosa versicolor virginica
setosa 50 0 0
versicolor 0 46 11
virginica 0 4 39
[1] "Correctly classified: 0.900000, kappa 0.850000"