I've been trying to get some linear classifiers (LiblineaR, kernlab,
e1071) to work with a sparse matrix of feature data. In the case of
LiblineaR and kernlab, it seems I have to coerce my data into a dense
matrix in order to train a model. I've done a number of searches,
read through the manuals and vignettes, but I can't seem to see how to
use either of these packages with sparse matrices. I've tried using
both csr from SparseM and sparseMatrix from the Matrix library. You
can see a simple example recreating my results below.
Does anybody know if there's a trick to get this to work without
coercing the data into a dense matrix?
I'm currently playing with the KDDCUP 2010 datasets. I've written a
simple script to create hash kernel feature vectors for each of the
rows of training data. Right now I haven't added many features into
the hash vectors. For simplicity, I'm just creating a string token
for each feature, then hashing it and taking that hash mod 10007 and
10009 (so two buckets for each feature with a low likelihood of two
features colliding on both buckets). 10009 columns may seem like
overkill, but I figured if it was a sparse matrix the number of
columns really wouldn't matter that much. Right now I'm also only
playing with 99999 rows of input. When ever I make the mistake of
doing something which unintentionally coerces the sparse matrix into a
dense one, I end up eating up all my RAM, going to swap, and spending
the next 5 minutes trying to kill my session... So I'm looking for
something that scales relatively well without taking up too large a
memory footprint to run.
Thanks!
Jeff
See below for an example that recreates what my basic attempts at
using sparse matrices
> L1=rep(0:1,5)
> M1=sparseMatrix(i=c(1:5*2,1:5*2),j=c(rep(1,5),rep(10,5)),x=1)
> L1=rep(0:1,5)
> SM1=sparseMatrix(i=c(1:5*2,1:5*2),j=c(rep(1,5),rep(10,5)),x=1)
> DM=as.matrix(SM1)
> SM2=as.matrix.csr(DM)
> as.matrix(SM2)
? ? ?[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] ? ?0 ? ?0 ? ?0 ? ?0 ? ?0 ? ?0 ? ?0 ? ?0 ? ?0 ? ? 0
[2,] ? ?1 ? ?0 ? ?0 ? ?0 ? ?0 ? ?0 ? ?0 ? ?0 ? ?0 ? ? 1
[3,] ? ?0 ? ?0 ? ?0 ? ?0 ? ?0 ? ?0 ? ?0 ? ?0 ? ?0 ? ? 0
[4,] ? ?1 ? ?0 ? ?0 ? ?0 ? ?0 ? ?0 ? ?0 ? ?0 ? ?0 ? ? 1
[5,] ? ?0 ? ?0 ? ?0 ? ?0 ? ?0 ? ?0 ? ?0 ? ?0 ? ?0 ? ? 0
[6,] ? ?1 ? ?0 ? ?0 ? ?0 ? ?0 ? ?0 ? ?0 ? ?0 ? ?0 ? ? 1
[7,] ? ?0 ? ?0 ? ?0 ? ?0 ? ?0 ? ?0 ? ?0 ? ?0 ? ?0 ? ? 0
[8,] ? ?1 ? ?0 ? ?0 ? ?0 ? ?0 ? ?0 ? ?0 ? ?0 ? ?0 ? ? 1
[9,] ? ?0 ? ?0 ? ?0 ? ?0 ? ?0 ? ?0 ? ?0 ? ?0 ? ?0 ? ? 0
[10,] ? ?1 ? ?0 ? ?0 ? ?0 ? ?0 ? ?0 ? ?0 ? ?0 ? ?0 ? ? 1> L1
[1] 0 1 0 1 0 1 0 1 0 1> model = LiblineaR(DM,L1)
> predict(model,DM)
$predictions
[1] 0 1 0 1 0 1 0 1 0 1
> model = LiblineaR(SM1,L1)
Error in t.default(data) : argument is not a matrix> model = LiblineaR(SM1,L1)
Error in t.default(data) : argument is not a matrix
Setting default kernel parameters> predict(model,DM)
[,1]
[1,] 0.1
[2,] 0.9
[3,] 0.1
[4,] 0.9
[5,] 0.1
[6,] 0.9
[7,] 0.1
[8,] 0.9
[9,] 0.1
[10,] 0.9> model = ksvm(SM1,L1,scale=FALSE,kernel="vanilladot")
Error in function (classes, fdef, mtable) :
unable to find an inherited method for function "ksvm", for
signature "dgCMatrix"> model = ksvm(SM2,L1,scale=FALSE,kernel="vanilladot")
Error in function (classes, fdef, mtable) :
unable to find an inherited method for function "ksvm", for
signature "matrix.csr">