Hi Lishu,
I run into the similar large-scale problems recently. I used a parallel
SGD k-means described in this paper for my problem:
http://www.eecs.tufts.edu/~dsculley/papers/fastkmeans.pdf
Let n be the samples, k be the number of clusters, and m be the number of
nodes,
1. First, each node reads n / m sample data, and randomly generate enough
'mini batches' (size of mini-batch and SGD iterations must be determined
beforehand)
2. Sample k / m centers from the samples on each node
3. Update the centers, by using the mini-batches generated at the first
step. Note that at this stage it is not necessary to hold the sample data
on each node.
4. Once the centers are optimized by SGD, compute the distance matrix
between samples and centers. I used spherical k-means so this step can be
divided into a series of block matrix multiplication to save memory.
Note that each node only needs to hold partial sample data and partial
centers, so this method can work on 'regular' MPI environment and do not
need the shared memory architecture.
I used pbdMPI to parallelize the algorithm.
hope this helps.
Wuming
On Wed, Jan 18, 2012 at 3:37 PM, Lishu Liu <lishuliu@gmail.com> wrote:
> Hi,
>
> I have a 60k*600k matrix, which exceed the vector length limit of 2^32-1.
> But it's rather sparse, only 0.02% has value. So I save is as
MarketMatrix
> (mm) file, it's about 300M in size. I use readMM in Matrix package to
read
> it in. If do so, the data type becomes dgTMatrix in 'Matrix'
package
> instead of the common matrix type.
>
> The problem is, if I run k-means only on part of the data, to make sure the
> vector length do not exceed 2^32-1, there's no problem at all. Meaning
that
> the kmeans in R could recognize this type of matrix.
> If I run the entire matrix, R says "too many elements specified."
>
> I have considered the 'bigmemory' and 'biganalytics'
packages. But to save
> the sparse matrix as common CSV file would take approx 70G and 99% being 0.
> I just don't think it's necessary or efficient to treat it as a
dense
> matrix.
>
> It there anyway to deal with the vector length limit? Can I split the whole
> matrix into small ones and then do k-means?
>
>
>
> Thanks,
> Lishu
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
[[alternative HTML version deleted]]