thr3ads.net - R help - [R] Sparse KMeans/KDE/Nearest Neighbors? [Feb 2010]

If this information is useful, please help other people find it:
Share via:

manyu_aditya

2010-Feb-24 22:00 UTC

[R] Sparse KMeans/KDE/Nearest Neighbors?

hi,

I have a dataset (the netflix dataset) which is basically ~18k columns and
well variable number of rows but let's assume 25 thousand for now. The
dataset is very sparse. I was wondering how to do kmeans/nearest neighbors
or kernel density estimation on it. 

I tired using the spMatrix function in "Matrix" package. I think
I'm able to
create the matrix but as soon as I pass it to kmeans functions in package
"stats" it says cannot allocate 3.3Gb. Which is basically 18k * 25K *
8.

There is a sparse kmeans solver by tibshirani but that epxects a regular
dense format matrix so again the issue is the same. 

A simple "no" this is not possible answer shall suffice as long as you
are
right!!!

tHanks much.
-- 
View this message in context:
http://n4.nabble.com/Sparse-KMeans-KDE-Nearest-Neighbors-tp1568129p1568129.html
Sent from the R help mailing list archive at Nabble.com.

Tal Galili

2010-Feb-25 08:17 UTC

head link

[R] Sparse KMeans/KDE/Nearest Neighbors?

Hello Manyu,
I am guessing you refer to the netflix dataset.

Try looking at ways to represent large data sets, that is, the list from
here:
http://cran.r-project.org/web/views/HighPerformanceComputing.html


Here it is:

*Large memory and out-of-memory data*

   - The biglm <http://cran.r-project.org/web/packages/biglm/index.html>
package
   by Lumley uses incremental computations to offers lm() and glm()
functionality
   to data sets stored outside of R's main memory.
   - The ff <http://cran.r-project.org/web/packages/ff/index.html> package
   by Adler et al. offers file-based access to data sets that are too large to
   be loaded into memory, along with a number of higher-level functions.
   - The
bigmemory<http://cran.r-project.org/web/packages/bigmemory/index.html>
package
   by Kane and Emerson permits storing large objects such as matrices in memory
   and uses external pointer objects to refer to them. This permits transparent
   access from R without bumping against R's internal memory limits. Several
R
   processes on the same computer can also shared big memory objects.
   - A large number of database packages, and database-alike packages (such
   as sqldf <http://cran.r-project.org/web/packages/sqldf/index.html> by
   Grothendieck and
data.table<http://cran.r-project.org/web/packages/data.table/index.html>
by
   Dowle) are also of potential interest but not reviewed here.
   - The
HadoopStreaming<http://cran.r-project.org/web/packages/HadoopStreaming/index.html>
package
   provides a framework for writing map/reduce scripts for use in Hadoop
   Streaming; it also facilitates operating on data in a streaming fashion
   which does not require Hadoop.
   - The
speedglm<http://cran.r-project.org/web/packages/speedglm/index.html>
package
   permits to fit (generalised) linear models to large data. For in-memory data
   sets, speedlm() or speedglm() can be used along with update.speedlm() which
   can update fitted models with new data. For out-of-memory data sets, shglm()
   is available; it works in the presence of factors and can check for singular
   matrices.
   - The biglars
<http://cran.r-project.org/web/packages/biglars/index.html> package
   by Seligman et al can use the
ff<http://cran.r-project.org/web/packages/ff/index.html> to
   support large-than-memory datasets for least-angle regression, lasso and
   stepwise regression.






----------------Contact
Details:-------------------------------------------------------
Contact me: Tal.Galili@gmail.com |  972-52-7275845
Read me: www.talgalili.com (Hebrew) | www.biostatistics.co.il (Hebrew) |
www.r-statistics.com (English)
----------------------------------------------------------------------------------------------




On Thu, Feb 25, 2010 at 12:00 AM, manyu_aditya
<abhimanyu.aditya@gmail.com>wrote:
>
> hi,
>
> I have a dataset (the netflix dataset) which is basically ~18k columns and
> well variable number of rows but let's assume 25 thousand for now. The
> dataset is very sparse. I was wondering how to do kmeans/nearest neighbors
> or kernel density estimation on it.
>
> I tired using the spMatrix function in "Matrix" package. I think
I'm able
> to
> create the matrix but as soon as I pass it to kmeans functions in package
> "stats" it says cannot allocate 3.3Gb. Which is basically 18k *
25K * 8.
>
> There is a sparse kmeans solver by tibshirani but that epxects a regular
> dense format matrix so again the issue is the same.
>
> A simple "no" this is not possible answer shall suffice as long
as you are
> right!!!
>
> tHanks much.
> --
> View this message in context:
>
http://n4.nabble.com/Sparse-KMeans-KDE-Nearest-Neighbors-tp1568129p1568129.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
	[[alternative HTML version deleted]]

Possibly Parallel Threads

Search for more seemingly similar threads

R help - Feb 2010 - Sparse KMeans/KDE/Nearest Neighbors?

[R] Sparse KMeans/KDE/Nearest Neighbors?

[R] Sparse KMeans/KDE/Nearest Neighbors?

Possibly Parallel Threads