Hello Manyu,
I am guessing you refer to the netflix dataset.
Try looking at ways to represent large data sets, that is, the list from
here:
http://cran.r-project.org/web/views/HighPerformanceComputing.html
Here it is:
*Large memory and out-of-memory data*
- The biglm <http://cran.r-project.org/web/packages/biglm/index.html>
package
by Lumley uses incremental computations to offers lm() and glm()
functionality
to data sets stored outside of R's main memory.
- The ff <http://cran.r-project.org/web/packages/ff/index.html> package
by Adler et al. offers file-based access to data sets that are too large to
be loaded into memory, along with a number of higher-level functions.
- The
bigmemory<http://cran.r-project.org/web/packages/bigmemory/index.html>
package
by Kane and Emerson permits storing large objects such as matrices in memory
and uses external pointer objects to refer to them. This permits transparent
access from R without bumping against R's internal memory limits. Several
R
processes on the same computer can also shared big memory objects.
- A large number of database packages, and database-alike packages (such
as sqldf <http://cran.r-project.org/web/packages/sqldf/index.html> by
Grothendieck and
data.table<http://cran.r-project.org/web/packages/data.table/index.html>
by
Dowle) are also of potential interest but not reviewed here.
- The
HadoopStreaming<http://cran.r-project.org/web/packages/HadoopStreaming/index.html>
package
provides a framework for writing map/reduce scripts for use in Hadoop
Streaming; it also facilitates operating on data in a streaming fashion
which does not require Hadoop.
- The
speedglm<http://cran.r-project.org/web/packages/speedglm/index.html>
package
permits to fit (generalised) linear models to large data. For in-memory data
sets, speedlm() or speedglm() can be used along with update.speedlm() which
can update fitted models with new data. For out-of-memory data sets, shglm()
is available; it works in the presence of factors and can check for singular
matrices.
- The biglars
<http://cran.r-project.org/web/packages/biglars/index.html> package
by Seligman et al can use the
ff<http://cran.r-project.org/web/packages/ff/index.html> to
support large-than-memory datasets for least-angle regression, lasso and
stepwise regression.
----------------Contact
Details:-------------------------------------------------------
Contact me: Tal.Galili@gmail.com | 972-52-7275845
Read me: www.talgalili.com (Hebrew) | www.biostatistics.co.il (Hebrew) |
www.r-statistics.com (English)
----------------------------------------------------------------------------------------------
On Thu, Feb 25, 2010 at 12:00 AM, manyu_aditya
<abhimanyu.aditya@gmail.com>wrote:
>
> hi,
>
> I have a dataset (the netflix dataset) which is basically ~18k columns and
> well variable number of rows but let's assume 25 thousand for now. The
> dataset is very sparse. I was wondering how to do kmeans/nearest neighbors
> or kernel density estimation on it.
>
> I tired using the spMatrix function in "Matrix" package. I think
I'm able
> to
> create the matrix but as soon as I pass it to kmeans functions in package
> "stats" it says cannot allocate 3.3Gb. Which is basically 18k *
25K * 8.
>
> There is a sparse kmeans solver by tibshirani but that epxects a regular
> dense format matrix so again the issue is the same.
>
> A simple "no" this is not possible answer shall suffice as long
as you are
> right!!!
>
> tHanks much.
> --
> View this message in context:
>
http://n4.nabble.com/Sparse-KMeans-KDE-Nearest-Neighbors-tp1568129p1568129.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
[[alternative HTML version deleted]]