utkarshsinghal
2009-Jun-02 15:25 UTC
[R] bigmemory - extracting submatrix from big.matrix object
I am using the library(bigmemory) to handle large datasets, say 1 GB, and facing following problems. Any hints from anybody can be helpful. _Problem-1: _ I am using "read.big.matrix" function to create a filebacked big matrix of my data and get the following warning: > x = read.big.matrix("/home/utkarsh.s/data.csv",header=T,type="double",shared=T,backingfile = "backup", backingpath = "/home/utkarsh.s") Warning message: In filebacked.big.matrix(nrow = numRows, ncol = numCols, type = type, : A descriptor file has not been specified. A descriptor named backup.desc will be created. However there is no such argument in "read.big.matrix". Although there is an argument "descriptorfile" in the function "as.big.matrix" but if I try to use it in "read.big.matrix", I get an error showing it as unused argument (as expected). _Problem-2:_ I want to get a filebacked *sub*matrix of "x", say only selected columns: x[, 1:100]. Is there any way of doing that without actually loading the data into R memory. _ Problem-3 _There are functions available like: summary, colmean, colsd, ... for standard summary statistics. But is there any way to calculate other summaries say number of missing values or skewness of each variable, without loading the whole data into R memory. Regards Utkarsh
Jay Emerson
2009-Jun-02 18:08 UTC
[R] bigmemory - extracting submatrix from big.matrix object
Thanks for trying this out. Problem 1. We'll check this. Options should certainly be available. Thanks! Problem 2. Fascinating. We just (yesterday) implemented a sub.big.matrix() function doing exactly this, creating something that is a big matrix but which just references a contiguous subset of the original matrix. This will be available in an upcoming version (hopefully in the next week). A more specialized function would create an entirely new big.matrix from a subset of a first big.matrix, making an actual copy, but this is something else altogether. You could do this entirely within R without much work, by the way, and only 2* memory overhead. Problem 3. You can count missing values using mwhich(). For other exploration (e.g. skewness) at the moment you should just extract a single column (variable) at a time into R, study it, then get the next column, etc... . We will not be implementing all of R's functions directly with big.matrix objects. We will be creating a new package "bigmemoryAnalytics" and would welcome contributions to the package. Feel free to email us directly with bugs, questions, etc... Cheers, Jay ---------------------------------------------------------- From: utkarshsinghal <utkarsh.singhal at global-analytics.com> Date: Tue, Jun 2, 2009 at 8:25 AM Subject: [R] bigmemory - extracting submatrix from big.matrix object To: r help <r-help at r-project.org> I am using the library(bigmemory) to handle large datasets, say 1 GB, and facing following problems. Any hints from anybody can be helpful. _Problem-1: _ I am using "read.big.matrix" function? to create a filebacked big matrix of my data and get the following warning:> x = read.big.matrix("/home/utkarsh.s/data.csv",header=T,type="double",shared=T,backingfile = "backup", backingpath = "/home/utkarsh.s")Warning message: In filebacked.big.matrix(nrow = numRows, ncol = numCols, type = type,? : ?A descriptor file has not been specified.? A descriptor named backup.desc will be created. However there is no such argument in "read.big.matrix". Although there is an argument "descriptorfile" in the function "as.big.matrix" but if I try to use it in "read.big.matrix", I get an error showing it as unused argument (as expected). _Problem-2:_ I want to get a filebacked *sub*matrix of "x", say only selected columns: x[, 1:100]. Is there any way of doing that without actually loading the data into R memory. _ Problem-3 _There are functions available like:? summary, colmean, colsd, ... for standard summary statistics. But is there any way to calculate other summaries say number of missing values or skewness of each variable, without loading the whole data into R memory. Regards Utkarsh -- John W. Emerson (Jay) Assistant Professor of Statistics Department of Statistics Yale University http://www.stat.yale.edu/~jay
utkarshsinghal
2009-Jun-03 07:16 UTC
[R] bigmemory - extracting submatrix from big.matrix object
Thanks for the really valuable inputs, developing the package and updating it regularly. I will be glad if I can contribute in any way. In problem three, however, I am interested in knowing a generic way to apply any function on columns of a big.matrix object (obviously without loading the data into R). May be the source code of the function "colmean" can help, if that is not too much to ask for. Or if we can develop a function similar to "apply" of the base R. Regards Utkarsh Jay Emerson wrote:> We also have ColCountNA(), which is not currently exposed to the user > but will be in the next version. > > Jay > > On Tue, Jun 2, 2009 at 2:08 PM, Jay Emerson <jayemerson at gmail.com> wrote: > >> Thanks for trying this out. >> >> Problem 1. We'll check this. Options should certainly be available. Thanks! >> >> Problem 2. Fascinating. We just (yesterday) implemented a >> sub.big.matrix() function doing exactly >> this, creating something that is a big matrix but which just >> references a contiguous subset of the >> original matrix. This will be available in an upcoming version >> (hopefully in the next week). A more >> specialized function would create an entirely new big.matrix from a >> subset of a first big.matrix, >> making an actual copy, but this is something else altogether. You >> could do this entirely within R >> without much work, by the way, and only 2* memory overhead. >> >> Problem 3. You can count missing values using mwhich(). For other >> exploration (e.g. skewness) >> at the moment you should just extract a single column (variable) at a >> time into R, study it, then get the >> next column, etc... . We will not be implementing all of R's >> functions directly with big.matrix objects. >> We will be creating a new package "bigmemoryAnalytics" and would >> welcome contributions to the >> package. >> >> Feel free to email us directly with bugs, questions, etc... >> >> Cheers, >> >> Jay >> >> >> ---------------------------------------------------------- >> >> From: utkarshsinghal <utkarsh.singhal at global-analytics.com> >> Date: Tue, Jun 2, 2009 at 8:25 AM >> Subject: [R] bigmemory - extracting submatrix from big.matrix object >> To: r help <r-help at r-project.org> >> I am using the library(bigmemory) to handle large datasets, say 1 GB, >> and facing following problems. Any hints from anybody can be helpful. >> _Problem-1: >> _ >> I am using "read.big.matrix" function to create a filebacked big >> matrix of my data and get the following warning: >> >>> x = read.big.matrix("/home/utkarsh.s/data.csv",header=T,type="double",shared=T,backingfile = "backup", backingpath = "/home/utkarsh.s") >>> >> Warning message: >> In filebacked.big.matrix(nrow = numRows, ncol = numCols, type = type, : >> A descriptor file has not been specified. A descriptor named >> backup.desc will be created. >> However there is no such argument in "read.big.matrix". Although there >> is an argument "descriptorfile" in the function "as.big.matrix" but if >> I try to use it in "read.big.matrix", I get an error showing it as >> unused argument (as expected). >> _Problem-2:_ >> I want to get a filebacked *sub*matrix of "x", say only selected >> columns: x[, 1:100]. Is there any way of doing that without actually >> loading the data into R memory. >> _ >> Problem-3 >> _There are functions available like: summary, colmean, colsd, ... for >> standard summary statistics. But is there any way to calculate other >> summaries say number of missing values or skewness of each variable, >> without loading the whole data into R memory. >> Regards >> Utkarsh >> >> -- >> John W. Emerson (Jay) >> Assistant Professor of Statistics >> Department of Statistics >> Yale University >> http://www.stat.yale.edu/~jay >> >> > > > >
Reasonably Related Threads
- [Fwd: adding more columns in big.matrix object of bigmemory package]
- Bigmemory: Error Running Example
- efficient coding with foreach and bigmemory
- adding more columns in big.matrix object of bigmemory package
- looking for adice on bigmemory framework with C++ and java interoperability