Shraddha Pai
2013-Mar-20 18:22 UTC
[R] bigmemory: Using backing file as alternate to write.big.matrix
Hi, Does the backing file of a big.matrix store the contents of entire matrix? Or does it store the portion of it that is not stored in RAM? In other words, can the backing file be treated as a file containing the matrix's full data? I have been writing my big.matrix objects to disk (write.big.matrix), and other programs that want to access this matrix then just read it in (read.big.matrix). A colleague pointed out that the explicit write operation is unnecessary as the contents of the matrix are already in the backing file. If the backing file is stored in the appropriate directory, then future reads of the matrix can proceed directly from the matrix's descriptor file (attach.big.matrix). Seems to make sense; no need to write out in text format. I just want to confirm from people more aware of the internals of the big matrix that indeed the backingfile is a safe way to persistently store big matrices. Looking at the documentation gave me pause because the backing file is described as a "cache". Thanks in advance, Shraddha ----- Shraddha Pai Post-doctoral fellow Krembil Family Epigenetic Research Laboratory (Lab head: Dr. Art Petronis) Centre for Addiction and Mental Health, Toronto ______________________________________________________________________ This email has been scanned by the CAMH Email Security System.
Shraddha Pai
2013-Mar-21 13:54 UTC
[R] bigmemory: Using backing file as alternate to write.big.matrix
OK, did a test where I did both - wrote a ~6Mx58 double matrix as a .txt file (write.big.matrix), but also left the backing file + descriptor file as-is (rather than deleting it as I usually do). Opened a different R session. Compared contents of first 100 rows of both, they seem identical. Size-wise, the .bin file is over twice the size of the .txt file (here .bin was 2,641Mb and .txt was 1,184Mb). So my conclusion is this: if the matrix will be read often by downstream programs, save as .bin. Code that reads the matrix can just attach it, which is super fast (0.002s elapsed; in contrast, using read.big.matrix to read the .txt version took 76s on my machine). If space is a constraint and the matrix isn't expected to be read in "very often", then save as text file and read using read.big.matrix. ----- library(bigmemory) m <- attach.big.matrix("rawXpr.desc") # attach descriptor -- super fast n <- read.table("rawXpr.txt",sep="\t",header=F,as.is=T,nrow=100) # same context saved as txt - read 100 rows for test. n <- as.matrix(n) # was a data.frame before sapply(1:nrow(n), function(x) { print(all.equal(n[x,], m[x,])) } ) [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE [16] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE [31] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE [46] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE [61] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE [76] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE [91] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE ----- -- View this message in context: http://r.789695.n4.nabble.com/bigmemory-Using-backing-file-as-alternate-to-write-big-matrix-tp4661958p4662055.html Sent from the R help mailing list archive at Nabble.com.