Hi all, SETUP: I have pairwise data on 22 chromosomes. Data matrix X for a given chromosome looks like this: 1 13 58 1.12 6 142 56 1.11 18 307 64 3.13 22 320 58 0.72 Where column 1 is person ID 1, column 2 is person ID 2, column 3 can be ignored, and column 4 is how much chromosomal sharing those two individuals have in some small portion of the chromosome. There are 9000 individual people, and therefore ~ (9000^2)/2 pairwise matches at each small location on the chromosome, so across an entire chromosome, these matrices are VERY large (e.g., 3 billion rows, which is > the 2^31 vector size limitation in R). I have access to a server with 64 bit R, 1TB RAM and 80 processors. PROBLEM: A pair of individuals (e.g., person 1 and 13 from the first row above) will show up multiple times in a given file. I want to sum column 4 across each pair of individuals. If I could bring the matrix into R, I could use tapply() to accomplish this by indexing on paste(X[,1],X[,2]), but the matrix doesn't fit into R. I have been trying to use bigmemory and bigtabulate packages in R, but when I try to use the bigsplit function, R never completes the operation (after a day, I killed the process). In particular, I did this: X <- read.big.matrix("file.loc.X",sep=" ",type="double") hap.indices <- bigsplit(X,1:2) #this runs for too long to be useful on these matrices #I was then going to use foreach loop to sum across the splits identified by bigsplit SO - does anyone have ideas on how to deal with this problem - i.e., how to use a tapply() like function on an enormous matrix? This isn't necessarily a bigtabulate question (although if I screwed up using bigsplit, let me know). If another package (e.g., an SQL package) can do something like this efficiently, I'd like to hear about it and your experiences using it. Thank you in advance, Matt -- Matthew C Keller Asst. Professor of Psychology University of Colorado at Boulder www.matthewckeller.com
On Tue, Feb 21, 2012 at 4:04 PM, Matthew Keller <mckellercran at gmail.com> wrote:> X <- read.big.matrix("file.loc.X",sep=" ",type="double") > hap.indices <- bigsplit(X,1:2) #this runs for too long to be useful on > these matrices > #I was then going to use foreach loop to sum across the splits > identified by bigsplitHow about just using foreach earlier in the process ? e.g. split file.loc.X to (80) sub files and then run read.big.matrix/bigsplit/sum inside %dopar% If splitting X beforehand is a problem, you could also use ?scan to read in different chunks of the file, something like (untested obviously): # for X a matrix 800x4 lineind<- seq(1,800,100) # create an index vec for the lines to read ReducedX<- foreach(i = 1:8) %dopar%{ x <- scan('file.loc.X',list(double(0),double(0),double(0),double(0)),skip=lineind[i],nlines=100) ... do your thing on x (aggregate/tapply etc.) } Hope this helped Elai.> > SO - does anyone have ideas on how to deal with this problem - i.e., > how to use a tapply() like function on an enormous matrix? This isn't > necessarily a bigtabulate question (although if I screwed up using > bigsplit, let me know). If another package (e.g., an SQL package) can > do something like this efficiently, I'd like to hear about it and your > experiences using it. > > Thank you in advance, > > Matt > > > > -- > Matthew C Keller > Asst. Professor of Psychology > University of Colorado at Boulder > www.matthewckeller.com > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
Seemingly Similar Threads
- R memory limits on table(x, y) (and bigtabulate)
- R memory limits on table(x, y) (and bigtabulate)
- drawing a line indicating extent of each factored data series in multipanel lattice xyplot
- how to specify the order of panels with xyplot
- questions related to ploting in R