Peter Waltman
2007-Aug-16 15:57 UTC
[R] Possible memory leak with large matrices in R v.2.5.0
I'm working with a very large matrix ( 22k rows x 2k cols) of RNA expression data with R v.2.5.0 on a RedHat Enterprise machine, x86_64 architecture. The relevant code is below, but I call a function that takes a cluster of this data ( a list structure that contains a $rows elt which lists the rows (genes ) in the cluster by ID, but not the actual data itself ). The function creates two copies of the matrix, one containing the rows in the cluster, and one with the rest of the rows in the matrix. After doing some statistical massaging, the function returns a statistical score for each rows/genes in the matrix, producing a vector of 22k elt's. When I run 'top', I see that the memory stamp of R after loading the matrix is ~750M. However, after calling this function on 10 clusters, this jumps to > 3.7 gig (at least by 'top's measurement), and this will not be reduced by any subsequent calls to gc(). Output from gc() is: > gc() used (Mb) gc trigger (Mb) max used (Mb) Ncells 377925 20.2 6819934 364.3 604878 32.4 Vcells 88857341 678.0 240204174 1832.7 90689707 692.0 > output from top is: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 1199 waltman 17 0 3844m 3.7g 3216 S 0.0 23.6 29:58.74 R Note, the relevant call that invoked my function is: test <- sapply( c(1:10), function(x) get.vars.for.cluster( clusterStack[[x]], opt="rows" ) ) Finally, for fun, I rm()'d all variables with the rm( list=ls() ) command, and then called gc(). The memory of this "empty" instance of R is still 3.4 gig, i.e. R.console: > rm( list=ls() ) > ls() character(0) > gc() used (Mb) gc trigger (Mb) max used (Mb) Ncells 363023 19.4 5455947 291.4 604878 32.4 Vcells 44434871 339.1 192163339 1466.1 90689707 692.0 > Subsequent top output: output from top is: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 1199 waltman 16 0 3507m 3.4g 3216 S 0.0 21.5 29:58.92 R Thanks for any help or suggestions, Peter Waltman p.s. code snippet follows. Note, that I've added extra rm() and gc() calls w/in the function to try to reduce the memory stamp to no avail. get.vars.for.cluster = function( cluster, genes=get.global( "gene.ids" ), opt=c("rows","cols"), ratios=get.global("ratios"), var.norm=T, r.sig=get.global( "r.sig" ), allow.anticor=get.global( "allow.anticor" ) ) { cluster <<- cluster opt <- match.arg( opt ) rows <- cluster$rows cols <- cluster$cols if ( opt == "rows" ) { r <- ratios[ rows, cols ] r.all <- ratios[ genes, cols ] avg.rows <- apply( r, 2, mean, na.rm=T ) ##median ) rm( r ) gc() devs <- apply( r.all, 1, "-", avg.rows ) if ( !allow.anticor ) { rm( r.all, avg.rows ) gc() } vars <- apply( devs, 2, var, na.rm=T ) rm( devs ) gc() vars <- log10( vars ) gc() if ( allow.anticor ) { ## Get variance against the inverse of the mean profile devs.2 <- apply( r.all, 1, "-", -avg.rows ) gc() vars.2 <- apply( devs.2, 2, var, na.rm=T ) gc() rm( devs.2 ) gc() vars.2 <- log10( vars.2 ) gc() ## For each gene take the min of variance or anti-cor variance vars <- cbind( vars, vars.2 ) rm( vars.2 ) gc() vars <- apply( vars, 1, min ) gc() } ## Normalize the values by the variance over the rows in the cluster if ( var.norm ) { vars <- vars - mean( vars[ rows ], na.rm=T ) tmp.sd <- sd( vars[ rows ], na.rm=T ) if ( ! is.na( tmp.sd ) && tmp.sd != 0 ) vars <- vars / ( tmp.sd + r.sig ) } gc() return( vars ) } else { r.all <- ratios[ rows, ] ## Mean-normalized variance vars <- log10( apply( r.all, 2, var, na.rm=T ) / abs( apply( r.all, 2, mean, na.rm=T ) ) ) names( vars ) <- colnames( ratios ) ## Normalize the values by the variance over the rows in the cluster if ( var.norm ) { vars <- vars - mean( vars[ cluster$cols ], na.rm=T ) tmp.sd <- sd( vars[ cluster$cols ], na.rm=T ) if ( ! is.na( tmp.sd ) && tmp.sd != 0 ) vars <- vars / ( tmp.sd + r.sig ) } return( vars ) } }
Possibly Parallel Threads
- Possible memory leak with R v.2.5.0
- Memory leak with character arrays?
- Running R under Sun Grid Engine with OpenMPI tight integration
- Suggestion on how to make permanent changes to a single object in a list?
- How to remove/prevent trailing space after tab completion in R shell