thr3ads.net - R help - [R] Possible memory leak with R v.2.5.0 [Aug 2007]

If this information is useful, please help other people find it:
Share via:

Peter Waltman

2007-Aug-16 05:21 UTC

[R] Possible memory leak with R v.2.5.0

I'm  working  with  a  very  large matrix ( 22k rows x 2k cols) of RNA
   expression  data with R v.2.5.0 on a RedHat Enterprise machine, x86_64
   architecture.
   The relevant code is below, but I call a function that takes a cluster
   of  this data ( a list structure that contains a $rows elt which lists
   the rows (genes ) in the cluster by ID, but not the actual data itself
   ).
   The function creates two copies of the matrix, one containing the rows
   in  the  cluster,  and  one  with  the rest of the rows in the matrix.
   After  doing  some  statistical  massaging,  the  function  returns  a
   statistical  score  for  each  rows/genes  in  the matrix, producing a
   vector of 22k elt's.
   When  I  run 'top', I see that the memory stamp of R after loading
the
   matrix is ~750M.  However, after calling this function on 10 clusters,
   this  jumps  to  >  3.7 gig (at least by 'top's measurement), and
this
   will not be reduced by any subsequent calls to gc().
   Output from gc() is:

     > gc()           used  (Mb) gc trigger   (Mb) max used  (Mb)
     Ncells   377925  20.2    6819934  364.3   604878  32.4
     Vcells 88857341 678.0  240204174 1832.7 90689707 692.0
     >

   output from top is:

        PID  USER       PR   NI   VIRT   RES   SHR  S %CPU %MEM    TIME+
     COMMAND
      1199 waltman   17   0 3844m 3.7g 3216 S  0.0 23.6  29:58.74 R

   Note, the relevant call that invoked my function is:

     test   <-   sapply(   c(1:10),   function(x)  get.vars.for.cluster(
     clusterStack[[x]], opt="rows" ) )

   Finally,  for  fun,  I  rm()'d  all variables with the rm( list=ls() )
   command, and then called gc().  The memory of this "empty" instance
of
   R is still 3.4 gig, i.e.
   R.console:

     > rm( list=ls() )
     > ls()
     character(0)
     > gc()
                used  (Mb) gc trigger   (Mb) max used  (Mb)
     Ncells   363023  19.4    5455947  291.4   604878  32.4
     Vcells 44434871 339.1  192163339 1466.1 90689707 692.0
     >

   Subsequent top  output:
   output from top is:

        PID  USER       PR   NI   VIRT   RES   SHR  S %CPU %MEM    TIME+
     COMMAND
      1199 waltman   16   0 3507m 3.4g 3216 S  0.0 21.5  29:58.92 R

   Thanks for any help or suggestions,
   Peter Waltman
   p.s.  code snippet follows.  Note, that I've added extra rm() and gc()
   calls w/in the function to try to reduce the memory stamp to no avail.

     get.vars.for.cluster   =   function(   cluster,   genes=get.global(
     "gene.ids" ), opt=c("rows","cols"),
                               ratios=get.global("ratios"), 
var.norm=T,
     r.sig=get.global( "r.sig" ),
                             allow.anticor=get.global( "allow.anticor"
)
     ) {
       cat( "phw dbg msg\n")
       cluster <<- cluster
       opt <- match.arg( opt )
       rows <- cluster$rows
       cols <- cluster$cols
       if ( opt == "rows" ) {
         cat( "phw dbg msg: if opt == rows\n" )
         r <- ratios[ rows, cols ]
         r.all <- ratios[ genes, cols ]
         avg.rows <- apply( r, 2, mean, na.rm=T ) ##median )
         rm( r )  # phw added 8/9/07
         gc( reset=TRUE )     # phw added 8/9/07
         devs <- apply( r.all, 1, "-", avg.rows )
         if ( !allow.anticor ) rm( r.all, avg.rows )  # phw added 8/9/07
         gc( reset=TRUE ) #  phw added 8/9/07
         cat( "phw dbg msg: finished calc'ing avg.rows &
devs\n" )
               ##   This   is  what  we'd  use  from  the  deHoon  paper
     (bioinformatics/bth927)
             ##sd.rows <- apply( r, 2, sd )
             ##devs <- devs * devs
             ##sd.rows <- sd.rows * sd.rows
             ##sds <- apply( devs, 2, "/", sd.rows )
             ##sds <- apply( sds, 2, sum )
             ##return( log10( sds ) )
             ## This is faster and nearly equivalent
         vars <- apply( devs, 2, var, na.rm=T )
         rm( devs )
         gc( reset=TRUE ) #  phw added 8/9/07
         test <- log10( vars ) #  phw added 8/9/07
         rm( vars ) #  phw added 8/9/07
         gc( reset=TRUE ) #  phw added 8/9/07
         vars <- log10( test ) #  phw added 8/9/07
         rm( test ) #  phw added 8/9/07
         gc( reset=TRUE ) #  phw added 8/9/07
     #    vars <- log10( vars )
         cat( "phw dbg msg: finished calc'ing vars (\n" )
             ## HOW TO ALLOW FOR ANTICOR??? Here's how:
         if ( allow.anticor ) {
           cat( "phw dbg msg: allow.anticor==T\n" )
                      ##  Get  variance  against the inverse of the mean
     profile
           devs.2 <- apply( r.all, 1, "-", -avg.rows )
           gc( reset=TRUE ) #  phw added 8/9/07
           vars.2 <- apply( devs.2, 2, var, na.rm=T )
           rm( devs.2 )
           gc( reset=TRUE ) #  phw added 8/9/07
           vars.2 <- log10( vars.2 )
           gc( reset=TRUE ) #  phw added 8/9/07
                      ##  For  each  gene  take  the  min of variance or
     anti-cor variance
           vars <- cbind( vars, vars.2 )
           rm( vars.2 )
           gc( reset=TRUE ) #  phw added 8/9/07
           vars <- apply( vars, 1, min )
           gc( reset=TRUE ) #  phw added 8/9/07
         }
              ##  Normalize  the values by the variance over the rows in
     the cluster
         if ( var.norm ) {
           cat( "phw dbg msg: var.norm == T \n")
           vars <- vars - mean( vars[ rows ], na.rm=T )
           tmp.sd <- sd( vars[ rows ], na.rm=T )
            if  (  !  is.na(  tmp.sd ) && tmp.sd != 0 ) vars <- vars
/ (
     tmp.sd + r.sig )
         }
         gc( reset=TRUE ) #  phw added 8/9/07
         return( vars )
       } else {
         cat( "phw dbg msg: else\n" )
         r.all <- ratios[ rows, ]
         ## Mean-normalized variance
          vars  <-  log10( apply( r.all, 2, var, na.rm=T ) / abs( apply(
     r.all, 2, mean, na.rm=T ) ) )
         names( vars ) <- colnames( ratios )
          ##  Normalize  the values by the variance over the rows in the
     cluster
         if ( var.norm ) {
           vars <- vars - mean( vars[ cluster$cols ], na.rm=T )
           tmp.sd <- sd( vars[ cluster$cols ], na.rm=T )
            if  (  !  is.na(  tmp.sd ) && tmp.sd != 0 ) vars <- vars
/ (
     tmp.sd + r.sig )
         }
         return( vars )
       }
     },

Martin Morgan

2007-Aug-16 15:53 UTC

head link

[R] Possible memory leak with R v.2.5.0

Hi Peter --

Here's my guess.

Ironically, adding things to broken code reduces the signal to noise
ratio. I ended up with

get.vars.for.cluster = function(
  cluster,
  genes=get.global("gene.ids" ),
  ratios=get.global("ratios"))
{
    cluster <<- cluster
    rows <- cluster$rows
    cols <- cluster$cols

    r <- ratios[ rows, cols ]
    avg.rows <- apply( r, 2, mean, na.rm=TRUE )
    r.all <- ratios[ genes, cols ]
    devs <- apply( r.all, 1, "-", avg.rows )

    apply( devs, 2, var, na.rm=TRUE )
}

at what might reproduce your problem (though can't be sure!). The
unusual bit is

    cluster <<- cluster

At first I thought this would be a no-op (assigning cluster to
itself), but apparently at this point in the code cluster does not
exist in the environment of the function (just in the call) so cluster
gets assigned outside the function.

So then my guess is that get.vars.for.cluster is part of a package,
and the package has a variable called cluster. get.vars.for.cluster
then assigns its first argument to the package variable cluster (which
is the first variable called cluster that <<-
encounters). rm(list=ls(all=TRUE)) removes everything from the global
environment, but (fortunately!) not from the package environment.

You might end up storing more than 'just' cluster, depending on what
it is.

So I think the solution is to rethink the use of <<- (and also the
get.global(), which are either for convenience (in which case it would
probably be better to specify a default for the function argument) or
out of a sense that copying is bad (but this is probably mistaken,
since R's semantics are 'copy on change', so passing a 'big'
object
into a function does not usually trigger a copy)).

You could also try 'detach'ing the package that get.vars.for.cluster
is defined in.

Hope that points in the right direction,

Martin

Peter Waltman <waltman at cs.nyu.edu> writes:
>    I'm  working  with  a  very  large matrix ( 22k rows x 2k cols) of
RNA
>    expression  data with R v.2.5.0 on a RedHat Enterprise machine, x86_64
>    architecture.
>    The relevant code is below, but I call a function that takes a cluster
>    of  this data ( a list structure that contains a $rows elt which lists
>    the rows (genes ) in the cluster by ID, but not the actual data itself
>    ).
>    The function creates two copies of the matrix, one containing the rows
>    in  the  cluster,  and  one  with  the rest of the rows in the matrix.
>    After  doing  some  statistical  massaging,  the  function  returns  a
>    statistical  score  for  each  rows/genes  in  the matrix, producing a
>    vector of 22k elt's.
>    When  I  run 'top', I see that the memory stamp of R after
loading the
>    matrix is ~750M.  However, after calling this function on 10 clusters,
>    this  jumps  to  >  3.7 gig (at least by 'top's measurement),
and this
>    will not be reduced by any subsequent calls to gc().
>    Output from gc() is:
>
>      > gc()           used  (Mb) gc trigger   (Mb) max used  (Mb)
>      Ncells   377925  20.2    6819934  364.3   604878  32.4
>      Vcells 88857341 678.0  240204174 1832.7 90689707 692.0
>      >
>
>    output from top is:
>
>         PID  USER       PR   NI   VIRT   RES   SHR  S %CPU %MEM    TIME+
>      COMMAND
>       1199 waltman   17   0 3844m 3.7g 3216 S  0.0 23.6  29:58.74 R
>
>    Note, the relevant call that invoked my function is:
>
>      test   <-   sapply(   c(1:10),   function(x)  get.vars.for.cluster(
>      clusterStack[[x]], opt="rows" ) )
>
>    Finally,  for  fun,  I  rm()'d  all variables with the rm( list=ls()
)
>    command, and then called gc().  The memory of this "empty"
instance of
>    R is still 3.4 gig, i.e.
>    R.console:
>
>      > rm( list=ls() )
>      > ls()
>      character(0)
>      > gc()
>                 used  (Mb) gc trigger   (Mb) max used  (Mb)
>      Ncells   363023  19.4    5455947  291.4   604878  32.4
>      Vcells 44434871 339.1  192163339 1466.1 90689707 692.0
>      >
>
>    Subsequent top  output:
>    output from top is:
>
>         PID  USER       PR   NI   VIRT   RES   SHR  S %CPU %MEM    TIME+
>      COMMAND
>       1199 waltman   16   0 3507m 3.4g 3216 S  0.0 21.5  29:58.92 R
>
>    Thanks for any help or suggestions,
>    Peter Waltman
>    p.s.  code snippet follows.  Note, that I've added extra rm() and
gc()
>    calls w/in the function to try to reduce the memory stamp to no avail.
>
>      get.vars.for.cluster   =   function(   cluster,   genes=get.global(
>      "gene.ids" ), opt=c("rows","cols"),
>                                ratios=get.global("ratios"), 
var.norm=T,
>      r.sig=get.global( "r.sig" ),
>                              allow.anticor=get.global(
"allow.anticor" )
>      ) {
>        cat( "phw dbg msg\n")
>        cluster <<- cluster
>        opt <- match.arg( opt )
>        rows <- cluster$rows
>        cols <- cluster$cols
>        if ( opt == "rows" ) {
>          cat( "phw dbg msg: if opt == rows\n" )
>          r <- ratios[ rows, cols ]
>          r.all <- ratios[ genes, cols ]
>          avg.rows <- apply( r, 2, mean, na.rm=T ) ##median )
>          rm( r )  # phw added 8/9/07
>          gc( reset=TRUE )     # phw added 8/9/07
>          devs <- apply( r.all, 1, "-", avg.rows )
>          if ( !allow.anticor ) rm( r.all, avg.rows )  # phw added 8/9/07
>          gc( reset=TRUE ) #  phw added 8/9/07
>          cat( "phw dbg msg: finished calc'ing avg.rows &
devs\n" )
>                ##   This   is  what  we'd  use  from  the  deHoon 
paper
>      (bioinformatics/bth927)
>              ##sd.rows <- apply( r, 2, sd )
>              ##devs <- devs * devs
>              ##sd.rows <- sd.rows * sd.rows
>              ##sds <- apply( devs, 2, "/", sd.rows )
>              ##sds <- apply( sds, 2, sum )
>              ##return( log10( sds ) )
>              ## This is faster and nearly equivalent
>          vars <- apply( devs, 2, var, na.rm=T )
>          rm( devs )
>          gc( reset=TRUE ) #  phw added 8/9/07
>          test <- log10( vars ) #  phw added 8/9/07
>          rm( vars ) #  phw added 8/9/07
>          gc( reset=TRUE ) #  phw added 8/9/07
>          vars <- log10( test ) #  phw added 8/9/07
>          rm( test ) #  phw added 8/9/07
>          gc( reset=TRUE ) #  phw added 8/9/07
>      #    vars <- log10( vars )
>          cat( "phw dbg msg: finished calc'ing vars (\n" )
>              ## HOW TO ALLOW FOR ANTICOR??? Here's how:
>          if ( allow.anticor ) {
>            cat( "phw dbg msg: allow.anticor==T\n" )
>                       ##  Get  variance  against the inverse of the mean
>      profile
>            devs.2 <- apply( r.all, 1, "-", -avg.rows )
>            gc( reset=TRUE ) #  phw added 8/9/07
>            vars.2 <- apply( devs.2, 2, var, na.rm=T )
>            rm( devs.2 )
>            gc( reset=TRUE ) #  phw added 8/9/07
>            vars.2 <- log10( vars.2 )
>            gc( reset=TRUE ) #  phw added 8/9/07
>                       ##  For  each  gene  take  the  min of variance or
>      anti-cor variance
>            vars <- cbind( vars, vars.2 )
>            rm( vars.2 )
>            gc( reset=TRUE ) #  phw added 8/9/07
>            vars <- apply( vars, 1, min )
>            gc( reset=TRUE ) #  phw added 8/9/07
>          }
>               ##  Normalize  the values by the variance over the rows in
>      the cluster
>          if ( var.norm ) {
>            cat( "phw dbg msg: var.norm == T \n")
>            vars <- vars - mean( vars[ rows ], na.rm=T )
>            tmp.sd <- sd( vars[ rows ], na.rm=T )
>             if  (  !  is.na(  tmp.sd ) && tmp.sd != 0 ) vars <-
vars / (
>      tmp.sd + r.sig )
>          }
>          gc( reset=TRUE ) #  phw added 8/9/07
>          return( vars )
>        } else {
>          cat( "phw dbg msg: else\n" )
>          r.all <- ratios[ rows, ]
>          ## Mean-normalized variance
>           vars  <-  log10( apply( r.all, 2, var, na.rm=T ) / abs( apply(
>      r.all, 2, mean, na.rm=T ) ) )
>          names( vars ) <- colnames( ratios )
>           ##  Normalize  the values by the variance over the rows in the
>      cluster
>          if ( var.norm ) {
>            vars <- vars - mean( vars[ cluster$cols ], na.rm=T )
>            tmp.sd <- sd( vars[ cluster$cols ], na.rm=T )
>             if  (  !  is.na(  tmp.sd ) && tmp.sd != 0 ) vars <-
vars / (
>      tmp.sd + r.sig )
>          }
>          return( vars )
>        }
>      },
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
-- 
Martin Morgan
Bioconductor / Computational Biology
http://bioconductor.org

Peter Waltman

2007-Aug-17 22:45 UTC

head link

[R] Possible memory leak with R v.2.5.0

Hi Martin (and others, hopefully ) -

Upon your suggestion, I tried your code and did not see a leak, and this 
was reflected by the 'top' command.

However, I modified your code snippet below to make it a little closer 
to what I'm actually doing, and can now reproducibly cause the leak that 
I'm seeing (I've put my code snippet at the bottom of this message.  
Note, I'm using R v.2.5.1 on a MacOS box with 4 gigs of memory, and the 
following output from sessionInfo():

     > sessionInfo()
    R version 2.5.1 (2007-06-27)
    powerpc-apple-darwin8.9.1

    locale:
    C

    attached base packages:
    [1] "stats"     "graphics"  "grDevices"
"utils"     "datasets"
    "methods" 
    [7] "base"    
     >

As you'll see, to make it closer to what I'm doing, I changed the
'val'
matrix to a 20k x 2k matrix, containing random vals.  You'll also see 
that I added dimnames to the matrix, though they're commented out in the 
code I sent you.  When using the dimnames, R will start throwing malloc 
errors, hence the reason I commented them out.

Using the f.inds() function with sapply, i.e. res <- sapply(1:10, 
function(i) f.inds() ),  I will get a memory even if I've detached both 
'foo' and 'bar' and have removed everything ( rm( list=ls() ) ).

When I garbage collect, R reports:

     > gc()
             used (Mb) gc trigger  (Mb)  max used   (Mb)
    Ncells 237802  6.4     531268  14.2    531268   14.2
    Vcells 105865  0.9  104088594 794.2 161014887 1228.5
     >

However, top still reports that R is using 284 Megs of memory.  I know 
you don't think that 'top' is the best way of gauging memory usage,
but
could you try out my code on your machine and let me know what you're 
seeing?  If you don't recommend 'top' than what other method should
I try?

Thanks,

Peter

    foo=list(getbar=function() get("val", "bar"),
      f.names=function() {
        my.cols <- sample( colnames( val ), 750 )
        my.r <- val[ sample( rownames( val ), 15 ),
                     my.cols
                    ]

        avg.rows <- apply( my.r, 2, mean, na.rm=TRUE )
        rm ( my.r)
        gc()

        my.r.all <- val[ , my.cols ]

        devs <- apply( my.r.all, 1, "-", avg.rows )
        rm( my.r.all )
        gc()

        apply( devs, 2, var, na.rm=TRUE )
      },
      f.inds=function() {
        my.cols <- sample( ncol( val ), 750 )
        my.r <- val[ sample( nrow( val ), 15 ),
                     my.cols
                    ]


        avg.rows <- apply( my.r, 2, mean, na.rm=TRUE )
        rm ( my.r)
        gc()

        my.r.all <- val[ , my.cols ]

        devs <- apply( my.r.all, 1, "-", avg.rows )
        rm( my.r.all )
        gc()

        apply( devs, 2, var, na.rm=TRUE )
      }

      )
    attach(foo)
    bar <- list(val=matrix( rnorm( (20000*2000) ), 20000, 2000 ) )#,
    dimnames= list( paste( "AT2G", 1:20000,sep="" ), paste(
"AT2Gcol",
    1:2000,sep="" ) )\
     ) )
    attach(bar)
    #res <- sapply(1:10, function(i) f.inds())
    #gc()


Martin Morgan wrote:> Hi Peter --
>
> I guess I think that the people who need to know are already reading
> the list, but they're not jumping in because you haven't stated the
> problem in a way that makes it clear it really is a bug in R, as
> opposed to a bug in your code.
>
> The two things 'they' need are:
>
> (1) an easy way to reproduce the problem. Here's my best guess at
> where you're at now, pasted into a new R session:
>
>   
>> gc()
>>     
>          used (Mb) gc trigger (Mb) max used (Mb)
> Ncells 140809  7.6     350000 18.7   350000 18.7
> Vcells 126374  1.0     786432  6.0   576498  4.4
>   
>> foo=list(getbar=function() get("val", "bar"),
>>     
> +          f=function() apply(getbar(), 2, var))
>   
>> attach(foo)
>> bar <- list(val=matrix(0,1000, 1000))
>> attach(bar)
>> res <- sapply(1:10, function(i) f())
>> gc()
>>     
>           used (Mb) gc trigger (Mb) max used (Mb)
> Ncells  141150  7.6     350000 18.7   350000 18.7
> Vcells 2136495 16.4    4944736 37.8  4944700 37.8
>   
>> rm(list=ls(all=TRUE))
>> detach("foo")
>> detach("bar")
>> gc()
>>     
>          used (Mb) gc trigger (Mb) max used (Mb)
> Ncells 141533  7.6     350000 18.7   350000 18.7
> Vcells 126472  1.0    3955788 30.2  4944700 37.8
>
> Except of course that it tidies up after itself nicely! I'm pretty
> dubious about 'top' as a source of reliable memory usage info, but
> don't have enough experience to know of an easy better way (just
> harder better ways!).
>
> (2) 'They' will also want to see the output of sessionInfo()
>
>   
>> sessionInfo()
>>     
> R version 2.6.0 Under development (unstable) (2007-08-14 r42505) 
> x86_64-unknown-linux-gnu 
>
> locale:
>
LC_CTYPE=en_US;LC_NUMERIC=C;LC_TIME=en_US;LC_COLLATE=en_US;LC_MONETARY=en_US;LC_MESSAGES=en_US;LC_PAPER=en_US;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US;LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base     
>
> Probably 'they' would respond to R 2.5.1, but not 2.4.x; you run
the
> risk of being told to 'update R, this bug was fixed a long time
ago',
> which of course you can just shrug off.
>
> Also, sapply returns a vector of results, and so has to be able to
> store this vector. This means that it will necessarily retain all the
> data (your for loop discards all but the last result, I guess, which
> doesn't do you much good!).
>
> If you can produce a really simple example, I'd suggest forwarding it
> to the R-devel mailing list.
>
> Martin
>
> Peter Waltman <waltman at cs.nyu.edu> writes:
>
>   
>> Hi Martin -
>>
>> No worries.  I know that this is a bit of nasty problem and I totally
>> appreciate you taking a look at it.
>>
>> Who could you suggest I contact since I'm now fairly convinced
it's a
>> memory leak in the apply  function (and related functions) as I
>> commented all the lines that use the (s)apply function and replaced
>> them with for loops and found that doing the when performing the same
>> test, R's memory stamp never exceeded 1.5 gig... and finished with
746
>> Meg (after gc()'ing everything).
>>
>> Not surprisingly, given it's low memory stamp, it also completed
much,
>> much faster than the version which used the (s)apply functions.
>>
>> Thanks,
>>
>> Peter
>>
>>
>> Martin Morgan wrote:
>>     
>>> Hi Peter --
>>>
>>> I can't really help further, other than to suggest that you
create a
>>> simple (5-10 lines of cut-and-paste code) re-producible example (by
>>> others), and that investigating whether your issues occurs in a
*more
>>> recent* version of R (so as to discover current rather than fixed,
>>> bugs) may be more productive. Also, I don't know in detail the
ins and
>>> outs of memory management in R but I would imagine that (a) a pool
of
>>> memory is retained even when not used; and (b) top might well not
>>> accurately measure memory allocation. The 'Writing R
Extensions'
>>> manual has a section on using valgrind, which (especially when
>>> compared to a 'normal' R session) would be a more reliable
way to
>>> document a memory leak.
>>>
>>> Martin
>>>
>>>
>>> Peter Waltman <waltman at cs.nyu.edu> writes:
>>>
>>>
>>>       
>>>> Hi Martin  -
>>>>
>>>> Thanks for the feedback.  Right after I sent the email to the
list
>>>> last night, I realized that I'd forgotten to clear the all
the vars we
>>>> attach() to the environment before rm()'ing everything. 
Whoops.
>>>>
>>>> However, I found that while doing this *does* reduce the memory
stamp
>>>> by about a gig (down to 2.6 from 3.4), subsequent calls to gc()
still
>>>> can't reduce the memory stamp any further.
>>>>
>>>> Also, I probably should have added some add'l explanation
of our code.
>>>> I'm working on is legacy code that doesn't implement
packages per se
>>>> (or per the definition that R uses, at least).  Instead, they
are
>>>> lists() of functions, i.e.
>>>>
>>>>     try ( detach( "gain-funcs" ), silent=T )
>>>>     gain-funcs = list(
>>>>         func1 = function() {
>>>>             ...
>>>>         },
>>>>         func2 = function(){
>>>>             ...
>>>>        }
>>>>     )
>>>>     attach( gain-funcs )
>>>>
>>>> In this framework, we can source a given .R file and have
access to
>>>> these functions without cluttering up the global namespace.
>>>>
>>>> Additionally, the 'ratios' var contains the expression
matrix, and is
>>>> actually an elt of another variable called
'global.data' that is also
>>>> attach()'d.  Similar to what you suggested, I pulled out
the
>>>> get.global( 'ratios' ) parameter from the function
(since it's already
>>>> available globally), and found that that had no effect on
reducing the
>>>> memory stamp.  Frustrating.
>>>>
>>>> Likewise, after checking, I discovered that the
>>>>
>>>>     cluster <<- cluster
>>>>
>>>> command was just for debugging purposes, so I've since
commented it
>>>> out in both your version, as well as mine.
>>>>
>>>> Additionally, there is no need to pass the cluster into the
argument
>>>> since we keep it in a stack (implemented as a list) that's
available
>>>> in the global namespace, so I re-wrote your version (and mine)
to take
>>>> the index of the cluster, rather than pass the cluster itself.
>>>>
>>>> Running your stripped down version of the
get.vars.for.cluster() fn on
>>>> 5 clusters caused R's memory stamp to jump up to as high as
5.7 gig,
>>>> ending with a final stamp of 4.9 gig.  Detach()'ing all the
vars we
>>>> add and the gc()'ing allowed it to drop to 3.4 gig. 
rm()'ing the var
>>>> that stored the results of these 5 get.vars.for.cluster() calls
and
>>>> then gc()'ing did not further reduce the memory stamp. 
rm()'ing
>>>> everything from the global namespace and gc()'in dropped
this further
>>>> to 3.1 gig, but no further.
>>>>
>>>> Adding lines to remove the r and r.all vars and gc() w/in the
>>>> get.vars.for.cluster() function reduced the running memory
footprint
>>>> to a range of 3.5-4.4 gig, and detach()'ing, rm()'ing
and gc()'ing
>>>> dropped it down to 2.8.
>>>>
>>>> The odd thing is that calling gc() at this point reports that R
is
>>>> using far less memory than 'top' reports, for example,
from one of my
>>>> tests:
>>>>
>>>>      > gc()
>>>>                used  (Mb) gc trigger   (Mb) max used  (Mb)
>>>>     Ncells   380537  20.4    6366476  340.1   380701  20.4
>>>>     Vcells 88715615 676.9  277437132 2116.7 88715745 676.9
>>>>
>>>> versus top:
>>>>
>>>>     PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+ 
COMMAND
>>>>     1199 waltman   17   0 3844m 3.7g 3216 S  0.0 23.6  29:58.74
R
>>>>
>>>> Given this behavior, I can only assume that there's a
memory leak in
>>>> R, at least for v.2.5.0.  I'll try to get a version of
v.2.4.x
>>>> installed somewhere to see if I see a similar behavior with it.
>>>>
>>>> Peter
>>>>
>>>> Martin Morgan wrote:
>>>>
>>>>         
>>>>> Hi Peter --
>>>>>
>>>>> Here's my guess.
>>>>>
>>>>> Ironically, adding things to broken code reduces the signal
to noise
>>>>> ratio. I ended up with
>>>>>
>>>>> get.vars.for.cluster = function(
>>>>>   cluster,
>>>>>   genes=get.global("gene.ids" ),
>>>>>   ratios=get.global("ratios"))
>>>>> {
>>>>>     cluster <<- cluster
>>>>>     rows <- cluster$rows
>>>>>     cols <- cluster$cols
>>>>>
>>>>>     r <- ratios[ rows, cols ]
>>>>>     avg.rows <- apply( r, 2, mean, na.rm=TRUE )
>>>>>     r.all <- ratios[ genes, cols ]
>>>>>     devs <- apply( r.all, 1, "-", avg.rows )
>>>>>
>>>>>     apply( devs, 2, var, na.rm=TRUE )
>>>>> }
>>>>>
>>>>> at what might reproduce your problem (though can't be
sure!). The
>>>>> unusual bit is
>>>>>
>>>>>     cluster <<- cluster
>>>>>
>>>>> At first I thought this would be a no-op (assigning cluster
to
>>>>> itself), but apparently at this point in the code cluster
does not
>>>>> exist in the environment of the function (just in the call)
so cluster
>>>>> gets assigned outside the function.
>>>>>
>>>>> So then my guess is that get.vars.for.cluster is part of a
package,
>>>>> and the package has a variable called cluster.
get.vars.for.cluster
>>>>> then assigns its first argument to the package variable
cluster (which
>>>>> is the first variable called cluster that <<-
>>>>> encounters). rm(list=ls(all=TRUE)) removes everything from
the global
>>>>> environment, but (fortunately!) not from the package
environment.
>>>>>
>>>>> You might end up storing more than 'just' cluster,
depending on what
>>>>> it is.
>>>>>
>>>>> So I think the solution is to rethink the use of <<-
(and also the
>>>>> get.global(), which are either for convenience (in which
case it would
>>>>> probably be better to specify a default for the function
argument) or
>>>>> out of a sense that copying is bad (but this is probably
mistaken,
>>>>> since R's semantics are 'copy on change', so
passing a 'big' object
>>>>> into a function does not usually trigger a copy)).
>>>>>
>>>>> You could also try 'detach'ing the package that
get.vars.for.cluster
>>>>> is defined in.
>>>>>
>>>>> Hope that points in the right direction,
>>>>>
>>>>> Martin
>>>>>
>>>>> Peter Waltman <waltman at cs.nyu.edu> writes:
>>>>>
>>>>>
>>>>>
>>>>>           
>>>>>>    I'm  working  with  a  very  large matrix ( 22k
rows x 2k cols) of RNA
>>>>>>    expression  data with R v.2.5.0 on a RedHat
Enterprise machine, x86_64
>>>>>>    architecture.
>>>>>>    The relevant code is below, but I call a function
that takes a cluster
>>>>>>    of  this data ( a list structure that contains a
$rows elt which lists
>>>>>>    the rows (genes ) in the cluster by ID, but not the
actual data itself
>>>>>>    ).
>>>>>>    The function creates two copies of the matrix, one
containing the rows
>>>>>>    in  the  cluster,  and  one  with  the rest of the
rows in the matrix.
>>>>>>    After  doing  some  statistical  massaging,  the 
function  returns  a
>>>>>>    statistical  score  for  each  rows/genes  in  the
matrix, producing a
>>>>>>    vector of 22k elt's.
>>>>>>    When  I  run 'top', I see that the memory
stamp of R after loading the
>>>>>>    matrix is ~750M.  However, after calling this
function on 10 clusters,
>>>>>>    this  jumps  to  >  3.7 gig (at least by
'top's measurement), and this
>>>>>>    will not be reduced by any subsequent calls to gc().
>>>>>>    Output from gc() is:
>>>>>>
>>>>>>      > gc()           used  (Mb) gc trigger   (Mb)
max used  (Mb)
>>>>>>      Ncells   377925  20.2    6819934  364.3   604878 
32.4
>>>>>>      Vcells 88857341 678.0  240204174 1832.7 90689707
692.0
>>>>>>      >
>>>>>>
>>>>>>    output from top is:
>>>>>>
>>>>>>         PID  USER       PR   NI   VIRT   RES   SHR  S
%CPU %MEM    TIME+
>>>>>>      COMMAND
>>>>>>       1199 waltman   17   0 3844m 3.7g 3216 S  0.0 23.6
29:58.74 R
>>>>>>
>>>>>>    Note, the relevant call that invoked my function is:
>>>>>>
>>>>>>      test   <-   sapply(   c(1:10),   function(x) 
get.vars.for.cluster(
>>>>>>      clusterStack[[x]], opt="rows" ) )
>>>>>>
>>>>>>    Finally,  for  fun,  I  rm()'d  all variables
with the rm( list=ls() )
>>>>>>    command, and then called gc().  The memory of this
"empty" instance of
>>>>>>    R is still 3.4 gig, i.e.
>>>>>>    R.console:
>>>>>>
>>>>>>      > rm( list=ls() )
>>>>>>      > ls()
>>>>>>      character(0)
>>>>>>      > gc()
>>>>>>                 used  (Mb) gc trigger   (Mb) max used 
(Mb)
>>>>>>      Ncells   363023  19.4    5455947  291.4   604878 
32.4
>>>>>>      Vcells 44434871 339.1  192163339 1466.1 90689707
692.0
>>>>>>      >
>>>>>>
>>>>>>    Subsequent top  output:
>>>>>>    output from top is:
>>>>>>
>>>>>>         PID  USER       PR   NI   VIRT   RES   SHR  S
%CPU %MEM    TIME+
>>>>>>      COMMAND
>>>>>>       1199 waltman   16   0 3507m 3.4g 3216 S  0.0 21.5
29:58.92 R
>>>>>>
>>>>>>    Thanks for any help or suggestions,
>>>>>>    Peter Waltman
>>>>>>    p.s.  code snippet follows.  Note, that I've
added extra rm() and gc()
>>>>>>    calls w/in the function to try to reduce the memory
stamp to no avail.
>>>>>>
>>>>>>      get.vars.for.cluster   =   function(   cluster,  
genes=get.global(
>>>>>>      "gene.ids" ),
opt=c("rows","cols"),
>>>>>>                               
ratios=get.global("ratios"),  var.norm=T,
>>>>>>      r.sig=get.global( "r.sig" ),
>>>>>>                              allow.anticor=get.global(
"allow.anticor" )
>>>>>>      ) {
>>>>>>        cat( "phw dbg msg\n")
>>>>>>        cluster <<- cluster
>>>>>>        opt <- match.arg( opt )
>>>>>>        rows <- cluster$rows
>>>>>>        cols <- cluster$cols
>>>>>>        if ( opt == "rows" ) {
>>>>>>          cat( "phw dbg msg: if opt == rows\n"
)
>>>>>>          r <- ratios[ rows, cols ]
>>>>>>          r.all <- ratios[ genes, cols ]
>>>>>>          avg.rows <- apply( r, 2, mean, na.rm=T )
##median )
>>>>>>          rm( r )  # phw added 8/9/07
>>>>>>          gc( reset=TRUE )     # phw added 8/9/07
>>>>>>          devs <- apply( r.all, 1, "-",
avg.rows )
>>>>>>          if ( !allow.anticor ) rm( r.all, avg.rows )  #
phw added 8/9/07
>>>>>>          gc( reset=TRUE ) #  phw added 8/9/07
>>>>>>          cat( "phw dbg msg: finished calc'ing
avg.rows & devs\n" )
>>>>>>                ##   This   is  what  we'd  use 
from  the  deHoon  paper
>>>>>>      (bioinformatics/bth927)
>>>>>>              ##sd.rows <- apply( r, 2, sd )
>>>>>>              ##devs <- devs * devs
>>>>>>              ##sd.rows <- sd.rows * sd.rows
>>>>>>              ##sds <- apply( devs, 2, "/",
sd.rows )
>>>>>>              ##sds <- apply( sds, 2, sum )
>>>>>>              ##return( log10( sds ) )
>>>>>>              ## This is faster and nearly equivalent
>>>>>>          vars <- apply( devs, 2, var, na.rm=T )
>>>>>>          rm( devs )
>>>>>>          gc( reset=TRUE ) #  phw added 8/9/07
>>>>>>          test <- log10( vars ) #  phw added 8/9/07
>>>>>>          rm( vars ) #  phw added 8/9/07
>>>>>>          gc( reset=TRUE ) #  phw added 8/9/07
>>>>>>          vars <- log10( test ) #  phw added 8/9/07
>>>>>>          rm( test ) #  phw added 8/9/07
>>>>>>          gc( reset=TRUE ) #  phw added 8/9/07
>>>>>>      #    vars <- log10( vars )
>>>>>>          cat( "phw dbg msg: finished calc'ing
vars (\n" )
>>>>>>              ## HOW TO ALLOW FOR ANTICOR??? Here's
how:
>>>>>>          if ( allow.anticor ) {
>>>>>>            cat( "phw dbg msg:
allow.anticor==T\n" )
>>>>>>                       ##  Get  variance  against the
inverse of the mean
>>>>>>      profile
>>>>>>            devs.2 <- apply( r.all, 1, "-",
-avg.rows )
>>>>>>            gc( reset=TRUE ) #  phw added 8/9/07
>>>>>>            vars.2 <- apply( devs.2, 2, var, na.rm=T
)
>>>>>>            rm( devs.2 )
>>>>>>            gc( reset=TRUE ) #  phw added 8/9/07
>>>>>>            vars.2 <- log10( vars.2 )
>>>>>>            gc( reset=TRUE ) #  phw added 8/9/07
>>>>>>                       ##  For  each  gene  take  the 
min of variance or
>>>>>>      anti-cor variance
>>>>>>            vars <- cbind( vars, vars.2 )
>>>>>>            rm( vars.2 )
>>>>>>            gc( reset=TRUE ) #  phw added 8/9/07
>>>>>>            vars <- apply( vars, 1, min )
>>>>>>            gc( reset=TRUE ) #  phw added 8/9/07
>>>>>>          }
>>>>>>               ##  Normalize  the values by the variance
over the rows in
>>>>>>      the cluster
>>>>>>          if ( var.norm ) {
>>>>>>            cat( "phw dbg msg: var.norm == T
\n")
>>>>>>            vars <- vars - mean( vars[ rows ],
na.rm=T )
>>>>>>            tmp.sd <- sd( vars[ rows ], na.rm=T )
>>>>>>             if  (  !  is.na(  tmp.sd ) &&
tmp.sd != 0 ) vars <- vars / (
>>>>>>      tmp.sd + r.sig )
>>>>>>          }
>>>>>>          gc( reset=TRUE ) #  phw added 8/9/07
>>>>>>          return( vars )
>>>>>>        } else {
>>>>>>          cat( "phw dbg msg: else\n" )
>>>>>>          r.all <- ratios[ rows, ]
>>>>>>          ## Mean-normalized variance
>>>>>>           vars  <-  log10( apply( r.all, 2, var,
na.rm=T ) / abs( apply(
>>>>>>      r.all, 2, mean, na.rm=T ) ) )
>>>>>>          names( vars ) <- colnames( ratios )
>>>>>>           ##  Normalize  the values by the variance
over the rows in the
>>>>>>      cluster
>>>>>>          if ( var.norm ) {
>>>>>>            vars <- vars - mean( vars[ cluster$cols
], na.rm=T )
>>>>>>            tmp.sd <- sd( vars[ cluster$cols ],
na.rm=T )
>>>>>>             if  (  !  is.na(  tmp.sd ) &&
tmp.sd != 0 ) vars <- vars / (
>>>>>>      tmp.sd + r.sig )
>>>>>>          }
>>>>>>          return( vars )
>>>>>>        }
>>>>>>      },
>>>>>> ______________________________________________
>>>>>> R-help at stat.math.ethz.ch mailing list
>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>>> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
>>>>>> and provide commented, minimal, self-contained,
reproducible code.
>>>>>>
>>>>>>
>>>>>>             
>>>       
>
>

Apparently Analagous Threads

Search for more apparently analagous threads

R help - Aug 2007 - Possible memory leak with R v.2.5.0

[R] Possible memory leak with R v.2.5.0

[R] Possible memory leak with R v.2.5.0

[R] Possible memory leak with R v.2.5.0

Apparently Analagous Threads