thr3ads.net - R help - [R] Memory leak with character arrays? [Jan 2007]

If this information is useful, please help other people find it:
Share via:

Peter Waltman

2007-Jan-17 21:54 UTC

[R] Memory leak with character arrays?

Hi -

When I'm trying to read in a text file into a labeled character array, 
the memory stamp/footprint of R will exceed 4 gigs or more.  I've seen 
this behavior on Mac OS X, Linux for AMD_64 and X86_64., and the R 
versions are 2.4, 2.4 and 2.2, respectively.  So, it would seem that 
this is platform and R version independant.

The file that I'm reading contains the upstream regions of the yeast 
genome, with each upstream region labeled using a FASTA header, i.e.:

    FASTA header for gene 1
    upstream region.....
    .....
    ....
    FASTA header for gene 2
    upstream....
    ....

The script I use - code below - opens the file, parses for a FASTA 
header, and then parses the header for the gene name.  Once this is 
done, it reads the following lines which contain the upstream region, 
and then adds it as an item to the character array, using the gene name 
as the name of the item it adds.  And then continues on to the following 
genes.

Each upstream region (the text to be added) is 550 bases (characters) 
long.  With ~6000 genes in the file I'm reading it, this would be 550 * 
6000 * 8 (if we're using ascii chars) ~= 25 Megs (if we're using ascii 
chars).

I realize that the character arrays/vectors will have a higher memory 
stamp b/c they are a named array and most likely aren't storing the text 
as ascii, but 4 gigs and up seems a bit excessive.  Or is it?

For an example, this is the output of top, at the point which R has 
processed around 5000 genes:

      PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND 
     4969 waltman   18   0 *6746m 3.4g*  920 D  2.7 88.2  19:09.19 R    

Is this expected behavior?  Can anyone recommend a less memory intensive 
way to store this data?  The relevant code that reads in the file follows:

     ....code....
         lines <- readLines( gzfile( seqs.fname ) )
         
          n.seqs <- 0
         
          upstream <- gene.names <- character()
          syn <- character( 0 )
          gene.start <- gene.end <- integer()
          gene <- seq <- ""


          for ( i in 1:length( lines ) ) {
            line <- lines[ i ]
            if ( line == "" ) next
            if ( substr( line, 1, 1 ) == ">" ) {

              if ( seq != "" && gene != "" )
upstream[ gene ] <-
    toupper( seq )
              splitted <- strsplit( line, "\t" )[[ 1 ]]
              splitted <- strsplit( splitted[ 1 ], ";\\ " )[[ 1 ]]
              gene <- toupper( substr( splitted[ 1 ], 2, nchar(
    splitted[ 1 ] ) ) )
              syn <- splitted[ 2 ]
              if ( ! is.null( syn ) &&
                  length( grep( valid.gene.regexp, gene, perl=T ) ) == 0
&&
                  length( grep( valid.gene.regexp, syn, perl=T ) ) == 1
    ) gene <- syn
              else if ( length( grep( valid.gene.regexp, gene, perl=T,
    ignore.case=T ) ) == 0 &&
                       length( grep( valid.gene.regexp, syn, perl=T,
    ignore.case=T ) ) == 0 ) next
              gene.start[ gene ] <- as.integer( splitted[ 9 ] )
              gene.end[ gene ] <- as.integer( splitted[ 10 ] )
              if ( n.seqs %% 100 == 0 ) cat.new( n.seqs, gene, "|",
syn,
    "| length=", nchar( seq ),
                               gene.end[gene]-gene.start[gene]+1,"\n"
)
              if ( ! is.na( syn ) && syn != "" ) gene.names[
gene ] <- syn
              else gene.names[ gene ] <- toupper( gene )
              n.seqs <- n.seqs + 1
              seq <- ""
            } else {
              seq <- paste( seq, line, sep="" )
            }
          }
          if ( seq != "" && gene != "" ) upstream[
gene ] <- toupper( seq )

     ....code....

Thanks,

Peter Waltman

jim holtman

2007-Jan-18 01:52 UTC

head link

[R] Memory leak with character arrays?

What does the FASTA header look like.  You are using 'gene' to access
things
in the array and if (for example) 'gene' is a character vector of 10,
then
for every element of vectors that you are using (I count about 4-5 that use
this index) then you are going to have at least 550 * 6000 * 5 * 10 more
bytes (165MB) used just to store the names of the elements.

You are also dynamically increasing the size of the vectors which means a
lot of copying of the objects and therefore using a lot of memory that is
probably fragmenting your memory.

So if you look at all these vectors, how many of them will contain data?
What you might want to do is to preprocess the data (pass 1) to find out how
many 'gene's there are and then create a factor from this.  You can then
statically allocate the vectors and use the numeric value of the factor to
index into the vector.


So you might have fragmentation (that seems to be what your 'ps' command
is
showing.  So it looks like a two pass process: 1) determine how many genes
you have and statically allocate, 2) go through the data and use the
'factor' to index into the vectors.

On 1/17/07, Peter Waltman <waltman@cs.nyu.edu>
wrote:>
> Hi -
>
> When I'm trying to read in a text file into a labeled character array,
> the memory stamp/footprint of R will exceed 4 gigs or more.  I've seen
> this behavior on Mac OS X, Linux for AMD_64 and X86_64., and the R
> versions are 2.4, 2.4 and 2.2, respectively.  So, it would seem that
> this is platform and R version independant.
>
> The file that I'm reading contains the upstream regions of the yeast
> genome, with each upstream region labeled using a FASTA header, i.e.:
>
>    FASTA header for gene 1
>    upstream region.....
>    .....
>    ....
>    FASTA header for gene 2
>    upstream....
>    ....
>
> The script I use - code below - opens the file, parses for a FASTA
> header, and then parses the header for the gene name.  Once this is
> done, it reads the following lines which contain the upstream region,
> and then adds it as an item to the character array, using the gene name
> as the name of the item it adds.  And then continues on to the following
> genes.
>
> Each upstream region (the text to be added) is 550 bases (characters)
> long.  With ~6000 genes in the file I'm reading it, this would be 550 *
> 6000 * 8 (if we're using ascii chars) ~= 25 Megs (if we're using
ascii
> chars).
>
> I realize that the character arrays/vectors will have a higher memory
> stamp b/c they are a named array and most likely aren't storing the
text
> as ascii, but 4 gigs and up seems a bit excessive.  Or is it?
>
> For an example, this is the output of top, at the point which R has
> processed around 5000 genes:
>
>      PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>     4969 waltman   18   0 *6746m 3.4g*  920 D  2.7 88.2  19:09.19 R
>
> Is this expected behavior?  Can anyone recommend a less memory intensive
> way to store this data?  The relevant code that reads in the file follows:
>
>     ....code....
>         lines <- readLines( gzfile( seqs.fname ) )
>
>          n.seqs <- 0
>
>          upstream <- gene.names <- character()
>          syn <- character( 0 )
>          gene.start <- gene.end <- integer()
>          gene <- seq <- ""
>
>
>          for ( i in 1:length( lines ) ) {
>            line <- lines[ i ]
>            if ( line == "" ) next
>            if ( substr( line, 1, 1 ) == ">" ) {
>
>              if ( seq != "" && gene != "" )
upstream[ gene ] <-
>    toupper( seq )
>              splitted <- strsplit( line, "\t" )[[ 1 ]]
>              splitted <- strsplit( splitted[ 1 ], ";\\ " )[[ 1
]]
>              gene <- toupper( substr( splitted[ 1 ], 2, nchar(
>    splitted[ 1 ] ) ) )
>              syn <- splitted[ 2 ]
>              if ( ! is.null( syn ) &&
>                  length( grep( valid.gene.regexp, gene, perl=T ) ) == 0
&&
>                  length( grep( valid.gene.regexp, syn, perl=T ) ) == 1
>    ) gene <- syn
>              else if ( length( grep( valid.gene.regexp, gene, perl=T,
>    ignore.case=T ) ) == 0 &&
>                       length( grep( valid.gene.regexp, syn, perl=T,
>    ignore.case=T ) ) == 0 ) next
>              gene.start[ gene ] <- as.integer( splitted[ 9 ] )
>              gene.end[ gene ] <- as.integer( splitted[ 10 ] )
>              if ( n.seqs %% 100 == 0 ) cat.new( n.seqs, gene,
"|", syn,
>    "| length=", nchar( seq ),
>                              
gene.end[gene]-gene.start[gene]+1,"\n" )
>              if ( ! is.na( syn ) && syn != "" )
gene.names[ gene ] <- syn
>              else gene.names[ gene ] <- toupper( gene )
>              n.seqs <- n.seqs + 1
>              seq <- ""
>            } else {
>              seq <- paste( seq, line, sep="" )
>            }
>          }
>          if ( seq != "" && gene != "" )
upstream[ gene ] <- toupper( seq )
>
>     ....code....
>
> Thanks,
>
> Peter Waltman
>
> ______________________________________________
> R-help@stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>


-- 
Jim Holtman
Cincinnati, OH
+1 513 646 9390

What is the problem you are trying to solve?

	[[alternative HTML version deleted]]

Peter Waltman

2007-Jan-18 03:23 UTC

head link

[R] Memory leak with character arrays?

[This email is either empty or too large to be displayed at this time]

Peter Waltman

2007-Jan-18 07:36 UTC

head link

[R] Memory leak with character arrays?

[This email is either empty or too large to be displayed at this time]

Jean lobry

2007-Jan-18 13:00 UTC

head link

[R] Memory leak with character arrays?

Dear Peter,
>The file that I'm reading contains the upstream regions of the yeast
>genome, with each upstream region labeled using a FASTA header, i.e.:
>
>     FASTA header for gene 1
>     upstream region.....
>     .....
>     ....
>     FASTA header for gene 2
>     upstream....
>     ....
you may want to have a look at the read.fasta() function in the
seqinr package. There is an example page 16 of this document:
http://pbil.univ-lyon1.fr/software/SeqinR/seqinr_1_0-6.pdf
about importing the content of a fasta file with 21,161 sequences
from Arabidopsis thaliana into an object which is about 15 Mb in RAM.

HTH,

-- 
Jean R. Lobry            (lobry at biomserv.univ-lyon1.fr)
Laboratoire BBE-CNRS-UMR-5558, Univ. C. Bernard - LYON I,
43 Bd 11/11/1918, F-69622 VILLEURBANNE CEDEX, FRANCE
allo  : +33 472 43 27 56     fax    : +33 472 43 13 88
http://pbil.univ-lyon1.fr/members/lobry/

Possibly Parallel Threads

Search for more seemingly similar threads

R help - Jan 2007 - Memory leak with character arrays?

[R] Memory leak with character arrays?

[R] Memory leak with character arrays?

[R] Memory leak with character arrays?

[R] Memory leak with character arrays?

[R] Memory leak with character arrays?

Possibly Parallel Threads