What does the FASTA header look like. You are using 'gene' to access
things
in the array and if (for example) 'gene' is a character vector of 10,
then
for every element of vectors that you are using (I count about 4-5 that use
this index) then you are going to have at least 550 * 6000 * 5 * 10 more
bytes (165MB) used just to store the names of the elements.
You are also dynamically increasing the size of the vectors which means a
lot of copying of the objects and therefore using a lot of memory that is
probably fragmenting your memory.
So if you look at all these vectors, how many of them will contain data?
What you might want to do is to preprocess the data (pass 1) to find out how
many 'gene's there are and then create a factor from this. You can then
statically allocate the vectors and use the numeric value of the factor to
index into the vector.
So you might have fragmentation (that seems to be what your 'ps' command
is
showing. So it looks like a two pass process: 1) determine how many genes
you have and statically allocate, 2) go through the data and use the
'factor' to index into the vectors.
On 1/17/07, Peter Waltman <waltman@cs.nyu.edu>
wrote:>
> Hi -
>
> When I'm trying to read in a text file into a labeled character array,
> the memory stamp/footprint of R will exceed 4 gigs or more. I've seen
> this behavior on Mac OS X, Linux for AMD_64 and X86_64., and the R
> versions are 2.4, 2.4 and 2.2, respectively. So, it would seem that
> this is platform and R version independant.
>
> The file that I'm reading contains the upstream regions of the yeast
> genome, with each upstream region labeled using a FASTA header, i.e.:
>
> FASTA header for gene 1
> upstream region.....
> .....
> ....
> FASTA header for gene 2
> upstream....
> ....
>
> The script I use - code below - opens the file, parses for a FASTA
> header, and then parses the header for the gene name. Once this is
> done, it reads the following lines which contain the upstream region,
> and then adds it as an item to the character array, using the gene name
> as the name of the item it adds. And then continues on to the following
> genes.
>
> Each upstream region (the text to be added) is 550 bases (characters)
> long. With ~6000 genes in the file I'm reading it, this would be 550 *
> 6000 * 8 (if we're using ascii chars) ~= 25 Megs (if we're using
ascii
> chars).
>
> I realize that the character arrays/vectors will have a higher memory
> stamp b/c they are a named array and most likely aren't storing the
text
> as ascii, but 4 gigs and up seems a bit excessive. Or is it?
>
> For an example, this is the output of top, at the point which R has
> processed around 5000 genes:
>
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 4969 waltman 18 0 *6746m 3.4g* 920 D 2.7 88.2 19:09.19 R
>
> Is this expected behavior? Can anyone recommend a less memory intensive
> way to store this data? The relevant code that reads in the file follows:
>
> ....code....
> lines <- readLines( gzfile( seqs.fname ) )
>
> n.seqs <- 0
>
> upstream <- gene.names <- character()
> syn <- character( 0 )
> gene.start <- gene.end <- integer()
> gene <- seq <- ""
>
>
> for ( i in 1:length( lines ) ) {
> line <- lines[ i ]
> if ( line == "" ) next
> if ( substr( line, 1, 1 ) == ">" ) {
>
> if ( seq != "" && gene != "" )
upstream[ gene ] <-
> toupper( seq )
> splitted <- strsplit( line, "\t" )[[ 1 ]]
> splitted <- strsplit( splitted[ 1 ], ";\\ " )[[ 1
]]
> gene <- toupper( substr( splitted[ 1 ], 2, nchar(
> splitted[ 1 ] ) ) )
> syn <- splitted[ 2 ]
> if ( ! is.null( syn ) &&
> length( grep( valid.gene.regexp, gene, perl=T ) ) == 0
&&
> length( grep( valid.gene.regexp, syn, perl=T ) ) == 1
> ) gene <- syn
> else if ( length( grep( valid.gene.regexp, gene, perl=T,
> ignore.case=T ) ) == 0 &&
> length( grep( valid.gene.regexp, syn, perl=T,
> ignore.case=T ) ) == 0 ) next
> gene.start[ gene ] <- as.integer( splitted[ 9 ] )
> gene.end[ gene ] <- as.integer( splitted[ 10 ] )
> if ( n.seqs %% 100 == 0 ) cat.new( n.seqs, gene,
"|", syn,
> "| length=", nchar( seq ),
>
gene.end[gene]-gene.start[gene]+1,"\n" )
> if ( ! is.na( syn ) && syn != "" )
gene.names[ gene ] <- syn
> else gene.names[ gene ] <- toupper( gene )
> n.seqs <- n.seqs + 1
> seq <- ""
> } else {
> seq <- paste( seq, line, sep="" )
> }
> }
> if ( seq != "" && gene != "" )
upstream[ gene ] <- toupper( seq )
>
> ....code....
>
> Thanks,
>
> Peter Waltman
>
> ______________________________________________
> R-help@stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
--
Jim Holtman
Cincinnati, OH
+1 513 646 9390
What is the problem you are trying to solve?
[[alternative HTML version deleted]]