thr3ads.net - R help - [R] character type and memory usage [Jan 2015]

If this information is useful, please help other people find it:
Share via:

Mike Miller

2015-Jan-17 06:21 UTC

[R] character type and memory usage

First, a very easy question:  What is the difference between using 
what="character" and what=character() in scan()?  What is the reason
for
the character() syntax?

I am working with some character vectors that are up to about 27.5 million 
elements long.  The elements are always unique.  Specifically, these are 
names of genetic markers.  This is how much memory those names take up:
> snps <- scan("SNPs.txt", what=character())
Read 27446736 items> object.size(snps)
1756363648 bytes> object.size(snps)/length(snps)63.9917128215173 bytes

As you can see, that's about 1.76 GB of memory for the vector at an 
average of 64 bytes per element.  The longest string is only 14 bytes, 
though.  The file takes up 313 MB.

Using 64 bytes per element instead of 14 bytes per element is costing me a 
total of 1,372,336,800 bytes.  In a different example where the longest 
string is 4 characters, the elements each use 8 bytes.  So it looks like 
I'm stuck with either 8 bytes or 64 bytes.  Is that true?  There is no way 
to modify that?

By the way...

It turns out that 99.72% of those character strings are of the form 
paste("rs", Int) where Int is an integer of no more than 9 digits.  So
if
I use only those markers, drop the "rs" off, and load them as
integers, I
see a huge improvement:
> snps <- scan("SNPs_rs.txt", what=integer())
Read 27369706 items> object.size(snps)
109478864 bytes> object.size(snps)/length(snps)4.00000146146985 bytes

That saves 93.8% of the memory by dropping 0.28% of the markers and 
encoding as integers instead of strings.  I might end up doing this by 
encoding the other characters as negative integers.

Mike

Martin Morgan

2015-Jan-17 10:01 UTC

head link

[R] character type and memory usage

On 01/16/2015 10:21 PM, Mike Miller wrote:> First, a very easy question:  What is the difference between using
> what="character" and what=character() in scan()?  What is the
reason for the
> character() syntax?
>
> I am working with some character vectors that are up to about 27.5 million
> elements long.  The elements are always unique.  Specifically, these are
names
> of genetic markers.  This is how much memory those names take up:
>
>> snps <- scan("SNPs.txt", what=character())
> Read 27446736 items
>> object.size(snps)
> 1756363648 bytes
>> object.size(snps)/length(snps)
> 63.9917128215173 bytes
>
> As you can see, that's about 1.76 GB of memory for the vector at an
average of
> 64 bytes per element.  The longest string is only 14 bytes, though.  The
file
> takes up 313 MB.
>
> Using 64 bytes per element instead of 14 bytes per element is costing me a
total
> of 1,372,336,800 bytes.  In a different example where the longest string is
4
> characters, the elements each use 8 bytes.  So it looks like I'm stuck
with
> either 8 bytes or 64 bytes.  Is that true?  There is no way to modify that?
Hi Mike --

R represents the atomic vector types as so-called S-expressions, which in 
addition to the actual data contain information about whether they have been 
referenced by one or more symbols etc.; you can get a sense of this with

     > x <- 1:5
     > .Internal(inspect(x))
     @4c732940 13 INTSXP g0c3 [NAM(1)] (len=5, tl=0) 1,2,3,4,5

where the number after @ is the memory location, INTSXP indicates that the type 
of data is an integer, etc. So a vector requires memory for the S-expression, 
and for the actual data.

A character vector is represented by an S-expression for the vector itself, and 
an S-expression for each element of the vector, and of course the data itself

     > .Internal(inspect(y))
     @4ce72090 16 STRSXP g0c3 [NAM(1)] (len=3, tl=0)
       @137ccd8 09 CHARSXP g0c1 [gp=0x61] [ASCII] [cached] "a"
       @137ccd8 09 CHARSXP g0c1 [gp=0x61] [ASCII] [cached] "a"
       @15a6698 09 CHARSXP g0c1 [gp=0x61] [ASCII] [cached] "b"

The large S-expression overhead is recouped by long (in the nchar() sense) or 
re-used strings, but that's not the case for your data.

There is no way around this in base R. There are general-purpose solutions like 
the data.table package, or retaining your large data in a data base (like 
SQLite) that you interface from within R using e.g., sqldf or dplyr to do as 
much data reduction in the data base (and out of R) as possible. In your 
particular case the Bioconductor Biostrings package BStringSet() might be
relevant

   http://bioconductor.org/packages/release/bioc/html/Biostrings.html

This will consume memory more along the lines of 1 byte per character + 1 byte 
per string, and is of particular relevance because you are likely doing other 
genetic operations for which the Bioconductor project has relevant packages (see
especially the GenomicRanges package).

If your work is not particularly domain-specific, data.table would be a good bet
(it also has an implementation for working with overlapping ranges, which is a 
very common task with SNPs). A lot of SNP data management is really relational, 
for which the SQL representation (and dplyr, for me) is the obvious choice. 
Bioconductor would be the choice if there is to be extensive domain-specific 
work. I am involved in the Bioconductor project, so not exactly impartial.

Martin
>
> By the way...
>
> It turns out that 99.72% of those character strings are of the form
paste("rs",
> Int) where Int is an integer of no more than 9 digits.  So if I use only
those
> markers, drop the "rs" off, and load them as integers, I see a
huge improvement:
>
>> snps <- scan("SNPs_rs.txt", what=integer())
> Read 27369706 items
>> object.size(snps)
> 109478864 bytes
>> object.size(snps)/length(snps)
> 4.00000146146985 bytes
>
> That saves 93.8% of the memory by dropping 0.28% of the markers and
encoding as
> integers instead of strings.  I might end up doing this by encoding the
other
> characters as negative integers.
>
> Mike
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

-- 
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793

R help - Jan 2015 - character type and memory usage

[R] character type and memory usage

[R] character type and memory usage