On 01/16/2015 10:21 PM, Mike Miller wrote:> First, a very easy question: What is the difference between using
> what="character" and what=character() in scan()? What is the
reason for the
> character() syntax?
>
> I am working with some character vectors that are up to about 27.5 million
> elements long. The elements are always unique. Specifically, these are
names
> of genetic markers. This is how much memory those names take up:
>
>> snps <- scan("SNPs.txt", what=character())
> Read 27446736 items
>> object.size(snps)
> 1756363648 bytes
>> object.size(snps)/length(snps)
> 63.9917128215173 bytes
>
> As you can see, that's about 1.76 GB of memory for the vector at an
average of
> 64 bytes per element. The longest string is only 14 bytes, though. The
file
> takes up 313 MB.
>
> Using 64 bytes per element instead of 14 bytes per element is costing me a
total
> of 1,372,336,800 bytes. In a different example where the longest string is
4
> characters, the elements each use 8 bytes. So it looks like I'm stuck
with
> either 8 bytes or 64 bytes. Is that true? There is no way to modify that?
Hi Mike --
R represents the atomic vector types as so-called S-expressions, which in
addition to the actual data contain information about whether they have been
referenced by one or more symbols etc.; you can get a sense of this with
> x <- 1:5
> .Internal(inspect(x))
@4c732940 13 INTSXP g0c3 [NAM(1)] (len=5, tl=0) 1,2,3,4,5
where the number after @ is the memory location, INTSXP indicates that the type
of data is an integer, etc. So a vector requires memory for the S-expression,
and for the actual data.
A character vector is represented by an S-expression for the vector itself, and
an S-expression for each element of the vector, and of course the data itself
> .Internal(inspect(y))
@4ce72090 16 STRSXP g0c3 [NAM(1)] (len=3, tl=0)
@137ccd8 09 CHARSXP g0c1 [gp=0x61] [ASCII] [cached] "a"
@137ccd8 09 CHARSXP g0c1 [gp=0x61] [ASCII] [cached] "a"
@15a6698 09 CHARSXP g0c1 [gp=0x61] [ASCII] [cached] "b"
The large S-expression overhead is recouped by long (in the nchar() sense) or
re-used strings, but that's not the case for your data.
There is no way around this in base R. There are general-purpose solutions like
the data.table package, or retaining your large data in a data base (like
SQLite) that you interface from within R using e.g., sqldf or dplyr to do as
much data reduction in the data base (and out of R) as possible. In your
particular case the Bioconductor Biostrings package BStringSet() might be
relevant
http://bioconductor.org/packages/release/bioc/html/Biostrings.html
This will consume memory more along the lines of 1 byte per character + 1 byte
per string, and is of particular relevance because you are likely doing other
genetic operations for which the Bioconductor project has relevant packages (see
especially the GenomicRanges package).
If your work is not particularly domain-specific, data.table would be a good bet
(it also has an implementation for working with overlapping ranges, which is a
very common task with SNPs). A lot of SNP data management is really relational,
for which the SQL representation (and dplyr, for me) is the obvious choice.
Bioconductor would be the choice if there is to be extensive domain-specific
work. I am involved in the Bioconductor project, so not exactly impartial.
Martin
>
> By the way...
>
> It turns out that 99.72% of those character strings are of the form
paste("rs",
> Int) where Int is an integer of no more than 9 digits. So if I use only
those
> markers, drop the "rs" off, and load them as integers, I see a
huge improvement:
>
>> snps <- scan("SNPs_rs.txt", what=integer())
> Read 27369706 items
>> object.size(snps)
> 109478864 bytes
>> object.size(snps)/length(snps)
> 4.00000146146985 bytes
>
> That saves 93.8% of the memory by dropping 0.28% of the markers and
encoding as
> integers instead of strings. I might end up doing this by encoding the
other
> characters as negative integers.
>
> Mike
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109
Location: Arnold Building M1 B861
Phone: (206) 667-2793