Hi all,
In short:
I'm running ddply on an admittedly (somehow) large data.frame (not
that large). It runs fine until it finishes and gets to the
"collating" part where all subsets of my data.frame have been
summarized and they are being reassembled into the final summary
data.frame (sorry, don't know the correct plyr terminology). During
collation, my R workspace RAM usage goes from about 1.5 GB upto 20GB
until I kill it.
Running a similar piece of code that iterates manually w/o ddply by
using a combo of lapply and a do.call(rbind, ...) uses considerably
less ram (tops out at about 8GB).
How can I use ddply more efficiently?
Longer:
Here's more info:
* The data.frame itself ~ 15.8 MB when loaded.
* ~ 400,000 rows, 8 columns
It looks like so:
exon.start exon.width exon.width.unique exon.anno counts
symbol transcript chr
1 4225 468 0 utr 0
WASH5P WASH5P chr1
2 4833 69 0 utr 1
WASH5P WASH5P chr1
3 5659 152 38 utr 1
WASH5P WASH5P chr1
4 6470 159 0 utr 0
WASH5P WASH5P chr1
5 6721 198 0 utr 0
WASH5P WASH5P chr1
6 7096 136 0 utr 0
WASH5P WASH5P chr1
7 7469 137 0 utr 0
WASH5P WASH5P chr1
8 7778 147 0 utr 0
WASH5P WASH5P chr1
9 8131 99 0 utr 0
WASH5P WASH5P chr1
10 14601 154 0 utr 0
WASH5P WASH5P chr1
11 19184 50 0 utr 0
WASH5P WASH5P chr1
12 4693 140 36 intron 2
WASH5P WASH5P chr1
13 4902 757 36 intron 1
WASH5P WASH5P chr1
14 5811 659 144 intron 47
WASH5P WASH5P chr1
15 6629 92 21 intron 1
WASH5P WASH5P chr1
16 6919 177 0 intron 0
WASH5P WASH5P chr1
17 7232 237 35 intron 2
WASH5P WASH5P chr1
18 7606 172 0 intron 0
WASH5P WASH5P chr1
19 7925 206 0 intron 0
WASH5P WASH5P chr1
20 8230 6371 109 intron 67
WASH5P WASH5P chr1
21 14755 4429 55 intron 12
WASH5P WASH5P chr1
...
I'm "ply"-ing over the "transcript" column and the
function transforms
each such subset of the data.frame into a new data.frame that is just
1 row / transcript that basically has the sum of the "counts" for each
transcript.
The code would look something like this (`summaries` is the data.frame
I'm referring to):
rpkm <- ddply(summaries, .(transcript), function(df) {
data.frame(symbol=df$symbol[1], counts=sum(df$counts))
}
(It actually calculates 2 more columns that are returned in the
data.frame, but I'm not sure that's really important here).
To test some things out, I've written another function to manually
iterate/create subsets of my data.frame to summarize.
I'm using sqldf to dump the data.frame into a db, then I lapply over
subsets of the db `where transcript=x` to summarize each subset of my
data into a list of single-row data.frames (like ddply is doing), and
finish with a `do.call(rbind, the.dfs)` o nthis list.
This returns the same exact result ddply would return, and by the time
`do.call` finishes, my RAM usage hits about 8gb.
So, what am I doing wrong with ddply that makes the difference ram
usage in the last step ("collation" -- the equivalent of my final
`do.call(rbind, my.dfs)` be more than 12GB?
Thanks,
-steve
--
Steve Lianoglou
Graduate Student: Computational Systems Biology
| Memorial Sloan-Kettering Cancer Center
| Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact
I don't know about that, but try this :
install.packages("data.table",
repos="http://R-Forge.R-project.org")
require(data.table)
summaries = data.table(summaries)
summaries[,sum(counts),by=symbol]
Please let us know if that returns the correct result, and if its
memory/speed is ok ?
Matthew
"Steve Lianoglou" <mailinglist.honeypot at gmail.com> wrote in
message
news:w2kbbdc7ed01004290606lc425e47cs95b36f6bf0ab3d at
mail.gmail.com...> Hi all,
>
> In short:
>
> I'm running ddply on an admittedly (somehow) large data.frame (not
> that large). It runs fine until it finishes and gets to the
> "collating" part where all subsets of my data.frame have been
> summarized and they are being reassembled into the final summary
> data.frame (sorry, don't know the correct plyr terminology). During
> collation, my R workspace RAM usage goes from about 1.5 GB upto 20GB
> until I kill it.
>
> Running a similar piece of code that iterates manually w/o ddply by
> using a combo of lapply and a do.call(rbind, ...) uses considerably
> less ram (tops out at about 8GB).
>
> How can I use ddply more efficiently?
>
> Longer:
>
> Here's more info:
>
> * The data.frame itself ~ 15.8 MB when loaded.
> * ~ 400,000 rows, 8 columns
>
> It looks like so:
>
> exon.start exon.width exon.width.unique exon.anno counts
> symbol transcript chr
> 1 4225 468 0 utr 0
> WASH5P WASH5P chr1
> 2 4833 69 0 utr 1
> WASH5P WASH5P chr1
> 3 5659 152 38 utr 1
> WASH5P WASH5P chr1
> 4 6470 159 0 utr 0
> WASH5P WASH5P chr1
> 5 6721 198 0 utr 0
> WASH5P WASH5P chr1
> 6 7096 136 0 utr 0
> WASH5P WASH5P chr1
> 7 7469 137 0 utr 0
> WASH5P WASH5P chr1
> 8 7778 147 0 utr 0
> WASH5P WASH5P chr1
> 9 8131 99 0 utr 0
> WASH5P WASH5P chr1
> 10 14601 154 0 utr 0
> WASH5P WASH5P chr1
> 11 19184 50 0 utr 0
> WASH5P WASH5P chr1
> 12 4693 140 36 intron 2
> WASH5P WASH5P chr1
> 13 4902 757 36 intron 1
> WASH5P WASH5P chr1
> 14 5811 659 144 intron 47
> WASH5P WASH5P chr1
> 15 6629 92 21 intron 1
> WASH5P WASH5P chr1
> 16 6919 177 0 intron 0
> WASH5P WASH5P chr1
> 17 7232 237 35 intron 2
> WASH5P WASH5P chr1
> 18 7606 172 0 intron 0
> WASH5P WASH5P chr1
> 19 7925 206 0 intron 0
> WASH5P WASH5P chr1
> 20 8230 6371 109 intron 67
> WASH5P WASH5P chr1
> 21 14755 4429 55 intron 12
> WASH5P WASH5P chr1
> ...
>
> I'm "ply"-ing over the "transcript" column and the
function transforms
> each such subset of the data.frame into a new data.frame that is just
> 1 row / transcript that basically has the sum of the "counts" for
each
> transcript.
>
> The code would look something like this (`summaries` is the data.frame
> I'm referring to):
>
> rpkm <- ddply(summaries, .(transcript), function(df) {
> data.frame(symbol=df$symbol[1], counts=sum(df$counts))
> }
>
> (It actually calculates 2 more columns that are returned in the
> data.frame, but I'm not sure that's really important here).
>
> To test some things out, I've written another function to manually
> iterate/create subsets of my data.frame to summarize.
>
> I'm using sqldf to dump the data.frame into a db, then I lapply over
> subsets of the db `where transcript=x` to summarize each subset of my
> data into a list of single-row data.frames (like ddply is doing), and
> finish with a `do.call(rbind, the.dfs)` o nthis list.
>
> This returns the same exact result ddply would return, and by the time
> `do.call` finishes, my RAM usage hits about 8gb.
>
> So, what am I doing wrong with ddply that makes the difference ram
> usage in the last step ("collation" -- the equivalent of my final
> `do.call(rbind, my.dfs)` be more than 12GB?
>
> Thanks,
> -steve
>
> --
> Steve Lianoglou
> Graduate Student: Computational Systems Biology
> | Memorial Sloan-Kettering Cancer Center
> | Weill Medical College of Cornell University
> Contact Info: http://cbio.mskcc.org/~lianos/contact
>