thr3ads.net - R help - [R] Using plyr::dply more (memory) efficiently? [Apr 2010]

If this information is useful, please help other people find it:
Share via:

Steve Lianoglou

2010-Apr-29 13:06 UTC

[R] Using plyr::dply more (memory) efficiently?

Hi all,

In short:

I'm running ddply on an admittedly (somehow) large data.frame (not
that large). It runs fine until it finishes and gets to the
"collating" part where all subsets of my data.frame have been
summarized and they are being reassembled into the final summary
data.frame (sorry, don't know the correct plyr terminology). During
collation, my R workspace RAM usage goes from about 1.5 GB upto 20GB
until I kill it.

Running a similar piece of code that iterates manually w/o ddply by
using a combo of lapply and a do.call(rbind, ...) uses considerably
less ram (tops out at about 8GB).

How can I use ddply more efficiently?

Longer:

Here's more info:

 * The data.frame itself ~ 15.8 MB when loaded.
 * ~ 400,000 rows, 8 columns

It looks like so:

   exon.start exon.width exon.width.unique exon.anno counts
symbol   transcript  chr
1        4225        468                 0       utr      0
WASH5P       WASH5P chr1
2        4833         69                 0       utr      1
WASH5P       WASH5P chr1
3        5659        152                38       utr      1
WASH5P       WASH5P chr1
4        6470        159                 0       utr      0
WASH5P       WASH5P chr1
5        6721        198                 0       utr      0
WASH5P       WASH5P chr1
6        7096        136                 0       utr      0
WASH5P       WASH5P chr1
7        7469        137                 0       utr      0
WASH5P       WASH5P chr1
8        7778        147                 0       utr      0
WASH5P       WASH5P chr1
9        8131         99                 0       utr      0
WASH5P       WASH5P chr1
10      14601        154                 0       utr      0
WASH5P       WASH5P chr1
11      19184         50                 0       utr      0
WASH5P       WASH5P chr1
12       4693        140                36    intron      2
WASH5P       WASH5P chr1
13       4902        757                36    intron      1
WASH5P       WASH5P chr1
14       5811        659               144    intron     47
WASH5P       WASH5P chr1
15       6629         92                21    intron      1
WASH5P       WASH5P chr1
16       6919        177                 0    intron      0
WASH5P       WASH5P chr1
17       7232        237                35    intron      2
WASH5P       WASH5P chr1
18       7606        172                 0    intron      0
WASH5P       WASH5P chr1
19       7925        206                 0    intron      0
WASH5P       WASH5P chr1
20       8230       6371               109    intron     67
WASH5P       WASH5P chr1
21      14755       4429                55    intron     12
WASH5P       WASH5P chr1
...

I'm "ply"-ing over the "transcript" column and the
function transforms
each such subset of the data.frame into a new data.frame that is just
1 row / transcript that basically has the sum of the "counts" for each
transcript.

The code would look something like this (`summaries` is the data.frame
I'm referring to):

rpkm <- ddply(summaries, .(transcript), function(df) {
  data.frame(symbol=df$symbol[1], counts=sum(df$counts))
}

(It actually calculates 2 more columns that are returned in the
data.frame, but I'm not sure that's really important here).

To test some things out, I've written another function to manually
iterate/create subsets of my data.frame to summarize.

I'm using sqldf to dump the data.frame into a db, then I lapply over
subsets of the db `where transcript=x` to summarize each subset of my
data into a list of single-row data.frames (like ddply is doing), and
finish with a `do.call(rbind, the.dfs)` o nthis list.

This returns the same exact result ddply would return, and by the time
`do.call` finishes, my RAM usage hits about 8gb.

So, what am I doing wrong with ddply that makes the difference ram
usage in the last step ("collation" -- the equivalent of my final
`do.call(rbind, my.dfs)` be more than 12GB?

Thanks,
-steve

-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact

Matthew Dowle

2010-Apr-29 13:52 UTC

head link

[R] Using plyr::dply more (memory) efficiently?

I don't know about that,  but try this :

install.packages("data.table",
repos="http://R-Forge.R-project.org")
require(data.table)
summaries = data.table(summaries)
summaries[,sum(counts),by=symbol]

Please let us know if that returns the correct result,  and if its 
memory/speed is ok ?

Matthew

"Steve Lianoglou" <mailinglist.honeypot at gmail.com> wrote in
message
news:w2kbbdc7ed01004290606lc425e47cs95b36f6bf0ab3d at
mail.gmail.com...> Hi all,
>
> In short:
>
> I'm running ddply on an admittedly (somehow) large data.frame (not
> that large). It runs fine until it finishes and gets to the
> "collating" part where all subsets of my data.frame have been
> summarized and they are being reassembled into the final summary
> data.frame (sorry, don't know the correct plyr terminology). During
> collation, my R workspace RAM usage goes from about 1.5 GB upto 20GB
> until I kill it.
>
> Running a similar piece of code that iterates manually w/o ddply by
> using a combo of lapply and a do.call(rbind, ...) uses considerably
> less ram (tops out at about 8GB).
>
> How can I use ddply more efficiently?
>
> Longer:
>
> Here's more info:
>
> * The data.frame itself ~ 15.8 MB when loaded.
> * ~ 400,000 rows, 8 columns
>
> It looks like so:
>
>   exon.start exon.width exon.width.unique exon.anno counts
> symbol   transcript  chr
> 1        4225        468                 0       utr      0
> WASH5P       WASH5P chr1
> 2        4833         69                 0       utr      1
> WASH5P       WASH5P chr1
> 3        5659        152                38       utr      1
> WASH5P       WASH5P chr1
> 4        6470        159                 0       utr      0
> WASH5P       WASH5P chr1
> 5        6721        198                 0       utr      0
> WASH5P       WASH5P chr1
> 6        7096        136                 0       utr      0
> WASH5P       WASH5P chr1
> 7        7469        137                 0       utr      0
> WASH5P       WASH5P chr1
> 8        7778        147                 0       utr      0
> WASH5P       WASH5P chr1
> 9        8131         99                 0       utr      0
> WASH5P       WASH5P chr1
> 10      14601        154                 0       utr      0
> WASH5P       WASH5P chr1
> 11      19184         50                 0       utr      0
> WASH5P       WASH5P chr1
> 12       4693        140                36    intron      2
> WASH5P       WASH5P chr1
> 13       4902        757                36    intron      1
> WASH5P       WASH5P chr1
> 14       5811        659               144    intron     47
> WASH5P       WASH5P chr1
> 15       6629         92                21    intron      1
> WASH5P       WASH5P chr1
> 16       6919        177                 0    intron      0
> WASH5P       WASH5P chr1
> 17       7232        237                35    intron      2
> WASH5P       WASH5P chr1
> 18       7606        172                 0    intron      0
> WASH5P       WASH5P chr1
> 19       7925        206                 0    intron      0
> WASH5P       WASH5P chr1
> 20       8230       6371               109    intron     67
> WASH5P       WASH5P chr1
> 21      14755       4429                55    intron     12
> WASH5P       WASH5P chr1
> ...
>
> I'm "ply"-ing over the "transcript" column and the
function transforms
> each such subset of the data.frame into a new data.frame that is just
> 1 row / transcript that basically has the sum of the "counts" for
each
> transcript.
>
> The code would look something like this (`summaries` is the data.frame
> I'm referring to):
>
> rpkm <- ddply(summaries, .(transcript), function(df) {
>  data.frame(symbol=df$symbol[1], counts=sum(df$counts))
> }
>
> (It actually calculates 2 more columns that are returned in the
> data.frame, but I'm not sure that's really important here).
>
> To test some things out, I've written another function to manually
> iterate/create subsets of my data.frame to summarize.
>
> I'm using sqldf to dump the data.frame into a db, then I lapply over
> subsets of the db `where transcript=x` to summarize each subset of my
> data into a list of single-row data.frames (like ddply is doing), and
> finish with a `do.call(rbind, the.dfs)` o nthis list.
>
> This returns the same exact result ddply would return, and by the time
> `do.call` finishes, my RAM usage hits about 8gb.
>
> So, what am I doing wrong with ddply that makes the difference ram
> usage in the last step ("collation" -- the equivalent of my final
> `do.call(rbind, my.dfs)` be more than 12GB?
>
> Thanks,
> -steve
>
> -- 
> Steve Lianoglou
> Graduate Student: Computational Systems Biology
> | Memorial Sloan-Kettering Cancer Center
> | Weill Medical College of Cornell University
> Contact Info: http://cbio.mskcc.org/~lianos/contact
>

Maybe Matching Threads

Search for more apparently analagous threads

R help - Apr 2010 - Using plyr::dply more (memory) efficiently?

[R] Using plyr::dply more (memory) efficiently?

[R] Using plyr::dply more (memory) efficiently?

Maybe Matching Threads