I have a large data.frame Z (2,424,185,944 bytes, 10,256,441 rows, 17 columns). I want to get the result of table(aggregate(Z$V1, FUN = length, by = list(id=Z$V2))$x) alas, aggregate has been running for ~30 minute, RSS is 14G, VIRT is 24.3G, and no end in sight. both V1 and V2 are characters (not factors). Is there anything I could do to speed this up? Thanks. -- Sam Steingold (http://sds.podval.org/) on Ubuntu 12.04 (precise) X 11.0.11103000 http://www.childpsy.net/ http://www.PetitionOnline.com/tap12009/ http://dhimmi.com http://think-israel.org http://iris.org.il WinWord 6.0 UNinstall: Not enough disk space to uninstall WinWord
Hi, On Fri, Sep 14, 2012 at 3:26 PM, Sam Steingold <sds at gnu.org> wrote:> I have a large data.frame Z (2,424,185,944 bytes, 10,256,441 rows, 17 columns). > I want to get the result of > table(aggregate(Z$V1, FUN = length, by = list(id=Z$V2))$x) > alas, aggregate has been running for ~30 minute, RSS is 14G, VIRT is > 24.3G, and no end in sight. > both V1 and V2 are characters (not factors). > Is there anything I could do to speed this up? > Thanks.You might find you'll get a lot of mileage out of data.table when working with such large data.frames ... To get something close to what you're after, you can try: R> library(data.table) R> Z <- as.data.table(Z) R> setkeyv(Z, 'V2') R> agg <- Z[, list(count=.N), by='V2']>From here you mightR> tab1 <- table(agg$count) I think that'll get you where you want to be ... I'm ashamed to say that I haven't really done much w/ aggregate since I mostly have used plyr and data.table like stuff, so I might be missing your end goal -- providing a reproducible example with a small data.frame from you can help here (for me at least). HTH, -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology | Memorial Sloan-Kettering Cancer Center | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact
Hi: This should give you some idea of what Steve is talking about: library(data.table) dt <- data.table(x = sample(100000, 10000000, replace = TRUE), y = rnorm(10000000), key = "x") dt[, .N, by = x] system.time(dt[, .N, by = x]) ...on my system, dual core 8Gb RAM running Win7 64-bit,> system.time(dt[, .N, by = x])user system elapsed 0.12 0.02 0.14 .N is an optimized function to find the number of rows of each data subset. Much faster than aggregate(). It might take a little longer because you have more columns that suck up space, but you get the idea. It's also about 5-6 times faster if you set a key variable in the data table than if you don't. Dennis On Fri, Sep 14, 2012 at 12:26 PM, Sam Steingold <sds@gnu.org> wrote:> I have a large data.frame Z (2,424,185,944 bytes, 10,256,441 rows, 17 > columns). > I want to get the result of > table(aggregate(Z$V1, FUN = length, by = list(id=Z$V2))$x) > alas, aggregate has been running for ~30 minute, RSS is 14G, VIRT is > 24.3G, and no end in sight. > both V1 and V2 are characters (not factors). > Is there anything I could do to speed this up? > Thanks. > > -- > Sam Steingold (http://sds.podval.org/) on Ubuntu 12.04 (precise) X > 11.0.11103000 > http://www.childpsy.net/ http://www.PetitionOnline.com/tap12009/ > http://dhimmi.com http://think-israel.org http://iris.org.il > WinWord 6.0 UNinstall: Not enough disk space to uninstall WinWord > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]
Hi, On Fri, Sep 14, 2012 at 4:26 PM, Dennis Murphy <djmuser at gmail.com> wrote:> Hi: > > This should give you some idea of what Steve is talking about: > > library(data.table) > dt <- data.table(x = sample(100000, 10000000, replace = TRUE), > y = rnorm(10000000), key = "x") > dt[, .N, by = x] > system.time(dt[, .N, by = x]) > > ...on my system, dual core 8Gb RAM running Win7 64-bit, >> system.time(dt[, .N, by = x]) > user system elapsed > 0.12 0.02 0.14 > > .N is an optimized function to find the number of rows of each data subset. > Much faster than aggregate(). It might take a little longer because you > have more columns that suck up space, but you get the idea. It's also about > 5-6 times faster if you set a key variable in the data table than if you > don't.Well done, sir! (slight critique in that .N isn't a function, it's just a variable that is constantly reset within each by-subset/group) Also, don't forget to use the .SDcols parameter in [.data.table if you plan on only using a subset of the columns in side your "by" stuff. There's lots of documentation in the package `?data.table` and the vignettes/FAQ to help you tweak your usage, if you decide to take data.table route. HTH, -steve> > Dennis > > On Fri, Sep 14, 2012 at 12:26 PM, Sam Steingold <sds at gnu.org> wrote: > >> I have a large data.frame Z (2,424,185,944 bytes, 10,256,441 rows, 17 >> columns). >> I want to get the result of >> table(aggregate(Z$V1, FUN = length, by = list(id=Z$V2))$x) >> alas, aggregate has been running for ~30 minute, RSS is 14G, VIRT is >> 24.3G, and no end in sight. >> both V1 and V2 are characters (not factors). >> Is there anything I could do to speed this up? >> Thanks. >> >> -- >> Sam Steingold (http://sds.podval.org/) on Ubuntu 12.04 (precise) X >> 11.0.11103000 >> http://www.childpsy.net/ http://www.PetitionOnline.com/tap12009/ >> http://dhimmi.com http://think-israel.org http://iris.org.il >> WinWord 6.0 UNinstall: Not enough disk space to uninstall WinWord >> >> ______________________________________________ >> R-help at r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.-- Steve Lianoglou Graduate Student: Computational Systems Biology | Memorial Sloan-Kettering Cancer Center | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact