I regularly crunch through this amount of data with tidyverse. You can also try the data.table package. They are optimized for speed, as long as you have the memory. Dominik On Wed, Sep 28, 2016 at 10:09 AM, Doran, Harold <HDoran at air.org> wrote:> I have an extremely large data frame (~13 million rows) that resembles the > structure of the object tmp below in the reproducible code. In my real > data, the variable, 'id' may or may not be ordered, but I think that is > irrelevant. > > I have a process that requires subsetting the data by id and then running > each smaller data frame through a set of functions. One example below uses > indexing and the other uses an explicit call to subset(), both return the > same result, but indexing is faster. > > Problem is in my real data, indexing must parse through millions of rows > to evaluate the condition and this is expensive and a bottleneck in my > code. I'm curious if anyone can recommend an improvement that would > somehow be less expensive and faster? > > Thank you > Harold > > > tmp <- data.frame(id = rep(1:200, each = 10), foo = rnorm(2000)) > > idList <- unique(tmp$id) > > ### Fast, but not fast enough > system.time(replicate(500, tmp[which(tmp$id == idList[1]),])) > > ### Not fast at all, a big bottleneck > system.time(replicate(500, subset(tmp, id == idList[1]))) > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/ > posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]
Don't do it this way. You are reinventing wheels. 1. Look at package dplyr, which has optimized functions to do exactly this (break into subframes, calculate on subframes, reassemble). Note also that dplyr is part of tidyverse. I use base R functionality for this because I know it and it does what I need, but dplyr may be better for your needs. 2. In base R, this would be done by by(): so for your example, by(tmp, tmp$id, FUN,... ) where FUN is a function that does whatever you want on each sub-data frame. e.g. if you wanted to just take the mean of foo for each subframe: by(tmp, tmp$id, function(x)mean(x$foo)) ## (but there are better ways of doing such a simple function in base R or dplyr) Cheers, Bert Bert Gunter "The trouble with having an open mind is that people keep coming along and sticking things into it." -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) On Wed, Sep 28, 2016 at 9:28 AM, Doran, Harold <HDoran at air.org> wrote:> Thank you very much. I don?t know tidyverse, I?ll look at that now. I did some tests with data.table package, but it was much slower on my machine, see examples below > > tmp <- data.frame(id = rep(1:200, each = 10), foo = rnorm(2000)) > > idList <- unique(tmp$id) > > system.time(replicate(500, tmp[which(tmp$id == idList[1]),])) > > system.time(replicate(500, subset(tmp, id == idList[1]))) > > > library(data.table) > > tmp2 <- as.data.table(tmp) # data.table > > system.time(replicate(500, tmp2[which(tmp$id == idList[1]),])) > > system.time(replicate(500, subset(tmp2, id == idList[1]))) > > From: Dominik Schneider [mailto:dosc3612 at colorado.edu] > Sent: Wednesday, September 28, 2016 12:27 PM > To: Doran, Harold <HDoran at air.org> > Cc: r-help at r-project.org > Subject: Re: [R] Faster Subsetting > > I regularly crunch through this amount of data with tidyverse. You can also try the data.table package. They are optimized for speed, as long as you have the memory. > Dominik > > On Wed, Sep 28, 2016 at 10:09 AM, Doran, Harold <HDoran at air.org<mailto:HDoran at air.org>> wrote: > I have an extremely large data frame (~13 million rows) that resembles the structure of the object tmp below in the reproducible code. In my real data, the variable, 'id' may or may not be ordered, but I think that is irrelevant. > > I have a process that requires subsetting the data by id and then running each smaller data frame through a set of functions. One example below uses indexing and the other uses an explicit call to subset(), both return the same result, but indexing is faster. > > Problem is in my real data, indexing must parse through millions of rows to evaluate the condition and this is expensive and a bottleneck in my code. I'm curious if anyone can recommend an improvement that would somehow be less expensive and faster? > > Thank you > Harold > > > tmp <- data.frame(id = rep(1:200, each = 10), foo = rnorm(2000)) > > idList <- unique(tmp$id) > > ### Fast, but not fast enough > system.time(replicate(500, tmp[which(tmp$id == idList[1]),])) > > ### Not fast at all, a big bottleneck > system.time(replicate(500, subset(tmp, id == idList[1]))) > > ______________________________________________ > R-help at r-project.org<mailto:R-help at r-project.org> mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
I just modified the reproducible example a bit, so it's a bit more realistic. The function "mean" could be "easily" replaced by your analysis. And here are some possible solutions: tmp <- data.frame(id = rep(1:2000, each = 100), foo = rnorm(200000)) tmp <- tmp[sample(dim(tmp)[1]),] # re-sampling the dataset ## with specialized packages require(plyr) system.time({ ?? res1 <- ddply(tmp, .(id), summarize, mean=mean(foo)) }) require(dplyr) system.time({ ?? res2 <- tmp %>% ???? group_by(id) %>% ???? summarise(mean = mean(foo)) }) library(data.table) system.time({ ?? res3 <- data.table(tmp)[, list(mean=mean(foo)), by=id] }) ## build-in R-methods system.time({ ?? res4 <- aggregate(tmp$foo, by = list(id=tmp$id), FUN = mean) }) system.time({ ?? res5 <- sapply(unique(tmp$id), simplify = TRUE, ????????????????? FUN = function(x){ ??????????????????? c(id=x, mean=mean(tmp[which(tmp$id == x), "foo"])) ????????????????? }) }) res5 <- t(res5) system.time({ ?? res5 <- sapply(unique(tmp$id), simplify = TRUE, ????????????????? FUN = function(x){ ??????????????????? sub.tmp <- subset(tmp, tmp$id == x) ??????????????????? c(x,mean=mean(sub.tmp[, "foo"])) ????????????????? }) }) res5 <- t(res5) Yours Constantin -- ^ |??????????????? X |?????????????? /eiser, Dr. Constantin (weiserc at hhu.de) |????????????? /Chair of Statistics and Econometrics |???????????? / Heinrich Heine-University of D?sseldorf | *??? /\??? /? Universit?tsstra?e 1, 40225 D?sseldorf, Germany |? \? /? \? /?? Oeconomicum (Building 24.31), Room 01.22 |?? \/??? \/??? Tel: 0049 211 81-15307 +-----------------------------------------------------------> Am 28.09.2016 um 18:28 schrieb Doran, Harold:> Thank you very much. I don?t know tidyverse, I?ll look at that now. I did some tests with data.table package, but it was much slower on my machine, see examples below > > tmp <- data.frame(id = rep(1:200, each = 10), foo = rnorm(2000)) > > idList <- unique(tmp$id) > > system.time(replicate(500, tmp[which(tmp$id == idList[1]),])) > > system.time(replicate(500, subset(tmp, id == idList[1]))) > > > library(data.table) > > tmp2 <- as.data.table(tmp)???? # data.table > > system.time(replicate(500, tmp2[which(tmp$id == idList[1]),])) > > system.time(replicate(500, subset(tmp2, id == idList[1]))) > > From: Dominik Schneider [mailto:dosc3612 at colorado.edu] > Sent: Wednesday, September 28, 2016 12:27 PM > To: Doran, Harold <HDoran at air.org> > Cc: r-help at r-project.org > Subject: Re: [R] Faster Subsetting > > I regularly crunch through this amount of data with tidyverse. You can also try the data.table package. They are optimized for speed, as long as you have the memory. > Dominik > > On Wed, Sep 28, 2016 at 10:09 AM, Doran, Harold <HDoran at air.org<mailto:HDoran at air.org>> wrote: > I have an extremely large data frame (~13 million rows) that resembles the structure of the object tmp below in the reproducible code. In my real data, the variable, 'id' may or may not be ordered, but I think that is irrelevant. > > I have a process that requires subsetting the data by id and then running each smaller data frame through a set of functions. One example below uses indexing and the other uses an explicit call to subset(), both return the same result, but indexing is faster. > > Problem is in my real data, indexing must parse through millions of rows to evaluate the condition and this is expensive and a bottleneck in my code.? I'm curious if anyone can recommend an improvement that would somehow be less expensive and faster? > > Thank you > Harold > > > tmp <- data.frame(id = rep(1:200, each = 10), foo = rnorm(2000)) > > idList <- unique(tmp$id) > > ### Fast, but not fast enough > system.time(replicate(500, tmp[which(tmp$id == idList[1]),])) > > ### Not fast at all, a big bottleneck > system.time(replicate(500, subset(tmp, id == idList[1]))) > > ______________________________________________ > R-help at r-project.org<mailto:R-help at r-project.org> mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > > >?????? [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >
Many thanks. I did also try the filter function in dplyr and was also much slower than simply indexing in the original way the code had. system.time(replicate(500, filter(tmp, id == idList[1]))) I did this on the toy example as well as the real data, finding the same (slower) result each time compared to the indexing method. Perhaps I'm using it incorrectly? -----Original Message----- From: Constantin Weiser [mailto:constantin.weiser at hhu.de] Sent: Wednesday, September 28, 2016 12:55 PM To: r-help at r-project.org Cc: Doran, Harold <HDoran at air.org> Subject: Re: [R] Faster Subsetting I just modified the reproducible example a bit, so it's a bit more realistic. The function "mean" could be "easily" replaced by your analysis. And here are some possible solutions: tmp <- data.frame(id = rep(1:2000, each = 100), foo = rnorm(200000)) tmp <- tmp[sample(dim(tmp)[1]),] # re-sampling the dataset ## with specialized packages require(plyr) system.time({ res1 <- ddply(tmp, .(id), summarize, mean=mean(foo)) }) require(dplyr) system.time({ res2 <- tmp %>% group_by(id) %>% summarise(mean = mean(foo)) }) library(data.table) system.time({ res3 <- data.table(tmp)[, list(mean=mean(foo)), by=id] }) ## build-in R-methods system.time({ res4 <- aggregate(tmp$foo, by = list(id=tmp$id), FUN = mean) }) system.time({ res5 <- sapply(unique(tmp$id), simplify = TRUE, FUN = function(x){ c(id=x, mean=mean(tmp[which(tmp$id == x), "foo"])) }) }) res5 <- t(res5) system.time({ res5 <- sapply(unique(tmp$id), simplify = TRUE, FUN = function(x){ sub.tmp <- subset(tmp, tmp$id == x) c(x,mean=mean(sub.tmp[, "foo"])) }) }) res5 <- t(res5) Yours Constantin -- ^ | X | /eiser, Dr. Constantin (weiserc at hhu.de) | /Chair of Statistics and Econometrics | / Heinrich Heine-University of D?sseldorf | * /\ / Universit?tsstra?e 1, 40225 D?sseldorf, Germany | \ / \ / Oeconomicum (Building 24.31), Room 01.22 | \/ \/ Tel: 0049 211 81-15307 +-----------------------------------------------------------> Am 28.09.2016 um 18:28 schrieb Doran, Harold:> Thank you very much. I don?t know tidyverse, I?ll look at that now. I > did some tests with data.table package, but it was much slower on my > machine, see examples below > > tmp <- data.frame(id = rep(1:200, each = 10), foo = rnorm(2000)) > > idList <- unique(tmp$id) > > system.time(replicate(500, tmp[which(tmp$id == idList[1]),])) > > system.time(replicate(500, subset(tmp, id == idList[1]))) > > > library(data.table) > > tmp2 <- as.data.table(tmp) # data.table > > system.time(replicate(500, tmp2[which(tmp$id == idList[1]),])) > > system.time(replicate(500, subset(tmp2, id == idList[1]))) > > From: Dominik Schneider [mailto:dosc3612 at colorado.edu] > Sent: Wednesday, September 28, 2016 12:27 PM > To: Doran, Harold <HDoran at air.org> > Cc: r-help at r-project.org > Subject: Re: [R] Faster Subsetting > > I regularly crunch through this amount of data with tidyverse. You can also try the data.table package. They are optimized for speed, as long as you have the memory. > Dominik > > On Wed, Sep 28, 2016 at 10:09 AM, Doran, Harold <HDoran at air.org<mailto:HDoran at air.org>> wrote: > I have an extremely large data frame (~13 million rows) that resembles the structure of the object tmp below in the reproducible code. In my real data, the variable, 'id' may or may not be ordered, but I think that is irrelevant. > > I have a process that requires subsetting the data by id and then running each smaller data frame through a set of functions. One example below uses indexing and the other uses an explicit call to subset(), both return the same result, but indexing is faster. > > Problem is in my real data, indexing must parse through millions of rows to evaluate the condition and this is expensive and a bottleneck in my code. I'm curious if anyone can recommend an improvement that would somehow be less expensive and faster? > > Thank you > Harold > > > tmp <- data.frame(id = rep(1:200, each = 10), foo = rnorm(2000)) > > idList <- unique(tmp$id) > > ### Fast, but not fast enough > system.time(replicate(500, tmp[which(tmp$id == idList[1]),])) > > ### Not fast at all, a big bottleneck > system.time(replicate(500, subset(tmp, id == idList[1]))) > > ______________________________________________ > R-help at r-project.org<mailto:R-help at r-project.org> mailing list -- To > UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >
Note that for base R, by() is considerably faster than aggregate() (both of which are *must* faster than the sapply() stuff; tapply() is what is more appropriate here). (for Constantin's example):> system.time({+ res4 <- aggregate(tmp$foo, by = list(id=tmp$id), FUN = mean) + }) user system elapsed 0.289 0.006 0.295> system.time(res6 <-by(tmp,tmp$id,function(x)mean(x$foo)))user system elapsed 0.124 0.005 0.130 Note, however, that the by() result may need to be manipulated to get it into the format you want, and so additional time. Cheers, Bert Bert Gunter "The trouble with having an open mind is that people keep coming along and sticking things into it." -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) On Wed, Sep 28, 2016 at 10:07 AM, Doran, Harold <HDoran at air.org> wrote:> Many thanks. I did also try the filter function in dplyr and was also much slower than simply indexing in the original way the code had. > > system.time(replicate(500, filter(tmp, id == idList[1]))) > > I did this on the toy example as well as the real data, finding the same (slower) result each time compared to the indexing method. > > Perhaps I'm using it incorrectly? > > > > -----Original Message----- > From: Constantin Weiser [mailto:constantin.weiser at hhu.de] > Sent: Wednesday, September 28, 2016 12:55 PM > To: r-help at r-project.org > Cc: Doran, Harold <HDoran at air.org> > Subject: Re: [R] Faster Subsetting > > I just modified the reproducible example a bit, so it's a bit more realistic. The function "mean" could be "easily" replaced by your analysis. > > And here are some possible solutions: > > tmp <- data.frame(id = rep(1:2000, each = 100), foo = rnorm(200000)) tmp <- tmp[sample(dim(tmp)[1]),] # re-sampling the dataset > > ## with specialized packages > require(plyr) > system.time({ > res1 <- ddply(tmp, .(id), summarize, mean=mean(foo)) > }) > > require(dplyr) > system.time({ > res2 <- tmp %>% > group_by(id) %>% > summarise(mean = mean(foo)) > }) > > library(data.table) > system.time({ > res3 <- data.table(tmp)[, list(mean=mean(foo)), by=id] > }) > > > ## build-in R-methods > system.time({ > res4 <- aggregate(tmp$foo, by = list(id=tmp$id), FUN = mean) > }) > > system.time({ > res5 <- sapply(unique(tmp$id), simplify = TRUE, > FUN = function(x){ > c(id=x, mean=mean(tmp[which(tmp$id == x), "foo"])) > }) > }) > res5 <- t(res5) > > system.time({ > res5 <- sapply(unique(tmp$id), simplify = TRUE, > FUN = function(x){ > sub.tmp <- subset(tmp, tmp$id == x) > c(x,mean=mean(sub.tmp[, "foo"])) > }) > }) > res5 <- t(res5) > > > Yours > Constantin > > > -- > ^ > | X > | /eiser, Dr. Constantin (weiserc at hhu.de) > | /Chair of Statistics and Econometrics > | / Heinrich Heine-University of D?sseldorf > | * /\ / Universit?tsstra?e 1, 40225 D?sseldorf, Germany > | \ / \ / Oeconomicum (Building 24.31), Room 01.22 > | \/ \/ Tel: 0049 211 81-15307 > +-----------------------------------------------------------> > > Am 28.09.2016 um 18:28 schrieb Doran, Harold: >> Thank you very much. I don?t know tidyverse, I?ll look at that now. I >> did some tests with data.table package, but it was much slower on my >> machine, see examples below >> >> tmp <- data.frame(id = rep(1:200, each = 10), foo = rnorm(2000)) >> >> idList <- unique(tmp$id) >> >> system.time(replicate(500, tmp[which(tmp$id == idList[1]),])) >> >> system.time(replicate(500, subset(tmp, id == idList[1]))) >> >> >> library(data.table) >> >> tmp2 <- as.data.table(tmp) # data.table >> >> system.time(replicate(500, tmp2[which(tmp$id == idList[1]),])) >> >> system.time(replicate(500, subset(tmp2, id == idList[1]))) >> >> From: Dominik Schneider [mailto:dosc3612 at colorado.edu] >> Sent: Wednesday, September 28, 2016 12:27 PM >> To: Doran, Harold <HDoran at air.org> >> Cc: r-help at r-project.org >> Subject: Re: [R] Faster Subsetting >> >> I regularly crunch through this amount of data with tidyverse. You can also try the data.table package. They are optimized for speed, as long as you have the memory. >> Dominik >> >> On Wed, Sep 28, 2016 at 10:09 AM, Doran, Harold <HDoran at air.org<mailto:HDoran at air.org>> wrote: >> I have an extremely large data frame (~13 million rows) that resembles the structure of the object tmp below in the reproducible code. In my real data, the variable, 'id' may or may not be ordered, but I think that is irrelevant. >> >> I have a process that requires subsetting the data by id and then running each smaller data frame through a set of functions. One example below uses indexing and the other uses an explicit call to subset(), both return the same result, but indexing is faster. >> >> Problem is in my real data, indexing must parse through millions of rows to evaluate the condition and this is expensive and a bottleneck in my code. I'm curious if anyone can recommend an improvement that would somehow be less expensive and faster? >> >> Thank you >> Harold >> >> >> tmp <- data.frame(id = rep(1:200, each = 10), foo = rnorm(2000)) >> >> idList <- unique(tmp$id) >> >> ### Fast, but not fast enough >> system.time(replicate(500, tmp[which(tmp$id == idList[1]),])) >> >> ### Not fast at all, a big bottleneck >> system.time(replicate(500, subset(tmp, id == idList[1]))) >> >> ______________________________________________ >> R-help at r-project.org<mailto:R-help at r-project.org> mailing list -- To >> UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> >> >> [[alternative HTML version deleted]] >> >> ______________________________________________ >> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.