I have an extremely large data frame (~13 million rows) that resembles the structure of the object tmp below in the reproducible code. In my real data, the variable, 'id' may or may not be ordered, but I think that is irrelevant. I have a process that requires subsetting the data by id and then running each smaller data frame through a set of functions. One example below uses indexing and the other uses an explicit call to subset(), both return the same result, but indexing is faster. Problem is in my real data, indexing must parse through millions of rows to evaluate the condition and this is expensive and a bottleneck in my code. I'm curious if anyone can recommend an improvement that would somehow be less expensive and faster? Thank you Harold tmp <- data.frame(id = rep(1:200, each = 10), foo = rnorm(2000)) idList <- unique(tmp$id) ### Fast, but not fast enough system.time(replicate(500, tmp[which(tmp$id == idList[1]),])) ### Not fast at all, a big bottleneck system.time(replicate(500, subset(tmp, id == idList[1])))
I regularly crunch through this amount of data with tidyverse. You can also try the data.table package. They are optimized for speed, as long as you have the memory. Dominik On Wed, Sep 28, 2016 at 10:09 AM, Doran, Harold <HDoran at air.org> wrote:> I have an extremely large data frame (~13 million rows) that resembles the > structure of the object tmp below in the reproducible code. In my real > data, the variable, 'id' may or may not be ordered, but I think that is > irrelevant. > > I have a process that requires subsetting the data by id and then running > each smaller data frame through a set of functions. One example below uses > indexing and the other uses an explicit call to subset(), both return the > same result, but indexing is faster. > > Problem is in my real data, indexing must parse through millions of rows > to evaluate the condition and this is expensive and a bottleneck in my > code. I'm curious if anyone can recommend an improvement that would > somehow be less expensive and faster? > > Thank you > Harold > > > tmp <- data.frame(id = rep(1:200, each = 10), foo = rnorm(2000)) > > idList <- unique(tmp$id) > > ### Fast, but not fast enough > system.time(replicate(500, tmp[which(tmp$id == idList[1]),])) > > ### Not fast at all, a big bottleneck > system.time(replicate(500, subset(tmp, id == idList[1]))) > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/ > posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]
Thank you very much. I don?t know tidyverse, I?ll look at that now. I did some tests with data.table package, but it was much slower on my machine, see examples below tmp <- data.frame(id = rep(1:200, each = 10), foo = rnorm(2000)) idList <- unique(tmp$id) system.time(replicate(500, tmp[which(tmp$id == idList[1]),])) system.time(replicate(500, subset(tmp, id == idList[1]))) library(data.table) tmp2 <- as.data.table(tmp) # data.table system.time(replicate(500, tmp2[which(tmp$id == idList[1]),])) system.time(replicate(500, subset(tmp2, id == idList[1]))) From: Dominik Schneider [mailto:dosc3612 at colorado.edu] Sent: Wednesday, September 28, 2016 12:27 PM To: Doran, Harold <HDoran at air.org> Cc: r-help at r-project.org Subject: Re: [R] Faster Subsetting I regularly crunch through this amount of data with tidyverse. You can also try the data.table package. They are optimized for speed, as long as you have the memory. Dominik On Wed, Sep 28, 2016 at 10:09 AM, Doran, Harold <HDoran at air.org<mailto:HDoran at air.org>> wrote: I have an extremely large data frame (~13 million rows) that resembles the structure of the object tmp below in the reproducible code. In my real data, the variable, 'id' may or may not be ordered, but I think that is irrelevant. I have a process that requires subsetting the data by id and then running each smaller data frame through a set of functions. One example below uses indexing and the other uses an explicit call to subset(), both return the same result, but indexing is faster. Problem is in my real data, indexing must parse through millions of rows to evaluate the condition and this is expensive and a bottleneck in my code. I'm curious if anyone can recommend an improvement that would somehow be less expensive and faster? Thank you Harold tmp <- data.frame(id = rep(1:200, each = 10), foo = rnorm(2000)) idList <- unique(tmp$id) ### Fast, but not fast enough system.time(replicate(500, tmp[which(tmp$id == idList[1]),])) ### Not fast at all, a big bottleneck system.time(replicate(500, subset(tmp, id == idList[1]))) ______________________________________________ R-help at r-project.org<mailto:R-help at r-project.org> mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]]
I regularly crunch through this amount of data with tidyverse. You can also try the data.table package. They are optimized for speed, as long as you have the memory. Dominik On Wed, Sep 28, 2016 at 10:09 AM, Doran, Harold <HDoran at air.org> wrote:> I have an extremely large data frame (~13 million rows) that resembles the > structure of the object tmp below in the reproducible code. In my real > data, the variable, 'id' may or may not be ordered, but I think that is > irrelevant. > > I have a process that requires subsetting the data by id and then running > each smaller data frame through a set of functions. One example below uses > indexing and the other uses an explicit call to subset(), both return the > same result, but indexing is faster. > > Problem is in my real data, indexing must parse through millions of rows > to evaluate the condition and this is expensive and a bottleneck in my > code. I'm curious if anyone can recommend an improvement that would > somehow be less expensive and faster? > > Thank you > Harold > > > tmp <- data.frame(id = rep(1:200, each = 10), foo = rnorm(2000)) > > idList <- unique(tmp$id) > > ### Fast, but not fast enough > system.time(replicate(500, tmp[which(tmp$id == idList[1]),])) > > ### Not fast at all, a big bottleneck > system.time(replicate(500, subset(tmp, id == idList[1]))) > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/ > posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]
Hello, If you work with a matrix instead of a data.frame, it usually runs faster, but your column vectors must all be numeric.> ### Fast, but not fast enough > system.time(replicate(500, tmp[which(tmp$id == idList[1]),]))user system elapsed 0.05 0.00 0.04> > ### Not fast at all, a big bottleneck > system.time(replicate(500, subset(tmp, id == idList[1])))user system elapsed 0.07 0.00 0.08># Make it a matrix and use the matrix> mattmp <- as.matrix(tmp) > system.time(replicate(500, mattmp[which(mattmp[,"id"] == idList[1]),]))user system elapsed 0.01 0.00 0.01 Hope this helps, Rui Barradas Citando Doran, Harold <HDoran at air.org>:> I have an extremely large data frame (~13 million rows) that > resembles the structure of the object tmp below in the reproducible > code. In my real data, the variable, 'id' may or may not be ordered, > but I think that is irrelevant. > > I have a process that requires subsetting the data by id and then > running each smaller data frame through a set of functions. One > example below uses indexing and the other uses an explicit call to > subset(), both return the same result, but indexing is faster. > > Problem is in my real data, indexing must parse through millions of > rows to evaluate the condition and this is expensive and a > bottleneck in my code. I'm curious if anyone can recommend an > improvement that would somehow be less expensive and faster? > > Thank you > Harold > > > tmp <- data.frame(id = rep(1:200, each = 10), foo = rnorm(2000)) > > idList <- unique(tmp$id) > > ### Fast, but not fast enough > system.time(replicate(500, tmp[which(tmp$id == idList[1]),])) > > ### Not fast at all, a big bottleneck > system.time(replicate(500, subset(tmp, id == idList[1]))) > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
On Wed, 28 Sep 2016, "Doran, Harold" <HDoran at air.org> writes:> I have an extremely large data frame (~13 million rows) that resembles > the structure of the object tmp below in the reproducible code. In my > real data, the variable, 'id' may or may not be ordered, but I think > that is irrelevant. > > I have a process that requires subsetting the data by id and then > running each smaller data frame through a set of functions. One > example below uses indexing and the other uses an explicit call to > subset(), both return the same result, but indexing is faster. > > Problem is in my real data, indexing must parse through millions of > rows to evaluate the condition and this is expensive and a bottleneck > in my code. I'm curious if anyone can recommend an improvement that > would somehow be less expensive and faster? > > Thank you > Harold > > > tmp <- data.frame(id = rep(1:200, each = 10), foo = rnorm(2000)) > > idList <- unique(tmp$id) > > ### Fast, but not fast enough > system.time(replicate(500, tmp[which(tmp$id == idList[1]),])) > > ### Not fast at all, a big bottleneck > system.time(replicate(500, subset(tmp, id == idList[1]))) >If you really need only one column, it will be faster to extract that column and then to take a subset of it: system.time(replicate(500, tmp[[2L]][tmp$id == idList[1L]])) (A data.frame is a list of atomic vectors, and it is typically faster to first extract the component of interest, i.e. the specific column, and then to subset this vector. The result will, of course, be a vector, not a data.frame.) -- Enrico Schumann Lucerne, Switzerland http://enricoschumann.net
Hi, I'm surprised nobody suggested split(). Splitting the data.frame upfront is faster than repeatedly subsetting it: tmp <- data.frame(id = rep(1:20000, each = 10), foo = rnorm(200000)) idList <- unique(tmp$id) system.time(for (i in idList) tmp[which(tmp$id == i),]) # user system elapsed # 16.286 0.000 16.305 system.time(split(tmp, tmp$id)) # user system elapsed # 5.637 0.004 5.647 Cheers, H. On 09/28/2016 09:09 AM, Doran, Harold wrote:> I have an extremely large data frame (~13 million rows) that resembles the structure of the object tmp below in the reproducible code. In my real data, the variable, 'id' may or may not be ordered, but I think that is irrelevant. > > I have a process that requires subsetting the data by id and then running each smaller data frame through a set of functions. One example below uses indexing and the other uses an explicit call to subset(), both return the same result, but indexing is faster. > > Problem is in my real data, indexing must parse through millions of rows to evaluate the condition and this is expensive and a bottleneck in my code. I'm curious if anyone can recommend an improvement that would somehow be less expensive and faster? > > Thank you > Harold > > > tmp <- data.frame(id = rep(1:200, each = 10), foo = rnorm(2000)) > > idList <- unique(tmp$id) > > ### Fast, but not fast enough > system.time(replicate(500, tmp[which(tmp$id == idList[1]),])) > > ### Not fast at all, a big bottleneck > system.time(replicate(500, subset(tmp, id == idList[1]))) > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Herv? Pag?s Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fredhutch.org Phone: (206) 667-5791 Fax: (206) 667-1319
"I'm surprised nobody suggested split(). " I did. by() is a data frame oriented version of tapply(), which uses split(). Cheers, Bert Bert Gunter "The trouble with having an open mind is that people keep coming along and sticking things into it." -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) On Wed, Sep 28, 2016 at 11:53 AM, Herv? Pag?s <hpages at fredhutch.org> wrote:> Hi, > > I'm surprised nobody suggested split(). Splitting the data.frame > upfront is faster than repeatedly subsetting it: > > tmp <- data.frame(id = rep(1:20000, each = 10), foo = rnorm(200000)) > idList <- unique(tmp$id) > > system.time(for (i in idList) tmp[which(tmp$id == i),]) > # user system elapsed > # 16.286 0.000 16.305 > > system.time(split(tmp, tmp$id)) > # user system elapsed > # 5.637 0.004 5.647 > > Cheers, > H. > > On 09/28/2016 09:09 AM, Doran, Harold wrote: >> >> I have an extremely large data frame (~13 million rows) that resembles the >> structure of the object tmp below in the reproducible code. In my real data, >> the variable, 'id' may or may not be ordered, but I think that is >> irrelevant. >> >> I have a process that requires subsetting the data by id and then running >> each smaller data frame through a set of functions. One example below uses >> indexing and the other uses an explicit call to subset(), both return the >> same result, but indexing is faster. >> >> Problem is in my real data, indexing must parse through millions of rows >> to evaluate the condition and this is expensive and a bottleneck in my code. >> I'm curious if anyone can recommend an improvement that would somehow be >> less expensive and faster? >> >> Thank you >> Harold >> >> >> tmp <- data.frame(id = rep(1:200, each = 10), foo = rnorm(2000)) >> >> idList <- unique(tmp$id) >> >> ### Fast, but not fast enough >> system.time(replicate(500, tmp[which(tmp$id == idList[1]),])) >> >> ### Not fast at all, a big bottleneck >> system.time(replicate(500, subset(tmp, id == idList[1]))) >> >> ______________________________________________ >> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> > > -- > Herv? Pag?s > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages at fredhutch.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 > > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
On 09/28/2016 02:53 PM, Herv? Pag?s wrote:> Hi, > > I'm surprised nobody suggested split(). Splitting the data.frame > upfront is faster than repeatedly subsetting it: > > tmp <- data.frame(id = rep(1:20000, each = 10), foo = rnorm(200000)) > idList <- unique(tmp$id) > > system.time(for (i in idList) tmp[which(tmp$id == i),]) > # user system elapsed > # 16.286 0.000 16.305 > > system.time(split(tmp, tmp$id)) > # user system elapsed > # 5.637 0.004 5.647an odd speed-up is to provide (non-sequential) row names, e.g., > system.time(split(tmp, tmp$id)) user system elapsed 4.472 0.648 5.122 > row.names(tmp) = rev(seq_len(nrow(tmp))) > system.time(split(tmp, tmp$id)) user system elapsed 0.588 0.000 0.587 for reasons explained here http://stackoverflow.com/questions/39545400/why-is-split-inefficient-on-large-data-frames-with-many-groups/39548316#39548316 Martin> > Cheers, > H. > > On 09/28/2016 09:09 AM, Doran, Harold wrote: >> I have an extremely large data frame (~13 million rows) that resembles >> the structure of the object tmp below in the reproducible code. In my >> real data, the variable, 'id' may or may not be ordered, but I think >> that is irrelevant. >> >> I have a process that requires subsetting the data by id and then >> running each smaller data frame through a set of functions. One >> example below uses indexing and the other uses an explicit call to >> subset(), both return the same result, but indexing is faster. >> >> Problem is in my real data, indexing must parse through millions of >> rows to evaluate the condition and this is expensive and a bottleneck >> in my code. I'm curious if anyone can recommend an improvement that >> would somehow be less expensive and faster? >> >> Thank you >> Harold >> >> >> tmp <- data.frame(id = rep(1:200, each = 10), foo = rnorm(2000)) >> >> idList <- unique(tmp$id) >> >> ### Fast, but not fast enough >> system.time(replicate(500, tmp[which(tmp$id == idList[1]),])) >> >> ### Not fast at all, a big bottleneck >> system.time(replicate(500, subset(tmp, id == idList[1]))) >> >> ______________________________________________ >> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> >This email message may contain legally privileged and/or...{{dropped:2}}
Hi Harold, Generally: you can not beat data.table, unless you can represent your data in a matrix (or array or vector). For some specific cases, Herv?'s suggestion might be also competitive. Your problem is that you did not put any effort to read at least part of the very extensive documentation of the data.table package. You should start here: https://github.com/Rdatatable/data.table/wiki/Getting-started To put in a nutshell: use a key which allows binary search instead of the much-much slower vector scan. (With the automatic auto-indexing feature of the data.table package, you may even skip this step.) The point is that creating the key must be done only once, and all subsequent subsetting operations which use the key become incredibly fast. You missed this point because you replicated the creation of the key as well, not only the subsetting in one of your examples. Here is a version of Herve's example (OK, it is a bit biased because data.table has a highly optimized internal version of mean() for calculating the group means): ## create a keyed data.table tmp_dt <- data.table(id = rep(1:20000, each = 10), foo = rnorm(200000), key = "id") system.time(tmp_dt[, .(result = mean(foo)), by = id]) # user system elapsed # 0.004 0.000 0.005 ## subset a keyed data.table all_ids <- tmp_dt[, unique(id)] select_id <- sample(all_ids, 1) system.time(tmp_dt[.(select_id)]) # user system elapsed # 0.000 0.000 0.001 ## or equivalently system.time(tmp_dt[id == select_id]) # user system elapsed # 0.000 0.000 0.001 Note: the CRAN version of the data.table package is already very fast, but you should try the developmental version ( devtools::install_github("Rdatatable/data.table") ) for multi-threaded subsetting. Cheers, Denes On 09/28/2016 08:53 PM, Herv? Pag?s wrote: > Hi, > > I'm surprised nobody suggested split(). Splitting the data.frame > upfront is faster than repeatedly subsetting it: > > tmp <- data.frame(id = rep(1:20000, each = 10), foo = rnorm(200000)) > idList <- unique(tmp$id) > > system.time(for (i in idList) tmp[which(tmp$id == i),]) > # user system elapsed > # 16.286 0.000 16.305 > > system.time(split(tmp, tmp$id)) > # user system elapsed > # 5.637 0.004 5.647 > > Cheers, > H. > > On 09/28/2016 09:09 AM, Doran, Harold wrote: >> I have an extremely large data frame (~13 million rows) that resembles >> the structure of the object tmp below in the reproducible code. In my >> real data, the variable, 'id' may or may not be ordered, but I think >> that is irrelevant. >> >> I have a process that requires subsetting the data by id and then >> running each smaller data frame through a set of functions. One >> example below uses indexing and the other uses an explicit call to >> subset(), both return the same result, but indexing is faster. >> >> Problem is in my real data, indexing must parse through millions of >> rows to evaluate the condition and this is expensive and a bottleneck >> in my code. I'm curious if anyone can recommend an improvement that >> would somehow be less expensive and faster? >> >> Thank you >> Harold >> >> >> tmp <- data.frame(id = rep(1:200, each = 10), foo = rnorm(2000)) >> >> idList <- unique(tmp$id) >> >> ### Fast, but not fast enough >> system.time(replicate(500, tmp[which(tmp$id == idList[1]),])) >> >> ### Not fast at all, a big bottleneck >> system.time(replicate(500, subset(tmp, id == idList[1]))) >> >> ______________________________________________ >> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> >