r user
2006-Jan-03 16:59 UTC
[R] For loop gets exponentially slower as dataset gets larger...
I am running R 2.1.1 in a Microsoft Windows XP environment. I have a matrix with three vectors (“columns”) and ~2 million “rows”. The three vectors are date_, id, and price. The data is ordered (sorted) by code and date_. (The matrix contains daily prices for several thousand stocks, and has ~2 million “rows”. If a stock did not trade on a particular date, its price is set to “NA”) I wish to add a fourth vector that is “next_price”. (“Next price” is the current price as long as the current price is not “NA”. If the current price is NA, the “next_price” is the next price that the security with this same ID trades. If the stock does not trade again, “next_price” is set to NA.) I wrote the following loop to calculate next_price. It works as intended, but I have one problem. When I have only 10,000 rows of data, the calculations are very fast. However, when I run the loop on the full 2 million rows, it seems to take ~ 1 second per row. Why is this happening? What can I do to speed the calculations when running the loop on the full 2 million rows? (I am not running low on memory, but I am maxing out my CPU at 100%) Here is my code and some sample data: data<- data[order(data$code,data$date_),] l<-dim(data)[1] w<-3 data[l,w+1]<-NA for (i in (l-1):(1)){ data[i,w+1]<-ifelse(is.na(data[i,w])==F,data[i,w],ifelse(data[i,2]==data[i+1,2],data[i+1,w+1],NA)) } date id price next_price 6/24/2005 1635 444.7838 444.7838 6/27/2005 1635 448.4756 448.4756 6/28/2005 1635 455.4161 455.4161 6/29/2005 1635 454.6658 454.6658 6/30/2005 1635 453.9155 453.9155 7/1/2005 1635 453.3153 453.3153 7/4/2005 1635 NA 453.9155 7/5/2005 1635 453.9155 453.9155 7/6/2005 1635 453.0152 453.0152 7/7/2005 1635 452.8651 452.8651 7/8/2005 1635 456.0163 456.0163 12/19/2005 1635 442.6982 442.6982 12/20/2005 1635 446.5159 446.5159 12/21/2005 1635 452.4714 452.4714 12/22/2005 1635 451.074 451.074 12/23/2005 1635 454.6453 454.6453 12/27/2005 1635 NA NA 12/28/2005 1635 NA NA 12/1/2003 1881 66.1562 66.1562 12/2/2003 1881 64.9192 64.9192 12/3/2003 1881 66.0078 66.0078 12/4/2003 1881 65.8098 65.8098 12/5/2003 1881 64.1275 64.1275 12/8/2003 1881 64.8697 64.8697 12/9/2003 1881 63.5337 63.5337 12/10/2003 1881 62.9399 62.9399 --------------------------------- [[alternative HTML version deleted]]
bogdan romocea
2006-Jan-03 18:49 UTC
[R] For loop gets exponentially slower as dataset gets larger...
Your 2-million loop is overkill, because apparently in the (vast) majority of cases you don't need to loop at all. You could try something like this: 1. Split the price by id, e.g. price.list <- split(price,id) For each id, 2a. When price is not NA, assign it to next price _without_ using a for loop - e.g. next.price[!is.na(price)] <- price[!is.na(price)] 2b. Use a for loop only when price is NA, but even then work with vectors as much as you can, for example (untested) for (i in setdiff(which(is.na(price)),length(price))) { remaining.prices <- price[(i+1):length(price)] of.interest <- head(remaining.prices[!is.na(remaining.prices)],1) if (class(of.interest) == "logical") next.price[i] <- NA else next.price[i] <- of.interest } To run (2a) and (2b) you could use lapply(); to paste the bits together try do.call("rbind",price.list). You might also want to take a look at ?Rprof and check the archives for efficiency suggestions.> -----Original Message----- > From: r-help-bounces at stat.math.ethz.ch > [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of r user > Sent: Tuesday, January 03, 2006 11:59 AM > To: rhelp > Subject: [R] For loop gets exponentially slower as dataset > gets larger... > > > I am running R 2.1.1 in a Microsoft Windows XP environment. > > I have a matrix with three vectors ("columns") and ~2 > million "rows". The three vectors are date_, id, and price. > The data is ordered (sorted) by code and date_. > > (The matrix contains daily prices for several thousand > stocks, and has ~2 million "rows". If a stock did not trade > on a particular date, its price is set to "NA") > > I wish to add a fourth vector that is "next_price". ("Next > price" is the current price as long as the current price is > not "NA". If the current price is NA, the "next_price" is > the next price that the security with this same ID trades. > If the stock does not trade again, "next_price" is set to NA.) > > I wrote the following loop to calculate next_price. It > works as intended, but I have one problem. When I have only > 10,000 rows of data, the calculations are very fast. > However, when I run the loop on the full 2 million rows, it > seems to take ~ 1 second per row. > > Why is this happening? What can I do to speed the > calculations when running the loop on the full 2 million rows? > > (I am not running low on memory, but I am maxing out my CPU at 100%) > > Here is my code and some sample data: > > data<- data[order(data$code,data$date_),] > l<-dim(data)[1] > w<-3 > data[l,w+1]<-NA > > for (i in (l-1):(1)){ > > data[i,w+1]<-ifelse(is.na(data[i,w])==F,data[i,w],ifelse(data[ > i,2]==data[i+1,2],data[i+1,w+1],NA)) > } > > > date id price next_price > 6/24/2005 1635 444.7838 444.7838 > 6/27/2005 1635 448.4756 448.4756 > 6/28/2005 1635 455.4161 455.4161 > 6/29/2005 1635 454.6658 454.6658 > 6/30/2005 1635 453.9155 453.9155 > 7/1/2005 1635 453.3153 453.3153 > 7/4/2005 1635 NA 453.9155 > 7/5/2005 1635 453.9155 453.9155 > 7/6/2005 1635 453.0152 453.0152 > 7/7/2005 1635 452.8651 452.8651 > 7/8/2005 1635 456.0163 456.0163 > 12/19/2005 1635 442.6982 442.6982 > 12/20/2005 1635 446.5159 446.5159 > 12/21/2005 1635 452.4714 452.4714 > 12/22/2005 1635 451.074 451.074 > 12/23/2005 1635 454.6453 454.6453 > 12/27/2005 1635 NA NA > 12/28/2005 1635 NA NA > 12/1/2003 1881 66.1562 66.1562 > 12/2/2003 1881 64.9192 64.9192 > 12/3/2003 1881 66.0078 66.0078 > 12/4/2003 1881 65.8098 65.8098 > 12/5/2003 1881 64.1275 64.1275 > 12/8/2003 1881 64.8697 64.8697 > 12/9/2003 1881 63.5337 63.5337 > 12/10/2003 1881 62.9399 62.9399 > > > --------------------------------- > > [[alternative HTML version deleted]] > >
Gabor Grothendieck
2006-Jan-03 19:30 UTC
[R] For loop gets exponentially slower as dataset gets larger...
Accepting this stacked representation for the moment try this. When reordering the dates do it in reverse order. Then loop over all codes applying the zoo function na.locf to the the prices for that code. locf stands for last observation carried forward. Since our dates are reversed it will bring the next one backwards. Finally sort back into ascending order. library(zoo) # needed for na.locf which also works for non-zoo objects data <- data[order(data$code, - as.numeric(data$date_)),] attach(data) next_price <- price for(i in unique(code)) next_price[code==i] <- na.locf(price[code==i], na.rm=F) data$next_price <- next_price data <- data[order(data$code, data$date_),] detach() Here it is again but this time we represent it as a list of zoo objects with one component per code. In the code below we split the data on code and apply f to do that. Note that na.locf replaces NAs with the last observation carried forward so by reversing the data, using na.locf and reversing the data again we get the effect. library(zoo) f <- function(x) { z <- zoo(x$price, x$date_) next_price <- rev(na.locf(rev(coredata(z)), na.rm = FALSE)) merge(z, next_price) } z <- lapply(split(data, data$code), f) On 1/3/06, r user <ruser2006 at yahoo.com> wrote:> I am running R 2.1.1 in a Microsoft Windows XP environment. > > I have a matrix with three vectors ("columns") and ~2 million "rows". The three vectors are date_, id, and price. The data is ordered (sorted) by code and date_. > > (The matrix contains daily prices for several thousand stocks, and has ~2 million "rows". If a stock did not trade on a particular date, its price is set to "NA") > > I wish to add a fourth vector that is "next_price". ("Next price" is the current price as long as the current price is not "NA". If the current price is NA, the "next_price" is the next price that the security with this same ID trades. If the stock does not trade again, "next_price" is set to NA.) > > I wrote the following loop to calculate next_price. It works as intended, but I have one problem. When I have only 10,000 rows of data, the calculations are very fast. However, when I run the loop on the full 2 million rows, it seems to take ~ 1 second per row. > > Why is this happening? What can I do to speed the calculations when running the loop on the full 2 million rows? > > (I am not running low on memory, but I am maxing out my CPU at 100%) > > Here is my code and some sample data: > > data<- data[order(data$code,data$date_),] > l<-dim(data)[1] > w<-3 > data[l,w+1]<-NA > > for (i in (l-1):(1)){ > data[i,w+1]<-ifelse(is.na(data[i,w])==F,data[i,w],ifelse(data[i,2]==data[i+1,2],data[i+1,w+1],NA)) > } > > > date id price next_price > 6/24/2005 1635 444.7838 444.7838 > 6/27/2005 1635 448.4756 448.4756 > 6/28/2005 1635 455.4161 455.4161 > 6/29/2005 1635 454.6658 454.6658 > 6/30/2005 1635 453.9155 453.9155 > 7/1/2005 1635 453.3153 453.3153 > 7/4/2005 1635 NA 453.9155 > 7/5/2005 1635 453.9155 453.9155 > 7/6/2005 1635 453.0152 453.0152 > 7/7/2005 1635 452.8651 452.8651 > 7/8/2005 1635 456.0163 456.0163 > 12/19/2005 1635 442.6982 442.6982 > 12/20/2005 1635 446.5159 446.5159 > 12/21/2005 1635 452.4714 452.4714 > 12/22/2005 1635 451.074 451.074 > 12/23/2005 1635 454.6453 454.6453 > 12/27/2005 1635 NA NA > 12/28/2005 1635 NA NA > 12/1/2003 1881 66.1562 66.1562 > 12/2/2003 1881 64.9192 64.9192 > 12/3/2003 1881 66.0078 66.0078 > 12/4/2003 1881 65.8098 65.8098 > 12/5/2003 1881 64.1275 64.1275 > 12/8/2003 1881 64.8697 64.8697 > 12/9/2003 1881 63.5337 63.5337 > 12/10/2003 1881 62.9399 62.9399 > > > --------------------------------- > > [[alternative HTML version deleted]] > > > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html > >