Dear R community, I have a 2 million by 2 matrix that looks like this: x<-sample(1:15,2000000, replace=T) y<-sample(1:10*1000, 2000000, replace=T) x y [1,] 10 4000 [2,] 3 1000 [3,] 3 4000 [4,] 8 6000 [5,] 2 9000 [6,] 3 8000 [7,] 2 10000 (...) The first column is a population expansion factor for the number in the second column (household income). I want to expand the second column with the first so that I end up with a vector beginning with 10 observations of 4000, then 3 observations of 1000 and so on. In my mind the natural approach would be to create a NULL vector and append the expansions: myvar<-NULL myvar<-append(myvar, replicate(x[1],y[1]), 1) for (i in 2:length(x)) { myvar<-append(myvar,replicate(x[i],y[i]),sum(x[1:i])+1) } to end with a vector of sum(x), which in my real database corresponds to 22 million observations. This works fine --if I only run it for the first, say, 1000 observations. If I try to perform this on all 2 million observations it takes long, way too long for this to be useful (I left it running 11 hours yesterday to no avail). I know R performs well with operations on relatively large vectors. Why is this so inefficient? And what would be the smart way to do this? Thanks in advance. Alex
Dear R community, I have a 2 million by 2 matrix that looks like this: x<-sample(1:15,2000000, replace=T) y<-sample(1:10*1000, 2000000, replace=T) x y [1,] 10 4000 [2,] 3 1000 [3,] 3 4000 [4,] 8 6000 [5,] 2 9000 [6,] 3 8000 [7,] 2 10000 (...) The first column is a population expansion factor for the number in the second column (household income). I want to expand the second column with the first so that I end up with a vector beginning with 10 observations of 4000, then 3 observations of 1000 and so on. In my mind the natural approach would be to create a NULL vector and append the expansions: myvar<-NULL myvar<-append(myvar, replicate(x[1],y[1]), 1) for (i in 2:length(x)) { myvar<-append(myvar,replicate(x[i],y[i]),sum(x[1:i])+1) } to end with a vector of sum(x), which in my real database corresponds to 22 million observations. This works fine --if I only run it for the first, say, 1000 observations. If I try to perform this on all 2 million observations it takes long, way too long for this to be useful (I left it running 11 hours yesterday to no avail). I know R performs well with operations on relatively large vectors. Why is this so inefficient? And what would be the smart way to do this? Thanks in advance. Alex
> -----Original Message----- > From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] > On Behalf Of Alex Ruiz Euler > Sent: Wednesday, August 17, 2011 3:54 PM > To: r-help at r-project.org > Subject: [R] More efficient option to append()? > > > Dear R community, > > I have a 2 million by 2 matrix that looks like this: > > x<-sample(1:15,2000000, replace=T) > y<-sample(1:10*1000, 2000000, replace=T) > x y > [1,] 10 4000 > [2,] 3 1000 > [3,] 3 4000 > [4,] 8 6000 > [5,] 2 9000 > [6,] 3 8000 > [7,] 2 10000 > (...) > > > The first column is a population expansion factor for the number in the > second column (household income). I want to expand the second column > with the first so that I end up with a vector beginning with 10 > observations of 4000, then 3 observations of 1000 and so on. In my mind > the natural approach would be to create a NULL vector and append the > expansions: > > myvar<-NULL > myvar<-append(myvar, replicate(x[1],y[1]), 1) > > for (i in 2:length(x)) { > myvar<-append(myvar,replicate(x[i],y[i]),sum(x[1:i])+1) > } > > to end with a vector of sum(x), which in my real database corresponds > to 22 million observations. > > This works fine --if I only run it for the first, say, 1000 > observations. If I try to perform this on all 2 million observations > it takes long, way too long for this to be useful (I left it running > 11 hours yesterday to no avail). > > > I know R performs well with operations on relatively large vectors. Why > is this so inefficient? And what would be the smart way to do this? > > Thanks in advance. > Alex >Alex, does the following do what you want? myvar <- rep(y,x) Hope this is helpful, Dan Daniel Nordlund Bothell, WA USA
Daniel, it works, thanks for your time with this simple matter. Best, Alex On Wed, 17 Aug 2011 16:35:48 -0700 "Daniel Nordlund" <djnordlund at frontier.com> wrote:> > > -----Original Message----- > > From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] > > On Behalf Of Alex Ruiz Euler > > Sent: Wednesday, August 17, 2011 3:54 PM > > To: r-help at r-project.org > > Subject: [R] More efficient option to append()? > > > > > > Dear R community, > > > > I have a 2 million by 2 matrix that looks like this: > > > > x<-sample(1:15,2000000, replace=T) > > y<-sample(1:10*1000, 2000000, replace=T) > > x y > > [1,] 10 4000 > > [2,] 3 1000 > > [3,] 3 4000 > > [4,] 8 6000 > > [5,] 2 9000 > > [6,] 3 8000 > > [7,] 2 10000 > > (...) > > > > > > The first column is a population expansion factor for the number in the > > second column (household income). I want to expand the second column > > with the first so that I end up with a vector beginning with 10 > > observations of 4000, then 3 observations of 1000 and so on. In my mind > > the natural approach would be to create a NULL vector and append the > > expansions: > > > > myvar<-NULL > > myvar<-append(myvar, replicate(x[1],y[1]), 1) > > > > for (i in 2:length(x)) { > > myvar<-append(myvar,replicate(x[i],y[i]),sum(x[1:i])+1) > > } > > > > to end with a vector of sum(x), which in my real database corresponds > > to 22 million observations. > > > > This works fine --if I only run it for the first, say, 1000 > > observations. If I try to perform this on all 2 million observations > > it takes long, way too long for this to be useful (I left it running > > 11 hours yesterday to no avail). > > > > > > I know R performs well with operations on relatively large vectors. Why > > is this so inefficient? And what would be the smart way to do this? > > > > Thanks in advance. > > Alex > > > > Alex, > > does the following do what you want? > > myvar <- rep(y,x) > > Hope this is helpful, > > Dan > > Daniel Nordlund > Bothell, WA USA > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
This takes a few seconds to do 1 million lines, and remains explicit/for loop form numberofSalaryBands = 1000000 # 2000000 x = sample(1:15,numberofSalaryBands, replace=T) y = sample((1:10)*1000, numberofSalaryBands, replace=T) df = data.frame(x,y) finalN = sum(df$x) myVar = rep(NA, finalN) outIndex = 1 i = 1 for (i in 1:numberofSalaryBands) { kount = df$x[i] myVar[outIndex:(outIndex+kount-1)] = rep(df$y[i], kount) # Make x[i] copies of value y[i] outIndex = outIndex+kount } head(myVar) plyr::count(myVar) On Aug 18, 2011, at 12:17 AM, Alex Ruiz Euler wrote:> > > Dear R community, > > I have a 2 million by 2 matrix that looks like this: > > x<-sample(1:15,2000000, replace=T) > y<-sample(1:10*1000, 2000000, replace=T) > x y > [1,] 10 4000 > [2,] 3 1000 > [3,] 3 4000 > [4,] 8 6000 > [5,] 2 9000 > [6,] 3 8000 > [7,] 2 10000 > (...) > > > The first column is a population expansion factor for the number in the > second column (household income). I want to expand the second column > with the first so that I end up with a vector beginning with 10 > observations of 4000, then 3 observations of 1000 and so on. In my mind > the natural approach would be to create a NULL vector and append the > expansions: > > myvar<-NULL > myvar<-append(myvar, replicate(x[1],y[1]), 1) > > for (i in 2:length(x)) { > myvar<-append(myvar,replicate(x[i],y[i]),sum(x[1:i])+1) > } > > to end with a vector of sum(x), which in my real database corresponds > to 22 million observations. > > This works fine --if I only run it for the first, say, 1000 > observations. If I try to perform this on all 2 million observations > it takes long, way too long for this to be useful (I left it running > 11 hours yesterday to no avail). > > > I know R performs well with operations on relatively large vectors. Why > is this so inefficient? And what would be the smart way to do this? > > Thanks in advance. > Alex > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
Hi: This seems to take a bit less code, avoids explicit loops (by using mapply() instead, where the loops are internal) and takes about 10 seconds on my system: m <- cbind(x = sample(1:15,2000000, replace=T), y = sample(1:10*1000, 2000000, replace=T)) sum(m[, 1]) # [1] 16005804 ff <- function(x, y) rep(y, x) system.time(w <- do.call(c, mapply(ff, m[, 1], m[, 2]))) user system elapsed 9.75 0.00 9.75> length(w)[1] 16005804> count(w)x freq 1 1000 1603184 2 2000 1590599 3 3000 1596661 4 4000 1607112 5 5000 1598571 6 6000 1599195 7 7000 1600475 8 8000 1601718 9 9000 1598896 10 10000 1609393 HTH, Dennis PS: It would have been a good idea to keep the OP in the loop of this thread. On Thu, Aug 18, 2011 at 12:46 AM, Timothy Bates <timothy.c.bates at gmail.com> wrote:> This takes a few seconds to do 1 million lines, and remains explicit/for loop form > > numberofSalaryBands = 1000000 # 2000000 > x ? ? ? ?= sample(1:15,numberofSalaryBands, replace=T) > y ? ? ? ?= sample((1:10)*1000, numberofSalaryBands, replace=T) > df ? ? ? = data.frame(x,y) > finalN ? = sum(df$x) > myVar ? ?= rep(NA, finalN) > outIndex = 1 > i ? ? ? ?= 1 > for (i in 1:numberofSalaryBands) { > ? ? ? ?kount = df$x[i] > ? ? ? ?myVar[outIndex:(outIndex+kount-1)] = rep(df$y[i], kount) # Make x[i] copies of value y[i] > ? ? ? ?outIndex = outIndex+kount > } > head(myVar) > plyr::count(myVar) > > > On Aug 18, 2011, at 12:17 AM, Alex Ruiz Euler wrote: > >> >> >> Dear R community, >> >> I have a 2 million by 2 matrix that looks like this: >> >> x<-sample(1:15,2000000, replace=T) >> y<-sample(1:10*1000, 2000000, replace=T) >> ? ? ?x ? ? y >> [1,] 10 ?4000 >> [2,] ?3 ?1000 >> [3,] ?3 ?4000 >> [4,] ?8 ?6000 >> [5,] ?2 ?9000 >> [6,] ?3 ?8000 >> [7,] ?2 10000 >> (...) >> >> >> The first column is a population expansion factor for the number in the >> second column (household income). I want to expand the second column >> with the first so that I end up with a vector beginning with 10 >> observations of 4000, then 3 observations of 1000 and so on. In my mind >> the natural approach would be to create a NULL vector and append the >> expansions: >> >> myvar<-NULL >> myvar<-append(myvar, replicate(x[1],y[1]), 1) >> >> for (i in 2:length(x)) { >> myvar<-append(myvar,replicate(x[i],y[i]),sum(x[1:i])+1) >> } >> >> to end with a vector of sum(x), which in my real database corresponds >> to 22 million observations. >> >> This works fine --if I only run it for the first, say, 1000 >> observations. If I try to perform this on all 2 million observations >> it takes long, way too long for this to be useful (I left it running >> 11 hours yesterday to no avail). >> >> >> I know R performs well with operations on relatively large vectors. Why >> is this so inefficient? And what would be the smart way to do this? >> >> Thanks in advance. >> Alex >> >> ______________________________________________ >> R-help at r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >
On 08/17/2011 10:53 PM, Alex Ruiz Euler wrote:> Dear R community, > > I have a 2 million by 2 matrix that looks like this: > > x<-sample(1:15,2000000, replace=T) > y<-sample(1:10*1000, 2000000, replace=T) > x y > [1,] 10 4000 > [2,] 3 1000 > [3,] 3 4000 > [4,] 8 6000 > [5,] 2 9000 > [6,] 3 8000 > [7,] 2 10000 > (...) > > > The first column is a population expansion factor for the number in the > second column (household income). I want to expand the second column > with the first so that I end up with a vector beginning with 10 > observations of 4000, then 3 observations of 1000 and so on. In my mind > the natural approach would be to create a NULL vector and append the > expansions: > > myvar<-NULL > myvar<-append(myvar, replicate(x[1],y[1]), 1) > > for (i in 2:length(x)) { > myvar<-append(myvar,replicate(x[i],y[i]),sum(x[1:i])+1) > } > > to end with a vector of sum(x), which in my real database corresponds > to 22 million observations. > > This works fine --if I only run it for the first, say, 1000 > observations. If I try to perform this on all 2 million observations > it takes long, way too long for this to be useful (I left it running > 11 hours yesterday to no avail). > > > I know R performs well with operations on relatively large vectors. Why > is this so inefficient? And what would be the smart way to do this?Hi Alex, The other reply already gave you the R way of doing this while avoiding the for loop. However, there is a more general reason why your for loop is terribly inefficient. A small set of examples: largeVector = runif(10e4) outputVector = NULL system.time(for(i in 1:length(largeVector)) { outputVector = append(outputVector, largeVector[i] + 1) }) # user system elapsed # 6.591 0.168 6.786 The problem in this code is that outputVector keeps on growing and growing. The operating system needs to allocate more and more space as the object grows. This process is really slow. Several (much) faster alternatives exist: # Pre-allocating the outputVector outputVector = rep(0,length(largeVector)) system.time(for(i in 1:length(largeVector)) { outputVector[i] = largeVector[i] + 1 }) # user system elapsed # 0.178 0.000 0.178 # speed up of 37 times, this will only increase for large # lengths of largeVector # Using apply functions system.time(outputVector <- sapply(largeVector, function(x) return(x + 1))) # user system elapsed # 0.124 0.000 0.125 # Even a bit faster # Using vectorisation system.time(outputVector <- largeVector + 1) # user system elapsed # 0.000 0.000 0.001 # Practically instant, 6780 times faster than the first example It is not always clear which method is most suitable and which performs best. At least they all perform much, much better than the naive option of letting outputVector grow. cheers, Paul> Thanks in advance. > Alex > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.-- Paul Hiemstra, Ph.D. Global Climate Division Royal Netherlands Meteorological Institute (KNMI) Wilhelminalaan 10 | 3732 GK | De Bilt | Kamer B 3.39 P.O. Box 201 | 3730 AE | De Bilt tel: +31 30 2206 494 http://intamap.geo.uu.nl/~paul http://nl.linkedin.com/pub/paul-hiemstra/20/30b/770
On 08/18/2011 07:46 AM, Timothy Bates wrote:> This takes a few seconds to do 1 million lines, and remains explicit/for loop form > > numberofSalaryBands = 1000000 # 2000000 > x = sample(1:15,numberofSalaryBands, replace=T) > y = sample((1:10)*1000, numberofSalaryBands, replace=T) > df = data.frame(x,y) > finalN = sum(df$x) > myVar = rep(NA, finalN) > outIndex = 1 > i = 1 > for (i in 1:numberofSalaryBands) { > kount = df$x[i] > myVar[outIndex:(outIndex+kount-1)] = rep(df$y[i], kount) # Make x[i] copies of value y[i]For posterity, the problem in the code of the OP was that myVar was continuously growing. This required the operating system to continuously create more space for myVar, which is a very slow process. In this example you preallocate the space needed for myVar by creating an object of the appropriate length before the for loop. So, in my opinion, for loops and append should be avoided like the plague! my 2cts :) Paul> outIndex = outIndex+kount > } > head(myVar) > plyr::count(myVar) > > > On Aug 18, 2011, at 12:17 AM, Alex Ruiz Euler wrote: > >> >> Dear R community, >> >> I have a 2 million by 2 matrix that looks like this: >> >> x<-sample(1:15,2000000, replace=T) >> y<-sample(1:10*1000, 2000000, replace=T) >> x y >> [1,] 10 4000 >> [2,] 3 1000 >> [3,] 3 4000 >> [4,] 8 6000 >> [5,] 2 9000 >> [6,] 3 8000 >> [7,] 2 10000 >> (...) >> >> >> The first column is a population expansion factor for the number in the >> second column (household income). I want to expand the second column >> with the first so that I end up with a vector beginning with 10 >> observations of 4000, then 3 observations of 1000 and so on. In my mind >> the natural approach would be to create a NULL vector and append the >> expansions: >> >> myvar<-NULL >> myvar<-append(myvar, replicate(x[1],y[1]), 1) >> >> for (i in 2:length(x)) { >> myvar<-append(myvar,replicate(x[i],y[i]),sum(x[1:i])+1) >> } >> >> to end with a vector of sum(x), which in my real database corresponds >> to 22 million observations. >> >> This works fine --if I only run it for the first, say, 1000 >> observations. If I try to perform this on all 2 million observations >> it takes long, way too long for this to be useful (I left it running >> 11 hours yesterday to no avail). >> >> >> I know R performs well with operations on relatively large vectors. Why >> is this so inefficient? And what would be the smart way to do this? >> >> Thanks in advance. >> Alex >> >> ______________________________________________ >> R-help at r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.-- Paul Hiemstra, Ph.D. Global Climate Division Royal Netherlands Meteorological Institute (KNMI) Wilhelminalaan 10 | 3732 GK | De Bilt | Kamer B 3.39 P.O. Box 201 | 3730 AE | De Bilt tel: +31 30 2206 494 http://intamap.geo.uu.nl/~paul http://nl.linkedin.com/pub/paul-hiemstra/20/30b/770
As I already stated in my reply to your earlier post: resending the answer for the archives of the mailing list... Hi Alex, The other reply already gave you the R way of doing this while avoiding the for loop. However, there is a more general reason why your for loop is terribly inefficient. A small set of examples: largeVector = runif(10e4) outputVector = NULL system.time(for(i in 1:length(largeVector)) { outputVector = append(outputVector, largeVector[i] + 1) }) # user system elapsed # 6.591 0.168 6.786 The problem in this code is that outputVector keeps on growing and growing. The operating system needs to allocate more and more space as the object grows. This process is really slow. Several (much) faster alternatives exist: # Pre-allocating the outputVector outputVector = rep(0,length(largeVector)) system.time(for(i in 1:length(largeVector)) { outputVector[i] = largeVector[i] + 1 }) # user system elapsed # 0.178 0.000 0.178 # speed up of 37 times, this will only increase for large # lengths of largeVector # Using apply functions system.time(outputVector <- sapply(largeVector, function(x) return(x + 1))) # user system elapsed # 0.124 0.000 0.125 # Even a bit faster # Using vectorisation system.time(outputVector <- largeVector + 1) # user system elapsed # 0.000 0.000 0.001 # Practically instant, 6780 times faster than the first example It is not always clear which method is most suitable and which performs best. At least they all perform much, much better than the naive option of letting outputVector grow. cheers, Paul On 08/17/2011 11:17 PM, Alex Ruiz Euler wrote:> > Dear R community, > > I have a 2 million by 2 matrix that looks like this: > > x<-sample(1:15,2000000, replace=T) > y<-sample(1:10*1000, 2000000, replace=T) > x y > [1,] 10 4000 > [2,] 3 1000 > [3,] 3 4000 > [4,] 8 6000 > [5,] 2 9000 > [6,] 3 8000 > [7,] 2 10000 > (...) > > > The first column is a population expansion factor for the number in the > second column (household income). I want to expand the second column > with the first so that I end up with a vector beginning with 10 > observations of 4000, then 3 observations of 1000 and so on. In my mind > the natural approach would be to create a NULL vector and append the > expansions: > > myvar<-NULL > myvar<-append(myvar, replicate(x[1],y[1]), 1) > > for (i in 2:length(x)) { > myvar<-append(myvar,replicate(x[i],y[i]),sum(x[1:i])+1) > } > > to end with a vector of sum(x), which in my real database corresponds > to 22 million observations. > > This works fine --if I only run it for the first, say, 1000 > observations. If I try to perform this on all 2 million observations > it takes long, way too long for this to be useful (I left it running > 11 hours yesterday to no avail). > > > I know R performs well with operations on relatively large vectors. Why > is this so inefficient? And what would be the smart way to do this? > > Thanks in advance. > Alex > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.-- Paul Hiemstra, Ph.D. Global Climate Division Royal Netherlands Meteorological Institute (KNMI) Wilhelminalaan 10 | 3732 GK | De Bilt | Kamer B 3.39 P.O. Box 201 | 3730 AE | De Bilt tel: +31 30 2206 494 http://intamap.geo.uu.nl/~paul http://nl.linkedin.com/pub/paul-hiemstra/20/30b/770
Hi, My time zone in Montreal is "Standard time zone: UTC/GMT -5 hours" (see <http://www.timeanddate.com/worldclock/city.html?n=165>). Yet, in R (POSIXct objects) I must specify the opposite, i.e. "UTC+5": dateMontreal = as.POSIXct("2011-01-15 05:00:00", tz="EST") dateMontreal2 = as.POSIXct("2011-01-15 05:00:00", tz="UTC+5") wrongdateMontreal = as.POSIXct("2011-01-15 05:00:00", tz="UTC-5") dateLondon = as.POSIXct("2011-01-15 10:00:00", tz="UTC0") difftime(dateMontreal, dateLondon) Time difference of 0 secs difftime(dateMontreal2, dateLondon) Time difference of 0 secs difftime(wrongdateMontreal, dateLondon) Time difference of -10 hours Is there a reason for this counter-intuitive convention? Denis R version 2.13.1 (2011-07-08) Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) locale: [1] fr_CA.UTF-8/fr_CA.UTF-8/C/C/fr_CA.UTF-8/fr_CA.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base
On 19.08.2011 15:50, Paul Hiemstra wrote:> On 08/17/2011 10:53 PM, Alex Ruiz Euler wrote: >> Dear R community, >> >> I have a 2 million by 2 matrix that looks like this: >> >> x<-sample(1:15,2000000, replace=T) >> y<-sample(1:10*1000, 2000000, replace=T) >> x y >> [1,] 10 4000 >> [2,] 3 1000 >> [3,] 3 4000 >> [4,] 8 6000 >> [5,] 2 9000 >> [6,] 3 8000 >> [7,] 2 10000 >> (...) >> >> >> The first column is a population expansion factor for the number in the >> second column (household income). I want to expand the second column >> with the first so that I end up with a vector beginning with 10 >> observations of 4000, then 3 observations of 1000 and so on. In my mind >> the natural approach would be to create a NULL vector and append the >> expansions: >> >> myvar<-NULL >> myvar<-append(myvar, replicate(x[1],y[1]), 1) >> >> for (i in 2:length(x)) { >> myvar<-append(myvar,replicate(x[i],y[i]),sum(x[1:i])+1) >> } >> >> to end with a vector of sum(x), which in my real database corresponds >> to 22 million observations. >> >> This works fine --if I only run it for the first, say, 1000 >> observations. If I try to perform this on all 2 million observations >> it takes long, way too long for this to be useful (I left it running >> 11 hours yesterday to no avail). >> >> >> I know R performs well with operations on relatively large vectors. Why >> is this so inefficient? And what would be the smart way to do this? > > Hi Alex, > > The other reply already gave you the R way of doing this while avoiding > the for loop. However, there is a more general reason why your for loop > is terribly inefficient. A small set of examples: > > largeVector = runif(10e4) > outputVector = NULL > system.time(for(i in 1:length(largeVector)) {Please do teach people to use seq_along(largeVector) rather than 1:length(largeVector) (the latter is not save in case of length 0 objects). Uwe Ligges> outputVector = append(outputVector, largeVector[i] + 1) > }) > # user system elapsed > # 6.591 0.168 6.786 > > The problem in this code is that outputVector keeps on growing and > growing. The operating system needs to allocate more and more space as > the object grows. This process is really slow. Several (much) faster > alternatives exist: > > # Pre-allocating the outputVector > outputVector = rep(0,length(largeVector)) > system.time(for(i in 1:length(largeVector)) { > outputVector[i] = largeVector[i] + 1 > }) > # user system elapsed > # 0.178 0.000 0.178 > # speed up of 37 times, this will only increase for large > # lengths of largeVector > > # Using apply functions > system.time(outputVector<- sapply(largeVector, function(x) return(x + 1))) > # user system elapsed > # 0.124 0.000 0.125 > # Even a bit faster > > # Using vectorisation > system.time(outputVector<- largeVector + 1) > # user system elapsed > # 0.000 0.000 0.001 > # Practically instant, 6780 times faster than the first example > > It is not always clear which method is most suitable and which performs > best. At least they all perform much, much better than the naive option > of letting outputVector grow. > > cheers, > Paul > >> Thanks in advance. >> Alex >> >> ______________________________________________ >> R-help at r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. > >