thr3ads.net - R help - [R] More efficient option to append()? [Aug 2011]

If this information is useful, please help other people find it:
Share via:

Alex Ruiz Euler

2011-Aug-17 22:53 UTC

[R] More efficient option to append()?

Dear R community,

I have a 2 million by 2 matrix that looks like this:

x<-sample(1:15,2000000, replace=T)
y<-sample(1:10*1000, 2000000, replace=T)
      x     y
[1,] 10  4000
[2,]  3  1000
[3,]  3  4000
[4,]  8  6000
[5,]  2  9000
[6,]  3  8000
[7,]  2 10000
(...)


The first column is a population expansion factor for the number in the
second column (household income). I want to expand the second column
with the first so that I end up with a vector beginning with 10
observations of 4000, then 3 observations of 1000 and so on. In my mind
the natural approach would be to create a NULL vector and append the
expansions:

myvar<-NULL
myvar<-append(myvar, replicate(x[1],y[1]), 1)

for (i in 2:length(x)) {
myvar<-append(myvar,replicate(x[i],y[i]),sum(x[1:i])+1)
}

to end with a vector of sum(x), which in my real database corresponds
to 22 million observations.

This works fine --if I only run it for the first, say, 1000
observations. If I try to perform this on all 2 million observations
it takes long, way too long for this to be useful (I left it running
11 hours yesterday to no avail).


I know R performs well with operations on relatively large vectors. Why
is this so inefficient? And what would be the smart way to do this?

Thanks in advance.
Alex

Alex Ruiz Euler

2011-Aug-17 23:17 UTC

head link

[R] More efficient option to append()?

Dear R community,

I have a 2 million by 2 matrix that looks like this:

x<-sample(1:15,2000000, replace=T)
y<-sample(1:10*1000, 2000000, replace=T)
      x     y
[1,] 10  4000
[2,]  3  1000
[3,]  3  4000
[4,]  8  6000
[5,]  2  9000
[6,]  3  8000
[7,]  2 10000
(...)


The first column is a population expansion factor for the number in the
second column (household income). I want to expand the second column
with the first so that I end up with a vector beginning with 10
observations of 4000, then 3 observations of 1000 and so on. In my mind
the natural approach would be to create a NULL vector and append the
expansions:

myvar<-NULL
myvar<-append(myvar, replicate(x[1],y[1]), 1)

for (i in 2:length(x)) {
myvar<-append(myvar,replicate(x[i],y[i]),sum(x[1:i])+1)
}

to end with a vector of sum(x), which in my real database corresponds
to 22 million observations.

This works fine --if I only run it for the first, say, 1000
observations. If I try to perform this on all 2 million observations
it takes long, way too long for this to be useful (I left it running
11 hours yesterday to no avail).


I know R performs well with operations on relatively large vectors. Why
is this so inefficient? And what would be the smart way to do this?

Thanks in advance.
Alex

Daniel Nordlund

2011-Aug-17 23:35 UTC

head link

[R] More efficient option to append()?

> -----Original Message-----
> From: r-help-bounces at r-project.org [mailto:r-help-bounces at
r-project.org]
> On Behalf Of Alex Ruiz Euler
> Sent: Wednesday, August 17, 2011 3:54 PM
> To: r-help at r-project.org
> Subject: [R] More efficient option to append()?
> 
> 
> Dear R community,
> 
> I have a 2 million by 2 matrix that looks like this:
> 
> x<-sample(1:15,2000000, replace=T)
> y<-sample(1:10*1000, 2000000, replace=T)
>       x     y
> [1,] 10  4000
> [2,]  3  1000
> [3,]  3  4000
> [4,]  8  6000
> [5,]  2  9000
> [6,]  3  8000
> [7,]  2 10000
> (...)
> 
> 
> The first column is a population expansion factor for the number in the
> second column (household income). I want to expand the second column
> with the first so that I end up with a vector beginning with 10
> observations of 4000, then 3 observations of 1000 and so on. In my mind
> the natural approach would be to create a NULL vector and append the
> expansions:
> 
> myvar<-NULL
> myvar<-append(myvar, replicate(x[1],y[1]), 1)
> 
> for (i in 2:length(x)) {
> myvar<-append(myvar,replicate(x[i],y[i]),sum(x[1:i])+1)
> }
> 
> to end with a vector of sum(x), which in my real database corresponds
> to 22 million observations.
> 
> This works fine --if I only run it for the first, say, 1000
> observations. If I try to perform this on all 2 million observations
> it takes long, way too long for this to be useful (I left it running
> 11 hours yesterday to no avail).
> 
> 
> I know R performs well with operations on relatively large vectors. Why
> is this so inefficient? And what would be the smart way to do this?
> 
> Thanks in advance.
> Alex
> 
Alex, 

does the following do what you want?

myvar <- rep(y,x)

Hope this is helpful,

Dan

Daniel Nordlund
Bothell, WA USA

Alex Ruiz Euler

2011-Aug-18 00:05 UTC

head link

[R] More efficient option to append()?

Daniel,

it works, thanks for your time with this simple matter.

Best,
Alex

On Wed, 17 Aug 2011 16:35:48 -0700
"Daniel Nordlund" <djnordlund at frontier.com> wrote:
> 
> > -----Original Message-----
> > From: r-help-bounces at r-project.org [mailto:r-help-bounces at
r-project.org]
> > On Behalf Of Alex Ruiz Euler
> > Sent: Wednesday, August 17, 2011 3:54 PM
> > To: r-help at r-project.org
> > Subject: [R] More efficient option to append()?
> > 
> > 
> > Dear R community,
> > 
> > I have a 2 million by 2 matrix that looks like this:
> > 
> > x<-sample(1:15,2000000, replace=T)
> > y<-sample(1:10*1000, 2000000, replace=T)
> >       x     y
> > [1,] 10  4000
> > [2,]  3  1000
> > [3,]  3  4000
> > [4,]  8  6000
> > [5,]  2  9000
> > [6,]  3  8000
> > [7,]  2 10000
> > (...)
> > 
> > 
> > The first column is a population expansion factor for the number in
the
> > second column (household income). I want to expand the second column
> > with the first so that I end up with a vector beginning with 10
> > observations of 4000, then 3 observations of 1000 and so on. In my
mind
> > the natural approach would be to create a NULL vector and append the
> > expansions:
> > 
> > myvar<-NULL
> > myvar<-append(myvar, replicate(x[1],y[1]), 1)
> > 
> > for (i in 2:length(x)) {
> > myvar<-append(myvar,replicate(x[i],y[i]),sum(x[1:i])+1)
> > }
> > 
> > to end with a vector of sum(x), which in my real database corresponds
> > to 22 million observations.
> > 
> > This works fine --if I only run it for the first, say, 1000
> > observations. If I try to perform this on all 2 million observations
> > it takes long, way too long for this to be useful (I left it running
> > 11 hours yesterday to no avail).
> > 
> > 
> > I know R performs well with operations on relatively large vectors.
Why
> > is this so inefficient? And what would be the smart way to do this?
> > 
> > Thanks in advance.
> > Alex
> > 
> 
> Alex, 
> 
> does the following do what you want?
> 
> myvar <- rep(y,x)
> 
> Hope this is helpful,
> 
> Dan
> 
> Daniel Nordlund
> Bothell, WA USA
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

Timothy Bates

2011-Aug-18 07:46 UTC

head link

[R] More efficient option to append()?

This takes a few seconds to do 1 million lines, and remains explicit/for loop
form

numberofSalaryBands = 1000000 # 2000000
x        = sample(1:15,numberofSalaryBands, replace=T)
y        = sample((1:10)*1000, numberofSalaryBands, replace=T)
df       = data.frame(x,y)
finalN   = sum(df$x)
myVar    = rep(NA, finalN)
outIndex = 1
i        = 1
for (i in 1:numberofSalaryBands) {
	kount = df$x[i]
	myVar[outIndex:(outIndex+kount-1)] = rep(df$y[i], kount) # Make x[i] copies of
value y[i]
	outIndex = outIndex+kount
}
head(myVar)
plyr::count(myVar)


On Aug 18, 2011, at 12:17 AM, Alex Ruiz Euler wrote:
> 
> 
> Dear R community,
> 
> I have a 2 million by 2 matrix that looks like this:
> 
> x<-sample(1:15,2000000, replace=T)
> y<-sample(1:10*1000, 2000000, replace=T)
>      x     y
> [1,] 10  4000
> [2,]  3  1000
> [3,]  3  4000
> [4,]  8  6000
> [5,]  2  9000
> [6,]  3  8000
> [7,]  2 10000
> (...)
> 
> 
> The first column is a population expansion factor for the number in the
> second column (household income). I want to expand the second column
> with the first so that I end up with a vector beginning with 10
> observations of 4000, then 3 observations of 1000 and so on. In my mind
> the natural approach would be to create a NULL vector and append the
> expansions:
> 
> myvar<-NULL
> myvar<-append(myvar, replicate(x[1],y[1]), 1)
> 
> for (i in 2:length(x)) {
> myvar<-append(myvar,replicate(x[i],y[i]),sum(x[1:i])+1)
> }
> 
> to end with a vector of sum(x), which in my real database corresponds
> to 22 million observations.
> 
> This works fine --if I only run it for the first, say, 1000
> observations. If I try to perform this on all 2 million observations
> it takes long, way too long for this to be useful (I left it running
> 11 hours yesterday to no avail).
> 
> 
> I know R performs well with operations on relatively large vectors. Why
> is this so inefficient? And what would be the smart way to do this?
> 
> Thanks in advance.
> Alex
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

Dennis Murphy

2011-Aug-19 03:11 UTC

head link

[R] More efficient option to append()?

Hi:

This seems to take a bit less code, avoids explicit loops (by using
mapply() instead, where the loops are internal) and takes about 10
seconds on my system:

m <- cbind(x = sample(1:15,2000000, replace=T),
           y = sample(1:10*1000, 2000000, replace=T))
sum(m[, 1])
# [1] 16005804
ff <- function(x, y) rep(y, x)
system.time(w <- do.call(c, mapply(ff, m[, 1], m[, 2])))
   user  system elapsed
   9.75    0.00    9.75
> length(w)
[1] 16005804> count(w)       x    freq
1   1000 1603184
2   2000 1590599
3   3000 1596661
4   4000 1607112
5   5000 1598571
6   6000 1599195
7   7000 1600475
8   8000 1601718
9   9000 1598896
10 10000 1609393

HTH,
Dennis

PS: It would have been a good idea to keep the OP in the loop of this thread.

On Thu, Aug 18, 2011 at 12:46 AM, Timothy Bates
<timothy.c.bates at gmail.com> wrote:> This takes a few seconds to do 1 million lines, and remains explicit/for
loop form
>
> numberofSalaryBands = 1000000 # 2000000
> x ? ? ? ?= sample(1:15,numberofSalaryBands, replace=T)
> y ? ? ? ?= sample((1:10)*1000, numberofSalaryBands, replace=T)
> df ? ? ? = data.frame(x,y)
> finalN ? = sum(df$x)
> myVar ? ?= rep(NA, finalN)
> outIndex = 1
> i ? ? ? ?= 1
> for (i in 1:numberofSalaryBands) {
> ? ? ? ?kount = df$x[i]
> ? ? ? ?myVar[outIndex:(outIndex+kount-1)] = rep(df$y[i], kount) # Make x[i]
copies of value y[i]
> ? ? ? ?outIndex = outIndex+kount
> }
> head(myVar)
> plyr::count(myVar)
>
>
> On Aug 18, 2011, at 12:17 AM, Alex Ruiz Euler wrote:
>
>>
>>
>> Dear R community,
>>
>> I have a 2 million by 2 matrix that looks like this:
>>
>> x<-sample(1:15,2000000, replace=T)
>> y<-sample(1:10*1000, 2000000, replace=T)
>> ? ? ?x ? ? y
>> [1,] 10 ?4000
>> [2,] ?3 ?1000
>> [3,] ?3 ?4000
>> [4,] ?8 ?6000
>> [5,] ?2 ?9000
>> [6,] ?3 ?8000
>> [7,] ?2 10000
>> (...)
>>
>>
>> The first column is a population expansion factor for the number in the
>> second column (household income). I want to expand the second column
>> with the first so that I end up with a vector beginning with 10
>> observations of 4000, then 3 observations of 1000 and so on. In my mind
>> the natural approach would be to create a NULL vector and append the
>> expansions:
>>
>> myvar<-NULL
>> myvar<-append(myvar, replicate(x[1],y[1]), 1)
>>
>> for (i in 2:length(x)) {
>> myvar<-append(myvar,replicate(x[i],y[i]),sum(x[1:i])+1)
>> }
>>
>> to end with a vector of sum(x), which in my real database corresponds
>> to 22 million observations.
>>
>> This works fine --if I only run it for the first, say, 1000
>> observations. If I try to perform this on all 2 million observations
>> it takes long, way too long for this to be useful (I left it running
>> 11 hours yesterday to no avail).
>>
>>
>> I know R performs well with operations on relatively large vectors. Why
>> is this so inefficient? And what would be the smart way to do this?
>>
>> Thanks in advance.
>> Alex
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

Paul Hiemstra

2011-Aug-19 13:50 UTC

head link

[R] More efficient option to append()?

On 08/17/2011 10:53 PM, Alex Ruiz Euler wrote:> Dear R community,
>
> I have a 2 million by 2 matrix that looks like this:
>
> x<-sample(1:15,2000000, replace=T)
> y<-sample(1:10*1000, 2000000, replace=T)
>       x     y
> [1,] 10  4000
> [2,]  3  1000
> [3,]  3  4000
> [4,]  8  6000
> [5,]  2  9000
> [6,]  3  8000
> [7,]  2 10000
> (...)
>
>
> The first column is a population expansion factor for the number in the
> second column (household income). I want to expand the second column
> with the first so that I end up with a vector beginning with 10
> observations of 4000, then 3 observations of 1000 and so on. In my mind
> the natural approach would be to create a NULL vector and append the
> expansions:
>
> myvar<-NULL
> myvar<-append(myvar, replicate(x[1],y[1]), 1)
>
> for (i in 2:length(x)) {
> myvar<-append(myvar,replicate(x[i],y[i]),sum(x[1:i])+1)
> }
>
> to end with a vector of sum(x), which in my real database corresponds
> to 22 million observations.
>
> This works fine --if I only run it for the first, say, 1000
> observations. If I try to perform this on all 2 million observations
> it takes long, way too long for this to be useful (I left it running
> 11 hours yesterday to no avail).
>
>
> I know R performs well with operations on relatively large vectors. Why
> is this so inefficient? And what would be the smart way to do this?
Hi Alex,

The other reply already gave you the R way of doing this while avoiding
the for loop. However, there is a more general reason why your for loop
is terribly inefficient. A small set of examples:

largeVector = runif(10e4)
outputVector = NULL
system.time(for(i in 1:length(largeVector)) {
    outputVector = append(outputVector, largeVector[i] + 1)
})
#   user  system elapsed
 # 6.591   0.168   6.786

The problem in this code is that outputVector keeps on growing and
growing. The operating system needs to allocate more and more space as
the object grows. This process is really slow. Several (much) faster
alternatives exist:

# Pre-allocating the outputVector
outputVector = rep(0,length(largeVector))
system.time(for(i in 1:length(largeVector)) {
    outputVector[i] = largeVector[i] + 1
})
#   user  system elapsed
# 0.178   0.000   0.178
# speed up of 37 times, this will only increase for large
# lengths of largeVector

# Using apply functions
system.time(outputVector <- sapply(largeVector, function(x) return(x + 1)))
#   user  system elapsed
#  0.124   0.000   0.125
# Even a bit faster

# Using vectorisation
system.time(outputVector <- largeVector + 1)
#   user  system elapsed
#  0.000   0.000   0.001
# Practically instant, 6780 times faster than the first example

It is not always clear which method is most suitable and which performs
best. At least they all perform much, much better than the naive option
of letting outputVector grow.

cheers,
Paul
> Thanks in advance.
> Alex
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

-- 
Paul Hiemstra, Ph.D.
Global Climate Division
Royal Netherlands Meteorological Institute (KNMI)
Wilhelminalaan 10 | 3732 GK | De Bilt | Kamer B 3.39
P.O. Box 201 | 3730 AE | De Bilt
tel: +31 30 2206 494

http://intamap.geo.uu.nl/~paul
http://nl.linkedin.com/pub/paul-hiemstra/20/30b/770

Paul Hiemstra

2011-Aug-19 13:56 UTC

head link

[R] More efficient option to append()?

On 08/18/2011 07:46 AM, Timothy Bates wrote:> This takes a few seconds to do 1 million lines, and remains explicit/for
loop form
>
> numberofSalaryBands = 1000000 # 2000000
> x        = sample(1:15,numberofSalaryBands, replace=T)
> y        = sample((1:10)*1000, numberofSalaryBands, replace=T)
> df       = data.frame(x,y)
> finalN   = sum(df$x)
> myVar    = rep(NA, finalN)
> outIndex = 1
> i        = 1
> for (i in 1:numberofSalaryBands) {
> 	kount = df$x[i]
> 	myVar[outIndex:(outIndex+kount-1)] = rep(df$y[i], kount) # Make x[i]
copies of value y[i]
For posterity, the problem in the code of the OP was that myVar was
continuously growing. This required the operating system to continuously
create more space for myVar, which is a very slow process. In this
example you preallocate the space needed for myVar by creating an object
of the appropriate length before the for loop.

So, in my opinion, for loops and append should be avoided like the plague!

my 2cts :)

Paul
> 	outIndex = outIndex+kount
> }
> head(myVar)
> plyr::count(myVar)
>
>
> On Aug 18, 2011, at 12:17 AM, Alex Ruiz Euler wrote:
>
>>
>> Dear R community,
>>
>> I have a 2 million by 2 matrix that looks like this:
>>
>> x<-sample(1:15,2000000, replace=T)
>> y<-sample(1:10*1000, 2000000, replace=T)
>>      x     y
>> [1,] 10  4000
>> [2,]  3  1000
>> [3,]  3  4000
>> [4,]  8  6000
>> [5,]  2  9000
>> [6,]  3  8000
>> [7,]  2 10000
>> (...)
>>
>>
>> The first column is a population expansion factor for the number in the
>> second column (household income). I want to expand the second column
>> with the first so that I end up with a vector beginning with 10
>> observations of 4000, then 3 observations of 1000 and so on. In my mind
>> the natural approach would be to create a NULL vector and append the
>> expansions:
>>
>> myvar<-NULL
>> myvar<-append(myvar, replicate(x[1],y[1]), 1)
>>
>> for (i in 2:length(x)) {
>> myvar<-append(myvar,replicate(x[i],y[i]),sum(x[1:i])+1)
>> }
>>
>> to end with a vector of sum(x), which in my real database corresponds
>> to 22 million observations.
>>
>> This works fine --if I only run it for the first, say, 1000
>> observations. If I try to perform this on all 2 million observations
>> it takes long, way too long for this to be useful (I left it running
>> 11 hours yesterday to no avail).
>>
>>
>> I know R performs well with operations on relatively large vectors. Why
>> is this so inefficient? And what would be the smart way to do this?
>>
>> Thanks in advance.
>> Alex
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

-- 
Paul Hiemstra, Ph.D.
Global Climate Division
Royal Netherlands Meteorological Institute (KNMI)
Wilhelminalaan 10 | 3732 GK | De Bilt | Kamer B 3.39
P.O. Box 201 | 3730 AE | De Bilt
tel: +31 30 2206 494

http://intamap.geo.uu.nl/~paul
http://nl.linkedin.com/pub/paul-hiemstra/20/30b/770

Paul Hiemstra

2011-Aug-19 13:58 UTC

head link

[R] More efficient option to append()?

As I already stated in my reply to your earlier post:

resending the answer for the archives of the mailing list...

Hi Alex,

The other reply already gave you the R way of doing this while avoiding
the for loop. However, there is a more general reason why your for loop
is terribly inefficient. A small set of examples:

largeVector = runif(10e4)
outputVector = NULL
system.time(for(i in 1:length(largeVector)) {
    outputVector = append(outputVector, largeVector[i] + 1)
})
#   user  system elapsed
 # 6.591   0.168   6.786

The problem in this code is that outputVector keeps on growing and
growing. The operating system needs to allocate more and more space as
the object grows. This process is really slow. Several (much) faster
alternatives exist:

# Pre-allocating the outputVector
outputVector = rep(0,length(largeVector))
system.time(for(i in 1:length(largeVector)) {
    outputVector[i] = largeVector[i] + 1
})
#   user  system elapsed
# 0.178   0.000   0.178
# speed up of 37 times, this will only increase for large
# lengths of largeVector

# Using apply functions
system.time(outputVector <- sapply(largeVector, function(x) return(x + 1)))
#   user  system elapsed
#  0.124   0.000   0.125
# Even a bit faster

# Using vectorisation
system.time(outputVector <- largeVector + 1)
#   user  system elapsed
#  0.000   0.000   0.001
# Practically instant, 6780 times faster than the first example

It is not always clear which method is most suitable and which performs
best. At least they all perform much, much better than the naive option
of letting outputVector grow.

cheers,
Paul



On 08/17/2011 11:17 PM, Alex Ruiz Euler wrote:>
> Dear R community,
>
> I have a 2 million by 2 matrix that looks like this:
>
> x<-sample(1:15,2000000, replace=T)
> y<-sample(1:10*1000, 2000000, replace=T)
>       x     y
> [1,] 10  4000
> [2,]  3  1000
> [3,]  3  4000
> [4,]  8  6000
> [5,]  2  9000
> [6,]  3  8000
> [7,]  2 10000
> (...)
>
>
> The first column is a population expansion factor for the number in the
> second column (household income). I want to expand the second column
> with the first so that I end up with a vector beginning with 10
> observations of 4000, then 3 observations of 1000 and so on. In my mind
> the natural approach would be to create a NULL vector and append the
> expansions:
>
> myvar<-NULL
> myvar<-append(myvar, replicate(x[1],y[1]), 1)
>
> for (i in 2:length(x)) {
> myvar<-append(myvar,replicate(x[i],y[i]),sum(x[1:i])+1)
> }
>
> to end with a vector of sum(x), which in my real database corresponds
> to 22 million observations.
>
> This works fine --if I only run it for the first, say, 1000
> observations. If I try to perform this on all 2 million observations
> it takes long, way too long for this to be useful (I left it running
> 11 hours yesterday to no avail).
>
>
> I know R performs well with operations on relatively large vectors. Why
> is this so inefficient? And what would be the smart way to do this?
>
> Thanks in advance.
> Alex
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

-- 
Paul Hiemstra, Ph.D.
Global Climate Division
Royal Netherlands Meteorological Institute (KNMI)
Wilhelminalaan 10 | 3732 GK | De Bilt | Kamer B 3.39
P.O. Box 201 | 3730 AE | De Bilt
tel: +31 30 2206 494

http://intamap.geo.uu.nl/~paul
http://nl.linkedin.com/pub/paul-hiemstra/20/30b/770

Denis Chabot

2011-Aug-19 14:16 UTC

head link

[R] strange convention for time zone names

Hi,

My time zone in Montreal is "Standard time zone:	UTC/GMT -5 hours" 
(see <http://www.timeanddate.com/worldclock/city.html?n=165>).

Yet, in R (POSIXct objects) I must specify the opposite, i.e. "UTC+5":

dateMontreal = as.POSIXct("2011-01-15 05:00:00", tz="EST")
dateMontreal2 = as.POSIXct("2011-01-15 05:00:00",
tz="UTC+5")
wrongdateMontreal = as.POSIXct("2011-01-15 05:00:00",
tz="UTC-5")

dateLondon = as.POSIXct("2011-01-15 10:00:00", tz="UTC0")
difftime(dateMontreal, dateLondon)
Time difference of 0 secs

difftime(dateMontreal2, dateLondon)
Time difference of 0 secs

difftime(wrongdateMontreal, dateLondon)
Time difference of -10 hours

Is there a reason for this counter-intuitive convention?

Denis
R version 2.13.1 (2011-07-08)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)

locale:
[1] fr_CA.UTF-8/fr_CA.UTF-8/C/C/fr_CA.UTF-8/fr_CA.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

Uwe Ligges

2011-Aug-19 20:23 UTC

head link

[R] More efficient option to append()?

On 19.08.2011 15:50, Paul Hiemstra wrote:>   On 08/17/2011 10:53 PM, Alex Ruiz Euler wrote:
>> Dear R community,
>>
>> I have a 2 million by 2 matrix that looks like this:
>>
>> x<-sample(1:15,2000000, replace=T)
>> y<-sample(1:10*1000, 2000000, replace=T)
>>        x     y
>> [1,] 10  4000
>> [2,]  3  1000
>> [3,]  3  4000
>> [4,]  8  6000
>> [5,]  2  9000
>> [6,]  3  8000
>> [7,]  2 10000
>> (...)
>>
>>
>> The first column is a population expansion factor for the number in the
>> second column (household income). I want to expand the second column
>> with the first so that I end up with a vector beginning with 10
>> observations of 4000, then 3 observations of 1000 and so on. In my mind
>> the natural approach would be to create a NULL vector and append the
>> expansions:
>>
>> myvar<-NULL
>> myvar<-append(myvar, replicate(x[1],y[1]), 1)
>>
>> for (i in 2:length(x)) {
>> myvar<-append(myvar,replicate(x[i],y[i]),sum(x[1:i])+1)
>> }
>>
>> to end with a vector of sum(x), which in my real database corresponds
>> to 22 million observations.
>>
>> This works fine --if I only run it for the first, say, 1000
>> observations. If I try to perform this on all 2 million observations
>> it takes long, way too long for this to be useful (I left it running
>> 11 hours yesterday to no avail).
>>
>>
>> I know R performs well with operations on relatively large vectors. Why
>> is this so inefficient? And what would be the smart way to do this?
>
> Hi Alex,
>
> The other reply already gave you the R way of doing this while avoiding
> the for loop. However, there is a more general reason why your for loop
> is terribly inefficient. A small set of examples:
>
> largeVector = runif(10e4)
> outputVector = NULL
> system.time(for(i in 1:length(largeVector)) {

Please do teach people to use seq_along(largeVector) rather than 
1:length(largeVector) (the latter is not save in case of length 0 objects).

Uwe Ligges

>      outputVector = append(outputVector, largeVector[i] + 1)
> })
> #   user  system elapsed
>   # 6.591   0.168   6.786
>
> The problem in this code is that outputVector keeps on growing and
> growing. The operating system needs to allocate more and more space as
> the object grows. This process is really slow. Several (much) faster
> alternatives exist:
>
> # Pre-allocating the outputVector
> outputVector = rep(0,length(largeVector))
> system.time(for(i in 1:length(largeVector)) {
>      outputVector[i] = largeVector[i] + 1
> })
> #   user  system elapsed
> # 0.178   0.000   0.178
> # speed up of 37 times, this will only increase for large
> # lengths of largeVector
>
> # Using apply functions
> system.time(outputVector<- sapply(largeVector, function(x) return(x +
1)))
> #   user  system elapsed
> #  0.124   0.000   0.125
> # Even a bit faster
>
> # Using vectorisation
> system.time(outputVector<- largeVector + 1)
> #   user  system elapsed
> #  0.000   0.000   0.001
> # Practically instant, 6780 times faster than the first example
>
> It is not always clear which method is most suitable and which performs
> best. At least they all perform much, much better than the naive option
> of letting outputVector grow.
>
> cheers,
> Paul
>
>> Thanks in advance.
>> Alex
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
>

R help - Aug 2011 - More efficient option to append()?

[R] More efficient option to append()?

[R] More efficient option to append()?

[R] More efficient option to append()?

[R] More efficient option to append()?

[R] More efficient option to append()?

[R] More efficient option to append()?

[R] More efficient option to append()?

[R] More efficient option to append()?

[R] More efficient option to append()?

[R] strange convention for time zone names

[R] More efficient option to append()?