thr3ads.net - R help - [R] practical to loop over 2million rows? [Oct 2012]

If this information is useful, please help other people find it:
Share via:

Jay Rice

2012-Oct-10 20:31 UTC

[R] practical to loop over 2million rows?

New to R and having issues with loops. I am aware that I should use
vectorization whenever possible and use the apply functions, however,
sometimes a loop seems necessary.

I have a data set of 2 million rows and have tried run a couple of loops of
varying complexity to test efficiency. If I do a very simple loop such as
add every item in a column I get an answer quickly.

If I use a nested ifelse statement in a loop it takes me 13 minutes to get
an answer on just 50,000 rows. I am aware of a few methods to speed up
loops. Preallocating memory space and compute as much outside of the loop
as possible (or use create functions and just loop over the function) but
it seems that even with these speed ups I might have too much data to run
loops.  Here is the loop I ran that took 13 minutes. I realize I can
accomplish the same goal using vectorization (and in fact did so).

y<-numeric(length(x))

for(i in 1:length(x))

ifelse(!is.na(x[i]), y[i]<-x[i],

ifelse(strataID[i+1]==strataID[i], y<-x[i+1], y<-x[i-1]))

Presumably, complicated loops would be more intensive than the nested if
statement above. If I write more efficient loops time will come down but I
wonder if I will ever be able to write efficient enough code to perform a
complicated loop over 2 million rows in a reasonable time.

Is it useless for me to try to do any complicated loops on 2 million rows,
or if I get much better at programming in R will it be manageable even for
complicated situations?


Jay

	[[alternative HTML version deleted]]

Joshua Wiley

2012-Oct-10 21:06 UTC

head link

[R] practical to loop over 2million rows?

Hi Jay,

A few comments.

1) As you know, vectorize when possible.  Even if you must have a
loop, perhaps you can avoid nested loops or at least speed each
iteration.
2) Write your loop in a function and then byte compile it using the
cmpfun() function from the compiler package.  This can help
dramatically (though still not to the extent of vectorization).
3) If you really need to speed up some aspect and are stuck with a
loop, checkout the R + Rcpp + inline + C++ tool chain, which allows
you to write inline C++ code, compile it fairly easily, and move data
to and from it.

Here is an example of a question I answered on SO where the OP had an
algorithm to implement in R and I ran through with the R implemention,
the compiled R implementation, and one using Rcpp and compare timings.
 It should give you a bit of a sense for what you are dealing with at
least.

You are correct that some things can help speed in R loops, such as
preallocation, and also depending what you are doing, some classes are
faster than others.  If you are working with a vector of integers,
don't store them as doubles in a data frame (that is a silly extreme,
but hopefully you get the point).

Good luck,

Josh

On Wed, Oct 10, 2012 at 1:31 PM, Jay Rice <jsrice18 at gmail.com>
wrote:> New to R and having issues with loops. I am aware that I should use
> vectorization whenever possible and use the apply functions, however,
> sometimes a loop seems necessary.
>
> I have a data set of 2 million rows and have tried run a couple of loops of
> varying complexity to test efficiency. If I do a very simple loop such as
> add every item in a column I get an answer quickly.
>
> If I use a nested ifelse statement in a loop it takes me 13 minutes to get
> an answer on just 50,000 rows. I am aware of a few methods to speed up
> loops. Preallocating memory space and compute as much outside of the loop
> as possible (or use create functions and just loop over the function) but
> it seems that even with these speed ups I might have too much data to run
> loops.  Here is the loop I ran that took 13 minutes. I realize I can
> accomplish the same goal using vectorization (and in fact did so).
>
> y<-numeric(length(x))
>
> for(i in 1:length(x))
>
> ifelse(!is.na(x[i]), y[i]<-x[i],
>
> ifelse(strataID[i+1]==strataID[i], y<-x[i+1], y<-x[i-1]))
>
> Presumably, complicated loops would be more intensive than the nested if
> statement above. If I write more efficient loops time will come down but I
> wonder if I will ever be able to write efficient enough code to perform a
> complicated loop over 2 million rows in a reasonable time.
>
> Is it useless for me to try to do any complicated loops on 2 million rows,
> or if I get much better at programming in R will it be manageable even for
> complicated situations?
>
>
> Jay
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

-- 
Joshua Wiley
Ph.D. Student, Health Psychology
Programmer Analyst II, Statistical Consulting Group
University of California, Los Angeles
https://joshuawiley.com/

David Winsemius

2012-Oct-10 21:16 UTC

head link

[R] practical to loop over 2million rows?

On Oct 10, 2012, at 1:31 PM, Jay Rice wrote:
> New to R and having issues with loops. I am aware that I should use
> vectorization whenever possible and use the apply functions, however,
> sometimes a loop seems necessary.
> 
> I have a data set of 2 million rows and have tried run a couple of loops of
> varying complexity to test efficiency. If I do a very simple loop such as
> add every item in a column I get an answer quickly.
> 
> If I use a nested ifelse statement in a loop it takes me 13 minutes to get
> an answer on just 50,000 rows. I am aware of a few methods to speed up
> loops. Preallocating memory space and compute as much outside of the loop
> as possible (or use create functions and just loop over the function) but
> it seems that even with these speed ups I might have too much data to run
> loops.  Here is the loop I ran that took 13 minutes. I realize I can
> accomplish the same goal using vectorization (and in fact did so).
You should describe what you want to do and you should learn to use the
vectorized capabilities of R  and leave the for-loops for process that really
need them

> 
> y<-numeric(length(x))
> 
> for(i in 1:length(x))
> 
> ifelse(!is.na(x[i]), y[i]<-x[i],
Instead :

y[!is.na(x)] <- x[!is.na(x)]  # No loop.

> 
> ifelse(strataID[i+1]==strataID[i], y<-x[i+1], y<-x[i-1]))
When you index outside the range of the length of x you get NA as a result.
Furthermore you are setting y to be only a single element. So I think
'y' will be a single NA at the end of all this.
> strataID <- sample(1:2, 10, repl=TRUE)
> strataID [1] 1 1 2 2 1 2 2 2 2 1
> for(i in 1:length(x)) {ifelse(strataID[i+1]==strataID[i], y<-x[i+1],
y<-x[i-1])}
> y[1] NA

 There is no implicit indexing of the LHS of an assignment operation. How long
is strataID? And why not do this inside a dataframe?
> 
> Presumably, complicated loops would be more intensive than the nested if
> statement above. If I write more efficient loops time will come down but I
> wonder if I will ever be able to write efficient enough code to perform a
> complicated loop over 2 million rows in a reasonable time.
> 
> Is it useless for me to try to do any complicated loops on 2 million rows,
> or if I get much better at programming in R will it be manageable even for
> complicated situations?
> 
You will gain efficiency when you learn vectorization. And when you learn to
test your code for correct behavior.
> 
> Jay
> 
> 	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
David Winsemius, MD
Alameda, CA, USA

S Ellison

2012-Oct-11 13:28 UTC

head link

[R] practical to loop over 2million rows?

> If I use a nested ifelse statement in a loop it takes me 13 
> minutes to get an answer on just 50,000 rows. 
> ...
> ifelse(strataID[i+1]==strataID[i], y<-x[i+1], y<-x[i-1]))
maybe take a closer look at the ifelse help page and the examples?

First, ifelse is intended to be vectorized. If you nest it in a loop, you're
effectively nesting a loop inside a loop. And by putting ifelse inside ifelse,
you've done that twice. And then you've run the loops on vectors of
length one, so 'twas all in vain...
Second, the two things after the condition in ifelse are not instructions, they
are arguments to the function. Putting y<-something in as an argument means
'(promise to) store something in a variable called y, and then pass y to the
function'. You probably didn't mean that.
Third, ifelse returns a vector of the results; you're not using the return
value for anything.

For a single 'if' that takes some action, you want 'if' and
'else' _separately_, not 'ifelse'
y<-length(x) #length() already returns a numeric value. So if you must do
this with a loop, it would look more like
 
for(i in 1:length(x)+1) { #because x[i-1] wand x[i+1] won't be there for all
i otherwise
	if (!is.na(x[i])) , y[i]<-x[i]
	if(strataID[i+1]==strataID[i]) y<-x[i+1] else y<-x[i] #I changed the
second x index  because I can't see why it differed from the strataID index
               #or, using the fact that 'if' also returns something:
               # y <- if(strataID[i+1]==strataID[i]) x[i+1] else x[i]
} 

Finally, if you don't preallocate y at the length you want, R will have to
move the whole of y to a new memory location with one more space every time you
append something to it. There's a section on that in the R inferno. It's
a really good way of slowing R down.

So let's try something else.
strataID <- sample(letters[1:3], 2000000, replace=T) #a nice long strata
identifier with some matches likely
x <- rnorm(2000000) #some random numbers
x <- ifelse(x < -2, NA, x) #a few NA's now in x, though it does take a
few seconds for the 2 million observations

i <- 1:(length(x)-1)  #A long indexing vector with space for the last x[i+1]
y <- x  #That puts all the NA's in the right place in y, allocates y and
happens to put all the current values of x into y too.
system.time( y[i]<-ifelse( strataID[i+1]==strataID[i], x[i+1], x[i]  ) )
                              #does the whole loop and stores it in the
'right' places in y -
                              # though it will foul up those NA's because of
your x indexing. And incidentally it doesn't change the last y either
                               #On my allegedly 2GHz machine the systemt time
result was 2.87 seconds for the 2 million 'rows'


#Incidentally, a look at what we ended up with:
data.frame(s=strataID, y=y)[1:30,]
#says you probably aren;t getting anything useful from the exercise other than a
feel for what can go wrong with loops.
> 
*******************************************************************
This email and any attachments are confidential. Any use...{{dropped:8}}

Maybe Matching Threads

Search for more apparently analagous threads

R help - Oct 2012 - practical to loop over 2million rows?

[R] practical to loop over 2million rows?

[R] practical to loop over 2million rows?

[R] practical to loop over 2million rows?

[R] practical to loop over 2million rows?

Maybe Matching Threads