Ishwor
2009-Aug-13 12:07 UTC
[R] need technique for speeding up R dataframe individual element insertion (no deletion though)
Hi fellas, I am working on a dataframe cam and it involves comparison within the 2 columns - t1 and t2 on about 20K rows and 14 columns. ### cap = cam; # this doesn't take long. ~1 secs. for( i in 1:length(cam$end_date)) { x1=strptime(cam$end_date[i], "%d/%m/%Y"); x2=strptime(cam$end_date[i+1], "%d/%m/%Y"); t1= cam$vol[i]; t2= cam$vol[i+1]; if(!is.na(x2) && !is.na(x1) && !is.na(t1) && !is.na(t2)) { if( (x2>=x1) && (t1==t2) ) # date and vol { cap$levels[i]=1; #make change to specific dataframe cell cap$levels[i+1]=1; } } } ### Having coded that, i ran a timing profile on this section and each 1000'th row comparison is taking ~1.1 minutes on a 2.8Ghz dual-core box (which is a test box we use). This obviously computes to ~21 minutes for 20k which is definitely not where we want it headed. I believe, optimisation(or even different way to address indexing inside dataframe) can be had inside the innermost `if' and specifically in `cap$levels[i]=1;' but I am a bit at a loss having scoured the documentation failing to find anything of value. So, my question remains are there any general/specific changes I can do to speed up the code execution dramatically? Thanks folks. -- Regards, Ishwor Gurung
jim holtman
2009-Aug-13 12:25 UTC
[R] need technique for speeding up R dataframe individual element insertion (no deletion though)
First of all, do the strptime conversions one time outside the loop. I would guess that if you ran Rprof on the code, most of the time is in that routine -- did you run Rprof? Also you are going through the loop one too many times; your ending value is 'length(cam$end_date)' and then you are indexing one greater than that in the loop 'x2=strptime(cam$end_date[i+1], "%d/%m/%Y");' FYI -- you don't need the semicolons at the end of the statements. On Thu, Aug 13, 2009 at 8:07 AM, Ishwor<ishwor.gurung at gmail.com> wrote:> Hi fellas, > > I am working on a dataframe cam and it involves comparison within the > 2 columns - t1 and t2 on about 20K rows and 14 columns. > > ### > cap = cam; # this doesn't take long. ~1 secs. > > > for( i in 1:length(cam$end_date)) > ?{ > ? ?x1=strptime(cam$end_date[i], "%d/%m/%Y"); > ? ?x2=strptime(cam$end_date[i+1], "%d/%m/%Y"); > > ? ?t1= cam$vol[i]; > ? ?t2= cam$vol[i+1]; > > ? ?if(!is.na(x2) && !is.na(x1) && !is.na(t1) && !is.na(t2)) > ? ?{ > ? ? ?if( (x2>=x1) && (t1==t2) ) # date and vol > ? ? ?{ > ? ? ? ?cap$levels[i]=1; #make change to specific dataframe cell > ? ? ? ?cap$levels[i+1]=1; > ? ? ?} > ? ?} > ?} > ### > > Having coded that, i ran a timing profile on this section and each > 1000'th row comparison is taking ~1.1 minutes on a 2.8Ghz dual-core > box (which is a test box we use). > This obviously computes to ~21 minutes for 20k which is definitely not > where we want it headed. I believe, optimisation(or even different way > to address indexing inside dataframe) can be had inside the innermost > `if' and specifically in `cap$levels[i]=1;' but I am a bit at a loss > having scoured the documentation failing to find anything of value. > So, my question remains are there any general/specific changes I can > do to speed up the code execution dramatically? > > Thanks folks. > > -- > Regards, > Ishwor Gurung > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem that you are trying to solve?
Bill.Venables at csiro.au
2009-Aug-13 12:44 UTC
[R] need technique for speeding up R dataframe individual element insertion (no deletion though)
Why do you need an explicit loop at all? (Also, your loop goes over i in 1:length(cam$end_date) but your code refers to cam$end_date[i+1] -->||<--!!) Here is a suggestion. You want to identify places where the date increases but the volume does not change. OK, where? ind <- with(cam, { dx <- as.numeric(diff(strptime(end_date, "%d/%m/%Y"))) dt <- diff(vol) which(dx > 0 & dt == 0) }) Now adjust the new data frame cap <- within(cam, { levels[ind] <- 1 levels[ind+1] <- 1 }) Of course this is untested code, so caveat emptor! Bill Venables. ________________________________________ From: r-help-bounces at r-project.org [r-help-bounces at r-project.org] On Behalf Of Ishwor [ishwor.gurung at gmail.com] Sent: 13 August 2009 22:07 To: r-help at r-project.org Subject: [R] need technique for speeding up R dataframe individual element insertion (no deletion though) Hi fellas, I am working on a dataframe cam and it involves comparison within the 2 columns - t1 and t2 on about 20K rows and 14 columns. ### cap = cam; # this doesn't take long. ~1 secs. for( i in 1:length(cam$end_date)) { x1=strptime(cam$end_date[i], "%d/%m/%Y"); x2=strptime(cam$end_date[i+1], "%d/%m/%Y"); t1= cam$vol[i]; t2= cam$vol[i+1]; if(!is.na(x2) && !is.na(x1) && !is.na(t1) && !is.na(t2)) { if( (x2>=x1) && (t1==t2) ) # date and vol { cap$levels[i]=1; #make change to specific dataframe cell cap$levels[i+1]=1; } } } ### Having coded that, i ran a timing profile on this section and each 1000'th row comparison is taking ~1.1 minutes on a 2.8Ghz dual-core box (which is a test box we use). This obviously computes to ~21 minutes for 20k which is definitely not where we want it headed. I believe, optimisation(or even different way to address indexing inside dataframe) can be had inside the innermost `if' and specifically in `cap$levels[i]=1;' but I am a bit at a loss having scoured the documentation failing to find anything of value. So, my question remains are there any general/specific changes I can do to speed up the code execution dramatically? Thanks folks. -- Regards, Ishwor Gurung ______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.