Greg Tarpinian
2006-Oct-05 05:38 UTC
[R] searching for data.frame rows / processing of rows
R 2.3.1, WinXP: I have a puzzling problem that I suspect may be solved using grep or a regular expression but am at a loss how to actually do it... My data.frame looks like Location Time X Y -------- ---- --- --- 1 0 1.6 9.3 1 3 4.2 10.4 1 6 2.7 16.3 2 0 0.5 2.1 2 3 NA 3.6 2 3 5.0 0.06 2 6 3.4 14.0 and so forth. I would like to search for duplicate Time values within a Location and take the numerical average (where possible) of the elements in X and Y. These numerical averages should then be used to create a single row where multiple rows once existed. So, I would like to obtain 2 3 5.0 1.83 for the two Time = 3 rows for Location = 2 and use it to replace these two rows. Ideally, avoiding for(i in 1:blah) loops would be nice because the data.frame has about 10,000 rows that need to be searched and processed. My intent is to do some comparing of SAS to R -- the DATA step processing in SAS is quite fast and using the RETAIN statement along with the LAG( ) function allows this sort of thing to be done rapidly. Thanks in advance, Greg
Gabor Grothendieck
2006-Oct-05 06:02 UTC
[R] searching for data.frame rows / processing of rows
Grouping the data frame by the first two columns, apply colMeans and then rbind the resulting by-structure together: do.call(rbind, by(DF, DF[2:1], colMeans, na.rm = TRUE)) On 10/5/06, Greg Tarpinian <sasprog474 at yahoo.com> wrote:> R 2.3.1, WinXP: > > I have a puzzling problem that I suspect may be solved using > grep or a regular expression but am at a loss how to actually do it... > My data.frame looks like > > Location Time X Y > -------- ---- --- --- > 1 0 1.6 9.3 > 1 3 4.2 10.4 > 1 6 2.7 16.3 > 2 0 0.5 2.1 > 2 3 NA 3.6 > 2 3 5.0 0.06 > 2 6 3.4 14.0 > > and so forth. I would like to search for duplicate Time values > within a Location and take the numerical average (where possible) > of the elements in X and Y. These numerical averages should > then be used to create a single row where multiple rows once > existed. So, I would like to obtain > > 2 3 5.0 1.83 > > for the two Time = 3 rows for Location = 2 and use it to replace > these two rows. Ideally, avoiding for(i in 1:blah) loops would be > nice because the data.frame has about 10,000 rows that need to > be searched and processed. My intent is to do some comparing of > SAS to R -- the DATA step processing in SAS is quite fast and > using the RETAIN statement along with the LAG( ) function allows > this sort of thing to be done rapidly. > > > Thanks in advance, > > Greg > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >
I tought that aggregate was the way to go, but only for large dataframes it is faster.> df <- read.table(stdin(),header=TRUE)0: Location Time X Y 1: 1 0 1.6 9.3 2: 1 3 4.2 10.4 3: 1 6 2.7 16.3 4: 2 0 0.5 2.1 5: 2 3 NA 3.6 6: 2 3 5.0 0.06 7: 2 6 3.4 14.0 8:> aggregate(df[,3:4],df[,1:2],FUN=mean,na.rm=TRUE)Location Time X Y 1 1 0 1.6 9.30 2 2 0 0.5 2.10 3 1 3 4.2 10.40 4 2 3 5.0 1.83 5 1 6 2.7 16.30 6 2 6 3.4 14.00> system.time( aggregate(df[,3:4],df[,1:2],FUN=mean,na.rm=TRUE) )[1] 0.008 0.000 0.008 0.000 0.000> > system.time( do.call(rbind, by(df, df[2:1], colMeans, na.rm = TRUE)))[1] 0.005 0.000 0.005 0.000 0.000> > > df <- data.frame(Location=rep(1:50,50),+ Time=sample(rep(1:50,each=10),2500,replace=TRUE), + X=runif(2500),Y=runif(2500))> > system.time( aggregate(df[,3:4],df[,1:2],FUN=mean,na.rm=TRUE) )[1] 0.162 0.000 0.163 0.000 0.000> > system.time( do.call(rbind, by(df, df[2:1], colMeans, na.rm = TRUE)))[1] 2.179 0.006 2.216 0.000 0.000 Kees Gabor Grothendieck wrote:> Grouping the data frame by the first two columns, apply colMeans > and then rbind the resulting by-structure together: > > do.call(rbind, by(DF, DF[2:1], colMeans, na.rm = TRUE)) > > > On 10/5/06, Greg Tarpinian <sasprog474 at yahoo.com> wrote: >> R 2.3.1, WinXP: >> >> I have a puzzling problem that I suspect may be solved using >> grep or a regular expression but am at a loss how to actually do it... >> My data.frame looks like >> >> Location Time X Y >> -------- ---- --- --- >> 1 0 1.6 9.3 >> 1 3 4.2 10.4 >> 1 6 2.7 16.3 >> 2 0 0.5 2.1 >> 2 3 NA 3.6 >> 2 3 5.0 0.06 >> 2 6 3.4 14.0 >> >> and so forth. I would like to search for duplicate Time values >> within a Location and take the numerical average (where possible) >> of the elements in X and Y. These numerical averages should >> then be used to create a single row where multiple rows once >> existed. So, I would like to obtain >> >> 2 3 5.0 1.83 >> >> for the two Time = 3 rows for Location = 2 and use it to replace >> these two rows. Ideally, avoiding for(i in 1:blah) loops would be >> nice because the data.frame has about 10,000 rows that need to >> be searched and processed. My intent is to do some comparing of >> SAS to R -- the DATA step processing in SAS is quite fast and >> using the RETAIN statement along with the LAG( ) function allows >> this sort of thing to be done rapidly. >> >> >> Thanks in advance, >> >> Greg >> >> ______________________________________________ >> R-help at stat.math.ethz.ch mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html>> and provide commented, minimal, self-contained, reproducible code. >> > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >