thr3ads.net - R help - [R] aggregate taking way too long to count. [Feb 2009]

If this information is useful, please help other people find it:
Share via:

Seeliger.Curt at epamail.epa.gov

2009-Feb-10 22:48 UTC

[R] aggregate taking way too long to count.

Folks,

I'm checking the structure of a dataframe for duplicate parameters at a 
site station (i.e depth should be measured once, not twice), using 
aggregate to count each parameter within a site station.  The fake data 
below has only 26000 rows, and takes roughly 14 seconds.  My real data has 
750000 rows and I had to stop execution after about an hour.  The by() 
function is faster, but I do not understand how to accurately associate 
those results with my test data.

How can I get this to work faster? I can't shake the feeling that it's 
something simple.  Thanks for any pointers.

df <- data.frame(expand.grid('parameter'=LETTERS
                            ,'station'=letters[1:10]
                            ,'site'=1:100
                            )
                )
df$parameter = as.character(df$parameter)
df$station = as.character(df$station)

df1 <- rbind(df, df[runif(nrow(df))>0.99,])  # add some duplicate 
parameters

tt <- df1
system.time(tt <- aggregate(I(df1$parameter)
                           ,list('site'=df1$site 
                                ,'station'=df1$station
                                ,'parameter'=df1$parameter
                                )
                           ,function(x) { length(na.omit(x)) }
                           )
           )
system.time(tt2 <- by(I(df1$parameter)
                     ,list('site'=df1$site 
                          ,'station'=df1$station
                          ,'parameter'=df1$parameter
                          )
                     ,function(x) { length(na.omit(x)) }
                     ,simplify=TRUE
                     )
           )

cur

-- 
Curt Seeliger, Data Ranger
Raytheon Information Services - Contractor to ORD
seeliger.curt@epa.gov
541/754-4638

	[[alternative HTML version deleted]]

William Dunlap

2009-Feb-10 23:17 UTC

head link

[R] aggregate taking way too long to count.

To quickly see if any duplicates exist you could use table() and
checking for entries that are more than 1.  Use na.omit()
on the entire data.frame before passing it to table.  E.g.,
   tmp <- with(na.omit(df1), table(parameter, station, site))
   sum(tmp>1) # number of parameter/station/site keys with >1 entry
That took 0.13 seconds on my machine, where your aggregate call took
18.42 seconds.

To keep only the first entry for a given key try something like
   df1.nodups <- df1[with(df1,
!duplicated(paste(sep="\1",parameter,station,site))),]
That is also very quick (0.06 seconds here).

Bill Dunlap
TIBCO Software Inc - Spotfire Division
wdunlap tibco.com 
> Folks,
> 
> I'm checking the structure of a dataframe for duplicate parameters at
a > site station (i.e depth should be measured once, not twice), using 
> aggregate to count each parameter within a site station.  The fake
data > below has only 26000 rows, and takes roughly 14 seconds.  My real data
has > 750000 rows and I had to stop execution after about an hour.  The by()
> function is faster, but I do not understand how to accurately
associate > those results with my test data.
> 
> How can I get this to work faster? I can't shake the feeling that
it's
> something simple.  Thanks for any pointers.
> 
> df <- data.frame(expand.grid('parameter'=LETTERS
>                             ,'station'=letters[1:10]
>                             ,'site'=1:100
>                             )
>                 )
> df$parameter = as.character(df$parameter)
> df$station = as.character(df$station)
> 
> df1 <- rbind(df, df[runif(nrow(df))>0.99,])  # add some duplicate 
> parameters
> 
> tt <- df1
> system.time(tt <- aggregate(I(df1$parameter)
>                            ,list('site'=df1$site 
>                                 ,'station'=df1$station
>                                 ,'parameter'=df1$parameter
>                                 )
>                            ,function(x) { length(na.omit(x)) }
>                            )
>            )
> system.time(tt2 <- by(I(df1$parameter)
>                      ,list('site'=df1$site 
>                           ,'station'=df1$station
>                           ,'parameter'=df1$parameter
>                           )
>                      ,function(x) { length(na.omit(x)) }
>                      ,simplify=TRUE
>                      )
>            )
> 
> cur
> 
> -- 
> Curt Seeliger, Data Ranger
> Raytheon Information Services - Contractor to ORD
> seeliger.curt at epa.gov
> 541/754-4638
>

Possibly Parallel Threads

Search for more reasonably related threads

R help - Feb 2009 - aggregate taking way too long to count.

[R] aggregate taking way too long to count.

[R] aggregate taking way too long to count.

Possibly Parallel Threads