thr3ads.net - R devel - [Rd] Any interest in "merge" and "by" implementations specifically for sorted data? [Jul 2006]

If this information is useful, please help other people find it:
Share via:

Kevin B. Hendricks

2006-Jul-27 13:31 UTC

[Rd] Any interest in "merge" and "by" implementations specifically for sorted data?

Hi Developers,

I am looking for another new project to help me get more up to speed  
on R and to learn something outside of R internals.   One recent R  
issue I have run into is finding a fast implementations of the  
equivalent to the following SAS code:

/* MDPC is an integer sort key made from two integer columns */
MDPC = (MD * 100000) + PCO;

/* sort the dataset by the key */
PROC SORT;
   BY MDPC;

/* print out count and sum for each unique sort key (subgroup) */
/* use of BY MDPC requires sorting that data set by MDPC first in SAS */
PROC UNIVARIATE NOPRINT;
    VAR MVE;
    BY MDPC;
    OUTPUT OUT=TMP0 N=XN SUM=XSUM;

Easy to do in R but the problem is the data set this is being run on  
has 1,742,201 lines in it and takes up 196,868,713 bytes to store as  
character data.  The sort key has easily has over 200,000 unique keys  
(if not twice that).

My first R attempt was a simple

# sort the data.frame gd and the sort key
sorder <- order(MDPC)
gd  <- gd[sorder,]
MDPC <- MDPC[sorder]
attach(gd)

# find the length and sum for each unique sort key
XN <- by(MVE, MDPC, length)
XSUM <- by(MVE, MDPC, sum)
GRPS <- levels(as.factor(MDPC))

Well the ordering and sorting was reasonably fast but the first "by"  
statement was still running 4 hours later on my machine (a dual 2.6  
gig Opteron with 4 gig of main memory).  This same snippet of code in  
SAS running on a slower machine takes about 5 minutes of system time.

I tried various simple R implementations of a "by_sorted" that I  
thought might help

# walk sorted array once and keep track of beginning
# and ending points for each unique sort key value in p
# and run function fcn on that sub sequence in vector v
# store the results in a vector

by_sorted <- function(v, p, fcn) {
    key <- p[[1]]
    bp <- 1
    r <- NULL
    for (i in 2:length(p)) {
       if (key != p[[i]]) {
           r <- c(r,fcn(v[bp:i-1]))
           bp <- i
           key <- p[[i]]
       }
    }
    r <- c(r,fcn(v[bp:i]))
}

but they took "forever" to run also (read that I killed those  
attempts at 15 minutes of cpu time).

I literally had the same issue when trying with "tapply".

So unless it already exists someplace, I need a really fast  
implementation of "by" for very large sorted data sets (probably  
written in fortran or c) that will do the equivalent of what SAS does  
with its "proc univariate by" approach with close to the same  
performance.  The code should only have to walk the array once (ie.  
be linear in time with the number of rows in the vector).   I have  
similar issues with "merge" as well since merging data frames already
sorted by the same sort key should be fast as well and does not  
appear to be.

Before I jump into this and create my own "sorted large data set"  
versions of "by" and "merge", I wanted to be sure it would
be of
interest to others.  If they work well and are well implemented (a  
big if since I am really still just learning this - the whole point  
of the project!) would something like this be of any interest for  
internal use of R?  Or is this something too specialized?
Is there an R function implemented in c or fortran that would make a  
good "model" to follow for implementing something like this?
Would/should they be extensions of current implementations of "merge"
and "by" with an additions of a sorted=TRUE (defaulting to FALSE)  
extra parameter.

Or am I simply barking up the wrong tree here?

Thanks,

Kevin

Seth Falcon

2006-Jul-27 14:20 UTC

head link

[Rd] Any interest in "merge" and "by" implementations specifically for sorted data?

"Kevin B. Hendricks" <kevin.hendricks at sympatico.ca>
writes:> My first R attempt was a simple
>
> # sort the data.frame gd and the sort key
> sorder <- order(MDPC)
> gd  <- gd[sorder,]
> MDPC <- MDPC[sorder]
> attach(gd)
>
> # find the length and sum for each unique sort key
> XN <- by(MVE, MDPC, length)
> XSUM <- by(MVE, MDPC, sum)
> GRPS <- levels(as.factor(MDPC))
>
> Well the ordering and sorting was reasonably fast but the first
"by"
> statement was still running 4 hours later on my machine (a dual 2.6  
> gig Opteron with 4 gig of main memory).  This same snippet of code in  
> SAS running on a slower machine takes about 5 minutes of system
> time.
I wonder if split() would be of use here.  Once you have sorted the
data frame gd and the sort keys MDPC, you could do:

gdList <- split(gd$MVE, MDPC)

xn <- sapply(gdList, length)
xsum <- sapply(gdList, sum)

+ seth

Brian D Ripley

2006-Jul-28 07:06 UTC

head link

[Rd] Any interest in "merge" and "by" implementations specifically for sorted data?

Which version of R are you looking at?   R-devel has

    o	merge() works more efficiently when there are relatively few
	matches between the data frames (for example, for 1-1
	matching).  The order of the result is changed for 'sort = FALSE'.


On Thu, 27 Jul 2006, Kevin B. Hendricks wrote:
> Hi Developers,
>
> I am looking for another new project to help me get more up to speed
> on R and to learn something outside of R internals.   One recent R
> issue I have run into is finding a fast implementations of the
> equivalent to the following SAS code:
>
> /* MDPC is an integer sort key made from two integer columns */
> MDPC = (MD * 100000) + PCO;
>
> /* sort the dataset by the key */
> PROC SORT;
>    BY MDPC;
>
> /* print out count and sum for each unique sort key (subgroup) */
> /* use of BY MDPC requires sorting that data set by MDPC first in SAS */
> PROC UNIVARIATE NOPRINT;
>     VAR MVE;
>     BY MDPC;
>     OUTPUT OUT=TMP0 N=XN SUM=XSUM;
>
> Easy to do in R but the problem is the data set this is being run on
> has 1,742,201 lines in it and takes up 196,868,713 bytes to store as
> character data.  The sort key has easily has over 200,000 unique keys
> (if not twice that).
>
> My first R attempt was a simple
>
> # sort the data.frame gd and the sort key
> sorder <- order(MDPC)
> gd  <- gd[sorder,]
> MDPC <- MDPC[sorder]
> attach(gd)
>
> # find the length and sum for each unique sort key
> XN <- by(MVE, MDPC, length)
> XSUM <- by(MVE, MDPC, sum)
> GRPS <- levels(as.factor(MDPC))
>
> Well the ordering and sorting was reasonably fast but the first
"by"
> statement was still running 4 hours later on my machine (a dual 2.6
> gig Opteron with 4 gig of main memory).  This same snippet of code in
> SAS running on a slower machine takes about 5 minutes of system time.
>
> I tried various simple R implementations of a "by_sorted" that I
> thought might help
>
> # walk sorted array once and keep track of beginning
> # and ending points for each unique sort key value in p
> # and run function fcn on that sub sequence in vector v
> # store the results in a vector
>
> by_sorted <- function(v, p, fcn) {
>     key <- p[[1]]
>     bp <- 1
>     r <- NULL
>     for (i in 2:length(p)) {
>        if (key != p[[i]]) {
>            r <- c(r,fcn(v[bp:i-1]))
>            bp <- i
>            key <- p[[i]]
>        }
>     }
>     r <- c(r,fcn(v[bp:i]))
> }
>
> but they took "forever" to run also (read that I killed those
> attempts at 15 minutes of cpu time).
>
> I literally had the same issue when trying with "tapply".
>
> So unless it already exists someplace, I need a really fast
> implementation of "by" for very large sorted data sets (probably
> written in fortran or c) that will do the equivalent of what SAS does
> with its "proc univariate by" approach with close to the same
> performance.  The code should only have to walk the array once (ie.
> be linear in time with the number of rows in the vector).   I have
> similar issues with "merge" as well since merging data frames
already
> sorted by the same sort key should be fast as well and does not
> appear to be.
>
> Before I jump into this and create my own "sorted large data set"
> versions of "by" and "merge", I wanted to be sure it
would be of
> interest to others.  If they work well and are well implemented (a
> big if since I am really still just learning this - the whole point
> of the project!) would something like this be of any interest for
> internal use of R?  Or is this something too specialized?
> Is there an R function implemented in c or fortran that would make a
> good "model" to follow for implementing something like this?
> Would/should they be extensions of current implementations of
"merge"
> and "by" with an additions of a sorted=TRUE (defaulting to FALSE)
> extra parameter.
>
> Or am I simply barking up the wrong tree here?
>
> Thanks,
>
> Kevin
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>
-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

Kevin B. Hendricks

2006-Jul-29 20:05 UTC

head link

[Rd] Any interest in "merge" and "by" implementations specifically for sorted data?

Hi Bill,

So you wrote one routine that can calculate any single of a variety  
of stats and allows weights, is that right?  Can it return a data  
frame of any subset of requested stats as well (that is what I was  
thinking of doing anyway).

I think someone can easily calculate all of those things in one pass  
through the array and then allow the user to select which of the new  
columns of stats should be added to a data.frame that is returned to  
the user.

To test all of this, I simply wrote my own igroupSums and integrated  
it into r-devel based on the code in split.c.  I can easily modify it  
to handle the case of calculating  a variety of stats (even all at  
the same time if desired).  I do not deal with "weights" at all and  
ignored that for now.

Here is what your test case now shows on my machine with the latest R  
build
with my added igroupSums routine (added internally to R).

 > x <- rnorm(2e6)
 > i <- rep(1:1e6,2)
 > unix.time(Asums <- unlist(lapply(split(x,i),sum)))
[1] 8.940 0.112 9.053 0.000 0.000
 > names(Asums) <- NULL

My version of igroupSums does not keep the names so I remove them to  
make the results comparable.

Here is my my own internal function igroupSums

 > unix.time(Bsums <- igroupSums(x,i))
[1] 0.932 0.024 0.958 0.000 0.000
 >
 > all.equal(Asums, Bsums)
[1] TRUE


So the speed up is quite significant (9.053 seconds vs 0.858 seconds).

I will next modify my code to handle any single one of maxs, mins,  
sums, counts, anys, alls, means, prods, and ranges by user choice.   
Although I will leave the use of weights as unimplemented for now (I  
always get mixed up thinking about weights and basic stats and I  
never use them so ...)

In case others want to play around with this too, here is the R  
wrapper in igroupSums.R to put in src/library/base/R/


igroupSums <- function(x, f, drop = FALSE, ...)
UseMethod("igroupSums")

igroupSums.default <- function(x, f, drop=FALSE, ...)
{
     if(length(list(...))) .NotYetUsed(deparse(...), error = FALSE)

     if (is.list(f)) f <- interaction(f, drop = drop)
     else if (drop || !is.factor(f)) # drop extraneous levels
         f <- factor(f)
     storage.mode(f) <- "integer"  # some factors have double
     if (is.null(attr(x, "class")))
         return(.Internal(igroupSums(x, f)))
     ## else
     r <- by(x,f,sum)
     r
}

igroupSums.data.frame <- function(x, f, drop = FALSE, ...)
     lapply(split(seq(length=nrow(x)), f, drop = drop, ...),
            function(ind) x[ind, , drop = FALSE])


And here is a very simple igroupSums.c to put in src/main/

It still needs a lot of work since it does not handle NAs in the  
vector x yet and still needs to be modified into a general routine to  
handle any single function of counts, sums, maxs, mins, means, prods,  
anys, alls, and ranges


#ifdef HAVE_CONFIG_H
#include <config.h>
#endif

#include "Defn.h"

SEXP attribute_hidden do_igroupSums(SEXP call, SEXP op, SEXP args,  
SEXP env)
{
     SEXP x, f, sums;
     int i, j, nobs, nlevs, nfac;

     checkArity(op, args);

     x = CAR(args);
     f = CADR(args);
     if (!isVector(x))
         errorcall(call, _("first argument must be a vector"));
     if (!isFactor(f))
         errorcall(call, _("second argument must be a factor"));
     nlevs = nlevels(f);
     nfac = LENGTH(CADR(args));
     nobs = LENGTH(CAR(args));
     if (nobs <= 0)
         return R_NilValue;
     if (nfac <= 0)
         errorcall(call, _("Group length is 0 but data length >
0"));
     if (nobs % nfac != 0)
         warningcall(call, _("data length is not a multiple of split  
variable"));

     PROTECT(sums = allocVector(TYPEOF(x), nlevs));
     switch (TYPEOF(x)) {
     case INTSXP:
         for (i=0; i < nlevs; i++) INTEGER(sums)[i] = 0;
         break;
     case REALSXP:
         for (i=0; i < nlevs; i++) REAL(sums)[i] = 0.0;
         break;
     default:
         UNIMPLEMENTED_TYPE("igroupSums", x);
     }
     for (i = 0;  i < nobs; i++) {
         j = INTEGER(f)[i % nfac];
         if (j != NA_INTEGER) {
             j--;
             switch (TYPEOF(x)) {
             case INTSXP:
                 INTEGER(sums)[j] = INTEGER(sums)[j] + INTEGER(x)[i];
                 break;
             case REALSXP:
                 REAL(sums)[j] = REAL(sums)[j] + REAL(x)[i];
                 break;
             default:
                 UNIMPLEMENTED_TYPE("igroupSums", x);
             }
         }
     }
     UNPROTECT(1);
     return sums;
}


If anyone is playing with this themselves, don't forget to update  
Internal.h and names.c to reflect the added routine before you make  
clean and then rebuild.

Once I finish, I will post me patches here and then if someone would  
like to modify them to implement "weights", please let me know.

Even if these never get added to R I can keep them in my own tree and  
use them for my own work.

Thanks again for all of your hints and guidance.  This alone will  
speed up my R code greatly!

Kevin
> That is roughly what I did in C code for the Splus version.
> E.g., here is the integer x case for sums and means.  It accumlates
> the weighted group sum and the group weight so that if we
> are doing the mean it has the right divisor.  It would
> go a bit faster if I didn't bother with the group weight
> in the sum case.
>
> (I was mostly interested in seeing if this function's interface
> was sufficiently general for your uses.  Computing the integer
> group codes can sometimes be a pain and there are cases where you
> might want to combine that with computing the grouped summaries.
> I am guessing that in most cases you want those two functions
> to be separate.)
>
> case S_SUM: /*FALLTHROUGH*/
> case S_MEAN:
>     for(i=0;i<length;i++) {
>         bin = binp ? *binp++ - 1 : 0 ; /* bin is now 0-based */
>         weight = weighted ? *weightp++ : 1.0 ;
>         x = *xp++ ;
>         if (is_na_INT(&bin) || bin<0 || bin>=maxbin ||
weight==0.0)
>             continue ;
>         if (is_na_DOUBLE(&groupWeightp[bin]))
>             continue ;
>         if (is_na_DOUBLE(&x) || is_na_DOUBLE(&weight)) {
>             if (!na_rm) {
>                 na_set3(&valuep[bin], value->mode, To_NA);
>                 na_set3(&groupWeightp[bin], groupWeight->mode,
To_NA);
>             }
>             continue ;
>         }
>         if (!is_na_DOUBLE(&valuep[bin])) {
>             valuep[bin] += x * weight ;
>             groupWeightp[bin] += weight ; /* groupWeightp and  
> valuep should both have same NA-ness */
>         }
>     }
>     if (which==S_MEAN) {
>         for(ibin=0;ibin<maxbin;ibin++) {
>             if (is_na_DOUBLE(&valuep[ibin])) {
>                 /* leave valuep[ibin] as NA */
>             } else if (groupWeightp[ibin]==0.0) {
>                 /* valuep[ibin] must be 0.0 if groupWeightp[ibin]  
> is */
>                 na_set3(&valuep[ibin], value->mode, To_NaN) ;
>             } else {
>                 valuep[ibin] = valuep[ibin] / groupWeightp[ibin] ;
>             }
>         }
>     }
>     break;

Kevin B. Hendricks

2006-Jul-30 14:11 UTC

head link

[Rd] Any interest in "merge" and "by" implementations specifically for sorted data?

Hi Bill,

After playing with this some more and adding an implementation to  
handle NAs in the data vector, I have run into the problem of what to  
return when the only data values for a particular bin (or level) in  
the data vector were NAs and the user selected na.rm=T

1. Should it return 0 for counts of that particular bin and NA for  
that bin for all of the other functions?  If so, wouldn't that be  
strange to return a NA just since there is no valid data for that bin  
because the user asked for na.rm=T?

2.  Or do I have to literally rebuild the final result vector,  
removing all "unused" bins before returning the results?   And  
wouldn't that cause problems in not all of the levels from 1:ngroups  
will be returned for some variables and not for others.

I personally like the approach of 1. better since if I give an igroup  
function my groups and tell it to na.rm=T from my data vector, I  
would really want all group levels returned and not just the ones  
that had valid data in them and if a particular group had no data, I  
would want the count to be 0 for that bin and all of the other funs  
to return NA for that particular bin?

Is that what you are returning in that case?

Also, do you always return Sums, Maxs, and Mins as "numeric" or do  
you sometimes return "integer" values if an "integer" data
vector is
passed in?

Are "Counts" always returned as "integer" or do you always
set them
to "numeric" or does that vary with the type of the data vector  
passed in?

Do you handle "complex" data vectors in a similar fashion (ie. using  
the length of the complex vector as its value for Maxs, Mins, etc?)?

Thanks,

Kevin

Thomas Lumley

2006-Jul-31 14:19 UTC

head link

[Rd] Any interest in "merge" and "by" implementations specifically for sorted data?

On Sat, 29 Jul 2006, Kevin B. Hendricks wrote:
> Hi Bill,
>
>>>>    sum : igroupSums
>
> Okay, after thinking about this ...
>
> # assumes i is the small integer factor with n levels
> # v is some long vector
> # no sorting required
>
> igroupSums <- function(v,i) {
>   sums <- rep(0,max(i))
>   for (j in 1:length(v)) {
>       sums[[i[[j]]]] <- sums[[i[[j]]]] + v[[j]]
>   }
>   sums
> }
>
> if written in fortran or c might be faster than using split.  It is
> at least just linear in time with the length of vector v.
For sums you should look at rowsum().  It uses a hash table in C and last 
time I looked was faster than using split(). It returns a vector of the 
same length as the input, but that would easily be fixed.

The same approach would work for min, max, range, count, mean, but not for 
arbitrary functions.

 	-thomas

Thomas Lumley			Assoc. Professor, Biostatistics
tlumley at u.washington.edu	University of Washington, Seattle

Kevin B. Hendricks

2006-Jul-31 21:19 UTC

head link

[Rd] Any interest in "merge" and "by" implementations specifically for sorted data?

Hi Thomas,

Here is a comparison of performance times from my own igroupSums  
versus using split and rowsum:

 > x <- rnorm(2e6)
 > i <- rep(1:1e6,2)
 >
 > unix.time(suma <- unlist(lapply(split(x,i),sum)))
[1] 8.188 0.076 8.263 0.000 0.000
 >
 > names(suma)<- NULL
 >
 > unix.time(sumb <- igroupSums(x,i))
[1] 0.036 0.000 0.035 0.000 0.000
 >
 > all.equal(suma, sumb)
[1] TRUE
 >
 > unix.time(sumc <- rowsum(x,i))
[1] 0.744 0.000 0.742 0.000 0.000
 >
 > sumc <- sumc[,1]
 > names(sumc)<-NULL
 > all.equal(suma,sumc)
[1] TRUE


So my implementation of igroupSums is faster and already handles NA.   
I also have implemented igroupMins, igroupMaxs, igroupAnys,  
igroupAlls, igroupCounts, igroupMeans, and igroupRanges.

The igroup functions I implemented do not handle weights yet but do  
handle NAs properly.

Assuming I clean them up, is anyone in the R developer group interested?

Or would you rather I instead extend the rowsum appropach to create  
rowcount, rowmax, rowmin, rowcount, etc using a hash function approach.

All of these approaches simply use differently ways to map group  
codes to integers and then do the functions the same.

Thanks,

Kevin

Seemingly Similar Threads

Search for more seemingly similar threads

R devel - Jul 2006 - Any interest in "merge" and "by" implementations specifically for sorted data?

[Rd] Any interest in "merge" and "by" implementations specifically for sorted data?

[Rd] Any interest in "merge" and "by" implementations specifically for sorted data?

[Rd] Any interest in "merge" and "by" implementations specifically for sorted data?

[Rd] Any interest in "merge" and "by" implementations specifically for sorted data?

[Rd] Any interest in "merge" and "by" implementations specifically for sorted data?

[Rd] Any interest in "merge" and "by" implementations specifically for sorted data?

[Rd] Any interest in "merge" and "by" implementations specifically for sorted data?

Seemingly Similar Threads