Kevin B. Hendricks
2006-Jul-27 13:31 UTC
[Rd] Any interest in "merge" and "by" implementations specifically for sorted data?
Hi Developers, I am looking for another new project to help me get more up to speed on R and to learn something outside of R internals. One recent R issue I have run into is finding a fast implementations of the equivalent to the following SAS code: /* MDPC is an integer sort key made from two integer columns */ MDPC = (MD * 100000) + PCO; /* sort the dataset by the key */ PROC SORT; BY MDPC; /* print out count and sum for each unique sort key (subgroup) */ /* use of BY MDPC requires sorting that data set by MDPC first in SAS */ PROC UNIVARIATE NOPRINT; VAR MVE; BY MDPC; OUTPUT OUT=TMP0 N=XN SUM=XSUM; Easy to do in R but the problem is the data set this is being run on has 1,742,201 lines in it and takes up 196,868,713 bytes to store as character data. The sort key has easily has over 200,000 unique keys (if not twice that). My first R attempt was a simple # sort the data.frame gd and the sort key sorder <- order(MDPC) gd <- gd[sorder,] MDPC <- MDPC[sorder] attach(gd) # find the length and sum for each unique sort key XN <- by(MVE, MDPC, length) XSUM <- by(MVE, MDPC, sum) GRPS <- levels(as.factor(MDPC)) Well the ordering and sorting was reasonably fast but the first "by" statement was still running 4 hours later on my machine (a dual 2.6 gig Opteron with 4 gig of main memory). This same snippet of code in SAS running on a slower machine takes about 5 minutes of system time. I tried various simple R implementations of a "by_sorted" that I thought might help # walk sorted array once and keep track of beginning # and ending points for each unique sort key value in p # and run function fcn on that sub sequence in vector v # store the results in a vector by_sorted <- function(v, p, fcn) { key <- p[[1]] bp <- 1 r <- NULL for (i in 2:length(p)) { if (key != p[[i]]) { r <- c(r,fcn(v[bp:i-1])) bp <- i key <- p[[i]] } } r <- c(r,fcn(v[bp:i])) } but they took "forever" to run also (read that I killed those attempts at 15 minutes of cpu time). I literally had the same issue when trying with "tapply". So unless it already exists someplace, I need a really fast implementation of "by" for very large sorted data sets (probably written in fortran or c) that will do the equivalent of what SAS does with its "proc univariate by" approach with close to the same performance. The code should only have to walk the array once (ie. be linear in time with the number of rows in the vector). I have similar issues with "merge" as well since merging data frames already sorted by the same sort key should be fast as well and does not appear to be. Before I jump into this and create my own "sorted large data set" versions of "by" and "merge", I wanted to be sure it would be of interest to others. If they work well and are well implemented (a big if since I am really still just learning this - the whole point of the project!) would something like this be of any interest for internal use of R? Or is this something too specialized? Is there an R function implemented in c or fortran that would make a good "model" to follow for implementing something like this? Would/should they be extensions of current implementations of "merge" and "by" with an additions of a sorted=TRUE (defaulting to FALSE) extra parameter. Or am I simply barking up the wrong tree here? Thanks, Kevin
Seth Falcon
2006-Jul-27 14:20 UTC
[Rd] Any interest in "merge" and "by" implementations specifically for sorted data?
"Kevin B. Hendricks" <kevin.hendricks at sympatico.ca> writes:> My first R attempt was a simple > > # sort the data.frame gd and the sort key > sorder <- order(MDPC) > gd <- gd[sorder,] > MDPC <- MDPC[sorder] > attach(gd) > > # find the length and sum for each unique sort key > XN <- by(MVE, MDPC, length) > XSUM <- by(MVE, MDPC, sum) > GRPS <- levels(as.factor(MDPC)) > > Well the ordering and sorting was reasonably fast but the first "by" > statement was still running 4 hours later on my machine (a dual 2.6 > gig Opteron with 4 gig of main memory). This same snippet of code in > SAS running on a slower machine takes about 5 minutes of system > time.I wonder if split() would be of use here. Once you have sorted the data frame gd and the sort keys MDPC, you could do: gdList <- split(gd$MVE, MDPC) xn <- sapply(gdList, length) xsum <- sapply(gdList, sum) + seth
Brian D Ripley
2006-Jul-28 07:06 UTC
[Rd] Any interest in "merge" and "by" implementations specifically for sorted data?
Which version of R are you looking at? R-devel has o merge() works more efficiently when there are relatively few matches between the data frames (for example, for 1-1 matching). The order of the result is changed for 'sort = FALSE'. On Thu, 27 Jul 2006, Kevin B. Hendricks wrote:> Hi Developers, > > I am looking for another new project to help me get more up to speed > on R and to learn something outside of R internals. One recent R > issue I have run into is finding a fast implementations of the > equivalent to the following SAS code: > > /* MDPC is an integer sort key made from two integer columns */ > MDPC = (MD * 100000) + PCO; > > /* sort the dataset by the key */ > PROC SORT; > BY MDPC; > > /* print out count and sum for each unique sort key (subgroup) */ > /* use of BY MDPC requires sorting that data set by MDPC first in SAS */ > PROC UNIVARIATE NOPRINT; > VAR MVE; > BY MDPC; > OUTPUT OUT=TMP0 N=XN SUM=XSUM; > > Easy to do in R but the problem is the data set this is being run on > has 1,742,201 lines in it and takes up 196,868,713 bytes to store as > character data. The sort key has easily has over 200,000 unique keys > (if not twice that). > > My first R attempt was a simple > > # sort the data.frame gd and the sort key > sorder <- order(MDPC) > gd <- gd[sorder,] > MDPC <- MDPC[sorder] > attach(gd) > > # find the length and sum for each unique sort key > XN <- by(MVE, MDPC, length) > XSUM <- by(MVE, MDPC, sum) > GRPS <- levels(as.factor(MDPC)) > > Well the ordering and sorting was reasonably fast but the first "by" > statement was still running 4 hours later on my machine (a dual 2.6 > gig Opteron with 4 gig of main memory). This same snippet of code in > SAS running on a slower machine takes about 5 minutes of system time. > > I tried various simple R implementations of a "by_sorted" that I > thought might help > > # walk sorted array once and keep track of beginning > # and ending points for each unique sort key value in p > # and run function fcn on that sub sequence in vector v > # store the results in a vector > > by_sorted <- function(v, p, fcn) { > key <- p[[1]] > bp <- 1 > r <- NULL > for (i in 2:length(p)) { > if (key != p[[i]]) { > r <- c(r,fcn(v[bp:i-1])) > bp <- i > key <- p[[i]] > } > } > r <- c(r,fcn(v[bp:i])) > } > > but they took "forever" to run also (read that I killed those > attempts at 15 minutes of cpu time). > > I literally had the same issue when trying with "tapply". > > So unless it already exists someplace, I need a really fast > implementation of "by" for very large sorted data sets (probably > written in fortran or c) that will do the equivalent of what SAS does > with its "proc univariate by" approach with close to the same > performance. The code should only have to walk the array once (ie. > be linear in time with the number of rows in the vector). I have > similar issues with "merge" as well since merging data frames already > sorted by the same sort key should be fast as well and does not > appear to be. > > Before I jump into this and create my own "sorted large data set" > versions of "by" and "merge", I wanted to be sure it would be of > interest to others. If they work well and are well implemented (a > big if since I am really still just learning this - the whole point > of the project!) would something like this be of any interest for > internal use of R? Or is this something too specialized? > Is there an R function implemented in c or fortran that would make a > good "model" to follow for implementing something like this? > Would/should they be extensions of current implementations of "merge" > and "by" with an additions of a sorted=TRUE (defaulting to FALSE) > extra parameter. > > Or am I simply barking up the wrong tree here? > > Thanks, > > Kevin > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel >-- Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595
Kevin B. Hendricks
2006-Jul-29 20:05 UTC
[Rd] Any interest in "merge" and "by" implementations specifically for sorted data?
Hi Bill, So you wrote one routine that can calculate any single of a variety of stats and allows weights, is that right? Can it return a data frame of any subset of requested stats as well (that is what I was thinking of doing anyway). I think someone can easily calculate all of those things in one pass through the array and then allow the user to select which of the new columns of stats should be added to a data.frame that is returned to the user. To test all of this, I simply wrote my own igroupSums and integrated it into r-devel based on the code in split.c. I can easily modify it to handle the case of calculating a variety of stats (even all at the same time if desired). I do not deal with "weights" at all and ignored that for now. Here is what your test case now shows on my machine with the latest R build with my added igroupSums routine (added internally to R). > x <- rnorm(2e6) > i <- rep(1:1e6,2) > unix.time(Asums <- unlist(lapply(split(x,i),sum))) [1] 8.940 0.112 9.053 0.000 0.000 > names(Asums) <- NULL My version of igroupSums does not keep the names so I remove them to make the results comparable. Here is my my own internal function igroupSums > unix.time(Bsums <- igroupSums(x,i)) [1] 0.932 0.024 0.958 0.000 0.000 > > all.equal(Asums, Bsums) [1] TRUE So the speed up is quite significant (9.053 seconds vs 0.858 seconds). I will next modify my code to handle any single one of maxs, mins, sums, counts, anys, alls, means, prods, and ranges by user choice. Although I will leave the use of weights as unimplemented for now (I always get mixed up thinking about weights and basic stats and I never use them so ...) In case others want to play around with this too, here is the R wrapper in igroupSums.R to put in src/library/base/R/ igroupSums <- function(x, f, drop = FALSE, ...) UseMethod("igroupSums") igroupSums.default <- function(x, f, drop=FALSE, ...) { if(length(list(...))) .NotYetUsed(deparse(...), error = FALSE) if (is.list(f)) f <- interaction(f, drop = drop) else if (drop || !is.factor(f)) # drop extraneous levels f <- factor(f) storage.mode(f) <- "integer" # some factors have double if (is.null(attr(x, "class"))) return(.Internal(igroupSums(x, f))) ## else r <- by(x,f,sum) r } igroupSums.data.frame <- function(x, f, drop = FALSE, ...) lapply(split(seq(length=nrow(x)), f, drop = drop, ...), function(ind) x[ind, , drop = FALSE]) And here is a very simple igroupSums.c to put in src/main/ It still needs a lot of work since it does not handle NAs in the vector x yet and still needs to be modified into a general routine to handle any single function of counts, sums, maxs, mins, means, prods, anys, alls, and ranges #ifdef HAVE_CONFIG_H #include <config.h> #endif #include "Defn.h" SEXP attribute_hidden do_igroupSums(SEXP call, SEXP op, SEXP args, SEXP env) { SEXP x, f, sums; int i, j, nobs, nlevs, nfac; checkArity(op, args); x = CAR(args); f = CADR(args); if (!isVector(x)) errorcall(call, _("first argument must be a vector")); if (!isFactor(f)) errorcall(call, _("second argument must be a factor")); nlevs = nlevels(f); nfac = LENGTH(CADR(args)); nobs = LENGTH(CAR(args)); if (nobs <= 0) return R_NilValue; if (nfac <= 0) errorcall(call, _("Group length is 0 but data length > 0")); if (nobs % nfac != 0) warningcall(call, _("data length is not a multiple of split variable")); PROTECT(sums = allocVector(TYPEOF(x), nlevs)); switch (TYPEOF(x)) { case INTSXP: for (i=0; i < nlevs; i++) INTEGER(sums)[i] = 0; break; case REALSXP: for (i=0; i < nlevs; i++) REAL(sums)[i] = 0.0; break; default: UNIMPLEMENTED_TYPE("igroupSums", x); } for (i = 0; i < nobs; i++) { j = INTEGER(f)[i % nfac]; if (j != NA_INTEGER) { j--; switch (TYPEOF(x)) { case INTSXP: INTEGER(sums)[j] = INTEGER(sums)[j] + INTEGER(x)[i]; break; case REALSXP: REAL(sums)[j] = REAL(sums)[j] + REAL(x)[i]; break; default: UNIMPLEMENTED_TYPE("igroupSums", x); } } } UNPROTECT(1); return sums; } If anyone is playing with this themselves, don't forget to update Internal.h and names.c to reflect the added routine before you make clean and then rebuild. Once I finish, I will post me patches here and then if someone would like to modify them to implement "weights", please let me know. Even if these never get added to R I can keep them in my own tree and use them for my own work. Thanks again for all of your hints and guidance. This alone will speed up my R code greatly! Kevin> That is roughly what I did in C code for the Splus version. > E.g., here is the integer x case for sums and means. It accumlates > the weighted group sum and the group weight so that if we > are doing the mean it has the right divisor. It would > go a bit faster if I didn't bother with the group weight > in the sum case. > > (I was mostly interested in seeing if this function's interface > was sufficiently general for your uses. Computing the integer > group codes can sometimes be a pain and there are cases where you > might want to combine that with computing the grouped summaries. > I am guessing that in most cases you want those two functions > to be separate.) > > case S_SUM: /*FALLTHROUGH*/ > case S_MEAN: > for(i=0;i<length;i++) { > bin = binp ? *binp++ - 1 : 0 ; /* bin is now 0-based */ > weight = weighted ? *weightp++ : 1.0 ; > x = *xp++ ; > if (is_na_INT(&bin) || bin<0 || bin>=maxbin || weight==0.0) > continue ; > if (is_na_DOUBLE(&groupWeightp[bin])) > continue ; > if (is_na_DOUBLE(&x) || is_na_DOUBLE(&weight)) { > if (!na_rm) { > na_set3(&valuep[bin], value->mode, To_NA); > na_set3(&groupWeightp[bin], groupWeight->mode, To_NA); > } > continue ; > } > if (!is_na_DOUBLE(&valuep[bin])) { > valuep[bin] += x * weight ; > groupWeightp[bin] += weight ; /* groupWeightp and > valuep should both have same NA-ness */ > } > } > if (which==S_MEAN) { > for(ibin=0;ibin<maxbin;ibin++) { > if (is_na_DOUBLE(&valuep[ibin])) { > /* leave valuep[ibin] as NA */ > } else if (groupWeightp[ibin]==0.0) { > /* valuep[ibin] must be 0.0 if groupWeightp[ibin] > is */ > na_set3(&valuep[ibin], value->mode, To_NaN) ; > } else { > valuep[ibin] = valuep[ibin] / groupWeightp[ibin] ; > } > } > } > break;
Kevin B. Hendricks
2006-Jul-30 14:11 UTC
[Rd] Any interest in "merge" and "by" implementations specifically for sorted data?
Hi Bill, After playing with this some more and adding an implementation to handle NAs in the data vector, I have run into the problem of what to return when the only data values for a particular bin (or level) in the data vector were NAs and the user selected na.rm=T 1. Should it return 0 for counts of that particular bin and NA for that bin for all of the other functions? If so, wouldn't that be strange to return a NA just since there is no valid data for that bin because the user asked for na.rm=T? 2. Or do I have to literally rebuild the final result vector, removing all "unused" bins before returning the results? And wouldn't that cause problems in not all of the levels from 1:ngroups will be returned for some variables and not for others. I personally like the approach of 1. better since if I give an igroup function my groups and tell it to na.rm=T from my data vector, I would really want all group levels returned and not just the ones that had valid data in them and if a particular group had no data, I would want the count to be 0 for that bin and all of the other funs to return NA for that particular bin? Is that what you are returning in that case? Also, do you always return Sums, Maxs, and Mins as "numeric" or do you sometimes return "integer" values if an "integer" data vector is passed in? Are "Counts" always returned as "integer" or do you always set them to "numeric" or does that vary with the type of the data vector passed in? Do you handle "complex" data vectors in a similar fashion (ie. using the length of the complex vector as its value for Maxs, Mins, etc?)? Thanks, Kevin
Thomas Lumley
2006-Jul-31 14:19 UTC
[Rd] Any interest in "merge" and "by" implementations specifically for sorted data?
On Sat, 29 Jul 2006, Kevin B. Hendricks wrote:> Hi Bill, > >>>> sum : igroupSums > > Okay, after thinking about this ... > > # assumes i is the small integer factor with n levels > # v is some long vector > # no sorting required > > igroupSums <- function(v,i) { > sums <- rep(0,max(i)) > for (j in 1:length(v)) { > sums[[i[[j]]]] <- sums[[i[[j]]]] + v[[j]] > } > sums > } > > if written in fortran or c might be faster than using split. It is > at least just linear in time with the length of vector v.For sums you should look at rowsum(). It uses a hash table in C and last time I looked was faster than using split(). It returns a vector of the same length as the input, but that would easily be fixed. The same approach would work for min, max, range, count, mean, but not for arbitrary functions. -thomas Thomas Lumley Assoc. Professor, Biostatistics tlumley at u.washington.edu University of Washington, Seattle
Kevin B. Hendricks
2006-Jul-31 21:19 UTC
[Rd] Any interest in "merge" and "by" implementations specifically for sorted data?
Hi Thomas, Here is a comparison of performance times from my own igroupSums versus using split and rowsum: > x <- rnorm(2e6) > i <- rep(1:1e6,2) > > unix.time(suma <- unlist(lapply(split(x,i),sum))) [1] 8.188 0.076 8.263 0.000 0.000 > > names(suma)<- NULL > > unix.time(sumb <- igroupSums(x,i)) [1] 0.036 0.000 0.035 0.000 0.000 > > all.equal(suma, sumb) [1] TRUE > > unix.time(sumc <- rowsum(x,i)) [1] 0.744 0.000 0.742 0.000 0.000 > > sumc <- sumc[,1] > names(sumc)<-NULL > all.equal(suma,sumc) [1] TRUE So my implementation of igroupSums is faster and already handles NA. I also have implemented igroupMins, igroupMaxs, igroupAnys, igroupAlls, igroupCounts, igroupMeans, and igroupRanges. The igroup functions I implemented do not handle weights yet but do handle NAs properly. Assuming I clean them up, is anyone in the R developer group interested? Or would you rather I instead extend the rowsum appropach to create rowcount, rowmax, rowmin, rowcount, etc using a hash function approach. All of these approaches simply use differently ways to map group codes to integers and then do the functions the same. Thanks, Kevin
Reasonably Related Threads
- [PATCH] blktap2: fix makefile of vhd for parallel make
- [PATCH] tools/blktap2: fix 'make clean'
- fft fails for lengths 392, 588, 968, 980 .... (PR#1429)
- [PATCH] tools: do not link against unused libraries
- R-alpha: R-0.49 / S-plus: "default argument evaluation" bugs and woes