I would like to generate a vector having the same length as the number of rows in a matrix. The vector should contain an integer indicating the "group" of the row, where identical matrix rows are in a group, and a unique row has a unique integer. Thus, for a <- c(1,2) b <- c(1,3) c <- c(1,2) d <- c(1,2) e <- c(1,3) f <- c(2,1) mat <- rbind(a,b,c,d,e,f) I would like to get the vector c(1,2,1,1,2,3). I know dist() gives part of the answer, but I can't figure out how to use it for this purpose without doing a lot of looping. I need to apply this to matrices up to ~100000 rows. Thanks, Scott Waichler Pacific Northwest National Laboratory scott.waichler_at_pnl.gov
Here's one possibility:> index <- as.numeric(factor(apply(mat, 1, paste, collapse=":"))) > index[1] 1 2 1 1 2 3 There's probably some better way though... Andy> From: Scott Waichler > > I would like to generate a vector having the same length > as the number of rows in a matrix. The vector should contain > an integer indicating the "group" of the row, where identical > matrix rows are in a group, and a unique row has a unique integer. > Thus, for > > a <- c(1,2) > b <- c(1,3) > c <- c(1,2) > d <- c(1,2) > e <- c(1,3) > f <- c(2,1) > mat <- rbind(a,b,c,d,e,f) > > I would like to get the vector c(1,2,1,1,2,3). I know dist() gives > part of the answer, but I can't figure out how to use it for > this purpose without doing a lot of looping. I need to apply this > to matrices up to ~100000 rows. > > Thanks, > Scott Waichler > Pacific Northwest National Laboratory > scott.waichler_at_pnl.gov > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://www.stat.math.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! > http://www.R-project.org/posting-guide.html > >
Scott Waichler <scott.waichler at pnl.gov> writes:> I would like to generate a vector having the same length > as the number of rows in a matrix. The vector should contain > an integer indicating the "group" of the row, where identical > matrix rows are in a group, and a unique row has a unique integer. > Thus, for > > a <- c(1,2) > b <- c(1,3) > c <- c(1,2) > d <- c(1,2) > e <- c(1,3) > f <- c(2,1) > mat <- rbind(a,b,c,d,e,f) > > I would like to get the vector c(1,2,1,1,2,3). I know dist() gives > part of the answer, but I can't figure out how to use it for > this purpose without doing a lot of looping. I need to apply this > to matrices up to ~100000 rows.I believe you want to start with unique which, when applied to a matrix, provides the unique rows.> unique(mat)[,1] [,2] a 1 2 b 1 3 f 2 1 I'm sure others will be able to provide clever ways of doing the matching against the unique rows.
The shortest expression I can think of is: as.numeric(interaction(mat[,1],mat[,2],drop=T)) Completing the thought of those who suggested dist or unique: N <- nrow(mat) dd <- as.matrix(dist(rbind(mat,unique(mat))))[-seq(N),seq(N)] apply(dd,2,function(x)match(0,x)) Scott Waichler <scott.waichler <at> pnl.gov> writes: : : I would like to generate a vector having the same length : as the number of rows in a matrix. The vector should contain : an integer indicating the "group" of the row, where identical : matrix rows are in a group, and a unique row has a unique integer. : Thus, for : : a <- c(1,2) : b <- c(1,3) : c <- c(1,2) : d <- c(1,2) : e <- c(1,3) : f <- c(2,1) : mat <- rbind(a,b,c,d,e,f) : : I would like to get the vector c(1,2,1,1,2,3). I know dist() gives : part of the answer, but I can't figure out how to use it for : this purpose without doing a lot of looping. I need to apply this : to matrices up to ~100000 rows. : : Thanks, : Scott Waichler : Pacific Northwest National Laboratory : scott.waichler_at_pnl.gov
Thanks to all of you who responded to my help request. Here is the very efficient upshot of your advice:> mat2 <- apply(mat, 1, paste, collapse=":") > vec <- match(mat2, unique(mat2)) > vec[1] 1 2 1 1 2 3 P.S. I found that Andy Liaw's method didn't preserve the index order that I wanted; it yields 2 3 2 2 3 1 To get the order of integers I was looking for required an invocation of unique: as.numeric(factor(apply(mat, 1, paste, collapse=":"), levels=unique(apply(mat, 1, paste, collapse=":")))) But the first method above is obviously cleaner and is twice as fast, only 9 seconds for a 100000 row matrix on an ordinary PC. Regards, Scott Waichler> > I would like to generate a vector having the same length > > as the number of rows in a matrix. The vector should contain an > > integer indicating the "group" of the row, where identical > matrix rows > > are in a group, and a unique row has a unique integer. Thus, for > > > > a <- c(1,2) > > b <- c(1,3) > > c <- c(1,2) > > d <- c(1,2) > > e <- c(1,3) > > f <- c(2,1) > > mat <- rbind(a,b,c,d,e,f) > > > > I would like to get the vector c(1,2,1,1,2,3). I know dist() gives > > part of the answer, but I can't figure out how to use it for this > > purpose without doing a lot of looping. I need to apply this to > > matrices up to ~100000 rows.
The problem with interaction() is that it doesn't scale with increasing number of columns:> set.seed(1) > mat2 <- matrix(sample(20,5e4,rep=T), 1e4) > invisible(gc()); system.time(z0 <- f0(mat2))[1] 1.58 0.01 1.85 NA NA> invisible(gc()); system.time(z1 <- f1(mat2))[1] 1.57 0.00 1.66 NA NA> invisible(gc()); system.time(z2 <- f2g(mat2))[1] 34.14 0.60 57.45 NA NA [f2g is the slightly modified version of f2 to allow for any number of columns: f2g <- function(mat) as.numeric(interaction(as.data.frame(mat), drop=T))] With 10 columns in the matrix, f0 and f1 ran fine in under 10 seconds, but f2g started thrashing, and ran out of memory after a while. If you look at how interaction() is written you'll quickly see why... Andy> From: Gabor Grothendieck > > Waichler, Scott R <Scott.Waichler <at> pnl.gov> writes: > > > > > Thanks to all of you who responded to my help request. > > Here is the very efficient upshot of your advice: > > > > > mat2 <- apply(mat, 1, paste, collapse=":") > > > vec <- match(mat2, unique(mat2)) > > > vec > > [1] 1 2 1 1 2 3 > > > > > > P.S. I found that Andy Liaw's method didn't preserve the > > index order that I wanted; it yields > > > > 2 3 2 2 3 1 > > > > To get the order of integers I was looking for required an > > invocation of unique: > > > > as.numeric(factor(apply(mat, 1, paste, collapse=":"), > > levels=unique(apply(mat, 1, paste, > collapse=":")))) > > > > But the first method above is obviously cleaner and is twice > > as fast, only 9 seconds for a 100000 row matrix on an ordinary PC. > > The interaction solution gives an identical result, is shorter and > is one or two orders of magnitude faster. Here is a > comparison of the three: > > R> set.seed(1) > R> mat <- matrix(sample(20,100000,rep=T),50000) > R> > R> f0 <- function(mat) { > + mat2 <- apply(mat, 1, paste, collapse=":"); > + match(mat2, unique(mat2)) > + } > R> > R> > R> f1 <- function(mat) { z <- apply(mat, 1, paste, collapse=":") > + as.numeric(factor(z,levels=unique(z))) > + } > R> > R> f2 <- function(mat) as.numeric(interaction(mat[,1],mat[,2],drop=T)) > R> > R> dummy <- gc(); system.time(z0 <- f0(mat)) > [1] 5.24 0.02 5.52 NA NA > R> dummy <- gc(); system.time(z1 <- f1(mat)) > [1] 5.18 0.00 5.52 NA NA > R> dummy <- gc(); system.time(z2 <- f2(mat)) > [1] 0.1 0.0 0.1 NA NA > R> all.equal(z0,z1) > [1] TRUE > R> all.equal(z0,z2) > [1] TRUE > R> all.equal(z2,z1) > [1] TRUE > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://www.stat.math.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! > http://www.R-project.org/posting-guide.html > >