thr3ads.net - R help - [R] Tagging identical rows of a matrix [May 2004]

If this information is useful, please help other people find it:
Share via:

Scott Waichler

2004-May-14 18:40 UTC

[R] Tagging identical rows of a matrix

I would like to generate a vector having the same length
as the number of rows in a matrix.  The vector should contain
an integer indicating the "group" of the row, where identical
matrix rows are in a group, and a unique row has a unique integer.
Thus, for

a <- c(1,2)
b <- c(1,3)
c <- c(1,2)
d <- c(1,2)
e <- c(1,3)
f <- c(2,1)
mat <- rbind(a,b,c,d,e,f)

I would like to get the vector c(1,2,1,1,2,3).  I know dist() gives
part of the answer, but I can't figure out how to use it for
this purpose without doing a lot of looping.  I need to apply this
to matrices up to ~100000 rows.

Thanks,
Scott Waichler
Pacific Northwest National Laboratory
scott.waichler_at_pnl.gov

Liaw, Andy

2004-May-14 18:50 UTC

head link

[R] Tagging identical rows of a matrix

Here's one possibility:
> index <- as.numeric(factor(apply(mat, 1, paste,
collapse=":")))
> index[1] 1 2 1 1 2 3


There's probably some better way though...

Andy
> From: Scott Waichler
> 
> I would like to generate a vector having the same length
> as the number of rows in a matrix.  The vector should contain
> an integer indicating the "group" of the row, where identical
> matrix rows are in a group, and a unique row has a unique integer.
> Thus, for
> 
> a <- c(1,2)
> b <- c(1,3)
> c <- c(1,2)
> d <- c(1,2)
> e <- c(1,3)
> f <- c(2,1)
> mat <- rbind(a,b,c,d,e,f)
> 
> I would like to get the vector c(1,2,1,1,2,3).  I know dist() gives
> part of the answer, but I can't figure out how to use it for
> this purpose without doing a lot of looping.  I need to apply this
> to matrices up to ~100000 rows.
> 
> Thanks,
> Scott Waichler
> Pacific Northwest National Laboratory
> scott.waichler_at_pnl.gov
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://www.stat.math.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! 
> http://www.R-project.org/posting-guide.html
> 
>

Douglas Bates

2004-May-14 18:52 UTC

head link

[R] Tagging identical rows of a matrix

Scott Waichler <scott.waichler at pnl.gov> writes:
> I would like to generate a vector having the same length
> as the number of rows in a matrix.  The vector should contain
> an integer indicating the "group" of the row, where identical
> matrix rows are in a group, and a unique row has a unique integer.
> Thus, for
> 
> a <- c(1,2)
> b <- c(1,3)
> c <- c(1,2)
> d <- c(1,2)
> e <- c(1,3)
> f <- c(2,1)
> mat <- rbind(a,b,c,d,e,f)
> 
> I would like to get the vector c(1,2,1,1,2,3).  I know dist() gives
> part of the answer, but I can't figure out how to use it for
> this purpose without doing a lot of looping.  I need to apply this
> to matrices up to ~100000 rows.
I believe you want to start with unique which, when applied to a
matrix, provides the unique rows.
> unique(mat)  [,1] [,2]
a    1    2
b    1    3
f    2    1

I'm sure others will be able to provide clever ways of doing the
matching against the unique rows.

Gabor Grothendieck

2004-May-14 19:47 UTC

head link

[R] Tagging identical rows of a matrix

The shortest expression I can think of is:

as.numeric(interaction(mat[,1],mat[,2],drop=T))

Completing the thought of those who suggested dist or
unique:

N <- nrow(mat)
dd <- as.matrix(dist(rbind(mat,unique(mat))))[-seq(N),seq(N)]
apply(dd,2,function(x)match(0,x))


Scott Waichler <scott.waichler <at> pnl.gov> writes:

: 
: I would like to generate a vector having the same length
: as the number of rows in a matrix.  The vector should contain
: an integer indicating the "group" of the row, where identical
: matrix rows are in a group, and a unique row has a unique integer.
: Thus, for
: 
: a <- c(1,2)
: b <- c(1,3)
: c <- c(1,2)
: d <- c(1,2)
: e <- c(1,3)
: f <- c(2,1)
: mat <- rbind(a,b,c,d,e,f)
: 
: I would like to get the vector c(1,2,1,1,2,3).  I know dist() gives
: part of the answer, but I can't figure out how to use it for
: this purpose without doing a lot of looping.  I need to apply this
: to matrices up to ~100000 rows.
: 
: Thanks,
: Scott Waichler
: Pacific Northwest National Laboratory
: scott.waichler_at_pnl.gov

Waichler, Scott R

2004-May-14 20:12 UTC

head link

[R] Tagging identical rows of a matrix

Thanks to all of you who responded to my help request.
Here is the very efficient upshot of your advice:
> mat2 <- apply(mat, 1, paste, collapse=":")
> vec <- match(mat2, unique(mat2))
> vec[1] 1 2 1 1 2 3


P.S.  I found that Andy Liaw's method didn't preserve the
index order that I wanted; it yields

2 3 2 2 3 1

To get the order of integers I was looking for required an
invocation of unique:

as.numeric(factor(apply(mat, 1, paste, collapse=":"),
                  levels=unique(apply(mat, 1, paste, collapse=":"))))

But the first method above is obviously cleaner and is twice
as fast, only 9 seconds for a 100000 row matrix on an ordinary PC.  

Regards,
Scott Waichler
> > I would like to generate a vector having the same length
> > as the number of rows in a matrix.  The vector should contain an 
> > integer indicating the "group" of the row, where identical 
> matrix rows 
> > are in a group, and a unique row has a unique integer. Thus, for
> >
> > a <- c(1,2)
> > b <- c(1,3)
> > c <- c(1,2)
> > d <- c(1,2)
> > e <- c(1,3)
> > f <- c(2,1)
> > mat <- rbind(a,b,c,d,e,f)
> >
> > I would like to get the vector c(1,2,1,1,2,3).  I know dist() gives 
> > part of the answer, but I can't figure out how to use it for this 
> > purpose without doing a lot of looping.  I need to apply this to 
> > matrices up to ~100000 rows.

Liaw, Andy

2004-May-15 01:20 UTC

head link

[R] Tagging identical rows of a matrix

The problem with interaction() is that it doesn't scale with increasing
number of columns:
> set.seed(1)
> mat2 <- matrix(sample(20,5e4,rep=T), 1e4)
> invisible(gc()); system.time(z0 <- f0(mat2))
[1] 1.58 0.01 1.85   NA   NA> invisible(gc()); system.time(z1 <- f1(mat2))
[1] 1.57 0.00 1.66   NA   NA> invisible(gc()); system.time(z2 <- f2g(mat2))[1] 34.14  0.60 57.45    NA    NA

[f2g is the slightly modified version of f2 to allow for any number of
columns:
f2g <- function(mat) as.numeric(interaction(as.data.frame(mat), drop=T))]

With 10 columns in the matrix, f0 and f1 ran fine in under 10 seconds, but
f2g started thrashing, and ran out of memory after a while.  If you look at
how interaction() is written you'll quickly see why...

Andy
> From: Gabor Grothendieck
> 
> Waichler, Scott R <Scott.Waichler <at> pnl.gov> writes:
> 
> > 
> > Thanks to all of you who responded to my help request.
> > Here is the very efficient upshot of your advice:
> > 
> > > mat2 <- apply(mat, 1, paste, collapse=":")
> > > vec <- match(mat2, unique(mat2))
> > > vec
> > [1] 1 2 1 1 2 3
> > 
> > 
> > P.S.  I found that Andy Liaw's method didn't preserve the
> > index order that I wanted; it yields
> > 
> > 2 3 2 2 3 1
> > 
> > To get the order of integers I was looking for required an
> > invocation of unique:
> > 
> > as.numeric(factor(apply(mat, 1, paste, collapse=":"),
> >                   levels=unique(apply(mat, 1, paste, 
> collapse=":"))))
> > 
> > But the first method above is obviously cleaner and is twice
> > as fast, only 9 seconds for a 100000 row matrix on an ordinary PC.  
> 
> The interaction solution gives an identical result, is shorter and
> is one or two orders of magnitude faster.  Here is a 
> comparison of the three:
> 
> R> set.seed(1)
> R> mat <- matrix(sample(20,100000,rep=T),50000)
> R> 
> R> f0 <- function(mat) {
> + mat2 <- apply(mat, 1, paste, collapse=":");
> + match(mat2, unique(mat2))
> + }
> R> 
> R> 
> R> f1 <- function(mat) { z <- apply(mat, 1, paste,
collapse=":")
> + as.numeric(factor(z,levels=unique(z)))
> + }
> R> 
> R> f2 <- function(mat)
as.numeric(interaction(mat[,1],mat[,2],drop=T))
> R> 
> R> dummy <- gc(); system.time(z0 <- f0(mat))
> [1] 5.24 0.02 5.52   NA   NA
> R> dummy <- gc(); system.time(z1 <- f1(mat))
> [1] 5.18 0.00 5.52   NA   NA
> R> dummy <- gc(); system.time(z2 <- f2(mat))
> [1] 0.1 0.0 0.1  NA  NA
> R> all.equal(z0,z1)
> [1] TRUE
> R> all.equal(z0,z2)
> [1] TRUE
> R> all.equal(z2,z1)
> [1] TRUE
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://www.stat.math.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! 
> http://www.R-project.org/posting-guide.html
> 
>

Maybe Matching Threads

Search for more apparently analagous threads

R help - May 2004 - Tagging identical rows of a matrix

[R] Tagging identical rows of a matrix

[R] Tagging identical rows of a matrix

[R] Tagging identical rows of a matrix

[R] Tagging identical rows of a matrix

[R] Tagging identical rows of a matrix

[R] Tagging identical rows of a matrix

Maybe Matching Threads