Dear useRs,
I am having a hard time coming up with a nice and efficient solution to
a problem on entires matrices or data.frames. In spirit, this is similar to
what setdiff() and setequal() do, but I need it in more dimensions.
Here's a brief description.
* given a set of factors or sequences, expand.grid() gives me the set
of permutations in a data.frame;
in my case all arguments are numeric so I could convert the data frame to
a matrix
let's call this one Candidates
* I have a second matrix (or data frame) to compare to; this second
set may be a subset of the first, or a superset but it guaranted to
contain the same columns
let's call this one Comparison
* I want know which rows in Candidates are not yet in Comparison.
A toy example:
> Comparison <- matrix(1:30, ncol=5)
> Candidates <- Comparison[c(2,4), ]
> checkRow <- function(r, M) { any( (r[1] == M[,1]) & (r[2] == M[,2])
& (r[3] == M[,3]) & (r[4] == M[,4]) ) }
> checkRow( Candidates[1,], Comparison)
[1] TRUE> falseRow <- Candidates[1,]
> falseRow[2] <- 42
> checkRow( falseRow, Comparison)
[1] FALSE>
The checkRow function works but is a) klunky, b) hardcodes the dimension and
c) works only on one row at a time.
There must be better ways, at least for a) and b). What am I missing?
Feel free to reply off-list and I'd gladly summarize back to the list. If
you
don't want your reply (or email) summarized back, please indicate.
Thanks, Dirk
--
Hell, there are no rules here - we're trying to accomplish something.
-- Thomas A. Edison
Gabor Grothendieck
2006-Sep-28 16:05 UTC
[R] Comparing entire row sets at once efficiently
If Comparison and Candidates each have no duplicated rows (which is the situation in the example) then try this: tail(!duplicated(rbind(Comparison, Candidates)), nrow(Candidates)) On 9/28/06, Dirk Eddelbuettel <edd at debian.org> wrote:> > Dear useRs, > > I am having a hard time coming up with a nice and efficient solution to > a problem on entires matrices or data.frames. In spirit, this is similar to > what setdiff() and setequal() do, but I need it in more dimensions. > > Here's a brief description. > > * given a set of factors or sequences, expand.grid() gives me the set > of permutations in a data.frame; > > in my case all arguments are numeric so I could convert the data frame to > a matrix > > let's call this one Candidates > > * I have a second matrix (or data frame) to compare to; this second > set may be a subset of the first, or a superset but it guaranted to > contain the same columns > > let's call this one Comparison > > * I want know which rows in Candidates are not yet in Comparison. > > A toy example: > > > Comparison <- matrix(1:30, ncol=5) > > Candidates <- Comparison[c(2,4), ] > > checkRow <- function(r, M) { any( (r[1] == M[,1]) & (r[2] == M[,2]) & (r[3] == M[,3]) & (r[4] == M[,4]) ) } > > checkRow( Candidates[1,], Comparison) > [1] TRUE > > falseRow <- Candidates[1,] > > falseRow[2] <- 42 > > checkRow( falseRow, Comparison) > [1] FALSE > > > > The checkRow function works but is a) klunky, b) hardcodes the dimension and > c) works only on one row at a time. > > There must be better ways, at least for a) and b). What am I missing? > > Feel free to reply off-list and I'd gladly summarize back to the list. If you > don't want your reply (or email) summarized back, please indicate. > > Thanks, Dirk > > > > -- > Hell, there are no rules here - we're trying to accomplish something. > -- Thomas A. Edison > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >