stacey thompson
2007-Mar-08 15:14 UTC
[R] Removing duplicated rows within a matrix, with missing data as wildcards
I'd like to remove duplicated rows within a matrix, with missing data being treated as wildcards. For example> x <- matrix((1:3), 5, 3) > x[4,2] = NA > x[3,3] = NA > x[,1] [,2] [,3] [1,] 1 3 2 [2,] 2 1 3 [3,] 3 2 NA [4,] 1 NA 2 [5,] 2 1 3 I would like to obtain [,1] [,2] [,3] [1,] 1 3 2 [2,] 2 1 3 [3,] 3 2 NA>From the R-help archives, I learned about unique(x) and duplicated(x).However, unique(x) returns> unique(x)[,1] [,2] [,3] [1,] 1 3 2 [2,] 2 1 3 [3,] 3 2 NA [4,] 1 NA 2 and duplicated(x) gives> duplicated(x)[1] FALSE FALSE FALSE FALSE TRUE I have tried various na.action 's but with unique(x) I get errors at best. e.g.> unique(x, na.omit(x))Error: argument 'incomparables != FALSE' is not used (yet) How I might tackle this? Thanks, -stacey -- -stacey lee thompson- Stagiaire post-doctorale Institut de recherche en biologie v?g?tale Universit? de Montr?al 4101 Sherbrooke Est Montr?al, Qu?bec H1X 2B2 Canada stacey.thompson at umontreal.ca
Petr Pikal
2007-Mar-09 07:03 UTC
[R] Removing duplicated rows within a matrix, with missing data as wildcards
Hi its a bit tricky but dup<-apply(x, 2, duplicated) #which are dupplucated isna<-apply(x, 2, is.na) #which are na check<-dup|isna # which are both and here is your result x[rowSums(check)!=3,] [,1] [,2] [,3] [1,] 1 3 2 [2,] 2 1 3 [3,] 3 2 NA Regards Petr On 8 Mar 2007 at 10:14, stacey thompson wrote: Date sent: Thu, 8 Mar 2007 10:14:37 -0500 From: "stacey thompson" <stacey.lee.thompson at gmail.com> To: r-help at stat.math.ethz.ch Subject: [R] Removing duplicated rows within a matrix, with missing data as wildcards> I'd like to remove duplicated rows within a matrix, with missing data > being treated as wildcards. > > For example > > > x <- matrix((1:3), 5, 3) > > x[4,2] = NA > > x[3,3] = NA > > x > > [,1] [,2] [,3] > [1,] 1 3 2 > [2,] 2 1 3 > [3,] 3 2 NA > [4,] 1 NA 2 > [5,] 2 1 3 > > I would like to obtain > > [,1] [,2] [,3] > [1,] 1 3 2 > [2,] 2 1 3 > [3,] 3 2 NA > > >From the R-help archives, I learned about unique(x) and > >duplicated(x). > However, unique(x) returns > > > unique(x) > > [,1] [,2] [,3] > [1,] 1 3 2 > [2,] 2 1 3 > [3,] 3 2 NA > [4,] 1 NA 2 > > and duplicated(x) gives > > > duplicated(x) > > [1] FALSE FALSE FALSE FALSE TRUE > > I have tried various na.action 's but with unique(x) I get errors at > best. > > e.g. > > unique(x, na.omit(x)) > > Error: argument 'incomparables != FALSE' is not used (yet) > > How I might tackle this? > > Thanks, > > -stacey > > -- > -stacey lee thompson- > Stagiaire post-doctorale > Institut de recherche en biologie v?g?tale > Universit? de Montr?al > 4101 Sherbrooke Est > Montr?al, Qu?bec H1X 2B2 Canada > stacey.thompson at umontreal.ca > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html and provide commented, > minimal, self-contained, reproducible code.Petr Pikal petr.pikal at precheza.cz
Dimitris Rizopoulos
2007-Mar-09 15:14 UTC
[R] Removing duplicated rows within a matrix, with missing data as wildcards
you could also try something like the following: x <- matrix(c(1, NA, 3, NA, 2, 3, 1, 3, 2, 2, 1, 3, 1, NA, 2, 2, 1, 3), ncol=3, byrow=TRUE) wildcardVals <- 1:3 # possible wildcard values ind <- complete.cases(x) nc <- ncol(x) nr <- nrow(x[ind, ]) nwld <- length(wildcardVals) posb <- apply(x[!ind, , drop = FALSE], 1, function(y){ out <- matrix(y, nwld, nc, by = TRUE) out[, is.na(y)] <- wildcardVals t(out) }) posb <- matrix(c(posb), ncol = nc, by = TRUE) keep.ind <- duplicated(rbind(x[ind, ], posb)) keep.ind[-(1:nr)] <- apply(matrix(keep.ind[-(1:nr)], nc = nwld, by = TRUE), 1, function(x) if(any(x)) rep(TRUE, length(x)) else x) out <- rbind(x[ind, ], matrix(rep(x[!ind, ], each = nwld), nc = nc)) unique(out[!keep.ind, ]) I hope it works ok. Best, Dimitris ---- Dimitris Rizopoulos Ph.D. Student Biostatistical Centre School of Public Health Catholic University of Leuven Address: Kapucijnenvoer 35, Leuven, Belgium Tel: +32/(0)16/336899 Fax: +32/(0)16/337015 Web: http://med.kuleuven.be/biostat/ http://www.student.kuleuven.be/~m0390867/dimitris.htm ----- Original Message ----- From: "stacey thompson" <stacey.lee.thompson at gmail.com> To: <hpages at fhcrc.org>; <r-help at stat.math.ethz.ch> Cc: <petr.pikal at precheza.cz> Sent: Friday, March 09, 2007 3:09 PM Subject: Re: [R] Removing duplicated rows within a matrix,with missing data as wildcards> Hi H., > > Your response has improved the clarity of my thinking. Kind thanks. > Also, your use of seq_len() prompted me to update from R version > 2.3.1 > on this machine. > > For your matrix > > > x <- matrix(c(1, NA, 3, NA, 2, 3), ncol=3, byrow=TRUE) > > x > [,1] [,2] [,3] > [1,] 1 NA 3 > [2,] NA 2 3 > > I would want to delete either x[1,] or x[2,] but not both. > Practically, your "removeLooseDupRows(x)" > > removeLooseDupRows <- function(x) > { > if (nrow(x) <= 1) > return(x) > ii <- do.call("order", > args=lapply(seq_len(ncol(x)), > function(col) x[ , col])) > dup_index <- logical(nrow(x)) > i0 <- -1 > for (k in 1:length(ii)) { > i <- ii[k] > if (any(is.na(x[i, ]))) { > if (i0 == -1) > next > if (any(x[i, ] != x[i0, ], na.rm=TRUE)) > next > dup_index[i] <- TRUE > } else { > i0 <- i > } > } > x[!dup_index, ] > } > > should leave no such ambiguous cases for my data, as the nrow(x) are > very high with few NA in each x. For example, a row of (1, 2, 3) is > very likely to exist in my data. > > However, to find the row numbers of any remaining ambiguous matches, > should they exist, using example: > >> x <- matrix(c(1, NA, 3, NA, 2, 3, 1, 3, 2, 2, 1, 3, 1, NA, 2, 2, 1, >> 3), ncol=3, byrow=TRUE) >> x > [,1] [,2] [,3] > [1,] 1 NA 3 > [2,] NA 2 3 > [3,] 1 3 2 > [4,] 2 1 3 > [5,] 1 NA 2 > [6,] 2 1 3 > > after your suggested > >> removeLooseDupRows(x) > [,1] [,2] [,3] > [1,] 1 NA 3 > [2,] NA 2 3 > [3,] 1 3 2 > [4,] 2 1 3 > [5,] 2 1 3 > >> q <- removeLooseDupRows(unique(x)) >> q > [,1] [,2] [,3] > [1,] 1 NA 3 > [2,] NA 2 3 > [3,] 1 3 2 > [4,] 2 1 3 > > I could > >> # ambiguous matches in matrix form >> apply(q, 1, function(row1) apply(q, 1, function(row2) >> all(is.na(row1) | is.na(row2) | row1==row2))) > > [,1] [,2] [,3] [,4] > [1,] TRUE TRUE FALSE FALSE > [2,] TRUE TRUE FALSE FALSE > [3,] FALSE FALSE TRUE FALSE > [4,] FALSE FALSE FALSE TRUE > >> # indices of ambiguous matches >> m <- which(apply(q, 1, function(row1) apply(q, 1, function(row2) >> all(is.na(row1) | is.na(row2) | row1==row2))), arr=T) >> m > row col > [1,] 1 1 > [2,] 2 1 > [3,] 1 2 > [4,] 2 2 > [5,] 3 3 > [6,] 4 4 > >> #put in order and omit duplicates >> m2 <- unique(t(apply(m, 1, sort))) >> m2 > [,1] [,2] > [1,] 1 1 > [2,] 1 2 > [3,] 2 2 > [4,] 3 3 > [5,] 4 4 > >> # show the ambiguous matches >> m2[m2[,1]!=m2[,2], drop=F] > [1] 1 2 > > ...and procede from there. > > This solution came from another helpful "R-help" respondant to my > poorly-defined problem. > > Appreciative thanks to everyone for your instructive help. > > Cheers, > stacey > > -- > -stacey lee thompson- > Stagiaire post-doctorale > Institut de recherche en biologie v?g?tale > Universit? de Montr?al > 4101 Sherbrooke Est > Montr?al, Qu?bec H1X 2B2 Canada > stacey.thompson at umontreal.ca > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >Disclaimer: http://www.kuleuven.be/cwis/email_disclaimer.htm