stacey thompson
2007-Mar-08 15:14 UTC
[R] Removing duplicated rows within a matrix, with missing data as wildcards
I'd like to remove duplicated rows within a matrix, with missing data being treated as wildcards. For example> x <- matrix((1:3), 5, 3) > x[4,2] = NA > x[3,3] = NA > x[,1] [,2] [,3] [1,] 1 3 2 [2,] 2 1 3 [3,] 3 2 NA [4,] 1 NA 2 [5,] 2 1 3 I would like to obtain [,1] [,2] [,3] [1,] 1 3 2 [2,] 2 1 3 [3,] 3 2 NA>From the R-help archives, I learned about unique(x) and duplicated(x).However, unique(x) returns> unique(x)[,1] [,2] [,3] [1,] 1 3 2 [2,] 2 1 3 [3,] 3 2 NA [4,] 1 NA 2 and duplicated(x) gives> duplicated(x)[1] FALSE FALSE FALSE FALSE TRUE I have tried various na.action 's but with unique(x) I get errors at best. e.g.> unique(x, na.omit(x))Error: argument 'incomparables != FALSE' is not used (yet) How I might tackle this? Thanks, -stacey -- -stacey lee thompson- Stagiaire post-doctorale Institut de recherche en biologie v?g?tale Universit? de Montr?al 4101 Sherbrooke Est Montr?al, Qu?bec H1X 2B2 Canada stacey.thompson at umontreal.ca
Petr Pikal
2007-Mar-09 07:03 UTC
[R] Removing duplicated rows within a matrix, with missing data as wildcards
Hi
its a bit tricky but
dup<-apply(x, 2, duplicated) #which are dupplucated
isna<-apply(x, 2, is.na) #which are na
check<-dup|isna # which are both
and here is your result
x[rowSums(check)!=3,]
[,1] [,2] [,3]
[1,] 1 3 2
[2,] 2 1 3
[3,] 3 2 NA
Regards
Petr
On 8 Mar 2007 at 10:14, stacey thompson wrote:
Date sent: Thu, 8 Mar 2007 10:14:37 -0500
From: "stacey thompson" <stacey.lee.thompson at
gmail.com>
To: r-help at stat.math.ethz.ch
Subject: [R] Removing duplicated rows within a matrix,
with missing data as wildcards
> I'd like to remove duplicated rows within a matrix, with missing data
> being treated as wildcards.
>
> For example
>
> > x <- matrix((1:3), 5, 3)
> > x[4,2] = NA
> > x[3,3] = NA
> > x
>
> [,1] [,2] [,3]
> [1,] 1 3 2
> [2,] 2 1 3
> [3,] 3 2 NA
> [4,] 1 NA 2
> [5,] 2 1 3
>
> I would like to obtain
>
> [,1] [,2] [,3]
> [1,] 1 3 2
> [2,] 2 1 3
> [3,] 3 2 NA
>
> >From the R-help archives, I learned about unique(x) and
> >duplicated(x).
> However, unique(x) returns
>
> > unique(x)
>
> [,1] [,2] [,3]
> [1,] 1 3 2
> [2,] 2 1 3
> [3,] 3 2 NA
> [4,] 1 NA 2
>
> and duplicated(x) gives
>
> > duplicated(x)
>
> [1] FALSE FALSE FALSE FALSE TRUE
>
> I have tried various na.action 's but with unique(x) I get errors at
> best.
>
> e.g.
> > unique(x, na.omit(x))
>
> Error: argument 'incomparables != FALSE' is not used (yet)
>
> How I might tackle this?
>
> Thanks,
>
> -stacey
>
> --
> -stacey lee thompson-
> Stagiaire post-doctorale
> Institut de recherche en biologie v?g?tale
> Universit? de Montr?al
> 4101 Sherbrooke Est
> Montr?al, Qu?bec H1X 2B2 Canada
> stacey.thompson at umontreal.ca
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html and provide commented,
> minimal, self-contained, reproducible code.
Petr Pikal
petr.pikal at precheza.cz
Dimitris Rizopoulos
2007-Mar-09 15:14 UTC
[R] Removing duplicated rows within a matrix, with missing data as wildcards
you could also try something like the following:
x <- matrix(c(1, NA, 3, NA, 2, 3, 1, 3, 2, 2, 1, 3, 1, NA, 2, 2, 1,
3), ncol=3, byrow=TRUE)
wildcardVals <- 1:3 # possible wildcard values
ind <- complete.cases(x)
nc <- ncol(x)
nr <- nrow(x[ind, ])
nwld <- length(wildcardVals)
posb <- apply(x[!ind, , drop = FALSE], 1, function(y){
out <- matrix(y, nwld, nc, by = TRUE)
out[, is.na(y)] <- wildcardVals
t(out)
})
posb <- matrix(c(posb), ncol = nc, by = TRUE)
keep.ind <- duplicated(rbind(x[ind, ], posb))
keep.ind[-(1:nr)] <- apply(matrix(keep.ind[-(1:nr)], nc = nwld, by =
TRUE),
1, function(x) if(any(x)) rep(TRUE, length(x)) else x)
out <- rbind(x[ind, ], matrix(rep(x[!ind, ], each = nwld), nc = nc))
unique(out[!keep.ind, ])
I hope it works ok.
Best,
Dimitris
----
Dimitris Rizopoulos
Ph.D. Student
Biostatistical Centre
School of Public Health
Catholic University of Leuven
Address: Kapucijnenvoer 35, Leuven, Belgium
Tel: +32/(0)16/336899
Fax: +32/(0)16/337015
Web: http://med.kuleuven.be/biostat/
http://www.student.kuleuven.be/~m0390867/dimitris.htm
----- Original Message -----
From: "stacey thompson" <stacey.lee.thompson at gmail.com>
To: <hpages at fhcrc.org>; <r-help at stat.math.ethz.ch>
Cc: <petr.pikal at precheza.cz>
Sent: Friday, March 09, 2007 3:09 PM
Subject: Re: [R] Removing duplicated rows within a matrix,with missing
data as wildcards
> Hi H.,
>
> Your response has improved the clarity of my thinking. Kind thanks.
> Also, your use of seq_len() prompted me to update from R version
> 2.3.1
> on this machine.
>
> For your matrix
>
> > x <- matrix(c(1, NA, 3, NA, 2, 3), ncol=3, byrow=TRUE)
> > x
> [,1] [,2] [,3]
> [1,] 1 NA 3
> [2,] NA 2 3
>
> I would want to delete either x[1,] or x[2,] but not both.
> Practically, your "removeLooseDupRows(x)"
>
> removeLooseDupRows <- function(x)
> {
> if (nrow(x) <= 1)
> return(x)
> ii <- do.call("order",
> args=lapply(seq_len(ncol(x)),
> function(col) x[ , col]))
> dup_index <- logical(nrow(x))
> i0 <- -1
> for (k in 1:length(ii)) {
> i <- ii[k]
> if (any(is.na(x[i, ]))) {
> if (i0 == -1)
> next
> if (any(x[i, ] != x[i0, ], na.rm=TRUE))
> next
> dup_index[i] <- TRUE
> } else {
> i0 <- i
> }
> }
> x[!dup_index, ]
> }
>
> should leave no such ambiguous cases for my data, as the nrow(x) are
> very high with few NA in each x. For example, a row of (1, 2, 3) is
> very likely to exist in my data.
>
> However, to find the row numbers of any remaining ambiguous matches,
> should they exist, using example:
>
>> x <- matrix(c(1, NA, 3, NA, 2, 3, 1, 3, 2, 2, 1, 3, 1, NA, 2, 2, 1,
>> 3), ncol=3, byrow=TRUE)
>> x
> [,1] [,2] [,3]
> [1,] 1 NA 3
> [2,] NA 2 3
> [3,] 1 3 2
> [4,] 2 1 3
> [5,] 1 NA 2
> [6,] 2 1 3
>
> after your suggested
>
>> removeLooseDupRows(x)
> [,1] [,2] [,3]
> [1,] 1 NA 3
> [2,] NA 2 3
> [3,] 1 3 2
> [4,] 2 1 3
> [5,] 2 1 3
>
>> q <- removeLooseDupRows(unique(x))
>> q
> [,1] [,2] [,3]
> [1,] 1 NA 3
> [2,] NA 2 3
> [3,] 1 3 2
> [4,] 2 1 3
>
> I could
>
>> # ambiguous matches in matrix form
>> apply(q, 1, function(row1) apply(q, 1, function(row2)
>> all(is.na(row1) | is.na(row2) | row1==row2)))
>
> [,1] [,2] [,3] [,4]
> [1,] TRUE TRUE FALSE FALSE
> [2,] TRUE TRUE FALSE FALSE
> [3,] FALSE FALSE TRUE FALSE
> [4,] FALSE FALSE FALSE TRUE
>
>> # indices of ambiguous matches
>> m <- which(apply(q, 1, function(row1) apply(q, 1, function(row2)
>> all(is.na(row1) | is.na(row2) | row1==row2))), arr=T)
>> m
> row col
> [1,] 1 1
> [2,] 2 1
> [3,] 1 2
> [4,] 2 2
> [5,] 3 3
> [6,] 4 4
>
>> #put in order and omit duplicates
>> m2 <- unique(t(apply(m, 1, sort)))
>> m2
> [,1] [,2]
> [1,] 1 1
> [2,] 1 2
> [3,] 2 2
> [4,] 3 3
> [5,] 4 4
>
>> # show the ambiguous matches
>> m2[m2[,1]!=m2[,2], drop=F]
> [1] 1 2
>
> ...and procede from there.
>
> This solution came from another helpful "R-help" respondant to my
> poorly-defined problem.
>
> Appreciative thanks to everyone for your instructive help.
>
> Cheers,
> stacey
>
> --
> -stacey lee thompson-
> Stagiaire post-doctorale
> Institut de recherche en biologie v?g?tale
> Universit? de Montr?al
> 4101 Sherbrooke Est
> Montr?al, Qu?bec H1X 2B2 Canada
> stacey.thompson at umontreal.ca
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
Disclaimer: http://www.kuleuven.be/cwis/email_disclaimer.htm