I need a function which is similar to duplicated(), but instead of returning TRUE/FALSE, returns indices of which element was duplicated. That is, > x <- c(9,7,9,3,7) > duplicated(x) [1] FALSE FALSE TRUE FALSE TRUE > duplicates(x) [1] NA NA 1 NA 2 (so that I know that element 3 is a duplicate of element 1, and element 5 is a duplicate of element 2, whereas the others were not duplicated according to our definition.) Is there a simple way to write this function? I have an ugly implementation in R that loops over all the values; it would make more sense to redo it in C, if there isn't a simple implementation I missed. Duncan Murdoch
How about: y <- rep(NA,length(x)) y[duplicated(x)] <- match(x[duplicated(x)] ,x) -- Joshua Ulrich ?| ?FOSS Trading: www.fosstrading.com On Fri, Apr 8, 2011 at 9:59 AM, Duncan Murdoch <murdoch.duncan at gmail.com> wrote:> I need a function which is similar to duplicated(), but instead of returning > TRUE/FALSE, returns indices of which element was duplicated. ?That is, > >> x <- c(9,7,9,3,7) >> duplicated(x) > [1] FALSE FALSE ?TRUE FALSE TRUE > >> duplicates(x) > [1] NA NA ?1 NA ?2 > > (so that I know that element 3 is a duplicate of element 1, and element 5 is > a duplicate of element 2, whereas the others were not duplicated according > to our definition.) > > Is there a simple way to write this function? ?I have ?an ugly > implementation in R that loops over all the values; it would make more sense > to redo it in C, if there isn't a simple implementation I missed. > > Duncan Murdoch > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel >
On Fri, Apr 8, 2011 at 9:59 AM, Duncan Murdoch <murdoch.duncan at gmail.com> wrote:> I need a function which is similar to duplicated(), but instead of returning > TRUE/FALSE, returns indices of which element was duplicated. ?That is, > >> x <- c(9,7,9,3,7) >> duplicated(x) > [1] FALSE FALSE ?TRUE FALSE TRUE > >> duplicates(x) > [1] NA NA ?1 NA ?2 > > (so that I know that element 3 is a duplicate of element 1, and element 5 is > a duplicate of element 2, whereas the others were not duplicated according > to our definition.) > > Is there a simple way to write this function? ?I have ?an ugly > implementation in R that loops over all the values; it would make more sense > to redo it in C, if there isn't a simple implementation I missed.I'd think of making it a lookup table. The basic idea is split(seq_along(x), x) but there are probably much faster ways of doing it, depending on what you need. But for efficiency, you probably need a hashtable somewhere. Hadley -- Assistant Professor / Dobelman Family Junior Chair Department of Statistics / Rice University http://had.co.nz/
which(duplicated(x)=="TRUE") -- View this message in context: http://r.789695.n4.nabble.com/duplicates-function-tp3436584p3437614.html Sent from the R devel mailing list archive at Nabble.com.
On Fri, Apr 08, 2011 at 10:59:10AM -0400, Duncan Murdoch wrote:> I need a function which is similar to duplicated(), but instead of > returning TRUE/FALSE, returns indices of which element was duplicated. > That is, > > > x <- c(9,7,9,3,7) > > duplicated(x) > [1] FALSE FALSE TRUE FALSE TRUE > > > duplicates(x) > [1] NA NA 1 NA 2 > > (so that I know that element 3 is a duplicate of element 1, and element > 5 is a duplicate of element 2, whereas the others were not duplicated > according to our definition.) > > Is there a simple way to write this function?A possible strategy is to use sorting. In a sorted matrix or data frame, the elements, which are duplicates of the same element, form consecutive blocks. These blocks may be identified using !duplicated(), which determines the first elements of these blocks. Since sorting is stable, when we map these blocks back to the original order, the first element of each block is mapped to the first ocurrence of the given row in the original order. An implementation may be done as follows. duplicates <- function(dat) { s <- do.call("order", as.data.frame(dat)) non.dup <- !duplicated(dat[s, ]) orig.ind <- s[non.dup] first.occ <- orig.ind[cumsum(non.dup)] first.occ[non.dup] <- NA first.occ[order(s)] } x <- cbind(1, c(9,7,9,3,7) ) duplicates(x) [1] NA NA 1 NA 2 The line orig.ind <- s[non.dup] creates a vector, whose length is the number of non-duplicated rows in the sorted "dat". Its components are indices of the corresponding first occurrences of these rows in the original order. For this, the stability of the order is needed. The lines first.occ <- orig.ind[cumsum(non.dup)] first.occ[non.dup] <- NA expand orig.ind to a vector, which satisfies: If i-th row of the sorted "dat" is duplicated, then first.occ[i] is the index of the first row in the original "dat", which is equal to this row. So, the values in first.occ are those, which are required for the output of duplicates(), but they are in the order of the sorted "dat". The last line first.occ[order(s)] reorders the vector to the original order of the rows. Petr Savicky.
Thanks for your answer, but: 1. can you please cite the question? It is hard for mailing list readers to follow now. 2. I think which(duplicated(x)) should be simpler, faster and less confusing, if your code would be the solution - which is not. 3. Please read the original question carefuly and find that your code and my optimization of it above gives a different undesired answer. Best, Uwe Ligges On 09.04.2011 01:05, B77S wrote:> which(duplicated(x)=="TRUE") > > -- > View this message in context: http://r.789695.n4.nabble.com/duplicates-function-tp3436584p3437614.html > Sent from the R devel mailing list archive at Nabble.com. > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel