Liviu Andronic
2012-Jul-23 12:49 UTC
[Rd] duplicated() variation that goes both ways to capture all duplicates
Dear all The trouble with the current duplicated() function in is that it can report duplicates while searching fromFirst _or_ fromLast, but not both ways. Often users will want to identify and extract all the copies of the item that has duplicates, not only the duplicates themselves. To take the example from the man page:> data(iris) > iris[duplicated(iris), ] ##duplicates while searching "fromFirst"Sepal.Length Sepal.Width Petal.Length Petal.Width Species 143 5.8 2.7 5.1 1.9 virginica> iris[duplicated(iris, fromLast=T), ] ##duplicates while searching "fromLast"Sepal.Length Sepal.Width Petal.Length Petal.Width Species 102 5.8 2.7 5.1 1.9 virginica To extract all the copies of the concerned items ("original" and duplicates) one would need to do something like this:> iris[(duplicated(iris) | duplicated(iris, fromLast=T)), ] ##duplicates while searching "bothWays"Sepal.Length Sepal.Width Petal.Length Petal.Width Species 102 5.8 2.7 5.1 1.9 virginica 143 5.8 2.7 5.1 1.9 virginica Unfortunately this is unnecessarily long and convoluted. Short of a 'bothWays' argument in duplicated(), I came up with a small wrapper that simplifies the above: duplicated2 <- function(x, bothWays=TRUE, ...) { if(!bothWays) { return(duplicated(x, ...)) } else if(bothWays) { return((duplicated(x, ...) | duplicated(x, fromLast=TRUE, ...))) } } Now the above can be achieved simply via:> iris[duplicated2(iris), ] ##duplicates while searching "bothWays"Sepal.Length Sepal.Width Petal.Length Petal.Width Species 102 5.8 2.7 5.1 1.9 virginica 143 5.8 2.7 5.1 1.9 virginica So here's my inquiry: Would the R Core consider adding such functionality in 'base' R? Either the---suitably cleaned up---duplicated2() function above, or a "bothWays" argument in duplicated() itself? Either of the two would improve user convenience and reduce confusion. (In my case it took some time before I understood the correct approach to this problem.) Regards Liviu -- Do you know how to read? http://www.alienetworks.com/srtest.cfm http://goodies.xfce.org/projects/applications/xfce4-dict#speed-reader Do you know how to write? http://garbl.home.comcast.net/~garbl/stylemanual/e.htm#e-mail
Duncan Murdoch
2012-Jul-23 13:08 UTC
[Rd] duplicated() variation that goes both ways to capture all duplicates
On 23/07/2012 8:49 AM, Liviu Andronic wrote:> Dear all > The trouble with the current duplicated() function in is that it can > report duplicates while searching fromFirst _or_ fromLast, but not > both ways. Often users will want to identify and extract all the > copies of the item that has duplicates, not only the duplicates > themselves. > > To take the example from the man page: > > data(iris) > > iris[duplicated(iris), ] ##duplicates while searching "fromFirst" > Sepal.Length Sepal.Width Petal.Length Petal.Width Species > 143 5.8 2.7 5.1 1.9 virginica > > iris[duplicated(iris, fromLast=T), ] ##duplicates while searching "fromLast" > Sepal.Length Sepal.Width Petal.Length Petal.Width Species > 102 5.8 2.7 5.1 1.9 virginica > > > To extract all the copies of the concerned items ("original" and > duplicates) one would need to do something like this: > > iris[(duplicated(iris) | duplicated(iris, fromLast=T)), ] ##duplicates while searching "bothWays" > Sepal.Length Sepal.Width Petal.Length Petal.Width Species > 102 5.8 2.7 5.1 1.9 virginica > 143 5.8 2.7 5.1 1.9 virginica > > > Unfortunately this is unnecessarily long and convoluted. Short of a > 'bothWays' argument in duplicated(), I came up with a small wrapper > that simplifies the above: > duplicated2 <- > function(x, bothWays=TRUE, ...) > { > if(!bothWays) { > return(duplicated(x, ...)) > } else if(bothWays) { > return((duplicated(x, ...) | duplicated(x, fromLast=TRUE, ...))) > } > } > > > Now the above can be achieved simply via: > > iris[duplicated2(iris), ] ##duplicates while searching "bothWays" > Sepal.Length Sepal.Width Petal.Length Petal.Width Species > 102 5.8 2.7 5.1 1.9 virginica > 143 5.8 2.7 5.1 1.9 virginica > > > So here's my inquiry: Would the R Core consider adding such > functionality in 'base' R? Either the---suitably cleaned > up---duplicated2() function above, or a "bothWays" argument in > duplicated() itself? Either of the two would improve user convenience > and reduce confusion. (In my case it took some time before I > understood the correct approach to this problem.)I can't speak for all of R core, but I don't see the need for this in base R -- your solution looks fine to me. Duncan Murdoch