Hervé Pagès
2013-Jul-29 18:52 UTC
[Rd] duplicated.data.frame() is broken on data frames containing \r
Hi, The trick used by duplicated.data.frame() is to transform the supplied data.frame into a character vector by pasting together the columns using "\r" as separator. But no precautions are taken to deal with "\r" in the supplied data.frame. As a consequence it's easy to imagine situations where duplicated.data.frame() returns an incorrect answer: > df <- data.frame(a=c("AA", "AA\r"), b=c("\rBBB", "BBB")) > df a b 1 AA \rBBB 2 AA\r BBB > duplicated(df) [1] FALSE TRUE Cheers, H. > sessionInfo() R version 3.0.1 (2013-05-16) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base -- Herv? Pag?s Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
Hervé Pagès
2013-Jul-29 19:06 UTC
[Rd] duplicated.data.frame() is broken on data frames containing \r
OK it's actually documented: The data frame method works by pasting together a character representation of the rows separated by ?\r?, so may be imperfect if the data frame has characters with embedded carriage returns or columns which do not reliably map to characters. But what about fixing it? One possible fix is to use "\r\r" as separator and to substitute user-supplied "\r" with, say, "#\r#". Just an example. Thanks, H. On 07/29/2013 11:52 AM, Herv? Pag?s wrote:> Hi, > > The trick used by duplicated.data.frame() is to transform the supplied > data.frame into a character vector by pasting together the columns using > "\r" as separator. But no precautions are taken to deal with "\r" in > the supplied data.frame. As a consequence it's easy to imagine > situations where duplicated.data.frame() returns an incorrect answer: > > > df <- data.frame(a=c("AA", "AA\r"), b=c("\rBBB", "BBB")) > > df > a b > 1 AA \rBBB > 2 AA\r BBB > > duplicated(df) > [1] FALSE TRUE > > Cheers, > H. > > > sessionInfo() > R version 3.0.1 (2013-05-16) > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=C LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > >-- Herv? Pag?s Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319