G. Jay Kerns
2009-May-30 04:35 UTC
[Rd] setdiff bizarre (was: odd behavior out of setdiff)
Dear R-devel, Please see the recent thread on R-help, "Odd Behavior Out of setdiff(...) - addition of duplicate entries is not identified" posted by Jason Rupert. I gave an answer, then read David Winsemius' answer, and then did some follow-up investigation. I would like to change my answer. My current version of setdiff() is acting in a way that I do not understand, and a way that I suspect has changed. Consider the following, derived from Jason's OP: The base package setdiff(), atomic vectors: x <- 1:100 y <- c(x,x) setdiff(x, y) # integer(0) setdiff(y, x) # integer(0) z <- 1:25 setdiff(x,z) # 26:100 setdiff(z,x) # integer(0) Everything is fine. Now look at base package setdiff(), data frames??? ################################ A <- data.frame(x = 1:100) B <- rbind(A, A) setdiff(A, B) # df 1:100? setdiff(B, A) # df 1:100? C <- data.frame(x = 1:25) setdiff(A, C) # df 1:100? setdiff(C, A) # df 1:25? ############################ I have read ?setdiff 37 times now, and I cannot divine any interpretation that matches the above output. From the source, it appears that match(x, y, 0L) == 0L is evaluating to TRUE, of length equal to the columns of x, and then x[match(x, y, 0L) == 0L] is returning the entire data frame. Compare with the output from package "prob", which uses a setdiff that operates row-wise: ########################### library(prob) A <- data.frame(x = 1:100) B <- rbind(A, A) setdiff(A, B) # integer(0) setdiff(B, A) # integer(0) C <- data.frame(x = 1:25) setdiff(A, C) # 26:100 setdiff(C, A) # integer(0) IMHO, the entire notion of "set" and "element" is problematic in the df case, so I am not advocating the adoption of the prob:::setdiff approach; rather, setdiff is behaving in a way that I cannot believe with my own eyes, and I would like to alert those who can speak as to why this may be happening. Thanks to Jason for bringing this up, and to David for catching the discrepancy. Session info is below. I use the binaries prepared by the Debian group so I do not have the latest patched-revision-4440986745343b. This must have been related to something which has been fixed since April 17, and in that case, please disregard my message. Yours truly, Jay> sessionInfo()R version 2.9.0 (2009-04-17) x86_64-pc-linux-gnu locale: LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=C;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] prob_0.9-1 -- *************************************************** G. Jay Kerns, Ph.D. Associate Professor Department of Mathematics & Statistics Youngstown State University Youngstown, OH 44555-0002 USA Office: 1035 Cushwa Hall Phone: (330) 941-3310 Office (voice mail) -3302 Department -3170 FAX E-mail: gkerns at ysu.edu http://www.cc.ysu.edu/~gjkerns/
Stavros Macrakis
2009-May-30 12:50 UTC
[Rd] setdiff bizarre (was: odd behavior out of setdiff)
It seems to me that, abstractly, a dataframe is just as straightforwardly a sequence of tuples/observations as a vector is a sequence of scalars. R's convention is that a 1-vector represents a scalar, and similarly, a 1-dataframe can represent a tuple (though it can also be represented as a list). Of course, a dataframe can *also* be interpreted as a list of vectors. Just as a sequence of scalars can be interpreted as a set of scalars by the order- and repetition-ignoring homomophism, so can a sequence of tuples. It seems to me natural that set operations should follow that interpretation. -s On 5/30/09, G. Jay Kerns <gkerns at ysu.edu> wrote:> Dear R-devel, > > Please see the recent thread on R-help, "Odd Behavior Out of > setdiff(...) - addition of duplicate entries is not identified" posted > by Jason Rupert. I gave an answer, then read David Winsemius' answer, > and then did some follow-up investigation. > > I would like to change my answer. > > My current version of setdiff() is acting in a way that I do not > understand, and a way that I suspect has changed. Consider the > following, derived from Jason's OP: > > The base package setdiff(), atomic vectors: > > x <- 1:100 > y <- c(x,x) > > setdiff(x, y) # integer(0) > setdiff(y, x) # integer(0) > > z <- 1:25 > > setdiff(x,z) # 26:100 > setdiff(z,x) # integer(0) > > > Everything is fine. > > Now look at base package setdiff(), data frames??? > > ################################ > A <- data.frame(x = 1:100) > B <- rbind(A, A) > > setdiff(A, B) # df 1:100? > setdiff(B, A) # df 1:100? > > C <- data.frame(x = 1:25) > > setdiff(A, C) # df 1:100? > setdiff(C, A) # df 1:25? > > ############################ > > > I have read ?setdiff 37 times now, and I cannot divine any > interpretation that matches the above output. From the source, it > appears that > > match(x, y, 0L) == 0L > > is evaluating to TRUE, of length equal to the columns of x, and then > > x[match(x, y, 0L) == 0L] > > is returning the entire data frame. > > Compare with the output from package "prob", which uses a setdiff that > operates row-wise: > > > ########################### > library(prob) > A <- data.frame(x = 1:100) > B <- rbind(A, A) > > setdiff(A, B) # integer(0) > setdiff(B, A) # integer(0) > > C <- data.frame(x = 1:25) > > setdiff(A, C) # 26:100 > setdiff(C, A) # integer(0) > > > > IMHO, the entire notion of "set" and "element" is problematic in the > df case, so I am not advocating the adoption of the prob:::setdiff > approach; rather, setdiff is behaving in a way that I cannot believe > with my own eyes, and I would like to alert those who can speak as to > why this may be happening. > > Thanks to Jason for bringing this up, and to David for catching the > discrepancy. > > Session info is below. I use the binaries prepared by the Debian > group so I do not have the latest patched-revision-4440986745343b. > This must have been related to something which has been fixed since > April 17, and in that case, please disregard my message. > > Yours truly, > Jay > > > > > > >> sessionInfo() > R version 2.9.0 (2009-04-17) > x86_64-pc-linux-gnu > > locale: > LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=C;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] prob_0.9-1 > > > > > > > > > > > > > > > > > > -- > > *************************************************** > G. Jay Kerns, Ph.D. > Associate Professor > Department of Mathematics & Statistics > Youngstown State University > Youngstown, OH 44555-0002 USA > Office: 1035 Cushwa Hall > Phone: (330) 941-3310 Office (voice mail) > -3302 Department > -3170 FAX > E-mail: gkerns at ysu.edu > http://www.cc.ysu.edu/~gjkerns/ > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel >
Jason Rupert
2009-May-30 19:30 UTC
[Rd] setdiff bizarre (was: odd behavior out of setdiff)
Jay, I really appreciate all your help help. I posted to Nabble an R file and input CSV files more accurately demonstrating what I am seeing and the output I desire to achieve when I difference two dataframes. http://n2.nabble.com/Support-SetDiff-Discussion-Items...-td2999739.html It may be that "setdiff" as intended in the base R functionality and "prob" was never intended to provide the type of result I desire. If that is the case then I will need to ask the "Ninjas" for help to produce the out come I seek. That is, when I different the data within RSetDiffEntry.csv and RSetDuplicatesRemoved.csv, I desire to get the result shown in RDesired.csv. Note that, it would not be enough to just work to remove duplicate "CostPerSquareFoot" values, since that variable is tied to "EntryDate" and "HouseNumber". Any further help and insights are much appreciated. Thanks again, Jason --- On Fri, 5/29/09, G. Jay Kerns <gkerns at ysu.edu> wrote:> From: G. Jay Kerns <gkerns at ysu.edu> > Subject: setdiff bizarre (was: odd behavior out of setdiff) > To: r-devel at r-project.org > Cc: dwinsemius at comcast.net, jasonkrupert at yahoo.com > Date: Friday, May 29, 2009, 11:35 PM > Dear R-devel, > > Please see the recent thread on R-help, "Odd Behavior Out > of > setdiff(...) - addition of duplicate entries is not > identified" posted > by Jason Rupert.? I gave an answer, then read David > Winsemius' answer, > and then did some follow-up investigation. > > I would like to change my answer. > > My current version of setdiff() is acting in a way that I > do not > understand, and a way that I suspect? has > changed.? Consider the > following, derived from Jason's OP: > > The base package setdiff(), atomic vectors: > > x <- 1:100 > y <- c(x,x) > > setdiff(x, y)? # integer(0) > setdiff(y, x)? # integer(0) > > z <- 1:25 > > setdiff(x,z)???# 26:100 > setdiff(z,x)???# integer(0) > > > Everything is fine. > > Now look at base package setdiff(), data frames??? > > ################################ > A <- data.frame(x = 1:100) > B <- rbind(A, A) > > setdiff(A, B)? ? ? ? ? ? > ???# df 1:100? > setdiff(B, A)? ? ? ? ? ? > ???# df 1:100? > > C <- data.frame(x = 1:25) > > setdiff(A, C)? ? ? ? ? ? > ???# df 1:100? > setdiff(C, A)? ? ? ? ? ? > ???# df 1:25? > > ############################ > > > I have read ?setdiff 37 times now, and I cannot divine any > interpretation that matches the above output.? From > the source, it > appears that > > match(x, y, 0L) == 0L > > is evaluating to TRUE, of length equal to the columns of x, > and then > > x[match(x, y, 0L) == 0L] > > is returning the entire data frame. > > Compare with the output from package "prob", which uses a > setdiff that > operates row-wise: > > > ########################### > library(prob) > A <- data.frame(x = 1:100) > B <- rbind(A, A) > > setdiff(A, B)? ? ? ? ? ? > ???# integer(0) > setdiff(B, A)? ? ? ? ? ? > ???# integer(0) > > C <- data.frame(x = 1:25) > > setdiff(A, C)? ? ? ? ? ? > ???# 26:100 > setdiff(C, A)? ? ? ? ? ? > ???# integer(0) > > > > IMHO, the entire notion of "set" and "element" is > problematic in the > df case, so I am not advocating the adoption of the > prob:::setdiff > approach;? rather, setdiff is behaving in a way that I > cannot believe > with my own eyes, and I would like to alert those who can > speak as to > why this may be happening. > > Thanks to Jason for bringing this up, and to David for > catching the discrepancy. > > Session info is below.? I use the binaries prepared by > the Debian > group so I do not have the latest > patched-revision-4440986745343b. > This must have been related to something which has been > fixed since > April 17, and in that case, please disregard my message. > > Yours truly, > Jay > > > > > > > > sessionInfo() > R version 2.9.0 (2009-04-17) > x86_64-pc-linux-gnu > > locale: > LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=C;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C > > attached base packages: > [1] stats? ???graphics? grDevices > utils? ???datasets? > methods???base > > other attached packages: > [1] prob_0.9-1 > > > > > > > > > > > > > > > > > > -- > > *************************************************** > G. Jay Kerns, Ph.D. > Associate Professor > Department of Mathematics & Statistics > Youngstown State University > Youngstown, OH 44555-0002 USA > Office: 1035 Cushwa Hall > Phone: (330) 941-3310 Office (voice mail) > -3302 Department > -3170 FAX > E-mail: gkerns at ysu.edu > http://www.cc.ysu.edu/~gjkerns/ >
Jay, Thanks again for all your help. I have ended up with something similar that appears to work and truly does provide the difference of two data frames including all the duplicate rows that may be removed due to filtering. Thanks again as this will be very helpful to me going forward as the data I receive often has duplicate rows that I filter out but want to double check that it is filtered out. Entry_DF<-read.csv("RSetDiffEntry.csv", header = TRUE) EntryFiltered_DF<-subset(Entry_DF, !duplicated(Entry_DF)) EntryFiltered_DF<-subset(EntryFiltered_DF, !(EntryFiltered_DF$CostPerSquareFoot==0)) EntryFiltered_DF<-subset(EntryFiltered_DF, EntryFiltered_DF$CostPerSquareFoot>0) EntryFiltered_DF<-subset(EntryFiltered_DF, EntryFiltered_DF$CostPerSquareFoot<300) library("prob") setDiff_DF<-setdiff(Entry_DF, EntryFiltered_DF) DuplicateRows_DF<-subset(Entry_DF, duplicated(Entry_DF)) DesiredDFDiff_DF<-rbind(DuplicateRows_DF, setDiff_DF) DesiredDFDiff_DF --- On Sat, 5/30/09, G. Jay Kerns <gkerns at ysu.edu> wrote:> From: G. Jay Kerns <gkerns at ysu.edu> > Subject: Re: setdiff bizarre (was: odd behavior out of setdiff) > To: "Jason Rupert" <jasonkrupert at yahoo.com> > Cc: "David Winsemius" <dwinsemius at comcast.net>, "r-help at r-project.org" <r-help at r-project.org> > Date: Saturday, May 30, 2009, 5:19 PM > Jason, > > (moved back to R-help) > > On Sat, May 30, 2009 at 3:30 PM, Jason Rupert <jasonkrupert at yahoo.com> > wrote: > > > > Jay, > > > > > > I really appreciate all your help help. > > > > I posted to Nabble an R file and input CSV files more > accurately demonstrating what I am seeing and the output I > desire to achieve when I difference two dataframes. > > http://n2.nabble.com/Support-SetDiff-Discussion-Items...-td2999739.html > > > > > > It may be that "setdiff" as intended in the base R > functionality and "prob" was never intended to provide the > type of result I desire. ?If that is the case then I will > need to ask the "Ninjas" for help to produce the out come I > seek. > > > > That is, when I different the data within > RSetDiffEntry.csv and RSetDuplicatesRemoved.csv, I desire to > get the result shown in ?RDesired.csv. > > > > Note that, it would not be enough to just work to > remove duplicate "CostPerSquareFoot" values, since that > variable is tied to "EntryDate" and "HouseNumber". > > > > Any further help and insights are much appreciated. > > > > Thanks again, > > Jason > > > > From your description, something like the following should > work: > > Let A = your RSetDiffEntry > Let B = your RSetDuplicatesRemoved... > > library(prob) > C <- setdiff(A,B) > D <- rbind(A,C) > E <- D[duplicated(D),] > > The E should = your RDesired. > > Hope this helps, > Jay > > P.S.? I notice your row number 7 in > "RSetDuplicatesRemoved" is > duplicated by the following row. That's a typo, yes?? > If so, then E > should have one more row than your "RDesired." >