Jason Rupert
2009-May-29  18:48 UTC
[R] Odd Behavior Out of setdiff(...) - addition of duplicate entries is not identified
I think I am using the improved version of setdiff(...) that handles data.frames, so I think some odd behavior was expected but this one is escaping me. It appears that the the addition of duplicate entries is not caught by the setdiff(...). Is this expected behavior? If so, is there another method or approach that should be used to identify duplicate row entries between two different data frames? Thanks in advance for any feedback. Test1_DF<-data.frame(HouseSize=c(1:100)) Test2_DF<-rbind(Test1_DF, Test1_DF) setdiff(Test1_DF, Test2_DF) integer(0) setdiff(Test2_DF, Test1_DF) integer(0) However, Test3_DF<-data.frame(HouseSize=c(1:25)) setdiff(Test1_DF, Test3_DF) [1] 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 [17] 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 [33] 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 [49] 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 [65] 90 91 92 93 94 95 96 97 98 99 100 setdiff(Test3_DF, Test1_DF) integer(0)
G. Jay Kerns
2009-May-29  20:21 UTC
[R] Odd Behavior Out of setdiff(...) - addition of duplicate entries is not identified
Dear Jason, On Fri, May 29, 2009 at 2:48 PM, Jason Rupert <jasonkrupert at yahoo.com> wrote:> > I think I am using the improved version of setdiff(...) that handles data.frames, so I think some odd behavior was expected but this one is escaping me. > > It appears that the the addition of duplicate entries is not caught by the setdiff(...). ?Is this expected behavior?[snip]> Thanks in advance for any feedback. > > Test1_DF<-data.frame(HouseSize=c(1:100)) > Test2_DF<-rbind(Test1_DF, Test1_DF) > setdiff(Test1_DF, Test2_DF) > integer(0) > setdiff(Test2_DF, Test1_DF) > integer(0) > > However, > Test3_DF<-data.frame(HouseSize=c(1:25)) > setdiff(Test1_DF, Test3_DF) > ?[1] ?26 ?27 ?28 ?29 ?30 ?31 ?32 ?33 ?34 ?35 ?36 ?37 ?38 ?39 ?40 ?41 > [17] ?42 ?43 ?44 ?45 ?46 ?47 ?48 ?49 ?50 ?51 ?52 ?53 ?54 ?55 ?56 ?57 > [33] ?58 ?59 ?60 ?61 ?62 ?63 ?64 ?65 ?66 ?67 ?68 ?69 ?70 ?71 ?72 ?73 > [49] ?74 ?75 ?76 ?77 ?78 ?79 ?80 ?81 ?82 ?83 ?84 ?85 ?86 ?87 ?88 ?89 > [65] ?90 ?91 ?92 ?93 ?94 ?95 ?96 ?97 ?98 ?99 100 > > setdiff(Test3_DF, Test1_DF) > integer(0)You didn't explicitly say which "improved version" of setdiff() that you are using, so I can only presume that you are using the setdiff.data.frame in the prob package. The behaviour you are observing is expected and matches the base:::setdiff behaviour in the case of vectors; cf. x1 <- c(1:100) x2 <- c(x1,x1) setdiff(x1, x2) # integer(0) setdiff(x2, x1) # integer(0) x3 <- c(1:25) setdiff(x1, x3) # 26:100 setdiff(x3, x1) # integer(0)> > If so, is there another method or approach that should be used to identify duplicate row entries between two different data frames? >The R-help archives are chock full of every possible variant of questions (and answers) about this, and you haven't said _exactly_ what you are looking for. In the absence of an already posted solution, please specify exactly what you want and I'll wager an R Ninja could dispatch it in moments. Regards, Jay *************************************************** G. Jay Kerns, Ph.D. Associate Professor Department of Mathematics & Statistics Youngstown State University Youngstown, OH 44555-0002 USA Office: 1035 Cushwa Hall Phone: (330) 941-3310 Office (voice mail) -3302 Department -3170 FAX E-mail: gkerns at ysu.edu http://www.cc.ysu.edu/~gjkerns/
Jason Rupert
2009-May-29  21:58 UTC
[R] Odd Behavior Out of setdiff(...) - addition of duplicate entries is not identified
Jay, 
Thanks much for the reply.    I think you are right about the prob.
Unfortunately, I was not able to find the old emails I had discussing the use of
the more powerful setdiff that essentially inherits from the base class R
setdiff functionality but extends that functionality by now working with
data.frames instead of just a simple array of values.  Love this functionality.
However, for the following example, 
Test1_DF<-data.frame(HouseSize=c(1:100), LandLocation=c("Here"))
Test1_DF<-data.frame(HouseSize=c(1:100), LandLocation=c("Here"),
Price = c("Low"))
Test2_DF<-rbind(Test1_DF, Test1_DF)
setdiff(Test1_DF, Test2_DF)
[1] HouseSize    LandLocation Price       
<0 rows> (or 0-length row.names)> setdiff(Test2_DF, Test1_DF)
[1] HouseSize    LandLocation Price       
<0 rows> (or 0-length row.names)
I was hoping for this example one of the setdiff's would have returned
essentially Test1_DF, since it is duplicated and that is what is different
between the two dataframes.
So, I guess I am trying to figure out a way to truely diff the dataframes, i.e.
determine when two data.frames are different from one another and then receive
the output of the results.
Does this capability exist in a function within a current R package or does it
exist within a typically used pattern to create this functionality?
Thanks again for any feedback you can provide. 
 
Also, I tried to determine my Session Info and the packages I have loaded, but I
received the following:> sessionInfo()
Error in x$Priority : $ operator is invalid for atomic vectors
In addition: There were 12 warnings (use warnings() to see
them)> warnings()
Warning messages:
1: In FUN(c("prob", "ggplot2", "reshape",
"RColorBrewer",  ... :
  DESCRIPTION file of package 'prob' is missing or broken
2: In FUN(c("prob", "ggplot2", "reshape",
"RColorBrewer",  ... :
  DESCRIPTION file of package 'ggplot2' is missing or broken
3: In FUN(c("prob", "ggplot2", "reshape",
"RColorBrewer",  ... :
  DESCRIPTION file of package 'reshape' is missing or broken
4: In FUN(c("prob", "ggplot2", "reshape",
"RColorBrewer",  ... :
  DESCRIPTION file of package 'RColorBrewer' is missing or broken
5: In FUN(c("prob", "ggplot2", "reshape",
"RColorBrewer",  ... :
  DESCRIPTION file of package 'proto' is missing or broken
6: In FUN(c("prob", "ggplot2", "reshape",
"RColorBrewer",  ... :
  DESCRIPTION file of package 'plyr' is missing or broken
7: In FUN(c("prob", "ggplot2", "reshape",
"RColorBrewer",  ... :
  DESCRIPTION file of package 'nortest' is missing or broken
8: In FUN(c("prob", "ggplot2", "reshape",
"RColorBrewer",  ... :
  DESCRIPTION file of package 'fBasics' is missing or broken
9: In FUN(c("prob", "ggplot2", "reshape",
"RColorBrewer",  ... :
  DESCRIPTION file of package 'timeSeries' is missing or broken
10: In FUN(c("prob", "ggplot2", "reshape",
"RColorBrewer",  ... :
  DESCRIPTION file of package 'timeDate' is missing or broken
11: In FUN(c("prob", "ggplot2", "reshape",
"RColorBrewer",  ... :
  DESCRIPTION file of package 'vcd' is missing or broken
12: In FUN(c("prob", "ggplot2", "reshape",
"RColorBrewer",  ... :
  DESCRIPTION file of package 'colorspace' is missing or broken
However, I typically load the following ones:
library(colorspace, lib.loc=RLibraryPathLocation)
library(vcd, lib.loc=RLibraryPathLocation)
library(timeDate, lib.loc=RLibraryPathLocation)
library(timeSeries, lib.loc=RLibraryPathLocation)
library(fBasics, lib.loc=RLibraryPathLocation)
library(nortest, lib.loc=RLibraryPathLocation)
library(plyr, lib.loc=RLibraryPathLocation)
library(proto, lib.loc=RLibraryPathLocation)
library(RColorBrewer, lib.loc=RLibraryPathLocation)
library(reshape, lib.loc=RLibraryPathLocation)
library(ggplot2, lib.loc=RLibraryPathLocation)
library(prob, lib.loc=RLibraryPathLocation)
--- On Fri, 5/29/09, G. Jay Kerns <gkerns at ysu.edu> wrote:
> From: G. Jay Kerns <gkerns at ysu.edu>
> Subject: Re: [R] Odd Behavior Out of setdiff(...) - addition of duplicate 
entries is not identified
> To: "Jason Rupert" <jasonkrupert at yahoo.com>
> Cc: R-help at r-project.org
> Date: Friday, May 29, 2009, 3:21 PM
> Dear Jason,
> 
> On Fri, May 29, 2009 at 2:48 PM, Jason Rupert <jasonkrupert at
yahoo.com>
> wrote:
> >
> > I think I am using the improved version of
> setdiff(...) that handles data.frames, so I think some odd
> behavior was expected but this one is escaping me.
> >
> > It appears that the the addition of duplicate entries
> is not caught by the setdiff(...). ?Is this expected
> behavior?
> 
> [snip]
> 
> > Thanks in advance for any feedback.
> >
> > Test1_DF<-data.frame(HouseSize=c(1:100))
> > Test2_DF<-rbind(Test1_DF, Test1_DF)
> > setdiff(Test1_DF, Test2_DF)
> > integer(0)
> > setdiff(Test2_DF, Test1_DF)
> > integer(0)
> >
> > However,
> > Test3_DF<-data.frame(HouseSize=c(1:25))
> > setdiff(Test1_DF, Test3_DF)
> > ?[1] ?26 ?27 ?28 ?29 ?30 ?31 ?32 ?33 ?34
> ?35 ?36 ?37 ?38 ?39 ?40 ?41
> > [17] ?42 ?43 ?44 ?45 ?46 ?47 ?48 ?49 ?50 ?51
> ?52 ?53 ?54 ?55 ?56 ?57
> > [33] ?58 ?59 ?60 ?61 ?62 ?63 ?64 ?65 ?66 ?67
> ?68 ?69 ?70 ?71 ?72 ?73
> > [49] ?74 ?75 ?76 ?77 ?78 ?79 ?80 ?81 ?82 ?83
> ?84 ?85 ?86 ?87 ?88 ?89
> > [65] ?90 ?91 ?92 ?93 ?94 ?95 ?96 ?97 ?98 ?99
> 100
> >
> > setdiff(Test3_DF, Test1_DF)
> > integer(0)
> 
> 
> You didn't explicitly say which "improved version" of
> setdiff() that
> you are using, so I can only presume that you are using
> the
> setdiff.data.frame in the prob package.
> 
> The behaviour you are observing is expected and matches
> the
> base:::setdiff behaviour in the case of vectors;? cf.
> 
> x1 <- c(1:100)
> x2 <- c(x1,x1)
> 
> setdiff(x1, x2)? # integer(0)
> setdiff(x2, x1)? # integer(0)
> 
> x3 <- c(1:25)
> setdiff(x1, x3)? # 26:100
> setdiff(x3, x1)? # integer(0)
> 
> 
> >
> > If so, is there another method or approach that should
> be used to identify duplicate row entries between two
> different data frames?
> >
> 
> The R-help archives are chock full of every possible
> variant of
> questions (and answers) about this, and you haven't said
> _exactly_
> what you are looking for. In the absence of an already
> posted
> solution, please specify exactly what you want and I'll
> wager an R
> Ninja could dispatch it in moments.
> 
> Regards,
> Jay
> 
> 
> 
> 
> 
> 
> 
> 
> 
> ***************************************************
> G. Jay Kerns, Ph.D.
> Associate Professor
> Department of Mathematics & Statistics
> Youngstown State University
> Youngstown, OH 44555-0002 USA
> Office: 1035 Cushwa Hall
> Phone: (330) 941-3310 Office (voice mail)
> -3302 Department
> -3170 FAX
> E-mail: gkerns at ysu.edu
> http://www.cc.ysu.edu/~gjkerns/
>