thr3ads.net - R devel - [Rd] setdiff bizarre (was: odd behavior out of setdiff) [May 2009]

If this information is useful, please help other people find it:
Share via:

G. Jay Kerns

2009-May-30 04:35 UTC

[Rd] setdiff bizarre (was: odd behavior out of setdiff)

Dear R-devel,

Please see the recent thread on R-help, "Odd Behavior Out of
setdiff(...) - addition of duplicate entries is not identified" posted
by Jason Rupert.  I gave an answer, then read David Winsemius' answer,
and then did some follow-up investigation.

I would like to change my answer.

My current version of setdiff() is acting in a way that I do not
understand, and a way that I suspect  has changed.  Consider the
following, derived from Jason's OP:

The base package setdiff(), atomic vectors:

x <- 1:100
y <- c(x,x)

setdiff(x, y)  # integer(0)
setdiff(y, x)  # integer(0)

z <- 1:25

setdiff(x,z)   # 26:100
setdiff(z,x)   # integer(0)


Everything is fine.

Now look at base package setdiff(), data frames???

################################
A <- data.frame(x = 1:100)
B <- rbind(A, A)

setdiff(A, B)               # df 1:100?
setdiff(B, A)               # df 1:100?

C <- data.frame(x = 1:25)

setdiff(A, C)               # df 1:100?
setdiff(C, A)               # df 1:25?

############################


I have read ?setdiff 37 times now, and I cannot divine any
interpretation that matches the above output.  From the source, it
appears that

match(x, y, 0L) == 0L

is evaluating to TRUE, of length equal to the columns of x, and then

x[match(x, y, 0L) == 0L]

is returning the entire data frame.

Compare with the output from package "prob", which uses a setdiff that
operates row-wise:


###########################
library(prob)
A <- data.frame(x = 1:100)
B <- rbind(A, A)

setdiff(A, B)               # integer(0)
setdiff(B, A)               # integer(0)

C <- data.frame(x = 1:25)

setdiff(A, C)               # 26:100
setdiff(C, A)               # integer(0)



IMHO, the entire notion of "set" and "element" is
problematic in the
df case, so I am not advocating the adoption of the prob:::setdiff
approach;  rather, setdiff is behaving in a way that I cannot believe
with my own eyes, and I would like to alert those who can speak as to
why this may be happening.

Thanks to Jason for bringing this up, and to David for catching the discrepancy.

Session info is below.  I use the binaries prepared by the Debian
group so I do not have the latest patched-revision-4440986745343b.
This must have been related to something which has been fixed since
April 17, and in that case, please disregard my message.

Yours truly,
Jay





> sessionInfo()R version 2.9.0 (2009-04-17)
x86_64-pc-linux-gnu

locale:
LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=C;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] prob_0.9-1

















-- 

***************************************************
G. Jay Kerns, Ph.D.
Associate Professor
Department of Mathematics & Statistics
Youngstown State University
Youngstown, OH 44555-0002 USA
Office: 1035 Cushwa Hall
Phone: (330) 941-3310 Office (voice mail)
-3302 Department
-3170 FAX
E-mail: gkerns at ysu.edu
http://www.cc.ysu.edu/~gjkerns/

Stavros Macrakis

2009-May-30 12:50 UTC

head link

[Rd] setdiff bizarre (was: odd behavior out of setdiff)

It seems to me that, abstractly, a dataframe is just as
straightforwardly a sequence of tuples/observations as a vector is a
sequence of scalars. R's convention is that a 1-vector represents a
scalar, and similarly, a 1-dataframe can represent a tuple (though it
can also be represented as a list). Of course, a dataframe can *also*
be interpreted as a list of vectors.

Just as a sequence of scalars can be interpreted as a set of scalars
by the order- and repetition-ignoring homomophism, so can a sequence
of tuples. It seems to me natural that set operations should follow
that interpretation.

          -s

On 5/30/09, G. Jay Kerns <gkerns at ysu.edu>
wrote:> Dear R-devel,
>
> Please see the recent thread on R-help, "Odd Behavior Out of
> setdiff(...) - addition of duplicate entries is not identified" posted
> by Jason Rupert.  I gave an answer, then read David Winsemius' answer,
> and then did some follow-up investigation.
>
> I would like to change my answer.
>
> My current version of setdiff() is acting in a way that I do not
> understand, and a way that I suspect  has changed.  Consider the
> following, derived from Jason's OP:
>
> The base package setdiff(), atomic vectors:
>
> x <- 1:100
> y <- c(x,x)
>
> setdiff(x, y)  # integer(0)
> setdiff(y, x)  # integer(0)
>
> z <- 1:25
>
> setdiff(x,z)   # 26:100
> setdiff(z,x)   # integer(0)
>
>
> Everything is fine.
>
> Now look at base package setdiff(), data frames???
>
> ################################
> A <- data.frame(x = 1:100)
> B <- rbind(A, A)
>
> setdiff(A, B)               # df 1:100?
> setdiff(B, A)               # df 1:100?
>
> C <- data.frame(x = 1:25)
>
> setdiff(A, C)               # df 1:100?
> setdiff(C, A)               # df 1:25?
>
> ############################
>
>
> I have read ?setdiff 37 times now, and I cannot divine any
> interpretation that matches the above output.  From the source, it
> appears that
>
> match(x, y, 0L) == 0L
>
> is evaluating to TRUE, of length equal to the columns of x, and then
>
> x[match(x, y, 0L) == 0L]
>
> is returning the entire data frame.
>
> Compare with the output from package "prob", which uses a setdiff
that
> operates row-wise:
>
>
> ###########################
> library(prob)
> A <- data.frame(x = 1:100)
> B <- rbind(A, A)
>
> setdiff(A, B)               # integer(0)
> setdiff(B, A)               # integer(0)
>
> C <- data.frame(x = 1:25)
>
> setdiff(A, C)               # 26:100
> setdiff(C, A)               # integer(0)
>
>
>
> IMHO, the entire notion of "set" and "element" is
problematic in the
> df case, so I am not advocating the adoption of the prob:::setdiff
> approach;  rather, setdiff is behaving in a way that I cannot believe
> with my own eyes, and I would like to alert those who can speak as to
> why this may be happening.
>
> Thanks to Jason for bringing this up, and to David for catching the
> discrepancy.
>
> Session info is below.  I use the binaries prepared by the Debian
> group so I do not have the latest patched-revision-4440986745343b.
> This must have been related to something which has been fixed since
> April 17, and in that case, please disregard my message.
>
> Yours truly,
> Jay
>
>
>
>
>
>
>> sessionInfo()
> R version 2.9.0 (2009-04-17)
> x86_64-pc-linux-gnu
>
> locale:
>
LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=C;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
> other attached packages:
> [1] prob_0.9-1
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> --
>
> ***************************************************
> G. Jay Kerns, Ph.D.
> Associate Professor
> Department of Mathematics & Statistics
> Youngstown State University
> Youngstown, OH 44555-0002 USA
> Office: 1035 Cushwa Hall
> Phone: (330) 941-3310 Office (voice mail)
> -3302 Department
> -3170 FAX
> E-mail: gkerns at ysu.edu
> http://www.cc.ysu.edu/~gjkerns/
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

Jason Rupert

2009-May-30 19:30 UTC

head link

[Rd] setdiff bizarre (was: odd behavior out of setdiff)

Jay, 


I really appreciate all your help help.  

I posted to Nabble an R file and input CSV files more accurately demonstrating
what I am seeing and the output I desire to achieve when I difference two
dataframes.
http://n2.nabble.com/Support-SetDiff-Discussion-Items...-td2999739.html


It may be that "setdiff" as intended in the base R functionality and
"prob" was never intended to provide the type of result I desire.  If
that is the case then I will need to ask the "Ninjas" for help to
produce the out come I seek.

That is, when I different the data within RSetDiffEntry.csv and
RSetDuplicatesRemoved.csv, I desire to get the result shown in  RDesired.csv.

Note that, it would not be enough to just work to remove duplicate
"CostPerSquareFoot" values, since that variable is tied to
"EntryDate" and "HouseNumber".

Any further help and insights are much appreciated. 

Thanks again, 
Jason 





--- On Fri, 5/29/09, G. Jay Kerns <gkerns at ysu.edu> wrote:
> From: G. Jay Kerns <gkerns at ysu.edu>
> Subject: setdiff bizarre (was: odd behavior out of setdiff)
> To: r-devel at r-project.org
> Cc: dwinsemius at comcast.net, jasonkrupert at yahoo.com
> Date: Friday, May 29, 2009, 11:35 PM
> Dear R-devel,
> 
> Please see the recent thread on R-help, "Odd Behavior Out
> of
> setdiff(...) - addition of duplicate entries is not
> identified" posted
> by Jason Rupert.? I gave an answer, then read David
> Winsemius' answer,
> and then did some follow-up investigation.
> 
> I would like to change my answer.
> 
> My current version of setdiff() is acting in a way that I
> do not
> understand, and a way that I suspect? has
> changed.? Consider the
> following, derived from Jason's OP:
> 
> The base package setdiff(), atomic vectors:
> 
> x <- 1:100
> y <- c(x,x)
> 
> setdiff(x, y)? # integer(0)
> setdiff(y, x)? # integer(0)
> 
> z <- 1:25
> 
> setdiff(x,z)???# 26:100
> setdiff(z,x)???# integer(0)
> 
> 
> Everything is fine.
> 
> Now look at base package setdiff(), data frames???
> 
> ################################
> A <- data.frame(x = 1:100)
> B <- rbind(A, A)
> 
> setdiff(A, B)? ? ? ? ? ?
> ???# df 1:100?
> setdiff(B, A)? ? ? ? ? ?
> ???# df 1:100?
> 
> C <- data.frame(x = 1:25)
> 
> setdiff(A, C)? ? ? ? ? ?
> ???# df 1:100?
> setdiff(C, A)? ? ? ? ? ?
> ???# df 1:25?
> 
> ############################
> 
> 
> I have read ?setdiff 37 times now, and I cannot divine any
> interpretation that matches the above output.? From
> the source, it
> appears that
> 
> match(x, y, 0L) == 0L
> 
> is evaluating to TRUE, of length equal to the columns of x,
> and then
> 
> x[match(x, y, 0L) == 0L]
> 
> is returning the entire data frame.
> 
> Compare with the output from package "prob", which uses a
> setdiff that
> operates row-wise:
> 
> 
> ###########################
> library(prob)
> A <- data.frame(x = 1:100)
> B <- rbind(A, A)
> 
> setdiff(A, B)? ? ? ? ? ?
> ???# integer(0)
> setdiff(B, A)? ? ? ? ? ?
> ???# integer(0)
> 
> C <- data.frame(x = 1:25)
> 
> setdiff(A, C)? ? ? ? ? ?
> ???# 26:100
> setdiff(C, A)? ? ? ? ? ?
> ???# integer(0)
> 
> 
> 
> IMHO, the entire notion of "set" and "element" is
> problematic in the
> df case, so I am not advocating the adoption of the
> prob:::setdiff
> approach;? rather, setdiff is behaving in a way that I
> cannot believe
> with my own eyes, and I would like to alert those who can
> speak as to
> why this may be happening.
> 
> Thanks to Jason for bringing this up, and to David for
> catching the discrepancy.
> 
> Session info is below.? I use the binaries prepared by
> the Debian
> group so I do not have the latest
> patched-revision-4440986745343b.
> This must have been related to something which has been
> fixed since
> April 17, and in that case, please disregard my message.
> 
> Yours truly,
> Jay
> 
> 
> 
> 
> 
> 
> > sessionInfo()
> R version 2.9.0 (2009-04-17)
> x86_64-pc-linux-gnu
> 
> locale:
>
LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=C;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C
> 
> attached base packages:
> [1] stats? ???graphics? grDevices
> utils? ???datasets?
> methods???base
> 
> other attached packages:
> [1] prob_0.9-1
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> -- 
> 
> ***************************************************
> G. Jay Kerns, Ph.D.
> Associate Professor
> Department of Mathematics & Statistics
> Youngstown State University
> Youngstown, OH 44555-0002 USA
> Office: 1035 Cushwa Hall
> Phone: (330) 941-3310 Office (voice mail)
> -3302 Department
> -3170 FAX
> E-mail: gkerns at ysu.edu
> http://www.cc.ysu.edu/~gjkerns/
>

Jason Rupert

2009-May-31 02:21 UTC

head link

[R] setdiff bizarre (was: odd behavior out of setdiff)

Jay, 

Thanks again for all your help.  

I have ended up with something similar that appears to work and truly does
provide the difference of two data frames including all the duplicate rows that
may be removed due to filtering.

Thanks again as this will be very helpful to me going forward as the data I
receive often has duplicate rows that I filter out but want to double check that
it is filtered out.


Entry_DF<-read.csv("RSetDiffEntry.csv", header = TRUE)

EntryFiltered_DF<-subset(Entry_DF, !duplicated(Entry_DF))
EntryFiltered_DF<-subset(EntryFiltered_DF,
!(EntryFiltered_DF$CostPerSquareFoot==0))
EntryFiltered_DF<-subset(EntryFiltered_DF,
EntryFiltered_DF$CostPerSquareFoot>0)
EntryFiltered_DF<-subset(EntryFiltered_DF,
EntryFiltered_DF$CostPerSquareFoot<300)

library("prob")
setDiff_DF<-setdiff(Entry_DF, EntryFiltered_DF)


DuplicateRows_DF<-subset(Entry_DF, duplicated(Entry_DF))


DesiredDFDiff_DF<-rbind(DuplicateRows_DF, setDiff_DF)

DesiredDFDiff_DF




--- On Sat, 5/30/09, G. Jay Kerns <gkerns at ysu.edu> wrote:
> From: G. Jay Kerns <gkerns at ysu.edu>
> Subject: Re: setdiff bizarre (was: odd behavior out of setdiff)
> To: "Jason Rupert" <jasonkrupert at yahoo.com>
> Cc: "David Winsemius" <dwinsemius at comcast.net>,
"r-help at r-project.org" <r-help at r-project.org>
> Date: Saturday, May 30, 2009, 5:19 PM
> Jason,
> 
> (moved back to R-help)
> 
> On Sat, May 30, 2009 at 3:30 PM, Jason Rupert <jasonkrupert at
yahoo.com>
> wrote:
> >
> > Jay,
> >
> >
> > I really appreciate all your help help.
> >
> > I posted to Nabble an R file and input CSV files more
> accurately demonstrating what I am seeing and the output I
> desire to achieve when I difference two dataframes.
> >
http://n2.nabble.com/Support-SetDiff-Discussion-Items...-td2999739.html
> >
> >
> > It may be that "setdiff" as intended in the base R
> functionality and "prob" was never intended to provide the
> type of result I desire. ?If that is the case then I will
> need to ask the "Ninjas" for help to produce the out come I
> seek.
> >
> > That is, when I different the data within
> RSetDiffEntry.csv and RSetDuplicatesRemoved.csv, I desire to
> get the result shown in ?RDesired.csv.
> >
> > Note that, it would not be enough to just work to
> remove duplicate "CostPerSquareFoot" values, since that
> variable is tied to "EntryDate" and "HouseNumber".
> >
> > Any further help and insights are much appreciated.
> >
> > Thanks again,
> > Jason
> >
> 
> From your description, something like the following should
> work:
> 
> Let A = your RSetDiffEntry
> Let B = your RSetDuplicatesRemoved...
> 
> library(prob)
> C <- setdiff(A,B)
> D <- rbind(A,C)
> E <- D[duplicated(D),]
> 
> The E should = your RDesired.
> 
> Hope this helps,
> Jay
> 
> P.S.? I notice your row number 7 in
> "RSetDuplicatesRemoved" is
> duplicated by the following row. That's a typo, yes??
> If so, then E
> should have one more row than your "RDesired."
>

Apparently Analagous Threads

Search for more apparently analagous threads

R devel - May 2009 - setdiff bizarre (was: odd behavior out of setdiff)

[Rd] setdiff bizarre (was: odd behavior out of setdiff)

[Rd] setdiff bizarre (was: odd behavior out of setdiff)

[Rd] setdiff bizarre (was: odd behavior out of setdiff)

[R] setdiff bizarre (was: odd behavior out of setdiff)

Apparently Analagous Threads