Hi everybody I have found something (for me at least) strange with duplicated(). I will first provide a replicable example of a certain kind of behaviour that I find odd and then give a sample of unexpected results from my own data. I hope someone can help me understand this. Consider the following # this works as expected ex=sample(1:20, replace=TRUE) ex duplicated(ex) ex=sort(ex) ex duplicated(ex) # but why does duplicate not work after order() ? ex=sample(1:20, replace=TRUE) ex duplicated(ex) ex=order(ex) duplicated(ex) Why does duplicated() not work after order() has been applied but it works fine after sort() ? Is this an error or is there something I don't understand. I have been getting very strage results from duplicated() and unique() in a dataset I am analysing. Her is a little sample of my real life problem> str(Masechaba$PROPDESC)Factor w/ 24545 levels " 06"," 71Hemilton str",..: 14527 8043 16113 16054 13875 15780 12522 7771 14824 12314 ...> # Create a indicator if the PROPDESC is unique. Default false > Masechaba$unique=FALSE > Masechaba$unique[which(is.na(unique(Masechaba$PROPDESC))==FALSE)]=TRUE > # Check is something happended > length(which(Masechaba$unique==TRUE))[1] 2174> length(which(Masechaba$unique==FALSE))[1] 476> Masechaba$duplicate=FALSE > Masechaba$duplicate[which(duplicated(Masechaba$PROPDESC)==TRUE)]=TRUE > length(which(Masechaba$duplicate==TRUE))[1] 476> length(which(Masechaba$duplicate==FALSE))[1] 2174> # Looks OK so far > # Test on a known duplicate. I expect one to be true and one to be false > Masechaba[which(Masechaba$PROPDESC==2363),10:12]PROPDESC unique duplicate 24874 2363 TRUE FALSE 31280 2363 TRUE TRUE # This is strange. I expected that unique() and duplicate() would give the same results. The variable PROPDESC is clearly not unique in both cases. # The totals are the same but not the individual results> table(Masechaba$unique,Masechaba$duplicate)FALSE TRUE FALSE 342 134 TRUE 1832 342 I don't understand this. Is there something I am missing? Best regards Christaan P.S> sessionInfo()R version 2.11.1 (2010-05-31) x86_64-apple-darwin9.8.0 locale: [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] splines stats graphics grDevices utils datasets methods base other attached packages: [1] plyr_0.1.9 maptools_0.7-34 lattice_0.18-8 foreign_0.8-40 Hmisc_3.8-0 survival_2.35-8 rgdal_0.6-26 [8] sp_0.9-64 loaded via a namespace (and not attached): [1] cluster_1.12.3 grid_2.11.1 tools_2.11.1 [[alternative HTML version deleted]]
Hi r-help-bounces at r-project.org napsal dne 08.06.2010 08:44:39:> Hi everybody > > I have found something (for me at least) strange with duplicated(). Iwill> first provide a replicable example of a certain kind of behaviour that I > find odd and then give a sample of unexpected results from my own data.I> hope someone can help me understand this. > > Consider the following > > # this works as expected > > ex=sample(1:20, replace=TRUE) > > ex > > duplicated(ex) > > ex=sort(ex)This is OK as sort sorts your data> > ex > > duplicated(ex) > > > # but why does duplicate not work after order() ? > > ex=sample(1:20, replace=TRUE) > > ex > > duplicated(ex) > > ex=order(ex)This is not as order gives you positions not your data> ex=sample(letters[1:5],20, replace=TRUE) > ex[1] "b" "b" "b" "e" "d" "c" "e" "a" "a" "d" "d" "d" "a" "e" "b" "c" "e" "d" "a" [20] "a"> ex<-order(ex) > ex[1] 8 9 13 19 20 1 2 3 15 6 16 5 10 11 12 18 4 7 14 17>ex=ex[order(ex)] shall give you the same result as sort. Maybe with exception of ties.> > duplicated(ex) > > Why does duplicated() not work after order() has been applied but itworks> fine after sort() ? Is this an error or is there something I don't > understand. > > I have been getting very strage results from duplicated() and unique()in a> dataset I am analysing. Her is a little sample of my real life problem > > > str(Masechaba$PROPDESC) > Factor w/ 24545 levels " 06"," 71Hemilton str",..: 14527 804316113> 16054 13875 15780 12522 7771 14824 12314 ... > > # Create a indicator if the PROPDESC is unique. Default false > > Masechaba$unique=FALSE > > Masechaba$unique[which(is.na(unique(Masechaba$PROPDESC))==FALSE)]=TRUE > > # Check is something happended > > length(which(Masechaba$unique==TRUE)) > [1] 2174 > > length(which(Masechaba$unique==FALSE)) > [1] 476 > > Masechaba$duplicate=FALSE > > Masechaba$duplicate[which(duplicated(Masechaba$PROPDESC)==TRUE)]=TRUE > > length(which(Masechaba$duplicate==TRUE)) > [1] 476 > > length(which(Masechaba$duplicate==FALSE)) > [1] 2174 > > # Looks OK so far > > # Test on a known duplicate. I expect one to be true and one to befalse> > Masechaba[which(Masechaba$PROPDESC==2363),10:12] > PROPDESC unique duplicate > 24874 2363 TRUE FALSE > 31280 2363 TRUE TRUE > > # This is strange. I expected that unique() and duplicate() would givethe> same results. The variable PROPDESC is clearly not unique in both cases.No. ex=sample(letters[1:5],10, replace=TRUE) ex [1] "b" "d" "d" "b" "a" "c" "b" "c" "d" "d" unique(ex) [1] "b" "d" "a" "c" duplicated(ex) [1] FALSE FALSE TRUE TRUE FALSE FALSE TRUE TRUE TRUE TRUE Functions give you different answers about your data as you ask different questions.> > Masechaba$unique[which(is.na(unique(Masechaba$PROPDESC))==FALSE)]=TRUE^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ This seems to be strange. At first sight I am puzzlet what result I shall expect from such construction. Regards Petr> # The totals are the same but not the individual results > > table(Masechaba$unique,Masechaba$duplicate) > > FALSE TRUE > FALSE 342 134 > TRUE 1832 342 > > I don't understand this. Is there something I am missing? > > Best regards > Christaan > > > P.S > > sessionInfo() > R version 2.11.1 (2010-05-31) > x86_64-apple-darwin9.8.0 > > locale: > [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8 > > attached base packages: > [1] splines stats graphics grDevices utils datasets methods > base > > other attached packages: > [1] plyr_0.1.9 maptools_0.7-34 lattice_0.18-8 foreign_0.8-40 > Hmisc_3.8-0 survival_2.35-8 rgdal_0.6-26 > [8] sp_0.9-64 > > loaded via a namespace (and not attached): > [1] cluster_1.12.3 grid_2.11.1 tools_2.11.1 > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html> and provide commented, minimal, self-contained, reproducible code.
On Tuesday, June 8, 2010, christiaan pauw <cjpauw at gmail.com> wrote:> Hi everybody > > I have found something (for me at least) strange with duplicated(). I will > first provide a replicable example of a certain kind of behaviour that I > find odd and then give a sample of unexpected results from my own data. I > hope someone can help me understand this. > > Consider the following > > # this works as expected > > ex=sample(1:20, replace=TRUE) > > ex > > duplicated(ex) > > ex=sort(ex) > > ex > > duplicated(ex) > > > # but why does duplicate not work after order() ? > > ex=sample(1:20, replace=TRUE) > > ex > > duplicated(ex) > > ex=order(ex) > > duplicated(ex) > > Why does duplicated() not work after order() has been applied but it works > fine after sort() ?? Is this an error or is there something I don't > understand.The latter: order() returns the indexes into your vector, i.e. a permutation, which select the values in a sorted order. Each element is unique by definition.> > I have been getting very strage results from duplicated() and unique() in a > dataset I am analysing. Her is a little sample of my real life problempresumably this is a data.frame...> >> str(Masechaba$PROPDESC) > ?Factor w/ 24545 levels " ? ? 06"," ? 71Hemilton str",..: 14527 8043 16113 > 16054 13875 15780 12522 7771 14824 12314 ... >> # Create a indicator if the PROPDESC is unique. Default false >> Masechaba$unique=FALSE >> Masechaba$unique[which(is.na(unique(Masechaba$PROPDESC))==FALSE)]=TRUEThe statement above is in error. You are referring to elements of unique(Masechaba$PROPDESC) which do not correspond to the rows of Masechaba. They are different lengths. Use duplicated() instead.>> # Check is something happended >> length(which(Masechaba$unique==TRUE)) > [1] 2174 >> length(which(Masechaba$unique==FALSE)) > [1] 476 >> Masechaba$duplicate=FALSE >> Masechaba$duplicate[which(duplicated(Masechaba$PROPDESC)==TRUE)]=TRUEequivalent to Masechaba$duplicate <- duplicated(Masechaba$PROPDESC)>> length(which(Masechaba$duplicate==TRUE)) > [1] 476 >> length(which(Masechaba$duplicate==FALSE)) > [1] 2174 >> # Looks OK so far >> # Test on a known duplicate. I expect one to be true and one to be false >> Masechaba[which(Masechaba$PROPDESC==2363),10:12] > ? ? ?PROPDESC unique duplicate > 24874 ? ? 2363 ? TRUE ? ? FALSE > 31280 ? ? 2363 ? TRUE ? ? ?TRUE > > # This is strange. ?I expected that unique() and duplicate() would give the > same results. The variable PROPDESC is clearly not unique in both cases. > # The totals are the same but not the individual results >> table(Masechaba$unique,Masechaba$duplicate) > > ? ? ? ?FALSE TRUE > ?FALSE ? 342 ?134 > ?TRUE ? 1832 ?342 > > I don't understand this. Is there something I am missing? > > Best regards > Christaan > > > P.S >> sessionInfo() > R version 2.11.1 (2010-05-31) > x86_64-apple-darwin9.8.0 > > locale: > [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8 > > attached base packages: > [1] splines ? stats ? ? graphics ?grDevices utils ? ? datasets ?methods > base > > other attached packages: > [1] plyr_0.1.9 ? ? ?maptools_0.7-34 lattice_0.18-8 ?foreign_0.8-40 > ?Hmisc_3.8-0 ? ? survival_2.35-8 rgdal_0.6-26 > [8] sp_0.9-64 > > loaded via a namespace (and not attached): > [1] cluster_1.12.3 grid_2.11.1 ? ?tools_2.11.1 > > ? ? ? ?[[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org?mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Felix Andrews / ??? Integrated Catchment Assessment and Management (iCAM) Centre Fenner School of Environment and Society [Bldg 48a] The Australian National University Canberra ACT 0200 Australia M: +61 410 400 963 T: + 61 2 6125 4670 E: felix.andrews at anu.edu.au CRICOS Provider No. 00120C -- http://www.neurofractal.org/felix/