Martin Møller Skarbiniks Pedersen
2021-Jan-31 20:57 UTC
[R] union of two sets are smaller than one set?
This is really puzzling me and when I try to make a small example everything works like expected. The problem: I got these two large vectors of strings.> str(s1)chr [1:766608] "0.dk" ...> str(s2)chr [1:59387] "043.dk" "0606.dk" "0618.dk" "0888.dk" "0iq.dk" "0it.dk" ... And I need to create the union-set of s1 and s2. I expect the size of the union-set to be between 766608 and 766608+59387. However it is 681193 which is less that number of elements in s1!> length(base::union(s1, s2))[1] 681193 Any hints? Regards Martin [[alternative HTML version deleted]]
On 31/01/2021 3:57 p.m., Martin M?ller Skarbiniks Pedersen wrote:> This is really puzzling me and when I try to make a small example > everything works like expected. > > The problem: > > I got these two large vectors of strings. > >> str(s1) > chr [1:766608] "0.dk" ... >> str(s2) > chr [1:59387] "043.dk" "0606.dk" "0618.dk" "0888.dk" "0iq.dk" "0it.dk" ... > > And I need to create the union-set of s1 and s2. > I expect the size of the union-set to be between 766608 and 766608+59387. > However it is 681193 which is less that number of elements in s1! > >> length(base::union(s1, s2)) > [1] 681193 > > Any hints?I imagine unique(s1) is shorter than s1. The union function is the same as unique(c(s1, s2)) for your data. (The only difference is if s1 or s2 is named: the names are dropped.) Duncan Murdoch
On Sun, 31 Jan 2021, Martin M?ller Skarbiniks Pedersen writes:> This is really puzzling me and when I try to make a small example > everything works like expected. > > The problem: > > I got these two large vectors of strings. > >> str(s1) > chr [1:766608] "0.dk" ... >> str(s2) > chr [1:59387] "043.dk" "0606.dk" "0618.dk" "0888.dk" "0iq.dk" "0it.dk" ... > > And I need to create the union-set of s1 and s2. > I expect the size of the union-set to be between 766608 and 766608+59387. > However it is 681193 which is less that number of elements in s1! > >> length(base::union(s1, s2)) > [1] 681193 > > Any hints? > > Regards > Martin >Duplicates? kind regards Enrico -- Enrico Schumann Lucerne, Switzerland http://enricoschumann.net
Martin, You did not say your two starting objects were already sets. You said they were vectors of strings. It may well be that your strings included duplicates. For example, If I read in lots of text with a blank line between paragraphs, I would have lots of seemingly empty and identical parts. Just converting that into a set would shrink it. You have not said how you created or processed your initial two vectors. It is also possible parts were sort of DELETED as in removing the string pointed to by some entry but leaving a null pointer of sorts which would leave the length of the vector longer than the useful contents. Your strings seem to be what may be filenames. Are they unique, especially if they are files in different folders/directories? There are many ways to check, but using your method, try this: length(base::union(s1, s1)) -----Original Message----- From: R-help <r-help-bounces at r-project.org> On Behalf Of Martin M?ller Skarbiniks Pedersen Sent: Sunday, January 31, 2021 3:57 PM To: R mailing list <r-help at r-project.org> Subject: [R] union of two sets are smaller than one set? This is really puzzling me and when I try to make a small example everything works like expected. The problem: I got these two large vectors of strings.> str(s1)chr [1:766608] "0.dk" ...> str(s2)chr [1:59387] "043.dk" "0606.dk" "0618.dk" "0888.dk" "0iq.dk" "0it.dk" ... And I need to create the union-set of s1 and s2. I expect the size of the union-set to be between 766608 and 766608+59387. However it is 681193 which is less that number of elements in s1!> length(base::union(s1, s2))[1] 681193 Any hints? Regards Martin [[alternative HTML version deleted]] ______________________________________________ R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.