thr3ads.net - R help - [R] counting duplicate items that occur in multiple groups [Nov 2020]

If this information is useful, please help other people find it:
Share via:

Tom Woolman

2020-Nov-17 23:29 UTC

[R] counting duplicate items that occur in multiple groups

Hi Bill. Sorry to be so obtuse with the example data, I was trying  
(too hard) not to share any actual values so I just created randomized  
values for my example; of course I should have specified that the  
random values would not provide the expected problem pattern. I should  
have just used simple dummy codes as Bill Dunlap did.

So per Bill's example data for Data1, the expected (hoped for) output  
should be:

  Vendor Account Num_Vendors_Sharing_Bank_Acct
1     V1      A1      0
2     V2      A2      3
3     V3      A2      3
4     V4      A2      3


Where the new calculated variable is Num_Vendors_Sharing_Bank_Acct.  
The value is 3 for V2, V3 and V4 because they each share bank account  
A2.


Likewise, in the Data2 frame, the same logic applies:

  Vendor Account Num_Vendors_Sharing_Bank_Acct
1     V1      A1     0
2     V2      A2     3
3     V3      A2     3
4     V1      A2     3
5     V4      A3     0
6     V2      A4     0






Thanks!


Quoting Bill Dunlap <williamwdunlap at gmail.com>:
> What should the result be for
>   Data1 <-
data.frame(Vendor=c("V1","V2","V3","V4"),
> Account=c("A1","A2","A2","A2"))
> ?
>
> Must each vendor have only one account?  If not, what should the result be
> for
>    Data2 <-
data.frame(Vendor=c("V1","V2","V3","V1","V4","V2"),
>
Account=c("A1","A2","A2","A2","A3","A4"))
> ?
>
> -Bill
>
> On Tue, Nov 17, 2020 at 1:20 PM Tom Woolman <twoolman at
ontargettek.com>
> wrote:
>
>> Hi everyone.  I have a dataframe that is a collection of Vendor IDs
>> plus a bank account number for each vendor. I'm trying to find a
way
>> to count the number of duplicate bank accounts that occur in more than
>> one unique Vendor_ID, and then assign the count value for each row in
>> the dataframe in a new variable.
>>
>> I can do a count of bank accounts that occur within the same vendor
>> using dplyr and group_by and count, but I can't figure out a way to
>> count duplicates among multiple Vendor_IDs.
>>
>>
>> Dataframe example code:
>>
>>
>> #Create a sample data frame:
>>
>> set.seed(1)
>>
>> Data <- data.frame(Vendor_ID = sample(1:10000), Bank_Account_ID
>> sample(1:10000))
>>
>>
>>
>>
>> Thanks in advance for any help.
>>
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>

Bert Gunter

2020-Nov-17 23:33 UTC

head link

[R] counting duplicate items that occur in multiple groups

Why 0's in the data frame? Shouldn't that be 1 (vendor with that
account)?

Bert
Bert Gunter

"The trouble with having an open mind is that people keep coming along and
sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )


On Tue, Nov 17, 2020 at 3:29 PM Tom Woolman <twoolman at ontargettek.com>
wrote:
> Hi Bill. Sorry to be so obtuse with the example data, I was trying
> (too hard) not to share any actual values so I just created randomized
> values for my example; of course I should have specified that the
> random values would not provide the expected problem pattern. I should
> have just used simple dummy codes as Bill Dunlap did.
>
> So per Bill's example data for Data1, the expected (hoped for) output
> should be:
>
>   Vendor Account Num_Vendors_Sharing_Bank_Acct
> 1     V1      A1      0
> 2     V2      A2      3
> 3     V3      A2      3
> 4     V4      A2      3
>
>
> Where the new calculated variable is Num_Vendors_Sharing_Bank_Acct.
> The value is 3 for V2, V3 and V4 because they each share bank account
> A2.
>
>
> Likewise, in the Data2 frame, the same logic applies:
>
>   Vendor Account Num_Vendors_Sharing_Bank_Acct
> 1     V1      A1     0
> 2     V2      A2     3
> 3     V3      A2     3
> 4     V1      A2     3
> 5     V4      A3     0
> 6     V2      A4     0
>
>
>
>
>
>
> Thanks!
>
>
> Quoting Bill Dunlap <williamwdunlap at gmail.com>:
>
> > What should the result be for
> >   Data1 <-
data.frame(Vendor=c("V1","V2","V3","V4"),
> >
Account=c("A1","A2","A2","A2"))
> > ?
> >
> > Must each vendor have only one account?  If not, what should the
result
> be
> > for
> >    Data2 <-
data.frame(Vendor=c("V1","V2","V3","V1","V4","V2"),
> >
Account=c("A1","A2","A2","A2","A3","A4"))
> > ?
> >
> > -Bill
> >
> > On Tue, Nov 17, 2020 at 1:20 PM Tom Woolman <twoolman at
ontargettek.com>
> > wrote:
> >
> >> Hi everyone.  I have a dataframe that is a collection of Vendor
IDs
> >> plus a bank account number for each vendor. I'm trying to find
a way
> >> to count the number of duplicate bank accounts that occur in more
than
> >> one unique Vendor_ID, and then assign the count value for each row
in
> >> the dataframe in a new variable.
> >>
> >> I can do a count of bank accounts that occur within the same
vendor
> >> using dplyr and group_by and count, but I can't figure out a
way to
> >> count duplicates among multiple Vendor_IDs.
> >>
> >>
> >> Dataframe example code:
> >>
> >>
> >> #Create a sample data frame:
> >>
> >> set.seed(1)
> >>
> >> Data <- data.frame(Vendor_ID = sample(1:10000), Bank_Account_ID
> >> sample(1:10000))
> >>
> >>
> >>
> >>
> >> Thanks in advance for any help.
> >>
> >> ______________________________________________
> >> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more,
see
> >> https://stat.ethz.ch/mailman/listinfo/r-help
> >> PLEASE do read the posting guide
> >> http://www.R-project.org/posting-guide.html
> >> and provide commented, minimal, self-contained, reproducible code.
> >>
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
	[[alternative HTML version deleted]]

Tom Woolman

2020-Nov-17 23:34 UTC

head link

[R] counting duplicate items that occur in multiple groups

Yes, good catch. Thanks


Quoting Bert Gunter <bgunter.4567 at gmail.com>:
> Why 0's in the data frame? Shouldn't that be 1 (vendor with that
account)?
>
> Bert
> Bert Gunter
>
> "The trouble with having an open mind is that people keep coming along
and
> sticking things into it."
> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip
)
>
>
> On Tue, Nov 17, 2020 at 3:29 PM Tom Woolman <twoolman at
ontargettek.com>
> wrote:
>
>> Hi Bill. Sorry to be so obtuse with the example data, I was trying
>> (too hard) not to share any actual values so I just created randomized
>> values for my example; of course I should have specified that the
>> random values would not provide the expected problem pattern. I should
>> have just used simple dummy codes as Bill Dunlap did.
>>
>> So per Bill's example data for Data1, the expected (hoped for)
output
>> should be:
>>
>>   Vendor Account Num_Vendors_Sharing_Bank_Acct
>> 1     V1      A1      0
>> 2     V2      A2      3
>> 3     V3      A2      3
>> 4     V4      A2      3
>>
>>
>> Where the new calculated variable is Num_Vendors_Sharing_Bank_Acct.
>> The value is 3 for V2, V3 and V4 because they each share bank account
>> A2.
>>
>>
>> Likewise, in the Data2 frame, the same logic applies:
>>
>>   Vendor Account Num_Vendors_Sharing_Bank_Acct
>> 1     V1      A1     0
>> 2     V2      A2     3
>> 3     V3      A2     3
>> 4     V1      A2     3
>> 5     V4      A3     0
>> 6     V2      A4     0
>>
>>
>>
>>
>>
>>
>> Thanks!
>>
>>
>> Quoting Bill Dunlap <williamwdunlap at gmail.com>:
>>
>> > What should the result be for
>> >   Data1 <-
data.frame(Vendor=c("V1","V2","V3","V4"),
>> >
Account=c("A1","A2","A2","A2"))
>> > ?
>> >
>> > Must each vendor have only one account?  If not, what should the
result
>> be
>> > for
>> >    Data2 <-
data.frame(Vendor=c("V1","V2","V3","V1","V4","V2"),
>> >
Account=c("A1","A2","A2","A2","A3","A4"))
>> > ?
>> >
>> > -Bill
>> >
>> > On Tue, Nov 17, 2020 at 1:20 PM Tom Woolman <twoolman at
ontargettek.com>
>> > wrote:
>> >
>> >> Hi everyone.  I have a dataframe that is a collection of
Vendor IDs
>> >> plus a bank account number for each vendor. I'm trying to
find a way
>> >> to count the number of duplicate bank accounts that occur in
more than
>> >> one unique Vendor_ID, and then assign the count value for each
row in
>> >> the dataframe in a new variable.
>> >>
>> >> I can do a count of bank accounts that occur within the same
vendor
>> >> using dplyr and group_by and count, but I can't figure out
a way to
>> >> count duplicates among multiple Vendor_IDs.
>> >>
>> >>
>> >> Dataframe example code:
>> >>
>> >>
>> >> #Create a sample data frame:
>> >>
>> >> set.seed(1)
>> >>
>> >> Data <- data.frame(Vendor_ID = sample(1:10000),
Bank_Account_ID >> >> sample(1:10000))
>> >>
>> >>
>> >>
>> >>
>> >> Thanks in advance for any help.
>> >>
>> >> ______________________________________________
>> >> R-help at r-project.org mailing list -- To UNSUBSCRIBE and
more, see
>> >> https://stat.ethz.ch/mailman/listinfo/r-help
>> >> PLEASE do read the posting guide
>> >> http://www.R-project.org/posting-guide.html
>> >> and provide commented, minimal, self-contained, reproducible
code.
>> >>
>>
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>

Avi Gross

2020-Nov-17 23:56 UTC

head link

[R] counting duplicate items that occur in multiple groups

Many problems can often be solved with some thought by using the right tools,
such as the ones from the tidyverse.

Without giving a specific answer, you might want to think about using the
group_by() functionality in a pipeline that would lump together all rows
matching say having the same value in several columns. Then in something like a
mutate() or summarize() you can use special functions like n() that return how
many rows exist within each grouping. There are many more such verbs and
features that let you build up something, often by removing the grouping along
the way and perhaps adding some other form of grouping including the new
rowwise() that then lets you do things across columns on a row at a time and so
on.

I think the point is to think of steps that lead to a result that can be used in
the next step and so on.

And, for some problems, you can  think outside the pipelines and create multiple
intermediate data.frames with parts of what you will need and then combine them
with joins or whatever it takes to efficiently get a result, or by brute force.
Sometimes (as when making graphs) you might want to convert data between forms
that are often called long versus wide.

Yes, plenty can be done in base R or using other packages. But a good set of
tools might be part of what you need to investigate.

Of course, others can chime in suggesting that there are negatives to dplyr and
other aspects of the tidyverse and they would be right too.

-----Original Message-----
From: R-help <r-help-bounces at r-project.org> On Behalf Of Tom Woolman
Sent: Tuesday, November 17, 2020 6:30 PM
To: Bill Dunlap <williamwdunlap at gmail.com>
Cc: r-help at r-project.org
Subject: Re: [R] counting duplicate items that occur in multiple groups

Hi Bill. Sorry to be so obtuse with the example data, I was trying (too hard)
not to share any actual values so I just created randomized values for my
example; of course I should have specified that the random values would not
provide the expected problem pattern. I should have just used simple dummy codes
as Bill Dunlap did.

So per Bill's example data for Data1, the expected (hoped for) output should
be:

  Vendor Account Num_Vendors_Sharing_Bank_Acct
1     V1      A1      0
2     V2      A2      3
3     V3      A2      3
4     V4      A2      3

Where the new calculated variable is Num_Vendors_Sharing_Bank_Acct.  
The value is 3 for V2, V3 and V4 because they each share bank account A2.

Likewise, in the Data2 frame, the same logic applies:

  Vendor Account Num_Vendors_Sharing_Bank_Acct
1     V1      A1     0
2     V2      A2     3
3     V3      A2     3
4     V1      A2     3
5     V4      A3     0
6     V2      A4     0

Thanks!

Quoting Bill Dunlap <williamwdunlap at gmail.com>:
> What should the result be for
>   Data1 <-
data.frame(Vendor=c("V1","V2","V3","V4"),
> Account=c("A1","A2","A2","A2"))
> ?
>
> Must each vendor have only one account?  If not, what should the 
> result be for
>    Data2 <-
data.frame(Vendor=c("V1","V2","V3","V1","V4","V2"),
>
Account=c("A1","A2","A2","A2","A3","A4"))
> ?
>
> -Bill
>
> On Tue, Nov 17, 2020 at 1:20 PM Tom Woolman <twoolman at
ontargettek.com>
> wrote:
>
>> Hi everyone.  I have a dataframe that is a collection of Vendor IDs 
>> plus a bank account number for each vendor. I'm trying to find a
way
>> to count the number of duplicate bank accounts that occur in more 
>> than one unique Vendor_ID, and then assign the count value for each 
>> row in the dataframe in a new variable.
>>
>> I can do a count of bank accounts that occur within the same vendor 
>> using dplyr and group_by and count, but I can't figure out a way to
>> count duplicates among multiple Vendor_IDs.
>>
>>
>> Dataframe example code:
>>
>>
>> #Create a sample data frame:
>>
>> set.seed(1)
>>
>> Data <- data.frame(Vendor_ID = sample(1:10000), Bank_Account_ID
>> sample(1:10000))
>>
>>
>>
>>
>> Thanks in advance for any help.
>>
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see 
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
______________________________________________
R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

R help - Nov 2020 - counting duplicate items that occur in multiple groups

[R] counting duplicate items that occur in multiple groups

[R] counting duplicate items that occur in multiple groups

[R] counting duplicate items that occur in multiple groups

[R] counting duplicate items that occur in multiple groups