Tom Woolman
2020-Nov-17 23:29 UTC
[R] counting duplicate items that occur in multiple groups
Hi Bill. Sorry to be so obtuse with the example data, I was trying (too hard) not to share any actual values so I just created randomized values for my example; of course I should have specified that the random values would not provide the expected problem pattern. I should have just used simple dummy codes as Bill Dunlap did. So per Bill's example data for Data1, the expected (hoped for) output should be: Vendor Account Num_Vendors_Sharing_Bank_Acct 1 V1 A1 0 2 V2 A2 3 3 V3 A2 3 4 V4 A2 3 Where the new calculated variable is Num_Vendors_Sharing_Bank_Acct. The value is 3 for V2, V3 and V4 because they each share bank account A2. Likewise, in the Data2 frame, the same logic applies: Vendor Account Num_Vendors_Sharing_Bank_Acct 1 V1 A1 0 2 V2 A2 3 3 V3 A2 3 4 V1 A2 3 5 V4 A3 0 6 V2 A4 0 Thanks! Quoting Bill Dunlap <williamwdunlap at gmail.com>:> What should the result be for > Data1 <- data.frame(Vendor=c("V1","V2","V3","V4"), > Account=c("A1","A2","A2","A2")) > ? > > Must each vendor have only one account? If not, what should the result be > for > Data2 <- data.frame(Vendor=c("V1","V2","V3","V1","V4","V2"), > Account=c("A1","A2","A2","A2","A3","A4")) > ? > > -Bill > > On Tue, Nov 17, 2020 at 1:20 PM Tom Woolman <twoolman at ontargettek.com> > wrote: > >> Hi everyone. I have a dataframe that is a collection of Vendor IDs >> plus a bank account number for each vendor. I'm trying to find a way >> to count the number of duplicate bank accounts that occur in more than >> one unique Vendor_ID, and then assign the count value for each row in >> the dataframe in a new variable. >> >> I can do a count of bank accounts that occur within the same vendor >> using dplyr and group_by and count, but I can't figure out a way to >> count duplicates among multiple Vendor_IDs. >> >> >> Dataframe example code: >> >> >> #Create a sample data frame: >> >> set.seed(1) >> >> Data <- data.frame(Vendor_ID = sample(1:10000), Bank_Account_ID >> sample(1:10000)) >> >> >> >> >> Thanks in advance for any help. >> >> ______________________________________________ >> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >>
Bert Gunter
2020-Nov-17 23:33 UTC
[R] counting duplicate items that occur in multiple groups
Why 0's in the data frame? Shouldn't that be 1 (vendor with that account)? Bert Bert Gunter "The trouble with having an open mind is that people keep coming along and sticking things into it." -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) On Tue, Nov 17, 2020 at 3:29 PM Tom Woolman <twoolman at ontargettek.com> wrote:> Hi Bill. Sorry to be so obtuse with the example data, I was trying > (too hard) not to share any actual values so I just created randomized > values for my example; of course I should have specified that the > random values would not provide the expected problem pattern. I should > have just used simple dummy codes as Bill Dunlap did. > > So per Bill's example data for Data1, the expected (hoped for) output > should be: > > Vendor Account Num_Vendors_Sharing_Bank_Acct > 1 V1 A1 0 > 2 V2 A2 3 > 3 V3 A2 3 > 4 V4 A2 3 > > > Where the new calculated variable is Num_Vendors_Sharing_Bank_Acct. > The value is 3 for V2, V3 and V4 because they each share bank account > A2. > > > Likewise, in the Data2 frame, the same logic applies: > > Vendor Account Num_Vendors_Sharing_Bank_Acct > 1 V1 A1 0 > 2 V2 A2 3 > 3 V3 A2 3 > 4 V1 A2 3 > 5 V4 A3 0 > 6 V2 A4 0 > > > > > > > Thanks! > > > Quoting Bill Dunlap <williamwdunlap at gmail.com>: > > > What should the result be for > > Data1 <- data.frame(Vendor=c("V1","V2","V3","V4"), > > Account=c("A1","A2","A2","A2")) > > ? > > > > Must each vendor have only one account? If not, what should the result > be > > for > > Data2 <- data.frame(Vendor=c("V1","V2","V3","V1","V4","V2"), > > Account=c("A1","A2","A2","A2","A3","A4")) > > ? > > > > -Bill > > > > On Tue, Nov 17, 2020 at 1:20 PM Tom Woolman <twoolman at ontargettek.com> > > wrote: > > > >> Hi everyone. I have a dataframe that is a collection of Vendor IDs > >> plus a bank account number for each vendor. I'm trying to find a way > >> to count the number of duplicate bank accounts that occur in more than > >> one unique Vendor_ID, and then assign the count value for each row in > >> the dataframe in a new variable. > >> > >> I can do a count of bank accounts that occur within the same vendor > >> using dplyr and group_by and count, but I can't figure out a way to > >> count duplicates among multiple Vendor_IDs. > >> > >> > >> Dataframe example code: > >> > >> > >> #Create a sample data frame: > >> > >> set.seed(1) > >> > >> Data <- data.frame(Vendor_ID = sample(1:10000), Bank_Account_ID > >> sample(1:10000)) > >> > >> > >> > >> > >> Thanks in advance for any help. > >> > >> ______________________________________________ > >> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > >> https://stat.ethz.ch/mailman/listinfo/r-help > >> PLEASE do read the posting guide > >> http://www.R-project.org/posting-guide.html > >> and provide commented, minimal, self-contained, reproducible code. > >> > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]
Tom Woolman
2020-Nov-17 23:34 UTC
[R] counting duplicate items that occur in multiple groups
Yes, good catch. Thanks Quoting Bert Gunter <bgunter.4567 at gmail.com>:> Why 0's in the data frame? Shouldn't that be 1 (vendor with that account)? > > Bert > Bert Gunter > > "The trouble with having an open mind is that people keep coming along and > sticking things into it." > -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) > > > On Tue, Nov 17, 2020 at 3:29 PM Tom Woolman <twoolman at ontargettek.com> > wrote: > >> Hi Bill. Sorry to be so obtuse with the example data, I was trying >> (too hard) not to share any actual values so I just created randomized >> values for my example; of course I should have specified that the >> random values would not provide the expected problem pattern. I should >> have just used simple dummy codes as Bill Dunlap did. >> >> So per Bill's example data for Data1, the expected (hoped for) output >> should be: >> >> Vendor Account Num_Vendors_Sharing_Bank_Acct >> 1 V1 A1 0 >> 2 V2 A2 3 >> 3 V3 A2 3 >> 4 V4 A2 3 >> >> >> Where the new calculated variable is Num_Vendors_Sharing_Bank_Acct. >> The value is 3 for V2, V3 and V4 because they each share bank account >> A2. >> >> >> Likewise, in the Data2 frame, the same logic applies: >> >> Vendor Account Num_Vendors_Sharing_Bank_Acct >> 1 V1 A1 0 >> 2 V2 A2 3 >> 3 V3 A2 3 >> 4 V1 A2 3 >> 5 V4 A3 0 >> 6 V2 A4 0 >> >> >> >> >> >> >> Thanks! >> >> >> Quoting Bill Dunlap <williamwdunlap at gmail.com>: >> >> > What should the result be for >> > Data1 <- data.frame(Vendor=c("V1","V2","V3","V4"), >> > Account=c("A1","A2","A2","A2")) >> > ? >> > >> > Must each vendor have only one account? If not, what should the result >> be >> > for >> > Data2 <- data.frame(Vendor=c("V1","V2","V3","V1","V4","V2"), >> > Account=c("A1","A2","A2","A2","A3","A4")) >> > ? >> > >> > -Bill >> > >> > On Tue, Nov 17, 2020 at 1:20 PM Tom Woolman <twoolman at ontargettek.com> >> > wrote: >> > >> >> Hi everyone. I have a dataframe that is a collection of Vendor IDs >> >> plus a bank account number for each vendor. I'm trying to find a way >> >> to count the number of duplicate bank accounts that occur in more than >> >> one unique Vendor_ID, and then assign the count value for each row in >> >> the dataframe in a new variable. >> >> >> >> I can do a count of bank accounts that occur within the same vendor >> >> using dplyr and group_by and count, but I can't figure out a way to >> >> count duplicates among multiple Vendor_IDs. >> >> >> >> >> >> Dataframe example code: >> >> >> >> >> >> #Create a sample data frame: >> >> >> >> set.seed(1) >> >> >> >> Data <- data.frame(Vendor_ID = sample(1:10000), Bank_Account_ID >> >> sample(1:10000)) >> >> >> >> >> >> >> >> >> >> Thanks in advance for any help. >> >> >> >> ______________________________________________ >> >> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >> >> https://stat.ethz.ch/mailman/listinfo/r-help >> >> PLEASE do read the posting guide >> >> http://www.R-project.org/posting-guide.html >> >> and provide commented, minimal, self-contained, reproducible code. >> >> >> >> ______________________________________________ >> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >>
Avi Gross
2020-Nov-17 23:56 UTC
[R] counting duplicate items that occur in multiple groups
Many problems can often be solved with some thought by using the right tools, such as the ones from the tidyverse. Without giving a specific answer, you might want to think about using the group_by() functionality in a pipeline that would lump together all rows matching say having the same value in several columns. Then in something like a mutate() or summarize() you can use special functions like n() that return how many rows exist within each grouping. There are many more such verbs and features that let you build up something, often by removing the grouping along the way and perhaps adding some other form of grouping including the new rowwise() that then lets you do things across columns on a row at a time and so on. I think the point is to think of steps that lead to a result that can be used in the next step and so on. And, for some problems, you can think outside the pipelines and create multiple intermediate data.frames with parts of what you will need and then combine them with joins or whatever it takes to efficiently get a result, or by brute force. Sometimes (as when making graphs) you might want to convert data between forms that are often called long versus wide. Yes, plenty can be done in base R or using other packages. But a good set of tools might be part of what you need to investigate. Of course, others can chime in suggesting that there are negatives to dplyr and other aspects of the tidyverse and they would be right too. -----Original Message----- From: R-help <r-help-bounces at r-project.org> On Behalf Of Tom Woolman Sent: Tuesday, November 17, 2020 6:30 PM To: Bill Dunlap <williamwdunlap at gmail.com> Cc: r-help at r-project.org Subject: Re: [R] counting duplicate items that occur in multiple groups Hi Bill. Sorry to be so obtuse with the example data, I was trying (too hard) not to share any actual values so I just created randomized values for my example; of course I should have specified that the random values would not provide the expected problem pattern. I should have just used simple dummy codes as Bill Dunlap did. So per Bill's example data for Data1, the expected (hoped for) output should be: Vendor Account Num_Vendors_Sharing_Bank_Acct 1 V1 A1 0 2 V2 A2 3 3 V3 A2 3 4 V4 A2 3 Where the new calculated variable is Num_Vendors_Sharing_Bank_Acct. The value is 3 for V2, V3 and V4 because they each share bank account A2. Likewise, in the Data2 frame, the same logic applies: Vendor Account Num_Vendors_Sharing_Bank_Acct 1 V1 A1 0 2 V2 A2 3 3 V3 A2 3 4 V1 A2 3 5 V4 A3 0 6 V2 A4 0 Thanks! Quoting Bill Dunlap <williamwdunlap at gmail.com>:> What should the result be for > Data1 <- data.frame(Vendor=c("V1","V2","V3","V4"), > Account=c("A1","A2","A2","A2")) > ? > > Must each vendor have only one account? If not, what should the > result be for > Data2 <- data.frame(Vendor=c("V1","V2","V3","V1","V4","V2"), > Account=c("A1","A2","A2","A2","A3","A4")) > ? > > -Bill > > On Tue, Nov 17, 2020 at 1:20 PM Tom Woolman <twoolman at ontargettek.com> > wrote: > >> Hi everyone. I have a dataframe that is a collection of Vendor IDs >> plus a bank account number for each vendor. I'm trying to find a way >> to count the number of duplicate bank accounts that occur in more >> than one unique Vendor_ID, and then assign the count value for each >> row in the dataframe in a new variable. >> >> I can do a count of bank accounts that occur within the same vendor >> using dplyr and group_by and count, but I can't figure out a way to >> count duplicates among multiple Vendor_IDs. >> >> >> Dataframe example code: >> >> >> #Create a sample data frame: >> >> set.seed(1) >> >> Data <- data.frame(Vendor_ID = sample(1:10000), Bank_Account_ID >> sample(1:10000)) >> >> >> >> >> Thanks in advance for any help. >> >> ______________________________________________ >> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >>______________________________________________ R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.