Tom Woolman
2020-Nov-17 21:10 UTC
[R] counting duplicate items that occur in multiple groups
Hi everyone. I have a dataframe that is a collection of Vendor IDs plus a bank account number for each vendor. I'm trying to find a way to count the number of duplicate bank accounts that occur in more than one unique Vendor_ID, and then assign the count value for each row in the dataframe in a new variable. I can do a count of bank accounts that occur within the same vendor using dplyr and group_by and count, but I can't figure out a way to count duplicates among multiple Vendor_IDs. Dataframe example code: #Create a sample data frame: set.seed(1) Data <- data.frame(Vendor_ID = sample(1:10000), Bank_Account_ID = sample(1:10000)) Thanks in advance for any help.
Bert Gunter
2020-Nov-17 22:22 UTC
[R] counting duplicate items that occur in multiple groups
Inline. Bert Gunter "The trouble with having an open mind is that people keep coming along and sticking things into it." -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) On Tue, Nov 17, 2020 at 1:20 PM Tom Woolman <twoolman at ontargettek.com> wrote:> Hi everyone. I have a dataframe that is a collection of Vendor IDs > plus a bank account number for each vendor.I interpret this as: "all vendors are unique and each vendor has a single bank account." Is that correct?> I'm trying to find a way > to count the number of duplicate bank accounts that occur in more than > one unique Vendor_ID,The following makes no sense to me, as each row is a unique vendor and has only one bank account.> and then assign the count value for each row in > the dataframe in a new variable. > > I can do a count of bank accounts that occur within the same vendor >using dplyr and group_by and count, but I can't figure out a way to> count duplicates among multiple Vendor_IDs. >I interpret this to mean that you want to count vendor ID's by account . With only one account per vendor this is trivial; e.g. set.seed(22) d1 <- data.frame(id = sample(1:30), account = sample(1:20,30, replace = TRUE)) table(d1$account) ## gives 1 2 3 6 7 8 9 10 11 13 15 16 17 18 19 20 3 1 2 1 1 1 1 1 4 3 1 2 1 3 2 3 Note that AFAICS your example is useless, as it gives the same number of different account numbers as ID's, so no duplication can occur. As my interpretations are likely incorrect and this is not what you mean nor want, either clarify your meaning and provide a useful **minimal** example; or wait for a reply from someone with a better understanding than I. Cheers, Bert> > Dataframe example code: > > > #Create a sample data frame: > > set.seed(1) > > Data <- data.frame(Vendor_ID = sample(1:10000), Bank_Account_ID > sample(1:10000)) > > > > > Thanks in advance for any help. > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]
Bill Dunlap
2020-Nov-17 23:01 UTC
[R] counting duplicate items that occur in multiple groups
What should the result be for Data1 <- data.frame(Vendor=c("V1","V2","V3","V4"), Account=c("A1","A2","A2","A2")) ? Must each vendor have only one account? If not, what should the result be for Data2 <- data.frame(Vendor=c("V1","V2","V3","V1","V4","V2"), Account=c("A1","A2","A2","A2","A3","A4")) ? -Bill On Tue, Nov 17, 2020 at 1:20 PM Tom Woolman <twoolman at ontargettek.com> wrote:> Hi everyone. I have a dataframe that is a collection of Vendor IDs > plus a bank account number for each vendor. I'm trying to find a way > to count the number of duplicate bank accounts that occur in more than > one unique Vendor_ID, and then assign the count value for each row in > the dataframe in a new variable. > > I can do a count of bank accounts that occur within the same vendor > using dplyr and group_by and count, but I can't figure out a way to > count duplicates among multiple Vendor_IDs. > > > Dataframe example code: > > > #Create a sample data frame: > > set.seed(1) > > Data <- data.frame(Vendor_ID = sample(1:10000), Bank_Account_ID > sample(1:10000)) > > > > > Thanks in advance for any help. > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]
Tom Woolman
2020-Nov-17 23:29 UTC
[R] counting duplicate items that occur in multiple groups
Hi Bill. Sorry to be so obtuse with the example data, I was trying (too hard) not to share any actual values so I just created randomized values for my example; of course I should have specified that the random values would not provide the expected problem pattern. I should have just used simple dummy codes as Bill Dunlap did. So per Bill's example data for Data1, the expected (hoped for) output should be: Vendor Account Num_Vendors_Sharing_Bank_Acct 1 V1 A1 0 2 V2 A2 3 3 V3 A2 3 4 V4 A2 3 Where the new calculated variable is Num_Vendors_Sharing_Bank_Acct. The value is 3 for V2, V3 and V4 because they each share bank account A2. Likewise, in the Data2 frame, the same logic applies: Vendor Account Num_Vendors_Sharing_Bank_Acct 1 V1 A1 0 2 V2 A2 3 3 V3 A2 3 4 V1 A2 3 5 V4 A3 0 6 V2 A4 0 Thanks! Quoting Bill Dunlap <williamwdunlap at gmail.com>:> What should the result be for > Data1 <- data.frame(Vendor=c("V1","V2","V3","V4"), > Account=c("A1","A2","A2","A2")) > ? > > Must each vendor have only one account? If not, what should the result be > for > Data2 <- data.frame(Vendor=c("V1","V2","V3","V1","V4","V2"), > Account=c("A1","A2","A2","A2","A3","A4")) > ? > > -Bill > > On Tue, Nov 17, 2020 at 1:20 PM Tom Woolman <twoolman at ontargettek.com> > wrote: > >> Hi everyone. I have a dataframe that is a collection of Vendor IDs >> plus a bank account number for each vendor. I'm trying to find a way >> to count the number of duplicate bank accounts that occur in more than >> one unique Vendor_ID, and then assign the count value for each row in >> the dataframe in a new variable. >> >> I can do a count of bank accounts that occur within the same vendor >> using dplyr and group_by and count, but I can't figure out a way to >> count duplicates among multiple Vendor_IDs. >> >> >> Dataframe example code: >> >> >> #Create a sample data frame: >> >> set.seed(1) >> >> Data <- data.frame(Vendor_ID = sample(1:10000), Bank_Account_ID >> sample(1:10000)) >> >> >> >> >> Thanks in advance for any help. >> >> ______________________________________________ >> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >>
Jim Lemon
2020-Nov-18 10:22 UTC
[R] counting duplicate items that occur in multiple groups
Oops, I sent this to Tom earlier today and forgot to copy to the list: VendorID=rep(paste0("V",1:10),each=5) AcctID=paste0("A",sample(1:5,50,TRUE)) Data<-data.frame(VendorID,AcctID) table(Data) # get multiple vendors for each account dupAcctID<-colSums(table(Data)>0) Data$dupAcct<-NA # fill in the new column for(i in 1:length(dupAcctID)) Data$dupAcct[Data$AcctID == names(dupAcctID[i])]<-dupAcctID[i] Jim On Wed, Nov 18, 2020 at 8:20 AM Tom Woolman <twoolman at ontargettek.com> wrote:> Hi everyone. I have a dataframe that is a collection of Vendor IDs > plus a bank account number for each vendor. I'm trying to find a way > to count the number of duplicate bank accounts that occur in more than > one unique Vendor_ID, and then assign the count value for each row in > the dataframe in a new variable. > > I can do a count of bank accounts that occur within the same vendor > using dplyr and group_by and count, but I can't figure out a way to > count duplicates among multiple Vendor_IDs. > > > Dataframe example code: > > > #Create a sample data frame: > > set.seed(1) > > Data <- data.frame(Vendor_ID = sample(1:10000), Bank_Account_ID > sample(1:10000)) > > > > > Thanks in advance for any help. > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]
Tom Woolman
2020-Nov-18 10:25 UTC
[R] counting duplicate items that occur in multiple groups
Thanks, everyone! Quoting Jim Lemon <drjimlemon at gmail.com>:> Oops, I sent this to Tom earlier today and forgot to copy to the list: > > VendorID=rep(paste0("V",1:10),each=5) > AcctID=paste0("A",sample(1:5,50,TRUE)) > Data<-data.frame(VendorID,AcctID) > table(Data) > # get multiple vendors for each account > dupAcctID<-colSums(table(Data)>0) > Data$dupAcct<-NA > # fill in the new column > for(i in 1:length(dupAcctID)) > Data$dupAcct[Data$AcctID == names(dupAcctID[i])]<-dupAcctID[i] > > Jim > > On Wed, Nov 18, 2020 at 8:20 AM Tom Woolman <twoolman at ontargettek.com> > wrote: > >> Hi everyone. I have a dataframe that is a collection of Vendor IDs >> plus a bank account number for each vendor. I'm trying to find a way >> to count the number of duplicate bank accounts that occur in more than >> one unique Vendor_ID, and then assign the count value for each row in >> the dataframe in a new variable. >> >> I can do a count of bank accounts that occur within the same vendor >> using dplyr and group_by and count, but I can't figure out a way to >> count duplicates among multiple Vendor_IDs. >> >> >> Dataframe example code: >> >> >> #Create a sample data frame: >> >> set.seed(1) >> >> Data <- data.frame(Vendor_ID = sample(1:10000), Bank_Account_ID >> sample(1:10000)) >> >> >> >> >> Thanks in advance for any help. >> >> ______________________________________________ >> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.