Hi, If I have two datasets like this: df=data.frame("id"=rep(1:10,10, each=10), "item1"=sample(1:20, 100, replace=T) df2=data.frame("a"=c(8, 8,10,9, 5, 1,2,1), "b"=c(16,18,11, 19,18, 11,17,12)) How do I find out which ids in the df dataset that has a match for both the numbers occuring in the same row in the df2 dataframe? In the output I would like to get the matching id and the rownumber from the df2. Output something like this Id Rownr 2 1 5 1 7 4 My actual problem is more complex with even more columns to be matched and the datasets are large, hence the solution needs to be efficient. Kind regards, N?r du skickar e-post till Karolinska Institutet (KI) inneb?r detta att KI kommer att behandla dina personuppgifter. H?r finns information om hur KI behandlar personuppgifter<ki.se/medarbetare/integritetsskyddspolicy>. Sending email to Karolinska Institutet (KI) will result in KI processing your personal data. You can read more about KI's processing of personal data here<ki.se/en/staff/data-protection-policy>. [[alternative HTML version deleted]]
Hallo Marine Could you please make your example more reproducible by using set.seed (and maybe smaller)? If I understand correctly, you want to know if let say row 1 items from df2 (8,16) are both in item column of specific id? If I am correct in guessing, I cannot find another solution than split your df according to id x <- split(df, df$id)[[1]] and for each row of df2 test if within the specified id you can find both numbers. sum(is.element(df2[1,], x$item))==2 [1] FALSE So basically 2 cycles, one for df ids and the other for df2 rows. But maybe somebody will give you more ingenious answer. Cheers Petr> -----Original Message----- > From: R-help <r-help-bounces at r-project.org> On Behalf Of Marine Andersson > Sent: Friday, October 7, 2022 1:58 PM > To: r-help at r-project.org > Subject: [R] Ids with matching number combinations? > > Hi, > > If I have two datasets like this: > df=data.frame("id"=rep(1:10,10, each=10), "item1"=sample(1:20, 100, > replace=T) > df2=data.frame("a"=c(8, 8,10,9, 5, 1,2,1), "b"=c(16,18,11, 19,18,11,17,12))> > How do I find out which ids in the df dataset that has a match for boththe> numbers occuring in the same row in the df2 dataframe? In the output Iwould> like to get the matching id and the rownumber from the df2. > > Output something like this > Id Rownr > 2 1 > 5 1 > 7 4 > > My actual problem is more complex with even more columns to be matched and > the datasets are large, hence the solution needs to be efficient. > > Kind regards, > > > > > > N?r du skickar e-post till Karolinska Institutet (KI) inneb?r detta att KIkommer> att behandla dina personuppgifter. H?r finns information om hur KIbehandlar> personuppgifter<ki.se/medarbetare/integritetsskyddspolicy>. > > > Sending email to Karolinska Institutet (KI) will result in KI processingyour> personal data. You can read more about KI's processing of personal data > here<ki.se/en/staff/data-protection-policy>. > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guideR-project.org/posting-guide.html> and provide commented, minimal, self-contained, reproducible code.
Well, for a start, you might give us a reproducible example that actually runs -- yours doesn't. Did you check? You seem to b.e missing a final ")"(Also, you do not need to quote the column names in data.frame(), though it works fine also if you do). Also note that in df, your id column has length 1000 and item1 column has length 100 and will be replicated to match id. This seems to me likely to be misspecified. Or is this what you meant? Finally, note that that none of the rows in df2 have identical numbers, so there is no id in df that can match both members of the row. So did you mean that the 'id' and 'item1' value in a row of df must match the corresponding 'a' and 'b' values of some row in df2 ? Under the above interpretation and with the following reprex(note the use of set.seed() to make it reproducible)... set.seed(1234) ## for reproducibility df=data.frame(id=c(10,rep(1:10, each=10), item1=sample(1:20, 100, replace=T)) df2=data.frame(a=c(8, 8,10,9, 5, 1,2,1), b=c(16,18,11, 19,18, 11,17,12)) ... you can paste0() the column vectors together in each data frame(as character values) and then just match on the single character vectors, like this: both <- match(do.call(paste0,df), do.call(paste0, df2)) ## subscript the data frames if you have more columns that are not used for matching> both[1] NA NA 8 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA [21] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA [41] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA [61] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA [81] NA NA NA NA NA NA NA NA NA NA 3 NA NA NA NA NA NA NA NA 3 Then> both[!is.na(both)] ## the rows of df2 that matched[1] 8 3 3> which(!is.na(both)) ## the rows of df that were matched[1] 3 91 100 This should be very efficient, as hashing is used for matching. I think there is a slicker way to do this that combines the paste and match functions together (other than using merge() or the like, as has already been suggested). But I have forgotten the details or I may just be thinking of merge(), which may indeed be a better option. All presuming this is what you meant, of course, which it may not be. :-( Cheers, Bert On Fri, Oct 7, 2022 at 5:57 AM Marine Andersson <marine.andersson at ki.se> wrote:> Hi, > > If I have two datasets like this: > df=data.frame("id"=rep(1:10,10, each=10), "item1"=sample(1:20, 100, > replace=T) > df2=data.frame("a"=c(8, 8,10,9, 5, 1,2,1), "b"=c(16,18,11, 19,18, > 11,17,12)) > > How do I find out which ids in the df dataset that has a match for both > the numbers occuring in the same row in the df2 dataframe? In the output I > would like to get the matching id and the rownumber from the df2. > > Output something like this > Id Rownr > 2 1 > 5 1 > 7 4 > > My actual problem is more complex with even more columns to be matched and > the datasets are large, hence the solution needs to be efficient. > > Kind regards, > > > > > > N?r du skickar e-post till Karolinska Institutet (KI) inneb?r detta att KI > kommer att behandla dina personuppgifter. H?r finns information om hur KI > behandlar personuppgifter< > ki.se/medarbetare/integritetsskyddspolicy>. > > > Sending email to Karolinska Institutet (KI) will result in KI processing > your personal data. You can read more about KI's processing of personal > data here<ki.se/en/staff/data-protection-policy>. > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]
Hi Marine, I'm not sure that I have the right idea, but this worked for me and may help. # I had to fix this line df<-data.frame(id=rep(1:10,each=10),item1=sample(1:20, 100, replace=TRUE)) df2<-data.frame(a=c(8, 8,10,9, 5, 1,2,1),b=c(16,18,11, 19,18, 11,17,12)) result<-data.frame(id=df[,1],rown=1:dim(df)[1],matched=rep(FALSE,dim(dfw)[1])) # function that checks that all entries in vector a exist in vector x match_all_a<-function(a,x,index) { inx<-TRUE for(i in 1:length(a)) inx <- inx && a[i] %in% x[-1] return(list(id=x[1],rown=index,matched=inx)) } # step through the data frame for(i in df[,1]) result[i,]<-match_all_a(df2[i,],df[i,],i) # display the result id, row in df, whether matched result # select the rows of result for which matched is TRUE Jim On Fri, Oct 7, 2022 at 11:57 PM Marine Andersson <marine.andersson at ki.se> wrote:> > Hi, > > If I have two datasets like this: > df=data.frame("id"=rep(1:10,10, each=10), "item1"=sample(1:20, 100, replace=T) > df2=data.frame("a"=c(8, 8,10,9, 5, 1,2,1), "b"=c(16,18,11, 19,18, 11,17,12)) > > How do I find out which ids in the df dataset that has a match for both the numbers occuring in the same row in the df2 dataframe? In the output I would like to get the matching id and the rownumber from the df2. > > Output something like this > Id Rownr > 2 1 > 5 1 > 7 4 > > My actual problem is more complex with even more columns to be matched and the datasets are large, hence the solution needs to be efficient. > > Kind regards, > > > > > > N?r du skickar e-post till Karolinska Institutet (KI) inneb?r detta att KI kommer att behandla dina personuppgifter. H?r finns information om hur KI behandlar personuppgifter<ki.se/medarbetare/integritetsskyddspolicy>. > > > Sending email to Karolinska Institutet (KI) will result in KI processing your personal data. You can read more about KI's processing of personal data here<ki.se/en/staff/data-protection-policy>. > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.