Greetings, I've been struggling for some time with a problem concerning a big database that i have to deal with. I'll try to exemplify my problem since the database is really big. Suppose I have the following data: AA = c(4,4,4,2,2,6,8,9) A1 = c(3,3,5,5,5,7,11,12) A2 = c(3,3,5,5,5,7,11,12) A = cbind(A, A1, A2) BB = c(2,2,4,6,6) B1 =c(5,11,7,13,NA) B2 =c(3,12,11,NA,NA) B3 =c(12,13,NA,NA,NA) B=cbind(BB,B1,B2,B3) I have to do the following: 1. Create a dummy (binary) variable in a new column of A that indicates if, for each row: a) the value from the column AA can be found in BB b) within the lines of B that corresponds to the value of AA, I can find both A1 and A2 in B1, B2 or B3. In this example i would have [0,0,1,1,1,0,0,0] I been able to do it with some loops; the problem is that since in the original data A has 2.936.044 lines and B has 14.965 it's taking forever to finish (probably because I might be doing the wrong way). I would really appreciate any help or advice on how to deal with this. Thanks! -- View this message in context: http://r.789695.n4.nabble.com/Big-data-and-column-correspondence-problem-tp3694912p3694912.html Sent from the R help mailing list archive at Nabble.com.
For question (a), do: which(AA%in%BB) Question (b) is very ambiguous to me. It makes little sense for your example because all values of BB are in AA. Therefore I am wondering whether you meant in question (a) that you want to find all values in BB that are in AA. That's not the same thing. I am also not sure what exactly you mean by "within the lines of B that correspond to the values of AA. If you mean "all the lines of B that for which AA is in BB, then you get that by: B[which(AA%in%BB) , ] However, this gives an error because AA has more values in BB than the number of rows in B. This leads me to believe that you might want which(BB%in%AA) for question (a). In this case you would get the lines of B by B[which(BB%in%AA) , ] which in this example are all rows of B. Again, part (b) is very opaque to me. It would help if you described it step by step as to what it should and what the outcome of every step along the way should be. Just from the final result that it should produce and your description, I cannot make sense of it. But maybe another helper can. Daniel murilofm wrote:> > Greetings, > > I've been struggling for some time with a problem concerning a big > database that i have to deal with. > I'll try to exemplify my problem since the database is really big. > Suppose I have the following data: > > AA = c(4,4,4,2,2,6,8,9) > A1 = c(3,3,5,5,5,7,11,12) > A2 = c(3,3,5,5,5,7,11,12) > A = cbind(A, A1, A2) > > BB = c(2,2,4,6,6) > B1 =c(5,11,7,13,NA) > B2 =c(3,12,11,NA,NA) > B3 =c(12,13,NA,NA,NA) > > B=cbind(BB,B1,B2,B3) > > I have to do the following: > > 1. Create a dummy (binary) variable in a new column of A that indicates > if, for each row: > a) the value from the column AA can be found in BB > b) within the lines of B that corresponds to the value of AA, I can find > both A1 and A2 in B1, B2 or B3. > > In this example i would have > [0,0,1,1,1,0,0,0] > > I been able to do it with some loops; the problem is that since in the > original data A has 2.936.044 lines and B has 14.965 it's taking forever > to finish (probably because I might be doing the wrong way). > > I would really appreciate any help or advice on how to deal with this. > Thanks! >-- View this message in context: http://r.789695.n4.nabble.com/Big-data-and-column-correspondence-problem-tp3694912p3695065.html Sent from the R help mailing list archive at Nabble.com.
This is much clearer. So here is what I think you want to do. In theory and practice: Theory: Check if AA[i] is in BB If AA[i] is in BB, then take the row where BB[j] == AA[i] and check whether A1 and A2 are in B1 to B3. Is that right? Only if both are, you want the indicator to take 1. Here is how you do this: newdata<-merge(A,B,by.x='AA',by.y='BB',all.x=F,all.y=F) A1.check<-with(newdata,A1==B1|A1==B2|A1==B3) B1.check<-with(newdata,A2==B1|A1==B2|A1==B3) A1.check<-replace(A1.check,which(is.na(A1.check)),0) B1.check<-replace(B1.check,which(is.na(B1.check)),0) newdata<-data.frame(newdata,A1.check,B1.check) newdata$index<-with(newdata,ifelse(A1.check+B1.check==2,1,0)) HTH, Daniel murilofm wrote:> >>>I can not see A1[1]=20 in your example data. > > Sorry about the typos.... A1[1]=3. > >>> Why B[3,]? > > Because AA[1]=BB[3]=4. > > I will reformulate the example with the code I'm running: > > AA = c(4,4,4,2,2,6,8,9) > A1 = c(3,3,11,5,5,7,11,12) > A2 = c(3,3,7,3,5,7,11,12) > A = cbind(AA, A1, A2) > > BB = c(2,2,4,6,6) > B1 =c(5,11,7,13,NA) > B2 =c(4,12,11,NA,NA) > B3 =c(12,13,NA,NA,NA) > > A = cbind(AA, A1, A2,0) > B=cbind(BB,B1,B2,B3) > > for(i in 1:dim(A)[1]){ > if (!is.na(sum(match(A[i,2:3],B[B[,1]==A[i,1],2:dim(B)[2]])))) A[i,4]<-1 > } > > Thanks >-- View this message in context: http://r.789695.n4.nabble.com/Big-data-and-column-correspondence-problem-tp3694912p3697067.html Sent from the R help mailing list archive at Nabble.com.