Ana Marija
2020-Jun-01 21:50 UTC
[R] is there is a way to extract lines in between 3 files that are in common based on one column?
Hello, I have 3 data frames which have about 3.4 mill lines (but they don't have exactly the same number of lines)...they look like this:> neu1=neu[order(neu$Marker),] > head(neu1)Chr BP Marker MAF A1 A2 Direction pValue N 209565 1 100000012 1:100000012:G:T 0.229925 T G + 0.650403 1594 209566 1 100000827 1:100000827:C:T 0.287014 T C + 0.955449 1594 209567 1 100002713 1:100002713:C:T 0.097867 T C - 0.290455 1594 209568 1 100002882 1:100002882:T:G 0.287014 G T + 0.955449 1594 209569 1 100002991 1:100002991:G:A 0.097867 A G - 0.290455 1594 209570 1 100004726 1:100004726:G:A 0.132058 A G + 0.115005 1594> nep1=nep[order(nep$Marker),] > head(nep1)Chr BP Marker MAF A1 A2 Direction pValue N 209642 1 100000012 1:100000012:G:T 0.2300430 T G - 0.1420030 1641 209643 1 100000827 1:100000827:C:T 0.2867150 T C - 0.2045580 1641 209644 1 100002713 1:100002713:C:T 0.0975015 T C - 0.0555507 1641 209645 1 100002882 1:100002882:T:G 0.2867150 G T - 0.2045580 1641 209646 1 100002991 1:100002991:G:A 0.0975015 A G - 0.0555507 1641 209647 1 100004726 1:100004726:G:A 0.1325410 A G - 0.8725660 1641> ret1=ret[order(ret$Marker),] > head(ret1)Chr BP Marker MAF A1 A2 Direction pValue N 865453 1 100000012 1:100000012:G:T 0.2322760 T G - 0.230383 1608 451596 1 100000827 1:100000827:C:T 0.2882460 T C - 0.120356 1608 1026046 1 100002713 1:100002713:C:T 0.0982587 T C - 0.272936 1608 451597 1 100002882 1:100002882:T:G 0.2882460 G T - 0.120356 1608 1026047 1 100002991 1:100002991:G:A 0.0982587 A G - 0.272936 1608 2234642 1 100004726 1:100004726:G:A 0.1340170 A G - 0.594538 1608 Is there is a way to create another 3 data frames, say neu2, nep2, ret2 which would only contain lines that have the same entries in Marker column for all 3 data frames? Thanks Ana [[alternative HTML version deleted]]
Jim Lemon
2020-Jun-02 00:31 UTC
[R] is there is a way to extract lines in between 3 files that are in common based on one column?
Hi Ana, Not too hard, but your example has all the "marker" fields in common. So using a sample that will show the expected result: neu1<-read.table(text="Chr BP Marker MAF A1 A2 Direction pValue N 1 100000012 1:100000012:G:T 0.229925 T G + 0.650403 1594 1 100000827 1:100000827:C:T 0.287014 T C + 0.955449 1594 1 100002713 1:100002713:C:T 0.097867 T C - 0.290455 1594 1 100002882 1:100002882:T:G 0.287014 G T + 0.955449 1594 1 100002991 1:100002991:G:A 0.097867 A G - 0.290455 1594 1 100004726 1:100004726:G:A 0.132058 A G + 0.115005 1594", header=TRUE,stringsAsFactors=FALSE) nep1<-read.table(text="Chr BP Marker MAF A1 A2 Direction pValue N 1 100000012 1:100000012:G:T 0.2300430 T G - 0.1420030 1641 1 100000827 1:100000827:C:T 0.2867150 T C - 0.2045580 1641 1 100002713 1:100002713:C:T 0.0975015 T C - 0.0555507 1641 1 100002882 1:100002882:T:G 0.2867150 G T - 0.2045580 1641 1 100002991 1:100002991:G:A 0.0975015 A G - 0.0555507 1641 1 100004726 1:100004727:G:A 0.1325410 A G - 0.8725660 1641", header=TRUE,stringsAsFactors=FALSE) ret1<-read.table(text="Chr BP Marker MAF A1 A2 Direction pValue N 1 100000012 1:100000012:G:T 0.2322760 T G - 0.230383 1608 1 100000827 1:100000827:C:T 0.2882460 T C - 0.120356 1608 1 100002713 1:100002713:C:T 0.0982587 T C - 0.272936 1608 1 100002882 1:100002882:T:G 0.2882460 G T - 0.120356 1608 1 100002991 1:100002992:G:A 0.0982587 A G - 0.272936 1608 1 100004726 1:100004727:G:A 0.1340170 A G - 0.594538 1608", header=TRUE,stringsAsFactors=FALSE) # merge the three data frames on "Marker" nn1<-merge(neu1,nep1,by="Marker") nn2<-merge(nn1,ret1,by="Marker") # get the common "Marker" strings Marker3<-nn2$Marker # subset all three data frames on Marker3 neu2<-neu1[neu1$Marker %in% Marker3,] nep2<-nep1[nep1$Marker %in% Marker3,] ret2<-ret1[ret1$Marker %in% Marker3,] Jim On Tue, Jun 2, 2020 at 7:50 AM Ana Marija <sokovic.anamarija at gmail.com> wrote:> > Hello, > > I have 3 data frames which have about 3.4 mill lines (but they don't have > exactly the same number of lines)...they look like this: > ... > Is there is a way to create another 3 data frames, say neu2, nep2, ret2 > which would only contain lines that have the same entries in Marker column > for all 3 data frames? > > Thanks > Ana > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
Ana Marija
2020-Jun-02 00:40 UTC
[R] is there is a way to extract lines in between 3 files that are in common based on one column?
Hi Jim, thank you so much for getting back to me. I tried your code and this is what I get:> dim(neu2)[1] 3740988 9> dim(nep2)[1] 3740988 9> dim(ret2)[1] 3740001 9 I think I would need to have the same number of lines in all 3 data frames. Can you please advise. Cheers Ana On Mon, Jun 1, 2020 at 7:31 PM Jim Lemon <drjimlemon at gmail.com> wrote:> Hi Ana, > Not too hard, but your example has all the "marker" fields in common. > So using a sample that will show the expected result: > > neu1<-read.table(text="Chr BP Marker MAF A1 A2 Direction pValue N > 1 100000012 1:100000012:G:T 0.229925 T G + 0.650403 1594 > 1 100000827 1:100000827:C:T 0.287014 T C + 0.955449 1594 > 1 100002713 1:100002713:C:T 0.097867 T C - 0.290455 1594 > 1 100002882 1:100002882:T:G 0.287014 G T + 0.955449 1594 > 1 100002991 1:100002991:G:A 0.097867 A G - 0.290455 1594 > 1 100004726 1:100004726:G:A 0.132058 A G + 0.115005 1594", > header=TRUE,stringsAsFactors=FALSE) > > nep1<-read.table(text="Chr BP Marker MAF A1 A2 Direction pValue N > 1 100000012 1:100000012:G:T 0.2300430 T G - 0.1420030 1641 > 1 100000827 1:100000827:C:T 0.2867150 T C - 0.2045580 1641 > 1 100002713 1:100002713:C:T 0.0975015 T C - 0.0555507 1641 > 1 100002882 1:100002882:T:G 0.2867150 G T - 0.2045580 1641 > 1 100002991 1:100002991:G:A 0.0975015 A G - 0.0555507 1641 > 1 100004726 1:100004727:G:A 0.1325410 A G - 0.8725660 1641", > header=TRUE,stringsAsFactors=FALSE) > > ret1<-read.table(text="Chr BP Marker MAF A1 A2 Direction pValue N > 1 100000012 1:100000012:G:T 0.2322760 T G - 0.230383 1608 > 1 100000827 1:100000827:C:T 0.2882460 T C - 0.120356 1608 > 1 100002713 1:100002713:C:T 0.0982587 T C - 0.272936 1608 > 1 100002882 1:100002882:T:G 0.2882460 G T - 0.120356 1608 > 1 100002991 1:100002992:G:A 0.0982587 A G - 0.272936 1608 > 1 100004726 1:100004727:G:A 0.1340170 A G - 0.594538 1608", > header=TRUE,stringsAsFactors=FALSE) > > # merge the three data frames on "Marker" > nn1<-merge(neu1,nep1,by="Marker") > nn2<-merge(nn1,ret1,by="Marker") > # get the common "Marker" strings > Marker3<-nn2$Marker > # subset all three data frames on Marker3 > neu2<-neu1[neu1$Marker %in% Marker3,] > nep2<-nep1[nep1$Marker %in% Marker3,] > ret2<-ret1[ret1$Marker %in% Marker3,] > > Jim > > On Tue, Jun 2, 2020 at 7:50 AM Ana Marija <sokovic.anamarija at gmail.com> > wrote: > > > > Hello, > > > > I have 3 data frames which have about 3.4 mill lines (but they don't have > > exactly the same number of lines)...they look like this: > > ... > > Is there is a way to create another 3 data frames, say neu2, nep2, ret2 > > which would only contain lines that have the same entries in Marker > column > > for all 3 data frames? > > > > Thanks > > Ana > > > > [[alternative HTML version deleted]] > > > > ______________________________________________ > > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]