Ana Marija
2020-Jun-02 00:40 UTC
[R] is there is a way to extract lines in between 3 files that are in common based on one column?
Hi Jim, thank you so much for getting back to me. I tried your code and this is what I get:> dim(neu2)[1] 3740988 9> dim(nep2)[1] 3740988 9> dim(ret2)[1] 3740001 9 I think I would need to have the same number of lines in all 3 data frames. Can you please advise. Cheers Ana On Mon, Jun 1, 2020 at 7:31 PM Jim Lemon <drjimlemon at gmail.com> wrote:> Hi Ana, > Not too hard, but your example has all the "marker" fields in common. > So using a sample that will show the expected result: > > neu1<-read.table(text="Chr BP Marker MAF A1 A2 Direction pValue N > 1 100000012 1:100000012:G:T 0.229925 T G + 0.650403 1594 > 1 100000827 1:100000827:C:T 0.287014 T C + 0.955449 1594 > 1 100002713 1:100002713:C:T 0.097867 T C - 0.290455 1594 > 1 100002882 1:100002882:T:G 0.287014 G T + 0.955449 1594 > 1 100002991 1:100002991:G:A 0.097867 A G - 0.290455 1594 > 1 100004726 1:100004726:G:A 0.132058 A G + 0.115005 1594", > header=TRUE,stringsAsFactors=FALSE) > > nep1<-read.table(text="Chr BP Marker MAF A1 A2 Direction pValue N > 1 100000012 1:100000012:G:T 0.2300430 T G - 0.1420030 1641 > 1 100000827 1:100000827:C:T 0.2867150 T C - 0.2045580 1641 > 1 100002713 1:100002713:C:T 0.0975015 T C - 0.0555507 1641 > 1 100002882 1:100002882:T:G 0.2867150 G T - 0.2045580 1641 > 1 100002991 1:100002991:G:A 0.0975015 A G - 0.0555507 1641 > 1 100004726 1:100004727:G:A 0.1325410 A G - 0.8725660 1641", > header=TRUE,stringsAsFactors=FALSE) > > ret1<-read.table(text="Chr BP Marker MAF A1 A2 Direction pValue N > 1 100000012 1:100000012:G:T 0.2322760 T G - 0.230383 1608 > 1 100000827 1:100000827:C:T 0.2882460 T C - 0.120356 1608 > 1 100002713 1:100002713:C:T 0.0982587 T C - 0.272936 1608 > 1 100002882 1:100002882:T:G 0.2882460 G T - 0.120356 1608 > 1 100002991 1:100002992:G:A 0.0982587 A G - 0.272936 1608 > 1 100004726 1:100004727:G:A 0.1340170 A G - 0.594538 1608", > header=TRUE,stringsAsFactors=FALSE) > > # merge the three data frames on "Marker" > nn1<-merge(neu1,nep1,by="Marker") > nn2<-merge(nn1,ret1,by="Marker") > # get the common "Marker" strings > Marker3<-nn2$Marker > # subset all three data frames on Marker3 > neu2<-neu1[neu1$Marker %in% Marker3,] > nep2<-nep1[nep1$Marker %in% Marker3,] > ret2<-ret1[ret1$Marker %in% Marker3,] > > Jim > > On Tue, Jun 2, 2020 at 7:50 AM Ana Marija <sokovic.anamarija at gmail.com> > wrote: > > > > Hello, > > > > I have 3 data frames which have about 3.4 mill lines (but they don't have > > exactly the same number of lines)...they look like this: > > ... > > Is there is a way to create another 3 data frames, say neu2, nep2, ret2 > > which would only contain lines that have the same entries in Marker > column > > for all 3 data frames? > > > > Thanks > > Ana > > > > [[alternative HTML version deleted]] > > > > ______________________________________________ > > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]
Rasmus Liland
2020-Jun-02 01:34 UTC
[R] is there is a way to extract lines in between 3 files that are in common based on one column?
Dear Ana and Jim, On 2020-06-01 19:40 -0500, Ana Marija wrote:> > dim(neu2) > [1] 3740988 9 > > dim(nep2) > [1] 3740988 9 > > dim(ret2) > [1] 3740001 9Jim's code works out of the box directly from the email ... I get: [1] 6 9 [1] 4 9 [1] 4 9 On 2020-06-01 19:40 -0500, Ana Marija wrote:> On Tue, Jun 2, 2020 at 7:50 AM Ana Marija wrote: > > > > but they don't have exactly the same > > number of lines > > I think I would need to have the same > number of lines in all 3 data frames.This does not make sense :?-\ Best, Rasmus -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 833 bytes Desc: not available URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20200602/b154d574/attachment.sig>
David Winsemius
2020-Jun-02 01:41 UTC
[R] is there is a way to extract lines in between 3 files that are in common based on one column?
On 6/1/20 5:40 PM, Ana Marija wrote:> Hi Jim, > > thank you so much for getting back to me. I tried your code and this is > what I get: >> dim(neu2) > [1] 3740988 9 >> dim(nep2) > [1] 3740988 9 >> dim(ret2) > [1] 3740001 9 > > I think I would need to have the same number of lines in all 3 data frames. > > Can you please advise.You should check for duplicated Marker values. -- David> > Cheers > Ana > > On Mon, Jun 1, 2020 at 7:31 PM Jim Lemon <drjimlemon at gmail.com> wrote: > >> Hi Ana, >> Not too hard, but your example has all the "marker" fields in common. >> So using a sample that will show the expected result: >> >> neu1<-read.table(text="Chr BP Marker MAF A1 A2 Direction pValue N >> 1 100000012 1:100000012:G:T 0.229925 T G + 0.650403 1594 >> 1 100000827 1:100000827:C:T 0.287014 T C + 0.955449 1594 >> 1 100002713 1:100002713:C:T 0.097867 T C - 0.290455 1594 >> 1 100002882 1:100002882:T:G 0.287014 G T + 0.955449 1594 >> 1 100002991 1:100002991:G:A 0.097867 A G - 0.290455 1594 >> 1 100004726 1:100004726:G:A 0.132058 A G + 0.115005 1594", >> header=TRUE,stringsAsFactors=FALSE) >> >> nep1<-read.table(text="Chr BP Marker MAF A1 A2 Direction pValue N >> 1 100000012 1:100000012:G:T 0.2300430 T G - 0.1420030 1641 >> 1 100000827 1:100000827:C:T 0.2867150 T C - 0.2045580 1641 >> 1 100002713 1:100002713:C:T 0.0975015 T C - 0.0555507 1641 >> 1 100002882 1:100002882:T:G 0.2867150 G T - 0.2045580 1641 >> 1 100002991 1:100002991:G:A 0.0975015 A G - 0.0555507 1641 >> 1 100004726 1:100004727:G:A 0.1325410 A G - 0.8725660 1641", >> header=TRUE,stringsAsFactors=FALSE) >> >> ret1<-read.table(text="Chr BP Marker MAF A1 A2 Direction pValue N >> 1 100000012 1:100000012:G:T 0.2322760 T G - 0.230383 1608 >> 1 100000827 1:100000827:C:T 0.2882460 T C - 0.120356 1608 >> 1 100002713 1:100002713:C:T 0.0982587 T C - 0.272936 1608 >> 1 100002882 1:100002882:T:G 0.2882460 G T - 0.120356 1608 >> 1 100002991 1:100002992:G:A 0.0982587 A G - 0.272936 1608 >> 1 100004726 1:100004727:G:A 0.1340170 A G - 0.594538 1608", >> header=TRUE,stringsAsFactors=FALSE) >> >> # merge the three data frames on "Marker" >> nn1<-merge(neu1,nep1,by="Marker") >> nn2<-merge(nn1,ret1,by="Marker") >> # get the common "Marker" strings >> Marker3<-nn2$Marker >> # subset all three data frames on Marker3 >> neu2<-neu1[neu1$Marker %in% Marker3,] >> nep2<-nep1[nep1$Marker %in% Marker3,] >> ret2<-ret1[ret1$Marker %in% Marker3,] >> >> Jim >> >> On Tue, Jun 2, 2020 at 7:50 AM Ana Marija <sokovic.anamarija at gmail.com> >> wrote: >>> Hello, >>> >>> I have 3 data frames which have about 3.4 mill lines (but they don't have >>> exactly the same number of lines)...they look like this: >>> ... >>> Is there is a way to create another 3 data frames, say neu2, nep2, ret2 >>> which would only contain lines that have the same entries in Marker >> column >>> for all 3 data frames? >>> >>> Thanks >>> Ana >>> >>> [[alternative HTML version deleted]] >>> >>> ______________________________________________ >>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >>> https://stat.ethz.ch/mailman/listinfo/r-help >>> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >>> and provide commented, minimal, self-contained, reproducible code. > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
Jim Lemon
2020-Jun-02 01:48 UTC
[R] is there is a way to extract lines in between 3 files that are in common based on one column?
Hi Ana, If I add another 6 rows to neu1, 2 to nep1 and one to ret1 and modify the "Marker" field so that there is one more match, I get the result I expect. I think that the program logic is correct. I can't say why ret1 has fewer lines. If there aren't too many mismatches, maybe checking the mismatches will help: neu3<-neu1[!(neu1$Marker %in% Marker3),] nep3<-nep1[!(nep1$Marker %in% Marker3),] ret3<-ret1[!(ret1$Marker %in% Marker3),] neu3 nep3 ret3 Jim On Tue, Jun 2, 2020 at 10:40 AM Ana Marija <sokovic.anamarija at gmail.com> wrote:> > Hi Jim, > > thank you so much for getting back to me. I tried your code and this is what I get: > > dim(neu2) > [1] 3740988 9 > > dim(nep2) > [1] 3740988 9 > > dim(ret2) > [1] 3740001 9 > > I think I would need to have the same number of lines in all 3 data frames. > > Can you please advise. > > Cheers > Ana > > On Mon, Jun 1, 2020 at 7:31 PM Jim Lemon <drjimlemon at gmail.com> wrote: >> >> Hi Ana, >> Not too hard, but your example has all the "marker" fields in common. >> So using a sample that will show the expected result: >> >> neu1<-read.table(text="Chr BP Marker MAF A1 A2 Direction pValue N >> 1 100000012 1:100000012:G:T 0.229925 T G + 0.650403 1594 >> 1 100000827 1:100000827:C:T 0.287014 T C + 0.955449 1594 >> 1 100002713 1:100002713:C:T 0.097867 T C - 0.290455 1594 >> 1 100002882 1:100002882:T:G 0.287014 G T + 0.955449 1594 >> 1 100002991 1:100002991:G:A 0.097867 A G - 0.290455 1594 >> 1 100004726 1:100004726:G:A 0.132058 A G + 0.115005 1594", >> header=TRUE,stringsAsFactors=FALSE) >> >> nep1<-read.table(text="Chr BP Marker MAF A1 A2 Direction pValue N >> 1 100000012 1:100000012:G:T 0.2300430 T G - 0.1420030 1641 >> 1 100000827 1:100000827:C:T 0.2867150 T C - 0.2045580 1641 >> 1 100002713 1:100002713:C:T 0.0975015 T C - 0.0555507 1641 >> 1 100002882 1:100002882:T:G 0.2867150 G T - 0.2045580 1641 >> 1 100002991 1:100002991:G:A 0.0975015 A G - 0.0555507 1641 >> 1 100004726 1:100004727:G:A 0.1325410 A G - 0.8725660 1641", >> header=TRUE,stringsAsFactors=FALSE) >> >> ret1<-read.table(text="Chr BP Marker MAF A1 A2 Direction pValue N >> 1 100000012 1:100000012:G:T 0.2322760 T G - 0.230383 1608 >> 1 100000827 1:100000827:C:T 0.2882460 T C - 0.120356 1608 >> 1 100002713 1:100002713:C:T 0.0982587 T C - 0.272936 1608 >> 1 100002882 1:100002882:T:G 0.2882460 G T - 0.120356 1608 >> 1 100002991 1:100002992:G:A 0.0982587 A G - 0.272936 1608 >> 1 100004726 1:100004727:G:A 0.1340170 A G - 0.594538 1608", >> header=TRUE,stringsAsFactors=FALSE) >> >> # merge the three data frames on "Marker" >> nn1<-merge(neu1,nep1,by="Marker") >> nn2<-merge(nn1,ret1,by="Marker") >> # get the common "Marker" strings >> Marker3<-nn2$Marker >> # subset all three data frames on Marker3 >> neu2<-neu1[neu1$Marker %in% Marker3,] >> nep2<-nep1[nep1$Marker %in% Marker3,] >> ret2<-ret1[ret1$Marker %in% Marker3,] >> >> Jim >> >> On Tue, Jun 2, 2020 at 7:50 AM Ana Marija <sokovic.anamarija at gmail.com> wrote: >> > >> > Hello, >> > >> > I have 3 data frames which have about 3.4 mill lines (but they don't have >> > exactly the same number of lines)...they look like this: >> > ... >> > Is there is a way to create another 3 data frames, say neu2, nep2, ret2 >> > which would only contain lines that have the same entries in Marker column >> > for all 3 data frames? >> > >> > Thanks >> > Ana >> > >> > [[alternative HTML version deleted]] >> > >> > ______________________________________________ >> > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >> > https://stat.ethz.ch/mailman/listinfo/r-help >> > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >> > and provide commented, minimal, self-contained, reproducible code.
Ana Marija
2020-Jun-02 01:50 UTC
[R] is there is a way to extract lines in between 3 files that are in common based on one column?
Hi David, that is a great point! Yes indeed some are non unique:> dim(neu1)[1] 3742845 9> length(unique(neu1$Marker))[1] 3741858> length(unique(nep1$Marker))[1] 3745560> dim(nep1)[1] 3746550 9> length(unique(ret1$Marker))[1] 3743494> dim(ret1)[1] 3743494 9 How would I rewrite this code so that is merging by Chr and Marker column? It seems that a Marker can be under a few Chr. On Mon, Jun 1, 2020 at 8:41 PM David Winsemius <dwinsemius at comcast.net> wrote:> > > On 6/1/20 5:40 PM, Ana Marija wrote: > > Hi Jim, > > > > thank you so much for getting back to me. I tried your code and this is > > what I get: > >> dim(neu2) > > [1] 3740988 9 > >> dim(nep2) > > [1] 3740988 9 > >> dim(ret2) > > [1] 3740001 9 > > > > I think I would need to have the same number of lines in all 3 data frames. > > > > Can you please advise. > > > You should check for duplicated Marker values. > > > -- > > David > > > > > Cheers > > Ana > > > > On Mon, Jun 1, 2020 at 7:31 PM Jim Lemon <drjimlemon at gmail.com> wrote: > > > >> Hi Ana, > >> Not too hard, but your example has all the "marker" fields in common. > >> So using a sample that will show the expected result: > >> > >> neu1<-read.table(text="Chr BP Marker MAF A1 A2 Direction pValue N > >> 1 100000012 1:100000012:G:T 0.229925 T G + 0.650403 1594 > >> 1 100000827 1:100000827:C:T 0.287014 T C + 0.955449 1594 > >> 1 100002713 1:100002713:C:T 0.097867 T C - 0.290455 1594 > >> 1 100002882 1:100002882:T:G 0.287014 G T + 0.955449 1594 > >> 1 100002991 1:100002991:G:A 0.097867 A G - 0.290455 1594 > >> 1 100004726 1:100004726:G:A 0.132058 A G + 0.115005 1594", > >> header=TRUE,stringsAsFactors=FALSE) > >> > >> nep1<-read.table(text="Chr BP Marker MAF A1 A2 Direction pValue N > >> 1 100000012 1:100000012:G:T 0.2300430 T G - 0.1420030 1641 > >> 1 100000827 1:100000827:C:T 0.2867150 T C - 0.2045580 1641 > >> 1 100002713 1:100002713:C:T 0.0975015 T C - 0.0555507 1641 > >> 1 100002882 1:100002882:T:G 0.2867150 G T - 0.2045580 1641 > >> 1 100002991 1:100002991:G:A 0.0975015 A G - 0.0555507 1641 > >> 1 100004726 1:100004727:G:A 0.1325410 A G - 0.8725660 1641", > >> header=TRUE,stringsAsFactors=FALSE) > >> > >> ret1<-read.table(text="Chr BP Marker MAF A1 A2 Direction pValue N > >> 1 100000012 1:100000012:G:T 0.2322760 T G - 0.230383 1608 > >> 1 100000827 1:100000827:C:T 0.2882460 T C - 0.120356 1608 > >> 1 100002713 1:100002713:C:T 0.0982587 T C - 0.272936 1608 > >> 1 100002882 1:100002882:T:G 0.2882460 G T - 0.120356 1608 > >> 1 100002991 1:100002992:G:A 0.0982587 A G - 0.272936 1608 > >> 1 100004726 1:100004727:G:A 0.1340170 A G - 0.594538 1608", > >> header=TRUE,stringsAsFactors=FALSE) > >> > >> # merge the three data frames on "Marker" > >> nn1<-merge(neu1,nep1,by="Marker") > >> nn2<-merge(nn1,ret1,by="Marker") > >> # get the common "Marker" strings > >> Marker3<-nn2$Marker > >> # subset all three data frames on Marker3 > >> neu2<-neu1[neu1$Marker %in% Marker3,] > >> nep2<-nep1[nep1$Marker %in% Marker3,] > >> ret2<-ret1[ret1$Marker %in% Marker3,] > >> > >> Jim > >> > >> On Tue, Jun 2, 2020 at 7:50 AM Ana Marija <sokovic.anamarija at gmail.com> > >> wrote: > >>> Hello, > >>> > >>> I have 3 data frames which have about 3.4 mill lines (but they don't have > >>> exactly the same number of lines)...they look like this: > >>> ... > >>> Is there is a way to create another 3 data frames, say neu2, nep2, ret2 > >>> which would only contain lines that have the same entries in Marker > >> column > >>> for all 3 data frames? > >>> > >>> Thanks > >>> Ana > >>> > >>> [[alternative HTML version deleted]] > >>> > >>> ______________________________________________ > >>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > >>> https://stat.ethz.ch/mailman/listinfo/r-help > >>> PLEASE do read the posting guide > >> http://www.R-project.org/posting-guide.html > >>> and provide commented, minimal, self-contained, reproducible code. > > [[alternative HTML version deleted]] > > > > ______________________________________________ > > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code.