Ana Marija
2020-Jun-02 01:50 UTC
[R] is there is a way to extract lines in between 3 files that are in common based on one column?
Hi David, that is a great point! Yes indeed some are non unique:> dim(neu1)[1] 3742845 9> length(unique(neu1$Marker))[1] 3741858> length(unique(nep1$Marker))[1] 3745560> dim(nep1)[1] 3746550 9> length(unique(ret1$Marker))[1] 3743494> dim(ret1)[1] 3743494 9 How would I rewrite this code so that is merging by Chr and Marker column? It seems that a Marker can be under a few Chr. On Mon, Jun 1, 2020 at 8:41 PM David Winsemius <dwinsemius at comcast.net> wrote:> > > On 6/1/20 5:40 PM, Ana Marija wrote: > > Hi Jim, > > > > thank you so much for getting back to me. I tried your code and this is > > what I get: > >> dim(neu2) > > [1] 3740988 9 > >> dim(nep2) > > [1] 3740988 9 > >> dim(ret2) > > [1] 3740001 9 > > > > I think I would need to have the same number of lines in all 3 data frames. > > > > Can you please advise. > > > You should check for duplicated Marker values. > > > -- > > David > > > > > Cheers > > Ana > > > > On Mon, Jun 1, 2020 at 7:31 PM Jim Lemon <drjimlemon at gmail.com> wrote: > > > >> Hi Ana, > >> Not too hard, but your example has all the "marker" fields in common. > >> So using a sample that will show the expected result: > >> > >> neu1<-read.table(text="Chr BP Marker MAF A1 A2 Direction pValue N > >> 1 100000012 1:100000012:G:T 0.229925 T G + 0.650403 1594 > >> 1 100000827 1:100000827:C:T 0.287014 T C + 0.955449 1594 > >> 1 100002713 1:100002713:C:T 0.097867 T C - 0.290455 1594 > >> 1 100002882 1:100002882:T:G 0.287014 G T + 0.955449 1594 > >> 1 100002991 1:100002991:G:A 0.097867 A G - 0.290455 1594 > >> 1 100004726 1:100004726:G:A 0.132058 A G + 0.115005 1594", > >> header=TRUE,stringsAsFactors=FALSE) > >> > >> nep1<-read.table(text="Chr BP Marker MAF A1 A2 Direction pValue N > >> 1 100000012 1:100000012:G:T 0.2300430 T G - 0.1420030 1641 > >> 1 100000827 1:100000827:C:T 0.2867150 T C - 0.2045580 1641 > >> 1 100002713 1:100002713:C:T 0.0975015 T C - 0.0555507 1641 > >> 1 100002882 1:100002882:T:G 0.2867150 G T - 0.2045580 1641 > >> 1 100002991 1:100002991:G:A 0.0975015 A G - 0.0555507 1641 > >> 1 100004726 1:100004727:G:A 0.1325410 A G - 0.8725660 1641", > >> header=TRUE,stringsAsFactors=FALSE) > >> > >> ret1<-read.table(text="Chr BP Marker MAF A1 A2 Direction pValue N > >> 1 100000012 1:100000012:G:T 0.2322760 T G - 0.230383 1608 > >> 1 100000827 1:100000827:C:T 0.2882460 T C - 0.120356 1608 > >> 1 100002713 1:100002713:C:T 0.0982587 T C - 0.272936 1608 > >> 1 100002882 1:100002882:T:G 0.2882460 G T - 0.120356 1608 > >> 1 100002991 1:100002992:G:A 0.0982587 A G - 0.272936 1608 > >> 1 100004726 1:100004727:G:A 0.1340170 A G - 0.594538 1608", > >> header=TRUE,stringsAsFactors=FALSE) > >> > >> # merge the three data frames on "Marker" > >> nn1<-merge(neu1,nep1,by="Marker") > >> nn2<-merge(nn1,ret1,by="Marker") > >> # get the common "Marker" strings > >> Marker3<-nn2$Marker > >> # subset all three data frames on Marker3 > >> neu2<-neu1[neu1$Marker %in% Marker3,] > >> nep2<-nep1[nep1$Marker %in% Marker3,] > >> ret2<-ret1[ret1$Marker %in% Marker3,] > >> > >> Jim > >> > >> On Tue, Jun 2, 2020 at 7:50 AM Ana Marija <sokovic.anamarija at gmail.com> > >> wrote: > >>> Hello, > >>> > >>> I have 3 data frames which have about 3.4 mill lines (but they don't have > >>> exactly the same number of lines)...they look like this: > >>> ... > >>> Is there is a way to create another 3 data frames, say neu2, nep2, ret2 > >>> which would only contain lines that have the same entries in Marker > >> column > >>> for all 3 data frames? > >>> > >>> Thanks > >>> Ana > >>> > >>> [[alternative HTML version deleted]] > >>> > >>> ______________________________________________ > >>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > >>> https://stat.ethz.ch/mailman/listinfo/r-help > >>> PLEASE do read the posting guide > >> http://www.R-project.org/posting-guide.html > >>> and provide commented, minimal, self-contained, reproducible code. > > [[alternative HTML version deleted]] > > > > ______________________________________________ > > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code.
Ana Marija
2020-Jun-02 01:54 UTC
[R] is there is a way to extract lines in between 3 files that are in common based on one column?
Hi Jim> neu3<-neu1[!(neu1$Marker %in% Marker3),] > dim(neu3)[1] 1857 9> nep3<-nep1[!(nep1$Marker %in% Marker3),] > dim(nep3)[1] 5562 9> ret3<-ret1[!(ret1$Marker %in% Marker3),] > dim(ret3)[1] 3493 9 If I do: nn1<-merge(neu1,nep1,by=c("Marker","Chr")) nn2<-merge(nn1,ret1,by=c("Marker","Chr"))> Marker3<-nn2$Marker > length(Marker3)[1] 3742962> Marker4<-nn1$Marker > length(Marker4)[1] 3744443 On Mon, Jun 1, 2020 at 8:50 PM Ana Marija <sokovic.anamarija at gmail.com> wrote:> > Hi David, > > that is a great point! > Yes indeed some are non unique: > > > dim(neu1) > [1] 3742845 9 > > length(unique(neu1$Marker)) > [1] 3741858 > > length(unique(nep1$Marker)) > [1] 3745560 > > dim(nep1) > [1] 3746550 9 > > length(unique(ret1$Marker)) > [1] 3743494 > > dim(ret1) > [1] 3743494 9 > > How would I rewrite this code so that is merging by Chr and Marker > column? It seems that a Marker can be under a few Chr. > > > > > > On Mon, Jun 1, 2020 at 8:41 PM David Winsemius <dwinsemius at comcast.net> wrote: > > > > > > On 6/1/20 5:40 PM, Ana Marija wrote: > > > Hi Jim, > > > > > > thank you so much for getting back to me. I tried your code and this is > > > what I get: > > >> dim(neu2) > > > [1] 3740988 9 > > >> dim(nep2) > > > [1] 3740988 9 > > >> dim(ret2) > > > [1] 3740001 9 > > > > > > I think I would need to have the same number of lines in all 3 data frames. > > > > > > Can you please advise. > > > > > > You should check for duplicated Marker values. > > > > > > -- > > > > David > > > > > > > > Cheers > > > Ana > > > > > > On Mon, Jun 1, 2020 at 7:31 PM Jim Lemon <drjimlemon at gmail.com> wrote: > > > > > >> Hi Ana, > > >> Not too hard, but your example has all the "marker" fields in common. > > >> So using a sample that will show the expected result: > > >> > > >> neu1<-read.table(text="Chr BP Marker MAF A1 A2 Direction pValue N > > >> 1 100000012 1:100000012:G:T 0.229925 T G + 0.650403 1594 > > >> 1 100000827 1:100000827:C:T 0.287014 T C + 0.955449 1594 > > >> 1 100002713 1:100002713:C:T 0.097867 T C - 0.290455 1594 > > >> 1 100002882 1:100002882:T:G 0.287014 G T + 0.955449 1594 > > >> 1 100002991 1:100002991:G:A 0.097867 A G - 0.290455 1594 > > >> 1 100004726 1:100004726:G:A 0.132058 A G + 0.115005 1594", > > >> header=TRUE,stringsAsFactors=FALSE) > > >> > > >> nep1<-read.table(text="Chr BP Marker MAF A1 A2 Direction pValue N > > >> 1 100000012 1:100000012:G:T 0.2300430 T G - 0.1420030 1641 > > >> 1 100000827 1:100000827:C:T 0.2867150 T C - 0.2045580 1641 > > >> 1 100002713 1:100002713:C:T 0.0975015 T C - 0.0555507 1641 > > >> 1 100002882 1:100002882:T:G 0.2867150 G T - 0.2045580 1641 > > >> 1 100002991 1:100002991:G:A 0.0975015 A G - 0.0555507 1641 > > >> 1 100004726 1:100004727:G:A 0.1325410 A G - 0.8725660 1641", > > >> header=TRUE,stringsAsFactors=FALSE) > > >> > > >> ret1<-read.table(text="Chr BP Marker MAF A1 A2 Direction pValue N > > >> 1 100000012 1:100000012:G:T 0.2322760 T G - 0.230383 1608 > > >> 1 100000827 1:100000827:C:T 0.2882460 T C - 0.120356 1608 > > >> 1 100002713 1:100002713:C:T 0.0982587 T C - 0.272936 1608 > > >> 1 100002882 1:100002882:T:G 0.2882460 G T - 0.120356 1608 > > >> 1 100002991 1:100002992:G:A 0.0982587 A G - 0.272936 1608 > > >> 1 100004726 1:100004727:G:A 0.1340170 A G - 0.594538 1608", > > >> header=TRUE,stringsAsFactors=FALSE) > > >> > > >> # merge the three data frames on "Marker" > > >> nn1<-merge(neu1,nep1,by="Marker") > > >> nn2<-merge(nn1,ret1,by="Marker") > > >> # get the common "Marker" strings > > >> Marker3<-nn2$Marker > > >> # subset all three data frames on Marker3 > > >> neu2<-neu1[neu1$Marker %in% Marker3,] > > >> nep2<-nep1[nep1$Marker %in% Marker3,] > > >> ret2<-ret1[ret1$Marker %in% Marker3,] > > >> > > >> Jim > > >> > > >> On Tue, Jun 2, 2020 at 7:50 AM Ana Marija <sokovic.anamarija at gmail.com> > > >> wrote: > > >>> Hello, > > >>> > > >>> I have 3 data frames which have about 3.4 mill lines (but they don't have > > >>> exactly the same number of lines)...they look like this: > > >>> ... > > >>> Is there is a way to create another 3 data frames, say neu2, nep2, ret2 > > >>> which would only contain lines that have the same entries in Marker > > >> column > > >>> for all 3 data frames? > > >>> > > >>> Thanks > > >>> Ana > > >>> > > >>> [[alternative HTML version deleted]] > > >>> > > >>> ______________________________________________ > > >>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > > >>> https://stat.ethz.ch/mailman/listinfo/r-help > > >>> PLEASE do read the posting guide > > >> http://www.R-project.org/posting-guide.html > > >>> and provide commented, minimal, self-contained, reproducible code. > > > [[alternative HTML version deleted]] > > > > > > ______________________________________________ > > > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > > > https://stat.ethz.ch/mailman/listinfo/r-help > > > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > > > and provide commented, minimal, self-contained, reproducible code.
Jim Lemon
2020-Jun-02 03:04 UTC
[R] is there is a way to extract lines in between 3 files that are in common based on one column?
So recombination sticks out its foot before us. Do you want to account for gene linkage? JIm On Tue, Jun 2, 2020 at 11:55 AM Ana Marija <sokovic.anamarija at gmail.com> wrote:> > Hi Jim > > > neu3<-neu1[!(neu1$Marker %in% Marker3),] > > dim(neu3) > [1] 1857 9 > > nep3<-nep1[!(nep1$Marker %in% Marker3),] > > dim(nep3) > [1] 5562 9 > > ret3<-ret1[!(ret1$Marker %in% Marker3),] > > dim(ret3) > [1] 3493 9 > > > If I do: > > nn1<-merge(neu1,nep1,by=c("Marker","Chr")) > nn2<-merge(nn1,ret1,by=c("Marker","Chr")) > > Marker3<-nn2$Marker > > length(Marker3) > [1] 3742962 > > Marker4<-nn1$Marker > > length(Marker4) > [1] 3744443 > > On Mon, Jun 1, 2020 at 8:50 PM Ana Marija <sokovic.anamarija at gmail.com> wrote: > > > > Hi David, > > > > that is a great point! > > Yes indeed some are non unique: > > > > > dim(neu1) > > [1] 3742845 9 > > > length(unique(neu1$Marker)) > > [1] 3741858 > > > length(unique(nep1$Marker)) > > [1] 3745560 > > > dim(nep1) > > [1] 3746550 9 > > > length(unique(ret1$Marker)) > > [1] 3743494 > > > dim(ret1) > > [1] 3743494 9 > > > > How would I rewrite this code so that is merging by Chr and Marker > > column? It seems that a Marker can be under a few Chr. > > > > > > > > > > > > On Mon, Jun 1, 2020 at 8:41 PM David Winsemius <dwinsemius at comcast.net> wrote: > > > > > > > > > On 6/1/20 5:40 PM, Ana Marija wrote: > > > > Hi Jim, > > > > > > > > thank you so much for getting back to me. I tried your code and this is > > > > what I get: > > > >> dim(neu2) > > > > [1] 3740988 9 > > > >> dim(nep2) > > > > [1] 3740988 9 > > > >> dim(ret2) > > > > [1] 3740001 9 > > > > > > > > I think I would need to have the same number of lines in all 3 data frames. > > > > > > > > Can you please advise. > > > > > > > > > You should check for duplicated Marker values. > > > > > > > > > -- > > > > > > David > > > > > > > > > > > Cheers > > > > Ana > > > > > > > > On Mon, Jun 1, 2020 at 7:31 PM Jim Lemon <drjimlemon at gmail.com> wrote: > > > > > > > >> Hi Ana, > > > >> Not too hard, but your example has all the "marker" fields in common. > > > >> So using a sample that will show the expected result: > > > >> > > > >> neu1<-read.table(text="Chr BP Marker MAF A1 A2 Direction pValue N > > > >> 1 100000012 1:100000012:G:T 0.229925 T G + 0.650403 1594 > > > >> 1 100000827 1:100000827:C:T 0.287014 T C + 0.955449 1594 > > > >> 1 100002713 1:100002713:C:T 0.097867 T C - 0.290455 1594 > > > >> 1 100002882 1:100002882:T:G 0.287014 G T + 0.955449 1594 > > > >> 1 100002991 1:100002991:G:A 0.097867 A G - 0.290455 1594 > > > >> 1 100004726 1:100004726:G:A 0.132058 A G + 0.115005 1594", > > > >> header=TRUE,stringsAsFactors=FALSE) > > > >> > > > >> nep1<-read.table(text="Chr BP Marker MAF A1 A2 Direction pValue N > > > >> 1 100000012 1:100000012:G:T 0.2300430 T G - 0.1420030 1641 > > > >> 1 100000827 1:100000827:C:T 0.2867150 T C - 0.2045580 1641 > > > >> 1 100002713 1:100002713:C:T 0.0975015 T C - 0.0555507 1641 > > > >> 1 100002882 1:100002882:T:G 0.2867150 G T - 0.2045580 1641 > > > >> 1 100002991 1:100002991:G:A 0.0975015 A G - 0.0555507 1641 > > > >> 1 100004726 1:100004727:G:A 0.1325410 A G - 0.8725660 1641", > > > >> header=TRUE,stringsAsFactors=FALSE) > > > >> > > > >> ret1<-read.table(text="Chr BP Marker MAF A1 A2 Direction pValue N > > > >> 1 100000012 1:100000012:G:T 0.2322760 T G - 0.230383 1608 > > > >> 1 100000827 1:100000827:C:T 0.2882460 T C - 0.120356 1608 > > > >> 1 100002713 1:100002713:C:T 0.0982587 T C - 0.272936 1608 > > > >> 1 100002882 1:100002882:T:G 0.2882460 G T - 0.120356 1608 > > > >> 1 100002991 1:100002992:G:A 0.0982587 A G - 0.272936 1608 > > > >> 1 100004726 1:100004727:G:A 0.1340170 A G - 0.594538 1608", > > > >> header=TRUE,stringsAsFactors=FALSE) > > > >> > > > >> # merge the three data frames on "Marker" > > > >> nn1<-merge(neu1,nep1,by="Marker") > > > >> nn2<-merge(nn1,ret1,by="Marker") > > > >> # get the common "Marker" strings > > > >> Marker3<-nn2$Marker > > > >> # subset all three data frames on Marker3 > > > >> neu2<-neu1[neu1$Marker %in% Marker3,] > > > >> nep2<-nep1[nep1$Marker %in% Marker3,] > > > >> ret2<-ret1[ret1$Marker %in% Marker3,] > > > >> > > > >> Jim > > > >> > > > >> On Tue, Jun 2, 2020 at 7:50 AM Ana Marija <sokovic.anamarija at gmail.com> > > > >> wrote: > > > >>> Hello, > > > >>> > > > >>> I have 3 data frames which have about 3.4 mill lines (but they don't have > > > >>> exactly the same number of lines)...they look like this: > > > >>> ... > > > >>> Is there is a way to create another 3 data frames, say neu2, nep2, ret2 > > > >>> which would only contain lines that have the same entries in Marker > > > >> column > > > >>> for all 3 data frames? > > > >>> > > > >>> Thanks > > > >>> Ana > > > >>> > > > >>> [[alternative HTML version deleted]] > > > >>> > > > >>> ______________________________________________ > > > >>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > > > >>> https://stat.ethz.ch/mailman/listinfo/r-help > > > >>> PLEASE do read the posting guide > > > >> http://www.R-project.org/posting-guide.html > > > >>> and provide commented, minimal, self-contained, reproducible code. > > > > [[alternative HTML version deleted]] > > > > > > > > ______________________________________________ > > > > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > > > > https://stat.ethz.ch/mailman/listinfo/r-help > > > > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > > > > and provide commented, minimal, self-contained, reproducible code.