Ana Marija
2020-Mar-31 02:43 UTC
[R] How to match strings in two files and replace strings?
HI Jim, thank you so much for getting back to me, I think the issue is with reading that csv file> marker_info<-read.csv("marker-info",header=F,stringsAsFactors=FALSE) > head(marker_info)V1 1 #Column Description: 2 #Column is separated by ' 3 #Chr: Chromosome on NCBI reference genome. 4 #Pos: chromosome position when snp has unique hit on reference genome. Otherwise this field is NULL. 5 #Submitter_snp_name: The string identifier of snp on the platform. This is the dbSNP local_snp_id. 6 #Ss#: dbSNP submitted snp Id. Each snp sequence on the platform gets a unique ss#. V2 1 2 '. 3 4 5 6 the file starts with 24 commented lines... I did run your workflow and this is what I got:> newout<-merge(output11.frq,marker_info[,c("V5","match_col")],by="match_col")Error in `[.data.frame`(marker_info, , c("V5", "match_col")) : undefined columns selected this is how marker-info looks like: #Column Description: #Column is separated by ','. #Chr: Chromosome on NCBI reference genome. #Pos: chromosome position when snp has unique hit on reference genome. Otherwise this field is NULL. #Submitter_snp_name: The string identifier of snp on the platform. This is the dbSNP local_snp_id. #Ss#: dbSNP submitted snp Id. Each snp sequence on the platform gets a unique ss#. #Rs#: refSNP cluster accession. Rs# for the dbSNP refSNP cluster that the sequence for this ss# maps to. #Genome_build_id: Genome build used to map the SNP (a string) #ALLELE1_genome_orient: genome orientation allele1, same as which genotypes are reported. #ALLELE2_genome_orient: genome orientation allele2, same as which genotypes are reported. #ALLELE1_orig_assay_orient: original reported orientation for the SNP assay, will correspond to CEL files and the ss_id. #ALLELE2_orig_assay_orient: original reported orientation for the SNP assay, will correspond to CEL files and the ss_id. #QC_TYPE: A-autosomal and P-pseudo-autosomal; X: X-linked; Y-Y-linked;NA-disable QC for this snp. #SNP_flank_sequence: snp sequence on the reference genome orientation. 40bp on each side of variation. #SOURCE: Platform specific string identifying assay (e.g. HBA_CHIP) #Ss2rs_orientation: ss to rs orientation. +: same; -: opposite strand. #Rs2genome_orienation: Orientation of rs flanking sequence to reference genome. +: same orientation, -: opposite. #Orien_flipped_assay_to_genome: y/n: this column would be the value of the exclusive OR from ss2rs_orientation XOR rs2genome_orientation. #Probe_id: NCBI probe_id. #neighbor_snp_list: List of neighbor snp and position within 40kb up/downstream. #dbSNP_build_id: dbSNP build id. #study_id: unique id with prefix: phs. # # Chr,Pos,Submitter_snp_name,Ss#,Rs#,Genome_build_id,ALLELE1_genome_orient,ALLELE2_genome_orient,ALLELE1_orig_assay_orient,ALLELE2_orig_assay_orient,QC_TYPE,SNP_flank_sequence,SOURCE,Ss2rs_orientation,Rs2genome_orienation,Orien_flipped_assay_to_genome,Probe_id,neighbor_snp_list,dbSNP_build_id,study_id 1,742429,SNP_A-1909444,ss66079302,rs3094315,36.2,G,A,C,T,A,GCACAGCAAGAGAAAC[A/G]TTTGACAGAGAATACA,Sty,+,-,y,,,127,phs000018 1,769185,SNP_A-4303947,ss66273559,rs4040617,36.2,A,G,A,G,A,GCTGTGAGAGAGAACA[A/G]TGTCCCAATTTTGCCC,Sty,+,+,n,,,127,phs000018 1,775852,SNP_A-1886933,ss66317030,rs2980300,36.2,T,C,A,G,A,GAATGACTGTGTCTCT[C/T]TGAGTTAGTGAAGTCA,Nsp,-,+,y,,,127,phs000018 ... Please advise, Ana On Mon, Mar 30, 2020 at 9:24 PM Jim Lemon <drjimlemon at gmail.com> wrote:> > Hi Ana, > This seems to work. It shouldn't be too hard to do the renaming and > reordering of columns. > > output11.frq<-read.table(text="CHR SNP A1 A2 MAF NCHROBS > 1 1:775852:T:C T C 0.1707 3444 > 1 1:1120590:A:C C A 0.08753 3496 > 1 1:1145994:T:C C T 0.1765 3496 > 1 1:1148494:A:G A G 0.1059 3464 > 1 1:1201155:C:T T C 0.07923 3496", > header=TRUE,stringsAsFactors=FALSE) > > marker_info<-read.csv(text="1,742429,SNP_A-1909444,ss66079302,rs3094315,36.2,G,A,C,T,A,GCACAGCAAGAGAAAC[A/G]TTTGACAGAGAATACA,Sty,+,-,y,,,127,phs000018 > 1,769185,SNP_A-4303947,ss66273559,rs4040617,36.2,A,G,A,G,A,GCTGTGAGAGAGAACA[A/G]TGTCCCAATTTTGCCC,Sty,+,+,n,,,127,phs000018 > 1,775852,SNP_A-1886933,ss66317030,rs2980300,36.2,T,C,A,G,A,GAATGACTGTGTCTCT[C/T]TGAGTTAGTGAAGTCA,Nsp,-,+,y,,,127,phs000018 > 1,782343,SNP_A-2236359,ss66185183,rs2905036,36.2,C,T,C,T,A,CTCGATTTGTGTTCAA[C/T]ATATTTCATTTGTACC,Sty,-,-,n,,,127,phs000018 > 1,1201155,SNP_A-2205441,ss66174584,rs4245756,36.2,C,T,C,T,A,CCAGTGCTTTCAACCA[C/T]ACTCACTTTTCACTGT,Sty,+,+,n,,,127,phs000018", > header=FALSE,stringsAsFactors=FALSE) > # create new columns for the merge > output11.frq$match_col<-unlist(lapply(lapply(strsplit(output11.frq$SNP,":"),"[", > 1:2), paste,collapse=":")) > marker_info$match_col<-apply(t(marker_info[,1:2]),2,paste,collapse=":") > # merge to get the result > newout<-merge(output11.frq,marker_info[,c("V5","match_col")],by="match_col") > > Jim > > On Tue, Mar 31, 2020 at 11:09 AM Ana Marija <sokovic.anamarija at gmail.com> wrote: > > > > I have a file like this: (has 308545 lines) > > > > head output11.frq > > CHR SNP A1 A2 MAF NCHROBS > > 1 1:775852:T:C T C 0.1707 3444 > > 1 1:1120590:A:C C A 0.08753 3496 > > 1 1:1145994:T:C C T 0.1765 3496 > > 1 1:1148494:A:G A G 0.1059 3464 > > 1 1:1201155:C:T T C 0.07923 3496 > > ... > > > > And another file (marker-info) which has the first 24 commented lines > > and is comma separated that looks like this (has total of 500593 > > lines): > > > > 1,742429,SNP_A-1909444,ss66079302,rs3094315,36.2,G,A,C,T,A,GCACAGCAAGAGAAAC[A/G]TTTGACAGAGAATACA,Sty,+,-,y,,,127,phs000018 > > 1,769185,SNP_A-4303947,ss66273559,rs4040617,36.2,A,G,A,G,A,GCTGTGAGAGAGAACA[A/G]TGTCCCAATTTTGCCC,Sty,+,+,n,,,127,phs000018 > > 1,775852,SNP_A-1886933,ss66317030,rs2980300,36.2,T,C,A,G,A,GAATGACTGTGTCTCT[C/T]TGAGTTAGTGAAGTCA,Nsp,-,+,y,,,127,phs000018 > > 1,782343,SNP_A-2236359,ss66185183,rs2905036,36.2,C,T,C,T,A,CTCGATTTGTGTTCAA[C/T]ATATTTCATTTGTACC,Sty,-,-,n,,,127,phs000018 > > 1,1201155,SNP_A-2205441,ss66174584,rs4245756,36.2,C,T,C,T,A,CCAGTGCTTTCAACCA[C/T]ACTCACTTTTCACTGT,Sty,+,+,n,,,127,phs000018 > > ... > > > > I want to replace in output11.frq second column with the 5th column in > > marker-info that has the matching value in 1st and 2nd column so for > > this example the result of the output11.frq would look like this: > > > > 1 rs2980300 T C 0.1707 3444 > > 1 rs4245756 T C 0.07923 3496 > > > > I tried doing this in bash but I got empty file: > > > > vi tst.awk > > NR==FNR { map[$1,$2]=$5; next } > > ($1,$4) in map { $2=map[$1,$4]; print } > > awk -f tst.awk FS=',' marker-info FS='\t' output11.frq > output11X.frq > > > > Can this be done in R? > > > > Thanks > > Ana > > > > ______________________________________________ > > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code.
Rasmus Liland
2020-Mar-31 03:42 UTC
[R] How to match strings in two files and replace strings?
On 2020-03-30 21:43 -0500, Ana Marija wrote:> I did run your workflow and this is what I got: > > > newout<-merge(output11.frq,marker_info[,c("V5","match_col")],by="match_col") > Error in `[.data.frame`(marker_info, , c("V5", "match_col")) : > undefined columns selected > > this is how marker-info looks like:Hi Ana, perhaps adding comment.char="#" as an argument to read.csv might help? Making the output11.frq$match_col column might perhaps be easier using gsub, have a look: marker_info <- "#Column Description: #Column is separated by ','. #Chr: Chromosome on NCBI reference genome. #Pos: chromosome position when snp has unique hit on reference genome. Otherwise this field is NULL. #Submitter_snp_name: The string identifier of snp on the platform. This is the dbSNP local_snp_id. #Ss#: dbSNP submitted snp Id. Each snp sequence on the platform gets a unique ss#. #Rs#: refSNP cluster accession. Rs# for the dbSNP refSNP cluster that the sequence for this ss# maps to. #Genome_build_id: Genome build used to map the SNP (a string) #ALLELE1_genome_orient: genome orientation allele1, same as which genotypes are reported. #ALLELE2_genome_orient: genome orientation allele2, same as which genotypes are reported. #ALLELE1_orig_assay_orient: original reported orientation for the SNP assay, will correspond to CEL files and the ss_id. #ALLELE2_orig_assay_orient: original reported orientation for the SNP assay, will correspond to CEL files and the ss_id. #QC_TYPE: A-autosomal and P-pseudo-autosomal; X: X-linked; Y-Y-linked;NA-disable QC for this snp. #SNP_flank_sequence: snp sequence on the reference genome orientation. 40bp on each side of variation. #SOURCE: Platform specific string identifying assay (e.g. HBA_CHIP) #Ss2rs_orientation: ss to rs orientation. +: same; -: opposite strand. #Rs2genome_orienation: Orientation of rs flanking sequence to reference genome. +: same orientation, -: opposite. #Orien_flipped_assay_to_genome: y/n: this column would be the value of the exclusive OR from ss2rs_orientation XOR rs2genome_orientation. #Probe_id: NCBI probe_id. #neighbor_snp_list: List of neighbor snp and position within 40kb up/downstream. #dbSNP_build_id: dbSNP build id. #study_id: unique id with prefix: phs. # # Chr,Pos,Submitter_snp_name,Ss#,Rs#,Genome_build_id,ALLELE1_genome_orient,ALLELE2_genome_orient,ALLELE1_orig_assay_orient,ALLELE2_orig_assay_orient,QC_TYPE,SNP_flank_sequence,SOURCE,Ss2rs_orientation,Rs2genome_orienation,Orien_flipped_assay_to_genome,Probe_id,neighbor_snp_list,dbSNP_build_id,study_id 1,742429,SNP_A-1909444,ss66079302,rs3094315,36.2,G,A,C,T,A,GCACAGCAAGAGAAAC[A/G]TTTGACAGAGAATACA,Sty,+,-,y,,,127,phs000018 1,769185,SNP_A-4303947,ss66273559,rs4040617,36.2,A,G,A,G,A,GCTGTGAGAGAGAACA[A/G]TGTCCCAATTTTGCCC,Sty,+,+,n,,,127,phs000018 1,775852,SNP_A-1886933,ss66317030,rs2980300,36.2,T,C,A,G,A,GAATGACTGTGTCTCT[C/T]TGAGTTAGTGAAGTCA,Nsp,-,+,y,,,127,phs000018 " marker_info <- read.csv(text=marker_info, header=FALSE, stringsAsFactors=FALSE, comment.char="#") output11.frq <- "CHR SNP A1 A2 MAF NCHROBS 1 1:775852:T:C T C 0.1707 3444 1 1:1120590:A:C C A 0.08753 3496 1 1:1145994:T:C C T 0.1765 3496 1 1:1148494:A:G A G 0.1059 3464 1 1:1201155:C:T T C 0.07923 3496" output11.frq <- read.table(text=output11.frq, header=TRUE, stringsAsFactors=FALSE) output11.frq$match_col <- gsub("^([0-9]+):([0-9]+).*", "\\1:\\2", output11.frq$SNP) marker_info$match_col <- apply(marker_info[,1:2], 1, paste, collapse=":") merge(x=output11.frq, y=marker_info[,c("V5", "match_col")], by="match_col") Regards, Rasmus
Jim Lemon
2020-Mar-31 06:07 UTC
[R] How to match strings in two files and replace strings?
Ah, my mistake. Should be:
marker_info<-read.csv("marker-info",header=FALSE,stringsAsFactors=FALSE,skip=24)
Jim
On Tue, Mar 31, 2020 at 1:43 PM Ana Marija <sokovic.anamarija at
gmail.com> wrote:>
> HI Jim,
>
> thank you so much for getting back to me, I think the issue is with
> reading that csv file
>
> >
marker_info<-read.csv("marker-info",header=F,stringsAsFactors=FALSE)
> > head(marker_info)
>
> V1
> 1
> #Column Description:
> 2
> #Column is separated by '
> 3 #Chr:
> Chromosome on NCBI reference genome.
> 4 #Pos: chromosome position when snp has unique hit on reference
> genome. Otherwise this field is NULL.
> 5 #Submitter_snp_name: The string identifier of snp on the
> platform. This is the dbSNP local_snp_id.
> 6 #Ss#: dbSNP submitted snp Id. Each snp sequence
> on the platform gets a unique ss#.
> V2
> 1
> 2 '.
> 3
> 4
> 5
> 6
>
> the file starts with 24 commented lines...
>
> I did run your workflow and this is what I got:
>
> >
newout<-merge(output11.frq,marker_info[,c("V5","match_col")],by="match_col")
> Error in `[.data.frame`(marker_info, , c("V5",
"match_col")) :
> undefined columns selected
>
> this is how marker-info looks like:
>
> #Column Description:
> #Column is separated by ','.
> #Chr: Chromosome on NCBI reference genome.
> #Pos: chromosome position when snp has unique hit on reference
> genome. Otherwise this field is NULL.
> #Submitter_snp_name: The string identifier of snp on the platform.
> This is the dbSNP local_snp_id.
> #Ss#: dbSNP submitted snp Id. Each snp sequence on the platform gets
> a unique ss#.
> #Rs#: refSNP cluster accession. Rs# for the dbSNP refSNP cluster
> that the sequence for this ss# maps to.
> #Genome_build_id: Genome build used to map the SNP (a string)
> #ALLELE1_genome_orient: genome orientation allele1, same as which
> genotypes are reported.
> #ALLELE2_genome_orient: genome orientation allele2, same as which
> genotypes are reported.
> #ALLELE1_orig_assay_orient: original reported orientation for the
> SNP assay, will correspond to CEL files and the ss_id.
> #ALLELE2_orig_assay_orient: original reported orientation for the
> SNP assay, will correspond to CEL files and the ss_id.
> #QC_TYPE: A-autosomal and P-pseudo-autosomal; X: X-linked;
> Y-Y-linked;NA-disable QC for this snp.
> #SNP_flank_sequence: snp sequence on the reference genome
> orientation. 40bp on each side of variation.
> #SOURCE: Platform specific string identifying assay (e.g. HBA_CHIP)
> #Ss2rs_orientation: ss to rs orientation. +: same; -: opposite strand.
> #Rs2genome_orienation: Orientation of rs flanking sequence to
> reference genome. +: same orientation, -: opposite.
> #Orien_flipped_assay_to_genome: y/n: this column would be the value of
> the exclusive OR from ss2rs_orientation XOR rs2genome_orientation.
> #Probe_id: NCBI probe_id.
> #neighbor_snp_list: List of neighbor snp and position within 40kb
> up/downstream.
> #dbSNP_build_id: dbSNP build id.
> #study_id: unique id with prefix: phs.
> #
> #
Chr,Pos,Submitter_snp_name,Ss#,Rs#,Genome_build_id,ALLELE1_genome_orient,ALLELE2_genome_orient,ALLELE1_orig_assay_orient,ALLELE2_orig_assay_orient,QC_TYPE,SNP_flank_sequence,SOURCE,Ss2rs_orientation,Rs2genome_orienation,Orien_flipped_assay_to_genome,Probe_id,neighbor_snp_list,dbSNP_build_id,study_id
>
1,742429,SNP_A-1909444,ss66079302,rs3094315,36.2,G,A,C,T,A,GCACAGCAAGAGAAAC[A/G]TTTGACAGAGAATACA,Sty,+,-,y,,,127,phs000018
>
1,769185,SNP_A-4303947,ss66273559,rs4040617,36.2,A,G,A,G,A,GCTGTGAGAGAGAACA[A/G]TGTCCCAATTTTGCCC,Sty,+,+,n,,,127,phs000018
>
1,775852,SNP_A-1886933,ss66317030,rs2980300,36.2,T,C,A,G,A,GAATGACTGTGTCTCT[C/T]TGAGTTAGTGAAGTCA,Nsp,-,+,y,,,127,phs000018
> ...
>
>
> Please advise,
> Ana
>
> On Mon, Mar 30, 2020 at 9:24 PM Jim Lemon <drjimlemon at gmail.com>
wrote:
> >
> > Hi Ana,
> > This seems to work. It shouldn't be too hard to do the renaming
and
> > reordering of columns.
> >
> > output11.frq<-read.table(text="CHR SNP A1 A2 MAF NCHROBS
> > 1 1:775852:T:C T C 0.1707 3444
> > 1 1:1120590:A:C C A 0.08753 3496
> > 1 1:1145994:T:C C T 0.1765 3496
> > 1 1:1148494:A:G A G 0.1059 3464
> > 1 1:1201155:C:T T C 0.07923 3496",
> > header=TRUE,stringsAsFactors=FALSE)
> >
> >
marker_info<-read.csv(text="1,742429,SNP_A-1909444,ss66079302,rs3094315,36.2,G,A,C,T,A,GCACAGCAAGAGAAAC[A/G]TTTGACAGAGAATACA,Sty,+,-,y,,,127,phs000018
> >
1,769185,SNP_A-4303947,ss66273559,rs4040617,36.2,A,G,A,G,A,GCTGTGAGAGAGAACA[A/G]TGTCCCAATTTTGCCC,Sty,+,+,n,,,127,phs000018
> >
1,775852,SNP_A-1886933,ss66317030,rs2980300,36.2,T,C,A,G,A,GAATGACTGTGTCTCT[C/T]TGAGTTAGTGAAGTCA,Nsp,-,+,y,,,127,phs000018
> >
1,782343,SNP_A-2236359,ss66185183,rs2905036,36.2,C,T,C,T,A,CTCGATTTGTGTTCAA[C/T]ATATTTCATTTGTACC,Sty,-,-,n,,,127,phs000018
> >
1,1201155,SNP_A-2205441,ss66174584,rs4245756,36.2,C,T,C,T,A,CCAGTGCTTTCAACCA[C/T]ACTCACTTTTCACTGT,Sty,+,+,n,,,127,phs000018",
> > header=FALSE,stringsAsFactors=FALSE)
> > # create new columns for the merge
> >
output11.frq$match_col<-unlist(lapply(lapply(strsplit(output11.frq$SNP,":"),"[",
> > 1:2), paste,collapse=":"))
> >
marker_info$match_col<-apply(t(marker_info[,1:2]),2,paste,collapse=":")
> > # merge to get the result
> >
newout<-merge(output11.frq,marker_info[,c("V5","match_col")],by="match_col")
> >
> > Jim
> >
> > On Tue, Mar 31, 2020 at 11:09 AM Ana Marija <sokovic.anamarija at
gmail.com> wrote:
> > >
> > > I have a file like this: (has 308545 lines)
> > >
> > > head output11.frq
> > > CHR SNP A1 A2 MAF NCHROBS
> > > 1 1:775852:T:C T C 0.1707 3444
> > > 1 1:1120590:A:C C A 0.08753 3496
> > > 1 1:1145994:T:C C T 0.1765 3496
> > > 1 1:1148494:A:G A G 0.1059 3464
> > > 1 1:1201155:C:T T C 0.07923 3496
> > > ...
> > >
> > > And another file (marker-info) which has the first 24 commented
lines
> > > and is comma separated that looks like this (has total of 500593
> > > lines):
> > >
> > >
1,742429,SNP_A-1909444,ss66079302,rs3094315,36.2,G,A,C,T,A,GCACAGCAAGAGAAAC[A/G]TTTGACAGAGAATACA,Sty,+,-,y,,,127,phs000018
> > >
1,769185,SNP_A-4303947,ss66273559,rs4040617,36.2,A,G,A,G,A,GCTGTGAGAGAGAACA[A/G]TGTCCCAATTTTGCCC,Sty,+,+,n,,,127,phs000018
> > >
1,775852,SNP_A-1886933,ss66317030,rs2980300,36.2,T,C,A,G,A,GAATGACTGTGTCTCT[C/T]TGAGTTAGTGAAGTCA,Nsp,-,+,y,,,127,phs000018
> > >
1,782343,SNP_A-2236359,ss66185183,rs2905036,36.2,C,T,C,T,A,CTCGATTTGTGTTCAA[C/T]ATATTTCATTTGTACC,Sty,-,-,n,,,127,phs000018
> > >
1,1201155,SNP_A-2205441,ss66174584,rs4245756,36.2,C,T,C,T,A,CCAGTGCTTTCAACCA[C/T]ACTCACTTTTCACTGT,Sty,+,+,n,,,127,phs000018
> > > ...
> > >
> > > I want to replace in output11.frq second column with the 5th
column in
> > > marker-info that has the matching value in 1st and 2nd column so
for
> > > this example the result of the output11.frq would look like this:
> > >
> > > 1 rs2980300 T C 0.1707 3444
> > > 1 rs4245756 T C 0.07923 3496
> > >
> > > I tried doing this in bash but I got empty file:
> > >
> > > vi tst.awk
> > > NR==FNR { map[$1,$2]=$5; next }
> > > ($1,$4) in map { $2=map[$1,$4]; print }
> > > awk -f tst.awk FS=',' marker-info FS='\t'
output11.frq > output11X.frq
> > >
> > > Can this be done in R?
> > >
> > > Thanks
> > > Ana
> > >
> > > ______________________________________________
> > > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more,
see
> > > https://stat.ethz.ch/mailman/listinfo/r-help
> > > PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> > > and provide commented, minimal, self-contained, reproducible
code.
Jim Lemon
2020-Mar-31 22:45 UTC
[R] How to match strings in two files and replace strings?
Nice improvement. Jim On Wed, Apr 1, 2020 at 3:18 AM Rasmus Liland <jensrli at student.ikos.uio.no> wrote:> > On 2020-03-30 21:43 -0500, Ana Marija wrote: > > I did run your workflow and this is what I got: > > > > > newout<-merge(output11.frq,marker_info[,c("V5","match_col")],by="match_col") > > Error in `[.data.frame`(marker_info, , c("V5", "match_col")) : > > undefined columns selected > > > > this is how marker-info looks like: > > Hi Ana, > > perhaps adding comment.char="#" as an argument to read.csv might > help? > > Making the output11.frq$match_col column might perhaps be easier > using gsub, have a look: > > marker_info <- "#Column Description: > #Column is separated by ','. > #Chr: Chromosome on NCBI reference genome. > #Pos: chromosome position when snp has unique hit on reference genome. Otherwise this field is NULL. > #Submitter_snp_name: The string identifier of snp on the platform. This is the dbSNP local_snp_id. > #Ss#: dbSNP submitted snp Id. Each snp sequence on the platform gets a unique ss#. > #Rs#: refSNP cluster accession. Rs# for the dbSNP refSNP cluster that the sequence for this ss# maps to. > #Genome_build_id: Genome build used to map the SNP (a string) > #ALLELE1_genome_orient: genome orientation allele1, same as which genotypes are reported. > #ALLELE2_genome_orient: genome orientation allele2, same as which genotypes are reported. > #ALLELE1_orig_assay_orient: original reported orientation for the SNP assay, will correspond to CEL files and the ss_id. > #ALLELE2_orig_assay_orient: original reported orientation for the SNP assay, will correspond to CEL files and the ss_id. > #QC_TYPE: A-autosomal and P-pseudo-autosomal; X: X-linked; Y-Y-linked;NA-disable QC for this snp. > #SNP_flank_sequence: snp sequence on the reference genome orientation. 40bp on each side of variation. > #SOURCE: Platform specific string identifying assay (e.g. HBA_CHIP) > #Ss2rs_orientation: ss to rs orientation. +: same; -: opposite strand. > #Rs2genome_orienation: Orientation of rs flanking sequence to reference genome. +: same orientation, -: opposite. > #Orien_flipped_assay_to_genome: y/n: this column would be the value of the exclusive OR from ss2rs_orientation XOR rs2genome_orientation. > #Probe_id: NCBI probe_id. > #neighbor_snp_list: List of neighbor snp and position within 40kb up/downstream. > #dbSNP_build_id: dbSNP build id. > #study_id: unique id with prefix: phs. > # > # Chr,Pos,Submitter_snp_name,Ss#,Rs#,Genome_build_id,ALLELE1_genome_orient,ALLELE2_genome_orient,ALLELE1_orig_assay_orient,ALLELE2_orig_assay_orient,QC_TYPE,SNP_flank_sequence,SOURCE,Ss2rs_orientation,Rs2genome_orienation,Orien_flipped_assay_to_genome,Probe_id,neighbor_snp_list,dbSNP_build_id,study_id > 1,742429,SNP_A-1909444,ss66079302,rs3094315,36.2,G,A,C,T,A,GCACAGCAAGAGAAAC[A/G]TTTGACAGAGAATACA,Sty,+,-,y,,,127,phs000018 > 1,769185,SNP_A-4303947,ss66273559,rs4040617,36.2,A,G,A,G,A,GCTGTGAGAGAGAACA[A/G]TGTCCCAATTTTGCCC,Sty,+,+,n,,,127,phs000018 > 1,775852,SNP_A-1886933,ss66317030,rs2980300,36.2,T,C,A,G,A,GAATGACTGTGTCTCT[C/T]TGAGTTAGTGAAGTCA,Nsp,-,+,y,,,127,phs000018 > " > marker_info <- > read.csv(text=marker_info, > header=FALSE, > stringsAsFactors=FALSE, > comment.char="#") > > output11.frq <- > "CHR SNP A1 A2 MAF NCHROBS > 1 1:775852:T:C T C 0.1707 3444 > 1 1:1120590:A:C C A 0.08753 3496 > 1 1:1145994:T:C C T 0.1765 3496 > 1 1:1148494:A:G A G 0.1059 3464 > 1 1:1201155:C:T T C 0.07923 3496" > output11.frq <- > read.table(text=output11.frq, header=TRUE, > stringsAsFactors=FALSE) > > output11.frq$match_col <- > gsub("^([0-9]+):([0-9]+).*", "\\1:\\2", > output11.frq$SNP) > > marker_info$match_col <- > apply(marker_info[,1:2], 1, paste, > collapse=":") > > merge(x=output11.frq, > y=marker_info[,c("V5", "match_col")], > by="match_col") > > > Regards, > Rasmus > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.