thr3ads.net - R help - [R] How to match strings in two files and replace strings? [Mar 2020]

If this information is useful, please help other people find it:
Share via:

Ana Marija

2020-Mar-31 02:43 UTC

[R] How to match strings in two files and replace strings?

HI Jim,

thank you so much for getting back to me, I think the issue is with
reading that csv file
>
marker_info<-read.csv("marker-info",header=F,stringsAsFactors=FALSE)
> head(marker_info)
                               V1
1
             #Column Description:
2
        #Column is separated by '
3                                                           #Chr:
Chromosome on NCBI reference genome.
4 #Pos:   chromosome position when snp has unique hit on reference
genome. Otherwise this field is NULL.
5 #Submitter_snp_name:    The string identifier of snp on the
platform.  This is the dbSNP local_snp_id.
6                   #Ss#:   dbSNP submitted snp Id. Each snp sequence
on the platform gets a unique ss#.
  V2
1
2 '.
3
4
5
6

the file starts with 24 commented lines...

I did run your workflow and this is what I got:
>
newout<-merge(output11.frq,marker_info[,c("V5","match_col")],by="match_col")Error in `[.data.frame`(marker_info, , c("V5", "match_col"))
:
  undefined columns selected

this is how marker-info looks like:

#Column Description:
#Column is separated by ','.
#Chr:   Chromosome on NCBI reference genome.
#Pos:   chromosome position when snp has unique hit on reference
genome. Otherwise this field is NULL.
#Submitter_snp_name:    The string identifier of snp on the platform.
This is the dbSNP local_snp_id.
#Ss#:   dbSNP submitted snp Id. Each snp sequence on the platform gets
a unique ss#.
#Rs#:   refSNP cluster accession. Rs# for the dbSNP refSNP cluster
that the sequence for this ss# maps to.
#Genome_build_id:       Genome build used to map the SNP (a string)
#ALLELE1_genome_orient: genome orientation allele1, same as which
genotypes are reported.
#ALLELE2_genome_orient: genome orientation allele2, same as which
genotypes are reported.
#ALLELE1_orig_assay_orient:     original reported orientation for the
SNP assay, will correspond to CEL files and the ss_id.
#ALLELE2_orig_assay_orient:     original reported orientation for the
SNP assay, will correspond to CEL files and the ss_id.
#QC_TYPE:       A-autosomal and P-pseudo-autosomal; X: X-linked;
Y-Y-linked;NA-disable QC for this snp.
#SNP_flank_sequence:    snp sequence on the reference genome
orientation. 40bp on each side of variation.
#SOURCE:         Platform specific string identifying assay (e.g. HBA_CHIP)
#Ss2rs_orientation:     ss to rs orientation. +: same; -: opposite strand.
#Rs2genome_orienation:  Orientation of rs flanking sequence to
reference genome. +: same orientation, -: opposite.
#Orien_flipped_assay_to_genome: y/n: this column would be the value of
the exclusive OR from ss2rs_orientation  XOR rs2genome_orientation.
#Probe_id:       NCBI probe_id.
#neighbor_snp_list:     List of neighbor snp and position within 40kb
up/downstream.
#dbSNP_build_id:        dbSNP build id.
#study_id:      unique id with prefix: phs.
#
#
Chr,Pos,Submitter_snp_name,Ss#,Rs#,Genome_build_id,ALLELE1_genome_orient,ALLELE2_genome_orient,ALLELE1_orig_assay_orient,ALLELE2_orig_assay_orient,QC_TYPE,SNP_flank_sequence,SOURCE,Ss2rs_orientation,Rs2genome_orienation,Orien_flipped_assay_to_genome,Probe_id,neighbor_snp_list,dbSNP_build_id,study_id
1,742429,SNP_A-1909444,ss66079302,rs3094315,36.2,G,A,C,T,A,GCACAGCAAGAGAAAC[A/G]TTTGACAGAGAATACA,Sty,+,-,y,,,127,phs000018
1,769185,SNP_A-4303947,ss66273559,rs4040617,36.2,A,G,A,G,A,GCTGTGAGAGAGAACA[A/G]TGTCCCAATTTTGCCC,Sty,+,+,n,,,127,phs000018
1,775852,SNP_A-1886933,ss66317030,rs2980300,36.2,T,C,A,G,A,GAATGACTGTGTCTCT[C/T]TGAGTTAGTGAAGTCA,Nsp,-,+,y,,,127,phs000018
...

Please advise,
Ana

On Mon, Mar 30, 2020 at 9:24 PM Jim Lemon <drjimlemon at gmail.com>
wrote:>
> Hi Ana,
> This seems to work. It shouldn't be too hard to do the renaming and
> reordering of columns.
>
> output11.frq<-read.table(text="CHR  SNP A1 A2  MAF  NCHROBS
> 1      1:775852:T:C    T    C       0.1707     3444
> 1     1:1120590:A:C    C    A      0.08753     3496
> 1     1:1145994:T:C    C    T       0.1765     3496
> 1     1:1148494:A:G    A    G       0.1059     3464
> 1     1:1201155:C:T    T    C      0.07923     3496",
> header=TRUE,stringsAsFactors=FALSE)
>
>
marker_info<-read.csv(text="1,742429,SNP_A-1909444,ss66079302,rs3094315,36.2,G,A,C,T,A,GCACAGCAAGAGAAAC[A/G]TTTGACAGAGAATACA,Sty,+,-,y,,,127,phs000018
>
1,769185,SNP_A-4303947,ss66273559,rs4040617,36.2,A,G,A,G,A,GCTGTGAGAGAGAACA[A/G]TGTCCCAATTTTGCCC,Sty,+,+,n,,,127,phs000018
>
1,775852,SNP_A-1886933,ss66317030,rs2980300,36.2,T,C,A,G,A,GAATGACTGTGTCTCT[C/T]TGAGTTAGTGAAGTCA,Nsp,-,+,y,,,127,phs000018
>
1,782343,SNP_A-2236359,ss66185183,rs2905036,36.2,C,T,C,T,A,CTCGATTTGTGTTCAA[C/T]ATATTTCATTTGTACC,Sty,-,-,n,,,127,phs000018
>
1,1201155,SNP_A-2205441,ss66174584,rs4245756,36.2,C,T,C,T,A,CCAGTGCTTTCAACCA[C/T]ACTCACTTTTCACTGT,Sty,+,+,n,,,127,phs000018",
> header=FALSE,stringsAsFactors=FALSE)
> # create new columns for the merge
>
output11.frq$match_col<-unlist(lapply(lapply(strsplit(output11.frq$SNP,":"),"[",
>  1:2), paste,collapse=":"))
>
marker_info$match_col<-apply(t(marker_info[,1:2]),2,paste,collapse=":")
> # merge to get the result
>
newout<-merge(output11.frq,marker_info[,c("V5","match_col")],by="match_col")
>
> Jim
>
> On Tue, Mar 31, 2020 at 11:09 AM Ana Marija <sokovic.anamarija at
gmail.com> wrote:
> >
> > I have a file like this: (has 308545 lines)
> >
> >     head output11.frq
> >      CHR               SNP   A1   A2          MAF  NCHROBS
> >        1      1:775852:T:C    T    C       0.1707     3444
> >        1     1:1120590:A:C    C    A      0.08753     3496
> >        1     1:1145994:T:C    C    T       0.1765     3496
> >        1     1:1148494:A:G    A    G       0.1059     3464
> >        1     1:1201155:C:T    T    C      0.07923     3496
> >     ...
> >
> > And another file (marker-info) which has the first 24 commented lines
> > and is comma separated that looks like this (has total of 500593
> > lines):
> >
> >    
1,742429,SNP_A-1909444,ss66079302,rs3094315,36.2,G,A,C,T,A,GCACAGCAAGAGAAAC[A/G]TTTGACAGAGAATACA,Sty,+,-,y,,,127,phs000018
> >    
1,769185,SNP_A-4303947,ss66273559,rs4040617,36.2,A,G,A,G,A,GCTGTGAGAGAGAACA[A/G]TGTCCCAATTTTGCCC,Sty,+,+,n,,,127,phs000018
> >    
1,775852,SNP_A-1886933,ss66317030,rs2980300,36.2,T,C,A,G,A,GAATGACTGTGTCTCT[C/T]TGAGTTAGTGAAGTCA,Nsp,-,+,y,,,127,phs000018
> >    
1,782343,SNP_A-2236359,ss66185183,rs2905036,36.2,C,T,C,T,A,CTCGATTTGTGTTCAA[C/T]ATATTTCATTTGTACC,Sty,-,-,n,,,127,phs000018
> >    
1,1201155,SNP_A-2205441,ss66174584,rs4245756,36.2,C,T,C,T,A,CCAGTGCTTTCAACCA[C/T]ACTCACTTTTCACTGT,Sty,+,+,n,,,127,phs000018
> >     ...
> >
> > I want to replace in output11.frq second column with the 5th column in
> > marker-info that has the matching value in 1st and 2nd column so for
> > this example the result of the output11.frq would look like this:
> >
> >     1      rs2980300    T    C       0.1707     3444
> >     1      rs4245756    T    C      0.07923     3496
> >
> > I tried doing this in bash but I got empty file:
> >
> >     vi tst.awk
> >     NR==FNR { map[$1,$2]=$5; next }
> >     ($1,$4) in map { $2=map[$1,$4]; print }
> >     awk -f tst.awk FS=',' marker-info FS='\t'
output11.frq  > output11X.frq
> >
> > Can this be done in R?
> >
> > Thanks
> > Ana
> >
> > ______________________________________________
> > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.

Rasmus Liland

2020-Mar-31 03:42 UTC

head link

[R] How to match strings in two files and replace strings?

On 2020-03-30 21:43 -0500, Ana Marija wrote:> I did run your workflow and this is what I got:
> 
> >
newout<-merge(output11.frq,marker_info[,c("V5","match_col")],by="match_col")
> Error in `[.data.frame`(marker_info, , c("V5",
"match_col")) :
>   undefined columns selected
> 
> this is how marker-info looks like:
Hi Ana,

perhaps adding comment.char="#" as an argument to read.csv might 
help?

Making the output11.frq$match_col column might perhaps be easier
using gsub, have a look:

marker_info <- "#Column Description:
#Column is separated by ','.
#Chr:   Chromosome on NCBI reference genome.
#Pos:   chromosome position when snp has unique hit on reference genome.
Otherwise this field is NULL.
#Submitter_snp_name:    The string identifier of snp on the platform.  This is
the dbSNP local_snp_id.
#Ss#:   dbSNP submitted snp Id. Each snp sequence on the platform gets a unique
ss#.
#Rs#:   refSNP cluster accession. Rs# for the dbSNP refSNP cluster that the
sequence for this ss# maps to.
#Genome_build_id:       Genome build used to map the SNP (a string)
#ALLELE1_genome_orient: genome orientation allele1, same as which genotypes are
reported.
#ALLELE2_genome_orient: genome orientation allele2, same as which genotypes are
reported.
#ALLELE1_orig_assay_orient:     original reported orientation for the SNP assay,
will correspond to CEL files and the ss_id.
#ALLELE2_orig_assay_orient:     original reported orientation for the SNP assay,
will correspond to CEL files and the ss_id.
#QC_TYPE:       A-autosomal and P-pseudo-autosomal; X: X-linked;
Y-Y-linked;NA-disable QC for this snp.
#SNP_flank_sequence:    snp sequence on the reference genome orientation. 40bp
on each side of variation.
#SOURCE:         Platform specific string identifying assay (e.g. HBA_CHIP)
#Ss2rs_orientation:     ss to rs orientation. +: same; -: opposite strand.
#Rs2genome_orienation:  Orientation of rs flanking sequence to reference genome.
+: same orientation, -: opposite.
#Orien_flipped_assay_to_genome: y/n: this column would be the value of the
exclusive OR from ss2rs_orientation  XOR rs2genome_orientation.
#Probe_id:       NCBI probe_id.
#neighbor_snp_list:     List of neighbor snp and position within 40kb
up/downstream.
#dbSNP_build_id:        dbSNP build id.
#study_id:      unique id with prefix: phs.
#
#
Chr,Pos,Submitter_snp_name,Ss#,Rs#,Genome_build_id,ALLELE1_genome_orient,ALLELE2_genome_orient,ALLELE1_orig_assay_orient,ALLELE2_orig_assay_orient,QC_TYPE,SNP_flank_sequence,SOURCE,Ss2rs_orientation,Rs2genome_orienation,Orien_flipped_assay_to_genome,Probe_id,neighbor_snp_list,dbSNP_build_id,study_id
1,742429,SNP_A-1909444,ss66079302,rs3094315,36.2,G,A,C,T,A,GCACAGCAAGAGAAAC[A/G]TTTGACAGAGAATACA,Sty,+,-,y,,,127,phs000018
1,769185,SNP_A-4303947,ss66273559,rs4040617,36.2,A,G,A,G,A,GCTGTGAGAGAGAACA[A/G]TGTCCCAATTTTGCCC,Sty,+,+,n,,,127,phs000018
1,775852,SNP_A-1886933,ss66317030,rs2980300,36.2,T,C,A,G,A,GAATGACTGTGTCTCT[C/T]TGAGTTAGTGAAGTCA,Nsp,-,+,y,,,127,phs000018
"
marker_info <-
  read.csv(text=marker_info,
    header=FALSE,
    stringsAsFactors=FALSE,
    comment.char="#")

output11.frq <-
"CHR  SNP A1 A2  MAF  NCHROBS
1      1:775852:T:C    T    C       0.1707     3444
1     1:1120590:A:C    C    A      0.08753     3496
1     1:1145994:T:C    C    T       0.1765     3496
1     1:1148494:A:G    A    G       0.1059     3464
1     1:1201155:C:T    T    C      0.07923     3496"
output11.frq <- 
  read.table(text=output11.frq, header=TRUE,
    stringsAsFactors=FALSE)

output11.frq$match_col <-
  gsub("^([0-9]+):([0-9]+).*", "\\1:\\2",
       output11.frq$SNP)

marker_info$match_col <-
  apply(marker_info[,1:2], 1, paste,
        collapse=":")

merge(x=output11.frq,
      y=marker_info[,c("V5", "match_col")],
      by="match_col")


Regards,
Rasmus

Jim Lemon

2020-Mar-31 06:07 UTC

head link

[R] How to match strings in two files and replace strings?

Ah, my mistake. Should be:

marker_info<-read.csv("marker-info",header=FALSE,stringsAsFactors=FALSE,skip=24)

Jim

On Tue, Mar 31, 2020 at 1:43 PM Ana Marija <sokovic.anamarija at
gmail.com> wrote:>
> HI Jim,
>
> thank you so much for getting back to me, I think the issue is with
> reading that csv file
>
> >
marker_info<-read.csv("marker-info",header=F,stringsAsFactors=FALSE)
> > head(marker_info)
>
>                                V1
> 1
>              #Column Description:
> 2
>         #Column is separated by '
> 3                                                           #Chr:
> Chromosome on NCBI reference genome.
> 4 #Pos:   chromosome position when snp has unique hit on reference
> genome. Otherwise this field is NULL.
> 5 #Submitter_snp_name:    The string identifier of snp on the
> platform.  This is the dbSNP local_snp_id.
> 6                   #Ss#:   dbSNP submitted snp Id. Each snp sequence
> on the platform gets a unique ss#.
>   V2
> 1
> 2 '.
> 3
> 4
> 5
> 6
>
> the file starts with 24 commented lines...
>
> I did run your workflow and this is what I got:
>
> >
newout<-merge(output11.frq,marker_info[,c("V5","match_col")],by="match_col")
> Error in `[.data.frame`(marker_info, , c("V5",
"match_col")) :
>   undefined columns selected
>
> this is how marker-info looks like:
>
> #Column Description:
> #Column is separated by ','.
> #Chr:   Chromosome on NCBI reference genome.
> #Pos:   chromosome position when snp has unique hit on reference
> genome. Otherwise this field is NULL.
> #Submitter_snp_name:    The string identifier of snp on the platform.
> This is the dbSNP local_snp_id.
> #Ss#:   dbSNP submitted snp Id. Each snp sequence on the platform gets
> a unique ss#.
> #Rs#:   refSNP cluster accession. Rs# for the dbSNP refSNP cluster
> that the sequence for this ss# maps to.
> #Genome_build_id:       Genome build used to map the SNP (a string)
> #ALLELE1_genome_orient: genome orientation allele1, same as which
> genotypes are reported.
> #ALLELE2_genome_orient: genome orientation allele2, same as which
> genotypes are reported.
> #ALLELE1_orig_assay_orient:     original reported orientation for the
> SNP assay, will correspond to CEL files and the ss_id.
> #ALLELE2_orig_assay_orient:     original reported orientation for the
> SNP assay, will correspond to CEL files and the ss_id.
> #QC_TYPE:       A-autosomal and P-pseudo-autosomal; X: X-linked;
> Y-Y-linked;NA-disable QC for this snp.
> #SNP_flank_sequence:    snp sequence on the reference genome
> orientation. 40bp on each side of variation.
> #SOURCE:         Platform specific string identifying assay (e.g. HBA_CHIP)
> #Ss2rs_orientation:     ss to rs orientation. +: same; -: opposite strand.
> #Rs2genome_orienation:  Orientation of rs flanking sequence to
> reference genome. +: same orientation, -: opposite.
> #Orien_flipped_assay_to_genome: y/n: this column would be the value of
> the exclusive OR from ss2rs_orientation  XOR rs2genome_orientation.
> #Probe_id:       NCBI probe_id.
> #neighbor_snp_list:     List of neighbor snp and position within 40kb
> up/downstream.
> #dbSNP_build_id:        dbSNP build id.
> #study_id:      unique id with prefix: phs.
> #
> #
Chr,Pos,Submitter_snp_name,Ss#,Rs#,Genome_build_id,ALLELE1_genome_orient,ALLELE2_genome_orient,ALLELE1_orig_assay_orient,ALLELE2_orig_assay_orient,QC_TYPE,SNP_flank_sequence,SOURCE,Ss2rs_orientation,Rs2genome_orienation,Orien_flipped_assay_to_genome,Probe_id,neighbor_snp_list,dbSNP_build_id,study_id
>
1,742429,SNP_A-1909444,ss66079302,rs3094315,36.2,G,A,C,T,A,GCACAGCAAGAGAAAC[A/G]TTTGACAGAGAATACA,Sty,+,-,y,,,127,phs000018
>
1,769185,SNP_A-4303947,ss66273559,rs4040617,36.2,A,G,A,G,A,GCTGTGAGAGAGAACA[A/G]TGTCCCAATTTTGCCC,Sty,+,+,n,,,127,phs000018
>
1,775852,SNP_A-1886933,ss66317030,rs2980300,36.2,T,C,A,G,A,GAATGACTGTGTCTCT[C/T]TGAGTTAGTGAAGTCA,Nsp,-,+,y,,,127,phs000018
> ...
>
>
> Please advise,
> Ana
>
> On Mon, Mar 30, 2020 at 9:24 PM Jim Lemon <drjimlemon at gmail.com>
wrote:
> >
> > Hi Ana,
> > This seems to work. It shouldn't be too hard to do the renaming
and
> > reordering of columns.
> >
> > output11.frq<-read.table(text="CHR  SNP A1 A2  MAF  NCHROBS
> > 1      1:775852:T:C    T    C       0.1707     3444
> > 1     1:1120590:A:C    C    A      0.08753     3496
> > 1     1:1145994:T:C    C    T       0.1765     3496
> > 1     1:1148494:A:G    A    G       0.1059     3464
> > 1     1:1201155:C:T    T    C      0.07923     3496",
> > header=TRUE,stringsAsFactors=FALSE)
> >
> >
marker_info<-read.csv(text="1,742429,SNP_A-1909444,ss66079302,rs3094315,36.2,G,A,C,T,A,GCACAGCAAGAGAAAC[A/G]TTTGACAGAGAATACA,Sty,+,-,y,,,127,phs000018
> >
1,769185,SNP_A-4303947,ss66273559,rs4040617,36.2,A,G,A,G,A,GCTGTGAGAGAGAACA[A/G]TGTCCCAATTTTGCCC,Sty,+,+,n,,,127,phs000018
> >
1,775852,SNP_A-1886933,ss66317030,rs2980300,36.2,T,C,A,G,A,GAATGACTGTGTCTCT[C/T]TGAGTTAGTGAAGTCA,Nsp,-,+,y,,,127,phs000018
> >
1,782343,SNP_A-2236359,ss66185183,rs2905036,36.2,C,T,C,T,A,CTCGATTTGTGTTCAA[C/T]ATATTTCATTTGTACC,Sty,-,-,n,,,127,phs000018
> >
1,1201155,SNP_A-2205441,ss66174584,rs4245756,36.2,C,T,C,T,A,CCAGTGCTTTCAACCA[C/T]ACTCACTTTTCACTGT,Sty,+,+,n,,,127,phs000018",
> > header=FALSE,stringsAsFactors=FALSE)
> > # create new columns for the merge
> >
output11.frq$match_col<-unlist(lapply(lapply(strsplit(output11.frq$SNP,":"),"[",
> >  1:2), paste,collapse=":"))
> >
marker_info$match_col<-apply(t(marker_info[,1:2]),2,paste,collapse=":")
> > # merge to get the result
> >
newout<-merge(output11.frq,marker_info[,c("V5","match_col")],by="match_col")
> >
> > Jim
> >
> > On Tue, Mar 31, 2020 at 11:09 AM Ana Marija <sokovic.anamarija at
gmail.com> wrote:
> > >
> > > I have a file like this: (has 308545 lines)
> > >
> > >     head output11.frq
> > >      CHR               SNP   A1   A2          MAF  NCHROBS
> > >        1      1:775852:T:C    T    C       0.1707     3444
> > >        1     1:1120590:A:C    C    A      0.08753     3496
> > >        1     1:1145994:T:C    C    T       0.1765     3496
> > >        1     1:1148494:A:G    A    G       0.1059     3464
> > >        1     1:1201155:C:T    T    C      0.07923     3496
> > >     ...
> > >
> > > And another file (marker-info) which has the first 24 commented
lines
> > > and is comma separated that looks like this (has total of 500593
> > > lines):
> > >
> > >    
1,742429,SNP_A-1909444,ss66079302,rs3094315,36.2,G,A,C,T,A,GCACAGCAAGAGAAAC[A/G]TTTGACAGAGAATACA,Sty,+,-,y,,,127,phs000018
> > >    
1,769185,SNP_A-4303947,ss66273559,rs4040617,36.2,A,G,A,G,A,GCTGTGAGAGAGAACA[A/G]TGTCCCAATTTTGCCC,Sty,+,+,n,,,127,phs000018
> > >    
1,775852,SNP_A-1886933,ss66317030,rs2980300,36.2,T,C,A,G,A,GAATGACTGTGTCTCT[C/T]TGAGTTAGTGAAGTCA,Nsp,-,+,y,,,127,phs000018
> > >    
1,782343,SNP_A-2236359,ss66185183,rs2905036,36.2,C,T,C,T,A,CTCGATTTGTGTTCAA[C/T]ATATTTCATTTGTACC,Sty,-,-,n,,,127,phs000018
> > >    
1,1201155,SNP_A-2205441,ss66174584,rs4245756,36.2,C,T,C,T,A,CCAGTGCTTTCAACCA[C/T]ACTCACTTTTCACTGT,Sty,+,+,n,,,127,phs000018
> > >     ...
> > >
> > > I want to replace in output11.frq second column with the 5th
column in
> > > marker-info that has the matching value in 1st and 2nd column so
for
> > > this example the result of the output11.frq would look like this:
> > >
> > >     1      rs2980300    T    C       0.1707     3444
> > >     1      rs4245756    T    C      0.07923     3496
> > >
> > > I tried doing this in bash but I got empty file:
> > >
> > >     vi tst.awk
> > >     NR==FNR { map[$1,$2]=$5; next }
> > >     ($1,$4) in map { $2=map[$1,$4]; print }
> > >     awk -f tst.awk FS=',' marker-info FS='\t'
output11.frq  > output11X.frq
> > >
> > > Can this be done in R?
> > >
> > > Thanks
> > > Ana
> > >
> > > ______________________________________________
> > > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more,
see
> > > https://stat.ethz.ch/mailman/listinfo/r-help
> > > PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> > > and provide commented, minimal, self-contained, reproducible
code.

Jim Lemon

2020-Mar-31 22:45 UTC

head link

[R] How to match strings in two files and replace strings?

Nice improvement.

Jim

On Wed, Apr 1, 2020 at 3:18 AM Rasmus Liland
<jensrli at student.ikos.uio.no> wrote:>
> On 2020-03-30 21:43 -0500, Ana Marija wrote:
> > I did run your workflow and this is what I got:
> >
> > >
newout<-merge(output11.frq,marker_info[,c("V5","match_col")],by="match_col")
> > Error in `[.data.frame`(marker_info, , c("V5",
"match_col")) :
> >   undefined columns selected
> >
> > this is how marker-info looks like:
>
> Hi Ana,
>
> perhaps adding comment.char="#" as an argument to read.csv might
> help?
>
> Making the output11.frq$match_col column might perhaps be easier
> using gsub, have a look:
>
> marker_info <- "#Column Description:
> #Column is separated by ','.
> #Chr:   Chromosome on NCBI reference genome.
> #Pos:   chromosome position when snp has unique hit on reference genome.
Otherwise this field is NULL.
> #Submitter_snp_name:    The string identifier of snp on the platform.  This
is the dbSNP local_snp_id.
> #Ss#:   dbSNP submitted snp Id. Each snp sequence on the platform gets a
unique ss#.
> #Rs#:   refSNP cluster accession. Rs# for the dbSNP refSNP cluster that the
sequence for this ss# maps to.
> #Genome_build_id:       Genome build used to map the SNP (a string)
> #ALLELE1_genome_orient: genome orientation allele1, same as which genotypes
are reported.
> #ALLELE2_genome_orient: genome orientation allele2, same as which genotypes
are reported.
> #ALLELE1_orig_assay_orient:     original reported orientation for the SNP
assay, will correspond to CEL files and the ss_id.
> #ALLELE2_orig_assay_orient:     original reported orientation for the SNP
assay, will correspond to CEL files and the ss_id.
> #QC_TYPE:       A-autosomal and P-pseudo-autosomal; X: X-linked;
Y-Y-linked;NA-disable QC for this snp.
> #SNP_flank_sequence:    snp sequence on the reference genome orientation.
40bp on each side of variation.
> #SOURCE:         Platform specific string identifying assay (e.g. HBA_CHIP)
> #Ss2rs_orientation:     ss to rs orientation. +: same; -: opposite strand.
> #Rs2genome_orienation:  Orientation of rs flanking sequence to reference
genome. +: same orientation, -: opposite.
> #Orien_flipped_assay_to_genome: y/n: this column would be the value of the
exclusive OR from ss2rs_orientation  XOR rs2genome_orientation.
> #Probe_id:       NCBI probe_id.
> #neighbor_snp_list:     List of neighbor snp and position within 40kb
up/downstream.
> #dbSNP_build_id:        dbSNP build id.
> #study_id:      unique id with prefix: phs.
> #
> #
Chr,Pos,Submitter_snp_name,Ss#,Rs#,Genome_build_id,ALLELE1_genome_orient,ALLELE2_genome_orient,ALLELE1_orig_assay_orient,ALLELE2_orig_assay_orient,QC_TYPE,SNP_flank_sequence,SOURCE,Ss2rs_orientation,Rs2genome_orienation,Orien_flipped_assay_to_genome,Probe_id,neighbor_snp_list,dbSNP_build_id,study_id
>
1,742429,SNP_A-1909444,ss66079302,rs3094315,36.2,G,A,C,T,A,GCACAGCAAGAGAAAC[A/G]TTTGACAGAGAATACA,Sty,+,-,y,,,127,phs000018
>
1,769185,SNP_A-4303947,ss66273559,rs4040617,36.2,A,G,A,G,A,GCTGTGAGAGAGAACA[A/G]TGTCCCAATTTTGCCC,Sty,+,+,n,,,127,phs000018
>
1,775852,SNP_A-1886933,ss66317030,rs2980300,36.2,T,C,A,G,A,GAATGACTGTGTCTCT[C/T]TGAGTTAGTGAAGTCA,Nsp,-,+,y,,,127,phs000018
> "
> marker_info <-
>   read.csv(text=marker_info,
>     header=FALSE,
>     stringsAsFactors=FALSE,
>     comment.char="#")
>
> output11.frq <-
> "CHR  SNP A1 A2  MAF  NCHROBS
> 1      1:775852:T:C    T    C       0.1707     3444
> 1     1:1120590:A:C    C    A      0.08753     3496
> 1     1:1145994:T:C    C    T       0.1765     3496
> 1     1:1148494:A:G    A    G       0.1059     3464
> 1     1:1201155:C:T    T    C      0.07923     3496"
> output11.frq <-
>   read.table(text=output11.frq, header=TRUE,
>     stringsAsFactors=FALSE)
>
> output11.frq$match_col <-
>   gsub("^([0-9]+):([0-9]+).*", "\\1:\\2",
>        output11.frq$SNP)
>
> marker_info$match_col <-
>   apply(marker_info[,1:2], 1, paste,
>         collapse=":")
>
> merge(x=output11.frq,
>       y=marker_info[,c("V5", "match_col")],
>       by="match_col")
>
>
> Regards,
> Rasmus
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

R help - Mar 2020 - How to match strings in two files and replace strings?

[R] How to match strings in two files and replace strings?

[R] How to match strings in two files and replace strings?

[R] How to match strings in two files and replace strings?

[R] How to match strings in two files and replace strings?