thr3ads.net - R help - [R] duplicate rows with rbind in a loop [Oct 2015]

If this information is useful, please help other people find it:
Share via:

Cara Fiore

2015-Oct-08 03:19 UTC

[R] duplicate rows with rbind in a loop

Dear R users,

I wrote a simple script to change the header lines in a fasta file that
contains DNA sequences in a format:
>header1
sequence1>header2sequence2

I am basically trying to replace the "header" in this file with a line
from
another file (taxonomy file). In order to do that I have to find the
matching header in the taxonomy file.

The output should be in fasta format and it is, but the rows repeat so the
output file is huge and it looks like:
>header1
sequence1>header1
sequence1>header2sequence2

The code I have is:

tax=read.table("taxonomy_file.txt", header=F, quote="",
sep="\t")
tax2=data.frame(tax)

library("Biostrings")
seqs=readDNAStringSet("File.fasta")
header=names(seqs)
seqs2=paste(seqs)

new.final=NULL
i=1

#Go through tax file and match the header in tax file to header in seqs file
for(i in 1:length(tax[,1])){
  sampleID=NULL
  match=NULL
  sampleID=as.character(tax2[i,1])  #sample ID in taxonomy header
  match=which(sampleID==header) #index for match in header file
  if(match>0){
    newH1=NULL
    newH2=NULL
    seqline=NULL
    new.header=NULL
    newH1=as.character(tax2[i,1])
    newH2=as.character(tax2[i,2])
    seqline=seqs2[match]
    new.header=paste(">",newH1,"|",newH2,
sep="")
    new.final=rbind(new.final, new.header, seqline)
  }
  print(paste("percent complete =", round((i/length(tax2[,1]))*100,3),
"%",sep=" "))
  write.table(new.final, file="Test_output.txt", quote=FALSE,
sep="\n",
col.names=FALSE, row.names=FALSE, append=TRUE)
  i=i+1
}


Something about rbind is repeating all of the rows every time it writes to
the output file. I have not been able to find anything about this online or
in the r help for rbind, although perhaps I am missing something obvious
about this.

I greatly appreciate any help with this!

	[[alternative HTML version deleted]]

Morgan, Martin

2015-Oct-08 09:46 UTC

head link

[R] duplicate rows with rbind in a loop

> -----Original Message-----
> From: R-help [mailto:r-help-bounces at r-project.org] On Behalf Of Cara
Fiore
> Sent: Wednesday, October 07, 2015 11:20 PM
> To: r-help at r-project.org
> Subject: [R] duplicate rows with rbind in a loop
> 
> Dear R users,
> 
> I wrote a simple script to change the header lines in a fasta file that
contains
> DNA sequences in a format:
> 
> >header1
> sequence1
> >header2
> sequence2
> 
> I am basically trying to replace the "header" in this file with a
line from
> another file (taxonomy file). In order to do that I have to find the
matching
> header in the taxonomy file.
> 
> The output should be in fasta format and it is, but the rows repeat so the
> output file is huge and it looks like:
> 
> >header1
> sequence1
> >header1
> sequence1
> >header2
> sequence2
> 
> The code I have is:
> 
> tax=read.table("taxonomy_file.txt", header=F, quote="",
sep="\t")
> tax2=data.frame(tax)
> 
> library("Biostrings")
> seqs=readDNAStringSet("File.fasta")
> header=names(seqs)
> seqs2=paste(seqs)
> 
> new.final=NULL
> i=1
> 
> #Go through tax file and match the header in tax file to header in seqs
file
> for(i in 1:length(tax[,1])){
Hi Cara - it is usually better to ask questions about Bioconductor packages on
the Bioconductor support site, https://support.bioconductor .org.

For your case, something like

  names(seqs) = tax[match(names(seqs), tax[,1]), 2] 

maps, in a 'vectorized' way, the names of the sequences to the names of
the taxon ids, and updates the names in the DNAStringSet. Then

  writeXStringSet(seqs, "output.fasta")

outputs the results as a fasta file. Probably you want read.table(...,
stringsAsFactors=FALSE) earlier in your code.

Martin
>   sampleID=NULL
>   match=NULL
>   sampleID=as.character(tax2[i,1])  #sample ID in taxonomy header
>   match=which(sampleID==header) #index for match in header file
>   if(match>0){
>     newH1=NULL
>     newH2=NULL
>     seqline=NULL
>     new.header=NULL
>     newH1=as.character(tax2[i,1])
>     newH2=as.character(tax2[i,2])
>     seqline=seqs2[match]
>     new.header=paste(">",newH1,"|",newH2,
sep="")
>     new.final=rbind(new.final, new.header, seqline)
>   }
>   print(paste("percent complete =",
round((i/length(tax2[,1]))*100,3),
> "%",sep=" "))
>   write.table(new.final, file="Test_output.txt", quote=FALSE,
sep="\n",
> col.names=FALSE, row.names=FALSE, append=TRUE)
>   i=i+1
> }
> 
> 
> Something about rbind is repeating all of the rows every time it writes to
the
> output file. I have not been able to find anything about this online or in
the r
> help for rbind, although perhaps I am missing something obvious about this.
> 
> I greatly appreciate any help with this!
> 
> 	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-
> guide.html
> and provide commented, minimal, self-contained, reproducible code.

This email message may contain legally privileged and/or confidential
information.  If you are not the intended recipient(s), or the employee or agent
responsible for the delivery of this message to the intended recipient(s), you
are hereby notified that any disclosure, copying, distribution, or use of this
email message is prohibited.  If you have received this message in error, please
notify the sender immediately by e-mail and delete this email message from your
computer. Thank you.

R help - Oct 2015 - duplicate rows with rbind in a loop

[R] duplicate rows with rbind in a loop

[R] duplicate rows with rbind in a loop