thr3ads.net - R help - [R] Help needed in feature extraction from two input files [Jun 2013]

If this information is useful, please help other people find it:
Share via:

arun

2013-Jun-11 18:52 UTC

[R] Help needed in feature extraction from two input files

Hi,
Try this:
lines1<- readLines(textConnection("gene1 or1|1234 or3|56 or4|793
gene4 or2|347
gene5 or3|23 or7|123456789")) 


lines2<-readLines(textConnection(">or1|1234
ATCGGATTCAGG>or2|347
GAACCTATCGGGGGGGGAATTTATATATTTTA>or3|56
ATCGGAGATATAACCAATC>or3|23
AAAATTAACAAGAGAATAGACAAAAAAA>or4|793
ATCTCTCTCCTCTCTCTCTAAAAA>or7|123456789ACGTGTGTACCCCC"))?

lines2New<-unlist(lapply(split(lines2,(seq_along(lines2)-1)%/%2+1),function(x)
paste(x,collapse="\n")),use.names=FALSE)


res<-lapply(lines1,function(x) {x1<- strsplit(x," ")[[1]];
x1New<-x1[-1];x2<-?
gsub(">(.*)\\n.*","\\1",lines2New);lines3<-lines2New[match(x1New,x2)];write.table(lines3,paste0(x1[1],".txt"),row.names=FALSE,quote=FALSE)})


Attached is one of the files generated by the code.
A.K.


Hi all, 

I have two input files. First file (file1.txt) contains entries in the following
tab delimited format:

gene1	or1|1234	or3|56	or4|793 
gene4	or2|347 
gene5	or3|23	or7|123456789 

....... 
.. 


The second file (file2.txt) contains some additional features along with the
header line of the first file, such as:
>or1|1234 
ATCGGATTCAGG >or2|347 GAACCTATCGGGGGGGGAATTTA 
TATATTTTA >or3|56 
ATCGGAGATATAACCAATC >or3|23 
AAAATTAACAAGAGAATAGACAAAAAAA >or4|793 
ATCTCTCTCCTCTCTCTCTAAAAA >or7|123456789 ACGTGTGTACCCCC 

.... 
.. 

From these two files, I want to extract entries by row wise 
header matching and rename the output file as the first column in file1.
 For example, in the above case, 3 output files will generate. 

the first output file would named as "gene1.txt" and it contains: 
>or1|1234 
ATCGGATTCAGG >or3|56 
ATCGGAGATATAACCAATC >or4|793 ATCTCTCTCCTCTCTCTCTAAAAA 

the second output file would named as "gene4.txt" and it contains: 
>or2|347 GAACCTATCGGGGGGGGAATTTATATATTTTA 

the third output file would named as "gene5.txt" and it contains: 
>or3|23 
AAAATTAACAAGAGAATAGACAAAAAAA >or7|123456789 ACGTGTGTACCCCC 

Any help in solving the problem is highly appreciated. Thanks in advance. 
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: gene1.txt
URL:
<https://stat.ethz.ch/pipermail/r-help/attachments/20130611/5e8cabf1/attachment.txt>

arun

2013-Jun-11 21:54 UTC

head link

[R] Help needed in feature extraction from two input files

Hi,
Try this:
lines1<- readLines("file1.txt")
lines1<- lines1[lines1!=""]
#In "file2.txt", 
>or1|1234
ATCGGATTCAGG>or2|347GAACCTATCGGGGGGGGAATTTA  
TATATTTTA###this should be a single line>or3|56
ATCGGAGATATAACCAATC>or3|23
AAAATTAACAAGAGAATAGACAAAAAAA>or4|793
ATCTCTCTCCTCTCTCTCTAAAAA>or7|123456789ACGTGTGTACCCCC
#So, I modified the file manually so that it looks like:>or1|1234 
ATCGGATTCAGG >or2|347 
GAACCTATCGGGGGGGGAATTTATATATTTTA >or3|56 
ATCGGAGATATAACCAATC >or3|23 
AAAATTAACAAGAGAATAGACAAAAAAA >or4|793 
ATCTCTCTCCTCTCTCTCTAAAAA >or7|123456789 ACGTGTGTACCCCC

?#and saved.? If you have many lines showing the above mentioned anomaly, then
let me know.

#I created a? new line after the last line (by using the `Enter` key) in the
file to suppress the warnings() which I removed below.
lines2<- readLines("file2.txt")
lines2<- lines2[lines2!=""]


lines2New<-unlist(lapply(split(lines2,(seq_along(lines2)-1)%/%2+1),function(x)
paste(x,collapse="\n")),use.names=FALSE)

##here changed because it was tab limited.
res<-lapply(lines1,function(x) {x1<- strsplit(x,"\t")[[1]];
x1New<-x1[-1];x2<-?
gsub(">(.*)\\n.*","\\1",lines2New);lines3<-lines2New[match(x1New,x2)];write.table(lines3,paste0(x1[1],".txt"),row.names=FALSE,quote=FALSE)})


I didn't had any problems in the output.
It looks like below:
gene1.txt

x>or1|1234
ATCGGATTCAGG>or3|56
ATCGGAGATATAACCAATC>or4|793ATCTCTCTCCTCTCTCTCTAAAAA

A.K.


Hi.. 

Thanks Arun, 

three output files are generated, but they show x and NA,, may be I have to
check the input...

and could u plz modify the script so that it will take ?direct input from files?
I have attached the two input files..



----- Original Message -----
From: arun <smartpink111 at yahoo.com>
To: Utpal Bakshi <utpalmtbi at gmail.com>
Cc: R help <r-help at r-project.org>
Sent: Tuesday, June 11, 2013 2:52 PM
Subject: Re: Help needed in feature extraction from two input files

Hi,
Try this:
lines1<- readLines(textConnection("gene1 or1|1234 or3|56 or4|793
gene4 or2|347
gene5 or3|23 or7|123456789")) 


lines2<-readLines(textConnection(">or1|1234
ATCGGATTCAGG>or2|347
GAACCTATCGGGGGGGGAATTTATATATTTTA>or3|56
ATCGGAGATATAACCAATC>or3|23
AAAATTAACAAGAGAATAGACAAAAAAA>or4|793
ATCTCTCTCCTCTCTCTCTAAAAA>or7|123456789ACGTGTGTACCCCC"))?

lines2New<-unlist(lapply(split(lines2,(seq_along(lines2)-1)%/%2+1),function(x)
paste(x,collapse="\n")),use.names=FALSE)


res<-lapply(lines1,function(x) {x1<- strsplit(x," ")[[1]];
x1New<-x1[-1];x2<-?
gsub(">(.*)\\n.*","\\1",lines2New);lines3<-lines2New[match(x1New,x2)];write.table(lines3,paste0(x1[1],".txt"),row.names=FALSE,quote=FALSE)})


Attached is one of the files generated by the code.
A.K.


Hi all, 

I have two input files. First file (file1.txt) contains entries in the following
tab delimited format:

gene1??? or1|1234??? or3|56??? or4|793 
gene4??? or2|347 
gene5??? or3|23??? or7|123456789 

....... 
.. 


The second file (file2.txt) contains some additional features along with the
header line of the first file, such as:
>or1|1234 
ATCGGATTCAGG >or2|347 GAACCTATCGGGGGGGGAATTTA 
TATATTTTA >or3|56 
ATCGGAGATATAACCAATC >or3|23 
AAAATTAACAAGAGAATAGACAAAAAAA >or4|793 
ATCTCTCTCCTCTCTCTCTAAAAA >or7|123456789 ACGTGTGTACCCCC 

.... 
.. 

From these two files, I want to extract entries by row wise 
header matching and rename the output file as the first column in file1.
For example, in the above case, 3 output files will generate. 

the first output file would named as "gene1.txt" and it contains: 
>or1|1234 
ATCGGATTCAGG >or3|56 
ATCGGAGATATAACCAATC >or4|793 ATCTCTCTCCTCTCTCTCTAAAAA 

the second output file would named as "gene4.txt" and it contains: 
>or2|347 GAACCTATCGGGGGGGGAATTTATATATTTTA 

the third output file would named as "gene5.txt" and it contains: 
>or3|23 
AAAATTAACAAGAGAATAGACAAAAAAA >or7|123456789 ACGTGTGTACCCCC 

Any help in solving the problem is highly appreciated. Thanks in advance.

Apparently Analagous Threads

Search for more maybe matching threads

R help - Jun 2013 - Help needed in feature extraction from two input files

[R] Help needed in feature extraction from two input files

[R] Help needed in feature extraction from two input files

Apparently Analagous Threads