J Chen
2009-Sep-15 20:59 UTC
[R] how to load only lines that start with a particular symbol
Dear all, I have DNA sequence data which are fasta-formatted as>gene A;.....AAAAACCCC TTTTTGGGG CCCTTTTTT>gene B;....CCCCCAAAA GGGGGTTTT I want to load only the lines that start with ">" where the annotation information for the gene is contained. In principle, I can remove the sequences before loading or after loading all the lines. I just wonder if there's a way to load only lines with a particular pattern. The skip argument in read.table() doesn't work for my purpose. Thanks in advance, Jimmy -- View this message in context: http://www.nabble.com/how-to-load-only-lines-that-start-with-a-particular-symbol-tp25461693p25461693.html Sent from the R help mailing list archive at Nabble.com.
jim holtman
2009-Sep-15 21:04 UTC
[R] how to load only lines that start with a particular symbol
read in the data with 'readLines' and then use 'grep'> x[1] ">gene A;....." "AAAAACCCC" "TTTTTGGGG" "CCCTTTTTT" ">gene B;...." "CCCCCAAAA" "GGGGGTTTT"> x <- x[grep("^>", x)] > x[1] ">gene A;....." ">gene B;....">On Tue, Sep 15, 2009 at 4:59 PM, J Chen <jiaxuan.chen at mdc-berlin.de> wrote:> > Dear all, > > I have DNA sequence data which are fasta-formatted as > >>gene A;..... > AAAAACCCC > TTTTTGGGG > CCCTTTTTT >>gene B;.... > CCCCCAAAA > GGGGGTTTT > > I want to load only the lines that start with ">" where the annotation > information for the gene is contained. In principle, I can remove the > sequences before loading or after loading all the lines. I just wonder if > there's a way to load only lines with a particular pattern. The skip > argument in read.table() doesn't work for my purpose. > > Thanks in advance, > Jimmy > > -- > View this message in context: http://www.nabble.com/how-to-load-only-lines-that-start-with-a-particular-symbol-tp25461693p25461693.html > Sent from the R help mailing list archive at Nabble.com. > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem that you are trying to solve?
William Dunlap
2009-Sep-15 21:44 UTC
[R] how to load only lines that start with a particular symbol
> -----Original Message----- > From: r-help-bounces at r-project.org > [mailto:r-help-bounces at r-project.org] On Behalf Of J Chen > Sent: Tuesday, September 15, 2009 2:00 PM > To: r-help at r-project.org > Subject: [R] how to load only lines that start with a > particular symbol > > > Dear all, > > I have DNA sequence data which are fasta-formatted as > > >gene A;..... > AAAAACCCC > TTTTTGGGG > CCCTTTTTT > >gene B;.... > CCCCCAAAA > GGGGGTTTT > > I want to load only the lines that start with ">" where the annotation > information for the gene is contained. In principle, I can remove the > sequences before loading or after loading all the lines. I > just wonder if > there's a way to load only lines with a particular pattern. The skip > argument in read.table() doesn't work for my purpose.You could use pipe() to call an external program like grep or perl to filter the lines of interest from the file so R's input routine only has to allocate space for those. E.g., the following makes a sample file and the readLines(pipe(...)) call reads only the lines starting with ">> " from it. (It assumes you don't have grep in PATH and gives where it is installed on my Windows machine.) > tfile <- tempfile() > cat(file=tfile, sep="\n", c(">> Date", ">> Author", "columnA columnB", "1 2", "3 4")) > readLines(tfile) [1] ">> Date" ">> Author" "columnA columnB" "1 2" [5] "3 4" > readLines(pipe(paste("e:/cygwin/bin/grep \"^>> \" ", tfile))) [1] ">> Date" ">> Author" perl can do more complicated processing and filtering than grep. Bill Dunlap TIBCO Software Inc - Spotfire Division wdunlap tibco.com> > Thanks in advance, > Jimmy > > -- > View this message in context: > http://www.nabble.com/how-to-load-only-lines-that-start-with-a > -particular-symbol-tp25461693p25461693.html > Sent from the R help mailing list archive at Nabble.com. > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >
Gabor Grothendieck
2009-Sep-16 00:50 UTC
[R] how to load only lines that start with a particular symbol
In the Windows cmd shell ^ means escape the next character so try this (assuming the data you posted is in genetest.dat in the current directory):> readLines(pipe("findstr/b ^> genetest.dat"))[1] ">gene A;....." ">gene B;...." and on UNIX replace "..." with the corresponding grep command making sure you appropriately escape the > depending on the shell you use. On Tue, Sep 15, 2009 at 4:59 PM, J Chen <jiaxuan.chen at mdc-berlin.de> wrote:> > Dear all, > > I have DNA sequence data which are fasta-formatted as > >>gene A;..... > AAAAACCCC > TTTTTGGGG > CCCTTTTTT >>gene B;.... > CCCCCAAAA > GGGGGTTTT > > I want to load only the lines that start with ">" where the annotation > information for the gene is contained. In principle, I can remove the > sequences before loading or after loading all the lines. I just wonder if > there's a way to load only lines with a particular pattern. The skip > argument in read.table() doesn't work for my purpose. > > Thanks in advance, > Jimmy > > -- > View this message in context: http://www.nabble.com/how-to-load-only-lines-that-start-with-a-particular-symbol-tp25461693p25461693.html > Sent from the R help mailing list archive at Nabble.com. > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >