Michael A. Gilchrist
2008-Sep-12 16:34 UTC
[R] reading in results from system(). There must be an easier way...
Hello, I am currently using R to run an external program and then read the results the external program sends to the stdout which are tsv data. When R reads the results in it converts it to to a list of strings which I then have to maniuplate with a whole slew of commands (which, figuring out how to do was a reall challenge for a newbie like myself)--see below. Here's the code I'm using. COMMAND runs the external program. rawInput= system(COMMAND,intern=TRUE);##read in tsv values rawInput = strsplit(rawInput, split="\t");##split elements w/in the list ##of character strings by "\t" rawInput = unlist(rawInput); ##unlist, making it one long vector mode(rawInput)="double"; ##convert from strings to double finalInput = data.frame(t(matrix(rawInput, nrow=6))); ##convert Because I will be doing this 100,000 of times as part of an optimization problem, I am interested in learning a more efficient way of doing this conversion. Any suggestions would be appreciated. Thanks in advance. Mike ----------------------------------------------------- Department of Ecology & Evolutionary Biology 569 Dabney Hall University of Tennessee Knoxville, TN 37996-1610 phone:(865) 974-6453 fax: (865) 974-6042 web: http://eeb.bio.utk.edu/gilchrist.asp
Marc Schwartz
2008-Sep-12 16:58 UTC
[R] reading in results from system(). There must be an easier way...
on 09/12/2008 11:34 AM Michael A. Gilchrist wrote:> Hello, > > I am currently using R to run an external program and then read the > results the external program sends to the stdout which are tsv data. > > When R reads the results in it converts it to to a list of strings which > I then have to maniuplate with a whole slew of commands (which, figuring > out how to do was a reall challenge for a newbie like myself)--see below. > > Here's the code I'm using. COMMAND runs the external program. > > rawInput= system(COMMAND,intern=TRUE);##read in tsv values > rawInput = strsplit(rawInput, split="\t");##split elements w/in the > list > ##of character strings by "\t" > rawInput = unlist(rawInput); ##unlist, making it one long vector > mode(rawInput)="double"; ##convert from strings to double > finalInput = data.frame(t(matrix(rawInput, nrow=6))); ##convert > > Because I will be doing this 100,000 of times as part of an optimization > problem, I am interested in learning a more efficient way of doing this > conversion. > > Any suggestions would be appreciated. > > > Thanks in advance. > > MikeBased upon the presumption that your incoming data are simple tab delimited values in lines, no header record and where each line is to end up as a single row in a data frame, you could use something like the following: finalDF <- read.table(textConnection(system(COMMAND,intern = TRUE)), sep = "\t", header = FALSE) Alternatively, you could use scan() directly, then convert to a matrix: finalMAT <- matrix(scan(textConnection(system(COMMAND,intern = TRUE)), sep = "\t"), nrow = 6) These are untested of course, but should get you close, if not there. See ?textConnection for the basic process of taking incoming data from a text stream. HTH, Marc Schwartz
Gabor Grothendieck
2008-Sep-12 17:15 UTC
[R] reading in results from system(). There must be an easier way...
The conversion to a data frame and the transpose might be time consuming. In addition to other comments you have received, try this: matrix(rawinput, ncol = 6, byrow = TRUE) and if you don't really need a data frame eliminate the conversion from matrix to data.frame. You might time this against the code you posted to see if it makes a difference. On Fri, Sep 12, 2008 at 12:34 PM, Michael A. Gilchrist <mikeg at utk.edu> wrote:> Hello, > > I am currently using R to run an external program and then read the results > the external program sends to the stdout which are tsv data. > > When R reads the results in it converts it to to a list of strings which I > then have to maniuplate with a whole slew of commands (which, figuring out > how to do was a reall challenge for a newbie like myself)--see below. > > Here's the code I'm using. COMMAND runs the external program. > > rawInput= system(COMMAND,intern=TRUE);##read in tsv values > rawInput = strsplit(rawInput, split="\t");##split elements w/in the list > ##of character strings by "\t" > rawInput = unlist(rawInput); ##unlist, making it one long vector > mode(rawInput)="double"; ##convert from strings to double > finalInput = data.frame(t(matrix(rawInput, nrow=6))); ##convert > > Because I will be doing this 100,000 of times as part of an optimization > problem, I am interested in learning a more efficient way of doing this > conversion. > > Any suggestions would be appreciated. > > > Thanks in advance. > > Mike > > > ----------------------------------------------------- > Department of Ecology & Evolutionary Biology > 569 Dabney Hall > University of Tennessee > Knoxville, TN 37996-1610 > > phone:(865) 974-6453 > fax: (865) 974-6042 > > web: http://eeb.bio.utk.edu/gilchrist.asp > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >
Henrik Bengtsson
2008-Sep-12 17:40 UTC
[R] reading in results from system(). There must be an easier way...
Hi, a few comments below. On Fri, Sep 12, 2008 at 9:34 AM, Michael A. Gilchrist <mikeg at utk.edu> wrote:> Hello, > > I am currently using R to run an external program and then read the results > the external program sends to the stdout which are tsv data. > > When R reads the results in it converts it to to a list of strings which I > then have to maniuplate with a whole slew of commands (which, figuring out > how to do was a reall challenge for a newbie like myself)--see below. > > Here's the code I'm using. COMMAND runs the external program. > > rawInput= system(COMMAND,intern=TRUE);##read in tsv valuesFor debugging purposes etc, it is good to read the data into a buffer like this; instead of wrapping up everything in one big nested expression. The overhead for doing this should be minimal.> rawInput = strsplit(rawInput, split="\t");##split elements w/in the listFYI, strsplit(x, split="\t", fixed=TRUE) is *heaps* faster (than fixed=FALSE), e.g.> x <- paste(1:3e4, collapse="\t") > t <- system.time(y <- strsplit(x, split="\t")) > tuser system elapsed 2.89 0.00 2.89> t <- system.time(y <- strsplit(x, split="\t", fixed=TRUE)) > tuser system elapsed 0 0 0> ##of character strings by "\t" > rawInput = unlist(rawInput); ##unlist, making it one long vectorFYI, unlist(x, use.names=FALSE) is faster, especially when 'x' is long/large.> mode(rawInput)="double"; ##convert from strings to double > finalInput = data.frame(t(matrix(rawInput, nrow=6))); ##convertTaking the transpose t() takes time - requires a copy in memory. Do you really need data transposed? Converting a matrix to a data frame takes time. Do you really need data as a data frame?> > Because I will be doing this 100,000 of times as part of an optimization > problem, I am interested in learning a more efficient way of doing this > conversion.Do you need the data in each iteration? If not, collect the data as strings and then do the coercing to doubles and turning it into a matrix all together. That is likely to be faster because there is a bit of overhead in each iteration. As suggested, using scan() and providing R with as much hints as possible - explicit arguments to scan() when you know something about the input so that R doesn't have to guess - will also speed things up. parseA <- function(x, ...) { y <- strsplit(x, split="\t", fixed=FALSE); y <- unlist(y); y <- as.double(y); } parseB <- function(x, ...) { y <- strsplit(x, split="\t", fixed=TRUE); y <- unlist(y, use.names=FALSE); y <- as.double(y); } parseC <- function(x, ...) { con <- textConnection(x); on.exit(close(con)); y <- scan(file=con, what=double(0), sep="\t", quiet=TRUE); y; } parseD <- function(x, ...) { con <- textConnection(x); on.exit(close(con)); y <- scan(file=con, what=double(0), sep="\t", quote=NULL, na.strings=NULL, strip.white=FALSE, comment.char="", allowEscapes=FALSE, quiet=TRUE); y; }> x <- paste(1:3e4, collapse="\t"); > tA <- system.time(yA <- parseA(x)); > tA;user system elapsed 2.91 0.00 2.91> tB <- system.time(yB <- parseB(x)); > tB;user system elapsed 0.03 0.00 0.04> tC <- system.time(yC <- parseC(x)); > tC;user system elapsed 0.03 0.00 0.03> tD <- system.time(yD <- parseD(x)); > tD;user system elapsed 0.03 0.00 0.03> x <- paste(1:1e6, collapse="\t");# parseA() painfully slow> tB <- system.time(yB <- parseB(x)); > tBuser system elapsed 2.30 0.00 2.31> tC <- system.time(yC <- parseC(x)); > tCuser system elapsed 1.14 0.00 1.16> tD <- system.time(yD <- parseD(x)); > tDuser system elapsed 1.16 0.01 1.17 Ok, so parseD() doesn't seem to be much faster than parseC(), but depending on your output format it may be. Take home message: read the help pages and try to help R as much as possible so it does not have to guess. You can always make your code twice as fast! /HB> > Any suggestions would be appreciated. > > > Thanks in advance. > > Mike > > > ----------------------------------------------------- > Department of Ecology & Evolutionary Biology > 569 Dabney Hall > University of Tennessee > Knoxville, TN 37996-1610 > > phone:(865) 974-6453 > fax: (865) 974-6042 > > web: http://eeb.bio.utk.edu/gilchrist.asp > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >
Prof Brian Ripley
2008-Sep-12 18:10 UTC
[R] reading in results from system(). There must be an easier way...
Why not use con <- pipe(COMMAND) foo <- read.delim(con, colClasses="numeric") close(con) ? See the 'R Data Input/Output Manual'. On Fri, 12 Sep 2008, Michael A. Gilchrist wrote:> Hello, > > I am currently using R to run an external program and then read the results > the external program sends to the stdout which are tsv data. > > When R reads the results in it converts it to to a list of strings which I > then have to maniuplate with a whole slew of commands (which, figuring out > how to do was a reall challenge for a newbie like myself)--see below. > > Here's the code I'm using. COMMAND runs the external program. > > rawInput= system(COMMAND,intern=TRUE);##read in tsv values > rawInput = strsplit(rawInput, split="\t");##split elements w/in the list > ##of character strings by "\t" > rawInput = unlist(rawInput); ##unlist, making it one long vector > mode(rawInput)="double"; ##convert from strings to double > finalInput = data.frame(t(matrix(rawInput, nrow=6))); ##convert > > Because I will be doing this 100,000 of times as part of an optimization > problem, I am interested in learning a more efficient way of doing this > conversion. > > Any suggestions would be appreciated. > > > Thanks in advance. > > Mike > > > ----------------------------------------------------- > Department of Ecology & Evolutionary Biology > 569 Dabney Hall > University of Tennessee > Knoxville, TN 37996-1610 > > phone:(865) 974-6453 > fax: (865) 974-6042 > > web: http://eeb.bio.utk.edu/gilchrist.asp >-- Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595