james.holtman@convergys.com
2001-Dec-29 20:29 UTC
[Rd] Slow 'read.table' in R 1.4.0 (PR#1232)
The 'read.table' function appears to be up to 10X slower in R 1.4.0 than R 1.3.1 for some of the data sets I read in. I was comparing the source code for the 2 versions and see that it was rewritten in R 1.4.0. I think I found out what part of the problem might be. I was comparing R1.3.1 and R1.4.0 code and it appears that a statement is missing in some of the code for R 1.4. This is the section of code at the beginning of read.table. The loop starting with 'while (nlines < 5)' will read in the entire file, because there is no increment of 'nlines' in the loop. I traced through the code and this is what was happening. It then does a 'pushBack' of the entire file. In tracing through the code, this is where is appears to be taking the time. With the change noted below, the speed was similar to R 1.3.1 and the results were the same. Here is the current code with what I think is the additional statement needed: =================part of read.table======= nlines <- 0 lines <- NULL while (nlines < 5) { line <- readLines(file, 1, ok = TRUE) if (length(line) == 0) break if (blank.lines.skip && length(grep("^[ \\t]*$", line))) next if (length(comment.char) && nchar(comment.char)) { pattern <- paste("^[ \\t]*", substring(comment.char, 1, 1), sep = "") if (length(grep(pattern, line))) next } lines <- c(lines, line) # # additional line required # nlines <- nlines+1 } nlines <- length(lines) if (!nlines) { if (missing(col.names)) stop("no lines available in input") else { tmp <- vector("list", length(col.names)) names(tmp) <- col.names class(tmp) <- "data.frame" return(tmp) } } if (all(nchar(lines) == 0)) stop("empty beginning of file") pushBack(c(lines, lines), file) -- NOTICE: The information contained in this electronic mail transmission is intended by Convergys Corporation for the use of the named individual or entity to which it is directed and may contain information that is privileged or otherwise confidential. If you have received this electronic mail transmission in error, please delete it from your system without copying or forwarding it, and notify the sender of the error by reply email or by telephone (collect), so that the sender's address records can be corrected. -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-devel-request@stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
As I have told you privately several days ago, *This has already been fixed in R-patched*. On Sat, 29 Dec 2001 james.holtman@convergys.com wrote:> The 'read.table' function appears to be up to 10X slower in R 1.4.0 than R > 1.3.1 for some of the data sets I read in. I was comparing the source code > for the 2 versions and see that it was rewritten in R 1.4.0. > > I think I found out what part of the problem might be. I was comparing > R1.3.1 and R1.4.0 code and it appears that a statement is missing in some > of the code for R 1.4. This is the section of code at the beginning of > read.table. The loop starting with 'while (nlines < 5)' will read in the > entire file, because there is no increment of 'nlines' in the loop. I > traced through the code and this is what was happening. It then does a > 'pushBack' of the entire file. In tracing through the code, this is where > is appears to be taking the time. With the change noted below, the speed > was similar to R 1.3.1 and the results were the same. > > Here is the current code with what I think is the additional statement > needed: > > =================part of read.table=======> > nlines <- 0 > lines <- NULL > while (nlines < 5) { > line <- readLines(file, 1, ok = TRUE) > if (length(line) == 0) > break > if (blank.lines.skip && length(grep("^[ \\t]*$", line))) > next > if (length(comment.char) && nchar(comment.char)) { > pattern <- paste("^[ \\t]*", substring(comment.char, > 1, 1), sep = "") > if (length(grep(pattern, line))) > next > } > lines <- c(lines, line) > # > # additional line required > # > nlines <- nlines+1 > } > nlines <- length(lines) > if (!nlines) { > if (missing(col.names)) > stop("no lines available in input") > else { > tmp <- vector("list", length(col.names)) > names(tmp) <- col.names > class(tmp) <- "data.frame" > return(tmp) > } > } > if (all(nchar(lines) == 0)) > stop("empty beginning of file") > pushBack(c(lines, lines), file) > > -- > > NOTICE: The information contained in this electronic mail transmission is > intended by Convergys Corporation for the use of the named individual or > entity to which it is directed and may contain information that is > privileged or otherwise confidential. If you have received this electronic > mail transmission in error, please delete it from your system without > copying or forwarding it, and notify the sender of the error by reply email > or by telephone (collect), so that the sender's address records can be > corrected. > > > > -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- > r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html > Send "info", "help", or "[un]subscribe" > (in the "body", not the subject !) To: r-devel-request@stat.math.ethz.ch > _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._ >-- Brian D. Ripley, ripley@stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272860 (secr) Oxford OX1 3TG, UK Fax: +44 1865 272595 -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-devel-request@stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._