thr3ads.net - R devel - [Rd] Slow 'read.table' in R 1.4.0 (PR#1232) [Dec 2001]

If this information is useful, please help other people find it:
Share via:

james.holtman@convergys.com

2001-Dec-29 20:29 UTC

[Rd] Slow 'read.table' in R 1.4.0 (PR#1232)

The 'read.table' function appears to be up to 10X slower in R 1.4.0 than
R
1.3.1 for some of the data sets I read in.  I was comparing the source code
for the 2 versions and see that it was rewritten in R 1.4.0.

I think I found out what part of the problem might be.  I was comparing
R1.3.1 and R1.4.0 code and it appears that a statement is missing in some
of the code for R 1.4.  This is the section of code at the beginning of
read.table.  The loop starting with 'while (nlines < 5)' will read in
the
entire file, because there is no increment of 'nlines' in the loop.  I
traced through the code  and this is what was happening.  It then does a
'pushBack' of the entire file.  In tracing through the code, this is
where
is appears to be taking the time.  With the change noted below, the speed
was similar to R 1.3.1 and the results were the same.

Here is the current code with what I think is the additional statement
needed:

=================part of read.table=======
    nlines <- 0
    lines <- NULL
    while (nlines < 5) {
        line <- readLines(file, 1, ok = TRUE)
        if (length(line) == 0)
            break
        if (blank.lines.skip && length(grep("^[ \\t]*$",
line)))
            next
        if (length(comment.char) && nchar(comment.char)) {
            pattern <- paste("^[ \\t]*", substring(comment.char,
                1, 1), sep = "")
            if (length(grep(pattern, line)))
                next
        }
        lines <- c(lines, line)
       #
       #  additional line required
       #
       nlines <- nlines+1
    }
    nlines <- length(lines)
    if (!nlines) {
        if (missing(col.names))
            stop("no lines available in input")
        else {
            tmp <- vector("list", length(col.names))
            names(tmp) <- col.names
            class(tmp) <- "data.frame"
            return(tmp)
        }
    }
    if (all(nchar(lines) == 0))
        stop("empty beginning of file")
    pushBack(c(lines, lines), file)

--

NOTICE:  The information contained in this electronic mail transmission is
intended by Convergys Corporation for the use of the named individual or
entity to which it is directed and may contain information that is
privileged or otherwise confidential.  If you have received this electronic
mail transmission in error, please delete it from your system without
copying or forwarding it, and notify the sender of the error by reply email
or by telephone (collect), so that the sender's address records can be
corrected.



-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To:
r-devel-request@stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._

ripley@stats.ox.ac.uk

2001-Dec-29 21:25 UTC

head link

[Rd] Slow 'read.table' in R 1.4.0 (PR#1232)

As I have told you privately several days ago,

*This has already been fixed in R-patched*.


On Sat, 29 Dec 2001 james.holtman@convergys.com wrote:
> The 'read.table' function appears to be up to 10X slower in R 1.4.0
than R
> 1.3.1 for some of the data sets I read in.  I was comparing the source code
> for the 2 versions and see that it was rewritten in R 1.4.0.
>
> I think I found out what part of the problem might be.  I was comparing
> R1.3.1 and R1.4.0 code and it appears that a statement is missing in some
> of the code for R 1.4.  This is the section of code at the beginning of
> read.table.  The loop starting with 'while (nlines < 5)' will
read in the
> entire file, because there is no increment of 'nlines' in the loop.
I
> traced through the code  and this is what was happening.  It then does a
> 'pushBack' of the entire file.  In tracing through the code, this
is where
> is appears to be taking the time.  With the change noted below, the speed
> was similar to R 1.3.1 and the results were the same.
>
> Here is the current code with what I think is the additional statement
> needed:
>
> =================part of read.table=======>
>     nlines <- 0
>     lines <- NULL
>     while (nlines < 5) {
>         line <- readLines(file, 1, ok = TRUE)
>         if (length(line) == 0)
>             break
>         if (blank.lines.skip && length(grep("^[ \\t]*$",
line)))
>             next
>         if (length(comment.char) && nchar(comment.char)) {
>             pattern <- paste("^[ \\t]*",
substring(comment.char,
>                 1, 1), sep = "")
>             if (length(grep(pattern, line)))
>                 next
>         }
>         lines <- c(lines, line)
>        #
>        #  additional line required
>        #
>        nlines <- nlines+1
>     }
>     nlines <- length(lines)
>     if (!nlines) {
>         if (missing(col.names))
>             stop("no lines available in input")
>         else {
>             tmp <- vector("list", length(col.names))
>             names(tmp) <- col.names
>             class(tmp) <- "data.frame"
>             return(tmp)
>         }
>     }
>     if (all(nchar(lines) == 0))
>         stop("empty beginning of file")
>     pushBack(c(lines, lines), file)
>
> --
>
> NOTICE:  The information contained in this electronic mail transmission is
> intended by Convergys Corporation for the use of the named individual or
> entity to which it is directed and may contain information that is
> privileged or otherwise confidential.  If you have received this electronic
> mail transmission in error, please delete it from your system without
> copying or forwarding it, and notify the sender of the error by reply email
> or by telephone (collect), so that the sender's address records can be
> corrected.
>
>
>
>
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
> r-devel mailing list -- Read
http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
> Send "info", "help", or "[un]subscribe"
> (in the "body", not the subject !)  To:
r-devel-request@stat.math.ethz.ch
>
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
>
-- 
Brian D. Ripley,                  ripley@stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272860 (secr)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595


-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To:
r-devel-request@stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._

Apparently Analagous Threads

Search for more seemingly similar threads

R devel - Dec 2001 - Slow 'read.table' in R 1.4.0 (PR#1232)

[Rd] Slow 'read.table' in R 1.4.0 (PR#1232)

[Rd] Slow 'read.table' in R 1.4.0 (PR#1232)

Apparently Analagous Threads