jtk@cmp.uea.ac.uk
2005-Mar-11 14:49 UTC
[Rd] read.table messes up stdin upon small, erroneous input (PR#7722)
Full_Name: Jan T. Kim Version: 2.0.1, devel-2005-02-24 OS: Linux 2.6.x Submission from: (NULL) (139.222.3.229) Run read.table(stdin()) and type in the broken table 1 2 1 terminating the input by pressing Ctrl-D at the 3rd line of input. An error message by scan, complaining that "line 2 did not have 2 elements" appears, as expected. However: After this, there are three empty lines buffered in stdin:> readLines(stdin())[1] "" "" "" Repeated attempts to read.table the broken input from stdin lead to even more strange results:> read.table(stdin())0: 1 2 1: 1 2: Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, : line 2 did not have 2 elements> read.table(stdin())3: 1 2 4: 1 [1] V1 V2 <0 rows> (or 0-length row.names)>Analysis: These effects are due to a combination of (1) the fact that there appear to be various routes of accessing the standard input, depending on context, and (2) the use of pushback in the process of automatically figuring out the table format: * read.table uses .Internal(readTableHead(...)) to get the first nlines lines of the table (nlines = 5). * .Internal(readTableHead(...)) always returns nlines lines, adding empty lines if EOF comes before nlines lines are read. * These lines, including any empty ones not originating from the file in the first place, are then pushed back twice * The first set of lines is always consumed off by the subsequent code to figure out the number of columns. * The second set is intended to be consumed by the regular operation of scan. * However, if scan chokes before it can consume these lines, including the blank ones, these will be left in the pushback buffer. * R's interactive fetch-parse-evaluate loop does not use the connection provided by stdin(), and therefore, the buffered stuff is not noticed until the next attempt to read from the stdin connection. The strange effects reported above could probably be fixed by modifying the internal readTableHead function such that it does not produce emtpy lines in order to return the number of lines "requested" by the nlines parameter. A more fundamental approach would be to avoid pushing back lines altogether. The repeated scanning of the first few lines could be done by using a textConnection instead. Some additional work will probably be necessary to combine the first few and the remaining lines, acquired by regular operation of scan, into the complete table.