thr3ads.net - R devel - [Rd] read.csv reads more rows than indicated by wc -l [Dec 2012]

If this information is useful, please help other people find it:
Share via:

G See

2012-Dec-19 13:37 UTC

[Rd] read.csv reads more rows than indicated by wc -l

When I have a csv file that is more than 6 lines long, not including
the header, and one of the fields is blank for the last few lines, and
there is an extra comma on of the lines with the blank field,
read.csv() makes creates an extra line.

I attached an example file; I'll also paste the contents here:

A,apple
A,orange
A,orange
A,orange
A,orange
A,,,
A,,

-----
wc -l reports that this file has 7 lines

R> system("wc -l test.csv")
7 test.csv

But, read.csv reads 8.

R> read.csv("test.csv", header=FALSE, stringsAsFactors=FALSE)
  V1     V2
1  A  apple
2  A orange
3  A orange
4  A orange
5  A orange
6  A
7
8  A

If I increase the number of commas at the end of the line, it
increases the number of rows.

This R command to read a 7 line csv:

read.csv(header=FALSE, text="A,apple
A,orange
A,orange
A,orange
A,orange
A,,,,,
A,,")

will produce this:

  V1     V2
1  A  apple
2  A orange
3  A orange
4  A orange
5  A orange
6  A
7
8
9  A


But if the file has fewer than 7 lines, it doesn't increase the number of
rows.

This R command to read a 6 line csv:
read.csv(header=FALSE, text="A,apple
A,orange
A,orange
A,orange
A,,,,,
A,,")

will produce this:

  V1     V2 V3 V4 V5 V6
1  A  apple NA NA NA NA
2  A orange NA NA NA NA
3  A orange NA NA NA NA
4  A orange NA NA NA NA
5  A        NA NA NA NA
6  A        NA NA NA NA



Is this intended behavior?

Thanks,
Garrett See

R> version
               _
platform       x86_64-pc-linux-gnu
arch           x86_64
os             linux-gnu
system         x86_64, linux-gnu
status
major          2
minor          15.2
year           2012
month          10
day            26
svn rev        61015
language       R
version.string R version 2.15.2 (2012-10-26)
nickname       Trick or Treat

R> sessionInfo()
R version 2.15.2 (2012-10-26)
Platform: x86_64-pc-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=C                 LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

Ben Bolker

2012-Dec-19 16:43 UTC

head link

[Rd] read.csv reads more rows than indicated by wc -l

G See <gsee000 <at> gmail.com> writes:
> 
> When I have a csv file that is more than 6 lines long, not including
> the header, and one of the fields is blank for the last few lines, and
> there is an extra comma on of the lines with the blank field,
> read.csv() makes creates an extra line.
> 
> I attached an example file; I'll also paste the contents here:
> 
> A,apple
> A,orange
> A,orange
> A,orange
> A,orange
> A,,,
> A,,
> 
> -----
> wc -l reports that this file has 7 lines
> 
> R> system("wc -l test.csv")
> 7 test.csv
> 
> But, read.csv reads 8.
> 
> R> read.csv("test.csv", header=FALSE, stringsAsFactors=FALSE)
>   V1     V2
> 1  A  apple
> 2  A orange
> 3  A orange
> 4  A orange
> 5  A orange
> 6  A
> 7
> 8  A
> 
> If I increase the number of commas at the end of the line, it
> increases the number of rows.
> 
> This R command to read a 7 line csv:
> 
> read.csv(header=FALSE, text="A,apple
> A,orange
> A,orange
> A,orange
> A,orange
> A,,,,,
> A,,")
> 
> will produce this:
> 
>   V1     V2
> 1  A  apple
> 2  A orange
> 3  A orange
> 4  A orange
> 5  A orange
> 6  A
> 7
> 8
> 9  A
> 
> But if the file has fewer than 7 lines, it doesn't increase the number
of rows.
> 
> This R command to read a 6 line csv:
> read.csv(header=FALSE, text="A,apple
> A,orange
> A,orange
> A,orange
> A,,,,,
> A,,")
> 
> will produce this:
> 
>   V1     V2 V3 V4 V5 V6
> 1  A  apple NA NA NA NA
> 2  A orange NA NA NA NA
> 3  A orange NA NA NA NA
> 4  A orange NA NA NA NA
> 5  A        NA NA NA NA
> 6  A        NA NA NA NA
> 
> Is this intended behavior?
> 
> Thanks,
> Garrett See 
 [snip]

I don't know if it's exactly *intended* or not, but I think it's
more or less as [IMPLICITLY] documented.  From ?read.table,

     The number of data columns is determined by looking at the first
     five lines of input (or the whole file if it has less than five
     lines), or from the length of ?col.names? if it is specified and
     is longer.  This could conceivably be wrong if ?fill? or
     ?blank.lines.skip? are true, so specify ?col.names? if necessary
     (as in the ?Examples?).

txt <- "A,apple
 A,orange
 A,orange
 A,orange
 A,orange
 A,,,,,
 A,,"
read.csv(header=FALSE, text=txt )

What is happening here is that
(1) read.table is determining from the first five lines that
there are two columns;
(2) when it gets to line six, it reads each set of two fields as a
separate row

If you try

read.csv(header=FALSE, text=txt, fill=FALSE,blank.lines.skip=FALSE)

you at least get an error.

But it gets worse:
    
txt2 <- "A,apple
 A,orange
 A,orange
 A,orange
 A,orange
 A,b,c,d,e,f
 A,g"

read.csv(header=FALSE, text=txt2, fill=FALSE,blank.lines.skip=FALSE)

produces bad results even though fill=FALSE and blank.lines.skip=FALSE ...

Even specifying col.names explicitly doesn't help:

read.csv(header=FALSE, text=txt2, col.names=paste0("V",1:2))

At least count.fields() does detect a problem ...

count.fields(textConnection(txt2),sep=",")

Somewhere on my wish/TO DO list is for someone to rewrite read.table for
better robustness *and* efficiency ...

Matthew Dowle

2012-Dec-20 23:46 UTC

head link

[Rd] read.csv reads more rows than indicated by wc -l

Ben,
> Somewhere on my wish/TO DO list is for someone to rewrite read.table 
> for
> better robustness *and* efficiency ...
Wish granted. New in data.table 1.8.7 :

====New function fread(), a fast and friendly file reader.
*  header, skip, nrows, sep and colClasses are all auto detected.
*  integers>2^31 are detected and read natively as bit64::integer64.
*  accepts filenames, URLs and "A,B\n1,2\n3,4" directly
*  new implementation entirely in C
*  with a 50MB .csv, 1 million rows x 6 columns :
      read.csv("test.csv")                                        # 
30-60 sec
      read.table("test.csv",<all known tricks and known nrows>) 
#
10 sec
      fread("test.csv")                                           #
3 sec
* airline data: 658MB csv (7 million rows x 29 columns)
      read.table("2008.csv",<all known tricks and known nrows>) 
#
360 sec
      fread("2008.csv")                                           #
50 sec
See ?fread. Many thanks to Chris Neff and Garrett See for ideas, 
discussions
and beta testing.
====
The help page ?fread is fairly well developed :
https://r-forge.r-project.org/scm/viewvc.php/pkg/man/fread.Rd?view=markup&root=datatable

Comments, feedback and bug reports very welcome.

Matthew

http://datatable.r-forge.r-project.org/

Apparently Analagous Threads

Search for more reasonably related threads

R devel - Dec 2012 - read.csv reads more rows than indicated by wc -l

[Rd] read.csv reads more rows than indicated by wc -l

[Rd] read.csv reads more rows than indicated by wc -l

[Rd] read.csv reads more rows than indicated by wc -l

Apparently Analagous Threads