thr3ads.net - R devel - [Rd] read.csv [Jun 2009]

If this information is useful, please help other people find it:
Share via:

Gabor Grothendieck

2009-Jun-14 18:56 UTC

[Rd] read.csv

If read.csv's colClasses= argument is NOT used then read.csv accepts
double quoted numerics:

1: > read.csv(stdin())
0: A,B
1: "1",1
2: "2",2
3:
  A B
1 1 1
2 2 2

However, if colClasses is used then it seems that it does not:
> read.csv(stdin(), colClasses = "numeric")0: A,B
1: "1",1
2: "2",2
3:
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  :
  scan() expected 'a real', got '"1"'

Is this really intended?  I would have expected that a csv file in which
each field is surrounded with double quotes is acceptable in both
cases.   This may be documented as is yet seems undesirable from
both a consistency viewpoint and the viewpoint that it should be
possible to double quote fields in a csv file.

(Ted Harding)

2009-Jun-14 20:21 UTC

head link

[Rd] read.csv

On 14-Jun-09 18:56:01, Gabor Grothendieck wrote:> If read.csv's colClasses= argument is NOT used then read.csv accepts
> double quoted numerics:
> 
> 1: > read.csv(stdin())
> 0: A,B
> 1: "1",1
> 2: "2",2
> 3:
>   A B
> 1 1 1
> 2 2 2
> 
> However, if colClasses is used then it seems that it does not:
> 
>> read.csv(stdin(), colClasses = "numeric")
> 0: A,B
> 1: "1",1
> 2: "2",2
> 3:
> Error in scan(file, what, nmax, sep, dec, quote, skip, nlines,
> na.strings,  :
>   scan() expected 'a real', got '"1"'
> 
> Is this really intended?  I would have expected that a csv file
> in which each field is surrounded with double quotes is acceptable
> in both cases. This may be documented as is yet seems undesirable
> from both a consistency viewpoint and the viewpoint that it should
> be possible to double quote fields in a csv file.
Well, the default for colClasses is NA, for which ?read.csv says:
  [...]
  Possible values are 'NA' (when 'type.convert' is used),
  [...]
and then ?type.convert says:
  This is principally a helper function for 'read.table'. Given a
  character vector, it attempts to convert it to logical, integer,
  numeric or complex, and failing that converts it to factor unless
  'as.is = TRUE'.  The first type that can accept all the non-missing
  values is chosen.

It would seem that type 'logical' won't accept integer (naively one
might expect 1 --> TRUE, but see experiment below), so the first
acceptable type for "1" is integer, and that is what happens.
So it is indeed documented (in the R[ecursive] sense of "documented"
:))

However, presumably when colClasses is used then type.convert() is
not called, in which case R sees itself being asked to assign a
character entity to a destination which it has been told shall be
integer, and therefore, since the default for as.is is
  as.is = !stringsAsFactors
but for this ?read.csv says that stringsAsFactors "is overridden
bu [sic] 'as.is' and 'colClasses', both of which allow finer
control.", so that wouldn't come to the rescue either.

Experiment:
  X <-logical(10)
  class(X)
  # [1] "logical"
  X[1]<-1
  X
  # [1] 1 0 0 0 0 0 0 0 0 0
  class(X)
  # [1] "numeric"
so R has converted X from class 'logical' to class 'numeric'
on being asked to assign a number to a logical; but in this
case its hands were not tied by colClasses.

Or am I missing something?!!

Ted.

--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at manchester.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 14-Jun-09                                       Time: 21:21:22
------------------------------ XFMail ------------------------------

Petr Savicky

2009-Jun-25 09:23 UTC

head link

[Rd] read.csv

On Sun, Jun 14, 2009 at 02:56:01PM -0400, Gabor Grothendieck
wrote:> If read.csv's colClasses= argument is NOT used then read.csv accepts
> double quoted numerics:
> 
> 1: > read.csv(stdin())
> 0: A,B
> 1: "1",1
> 2: "2",2
> 3:
>   A B
> 1 1 1
> 2 2 2
> 
> However, if colClasses is used then it seems that it does not:
> 
> > read.csv(stdin(), colClasses = "numeric")
> 0: A,B
> 1: "1",1
> 2: "2",2
> 3:
> Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, 
:
>   scan() expected 'a real', got '"1"'
> 
> Is this really intended?  I would have expected that a csv file in which
> each field is surrounded with double quotes is acceptable in both
> cases.   This may be documented as is yet seems undesirable from
> both a consistency viewpoint and the viewpoint that it should be
> possible to double quote fields in a csv file.
The problem is not specific to read.csv(). The same difference appears
for read.table().
  read.table(stdin())
  "1" 1
  2 "2"
  
  #   V1 V2
  # 1  1  1
  # 2  2  2
but
  read.table(stdin(), colClasses = "numeric")
  "1" 1
  2 "2"
  
  Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  : 
  scan() expected 'a real', got '"1"'

The error occurs in the call of scan() at line 152 in
src/library/utils/R/readtable.R,
which is
  data <- scan(file = file, what = what, sep = sep, quote = quote, ...
(This is the third call of scan() in the source code of read.table())

In this call, scan() gets the types of columns in "what" argument. If
the type
is specified, scan() performs the conversion itself and fails, if a numeric
field
is quoted. If the type is not specified, the output of scan() is of type
character,
but with quotes eliminated, if there are some in the input file. Columns with
unknown type are then converted using type.convert(), which receives the data
already without quotes.

The call of type.convert() is contained in a cycle
    for (i in (1L:cols)[do]) {
        data[[i]] <-
            if (is.na(colClasses[i]))
                type.convert(data[[i]], as.is = as.is[i], dec = dec,
                             na.strings = character(0L))
        ## as na.strings have already been converted to <NA>
            else if (colClasses[i] == "factor") as.factor(data[[i]])
            else if (colClasses[i] == "Date") as.Date(data[[i]])
            else if (colClasses[i] == "POSIXct") as.POSIXct(data[[i]])
            else methods::as(data[[i]], colClasses[i])
    }
which contains also lines, which could perform conversion for columns with
a specified type, but these lines are not used, since the vector "do" 
is defined as
  do <- keep & !known 
where "known" determines for which columns the type is known.

It is possible to modify the code so that scan() is called with all types
unspecified and leave the conversion to the lines
            else if (colClasses[i] == "factor") as.factor(data[[i]])
            else if (colClasses[i] == "Date") as.Date(data[[i]])
            else if (colClasses[i] == "POSIXct") as.POSIXct(data[[i]])
            else methods::as(data[[i]], colClasses[i])
above. Since this solution is already prepared in the code, the patch is very
simple
  --- R-devel/src/library/utils/R/readtable.R     2009-05-18 17:53:08.000000000
+0200
  +++ R-devel-readtable/src/library/utils/R/readtable.R   2009-06-25
10:20:06.000000000 +0200
  @@ -143,9 +143,6 @@
       names(what) <- col.names
   
       colClasses[colClasses %in% c("real", "double")] <-
"numeric"
  -    known <- colClasses %in%
  -                c("logical", "integer",
"numeric", "complex", "character")
  -    what[known] <- sapply(colClasses[known], do.call, list(0))
       what[colClasses %in% "NULL"] <- list(NULL)
       keep <- !sapply(what, is.null)
   
  @@ -189,7 +186,7 @@
          stop(gettextf("'as.is' has the wrong length %d  != cols =
%d",
                        length(as.is), cols), domain = NA)
   
  -    do <- keep & !known # & !as.is
  +    do <- keep & !as.is
       if(rlabp) do[1L] <- FALSE # don't convert "row.names"
       for (i in (1L:cols)[do]) {
           data[[i]] <-
(Also in attachment)

I did a test as follows
  d1 <- read.table(stdin())
  "1" TRUE   3.5
  2   NA     "0.1"
  NA  FALSE  0.1
  3   "TRUE" NA

  sapply(d1, typeof)
  #        V1        V2        V3 
  # "integer" "logical"  "double" 
  is.na(d1)
  #         V1    V2    V3
  # [1,] FALSE FALSE FALSE
  # [2,] FALSE  TRUE FALSE
  # [3,]  TRUE FALSE FALSE
  # [4,] FALSE FALSE  TRUE
  
  d2 <- read.table(stdin(), colClasses=c("integer",
"logical", "double"))
  "1" TRUE   3.5
  2   NA     "0.1"
  NA  FALSE  0.1
  3   "TRUE" NA

  sapply(d2, typeof)
  #        V1        V2        V3 
  # "integer" "logical"  "double" 
  is.na(d2)
  #         V1    V2    V3
  # [1,] FALSE FALSE FALSE
  # [2,] FALSE  TRUE FALSE
  # [3,]  TRUE FALSE FALSE
  # [4,] FALSE FALSE  TRUE

I think, there was a reason to let scan() to perform the type conversion, for
example, it may be more efficient. So, if correct, the above patch is a possible
solution, but some other may be more appropriate. In particular, function scan()
may be modified to remove quotes also from fields specified as numeric.

Petr.

Apparently Analagous Threads

Search for more reasonably related threads

R devel - Jun 2009 - read.csv

[Rd] read.csv

[Rd] read.csv

[Rd] read.csv

Apparently Analagous Threads