thr3ads.net - R devel - [Rd] read.table / type.convert with NA values [Jun 2010]

If this information is useful, please help other people find it:
Share via:

Erik Iverson

2010-Jun-29 20:41 UTC

[Rd] read.table / type.convert with NA values

Hello,

While assisting a fellow R-helper off list, I narrowed down an issue he 
was having to the following behavior of type.convert, called through 
read.table.  This is using R 2.10.1, if newer versions don't exhibit 
this behavior, apologies.

# generates numeric vector
 > type.convert(c("123.42", "NA"))
[1] 123.42     NA

# generates a numeric vector, notice the space before 123.42
 > type.convert(c(" 123.42 ", "NA"))
[1] 123.42     NA

# generates factor, notice the space before NA
# note that the 2nd element is actually " NA", not a true NA value
 > type.convert(c("123.42", " NA"))
[1] 123.42  NA
Levels: 123.42  NA


How can this affect read.table/read.csv use 'in the wild'?

This gentleman had a data file that was

1) delimited by something other than white space, CSV in his case
2) contained missing values, designated by NA in his case
3) contained white space between delimiters and data values, e.g.,

NA,     NA,    4.5,    NA

as opposed to

NA,NA,4.5,NA


With these 3 conditions met, read.table gives type.convert a character 
vector like my third example above, and ultimately he got a data.frame 
consisting of only factors when we were expecting numeric columns.  This 
was easily fixed either by modifying the read.csv function call to 
specify colClasses directly, or in his case, strip.white = TRUE did the 
job just fine.

I believe the confusion stems from the fact that with no NA values in 
our data file, this would work as we would expect.  The introduction of 
what we thought were NA values changed the behavior.  In reality, these 
were not being treated as NA values by read.table/type.convert.  The 
question is, should they be in this case?

This behavior of read.table/type.convert may very well be what is 
expected/needed.  If so, this note could still be of use to someone in 
the future if they stumble upon similar behavior.  The fact I wasn't 
able to uncover anyone who asked about it on list before probably means 
the situation is rare.

Best Regards,
Erik Iverson

Matt Shotwell

2010-Jun-29 23:14 UTC

head link

[Rd] read.table / type.convert with NA values

The document RFC 4180 (which appears to be the CSV standard used by R,
see ?read.table) considers spaces to be part of the fielded value. Some
have taken this to mean that all white space characters should be
considered part of the fielded value, though the RFC is not explicit
here. Hence, this behavior is in compliance with the "standard" for
CSV
files. It seems that R treats '\t' (and perhaps all?) separated value
files the same way by default.

The RFC is very short and easy to read if you're interested.
http://tools.ietf.org/html/rfc4180

-Matt

On Tue, 2010-06-29 at 16:41 -0400, Erik Iverson wrote:> Hello,
> 
> While assisting a fellow R-helper off list, I narrowed down an issue he 
> was having to the following behavior of type.convert, called through 
> read.table.  This is using R 2.10.1, if newer versions don't exhibit 
> this behavior, apologies.
> 
> # generates numeric vector
>  > type.convert(c("123.42", "NA"))
> [1] 123.42     NA
> 
> # generates a numeric vector, notice the space before 123.42
>  > type.convert(c(" 123.42 ", "NA"))
> [1] 123.42     NA
> 
> # generates factor, notice the space before NA
> # note that the 2nd element is actually " NA", not a true NA
value
>  > type.convert(c("123.42", " NA"))
> [1] 123.42  NA
> Levels: 123.42  NA
> 
> 
> How can this affect read.table/read.csv use 'in the wild'?
> 
> This gentleman had a data file that was
> 
> 1) delimited by something other than white space, CSV in his case
> 2) contained missing values, designated by NA in his case
> 3) contained white space between delimiters and data values, e.g.,
> 
> NA,     NA,    4.5,    NA
> 
> as opposed to
> 
> NA,NA,4.5,NA
> 
> 
> With these 3 conditions met, read.table gives type.convert a character 
> vector like my third example above, and ultimately he got a data.frame 
> consisting of only factors when we were expecting numeric columns.  This 
> was easily fixed either by modifying the read.csv function call to 
> specify colClasses directly, or in his case, strip.white = TRUE did the 
> job just fine.
> 
> I believe the confusion stems from the fact that with no NA values in 
> our data file, this would work as we would expect.  The introduction of 
> what we thought were NA values changed the behavior.  In reality, these 
> were not being treated as NA values by read.table/type.convert.  The 
> question is, should they be in this case?
> 
> This behavior of read.table/type.convert may very well be what is 
> expected/needed.  If so, this note could still be of use to someone in 
> the future if they stumble upon similar behavior.  The fact I wasn't 
> able to uncover anyone who asked about it on list before probably means 
> the situation is rare.
> 
> Best Regards,
> Erik Iverson
> 
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel-- 
Matthew S. Shotwell
Graduate Student
Division of Biostatistics and Epidemiology
Medical University of South Carolina
http://biostatmatt.com

Maybe Matching Threads

Search for more maybe matching threads

R devel - Jun 2010 - read.table / type.convert with NA values

[Rd] read.table / type.convert with NA values

[Rd] read.table / type.convert with NA values

Maybe Matching Threads