thr3ads.net - R help - [R] retaining characters in a csv file [Sep 2015]

If this information is useful, please help other people find it:
Share via:

Therneau, Terry M., Ph.D.

2015-Sep-22 22:00 UTC

[R] retaining characters in a csv file

I have a csv file from an automatic process (so this will happen thousands of
times), for
which the first row is a vector of variable names and the second row often
starts
something like this:

5724550,"000202075214",2005.02.17,2005.02.17,"F", .....

Notice the second variable which is
       a character string (note the quotation marks)
       a sequence of numeric digits
       leading zeros are significant

The read.csv function insists on turning this into a numeric.  Is there any
simple set of
options that
will turn this behavior off?  I'm looking for a way to tell it to "obey
the bloody quotes"
-- I still want the first, third, etc columns to become numeric.  There can be
more than
one variable like this, and not always in the second position.

This happens deep inside the httr library; there is an easy way for me to add
more options
to the read.csv call but it is not so easy to replace it with something else.

Terry T

Rolf Turner

2015-Sep-22 22:33 UTC

head link

[R] retaining characters in a csv file

On 23/09/15 10:00, Therneau, Terry M., Ph.D. wrote:> I have a csv file from an automatic process (so this will happen
> thousands of times), for which the first row is a vector of variable
> names and the second row often starts something like this:
>
> 5724550,"000202075214",2005.02.17,2005.02.17,"F", .....
>
> Notice the second variable which is
>        a character string (note the quotation marks)
>        a sequence of numeric digits
>        leading zeros are significant
>
> The read.csv function insists on turning this into a numeric.  Is there
> any simple set of options that
> will turn this behavior off?  I'm looking for a way to tell it to
"obey
> the bloody quotes" -- I still want the first, third, etc columns to
> become numeric.  There can be more than one variable like this, and not
> always in the second position.
>
> This happens deep inside the httr library; there is an easy way for me
> to add more options to the read.csv call but it is not so easy to
> replace it with something else.
IMHO this is a bug in read.csv().

A possible workaround:

ccc <- c("integer","character",rep(NA,k))
X   <- read.csv("melvin.csv",colClasses=ccc)

where "melvin.csv" is the file from which you are attempting to read
and
where k+2 = the number of columns in that file.

Kludgey, but it might work.

Another workaround is to specify quote="", but this has the side
effect
of making the 5th column character rather than logical.

cheers,

Rolf

-- 
Technical Editor ANZJS
Department of Statistics
University of Auckland
Phone: +64-9-373-7599 ext. 88276

David Winsemius

2015-Sep-22 22:58 UTC

head link

[R] retaining characters in a csv file

On Sep 22, 2015, at 3:00 PM, Therneau, Terry M., Ph.D. wrote:
> I have a csv file from an automatic process (so this will happen thousands
of times), for which the first row is a vector of variable names and the second
row often starts something like this:
> 
> 5724550,"000202075214",2005.02.17,2005.02.17,"F", .....
> 
> Notice the second variable which is
>      a character string (note the quotation marks)
>      a sequence of numeric digits
>      leading zeros are significant
> 
> The read.csv function insists on turning this into a numeric.  Is there any
simple set of options that
> will turn this behavior off?  I'm looking for a way to tell it to
"obey the bloody quotes" -- I still want the first, third, etc columns
to become numeric.  There can be more than one variable like this, and not
always in the second position.
The last part about not knowing which col might be an issue might require
inputting everything with character class, but if there is a way to pass in a
colClasses argument this might help:
>
read.csv(text='5724550,"000202075214",2005.02.17,2005.02.17,"F"',
stringsAsFactors=FALSE, header=FALSE, colClasses=c("numeric",
rep("character", 4)))       V1           V2         V3         V4 V5
1 5724550 000202075214 2005.02.17 2005.02.17  F

Or you can create a class with an As method:
> setClass('myChar')
> setAs('character', 'myChar', def=function(from, to ) to
<- I(from))
>
read.csv(text='5724550,"000202075214",2005.02.17,2005.02.17,"F"',
stringsAsFactors=FALSE, header=FALSE, colClasses=c("numeric",
rep('myChar',4)) )       V1           V2         V3         V4 V5
1 5724550 000202075214 2005.02.17 2005.02.17  F

(Neither of the third or fourth columns makes sense as a numeric, so now
illustrating coercion to Date.)
> setClass('dotDate')
> setAs('character', 'dotDate', def=function(from, to ) to
<- as.Date(from, "%Y.%m.%d")  )
>
read.csv(text='5724550,"000202075214",2005.02.17,2005.02.17,"F"',
stringsAsFactors=FALSE, header=FALSE, colClasses=c("numeric",
"character", rep('dotDate',2), "character") )       V1           V2         V3         V4 V5
1 5724550 000202075214 2005-02-17 2005-02-17  F

> 
> This happens deep inside the httr library; there is an easy way for me to
add more options to the read.csv call but it is not so easy to replace it with
something else.
> 
> Terry T
> 
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
David Winsemius
Alameda, CA, USA

Arunkumar Srinivasan

2015-Sep-22 23:11 UTC

head link

[R] retaining characters in a csv file

data.table's fread reads this as expected. Quoted strings aren't
coerced.

sapply(fread('5724550,"000202075214",2005.02.17,2005.02.17,"F"\n'),
class)
#          V1          V2          V3          V4          V5
#   "integer" "character" "character"
"character" "character"

Best,
Arun.

On Wed, Sep 23, 2015 at 12:00 AM, Therneau, Terry M., Ph.D.
<therneau at mayo.edu> wrote:> I have a csv file from an automatic process (so this will happen thousands
> of times), for which the first row is a vector of variable names and the
> second row often starts something like this:
>
> 5724550,"000202075214",2005.02.17,2005.02.17,"F", .....
>
> Notice the second variable which is
>       a character string (note the quotation marks)
>       a sequence of numeric digits
>       leading zeros are significant
>
> The read.csv function insists on turning this into a numeric.  Is there any
> simple set of options that
> will turn this behavior off?  I'm looking for a way to tell it to
"obey the
> bloody quotes" -- I still want the first, third, etc columns to become
> numeric.  There can be more than one variable like this, and not always in
> the second position.
>
> This happens deep inside the httr library; there is an easy way for me to
> add more options to the read.csv call but it is not so easy to replace it
> with something else.
>
> Terry T
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

peter dalgaard

2015-Sep-22 23:19 UTC

head link

[R] retaining characters in a csv file

> On 23 Sep 2015, at 00:33 , Rolf Turner <r.turner at auckland.ac.nz>
wrote:
> 
[read.csv() doesn't distinguish "123.4" from 123.4]
> IMHO this is a bug in read.csv().
> 
Dunno about that:

pd$ cat ~/tmp/junk.csv 
"1";1
2;"2"
pd$ open !$
open ~/tmp/junk.csv

And lo and behold, Excel opens with 

1 1
2 2

and all cells numeric.

I don't think the CSV standard (if there is one...) specifies that quoted
strings are necessarily text.

I think we have been here before, and found that even if we decide that it is a
bug (or misfeature), it would be hard to change, because the modus operandi of
read.* is to first read everything as character and _then_ see (in
type.convert()) which entries can be converted to numeric, logical, etc.

-pd

-- 
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com

Duncan Murdoch

2015-Sep-23 00:36 UTC

head link

[R] retaining characters in a csv file

On 22/09/2015 6:00 PM, Therneau, Terry M., Ph.D. wrote:> I have a csv file from an automatic process (so this will happen thousands
of times), for
> which the first row is a vector of variable names and the second row often
starts
> something like this:
> 
> 5724550,"000202075214",2005.02.17,2005.02.17,"F", .....
> 
> Notice the second variable which is
>        a character string (note the quotation marks)
>        a sequence of numeric digits
>        leading zeros are significant
> 
> The read.csv function insists on turning this into a numeric. 
No it doesn't.  All you need to do is specify colClasses and it will
follow your instructions.


 Is there any simple set of> options that
> will turn this behavior off?  I'm looking for a way to tell it to
"obey the bloody quotes"
> -- I still want the first, third, etc columns to become numeric.  There can
be more than
> one variable like this, and not always in the second position.
No, because the bloody quotes are part of the "csv standard".  They
aren't meaningful.

If you don't know what the data is, that's your fault.  You
shouldn't be
analyzing data when you are so ignorant.

Duncan Murdoch
> This happens deep inside the httr library; there is an easy way for me to
add more options
> to the read.csv call but it is not so easy to replace it with something
else.
> 
> Terry T
> 
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

Duncan Murdoch

2015-Sep-23 00:48 UTC

head link

[R] retaining characters in a csv file

On 22/09/2015 6:33 PM, Rolf Turner wrote:> On 23/09/15 10:00, Therneau, Terry M., Ph.D. wrote:
>> I have a csv file from an automatic process (so this will happen
>> thousands of times), for which the first row is a vector of variable
>> names and the second row often starts something like this:
>>
>> 5724550,"000202075214",2005.02.17,2005.02.17,"F",
.....
>>
>> Notice the second variable which is
>>        a character string (note the quotation marks)
>>        a sequence of numeric digits
>>        leading zeros are significant
>>
>> The read.csv function insists on turning this into a numeric.  Is there
>> any simple set of options that
>> will turn this behavior off?  I'm looking for a way to tell it to
"obey
>> the bloody quotes" -- I still want the first, third, etc columns
to
>> become numeric.  There can be more than one variable like this, and not
>> always in the second position.
>>
>> This happens deep inside the httr library; there is an easy way for me
>> to add more options to the read.csv call but it is not so easy to
>> replace it with something else.
> 
> IMHO this is a bug in read.csv().
No, it's a bug in "Rolf Turner", who believes in fairies at the
end of
his garden, rather than in documentation for file formats.

Duncan Murdoch
> 
> A possible workaround:
> 
> ccc <- c("integer","character",rep(NA,k))
> X   <- read.csv("melvin.csv",colClasses=ccc)
> 
> where "melvin.csv" is the file from which you are attempting to
read and
> where k+2 = the number of columns in that file.
> 
> Kludgey, but it might work.
> 
> Another workaround is to specify quote="", but this has the side
effect
> of making the 5th column character rather than logical.
> 
> cheers,
> 
> Rolf
>

Therneau, Terry M., Ph.D.

2015-Sep-23 12:57 UTC

head link

[R] retaining characters in a csv file

Thanks for all for the comments, I hadn't intended to start a war.

My summary:
   1. Most important: I wasn't missing something obvious.  This is always my
first
suspicion when I submit something to R-help, and it's true more often than
not.

   2. Obviously (at least it is now), the CSV standard does not specify that
quotes should
force a character result.  R is not "wrong".  Wrt to using what Excel
does as litmus test,
I consider that to be totally uninformative about standards: neither pro (like
Duncan) or
anti (like Rolf), but simply irrelevant.  (Like many MS choices.)

   3. I'll have to code in my own solution, either pre-scan the first few
lines to create
a colClasses, or use read_csv from the readr library (if there are leading zeros
it keeps
the string as character, which may suffice for my needs), or something else.

   4. The source of the data is a "text/csv" field coming from an http
POST request.  This
is an internal service on an internal Mayo server and coded by our own IT
department; this
will not be the first case where I have found that their definition of
"csv" is not quite
standard.

Terry T.

> On 23/09/15 10:00, Therneau, Terry M., Ph.D. wrote:
>> I have a csv file from an automatic process (so this will happen
>> thousands of times), for which the first row is a vector of variable
>> names and the second row often starts something like this:
>>
>> 5724550,"000202075214",2005.02.17,2005.02.17,"F",
.....
>>
>> Notice the second variable which is
>>        a character string (note the quotation marks)
>>        a sequence of numeric digits
>>        leading zeros are significant
>>
>> The read.csv function insists on turning this into a numeric.  Is there
>> any simple set of options that
>> will turn this behavior off?  I'm looking for a way to tell it to
"obey
>> the bloody quotes" -- I still want the first, third, etc columns
to
>> become numeric.  There can be more than one variable like this, and not
>> always in the second position.
>>
>> This happens deep inside the httr library; there is an easy way for me
>> to add more options to the read.csv call but it is not so easy to
>> replace it with something else.

John Kane

2015-Sep-23 18:22 UTC

head link

[R] retaining characters in a csv file

> -----Original Message-----
> From: r.turner at auckland.ac.nz
> Sent: Wed, 23 Sep 2015 13:26:58 +1200
> To: pdalgd at gmail.com
   ..........> I would say that this phenomenon ("Excel does it") is
*overwhelming*
> evidence that it is bad practice!!! :-)
Fortune?

____________________________________________________________
Can't remember your password? Do you need a strong and secure password?
Use Password manager! It stores your passwords & protects your account.

R help - Sep 2015 - retaining characters in a csv file

[R] retaining characters in a csv file

[R] retaining characters in a csv file

[R] retaining characters in a csv file

[R] retaining characters in a csv file

[R] retaining characters in a csv file

[R] retaining characters in a csv file

[R] retaining characters in a csv file

[R] retaining characters in a csv file

[R] retaining characters in a csv file