I have a csv file from an automatic process (so this will happen thousands of times), for which the first row is a vector of variable names and the second row often starts something like this: 5724550,"000202075214",2005.02.17,2005.02.17,"F", ..... Notice the second variable which is a character string (note the quotation marks) a sequence of numeric digits leading zeros are significant The read.csv function insists on turning this into a numeric. Is there any simple set of options that will turn this behavior off? I'm looking for a way to tell it to "obey the bloody quotes" -- I still want the first, third, etc columns to become numeric. There can be more than one variable like this, and not always in the second position. This happens deep inside the httr library; there is an easy way for me to add more options to the read.csv call but it is not so easy to replace it with something else. Terry T
On 23/09/15 10:00, Therneau, Terry M., Ph.D. wrote:> I have a csv file from an automatic process (so this will happen > thousands of times), for which the first row is a vector of variable > names and the second row often starts something like this: > > 5724550,"000202075214",2005.02.17,2005.02.17,"F", ..... > > Notice the second variable which is > a character string (note the quotation marks) > a sequence of numeric digits > leading zeros are significant > > The read.csv function insists on turning this into a numeric. Is there > any simple set of options that > will turn this behavior off? I'm looking for a way to tell it to "obey > the bloody quotes" -- I still want the first, third, etc columns to > become numeric. There can be more than one variable like this, and not > always in the second position. > > This happens deep inside the httr library; there is an easy way for me > to add more options to the read.csv call but it is not so easy to > replace it with something else.IMHO this is a bug in read.csv(). A possible workaround: ccc <- c("integer","character",rep(NA,k)) X <- read.csv("melvin.csv",colClasses=ccc) where "melvin.csv" is the file from which you are attempting to read and where k+2 = the number of columns in that file. Kludgey, but it might work. Another workaround is to specify quote="", but this has the side effect of making the 5th column character rather than logical. cheers, Rolf -- Technical Editor ANZJS Department of Statistics University of Auckland Phone: +64-9-373-7599 ext. 88276
On Sep 22, 2015, at 3:00 PM, Therneau, Terry M., Ph.D. wrote:> I have a csv file from an automatic process (so this will happen thousands of times), for which the first row is a vector of variable names and the second row often starts something like this: > > 5724550,"000202075214",2005.02.17,2005.02.17,"F", ..... > > Notice the second variable which is > a character string (note the quotation marks) > a sequence of numeric digits > leading zeros are significant > > The read.csv function insists on turning this into a numeric. Is there any simple set of options that > will turn this behavior off? I'm looking for a way to tell it to "obey the bloody quotes" -- I still want the first, third, etc columns to become numeric. There can be more than one variable like this, and not always in the second position.The last part about not knowing which col might be an issue might require inputting everything with character class, but if there is a way to pass in a colClasses argument this might help:> read.csv(text='5724550,"000202075214",2005.02.17,2005.02.17,"F"', stringsAsFactors=FALSE, header=FALSE, colClasses=c("numeric", rep("character", 4)))V1 V2 V3 V4 V5 1 5724550 000202075214 2005.02.17 2005.02.17 F Or you can create a class with an As method:> setClass('myChar') > setAs('character', 'myChar', def=function(from, to ) to <- I(from)) > read.csv(text='5724550,"000202075214",2005.02.17,2005.02.17,"F"', stringsAsFactors=FALSE, header=FALSE, colClasses=c("numeric", rep('myChar',4)) )V1 V2 V3 V4 V5 1 5724550 000202075214 2005.02.17 2005.02.17 F (Neither of the third or fourth columns makes sense as a numeric, so now illustrating coercion to Date.)> setClass('dotDate') > setAs('character', 'dotDate', def=function(from, to ) to <- as.Date(from, "%Y.%m.%d") )> read.csv(text='5724550,"000202075214",2005.02.17,2005.02.17,"F"', stringsAsFactors=FALSE, header=FALSE, colClasses=c("numeric", "character", rep('dotDate',2), "character") )V1 V2 V3 V4 V5 1 5724550 000202075214 2005-02-17 2005-02-17 F> > This happens deep inside the httr library; there is an easy way for me to add more options to the read.csv call but it is not so easy to replace it with something else. > > Terry T > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.David Winsemius Alameda, CA, USA
data.table's fread reads this as expected. Quoted strings aren't coerced. sapply(fread('5724550,"000202075214",2005.02.17,2005.02.17,"F"\n'), class) # V1 V2 V3 V4 V5 # "integer" "character" "character" "character" "character" Best, Arun. On Wed, Sep 23, 2015 at 12:00 AM, Therneau, Terry M., Ph.D. <therneau at mayo.edu> wrote:> I have a csv file from an automatic process (so this will happen thousands > of times), for which the first row is a vector of variable names and the > second row often starts something like this: > > 5724550,"000202075214",2005.02.17,2005.02.17,"F", ..... > > Notice the second variable which is > a character string (note the quotation marks) > a sequence of numeric digits > leading zeros are significant > > The read.csv function insists on turning this into a numeric. Is there any > simple set of options that > will turn this behavior off? I'm looking for a way to tell it to "obey the > bloody quotes" -- I still want the first, third, etc columns to become > numeric. There can be more than one variable like this, and not always in > the second position. > > This happens deep inside the httr library; there is an easy way for me to > add more options to the read.csv call but it is not so easy to replace it > with something else. > > Terry T > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
> On 23 Sep 2015, at 00:33 , Rolf Turner <r.turner at auckland.ac.nz> wrote: >[read.csv() doesn't distinguish "123.4" from 123.4]> IMHO this is a bug in read.csv(). >Dunno about that: pd$ cat ~/tmp/junk.csv "1";1 2;"2" pd$ open !$ open ~/tmp/junk.csv And lo and behold, Excel opens with 1 1 2 2 and all cells numeric. I don't think the CSV standard (if there is one...) specifies that quoted strings are necessarily text. I think we have been here before, and found that even if we decide that it is a bug (or misfeature), it would be hard to change, because the modus operandi of read.* is to first read everything as character and _then_ see (in type.convert()) which entries can be converted to numeric, logical, etc. -pd -- Peter Dalgaard, Professor, Center for Statistics, Copenhagen Business School Solbjerg Plads 3, 2000 Frederiksberg, Denmark Phone: (+45)38153501 Email: pd.mes at cbs.dk Priv: PDalgd at gmail.com
On 22/09/2015 6:00 PM, Therneau, Terry M., Ph.D. wrote:> I have a csv file from an automatic process (so this will happen thousands of times), for > which the first row is a vector of variable names and the second row often starts > something like this: > > 5724550,"000202075214",2005.02.17,2005.02.17,"F", ..... > > Notice the second variable which is > a character string (note the quotation marks) > a sequence of numeric digits > leading zeros are significant > > The read.csv function insists on turning this into a numeric.No it doesn't. All you need to do is specify colClasses and it will follow your instructions. Is there any simple set of> options that > will turn this behavior off? I'm looking for a way to tell it to "obey the bloody quotes" > -- I still want the first, third, etc columns to become numeric. There can be more than > one variable like this, and not always in the second position.No, because the bloody quotes are part of the "csv standard". They aren't meaningful. If you don't know what the data is, that's your fault. You shouldn't be analyzing data when you are so ignorant. Duncan Murdoch> This happens deep inside the httr library; there is an easy way for me to add more options > to the read.csv call but it is not so easy to replace it with something else. > > Terry T > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >
On 22/09/2015 6:33 PM, Rolf Turner wrote:> On 23/09/15 10:00, Therneau, Terry M., Ph.D. wrote: >> I have a csv file from an automatic process (so this will happen >> thousands of times), for which the first row is a vector of variable >> names and the second row often starts something like this: >> >> 5724550,"000202075214",2005.02.17,2005.02.17,"F", ..... >> >> Notice the second variable which is >> a character string (note the quotation marks) >> a sequence of numeric digits >> leading zeros are significant >> >> The read.csv function insists on turning this into a numeric. Is there >> any simple set of options that >> will turn this behavior off? I'm looking for a way to tell it to "obey >> the bloody quotes" -- I still want the first, third, etc columns to >> become numeric. There can be more than one variable like this, and not >> always in the second position. >> >> This happens deep inside the httr library; there is an easy way for me >> to add more options to the read.csv call but it is not so easy to >> replace it with something else. > > IMHO this is a bug in read.csv().No, it's a bug in "Rolf Turner", who believes in fairies at the end of his garden, rather than in documentation for file formats. Duncan Murdoch> > A possible workaround: > > ccc <- c("integer","character",rep(NA,k)) > X <- read.csv("melvin.csv",colClasses=ccc) > > where "melvin.csv" is the file from which you are attempting to read and > where k+2 = the number of columns in that file. > > Kludgey, but it might work. > > Another workaround is to specify quote="", but this has the side effect > of making the 5th column character rather than logical. > > cheers, > > Rolf >
Thanks for all for the comments, I hadn't intended to start a war. My summary: 1. Most important: I wasn't missing something obvious. This is always my first suspicion when I submit something to R-help, and it's true more often than not. 2. Obviously (at least it is now), the CSV standard does not specify that quotes should force a character result. R is not "wrong". Wrt to using what Excel does as litmus test, I consider that to be totally uninformative about standards: neither pro (like Duncan) or anti (like Rolf), but simply irrelevant. (Like many MS choices.) 3. I'll have to code in my own solution, either pre-scan the first few lines to create a colClasses, or use read_csv from the readr library (if there are leading zeros it keeps the string as character, which may suffice for my needs), or something else. 4. The source of the data is a "text/csv" field coming from an http POST request. This is an internal service on an internal Mayo server and coded by our own IT department; this will not be the first case where I have found that their definition of "csv" is not quite standard. Terry T.> On 23/09/15 10:00, Therneau, Terry M., Ph.D. wrote: >> I have a csv file from an automatic process (so this will happen >> thousands of times), for which the first row is a vector of variable >> names and the second row often starts something like this: >> >> 5724550,"000202075214",2005.02.17,2005.02.17,"F", ..... >> >> Notice the second variable which is >> a character string (note the quotation marks) >> a sequence of numeric digits >> leading zeros are significant >> >> The read.csv function insists on turning this into a numeric. Is there >> any simple set of options that >> will turn this behavior off? I'm looking for a way to tell it to "obey >> the bloody quotes" -- I still want the first, third, etc columns to >> become numeric. There can be more than one variable like this, and not >> always in the second position. >> >> This happens deep inside the httr library; there is an easy way for me >> to add more options to the read.csv call but it is not so easy to >> replace it with something else.
> -----Original Message----- > From: r.turner at auckland.ac.nz > Sent: Wed, 23 Sep 2015 13:26:58 +1200 > To: pdalgd at gmail.com..........> I would say that this phenomenon ("Excel does it") is *overwhelming* > evidence that it is bad practice!!! :-)Fortune? ____________________________________________________________ Can't remember your password? Do you need a strong and secure password? Use Password manager! It stores your passwords & protects your account.