I had occasion recently to read in a one-line *.csv file that looked like: "CandidateName","NSN","Ethnicity","dob","gender" "Smith, Mary Jane",111222333,"E","2/25/1989","F" That "F" (for female) in the last field got transformed to FALSE. Apparently read.csv (and hence read.table) are inferring that if the entries of a file are all F's and T's then the field is interpreted as logical. If I change the file to "CandidateName","NSN","Ethnicity","dob","gender" "Smith, Mary Jane",111222333,"E","2/25/1989","F" "Mingdinkler, Melvin Queue",999888777,"01/04/1942","M" then the read functions correctly interpret the last field as being character. The translation of "F" into FALSE resulted in some mysterious contretemps in further analysis, which it took me a while to track down. I solved the problem by putting in a colClasses argument in my call to read.csv(). But I really think that the read functions are being too clever by half here. If field entries are surrounded by quotes, shouldn't they be left as character? Even if they are all F's and T's? Furthermore using F's and T's to represent TRUE's and FALSE's is bad practice anyway. Since FALSE and TRUE are reserved words it would make sense for the read function to assume that a field is logical if it consists entirely of these words. But T's and F's .... I don't think so. I would argue that this behaviour should be changed. I can see no downside to such a change. cheers, Rolf Turner ###################################################################### Attention:\ This e-mail message is privileged and confid...{{dropped:9}}
Rolf Turner wrote:> > I solved the problem by putting in a colClasses argument in my > call to read.csv(). But I really think that the read functions > are being too clever by half here. If field entries are surrounded > by quotes, shouldn't they be left as character? Even if they are > all F's and T's? >It has been my experience that fields surrounded by quotes are interpreted as factors unless the stringsAsFactors switch has been set to false. So it seems the default behavior of read.table is to be clever. Annoying as these behaviors are, changing them would probably break existing code that expects the function to execute the way it does. -Charlie -- View this message in context: http://n4.nabble.com/A-slight-trap-in-read-table-read-csv-tp1573007p1573018.html Sent from the R help mailing list archive at Nabble.com.
On Feb 28, 2010, at 4:55 PM, Rolf Turner wrote:> > I had occasion recently to read in a one-line *.csv file that > looked like: > > "CandidateName","NSN","Ethnicity","dob","gender" > "Smith, Mary Jane",111222333,"E","2/25/1989","F" > > That "F" (for female) in the last field got transformed to > FALSE. Apparently read.csv (and hence read.table) are inferring > that if the entries of a file are all F's and T's then the > field is interpreted as logical. > > If I change the file to > > "CandidateName","NSN","Ethnicity","dob","gender" > "Smith, Mary Jane",111222333,"E","2/25/1989","F" > "Mingdinkler, Melvin Queue",999888777,"01/04/1942","M" > > then the read functions correctly interpret the last field > as being character. > > The translation of "F" into FALSE resulted in some mysterious > contretemps in further analysis, which it took me a while to > track down. > > I solved the problem by putting in a colClasses argument in my > call to read.csv(). But I really think that the read functions > are being too clever by half here. If field entries are surrounded > by quotes, shouldn't they be left as character? Even if they are > all F's and T's? > > Furthermore using F's and T's to represent TRUE's and FALSE's is > bad practice anyway. Since FALSE and TRUE are reserved words it > would make sense for the read function to assume that a field is > logical if it consists entirely of these words. But T's and F's > .... I don't think so.It is documented that conversion will be attempted to logical, so it does make sense that T/F would become TRUE and FALSE since that is typical behavior elsewhere. But at the very least this sentence in the type.convert help page: "Given a character vector, it attempts to convert it to logical, integer, numeric or complex, and failing that converts it to factor unless as.is = TRUE." ... ought to be clarified. It is not at all clear that the conversion to logical still will be attempted even if as.is=TRUE, i.e. the only conversion not attempted would be to factor.> > I would argue that this behaviour should be changed. I can see no > downside to such a change. > > cheers, > > Rolf Turner > > ###################################################################### > Attention:\ This e-mail message is privileged and confid...{{dropped: > 9}} > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
It is strange. Even in R itself T and F are not guaranteed to be TRUE and FALSE.> T <- 1:3 > T[1] 1 2 3 On Sun, Feb 28, 2010 at 4:55 PM, Rolf Turner <r.turner at auckland.ac.nz> wrote:> > I had occasion recently to read in a one-line *.csv file that > looked like: > > "CandidateName","NSN","Ethnicity","dob","gender" > "Smith, Mary Jane",111222333,"E","2/25/1989","F" > > That "F" (for female) in the last field got transformed to > FALSE. ?Apparently read.csv (and hence read.table) are inferring > that if the entries of a file are all F's and T's then the > field is interpreted as logical. > > If I change the file to > > "CandidateName","NSN","Ethnicity","dob","gender" > "Smith, Mary Jane",111222333,"E","2/25/1989","F" > "Mingdinkler, Melvin Queue",999888777,"01/04/1942","M" > > then the read functions correctly interpret the last field > as being character. > > The translation of "F" into FALSE resulted in some mysterious > contretemps in further analysis, which it took me a while to > track down. > > I solved the problem by putting in a colClasses argument in my > call to read.csv(). ?But I really think that the read functions > are being too clever by half here. ?If field entries are surrounded > by quotes, shouldn't they be left as character? ?Even if they are > all F's and T's? > > Furthermore using F's and T's to represent TRUE's and FALSE's is > bad practice anyway. ?Since FALSE and TRUE are reserved words it > would make sense for the read function to assume that a field is > logical if it consists entirely of these words. ?But T's and F's > .... I don't think so. > > I would argue that this behaviour should be changed. ?I can see no > downside to such a change. > > ? ? ? ?cheers, > > ? ? ? ? ? ? ? ?Rolf Turner > > ###################################################################### > Attention:\ This e-mail message is privileged and confid...{{dropped:9}} > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >
On 03/01/2010 08:55 AM, Rolf Turner wrote:>... > Furthermore using F's and T's to represent TRUE's and FALSE's is > bad practice anyway. Since FALSE and TRUE are reserved words it > would make sense for the read function to assume that a field is > logical if it consists entirely of these words. But T's and F's > .... I don't think so. > > I would argue that this behaviour should be changed. I can see no > downside to such a change. >Hi Rolf, I think that the answer is buried in the history of Truth and Falsity, in that T and F once were valid ways to abbreviate these fundamental but widely disputed concepts. The number of messages on the help list that contain this usage indicates that an awful lot of people would be chucking a tanty if automatic conversion were dropped. Jim
On 2010-02-28 14:55, Rolf Turner wrote:> > I had occasion recently to read in a one-line *.csv file that > looked like: > > "CandidateName","NSN","Ethnicity","dob","gender" > "Smith, Mary Jane",111222333,"E","2/25/1989","F" > > That "F" (for female) in the last field got transformed to > FALSE. Apparently read.csv (and hence read.table) are inferring > that if the entries of a file are all F's and T's then the > field is interpreted as logical. > > If I change the file to > > "CandidateName","NSN","Ethnicity","dob","gender" > "Smith, Mary Jane",111222333,"E","2/25/1989","F" > "Mingdinkler, Melvin Queue",999888777,"01/04/1942","M" > > then the read functions correctly interpret the last field > as being character. > > The translation of "F" into FALSE resulted in some mysterious > contretemps in further analysis, which it took me a while to > track down. > > I solved the problem by putting in a colClasses argument in my > call to read.csv(). But I really think that the read functions > are being too clever by half here. If field entries are surrounded > by quotes, shouldn't they be left as character? Even if they are > all F's and T's? > > Furthermore using F's and T's to represent TRUE's and FALSE's is > bad practice anyway. Since FALSE and TRUE are reserved words it > would make sense for the read function to assume that a field is > logical if it consists entirely of these words. But T's and F's > .... I don't think so. > > I would argue that this behaviour should be changed. I can see no > downside to such a change. >I agree with Rolf. Indeed, I'm not fond of the use of T/F for TRUE/FALSE at all.> cheers, > > Rolf Turner > > ###################################################################### > Attention:\ This e-mail message is privileged and confid...{{dropped:9}} > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > >-- Peter Ehlers University of Calgary
Rolf Turner <r.turner at auckland.ac.nz> wrote:> > I solved the problem by putting in a colClasses argument in my > call to read.csv(). But I really think that the read functions > are being too clever by half here. If field entries are surrounded > by quotes, shouldn't they be left as character? Even if they are > all F's and T's? > > Furthermore using F's and T's to represent TRUE's and FALSE's is > bad practice anyway. Since FALSE and TRUE are reserved words it > would make sense for the read function to assume that a field is > logical if it consists entirely of these words. But T's and F's > .... I don't think so. > > I would argue that this behaviour should be changed. I can see no > downside to such a change. >I agree with you, Rolf, that this is horrid behavior. It is such automatic devices that have made people hate (e.g.) Microsoft Word with a passion. Yet, in R this is a designed-in bug (e.g., feature) that probably can't be changed without making some legacy code not work. But at least, T and F could be removed soon as synonms for TRUE and FALSE. We have seen that "_" was removed as an assignment operator, and the world did not crumble. The use of T and F is no less error-prone, and possibly more. The only immediate solution to this accretion of overly clever behavior would be for someone to write new functions (say, Read.csv) that didn't do all those conversions behind the scenes. I'm not about to do that. Are you? Best of luck! -- Mike Prager, NOAA, Beaufort, NC * Opinions expressed are personal and not represented otherwise. * Any use of tradenames does not constitute a NOAA endorsement.