On Dec 31, 2011, at 16:05 , Dennis Fisher wrote:
> R version: 2.13.1
> OS X
>
> Colleagues,
>
> I am working with a CSV file; for testing purposes, I created an XLS
version of the file.
> When I read these files using read.xls (gdata) or read.csv, I encounter an
error:
> Error in type.convert(data[[i]], as.is = as.is[i], dec = dec, na.strings =
character(0L)) :
> invalid multibyte string at '<b0>C'
> The error occurs whether or not I invoke the "as.is" option of
read.csv.
>
> The trigger for this error is a "degree C" string (\xb0). The
offending line is:
> [1]
"\"DD4A14\",\"VITALS\",\"SITE038\",\"038-501\",\"SCREENING\",\"\",\"Temperature\",\"37.8\",\"\xb0C\",\"1005_TS\",\"e2\",\"1005/cla\",\"\",5/25/2011,-1,2,0,0,0,0,0,0,1,7/20/2011
16:48:25,240,1"
I think this means that you are working in UTF-8, trying to read something that
is encoded in Latin-1. Try playing with the fileEncoding or encoding arguments;
my first try would be fileEncoding="latin1".
-pd
>
> I can get around the error by reading the file with readLines, then editing
out that character:
> PATH <- textConnection(sub("\xb0", "degrees",
readLines(PATH)))
> read.csv(PATH, header=T, as.is=T)
> This alternate approach is successful. This leads to two questions:
>
> 1. Why can readLines handle that character string whereas read.csv cannot?
>
> 2. Reading the text connection is slow - it takes ~ 11 seconds to read a
file with 11K rows. I edited the file to replace to offending character with
"degree". read.csv reads the 11K rows of the new file in a fraction
of a second. Can someone explain why reading the text connection is so much
slower than reading a file?
>
> Dennis
>
> Dennis Fisher MD
> P < (The "P Less Than" Company)
> Phone: 1-866-PLessThan (1-866-753-7784)
> Fax: 1-866-PLessThan (1-866-753-7784)
> www.PLessThan.com
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
--
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Email: pd.mes at cbs.dk Priv: PDalgd at gmail.com