Antti Arppe
2006-Jun-08 13:10 UTC
[R] Reading in a table with ISO-latin1 encoding in MacOS-X (Intel)
Dear colleages in R, I have earlier been working with R in Linux, where reading in a table containing Scandinavian letters ("?", "?", and "?") in the header as part of variable names has not caused any problem whatsoever. However, when trying to do the same in R running on new MacOS-X (with an Intel processor) with the same original text table does not seem to work whichever way I try. Following the recommendations on the R site and using the 'file' function to set the encoding breaks down at the first encounter with a Scandinavian character: THINK <- read.table(file("R_data/hs+sfnet.T.060505.tbl4", encoding="latin1"),header=TRUE) Warning messages: 1: invalid input found on input connection 'R_data/hs+sfnet.T.060505.tbl4' 2: incomplete final line found by readTableHeader on 'R_data/hs+sfnet.T.060505.tbl4' A sample exemplifying such characters as variable labels is below (for which the behavior of R in Mac is the same as for the larger file referred to above):. ajatella mietti? pohtia 1 FALSE FALSE TRUE 2 FALSE FALSE FALSE 3 FALSE TRUE FALSE 4 FALSE TRUE FALSE 5 TRUE FALSE FALSE 6 TRUE FALSE FALSE 7 FALSE FALSE FALSE 8 FALSE TRUE FALSE 9 FALSE TRUE FALSE 10 FALSE FALSE FALSE Converting the the file from ISO-latin-1 to UTF8 (with Mac's TextEdit application)allows the file to be read in in its entirety, but still the Scandinavian character in the heading is coerced to a period '.', or two, in fact (i.e. 'mietti?' -> 'miett..') Have I possibly misunderstood how the 'file' function should be used in conjunction with 'read.table', or might the problem with latin1-to-utf conversion be somewhere else? Appreciating any help on this matter, -- =====================================================================Antti Arppe - Master of Science (Engineering) Researcher & doctoral student (Linguistics) E-mail: antti.arppe at helsinki.fi WWW: http://www.ling.helsinki.fi/~aarppe
Charles Plessy
2006-Jun-08 13:31 UTC
[R] Reading in a table with ISO-latin1 encoding in MacOS-X (Intel)
Le Thu, Jun 08, 2006 at 04:10:08PM +0300, Antti Arppe a ?crit :> > Converting the the file from ISO-latin-1 to UTF8 (with Mac's TextEdit > application)allows the file to be read in in its entirety, but still > the Scandinavian character in the heading is coerced to a period '.', > or two, in fact (i.e. 'mietti?' -> 'miett..')Dear Antti, I may be wrong, but the unicode accented latin letters are not encoded the same on linux and macintosh. On linux, ? is ?, but on Macintosh, it is "+a (please read the quotes as if there were an umlaut). Did you try to just retype the headers with a macintosh text editor? Good luck! -- Charles Plessy Wako, Saitama, Japan
Peter Dalgaard
2006-Jun-08 14:06 UTC
[R] Reading in a table with ISO-latin1 encoding in MacOS-X (Intel)
Antti Arppe <aarppe at ling.helsinki.fi> writes:> Converting the the file from ISO-latin-1 to UTF8 (with Mac's TextEdit > application)allows the file to be read in in its entirety, but still > the Scandinavian character in the heading is coerced to a period '.', > or two, in fact (i.e. 'mietti?' -> 'miett..')I think you probably need check.names=FALSE. (Presumably, you cannot have Finnish characters in variable names either on the Mac?)> Have I possibly misunderstood how the 'file' function should be used > in conjunction with 'read.table', or might the problem with > latin1-to-utf conversion be somewhere else?Not really, text encodings are just a pain. The blame for this fact can be shifted in various directions, but it doesn't really help... (My personal angle is that ISO-8859 was terribly shortsighted, and stuck in a "Western Europe" mindset. As soon as the iron curtain disappeared and we started to deal with people from Slavic countries, the weakness was revealed.) The basic structure looks OK, and works for me on Linux:> read.table(file("xx.data",encoding="latin1"),header=TRUE)?h b?h 1 1 2 so one can only guess that you have a local or Mac-specific setup issue. -- O__ ---- Peter Dalgaard ?ster Farimagsgade 5, Entr.B c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K (*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918 ~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk) FAX: (+45) 35327907
Prof Brian Ripley
2006-Jun-08 14:17 UTC
[R] Reading in a table with ISO-latin1 encoding in MacOS-X (Intel)
You are using this as intended, although your email message came in latin9 not latin1, which does not affect your examples. Have you actually checked (e.g. via a hex dump) that the file is in latin1? I assume that if you converted the file to UTF-8 you then used read.table(R_data/hs+sfnet.T.060505.tbl4", header=TRUE) If so, you need to investigate the locale in use, as which letters are valid depends on the locale: on Linux UTF-8 locales all letters in all languages are valid in R names, but that is not necessarily the MacOS interpretation. (Invalid characters in names will be converted to ., and if the locale is wrong so may be the interpretation of bytes as characters.) You might find more informed help on the r-sig-mac list. On Thu, 8 Jun 2006, Antti Arppe wrote:> Dear colleages in R, > > I have earlier been working with R in Linux, where reading in a table > containing Scandinavian letters ("?", "?", and "?") in the header as part of > variable names has not caused any problem whatsoever. > > However, when trying to do the same in R running on new MacOS-X (with an > Intel processor) with the same original text table does not seem to work > whichever way I try. Following the recommendations on the R site and using > the 'file' function to set the encoding breaks down at the first encounter > with a Scandinavian character: > > THINK <- read.table(file("R_data/hs+sfnet.T.060505.tbl4", > encoding="latin1"),header=TRUE) > Warning messages: > 1: invalid input found on input connection 'R_data/hs+sfnet.T.060505.tbl4' > 2: incomplete final line found by readTableHeader on > 'R_data/hs+sfnet.T.060505.tbl4' > > A sample exemplifying such characters as variable labels is below (for which > the behavior of R in Mac is the same as for the larger file referred to > above):. > > ajatella mietti? pohtia > 1 FALSE FALSE TRUE > 2 FALSE FALSE FALSE > 3 FALSE TRUE FALSE > 4 FALSE TRUE FALSE > 5 TRUE FALSE FALSE > 6 TRUE FALSE FALSE > 7 FALSE FALSE FALSE > 8 FALSE TRUE FALSE > 9 FALSE TRUE FALSE > 10 FALSE FALSE FALSE > > Converting the the file from ISO-latin-1 to UTF8 (with Mac's TextEdit > application)allows the file to be read in in its entirety, but still the > Scandinavian character in the heading is coerced to a period '.', or two, in > fact (i.e. 'mietti?' -> 'miett..') > > Have I possibly misunderstood how the 'file' function should be used in > conjunction with 'read.table', or might the problem with latin1-to-utf > conversion be somewhere else? > > Appreciating any help on this matter, > >-- Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595