On May 18, 2012, at 20:19 , Patrick Callier wrote:
> Hi all,
>
> I am running 64-bit R 2.15.0 on windows 7. I am trying to use read.delim
> to read from a file that has 2-byte unicode (CJK) characters.
>
> Here is an example of the data (it is tab-delimited if that gets messed
up):
> HITId HITTypeId Title
> 2Q69Z6KW4ZMAGKKFRT6Q4ONO6MJF68 2LVJ1LY58B72OP36GNBHH16YF7RS7Z ?????????
> ????????????
>
> So read.delim (code below) doesn't read in correctly. It reads up
until it
> hits the CJK characters and then terminates with a warning:
> Warning messages:
> 1: In read.table(file = file, header = header, sep = sep, quote = quote, :
> invalid input found on input connection 'minimal.txt'
> 2: In read.table(file = file, header = header, sep = sep, quote = quote, :
> incomplete final line found by readTableHeader on 'minimal.txt'
>
> The "Title" field gets filled with an NA. I played around with
scan() a
> little bit and it can read the file correctly if i send it an open file
> with the correct encoding given for the "encoding" parameter. It
barfs with
> the same warnings if I just send it the filename and set the fileEncoding
> parameter.
>
> Here is some test code with the above text in a file
"minimal.txt"
> # works
>
scan(file("minimal.txt",encoding="UTF-16LE"),what=character(),nlines=2)
> # don't work
> scan("minimal.txt",what=character(),nlines=2) # output is in
wrong
> encoding
>
scan("minimal.txt",what=character(),nlines=2,fileEncoding="UTF-16LE")
> #"invalid input found on input connection"
> read.delim(file("minimal.txt",encoding="UTF-16LE"), sep
= "\t",
> header=TRUE) # ditto
>
> Is this a bug? Or am I just doing something wrong? Thanks for any help you
> can provide.
This stuff is highly locale dependent (and locales are OS dependent). As I
understand things, the encoding= argument to scan() or read.table() says that
the file is in a foreign encoding and you want to treat strings in that
encoding, whereas fileEncoding= means that you want to convert to your current
encoding and then treat the converted data. In the first case, you need to get
the encoding right, in the other, in addition, you need to be in a locale that
allows the conversion.
For file(), requesting an encoding means asking for conversion, so if that
doesn't work, you are out of luck (and you're just confusing the issue
anyway). Here are a couple of examples in Latin1; notice that if you can't
convert Chinese characters to your current locale, then the <U+1234> style
output is the best you can hope for.
Peter-Dalgaards-MacBook-Air:minimal pd$ LC_ALL="da_DK.ISO8859-1" R
--vanilla < minimal2.R
R version 2.14.2 (2012-02-29)
....> read.delim(file("minimal.txt",encoding="UTF-8"), sep =
"\t", header=TRUE,encoding="UTF-8")
HITId HITTypeId Title Question
1 2Q69Z6KW4ZMAGKKFRT6Q4ONO6MJF68 2LVJ1LY58B72OP36GNBHH16YF7RS7Z NA NA
Warning messages:
1: In read.table(file = file, header = header, sep = sep, quote = quote, :
invalid input found on input connection 'minimal.txt'
2: In read.table(file = file, header = header, sep = sep, quote = quote, :
incomplete final line found by readTableHeader on
'minimal.txt'> read.delim(file="minimal.txt", encoding="UTF-8")
HITId HITTypeId
1 2Q69Z6KW4ZMAGKKFRT6Q4ONO6MJF68 2LVJ1LY58B72OP36GNBHH16YF7RS7Z
Title
1
<U+770B><U+770B><U+53E5><U+5B50><U+FF0C><U+5199><U+5199><U+60F3><U+6CD5>
Question
1
<U+8BF7><U+770B><U+4EE5><U+4E0B><U+7684><U+53E5><U+5B50><U+FF0C><U+518D><U+56DE><U+7B54><U+95EE>> read.delim(file="minimal.txt")
HITId HITTypeId
1 2Q69Z6KW4ZMAGKKFRT6Q4ONO6MJF68 2LVJ1LY58B72OP36GNBHH16YF7RS7Z
Title
1 ?\234\213?\234\213?\217??\220?\214?\206\231?\206\231?\203??\225
Question
1
??\234\213??\213?\232\204?\217??\220?\214?\206\215?\233\236?\224?\227?> read.delim(file="minimal.txt", fileEncoding="UTF-8")
HITId HITTypeId Title Question
1 2Q69Z6KW4ZMAGKKFRT6Q4ONO6MJF68 2LVJ1LY58B72OP36GNBHH16YF7RS7Z NA NA
Warning messages:
1: In read.table(file = file, header = header, sep = sep, quote = quote, :
invalid input found on input connection 'minimal.txt'
2: In read.table(file = file, header = header, sep = sep, quote = quote, :
incomplete final line found by readTableHeader on
'minimal.txt'>
>
> --Pat
>
> --
> Patrick Callier
> Georgetown University
> www12.georgetown.edu/students/prc23
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
--
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Email: pd.mes at cbs.dk Priv: PDalgd at gmail.com