thr3ads.net - R help - [R] Truncated file upon reading a text file with 0xff characters [Mar 2016]

If this information is useful, please help other people find it:
Share via:

Jean-Claude Arbaut

2016-Mar-15 19:05 UTC

[R] Truncated file upon reading a text file with 0xff characters

Hello R users,

I am having problems to read a CSV file that contains names with character ?.
In case it doesn't print correctly, it's Unicode character 00FF or LATIN
SMALL
LETTER Y WITH DIAERESIS.
My computer has Windows 7 and R 3.2.4.

Initially, I configured my computer to run options(encoding="UTF-8")
in my .Rprofile,
since I prefer this encoding, for portability. Good and modern
standard, I thought.
Rather than sending a large file, here is how to reproduce my problem:

  options(encoding="UTF-8")

  f <- file("test.txt", "wb")
  writeBin(as.integer(c(65, 13, 10, 66, 255, 67, 13, 10, 68, 13, 10)),
f, size=1)
  close(f)
  read.table("test.txt", encoding="latin1")
  f <- file("test.txt", "rt")
  readLines(f, encoding="latin1")
  close(f)

I write a file with three lines, in binary to avoid any translation:
A
B\xffC
D

Upon reading I get only:

  > read.table("test.txt", encoding="latin1")
    V1
  1  A
  2  B
  Warning messages:
  1: In read.table("test.txt", encoding = "latin1") :
    invalid input found on input connection 'test.txt'
  2: In read.table("test.txt", encoding = "latin1") :
    incomplete final line found by readTableHeader on 'test.txt'
  > readLines(f, encoding="latin1")
  [1] "A" "B"
  Warning messages:
  1: In readLines(f, encoding = "latin1") :
    invalid input found on input connection 'test.txt'
  2: In readLines(f, encoding = "latin1") :
    incomplete final line found on 'test.txt'

Hence the file is truncated. However, character \xff is a valid latin1
character,
as one can check for instance at https://en.wikipedia.org/wiki/ISO/IEC_8859-1
I tried with an UTF-8 version of this file:

  f <- file("test.txt", "wb")
  writeBin(as.integer(c(65, 13, 10, 66, 195, 191, 67, 13, 10, 68, 13,
10)), f, size=1)
  close(f)
  read.table("test.txt", encoding="UTF-8")
  f <- file("test.txt", "rt")
  readLines(f, encoding="UTF-8")
  close(f)

Since this character ? is encoded as two bytes 195, 191 in UTF-8, I would expect
that I get my complete file. But I don't. Instead, I get:

  > read.table("test.txt", encoding="UTF-8")
    V1
  1  A
  2  B
  3  C
  4  D
  Warning message:
  In read.table("test.txt", encoding = "UTF-8") :
    incomplete final line found by readTableHeader on 'test.txt'

  > readLines(f, encoding="UTF-8")
  [1] "A" "B"
  Warning message:
  In readLines(f, encoding = "UTF-8") :
    incomplete final line found on 'test.txt'

I tried all the preceding but with options(encoding="latin1") at the
beginning.
For the first attempt, with byte 255, I get:

  > read.table("test.txt", encoding="latin1")
    V1
  1  A
  2  B
  3  C
  4  D
  Warning message:
  In read.table("test.txt", encoding = "latin1") :
    incomplete final line found by readTableHeader on 'test.txt'
  >
  > f <- file("test.txt", "rt")
  > readLines(f, encoding="latin1")

For the other attempt, with 195, 191:

  > read.table("test.txt", encoding="UTF-8")
     V1
  1   A
  2 B?C
  3   D
  >
  > f <- file("test.txt", "rt")
  > readLines(f, encoding="UTF-8")
  [1] "A"   "B?C" "D"
  > close(f)

Thus the second one does indeed work, it seems. Just a check:

  > a <- read.table("test.txt", encoding="UTF-8")
  > Encoding(a$V1)
  [1] "unknown" "UTF-8"   "unknown"

At last, I figured out that with the default encoding in R, both attempts work,
with or without even giving the encoding as a parameter of read.table
or readLines.
However, I don't understand what happens:

  f <- file("test.txt", "wb")
  writeBin(as.integer(c(65, 13, 10, 66, 255, 67, 13, 10, 68, 13, 10)),
f, size=1)
  close(f)
  a <- read.table("test.txt", encoding="latin1")$V1
  Encoding(a)
  iconv(a[2], toRaw=T)
  a
  a <- read.table("test.txt")$V1
  Encoding(a)
  iconv(a[2], toRaw=T)
  a

This will yield:

  > a <- read.table("test.txt", encoding="latin1")$V1
  > Encoding(a)
  [1] "unknown" "latin1"  "unknown"
  > iconv(a[2], toRaw=T)
  [[1]]
  [1] 42 ff 43
  > a
  [1] "A"   "B?C" "D"
  >
  > a <- read.table("test.txt")$V1
  > Encoding(a)
  [1] "unknown" "unknown" "unknown"
  > iconv(a[2], toRaw=T)
  [[1]]
  [1] 42 ff 43
  > a
  [1] "A"   "B?C" "D"

The second line is correctly encoded, but the encoding is just not
"marked" in one case.
With the UTF-8 bytes:

  f <- file("test.txt", "wb")
  writeBin(as.integer(c(65, 13, 10, 66, 195, 191, 67, 13, 10, 68, 13,
10)), f, size=1)
  close(f)
  a <- read.table("test.txt", encoding="UTF-8")$V1
  Encoding(a)
  iconv(a[2], toRaw=T)
  a
  a <- read.table("test.txt")$V1
  Encoding(a)
  iconv(a[2], toRaw=T)
  a

This will yield:
> a <- read.table("test.txt", encoding="UTF-8")$V1
> Encoding(a)[1] "unknown" "UTF-8"  
"unknown"> iconv(a[2], toRaw=T)[[1]]
[1] 42 c3 bf 43> a
[1] "A"   "B?C" "D"> a <- read.table("test.txt")$V1
> Encoding(a)[1] "unknown" "unknown"
"unknown"> iconv(a[2], toRaw=T)[[1]]
[1] 42 c3 bf 43> a[1] "A"    "B??C" "D"

Both are correctly read (the raw bytes are ok), but the second one doesn't
print
correctly because the encoding is not "marked".

My thoughts:
With options(encoding="native.enc"), the characters read are not
translated, and are read
as raw bytes, which can get an encoding mark to print correctly (otherwise it
prints as native, that is mostly latin1).
With options(encoding="latin1"), and reading the UTF-8 file, I guess
it's mostly
like the preceding: the characters are read as raw, and marked as
UTF-8, which works.
With options(encoding="latin1"), and reading the latin1 file (with the
0xFF byte),
I don't understand what happens. The file gets truncated almost as if 0xFF
were
an EOF character - which is perplexing, since I think that in C, 0xFF
is sometimes
(wrongly) confused with EOF.
And with options(encoding="UTF-8"), I am not sure what happens.

Questions:
* What's wrong with options(encoding="latin1")?
* Is it unsafe to use another option(encoding) than the default
native.enc, on Windows?
* Is it safe to assume that with native.enc R reads raw characters
and, only when requested,
  marks an encoding afterwards? (that is, I get "unknown" by default
which is printed
  as latin1 on Windows, and if I enforce another encoding, it will be
used whatever
  the bytes really are)
* What does really happen with another option(encoding), especially UTF-8?
* If I save a character variable to an Rdata file, is the file usable
on another OS,
  or on the same with another default encoding (by changing
options())? Does it depend
  whether the character string has un "unknown" encoding or an
explicit one?
* Is there a way (preferably an options()) to tell R to read text
files as UTF-8 by default?
  Would it work with any one of read.table(), readLines(), or even source()?
  I thought options(encoding="UTF-8") would do, but it fails on the
examples above.

Best regards,

Jean-Claude Arbaut

Duncan Murdoch

2016-Mar-15 20:24 UTC

head link

[R] Truncated file upon reading a text file with 0xff characters

I think you've identified a bug (or more than one) here, but your 
message is so long, I haven't had time to go through it all.  I'd 
suggest that you write up a shorter version for the bug list.  The 
shorter version would

1.  Write the latin1 file using writeBin.
2.  Set options(encoding = "") and read it without error.
3.  Set options(encoding = "UTF-8") and get an error even if you 
explicitly set encoding when reading.
4.  Set options(encoding = "latin1") and also get an error with or 
without explicitly setting the encoding.

I would limit the tests to readLines; read.table is much more 
complicated, and isn't necessary to illustrate the problem.  It just 
confuses things by bringing it into the discussion.

You should also avoid bringing text mode connections into the discussion 
unless they are necessary.

Duncan Murdoch

On 15/03/2016 3:05 PM, Jean-Claude Arbaut wrote:> Hello R users,
>
> I am having problems to read a CSV file that contains names with character
?.
> In case it doesn't print correctly, it's Unicode character 00FF or
LATIN SMALL
> LETTER Y WITH DIAERESIS.
> My computer has Windows 7 and R 3.2.4.
>
> Initially, I configured my computer to run
options(encoding="UTF-8")
> in my .Rprofile,
> since I prefer this encoding, for portability. Good and modern
> standard, I thought.
> Rather than sending a large file, here is how to reproduce my problem:
>
>    options(encoding="UTF-8")
>
>    f <- file("test.txt", "wb")
>    writeBin(as.integer(c(65, 13, 10, 66, 255, 67, 13, 10, 68, 13, 10)),
> f, size=1)
>    close(f)
>    read.table("test.txt", encoding="latin1")
>    f <- file("test.txt", "rt")
>    readLines(f, encoding="latin1")
>    close(f)
>
> I write a file with three lines, in binary to avoid any translation:
> A
> B\xffC
> D
>
> Upon reading I get only:
>
>    > read.table("test.txt", encoding="latin1")
>      V1
>    1  A
>    2  B
>    Warning messages:
>    1: In read.table("test.txt", encoding = "latin1") :
>      invalid input found on input connection 'test.txt'
>    2: In read.table("test.txt", encoding = "latin1") :
>      incomplete final line found by readTableHeader on 'test.txt'
>    > readLines(f, encoding="latin1")
>    [1] "A" "B"
>    Warning messages:
>    1: In readLines(f, encoding = "latin1") :
>      invalid input found on input connection 'test.txt'
>    2: In readLines(f, encoding = "latin1") :
>      incomplete final line found on 'test.txt'
>
> Hence the file is truncated. However, character \xff is a valid latin1
> character,
> as one can check for instance at
https://en.wikipedia.org/wiki/ISO/IEC_8859-1
> I tried with an UTF-8 version of this file:
>
>    f <- file("test.txt", "wb")
>    writeBin(as.integer(c(65, 13, 10, 66, 195, 191, 67, 13, 10, 68, 13,
> 10)), f, size=1)
>    close(f)
>    read.table("test.txt", encoding="UTF-8")
>    f <- file("test.txt", "rt")
>    readLines(f, encoding="UTF-8")
>    close(f)
>
> Since this character ? is encoded as two bytes 195, 191 in UTF-8, I would
expect
> that I get my complete file. But I don't. Instead, I get:
>
>    > read.table("test.txt", encoding="UTF-8")
>      V1
>    1  A
>    2  B
>    3  C
>    4  D
>    Warning message:
>    In read.table("test.txt", encoding = "UTF-8") :
>      incomplete final line found by readTableHeader on 'test.txt'
>
>    > readLines(f, encoding="UTF-8")
>    [1] "A" "B"
>    Warning message:
>    In readLines(f, encoding = "UTF-8") :
>      incomplete final line found on 'test.txt'
>
> I tried all the preceding but with options(encoding="latin1") at
the beginning.
> For the first attempt, with byte 255, I get:
>
>    > read.table("test.txt", encoding="latin1")
>      V1
>    1  A
>    2  B
>    3  C
>    4  D
>    Warning message:
>    In read.table("test.txt", encoding = "latin1") :
>      incomplete final line found by readTableHeader on 'test.txt'
>    >
>    > f <- file("test.txt", "rt")
>    > readLines(f, encoding="latin1")
>
> For the other attempt, with 195, 191:
>
>    > read.table("test.txt", encoding="UTF-8")
>       V1
>    1   A
>    2 B?C
>    3   D
>    >
>    > f <- file("test.txt", "rt")
>    > readLines(f, encoding="UTF-8")
>    [1] "A"   "B?C" "D"
>    > close(f)
>
> Thus the second one does indeed work, it seems. Just a check:
>
>    > a <- read.table("test.txt",
encoding="UTF-8")
>    > Encoding(a$V1)
>    [1] "unknown" "UTF-8"   "unknown"
>
> At last, I figured out that with the default encoding in R, both attempts
work,
> with or without even giving the encoding as a parameter of read.table
> or readLines.
> However, I don't understand what happens:
>
>    f <- file("test.txt", "wb")
>    writeBin(as.integer(c(65, 13, 10, 66, 255, 67, 13, 10, 68, 13, 10)),
> f, size=1)
>    close(f)
>    a <- read.table("test.txt", encoding="latin1")$V1
>    Encoding(a)
>    iconv(a[2], toRaw=T)
>    a
>    a <- read.table("test.txt")$V1
>    Encoding(a)
>    iconv(a[2], toRaw=T)
>    a
>
> This will yield:
>
>    > a <- read.table("test.txt",
encoding="latin1")$V1
>    > Encoding(a)
>    [1] "unknown" "latin1"  "unknown"
>    > iconv(a[2], toRaw=T)
>    [[1]]
>    [1] 42 ff 43
>    > a
>    [1] "A"   "B?C" "D"
>    >
>    > a <- read.table("test.txt")$V1
>    > Encoding(a)
>    [1] "unknown" "unknown" "unknown"
>    > iconv(a[2], toRaw=T)
>    [[1]]
>    [1] 42 ff 43
>    > a
>    [1] "A"   "B?C" "D"
>
> The second line is correctly encoded, but the encoding is just not
> "marked" in one case.
> With the UTF-8 bytes:
>
>    f <- file("test.txt", "wb")
>    writeBin(as.integer(c(65, 13, 10, 66, 195, 191, 67, 13, 10, 68, 13,
> 10)), f, size=1)
>    close(f)
>    a <- read.table("test.txt", encoding="UTF-8")$V1
>    Encoding(a)
>    iconv(a[2], toRaw=T)
>    a
>    a <- read.table("test.txt")$V1
>    Encoding(a)
>    iconv(a[2], toRaw=T)
>    a
>
> This will yield:
>
> > a <- read.table("test.txt",
encoding="UTF-8")$V1
> > Encoding(a)
> [1] "unknown" "UTF-8"   "unknown"
> > iconv(a[2], toRaw=T)
> [[1]]
> [1] 42 c3 bf 43
> > a
> [1] "A"   "B?C" "D"
> > a <- read.table("test.txt")$V1
> > Encoding(a)
> [1] "unknown" "unknown" "unknown"
> > iconv(a[2], toRaw=T)
> [[1]]
> [1] 42 c3 bf 43
> > a
> [1] "A"    "B??C" "D"
>
> Both are correctly read (the raw bytes are ok), but the second one
doesn't print
> correctly because the encoding is not "marked".
>
> My thoughts:
> With options(encoding="native.enc"), the characters read are not
> translated, and are read
> as raw bytes, which can get an encoding mark to print correctly (otherwise
it
> prints as native, that is mostly latin1).
> With options(encoding="latin1"), and reading the UTF-8 file, I
guess it's mostly
> like the preceding: the characters are read as raw, and marked as
> UTF-8, which works.
> With options(encoding="latin1"), and reading the latin1 file
(with the
> 0xFF byte),
> I don't understand what happens. The file gets truncated almost as if
0xFF were
> an EOF character - which is perplexing, since I think that in C, 0xFF
> is sometimes
> (wrongly) confused with EOF.
> And with options(encoding="UTF-8"), I am not sure what happens.
>
> Questions:
> * What's wrong with options(encoding="latin1")?
> * Is it unsafe to use another option(encoding) than the default
> native.enc, on Windows?
> * Is it safe to assume that with native.enc R reads raw characters
> and, only when requested,
>    marks an encoding afterwards? (that is, I get "unknown" by
default
> which is printed
>    as latin1 on Windows, and if I enforce another encoding, it will be
> used whatever
>    the bytes really are)
> * What does really happen with another option(encoding), especially UTF-8?
> * If I save a character variable to an Rdata file, is the file usable
> on another OS,
>    or on the same with another default encoding (by changing
> options())? Does it depend
>    whether the character string has un "unknown" encoding or an
explicit one?
> * Is there a way (preferably an options()) to tell R to read text
> files as UTF-8 by default?
>    Would it work with any one of read.table(), readLines(), or even
source()?
>    I thought options(encoding="UTF-8") would do, but it fails on
the
> examples above.
>
> Best regards,
>
> Jean-Claude Arbaut
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

Jean-Claude Arbaut

2016-Mar-15 21:00 UTC

head link

[R] Truncated file upon reading a text file with 0xff characters

Thank you for the answer. I was about to ask why I should avoid text
connections, but actually I just noticed that with a binary connection
for the read, the problem disappears (I mean, I replace "rt" with
"rb"
in the file open).
R is even clever enough that, when feeded the latin1 file after an
options(encoding="UTF-8") and no encoding in the readLines, it returns
correctly a string with encoding "unknown" and byte 0xff in the raw
representation (I would have expected at least a warning, but it
silently reads bad UTF-8 bytes as simply raw bytes, it seems)

Thus the text connection does something more that causes a problem.
Maybe it tries to translate characters twice?

And this problem remains with read.table. Not surprising: by
inspecting the source, I see it uses open(file "rt").

Jean-Claude Arbaut


2016-03-15 21:24 GMT+01:00 Duncan Murdoch <murdoch.duncan at
gmail.com>:> I think you've identified a bug (or more than one) here, but your
message is
> so long, I haven't had time to go through it all.  I'd suggest that
you
> write up a shorter version for the bug list.  The shorter version would
>
> 1.  Write the latin1 file using writeBin.
> 2.  Set options(encoding = "") and read it without error.
> 3.  Set options(encoding = "UTF-8") and get an error even if you
explicitly
> set encoding when reading.
> 4.  Set options(encoding = "latin1") and also get an error with
or without
> explicitly setting the encoding.
>
> I would limit the tests to readLines; read.table is much more complicated,
> and isn't necessary to illustrate the problem.  It just confuses things
by
> bringing it into the discussion.
>
> You should also avoid bringing text mode connections into the discussion
> unless they are necessary.
>
> Duncan Murdoch
>
>
> On 15/03/2016 3:05 PM, Jean-Claude Arbaut wrote:
>>
>> Hello R users,
>>
>> I am having problems to read a CSV file that contains names with
character
>> ?.
>> In case it doesn't print correctly, it's Unicode character 00FF
or LATIN
>> SMALL
>> LETTER Y WITH DIAERESIS.
>> My computer has Windows 7 and R 3.2.4.
>>
>> Initially, I configured my computer to run
options(encoding="UTF-8")
>> in my .Rprofile,
>> since I prefer this encoding, for portability. Good and modern
>> standard, I thought.
>> Rather than sending a large file, here is how to reproduce my problem:
>>
>>    options(encoding="UTF-8")
>>
>>    f <- file("test.txt", "wb")
>>    writeBin(as.integer(c(65, 13, 10, 66, 255, 67, 13, 10, 68, 13, 10)),
>> f, size=1)
>>    close(f)
>>    read.table("test.txt", encoding="latin1")
>>    f <- file("test.txt", "rt")
>>    readLines(f, encoding="latin1")
>>    close(f)
>>
>> I write a file with three lines, in binary to avoid any translation:
>> A
>> B\xffC
>> D
>>
>> Upon reading I get only:
>>
>>    > read.table("test.txt", encoding="latin1")
>>      V1
>>    1  A
>>    2  B
>>    Warning messages:
>>    1: In read.table("test.txt", encoding =
"latin1") :
>>      invalid input found on input connection 'test.txt'
>>    2: In read.table("test.txt", encoding =
"latin1") :
>>      incomplete final line found by readTableHeader on
'test.txt'
>>    > readLines(f, encoding="latin1")
>>    [1] "A" "B"
>>    Warning messages:
>>    1: In readLines(f, encoding = "latin1") :
>>      invalid input found on input connection 'test.txt'
>>    2: In readLines(f, encoding = "latin1") :
>>      incomplete final line found on 'test.txt'
>>
>> Hence the file is truncated. However, character \xff is a valid latin1
>> character,
>> as one can check for instance at
>> https://en.wikipedia.org/wiki/ISO/IEC_8859-1
>> I tried with an UTF-8 version of this file:
>>
>>    f <- file("test.txt", "wb")
>>    writeBin(as.integer(c(65, 13, 10, 66, 195, 191, 67, 13, 10, 68, 13,
>> 10)), f, size=1)
>>    close(f)
>>    read.table("test.txt", encoding="UTF-8")
>>    f <- file("test.txt", "rt")
>>    readLines(f, encoding="UTF-8")
>>    close(f)
>>
>> Since this character ? is encoded as two bytes 195, 191 in UTF-8, I
would
>> expect
>> that I get my complete file. But I don't. Instead, I get:
>>
>>    > read.table("test.txt", encoding="UTF-8")
>>      V1
>>    1  A
>>    2  B
>>    3  C
>>    4  D
>>    Warning message:
>>    In read.table("test.txt", encoding = "UTF-8") :
>>      incomplete final line found by readTableHeader on
'test.txt'
>>
>>    > readLines(f, encoding="UTF-8")
>>    [1] "A" "B"
>>    Warning message:
>>    In readLines(f, encoding = "UTF-8") :
>>      incomplete final line found on 'test.txt'
>>
>> I tried all the preceding but with options(encoding="latin1")
at the
>> beginning.
>> For the first attempt, with byte 255, I get:
>>
>>    > read.table("test.txt", encoding="latin1")
>>      V1
>>    1  A
>>    2  B
>>    3  C
>>    4  D
>>    Warning message:
>>    In read.table("test.txt", encoding = "latin1") :
>>      incomplete final line found by readTableHeader on
'test.txt'
>>    >
>>    > f <- file("test.txt", "rt")
>>    > readLines(f, encoding="latin1")
>>
>> For the other attempt, with 195, 191:
>>
>>    > read.table("test.txt", encoding="UTF-8")
>>       V1
>>    1   A
>>    2 B?C
>>    3   D
>>    >
>>    > f <- file("test.txt", "rt")
>>    > readLines(f, encoding="UTF-8")
>>    [1] "A"   "B?C" "D"
>>    > close(f)
>>
>> Thus the second one does indeed work, it seems. Just a check:
>>
>>    > a <- read.table("test.txt",
encoding="UTF-8")
>>    > Encoding(a$V1)
>>    [1] "unknown" "UTF-8"   "unknown"
>>
>> At last, I figured out that with the default encoding in R, both
attempts
>> work,
>> with or without even giving the encoding as a parameter of read.table
>> or readLines.
>> However, I don't understand what happens:
>>
>>    f <- file("test.txt", "wb")
>>    writeBin(as.integer(c(65, 13, 10, 66, 255, 67, 13, 10, 68, 13, 10)),
>> f, size=1)
>>    close(f)
>>    a <- read.table("test.txt",
encoding="latin1")$V1
>>    Encoding(a)
>>    iconv(a[2], toRaw=T)
>>    a
>>    a <- read.table("test.txt")$V1
>>    Encoding(a)
>>    iconv(a[2], toRaw=T)
>>    a
>>
>> This will yield:
>>
>>    > a <- read.table("test.txt",
encoding="latin1")$V1
>>    > Encoding(a)
>>    [1] "unknown" "latin1"  "unknown"
>>    > iconv(a[2], toRaw=T)
>>    [[1]]
>>    [1] 42 ff 43
>>    > a
>>    [1] "A"   "B?C" "D"
>>    >
>>    > a <- read.table("test.txt")$V1
>>    > Encoding(a)
>>    [1] "unknown" "unknown" "unknown"
>>    > iconv(a[2], toRaw=T)
>>    [[1]]
>>    [1] 42 ff 43
>>    > a
>>    [1] "A"   "B?C" "D"
>>
>> The second line is correctly encoded, but the encoding is just not
>> "marked" in one case.
>> With the UTF-8 bytes:
>>
>>    f <- file("test.txt", "wb")
>>    writeBin(as.integer(c(65, 13, 10, 66, 195, 191, 67, 13, 10, 68, 13,
>> 10)), f, size=1)
>>    close(f)
>>    a <- read.table("test.txt",
encoding="UTF-8")$V1
>>    Encoding(a)
>>    iconv(a[2], toRaw=T)
>>    a
>>    a <- read.table("test.txt")$V1
>>    Encoding(a)
>>    iconv(a[2], toRaw=T)
>>    a
>>
>> This will yield:
>>
>> > a <- read.table("test.txt",
encoding="UTF-8")$V1
>> > Encoding(a)
>> [1] "unknown" "UTF-8"   "unknown"
>> > iconv(a[2], toRaw=T)
>> [[1]]
>> [1] 42 c3 bf 43
>> > a
>> [1] "A"   "B?C" "D"
>> > a <- read.table("test.txt")$V1
>> > Encoding(a)
>> [1] "unknown" "unknown" "unknown"
>> > iconv(a[2], toRaw=T)
>> [[1]]
>> [1] 42 c3 bf 43
>> > a
>> [1] "A"    "B??C" "D"
>>
>> Both are correctly read (the raw bytes are ok), but the second one
doesn't
>> print
>> correctly because the encoding is not "marked".
>>
>> My thoughts:
>> With options(encoding="native.enc"), the characters read are
not
>> translated, and are read
>> as raw bytes, which can get an encoding mark to print correctly
(otherwise
>> it
>> prints as native, that is mostly latin1).
>> With options(encoding="latin1"), and reading the UTF-8 file,
I guess it's
>> mostly
>> like the preceding: the characters are read as raw, and marked as
>> UTF-8, which works.
>> With options(encoding="latin1"), and reading the latin1 file
(with the
>> 0xFF byte),
>> I don't understand what happens. The file gets truncated almost as
if 0xFF
>> were
>> an EOF character - which is perplexing, since I think that in C, 0xFF
>> is sometimes
>> (wrongly) confused with EOF.
>> And with options(encoding="UTF-8"), I am not sure what
happens.
>>
>> Questions:
>> * What's wrong with options(encoding="latin1")?
>> * Is it unsafe to use another option(encoding) than the default
>> native.enc, on Windows?
>> * Is it safe to assume that with native.enc R reads raw characters
>> and, only when requested,
>>    marks an encoding afterwards? (that is, I get "unknown" by
default
>> which is printed
>>    as latin1 on Windows, and if I enforce another encoding, it will be
>> used whatever
>>    the bytes really are)
>> * What does really happen with another option(encoding), especially
UTF-8?
>> * If I save a character variable to an Rdata file, is the file usable
>> on another OS,
>>    or on the same with another default encoding (by changing
>> options())? Does it depend
>>    whether the character string has un "unknown" encoding or
an explicit
>> one?
>> * Is there a way (preferably an options()) to tell R to read text
>> files as UTF-8 by default?
>>    Would it work with any one of read.table(), readLines(), or even
>> source()?
>>    I thought options(encoding="UTF-8") would do, but it fails
on the
>> examples above.
>>
>> Best regards,
>>
>> Jean-Claude Arbaut
>>
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
>

R help - Mar 2016 - Truncated file upon reading a text file with 0xff characters

[R] Truncated file upon reading a text file with 0xff characters

[R] Truncated file upon reading a text file with 0xff characters

[R] Truncated file upon reading a text file with 0xff characters