On Thu, 26 Oct 2006, Henrik Bengtsson wrote:
> I'm observing the following on different platforms:
>
>> parse(text='"\\x7F"')
> expression("\177")
>> parse(text='"\\x80"')
> Error: invalid multibyte string
Yes. It's an invalid multibyte string. In UTF-8 a single byte is a valid
character string only if it is below x80, so x7F is fine but x80 is not.
In fact x80 is not the leading byte of any valid UTF-8 character.
You have to work out what the Unicode code point is for whatever character
you were expecting to be x80 and convert that to UTF-8.
I'm surprised that one of your UTF-8 machines worked -- I don't think it
should.
-thomas
> ...
>> parse(text='"\\xFF"')
> Error: invalid multibyte string
>
> However,
>
> cat("\x7F\n\x80\n...\xFF\n")
>
> works. Using R --vanilla.
> SYSTEMS GIVING THE ERROR:
>> sessionInfo()
> R version 2.4.0 (2006-10-03)
> x86_64-unknown-linux-gnu
> locale:
>
LC_CTYPE=en_AU.UTF-8;LC_NUMERIC=C;LC_TIME=en_AU.UTF-8;LC_COLLATE=en_AU.UTF-8;LC_MONETARY=en_AU.UTF-8;LC_MESSAGES=en_AU.UTF-8;LC_PAPER=en_AU.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_AU.UTF-8;LC_IDENTIFICATION=C
>
> R version 2.4.0 Patched (2006-10-03 r39576)
> i686-pc-linux-gnu
> locale:
>
LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=en_US.UTF-8;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C
>
>
> SYSTEMS OK:
> R version 2.4.0 Under development (unstable) (2006-07-23 r38687)
> x86_64-unknown-linux-gnu
> locale:
>
LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=en_US.UTF-8;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C
>
> R version 2.4.0 (2006-10-03)
> i386-pc-mingw32
> locale:
> LC_COLLATE=English_United States.1252;LC_CTYPE=English_United
> States.1252;LC_MONETARY=English_United
> States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252
>
> R version 2.4.0 Patched (2006-10-10 r39600)
> i386-pc-mingw32
> locale:
>
LC_COLLATE=English_Australia.1252;LC_CTYPE=English_Australia.1252;LC_MONETARY=En
> glish_Australia.1252;LC_NUMERIC=C;LC_TIME=English_Australia.1252
>
> Version 2.3.0 (2006-04-24)
> x86_64-unknown-linux-gnu
> locale: <not reported>
>
>
> All of the above have the following packages attached:
> [1] "methods" "stats" "graphics"
"grDevices" "utils" "datasets"
> [7] "base"
>
> We identified this problem because R CMD check complained:
>
>> * checking package dependencies ... WARNING
>> Error in deparse(e[[2]]) : invalid multibyte string
>> Execution halted
>
> because we use "\xFF" (or "\377") in the source code to
be used as a
> terminator in a vector buffer; "\0" can't be used for other
reasons.
>
> Is the above a bug in R or one in my head?
>
> /H
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>
Thomas Lumley Assoc. Professor, Biostatistics
tlumley at u.washington.edu University of Washington, Seattle