thr3ads.net - R devel - [Rd] Possible Bug In Validation of UTF-8 Sequences [Apr 2020]

If this information is useful, please help other people find it:
Share via:

brodie gaslam

2020-Apr-04 18:43 UTC

[Rd] Possible Bug In Validation of UTF-8 Sequences

As per `?intToUtf8`, and in the comments to `valid_utf8`[1], R 
intends to prevent illegal UTF-8 such as UTF-8 encoded
UTF-16 surrogate pairs.? `R_nchar`, invoked via `base::nchar`,
explicitly validates UTF-8 strings[2], but allows the surrogate:

??? > Encoding('\ud800')
??? [1] "UTF-8"
??? > nchar('\ud800')? // should be an error
??? [1] 1

The problem manifests on systems where `char` is signed.? The logic
used to test for the forbidden sequences implicitly assumes that
`char` is unsigned[3]:

??? if (c == 0xe0 && (d & 0x20) == 0) return 1;
??? if (c == 0xed && d >= 0xa0) return 1;

Notice the `d >= 0xa0`.? On a system where `char` is signed this can
only ever be true if a byte has more than 8 bits, as otherwise the
maximum value of `d` is 0x7f because `d` is retrieved from a plain
`char` pointer[4] (d is `int`):

??? if (((d = *(++p)) & 0xc0) != 0x80) return 1;

Where `p` is defined as[5]:

???? const char *p;

In contrast `c` above is correctly cast to `unsigned char` prior to
use[8]:

???? c = (unsigned char)*p;

I attach a simple patch to address this.

I also include a patch to remove the handling of surrogates from
`R_nchar` as that should not longer be necessary, and additionally the
current handling appears incorrect.? AFAICT, the current handling
attempts to decode a surrogate pair by combining the high surrogate
with the same high surrogate, instead of the high surrogate with the
subsequent character that hopefully is the low surrogate[7].

Here is some code that could be added to regression tests:

??? surr_good <- '\ud840\udc00'??????????? # auto-converts to normal
??? surr_bad <- paste0('\ud840', '\udc00') # surrogates
remain
??? good <- c('hello', 'world', surr_good, '\ud7ff',
'\ue000', '\U0010ffff')
??? bad <- c(surr_bad, '\ud800', '\udfff',
'\U00110000')

On R3.6.3:

??? nchar(good, allowNA=TRUE)
??? [1] 5 5 1 1 1 1
??? nchar(bad, allowNA=TRUE)
??? [1] 2 1 1 1

On R-devel (2020-03-31 r78116) w/ patch:

??? nchar(good, allowNA=TRUE)
??? [1] 5 5 1 1 1 1
??? nchar(bad, allowNA=TRUE)
??? [1] NA NA NA NA

I ran `make check-devel` successfully, although I did have to suppress
one PCRE test[9] that segfaults on my particular set-up, though that
segfaulted prior to my patch as well.

The patch does not prevent the creation of illegal UTF-8 strings,
although I imagine it would be straightforward to add checks to the
entry points into CHARSXPs if that were desired.

Finally, this signed char business hints at a potential broader issue.
If I understand correctly importing byte sequences with values greater
than 0x7f overflows the `char` buffers on systems with signed chars
and octet (or lesser) bytes, e.g. as in `do_readLines`[6] where an
integer procured via `fgetc` is subsequently stored in a `char`
buffer.? Obviously this hasn't mattered much to date, presumably
because the implementations R runs on define the `unsigned char` to
`signed char` conversion in such a way that the `signed char` to
`unsigned char` conversion restores the original value.? I don't know
if this is something explicitly checked for like the `int` == 32bits
assumption.

[1]: https://github.com/wch/r-source/blob/tags/R-3-6-3/src/main/valid_utf8.h#L61
[2]: https://github.com/wch/r-source/blob/tags/R-3-6-3/src/main/character.c#L148
[3]:
https://github.com/wch/r-source/blob/tags/R-3-6-3/src/main/valid_utf8.h#L106
[4]: https://github.com/wch/r-source/blob/tags/R-3-6-3/src/main/valid_utf8.h#L84
[5]: https://github.com/wch/r-source/blob/tags/R-3-6-3/src/main/valid_utf8.h#L69
[6]:
https://github.com/wch/r-source/blob/tags/R-3-6-3/src/main/connections.c#L3935
[7]: https://github.com/wch/r-source/blob/tags/R-3-6-3/src/main/character.c#L184
[8]: https://github.com/wch/r-source/blob/tags/R-3-6-3/src/main/valid_utf8.h#L73
[9]: https://github.com/wch/r-source/blob/tags/R-3-6-3/tests/PCRE.R#L16


-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: patch_val_utf8.txt
URL:
<https://stat.ethz.ch/pipermail/r-devel/attachments/20200404/2bef12f7/attachment.txt>

-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: patch_nchar.txt
URL:
<https://stat.ethz.ch/pipermail/r-devel/attachments/20200404/2bef12f7/attachment-0001.txt>

Possibly Parallel Threads

Search for more apparently analagous threads

R devel - Apr 2020 - Possible Bug In Validation of UTF-8 Sequences

[Rd] Possible Bug In Validation of UTF-8 Sequences

Possibly Parallel Threads

Wisdom of the Ancients