thr3ads.net - R devel - [Rd] Possible `substr` bug in UTF-8 Corner Case [Mar 2018]

If this information is useful, please help other people find it:
Share via:

brodie gaslam

2018-Mar-29 01:53 UTC

[Rd] Possible `substr` bug in UTF-8 Corner Case

I think there is a memory bug in `substr` that is triggered by a UTF-8 corner
case: an incomplete UTF-8 byte sequence at the end of a string.? With a valgrind
level 2 instrumented build of R-devel I get:
> string <- "abc\xEE"??? # \xEE indicates the start of a 3 byte
UTF-8 sequence
> Encoding(string) <- "UTF-8"
> substr(string, 1, 10)==15375== Invalid read of size 1
==15375==??? at 0x45B3F0: substr (character.c:286)
==15375==??? by 0x45B3F0: do_substr (character.c:342)
==15375==??? by 0x4CFCB9: bcEval (eval.c:6775)
==15375==??? by 0x4D95AF: Rf_eval (eval.c:624)
==15375==??? by 0x4DAD12: R_execClosure (eval.c:1764)
==15375==??? by 0x4D9561: Rf_eval (eval.c:747)
==15375==??? by 0x507008: Rf_ReplIteration (main.c:258)
==15375==??? by 0x5073E7: R_ReplConsole (main.c:308)
==15375==??? by 0x507494: run_Rmainloop (main.c:1082)
==15375==??? by 0x41A8E6: main (Rmain.c:29)
==15375==? Address 0xb9e518d is 3,869 bytes inside a block of size 7,960
alloc'd
==15375==??? at 0x4C2DB8F: malloc (in
/usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==15375==??? by 0x51033E: GetNewPage (memory.c:888)
==15375==??? by 0x511FC0: Rf_allocVector3 (memory.c:2691)
==15375==??? by 0x4657AC: Rf_allocVector (Rinlinedfuns.h:577)
==15375==??? by 0x4657AC: Rf_ScalarString (Rinlinedfuns.h:1007)
==15375==??? by 0x4657AC: coerceToVectorList (coerce.c:892)
==15375==??? by 0x4657AC: Rf_coerceVector (coerce.c:1293)
==15375==??? by 0x4660EB: ascommon (coerce.c:1369)
==15375==??? by 0x4667C0: do_asvector (coerce.c:1544)
==15375==??? by 0x4CFCB9: bcEval (eval.c:6775)
==15375==??? by 0x4D95AF: Rf_eval (eval.c:624)
==15375==??? by 0x4DAD12: R_execClosure (eval.c:1764)
==15375==??? by 0x515EF7: dispatchMethod (objects.c:408)
==15375==??? by 0x516379: Rf_usemethod (objects.c:458)
==15375==??? by 0x516694: do_usemethod (objects.c:543)
==15375== 
[1] "abc<ee>"

Here is a patch for the native version of `substr` that highlights the problem
and a possible fix.? Basically `substr` computes the byte width of a UTF-8
character based on the leading byte ("\xEE" here, which implies 3
bytes) and reads/writes that entire byte width irrespective of whether the
string actually ends before the theoretical end of the UTF-8
"character".

Index: src/main/character.c
==================================================================---
src/main/character.c????(revision 74482)
+++ src/main/character.c????(working copy)
@@ -283,7 +283,7 @@
????for (i = 0; i < so && str < end; i++) {
????????int used = utf8clen(*str);
????????if (i < sa - 1) { str += used; continue; }
-????????for (j = 0; j < used; j++) *buf++ = *str++;
+????????for (j = 0; j < used && str < end; j++) *buf++ = *str++;
????}
???? } else if (ienc == CE_LATIN1 || ienc == CE_BYTES) {
????for (str += (sa - 1), i = sa; i <= so; i++) *buf++ = *str++;

The change above removed the valgrind error for me.? I re-built R with the
change and ran "make check" which seemed to work fine. I also ran some
simple checks on UTF-8 strings and things seem to work okay.

I have very limited experience making changes to R (this is my first attempt at
a patch) so please take all of the above with extreme skepticism.

Apologies in advance if this turns out to be a false alarm caused by an error on
my part.

Best,

Brodie.

PS: apologies also if the formatting of this e-mail is bad.? I have not figured
out how to get plaintext working properly with yahoo.

Tomas Kalibera

2018-Mar-29 13:11 UTC

head link

[Rd] Possible `substr` bug in UTF-8 Corner Case

Thanks, fixed in R-devel (by checking validity of UTF-8 strings for 
substr/substring).
Tomas

On 03/29/2018 03:53 AM, brodie gaslam via R-devel wrote:> I think there is a memory bug in `substr` that is triggered by a UTF-8
corner case: an incomplete UTF-8 byte sequence at the end of a string.? With a
valgrind level 2 instrumented build of R-devel I get:
>
>> string <- "abc\xEE"??? # \xEE indicates the start of a 3
byte UTF-8 sequence
>> Encoding(string) <- "UTF-8"
>> substr(string, 1, 10)
> ==15375== Invalid read of size 1
> ==15375==??? at 0x45B3F0: substr (character.c:286)
> ==15375==??? by 0x45B3F0: do_substr (character.c:342)
> ==15375==??? by 0x4CFCB9: bcEval (eval.c:6775)
> ==15375==??? by 0x4D95AF: Rf_eval (eval.c:624)
> ==15375==??? by 0x4DAD12: R_execClosure (eval.c:1764)
> ==15375==??? by 0x4D9561: Rf_eval (eval.c:747)
> ==15375==??? by 0x507008: Rf_ReplIteration (main.c:258)
> ==15375==??? by 0x5073E7: R_ReplConsole (main.c:308)
> ==15375==??? by 0x507494: run_Rmainloop (main.c:1082)
> ==15375==??? by 0x41A8E6: main (Rmain.c:29)
> ==15375==? Address 0xb9e518d is 3,869 bytes inside a block of size 7,960
alloc'd
> ==15375==??? at 0x4C2DB8F: malloc (in
/usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
> ==15375==??? by 0x51033E: GetNewPage (memory.c:888)
> ==15375==??? by 0x511FC0: Rf_allocVector3 (memory.c:2691)
> ==15375==??? by 0x4657AC: Rf_allocVector (Rinlinedfuns.h:577)
> ==15375==??? by 0x4657AC: Rf_ScalarString (Rinlinedfuns.h:1007)
> ==15375==??? by 0x4657AC: coerceToVectorList (coerce.c:892)
> ==15375==??? by 0x4657AC: Rf_coerceVector (coerce.c:1293)
> ==15375==??? by 0x4660EB: ascommon (coerce.c:1369)
> ==15375==??? by 0x4667C0: do_asvector (coerce.c:1544)
> ==15375==??? by 0x4CFCB9: bcEval (eval.c:6775)
> ==15375==??? by 0x4D95AF: Rf_eval (eval.c:624)
> ==15375==??? by 0x4DAD12: R_execClosure (eval.c:1764)
> ==15375==??? by 0x515EF7: dispatchMethod (objects.c:408)
> ==15375==??? by 0x516379: Rf_usemethod (objects.c:458)
> ==15375==??? by 0x516694: do_usemethod (objects.c:543)
> ==15375=> [1] "abc<ee>"
>
> Here is a patch for the native version of `substr` that highlights the
problem and a possible fix.? Basically `substr` computes the byte width of a
UTF-8 character based on the leading byte ("\xEE" here, which implies
3 bytes) and reads/writes that entire byte width irrespective of whether the
string actually ends before the theoretical end of the UTF-8
"character".
>
> Index: src/main/character.c
> ==================================================================> ---
src/main/character.c????(revision 74482)
> +++ src/main/character.c????(working copy)
> @@ -283,7 +283,7 @@
>  ????for (i = 0; i < so && str < end; i++) {
>  ????????int used = utf8clen(*str);
>  ????????if (i < sa - 1) { str += used; continue; }
> -????????for (j = 0; j < used; j++) *buf++ = *str++;
> +????????for (j = 0; j < used && str < end; j++) *buf++ =
*str++;
>  ????}
>  ???? } else if (ienc == CE_LATIN1 || ienc == CE_BYTES) {
>  ????for (str += (sa - 1), i = sa; i <= so; i++) *buf++ = *str++;
>
> The change above removed the valgrind error for me.? I re-built R with the
change and ran "make check" which seemed to work fine. I also ran some
simple checks on UTF-8 strings and things seem to work okay.
>
> I have very limited experience making changes to R (this is my first
attempt at a patch) so please take all of the above with extreme skepticism.
>
> Apologies in advance if this turns out to be a false alarm caused by an
error on my part.
>
> Best,
>
> Brodie.
>
> PS: apologies also if the formatting of this e-mail is bad.? I have not
figured out how to get plaintext working properly with yahoo.
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

brodie gaslam

2018-Mar-29 13:56 UTC

head link

[Rd] Possible `substr` bug in UTF-8 Corner Case

Thank you for the quick response and for the quick fix (and for the rchk vagrant
image I used to build and test the below!).
One thing I'll note about the fix is that it may start breaking things that
used to "work".? I think it is fair to say that character count is not
well defined with illegal UTF-8 sequences (and noteworthy that `nchar` does
actually stop when it encounters them), but there may be a bit of code out there
that relied on being able to successfully complete (albeit while potentially
corrupting memory) that will now produce errors.? It may be worth highlighting
this in the release notes.

Best,
Brodie.


    On Thursday, March 29, 2018, 9:11:15 AM EDT, Tomas Kalibera
<tomas.kalibera at gmail.com> wrote:
 
 Thanks, fixed in R-devel (by checking validity of UTF-8 strings for 
substr/substring).
Tomas

On 03/29/2018 03:53 AM, brodie gaslam via R-devel wrote:> I think there is a memory bug in `substr` that is triggered by a UTF-8
corner case: an incomplete UTF-8 byte sequence at the end of a string.? With a
valgrind level 2 instrumented build of R-devel I get:
>
>> string <- "abc\xEE"??? # \xEE indicates the start of a 3
byte UTF-8 sequence
>> Encoding(string) <- "UTF-8"
>> substr(string, 1, 10)
> ==15375== Invalid read of size 1
> ==15375==??? at 0x45B3F0: substr (character.c:286)
> ==15375==??? by 0x45B3F0: do_substr (character.c:342)
> ==15375==??? by 0x4CFCB9: bcEval (eval.c:6775)
> ==15375==??? by 0x4D95AF: Rf_eval (eval.c:624)
> ==15375==??? by 0x4DAD12: R_execClosure (eval.c:1764)
> ==15375==??? by 0x4D9561: Rf_eval (eval.c:747)
> ==15375==??? by 0x507008: Rf_ReplIteration (main.c:258)
> ==15375==??? by 0x5073E7: R_ReplConsole (main.c:308)
> ==15375==??? by 0x507494: run_Rmainloop (main.c:1082)
> ==15375==??? by 0x41A8E6: main (Rmain.c:29)
> ==15375==? Address 0xb9e518d is 3,869 bytes inside a block of size 7,960
alloc'd
> ==15375==??? at 0x4C2DB8F: malloc (in
/usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
> ==15375==??? by 0x51033E: GetNewPage (memory.c:888)
> ==15375==??? by 0x511FC0: Rf_allocVector3 (memory.c:2691)
> ==15375==??? by 0x4657AC: Rf_allocVector (Rinlinedfuns.h:577)
> ==15375==??? by 0x4657AC: Rf_ScalarString (Rinlinedfuns.h:1007)
> ==15375==??? by 0x4657AC: coerceToVectorList (coerce.c:892)
> ==15375==??? by 0x4657AC: Rf_coerceVector (coerce.c:1293)
> ==15375==??? by 0x4660EB: ascommon (coerce.c:1369)
> ==15375==??? by 0x4667C0: do_asvector (coerce.c:1544)
> ==15375==??? by 0x4CFCB9: bcEval (eval.c:6775)
> ==15375==??? by 0x4D95AF: Rf_eval (eval.c:624)
> ==15375==??? by 0x4DAD12: R_execClosure (eval.c:1764)
> ==15375==??? by 0x515EF7: dispatchMethod (objects.c:408)
> ==15375==??? by 0x516379: Rf_usemethod (objects.c:458)
> ==15375==??? by 0x516694: do_usemethod (objects.c:543)
> ==15375=> [1] "abc<ee>"
>
> Here is a patch for the native version of `substr` that highlights the
problem and a possible fix.? Basically `substr` computes the byte width of a
UTF-8 character based on the leading byte ("\xEE" here, which implies
3 bytes) and reads/writes that entire byte width irrespective of whether the
string actually ends before the theoretical end of the UTF-8
"character".
>
> Index: src/main/character.c
> ==================================================================> ---
src/main/character.c????(revision 74482)
> +++ src/main/character.c????(working copy)
> @@ -283,7 +283,7 @@
>? ????for (i = 0; i < so && str < end; i++) {
>? ????????int used = utf8clen(*str);
>? ????????if (i < sa - 1) { str += used; continue; }
> -????????for (j = 0; j < used; j++) *buf++ = *str++;
> +????????for (j = 0; j < used && str < end; j++) *buf++ =
*str++;
>? ????}
>? ???? } else if (ienc == CE_LATIN1 || ienc == CE_BYTES) {
>? ????for (str += (sa - 1), i = sa; i <= so; i++) *buf++ = *str++;
>
> The change above removed the valgrind error for me.? I re-built R with the
change and ran "make check" which seemed to work fine. I also ran some
simple checks on UTF-8 strings and things seem to work okay.
>
> I have very limited experience making changes to R (this is my first
attempt at a patch) so please take all of the above with extreme skepticism.
>
> Apologies in advance if this turns out to be a false alarm caused by an
error on my part.
>
> Best,
>
> Brodie.
>
> PS: apologies also if the formatting of this e-mail is bad.? I have not
figured out how to get plaintext working properly with yahoo.
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

  
	[[alternative HTML version deleted]]

Maybe Matching Threads

Search for more apparently analagous threads

R devel - Mar 2018 - Possible `substr` bug in UTF-8 Corner Case

[Rd] Possible `substr` bug in UTF-8 Corner Case

[Rd] Possible `substr` bug in UTF-8 Corner Case

[Rd] Possible `substr` bug in UTF-8 Corner Case

Maybe Matching Threads