I think there is a memory bug in `substr` that is triggered by a UTF-8 corner case: an incomplete UTF-8 byte sequence at the end of a string.? With a valgrind level 2 instrumented build of R-devel I get:> string <- "abc\xEE"??? # \xEE indicates the start of a 3 byte UTF-8 sequence > Encoding(string) <- "UTF-8" > substr(string, 1, 10)==15375== Invalid read of size 1 ==15375==??? at 0x45B3F0: substr (character.c:286) ==15375==??? by 0x45B3F0: do_substr (character.c:342) ==15375==??? by 0x4CFCB9: bcEval (eval.c:6775) ==15375==??? by 0x4D95AF: Rf_eval (eval.c:624) ==15375==??? by 0x4DAD12: R_execClosure (eval.c:1764) ==15375==??? by 0x4D9561: Rf_eval (eval.c:747) ==15375==??? by 0x507008: Rf_ReplIteration (main.c:258) ==15375==??? by 0x5073E7: R_ReplConsole (main.c:308) ==15375==??? by 0x507494: run_Rmainloop (main.c:1082) ==15375==??? by 0x41A8E6: main (Rmain.c:29) ==15375==? Address 0xb9e518d is 3,869 bytes inside a block of size 7,960 alloc'd ==15375==??? at 0x4C2DB8F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so) ==15375==??? by 0x51033E: GetNewPage (memory.c:888) ==15375==??? by 0x511FC0: Rf_allocVector3 (memory.c:2691) ==15375==??? by 0x4657AC: Rf_allocVector (Rinlinedfuns.h:577) ==15375==??? by 0x4657AC: Rf_ScalarString (Rinlinedfuns.h:1007) ==15375==??? by 0x4657AC: coerceToVectorList (coerce.c:892) ==15375==??? by 0x4657AC: Rf_coerceVector (coerce.c:1293) ==15375==??? by 0x4660EB: ascommon (coerce.c:1369) ==15375==??? by 0x4667C0: do_asvector (coerce.c:1544) ==15375==??? by 0x4CFCB9: bcEval (eval.c:6775) ==15375==??? by 0x4D95AF: Rf_eval (eval.c:624) ==15375==??? by 0x4DAD12: R_execClosure (eval.c:1764) ==15375==??? by 0x515EF7: dispatchMethod (objects.c:408) ==15375==??? by 0x516379: Rf_usemethod (objects.c:458) ==15375==??? by 0x516694: do_usemethod (objects.c:543) ==15375== [1] "abc<ee>" Here is a patch for the native version of `substr` that highlights the problem and a possible fix.? Basically `substr` computes the byte width of a UTF-8 character based on the leading byte ("\xEE" here, which implies 3 bytes) and reads/writes that entire byte width irrespective of whether the string actually ends before the theoretical end of the UTF-8 "character". Index: src/main/character.c ==================================================================--- src/main/character.c????(revision 74482) +++ src/main/character.c????(working copy) @@ -283,7 +283,7 @@ ????for (i = 0; i < so && str < end; i++) { ????????int used = utf8clen(*str); ????????if (i < sa - 1) { str += used; continue; } -????????for (j = 0; j < used; j++) *buf++ = *str++; +????????for (j = 0; j < used && str < end; j++) *buf++ = *str++; ????} ???? } else if (ienc == CE_LATIN1 || ienc == CE_BYTES) { ????for (str += (sa - 1), i = sa; i <= so; i++) *buf++ = *str++; The change above removed the valgrind error for me.? I re-built R with the change and ran "make check" which seemed to work fine. I also ran some simple checks on UTF-8 strings and things seem to work okay. I have very limited experience making changes to R (this is my first attempt at a patch) so please take all of the above with extreme skepticism. Apologies in advance if this turns out to be a false alarm caused by an error on my part. Best, Brodie. PS: apologies also if the formatting of this e-mail is bad.? I have not figured out how to get plaintext working properly with yahoo.
Thanks, fixed in R-devel (by checking validity of UTF-8 strings for substr/substring). Tomas On 03/29/2018 03:53 AM, brodie gaslam via R-devel wrote:> I think there is a memory bug in `substr` that is triggered by a UTF-8 corner case: an incomplete UTF-8 byte sequence at the end of a string.? With a valgrind level 2 instrumented build of R-devel I get: > >> string <- "abc\xEE"??? # \xEE indicates the start of a 3 byte UTF-8 sequence >> Encoding(string) <- "UTF-8" >> substr(string, 1, 10) > ==15375== Invalid read of size 1 > ==15375==??? at 0x45B3F0: substr (character.c:286) > ==15375==??? by 0x45B3F0: do_substr (character.c:342) > ==15375==??? by 0x4CFCB9: bcEval (eval.c:6775) > ==15375==??? by 0x4D95AF: Rf_eval (eval.c:624) > ==15375==??? by 0x4DAD12: R_execClosure (eval.c:1764) > ==15375==??? by 0x4D9561: Rf_eval (eval.c:747) > ==15375==??? by 0x507008: Rf_ReplIteration (main.c:258) > ==15375==??? by 0x5073E7: R_ReplConsole (main.c:308) > ==15375==??? by 0x507494: run_Rmainloop (main.c:1082) > ==15375==??? by 0x41A8E6: main (Rmain.c:29) > ==15375==? Address 0xb9e518d is 3,869 bytes inside a block of size 7,960 alloc'd > ==15375==??? at 0x4C2DB8F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so) > ==15375==??? by 0x51033E: GetNewPage (memory.c:888) > ==15375==??? by 0x511FC0: Rf_allocVector3 (memory.c:2691) > ==15375==??? by 0x4657AC: Rf_allocVector (Rinlinedfuns.h:577) > ==15375==??? by 0x4657AC: Rf_ScalarString (Rinlinedfuns.h:1007) > ==15375==??? by 0x4657AC: coerceToVectorList (coerce.c:892) > ==15375==??? by 0x4657AC: Rf_coerceVector (coerce.c:1293) > ==15375==??? by 0x4660EB: ascommon (coerce.c:1369) > ==15375==??? by 0x4667C0: do_asvector (coerce.c:1544) > ==15375==??? by 0x4CFCB9: bcEval (eval.c:6775) > ==15375==??? by 0x4D95AF: Rf_eval (eval.c:624) > ==15375==??? by 0x4DAD12: R_execClosure (eval.c:1764) > ==15375==??? by 0x515EF7: dispatchMethod (objects.c:408) > ==15375==??? by 0x516379: Rf_usemethod (objects.c:458) > ==15375==??? by 0x516694: do_usemethod (objects.c:543) > ==15375=> [1] "abc<ee>" > > Here is a patch for the native version of `substr` that highlights the problem and a possible fix.? Basically `substr` computes the byte width of a UTF-8 character based on the leading byte ("\xEE" here, which implies 3 bytes) and reads/writes that entire byte width irrespective of whether the string actually ends before the theoretical end of the UTF-8 "character". > > Index: src/main/character.c > ==================================================================> --- src/main/character.c????(revision 74482) > +++ src/main/character.c????(working copy) > @@ -283,7 +283,7 @@ > ????for (i = 0; i < so && str < end; i++) { > ????????int used = utf8clen(*str); > ????????if (i < sa - 1) { str += used; continue; } > -????????for (j = 0; j < used; j++) *buf++ = *str++; > +????????for (j = 0; j < used && str < end; j++) *buf++ = *str++; > ????} > ???? } else if (ienc == CE_LATIN1 || ienc == CE_BYTES) { > ????for (str += (sa - 1), i = sa; i <= so; i++) *buf++ = *str++; > > The change above removed the valgrind error for me.? I re-built R with the change and ran "make check" which seemed to work fine. I also ran some simple checks on UTF-8 strings and things seem to work okay. > > I have very limited experience making changes to R (this is my first attempt at a patch) so please take all of the above with extreme skepticism. > > Apologies in advance if this turns out to be a false alarm caused by an error on my part. > > Best, > > Brodie. > > PS: apologies also if the formatting of this e-mail is bad.? I have not figured out how to get plaintext working properly with yahoo. > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel
Thank you for the quick response and for the quick fix (and for the rchk vagrant image I used to build and test the below!). One thing I'll note about the fix is that it may start breaking things that used to "work".? I think it is fair to say that character count is not well defined with illegal UTF-8 sequences (and noteworthy that `nchar` does actually stop when it encounters them), but there may be a bit of code out there that relied on being able to successfully complete (albeit while potentially corrupting memory) that will now produce errors.? It may be worth highlighting this in the release notes. Best, Brodie. On Thursday, March 29, 2018, 9:11:15 AM EDT, Tomas Kalibera <tomas.kalibera at gmail.com> wrote: Thanks, fixed in R-devel (by checking validity of UTF-8 strings for substr/substring). Tomas On 03/29/2018 03:53 AM, brodie gaslam via R-devel wrote:> I think there is a memory bug in `substr` that is triggered by a UTF-8 corner case: an incomplete UTF-8 byte sequence at the end of a string.? With a valgrind level 2 instrumented build of R-devel I get: > >> string <- "abc\xEE"??? # \xEE indicates the start of a 3 byte UTF-8 sequence >> Encoding(string) <- "UTF-8" >> substr(string, 1, 10) > ==15375== Invalid read of size 1 > ==15375==??? at 0x45B3F0: substr (character.c:286) > ==15375==??? by 0x45B3F0: do_substr (character.c:342) > ==15375==??? by 0x4CFCB9: bcEval (eval.c:6775) > ==15375==??? by 0x4D95AF: Rf_eval (eval.c:624) > ==15375==??? by 0x4DAD12: R_execClosure (eval.c:1764) > ==15375==??? by 0x4D9561: Rf_eval (eval.c:747) > ==15375==??? by 0x507008: Rf_ReplIteration (main.c:258) > ==15375==??? by 0x5073E7: R_ReplConsole (main.c:308) > ==15375==??? by 0x507494: run_Rmainloop (main.c:1082) > ==15375==??? by 0x41A8E6: main (Rmain.c:29) > ==15375==? Address 0xb9e518d is 3,869 bytes inside a block of size 7,960 alloc'd > ==15375==??? at 0x4C2DB8F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so) > ==15375==??? by 0x51033E: GetNewPage (memory.c:888) > ==15375==??? by 0x511FC0: Rf_allocVector3 (memory.c:2691) > ==15375==??? by 0x4657AC: Rf_allocVector (Rinlinedfuns.h:577) > ==15375==??? by 0x4657AC: Rf_ScalarString (Rinlinedfuns.h:1007) > ==15375==??? by 0x4657AC: coerceToVectorList (coerce.c:892) > ==15375==??? by 0x4657AC: Rf_coerceVector (coerce.c:1293) > ==15375==??? by 0x4660EB: ascommon (coerce.c:1369) > ==15375==??? by 0x4667C0: do_asvector (coerce.c:1544) > ==15375==??? by 0x4CFCB9: bcEval (eval.c:6775) > ==15375==??? by 0x4D95AF: Rf_eval (eval.c:624) > ==15375==??? by 0x4DAD12: R_execClosure (eval.c:1764) > ==15375==??? by 0x515EF7: dispatchMethod (objects.c:408) > ==15375==??? by 0x516379: Rf_usemethod (objects.c:458) > ==15375==??? by 0x516694: do_usemethod (objects.c:543) > ==15375=> [1] "abc<ee>" > > Here is a patch for the native version of `substr` that highlights the problem and a possible fix.? Basically `substr` computes the byte width of a UTF-8 character based on the leading byte ("\xEE" here, which implies 3 bytes) and reads/writes that entire byte width irrespective of whether the string actually ends before the theoretical end of the UTF-8 "character". > > Index: src/main/character.c > ==================================================================> --- src/main/character.c????(revision 74482) > +++ src/main/character.c????(working copy) > @@ -283,7 +283,7 @@ >? ????for (i = 0; i < so && str < end; i++) { >? ????????int used = utf8clen(*str); >? ????????if (i < sa - 1) { str += used; continue; } > -????????for (j = 0; j < used; j++) *buf++ = *str++; > +????????for (j = 0; j < used && str < end; j++) *buf++ = *str++; >? ????} >? ???? } else if (ienc == CE_LATIN1 || ienc == CE_BYTES) { >? ????for (str += (sa - 1), i = sa; i <= so; i++) *buf++ = *str++; > > The change above removed the valgrind error for me.? I re-built R with the change and ran "make check" which seemed to work fine. I also ran some simple checks on UTF-8 strings and things seem to work okay. > > I have very limited experience making changes to R (this is my first attempt at a patch) so please take all of the above with extreme skepticism. > > Apologies in advance if this turns out to be a false alarm caused by an error on my part. > > Best, > > Brodie. > > PS: apologies also if the formatting of this e-mail is bad.? I have not figured out how to get plaintext working properly with yahoo. > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel[[alternative HTML version deleted]]