Barry Rowlingson
2024-May-30 07:29 UTC
[Rd] [External] Re: Segfault when parsing UTF-8 text with srcrefs
I get an R error and no segfault:> parse(textConnection(text), srcfile = srcfile)Error in parse(textConnection(text), srcfile = srcfile) : test.r:1:1: unexpected $end 1: ? ^ This is R 4.3.0, so maybe the bug has been introduced since then... Version and system info:> version_ platform x86_64-pc-linux-gnu arch x86_64 os linux-gnu system x86_64, linux-gnu status major 4 minor 3.0 year 2023 month 04 day 21 svn rev 84292 language R version.string R version 4.3.0 (2023-04-21) nickname Already Tomorrow> sessionInfo()R version 4.3.0 (2023-04-21) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 22.04.4 LTS Matrix products: default BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0 locale: [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8 [7] LC_PAPER=en_GB.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C time zone: Europe/London tzcode source: system (glibc) attached base packages: [1] stats graphics grDevices utils datasets methods base loaded via a namespace (and not attached): [1] compiler_4.3.0 On Tue, May 28, 2024 at 7:42?PM Tomas Kalibera <tomas.kalibera at gmail.com> wrote:> This email originated outside the University. Check before clicking links > or attachments. > > On 5/28/24 19:35, Hadley Wickham wrote: > > Hi all, > > > > When I run the following code, R segfaults: > > > > text <- "?" > > srcfile <- srcfilecopy("test.r", text) > > parse(textConnection(text), srcfile = srcfile) > > > > It doesn't segfault if text is ASCII, or it's not wrapped in > > textConnection, or srcfile isn't set. > > Thanks, this is because R parser doesn't support non-ASCII UTF-8 outside > string literals and comments, plus a missing bounds check. The "correct" > result should be an R error, which I get in a debug build. > > The tokenizer ends up with a negative token and then when the parse data > are being finalized, creating a table of token names, there is an out of > bounds access (yytname array). Probably the check should go right away > into the tokenizer. > > Tomas > > > > > Hadley > > > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel >[[alternative HTML version deleted]]
Tomas Kalibera
2024-May-30 08:00 UTC
[Rd] [External] Re: Segfault when parsing UTF-8 text with srcrefs
On 5/30/24 09:29, Barry Rowlingson wrote:> I get an R error and no segfault: > > > parse(textConnection(text), srcfile = srcfile) > Error in parse(textConnection(text), srcfile = srcfile) : > ? test.r:1:1: unexpected $end > 1: ? > ? ? ^ > > This is R 4.3.0, so maybe the bug has been introduced since then...Thanks, am looking into it and have found the cause, now testing a patch. The bug has been in the code for a long time, but whether it causes a crash or not is non-deterministic, depending on memory layout and content (out of bounds access). Tomas> > Version and system info: > > > version > ? ? ? ? ? ? ? ?_ > platform ? ? ? x86_64-pc-linux-gnu > arch ? ? ? ? ? x86_64 > os ? ? ? ? ? ? linux-gnu > system ? ? ? ? x86_64, linux-gnu > status > major ? ? ? ? ?4 > minor ? ? ? ? ?3.0 > year ? ? ? ? ? 2023 > month ? ? ? ? ?04 > day ? ? ? ? ? ?21 > svn rev ? ? ? ?84292 > language ? ? ? R > version.string R version 4.3.0 (2023-04-21) > nickname ? ? ? Already Tomorrow > > > sessionInfo() > R version 4.3.0 (2023-04-21) > Platform: x86_64-pc-linux-gnu (64-bit) > Running under: Ubuntu 22.04.4 LTS > > Matrix products: default > BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 > LAPACK: > /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so > <http://libopenblasp-r0.3.20.so>; ?LAPACK version 3.10.0 > > locale: > ?[1] LC_CTYPE=en_GB.UTF-8 ? ? ? LC_NUMERIC=C > ?[3] LC_TIME=en_GB.UTF-8 ? ? ? ?LC_COLLATE=en_GB.UTF-8 > ?[5] LC_MONETARY=en_GB.UTF-8 ? ?LC_MESSAGES=en_GB.UTF-8 > ?[7] LC_PAPER=en_GB.UTF-8 ? ? ? LC_NAME=C > ?[9] LC_ADDRESS=C ? ? ? ? ? ? ? LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C > > time zone: Europe/London > tzcode source: system (glibc) > > attached base packages: > [1] stats ? ? graphics ?grDevices utils ? ? datasets ?methods ? base > > loaded via a namespace (and not attached): > [1] compiler_4.3.0 > > On Tue, May 28, 2024 at 7:42?PM Tomas Kalibera > <tomas.kalibera at gmail.com> wrote: > > This email originated outside the University. Check before > clicking links or attachments. > > On 5/28/24 19:35, Hadley Wickham wrote: > > Hi all, > > > > When I run the following code, R segfaults: > > > > text <- "?" > > srcfile <- srcfilecopy("test.r", text) > > parse(textConnection(text), srcfile = srcfile) > > > > It doesn't segfault if text is ASCII, or it's not wrapped in > > textConnection, or srcfile isn't set. > > Thanks, this is because R parser doesn't support non-ASCII UTF-8 > outside > string literals and comments, plus a missing bounds check. The > "correct" > result should be an R error, which I get in a debug build. > > The tokenizer ends up with a negative token and then when the > parse data > are being finalized, creating a table of token names, there is an > out of > bounds access (yytname array). Probably the check should go right away > into the tokenizer. > > Tomas > > > > > Hadley > > > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel >