thr3ads.net - R devel - [Rd] [External] Re: Segfault when parsing UTF-8 text with srcrefs [May 2024]

If this information is useful, please help other people find it:
Share via:

Barry Rowlingson

2024-May-30 07:29 UTC

[Rd] [External] Re: Segfault when parsing UTF-8 text with srcrefs

I get an R error and no segfault:
> parse(textConnection(text), srcfile = srcfile)Error in parse(textConnection(text), srcfile = srcfile) :
  test.r:1:1: unexpected $end
1: ?
    ^

This is R 4.3.0, so maybe the bug has been introduced since then...

Version and system info:
> version               _
platform       x86_64-pc-linux-gnu
arch           x86_64
os             linux-gnu
system         x86_64, linux-gnu
status
major          4
minor          3.0
year           2023
month          04
day            21
svn rev        84292
language       R
version.string R version 4.3.0 (2023-04-21)
nickname       Already Tomorrow
> sessionInfo()R version 4.3.0 (2023-04-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.4 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;
 LAPACK version 3.10.0

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8
 [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8
 [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C

time zone: Europe/London
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

loaded via a namespace (and not attached):
[1] compiler_4.3.0

On Tue, May 28, 2024 at 7:42?PM Tomas Kalibera <tomas.kalibera at
gmail.com>
wrote:
> This email originated outside the University. Check before clicking links
> or attachments.
>
> On 5/28/24 19:35, Hadley Wickham wrote:
> > Hi all,
> >
> > When I run the following code, R segfaults:
> >
> > text <- "?"
> > srcfile <- srcfilecopy("test.r", text)
> > parse(textConnection(text), srcfile = srcfile)
> >
> > It doesn't segfault if text is ASCII, or it's not wrapped in
> > textConnection, or srcfile isn't set.
>
> Thanks, this is because R parser doesn't support non-ASCII UTF-8
outside
> string literals and comments, plus a missing bounds check. The
"correct"
> result should be an R error, which I get in a debug build.
>
> The tokenizer ends up with a negative token and then when the parse data
> are being finalized, creating a table of token names, there is an out of
> bounds access (yytname array). Probably the check should go right away
> into the tokenizer.
>
> Tomas
>
> >
> > Hadley
> >
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>
	[[alternative HTML version deleted]]

Tomas Kalibera

2024-May-30 08:00 UTC

head link

[Rd] [External] Re: Segfault when parsing UTF-8 text with srcrefs

On 5/30/24 09:29, Barry Rowlingson wrote:> I get an R error and no segfault:
>
> > parse(textConnection(text), srcfile = srcfile)
> Error in parse(textConnection(text), srcfile = srcfile) :
> ? test.r:1:1: unexpected $end
> 1: ?
> ? ? ^
>
> This is R 4.3.0, so maybe the bug has been introduced since then...
Thanks, am looking into it and have found the cause, now testing a 
patch. The bug has been in the code for a long time, but whether it 
causes a crash or not is non-deterministic, depending on memory layout 
and content (out of bounds access).

Tomas
>
> Version and system info:
>
> > version
> ? ? ? ? ? ? ? ?_
> platform ? ? ? x86_64-pc-linux-gnu
> arch ? ? ? ? ? x86_64
> os ? ? ? ? ? ? linux-gnu
> system ? ? ? ? x86_64, linux-gnu
> status
> major ? ? ? ? ?4
> minor ? ? ? ? ?3.0
> year ? ? ? ? ? 2023
> month ? ? ? ? ?04
> day ? ? ? ? ? ?21
> svn rev ? ? ? ?84292
> language ? ? ? R
> version.string R version 4.3.0 (2023-04-21)
> nickname ? ? ? Already Tomorrow
>
> > sessionInfo()
> R version 4.3.0 (2023-04-21)
> Platform: x86_64-pc-linux-gnu (64-bit)
> Running under: Ubuntu 22.04.4 LTS
>
> Matrix products: default
> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
> LAPACK: 
> /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so 
> <http://libopenblasp-r0.3.20.so>; ?LAPACK version 3.10.0
>
> locale:
> ?[1] LC_CTYPE=en_GB.UTF-8 ? ? ? LC_NUMERIC=C
> ?[3] LC_TIME=en_GB.UTF-8 ? ? ? ?LC_COLLATE=en_GB.UTF-8
> ?[5] LC_MONETARY=en_GB.UTF-8 ? ?LC_MESSAGES=en_GB.UTF-8
> ?[7] LC_PAPER=en_GB.UTF-8 ? ? ? LC_NAME=C
> ?[9] LC_ADDRESS=C ? ? ? ? ? ? ? LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
>
> time zone: Europe/London
> tzcode source: system (glibc)
>
> attached base packages:
> [1] stats ? ? graphics ?grDevices utils ? ? datasets ?methods ? base
>
> loaded via a namespace (and not attached):
> [1] compiler_4.3.0
>
> On Tue, May 28, 2024 at 7:42?PM Tomas Kalibera 
> <tomas.kalibera at gmail.com> wrote:
>
>     This email originated outside the University. Check before
>     clicking links or attachments.
>
>     On 5/28/24 19:35, Hadley Wickham wrote:
>     > Hi all,
>     >
>     > When I run the following code, R segfaults:
>     >
>     > text <- "?"
>     > srcfile <- srcfilecopy("test.r", text)
>     > parse(textConnection(text), srcfile = srcfile)
>     >
>     > It doesn't segfault if text is ASCII, or it's not wrapped
in
>     > textConnection, or srcfile isn't set.
>
>     Thanks, this is because R parser doesn't support non-ASCII UTF-8
>     outside
>     string literals and comments, plus a missing bounds check. The
>     "correct"
>     result should be an R error, which I get in a debug build.
>
>     The tokenizer ends up with a negative token and then when the
>     parse data
>     are being finalized, creating a table of token names, there is an
>     out of
>     bounds access (yytname array). Probably the check should go right away
>     into the tokenizer.
>
>     Tomas
>
>     >
>     > Hadley
>     >
>
>     ______________________________________________
>     R-devel at r-project.org mailing list
>     https://stat.ethz.ch/mailman/listinfo/r-devel
>

Reasonably Related Threads

Search for more maybe matching threads

R devel - May 2024 - [External] Re: Segfault when parsing UTF-8 text with srcrefs

[Rd] [External] Re: Segfault when parsing UTF-8 text with srcrefs

[Rd] [External] Re: Segfault when parsing UTF-8 text with srcrefs

Reasonably Related Threads