thr3ads.net - R devel - [Rd] R-4.3 version list.files function could not work correctly in chinese [Aug 2023]

If this information is useful, please help other people find it:
Share via:

Ivan Krylov

2023-Aug-15 07:04 UTC

[Rd] R-4.3 version list.files function could not work correctly in chinese

? Tue, 15 Aug 2023 08:38:11 +0200
Tomas Kalibera <tomas.kalibera at gmail.com> ?????:
> As this was reported to be regression in 4.3, it is entirely possible 
> this change came with a regression (though a bit surprising we didn't 
> catch it earlier by testing), so it would be a great help if I could 
> have the example and debug it.
Sorry, let me try to be more clear.

The Windows filename length limit is 255(?) wide characters. The
WIN32_FIND_DATAA structure contains a 260-byte buffer for the filename
to be returned by FindFirstFileA()/FindNextFileA(). If a wide character
takes more than one byte to be represented in UTF-8, it may overflow
the 260 byte limit in the WIN32_FIND_DATAA structure despite being
below the 260 wide character limit. When such an overflow happens,
FindNextFile() returns FALSE with GetLastError() == ERROR_MORE_DATA,
which results in R_readdir() returning NULL and makes list_files() stop
before listing the rest of the directory.

This is easier to make happen by accident with Chinese characters,
because they take three UTF-8 bytes per character.

Take the ? (\uf8) letter. It takes two bytes to represent in UTF-8.
Create a file with a name consisting of this symbol repeated 140 times.
When you run list.files() on the resulting directory on Windows with a
UTF-8 locale, Windows tries to fit (0xc3 0xb8) times 140 into a
260-byte buffer, which doesn't work. I'm afraid the only way to avoid
such a failure is to rewrite R_readdir using the wide character API and
convert the file names on the fly. (Just like mingw readdir() did in
the past?)

stopifnot(.Platform$OS.type == 'windows', l10n_info()$`UTF-8`)
# any character for which nchar(enc2utf8(.), 'bytes') > 1 will do
# any number >260/2 should do
file.create(strrep('\uf8', 140))
list.files()

Does this work? I don't have access to a UTF-8 Windows machine right
now.

-- 
Best regards,
Ivan

Tomas Kalibera

2023-Aug-15 14:00 UTC

head link

[Rd] R-4.3 version list.files function could not work correctly in chinese

On 8/15/23 09:04, Ivan Krylov wrote:> ? Tue, 15 Aug 2023 08:38:11 +0200
> Tomas Kalibera <tomas.kalibera at gmail.com> ?????:
>
>> As this was reported to be regression in 4.3, it is entirely possible
>> this change came with a regression (though a bit surprising we
didn't
>> catch it earlier by testing), so it would be a great help if I could
>> have the example and debug it.
> Sorry, let me try to be more clear.
>
> The Windows filename length limit is 255(?) wide characters. The
> WIN32_FIND_DATAA structure contains a 260-byte buffer for the filename
> to be returned by FindFirstFileA()/FindNextFileA(). If a wide character
> takes more than one byte to be represented in UTF-8, it may overflow
> the 260 byte limit in the WIN32_FIND_DATAA structure despite being
> below the 260 wide character limit. When such an overflow happens,
> FindNextFile() returns FALSE with GetLastError() == ERROR_MORE_DATA,
> which results in R_readdir() returning NULL and makes list_files() stop
> before listing the rest of the directory.
>
> This is easier to make happen by accident with Chinese characters,
> because they take three UTF-8 bytes per character.
>
> Take the ? (\uf8) letter. It takes two bytes to represent in UTF-8.
> Create a file with a name consisting of this symbol repeated 140 times.
> When you run list.files() on the resulting directory on Windows with a
> UTF-8 locale, Windows tries to fit (0xc3 0xb8) times 140 into a
> 260-byte buffer, which doesn't work. I'm afraid the only way to
avoid
> such a failure is to rewrite R_readdir using the wide character API and
> convert the file names on the fly. (Just like mingw readdir() did in
> the past?)
>
> stopifnot(.Platform$OS.type == 'windows', l10n_info()$`UTF-8`)
> # any character for which nchar(enc2utf8(.), 'bytes') > 1 will
do
> # any number >260/2 should do
> file.create(strrep('\uf8', 140))
> list.files()
>
> Does this work? I don't have access to a UTF-8 Windows machine right
> now.
Thanks, yes, I can reproduce the problem. Some Windows functions impose 
260 wide characters limit, but other 260 bytes limit, so one can create 
a file with a name too long to be found by FindNextFileA.

In R 4.2, we used readdir() from mingw-w64, which itself used findnext, 
which however had the same problem, it used a buffer of size 260 bytes 
and from the code of mingw-w64 and the Windows documentation, it should 
have behaved the same, it should have stopped the search on such a long 
file name. However, in my use case, R 4.2.3 crashed inside findnext due 
to stack overrun, R 4.1.3 worked, but clearly it would require a 
different use case to overrun this buffer as it didn't use UTF-8. This 
suggests that findnext didn't have a check for this and hence caused 
memory corruption, which can lead to a crash or work by coincidence. 
Which could have been the case for the user reporting this as a 
regression compared to R 4.2. But it is not a regression, the problem 
existed for long.

So, yes, we'd probably have to use wide variants of FindNext/FindFirst. 
I'll fix.

Thanks for debugging this,
Tomas

R devel - Aug 2023 - R-4.3 version list.files function could not work correctly in chinese

[Rd] R-4.3 version list.files function could not work correctly in chinese

[Rd] R-4.3 version list.files function could not work correctly in chinese