thr3ads.net - R devel - [Rd] R-4.3 version list.files function could not work correctly in chinese [Aug 2023]

If this information is useful, please help other people find it:
Share via:

Tomas Kalibera

2023-Aug-15 14:00 UTC

[Rd] R-4.3 version list.files function could not work correctly in chinese

On 8/15/23 09:04, Ivan Krylov wrote:> ? Tue, 15 Aug 2023 08:38:11 +0200
> Tomas Kalibera <tomas.kalibera at gmail.com> ?????:
>
>> As this was reported to be regression in 4.3, it is entirely possible
>> this change came with a regression (though a bit surprising we
didn't
>> catch it earlier by testing), so it would be a great help if I could
>> have the example and debug it.
> Sorry, let me try to be more clear.
>
> The Windows filename length limit is 255(?) wide characters. The
> WIN32_FIND_DATAA structure contains a 260-byte buffer for the filename
> to be returned by FindFirstFileA()/FindNextFileA(). If a wide character
> takes more than one byte to be represented in UTF-8, it may overflow
> the 260 byte limit in the WIN32_FIND_DATAA structure despite being
> below the 260 wide character limit. When such an overflow happens,
> FindNextFile() returns FALSE with GetLastError() == ERROR_MORE_DATA,
> which results in R_readdir() returning NULL and makes list_files() stop
> before listing the rest of the directory.
>
> This is easier to make happen by accident with Chinese characters,
> because they take three UTF-8 bytes per character.
>
> Take the ? (\uf8) letter. It takes two bytes to represent in UTF-8.
> Create a file with a name consisting of this symbol repeated 140 times.
> When you run list.files() on the resulting directory on Windows with a
> UTF-8 locale, Windows tries to fit (0xc3 0xb8) times 140 into a
> 260-byte buffer, which doesn't work. I'm afraid the only way to
avoid
> such a failure is to rewrite R_readdir using the wide character API and
> convert the file names on the fly. (Just like mingw readdir() did in
> the past?)
>
> stopifnot(.Platform$OS.type == 'windows', l10n_info()$`UTF-8`)
> # any character for which nchar(enc2utf8(.), 'bytes') > 1 will
do
> # any number >260/2 should do
> file.create(strrep('\uf8', 140))
> list.files()
>
> Does this work? I don't have access to a UTF-8 Windows machine right
> now.
Thanks, yes, I can reproduce the problem. Some Windows functions impose 
260 wide characters limit, but other 260 bytes limit, so one can create 
a file with a name too long to be found by FindNextFileA.

In R 4.2, we used readdir() from mingw-w64, which itself used findnext, 
which however had the same problem, it used a buffer of size 260 bytes 
and from the code of mingw-w64 and the Windows documentation, it should 
have behaved the same, it should have stopped the search on such a long 
file name. However, in my use case, R 4.2.3 crashed inside findnext due 
to stack overrun, R 4.1.3 worked, but clearly it would require a 
different use case to overrun this buffer as it didn't use UTF-8. This 
suggests that findnext didn't have a check for this and hence caused 
memory corruption, which can lead to a crash or work by coincidence. 
Which could have been the case for the user reporting this as a 
regression compared to R 4.2. But it is not a regression, the problem 
existed for long.

So, yes, we'd probably have to use wide variants of FindNext/FindFirst. 
I'll fix.

Thanks for debugging this,
Tomas

Tomas Kalibera

2023-Aug-16 07:42 UTC

head link

[Rd] R-4.3 version list.files function could not work correctly in chinese

On 8/15/23 16:00, Tomas Kalibera wrote:>
> On 8/15/23 09:04, Ivan Krylov wrote:
>> ? Tue, 15 Aug 2023 08:38:11 +0200
>> Tomas Kalibera <tomas.kalibera at gmail.com> ?????:
>>
>>> As this was reported to be regression in 4.3, it is entirely
possible
>>> this change came with a regression (though a bit surprising we
didn't
>>> catch it earlier by testing), so it would be a great help if I
could
>>> have the example and debug it.
>> Sorry, let me try to be more clear.
>>
>> The Windows filename length limit is 255(?) wide characters. The
>> WIN32_FIND_DATAA structure contains a 260-byte buffer for the filename
>> to be returned by FindFirstFileA()/FindNextFileA(). If a wide character
>> takes more than one byte to be represented in UTF-8, it may overflow
>> the 260 byte limit in the WIN32_FIND_DATAA structure despite being
>> below the 260 wide character limit. When such an overflow happens,
>> FindNextFile() returns FALSE with GetLastError() == ERROR_MORE_DATA,
>> which results in R_readdir() returning NULL and makes list_files() stop
>> before listing the rest of the directory.
>>
>> This is easier to make happen by accident with Chinese characters,
>> because they take three UTF-8 bytes per character.
>>
>> Take the ? (\uf8) letter. It takes two bytes to represent in UTF-8.
>> Create a file with a name consisting of this symbol repeated 140 times.
>> When you run list.files() on the resulting directory on Windows with a
>> UTF-8 locale, Windows tries to fit (0xc3 0xb8) times 140 into a
>> 260-byte buffer, which doesn't work. I'm afraid the only way to
avoid
>> such a failure is to rewrite R_readdir using the wide character API and
>> convert the file names on the fly. (Just like mingw readdir() did in
>> the past?)
>>
>> stopifnot(.Platform$OS.type == 'windows', l10n_info()$`UTF-8`)
>> # any character for which nchar(enc2utf8(.), 'bytes') > 1
will do
>> # any number >260/2 should do
>> file.create(strrep('\uf8', 140))
>> list.files()
>>
>> Does this work? I don't have access to a UTF-8 Windows machine
right
>> now.
>
> Thanks, yes, I can reproduce the problem. Some Windows functions 
> impose 260 wide characters limit, but other 260 bytes limit, so one 
> can create a file with a name too long to be found by FindNextFileA.
>
> In R 4.2, we used readdir() from mingw-w64, which itself used 
> findnext, which however had the same problem, it used a buffer of size 
> 260 bytes and from the code of mingw-w64 and the Windows 
> documentation, it should have behaved the same, it should have stopped 
> the search on such a long file name. However, in my use case, R 4.2.3 
> crashed inside findnext due to stack overrun, R 4.1.3 worked, but 
> clearly it would require a different use case to overrun this buffer 
> as it didn't use UTF-8. This suggests that findnext didn't have a 
> check for this and hence caused memory corruption, which can lead to a 
> crash or work by coincidence. Which could have been the case for the 
> user reporting this as a regression compared to R 4.2. But it is not a 
> regression, the problem existed for long.
>
> So, yes, we'd probably have to use wide variants of 
> FindNext/FindFirst. I'll fix.
Fixed in R-devel (84960). Please let me know if you see any problem with 
the fix.

Thanks,
Tomas
>
> Thanks for debugging this,
> Tomas
>
>
>

R devel - Aug 2023 - R-4.3 version list.files function could not work correctly in chinese

[Rd] R-4.3 version list.files function could not work correctly in chinese

[Rd] R-4.3 version list.files function could not work correctly in chinese