Tomas Kalibera
2023-Aug-15 14:00 UTC
[Rd] R-4.3 version list.files function could not work correctly in chinese
On 8/15/23 09:04, Ivan Krylov wrote:> ? Tue, 15 Aug 2023 08:38:11 +0200 > Tomas Kalibera <tomas.kalibera at gmail.com> ?????: > >> As this was reported to be regression in 4.3, it is entirely possible >> this change came with a regression (though a bit surprising we didn't >> catch it earlier by testing), so it would be a great help if I could >> have the example and debug it. > Sorry, let me try to be more clear. > > The Windows filename length limit is 255(?) wide characters. The > WIN32_FIND_DATAA structure contains a 260-byte buffer for the filename > to be returned by FindFirstFileA()/FindNextFileA(). If a wide character > takes more than one byte to be represented in UTF-8, it may overflow > the 260 byte limit in the WIN32_FIND_DATAA structure despite being > below the 260 wide character limit. When such an overflow happens, > FindNextFile() returns FALSE with GetLastError() == ERROR_MORE_DATA, > which results in R_readdir() returning NULL and makes list_files() stop > before listing the rest of the directory. > > This is easier to make happen by accident with Chinese characters, > because they take three UTF-8 bytes per character. > > Take the ? (\uf8) letter. It takes two bytes to represent in UTF-8. > Create a file with a name consisting of this symbol repeated 140 times. > When you run list.files() on the resulting directory on Windows with a > UTF-8 locale, Windows tries to fit (0xc3 0xb8) times 140 into a > 260-byte buffer, which doesn't work. I'm afraid the only way to avoid > such a failure is to rewrite R_readdir using the wide character API and > convert the file names on the fly. (Just like mingw readdir() did in > the past?) > > stopifnot(.Platform$OS.type == 'windows', l10n_info()$`UTF-8`) > # any character for which nchar(enc2utf8(.), 'bytes') > 1 will do > # any number >260/2 should do > file.create(strrep('\uf8', 140)) > list.files() > > Does this work? I don't have access to a UTF-8 Windows machine right > now.Thanks, yes, I can reproduce the problem. Some Windows functions impose 260 wide characters limit, but other 260 bytes limit, so one can create a file with a name too long to be found by FindNextFileA. In R 4.2, we used readdir() from mingw-w64, which itself used findnext, which however had the same problem, it used a buffer of size 260 bytes and from the code of mingw-w64 and the Windows documentation, it should have behaved the same, it should have stopped the search on such a long file name. However, in my use case, R 4.2.3 crashed inside findnext due to stack overrun, R 4.1.3 worked, but clearly it would require a different use case to overrun this buffer as it didn't use UTF-8. This suggests that findnext didn't have a check for this and hence caused memory corruption, which can lead to a crash or work by coincidence. Which could have been the case for the user reporting this as a regression compared to R 4.2. But it is not a regression, the problem existed for long. So, yes, we'd probably have to use wide variants of FindNext/FindFirst. I'll fix. Thanks for debugging this, Tomas
Tomas Kalibera
2023-Aug-16 07:42 UTC
[Rd] R-4.3 version list.files function could not work correctly in chinese
On 8/15/23 16:00, Tomas Kalibera wrote:> > On 8/15/23 09:04, Ivan Krylov wrote: >> ? Tue, 15 Aug 2023 08:38:11 +0200 >> Tomas Kalibera <tomas.kalibera at gmail.com> ?????: >> >>> As this was reported to be regression in 4.3, it is entirely possible >>> this change came with a regression (though a bit surprising we didn't >>> catch it earlier by testing), so it would be a great help if I could >>> have the example and debug it. >> Sorry, let me try to be more clear. >> >> The Windows filename length limit is 255(?) wide characters. The >> WIN32_FIND_DATAA structure contains a 260-byte buffer for the filename >> to be returned by FindFirstFileA()/FindNextFileA(). If a wide character >> takes more than one byte to be represented in UTF-8, it may overflow >> the 260 byte limit in the WIN32_FIND_DATAA structure despite being >> below the 260 wide character limit. When such an overflow happens, >> FindNextFile() returns FALSE with GetLastError() == ERROR_MORE_DATA, >> which results in R_readdir() returning NULL and makes list_files() stop >> before listing the rest of the directory. >> >> This is easier to make happen by accident with Chinese characters, >> because they take three UTF-8 bytes per character. >> >> Take the ? (\uf8) letter. It takes two bytes to represent in UTF-8. >> Create a file with a name consisting of this symbol repeated 140 times. >> When you run list.files() on the resulting directory on Windows with a >> UTF-8 locale, Windows tries to fit (0xc3 0xb8) times 140 into a >> 260-byte buffer, which doesn't work. I'm afraid the only way to avoid >> such a failure is to rewrite R_readdir using the wide character API and >> convert the file names on the fly. (Just like mingw readdir() did in >> the past?) >> >> stopifnot(.Platform$OS.type == 'windows', l10n_info()$`UTF-8`) >> # any character for which nchar(enc2utf8(.), 'bytes') > 1 will do >> # any number >260/2 should do >> file.create(strrep('\uf8', 140)) >> list.files() >> >> Does this work? I don't have access to a UTF-8 Windows machine right >> now. > > Thanks, yes, I can reproduce the problem. Some Windows functions > impose 260 wide characters limit, but other 260 bytes limit, so one > can create a file with a name too long to be found by FindNextFileA. > > In R 4.2, we used readdir() from mingw-w64, which itself used > findnext, which however had the same problem, it used a buffer of size > 260 bytes and from the code of mingw-w64 and the Windows > documentation, it should have behaved the same, it should have stopped > the search on such a long file name. However, in my use case, R 4.2.3 > crashed inside findnext due to stack overrun, R 4.1.3 worked, but > clearly it would require a different use case to overrun this buffer > as it didn't use UTF-8. This suggests that findnext didn't have a > check for this and hence caused memory corruption, which can lead to a > crash or work by coincidence. Which could have been the case for the > user reporting this as a regression compared to R 4.2. But it is not a > regression, the problem existed for long. > > So, yes, we'd probably have to use wide variants of > FindNext/FindFirst. I'll fix.Fixed in R-devel (84960). Please let me know if you see any problem with the fix. Thanks, Tomas> > Thanks for debugging this, > Tomas > > >