Tomas Kalibera
2023-Aug-15 06:38 UTC
[Rd] R-4.3 version list.files function could not work correctly in chinese
On 8/13/23 13:16, Ivan Krylov wrote:> Found it! Looks like a buffer length problem. This isn't limited to > Chinese, just more likely to happen when a character takes three bytes > to represent in UTF-8. (Any filename containing characters which take > more than one byte to represent in UTF-8 may fail.) > > If a directory contains a file with a sufficiently long name, > FindNextFile() fails with ERROR_MORE_DATA (0xEA, 234), making > R_readdir() return NULL, stopping list_files() prematurely: > > # everything seems to work fine... > > list.files("????") > # [1] "????-non-utf8-????? > ????????????????????????????????????????????????????.txt" > # [2] "????-non-utf8-?????.txt" > # [3] "????-utf-8.txt" > > # now create a file with an even longer name > > list.files("????") > # [1] "????-non-utf8-????? > ????????????????????????????????????????????????????.txt" > > # the files are still there, but not visible to list.files():Thanks, Ivan, could you please turn this into a complete minimal reproducible example, ideally with only ASCII characters (if enough to trigger)? Or any reproducible example would do. I would have a look later today.> > system("cmd /c dir /s *.txt") > # Volume in drive C has no label. > # Volume Serial Number is A85A-AA74 > # > # Directory of C:\R\R-4.3.1\bin\x64\???? > # > # 08/12/2023 07:57 AM 22 ????-non-utf8-????? > ????????????????????????????????????????????????????.txt > # 08/12/2023 07:57 AM 22 ????-non-utf8-????? > ????????????????????????????????????????????????????????????????????????????????????????????????????????.txt > # 08/12/2023 07:57 AM 22 ????-non-utf8-?????.txt > # 08/12/2023 07:56 AM 18 ????-utf-8.txt > # 4 File(s) 84 bytes > # > # Total Files Listed: > # 4 File(s) 84 bytes > # 0 Dir(s) 29,281,538,048 bytes free > # [1] 0 > > Increasing the path length limits [*] doesn't help, since it's the > filename length limit that we're bumping against. While both > WIN32_FIND_DATAA and WIN32_FIND_DATAW contain fixed-size buffers, a > valid filename may take more than MAX_PATH bytes to represent in UTF-8 > while still being under the limit of MAX_PATH wide characters. This may > mean having to rewrite list_files in terms of R_wopendir()/R_wreaddir() > for Windows. As a workaround, we may use the short filename (which > sometimes may not exist, alas) when FindNextFile() fails with > ERROR_MORE_DATA.I admit I didn't get your analysis. However, I've rewritten this code for R 4.3 to support long paths (when enabled in the system), more in https://blog.r-project.org/2023/03/07/path-length-limit-on-windows/index.html. As this was reported to be regression in 4.3, it is entirely possible this change came with a regression (though a bit surprising we didn't catch it earlier by testing), so it would be a great help if I could have the example and debug it. Thanks, Tomas
Ivan Krylov
2023-Aug-15 07:04 UTC
[Rd] R-4.3 version list.files function could not work correctly in chinese
? Tue, 15 Aug 2023 08:38:11 +0200 Tomas Kalibera <tomas.kalibera at gmail.com> ?????:> As this was reported to be regression in 4.3, it is entirely possible > this change came with a regression (though a bit surprising we didn't > catch it earlier by testing), so it would be a great help if I could > have the example and debug it.Sorry, let me try to be more clear. The Windows filename length limit is 255(?) wide characters. The WIN32_FIND_DATAA structure contains a 260-byte buffer for the filename to be returned by FindFirstFileA()/FindNextFileA(). If a wide character takes more than one byte to be represented in UTF-8, it may overflow the 260 byte limit in the WIN32_FIND_DATAA structure despite being below the 260 wide character limit. When such an overflow happens, FindNextFile() returns FALSE with GetLastError() == ERROR_MORE_DATA, which results in R_readdir() returning NULL and makes list_files() stop before listing the rest of the directory. This is easier to make happen by accident with Chinese characters, because they take three UTF-8 bytes per character. Take the ? (\uf8) letter. It takes two bytes to represent in UTF-8. Create a file with a name consisting of this symbol repeated 140 times. When you run list.files() on the resulting directory on Windows with a UTF-8 locale, Windows tries to fit (0xc3 0xb8) times 140 into a 260-byte buffer, which doesn't work. I'm afraid the only way to avoid such a failure is to rewrite R_readdir using the wide character API and convert the file names on the fly. (Just like mingw readdir() did in the past?) stopifnot(.Platform$OS.type == 'windows', l10n_info()$`UTF-8`) # any character for which nchar(enc2utf8(.), 'bytes') > 1 will do # any number >260/2 should do file.create(strrep('\uf8', 140)) list.files() Does this work? I don't have access to a UTF-8 Windows machine right now. -- Best regards, Ivan