thr3ads.net - R devel - [Rd] R-4.3 version list.files function could not work correctly in chinese [Aug 2023]

If this information is useful, please help other people find it:
Share via:

Tomas Kalibera

2023-Aug-15 06:38 UTC

[Rd] R-4.3 version list.files function could not work correctly in chinese

On 8/13/23 13:16, Ivan Krylov wrote:> Found it! Looks like a buffer length problem. This isn't limited to
> Chinese, just more likely to happen when a character takes three bytes
> to represent in UTF-8. (Any filename containing characters which take
> more than one byte to represent in UTF-8 may fail.)
>
> If a directory contains a file with a sufficiently long name,
> FindNextFile() fails with ERROR_MORE_DATA (0xEA, 234), making
> R_readdir() return NULL, stopping list_files() prematurely:
>
> # everything seems to work fine...
>
> list.files("????")
> # [1] "????-non-utf8-?????
> ????????????????????????????????????????????????????.txt"
> # [2] "????-non-utf8-?????.txt"
> # [3] "????-utf-8.txt"
>
> # now create a file with an even longer name
>
> list.files("????")
> # [1] "????-non-utf8-?????
> ????????????????????????????????????????????????????.txt"
>
> # the files are still there, but not visible to list.files():Thanks, Ivan, could you please turn this into a complete minimal 
reproducible example, ideally with only ASCII characters (if enough to 
trigger)? Or any reproducible example would do. I would have a look 
later today.>
> system("cmd /c dir /s *.txt")
> #  Volume in drive C has no label.
> #  Volume Serial Number is A85A-AA74
> #
> #  Directory of C:\R\R-4.3.1\bin\x64\????
> #
> # 08/12/2023  07:57 AM                22 ????-non-utf8-?????
> ????????????????????????????????????????????????????.txt
> # 08/12/2023 07:57 AM                22 ????-non-utf8-?????
>
????????????????????????????????????????????????????????????????????????????????????????????????????????.txt
> # 08/12/2023  07:57 AM                22 ????-non-utf8-?????.txt
> # 08/12/2023  07:56 AM                18 ????-utf-8.txt
> # 4 File(s)             84 bytes
> #
> #       Total Files Listed:
> #                4 File(s)             84 bytes
> #                0 Dir(s)  29,281,538,048 bytes free
> # [1] 0
>
> Increasing the path length limits [*] doesn't help, since it's the
> filename length limit that we're bumping against. While both
> WIN32_FIND_DATAA and WIN32_FIND_DATAW contain fixed-size buffers, a
> valid filename may take more than MAX_PATH bytes to represent in UTF-8
> while still being under the limit of MAX_PATH wide characters. This may
> mean having to rewrite list_files in terms of R_wopendir()/R_wreaddir()
> for Windows. As a workaround, we may use the short filename (which
> sometimes may not exist, alas) when FindNextFile() fails with
> ERROR_MORE_DATA.
I admit I didn't get your analysis. However, I've rewritten this code 
for R 4.3 to support long paths (when enabled in the system), more in 
https://blog.r-project.org/2023/03/07/path-length-limit-on-windows/index.html. 
As this was reported to be regression in 4.3, it is entirely possible 
this change came with a regression (though a bit surprising we didn't 
catch it earlier by testing), so it would be a great help if I could 
have the example and debug it.

Thanks,
Tomas

Ivan Krylov

2023-Aug-15 07:04 UTC

head link

[Rd] R-4.3 version list.files function could not work correctly in chinese

? Tue, 15 Aug 2023 08:38:11 +0200
Tomas Kalibera <tomas.kalibera at gmail.com> ?????:
> As this was reported to be regression in 4.3, it is entirely possible 
> this change came with a regression (though a bit surprising we didn't 
> catch it earlier by testing), so it would be a great help if I could 
> have the example and debug it.
Sorry, let me try to be more clear.

The Windows filename length limit is 255(?) wide characters. The
WIN32_FIND_DATAA structure contains a 260-byte buffer for the filename
to be returned by FindFirstFileA()/FindNextFileA(). If a wide character
takes more than one byte to be represented in UTF-8, it may overflow
the 260 byte limit in the WIN32_FIND_DATAA structure despite being
below the 260 wide character limit. When such an overflow happens,
FindNextFile() returns FALSE with GetLastError() == ERROR_MORE_DATA,
which results in R_readdir() returning NULL and makes list_files() stop
before listing the rest of the directory.

This is easier to make happen by accident with Chinese characters,
because they take three UTF-8 bytes per character.

Take the ? (\uf8) letter. It takes two bytes to represent in UTF-8.
Create a file with a name consisting of this symbol repeated 140 times.
When you run list.files() on the resulting directory on Windows with a
UTF-8 locale, Windows tries to fit (0xc3 0xb8) times 140 into a
260-byte buffer, which doesn't work. I'm afraid the only way to avoid
such a failure is to rewrite R_readdir using the wide character API and
convert the file names on the fly. (Just like mingw readdir() did in
the past?)

stopifnot(.Platform$OS.type == 'windows', l10n_info()$`UTF-8`)
# any character for which nchar(enc2utf8(.), 'bytes') > 1 will do
# any number >260/2 should do
file.create(strrep('\uf8', 140))
list.files()

Does this work? I don't have access to a UTF-8 Windows machine right
now.

-- 
Best regards,
Ivan

R devel - Aug 2023 - R-4.3 version list.files function could not work correctly in chinese

[Rd] R-4.3 version list.files function could not work correctly in chinese

[Rd] R-4.3 version list.files function could not work correctly in chinese