Ivan Krylov
2023-Aug-12 15:33 UTC
[Rd] R-4.3 version list.files function could not work correctly in chinese
Dear Yihui, Thanks a lot for your help! Unfortunately, I was not able to reproduce this. I've tried creating files with Chinese characters in their names and populating them with valid UTF-8 and valid non-UTF-8 text, but R seems to be able to list them all in my case. I'm running a US English evaluation ISO image of a slightly newer build of Windows 10, and I also compiled R-4.3.1 from source, anticipating having to single-step through the list.files() implementation: sessionInfo() # R version 4.3.1 (2023-06-16 ucrt) # Platform: x86_64-w64-mingw32/x64 (64-bit) # Running under: Windows 10 x64 (build 19045) # # Matrix products: default # # # locale: # [1] LC_COLLATE=English_United States.utf8 LC_CTYPE=English_United # States.utf8 # [3] LC_MONETARY=English_United States.utf8 LC_NUMERIC=C # [5] LC_TIME=English_United States.utf8 # # time zone: America/Los_Angeles # tzcode source: internal # # attached base packages: # [1] stats graphics grDevices utils datasets methods base # # loaded via a namespace (and not attached): # [1] compiler_4.3.1 dir("????") # [1] "????-non-utf8-?????.txt" "????-utf-8.txt" system('cmd /c dir /s *.txt') # Volume in drive C has no label. # Volume Serial Number is A85A-AA74 # # Directory of C:\R\R-4.3.1\bin\x64\???? # # 08/12/2023 07:57 AM 22 ????-non-utf8-?????.txt # 08/12/2023 07:56 AM 18 ????-utf-8.txt # 2 File(s) 40 bytes # # Total Files Listed: # 2 File(s) 40 bytes # 0 Dir(s) 29,538,418,688 bytes free # [1] 0 (The OEM codepage cannot represent the characters I used in the file names, but all the files are present in both lists.) In order to find out what's wrong, it will be needed to download the R source code and compile it [*], install gdb using pacman (part of Rtools), then set a breakpoint on the list_files function from src/main/platform.c and step through it [**], paying attention to the R_readdir calls. Do the missing file names not even come out from FindNextFile()? Are they somehow skipped around the time of regex match? (I could help with the details of this, maybe off-list, if there's interest.) Unless Tomas Kalibera is able to deduce the root cause from the observed symptoms, someone who can reproduce the problem will have to investigate further. -- Best regards, Ivan [*] https://cran.r-project.org/bin/windows/base/howto-R-devel.html [**] https://beej.us/guide/bggdb/
叶月光
2023-Aug-13 05:39 UTC
[Rd] 答复: R-4.3 version list.files function could not work correctly in chinese
list.files function is notcorrect? -----????----- ???: Ivan Krylov [mailto:krylov.r00t at gmail.com] ????: 2023?8?12? 23:33 ???: Yihui Xie <xie at yihui.name> ??: ??? <yeyueguang at goldwind.com>; r-devel at r-project.org ??: Re: [Rd] R-4.3 version list.files function could not work correctly in chinese Dear Yihui, Thanks a lot for your help! Unfortunately, I was not able to reproduce this. I've tried creating files with Chinese characters in their names and populating them with valid UTF-8 and valid non-UTF-8 text, but R seems to be able to list them all in my case. I'm running a US English evaluation ISO image of a slightly newer build of Windows 10, and I also compiled R-4.3.1 from source, anticipating having to single-step through the list.files() implementation: sessionInfo() # R version 4.3.1 (2023-06-16 ucrt) # Platform: x86_64-w64-mingw32/x64 (64-bit) # Running under: Windows 10 x64 (build 19045) # # Matrix products: default # # # locale: # [1] LC_COLLATE=English_United States.utf8 LC_CTYPE=English_United # States.utf8 # [3] LC_MONETARY=English_United States.utf8 LC_NUMERIC=C # [5] LC_TIME=English_United States.utf8 # # time zone: America/Los_Angeles # tzcode source: internal # # attached base packages: # [1] stats graphics grDevices utils datasets methods base # # loaded via a namespace (and not attached): # [1] compiler_4.3.1 dir("????") # [1] "????-non-utf8-?????.txt" "????-utf-8.txt" system('cmd /c dir /s *.txt') # Volume in drive C has no label. # Volume Serial Number is A85A-AA74 # # Directory of C:\R\R-4.3.1\bin\x64\???? # # 08/12/2023 07:57 AM 22 ????-non-utf8-?????.txt # 08/12/2023 07:56 AM 18 ????-utf-8.txt # 2 File(s) 40 bytes # # Total Files Listed: # 2 File(s) 40 bytes # 0 Dir(s) 29,538,418,688 bytes free # [1] 0 (The OEM codepage cannot represent the characters I used in the file names, but all the files are present in both lists.) In order to find out what's wrong, it will be needed to download the R source code and compile it [*], install gdb using pacman (part of Rtools), then set a breakpoint on the list_files function from src/main/platform.c and step through it [**], paying attention to the R_readdir calls. Do the missing file names not even come out from FindNextFile()? Are they somehow skipped around the time of regex match? (I could help with the details of this, maybe off-list, if there's interest.) Unless Tomas Kalibera is able to deduce the root cause from the observed symptoms, someone who can reproduce the problem will have to investigate further. -- Best regards, Ivan [*] https://cran.r-project.org/bin/windows/base/howto-R-devel.html [**] https://beej.us/guide/bggdb/ ?????????????? ????????????????????????????????????????????????????????????????? ?????????????????? ITSecurity at goldwind.com? ??????????????? Email system security tips? The use of emails to collect personal information, account passwords, bank card information, help, subsidies, money transfers, etc. is "phishing email" or "virus email", no response is required, and please delete it immediately. If you encounter email security issues, please contact ITSecurity at goldwind.com. -------------- next part -------------- A non-text attachment was scrubbed... Name: r-sessioninfo.png Type: image/png Size: 127929 bytes Desc: r-sessioninfo.png URL: <https://stat.ethz.ch/pipermail/r-devel/attachments/20230813/ab7820ad/attachment.png> -------------- next part -------------- A non-text attachment was scrubbed... Name: list.files_test.png Type: image/png Size: 38952 bytes Desc: list.files_test.png URL: <https://stat.ethz.ch/pipermail/r-devel/attachments/20230813/ab7820ad/attachment-0001.png> -------------- next part -------------- A non-text attachment was scrubbed... Name: path-files.png Type: image/png Size: 29532 bytes Desc: path-files.png URL: <https://stat.ethz.ch/pipermail/r-devel/attachments/20230813/ab7820ad/attachment-0002.png>
Ivan Krylov
2023-Aug-13 07:58 UTC
[Rd] R-4.3 version list.files function could not work correctly in chinese
Dear ???, I believe you that there's a problem with list.files() and file names in Chinese. There is no need for additional proof. Unfortunately, it's impossible to fix the problem unless its source is found: https://www.chiark.greenend.org.uk/~sgtatham/bugs-cn.html Can you give me more examples of file names, _as text_, that I could _copy and paste_ into my computer in order to (hopefully) reproduce the problem here? Alternatively, can you use a debugger for programs written in C? Do you know someone who does? -- Best regards, Ivan