tlaronde at kergis.com
2024-Feb-05 12:18 UTC
[Samba] Languages and encoding: file system and file contents
I'm rather unclear about the way CIFS/Samba deal with languages, encoding, and the, perhaps, encoding of pathnames vs the encoding of files considered by MS Windows to be text file and hence with, perhaps a language and an encoding. So I will try to formulate questions about the elementary points, and I will be grateful to the ones who can share lights about these: "dos charset" and "unix charset" are global parameters. As far as I understand the description, this: a) fixes the _contents_ of files; b) the parameters are global so it is not possible to have different values for different shares and/or different users. Context: I imagine a Unix filesystem served to Windows clients. So the questions: 1) These parameters don't seem to have anything to do with pathnames, but setting "unix charset" to ASCII, not ascii pathnames are not displayed on the clients. What is the relationship between charsets and the pathnames? 2) What is the relationship between the language parameters (LC_* and LANG) settings for the user, and the "charsets" defined in smb.conf? Is the localisation encoding, for a Unix user mapped to a Samba user, used in anyway to inform a Windows client about the language and encoding of the contents? 3) If a MS Windows client connects to a share via another user (say a Unix one), if the encodings on the MS Windows is different from what is defined for the user connecting, is there a problem? (Ex.: Windows is configured to use latin9 or equivalent; user used to connect is declared as using UTF-8; what encoding will be used by a Windows program? latin9 or UTF-8---I'm not talking about what will be stored, I'm talking about what the Windows program, on Windows, is using: Windows user encoding or encoding of the user making the share connection? 4) Same question about the pathnames? 5) If a MS Windows program creates temporary filenames that use, hopefully, only ASCII chars, if the Unix encoding is not ASCII compatible, does this lead to problems or are the pathnames considered, as on a Unix filesystem, simply a nul byte terminated string of bytes, without encoding---so a not utf-8 valid string is no problem for a pathname? 6) The parameters are global, meaning that different shares destined to different users impose de facto utf-8 on the Unix side in order to be able to store whatever the clients are sending---even if when a file is retrieved, the reverse conversion from utf-8 to whatever is done by Samba? TIA for any information about these points. -- Thierry Laronde <tlaronde +AT+ kergis +dot+ com> http://www.kergis.com/ http://kertex.kergis.com/ Key fingerprint = 0FF7 E906 FBAF FE95 FD89 250D 52B1 AE95 6006 F40C
Michael Tokarev
2024-Feb-05 13:16 UTC
[Samba] Languages and encoding: file system and file contents
05.02.2024 15:18, Thierry LARONDE via samba :> I'm rather unclear about the way CIFS/Samba deal with languages, > encoding, and the, perhaps, encoding of pathnames vs the encoding of > files considered by MS Windows to be text file and hence with, perhaps > a language and an encoding. > > So I will try to formulate questions about the elementary points, and > I will be grateful to the ones who can share lights about these: > > "dos charset" and "unix charset" are global parameters. As far as I > understand the description, this: > a) fixes the _contents_ of files;Absolutely not. Samba does not do anything with contents of the files, it treats all files as binary objects, not changing contents in any way.> b) the parameters are global so it is not possible to have > different values for different shares and/or different users. > > Context: I imagine a Unix filesystem served to Windows clients. > > So the questions: > > 1) These parameters don't seem to have anything to do with pathnames, > but setting "unix charset" to ASCII, not ascii pathnames are not > displayed on the clients. What is the relationship between charsets > and the pathnames?These parameters has meaning for file *names* (pathnames) *only*, has nothing to do with the contents of the files.> 2) What is the relationship between the language parameters (LC_* and > LANG) settings for the user, and the "charsets" defined in smb.conf? Is > the localisation encoding, for a Unix user mapped to a Samba user, used > in anyway to inform a Windows client about the language and encoding > of the contents?There's no relation whatsoever.> 3) If a MS Windows client connects to a share via another user (say a > Unix one), if the encodings on the MS Windows is different from what > is defined for the user connecting, is there a problem? (Ex.: Windows > is configured to use latin9 or equivalent; user used to connect is > declared as using UTF-8; what encoding will be used by a Windows > program? latin9 or UTF-8---I'm not talking about what will be stored, > I'm talking about what the Windows program, on Windows, is using: > Windows user encoding or encoding of the user making the share > connection?When two entities connect, they exchange information about the charsets they store filenames in. Next it's the client job to convert from/to server charset to/from whatever local charset happens to be.> 4) Same question about the pathnames?It is about pathnames *only*, nothing to do with file contents. /mjt