Are path-names text or raw data in zfs? I.e., is it possible to know what the name of a file/dir/whatever is, or do I have to make more or less wild guesses what encoding is used where? - Marcus
Hi Marcus, Marcus Sundman wrote:> Are path-names text or raw data in zfs? I.e., is it possible to know > what the name of a file/dir/whatever is, or do I have to make more or > less wild guesses what encoding is used where? > > - Marcus >I''m not sure what you are asking here. When a zfs file system is mounted, it looks like a normal unix file system, i.e., a tree of files where intermediate nodes are directories and leaf nodes may be directories or regular files. In other words, ls gives you the same kind of output you would expect on any unix file system. As to whether a file/directory name is text or binary, that depends on the name used when creating the file/directory. As far as the meta-data used to maintain the file system tree, most of this is compressed. But your question makes me wonder if you have tried zfs. If so, then I really am not sure what you are asking. If not, maybe you should try it out... max
"max at bruningsystems.com" <max at bruningsystems.com> wrote:> Marcus Sundman wrote: > > Are path-names text or raw data in zfs? I.e., is it possible to know > > what the name of a file/dir/whatever is, or do I have to make more > > or less wild guesses what encoding is used where? > > I''m not sure what you are asking here. When a zfs file system is > mounted, it looks like a normal unix file system, i.e., a tree of > files where intermediate nodes are directories and leaf nodes may be > directories or regular files. In other words, ls gives you the same > kind of output you would expect on any unix file system. As to > whether a file/directory name is text or binary, that depends > on the name used when creating the file/directory. As far as the > meta-data used to maintain the file system tree, most of this is > compressed. But your question makes me wonder if you have tried > zfs. If so, then I really am not sure what you are asking. If not, > maybe you should try it out...I am running it (in nexenta). Anyway, my question was whether path-names (files, dirs, links, sockets, etc) are text or raw data. Fundamentals: "raw data" is "a list of bits, usually in groups of 8 (i.e., bytes)", and "text" is "raw data + some way of knowing how to convert that data into characters, forming strings". Example: When you go to a web-page the webserver sends the bytes of the page along with a http-header named "Content-Type", which tells your browser how to interpret those bytes. Example: Some versioning systems, such as svn, are hardcoded to encode pathnames as UTF-8. So, although the encoding-metadata isn''t available along with the data it is in the specification. So, once more, is it possible to know the pathnames (as text) on zfs, or are pathnames just raw data and I (or my programs) have to make more or less wild guesses about what encoding the user who created the file/dir/etc. used for its name? At least on linux it''s the latter. IMO it really sucks to not be able to know the names of files/dirs/etc., because it always leads to problems. E.g., most (but not all) programs assume filenames should be encoded according to the current locale (let''s say utf-8), so when a filename with another encoding (let''s say iso-8859-15) is encountered various Evil(tm) things happen, such as not displaying the file(s) at all (e.g., an image viewer I''ve used), or replacing filenames with "?", or replacing parts of filenames with "?" and decoding the rest of the filename with an obviously incorrect encoding (e.g., ls). I''ve even seen programs crash when they can''t decode a filename. - Marcus
See the description of the normalization and utf8only properties in the zfs(1) man page. I think this might help you. normalization =none | formD | formKCf Indicates whether the file system should perform a unicode normalization of file names whenever two file names are compared, and which normalization algorithm should be used. File names are always stored unmodified, names are normalized as part of any comparison process. If this property is set to a legal value other than "none," and the "utf8only" property was left unspeci- fied, the "utf8only" property is automatically set to "on." The default value of the "normalization" property is "none." This property cannot be changed after the file system is created. utf8only =on | off Indicates whether the file system should reject file names that include characters that are not present in the UTF-8 character code set. If this property is expli- citly set to "off," the normalization property must either not be explicitly set or be set to "none." The default value for the "utf8only" property is "off." This property cannot be changed after the file system is created. -- Darren J Moffat
Darren J Moffat <darrenm at opensolaris.org> wrote:> See the description of the normalization and utf8only properties in > the zfs(1) man page. > > I think this might help you. > > normalization =none | formD | formKCfThat''s apparently only for comparisons, so I don''t see how it''s relevant.> utf8only =on | off > > Indicates whether the file system should reject file > names that include characters that are not present in > the UTF-8 character code set. If this property is expli- > citly set to "off," the normalization property must > either not be explicitly set or be set to "none." The > default value for the "utf8only" property is "off." This > property cannot be changed after the file system is > created.I''m unable to find more info about this. E.g., what does "reject file names" mean in practice? E.g., if a program tries to create a file using an utf8-incompatible filename, what happens? Does the fopen() fail? Would this normally be a problem? E.g., do tar and similar programs convert utf8-incompatible filenames to utf8 upon extraction if my locale (or wherever the fs encoding is taken from) is set to use utf-8? If they don''t, then what happens with archives containing utf8-incompatible filenames? - Marcus
So, I set utf8only=on and try to create a file with a filename that is a byte array that can''t be decoded to text using UTF-8. What''s supposed to happen? Should fopen(), or whatever syscall ''touch'' uses, fail? Should the syscall somehow escape utf8-incompatible bytes, or maybe replace them with ?s or somesuch? Or should it automatically convert the filename from the active locale''s fs-encoding (LC_CTYPE?) to UTF-8? - Marcus
Marcus Sundman wrote:> I''m unable to find more info about this. E.g., what does "reject file > names" mean in practice? E.g., if a program tries to create a file > using an utf8-incompatible filename, what happens? Does the fopen() > fail? Would this normally be a problem? E.g., do tar and similar > programs convert utf8-incompatible filenames to utf8 upon extraction if > my locale (or wherever the fs encoding is taken from) is set to use > utf-8? If they don''t, then what happens with archives containing > utf8-incompatible filenames?Note that the normal ZFS behavior is exactly what you''d expect: you get the filenames you wanted; the same ones back you put in. The trick is that in order to support such things as casesensitivity=false for CIFS, the OS needs to know what characters are uppercase vs lowercase, which means it needs to know about encodings, and reject codepoints which cannot be classified as uppercase vs lowercase. If you''re not running a CIFS server, the defaults will allow you to create files w/ utf8 names very happily. : barts at cyber[147]; cat test ?? ?????? ??? ?????? ???????? : barts at cyber[148]; cat > "`cat test`" this is a test w/ a utf8 filename : barts at cyber[149]; ls -l total 10 -rw-r--r-- 1 barts staff 37 Oct 22 15:45 Makefile -rw-r--r-- 1 barts staff 0 Oct 22 15:46 bar -rw-r--r-- 1 barts staff 0 Oct 22 15:46 foo -rw-r--r-- 1 barts staff 55 Feb 27 19:45 test -rw-r--r-- 1 barts staff 301 Feb 27 19:44 test~ -rw-r--r-- 1 barts staff 34 Feb 27 19:46 ?? ?????? ??? ?????? ???????? : barts at cyber[150]; df -h . Filesystem size used avail capacity Mounted on zfs/home 228G 136G 48G 74% /export/home/cyber : barts at cyber[151]; - Bart -- Bart Smaalders Solaris Kernel Performance barts at cyber.eng.sun.com http://blogs.sun.com/barts "You will contribute more with mercurial than with thunderbird."
Bart Smaalders wrote:> Marcus Sundman wrote: > > I''m unable to find more info about this. E.g., what does "reject file > > names" mean in practice? E.g., if a program tries to create a file > > using an utf8-incompatible filename, what happens? Does the fopen() > > fail? Would this normally be a problem? E.g., do tar and similar > > programs convert utf8-incompatible filenames to utf8 upon extraction if > > my locale (or wherever the fs encoding is taken from) is set to use > > utf-8? If they don''t, then what happens with archives containing > > utf8-incompatible filenames? > > Note that the normal ZFS behavior is exactly what you''d expect: you > get the filenames you wanted; the same ones back you put in.Does ZFS convert the strings to UTF-8 in this case or will it just store the multibyte sequence unmodified ? ---- Bye, Roland -- __ . . __ (o.\ \/ /.o) roland.mainz at nrubsig.org \__\/\/__/ MPEG specialist, C&&JAVA&&Sun&&Unix programmer /O /==\ O\ TEL +49 641 7950090 (;O/ \/ \O;)
Roland Mainz wrote:> Bart Smaalders wrote: >> Marcus Sundman wrote: >>> I''m unable to find more info about this. E.g., what does "reject file >>> names" mean in practice? E.g., if a program tries to create a file >>> using an utf8-incompatible filename, what happens? Does the fopen() >>> fail? Would this normally be a problem? E.g., do tar and similar >>> programs convert utf8-incompatible filenames to utf8 upon extraction if >>> my locale (or wherever the fs encoding is taken from) is set to use >>> utf-8? If they don''t, then what happens with archives containing >>> utf8-incompatible filenames? >> Note that the normal ZFS behavior is exactly what you''d expect: you >> get the filenames you wanted; the same ones back you put in. > > Does ZFS convert the strings to UTF-8 in this case or will it just store > the multibyte sequence unmodified ? >ZFS doesn''t muck with names it is sent when storing them on-disk. The on-disk name is exactly the sequence of bytes provided to the open(), creat(), etc. If normalization options are chosen, it may do some manipulation of the byte strings *when comparing* names, but the on-disk name should be untouched from what the user requested. -tim> ---- > > Bye, > Roland >
Tim Haley wrote:> Roland Mainz wrote: > > Bart Smaalders wrote: > >> Marcus Sundman wrote: > >>> I''m unable to find more info about this. E.g., what does "reject file > >>> names" mean in practice? E.g., if a program tries to create a file > >>> using an utf8-incompatible filename, what happens? Does the fopen() > >>> fail? Would this normally be a problem? E.g., do tar and similar > >>> programs convert utf8-incompatible filenames to utf8 upon extraction if > >>> my locale (or wherever the fs encoding is taken from) is set to use > >>> utf-8? If they don''t, then what happens with archives containing > >>> utf8-incompatible filenames? > >> Note that the normal ZFS behavior is exactly what you''d expect: you > >> get the filenames you wanted; the same ones back you put in. > > > > Does ZFS convert the strings to UTF-8 in this case or will it just store > > the multibyte sequence unmodified ? > > > ZFS doesn''t muck with names it is sent when storing them on-disk. The > on-disk name is exactly the sequence of bytes provided to the open(), > creat(), etc. If normalization options are chosen, it may do some > manipulation of the byte strings *when comparing* names, but the on-disk > name should be untouched from what the user requested.Ok... that was the part which I was _praying_ for... :-) ... just some background (for those who may be puzzled by the statement above): The conversion to Unicode is not always "lossless" (Unicode is sometimes marketed as "convert-any-encoding-to-unicode-without-loosing-any-information") ... for example if you have a mixed-language ISO-2022 character sequence the conversion to Unicode will use the language information itself and converting it back to an ISO-2022 sequence will result in a different multibyte sequence than the original input (the issue could be worked-around by inserting the "language tag" characters to preserve this information but almost every converter doesn''t do that (and since these "tags" are outside the BMP you have to pray that everything in the toolchain works with Unicode charcters beyond 65535) ... ;-( ). ---- Bye, Roland -- __ . . __ (o.\ \/ /.o) roland.mainz at nrubsig.org \__\/\/__/ MPEG specialist, C&&JAVA&&Sun&&Unix programmer /O /==\ O\ TEL +49 641 7950090 (;O/ \/ \O;)
Roland Mainz wrote:> Tim Haley wrote: > > Roland Mainz wrote: > > > Bart Smaalders wrote: > > >> Marcus Sundman wrote: > > >>> I''m unable to find more info about this. E.g., what does "reject file > > >>> names" mean in practice? E.g., if a program tries to create a file > > >>> using an utf8-incompatible filename, what happens? Does the fopen() > > >>> fail? Would this normally be a problem? E.g., do tar and similar > > >>> programs convert utf8-incompatible filenames to utf8 upon extraction if > > >>> my locale (or wherever the fs encoding is taken from) is set to use > > >>> utf-8? If they don''t, then what happens with archives containing > > >>> utf8-incompatible filenames? > > >> Note that the normal ZFS behavior is exactly what you''d expect: you > > >> get the filenames you wanted; the same ones back you put in. > > > > > > Does ZFS convert the strings to UTF-8 in this case or will it just store > > > the multibyte sequence unmodified ? > > > > > ZFS doesn''t muck with names it is sent when storing them on-disk. The > > on-disk name is exactly the sequence of bytes provided to the open(), > > creat(), etc. If normalization options are chosen, it may do some > > manipulation of the byte strings *when comparing* names, but the on-disk > > name should be untouched from what the user requested. > > Ok... that was the part which I was _praying_ for... :-) > > ... just some background (for those who may be puzzled by the statement > above): The conversion to Unicode is not always "lossless" (Unicode is > sometimes marketed as > "convert-any-encoding-to-unicode-without-loosing-any-information") ... > for example if you have a mixed-language ISO-2022 character sequence the > conversion to Unicode will use the language information itselfs/use/loose/ ... sorry... ---- Bye, Roland -- __ . . __ (o.\ \/ /.o) roland.mainz at nrubsig.org \__\/\/__/ MPEG specialist, C&&JAVA&&Sun&&Unix programmer /O /==\ O\ TEL +49 641 7950090 (;O/ \/ \O;)
On Thu, Feb 28, 2008 at 05:57:21AM +0100, Roland Mainz wrote:> Tim Haley wrote: > > ZFS doesn''t muck with names it is sent when storing them on-disk. The > > on-disk name is exactly the sequence of bytes provided to the open(), > > creat(), etc. If normalization options are chosen, it may do some > > manipulation of the byte strings *when comparing* names, but the on-disk > > name should be untouched from what the user requested. > > Ok... that was the part which I was _praying_ for... :-) > > ... just some background (for those who may be puzzled by the statement > above): The conversion to Unicode is not always "lossless" (Unicode is > sometimes marketed as > "convert-any-encoding-to-unicode-without-loosing-any-information") ... > for example if you have a mixed-language ISO-2022 character sequence the > conversion to Unicode will use the language information itself and > converting it back to an ISO-2022 sequence will result in a different > multibyte sequence than the original input (the issue could be > worked-around by inserting the "language tag" characters to preserve > this information but almost every converter doesn''t do that (and since > these "tags" are outside the BMP you have to pray that everything in the > toolchain works with Unicode charcters beyond 65535) ... ;-( ).Keep in mind that NFSv4 requires use of UTF-8 on the wire. Most implementations just-use-8, including Solaris, but IIRC ZFS has an option to require/allow only valid UTF-8 byte sequences, and it has support for normalization-insensitive/preserving behaviour on lookup/create, so the Solaris server is approaching compliance with the NFSv4 spec, and the client can be compliant if you use only UTF-8 locales :) I.e., we (the industry) are converging on Unicode as the standard codeset for filesystem object naming. The upshot of this is that if you really care about lossless conversions then you''ll just have to avoid using problematic sequences in filesystem object names. It is important, for reasons like what you described, that other things -- particularly document formats -- support codesets other than Unicode. But I just don''t see the NFS community adopting a multiplicity of codesets for NFS (who knows, I might be wrong, and you could bring this up on the IETF NFSv4 WG). Nico --
> So, I set utf8only=on and try to create a file with a > filename that is > a byte array that can''t be decoded to text using > UTF-8. What''s supposed > to happen? Should fopen(), or whatever syscall > ''touch'' uses, fail? > Should the syscall somehow escape utf8-incompatible > bytes, or maybe > replace them with ?s or somesuch? Or should it > automatically convert the > filename from the active locale''s fs-encoding > (LC_CTYPE?) to UTF-8?First, utf8only can AFAIK only be set when a filesystem is created. Second, "use the source, Luke:" http://src.opensolaris.org/source/search?q=&defs=&refs=z_utf8&path=%2Fonnv%2Fonnv-gate%2Fusr%2Fsrc%2Futs%2Fcommon%2Ffs%2Fzfs%2Fzfs_vnops.c&hist=&project=%2Fonnv Looks to me like lookups, file create, directory create, creating symlinks, and creating hard links will all fail with error EILSEQ ("Illegal byte sequence") if utf8only is enabled and they are presented with a name that is not valid UTF-8. Thus, on a filesystem where it is enabled (since creation), no such names can be created or would ever be there to be found anyway. So in that case, the system is refusing non UTF-8 compatible byte strings and there''s no need to escape anything. Further, your last sentence suggests that you might hold the incorrect idea that the kernel knows or cares what locale an application is running in: it does not. Nor indeed does the kernel know about environment variables at all, except as the third argument passed to execve(2); it doesn''t interpret them, or even validate that they are of the usual name=value form, they''re typically handled pretty much the same as the command line args, and the only illusion of magic is that with the more widely used variants of exec that don''t explicitly pass the environment, they internally call execve(2) with the external variable environ as the last arg, thus passing the environment automatically. There have been Unix-like OSs that make the environment available to additional system calls (give or take what''s a true system call in the example I''m thinking of, namely variant links (symlinks with embedded environment variable references) in the now defunct Apollo Domain/OS), but AFAIK, that''s not the case in those that are part of the historical Unix source lineage. (I have no idea off the top of my head whether or not Linux, or oddballs like OSF/1 might make environment variables implicitly available to syscalls other than execve(2).) This message posted from opensolaris.org
Bart Smaalders <bart.smaalders at Sun.COM> wrote:> > I''m unable to find more info about this. E.g., what does "reject > > file names" mean in practice? E.g., if a program tries to create a > > file using an utf8-incompatible filename, what happens? Does the > > fopen() fail? Would this normally be a problem? E.g., do tar and > > similar programs convert utf8-incompatible filenames to utf8 upon > > extraction if my locale (or wherever the fs encoding is taken from) > > is set to use utf-8? If they don''t, then what happens with archives > > containing utf8-incompatible filenames? > > > Note that the normal ZFS behavior is exactly what you''d expect: you > get the filenames you wanted; the same ones back you put in.OK, thanks. I still haven''t got any answer to my original question, though. I.e., is there some way to know what text the filename is, or do I have to make a more or less wild guess what encoding the program that created the file used? OK, if I use utf8only then I know that all filenames can be interpreted as UTF-8. However, that''s completely unacceptable for me, since I''d much rather have an important file with an incomprehensible filename than not have that important file at all. Also, what about non-UTF-8 encodings? E.g., is it possible to know whether 0xe4 is "?" (as in iso-8859-1) or "?" (as in iso-8859-5)?> The trick is that in order to support such things as > casesensitivity=false for CIFS, the OS needs to know what characters > are uppercase vs lowercase, which means it needs to know about > encodings, and reject codepoints which cannot be classified as > uppercase vs lowercase.I don''t see why the OS would care about that. Isn''t that the job of the CIFS daemon? As a matter of fact I don''t see why the OS would need to know how to decode any filename-bytes to text. However, I firmly believe that user applications should have that opportunity. If the encoding of filenames is not known (explicitly or implicitly) then applications don''t have that opportunity. - Marcus
Marcus Sundman wrote:> Bart Smaalders <bart.smaalders at Sun.COM> wrote: >>> I''m unable to find more info about this. E.g., what does "reject >>> file names" mean in practice? E.g., if a program tries to create a >>> file using an utf8-incompatible filename, what happens? Does the >>> fopen() fail? Would this normally be a problem? E.g., do tar and >>> similar programs convert utf8-incompatible filenames to utf8 upon >>> extraction if my locale (or wherever the fs encoding is taken from) >>> is set to use utf-8? If they don''t, then what happens with archives >>> containing utf8-incompatible filenames? >> >> Note that the normal ZFS behavior is exactly what you''d expect: you >> get the filenames you wanted; the same ones back you put in. > > OK, thanks. I still haven''t got any answer to my original question, > though. I.e., is there some way to know what text the filename is, or > do I have to make a more or less wild guess what encoding the program > that created the file used?How do you expect the filesystem to know this? Open(2) takes 3 args; none of them have anything to do with the encoding.> OK, if I use utf8only then I know that all filenames can be interpreted > as UTF-8. However, that''s completely unacceptable for me, since I''d > much rather have an important file with an incomprehensible filename > than not have that important file at all. Also, what about non-UTF-8 > encodings? E.g., is it possible to know whether 0xe4 is "?" (as in > iso-8859-1) or "?" (as in iso-8859-5)? >There are two characters not allowed in filenames: NULL and ''/''. Everything else is meaning imparted by the user, just like the contents of text documents.>> The trick is that in order to support such things as >> casesensitivity=false for CIFS, the OS needs to know what characters >> are uppercase vs lowercase, which means it needs to know about >> encodings, and reject codepoints which cannot be classified as >> uppercase vs lowercase. > > I don''t see why the OS would care about that. Isn''t that the job of the > CIFS daemon?If my program attempts to open file "fred" in a case-insensitive filesystem and "FRED" exists, I would expect to get a handle to "FRED". In order for the filesystem to do this, the OS must be able to perform this comparison. CIFS is in the kernel; case insensitivity is a property of the filesystem, not a layer added on by a daemon. If not, I could create "fred" and "FRED" locally, and then which one would I get were I to open "FrEd" via CIFS?> As a matter of fact I don''t see why the OS would need to > know how to decode any filename-bytes to text. However, I firmly > believe that user applications should have that opportunity. If the > encoding of filenames is not known (explicitly or implicitly) then > applications don''t have that opportunity.The OS doesn''t care; the user does. If a user creates a file named ?????? in his home directory, but my encoding doesn''t contain these characters, what should ls -l display? You also assume that knowing the encoding will transfer meaning... but a directory containing files named ??????, ????? and ?????? may as well be line noise for most of us. The OS doesn''t care one whit about language or encodings (save the optional upper/lower case accommodation for CIFS). The OS simply stores files under names that don''t contain either ''/'' or NULL. UTF8 is the answer here. If you care about anything more than simple ascii and you work in more than a single locale/encoding, use UTF8. You may not understand the meaning of a filename, but at least you''ll see the same characters as the person who wrote it. - Bart -- Bart Smaalders Solaris Kernel Performance barts at cyber.eng.sun.com http://blogs.sun.com/barts "You will contribute more with mercurial than with thunderbird."
> OK, thanks. I still haven''t got any answer to my original question, > though. I.e., is there some way to know what text the > filename is, or do I have to make a more or less wild guess what > encoding the program that created the file used?You have to guess. As far as I know, Apple''s HFS (and HFS+) is the only file system which stores the encoding along with the filename. NFS doesn''t provide a mechanism to send the encoding with the filename; I don''t believe that CIFS does, either. If you''re writing the application, you could store the encoding as an extended attribute of the file. This would be useful, for instance, for an AFP server.> > The trick is that in order to support such things as > > casesensitivity=false for CIFS, the OS needs to know what characters > > are uppercase vs lowercase, which means it needs to know about > > encodings, and reject codepoints which cannot be classified as > > uppercase vs lowercase. > > I don''t see why the OS would care about that. Isn''t that the job of the > CIFS daemon?The CIFS daemon can do it, but it would require that the daemon cache the whole directory in memory (at least, to get reasonable efficiency). This doesn''t work so well for large directories. If you leave it up to the CIFS daemon, you also wind up with problems if you have a single sharepoint shared between local users, NFS & CIFS -- the NFS client can create two files named "a" and "A", but the CIFS client can only see one of those.> As a matter of fact I don''t see why the OS would need to > know how to decode any filename-bytes to text. > However, I firmly believe that user applications should have that > opportunity. If the encoding of filenames is not known (explicitly or > implicitly) then applications don''t have that opportunity.Yes -- that''s why Apple includes an encoding byte in both HFS and HFS+. (In HFS+, filenames are normalized to 16-bit Unicode, but the encoding is still useful in choosing how to recompose the characters, and in providing hints for applications which prefer the names in some 8-bit encoding.) -- Anton This message posted from opensolaris.org
Bart Smaalders <bart.smaalders at Sun.COM> wrote:> Marcus Sundman wrote: > > Bart Smaalders <bart.smaalders at Sun.COM> wrote: > >>> I''m unable to find more info about this. E.g., what does "reject > >>> file names" mean in practice? E.g., if a program tries to create a > >>> file using an utf8-incompatible filename, what happens? Does the > >>> fopen() fail? Would this normally be a problem? E.g., do tar and > >>> similar programs convert utf8-incompatible filenames to utf8 upon > >>> extraction if my locale (or wherever the fs encoding is taken > >>> from) is set to use utf-8? If they don''t, then what happens with > >>> archives containing utf8-incompatible filenames? > >> > >> Note that the normal ZFS behavior is exactly what you''d expect: you > >> get the filenames you wanted; the same ones back you put in. > > > > OK, thanks. I still haven''t got any answer to my original question, > > though. I.e., is there some way to know what text the filename is, > > or do I have to make a more or less wild guess what encoding the > > program that created the file used? > > How do you expect the filesystem to know this? Open(2) takes 3 args; > none of them have anything to do with the encoding.I don''t expect the filesystem to know "this" (whatever you mean by "this"). I don''t expect the filesystem not to either. I just don''t know, and therefore I ask.> > OK, if I use utf8only then I know that all filenames can be > > interpreted as UTF-8. However, that''s completely unacceptable for > > me, since I''d much rather have an important file with an > > incomprehensible filename than not have that important file at all. > > Also, what about non-UTF-8 encodings? E.g., is it possible to know > > whether 0xe4 is "?" (as in iso-8859-1) or "?" (as in iso-8859-5)? > > > > There are two characters not allowed in filenames: NULL and ''/''. > Everything else is meaning imparted by the user, just like the > contents of text documents.You are confusing "characters" and "bytes". The former are encoded when transformed to the latter. ''/'' is a character, 0x2f is a byte. (Well, representations of a character and of a byte, respectively, if we''re nitpicking.)> >> The trick is that in order to support such things as > >> casesensitivity=false for CIFS, the OS needs to know what > >> characters are uppercase vs lowercase, which means it needs to > >> know about encodings, and reject codepoints which cannot be > >> classified as uppercase vs lowercase. > > > > I don''t see why the OS would care about that. Isn''t that the job of > > the CIFS daemon? > > If my program attempts to open file "fred" in a case-insensitive > filesystem and "FRED" exists, I would expect to get a handle to > "FRED". In order for the filesystem to do this, the OS must be able > to perform this comparison.Well, yes, if the case-insensitivity is in the filesystem (and if the fs is in the kernel), but my point was that it wouldn''t _have_to_ be in the filesystem. It''s probably faster if it is, though.> CIFS is in the kernel; case insensitivity is a property of the > filesystem, not a layer added on by a daemon.You probably mean "CIFS is in (Open)Solaris" and "case insensitivity is a property of ZFS".> If not, I could create "fred" and "FRED" locally, and then which one > would I get were I to open "FrEd" via CIFS?I guess that would be up to the implementation (unless CIFS includes it in its specification).> > As a matter of fact I don''t see why the OS would need to > > know how to decode any filename-bytes to text. However, I firmly > > believe that user applications should have that opportunity. If the > > encoding of filenames is not known (explicitly or implicitly) then > > applications don''t have that opportunity. > > The OS doesn''t care; the user does. If a user creates a file named > ?????? in his home directory, but my encoding doesn''t contain these > characters, what should ls -l display?I assume we''re assuming encodings to be known here. (If the encodings are unknown/unspecified the user can''t create a file named any particular character string, only raw data (bits/bytes).) What a particular program displays is up to the implementation, I guess. I''ve seen programs use escapes (e.g., \uc3\ua5), or ''?'', or empty squares, or small squares with hex-numbers in them. (I''ve also seen programs not display the text at all (sometimes not displaying any text after the offending part), or even crash.) However, we have the same problem always when programs should display text, whether we know the encoding or not. Command line programs might propagate the problem to the terminal (as ls in OpenSolaris currently seems to be doing), graphical programs have to deal with it themselves. So, while the OS might not care, the programs certainly do, especially the graphical ones, since they can''t let someone else deal with the problem. (And yes, I know programs don''t like to be anthropomorphized.)> You also assume that knowing the encoding will transfer meaning... > but a directory containing files named ??????, ????? and ?????? may > as well be line noise for most of us.I assume no such thing. However, I firmly believe that knowing the encoding of a bit sequence is the _only_possibility_ to be able to _know_ what text that bit sequence represents.> The OS doesn''t care one whit about language or encodings (save > the optional upper/lower case accommodation for CIFS). The OS simply > stores files under names that don''t contain either ''/'' or NULL.I think you mean "[...]names that don''t contain either 0x2F and 0x0", which includes characters such as ''A'' in UTF-16.> UTF8 is the answer here. If you care about anything more than simple > ascii and you work in more than a single locale/encoding, use UTF8. > You may not understand the meaning of a filename, but at least > you''ll see the same characters as the person who wrote it.I think you are a bit confused. A) If you meant that _I_ should use UTF-8 then that alone won''t help. Let''s say the person who created the file used ISO-8859-1 and named it ''h?st'', i.e., 0x68e47374. If I then use UTF-8 when displaying the filename my program will be faced with the problem of what to do with the second byte, 0xe4, which can''t be decoded using UTF-8. ("h?st" is 0x68c3a47374 in UTF-8, in case someone wonders.) B) If you meant that _everybody_ should use UTF-8 then why would UTF-8 be "the answer"? Certainly it''s enough that everybody uses the same encoding. Regards, Marcus
"Anton B. Rang" <rang at acm.org> wrote:> Yes -- that''s why Apple includes an encoding byte in both HFS and HFS+. (In HFS+, filenames are normalized to 16-bit Unicode, but the encoding is still useful in choosing how to recompose the characters, and in providing hints for applications which prefer the names in some 8-bit encoding.)If you like to do something like this, it would be better to use the UDF aproach. In UDF directories, the first byte of a filename may either be 8 (''\010'') and then the filename is ISO-8859-1 (the low 8 bits of UNOICODE) or 16 (''\020'') and then the file name is usinf UCS-2 (16 bit chars) from UNICODE. This allows to keep the full path name length for the popular ISO-8859-1 coding and still needs less space than UTF-8 if you e.g. use japanese chars as Japanese chars need 3 octects in UTF-8. J?rg -- EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin js at cs.tu-berlin.de (uni) schilling at fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/old/private/ ftp://ftp.berlios.de/pub/schily
"Anton B. Rang" <rang at acm.org> wrote:> > OK, thanks. I still haven''t got any answer to my original question, > > though. I.e., is there some way to know what text the > > filename is, or do I have to make a more or less wild guess what > > encoding the program that created the file used? > > You have to guess.Ouch! Guessing sucks. (By the way, that''s why I switched to ZFS with its internal checksums, so that I wouldn''t have to guess if my data was OK.) Thanks for the answer, though. Do you happen to know where programs in (Open)Solaris look when they want to know how to encode text to be used in a filename? Is it LC_CTYPE?> NFS doesn''t provide a mechanism to send the encoding with the > filename; I don''t believe that CIFS does, either.Really?!? That''s insane! How do programs know how to encode filenames to be sent over NFS or CIFS?> If you''re writing the application, you could store the encoding as an > extended attribute of the file. This would be useful, for instance, > for an AFP server.OK. But then I''d have to hack a similar change into all other programs that I use, too.> > > The trick is that in order to support such things as > > > casesensitivity=false for CIFS, the OS needs to know what > > > characters are uppercase vs lowercase, which means it needs to > > > know about encodings, and reject codepoints which cannot be > > > classified as uppercase vs lowercase. > > > > I don''t see why the OS would care about that. Isn''t that the job of > > the CIFS daemon? > > The CIFS daemon can do it, but it would require that the daemon cache > the whole directory in memory (at least, to get reasonable > efficiency).I guess that depends on what file access functions there are for the file system.> If you leave it up to the CIFS daemon, you also wind up with problems > if you have a single sharepoint shared between local users, NFS & > CIFS -- the NFS client can create two files named "a" and "A", but > the CIFS client can only see one of those.Not necessarily. There could be some (nonstandard) way of accessing such duplicates (e.g., by having the CIFS daemon append "[dup-N]" or somesuch to the name). And even if that problem did exist it might still be OK for CIFS access to have that limitation. Regards, Marcus
Joerg.Schilling at fokus.fraunhofer.de (Joerg Schilling) wrote:> [...] ISO-8859-1 (the low 8 bits of UNOICODE) [...]Unicode is not an encoding, but you probably mean "the low 8 bits of UCS-2" or "the first 256 codepoints in Unicode" or somesuch. Regards, Marcus
Marcus Sundman <sundman at iki.fi> wrote:> Joerg.Schilling at fokus.fraunhofer.de (Joerg Schilling) wrote: > > [...] ISO-8859-1 (the low 8 bits of UNOICODE) [...] > > Unicode is not an encoding, but you probably mean "the low 8 bits of > UCS-2" or "the first 256 codepoints in Unicode" or somesuch.Unicode _is_ an encoding that uses 21 (IIRC) bits. UCS-2 is a way to _represent_ the low 16 bits of UNICODE in a way that allows to use some tricks go bejund 16 bits. Microfoft e.g. does not go bejund 16 bits. ISO-8859-1 is a representation of the low 8 bits of UNICODE (well ISO-8859-1 is older than UNICODE ;-). ISO-8859-1 does not allow to code more than the 8 least sinificant bits from unicode. J?rg -- EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin js at cs.tu-berlin.de (uni) schilling at fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/old/private/ ftp://ftp.berlios.de/pub/schily
Bart Smaalders <bart.smaalders at Sun.COM> wrote:> > OK, thanks. I still haven''t got any answer to my original question, > > though. I.e., is there some way to know what text the filename is, or > > do I have to make a more or less wild guess what encoding the program > > that created the file used? > > How do you expect the filesystem to know this? Open(2) takes 3 args; > none of them have anything to do with the encoding.A while ago, when discussing thing with some filesystem guys, I made the proposal to introduce a new syscall to inform the kernel about the locale coding used by a process. If the kernel (or filesystem) then like to store file names in a kernel-specific way and if there is a in-kernel libiconv, the kernel could convert from/to the userland view. A problem that remains is a userland coding that probably cannot represent all "characters" used inside the kernel view.> There are two characters not allowed in filenames: NULL and ''/''. Everything > else is meaning imparted by the user, just like the contents of text > documents.Platforms that insist in UTF-8 codinf for filenames often disallow octett codingd tha are not valid inside a UTF-8 character sequence.> > The OS doesn''t care; the user does. If a user creates a file named > ?????????????????? in his home directory, but my encoding doesn''t contain these > characters, > what should ls -l display? You also assume that knowing the encoding > will transfer meaning... but a directory containing files named > ??????????????????, ??????????????? and ?????????????????? may as well be line noise for most of us. > > The OS doesn''t care one whit about language or encodings (save > the optional upper/lower case accommodation for CIFS). The OS simply > stores files under names that don''t contain either ''/'' or NULL. > > UTF8 is the answer here. If you care about anything more than simple > ascii and you work in more than a single locale/encoding, use UTF8. > You may not understand the meaning of a filename, but at least > you''ll see the same characters as the person who wrote it.UTF-8 may be the answer for many but definitely not all problems. UTF-8 may make less problems in 5 years (if more people then use it) than the problem known with UTF-8 today. J?rg -- EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin js at cs.tu-berlin.de (uni) schilling at fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/old/private/ ftp://ftp.berlios.de/pub/schily
Joerg.Schilling at fokus.fraunhofer.de (Joerg Schilling) wrote:> Marcus Sundman <sundman at iki.fi> wrote: > > Joerg.Schilling at fokus.fraunhofer.de (Joerg Schilling) wrote: > > > [...] ISO-8859-1 (the low 8 bits of UNOICODE) [...] > > > > Unicode is not an encoding, but you probably mean "the low 8 bits of > > UCS-2" or "the first 256 codepoints in Unicode" or somesuch. > > Unicode _is_ an encoding that uses 21 (IIRC) bits.AFAIK you are incorrect. Unicode is a standard that, among other things, defines a _number_ for each character. A number does not equal 21 bits, even if it so happens that the highest codepoint number in the current version is no more than 21 bits long. Unicode defines (at least) 3 encodings to represent those characters: UTF-8, UTF-16 and UTF-32. Well, it doesn''t very much matter exactly how the terms are defined, as long as everybody knows what''s what. So, I''m sorry for nitpicking. - Marcus
Marcus Sundman wrote:> Bart Smaalders <bart.smaalders at Sun.COM> wrote: >> UTF8 is the answer here. If you care about anything more than simple >> ascii and you work in more than a single locale/encoding, use UTF8. >> You may not understand the meaning of a filename, but at least >> you''ll see the same characters as the person who wrote it. > > I think you are a bit confused. > > A) If you meant that _I_ should use UTF-8 then that alone won''t help. > Let''s say the person who created the file used ISO-8859-1 and named it > ''h?st'', i.e., 0x68e47374. If I then use UTF-8 when displaying the > filename my program will be faced with the problem of what to do with > the second byte, 0xe4, which can''t be decoded using UTF-8. ("h?st" is > 0x68c3a47374 in UTF-8, in case someone wonders.)What I mean is very simple: The OS has no way of merging your various encodings. If I create a directory, and have people from around the world create a file in that directory named after themselves in their own character sets, what should I see when I invoke: % ls -l | less in that directory? If you wish to share filenames across locales, I suggest you and everyone else writing to that directory use an encoding that will work across all those locales. The encoding that works well for this on Unix systems is UTF8, since it leaves ''/'' and NULL alone. - Bart -- Bart Smaalders Solaris Kernel Performance barts at cyber.eng.sun.com http://blogs.sun.com/barts "You will contribute more with mercurial than with thunderbird."
Bart Smaalders <bart.smaalders at Sun.COM> wrote:> Marcus Sundman wrote: > > Bart Smaalders <bart.smaalders at Sun.COM> wrote: > >> UTF8 is the answer here. If you care about anything more than > >> simple ascii and you work in more than a single locale/encoding, > >> use UTF8. You may not understand the meaning of a filename, but at > >> least you''ll see the same characters as the person who wrote it. > > > > I think you are a bit confused. > > > > A) If you meant that _I_ should use UTF-8 then that alone won''t > > help. Let''s say the person who created the file used ISO-8859-1 and > > named it ''h?st'', i.e., 0x68e47374. If I then use UTF-8 when > > displaying the filename my program will be faced with the problem > > of what to do with the second byte, 0xe4, which can''t be decoded > > using UTF-8. ("h?st" is 0x68c3a47374 in UTF-8, in case someone > > wonders.) > > What I mean is very simple: > > The OS has no way of merging your various encodings. If I create a > directory, and have people from around the world create a file > in that directory named after themselves in their own character sets, > what should I see when I invoke: > > % ls -l | less > > in that directory?Either (1) programs can find out what the encoding is, or (2) programs must assume the encoding is what some environment variable (or somesuch) is set to. (1) The OS doesn''t have to "merge" anything, just let the programs handle any conversions the programs see fit. (2) The OS must transcode the filenames. If a filename is incompatible with the target encoding then the offending characters must be escaped.> If you wish to share filenames across locales, I suggest you and > everyone else writing to that directory use an encoding that will work > across all those locales. The encoding that works well for this on > Unix systems is UTF8, since it leaves ''/'' and NULL alone.Again, that won''t work. First of all there is no way to enforce programs to use UTF-8. I can''t even force my own programs to do that. (E.g., unrar or unzip or tar or 7z (can''t remember which one(s)) just dump the filename data to the fs in whatever encoding they were inside the archive, and I have at least one collaboration program that also does it similarly.) Now, if I force the fs to only accept filenames compatible with UTF-8 (i.e., utf8only) then I risk losing files. I''d rather have files with incomprehensible filenames than not have them at all. OTOH, if I allow filenames incompatible with UTF-8 then my programs can''t necessarily access them if I use UTF-8. I could use some 8bits/char encoding (e.g., iso-8859-15), but I''d rather not, since the world is going the way of UTF-8 and so I''d just be dragging behind. And then I would also have problems with garbage-filenames when they use UTF-8 or some other encoding. Also, I''m quite sure I do have files with names with characters not in iso-8859-15. So, you see, there is no way for me to use filenames intelligibly unless their encodings are knowable. (In fact I''m quite surprised that zfs doesn''t (and even can''t) know the encoding(s) of filenames. Usually Sun seems to make relatively sane design decisions. This, however, is more what I''d expect from linux with their overpragmatic "who cares if it''s sane, as long as it kinda works"-attitudes.) Regards, Marcus
Marcus Sundman <sundman at iki.fi> wrote:> Joerg.Schilling at fokus.fraunhofer.de (Joerg Schilling) wrote: > > Marcus Sundman <sundman at iki.fi> wrote: > > > Joerg.Schilling at fokus.fraunhofer.de (Joerg Schilling) wrote: > > > > [...] ISO-8859-1 (the low 8 bits of UNOICODE) [...] > > > > > > Unicode is not an encoding, but you probably mean "the low 8 bits of > > > UCS-2" or "the first 256 codepoints in Unicode" or somesuch. > > > > Unicode _is_ an encoding that uses 21 (IIRC) bits. > > AFAIK you are incorrect. Unicode is a standard that, among other > things, defines a _number_ for each character. A number does not equalAnd I tend to call the relation Character <-> number an "encoding". As the "number" may be outside the range of "classical characters" that on most systems live inside octetts, there is a need to use another encoding on top of the unicode encoding. This second encoding is typically UTF-8 on UNIX. J?rg -- EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin js at cs.tu-berlin.de (uni) schilling at fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/old/private/ ftp://ftp.berlios.de/pub/schily
Bart Smaalders <bart.smaalders at Sun.COM> wrote:> The OS has no way of merging your various encodings. If I create a > directory, and have people from around the world create a file > in that directory named after themselves in their own character sets, > what should I see when I invoke: > > % ls -l | less > > in that directory? > > If you wish to share filenames across locales, I suggest you and > everyone else writing to that directory use an encoding that will work > across all those locales. The encoding that works well for this on > Unix systems is UTF8, since it leaves ''/'' and NULL alone.The problem with this aproach is that all users need to change their locale encoding. Some of them may not be able to do so because they need to login into older systems that do not support UTF-8. We had less problems if UNICODE was introduced 10 years ealier. Because of missing encoding support for their countries, people in russia, china, ... did create own encoding schemes in the 1980s that are still in use. J?rg -- EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin js at cs.tu-berlin.de (uni) schilling at fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/old/private/ ftp://ftp.berlios.de/pub/schily
> Do you happen to know where programs in (Open)Solaris look when they > want to know how to encode text to be used in a filename? Is it > LC_CTYPE?In general, they don''t. Command-line utilities just use the sequence of bytes entered by the user. GUI-based software does as well, but the encoding used for user input can sometimes be selected....> > NFS doesn''t provide a mechanism to send the encoding with the > > filename; I don''t believe that CIFS does, either. > > Really?!? That''s insane! How do programs know how to > encode filenames to be sent over NFS or CIFS?For NFSv3, you guess. :-) It''s just stream-of-bytes. For NFSv4, the encoding used to transmit data is supposed to be UTF-8, but this isn''t enforced by most clients. What''s more, since the encoding isn''t stored, the reverse translation (UTF-8 to local encoding) would have to be done by the NFS client based on ... something. Usually this is "just return the raw bytes and let the application deal with the mess." For CIFS, you can send either "ASCII" (which I believe really means uninterpreted bytes) or UTF-16. If you''re working in UTF-16, and you''re on Windows, there are two sets of APIs. The Unicode APIs will return the proper Unicode names. The non-Unicode (legacy) APIs will encode the names according to your system''s current "code page" setting. -- Anton This message posted from opensolaris.org
"Anton B. Rang" <rang at acm.org> wrote:> > Do you happen to know where programs in (Open)Solaris look when they > > want to know how to encode text to be used in a filename? Is it > > LC_CTYPE? > > In general, they don''t. Command-line utilities just use the sequence > of bytes entered by the user.Obviously that depends on the application. A command-line utility that interprets an normal xml file containing filenames know the characters but not the bytes. The same goes for command-line utilities that receive the filenames as text (e.g., some file transfer utility or daemon).> GUI-based software does as well, but the encoding used for user input > can sometimes be selected....Hmm.. I''m usually programming at quite high a level, so I''m not very familiar with how stuff works under the hood... If I run xev on my linux box (I don''t have X on any (Open)Solaris) and press the ?-key on my keyboard it says "keycode 48" and "keysym 0xe4", and then "XLookupString gives 2 bytes: (c3 a4) "?"". Thus at least XLookupString seems to know that I''m using UTF-8. Where did it (or whoever converted 0xe4 to 0xc3a4) get the needed info? - Marcus
Marcus Sundman <sundman at iki.fi> writes:> So, you see, there is no way for me to use filenames intelligibly unless > their encodings are knowable. (In fact I''m quite surprised that zfs > doesn''t (and even can''t) know the encoding(s) of filenames. Usually Sun > seems to make relatively sane design decisions. This, however, is more > what I''d expect from linux with their overpragmatic "who cares if it''s > sane, as long as it kinda works"-attitudes.)To be fair, ZFS is constrained by compatibility requirements with existing systems. For the longest time the only interpretation that Unix kernels put on the filenames passed by applications was to treat "/" and "\000" specially. The interfaces provided to applications assume this is the entire extent of the process. Changing this incompatibly is not an option, and adding new interfaces to support this is meaningless unless there is a critical mass of applications that use them. It''s not reasonable to talk about "ZFS" doing this, since it''s just a part of the wider ecosystem. To solve this problem at the moment takes one of two approaches. 1. A userland convention is adopted to decide on what meaning the byte strings that the kernel provides have. 2. Some new interfaces are created to pass this information into the kernel and get it back. Leaving aside the merits of either approach, both of them require significant agreement from applications to use a certain approach before they reap any benefits. There''s not much ZFS itself can do there. Boyd
> > In general, they don''t. Command-line utilities just use the sequence > > of bytes entered by the user. > > Obviously that depends on the application. A command-line utility that > interprets an normal xml file containing filenames know the characters > but not the bytes. The same goes for command-line utilities that > receive the filenames as text (e.g., some file transfer utility or daemon).It''s true that they know the characters, and not necessarily the bytes -- but all of the tools I''m aware of ignore the characters and simply treat these as bytes when it comes to making calls into the file system.> If I run xev on my linux box (I don''t have X on any (Open)Solaris) and > press the ?-key on my keyboard it says "keycode 48" and "keysym 0xe4", > and then "XLookupString gives 2 bytes: (c3 a4) "?"". Thus at least > XLookupString seems to know that I''m using UTF-8. Where did it (or > whoever converted 0xe4 to 0xc3a4) get the needed info?Depending on what version of xev you''ve got, there''s a good chance it made a call to XmbLookupString (the "multibyte" version of XLookupString). This uses the current locale for the encoding; the locale is stored in an environment variable which can be queried by the application. (But this has wandered afield of file systems -- though it''s true that the file system could potentially look at environment variables to make encoding choices!) This message posted from opensolaris.org