thr3ads.net - zfs discuss - [zfs-discuss] path-name encodings [Feb 2008]

If this information is useful, please help other people find it:
Share via:

Marcus Sundman

2008-Feb-27 01:33 UTC

[zfs-discuss] path-name encodings

Are path-names text or raw data in zfs? I.e., is it possible to know
what the name of a file/dir/whatever is, or do I have to make more or
less wild guesses what encoding is used where?

- Marcus

max at bruningsystems.com

2008-Feb-27 10:16 UTC

head link

[zfs-discuss] path-name encodings

Hi Marcus,

Marcus Sundman wrote:> Are path-names text or raw data in zfs? I.e., is it possible to know
> what the name of a file/dir/whatever is, or do I have to make more or
> less wild guesses what encoding is used where?
>
> - Marcus
>   I''m not sure what you are asking here.  When a zfs file system is 
mounted, it looks like a normal
unix file system, i.e., a tree of files where intermediate nodes are 
directories and leaf nodes may be
directories or regular files.  In other words, ls gives you the same 
kind of output you would expect on
any unix file system.  As to whether a file/directory name is text or 
binary, that depends
on the name used when creating the file/directory.  As far as the 
meta-data used to maintain the file system tree, most of this is
compressed.  But your question makes me wonder if you have tried zfs.  
If so, then I really am not sure
what you are asking.  If not, maybe you should try it out...

max

Marcus Sundman

2008-Feb-27 13:53 UTC

head link

[zfs-discuss] path-name encodings

"max at bruningsystems.com" <max at bruningsystems.com>
wrote:> Marcus Sundman wrote:
> > Are path-names text or raw data in zfs? I.e., is it possible to know
> > what the name of a file/dir/whatever is, or do I have to make more
> > or less wild guesses what encoding is used where?
> 
> I''m not sure what you are asking here.  When a zfs file system is 
> mounted, it looks like a normal unix file system, i.e., a tree of
> files where intermediate nodes are directories and leaf nodes may be
> directories or regular files.  In other words, ls gives you the same 
> kind of output you would expect on any unix file system.  As to
> whether a file/directory name is text or binary, that depends
> on the name used when creating the file/directory.  As far as the 
> meta-data used to maintain the file system tree, most of this is
> compressed.  But your question makes me wonder if you have tried
> zfs. If so, then I really am not sure what you are asking.  If not,
> maybe you should try it out...
I am running it (in nexenta).
Anyway, my question was whether path-names (files, dirs, links, sockets,
etc) are text or raw data.
Fundamentals:
"raw data" is "a list of bits, usually in groups of 8 (i.e.,
bytes)",
and
"text" is "raw data + some way of knowing how to convert that
data into
characters, forming strings". 

Example: When you go to a web-page the webserver sends the bytes of the
page along with a http-header named "Content-Type", which tells your
browser how to interpret those bytes.

Example: Some versioning systems, such as svn, are hardcoded to encode
pathnames as UTF-8. So, although the encoding-metadata isn''t available
along with the data it is in the specification.

So, once more, is it possible to know the pathnames (as text) on zfs,
or are pathnames just raw data and I (or my programs) have to make more
or less wild guesses about what encoding the user who created the
file/dir/etc. used for its name?

At least on linux it''s the latter. IMO it really sucks to not be able
to know the names of files/dirs/etc., because it always leads to
problems. E.g., most (but not all) programs assume filenames should be
encoded according to the current locale (let''s say utf-8), so when a
filename with another encoding (let''s say iso-8859-15) is encountered
various Evil(tm) things happen, such as not displaying the file(s) at
all (e.g., an image viewer I''ve used), or replacing filenames with
"?",
or replacing parts of filenames with "?" and decoding the rest of the
filename with an obviously incorrect encoding (e.g., ls). I''ve even
seen programs crash when they can''t decode a filename.

- Marcus

Darren J Moffat

2008-Feb-27 14:00 UTC

head link

[zfs-discuss] path-name encodings

See the description of the normalization and utf8only properties in the 
zfs(1) man page.

I think this might help you.


      normalization =none | formD | formKCf

          Indicates whether  the  file  system  should  perform  a
          unicode  normalization  of  file names whenever two file
          names are compared, and  which  normalization  algorithm
          should be used. File names are always stored unmodified,
          names are normalized as part of any comparison  process.

          If  this  property  is  set  to a legal value other than
          "none," and the "utf8only" property  was  left 
unspeci-
          fied,  the  "utf8only"  property is automatically set to
          "on."  The default value of the "normalization"
property
          is  "none."  This  property  cannot be changed after the
          file system is created.


      utf8only =on | off

          Indicates whether the file  system  should  reject  file
          names  that  include  characters that are not present in
          the UTF-8 character code set. If this property is expli-
          citly  set  to  "off,"  the  normalization property must
          either not be explicitly set or be set  to  "none."  The
          default value for the "utf8only" property is
"off." This
          property cannot be changed  after  the  file  system  is
          created.

--
Darren J Moffat

Marcus Sundman

2008-Feb-27 14:34 UTC

head link

[zfs-discuss] path-name encodings

Darren J Moffat <darrenm at opensolaris.org>
wrote:> See the description of the normalization and utf8only properties in
> the zfs(1) man page.
> 
> I think this might help you.
>
>       normalization =none | formD | formKCf
That''s apparently only for comparisons, so I don''t see how
it''s
relevant.
>       utf8only =on | off
> 
>           Indicates whether the file  system  should  reject  file
>           names  that  include  characters that are not present in
>           the UTF-8 character code set. If this property is expli-
>           citly  set  to  "off,"  the  normalization property
must
>           either not be explicitly set or be set  to  "none." 
The
>           default value for the "utf8only" property is
"off." This
>           property cannot be changed  after  the  file  system  is
>           created.
I''m unable to find more info about this. E.g., what does "reject
file
names" mean in practice? E.g., if a program tries to create a file
using an utf8-incompatible filename, what happens? Does the fopen()
fail? Would this normally be a problem? E.g., do tar and similar
programs convert utf8-incompatible filenames to utf8 upon extraction if
my locale (or wherever the fs encoding is taken from) is set to use
utf-8? If they don''t, then what happens with archives containing
utf8-incompatible filenames?


- Marcus

Marcus Sundman

2008-Feb-28 03:19 UTC

head link

[zfs-discuss] utf8only-property

So, I set utf8only=on and try to create a file with a filename that is
a byte array that can''t be decoded to text using UTF-8. What''s
supposed
to happen? Should fopen(), or whatever syscall ''touch'' uses,
fail?
Should the syscall somehow escape utf8-incompatible bytes, or maybe
replace them with ?s or somesuch? Or should it automatically convert the
filename from the active locale''s fs-encoding (LC_CTYPE?) to UTF-8?

- Marcus

Bart Smaalders

2008-Feb-28 03:54 UTC

head link

[zfs-discuss] path-name encodings

Marcus Sundman wrote:
> I''m unable to find more info about this. E.g., what does
"reject file
> names" mean in practice? E.g., if a program tries to create a file
> using an utf8-incompatible filename, what happens? Does the fopen()
> fail? Would this normally be a problem? E.g., do tar and similar
> programs convert utf8-incompatible filenames to utf8 upon extraction if
> my locale (or wherever the fs encoding is taken from) is set to use
> utf-8? If they don''t, then what happens with archives containing
> utf8-incompatible filenames?

Note that the normal ZFS behavior is exactly what you''d expect: you
get the filenames you wanted; the same ones back you put in.
The trick is that in order to support such things as casesensitivity=false
for CIFS, the OS needs to know what characters are uppercase vs
lowercase, which means it needs to know about encodings, and
reject codepoints which cannot be classified as uppercase vs lowercase.

If you''re not running a CIFS server, the defaults will allow you to
create
files w/ utf8 names very happily.

: barts at cyber[147]; cat test
?? ?????? ??? ?????? ????????
: barts at cyber[148]; cat > "`cat test`"
this is a test w/ a utf8 filename
: barts at cyber[149]; ls -l
total 10
-rw-r--r--   1 barts    staff         37 Oct 22 15:45 Makefile
-rw-r--r--   1 barts    staff          0 Oct 22 15:46 bar
-rw-r--r--   1 barts    staff          0 Oct 22 15:46 foo
-rw-r--r--   1 barts    staff         55 Feb 27 19:45 test
-rw-r--r--   1 barts    staff        301 Feb 27 19:44 test~
-rw-r--r--   1 barts    staff         34 Feb 27 19:46 ?? ?????? ??? 
?????? ????????
: barts at cyber[150]; df -h .
Filesystem             size   used  avail capacity  Mounted on
zfs/home               228G   136G    48G    74%    /export/home/cyber
: barts at cyber[151];


- Bart


-- 
Bart Smaalders			Solaris Kernel Performance
barts at cyber.eng.sun.com		http://blogs.sun.com/barts
"You will contribute more with mercurial than with thunderbird."

Roland Mainz

2008-Feb-28 04:22 UTC

head link

[zfs-discuss] path-name encodings

Bart Smaalders wrote:> Marcus Sundman wrote:
> > I''m unable to find more info about this. E.g., what does
"reject file
> > names" mean in practice? E.g., if a program tries to create a
file
> > using an utf8-incompatible filename, what happens? Does the fopen()
> > fail? Would this normally be a problem? E.g., do tar and similar
> > programs convert utf8-incompatible filenames to utf8 upon extraction
if
> > my locale (or wherever the fs encoding is taken from) is set to use
> > utf-8? If they don''t, then what happens with archives
containing
> > utf8-incompatible filenames?
> 
> Note that the normal ZFS behavior is exactly what you''d expect:
you
> get the filenames you wanted; the same ones back you put in.
Does ZFS convert the strings to UTF-8 in this case or will it just store
the multibyte sequence unmodified ?

----

Bye,
Roland

-- 
  __ .  . __
 (o.\ \/ /.o) roland.mainz at nrubsig.org
  \__\/\/__/  MPEG specialist, C&&JAVA&&Sun&&Unix
programmer
  /O /==\ O\  TEL +49 641 7950090
 (;O/ \/ \O;)

Tim Haley

2008-Feb-28 04:46 UTC

head link

[zfs-discuss] path-name encodings

Roland Mainz wrote:> Bart Smaalders wrote:
>> Marcus Sundman wrote:
>>> I''m unable to find more info about this. E.g., what does
"reject file
>>> names" mean in practice? E.g., if a program tries to create a
file
>>> using an utf8-incompatible filename, what happens? Does the fopen()
>>> fail? Would this normally be a problem? E.g., do tar and similar
>>> programs convert utf8-incompatible filenames to utf8 upon
extraction if
>>> my locale (or wherever the fs encoding is taken from) is set to use
>>> utf-8? If they don''t, then what happens with archives
containing
>>> utf8-incompatible filenames?
>> Note that the normal ZFS behavior is exactly what you''d
expect: you
>> get the filenames you wanted; the same ones back you put in.
> 
> Does ZFS convert the strings to UTF-8 in this case or will it just store
> the multibyte sequence unmodified ?
> ZFS doesn''t muck with names it is sent when storing them on-disk.  The 
on-disk name is exactly the sequence of bytes provided to the open(), 
creat(), etc.  If normalization options are chosen, it may do some 
manipulation of the byte strings *when comparing* names, but the on-disk 
name should be untouched from what the user requested.

-tim
> ----
> 
> Bye,
> Roland
>

Roland Mainz

2008-Feb-28 04:57 UTC

head link

[zfs-discuss] path-name encodings

Tim Haley wrote:> Roland Mainz wrote:
> > Bart Smaalders wrote:
> >> Marcus Sundman wrote:
> >>> I''m unable to find more info about this. E.g., what
does "reject file
> >>> names" mean in practice? E.g., if a program tries to
create a file
> >>> using an utf8-incompatible filename, what happens? Does the
fopen()
> >>> fail? Would this normally be a problem? E.g., do tar and
similar
> >>> programs convert utf8-incompatible filenames to utf8 upon
extraction if
> >>> my locale (or wherever the fs encoding is taken from) is set
to use
> >>> utf-8? If they don''t, then what happens with archives
containing
> >>> utf8-incompatible filenames?
> >> Note that the normal ZFS behavior is exactly what you''d
expect: you
> >> get the filenames you wanted; the same ones back you put in.
> >
> > Does ZFS convert the strings to UTF-8 in this case or will it just
store
> > the multibyte sequence unmodified ?
> >
> ZFS doesn''t muck with names it is sent when storing them on-disk. 
The
> on-disk name is exactly the sequence of bytes provided to the open(),
> creat(), etc.  If normalization options are chosen, it may do some
> manipulation of the byte strings *when comparing* names, but the on-disk
> name should be untouched from what the user requested.
Ok... that was the part which I was _praying_ for... :-)

... just some background (for those who may be puzzled by the statement
above): The conversion to Unicode is not always "lossless" (Unicode is
sometimes marketed as
"convert-any-encoding-to-unicode-without-loosing-any-information") ...
for example if you have a mixed-language ISO-2022 character sequence the
conversion to Unicode will use the language information itself and
converting it back to an ISO-2022 sequence will result in a different
multibyte sequence than the original input (the issue could be
worked-around by inserting the "language tag" characters to preserve
this information but almost every converter doesn''t do that (and since
these "tags" are outside the BMP you have to pray that everything in
the
toolchain works with Unicode charcters beyond 65535) ... ;-( ).

----

Bye,
Roland

-- 
  __ .  . __
 (o.\ \/ /.o) roland.mainz at nrubsig.org
  \__\/\/__/  MPEG specialist, C&&JAVA&&Sun&&Unix
programmer
  /O /==\ O\  TEL +49 641 7950090
 (;O/ \/ \O;)

Roland Mainz

2008-Feb-28 05:10 UTC

head link

[zfs-discuss] path-name encodings

Roland Mainz wrote:> Tim Haley wrote:
> > Roland Mainz wrote:
> > > Bart Smaalders wrote:
> > >> Marcus Sundman wrote:
> > >>> I''m unable to find more info about this. E.g.,
what does "reject file
> > >>> names" mean in practice? E.g., if a program tries to
create a file
> > >>> using an utf8-incompatible filename, what happens? Does
the fopen()
> > >>> fail? Would this normally be a problem? E.g., do tar and
similar
> > >>> programs convert utf8-incompatible filenames to utf8 upon
extraction if
> > >>> my locale (or wherever the fs encoding is taken from) is
set to use
> > >>> utf-8? If they don''t, then what happens with
archives containing
> > >>> utf8-incompatible filenames?
> > >> Note that the normal ZFS behavior is exactly what
you''d expect: you
> > >> get the filenames you wanted; the same ones back you put in.
> > >
> > > Does ZFS convert the strings to UTF-8 in this case or will it
just store
> > > the multibyte sequence unmodified ?
> > >
> > ZFS doesn''t muck with names it is sent when storing them
on-disk.  The
> > on-disk name is exactly the sequence of bytes provided to the open(),
> > creat(), etc.  If normalization options are chosen, it may do some
> > manipulation of the byte strings *when comparing* names, but the
on-disk
> > name should be untouched from what the user requested.
> 
> Ok... that was the part which I was _praying_ for... :-)
> 
> ... just some background (for those who may be puzzled by the statement
> above): The conversion to Unicode is not always "lossless"
(Unicode is
> sometimes marketed as
>
"convert-any-encoding-to-unicode-without-loosing-any-information") ...
> for example if you have a mixed-language ISO-2022 character sequence the
> conversion to Unicode will use the language information itself 
s/use/loose/ ... sorry...

----

Bye,
Roland

-- 
  __ .  . __
 (o.\ \/ /.o) roland.mainz at nrubsig.org
  \__\/\/__/  MPEG specialist, C&&JAVA&&Sun&&Unix
programmer
  /O /==\ O\  TEL +49 641 7950090
 (;O/ \/ \O;)

Nicolas Williams

2008-Feb-28 06:03 UTC

head link

[zfs-discuss] path-name encodings

On Thu, Feb 28, 2008 at 05:57:21AM +0100, Roland Mainz
wrote:> Tim Haley wrote:
> > ZFS doesn''t muck with names it is sent when storing them
on-disk.  The
> > on-disk name is exactly the sequence of bytes provided to the open(),
> > creat(), etc.  If normalization options are chosen, it may do some
> > manipulation of the byte strings *when comparing* names, but the
on-disk
> > name should be untouched from what the user requested.
> 
> Ok... that was the part which I was _praying_ for... :-)
> 
> ... just some background (for those who may be puzzled by the statement
> above): The conversion to Unicode is not always "lossless"
(Unicode is
> sometimes marketed as
>
"convert-any-encoding-to-unicode-without-loosing-any-information") ...
> for example if you have a mixed-language ISO-2022 character sequence the
> conversion to Unicode will use the language information itself and
> converting it back to an ISO-2022 sequence will result in a different
> multibyte sequence than the original input (the issue could be
> worked-around by inserting the "language tag" characters to
preserve
> this information but almost every converter doesn''t do that (and
since
> these "tags" are outside the BMP you have to pray that everything
in the
> toolchain works with Unicode charcters beyond 65535) ... ;-( ).
Keep in mind that NFSv4 requires use of UTF-8 on the wire.  Most
implementations just-use-8, including Solaris, but IIRC ZFS has an
option to require/allow only valid UTF-8 byte sequences, and it has
support for normalization-insensitive/preserving behaviour on
lookup/create, so the Solaris server is approaching compliance with the
NFSv4 spec, and the client can be compliant if you use only UTF-8
locales :)

I.e., we (the industry) are converging on Unicode as the standard
codeset for filesystem object naming.

The upshot of this is that if you really care about lossless conversions
then you''ll just have to avoid using problematic sequences in
filesystem
object names.

It is important, for reasons like what you described, that other things
-- particularly document formats -- support codesets other than Unicode.
But I just don''t see the NFS community adopting a multiplicity of
codesets for NFS (who knows, I might be wrong, and you could bring this
up on the IETF NFSv4 WG).

Nico
--

Richard L. Hamilton

2008-Feb-28 10:46 UTC

head link

[zfs-discuss] utf8only-property

> So, I set utf8only=on and try to create a file with a
> filename that is
> a byte array that can''t be decoded to text using
> UTF-8. What''s supposed
> to happen? Should fopen(), or whatever syscall
> ''touch'' uses, fail?
> Should the syscall somehow escape utf8-incompatible
> bytes, or maybe
> replace them with ?s or somesuch? Or should it
> automatically convert the
> filename from the active locale''s fs-encoding
> (LC_CTYPE?) to UTF-8?
First, utf8only can AFAIK only be set when a filesystem is created.

Second, "use the source, Luke:"
http://src.opensolaris.org/source/search?q=&defs=&refs=z_utf8&path=%2Fonnv%2Fonnv-gate%2Fusr%2Fsrc%2Futs%2Fcommon%2Ffs%2Fzfs%2Fzfs_vnops.c&hist=&project=%2Fonnv

Looks to me like lookups, file create, directory create, creating symlinks,
and creating hard links will all fail with error EILSEQ ("Illegal byte
sequence")
if utf8only is enabled and they are presented with a name that is not valid
UTF-8.  Thus, on a filesystem where it is enabled (since creation), no such
names can be created or would ever be there to be found anyway.

So in that case, the system is refusing non UTF-8 compatible byte strings
and there''s no need to escape anything.

Further, your last sentence suggests that you might hold the
incorrect idea that the kernel knows or cares what locale an application is
running in: it does not.  Nor indeed does the kernel know about environment
variables at all, except as the third argument passed to execve(2); it
doesn''t interpret them, or even validate that they are of the usual
name=value form, they''re typically handled pretty much the same as the
command line args, and the only illusion of magic is that with the more
widely used variants of exec that don''t explicitly pass the
environment,
they internally call execve(2) with the external variable environ as the
last arg, thus passing the environment automatically.

There have been Unix-like OSs that make the environment available to
additional system calls (give or take what''s a true system call in the
example I''m thinking of, namely variant links (symlinks with embedded
environment variable references) in the now defunct Apollo Domain/OS),
but AFAIK, that''s not the case in those that are part of the historical
Unix source lineage.  (I have no idea off the top of my head whether
or not Linux, or oddballs like OSF/1 might make environment variables
implicitly available to syscalls other than execve(2).)
 
 
This message posted from opensolaris.org

Marcus Sundman

2008-Feb-28 17:29 UTC

head link

[zfs-discuss] path-name encodings

Bart Smaalders <bart.smaalders at Sun.COM> wrote:> > I''m unable to find more info about this. E.g., what does
"reject
> > file names" mean in practice? E.g., if a program tries to create
a
> > file using an utf8-incompatible filename, what happens? Does the
> > fopen() fail? Would this normally be a problem? E.g., do tar and
> > similar programs convert utf8-incompatible filenames to utf8 upon
> > extraction if my locale (or wherever the fs encoding is taken from)
> > is set to use utf-8? If they don''t, then what happens with
archives
> > containing utf8-incompatible filenames?
> 
> 
> Note that the normal ZFS behavior is exactly what you''d expect:
you
> get the filenames you wanted; the same ones back you put in.
OK, thanks. I still haven''t got any answer to my original question,
though. I.e., is there some way to know what text the filename is, or
do I have to make a more or less wild guess what encoding the program
that created the file used?

OK, if I use utf8only then I know that all filenames can be interpreted
as UTF-8. However, that''s completely unacceptable for me, since
I''d
much rather have an important file with an incomprehensible filename
than not have that important file at all. Also, what about non-UTF-8
encodings? E.g., is it possible to know whether 0xe4 is "?" (as in
iso-8859-1) or "?" (as in iso-8859-5)?
> The trick is that in order to support such things as
> casesensitivity=false for CIFS, the OS needs to know what characters
> are uppercase vs lowercase, which means it needs to know about
> encodings, and reject codepoints which cannot be classified as
> uppercase vs lowercase.
I don''t see why the OS would care about that. Isn''t that the
job of the
CIFS daemon? As a matter of fact I don''t see why the OS would need to
know how to decode any filename-bytes to text. However, I firmly
believe that user applications should have that opportunity. If the
encoding of filenames is not known (explicitly or implicitly) then
applications don''t have that opportunity.


- Marcus

Bart Smaalders

2008-Feb-28 21:46 UTC

head link

[zfs-discuss] path-name encodings

Marcus Sundman wrote:> Bart Smaalders <bart.smaalders at Sun.COM> wrote:
>>> I''m unable to find more info about this. E.g., what does
"reject
>>> file names" mean in practice? E.g., if a program tries to
create a
>>> file using an utf8-incompatible filename, what happens? Does the
>>> fopen() fail? Would this normally be a problem? E.g., do tar and
>>> similar programs convert utf8-incompatible filenames to utf8 upon
>>> extraction if my locale (or wherever the fs encoding is taken from)
>>> is set to use utf-8? If they don''t, then what happens with
archives
>>> containing utf8-incompatible filenames?
>>
>> Note that the normal ZFS behavior is exactly what you''d
expect: you
>> get the filenames you wanted; the same ones back you put in.
> 
> OK, thanks. I still haven''t got any answer to my original
question,
> though. I.e., is there some way to know what text the filename is, or
> do I have to make a more or less wild guess what encoding the program
> that created the file used?
How do you expect the filesystem to know this?  Open(2) takes 3 args;
none of them have anything to do with the encoding.
> OK, if I use utf8only then I know that all filenames can be interpreted
> as UTF-8. However, that''s completely unacceptable for me, since
I''d
> much rather have an important file with an incomprehensible filename
> than not have that important file at all. Also, what about non-UTF-8
> encodings? E.g., is it possible to know whether 0xe4 is "?" (as
in
> iso-8859-1) or "?" (as in iso-8859-5)?
> 
There are two characters not allowed in filenames: NULL and
''/''.  Everything
else is meaning imparted by the user, just like the contents of text
documents.
>> The trick is that in order to support such things as
>> casesensitivity=false for CIFS, the OS needs to know what characters
>> are uppercase vs lowercase, which means it needs to know about
>> encodings, and reject codepoints which cannot be classified as
>> uppercase vs lowercase.
> 
> I don''t see why the OS would care about that. Isn''t that
the job of the
> CIFS daemon? 
If my program attempts to open file "fred" in a case-insensitive
filesystem
and "FRED" exists, I would expect to get a handle to "FRED".
In order for
the filesystem to do this, the OS must be able to perform this comparison.

CIFS is in the kernel; case insensitivity is a property of the 
filesystem, not
a layer added on by a daemon.  If not, I could create "fred" and
"FRED"
locally, and then which one would I get were I to open "FrEd" via
CIFS?

> As a matter of fact I don''t see why the OS would need to
> know how to decode any filename-bytes to text. However, I firmly
> believe that user applications should have that opportunity. If the
> encoding of filenames is not known (explicitly or implicitly) then
> applications don''t have that opportunity.
The OS doesn''t care; the user does.  If a user creates a file named
?????? in his home directory, but my encoding doesn''t contain these 
characters,
what should ls -l display?  You also assume that knowing the encoding
will transfer meaning... but a directory containing files named
??????, ????? and ?????? may as well be line noise for most of us.

The OS doesn''t care one whit about language or encodings (save
the optional upper/lower case accommodation for CIFS).  The OS simply
stores files under names that don''t contain either
''/'' or NULL.

UTF8 is the answer here.  If you care about anything more than simple
ascii and you work in more than a single locale/encoding, use UTF8.
You may not understand the meaning of a filename, but at least
you''ll see the same characters as the person who wrote it.
- Bart

-- 
Bart Smaalders			Solaris Kernel Performance
barts at cyber.eng.sun.com		http://blogs.sun.com/barts
"You will contribute more with mercurial than with thunderbird."

Anton B. Rang

2008-Feb-29 05:20 UTC

head link

[zfs-discuss] path-name encodings

> OK, thanks. I still haven''t got any answer to my original
question,
> though. I.e., is there some way to know what text the
> filename is, or do I have to make a more or less wild guess what
> encoding the program that created the file used?
You have to guess.  As far as I know, Apple''s HFS (and HFS+) is the
only file system which stores the encoding along with the filename.  NFS
doesn''t provide a mechanism to send the encoding with the filename; I
don''t believe that CIFS does, either.

If you''re writing the application, you could store the encoding as an
extended attribute of the file. This would be useful, for instance, for an AFP
server.
> > The trick is that in order to support such things as
> > casesensitivity=false for CIFS, the OS needs to know what characters
> > are uppercase vs lowercase, which means it needs to know about
> > encodings, and reject codepoints which cannot be classified as
> > uppercase vs lowercase.
> 
> I don''t see why the OS would care about that. Isn''t that
the job of the
> CIFS daemon?
The CIFS daemon can do it, but it would require that the daemon cache the whole
directory in memory (at least, to get reasonable efficiency). This
doesn''t work so well for large directories. If you leave it up to the
CIFS daemon, you also wind up with problems if you have a single sharepoint
shared between local users, NFS & CIFS -- the NFS client can create two
files named "a" and "A", but the CIFS client can only see
one of those.
> As a matter of fact I don''t see why the OS would need to
> know how to decode any filename-bytes to text.
> However, I firmly believe that user applications should have that
> opportunity. If the encoding of filenames is not known (explicitly or
> implicitly) then applications don''t have that opportunity.
Yes -- that''s why Apple includes an encoding byte in both HFS and HFS+.
(In HFS+, filenames are normalized to 16-bit Unicode, but the encoding is still
useful in choosing how to recompose the characters, and in providing hints for
applications which prefer the names in some 8-bit encoding.)

-- Anton
 
 
This message posted from opensolaris.org

Marcus Sundman

2008-Mar-04 10:45 UTC

head link

[zfs-discuss] path-name encodings

Bart Smaalders <bart.smaalders at Sun.COM> wrote:> Marcus Sundman wrote:
> > Bart Smaalders <bart.smaalders at Sun.COM> wrote:
> >>> I''m unable to find more info about this. E.g., what
does "reject
> >>> file names" mean in practice? E.g., if a program tries to
create a
> >>> file using an utf8-incompatible filename, what happens? Does
the
> >>> fopen() fail? Would this normally be a problem? E.g., do tar
and
> >>> similar programs convert utf8-incompatible filenames to utf8
upon
> >>> extraction if my locale (or wherever the fs encoding is taken
> >>> from) is set to use utf-8? If they don''t, then what
happens with
> >>> archives containing utf8-incompatible filenames?
> >>
> >> Note that the normal ZFS behavior is exactly what you''d
expect: you
> >> get the filenames you wanted; the same ones back you put in.
> > 
> > OK, thanks. I still haven''t got any answer to my original
question,
> > though. I.e., is there some way to know what text the filename is,
> > or do I have to make a more or less wild guess what encoding the
> > program that created the file used?
> 
> How do you expect the filesystem to know this?  Open(2) takes 3 args;
> none of them have anything to do with the encoding.
I don''t expect the filesystem to know "this" (whatever you
mean by
"this"). I don''t expect the filesystem not to either. I just
don''t know,
and therefore I ask.
> > OK, if I use utf8only then I know that all filenames can be
> > interpreted as UTF-8. However, that''s completely unacceptable
for
> > me, since I''d much rather have an important file with an
> > incomprehensible filename than not have that important file at all.
> > Also, what about non-UTF-8 encodings? E.g., is it possible to know
> > whether 0xe4 is "?" (as in iso-8859-1) or "?" (as
in iso-8859-5)?
> > 
> 
> There are two characters not allowed in filenames: NULL and
''/''.
> Everything else is meaning imparted by the user, just like the
> contents of text documents.
You are confusing "characters" and "bytes". The former are
encoded when
transformed to the latter. ''/'' is a character, 0x2f is a byte.
(Well,
representations of a character and of a byte, respectively, if we''re
nitpicking.)
> >> The trick is that in order to support such things as
> >> casesensitivity=false for CIFS, the OS needs to know what
> >> characters are uppercase vs lowercase, which means it needs to
> >> know about encodings, and reject codepoints which cannot be
> >> classified as uppercase vs lowercase.
> > 
> > I don''t see why the OS would care about that. Isn''t
that the job of
> > the CIFS daemon? 
> 
> If my program attempts to open file "fred" in a case-insensitive
> filesystem and "FRED" exists, I would expect to get a handle to
> "FRED".  In order for the filesystem to do this, the OS must be
able
> to perform this comparison.
Well, yes, if the case-insensitivity is in the filesystem (and if the
fs is in the kernel), but my point was that it wouldn''t _have_to_ be in
the filesystem. It''s probably faster if it is, though.
> CIFS is in the kernel; case insensitivity is a property of the 
> filesystem, not a layer added on by a daemon.
You probably mean "CIFS is in (Open)Solaris" and "case
insensitivity is
a property of ZFS".
> If not, I could create "fred" and "FRED" locally, and
then which one
> would I get were I to open "FrEd" via CIFS?
I guess that would be up to the implementation (unless CIFS includes
it in its specification). 
> > As a matter of fact I don''t see why the OS would need to
> > know how to decode any filename-bytes to text. However, I firmly
> > believe that user applications should have that opportunity. If the
> > encoding of filenames is not known (explicitly or implicitly) then
> > applications don''t have that opportunity.
> 
> The OS doesn''t care; the user does.  If a user creates a file
named
> ?????? in his home directory, but my encoding doesn''t contain
these
> characters, what should ls -l display?
I assume we''re assuming encodings to be known here. (If the encodings
are unknown/unspecified the user can''t create a file named any
particular character string, only raw data (bits/bytes).) What a
particular program displays is up to the implementation, I guess. I''ve
seen programs use escapes (e.g., \uc3\ua5), or ''?'', or empty
squares, or
small squares with hex-numbers in them. (I''ve also seen programs not
display the text at all (sometimes not displaying any text after the
offending part), or even crash.)

However, we have the same problem always when programs should display
text, whether we know the encoding or not. Command line programs might
propagate the problem to the terminal (as ls in OpenSolaris currently
seems to be doing), graphical programs have to deal with it themselves.

So, while the OS might not care, the programs certainly do, especially
the graphical ones, since they can''t let someone else deal with the
problem. (And yes, I know programs don''t like to be anthropomorphized.)
> You also assume that knowing the encoding will transfer meaning...
> but a directory containing files named ??????, ????? and ?????? may
> as well be line noise for most of us.
I assume no such thing. However, I firmly believe that knowing the
encoding of a bit sequence is the _only_possibility_ to be able to
_know_ what text that bit sequence represents.
> The OS doesn''t care one whit about language or encodings (save
> the optional upper/lower case accommodation for CIFS).  The OS simply
> stores files under names that don''t contain either
''/'' or NULL.
I think you mean "[...]names that don''t contain either 0x2F and
0x0",
which includes characters such as ''A'' in UTF-16.
> UTF8 is the answer here.  If you care about anything more than simple
> ascii and you work in more than a single locale/encoding, use UTF8.
> You may not understand the meaning of a filename, but at least
> you''ll see the same characters as the person who wrote it.
I think you are a bit confused.

A) If you meant that _I_ should use UTF-8 then that alone won''t help.
Let''s say the person who created the file used ISO-8859-1 and named it
''h?st'', i.e., 0x68e47374. If I then use UTF-8 when displaying
the
filename my program will be faced with the problem of what to do with
the second byte, 0xe4, which can''t be decoded using UTF-8.
("h?st" is
0x68c3a47374 in UTF-8, in case someone wonders.)

B) If you meant that _everybody_ should use UTF-8 then why would UTF-8
be "the answer"? Certainly it''s enough that everybody uses
the same
encoding.


Regards,

Marcus

Joerg Schilling

2008-Mar-04 10:49 UTC

head link

[zfs-discuss] path-name encodings

"Anton B. Rang" <rang at acm.org> wrote:
> Yes -- that''s why Apple includes an encoding byte in both HFS and
HFS+.  (In HFS+, filenames are normalized to 16-bit Unicode, but the encoding is
still useful in choosing how to recompose the characters, and in providing hints
for applications which prefer the names in some 8-bit encoding.)
If you like to do something like this, it would be better to use the UDF 
aproach.

In UDF directories, the first byte of a filename may either be 8
(''\010'') and
then the filename is ISO-8859-1 (the low 8 bits of UNOICODE) or 16
(''\020'') and
then the file name is usinf UCS-2 (16 bit chars) from UNICODE.

This allows to keep the full path name length for the popular ISO-8859-1 
coding and still needs less space than UTF-8 if you e.g. use japanese chars as
Japanese chars need 3 octects in UTF-8.

J?rg

-- 
 EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin
       js at cs.tu-berlin.de                (uni)  
       schilling at fokus.fraunhofer.de     (work) Blog:
http://schily.blogspot.com/
 URL:  http://cdrecord.berlios.de/old/private/ ftp://ftp.berlios.de/pub/schily

Marcus Sundman

2008-Mar-04 11:37 UTC

head link

[zfs-discuss] path-name encodings

"Anton B. Rang" <rang at acm.org> wrote:> > OK, thanks. I still haven''t got any answer to my original
question,
> > though. I.e., is there some way to know what text the
> > filename is, or do I have to make a more or less wild guess what
> > encoding the program that created the file used?
> 
> You have to guess.
Ouch! Guessing sucks. (By the way, that''s why I switched to ZFS with
its
internal checksums, so that I wouldn''t have to guess if my data was
OK.)

Thanks for the answer, though.

Do you happen to know where programs in (Open)Solaris look when they
want to know how to encode text to be used in a filename? Is it
LC_CTYPE?
> NFS doesn''t provide a mechanism to send the encoding with the
> filename; I don''t believe that CIFS does, either.
Really?!? That''s insane! How do programs know how to encode filenames
to be sent over NFS or CIFS?
> If you''re writing the application, you could store the encoding as
an
> extended attribute of the file. This would be useful, for instance,
> for an AFP server.
OK. But then I''d have to hack a similar change into all other programs
that I use, too.
> > > The trick is that in order to support such things as
> > > casesensitivity=false for CIFS, the OS needs to know what
> > > characters are uppercase vs lowercase, which means it needs to
> > > know about encodings, and reject codepoints which cannot be
> > > classified as uppercase vs lowercase.
> > 
> > I don''t see why the OS would care about that. Isn''t
that the job of
> > the CIFS daemon?
> 
> The CIFS daemon can do it, but it would require that the daemon cache
> the whole directory in memory (at least, to get reasonable
> efficiency).
I guess that depends on what file access functions there are for the
file system.
> If you leave it up to the CIFS daemon, you also wind up with problems
> if you have a single sharepoint shared between local users, NFS &
> CIFS -- the NFS client can create two files named "a" and
"A", but
> the CIFS client can only see one of those.
Not necessarily. There could be some (nonstandard) way of accessing
such duplicates (e.g., by having the CIFS daemon append "[dup-N]" or
somesuch to the name). And even if that problem did exist it might still
be OK for CIFS access to have that limitation.


Regards,

Marcus

Marcus Sundman

2008-Mar-04 11:48 UTC

head link

[zfs-discuss] path-name encodings

Joerg.Schilling at fokus.fraunhofer.de (Joerg Schilling)
wrote:> [...] ISO-8859-1 (the low 8 bits of UNOICODE) [...]
Unicode is not an encoding, but you probably mean "the low 8 bits of
UCS-2" or "the first 256 codepoints in Unicode" or somesuch.


Regards,

Marcus

Joerg Schilling

2008-Mar-04 11:56 UTC

head link

[zfs-discuss] path-name encodings

Marcus Sundman <sundman at iki.fi> wrote:
> Joerg.Schilling at fokus.fraunhofer.de (Joerg Schilling) wrote:
> > [...] ISO-8859-1 (the low 8 bits of UNOICODE) [...]
>
> Unicode is not an encoding, but you probably mean "the low 8 bits of
> UCS-2" or "the first 256 codepoints in Unicode" or somesuch.
Unicode _is_ an encoding that uses 21 (IIRC) bits.

UCS-2 is a way to _represent_ the low 16 bits of UNICODE in a way that allows to
use some tricks go bejund 16 bits. Microfoft e.g. does not go bejund 16 bits.

ISO-8859-1 is a representation of the low 8 bits of UNICODE (well ISO-8859-1
is older than UNICODE ;-). ISO-8859-1 does not allow to code more than the
8 least sinificant bits from unicode.

J?rg

-- 
 EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin
       js at cs.tu-berlin.de                (uni)  
       schilling at fokus.fraunhofer.de     (work) Blog:
http://schily.blogspot.com/
 URL:  http://cdrecord.berlios.de/old/private/ ftp://ftp.berlios.de/pub/schily

Joerg Schilling

2008-Mar-04 13:59 UTC

head link

[zfs-discuss] path-name encodings

Bart Smaalders <bart.smaalders at Sun.COM> wrote:
> > OK, thanks. I still haven''t got any answer to my original
question,
> > though. I.e., is there some way to know what text the filename is, or
> > do I have to make a more or less wild guess what encoding the program
> > that created the file used?
>
> How do you expect the filesystem to know this?  Open(2) takes 3 args;
> none of them have anything to do with the encoding.
A while ago, when discussing thing with some filesystem guys, I made the 
proposal to introduce a new syscall to inform the kernel about the locale 
coding used by a process. If the kernel (or filesystem) then like to store
file names in a kernel-specific way and if there is a in-kernel libiconv,
the kernel could convert from/to the userland view. A problem that remains
is a userland coding that probably cannot represent all "characters"
used
inside the kernel view.

> There are two characters not allowed in filenames: NULL and
''/''.  Everything
> else is meaning imparted by the user, just like the contents of text
> documents.
Platforms that insist in UTF-8 codinf for filenames often disallow octett 
codingd tha are not valid inside a UTF-8 character sequence.
>
> The OS doesn''t care; the user does.  If a user creates a file
named
> ?????????????????? in his home directory, but my encoding doesn''t
contain these
> characters,
> what should ls -l display?  You also assume that knowing the encoding
> will transfer meaning... but a directory containing files named
> ??????????????????, ??????????????? and ?????????????????? may as well be
line noise for most of us.
>
> The OS doesn''t care one whit about language or encodings (save
> the optional upper/lower case accommodation for CIFS).  The OS simply
> stores files under names that don''t contain either
''/'' or NULL.
>
> UTF8 is the answer here.  If you care about anything more than simple
> ascii and you work in more than a single locale/encoding, use UTF8.
> You may not understand the meaning of a filename, but at least
> you''ll see the same characters as the person who wrote it.
UTF-8 may be the answer for many but definitely not all problems.
UTF-8 may make less problems in 5 years (if more people then use it) than
the problem known with UTF-8 today.

J?rg

-- 
 EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin
       js at cs.tu-berlin.de                (uni)  
       schilling at fokus.fraunhofer.de     (work) Blog:
http://schily.blogspot.com/
 URL:  http://cdrecord.berlios.de/old/private/ ftp://ftp.berlios.de/pub/schily

Marcus Sundman

2008-Mar-04 19:04 UTC

head link

[zfs-discuss] path-name encodings

Joerg.Schilling at fokus.fraunhofer.de (Joerg Schilling)
wrote:> Marcus Sundman <sundman at iki.fi> wrote:
> > Joerg.Schilling at fokus.fraunhofer.de (Joerg Schilling) wrote:
> > > [...] ISO-8859-1 (the low 8 bits of UNOICODE) [...]
> >
> > Unicode is not an encoding, but you probably mean "the low 8 bits
of
> > UCS-2" or "the first 256 codepoints in Unicode" or
somesuch.
> 
> Unicode _is_ an encoding that uses 21 (IIRC) bits.
AFAIK you are incorrect. Unicode is a standard that, among other
things, defines a _number_ for each character. A number does not equal
21 bits, even if it so happens that the highest codepoint number in the
current version is no more than 21 bits long. Unicode defines (at
least) 3 encodings to represent those characters: UTF-8, UTF-16 and
UTF-32.

Well, it doesn''t very much matter exactly how the terms are defined, as
long as everybody knows what''s what. So, I''m sorry for
nitpicking.


- Marcus

Bart Smaalders

2008-Mar-04 21:15 UTC

head link

[zfs-discuss] path-name encodings

Marcus Sundman wrote:> Bart Smaalders <bart.smaalders at Sun.COM> wrote:
>> UTF8 is the answer here.  If you care about anything more than simple
>> ascii and you work in more than a single locale/encoding, use UTF8.
>> You may not understand the meaning of a filename, but at least
>> you''ll see the same characters as the person who wrote it.
> 
> I think you are a bit confused.
> 
> A) If you meant that _I_ should use UTF-8 then that alone won''t
help.
> Let''s say the person who created the file used ISO-8859-1 and
named it
> ''h?st'', i.e., 0x68e47374. If I then use UTF-8 when
displaying the
> filename my program will be faced with the problem of what to do with
> the second byte, 0xe4, which can''t be decoded using UTF-8.
("h?st" is
> 0x68c3a47374 in UTF-8, in case someone wonders.)
What I mean is very simple:

The OS has no way of merging your various encodings.  If I create a
directory, and have people from around the world create a file
in that directory named after themselves in their own character sets,
what should I see when I invoke:

% ls -l | less

in that directory?

If you wish to share filenames across locales, I suggest you and
everyone else writing to that directory use an encoding that will work
across all those locales.  The encoding that works well for this on
Unix systems is UTF8, since it leaves ''/'' and NULL alone.

- Bart

-- 
Bart Smaalders			Solaris Kernel Performance
barts at cyber.eng.sun.com		http://blogs.sun.com/barts
"You will contribute more with mercurial than with thunderbird."

Marcus Sundman

2008-Mar-05 08:58 UTC

head link

[zfs-discuss] path-name encodings

Bart Smaalders <bart.smaalders at Sun.COM> wrote:> Marcus Sundman wrote:
> > Bart Smaalders <bart.smaalders at Sun.COM> wrote:
> >> UTF8 is the answer here.  If you care about anything more than
> >> simple ascii and you work in more than a single locale/encoding,
> >> use UTF8. You may not understand the meaning of a filename, but at
> >> least you''ll see the same characters as the person who
wrote it.
> > 
> > I think you are a bit confused.
> > 
> > A) If you meant that _I_ should use UTF-8 then that alone
won''t
> > help. Let''s say the person who created the file used
ISO-8859-1 and
> > named it ''h?st'', i.e., 0x68e47374. If I then use
UTF-8 when
> > displaying the filename my program will be faced with the problem
> > of what to do with the second byte, 0xe4, which can''t be
decoded
> > using UTF-8. ("h?st" is 0x68c3a47374 in UTF-8, in case
someone
> > wonders.)
> 
> What I mean is very simple:
> 
> The OS has no way of merging your various encodings.  If I create a
> directory, and have people from around the world create a file
> in that directory named after themselves in their own character sets,
> what should I see when I invoke:
> 
> % ls -l | less
> 
> in that directory?
Either (1) programs can find out what the encoding is, or (2) programs
must assume the encoding is what some environment variable (or
somesuch) is set to.

(1) The OS doesn''t have to "merge" anything, just let the
programs
handle any conversions the programs see fit.

(2) The OS must transcode the filenames. If a filename is incompatible
with the target encoding then the offending characters must be escaped.

> If you wish to share filenames across locales, I suggest you and
> everyone else writing to that directory use an encoding that will work
> across all those locales.  The encoding that works well for this on
> Unix systems is UTF8, since it leaves ''/'' and NULL alone.
Again, that won''t work. First of all there is no way to enforce
programs to use UTF-8. I can''t even force my own programs to do that.
(E.g., unrar or unzip or tar or 7z (can''t remember which one(s)) just
dump the filename data to the fs in whatever encoding they were inside
the archive, and I have at least one collaboration program that also
does it similarly.) Now, if I force the fs to only accept filenames
compatible with UTF-8 (i.e., utf8only) then I risk losing files. I''d
rather have files with incomprehensible filenames than not have them at
all. OTOH, if I allow filenames incompatible with UTF-8 then my
programs can''t necessarily access them if I use UTF-8. I could use some
8bits/char encoding (e.g., iso-8859-15), but I''d rather not, since the
world is going the way of UTF-8 and so I''d just be dragging behind. And
then I would also have problems with garbage-filenames when they use
UTF-8 or some other encoding. Also, I''m quite sure I do have files with
names with characters not in iso-8859-15.

So, you see, there is no way for me to use filenames intelligibly unless
their encodings are knowable. (In fact I''m quite surprised that zfs
doesn''t (and even can''t) know the encoding(s) of filenames.
Usually Sun
seems to make relatively sane design decisions. This, however, is more
what I''d expect from linux with their overpragmatic "who cares if
it''s
sane, as long as it kinda works"-attitudes.)


Regards,

Marcus

Joerg Schilling

2008-Mar-05 12:08 UTC

head link

[zfs-discuss] path-name encodings

Marcus Sundman <sundman at iki.fi> wrote:
> Joerg.Schilling at fokus.fraunhofer.de (Joerg Schilling) wrote:
> > Marcus Sundman <sundman at iki.fi> wrote:
> > > Joerg.Schilling at fokus.fraunhofer.de (Joerg Schilling) wrote:
> > > > [...] ISO-8859-1 (the low 8 bits of UNOICODE) [...]
> > >
> > > Unicode is not an encoding, but you probably mean "the low 8
bits of
> > > UCS-2" or "the first 256 codepoints in Unicode" or
somesuch.
> > 
> > Unicode _is_ an encoding that uses 21 (IIRC) bits.
>
> AFAIK you are incorrect. Unicode is a standard that, among other
> things, defines a _number_ for each character. A number does not equal
And I tend to call the relation Character <-> number an
"encoding".

As the "number" may be outside the range of "classical
characters" that
on most systems live inside octetts, there is a need to use another encoding
on top of the unicode encoding. This second encoding is typically UTF-8 on UNIX.

J?rg

-- 
 EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin
       js at cs.tu-berlin.de                (uni)  
       schilling at fokus.fraunhofer.de     (work) Blog:
http://schily.blogspot.com/
 URL:  http://cdrecord.berlios.de/old/private/ ftp://ftp.berlios.de/pub/schily

Joerg Schilling

2008-Mar-05 12:20 UTC

head link

[zfs-discuss] path-name encodings

Bart Smaalders <bart.smaalders at Sun.COM> wrote:
> The OS has no way of merging your various encodings.  If I create a
> directory, and have people from around the world create a file
> in that directory named after themselves in their own character sets,
> what should I see when I invoke:
>
> % ls -l | less
>
> in that directory?
>
> If you wish to share filenames across locales, I suggest you and
> everyone else writing to that directory use an encoding that will work
> across all those locales.  The encoding that works well for this on
> Unix systems is UTF8, since it leaves ''/'' and NULL alone.
The problem with this aproach is that all users need to change their locale 
encoding. Some of them may not be able to do so because they need to login into
older systems that do not support UTF-8.

We had less problems if UNICODE was introduced 10 years ealier. Because of 
missing encoding support for their countries, people in russia, china, ...
did create own encoding schemes in the 1980s that are still in use.

J?rg

-- 
 EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin
       js at cs.tu-berlin.de                (uni)  
       schilling at fokus.fraunhofer.de     (work) Blog:
http://schily.blogspot.com/
 URL:  http://cdrecord.berlios.de/old/private/ ftp://ftp.berlios.de/pub/schily

Anton B. Rang

2008-Mar-05 16:38 UTC

head link

[zfs-discuss] path-name encodings

> Do you happen to know where programs in (Open)Solaris look when they
> want to know how to encode text to be used in a filename? Is it
> LC_CTYPE?
In general, they don''t.  Command-line utilities just use the sequence
of
bytes entered by the user.  GUI-based software does as well, but the
encoding used for user input can sometimes be selected....
> > NFS doesn''t provide a mechanism to send the encoding with the
> > filename; I don''t believe that CIFS does, either.
> 
> Really?!? That''s insane! How do programs know how to
> encode filenames to be sent over NFS or CIFS?
For NFSv3, you guess.  :-)  It''s just stream-of-bytes.

For NFSv4, the encoding used to transmit data is supposed to be UTF-8,
but this isn''t enforced by most clients.  What''s more, since
the encoding
isn''t stored, the reverse translation (UTF-8 to local encoding) would
have
to be done by the NFS client based on ... something.  Usually this is
"just return the raw bytes and let the application deal with the
mess."

For CIFS, you can send either "ASCII" (which I believe really means
uninterpreted bytes) or UTF-16.  If you''re working in UTF-16, and
you''re on
Windows, there are two sets of APIs.  The Unicode APIs will return the
proper Unicode names.  The non-Unicode (legacy) APIs will encode the
names according to your system''s current "code page" setting.

-- Anton
 
 
This message posted from opensolaris.org

Marcus Sundman

2008-Mar-05 17:25 UTC

head link

[zfs-discuss] path-name encodings

"Anton B. Rang" <rang at acm.org> wrote:> > Do you happen to know where programs in (Open)Solaris look when they
> > want to know how to encode text to be used in a filename? Is it
> > LC_CTYPE?
> 
> In general, they don''t.  Command-line utilities just use the
sequence
> of bytes entered by the user.
Obviously that depends on the application. A command-line utility that
interprets an normal xml file containing filenames know the characters
but not the bytes. The same goes for command-line utilities that
receive the filenames as text (e.g., some file transfer utility or
daemon).
> GUI-based software does as well, but the encoding used for user input
> can sometimes be selected....
Hmm.. I''m usually programming at quite high a level, so I''m
not very
familiar with how stuff works under the hood...
If I run xev on my linux box (I don''t have X on any (Open)Solaris) and
press the ?-key on my keyboard it says "keycode 48" and "keysym
0xe4",
and then "XLookupString gives 2 bytes: (c3 a4) "?"". Thus at
least
XLookupString seems to know that I''m using UTF-8. Where did it (or
whoever converted 0xe4 to 0xc3a4) get the needed info?


- Marcus

Boyd Adamson

2008-Mar-05 22:09 UTC

head link

[zfs-discuss] path-name encodings

Marcus Sundman <sundman at iki.fi> writes:> So, you see, there is no way for me to use filenames intelligibly unless
> their encodings are knowable. (In fact I''m quite surprised that
zfs
> doesn''t (and even can''t) know the encoding(s) of
filenames. Usually Sun
> seems to make relatively sane design decisions. This, however, is more
> what I''d expect from linux with their overpragmatic "who
cares if it''s
> sane, as long as it kinda works"-attitudes.)
To be fair, ZFS is constrained by compatibility requirements with
existing systems. For the longest time the only interpretation that Unix
kernels put on the filenames passed by applications was to treat "/"
and
"\000" specially. The interfaces provided to applications assume this
is
the entire extent of the process. 

Changing this incompatibly is not an option, and adding new interfaces
to support this is meaningless unless there is a critical mass of
applications that use them. It''s not reasonable to talk about
"ZFS"
doing this, since it''s just a part of the wider ecosystem.

To solve this problem at the moment takes one of two approaches.

1. A userland convention is adopted to decide on what meaning the byte
strings that the kernel provides have.

2. Some new interfaces are created to pass this information into the
kernel and get it back.

Leaving aside the merits of either approach, both of them require
significant agreement from applications to use a certain approach before
they reap any benefits. There''s not much ZFS itself can do there.

Boyd

Anton B. Rang

2008-Mar-06 06:32 UTC

head link

[zfs-discuss] path-name encodings

> > In general, they don''t.  Command-line utilities just use the
sequence
> > of bytes entered by the user.
> 
> Obviously that depends on the application. A command-line utility that
> interprets an normal xml file containing filenames know the characters
> but not the bytes. The same goes for command-line utilities that
> receive the filenames as text (e.g., some file transfer utility or daemon).
It''s true that they know the characters, and not necessarily the bytes
-- but
all of the tools I''m aware of ignore the characters and simply treat
these
as bytes when it comes to making calls into the file system.
> If I run xev on my linux box (I don''t have X on any (Open)Solaris)
and
> press the ?-key on my keyboard it says "keycode 48" and
"keysym 0xe4",
> and then "XLookupString gives 2 bytes: (c3 a4) "?"".
Thus at least
> XLookupString seems to know that I''m using UTF-8. Where did it (or
> whoever converted 0xe4 to 0xc3a4) get the needed info?
Depending on what version of xev you''ve got, there''s a good
chance it made a call to XmbLookupString (the "multibyte" version of
XLookupString). This uses the current locale for the encoding; the locale is
stored in an environment variable which can be queried by the application. (But
this has wandered afield of file systems -- though it''s true that the
file system could potentially look at environment variables to make encoding
choices!)
 
 
This message posted from opensolaris.org

zfs discuss - Feb 2008 - path-name encodings

[zfs-discuss] path-name encodings

[zfs-discuss] path-name encodings

[zfs-discuss] path-name encodings

[zfs-discuss] path-name encodings

[zfs-discuss] path-name encodings

[zfs-discuss] utf8only-property

[zfs-discuss] path-name encodings

[zfs-discuss] path-name encodings

[zfs-discuss] path-name encodings

[zfs-discuss] path-name encodings

[zfs-discuss] path-name encodings

[zfs-discuss] path-name encodings

[zfs-discuss] utf8only-property

[zfs-discuss] path-name encodings

[zfs-discuss] path-name encodings

[zfs-discuss] path-name encodings

[zfs-discuss] path-name encodings

[zfs-discuss] path-name encodings

[zfs-discuss] path-name encodings

[zfs-discuss] path-name encodings

[zfs-discuss] path-name encodings

[zfs-discuss] path-name encodings

[zfs-discuss] path-name encodings

[zfs-discuss] path-name encodings

[zfs-discuss] path-name encodings

[zfs-discuss] path-name encodings

[zfs-discuss] path-name encodings

[zfs-discuss] path-name encodings

[zfs-discuss] path-name encodings

[zfs-discuss] path-name encodings

[zfs-discuss] path-name encodings