Hello, I''m wondering what are some use cases for ZFS''s utf8only and normalization properties. They are off/none by default, and can only be set when the filesystem is created. When should they specifically be enabled and/or disabled? (i.e. Where is using them a really good idea? Where is using them a really bad idea?) Looking forward, starting with Windows XP and OS X 10.5 clients, is there any reason to change the defaults in order to minimize problems? From the documentation at http://dlc.sun.com/osol/docs/content/ZFSADMIN/gazss.html : utf8only Boolean Off This property indicates whether a file system should reject file names that include characters that are not present in the UTF-8 character code set. If this property is explicitly set to off, the normalization property must either not be explicitly set or be set to none. The default value for the utf8only property is off. This property cannot be changed after the file system is created. normalization String None This property indicates whether a file system should perform a unicode normalization of file names whenever two file names are compared, and which normalization algorithm should be used. File names are always stored unmodified, names are normalized as part of any comparison process. If this property is set to a legal value other than none, and the utf8only property was left unspecified, the utf8only property is automatically set to on. The default value of the normalization property is none. This property cannot be changed after the file system is created Background: I''ve built a test system running OpenSolaris 2009.06 (b111) with a ZFS RAIDZ1, with CIFS in workgroup mode. I''m testing with Windows XP and Mac OS X 10.5 clients connecting via CIFS (no NFS or AFP). I''ve set these properties during zfs create or immediately afterwards: casesensitivity=mixed compression=on snapdir=visible and ran this to set up nonrestrictive ACLs as suggested by Alan Wright at the thread "[cifs-discuss] CIFS and permission mapping" at http://opensolaris.org/jive/message.jspa?messageID=365620#365947 chmod A=everyone@:full_set:fd:allow /tank/home Thanks! -hk
Nicolas Williams
2009-Aug-12 23:48 UTC
[zfs-discuss] utf8only and normalization properties
On Wed, Aug 12, 2009 at 06:17:44PM -0500, Haudy Kazemi wrote:> I''m wondering what are some use cases for ZFS''s utf8only and > normalization properties. They are off/none by default, and can only be > set when the filesystem is created. When should they specifically be > enabled and/or disabled? (i.e. Where is using them a really good idea? > Where is using them a really bad idea?)These are for interoperability. The world is converging on Unicode for filesystem object naming. If you want to exclude non-Unicode strings then you should set utf8only (some non-Unicode strings in some codesets can look like valid UTF-8 though). But Unicode has multiple canonical and non-canonical ways of representing certain characters (e.g., ´). Solaris and Windows input methods tend to conform to NFKC, so they will interop even if you don''t enable the normalization feature. But MacOS X normalizes to NFD. Therefore, if you need to interoperate with MacOS X then you should enable the normalization feature.> Looking forward, starting with Windows XP and OS X 10.5 clients, is > there any reason to change the defaults in order to minimize problems?You should definetely enable normalization (see above). It doesn''t matter what normalization form you use, but "nfd" runs faster than "nfc". The normalization feature doesn''t cost much if you use all US-ASCII file names. And it doesn''t cost much if your file names are mostly US-ASCII. Nico --
Nicolas Williams wrote:> On Wed, Aug 12, 2009 at 06:17:44PM -0500, Haudy Kazemi wrote: > >> I''m wondering what are some use cases for ZFS''s utf8only and >> normalization properties. They are off/none by default, and can only be >> set when the filesystem is created. When should they specifically be >> enabled and/or disabled? (i.e. Where is using them a really good idea? >> Where is using them a really bad idea?) >> > > These are for interoperability. > > The world is converging on Unicode for filesystem object naming. If you > want to exclude non-Unicode strings then you should set utf8only (some > non-Unicode strings in some codesets can look like valid UTF-8 though). > > But Unicode has multiple canonical and non-canonical ways of > representing certain characters (e.g., ´). Solaris and Windows > input methods tend to conform to NFKC, so they will interop even if you > don''t enable the normalization feature. But MacOS X normalizes to NFD. > > Therefore, if you need to interoperate with MacOS X then you should > enable the normalization feature. >Thank you for the reply. My goal is to configure the filesystem for the lowest common denominator without knowing up front which clients will be used. OS X and Win XP are listed because they are commonly used as desktop OSes. Ubuntu Linux is a third potential desktop OS. The normalization property documentation says "this property indicates whether a file system should perform a unicode normalization of file names whenever two file names are compared. File names are always stored unmodified, names are normalized as part of any comparison process." Where does the file system use filename comparisons and what does it use them for? Filename collision checking? Sorting? Is it used for any other operation, say when returning a filename to an application? Would applications reading/writing files to a ZFS filesystem ever notice the difference in normalization settings as long as they produce filenames that do not conflict with existing names or create invalid UTF8? The documentation says filenames are stored unmodified, which sounds like things should be transparent to applications. (In regard to filename collision checking, if non-normalized unmodified filenames are always stored on disk, and they don''t conflict in non-normalized form, what would the point be of normalizing the filenames for a comparison? To verify there isn''t conflict in normalized forms, and if there is no conflict with an existing file to allow the filename to be written unmodified?)>> Looking forward, starting with Windows XP and OS X 10.5 clients, is >> there any reason to change the defaults in order to minimize problems? >> > > You should definetely enable normalization (see above). > > It doesn''t matter what normalization form you use, but "nfd" runs faster > than "nfc". > > The normalization feature doesn''t cost much if you use all US-ASCII file > names. And it doesn''t cost much if your file names are mostly US-ASCII. > > Nico >The ZFS documentation doesn''t list the valid values for the normalization property other than ''none. From your reply and from the the official unicode docs at http://unicode.org/reports/tr15/ and http://unicode.org/faq/normalization.html would it be correct to conclude that none, NFD, NFC, NFKC, and NFKD are the only valid values for the ZFS normalization property? If so, I suggest they be added to the documentation at http://dlc.sun.com/osol/docs/content/ZFSADMIN/gazss.html Thanks, -hk -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090813/e809b9d8/attachment.html>
Nicolas Williams
2009-Aug-13 23:33 UTC
[zfs-discuss] utf8only and normalization properties
On Thu, Aug 13, 2009 at 05:57:57PM -0500, Haudy Kazemi wrote:> >Therefore, if you need to interoperate with MacOS X then you should > >enable the normalization feature. > > > Thank you for the reply. My goal is to configure the filesystem for the > lowest common denominator without knowing up front which clients will be > used. OS X and Win XP are listed because they are commonly used as > desktop OSes. Ubuntu Linux is a third potential desktop OS.Right, so set normalization=formD .> The normalization property documentation says "this property indicates > whether a file system should perform a unicode normalization of file > names whenever two file names are compared. File names are always > stored unmodified, names are normalized as part of any comparison > process." Where does the file system use filename comparisons and what > does it use them for? Filename collision checking? Sorting?The system does filename comparisons when doing lookups (open("/foo/bar/baz", ...) does at least three such lookups, for example), and on create (since that involves a lookup). Yes, this is about collisions. Consider a file named "?" (that''s "a" with an acute accent). There are _two_ possible encodings for that name in UTF-8. That means that you could have two files in the same directory and with the same name, though they''d have different names if you looked at the bytes that make up the names. That would be confusing, at the very least. To avoid such collisions you can enable normalization. You can find more here: http://blogs.sun.com/nico/entry/filesystem_i18n> Is it used for any other operation, say when returning a filename to an > application? Would applications reading/writing files to a ZFSNo, directory listings always return the filename used when the file name was created, without any normalization.> filesystem ever notice the difference in normalization settings as long > as they produce filenames that do not conflict with existing names or > create invalid UTF8? The documentation says filenames are stored > unmodified, which sounds like things should be transparent to applications.Applications shouldn''t notice normalization being enabled. The only reasons to disable normalization are: a) you don''t want to force the use of UTF-8, or b) you consistently use a single normalization form and you don''t want to pay a penalty for normalizing on lookup. (b) is probably not a problem -- the normalization code is fast if you use all US-ASCII strings, and it''s linear with the number of non-ASCII, Unicode codepoints in file names. But I don''t have performance numbers to share. I think that normalization should be enabled by default if you enable utf8only, and utf8only should probably be enabled by default in Solaris, but that''s just my personal opinion.> (In regard to filename collision checking, if non-normalized unmodified > filenames are always stored on disk, and they don''t conflict in > non-normalized form, what would the point be of normalizing the > filenames for a comparison? To verify there isn''t conflict in > normalized forms, and if there is no conflict with an existing file to > allow the filename to be written unmodified?)Yes.> The ZFS documentation doesn''t list the valid values for the > normalization property other than ''none. From your reply and from theThe zfs(1M) manpage lists them: normalization = none | formD | formKCf That''s not all existing Unicode normalization forms, no. The reason for this is that we only normalize on lookup (the file names returned by readdir are not normalized), and for that the forms C and D are semantically equivalent, but K and non-K forms are not semantically equivalent, so we need one K form and one non-K form. NFD is faster than NFC, but the K forms require a trip through form C, so NFKC is faster than NFKD (at least if I remember correctly). Which means that NFD and NFKC were sufficient, and there''s no reason to ever want NFC or NFKD.> suggest they be added to the documentation at > http://dlc.sun.com/osol/docs/content/ZFSADMIN/gazss.htmlYes, that''s a good point. PS: ZFS directories are hashed. When normalization is enabled, the hash keys are normalized on create, but the hash contents are not, so filenames rename unnormalized.