Hi, With reference to my previous mail ([zfs-code] Nvpair for storing EAs in dnode) which had expressed our intent to use libnvpair for EAs in dnode. I am now sending a high level design document for the same. Any suggestions/comments are most welcome. Thanks & Regards, Girish -------------- next part -------------- A non-text attachment was scrubbed... Name: EAs-in-Dnode-HLD.pdf Type: application/pdf Size: 41285 bytes Desc: URL: <http://mail.opensolaris.org/pipermail/zfs-code/attachments/20080207/0b329848/attachment.pdf>
Girish Shilamkar wrote:> Hi, > With reference to my previous mail ([zfs-code] Nvpair for storing EAs > in dnode) which had expressed our intent to use libnvpair for EAs in > dnode. I am now sending a high level design document for the same. > > Any suggestions/comments are most welcome. > > Thanks & Regards, > Girish > >General comment: I''m not sure we should be using nvlists to store these attributes. While it will be quick to implement, I''m concerned about the constant packing/unpacking and searching linked lists in the nvlist code. As part of this effort I would like to refactor the znode_phys to only include what I believe to be the core attributes that must be in the znode_phys. Then all of the secondary attributes that aren''t needed all of the time can use the new EA mechanisms that Lustre and others desire. We will need to rev the ZPL version to 4. By refactoring the znode we will recover a lot of the bonus space that is in short supply today. Section 2.2.4 Not sure what you mean by storing all xattrs in a single object? Are you saying all of the Lustre attributes would be stored in a single Solaris extended attribute file? Section 2.2.5 This section is not needed. Solaris already has interfaces and frameworks in place to expose additional attributes to applications. Any attribute that lustre wants to retrieve/set can be retrieved with VOP_GETATTR()/VOP_SETATTR() via the xvattr_t. From an application those attributes can be manipulated with getattrat()/setattrat(). -Mark
On Qui, 2008-02-07 at 09:44 -0700, Mark Shellenbaum wrote:> I''m not sure we should be using nvlists to store these attributes. > While it will be quick to implement, I''m concerned about the constant > packing/unpacking and searching linked lists in the nvlist code.That is also one of things that I was concerned about. What do you think about providing an alternate interface in the nvlist code that would do an nvlist_lookup() without having to unpack the nvlist (ie, by searching the name directly in the encoded format)? The number of attributes on a file should be very limited so I don''t think a linear search would be a problem. We could also provide a similar interface to add/update an attribute that would manipulate the encoded format directly (the XDR format looks fairly simple, I think).> As > part of this effort I would like to refactor the znode_phys to only > include what I believe to be the core attributes that must be in the > znode_phys. Then all of the secondary attributes that aren''t needed all > of the time can use the new EA mechanisms that Lustre and others desire. > We will need to rev the ZPL version to 4. By refactoring the znode we > will recover a lot of the bonus space that is in short supply today.That sounds reasonable.> Section 2.2.4 > > Not sure what you mean by storing all xattrs in a single object? Are you > saying all of the Lustre attributes would be stored in a single Solaris > extended attribute file?Yes. The benefit of this approach is that it would greatly reduce the number of seeks (which are always in short supply) in a Lustre metadata server in the case where the xattrs won''t fit in the dnode.> Section 2.2.5 > > This section is not needed. Solaris already has interfaces and > frameworks in place to expose additional attributes to applications. > Any attribute that lustre wants to retrieve/set can be retrieved with > VOP_GETATTR()/VOP_SETATTR() via the xvattr_t. From an application those > attributes can be manipulated with getattrat()/setattrat().Are you referring to section 2.2.6 as opposed to 2.2.5? In Lustre, we are not using (nor we have the desire to use) any Solaris VFS code, so we can''t use VOP_*() calls, vnode_t and/or znode_t structures (although we do manipulate znode_phys_t buffers), ACL code, etc.. Although we don''t compile the majority of the ZPL code due to VFS dependencies, we do compile zfs_znode.c in userspace (without the _KERNEL definition). As you may understand, it would be very convenient for us if the core of this functionality would be free of VFS dependencies so that we could just implement a thin wrapper in our Lustre code. However, since this probably belongs more in the ZPL than in the DMU, it could be implemented in zfs_znode.c. Thanks! -- Ricardo Manuel Correia Lustre Engineering Sun Microsystems, Inc. Portugal Phone +351.214134023 / x58723 Mobile +351.912590825 Email Ricardo.M.Correia at Sun.COM -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-code/attachments/20080207/433e375b/attachment.html> -------------- next part -------------- A non-text attachment was scrubbed... Name: 6g_top.gif Type: image/gif Size: 1257 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-code/attachments/20080207/433e375b/attachment.gif>
On Qui, 2008-02-07 at 17:14 +0000, Ricardo M. Correia wrote:> > Not sure what you mean by storing all xattrs in a single object? Are you > > saying all of the Lustre attributes would be stored in a single Solaris > > extended attribute file? > > > Yes. The benefit of this approach is that it would greatly reduce the number of seeks (which are always in short supply) in a Lustre metadata server in the case where the xattrs won''t fit in the dnode.Just to clarify: the intention of this approach to use a single object to store all attributes so that we wouldn''t have the overhead of a ZAP lookup (in the xattr directory ZAP), nor we would have the overhead of having to open several objects to read several attributes. Thanks, Ricardo -- Ricardo Manuel Correia Lustre Engineering Sun Microsystems, Inc. Portugal Phone +351.214134023 / x58723 Mobile +351.912590825 Email Ricardo.M.Correia at Sun.COM -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-code/attachments/20080207/0daaac15/attachment.html> -------------- next part -------------- A non-text attachment was scrubbed... Name: 6g_top.gif Type: image/gif Size: 1257 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-code/attachments/20080207/0daaac15/attachment.gif>
Ricardo M. Correia wrote:> > On Qui, 2008-02-07 at 09:44 -0700, Mark Shellenbaum wrote: >> I''m not sure we should be using nvlists to store these attributes. >> While it will be quick to implement, I''m concerned about the constant >> packing/unpacking and searching linked lists in the nvlist code. > > That is also one of things that I was concerned about. > > What do you think about providing an alternate interface in the nvlist > code that would do an nvlist_lookup() without having to unpack the > nvlist (ie, by searching the name directly in the encoded format)? > The number of attributes on a file should be very limited so I don''t > think a linear search would be a problem. >Ok, but we don''t know how many attributes future consumers may want to store. I hope it will be small, but I wouldn''t want to paint ourselves in a corner and then wish we had done it a different way.> We could also provide a similar interface to add/update an attribute > that would manipulate the encoded format directly (the XDR format looks > fairly simple, I think). >I think that would be complicated. You could potentionally have a packed buffer that needs to be expanded in size or compressed depending on how the attribute is changed/added.>> As >> part of this effort I would like to refactor the znode_phys to only >> include what I believe to be the core attributes that must be in the >> znode_phys. Then all of the secondary attributes that aren''t needed all >> of the time can use the new EA mechanisms that Lustre and others desire. >> We will need to rev the ZPL version to 4. By refactoring the znode we >> will recover a lot of the bonus space that is in short supply today. > > That sounds reasonable. > >> Section 2.2.4 >> >> Not sure what you mean by storing all xattrs in a single object? Are you >> saying all of the Lustre attributes would be stored in a single Solaris >> extended attribute file? > > Yes. The benefit of this approach is that it would greatly reduce the > number of seeks (which are always in short supply) in a Lustre metadata > server in the case where the xattrs won''t fit in the dnode. > >> Section 2.2.5 >> >> This section is not needed. Solaris already has interfaces and >> frameworks in place to expose additional attributes to applications. >> Any attribute that lustre wants to retrieve/set can be retrieved with >> VOP_GETATTR()/VOP_SETATTR() via the xvattr_t. From an application those >> attributes can be manipulated with getattrat()/setattrat(). > > Are you referring to section 2.2.6 as opposed to 2.2.5? >The document that was posted by Girish didn''t have a section 2.2.6> In Lustre, we are not using (nor we have the desire to use) any Solaris > VFS code, so we can''t use VOP_*() calls, vnode_t and/or znode_t > structures (although we do manipulate znode_phys_t buffers), ACL code, etc.. >I thought you were going to be running on Solaris? Things would be much simpler if you would be in the kernel. ;-)> Although we don''t compile the majority of the ZPL code due to VFS > dependencies, we do compile zfs_znode.c in userspace (without the > _KERNEL definition). >Are you referring to the linux/fuse implementation? If so, then I would probably suggest we add some ioctl interfaces to retrieve the attributes. there is no need for Solaris to add new VOP interfaces for these.> As you may understand, it would be very convenient for us if the core of > this functionality would be free of VFS dependencies so that we could > just implement a thin wrapper in our Lustre code. > However, since this probably belongs more in the ZPL than in the DMU, it > could be implemented in zfs_znode.c. > > Thanks! > > -- > *Ricardo Manuel Correia* > Lustre Engineering > > *Sun Microsystems, Inc.* > Portugal > Phone +351.214134023 / x58723 > Mobile +351.912590825 > Email Ricardo.M.Correia at Sun.COM <mailto:Ricardo.M.Correia at Sun.COM> > > > ------------------------------------------------------------------------ > > _______________________________________________ > zfs-code mailing list > zfs-code at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-code
On Qui, 2008-02-07 at 10:29 -0700, Mark Shellenbaum wrote:> Ok, but we don''t know how many attributes future consumers may want to > store. I hope it will be small, but I wouldn''t want to paint ourselves > in a corner and then wish we had done it a different way.That would be a matter of falling back to the current mechanism if we reach a sufficiently large number of attributes (>100?). We are simply interested in optimizing for the large majority of cases.> > We could also provide a similar interface to add/update an attribute > > that would manipulate the encoded format directly (the XDR format looks > > fairly simple, I think). > > > > I think that would be complicated. You could potentionally have a > packed buffer that needs to be expanded in size or compressed depending > on how the attribute is changed/added.Yes.. So we have 4 cases: 1) Adding an attribute: we would just add it at the end (very simple). 2) Changing an attribute without changing it''s size: it could be changed in-place. 3) Changing an attribute and changing it''s size: we could use memmove() or similar to copy the attributes after the one we''re changing to the correct place. 4) Deleting an attribute: we would also use memmove() to copy the following attributes to the correct place. We''d have to do a size check before doing an operation to see if it would still fit in the provided buffer (which would be the bonus buffer). Whatever format we''d use, I think the important thing is to have as less space overhead as possible in order to fit in the dnode. So I think a packed format is probably desirable even if it means it would not be 100% efficient CPU-wise in some cases (cases 3 and 4).> > Although we don''t compile the majority of the ZPL code due to VFS > > dependencies, we do compile zfs_znode.c in userspace (without the > > _KERNEL definition). > > > > Are you referring to the linux/fuse implementation? If so, then I would > probably suggest we add some ioctl interfaces to retrieve the > attributes.I''m referring to the Lustre implementation (which has many things in common with Linux/FUSE due to being implemented in userspace). We cannot use ioctls() to retrieve attributes because the DMU will be running completely in userspace. In Solaris (well, in all Lustre-supported OSes), we export the pool from the native (kernel) implementation and import it into our userspace process which runs libzpool in a similar way as ztest. This means that the native kernel implementation *cannot* have access to the ZFS pool while Lustre is running..> there is no need for Solaris to add new VOP interfaces for > these.Great, we don''t need them either :) However, I think we do not have enough experience to fix the existing Solaris VFS interface to use the new mechanism. Well, we could, but it would takes us much, much longer than if it were you guys :) Thanks, Ricardo -- Ricardo Manuel Correia Lustre Engineering Sun Microsystems, Inc. Portugal Phone +351.214134023 / x58723 Mobile +351.912590825 Email Ricardo.M.Correia at Sun.COM -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-code/attachments/20080207/48f734a6/attachment.html> -------------- next part -------------- A non-text attachment was scrubbed... Name: 6g_top.gif Type: image/gif Size: 1257 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-code/attachments/20080207/48f734a6/attachment.gif>
Mark Shellenbaum wrote:> Girish Shilamkar wrote: >> Hi, >> With reference to my previous mail ([zfs-code] Nvpair for storing EAs >> in dnode) which had expressed our intent to use libnvpair for EAs in >> dnode. I am now sending a high level design document for the same. >> >> Any suggestions/comments are most welcome. >> >> Thanks & Regards, >> Girish > > General comment: > > I''m not sure we should be using nvlists to store these attributes. > While it will be quick to implement, I''m concerned about the constant > packing/unpacking and searching linked lists in the nvlist code.Agreed. Here is one proposal for how this could be done efficiently and scalably (both in space and runtime): First, note that the problem is quite constrained: There are a fixed number of attribute names (for any given software version) since these are interpreted system attributes. Therefore the attribute names can be a small number, rather than a string -- ie, we have an enum that defines the attributes. We (ie, OpenSolaris) can control the allocation of new enum values in this namespace. Many attributes will always have a fixed size (eg, scanstamp, finder info) Removing attributes or changing their sizes will be extremely rare, so we can make this code path slow. For example, it would not be prohibitive to repack & rewrite all attrs. Here is an idea for an on-disk format which takes advantage of these constraints to achieve good performance: (sa == System Attribute) enum sa_type { ST_ATIME, ST_ACL, ST_FINDERINFO, ST_SCANSTAMP, ST_LUSTRE_LAYOUT, ... ST_NUMTYPES }; const struct sa_info { uint8_t sai_intlen; } sainfo = { 8, 2, 8, 8, 1, ... }; on disk: header: struct sa_phys { uint16_t sa_numattrs; struct { uint16_t sa_type; /* enum sssa_type */ uint16_t sa_length; /* in sainfo[sa_type] chunks */ } sssa_attr[]; }; followed by the data in the order specified by the header. 8-byte alignment will be enforced for all attribute starting offsets. Note that the software (not the on-disk layout) defines the set of possible attributes, and their byte sizes (ie, is the value a string, vs a series of 8-byte integers, etc). Alternatively, if we are confident that we won''t need anything but 8-byte integers (or arrays thereof), we can simplify the design further. When we first access a SA block, we read in the header, and construct an array indexed by enum sa_type that tells us the position and length of each attribute. That way, access and modification will be blazing fast[*]. If the size of an attribute is changed, we rewrite & repack everything, so there is never any unused space. (Aside from the padding necessary to enforce 8-byte alignment of each SA. It is assumed that attribute size changing will be "rare enough" that this performance is acceptable.) Thus the space overhead would be only 4 bytes per attribute (compared to a static struct layout), and runtime overhead would be minimal. So I think we could justify using it for much of the ZPL metadata. In fact, we could make this SA stuff be a DMU service, and remove support for bonus buffers entirely. The set of ZPL metadata that is always needed could be in one sa, and other pieces elsewhere. The DMU would use what was the bonus buffer, plus an additional BP only if necessary, to implement the SA. Ie, if the space needed for the SA''s was <= 320 bytes (the current bonus size limit), then it would all be in the dnode. Otherwise, the first 192 (320-sizeof(blkptr_t)) would be in the dnode, and the rest in a block pointed to by a BP stored in the dnode. In addition, we could expand the dnode to 1k so that more SA''s could be stored there. --matt [*] For example: struct sa_layout { struct { int sal_offset; /* in bytes */ int sal_length; /* in sainfo[sa_type] chunks */ } sal_ary[ST_NUMTYPES]; }; sa_init would construct an sa_layout given the on-disk sa_phys. void sa_read(struct sa_layout *sal, void *sa_phys, enum sa_type sat, void *buf, int buflen) { int len; ASSERT(sat < ST_NUMTYPES); len = sal->sal_ary[sat].sal_length * sainfo[sat].sai_intlen ASSERT(buflen >= len); /* Assumes that it was already byteswapped when read off disk */ bcopy((char*)sa_phys + sal->sal_ary[sat].sa_offset, buf, len); } sa_write would be similar, unless the sal_length was different from what we are trying to write, in which case we would rewrite the whole thing to insert the new (or resized) attribute. The software would define a priority order for the SA''s, which defines what order to lay them out on disk (and which spill out of the dnode into the accessory block first).
Matthew Ahrens wrote:> Mark Shellenbaum wrote: >> Girish Shilamkar wrote: >>> Hi, >>> With reference to my previous mail ([zfs-code] Nvpair for storing EAs >>> in dnode) which had expressed our intent to use libnvpair for EAs in >>> dnode. I am now sending a high level design document for the same. >>> >>> Any suggestions/comments are most welcome. >>> >>> Thanks & Regards, >>> Girish >> General comment: >> >> I''m not sure we should be using nvlists to store these attributes. >> While it will be quick to implement, I''m concerned about the constant >> packing/unpacking and searching linked lists in the nvlist code. > > Agreed. Here is one proposal for how this could be done efficiently and > scalably (both in space and runtime): > > First, note that the problem is quite constrained: > > There are a fixed number of attribute names (for any given software version) > since these are interpreted system attributes. Therefore the attribute names > can be a small number, rather than a string -- ie, we have an enum that > defines the attributes. We (ie, OpenSolaris) can control the allocation of > new enum values in this namespace. > > Many attributes will always have a fixed size (eg, scanstamp, finder info) > > Removing attributes or changing their sizes will be extremely rare, so we can > make this code path slow. For example, it would not be prohibitive to repack > & rewrite all attrs. > > Here is an idea for an on-disk format which takes advantage of these > constraints to achieve good performance: > > (sa == System Attribute) > > enum sa_type { > ST_ATIME, > ST_ACL, > ST_FINDERINFO, > ST_SCANSTAMP, > ST_LUSTRE_LAYOUT, > ... > ST_NUMTYPES > }; > > const struct sa_info { > uint8_t sai_intlen; > } sainfo = { > 8, 2, 8, 8, 1, ... > }; > > on disk: > header: > struct sa_phys { > uint16_t sa_numattrs; > struct { > uint16_t sa_type; /* enum sssa_type */ > uint16_t sa_length; /* in sainfo[sa_type] chunks */ > } sssa_attr[]; > }; >I like this approach much better. I was thinking about some sort of indexed table with attributes having a number to identify them rather than a name. This will also allow us to kick out a number of the ZPLs attributes into sa space, and then we can choose which additional attributes need to be exposed with the Solaris sysattr framework. -Mark
On Qui, 2008-02-07 at 12:48 -0800, Matthew Ahrens wrote:> First, note that the problem is quite constrained: > > There are a fixed number of attribute names (for any given software version) > since these are interpreted system attributes. Therefore the attribute names > can be a small number, rather than a string -- ie, we have an enum that > defines the attributes. We (ie, OpenSolaris) can control the allocation of > new enum values in this namespace.Since we are on the subject of redesigning attributes, have you considered providing a lighter-weight extended attribute mechanism to userspace applications such as Beagle ( http://beagle-project.org/Enabling_Extended_Attributes ), Meta Tracker ( http://www.gnome.org/projects/tracker/ ), Apache (mime types), etc? It seems to me that the usefulness of extended attributes is to efficiently store small amounts of information associated with a file, and I think these would make more sense to be in the dnode (possibly making it 1K by default) or in a single external object if it doesn''t fit into the dnode. There''s a de-facto namespace standard for EAs that specifies some existing and proposed attributes that could give you some ideas about this: http://freedesktop.org/wiki/CommonExtendedAttributes Of course, this would still require storing the EA names in the dnode, but I think the added flexibility of being able to efficiently store arbitrary EAs would be worth it. And also, the large dnodes feature that we have are working on makes it rather easy to create filesystems with bigger dnodes. Thanks, Ricardo -- Ricardo Manuel Correia Lustre Engineering Sun Microsystems, Inc. Portugal Phone +351.214134023 / x58723 Mobile +351.912590825 Email Ricardo.M.Correia at Sun.COM -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-code/attachments/20080207/6893050e/attachment.html> -------------- next part -------------- A non-text attachment was scrubbed... Name: 6g_top.gif Type: image/gif Size: 1257 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-code/attachments/20080207/6893050e/attachment.gif>
Ricardo M. Correia wrote:> > On Qui, 2008-02-07 at 12:48 -0800, Matthew Ahrens wrote: >> First, note that the problem is quite constrained: >> >> There are a fixed number of attribute names (for any given software version) >> since these are interpreted system attributes. Therefore the attribute names >> can be a small number, rather than a string -- ie, we have an enum that >> defines the attributes. We (ie, OpenSolaris) can control the allocation of >> new enum values in this namespace. > > Since we are on the subject of redesigning attributes, have you > considered providing a lighter-weight extended attribute mechanism to > userspace applicationsThe ZAP would probably be sufficient to store lighter-weight user extended attributes (directly in the ZAP, not to point to other files like the current Solaris extended attributes do). The method I proposed for system attributes is not applicable, since user attributes would almost certainly be string -> string mappings. --matt
Ricardo M. Correia wrote:> > On Qui, 2008-02-07 at 12:48 -0800, Matthew Ahrens wrote: >> First, note that the problem is quite constrained: >> >> There are a fixed number of attribute names (for any given software version) >> since these are interpreted system attributes. Therefore the attribute names >> can be a small number, rather than a string -- ie, we have an enum that >> defines the attributes. We (ie, OpenSolaris) can control the allocation of >> new enum values in this namespace. > > Since we are on the subject of redesigning attributes, have you > considered providing a lighter-weight extended attribute mechanism to > userspace applications such as Beagle ( > http://beagle-project.org/Enabling_Extended_Attributes > <http://beagle-project.org/Enabling_Extended_Attributes)?> ), Meta > Tracker ( http://www.gnome.org/projects/tracker/ ), Apache (mime types), > etc? > > It seems to me that the usefulness of extended attributes is to > efficiently store small amounts of information associated with a file, > and I think these would make more sense to be in the dnode (possibly > making it 1K by default) or in a single external object if it doesn''t > fit into the dnode.I believe when a prototype using 1K dnodes was tested it showed an unacceptable (30%?) hit on some benchmarks. So if can possibly avoid increasing the dnode size (by default) then we should do so.> > There''s a de-facto namespace standard for EAs that specifies some > existing and proposed attributes that could give you some ideas about > this: http://freedesktop.org/wiki/CommonExtendedAttributes > > Of course, this would still require storing the EA names in the dnode, > but I think the added flexibility of being able to efficiently store > arbitrary EAs would be worth it. > And also, the large dnodes feature that we have are working on makes it > rather easy to create filesystems with bigger dnodes. > > Thanks, > Ricardo > -- > *Ricardo Manuel Correia* > Lustre Engineering > > *Sun Microsystems, Inc.* > Portugal > Phone +351.214134023 / x58723 > Mobile +351.912590825 > Email Ricardo.M.Correia at Sun.COM <mailto:Ricardo.M.Correia at Sun.COM> > > > ------------------------------------------------------------------------ > > _______________________________________________ > zfs-code mailing list > zfs-code at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-code
On Feb 07, 2008 22:51 -0700, Neil Perrin wrote:> > It seems to me that the usefulness of extended attributes is to > > efficiently store small amounts of information associated with a file, > > and I think these would make more sense to be in the dnode (possibly > > making it 1K by default) or in a single external object if it doesn''t > > fit into the dnode. > > I believe when a prototype using 1K dnodes was tested it showed an > unacceptable (30%?) hit on some benchmarks. So if can possibly > avoid increasing the dnode size (by default) then we should do so.That is definitely true, but in other uses this can be a major performace improvement for applications like Samba that use a bunch of attributes: http://lwn.net/Articles/112571/ http://samba.org/~tridge/xattr_results/ext3-tuning.png The "default" case in the ext3-tuning.png graph is similar to what is in ZFS today (though actually still more light weight because the EAs are only 1 hop from the inode instead of 5 like ZFS today). Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
Andreas Dilger wrote:> On Feb 07, 2008 22:51 -0700, Neil Perrin wrote: >>> It seems to me that the usefulness of extended attributes is to >>> efficiently store small amounts of information associated with a file, >>> and I think these would make more sense to be in the dnode (possibly >>> making it 1K by default) or in a single external object if it doesn''t >>> fit into the dnode. >> I believe when a prototype using 1K dnodes was tested it showed an >> unacceptable (30%?) hit on some benchmarks. So if can possibly >> avoid increasing the dnode size (by default) then we should do so. > > That is definitely true, but in other uses this can be a major performace > improvement for applications like Samba that use a bunch of attributes: > > http://lwn.net/Articles/112571/ > http://samba.org/~tridge/xattr_results/ext3-tuning.png > > The "default" case in the ext3-tuning.png graph is similar to what is > in ZFS today (though actually still more light weight because the EAs > are only 1 hop from the inode instead of 5 like ZFS today). > > Cheers, AndreasInteresting data. So I hope we can come up with a solution that works well with both a few and a lot of attributes. Maybe Matt''s proposal would get us there? Neil.
On Qui, 2008-02-07 at 22:51 -0700, Neil Perrin wrote:> I believe when a prototype using 1K dnodes was tested it showed an > unacceptable (30%?) hit on some benchmarks. So if can possibly > avoid increasing the dnode size (by default) then we should do so.Hmm, interesting.. Do you know the reason for such a performance hit? Even with 1K dnode sizes, if the dnodes didn''t have any extended attributes and since metadata compression is enabled, the on-disk size of metadnode blocks should remain approximately the same, right? Could it be because the metadnode object became twice the size (logical size) and therefore required another level of indirect blocks which, as a consequence, required an additional disk seek for each metadnode block read? It would be interesting to run some benchmarks with Kalpak''s large dnode patch. Cheers, Ricardo -- Ricardo Manuel Correia Lustre Engineering Sun Microsystems, Inc. Portugal Phone +351.214134023 / x58723 Mobile +351.912590825 Email Ricardo.M.Correia at Sun.COM -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-code/attachments/20080208/bdc7cf22/attachment.html> -------------- next part -------------- A non-text attachment was scrubbed... Name: 6g_top.gif Type: image/gif Size: 1257 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-code/attachments/20080208/bdc7cf22/attachment.gif>
Ricardo M. Correia wrote:> On Qui, 2008-02-07 at 22:51 -0700, Neil Perrin wrote: >> I believe when a prototype using 1K dnodes was tested it showed an >> unacceptable (30%?) hit on some benchmarks. So if can possibly >> avoid increasing the dnode size (by default) then we should do so. > > Hmm, interesting.. > > Do you know the reason for such a performance hit? > > Even with 1K dnode sizes, if the dnodes didn''t have any extended > attributes and since metadata compression is enabled, the on-disk size > of metadnode blocks should remain approximately the same, right? > > Could it be because the metadnode object became twice the size (logical > size) and therefore required another level of indirect blocks which, as > a consequence, required an additional disk seek for each metadnode block > read?Actually, we tried with both the same (16k) dnode block size, and double (32k) dnode block size, to see if the the "sidefetching" caused by reading in a whole block of dnodes was influencing things. Both performed worse than 512-byte dnodes (in 16k blocks). However, the additional space would not necessarily be zero-filled, it would be used by additional block pointers (if the file size is in certain ranges). So the additional i/o caused by larger blocks could be an issue. Note, the dnode file is always stored with the max levels of indirection so that could not be the issue. Certainly more investigation is needed to determine the impact of 1k dnodes on various workloads. But this is a bit of a diversion from the topic at hand -- EAs (or SAs as I call them; "extended attribute" already has a different meaning on Solaris). I haven''t heard anyone from Lustre comment on my proposed design... --matt
On Feb 07, 2008 17:14 +0000, Ricardo Correia wrote:> On Qui, 2008-02-07 at 09:44 -0700, Mark Shellenbaum wrote: > > Section 2.2.4 > > > > Not sure what you mean by storing all xattrs in a single object? Are you > > saying all of the Lustre attributes would be stored in a single Solaris > > extended attribute file? > > Yes. The benefit of this approach is that it would greatly reduce the > number of seeks (which are always in short supply) in a Lustre metadata > server in the case where the xattrs won''t fit in the dnode.To clarify, this is a desirable approach for the "spill over" case where the small attributes don''t fit into the dnode bonus buffer. The current situation is that every xattr needs a ZAP entry + dnode + block, no matter what size it is. The "EA in dnode" change moves these attributes into the dnode, but the space there is limited. The proposal previously suggested was to have a block pointer to hold the spillover EAs, but this is quite large and would consume a noticable fraction of the bonus buffer and either limits the EAs to some larger size (1 block) or requires having a tree at that level. The compomise that we propose with is to store the spillover EAs in a single dnode, which has the benefit of being essentially unlimited in size, being efficiently addressible (objid), and amortizing the lookup over multiple attributes. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
Andreas Dilger wrote:> On Feb 07, 2008 17:14 +0000, Ricardo Correia wrote: >> On Qui, 2008-02-07 at 09:44 -0700, Mark Shellenbaum wrote: >>> Section 2.2.4 >>> >>> Not sure what you mean by storing all xattrs in a single object? Are you >>> saying all of the Lustre attributes would be stored in a single Solaris >>> extended attribute file? >> Yes. The benefit of this approach is that it would greatly reduce the >> number of seeks (which are always in short supply) in a Lustre metadata >> server in the case where the xattrs won''t fit in the dnode. > > To clarify, this is a desirable approach for the "spill over" case where > the small attributes don''t fit into the dnode bonus buffer. The current > situation is that every xattr needs a ZAP entry + dnode + block, no matter > what size it is. The "EA in dnode" change moves these attributes into > the dnodeNot as I read it. Solaris extended attributes will continue to be stored as they are today -- as separate files in an xattr directory hung of the main file. The proposal is for some new system-interpreted attributes to be stored in the dnode. This is a different namespace with different requirements and different semantics. You don''t need to set an ACL on your lustre metadata, do you? --matt
On Feb 08, 2008 18:09 +0000, Ricardo Correia wrote:> On Qui, 2008-02-07 at 22:51 -0700, Neil Perrin wrote: > > I believe when a prototype using 1K dnodes was tested it showed an > > unacceptable (30%?) hit on some benchmarks. So if can possibly > > avoid increasing the dnode size (by default) then we should do so. > > > Do you know the reason for such a performance hit? > > Even with 1K dnode sizes, if the dnodes didn''t have any extended > attributes and since metadata compression is enabled, the on-disk size > of metadnode blocks should remain approximately the same, right?One of the default behaviours of the DMU is that if there is a larger dnode then there are more blkptr_t items added to consume the rest of the space left once the bonus buffer. Unless the bonus buffer size was also increased in this test, or the number of blkptr in the dnode is limited (as we have done with the large dnode patch) then any file data will cause the blocks to be stored directly in the dnode instead of in an indirect block. For some workloads (small file read/write access) this would be an improvement, but for workloads which look a lot at the [dz]node attributes (e.g. find, ls -l) it would be worse.> Could it be because the metadnode object became twice the size (logical > size) and therefore required another level of indirect blocks which, as > a consequence, required an additional disk seek for each metadnode block > read? > > It would be interesting to run some benchmarks with Kalpak''s large dnode > patch.Definitely. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
Hi Matthew, On Qui, 2008-02-07 at 12:48 -0800, Matthew Ahrens wrote:> on disk: > header: > struct sa_phys { > uint16_t sa_numattrs; > struct { > uint16_t sa_type; /* enum sssa_type */ > uint16_t sa_length; /* in sainfo[sa_type] chunks */ > } sssa_attr[]; > }; > > followed by the data in the order specified by the header. 8-byte alignment > will be enforced for all attribute starting offsets.This would require repacking when adding an attribute, right? Anyway, maybe that wouldn''t be a big problem.. There''s also the question of the attribute size limit.. from conversations I had with Andreas, I got the feeling that the Lustre layout information could be quite big when a file is striped through all OSTs, and I would imagine this would become an even bigger concern in the future if we intend to continue scaling horizontally. Anyway, your proposal is interesting, but there''s also one thing I would like to add: Could we have a special integer value that would essentially mean "this is an unknown, name-value type of attribute", which would be used to store additional, perhaps user-specified attributes? In the space of the attribute value, we could store the name of the attribute and the value itself (perhaps with 1 or 2 additional bytes for the name length of the attribute). These attributes could have lower priority than the other system attributes (they would be stored at the end) and they would be ignored (but not removed) by any implemention that doesn''t understand them. I also think that instead of having an additional block pointer (which are huge) in the dnode for "spillage", we should have something like a "uint64_t dn_spillobj" which would be an object id of a "spillage" object. An object id is much more space-efficient and, like Andreas mentioned, allows for an unlimited number of attributes/attribute sizes. I also find *extremely* appealing your idea of making this whole thing a DMU service. I think that would make our lives (as in Lustre developers'' lives) much, much easier in the long run.. the VFS dependencies in the ZPL code have been kind of a pain, to say the least.. :) Thanks, Ricardo -- Ricardo Manuel Correia Lustre Engineering Sun Microsystems, Inc. Portugal Phone +351.214134023 / x58723 Mobile +351.912590825 Email Ricardo.M.Correia at Sun.COM -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-code/attachments/20080209/191da8da/attachment.html> -------------- next part -------------- A non-text attachment was scrubbed... Name: 6g_top.gif Type: image/gif Size: 1257 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-code/attachments/20080209/191da8da/attachment.gif>
Ricardo M. Correia wrote:> Hi Matthew, > > On Qui, 2008-02-07 at 12:48 -0800, Matthew Ahrens wrote: >> on disk: >> header: >> struct sa_phys { >> uint16_t sa_numattrs; >> struct { >> uint16_t sa_type; /* enum sssa_type */ >> uint16_t sa_length; /* in sainfo[sa_type] chunks */ >> } sssa_attr[]; >> }; >> >> followed by the data in the order specified by the header. 8-byte alignment >> will be enforced for all attribute starting offsets. > > This would require repacking when adding an attribute, right? Anyway, > maybe that wouldn''t be a big problem.. > > There''s also the question of the attribute size limit.. from > conversations I had with Andreas, I got the feeling that the Lustre > layout information could be quite big when a file is striped through all > OSTs, and I would imagine this would become an even bigger concern in > the future if we intend to continue scaling horizontally. >By their vary nature, these attributes will have a size limitation. I don''t really see the point in allowing a size larger than the supported block size. 64K seems reasonable. I think that very large values could possibly be supported using some form of indirection: storing a block pointer for the value, or storing an object ID for values that span multiple blocks.> Anyway, your proposal is interesting, but there''s also one thing I would > like to add: > > Could we have a special integer value that would essentially mean "this > is an unknown, name-value type of attribute", which would be used to > store additional, perhaps user-specified attributes? > In the space of the attribute value, we could store the name of the > attribute and the value itself (perhaps with 1 or 2 additional bytes for > the name length of the attribute). > > These attributes could have lower priority than the other system > attributes (they would be stored at the end) and they would be ignored > (but not removed) by any implemention that doesn''t understand them. >I''m not sure how storing them at the end makes them lower priority. I''m also not sure exactly how these would be used... does lustre need these? For any file system layer, we would need to add infrastructure to manage and manipulate these objects, as well as new interfaces to be able to access these attributes from outside the kernel. All that aside. There is nothing that would prohibit the definition of this type of "special system attribute" in the current model.> I also think that instead of having an additional block pointer (which > are huge) in the dnode for "spillage", we should have something like a > "uint64_t dn_spillobj" which would be an object id of a "spillage" object. > An object id is much more space-efficient and, like Andreas mentioned, > allows for an unlimited number of attributes/attribute sizes. >I don''t quite understand this. The whole point of these attributes is to make them fast to access... using an object ID is going to be far more expensive then a block pointer to access. Matt''s model can also support unlimited numbers of attributes if we allow the blocks to be chained.> I also find **extremely** appealing your idea of making this whole thing > a DMU service. > I think that would make our lives (as in Lustre developers'' lives) much, > much easier in the long run.. the VFS dependencies in the ZPL code have > been kind of a pain, to say the least.. :) > > Thanks, > Ricardo > > -- > *Ricardo Manuel Correia* > Lustre Engineering > > *Sun Microsystems, Inc.* > Portugal > Phone +351.214134023 / x58723 > Mobile +351.912590825 > Email Ricardo.M.Correia at Sun.COM <mailto:Ricardo.M.Correia at Sun.COM> > > > ------------------------------------------------------------------------ > > _______________________________________________ > zfs-code mailing list > zfs-code at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-code
To put the document in a better perspective: The extended attributes (EAs), as we(Lustre) see, are the lighter weighted extended attributes which are user defined and user can be Beagle, Metatracker, Lustre etc. Lustre has been using the extended attributes http://acl.bestbits.at/about.html where ldiskfs (ext3 modified for Lustre) is used as backend. The design posted specifically tries to devise a similar mechanism for ZFS where even pNFS, MAC OS/X could possibly exploit this mechanism. Thus it also implies, Matt has already mentioned this, that the proposal made by him (Matt) for system attributes is not applicable. As we do not intend to extend the current optional attributes i.e. struct xoptattr, the current APIs VOP_GETATTR/VOP_SETATTR won''t serve the purpose. Hence the proposal of new APIs in section 2.2.5 of design doc. Use of existing APIs VOP_GETATTR/VOP_SETATTR will be possible if we somehow integrate the (userdefined) extended attributes with existing xoptattr_t. On Fri, 2008-02-08 at 11:48 -0800, Matthew Ahrens wrote:> Not as I read it. Solaris extended attributes will continue to be stored as > they are today -- as separate files in an xattr directory hung of the main file.We have been discussing to move only uint8_t xoa_av_scanstamp[AV_SCANSTAMP_SZ], at least as of now, to nvlist which holds EAs in dnode, as the first 32 bytes just after the znode_phys are used by it.> The proposal is for some new system-interpreted attributes to be stored in > the dnode. This is a different namespace with different requirements and > different semantics. You don''t need to set an ACL on your lustre metadata, > do you? >P.S. Couldn''t reply early as I was down with high temperature and had been coughing my lungs out for past few days. -Girish
On Feb 09, 2008 08:10 -0700, Mark Maybee wrote:>> Hi Matthew, >> There''s also the question of the attribute size limit.. from conversations >> I had with Andreas, I got the feeling that the Lustre layout information >> could be quite big when a file is striped through all OSTs, and I would >> imagine this would become an even bigger concern in the future if we >> intend to continue scaling horizontally. > > By their vary nature, these attributes will have a size limitation. I > don''t really see the point in allowing a size larger than the supported > block size. 64K seems reasonable. I think that very large values could > possibly be supported using some form of indirection: storing a block > pointer for the value, or storing an object ID for values that span > multiple blocks.For the most common case in Lustre, the striping attribute will be relatively small (in the range of 80 - 128 bytes). In other cases (less common, but still present) the current ext3 size limit of 4096 bytes is already a limiting factor on the striping of a file - directly affecting the total bandwidth that can be allocated to a single file. It is reasonable to have Lustre striping attributes up to 16kB - 24kB range, at which point we will have a different (more efficient) mechanism for storing large stripes, but it needs some upcoming infrastructure first. There is very little use in the middle range. It seems possible that we may need to have two separate mechanisms for storing the small attributes and storing the large ones. The small Lustre SAs will be stored in the dnode, and the large ones in the existing xattr mechanism. Given that some applications (ZFS/OSX/pNFS) need to be able to fall back to looking in an xattr for the data they need for compatibility, this isn''t any extra overhead.>> Anyway, your proposal is interesting, but there''s also one thing I would >> like to add: >> >> Could we have a special integer value that would essentially mean "this is >> an unknown, name-value type of attribute", which would be used to store >> additional, perhaps user-specified attributes? >> In the space of the attribute value, we could store the name of the >> attribute and the value itself (perhaps with 1 or 2 additional bytes for >> the name length of the attribute).To be clear - as yet Lustre has a fairly limited set of "system attributes" that are needed for high performance operation. There is the ability to store "user attributes" on a file, and while good performance is desirable this is not a widely-used feature and falls into the "nice to have" category. In ext3 there is no separation of system attributes and user attributes, so they all benefit from the [di]node local storage optimization.>> I also think that instead of having an additional block pointer (which are >> huge) in the dnode for "spillage", we should have something like a >> "uint64_t dn_spillobj" which would be an object id of a "spillage" object. >> An object id is much more space-efficient and, like Andreas mentioned, >> allows for an unlimited number of attributes/attribute sizes. > > I don''t quite understand this. The whole point of these attributes is > to make them fast to access... using an object ID is going to be far > more expensive then a block pointer to access. Matt''s model can also > support unlimited numbers of attributes if we allow the blocks to be > chained.While I agree with your point, I think part of the issue is that as soon as we store a blkptr_t in the dnode this will consume some significant chunk of the SA space and push attributes out of the dnode. The other tradeoff is one of complexity. You know the code better than I, but it seems cleaner to have a dnode reference as a container for a bunch of SA blocks rather than having another block tree attached to the same dnode. That said, if you don''t think there is a lot of added complexity to have chained blocks for the SAs, I''ll defer to your experience. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
On Feb 11, 2008, at 8:42 PM, Andreas Dilger wrote:> On Feb 09, 2008 08:10 -0700, Mark Maybee wrote: >>> Hi Matthew, >>> There''s also the question of the attribute size limit.. from >>> conversations >>> I had with Andreas, I got the feeling that the Lustre layout >>> information >>> could be quite big when a file is striped through all OSTs, and I >>> would >>> imagine this would become an even bigger concern in the future if we >>> intend to continue scaling horizontally. >> >> By their vary nature, these attributes will have a size >> limitation. I >> don''t really see the point in allowing a size larger than the >> supported >> block size. 64K seems reasonable. I think that very large values >> could >> possibly be supported using some form of indirection: storing a block >> pointer for the value, or storing an object ID for values that span >> multiple blocks. > > For the most common case in Lustre, the striping attribute will be > relatively small (in the range of 80 - 128 bytes). In other cases > (less common, but still present) the current ext3 size limit of 4096 > bytes is already a limiting factor on the striping of a file - > directly > affecting the total bandwidth that can be allocated to a single file. > It is reasonable to have Lustre striping attributes up to 16kB - 24kB > range, at which point we will have a different (more efficient) > mechanism > for storing large stripes, but it needs some upcoming > infrastructure first. > There is very little use in the middle range.Another data point; I would expect the pNFS striping information to generally be in the 128 byte range with some growth to 256 bytes in the common case. Spencer> > It seems possible that we may need to have two separate mechanisms for > storing the small attributes and storing the large ones. The small > Lustre SAs will be stored in the dnode, and the large ones in the > existing > xattr mechanism. Given that some applications (ZFS/OSX/pNFS) need to > be able to fall back to looking in an xattr for the data they need for > compatibility, this isn''t any extra overhead. > >>> Anyway, your proposal is interesting, but there''s also one thing >>> I would >>> like to add: >>> >>> Could we have a special integer value that would essentially mean >>> "this is >>> an unknown, name-value type of attribute", which would be used to >>> store >>> additional, perhaps user-specified attributes? >>> In the space of the attribute value, we could store the name of the >>> attribute and the value itself (perhaps with 1 or 2 additional >>> bytes for >>> the name length of the attribute). > > To be clear - as yet Lustre has a fairly limited set of "system > attributes" > that are needed for high performance operation. There is the > ability to > store "user attributes" on a file, and while good performance is > desirable > this is not a widely-used feature and falls into the "nice to have" > category. > In ext3 there is no separation of system attributes and user > attributes, so > they all benefit from the [di]node local storage optimization. > >>> I also think that instead of having an additional block pointer >>> (which are >>> huge) in the dnode for "spillage", we should have something like a >>> "uint64_t dn_spillobj" which would be an object id of a >>> "spillage" object. >>> An object id is much more space-efficient and, like Andreas >>> mentioned, >>> allows for an unlimited number of attributes/attribute sizes. >> >> I don''t quite understand this. The whole point of these >> attributes is >> to make them fast to access... using an object ID is going to be far >> more expensive then a block pointer to access. Matt''s model can also >> support unlimited numbers of attributes if we allow the blocks to be >> chained. > > While I agree with your point, I think part of the issue is that as > soon > as we store a blkptr_t in the dnode this will consume some significant > chunk of the SA space and push attributes out of the dnode. The other > tradeoff is one of complexity. You know the code better than I, > but it > seems cleaner to have a dnode reference as a container for a bunch > of SA > blocks rather than having another block tree attached to the same > dnode. > > That said, if you don''t think there is a lot of added complexity to > have > chained blocks for the SAs, I''ll defer to your experience. > > Cheers, Andreas > -- > Andreas Dilger > Sr. Staff Engineer, Lustre Group > Sun Microsystems of Canada, Inc. > > _______________________________________________ > zfs-code mailing list > zfs-code at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-code
On Thu, 2008-02-07 at 12:48 -0800, Matthew Ahrens wrote:> Mark Shellenbaum wrote: > > General comment: > > > > I''m not sure we should be using nvlists to store these attributes. > > While it will be quick to implement, I''m concerned about the constant > > packing/unpacking and searching linked lists in the nvlist code. > > Agreed. Here is one proposal for how this could be done efficiently and > scalably (both in space and runtime): > > First, note that the problem is quite constrained: > > There are a fixed number of attribute names (for any given software version) > since these are interpreted system attributes. Therefore the attribute names > can be a small number, rather than a string -- ie, we have an enum that > defines the attributes. We (ie, OpenSolaris) can control the allocation of > new enum values in this namespace.We can have a 32-bit type and a 32-bit subtype so that the ZFS team will only need to assign the 32-bit type and the subtype can be used by Lustre or OS/X without requiring ZFS team approval. For Lustre the subtype will be a 3-4 char EA name but it can be used differently also. The kernel would not care about the SA names but it can be expanded to com.apple.finder or com.lustre.lov, etc. if it needs to be printed.> > Many attributes will always have a fixed size (eg, scanstamp, finder info)I think we can let sai_intlen == 0 indicate that sa_length is the real length of the SA value(and not in sainfo[sa_type] chunks) to allow arbitary sized values. So here is a slightly modified on-disk format from what you proposed: struct sa_phys { struct sa_header { uint32_t magic; uint16_t sa_numattrs; uint16_t pad; }; struct { uint32_t sa_type; uint32_t sa_subtype; uint32_t sa_value_offset; uint32_t sa_value_size; } sa_entry[]; }; The sa_value_offset will point to the address of the value in the bonus buffer of dnode or in the external block. And 8-byte alignment will require padding of entries and values. sa_type + sa_subtype == 0 would terminate the list and as an additional check we could store 0x00000000 to indicate end of sa_entry''s. So during modification or removal of entries we only need to pack the ones at the end. Thanks, Kalpak.> > Removing attributes or changing their sizes will be extremely rare, so we can > make this code path slow. For example, it would not be prohibitive to repack > & rewrite all attrs. > > Here is an idea for an on-disk format which takes advantage of these > constraints to achieve good performance: > > (sa == System Attribute) > > enum sa_type { > ST_ATIME, > ST_ACL, > ST_FINDERINFO, > ST_SCANSTAMP, > ST_LUSTRE_LAYOUT, > ... > ST_NUMTYPES > }; > > const struct sa_info { > uint8_t sai_intlen; > } sainfo = { > 8, 2, 8, 8, 1, ... > }; > > on disk: > header: > struct sa_phys { > uint16_t sa_numattrs; > struct { > uint16_t sa_type; /* enum sssa_type */ > uint16_t sa_length; /* in sainfo[sa_type] chunks */ > } sssa_attr[]; > }; > > followed by the data in the order specified by the header. 8-byte alignment > will be enforced for all attribute starting offsets. > > Note that the software (not the on-disk layout) defines the set of possible > attributes, and their byte sizes (ie, is the value a string, vs a series of > 8-byte integers, etc). Alternatively, if we are confident that we won''t need > anything but 8-byte integers (or arrays thereof), we can simplify the design > further. > > When we first access a SA block, we read in the header, and construct > an array indexed by enum sa_type that tells us the position and length > of each attribute. That way, access and modification will be blazing > fast[*]. If the size of an attribute is changed, we rewrite & repack > everything, so there is never any unused space. (Aside from the padding > necessary to enforce 8-byte alignment of each SA. It is assumed that > attribute size changing will be "rare enough" that this performance is > acceptable.) > > Thus the space overhead would be only 4 bytes per attribute (compared to > a static struct layout), and runtime overhead would be minimal. So I think > we could justify using it for much of the ZPL metadata. In fact, we could > make this SA stuff be a DMU service, and remove support for bonus > buffers entirely. The set of ZPL metadata that is always needed could > be in one sa, and other pieces elsewhere. The DMU would use what was > the bonus buffer, plus an additional BP only if necessary, to implement > the SA. Ie, if the space needed for the SA''s was <= 320 bytes (the > current bonus size limit), then it would all be in the dnode. > Otherwise, the first 192 (320-sizeof(blkptr_t)) would be in the dnode, > and the rest in a block pointed to by a BP stored in the dnode. In addition, > we could expand the dnode to 1k so that more SA''s could be stored there. > > --matt > > [*] For example: > > struct sa_layout { > struct { > int sal_offset; /* in bytes */ > int sal_length; /* in sainfo[sa_type] chunks */ > } sal_ary[ST_NUMTYPES]; > }; > > sa_init would construct an sa_layout given the on-disk sa_phys. > > void sa_read(struct sa_layout *sal, void *sa_phys, enum sa_type sat, > void *buf, int buflen) > { > int len; > > ASSERT(sat < ST_NUMTYPES); > len = sal->sal_ary[sat].sal_length * sainfo[sat].sai_intlen > ASSERT(buflen >= len); > /* Assumes that it was already byteswapped when read off disk */ > bcopy((char*)sa_phys + sal->sal_ary[sat].sa_offset, buf, len); > } > > sa_write would be similar, unless the sal_length was different from what we > are trying to write, in which case we would rewrite the whole thing to insert > the new (or resized) attribute. The software would define a priority order > for the SA''s, which defines what order to lay them out on disk (and which > spill out of the dnode into the accessory block first). > _______________________________________________ > zfs-code mailing list > zfs-code at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-code
On Feb 24, 2008 17:20 +0530, Kalpak Shah wrote:> On Thu, 2008-02-07 at 12:48 -0800, Matthew Ahrens wrote: > > Mark Shellenbaum wrote: > > > General comment: > > > > > > I''m not sure we should be using nvlists to store these attributes. > > > While it will be quick to implement, I''m concerned about the constant > > > packing/unpacking and searching linked lists in the nvlist code. > > > > Agreed. Here is one proposal for how this could be done efficiently and > > scalably (both in space and runtime): > > > > First, note that the problem is quite constrained: > > > > There are a fixed number of attribute names (for any given software version) > > since these are interpreted system attributes. Therefore the attribute names > > can be a small number, rather than a string -- ie, we have an enum that > > defines the attributes. We (ie, OpenSolaris) can control the allocation of > > new enum values in this namespace. > > We can have a 32-bit type and a 32-bit subtype so that the ZFS team will > only need to assign the 32-bit type and the subtype can be used by > Lustre or OS/X without requiring ZFS team approval. For Lustre the > subtype will be a 3-4 char EA name but it can be used differently also.What else I was thinking would be very useful is to allow setting a rank/priority for each SA. This would give the DMU code the ability to prioritize which SAs are stored in the beginning of the list (both to be quickly accessible without much scanning, and to ensure they remain in the dnode in case of overflow). In particular, to avoid serious performance degradation any znode data that is moved to an SA, and if we need an SA to point to an external block, these should be kept in the dnode itself. I''m undecided whether the "sa_type" should be the ranking (which might put some assigned types at a disadvantage of having "high priority" SAs but ensures that critical SAs can never be pushed out of the dnode, or the "sa_subtype".> The kernel would not care about the SA names but it can be expanded to > com.apple.finder or com.lustre.lov, etc. if it needs to be printed. > > > Many attributes will always have a fixed size (eg, scanstamp, finder info) > > I think we can let sai_intlen == 0 indicate that sa_length is the real > length of the SA value(and not in sainfo[sa_type] chunks) to allow > arbitary sized values. > > So here is a slightly modified on-disk format from what you proposed: > > struct sa_phys { > struct sa_header { > uint32_t magic; > uint16_t sa_numattrs; > uint16_t pad; > }; > struct { > uint32_t sa_type; > uint32_t sa_subtype; > uint32_t sa_value_offset; > uint32_t sa_value_size; > } sa_entry[]; > }; > > The sa_value_offset will point to the address of the value in the bonus > buffer of dnode or in the external block. And 8-byte alignment will > require padding of entries and values. > > sa_type + sa_subtype == 0 would terminate the list and as an additional > check we could store 0x00000000 to indicate end of sa_entry''s. So during > modification or removal of entries we only need to pack the ones at the > end. > > Thanks, > Kalpak. > > > > > Removing attributes or changing their sizes will be extremely rare, so we can > > make this code path slow. For example, it would not be prohibitive to repack > > & rewrite all attrs. > > > > Here is an idea for an on-disk format which takes advantage of these > > constraints to achieve good performance: > > > > (sa == System Attribute) > > > > enum sa_type { > > ST_ATIME, > > ST_ACL, > > ST_FINDERINFO, > > ST_SCANSTAMP, > > ST_LUSTRE_LAYOUT, > > ... > > ST_NUMTYPES > > }; > > > > const struct sa_info { > > uint8_t sai_intlen; > > } sainfo = { > > 8, 2, 8, 8, 1, ... > > }; > > > > on disk: > > header: > > struct sa_phys { > > uint16_t sa_numattrs; > > struct { > > uint16_t sa_type; /* enum sssa_type */ > > uint16_t sa_length; /* in sainfo[sa_type] chunks */ > > } sssa_attr[]; > > }; > > > > followed by the data in the order specified by the header. 8-byte alignment > > will be enforced for all attribute starting offsets. > > > > Note that the software (not the on-disk layout) defines the set of possible > > attributes, and their byte sizes (ie, is the value a string, vs a series of > > 8-byte integers, etc). Alternatively, if we are confident that we won''t need > > anything but 8-byte integers (or arrays thereof), we can simplify the design > > further. > > > > When we first access a SA block, we read in the header, and construct > > an array indexed by enum sa_type that tells us the position and length > > of each attribute. That way, access and modification will be blazing > > fast[*]. If the size of an attribute is changed, we rewrite & repack > > everything, so there is never any unused space. (Aside from the padding > > necessary to enforce 8-byte alignment of each SA. It is assumed that > > attribute size changing will be "rare enough" that this performance is > > acceptable.) > > > > Thus the space overhead would be only 4 bytes per attribute (compared to > > a static struct layout), and runtime overhead would be minimal. So I think > > we could justify using it for much of the ZPL metadata. In fact, we could > > make this SA stuff be a DMU service, and remove support for bonus > > buffers entirely. The set of ZPL metadata that is always needed could > > be in one sa, and other pieces elsewhere. The DMU would use what was > > the bonus buffer, plus an additional BP only if necessary, to implement > > the SA. Ie, if the space needed for the SA''s was <= 320 bytes (the > > current bonus size limit), then it would all be in the dnode. > > Otherwise, the first 192 (320-sizeof(blkptr_t)) would be in the dnode, > > and the rest in a block pointed to by a BP stored in the dnode. In addition, > > we could expand the dnode to 1k so that more SA''s could be stored there. > > > > --matt > > > > [*] For example: > > > > struct sa_layout { > > struct { > > int sal_offset; /* in bytes */ > > int sal_length; /* in sainfo[sa_type] chunks */ > > } sal_ary[ST_NUMTYPES]; > > }; > > > > sa_init would construct an sa_layout given the on-disk sa_phys. > > > > void sa_read(struct sa_layout *sal, void *sa_phys, enum sa_type sat, > > void *buf, int buflen) > > { > > int len; > > > > ASSERT(sat < ST_NUMTYPES); > > len = sal->sal_ary[sat].sal_length * sainfo[sat].sai_intlen > > ASSERT(buflen >= len); > > /* Assumes that it was already byteswapped when read off disk */ > > bcopy((char*)sa_phys + sal->sal_ary[sat].sa_offset, buf, len); > > } > > > > sa_write would be similar, unless the sal_length was different from what we > > are trying to write, in which case we would rewrite the whole thing to insert > > the new (or resized) attribute. The software would define a priority order > > for the SA''s, which defines what order to lay them out on disk (and which > > spill out of the dnode into the accessory block first). > > _______________________________________________ > > zfs-code mailing list > > zfs-code at opensolaris.org > > http://mail.opensolaris.org/mailman/listinfo/zfs-code > > _______________________________________________ > zfs-code mailing list > zfs-code at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-codeCheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.