thr3ads.net - zfs code - [zfs-code] Design for EAs in dnode. [Feb 2008]

If this information is useful, please help other people find it:
Share via:

Girish Shilamkar

2008-Feb-07 15:28 UTC

[zfs-code] Design for EAs in dnode.

Hi,
	With reference to my previous mail ([zfs-code] Nvpair for storing EAs
in dnode) which had expressed our intent to use libnvpair for EAs in
dnode. I am now sending a high level design document for the same.

Any suggestions/comments are most welcome.

Thanks & Regards,
Girish
-------------- next part --------------
A non-text attachment was scrubbed...
Name: EAs-in-Dnode-HLD.pdf
Type: application/pdf
Size: 41285 bytes
Desc: 
URL:
<http://mail.opensolaris.org/pipermail/zfs-code/attachments/20080207/0b329848/attachment.pdf>

Mark Shellenbaum

2008-Feb-07 16:44 UTC

head link

[zfs-code] Design for EAs in dnode.

Girish Shilamkar wrote:> Hi,
> 	With reference to my previous mail ([zfs-code] Nvpair for storing EAs
> in dnode) which had expressed our intent to use libnvpair for EAs in
> dnode. I am now sending a high level design document for the same.
> 
> Any suggestions/comments are most welcome.
> 
> Thanks & Regards,
> Girish
> 
> 

General comment:

I''m not sure we should be using nvlists to store these attributes. 
While it will be quick to implement, I''m concerned about the constant 
packing/unpacking and searching linked lists in the nvlist code.  As 
part of this effort I would like to refactor the znode_phys to only 
include what I believe to be the core attributes that must be in the 
znode_phys.  Then all of the secondary attributes that aren''t needed
all
of the time can use the new EA mechanisms that Lustre and others desire. 
  We will need to rev the ZPL version to 4.  By refactoring the znode we 
will recover a lot of the bonus space that is in short supply today.

Section 2.2.4

Not sure what you mean by storing all xattrs in a single object? Are you 
saying all of the Lustre attributes would be stored in a single Solaris 
extended attribute file?

Section 2.2.5

This section is not needed.  Solaris already has interfaces and 
frameworks in place to expose additional attributes to applications. 
Any attribute that lustre wants to retrieve/set can be retrieved with 
VOP_GETATTR()/VOP_SETATTR() via the xvattr_t.  From an application those 
attributes can be manipulated with getattrat()/setattrat().

   -Mark

Ricardo M. Correia

2008-Feb-07 17:14 UTC

head link

[zfs-code] Design for EAs in dnode.

On Qui, 2008-02-07 at 09:44 -0700, Mark Shellenbaum wrote:
> I''m not sure we should be using nvlists to store these attributes.
> While it will be quick to implement, I''m concerned about the
constant
> packing/unpacking and searching linked lists in the nvlist code.

That is also one of things that I was concerned about.

What do you think about providing an alternate interface in the nvlist
code that would do an nvlist_lookup() without having to unpack the
nvlist (ie, by searching the name directly in the encoded format)?
The number of attributes on a file should be very limited so I don''t
think a linear search would be a problem.

We could also provide a similar interface to add/update an attribute
that would manipulate the encoded format directly (the XDR format looks
fairly simple, I think).

> As 
> part of this effort I would like to refactor the znode_phys to only 
> include what I believe to be the core attributes that must be in the 
> znode_phys.  Then all of the secondary attributes that aren''t
needed all
> of the time can use the new EA mechanisms that Lustre and others desire. 
>   We will need to rev the ZPL version to 4.  By refactoring the znode we 
> will recover a lot of the bonus space that is in short supply today.

That sounds reasonable.

> Section 2.2.4
> 
> Not sure what you mean by storing all xattrs in a single object? Are you 
> saying all of the Lustre attributes would be stored in a single Solaris 
> extended attribute file?

Yes. The benefit of this approach is that it would greatly reduce the
number of seeks (which are always in short supply) in a Lustre metadata
server in the case where the xattrs won''t fit in the dnode.

> Section 2.2.5
> 
> This section is not needed.  Solaris already has interfaces and 
> frameworks in place to expose additional attributes to applications. 
> Any attribute that lustre wants to retrieve/set can be retrieved with 
> VOP_GETATTR()/VOP_SETATTR() via the xvattr_t.  From an application those 
> attributes can be manipulated with getattrat()/setattrat().

Are you referring to section 2.2.6 as opposed to 2.2.5?

In Lustre, we are not using (nor we have the desire to use) any Solaris
VFS code, so we can''t use VOP_*() calls, vnode_t and/or znode_t
structures (although we do manipulate znode_phys_t buffers), ACL code,
etc..

Although we don''t compile the majority of the ZPL code due to VFS
dependencies, we do compile zfs_znode.c in userspace (without the
_KERNEL definition).

As you may understand, it would be very convenient for us if the core of
this functionality would be free of VFS dependencies so that we could
just implement a thin wrapper in our Lustre code.
However, since this probably belongs more in the ZPL than in the DMU, it
could be implemented in zfs_znode.c.

Thanks!

--

Ricardo Manuel Correia
Lustre Engineering

Sun Microsystems, Inc.
Portugal
Phone +351.214134023 / x58723
Mobile +351.912590825
Email Ricardo.M.Correia at Sun.COM
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-code/attachments/20080207/433e375b/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 6g_top.gif
Type: image/gif
Size: 1257 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-code/attachments/20080207/433e375b/attachment.gif>

Ricardo M. Correia

2008-Feb-07 17:21 UTC

head link

[zfs-code] Design for EAs in dnode.

On Qui, 2008-02-07 at 17:14 +0000, Ricardo M. Correia wrote:
> > Not sure what you mean by storing all xattrs in a single object? Are
you
> > saying all of the Lustre attributes would be stored in a single
Solaris
> > extended attribute file?
> 
> 
> Yes. The benefit of this approach is that it would greatly reduce the
number of seeks (which are always in short supply) in a Lustre metadata server
in the case where the xattrs won''t fit in the dnode.

Just to clarify: the intention of this approach to use a single object
to store all attributes so that we wouldn''t have the overhead of a ZAP
lookup (in the xattr directory ZAP), nor we would have the overhead of
having to open several objects to read several attributes.

Thanks,
Ricardo

--

Ricardo Manuel Correia
Lustre Engineering

Sun Microsystems, Inc.
Portugal
Phone +351.214134023 / x58723
Mobile +351.912590825
Email Ricardo.M.Correia at Sun.COM
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-code/attachments/20080207/0daaac15/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 6g_top.gif
Type: image/gif
Size: 1257 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-code/attachments/20080207/0daaac15/attachment.gif>

Mark Shellenbaum

2008-Feb-07 17:29 UTC

head link

[zfs-code] Design for EAs in dnode.

Ricardo M. Correia wrote:> 
> On Qui, 2008-02-07 at 09:44 -0700, Mark Shellenbaum wrote:
>> I''m not sure we should be using nvlists to store these
attributes.
>> While it will be quick to implement, I''m concerned about the
constant
>> packing/unpacking and searching linked lists in the nvlist code.
> 
> That is also one of things that I was concerned about.
> 
> What do you think about providing an alternate interface in the nvlist 
> code that would do an nvlist_lookup() without having to unpack the 
> nvlist (ie, by searching the name directly in the encoded format)?
> The number of attributes on a file should be very limited so I
don''t
> think a linear search would be a problem.
> 
Ok, but we don''t know how many attributes future consumers may want to 
store.  I hope it will be small, but I wouldn''t want to paint ourselves
in a corner and then wish we had done it a different way.
> We could also provide a similar interface to add/update an attribute 
> that would manipulate the encoded format directly (the XDR format looks 
> fairly simple, I think).
> 
I think that would be complicated.  You could potentionally have a 
packed buffer that needs to be expanded in size or compressed depending 
on how the attribute is changed/added.
>> As 
>> part of this effort I would like to refactor the znode_phys to only 
>> include what I believe to be the core attributes that must be in the 
>> znode_phys.  Then all of the secondary attributes that aren''t
needed all
>> of the time can use the new EA mechanisms that Lustre and others
desire.
>>   We will need to rev the ZPL version to 4.  By refactoring the znode
we
>> will recover a lot of the bonus space that is in short supply today.
> 
> That sounds reasonable.
> 
>> Section 2.2.4
>>
>> Not sure what you mean by storing all xattrs in a single object? Are
you
>> saying all of the Lustre attributes would be stored in a single Solaris
>> extended attribute file?
> 
> Yes. The benefit of this approach is that it would greatly reduce the 
> number of seeks (which are always in short supply) in a Lustre metadata 
> server in the case where the xattrs won''t fit in the dnode.
> 
>> Section 2.2.5
>>
>> This section is not needed.  Solaris already has interfaces and 
>> frameworks in place to expose additional attributes to applications. 
>> Any attribute that lustre wants to retrieve/set can be retrieved with 
>> VOP_GETATTR()/VOP_SETATTR() via the xvattr_t.  From an application
those
>> attributes can be manipulated with getattrat()/setattrat().
> 
> Are you referring to section 2.2.6 as opposed to 2.2.5?
> 
The document that was posted by Girish didn''t have a section 2.2.6
> In Lustre, we are not using (nor we have the desire to use) any Solaris 
> VFS code, so we can''t use VOP_*() calls, vnode_t and/or znode_t 
> structures (although we do manipulate znode_phys_t buffers), ACL code,
etc..
> 
I thought you were going to be running on Solaris?

Things would be much simpler if you would be in the kernel. ;-)
> Although we don''t compile the majority of the ZPL code due to VFS 
> dependencies, we do compile zfs_znode.c in userspace (without the 
> _KERNEL definition).
> 
Are you referring to the linux/fuse implementation?  If so, then I would 
probably suggest we add some ioctl interfaces to retrieve the 
attributes.  there is no need for Solaris to add new VOP interfaces for 
these.
> As you may understand, it would be very convenient for us if the core of 
> this functionality would be free of VFS dependencies so that we could 
> just implement a thin wrapper in our Lustre code.
> However, since this probably belongs more in the ZPL than in the DMU, it 
> could be implemented in zfs_znode.c.
> 
> Thanks!
> 
> --
> 	*Ricardo Manuel Correia*
> Lustre Engineering
> 
> *Sun Microsystems, Inc.*
> Portugal
> Phone +351.214134023 / x58723
> Mobile +351.912590825
> Email Ricardo.M.Correia at Sun.COM <mailto:Ricardo.M.Correia at
Sun.COM>
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> zfs-code mailing list
> zfs-code at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-code

Ricardo M. Correia

2008-Feb-07 17:54 UTC

head link

[zfs-code] Design for EAs in dnode.

On Qui, 2008-02-07 at 10:29 -0700, Mark Shellenbaum wrote:
> Ok, but we don''t know how many attributes future consumers may
want to
> store.  I hope it will be small, but I wouldn''t want to paint
ourselves
> in a corner and then wish we had done it a different way.

That would be a matter of falling back to the current mechanism if we
reach a sufficiently large number of attributes (>100?).
We are simply interested in optimizing for the large majority of cases.

> > We could also provide a similar interface to add/update an attribute 
> > that would manipulate the encoded format directly (the XDR format
looks
> > fairly simple, I think).
> > 
> 
> I think that would be complicated.  You could potentionally have a 
> packed buffer that needs to be expanded in size or compressed depending 
> on how the attribute is changed/added.

Yes.. So we have 4 cases:

1) Adding an attribute: we would just add it at the end (very simple).
2) Changing an attribute without changing it''s size: it could be
changed
in-place.
3) Changing an attribute and changing it''s size: we could use memmove()
or similar to copy the attributes after the one we''re changing to the
correct place.
4) Deleting an attribute: we would also use memmove() to copy the
following attributes to the correct place.

We''d have to do a size check before doing an operation to see if it
would still fit in the provided buffer (which would be the bonus
buffer).

Whatever format we''d use, I think the important thing is to have as
less
space overhead as possible in order to fit in the dnode.
So I think a packed format is probably desirable even if it means it
would not be 100% efficient CPU-wise in some cases (cases 3 and 4).

> > Although we don''t compile the majority of the ZPL code due to
VFS
> > dependencies, we do compile zfs_znode.c in userspace (without the 
> > _KERNEL definition).
> > 
> 
> Are you referring to the linux/fuse implementation?  If so, then I would 
> probably suggest we add some ioctl interfaces to retrieve the 
> attributes.  

I''m referring to the Lustre implementation (which has many things in
common with Linux/FUSE due to being implemented in userspace).

We cannot use ioctls() to retrieve attributes because the DMU will be
running completely in userspace.
In Solaris (well, in all Lustre-supported OSes), we export the pool from
the native (kernel) implementation and import it into our userspace
process which runs libzpool in a similar way as ztest.

This means that the native kernel implementation *cannot* have access to
the ZFS pool while Lustre is running..

> there is no need for Solaris to add new VOP interfaces for 
> these.

Great, we don''t need them either :)
However, I think we do not have enough experience to fix the existing
Solaris VFS interface to use the new mechanism. Well, we could, but it
would takes us much, much longer than if it were you guys :)

Thanks,
Ricardo

--

Ricardo Manuel Correia
Lustre Engineering

Sun Microsystems, Inc.
Portugal
Phone +351.214134023 / x58723
Mobile +351.912590825
Email Ricardo.M.Correia at Sun.COM
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-code/attachments/20080207/48f734a6/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 6g_top.gif
Type: image/gif
Size: 1257 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-code/attachments/20080207/48f734a6/attachment.gif>

Matthew Ahrens

2008-Feb-07 20:48 UTC

head link

[zfs-code] Design for EAs in dnode.

Mark Shellenbaum wrote:> Girish Shilamkar wrote:
>> Hi,
>> 	With reference to my previous mail ([zfs-code] Nvpair for storing EAs
>> in dnode) which had expressed our intent to use libnvpair for EAs in
>> dnode. I am now sending a high level design document for the same.
>>
>> Any suggestions/comments are most welcome.
>>
>> Thanks & Regards,
>> Girish
> 
> General comment:
> 
> I''m not sure we should be using nvlists to store these attributes.
> While it will be quick to implement, I''m concerned about the
constant
> packing/unpacking and searching linked lists in the nvlist code.
Agreed.  Here is one proposal for how this could be done efficiently and 
scalably (both in space and runtime):

First, note that the problem is quite constrained:

There are a fixed number of attribute names (for any given software version) 
since these are interpreted system attributes. Therefore the attribute names 
can be a small number, rather than a string -- ie, we have an enum that 
defines the attributes.  We (ie, OpenSolaris) can control the allocation of 
new enum values in this namespace.

Many attributes will always have a fixed size (eg, scanstamp, finder info)

Removing attributes or changing their sizes will be extremely rare, so we can 
make this code path slow.  For example, it would not be prohibitive to repack 
& rewrite all attrs.

Here is an idea for an on-disk format which takes advantage of these 
constraints to achieve good performance:

(sa == System Attribute)

enum sa_type {
	ST_ATIME,
         ST_ACL,
         ST_FINDERINFO,
         ST_SCANSTAMP,
         ST_LUSTRE_LAYOUT,
         ...
	ST_NUMTYPES
};

const struct sa_info {
	uint8_t sai_intlen;
} sainfo = {
	8, 2, 8, 8, 1, ...
};	

on disk:
header:
struct sa_phys {
         uint16_t sa_numattrs;
         struct {
                 uint16_t sa_type;     /* enum sssa_type */
                 uint16_t sa_length;   /* in sainfo[sa_type] chunks */
         } sssa_attr[];
};

followed by the data in the order specified by the header.  8-byte alignment 
will be enforced for all attribute starting offsets.

Note that the software (not the on-disk layout) defines the set of possible 
attributes, and their byte sizes (ie, is the value a string, vs a series of 
8-byte integers, etc).  Alternatively, if we are confident that we
won''t need
anything but 8-byte integers (or arrays thereof), we can simplify the design 
further.

When we first access a SA block, we read in the header, and construct
an array indexed by enum sa_type that tells us the position and length
of each attribute.  That way, access and modification will be blazing
fast[*].  If the size of an attribute is changed, we rewrite & repack
everything, so there is never any unused space.  (Aside from the padding 
necessary to enforce 8-byte alignment of each SA.  It is assumed that
attribute size changing will be "rare enough" that this performance is
acceptable.)

Thus the space overhead would be only 4 bytes per attribute (compared to
a static struct layout), and runtime overhead would be minimal.  So I think 
we could justify using it for much of the ZPL metadata.  In fact, we could
make this SA stuff be a DMU service, and remove support for bonus
buffers entirely.  The set of ZPL metadata that is always needed could
be in one sa, and other pieces elsewhere.  The DMU would use what was
the bonus buffer, plus an additional BP only if necessary, to implement
the SA.  Ie, if the space needed for the SA''s was <= 320 bytes (the
current bonus size limit), then it would all be in the dnode.
Otherwise, the first 192 (320-sizeof(blkptr_t)) would be in the dnode,
and the rest in a block pointed to by a BP stored in the dnode.  In addition, 
we could expand the dnode to 1k so that more SA''s could be stored
there.

--matt

[*] For example:

struct sa_layout {
	struct {
		int sal_offset; /* in bytes */
		int sal_length; /* in sainfo[sa_type] chunks */
	} sal_ary[ST_NUMTYPES];
};

sa_init would construct an sa_layout given the on-disk sa_phys.

void sa_read(struct sa_layout *sal, void *sa_phys, enum sa_type sat,
     void *buf, int buflen)
{
	int len;

	ASSERT(sat < ST_NUMTYPES);
	len = sal->sal_ary[sat].sal_length * sainfo[sat].sai_intlen
	ASSERT(buflen >= len);
	/* Assumes that it was already byteswapped when read off disk */
	bcopy((char*)sa_phys + sal->sal_ary[sat].sa_offset, buf, len);
}

sa_write would be similar, unless the sal_length was different from what we 
are trying to write, in which case we would rewrite the whole thing to insert 
the new (or resized) attribute.  The software would define a priority order 
for the SA''s, which defines what order to lay them out on disk (and
which
spill out of the dnode into the accessory block first).

Mark Shellenbaum

2008-Feb-07 21:32 UTC

head link

[zfs-code] Design for EAs in dnode.

Matthew Ahrens wrote:> Mark Shellenbaum wrote:
>> Girish Shilamkar wrote:
>>> Hi,
>>> 	With reference to my previous mail ([zfs-code] Nvpair for storing
EAs
>>> in dnode) which had expressed our intent to use libnvpair for EAs
in
>>> dnode. I am now sending a high level design document for the same.
>>>
>>> Any suggestions/comments are most welcome.
>>>
>>> Thanks & Regards,
>>> Girish
>> General comment:
>>
>> I''m not sure we should be using nvlists to store these
attributes.
>> While it will be quick to implement, I''m concerned about the
constant
>> packing/unpacking and searching linked lists in the nvlist code.
> 
> Agreed.  Here is one proposal for how this could be done efficiently and 
> scalably (both in space and runtime):
> 
> First, note that the problem is quite constrained:
> 
> There are a fixed number of attribute names (for any given software
version)
> since these are interpreted system attributes. Therefore the attribute
names
> can be a small number, rather than a string -- ie, we have an enum that 
> defines the attributes.  We (ie, OpenSolaris) can control the allocation of
> new enum values in this namespace.
> 
> Many attributes will always have a fixed size (eg, scanstamp, finder info)
> 
> Removing attributes or changing their sizes will be extremely rare, so we
can
> make this code path slow.  For example, it would not be prohibitive to
repack
> & rewrite all attrs.
> 
> Here is an idea for an on-disk format which takes advantage of these 
> constraints to achieve good performance:
> 
> (sa == System Attribute)
> 
> enum sa_type {
> 	ST_ATIME,
>          ST_ACL,
>          ST_FINDERINFO,
>          ST_SCANSTAMP,
>          ST_LUSTRE_LAYOUT,
>          ...
> 	ST_NUMTYPES
> };
> 
> const struct sa_info {
> 	uint8_t sai_intlen;
> } sainfo = {
> 	8, 2, 8, 8, 1, ...
> };	
> 
> on disk:
> header:
> struct sa_phys {
>          uint16_t sa_numattrs;
>          struct {
>                  uint16_t sa_type;     /* enum sssa_type */
>                  uint16_t sa_length;   /* in sainfo[sa_type] chunks */
>          } sssa_attr[];
> };
> 
I like this approach much better. I was thinking about some sort of 
indexed table with attributes having a number to identify them rather 
than a name.    This will also allow us to kick out a number of the ZPLs 
attributes into sa space, and then we can choose which additional 
attributes need to be exposed with the Solaris sysattr framework.

   -Mark

Ricardo M. Correia

2008-Feb-07 23:04 UTC

head link

[zfs-code] Design for EAs in dnode.

On Qui, 2008-02-07 at 12:48 -0800, Matthew Ahrens wrote:
> First, note that the problem is quite constrained:
> 
> There are a fixed number of attribute names (for any given software
version)
> since these are interpreted system attributes. Therefore the attribute
names
> can be a small number, rather than a string -- ie, we have an enum that 
> defines the attributes.  We (ie, OpenSolaris) can control the allocation of
> new enum values in this namespace.

Since we are on the subject of redesigning attributes, have you
considered providing a lighter-weight extended attribute mechanism to
userspace applications such as Beagle
( http://beagle-project.org/Enabling_Extended_Attributes ), Meta Tracker
( http://www.gnome.org/projects/tracker/ ), Apache (mime types), etc?

It seems to me that the usefulness of extended attributes is to
efficiently store small amounts of information associated with a file,
and I think these would make more sense to be in the dnode (possibly
making it 1K by default) or in a single external object if it doesn''t
fit into the dnode.

There''s a de-facto namespace standard for EAs that specifies some
existing and proposed attributes that could give you some ideas about
this: http://freedesktop.org/wiki/CommonExtendedAttributes

Of course, this would still require storing the EA names in the dnode,
but I think the added flexibility of being able to efficiently store
arbitrary EAs would be worth it.
And also, the large dnodes feature that we have are working on makes it
rather easy to create filesystems with bigger dnodes.

Thanks,
Ricardo
--

Ricardo Manuel Correia
Lustre Engineering

Sun Microsystems, Inc.
Portugal
Phone +351.214134023 / x58723
Mobile +351.912590825
Email Ricardo.M.Correia at Sun.COM
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-code/attachments/20080207/6893050e/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 6g_top.gif
Type: image/gif
Size: 1257 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-code/attachments/20080207/6893050e/attachment.gif>

Matthew Ahrens

2008-Feb-07 23:11 UTC

head link

[zfs-code] Design for EAs in dnode.

Ricardo M. Correia wrote:> 
> On Qui, 2008-02-07 at 12:48 -0800, Matthew Ahrens wrote:
>> First, note that the problem is quite constrained:
>>
>> There are a fixed number of attribute names (for any given software
version)
>> since these are interpreted system attributes. Therefore the attribute
names
>> can be a small number, rather than a string -- ie, we have an enum that
>> defines the attributes.  We (ie, OpenSolaris) can control the
allocation of
>> new enum values in this namespace.
> 
> Since we are on the subject of redesigning attributes, have you 
> considered providing a lighter-weight extended attribute mechanism to 
> userspace applications
The ZAP would probably be sufficient to store lighter-weight user extended 
attributes (directly in the ZAP, not to point to other files like the current 
Solaris extended attributes do).  The method I proposed for system attributes 
is not applicable, since user attributes would almost certainly be string -> 
string mappings.

--matt

Neil Perrin

2008-Feb-08 05:51 UTC

head link

[zfs-code] Design for EAs in dnode.

Ricardo M. Correia wrote:> 
> On Qui, 2008-02-07 at 12:48 -0800, Matthew Ahrens wrote:
>> First, note that the problem is quite constrained:
>>
>> There are a fixed number of attribute names (for any given software
version)
>> since these are interpreted system attributes. Therefore the attribute
names
>> can be a small number, rather than a string -- ie, we have an enum that
>> defines the attributes.  We (ie, OpenSolaris) can control the
allocation of
>> new enum values in this namespace.
> 
> Since we are on the subject of redesigning attributes, have you 
> considered providing a lighter-weight extended attribute mechanism to 
> userspace applications such as Beagle ( 
> http://beagle-project.org/Enabling_Extended_Attributes 
> <http://beagle-project.org/Enabling_Extended_Attributes)?> ), Meta 
> Tracker ( http://www.gnome.org/projects/tracker/ ), Apache (mime types), 
> etc?
> 
> It seems to me that the usefulness of extended attributes is to 
> efficiently store small amounts of information associated with a file, 
> and I think these would make more sense to be in the dnode (possibly 
> making it 1K by default) or in a single external object if it
doesn''t
> fit into the dnode.
I believe when a prototype using 1K dnodes was tested it showed an
unacceptable (30%?) hit on some benchmarks. So if can possibly
avoid increasing the dnode size (by default) then we should do so.
> 
> There''s a de-facto namespace standard for EAs that specifies some 
> existing and proposed attributes that could give you some ideas about 
> this: http://freedesktop.org/wiki/CommonExtendedAttributes
> 
> Of course, this would still require storing the EA names in the dnode, 
> but I think the added flexibility of being able to efficiently store 
> arbitrary EAs would be worth it.
> And also, the large dnodes feature that we have are working on makes it 
> rather easy to create filesystems with bigger dnodes.
> 
> Thanks,
> Ricardo
> --
> 	*Ricardo Manuel Correia*
> Lustre Engineering
> 
> *Sun Microsystems, Inc.*
> Portugal
> Phone +351.214134023 / x58723
> Mobile +351.912590825
> Email Ricardo.M.Correia at Sun.COM <mailto:Ricardo.M.Correia at
Sun.COM>
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> zfs-code mailing list
> zfs-code at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-code

Andreas Dilger

2008-Feb-08 14:22 UTC

head link

[zfs-code] Design for EAs in dnode.

On Feb 07, 2008  22:51 -0700, Neil Perrin wrote:> > It seems to me that the usefulness of extended attributes is to 
> > efficiently store small amounts of information associated with a file,
> > and I think these would make more sense to be in the dnode (possibly 
> > making it 1K by default) or in a single external object if it
doesn''t
> > fit into the dnode.
> 
> I believe when a prototype using 1K dnodes was tested it showed an
> unacceptable (30%?) hit on some benchmarks. So if can possibly
> avoid increasing the dnode size (by default) then we should do so.
That is definitely true, but in other uses this can be a major performace
improvement for applications like Samba that use a bunch of attributes:

http://lwn.net/Articles/112571/
http://samba.org/~tridge/xattr_results/ext3-tuning.png

The "default" case in the ext3-tuning.png graph is similar to what is
in ZFS today (though actually still more light weight because the EAs
are only 1 hop from the inode instead of 5 like ZFS today).

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

Neil Perrin

2008-Feb-08 16:34 UTC

head link

[zfs-code] Design for EAs in dnode.

Andreas Dilger wrote:> On Feb 07, 2008  22:51 -0700, Neil Perrin wrote:
>>> It seems to me that the usefulness of extended attributes is to 
>>> efficiently store small amounts of information associated with a
file,
>>> and I think these would make more sense to be in the dnode
(possibly
>>> making it 1K by default) or in a single external object if it
doesn''t
>>> fit into the dnode.
>> I believe when a prototype using 1K dnodes was tested it showed an
>> unacceptable (30%?) hit on some benchmarks. So if can possibly
>> avoid increasing the dnode size (by default) then we should do so.
> 
> That is definitely true, but in other uses this can be a major performace
> improvement for applications like Samba that use a bunch of attributes:
> 
> http://lwn.net/Articles/112571/
> http://samba.org/~tridge/xattr_results/ext3-tuning.png
> 
> The "default" case in the ext3-tuning.png graph is similar to
what is
> in ZFS today (though actually still more light weight because the EAs
> are only 1 hop from the inode instead of 5 like ZFS today).
> 
> Cheers, Andreas
Interesting data. 

So I hope we can come up with a solution that works well with both
a few and a lot of attributes. Maybe Matt''s proposal would get us
there?

Neil.

Ricardo M. Correia

2008-Feb-08 18:09 UTC

head link

[zfs-code] Design for EAs in dnode.

On Qui, 2008-02-07 at 22:51 -0700, Neil Perrin wrote:
> I believe when a prototype using 1K dnodes was tested it showed an
> unacceptable (30%?) hit on some benchmarks. So if can possibly
> avoid increasing the dnode size (by default) then we should do so.

Hmm, interesting..

Do you know the reason for such a performance hit?

Even with 1K dnode sizes, if the dnodes didn''t have any extended
attributes and since metadata compression is enabled, the on-disk size
of metadnode blocks should remain approximately the same, right?

Could it be because the metadnode object became twice the size (logical
size) and therefore required another level of indirect blocks which, as
a consequence, required an additional disk seek for each metadnode block
read?

It would be interesting to run some benchmarks with Kalpak''s large
dnode
patch.

Cheers,
Ricardo

--

Ricardo Manuel Correia
Lustre Engineering

Sun Microsystems, Inc.
Portugal
Phone +351.214134023 / x58723
Mobile +351.912590825
Email Ricardo.M.Correia at Sun.COM
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-code/attachments/20080208/bdc7cf22/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 6g_top.gif
Type: image/gif
Size: 1257 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-code/attachments/20080208/bdc7cf22/attachment.gif>

Matthew Ahrens

2008-Feb-08 19:39 UTC

head link

[zfs-code] Design for EAs in dnode.

Ricardo M. Correia wrote:> On Qui, 2008-02-07 at 22:51 -0700, Neil Perrin wrote:
>> I believe when a prototype using 1K dnodes was tested it showed an
>> unacceptable (30%?) hit on some benchmarks. So if can possibly
>> avoid increasing the dnode size (by default) then we should do so.
> 
> Hmm, interesting..
> 
> Do you know the reason for such a performance hit?
> 
> Even with 1K dnode sizes, if the dnodes didn''t have any extended 
> attributes and since metadata compression is enabled, the on-disk size 
> of metadnode blocks should remain approximately the same, right?
> 
> Could it be because the metadnode object became twice the size (logical 
> size) and therefore required another level of indirect blocks which, as 
> a consequence, required an additional disk seek for each metadnode block 
> read?
Actually, we tried with both the same (16k) dnode block size, and double 
(32k) dnode block size, to see if the the "sidefetching" caused by
reading in
a whole block of dnodes was influencing things.  Both performed worse than 
512-byte dnodes (in 16k blocks).  However, the additional space would not 
necessarily be zero-filled, it would be used by additional block pointers (if 
the file size is in certain ranges).  So the additional i/o caused by larger 
blocks could be an issue.

Note, the dnode file is always stored with the max levels of indirection so 
that could not be the issue.

Certainly more investigation is needed to determine the impact of 1k dnodes 
on various workloads.  But this is a bit of a diversion from the topic at 
hand -- EAs (or SAs as I call them; "extended attribute" already has a
different meaning on Solaris).  I haven''t heard anyone from Lustre
comment on
my proposed design...

--matt

Andreas Dilger

2008-Feb-08 19:41 UTC

head link

[zfs-code] Design for EAs in dnode.

On Feb 07, 2008  17:14 +0000, Ricardo Correia wrote:> On Qui, 2008-02-07 at 09:44 -0700, Mark Shellenbaum wrote:
> > Section 2.2.4
> > 
> > Not sure what you mean by storing all xattrs in a single object? Are
you
> > saying all of the Lustre attributes would be stored in a single
Solaris
> > extended attribute file?
> 
> Yes. The benefit of this approach is that it would greatly reduce the
> number of seeks (which are always in short supply) in a Lustre metadata
> server in the case where the xattrs won''t fit in the dnode.
To clarify, this is a desirable approach for the "spill over" case
where
the small attributes don''t fit into the dnode bonus buffer.  The
current
situation is that every xattr needs a ZAP entry + dnode + block, no matter
what size it is.  The "EA in dnode" change moves these attributes into
the dnode, but the space there is limited.  The proposal previously
suggested was to have a block pointer to hold the spillover EAs, but
this is quite large and would consume a noticable fraction of the bonus
buffer and either limits the EAs to some larger size (1 block) or requires
having a tree at that level.

The compomise that we propose with is to store the spillover EAs in a
single dnode, which has the benefit of being essentially unlimited in
size, being efficiently addressible (objid), and amortizing the lookup
over multiple attributes.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

Matthew Ahrens

2008-Feb-08 19:48 UTC

head link

[zfs-code] Design for EAs in dnode.

Andreas Dilger wrote:> On Feb 07, 2008  17:14 +0000, Ricardo Correia wrote:
>> On Qui, 2008-02-07 at 09:44 -0700, Mark Shellenbaum wrote:
>>> Section 2.2.4
>>>
>>> Not sure what you mean by storing all xattrs in a single object?
Are you
>>> saying all of the Lustre attributes would be stored in a single
Solaris
>>> extended attribute file?
>> Yes. The benefit of this approach is that it would greatly reduce the
>> number of seeks (which are always in short supply) in a Lustre metadata
>> server in the case where the xattrs won''t fit in the dnode.
> 
> To clarify, this is a desirable approach for the "spill over"
case where
> the small attributes don''t fit into the dnode bonus buffer.  The
current
> situation is that every xattr needs a ZAP entry + dnode + block, no matter
> what size it is.  The "EA in dnode" change moves these attributes
into
> the dnode
Not as I read it.  Solaris extended attributes will continue to be stored as 
they are today -- as separate files in an xattr directory hung of the main file.

The proposal is for some new system-interpreted attributes to be stored in 
the dnode.  This is a different namespace with different requirements and 
different semantics.  You don''t need to set an ACL on your lustre
metadata,
do you?

--matt

Andreas Dilger

2008-Feb-08 19:50 UTC

head link

[zfs-code] Design for EAs in dnode.

On Feb 08, 2008  18:09 +0000, Ricardo Correia wrote:> On Qui, 2008-02-07 at 22:51 -0700, Neil Perrin wrote:
> > I believe when a prototype using 1K dnodes was tested it showed an
> > unacceptable (30%?) hit on some benchmarks. So if can possibly
> > avoid increasing the dnode size (by default) then we should do so.
> 
> 
> Do you know the reason for such a performance hit?
> 
> Even with 1K dnode sizes, if the dnodes didn''t have any extended
> attributes and since metadata compression is enabled, the on-disk size
> of metadnode blocks should remain approximately the same, right?
One of the default behaviours of the DMU is that if there is a larger
dnode then there are more blkptr_t items added to consume the rest of
the space left once the bonus buffer.  Unless the bonus buffer size
was also increased in this test, or the number of blkptr in the dnode
is limited (as we have done with the large dnode patch) then any file
data will cause the blocks to be stored directly in the dnode instead
of in an indirect block.

For some workloads (small file read/write access) this would be an
improvement, but for workloads which look a lot at the [dz]node attributes
(e.g. find, ls -l) it would be worse.
> Could it be because the metadnode object became twice the size (logical
> size) and therefore required another level of indirect blocks which, as
> a consequence, required an additional disk seek for each metadnode block
> read?
> 
> It would be interesting to run some benchmarks with Kalpak''s large
dnode
> patch.
Definitely.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

Ricardo M. Correia

2008-Feb-09 04:37 UTC

head link

[zfs-code] Design for EAs in dnode.

Hi Matthew,

On Qui, 2008-02-07 at 12:48 -0800, Matthew Ahrens wrote: 
> on disk:
> header:
> struct sa_phys {
>          uint16_t sa_numattrs;
>          struct {
>                  uint16_t sa_type;     /* enum sssa_type */
>                  uint16_t sa_length;   /* in sainfo[sa_type] chunks */
>          } sssa_attr[];
> };
> 
> followed by the data in the order specified by the header.  8-byte
alignment
> will be enforced for all attribute starting offsets.

This would require repacking when adding an attribute, right? Anyway,
maybe that wouldn''t be a big problem..

There''s also the question of the attribute size limit.. from
conversations I had with Andreas, I got the feeling that the Lustre
layout information could be quite big when a file is striped through all
OSTs, and I would imagine this would become an even bigger concern in
the future if we intend to continue scaling horizontally.

Anyway, your proposal is interesting, but there''s also one thing I
would
like to add:

Could we have a special integer value that would essentially mean "this
is an unknown, name-value type of attribute", which would be used to
store additional, perhaps user-specified attributes?
In the space of the attribute value, we could store the name of the
attribute and the value itself (perhaps with 1 or 2 additional bytes for
the name length of the attribute).

These attributes could have lower priority than the other system
attributes (they would be stored at the end) and they would be ignored
(but not removed) by any implemention that doesn''t understand them.

I also think that instead of having an additional block pointer (which
are huge) in the dnode for "spillage", we should have something like a
"uint64_t dn_spillobj" which would be an object id of a
"spillage"
object.
An object id is much more space-efficient and, like Andreas mentioned,
allows for an unlimited number of attributes/attribute sizes.

I also find *extremely* appealing your idea of making this whole thing a
DMU service.
I think that would make our lives (as in Lustre developers'' lives)
much,
much easier in the long run.. the VFS dependencies in the ZPL code have
been kind of a pain, to say the least.. :)

Thanks,
Ricardo

--

Ricardo Manuel Correia
Lustre Engineering

Sun Microsystems, Inc.
Portugal
Phone +351.214134023 / x58723
Mobile +351.912590825
Email Ricardo.M.Correia at Sun.COM
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-code/attachments/20080209/191da8da/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 6g_top.gif
Type: image/gif
Size: 1257 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-code/attachments/20080209/191da8da/attachment.gif>

Mark Maybee

2008-Feb-09 15:10 UTC

head link

[zfs-code] Design for EAs in dnode.

Ricardo M. Correia wrote:> Hi Matthew,
> 
> On Qui, 2008-02-07 at 12:48 -0800, Matthew Ahrens wrote:
>> on disk:
>> header:
>> struct sa_phys {
>>          uint16_t sa_numattrs;
>>          struct {
>>                  uint16_t sa_type;     /* enum sssa_type */
>>                  uint16_t sa_length;   /* in sainfo[sa_type] chunks */
>>          } sssa_attr[];
>> };
>>
>> followed by the data in the order specified by the header.  8-byte
alignment
>> will be enforced for all attribute starting offsets.
> 
> This would require repacking when adding an attribute, right? Anyway, 
> maybe that wouldn''t be a big problem..
> 
> There''s also the question of the attribute size limit.. from 
> conversations I had with Andreas, I got the feeling that the Lustre 
> layout information could be quite big when a file is striped through all 
> OSTs, and I would imagine this would become an even bigger concern in 
> the future if we intend to continue scaling horizontally.
> By their vary nature, these attributes will have a size limitation.  I
don''t really see the point in allowing a size larger than the supported
block size.  64K seems reasonable.  I think that very large values could
possibly be supported using some form of indirection: storing a block
pointer for the value, or storing an object ID for values that span
multiple blocks.
> Anyway, your proposal is interesting, but there''s also one thing I
would
> like to add:
> 
> Could we have a special integer value that would essentially mean
"this
> is an unknown, name-value type of attribute", which would be used to 
> store additional, perhaps user-specified attributes?
> In the space of the attribute value, we could store the name of the 
> attribute and the value itself (perhaps with 1 or 2 additional bytes for 
> the name length of the attribute).
> 
> These attributes could have lower priority than the other system 
> attributes (they would be stored at the end) and they would be ignored 
> (but not removed) by any implemention that doesn''t understand
them.
> I''m not sure how storing them at the end makes them lower priority. 
I''m
also not sure exactly how these would be used... does lustre need these?
For any file system layer, we would need to add infrastructure to manage
and manipulate these objects, as well as new interfaces to be able to
access these attributes from outside the kernel.

All that aside.  There is nothing that would prohibit the definition of
this type of "special system attribute" in the current model.
> I also think that instead of having an additional block pointer (which 
> are huge) in the dnode for "spillage", we should have something
like a
> "uint64_t dn_spillobj" which would be an object id of a
"spillage" object.
> An object id is much more space-efficient and, like Andreas mentioned, 
> allows for an unlimited number of attributes/attribute sizes.
> I don''t quite understand this.  The whole point of these attributes is
to make them fast to access... using an object ID is going to be far
more expensive then a block pointer to access.  Matt''s model can also
support unlimited numbers of attributes if we allow the blocks to be
chained.
> I also find **extremely** appealing your idea of making this whole thing 
> a DMU service.
> I think that would make our lives (as in Lustre developers'' lives)
much,
> much easier in the long run.. the VFS dependencies in the ZPL code have 
> been kind of a pain, to say the least.. :)
> 
> Thanks,
> Ricardo
> 
> --
> 	*Ricardo Manuel Correia*
> Lustre Engineering
> 
> *Sun Microsystems, Inc.*
> Portugal
> Phone +351.214134023 / x58723
> Mobile +351.912590825
> Email Ricardo.M.Correia at Sun.COM <mailto:Ricardo.M.Correia at
Sun.COM>
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> zfs-code mailing list
> zfs-code at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-code

Girish Shilamkar

2008-Feb-10 18:06 UTC

head link

[zfs-code] Design for EAs in dnode.

To put the document in a better perspective:

The extended attributes (EAs), as we(Lustre) see, are the lighter weighted
extended
attributes which are user defined and user can be Beagle, Metatracker, Lustre
etc.
Lustre has been using the extended attributes http://acl.bestbits.at/about.html
where
ldiskfs (ext3 modified for Lustre) is used as backend.
The design posted specifically tries to devise a similar mechanism for ZFS
where even pNFS, MAC OS/X could possibly exploit this mechanism.
Thus it also implies, Matt has already mentioned this, that the proposal made
by him (Matt) for system attributes is not applicable.

As we do not intend to extend the current optional attributes i.e. struct
xoptattr,
the current APIs VOP_GETATTR/VOP_SETATTR won''t serve the purpose. Hence
the
proposal of new APIs in section 2.2.5 of design doc.
Use of existing APIs VOP_GETATTR/VOP_SETATTR will be possible if we somehow
integrate the (userdefined) extended attributes with existing xoptattr_t.

On Fri, 2008-02-08 at 11:48 -0800, Matthew Ahrens wrote:> Not as I read it.  Solaris extended attributes will continue to be stored
as
> they are today -- as separate files in an xattr directory hung of the main
file.We have been discussing to move only uint8_t
xoa_av_scanstamp[AV_SCANSTAMP_SZ], at least as of now, to nvlist which
holds EAs in dnode, as the first 32 bytes just after the znode_phys are
used by it.
> The proposal is for some new system-interpreted attributes to be stored in 
> the dnode.  This is a different namespace with different requirements and 
> different semantics.  You don''t need to set an ACL on your lustre
metadata,
> do you?
> 

P.S. Couldn''t reply early as I was down with high temperature and had
been coughing
my lungs out for past few days.

-Girish

Andreas Dilger

2008-Feb-12 02:42 UTC

head link

[zfs-code] Design for EAs in dnode.

On Feb 09, 2008  08:10 -0700, Mark Maybee wrote:>> Hi Matthew,
>> There''s also the question of the attribute size limit.. from
conversations
>> I had with Andreas, I got the feeling that the Lustre layout
information
>> could be quite big when a file is striped through all OSTs, and I would
>> imagine this would become an even bigger concern in the future if we 
>> intend to continue scaling horizontally.
>
> By their vary nature, these attributes will have a size limitation.  I
> don''t really see the point in allowing a size larger than the
supported
> block size.  64K seems reasonable.  I think that very large values could
> possibly be supported using some form of indirection: storing a block
> pointer for the value, or storing an object ID for values that span
> multiple blocks.
For the most common case in Lustre, the striping attribute will be
relatively small (in the range of 80 - 128 bytes).  In other cases
(less common, but still present) the current ext3 size limit of 4096
bytes is already a limiting factor on the striping of a file - directly
affecting the total bandwidth that can be allocated to a single file.
It is reasonable to have Lustre striping attributes up to 16kB - 24kB
range, at which point we will have a different (more efficient) mechanism
for storing large stripes, but it needs some upcoming infrastructure first.
There is very little use in the middle range.

It seems possible that we may need to have two separate mechanisms for
storing the small attributes and storing the large ones.  The small
Lustre SAs will be stored in the dnode, and the large ones in the existing
xattr mechanism.  Given that some applications (ZFS/OSX/pNFS) need to 
be able to fall back to looking in an xattr for the data they need for
compatibility, this isn''t any extra overhead.
>> Anyway, your proposal is interesting, but there''s also one
thing I would
>> like to add:
>>
>> Could we have a special integer value that would essentially mean
"this is
>> an unknown, name-value type of attribute", which would be used to
store
>> additional, perhaps user-specified attributes?
>> In the space of the attribute value, we could store the name of the 
>> attribute and the value itself (perhaps with 1 or 2 additional bytes
for
>> the name length of the attribute).
To be clear - as yet Lustre has a fairly limited set of "system
attributes"
that are needed for high performance operation.  There is the ability to
store "user attributes" on a file, and while good performance is
desirable
this is not a widely-used feature and falls into the "nice to have"
category.
In ext3 there is no separation of system attributes and user attributes, so
they all benefit from the [di]node local storage optimization.
>> I also think that instead of having an additional block pointer (which
are
>> huge) in the dnode for "spillage", we should have something
like a
>> "uint64_t dn_spillobj" which would be an object id of a
"spillage" object.
>> An object id is much more space-efficient and, like Andreas mentioned, 
>> allows for an unlimited number of attributes/attribute sizes.
>
> I don''t quite understand this.  The whole point of these
attributes is
> to make them fast to access... using an object ID is going to be far
> more expensive then a block pointer to access.  Matt''s model can
also
> support unlimited numbers of attributes if we allow the blocks to be
> chained.
While I agree with your point, I think part of the issue is that as soon
as we store a blkptr_t in the dnode this will consume some significant
chunk of the SA space and push attributes out of the dnode.  The other
tradeoff is one of complexity.  You know the code better than I, but it
seems cleaner to have a dnode reference as a container for a bunch of SA
blocks rather than having another block tree attached to the same dnode.

That said, if you don''t think there is a lot of added complexity to
have
chained blocks for the SAs, I''ll defer to your experience.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

Spencer Shepler

2008-Feb-12 14:51 UTC

head link

[zfs-code] Design for EAs in dnode.

On Feb 11, 2008, at 8:42 PM, Andreas Dilger wrote:
> On Feb 09, 2008  08:10 -0700, Mark Maybee wrote:
>>> Hi Matthew,
>>> There''s also the question of the attribute size limit..
from
>>> conversations
>>> I had with Andreas, I got the feeling that the Lustre layout  
>>> information
>>> could be quite big when a file is striped through all OSTs, and I  
>>> would
>>> imagine this would become an even bigger concern in the future if
we
>>> intend to continue scaling horizontally.
>>
>> By their vary nature, these attributes will have a size  
>> limitation.  I
>> don''t really see the point in allowing a size larger than the
>> supported
>> block size.  64K seems reasonable.  I think that very large values  
>> could
>> possibly be supported using some form of indirection: storing a block
>> pointer for the value, or storing an object ID for values that span
>> multiple blocks.
>
> For the most common case in Lustre, the striping attribute will be
> relatively small (in the range of 80 - 128 bytes).  In other cases
> (less common, but still present) the current ext3 size limit of 4096
> bytes is already a limiting factor on the striping of a file -  
> directly
> affecting the total bandwidth that can be allocated to a single file.
> It is reasonable to have Lustre striping attributes up to 16kB - 24kB
> range, at which point we will have a different (more efficient)  
> mechanism
> for storing large stripes, but it needs some upcoming  
> infrastructure first.
> There is very little use in the middle range.
Another data point; I would expect the pNFS striping information to  
generally be in the
128 byte range with some growth to 256 bytes in the common case.

Spencer
>
> It seems possible that we may need to have two separate mechanisms for
> storing the small attributes and storing the large ones.  The small
> Lustre SAs will be stored in the dnode, and the large ones in the  
> existing
> xattr mechanism.  Given that some applications (ZFS/OSX/pNFS) need to
> be able to fall back to looking in an xattr for the data they need for
> compatibility, this isn''t any extra overhead.
>
>>> Anyway, your proposal is interesting, but there''s also one
thing
>>> I would
>>> like to add:
>>>
>>> Could we have a special integer value that would essentially mean  
>>> "this is
>>> an unknown, name-value type of attribute", which would be used
to
>>> store
>>> additional, perhaps user-specified attributes?
>>> In the space of the attribute value, we could store the name of the
>>> attribute and the value itself (perhaps with 1 or 2 additional  
>>> bytes for
>>> the name length of the attribute).
>
> To be clear - as yet Lustre has a fairly limited set of "system  
> attributes"
> that are needed for high performance operation.  There is the  
> ability to
> store "user attributes" on a file, and while good performance is
> desirable
> this is not a widely-used feature and falls into the "nice to
have"
> category.
> In ext3 there is no separation of system attributes and user  
> attributes, so
> they all benefit from the [di]node local storage optimization.
>
>>> I also think that instead of having an additional block pointer  
>>> (which are
>>> huge) in the dnode for "spillage", we should have
something like a
>>> "uint64_t dn_spillobj" which would be an object id of a  
>>> "spillage" object.
>>> An object id is much more space-efficient and, like Andreas  
>>> mentioned,
>>> allows for an unlimited number of attributes/attribute sizes.
>>
>> I don''t quite understand this.  The whole point of these  
>> attributes is
>> to make them fast to access... using an object ID is going to be far
>> more expensive then a block pointer to access.  Matt''s model
can also
>> support unlimited numbers of attributes if we allow the blocks to be
>> chained.
>
> While I agree with your point, I think part of the issue is that as  
> soon
> as we store a blkptr_t in the dnode this will consume some significant
> chunk of the SA space and push attributes out of the dnode.  The other
> tradeoff is one of complexity.  You know the code better than I,  
> but it
> seems cleaner to have a dnode reference as a container for a bunch  
> of SA
> blocks rather than having another block tree attached to the same  
> dnode.
>
> That said, if you don''t think there is a lot of added complexity
to
> have
> chained blocks for the SAs, I''ll defer to your experience.
>
> Cheers, Andreas
> --
> Andreas Dilger
> Sr. Staff Engineer, Lustre Group
> Sun Microsystems of Canada, Inc.
>
> _______________________________________________
> zfs-code mailing list
> zfs-code at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-code

Kalpak Shah

2008-Feb-24 11:50 UTC

head link

[zfs-code] Design for EAs in dnode.

On Thu, 2008-02-07 at 12:48 -0800, Matthew Ahrens wrote:> Mark Shellenbaum wrote:
> > General comment:
> > 
> > I''m not sure we should be using nvlists to store these
attributes.
> > While it will be quick to implement, I''m concerned about the
constant
> > packing/unpacking and searching linked lists in the nvlist code.
> 
> Agreed.  Here is one proposal for how this could be done efficiently and 
> scalably (both in space and runtime):
> 
> First, note that the problem is quite constrained:
> 
> There are a fixed number of attribute names (for any given software
version)
> since these are interpreted system attributes. Therefore the attribute
names
> can be a small number, rather than a string -- ie, we have an enum that 
> defines the attributes.  We (ie, OpenSolaris) can control the allocation of
> new enum values in this namespace.
We can have a 32-bit type and a 32-bit subtype so that the ZFS team will
only need to assign the 32-bit type and the subtype can be used by
Lustre or OS/X without requiring ZFS team approval. For Lustre the
subtype will be a 3-4 char EA name but it can be used differently also.
The kernel would not care about the SA names but it can be expanded to
com.apple.finder or com.lustre.lov, etc. if it needs to be printed.
> 
> Many attributes will always have a fixed size (eg, scanstamp, finder info)
I think we can let sai_intlen == 0 indicate that sa_length is the real
length of the SA value(and not in sainfo[sa_type] chunks) to allow
arbitary sized values.

So here is a slightly modified on-disk format from what you proposed:

struct sa_phys {
	struct sa_header {
		uint32_t magic;
		uint16_t sa_numattrs;
		uint16_t pad;
	};
	struct {
		uint32_t sa_type;
		uint32_t sa_subtype;
		uint32_t sa_value_offset;
		uint32_t sa_value_size;
	} sa_entry[];
};

The sa_value_offset will point to the address of the value in the bonus
buffer of dnode or in the external block. And 8-byte alignment will
require padding of entries and values.

sa_type + sa_subtype == 0 would terminate the list and as an additional
check we could store 0x00000000 to indicate end of sa_entry''s. So
during
modification or removal of entries we only need to pack the ones at the
end.

Thanks,
Kalpak.
> 
> Removing attributes or changing their sizes will be extremely rare, so we
can
> make this code path slow.  For example, it would not be prohibitive to
repack
> & rewrite all attrs.
> 
> Here is an idea for an on-disk format which takes advantage of these 
> constraints to achieve good performance:
> 
> (sa == System Attribute)
> 
> enum sa_type {
> 	ST_ATIME,
>          ST_ACL,
>          ST_FINDERINFO,
>          ST_SCANSTAMP,
>          ST_LUSTRE_LAYOUT,
>          ...
> 	ST_NUMTYPES
> };
> 
> const struct sa_info {
> 	uint8_t sai_intlen;
> } sainfo = {
> 	8, 2, 8, 8, 1, ...
> };	
> 
> on disk:
> header:
> struct sa_phys {
>          uint16_t sa_numattrs;
>          struct {
>                  uint16_t sa_type;     /* enum sssa_type */
>                  uint16_t sa_length;   /* in sainfo[sa_type] chunks */
>          } sssa_attr[];
> };
> 
> followed by the data in the order specified by the header.  8-byte
alignment
> will be enforced for all attribute starting offsets.
> 
> Note that the software (not the on-disk layout) defines the set of possible
> attributes, and their byte sizes (ie, is the value a string, vs a series of
> 8-byte integers, etc).  Alternatively, if we are confident that we
won''t need
> anything but 8-byte integers (or arrays thereof), we can simplify the
design
> further.
> 
> When we first access a SA block, we read in the header, and construct
> an array indexed by enum sa_type that tells us the position and length
> of each attribute.  That way, access and modification will be blazing
> fast[*].  If the size of an attribute is changed, we rewrite & repack
> everything, so there is never any unused space.  (Aside from the padding 
> necessary to enforce 8-byte alignment of each SA.  It is assumed that
> attribute size changing will be "rare enough" that this
performance is
> acceptable.)
> 
> Thus the space overhead would be only 4 bytes per attribute (compared to
> a static struct layout), and runtime overhead would be minimal.  So I think
> we could justify using it for much of the ZPL metadata.  In fact, we could
> make this SA stuff be a DMU service, and remove support for bonus
> buffers entirely.  The set of ZPL metadata that is always needed could
> be in one sa, and other pieces elsewhere.  The DMU would use what was
> the bonus buffer, plus an additional BP only if necessary, to implement
> the SA.  Ie, if the space needed for the SA''s was <= 320 bytes
(the
> current bonus size limit), then it would all be in the dnode.
> Otherwise, the first 192 (320-sizeof(blkptr_t)) would be in the dnode,
> and the rest in a block pointed to by a BP stored in the dnode.  In
addition,
> we could expand the dnode to 1k so that more SA''s could be stored
there.
> 
> --matt
> 
> [*] For example:
> 
> struct sa_layout {
> 	struct {
> 		int sal_offset; /* in bytes */
> 		int sal_length; /* in sainfo[sa_type] chunks */
> 	} sal_ary[ST_NUMTYPES];
> };
> 
> sa_init would construct an sa_layout given the on-disk sa_phys.
> 
> void sa_read(struct sa_layout *sal, void *sa_phys, enum sa_type sat,
>      void *buf, int buflen)
> {
> 	int len;
> 
> 	ASSERT(sat < ST_NUMTYPES);
> 	len = sal->sal_ary[sat].sal_length * sainfo[sat].sai_intlen
> 	ASSERT(buflen >= len);
> 	/* Assumes that it was already byteswapped when read off disk */
> 	bcopy((char*)sa_phys + sal->sal_ary[sat].sa_offset, buf, len);
> }
> 
> sa_write would be similar, unless the sal_length was different from what we
> are trying to write, in which case we would rewrite the whole thing to
insert
> the new (or resized) attribute.  The software would define a priority order
> for the SA''s, which defines what order to lay them out on disk
(and which
> spill out of the dnode into the accessory block first).
> _______________________________________________
> zfs-code mailing list
> zfs-code at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-code

Andreas Dilger

2008-Feb-25 04:27 UTC

head link

[zfs-code] Design for EAs in dnode.

On Feb 24, 2008  17:20 +0530, Kalpak Shah wrote:> On Thu, 2008-02-07 at 12:48 -0800, Matthew Ahrens wrote:
> > Mark Shellenbaum wrote:
> > > General comment:
> > > 
> > > I''m not sure we should be using nvlists to store these
attributes.
> > > While it will be quick to implement, I''m concerned about
the constant
> > > packing/unpacking and searching linked lists in the nvlist code.
> > 
> > Agreed.  Here is one proposal for how this could be done efficiently
and
> > scalably (both in space and runtime):
> > 
> > First, note that the problem is quite constrained:
> > 
> > There are a fixed number of attribute names (for any given software
version)
> > since these are interpreted system attributes. Therefore the attribute
names
> > can be a small number, rather than a string -- ie, we have an enum
that
> > defines the attributes.  We (ie, OpenSolaris) can control the
allocation of
> > new enum values in this namespace.
> 
> We can have a 32-bit type and a 32-bit subtype so that the ZFS team will
> only need to assign the 32-bit type and the subtype can be used by
> Lustre or OS/X without requiring ZFS team approval. For Lustre the
> subtype will be a 3-4 char EA name but it can be used differently also.
What else I was thinking would be very useful is to allow setting a
rank/priority for each SA.  This would give the DMU code the ability
to prioritize which SAs are stored in the beginning of the list (both
to be quickly accessible without much scanning, and to ensure they
remain in the dnode in case of overflow).  In particular, to avoid
serious performance degradation any znode data that is moved to an SA,
and if we need an SA to point to an external block, these should be kept
in the dnode itself.

I''m undecided whether the "sa_type" should be the ranking
(which might
put some assigned types at a disadvantage of having "high priority"
SAs
but ensures that critical SAs can never be pushed out of the dnode,
or the "sa_subtype".
> The kernel would not care about the SA names but it can be expanded to
> com.apple.finder or com.lustre.lov, etc. if it needs to be printed.
> 
> > Many attributes will always have a fixed size (eg, scanstamp, finder
info)
> 
> I think we can let sai_intlen == 0 indicate that sa_length is the real
> length of the SA value(and not in sainfo[sa_type] chunks) to allow
> arbitary sized values.
> 
> So here is a slightly modified on-disk format from what you proposed:
> 
> struct sa_phys {
> 	struct sa_header {
> 		uint32_t magic;
> 		uint16_t sa_numattrs;
> 		uint16_t pad;
> 	};
> 	struct {
> 		uint32_t sa_type;
> 		uint32_t sa_subtype;
> 		uint32_t sa_value_offset;
> 		uint32_t sa_value_size;
> 	} sa_entry[];
> };
> 
> The sa_value_offset will point to the address of the value in the bonus
> buffer of dnode or in the external block. And 8-byte alignment will
> require padding of entries and values.
> 
> sa_type + sa_subtype == 0 would terminate the list and as an additional
> check we could store 0x00000000 to indicate end of sa_entry''s. So
during
> modification or removal of entries we only need to pack the ones at the
> end.
> 
> Thanks,
> Kalpak.
> 
> > 
> > Removing attributes or changing their sizes will be extremely rare, so
we can
> > make this code path slow.  For example, it would not be prohibitive to
repack
> > & rewrite all attrs.
> > 
> > Here is an idea for an on-disk format which takes advantage of these 
> > constraints to achieve good performance:
> > 
> > (sa == System Attribute)
> > 
> > enum sa_type {
> > 	ST_ATIME,
> >          ST_ACL,
> >          ST_FINDERINFO,
> >          ST_SCANSTAMP,
> >          ST_LUSTRE_LAYOUT,
> >          ...
> > 	ST_NUMTYPES
> > };
> > 
> > const struct sa_info {
> > 	uint8_t sai_intlen;
> > } sainfo = {
> > 	8, 2, 8, 8, 1, ...
> > };	
> > 
> > on disk:
> > header:
> > struct sa_phys {
> >          uint16_t sa_numattrs;
> >          struct {
> >                  uint16_t sa_type;     /* enum sssa_type */
> >                  uint16_t sa_length;   /* in sainfo[sa_type] chunks */
> >          } sssa_attr[];
> > };
> > 
> > followed by the data in the order specified by the header.  8-byte
alignment
> > will be enforced for all attribute starting offsets.
> > 
> > Note that the software (not the on-disk layout) defines the set of
possible
> > attributes, and their byte sizes (ie, is the value a string, vs a
series of
> > 8-byte integers, etc).  Alternatively, if we are confident that we
won''t need
> > anything but 8-byte integers (or arrays thereof), we can simplify the
design
> > further.
> > 
> > When we first access a SA block, we read in the header, and construct
> > an array indexed by enum sa_type that tells us the position and length
> > of each attribute.  That way, access and modification will be blazing
> > fast[*].  If the size of an attribute is changed, we rewrite &
repack
> > everything, so there is never any unused space.  (Aside from the
padding
> > necessary to enforce 8-byte alignment of each SA.  It is assumed that
> > attribute size changing will be "rare enough" that this
performance is
> > acceptable.)
> > 
> > Thus the space overhead would be only 4 bytes per attribute (compared
to
> > a static struct layout), and runtime overhead would be minimal.  So I
think
> > we could justify using it for much of the ZPL metadata.  In fact, we
could
> > make this SA stuff be a DMU service, and remove support for bonus
> > buffers entirely.  The set of ZPL metadata that is always needed could
> > be in one sa, and other pieces elsewhere.  The DMU would use what was
> > the bonus buffer, plus an additional BP only if necessary, to
implement
> > the SA.  Ie, if the space needed for the SA''s was <= 320
bytes (the
> > current bonus size limit), then it would all be in the dnode.
> > Otherwise, the first 192 (320-sizeof(blkptr_t)) would be in the dnode,
> > and the rest in a block pointed to by a BP stored in the dnode.  In
addition,
> > we could expand the dnode to 1k so that more SA''s could be
stored there.
> > 
> > --matt
> > 
> > [*] For example:
> > 
> > struct sa_layout {
> > 	struct {
> > 		int sal_offset; /* in bytes */
> > 		int sal_length; /* in sainfo[sa_type] chunks */
> > 	} sal_ary[ST_NUMTYPES];
> > };
> > 
> > sa_init would construct an sa_layout given the on-disk sa_phys.
> > 
> > void sa_read(struct sa_layout *sal, void *sa_phys, enum sa_type sat,
> >      void *buf, int buflen)
> > {
> > 	int len;
> > 
> > 	ASSERT(sat < ST_NUMTYPES);
> > 	len = sal->sal_ary[sat].sal_length * sainfo[sat].sai_intlen
> > 	ASSERT(buflen >= len);
> > 	/* Assumes that it was already byteswapped when read off disk */
> > 	bcopy((char*)sa_phys + sal->sal_ary[sat].sa_offset, buf, len);
> > }
> > 
> > sa_write would be similar, unless the sal_length was different from
what we
> > are trying to write, in which case we would rewrite the whole thing to
insert
> > the new (or resized) attribute.  The software would define a priority
order
> > for the SA''s, which defines what order to lay them out on
disk (and which
> > spill out of the dnode into the accessory block first).
> > _______________________________________________
> > zfs-code mailing list
> > zfs-code at opensolaris.org
> > http://mail.opensolaris.org/mailman/listinfo/zfs-code
> 
> _______________________________________________
> zfs-code mailing list
> zfs-code at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-code
Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

zfs code - Feb 2008 - Design for EAs in dnode.

[zfs-code] Design for EAs in dnode.

[zfs-code] Design for EAs in dnode.

[zfs-code] Design for EAs in dnode.

[zfs-code] Design for EAs in dnode.

[zfs-code] Design for EAs in dnode.

[zfs-code] Design for EAs in dnode.

[zfs-code] Design for EAs in dnode.

[zfs-code] Design for EAs in dnode.

[zfs-code] Design for EAs in dnode.

[zfs-code] Design for EAs in dnode.

[zfs-code] Design for EAs in dnode.

[zfs-code] Design for EAs in dnode.

[zfs-code] Design for EAs in dnode.

[zfs-code] Design for EAs in dnode.

[zfs-code] Design for EAs in dnode.

[zfs-code] Design for EAs in dnode.

[zfs-code] Design for EAs in dnode.

[zfs-code] Design for EAs in dnode.

[zfs-code] Design for EAs in dnode.

[zfs-code] Design for EAs in dnode.

[zfs-code] Design for EAs in dnode.

[zfs-code] Design for EAs in dnode.

[zfs-code] Design for EAs in dnode.

[zfs-code] Design for EAs in dnode.

[zfs-code] Design for EAs in dnode.