thr3ads.net - zfs code - [zfs-code] Increasing dnode size [Sep 2007]

If this information is useful, please help other people find it:
Share via:

Andreas Dilger

2007-Sep-12 23:45 UTC

[zfs-code] Increasing dnode size

Hello,
as a brief introduction, I''m one of the developers of Lustre
(www.lustre.org) at CFS and we are porting over Lustre to use ZFS (well,
technically just the DMU) for back-end storage of Lustre.  We currently
use a modified ext3/4 filesystem for the back-end storage (both data and
metadata) fairly successfully (single filesystems of up to 2PB with up
to 500 back-end ext3 file stores and getting 50GB/s aggregate throughput
in some installations).

Lustre is a fairly heavy user of extended attributes on the metadata target
(MDT) to record virtual file->object mappings, and we''ll also begin
using
EAs more heavily on the object store (OST) in the near future (reverse
object->file mappings for example).

One of the performance improvements we developed early on with ext3 is
moving the EA into the inode to avoid seeking and full block writes for
small amounts of EA data.  The same could also be done to improve small
file performance (though we didn''t implement that).  For ext3 this
meant
increasing the inode size from 128 bytes to a format-time constant size of
256 - 4096 bytes (chosen based on the default Lustre EA size for that fs).

My understanding from brief conversations with some of the ZFS developers
is that there are already some plans to enlarge the dnode this because
the dnode bonus buffer is getting close to being full for ZFS.  Are there
any details of this plan that I could read, or has it been discussed before?
Due to the generality of the terms I wasn''t able to find anything by
search.
I wanted to get the ball rolling on the large dnode discussion (which
you may have already had internally, I don''t know), and start a fast EA
discussion in a separate thread.



One of the important design decisions made with the ext3 "large inode"
space
(beyond the end of the regular inode) was that there was a marker in each
inode which records how much of that space was used for "fixed" fields
(e.g. nanosecond timestamps, creation time, inode version) at the time the
inode was last written.  The space beyond "i_extra_isize" is used for
extended attribute storage.  If an inode is modified and the kernel code
wants to store additional "fixed" fields in the inode it will push the
EAs
out to external blocks to make room if there isn''t enough in-inode
space.

By having i_extra_isize stored in each inode (actually the first 16-bit
field in large inodes) we are at liberty to add new fields to the inode
itself without having to do a scan/update operation on existing inodes
(definitely desirable for ZFS also) and we don''t have to waste a lot
of "reserved" space for potential future expansion or for fields at
the
end that are not being used (e.g. inode version is only useful for NFSv4
and Lustre).  None of the "extra" fields are critical to correct
operation
by definition, since the code has existed until now without them...
Conversely, we don''t force EAs to start at a fixed offset and then use
inefficient EA wrapping for small 32- or 64-bit fields.

We also _discussed_ storing ext3 small file data in an EA on an
opportunistic basis along with more extent data (ala XFS).  Are there
plans to allow the dn_blkptr[] array to grow on a per-dnode basis to
avoid spilling out to an external block for files that are smaller and/or
have little/no EA data?  Alternately, it would be interesting to store
file data in the (enlarged) dn_blkptr[] array for small files to avoid
fragmenting the free space within the dnode.


Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

Mark Maybee

2007-Sep-13 21:27 UTC

head link

[zfs-code] Increasing dnode size

Andreas,

We have explored the idea of increasing the dnode size in the past
and discovered that a larger dnode size has a significant negative
performance impact on the ZPL (at least with our current caching
and read-ahead policies).  So we don''t have any plans to increase
its size generically anytime soon.

However, given that the ZPL isn''t the only consumer of datasets,
and that Lustre may benefit from a larger dnode size, it may be
worth investigating the possibility of supporting multiple dnode
sizes within a single pool (this is currently not supported).

Also, note that dnodes already have the notion of "fixed" DMU-
specific data and "variable" application-used data (the bonus
area).  So even in the current code, Lustre has the ability to
use 320 bytes of bonus space however it wants.

-Mark

Andreas Dilger wrote:> Hello,
> as a brief introduction, I''m one of the developers of Lustre
> (www.lustre.org) at CFS and we are porting over Lustre to use ZFS (well,
> technically just the DMU) for back-end storage of Lustre.  We currently
> use a modified ext3/4 filesystem for the back-end storage (both data and
> metadata) fairly successfully (single filesystems of up to 2PB with up
> to 500 back-end ext3 file stores and getting 50GB/s aggregate throughput
> in some installations).
> 
> Lustre is a fairly heavy user of extended attributes on the metadata target
> (MDT) to record virtual file->object mappings, and we''ll also
begin using
> EAs more heavily on the object store (OST) in the near future (reverse
> object->file mappings for example).
> 
> One of the performance improvements we developed early on with ext3 is
> moving the EA into the inode to avoid seeking and full block writes for
> small amounts of EA data.  The same could also be done to improve small
> file performance (though we didn''t implement that).  For ext3 this
meant
> increasing the inode size from 128 bytes to a format-time constant size of
> 256 - 4096 bytes (chosen based on the default Lustre EA size for that fs).
> 
> My understanding from brief conversations with some of the ZFS developers
> is that there are already some plans to enlarge the dnode this because
> the dnode bonus buffer is getting close to being full for ZFS.  Are there
> any details of this plan that I could read, or has it been discussed
before?
> Due to the generality of the terms I wasn''t able to find anything
by search.
> I wanted to get the ball rolling on the large dnode discussion (which
> you may have already had internally, I don''t know), and start a
fast EA
> discussion in a separate thread.
> 
> 
> 
> One of the important design decisions made with the ext3 "large
inode" space
> (beyond the end of the regular inode) was that there was a marker in each
> inode which records how much of that space was used for "fixed"
fields
> (e.g. nanosecond timestamps, creation time, inode version) at the time the
> inode was last written.  The space beyond "i_extra_isize" is used
for
> extended attribute storage.  If an inode is modified and the kernel code
> wants to store additional "fixed" fields in the inode it will
push the EAs
> out to external blocks to make room if there isn''t enough in-inode
space.
> 
> By having i_extra_isize stored in each inode (actually the first 16-bit
> field in large inodes) we are at liberty to add new fields to the inode
> itself without having to do a scan/update operation on existing inodes
> (definitely desirable for ZFS also) and we don''t have to waste a
lot
> of "reserved" space for potential future expansion or for fields
at the
> end that are not being used (e.g. inode version is only useful for NFSv4
> and Lustre).  None of the "extra" fields are critical to correct
operation
> by definition, since the code has existed until now without them...
> Conversely, we don''t force EAs to start at a fixed offset and then
use
> inefficient EA wrapping for small 32- or 64-bit fields.
> 
> We also _discussed_ storing ext3 small file data in an EA on an
> opportunistic basis along with more extent data (ala XFS).  Are there
> plans to allow the dn_blkptr[] array to grow on a per-dnode basis to
> avoid spilling out to an external block for files that are smaller and/or
> have little/no EA data?  Alternately, it would be interesting to store
> file data in the (enlarged) dn_blkptr[] array for small files to avoid
> fragmenting the free space within the dnode.
> 
> 
> Cheers, Andreas
> --
> Andreas Dilger
> Principal Software Engineer
> Cluster File Systems, Inc.
> 
> _______________________________________________
> zfs-code mailing list
> zfs-code at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-code

Andreas Dilger

2007-Sep-14 00:06 UTC

head link

[zfs-code] Increasing dnode size

On Sep 13, 2007  15:27 -0600, Mark Maybee wrote:> We have explored the idea of increasing the dnode size in the past
> and discovered that a larger dnode size has a significant negative
> performance impact on the ZPL (at least with our current caching
> and read-ahead policies).  So we don''t have any plans to increase
> its size generically anytime soon.
I''m sure it depends a lot on the workload.  I don''t know the
details
of how the ZFS allocators work, so it seems possible they always
allocate the modified dnode and the corresponding EAs in a contiguous
chunk initially, but I suspect that keeping this true over the life
of the dnode would put an added burden on the allocator (to know this)
or ZPL (to always mark them dirty to force colocation even if not modified).

I''d also heard that the 48 (or so) bytes that remain in the bonus
buffer
for ZFS are potentially going to be used up soon so there would be a
desire to have a generic solution to this issue.

One of the reasons the large inode patch made it into the Linux
kernel quickly was because it made a big difference for Samba
(in addition to Lustre):

	http://lwn.net/Articles/112571/
> However, given that the ZPL isn''t the only consumer of datasets,
> and that Lustre may benefit from a larger dnode size, it may be
> worth investigating the possibility of supporting multiple dnode
> sizes within a single pool (this is currently not supported).
Without knowing the details, it would seem at first glance that
having variable dnode size would be fairly complex.  Aren''t the
dnodes just stored in a single sparse object and accessed by
dnode_size * objid?  This does seem desirable from the POV that
if you have an existing fs with the current dnode size you don''t
want to need a reformat in order to use the larger size.
> Also, note that dnodes already have the notion of "fixed" DMU-
> specific data and "variable" application-used data (the bonus
> area).  So even in the current code, Lustre has the ability to
> use 320 bytes of bonus space however it wants.
That is true, and we discussed this internally, but one of the internal
requirements we have for DMU usage is that it create an on-disk layout
that matches ZFS so that it is possible to mount a Lustre filesystem
via ZFS or ZFS-FUSE (and potentially the reverse in the future).
This will allow us to do problem diagnosis and also leverage any ZFS
scanning/verification tools that may be developed.
> Andreas Dilger wrote:
> >Lustre is a fairly heavy user of extended attributes on the metadata
target
> >(MDT) to record virtual file->object mappings, and we''ll
also begin using
> >EAs more heavily on the object store (OST) in the near future (reverse
> >object->file mappings for example).
> >
> >One of the performance improvements we developed early on with ext3 is
> >moving the EA into the inode to avoid seeking and full block writes for
> >small amounts of EA data.  The same could also be done to improve small
> >file performance (though we didn''t implement that).  For ext3
this meant
> >increasing the inode size from 128 bytes to a format-time constant size
of
> >256 - 4096 bytes (chosen based on the default Lustre EA size for that
fs).
> >
> >My understanding from brief conversations with some of the ZFS
developers
> >is that there are already some plans to enlarge the dnode this because
> >the dnode bonus buffer is getting close to being full for ZFS.  Are
there
> >any details of this plan that I could read, or has it been discussed 
> >before?
> >Due to the generality of the terms I wasn''t able to find
anything by
> >search.
> >I wanted to get the ball rolling on the large dnode discussion (which
> >you may have already had internally, I don''t know), and start
a fast EA
> >discussion in a separate thread.
> >
> >
> >
> >One of the important design decisions made with the ext3 "large
inode"
> >space
> >(beyond the end of the regular inode) was that there was a marker in
each
> >inode which records how much of that space was used for
"fixed" fields
> >(e.g. nanosecond timestamps, creation time, inode version) at the time
the
> >inode was last written.  The space beyond "i_extra_isize" is
used for
> >extended attribute storage.  If an inode is modified and the kernel
code
> >wants to store additional "fixed" fields in the inode it will
push the EAs
> >out to external blocks to make room if there isn''t enough
in-inode space.
> >
> >By having i_extra_isize stored in each inode (actually the first 16-bit
> >field in large inodes) we are at liberty to add new fields to the inode
> >itself without having to do a scan/update operation on existing inodes
> >(definitely desirable for ZFS also) and we don''t have to waste
a lot
> >of "reserved" space for potential future expansion or for
fields at the
> >end that are not being used (e.g. inode version is only useful for
NFSv4
> >and Lustre).  None of the "extra" fields are critical to
correct operation
> >by definition, since the code has existed until now without them...
> >Conversely, we don''t force EAs to start at a fixed offset and
then use
> >inefficient EA wrapping for small 32- or 64-bit fields.
> >
> >We also _discussed_ storing ext3 small file data in an EA on an
> >opportunistic basis along with more extent data (ala XFS).  Are there
> >plans to allow the dn_blkptr[] array to grow on a per-dnode basis to
> >avoid spilling out to an external block for files that are smaller
and/or
> >have little/no EA data?  Alternately, it would be interesting to store
> >file data in the (enlarged) dn_blkptr[] array for small files to avoid
> >fragmenting the free space within the dnode.
> >
> >
> >Cheers, Andreas
> >--
> >Andreas Dilger
> >Principal Software Engineer
> >Cluster File Systems, Inc.
> >
> >_______________________________________________
> >zfs-code mailing list
> >zfs-code at opensolaris.org
> >http://mail.opensolaris.org/mailman/listinfo/zfs-code
Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

Bill Moore

2007-Sep-14 00:48 UTC

head link

[zfs-code] Increasing dnode size

The performance benchmarks that Mark refers to are valid for our current
ZPL implementation.  That is, the bonus buffer only contains the znode
and symlink contents.  If, however, we had an application that always
had an extended attribute, and that extended attribute was frequently
accessed, then I think there would be (as Andreas points out) a
significant performance advantage to having the XATTR in the dnode
somewhere.

I think there are a couple of issues here.  The first one is to allow
each dataset to have its own dnode size.  While conceptually not all
that hard, it would take some re-jiggering of the code to make most of
the #defines turn into per-dataset variables.  But it should be pretty
straightforward, and probably not a bad idea in general.

The other issue is a little more sticky.  My understanding is that
Lustre-on-DMU plans to use the same data structures as the ZPL.  That
way, you can mount the Lustre metadata or object stores as a regular
filesystem.  Given this, the question is what changes, if any, should be
made to the ZPL to accommodate.  Allowing the ZPL to deal with
non-512-byte dnodes is probably not that bad.  The question is whether
or not the ZPL should be made to understand the extended attributes (or
whatever) that is stored in the rest of the bonus buffer.

While the Lustre guys may be the first to venture into this area, it
will come up anyway with pNFS or the CIFS server, so we should probably
spend some brain cycles thinking about the best way to have extra data
(of various sorts) in larger-than-normal dnodes that the ZPL can deal
with.

A simple plan may be that the first extended attribute is stored in the
bonus buffer (if it fits).  I don''t know if this would require the same
logic we used to have that placed small file contents in the bonus
buffer.  Unfortunately, that code was *way* complicated and was ripped
out some time ago.

If the bonus buffer containing an extended attribute won''t work, the
question becomes how to put the Lustre LOV data into the dnode/znode so
we get the performance benefits, but using an implementation that we can
live with.  Of course one option would be to give up on the Lustre/ZPL
compatibility, but I don''t think that''s such a good plan. 
Like I
mentioned earlier, I think that pNFS and CIFS will wind up running int
similar issues, so we''ll have to deal with such a thing sooner or
later.

Ideas?

--Bill

On Thu, Sep 13, 2007 at 03:27:24PM -0600, Mark Maybee
wrote:> Andreas,
> 
> We have explored the idea of increasing the dnode size in the past
> and discovered that a larger dnode size has a significant negative
> performance impact on the ZPL (at least with our current caching
> and read-ahead policies).  So we don''t have any plans to increase
> its size generically anytime soon.
> 
> However, given that the ZPL isn''t the only consumer of datasets,
> and that Lustre may benefit from a larger dnode size, it may be
> worth investigating the possibility of supporting multiple dnode
> sizes within a single pool (this is currently not supported).
> 
> Also, note that dnodes already have the notion of "fixed" DMU-
> specific data and "variable" application-used data (the bonus
> area).  So even in the current code, Lustre has the ability to
> use 320 bytes of bonus space however it wants.
> 
> -Mark
> 
> Andreas Dilger wrote:
> > Hello,
> > as a brief introduction, I''m one of the developers of Lustre
> > (www.lustre.org) at CFS and we are porting over Lustre to use ZFS
(well,
> > technically just the DMU) for back-end storage of Lustre.  We
currently
> > use a modified ext3/4 filesystem for the back-end storage (both data
and
> > metadata) fairly successfully (single filesystems of up to 2PB with up
> > to 500 back-end ext3 file stores and getting 50GB/s aggregate
throughput
> > in some installations).
> > 
> > Lustre is a fairly heavy user of extended attributes on the metadata
target
> > (MDT) to record virtual file->object mappings, and we''ll
also begin using
> > EAs more heavily on the object store (OST) in the near future (reverse
> > object->file mappings for example).
> > 
> > One of the performance improvements we developed early on with ext3 is
> > moving the EA into the inode to avoid seeking and full block writes
for
> > small amounts of EA data.  The same could also be done to improve
small
> > file performance (though we didn''t implement that).  For ext3
this meant
> > increasing the inode size from 128 bytes to a format-time constant
size of
> > 256 - 4096 bytes (chosen based on the default Lustre EA size for that
fs).
> > 
> > My understanding from brief conversations with some of the ZFS
developers
> > is that there are already some plans to enlarge the dnode this because
> > the dnode bonus buffer is getting close to being full for ZFS.  Are
there
> > any details of this plan that I could read, or has it been discussed
before?
> > Due to the generality of the terms I wasn''t able to find
anything by search.
> > I wanted to get the ball rolling on the large dnode discussion (which
> > you may have already had internally, I don''t know), and start
a fast EA
> > discussion in a separate thread.
> > 
> > 
> > 
> > One of the important design decisions made with the ext3 "large
inode" space
> > (beyond the end of the regular inode) was that there was a marker in
each
> > inode which records how much of that space was used for
"fixed" fields
> > (e.g. nanosecond timestamps, creation time, inode version) at the time
the
> > inode was last written.  The space beyond "i_extra_isize" is
used for
> > extended attribute storage.  If an inode is modified and the kernel
code
> > wants to store additional "fixed" fields in the inode it
will push the EAs
> > out to external blocks to make room if there isn''t enough
in-inode space.
> > 
> > By having i_extra_isize stored in each inode (actually the first
16-bit
> > field in large inodes) we are at liberty to add new fields to the
inode
> > itself without having to do a scan/update operation on existing inodes
> > (definitely desirable for ZFS also) and we don''t have to
waste a lot
> > of "reserved" space for potential future expansion or for
fields at the
> > end that are not being used (e.g. inode version is only useful for
NFSv4
> > and Lustre).  None of the "extra" fields are critical to
correct operation
> > by definition, since the code has existed until now without them...
> > Conversely, we don''t force EAs to start at a fixed offset and
then use
> > inefficient EA wrapping for small 32- or 64-bit fields.
> > 
> > We also _discussed_ storing ext3 small file data in an EA on an
> > opportunistic basis along with more extent data (ala XFS).  Are there
> > plans to allow the dn_blkptr[] array to grow on a per-dnode basis to
> > avoid spilling out to an external block for files that are smaller
and/or
> > have little/no EA data?  Alternately, it would be interesting to store
> > file data in the (enlarged) dn_blkptr[] array for small files to avoid
> > fragmenting the free space within the dnode.
> > 
> > 
> > Cheers, Andreas
> > --
> > Andreas Dilger
> > Principal Software Engineer
> > Cluster File Systems, Inc.
> > 
> > _______________________________________________
> > zfs-code mailing list
> > zfs-code at opensolaris.org
> > http://mail.opensolaris.org/mailman/listinfo/zfs-code
> _______________________________________________
> zfs-code mailing list
> zfs-code at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-code

Mark Maybee

2007-Sep-14 14:52 UTC

head link

[zfs-code] Increasing dnode size

Andreas Dilger wrote:> On Sep 13, 2007  15:27 -0600, Mark Maybee wrote:
>> We have explored the idea of increasing the dnode size in the past
>> and discovered that a larger dnode size has a significant negative
>> performance impact on the ZPL (at least with our current caching
>> and read-ahead policies).  So we don''t have any plans to
increase
>> its size generically anytime soon.
> 
> I''m sure it depends a lot on the workload.  I don''t know
the details
> of how the ZFS allocators work, so it seems possible they always
> allocate the modified dnode and the corresponding EAs in a contiguous
> chunk initially, but I suspect that keeping this true over the life
> of the dnode would put an added burden on the allocator (to know this)
> or ZPL (to always mark them dirty to force colocation even if not
modified).
> 
> I''d also heard that the 48 (or so) bytes that remain in the bonus
buffer
> for ZFS are potentially going to be used up soon so there would be a
> desire to have a generic solution to this issue.
> You seem to have a line on a lot of internal development details :-).
> One of the reasons the large inode patch made it into the Linux
> kernel quickly was because it made a big difference for Samba
> (in addition to Lustre):
> 
> 	http://lwn.net/Articles/112571/
> 
>> However, given that the ZPL isn''t the only consumer of
datasets,
>> and that Lustre may benefit from a larger dnode size, it may be
>> worth investigating the possibility of supporting multiple dnode
>> sizes within a single pool (this is currently not supported).
> 
> Without knowing the details, it would seem at first glance that
> having variable dnode size would be fairly complex.  Aren''t the
> dnodes just stored in a single sparse object and accessed by
> dnode_size * objid?  This does seem desirable from the POV that
> if you have an existing fs with the current dnode size you don''t
> want to need a reformat in order to use the larger size.
> I was referring here to supporting multiple dnode sizes within a
*pool*, but the size would still remained fixed for a given dataset
(see Bill''s mail).  This is a much simpler concept to implement.
>> Also, note that dnodes already have the notion of "fixed"
DMU-
>> specific data and "variable" application-used data (the bonus
>> area).  So even in the current code, Lustre has the ability to
>> use 320 bytes of bonus space however it wants.
> 
> That is true, and we discussed this internally, but one of the internal
> requirements we have for DMU usage is that it create an on-disk layout
> that matches ZFS so that it is possible to mount a Lustre filesystem
> via ZFS or ZFS-FUSE (and potentially the reverse in the future).
> This will allow us to do problem diagnosis and also leverage any ZFS
> scanning/verification tools that may be developed.
> Ah, interesting, I was not aware of this requirement.  It would not be
difficult to allow the ZPL to work with a larger dnode size (in fact
its pretty much a noop as long as the ZPL is not trying to use any of
the extra space in the dnode).
>> Andreas Dilger wrote:
>>> Lustre is a fairly heavy user of extended attributes on the
metadata target
>>> (MDT) to record virtual file->object mappings, and
we''ll also begin using
>>> EAs more heavily on the object store (OST) in the near future
(reverse
>>> object->file mappings for example).
>>>
>>> One of the performance improvements we developed early on with ext3
is
>>> moving the EA into the inode to avoid seeking and full block writes
for
>>> small amounts of EA data.  The same could also be done to improve
small
>>> file performance (though we didn''t implement that).  For
ext3 this meant
>>> increasing the inode size from 128 bytes to a format-time constant
size of
>>> 256 - 4096 bytes (chosen based on the default Lustre EA size for
that fs).
>>>
>>> My understanding from brief conversations with some of the ZFS
developers
>>> is that there are already some plans to enlarge the dnode this
because
>>> the dnode bonus buffer is getting close to being full for ZFS.  Are
there
>>> any details of this plan that I could read, or has it been
discussed
>>> before?
>>> Due to the generality of the terms I wasn''t able to find
anything by
>>> search.
>>> I wanted to get the ball rolling on the large dnode discussion
(which
>>> you may have already had internally, I don''t know), and
start a fast EA
>>> discussion in a separate thread.
>>>
>>>
>>>
>>> One of the important design decisions made with the ext3
"large inode"
>>> space
>>> (beyond the end of the regular inode) was that there was a marker
in each
>>> inode which records how much of that space was used for
"fixed" fields
>>> (e.g. nanosecond timestamps, creation time, inode version) at the
time the
>>> inode was last written.  The space beyond "i_extra_isize"
is used for
>>> extended attribute storage.  If an inode is modified and the kernel
code
>>> wants to store additional "fixed" fields in the inode it
will push the EAs
>>> out to external blocks to make room if there isn''t enough
in-inode space.
>>>
>>> By having i_extra_isize stored in each inode (actually the first
16-bit
>>> field in large inodes) we are at liberty to add new fields to the
inode
>>> itself without having to do a scan/update operation on existing
inodes
>>> (definitely desirable for ZFS also) and we don''t have to
waste a lot
>>> of "reserved" space for potential future expansion or for
fields at the
>>> end that are not being used (e.g. inode version is only useful for
NFSv4
>>> and Lustre).  None of the "extra" fields are critical to
correct operation
>>> by definition, since the code has existed until now without them...
>>> Conversely, we don''t force EAs to start at a fixed offset
and then use
>>> inefficient EA wrapping for small 32- or 64-bit fields.
>>>
>>> We also _discussed_ storing ext3 small file data in an EA on an
>>> opportunistic basis along with more extent data (ala XFS).  Are
there
>>> plans to allow the dn_blkptr[] array to grow on a per-dnode basis
to
>>> avoid spilling out to an external block for files that are smaller
and/or
>>> have little/no EA data?  Alternately, it would be interesting to
store
>>> file data in the (enlarged) dn_blkptr[] array for small files to
avoid
>>> fragmenting the free space within the dnode.
>>>
>>>
>>> Cheers, Andreas
>>> --
>>> Andreas Dilger
>>> Principal Software Engineer
>>> Cluster File Systems, Inc.
>>>
>>> _______________________________________________
>>> zfs-code mailing list
>>> zfs-code at opensolaris.org
>>> http://mail.opensolaris.org/mailman/listinfo/zfs-code
> 
> Cheers, Andreas
> --
> Andreas Dilger
> Principal Software Engineer
> Cluster File Systems, Inc.
>

eric kustarz

2007-Sep-14 17:21 UTC

head link

[zfs-code] Increasing dnode size

On Sep 13, 2007, at 5:48 PM, Bill Moore wrote:
> The performance benchmarks that Mark refers to are valid for our  
> current
> ZPL implementation.  That is, the bonus buffer only contains the znode
> and symlink contents.  If, however, we had an application that always
> had an extended attribute, and that extended attribute was frequently
> accessed, then I think there would be (as Andreas points out) a
> significant performance advantage to having the XATTR in the dnode
> somewhere.
>
> I think there are a couple of issues here.  The first one is to allow
> each dataset to have its own dnode size.  While conceptually not all
> that hard, it would take some re-jiggering of the code to make most of
> the #defines turn into per-dataset variables.  But it should be pretty
> straightforward, and probably not a bad idea in general.
>
> The other issue is a little more sticky.  My understanding is that
> Lustre-on-DMU plans to use the same data structures as the ZPL.  That
> way, you can mount the Lustre metadata or object stores as a regular
> filesystem.  Given this, the question is what changes, if any,  
> should be
> made to the ZPL to accommodate.  Allowing the ZPL to deal with
> non-512-byte dnodes is probably not that bad.  The question is whether
> or not the ZPL should be made to understand the extended attributes  
> (or
> whatever) that is stored in the rest of the bonus buffer.
>
> While the Lustre guys may be the first to venture into this area, it
> will come up anyway with pNFS or the CIFS server, so we should  
> probably
> spend some brain cycles thinking about the best way to have extra data
> (of various sorts) in larger-than-normal dnodes that the ZPL can deal
> with.
Yeah, the pNFS metadata server is going to use EAs for the layout  
information.  The pNFS data server is bypassing the ZPL and going  
directly to the DMU.

For the pNFS people, do you have any feeling how big of a EA you will  
need for the layout information?  Are you planning on using just one  
EA?  I''m wondering if the bonus buffer of 320 bytes would suffice.

eric
>
> A simple plan may be that the first extended attribute is stored in  
> the
> bonus buffer (if it fits).  I don''t know if this would require the
> same
> logic we used to have that placed small file contents in the bonus
> buffer.  Unfortunately, that code was *way* complicated and was ripped
> out some time ago.
>
> If the bonus buffer containing an extended attribute won''t work,
the
> question becomes how to put the Lustre LOV data into the dnode/ 
> znode so
> we get the performance benefits, but using an implementation that  
> we can
> live with.  Of course one option would be to give up on the Lustre/ZPL
> compatibility, but I don''t think that''s such a good plan.
Like I
> mentioned earlier, I think that pNFS and CIFS will wind up running int
> similar issues, so we''ll have to deal with such a thing sooner or
> later.
>
> Ideas?
>
>
> --Bill
>
> On Thu, Sep 13, 2007 at 03:27:24PM -0600, Mark Maybee wrote:
>> Andreas,
>>
>> We have explored the idea of increasing the dnode size in the past
>> and discovered that a larger dnode size has a significant negative
>> performance impact on the ZPL (at least with our current caching
>> and read-ahead policies).  So we don''t have any plans to
increase
>> its size generically anytime soon.
>>
>> However, given that the ZPL isn''t the only consumer of
datasets,
>> and that Lustre may benefit from a larger dnode size, it may be
>> worth investigating the possibility of supporting multiple dnode
>> sizes within a single pool (this is currently not supported).
>>
>> Also, note that dnodes already have the notion of "fixed"
DMU-
>> specific data and "variable" application-used data (the bonus
>> area).  So even in the current code, Lustre has the ability to
>> use 320 bytes of bonus space however it wants.
>>
>> -Mark
>>
>> Andreas Dilger wrote:
>>> Hello,
>>> as a brief introduction, I''m one of the developers of
Lustre
>>> (www.lustre.org) at CFS and we are porting over Lustre to use ZFS  
>>> (well,
>>> technically just the DMU) for back-end storage of Lustre.  We  
>>> currently
>>> use a modified ext3/4 filesystem for the back-end storage (both  
>>> data and
>>> metadata) fairly successfully (single filesystems of up to 2PB  
>>> with up
>>> to 500 back-end ext3 file stores and getting 50GB/s aggregate  
>>> throughput
>>> in some installations).
>>>
>>> Lustre is a fairly heavy user of extended attributes on the  
>>> metadata target
>>> (MDT) to record virtual file->object mappings, and
we''ll also
>>> begin using
>>> EAs more heavily on the object store (OST) in the near future  
>>> (reverse
>>> object->file mappings for example).
>>>
>>> One of the performance improvements we developed early on with  
>>> ext3 is
>>> moving the EA into the inode to avoid seeking and full block  
>>> writes for
>>> small amounts of EA data.  The same could also be done to improve  
>>> small
>>> file performance (though we didn''t implement that).  For
ext3
>>> this meant
>>> increasing the inode size from 128 bytes to a format-time  
>>> constant size of
>>> 256 - 4096 bytes (chosen based on the default Lustre EA size for  
>>> that fs).
>>>
>>> My understanding from brief conversations with some of the ZFS  
>>> developers
>>> is that there are already some plans to enlarge the dnode this  
>>> because
>>> the dnode bonus buffer is getting close to being full for ZFS.   
>>> Are there
>>> any details of this plan that I could read, or has it been  
>>> discussed before?
>>> Due to the generality of the terms I wasn''t able to find
anything
>>> by search.
>>> I wanted to get the ball rolling on the large dnode discussion  
>>> (which
>>> you may have already had internally, I don''t know), and
start a
>>> fast EA
>>> discussion in a separate thread.
>>>
>>>
>>>
>>> One of the important design decisions made with the ext3
"large
>>> inode" space
>>> (beyond the end of the regular inode) was that there was a marker  
>>> in each
>>> inode which records how much of that space was used for
"fixed"
>>> fields
>>> (e.g. nanosecond timestamps, creation time, inode version) at the  
>>> time the
>>> inode was last written.  The space beyond "i_extra_isize"
is used
>>> for
>>> extended attribute storage.  If an inode is modified and the  
>>> kernel code
>>> wants to store additional "fixed" fields in the inode it
will
>>> push the EAs
>>> out to external blocks to make room if there isn''t enough
in-
>>> inode space.
>>>
>>> By having i_extra_isize stored in each inode (actually the first  
>>> 16-bit
>>> field in large inodes) we are at liberty to add new fields to the  
>>> inode
>>> itself without having to do a scan/update operation on existing  
>>> inodes
>>> (definitely desirable for ZFS also) and we don''t have to
waste a lot
>>> of "reserved" space for potential future expansion or for
fields
>>> at the
>>> end that are not being used (e.g. inode version is only useful  
>>> for NFSv4
>>> and Lustre).  None of the "extra" fields are critical to
correct
>>> operation
>>> by definition, since the code has existed until now without them...
>>> Conversely, we don''t force EAs to start at a fixed offset
and
>>> then use
>>> inefficient EA wrapping for small 32- or 64-bit fields.
>>>
>>> We also _discussed_ storing ext3 small file data in an EA on an
>>> opportunistic basis along with more extent data (ala XFS).  Are  
>>> there
>>> plans to allow the dn_blkptr[] array to grow on a per-dnode basis
to
>>> avoid spilling out to an external block for files that are  
>>> smaller and/or
>>> have little/no EA data?  Alternately, it would be interesting to  
>>> store
>>> file data in the (enlarged) dn_blkptr[] array for small files to  
>>> avoid
>>> fragmenting the free space within the dnode.
>>>
>>>
>>> Cheers, Andreas
>>> --
>>> Andreas Dilger
>>> Principal Software Engineer
>>> Cluster File Systems, Inc.
>>>
>>> _______________________________________________
>>> zfs-code mailing list
>>> zfs-code at opensolaris.org
>>> http://mail.opensolaris.org/mailman/listinfo/zfs-code
>> _______________________________________________
>> zfs-code mailing list
>> zfs-code at opensolaris.org
>> http://mail.opensolaris.org/mailman/listinfo/zfs-code
> _______________________________________________
> zfs-code mailing list
> zfs-code at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-code

Andreas Dilger

2007-Sep-14 18:09 UTC

head link

[zfs-code] Increasing dnode size

On Sep 14, 2007  08:52 -0600, Mark Maybee wrote:> >Without knowing the details, it would seem at first glance that
> >having variable dnode size would be fairly complex.  Aren''t
the
> >dnodes just stored in a single sparse object and accessed by
> >dnode_size * objid?  This does seem desirable from the POV that
> >if you have an existing fs with the current dnode size you
don''t
> >want to need a reformat in order to use the larger size.
>
> I was referring here to supporting multiple dnode sizes within a
> *pool*, but the size would still remained fixed for a given dataset
> (see Bill''s mail).  This is a much simpler concept to implement.
Ah, sure.  That would be a lot easier to implement.
> >That is true, and we discussed this internally, but one of the internal
> >requirements we have for DMU usage is that it create an on-disk layout
> >that matches ZFS so that it is possible to mount a Lustre filesystem
> >via ZFS or ZFS-FUSE (and potentially the reverse in the future).
> >This will allow us to do problem diagnosis and also leverage any ZFS
> >scanning/verification tools that may be developed.
>
> Ah, interesting, I was not aware of this requirement.  It would not be
> difficult to allow the ZPL to work with a larger dnode size (in fact
> its pretty much a noop as long as the ZPL is not trying to use any of
> the extra space in the dnode).
I agree, but I suspect large dnodes could also be of use to ZFS at
some point, either for fast EAs and/or small files, so we wanted to
get some buy-in from the ZFS developers on an approach that would
be suitable for ZFS also.  In particular, being able to use the larger
dnode space for a variety of reasons (more elements in dn_blkptr[],
small file data, fast EA space) is much more desirable than a Lustre-only
implementation.

Also, given that we''d want to be able to access the EAs via ZPL if
mounted as ZFS would be important for debugging/backup/restore/etc.

I suspect the Lustre development approach would be the same with ZFS
as it is with ext3, which has been quite successful to this point.
Namely, we''re happy to develop new functionality in ZFS/DMU as needed
so long as we get buy-in from the ZFS team on the design and most
importantly the on-disk format.  We don''t want to create a permanent
fork in the code or on-disk format that separates Lustre-ZFS from
Solaris-ZFS, which is the whole point to starting this discussion long
before we''re going to start implementing anything.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

eric kustarz

2007-Sep-14 23:03 UTC

head link

[zfs-code] Increasing dnode size

On Sep 14, 2007, at 11:09 AM, Andreas Dilger wrote:
> On Sep 14, 2007  08:52 -0600, Mark Maybee wrote:
>>> Without knowing the details, it would seem at first glance that
>>> having variable dnode size would be fairly complex. 
Aren''t the
>>> dnodes just stored in a single sparse object and accessed by
>>> dnode_size * objid?  This does seem desirable from the POV that
>>> if you have an existing fs with the current dnode size you
don''t
>>> want to need a reformat in order to use the larger size.
>>
>> I was referring here to supporting multiple dnode sizes within a
>> *pool*, but the size would still remained fixed for a given dataset
>> (see Bill''s mail).  This is a much simpler concept to
implement.
>
> Ah, sure.  That would be a lot easier to implement.
>
>>> That is true, and we discussed this internally, but one of the  
>>> internal
>>> requirements we have for DMU usage is that it create an on-disk  
>>> layout
>>> that matches ZFS so that it is possible to mount a Lustre
filesystem
>>> via ZFS or ZFS-FUSE (and potentially the reverse in the future).
>>> This will allow us to do problem diagnosis and also leverage any
ZFS
>>> scanning/verification tools that may be developed.
>>
>> Ah, interesting, I was not aware of this requirement.  It would  
>> not be
>> difficult to allow the ZPL to work with a larger dnode size (in fact
>> its pretty much a noop as long as the ZPL is not trying to use any of
>> the extra space in the dnode).
>
> I agree, but I suspect large dnodes could also be of use to ZFS at
> some point, either for fast EAs and/or small files, so we wanted to
> get some buy-in from the ZFS developers on an approach that would
> be suitable for ZFS also.  In particular, being able to use the larger
> dnode space for a variety of reasons (more elements in dn_blkptr[],
> small file data, fast EA space) is much more desirable than a  
> Lustre-only
> implementation.
>
> Also, given that we''d want to be able to access the EAs via ZPL if
> mounted as ZFS would be important for debugging/backup/restore/etc.
>
> I suspect the Lustre development approach would be the same with ZFS
> as it is with ext3, which has been quite successful to this point.
> Namely, we''re happy to develop new functionality in ZFS/DMU as
needed
> so long as we get buy-in from the ZFS team on the design and most
> importantly the on-disk format.  We don''t want to create a
permanent
> fork in the code or on-disk format that separates Lustre-ZFS from
> Solaris-ZFS, which is the whole point to starting this discussion long
> before we''re going to start implementing anything.
Absolutley, let''s make sure we all agree on the on-disk changes.   
This has been a major focus for us when working with the OSX and  
FreeBSD people.  So far we''ve been quite successful (and i
don''t see
any reason why we won''t be in the future).  Its great to hear you  
want the same thing.

Another nice thing is that ZFS was designed to support on-disk  
changes (see zpool upgrade).

eric

Andreas Dilger

2007-Sep-15 22:19 UTC

head link

[zfs-code] Increasing dnode size

On Sep 13, 2007  17:48 -0700, Bill Moore wrote:> I think there are a couple of issues here.  The first one is to allow
> each dataset to have its own dnode size.  While conceptually not all
> that hard, it would take some re-jiggering of the code to make most of
> the #defines turn into per-dataset variables.  But it should be pretty
> straightforward, and probably not a bad idea in general.
Agreed.
> The other issue is a little more sticky.  My understanding is that
> Lustre-on-DMU plans to use the same data structures as the ZPL.  That
> way, you can mount the Lustre metadata or object stores as a regular
> filesystem.  Given this, the question is what changes, if any, should be
> made to the ZPL to accommodate.  Allowing the ZPL to deal with
> non-512-byte dnodes is probably not that bad.  The question is whether
> or not the ZPL should be made to understand the extended attributes (or
> whatever) that is stored in the rest of the bonus buffer.
There are a couple of approaches I can propose, but since I''m only at
the level of ZFS code newbie I can''t weigh weigh how easy/hard it would
be to implement them.  This is really just at the brainstorming stage
for many of them, and we may want to split details into separate threads.

typedef struct dnode_phys {
	uint8_t dn_type;
	uint8_t dn_indblkshift;
	uint8_t dn_nlevels = 3
	uint8_t dn_nblkptr = 3
	uint8_t dn_bonustype;
	uint8_t dn_checksum;
	uint8_t dn_compress;
	uint8_t dn_pad[1];
	uint16_t dn_datablkszsec;
	uint16_t dn_bonuslen;
	uint8_t dn_pad2[4];
	uint64_t dn_maxblkid;
	uint64_t dn_secphys;
	uint64_t dn_pad3[4];
	blkptr_t dn_blkptr[dn_nblkptr];
	uint8_t dn_bonus[BONUSLEN]
} dnode_phys_t;

typedef struct znode_phys {
	uint64_t zp_atime[2];
	uint64_t zp_mtime[2];
	uint64_t zp_ctime[2];
	uint64_t zp_crtime[2];
	uint64_t zp_gen;
	uint64_t zp_mode;
	uint64_t zp_size;
	uint64_t zp_parent;
	uint64_t zp_links;
	uint64_t zp_xattr;
	uint64_t zp_rdev;
	uint64_t zp_flags;
	uint64_t zp_uid;
	uint64_t zp_gid;
	uint64_t zp_pad[4];
	zfs_znode_acl_t zp_acl;
} znode_phys_t

There are several issues that I think should be addressed with a single
design, since they are closely related:
0) versioning of the filesystem
1) variable dnode_phys_t size (per dataset, to start with at least)
2) fast small files (per dnode)
3) variable znode_phys_t size (per dnode)
4) fast extended attributes (per dnode)

Lustre doesn''t really care about (3) per-se, and not very much about
(2)
right now but we may as well address it at the same time as the others.

Versioning of the filesystem
===========================0.a If we are changing the on-disk layout we have to
pay attention to
   on-disk compatibility and ensure older ZFS code does not fail badly.
   I don''t think it is possible to make all of the changes being
   proposed here in a way that is compatible with existing code so we
   need to version the changes in some manner.

0.b The ext2/3/4 format has a very clever IMHO versioning mechanism that
   is superior to just incrementing a version number and forcing all
   implementations to support every previous version''s features.  See
   http://www.mjmwired.net/kernel/Documentation/filesystems/ext2.txt#224
   for a detailed description of how the features work.  The gist is
   that instead of the "version" being an incrementing digit it is
   instead a bitmask of features.

0.c It would be possible to modify ZFS to use ext2-like feature flags.
   We would have to special-case the bits 0x00000001 and 0x00000002
   that represent the different features of ZFS_VERSION_3 currently.
   All new features would still increment the "version number" (which
   would become the "INCOMPAT" version field) so old code would still
   refuse to mount it, but instead of being sequential versions we now
   get power-of-two jumps in the version number.  It is no longer required
   that ZFS support a strict superset of all changes that the Lustre ZFS
   code implements immediately, and it is possible to develop and support
   these changes in parallel, and land them in a safe, piecewise manner
   (or never, as sometimes happens with features that die off)

Variable dnode_phys_t size
=========================1.a) I think everyone agrees that for a per-dataset
fixed value this is
   "just" a matter of changing all the code in a mechanical fashion.
   I''ll just ignore the issue of being able to increase this in an
   existing dataset for now.

1.b) My understanding is that dn_bonuslen covers ALL of the ZPL-accessible
   data (i.e. it is a layering violation to try and access anything beyond
   db_bonuslen and in fact the buffer may not even contain any valid data
   or concievably even segfault).  That means any data used by ZPL (and
   by extension Lustre, which wants to maintain format compatibility)
   needs to live inside dn_bonuslen.

1.c) With a larger dnode, it is possible to have more elements in dn_blkptr[]
   on a per-dnode basis.  I have no feeling for the relative performance
   gains of storing 5 or 12 blocks in the dnode but it can''t hurt I
think.
   Avoiding a seek for files < 10*128kB is still good.  It seems this
   dnode_allocate() already takes this into account based on bonuslen at
   the time of dnode creation.

1.d) It currently doesn''t seem possible to change dn_bonuslen on an
existing
   object (dnode_reallocate() will truncate all the file data in that case?),
   so we''d need some mechanism to push data blocks into an external
blkptr
   in this case (hopefully not impossible given that the pointer to the
   bonus buffer might change?).

1.e) For a Lustre metadata server (which never stores file data) it
   may even be useful to allow dn_nblkptr = 0 to reclaim the 128-byte
   blkptr for EAs.  That is a relatively minor improvement and it seems
   the DMU would currently not be very happy with that.

Fast small files
===============2.a This means storing small files within the dnode itself. 
Since
   (AFAICS) the ZPL code is correctly layered atop the DMU, it has no
   idea how or where the data for a file is actually stored.  This
   leaves the possibility of storing small file data within the dn_blkptr[]
   array, which at 128 bytes/blkptr is fairly significant (larger than
   the shrinking symlink space), especially if we have a larger dnode which
   may have a bunch of free space in it.  For a 1024-byte dnode+znode
   we would have 760 bytes of contiguous space, and that covers 1/3
   of the files in my /etc, /bin, /lib, /usr/bin, /usr/lib, and /var.

2.b The DMU of course assumes the dn_blkptr contents are valid (after
   verifying the checksums) so we''d need a mechanism (dn_flag, dn_type,
   dn_compress, dn_datablkszsec?) that indicated whether this was
   "packed inline" data or blkptr_t data.  At first glance I like
   "dn_compress" the best, but there would still have to be some
special
   casing to avoid handling the "blkptr" in the normal way.

Variable znode_phys_t size
=========================3.a) I initially thought that we don''t have to
store any extra
   information to have a variable znode_phys_t size, because dn_bonuslen
   holds this information.  However, for symlinks ZFS checks essentially
   "zp_size + sizeof(znode_phys_t) < dn_bonuslen" to see if it is a
   fast or slow symlink.  That implies if sizeof(znode_phys_t) changes
   old symlinks on disk will be accessed incorrectly if we don''t have
   some extra information about the size of znode_phys_t in each dnode.

3.b)  We can call this "zp_extra_znsize".  If we declare the current
   znode_phys_t as znode_phys_v0_t then zp_extra_znsize is the amount of
   extra space beyond sizeof(znode_phys_v0_t), so 0 for current filesystems.

3.c) zp_extra_znsize would need to be stored in znode_phys_t somewhere.
   There is lots of unused space in some of the 64-bit fields, but I
   don''t know how you feel about hacks for this.  Possibilities include
   some bits in zp_flags, zp_pad, high bits in zp_*time nanoseconds, etc.
   It probably only needs to be 8 bytes or so (seems unlikely you will
   more than double the number of fixed fields in struct znode_phys_t).

3.d) We might consider some symlink-specific mechanism to incidate
   fast/slow symlinks (e.g. a flag) instead of depending on sizes,
   which I always found fragile in ext3 also, and was the source of
   several bugs.

3.e) We may instead consider (2.a) for symlinks a that point, since there
   is no reason to fear writing 60-byte files anymore (same performance,
   different (larger!) location for symlink data).

3.f) When ZFS code is accessing new fields declared in znode_phys_t it has
   to verify whether they are beyond dn_bonuslen and zp_extra_znsize to
   know if those fields are actually valid on disk.

Finally,

Fast extended attributes
=======================4.a) Unfortunately, due to (1.b), I don''t think
we can just store the
   EA in the dnode after the bonus buffer.

4.b) Putting the EA in the bonus buffer requires (3.a, 3.b) to be addressed.
   At that point (symlinks possibly excepted, depending on whether 3.e
   is used) the EA space would be:

   (dn_bonuslen - sizeof(znode_phys_v0_t) - zp_extra_znsize)

   For existing symlinks we''d have to also reduce this by zp_size.

4.c) It would be best to have some kind of ZAP to store the fast EA data.
   Ideally it is a very simple kind of ZAP (single buffer), but the
   microzap format is too restrictive with only a 64-bit value.
   One of the other Lustre desires is to store additional information in
   each directory entry (in addition to the object number) like file type
   and a remote server identifier, and having a single ZAP type that is
   useful for small entries would be good.  Is it possible to go straight
   to a zap_leaf_phys_t without having a corresponding zap_phys_t first?
   If yes, then this would be quite useful, otherwise a fat ZAP is too fat
   to be useful for storing fast EA data and the extended directory info.

Apologies for the long email, but I think all of these issues are related
and best addressed with a single design even if they are implemented in
a piecemeal fashion.  None of these features are blockers for Lustre
implementation atop ZFS/DMU but nobody wants the performance to be bad.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

Don Brady

2007-Sep-17 03:25 UTC

head link

[zfs-code] Increasing dnode size

The "Fast extended attributes" is of great interest to us in the Mac  
OS X camp.  Historically most files have  32 bytes of "Finder Info"  
which we are currently storing as an EA.  Fast access to this info  
would be a great gain for us.   We also are seeing more and more EAs  
used in Mac OS X 10.5 (many with small data) so we would be interested  
in some sort of generic fast EAs (ie embedded) or at least fast access  
to their names.

-Don


On Sep 15, 2007, at 4:19 PM, Andreas Dilger wrote:
> On Sep 13, 2007  17:48 -0700, Bill Moore wrote:
>> I think there are a couple of issues here.  The first one is to allow
>> each dataset to have its own dnode size.  While conceptually not all
>> that hard, it would take some re-jiggering of the code to make most  
>> of
>> the #defines turn into per-dataset variables.  But it should be  
>> pretty
>> straightforward, and probably not a bad idea in general.
>
> Agreed.
>
>> The other issue is a little more sticky.  My understanding is that
>> Lustre-on-DMU plans to use the same data structures as the ZPL.  That
>> way, you can mount the Lustre metadata or object stores as a regular
>> filesystem.  Given this, the question is what changes, if any,  
>> should be
>> made to the ZPL to accommodate.  Allowing the ZPL to deal with
>> non-512-byte dnodes is probably not that bad.  The question is  
>> whether
>> or not the ZPL should be made to understand the extended attributes  
>> (or
>> whatever) that is stored in the rest of the bonus buffer.
>
> There are a couple of approaches I can propose, but since I''m only
at
> the level of ZFS code newbie I can''t weigh weigh how easy/hard it
> would
> be to implement them.  This is really just at the brainstorming stage
> for many of them, and we may want to split details into separate  
> threads.
>
> typedef struct dnode_phys {
> 	uint8_t dn_type;
> 	uint8_t dn_indblkshift;
> 	uint8_t dn_nlevels = 3
> 	uint8_t dn_nblkptr = 3
> 	uint8_t dn_bonustype;
> 	uint8_t dn_checksum;
> 	uint8_t dn_compress;
> 	uint8_t dn_pad[1];
> 	uint16_t dn_datablkszsec;
> 	uint16_t dn_bonuslen;
> 	uint8_t dn_pad2[4];
> 	uint64_t dn_maxblkid;
> 	uint64_t dn_secphys;
> 	uint64_t dn_pad3[4];
> 	blkptr_t dn_blkptr[dn_nblkptr];
> 	uint8_t dn_bonus[BONUSLEN]
> } dnode_phys_t;
>
> typedef struct znode_phys {
> 	uint64_t zp_atime[2];
> 	uint64_t zp_mtime[2];
> 	uint64_t zp_ctime[2];
> 	uint64_t zp_crtime[2];
> 	uint64_t zp_gen;
> 	uint64_t zp_mode;
> 	uint64_t zp_size;
> 	uint64_t zp_parent;
> 	uint64_t zp_links;
> 	uint64_t zp_xattr;
> 	uint64_t zp_rdev;
> 	uint64_t zp_flags;
> 	uint64_t zp_uid;
> 	uint64_t zp_gid;
> 	uint64_t zp_pad[4];
> 	zfs_znode_acl_t zp_acl;
> } znode_phys_t
>
> There are several issues that I think should be addressed with a  
> single
> design, since they are closely related:
> 0) versioning of the filesystem
> 1) variable dnode_phys_t size (per dataset, to start with at least)
> 2) fast small files (per dnode)
> 3) variable znode_phys_t size (per dnode)
> 4) fast extended attributes (per dnode)
>
> Lustre doesn''t really care about (3) per-se, and not very much
about
> (2)
> right now but we may as well address it at the same time as the  
> others.
>
> Versioning of the filesystem
> ===========================> 0.a If we are changing the on-disk layout
we have to pay attention to
>   on-disk compatibility and ensure older ZFS code does not fail badly.
>   I don''t think it is possible to make all of the changes being
>   proposed here in a way that is compatible with existing code so we
>   need to version the changes in some manner.
>
> 0.b The ext2/3/4 format has a very clever IMHO versioning mechanism  
> that
>   is superior to just incrementing a version number and forcing all
>   implementations to support every previous version''s features. 
See
>   http://www.mjmwired.net/kernel/Documentation/filesystems/ 
> ext2.txt#224
>   for a detailed description of how the features work.  The gist is
>   that instead of the "version" being an incrementing digit it is
>   instead a bitmask of features.
>
> 0.c It would be possible to modify ZFS to use ext2-like feature flags.
>   We would have to special-case the bits 0x00000001 and 0x00000002
>   that represent the different features of ZFS_VERSION_3 currently.
>   All new features would still increment the "version number"
(which
>   would become the "INCOMPAT" version field) so old code would
still
>   refuse to mount it, but instead of being sequential versions we now
>   get power-of-two jumps in the version number.  It is no longer  
> required
>   that ZFS support a strict superset of all changes that the Lustre  
> ZFS
>   code implements immediately, and it is possible to develop and  
> support
>   these changes in parallel, and land them in a safe, piecewise manner
>   (or never, as sometimes happens with features that die off)
>
> Variable dnode_phys_t size
> =========================> 1.a) I think everyone agrees that for a
per-dataset fixed value this
> is
>   "just" a matter of changing all the code in a mechanical
fashion.
>   I''ll just ignore the issue of being able to increase this in an
>   existing dataset for now.
>
> 1.b) My understanding is that dn_bonuslen covers ALL of the ZPL- 
> accessible
>   data (i.e. it is a layering violation to try and access anything  
> beyond
>   db_bonuslen and in fact the buffer may not even contain any valid  
> data
>   or concievably even segfault).  That means any data used by ZPL (and
>   by extension Lustre, which wants to maintain format compatibility)
>   needs to live inside dn_bonuslen.
>
> 1.c) With a larger dnode, it is possible to have more elements in  
> dn_blkptr[]
>   on a per-dnode basis.  I have no feeling for the relative  
> performance
>   gains of storing 5 or 12 blocks in the dnode but it can''t hurt I
> think.
>   Avoiding a seek for files < 10*128kB is still good.  It seems this
>   dnode_allocate() already takes this into account based on bonuslen  
> at
>   the time of dnode creation.
>
> 1.d) It currently doesn''t seem possible to change dn_bonuslen on
an
> existing
>   object (dnode_reallocate() will truncate all the file data in that  
> case?),
>   so we''d need some mechanism to push data blocks into an external
> blkptr
>   in this case (hopefully not impossible given that the pointer to the
>   bonus buffer might change?).
>
> 1.e) For a Lustre metadata server (which never stores file data) it
>   may even be useful to allow dn_nblkptr = 0 to reclaim the 128-byte
>   blkptr for EAs.  That is a relatively minor improvement and it seems
>   the DMU would currently not be very happy with that.
>
> Fast small files
> ===============> 2.a This means storing small files within the dnode
itself.  Since
>   (AFAICS) the ZPL code is correctly layered atop the DMU, it has no
>   idea how or where the data for a file is actually stored.  This
>   leaves the possibility of storing small file data within the  
> dn_blkptr[]
>   array, which at 128 bytes/blkptr is fairly significant (larger than
>   the shrinking symlink space), especially if we have a larger dnode  
> which
>   may have a bunch of free space in it.  For a 1024-byte dnode+znode
>   we would have 760 bytes of contiguous space, and that covers 1/3
>   of the files in my /etc, /bin, /lib, /usr/bin, /usr/lib, and /var.
>
> 2.b The DMU of course assumes the dn_blkptr contents are valid (after
>   verifying the checksums) so we''d need a mechanism (dn_flag,
dn_type,
>   dn_compress, dn_datablkszsec?) that indicated whether this was
>   "packed inline" data or blkptr_t data.  At first glance I like
>   "dn_compress" the best, but there would still have to be some  
> special
>   casing to avoid handling the "blkptr" in the normal way.
>
> Variable znode_phys_t size
> =========================> 3.a) I initially thought that we
don''t have to store any extra
>   information to have a variable znode_phys_t size, because  
> dn_bonuslen
>   holds this information.  However, for symlinks ZFS checks  
> essentially
>   "zp_size + sizeof(znode_phys_t) < dn_bonuslen" to see if it
is a
>   fast or slow symlink.  That implies if sizeof(znode_phys_t) changes
>   old symlinks on disk will be accessed incorrectly if we don''t
have
>   some extra information about the size of znode_phys_t in each dnode.
>
> 3.b)  We can call this "zp_extra_znsize".  If we declare the
current
>   znode_phys_t as znode_phys_v0_t then zp_extra_znsize is the amount  
> of
>   extra space beyond sizeof(znode_phys_v0_t), so 0 for current  
> filesystems.
>
> 3.c) zp_extra_znsize would need to be stored in znode_phys_t  
> somewhere.
>   There is lots of unused space in some of the 64-bit fields, but I
>   don''t know how you feel about hacks for this.  Possibilities
include
>   some bits in zp_flags, zp_pad, high bits in zp_*time nanoseconds,  
> etc.
>   It probably only needs to be 8 bytes or so (seems unlikely you will
>   more than double the number of fixed fields in struct znode_phys_t).
>
> 3.d) We might consider some symlink-specific mechanism to incidate
>   fast/slow symlinks (e.g. a flag) instead of depending on sizes,
>   which I always found fragile in ext3 also, and was the source of
>   several bugs.
>
> 3.e) We may instead consider (2.a) for symlinks a that point, since  
> there
>   is no reason to fear writing 60-byte files anymore (same  
> performance,
>   different (larger!) location for symlink data).
>
> 3.f) When ZFS code is accessing new fields declared in znode_phys_t  
> it has
>   to verify whether they are beyond dn_bonuslen and zp_extra_znsize to
>   know if those fields are actually valid on disk.
>
> Finally,
>
> Fast extended attributes
> =======================> 4.a) Unfortunately, due to (1.b), I
don''t think we can just store the
>   EA in the dnode after the bonus buffer.
>
> 4.b) Putting the EA in the bonus buffer requires (3.a, 3.b) to be  
> addressed.
>   At that point (symlinks possibly excepted, depending on whether 3.e
>   is used) the EA space would be:
>
>   (dn_bonuslen - sizeof(znode_phys_v0_t) - zp_extra_znsize)
>
>   For existing symlinks we''d have to also reduce this by zp_size.
>
> 4.c) It would be best to have some kind of ZAP to store the fast EA  
> data.
>   Ideally it is a very simple kind of ZAP (single buffer), but the
>   microzap format is too restrictive with only a 64-bit value.
>   One of the other Lustre desires is to store additional information  
> in
>   each directory entry (in addition to the object number) like file  
> type
>   and a remote server identifier, and having a single ZAP type that is
>   useful for small entries would be good.  Is it possible to go  
> straight
>   to a zap_leaf_phys_t without having a corresponding zap_phys_t  
> first?
>   If yes, then this would be quite useful, otherwise a fat ZAP is  
> too fat
>   to be useful for storing fast EA data and the extended directory  
> info.
>
>
> Apologies for the long email, but I think all of these issues are  
> related
> and best addressed with a single design even if they are implemented  
> in
> a piecemeal fashion.  None of these features are blockers for Lustre
> implementation atop ZFS/DMU but nobody wants the performance to be  
> bad.
>
> Cheers, Andreas
> --
> Andreas Dilger
> Principal Software Engineer
> Cluster File Systems, Inc.
>
> _______________________________________________
> zfs-code mailing list
> zfs-code at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-code

Darren J Moffat

2007-Sep-17 09:26 UTC

head link

[zfs-code] Increasing dnode size

Andreas Dilger wrote:
> I agree, but I suspect large dnodes could also be of use to ZFS at
> some point, either for fast EAs and/or small files, so we wanted to
> get some buy-in from the ZFS developers on an approach that would
> be suitable for ZFS also.  In particular, being able to use the larger
> dnode space for a variety of reasons (more elements in dn_blkptr[],
> small file data, fast EA space) is much more desirable than a Lustre-only
> implementation.
Let me give an alternate view here.  This could make ZFS Crypto more 
complex because now data would sometimes be stored inside the dnode.  I 
need to think about this a bit more but in general it makes me uneasy, 
it may turn out not to be an issue though.

See: http://opensolaris.org/os/project/zfs-crypto/phase1/dmu_ot/ for my 
current plan of what DMU object types get encrypted.

-- 
Darren J Moffat

Mark Shellenbaum

2007-Sep-17 14:31 UTC

head link

[zfs-code] Increasing dnode size

> There are several issues that I think should be addressed with a single
> design, since they are closely related:
> 0) versioning of the filesystem
> 1) variable dnode_phys_t size (per dataset, to start with at least)
> 2) fast small files (per dnode)
> 3) variable znode_phys_t size (per dnode)
> 4) fast extended attributes (per dnode)
> 
> Lustre doesn''t really care about (3) per-se, and not very much
about (2)
> right now but we may as well address it at the same time as the others.
> 
> Versioning of the filesystem
> ===========================> 0.a If we are changing the on-disk layout
we have to pay attention to
>    on-disk compatibility and ensure older ZFS code does not fail badly.
>    I don''t think it is possible to make all of the changes being
>    proposed here in a way that is compatible with existing code so we
>    need to version the changes in some manner.
> 
> 0.b The ext2/3/4 format has a very clever IMHO versioning mechanism that
>    is superior to just incrementing a version number and forcing all
>    implementations to support every previous version''s features. 
See
>    http://www.mjmwired.net/kernel/Documentation/filesystems/ext2.txt#224
>    for a detailed description of how the features work.  The gist is
>    that instead of the "version" being an incrementing digit it
is
>    instead a bitmask of features.
>    
> 0.c It would be possible to modify ZFS to use ext2-like feature flags.
>    We would have to special-case the bits 0x00000001 and 0x00000002
>    that represent the different features of ZFS_VERSION_3 currently.
>    All new features would still increment the "version number"
(which
>    would become the "INCOMPAT" version field) so old code would
still
>    refuse to mount it, but instead of being sequential versions we now
>    get power-of-two jumps in the version number.  It is no longer required
>    that ZFS support a strict superset of all changes that the Lustre ZFS
>    code implements immediately, and it is possible to develop and support
>    these changes in parallel, and land them in a safe, piecewise manner
>    (or never, as sometimes happens with features that die off)
> 
While not entirely the same thing we will soon have a VFS feature 
registration mechanism in Nevada.  Basically, a file system registers 
what features it supports.  Initially this will be things such as "case 
insensitivity", "acl on create", "extended vattr_t".

> Variable znode_phys_t size
> =========================> 3.a) I initially thought that we
don''t have to store any extra
>    information to have a variable znode_phys_t size, because dn_bonuslen
>    holds this information.  However, for symlinks ZFS checks essentially
>    "zp_size + sizeof(znode_phys_t) < dn_bonuslen" to see if it
is a
>    fast or slow symlink.  That implies if sizeof(znode_phys_t) changes
>    old symlinks on disk will be accessed incorrectly if we don''t
have
>    some extra information about the size of znode_phys_t in each dnode.
> 
There is an existing bug to create symlinks with their own object type.
> 3.b)  We can call this "zp_extra_znsize".  If we declare the
current
>    znode_phys_t as znode_phys_v0_t then zp_extra_znsize is the amount of
>    extra space beyond sizeof(znode_phys_v0_t), so 0 for current
filesystems.
This would also require creation a new DMU_OT_ZNODE2 or something 
similarly named.
> 
> 3.c) zp_extra_znsize would need to be stored in znode_phys_t somewhere.
>    There is lots of unused space in some of the 64-bit fields, but I
>    don''t know how you feel about hacks for this.  Possibilities
include
>    some bits in zp_flags, zp_pad, high bits in zp_*time nanoseconds, etc.
>    It probably only needs to be 8 bytes or so (seems unlikely you will
>    more than double the number of fixed fields in struct znode_phys_t).
> 
The zp_flags field is off limits.  It is going to be used for storing 
additional file attributes such as immutable, nounlink,...

I don''t want to see us overload other fields.  We already have several 
pad fields within the znode that could be used.
> 3.d) We might consider some symlink-specific mechanism to incidate
>    fast/slow symlinks (e.g. a flag) instead of depending on sizes,
>    which I always found fragile in ext3 also, and was the source of
>    several bugs.
>    
> 3.e) We may instead consider (2.a) for symlinks a that point, since there
>    is no reason to fear writing 60-byte files anymore (same performance,
>    different (larger!) location for symlink data).
> 
> 3.f) When ZFS code is accessing new fields declared in znode_phys_t it has
>    to verify whether they are beyond dn_bonuslen and zp_extra_znsize to
>    know if those fields are actually valid on disk.
> 
> Finally,
> 
> Fast extended attributes
> =======================> 4.a) Unfortunately, due to (1.b), I
don''t think we can just store the
>    EA in the dnode after the bonus buffer.
> 
> 4.b) Putting the EA in the bonus buffer requires (3.a, 3.b) to be
addressed.
>    At that point (symlinks possibly excepted, depending on whether 3.e
>    is used) the EA space would be:
>    
>    (dn_bonuslen - sizeof(znode_phys_v0_t) - zp_extra_znsize)
> 
>    For existing symlinks we''d have to also reduce this by zp_size.
> 
> 4.c) It would be best to have some kind of ZAP to store the fast EA data.
>    Ideally it is a very simple kind of ZAP (single buffer), but the
>    microzap format is too restrictive with only a 64-bit value.
>    One of the other Lustre desires is to store additional information in
>    each directory entry (in addition to the object number) like file type
>    and a remote server identifier, and having a single ZAP type that is
>    useful for small entries would be good.  Is it possible to go straight
>    to a zap_leaf_phys_t without having a corresponding zap_phys_t first?
>    If yes, then this would be quite useful, otherwise a fat ZAP is too fat
>    to be useful for storing fast EA data and the extended directory info.
> 
Can you provide a list of what attributes you want to store in the znode 
and what their sizes are?  Do you expect ZFS to do anything special with 
these attributes?  Should these attributes be exposed to applications?

Usually, we only embed attributes in the znode if the file system has 
some sort of semantics associated with them.

One of the original plans, from several years ago was to create a zp_zap 
field in the znode that would be used for storing additional file 
attributes.  We never actually did that and the field was turned into 
one of the pad fields in the znode.

If the attribute will be needed for every file then it should probably 
be in the znode, but if it is an optional attribute  or too big then 
maybe it should be in some sort of overflow object.


   -Mark> 
> Apologies for the long email, but I think all of these issues are related
> and best addressed with a single design even if they are implemented in
> a piecemeal fashion.  None of these features are blockers for Lustre
> implementation atop ZFS/DMU but nobody wants the performance to be bad.
> 
> Cheers, Andreas
> --
> Andreas Dilger
> Principal Software Engineer
> Cluster File Systems, Inc.
> 
> _______________________________________________
> zfs-code mailing list
> zfs-code at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-code

Andreas Dilger

2007-Sep-17 20:16 UTC

head link

[zfs-code] Increasing dnode size

On Sep 17, 2007  08:31 -0600, Mark Shellenbaum wrote:> While not entirely the same thing we will soon have a VFS feature 
> registration mechanism in Nevada.  Basically, a file system registers 
> what features it supports.  Initially this will be things such as
"case
> insensitivity", "acl on create", "extended
vattr_t".
It''s hard for me to comment on this without more information.  I just
suggested the ext3 mechanism because what I see so far (many features
being tied to ZFS_VERSION_3, and checking for version >= ZFS_VERSION_3)
mean that it is really hard to do parallel development of features and
ensure that the code is actually safe to access the filesystem.

For example, if we start developing large dnode + fast EA code we might 
want to ship that out sooner than it can go into a Solaris release.  We
want to make sure that no Solaris code tries to mount such a filesystem
or it will assert (I think), so we would have to version the fs as v4.

However, maybe Solaris needs some other changes that would require a v4
that does not include large dnode + fast EA support (for whatever reason)
so now we have 2 incompatible codebases that support "v4"...

Do you have a pointer to the upcoming versioning mechanism?
> >3.a) I initially thought that we don''t have to store any extra
> >   information to have a variable znode_phys_t size, because
dn_bonuslen
> >   holds this information.  However, for symlinks ZFS checks
essentially
> >   "zp_size + sizeof(znode_phys_t) < dn_bonuslen" to see
if it is a
> >   fast or slow symlink.  That implies if sizeof(znode_phys_t) changes
> >   old symlinks on disk will be accessed incorrectly if we
don''t have
> >   some extra information about the size of znode_phys_t in each dnode.
> >
> 
> There is an existing bug to create symlinks with their own object type.
I don''t think that will help unless there is an extra mechanism to
detect
whether the symlink is fast or slow, instead of just using the dn_bonuslen.
Is it possible to store XATTR data on symlinks in Solaris?
> >3.b)  We can call this "zp_extra_znsize".  If we declare the
current
> >   znode_phys_t as znode_phys_v0_t then zp_extra_znsize is the amount
of
> >   extra space beyond sizeof(znode_phys_v0_t), so 0 for current 
> >   filesystems.
> 
> This would also require creation a new DMU_OT_ZNODE2 or something 
> similarly named.
Sure.  Is it possible to change the DMU_OT type on an existing object?
> >3.c) zp_extra_znsize would need to be stored in znode_phys_t somewhere.
> >   There is lots of unused space in some of the 64-bit fields, but I
> >   don''t know how you feel about hacks for this. 
Possibilities include
> >   some bits in zp_flags, zp_pad, high bits in zp_*time nanoseconds,
etc.
> >   It probably only needs to be 8 bytes or so (seems unlikely you will
> >   more than double the number of fixed fields in struct znode_phys_t).
> >
> 
> The zp_flags field is off limits.  It is going to be used for storing 
> additional file attributes such as immutable, nounlink,...
Ah, OK.  I was wondering about that also, but it isn''t in the top 10
priorities yet.
> I don''t want to see us overload other fields.  We already have
several
> pad fields within the znode that could be used.
OK, I wasn''t sure about what is spoken for already.  Is it ZFS policy
to
always have 64-bit member fields?  Some of the fields (e.g. nanoseconds)
don''t really make sense as 64-bit values, and it would probably be a
waste to have a 64-bit value for zp_extra_znsize.
> >4.c) It would be best to have some kind of ZAP to store the fast EA
data.
> >   Ideally it is a very simple kind of ZAP (single buffer), but the
> >   microzap format is too restrictive with only a 64-bit value.
> >   One of the other Lustre desires is to store additional information
in
> >   each directory entry (in addition to the object number) like file
type
> >   and a remote server identifier, and having a single ZAP type that is
> >   useful for small entries would be good.  Is it possible to go
straight
> >   to a zap_leaf_phys_t without having a corresponding zap_phys_t
first?
> >   If yes, then this would be quite useful, otherwise a fat ZAP is too
fat
> >   to be useful for storing fast EA data and the extended directory
info.
> 
> Can you provide a list of what attributes you want to store in the znode 
> and what their sizes are?  Do you expect ZFS to do anything special with 
> these attributes?  Should these attributes be exposed to applications?
The main one is the Lustre logical object volume (LOV) extended attribute
data.  This ranges from (commonly) 64 bytes, to as much as 4096 bytes (or
possibly larger once on ZFS).  This HAS to be accessed to do anything with
the znode, even stat currently, since the size of a file is distributed
over potentially many servers, so avoiding overhead here is critical.

In addition to that, there will be similar smallish attributes stored with
each znode like back-pointers from the storage znodes to the metadata znode.
These are on the order of 64 bytes as well.
> Usually, we only embed attributes in the znode if the file system has 
> some sort of semantics associated with them.
The issue I think is that this data is only useful for Lustre, so reserving
dedicated space for it in a znode is no good.  Also, the LOV XATTR might be
very large, so any dedicated space would be wasted.  Having a generic and
fast XATTR storage in the znode would help a variety of applications.
> One of the original plans, from several years ago was to create a zp_zap 
> field in the znode that would be used for storing additional file 
> attributes.  We never actually did that and the field was turned into 
> one of the pad fields in the znode.
Maybe "file attributes" is the wrong term.  These are really XATTRs in
the
ZFS sense, so I''ll refer to them as such in the future.
> If the attribute will be needed for every file then it should probably 
> be in the znode, but if it is an optional attribute  or too big then 
> maybe it should be in some sort of overflow object.
This is what I''m proposing.  For small XATTRs they would live in the
znode,
and large ones would be stored using the normal ZFS XATTR mechanism (which
is infinitely flexible).  Since the Lustre LOV XATTR data is created when
the znode is first allocated, it will always get first crack at using the
fast XATTR space, which is fine since it is right up with the znode data in
importance.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

Mark Shellenbaum

2007-Sep-17 20:38 UTC

head link

[zfs-code] Increasing dnode size

Andreas Dilger wrote:> On Sep 17, 2007  08:31 -0600, Mark Shellenbaum wrote:
>> While not entirely the same thing we will soon have a VFS feature 
>> registration mechanism in Nevada.  Basically, a file system registers 
>> what features it supports.  Initially this will be things such as
"case
>> insensitivity", "acl on create", "extended
vattr_t".
> 
> It''s hard for me to comment on this without more information.  I
just
> suggested the ext3 mechanism because what I see so far (many features
> being tied to ZFS_VERSION_3, and checking for version >= ZFS_VERSION_3)
> mean that it is really hard to do parallel development of features and
> ensure that the code is actually safe to access the filesystem.
> 
ZFS actually has 3 different version numbers.  Anything with ZFS_ is 
actually the spa version.  The ZPL also has a version associated with it 
and will have ZPL_ as its prefix.  Within each file is a unique ACL 
version.  Most of the version changing has happened at the spa level, 
but soon the ZPL version will be changing to support some additional 
attributes and other things for SMB.
> For example, if we start developing large dnode + fast EA code we might 
> want to ship that out sooner than it can go into a Solaris release.  We
> want to make sure that no Solaris code tries to mount such a filesystem
> or it will assert (I think), so we would have to version the fs as v4.
> 
> However, maybe Solaris needs some other changes that would require a v4
> that does not include large dnode + fast EA support (for whatever reason)
> so now we have 2 incompatible codebases that support "v4"...
> 
> Do you have a pointer to the upcoming versioning mechanism?
> 
Sure, take a look at:

http://www.opensolaris.org/os/community/arc/caselog/2007/315/
http://www.opensolaris.org/os/community/arc/caselog/2007/444/

These describe more than just the feature registration though.
>>> 3.a) I initially thought that we don''t have to store any
extra
>>>   information to have a variable znode_phys_t size, because
dn_bonuslen
>>>   holds this information.  However, for symlinks ZFS checks
essentially
>>>   "zp_size + sizeof(znode_phys_t) < dn_bonuslen" to
see if it is a
>>>   fast or slow symlink.  That implies if sizeof(znode_phys_t)
changes
>>>   old symlinks on disk will be accessed incorrectly if we
don''t have
>>>   some extra information about the size of znode_phys_t in each
dnode.
>>>
>> There is an existing bug to create symlinks with their own object type.
> 
> I don''t think that will help unless there is an extra mechanism to
detect
> whether the symlink is fast or slow, instead of just using the dn_bonuslen.
> Is it possible to store XATTR data on symlinks in Solaris?
> 
>>> 3.b)  We can call this "zp_extra_znsize".  If we declare
the current
>>>   znode_phys_t as znode_phys_v0_t then zp_extra_znsize is the
amount of
>>>   extra space beyond sizeof(znode_phys_v0_t), so 0 for current 
>>>   filesystems.
>> This would also require creation a new DMU_OT_ZNODE2 or something 
>> similarly named.
> 
> Sure.  Is it possible to change the DMU_OT type on an existing object?
> 
Not that I know of.  You would just allocate new files with the new type.
>>> 3.c) zp_extra_znsize would need to be stored in znode_phys_t
somewhere.
>>>   There is lots of unused space in some of the 64-bit fields, but I
>>>   don''t know how you feel about hacks for this. 
Possibilities include
>>>   some bits in zp_flags, zp_pad, high bits in zp_*time nanoseconds,
etc.
>>>   It probably only needs to be 8 bytes or so (seems unlikely you
will
>>>   more than double the number of fixed fields in struct
znode_phys_t).
>>>
>> The zp_flags field is off limits.  It is going to be used for storing 
>> additional file attributes such as immutable, nounlink,...
> 
> Ah, OK.  I was wondering about that also, but it isn''t in the top
10
> priorities yet.
> 
>> I don''t want to see us overload other fields.  We already have
several
>> pad fields within the znode that could be used.
> 
> OK, I wasn''t sure about what is spoken for already.  Is it ZFS
policy to
> always have 64-bit member fields?  Some of the fields (e.g. nanoseconds)
> don''t really make sense as 64-bit values, and it would probably be
a
> waste to have a 64-bit value for zp_extra_znsize.
Not an official policy, but we do typically use 64-bit values.
> 
>>> 4.c) It would be best to have some kind of ZAP to store the fast EA
data.
>>>   Ideally it is a very simple kind of ZAP (single buffer), but the
>>>   microzap format is too restrictive with only a 64-bit value.
>>>   One of the other Lustre desires is to store additional
information in
>>>   each directory entry (in addition to the object number) like file
type
>>>   and a remote server identifier, and having a single ZAP type that
is
>>>   useful for small entries would be good.  Is it possible to go
straight
>>>   to a zap_leaf_phys_t without having a corresponding zap_phys_t
first?
>>>   If yes, then this would be quite useful, otherwise a fat ZAP is
too fat
>>>   to be useful for storing fast EA data and the extended directory
info.
>> Can you provide a list of what attributes you want to store in the
znode
>> and what their sizes are?  Do you expect ZFS to do anything special
with
>> these attributes?  Should these attributes be exposed to applications?
> 
> The main one is the Lustre logical object volume (LOV) extended attribute
> data.  This ranges from (commonly) 64 bytes, to as much as 4096 bytes (or
> possibly larger once on ZFS).  This HAS to be accessed to do anything with
> the znode, even stat currently, since the size of a file is distributed
> over potentially many servers, so avoiding overhead here is critical.
> 
> In addition to that, there will be similar smallish attributes stored with
> each znode like back-pointers from the storage znodes to the metadata
znode.
> These are on the order of 64 bytes as well.
> 
>> Usually, we only embed attributes in the znode if the file system has 
>> some sort of semantics associated with them.
> 
> The issue I think is that this data is only useful for Lustre, so reserving
> dedicated space for it in a znode is no good.  Also, the LOV XATTR might be
> very large, so any dedicated space would be wasted.  Having a generic and
> fast XATTR storage in the znode would help a variety of applications.
> 
How does lustre retrieve the data?  Do you expect the data to be 
preserved via backup utilities?
>> One of the original plans, from several years ago was to create a
zp_zap
>> field in the znode that would be used for storing additional file 
>> attributes.  We never actually did that and the field was turned into 
>> one of the pad fields in the znode.
> 
> Maybe "file attributes" is the wrong term.  These are really
XATTRs in the
> ZFS sense, so I''ll refer to them as such in the future.
> 
Yep, when you say EAs I was assuming small named/value pairs, not the 
Solaris based XATTR model.
>> If the attribute will be needed for every file then it should probably 
>> be in the znode, but if it is an optional attribute  or too big then 
>> maybe it should be in some sort of overflow object.
> 
> This is what I''m proposing.  For small XATTRs they would live in
the znode,
> and large ones would be stored using the normal ZFS XATTR mechanism (which
> is infinitely flexible).  Since the Lustre LOV XATTR data is created when
> the znode is first allocated, it will always get first crack at using the
> fast XATTR space, which is fine since it is right up with the znode data in
> importance.
> 
How will you be setting the attributes when the object is created?  Do 
you have a kernel module that would be calling VOP_CREATE()?  The reason 
I ask is that with the ARC cases I listed earlier, you will be able to 
set additional attributes atomically at the time the file is created.


   -Mark

Mark Shellenbaum

2007-Sep-17 20:43 UTC

head link

[zfs-code] Increasing dnode size

Mark Shellenbaum wrote:> Andreas Dilger wrote:
>> On Sep 17, 2007  08:31 -0600, Mark Shellenbaum wrote:
>>> While not entirely the same thing we will soon have a VFS feature 
>>> registration mechanism in Nevada.  Basically, a file system
registers
>>> what features it supports.  Initially this will be things such as
"case
>>> insensitivity", "acl on create", "extended
vattr_t".
>> It''s hard for me to comment on this without more information. 
I just
>> suggested the ext3 mechanism because what I see so far (many features
>> being tied to ZFS_VERSION_3, and checking for version >=
ZFS_VERSION_3)
>> mean that it is really hard to do parallel development of features and
>> ensure that the code is actually safe to access the filesystem.
>>
> 
> ZFS actually has 3 different version numbers.  Anything with ZFS_ is 
> actually the spa version.  The ZPL also has a version associated with it 
> and will have ZPL_ as its prefix.  Within each file is a unique ACL 
> version.  Most of the version changing has happened at the spa level, 
> but soon the ZPL version will be changing to support some additional 
> attributes and other things for SMB.
> 
>> For example, if we start developing large dnode + fast EA code we might
>> want to ship that out sooner than it can go into a Solaris release.  We
>> want to make sure that no Solaris code tries to mount such a filesystem
>> or it will assert (I think), so we would have to version the fs as v4.
>>
>> However, maybe Solaris needs some other changes that would require a v4
>> that does not include large dnode + fast EA support (for whatever
reason)
>> so now we have 2 incompatible codebases that support "v4"...
>>
>> Do you have a pointer to the upcoming versioning mechanism?
>>
> 
> Sure, take a look at:
> 
> http://www.opensolaris.org/os/community/arc/caselog/2007/315/
> http://www.opensolaris.org/os/community/arc/caselog/2007/444/
> 
Forgot to list the feature registration one.

http://www.opensolaris.org/os/community/arc/caselog/2007/227/mail

> These describe more than just the feature registration though.
> 
>>>> 3.a) I initially thought that we don''t have to store
any extra
>>>>   information to have a variable znode_phys_t size, because
dn_bonuslen
>>>>   holds this information.  However, for symlinks ZFS checks
essentially
>>>>   "zp_size + sizeof(znode_phys_t) < dn_bonuslen"
to see if it is a
>>>>   fast or slow symlink.  That implies if sizeof(znode_phys_t)
changes
>>>>   old symlinks on disk will be accessed incorrectly if we
don''t have
>>>>   some extra information about the size of znode_phys_t in each
dnode.
>>>>
>>> There is an existing bug to create symlinks with their own object
type.
>> I don''t think that will help unless there is an extra
mechanism to detect
>> whether the symlink is fast or slow, instead of just using the
dn_bonuslen.
>> Is it possible to store XATTR data on symlinks in Solaris?
>>
>>>> 3.b)  We can call this "zp_extra_znsize".  If we
declare the current
>>>>   znode_phys_t as znode_phys_v0_t then zp_extra_znsize is the
amount of
>>>>   extra space beyond sizeof(znode_phys_v0_t), so 0 for current 
>>>>   filesystems.
>>> This would also require creation a new DMU_OT_ZNODE2 or something 
>>> similarly named.
>> Sure.  Is it possible to change the DMU_OT type on an existing object?
>>
> 
> Not that I know of.  You would just allocate new files with the new type.
> 
>>>> 3.c) zp_extra_znsize would need to be stored in znode_phys_t
somewhere.
>>>>   There is lots of unused space in some of the 64-bit fields,
but I
>>>>   don''t know how you feel about hacks for this. 
Possibilities include
>>>>   some bits in zp_flags, zp_pad, high bits in zp_*time
nanoseconds, etc.
>>>>   It probably only needs to be 8 bytes or so (seems unlikely
you will
>>>>   more than double the number of fixed fields in struct
znode_phys_t).
>>>>
>>> The zp_flags field is off limits.  It is going to be used for
storing
>>> additional file attributes such as immutable, nounlink,...
>> Ah, OK.  I was wondering about that also, but it isn''t in the
top 10
>> priorities yet.
>>
>>> I don''t want to see us overload other fields.  We already
have several
>>> pad fields within the znode that could be used.
>> OK, I wasn''t sure about what is spoken for already.  Is it ZFS
policy to
>> always have 64-bit member fields?  Some of the fields (e.g.
nanoseconds)
>> don''t really make sense as 64-bit values, and it would
probably be a
>> waste to have a 64-bit value for zp_extra_znsize.
> 
> Not an official policy, but we do typically use 64-bit values.
> 
>>>> 4.c) It would be best to have some kind of ZAP to store the
fast EA data.
>>>>   Ideally it is a very simple kind of ZAP (single buffer), but
the
>>>>   microzap format is too restrictive with only a 64-bit value.
>>>>   One of the other Lustre desires is to store additional
information in
>>>>   each directory entry (in addition to the object number) like
file type
>>>>   and a remote server identifier, and having a single ZAP type
that is
>>>>   useful for small entries would be good.  Is it possible to go
straight
>>>>   to a zap_leaf_phys_t without having a corresponding
zap_phys_t first?
>>>>   If yes, then this would be quite useful, otherwise a fat ZAP
is too fat
>>>>   to be useful for storing fast EA data and the extended
directory info.
>>> Can you provide a list of what attributes you want to store in the
znode
>>> and what their sizes are?  Do you expect ZFS to do anything special
with
>>> these attributes?  Should these attributes be exposed to
applications?
>> The main one is the Lustre logical object volume (LOV) extended
attribute
>> data.  This ranges from (commonly) 64 bytes, to as much as 4096 bytes
(or
>> possibly larger once on ZFS).  This HAS to be accessed to do anything
with
>> the znode, even stat currently, since the size of a file is distributed
>> over potentially many servers, so avoiding overhead here is critical.
>>
>> In addition to that, there will be similar smallish attributes stored
with
>> each znode like back-pointers from the storage znodes to the metadata
znode.
>> These are on the order of 64 bytes as well.
>>
>>> Usually, we only embed attributes in the znode if the file system
has
>>> some sort of semantics associated with them.
>> The issue I think is that this data is only useful for Lustre, so
reserving
>> dedicated space for it in a znode is no good.  Also, the LOV XATTR
might be
>> very large, so any dedicated space would be wasted.  Having a generic
and
>> fast XATTR storage in the znode would help a variety of applications.
>>
> 
> How does lustre retrieve the data?  Do you expect the data to be 
> preserved via backup utilities?
> 
>>> One of the original plans, from several years ago was to create a
zp_zap
>>> field in the znode that would be used for storing additional file 
>>> attributes.  We never actually did that and the field was turned
into
>>> one of the pad fields in the znode.
>> Maybe "file attributes" is the wrong term.  These are really
XATTRs in the
>> ZFS sense, so I''ll refer to them as such in the future.
>>
> 
> Yep, when you say EAs I was assuming small named/value pairs, not the 
> Solaris based XATTR model.
> 
>>> If the attribute will be needed for every file then it should
probably
>>> be in the znode, but if it is an optional attribute  or too big
then
>>> maybe it should be in some sort of overflow object.
>> This is what I''m proposing.  For small XATTRs they would live
in the znode,
>> and large ones would be stored using the normal ZFS XATTR mechanism
(which
>> is infinitely flexible).  Since the Lustre LOV XATTR data is created
when
>> the znode is first allocated, it will always get first crack at using
the
>> fast XATTR space, which is fine since it is right up with the znode
data in
>> importance.
>>
> 
> How will you be setting the attributes when the object is created?  Do 
> you have a kernel module that would be calling VOP_CREATE()?  The reason 
> I ask is that with the ARC cases I listed earlier, you will be able to 
> set additional attributes atomically at the time the file is created.
> 
> 
>    -Mark
> _______________________________________________
> zfs-code mailing list
> zfs-code at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-code

Jeff Bonwick

2007-Sep-17 22:26 UTC

head link

[zfs-code] Increasing dnode size

I suggest that we get together soon for a "dnode summit", if you will,
in which we put our various plans on the whiteboard and attempt to do
the global optimization.  I suspect that Lustre and pNFS, for example,
have very similar needs -- it would be great to make them identical.

The dnode is a truly core data structure -- we should do everything
we can to keep it free of #ifdefs and conditional logic.

Andreas, where are you based?  When''s your next trip to CA?

Jeff

On Mon, Sep 17, 2007 at 02:16:17PM -0600, Andreas Dilger
wrote:> On Sep 17, 2007  08:31 -0600, Mark Shellenbaum wrote:
> > While not entirely the same thing we will soon have a VFS feature 
> > registration mechanism in Nevada.  Basically, a file system registers 
> > what features it supports.  Initially this will be things such as
"case
> > insensitivity", "acl on create", "extended
vattr_t".
> 
> It''s hard for me to comment on this without more information.  I
just
> suggested the ext3 mechanism because what I see so far (many features
> being tied to ZFS_VERSION_3, and checking for version >= ZFS_VERSION_3)
> mean that it is really hard to do parallel development of features and
> ensure that the code is actually safe to access the filesystem.
> 
> For example, if we start developing large dnode + fast EA code we might 
> want to ship that out sooner than it can go into a Solaris release.  We
> want to make sure that no Solaris code tries to mount such a filesystem
> or it will assert (I think), so we would have to version the fs as v4.
> 
> However, maybe Solaris needs some other changes that would require a v4
> that does not include large dnode + fast EA support (for whatever reason)
> so now we have 2 incompatible codebases that support "v4"...
> 
> Do you have a pointer to the upcoming versioning mechanism?
> 
> > >3.a) I initially thought that we don''t have to store any
extra
> > >   information to have a variable znode_phys_t size, because
dn_bonuslen
> > >   holds this information.  However, for symlinks ZFS checks
essentially
> > >   "zp_size + sizeof(znode_phys_t) < dn_bonuslen" to
see if it is a
> > >   fast or slow symlink.  That implies if sizeof(znode_phys_t)
changes
> > >   old symlinks on disk will be accessed incorrectly if we
don''t have
> > >   some extra information about the size of znode_phys_t in each
dnode.
> > >
> > 
> > There is an existing bug to create symlinks with their own object
type.
> 
> I don''t think that will help unless there is an extra mechanism to
detect
> whether the symlink is fast or slow, instead of just using the dn_bonuslen.
> Is it possible to store XATTR data on symlinks in Solaris?
> 
> > >3.b)  We can call this "zp_extra_znsize".  If we declare
the current
> > >   znode_phys_t as znode_phys_v0_t then zp_extra_znsize is the
amount of
> > >   extra space beyond sizeof(znode_phys_v0_t), so 0 for current 
> > >   filesystems.
> > 
> > This would also require creation a new DMU_OT_ZNODE2 or something 
> > similarly named.
> 
> Sure.  Is it possible to change the DMU_OT type on an existing object?
> 
> > >3.c) zp_extra_znsize would need to be stored in znode_phys_t
somewhere.
> > >   There is lots of unused space in some of the 64-bit fields, but
I
> > >   don''t know how you feel about hacks for this. 
Possibilities include
> > >   some bits in zp_flags, zp_pad, high bits in zp_*time
nanoseconds, etc.
> > >   It probably only needs to be 8 bytes or so (seems unlikely you
will
> > >   more than double the number of fixed fields in struct
znode_phys_t).
> > >
> > 
> > The zp_flags field is off limits.  It is going to be used for storing 
> > additional file attributes such as immutable, nounlink,...
> 
> Ah, OK.  I was wondering about that also, but it isn''t in the top
10
> priorities yet.
> 
> > I don''t want to see us overload other fields.  We already
have several
> > pad fields within the znode that could be used.
> 
> OK, I wasn''t sure about what is spoken for already.  Is it ZFS
policy to
> always have 64-bit member fields?  Some of the fields (e.g. nanoseconds)
> don''t really make sense as 64-bit values, and it would probably be
a
> waste to have a 64-bit value for zp_extra_znsize.
> 
> > >4.c) It would be best to have some kind of ZAP to store the fast
EA data.
> > >   Ideally it is a very simple kind of ZAP (single buffer), but
the
> > >   microzap format is too restrictive with only a 64-bit value.
> > >   One of the other Lustre desires is to store additional
information in
> > >   each directory entry (in addition to the object number) like
file type
> > >   and a remote server identifier, and having a single ZAP type
that is
> > >   useful for small entries would be good.  Is it possible to go
straight
> > >   to a zap_leaf_phys_t without having a corresponding zap_phys_t
first?
> > >   If yes, then this would be quite useful, otherwise a fat ZAP is
too fat
> > >   to be useful for storing fast EA data and the extended
directory info.
> > 
> > Can you provide a list of what attributes you want to store in the
znode
> > and what their sizes are?  Do you expect ZFS to do anything special
with
> > these attributes?  Should these attributes be exposed to applications?
> 
> The main one is the Lustre logical object volume (LOV) extended attribute
> data.  This ranges from (commonly) 64 bytes, to as much as 4096 bytes (or
> possibly larger once on ZFS).  This HAS to be accessed to do anything with
> the znode, even stat currently, since the size of a file is distributed
> over potentially many servers, so avoiding overhead here is critical.
> 
> In addition to that, there will be similar smallish attributes stored with
> each znode like back-pointers from the storage znodes to the metadata
znode.
> These are on the order of 64 bytes as well.
> 
> > Usually, we only embed attributes in the znode if the file system has 
> > some sort of semantics associated with them.
> 
> The issue I think is that this data is only useful for Lustre, so reserving
> dedicated space for it in a znode is no good.  Also, the LOV XATTR might be
> very large, so any dedicated space would be wasted.  Having a generic and
> fast XATTR storage in the znode would help a variety of applications.
> 
> > One of the original plans, from several years ago was to create a
zp_zap
> > field in the znode that would be used for storing additional file 
> > attributes.  We never actually did that and the field was turned into 
> > one of the pad fields in the znode.
> 
> Maybe "file attributes" is the wrong term.  These are really
XATTRs in the
> ZFS sense, so I''ll refer to them as such in the future.
> 
> > If the attribute will be needed for every file then it should probably
> > be in the znode, but if it is an optional attribute  or too big then 
> > maybe it should be in some sort of overflow object.
> 
> This is what I''m proposing.  For small XATTRs they would live in
the znode,
> and large ones would be stored using the normal ZFS XATTR mechanism (which
> is infinitely flexible).  Since the Lustre LOV XATTR data is created when
> the znode is first allocated, it will always get first crack at using the
> fast XATTR space, which is fine since it is right up with the znode data in
> importance.
> 
> Cheers, Andreas
> --
> Andreas Dilger
> Principal Software Engineer
> Cluster File Systems, Inc.
> 
> _______________________________________________
> zfs-code mailing list
> zfs-code at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-code

Andreas Dilger

2007-Sep-18 08:09 UTC

head link

[zfs-code] Increasing dnode size

On Sep 17, 2007  15:26 -0700, Jeff Bonwick wrote:> I suggest that we get together soon for a "dnode summit", if you
will,
> in which we put our various plans on the whiteboard and attempt to do
> the global optimization.  I suspect that Lustre and pNFS, for example,
> have very similar needs -- it would be great to make them identical.
> 
> The dnode is a truly core data structure -- we should do everything
> we can to keep it free of #ifdefs and conditional logic.
> 
> Andreas, where are you based?  When''s your next trip to CA?
I''m in Calgary, Canada.  There''s some chance I''ll be
down there on Oct. 1
but it isn''t yet certain.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

Andreas Dilger

2007-Sep-20 10:14 UTC

head link

[zfs-code] Increasing dnode size

On Sep 17, 2007  14:43 -0600, Mark Shellenbaum wrote:> >Andreas Dilger wrote:
> >>0.b The ext2/3/4 format has a very clever IMHO versioning mechanism
that
> >>   is superior to just incrementing a version number and forcing
all
> >>   implementations to support every previous version''s
features.  See
> >>  
http://www.mjmwired.net/kernel/Documentation/filesystems/ext2.txt#224
> >>   for a detailed description of how the features work.  The gist
is
> >>
> >>It''s hard for me to comment on this without more
information.  I just
> >>suggested the ext3 mechanism because what I see so far (many
features
> >>being tied to ZFS_VERSION_3, and checking for version >=
ZFS_VERSION_3)
> >>mean that it is really hard to do parallel development of features
and
> >>ensure that the code is actually safe to access the filesystem.
> >>
> >
> >ZFS actually has 3 different version numbers.  Anything with ZFS_ is 
> >actually the spa version.  The ZPL also has a version associated with
it
> >and will have ZPL_ as its prefix.  Within each file is a unique ACL 
> >version.  Most of the version changing has happened at the spa level, 
> >but soon the ZPL version will be changing to support some additional 
> >attributes and other things for SMB.
> >
> >Sure, take a look at:
> >
> >http://www.opensolaris.org/os/community/arc/caselog/2007/315/
> >http://www.opensolaris.org/os/community/arc/caselog/2007/444/
> >
> 
> Forgot to list the feature registration one.
> 
> http://www.opensolaris.org/os/community/arc/caselog/2007/227/mail
So, after finally having had a chance to read these threads, I don''t
think they relate at all to what I was initially proposing.  The
feature registration APIs are more related to the users of the filesystem
in terms of what functionality the fs provides.  What I was proposing
was a mechanism for ZFS internally to be more flexible in terms of
forward and backward compatibility between the ZFS code and the on-disk
format.

The system attributes discussion may be somewhat related to the fast
XATTR proposal.  The document doesn''t actually specify how these
system attributes will be stored internally to ZFS.  If they are
stored in the znode, that is the best performance wise.  If the system
attributes are stored in an XATTR object hung off the znode->zp_xattr
then this will probably be a noticable performance hit for each access.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

Matthew Ahrens

2007-Sep-20 13:09 UTC

head link

[zfs-code] Increasing dnode size

Darren J Moffat wrote:> Andreas Dilger wrote:
> 
>> I agree, but I suspect large dnodes could also be of use to ZFS at
>> some point, either for fast EAs and/or small files, so we wanted to
>> get some buy-in from the ZFS developers on an approach that would
>> be suitable for ZFS also.  In particular, being able to use the larger
>> dnode space for a variety of reasons (more elements in dn_blkptr[],
>> small file data, fast EA space) is much more desirable than a
Lustre-only
>> implementation.
> 
> Let me give an alternate view here.  This could make ZFS Crypto more 
> complex because now data would sometimes be stored inside the dnode.  I 
> need to think about this a bit more but in general it makes me uneasy, 
> it may turn out not to be an issue though.
I thought we were just talking about increasing the potential bonus buffer 
size.  So it is no different than the problem you have today: you need to 
encrypt the bonus buffer part of the dnode_phys_t, but not the rest of it.

--matt

Darren J Moffat

2007-Sep-20 13:31 UTC

head link

[zfs-code] Increasing dnode size

Matthew Ahrens wrote:> Darren J Moffat wrote:
>> Andreas Dilger wrote:
>>
>>> I agree, but I suspect large dnodes could also be of use to ZFS at
>>> some point, either for fast EAs and/or small files, so we wanted to
>>> get some buy-in from the ZFS developers on an approach that would
>>> be suitable for ZFS also.  In particular, being able to use the
larger
>>> dnode space for a variety of reasons (more elements in dn_blkptr[],
>>> small file data, fast EA space) is much more desirable than a 
>>> Lustre-only
>>> implementation.
>>
>> Let me give an alternate view here.  This could make ZFS Crypto more 
>> complex because now data would sometimes be stored inside the dnode.  
>> I need to think about this a bit more but in general it makes me 
>> uneasy, it may turn out not to be an issue though.
> 
> I thought we were just talking about increasing the potential bonus 
> buffer size.  So it is no different than the problem you have today: you 
> need to encrypt the bonus buffer part of the dnode_phys_t, but not the 
> rest of it.
Ah okay.  Maybe I read too much into it that the dnode_phys_t would 
actually have more than the bonus buffer to worry about if it all this 
stays in the bonusbuffer then yes it is the same existing problem with 
ensuring that gets encrypted.

-- 
Darren J Moffat

Andreas Dilger

2007-Sep-20 17:41 UTC

head link

[zfs-code] Increasing dnode size

On Sep 20, 2007  06:09 -0700, Matthew Ahrens wrote:> Andreas Dilger wrote:
> >>I agree, but I suspect large dnodes could also be of use to ZFS at
> >>some point, either for fast EAs and/or small files, so we wanted to
> >>get some buy-in from the ZFS developers on an approach that would
> >>be suitable for ZFS also.  In particular, being able to use the
larger
> >>dnode space for a variety of reasons (more elements in dn_blkptr[],
> >>small file data, fast EA space) is much more desirable than a
Lustre-only
> >>implementation.
> 
> I thought we were just talking about increasing the potential bonus buffer 
> size.  So it is no different than the problem you have today: you need to 
> encrypt the bonus buffer part of the dnode_phys_t, but not the rest of it.
Well, CFS is only immediately interested in fast XATTR storage using a
larger dnode bonus buffer.  However, I thought it might be worthwhile to
discuss using the dn_blkptr[] array itself for fast small file/symlink
storage at the same time.  It might be too hard to change that at this
time, and I won''t shed a tear if that is the case, but it is worth some
thought.  Bill mentioned you had already tried something similar, so we
might discount that idea pretty quickly.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

zfs code - Sep 2007 - Increasing dnode size

[zfs-code] Increasing dnode size

[zfs-code] Increasing dnode size

[zfs-code] Increasing dnode size

[zfs-code] Increasing dnode size

[zfs-code] Increasing dnode size

[zfs-code] Increasing dnode size

[zfs-code] Increasing dnode size

[zfs-code] Increasing dnode size

[zfs-code] Increasing dnode size

[zfs-code] Increasing dnode size

[zfs-code] Increasing dnode size

[zfs-code] Increasing dnode size

[zfs-code] Increasing dnode size

[zfs-code] Increasing dnode size

[zfs-code] Increasing dnode size

[zfs-code] Increasing dnode size

[zfs-code] Increasing dnode size

[zfs-code] Increasing dnode size

[zfs-code] Increasing dnode size

[zfs-code] Increasing dnode size

[zfs-code] Increasing dnode size