Hello, as a brief introduction, I''m one of the developers of Lustre (www.lustre.org) at CFS and we are porting over Lustre to use ZFS (well, technically just the DMU) for back-end storage of Lustre. We currently use a modified ext3/4 filesystem for the back-end storage (both data and metadata) fairly successfully (single filesystems of up to 2PB with up to 500 back-end ext3 file stores and getting 50GB/s aggregate throughput in some installations). Lustre is a fairly heavy user of extended attributes on the metadata target (MDT) to record virtual file->object mappings, and we''ll also begin using EAs more heavily on the object store (OST) in the near future (reverse object->file mappings for example). One of the performance improvements we developed early on with ext3 is moving the EA into the inode to avoid seeking and full block writes for small amounts of EA data. The same could also be done to improve small file performance (though we didn''t implement that). For ext3 this meant increasing the inode size from 128 bytes to a format-time constant size of 256 - 4096 bytes (chosen based on the default Lustre EA size for that fs). My understanding from brief conversations with some of the ZFS developers is that there are already some plans to enlarge the dnode this because the dnode bonus buffer is getting close to being full for ZFS. Are there any details of this plan that I could read, or has it been discussed before? Due to the generality of the terms I wasn''t able to find anything by search. I wanted to get the ball rolling on the large dnode discussion (which you may have already had internally, I don''t know), and start a fast EA discussion in a separate thread. One of the important design decisions made with the ext3 "large inode" space (beyond the end of the regular inode) was that there was a marker in each inode which records how much of that space was used for "fixed" fields (e.g. nanosecond timestamps, creation time, inode version) at the time the inode was last written. The space beyond "i_extra_isize" is used for extended attribute storage. If an inode is modified and the kernel code wants to store additional "fixed" fields in the inode it will push the EAs out to external blocks to make room if there isn''t enough in-inode space. By having i_extra_isize stored in each inode (actually the first 16-bit field in large inodes) we are at liberty to add new fields to the inode itself without having to do a scan/update operation on existing inodes (definitely desirable for ZFS also) and we don''t have to waste a lot of "reserved" space for potential future expansion or for fields at the end that are not being used (e.g. inode version is only useful for NFSv4 and Lustre). None of the "extra" fields are critical to correct operation by definition, since the code has existed until now without them... Conversely, we don''t force EAs to start at a fixed offset and then use inefficient EA wrapping for small 32- or 64-bit fields. We also _discussed_ storing ext3 small file data in an EA on an opportunistic basis along with more extent data (ala XFS). Are there plans to allow the dn_blkptr[] array to grow on a per-dnode basis to avoid spilling out to an external block for files that are smaller and/or have little/no EA data? Alternately, it would be interesting to store file data in the (enlarged) dn_blkptr[] array for small files to avoid fragmenting the free space within the dnode. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc.
Andreas, We have explored the idea of increasing the dnode size in the past and discovered that a larger dnode size has a significant negative performance impact on the ZPL (at least with our current caching and read-ahead policies). So we don''t have any plans to increase its size generically anytime soon. However, given that the ZPL isn''t the only consumer of datasets, and that Lustre may benefit from a larger dnode size, it may be worth investigating the possibility of supporting multiple dnode sizes within a single pool (this is currently not supported). Also, note that dnodes already have the notion of "fixed" DMU- specific data and "variable" application-used data (the bonus area). So even in the current code, Lustre has the ability to use 320 bytes of bonus space however it wants. -Mark Andreas Dilger wrote:> Hello, > as a brief introduction, I''m one of the developers of Lustre > (www.lustre.org) at CFS and we are porting over Lustre to use ZFS (well, > technically just the DMU) for back-end storage of Lustre. We currently > use a modified ext3/4 filesystem for the back-end storage (both data and > metadata) fairly successfully (single filesystems of up to 2PB with up > to 500 back-end ext3 file stores and getting 50GB/s aggregate throughput > in some installations). > > Lustre is a fairly heavy user of extended attributes on the metadata target > (MDT) to record virtual file->object mappings, and we''ll also begin using > EAs more heavily on the object store (OST) in the near future (reverse > object->file mappings for example). > > One of the performance improvements we developed early on with ext3 is > moving the EA into the inode to avoid seeking and full block writes for > small amounts of EA data. The same could also be done to improve small > file performance (though we didn''t implement that). For ext3 this meant > increasing the inode size from 128 bytes to a format-time constant size of > 256 - 4096 bytes (chosen based on the default Lustre EA size for that fs). > > My understanding from brief conversations with some of the ZFS developers > is that there are already some plans to enlarge the dnode this because > the dnode bonus buffer is getting close to being full for ZFS. Are there > any details of this plan that I could read, or has it been discussed before? > Due to the generality of the terms I wasn''t able to find anything by search. > I wanted to get the ball rolling on the large dnode discussion (which > you may have already had internally, I don''t know), and start a fast EA > discussion in a separate thread. > > > > One of the important design decisions made with the ext3 "large inode" space > (beyond the end of the regular inode) was that there was a marker in each > inode which records how much of that space was used for "fixed" fields > (e.g. nanosecond timestamps, creation time, inode version) at the time the > inode was last written. The space beyond "i_extra_isize" is used for > extended attribute storage. If an inode is modified and the kernel code > wants to store additional "fixed" fields in the inode it will push the EAs > out to external blocks to make room if there isn''t enough in-inode space. > > By having i_extra_isize stored in each inode (actually the first 16-bit > field in large inodes) we are at liberty to add new fields to the inode > itself without having to do a scan/update operation on existing inodes > (definitely desirable for ZFS also) and we don''t have to waste a lot > of "reserved" space for potential future expansion or for fields at the > end that are not being used (e.g. inode version is only useful for NFSv4 > and Lustre). None of the "extra" fields are critical to correct operation > by definition, since the code has existed until now without them... > Conversely, we don''t force EAs to start at a fixed offset and then use > inefficient EA wrapping for small 32- or 64-bit fields. > > We also _discussed_ storing ext3 small file data in an EA on an > opportunistic basis along with more extent data (ala XFS). Are there > plans to allow the dn_blkptr[] array to grow on a per-dnode basis to > avoid spilling out to an external block for files that are smaller and/or > have little/no EA data? Alternately, it would be interesting to store > file data in the (enlarged) dn_blkptr[] array for small files to avoid > fragmenting the free space within the dnode. > > > Cheers, Andreas > -- > Andreas Dilger > Principal Software Engineer > Cluster File Systems, Inc. > > _______________________________________________ > zfs-code mailing list > zfs-code at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-code
On Sep 13, 2007 15:27 -0600, Mark Maybee wrote:> We have explored the idea of increasing the dnode size in the past > and discovered that a larger dnode size has a significant negative > performance impact on the ZPL (at least with our current caching > and read-ahead policies). So we don''t have any plans to increase > its size generically anytime soon.I''m sure it depends a lot on the workload. I don''t know the details of how the ZFS allocators work, so it seems possible they always allocate the modified dnode and the corresponding EAs in a contiguous chunk initially, but I suspect that keeping this true over the life of the dnode would put an added burden on the allocator (to know this) or ZPL (to always mark them dirty to force colocation even if not modified). I''d also heard that the 48 (or so) bytes that remain in the bonus buffer for ZFS are potentially going to be used up soon so there would be a desire to have a generic solution to this issue. One of the reasons the large inode patch made it into the Linux kernel quickly was because it made a big difference for Samba (in addition to Lustre): http://lwn.net/Articles/112571/> However, given that the ZPL isn''t the only consumer of datasets, > and that Lustre may benefit from a larger dnode size, it may be > worth investigating the possibility of supporting multiple dnode > sizes within a single pool (this is currently not supported).Without knowing the details, it would seem at first glance that having variable dnode size would be fairly complex. Aren''t the dnodes just stored in a single sparse object and accessed by dnode_size * objid? This does seem desirable from the POV that if you have an existing fs with the current dnode size you don''t want to need a reformat in order to use the larger size.> Also, note that dnodes already have the notion of "fixed" DMU- > specific data and "variable" application-used data (the bonus > area). So even in the current code, Lustre has the ability to > use 320 bytes of bonus space however it wants.That is true, and we discussed this internally, but one of the internal requirements we have for DMU usage is that it create an on-disk layout that matches ZFS so that it is possible to mount a Lustre filesystem via ZFS or ZFS-FUSE (and potentially the reverse in the future). This will allow us to do problem diagnosis and also leverage any ZFS scanning/verification tools that may be developed.> Andreas Dilger wrote: > >Lustre is a fairly heavy user of extended attributes on the metadata target > >(MDT) to record virtual file->object mappings, and we''ll also begin using > >EAs more heavily on the object store (OST) in the near future (reverse > >object->file mappings for example). > > > >One of the performance improvements we developed early on with ext3 is > >moving the EA into the inode to avoid seeking and full block writes for > >small amounts of EA data. The same could also be done to improve small > >file performance (though we didn''t implement that). For ext3 this meant > >increasing the inode size from 128 bytes to a format-time constant size of > >256 - 4096 bytes (chosen based on the default Lustre EA size for that fs). > > > >My understanding from brief conversations with some of the ZFS developers > >is that there are already some plans to enlarge the dnode this because > >the dnode bonus buffer is getting close to being full for ZFS. Are there > >any details of this plan that I could read, or has it been discussed > >before? > >Due to the generality of the terms I wasn''t able to find anything by > >search. > >I wanted to get the ball rolling on the large dnode discussion (which > >you may have already had internally, I don''t know), and start a fast EA > >discussion in a separate thread. > > > > > > > >One of the important design decisions made with the ext3 "large inode" > >space > >(beyond the end of the regular inode) was that there was a marker in each > >inode which records how much of that space was used for "fixed" fields > >(e.g. nanosecond timestamps, creation time, inode version) at the time the > >inode was last written. The space beyond "i_extra_isize" is used for > >extended attribute storage. If an inode is modified and the kernel code > >wants to store additional "fixed" fields in the inode it will push the EAs > >out to external blocks to make room if there isn''t enough in-inode space. > > > >By having i_extra_isize stored in each inode (actually the first 16-bit > >field in large inodes) we are at liberty to add new fields to the inode > >itself without having to do a scan/update operation on existing inodes > >(definitely desirable for ZFS also) and we don''t have to waste a lot > >of "reserved" space for potential future expansion or for fields at the > >end that are not being used (e.g. inode version is only useful for NFSv4 > >and Lustre). None of the "extra" fields are critical to correct operation > >by definition, since the code has existed until now without them... > >Conversely, we don''t force EAs to start at a fixed offset and then use > >inefficient EA wrapping for small 32- or 64-bit fields. > > > >We also _discussed_ storing ext3 small file data in an EA on an > >opportunistic basis along with more extent data (ala XFS). Are there > >plans to allow the dn_blkptr[] array to grow on a per-dnode basis to > >avoid spilling out to an external block for files that are smaller and/or > >have little/no EA data? Alternately, it would be interesting to store > >file data in the (enlarged) dn_blkptr[] array for small files to avoid > >fragmenting the free space within the dnode. > > > > > >Cheers, Andreas > >-- > >Andreas Dilger > >Principal Software Engineer > >Cluster File Systems, Inc. > > > >_______________________________________________ > >zfs-code mailing list > >zfs-code at opensolaris.org > >http://mail.opensolaris.org/mailman/listinfo/zfs-codeCheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc.
The performance benchmarks that Mark refers to are valid for our current ZPL implementation. That is, the bonus buffer only contains the znode and symlink contents. If, however, we had an application that always had an extended attribute, and that extended attribute was frequently accessed, then I think there would be (as Andreas points out) a significant performance advantage to having the XATTR in the dnode somewhere. I think there are a couple of issues here. The first one is to allow each dataset to have its own dnode size. While conceptually not all that hard, it would take some re-jiggering of the code to make most of the #defines turn into per-dataset variables. But it should be pretty straightforward, and probably not a bad idea in general. The other issue is a little more sticky. My understanding is that Lustre-on-DMU plans to use the same data structures as the ZPL. That way, you can mount the Lustre metadata or object stores as a regular filesystem. Given this, the question is what changes, if any, should be made to the ZPL to accommodate. Allowing the ZPL to deal with non-512-byte dnodes is probably not that bad. The question is whether or not the ZPL should be made to understand the extended attributes (or whatever) that is stored in the rest of the bonus buffer. While the Lustre guys may be the first to venture into this area, it will come up anyway with pNFS or the CIFS server, so we should probably spend some brain cycles thinking about the best way to have extra data (of various sorts) in larger-than-normal dnodes that the ZPL can deal with. A simple plan may be that the first extended attribute is stored in the bonus buffer (if it fits). I don''t know if this would require the same logic we used to have that placed small file contents in the bonus buffer. Unfortunately, that code was *way* complicated and was ripped out some time ago. If the bonus buffer containing an extended attribute won''t work, the question becomes how to put the Lustre LOV data into the dnode/znode so we get the performance benefits, but using an implementation that we can live with. Of course one option would be to give up on the Lustre/ZPL compatibility, but I don''t think that''s such a good plan. Like I mentioned earlier, I think that pNFS and CIFS will wind up running int similar issues, so we''ll have to deal with such a thing sooner or later. Ideas? --Bill On Thu, Sep 13, 2007 at 03:27:24PM -0600, Mark Maybee wrote:> Andreas, > > We have explored the idea of increasing the dnode size in the past > and discovered that a larger dnode size has a significant negative > performance impact on the ZPL (at least with our current caching > and read-ahead policies). So we don''t have any plans to increase > its size generically anytime soon. > > However, given that the ZPL isn''t the only consumer of datasets, > and that Lustre may benefit from a larger dnode size, it may be > worth investigating the possibility of supporting multiple dnode > sizes within a single pool (this is currently not supported). > > Also, note that dnodes already have the notion of "fixed" DMU- > specific data and "variable" application-used data (the bonus > area). So even in the current code, Lustre has the ability to > use 320 bytes of bonus space however it wants. > > -Mark > > Andreas Dilger wrote: > > Hello, > > as a brief introduction, I''m one of the developers of Lustre > > (www.lustre.org) at CFS and we are porting over Lustre to use ZFS (well, > > technically just the DMU) for back-end storage of Lustre. We currently > > use a modified ext3/4 filesystem for the back-end storage (both data and > > metadata) fairly successfully (single filesystems of up to 2PB with up > > to 500 back-end ext3 file stores and getting 50GB/s aggregate throughput > > in some installations). > > > > Lustre is a fairly heavy user of extended attributes on the metadata target > > (MDT) to record virtual file->object mappings, and we''ll also begin using > > EAs more heavily on the object store (OST) in the near future (reverse > > object->file mappings for example). > > > > One of the performance improvements we developed early on with ext3 is > > moving the EA into the inode to avoid seeking and full block writes for > > small amounts of EA data. The same could also be done to improve small > > file performance (though we didn''t implement that). For ext3 this meant > > increasing the inode size from 128 bytes to a format-time constant size of > > 256 - 4096 bytes (chosen based on the default Lustre EA size for that fs). > > > > My understanding from brief conversations with some of the ZFS developers > > is that there are already some plans to enlarge the dnode this because > > the dnode bonus buffer is getting close to being full for ZFS. Are there > > any details of this plan that I could read, or has it been discussed before? > > Due to the generality of the terms I wasn''t able to find anything by search. > > I wanted to get the ball rolling on the large dnode discussion (which > > you may have already had internally, I don''t know), and start a fast EA > > discussion in a separate thread. > > > > > > > > One of the important design decisions made with the ext3 "large inode" space > > (beyond the end of the regular inode) was that there was a marker in each > > inode which records how much of that space was used for "fixed" fields > > (e.g. nanosecond timestamps, creation time, inode version) at the time the > > inode was last written. The space beyond "i_extra_isize" is used for > > extended attribute storage. If an inode is modified and the kernel code > > wants to store additional "fixed" fields in the inode it will push the EAs > > out to external blocks to make room if there isn''t enough in-inode space. > > > > By having i_extra_isize stored in each inode (actually the first 16-bit > > field in large inodes) we are at liberty to add new fields to the inode > > itself without having to do a scan/update operation on existing inodes > > (definitely desirable for ZFS also) and we don''t have to waste a lot > > of "reserved" space for potential future expansion or for fields at the > > end that are not being used (e.g. inode version is only useful for NFSv4 > > and Lustre). None of the "extra" fields are critical to correct operation > > by definition, since the code has existed until now without them... > > Conversely, we don''t force EAs to start at a fixed offset and then use > > inefficient EA wrapping for small 32- or 64-bit fields. > > > > We also _discussed_ storing ext3 small file data in an EA on an > > opportunistic basis along with more extent data (ala XFS). Are there > > plans to allow the dn_blkptr[] array to grow on a per-dnode basis to > > avoid spilling out to an external block for files that are smaller and/or > > have little/no EA data? Alternately, it would be interesting to store > > file data in the (enlarged) dn_blkptr[] array for small files to avoid > > fragmenting the free space within the dnode. > > > > > > Cheers, Andreas > > -- > > Andreas Dilger > > Principal Software Engineer > > Cluster File Systems, Inc. > > > > _______________________________________________ > > zfs-code mailing list > > zfs-code at opensolaris.org > > http://mail.opensolaris.org/mailman/listinfo/zfs-code > _______________________________________________ > zfs-code mailing list > zfs-code at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-code
Andreas Dilger wrote:> On Sep 13, 2007 15:27 -0600, Mark Maybee wrote: >> We have explored the idea of increasing the dnode size in the past >> and discovered that a larger dnode size has a significant negative >> performance impact on the ZPL (at least with our current caching >> and read-ahead policies). So we don''t have any plans to increase >> its size generically anytime soon. > > I''m sure it depends a lot on the workload. I don''t know the details > of how the ZFS allocators work, so it seems possible they always > allocate the modified dnode and the corresponding EAs in a contiguous > chunk initially, but I suspect that keeping this true over the life > of the dnode would put an added burden on the allocator (to know this) > or ZPL (to always mark them dirty to force colocation even if not modified). > > I''d also heard that the 48 (or so) bytes that remain in the bonus buffer > for ZFS are potentially going to be used up soon so there would be a > desire to have a generic solution to this issue. >You seem to have a line on a lot of internal development details :-).> One of the reasons the large inode patch made it into the Linux > kernel quickly was because it made a big difference for Samba > (in addition to Lustre): > > http://lwn.net/Articles/112571/ > >> However, given that the ZPL isn''t the only consumer of datasets, >> and that Lustre may benefit from a larger dnode size, it may be >> worth investigating the possibility of supporting multiple dnode >> sizes within a single pool (this is currently not supported). > > Without knowing the details, it would seem at first glance that > having variable dnode size would be fairly complex. Aren''t the > dnodes just stored in a single sparse object and accessed by > dnode_size * objid? This does seem desirable from the POV that > if you have an existing fs with the current dnode size you don''t > want to need a reformat in order to use the larger size. >I was referring here to supporting multiple dnode sizes within a *pool*, but the size would still remained fixed for a given dataset (see Bill''s mail). This is a much simpler concept to implement.>> Also, note that dnodes already have the notion of "fixed" DMU- >> specific data and "variable" application-used data (the bonus >> area). So even in the current code, Lustre has the ability to >> use 320 bytes of bonus space however it wants. > > That is true, and we discussed this internally, but one of the internal > requirements we have for DMU usage is that it create an on-disk layout > that matches ZFS so that it is possible to mount a Lustre filesystem > via ZFS or ZFS-FUSE (and potentially the reverse in the future). > This will allow us to do problem diagnosis and also leverage any ZFS > scanning/verification tools that may be developed. >Ah, interesting, I was not aware of this requirement. It would not be difficult to allow the ZPL to work with a larger dnode size (in fact its pretty much a noop as long as the ZPL is not trying to use any of the extra space in the dnode).>> Andreas Dilger wrote: >>> Lustre is a fairly heavy user of extended attributes on the metadata target >>> (MDT) to record virtual file->object mappings, and we''ll also begin using >>> EAs more heavily on the object store (OST) in the near future (reverse >>> object->file mappings for example). >>> >>> One of the performance improvements we developed early on with ext3 is >>> moving the EA into the inode to avoid seeking and full block writes for >>> small amounts of EA data. The same could also be done to improve small >>> file performance (though we didn''t implement that). For ext3 this meant >>> increasing the inode size from 128 bytes to a format-time constant size of >>> 256 - 4096 bytes (chosen based on the default Lustre EA size for that fs). >>> >>> My understanding from brief conversations with some of the ZFS developers >>> is that there are already some plans to enlarge the dnode this because >>> the dnode bonus buffer is getting close to being full for ZFS. Are there >>> any details of this plan that I could read, or has it been discussed >>> before? >>> Due to the generality of the terms I wasn''t able to find anything by >>> search. >>> I wanted to get the ball rolling on the large dnode discussion (which >>> you may have already had internally, I don''t know), and start a fast EA >>> discussion in a separate thread. >>> >>> >>> >>> One of the important design decisions made with the ext3 "large inode" >>> space >>> (beyond the end of the regular inode) was that there was a marker in each >>> inode which records how much of that space was used for "fixed" fields >>> (e.g. nanosecond timestamps, creation time, inode version) at the time the >>> inode was last written. The space beyond "i_extra_isize" is used for >>> extended attribute storage. If an inode is modified and the kernel code >>> wants to store additional "fixed" fields in the inode it will push the EAs >>> out to external blocks to make room if there isn''t enough in-inode space. >>> >>> By having i_extra_isize stored in each inode (actually the first 16-bit >>> field in large inodes) we are at liberty to add new fields to the inode >>> itself without having to do a scan/update operation on existing inodes >>> (definitely desirable for ZFS also) and we don''t have to waste a lot >>> of "reserved" space for potential future expansion or for fields at the >>> end that are not being used (e.g. inode version is only useful for NFSv4 >>> and Lustre). None of the "extra" fields are critical to correct operation >>> by definition, since the code has existed until now without them... >>> Conversely, we don''t force EAs to start at a fixed offset and then use >>> inefficient EA wrapping for small 32- or 64-bit fields. >>> >>> We also _discussed_ storing ext3 small file data in an EA on an >>> opportunistic basis along with more extent data (ala XFS). Are there >>> plans to allow the dn_blkptr[] array to grow on a per-dnode basis to >>> avoid spilling out to an external block for files that are smaller and/or >>> have little/no EA data? Alternately, it would be interesting to store >>> file data in the (enlarged) dn_blkptr[] array for small files to avoid >>> fragmenting the free space within the dnode. >>> >>> >>> Cheers, Andreas >>> -- >>> Andreas Dilger >>> Principal Software Engineer >>> Cluster File Systems, Inc. >>> >>> _______________________________________________ >>> zfs-code mailing list >>> zfs-code at opensolaris.org >>> http://mail.opensolaris.org/mailman/listinfo/zfs-code > > Cheers, Andreas > -- > Andreas Dilger > Principal Software Engineer > Cluster File Systems, Inc. >
On Sep 13, 2007, at 5:48 PM, Bill Moore wrote:> The performance benchmarks that Mark refers to are valid for our > current > ZPL implementation. That is, the bonus buffer only contains the znode > and symlink contents. If, however, we had an application that always > had an extended attribute, and that extended attribute was frequently > accessed, then I think there would be (as Andreas points out) a > significant performance advantage to having the XATTR in the dnode > somewhere. > > I think there are a couple of issues here. The first one is to allow > each dataset to have its own dnode size. While conceptually not all > that hard, it would take some re-jiggering of the code to make most of > the #defines turn into per-dataset variables. But it should be pretty > straightforward, and probably not a bad idea in general. > > The other issue is a little more sticky. My understanding is that > Lustre-on-DMU plans to use the same data structures as the ZPL. That > way, you can mount the Lustre metadata or object stores as a regular > filesystem. Given this, the question is what changes, if any, > should be > made to the ZPL to accommodate. Allowing the ZPL to deal with > non-512-byte dnodes is probably not that bad. The question is whether > or not the ZPL should be made to understand the extended attributes > (or > whatever) that is stored in the rest of the bonus buffer. > > While the Lustre guys may be the first to venture into this area, it > will come up anyway with pNFS or the CIFS server, so we should > probably > spend some brain cycles thinking about the best way to have extra data > (of various sorts) in larger-than-normal dnodes that the ZPL can deal > with.Yeah, the pNFS metadata server is going to use EAs for the layout information. The pNFS data server is bypassing the ZPL and going directly to the DMU. For the pNFS people, do you have any feeling how big of a EA you will need for the layout information? Are you planning on using just one EA? I''m wondering if the bonus buffer of 320 bytes would suffice. eric> > A simple plan may be that the first extended attribute is stored in > the > bonus buffer (if it fits). I don''t know if this would require the > same > logic we used to have that placed small file contents in the bonus > buffer. Unfortunately, that code was *way* complicated and was ripped > out some time ago. > > If the bonus buffer containing an extended attribute won''t work, the > question becomes how to put the Lustre LOV data into the dnode/ > znode so > we get the performance benefits, but using an implementation that > we can > live with. Of course one option would be to give up on the Lustre/ZPL > compatibility, but I don''t think that''s such a good plan. Like I > mentioned earlier, I think that pNFS and CIFS will wind up running int > similar issues, so we''ll have to deal with such a thing sooner or > later. > > Ideas? > > > --Bill > > On Thu, Sep 13, 2007 at 03:27:24PM -0600, Mark Maybee wrote: >> Andreas, >> >> We have explored the idea of increasing the dnode size in the past >> and discovered that a larger dnode size has a significant negative >> performance impact on the ZPL (at least with our current caching >> and read-ahead policies). So we don''t have any plans to increase >> its size generically anytime soon. >> >> However, given that the ZPL isn''t the only consumer of datasets, >> and that Lustre may benefit from a larger dnode size, it may be >> worth investigating the possibility of supporting multiple dnode >> sizes within a single pool (this is currently not supported). >> >> Also, note that dnodes already have the notion of "fixed" DMU- >> specific data and "variable" application-used data (the bonus >> area). So even in the current code, Lustre has the ability to >> use 320 bytes of bonus space however it wants. >> >> -Mark >> >> Andreas Dilger wrote: >>> Hello, >>> as a brief introduction, I''m one of the developers of Lustre >>> (www.lustre.org) at CFS and we are porting over Lustre to use ZFS >>> (well, >>> technically just the DMU) for back-end storage of Lustre. We >>> currently >>> use a modified ext3/4 filesystem for the back-end storage (both >>> data and >>> metadata) fairly successfully (single filesystems of up to 2PB >>> with up >>> to 500 back-end ext3 file stores and getting 50GB/s aggregate >>> throughput >>> in some installations). >>> >>> Lustre is a fairly heavy user of extended attributes on the >>> metadata target >>> (MDT) to record virtual file->object mappings, and we''ll also >>> begin using >>> EAs more heavily on the object store (OST) in the near future >>> (reverse >>> object->file mappings for example). >>> >>> One of the performance improvements we developed early on with >>> ext3 is >>> moving the EA into the inode to avoid seeking and full block >>> writes for >>> small amounts of EA data. The same could also be done to improve >>> small >>> file performance (though we didn''t implement that). For ext3 >>> this meant >>> increasing the inode size from 128 bytes to a format-time >>> constant size of >>> 256 - 4096 bytes (chosen based on the default Lustre EA size for >>> that fs). >>> >>> My understanding from brief conversations with some of the ZFS >>> developers >>> is that there are already some plans to enlarge the dnode this >>> because >>> the dnode bonus buffer is getting close to being full for ZFS. >>> Are there >>> any details of this plan that I could read, or has it been >>> discussed before? >>> Due to the generality of the terms I wasn''t able to find anything >>> by search. >>> I wanted to get the ball rolling on the large dnode discussion >>> (which >>> you may have already had internally, I don''t know), and start a >>> fast EA >>> discussion in a separate thread. >>> >>> >>> >>> One of the important design decisions made with the ext3 "large >>> inode" space >>> (beyond the end of the regular inode) was that there was a marker >>> in each >>> inode which records how much of that space was used for "fixed" >>> fields >>> (e.g. nanosecond timestamps, creation time, inode version) at the >>> time the >>> inode was last written. The space beyond "i_extra_isize" is used >>> for >>> extended attribute storage. If an inode is modified and the >>> kernel code >>> wants to store additional "fixed" fields in the inode it will >>> push the EAs >>> out to external blocks to make room if there isn''t enough in- >>> inode space. >>> >>> By having i_extra_isize stored in each inode (actually the first >>> 16-bit >>> field in large inodes) we are at liberty to add new fields to the >>> inode >>> itself without having to do a scan/update operation on existing >>> inodes >>> (definitely desirable for ZFS also) and we don''t have to waste a lot >>> of "reserved" space for potential future expansion or for fields >>> at the >>> end that are not being used (e.g. inode version is only useful >>> for NFSv4 >>> and Lustre). None of the "extra" fields are critical to correct >>> operation >>> by definition, since the code has existed until now without them... >>> Conversely, we don''t force EAs to start at a fixed offset and >>> then use >>> inefficient EA wrapping for small 32- or 64-bit fields. >>> >>> We also _discussed_ storing ext3 small file data in an EA on an >>> opportunistic basis along with more extent data (ala XFS). Are >>> there >>> plans to allow the dn_blkptr[] array to grow on a per-dnode basis to >>> avoid spilling out to an external block for files that are >>> smaller and/or >>> have little/no EA data? Alternately, it would be interesting to >>> store >>> file data in the (enlarged) dn_blkptr[] array for small files to >>> avoid >>> fragmenting the free space within the dnode. >>> >>> >>> Cheers, Andreas >>> -- >>> Andreas Dilger >>> Principal Software Engineer >>> Cluster File Systems, Inc. >>> >>> _______________________________________________ >>> zfs-code mailing list >>> zfs-code at opensolaris.org >>> http://mail.opensolaris.org/mailman/listinfo/zfs-code >> _______________________________________________ >> zfs-code mailing list >> zfs-code at opensolaris.org >> http://mail.opensolaris.org/mailman/listinfo/zfs-code > _______________________________________________ > zfs-code mailing list > zfs-code at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-code
On Sep 14, 2007 08:52 -0600, Mark Maybee wrote:> >Without knowing the details, it would seem at first glance that > >having variable dnode size would be fairly complex. Aren''t the > >dnodes just stored in a single sparse object and accessed by > >dnode_size * objid? This does seem desirable from the POV that > >if you have an existing fs with the current dnode size you don''t > >want to need a reformat in order to use the larger size. > > I was referring here to supporting multiple dnode sizes within a > *pool*, but the size would still remained fixed for a given dataset > (see Bill''s mail). This is a much simpler concept to implement.Ah, sure. That would be a lot easier to implement.> >That is true, and we discussed this internally, but one of the internal > >requirements we have for DMU usage is that it create an on-disk layout > >that matches ZFS so that it is possible to mount a Lustre filesystem > >via ZFS or ZFS-FUSE (and potentially the reverse in the future). > >This will allow us to do problem diagnosis and also leverage any ZFS > >scanning/verification tools that may be developed. > > Ah, interesting, I was not aware of this requirement. It would not be > difficult to allow the ZPL to work with a larger dnode size (in fact > its pretty much a noop as long as the ZPL is not trying to use any of > the extra space in the dnode).I agree, but I suspect large dnodes could also be of use to ZFS at some point, either for fast EAs and/or small files, so we wanted to get some buy-in from the ZFS developers on an approach that would be suitable for ZFS also. In particular, being able to use the larger dnode space for a variety of reasons (more elements in dn_blkptr[], small file data, fast EA space) is much more desirable than a Lustre-only implementation. Also, given that we''d want to be able to access the EAs via ZPL if mounted as ZFS would be important for debugging/backup/restore/etc. I suspect the Lustre development approach would be the same with ZFS as it is with ext3, which has been quite successful to this point. Namely, we''re happy to develop new functionality in ZFS/DMU as needed so long as we get buy-in from the ZFS team on the design and most importantly the on-disk format. We don''t want to create a permanent fork in the code or on-disk format that separates Lustre-ZFS from Solaris-ZFS, which is the whole point to starting this discussion long before we''re going to start implementing anything. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc.
On Sep 14, 2007, at 11:09 AM, Andreas Dilger wrote:> On Sep 14, 2007 08:52 -0600, Mark Maybee wrote: >>> Without knowing the details, it would seem at first glance that >>> having variable dnode size would be fairly complex. Aren''t the >>> dnodes just stored in a single sparse object and accessed by >>> dnode_size * objid? This does seem desirable from the POV that >>> if you have an existing fs with the current dnode size you don''t >>> want to need a reformat in order to use the larger size. >> >> I was referring here to supporting multiple dnode sizes within a >> *pool*, but the size would still remained fixed for a given dataset >> (see Bill''s mail). This is a much simpler concept to implement. > > Ah, sure. That would be a lot easier to implement. > >>> That is true, and we discussed this internally, but one of the >>> internal >>> requirements we have for DMU usage is that it create an on-disk >>> layout >>> that matches ZFS so that it is possible to mount a Lustre filesystem >>> via ZFS or ZFS-FUSE (and potentially the reverse in the future). >>> This will allow us to do problem diagnosis and also leverage any ZFS >>> scanning/verification tools that may be developed. >> >> Ah, interesting, I was not aware of this requirement. It would >> not be >> difficult to allow the ZPL to work with a larger dnode size (in fact >> its pretty much a noop as long as the ZPL is not trying to use any of >> the extra space in the dnode). > > I agree, but I suspect large dnodes could also be of use to ZFS at > some point, either for fast EAs and/or small files, so we wanted to > get some buy-in from the ZFS developers on an approach that would > be suitable for ZFS also. In particular, being able to use the larger > dnode space for a variety of reasons (more elements in dn_blkptr[], > small file data, fast EA space) is much more desirable than a > Lustre-only > implementation. > > Also, given that we''d want to be able to access the EAs via ZPL if > mounted as ZFS would be important for debugging/backup/restore/etc. > > I suspect the Lustre development approach would be the same with ZFS > as it is with ext3, which has been quite successful to this point. > Namely, we''re happy to develop new functionality in ZFS/DMU as needed > so long as we get buy-in from the ZFS team on the design and most > importantly the on-disk format. We don''t want to create a permanent > fork in the code or on-disk format that separates Lustre-ZFS from > Solaris-ZFS, which is the whole point to starting this discussion long > before we''re going to start implementing anything.Absolutley, let''s make sure we all agree on the on-disk changes. This has been a major focus for us when working with the OSX and FreeBSD people. So far we''ve been quite successful (and i don''t see any reason why we won''t be in the future). Its great to hear you want the same thing. Another nice thing is that ZFS was designed to support on-disk changes (see zpool upgrade). eric
On Sep 13, 2007 17:48 -0700, Bill Moore wrote:> I think there are a couple of issues here. The first one is to allow > each dataset to have its own dnode size. While conceptually not all > that hard, it would take some re-jiggering of the code to make most of > the #defines turn into per-dataset variables. But it should be pretty > straightforward, and probably not a bad idea in general.Agreed.> The other issue is a little more sticky. My understanding is that > Lustre-on-DMU plans to use the same data structures as the ZPL. That > way, you can mount the Lustre metadata or object stores as a regular > filesystem. Given this, the question is what changes, if any, should be > made to the ZPL to accommodate. Allowing the ZPL to deal with > non-512-byte dnodes is probably not that bad. The question is whether > or not the ZPL should be made to understand the extended attributes (or > whatever) that is stored in the rest of the bonus buffer.There are a couple of approaches I can propose, but since I''m only at the level of ZFS code newbie I can''t weigh weigh how easy/hard it would be to implement them. This is really just at the brainstorming stage for many of them, and we may want to split details into separate threads. typedef struct dnode_phys { uint8_t dn_type; uint8_t dn_indblkshift; uint8_t dn_nlevels = 3 uint8_t dn_nblkptr = 3 uint8_t dn_bonustype; uint8_t dn_checksum; uint8_t dn_compress; uint8_t dn_pad[1]; uint16_t dn_datablkszsec; uint16_t dn_bonuslen; uint8_t dn_pad2[4]; uint64_t dn_maxblkid; uint64_t dn_secphys; uint64_t dn_pad3[4]; blkptr_t dn_blkptr[dn_nblkptr]; uint8_t dn_bonus[BONUSLEN] } dnode_phys_t; typedef struct znode_phys { uint64_t zp_atime[2]; uint64_t zp_mtime[2]; uint64_t zp_ctime[2]; uint64_t zp_crtime[2]; uint64_t zp_gen; uint64_t zp_mode; uint64_t zp_size; uint64_t zp_parent; uint64_t zp_links; uint64_t zp_xattr; uint64_t zp_rdev; uint64_t zp_flags; uint64_t zp_uid; uint64_t zp_gid; uint64_t zp_pad[4]; zfs_znode_acl_t zp_acl; } znode_phys_t There are several issues that I think should be addressed with a single design, since they are closely related: 0) versioning of the filesystem 1) variable dnode_phys_t size (per dataset, to start with at least) 2) fast small files (per dnode) 3) variable znode_phys_t size (per dnode) 4) fast extended attributes (per dnode) Lustre doesn''t really care about (3) per-se, and not very much about (2) right now but we may as well address it at the same time as the others. Versioning of the filesystem ===========================0.a If we are changing the on-disk layout we have to pay attention to on-disk compatibility and ensure older ZFS code does not fail badly. I don''t think it is possible to make all of the changes being proposed here in a way that is compatible with existing code so we need to version the changes in some manner. 0.b The ext2/3/4 format has a very clever IMHO versioning mechanism that is superior to just incrementing a version number and forcing all implementations to support every previous version''s features. See http://www.mjmwired.net/kernel/Documentation/filesystems/ext2.txt#224 for a detailed description of how the features work. The gist is that instead of the "version" being an incrementing digit it is instead a bitmask of features. 0.c It would be possible to modify ZFS to use ext2-like feature flags. We would have to special-case the bits 0x00000001 and 0x00000002 that represent the different features of ZFS_VERSION_3 currently. All new features would still increment the "version number" (which would become the "INCOMPAT" version field) so old code would still refuse to mount it, but instead of being sequential versions we now get power-of-two jumps in the version number. It is no longer required that ZFS support a strict superset of all changes that the Lustre ZFS code implements immediately, and it is possible to develop and support these changes in parallel, and land them in a safe, piecewise manner (or never, as sometimes happens with features that die off) Variable dnode_phys_t size =========================1.a) I think everyone agrees that for a per-dataset fixed value this is "just" a matter of changing all the code in a mechanical fashion. I''ll just ignore the issue of being able to increase this in an existing dataset for now. 1.b) My understanding is that dn_bonuslen covers ALL of the ZPL-accessible data (i.e. it is a layering violation to try and access anything beyond db_bonuslen and in fact the buffer may not even contain any valid data or concievably even segfault). That means any data used by ZPL (and by extension Lustre, which wants to maintain format compatibility) needs to live inside dn_bonuslen. 1.c) With a larger dnode, it is possible to have more elements in dn_blkptr[] on a per-dnode basis. I have no feeling for the relative performance gains of storing 5 or 12 blocks in the dnode but it can''t hurt I think. Avoiding a seek for files < 10*128kB is still good. It seems this dnode_allocate() already takes this into account based on bonuslen at the time of dnode creation. 1.d) It currently doesn''t seem possible to change dn_bonuslen on an existing object (dnode_reallocate() will truncate all the file data in that case?), so we''d need some mechanism to push data blocks into an external blkptr in this case (hopefully not impossible given that the pointer to the bonus buffer might change?). 1.e) For a Lustre metadata server (which never stores file data) it may even be useful to allow dn_nblkptr = 0 to reclaim the 128-byte blkptr for EAs. That is a relatively minor improvement and it seems the DMU would currently not be very happy with that. Fast small files ===============2.a This means storing small files within the dnode itself. Since (AFAICS) the ZPL code is correctly layered atop the DMU, it has no idea how or where the data for a file is actually stored. This leaves the possibility of storing small file data within the dn_blkptr[] array, which at 128 bytes/blkptr is fairly significant (larger than the shrinking symlink space), especially if we have a larger dnode which may have a bunch of free space in it. For a 1024-byte dnode+znode we would have 760 bytes of contiguous space, and that covers 1/3 of the files in my /etc, /bin, /lib, /usr/bin, /usr/lib, and /var. 2.b The DMU of course assumes the dn_blkptr contents are valid (after verifying the checksums) so we''d need a mechanism (dn_flag, dn_type, dn_compress, dn_datablkszsec?) that indicated whether this was "packed inline" data or blkptr_t data. At first glance I like "dn_compress" the best, but there would still have to be some special casing to avoid handling the "blkptr" in the normal way. Variable znode_phys_t size =========================3.a) I initially thought that we don''t have to store any extra information to have a variable znode_phys_t size, because dn_bonuslen holds this information. However, for symlinks ZFS checks essentially "zp_size + sizeof(znode_phys_t) < dn_bonuslen" to see if it is a fast or slow symlink. That implies if sizeof(znode_phys_t) changes old symlinks on disk will be accessed incorrectly if we don''t have some extra information about the size of znode_phys_t in each dnode. 3.b) We can call this "zp_extra_znsize". If we declare the current znode_phys_t as znode_phys_v0_t then zp_extra_znsize is the amount of extra space beyond sizeof(znode_phys_v0_t), so 0 for current filesystems. 3.c) zp_extra_znsize would need to be stored in znode_phys_t somewhere. There is lots of unused space in some of the 64-bit fields, but I don''t know how you feel about hacks for this. Possibilities include some bits in zp_flags, zp_pad, high bits in zp_*time nanoseconds, etc. It probably only needs to be 8 bytes or so (seems unlikely you will more than double the number of fixed fields in struct znode_phys_t). 3.d) We might consider some symlink-specific mechanism to incidate fast/slow symlinks (e.g. a flag) instead of depending on sizes, which I always found fragile in ext3 also, and was the source of several bugs. 3.e) We may instead consider (2.a) for symlinks a that point, since there is no reason to fear writing 60-byte files anymore (same performance, different (larger!) location for symlink data). 3.f) When ZFS code is accessing new fields declared in znode_phys_t it has to verify whether they are beyond dn_bonuslen and zp_extra_znsize to know if those fields are actually valid on disk. Finally, Fast extended attributes =======================4.a) Unfortunately, due to (1.b), I don''t think we can just store the EA in the dnode after the bonus buffer. 4.b) Putting the EA in the bonus buffer requires (3.a, 3.b) to be addressed. At that point (symlinks possibly excepted, depending on whether 3.e is used) the EA space would be: (dn_bonuslen - sizeof(znode_phys_v0_t) - zp_extra_znsize) For existing symlinks we''d have to also reduce this by zp_size. 4.c) It would be best to have some kind of ZAP to store the fast EA data. Ideally it is a very simple kind of ZAP (single buffer), but the microzap format is too restrictive with only a 64-bit value. One of the other Lustre desires is to store additional information in each directory entry (in addition to the object number) like file type and a remote server identifier, and having a single ZAP type that is useful for small entries would be good. Is it possible to go straight to a zap_leaf_phys_t without having a corresponding zap_phys_t first? If yes, then this would be quite useful, otherwise a fat ZAP is too fat to be useful for storing fast EA data and the extended directory info. Apologies for the long email, but I think all of these issues are related and best addressed with a single design even if they are implemented in a piecemeal fashion. None of these features are blockers for Lustre implementation atop ZFS/DMU but nobody wants the performance to be bad. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc.
The "Fast extended attributes" is of great interest to us in the Mac OS X camp. Historically most files have 32 bytes of "Finder Info" which we are currently storing as an EA. Fast access to this info would be a great gain for us. We also are seeing more and more EAs used in Mac OS X 10.5 (many with small data) so we would be interested in some sort of generic fast EAs (ie embedded) or at least fast access to their names. -Don On Sep 15, 2007, at 4:19 PM, Andreas Dilger wrote:> On Sep 13, 2007 17:48 -0700, Bill Moore wrote: >> I think there are a couple of issues here. The first one is to allow >> each dataset to have its own dnode size. While conceptually not all >> that hard, it would take some re-jiggering of the code to make most >> of >> the #defines turn into per-dataset variables. But it should be >> pretty >> straightforward, and probably not a bad idea in general. > > Agreed. > >> The other issue is a little more sticky. My understanding is that >> Lustre-on-DMU plans to use the same data structures as the ZPL. That >> way, you can mount the Lustre metadata or object stores as a regular >> filesystem. Given this, the question is what changes, if any, >> should be >> made to the ZPL to accommodate. Allowing the ZPL to deal with >> non-512-byte dnodes is probably not that bad. The question is >> whether >> or not the ZPL should be made to understand the extended attributes >> (or >> whatever) that is stored in the rest of the bonus buffer. > > There are a couple of approaches I can propose, but since I''m only at > the level of ZFS code newbie I can''t weigh weigh how easy/hard it > would > be to implement them. This is really just at the brainstorming stage > for many of them, and we may want to split details into separate > threads. > > typedef struct dnode_phys { > uint8_t dn_type; > uint8_t dn_indblkshift; > uint8_t dn_nlevels = 3 > uint8_t dn_nblkptr = 3 > uint8_t dn_bonustype; > uint8_t dn_checksum; > uint8_t dn_compress; > uint8_t dn_pad[1]; > uint16_t dn_datablkszsec; > uint16_t dn_bonuslen; > uint8_t dn_pad2[4]; > uint64_t dn_maxblkid; > uint64_t dn_secphys; > uint64_t dn_pad3[4]; > blkptr_t dn_blkptr[dn_nblkptr]; > uint8_t dn_bonus[BONUSLEN] > } dnode_phys_t; > > typedef struct znode_phys { > uint64_t zp_atime[2]; > uint64_t zp_mtime[2]; > uint64_t zp_ctime[2]; > uint64_t zp_crtime[2]; > uint64_t zp_gen; > uint64_t zp_mode; > uint64_t zp_size; > uint64_t zp_parent; > uint64_t zp_links; > uint64_t zp_xattr; > uint64_t zp_rdev; > uint64_t zp_flags; > uint64_t zp_uid; > uint64_t zp_gid; > uint64_t zp_pad[4]; > zfs_znode_acl_t zp_acl; > } znode_phys_t > > There are several issues that I think should be addressed with a > single > design, since they are closely related: > 0) versioning of the filesystem > 1) variable dnode_phys_t size (per dataset, to start with at least) > 2) fast small files (per dnode) > 3) variable znode_phys_t size (per dnode) > 4) fast extended attributes (per dnode) > > Lustre doesn''t really care about (3) per-se, and not very much about > (2) > right now but we may as well address it at the same time as the > others. > > Versioning of the filesystem > ===========================> 0.a If we are changing the on-disk layout we have to pay attention to > on-disk compatibility and ensure older ZFS code does not fail badly. > I don''t think it is possible to make all of the changes being > proposed here in a way that is compatible with existing code so we > need to version the changes in some manner. > > 0.b The ext2/3/4 format has a very clever IMHO versioning mechanism > that > is superior to just incrementing a version number and forcing all > implementations to support every previous version''s features. See > http://www.mjmwired.net/kernel/Documentation/filesystems/ > ext2.txt#224 > for a detailed description of how the features work. The gist is > that instead of the "version" being an incrementing digit it is > instead a bitmask of features. > > 0.c It would be possible to modify ZFS to use ext2-like feature flags. > We would have to special-case the bits 0x00000001 and 0x00000002 > that represent the different features of ZFS_VERSION_3 currently. > All new features would still increment the "version number" (which > would become the "INCOMPAT" version field) so old code would still > refuse to mount it, but instead of being sequential versions we now > get power-of-two jumps in the version number. It is no longer > required > that ZFS support a strict superset of all changes that the Lustre > ZFS > code implements immediately, and it is possible to develop and > support > these changes in parallel, and land them in a safe, piecewise manner > (or never, as sometimes happens with features that die off) > > Variable dnode_phys_t size > =========================> 1.a) I think everyone agrees that for a per-dataset fixed value this > is > "just" a matter of changing all the code in a mechanical fashion. > I''ll just ignore the issue of being able to increase this in an > existing dataset for now. > > 1.b) My understanding is that dn_bonuslen covers ALL of the ZPL- > accessible > data (i.e. it is a layering violation to try and access anything > beyond > db_bonuslen and in fact the buffer may not even contain any valid > data > or concievably even segfault). That means any data used by ZPL (and > by extension Lustre, which wants to maintain format compatibility) > needs to live inside dn_bonuslen. > > 1.c) With a larger dnode, it is possible to have more elements in > dn_blkptr[] > on a per-dnode basis. I have no feeling for the relative > performance > gains of storing 5 or 12 blocks in the dnode but it can''t hurt I > think. > Avoiding a seek for files < 10*128kB is still good. It seems this > dnode_allocate() already takes this into account based on bonuslen > at > the time of dnode creation. > > 1.d) It currently doesn''t seem possible to change dn_bonuslen on an > existing > object (dnode_reallocate() will truncate all the file data in that > case?), > so we''d need some mechanism to push data blocks into an external > blkptr > in this case (hopefully not impossible given that the pointer to the > bonus buffer might change?). > > 1.e) For a Lustre metadata server (which never stores file data) it > may even be useful to allow dn_nblkptr = 0 to reclaim the 128-byte > blkptr for EAs. That is a relatively minor improvement and it seems > the DMU would currently not be very happy with that. > > Fast small files > ===============> 2.a This means storing small files within the dnode itself. Since > (AFAICS) the ZPL code is correctly layered atop the DMU, it has no > idea how or where the data for a file is actually stored. This > leaves the possibility of storing small file data within the > dn_blkptr[] > array, which at 128 bytes/blkptr is fairly significant (larger than > the shrinking symlink space), especially if we have a larger dnode > which > may have a bunch of free space in it. For a 1024-byte dnode+znode > we would have 760 bytes of contiguous space, and that covers 1/3 > of the files in my /etc, /bin, /lib, /usr/bin, /usr/lib, and /var. > > 2.b The DMU of course assumes the dn_blkptr contents are valid (after > verifying the checksums) so we''d need a mechanism (dn_flag, dn_type, > dn_compress, dn_datablkszsec?) that indicated whether this was > "packed inline" data or blkptr_t data. At first glance I like > "dn_compress" the best, but there would still have to be some > special > casing to avoid handling the "blkptr" in the normal way. > > Variable znode_phys_t size > =========================> 3.a) I initially thought that we don''t have to store any extra > information to have a variable znode_phys_t size, because > dn_bonuslen > holds this information. However, for symlinks ZFS checks > essentially > "zp_size + sizeof(znode_phys_t) < dn_bonuslen" to see if it is a > fast or slow symlink. That implies if sizeof(znode_phys_t) changes > old symlinks on disk will be accessed incorrectly if we don''t have > some extra information about the size of znode_phys_t in each dnode. > > 3.b) We can call this "zp_extra_znsize". If we declare the current > znode_phys_t as znode_phys_v0_t then zp_extra_znsize is the amount > of > extra space beyond sizeof(znode_phys_v0_t), so 0 for current > filesystems. > > 3.c) zp_extra_znsize would need to be stored in znode_phys_t > somewhere. > There is lots of unused space in some of the 64-bit fields, but I > don''t know how you feel about hacks for this. Possibilities include > some bits in zp_flags, zp_pad, high bits in zp_*time nanoseconds, > etc. > It probably only needs to be 8 bytes or so (seems unlikely you will > more than double the number of fixed fields in struct znode_phys_t). > > 3.d) We might consider some symlink-specific mechanism to incidate > fast/slow symlinks (e.g. a flag) instead of depending on sizes, > which I always found fragile in ext3 also, and was the source of > several bugs. > > 3.e) We may instead consider (2.a) for symlinks a that point, since > there > is no reason to fear writing 60-byte files anymore (same > performance, > different (larger!) location for symlink data). > > 3.f) When ZFS code is accessing new fields declared in znode_phys_t > it has > to verify whether they are beyond dn_bonuslen and zp_extra_znsize to > know if those fields are actually valid on disk. > > Finally, > > Fast extended attributes > =======================> 4.a) Unfortunately, due to (1.b), I don''t think we can just store the > EA in the dnode after the bonus buffer. > > 4.b) Putting the EA in the bonus buffer requires (3.a, 3.b) to be > addressed. > At that point (symlinks possibly excepted, depending on whether 3.e > is used) the EA space would be: > > (dn_bonuslen - sizeof(znode_phys_v0_t) - zp_extra_znsize) > > For existing symlinks we''d have to also reduce this by zp_size. > > 4.c) It would be best to have some kind of ZAP to store the fast EA > data. > Ideally it is a very simple kind of ZAP (single buffer), but the > microzap format is too restrictive with only a 64-bit value. > One of the other Lustre desires is to store additional information > in > each directory entry (in addition to the object number) like file > type > and a remote server identifier, and having a single ZAP type that is > useful for small entries would be good. Is it possible to go > straight > to a zap_leaf_phys_t without having a corresponding zap_phys_t > first? > If yes, then this would be quite useful, otherwise a fat ZAP is > too fat > to be useful for storing fast EA data and the extended directory > info. > > > Apologies for the long email, but I think all of these issues are > related > and best addressed with a single design even if they are implemented > in > a piecemeal fashion. None of these features are blockers for Lustre > implementation atop ZFS/DMU but nobody wants the performance to be > bad. > > Cheers, Andreas > -- > Andreas Dilger > Principal Software Engineer > Cluster File Systems, Inc. > > _______________________________________________ > zfs-code mailing list > zfs-code at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-code
Andreas Dilger wrote:> I agree, but I suspect large dnodes could also be of use to ZFS at > some point, either for fast EAs and/or small files, so we wanted to > get some buy-in from the ZFS developers on an approach that would > be suitable for ZFS also. In particular, being able to use the larger > dnode space for a variety of reasons (more elements in dn_blkptr[], > small file data, fast EA space) is much more desirable than a Lustre-only > implementation.Let me give an alternate view here. This could make ZFS Crypto more complex because now data would sometimes be stored inside the dnode. I need to think about this a bit more but in general it makes me uneasy, it may turn out not to be an issue though. See: http://opensolaris.org/os/project/zfs-crypto/phase1/dmu_ot/ for my current plan of what DMU object types get encrypted. -- Darren J Moffat
> There are several issues that I think should be addressed with a single > design, since they are closely related: > 0) versioning of the filesystem > 1) variable dnode_phys_t size (per dataset, to start with at least) > 2) fast small files (per dnode) > 3) variable znode_phys_t size (per dnode) > 4) fast extended attributes (per dnode) > > Lustre doesn''t really care about (3) per-se, and not very much about (2) > right now but we may as well address it at the same time as the others. > > Versioning of the filesystem > ===========================> 0.a If we are changing the on-disk layout we have to pay attention to > on-disk compatibility and ensure older ZFS code does not fail badly. > I don''t think it is possible to make all of the changes being > proposed here in a way that is compatible with existing code so we > need to version the changes in some manner. > > 0.b The ext2/3/4 format has a very clever IMHO versioning mechanism that > is superior to just incrementing a version number and forcing all > implementations to support every previous version''s features. See > http://www.mjmwired.net/kernel/Documentation/filesystems/ext2.txt#224 > for a detailed description of how the features work. The gist is > that instead of the "version" being an incrementing digit it is > instead a bitmask of features. > > 0.c It would be possible to modify ZFS to use ext2-like feature flags. > We would have to special-case the bits 0x00000001 and 0x00000002 > that represent the different features of ZFS_VERSION_3 currently. > All new features would still increment the "version number" (which > would become the "INCOMPAT" version field) so old code would still > refuse to mount it, but instead of being sequential versions we now > get power-of-two jumps in the version number. It is no longer required > that ZFS support a strict superset of all changes that the Lustre ZFS > code implements immediately, and it is possible to develop and support > these changes in parallel, and land them in a safe, piecewise manner > (or never, as sometimes happens with features that die off) >While not entirely the same thing we will soon have a VFS feature registration mechanism in Nevada. Basically, a file system registers what features it supports. Initially this will be things such as "case insensitivity", "acl on create", "extended vattr_t".> Variable znode_phys_t size > =========================> 3.a) I initially thought that we don''t have to store any extra > information to have a variable znode_phys_t size, because dn_bonuslen > holds this information. However, for symlinks ZFS checks essentially > "zp_size + sizeof(znode_phys_t) < dn_bonuslen" to see if it is a > fast or slow symlink. That implies if sizeof(znode_phys_t) changes > old symlinks on disk will be accessed incorrectly if we don''t have > some extra information about the size of znode_phys_t in each dnode. >There is an existing bug to create symlinks with their own object type.> 3.b) We can call this "zp_extra_znsize". If we declare the current > znode_phys_t as znode_phys_v0_t then zp_extra_znsize is the amount of > extra space beyond sizeof(znode_phys_v0_t), so 0 for current filesystems.This would also require creation a new DMU_OT_ZNODE2 or something similarly named.> > 3.c) zp_extra_znsize would need to be stored in znode_phys_t somewhere. > There is lots of unused space in some of the 64-bit fields, but I > don''t know how you feel about hacks for this. Possibilities include > some bits in zp_flags, zp_pad, high bits in zp_*time nanoseconds, etc. > It probably only needs to be 8 bytes or so (seems unlikely you will > more than double the number of fixed fields in struct znode_phys_t). >The zp_flags field is off limits. It is going to be used for storing additional file attributes such as immutable, nounlink,... I don''t want to see us overload other fields. We already have several pad fields within the znode that could be used.> 3.d) We might consider some symlink-specific mechanism to incidate > fast/slow symlinks (e.g. a flag) instead of depending on sizes, > which I always found fragile in ext3 also, and was the source of > several bugs. > > 3.e) We may instead consider (2.a) for symlinks a that point, since there > is no reason to fear writing 60-byte files anymore (same performance, > different (larger!) location for symlink data). > > 3.f) When ZFS code is accessing new fields declared in znode_phys_t it has > to verify whether they are beyond dn_bonuslen and zp_extra_znsize to > know if those fields are actually valid on disk. > > Finally, > > Fast extended attributes > =======================> 4.a) Unfortunately, due to (1.b), I don''t think we can just store the > EA in the dnode after the bonus buffer. > > 4.b) Putting the EA in the bonus buffer requires (3.a, 3.b) to be addressed. > At that point (symlinks possibly excepted, depending on whether 3.e > is used) the EA space would be: > > (dn_bonuslen - sizeof(znode_phys_v0_t) - zp_extra_znsize) > > For existing symlinks we''d have to also reduce this by zp_size. > > 4.c) It would be best to have some kind of ZAP to store the fast EA data. > Ideally it is a very simple kind of ZAP (single buffer), but the > microzap format is too restrictive with only a 64-bit value. > One of the other Lustre desires is to store additional information in > each directory entry (in addition to the object number) like file type > and a remote server identifier, and having a single ZAP type that is > useful for small entries would be good. Is it possible to go straight > to a zap_leaf_phys_t without having a corresponding zap_phys_t first? > If yes, then this would be quite useful, otherwise a fat ZAP is too fat > to be useful for storing fast EA data and the extended directory info. >Can you provide a list of what attributes you want to store in the znode and what their sizes are? Do you expect ZFS to do anything special with these attributes? Should these attributes be exposed to applications? Usually, we only embed attributes in the znode if the file system has some sort of semantics associated with them. One of the original plans, from several years ago was to create a zp_zap field in the znode that would be used for storing additional file attributes. We never actually did that and the field was turned into one of the pad fields in the znode. If the attribute will be needed for every file then it should probably be in the znode, but if it is an optional attribute or too big then maybe it should be in some sort of overflow object. -Mark> > Apologies for the long email, but I think all of these issues are related > and best addressed with a single design even if they are implemented in > a piecemeal fashion. None of these features are blockers for Lustre > implementation atop ZFS/DMU but nobody wants the performance to be bad. > > Cheers, Andreas > -- > Andreas Dilger > Principal Software Engineer > Cluster File Systems, Inc. > > _______________________________________________ > zfs-code mailing list > zfs-code at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-code
On Sep 17, 2007 08:31 -0600, Mark Shellenbaum wrote:> While not entirely the same thing we will soon have a VFS feature > registration mechanism in Nevada. Basically, a file system registers > what features it supports. Initially this will be things such as "case > insensitivity", "acl on create", "extended vattr_t".It''s hard for me to comment on this without more information. I just suggested the ext3 mechanism because what I see so far (many features being tied to ZFS_VERSION_3, and checking for version >= ZFS_VERSION_3) mean that it is really hard to do parallel development of features and ensure that the code is actually safe to access the filesystem. For example, if we start developing large dnode + fast EA code we might want to ship that out sooner than it can go into a Solaris release. We want to make sure that no Solaris code tries to mount such a filesystem or it will assert (I think), so we would have to version the fs as v4. However, maybe Solaris needs some other changes that would require a v4 that does not include large dnode + fast EA support (for whatever reason) so now we have 2 incompatible codebases that support "v4"... Do you have a pointer to the upcoming versioning mechanism?> >3.a) I initially thought that we don''t have to store any extra > > information to have a variable znode_phys_t size, because dn_bonuslen > > holds this information. However, for symlinks ZFS checks essentially > > "zp_size + sizeof(znode_phys_t) < dn_bonuslen" to see if it is a > > fast or slow symlink. That implies if sizeof(znode_phys_t) changes > > old symlinks on disk will be accessed incorrectly if we don''t have > > some extra information about the size of znode_phys_t in each dnode. > > > > There is an existing bug to create symlinks with their own object type.I don''t think that will help unless there is an extra mechanism to detect whether the symlink is fast or slow, instead of just using the dn_bonuslen. Is it possible to store XATTR data on symlinks in Solaris?> >3.b) We can call this "zp_extra_znsize". If we declare the current > > znode_phys_t as znode_phys_v0_t then zp_extra_znsize is the amount of > > extra space beyond sizeof(znode_phys_v0_t), so 0 for current > > filesystems. > > This would also require creation a new DMU_OT_ZNODE2 or something > similarly named.Sure. Is it possible to change the DMU_OT type on an existing object?> >3.c) zp_extra_znsize would need to be stored in znode_phys_t somewhere. > > There is lots of unused space in some of the 64-bit fields, but I > > don''t know how you feel about hacks for this. Possibilities include > > some bits in zp_flags, zp_pad, high bits in zp_*time nanoseconds, etc. > > It probably only needs to be 8 bytes or so (seems unlikely you will > > more than double the number of fixed fields in struct znode_phys_t). > > > > The zp_flags field is off limits. It is going to be used for storing > additional file attributes such as immutable, nounlink,...Ah, OK. I was wondering about that also, but it isn''t in the top 10 priorities yet.> I don''t want to see us overload other fields. We already have several > pad fields within the znode that could be used.OK, I wasn''t sure about what is spoken for already. Is it ZFS policy to always have 64-bit member fields? Some of the fields (e.g. nanoseconds) don''t really make sense as 64-bit values, and it would probably be a waste to have a 64-bit value for zp_extra_znsize.> >4.c) It would be best to have some kind of ZAP to store the fast EA data. > > Ideally it is a very simple kind of ZAP (single buffer), but the > > microzap format is too restrictive with only a 64-bit value. > > One of the other Lustre desires is to store additional information in > > each directory entry (in addition to the object number) like file type > > and a remote server identifier, and having a single ZAP type that is > > useful for small entries would be good. Is it possible to go straight > > to a zap_leaf_phys_t without having a corresponding zap_phys_t first? > > If yes, then this would be quite useful, otherwise a fat ZAP is too fat > > to be useful for storing fast EA data and the extended directory info. > > Can you provide a list of what attributes you want to store in the znode > and what their sizes are? Do you expect ZFS to do anything special with > these attributes? Should these attributes be exposed to applications?The main one is the Lustre logical object volume (LOV) extended attribute data. This ranges from (commonly) 64 bytes, to as much as 4096 bytes (or possibly larger once on ZFS). This HAS to be accessed to do anything with the znode, even stat currently, since the size of a file is distributed over potentially many servers, so avoiding overhead here is critical. In addition to that, there will be similar smallish attributes stored with each znode like back-pointers from the storage znodes to the metadata znode. These are on the order of 64 bytes as well.> Usually, we only embed attributes in the znode if the file system has > some sort of semantics associated with them.The issue I think is that this data is only useful for Lustre, so reserving dedicated space for it in a znode is no good. Also, the LOV XATTR might be very large, so any dedicated space would be wasted. Having a generic and fast XATTR storage in the znode would help a variety of applications.> One of the original plans, from several years ago was to create a zp_zap > field in the znode that would be used for storing additional file > attributes. We never actually did that and the field was turned into > one of the pad fields in the znode.Maybe "file attributes" is the wrong term. These are really XATTRs in the ZFS sense, so I''ll refer to them as such in the future.> If the attribute will be needed for every file then it should probably > be in the znode, but if it is an optional attribute or too big then > maybe it should be in some sort of overflow object.This is what I''m proposing. For small XATTRs they would live in the znode, and large ones would be stored using the normal ZFS XATTR mechanism (which is infinitely flexible). Since the Lustre LOV XATTR data is created when the znode is first allocated, it will always get first crack at using the fast XATTR space, which is fine since it is right up with the znode data in importance. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc.
Andreas Dilger wrote:> On Sep 17, 2007 08:31 -0600, Mark Shellenbaum wrote: >> While not entirely the same thing we will soon have a VFS feature >> registration mechanism in Nevada. Basically, a file system registers >> what features it supports. Initially this will be things such as "case >> insensitivity", "acl on create", "extended vattr_t". > > It''s hard for me to comment on this without more information. I just > suggested the ext3 mechanism because what I see so far (many features > being tied to ZFS_VERSION_3, and checking for version >= ZFS_VERSION_3) > mean that it is really hard to do parallel development of features and > ensure that the code is actually safe to access the filesystem. >ZFS actually has 3 different version numbers. Anything with ZFS_ is actually the spa version. The ZPL also has a version associated with it and will have ZPL_ as its prefix. Within each file is a unique ACL version. Most of the version changing has happened at the spa level, but soon the ZPL version will be changing to support some additional attributes and other things for SMB.> For example, if we start developing large dnode + fast EA code we might > want to ship that out sooner than it can go into a Solaris release. We > want to make sure that no Solaris code tries to mount such a filesystem > or it will assert (I think), so we would have to version the fs as v4. > > However, maybe Solaris needs some other changes that would require a v4 > that does not include large dnode + fast EA support (for whatever reason) > so now we have 2 incompatible codebases that support "v4"... > > Do you have a pointer to the upcoming versioning mechanism? >Sure, take a look at: http://www.opensolaris.org/os/community/arc/caselog/2007/315/ http://www.opensolaris.org/os/community/arc/caselog/2007/444/ These describe more than just the feature registration though.>>> 3.a) I initially thought that we don''t have to store any extra >>> information to have a variable znode_phys_t size, because dn_bonuslen >>> holds this information. However, for symlinks ZFS checks essentially >>> "zp_size + sizeof(znode_phys_t) < dn_bonuslen" to see if it is a >>> fast or slow symlink. That implies if sizeof(znode_phys_t) changes >>> old symlinks on disk will be accessed incorrectly if we don''t have >>> some extra information about the size of znode_phys_t in each dnode. >>> >> There is an existing bug to create symlinks with their own object type. > > I don''t think that will help unless there is an extra mechanism to detect > whether the symlink is fast or slow, instead of just using the dn_bonuslen. > Is it possible to store XATTR data on symlinks in Solaris? > >>> 3.b) We can call this "zp_extra_znsize". If we declare the current >>> znode_phys_t as znode_phys_v0_t then zp_extra_znsize is the amount of >>> extra space beyond sizeof(znode_phys_v0_t), so 0 for current >>> filesystems. >> This would also require creation a new DMU_OT_ZNODE2 or something >> similarly named. > > Sure. Is it possible to change the DMU_OT type on an existing object? >Not that I know of. You would just allocate new files with the new type.>>> 3.c) zp_extra_znsize would need to be stored in znode_phys_t somewhere. >>> There is lots of unused space in some of the 64-bit fields, but I >>> don''t know how you feel about hacks for this. Possibilities include >>> some bits in zp_flags, zp_pad, high bits in zp_*time nanoseconds, etc. >>> It probably only needs to be 8 bytes or so (seems unlikely you will >>> more than double the number of fixed fields in struct znode_phys_t). >>> >> The zp_flags field is off limits. It is going to be used for storing >> additional file attributes such as immutable, nounlink,... > > Ah, OK. I was wondering about that also, but it isn''t in the top 10 > priorities yet. > >> I don''t want to see us overload other fields. We already have several >> pad fields within the znode that could be used. > > OK, I wasn''t sure about what is spoken for already. Is it ZFS policy to > always have 64-bit member fields? Some of the fields (e.g. nanoseconds) > don''t really make sense as 64-bit values, and it would probably be a > waste to have a 64-bit value for zp_extra_znsize.Not an official policy, but we do typically use 64-bit values.> >>> 4.c) It would be best to have some kind of ZAP to store the fast EA data. >>> Ideally it is a very simple kind of ZAP (single buffer), but the >>> microzap format is too restrictive with only a 64-bit value. >>> One of the other Lustre desires is to store additional information in >>> each directory entry (in addition to the object number) like file type >>> and a remote server identifier, and having a single ZAP type that is >>> useful for small entries would be good. Is it possible to go straight >>> to a zap_leaf_phys_t without having a corresponding zap_phys_t first? >>> If yes, then this would be quite useful, otherwise a fat ZAP is too fat >>> to be useful for storing fast EA data and the extended directory info. >> Can you provide a list of what attributes you want to store in the znode >> and what their sizes are? Do you expect ZFS to do anything special with >> these attributes? Should these attributes be exposed to applications? > > The main one is the Lustre logical object volume (LOV) extended attribute > data. This ranges from (commonly) 64 bytes, to as much as 4096 bytes (or > possibly larger once on ZFS). This HAS to be accessed to do anything with > the znode, even stat currently, since the size of a file is distributed > over potentially many servers, so avoiding overhead here is critical. > > In addition to that, there will be similar smallish attributes stored with > each znode like back-pointers from the storage znodes to the metadata znode. > These are on the order of 64 bytes as well. > >> Usually, we only embed attributes in the znode if the file system has >> some sort of semantics associated with them. > > The issue I think is that this data is only useful for Lustre, so reserving > dedicated space for it in a znode is no good. Also, the LOV XATTR might be > very large, so any dedicated space would be wasted. Having a generic and > fast XATTR storage in the znode would help a variety of applications. >How does lustre retrieve the data? Do you expect the data to be preserved via backup utilities?>> One of the original plans, from several years ago was to create a zp_zap >> field in the znode that would be used for storing additional file >> attributes. We never actually did that and the field was turned into >> one of the pad fields in the znode. > > Maybe "file attributes" is the wrong term. These are really XATTRs in the > ZFS sense, so I''ll refer to them as such in the future. >Yep, when you say EAs I was assuming small named/value pairs, not the Solaris based XATTR model.>> If the attribute will be needed for every file then it should probably >> be in the znode, but if it is an optional attribute or too big then >> maybe it should be in some sort of overflow object. > > This is what I''m proposing. For small XATTRs they would live in the znode, > and large ones would be stored using the normal ZFS XATTR mechanism (which > is infinitely flexible). Since the Lustre LOV XATTR data is created when > the znode is first allocated, it will always get first crack at using the > fast XATTR space, which is fine since it is right up with the znode data in > importance. >How will you be setting the attributes when the object is created? Do you have a kernel module that would be calling VOP_CREATE()? The reason I ask is that with the ARC cases I listed earlier, you will be able to set additional attributes atomically at the time the file is created. -Mark
Mark Shellenbaum wrote:> Andreas Dilger wrote: >> On Sep 17, 2007 08:31 -0600, Mark Shellenbaum wrote: >>> While not entirely the same thing we will soon have a VFS feature >>> registration mechanism in Nevada. Basically, a file system registers >>> what features it supports. Initially this will be things such as "case >>> insensitivity", "acl on create", "extended vattr_t". >> It''s hard for me to comment on this without more information. I just >> suggested the ext3 mechanism because what I see so far (many features >> being tied to ZFS_VERSION_3, and checking for version >= ZFS_VERSION_3) >> mean that it is really hard to do parallel development of features and >> ensure that the code is actually safe to access the filesystem. >> > > ZFS actually has 3 different version numbers. Anything with ZFS_ is > actually the spa version. The ZPL also has a version associated with it > and will have ZPL_ as its prefix. Within each file is a unique ACL > version. Most of the version changing has happened at the spa level, > but soon the ZPL version will be changing to support some additional > attributes and other things for SMB. > >> For example, if we start developing large dnode + fast EA code we might >> want to ship that out sooner than it can go into a Solaris release. We >> want to make sure that no Solaris code tries to mount such a filesystem >> or it will assert (I think), so we would have to version the fs as v4. >> >> However, maybe Solaris needs some other changes that would require a v4 >> that does not include large dnode + fast EA support (for whatever reason) >> so now we have 2 incompatible codebases that support "v4"... >> >> Do you have a pointer to the upcoming versioning mechanism? >> > > Sure, take a look at: > > http://www.opensolaris.org/os/community/arc/caselog/2007/315/ > http://www.opensolaris.org/os/community/arc/caselog/2007/444/ >Forgot to list the feature registration one. http://www.opensolaris.org/os/community/arc/caselog/2007/227/mail> These describe more than just the feature registration though. > >>>> 3.a) I initially thought that we don''t have to store any extra >>>> information to have a variable znode_phys_t size, because dn_bonuslen >>>> holds this information. However, for symlinks ZFS checks essentially >>>> "zp_size + sizeof(znode_phys_t) < dn_bonuslen" to see if it is a >>>> fast or slow symlink. That implies if sizeof(znode_phys_t) changes >>>> old symlinks on disk will be accessed incorrectly if we don''t have >>>> some extra information about the size of znode_phys_t in each dnode. >>>> >>> There is an existing bug to create symlinks with their own object type. >> I don''t think that will help unless there is an extra mechanism to detect >> whether the symlink is fast or slow, instead of just using the dn_bonuslen. >> Is it possible to store XATTR data on symlinks in Solaris? >> >>>> 3.b) We can call this "zp_extra_znsize". If we declare the current >>>> znode_phys_t as znode_phys_v0_t then zp_extra_znsize is the amount of >>>> extra space beyond sizeof(znode_phys_v0_t), so 0 for current >>>> filesystems. >>> This would also require creation a new DMU_OT_ZNODE2 or something >>> similarly named. >> Sure. Is it possible to change the DMU_OT type on an existing object? >> > > Not that I know of. You would just allocate new files with the new type. > >>>> 3.c) zp_extra_znsize would need to be stored in znode_phys_t somewhere. >>>> There is lots of unused space in some of the 64-bit fields, but I >>>> don''t know how you feel about hacks for this. Possibilities include >>>> some bits in zp_flags, zp_pad, high bits in zp_*time nanoseconds, etc. >>>> It probably only needs to be 8 bytes or so (seems unlikely you will >>>> more than double the number of fixed fields in struct znode_phys_t). >>>> >>> The zp_flags field is off limits. It is going to be used for storing >>> additional file attributes such as immutable, nounlink,... >> Ah, OK. I was wondering about that also, but it isn''t in the top 10 >> priorities yet. >> >>> I don''t want to see us overload other fields. We already have several >>> pad fields within the znode that could be used. >> OK, I wasn''t sure about what is spoken for already. Is it ZFS policy to >> always have 64-bit member fields? Some of the fields (e.g. nanoseconds) >> don''t really make sense as 64-bit values, and it would probably be a >> waste to have a 64-bit value for zp_extra_znsize. > > Not an official policy, but we do typically use 64-bit values. > >>>> 4.c) It would be best to have some kind of ZAP to store the fast EA data. >>>> Ideally it is a very simple kind of ZAP (single buffer), but the >>>> microzap format is too restrictive with only a 64-bit value. >>>> One of the other Lustre desires is to store additional information in >>>> each directory entry (in addition to the object number) like file type >>>> and a remote server identifier, and having a single ZAP type that is >>>> useful for small entries would be good. Is it possible to go straight >>>> to a zap_leaf_phys_t without having a corresponding zap_phys_t first? >>>> If yes, then this would be quite useful, otherwise a fat ZAP is too fat >>>> to be useful for storing fast EA data and the extended directory info. >>> Can you provide a list of what attributes you want to store in the znode >>> and what their sizes are? Do you expect ZFS to do anything special with >>> these attributes? Should these attributes be exposed to applications? >> The main one is the Lustre logical object volume (LOV) extended attribute >> data. This ranges from (commonly) 64 bytes, to as much as 4096 bytes (or >> possibly larger once on ZFS). This HAS to be accessed to do anything with >> the znode, even stat currently, since the size of a file is distributed >> over potentially many servers, so avoiding overhead here is critical. >> >> In addition to that, there will be similar smallish attributes stored with >> each znode like back-pointers from the storage znodes to the metadata znode. >> These are on the order of 64 bytes as well. >> >>> Usually, we only embed attributes in the znode if the file system has >>> some sort of semantics associated with them. >> The issue I think is that this data is only useful for Lustre, so reserving >> dedicated space for it in a znode is no good. Also, the LOV XATTR might be >> very large, so any dedicated space would be wasted. Having a generic and >> fast XATTR storage in the znode would help a variety of applications. >> > > How does lustre retrieve the data? Do you expect the data to be > preserved via backup utilities? > >>> One of the original plans, from several years ago was to create a zp_zap >>> field in the znode that would be used for storing additional file >>> attributes. We never actually did that and the field was turned into >>> one of the pad fields in the znode. >> Maybe "file attributes" is the wrong term. These are really XATTRs in the >> ZFS sense, so I''ll refer to them as such in the future. >> > > Yep, when you say EAs I was assuming small named/value pairs, not the > Solaris based XATTR model. > >>> If the attribute will be needed for every file then it should probably >>> be in the znode, but if it is an optional attribute or too big then >>> maybe it should be in some sort of overflow object. >> This is what I''m proposing. For small XATTRs they would live in the znode, >> and large ones would be stored using the normal ZFS XATTR mechanism (which >> is infinitely flexible). Since the Lustre LOV XATTR data is created when >> the znode is first allocated, it will always get first crack at using the >> fast XATTR space, which is fine since it is right up with the znode data in >> importance. >> > > How will you be setting the attributes when the object is created? Do > you have a kernel module that would be calling VOP_CREATE()? The reason > I ask is that with the ARC cases I listed earlier, you will be able to > set additional attributes atomically at the time the file is created. > > > -Mark > _______________________________________________ > zfs-code mailing list > zfs-code at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-code
I suggest that we get together soon for a "dnode summit", if you will, in which we put our various plans on the whiteboard and attempt to do the global optimization. I suspect that Lustre and pNFS, for example, have very similar needs -- it would be great to make them identical. The dnode is a truly core data structure -- we should do everything we can to keep it free of #ifdefs and conditional logic. Andreas, where are you based? When''s your next trip to CA? Jeff On Mon, Sep 17, 2007 at 02:16:17PM -0600, Andreas Dilger wrote:> On Sep 17, 2007 08:31 -0600, Mark Shellenbaum wrote: > > While not entirely the same thing we will soon have a VFS feature > > registration mechanism in Nevada. Basically, a file system registers > > what features it supports. Initially this will be things such as "case > > insensitivity", "acl on create", "extended vattr_t". > > It''s hard for me to comment on this without more information. I just > suggested the ext3 mechanism because what I see so far (many features > being tied to ZFS_VERSION_3, and checking for version >= ZFS_VERSION_3) > mean that it is really hard to do parallel development of features and > ensure that the code is actually safe to access the filesystem. > > For example, if we start developing large dnode + fast EA code we might > want to ship that out sooner than it can go into a Solaris release. We > want to make sure that no Solaris code tries to mount such a filesystem > or it will assert (I think), so we would have to version the fs as v4. > > However, maybe Solaris needs some other changes that would require a v4 > that does not include large dnode + fast EA support (for whatever reason) > so now we have 2 incompatible codebases that support "v4"... > > Do you have a pointer to the upcoming versioning mechanism? > > > >3.a) I initially thought that we don''t have to store any extra > > > information to have a variable znode_phys_t size, because dn_bonuslen > > > holds this information. However, for symlinks ZFS checks essentially > > > "zp_size + sizeof(znode_phys_t) < dn_bonuslen" to see if it is a > > > fast or slow symlink. That implies if sizeof(znode_phys_t) changes > > > old symlinks on disk will be accessed incorrectly if we don''t have > > > some extra information about the size of znode_phys_t in each dnode. > > > > > > > There is an existing bug to create symlinks with their own object type. > > I don''t think that will help unless there is an extra mechanism to detect > whether the symlink is fast or slow, instead of just using the dn_bonuslen. > Is it possible to store XATTR data on symlinks in Solaris? > > > >3.b) We can call this "zp_extra_znsize". If we declare the current > > > znode_phys_t as znode_phys_v0_t then zp_extra_znsize is the amount of > > > extra space beyond sizeof(znode_phys_v0_t), so 0 for current > > > filesystems. > > > > This would also require creation a new DMU_OT_ZNODE2 or something > > similarly named. > > Sure. Is it possible to change the DMU_OT type on an existing object? > > > >3.c) zp_extra_znsize would need to be stored in znode_phys_t somewhere. > > > There is lots of unused space in some of the 64-bit fields, but I > > > don''t know how you feel about hacks for this. Possibilities include > > > some bits in zp_flags, zp_pad, high bits in zp_*time nanoseconds, etc. > > > It probably only needs to be 8 bytes or so (seems unlikely you will > > > more than double the number of fixed fields in struct znode_phys_t). > > > > > > > The zp_flags field is off limits. It is going to be used for storing > > additional file attributes such as immutable, nounlink,... > > Ah, OK. I was wondering about that also, but it isn''t in the top 10 > priorities yet. > > > I don''t want to see us overload other fields. We already have several > > pad fields within the znode that could be used. > > OK, I wasn''t sure about what is spoken for already. Is it ZFS policy to > always have 64-bit member fields? Some of the fields (e.g. nanoseconds) > don''t really make sense as 64-bit values, and it would probably be a > waste to have a 64-bit value for zp_extra_znsize. > > > >4.c) It would be best to have some kind of ZAP to store the fast EA data. > > > Ideally it is a very simple kind of ZAP (single buffer), but the > > > microzap format is too restrictive with only a 64-bit value. > > > One of the other Lustre desires is to store additional information in > > > each directory entry (in addition to the object number) like file type > > > and a remote server identifier, and having a single ZAP type that is > > > useful for small entries would be good. Is it possible to go straight > > > to a zap_leaf_phys_t without having a corresponding zap_phys_t first? > > > If yes, then this would be quite useful, otherwise a fat ZAP is too fat > > > to be useful for storing fast EA data and the extended directory info. > > > > Can you provide a list of what attributes you want to store in the znode > > and what their sizes are? Do you expect ZFS to do anything special with > > these attributes? Should these attributes be exposed to applications? > > The main one is the Lustre logical object volume (LOV) extended attribute > data. This ranges from (commonly) 64 bytes, to as much as 4096 bytes (or > possibly larger once on ZFS). This HAS to be accessed to do anything with > the znode, even stat currently, since the size of a file is distributed > over potentially many servers, so avoiding overhead here is critical. > > In addition to that, there will be similar smallish attributes stored with > each znode like back-pointers from the storage znodes to the metadata znode. > These are on the order of 64 bytes as well. > > > Usually, we only embed attributes in the znode if the file system has > > some sort of semantics associated with them. > > The issue I think is that this data is only useful for Lustre, so reserving > dedicated space for it in a znode is no good. Also, the LOV XATTR might be > very large, so any dedicated space would be wasted. Having a generic and > fast XATTR storage in the znode would help a variety of applications. > > > One of the original plans, from several years ago was to create a zp_zap > > field in the znode that would be used for storing additional file > > attributes. We never actually did that and the field was turned into > > one of the pad fields in the znode. > > Maybe "file attributes" is the wrong term. These are really XATTRs in the > ZFS sense, so I''ll refer to them as such in the future. > > > If the attribute will be needed for every file then it should probably > > be in the znode, but if it is an optional attribute or too big then > > maybe it should be in some sort of overflow object. > > This is what I''m proposing. For small XATTRs they would live in the znode, > and large ones would be stored using the normal ZFS XATTR mechanism (which > is infinitely flexible). Since the Lustre LOV XATTR data is created when > the znode is first allocated, it will always get first crack at using the > fast XATTR space, which is fine since it is right up with the znode data in > importance. > > Cheers, Andreas > -- > Andreas Dilger > Principal Software Engineer > Cluster File Systems, Inc. > > _______________________________________________ > zfs-code mailing list > zfs-code at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-code
On Sep 17, 2007 15:26 -0700, Jeff Bonwick wrote:> I suggest that we get together soon for a "dnode summit", if you will, > in which we put our various plans on the whiteboard and attempt to do > the global optimization. I suspect that Lustre and pNFS, for example, > have very similar needs -- it would be great to make them identical. > > The dnode is a truly core data structure -- we should do everything > we can to keep it free of #ifdefs and conditional logic. > > Andreas, where are you based? When''s your next trip to CA?I''m in Calgary, Canada. There''s some chance I''ll be down there on Oct. 1 but it isn''t yet certain. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc.
On Sep 17, 2007 14:43 -0600, Mark Shellenbaum wrote:> >Andreas Dilger wrote: > >>0.b The ext2/3/4 format has a very clever IMHO versioning mechanism that > >> is superior to just incrementing a version number and forcing all > >> implementations to support every previous version''s features. See > >> http://www.mjmwired.net/kernel/Documentation/filesystems/ext2.txt#224 > >> for a detailed description of how the features work. The gist is > >> > >>It''s hard for me to comment on this without more information. I just > >>suggested the ext3 mechanism because what I see so far (many features > >>being tied to ZFS_VERSION_3, and checking for version >= ZFS_VERSION_3) > >>mean that it is really hard to do parallel development of features and > >>ensure that the code is actually safe to access the filesystem. > >> > > > >ZFS actually has 3 different version numbers. Anything with ZFS_ is > >actually the spa version. The ZPL also has a version associated with it > >and will have ZPL_ as its prefix. Within each file is a unique ACL > >version. Most of the version changing has happened at the spa level, > >but soon the ZPL version will be changing to support some additional > >attributes and other things for SMB. > > > >Sure, take a look at: > > > >http://www.opensolaris.org/os/community/arc/caselog/2007/315/ > >http://www.opensolaris.org/os/community/arc/caselog/2007/444/ > > > > Forgot to list the feature registration one. > > http://www.opensolaris.org/os/community/arc/caselog/2007/227/mailSo, after finally having had a chance to read these threads, I don''t think they relate at all to what I was initially proposing. The feature registration APIs are more related to the users of the filesystem in terms of what functionality the fs provides. What I was proposing was a mechanism for ZFS internally to be more flexible in terms of forward and backward compatibility between the ZFS code and the on-disk format. The system attributes discussion may be somewhat related to the fast XATTR proposal. The document doesn''t actually specify how these system attributes will be stored internally to ZFS. If they are stored in the znode, that is the best performance wise. If the system attributes are stored in an XATTR object hung off the znode->zp_xattr then this will probably be a noticable performance hit for each access. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc.
Darren J Moffat wrote:> Andreas Dilger wrote: > >> I agree, but I suspect large dnodes could also be of use to ZFS at >> some point, either for fast EAs and/or small files, so we wanted to >> get some buy-in from the ZFS developers on an approach that would >> be suitable for ZFS also. In particular, being able to use the larger >> dnode space for a variety of reasons (more elements in dn_blkptr[], >> small file data, fast EA space) is much more desirable than a Lustre-only >> implementation. > > Let me give an alternate view here. This could make ZFS Crypto more > complex because now data would sometimes be stored inside the dnode. I > need to think about this a bit more but in general it makes me uneasy, > it may turn out not to be an issue though.I thought we were just talking about increasing the potential bonus buffer size. So it is no different than the problem you have today: you need to encrypt the bonus buffer part of the dnode_phys_t, but not the rest of it. --matt
Matthew Ahrens wrote:> Darren J Moffat wrote: >> Andreas Dilger wrote: >> >>> I agree, but I suspect large dnodes could also be of use to ZFS at >>> some point, either for fast EAs and/or small files, so we wanted to >>> get some buy-in from the ZFS developers on an approach that would >>> be suitable for ZFS also. In particular, being able to use the larger >>> dnode space for a variety of reasons (more elements in dn_blkptr[], >>> small file data, fast EA space) is much more desirable than a >>> Lustre-only >>> implementation. >> >> Let me give an alternate view here. This could make ZFS Crypto more >> complex because now data would sometimes be stored inside the dnode. >> I need to think about this a bit more but in general it makes me >> uneasy, it may turn out not to be an issue though. > > I thought we were just talking about increasing the potential bonus > buffer size. So it is no different than the problem you have today: you > need to encrypt the bonus buffer part of the dnode_phys_t, but not the > rest of it.Ah okay. Maybe I read too much into it that the dnode_phys_t would actually have more than the bonus buffer to worry about if it all this stays in the bonusbuffer then yes it is the same existing problem with ensuring that gets encrypted. -- Darren J Moffat
On Sep 20, 2007 06:09 -0700, Matthew Ahrens wrote:> Andreas Dilger wrote: > >>I agree, but I suspect large dnodes could also be of use to ZFS at > >>some point, either for fast EAs and/or small files, so we wanted to > >>get some buy-in from the ZFS developers on an approach that would > >>be suitable for ZFS also. In particular, being able to use the larger > >>dnode space for a variety of reasons (more elements in dn_blkptr[], > >>small file data, fast EA space) is much more desirable than a Lustre-only > >>implementation. > > I thought we were just talking about increasing the potential bonus buffer > size. So it is no different than the problem you have today: you need to > encrypt the bonus buffer part of the dnode_phys_t, but not the rest of it.Well, CFS is only immediately interested in fast XATTR storage using a larger dnode bonus buffer. However, I thought it might be worthwhile to discuss using the dn_blkptr[] array itself for fast small file/symlink storage at the same time. It might be too hard to change that at this time, and I won''t shed a tear if that is the case, but it is worth some thought. Bill mentioned you had already tried something similar, so we might discount that idea pretty quickly. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc.