Hello Is there a mechanism within Lustre for querying the populated extents in a sparse lustre file? Perhaps some kind of bmap support or an IOCTL for populating an extent map? I believe ZFS has support for SEEK_HOLE whences, but I didn''t know if Lustre has any mechanism to accomplish similar goals. Cheers, -- Brad Settlemyer Research Associate Oak Ridge National Laboratory
On 2010-06-10, at 08:07, Bradley W. Settlemyer wrote:> Is there a mechanism within Lustre for querying the populated extents > in a sparse lustre file? Perhaps some kind of bmap support or an IOCTL > for populating an extent map? > > I believe ZFS has support for SEEK_HOLE whences, but I didn''t know if > Lustre has any mechanism to accomplish similar goals.On Linux, the equivalent (better?) interface is the FIEMAP ioctl, which returns a readdir-like list of extents into a user-supplied buffer. We developed this for Lustre, because FIBMAP is wholly inefficient and inadequate to return millions of allocated blocks, and there is no way to express the blocks being stored on different devices. Also, the FIEMAP ioctl does not need root permission, unlike the FIBMAP ioctl, so it is useful for regular users/tools. Subsequently the FIEMAP ioctl was adopted into the upstream kernel (with a huge amount of effort), and is now available for ext2/3/4/xfs/reiserfs/btrfs for dumping extent maps to userspace. For displaying the FIEMAP data, the filefrag(8) tool was enhanced to use FIEMAP in preference to FIBMAP, if the underlying filesystem supports it. In the lustre-patched e2fsprogs it correctly handles the presence of stripes on multiple backing devices. Note that the output format shown below is an improved version that is not in any released e2fsprogs yet (it''s in CVS though), but it will be in our next e2fsprogs release and has also been accepted upstream. The FIEMAP ioctl is available in 1.8, and in some later versions of 1.6, but due to petty infighting when it was accepted upstream the data format was changed from our original version that is in older 1.6 releases, and they should not be used. Note one major caveat when using FIEMAP on Lustre - it is currently implementing a slightly different output format than in the local-disk filesystems, because for fragmentation visualization (which is what it was originally intended for) it makes sense to display the layout in per-object order. If the extents are presented in file-offset order there would appear to be fragmentation every 1MB in the file, even though they are allocated contiguously on disk. For 1-stripe files this is irrelevant and the output is the same. On the client with Lustre: [adilger at twoshoes]$ filefrag -v /myth/images/Main\ Library/Library6.iPhoto Filesystem type is: bd00bd0 File size of /myth/images/Main Library/Library6.iPhoto is 30240622 (29532 blocks of 1024 bytes) ext: device_logical: physical_offset: length: dev: flags: 0: 0.. 28671: 637502464.. 637531135: 28672: 0003: network 1: 28672.. 29531: 637669376.. 637670235: 860: 0003: network,eof /myth/images/Main Library/Library6.iPhoto: 2 extents found [adilger at twoshoes]$ lfs getstripe /myth/images/Main\ Library/Library6.iPhoto/myth/images/Main Library/Library6.iPhoto lmm_stripe_count: 1 lmm_stripe_size: 1048576lmm_stripe_offset: 3 obdidx objid objid group 3 341351 0x53567 0 On the server with local ldiskfs mount (for comparison, note ''-k'' argument to use 1024-byte blocks for output, otherwise it defaults to 4096-byte blocks to match the local filesystem blocksize): [root at mookie]# mount -t ldiskfs /dev/vgmyth/lvmythost3 /mnt/tmp [root at mookie]# filefrag -k -v /mnt/tmp/O/0/d$((341351 % 32))/341351 Filesystem type is: ef53 File size of /mnt/tmp/O/0/d7/341351 is 30240622 (29532 blocks of 1024 bytes) ext: logical_offset: physical_offset: length: flags: 0: 0.. 28671: 637502464.. 637531135: 28672: 1: 28672.. 29531: 637669376.. 637670235: 860: eof /mnt/tmp/O/0/d7/341351: 2 extents found If there are multiple stripes in a file it will show it with object offsets instead of file offsets: [adilger at twoshoes]$ filefrag -v "/myth/tmp/4stripe" Filesystem type is: bd00bd0 File size of /myth/tmp/4stripe is 104857600 (102400 blocks of 1024 bytes) ext: device_logical: physical_offset: length: dev: flags: 0: 0.. 14335: 179423232.. 179437567: 14336: 0003: network 1: 14336.. 28671: 179445760.. 179460095: 14336: 0003: network 2: 0.. 1023: 18482176.. 18483199: 1024: 0000: network 3: 1024.. 24575: 18485248.. 18508799: 23552: 0000: network 4: 0.. 24575: 331166720.. 331191295: 24576: 0004: network 5: 0.. 8191: 156459008.. 156467199: 8192: 0001: network 6: 8192.. 14335: 156502016.. 156508159: 6144: 0001: network 7: 14336.. 18431: 156622848.. 156626943: 4096: 0001: network 8: 18432.. 24575: 156516352.. 156522495: 6144: 0001: network /myth/tmp/4stripe: 6 extents found [adilger at twoshoes]$ lfs getstripe -v "/myth/tmp/4stripe" /myth/tmp/4stripe lmm_magic: 0x0BD10BD0 lmm_object_gr: 0 lmm_object_id: 0x24dab9 lmm_stripe_count: 4 lmm_stripe_size: 4194304 lmm_stripe_pattern: 1 lmm_stripe_offset: 3 obdidx objid objid group 3 340942 0x533ce 0 0 744427 0xb5beb 0 4 64720 0xfcd0 0 1 602677 0x93235 0 If you are using this for e.g. skipping sparse parts of the file you would need to do some extra work to convert the object offsets into stripe offsets. Bug 13192 contains old patches for userspace helper functions that should do most of that work, if you are interested in taking a look at them. It would also be possible to change Lustre to return extents in file offset order, but this would need a Lustre patch to implement (which is currently not a priority task). Cheers, Andreas -- Andreas Dilger Lustre Technical Lead Oracle Corporation Canada Inc.
Great, it appears to do exactly what I need. One more question: What interactions with the MDS and OSTs does this IOCTL cause. That is, does this IOCTL even require an MDS access, or does it interact with the OSTs only (obviously I''ve already opened the file causing an initial MDS access)? Cheers, Brad On 06/10/2010 06:30 PM, Andreas Dilger wrote:> On 2010-06-10, at 08:07, Bradley W. Settlemyer wrote: >> Is there a mechanism within Lustre for querying the populated extents >> in a sparse lustre file? Perhaps some kind of bmap support or an IOCTL >> for populating an extent map? >> >> I believe ZFS has support for SEEK_HOLE whences, but I didn''t know if >> Lustre has any mechanism to accomplish similar goals. > > On Linux, the equivalent (better?) interface is the FIEMAP ioctl, which returns a readdir-like list of extents into a user-supplied buffer. We developed this for Lustre, because FIBMAP is wholly inefficient and inadequate to return millions of allocated blocks, and there is no way to express the blocks being stored on different devices. Also, the FIEMAP ioctl does not need root permission, unlike the FIBMAP ioctl, so it is useful for regular users/tools. > > Subsequently the FIEMAP ioctl was adopted into the upstream kernel (with a huge amount of effort), and is now available for ext2/3/4/xfs/reiserfs/btrfs for dumping extent maps to userspace. > > For displaying the FIEMAP data, the filefrag(8) tool was enhanced to use FIEMAP in preference to FIBMAP, if the underlying filesystem supports it. In the lustre-patched e2fsprogs it correctly handles the presence of stripes on multiple backing devices. Note that the output format shown below is an improved version that is not in any released e2fsprogs yet (it''s in CVS though), but it will be in our next e2fsprogs release and has also been accepted upstream. The FIEMAP ioctl is available in 1.8, and in some later versions of 1.6, but due to petty infighting when it was accepted upstream the data format was changed from our original version that is in older 1.6 releases, and they should not be used. > > > Note one major caveat when using FIEMAP on Lustre - it is currently implementing a slightly different output format than in the local-disk filesystems, because for fragmentation visualization (which is what it was originally intended for) it makes sense to display the layout in per-object order. If the extents are presented in file-offset order there would appear to be fragmentation every 1MB in the file, even though they are allocated contiguously on disk. For 1-stripe files this is irrelevant and the output is the same. > > > On the client with Lustre: > > [adilger at twoshoes]$ filefrag -v /myth/images/Main\ Library/Library6.iPhoto > Filesystem type is: bd00bd0 > File size of /myth/images/Main Library/Library6.iPhoto is 30240622 (29532 blocks of 1024 bytes) > ext: device_logical: physical_offset: length: dev: flags: > 0: 0.. 28671: 637502464.. 637531135: 28672: 0003: network > 1: 28672.. 29531: 637669376.. 637670235: 860: 0003: network,eof > /myth/images/Main Library/Library6.iPhoto: 2 extents found > > [adilger at twoshoes]$ lfs getstripe /myth/images/Main\ Library/Library6.iPhoto/myth/images/Main Library/Library6.iPhoto > lmm_stripe_count: 1 > lmm_stripe_size: 1048576lmm_stripe_offset: 3 > obdidx objid objid group > 3 341351 0x53567 0 > > > On the server with local ldiskfs mount (for comparison, note ''-k'' argument to use 1024-byte blocks for output, otherwise it defaults to 4096-byte blocks to match the local filesystem blocksize): > > [root at mookie]# mount -t ldiskfs /dev/vgmyth/lvmythost3 /mnt/tmp > [root at mookie]# filefrag -k -v /mnt/tmp/O/0/d$((341351 % 32))/341351 > Filesystem type is: ef53 > File size of /mnt/tmp/O/0/d7/341351 is 30240622 (29532 blocks of 1024 bytes) > ext: logical_offset: physical_offset: length: flags: > 0: 0.. 28671: 637502464.. 637531135: 28672: > 1: 28672.. 29531: 637669376.. 637670235: 860: eof > /mnt/tmp/O/0/d7/341351: 2 extents found > > > If there are multiple stripes in a file it will show it with object offsets instead of file offsets: > > [adilger at twoshoes]$ filefrag -v "/myth/tmp/4stripe" Filesystem type is: bd00bd0 > File size of /myth/tmp/4stripe is 104857600 (102400 blocks of 1024 bytes) > ext: device_logical: physical_offset: length: dev: flags: > 0: 0.. 14335: 179423232.. 179437567: 14336: 0003: network > 1: 14336.. 28671: 179445760.. 179460095: 14336: 0003: network > 2: 0.. 1023: 18482176.. 18483199: 1024: 0000: network > 3: 1024.. 24575: 18485248.. 18508799: 23552: 0000: network > 4: 0.. 24575: 331166720.. 331191295: 24576: 0004: network > 5: 0.. 8191: 156459008.. 156467199: 8192: 0001: network > 6: 8192.. 14335: 156502016.. 156508159: 6144: 0001: network > 7: 14336.. 18431: 156622848.. 156626943: 4096: 0001: network > 8: 18432.. 24575: 156516352.. 156522495: 6144: 0001: network > /myth/tmp/4stripe: 6 extents found > > [adilger at twoshoes]$ lfs getstripe -v "/myth/tmp/4stripe" > /myth/tmp/4stripe > lmm_magic: 0x0BD10BD0 > lmm_object_gr: 0 > lmm_object_id: 0x24dab9 > lmm_stripe_count: 4 > lmm_stripe_size: 4194304 > lmm_stripe_pattern: 1 > lmm_stripe_offset: 3 > obdidx objid objid group > 3 340942 0x533ce 0 > 0 744427 0xb5beb 0 > 4 64720 0xfcd0 0 > 1 602677 0x93235 0 > > If you are using this for e.g. skipping sparse parts of the file you would need to do some extra work to convert the object offsets into stripe offsets. Bug 13192 contains old patches for userspace helper functions that should do most of that work, if you are interested in taking a look at them. It would also be possible to change Lustre to return extents in file offset order, but this would need a Lustre patch to implement (which is currently not a priority task). > > Cheers, Andreas > -- > Andreas Dilger > Lustre Technical Lead > Oracle Corporation Canada Inc. > >-- Brad Settlemyer Research Associate Oak Ridge National Laboratory
On 2010-06-11, at 8:10, "Bradley W. Settlemyer" <settlemyerbw at ornl.gov> wrote:> Great, it appears to do exactly what I need. One more question: What > interactions with the MDS and OSTs does this IOCTL cause. That is, > does > this IOCTL even require an MDS access, or does it interact with the > OSTs > only (obviously I''ve already opened the file causing an initial MDS > access)?After the initial open, the FIEMAP ioctl is only generating RPCs to the OSTs. Normally this is only a single RPC per stripe, since the protocol can pack hundreds of extents into a single page (assuming the caller has a 1-page buffer to receive the extents).> On 06/10/2010 06:30 PM, Andreas Dilger wrote: >> On 2010-06-10, at 08:07, Bradley W. Settlemyer wrote: >>> Is there a mechanism within Lustre for querying the populated >>> extents >>> in a sparse lustre file? Perhaps some kind of bmap support or an >>> IOCTL >>> for populating an extent map? >>> >>> I believe ZFS has support for SEEK_HOLE whences, but I didn''t know >>> if >>> Lustre has any mechanism to accomplish similar goals. >> >> On Linux, the equivalent (better?) interface is the FIEMAP ioctl, >> which returns a readdir-like list of extents into a user-supplied >> buffer. We developed this for Lustre, because FIBMAP is wholly >> inefficient and inadequate to return millions of allocated blocks, >> and there is no way to express the blocks being stored on different >> devices. Also, the FIEMAP ioctl does not need root permission, >> unlike the FIBMAP ioctl, so it is useful for regular users/tools. >> >> Subsequently the FIEMAP ioctl was adopted into the upstream kernel >> (with a huge amount of effort), and is now available for ext2/3/4/ >> xfs/reiserfs/btrfs for dumping extent maps to userspace. >> >> For displaying the FIEMAP data, the filefrag(8) tool was enhanced >> to use FIEMAP in preference to FIBMAP, if the underlying filesystem >> supports it. In the lustre-patched e2fsprogs it correctly handles >> the presence of stripes on multiple backing devices. Note that the >> output format shown below is an improved version that is not in any >> released e2fsprogs yet (it''s in CVS though), but it will be in our >> next e2fsprogs release and has also been accepted upstream. The >> FIEMAP ioctl is available in 1.8, and in some later versions of >> 1.6, but due to petty infighting when it was accepted upstream the >> data format was changed from our original version that is in older >> 1.6 releases, and they should not be used. >> >> >> Note one major caveat when using FIEMAP on Lustre - it is currently >> implementing a slightly different output format than in the local- >> disk filesystems, because for fragmentation visualization (which is >> what it was originally intended for) it makes sense to display the >> layout in per-object order. If the extents are presented in file- >> offset order there would appear to be fragmentation every 1MB in >> the file, even though they are allocated contiguously on disk. For >> 1-stripe files this is irrelevant and the output is the same. >> >> >> On the client with Lustre: >> >> [adilger at twoshoes]$ filefrag -v /myth/images/Main\ Library/ >> Library6.iPhoto >> Filesystem type is: bd00bd0 >> File size of /myth/images/Main Library/Library6.iPhoto is 30240622 >> (29532 blocks of 1024 bytes) >> ext: device_logical: physical_offset: length: dev: flags: >> 0: 0.. 28671: 637502464.. 637531135: 28672: 0003: >> network >> 1: 28672.. 29531: 637669376.. 637670235: 860: 0003: >> network,eof >> /myth/images/Main Library/Library6.iPhoto: 2 extents found >> >> [adilger at twoshoes]$ lfs getstripe /myth/images/Main\ Library/ >> Library6.iPhoto/myth/images/Main Library/Library6.iPhoto >> lmm_stripe_count: 1 >> lmm_stripe_size: 1048576lmm_stripe_offset: 3 >> obdidx objid objid group >> 3 341351 0x53567 0 >> >> >> On the server with local ldiskfs mount (for comparison, note ''-k'' >> argument to use 1024-byte blocks for output, otherwise it defaults >> to 4096-byte blocks to match the local filesystem blocksize): >> >> [root at mookie]# mount -t ldiskfs /dev/vgmyth/lvmythost3 /mnt/tmp >> [root at mookie]# filefrag -k -v /mnt/tmp/O/0/d$((341351 % 32))/341351 >> Filesystem type is: ef53 >> File size of /mnt/tmp/O/0/d7/341351 is 30240622 (29532 blocks of >> 1024 bytes) >> ext: logical_offset: physical_offset: length: flags: >> 0: 0.. 28671: 637502464.. 637531135: 28672: >> 1: 28672.. 29531: 637669376.. 637670235: 860: eof >> /mnt/tmp/O/0/d7/341351: 2 extents found >> >> >> If there are multiple stripes in a file it will show it with object >> offsets instead of file offsets: >> >> [adilger at twoshoes]$ filefrag -v "/myth/tmp/4stripe" Filesystem type >> is: bd00bd0 >> File size of /myth/tmp/4stripe is 104857600 (102400 blocks of 1024 >> bytes) >> ext: device_logical: physical_offset: length: dev: flags: >> 0: 0.. 14335: 179423232.. 179437567: 14336: 0003: >> network >> 1: 14336.. 28671: 179445760.. 179460095: 14336: 0003: >> network >> 2: 0.. 1023: 18482176.. 18483199: 1024: 0000: >> network >> 3: 1024.. 24575: 18485248.. 18508799: 23552: 0000: >> network >> 4: 0.. 24575: 331166720.. 331191295: 24576: 0004: >> network >> 5: 0.. 8191: 156459008.. 156467199: 8192: 0001: >> network >> 6: 8192.. 14335: 156502016.. 156508159: 6144: 0001: >> network >> 7: 14336.. 18431: 156622848.. 156626943: 4096: 0001: >> network >> 8: 18432.. 24575: 156516352.. 156522495: 6144: 0001: >> network >> /myth/tmp/4stripe: 6 extents found >> >> [adilger at twoshoes]$ lfs getstripe -v "/myth/tmp/4stripe" >> /myth/tmp/4stripe >> lmm_magic: 0x0BD10BD0 >> lmm_object_gr: 0 >> lmm_object_id: 0x24dab9 >> lmm_stripe_count: 4 >> lmm_stripe_size: 4194304 >> lmm_stripe_pattern: 1 >> lmm_stripe_offset: 3 >> obdidx objid objid group >> 3 340942 0x533ce 0 >> 0 744427 0xb5beb 0 >> 4 64720 0xfcd0 0 >> 1 602677 0x93235 0 >> >> If you are using this for e.g. skipping sparse parts of the file >> you would need to do some extra work to convert the object offsets >> into stripe offsets. Bug 13192 contains old patches for userspace >> helper functions that should do most of that work, if you are >> interested in taking a look at them. It would also be possible to >> change Lustre to return extents in file offset order, but this >> would need a Lustre patch to implement (which is currently not a >> priority task). >> >> Cheers, Andreas >> -- >> Andreas Dilger >> Lustre Technical Lead >> Oracle Corporation Canada Inc. >> >> > > -- > Brad Settlemyer > Research Associate > Oak Ridge National Laboratory