Mark Fasheh
2007-Oct-29 13:57 UTC
[Ocfs2-devel] Re: [RFC] add FIEMAP ioctl to efficiently map file allocation
Hi Andreas, Thanks for posting this. I believe that an interface such as FIEMAP would be very useful to Ocfs2 as well. (I added ocfs2-devel to the e-mail) My comments below are generally geared towards understanding the ioctl interface. On Mon, Oct 29, 2007 at 01:45:07PM -0600, Andreas Dilger wrote:> 2 Functional specification > > The FIEMAP ioctl (FIle Extent MAP) is similar to the existing FIBMAP > ioctl block device ioctl used for mapping an individual logical block > address in a file to a physical block address in the block device. The > FIEMAP ioctl will return the logical to physical mapping for the extent > that contains the specified logical byte address. > > struct fiemap_extent { > __u64 fe_offset;/* offset in bytes for the start of the extent */I'm a little bit confused by fe_offset. Is it a physical offset, or a logical offset? The reason I ask is that your description above says "FIEMAP ioctl will return the logical to physical mapping for the extent that contains the specified logical byte address." Which seems to imply physical, but your math to get to the next logical start in a very fragmented file, implies that fe_offset is a logical offset: fm_start = fm_extents[fm_extent_count - 1].fe_offset + fm_extents[fm_extent_count - 1].fe_length + 1;> The logic for the filefrag would be similar to above. The size of the > extent array will be extrapolated from the filesize and multiple ioctls > of increasing extent count may be called for very large files. filefrag > can easily call the FIEMAP ioctls repeatedly using the end of the last > extent as the start offset for the next ioctl: > > fm_start = fm_extents[fm_extent_count - 1].fe_offset + > fm_extents[fm_extent_count - 1].fe_length + 1; > > We do this until we find an extent with FIEMAP_EXTENT_LAST flag set. We > will also need to re-initialise the fiemap flags, fm_extent_count, fm_end.I think you meant 'fm_length' instead of 'fm_end' there.> The FIEMAP_FLAG_* values are specified below. If FIEMAP_FLAG_NO_EXTENTS is > given then the fm_extents array is not filled, and only fm_extent_count is > returned with the total number of extents in the file. Any new flags that > introduce and/or require an incompatible behaviour in an application or > in the kernel need to be in the range specified by FIEMAP_FLAG_INCOMPAT > (e.g. FIEMAP_FLAG_SYNC and FIEMAP_FLAG_NO_EXTENTS would fall into that > range if they were not part of the original specification). This is > currently only for future use. If it turns out that FIEMAP_FLAG_INCOMPAT > is not large enough then it is possible to use the last INCOMPAT flag > 0x01000000 to incidate that more of the flag range contains incompatible > flags. > > #define FIEMAP_FLAG_SYNC 0x00000001 /* sync file data before map */ > #define FIEMAP_FLAG_HSM_READ 0x00000002 /* get data from HSM before map */ > #define FIEMAP_FLAG_NUM_EXTENTS 0x00000004 /* return only number of extents */ > #define FIEMAP_FLAG_INCOMPAT 0xff000000 /* error for unknown flags in here */ > > The returned data from the FIEMAP ioctl is an array of fiemap_extent > elements, one per extent in the file. The first extent will contain the > byte specified by fm_start and the last extent will contain the byte > specified by fm_start + fm_len, unless there are more than the passed-in > fm_extent_count extents in the file, or this is beyond the EOF in which > case the last extent will be marked with FIEMAP_EXTENT_LAST. Each extent > returned has a set of flags associated with it that provide additional > information about the extent. Not all filesystems will support all flags. > > FIEMAP_FLAG_NUM_EXTENTS will return only the number of extents used by > the file. It will be used by default for filefrag since the specific > extent information is not required in many cases. > > #define FIEMAP_EXTENT_HOLE 0x00000001 /* has no data or space allocation */Btw, I really like that holes are explicitely marked.> #define FIEMAP_EXTENT_UNWRITTEN 0x00000002 /* space allocated, but no data */ > #define FIEMAP_EXTENT_UNMAPPED 0x00000004 /* has data but no space allocated */ > #define FIEMAP_EXTENT_ERROR 0x00000008 /* map error, errno in fe_offset. */ > #define FIEMAP_EXTENT_NO_DIRECT 0x00000010 /* cannot access data directly */ > #define FIEMAP_EXTENT_LAST 0x00000020 /* last extent in the file */ > #define FIEMAP_EXTENT_DELALLOC 0x00000040 /* has data but not yet written */ > #define FIEMAP_EXTENT_SECONDARY 0x00000080 /* data in secondary storage */ > #define FIEMAP_EXTENT_EOF 0x00000100 /* fm_start + fm_len beyond EOF */Is "EOF" here considering "beyond i_size" or "beyond allocation"?> #define FIEMAP_EXTENT_UNKNOWN 0x00000200 /* in use but location is unknown */ > > > FIEMAP_EXTENT_NO_DIRECT means data cannot be directly accessed (maybe > encrypted, compressed, etc.)Would it be valid to use FIEMAP_EXTENT_NO_DIRECT for marking in-inode data? Btrfs, Ocfs2, and Gfs2 pack small amounts of user data directly in inode blocks. Thanks, --Mark -- Mark Fasheh Senior Software Developer, Oracle mark.fasheh@oracle.com
Mark Fasheh
2007-Oct-29 17:11 UTC
[Ocfs2-devel] Re: [RFC] add FIEMAP ioctl to efficiently map file allocation
On Mon, Oct 29, 2007 at 04:13:02PM -0600, Andreas Dilger wrote:> On Oct 29, 2007 13:57 -0700, Mark Fasheh wrote: > > Thanks for posting this. I believe that an interface such as FIEMAP > > would be very useful to Ocfs2 as well. (I added ocfs2-devel to the e-mail) > > I tried to make it as Lustre-agnostic as possible...IMHO, your description succeeded at that. I'm hoping that the final patch can have mostly generic code, like FIBMAP does today.> > > #define FIEMAP_EXTENT_LAST 0x00000020 /* last extent in the file */ > > > #define FIEMAP_EXTENT_EOF 0x00000100 /* fm_start + fm_len beyond EOF*/ > > > > Is "EOF" here considering "beyond i_size" or "beyond allocation"? > > _EOF == beyond i_size. > _LAST == last extent in the file. > > In most cases FIEMAP_EXTENT_EOF will be set at the same time as > FIEMAP_EXTENT_LAST, but in case of e.g. prealloc beyond i_size the > EOF flag may be set on one or more earlier extents.Oh, ok great - I was primarily looking for a way to say "there's allocation past i_size" and it looks like we have it.> > > FIEMAP_EXTENT_NO_DIRECT means data cannot be directly accessed (maybe > > > encrypted, compressed, etc.) > > > > Would it be valid to use FIEMAP_EXTENT_NO_DIRECT for marking in-inode data? > > Btrfs, Ocfs2, and Gfs2 pack small amounts of user data directly in inode > > blocks. > > Hmm, but part of the issue would be how to request the extra data, and > what offset it would be given? One could, for example, use negative > offsets to represent metadata or something, or add a FIEMAP_EXTENT_META > or similar, I hadn't given that much thought.Well, fe_offset and fe_length are already expressed in bytes, so we could just put the byte offset to where the inline data starts in there. fe_length is just used as the length allocated for inline-data. If fe_offset is required to be block aligned, then we could add a field to express an offset within the block where data would be found - say 'fe_data_start_offset'. In the non-inline case, we could guarantee that fe_data_start_offset is zero. That way software which doesn't want to care whether something is inline-data (for example, a backup program) or not could just blidly add it to fe_offset before looking at the data. Regardless, I think we also want to explicitely flag this: #define FIEMAP_EXTENT_DATA_IN_INODE 0x00000400 /* extent data is stored in inode block */ I'm going to pretend that I completely understand reiserfs tail-packing and say that my approaches above looks like they could work for that case too. We'd want to add a seperate flag for tail packed data though.> The other issue is that I'd like to get the basics of the API in place > before it gets too complex. We can always add functionality with more > FIEMAP_FLAG_* (whether in the INCOMPAT range or not, depending on what is > being done).Sure, but I think whatever goes upstream should be able to handle this case - there's file systems in use _today_ which put data in inode blocks and pack file tails. Thanks, --Mark -- Mark Fasheh Senior Software Developer, Oracle mark.fasheh@oracle.com
Andreas Dilger
2007-Nov-05 17:44 UTC
[Ocfs2-devel] Re: [RFC] add FIEMAP ioctl to efficiently map file allocation
On Oct 29, 2007 13:57 -0700, Mark Fasheh wrote:> Thanks for posting this. I believe that an interface such as FIEMAP > would be very useful to Ocfs2 as well. (I added ocfs2-devel to the e-mail)I tried to make it as Lustre-agnostic as possible...> On Mon, Oct 29, 2007 at 01:45:07PM -0600, Andreas Dilger wrote: > > The FIEMAP ioctl (FIle Extent MAP) is similar to the existing FIBMAP > > ioctl block device ioctl used for mapping an individual logical block > > address in a file to a physical block address in the block device. The > > FIEMAP ioctl will return the logical to physical mapping for the extent > > that contains the specified logical byte address. > > > > struct fiemap_extent { > > __u64 fe_offset;/* offset in bytes for the start of the extent */ > > I'm a little bit confused by fe_offset. Is it a physical offset, or a > logical offset? The reason I ask is that your description above says "FIEMAP > ioctl will return the logical to physical mapping for the extent that > contains the specified logical byte address." Which seems to imply physical, > but your math to get to the next logical start in a very fragmented file, > implies that fe_offset is a logical offset: > > fm_start = fm_extents[fm_extent_count - 1].fe_offset + > fm_extents[fm_extent_count - 1].fe_length + 1;Note the distinction between "fe_offset" (which is a physical offset for a single extent) and "fm_offset" (which is a logical offset for that file).> > We do this until we find an extent with FIEMAP_EXTENT_LAST flag set. We > > will also need to re-initialise the fiemap flags, fm_extent_count, fm_end. > > I think you meant 'fm_length' instead of 'fm_end' there.You're right, thanks.> > #define FIEMAP_EXTENT_LAST 0x00000020 /* last extent in the file */ > > #define FIEMAP_EXTENT_EOF 0x00000100 /* fm_start + fm_len beyond EOF*/ > > Is "EOF" here considering "beyond i_size" or "beyond allocation"?_EOF == beyond i_size. _LAST == last extent in the file. In most cases FIEMAP_EXTENT_EOF will be set at the same time as FIEMAP_EXTENT_LAST, but in case of e.g. prealloc beyond i_size the EOF flag may be set on one or more earlier extents.> > FIEMAP_EXTENT_NO_DIRECT means data cannot be directly accessed (maybe > > encrypted, compressed, etc.) > > Would it be valid to use FIEMAP_EXTENT_NO_DIRECT for marking in-inode data? > Btrfs, Ocfs2, and Gfs2 pack small amounts of user data directly in inode > blocks.Hmm, but part of the issue would be how to request the extra data, and what offset it would be given? One could, for example, use negative offsets to represent metadata or something, or add a FIEMAP_EXTENT_META or similar, I hadn't given that much thought. The other issue is that I'd like to get the basics of the API in place before it gets too complex. We can always add functionality with more FIEMAP_FLAG_* (whether in the INCOMPAT range or not, depending on what is being done). Cheers, Andreas -- Andreas Dilger Sr. Software Engineer, Lustre Group Sun Microsystems of Canada, Inc.