On Feb 27, 2008 14:50 +0300, Nikita Danilov wrote:> Andreas Dilger writes: > > are you aware of any desirable LOV EA changes that would be good to > > include with the changes for the v3 "pool" EA (attached)? Are there > > any changes that are desirable for e.g. FIDs or similar? > > I think that if we are introducing incompatible LOV EA format, we can as > well go forward with changes hinted to at > > http://arch.lustre.org/index.php?title=MDS_striping_format#Future_developmentsNikita, sorry to take a long time to get back to this issue, but I think it is quite valuable to pursue if we are already going to change the on-disk format. Since we are already need to change the LOV_MAGIC value, we may as well do as you suggest and have 0x0BD3ssss instead of 0x0BD30BD0, where ssss = size of EA in bytes. That would limit us to a 65536-byte striping EA, but it still larger than what is supported today, and the plans for wide striping also do not call for larger EAs. Even supporting 64kB EAs would be an issue with the current nifrastructure because the client always has to preallocate a receive buffer large enough for the largest EA because it does not know the EA size in advance. The question is whether you think we should also add a magic + size to the lov_ost_data_v1 structure, which is currently the same for all EA types. Adding a per-stripe magic + size would reduce the number of stripes we can allocate per file, and the 160-stripe limit is already a problem for some systems with more than 160 OSTs. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
Andreas Dilger writes: > On Feb 27, 2008 14:50 +0300, Nikita Danilov wrote: > > Andreas Dilger writes: > > > are you aware of any desirable LOV EA changes that would be good to > > > include with the changes for the v3 "pool" EA (attached)? Are there > > > any changes that are desirable for e.g. FIDs or similar? > > > > I think that if we are introducing incompatible LOV EA format, we can as > > well go forward with changes hinted to at > > > > http://arch.lustre.org/index.php?title=MDS_striping_format#Future_developments > > Nikita, sorry to take a long time to get back to this issue, but I think > it is quite valuable to pursue if we are already going to change the > on-disk format. > > Since we are already need to change the LOV_MAGIC value, we may as well > do as you suggest and have 0x0BD3ssss instead of 0x0BD30BD0, where > ssss = size of EA in bytes. > > That would limit us to a 65536-byte striping EA, but it still larger than If that happens to be a limiting factor, we can interpret ssss as a number of __u32''s or __u64''s in EA. > what is supported today, and the plans for wide striping also do not call > for larger EAs. Even supporting 64kB EAs would be an issue with the > current nifrastructure because the client always has to preallocate a > receive buffer large enough for the largest EA because it does not know > the EA size in advance. > > The question is whether you think we should also add a magic + size to the > lov_ost_data_v1 structure, which is currently the same for all EA types. > Adding a per-stripe magic + size would reduce the number of stripes we > can allocate per file, and the 160-stripe limit is already a problem for > some systems with more than 160 OSTs. I think we need an ability to have fully general files layout, so that, for example, a stripe can in turn be a striped file. This can be something like struct lov_mds_md_v3 { __u32 lmm_magic; /* 0x0BD3ssss */ __u32 lmm_pattern; /* LOV_PATTERN_RAID0, LOV_PATTERN_RAID1 */ __u64 lmm_object_id; /* LOV object ID */ __u64 lmm_object_gr; /* LOV object group */ __u32 lmm_stripe_size; /* size of stripe in bytes */ __u32 lmm_stripe_count; /* num stripes in use for this object */ }; followed by a sequence of stripe layout descriptors each starting with __u32 magic; /* 0xLLLLSSSS. where LLLL is an identifier of a * layout type (e.g., 0bd3 is raid0 or raid1), and * SSSS is a size. */ But for a common case of a striped file where all stripes have the same layout, we implement a short-cut: struct lov_mds_md_v4 { __u32 lmm_magic; /* 0x0BD4ssss */ __u32 lmm_pattern; /* LOV_PATTERN_RAID0, LOV_PATTERN_RAID1 */ __u64 lmm_object_id; /* LOV object ID */ __u64 lmm_object_gr; /* LOV object group */ __u32 lmm_stripe_size; /* size of stripe in bytes */ __u32 lmm_stripe_count; /* num stripes in use for this object */ __u32 lmm_stripe_magic; /* 0xLLLLSSSS for all stripes */ }; followed by an array of stripe layout descriptor, stripped of their magics. Or we can go one step further and assume particular value of lmm_stripe_magic for a particular lmm_magic. In this case, ->lmm_stripe_magic field can be removed. > > Cheers, Andreas > -- Nikita.
Comment in-line below...> -----Original Message----- > From: lustre-devel-bounces at lists.lustre.org > [mailto:lustre-devel-bounces at lists.lustre.org] On Behalf Of > Andreas Dilger > Sent: 19 April 2008 1:22 AM > To: Lustre Development Mailing List > Cc: Mike Pershin; Nikita Danilov; Yuriy Umanets > Subject: Re: [Lustre-devel] New Pools DLD > > On Feb 27, 2008 14:50 +0300, Nikita Danilov wrote: > > Andreas Dilger writes: > > > are you aware of any desirable LOV EA changes that would be good > > > to include with the changes for the v3 "pool" EA (attached)? > > > Are there any changes that are desirable for e.g. FIDs or > > > similar? > > > > I think that if we are introducing incompatible LOV EA format, we > > can as well go forward with changes hinted to at > > > > > http://arch.lustre.org/index.php?title=MDS_striping_format#Future_developments > > Nikita, sorry to take a long time to get back to this issue, but I > think it is quite valuable to pursue if we are already going to > change the on-disk format. > > Since we are already need to change the LOV_MAGIC value, we may as > well do as you suggest and have 0x0BD3ssss instead of 0x0BD30BD0, > where ssss = size of EA in bytes.That seems a hack - why overload magic like this rather than have a separate field?> That would limit us to a 65536-byte striping EA, but it still larger > than what is supported today, and the plans for wide striping also > do not call for larger EAs. Even supporting 64kB EAs would be an > issue with the current nifrastructure because the client always has > to preallocate a receive buffer large enough for the largest EA > because it does not know the EA size in advance. > > The question is whether you think we should also add a magic + size > to the lov_ost_data_v1 structure, which is currently the same for > all EA types. Adding a per-stripe magic + size would reduce the > number of stripes we can allocate per file, and the 160-stripe limit > is already a problem for some systems with more than 160 OSTs. > > Cheers, Andreas > -- > Andreas Dilger > Sr. Staff Engineer, Lustre Group > Sun Microsystems of Canada, Inc. > > _______________________________________________ > Lustre-devel mailing list > Lustre-devel at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-devel >
Eric Barton writes: > Comment in-line below... > > > -----Original Message----- > > From: lustre-devel-bounces at lists.lustre.org > > [mailto:lustre-devel-bounces at lists.lustre.org] On Behalf Of > > Andreas Dilger [...] > > Since we are already need to change the LOV_MAGIC value, we may as > > well do as you suggest and have 0x0BD3ssss instead of 0x0BD30BD0, > > where ssss = size of EA in bytes. > > That seems a hack - why overload magic like this rather than have a > separate field? Historically space in EA''s was very scarce resource. Also, even a slight increase in EA size might result in significant performance changes, e.g., when EA''s stop fitting into inode, and additional seek+read is added to every inode load. So I tend to agree with Andreas that it makes sense to replace __u32 magic with __u16 layout_type_id; __u16 layout_descriptor_size; Nikita.
On Apr 21, 2008 15:08 +0400, Nikita Danilov wrote:> Andreas Dilger writes: > > Since we are already need to change the LOV_MAGIC value, we may as well > > do as you suggest and have 0x0BD3ssss instead of 0x0BD30BD0, where > > ssss = size of EA in bytes. > > > > That would limit us to a 65536-byte striping EA, but it still larger than > > If that happens to be a limiting factor, we can interpret ssss as a > number of __u32''s or __u64''s in EA.The EA size being a multiple of at least __u32 is certainly true, and the current EAs are also a multiple of __u64 so I don''t think this is a bad idea at all. This gives us an extra 8x larger EAs (512kB).> I think we need an ability to have fully general files layout, so that, > for example, a stripe can in turn be a striped file. This can be > something like > > struct lov_mds_md_v3 { > __u32 lmm_magic; /* 0x0BD3ssss */ > __u32 lmm_pattern; /* LOV_PATTERN_RAID0, LOV_PATTERN_RAID1 */ > __u64 lmm_object_id; /* LOV object ID */ > __u64 lmm_object_gr; /* LOV object group */ > __u32 lmm_stripe_size; /* size of stripe in bytes */ > __u32 lmm_stripe_count; /* num stripes in use for this object */ > }; > > followed by a sequence of stripe layout descriptors each starting with > > __u32 magic; /* 0xLLLLSSSS. where LLLL is an identifier of a > * layout type (e.g., 0bd3 is raid0 or raid1), and > * SSSS is a size. */ > > But for a common case of a striped file where all stripes have the same > layout, we implement a short-cut: > > struct lov_mds_md_v4 { > __u32 lmm_magic; /* 0x0BD4ssss */ > __u32 lmm_pattern; /* LOV_PATTERN_RAID0, LOV_PATTERN_RAID1 */ > __u64 lmm_object_id; /* LOV object ID */ > __u64 lmm_object_gr; /* LOV object group */ > __u32 lmm_stripe_size; /* size of stripe in bytes */ > __u32 lmm_stripe_count; /* num stripes in use for this object */ > __u32 lmm_stripe_magic; /* 0xLLLLSSSS for all stripes */ > }; > > followed by an array of stripe layout descriptor, stripped of their > magics. > > Or we can go one step further and assume particular value of > lmm_stripe_magic for a particular lmm_magic. In this case, > ->lmm_stripe_magic field can be removed.This is essentially the mechanism we use today, which is OK for the very common case of a single record format. I do like the idea of being able to have a heirarchical LOV EA format, and have been thinking about that for a long time. The other thing that I noticed in the pNFS layout is the ability to have "joined" files in a very simple manner. The "header" part of the layout (equivalent to our lov_mds_md) also contains the length in bytes of that part of the file, and then allows mutliple EAs to be concatenated together. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.