thr3ads.net - Lustre devel - [Lustre-devel] New Pools DLD [Apr 2008]

If this information is useful, please help other people find it:
Share via:

Andreas Dilger

2008-Apr-19 00:22 UTC

[Lustre-devel] New Pools DLD

On Feb 27, 2008  14:50 +0300, Nikita Danilov wrote:> Andreas Dilger writes:
> > are you aware of any desirable LOV EA changes that would be good to
> > include with the changes for the v3 "pool" EA (attached)? 
Are there
> > any changes that are desirable for e.g. FIDs or similar?
> 
> I think that if we are introducing incompatible LOV EA format, we can as
> well go forward with changes hinted to at
> 
>
http://arch.lustre.org/index.php?title=MDS_striping_format#Future_developments
Nikita, sorry to take a long time to get back to this issue, but I think
it is quite valuable to pursue if we are already going to change the
on-disk format.

Since we are already need to change the LOV_MAGIC value, we may as well
do as you suggest and have 0x0BD3ssss instead of 0x0BD30BD0, where
ssss = size of EA in bytes.

That would limit us to a 65536-byte striping EA, but it still larger than
what is supported today, and the plans for wide striping also do not call
for larger EAs.  Even supporting 64kB EAs  would be an issue with the
current nifrastructure because the client always has to preallocate a
receive buffer large enough for the largest EA because it does not know
the EA size in advance.

The question is whether you think we should also add a magic + size to the
lov_ost_data_v1 structure, which is currently the same for all EA types.
Adding a per-stripe magic + size would reduce the number of stripes we
can allocate per file, and the 160-stripe limit is already a problem for
some systems with more than 160 OSTs.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

Nikita Danilov

2008-Apr-21 11:08 UTC

head link

[Lustre-devel] New Pools DLD

Andreas Dilger writes:
 > On Feb 27, 2008  14:50 +0300, Nikita Danilov wrote:
 > > Andreas Dilger writes:
 > > > are you aware of any desirable LOV EA changes that would be good
to
 > > > include with the changes for the v3 "pool" EA
(attached)?  Are there
 > > > any changes that are desirable for e.g. FIDs or similar?
 > > 
 > > I think that if we are introducing incompatible LOV EA format, we can
as
 > > well go forward with changes hinted to at
 > > 
 > >
http://arch.lustre.org/index.php?title=MDS_striping_format#Future_developments
 > 
 > Nikita, sorry to take a long time to get back to this issue, but I think
 > it is quite valuable to pursue if we are already going to change the
 > on-disk format.
 > 
 > Since we are already need to change the LOV_MAGIC value, we may as well
 > do as you suggest and have 0x0BD3ssss instead of 0x0BD30BD0, where
 > ssss = size of EA in bytes.
 > 
 > That would limit us to a 65536-byte striping EA, but it still larger than

If that happens to be a limiting factor, we can interpret ssss as a
number of __u32''s or __u64''s in EA.

 > what is supported today, and the plans for wide striping also do not call
 > for larger EAs.  Even supporting 64kB EAs  would be an issue with the
 > current nifrastructure because the client always has to preallocate a
 > receive buffer large enough for the largest EA because it does not know
 > the EA size in advance.
 > 
 > The question is whether you think we should also add a magic + size to the
 > lov_ost_data_v1 structure, which is currently the same for all EA types.
 > Adding a per-stripe magic + size would reduce the number of stripes we
 > can allocate per file, and the 160-stripe limit is already a problem for
 > some systems with more than 160 OSTs.

I think we need an ability to have fully general files layout, so that,
for example, a stripe can in turn be a striped file. This can be
something like

struct lov_mds_md_v3 {
        __u32 lmm_magic;          /* 0x0BD3ssss */
        __u32 lmm_pattern;        /* LOV_PATTERN_RAID0, LOV_PATTERN_RAID1 */
        __u64 lmm_object_id;      /* LOV object ID */
        __u64 lmm_object_gr;      /* LOV object group */
        __u32 lmm_stripe_size;    /* size of stripe in bytes */
        __u32 lmm_stripe_count;   /* num stripes in use for this object */
};

followed by a sequence of stripe layout descriptors each starting with

        __u32 magic; /* 0xLLLLSSSS. where LLLL is an identifier of a
                      * layout type (e.g., 0bd3 is raid0 or raid1), and
                      * SSSS is a size. */

But for a common case of a striped file where all stripes have the same
layout, we implement a short-cut:

struct lov_mds_md_v4 {
        __u32 lmm_magic;          /* 0x0BD4ssss */
        __u32 lmm_pattern;        /* LOV_PATTERN_RAID0, LOV_PATTERN_RAID1 */
        __u64 lmm_object_id;      /* LOV object ID */
        __u64 lmm_object_gr;      /* LOV object group */
        __u32 lmm_stripe_size;    /* size of stripe in bytes */
        __u32 lmm_stripe_count;   /* num stripes in use for this object */
        __u32 lmm_stripe_magic;   /* 0xLLLLSSSS for all stripes */
};

followed by an array of stripe layout descriptor, stripped of their
magics.

Or we can go one step further and assume particular value of
lmm_stripe_magic for a particular lmm_magic. In this case,
->lmm_stripe_magic field can be removed.

 > 
 > Cheers, Andreas
 > --

Nikita.

Eric Barton

2008-Apr-22 13:06 UTC

head link

[Lustre-devel] New Pools DLD

Comment in-line below...
> -----Original Message-----
> From: lustre-devel-bounces at lists.lustre.org 
> [mailto:lustre-devel-bounces at lists.lustre.org] On Behalf Of 
> Andreas Dilger
> Sent: 19 April 2008 1:22 AM
> To: Lustre Development Mailing List
> Cc: Mike Pershin; Nikita Danilov; Yuriy Umanets
> Subject: Re: [Lustre-devel] New Pools DLD
> 
> On Feb 27, 2008  14:50 +0300, Nikita Danilov wrote:
> > Andreas Dilger writes:
> > > are you aware of any desirable LOV EA changes that would be good
> > > to include with the changes for the v3 "pool" EA
(attached)?
> > > Are there any changes that are desirable for e.g. FIDs or
> > > similar?
> > 
> > I think that if we are introducing incompatible LOV EA format, we
> > can as well go forward with changes hinted to at
> > 
> > 
>
http://arch.lustre.org/index.php?title=MDS_striping_format#Future_developments
> 
> Nikita, sorry to take a long time to get back to this issue, but I
> think it is quite valuable to pursue if we are already going to
> change the on-disk format.
> 
> Since we are already need to change the LOV_MAGIC value, we may as
> well do as you suggest and have 0x0BD3ssss instead of 0x0BD30BD0,
> where ssss = size of EA in bytes.
That seems a hack - why overload magic like this rather than have a
separate field?
> That would limit us to a 65536-byte striping EA, but it still larger
> than what is supported today, and the plans for wide striping also
> do not call for larger EAs.  Even supporting 64kB EAs would be an
> issue with the current nifrastructure because the client always has
> to preallocate a receive buffer large enough for the largest EA
> because it does not know the EA size in advance.
> 
> The question is whether you think we should also add a magic + size
> to the lov_ost_data_v1 structure, which is currently the same for
> all EA types.  Adding a per-stripe magic + size would reduce the
> number of stripes we can allocate per file, and the 160-stripe limit
> is already a problem for some systems with more than 160 OSTs.
> 
> Cheers, Andreas
> --
> Andreas Dilger
> Sr. Staff Engineer, Lustre Group
> Sun Microsystems of Canada, Inc.
> 
> _______________________________________________
> Lustre-devel mailing list
> Lustre-devel at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-devel
>

Nikita Danilov

2008-Apr-22 17:03 UTC

head link

[Lustre-devel] New Pools DLD

Eric Barton writes:
 > Comment in-line below...
 > 
 > > -----Original Message-----
 > > From: lustre-devel-bounces at lists.lustre.org 
 > > [mailto:lustre-devel-bounces at lists.lustre.org] On Behalf Of 
 > > Andreas Dilger

[...]

 > > Since we are already need to change the LOV_MAGIC value, we may as
 > > well do as you suggest and have 0x0BD3ssss instead of 0x0BD30BD0,
 > > where ssss = size of EA in bytes.
 > 
 > That seems a hack - why overload magic like this rather than have a
 > separate field?

Historically space in EA''s was very scarce resource. Also, even a
slight
increase in EA size might result in significant performance changes,
e.g., when EA''s stop fitting into inode, and additional seek+read is
added to every inode load. So I tend to agree with Andreas that it makes
sense to replace __u32 magic with

        __u16 layout_type_id;
        __u16 layout_descriptor_size;

Nikita.

Andreas Dilger

2008-Apr-27 05:13 UTC

head link

[Lustre-devel] New Pools DLD

On Apr 21, 2008  15:08 +0400, Nikita Danilov wrote:> Andreas Dilger writes:
>  > Since we are already need to change the LOV_MAGIC value, we may as
well
>  > do as you suggest and have 0x0BD3ssss instead of 0x0BD30BD0, where
>  > ssss = size of EA in bytes.
>  > 
>  > That would limit us to a 65536-byte striping EA, but it still larger
than
> 
> If that happens to be a limiting factor, we can interpret ssss as a
> number of __u32''s or __u64''s in EA.
The EA size being a multiple of at least __u32 is certainly true, and the
current EAs are also a multiple of __u64 so I don''t think this is a bad
idea
at all.  This gives us an extra 8x larger EAs (512kB).
> I think we need an ability to have fully general files layout, so that,
> for example, a stripe can in turn be a striped file. This can be
> something like
> 
> struct lov_mds_md_v3 {
>         __u32 lmm_magic;          /* 0x0BD3ssss */
>         __u32 lmm_pattern;        /* LOV_PATTERN_RAID0, LOV_PATTERN_RAID1
*/
>         __u64 lmm_object_id;      /* LOV object ID */
>         __u64 lmm_object_gr;      /* LOV object group */
>         __u32 lmm_stripe_size;    /* size of stripe in bytes */
>         __u32 lmm_stripe_count;   /* num stripes in use for this object */
> };
> 
> followed by a sequence of stripe layout descriptors each starting with
> 
>         __u32 magic; /* 0xLLLLSSSS. where LLLL is an identifier of a
>                       * layout type (e.g., 0bd3 is raid0 or raid1), and
>                       * SSSS is a size. */
> 
> But for a common case of a striped file where all stripes have the same
> layout, we implement a short-cut:
> 
> struct lov_mds_md_v4 {
>         __u32 lmm_magic;          /* 0x0BD4ssss */
>         __u32 lmm_pattern;        /* LOV_PATTERN_RAID0, LOV_PATTERN_RAID1
*/
>         __u64 lmm_object_id;      /* LOV object ID */
>         __u64 lmm_object_gr;      /* LOV object group */
>         __u32 lmm_stripe_size;    /* size of stripe in bytes */
>         __u32 lmm_stripe_count;   /* num stripes in use for this object */
>         __u32 lmm_stripe_magic;   /* 0xLLLLSSSS for all stripes */
> };
> 
> followed by an array of stripe layout descriptor, stripped of their
> magics.
> 
> Or we can go one step further and assume particular value of
> lmm_stripe_magic for a particular lmm_magic. In this case,
> ->lmm_stripe_magic field can be removed.
This is essentially the mechanism we use today, which is OK for
the very common case of a single record format.  I do like the
idea of being able to have a heirarchical LOV EA format, and
have been thinking about that for a long time.

The other thing that I noticed in the pNFS layout is the ability
to have "joined" files in a very simple manner.  The
"header" part
of the layout (equivalent to our lov_mds_md) also contains the
length in bytes of that part of the file, and then allows mutliple
EAs to be concatenated together.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

Lustre devel - Apr 2008 - New Pools DLD

[Lustre-devel] New Pools DLD

[Lustre-devel] New Pools DLD

[Lustre-devel] New Pools DLD

[Lustre-devel] New Pools DLD

[Lustre-devel] New Pools DLD