thr3ads.net - Lustre discuss - [Lustre-discuss] "obdidx" ordering in "lfs getstripe" [Feb 2012]

If this information is useful, please help other people find it:
Share via:

Jack David

2012-Feb-09 13:20 UTC

[Lustre-discuss] "obdidx" ordering in "lfs getstripe"

Hi All,

In the output of "lsf getstripe <filename> | <dirname>",
the obdidx
denotes the OST index (I assume).

Consider the following output:

lmm_stripe_count:   2
lmm_stripe_size:    1048576
lmm_stripe_offset:  1
	obdidx		 objid		objid		 group
	     1	             2	          0x2	             0
	     0	             3	          0x3	             0

where I have a setup consisting of two OSTs. If I have more than two
OSTs, is it possible that I get the obdidx values out of order? Or the
obdidx values will always be linear?

For example, in above output, the values are linear (like 1, 0 - and
this pattern will be repeated while storing the data I assume). If I
have 4 OSTs, can the values be non-linear? Something like 2,0,1,3 or
2,1,3,0 (or any pattern for that matter)??

My assumption on how the data is stored on OSTs:
Based upon the values of obdidx, each OST will store a stripe_size
worth data into the objid (a file under ldiskfs volume of that OST) in
rotation. So if I get the obdidx like 2,1,3,0 and stripe_size if 1MB,
then the data will be stored in following order:

1st MB: 2nd OST
2nd MB: 1st OST
3rdMB: 3rd OST
4thMB: 0th OST
5th MB: 2nd OST (Again - repeating the pattern)
6th MB: 1st OST

Is this understanding correct?? I hope I am clear on my question.


Thanks,
J

Andreas Dilger

2012-Feb-09 14:48 UTC

head link

[Lustre-discuss] "obdidx" ordering in "lfs getstripe"

On 2012-02-09, at 6:20 AM, Jack David wrote:> In the output of "lsf getstripe <filename> |
<dirname>", the obdidx
> denotes the OST index (I assume).
> 
> Consider the following output:
> 
> lmm_stripe_count:   2
> lmm_stripe_size:    1048576
> lmm_stripe_offset:  1
> 	obdidx		 objid		objid		 group
> 	     1	             2	          0x2	             0
> 	     0	             3	          0x3	             0
> 
> where I have a setup consisting of two OSTs. If I have more than two
> OSTs, is it possible that I get the obdidx values out of order? Or the
> obdidx values will always be linear?
> 
> For example, in above output, the values are linear (like 1, 0 - and
> this pattern will be repeated while storing the data I assume). If I
> have 4 OSTs, can the values be non-linear? Something like 2,0,1,3 or
> 2,1,3,0 (or any pattern for that matter)??
Typically the ordering will be linear, but this depends on a number of
different factors:
- what order the OSTs were created in:  without --index=N the OST order
  depends on the order in which they were first mounted, so using --index
  is always recommended, and will be mandatory in the future
- the distribution of OSTs among OSS nodes:  the MDS object allocator
  will normally select one OST from each OSS before allocating another
  object from a different OST on the same OSS
- the space available on each OST:  when OST free space is imbalanced
  the OSTs will be selected in part based on how full they are
> My assumption on how the data is stored on OSTs:
> Based upon the values of obdidx, each OST will store a stripe_size
> worth data into the objid (a file under ldiskfs volume of that OST) in
> rotation. So if I get the obdidx like 2,1,3,0 and stripe_size if 1MB,
> then the data will be stored in following order:
> 
> 1st MB: 2nd OST
> 2nd MB: 1st OST
> 3rdMB: 3rd OST
> 4thMB: 0th OST
> 5th MB: 2nd OST (Again - repeating the pattern)
> 6th MB: 1st OST
> 
> Is this understanding correct?? I hope I am clear on my question.
Correct.  The data is strictly round-robin on the objects once they
are allocated to a file.

Cheers, Andreas
--
Andreas Dilger                       Whamcloud, Inc.
Principal Engineer                   http://www.whamcloud.com/

Jack David

2012-Feb-14 07:13 UTC

head link

[Lustre-discuss] "obdidx" ordering in "lfs getstripe"

On Thu, Feb 9, 2012 at 8:18 PM, Andreas Dilger <adilger at whamcloud.com>
wrote:> On 2012-02-09, at 6:20 AM, Jack David wrote:
>> In the output of "lsf getstripe <filename> |
<dirname>", the obdidx
>> denotes the OST index (I assume).
>>
>> Consider the following output:
>>
>> lmm_stripe_count: ? 2
>> lmm_stripe_size: ? ?1048576
>> lmm_stripe_offset: ?1
>> ? ? ? obdidx ? ? ? ? ? objid ? ? ? ? ?objid ? ? ? ? ? ?group
>> ? ? ? ? ? ?1 ? ? ? ? ? ? ? 2 ? ? ? ? ? ?0x2 ? ? ? ? ? ? ? ?0
>> ? ? ? ? ? ?0 ? ? ? ? ? ? ? 3 ? ? ? ? ? ?0x3 ? ? ? ? ? ? ? ?0
>>
>> where I have a setup consisting of two OSTs. If I have more than two
>> OSTs, is it possible that I get the obdidx values out of order? Or the
>> obdidx values will always be linear?
>>
>> For example, in above output, the values are linear (like 1, 0 - and
>> this pattern will be repeated while storing the data I assume). If I
>> have 4 OSTs, can the values be non-linear? Something like 2,0,1,3 or
>> 2,1,3,0 (or any pattern for that matter)??
>
> Typically the ordering will be linear, but this depends on a number of
> different factors:
> - what order the OSTs were created in: ?without --index=N the OST order
> ?depends on the order in which they were first mounted, so using --index
> ?is always recommended, and will be mandatory in the future
> - the distribution of OSTs among OSS nodes: ?the MDS object allocator
> ?will normally select one OST from each OSS before allocating another
> ?object from a different OST on the same OSS
Thanks for this information.
> - the space available on each OST: ?when OST free space is imbalanced
> ?the OSTs will be selected in part based on how full they are
I have a doubt here. Lets say I have 4 OSTs, but the lustre client is
issuing the write request having which can be accommodated by any
single OST (e.g. write request is of size 512bytes and stripe_size is
1MB). In this case, how will the data be stored? Will the MDS maintain
the index of next OST which should serve the request?
>
>> My assumption on how the data is stored on OSTs:
>> Based upon the values of obdidx, each OST will store a stripe_size
>> worth data into the objid (a file under ldiskfs volume of that OST) in
>> rotation. So if I get the obdidx like 2,1,3,0 and stripe_size if 1MB,
>> then the data will be stored in following order:
>>
>> 1st MB: 2nd OST
>> 2nd MB: 1st OST
>> 3rdMB: 3rd OST
>> 4thMB: 0th OST
>> 5th MB: 2nd OST (Again - repeating the pattern)
>> 6th MB: 1st OST
>>
>> Is this understanding correct?? I hope I am clear on my question.
>
> Correct. ?The data is strictly round-robin on the objects once they
> are allocated to a file.
>
Thanks again,
J
> Cheers, Andreas
> --
> Andreas Dilger ? ? ? ? ? ? ? ? ? ? ? Whamcloud, Inc.
> Principal Engineer ? ? ? ? ? ? ? ? ? http://www.whamcloud.com/
>
>
>
>


-- 
J

Kevin Van Maren

2012-Feb-14 13:27 UTC

head link

[Lustre-discuss] "obdidx" ordering in "lfs getstripe"

On Feb 14, 2012, at 12:13 AM, Jack David wrote:
> On Thu, Feb 9, 2012 at 8:18 PM, Andreas Dilger <adilger at
whamcloud.com> wrote:
>> On 2012-02-09, at 6:20 AM, Jack David wrote:
>>> In the output of "lsf getstripe <filename> |
<dirname>", the obdidx
>>> denotes the OST index (I assume).
>>> 
>>> Consider the following output:
>>> 
>>> lmm_stripe_count:   2
>>> lmm_stripe_size:    1048576
>>> lmm_stripe_offset:  1
>>>       obdidx           objid          objid            group
>>>            1               2            0x2                0
>>>            0               3            0x3                0
>>> 
>>> where I have a setup consisting of two OSTs. If I have more than
two
>>> OSTs, is it possible that I get the obdidx values out of order? Or
the
>>> obdidx values will always be linear?
>>> 
>>> For example, in above output, the values are linear (like 1, 0 -
and
>>> this pattern will be repeated while storing the data I assume). If
I
>>> have 4 OSTs, can the values be non-linear? Something like 2,0,1,3
or
>>> 2,1,3,0 (or any pattern for that matter)??
>> 
>> Typically the ordering will be linear, but this depends on a number of
>> different factors:
>> - what order the OSTs were created in:  without --index=N the OST order
>>  depends on the order in which they were first mounted, so using
--index
>>  is always recommended, and will be mandatory in the future
>> - the distribution of OSTs among OSS nodes:  the MDS object allocator
>>  will normally select one OST from each OSS before allocating another
>>  object from a different OST on the same OSS
> 
> Thanks for this information.
> 
>> - the space available on each OST:  when OST free space is imbalanced
>>  the OSTs will be selected in part based on how full they are
> 
> I have a doubt here. Lets say I have 4 OSTs, but the lustre client is
> issuing the write request having which can be accommodated by any
> single OST (e.g. write request is of size 512bytes and stripe_size is
> 1MB). In this case, how will the data be stored? Will the MDS maintain
> the index of next OST which should serve the request?

I think you are still confused about how it works.  The OSTs are selected
_when the file is created_.  The striping is a static map of offset to OST.
For example, if the stripe count = 2, and the stripe size = 1MB, then
0-1MB goes to the first OST, 1-2MB goes to the second, 2-3 goes to the first,
etc.

The free space impacts _which_ OSTs are selected when a file is created,
it does NOT impact where data is written once a file a created.  So if an OST
fills up, every file that resides on that OST will be unable to grow if the
growth is
to an offset that maps to that OST.

Kevin

Confidentiality Notice: This e-mail message, its contents and any attachments to
it are confidential to the intended recipient, and may contain information that
is privileged and/or exempt from disclosure under applicable law. If you are not
the intended recipient, please immediately notify the sender and destroy the
original e-mail message and any attachments (and any copies that may have been
made) from your system or otherwise. Any unauthorized use, copying, disclosure
or distribution of this information is strictly prohibited.  Email addresses
that end with a ?-c? identify the sender as a Fusion-io contractor.

Jack David

2012-Feb-14 13:51 UTC

head link

[Lustre-discuss] "obdidx" ordering in "lfs getstripe"

On Tue, Feb 14, 2012 at 6:57 PM, Kevin Van Maren <KVanMaren at
fusionio.com> wrote:> On Feb 14, 2012, at 12:13 AM, Jack David wrote:
>
>> On Thu, Feb 9, 2012 at 8:18 PM, Andreas Dilger <adilger at
whamcloud.com> wrote:
>>> On 2012-02-09, at 6:20 AM, Jack David wrote:
>>>> In the output of "lsf getstripe <filename> |
<dirname>", the obdidx
>>>> denotes the OST index (I assume).
>>>>
>>>> Consider the following output:
>>>>
>>>> lmm_stripe_count: ? 2
>>>> lmm_stripe_size: ? ?1048576
>>>> lmm_stripe_offset: ?1
>>>> ? ? ? obdidx ? ? ? ? ? objid ? ? ? ? ?objid ? ? ? ? ? ?group
>>>> ? ? ? ? ? ?1 ? ? ? ? ? ? ? 2 ? ? ? ? ? ?0x2 ? ? ? ? ? ? ? ?0
>>>> ? ? ? ? ? ?0 ? ? ? ? ? ? ? 3 ? ? ? ? ? ?0x3 ? ? ? ? ? ? ? ?0
>>>>
>>>> where I have a setup consisting of two OSTs. If I have more
than two
>>>> OSTs, is it possible that I get the obdidx values out of order?
Or the
>>>> obdidx values will always be linear?
>>>>
>>>> For example, in above output, the values are linear (like 1, 0
- and
>>>> this pattern will be repeated while storing the data I assume).
If I
>>>> have 4 OSTs, can the values be non-linear? Something like
2,0,1,3 or
>>>> 2,1,3,0 (or any pattern for that matter)??
>>>
>>> Typically the ordering will be linear, but this depends on a number
of
>>> different factors:
>>> - what order the OSTs were created in: ?without --index=N the OST
order
>>> ?depends on the order in which they were first mounted, so using
--index
>>> ?is always recommended, and will be mandatory in the future
>>> - the distribution of OSTs among OSS nodes: ?the MDS object
allocator
>>> ?will normally select one OST from each OSS before allocating
another
>>> ?object from a different OST on the same OSS
>>
>> Thanks for this information.
>>
>>> - the space available on each OST: ?when OST free space is
imbalanced
>>> ?the OSTs will be selected in part based on how full they are
>>
>> I have a doubt here. Lets say I have 4 OSTs, but the lustre client is
>> issuing the write request having which can be accommodated by any
>> single OST (e.g. write request is of size 512bytes and stripe_size is
>> 1MB). In this case, how will the data be stored? Will the MDS maintain
>> the index of next OST which should serve the request?
>
>
> I think you are still confused about how it works. ?The OSTs are selected
> _when the file is created_. ?The striping is a static map of offset to OST.
> For example, if the stripe count = 2, and the stripe size = 1MB, then
> 0-1MB goes to the first OST, 1-2MB goes to the second, 2-3 goes to the
first, etc.
>I understand that, but just got curious that does lustre client keeps
track of which is the _next_ OST where the IO request should go to? I
am unaware that who decides the stripe_size at the time of file
creation (by default is 1MB - from lfs setstripe man page), so I
assume client is not bothered about that. But if the client is
generating the write request which is not in multiple of stripe_size,
multiple write requests can be and stored into one OST (e.g. if stripe
size is 1MB, then 20 req of 512bytes can be stored in OST1, next 20
reqs on OST2 and likewise).

Actually I am trying to understand how can I leverage the pNFS file
layout semantics (which communicates to Data Servers directly once the
layout is supplied by Meta Data Server) with Lustre Filesystem, and
that is the source of such questions.
> The free space impacts _which_ OSTs are selected when a file is created,
> it does NOT impact where data is written once a file a created. ?So if an
OST
> fills up, every file that resides on that OST will be unable to grow if the
growth is
> to an offset that maps to that OST.
>
Good to know that.
> Kevin
>
>
> Confidentiality Notice: This e-mail message, its contents and any
attachments to it are confidential to the intended recipient, and may contain
information that is privileged and/or exempt from disclosure under applicable
law. If you are not the intended recipient, please immediately notify the sender
and destroy the original e-mail message and any attachments (and any copies that
may have been made) from your system or otherwise. Any unauthorized use,
copying, disclosure or distribution of this information is strictly prohibited.
?Email addresses that end with a ?-c? identify the sender as a Fusion-io
contractor.


-- 
J

Kevin Van Maren

2012-Feb-14 14:07 UTC

head link

[Lustre-discuss] "obdidx" ordering in "lfs getstripe"

On Feb 14, 2012, at 6:51 AM, Jack David wrote:
> On Tue, Feb 14, 2012 at 6:57 PM, Kevin Van Maren <KVanMaren at
fusionio.com> wrote:
>> On Feb 14, 2012, at 12:13 AM, Jack David wrote:
>> 
>>> On Thu, Feb 9, 2012 at 8:18 PM, Andreas Dilger <adilger at
whamcloud.com> wrote:
>>>> On 2012-02-09, at 6:20 AM, Jack David wrote:
>>>>> In the output of "lsf getstripe <filename> |
<dirname>", the obdidx
>>>>> denotes the OST index (I assume).
>>>>> 
>>>>> Consider the following output:
>>>>> 
>>>>> lmm_stripe_count:   2
>>>>> lmm_stripe_size:    1048576
>>>>> lmm_stripe_offset:  1
>>>>>       obdidx           objid          objid           
group
>>>>>            1               2            0x2               
0
>>>>>            0               3            0x3               
0
>>>>> 
>>>>> where I have a setup consisting of two OSTs. If I have more
than two
>>>>> OSTs, is it possible that I get the obdidx values out of
order? Or the
>>>>> obdidx values will always be linear?
>>>>> 
>>>>> For example, in above output, the values are linear (like
1, 0 - and
>>>>> this pattern will be repeated while storing the data I
assume). If I
>>>>> have 4 OSTs, can the values be non-linear? Something like
2,0,1,3 or
>>>>> 2,1,3,0 (or any pattern for that matter)??
>>>> 
>>>> Typically the ordering will be linear, but this depends on a
number of
>>>> different factors:
>>>> - what order the OSTs were created in:  without --index=N the
OST order
>>>>  depends on the order in which they were first mounted, so
using --index
>>>>  is always recommended, and will be mandatory in the future
>>>> - the distribution of OSTs among OSS nodes:  the MDS object
allocator
>>>>  will normally select one OST from each OSS before allocating
another
>>>>  object from a different OST on the same OSS
>>> 
>>> Thanks for this information.
>>> 
>>>> - the space available on each OST:  when OST free space is
imbalanced
>>>>  the OSTs will be selected in part based on how full they are
>>> 
>>> I have a doubt here. Lets say I have 4 OSTs, but the lustre client
is
>>> issuing the write request having which can be accommodated by any
>>> single OST (e.g. write request is of size 512bytes and stripe_size
is
>>> 1MB). In this case, how will the data be stored? Will the MDS
maintain
>>> the index of next OST which should serve the request?
>> 
>> 
>> I think you are still confused about how it works.  The OSTs are
selected
>> _when the file is created_.  The striping is a static map of offset to
OST.
>> For example, if the stripe count = 2, and the stripe size = 1MB, then
>> 0-1MB goes to the first OST, 1-2MB goes to the second, 2-3 goes to the
first, etc.
>> 
> I understand that, but just got curious that does lustre client keeps
> track of which is the _next_ OST where the IO request should go to? I
No, it does not track the "next", as that depends on the file offset. 
For example,
with the 2-OST stripe example in my previous email, if the client writes 0-1MB,
2-3MB, and 4-5MB, all the data will be written to a single OST.

> am unaware that who decides the stripe_size at the time of file
> creation (by default is 1MB - from lfs setstripe man page), so I
> assume client is not bothered about that. But if the client is
> generating the write request which is not in multiple of stripe_size,
> multiple write requests can be and stored into one OST (e.g. if stripe
> size is 1MB, then 20 req of 512bytes can be stored in OST1, next 20
> reqs on OST2 and likewise).

1MB is the default default, but the actual default can vary system to system.

The file stripe is determined when the file is created.  "lfs
setstripe" can
be used to create a file with a specified striping.

"lfs setstripe" can aso be used to change the striping for a
directory, which is
quite useful as that determines the default stripe for any files created in
that directory (including directories!)

When the client opens a file, the MDT returns the stripe information to the
client so that the client knows how to map file offsets to OST objects (and
the offset in that object).  It is the client''s job (inside Lustre so
it is automatic)
to figure out how to map a read/write to the server/ost/object/offset.

Kevin

> Actually I am trying to understand how can I leverage the pNFS file
> layout semantics (which communicates to Data Servers directly once the
> layout is supplied by Meta Data Server) with Lustre Filesystem, and
> that is the source of such questions.
> 
>> The free space impacts _which_ OSTs are selected when a file is
created,
>> it does NOT impact where data is written once a file a created.  So if
an OST
>> fills up, every file that resides on that OST will be unable to grow if
the growth is
>> to an offset that maps to that OST.
>> 
> 
> Good to know that.
> 
>> Kevin
>> 
>> 
>> Confidentiality Notice: This e-mail message, its contents and any
attachments to it are confidential to the intended recipient, and may contain
information that is privileged and/or exempt from disclosure under applicable
law. If you are not the intended recipient, please immediately notify the sender
and destroy the original e-mail message and any attachments (and any copies that
may have been made) from your system or otherwise. Any unauthorized use,
copying, disclosure or distribution of this information is strictly prohibited. 
Email addresses that end with a ?-c? identify the sender as a Fusion-io
contractor.
> 
> 
> 
> -- 
> J

This e-mail message, its contents and any attachments to it are confidential to
the intended recipient, and may contain information that is privileged and/or
exempt from disclosure under applicable law. If you are not the intended
recipient, please immediately notify the sender and destroy the original e-mail
message and any attachments (and any copies that may have been made) from your
system or otherwise. Any unauthorized use, copying, disclosure or distribution
of this information is strictly prohibited.  Email addresses that end with a
?-c? identify the sender as a Fusion-io contractor.

Lustre discuss - Feb 2012 - "obdidx" ordering in "lfs getstripe"

[Lustre-discuss] "obdidx" ordering in "lfs getstripe"

[Lustre-discuss] "obdidx" ordering in "lfs getstripe"

[Lustre-discuss] "obdidx" ordering in "lfs getstripe"

[Lustre-discuss] "obdidx" ordering in "lfs getstripe"

[Lustre-discuss] "obdidx" ordering in "lfs getstripe"

[Lustre-discuss] "obdidx" ordering in "lfs getstripe"