Hi All, In the output of "lsf getstripe <filename> | <dirname>", the obdidx denotes the OST index (I assume). Consider the following output: lmm_stripe_count: 2 lmm_stripe_size: 1048576 lmm_stripe_offset: 1 obdidx objid objid group 1 2 0x2 0 0 3 0x3 0 where I have a setup consisting of two OSTs. If I have more than two OSTs, is it possible that I get the obdidx values out of order? Or the obdidx values will always be linear? For example, in above output, the values are linear (like 1, 0 - and this pattern will be repeated while storing the data I assume). If I have 4 OSTs, can the values be non-linear? Something like 2,0,1,3 or 2,1,3,0 (or any pattern for that matter)?? My assumption on how the data is stored on OSTs: Based upon the values of obdidx, each OST will store a stripe_size worth data into the objid (a file under ldiskfs volume of that OST) in rotation. So if I get the obdidx like 2,1,3,0 and stripe_size if 1MB, then the data will be stored in following order: 1st MB: 2nd OST 2nd MB: 1st OST 3rdMB: 3rd OST 4thMB: 0th OST 5th MB: 2nd OST (Again - repeating the pattern) 6th MB: 1st OST Is this understanding correct?? I hope I am clear on my question. Thanks, J
Andreas Dilger
2012-Feb-09 14:48 UTC
[Lustre-discuss] "obdidx" ordering in "lfs getstripe"
On 2012-02-09, at 6:20 AM, Jack David wrote:> In the output of "lsf getstripe <filename> | <dirname>", the obdidx > denotes the OST index (I assume). > > Consider the following output: > > lmm_stripe_count: 2 > lmm_stripe_size: 1048576 > lmm_stripe_offset: 1 > obdidx objid objid group > 1 2 0x2 0 > 0 3 0x3 0 > > where I have a setup consisting of two OSTs. If I have more than two > OSTs, is it possible that I get the obdidx values out of order? Or the > obdidx values will always be linear? > > For example, in above output, the values are linear (like 1, 0 - and > this pattern will be repeated while storing the data I assume). If I > have 4 OSTs, can the values be non-linear? Something like 2,0,1,3 or > 2,1,3,0 (or any pattern for that matter)??Typically the ordering will be linear, but this depends on a number of different factors: - what order the OSTs were created in: without --index=N the OST order depends on the order in which they were first mounted, so using --index is always recommended, and will be mandatory in the future - the distribution of OSTs among OSS nodes: the MDS object allocator will normally select one OST from each OSS before allocating another object from a different OST on the same OSS - the space available on each OST: when OST free space is imbalanced the OSTs will be selected in part based on how full they are> My assumption on how the data is stored on OSTs: > Based upon the values of obdidx, each OST will store a stripe_size > worth data into the objid (a file under ldiskfs volume of that OST) in > rotation. So if I get the obdidx like 2,1,3,0 and stripe_size if 1MB, > then the data will be stored in following order: > > 1st MB: 2nd OST > 2nd MB: 1st OST > 3rdMB: 3rd OST > 4thMB: 0th OST > 5th MB: 2nd OST (Again - repeating the pattern) > 6th MB: 1st OST > > Is this understanding correct?? I hope I am clear on my question.Correct. The data is strictly round-robin on the objects once they are allocated to a file. Cheers, Andreas -- Andreas Dilger Whamcloud, Inc. Principal Engineer http://www.whamcloud.com/
On Thu, Feb 9, 2012 at 8:18 PM, Andreas Dilger <adilger at whamcloud.com> wrote:> On 2012-02-09, at 6:20 AM, Jack David wrote: >> In the output of "lsf getstripe <filename> | <dirname>", the obdidx >> denotes the OST index (I assume). >> >> Consider the following output: >> >> lmm_stripe_count: ? 2 >> lmm_stripe_size: ? ?1048576 >> lmm_stripe_offset: ?1 >> ? ? ? obdidx ? ? ? ? ? objid ? ? ? ? ?objid ? ? ? ? ? ?group >> ? ? ? ? ? ?1 ? ? ? ? ? ? ? 2 ? ? ? ? ? ?0x2 ? ? ? ? ? ? ? ?0 >> ? ? ? ? ? ?0 ? ? ? ? ? ? ? 3 ? ? ? ? ? ?0x3 ? ? ? ? ? ? ? ?0 >> >> where I have a setup consisting of two OSTs. If I have more than two >> OSTs, is it possible that I get the obdidx values out of order? Or the >> obdidx values will always be linear? >> >> For example, in above output, the values are linear (like 1, 0 - and >> this pattern will be repeated while storing the data I assume). If I >> have 4 OSTs, can the values be non-linear? Something like 2,0,1,3 or >> 2,1,3,0 (or any pattern for that matter)?? > > Typically the ordering will be linear, but this depends on a number of > different factors: > - what order the OSTs were created in: ?without --index=N the OST order > ?depends on the order in which they were first mounted, so using --index > ?is always recommended, and will be mandatory in the future > - the distribution of OSTs among OSS nodes: ?the MDS object allocator > ?will normally select one OST from each OSS before allocating another > ?object from a different OST on the same OSSThanks for this information.> - the space available on each OST: ?when OST free space is imbalanced > ?the OSTs will be selected in part based on how full they areI have a doubt here. Lets say I have 4 OSTs, but the lustre client is issuing the write request having which can be accommodated by any single OST (e.g. write request is of size 512bytes and stripe_size is 1MB). In this case, how will the data be stored? Will the MDS maintain the index of next OST which should serve the request?> >> My assumption on how the data is stored on OSTs: >> Based upon the values of obdidx, each OST will store a stripe_size >> worth data into the objid (a file under ldiskfs volume of that OST) in >> rotation. So if I get the obdidx like 2,1,3,0 and stripe_size if 1MB, >> then the data will be stored in following order: >> >> 1st MB: 2nd OST >> 2nd MB: 1st OST >> 3rdMB: 3rd OST >> 4thMB: 0th OST >> 5th MB: 2nd OST (Again - repeating the pattern) >> 6th MB: 1st OST >> >> Is this understanding correct?? I hope I am clear on my question. > > Correct. ?The data is strictly round-robin on the objects once they > are allocated to a file. >Thanks again, J> Cheers, Andreas > -- > Andreas Dilger ? ? ? ? ? ? ? ? ? ? ? Whamcloud, Inc. > Principal Engineer ? ? ? ? ? ? ? ? ? http://www.whamcloud.com/ > > > >-- J
Kevin Van Maren
2012-Feb-14 13:27 UTC
[Lustre-discuss] "obdidx" ordering in "lfs getstripe"
On Feb 14, 2012, at 12:13 AM, Jack David wrote:> On Thu, Feb 9, 2012 at 8:18 PM, Andreas Dilger <adilger at whamcloud.com> wrote: >> On 2012-02-09, at 6:20 AM, Jack David wrote: >>> In the output of "lsf getstripe <filename> | <dirname>", the obdidx >>> denotes the OST index (I assume). >>> >>> Consider the following output: >>> >>> lmm_stripe_count: 2 >>> lmm_stripe_size: 1048576 >>> lmm_stripe_offset: 1 >>> obdidx objid objid group >>> 1 2 0x2 0 >>> 0 3 0x3 0 >>> >>> where I have a setup consisting of two OSTs. If I have more than two >>> OSTs, is it possible that I get the obdidx values out of order? Or the >>> obdidx values will always be linear? >>> >>> For example, in above output, the values are linear (like 1, 0 - and >>> this pattern will be repeated while storing the data I assume). If I >>> have 4 OSTs, can the values be non-linear? Something like 2,0,1,3 or >>> 2,1,3,0 (or any pattern for that matter)?? >> >> Typically the ordering will be linear, but this depends on a number of >> different factors: >> - what order the OSTs were created in: without --index=N the OST order >> depends on the order in which they were first mounted, so using --index >> is always recommended, and will be mandatory in the future >> - the distribution of OSTs among OSS nodes: the MDS object allocator >> will normally select one OST from each OSS before allocating another >> object from a different OST on the same OSS > > Thanks for this information. > >> - the space available on each OST: when OST free space is imbalanced >> the OSTs will be selected in part based on how full they are > > I have a doubt here. Lets say I have 4 OSTs, but the lustre client is > issuing the write request having which can be accommodated by any > single OST (e.g. write request is of size 512bytes and stripe_size is > 1MB). In this case, how will the data be stored? Will the MDS maintain > the index of next OST which should serve the request?I think you are still confused about how it works. The OSTs are selected _when the file is created_. The striping is a static map of offset to OST. For example, if the stripe count = 2, and the stripe size = 1MB, then 0-1MB goes to the first OST, 1-2MB goes to the second, 2-3 goes to the first, etc. The free space impacts _which_ OSTs are selected when a file is created, it does NOT impact where data is written once a file a created. So if an OST fills up, every file that resides on that OST will be unable to grow if the growth is to an offset that maps to that OST. Kevin Confidentiality Notice: This e-mail message, its contents and any attachments to it are confidential to the intended recipient, and may contain information that is privileged and/or exempt from disclosure under applicable law. If you are not the intended recipient, please immediately notify the sender and destroy the original e-mail message and any attachments (and any copies that may have been made) from your system or otherwise. Any unauthorized use, copying, disclosure or distribution of this information is strictly prohibited. Email addresses that end with a ?-c? identify the sender as a Fusion-io contractor.
On Tue, Feb 14, 2012 at 6:57 PM, Kevin Van Maren <KVanMaren at fusionio.com> wrote:> On Feb 14, 2012, at 12:13 AM, Jack David wrote: > >> On Thu, Feb 9, 2012 at 8:18 PM, Andreas Dilger <adilger at whamcloud.com> wrote: >>> On 2012-02-09, at 6:20 AM, Jack David wrote: >>>> In the output of "lsf getstripe <filename> | <dirname>", the obdidx >>>> denotes the OST index (I assume). >>>> >>>> Consider the following output: >>>> >>>> lmm_stripe_count: ? 2 >>>> lmm_stripe_size: ? ?1048576 >>>> lmm_stripe_offset: ?1 >>>> ? ? ? obdidx ? ? ? ? ? objid ? ? ? ? ?objid ? ? ? ? ? ?group >>>> ? ? ? ? ? ?1 ? ? ? ? ? ? ? 2 ? ? ? ? ? ?0x2 ? ? ? ? ? ? ? ?0 >>>> ? ? ? ? ? ?0 ? ? ? ? ? ? ? 3 ? ? ? ? ? ?0x3 ? ? ? ? ? ? ? ?0 >>>> >>>> where I have a setup consisting of two OSTs. If I have more than two >>>> OSTs, is it possible that I get the obdidx values out of order? Or the >>>> obdidx values will always be linear? >>>> >>>> For example, in above output, the values are linear (like 1, 0 - and >>>> this pattern will be repeated while storing the data I assume). If I >>>> have 4 OSTs, can the values be non-linear? Something like 2,0,1,3 or >>>> 2,1,3,0 (or any pattern for that matter)?? >>> >>> Typically the ordering will be linear, but this depends on a number of >>> different factors: >>> - what order the OSTs were created in: ?without --index=N the OST order >>> ?depends on the order in which they were first mounted, so using --index >>> ?is always recommended, and will be mandatory in the future >>> - the distribution of OSTs among OSS nodes: ?the MDS object allocator >>> ?will normally select one OST from each OSS before allocating another >>> ?object from a different OST on the same OSS >> >> Thanks for this information. >> >>> - the space available on each OST: ?when OST free space is imbalanced >>> ?the OSTs will be selected in part based on how full they are >> >> I have a doubt here. Lets say I have 4 OSTs, but the lustre client is >> issuing the write request having which can be accommodated by any >> single OST (e.g. write request is of size 512bytes and stripe_size is >> 1MB). In this case, how will the data be stored? Will the MDS maintain >> the index of next OST which should serve the request? > > > I think you are still confused about how it works. ?The OSTs are selected > _when the file is created_. ?The striping is a static map of offset to OST. > For example, if the stripe count = 2, and the stripe size = 1MB, then > 0-1MB goes to the first OST, 1-2MB goes to the second, 2-3 goes to the first, etc. >I understand that, but just got curious that does lustre client keeps track of which is the _next_ OST where the IO request should go to? I am unaware that who decides the stripe_size at the time of file creation (by default is 1MB - from lfs setstripe man page), so I assume client is not bothered about that. But if the client is generating the write request which is not in multiple of stripe_size, multiple write requests can be and stored into one OST (e.g. if stripe size is 1MB, then 20 req of 512bytes can be stored in OST1, next 20 reqs on OST2 and likewise). Actually I am trying to understand how can I leverage the pNFS file layout semantics (which communicates to Data Servers directly once the layout is supplied by Meta Data Server) with Lustre Filesystem, and that is the source of such questions.> The free space impacts _which_ OSTs are selected when a file is created, > it does NOT impact where data is written once a file a created. ?So if an OST > fills up, every file that resides on that OST will be unable to grow if the growth is > to an offset that maps to that OST. >Good to know that.> Kevin > > > Confidentiality Notice: This e-mail message, its contents and any attachments to it are confidential to the intended recipient, and may contain information that is privileged and/or exempt from disclosure under applicable law. If you are not the intended recipient, please immediately notify the sender and destroy the original e-mail message and any attachments (and any copies that may have been made) from your system or otherwise. Any unauthorized use, copying, disclosure or distribution of this information is strictly prohibited. ?Email addresses that end with a ?-c? identify the sender as a Fusion-io contractor.-- J
Kevin Van Maren
2012-Feb-14 14:07 UTC
[Lustre-discuss] "obdidx" ordering in "lfs getstripe"
On Feb 14, 2012, at 6:51 AM, Jack David wrote:> On Tue, Feb 14, 2012 at 6:57 PM, Kevin Van Maren <KVanMaren at fusionio.com> wrote: >> On Feb 14, 2012, at 12:13 AM, Jack David wrote: >> >>> On Thu, Feb 9, 2012 at 8:18 PM, Andreas Dilger <adilger at whamcloud.com> wrote: >>>> On 2012-02-09, at 6:20 AM, Jack David wrote: >>>>> In the output of "lsf getstripe <filename> | <dirname>", the obdidx >>>>> denotes the OST index (I assume). >>>>> >>>>> Consider the following output: >>>>> >>>>> lmm_stripe_count: 2 >>>>> lmm_stripe_size: 1048576 >>>>> lmm_stripe_offset: 1 >>>>> obdidx objid objid group >>>>> 1 2 0x2 0 >>>>> 0 3 0x3 0 >>>>> >>>>> where I have a setup consisting of two OSTs. If I have more than two >>>>> OSTs, is it possible that I get the obdidx values out of order? Or the >>>>> obdidx values will always be linear? >>>>> >>>>> For example, in above output, the values are linear (like 1, 0 - and >>>>> this pattern will be repeated while storing the data I assume). If I >>>>> have 4 OSTs, can the values be non-linear? Something like 2,0,1,3 or >>>>> 2,1,3,0 (or any pattern for that matter)?? >>>> >>>> Typically the ordering will be linear, but this depends on a number of >>>> different factors: >>>> - what order the OSTs were created in: without --index=N the OST order >>>> depends on the order in which they were first mounted, so using --index >>>> is always recommended, and will be mandatory in the future >>>> - the distribution of OSTs among OSS nodes: the MDS object allocator >>>> will normally select one OST from each OSS before allocating another >>>> object from a different OST on the same OSS >>> >>> Thanks for this information. >>> >>>> - the space available on each OST: when OST free space is imbalanced >>>> the OSTs will be selected in part based on how full they are >>> >>> I have a doubt here. Lets say I have 4 OSTs, but the lustre client is >>> issuing the write request having which can be accommodated by any >>> single OST (e.g. write request is of size 512bytes and stripe_size is >>> 1MB). In this case, how will the data be stored? Will the MDS maintain >>> the index of next OST which should serve the request? >> >> >> I think you are still confused about how it works. The OSTs are selected >> _when the file is created_. The striping is a static map of offset to OST. >> For example, if the stripe count = 2, and the stripe size = 1MB, then >> 0-1MB goes to the first OST, 1-2MB goes to the second, 2-3 goes to the first, etc. >> > I understand that, but just got curious that does lustre client keeps > track of which is the _next_ OST where the IO request should go to? INo, it does not track the "next", as that depends on the file offset. For example, with the 2-OST stripe example in my previous email, if the client writes 0-1MB, 2-3MB, and 4-5MB, all the data will be written to a single OST.> am unaware that who decides the stripe_size at the time of file > creation (by default is 1MB - from lfs setstripe man page), so I > assume client is not bothered about that. But if the client is > generating the write request which is not in multiple of stripe_size, > multiple write requests can be and stored into one OST (e.g. if stripe > size is 1MB, then 20 req of 512bytes can be stored in OST1, next 20 > reqs on OST2 and likewise).1MB is the default default, but the actual default can vary system to system. The file stripe is determined when the file is created. "lfs setstripe" can be used to create a file with a specified striping. "lfs setstripe" can aso be used to change the striping for a directory, which is quite useful as that determines the default stripe for any files created in that directory (including directories!) When the client opens a file, the MDT returns the stripe information to the client so that the client knows how to map file offsets to OST objects (and the offset in that object). It is the client''s job (inside Lustre so it is automatic) to figure out how to map a read/write to the server/ost/object/offset. Kevin> Actually I am trying to understand how can I leverage the pNFS file > layout semantics (which communicates to Data Servers directly once the > layout is supplied by Meta Data Server) with Lustre Filesystem, and > that is the source of such questions. > >> The free space impacts _which_ OSTs are selected when a file is created, >> it does NOT impact where data is written once a file a created. So if an OST >> fills up, every file that resides on that OST will be unable to grow if the growth is >> to an offset that maps to that OST. >> > > Good to know that. > >> Kevin >> >> >> Confidentiality Notice: This e-mail message, its contents and any attachments to it are confidential to the intended recipient, and may contain information that is privileged and/or exempt from disclosure under applicable law. If you are not the intended recipient, please immediately notify the sender and destroy the original e-mail message and any attachments (and any copies that may have been made) from your system or otherwise. Any unauthorized use, copying, disclosure or distribution of this information is strictly prohibited. Email addresses that end with a ?-c? identify the sender as a Fusion-io contractor. > > > > -- > JThis e-mail message, its contents and any attachments to it are confidential to the intended recipient, and may contain information that is privileged and/or exempt from disclosure under applicable law. If you are not the intended recipient, please immediately notify the sender and destroy the original e-mail message and any attachments (and any copies that may have been made) from your system or otherwise. Any unauthorized use, copying, disclosure or distribution of this information is strictly prohibited. Email addresses that end with a ?-c? identify the sender as a Fusion-io contractor.