Ms. Megan Larko
2012-Feb-10 01:49 UTC
[Lustre-discuss] WRT "obdidx ordering in "lfs getstript"
Greetings! I was reading Mr. David''s query about the ordering of data on a striped luster file system. I too am under the impression that the data stripe of size lfs-stripesize will rotate in order from the starting point. Following Mr. David''s example, a large data set would be written to the 2nd OST, with the next piece on the 3rd, then 0th and finally 1st before circling back around to the 2nd (assuming OSTs 0 to 3 from the example). In his response, Mr. Dilger stated: "when OST free space is imbalanced the OSTs will be selected in part based on how full they are". Does that refer to a starting point for the data writes before the orderly progression? Does that somehow imply a "skipping over" of a "full" OST? The latter would be revolutionary to me in my personal understanding of Lustre and cluster file systems in general. I thought that a single OST having insufficient space available for writing of the data piece of "stripe size"---or all of the data if the default Lustre stripe size of one is used--would cause a file system full error. This error can confuse users and novice administrators who see a file system full message when a typical disk usage command on the client will show (ofter a reasonable) percentage available on the file system as a whole. Have I misunderstood something here or is this skipping over a full OST something in the newer versions of Lustre cluster filesystem? Cheers! megan
Andreas Dilger
2012-Feb-10 05:31 UTC
[Lustre-discuss] WRT "obdidx ordering in "lfs getstript"
On 2012-02-09, at 6:49 PM, Ms. Megan Larko wrote:> I was reading Mr. David''s query about the ordering of data on a > striped luster file system. I too am under the impression that the > data stripe of size lfs-stripesize will rotate in order from the > starting point. Following Mr. David''s example, a large data set > would be written to the 2nd OST, with the next piece on the 3rd, then > 0th and finally 1st before circling back around to the 2nd (assuming > OSTs 0 to 3 from the example).Correct.> In his response, Mr. Dilger stated: > "when OST free space is imbalanced the OSTs will be selected in part > based on how full they are". Does that refer to a starting point for > the data writes before the orderly progression? Does that somehow > imply a "skipping over" of a "full" OST?Correct. When free space becomes too imbalanced between OSTs, the MDS object allocator changes to a mode where it allocates objects partly based on how much space is free on each OST. This is not ideal, and could be improved (see https://bugzilla.lustre.org/show_bug.cgi?id=18547 for details), but is reasonable for some workloads.> The latter would be > revolutionary to me in my personal understanding of Lustre and cluster > file systems in general. I thought that a single OST having > insufficient space available for writing of the data piece of "stripe > size"---or all of the data if the default Lustre stripe size of one is > used--would cause a file system full error. This error can confuse > users and novice administrators who see a file system full message > when a typical disk usage command on the client will show (ofter a > reasonable) percentage available on the file system as a whole.This is still true after a file has had OSTs allocated. If any OST becomes full, writes to files on that OST will return ENOSPC even if there is free space on another OST. Cheers, Andreas -- Andreas Dilger Whamcloud, Inc. Principal Engineer http://www.whamcloud.com/