Hi, I''ve made some histograms that visualise the F/S stats people have so kindly sent in. There are a couple of surprises which seem to do with how the filesystems were set up (e.g. huge #s of inodes reserved but not used on OSTs) and used (different ratios of ops). The attached PDF has a summary for each filesystem containing... * Filesystem name * Ratio of MDS 1K blocks to MDS inodes Ratio of OST 1K blocks to MDS inodes Ratio of OST inodes to MDS inodes * Histogram of operation counts * Histogram of blocks total (blue) and used (red) on the MDS and all OSTs * Histogram of inodes total (blue) and used (red) on the MDS and all OSTs ...hope you find it useful. Cheers, Eric -------------- next part -------------- A non-text attachment was scrubbed... Name: Lustre stats.pdf Type: application/pdf Size: 619235 bytes Desc: not available Url : http://lists.lustre.org/pipermail/lustre-devel/attachments/20110429/e5fbfcf5/attachment-0001.pdf
On 29 Apr 2011, at 00:04, Eric Barton wrote:> > * Histogram of inodes total (blue) and used (red) on the MDS > and all OSTsWhat do you think of the blue lines showing total number of inodes? It appears to me as though df -i and lfs df -i are reporting different numbers for this value which I''m not able to easily explain. Ashley.
That''s interesting - why do you think opens >> closes? Are we not tracking closes correctly? On Apr 28, 2011, at 4:04 PM, Eric Barton wrote:> Hi, > > I''ve made some histograms that visualise the F/S stats people have > so kindly sent in. There are a couple of surprises which seem to > do with how the filesystems were set up (e.g. huge #s of inodes > reserved but not used on OSTs) and used (different ratios of ops). > > The attached PDF has a summary for each filesystem containing... > > * Filesystem name > > * Ratio of MDS 1K blocks to MDS inodes > Ratio of OST 1K blocks to MDS inodes > Ratio of OST inodes to MDS inodes > > * Histogram of operation counts > > * Histogram of blocks total (blue) and used (red) on the MDS > and all OSTs > > * Histogram of inodes total (blue) and used (red) on the MDS > and all OSTs > > ...hope you find it useful. > > Cheers, > Eric > > > <Lustre stats.pdf>_______________________________________________ > Lustre-devel mailing list > Lustre-devel at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-devel______________________________________________________________________ This email may contain privileged or confidential information, which should only be used for the purpose for which it was sent by Xyratex. No further rights or licenses are granted to use such information. If you are not the intended recipient of this message, please notify the sender by return and delete it. You may not use, copy, disclose or rely on the information contained in it. Internet email is susceptible to data corruption, interception and unauthorised amendment for which Xyratex does not accept liability. While we have taken reasonable precautions to ensure that this email is free of viruses, Xyratex does not accept liability for the presence of any computer viruses in this email, nor for any losses caused as a result of viruses. Xyratex Technology Limited (03134912), Registered in England & Wales, Registered Office, Langstone Road, Havant, Hampshire, PO9 1SA. The Xyratex group of companies also includes, Xyratex Ltd, registered in Bermuda, Xyratex International Inc, registered in California, Xyratex (Malaysia) Sdn Bhd registered in Malaysia, Xyratex Technology (Wuxi) Co Ltd registered in The People''s Republic of China and Xyratex Japan Limited registered in Japan. ______________________________________________________________________
On Apr 29, 2011, at 11:35, Nathan Rutman wrote:> That''s interesting - why do you think opens >> closes? > Are we not tracking closes correctly?I was wondering the same, and/or we are double-counting opens? Clients won''t send close RPCs if they are evicted (MDS does the close locally on eviction), but I''d hope that clients aren''t evicted _that_ often.> On Apr 28, 2011, at 4:04 PM, Eric Barton wrote: > >> Hi, >> >> I''ve made some histograms that visualise the F/S stats people have >> so kindly sent in. There are a couple of surprises which seem to >> do with how the filesystems were set up (e.g. huge #s of inodes >> reserved but not used on OSTs) and used (different ratios of ops). >> >> The attached PDF has a summary for each filesystem containing... >> >> * Filesystem name >> >> * Ratio of MDS 1K blocks to MDS inodes >> Ratio of OST 1K blocks to MDS inodes >> Ratio of OST inodes to MDS inodes >> >> * Histogram of operation counts >> >> * Histogram of blocks total (blue) and used (red) on the MDS >> and all OSTs >> >> * Histogram of inodes total (blue) and used (red) on the MDS >> and all OSTs >> >> ...hope you find it useful. >> >> Cheers, >> Eric >> >> >> <Lustre stats.pdf>_______________________________________________ >> Lustre-devel mailing list >> Lustre-devel at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-devel > ______________________________________________________________________ > This email may contain privileged or confidential information, which should only be used for the purpose for which it was sent by Xyratex. No further rights or licenses are granted to use such information. If you are not the intended recipient of this message, please notify the sender by return and delete it. You may not use, copy, disclose or rely on the information contained in it. > > Internet email is susceptible to data corruption, interception and unauthorised amendment for which Xyratex does not accept liability. While we have taken reasonable precautions to ensure that this email is free of viruses, Xyratex does not accept liability for the presence of any computer viruses in this email, nor for any losses caused as a result of viruses. > > Xyratex Technology Limited (03134912), Registered in England & Wales, Registered Office, Langstone Road, Havant, Hampshire, PO9 1SA. > > The Xyratex group of companies also includes, Xyratex Ltd, registered in Bermuda, Xyratex International Inc, registered in California, Xyratex (Malaysia) Sdn Bhd registered in Malaysia, Xyratex Technology (Wuxi) Co Ltd registered in The People''s Republic of China and Xyratex Japan Limited registered in Japan. > ______________________________________________________________________ > > > _______________________________________________ > Lustre-devel mailing list > Lustre-devel at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-develCheers, Andreas -- Andreas Dilger Principal Engineer Whamcloud, Inc.
On Fri, Apr 29, 2011 at 11:41:05AM -0600, Andreas Dilger wrote:> On Apr 29, 2011, at 11:35, Nathan Rutman wrote: > > That''s interesting - why do you think opens >> closes? > > Are we not tracking closes correctly? > > I was wondering the same, and/or we are double-counting opens? Clients won''t send close RPCs if they are evicted (MDS does the close locally on eviction), but I''d hope that clients aren''t evicted _that_ often.It seems that we increment the open stat counter on resent (for reply reconstruction), but not for close. Maybe that''s the reason? It might also just be that open failed so no close is needed. Johann
Hi Ashley, On Fri, Apr 29, 2011 at 12:06:06AM -0700, Ashley Pittman wrote:> What do you think of the blue lines showing total number of inodes? It appears to me as though df -i and lfs df -i are reporting different numbers for this value which I''m not able to easily explain.The total number of inodes can indeed be different between df (statfs(2)) and lfs df (lustre''s ioctl). File creation can also fail with ENOSPC if you run out of objects on the OSTs, although the MDT might still have free inodes. Therefore the number of free inodes as well as the total number of inodes returned through statfs(2) are adjusted if the MDT has more free inodes than the OSTs. You can check ll_statfs_internal() to see how this is computed. HTH Johann -- Johann Lombardi Whamcloud, Inc. www.whamcloud.com
>>>>> "Johann" == Johann Lombardi <johann at whamcloud.com> writes:Hi Johann, Ashley, Johann> Hi Ashley, On Fri, Apr 29, 2011 at 12:06:06AM -0700, Ashley Johann> Pittman wrote: >> What do you think of the blue lines showing total number of >> inodes? It appears to me as though df -i and lfs df -i are >> reporting different numbers for this value which I''m not able to >> easily explain. Johann> The total number of inodes can indeed be different between Johann> df (statfs(2)) and lfs df (lustre''s ioctl). File creation Johann> can also fail with ENOSPC if you run out of objects on the Johann> OSTs, although the MDT might still have free Johann> inodes. Therefore the number of free inodes as well as the Johann> total number of inodes returned through statfs(2) are Johann> adjusted if the MDT has more free inodes than the OSTs. You Johann> can check ll_statfs_internal() to see how this is computed. It seems lfs df -i is indeed buggy showing totally bogus numbers (see https://bugzilla.lustre.org/show_bug.cgi?id=24489) Roland
On Mon, May 02, 2011 at 05:51:36PM +0200, rf at q-leap.de wrote:> It seems lfs df -i is indeed buggy showing totally bogus numbers > (see https://bugzilla.lustre.org/show_bug.cgi?id=24489)In this case, you run df on the server directly, so you are comparing statfs information as returned by ext4/ldiskfs with what you get through lustre (i.e. lfs df or df on a lustre client). Lustre takes for granted that 1 EA block is needed for each inode (conservative approach) and adjusts the total number of inodes accordingly (see fsfilt_ext3_statfs()). That being said, with large inode support and mkfs.lustre adapting the inode size based on the default stripe count, i am not sure this "adjustment" makes sense any more. We could instead print a warning at mkfs time when the default stripe count cannot fit in the inode core and #inodes > #blocks. Johann -- Johann Lombardi Whamcloud, Inc. www.whamcloud.com
In fact in the patch I recently posted to LU-255 mkfs.lustre is a bit smarter about how many inodes are allocated on the MDT, based on the default striping count given by --stripe-count-hint. It is more aggressive about allocating inodes on the MDS, because we waste a lot of space otherwise. That is not in itself harmful with HDDs because there is too much space anyway and you need more drives to have enough IOPS, but space on SSDs is much more precious. Cheers, Andreas On 2011-05-02, at 10:32 AM, Johann Lombardi <johann at whamcloud.com> wrote:> On Mon, May 02, 2011 at 05:51:36PM +0200, rf at q-leap.de wrote: >> It seems lfs df -i is indeed buggy showing totally bogus numbers >> (see https://bugzilla.lustre.org/show_bug.cgi?id=24489) > > In this case, you run df on the server directly, so you are comparing statfs information as returned by ext4/ldiskfs with what you get through lustre (i.e. lfs df or df on a lustre client). > Lustre takes for granted that 1 EA block is needed for each inode (conservative approach) and adjusts the total number of inodes accordingly (see fsfilt_ext3_statfs()). > That being said, with large inode support and mkfs.lustre adapting the inode size based on the default stripe count, i am not sure this "adjustment" makes sense any more. We could instead print a warning at mkfs time when the default stripe count cannot fit in the inode core and #inodes > #blocks. > > Johann > > -- > Johann Lombardi > Whamcloud, Inc. > www.whamcloud.com > _______________________________________________ > Lustre-devel mailing list > Lustre-devel at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-devel
>>>>> "Johann" == Johann Lombardi <johann at whamcloud.com> writes:Johann> On Mon, May 02, 2011 at 05:51:36PM +0200, rf at q-leap.de Johann> wrote: >> It seems lfs df -i is indeed buggy showing totally bogus numbers >> (see https://bugzilla.lustre.org/show_bug.cgi?id=24489) Johann> In this case, you run df on the server directly, so you are Johann> comparing statfs information as returned by ext4/ldiskfs Johann> with what you get through lustre (i.e. lfs df or df on a Johann> lustre client). Lustre takes for granted that 1 EA block is Johann> needed for each inode (conservative approach) and adjusts Johann> the total number of inodes accordingly (see Johann> fsfilt_ext3_statfs()). That being said, with large inode Johann> support and mkfs.lustre adapting the inode size based on the Johann> default stripe count, i am not sure this "adjustment" makes Johann> sense any more. We could instead print a warning at mkfs Johann> time when the default stripe count cannot fit in the inode Johann> core and #inodes > #blocks. Sorry, but I''m not sure, what I''m supposed to understand from this. It''s a fact that ''lfs df -i '' numbers are bogus. I can fill up the whole filesystem with as many inodes as df on the server shows (tested this with the installation mentioned in bug 24489), so the latter inode number is correct. In my opinion this is a clear bug, and deserves fixing. Roland
On Tue, May 17, 2011 at 01:05:01PM +0200, rf at q-leap.de wrote:> Sorry, but I''m not sure, what I''m supposed to understand from this.Lustre intentionally reduces the number of inodes returned via lfs df -i when #blocks > #inodes.> It''s a fact that ''lfs df -i '' numbers are bogus. > I can fill up the whole filesystem with as many inodes as df on the server shows (tested this > with the installation mentioned in bug 24489), so the latter inode > number is correct.That''s because your default striping does not require an additional EA block. Johann
On Tue, May 17, 2011 at 02:03:05PM +0200, Johann Lombardi wrote:> Lustre intentionally reduces the number of inodes returned via lfs df -i when #blocks > #inodes.Sorry, i meant when #inodes > #blocks.
>>>>> "Johann" == Johann Lombardi <johann at whamcloud.com> writes:Johann> On Tue, May 17, 2011 at 01:05:01PM +0200, rf at q-leap.de Johann> wrote: >> Sorry, but I''m not sure, what I''m supposed to understand from >> this. Johann> Lustre intentionally reduces the number of inodes returned Johann> via lfs df -i when #blocks > #inodes. Which #blocks? The one of the MDT or the one of the whole FS? >> It''s a fact that ''lfs df -i '' numbers are bogus. I can fill up >> the whole filesystem with as many inodes as df on the server >> shows (tested this with the installation mentioned in bug 24489), >> so the latter inode number is correct. Johann> That''s because your default striping does not require an Johann> additional EA block. Does that imply I''m doing something wrong? What is an EA block anyway? We haven''t activated any striping. Roland
On Tue, May 17, 2011 at 02:13:37PM +0200, rf at q-leap.de wrote:> Which #blocks? The one of the MDT or the one of the whole FS?The #blocks on the MDT.> Does that imply I''m doing something wrong?Lustre is just *very* conservative. If you format your MDT with a #blocks/#inodes ratio of 1, you won''t get this problem.> What is an EA block anyway?The file striping configuration is stored in an extended attribute. Depending on the number of file stripes (can be changed dynamically with lfs setstripe), this extended attribute is stored either in the inode core or in an additional data block. In the latter, you need to alloate one data block for each file creation.> We haven''t activated any striping.Then you use the default stripe count which is 1 and the extended attribute fits in the inode. As i said in my first email, we could probably do better and handle this at mkfs time. Andreas'' patch (http://review.whamcloud.com/#change,480) introducing --stripe-count-hint should help in this regard. Johann
>>>>> "Johann" == Johann Lombardi <johann at whamcloud.com> writes:Johann> On Tue, May 17, 2011 at 02:13:37PM +0200, rf at q-leap.de Johann> wrote: >> Which #blocks? The one of the MDT or the one of the whole FS? Johann> The #blocks on the MDT. Ah, OK, that''s definitely the case here. >> Does that imply I''m doing something wrong? Johann> Lustre is just *very* conservative. If you format your MDT Johann> with a #blocks/#inodes ratio of 1, you won''t get this Johann> problem. I see. Unfortunately, we need that many inodes ... >> What is an EA block anyway? Johann> The file striping configuration is stored in an extended Johann> attribute. Depending on the number of file stripes (can be Johann> changed dynamically with lfs setstripe), this extended Johann> attribute is stored either in the inode core or in an Johann> additional data block. In the latter, you need to alloate Johann> one data block for each file creation. Got it. >> We haven''t activated any striping. Johann> Then you use the default stripe count which is 1 and the Johann> extended attribute fits in the inode. As i said in my first Johann> email, we could probably do better and handle this at mkfs Johann> time. Andreas'' patch Johann> (http://review.whamcloud.com/#change,480) introducing Johann> --stripe-count-hint should help in this regard. Great. Thanks a lot for this hint. Do you know whether this will eneter the 1.8 branch as well? Roland
On Tue, May 17, 2011 at 03:43:14PM +0200, rf at q-leap.de wrote:> I see. Unfortunately, we need that many inodes ...Well, in your case, you could just remove the culprit code from fsfilt_ext3_statfs(), as done here: http://review.whamcloud.com/#patch,sidebyside,480,8,lustre/lvfs/fsfilt_ext3.c> Great. Thanks a lot for this hint. Do you know whether this will eneter > the 1.8 branch as well?I would ask this question in the bugzilla ticket to have an answer from someone from Oracle. Johann -- Johann Lombardi Whamcloud, Inc. www.whamcloud.com
>>>>> "Johann" == Johann Lombardi <johann at whamcloud.com> writes:Johann> On Tue, May 17, 2011 at 03:43:14PM +0200, rf at q-leap.de Johann> wrote: >> I see. Unfortunately, we need that many inodes ... Johann> Well, in your case, you could just remove the culprit code Johann> from fsfilt_ext3_statfs(), as done here: Johann> http://review.whamcloud.com/#patch,sidebyside,480,8,lustre/lvfs/fsfilt_ext3.c Ok, that''s easy. Thanks a lot for the pointer. >> Great. Thanks a lot for this hint. Do you know whether this will >> eneter the 1.8 branch as well? Johann> I would ask this question in the bugzilla ticket to have an Johann> answer from someone from Oracle. I will. Roland