On 11/09/2010 19:27, Robin Humble wrote:> Hey Dr Stu,
>
> On Sat, Sep 11, 2010 at 04:27:43PM +0800, Stuart Midgley wrote:
>> We are getting jobs that fail due to no space left on device.
>> BUT none of our lustre servers are full (as reported by lfs df -h on a
client and by df -h on the oss''s).
>> They are all close to being full, but are not actually full (still have
~300gb of space left)
> sounds like a grant problem.
>
>> I''ve tried playing around with tune2fs -m {0,1,2,3} and
tune2fs -r 1024 etc and nothing appears to help.
>> Anyone have a similar problem? We are running 1.8.3
> there are a couple of grant leaks that are fixed in 1.8.4 eg.
> https://bugzilla.lustre.org/show_bug.cgi?id=22755
> or see the 1.8.4 release notes.
>
> however the overall grant revoking problem is still unresolved AFAICT
> https://bugzilla.lustre.org/show_bug.cgi?id=12069
> and you''ll hit that issue more frequently with many clients and
small
> OSTs, or when any OST starts getting full.
>
> in your case 300g per OST should be enough headroom unless you have
> ~4k clients now (assuming 32-64m grants per client), so it''s
probably
> grant leaks. there''s a recipe for adding up client grants and
comparing
> them to server grants to see if they''ve gone wrong in bz 22755.
>
Per BZ 22755, comment #96
(https://bugzilla.lustre.org/show_bug.cgi?id=22755#c96), you can arrest
the grant leak by changing the "grant shrink interval" to a large
value
(if you want to reset the server side grant reservation, you will have
to remount the OSTs). We have applied this work-around to our system
with good results. We have been monitoring our file systems with Nagios
and have not encountered a repeat of this problem.
Malcolm.