thr3ads.net - Lustre discuss - [Lustre-discuss] brief ''hangs'' on file operations [Sep 2010]

If this information is useful, please help other people find it:
Share via:

Tina Friedrich

2010-Sep-02 13:43 UTC

[Lustre-discuss] brief ''hangs'' on file operations

Hello List,

we are trying to debug some issues - or possibly different 
manifestations of the same issue - on our file system.

Causing most grieve at the moment is that we sometimes see delays 
writing files. From the writing clients end, it simply looks as if I/O 
stops for a while (we''ve seen ''pauses'' of anything up
to 10 seconds).
This appears to be independent of what client does the writing, and 
software doing the writing. We investigated this a bit using strace and 
dd; the ''slow'' calls appear to always be either open, write,
or close
calls. Usually, these take well below 0.001s; in around 0.5% or 1% of 
cases, they take up to multiple seconds. It does not seem to be 
associated with any specific OST, OSS, client or anything; there is 
nothing in any log files or any exceptional load on MDS or OSS or any of 
the clients.

The other issue is that we frequently see delays when trying to read a 
file. I sometimes takes more than 60s for a file to be visible on a 
machine after the initial write on a different machine has completed 
(both machines being Lustre clients). Again, there is nothing in the 
logs, nor exceptional load on any of the machines.

Any ideas what this could be? How can we debug this?

Clients and servers are using Lustre 1.6.7.2.ddn3.5.

Regards,
Tina

-- 
Tina Friedrich, Computer Systems Administrator, Diamond Light Source Ltd
Diamond House, Harwell Science and Innovation Campus - 01235 77 8442

Andreas Dilger

2010-Sep-02 16:01 UTC

head link

[Lustre-discuss] brief ''hangs'' on file operations

On 2010-09-02, at 06:43, Tina Friedrich wrote:> Causing most grieve at the moment is that we sometimes see delays 
> writing files. From the writing clients end, it simply looks as if I/O 
> stops for a while (we''ve seen ''pauses'' of
anything up to 10 seconds).
> This appears to be independent of what client does the writing, and 
> software doing the writing. We investigated this a bit using strace and 
> dd; the ''slow'' calls appear to always be either open,
write, or close
> calls. Usually, these take well below 0.001s; in around 0.5% or 1% of 
> cases, they take up to multiple seconds. It does not seem to be 
> associated with any specific OST, OSS, client or anything; there is 
> nothing in any log files or any exceptional load on MDS or OSS or any of 
> the clients.
This is most likely associated with delays in committing the journal on the MDT
or OST, which can happen if the journal fills completely.  Having larger
journals can help, if you have enough RAM to keep them all in memory and not
overflow.  Alternately, if you make the journals small it will limit the
latency, at the cost of reducing overall performance.  A third alternative might
be to use SSDs for the journal devices.
> The other issue is that we frequently see delays when trying to read a 
> file. I sometimes takes more than 60s for a file to be visible on a 
> machine after the initial write on a different machine has completed 
> (both machines being Lustre clients). Again, there is nothing in the 
> logs, nor exceptional load on any of the machines.
This is probably just a manifestation of the first problem.  The issue likely
isn''t in the read, but a delay in flushing the data from the cache of
the writing client.  There were fixes made in 1.8 to increase the IO priority
for clients writing data under a lock that other clients are waiting on.

Cheers, Andreas
--
Andreas Dilger
Lustre Technical Lead
Oracle Corporation Canada Inc.

Bernd Schubert

2010-Sep-02 16:21 UTC

head link

[Lustre-discuss] brief ''hangs'' on file operations

On Thursday, September 02, 2010, Andreas Dilger wrote:> On 2010-09-02, at 06:43, Tina Friedrich wrote:
> > Causing most grieve at the moment is that we sometimes see delays
> > writing files. From the writing clients end, it simply looks as if I/O
> > stops for a while (we''ve seen ''pauses'' of
anything up to 10 seconds).
> > This appears to be independent of what client does the writing, and
> > software doing the writing. We investigated this a bit using strace
and
> > dd; the ''slow'' calls appear to always be either
open, write, or close
> > calls. Usually, these take well below 0.001s; in around 0.5% or 1% of
> > cases, they take up to multiple seconds. It does not seem to be
> > associated with any specific OST, OSS, client or anything; there is
> > nothing in any log files or any exceptional load on MDS or OSS or any
of
> > the clients.
> 
> This is most likely associated with delays in committing the journal on the
> MDT or OST, which can happen if the journal fills completely.  Having
> larger journals can help, if you have enough RAM to keep them all in
> memory and not overflow.  Alternately, if you make the journals small it
> will limit the latency, at the cost of reducing overall performance.  A
> third alternative might be to use SSDs for the journal devices.
As diamond uses DDN hardware, it should help in general with performance to 
update to 1.8 and to enable the async journal feature. I guess it also might 
help to reduce those delays, as writes are more optimized.

A question, though. Tina, do you use our ddn udev rules, which tune the 
devices for optimized performance? If not, please send a mail to 
support at ddn.com and ask for a recent udev rpm please (available for RHEL5
only
so far, also *might* work on SLES11, but udev syntax changes to often, IMHO). 
And put [lustre] into the subject line please, as the lustre team maintains 
them.

Cheers,
Bernd

Tina Friedrich

2010-Sep-02 16:28 UTC

head link

[Lustre-discuss] brief ''hangs'' on file operations

Hi Andreas,

thanks for your answer.
>> Causing most grieve at the moment is that we sometimes see delays
>> writing files. From the writing clients end, it simply looks as if I/O
>> stops for a while (we''ve seen ''pauses'' of
anything up to 10 seconds).
>> This appears to be independent of what client does the writing, and
>> software doing the writing. We investigated this a bit using strace and
>> dd; the ''slow'' calls appear to always be either open,
write, or close
>> calls. Usually, these take well below 0.001s; in around 0.5% or 1% of
>> cases, they take up to multiple seconds. It does not seem to be
>> associated with any specific OST, OSS, client or anything; there is
>> nothing in any log files or any exceptional load on MDS or OSS or any
of
>> the clients.
>
> This is most likely associated with delays in committing the journal on the
MDT or OST, which can happen if the journal fills completely.  Having larger
journals can help, if you have enough RAM to keep them all in memory and not
overflow.  Alternately, if you make the journals small it will limit the
latency, at the cost of reducing overall performance.  A third alternative might
be to use SSDs for the journal devices.
Just to double check - that would be the file system journal, I assume?

That makes a lot of sense; is there a way to verify that this is the 
issue we''re having?

Journal size appears to be 400M - if we were to try increasing it, how 
would be determine what to best set it to?
>> The other issue is that we frequently see delays when trying to read a
>> file. I sometimes takes more than 60s for a file to be visible on a
>> machine after the initial write on a different machine has completed
>> (both machines being Lustre clients). Again, there is nothing in the
>> logs, nor exceptional load on any of the machines.
>
> This is probably just a manifestation of the first problem.  The issue
likely isn''t in the read, but a delay in flushing the data from the
cache of the writing client.  There were fixes made in 1.8 to increase the IO
priority for clients writing data under a lock that other clients are waiting
on.
We kind of suspected them to be related, yes.

Tina

-- 
Tina Friedrich, Computer Systems Administrator, Diamond Light Source Ltd
Diamond House, Harwell Science and Innovation Campus - 01235 77 8442

Tina Friedrich

2010-Sep-02 16:36 UTC

head link

[Lustre-discuss] brief ''hangs'' on file operations

Hi Bernd,

On 02/09/10 17:21, Bernd Schubert wrote:> On Thursday, September 02, 2010, Andreas Dilger wrote:
>> On 2010-09-02, at 06:43, Tina Friedrich wrote:
>>> Causing most grieve at the moment is that we sometimes see delays
>>> writing files. From the writing clients end, it simply looks as if
I/O
>>> stops for a while (we''ve seen ''pauses''
of anything up to 10 seconds).
>>> This appears to be independent of what client does the writing, and
>>> software doing the writing. We investigated this a bit using strace
and
>>> dd; the ''slow'' calls appear to always be either
open, write, or close
>>> calls. Usually, these take well below 0.001s; in around 0.5% or 1%
of
>>> cases, they take up to multiple seconds. It does not seem to be
>>> associated with any specific OST, OSS, client or anything; there is
>>> nothing in any log files or any exceptional load on MDS or OSS or
any of
>>> the clients.
>>
>> This is most likely associated with delays in committing the journal on
the
>> MDT or OST, which can happen if the journal fills completely.  Having
>> larger journals can help, if you have enough RAM to keep them all in
>> memory and not overflow.  Alternately, if you make the journals small
it
>> will limit the latency, at the cost of reducing overall performance.  A
>> third alternative might be to use SSDs for the journal devices.
>
> As diamond uses DDN hardware, it should help in general with performance to
> update to 1.8 and to enable the async journal feature. I guess it also
might
> help to reduce those delays, as writes are more optimized.
We have been considering an update; however, due to to this being a 
production file system (and an important one), that''s not something
that
we can do easily.
> A question, though. Tina, do you use our ddn udev rules, which tune the
> devices for optimized performance? If not, please send a mail to
> support at ddn.com and ask for a recent udev rpm please (available for
RHEL5 only
> so far, also *might* work on SLES11, but udev syntax changes to often,
IMHO).
> And put [lustre] into the subject line please, as the lustre team maintains
> them.
Well; I don''t think so; not 100% sure. There does not appear to be 
anything DDN specific in our udev rules (which makes me think that''s a 
''no''). I have sent an email requesting them, and shall look
into that,
as well.

Tina

-- 
Tina Friedrich, Computer Systems Administrator, Diamond Light Source Ltd
Diamond House, Harwell Science and Innovation Campus - 01235 77 8442

Tina Friedrich

2010-Sep-02 16:38 UTC

head link

[Lustre-discuss] brief ''hangs'' on file operations

Hello,

On 02/09/10 17:28, Tina Friedrich wrote:> Hi Andreas,
>
> thanks for your answer.
>
>>> Causing most grieve at the moment is that we sometimes see delays
>>> writing files. From the writing clients end, it simply looks as if
I/O
>>> stops for a while (we''ve seen ''pauses''
of anything up to 10 seconds).
>>> This appears to be independent of what client does the writing, and
>>> software doing the writing. We investigated this a bit using strace
and
>>> dd; the ''slow'' calls appear to always be either
open, write, or close
>>> calls. Usually, these take well below 0.001s; in around 0.5% or 1%
of
>>> cases, they take up to multiple seconds. It does not seem to be
>>> associated with any specific OST, OSS, client or anything; there is
>>> nothing in any log files or any exceptional load on MDS or OSS or
any of
>>> the clients.
>>
>> This is most likely associated with delays in committing the journal on
the MDT or OST, which can happen if the journal fills completely.  Having larger
journals can help, if you have enough RAM to keep them all in memory and not
overflow.  Alternately, if you make the journals small it will limit the
latency, at the cost of reducing overall performance.  A third alternative might
be to use SSDs for the journal devices.
>
> Just to double check - that would be the file system journal, I assume?
>
> That makes a lot of sense; is there a way to verify that this is the
> issue we''re having?
>
> Journal size appears to be 400M - if we were to try increasing it, how
> would be determine what to best set it to?
That was meant to be ''if we were to try increasing or decreasing
it'' -
sounds to us as if decreasing might be the better option (as in, if this 
is the journal flushing, having less journal to flush would probably be 
better - or is that the wrong idea?)

>>> The other issue is that we frequently see delays when trying to
read a
>>> file. I sometimes takes more than 60s for a file to be visible on a
>>> machine after the initial write on a different machine has
completed
>>> (both machines being Lustre clients). Again, there is nothing in
the
>>> logs, nor exceptional load on any of the machines.
>>
>> This is probably just a manifestation of the first problem.  The issue
likely isn''t in the read, but a delay in flushing the data from the
cache of the writing client.  There were fixes made in 1.8 to increase the IO
priority for clients writing data under a lock that other clients are waiting
on.
>
> We kind of suspected them to be related, yes.
>
> Tina
>

-- 
Tina Friedrich, Computer Systems Administrator, Diamond Light Source Ltd
Diamond House, Harwell Science and Innovation Campus - 01235 77 8442

Tina Friedrich

2010-Sep-03 10:04 UTC

head link

[Lustre-discuss] brief ''hangs'' on file operations

And another quick question - would this be more likely to be the journal 
on the MDS, or the OSS servers?

On 02/09/10 17:38, Tina Friedrich wrote:> Hello,
>
> On 02/09/10 17:28, Tina Friedrich wrote:
>> Hi Andreas,
>>
>> thanks for your answer.
>>
>>>> Causing most grieve at the moment is that we sometimes see
delays
>>>> writing files. From the writing clients end, it simply looks as
if I/O
>>>> stops for a while (we''ve seen
''pauses'' of anything up to 10 seconds).
>>>> This appears to be independent of what client does the writing,
and
>>>> software doing the writing. We investigated this a bit using
strace and
>>>> dd; the ''slow'' calls appear to always be
either open, write, or close
>>>> calls. Usually, these take well below 0.001s; in around 0.5% or
1% of
>>>> cases, they take up to multiple seconds. It does not seem to be
>>>> associated with any specific OST, OSS, client or anything;
there is
>>>> nothing in any log files or any exceptional load on MDS or OSS
or
>>>> any of
>>>> the clients.
>>>
>>> This is most likely associated with delays in committing the
journal
>>> on the MDT or OST, which can happen if the journal fills
completely.
>>> Having larger journals can help, if you have enough RAM to keep
them
>>> all in memory and not overflow. Alternately, if you make the
journals
>>> small it will limit the latency, at the cost of reducing overall
>>> performance. A third alternative might be to use SSDs for the
journal
>>> devices.
>>
>> Just to double check - that would be the file system journal, I assume?
>>
>> That makes a lot of sense; is there a way to verify that this is the
>> issue we''re having?
>>
>> Journal size appears to be 400M - if we were to try increasing it, how
>> would be determine what to best set it to?
>
> That was meant to be ''if we were to try increasing or decreasing
it'' -
> sounds to us as if decreasing might be the better option (as in, if this
> is the journal flushing, having less journal to flush would probably be
> better - or is that the wrong idea?)
>
>
>>>> The other issue is that we frequently see delays when trying to
read a
>>>> file. I sometimes takes more than 60s for a file to be visible
on a
>>>> machine after the initial write on a different machine has
completed
>>>> (both machines being Lustre clients). Again, there is nothing
in the
>>>> logs, nor exceptional load on any of the machines.
>>>
>>> This is probably just a manifestation of the first problem. The
issue
>>> likely isn''t in the read, but a delay in flushing the data
from the
>>> cache of the writing client. There were fixes made in 1.8 to
increase
>>> the IO priority for clients writing data under a lock that other
>>> clients are waiting on.
>>
>> We kind of suspected them to be related, yes.
>>
>> Tina
>>
>
>

-- 
Tina Friedrich, Computer Systems Administrator, Diamond Light Source Ltd
Diamond House, Harwell Science and Innovation Campus - 01235 77 8442

Andreas Dilger

2010-Sep-03 14:34 UTC

head link

[Lustre-discuss] brief ''hangs'' on file operations

It could be both. We had good success with performance improvements up to 1GB
journals, and AFAIK Some customers had reduced their journal size to 128 MB
without significant performance impact to reduce RAM consumption on their OSS.
We haven''t really made any testing with reduced journal size to reduce
latency, so I don''t know what performance you will see. The minimum
possible size is about 6MB, which would unfortunately single-thread all of your
IO and probably give lousy performance unless the journal was on an SSD.

Note that the amount of RAM used by the kernel journalling code can equal the
size of the journal, so don''t make it too large for your system.

That said, my speculation that this stall is caused by the journal is still only
speculation. You should check the /proc/fs/nbd/{dev}/history file to see the
commit timings of recent transactions.

Cheers, Andreas

On 2010-09-03, at 4:04, Tina Friedrich <Tina.Friedrich at diamond.ac.uk>
wrote:
> And another quick question - would this be more likely to be the journal on
the MDS, or the OSS servers?
> 
> On 02/09/10 17:38, Tina Friedrich wrote:
>> Hello,
>> 
>> On 02/09/10 17:28, Tina Friedrich wrote:
>>> Hi Andreas,
>>> 
>>> thanks for your answer.
>>> 
>>>>> Causing most grieve at the moment is that we sometimes see
delays
>>>>> writing files. From the writing clients end, it simply
looks as if I/O
>>>>> stops for a while (we''ve seen
''pauses'' of anything up to 10 seconds).
>>>>> This appears to be independent of what client does the
writing, and
>>>>> software doing the writing. We investigated this a bit
using strace and
>>>>> dd; the ''slow'' calls appear to always be
either open, write, or close
>>>>> calls. Usually, these take well below 0.001s; in around
0.5% or 1% of
>>>>> cases, they take up to multiple seconds. It does not seem
to be
>>>>> associated with any specific OST, OSS, client or anything;
there is
>>>>> nothing in any log files or any exceptional load on MDS or
OSS or
>>>>> any of
>>>>> the clients.
>>>> 
>>>> This is most likely associated with delays in committing the
journal
>>>> on the MDT or OST, which can happen if the journal fills
completely.
>>>> Having larger journals can help, if you have enough RAM to keep
them
>>>> all in memory and not overflow. Alternately, if you make the
journals
>>>> small it will limit the latency, at the cost of reducing
overall
>>>> performance. A third alternative might be to use SSDs for the
journal
>>>> devices.
>>> 
>>> Just to double check - that would be the file system journal, I
assume?
>>> 
>>> That makes a lot of sense; is there a way to verify that this is
the
>>> issue we''re having?
>>> 
>>> Journal size appears to be 400M - if we were to try increasing it,
how
>>> would be determine what to best set it to?
>> 
>> That was meant to be ''if we were to try increasing or
decreasing it'' -
>> sounds to us as if decreasing might be the better option (as in, if
this
>> is the journal flushing, having less journal to flush would probably be
>> better - or is that the wrong idea?)
>> 
>> 
>>>>> The other issue is that we frequently see delays when
trying to read a
>>>>> file. I sometimes takes more than 60s for a file to be
visible on a
>>>>> machine after the initial write on a different machine has
completed
>>>>> (both machines being Lustre clients). Again, there is
nothing in the
>>>>> logs, nor exceptional load on any of the machines.
>>>> 
>>>> This is probably just a manifestation of the first problem. The
issue
>>>> likely isn''t in the read, but a delay in flushing the
data from the
>>>> cache of the writing client. There were fixes made in 1.8 to
increase
>>>> the IO priority for clients writing data under a lock that
other
>>>> clients are waiting on.
>>> 
>>> We kind of suspected them to be related, yes.
>>> 
>>> Tina
>>> 
>> 
>> 
> 
> 
> -- 
> Tina Friedrich, Computer Systems Administrator, Diamond Light Source Ltd
> Diamond House, Harwell Science and Innovation Campus - 01235 77 8442

Lustre discuss - Sep 2010 - brief 'hangs' on file operations

[Lustre-discuss] brief ''hangs'' on file operations

[Lustre-discuss] brief ''hangs'' on file operations

[Lustre-discuss] brief ''hangs'' on file operations

[Lustre-discuss] brief ''hangs'' on file operations

[Lustre-discuss] brief ''hangs'' on file operations

[Lustre-discuss] brief ''hangs'' on file operations

[Lustre-discuss] brief ''hangs'' on file operations

[Lustre-discuss] brief ''hangs'' on file operations