Hello List, we are trying to debug some issues - or possibly different manifestations of the same issue - on our file system. Causing most grieve at the moment is that we sometimes see delays writing files. From the writing clients end, it simply looks as if I/O stops for a while (we''ve seen ''pauses'' of anything up to 10 seconds). This appears to be independent of what client does the writing, and software doing the writing. We investigated this a bit using strace and dd; the ''slow'' calls appear to always be either open, write, or close calls. Usually, these take well below 0.001s; in around 0.5% or 1% of cases, they take up to multiple seconds. It does not seem to be associated with any specific OST, OSS, client or anything; there is nothing in any log files or any exceptional load on MDS or OSS or any of the clients. The other issue is that we frequently see delays when trying to read a file. I sometimes takes more than 60s for a file to be visible on a machine after the initial write on a different machine has completed (both machines being Lustre clients). Again, there is nothing in the logs, nor exceptional load on any of the machines. Any ideas what this could be? How can we debug this? Clients and servers are using Lustre 1.6.7.2.ddn3.5. Regards, Tina -- Tina Friedrich, Computer Systems Administrator, Diamond Light Source Ltd Diamond House, Harwell Science and Innovation Campus - 01235 77 8442
On 2010-09-02, at 06:43, Tina Friedrich wrote:> Causing most grieve at the moment is that we sometimes see delays > writing files. From the writing clients end, it simply looks as if I/O > stops for a while (we''ve seen ''pauses'' of anything up to 10 seconds). > This appears to be independent of what client does the writing, and > software doing the writing. We investigated this a bit using strace and > dd; the ''slow'' calls appear to always be either open, write, or close > calls. Usually, these take well below 0.001s; in around 0.5% or 1% of > cases, they take up to multiple seconds. It does not seem to be > associated with any specific OST, OSS, client or anything; there is > nothing in any log files or any exceptional load on MDS or OSS or any of > the clients.This is most likely associated with delays in committing the journal on the MDT or OST, which can happen if the journal fills completely. Having larger journals can help, if you have enough RAM to keep them all in memory and not overflow. Alternately, if you make the journals small it will limit the latency, at the cost of reducing overall performance. A third alternative might be to use SSDs for the journal devices.> The other issue is that we frequently see delays when trying to read a > file. I sometimes takes more than 60s for a file to be visible on a > machine after the initial write on a different machine has completed > (both machines being Lustre clients). Again, there is nothing in the > logs, nor exceptional load on any of the machines.This is probably just a manifestation of the first problem. The issue likely isn''t in the read, but a delay in flushing the data from the cache of the writing client. There were fixes made in 1.8 to increase the IO priority for clients writing data under a lock that other clients are waiting on. Cheers, Andreas -- Andreas Dilger Lustre Technical Lead Oracle Corporation Canada Inc.
On Thursday, September 02, 2010, Andreas Dilger wrote:> On 2010-09-02, at 06:43, Tina Friedrich wrote: > > Causing most grieve at the moment is that we sometimes see delays > > writing files. From the writing clients end, it simply looks as if I/O > > stops for a while (we''ve seen ''pauses'' of anything up to 10 seconds). > > This appears to be independent of what client does the writing, and > > software doing the writing. We investigated this a bit using strace and > > dd; the ''slow'' calls appear to always be either open, write, or close > > calls. Usually, these take well below 0.001s; in around 0.5% or 1% of > > cases, they take up to multiple seconds. It does not seem to be > > associated with any specific OST, OSS, client or anything; there is > > nothing in any log files or any exceptional load on MDS or OSS or any of > > the clients. > > This is most likely associated with delays in committing the journal on the > MDT or OST, which can happen if the journal fills completely. Having > larger journals can help, if you have enough RAM to keep them all in > memory and not overflow. Alternately, if you make the journals small it > will limit the latency, at the cost of reducing overall performance. A > third alternative might be to use SSDs for the journal devices.As diamond uses DDN hardware, it should help in general with performance to update to 1.8 and to enable the async journal feature. I guess it also might help to reduce those delays, as writes are more optimized. A question, though. Tina, do you use our ddn udev rules, which tune the devices for optimized performance? If not, please send a mail to support at ddn.com and ask for a recent udev rpm please (available for RHEL5 only so far, also *might* work on SLES11, but udev syntax changes to often, IMHO). And put [lustre] into the subject line please, as the lustre team maintains them. Cheers, Bernd
Hi Andreas, thanks for your answer.>> Causing most grieve at the moment is that we sometimes see delays >> writing files. From the writing clients end, it simply looks as if I/O >> stops for a while (we''ve seen ''pauses'' of anything up to 10 seconds). >> This appears to be independent of what client does the writing, and >> software doing the writing. We investigated this a bit using strace and >> dd; the ''slow'' calls appear to always be either open, write, or close >> calls. Usually, these take well below 0.001s; in around 0.5% or 1% of >> cases, they take up to multiple seconds. It does not seem to be >> associated with any specific OST, OSS, client or anything; there is >> nothing in any log files or any exceptional load on MDS or OSS or any of >> the clients. > > This is most likely associated with delays in committing the journal on the MDT or OST, which can happen if the journal fills completely. Having larger journals can help, if you have enough RAM to keep them all in memory and not overflow. Alternately, if you make the journals small it will limit the latency, at the cost of reducing overall performance. A third alternative might be to use SSDs for the journal devices.Just to double check - that would be the file system journal, I assume? That makes a lot of sense; is there a way to verify that this is the issue we''re having? Journal size appears to be 400M - if we were to try increasing it, how would be determine what to best set it to?>> The other issue is that we frequently see delays when trying to read a >> file. I sometimes takes more than 60s for a file to be visible on a >> machine after the initial write on a different machine has completed >> (both machines being Lustre clients). Again, there is nothing in the >> logs, nor exceptional load on any of the machines. > > This is probably just a manifestation of the first problem. The issue likely isn''t in the read, but a delay in flushing the data from the cache of the writing client. There were fixes made in 1.8 to increase the IO priority for clients writing data under a lock that other clients are waiting on.We kind of suspected them to be related, yes. Tina -- Tina Friedrich, Computer Systems Administrator, Diamond Light Source Ltd Diamond House, Harwell Science and Innovation Campus - 01235 77 8442
Hi Bernd, On 02/09/10 17:21, Bernd Schubert wrote:> On Thursday, September 02, 2010, Andreas Dilger wrote: >> On 2010-09-02, at 06:43, Tina Friedrich wrote: >>> Causing most grieve at the moment is that we sometimes see delays >>> writing files. From the writing clients end, it simply looks as if I/O >>> stops for a while (we''ve seen ''pauses'' of anything up to 10 seconds). >>> This appears to be independent of what client does the writing, and >>> software doing the writing. We investigated this a bit using strace and >>> dd; the ''slow'' calls appear to always be either open, write, or close >>> calls. Usually, these take well below 0.001s; in around 0.5% or 1% of >>> cases, they take up to multiple seconds. It does not seem to be >>> associated with any specific OST, OSS, client or anything; there is >>> nothing in any log files or any exceptional load on MDS or OSS or any of >>> the clients. >> >> This is most likely associated with delays in committing the journal on the >> MDT or OST, which can happen if the journal fills completely. Having >> larger journals can help, if you have enough RAM to keep them all in >> memory and not overflow. Alternately, if you make the journals small it >> will limit the latency, at the cost of reducing overall performance. A >> third alternative might be to use SSDs for the journal devices. > > As diamond uses DDN hardware, it should help in general with performance to > update to 1.8 and to enable the async journal feature. I guess it also might > help to reduce those delays, as writes are more optimized.We have been considering an update; however, due to to this being a production file system (and an important one), that''s not something that we can do easily.> A question, though. Tina, do you use our ddn udev rules, which tune the > devices for optimized performance? If not, please send a mail to > support at ddn.com and ask for a recent udev rpm please (available for RHEL5 only > so far, also *might* work on SLES11, but udev syntax changes to often, IMHO). > And put [lustre] into the subject line please, as the lustre team maintains > them.Well; I don''t think so; not 100% sure. There does not appear to be anything DDN specific in our udev rules (which makes me think that''s a ''no''). I have sent an email requesting them, and shall look into that, as well. Tina -- Tina Friedrich, Computer Systems Administrator, Diamond Light Source Ltd Diamond House, Harwell Science and Innovation Campus - 01235 77 8442
Hello, On 02/09/10 17:28, Tina Friedrich wrote:> Hi Andreas, > > thanks for your answer. > >>> Causing most grieve at the moment is that we sometimes see delays >>> writing files. From the writing clients end, it simply looks as if I/O >>> stops for a while (we''ve seen ''pauses'' of anything up to 10 seconds). >>> This appears to be independent of what client does the writing, and >>> software doing the writing. We investigated this a bit using strace and >>> dd; the ''slow'' calls appear to always be either open, write, or close >>> calls. Usually, these take well below 0.001s; in around 0.5% or 1% of >>> cases, they take up to multiple seconds. It does not seem to be >>> associated with any specific OST, OSS, client or anything; there is >>> nothing in any log files or any exceptional load on MDS or OSS or any of >>> the clients. >> >> This is most likely associated with delays in committing the journal on the MDT or OST, which can happen if the journal fills completely. Having larger journals can help, if you have enough RAM to keep them all in memory and not overflow. Alternately, if you make the journals small it will limit the latency, at the cost of reducing overall performance. A third alternative might be to use SSDs for the journal devices. > > Just to double check - that would be the file system journal, I assume? > > That makes a lot of sense; is there a way to verify that this is the > issue we''re having? > > Journal size appears to be 400M - if we were to try increasing it, how > would be determine what to best set it to?That was meant to be ''if we were to try increasing or decreasing it'' - sounds to us as if decreasing might be the better option (as in, if this is the journal flushing, having less journal to flush would probably be better - or is that the wrong idea?)>>> The other issue is that we frequently see delays when trying to read a >>> file. I sometimes takes more than 60s for a file to be visible on a >>> machine after the initial write on a different machine has completed >>> (both machines being Lustre clients). Again, there is nothing in the >>> logs, nor exceptional load on any of the machines. >> >> This is probably just a manifestation of the first problem. The issue likely isn''t in the read, but a delay in flushing the data from the cache of the writing client. There were fixes made in 1.8 to increase the IO priority for clients writing data under a lock that other clients are waiting on. > > We kind of suspected them to be related, yes. > > Tina >-- Tina Friedrich, Computer Systems Administrator, Diamond Light Source Ltd Diamond House, Harwell Science and Innovation Campus - 01235 77 8442
And another quick question - would this be more likely to be the journal on the MDS, or the OSS servers? On 02/09/10 17:38, Tina Friedrich wrote:> Hello, > > On 02/09/10 17:28, Tina Friedrich wrote: >> Hi Andreas, >> >> thanks for your answer. >> >>>> Causing most grieve at the moment is that we sometimes see delays >>>> writing files. From the writing clients end, it simply looks as if I/O >>>> stops for a while (we''ve seen ''pauses'' of anything up to 10 seconds). >>>> This appears to be independent of what client does the writing, and >>>> software doing the writing. We investigated this a bit using strace and >>>> dd; the ''slow'' calls appear to always be either open, write, or close >>>> calls. Usually, these take well below 0.001s; in around 0.5% or 1% of >>>> cases, they take up to multiple seconds. It does not seem to be >>>> associated with any specific OST, OSS, client or anything; there is >>>> nothing in any log files or any exceptional load on MDS or OSS or >>>> any of >>>> the clients. >>> >>> This is most likely associated with delays in committing the journal >>> on the MDT or OST, which can happen if the journal fills completely. >>> Having larger journals can help, if you have enough RAM to keep them >>> all in memory and not overflow. Alternately, if you make the journals >>> small it will limit the latency, at the cost of reducing overall >>> performance. A third alternative might be to use SSDs for the journal >>> devices. >> >> Just to double check - that would be the file system journal, I assume? >> >> That makes a lot of sense; is there a way to verify that this is the >> issue we''re having? >> >> Journal size appears to be 400M - if we were to try increasing it, how >> would be determine what to best set it to? > > That was meant to be ''if we were to try increasing or decreasing it'' - > sounds to us as if decreasing might be the better option (as in, if this > is the journal flushing, having less journal to flush would probably be > better - or is that the wrong idea?) > > >>>> The other issue is that we frequently see delays when trying to read a >>>> file. I sometimes takes more than 60s for a file to be visible on a >>>> machine after the initial write on a different machine has completed >>>> (both machines being Lustre clients). Again, there is nothing in the >>>> logs, nor exceptional load on any of the machines. >>> >>> This is probably just a manifestation of the first problem. The issue >>> likely isn''t in the read, but a delay in flushing the data from the >>> cache of the writing client. There were fixes made in 1.8 to increase >>> the IO priority for clients writing data under a lock that other >>> clients are waiting on. >> >> We kind of suspected them to be related, yes. >> >> Tina >> > >-- Tina Friedrich, Computer Systems Administrator, Diamond Light Source Ltd Diamond House, Harwell Science and Innovation Campus - 01235 77 8442
It could be both. We had good success with performance improvements up to 1GB journals, and AFAIK Some customers had reduced their journal size to 128 MB without significant performance impact to reduce RAM consumption on their OSS. We haven''t really made any testing with reduced journal size to reduce latency, so I don''t know what performance you will see. The minimum possible size is about 6MB, which would unfortunately single-thread all of your IO and probably give lousy performance unless the journal was on an SSD. Note that the amount of RAM used by the kernel journalling code can equal the size of the journal, so don''t make it too large for your system. That said, my speculation that this stall is caused by the journal is still only speculation. You should check the /proc/fs/nbd/{dev}/history file to see the commit timings of recent transactions. Cheers, Andreas On 2010-09-03, at 4:04, Tina Friedrich <Tina.Friedrich at diamond.ac.uk> wrote:> And another quick question - would this be more likely to be the journal on the MDS, or the OSS servers? > > On 02/09/10 17:38, Tina Friedrich wrote: >> Hello, >> >> On 02/09/10 17:28, Tina Friedrich wrote: >>> Hi Andreas, >>> >>> thanks for your answer. >>> >>>>> Causing most grieve at the moment is that we sometimes see delays >>>>> writing files. From the writing clients end, it simply looks as if I/O >>>>> stops for a while (we''ve seen ''pauses'' of anything up to 10 seconds). >>>>> This appears to be independent of what client does the writing, and >>>>> software doing the writing. We investigated this a bit using strace and >>>>> dd; the ''slow'' calls appear to always be either open, write, or close >>>>> calls. Usually, these take well below 0.001s; in around 0.5% or 1% of >>>>> cases, they take up to multiple seconds. It does not seem to be >>>>> associated with any specific OST, OSS, client or anything; there is >>>>> nothing in any log files or any exceptional load on MDS or OSS or >>>>> any of >>>>> the clients. >>>> >>>> This is most likely associated with delays in committing the journal >>>> on the MDT or OST, which can happen if the journal fills completely. >>>> Having larger journals can help, if you have enough RAM to keep them >>>> all in memory and not overflow. Alternately, if you make the journals >>>> small it will limit the latency, at the cost of reducing overall >>>> performance. A third alternative might be to use SSDs for the journal >>>> devices. >>> >>> Just to double check - that would be the file system journal, I assume? >>> >>> That makes a lot of sense; is there a way to verify that this is the >>> issue we''re having? >>> >>> Journal size appears to be 400M - if we were to try increasing it, how >>> would be determine what to best set it to? >> >> That was meant to be ''if we were to try increasing or decreasing it'' - >> sounds to us as if decreasing might be the better option (as in, if this >> is the journal flushing, having less journal to flush would probably be >> better - or is that the wrong idea?) >> >> >>>>> The other issue is that we frequently see delays when trying to read a >>>>> file. I sometimes takes more than 60s for a file to be visible on a >>>>> machine after the initial write on a different machine has completed >>>>> (both machines being Lustre clients). Again, there is nothing in the >>>>> logs, nor exceptional load on any of the machines. >>>> >>>> This is probably just a manifestation of the first problem. The issue >>>> likely isn''t in the read, but a delay in flushing the data from the >>>> cache of the writing client. There were fixes made in 1.8 to increase >>>> the IO priority for clients writing data under a lock that other >>>> clients are waiting on. >>> >>> We kind of suspected them to be related, yes. >>> >>> Tina >>> >> >> > > > -- > Tina Friedrich, Computer Systems Administrator, Diamond Light Source Ltd > Diamond House, Harwell Science and Innovation Campus - 01235 77 8442