thr3ads.net - Lustre discuss - [Lustre-discuss] OST went down running Lustre 1.6.6 [Feb 2009]

If this information is useful, please help other people find it:
Share via:

Brian Stone

2009-Feb-09 18:39 UTC

[Lustre-discuss] OST went down running Lustre 1.6.6

Hi,

Running Lustre 1.6.6 on SLES 10 SP2 and an OST crashed due to a SCSI 
failure. The recovery completed, but all the clients did not reconnect. I.E.

status: COMPLETE
recovery_start: 1233712363
recovery_duration: 320
completed_clients: 2765/5892
replayed_requests: 0
last_transno: 45914950

Some clients are still evicted, but not all. Why would the recovery 
complete if all clients did not reconnect?

Thanks,
Brian Stone

Brian J. Murrell

2009-Feb-10 14:56 UTC

head link

[Lustre-discuss] OST went down running Lustre 1.6.6

On Mon, 2009-02-09 at 13:39 -0500, Brian Stone wrote:> 
> Some clients are still evicted, but not all. Why would the recovery 
> complete if all clients did not reconnect?
Recovery can''t wait indefinitely for all clients to connect.  You could
wind up with a recovery that never completes if it does.  After a
timeout, if all clients don''t connect, recovery is aborted and the
target proceeds to the completion state so that it can respond to new
requests.

b.
 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part
Url :
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090210/f77004b9/attachment.bin

Brian Stone

2009-Feb-11 15:23 UTC

head link

[Lustre-discuss] OST went down running Lustre 1.6.6

My customer believes that this scenario is leading to corrupted files. 
If so, is there any way to avoid the file corruption when an OSS goes down?

Probably related is the following:

A filesystem hang was reported to me on Friday 2/6. The problem appeared 
to be localized to (OSS for /nobackupp2). Examination showed that all 
512 ll_ost_io threads were stuck waiting for lock callbacks to complete. 
I think the stack trace for the ost_io threads indicates that they''r 
waiting for a blocking AST to complete, which was requested to an 
ost_punch (truncate) operation. For example:

 ll_ost_io_329 S ffff8103fae15888     0   634      1           635   633 
(L-TLB\
)
 ffff8103fae15838 0000000000000046 ffff8100ace9ad40 000000000000000a
        ffff8103fae0da48 ffff8103fae0d7f0 ffff810009035800 000578ee6a132423
        0000000000001971 00000000000061a8
 Call Trace: <ffffffff885903fd>{:ptlrpc:ldlm_expired_completion_wait+173}
        <ffffffff88590350>{:ptlrpc:ldlm_expired_completion_wait+0}
        <ffffffff88591dfd>{:ptlrpc:ldlm_completion_ast+845}
        <ffffffff885780bb>{:ptlrpc:ldlm_lock_enqueue+2171} 
<ffffffff8012c8a9>{default_wake_function+0}
        <ffffffff88573dca>{:ptlrpc:ldlm_lock_addref_internal_nolock+58}
        <ffffffff88590a9b>{:ptlrpc:ldlm_cli_enqueue_local+1275}
        <ffffffff885b56f2>{:ptlrpc:lustre_swab_buf+66} 
<ffffffff887dae38>{:ost:ost_punch+1320}
        <ffffffff8858e270>{:ptlrpc:ldlm_blocking_ast+0} 
<ffffffff88591ab0>{:ptlrpc:ldlm_completion_ast+0}
        <ffffffff8858afc0>{:ptlrpc:ldlm_glimpse_ast+0} 
<ffffffff885b2ff5>{:ptlrpc:lustre_msg_get_opc+53}
        <ffffffff887ded5d>{:ost:ost_msg_check_version+317} 
<ffffffff887e4c7e>{:ost:ost_handle+13454}
        <ffffffff801823a9>{cache_alloc_refill+109} 
<ffffffff88515995>{:obdclass:class_handle2object+213}
        <ffffffff885b2765>{:ptlrpc:lustre_msg_get_conn_cnt+53}
        <ffffffff8012bac9>{find_busiest_group+360} 
<ffffffff885bc60a>{:ptlrpc:ptlrpc_check_req+26}
        <ffffffff885be867>{:ptlrpc:ptlrpc_server_handle_request+2503}
        <ffffffff8010f239>{do_gettimeofday+92} 
<ffffffff8847c3d6>{:libcfs:lcw_update_time+38}
        <ffffffff8012ac78>{__wake_up_common+64} 
<ffffffff885c19d1>{:ptlrpc:ptlrpc_main+3745}
        <ffffffff8012c8a9>{default_wake_function+0} 
<ffffffff8010bfc2>{child_rip+8}
        <ffffffff885c0b30>{:ptlrpc:ptlrpc_main+0} 
<ffffffff8010bfba>{child_rip+0}

Rebooting cleared up the problem.

This appears to be Lustre bug 16129. Can you confirm? Also, if this is 
the issue, it appears as if a patch is available and 16129 said the 
patch is attached to 17748, but it wasn''t clear to me which item there 
was the patch.

Thanks,
Brian Stone

Brian J. Murrell wrote:> On Mon, 2009-02-09 at 13:39 -0500, Brian Stone wrote:
>   
>> Some clients are still evicted, but not all. Why would the recovery 
>> complete if all clients did not reconnect?
>>     
>
> Recovery can''t wait indefinitely for all clients to connect.  You
could
> wind up with a recovery that never completes if it does.  After a
> timeout, if all clients don''t connect, recovery is aborted and the
> target proceeds to the completion state so that it can respond to new
> requests.
>
> b.
>  
>   
> ------------------------------------------------------------------------
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>

Brian J. Murrell

2009-Feb-11 15:37 UTC

head link

[Lustre-discuss] OST went down running Lustre 1.6.6

On Wed, 2009-02-11 at 10:23 -0500, Brian Stone wrote:> My customer believes that this scenario is leading to corrupted files. 
What scenario?  Recovery does not cause any corruption.  At worst,
transactions from (some) clients will not get replayed and will be lost.
That''s not generally what we consider "corruption" though.
> If so, is there any way to avoid the file corruption when an OSS goes down?
Can you be more specific about what "corruption" you are seeing?

Corruption by our definition is that on-disk, or on-the-wire data is
being altered by something other than a legitimate write from a client.
Corruption is not, for example, data simply not making it from the
client to disk.

b.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part
Url :
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090211/9753593f/attachment.bin

Brian Stone

2009-Feb-11 15:46 UTC

head link

[Lustre-discuss] OST went down running Lustre 1.6.6

Yes, I was using corruption to mean incomplete files. So, let me 
rephrase, is there a way to avoid "incomplete files" after an OSS
crash,
reboot, etc.? The lustre devices were not deactivated from the clients 
and MDS, would that possibly avoid the purge of data?

Thanks,
Brian Stone

Brian J. Murrell wrote:> On Wed, 2009-02-11 at 10:23 -0500, Brian Stone wrote:
>   
>> My customer believes that this scenario is leading to corrupted files. 
>>     
>
> What scenario?  Recovery does not cause any corruption.  At worst,
> transactions from (some) clients will not get replayed and will be lost.
> That''s not generally what we consider "corruption"
though.
>
>   
>> If so, is there any way to avoid the file corruption when an OSS goes
down?
>>     
>
> Can you be more specific about what "corruption" you are seeing?
>
> Corruption by our definition is that on-disk, or on-the-wire data is
> being altered by something other than a legitimate write from a client.
> Corruption is not, for example, data simply not making it from the
> client to disk.
>
> b.
>
>   
> ------------------------------------------------------------------------
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>

Brian J. Murrell

2009-Feb-11 16:00 UTC

head link

[Lustre-discuss] OST went down running Lustre 1.6.6

On Wed, 2009-02-11 at 10:46 -0500, Brian Stone wrote:> Yes, I was using corruption to mean incomplete files.
Ahhh.  OK.  That can be an artifact of a failure to complete recovery.
> So, let me 
> rephrase, is there a way to avoid "incomplete files" after an OSS
crash,
Yes, make sure that all clients are available to reconnect and
participate in recovery.
> The lustre devices were not deactivated from the clients 
> and MDS, would that possibly avoid the purge of data?
No.  You are free to reboot an OSS any time you wish.  All clients will
wait for it to come back up and recovery will proceed and complete, if
all clients are still present to participate.

Recovery is basically a process by which clients have an opportunity to
retry any recent (i.e. in progress) transactions that may have gotten
"lost" while an OSS crashes or reboots.  Until the OSS confirms to the
client that it has written a given transaction to stable storage,
clients hold on to them in case it needs to replay them.

In the event of a crash, all clients reconnect and offer their recent
transactions, however this must (currently) be done, serially, in the
same order they were done originally.  If a given client fails to
connect, the server cannot know that it did not have any transactions to
commit and must therefore discard all further transactions and abort
recovery, which can lead to this "data loss".

b.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part
Url :
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090211/49b33304/attachment.bin

Brian Stone

2009-Feb-11 16:53 UTC

head link

[Lustre-discuss] OST went down running Lustre 1.6.6

So, in this case, it appears that all clients did not complete recovery 
and recovery timed out. I assume you have a short amount of time to 
figure out who did not participate in recovery and get them to 
reconnect. What''s the best way to get clients to reconnect that are not
participating in recovery? What''s the best way to identify clients that
are not participating in recovery?

Thanks,
Brian Stone

Brian J. Murrell wrote:> On Wed, 2009-02-11 at 10:46 -0500, Brian Stone wrote:
>   
>> Yes, I was using corruption to mean incomplete files.
>>     
>
> Ahhh.  OK.  That can be an artifact of a failure to complete recovery.
>
>   
>> So, let me 
>> rephrase, is there a way to avoid "incomplete files" after an
OSS crash,
>>     
>
> Yes, make sure that all clients are available to reconnect and
> participate in recovery.
>
>   
>> The lustre devices were not deactivated from the clients 
>> and MDS, would that possibly avoid the purge of data?
>>     
>
> No.  You are free to reboot an OSS any time you wish.  All clients will
> wait for it to come back up and recovery will proceed and complete, if
> all clients are still present to participate.
>
> Recovery is basically a process by which clients have an opportunity to
> retry any recent (i.e. in progress) transactions that may have gotten
> "lost" while an OSS crashes or reboots.  Until the OSS confirms
to the
> client that it has written a given transaction to stable storage,
> clients hold on to them in case it needs to replay them.
>
> In the event of a crash, all clients reconnect and offer their recent
> transactions, however this must (currently) be done, serially, in the
> same order they were done originally.  If a given client fails to
> connect, the server cannot know that it did not have any transactions to
> commit and must therefore discard all further transactions and abort
> recovery, which can lead to this "data loss".
>
> b.
>
>   
> ------------------------------------------------------------------------
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>

Brian J. Murrell

2009-Feb-11 17:07 UTC

head link

[Lustre-discuss] OST went down running Lustre 1.6.6

On Wed, 2009-02-11 at 11:53 -0500, Brian Stone wrote:> So, in this case, it appears that all clients did not complete recovery 
> and recovery timed out.
During recovery, there is a progress message printed to the log
periodically stating how_many/expected clients have reconnected.  I
believe when recovery is aborted it also shows how many clients managed
to reconnect.
> I assume you have a short amount of time to 
> figure out who did not participate in recovery and get them to 
> reconnect.
Yes.  This is calculated from obd_timeout.
> What''s the best way to get clients to reconnect that are not 
> participating in recovery?
Well, given a robust enough network (i.e. just reliable) it should just
happen.  There is no way to prod a client into reconnecting sooner/more
frequently than it would normally.  However with a reliable network,
this should not be a problem.
> What''s the best way to identify clients that 
> are not participating in recovery?
Hrm.  I''m not sure that it''s realistic to try to figure out
who''s all
connected and who isn''t with a goal to troubleshoot and fix problems
during recovery.  It just doesn''t last long enough.

Really, barring any bugs, all you need to do to have successful
recoveries is to just have a network that allows the communication to
happen reliably enough.

b.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part
Url :
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090211/ae2e19de/attachment.bin

Lustre discuss - Feb 2009 - OST went down running Lustre 1.6.6

[Lustre-discuss] OST went down running Lustre 1.6.6

[Lustre-discuss] OST went down running Lustre 1.6.6

[Lustre-discuss] OST went down running Lustre 1.6.6

[Lustre-discuss] OST went down running Lustre 1.6.6

[Lustre-discuss] OST went down running Lustre 1.6.6

[Lustre-discuss] OST went down running Lustre 1.6.6

[Lustre-discuss] OST went down running Lustre 1.6.6

[Lustre-discuss] OST went down running Lustre 1.6.6