thr3ads.net - Lustre discuss - [Lustre-discuss] slow journal/commitrw on OSTs lead to crash [Apr 2009]

If this information is useful, please help other people find it:
Share via:

Hendelman, Rob

2009-Apr-08 18:16 UTC

[Lustre-discuss] slow journal/commitrw on OSTs lead to crash

Hello,

I recently upgraded our 1.6.4.3 centos servers to 1.6.7.  Clients are
still at 1.6.4.3.  We were going to upgrade to 1.6.7 but decided to wait
until 1.6.7.1 comes out with the fix for the bug mentioned on the list
late last week.

Our OST''s are almost completely full (85-90% on each) and our MDT is
almost entirely empty.

Our setup is as follows:

Server nodes
===========Node 1 = MGS + MDT (separate mount points) with raid 10 of 146G sas
15k
disks, 1 gigabit ethernet
Node 4 = OSS with 10 OST''s connected to disks via multipath FC.  Each
OST is a raid6 (done on the san side) of (I believe 12) 146G 15k disks.
Each node has 20G ram.

Our clients are running with 2.6.22 patchless kernel & lustre 1.4.3
Client nodes have 32G ram ea.

Each filesystem (mgs, mdt, ost) also has a separate device in either
raid 1 or 10 for the journal.

Last night around 6pm I started getting alerts that our lustre OSS was
unresponsive.  Looking in the logs, I found the following:

=========================================================Apr  7 18:07:53
maglustre04 last message repeated 2 times
Apr  7 18:08:48 maglustre04 kernel: Lustre:
5228:0:(lustre_fsfilt.h:320:fsfilt_commit_wait()) fs01-OST0005: slow
journal start 31s
Apr  7 18:08:48 maglustre04 kernel: Lustre:
5228:0:(filter_io_26.c:792:filter_commitrw_write()) fs01-OST0005: slow
commitrw commit 31s
Apr  7 18:08:48 maglustre04 kernel: Lustre:
5227:0:(lustre_fsfilt.h:320:fsfilt_commit_wait()) fs01-OST0005: slow
journal start 30s
Apr  7 18:08:48 maglustre04 kernel: Lustre:
5218:0:(filter_io_26.c:792:filter_commitrw_write()) fs01-OST0005: slow
commitrw commit 30s
Apr  7 18:08:48 maglustre04 kernel: Lustre:
5227:0:(lustre_fsfilt.h:320:fsfilt_commit_wait()) Skipped 1 previous
similar message
Apr  7 18:10:15 maglustre04 kernel: Lustre:
6588:0:(ldlm_lib.c:541:target_handle_reconnect()) fs01-OST0006:
dc9faf61-04e0-3f57-0cba-405fcac557f7 reconnecting
Apr  7 18:10:15 maglustre04 kernel: Lustre:
6588:0:(ldlm_lib.c:780:target_handle_connect()) fs01-OST0006: refuse
reconnection from dc9faf61-04e0-3f57-0cba-405fcac557f7 at 10.5.20
0.13 at tcp to 0xffff810156556000; still busy with 2 active RPCs
Apr  7 18:10:15 maglustre04 kernel: LustreError:
6588:0:(ldlm_lib.c:1643:target_send_reply_msg()) @@@ processing error
(-16)  req at ffff81026ddfd800 x21081511/t0 o8->dc9faf61-04
e0-3f57-0cba-405fcac557f7@:0/0 lens 304/200 e 0 to 0 dl 1239145915 ref 1
fl Interpret:/0/0 rc -16/0
Apr  7 18:10:15 maglustre04 kernel: LustreError:
6588:0:(ldlm_lib.c:1643:target_send_reply_msg()) Skipped 2 previous
similar messages
Apr  7 18:10:23 maglustre04 kernel: LustreError:
5269:0:(filter_io_26.c:779:filter_commitrw_write()) fs01-OST0006: slow
direct_io 107s
Apr  7 18:10:23 maglustre04 kernel: Lustre:
5269:0:(service.c:1317:ptlrpc_server_handle_request()) @@@ Request
x21080837 took longer than estimated (100+8s); client may timeou
t.  req at ffff810529d42400 x21080837/t4450358
o4->dc9faf61-04e0-3f57-0cba-405fcac557f7@:0/0 lens 384/352 e 0 to 0 dl
1239145815 ref 1 fl Complete:/0/0 rc 0/0
Apr  7 18:10:25 maglustre04 kernel: LustreError:
5270:0:(lustre_fsfilt.h:320:fsfilt_commit_wait()) fs01-OST0008: slow
journal start 109s
Apr  7 18:10:25 maglustre04 kernel: LustreError:
5270:0:(filter_io_26.c:792:filter_commitrw_write()) fs01-OST0008: slow
commitrw commit 109s
Apr  7 18:10:25 maglustre04 kernel: Lustre:
5270:0:(service.c:1317:ptlrpc_server_handle_request()) @@@ Request
x99233655 took longer than estimated (100+10s); client may timeo
ut.  req at ffff8104dd112800 x99233655/t4401190
o4->19615549-595b-2174-a224-57b10ed10a9c@:0/0 lens 384/352 e 0 to 0 dl
1239145815 ref 1 fl Complete:/0/0 rc 0/0
Apr  7 18:10:26 maglustre04 kernel: LustreError:
5210:0:(filter_io_26.c:714:filter_commitrw_write()) fs01-OST0009: slow
i_mutex 122s
Apr  7 18:10:26 maglustre04 kernel: LustreError:
5184:0:(filter_io_26.c:714:filter_commitrw_write()) fs01-OST0009: slow
i_mutex 122s
Apr  7 18:10:26 maglustre04 kernel: Lustre:
5184:0:(service.c:1317:ptlrpc_server_handle_request()) @@@ Request
x99233618 took longer than estimated (100+22s); client may timeo
ut.  req at ffff8104a65f5c50 x99233618/t4374309
o4->19615549-595b-2174-a224-57b10ed10a9c@:0/0 lens 384/352 e 0 to 0 dl
1239145804 ref 1 fl Complete:/0/0 rc 0/0
Apr  7 18:10:26 maglustre04 kernel: Lustre:
5184:0:(service.c:1317:ptlrpc_server_handle_request()) Skipped 1
previous similar message
Apr  7 18:10:26 maglustre04 kernel: LustreError:
5241:0:(filter_io_26.c:779:filter_commitrw_write()) fs01-OST0009: slow
direct_io 122s
Apr  7 18:10:27 maglustre04 kernel: Lustre:
6341:0:(ldlm_lib.c:541:target_handle_reconnect()) fs01-OST0001:
e293e1d4-51a4-a8f4-a7c0-c1aaab662a94 reconnecting
Apr  7 18:10:27 maglustre04 kernel: Lustre:
5139:0:(ldlm_lib.c:541:target_handle_reconnect()) fs01-OST0000:
e293e1d4-51a4-a8f4-a7c0-c1aaab662a94 reconnecting
Apr  7 18:10:27 maglustre04 kernel: Lustre:
6398:0:(ldlm_lib.c:780:target_handle_connect()) fs01-OST0006: refuse
reconnection from e293e1d4-51a4-a8f4-a7c0-c1aaab662a94 at 10.5.20
0.12 at tcp to 0xffff810525b80000; still busy with 9 active RPCs
Apr  7 18:10:27 maglustre04 kernel: Lustre:
6341:0:(ldlm_lib.c:541:target_handle_reconnect()) Skipped 6 previous
similar messages
Apr  7 18:10:32 maglustre04 kernel: Lustre:
6439:0:(ldlm_lib.c:541:target_handle_reconnect()) fs01-OST0009:
5824cc81-63ce-7c87-cdd6-42422d0c4dcd reconnecting
Apr  7 18:10:32 maglustre04 kernel: Lustre:
6439:0:(ldlm_lib.c:541:target_handle_reconnect()) Skipped 6 previous
similar messages
Apr  7 18:10:34 maglustre04 kernel: Lustre:
6477:0:(ldlm_lib.c:541:target_handle_reconnect()) fs01-OST0009:
19615549-595b-2174-a224-57b10ed10a9c reconnecting
Apr  7 18:10:34 maglustre04 kernel: Lustre:
6477:0:(ldlm_lib.c:541:target_handle_reconnect()) Skipped 1 previous
similar message
Apr  7 18:10:37 maglustre04 kernel: Lustre:
6430:0:(ldlm_lib.c:541:target_handle_reconnect()) fs01-OST000a:
5824cc81-63ce-7c87-cdd6-42422d0c4dcd reconnecting
Apr  7 18:10:37 maglustre04 kernel: Lustre:
6430:0:(ldlm_lib.c:541:target_handle_reconnect()) Skipped 10 previous
similar messages
Apr  7 18:10:41 maglustre04 kernel: Lustre:
6574:0:(ldlm_lib.c:541:target_handle_reconnect()) fs01-OST0002:
c454c19b-6df7-3d89-33b6-e421b8135c62 reconnecting
Apr  7 18:10:41 maglustre04 kernel: Lustre:
6413:0:(ldlm_lib.c:541:target_handle_reconnect()) fs01-OST0001:
c454c19b-6df7-3d89-33b6-e421b8135c62 reconnecting
Apr  7 18:10:41 maglustre04 kernel: Lustre:
6413:0:(ldlm_lib.c:541:target_handle_reconnect()) Skipped 2 previous
similar messages
Apr  7 18:10:41 maglustre04 kernel: Lustre:
6574:0:(ldlm_lib.c:541:target_handle_reconnect()) Skipped 6 previous
similar messages
Apr  7 18:10:51 maglustre04 kernel: Lustre:
6598:0:(ldlm_lib.c:541:target_handle_reconnect()) fs01-OST0005:
e293e1d4-51a4-a8f4-a7c0-c1aaab662a94 reconnecting
Apr  7 18:10:51 maglustre04 kernel: Lustre:
6598:0:(ldlm_lib.c:541:target_handle_reconnect()) Skipped 2 previous
similar messages
Apr  7 18:11:07 maglustre04 kernel: Lustre:
5027:0:(ldlm_lib.c:780:target_handle_connect()) fs01-OST0005: refuse
reconnection from 19615549-595b-2174-a224-57b10ed10a9c at 10.5.20
0.15 at tcp to 0xffff8102a7270000; still busy with 2 active RPCs
Apr  7 18:11:16 maglustre04 kernel: Lustre:
5076:0:(ldlm_lib.c:541:target_handle_reconnect()) fs01-OST0006:
e293e1d4-51a4-a8f4-a7c0-c1aaab662a94 reconnecting
Apr  7 18:11:16 maglustre04 kernel: Lustre:
5076:0:(ldlm_lib.c:541:target_handle_reconnect()) Skipped 17 previous
similar messages
Apr  7 18:12:11 maglustre04 kernel: Lustre:
5180:0:(filter_io.c:551:filter_preprw_write()) fs01-OST0008: slow
preprw_write setup 37s
Apr  7 18:12:45 maglustre04 kernel: Lustre:
5220:0:(filter_io_26.c:779:filter_commitrw_write()) fs01-OST0003: slow
direct_io 33s
Apr  7 18:12:48 maglustre04 kernel: Lustre:
11010:0:(socklnd_cb.c:2728:ksocknal_check_peer_timeouts()) Stale ZC_REQs
for peer 10.5.200.12 at tcp detected: 1; the oldest (ffff8105
1f1de000) timed out 9 secs ago
Apr  7 18:12:56 maglustre04 kernel: Lustre:
5063:0:(ldlm_lib.c:541:target_handle_reconnect()) fs01-OST0006:
e293e1d4-51a4-a8f4-a7c0-c1aaab662a94 reconnecting
Apr  7 18:12:56 maglustre04 kernel: Lustre:
5063:0:(ldlm_lib.c:541:target_handle_reconnect()) Skipped 1 previous
similar message
Apr  7 18:12:56 maglustre04 kernel: Lustre:
5063:0:(ldlm_lib.c:780:target_handle_connect()) fs01-OST0006: refuse
reconnection from e293e1d4-51a4-a8f4-a7c0-c1aaab662a94 at 10.5.20
0.12 at tcp to 0xffff810525b80000; still busy with 2 active RPCs
Apr  7 18:12:56 maglustre04 kernel: LustreError:
5063:0:(ldlm_lib.c:1643:target_send_reply_msg()) @@@ processing error
(-16)  req at ffff8103dc39bc00 x243110586/t0 o8->e293e1d4-5
1a4-a8f4-a7c0-c1aaab662a94@:0/0 lens 304/200 e 0 to 0 dl 1239146076 ref
1 fl Interpret:/0/0 rc -16/0
Apr  7 18:12:56 maglustre04 kernel: LustreError:
5063:0:(ldlm_lib.c:1643:target_send_reply_msg()) Skipped 2 previous
similar messages
Apr  7 18:12:56 maglustre04 kernel: LustreError:
5209:0:(ost_handler.c:899:ost_brw_read()) @@@ timeout on bulk PUT after
68+0s  req at ffff8102e7357800 x243110512/t0 o3->e293e1d4
-51a4-a8f4-a7c0-c1aaab662a94@:0/0 lens 384/336 e 0 to 0 dl 1239145976
ref 1 fl Interpret:/0/0 rc 0/0
Apr  7 18:12:57 maglustre04 kernel: Lustre:
5215:0:(filter_io_26.c:779:filter_commitrw_write()) fs01-OST0007: slow
direct_io 39s
Apr  7 18:12:57 maglustre04 kernel: Lustre:
5209:0:(ost_handler.c:957:ost_brw_read()) fs01-OST0006: ignoring bulk IO
comm error with e293e1d4-51a4-a8f4-a7c0-c1aaab662a94@ id 1
2345-10.5.200.12 at tcp - client will retry
Apr  7 18:12:57 maglustre04 kernel: Lustre:
5209:0:(service.c:1317:ptlrpc_server_handle_request()) @@@ Request
x243110512 took longer than estimated (100+1s); client may timeo
ut.  req at ffff8102e7357800 x243110512/t0
o3->e293e1d4-51a4-a8f4-a7c0-c1aaab662a94@:0/0 lens 384/336 e 0 to 0 dl
1239145976 ref 1 fl Complete:/0/0 rc 0/0
Apr  7 18:12:57 maglustre04 kernel: Lustre:
5209:0:(service.c:1317:ptlrpc_server_handle_request()) Skipped 6
previous similar messages
=======================================================================Soon the
OSS started refusing connections:
============================================================================Apr 
7 18:18:30 maglustre04 kernel: Lustre:
5206:0:(filter_io.c:551:filter_preprw_write()) fs01-OST0002: slow
preprw_write setup 42s
Apr  7 18:18:34 maglustre04 kernel: Lustre:
5079:0:(ldlm_lib.c:780:target_handle_connect()) fs01-OST0008: refuse
reconnection from e293e1d4-51a4-a8f4-a7c0-c1aaab662a94 at 10.5.20
0.12 at tcp to 0xffff810337f8a000; still busy with 2 active RPCs
Apr  7 18:18:34 maglustre04 kernel: LustreError:
5079:0:(ldlm_lib.c:1643:target_send_reply_msg()) @@@ processing error
(-16)  req at ffff81052b724400 x243111082/t0 o8->e293e1d4-5
1a4-a8f4-a7c0-c1aaab662a94@:0/0 lens 304/200 e 0 to 0 dl 1239146414 ref
1 fl Interpret:/0/0 rc -16/0
Apr  7 18:18:34 maglustre04 kernel: LustreError:
5079:0:(ldlm_lib.c:1643:target_send_reply_msg()) Skipped 1 previous
similar message
Apr  7 18:18:38 maglustre04 kernel: LustreError:
11000:0:(events.c:66:request_out_callback()) @@@ type 4, status -5
req at ffff8104e3853400 x208570/t0 o400->MGS at MGC10.5.10.11@tc
p_0:26/25 lens 128/256 e 0 to 1 dl 1239146349 ref 2 fl Rpc:N/0/0 rc 0/0
Apr  7 18:18:38 maglustre04 kernel: LustreError:
11000:0:(events.c:66:request_out_callback()) @@@ type 4, status -5
req at ffff81051e07fc00 x208571/t0 o400->MGS at MGC10.5.10.11@tc
p_0:26/25 lens 128/256 e 0 to 1 dl 1239146374 ref 2 fl Rpc:N/0/0 rc 0/0
Apr  7 18:18:38 maglustre04 kernel: Lustre: Request x208570 sent from
MGC10.5.10.11 at tcp to NID 10.5.10.11 at tcp 69s ago has timed out (limit
100s).
Apr  7 18:18:38 maglustre04 kernel: LustreError: 166-1:
MGC10.5.10.11 at tcp: Connection to service MGS via nid 10.5.10.11 at tcp was
lost; in progress operations using this service
 will fail.
Apr  7 18:18:38 maglustre04 kernel: Lustre: MGC10.5.10.11 at tcp:
Reactivating import
Apr  7 18:18:38 maglustre04 kernel: Lustre: MGC10.5.10.11 at tcp:
Connection restored to service MGS using nid 10.5.10.11 at tcp.
Apr  7 18:18:45 maglustre04 kernel: Lustre: fs01-OST0000: received MDS
connection from 10.5.10.11 at tcp
Apr  7 18:18:45 maglustre04 kernel: Lustre: Skipped 10 previous similar
messages
Apr  7 18:18:45 maglustre04 kernel: Lustre:
6448:0:(filter.c:2849:filter_destroy_precreated()) fs01-OST0000:
deleting orphan objects from 184495 to 184526
Apr  7 18:18:45 maglustre04 kernel: Lustre:
6448:0:(filter.c:2849:filter_destroy_precreated()) Skipped 8 previous
similar messages
Apr  7 18:18:53 maglustre04 kernel: LustreError:
11004:0:(events.c:367:server_bulk_callback()) event type 4, status -5,
desc ffff8104f2272000
Apr  7 18:18:53 maglustre04 kernel: LustreError:
11004:0:(events.c:367:server_bulk_callback()) event type 4, status -5,
desc ffff81051f681000
Apr  7 18:18:56 maglustre04 kernel: LustreError:
5226:0:(ost_handler.c:1131:ost_brw_write()) @@@ timeout on bulk GET
after 46+0s  req at ffff8104e516d800 x99305448/t0 o4->1961554
9-595b-2174-a224-57b10ed10a9c@:0/0 lens 384/352 e 0 to 0 dl 1239146336
ref 1 fl Interpret:/0/0 rc 0/0
Apr  7 18:18:56 maglustre04 kernel: Lustre:
5226:0:(ost_handler.c:1277:ost_brw_write()) fs01-OST0005: ignoring bulk
IO comm error with 19615549-595b-2174-a224-57b10ed10a9c@ id
 12345-10.5.200.15 at tcp - client will retry
Apr  7 18:19:01 maglustre04 kernel: Lustre:
6350:0:(ldlm_lib.c:780:target_handle_connect()) fs01-OST0008: refuse
reconnection from e293e1d4-51a4-a8f4-a7c0-c1aaab662a94 at 10.5.20
0.12 at tcp to 0xffff810337f8a000; still busy with 2 active RPCs
Apr  7 18:19:13 maglustre04 last message repeated 2 times
Apr  7 18:19:13 maglustre04 last message repeated 2 times
===========================================================
Considering we have a _very_ small installation compared to almost
everyone else on the list with 2 servers & 5 client nodes I''m not
sure
exactly where to start looking.

I did find a thread from December 08 with similar error messages but it
seemed confined to a single OST.  This seemed to hit all our OST''s one
after the other.  We are using the default striping which only has 1
file per OST as well.

Storage Group reported no errors on FC switch or disks on the SAN.
Networking group reported no errors on the gigabit Ethernet switch so I
don''t think it is a networking error.

This is the only issue we''ve had so far after upgrading to 1.6.7 on the
servers in approximately 3 weeks of regular use.

Any suggestions are appreciated.

Also: If anyone over there @ Sun has a contact of someone in sales I can
call for a commercial support contract, this is something we are
interested in looking at as well.

Thanks,

Robert

Robert Hendelman Jr
Magnetar Capital LLC
Rob.Hendelman at magnetar.com
1-847-905-4557



The information contained in this message and its attachments 
is intended only for the private and confidential use of the 
intended recipient(s).  If you are not the intended recipient 
(or have received this e-mail in error) please notify the 
sender immediately and destroy this e-mail. Any unauthorized 
copying, disclosure or distribution of the material in this e-
mail is strictly prohibited.

Hendelman, Rob

2009-Apr-08 18:24 UTC

head link

[Lustre-discuss] slow journal/commitrw on OSTs lead to crash

I forgot to mention, that about an hour after this started, we got the
following:
===============Apr  7 18:49:06 maglustre04 kernel: Lustre:
6483:0:(ldlm_lib.c:541:target_handle_reconnect()) fs01-OST0000:
fs01-mdtlov_UUID reconnecting
Apr  7 18:49:06 maglustre04 kernel: Lustre:
6483:0:(ldlm_lib.c:541:target_handle_reconnect()) Skipped 111 previous
similar messages
Apr  7 18:49:26 maglustre04 kernel: Lustre:
5075:0:(service.c:1317:ptlrpc_server_handle_request()) @@@ Request
x1039597 took longer than estimated (50+75s); client may timeout
.  req at ffff8104bb35e400 x1039597/t0 o5->fs01-mdtlov_UUID@:0/0 lens
336/336 e 0 to 0 dl 1239148091 ref 1 fl Complete:/0/0 rc 0/0
Apr  7 18:51:58 maglustre04 kernel: Lustre:
5075:0:(service.c:1317:ptlrpc_server_handle_request()) Skipped 12
previous similar messages
Apr  7 18:51:58 maglustre04 kernel: Lustre: fs01-OST0006: received MDS
connection from 10.5.10.11 at tcp
Apr  7 18:51:58 maglustre04 kernel: Lustre: Skipped 10 previous similar
messages
Apr  7 18:51:58 maglustre04 kernel: LustreError:
5146:0:(lustre_fsfilt.h:229:fsfilt_start_log()) fs01-OST0008: slow
journal start 135s
Apr  7 18:51:58 maglustre04 kernel: Lustre:
462:0:(watchdog.c:148:lcw_cb()) Watchdog triggered for pid 5024: it was
inactive for 200s
Apr  7 18:51:58 maglustre04 kernel: Lustre:
462:0:(linux-debug.c:222:libcfs_debug_dumpstack()) showing stack for
process 5024
Apr  7 18:51:58 maglustre04 kernel: ll_ost_02     D ffff8102687a20c0
0  5024      1          5025  5023 (L-TLB)
Apr  7 18:51:58 maglustre04 kernel:  ffff81027a69f5a8 0000000000000046
0000000000000000 00000000ffffffff
Apr  7 18:51:58 maglustre04 kernel:  0000000000000000 0000000000000001
ffff8102687a20c0 ffff8105272907a0
Apr  7 18:51:58 maglustre04 kernel:  0004a4ba546cd448 0000000000009669
ffff8102687a22a8 000000008882434d
Apr  7 18:51:58 maglustre04 kernel: Call Trace:
Apr  7 18:51:58 maglustre04 kernel:  [<ffffffff80046dac>]
sprintf+0x51/0x59
Apr  7 18:51:58 maglustre04 kernel:  [<ffffffff80063be0>]
__mutex_lock_slowpath+0x60/0x9b
Apr  7 18:51:58 maglustre04 kernel:  [<ffffffff80063c20>]
.text.lock.mutex+0x5/0x14
Apr  7 18:51:58 maglustre04 kernel:  [<ffffffff887fe756>]
:obdfilter:filter_parent_lock+0x36/0x220
Apr  7 18:51:58 maglustre04 kernel:  [<ffffffff8009ddc3>]
autoremove_wake_function+0x9/0x2e
Apr  7 18:51:58 maglustre04 kernel:  [<ffffffff800891f6>]
__wake_up_common+0x3e/0x68
Apr  7 18:51:58 maglustre04 kernel:  [<ffffffff888026cf>]
:obdfilter:filter_fid2dentry+0x2cf/0x730
Apr  7 18:51:58 maglustre04 kernel:  [<ffffffff88652b72>]
:lquota:filter_quota_adjust+0x172/0x2a0
Apr  7 18:51:58 maglustre04 kernel:  [<ffffffff8000d046>]
dput+0x23/0x10a
Apr  7 18:51:58 maglustre04 kernel:  [<ffffffff88809fe8>]
:obdfilter:filter_destroy+0x148/0x1dd0
Apr  7 18:51:58 maglustre04 kernel:  [<ffffffff88546520>]
:ptlrpc:ldlm_blocking_ast+0x0/0x2a0
Apr  7 18:51:58 maglustre04 kernel:  [<ffffffff88549db0>]
:ptlrpc:ldlm_completion_ast+0x0/0x7c0
Apr  7 18:51:58 maglustre04 kernel:  [<ffffffff800646b6>]
__down_failed_trylock+0x35/0x3a
Apr  7 18:51:58 maglustre04 kernel:  [<ffffffff8880c7ce>]
:obdfilter:filter_create+0xb5e/0x1530
Apr  7 18:51:58 maglustre04 kernel:  [<ffffffff8856f004>]
:ptlrpc:lustre_msg_set_timeout+0x34/0x110
Apr  7 18:51:58 maglustre04 kernel:  [<ffffffff88572f09>]
:ptlrpc:lustre_pack_reply+0x29/0xb0
Apr  7 18:51:58 maglustre04 kernel:  [<ffffffff887cd01f>]
:ost:ost_handle+0x136f/0x5cd0
Apr  7 18:51:58 maglustre04 kernel:  [<ffffffff80143809>]
__next_cpu+0x19/0x28
Apr  7 18:51:58 maglustre04 kernel:  [<ffffffff80143809>]
__next_cpu+0x19/0x28
Apr  7 18:51:58 maglustre04 kernel:  [<ffffffff800898e3>]
find_busiest_group+0x20d/0x621
Apr  7 18:51:58 maglustre04 kernel:  [<ffffffff8856d5a5>]
:ptlrpc:lustre_msg_get_conn_cnt+0x35/0xf0
Apr  7 18:51:58 maglustre04 kernel:  [<ffffffff88575cfa>]
:ptlrpc:ptlrpc_server_request_get+0x6a/0x150
Apr  7 18:51:58 maglustre04 kernel:  [<ffffffff88577b7d>]
:ptlrpc:ptlrpc_check_req+0x1d/0x110
Apr  7 18:51:58 maglustre04 kernel:  [<ffffffff8857a103>]
:ptlrpc:ptlrpc_server_handle_request+0xa93/0x1150
Apr  7 18:51:58 maglustre04 kernel:  [<ffffffff80062f4b>]
thread_return+0x0/0xdf
Apr  7 18:51:58 maglustre04 kernel:  [<ffffffff8006d8a2>]
do_gettimeofday+0x40/0x8f
Apr  7 18:51:58 maglustre04 kernel:  [<ffffffff8841d7c6>]
:libcfs:lcw_update_time+0x16/0x100
Apr  7 18:51:58 maglustre04 kernel:  [<ffffffff800891f6>]
__wake_up_common+0x3e/0x68
Apr  7 18:51:58 maglustre04 kernel:  [<ffffffff8857d5f8>]
:ptlrpc:ptlrpc_main+0x1218/0x13e0
Apr  7 18:51:58 maglustre04 kernel:  [<ffffffff8008abb9>]
default_wake_function+0x0/0xe
Apr  7 18:51:58 maglustre04 kernel:  [<ffffffff800b4382>]
audit_syscall_exit+0x31b/0x336
Apr  7 18:51:58 maglustre04 kernel:  [<ffffffff8005dfb1>]
child_rip+0xa/0x11
Apr  7 18:51:58 maglustre04 kernel:  [<ffffffff8857c3e0>]
:ptlrpc:ptlrpc_main+0x0/0x13e0
Apr  7 18:51:58 maglustre04 kernel:  [<ffffffff8005dfa7>]
child_rip+0x0/0x11
Apr  7 18:51:58 maglustre04 kernel:
=================
Are these due to watchdog timeouts ?  

Eventually (at just after 8pm or so) while we were looking at the logs,
the box oom''ed on us & started killing stuff.  At that point we
rebooted, e2fsck''d and remounted & things have been ok since. 
Clients
went through their recovery and came back functional.

Robert Hendelman Jr
Magnetar Capital LLC
Rob.Hendelman at magnetar.com
1-847-905-4557



The information contained in this message and its attachments 
is intended only for the private and confidential use of the 
intended recipient(s).  If you are not the intended recipient 
(or have received this e-mail in error) please notify the 
sender immediately and destroy this e-mail. Any unauthorized 
copying, disclosure or distribution of the material in this e-
mail is strictly prohibited.

Brian J. Murrell

2009-Apr-08 18:49 UTC

head link

[Lustre-discuss] slow journal/commitrw on OSTs lead to crash

On Wed, 2009-04-08 at 13:16 -0500, Hendelman, Rob wrote:> Hello,
Hi,
> I recently upgraded our 1.6.4.3 centos servers to 1.6.7.  Clients are
> still at 1.6.4.3.  We were going to upgrade to 1.6.7 but decided to wait
> until 1.6.7.1 comes out with the fix for the bug mentioned on the list
> late last week.
The corruption bug for which 1.6.7.1 will be released affects the MDT,
not the clients, so you have some exposure there still, even without
upgrading the clients.
> Apr  7 18:08:48 maglustre04 kernel: Lustre:
> 5228:0:(lustre_fsfilt.h:320:fsfilt_commit_wait()) fs01-OST0005: slow
> journal start 31s
This basically means the storage devices are taking exceptionally long
to process requests.  One common cause of this is over-subscription of
OST threads.  Did you run the obdfilter-survey on your disk subsystem
before installing Lustre on it?  If so, it should tell you at what kind
of thread count you saturate your storage.  You want to adjust your ost
thread count according to that number.

If you didn''t run the obdfilter-survey you will have to do some trial
and error (i.e. a binary search) with OST thread counts to find a happy
place where you are no longer over-subscribed but you are also not
suffering less performance.

Of course, OST thread count is relative to storage backend performance.
If that performance suffers, then thread counts which were once
sufficient could now be over-subscribed.
> Also: If anyone over there @ Sun has a contact of someone in sales I can
> call for a commercial support contract, this is something we are
> interested in looking at as well.
http://www.sun.com/software/products/lustre/support.xml

Let me know if that doesn''t help you out in that endeavour.

b.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part
Url :
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090408/7d98487d/attachment.bin

Daniel Leaberry

2009-Apr-08 19:51 UTC

head link

[Lustre-discuss] slow journal/commitrw on OSTs lead to crash

Hendelman, Rob wrote:> Hello,
<snip>
> 
> Our OST''s are almost completely full (85-90% on each) and our MDT
is
> almost entirely empty.
> 

I ran into a similar scenario with lustre when I hit 80% full as well.
Exact same problem with journal commits and disks seemingly unusable.
iostat on the disks (DDN 9500 array) shows huge numbers of small reads.
Almost like the disk is being scanned.

I eventually decided it must be something in the filesystem code that
changes how it allocates free blocks when the disk is ~80% full.

I have no confirmation about this other than I''ve been through the
exact
same thing you''re seeing and now we consider our 90TB of lustre full
when we hit 77% usage.

Daniel Leaberry
Senior Systems Administrator

Hendelman, Rob

2009-Apr-08 20:14 UTC

head link

[Lustre-discuss] slow journal/commitrw on OSTs lead to crash

I did not run obdfilter-survey on the disk subsystem, however I am
trying this now on a new filesystem we are building  (to relieve space
issues on the previous/crashing system):

Using the directions from the manual for running obdfilter-survey on
local disks, 

I have setup lustre with an mgs/mds/ost.  There is no data on this new
filesystem.

Following the instructions from page 18-6 in the manual (18.2.2.1):
Running nobjhi=2 thrhi=2 size=1024 case=disk sh obdfilter-survey

Gives me the following:

Wed Apr  8 15:11:29 CDT 2009 Obdfilter-survey for case=disk from
maglustre03
ost  1 sz  1048576K rsz 1024 obj    1 thr    1 ERROR: 1 != 0
create: 1 objects
error: create: #1 - No such device
created object #s on localhost:fs02-OST0000_ecc not contiguous

The mgs/mdt are on another box.  Do they have to be on the the same box
as the OST?

Robert Hendelman Jr
Magnetar Capital LLC
Rob.Hendelman at magnetar.com
1-847-905-4557



The information contained in this message and its attachments 
is intended only for the private and confidential use of the 
intended recipient(s).  If you are not the intended recipient 
(or have received this e-mail in error) please notify the 
sender immediately and destroy this e-mail. Any unauthorized 
copying, disclosure or distribution of the material in this e-
mail is strictly prohibited.

Brian J. Murrell

2009-Apr-09 14:09 UTC

head link

[Lustre-discuss] slow journal/commitrw on OSTs lead to crash

On Wed, 2009-04-08 at 13:51 -0600, Daniel Leaberry
wrote:> 
> I ran into a similar scenario with lustre when I hit 80% full as well.
> Exact same problem with journal commits and disks seemingly unusable.
> iostat on the disks (DDN 9500 array) shows huge numbers of small reads.
> Almost like the disk is being scanned.
Interesting.  It''s probably worth mentioning the historical (anecdotal
at least) warning that as ext2/3 filesystems get full, they become less
efficient.  That''s not surprising as a filesystem certainly can fall
into the category of resources that become less efficient as they become
more full due to the overhead of finding suitable allocations.

I don''t know enough about the internals of ext2/3/4 to comment directly
on the design decisions that led to this or how much of those problems
still exist today in ext4.  Perhaps one of our ext3/4 experts can
comment.

Would certainly make for an interesting benchmarking test.

b.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part
Url :
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090409/cf704254/attachment-0001.bin

Brian J. Murrell

2009-Apr-09 14:14 UTC

head link

[Lustre-discuss] slow journal/commitrw on OSTs lead to crash

On Wed, 2009-04-08 at 15:14 -0500, Hendelman, Rob wrote:> Wed Apr  8 15:11:29 CDT 2009 Obdfilter-survey for case=disk from
> maglustre03
> ost  1 sz  1048576K rsz 1024 obj    1 thr    1 ERROR: 1 != 0
> create: 1 objects
> error: create: #1 - No such device
> created object #s on localhost:fs02-OST0000_ecc not contiguous
If you are using 1.6.7, this looks like bug 18478.  :-(

b.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part
Url :
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090409/d6a72e74/attachment.bin

Peter Grandi

2009-Apr-12 15:41 UTC

head link

[Lustre-discuss] slow journal/commitrw on OSTs lead to crash

[ ... ]
>> I ran into a similar scenario with lustre when I hit 80% full as
>> well. Exact same problem with journal commits and disks seemingly
>> unusable. iostat on the disks (DDN 9500 array) shows huge numbers of
>> small reads. Almost like the disk is being scanned.
This reminds me of a different issue:

  http://lists.lustre.org/pipermail/lustre-discuss/2008-November/009124.html

   ?We have experienced all these errors when we have a big job that is
    writing many small chunks.  when the writes are ... say 80 bytes and
    the block size is 4k bytes, the back end storage can slow down with
    read block, modify block, write block, to such and extent as to
    cause the slow commitrw and slow journal messages very similar to
    yours.?

Those reads can be part of RMW cycles for sub-stripe writes, fairly
common say on journals.

 
http://manual.lustre.org/manual/LustreManual16_HTML/RAID.html#50548852_pgfId-1288359

   ?Do not put an ext3 journal onto RAID5. As the journal is written
    linearly and synchronously, in most cases writes do not fill whole
    stripes and RAID5 has to read parities.?

   ?Ideally, the RAID configuration should allow the 1 MB Lustre RPCs
    to evenly fit only one RAID stripe without requiring an expensive
    read-modify-write cycle.?

As usual, I would be wary of using RAID5 or RAID6 as an OST, as RAID10
is nearly always so much nicer. On the other hand a RAID5 or RAID6 with
a stripe size of 4KiB might be vaguely tolerable (as Linux on x86
variants uses 4KiB as block size anyhow).
> [ ... ] as ext2/3 filesystems get full, they become less efficient.
> That''s not surprising as a filesystem certainly can fall into the
> category of resources that become less efficient as they become more
> full due to the overhead of finding suitable allocations. [ ... ]
This seems to indicate that finding space becomes slower. That seems to
be a minor issue. There are two bigger issues:

* A fundamental issue is that nearby or contiguous stretches of blocks
  become scarce, as free blocks tend to be widely scattered. This is
  independent of file system design (except those that have a compacting
  collector).

* A more incidental one is that since most file systems tend to allocate
  free blocks sequentially starting from the beginning, the last blocks
  to remain free tend to be in the inner cylinders. That is, odds are
  that in a 50% full filesystem the allocated 50% is mostly in the outer
  cylinders, and the free blocks are mostly in the inner cylinders,
  which can be much slower.

There are other potential issues; ''ext3'' and
''ext4'' for example use
relatively small allocation groups, and try to keep some space free in
every group, and this works well when there is lots of free space; also
when there is little disk space, it becomes much more difficult to find
parity RAID aligned free space, and this means that partity RAID based
OSTs end up with many more RMW cycles than otherwise when writing.


Overall, people who like *sustainable* performance have a big problems:
that performance numbers drive sales, and the numbers that matter in
practice are the "peak" numbers obtained from a fresh install and
quick
benchmark run. So in the storage development community there is a lot of
focus on performance as in a freshly loaded nearly empty filesystem on
the outer tracks of storage system doing IO in large aligned blocks.

Andreas Dilger

2009-Apr-14 04:33 UTC

head link

[Lustre-discuss] slow journal/commitrw on OSTs lead to crash

On Apr 12, 2009  16:41 +0100, Peter Grandi wrote:>    ?Ideally, the RAID configuration should allow the 1 MB Lustre RPCs
>     to evenly fit only one RAID stripe without requiring an expensive
>     read-modify-write cycle.?
> 
> As usual, I would be wary of using RAID5 or RAID6 as an OST, as RAID10
> is nearly always so much nicer. On the other hand a RAID5 or RAID6 with
> a stripe size of 4KiB might be vaguely tolerable (as Linux on x86
> variants uses 4KiB as block size anyhow).
Actually, the ldiskfs allocator (mballoc, also used in ext4) is optimized
for RAID configurations.  It allocates RAID-stripe sized and aligned
chunks if possible, specifically to avoid read-modify-write, and also
to avoid fragmented free space.  In the absence of other information,
mballoc will assume a 1MB RAID stripe width, but modern Lustre + e2fsprogs
allows storing the RAID stripe width in the superblock via tune2fs.  It
is also possible to have the clients match their RPC size and alignment
to the RAID stripe size/alignment, though it isn''t yet automatic to
pass
the optimal RPC size from the underlying RAID all the way to the client.
> This seems to indicate that finding space becomes slower. That seems to
> be a minor issue. There are two bigger issues:
> 
> * A fundamental issue is that nearby or contiguous stretches of blocks
>   become scarce, as free blocks tend to be widely scattered. This is
>   independent of file system design (except those that have a compacting
>   collector).
This is true of almost all filesystems.  This means the best possible
avoidance is to allocate blocks contiguously in the first place, to
ensure that when a file is freed it will also result in contiguous
free blocks.
> There are other potential issues; ''ext3'' and
''ext4'' for example use
> relatively small allocation groups, and try to keep some space free in
> every group, and this works well when there is lots of free space; also
> when there is little disk space, it becomes much more difficult to find
> parity RAID aligned free space, and this means that partity RAID based
> OSTs end up with many more RMW cycles than otherwise when writing.
While it is true that ext3 and ext4 have relatively small allocation
groups (128MB), the allocators are totally different and as a result
the performance of ext4 (which is largely based on Lustre ldiskfs)
is much better than ext3.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

Lustre discuss - Apr 2009 - slow journal/commitrw on OSTs lead to crash

[Lustre-discuss] slow journal/commitrw on OSTs lead to crash

[Lustre-discuss] slow journal/commitrw on OSTs lead to crash

[Lustre-discuss] slow journal/commitrw on OSTs lead to crash

[Lustre-discuss] slow journal/commitrw on OSTs lead to crash

[Lustre-discuss] slow journal/commitrw on OSTs lead to crash

[Lustre-discuss] slow journal/commitrw on OSTs lead to crash

[Lustre-discuss] slow journal/commitrw on OSTs lead to crash

[Lustre-discuss] slow journal/commitrw on OSTs lead to crash

[Lustre-discuss] slow journal/commitrw on OSTs lead to crash