Hendelman, Rob
2009-Apr-08 18:16 UTC
[Lustre-discuss] slow journal/commitrw on OSTs lead to crash
Hello, I recently upgraded our 1.6.4.3 centos servers to 1.6.7. Clients are still at 1.6.4.3. We were going to upgrade to 1.6.7 but decided to wait until 1.6.7.1 comes out with the fix for the bug mentioned on the list late last week. Our OST''s are almost completely full (85-90% on each) and our MDT is almost entirely empty. Our setup is as follows: Server nodes ===========Node 1 = MGS + MDT (separate mount points) with raid 10 of 146G sas 15k disks, 1 gigabit ethernet Node 4 = OSS with 10 OST''s connected to disks via multipath FC. Each OST is a raid6 (done on the san side) of (I believe 12) 146G 15k disks. Each node has 20G ram. Our clients are running with 2.6.22 patchless kernel & lustre 1.4.3 Client nodes have 32G ram ea. Each filesystem (mgs, mdt, ost) also has a separate device in either raid 1 or 10 for the journal. Last night around 6pm I started getting alerts that our lustre OSS was unresponsive. Looking in the logs, I found the following: =========================================================Apr 7 18:07:53 maglustre04 last message repeated 2 times Apr 7 18:08:48 maglustre04 kernel: Lustre: 5228:0:(lustre_fsfilt.h:320:fsfilt_commit_wait()) fs01-OST0005: slow journal start 31s Apr 7 18:08:48 maglustre04 kernel: Lustre: 5228:0:(filter_io_26.c:792:filter_commitrw_write()) fs01-OST0005: slow commitrw commit 31s Apr 7 18:08:48 maglustre04 kernel: Lustre: 5227:0:(lustre_fsfilt.h:320:fsfilt_commit_wait()) fs01-OST0005: slow journal start 30s Apr 7 18:08:48 maglustre04 kernel: Lustre: 5218:0:(filter_io_26.c:792:filter_commitrw_write()) fs01-OST0005: slow commitrw commit 30s Apr 7 18:08:48 maglustre04 kernel: Lustre: 5227:0:(lustre_fsfilt.h:320:fsfilt_commit_wait()) Skipped 1 previous similar message Apr 7 18:10:15 maglustre04 kernel: Lustre: 6588:0:(ldlm_lib.c:541:target_handle_reconnect()) fs01-OST0006: dc9faf61-04e0-3f57-0cba-405fcac557f7 reconnecting Apr 7 18:10:15 maglustre04 kernel: Lustre: 6588:0:(ldlm_lib.c:780:target_handle_connect()) fs01-OST0006: refuse reconnection from dc9faf61-04e0-3f57-0cba-405fcac557f7 at 10.5.20 0.13 at tcp to 0xffff810156556000; still busy with 2 active RPCs Apr 7 18:10:15 maglustre04 kernel: LustreError: 6588:0:(ldlm_lib.c:1643:target_send_reply_msg()) @@@ processing error (-16) req at ffff81026ddfd800 x21081511/t0 o8->dc9faf61-04 e0-3f57-0cba-405fcac557f7@:0/0 lens 304/200 e 0 to 0 dl 1239145915 ref 1 fl Interpret:/0/0 rc -16/0 Apr 7 18:10:15 maglustre04 kernel: LustreError: 6588:0:(ldlm_lib.c:1643:target_send_reply_msg()) Skipped 2 previous similar messages Apr 7 18:10:23 maglustre04 kernel: LustreError: 5269:0:(filter_io_26.c:779:filter_commitrw_write()) fs01-OST0006: slow direct_io 107s Apr 7 18:10:23 maglustre04 kernel: Lustre: 5269:0:(service.c:1317:ptlrpc_server_handle_request()) @@@ Request x21080837 took longer than estimated (100+8s); client may timeou t. req at ffff810529d42400 x21080837/t4450358 o4->dc9faf61-04e0-3f57-0cba-405fcac557f7@:0/0 lens 384/352 e 0 to 0 dl 1239145815 ref 1 fl Complete:/0/0 rc 0/0 Apr 7 18:10:25 maglustre04 kernel: LustreError: 5270:0:(lustre_fsfilt.h:320:fsfilt_commit_wait()) fs01-OST0008: slow journal start 109s Apr 7 18:10:25 maglustre04 kernel: LustreError: 5270:0:(filter_io_26.c:792:filter_commitrw_write()) fs01-OST0008: slow commitrw commit 109s Apr 7 18:10:25 maglustre04 kernel: Lustre: 5270:0:(service.c:1317:ptlrpc_server_handle_request()) @@@ Request x99233655 took longer than estimated (100+10s); client may timeo ut. req at ffff8104dd112800 x99233655/t4401190 o4->19615549-595b-2174-a224-57b10ed10a9c@:0/0 lens 384/352 e 0 to 0 dl 1239145815 ref 1 fl Complete:/0/0 rc 0/0 Apr 7 18:10:26 maglustre04 kernel: LustreError: 5210:0:(filter_io_26.c:714:filter_commitrw_write()) fs01-OST0009: slow i_mutex 122s Apr 7 18:10:26 maglustre04 kernel: LustreError: 5184:0:(filter_io_26.c:714:filter_commitrw_write()) fs01-OST0009: slow i_mutex 122s Apr 7 18:10:26 maglustre04 kernel: Lustre: 5184:0:(service.c:1317:ptlrpc_server_handle_request()) @@@ Request x99233618 took longer than estimated (100+22s); client may timeo ut. req at ffff8104a65f5c50 x99233618/t4374309 o4->19615549-595b-2174-a224-57b10ed10a9c@:0/0 lens 384/352 e 0 to 0 dl 1239145804 ref 1 fl Complete:/0/0 rc 0/0 Apr 7 18:10:26 maglustre04 kernel: Lustre: 5184:0:(service.c:1317:ptlrpc_server_handle_request()) Skipped 1 previous similar message Apr 7 18:10:26 maglustre04 kernel: LustreError: 5241:0:(filter_io_26.c:779:filter_commitrw_write()) fs01-OST0009: slow direct_io 122s Apr 7 18:10:27 maglustre04 kernel: Lustre: 6341:0:(ldlm_lib.c:541:target_handle_reconnect()) fs01-OST0001: e293e1d4-51a4-a8f4-a7c0-c1aaab662a94 reconnecting Apr 7 18:10:27 maglustre04 kernel: Lustre: 5139:0:(ldlm_lib.c:541:target_handle_reconnect()) fs01-OST0000: e293e1d4-51a4-a8f4-a7c0-c1aaab662a94 reconnecting Apr 7 18:10:27 maglustre04 kernel: Lustre: 6398:0:(ldlm_lib.c:780:target_handle_connect()) fs01-OST0006: refuse reconnection from e293e1d4-51a4-a8f4-a7c0-c1aaab662a94 at 10.5.20 0.12 at tcp to 0xffff810525b80000; still busy with 9 active RPCs Apr 7 18:10:27 maglustre04 kernel: Lustre: 6341:0:(ldlm_lib.c:541:target_handle_reconnect()) Skipped 6 previous similar messages Apr 7 18:10:32 maglustre04 kernel: Lustre: 6439:0:(ldlm_lib.c:541:target_handle_reconnect()) fs01-OST0009: 5824cc81-63ce-7c87-cdd6-42422d0c4dcd reconnecting Apr 7 18:10:32 maglustre04 kernel: Lustre: 6439:0:(ldlm_lib.c:541:target_handle_reconnect()) Skipped 6 previous similar messages Apr 7 18:10:34 maglustre04 kernel: Lustre: 6477:0:(ldlm_lib.c:541:target_handle_reconnect()) fs01-OST0009: 19615549-595b-2174-a224-57b10ed10a9c reconnecting Apr 7 18:10:34 maglustre04 kernel: Lustre: 6477:0:(ldlm_lib.c:541:target_handle_reconnect()) Skipped 1 previous similar message Apr 7 18:10:37 maglustre04 kernel: Lustre: 6430:0:(ldlm_lib.c:541:target_handle_reconnect()) fs01-OST000a: 5824cc81-63ce-7c87-cdd6-42422d0c4dcd reconnecting Apr 7 18:10:37 maglustre04 kernel: Lustre: 6430:0:(ldlm_lib.c:541:target_handle_reconnect()) Skipped 10 previous similar messages Apr 7 18:10:41 maglustre04 kernel: Lustre: 6574:0:(ldlm_lib.c:541:target_handle_reconnect()) fs01-OST0002: c454c19b-6df7-3d89-33b6-e421b8135c62 reconnecting Apr 7 18:10:41 maglustre04 kernel: Lustre: 6413:0:(ldlm_lib.c:541:target_handle_reconnect()) fs01-OST0001: c454c19b-6df7-3d89-33b6-e421b8135c62 reconnecting Apr 7 18:10:41 maglustre04 kernel: Lustre: 6413:0:(ldlm_lib.c:541:target_handle_reconnect()) Skipped 2 previous similar messages Apr 7 18:10:41 maglustre04 kernel: Lustre: 6574:0:(ldlm_lib.c:541:target_handle_reconnect()) Skipped 6 previous similar messages Apr 7 18:10:51 maglustre04 kernel: Lustre: 6598:0:(ldlm_lib.c:541:target_handle_reconnect()) fs01-OST0005: e293e1d4-51a4-a8f4-a7c0-c1aaab662a94 reconnecting Apr 7 18:10:51 maglustre04 kernel: Lustre: 6598:0:(ldlm_lib.c:541:target_handle_reconnect()) Skipped 2 previous similar messages Apr 7 18:11:07 maglustre04 kernel: Lustre: 5027:0:(ldlm_lib.c:780:target_handle_connect()) fs01-OST0005: refuse reconnection from 19615549-595b-2174-a224-57b10ed10a9c at 10.5.20 0.15 at tcp to 0xffff8102a7270000; still busy with 2 active RPCs Apr 7 18:11:16 maglustre04 kernel: Lustre: 5076:0:(ldlm_lib.c:541:target_handle_reconnect()) fs01-OST0006: e293e1d4-51a4-a8f4-a7c0-c1aaab662a94 reconnecting Apr 7 18:11:16 maglustre04 kernel: Lustre: 5076:0:(ldlm_lib.c:541:target_handle_reconnect()) Skipped 17 previous similar messages Apr 7 18:12:11 maglustre04 kernel: Lustre: 5180:0:(filter_io.c:551:filter_preprw_write()) fs01-OST0008: slow preprw_write setup 37s Apr 7 18:12:45 maglustre04 kernel: Lustre: 5220:0:(filter_io_26.c:779:filter_commitrw_write()) fs01-OST0003: slow direct_io 33s Apr 7 18:12:48 maglustre04 kernel: Lustre: 11010:0:(socklnd_cb.c:2728:ksocknal_check_peer_timeouts()) Stale ZC_REQs for peer 10.5.200.12 at tcp detected: 1; the oldest (ffff8105 1f1de000) timed out 9 secs ago Apr 7 18:12:56 maglustre04 kernel: Lustre: 5063:0:(ldlm_lib.c:541:target_handle_reconnect()) fs01-OST0006: e293e1d4-51a4-a8f4-a7c0-c1aaab662a94 reconnecting Apr 7 18:12:56 maglustre04 kernel: Lustre: 5063:0:(ldlm_lib.c:541:target_handle_reconnect()) Skipped 1 previous similar message Apr 7 18:12:56 maglustre04 kernel: Lustre: 5063:0:(ldlm_lib.c:780:target_handle_connect()) fs01-OST0006: refuse reconnection from e293e1d4-51a4-a8f4-a7c0-c1aaab662a94 at 10.5.20 0.12 at tcp to 0xffff810525b80000; still busy with 2 active RPCs Apr 7 18:12:56 maglustre04 kernel: LustreError: 5063:0:(ldlm_lib.c:1643:target_send_reply_msg()) @@@ processing error (-16) req at ffff8103dc39bc00 x243110586/t0 o8->e293e1d4-5 1a4-a8f4-a7c0-c1aaab662a94@:0/0 lens 304/200 e 0 to 0 dl 1239146076 ref 1 fl Interpret:/0/0 rc -16/0 Apr 7 18:12:56 maglustre04 kernel: LustreError: 5063:0:(ldlm_lib.c:1643:target_send_reply_msg()) Skipped 2 previous similar messages Apr 7 18:12:56 maglustre04 kernel: LustreError: 5209:0:(ost_handler.c:899:ost_brw_read()) @@@ timeout on bulk PUT after 68+0s req at ffff8102e7357800 x243110512/t0 o3->e293e1d4 -51a4-a8f4-a7c0-c1aaab662a94@:0/0 lens 384/336 e 0 to 0 dl 1239145976 ref 1 fl Interpret:/0/0 rc 0/0 Apr 7 18:12:57 maglustre04 kernel: Lustre: 5215:0:(filter_io_26.c:779:filter_commitrw_write()) fs01-OST0007: slow direct_io 39s Apr 7 18:12:57 maglustre04 kernel: Lustre: 5209:0:(ost_handler.c:957:ost_brw_read()) fs01-OST0006: ignoring bulk IO comm error with e293e1d4-51a4-a8f4-a7c0-c1aaab662a94@ id 1 2345-10.5.200.12 at tcp - client will retry Apr 7 18:12:57 maglustre04 kernel: Lustre: 5209:0:(service.c:1317:ptlrpc_server_handle_request()) @@@ Request x243110512 took longer than estimated (100+1s); client may timeo ut. req at ffff8102e7357800 x243110512/t0 o3->e293e1d4-51a4-a8f4-a7c0-c1aaab662a94@:0/0 lens 384/336 e 0 to 0 dl 1239145976 ref 1 fl Complete:/0/0 rc 0/0 Apr 7 18:12:57 maglustre04 kernel: Lustre: 5209:0:(service.c:1317:ptlrpc_server_handle_request()) Skipped 6 previous similar messages =======================================================================Soon the OSS started refusing connections: ============================================================================Apr 7 18:18:30 maglustre04 kernel: Lustre: 5206:0:(filter_io.c:551:filter_preprw_write()) fs01-OST0002: slow preprw_write setup 42s Apr 7 18:18:34 maglustre04 kernel: Lustre: 5079:0:(ldlm_lib.c:780:target_handle_connect()) fs01-OST0008: refuse reconnection from e293e1d4-51a4-a8f4-a7c0-c1aaab662a94 at 10.5.20 0.12 at tcp to 0xffff810337f8a000; still busy with 2 active RPCs Apr 7 18:18:34 maglustre04 kernel: LustreError: 5079:0:(ldlm_lib.c:1643:target_send_reply_msg()) @@@ processing error (-16) req at ffff81052b724400 x243111082/t0 o8->e293e1d4-5 1a4-a8f4-a7c0-c1aaab662a94@:0/0 lens 304/200 e 0 to 0 dl 1239146414 ref 1 fl Interpret:/0/0 rc -16/0 Apr 7 18:18:34 maglustre04 kernel: LustreError: 5079:0:(ldlm_lib.c:1643:target_send_reply_msg()) Skipped 1 previous similar message Apr 7 18:18:38 maglustre04 kernel: LustreError: 11000:0:(events.c:66:request_out_callback()) @@@ type 4, status -5 req at ffff8104e3853400 x208570/t0 o400->MGS at MGC10.5.10.11@tc p_0:26/25 lens 128/256 e 0 to 1 dl 1239146349 ref 2 fl Rpc:N/0/0 rc 0/0 Apr 7 18:18:38 maglustre04 kernel: LustreError: 11000:0:(events.c:66:request_out_callback()) @@@ type 4, status -5 req at ffff81051e07fc00 x208571/t0 o400->MGS at MGC10.5.10.11@tc p_0:26/25 lens 128/256 e 0 to 1 dl 1239146374 ref 2 fl Rpc:N/0/0 rc 0/0 Apr 7 18:18:38 maglustre04 kernel: Lustre: Request x208570 sent from MGC10.5.10.11 at tcp to NID 10.5.10.11 at tcp 69s ago has timed out (limit 100s). Apr 7 18:18:38 maglustre04 kernel: LustreError: 166-1: MGC10.5.10.11 at tcp: Connection to service MGS via nid 10.5.10.11 at tcp was lost; in progress operations using this service will fail. Apr 7 18:18:38 maglustre04 kernel: Lustre: MGC10.5.10.11 at tcp: Reactivating import Apr 7 18:18:38 maglustre04 kernel: Lustre: MGC10.5.10.11 at tcp: Connection restored to service MGS using nid 10.5.10.11 at tcp. Apr 7 18:18:45 maglustre04 kernel: Lustre: fs01-OST0000: received MDS connection from 10.5.10.11 at tcp Apr 7 18:18:45 maglustre04 kernel: Lustre: Skipped 10 previous similar messages Apr 7 18:18:45 maglustre04 kernel: Lustre: 6448:0:(filter.c:2849:filter_destroy_precreated()) fs01-OST0000: deleting orphan objects from 184495 to 184526 Apr 7 18:18:45 maglustre04 kernel: Lustre: 6448:0:(filter.c:2849:filter_destroy_precreated()) Skipped 8 previous similar messages Apr 7 18:18:53 maglustre04 kernel: LustreError: 11004:0:(events.c:367:server_bulk_callback()) event type 4, status -5, desc ffff8104f2272000 Apr 7 18:18:53 maglustre04 kernel: LustreError: 11004:0:(events.c:367:server_bulk_callback()) event type 4, status -5, desc ffff81051f681000 Apr 7 18:18:56 maglustre04 kernel: LustreError: 5226:0:(ost_handler.c:1131:ost_brw_write()) @@@ timeout on bulk GET after 46+0s req at ffff8104e516d800 x99305448/t0 o4->1961554 9-595b-2174-a224-57b10ed10a9c@:0/0 lens 384/352 e 0 to 0 dl 1239146336 ref 1 fl Interpret:/0/0 rc 0/0 Apr 7 18:18:56 maglustre04 kernel: Lustre: 5226:0:(ost_handler.c:1277:ost_brw_write()) fs01-OST0005: ignoring bulk IO comm error with 19615549-595b-2174-a224-57b10ed10a9c@ id 12345-10.5.200.15 at tcp - client will retry Apr 7 18:19:01 maglustre04 kernel: Lustre: 6350:0:(ldlm_lib.c:780:target_handle_connect()) fs01-OST0008: refuse reconnection from e293e1d4-51a4-a8f4-a7c0-c1aaab662a94 at 10.5.20 0.12 at tcp to 0xffff810337f8a000; still busy with 2 active RPCs Apr 7 18:19:13 maglustre04 last message repeated 2 times Apr 7 18:19:13 maglustre04 last message repeated 2 times =========================================================== Considering we have a _very_ small installation compared to almost everyone else on the list with 2 servers & 5 client nodes I''m not sure exactly where to start looking. I did find a thread from December 08 with similar error messages but it seemed confined to a single OST. This seemed to hit all our OST''s one after the other. We are using the default striping which only has 1 file per OST as well. Storage Group reported no errors on FC switch or disks on the SAN. Networking group reported no errors on the gigabit Ethernet switch so I don''t think it is a networking error. This is the only issue we''ve had so far after upgrading to 1.6.7 on the servers in approximately 3 weeks of regular use. Any suggestions are appreciated. Also: If anyone over there @ Sun has a contact of someone in sales I can call for a commercial support contract, this is something we are interested in looking at as well. Thanks, Robert Robert Hendelman Jr Magnetar Capital LLC Rob.Hendelman at magnetar.com 1-847-905-4557 The information contained in this message and its attachments is intended only for the private and confidential use of the intended recipient(s). If you are not the intended recipient (or have received this e-mail in error) please notify the sender immediately and destroy this e-mail. Any unauthorized copying, disclosure or distribution of the material in this e- mail is strictly prohibited.
Hendelman, Rob
2009-Apr-08 18:24 UTC
[Lustre-discuss] slow journal/commitrw on OSTs lead to crash
I forgot to mention, that about an hour after this started, we got the following: ===============Apr 7 18:49:06 maglustre04 kernel: Lustre: 6483:0:(ldlm_lib.c:541:target_handle_reconnect()) fs01-OST0000: fs01-mdtlov_UUID reconnecting Apr 7 18:49:06 maglustre04 kernel: Lustre: 6483:0:(ldlm_lib.c:541:target_handle_reconnect()) Skipped 111 previous similar messages Apr 7 18:49:26 maglustre04 kernel: Lustre: 5075:0:(service.c:1317:ptlrpc_server_handle_request()) @@@ Request x1039597 took longer than estimated (50+75s); client may timeout . req at ffff8104bb35e400 x1039597/t0 o5->fs01-mdtlov_UUID@:0/0 lens 336/336 e 0 to 0 dl 1239148091 ref 1 fl Complete:/0/0 rc 0/0 Apr 7 18:51:58 maglustre04 kernel: Lustre: 5075:0:(service.c:1317:ptlrpc_server_handle_request()) Skipped 12 previous similar messages Apr 7 18:51:58 maglustre04 kernel: Lustre: fs01-OST0006: received MDS connection from 10.5.10.11 at tcp Apr 7 18:51:58 maglustre04 kernel: Lustre: Skipped 10 previous similar messages Apr 7 18:51:58 maglustre04 kernel: LustreError: 5146:0:(lustre_fsfilt.h:229:fsfilt_start_log()) fs01-OST0008: slow journal start 135s Apr 7 18:51:58 maglustre04 kernel: Lustre: 462:0:(watchdog.c:148:lcw_cb()) Watchdog triggered for pid 5024: it was inactive for 200s Apr 7 18:51:58 maglustre04 kernel: Lustre: 462:0:(linux-debug.c:222:libcfs_debug_dumpstack()) showing stack for process 5024 Apr 7 18:51:58 maglustre04 kernel: ll_ost_02 D ffff8102687a20c0 0 5024 1 5025 5023 (L-TLB) Apr 7 18:51:58 maglustre04 kernel: ffff81027a69f5a8 0000000000000046 0000000000000000 00000000ffffffff Apr 7 18:51:58 maglustre04 kernel: 0000000000000000 0000000000000001 ffff8102687a20c0 ffff8105272907a0 Apr 7 18:51:58 maglustre04 kernel: 0004a4ba546cd448 0000000000009669 ffff8102687a22a8 000000008882434d Apr 7 18:51:58 maglustre04 kernel: Call Trace: Apr 7 18:51:58 maglustre04 kernel: [<ffffffff80046dac>] sprintf+0x51/0x59 Apr 7 18:51:58 maglustre04 kernel: [<ffffffff80063be0>] __mutex_lock_slowpath+0x60/0x9b Apr 7 18:51:58 maglustre04 kernel: [<ffffffff80063c20>] .text.lock.mutex+0x5/0x14 Apr 7 18:51:58 maglustre04 kernel: [<ffffffff887fe756>] :obdfilter:filter_parent_lock+0x36/0x220 Apr 7 18:51:58 maglustre04 kernel: [<ffffffff8009ddc3>] autoremove_wake_function+0x9/0x2e Apr 7 18:51:58 maglustre04 kernel: [<ffffffff800891f6>] __wake_up_common+0x3e/0x68 Apr 7 18:51:58 maglustre04 kernel: [<ffffffff888026cf>] :obdfilter:filter_fid2dentry+0x2cf/0x730 Apr 7 18:51:58 maglustre04 kernel: [<ffffffff88652b72>] :lquota:filter_quota_adjust+0x172/0x2a0 Apr 7 18:51:58 maglustre04 kernel: [<ffffffff8000d046>] dput+0x23/0x10a Apr 7 18:51:58 maglustre04 kernel: [<ffffffff88809fe8>] :obdfilter:filter_destroy+0x148/0x1dd0 Apr 7 18:51:58 maglustre04 kernel: [<ffffffff88546520>] :ptlrpc:ldlm_blocking_ast+0x0/0x2a0 Apr 7 18:51:58 maglustre04 kernel: [<ffffffff88549db0>] :ptlrpc:ldlm_completion_ast+0x0/0x7c0 Apr 7 18:51:58 maglustre04 kernel: [<ffffffff800646b6>] __down_failed_trylock+0x35/0x3a Apr 7 18:51:58 maglustre04 kernel: [<ffffffff8880c7ce>] :obdfilter:filter_create+0xb5e/0x1530 Apr 7 18:51:58 maglustre04 kernel: [<ffffffff8856f004>] :ptlrpc:lustre_msg_set_timeout+0x34/0x110 Apr 7 18:51:58 maglustre04 kernel: [<ffffffff88572f09>] :ptlrpc:lustre_pack_reply+0x29/0xb0 Apr 7 18:51:58 maglustre04 kernel: [<ffffffff887cd01f>] :ost:ost_handle+0x136f/0x5cd0 Apr 7 18:51:58 maglustre04 kernel: [<ffffffff80143809>] __next_cpu+0x19/0x28 Apr 7 18:51:58 maglustre04 kernel: [<ffffffff80143809>] __next_cpu+0x19/0x28 Apr 7 18:51:58 maglustre04 kernel: [<ffffffff800898e3>] find_busiest_group+0x20d/0x621 Apr 7 18:51:58 maglustre04 kernel: [<ffffffff8856d5a5>] :ptlrpc:lustre_msg_get_conn_cnt+0x35/0xf0 Apr 7 18:51:58 maglustre04 kernel: [<ffffffff88575cfa>] :ptlrpc:ptlrpc_server_request_get+0x6a/0x150 Apr 7 18:51:58 maglustre04 kernel: [<ffffffff88577b7d>] :ptlrpc:ptlrpc_check_req+0x1d/0x110 Apr 7 18:51:58 maglustre04 kernel: [<ffffffff8857a103>] :ptlrpc:ptlrpc_server_handle_request+0xa93/0x1150 Apr 7 18:51:58 maglustre04 kernel: [<ffffffff80062f4b>] thread_return+0x0/0xdf Apr 7 18:51:58 maglustre04 kernel: [<ffffffff8006d8a2>] do_gettimeofday+0x40/0x8f Apr 7 18:51:58 maglustre04 kernel: [<ffffffff8841d7c6>] :libcfs:lcw_update_time+0x16/0x100 Apr 7 18:51:58 maglustre04 kernel: [<ffffffff800891f6>] __wake_up_common+0x3e/0x68 Apr 7 18:51:58 maglustre04 kernel: [<ffffffff8857d5f8>] :ptlrpc:ptlrpc_main+0x1218/0x13e0 Apr 7 18:51:58 maglustre04 kernel: [<ffffffff8008abb9>] default_wake_function+0x0/0xe Apr 7 18:51:58 maglustre04 kernel: [<ffffffff800b4382>] audit_syscall_exit+0x31b/0x336 Apr 7 18:51:58 maglustre04 kernel: [<ffffffff8005dfb1>] child_rip+0xa/0x11 Apr 7 18:51:58 maglustre04 kernel: [<ffffffff8857c3e0>] :ptlrpc:ptlrpc_main+0x0/0x13e0 Apr 7 18:51:58 maglustre04 kernel: [<ffffffff8005dfa7>] child_rip+0x0/0x11 Apr 7 18:51:58 maglustre04 kernel: ================= Are these due to watchdog timeouts ? Eventually (at just after 8pm or so) while we were looking at the logs, the box oom''ed on us & started killing stuff. At that point we rebooted, e2fsck''d and remounted & things have been ok since. Clients went through their recovery and came back functional. Robert Hendelman Jr Magnetar Capital LLC Rob.Hendelman at magnetar.com 1-847-905-4557 The information contained in this message and its attachments is intended only for the private and confidential use of the intended recipient(s). If you are not the intended recipient (or have received this e-mail in error) please notify the sender immediately and destroy this e-mail. Any unauthorized copying, disclosure or distribution of the material in this e- mail is strictly prohibited.
Brian J. Murrell
2009-Apr-08 18:49 UTC
[Lustre-discuss] slow journal/commitrw on OSTs lead to crash
On Wed, 2009-04-08 at 13:16 -0500, Hendelman, Rob wrote:> Hello,Hi,> I recently upgraded our 1.6.4.3 centos servers to 1.6.7. Clients are > still at 1.6.4.3. We were going to upgrade to 1.6.7 but decided to wait > until 1.6.7.1 comes out with the fix for the bug mentioned on the list > late last week.The corruption bug for which 1.6.7.1 will be released affects the MDT, not the clients, so you have some exposure there still, even without upgrading the clients.> Apr 7 18:08:48 maglustre04 kernel: Lustre: > 5228:0:(lustre_fsfilt.h:320:fsfilt_commit_wait()) fs01-OST0005: slow > journal start 31sThis basically means the storage devices are taking exceptionally long to process requests. One common cause of this is over-subscription of OST threads. Did you run the obdfilter-survey on your disk subsystem before installing Lustre on it? If so, it should tell you at what kind of thread count you saturate your storage. You want to adjust your ost thread count according to that number. If you didn''t run the obdfilter-survey you will have to do some trial and error (i.e. a binary search) with OST thread counts to find a happy place where you are no longer over-subscribed but you are also not suffering less performance. Of course, OST thread count is relative to storage backend performance. If that performance suffers, then thread counts which were once sufficient could now be over-subscribed.> Also: If anyone over there @ Sun has a contact of someone in sales I can > call for a commercial support contract, this is something we are > interested in looking at as well.http://www.sun.com/software/products/lustre/support.xml Let me know if that doesn''t help you out in that endeavour. b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090408/7d98487d/attachment.bin
Daniel Leaberry
2009-Apr-08 19:51 UTC
[Lustre-discuss] slow journal/commitrw on OSTs lead to crash
Hendelman, Rob wrote:> Hello,<snip>> > Our OST''s are almost completely full (85-90% on each) and our MDT is > almost entirely empty. >I ran into a similar scenario with lustre when I hit 80% full as well. Exact same problem with journal commits and disks seemingly unusable. iostat on the disks (DDN 9500 array) shows huge numbers of small reads. Almost like the disk is being scanned. I eventually decided it must be something in the filesystem code that changes how it allocates free blocks when the disk is ~80% full. I have no confirmation about this other than I''ve been through the exact same thing you''re seeing and now we consider our 90TB of lustre full when we hit 77% usage. Daniel Leaberry Senior Systems Administrator
Hendelman, Rob
2009-Apr-08 20:14 UTC
[Lustre-discuss] slow journal/commitrw on OSTs lead to crash
I did not run obdfilter-survey on the disk subsystem, however I am trying this now on a new filesystem we are building (to relieve space issues on the previous/crashing system): Using the directions from the manual for running obdfilter-survey on local disks, I have setup lustre with an mgs/mds/ost. There is no data on this new filesystem. Following the instructions from page 18-6 in the manual (18.2.2.1): Running nobjhi=2 thrhi=2 size=1024 case=disk sh obdfilter-survey Gives me the following: Wed Apr 8 15:11:29 CDT 2009 Obdfilter-survey for case=disk from maglustre03 ost 1 sz 1048576K rsz 1024 obj 1 thr 1 ERROR: 1 != 0 create: 1 objects error: create: #1 - No such device created object #s on localhost:fs02-OST0000_ecc not contiguous The mgs/mdt are on another box. Do they have to be on the the same box as the OST? Robert Hendelman Jr Magnetar Capital LLC Rob.Hendelman at magnetar.com 1-847-905-4557 The information contained in this message and its attachments is intended only for the private and confidential use of the intended recipient(s). If you are not the intended recipient (or have received this e-mail in error) please notify the sender immediately and destroy this e-mail. Any unauthorized copying, disclosure or distribution of the material in this e- mail is strictly prohibited.
Brian J. Murrell
2009-Apr-09 14:09 UTC
[Lustre-discuss] slow journal/commitrw on OSTs lead to crash
On Wed, 2009-04-08 at 13:51 -0600, Daniel Leaberry wrote:> > I ran into a similar scenario with lustre when I hit 80% full as well. > Exact same problem with journal commits and disks seemingly unusable. > iostat on the disks (DDN 9500 array) shows huge numbers of small reads. > Almost like the disk is being scanned.Interesting. It''s probably worth mentioning the historical (anecdotal at least) warning that as ext2/3 filesystems get full, they become less efficient. That''s not surprising as a filesystem certainly can fall into the category of resources that become less efficient as they become more full due to the overhead of finding suitable allocations. I don''t know enough about the internals of ext2/3/4 to comment directly on the design decisions that led to this or how much of those problems still exist today in ext4. Perhaps one of our ext3/4 experts can comment. Would certainly make for an interesting benchmarking test. b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090409/cf704254/attachment-0001.bin
Brian J. Murrell
2009-Apr-09 14:14 UTC
[Lustre-discuss] slow journal/commitrw on OSTs lead to crash
On Wed, 2009-04-08 at 15:14 -0500, Hendelman, Rob wrote:> Wed Apr 8 15:11:29 CDT 2009 Obdfilter-survey for case=disk from > maglustre03 > ost 1 sz 1048576K rsz 1024 obj 1 thr 1 ERROR: 1 != 0 > create: 1 objects > error: create: #1 - No such device > created object #s on localhost:fs02-OST0000_ecc not contiguousIf you are using 1.6.7, this looks like bug 18478. :-( b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090409/d6a72e74/attachment.bin
Peter Grandi
2009-Apr-12 15:41 UTC
[Lustre-discuss] slow journal/commitrw on OSTs lead to crash
[ ... ]>> I ran into a similar scenario with lustre when I hit 80% full as >> well. Exact same problem with journal commits and disks seemingly >> unusable. iostat on the disks (DDN 9500 array) shows huge numbers of >> small reads. Almost like the disk is being scanned.This reminds me of a different issue: http://lists.lustre.org/pipermail/lustre-discuss/2008-November/009124.html ?We have experienced all these errors when we have a big job that is writing many small chunks. when the writes are ... say 80 bytes and the block size is 4k bytes, the back end storage can slow down with read block, modify block, write block, to such and extent as to cause the slow commitrw and slow journal messages very similar to yours.? Those reads can be part of RMW cycles for sub-stripe writes, fairly common say on journals. http://manual.lustre.org/manual/LustreManual16_HTML/RAID.html#50548852_pgfId-1288359 ?Do not put an ext3 journal onto RAID5. As the journal is written linearly and synchronously, in most cases writes do not fill whole stripes and RAID5 has to read parities.? ?Ideally, the RAID configuration should allow the 1 MB Lustre RPCs to evenly fit only one RAID stripe without requiring an expensive read-modify-write cycle.? As usual, I would be wary of using RAID5 or RAID6 as an OST, as RAID10 is nearly always so much nicer. On the other hand a RAID5 or RAID6 with a stripe size of 4KiB might be vaguely tolerable (as Linux on x86 variants uses 4KiB as block size anyhow).> [ ... ] as ext2/3 filesystems get full, they become less efficient. > That''s not surprising as a filesystem certainly can fall into the > category of resources that become less efficient as they become more > full due to the overhead of finding suitable allocations. [ ... ]This seems to indicate that finding space becomes slower. That seems to be a minor issue. There are two bigger issues: * A fundamental issue is that nearby or contiguous stretches of blocks become scarce, as free blocks tend to be widely scattered. This is independent of file system design (except those that have a compacting collector). * A more incidental one is that since most file systems tend to allocate free blocks sequentially starting from the beginning, the last blocks to remain free tend to be in the inner cylinders. That is, odds are that in a 50% full filesystem the allocated 50% is mostly in the outer cylinders, and the free blocks are mostly in the inner cylinders, which can be much slower. There are other potential issues; ''ext3'' and ''ext4'' for example use relatively small allocation groups, and try to keep some space free in every group, and this works well when there is lots of free space; also when there is little disk space, it becomes much more difficult to find parity RAID aligned free space, and this means that partity RAID based OSTs end up with many more RMW cycles than otherwise when writing. Overall, people who like *sustainable* performance have a big problems: that performance numbers drive sales, and the numbers that matter in practice are the "peak" numbers obtained from a fresh install and quick benchmark run. So in the storage development community there is a lot of focus on performance as in a freshly loaded nearly empty filesystem on the outer tracks of storage system doing IO in large aligned blocks.
Andreas Dilger
2009-Apr-14 04:33 UTC
[Lustre-discuss] slow journal/commitrw on OSTs lead to crash
On Apr 12, 2009 16:41 +0100, Peter Grandi wrote:> ?Ideally, the RAID configuration should allow the 1 MB Lustre RPCs > to evenly fit only one RAID stripe without requiring an expensive > read-modify-write cycle.? > > As usual, I would be wary of using RAID5 or RAID6 as an OST, as RAID10 > is nearly always so much nicer. On the other hand a RAID5 or RAID6 with > a stripe size of 4KiB might be vaguely tolerable (as Linux on x86 > variants uses 4KiB as block size anyhow).Actually, the ldiskfs allocator (mballoc, also used in ext4) is optimized for RAID configurations. It allocates RAID-stripe sized and aligned chunks if possible, specifically to avoid read-modify-write, and also to avoid fragmented free space. In the absence of other information, mballoc will assume a 1MB RAID stripe width, but modern Lustre + e2fsprogs allows storing the RAID stripe width in the superblock via tune2fs. It is also possible to have the clients match their RPC size and alignment to the RAID stripe size/alignment, though it isn''t yet automatic to pass the optimal RPC size from the underlying RAID all the way to the client.> This seems to indicate that finding space becomes slower. That seems to > be a minor issue. There are two bigger issues: > > * A fundamental issue is that nearby or contiguous stretches of blocks > become scarce, as free blocks tend to be widely scattered. This is > independent of file system design (except those that have a compacting > collector).This is true of almost all filesystems. This means the best possible avoidance is to allocate blocks contiguously in the first place, to ensure that when a file is freed it will also result in contiguous free blocks.> There are other potential issues; ''ext3'' and ''ext4'' for example use > relatively small allocation groups, and try to keep some space free in > every group, and this works well when there is lots of free space; also > when there is little disk space, it becomes much more difficult to find > parity RAID aligned free space, and this means that partity RAID based > OSTs end up with many more RMW cycles than otherwise when writing.While it is true that ext3 and ext4 have relatively small allocation groups (128MB), the allocators are totally different and as a result the performance of ext4 (which is largely based on Lustre ldiskfs) is much better than ext3. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.