th@llnl.gov
2006-Dec-21 15:28 UTC
[Lustre-devel] [Bug 11472] New: hung alc client node; possible elan3-related: (bug, or tuning required)
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: https://bugzilla.lustre.org/show_bug.cgi?id=11472 see also: bug#9423 Heavier loads on our alc/ltest Lnet router nodes are producing many messages of the type: 2006-12-20 20:36:22 LustreError: 24509:0:(qswlnd_cb.c:1023:kqswnal_send()) Can''t get txd for msg type 1 for 317@elan 2006-12-20 20:36:22 LustreError: 24509:0:(qswlnd_cb.c:1023:kqswnal_send()) Skipped 53 previous similar messages 2006-12-20 20:36:49 LustreError: 24521:0:(qswlnd_cb.c:890:kqswnal_rdma()) Can''t get txd for RDMA with 279@elan 2006-12-20 20:36:49 LustreError: 24521:0:(qswlnd_cb.c:890:kqswnal_rdma()) Skipped 2 previous similar messages 2006-12-20 20:42:38 LustreError: 24509:0:(qswlnd_cb.c:1023:kqswnal_send()) Can''t get txd for msg type 2 for 488@elan 2006-12-20 20:42:38 LustreError: 24509:0:(qswlnd_cb.c:1023:kqswnal_send()) Skipped 145 previous similar messages 2006-12-20 20:42:38 LustreError: 956:0:(qswlnd_cb.c:890:kqswnal_rdma()) Can''t get txd for RDMA with 424@elan 2006-12-20 20:42:38 LustreError: 956:0:(qswlnd_cb.c:890:kqswnal_rdma()) Skipped 4 previous similar messages Eric Barton commented: The "Can''t get txd for..." qswlnd messages mean that you need to increase "ntxmsgs". But before that, it would help to see all the LNET/LND module parameters everywhere. "ntxmsgs" is a finite parameter for the qswlnd since it can consume all of elan VM (each tx reserves 1 Mbyte of elan VM in case it needs to map a data buffer), but pressure can be ameliorated by playing with "credits" and "peer_credits". Maybe it''s worth opening a bugzilla bug until this is understood. sample Lnet router: # alc14 /root > cat /etc/modprobe.conf ############################################################################### # $URL: svn+ssh://tyche.llnl.gov/var/svn/dist/kernel/modprobe.conf.alc.io.ltest $ # $Author: marc $ # $Date: 2006-12-15 15:11:15 -0800 (Fri, 15 Dec 2006) $ # $Rev: 4324 $ ############################################################################### # alias scsi_hostadapter0 mptscsih alias scsi_hostadapter1 mptscsih alias eth0 e1000 alias eth1 e1000 alias eth2 e1000 alias eth3 e1000 options elan3 DownLinks=0x44644 alias eip0 eip # Turn off IPv6 alias net-pf-10 off alias ipv6 off # Lustre modules for clients options lnet networks="tcp0(eth1,eth0),elan0" \ forwarding="enabled" \ tiny_router_buffers=2048 \ small_router_buffers=2048 \ large_router_buffers=512 \ hosts_allow="172.16.[70-71].[147-163]; 172.16.[2,10].[101-123,231]" options e1000 RxDescriptors=4096,4096,256,256 options ksocklnd keepalive_intvl=5 keepalive_count=100 keepalive_idle=30 sample ALC client: # alc200 /root > cat /etc/modprobe.conf ############################################################################### # $URL: svn+ssh://tyche.llnl.gov/var/svn/dist/kernel/modprobe.conf.alc.compute.ltest $ # $Author: wartens2 $ # $Date: 2006-10-18 15:32:37 -0700 (Wed, 18 Oct 2006) $ # $Rev: 3652 $ ############################################################################### # alias scsi_hostadapter mptscsih alias eth0 tg3 alias eth1 tg3 options elan3 DownLinks=0x44644 alias eip0 eip # Turn off IPv6 alias net-pf-10 off alias ipv6 off # Lustre modules for clients alias lustre llite options lnet networks="elan0" \ routes="tcp0 [4-19]@elan0; tcp1 [4-19]@elan0; tcp2 [4-19]@elan0" \ live_router_check_interval=60 \ dead_router_check_interval=60 \ check_routers_before_use=1 #install kptlrouter /sbin/modprobe portals; /sbin/modprobe --ignore-install kptlrouter #install kqswnal /sbin/modprobe kptlrouter; /sbin/modprobe --ignore-install kqswnal #install ksocknal /sbin/modprobe kqswnal; /sbin/modprobe --ignore-install ksocknal #install ptlrpc /sbin/modprobe ksocknal; /sbin/modprobe --ignore-install ptlrpc #install llite /sbin/modprobe lov osc; /sbin/modprobe --ignore-install llite Here''s a console log snippet from the client that hung: 2006-12-20 21:43:25 LustreError: 3888:0:(client.c:576:ptlrpc_check_status()) @@@ type == PTL_RPC_MSG_ERR, err == -107 req@db01c000 x10094820/t0 o4->OST_igs4_UUID@igs4_UUID:28 lens 328/288 ref 2 fl Rpc:R/0/0 r c 0/-107 2006-12-20 21:43:25 LustreError: OSC_igs-mds1_OST_igs4_MNT_crater_client-cc685200: Connection to service OST_igs4 via nid 172.16.2.104@tcp was lost; in progress operations using this service will wait for rec overy to complete. 2006-12-20 21:43:25 LustreError: Skipped 4 previous similar messages 2006-12-20 21:43:25 LustreError: This client was evicted by OST_igs4; in progress operations using this service will fail. 2006-12-20 21:43:25 LustreError: 26451:0:(file.c:632:ll_remove_page_from_cache()) writepage of page c1ad3340 failed: -5 2006-12-20 21:43:25 LustreError: 26451:0:(file.c:632:ll_remove_page_from_cache()) writepage of page c1ad3360 failed: -5 2006-12-20 21:43:25 LustreError: 26451:0:(file.c:632:ll_remove_page_from_cache()) writepage of page c1ad3380 failed: -5 2006-12-20 21:43:25 LustreError: 26451:0:(file.c:632:ll_remove_page_from_cache()) writepage of page c1ad33a0 failed: -5 2006-12-20 21:43:25 LustreError: 26451:0:(file.c:632:ll_remove_page_from_cache()) writepage of page c1ad33c0 failed: -5 2006-12-20 21:43:25 LustreError: 26451:0:(file.c:632:ll_remove_page_from_cache()) writepage of page c1ad33e0 failed: -5 2006-12-20 21:43:25 LustreError: 26451:0:(file.c:632:ll_remove_page_from_cache()) writepage of page c181b800 failed: -5 2006-12-20 21:43:25 Lustre: OSC_igs-mds1_OST_igs4_MNT_crater_client-cc685200: Connection restored to service OST_igs4 using nid 172.16.2.104@tcp. 2006-12-20 21:43:25 Lustre: Skipped 4 previous similar messages 2006-12-20 21:47:28 LustreError: 3888:0:(client.c:940:ptlrpc_expire_one_request()) @@@ timeout (sent at 1166679748, 300s ago) req@d5b55400 x10094752/t0 o4->OST_igs14_3_UUID@igs14_UUID:28 lens 328/288 ref 2 fl Rpc:/0/0 rc 0/0 2006-12-20 21:47:28 LustreError: 3888:0:(client.c:940:ptlrpc_expire_one_request()) Skipped 4 previous similar messages 2006-12-20 21:47:28 LustreError: OSC_igs-mds1_OST_igs14_3_MNT_crater_client-cc685200: Connection to service OST_igs14_3 via nid 172.16.2.114@tcp was lost; in progress operations using this service will wait f or recovery to complete. 2006-12-20 21:47:28 Lustre: OSC_igs-mds1_OST_igs14_3_MNT_crater_client-cc685200: Connection restored to service OST_igs14_3 using nid 172.16.2.114@tcp. 2006-12-20 21:47:29 LustreError: 3888:0:(client.c:940:ptlrpc_expire_one_request()) @@@ timeout (sent at 1166679749, 300s ago) req@da1da800 x10094758/t0 o4->OST_igs19_UUID@igs19_UUID:28 lens 328/288 ref 2 fl R pc:/0/0 rc 0/0 2006-12-20 21:47:30 LustreError: 3888:0:(client.c:940:ptlrpc_expire_one_request()) Skipped 1 previous similar message 2006-12-20 21:47:30 LustreError: OSC_igs-mds1_OST_igs19_MNT_crater_client-cc685200: Connection to service OST_igs19 via nid 172.16.2.119@tcp was lost; in progress operations using this service will wait for r ecovery to complete. 2006-12-20 21:47:30 LustreError: Skipped 1 previous similar message 2006-12-20 21:47:30 Lustre: OSC_igs-mds1_OST_igs19_MNT_crater_client-cc685200: Connection restored to service OST_igs19 using nid 172.16.2.119@tcp. 2006-12-20 21:47:30 Lustre: Skipped 1 previous similar message 2006-12-20 21:48:35 LustreError: 26964:0:(qswlnd_cb.c:575:kqswnal_txhandler()) 14@elan RPC status ffffffb9 2006-12-20 21:48:35 LustreError: 3489:0:(events.c:129:client_bulk_callback()) event type 0, status -71, desc daefb000 2006-12-20 21:49:49 LustreError: 0:0:(qswlnd_cb.c:575:kqswnal_txhandler()) 12@elan RPC status ffffffb9 2006-12-20 21:49:49 LustreError: 3489:0:(events.c:129:client_bulk_callback()) event type 0, status -71, desc f2b71000 2006-12-20 21:53:34 LustreError: 3888:0:(client.c:940:ptlrpc_expire_one_request()) @@@ timeout (sent at 1166680113, 301s ago) req@e7db3a00 x10095074/t0 o4->OST_igs1_3_UUID@igs1_UUID:28 lens 328/288 ref 2 fl R pc:/0/0 rc 0/0 2006-12-20 21:53:34 LustreError: 3888:0:(client.c:940:ptlrpc_expire_one_request()) Skipped 3 previous similar messages 2006-12-20 21:53:34 LustreError: OSC_igs-mds1_OST_igs1_3_MNT_crater_client-cc685200: Connection to service OST_igs1_3 via nid 172.16.2.101@tcp was lost; in progress operations using this service will wait for recovery to complete. 2006-12-20 21:53:34 LustreError: Skipped 3 previous similar messages 2006-12-20 21:54:46 LustreError: 3888:0:(client.c:940:ptlrpc_expire_one_request()) @@@ timeout (sent at 1166680186, 300s ago) req@c8841200 x10095680/t0 o4->OST_igs1_3_UUID@igs1_UUID:28 lens 328/288 ref 2 fl R pc:/0/0 rc 0/0 2006-12-20 21:54:46 LustreError: 3888:0:(client.c:940:ptlrpc_expire_one_request()) Skipped 10 previous similar messages 2006-12-20 21:54:52 LustreError: 3888:0:(client.c:940:ptlrpc_expire_one_request()) @@@ timeout (sent at 1166680192, 300s ago) req@e978da00 x10095716/t0 o4->OST_igs12_3_UUID@igs12_UUID:28 lens 328/288 ref 2 fl Rpc:/0/0 rc 0/0 2006-12-20 21:54:52 LustreError: 3888:0:(client.c:940:ptlrpc_expire_one_request()) Skipped 4 previous similar messages 2006-12-20 21:55:18 LustreError: 3888:0:(client.c:940:ptlrpc_expire_one_request()) @@@ timeout (sent at 1166680218, 300s ago) req@e7db3c00 x10095910/t0 o4->OST_igs1_3_UUID@igs1_UUID:28 lens 328/288 ref 2 fl R pc:/0/0 rc 0/0 2006-12-20 21:56:05 LustreError: 3888:0:(client.c:940:ptlrpc_expire_one_request()) @@@ timeout (sent at 1166680264, 301s ago) req@defe3200 x10096157/t0 o4->OST_igs2_3_UUID@igs2_UUID:28 lens 328/288 ref 2 fl R pc:/0/0 rc 0/0 2006-12-20 21:58:33 LustreError: 3888:0:(client.c:940:ptlrpc_expire_one_request()) @@@ timeout (sent at 1166680413, 300s ago) req@c8841a00 x10097165/t0 o4->OST_igs1_3_UUID@igs1_UUID:28 lens 328/288 ref 2 fl R pc:/0/0 rc 0/0 [hung] If useful, I can attach the &Bt backtraces from this node, in its hung state.