th@llnl.gov
2006-Dec-21  15:28 UTC
[Lustre-devel] [Bug 11472] New: hung alc client node; possible elan3-related: (bug, or tuning required)
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by
using the following link:
https://bugzilla.lustre.org/show_bug.cgi?id=11472
see also:  bug#9423
Heavier loads on our alc/ltest Lnet router nodes are producing
many messages of the type:
2006-12-20 20:36:22 LustreError: 24509:0:(qswlnd_cb.c:1023:kqswnal_send())
Can''t
get txd for msg type 1 for 317@elan
2006-12-20 20:36:22 LustreError: 24509:0:(qswlnd_cb.c:1023:kqswnal_send())
Skipped 53 previous similar messages
2006-12-20 20:36:49 LustreError: 24521:0:(qswlnd_cb.c:890:kqswnal_rdma())
Can''t
get txd for RDMA with 279@elan
2006-12-20 20:36:49 LustreError: 24521:0:(qswlnd_cb.c:890:kqswnal_rdma())
Skipped 2 previous similar messages
2006-12-20 20:42:38 LustreError: 24509:0:(qswlnd_cb.c:1023:kqswnal_send())
Can''t
get txd for msg type 2 for 488@elan
2006-12-20 20:42:38 LustreError: 24509:0:(qswlnd_cb.c:1023:kqswnal_send())
Skipped 145 previous similar messages
2006-12-20 20:42:38 LustreError: 956:0:(qswlnd_cb.c:890:kqswnal_rdma())
Can''t
get txd for RDMA with 424@elan
2006-12-20 20:42:38 LustreError: 956:0:(qswlnd_cb.c:890:kqswnal_rdma()) Skipped
4 previous similar messages
Eric Barton commented:
The "Can''t get txd for..." qswlnd messages mean that you need
to increase
"ntxmsgs".  But before that, it would help to see all the LNET/LND
module
parameters everywhere.  "ntxmsgs" is a finite parameter for the qswlnd
since
it can consume all of elan VM (each tx reserves 1 Mbyte of elan VM in case
it needs to map a data buffer), but pressure can be ameliorated by playing
with "credits" and "peer_credits".
Maybe it''s worth opening a bugzilla bug until this is understood.
sample Lnet router:
# alc14 /root > cat /etc/modprobe.conf
###############################################################################
# $URL: svn+ssh://tyche.llnl.gov/var/svn/dist/kernel/modprobe.conf.alc.io.ltest
$
# $Author: marc $
# $Date: 2006-12-15 15:11:15 -0800 (Fri, 15 Dec 2006) $
# $Rev: 4324 $
###############################################################################
#
alias scsi_hostadapter0 mptscsih
alias scsi_hostadapter1 mptscsih
alias eth0 e1000
alias eth1 e1000
alias eth2 e1000
alias eth3 e1000
options elan3 DownLinks=0x44644
alias eip0 eip
# Turn off IPv6
alias net-pf-10 off
alias ipv6 off
# Lustre modules for clients
options lnet networks="tcp0(eth1,eth0),elan0" \
             forwarding="enabled" \
             tiny_router_buffers=2048 \
             small_router_buffers=2048 \
             large_router_buffers=512 \
             hosts_allow="172.16.[70-71].[147-163];
172.16.[2,10].[101-123,231]"
options e1000 RxDescriptors=4096,4096,256,256
options ksocklnd keepalive_intvl=5 keepalive_count=100 keepalive_idle=30
sample ALC client:
# alc200 /root > cat /etc/modprobe.conf
###############################################################################
# $URL:
svn+ssh://tyche.llnl.gov/var/svn/dist/kernel/modprobe.conf.alc.compute.ltest $
# $Author: wartens2 $
# $Date: 2006-10-18 15:32:37 -0700 (Wed, 18 Oct 2006) $
# $Rev: 3652 $
###############################################################################
#
alias scsi_hostadapter mptscsih
alias eth0 tg3
alias eth1 tg3
options elan3 DownLinks=0x44644
alias eip0 eip
# Turn off IPv6
alias net-pf-10 off
alias ipv6 off
# Lustre modules for clients
alias lustre llite
options lnet networks="elan0" \
             routes="tcp0 [4-19]@elan0; tcp1 [4-19]@elan0; tcp2
[4-19]@elan0" \
             live_router_check_interval=60 \
             dead_router_check_interval=60 \
             check_routers_before_use=1
#install kptlrouter /sbin/modprobe portals; /sbin/modprobe --ignore-install
kptlrouter
#install kqswnal /sbin/modprobe kptlrouter; /sbin/modprobe --ignore-install
kqswnal
#install ksocknal /sbin/modprobe kqswnal; /sbin/modprobe --ignore-install
ksocknal
#install ptlrpc /sbin/modprobe ksocknal; /sbin/modprobe --ignore-install ptlrpc
#install llite /sbin/modprobe lov osc; /sbin/modprobe --ignore-install llite
Here''s a console log snippet from the client that hung:
2006-12-20 21:43:25 LustreError: 3888:0:(client.c:576:ptlrpc_check_status()) @@@
type == PTL_RPC_MSG_ERR, err == -107 req@db01c000 x10094820/t0
o4->OST_igs4_UUID@igs4_UUID:28 lens 328/288 ref 2 fl Rpc:R/0/0 r
c 0/-107
2006-12-20 21:43:25 LustreError:
OSC_igs-mds1_OST_igs4_MNT_crater_client-cc685200: Connection to service OST_igs4
via nid 172.16.2.104@tcp was lost; in progress operations using this service
will wait for rec
overy to complete.
2006-12-20 21:43:25 LustreError: Skipped 4 previous similar messages
2006-12-20 21:43:25 LustreError: This client was evicted by OST_igs4; in
progress operations using this service will fail.
2006-12-20 21:43:25 LustreError:
26451:0:(file.c:632:ll_remove_page_from_cache()) writepage of page c1ad3340
failed: -5
2006-12-20 21:43:25 LustreError:
26451:0:(file.c:632:ll_remove_page_from_cache()) writepage of page c1ad3360
failed: -5
2006-12-20 21:43:25 LustreError:
26451:0:(file.c:632:ll_remove_page_from_cache()) writepage of page c1ad3380
failed: -5
2006-12-20 21:43:25 LustreError:
26451:0:(file.c:632:ll_remove_page_from_cache()) writepage of page c1ad33a0
failed: -5
2006-12-20 21:43:25 LustreError:
26451:0:(file.c:632:ll_remove_page_from_cache()) writepage of page c1ad33c0
failed: -5
2006-12-20 21:43:25 LustreError:
26451:0:(file.c:632:ll_remove_page_from_cache()) writepage of page c1ad33e0
failed: -5
2006-12-20 21:43:25 LustreError:
26451:0:(file.c:632:ll_remove_page_from_cache()) writepage of page c181b800
failed: -5
2006-12-20 21:43:25 Lustre: OSC_igs-mds1_OST_igs4_MNT_crater_client-cc685200:
Connection restored to service OST_igs4 using nid 172.16.2.104@tcp.
2006-12-20 21:43:25 Lustre: Skipped 4 previous similar messages
2006-12-20 21:47:28 LustreError:
3888:0:(client.c:940:ptlrpc_expire_one_request()) @@@ timeout (sent at
1166679748, 300s ago) req@d5b55400 x10094752/t0
o4->OST_igs14_3_UUID@igs14_UUID:28 lens 328/288 ref 2 fl
 Rpc:/0/0 rc 0/0
2006-12-20 21:47:28 LustreError:
3888:0:(client.c:940:ptlrpc_expire_one_request()) Skipped 4 previous similar
messages
2006-12-20 21:47:28 LustreError:
OSC_igs-mds1_OST_igs14_3_MNT_crater_client-cc685200: Connection to service
OST_igs14_3 via nid 172.16.2.114@tcp was lost; in progress operations using this
service will wait f
or recovery to complete.
2006-12-20 21:47:28 Lustre: OSC_igs-mds1_OST_igs14_3_MNT_crater_client-cc685200:
Connection restored to service OST_igs14_3 using nid 172.16.2.114@tcp.
2006-12-20 21:47:29 LustreError:
3888:0:(client.c:940:ptlrpc_expire_one_request()) @@@ timeout (sent at
1166679749, 300s ago) req@da1da800 x10094758/t0
o4->OST_igs19_UUID@igs19_UUID:28
lens 328/288 ref 2 fl R
pc:/0/0 rc 0/0
2006-12-20 21:47:30 LustreError:
3888:0:(client.c:940:ptlrpc_expire_one_request()) Skipped 1 previous similar
message
2006-12-20 21:47:30 LustreError:
OSC_igs-mds1_OST_igs19_MNT_crater_client-cc685200: Connection to service
OST_igs19 via nid 172.16.2.119@tcp was lost; in progress operations using this
service will wait for r
ecovery to complete.
2006-12-20 21:47:30 LustreError: Skipped 1 previous similar message
2006-12-20 21:47:30 Lustre: OSC_igs-mds1_OST_igs19_MNT_crater_client-cc685200:
Connection restored to service OST_igs19 using nid 172.16.2.119@tcp.
2006-12-20 21:47:30 Lustre: Skipped 1 previous similar message
2006-12-20 21:48:35 LustreError: 26964:0:(qswlnd_cb.c:575:kqswnal_txhandler())
14@elan RPC status ffffffb9
2006-12-20 21:48:35 LustreError: 3489:0:(events.c:129:client_bulk_callback())
event type 0, status -71, desc daefb000
2006-12-20 21:49:49 LustreError: 0:0:(qswlnd_cb.c:575:kqswnal_txhandler())
12@elan RPC status ffffffb9
2006-12-20 21:49:49 LustreError: 3489:0:(events.c:129:client_bulk_callback())
event type 0, status -71, desc f2b71000
2006-12-20 21:53:34 LustreError:
3888:0:(client.c:940:ptlrpc_expire_one_request()) @@@ timeout (sent at
1166680113, 301s ago) req@e7db3a00 x10095074/t0
o4->OST_igs1_3_UUID@igs1_UUID:28
lens 328/288 ref 2 fl R
pc:/0/0 rc 0/0
2006-12-20 21:53:34 LustreError:
3888:0:(client.c:940:ptlrpc_expire_one_request()) Skipped 3 previous similar
messages
2006-12-20 21:53:34 LustreError:
OSC_igs-mds1_OST_igs1_3_MNT_crater_client-cc685200: Connection to service
OST_igs1_3 via nid 172.16.2.101@tcp was lost; in progress operations using this
service will wait for
 recovery to complete.
2006-12-20 21:53:34 LustreError: Skipped 3 previous similar messages
2006-12-20 21:54:46 LustreError:
3888:0:(client.c:940:ptlrpc_expire_one_request()) @@@ timeout (sent at
1166680186, 300s ago) req@c8841200 x10095680/t0
o4->OST_igs1_3_UUID@igs1_UUID:28
lens 328/288 ref 2 fl R
pc:/0/0 rc 0/0
2006-12-20 21:54:46 LustreError:
3888:0:(client.c:940:ptlrpc_expire_one_request()) Skipped 10 previous similar
messages
2006-12-20 21:54:52 LustreError:
3888:0:(client.c:940:ptlrpc_expire_one_request()) @@@ timeout (sent at
1166680192, 300s ago) req@e978da00 x10095716/t0
o4->OST_igs12_3_UUID@igs12_UUID:28 lens 328/288 ref 2 fl
 Rpc:/0/0 rc 0/0
2006-12-20 21:54:52 LustreError:
3888:0:(client.c:940:ptlrpc_expire_one_request()) Skipped 4 previous similar
messages
2006-12-20 21:55:18 LustreError:
3888:0:(client.c:940:ptlrpc_expire_one_request()) @@@ timeout (sent at
1166680218, 300s ago) req@e7db3c00 x10095910/t0
o4->OST_igs1_3_UUID@igs1_UUID:28
lens 328/288 ref 2 fl R
pc:/0/0 rc 0/0
2006-12-20 21:56:05 LustreError:
3888:0:(client.c:940:ptlrpc_expire_one_request()) @@@ timeout (sent at
1166680264, 301s ago) req@defe3200 x10096157/t0
o4->OST_igs2_3_UUID@igs2_UUID:28
lens 328/288 ref 2 fl R
pc:/0/0 rc 0/0
2006-12-20 21:58:33 LustreError:
3888:0:(client.c:940:ptlrpc_expire_one_request()) @@@ timeout (sent at
1166680413, 300s ago) req@c8841a00 x10097165/t0
o4->OST_igs1_3_UUID@igs1_UUID:28
lens 328/288 ref 2 fl R
pc:/0/0 rc 0/0
[hung]
If useful, I can attach the &Bt backtraces from this node, in its
hung state.