Hi,
we are experiencing some weird behaviour on one of our Lustre clients.
First some information about our environment:
- Lustre 1.8.1.1 CentOS52 (Kernel 2.6.18-92.1.13.el5)
- self-compiled patchless clients with quotas enabled (quotas not used at the
moment)
- Infiniband interconnect, OFED 1.3.3
- OSS: 2x Sun Fire X4540 with 48TB, with official sun kernel
2.6.18-128.7.1.el5_lustre.1.8.1.1, 4
OSTs on
each of them, OFED 1.3.3
- MDS: 2x Sun Fire X4100 M2 + 1x StorEdge 3320 with heartbeat failover, official
sun kernel
2.6.18-128.7.1.el5_lustre.1.8.1.1, OFED 1.4.2
- Lustre mounted on /hpcscr
- OSS1 is 192.168.60.238 at o2ib, hostname is hpc9oss1
- OSS2 is 192.168.60.237 at o2ib, hostname is hpc9oss2
- MDS1 is 192.168.60.240 at o2ib, hostname is hpc9mds1 (active)
- MDS2 is 192.168.60.239 at o2ib, hostname is hpc9mds2 (standby, was active for
a short time while
mds1 was lifted from 1.6.7.1 to 1.8.1.1)
- problematic client is 192.168.60.226 at o2ib, hostname hpc9master02
- no problems on the Infiniband
Problem:
Users report a slow Lustre filesystem on this particular machine (hpc9master02).
Running "find
/hpcscr -ls" gets stuck after some time and most of the time it continues
after some seconds, but
sometimes it takes several minutes and sometimes I get errors (one "I/O
error", then several "Cannot
send after transport endpoint shutdown") and the find terminates. The IB
error counters do not change
during this test.
This is the client syslog during a find run that aborted with the I/O error (see
below for MDS and
OSS syslogs):
The output is shortened and commented, I can provide the complete logs if
necessary.
--------------------------------------------
Feb 2 15:59:02 hpc9master02 kernel: Lustre:
5947:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from
192.168.60.240 at o2ib failed: 5
Feb 2 15:59:02 hpc9master02 last message repeated 11 times
Feb 2 15:59:02 hpc9master02 kernel: Lustre:
5946:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from
192.168.60.240 at o2ib failed: 5
Feb 2 15:59:02 hpc9master02 kernel: Lustre:
5947:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from
192.168.60.240 at o2ib failed: 5
Feb 2 15:59:02 hpc9master02 kernel: Lustre:
5947:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from
192.168.60.240 at o2ib failed: 5
Feb 2 15:59:02 hpc9master02 kernel: Lustre:
5946:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from
192.168.60.240 at o2ib failed: 5
Feb 2 15:59:02 hpc9master02 kernel: Lustre:
5947:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from
192.168.60.240 at o2ib failed: 5
Feb 2 15:59:02 hpc9master02 kernel: Lustre:
5946:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from
192.168.60.240 at o2ib failed: 5
Feb 2 15:59:02 hpc9master02 kernel: Lustre:
5157:0:(o2iblnd_cb.c:2384:kiblnd_reconnect())
192.168.60.240 at o2ib: retrying (stale), 12, 12, queue_dep: 8, max_frag: 256,
msg_size: 4096
Then for 192.168.60.238 at o2ib:
Feb 2 15:59:02 hpc9master02 kernel: Lustre:
5947:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from
192.168.60.238 at o2ib failed: 5
Feb 2 15:59:03 hpc9master02 kernel: Lustre:
5157:0:(o2iblnd_cb.c:2384:kiblnd_reconnect())
192.168.60.238 at o2ib: retrying (stale), 12, 12, queue_dep: 8, max_frag: 256,
msg_size: 4096
Then for 192.168.60.237 at o2ib:
Feb 2 15:59:03 hpc9master02 kernel: Lustre:
5947:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from
192.168.60.237 at o2ib failed: 5
Feb 2 15:59:03 hpc9master02 kernel: Lustre:
5942:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from
192.168.60.237 at o2ib failed: 5
Feb 2 15:59:03 hpc9master02 kernel: Lustre:
5942:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from
192.168.60.237 at o2ib failed: 5
Feb 2 15:59:03 hpc9master02 kernel: Lustre:
5157:0:(o2iblnd_cb.c:2384:kiblnd_reconnect())
192.168.60.237 at o2ib: retrying (stale), 12, 12, queue_dep: 8, max_frag: 256,
msg_size: 4096
Again for 192.168.60.240 at o2ib:
Feb 2 15:59:03 hpc9master02 kernel: Lustre:
5947:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from
192.168.60.240 at o2ib failed: 5
Feb 2 15:59:03 hpc9master02 kernel: Lustre:
5157:0:(o2iblnd_cb.c:2384:kiblnd_reconnect())
192.168.60.240 at o2ib: retrying (stale), 12, 12, queue_dep: 8, max_frag: 256,
msg_size: 4096
And then the connection gets closed:
Feb 2 16:00:17 hpc9master02 kernel: Lustre:
5151:0:(o2iblnd_cb.c:1785:kiblnd_close_conn_locked())
Closing conn to 192.168.60.240 at o2ib: error 0(waiting)
Feb 2 16:00:17 hpc9master02 kernel: LustreError:
5945:0:(events.c:66:request_out_callback()) @@@
type 4, status -5 req at ffff811f83619000 x1326214018219145/t0
o101->cfd1-MDT0000_UUID at 192.168.60.240@o2ib:12/10 lens 512/1232 e 0 t
o 1 dl 1265122824 ref 2 fl Rpc:P/0/0 rc 0/0
Afterwards it all starts over again until these messages appear:
Feb 2 16:02:54 hpc9master02 kernel: Lustre:
29726:0:(client.c:1383:ptlrpc_expire_one_request()) @@@
Request x1326214018348828 sent from cfd1-MDT0000-mdc-ffff811026fc1400 to NID
192.168.60.240 at o2ib 7s
ago has timed out (limit 7s).
Feb 2 16:02:54 hpc9master02 kernel: req at ffff811ec8cb2c00
x1326214018348828/t0
o101->cfd1-MDT0000_UUID at 192.168.60.240@o2ib:12/10 lens 568/1232 e 0 to 1
dl 1265122974 ref 1 fl
Rpc:/0/0 rc 0/0
Feb 2 16:02:54 hpc9master02 kernel: Lustre:
29726:0:(client.c:1383:ptlrpc_expire_one_request())
Skipped 5 previous similar messages
Feb 2 16:02:54 hpc9master02 kernel: Lustre: cfd1-MDT0000-mdc-ffff811026fc1400:
Connection to
service cfd1-MDT0000 via nid 192.168.60.240 at o2ib was lost; in progress
operations using this service
will wait for recovery to complete.
Followed again by lots of RX failed messages...
--------------------------------------------
This is what the mds reports:
--------------------------------------------
Feb 2 15:58:12 hpc9mds1 kernel: Lustre:
5008:0:(o2iblnd_cb.c:1785:kiblnd_close_conn_locked())
Closing conn to 192.168.60.226 at o2ib: error -116
Feb 2 15:59:02 hpc9mds1 last message repeated 4 times
Feb 2 15:59:27 hpc9mds1 last message repeated 3 times
Feb 2 15:59:27 hpc9mds1 kernel: Lustre:
5008:0:(o2iblnd_cb.c:2232:kiblnd_passive_connect()) Conn
race 192.168.60.226 at o2ib
Feb 2 15:59:27 hpc9mds1 kernel: Lustre:
5008:0:(o2iblnd_cb.c:2232:kiblnd_passive_connect()) Skipped
1 previous similar message
Feb 2 15:59:27 hpc9mds1 kernel: Lustre:
5008:0:(o2iblnd_cb.c:1785:kiblnd_close_conn_locked())
Closing conn to 192.168.60.226 at o2ib: error -116
Feb 2 16:00:17 hpc9mds1 last message repeated 3 times
Feb 2 16:00:42 hpc9mds1 last message repeated 3 times
Feb 2 16:01:07 hpc9mds1 kernel: Lustre:
5007:0:(o2iblnd_cb.c:1785:kiblnd_close_conn_locked())
Closing conn to 192.168.60.226 at o2ib: error -116
Feb 2 16:01:07 hpc9mds1 kernel: Lustre:
5008:0:(o2iblnd_cb.c:1785:kiblnd_close_conn_locked())
Closing conn to 192.168.60.226 at o2ib: error -116
Feb 2 16:01:32 hpc9mds1 last message repeated 3 times
Feb 2 16:01:39 hpc9mds1 kernel: Lustre:
8664:0:(ldlm_lib.c:541:target_handle_reconnect())
cfd1-MDT0000: b80f9710-f288-9c2a-ee70-16c25f8ca97d reconnecting
Feb 2 16:01:39 hpc9mds1 kernel: Lustre:
8664:0:(ldlm_lib.c:541:target_handle_reconnect()) Skipped 2
previous similar messages
Feb 2 16:01:57 hpc9mds1 kernel: Lustre:
5008:0:(o2iblnd_cb.c:1785:kiblnd_close_conn_locked())
Closing conn to 192.168.60.226 at o2ib: error -116
Feb 2 16:02:47 hpc9mds1 last message repeated 4 times
Feb 2 16:02:47 hpc9mds1 kernel: Lustre:
5008:0:(o2iblnd_cb.c:2232:kiblnd_passive_connect()) Conn
race 192.168.60.226 at o2ib
Feb 2 16:02:47 hpc9mds1 kernel: Lustre:
5008:0:(o2iblnd_cb.c:2232:kiblnd_passive_connect()) Skipped
1 previous similar message
Feb 2 16:02:54 hpc9mds1 kernel: Lustre:
5008:0:(o2iblnd_cb.c:1785:kiblnd_close_conn_locked())
Closing conn to 192.168.60.226 at o2ib: error -116
Feb 2 16:03:37 hpc9mds1 last message repeated 3 times
Feb 2 16:04:52 hpc9mds1 last message repeated 6 times
Feb 2 16:06:07 hpc9mds1 last message repeated 8 times
Feb 2 16:07:22 hpc9mds1 last message repeated 7 times
Feb 2 16:07:22 hpc9mds1 kernel: Lustre:
5008:0:(o2iblnd_cb.c:2220:kiblnd_passive_connect()) Conn
stale 192.168.60.226 at o2ib [old ver: 12, new ver: 12]
Feb 2 16:07:22 hpc9mds1 kernel: Lustre:
5008:0:(o2iblnd_cb.c:2220:kiblnd_passive_connect()) Skipped
62 previous similar messages
Feb 2 16:07:29 hpc9mds1 kernel: Lustre:
5008:0:(o2iblnd_cb.c:1785:kiblnd_close_conn_locked())
Closing conn to 192.168.60.226 at o2ib: error -116
Feb 2 16:07:47 hpc9mds1 kernel: Lustre:
5008:0:(o2iblnd_cb.c:1785:kiblnd_close_conn_locked())
Closing conn to 192.168.60.226 at o2ib: error -116(waiting)
Feb 2 16:07:47 hpc9mds1 kernel: LustreError:
5781:0:(events.c:367:server_bulk_callback()) event
type 4, status -103, desc ffff81010df46d80
Feb 2 16:07:53 hpc9mds1 kernel: LustreError:
25108:0:(handler.c:161:mds_sendpage()) @@@ bulk
failed: timeout 0(4096), evicting b80f9710-f288-9c2a-ee70-16c25f8ca97d at
NET_0x50000c0a83ce2_UUID
Feb 2 16:07:53 hpc9mds1 kernel: req at ffff8100a7e2f000 x1326214018567382/t0
o37->b80f9710-f288-9c2a-ee70-16c25f8ca97d at NET_0x50000c0a83ce2_UUID:0/0
lens 408/360 e 1 to 0 dl
1265123298 ref 1 fl Interpret:/0/0 rc 0/0
Feb 2 16:07:54 hpc9mds1 kernel: Lustre:
5007:0:(o2iblnd_cb.c:1785:kiblnd_close_conn_locked())
Closing conn to 192.168.60.226 at o2ib: error -116
Feb 2 16:08:03 hpc9mds1 kernel: BUG: soft lockup - CPU#1 stuck for 10s!
[ll_mdt_rdpg_12:25108]
Feb 2 16:08:03 hpc9mds1 kernel: CPU 1:
Feb 2 16:08:03 hpc9mds1 kernel: Modules linked in: mds(U) fsfilt_ldiskfs(U)
mgs(U) mgc(U)
ldiskfs(U) crc16(U) nfsd(U) exportfs(U) lockd(U) nfs_acl(U) auth_rpcgss(U)
ipmi_devintf(U)
ipmi_si(U) ipmi_msghandler(U) hidp(U) rfcomm(U
) l2cap(U) bluetooth(U) sunrpc(U) lustre(U) lov(U) mdc(U) lquota(U) osc(U)
ksocklnd(U) ko2iblnd(U)
ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) rdma_ucm(U) qlgc_vnic(U)
ib_sdp(U) rdma_cm(U)
iw_cm(U) ib_addr(U) ib_ipoib(U) ipo
ib_helper(U) ib_cm(U) ib_sa(U) ipv6(U) xfrm_nalgo(U) crypto_api(U) ib_uverbs(U)
ib_umad(U)
iw_cxgb3(U) cxgb3(U) ib_ipath(U) ib_mthca(U) mptctl(U) dm_multipath(U)
scsi_dh(U) video(U)
backlight(U) sbs(U) i2c_ec(U) button(U) batter
y(U) asus_acpi(U) acpi_memhotplug(U) ac(U) parport_pc(U) lp(U) parport(U)
mlx4_ib(U) ib_mad(U)
ib_core(U) joydev(U) mlx4_core(U) i2c_nforce2(U) sr_mod(U) pcspkr(U) ide_cd(U)
serio_raw(U)
i2c_core(U) shpchp(U) e1000(U) k8temp(U)
forcedeth(U) hwmon(U) cdrom(U) sg(U) edac_mc(U) dm_snapshot(U) dm_zero(U)
dm_mirror(U) dm_log(U)
dm_mod(U) usb_stora
Feb 2 16:08:03 hpc9mds1 kernel: e(U) mptfc(U) scsi_transport_fc(U) mptspi(U)
scsi_transport_spi(U)
mptsas(U) mptscsih(U) mptbase(U) scsi_transport_sas(U) sd_mod(U) scsi_mod(U)
ext3(U) jbd(U)
uhci_hcd(U) ohci_hcd(U) ehci_hcd(U)
Feb 2 16:08:03 hpc9mds1 kernel: Pid: 25108, comm: ll_mdt_rdpg_12 Tainted: G
2.6.18-128.7.1.el5_lustre.1.8.1.1 #1
Feb 2 16:08:03 hpc9mds1 kernel: RIP: 0010:[<ffffffff8006de08>]
[<ffffffff8006de08>]
do_gettimeoffset_hpet+0x7/0x28
Feb 2 16:08:03 hpc9mds1 kernel: RSP: 0018:ffff8102050555b8 EFLAGS: 00000206
Feb 2 16:08:03 hpc9mds1 kernel: RAX: 0000000051f252e8 RBX: ffff8102050555f0
RCX: 0000000000000383
Feb 2 16:08:03 hpc9mds1 kernel: RDX: 0000000000000040 RSI: 0000000000010000
RDI: ffff8102050555f0
Feb 2 16:08:03 hpc9mds1 kernel: RBP: 6666666638313031 R08: 0000000000000000
R09: 0000000000000383
Feb 2 16:08:03 hpc9mds1 kernel: R10: 0000000000000000 R11: 0000000000000000
R12: 6463313434363031
Feb 2 16:08:03 hpc9mds1 kernel: R13: 0000000000000000 R14: 00000000ffffffff
R15: ffff81010f780080
Feb 2 16:08:03 hpc9mds1 kernel: FS: 00002b3759c5c220(0000)
GS:ffff810103f99440(0000)
knlGS:00000000f7f9d6c0
Feb 2 16:08:03 hpc9mds1 kernel: CS: 0010 DS: 0000 ES: 0000 CR0:
000000008005003b
Feb 2 16:08:03 hpc9mds1 kernel: CR2: 00002b3ef0e64240 CR3: 0000000000201000
CR4: 00000000000006e0
Feb 2 16:08:03 hpc9mds1 kernel:
Feb 2 16:08:03 hpc9mds1 kernel: Call Trace:
Feb 2 16:08:03 hpc9mds1 kernel: [<ffffffff8006dc92>]
do_gettimeofday+0x40/0x8f
Feb 2 16:08:03 hpc9mds1 kernel: [<ffffffff88636164>]
:libcfs:set_ptldebug_header+0x34/0xa0
Feb 2 16:08:03 hpc9mds1 kernel: [<ffffffff8863e300>]
:libcfs:libcfs_debug_vmsg2+0x70/0x990
Feb 2 16:08:03 hpc9mds1 kernel: [<ffffffff8878a7ef>]
:ptlrpc:ptlrpc_set_destroy+0x37f/0x390
Feb 2 16:08:03 hpc9mds1 kernel: [<ffffffff887555e0>]
:ptlrpc:ldlm_cancel_locks_for_export_cb+0x0/0xc0
Feb 2 16:08:03 hpc9mds1 kernel: [<ffffffff887555e0>]
:ptlrpc:ldlm_cancel_locks_for_export_cb+0x0/0xc0
Feb 2 16:08:03 hpc9mds1 kernel: [<ffffffff887588d5>]
:ptlrpc:ldlm_resource_putref+0x395/0x3a0
Feb 2 16:08:03 hpc9mds1 kernel: [<ffffffff887555e0>]
:ptlrpc:ldlm_cancel_locks_for_export_cb+0x0/0xc0
Feb 2 16:08:03 hpc9mds1 kernel: [<ffffffff887578f9>]
:ptlrpc:ldlm_resource_getref+0xc9/0xe0
Feb 2 16:08:03 hpc9mds1 kernel: [<ffffffff88755601>]
:ptlrpc:ldlm_cancel_locks_for_export_cb+0x21/0xc0
Feb 2 16:08:03 hpc9mds1 kernel: [<ffffffff886d4b3b>]
:obdclass:lustre_hash_for_each_empty+0x1cb/0x2b0
Feb 2 16:08:03 hpc9mds1 kernel: [<ffffffff88bf05e1>]
:mds:mds_disconnect+0x121/0xe40
Feb 2 16:08:03 hpc9mds1 kernel: [<ffffffff8003d382>]
lock_timer_base+0x1b/0x3c
Feb 2 16:08:03 hpc9mds1 kernel: [<ffffffff8004aa63>]
try_to_del_timer_sync+0x51/0x5a
Feb 2 16:08:03 hpc9mds1 kernel: [<ffffffff886d7994>]
:obdclass:class_fail_export+0x384/0x4c0
Feb 2 16:08:03 hpc9mds1 kernel: [<ffffffff88bea311>]
:mds:mds_readpage+0x1591/0x18c0
Feb 2 16:08:03 hpc9mds1 kernel: [<ffffffff8001a1b8>]
vsnprintf+0x559/0x59e
Feb 2 16:08:03 hpc9mds1 kernel: [<ffffffff8008a3f3>]
default_wake_function+0x0/0xe
Feb 2 16:08:03 hpc9mds1 kernel: [<ffffffff88bed732>]
:mds:mds_handle+0x2172/0x4cd0
Feb 2 16:08:03 hpc9mds1 kernel: [<ffffffff8863b5d8>]
:libcfs:libcfs_ip_addr2str+0x38/0x40
Feb 2 16:08:03 hpc9mds1 kernel: [<ffffffff8863b9ee>]
:libcfs:libcfs_nid2str+0xbe/0x110
Feb 2 16:08:04 hpc9mds1 kernel: [<ffffffff887a2357>]
:ptlrpc:ptlrpc_server_handle_request+0xa97/0x1170
Feb 2 16:08:04 hpc9mds1 kernel: [<ffffffff8008881d>]
__wake_up_common+0x3e/0x68
Feb 2 16:08:04 hpc9mds1 kernel: [<ffffffff887a5e08>]
:ptlrpc:ptlrpc_main+0x1218/0x13e0
Feb 2 16:08:04 hpc9mds1 kernel: [<ffffffff8008a3f3>]
default_wake_function+0x0/0xe
Feb 2 16:08:04 hpc9mds1 kernel: [<ffffffff800b48f2>]
audit_syscall_exit+0x33c/0x357
Feb 2 16:08:04 hpc9mds1 kernel: [<ffffffff8005dfb1>] child_rip+0xa/0x11
Feb 2 16:08:04 hpc9mds1 kernel: [<ffffffff887a4bf0>]
:ptlrpc:ptlrpc_main+0x0/0x13e0
Feb 2 16:08:04 hpc9mds1 kernel: [<ffffffff8005dfa7>] child_rip+0x0/0x11
Feb 2 16:08:04 hpc9mds1 kernel:
Feb 2 16:08:12 hpc9mds1 kernel: Lustre:
5007:0:(o2iblnd_cb.c:1785:kiblnd_close_conn_locked())
Closing conn to 192.168.60.226 at o2ib: error -116
Feb 2 16:08:13 hpc9mds1 kernel: BUG: soft lockup - CPU#1 stuck for 10s!
[ll_mdt_rdpg_12:25108]
Feb 2 16:08:13 hpc9mds1 kernel: CPU 1:
Feb 2 16:08:13 hpc9mds1 kernel: Modules linked in: mds(U) fsfilt_ldiskfs(U)
mgs(U) mgc(U)
ldiskfs(U) crc16(U) nfsd(U) exportfs(U) lockd(U) nfs_acl(U) auth_rpcgss(U)
ipmi_devintf(U)
ipmi_si(U) ipmi_msghandler(U) hidp(U) rfcomm(U
) l2cap(U) bluetooth(U) sunrpc(U) lustre(U) lov(U) mdc(U) lquota(U) osc(U)
ksocklnd(U) ko2iblnd(U)
ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) rdma_ucm(U) qlgc_vnic(U)
ib_sdp(U) rdma_cm(U)
iw_cm(U) ib_addr(U) ib_ipoib(U) ipo
ib_helper(U) ib_cm(U) ib_sa(U) ipv6(U) xfrm_nalgo(U) crypto_api(U) ib_uverbs(U)
ib_umad(U)
iw_cxgb3(U) cxgb3(U) ib_ipath(U) ib_mthca(U) mptctl(U) dm_multipath(U)
scsi_dh(U) video(U)
backlight(U) sbs(U) i2c_ec(U) button(U) batter
y(U) asus_acpi(U) acpi_memhotplug(U) ac(U) parport_pc(U) lp(U) parport(U)
mlx4_ib(U) ib_mad(U)
ib_core(U) joydev(U) mlx4_core(U) i2c_nforce2(U) sr_mod(U) pcspkr(U) ide_cd(U)
serio_raw(U)
i2c_core(U) shpchp(U) e1000(U) k8temp(U)
forcedeth(U) hwmon(U) cdrom(U) sg(U) edac_mc(U) dm_snapshot(U) dm_zero(U)
dm_mirror(U) dm_log(U)
dm_mod(U) usb_stora
Feb 2 16:08:13 hpc9mds1 kernel: e(U) mptfc(U) scsi_transport_fc(U) mptspi(U)
scsi_transport_spi(U)
mptsas(U) mptscsih(U) mptbase(U) scsi_transport_sas(U) sd_mod(U) scsi_mod(U)
ext3(U) jbd(U)
uhci_hcd(U) ohci_hcd(U) ehci_hcd(U)
Feb 2 16:08:13 hpc9mds1 kernel: Pid: 25108, comm: ll_mdt_rdpg_12 Tainted: G
2.6.18-128.7.1.el5_lustre.1.8.1.1 #1
Feb 2 16:08:13 hpc9mds1 kernel: RIP: 0010:[<ffffffff80064ba8>]
[<ffffffff80064ba8>]
_spin_unlock_irqrestore+0x8/0x9
Feb 2 16:08:13 hpc9mds1 kernel: RSP: 0018:ffff810205055578 EFLAGS: 00000293
Feb 2 16:08:13 hpc9mds1 kernel: RAX: 000000000000037e RBX: ffff81021f8382a0
RCX: 0000000000000001
Feb 2 16:08:13 hpc9mds1 kernel: RDX: 000000000000032e RSI: 0000000000000293
RDI: ffff81010f780080
Feb 2 16:08:13 hpc9mds1 kernel: RBP: ffff810205055610 R08: 00000000ffffffff
R09: 0000000000000020
Feb 2 16:08:13 hpc9mds1 kernel: R10: 0000000000000000 R11: 0000000000000000
R12: 0000000000000000
Feb 2 16:08:13 hpc9mds1 kernel: R13: ffffffff8001a1b8 R14: ffff8100d9680000
R15: 0000000000000000
Feb 2 16:08:13 hpc9mds1 kernel: FS: 00002b3759c5c220(0000)
GS:ffff810103f99440(0000)
knlGS:00000000f7f9d6c0
Feb 2 16:08:13 hpc9mds1 kernel: CS: 0010 DS: 0000 ES: 0000 CR0:
000000008005003b
Feb 2 16:08:13 hpc9mds1 kernel: CR2: 00002b3ef0e64240 CR3: 0000000000201000
CR4: 00000000000006e0
Feb 2 16:08:13 hpc9mds1 kernel:
Feb 2 16:08:13 hpc9mds1 kernel: Call Trace:
Feb 2 16:08:13 hpc9mds1 kernel: [<ffffffff8863e978>]
:libcfs:libcfs_debug_vmsg2+0x6e8/0x990
Feb 2 16:08:13 hpc9mds1 kernel: [<ffffffff887555e0>]
:ptlrpc:ldlm_cancel_locks_for_export_cb+0x0/0xc0
Feb 2 16:08:13 hpc9mds1 kernel: [<ffffffff88753935>]
:ptlrpc:ldlm_lock_put+0x3c5/0x3d0
Feb 2 16:08:13 hpc9mds1 kernel: [<ffffffff887555e0>]
:ptlrpc:ldlm_cancel_locks_for_export_cb+0x0/0xc0
Feb 2 16:08:13 hpc9mds1 kernel: [<ffffffff88771f31>]
:ptlrpc:ldlm_export_lock_put+0xd1/0xe0
Feb 2 16:08:13 hpc9mds1 kernel: [<ffffffff887555e0>]
:ptlrpc:ldlm_cancel_locks_for_export_cb+0x0/0xc0
Feb 2 16:08:13 hpc9mds1 kernel: [<ffffffff886d523c>]
:obdclass:lustre_hash_del+0x33c/0x370
Feb 2 16:08:13 hpc9mds1 kernel: [<ffffffff8875311a>]
:ptlrpc:ldlm_lock_destroy_internal+0x1ca/0x250
Feb 2 16:08:13 hpc9mds1 kernel: [<ffffffff88755408>]
:ptlrpc:ldlm_lock_destroy_nolock+0x68/0xd0
Feb 2 16:08:13 hpc9mds1 kernel: [<ffffffff88755548>]
:ptlrpc:ldlm_lock_cancel+0xd8/0x170
Feb 2 16:08:13 hpc9mds1 kernel: [<ffffffff8875566e>]
:ptlrpc:ldlm_cancel_locks_for_export_cb+0x8e/0xc0
Feb 2 16:08:13 hpc9mds1 kernel: [<ffffffff886d4b3b>]
:obdclass:lustre_hash_for_each_empty+0x1cb/0x2b0
Feb 2 16:08:13 hpc9mds1 kernel: [<ffffffff88bf05e1>]
:mds:mds_disconnect+0x121/0xe40
Feb 2 16:08:13 hpc9mds1 kernel: [<ffffffff8003d382>]
lock_timer_base+0x1b/0x3c
Feb 2 16:08:13 hpc9mds1 kernel: [<ffffffff8004aa63>]
try_to_del_timer_sync+0x51/0x5a
Feb 2 16:08:13 hpc9mds1 kernel: [<ffffffff886d7994>]
:obdclass:class_fail_export+0x384/0x4c0
Feb 2 16:08:13 hpc9mds1 kernel: [<ffffffff88bea311>]
:mds:mds_readpage+0x1591/0x18c0
Feb 2 16:08:13 hpc9mds1 kernel: [<ffffffff8001a1b8>]
vsnprintf+0x559/0x59e
Feb 2 16:08:14 hpc9mds1 kernel: [<ffffffff8008a3f3>]
default_wake_function+0x0/0xe
Feb 2 16:08:14 hpc9mds1 kernel: [<ffffffff88bed732>]
:mds:mds_handle+0x2172/0x4cd0
Feb 2 16:08:14 hpc9mds1 kernel: [<ffffffff8863b5d8>]
:libcfs:libcfs_ip_addr2str+0x38/0x40
Feb 2 16:08:14 hpc9mds1 kernel: [<ffffffff8863b9ee>]
:libcfs:libcfs_nid2str+0xbe/0x110
Feb 2 16:08:14 hpc9mds1 kernel: [<ffffffff887a2357>]
:ptlrpc:ptlrpc_server_handle_request+0xa97/0x1170
Feb 2 16:08:14 hpc9mds1 kernel: [<ffffffff8008881d>]
__wake_up_common+0x3e/0x68
Feb 2 16:08:14 hpc9mds1 kernel: [<ffffffff887a5e08>]
:ptlrpc:ptlrpc_main+0x1218/0x13e0
Feb 2 16:08:14 hpc9mds1 kernel: [<ffffffff8008a3f3>]
default_wake_function+0x0/0xe
Feb 2 16:08:14 hpc9mds1 kernel: [<ffffffff800b48f2>]
audit_syscall_exit+0x33c/0x357
Feb 2 16:08:14 hpc9mds1 kernel: [<ffffffff8005dfb1>] child_rip+0xa/0x11
Feb 2 16:08:14 hpc9mds1 kernel: [<ffffffff887a4bf0>]
:ptlrpc:ptlrpc_main+0x0/0x13e0
Feb 2 16:08:14 hpc9mds1 kernel: [<ffffffff8005dfa7>] child_rip+0x0/0x11
Feb 2 16:08:14 hpc9mds1 kernel:
Feb 2 16:08:19 hpc9mds1 kernel: Lustre:
5007:0:(o2iblnd_cb.c:1785:kiblnd_close_conn_locked())
Closing conn to 192.168.60.226 at o2ib: error -116
Feb 2 16:08:23 hpc9mds1 kernel: BUG: soft lockup - CPU#1 stuck for 10s!
[ll_mdt_rdpg_12:25108]
Feb 2 16:08:23 hpc9mds1 kernel: CPU 1:
Feb 2 16:08:23 hpc9mds1 kernel: Modules linked in: mds(U) fsfilt_ldiskfs(U)
mgs(U) mgc(U)
ldiskfs(U) crc16(U) nfsd(U) exportfs(U) lockd(U) nfs_acl(U) auth_rpcgss(U)
ipmi_devintf(U)
ipmi_si(U) ipmi_msghandler(U) hidp(U) rfcomm(U
) l2cap(U) bluetooth(U) sunrpc(U) lustre(U) lov(U) mdc(U) lquota(U) osc(U)
ksocklnd(U) ko2iblnd(U)
ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) rdma_ucm(U) qlgc_vnic(U)
ib_sdp(U) rdma_cm(U)
iw_cm(U) ib_addr(U) ib_ipoib(U) ipo
ib_helper(U) ib_cm(U) ib_sa(U) ipv6(U) xfrm_nalgo(U) crypto_api(U) ib_uverbs(U)
ib_umad(U)
iw_cxgb3(U) cxgb3(U) ib_ipath(U) ib_mthca(U) mptctl(U) dm_multipath(U)
scsi_dh(U) video(U)
backlight(U) sbs(U) i2c_ec(U) button(U) batter
y(U) asus_acpi(U) acpi_memhotplug(U) ac(U) parport_pc(U) lp(U) parport(U)
mlx4_ib(U) ib_mad(U)
ib_core(U) joydev(U) mlx4_core(U) i2c_nforce2(U) sr_mod(U) pcspkr(U) ide_cd(U)
serio_raw(U)
i2c_core(U) shpchp(U) e1000(U) k8temp(U)
forcedeth(U) hwmon(U) cdrom(U) sg(U) edac_mc(U) dm_snapshot(U) dm_zero(U)
dm_mirror(U) dm_log(U)
dm_mod(U) usb_stora
Feb 2 16:08:23 hpc9mds1 kernel: e(U) mptfc(U) scsi_transport_fc(U) mptspi(U)
scsi_transport_spi(U)
mptsas(U) mptscsih(U) mptbase(U) scsi_transport_sas(U) sd_mod(U) scsi_mod(U)
ext3(U) jbd(U)
uhci_hcd(U) ohci_hcd(U) ehci_hcd(U)
Feb 2 16:08:23 hpc9mds1 kernel: Pid: 25108, comm: ll_mdt_rdpg_12 Tainted: G
2.6.18-128.7.1.el5_lustre.1.8.1.1 #1
Feb 2 16:08:23 hpc9mds1 kernel: RIP: 0010:[<ffffffff886d4b88>]
[<ffffffff886d4b88>]
:obdclass:lustre_hash_for_each_empty+0x218/0x2b0
Feb 2 16:08:23 hpc9mds1 kernel: RSP: 0018:ffff810205055830 EFLAGS: 00000206
Feb 2 16:08:23 hpc9mds1 kernel: RAX: 000000000000ffff RBX: 0000000000004ce1
RCX: 0000000000000001
Feb 2 16:08:23 hpc9mds1 kernel: RDX: 0000000000000c20 RSI: 0000000000000293
RDI: ffffc20005c3be1c
Feb 2 16:08:23 hpc9mds1 kernel: RBP: ffffffff88753935 R08: 00000000ffffffff
R09: 0000000000000020
Feb 2 16:08:23 hpc9mds1 kernel: R10: 0000000000000000 R11: 0000000000000000
R12: ffffffff887555e0
Feb 2 16:08:23 hpc9mds1 kernel: R13: ffff8101f16b9df8 R14: ffff8101e17ed888
R15: ffff8101f16b9dc0
Feb 2 16:08:23 hpc9mds1 kernel: FS: 00002b3759c5c220(0000)
GS:ffff810103f99440(0000)
knlGS:00000000f7f9d6c0
Feb 2 16:08:23 hpc9mds1 kernel: CS: 0010 DS: 0000 ES: 0000 CR0:
000000008005003b
Feb 2 16:08:23 hpc9mds1 kernel: CR2: 00002b3ef0e64240 CR3: 0000000000201000
CR4: 00000000000006e0
Feb 2 16:08:23 hpc9mds1 kernel:
Feb 2 16:08:23 hpc9mds1 kernel: Call Trace:
Feb 2 16:08:23 hpc9mds1 kernel: [<ffffffff88bf05e1>]
:mds:mds_disconnect+0x121/0xe40
Feb 2 16:08:23 hpc9mds1 kernel: [<ffffffff8003d382>]
lock_timer_base+0x1b/0x3c
Feb 2 16:08:23 hpc9mds1 kernel: [<ffffffff8004aa63>]
try_to_del_timer_sync+0x51/0x5a
Feb 2 16:08:23 hpc9mds1 kernel: [<ffffffff886d7994>]
:obdclass:class_fail_export+0x384/0x4c0
Feb 2 16:08:24 hpc9mds1 kernel: [<ffffffff88bea311>]
:mds:mds_readpage+0x1591/0x18c0
Feb 2 16:08:24 hpc9mds1 kernel: [<ffffffff8001a1b8>]
vsnprintf+0x559/0x59e
Feb 2 16:08:24 hpc9mds1 kernel: [<ffffffff8008a3f3>]
default_wake_function+0x0/0xe
Feb 2 16:08:24 hpc9mds1 kernel: [<ffffffff88bed732>]
:mds:mds_handle+0x2172/0x4cd0
Feb 2 16:08:24 hpc9mds1 kernel: [<ffffffff8863b5d8>]
:libcfs:libcfs_ip_addr2str+0x38/0x40
Feb 2 16:08:24 hpc9mds1 kernel: [<ffffffff8863b9ee>]
:libcfs:libcfs_nid2str+0xbe/0x110
Feb 2 16:08:24 hpc9mds1 kernel: [<ffffffff887a2357>]
:ptlrpc:ptlrpc_server_handle_request+0xa97/0x1170
Feb 2 16:08:24 hpc9mds1 kernel: [<ffffffff8008881d>]
__wake_up_common+0x3e/0x68
Feb 2 16:08:24 hpc9mds1 kernel: [<ffffffff887a5e08>]
:ptlrpc:ptlrpc_main+0x1218/0x13e0
Feb 2 16:08:24 hpc9mds1 kernel: [<ffffffff8008a3f3>]
default_wake_function+0x0/0xe
Feb 2 16:08:24 hpc9mds1 kernel: [<ffffffff800b48f2>]
audit_syscall_exit+0x33c/0x357
Feb 2 16:08:24 hpc9mds1 kernel: [<ffffffff8005dfb1>] child_rip+0xa/0x11
Feb 2 16:08:24 hpc9mds1 kernel: [<ffffffff887a4bf0>]
:ptlrpc:ptlrpc_main+0x0/0x13e0
Feb 2 16:08:24 hpc9mds1 kernel: [<ffffffff8005dfa7>] child_rip+0x0/0x11
Feb 2 16:08:24 hpc9mds1 kernel:
Feb 2 16:08:37 hpc9mds1 kernel: Lustre:
5007:0:(o2iblnd_cb.c:1785:kiblnd_close_conn_locked())
Closing conn to 192.168.60.226 at o2ib: error -116
Feb 2 16:08:44 hpc9mds1 kernel: Lustre:
5007:0:(o2iblnd_cb.c:1785:kiblnd_close_conn_locked())
Closing conn to 192.168.60.226 at o2ib: error -116
Feb 2 16:08:47 hpc9mds1 kernel: BUG: soft lockup - CPU#1 stuck for 10s!
[ll_mdt_rdpg_12:25108]
Feb 2 16:08:47 hpc9mds1 kernel: CPU 1:
Feb 2 16:08:47 hpc9mds1 kernel: Modules linked in: mds(U) fsfilt_ldiskfs(U)
mgs(U) mgc(U)
ldiskfs(U) crc16(U) nfsd(U) exportfs(U) lockd(U) nfs_acl(U) auth_rpcgss(U)
ipmi_devintf(U)
ipmi_si(U) ipmi_msghandler(U) hidp(U) rfcomm(U
) l2cap(U) bluetooth(U) sunrpc(U) lustre(U) lov(U) mdc(U) lquota(U) osc(U)
ksocklnd(U) ko2iblnd(U)
ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) rdma_ucm(U) qlgc_vnic(U)
ib_sdp(U) rdma_cm(U)
iw_cm(U) ib_addr(U) ib_ipoib(U) ipo
ib_helper(U) ib_cm(U) ib_sa(U) ipv6(U) xfrm_nalgo(U) crypto_api(U) ib_uverbs(U)
ib_umad(U)
iw_cxgb3(U) cxgb3(U) ib_ipath(U) ib_mthca(U) mptctl(U) dm_multipath(U)
scsi_dh(U) video(U)
backlight(U) sbs(U) i2c_ec(U) button(U) batter
y(U) asus_acpi(U) acpi_memhotplug(U) ac(U) parport_pc(U) lp(U) parport(U)
mlx4_ib(U) ib_mad(U)
ib_core(U) joydev(U) mlx4_core(U) i2c_nforce2(U) sr_mod(U) pcspkr(U) ide_cd(U)
serio_raw(U)
i2c_core(U) shpchp(U) e1000(U) k8temp(U)
forcedeth(U) hwmon(U) cdrom(U) sg(U) edac_mc(U) dm_snapshot(U) dm_zero(U)
dm_mirror(U) dm_log(U)
dm_mod(U) usb_stora
Feb 2 16:08:47 hpc9mds1 kernel: e(U) mptfc(U) scsi_transport_fc(U) mptspi(U)
scsi_transport_spi(U)
mptsas(U) mptscsih(U) mptbase(U) scsi_transport_sas(U) sd_mod(U) scsi_mod(U)
ext3(U) jbd(U)
uhci_hcd(U) ohci_hcd(U) ehci_hcd(U)
Feb 2 16:08:47 hpc9mds1 kernel: Pid: 25108, comm: ll_mdt_rdpg_12 Tainted: G
2.6.18-128.7.1.el5_lustre.1.8.1.1 #1
Feb 2 16:08:47 hpc9mds1 kernel: RIP: 0010:[<ffffffff886d4b88>]
[<ffffffff886d4b88>]
:obdclass:lustre_hash_for_each_empty+0x218/0x2b0
Feb 2 16:08:47 hpc9mds1 kernel: RSP: 0018:ffff810205055830 EFLAGS: 00000206
Feb 2 16:08:47 hpc9mds1 kernel: RAX: 000000000000ffff RBX: 0000000000001d7e
RCX: 0000000000000001
Feb 2 16:08:47 hpc9mds1 kernel: RDX: 000000000000029a RSI: 0000000000000293
RDI: ffffc20005c0c7ec
Feb 2 16:08:47 hpc9mds1 kernel: RBP: ffffffff88753935 R08: 00000000ffffffff
R09: 0000000000000020
Feb 2 16:08:47 hpc9mds1 kernel: R10: 0000000000000000 R11: 0000000000000000
R12: ffffffff887555e0
Feb 2 16:08:47 hpc9mds1 kernel: R13: ffff8101f16b9df8 R14: ffff810092b5ae88
R15: ffff8101f16b9dc0
Feb 2 16:08:47 hpc9mds1 kernel: FS: 00002b3759c5c220(0000)
GS:ffff810103f99440(0000)
knlGS:00000000f7f9d6c0
Feb 2 16:08:47 hpc9mds1 kernel: CS: 0010 DS: 0000 ES: 0000 CR0:
000000008005003b
Feb 2 16:08:47 hpc9mds1 kernel: CR2: 00002b3ef0e64240 CR3: 0000000000201000
CR4: 00000000000006e0
Feb 2 16:08:47 hpc9mds1 kernel:
Feb 2 16:08:47 hpc9mds1 kernel: Call Trace:
Feb 2 16:08:47 hpc9mds1 kernel: [<ffffffff88bf05e1>]
:mds:mds_disconnect+0x121/0xe40
Feb 2 16:08:47 hpc9mds1 kernel: [<ffffffff8003d382>]
lock_timer_base+0x1b/0x3c
Feb 2 16:08:47 hpc9mds1 kernel: [<ffffffff8004aa63>]
try_to_del_timer_sync+0x51/0x5a
Feb 2 16:08:48 hpc9mds1 kernel: [<ffffffff886d7994>]
:obdclass:class_fail_export+0x384/0x4c0
Feb 2 16:08:48 hpc9mds1 kernel: [<ffffffff88bea311>]
:mds:mds_readpage+0x1591/0x18c0
Feb 2 16:08:48 hpc9mds1 kernel: [<ffffffff8001a1b8>]
vsnprintf+0x559/0x59e
Feb 2 16:08:48 hpc9mds1 kernel: [<ffffffff8008a3f3>]
default_wake_function+0x0/0xe
Feb 2 16:08:48 hpc9mds1 kernel: [<ffffffff88bed732>]
:mds:mds_handle+0x2172/0x4cd0
Feb 2 16:08:48 hpc9mds1 kernel: [<ffffffff8863b5d8>]
:libcfs:libcfs_ip_addr2str+0x38/0x40
Feb 2 16:08:48 hpc9mds1 kernel: [<ffffffff8863b9ee>]
:libcfs:libcfs_nid2str+0xbe/0x110
Feb 2 16:08:48 hpc9mds1 kernel: [<ffffffff887a2357>]
:ptlrpc:ptlrpc_server_handle_request+0xa97/0x1170
Feb 2 16:08:48 hpc9mds1 kernel: [<ffffffff8008881d>]
__wake_up_common+0x3e/0x68
Feb 2 16:08:48 hpc9mds1 kernel: [<ffffffff887a5e08>]
:ptlrpc:ptlrpc_main+0x1218/0x13e0
Feb 2 16:08:48 hpc9mds1 kernel: [<ffffffff8008a3f3>]
default_wake_function+0x0/0xe
Feb 2 16:08:48 hpc9mds1 kernel: [<ffffffff800b48f2>]
audit_syscall_exit+0x33c/0x357
Feb 2 16:08:48 hpc9mds1 kernel: [<ffffffff8005dfb1>] child_rip+0xa/0x11
Feb 2 16:08:48 hpc9mds1 kernel: [<ffffffff887a4bf0>]
:ptlrpc:ptlrpc_main+0x0/0x13e0
Feb 2 16:08:48 hpc9mds1 kernel: [<ffffffff8005dfa7>] child_rip+0x0/0x11
Feb 2 16:08:48 hpc9mds1 kernel:
Feb 2 16:09:01 hpc9mds1 kernel: BUG: soft lockup - CPU#1 stuck for 10s!
[ll_mdt_rdpg_12:25108]
Feb 2 16:09:01 hpc9mds1 kernel: CPU 1:
Feb 2 16:09:01 hpc9mds1 kernel: Modules linked in: mds(U) fsfilt_ldiskfs(U)
mgs(U) mgc(U)
ldiskfs(U) crc16(U) nfsd(U) exportfs(U) lockd(U) nfs_acl(U) auth_rpcgss(U)
ipmi_devintf(U)
ipmi_si(U) ipmi_msghandler(U) hidp(U) rfcomm(U
) l2cap(U) bluetooth(U) sunrpc(U) lustre(U) lov(U) mdc(U) lquota(U) osc(U)
ksocklnd(U) ko2iblnd(U)
ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) rdma_ucm(U) qlgc_vnic(U)
ib_sdp(U) rdma_cm(U)
iw_cm(U) ib_addr(U) ib_ipoib(U) ipo
ib_helper(U) ib_cm(U) ib_sa(U) ipv6(U) xfrm_nalgo(U) crypto_api(U) ib_uverbs(U)
ib_umad(U)
iw_cxgb3(U) cxgb3(U) ib_ipath(U) ib_mthca(U) mptctl(U) dm_multipath(U)
scsi_dh(U) video(U)
backlight(U) sbs(U) i2c_ec(U) button(U) batter
y(U) asus_acpi(U) acpi_memhotplug(U) ac(U) parport_pc(U) lp(U) parport(U)
mlx4_ib(U) ib_mad(U)
ib_core(U) joydev(U) mlx4_core(U) i2c_nforce2(U) sr_mod(U) pcspkr(U) ide_cd(U)
serio_raw(U)
i2c_core(U) shpchp(U) e1000(U) k8temp(U)
forcedeth(U) hwmon(U) cdrom(U) sg(U) edac_mc(U) dm_snapshot(U) dm_zero(U)
dm_mirror(U) dm_log(U)
dm_mod(U) usb_stora
Feb 2 16:09:01 hpc9mds1 kernel: e(U) mptfc(U) scsi_transport_fc(U) mptspi(U)
scsi_transport_spi(U)
mptsas(U) mptscsih(U) mptbase(U) scsi_transport_sas(U) sd_mod(U) scsi_mod(U)
ext3(U) jbd(U)
uhci_hcd(U) ohci_hcd(U) ehci_hcd(U)
Feb 2 16:09:01 hpc9mds1 kernel: Pid: 25108, comm: ll_mdt_rdpg_12 Tainted: G
2.6.18-128.7.1.el5_lustre.1.8.1.1 #1
Feb 2 16:09:01 hpc9mds1 kernel: RIP: 0010:[<ffffffff8006de08>]
[<ffffffff8006de08>]
do_gettimeoffset_hpet+0x7/0x28
Feb 2 16:09:01 hpc9mds1 kernel: RSP: 0018:ffff810205055548 EFLAGS: 00000202
Feb 2 16:09:01 hpc9mds1 kernel: RAX: 00000000a85c775f RBX: ffff810205055580
RCX: 00000000000003c3
Feb 2 16:09:01 hpc9mds1 kernel: RDX: 0000000000000001 RSI: 0000000000010000
RDI: ffff810205055580
Feb 2 16:09:01 hpc9mds1 kernel: RBP: ffffffff8006dc92 R08: 0000000000000000
R09: 00000000000003c3
Feb 2 16:09:01 hpc9mds1 kernel: R10: 0000000000000000 R11: 0000000000000000
R12: 0000000000000010
Feb 2 16:09:01 hpc9mds1 kernel: R13: ffff81010f780080 R14: ffff8100a9034330
R15: 0000009a00000000
Feb 2 16:09:01 hpc9mds1 kernel: FS: 00002b3759c5c220(0000)
GS:ffff810103f99440(0000)
knlGS:00000000f7f9d6c0
Feb 2 16:09:01 hpc9mds1 kernel: CS: 0010 DS: 0000 ES: 0000 CR0:
000000008005003b
Feb 2 16:09:01 hpc9mds1 kernel: CR2: 00002b3ef0e64240 CR3: 0000000000201000
CR4: 00000000000006e0
Feb 2 16:09:01 hpc9mds1 kernel:
Feb 2 16:09:01 hpc9mds1 kernel: Call Trace:
Feb 2 16:09:01 hpc9mds1 kernel: [<ffffffff8006dc92>]
do_gettimeofday+0x40/0x8f
Feb 2 16:09:01 hpc9mds1 kernel: [<ffffffff88636164>]
:libcfs:set_ptldebug_header+0x34/0xa0
Feb 2 16:09:01 hpc9mds1 kernel: [<ffffffff8863e300>]
:libcfs:libcfs_debug_vmsg2+0x70/0x990
Feb 2 16:09:01 hpc9mds1 kernel: [<ffffffff8875086c>]
:ptlrpc:_ldlm_lock_debug+0x57c/0x6e0
Feb 2 16:09:01 hpc9mds1 kernel: [<ffffffff8878a7ef>]
:ptlrpc:ptlrpc_set_destroy+0x37f/0x390
Feb 2 16:09:01 hpc9mds1 kernel: [<ffffffff887555e0>]
:ptlrpc:ldlm_cancel_locks_for_export_cb+0x0/0xc0
Feb 2 16:09:01 hpc9mds1 kernel: [<ffffffff887555e0>]
:ptlrpc:ldlm_cancel_locks_for_export_cb+0x0/0xc0
Feb 2 16:09:01 hpc9mds1 kernel: [<ffffffff887588d5>]
:ptlrpc:ldlm_resource_putref+0x395/0x3a0
Feb 2 16:09:01 hpc9mds1 kernel: [<ffffffff88753703>]
:ptlrpc:ldlm_lock_put+0x193/0x3d0
Feb 2 16:09:01 hpc9mds1 kernel: [<ffffffff88771ecf>]
:ptlrpc:ldlm_export_lock_put+0x6f/0xe0
Feb 2 16:09:01 hpc9mds1 kernel: [<ffffffff8875567e>]
:ptlrpc:ldlm_cancel_locks_for_export_cb+0x9e/0xc0
Feb 2 16:09:01 hpc9mds1 kernel: [<ffffffff886d4b79>]
:obdclass:lustre_hash_for_each_empty+0x209/0x2b0
Feb 2 16:09:01 hpc9mds1 kernel: [<ffffffff88bf05e1>]
:mds:mds_disconnect+0x121/0xe40
Feb 2 16:09:01 hpc9mds1 kernel: [<ffffffff8003d382>]
lock_timer_base+0x1b/0x3c
Feb 2 16:09:01 hpc9mds1 kernel: [<ffffffff8004aa63>]
try_to_del_timer_sync+0x51/0x5a
Feb 2 16:09:01 hpc9mds1 kernel: [<ffffffff886d7994>]
:obdclass:class_fail_export+0x384/0x4c0
Feb 2 16:09:01 hpc9mds1 kernel: [<ffffffff88bea311>]
:mds:mds_readpage+0x1591/0x18c0
Feb 2 16:09:01 hpc9mds1 kernel: [<ffffffff8001a1b8>]
vsnprintf+0x559/0x59e
Feb 2 16:09:02 hpc9mds1 kernel: [<ffffffff8008a3f3>]
default_wake_function+0x0/0xe
Feb 2 16:09:02 hpc9mds1 kernel: [<ffffffff88bed732>]
:mds:mds_handle+0x2172/0x4cd0
Feb 2 16:09:02 hpc9mds1 kernel: [<ffffffff8863b5d8>]
:libcfs:libcfs_ip_addr2str+0x38/0x40
Feb 2 16:09:02 hpc9mds1 kernel: [<ffffffff8863b9ee>]
:libcfs:libcfs_nid2str+0xbe/0x110
Feb 2 16:09:02 hpc9mds1 kernel: [<ffffffff887a2357>]
:ptlrpc:ptlrpc_server_handle_request+0xa97/0x1170
Feb 2 16:09:02 hpc9mds1 kernel: [<ffffffff8008881d>]
__wake_up_common+0x3e/0x68
Feb 2 16:09:02 hpc9mds1 kernel: [<ffffffff887a5e08>]
:ptlrpc:ptlrpc_main+0x1218/0x13e0
Feb 2 16:09:02 hpc9mds1 kernel: [<ffffffff8008a3f3>]
default_wake_function+0x0/0xe
Feb 2 16:09:02 hpc9mds1 kernel: [<ffffffff800b48f2>]
audit_syscall_exit+0x33c/0x357
Feb 2 16:09:02 hpc9mds1 kernel: [<ffffffff8005dfb1>] child_rip+0xa/0x11
Feb 2 16:09:02 hpc9mds1 kernel: [<ffffffff887a4bf0>]
:ptlrpc:ptlrpc_main+0x0/0x13e0
Feb 2 16:09:02 hpc9mds1 kernel: [<ffffffff8005dfa7>] child_rip+0x0/0x11
Feb 2 16:09:02 hpc9mds1 kernel:
Feb 2 16:09:02 hpc9mds1 kernel: Lustre:
5007:0:(o2iblnd_cb.c:1785:kiblnd_close_conn_locked())
Closing conn to 192.168.60.226 at o2ib: error -116
Feb 2 16:09:09 hpc9mds1 kernel: Lustre:
5007:0:(o2iblnd_cb.c:1785:kiblnd_close_conn_locked())
Closing conn to 192.168.60.226 at o2ib: error -116
Feb 2 16:09:11 hpc9mds1 kernel: BUG: soft lockup - CPU#1 stuck for 10s!
[ll_mdt_rdpg_12:25108]
Feb 2 16:09:11 hpc9mds1 kernel: CPU 1:
Feb 2 16:09:11 hpc9mds1 kernel: Modules linked in: mds(U) fsfilt_ldiskfs(U)
mgs(U) mgc(U)
ldiskfs(U) crc16(U) nfsd(U) exportfs(U) lockd(U) nfs_acl(U) auth_rpcgss(U)
ipmi_devintf(U)
ipmi_si(U) ipmi_msghandler(U) hidp(U) rfcomm(U
) l2cap(U) bluetooth(U) sunrpc(U) lustre(U) lov(U) mdc(U) lquota(U) osc(U)
ksocklnd(U) ko2iblnd(U)
ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) rdma_ucm(U) qlgc_vnic(U)
ib_sdp(U) rdma_cm(U)
iw_cm(U) ib_addr(U) ib_ipoib(U) ipo
ib_helper(U) ib_cm(U) ib_sa(U) ipv6(U) xfrm_nalgo(U) crypto_api(U) ib_uverbs(U)
ib_umad(U)
iw_cxgb3(U) cxgb3(U) ib_ipath(U) ib_mthca(U) mptctl(U) dm_multipath(U)
scsi_dh(U) video(U)
backlight(U) sbs(U) i2c_ec(U) button(U) batter
y(U) asus_acpi(U) acpi_memhotplug(U) ac(U) parport_pc(U) lp(U) parport(U)
mlx4_ib(U) ib_mad(U)
ib_core(U) joydev(U) mlx4_core(U) i2c_nforce2(U) sr_mod(U) pcspkr(U) ide_cd(U)
serio_raw(U)
i2c_core(U) shpchp(U) e1000(U) k8temp(U)
forcedeth(U) hwmon(U) cdrom(U) sg(U) edac_mc(U) dm_snapshot(U) dm_zero(U)
dm_mirror(U) dm_log(U)
dm_mod(U) usb_stora
Feb 2 16:09:11 hpc9mds1 kernel: e(U) mptfc(U) scsi_transport_fc(U) mptspi(U)
scsi_transport_spi(U)
mptsas(U) mptscsih(U) mptbase(U) scsi_transport_sas(U) sd_mod(U) scsi_mod(U)
ext3(U) jbd(U)
uhci_hcd(U) ohci_hcd(U) ehci_hcd(U)
Feb 2 16:09:11 hpc9mds1 kernel: Pid: 25108, comm: ll_mdt_rdpg_12 Tainted: G
2.6.18-128.7.1.el5_lustre.1.8.1.1 #1
Feb 2 16:09:11 hpc9mds1 kernel: RIP: 0010:[<ffffffff80064aee>]
[<ffffffff80064aee>]
_write_lock+0x7/0xf
Feb 2 16:09:11 hpc9mds1 kernel: RSP: 0018:ffff810205055828 EFLAGS: 00000246
Feb 2 16:09:11 hpc9mds1 kernel: RAX: 000000000000ffff RBX: 0000000000000c49
RCX: 0000000000000001
Feb 2 16:09:11 hpc9mds1 kernel: RDX: 0000000000000ec0 RSI: 0000000000000293
RDI: ffffc20005bfb49c
Feb 2 16:09:11 hpc9mds1 kernel: RBP: ffff8101f16b9df8 R08: 00000000ffffffff
R09: 0000000000000020
Feb 2 16:09:11 hpc9mds1 kernel: R10: 0000000000000000 R11: 0000000000000000
R12: ffff8100c6087e88
Feb 2 16:09:11 hpc9mds1 kernel: R13: ffff8101f16b9dc0 R14: ffff8100c6086380
R15: ffff8100c6087e00
Feb 2 16:09:11 hpc9mds1 kernel: FS: 00002b3759c5c220(0000)
GS:ffff810103f99440(0000)
knlGS:00000000f7f9d6c0
Feb 2 16:09:11 hpc9mds1 kernel: CS: 0010 DS: 0000 ES: 0000 CR0:
000000008005003b
Feb 2 16:09:11 hpc9mds1 kernel: CR2: 00002b3ef0e64240 CR3: 0000000000201000
CR4: 00000000000006e0
Feb 2 16:09:11 hpc9mds1 kernel:
Feb 2 16:09:11 hpc9mds1 kernel: Call Trace:
Feb 2 16:09:11 hpc9mds1 kernel: [<ffffffff886d4ba7>]
:obdclass:lustre_hash_for_each_empty+0x237/0x2b0
Feb 2 16:09:11 hpc9mds1 kernel: [<ffffffff88bf05e1>]
:mds:mds_disconnect+0x121/0xe40
Feb 2 16:09:11 hpc9mds1 kernel: [<ffffffff8003d382>]
lock_timer_base+0x1b/0x3c
Feb 2 16:09:11 hpc9mds1 kernel: [<ffffffff8004aa63>]
try_to_del_timer_sync+0x51/0x5a
Feb 2 16:09:12 hpc9mds1 kernel: [<ffffffff886d7994>]
:obdclass:class_fail_export+0x384/0x4c0
Feb 2 16:09:12 hpc9mds1 kernel: [<ffffffff88bea311>]
:mds:mds_readpage+0x1591/0x18c0
Feb 2 16:09:12 hpc9mds1 kernel: [<ffffffff8001a1b8>]
vsnprintf+0x559/0x59e
Feb 2 16:09:12 hpc9mds1 kernel: [<ffffffff8008a3f3>]
default_wake_function+0x0/0xe
Feb 2 16:09:12 hpc9mds1 kernel: [<ffffffff88bed732>]
:mds:mds_handle+0x2172/0x4cd0
Feb 2 16:09:12 hpc9mds1 kernel: [<ffffffff8863b5d8>]
:libcfs:libcfs_ip_addr2str+0x38/0x40
Feb 2 16:09:12 hpc9mds1 kernel: [<ffffffff8863b9ee>]
:libcfs:libcfs_nid2str+0xbe/0x110
Feb 2 16:09:12 hpc9mds1 kernel: [<ffffffff887a2357>]
:ptlrpc:ptlrpc_server_handle_request+0xa97/0x1170
Feb 2 16:09:12 hpc9mds1 kernel: [<ffffffff8008881d>]
__wake_up_common+0x3e/0x68
Feb 2 16:09:12 hpc9mds1 kernel: [<ffffffff887a5e08>]
:ptlrpc:ptlrpc_main+0x1218/0x13e0
Feb 2 16:09:12 hpc9mds1 kernel: [<ffffffff8008a3f3>]
default_wake_function+0x0/0xe
Feb 2 16:09:12 hpc9mds1 kernel: [<ffffffff800b48f2>]
audit_syscall_exit+0x33c/0x357
Feb 2 16:09:12 hpc9mds1 kernel: [<ffffffff8005dfb1>] child_rip+0xa/0x11
Feb 2 16:09:12 hpc9mds1 kernel: [<ffffffff887a4bf0>]
:ptlrpc:ptlrpc_main+0x0/0x13e0
Feb 2 16:09:12 hpc9mds1 kernel: [<ffffffff8005dfa7>] child_rip+0x0/0x11
Feb 2 16:09:12 hpc9mds1 kernel:
Feb 2 16:09:27 hpc9mds1 kernel: Lustre:
5007:0:(o2iblnd_cb.c:1785:kiblnd_close_conn_locked())
Closing conn to 192.168.60.226 at o2ib: error -116
Feb 2 16:09:34 hpc9mds1 kernel: Lustre:
5007:0:(o2iblnd_cb.c:1785:kiblnd_close_conn_locked())
Closing conn to 192.168.60.226 at o2ib: error -116
--------------------------------------------
The OSS1 shows this:
--------------------------------------------
Feb 2 15:58:37 hpc9oss1 kernel: Lustre:
7938:0:(o2iblnd_cb.c:1785:kiblnd_close_conn_locked())
Closing conn to 192.168.60.226 at o2ib: error -116
Feb 2 15:59:02 hpc9oss1 last message repeated 3 times
Feb 2 15:59:27 hpc9oss1 kernel: Lustre:
7936:0:(o2iblnd_cb.c:1785:kiblnd_close_conn_locked())
Closing conn to 192.168.60.226 at o2ib: error -116
Feb 2 15:59:59 hpc9oss1 last message repeated 3 times
Feb 2 16:01:07 hpc9oss1 last message repeated 5 times
Feb 2 16:01:07 hpc9oss1 kernel: Lustre:
7936:0:(o2iblnd_cb.c:1785:kiblnd_close_conn_locked())
Closing conn to 192.168.60.226 at o2ib: error -116
Feb 2 16:01:32 hpc9oss1 kernel: Lustre:
7941:0:(o2iblnd_cb.c:1785:kiblnd_close_conn_locked())
Closing conn to 192.168.60.226 at o2ib: error -116
Feb 2 16:01:39 hpc9oss1 kernel: Lustre:
7938:0:(o2iblnd_cb.c:1785:kiblnd_close_conn_locked())
Closing conn to 192.168.60.226 at o2ib: error -116
Feb 2 16:02:22 hpc9oss1 last message repeated 3 times
Feb 2 16:02:22 hpc9oss1 kernel: Lustre:
7938:0:(o2iblnd_cb.c:1785:kiblnd_close_conn_locked())
Closing conn to 192.168.60.226 at o2ib: error -116
Feb 2 16:02:47 hpc9oss1 kernel: Lustre:
7941:0:(o2iblnd_cb.c:1785:kiblnd_close_conn_locked())
Closing conn to 192.168.60.226 at o2ib: error -116
Feb 2 16:02:48 hpc9oss1 kernel: Lustre:
7941:0:(o2iblnd_cb.c:1785:kiblnd_close_conn_locked())
Closing conn to 192.168.60.226 at o2ib: error -116
Feb 2 16:03:12 hpc9oss1 kernel: Lustre:
7938:0:(o2iblnd_cb.c:1785:kiblnd_close_conn_locked())
Closing conn to 192.168.60.226 at o2ib: error -116
Feb 2 16:03:38 hpc9oss1 last message repeated 3 times
Feb 2 16:04:02 hpc9oss1 kernel: Lustre:
7937:0:(o2iblnd_cb.c:1785:kiblnd_close_conn_locked())
Closing conn to 192.168.60.226 at o2ib: error -116
Feb 2 16:04:04 hpc9oss1 kernel: Lustre:
7937:0:(o2iblnd_cb.c:1785:kiblnd_close_conn_locked())
Closing conn to 192.168.60.226 at o2ib: error -116
Feb 2 16:04:21 hpc9oss1 kernel: Lustre:
10054:0:(ldlm_lib.c:541:target_handle_reconnect())
cfd1-OST0002: b80f9710-f288-9c2a-ee70-16c25f8ca97d reconnecting
Feb 2 16:04:21 hpc9oss1 kernel: Lustre:
10054:0:(ldlm_lib.c:541:target_handle_reconnect()) Skipped
1 previous similar message
Feb 2 16:04:27 hpc9oss1 kernel: Lustre:
7937:0:(o2iblnd_cb.c:1785:kiblnd_close_conn_locked())
Closing conn to 192.168.60.226 at o2ib: error -116
Feb 2 16:05:17 hpc9oss1 last message repeated 4 times
Feb 2 16:06:07 hpc9oss1 last message repeated 5 times
Feb 2 16:06:32 hpc9oss1 kernel: Lustre:
7938:0:(o2iblnd_cb.c:1785:kiblnd_close_conn_locked())
Closing conn to 192.168.60.226 at o2ib: error -116
Feb 2 16:06:57 hpc9oss1 last message repeated 2 times
Feb 2 16:06:57 hpc9oss1 kernel: Lustre:
7938:0:(o2iblnd_cb.c:2232:kiblnd_passive_connect()) Conn
race 192.168.60.226 at o2ib
Feb 2 16:06:57 hpc9oss1 kernel: Lustre:
7938:0:(o2iblnd_cb.c:2232:kiblnd_passive_connect()) Skipped
1 previous similar message
Feb 2 16:07:06 hpc9oss1 kernel: Lustre:
7938:0:(o2iblnd_cb.c:1785:kiblnd_close_conn_locked())
Closing conn to 192.168.60.226 at o2ib: error -116
Feb 2 16:07:22 hpc9oss1 kernel: Lustre:
7942:0:(o2iblnd_cb.c:1785:kiblnd_close_conn_locked())
Closing conn to 192.168.60.226 at o2ib: error -116
Feb 2 16:07:29 hpc9oss1 kernel: Lustre:
7937:0:(o2iblnd_cb.c:1785:kiblnd_close_conn_locked())
Closing conn to 192.168.60.226 at o2ib: error -116
Feb 2 16:08:19 hpc9oss1 last message repeated 2 times
Feb 2 16:08:19 hpc9oss1 kernel: Lustre:
7937:0:(o2iblnd_cb.c:2220:kiblnd_passive_connect()) Conn
stale 192.168.60.226 at o2ib [old ver: 12, new ver: 12]
Feb 2 16:08:19 hpc9oss1 kernel: Lustre:
7937:0:(o2iblnd_cb.c:2220:kiblnd_passive_connect()) Skipped
46 previous similar messages
Feb 2 16:08:37 hpc9oss1 kernel: Lustre:
7937:0:(o2iblnd_cb.c:1785:kiblnd_close_conn_locked())
Closing conn to 192.168.60.226 at o2ib: error -116
Feb 2 16:09:09 hpc9oss1 last message repeated 3 times
Feb 2 16:09:34 hpc9oss1 last message repeated 2 times
Feb 2 16:09:52 hpc9oss1 kernel: Lustre:
7941:0:(o2iblnd_cb.c:1785:kiblnd_close_conn_locked())
Closing conn to 192.168.60.226 at o2ib: error -116
Feb 2 16:09:59 hpc9oss1 kernel: Lustre:
7937:0:(o2iblnd_cb.c:1785:kiblnd_close_conn_locked())
Closing conn to 192.168.60.226 at o2ib: error -116
Feb 2 16:10:42 hpc9oss1 last message repeated 3 times
--------------------------------------------
The OSS2 shows this:
--------------------------------------------
Feb 2 15:58:37 hpc9oss2 kernel: Lustre:
7947:0:(o2iblnd_cb.c:1785:kiblnd_close_conn_locked())
Closing conn to 192.168.60.226 at o2ib: error -116
Feb 2 15:58:37 hpc9oss2 kernel: Lustre:
7947:0:(o2iblnd_cb.c:1785:kiblnd_close_conn_locked())
Closing conn to 192.168.60.226 at o2ib: error -116
Feb 2 15:59:02 hpc9oss2 kernel: Lustre:
7942:0:(o2iblnd_cb.c:1785:kiblnd_close_conn_locked())
Closing conn to 192.168.60.226 at o2ib: error -116
Feb 2 15:59:52 hpc9oss2 last message repeated 4 times
Feb 2 16:00:42 hpc9oss2 last message repeated 5 times
Feb 2 16:01:07 hpc9oss2 kernel: Lustre:
7948:0:(o2iblnd_cb.c:1785:kiblnd_close_conn_locked())
Closing conn to 192.168.60.226 at o2ib: error -116
Feb 2 16:01:07 hpc9oss2 kernel: Lustre:
7948:0:(o2iblnd_cb.c:1785:kiblnd_close_conn_locked())
Closing conn to 192.168.60.226 at o2ib: error -116
Feb 2 16:01:32 hpc9oss2 kernel: Lustre:
7944:0:(o2iblnd_cb.c:1785:kiblnd_close_conn_locked())
Closing conn to 192.168.60.226 at o2ib: error -116
Feb 2 16:02:04 hpc9oss2 last message repeated 3 times
Feb 2 16:02:04 hpc9oss2 kernel: Lustre:
10237:0:(ldlm_lib.c:541:target_handle_reconnect())
cfd1-OST0007: b80f9710-f288-9c2a-ee70-16c25f8ca97d reconnecting
Feb 2 16:02:22 hpc9oss2 kernel: Lustre:
7944:0:(o2iblnd_cb.c:1785:kiblnd_close_conn_locked())
Closing conn to 192.168.60.226 at o2ib: error -116
Feb 2 16:03:12 hpc9oss2 last message repeated 4 times
Feb 2 16:04:27 hpc9oss2 last message repeated 6 times
Feb 2 16:05:42 hpc9oss2 last message repeated 6 times
Feb 2 16:06:32 hpc9oss2 last message repeated 5 times
Feb 2 16:06:57 hpc9oss2 kernel: Lustre:
7947:0:(o2iblnd_cb.c:1785:kiblnd_close_conn_locked())
Closing conn to 192.168.60.226 at o2ib: error -116
Feb 2 16:07:04 hpc9oss2 last message repeated 2 times
Feb 2 16:07:04 hpc9oss2 kernel: Lustre:
10027:0:(ldlm_lib.c:541:target_handle_reconnect())
cfd1-OST0004: b80f9710-f288-9c2a-ee70-16c25f8ca97d reconnecting
Feb 2 16:07:22 hpc9oss2 kernel: Lustre:
7944:0:(o2iblnd_cb.c:1785:kiblnd_close_conn_locked())
Closing conn to 192.168.60.226 at o2ib: error -116
Feb 2 16:08:07 hpc9oss2 last message repeated 3 times
Feb 2 16:08:07 hpc9oss2 kernel: Lustre:
7944:0:(o2iblnd_cb.c:2220:kiblnd_passive_connect()) Conn
stale 192.168.60.226 at o2ib [old ver: 12, new ver: 12]
Feb 2 16:08:07 hpc9oss2 kernel: Lustre:
7944:0:(o2iblnd_cb.c:2220:kiblnd_passive_connect()) Skipped
47 previous similar messages
Feb 2 16:08:12 hpc9oss2 kernel: Lustre:
7944:0:(o2iblnd_cb.c:1785:kiblnd_close_conn_locked())
Closing conn to 192.168.60.226 at o2ib: error -116
Feb 2 16:09:02 hpc9oss2 last message repeated 4 times
Feb 2 16:09:09 hpc9oss2 kernel: Lustre:
7944:0:(o2iblnd_cb.c:1785:kiblnd_close_conn_locked())
Closing conn to 192.168.60.226 at o2ib: error -116
Feb 2 16:09:27 hpc9oss2 kernel: Lustre:
7943:0:(o2iblnd_cb.c:1785:kiblnd_close_conn_locked())
Closing conn to 192.168.60.226 at o2ib: error -116
Feb 2 16:09:59 hpc9oss2 last message repeated 3 times
--------------------------------------------
What''s the cause of those "Rx from 192.168.60.240 at o2ib failed:
5"? Any ideas what''s going on here?
Thanks,
Ulrich Sibiller
--
Vorstand/Board of Management:
Dr. Bernd Finkbeiner, Dr. Roland Niemeier,
Dr. Arno Steitz, Dr. Ingrid Zech
Vorsitzender des Aufsichtsrats/
Chairman of the Supervisory Board:
Michel Lepert
Sitz/Registered Office: Tuebingen
Registergericht/Registration Court: Stuttgart
Registernummer/Commercial Register No.: HRB 382196