thr3ads.net - Lustre discuss - [Lustre-discuss] How to evict a dead client? [Jul 2010]

If this information is useful, please help other people find it:
Share via:
huangql
2010-Jul-07 07:13 UTC
[Lustre-discuss] How to evict a dead client?

Dear, everyone

We have stuck with the problem that the OSS  connect one dead client or one with
changed IP address all the time until we reboot the dead client. From the OSS
log message, we can get  the information as follows:

Jul  7 14:45:07 com01 kernel: Lustre: 12180:0:(socklnd_cb.cLustre:
12180:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to
12345-202.Lustre: 12180:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable
routes to 12345-202.122.37.79 at tcp
Jul  7 14:45:07 com01 kernel: Lustre:
12180:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to
12345-202.122.37.79 at tcp
Jul  7 14:45:07 com01 last message repeated 35 times
Jul  7 14:45:07 com01 kernel: Lustre:
12180:0:(socklnd_cb.c:915:ksocknal_lauLustre:
12180:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to
12345-202.122.37.79 at tcp
Jul  7 14:45:07 com01 kernel: Lustre:
12180:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to
12345-202.122.37.79 at tcp
Jul  7 14:45:07 com01 last message repeated 2 times
Jul  7 14:45:07 com01 kernel: Lustre: 12180:0:(socklnd_cb.c:91Lustre:
12180:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to
12345-202.122.37.79 at tcp
Jul  7 14:45:07 com01 kernel: Lustre:
12180:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to
12345-202.122.37.79 at tcp
Jul  7 14:45:11 com01 last message repeated 188807 times
Jul  7 14:45:11 com01 kernel: BUG: soft lockup - CPU#15 stuck for 10s!
[ll_ost_118:12180]Jul  7 14:45:11 com01 kernel: CPU 15:
Jul  7 14:45:11 com01 kernel: Modules linked in: obdfilter(U) fsfilt_ldiskfs(U)
ost(U) mgc(U) lustre(U) lov(U) mdc(U) lquota(U) osc(U) ksocklnd(U) ptlrpc(U)
obdclass(U) lnet(U) lvfs(U) libcfs(U) ldiskfs(U) crc16(U) autofs4(U) hidp(U)
rfcomm(U) l2cap(U) bluetooth(U) sunrpc(U) dm_multipath(U) scsi_dh(U) video(U)
hwmon(U) backlight(U) sbs(U) i2c_ec(U) i2c_core(U) button(U) battery(U)
asus_acpi(U) acpi_memhotplug(U) ac(U) ipv6(U) xfrm_nalgo(U) crypto_api(U)
parport_pc(U) lp(U) parport(U) ixgbe(U) pcspkr(U) shpchp(U) serio_raw(U)
hpilo(U) sg(U) bnx2(U) dm_raid45(U) dm_message(U) dm_region_hash(U)
dm_mem_cache(U) dm_snapshot(U) dm_zero(U) dm_mirror(U) dm_log(U) dm_mod(U)
usb_storage(U) lpfc(U) scsi_transport_fc(U) cciss(U) sd_mod(U) scsi_mod(U)
ext3(U) jbd(U) uhci_hcd(U) ohci_hcd(U) ehci_hcd(U)
Jul  7 14:45:11 com01 kernel: Pid: 12180, comm: ll_ost_118 Tainted: G   M 
2.6.18-128.7.1.el5_lustre.1.8.1.1 #1Jul  7 14:45:11 com01 kernel: RIP:
0010:[<ffffffff8006dce9>]  [<ffffffff8006dce9>]
do_gettimeoffset_tsc+0x8/0x39
Jul  7 14:45:11 com01 kernel: RSP: 0018:ffff8102797b92c0  EFLAGS: 00000202
Jul  7 14:45:11 com01 kernel: RAX: 00000000000106a5 RBX: ffff8102797b9300 RCX:
00000000009ce3bd
Jul  7 14:45:11 com01 kernel: RDX: 00000000bfebfbff RSI: 0000000000000100 RDI:
ffff8102797b9300Jul  7 14:45:11 com01 kernel: RBP: 0000000000000733 R08:
0000000000000000 R09: 0000000000000800
Jul  7 14:45:11 com01 kernel: R10: ffffffff8867dc7b R11: ffffffff8867dc4e R12:
ffffffff88677379
Jul  7 14:45:11 com01 kernel: R13: ffff8104f46af953 R14: 00000000000006ad R15:
00000000ffffffff
Jul  7 14:45:11 com01 kernel: FS:  00002ab88e27f220(0000)
GS:ffff81061fcf78c0(0000) knlGS:0000000000000000
......
And we find one CPU stuck.
[root at com01 ~]# grep CPU#5 /var/log/messages

Jul  7 04:28:59 com01 kernel: BUG: soft lockup - CPU#5 stuck for 10s!
[ll_ost_118:12180]
Jul  7 04:29:43 com01 kernel: BUG: soft lockup - CPU#5 stuck for 10s!
[ll_ost_118:12180]
Jul  7 04:30:03 com01 kernel: BUG: soft lockup - CPU#5 stuck for 10s!
[ll_ost_118:12180]
Jul  7 04:30:23 com01 kernel: BUG: soft lockup - CPU#5 stuck for 10s!
[ll_ost_118:12180]
Jul  7 04:30:45 com01 kernel: BUG: soft lockup - CPU#5 stuck for 10s!
[ll_ost_118:12180]
Jul  7 04:31:25 com01 kernel: BUG: soft lockup - CPU#5 stuck for 10s!
[ll_ost_118:12180]
Jul  7 04:32:52 com01 kernel: BUG: soft lockup - CPU#5 stuck for 10s!
[ll_ost_118:12180]
Jul  7 04:33:12 com01 kernel: BUG: soft lockup - CPU#5 stuck for 10s!
[ll_ost_118:12180]
Jul  7 04:33:55 com01 kernel: BUG: soft lockup - CPU#5 stuck for 10s!
[ll_ost_118:12180]

I think the the process "ll_ost_118:12180" is charge of the connection
between the client 12345-202.122.37.79 at tcp and the OSS, Because we changed
the client with another IP address, but the OSS couldn''t recognize it
and still connect the original IP. For this reason, our monitoring gives an
alarm for us on and off, as the monitor can''t ping through the
OSS(hostname called com01) with the command "lctl ping com01" ,but
after several seconds or minutes it works well. And we found the cpu_idle an
cpu_wio with serrated graph showed from ganglia monitoring. You can have a look
from the attachment.
For this problem, although we always reboot the dead client, then the OSS works
well, we found it''s not very rationable for the Lustre file system
especially the specific case such as the OSS connecting the non-existent IP.
Shall we have some command or some other methods to evict the dead client or
unknown IP manually?
We really appreciate for your any help!


Best Regards
QiuLan Huang
2010-07-07
===================================================================Computing
center,the Institute of High Energy Physics, China
Huang, Qiulan                        Tel: (+86) 10 8823 6012-604
P.O. Box 918-7                       Fax: (+86) 10 8823 6839
Beijing 100049  P.R. China           Email: huangql at ihep.ac.cn
=================================================================== 
                          
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100707/7a492170/attachment-0001.html
-------------- next part --------------
A non-text attachment was scrubbed...
Name: cpu_idle.bmp
Type: application/octet-stream
Size: 2414778 bytes
Desc: not available
Url :
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100707/7a492170/attachment-0002.obj
-------------- next part --------------
A non-text attachment was scrubbed...
Name: cpu_wio.bmp
Type: application/octet-stream
Size: 235078 bytes
Desc: not available
Url :
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100707/7a492170/attachment-0003.obj
Reasonably Related Threads

Search for more apparently analagous threads
Lustre discuss - Jul 2010 - How to evict a dead client?

[Lustre-discuss] How to evict a dead client?

Reasonably Related Threads

Wisdom of the Ancients