Dear, everyone We have stuck with the problem that the OSS connect one dead client or one with changed IP address all the time until we reboot the dead client. From the OSS log message, we can get the information as follows: Jul 7 14:45:07 com01 kernel: Lustre: 12180:0:(socklnd_cb.cLustre: 12180:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 12345-202.Lustre: 12180:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 12345-202.122.37.79 at tcp Jul 7 14:45:07 com01 kernel: Lustre: 12180:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 12345-202.122.37.79 at tcp Jul 7 14:45:07 com01 last message repeated 35 times Jul 7 14:45:07 com01 kernel: Lustre: 12180:0:(socklnd_cb.c:915:ksocknal_lauLustre: 12180:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 12345-202.122.37.79 at tcp Jul 7 14:45:07 com01 kernel: Lustre: 12180:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 12345-202.122.37.79 at tcp Jul 7 14:45:07 com01 last message repeated 2 times Jul 7 14:45:07 com01 kernel: Lustre: 12180:0:(socklnd_cb.c:91Lustre: 12180:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 12345-202.122.37.79 at tcp Jul 7 14:45:07 com01 kernel: Lustre: 12180:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 12345-202.122.37.79 at tcp Jul 7 14:45:11 com01 last message repeated 188807 times Jul 7 14:45:11 com01 kernel: BUG: soft lockup - CPU#15 stuck for 10s! [ll_ost_118:12180]Jul 7 14:45:11 com01 kernel: CPU 15: Jul 7 14:45:11 com01 kernel: Modules linked in: obdfilter(U) fsfilt_ldiskfs(U) ost(U) mgc(U) lustre(U) lov(U) mdc(U) lquota(U) osc(U) ksocklnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) ldiskfs(U) crc16(U) autofs4(U) hidp(U) rfcomm(U) l2cap(U) bluetooth(U) sunrpc(U) dm_multipath(U) scsi_dh(U) video(U) hwmon(U) backlight(U) sbs(U) i2c_ec(U) i2c_core(U) button(U) battery(U) asus_acpi(U) acpi_memhotplug(U) ac(U) ipv6(U) xfrm_nalgo(U) crypto_api(U) parport_pc(U) lp(U) parport(U) ixgbe(U) pcspkr(U) shpchp(U) serio_raw(U) hpilo(U) sg(U) bnx2(U) dm_raid45(U) dm_message(U) dm_region_hash(U) dm_mem_cache(U) dm_snapshot(U) dm_zero(U) dm_mirror(U) dm_log(U) dm_mod(U) usb_storage(U) lpfc(U) scsi_transport_fc(U) cciss(U) sd_mod(U) scsi_mod(U) ext3(U) jbd(U) uhci_hcd(U) ohci_hcd(U) ehci_hcd(U) Jul 7 14:45:11 com01 kernel: Pid: 12180, comm: ll_ost_118 Tainted: G M 2.6.18-128.7.1.el5_lustre.1.8.1.1 #1Jul 7 14:45:11 com01 kernel: RIP: 0010:[<ffffffff8006dce9>] [<ffffffff8006dce9>] do_gettimeoffset_tsc+0x8/0x39 Jul 7 14:45:11 com01 kernel: RSP: 0018:ffff8102797b92c0 EFLAGS: 00000202 Jul 7 14:45:11 com01 kernel: RAX: 00000000000106a5 RBX: ffff8102797b9300 RCX: 00000000009ce3bd Jul 7 14:45:11 com01 kernel: RDX: 00000000bfebfbff RSI: 0000000000000100 RDI: ffff8102797b9300Jul 7 14:45:11 com01 kernel: RBP: 0000000000000733 R08: 0000000000000000 R09: 0000000000000800 Jul 7 14:45:11 com01 kernel: R10: ffffffff8867dc7b R11: ffffffff8867dc4e R12: ffffffff88677379 Jul 7 14:45:11 com01 kernel: R13: ffff8104f46af953 R14: 00000000000006ad R15: 00000000ffffffff Jul 7 14:45:11 com01 kernel: FS: 00002ab88e27f220(0000) GS:ffff81061fcf78c0(0000) knlGS:0000000000000000 ...... And we find one CPU stuck. [root at com01 ~]# grep CPU#5 /var/log/messages Jul 7 04:28:59 com01 kernel: BUG: soft lockup - CPU#5 stuck for 10s! [ll_ost_118:12180] Jul 7 04:29:43 com01 kernel: BUG: soft lockup - CPU#5 stuck for 10s! [ll_ost_118:12180] Jul 7 04:30:03 com01 kernel: BUG: soft lockup - CPU#5 stuck for 10s! [ll_ost_118:12180] Jul 7 04:30:23 com01 kernel: BUG: soft lockup - CPU#5 stuck for 10s! [ll_ost_118:12180] Jul 7 04:30:45 com01 kernel: BUG: soft lockup - CPU#5 stuck for 10s! [ll_ost_118:12180] Jul 7 04:31:25 com01 kernel: BUG: soft lockup - CPU#5 stuck for 10s! [ll_ost_118:12180] Jul 7 04:32:52 com01 kernel: BUG: soft lockup - CPU#5 stuck for 10s! [ll_ost_118:12180] Jul 7 04:33:12 com01 kernel: BUG: soft lockup - CPU#5 stuck for 10s! [ll_ost_118:12180] Jul 7 04:33:55 com01 kernel: BUG: soft lockup - CPU#5 stuck for 10s! [ll_ost_118:12180] I think the the process "ll_ost_118:12180" is charge of the connection between the client 12345-202.122.37.79 at tcp and the OSS, Because we changed the client with another IP address, but the OSS couldn''t recognize it and still connect the original IP. For this reason, our monitoring gives an alarm for us on and off, as the monitor can''t ping through the OSS(hostname called com01) with the command "lctl ping com01" ,but after several seconds or minutes it works well. And we found the cpu_idle an cpu_wio with serrated graph showed from ganglia monitoring. You can have a look from the attachment. For this problem, although we always reboot the dead client, then the OSS works well, we found it''s not very rationable for the Lustre file system especially the specific case such as the OSS connecting the non-existent IP. Shall we have some command or some other methods to evict the dead client or unknown IP manually? We really appreciate for your any help! Best Regards QiuLan Huang 2010-07-07 ===================================================================Computing center,the Institute of High Energy Physics, China Huang, Qiulan Tel: (+86) 10 8823 6012-604 P.O. Box 918-7 Fax: (+86) 10 8823 6839 Beijing 100049 P.R. China Email: huangql at ihep.ac.cn =================================================================== -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100707/7a492170/attachment-0001.html -------------- next part -------------- A non-text attachment was scrubbed... Name: cpu_idle.bmp Type: application/octet-stream Size: 2414778 bytes Desc: not available Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100707/7a492170/attachment-0002.obj -------------- next part -------------- A non-text attachment was scrubbed... Name: cpu_wio.bmp Type: application/octet-stream Size: 235078 bytes Desc: not available Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100707/7a492170/attachment-0003.obj