hi All, First off all, I''m sorry, I cannot write a better bug report as I''m far away from the host, and right now there is no remote access. My colleague send this in by email: Nov 9 19:05:55 node1 kernel: [<ffffffff88591357>] :ptlrpc:ptlrpc_server_handle_request+0xa97/0x1170 Nov 9 19:05:55 node1 kernel: [<ffffffff8003d382>] lock_timer_base+0x1b/0x3c Nov 9 19:05:55 node1 kernel: [<ffffffff8008881d>] __wake_up_common+0x3e/0x68 Nov 9 19:05:55 node1 kernel: [<ffffffff88594e08>] :ptlrpc:ptlrpc_main+0x1218/0x13e0 Nov 9 19:05:55 node1 kernel: [<ffffffff8008a3f3>] default_wake_function+0x0/0xe Nov 9 19:05:55 node1 kernel: [<ffffffff8005dfb1>] child_rip+0xa/0x11 Nov 9 19:05:55 node1 kernel: [<ffffffff88593bf0>] :ptlrpc:ptlrpc_main+0x0/0x13e0 Nov 9 19:05:55 node1 kernel: [<ffffffff8005dfa7>] child_rip+0x0/0x11 Nov 9 19:05:55 node1 kernel: Nov 9 19:05:55 node1 kernel: Lustre: 4381:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 12345-192.168.3.149 at tcp Nov 9 19:06:05 node1 last message repeated 409322 times Nov 9 19:06:05 node1 kernel: BUG: soft lockup - CPU#1 stuck for 10s! [ll_ost_82:4381] Nov 9 19:06:05 node1 kernel: CPU 1: Nov 9 19:06:05 node1 kernel: Modules linked in: obdfilter(U) fsfilt_ldiskfs(U) ost(U) mgc(U) ldiskfs(U) crc16(U) lustre(U) lov(U) mdc(U) lquota(U) osc(U) ksocklnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) mptctl(U) mptbase(U) ipmi_devintf(U) ipmi_si(U) ipmi_msghandler(U) ipv6(U) xfrm_nalgo(U) crypto_api(U) autofs4(U) lockd(U) sunrpc(U) cpufreq_ondemand(U) acpi_cpufreq(U) freq_table(U) dm_mirror(U) dm_multipath(U) scsi_dh(U) video(U) hwmon(U) backlight(U) sbs(U) i2c_ec(U) i2c_core(U) button(U) battery(U) asus_acpi(U) acpi_memhotplug(U) ac(U) parport_pc(U) lp(U) parport(U) sg(U) igb(U) shpchp(U) pcspkr(U) dm_raid45(U) dm_message(U) dm_region_hash(U) dm_log(U) dm_mod(U) dm_mem_cache(U) usb_storage(U) cciss(U) ata_piix(U) libata(U) sd_mod(U) scsi_mod(U) ext3(U) jbd(U) uhci_hcd(U) ohci_hcd(U) ehci_hcd(U) Nov 9 19:06:05 node1 kernel: Pid: 4381, comm: ll_ost_82 Tainted: G 2.6.18-128.7.1.el5_lustre.1.8.1.1 #1 Nov 9 19:06:05 node1 kernel: RIP: 0010:[<ffffffff80064ae0>] [<ffffffff80064ae0>] _spin_lock+0x3/0xa Nov 9 19:06:05 node1 kernel: RSP: 0018:ffff8102217b7758 EFLAGS: 00000246 Nov 9 19:06:05 node1 kernel: RAX: 0000000000000008 RBX: ffff81022c80b400 RCX: 0000000000000000 Nov 9 19:06:05 node1 kernel: RDX: ffff81023df660a0 RSI: 0000000000000000 RDI: ffff81023df66250 Nov 9 19:06:05 node1 kernel: RBP: ffff81022c80b400 R08: ffff81022c80b530 R09: 0000000000000000 Nov 9 19:06:05 node1 kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000013 Nov 9 19:06:05 node1 kernel: R13: ffffffff8857327c R14: 0000000500000000 R15: 0000000000000007 Nov 9 19:06:05 node1 kernel: FS: 00002b4cfcfab230(0000) GS:ffff810107ed96c0(0000) knlGS:0000000000000000 Nov 9 19:06:05 node1 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b Nov 9 19:06:05 node1 kernel: CR2: 00002abf9b119158 CR3: 0000000000201000 CR4: 00000000000006e0 Nov 9 19:06:05 node1 kernel: Nov 9 19:06:05 node1 kernel: Call Trace: Nov 9 19:06:05 node1 kernel: [<ffffffff8857cafc>] :ptlrpc:ptlrpc_queue_wait+0x103c/0x1690 Nov 9 19:06:05 node1 kernel: [<ffffffff8858a515>] :ptlrpc:lustre_msg_set_opc+0x45/0x120 Nov 9 19:06:05 node1 kernel: [<ffffffff88574085>] :ptlrpc:ptlrpc_at_set_req_timeout+0x85/0xd0 Nov 9 19:06:05 node1 kernel: [<ffffffff885748a9>] :ptlrpc:ptlrpc_prep_req_pool+0x619/0x6b0 Nov 9 19:06:05 node1 kernel: [<ffffffff8008a3f3>] default_wake_function+0x0/0xe Nov 9 19:06:05 node1 kernel: [<ffffffff88564196>] :ptlrpc:ldlm_server_glimpse_ast+0x266/0x3b0 Nov 9 19:06:05 node1 kernel: [<ffffffff88570f03>] :ptlrpc:interval_iterate_reverse+0x73/0x240 Nov 9 19:06:05 node1 kernel: [<ffffffff88558f20>] :ptlrpc:ldlm_process_extent_lock+0x0/0xad0 History: The "cluster" is up for approx. 10 days. It has only one MDS and 2 OSS computers. On the second day the node1 locked up with no usable messages on the screen, and similar messages in the log as above. After I restarted it the cluster was running for a week without bigger error messages. On Sunday we removed the servers, so we had to restart them. Today morning node2 locked up, and after some hours node1 also start to give up. The systems are based on CentOS 5.4 with the official packages from Sun. Is there any known bug about this? Thank you, tamas
Papp Tam?s wrote, On 2009. 11. 09. 19:34:> > Is there any known bug about this? >Is this bug 21221 <https://bugzilla.lustre.org/show_bug.cgi?id=21221> ? 10x tompos
This looks exactly like what we are running into, my bugzilla (https://bugzilla.lustre.org/show_bug.cgi?id=21256 ) was duped to this bug, https://bugzilla.lustre.org/show_bug.cgi?id=19557 But I am not sure until you reported it that you are also seeing that your network stack is being shut down? We see our NIC fall down when this happens, running RHEL5.3 1.8.1.0. Unclear on what a ETA on a fix is going to be from the bugzilla entry. On Nov 9, 2009, at 3:25 PM, Papp Tam?s wrote:> Papp Tam?s wrote, On 2009. 11. 09. 19:34: >> >> Is there any known bug about this? >> > > > Is this bug 21221 <https://bugzilla.lustre.org/show_bug.cgi? > id=21221> ? > > 10x > > tompos > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss >Derek Yarnell UNIX Systems Administrator University of Maryland Institute for Advanced Computer Studies
Derek Yarnell wrote, On 2009. 11. 10. 7:19:> This looks exactly like what we are running into, my bugzilla > (https://bugzilla.lustre.org/show_bug.cgi?id=21256) was duped to this > bug, > > https://bugzilla.lustre.org/show_bug.cgi?id=19557 > > But I am not sure until you reported it that you are also seeing that > your network stack is being shut down? We see our NIC fall down when > this happens, running RHEL5.3 1.8.1.0. > > Unclear on what a ETA on a fix is going to be from the bugzilla entry.I''m not sure, it''s the same poblem. I''ll check it closer tomorrow. Nobody else are experienced about this issue? Could this be HW related? The type of our cluster nodes is HP DL 180 G6. tamas
Derek Yarnell wrote, On 2009. 11. 10. 7:19:> This looks exactly like what we are running into, my bugzilla > (https://bugzilla.lustre.org/show_bug.cgi?id=21256) was duped to this > bug, > > https://bugzilla.lustre.org/show_bug.cgi?id=19557 > > But I am not sure until you reported it that you are also seeing that > your network stack is being shut down? We see our NIC fall down when > this happens, running RHEL5.3 1.8.1.0. > > Unclear on what a ETA on a fix is going to be from the bugzilla entry.Well, my problem has not gone, but occurs more frequently. There were days, when the logs were clean enough, and other days there were call traces, but the cluster was functional. Until this afternoon. I wrote earlier, we have one MDS (meta1) and two OSS (node1 and node2). The servers meta1 and node2 are working fine. But node1 has a load of 8, and the top says: 4820 root 15 0 0 0 0 R 99.8 0.0 75:42.89 ll_ost_06 4894 root 15 0 0 0 0 R 99.8 0.0 75:41.79 ll_ost_80 The logs are full with this: Nov 19 20:03:32 node1 kernel: BUG: soft lockup - CPU#3 stuck for 10s! [ll_ost_80:4894] Nov 19 20:03:32 node1 kernel: CPU 3: Nov 19 20:03:32 node1 kernel: Modules linked in: obdfilter(U) fsfilt_ldiskfs(U) ost(U) mgc(U) ldiskfs(U) crc16(U) lustre(U) lov(U) mdc(U) lquota(U) osc(U) ksocklnd(U) ptlrpc(U) obdclass(U) lnet(U) l vfs(U) libcfs(U) mptctl(U) mptbase(U) ipmi_devintf(U) ipmi_si(U) ipmi_msghandler(U) ipv6(U) xfrm_nalgo(U) crypto_api(U) autofs4(U) lockd(U) sunrpc(U) cpufreq_ondemand(U) acpi_cpufreq(U) freq_table(U ) dm_mirror(U) dm_multipath(U) scsi_dh(U) video(U) hwmon(U) backlight(U) sbs(U) i2c_ec(U) i2c_core(U) button(U) battery(U) asus_acpi(U) acpi_memhotplug(U) ac(U) parport_pc(U) lp(U) parport(U) sg(U) shpchp(U) igb(U) pcspkr(U) dm_raid45(U) dm_message(U) dm_region_hash(U) dm_log(U) dm_mod(U) dm_mem_cache(U) usb_storage(U) cciss(U) ata_piix(U) libata(U) sd_mod(U) scsi_mod(U) ext3(U) jbd(U) uhci_hc d(U) ohci_hcd(U) ehci_hcd(U) Nov 19 20:03:32 node1 kernel: Pid: 4894, comm: ll_ost_80 Tainted: G 2.6.18-128.1.6.el5_lustre.1.8.0.1smp #1 Nov 19 20:03:32 node1 kernel: RIP: 0010:[<ffffffff885814f8>] [<ffffffff885814f8>] :ptlrpc:ptlrpc_queue_wait+0xe58/0x1690 Nov 19 20:03:32 node1 kernel: RSP: 0018:ffff810220143760 EFLAGS: 00000202 Nov 19 20:03:32 node1 kernel: RAX: ffff8101f7b5a400 RBX: ffff81020d360000 RCX: 0000000000000000 Nov 19 20:03:32 node1 kernel: RDX: ffff8100a739a408 RSI: 0000000000000000 RDI: ffff8101f7b5a548 Nov 19 20:03:32 node1 kernel: RBP: ffffffffffffffff R08: ffff81020d360128 R09: 0000000000000000 Nov 19 20:03:32 node1 kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffff81020d36005c Nov 19 20:03:32 node1 kernel: R13: ffff81020d360000 R14: 0000000000000013 R15: ffffffff88577e6c Nov 19 20:03:32 node1 kernel: FS: 00002b9c353dc230(0000) GS:ffff810107f61540(0000) knlGS:0000000000000000 Nov 19 20:03:32 node1 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b Nov 19 20:03:32 node1 kernel: CR2: 0000003862603080 CR3: 0000000232d91000 CR4: 00000000000006e0 Nov 19 20:03:32 node1 kernel: Nov 19 20:03:32 node1 kernel: Call Trace: Nov 19 20:03:32 node1 kernel: [<ffffffff8858f185>] :ptlrpc:lustre_msg_set_opc+0x45/0x120 Nov 19 20:03:32 node1 kernel: [<ffffffff88578ca2>] :ptlrpc:ptlrpc_at_set_req_timeout+0x82/0xd0 Nov 19 20:03:32 node1 kernel: [<ffffffff885794c3>] :ptlrpc:ptlrpc_prep_req_pool+0x613/0x6b0 Nov 19 20:03:32 node1 kernel: [<ffffffff8008a3f6>] default_wake_function+0x0/0xe Nov 19 20:03:32 node1 kernel: [<ffffffff88568d47>] :ptlrpc:ldlm_server_glimpse_ast+0x257/0x3a0 Nov 19 20:03:32 node1 kernel: [<ffffffff88575af3>] :ptlrpc:interval_iterate_reverse+0x73/0x240 Nov 19 20:03:32 node1 kernel: [<ffffffff8855d850>] :ptlrpc:ldlm_process_extent_lock+0x0/0xad0 Nov 19 20:03:32 node1 kernel: [<ffffffff8884205c>] :obdfilter:filter_intent_policy+0x66c/0x780 Nov 19 20:03:32 node1 kernel: [<ffffffff884825b0>] :lnet:LNetPut+0x700/0x800 Nov 19 20:03:32 node1 kernel: [<ffffffff88549d56>] :ptlrpc:ldlm_lock_enqueue+0x186/0xb30 Nov 19 20:03:32 node1 kernel: [<ffffffff88565b5f>] :ptlrpc:ldlm_export_lock_get+0x6f/0xe0 Nov 19 20:03:32 node1 kernel: [<ffffffff884ca748>] :obdclass:lustre_hash_add+0x208/0x2d0 Nov 19 20:03:32 node1 kernel: [<ffffffff887f6f20>] :ost:ost_blocking_ast+0x0/0xaa0 Nov 19 20:03:32 node1 kernel: [<ffffffff8856c779>] :ptlrpc:ldlm_handle_enqueue+0xc09/0x1200 Nov 19 20:03:32 node1 kernel: [<ffffffff88589208>] :ptlrpc:lustre_msg_check_version_v2+0x8/0x20 Nov 19 20:03:32 node1 kernel: [<ffffffff887ff110>] :ost:ost_handle+0x54a0/0x5a60 Nov 19 20:03:32 node1 kernel: [<ffffffff88482de5>] :lnet:lnet_match_blocked_msg+0x375/0x390 Nov 19 20:03:32 node1 kernel: [<ffffffff800d7301>] free_block+0x11c/0x13b Nov 19 20:03:32 node1 kernel: [<ffffffff80148fe6>] __next_cpu+0x19/0x28 Nov 19 20:03:32 node1 kernel: [<ffffffff80148fe6>] __next_cpu+0x19/0x28 Nov 19 20:03:32 node1 kernel: [<ffffffff80088f4a>] find_busiest_group+0x20d/0x621 Nov 19 20:03:32 node1 kernel: [<ffffffff8858eb75>] :ptlrpc:lustre_msg_get_conn_cnt+0x35/0xf0 Nov 19 20:03:32 node1 kernel: [<ffffffff80089d90>] enqueue_task+0x41/0x56 Nov 19 20:03:32 node1 kernel: [<ffffffff8859382d>] :ptlrpc:ptlrpc_check_req+0x1d/0x110 Nov 19 20:03:32 node1 kernel: [<ffffffff88595db3>] :ptlrpc:ptlrpc_server_handle_request+0xa93/0x1160 Nov 19 20:03:32 node1 kernel: [<ffffffff80063097>] thread_return+0x62/0xfe Nov 19 20:03:32 node1 kernel: [<ffffffff8008882b>] __wake_up_common+0x3e/0x68 Nov 19 20:03:32 node1 kernel: [<ffffffff885992e8>] :ptlrpc:ptlrpc_main+0x1218/0x13e0 Nov 19 20:03:32 node1 kernel: [<ffffffff8008a3f6>] default_wake_function+0x0/0xe Nov 19 20:03:32 node1 kernel: [<ffffffff8005dfb1>] child_rip+0xa/0x11 Nov 19 20:03:32 node1 kernel: [<ffffffff885980d0>] :ptlrpc:ptlrpc_main+0x0/0x13e0 Nov 19 20:03:32 node1 kernel: [<ffffffff8005dfa7>] child_rip+0x0/0x11 Nov 19 20:03:32 node1 kernel: Nov 19 20:03:34 node1 kernel: Lustre: Request x1319785196423799 sent from dafs-OST0001 to NID 192.168.3.115 at tcp 0s ago has timed out (limit 7s). Nov 19 20:03:34 node1 kernel: Lustre: Skipped 40339060 previous similar messages 0; still busy with 3 active RPCs Nov 19 20:03:46 node1 kernel: Lustre: 4924:0:(ldlm_lib.c:815:target_handle_connect()) Skipped 85 previous similar messages So the client nodes are not able to use the storage continuously. Actually Last weekend I downgraded lustre version to 1.8.0.1, so this is not with 1.8.1.1, but I''m afraid, it doesn''t matter. Reboot already doesn''t help at all, the problem comes back in 30 minutes. Any idea, suggestion? I''m a bit desperated, why this happens, and what I could do to this? Thank you, tamas -------------- next part -------------- A non-text attachment was scrubbed... Name: lustre.log.zip Type: application/x-zip-compressed Size: 184829 bytes Desc: not available Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20091119/19a15494/attachment-0001.bin
Papp Tam?s wrote:> The logs are full with this: > > Nov 19 20:03:32 node1 kernel: BUG: soft lockup - CPU#3 stuck for 10s! > [ll_ost_80:4894] > Nov 19 20:03:32 node1 kernel: CPU 3:<snip>> Nov 19 20:03:34 node1 kernel: Lustre: Skipped 40339060 previous similar > messages 0; still busy with 3 active RPCsWe had the same problem with 1.8.x.x. We set lnet.printk=0 on our OSS nodes and it has helped us dramatically - we have not seen the ''soft lockup'' problem since. sysctl -w lnet.printk=0 This will turn off all but ''emerg'' messages from lnet. It would be interesting to know if this avoided the lockups for you, too. Cheers, Craig
Craig Prescott wrote, On 2009. 11. 19. 20:42:> Papp Tam?s wrote: >> The logs are full with this: >> >> Nov 19 20:03:32 node1 kernel: BUG: soft lockup - CPU#3 stuck for 10s! >> [ll_ost_80:4894] >> Nov 19 20:03:32 node1 kernel: CPU 3: > <snip> >> Nov 19 20:03:34 node1 kernel: Lustre: Skipped 40339060 previous >> similar messages 0; still busy with 3 active RPCs > > We had the same problem with 1.8.x.x. > > We set lnet.printk=0 on our OSS nodes and it has helped us > dramatically - we have not seen the ''soft lockup'' problem since. > > sysctl -w lnet.printk=0 > > This will turn off all but ''emerg'' messages from lnet. > > It would be interesting to know if this avoided the lockups for you, too.I set it up. We''ll see. Thank you very much! tamas
Craig Prescott wrote, On 2009. 11. 19. 20:42:> We had the same problem with 1.8.x.x. > > We set lnet.printk=0 on our OSS nodes and it has helped us > dramatically - we have not seen the ''soft lockup'' problem since. > > sysctl -w lnet.printk=0 > > This will turn off all but ''emerg'' messages from lnet. > > It would be interesting to know if this avoided the lockups for you, too.Well, this definetely helped, but didn''t resolve the root of the problem. A few minutes ago we were not able to reach our cluster from clients. On MDS I see a lot of this: Nov 27 10:52:17 meta1 kernel: BUG: soft lockup - CPU#3 stuck for 10s! [ll_evictor:6123] Nov 27 10:52:17 meta1 kernel: CPU 3: Nov 27 10:52:17 meta1 kernel: Modules linked in: mds(U) fsfilt_ldiskfs(U) mgs(U) mgc(U) ldiskfs(U) crc16(U) lustre(U) lov(U) mdc(U) lquota(U) osc(U) ksocklnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) mptctl(U) mptbase(U) ipmi_devintf(U) ipmi_si(U) ipmi_msghandler(U) ipv6(U) xfrm_nalgo(U) crypto_api(U) autofs4(U) lockd(U) sunrpc(U) cpufreq_ondemand(U) acpi_cpufreq(U) freq_table(U) dm_mirror(U) dm_multipath(U) scsi_dh(U) video(U) hwmon(U) backlight(U) sbs(U) i2c_ec(U) i2c_core(U) button(U) battery(U) asus_acpi(U) acpi_memhotplug(U) ac(U) parport_pc(U) lp(U) parport(U) shpchp(U) pcspkr(U) sg(U) igb(U) dm_raid45(U) dm_message(U) dm_region_hash(U) dm_log(U) dm_mod(U) dm_mem_cache(U) usb_storage(U) ata_piix(U) ahci(U) libata(U) sd_mod(U) scsi_mod(U) raid1(U) ext3(U) jbd(U) uhci_hcd(U) ohci_hcd(U) ehci_hcd(U) Nov 27 10:52:17 meta1 kernel: Pid: 6123, comm: ll_evictor Tainted: G 2.6.18-128.7.1.el5_lustre.1.8.1.1 #1 Nov 27 10:52:17 meta1 kernel: RIP: 0010:[<ffffffff80064af5>] [<ffffffff80064af5>] _write_lock+0xe/0xf Nov 27 10:52:17 meta1 kernel: RSP: 0018:ffff8102154d9c78 EFLAGS: 00000246 Nov 27 10:52:17 meta1 kernel: RAX: 000000000000ffff RBX: 00000000000050c3 RCX: 0000000000042620 Nov 27 10:52:17 meta1 kernel: RDX: 0000000000000282 RSI: ffffffff802fae80 RDI: ffffc200108e4c3c Nov 27 10:52:17 meta1 kernel: RBP: 0000000000000286 R08: ffff81000101f3e0 R09: 0000000000000000 Nov 27 10:52:17 meta1 kernel: R10: ffff8101960319c0 R11: 0000000000000150 R12: 0000000000000286 Nov 27 10:52:17 meta1 kernel: R13: ffff8102154d9c20 R14: ffff810196031a18 R15: ffff8101960319c0 Nov 27 10:52:17 meta1 kernel: FS: 00002abb113e4230(0000) GS:ffff810107f61540(0000) knlGS:0000000000000000 Nov 27 10:52:17 meta1 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b Nov 27 10:52:18 meta1 kernel: CR2: 00000031d3803080 CR3: 0000000000201000 CR4: 00000000000006e0 Nov 27 10:52:18 meta1 kernel: Nov 27 10:52:18 meta1 kernel: Call Trace: Nov 27 10:52:18 meta1 kernel: [<ffffffff884caba7>] :obdclass:lustre_hash_for_each_empty+0x237/0x2b0 Nov 27 10:52:18 meta1 kernel: [<ffffffff884d1ae8>] :obdclass:class_disconnect+0x398/0x420 Nov 27 10:52:18 meta1 kernel: [<ffffffff8886d5e1>] :mds:mds_disconnect+0x121/0xe40 Nov 27 10:52:18 meta1 kernel: [<ffffffff8014b87a>] snprintf+0x44/0x4c Nov 27 10:52:18 meta1 kernel: [<ffffffff884cd994>] :obdclass:class_fail_export+0x384/0x4c0 Nov 27 10:52:18 meta1 kernel: [<ffffffff8859e238>] :ptlrpc:ping_evictor_main+0x4f8/0x7e0 Nov 27 10:52:18 meta1 kernel: [<ffffffff8008a3f3>] default_wake_function+0x0/0xe Nov 27 10:52:18 meta1 kernel: [<ffffffff8005dfb1>] child_rip+0xa/0x11 Nov 27 10:52:18 meta1 kernel: [<ffffffff8859dd40>] :ptlrpc:ping_evictor_main+0x0/0x7e0 Nov 27 10:52:18 meta1 kernel: [<ffffffff8005dfa7>] child_rip+0x0/0x11 Nov 27 10:52:18 meta1 kernel: After them: Nov 27 10:53:05 meta1 kernel: LustreError: 6009:0:(events.c:367:server_bulk_callback()) event type 4, status -5, desc ffff81022ed54280 Nov 27 10:53:05 meta1 kernel: Lustre: dafs-MDT0000: haven''t heard from client 4a05e5fb-590d-ec56-c1b3-ff0234152564 (at 192.168.3.141 at tcp) in 258 seconds. I think it''s dead, and I am evicting it. Nov 27 10:53:05 meta1 kernel: LustreError: 6009:0:(events.c:367:server_bulk_callback()) event type 4, status -5, desc ffff8102189b0cc0 Nov 27 10:53:05 meta1 kernel: LustreError: 6145:0:(handler.c:1563:mds_handle()) operation 400 on unconnected MDS from 12345-192.168.3.141 at tcp Nov 27 10:53:05 meta1 kernel: LustreError: 6145:0:(ldlm_lib.c:1863:target_send_reply_msg()) @@@ processing error (-107) req at ffff8101bec0cc00 x1320487774846997/t0 o400-><?>@<?>:0/0 lens 192/0 e 0 to 0 dl 1259315591 ref 1 fl Interpret:H/0/0 rc -107/0 Nov 27 10:53:05 meta1 kernel: Lustre: 6149:0:(ldlm_lib.c:541:target_handle_reconnect()) dafs-MDT0000: b4c8f591-df5c-b6f2-93c1-57e5f64312d0 reconnecting 800; still busy with 1 active RPCs a02b-9917-c60e-d5b6c1c1d3c8@:0/0 lens 408/0 e 0 to 0 dl 0 ref 1 fl New:/0/0 rc 0/0 Nov 27 10:53:05 meta1 kernel: Lustre: 6282:0:(ldlm_lib.c:766:target_handle_connect()) MGS: exp ffff81023e83c000 already connecting l busy with 1 active RPCs Nov 27 10:53:05 meta1 kernel: Lustre: 6282:0:(ldlm_lib.c:766:target_handle_connect()) Skipped 8 previous similar messages Nov 27 10:53:05 meta1 kernel: Lustre: 6121:0:(ldlm_lib.c:835:target_handle_connect()) Skipped 10 previous similar messages Nov 27 10:53:05 meta1 kernel: LustreError: 6282:0:(mgs_handler.c:661:mgs_handle()) MGS handle cmd=250 rc=-114 Nov 27 10:53:05 meta1 kernel: LustreError: 6282:0:(mgs_handler.c:661:mgs_handle()) Skipped 9 previous similar messages 3400 x1320390320856467/t0 o37->56d39829-d432-2c01-0261-eee2779a4fe3 at NET_0x20000c0a80390_UUID:0/0 lens 408/360 e 1 to 0 dl 1259315514 ref 1 fl Complete:/0/0 rc -110/-110 Nov 27 10:53:05 meta1 kernel: Lustre: 6841:0:(watchdog.c:363:lcw_update_time()) Expired watchdog for pid 6841 disabled after 208.42s Nov 27 10:53:05 meta1 kernel: LustreError: 6145:0:(ldlm_lib.c:1863:target_send_reply_msg()) Skipped 17 previous similar messages Nov 27 10:53:06 meta1 kernel: Lustre: 6161:0:(ldlm_lib.c:541:target_handle_reconnect()) dafs-MDT0000: d3728ce6-b580-dbb3-b262-123d1516703b reconnecting Nov 27 10:53:06 meta1 kernel: Lustre: 6161:0:(ldlm_lib.c:541:target_handle_reconnect()) Skipped 185 previous similar messages 800; still busy with 1 active RPCs Nov 27 10:53:06 meta1 kernel: Lustre: 6143:0:(ldlm_lib.c:835:target_handle_connect()) Skipped 1725 previous similar messages Nov 27 10:53:06 meta1 kernel: Lustre: 6161:0:(ldlm_lib.c:541:target_handle_reconnect()) dafs-MDT0000: d3728ce6-b580-dbb3-b262-123d1516703b reconnecting Nov 27 10:53:06 meta1 kernel: Lustre: 6161:0:(ldlm_lib.c:541:target_handle_reconnect()) Skipped 1763 previous similar messages 6703b at NET_0x20000c0a80376_UUID:0/0 lens 368/264 e 0 to 0 dl 1259315686 ref 1 fl Interpret:/0/0 rc -16/0 Nov 27 10:53:06 meta1 kernel: LustreError: 6152:0:(ldlm_lib.c:1863:target_send_reply_msg()) Skipped 4173 previous similar messages 800; still busy with 1 active RPCs Nov 27 10:53:07 meta1 kernel: Lustre: 6148:0:(ldlm_lib.c:835:target_handle_connect()) Skipped 4913 previous similar messages Nov 27 10:53:07 meta1 kernel: Lustre: 6136:0:(ldlm_lib.c:541:target_handle_reconnect()) dafs-MDT0000: d3728ce6-b580-dbb3-b262-123d1516703b reconnecting Nov 27 10:53:07 meta1 kernel: Lustre: 6136:0:(ldlm_lib.c:541:target_handle_reconnect()) Skipped 4916 previous similar messages 6703b at NET_0x20000c0a80376_UUID:0/0 lens 368/264 e 0 to 0 dl 1259315688 ref 1 fl Interpret:/0/0 rc -16/0 Nov 27 10:53:08 meta1 kernel: LustreError: 6160:0:(ldlm_lib.c:1863:target_send_reply_msg()) Skipped 9782 previous similar messages Nov 27 10:53:08 meta1 kernel: Lustre: MGS: haven''t heard from client f6fe0d62-238f-d5b3-5137-d1d897bc1195 (at 192.168.3.141 at tcp) in 467 seconds. I think it''s dead, and I am evicting it. Nov 27 10:53:08 meta1 kernel: Lustre: Skipped 1 previous similar message 800; still busy with 1 active RPCs Nov 27 10:53:09 meta1 kernel: Lustre: 6156:0:(ldlm_lib.c:835:target_handle_connect()) Skipped 9712 previous similar messages Nov 27 10:53:09 meta1 kernel: Lustre: 6156:0:(ldlm_lib.c:541:target_handle_reconnect()) dafs-MDT0000: d3728ce6-b580-dbb3-b262-123d1516703b reconnecting Nov 27 10:53:09 meta1 kernel: Lustre: 6156:0:(ldlm_lib.c:541:target_handle_reconnect()) Skipped 9718 previous similar messages Nov 27 10:53:11 meta1 kernel: LustreError: 27480:0:(handler.c:161:mds_sendpage()) @@@ bulk failed: timeout 0(4096), evicting d3728ce6-b580-dbb3-b262-123d1516703b at NET_0x20000c0a80376_UUID t:/0/0 rc 0/0 Nov 27 10:54:23 meta1 hpasmpld[2804]: Sensor 9b is missing. Reading Flags = 40 d3728ce6-b580-dbb3-b262-123d1516703b at NET_0x20000c0a80376_UUID:0/0 lens 408/528 e 0 to 0 dl 1259315709 ref 1 fl Interpret:/0/0 rc 0/0 Nov 27 10:55:03 meta1 kernel: LustreError: 27480:0:(mds_open.c:1679:mds_close()) Skipped 3 previous similar messages 516703b at NET_0x20000c0a80376_UUID:0/0 lens 408/528 e 0 to 0 dl 1259315709 ref 1 fl Interpret:/0/0 rc -116/0 Nov 27 10:55:03 meta1 kernel: LustreError: 27480:0:(ldlm_lib.c:1863:target_send_reply_msg()) Skipped 14495 previous similar messages Nov 27 10:56:23 meta1 hpasmpld[2804]: Sensor 9b is missing. Reading Flags = 40 On node2 a single one of this: Nov 27 10:52:43 node2 kernel: ll_ost_io_68 S ffff81023c101080 0 6222 1 6223 6221 (L-TLB) Nov 27 10:52:43 node2 kernel: ffff810209f5b960 0000000000000046 0000000000000001 ffffffff884776b0 Nov 27 10:52:43 node2 kernel: ffff810107f24000 000000000000000a ffff81022408a7e0 ffff81023c101080 Nov 27 10:52:43 node2 kernel: 000254bcf81e00e9 00000000000003b2 ffff81022408a9c8 0000000100000000 Nov 27 10:52:43 node2 kernel: Call Trace: Nov 27 10:52:43 node2 kernel: [<ffffffff884776b0>] :lnet:LNetMDBind+0x2c0/0x420 Nov 27 10:52:43 node2 kernel: [<ffffffff8003d382>] lock_timer_base+0x1b/0x3c Nov 27 10:52:43 node2 kernel: [<ffffffff8001c6fa>] __mod_timer+0xb0/0xbe Nov 27 10:52:43 node2 kernel: [<ffffffff80063947>] schedule_timeout+0x8a/0xad Nov 27 10:52:43 node2 kernel: [<ffffffff80094f4f>] process_timeout+0x0/0x5 Nov 27 10:52:43 node2 kernel: [<ffffffff888060dc>] :ost:ost_brw_read+0x127c/0x1a30 Nov 27 10:52:43 node2 kernel: [<ffffffff8008a3f3>] default_wake_function+0x0/0xe Nov 27 10:52:43 node2 kernel: [<ffffffff885865d8>] :ptlrpc:lustre_msg_check_version_v2+0x8/0x20 Nov 27 10:52:43 node2 kernel: [<ffffffff8880e429>] :ost:ost_handle+0x3129/0x5a70 Nov 27 10:52:43 node2 kernel: [<ffffffff88478923>] :lnet:lnet_ni_send+0x93/0xd0 Nov 27 10:52:43 node2 kernel: [<ffffffff8847ad23>] :lnet:lnet_send+0x973/0x9a0 Nov 27 10:52:43 node2 kernel: [<ffffffff8001c6fa>] __mod_timer+0xb0/0xbe Nov 27 10:52:43 node2 kernel: [<ffffffff8847f305>] :lnet:lnet_match_blocked_msg+0x375/0x390 Nov 27 10:52:43 node2 kernel: [<ffffffff8855be8a>] :ptlrpc:ldlm_resource_foreach+0x11a/0x390 Nov 27 10:52:43 node2 kernel: [<ffffffff884dadf1>] :obdclass:class_handle2object+0xd1/0x160 Nov 27 10:52:43 node2 kernel: [<ffffffff80148e8c>] __next_cpu+0x19/0x28 Nov 27 10:52:43 node2 kernel: [<ffffffff80088f36>] find_busiest_group+0x20d/0x621 Nov 27 10:52:43 node2 kernel: [<ffffffff8854119a>] :ptlrpc:lock_res_and_lock+0xba/0xd0 Nov 27 10:52:43 node2 kernel: [<ffffffff8858bf05>] :ptlrpc:lustre_msg_get_conn_cnt+0x35/0xf0 Nov 27 10:52:43 node2 kernel: [<ffffffff80089d8d>] enqueue_task+0x41/0x56 Nov 27 10:52:43 node2 kernel: [<ffffffff88590c1d>] :ptlrpc:ptlrpc_check_req+0x1d/0x110 Nov 27 10:52:43 node2 kernel: [<ffffffff88593357>] :ptlrpc:ptlrpc_server_handle_request+0xa97/0x1170 Nov 27 10:52:43 node2 kernel: [<ffffffff8008881d>] __wake_up_common+0x3e/0x68 Nov 27 10:52:44 node2 kernel: [<ffffffff88596e08>] :ptlrpc:ptlrpc_main+0x1218/0x13e0 Nov 27 10:52:44 node2 kernel: [<ffffffff8008a3f3>] default_wake_function+0x0/0xe Nov 27 10:52:44 node2 kernel: [<ffffffff8005dfb1>] child_rip+0xa/0x11 Nov 27 10:52:44 node2 kernel: [<ffffffff88595bf0>] :ptlrpc:ptlrpc_main+0x0/0x13e0 Nov 27 10:52:44 node2 kernel: [<ffffffff8005dfa7>] child_rip+0x0/0x11 Nov 27 10:52:44 node2 kernel: Nov 27 10:52:44 node2 kernel: LustreError: dumping log to /tmp/lustre-log.1259315563.6222 I attach the lustre-log. I googled the net for LNetMDBind, but I found no useful pages. I see this in the Understanding_Lustre_Filesystem_Internals.pdf, but it''s not really helpful to me. int LNetMDBind( lnet_md_t umd, lnet_unlink_t unlink, lnet_handle_md_t *handle) This function creates a standalone memory descriptor, i.e., an MD that is not attached to an ME. By the way it''s a bit strange, nobody else met this bug yet. tamas -------------- next part -------------- A non-text attachment was scrubbed... Name: lustre-log.1259315563.6222.bz2 Type: application/octet-stream Size: 9859 bytes Desc: not available Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20091127/d315eee5/attachment.obj
On 2009-11-27, at 03:13, Papp Tam?s wrote:> Craig Prescott wrote, On 2009. 11. 19. 20:42: >> We had the same problem with 1.8.x.x. >> >> We set lnet.printk=0 on our OSS nodes and it has helped us >> dramatically - we have not seen the ''soft lockup'' problem since. >> >> sysctl -w lnet.printk=0 >> >> This will turn off all but ''emerg'' messages from lnet. >> >> It would be interesting to know if this avoided the lockups for >> you, too. > > Well, this definetely helped, but didn''t resolve the root of the > problem. > > A few minutes ago we were not able to reach our cluster from clients. > > On MDS I see a lot of this: > > > Nov 27 10:52:17 meta1 kernel: BUG: soft lockup - CPU#3 stuck for > 10s! [ll_evictor:6123] > Nov 27 10:52:18 meta1 kernel: Call Trace: > :obdclass:lustre_hash_for_each_empty+0x237/0x2b0 > :obdclass:class_disconnect+0x398/0x420 > :mds:mds_disconnect+0x121/0xe40 > :obdclass:class_fail_export+0x384/0x4c0 > :ptlrpc:ping_evictor_main+0x4f8/0x7e0 > default_wake_function+0x0/0xe > :ptlrpc:ping_evictor_main+0x0/0x7e0This looks like the server evicting a client that has a lot of locks. One thing to try is in lustre_hash_for_each_empty() add a call to cond_resched(), since it seems this function could run a long time if func() doesn''t ever cause a reschedule: lustre_hash_for_each_empty(lustre_hash_t *lh, lh_for_each_cb func, { read_unlock(&lh->lh_rwlock); func(obj, data); (void)lh_put(lh, hnode); + cond_resched(); goto restart; I''m not sure this is the root cause, but you could check the DLM lock stats in /proc/fs/lustre/ldlm/namespaces/*/lock_count on some clients, to see how many locks they are holding, or the same on the MDS, which will be the total number of locks currently granted to all clients.> After them: > > 6009:0:(events.c:367:server_bulk_callback()) event type 4, status > -5, desc ffff81022ed54280This is just fallout from the MDS being too busy to handle requests. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
Andreas Dilger wrote, On 2009. 11. 28. 0:22:> On 2009-11-27, at 03:13, Papp Tam?s wrote: >> Craig Prescott wrote, On 2009. 11. 19. 20:42: >>> We had the same problem with 1.8.x.x. >>> >>> We set lnet.printk=0 on our OSS nodes and it has helped us >>> dramatically - we have not seen the ''soft lockup'' problem since. >>> >>> sysctl -w lnet.printk=0 >>> >>> This will turn off all but ''emerg'' messages from lnet. >>> >>> It would be interesting to know if this avoided the lockups for you, >>> too. >> >> Well, this definetely helped, but didn''t resolve the root of the >> problem. >> >> A few minutes ago we were not able to reach our cluster from clients. >> >> On MDS I see a lot of this: >> >> >> Nov 27 10:52:17 meta1 kernel: BUG: soft lockup - CPU#3 stuck for 10s! >> [ll_evictor:6123] >> Nov 27 10:52:18 meta1 kernel: Call Trace: >> :obdclass:lustre_hash_for_each_empty+0x237/0x2b0 >> :obdclass:class_disconnect+0x398/0x420 >> :mds:mds_disconnect+0x121/0xe40 >> :obdclass:class_fail_export+0x384/0x4c0 >> :ptlrpc:ping_evictor_main+0x4f8/0x7e0 >> default_wake_function+0x0/0xe >> :ptlrpc:ping_evictor_main+0x0/0x7e0 > > This looks like the server evicting a client that has a lot of locks. > One thing to try is in lustre_hash_for_each_empty() add a call to > cond_resched(), since it seems this function could run a long time if > func() doesn''t ever cause a reschedule: > > lustre_hash_for_each_empty(lustre_hash_t *lh, lh_for_each_cb func, > { > read_unlock(&lh->lh_rwlock); > func(obj, data); > (void)lh_put(lh, hnode); > + cond_resched(); > goto restart; > > I''m not sure this is the root cause, but you could check the DLM lock > stats in /proc/fs/lustre/ldlm/namespaces/*/lock_count on some clients, > to see how many locks they are holding, or the same on the MDS, which > will be the total number of locks currently granted to all clients.At the weekend we upgraded the firmware on the OSS''s raid controllers (SmartArray P410) and the messages has gone. Of course I can''t be sure, because of it, but we are hoping, still watching it. On the MDS the messages are more rare, in two days there were two of it: Dec 1 09:19:18 meta1 kernel: Modules linked in: mds(U) fsfilt_ldiskfs(U) mgs(U) mgc(U) ldiskfs(U) crc16(U) lustre(U) lov(U) m dc(U) lquota(U) osc(U) ksocklnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) mptctl(U) mptbase(U) ipmi_devintf(U) ipmi_s i(U) ipmi_msghandler(U) ipv6(U) xfrm_nalgo(U) crypto_api(U) autofs4(U) lockd(U) sunrpc(U) cpufreq_ondemand(U) acpi_cpufreq(U) freq_table(U) dm_mirror(U) dm_multipath(U) scsi_dh(U) video(U) hwmon(U) backlight(U) sbs(U) i2c_ec(U) i2c_core(U) button(U) ba ttery(U) asus_acpi(U) acpi_memhotplug(U) ac(U) parport_pc(U) lp(U) parport(U) sg(U) pcspkr(U) shpchp(U) igb(U) dm_raid45(U) dm _message(U) dm_region_hash(U) dm_log(U) dm_mod(U) dm_mem_cache(U) usb_storage(U) ata_piix(U) ahci(U) libata(U) sd_mod(U) scsi_ mod(U) raid1(U) ext3(U) jbd(U) uhci_hcd(U) ohci_hcd(U) ehci_hcd(U) Dec 1 09:19:18 meta1 kernel: Pid: 4470, comm: ll_evictor Tainted: G 2.6.18-128.7.1.el5_lustre.1.8.1.1 #1 Dec 1 09:19:18 meta1 kernel: RIP: 0010:[<ffffffff80064aee>] [<ffffffff80064aee>] _write_lock+0x7/0xf Dec 1 09:19:18 meta1 kernel: RSP: 0018:ffff81022e503c78 EFLAGS: 00000246 Dec 1 09:19:18 meta1 kernel: RAX: 0000000000007fff RBX: 00000000000013d6 RCX: 0000000000006f89 Dec 1 09:19:18 meta1 kernel: RDX: 0000000000000216 RSI: ffffffff802fae80 RDI: ffffc2001085cd6c Dec 1 09:19:18 meta1 kernel: RBP: 0000000000000286 R08: ffff81000101f3e0 R09: 5a5a5a5a5a5a5a5a Dec 1 09:19:18 meta1 kernel: R10: 5a5a5a5a5a5a5a5a R11: 5a5a5a5a5a5a5a5a R12: 0000000000000286 Dec 1 09:19:18 meta1 kernel: R13: ffff81022e503c20 R14: ffff8101cea7f418 R15: ffff8101cea7f3c0 Dec 1 09:19:18 meta1 kernel: FS: 00002ae6ba0b1230(0000) GS:ffff810107f61540(0000) knlGS:0000000000000000 Dec 1 09:19:18 meta1 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b Dec 1 09:19:18 meta1 kernel: CR2: 00002aaaae52d000 CR3: 00000002350d0000 CR4: 00000000000006e0 Dec 1 09:19:18 meta1 kernel: Dec 1 09:19:18 meta1 kernel: Call Trace: Dec 1 09:19:18 meta1 kernel: [<ffffffff884ccba7>] :obdclass:lustre_hash_for_each_empty+0x237/0x2b0 Dec 1 09:19:19 meta1 kernel: [<ffffffff884d3ae8>] :obdclass:class_disconnect+0x398/0x420 Dec 1 09:19:19 meta1 kernel: [<ffffffff8886f5e1>] :mds:mds_disconnect+0x121/0xe40 Dec 1 09:19:19 meta1 kernel: [<ffffffff8014b87a>] snprintf+0x44/0x4c Dec 1 09:19:19 meta1 kernel: [<ffffffff884cf994>] :obdclass:class_fail_export+0x384/0x4c0 Dec 1 09:19:19 meta1 kernel: [<ffffffff885a0238>] :ptlrpc:ping_evictor_main+0x4f8/0x7e0 Dec 1 09:19:19 meta1 kernel: [<ffffffff8008a3f3>] default_wake_function+0x0/0xe Dec 1 09:19:19 meta1 kernel: [<ffffffff8005dfb1>] child_rip+0xa/0x11 Dec 1 09:19:19 meta1 kernel: [<ffffffff8859fd40>] :ptlrpc:ping_evictor_main+0x0/0x7e0 Dec 1 09:19:19 meta1 kernel: [<ffffffff8005dfa7>] child_rip+0x0/0x11 Dec 1 10:08:38 meta1 kernel: BUG: soft lockup - CPU#1 stuck for 10s! [ldlm_elt:4466] Dec 1 10:08:38 meta1 kernel: CPU 1: Dec 1 10:08:38 meta1 kernel: Modules linked in: mds(U) fsfilt_ldiskfs(U) mgs(U) mgc(U) ldiskfs(U) crc16(U) lustre(U) lov(U) m dc(U) lquota(U) osc(U) ksocklnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) mptctl(U) mptbase(U) ipmi_devintf(U) ipmi_s i(U) ipmi_msghandler(U) ipv6(U) xfrm_nalgo(U) crypto_api(U) autofs4(U) lockd(U) sunrpc(U) cpufreq_ondemand(U) acpi_cpufreq(U) freq_table(U) dm_mirror(U) dm_multipath(U) scsi_dh(U) video(U) hwmon(U) backlight(U) sbs(U) i2c_ec(U) i2c_core(U) button(U) ba ttery(U) asus_acpi(U) acpi_memhotplug(U) ac(U) parport_pc(U) lp(U) parport(U) sg(U) pcspkr(U) shpchp(U) igb(U) dm_raid45(U) dm _message(U) dm_region_hash(U) dm_log(U) dm_mod(U) dm_mem_cache(U) usb_storage(U) ata_piix(U) ahci(U) libata(U) sd_mod(U) scsi_ mod(U) raid1(U) ext3(U) jbd(U) uhci_hcd(U) ohci_hcd(U) ehci_hcd(U) Dec 1 10:08:38 meta1 kernel: Pid: 4466, comm: ldlm_elt Tainted: G 2.6.18-128.7.1.el5_lustre.1.8.1.1 #1 Dec 1 10:08:38 meta1 kernel: RIP: 0010:[<ffffffff80064aee>] [<ffffffff80064aee>] _write_lock+0x7/0xf Dec 1 10:08:38 meta1 kernel: RSP: 0018:ffff81022efa3ca8 EFLAGS: 00000246 Dec 1 10:08:38 meta1 kernel: RAX: 0000000000007fff RBX: 0000000000004cc7 RCX: 0000000000005e96 Dec 1 10:08:38 meta1 kernel: RDX: ffffffffffffa6d8 RSI: ffffffff802fae80 RDI: ffffc20010611c7c Dec 1 10:08:38 meta1 kernel: RBP: 0000000000000282 R08: ffff81000100e8e0 R09: 5a5a5a5a5a5a5a5a Dec 1 10:08:38 meta1 kernel: R10: 5a5a5a5a5a5a5a5a R11: 5a5a5a5a5a5a5a5a R12: 0000000000000282 Dec 1 10:08:38 meta1 kernel: R13: ffff81022efa3c50 R14: ffff8101b9d2eb98 R15: ffff8101b9d2eb40 Dec 1 10:08:38 meta1 kernel: FS: 00002ae6ba0b1230(0000) GS:ffff810107ed96c0(0000) knlGS:0000000000000000 Dec 1 10:08:38 meta1 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b Dec 1 10:08:38 meta1 kernel: CR2: 00000031d329a3a0 CR3: 0000000237251000 CR4: 00000000000006e0 Dec 1 10:08:38 meta1 kernel: Dec 1 10:08:38 meta1 kernel: Call Trace: Dec 1 10:08:38 meta1 kernel: [<ffffffff884ccba7>] :obdclass:lustre_hash_for_each_empty+0x237/0x2b0 Dec 1 10:08:38 meta1 kernel: [<ffffffff884d3ae8>] :obdclass:class_disconnect+0x398/0x420 Dec 1 10:08:38 meta1 kernel: [<ffffffff8886f5e1>] :mds:mds_disconnect+0x121/0xe40 Dec 1 10:08:38 meta1 kernel: [<ffffffff80089d8d>] enqueue_task+0x41/0x56 Dec 1 10:08:38 meta1 kernel: [<ffffffff80089bdd>] dequeue_task+0x18/0x37 Dec 1 10:08:38 meta1 kernel: [<ffffffff80063098>] thread_return+0x62/0xfe Dec 1 10:08:38 meta1 kernel: [<ffffffff884cf994>] :obdclass:class_fail_export+0x384/0x4c0 Dec 1 10:08:39 meta1 kernel: [<ffffffff8003b59a>] remove_wait_queue+0x1c/0x2c Dec 1 10:08:39 meta1 kernel: [<ffffffff8857224f>] :ptlrpc:expired_lock_main+0x68f/0x8b0 Dec 1 10:08:39 meta1 kernel: [<ffffffff8008a3f3>] default_wake_function+0x0/0xe Dec 1 10:08:39 meta1 kernel: [<ffffffff8005dfb1>] child_rip+0xa/0x11 Dec 1 10:08:39 meta1 kernel: [<ffffffff88571bc0>] :ptlrpc:expired_lock_main+0x0/0x8b0 Dec 1 10:08:39 meta1 kernel: [<ffffffff8005dfa7>] child_rip+0x0/0x11 I''m afraid(?), it''s not the same. The MDT is not on a SmartArray, it''s on an md raid1 array, I don''t know if it matters. I think these were in the time, when users switched on their workstation. Anyway, the not reported problem about the cluster. I''m sorry, it''s been a bit long... Still do you think, I should recompile the kernel with the change above? Thank you, tamas
On Dec 1, 2009, at 10:51 PM, Papp Tam?s wrote:> I''m afraid(?), it''s not the same.The stacks still look similar, afaics.> Still do you think, I should recompile the kernel with the change > above?Actually, this is bug 19557 and a patch is pending review. I would be delighted if you could give the patch a try. Cheers, Johann
Johann Lombardi wrote, On 2009. 12. 02. 0:18:> Actually, this is bug 19557 and a patch is pending review. > I would be delighted if you could give the patch a try.How can I rebuild only lustre modules? I checked out the b1_8 source tree and applied the patched from bugzilla. Then I tried this: ./configure --with-linux=/usr/src/kernels/2.6.18-128.7.1.el5_lustre.1.8.1.1-x86_64/ make says: mkdir -p linux-stage/fs/ext3 linux-stage/include/linux cp /usr/src/kernels/2.6.18-128.7.1.el5_lustre.1.8.1.1-x86_64//fs/ext3/acl.c /usr/src/kernels/2.6.18-128.7.1.el5_lustre.1.8.1.1-x86_64//fs/ext3/balloc.c /usr/src/kernels/2.6.18-128.7.1.el5_lustre.1.8.1.1-x86_64//fs/ext3/bitmap.c /usr/src/kernels/2.6.18-128.7.1.el5_lustre.1.8.1.1-x86_64//fs/ext3/dir.c /usr/src/kernels/2.6.18-128.7.1.el5_lustre.1.8.1.1-x86_64//fs/ext3/file.c /usr/src/kernels/2.6.18-128.7.1.el5_lustre.1.8.1.1-x86_64//fs/ext3/fsync.c /usr/src/kernels/2.6.18-128.7.1.el5_lustre.1.8.1.1-x86_64//fs/ext3/hash.c /usr/src/kernels/2.6.18-128.7.1.el5_lustre.1.8.1.1-x86_64//fs/ext3/ialloc.c /usr/src/kernels/2.6.18-128.7.1.el5_lustre.1.8.1.1-x86_64//fs/ext3/inode.c /usr/src/kernels/2.6.18-128.7.1.el5_lustre.1.8.1.1-x86_64//fs/ext3/ioctl.c /usr/src/kernels/2.6.18-128.7.1.el5_lustre.1.8.1.1-x86_64//fs/ext3/namei.c /usr/src/kernels/2.6.18-128.7.1.el5_lustre.1.8.1.1-x86_64//fs/ext3/resize.c /usr/src/kernels/2.6.18-128.7.1.el5_lustre.1.8.1.1-x86_64//fs/ext3/super.c /usr/src/kernels/2.6.18-128.7.1.el5_lustre.1.8.1.1-x86_64//fs/ext3/symlink.c /usr/src/kernels/2.6.18-128.7.1.el5_lustre.1.8.1.1-x86_64//fs/ext3/xattr.c /usr/src/kernels/2.6.18-128.7.1.el5_lustre.1.8.1.1-x86_64//fs/ext3/xattr_security.c /usr/src/kernels/2.6.18-128.7.1.el5_lustre.1.8.1.1-x86_64//fs/ext3/xattr_trusted.c /usr/src/kernels/2.6.18-128.7.1.el5_lustre.1.8.1.1-x86_64//fs/ext3/xattr_user.c /usr/src/kernels/2.6.18-128.7.1.el5_lustre.1.8.1.1-x86_64//fs/ext3/acl.h /usr/src/kernels/2.6.18-128.7.1.el5_lustre.1.8.1.1-x86_64//fs/ext3/namei.h /usr/src/kernels/2.6.18-128.7.1.el5_lustre.1.8.1.1-x86_64//fs/ext3/xattr.h /usr/src/kernels/2.6.18-128.7.1.el5_lustre.1.8.1.1-x86_64//fs/ext3/Makefile linux-stage/fs/ext3 if test -n "/usr/src/kernels/2.6.18-128.7.1.el5_lustre.1.8.1.1-x86_64//include/linux/ext3_fs.h /usr/src/kernels/2.6.18-128.7.1.el5_lustre.1.8.1.1-x86_64//include/linux/ext3_fs_i.h /usr/src/kernels/2.6.18-128.7.1.el5_lustre.1.8.1.1-x86_64//include/linux/ext3_fs_sb.h /usr/src/kernels/2.6.18-128.7.1.el5_lustre.1.8.1.1-x86_64//include/linux/ext3_jbd.h" ; then \ cp /usr/src/kernels/2.6.18-128.7.1.el5_lustre.1.8.1.1-x86_64//include/linux/ext3_fs.h /usr/src/kernels/2.6.18-128.7.1.el5_lustre.1.8.1.1-x86_64//include/linux/ext3_fs_i.h /usr/src/kernels/2.6.18-128.7.1.el5_lustre.1.8.1.1-x86_64//include/linux/ext3_fs_sb.h /usr/src/kernels/2.6.18-128.7.1.el5_lustre.1.8.1.1-x86_64//include/linux/ext3_jbd.h linux-stage/include/linux; \ fi Applying ext3 patches: ext3-version-2.6-rhel5.patch ext3-wantedi-2.6-rhel4.patch iopen-2.6.18-rhel5.patch ext3-map_inode_page-2.6.18.patch export-ext3-2.6-rhel4.patch ext3-include-fixes-2.6-rhel4.patch ext3-extents-2.6.18-vanilla.patch1 out of 5 hunks FAILED -- saving rejects to file include/linux/ext3_fs.h.rej make[5]: *** [sources] Error 1 make[5]: Leaving directory `/usr/src/cvs/lustre/ldiskfs/ldiskfs'' make[4]: *** [all-sources] Error 2 make[4]: Leaving directory `/usr/src/cvs/lustre/ldiskfs'' make[3]: *** [all-recursive] Error 1 make[3]: Leaving directory `/usr/src/cvs/lustre/ldiskfs'' make[2]: *** [all] Error 2 make[2]: Leaving directory `/usr/src/cvs/lustre/ldiskfs'' make[1]: *** [all-recursive] Error 1 make[1]: Leaving directory `/usr/src/cvs/lustre'' make: *** [all] Error 2 Why does it want to patch the kernel again? Thank you, tamas
On 2009-12-05, at 10:28, Papp Tam?s wrote:> Johann Lombardi wrote, On 2009. 12. 02. 0:18: >> Actually, this is bug 19557 and a patch is pending review. >> I would be delighted if you could give the patch a try. > > How can I rebuild only lustre modules? > > I checked out the b1_8 source tree and applied the patched from > bugzilla. > > Then I tried this: > > ./configure > --with-linux=/usr/src/kernels/2.6.18-128.7.1.el5_lustre.1.8.1.1- > x86_64/This should be the right kernel for b1_8, according to lustre/ChangeLog> make says: > > mkdir -p linux-stage/fs/ext3 linux-stage/include/linux > cp > /usr/src/kernels/2.6.18-128.7.1.el5_lustre.1.8.1.1-x86_64//fs/ext3/ > acl.c > /usr/src/kernels/2.6.18-128.7.1.el5_lustre.1.8.1.1-x86_64//fs/ext3/ > balloc.c > /usr/src/kernels/2.6.18-128.7.1.el5_lustre.1.8.1.1-x86_64//fs/ext3/ > bitmap.c > /usr/src/kernels/2.6.18-128.7.1.el5_lustre.1.8.1.1-x86_64//fs/ext3/ > dir.c > /usr/src/kernels/2.6.18-128.7.1.el5_lustre.1.8.1.1-x86_64//fs/ext3/ > file.c > /usr/src/kernels/2.6.18-128.7.1.el5_lustre.1.8.1.1-x86_64//fs/ext3/ > fsync.c > /usr/src/kernels/2.6.18-128.7.1.el5_lustre.1.8.1.1-x86_64//fs/ext3/ > hash.c > /usr/src/kernels/2.6.18-128.7.1.el5_lustre.1.8.1.1-x86_64//fs/ext3/ > ialloc.c > /usr/src/kernels/2.6.18-128.7.1.el5_lustre.1.8.1.1-x86_64//fs/ext3/ > inode.c > /usr/src/kernels/2.6.18-128.7.1.el5_lustre.1.8.1.1-x86_64//fs/ext3/ > ioctl.c > /usr/src/kernels/2.6.18-128.7.1.el5_lustre.1.8.1.1-x86_64//fs/ext3/ > namei.c > /usr/src/kernels/2.6.18-128.7.1.el5_lustre.1.8.1.1-x86_64//fs/ext3/ > resize.c > /usr/src/kernels/2.6.18-128.7.1.el5_lustre.1.8.1.1-x86_64//fs/ext3/ > super.c > /usr/src/kernels/2.6.18-128.7.1.el5_lustre.1.8.1.1-x86_64//fs/ext3/ > symlink.c > /usr/src/kernels/2.6.18-128.7.1.el5_lustre.1.8.1.1-x86_64//fs/ext3/ > xattr.c > /usr/src/kernels/2.6.18-128.7.1.el5_lustre.1.8.1.1-x86_64//fs/ext3/ > xattr_security.c > /usr/src/kernels/2.6.18-128.7.1.el5_lustre.1.8.1.1-x86_64//fs/ext3/ > xattr_trusted.c > /usr/src/kernels/2.6.18-128.7.1.el5_lustre.1.8.1.1-x86_64//fs/ext3/ > xattr_user.c > /usr/src/kernels/2.6.18-128.7.1.el5_lustre.1.8.1.1-x86_64//fs/ext3/ > acl.h > /usr/src/kernels/2.6.18-128.7.1.el5_lustre.1.8.1.1-x86_64//fs/ext3/ > namei.h > /usr/src/kernels/2.6.18-128.7.1.el5_lustre.1.8.1.1-x86_64//fs/ext3/ > xattr.h > /usr/src/kernels/2.6.18-128.7.1.el5_lustre.1.8.1.1-x86_64//fs/ext3/ > Makefile > linux-stage/fs/ext3 > if test -n > "/usr/src/kernels/2.6.18-128.7.1.el5_lustre.1.8.1.1-x86_64//include/ > linux/ext3_fs.h > /usr/src/kernels/2.6.18-128.7.1.el5_lustre.1.8.1.1-x86_64//include/ > linux/ext3_fs_i.h > /usr/src/kernels/2.6.18-128.7.1.el5_lustre.1.8.1.1-x86_64//include/ > linux/ext3_fs_sb.h > /usr/src/kernels/2.6.18-128.7.1.el5_lustre.1.8.1.1-x86_64//include/ > linux/ext3_jbd.h" > ; then \ > cp > /usr/src/kernels/2.6.18-128.7.1.el5_lustre.1.8.1.1-x86_64//include/ > linux/ext3_fs.h > /usr/src/kernels/2.6.18-128.7.1.el5_lustre.1.8.1.1-x86_64//include/ > linux/ext3_fs_i.h > /usr/src/kernels/2.6.18-128.7.1.el5_lustre.1.8.1.1-x86_64//include/ > linux/ext3_fs_sb.h > /usr/src/kernels/2.6.18-128.7.1.el5_lustre.1.8.1.1-x86_64//include/ > linux/ext3_jbd.h > linux-stage/include/linux; \ > fi > Applying ext3 patches: > ext3-version-2.6-rhel5.patch > ext3-wantedi-2.6-rhel4.patch > iopen-2.6.18-rhel5.patch > ext3-map_inode_page-2.6.18.patch > export-ext3-2.6-rhel4.patch > ext3-include-fixes-2.6-rhel4.patch > ext3-extents-2.6.18-vanilla.patch > 1 out of 5 hunks FAILED -- saving rejects to include/linux/ > ext3_fs.h.rej > > Why does it want to patch the kernel again?This is trying to build the ldiskfs module from the ext3 sources. It _should_ work, given that you have the right kernel sources, but clearly either the patch was changed, or something is different between your ext3 and what the patch expects. This is normally just a simple context error. To fix this: cd ldiskfs/ldiskfs/linux-stage quilt push -f {apply include/linux/ext3_fs.h.rej to ext3_fs.h by hand} quilt refresh cd - then "make" should work again. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
On 2009. 12. 06. 1:47, Andreas Dilger wrote:>> >> ./configure >> --with-linux=/usr/src/kernels/2.6.18-128.7.1.el5_lustre.1.8.1.1-x86_64/ > > > This should be the right kernel for b1_8, according to lustre/ChangeLog > >> This is trying to build the ldiskfs module from the ext3 sources. It > _should_ work, given that you have the right kernel sources, but > clearly either the patch was changed, or something is different > between your ext3 and what the patch expects. This is normally just a > simple context error. > > To fix this: > > cd ldiskfs/ldiskfs/linux-stage > quilt push -f > {apply include/linux/ext3_fs.h.rej to ext3_fs.h by hand} > quilt refresh > cd - > > then "make" should work again. >Well, it was not working, and by the way, my guess is that it should not work. I didn''t write, but /usr/src/kernels/2.6.18-128.7.1.el5_lustre.1.8.1.1-x86_64/ belongs to the official kernel-lustre-devel-2.6.18-128.7.1.el5_lustre.1.8.1.1 package. Anyway, a build an own kernel and lustre b1_8 with the patches from the bug 19557: Linux meta1 2.6.18-prep #1 SMP Sun Dec 6 14:40:15 CET 2009 x86_64 x86_64 x86_64 GNU/Linux I installed it on the MDS and the tw OSSs. Unofortunately no luck: Dec 9 18:41:36 node1 kernel: ll_ost_io_05 S ffff81023f353860 0 4421 1 4422 4420 (L-TLB) Dec 9 18:41:36 node1 kernel: ffff810229cf9990 0000000000000046 00000000000000fd ffffffff88491b91 Dec 9 18:41:36 node1 kernel: ffff8101ff1d65a0 000000000000000a ffff810229c080c0 ffff81023f353860 Dec 9 18:41:36 node1 kernel: 0000f4410ae68bb7 0000000000000825 ffff810229c082a8 0000000300000000 Dec 9 18:41:36 node1 kernel: Call Trace: Dec 9 18:41:36 node1 kernel: [<ffffffff88491b91>] :lnet:LNetMDBind+0x301/0x450 Dec 9 18:41:36 node1 kernel: [<ffffffff8003dacd>] lock_timer_base+0x1b/0x3c Dec 9 18:41:36 node1 kernel: [<ffffffff8001caa7>] __mod_timer+0xb0/0xbe Dec 9 18:41:36 node1 kernel: [<ffffffff8006387b>] schedule_timeout+0x8a/0xad Dec 9 18:41:36 node1 kernel: [<ffffffff80096ff3>] process_timeout+0x0/0x5 Dec 9 18:41:36 node1 kernel: [<ffffffff888234ca>] :ost:ost_brw_write+0x137a/0x23a0 Dec 9 18:41:36 node1 kernel: [<ffffffff8859f998>] :ptlrpc:ptlrpc_send_reply+0x5c8/0x5e0 Dec 9 18:41:36 node1 kernel: [<ffffffff8856aab0>] :ptlrpc:target_committed_to_req+0x40/0x120 Dec 9 18:41:36 node1 kernel: [<ffffffff8881f81d>] :ost:ost_brw_read+0x189d/0x1a70 Dec 9 18:41:36 node1 kernel: [<ffffffff885a3e25>] :ptlrpc:lustre_msg_get_version+0x35/0xf0 Dec 9 18:41:36 node1 kernel: [<ffffffff8008c3ba>] default_wake_function+0x0/0xe Dec 9 18:41:36 node1 kernel: [<ffffffff885a3ee8>] :ptlrpc:lustre_msg_check_version_v2+0x8/0x20 Dec 9 18:41:36 node1 kernel: [<ffffffff8882709e>] :ost:ost_handle+0x2bae/0x53e0 Dec 9 18:41:36 node1 kernel: [<ffffffff8858233e>] :ptlrpc:ldlm_refresh_waiting_lock+0xbe/0x110 Dec 9 18:41:36 node1 kernel: [<ffffffff88499485>] :lnet:lnet_match_blocked_msg+0x375/0x390 Dec 9 18:41:36 node1 kernel: [<ffffffff8855e1d4>] :ptlrpc:ldlm_lock_get+0x4/0x10 Dec 9 18:41:36 node1 kernel: [<ffffffff885ae45d>] :ptlrpc:ptlrpc_check_req+0x1d/0x110 Dec 9 18:41:36 node1 kernel: [<ffffffff885b10b9>] :ptlrpc:ptlrpc_server_handle_request+0xa79/0x1150 Dec 9 18:41:36 node1 kernel: [<ffffffff8008a7e4>] __wake_up_common+0x3e/0x68 Dec 9 18:41:36 node1 kernel: [<ffffffff885b4ae8>] :ptlrpc:ptlrpc_main+0x1258/0x1420 Dec 9 18:41:36 node1 kernel: [<ffffffff8008c3ba>] default_wake_function+0x0/0xe Dec 9 18:41:36 node1 kernel: [<ffffffff8005dfb1>] child_rip+0xa/0x11 Dec 9 18:41:36 node1 kernel: [<ffffffff885b3890>] :ptlrpc:ptlrpc_main+0x0/0x1420 Dec 9 18:41:36 node1 kernel: [<ffffffff8005dfa7>] child_rip+0x0/0x11 Dec 9 18:41:36 node1 kernel: Dec 9 18:41:36 node1 kernel: LustreError: dumping log to /tmp/lustre-log.1260380496.4421 Should I install this patched lustre on the clients too? Or is the problem something else? I''m sorry for the messy mail. Thank you, tamas
On 2009-12-10, at 16:37, Papp Tamas wrote:> Well, it was not working, and by the way, my guess is that it should > not work. I didn''t write, but /usr/src/kernels/ > 2.6.18-128.7.1.el5_lustre.1.8.1.1-x86_64/ belongs to the official > kernel-lustre-devel-2.6.18-128.7.1.el5_lustre.1.8.1.1 package. > > Anyway, a build an own kernel and lustre b1_8 with the patches from > the bug 19557: > > Linux meta1 2.6.18-prep #1 SMP Sun Dec 6 14:40:15 CET 2009 x86_64 > x86_64 x86_64 GNU/Linux > > > I installed it on the MDS and the tw OSSs. Unofortunately no luck: > > Dec 9 18:41:36 node1 kernel: Call Trace: > Dec 9 18:41:36 node1 kernel: [<ffffffff88491b91>] :lnet:LNetMDBind > +0x301/0x450 > Dec 9 18:41:36 node1 kernel: [<ffffffff8003dacd>] lock_timer_base > +0x1b/0x3c > Dec 9 18:41:36 node1 kernel: [<ffffffff8001caa7>] __mod_timer > +0xb0/0xbe > Dec 9 18:41:36 node1 kernel: [<ffffffff8006387b>] schedule_timeout > +0x8a/0xad > Dec 9 18:41:36 node1 kernel: [<ffffffff80096ff3>] process_timeout > +0x0/0x5 > Dec 9 18:41:36 node1 kernel: > [<ffffffff888234ca>] :ost:ost_brw_write+0x137a/0x23a0 > Dec 9 18:41:36 node1 kernel: > [<ffffffff8859f998>] :ptlrpc:ptlrpc_send_reply+0x5c8/0x5e0This is a different issue. The previous stack was busy in lustre_hash_for_each_empty().> Should I install this patched lustre on the clients too? Or is the > problem something else?This is strictly a server-side patch. This looks like a similar issue, and may be fixed by one of the other patches on bug 19557. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.