Hello Folks, I have few client nodes that are getting soft lockup errors. These are patchless clients running Lustre 1.6.5.1 with kernel 2.6.18-53.1.6.el5-PAPI. More or less stock RHEL 5.1 with PAPI patch added on it. The MDS and OSS are running Lustre 1.6.5.1 with the supplied Lustre kernels and OFED 1.3.1. I remember there was an issue with __d_lookup in the past but I thought it was fixed with the newest release of Lustre. So I dont know if this is related in anyway at all. I dont see any other real lustre error messages on the client or the MDS/OSS at the time of the softlock up. Also wasnt there a softirq issue? I dont think this is related to that... Thanks, -Alex Here is the syslog output: Aug 24 04:02:03 papi-0476 syslogd 1.4.1: restart. Aug 25 00:52:24 papi-0476 kernel: Losing some ticks... checking if CPU frequency changed. Aug 25 15:17:05 papi-0476 kernel: cpsMPI.x[10493]: segfault at 00002aaab23ce000 rip 00000000004a8fce rs p 00007fff584d6cf0 error 4 Aug 27 04:15:00 papi-0476 kernel: BUG: soft lockup detected on CPU#4! Aug 27 04:15:00 papi-0476 kernel: Aug 27 04:15:00 papi-0476 kernel: Call Trace: Aug 27 04:15:00 papi-0476 kernel: <IRQ> [<ffffffff800b619e>] softlockup_tick+0xd5/0xe7 Aug 27 04:15:00 papi-0476 kernel: [<ffffffff800910ab>] tasklet_action+0x62/0xac Aug 27 04:15:00 papi-0476 kernel: [<ffffffff800941bb>] update_process_times+0x54/0x7a Aug 27 04:15:00 papi-0476 kernel: [<ffffffff80075727>] smp_local_timer_interrupt+0x2c/0x61 Aug 27 04:15:00 papi-0476 kernel: [<ffffffff8005d368>] call_softirq+0x1c/0x28 Aug 27 04:15:00 papi-0476 kernel: [<ffffffff80075def>] smp_apic_timer_interrupt+0x41/0x47 Aug 27 04:15:00 papi-0476 kernel: [<ffffffff8005cc8e>] apic_timer_interrupt+0x66/0x6c Aug 27 04:15:00 papi-0476 kernel: <EOI> [<ffffffff8002c285>] dummy_inode_permission+0x0/0x3 Aug 27 04:15:00 papi-0476 kernel: [<ffffffff800094bb>] __d_lookup+0xd2/0xff Aug 27 04:15:00 papi-0476 kernel: [<ffffffff80009499>] __d_lookup+0xb0/0xff Aug 27 04:15:00 papi-0476 kernel: [<ffffffff8000c990>] do_lookup+0x2c/0x1d4 Aug 27 04:15:00 papi-0476 kernel: [<ffffffff80009ee9>] __link_path_walk+0xa01/0xf42 Aug 27 04:15:00 papi-0476 kernel: [<ffffffff8000e5ca>] link_path_walk+0x5c/0xe5 Aug 27 04:15:00 papi-0476 kernel: [<ffffffff8000c7e8>] do_path_lookup+0x270/0x2e8 Aug 27 04:15:01 papi-0476 kernel: [<ffffffff800234de>] __user_walk_fd+0x37/0x4c Aug 27 04:15:01 papi-0476 kernel: [<ffffffff8003e6cf>] vfs_lstat_fd+0x18/0x47 Aug 27 04:15:01 papi-0476 kernel: [<ffffffff800dc1b0>] sys_newfstatat+0x22/0x43 Aug 27 04:15:01 papi-0476 kernel: [<ffffffff8005c116>] system_call+0x7e/0x83 Aug 27 04:15:01 papi-0476 kernel: Aug 28 04:03:14 papi-0476 kernel: BUG: soft lockup detected on CPU#0! Aug 28 04:03:14 papi-0476 kernel: Aug 28 04:03:14 papi-0476 kernel: Call Trace: Aug 28 04:03:14 papi-0476 kernel: <IRQ> [<ffffffff800b619e>] softlockup_tick+0xd5/0xe7 Aug 28 04:03:14 papi-0476 kernel: [<ffffffff800941bb>] update_process_times+0x54/0x7a Aug 28 04:03:14 papi-0476 kernel: [<ffffffff80075727>] smp_local_timer_interrupt+0x2c/0x61 Aug 28 04:03:14 papi-0476 kernel: [<ffffffff80075def>] smp_apic_timer_interrupt+0x41/0x47 Aug 28 04:03:14 papi-0476 kernel: [<ffffffff8005cc8e>] apic_timer_interrupt+0x66/0x6c Aug 28 04:03:14 papi-0476 kernel: <EOI> [<ffffffff8002c285>] dummy_inode_permission+0x0/0x3 Aug 28 04:03:14 papi-0476 kernel: [<ffffffff800094cb>] __d_lookup+0xe2/0xff Aug 28 04:03:14 papi-0476 kernel: [<ffffffff80009499>] __d_lookup+0xb0/0xff Aug 28 04:03:14 papi-0476 kernel: [<ffffffff8000c990>] do_lookup+0x2c/0x1d4 Aug 28 04:03:14 papi-0476 kernel: [<ffffffff80009ee9>] __link_path_walk+0xa01/0xf42 Aug 28 04:03:14 papi-0476 kernel: [<ffffffff8000e5ca>] link_path_walk+0x5c/0xe5 Aug 28 04:03:14 papi-0476 kernel: [<ffffffff8000c7e8>] do_path_lookup+0x270/0x2e8 Aug 28 04:03:14 papi-0476 kernel: [<ffffffff800234de>] __user_walk_fd+0x37/0x4c Aug 28 04:03:14 papi-0476 kernel: [<ffffffff8003e6cf>] vfs_lstat_fd+0x18/0x47 Aug 28 04:03:14 papi-0476 kernel: [<ffffffff800dc1b0>] sys_newfstatat+0x22/0x43 Aug 28 04:03:14 papi-0476 kernel: [<ffffffff8005c116>] system_call+0x7e/0x83 Aug 28 04:03:14 papi-0476 kernel: Aug 28 04:12:12 papi-0476 kernel: BUG: soft lockup detected on CPU#3! Aug 28 04:12:17 papi-0476 kernel: Aug 28 04:12:18 papi-0476 kernel: Call Trace: Aug 28 04:12:18 papi-0476 kernel: <IRQ> [<ffffffff800b619e>] softlockup_tick+0xd5/0xe7 Aug 28 04:12:18 papi-0476 kernel: [<ffffffff800941bb>] update_process_times+0x54/0x7a Aug 28 04:12:18 papi-0476 kernel: [<ffffffff80075727>] smp_local_timer_interrupt+0x2c/0x61 Aug 28 04:12:18 papi-0476 kernel: [<ffffffff80075def>] smp_apic_timer_interrupt+0x41/0x47 Aug 28 04:12:18 papi-0476 kernel: [<ffffffff8005cc8e>] apic_timer_interrupt+0x66/0x6c Aug 28 04:12:18 papi-0476 kernel: [<ffffffff80007660>] kmem_cache_free+0x1c0/0x1cb Aug 28 04:12:18 papi-0476 kernel: [<ffffffff8009a5dc>] __rcu_process_callbacks+0x122/0x1a8 Aug 28 04:12:18 papi-0476 kernel: [<ffffffff8009a685>] rcu_process_callbacks+0x23/0x43 Aug 28 04:12:18 papi-0476 kernel: [<ffffffff800910ab>] tasklet_action+0x62/0xac Aug 28 04:12:18 papi-0476 kernel: [<ffffffff80011cc5>] __do_softirq+0x5e/0xd5 Aug 28 04:12:18 papi-0476 kernel: [<ffffffff8005d368>] call_softirq+0x1c/0x28 Aug 28 04:12:18 papi-0476 kernel: [<ffffffff8006b565>] do_softirq+0x2c/0x85 Aug 28 04:12:18 papi-0476 kernel: [<ffffffff8005cc8e>] apic_timer_interrupt+0x66/0x6c Aug 28 04:03:14 papi-0476 kernel: [<ffffffff800dc1b0>] sys_newfstatat+0x22/0x43 Aug 28 04:03:14 papi-0476 kernel: [<ffffffff8005c116>] system_call+0x7e/0x83 Aug 28 04:03:14 papi-0476 kernel: Aug 28 04:12:12 papi-0476 kernel: BUG: soft lockup detected on CPU#3! Aug 28 04:12:17 papi-0476 kernel: Aug 28 04:12:18 papi-0476 kernel: Call Trace: Aug 28 04:12:18 papi-0476 kernel: <IRQ> [<ffffffff800b619e>] softlockup_tick+0xd5/0xe7 Aug 28 04:12:18 papi-0476 kernel: [<ffffffff800941bb>] update_process_times+0x54/0x7a Aug 28 04:12:18 papi-0476 kernel: [<ffffffff80075727>] smp_local_timer_interrupt+0x2c/0x61 Aug 28 04:12:18 papi-0476 kernel: [<ffffffff80075def>] smp_apic_timer_interrupt+0x41/0x47 Aug 28 04:12:18 papi-0476 kernel: [<ffffffff8005cc8e>] apic_timer_interrupt+0x66/0x6c Aug 28 04:12:18 papi-0476 kernel: [<ffffffff80007660>] kmem_cache_free+0x1c0/0x1cb Aug 28 04:12:18 papi-0476 kernel: [<ffffffff8009a5dc>] __rcu_process_callbacks+0x122/0x1a8 Aug 28 04:12:18 papi-0476 kernel: [<ffffffff8009a685>] rcu_process_callbacks+0x23/0x43 Aug 28 04:12:18 papi-0476 kernel: [<ffffffff800910ab>] tasklet_action+0x62/0xac Aug 28 04:12:18 papi-0476 kernel: [<ffffffff80011cc5>] __do_softirq+0x5e/0xd5 Aug 28 04:12:18 papi-0476 kernel: [<ffffffff8005d368>] call_softirq+0x1c/0x28 Aug 28 04:12:18 papi-0476 kernel: [<ffffffff8006b565>] do_softirq+0x2c/0x85 Aug 28 04:12:18 papi-0476 kernel: [<ffffffff8005cc8e>] apic_timer_interrupt+0x66/0x6c Aug 28 04:12:18 papi-0476 kernel: <EOI> [<ffffffff800cb2c6>] __put_swap_token+0xb/0x52 Aug 28 04:12:18 papi-0476 kernel: [<ffffffff8003b7dc>] mmput+0x68/0x83 Aug 28 04:12:18 papi-0476 kernel: [<ffffffff80014f94>] do_exit+0x28b/0x89d Aug 28 04:12:18 papi-0476 kernel: [<ffffffff80046eac>] cpuset_exit+0x0/0x6c Aug 28 04:12:18 papi-0476 kernel: [<ffffffff8002ad92>] get_signal_to_deliver+0x42c/0x45a Aug 28 04:12:18 papi-0476 kernel: [<ffffffff800590f9>] do_notify_resume+0x9c/0x7a9 Aug 28 04:12:18 papi-0476 kernel: [<ffffffff80021d2e>] __up_read+0x19/0x7f Aug 28 04:12:19 papi-0476 kernel: [<ffffffff80065ac9>] do_page_fault+0x4eb/0x81d Aug 28 04:12:19 papi-0476 kernel: [<ffffffff800270db>] do_filp_open+0x2a/0x38 Aug 28 04:12:19 papi-0476 kernel: [<ffffffff8003fc68>] do_ioctl+0x5c/0x6b Aug 28 04:12:19 papi-0476 kernel: [<ffffffff8005c6dc>] retint_signal+0x3d/0x79 Aug 28 04:12:19 papi-0476 kernel:
Hello Alex, On Thursday 28 August 2008 05:20:47 Alex Lee wrote:> Hello Folks, > > I have few client nodes that are getting soft lockup errors. These are > patchless clients running Lustre 1.6.5.1 with kernel > 2.6.18-53.1.6.el5-PAPI. More or less stock RHEL 5.1 with PAPI patch added > on it. The MDS and OSS are running Lustre 1.6.5.1 with the supplied Lustre > kernels and OFED 1.3.1. > > I remember there was an issue with __d_lookup in the past but I thought it > was fixed with the newest release of Lustre. So I dont know if this is > related in anyway at all. I dont see any other real lustre error messages > on the client or the MDS/OSS at the time of the softlock up. Also wasnt > there a softirq issue? I dont think this is related to that...according to the traces it somehow looks like there is a double locking. Unfortunately this is hard to debug due to lustre bug#12752. In your traces it also looks like it might have locked up at "rcu_read_lock();" While you have compiled yourself anyway, could you recompile it with debugging symbols and then resolve __d_lookup+0xd2 using gdb? Cheers, Bernd -- Bernd Schubert Q-Leap Networks GmbH
Bernd Schubert wrote:> Hello Alex, > > On Thursday 28 August 2008 05:20:47 Alex Lee wrote: > >> Hello Folks, >> >> I have few client nodes that are getting soft lockup errors. These are >> patchless clients running Lustre 1.6.5.1 with kernel >> 2.6.18-53.1.6.el5-PAPI. More or less stock RHEL 5.1 with PAPI patch added >> on it. The MDS and OSS are running Lustre 1.6.5.1 with the supplied Lustre >> kernels and OFED 1.3.1. >> >> I remember there was an issue with __d_lookup in the past but I thought it >> was fixed with the newest release of Lustre. So I dont know if this is >> related in anyway at all. I dont see any other real lustre error messages >> on the client or the MDS/OSS at the time of the softlock up. Also wasnt >> there a softirq issue? I dont think this is related to that... >> > > according to the traces it somehow looks like there is a double locking. > Unfortunately this is hard to debug due to lustre bug#12752. > In your traces it also looks like it might have locked up > at "rcu_read_lock();" > > While you have compiled yourself anyway, could you recompile it with debugging > symbols and then resolve __d_lookup+0xd2 using gdb? > > > Cheers, > Bernd > > -- > Bernd Schubert > Q-Leap Networks GmbH >Someone found this bug for me that looks very similar. https://bugzilla.lustre.org/show_bug.cgi?id=15975 Does this look anything close? I''m pretty clueless about debugging kernel traces. Thanks, -Alex
Hi Alex, On Thursday 28 August 2008 14:52:22 Alex Lee wrote:> Someone found this bug for me that looks very similar. > > https://bugzilla.lustre.org/show_bug.cgi?id=15975 > > Does this look anything close? I''m pretty clueless about debugging > kernel traces.yeah, looks like this is your issue. Also explains why we didn''t run into this ourselves - we are using the same server and client kernel for all of our customers and so we are already patching d_rehash() for server site support. Cheers, Bernd -- Bernd Schubert Q-Leap Networks GmbH
Bernd Schubert wrote:> Hi Alex, > > On Thursday 28 August 2008 14:52:22 Alex Lee wrote: > >> Someone found this bug for me that looks very similar. >> >> https://bugzilla.lustre.org/show_bug.cgi?id=15975 >> >> Does this look anything close? I''m pretty clueless about debugging >> kernel traces. >> > > yeah, looks like this is your issue. Also explains why we didn''t run into this > ourselves - we are using the same server and client kernel for all of our > customers and so we are already patching d_rehash() for server site support. > > > Cheers, > Bernd > > > -- > Bernd Schubert > Q-Leap Networks GmbH >Hi Bernd, Your patching for d_rehash() for your servers? I thought the race conditions which causes the softlockup only happened on the patchless clients (which were running). The solution I saw was run lustre supplied kernel or run the iopen-misc patch on the kernel. Or did you mean you are already patched? -Alex
Hello Alex, sorry for the late reply, somehow missed your answer. Alex Lee wrote:> Hi Bernd, > > Your patching for d_rehash() for your servers? I thought the race > conditions which causes the softlockup only happened on the patchless > clients (which were running). > The solution I saw was run lustre supplied kernel or run the iopen-misc > patch on the kernel. > > Or did you mean you are already patched?Well, we are using the same kernel for both, clients and servers, so we are using a patched kernel anyway. Most of these patches don''t have any affect on llite (lustre-client), only the patches in vfs_races-* patching d_rehash() still have. The last patch submitted yesterday in bug#15975 by Alexey also removes this dependency. Unfortunately, it has a side effect, since it makes my patch in bug #16417 not to work anymore and customers are not happy when their filesystem fills up with already deleted files :( Cheers, Bernd -- Bernd Schubert Q-Leap Networks GmbH