Greetings, I am looking for a good way to diagnose random crashes that are occurring with one of our OCFS clusters. It is a simple 2 node cluster. debugfs does not seem to indicate any issues. (Also, I would be happy to find a consultant/freelancer to work through this.) Cheers, Matthew --- Matthew E. Porter Contegix Beyond Managed Hosting(r) for Your Enterprise
A bugzilla with the oops stack trace will help. Matthew E. Porter wrote:> Greetings, I am looking for a good way to diagnose random crashes that > are occurring with one of our OCFS clusters. It is a simple 2 node > cluster. debugfs does not seem to indicate any issues. > > (Also, I would be happy to find a consultant/freelancer to work > through this.) > > > Cheers, > Matthew > > --- > Matthew E. Porter > Contegix > Beyond Managed Hosting(r) for Your Enterprise > > > > > _______________________________________________ > Ocfs2-users mailing list > Ocfs2-users@oss.oracle.com > http://oss.oracle.com/mailman/listinfo/ocfs2-users
Sunil: We have seen some similar errors in bugzilla. Specifically, what we are seeing is: Sep 10 09:15:34 sulu kernel: BUG: soft lockup detected on CPU#2! Sep 10 09:15:34 sulu kernel: [<c0447f3f>] softlockup_tick +0x98/0xa6 Sep 10 09:15:34 sulu kernel: [<c042d138>] update_process_times +0x39/0x5c Sep 10 09:15:34 sulu kernel: [<c04176f0>] smp_apic_timer_interrupt +0x5c/0x64 Sep 10 09:15:34 sulu kernel: [<c04049bf>] apic_timer_interrupt+0x1f/ 0x24 Sep 10 09:15:34 sulu kernel: [<c041c774>] kmap_atomic +0xb5/0xbb Sep 10 09:15:34 sulu kernel: [<c046cd92>] cont_prepare_write +0xd4/0x21d Sep 10 09:15:34 sulu kernel: [<f8cb12cf>] ocfs2_prepare_write +0x150/0x19d [ocfs2] Sep 10 09:15:34 sulu kernel: [<f8cb06da>] ocfs2_get_block +0x0/0xaa5 [ocfs2] Sep 10 09:15:34 sulu kernel: [<c044ecae>] generic_file_buffered_write+0x23f/0x5f1 Sep 10 09:15:34 sulu kernel: [<f8856ac0>] do_get_write_access+0x43a/ 0x467 [jbd] Sep 10 09:15:34 sulu kernel: [<c0427f65>] current_fs_time +0x4a/0x55 Sep 10 09:15:34 sulu kernel: [<c044f506>] __generic_file_aio_write_nolock+0x4a6/0x52a Sep 10 09:15:34 sulu kernel: [<f8cbebb2>] ocfs2_extend_file+0xf0d/ 0xf95 [ocfs2] Sep 10 09:15:34 sulu kernel: [<c044f831>] generic_file_aio_write_nolock+0x39/0x83 Sep 10 09:15:34 sulu kernel: [<c044fbb4>] generic_file_write_nolock +0x86/0x9a Sep 10 09:15:34 sulu kernel: [<f8ccd226>] ocfs2_write_lock_maybe_extend+0xd39/0xe03 [ocfs2] Sep 10 09:15:34 sulu kernel: [<c04352dd>] autoremove_wake_function +0x0/0x2d Sep 10 09:15:34 sulu kernel: [<f8cbf190>] ocfs2_file_write +0x189/0x22c [ocfs2] Sep 10 09:15:34 sulu kernel: [<f8cbf007>] ocfs2_file_write +0x0/0x22c [ocfs2] Sep 10 09:15:34 sulu kernel: [<c0469af3>] vfs_write +0xa1/0x143 Sep 10 09:15:34 sulu kernel: [<c046a0e5>] sys_write+0x3c/ 0x63 Sep 10 09:15:34 sulu kernel: [<c0403eff>] syscall_call +0x7/0xb This happens on all nodes. The CPU# and timestamp change, but the problem persists. The systems do not restart or panic. The system merely puts every process accessing the OCFS volume in a D state. Would you still like me to log another bugzilla issue? I am happy to do such if you wish. Cheers, Matthew --- Matthew E. Porter Contegix Beyond Managed Hosting(r) for Your Enterprise On Sep 7, 2007, at 12:49 PM, Sunil Mushran wrote:> A bugzilla with the oops stack trace will help. > > Matthew E. Porter wrote: >> Greetings, I am looking for a good way to diagnose random crashes >> that are occurring with one of our OCFS clusters. It is a simple >> 2 node cluster. debugfs does not seem to indicate any issues. >> >> (Also, I would be happy to find a consultant/freelancer to work >> through this.) >> >> >> Cheers, >> Matthew >> >> --- >> Matthew E. Porter >> Contegix >> Beyond Managed Hosting(r) for Your Enterprise >> >> >> >> >> _______________________________________________ >> Ocfs2-users mailing list >> Ocfs2-users@oss.oracle.com >> http://oss.oracle.com/mailman/listinfo/ocfs2-users >
Please log a bugzilla with this output alongwith all the version numbers. Kernel/ocfs2/distro Matthew E. Porter wrote:> Sunil: > We have seen some similar errors in bugzilla. Specifically, what we > are seeing is: > > Sep 10 09:15:34 sulu kernel: BUG: soft lockup detected on CPU#2! > Sep 10 09:15:34 sulu kernel: [<c0447f3f>] softlockup_tick +0x98/0xa6 > Sep 10 09:15:34 sulu kernel: [<c042d138>] > update_process_times+0x39/0x5c > Sep 10 09:15:34 sulu kernel: [<c04176f0>] > smp_apic_timer_interrupt+0x5c/0x64 > Sep 10 09:15:34 sulu kernel: [<c04049bf>] > apic_timer_interrupt+0x1f/0x24 > Sep 10 09:15:34 sulu kernel: [<c041c774>] kmap_atomic +0xb5/0xbb > Sep 10 09:15:34 sulu kernel: [<c046cd92>] cont_prepare_write+0xd4/0x21d > Sep 10 09:15:34 sulu kernel: [<f8cb12cf>] > ocfs2_prepare_write+0x150/0x19d [ocfs2] > Sep 10 09:15:34 sulu kernel: [<f8cb06da>] ocfs2_get_block +0x0/0xaa5 > [ocfs2] > Sep 10 09:15:34 sulu kernel: [<c044ecae>] > generic_file_buffered_write+0x23f/0x5f1 > Sep 10 09:15:34 sulu kernel: [<f8856ac0>] > do_get_write_access+0x43a/0x467 [jbd] > Sep 10 09:15:34 sulu kernel: [<c0427f65>] current_fs_time +0x4a/0x55 > Sep 10 09:15:34 sulu kernel: [<c044f506>] > __generic_file_aio_write_nolock+0x4a6/0x52a > Sep 10 09:15:34 sulu kernel: [<f8cbebb2>] > ocfs2_extend_file+0xf0d/0xf95 [ocfs2] > Sep 10 09:15:34 sulu kernel: [<c044f831>] > generic_file_aio_write_nolock+0x39/0x83 > Sep 10 09:15:34 sulu kernel: [<c044fbb4>] > generic_file_write_nolock+0x86/0x9a > Sep 10 09:15:34 sulu kernel: [<f8ccd226>] > ocfs2_write_lock_maybe_extend+0xd39/0xe03 [ocfs2] > Sep 10 09:15:34 sulu kernel: [<c04352dd>] > autoremove_wake_function+0x0/0x2d > Sep 10 09:15:34 sulu kernel: [<f8cbf190>] ocfs2_file_write > +0x189/0x22c [ocfs2] > Sep 10 09:15:34 sulu kernel: [<f8cbf007>] ocfs2_file_write +0x0/0x22c > [ocfs2] > Sep 10 09:15:34 sulu kernel: [<c0469af3>] vfs_write +0xa1/0x143 > Sep 10 09:15:34 sulu kernel: [<c046a0e5>] sys_write+0x3c/ 0x63 > Sep 10 09:15:34 sulu kernel: [<c0403eff>] syscall_call +0x7/0xb > > This happens on all nodes. The CPU# and timestamp change, but the > problem persists. The systems do not restart or panic. The system > merely puts every process accessing the OCFS volume in a D state. > > Would you still like me to log another bugzilla issue? I am happy > to do such if you wish. > > > Cheers, > Matthew > > > --- > Matthew E. Porter > Contegix > Beyond Managed Hosting(r) for Your Enterprise > > > > On Sep 7, 2007, at 12:49 PM, Sunil Mushran wrote: > >> A bugzilla with the oops stack trace will help. >> >> Matthew E. Porter wrote: >>> Greetings, I am looking for a good way to diagnose random crashes >>> that are occurring with one of our OCFS clusters. It is a simple 2 >>> node cluster. debugfs does not seem to indicate any issues. >>> >>> (Also, I would be happy to find a consultant/freelancer to work >>> through this.) >>> >>> >>> Cheers, >>> Matthew >>> >>> --- >>> Matthew E. Porter >>> Contegix >>> Beyond Managed Hosting(r) for Your Enterprise >>> >>> >>> >>> >>> _______________________________________________ >>> Ocfs2-users mailing list >>> Ocfs2-users@oss.oracle.com >>> http://oss.oracle.com/mailman/listinfo/ocfs2-users >> >
Submitted as bug 918. Thank you for your assistance. (Posting information here in case anyone else has seen the issue.) Cheers, Matthew --- Matthew E. Porter Contegix Beyond Managed Hosting(r) for Your Enterprise On Sep 11, 2007, at 10:49 AM, Sunil Mushran wrote:> Please log a bugzilla with this output alongwith all the version > numbers. Kernel/ocfs2/distro > > Matthew E. Porter wrote: >> Sunil: >> We have seen some similar errors in bugzilla. Specifically, >> what we are seeing is: >> >> Sep 10 09:15:34 sulu kernel: BUG: soft lockup detected on CPU#2! >> Sep 10 09:15:34 sulu kernel: [<c0447f3f>] softlockup_tick +0x98/0xa6 >> Sep 10 09:15:34 sulu kernel: [<c042d138>] update_process_times >> +0x39/0x5c >> Sep 10 09:15:34 sulu kernel: [<c04176f0>] >> smp_apic_timer_interrupt+0x5c/0x64 >> Sep 10 09:15:34 sulu kernel: [<c04049bf>] apic_timer_interrupt >> +0x1f/0x24 >> Sep 10 09:15:34 sulu kernel: [<c041c774>] kmap_atomic +0xb5/0xbb >> Sep 10 09:15:34 sulu kernel: [<c046cd92>] cont_prepare_write >> +0xd4/0x21d >> Sep 10 09:15:34 sulu kernel: [<f8cb12cf>] ocfs2_prepare_write >> +0x150/0x19d [ocfs2] >> Sep 10 09:15:34 sulu kernel: [<f8cb06da>] ocfs2_get_block >> +0x0/0xaa5 [ocfs2] >> Sep 10 09:15:34 sulu kernel: [<c044ecae>] >> generic_file_buffered_write+0x23f/0x5f1 >> Sep 10 09:15:34 sulu kernel: [<f8856ac0>] do_get_write_access >> +0x43a/0x467 [jbd] >> Sep 10 09:15:34 sulu kernel: [<c0427f65>] current_fs_time +0x4a/0x55 >> Sep 10 09:15:34 sulu kernel: [<c044f506>] >> __generic_file_aio_write_nolock+0x4a6/0x52a >> Sep 10 09:15:34 sulu kernel: [<f8cbebb2>] ocfs2_extend_file >> +0xf0d/0xf95 [ocfs2] >> Sep 10 09:15:34 sulu kernel: [<c044f831>] >> generic_file_aio_write_nolock+0x39/0x83 >> Sep 10 09:15:34 sulu kernel: [<c044fbb4>] >> generic_file_write_nolock+0x86/0x9a >> Sep 10 09:15:34 sulu kernel: [<f8ccd226>] >> ocfs2_write_lock_maybe_extend+0xd39/0xe03 [ocfs2] >> Sep 10 09:15:34 sulu kernel: [<c04352dd>] >> autoremove_wake_function+0x0/0x2d >> Sep 10 09:15:34 sulu kernel: [<f8cbf190>] ocfs2_file_write >> +0x189/0x22c [ocfs2] >> Sep 10 09:15:34 sulu kernel: [<f8cbf007>] ocfs2_file_write >> +0x0/0x22c [ocfs2] >> Sep 10 09:15:34 sulu kernel: [<c0469af3>] vfs_write +0xa1/0x143 >> Sep 10 09:15:34 sulu kernel: [<c046a0e5>] sys_write+0x3c/ 0x63 >> Sep 10 09:15:34 sulu kernel: [<c0403eff>] syscall_call +0x7/0xb >> >> This happens on all nodes. The CPU# and timestamp change, but >> the problem persists. The systems do not restart or panic. The >> system merely puts every process accessing the OCFS volume in a D >> state. >> >> Would you still like me to log another bugzilla issue? I am >> happy to do such if you wish. >> >> >> Cheers, >> Matthew >> >> >> --- >> Matthew E. Porter >> Contegix >> Beyond Managed Hosting(r) for Your Enterprise >> >> >> >> On Sep 7, 2007, at 12:49 PM, Sunil Mushran wrote: >> >>> A bugzilla with the oops stack trace will help. >>> >>> Matthew E. Porter wrote: >>>> Greetings, I am looking for a good way to diagnose random >>>> crashes that are occurring with one of our OCFS clusters. It is >>>> a simple 2 node cluster. debugfs does not seem to indicate any >>>> issues. >>>> >>>> (Also, I would be happy to find a consultant/freelancer to work >>>> through this.) >>>> >>>> >>>> Cheers, >>>> Matthew >>>> >>>> --- >>>> Matthew E. Porter >>>> Contegix >>>> Beyond Managed Hosting(r) for Your Enterprise >>>> >>>> >>>> >>>> >>>> _______________________________________________ >>>> Ocfs2-users mailing list >>>> Ocfs2-users@oss.oracle.com >>>> http://oss.oracle.com/mailman/listinfo/ocfs2-users >>> >> >