Nirmal Seenu
2009-Aug-31 19:19 UTC
[Lustre-discuss] SCSI driver problem on 2.6.22.19 + lustre 1.8.1?
I have been having a lot of troubles with my SCSI devices on 2.6.22.19 + lustre 1.8.1. I am not able to boot my servers if the fibre is connected and I have to physically remove the fibre every time that I need to reboot the server. I am using SATABeast for OSTs and it is connected to the servers through fibre channel using QLE2460 cards. I was running 1.8.0.1 without any problem on the same kernel(2.6.22.19) and all these problems started when I upgraded to 1.8.1 using the same kernel. The following Error messages and Call traces are generated when the machine is booted with the fibre connected and the machine pretty much becomes unresponsive at this point: TIA Nirmal qla2xxx 0000:03:00.0: LIP reset occured (f7ef). qla2xxx 0000:03:00.0: LIP occured (f7ef). qla2xxx 0000:03:00.0: LOOP UP detected (4 Gbps). scsi 4:0:0:0: Direct-Access NEXSAN SATABeast Gl66 PQ: 1 ANSI: 5 scsi 4:0:0:101: Direct-Access NEXSAN SATABeast Gl66 PQ: 0 ANSI: 5 kobject_add failed for 4:0:0:101 with -EEXIST, don''t try to register things with the same name in the same directory. Call Trace: [<ffffffff80311c3c>] kobject_put+0x19/0x1b [<ffffffff80311fb8>] kobject_shadow_add+0x177/0x1ac [<ffffffff80311ff8>] kobject_add+0xb/0xd [<ffffffff80388c22>] device_add+0xb5/0x5ea [<ffffffff80250102>] trace_hardirqs_on+0x11c/0x147 [<ffffffff880680ca>] :scsi_mod:scsi_sysfs_add_sdev+0x39/0x22d [<ffffffff88065e97>] :scsi_mod:scsi_probe_and_add_lun+0x9de/0xb0e [<ffffffff88066917>] :scsi_mod:__scsi_scan_target+0x44a/0x638 [<ffffffff80468a19>] __mutex_lock_slowpath+0x26c/0x279 [<ffffffff88067101>] :scsi_mod:scsi_scan_target+0x9a/0xaf [<ffffffff880bf86e>] :scsi_transport_fc:fc_scsi_scan_rport+0x0/0x89 [<ffffffff880bf8d1>] :scsi_transport_fc:fc_scsi_scan_rport+0x63/0x89 [<ffffffff80244a92>] run_workqueue+0x97/0x16a [<ffffffff80245465>] worker_thread+0x0/0xea [<ffffffff80245544>] worker_thread+0xdf/0xea [<ffffffff80248603>] autoremove_wake_function+0x0/0x38 [<ffffffff802484bb>] kthread+0x49/0x76 [<ffffffff8020abb8>] child_rip+0xa/0x12 [<ffffffff8020a2cc>] restore_args+0x0/0x30 [<ffffffff80293fe5>] ____cache_alloc_node+0xff/0x144 [<ffffffff80248472>] kthread+0x0/0x76 [<ffffffff8020abae>] child_rip+0x0/0x12 error 1 scsi 4:0:0:0: Unexpected response from lun 101 while scanning, scan aborted list_add corruption. prev->next should be next (ffffffff805bc9e8), but was ffff81041f5d2410. (prev=ffff81041f5d2410). ------------[ cut here ]------------ kernel BUG at lib/list_debug.c:33! invalid opcode: 0000 [1] SMP CPU 4 Modules linked in: usb_storage qla2xxx scsi_transport_fc sata_nv libata sd_mod scsi_mod raid1 ext3 jbd uhci_hcd ohci_hcd ehci_hcd Pid: 632, comm: scsi_scan_5 Not tainted 2.6.22.19_lustre.1.8.1 #1 RIP: 0010:[<ffffffff80318cca>] [<ffffffff80318cca>] __list_add+0x47/0x5b RSP: 0000:ffff81021f2afc40 EFLAGS: 00010286 RAX: 0000000000000079 RBX: ffff81041f5c5930 RCX: 0000000000000079 RDX: ffff81021a1b8080 RSI: 0000000000000001 RDI: 0000000000000002 RBP: ffff81021f2afc40 R08: 0000000000000002 R09: ffffffff8023516f R10: ffff81021f2afa30 R11: 0000000000000000 R12: 0000000000000000 R13: ffff81041f5c57a0 R14: ffff81041f5c57a0 R15: 00000000fffffffeqla2xxx 0000:03:00.0: LIP reset occured (f7ef). qla2xxx 0000:03:00.0: LIP occured (f7ef). qla2xxx 0000:03:00.0: LOOP UP detected (4 Gbps). scsi 4:0:0:0: Direct-Access NEXSAN SATABeast Gl66 PQ: 1 ANSI: 5 scsi 4:0:0:101: Direct-Access NEXSAN SATABeast Gl66 PQ: 0 ANSI: 5 kobject_add failed for 4:0:0:101 with -EEXIST, don''t try to register things with the same name in the same directory. Call Trace: [<ffffffff80311c3c>] kobject_put+0x19/0x1b [<ffffffff80311fb8>] kobject_shadow_add+0x177/0x1ac [<ffffffff80311ff8>] kobject_add+0xb/0xd [<ffffffff80388c22>] device_add+0xb5/0x5ea [<ffffffff80250102>] trace_hardirqs_on+0x11c/0x147 [<ffffffff880680ca>] :scsi_mod:scsi_sysfs_add_sdev+0x39/0x22d [<ffffffff88065e97>] :scsi_mod:scsi_probe_and_add_lun+0x9de/0xb0e [<ffffffff88066917>] :scsi_mod:__scsi_scan_target+0x44a/0x638 [<ffffffff80468a19>] __mutex_lock_slowpath+0x26c/0x279 [<ffffffff88067101>] :scsi_mod:scsi_scan_target+0x9a/0xaf [<ffffffff880bf86e>] :scsi_transport_fc:fc_scsi_scan_rport+0x0/0x89 [<ffffffff880bf8d1>] :scsi_transport_fc:fc_scsi_scan_rport+0x63/0x89 [<ffffffff80244a92>] run_workqueue+0x97/0x16a [<ffffffff80245465>] worker_thread+0x0/0xea [<ffffffff80245544>] worker_thread+0xdf/0xea [<ffffffff80248603>] autoremove_wake_function+0x0/0x38 [<ffffffff802484bb>] kthread+0x49/0x76 [<ffffffff8020abb8>] child_rip+0xa/0x12 [<ffffffff8020a2cc>] restore_args+0x0/0x30 [<ffffffff80293fe5>] ____cache_alloc_node+0xff/0x144 [<ffffffff80248472>] kthread+0x0/0x76 [<ffffffff8020abae>] child_rip+0x0/0x12 error 1 scsi 4:0:0:0: Unexpected response from lun 101 while scanning, scan aborted list_add corruption. prev->next should be next (ffffffff805bc9e8), but was ffff81041f5d2410. (prev=ffff81041f5d2410). ------------[ cut here ]------------ kernel BUG at lib/list_debug.c:33! invalid opcode: 0000 [1] SMP CPU 4 Modules linked in: usb_storage qla2xxx scsi_transport_fc sata_nv libata sd_mod scsi_mod raid1 ext3 jbd uhci_hcd ohci_hcd ehci_hcd Pid: 632, comm: scsi_scan_5 Not tainted 2.6.22.19_lustre.1.8.1 #1 RIP: 0010:[<ffffffff80318cca>] [<ffffffff80318cca>] __list_add+0x47/0x5b RSP: 0000:ffff81021f2afc40 EFLAGS: 00010286 RAX: 0000000000000079 RBX: ffff81041f5c5930 RCX: 0000000000000079 RDX: ffff81021a1b8080 RSI: 0000000000000001 RDI: 0000000000000002 RBP: ffff81021f2afc40 R08: 0000000000000002 R09: ffffffff8023516f R10: ffff81021f2afa30 R11: 0000000000000000 R12: 0000000000000000 R13: ffff81041f5c57a0 R14: ffff81041f5c57a0 R15: 00000000fffffffe FS: 00000000008578f0(0000) GS:ffff81041f4e90a0(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 000000000056d080 CR3: 000000021a58a000 CR4: 00000000000006e0 Process scsi_scan_5 (pid: 632, threadinfo ffff81021f2ae000, task ffff81021a1b8080) Stack: ffff81021f2afc80 ffffffff80311f06 ffff81041a8443e0 ffff81041a844008 ffff81041f5c5780 ffff81041f5c57a0 ffff81041f5c57a0 0000000000000000 ffff81021f2afc90 ffffffff80311ff8 ffff81021f2afd10 ffffffff80388c22 Call Trace: [<ffffffff80311f06>] kobject_shadow_add+0xc5/0x1ac [<ffffffff80311ff8>] kobject_add+0xb/0xd [<ffffffff80388c22>] device_add+0xb5/0x5ea [<ffffffff8038e480>] transport_setup_classdev+0x0/0x1a [<ffffffff880663bf>] :scsi_mod:scsi_alloc_target+0x2e4/0x357 [<ffffffff8806653a>] :scsi_mod:__scsi_scan_target+0x6d/0x638 [<ffffffff8024ff18>] mark_held_locks+0x4a/0x6a [<ffffffff80468a0d>] __mutex_lock_slowpath+0x260/0x279 [<ffffffff8024dba2>] debug_mutex_free_waiter+0x5b/0x5f [<ffffffff88066b55>] :scsi_mod:scsi_scan_channel+0x50/0x79 [<ffffffff88066c5d>] :scsi_mod:scsi_scan_host_selected+0xdf/0x11f [<ffffffff88066e70>] :scsi_mod:do_scan_async+0x0/0x114 [<ffffffff88066d08>] :scsi_mod:do_scsi_scan_host+0x6b/0x70 [<ffffffff88066e70>] :scsi_mod:do_scan_async+0x0/0x114 [<ffffffff88066e87>] :scsi_mod:do_scan_async+0x17/0x114 [<ffffffff88066e70>] :scsi_mod:do_scan_async+0x0/0x114 [<ffffffff802484bb>] kthread+0x49/0x76 [<ffffffff8020abb8>] child_rip+0xa/0x12 [<ffffffff8020a2cc>] restore_args+0x0/0x30 [<ffffffff80248472>] kthread+0x0/0x76 [<ffffffff8020abae>] child_rip+0x0/0x12 Code: 0f 0b eb fe 48 89 7e 08 48 89 37 48 89 57 08 48 89 3a c9 c3 RIP [<ffffffff80318cca>] __list_add+0x47/0x5b RSP <ffff81021f2afc40> FS: 00000000008578f0(0000) GS:ffff81041f4e90a0(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 000000000056d080 CR3: 000000021a58a000 CR4: 00000000000006e0 Process scsi_scan_5 (pid: 632, threadinfo ffff81021f2ae000, task ffff81021a1b8080) Stack: ffff81021f2afc80 ffffffff80311f06 ffff81041a8443e0 ffff81041a844008 ffff81041f5c5780 ffff81041f5c57a0 ffff81041f5c57a0 0000000000000000 ffff81021f2afc90 ffffffff80311ff8 ffff81021f2afd10 ffffffff80388c22 Call Trace: [<ffffffff80311f06>] kobject_shadow_add+0xc5/0x1ac [<ffffffff80311ff8>] kobject_add+0xb/0xd [<ffffffff80388c22>] device_add+0xb5/0x5ea [<ffffffff8038e480>] transport_setup_classdev+0x0/0x1a [<ffffffff880663bf>] :scsi_mod:scsi_alloc_target+0x2e4/0x357 [<ffffffff8806653a>] :scsi_mod:__scsi_scan_target+0x6d/0x638 [<ffffffff8024ff18>] mark_held_locks+0x4a/0x6a [<ffffffff80468a0d>] __mutex_lock_slowpath+0x260/0x279 [<ffffffff8024dba2>] debug_mutex_free_waiter+0x5b/0x5f [<ffffffff88066b55>] :scsi_mod:scsi_scan_channel+0x50/0x79 [<ffffffff88066c5d>] :scsi_mod:scsi_scan_host_selected+0xdf/0x11f [<ffffffff88066e70>] :scsi_mod:do_scan_async+0x0/0x114 [<ffffffff88066d08>] :scsi_mod:do_scsi_scan_host+0x6b/0x70 [<ffffffff88066e70>] :scsi_mod:do_scan_async+0x0/0x114 [<ffffffff88066e87>] :scsi_mod:do_scan_async+0x17/0x114 [<ffffffff88066e70>] :scsi_mod:do_scan_async+0x0/0x114 [<ffffffff802484bb>] kthread+0x49/0x76 [<ffffffff8020abb8>] child_rip+0xa/0x12 [<ffffffff8020a2cc>] restore_args+0x0/0x30 [<ffffffff80248472>] kthread+0x0/0x76 [<ffffffff8020abae>] child_rip+0x0/0x12 Code: 0f 0b eb fe 48 89 7e 08 48 89 37 48 89 57 08 48 89 3a c9 c3 RIP [<ffffffff80318cca>] __list_add+0x47/0x5b RSP <ffff81021f2afc40>
Brian J. Murrell
2009-Aug-31 19:47 UTC
[Lustre-discuss] SCSI driver problem on 2.6.22.19 + lustre 1.8.1?
On Mon, 2009-08-31 at 14:19 -0500, Nirmal Seenu wrote:> > I was running 1.8.0.1 without any problem on the same kernel(2.6.22.19) > and all these problems started when I upgraded to 1.8.1 using the same > kernel.Just based on logic, if all that was changed was really only the Lustre upgrade, that would certainly point to Lustre, however...> The following Error messages and Call traces are generated when the > machine is booted with the fibre connected and the machine pretty much > becomes unresponsive at this point:I don''t see Lustre in any of those call traces. Does all of this happen even before the Lustre modules are loaded or only after? That would be a very interesting data point. b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090831/f724fc73/attachment-0001.bin
Nirmal Seenu
2009-Aug-31 20:57 UTC
[Lustre-discuss] SCSI driver problem on 2.6.22.19 + lustre 1.8.1?
All these call traces are generated when the machine boots up and tries to load the kernel and before the lustre modules are loaded(I don''t load them automatically during boot up). There is no kernel panic generated but the kernel is unresponsive. I could quickly try reverting one of the servers back to 1.8.0.1 to be absolutely sure. I had these same servers in production use on 1.8.0.1 for a number of weeks before I upgraded it 1.8.1. Nirmal
Brian J. Murrell
2009-Aug-31 23:17 UTC
[Lustre-discuss] SCSI driver problem on 2.6.22.19 + lustre 1.8.1?
On Mon, 2009-08-31 at 15:57 -0500, Nirmal Seenu wrote:> All these call traces are generated when the machine boots up and tries > to load the kernel and before the lustre modules are loadedAhhh. So the only possible culprit WRT Lustre is possibly a kernel patch problem, do you agree? IOW, this cannot be in a Lustre module itself.> I could quickly try reverting one of the servers back to 1.8.0.1 to be > absolutely sure.Why not.> I had these same servers in production use on 1.8.0.1 > for a number of weeks before I upgraded it 1.8.1.Given you are using a 2.6.22-something-or-other kernel, you must have built your own kernel, yes? Which patch series and kernel_configs file did you use? Did you alter anything from it''s default in the Lustre tree? b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090831/638cbb61/attachment.bin
Brian J. Murrell
2009-Aug-31 23:40 UTC
[Lustre-discuss] SCSI driver problem on 2.6.22.19 + lustre 1.8.1?
On Mon, 2009-08-31 at 19:17 -0400, Brian J. Murrell wrote:> Ahhh. So the only possible culprit WRT Lustre is possibly a kernel > patch problem, do you agree?Assuming you are using the 2.6.22-vanilla.series the only patches I see that were touched between those two were: * export-truncate-2.6.18-vanilla.patch in bug 18668 * remove_from_page_cache(page) -> ClearPageMappedToDisk(page); * export_symbols-2.6.22-vanilla.patch in also in bug 18668 * this change is a benign context update * i_filter_data.patch in bug 19846 * benign context update also * jbd-commit-timer-no-jiffies-rounding.diff in bug in bug 19321 * removes rounding of transaction->t_expires when assigned to journal->j_commit_timer.expires * md-mmp-unplug-dev.patch in bug 17895 * raid5 fix * quota-large-limits-rhel5.patch in bug 19846 * benign context update Changes in the actual series were to remove the 8kstack-2.6.12.patch and add the above mentioned jbd-commit-timer-no-jiffies-rounding.diff. Given all of this info, I am sure you can figure out how to narrow down exactly which patch caused the problem. Please file a bug once you find it. Cheers, b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090831/592fec88/attachment.bin
Nirmal Seenu
2009-Sep-01 15:05 UTC
[Lustre-discuss] SCSI driver problem on 2.6.22.19 + lustre 1.8.1?
Hi Brain, Thanks for the detailed response. I build my own kernel with vanilla 2.6.22.19 kernel source + 2.6.22-vanilla patch series and kernel_config files provided by Lustre with no additional patches to lustre or the kernel. I have stopped using the Redhat kernels on our servers as the 1GHz clock used by the Redhat kernels causes a lot of timeouts/errors under heavy I/O. 2.6.22.19 kernel with 250MHz clock is pretty much rock solid on our servers. I did a fresh kernel build with 2.6.22.19 + Lustre 1.8.0.1 patches yesterday on our test machines and it worked fine. When I reverted back to the kernel build with 2.6.22.19 + Lustre 1.8.1 patches it worked as well without any problem. I am confused on why I am not able to reproduce these errors on the test machine while it is consistently reproducible on our production servers. I could possibly try it again on my production server sometime in the next few days. I will also take a close look at the patches and try adding them incrementally. If I understand it correctly, I could apply the patches from Lustre-1.8.0.1 and build the 2.6.22.19 kernel and then build/install Lustre-1.8.1 against that kernel. Thanks Nirmal
Brian J. Murrell
2009-Sep-01 15:10 UTC
[Lustre-discuss] SCSI driver problem on 2.6.22.19 + lustre 1.8.1?
On Tue, 2009-09-01 at 10:05 -0500, Nirmal Seenu wrote:> > If I understand it correctly, I could apply the patches from > Lustre-1.8.0.1 and build the 2.6.22.19 kernel and then build/install > Lustre-1.8.1 against that kernel.That is possible, but unsupported. There were definitely kernel patch updates between those two releases and one or more may be required for the 1.8.1 lustre modules to function properly. Cheers, b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090901/1e44b12a/attachment.bin
Michael Mayer
2009-Sep-01 15:33 UTC
[Lustre-discuss] SCSI driver problem on 2.6.22.19 + lustre 1.8.1?
Nirmal Seenu schrieb: <snip/>> > I have stopped using the Redhat kernels on our servers as the 1GHz clock > used by the Redhat kernels causes a lot of timeouts/errors under heavy > I/O. 2.6.22.19 kernel with 250MHz clock is pretty much rock solid on our > servers. >FYI: If you boot your machine (using a Red Hat kernel) with "divider=4" then your clock speed will be reduced to 1/4 of the 1000 Hz, i.e. 250 Hz. This applies for Red Hat 5.1 (and later) and Red Hat 4.7 (and later). So you could use the default Red Hat kernels... (cf. http://kbase.redhat.com/faq/docs/DOC-3115). <snip/> Michael.
Nirmal Seenu
2009-Sep-01 18:47 UTC
[Lustre-discuss] SCSI driver problem on 2.6.22.19 + lustre 1.8.1?
I uncompressed linux-2.6.22.19.tar,gz into two different directories and applied 1.8.0.1 patches to one directory and then 1.8.1 into the other. I compared each file that was listed as files patched by the quilt command in both the kernel source tree directory and found that the following line to be different in: fs/jbd/transaction.c In lustre 1.8.0.1 it used to be: journal->j_commit_timer.expires = round_jiffies(transaction->t_expires); and it is the following in Lustre 1.8.1: journal->j_commit_timer.expires = transaction->t_expires; I guess this patch was applied based on the these bug and it seems to fix the problem with the Redhat and SLES kernels and break vanilla kernels: http://kerneltrap.org/Linux/High_Idle_Load_Average https://bugzilla.lustre.org/show_bug.cgi?id=19321 The only other differnece is from the 8kstack-2.6.12.patch which used to modify the following line in include/asm-i386/thread_info.h from #define THREAD_SIZE (4096) to #define THREAD_SIZE (8192) It seems like, it would be safe for me to rollback the changes to fs/jbd/transaction.c and continue using 2.6.22.19 kernel. Thanks Nirmal
Brian J. Murrell
2009-Sep-01 19:06 UTC
[Lustre-discuss] SCSI driver problem on 2.6.22.19 + lustre 1.8.1?
On Tue, 2009-09-01 at 13:47 -0500, Nirmal Seenu wrote:> > In lustre 1.8.0.1 it used to be: > > journal->j_commit_timer.expires = round_jiffies(transaction->t_expires); > > and it is the following in Lustre 1.8.1: > > journal->j_commit_timer.expires = transaction->t_expires;Yes, that''s one of the differences I listed.> I guess this patch was applied based on the these bug and it seems to > fix the problem with the Redhat and SLES kernels and break vanilla kernels:Or perhaps rather only works with kernels of a certain vintage. I don''t really know.> The only other differnece is from the 8kstack-2.6.12.patch which used to > modify the following line in include/asm-i386/thread_info.h > > from > #define THREAD_SIZE (4096) > to > #define THREAD_SIZE (8192) > > It seems like, it would be safe for me to rollback the changes to > fs/jbd/transaction.c and continue using 2.6.22.19 kernel.You certainly could give that a try. Please do file a bug in our bugzilla about your findings. b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090901/4f4e76b1/attachment.bin
Andreas Dilger
2009-Sep-02 06:50 UTC
[Lustre-discuss] SCSI driver problem on 2.6.22.19 + lustre 1.8.1?
On Sep 01, 2009 15:06 -0400, Brian J. Murrell wrote:> On Tue, 2009-09-01 at 13:47 -0500, Nirmal Seenu wrote: > > In lustre 1.8.0.1 it used to be: > > > > journal->j_commit_timer.expires = round_jiffies(transaction->t_expires); > > > > and it is the following in Lustre 1.8.1: > > > > journal->j_commit_timer.expires = transaction->t_expires; > > Yes, that''s one of the differences I listed.We''ve since modified this change to use "round_jiffies_up()", which fixes the upstream problem, while avoiding the hangs it caused in Lustre.> > The only other differnece is from the 8kstack-2.6.12.patch which used to > > modify the following line in include/asm-i386/thread_info.h > > > > from > > #define THREAD_SIZE (4096) > > to > > #define THREAD_SIZE (8192) > > > > It seems like, it would be safe for me to rollback the changes to > > fs/jbd/transaction.c and continue using 2.6.22.19 kernel.I think we just ensure that the CONFIG_4K_STACK is not enabled on servers now. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.