thr3ads.net - Lustre discuss - [Lustre-discuss] SCSI driver problem on 2.6.22.19 + lustre 1.8.1? [Aug 2009]

If this information is useful, please help other people find it:
Share via:

Nirmal Seenu

2009-Aug-31 19:19 UTC

[Lustre-discuss] SCSI driver problem on 2.6.22.19 + lustre 1.8.1?

I have been having a lot of troubles with my SCSI devices on 2.6.22.19 + 
lustre 1.8.1. I am not able to boot my servers if the fibre is connected 
and I have to physically remove the fibre every time that I need to 
reboot the server. I am using SATABeast for OSTs and it is connected to 
the servers through fibre channel using QLE2460 cards.

I was running 1.8.0.1 without any problem on the same kernel(2.6.22.19) 
and all these problems started when I upgraded to 1.8.1 using the same 
kernel.

The following Error messages and Call traces are generated when the 
machine is booted with the fibre connected and the machine pretty much 
becomes unresponsive at this point:

TIA
Nirmal

qla2xxx 0000:03:00.0: LIP reset occured (f7ef).
qla2xxx 0000:03:00.0: LIP occured (f7ef).
qla2xxx 0000:03:00.0: LOOP UP detected (4 Gbps).
scsi 4:0:0:0: Direct-Access     NEXSAN   SATABeast        Gl66 PQ: 1 ANSI: 5
scsi 4:0:0:101: Direct-Access     NEXSAN   SATABeast        Gl66 PQ: 0 
ANSI: 5
kobject_add failed for 4:0:0:101 with -EEXIST, don''t try to register 
things with the same name in the same directory.

Call Trace:
  [<ffffffff80311c3c>] kobject_put+0x19/0x1b
  [<ffffffff80311fb8>] kobject_shadow_add+0x177/0x1ac
  [<ffffffff80311ff8>] kobject_add+0xb/0xd
  [<ffffffff80388c22>] device_add+0xb5/0x5ea
  [<ffffffff80250102>] trace_hardirqs_on+0x11c/0x147
  [<ffffffff880680ca>] :scsi_mod:scsi_sysfs_add_sdev+0x39/0x22d
  [<ffffffff88065e97>] :scsi_mod:scsi_probe_and_add_lun+0x9de/0xb0e
  [<ffffffff88066917>] :scsi_mod:__scsi_scan_target+0x44a/0x638
  [<ffffffff80468a19>] __mutex_lock_slowpath+0x26c/0x279
  [<ffffffff88067101>] :scsi_mod:scsi_scan_target+0x9a/0xaf
  [<ffffffff880bf86e>] :scsi_transport_fc:fc_scsi_scan_rport+0x0/0x89
  [<ffffffff880bf8d1>] :scsi_transport_fc:fc_scsi_scan_rport+0x63/0x89
  [<ffffffff80244a92>] run_workqueue+0x97/0x16a
  [<ffffffff80245465>] worker_thread+0x0/0xea
  [<ffffffff80245544>] worker_thread+0xdf/0xea
  [<ffffffff80248603>] autoremove_wake_function+0x0/0x38
  [<ffffffff802484bb>] kthread+0x49/0x76
  [<ffffffff8020abb8>] child_rip+0xa/0x12
  [<ffffffff8020a2cc>] restore_args+0x0/0x30
  [<ffffffff80293fe5>] ____cache_alloc_node+0xff/0x144
  [<ffffffff80248472>] kthread+0x0/0x76
  [<ffffffff8020abae>] child_rip+0x0/0x12

error 1
scsi 4:0:0:0: Unexpected response from lun 101 while scanning, scan aborted
list_add corruption. prev->next should be next (ffffffff805bc9e8), but 
was ffff81041f5d2410. (prev=ffff81041f5d2410).
------------[ cut here ]------------
kernel BUG at lib/list_debug.c:33!
invalid opcode: 0000 [1] SMP
CPU 4
Modules linked in: usb_storage qla2xxx scsi_transport_fc sata_nv libata 
sd_mod scsi_mod raid1 ext3 jbd uhci_hcd ohci_hcd ehci_hcd
Pid: 632, comm: scsi_scan_5 Not tainted 2.6.22.19_lustre.1.8.1 #1
RIP: 0010:[<ffffffff80318cca>]  [<ffffffff80318cca>]
__list_add+0x47/0x5b
RSP: 0000:ffff81021f2afc40  EFLAGS: 00010286
RAX: 0000000000000079 RBX: ffff81041f5c5930 RCX: 0000000000000079
RDX: ffff81021a1b8080 RSI: 0000000000000001 RDI: 0000000000000002
RBP: ffff81021f2afc40 R08: 0000000000000002 R09: ffffffff8023516f
R10: ffff81021f2afa30 R11: 0000000000000000 R12: 0000000000000000
R13: ffff81041f5c57a0 R14: ffff81041f5c57a0 R15: 00000000fffffffeqla2xxx 
0000:03:00.0: LIP reset occured (f7ef).
qla2xxx 0000:03:00.0: LIP occured (f7ef).
qla2xxx 0000:03:00.0: LOOP UP detected (4 Gbps).
scsi 4:0:0:0: Direct-Access     NEXSAN   SATABeast        Gl66 PQ: 1 ANSI: 5
scsi 4:0:0:101: Direct-Access     NEXSAN   SATABeast        Gl66 PQ: 0 
ANSI: 5
kobject_add failed for 4:0:0:101 with -EEXIST, don''t try to register 
things with the same name in the same directory.

Call Trace:
  [<ffffffff80311c3c>] kobject_put+0x19/0x1b
  [<ffffffff80311fb8>] kobject_shadow_add+0x177/0x1ac
  [<ffffffff80311ff8>] kobject_add+0xb/0xd
  [<ffffffff80388c22>] device_add+0xb5/0x5ea
  [<ffffffff80250102>] trace_hardirqs_on+0x11c/0x147
  [<ffffffff880680ca>] :scsi_mod:scsi_sysfs_add_sdev+0x39/0x22d
  [<ffffffff88065e97>] :scsi_mod:scsi_probe_and_add_lun+0x9de/0xb0e
  [<ffffffff88066917>] :scsi_mod:__scsi_scan_target+0x44a/0x638
  [<ffffffff80468a19>] __mutex_lock_slowpath+0x26c/0x279
  [<ffffffff88067101>] :scsi_mod:scsi_scan_target+0x9a/0xaf
  [<ffffffff880bf86e>] :scsi_transport_fc:fc_scsi_scan_rport+0x0/0x89
  [<ffffffff880bf8d1>] :scsi_transport_fc:fc_scsi_scan_rport+0x63/0x89
  [<ffffffff80244a92>] run_workqueue+0x97/0x16a
  [<ffffffff80245465>] worker_thread+0x0/0xea
  [<ffffffff80245544>] worker_thread+0xdf/0xea
  [<ffffffff80248603>] autoremove_wake_function+0x0/0x38
  [<ffffffff802484bb>] kthread+0x49/0x76
  [<ffffffff8020abb8>] child_rip+0xa/0x12
  [<ffffffff8020a2cc>] restore_args+0x0/0x30
  [<ffffffff80293fe5>] ____cache_alloc_node+0xff/0x144
  [<ffffffff80248472>] kthread+0x0/0x76
  [<ffffffff8020abae>] child_rip+0x0/0x12

error 1
scsi 4:0:0:0: Unexpected response from lun 101 while scanning, scan aborted
list_add corruption. prev->next should be next (ffffffff805bc9e8), but 
was ffff81041f5d2410. (prev=ffff81041f5d2410).
------------[ cut here ]------------
kernel BUG at lib/list_debug.c:33!
invalid opcode: 0000 [1] SMP
CPU 4
Modules linked in: usb_storage qla2xxx scsi_transport_fc sata_nv libata 
sd_mod scsi_mod raid1 ext3 jbd uhci_hcd ohci_hcd ehci_hcd
Pid: 632, comm: scsi_scan_5 Not tainted 2.6.22.19_lustre.1.8.1 #1
RIP: 0010:[<ffffffff80318cca>]  [<ffffffff80318cca>]
__list_add+0x47/0x5b
RSP: 0000:ffff81021f2afc40  EFLAGS: 00010286
RAX: 0000000000000079 RBX: ffff81041f5c5930 RCX: 0000000000000079
RDX: ffff81021a1b8080 RSI: 0000000000000001 RDI: 0000000000000002
RBP: ffff81021f2afc40 R08: 0000000000000002 R09: ffffffff8023516f
R10: ffff81021f2afa30 R11: 0000000000000000 R12: 0000000000000000
R13: ffff81041f5c57a0 R14: ffff81041f5c57a0 R15: 00000000fffffffe
FS:  00000000008578f0(0000) GS:ffff81041f4e90a0(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 000000000056d080 CR3: 000000021a58a000 CR4: 00000000000006e0
Process scsi_scan_5 (pid: 632, threadinfo ffff81021f2ae000, task 
ffff81021a1b8080)
Stack:  ffff81021f2afc80 ffffffff80311f06 ffff81041a8443e0 ffff81041a844008
  ffff81041f5c5780 ffff81041f5c57a0 ffff81041f5c57a0 0000000000000000
  ffff81021f2afc90 ffffffff80311ff8 ffff81021f2afd10 ffffffff80388c22
Call Trace:
  [<ffffffff80311f06>] kobject_shadow_add+0xc5/0x1ac
  [<ffffffff80311ff8>] kobject_add+0xb/0xd
  [<ffffffff80388c22>] device_add+0xb5/0x5ea
  [<ffffffff8038e480>] transport_setup_classdev+0x0/0x1a
  [<ffffffff880663bf>] :scsi_mod:scsi_alloc_target+0x2e4/0x357
  [<ffffffff8806653a>] :scsi_mod:__scsi_scan_target+0x6d/0x638
  [<ffffffff8024ff18>] mark_held_locks+0x4a/0x6a
  [<ffffffff80468a0d>] __mutex_lock_slowpath+0x260/0x279
  [<ffffffff8024dba2>] debug_mutex_free_waiter+0x5b/0x5f
  [<ffffffff88066b55>] :scsi_mod:scsi_scan_channel+0x50/0x79
  [<ffffffff88066c5d>] :scsi_mod:scsi_scan_host_selected+0xdf/0x11f
  [<ffffffff88066e70>] :scsi_mod:do_scan_async+0x0/0x114
  [<ffffffff88066d08>] :scsi_mod:do_scsi_scan_host+0x6b/0x70
  [<ffffffff88066e70>] :scsi_mod:do_scan_async+0x0/0x114
  [<ffffffff88066e87>] :scsi_mod:do_scan_async+0x17/0x114
  [<ffffffff88066e70>] :scsi_mod:do_scan_async+0x0/0x114
  [<ffffffff802484bb>] kthread+0x49/0x76
  [<ffffffff8020abb8>] child_rip+0xa/0x12
  [<ffffffff8020a2cc>] restore_args+0x0/0x30
  [<ffffffff80248472>] kthread+0x0/0x76
  [<ffffffff8020abae>] child_rip+0x0/0x12


Code: 0f 0b eb fe 48 89 7e 08 48 89 37 48 89 57 08 48 89 3a c9 c3
RIP  [<ffffffff80318cca>] __list_add+0x47/0x5b
  RSP <ffff81021f2afc40>

FS:  00000000008578f0(0000) GS:ffff81041f4e90a0(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 000000000056d080 CR3: 000000021a58a000 CR4: 00000000000006e0
Process scsi_scan_5 (pid: 632, threadinfo ffff81021f2ae000, task 
ffff81021a1b8080)
Stack:  ffff81021f2afc80 ffffffff80311f06 ffff81041a8443e0 ffff81041a844008
  ffff81041f5c5780 ffff81041f5c57a0 ffff81041f5c57a0 0000000000000000
  ffff81021f2afc90 ffffffff80311ff8 ffff81021f2afd10 ffffffff80388c22
Call Trace:
  [<ffffffff80311f06>] kobject_shadow_add+0xc5/0x1ac
  [<ffffffff80311ff8>] kobject_add+0xb/0xd
  [<ffffffff80388c22>] device_add+0xb5/0x5ea
  [<ffffffff8038e480>] transport_setup_classdev+0x0/0x1a
  [<ffffffff880663bf>] :scsi_mod:scsi_alloc_target+0x2e4/0x357
  [<ffffffff8806653a>] :scsi_mod:__scsi_scan_target+0x6d/0x638
  [<ffffffff8024ff18>] mark_held_locks+0x4a/0x6a
  [<ffffffff80468a0d>] __mutex_lock_slowpath+0x260/0x279
  [<ffffffff8024dba2>] debug_mutex_free_waiter+0x5b/0x5f
  [<ffffffff88066b55>] :scsi_mod:scsi_scan_channel+0x50/0x79
  [<ffffffff88066c5d>] :scsi_mod:scsi_scan_host_selected+0xdf/0x11f
  [<ffffffff88066e70>] :scsi_mod:do_scan_async+0x0/0x114
  [<ffffffff88066d08>] :scsi_mod:do_scsi_scan_host+0x6b/0x70
  [<ffffffff88066e70>] :scsi_mod:do_scan_async+0x0/0x114
  [<ffffffff88066e87>] :scsi_mod:do_scan_async+0x17/0x114
  [<ffffffff88066e70>] :scsi_mod:do_scan_async+0x0/0x114
  [<ffffffff802484bb>] kthread+0x49/0x76
  [<ffffffff8020abb8>] child_rip+0xa/0x12
  [<ffffffff8020a2cc>] restore_args+0x0/0x30
  [<ffffffff80248472>] kthread+0x0/0x76
  [<ffffffff8020abae>] child_rip+0x0/0x12


Code: 0f 0b eb fe 48 89 7e 08 48 89 37 48 89 57 08 48 89 3a c9 c3
RIP  [<ffffffff80318cca>] __list_add+0x47/0x5b
  RSP <ffff81021f2afc40>

Brian J. Murrell

2009-Aug-31 19:47 UTC

head link

[Lustre-discuss] SCSI driver problem on 2.6.22.19 + lustre 1.8.1?

On Mon, 2009-08-31 at 14:19 -0500, Nirmal Seenu wrote:> 
> I was running 1.8.0.1 without any problem on the same kernel(2.6.22.19) 
> and all these problems started when I upgraded to 1.8.1 using the same 
> kernel.
Just based on logic, if all that was changed was really only the Lustre
upgrade, that would certainly point to Lustre, however...
> The following Error messages and Call traces are generated when the 
> machine is booted with the fibre connected and the machine pretty much 
> becomes unresponsive at this point:
I don''t see Lustre in any of those call traces.  Does all of this
happen
even before the Lustre modules are loaded or only after?  That would be
a very interesting data point.

b.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part
Url :
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090831/f724fc73/attachment-0001.bin

Nirmal Seenu

2009-Aug-31 20:57 UTC

head link

[Lustre-discuss] SCSI driver problem on 2.6.22.19 + lustre 1.8.1?

All these call traces are generated when the machine boots up and tries 
to load the kernel and before the lustre modules are loaded(I don''t
load
them automatically during boot up). There is no kernel panic generated 
but the kernel is unresponsive.

I could quickly try reverting one of the servers back to 1.8.0.1 to be 
absolutely sure. I had these same servers in production use on 1.8.0.1 
for a number of weeks before I upgraded it 1.8.1.

Nirmal

Brian J. Murrell

2009-Aug-31 23:17 UTC

head link

[Lustre-discuss] SCSI driver problem on 2.6.22.19 + lustre 1.8.1?

On Mon, 2009-08-31 at 15:57 -0500, Nirmal Seenu wrote:> All these call traces are generated when the machine boots up and tries 
> to load the kernel and before the lustre modules are loaded
Ahhh.  So the only possible culprit WRT Lustre is possibly a kernel
patch problem, do you agree?  IOW, this cannot be in a Lustre module
itself.
> I could quickly try reverting one of the servers back to 1.8.0.1 to be 
> absolutely sure.
Why not.
> I had these same servers in production use on 1.8.0.1 
> for a number of weeks before I upgraded it 1.8.1.
Given you are using a 2.6.22-something-or-other kernel, you must have
built your own kernel, yes?  Which patch series and kernel_configs file
did you use?  Did you alter anything from it''s default in the Lustre
tree?

b.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part
Url :
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090831/638cbb61/attachment.bin

Brian J. Murrell

2009-Aug-31 23:40 UTC

head link

[Lustre-discuss] SCSI driver problem on 2.6.22.19 + lustre 1.8.1?

On Mon, 2009-08-31 at 19:17 -0400, Brian J. Murrell
wrote:> Ahhh.  So the only possible culprit WRT Lustre is possibly a kernel
> patch problem, do you agree?
Assuming you are using the 2.6.22-vanilla.series the only patches I see
that were touched between those two were:
      * export-truncate-2.6.18-vanilla.patch in bug 18668
              * remove_from_page_cache(page) ->
                ClearPageMappedToDisk(page);
      * export_symbols-2.6.22-vanilla.patch in also in bug 18668
              * this change is a benign context update
      * i_filter_data.patch in bug 19846
              * benign context update also
      * jbd-commit-timer-no-jiffies-rounding.diff in bug in bug 19321
              * removes rounding of transaction->t_expires when assigned
                to journal->j_commit_timer.expires
      * md-mmp-unplug-dev.patch in bug 17895
              * raid5 fix
      * quota-large-limits-rhel5.patch in bug 19846
              * benign context update

Changes in the actual series were to remove the 8kstack-2.6.12.patch and
add the above mentioned jbd-commit-timer-no-jiffies-rounding.diff.

Given all of this info, I am sure you can figure out how to narrow down
exactly which patch caused the problem.  Please file a bug once you find
it.

Cheers,
b.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part
Url :
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090831/592fec88/attachment.bin

Nirmal Seenu

2009-Sep-01 15:05 UTC

head link

[Lustre-discuss] SCSI driver problem on 2.6.22.19 + lustre 1.8.1?

Hi Brain,

Thanks for the detailed response.

I build my own kernel with vanilla 2.6.22.19 kernel source + 
2.6.22-vanilla patch series and kernel_config files provided by Lustre 
with no additional patches to lustre or the kernel.

I have stopped using the Redhat kernels on our servers as the 1GHz clock 
used by the Redhat kernels causes a lot of timeouts/errors under heavy 
I/O. 2.6.22.19 kernel with 250MHz clock is pretty much rock solid on our 
servers.

I did a fresh kernel build with 2.6.22.19 + Lustre 1.8.0.1 patches 
yesterday on our test machines and it worked fine. When I reverted back 
to the kernel build with 2.6.22.19 + Lustre 1.8.1 patches it worked as 
well without any problem.

I am confused on why I am not able to reproduce these errors on the test 
machine while it is consistently reproducible on our production servers. 
I could possibly try it again on my production server sometime in the 
next few days. I will also take a close look at the patches and try 
adding them incrementally.

If I understand it correctly, I could apply the patches from 
Lustre-1.8.0.1 and build the 2.6.22.19 kernel and then build/install 
Lustre-1.8.1 against that kernel.

Thanks
Nirmal

Brian J. Murrell

2009-Sep-01 15:10 UTC

head link

[Lustre-discuss] SCSI driver problem on 2.6.22.19 + lustre 1.8.1?

On Tue, 2009-09-01 at 10:05 -0500, Nirmal Seenu wrote:> 
> If I understand it correctly, I could apply the patches from 
> Lustre-1.8.0.1 and build the 2.6.22.19 kernel and then build/install 
> Lustre-1.8.1 against that kernel.
That is possible, but unsupported.  There were definitely kernel patch
updates between those two releases and one or more may be required for
the 1.8.1 lustre modules to function properly.

Cheers,
b.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part
Url :
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090901/1e44b12a/attachment.bin

Michael Mayer

2009-Sep-01 15:33 UTC

head link

[Lustre-discuss] SCSI driver problem on 2.6.22.19 + lustre 1.8.1?

Nirmal Seenu schrieb:

<snip/>>
> I have stopped using the Redhat kernels on our servers as the 1GHz clock 
> used by the Redhat kernels causes a lot of timeouts/errors under heavy 
> I/O. 2.6.22.19 kernel with 250MHz clock is pretty much rock solid on our 
> servers.
>   FYI: If you boot your machine (using a Red Hat kernel) with
"divider=4" then your clock speed will be reduced to 1/4 of the 1000
Hz, i.e. 250 Hz. This applies for Red Hat 5.1 (and later) and Red Hat 4.7 (and
later). So you could use the default Red Hat kernels... (cf.
http://kbase.redhat.com/faq/docs/DOC-3115).

<snip/>

Michael.

Nirmal Seenu

2009-Sep-01 18:47 UTC

head link

[Lustre-discuss] SCSI driver problem on 2.6.22.19 + lustre 1.8.1?

I uncompressed linux-2.6.22.19.tar,gz into two different directories and 
  applied 1.8.0.1 patches to one directory and then 1.8.1 into the other.

I compared each file that was listed as files patched by the quilt 
command in both the kernel source tree directory and found that the 
following line to be different in: fs/jbd/transaction.c

In lustre 1.8.0.1 it used to be:

journal->j_commit_timer.expires = round_jiffies(transaction->t_expires);

and it is the following in Lustre 1.8.1:

   journal->j_commit_timer.expires = transaction->t_expires;

I guess this patch was applied based on the these bug and it seems to 
fix the problem with the Redhat and SLES kernels and break vanilla kernels:

http://kerneltrap.org/Linux/High_Idle_Load_Average

https://bugzilla.lustre.org/show_bug.cgi?id=19321

The only other differnece is from the 8kstack-2.6.12.patch which used to 
modify the following line in include/asm-i386/thread_info.h

from
  #define THREAD_SIZE            (4096)
to
  #define THREAD_SIZE            (8192)

It seems like, it would be safe for me to rollback the changes to 
fs/jbd/transaction.c and continue using 2.6.22.19 kernel.

Thanks
Nirmal

Brian J. Murrell

2009-Sep-01 19:06 UTC

head link

[Lustre-discuss] SCSI driver problem on 2.6.22.19 + lustre 1.8.1?

On Tue, 2009-09-01 at 13:47 -0500, Nirmal Seenu wrote:> 
> In lustre 1.8.0.1 it used to be:
> 
> journal->j_commit_timer.expires =
round_jiffies(transaction->t_expires);
> 
> and it is the following in Lustre 1.8.1:
> 
>    journal->j_commit_timer.expires = transaction->t_expires;
Yes, that''s one of the differences I listed.
> I guess this patch was applied based on the these bug and it seems to 
> fix the problem with the Redhat and SLES kernels and break vanilla kernels:
Or perhaps rather only works with kernels of a certain vintage.  I
don''t
really know.
> The only other differnece is from the 8kstack-2.6.12.patch which used to 
> modify the following line in include/asm-i386/thread_info.h
> 
> from
>   #define THREAD_SIZE            (4096)
> to
>   #define THREAD_SIZE            (8192)
> 
> It seems like, it would be safe for me to rollback the changes to 
> fs/jbd/transaction.c and continue using 2.6.22.19 kernel.
You certainly could give that a try.  Please do file a bug in our
bugzilla about your findings.

b.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part
Url :
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090901/4f4e76b1/attachment.bin

Andreas Dilger

2009-Sep-02 06:50 UTC

head link

[Lustre-discuss] SCSI driver problem on 2.6.22.19 + lustre 1.8.1?

On Sep 01, 2009  15:06 -0400, Brian J. Murrell wrote:> On Tue, 2009-09-01 at 13:47 -0500, Nirmal Seenu wrote:
> > In lustre 1.8.0.1 it used to be:
> > 
> > journal->j_commit_timer.expires =
round_jiffies(transaction->t_expires);
> > 
> > and it is the following in Lustre 1.8.1:
> > 
> >    journal->j_commit_timer.expires = transaction->t_expires;
> 
> Yes, that''s one of the differences I listed.
We''ve since modified this change to use "round_jiffies_up()",
which
fixes the upstream problem, while avoiding the hangs it caused in
Lustre.
> > The only other differnece is from the 8kstack-2.6.12.patch which used
to
> > modify the following line in include/asm-i386/thread_info.h
> > 
> > from
> >   #define THREAD_SIZE            (4096)
> > to
> >   #define THREAD_SIZE            (8192)
> > 
> > It seems like, it would be safe for me to rollback the changes to 
> > fs/jbd/transaction.c and continue using 2.6.22.19 kernel.
I think we just ensure that the CONFIG_4K_STACK is not enabled on
servers now.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

Lustre discuss - Aug 2009 - SCSI driver problem on 2.6.22.19 + lustre 1.8.1?

[Lustre-discuss] SCSI driver problem on 2.6.22.19 + lustre 1.8.1?

[Lustre-discuss] SCSI driver problem on 2.6.22.19 + lustre 1.8.1?

[Lustre-discuss] SCSI driver problem on 2.6.22.19 + lustre 1.8.1?

[Lustre-discuss] SCSI driver problem on 2.6.22.19 + lustre 1.8.1?

[Lustre-discuss] SCSI driver problem on 2.6.22.19 + lustre 1.8.1?

[Lustre-discuss] SCSI driver problem on 2.6.22.19 + lustre 1.8.1?

[Lustre-discuss] SCSI driver problem on 2.6.22.19 + lustre 1.8.1?

[Lustre-discuss] SCSI driver problem on 2.6.22.19 + lustre 1.8.1?

[Lustre-discuss] SCSI driver problem on 2.6.22.19 + lustre 1.8.1?

[Lustre-discuss] SCSI driver problem on 2.6.22.19 + lustre 1.8.1?

[Lustre-discuss] SCSI driver problem on 2.6.22.19 + lustre 1.8.1?