thr3ads.net - Lustre discuss - [Lustre-discuss] 1.8.1.1 [Nov 2009]

If this information is useful, please help other people find it:
Share via:

Papp Tamás

2009-Nov-09 18:34 UTC

[Lustre-discuss] 1.8.1.1

hi All,

First off all, I''m sorry, I cannot write a better bug report as
I''m far
away from the host, and right now there is no remote access.
My colleague send this in by email:


Nov  9 19:05:55 node1 kernel:  [<ffffffff88591357>] 
:ptlrpc:ptlrpc_server_handle_request+0xa97/0x1170
Nov  9 19:05:55 node1 kernel:  [<ffffffff8003d382>] 
lock_timer_base+0x1b/0x3c
Nov  9 19:05:55 node1 kernel:  [<ffffffff8008881d>] 
__wake_up_common+0x3e/0x68
Nov  9 19:05:55 node1 kernel:  [<ffffffff88594e08>] 
:ptlrpc:ptlrpc_main+0x1218/0x13e0
Nov  9 19:05:55 node1 kernel:  [<ffffffff8008a3f3>] 
default_wake_function+0x0/0xe
Nov  9 19:05:55 node1 kernel:  [<ffffffff8005dfb1>] child_rip+0xa/0x11
Nov  9 19:05:55 node1 kernel:  [<ffffffff88593bf0>] 
:ptlrpc:ptlrpc_main+0x0/0x13e0
Nov  9 19:05:55 node1 kernel:  [<ffffffff8005dfa7>] child_rip+0x0/0x11
Nov  9 19:05:55 node1 kernel:
Nov  9 19:05:55 node1 kernel: Lustre: 
4381:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 
12345-192.168.3.149 at tcp
Nov  9 19:06:05 node1 last message repeated 409322 times
Nov  9 19:06:05 node1 kernel: BUG: soft lockup - CPU#1 stuck for 10s! 
[ll_ost_82:4381]
Nov  9 19:06:05 node1 kernel: CPU 1:
Nov  9 19:06:05 node1 kernel: Modules linked in: obdfilter(U) 
fsfilt_ldiskfs(U) ost(U) mgc(U) ldiskfs(U) crc16(U) lustre(U) lov(U) 
mdc(U) lquota(U) osc(U) ksocklnd(U) ptlrpc(U) obdclass(U) lnet(U) 
lvfs(U) libcfs(U) mptctl(U) mptbase(U) ipmi_devintf(U) ipmi_si(U) 
ipmi_msghandler(U) ipv6(U) xfrm_nalgo(U) crypto_api(U) autofs4(U) 
lockd(U) sunrpc(U) cpufreq_ondemand(U) acpi_cpufreq(U) freq_table(U) 
dm_mirror(U) dm_multipath(U) scsi_dh(U) video(U) hwmon(U) backlight(U) 
sbs(U) i2c_ec(U) i2c_core(U) button(U) battery(U) asus_acpi(U) 
acpi_memhotplug(U) ac(U) parport_pc(U) lp(U) parport(U) sg(U) igb(U) 
shpchp(U) pcspkr(U) dm_raid45(U) dm_message(U) dm_region_hash(U) 
dm_log(U) dm_mod(U) dm_mem_cache(U) usb_storage(U) cciss(U) ata_piix(U) 
libata(U) sd_mod(U) scsi_mod(U) ext3(U) jbd(U) uhci_hcd(U) ohci_hcd(U) 
ehci_hcd(U)
Nov  9 19:06:05 node1 kernel: Pid: 4381, comm: ll_ost_82 Tainted: G      
2.6.18-128.7.1.el5_lustre.1.8.1.1 #1
Nov  9 19:06:05 node1 kernel: RIP: 0010:[<ffffffff80064ae0>]  
[<ffffffff80064ae0>] _spin_lock+0x3/0xa
Nov  9 19:06:05 node1 kernel: RSP: 0018:ffff8102217b7758  EFLAGS: 00000246
Nov  9 19:06:05 node1 kernel: RAX: 0000000000000008 RBX: 
ffff81022c80b400 RCX: 0000000000000000
Nov  9 19:06:05 node1 kernel: RDX: ffff81023df660a0 RSI: 
0000000000000000 RDI: ffff81023df66250
Nov  9 19:06:05 node1 kernel: RBP: ffff81022c80b400 R08: 
ffff81022c80b530 R09: 0000000000000000
Nov  9 19:06:05 node1 kernel: R10: 0000000000000000 R11: 
0000000000000000 R12: 0000000000000013
Nov  9 19:06:05 node1 kernel: R13: ffffffff8857327c R14: 
0000000500000000 R15: 0000000000000007
Nov  9 19:06:05 node1 kernel: FS:  00002b4cfcfab230(0000) 
GS:ffff810107ed96c0(0000) knlGS:0000000000000000
Nov  9 19:06:05 node1 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 
000000008005003b
Nov  9 19:06:05 node1 kernel: CR2: 00002abf9b119158 CR3: 
0000000000201000 CR4: 00000000000006e0
Nov  9 19:06:05 node1 kernel:
Nov  9 19:06:05 node1 kernel: Call Trace:
Nov  9 19:06:05 node1 kernel:  [<ffffffff8857cafc>] 
:ptlrpc:ptlrpc_queue_wait+0x103c/0x1690
Nov  9 19:06:05 node1 kernel:  [<ffffffff8858a515>] 
:ptlrpc:lustre_msg_set_opc+0x45/0x120
Nov  9 19:06:05 node1 kernel:  [<ffffffff88574085>] 
:ptlrpc:ptlrpc_at_set_req_timeout+0x85/0xd0
Nov  9 19:06:05 node1 kernel:  [<ffffffff885748a9>] 
:ptlrpc:ptlrpc_prep_req_pool+0x619/0x6b0
Nov  9 19:06:05 node1 kernel:  [<ffffffff8008a3f3>] 
default_wake_function+0x0/0xe
Nov  9 19:06:05 node1 kernel:  [<ffffffff88564196>] 
:ptlrpc:ldlm_server_glimpse_ast+0x266/0x3b0
Nov  9 19:06:05 node1 kernel:  [<ffffffff88570f03>] 
:ptlrpc:interval_iterate_reverse+0x73/0x240
Nov  9 19:06:05 node1 kernel:  [<ffffffff88558f20>] 
:ptlrpc:ldlm_process_extent_lock+0x0/0xad0

History:

The "cluster" is up for approx. 10 days. It has only one MDS and 2 OSS
computers.
On the second day the node1 locked up with no usable messages on the 
screen, and similar messages in the log as above.
After I restarted it the cluster was running for a week without bigger 
error messages.

On Sunday we removed the servers, so we had to restart them.

Today morning node2 locked up, and after some hours node1 also start to 
give up.

The systems are based on CentOS 5.4 with the official packages from Sun.

Is there any known bug about this?

Thank you,

tamas

Papp Tamás

2009-Nov-09 20:25 UTC

head link

[Lustre-discuss] 1.8.1.1

Papp Tam?s wrote, On 2009. 11. 09. 19:34:>
> Is there any known bug about this?
>   

Is this bug 21221 <https://bugzilla.lustre.org/show_bug.cgi?id=21221> ?

10x

tompos

Derek Yarnell

2009-Nov-10 06:19 UTC

head link

[Lustre-discuss] 1.8.1.1

This looks exactly like what we are running into, my bugzilla
(https://bugzilla.lustre.org/show_bug.cgi?id=21256
) was duped to this bug,

   https://bugzilla.lustre.org/show_bug.cgi?id=19557

But I am not sure until you reported it that you are also seeing that  
your network stack is being shut down?  We see our NIC fall down when  
this happens, running RHEL5.3 1.8.1.0.

Unclear on what a ETA on a fix is going to be from the bugzilla entry.

On Nov 9, 2009, at 3:25 PM, Papp Tam?s wrote:
> Papp Tam?s wrote, On 2009. 11. 09. 19:34:
>>
>> Is there any known bug about this?
>>
>
>
> Is this bug 21221 <https://bugzilla.lustre.org/show_bug.cgi? 
> id=21221> ?
>
> 10x
>
> tompos
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
Derek Yarnell
UNIX Systems Administrator
University of Maryland
Institute for Advanced Computer Studies

Papp Tamás

2009-Nov-10 21:58 UTC

head link

[Lustre-discuss] 1.8.1.1

Derek Yarnell wrote, On 2009. 11. 10. 7:19:> This looks exactly like what we are running into, my bugzilla 
> (https://bugzilla.lustre.org/show_bug.cgi?id=21256) was duped to this 
> bug,
>
>   https://bugzilla.lustre.org/show_bug.cgi?id=19557
>
> But I am not sure until you reported it that you are also seeing that 
> your network stack is being shut down?  We see our NIC fall down when 
> this happens, running RHEL5.3 1.8.1.0.
>
> Unclear on what a ETA on a fix is going to be from the bugzilla entry.
I''m not sure, it''s the same poblem. I''ll check it
closer tomorrow.

Nobody else are experienced about this issue?

Could this be HW related? The type of our cluster nodes is HP DL 180 G6.

tamas

Papp Tamás

2009-Nov-19 19:08 UTC

head link

[Lustre-discuss] 1.8.1.1

Derek Yarnell wrote, On 2009. 11. 10. 7:19:> This looks exactly like what we are running into, my bugzilla 
> (https://bugzilla.lustre.org/show_bug.cgi?id=21256) was duped to this 
> bug,
>
>   https://bugzilla.lustre.org/show_bug.cgi?id=19557
>
> But I am not sure until you reported it that you are also seeing that 
> your network stack is being shut down?  We see our NIC fall down when 
> this happens, running RHEL5.3 1.8.1.0.
>
> Unclear on what a ETA on a fix is going to be from the bugzilla entry.
Well, my problem has not gone, but occurs more frequently.

There were days, when the logs were clean enough, and other days there 
were call traces, but the cluster was functional.

Until this afternoon. I wrote earlier, we have one MDS (meta1) and two 
OSS (node1 and node2). The servers meta1 and node2 are working fine.
But node1 has a load of 8, and the top says:

 4820 root      15   0     0    0    0 R 99.8  0.0  75:42.89 ll_ost_06
 4894 root      15   0     0    0    0 R 99.8  0.0  75:41.79 ll_ost_80


The logs are full with this:

Nov 19 20:03:32 node1 kernel: BUG: soft lockup - CPU#3 stuck for 10s! 
[ll_ost_80:4894]
Nov 19 20:03:32 node1 kernel: CPU 3:
Nov 19 20:03:32 node1 kernel: Modules linked in: obdfilter(U) 
fsfilt_ldiskfs(U) ost(U) mgc(U) ldiskfs(U) crc16(U) lustre(U) lov(U) 
mdc(U) lquota(U) osc(U) ksocklnd(U) ptlrpc(U) obdclass(U) lnet(U) l
vfs(U) libcfs(U) mptctl(U) mptbase(U) ipmi_devintf(U) ipmi_si(U) 
ipmi_msghandler(U) ipv6(U) xfrm_nalgo(U) crypto_api(U) autofs4(U) 
lockd(U) sunrpc(U) cpufreq_ondemand(U) acpi_cpufreq(U) freq_table(U
) dm_mirror(U) dm_multipath(U) scsi_dh(U) video(U) hwmon(U) backlight(U) 
sbs(U) i2c_ec(U) i2c_core(U) button(U) battery(U) asus_acpi(U) 
acpi_memhotplug(U) ac(U) parport_pc(U) lp(U) parport(U) sg(U)
shpchp(U) igb(U) pcspkr(U) dm_raid45(U) dm_message(U) dm_region_hash(U) 
dm_log(U) dm_mod(U) dm_mem_cache(U) usb_storage(U) cciss(U) ata_piix(U) 
libata(U) sd_mod(U) scsi_mod(U) ext3(U) jbd(U) uhci_hc
d(U) ohci_hcd(U) ehci_hcd(U)
Nov 19 20:03:32 node1 kernel: Pid: 4894, comm: ll_ost_80 Tainted: G      
2.6.18-128.1.6.el5_lustre.1.8.0.1smp #1
Nov 19 20:03:32 node1 kernel: RIP: 0010:[<ffffffff885814f8>]  
[<ffffffff885814f8>] :ptlrpc:ptlrpc_queue_wait+0xe58/0x1690
Nov 19 20:03:32 node1 kernel: RSP: 0018:ffff810220143760  EFLAGS: 00000202
Nov 19 20:03:32 node1 kernel: RAX: ffff8101f7b5a400 RBX: 
ffff81020d360000 RCX: 0000000000000000
Nov 19 20:03:32 node1 kernel: RDX: ffff8100a739a408 RSI: 
0000000000000000 RDI: ffff8101f7b5a548
Nov 19 20:03:32 node1 kernel: RBP: ffffffffffffffff R08: 
ffff81020d360128 R09: 0000000000000000
Nov 19 20:03:32 node1 kernel: R10: 0000000000000000 R11: 
0000000000000000 R12: ffff81020d36005c
Nov 19 20:03:32 node1 kernel: R13: ffff81020d360000 R14: 
0000000000000013 R15: ffffffff88577e6c
Nov 19 20:03:32 node1 kernel: FS:  00002b9c353dc230(0000) 
GS:ffff810107f61540(0000) knlGS:0000000000000000
Nov 19 20:03:32 node1 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 
000000008005003b
Nov 19 20:03:32 node1 kernel: CR2: 0000003862603080 CR3: 
0000000232d91000 CR4: 00000000000006e0
Nov 19 20:03:32 node1 kernel:
Nov 19 20:03:32 node1 kernel: Call Trace:
Nov 19 20:03:32 node1 kernel:  [<ffffffff8858f185>] 
:ptlrpc:lustre_msg_set_opc+0x45/0x120
Nov 19 20:03:32 node1 kernel:  [<ffffffff88578ca2>] 
:ptlrpc:ptlrpc_at_set_req_timeout+0x82/0xd0
Nov 19 20:03:32 node1 kernel:  [<ffffffff885794c3>] 
:ptlrpc:ptlrpc_prep_req_pool+0x613/0x6b0
Nov 19 20:03:32 node1 kernel:  [<ffffffff8008a3f6>] 
default_wake_function+0x0/0xe
Nov 19 20:03:32 node1 kernel:  [<ffffffff88568d47>] 
:ptlrpc:ldlm_server_glimpse_ast+0x257/0x3a0
Nov 19 20:03:32 node1 kernel:  [<ffffffff88575af3>] 
:ptlrpc:interval_iterate_reverse+0x73/0x240
Nov 19 20:03:32 node1 kernel:  [<ffffffff8855d850>] 
:ptlrpc:ldlm_process_extent_lock+0x0/0xad0
Nov 19 20:03:32 node1 kernel:  [<ffffffff8884205c>] 
:obdfilter:filter_intent_policy+0x66c/0x780
Nov 19 20:03:32 node1 kernel:  [<ffffffff884825b0>] 
:lnet:LNetPut+0x700/0x800
Nov 19 20:03:32 node1 kernel:  [<ffffffff88549d56>] 
:ptlrpc:ldlm_lock_enqueue+0x186/0xb30
Nov 19 20:03:32 node1 kernel:  [<ffffffff88565b5f>] 
:ptlrpc:ldlm_export_lock_get+0x6f/0xe0
Nov 19 20:03:32 node1 kernel:  [<ffffffff884ca748>] 
:obdclass:lustre_hash_add+0x208/0x2d0
Nov 19 20:03:32 node1 kernel:  [<ffffffff887f6f20>] 
:ost:ost_blocking_ast+0x0/0xaa0
Nov 19 20:03:32 node1 kernel:  [<ffffffff8856c779>] 
:ptlrpc:ldlm_handle_enqueue+0xc09/0x1200
Nov 19 20:03:32 node1 kernel:  [<ffffffff88589208>] 
:ptlrpc:lustre_msg_check_version_v2+0x8/0x20
Nov 19 20:03:32 node1 kernel:  [<ffffffff887ff110>] 
:ost:ost_handle+0x54a0/0x5a60
Nov 19 20:03:32 node1 kernel:  [<ffffffff88482de5>] 
:lnet:lnet_match_blocked_msg+0x375/0x390
Nov 19 20:03:32 node1 kernel:  [<ffffffff800d7301>] free_block+0x11c/0x13b
Nov 19 20:03:32 node1 kernel:  [<ffffffff80148fe6>] __next_cpu+0x19/0x28
Nov 19 20:03:32 node1 kernel:  [<ffffffff80148fe6>] __next_cpu+0x19/0x28
Nov 19 20:03:32 node1 kernel:  [<ffffffff80088f4a>] 
find_busiest_group+0x20d/0x621
Nov 19 20:03:32 node1 kernel:  [<ffffffff8858eb75>] 
:ptlrpc:lustre_msg_get_conn_cnt+0x35/0xf0
Nov 19 20:03:32 node1 kernel:  [<ffffffff80089d90>] enqueue_task+0x41/0x56
Nov 19 20:03:32 node1 kernel:  [<ffffffff8859382d>] 
:ptlrpc:ptlrpc_check_req+0x1d/0x110
Nov 19 20:03:32 node1 kernel:  [<ffffffff88595db3>] 
:ptlrpc:ptlrpc_server_handle_request+0xa93/0x1160
Nov 19 20:03:32 node1 kernel:  [<ffffffff80063097>]
thread_return+0x62/0xfe
Nov 19 20:03:32 node1 kernel:  [<ffffffff8008882b>] 
__wake_up_common+0x3e/0x68
Nov 19 20:03:32 node1 kernel:  [<ffffffff885992e8>] 
:ptlrpc:ptlrpc_main+0x1218/0x13e0
Nov 19 20:03:32 node1 kernel:  [<ffffffff8008a3f6>] 
default_wake_function+0x0/0xe
Nov 19 20:03:32 node1 kernel:  [<ffffffff8005dfb1>] child_rip+0xa/0x11
Nov 19 20:03:32 node1 kernel:  [<ffffffff885980d0>] 
:ptlrpc:ptlrpc_main+0x0/0x13e0
Nov 19 20:03:32 node1 kernel:  [<ffffffff8005dfa7>] child_rip+0x0/0x11
Nov 19 20:03:32 node1 kernel:
Nov 19 20:03:34 node1 kernel: Lustre: Request x1319785196423799 sent 
from dafs-OST0001 to NID 192.168.3.115 at tcp 0s ago has timed out (limit 7s).
Nov 19 20:03:34 node1 kernel: Lustre: Skipped 40339060 previous similar 
messages
0; still busy with 3 active RPCs
Nov 19 20:03:46 node1 kernel: Lustre: 
4924:0:(ldlm_lib.c:815:target_handle_connect()) Skipped 85 previous 
similar messages


So the client nodes are not able to use the storage continuously.

Actually Last weekend I downgraded lustre version to 1.8.0.1, so this is 
not with 1.8.1.1, but I''m afraid, it doesn''t matter.

Reboot already doesn''t help at all, the problem comes back in 30
minutes.

Any idea, suggestion?
I''m a bit desperated, why this happens, and what I could do to this?

Thank you,

tamas
-------------- next part --------------
A non-text attachment was scrubbed...
Name: lustre.log.zip
Type: application/x-zip-compressed
Size: 184829 bytes
Desc: not available
Url :
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20091119/19a15494/attachment-0001.bin

Craig Prescott

2009-Nov-19 19:42 UTC

head link

[Lustre-discuss] 1.8.1.1

Papp Tam?s wrote:> The logs are full with this:
> 
> Nov 19 20:03:32 node1 kernel: BUG: soft lockup - CPU#3 stuck for 10s! 
> [ll_ost_80:4894]
> Nov 19 20:03:32 node1 kernel: CPU 3:
<snip>> Nov 19 20:03:34 node1 kernel: Lustre: Skipped 40339060 previous similar 
> messages 0; still busy with 3 active RPCs
We had the same problem with 1.8.x.x.

We set lnet.printk=0 on our OSS nodes and it has helped us dramatically 
- we have not seen the ''soft lockup'' problem since.

sysctl -w lnet.printk=0

This will turn off all but ''emerg'' messages from lnet.

It would be interesting to know if this avoided the lockups for you, too.

Cheers,
Craig

Papp Tamás

2009-Nov-19 20:00 UTC

head link

[Lustre-discuss] 1.8.1.1

Craig Prescott wrote, On 2009. 11. 19. 20:42:> Papp Tam?s wrote:
>> The logs are full with this:
>>
>> Nov 19 20:03:32 node1 kernel: BUG: soft lockup - CPU#3 stuck for 10s! 
>> [ll_ost_80:4894]
>> Nov 19 20:03:32 node1 kernel: CPU 3:
> <snip>
>> Nov 19 20:03:34 node1 kernel: Lustre: Skipped 40339060 previous 
>> similar messages 0; still busy with 3 active RPCs
>
> We had the same problem with 1.8.x.x.
>
> We set lnet.printk=0 on our OSS nodes and it has helped us 
> dramatically - we have not seen the ''soft lockup'' problem
since.
>
> sysctl -w lnet.printk=0
>
> This will turn off all but ''emerg'' messages from lnet.
>
> It would be interesting to know if this avoided the lockups for you, too.
I set it up.

We''ll see.

Thank you very much!

tamas

Papp Tamás

2009-Nov-27 10:13 UTC

head link

[Lustre-discuss] 1.8.1.1

Craig Prescott wrote, On 2009. 11. 19. 20:42:> We had the same problem with 1.8.x.x.
>
> We set lnet.printk=0 on our OSS nodes and it has helped us 
> dramatically - we have not seen the ''soft lockup'' problem
since.
>
> sysctl -w lnet.printk=0
>
> This will turn off all but ''emerg'' messages from lnet.
>
> It would be interesting to know if this avoided the lockups for you, too.
Well, this definetely helped, but didn''t resolve the root of the
problem.

A few minutes ago we were not able to reach our cluster from clients.

On MDS I see a lot of this:


Nov 27 10:52:17 meta1 kernel: BUG: soft lockup - CPU#3 stuck for 10s! 
[ll_evictor:6123]
Nov 27 10:52:17 meta1 kernel: CPU 3:
Nov 27 10:52:17 meta1 kernel: Modules linked in: mds(U) 
fsfilt_ldiskfs(U) mgs(U) mgc(U) ldiskfs(U) crc16(U) lustre(U) lov(U) 
mdc(U) lquota(U) osc(U) ksocklnd(U) ptlrpc(U) obdclass(U) lnet(U) 
lvfs(U) libcfs(U) mptctl(U) mptbase(U) ipmi_devintf(U) ipmi_si(U) 
ipmi_msghandler(U) ipv6(U) xfrm_nalgo(U) crypto_api(U) autofs4(U) 
lockd(U) sunrpc(U) cpufreq_ondemand(U) acpi_cpufreq(U) freq_table(U) 
dm_mirror(U) dm_multipath(U) scsi_dh(U) video(U) hwmon(U) backlight(U) 
sbs(U) i2c_ec(U) i2c_core(U) button(U) battery(U) asus_acpi(U) 
acpi_memhotplug(U) ac(U) parport_pc(U) lp(U) parport(U) shpchp(U) 
pcspkr(U) sg(U) igb(U) dm_raid45(U) dm_message(U) dm_region_hash(U) 
dm_log(U) dm_mod(U) dm_mem_cache(U) usb_storage(U) ata_piix(U) ahci(U) 
libata(U) sd_mod(U) scsi_mod(U) raid1(U) ext3(U) jbd(U) uhci_hcd(U) 
ohci_hcd(U) ehci_hcd(U)
Nov 27 10:52:17 meta1 kernel: Pid: 6123, comm: ll_evictor Tainted: 
G      2.6.18-128.7.1.el5_lustre.1.8.1.1 #1
Nov 27 10:52:17 meta1 kernel: RIP: 0010:[<ffffffff80064af5>]  
[<ffffffff80064af5>] _write_lock+0xe/0xf
Nov 27 10:52:17 meta1 kernel: RSP: 0018:ffff8102154d9c78  EFLAGS: 00000246
Nov 27 10:52:17 meta1 kernel: RAX: 000000000000ffff RBX: 
00000000000050c3 RCX: 0000000000042620
Nov 27 10:52:17 meta1 kernel: RDX: 0000000000000282 RSI: 
ffffffff802fae80 RDI: ffffc200108e4c3c
Nov 27 10:52:17 meta1 kernel: RBP: 0000000000000286 R08: 
ffff81000101f3e0 R09: 0000000000000000
Nov 27 10:52:17 meta1 kernel: R10: ffff8101960319c0 R11: 
0000000000000150 R12: 0000000000000286
Nov 27 10:52:17 meta1 kernel: R13: ffff8102154d9c20 R14: 
ffff810196031a18 R15: ffff8101960319c0
Nov 27 10:52:17 meta1 kernel: FS:  00002abb113e4230(0000) 
GS:ffff810107f61540(0000) knlGS:0000000000000000
Nov 27 10:52:17 meta1 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 
000000008005003b
Nov 27 10:52:18 meta1 kernel: CR2: 00000031d3803080 CR3: 
0000000000201000 CR4: 00000000000006e0
Nov 27 10:52:18 meta1 kernel:
Nov 27 10:52:18 meta1 kernel: Call Trace:
Nov 27 10:52:18 meta1 kernel:  [<ffffffff884caba7>] 
:obdclass:lustre_hash_for_each_empty+0x237/0x2b0
Nov 27 10:52:18 meta1 kernel:  [<ffffffff884d1ae8>] 
:obdclass:class_disconnect+0x398/0x420
Nov 27 10:52:18 meta1 kernel:  [<ffffffff8886d5e1>] 
:mds:mds_disconnect+0x121/0xe40
Nov 27 10:52:18 meta1 kernel:  [<ffffffff8014b87a>] snprintf+0x44/0x4c
Nov 27 10:52:18 meta1 kernel:  [<ffffffff884cd994>] 
:obdclass:class_fail_export+0x384/0x4c0
Nov 27 10:52:18 meta1 kernel:  [<ffffffff8859e238>] 
:ptlrpc:ping_evictor_main+0x4f8/0x7e0
Nov 27 10:52:18 meta1 kernel:  [<ffffffff8008a3f3>] 
default_wake_function+0x0/0xe
Nov 27 10:52:18 meta1 kernel:  [<ffffffff8005dfb1>] child_rip+0xa/0x11
Nov 27 10:52:18 meta1 kernel:  [<ffffffff8859dd40>] 
:ptlrpc:ping_evictor_main+0x0/0x7e0
Nov 27 10:52:18 meta1 kernel:  [<ffffffff8005dfa7>] child_rip+0x0/0x11
Nov 27 10:52:18 meta1 kernel:


After them:

Nov 27 10:53:05 meta1 kernel: LustreError: 
6009:0:(events.c:367:server_bulk_callback()) event type 4, status -5, 
desc ffff81022ed54280
Nov 27 10:53:05 meta1 kernel: Lustre: dafs-MDT0000: haven''t heard from 
client 4a05e5fb-590d-ec56-c1b3-ff0234152564 (at 192.168.3.141 at tcp) in 
258 seconds. I think it''s dead, and I am evicting it.
Nov 27 10:53:05 meta1 kernel: LustreError: 
6009:0:(events.c:367:server_bulk_callback()) event type 4, status -5, 
desc ffff8102189b0cc0
Nov 27 10:53:05 meta1 kernel: LustreError: 
6145:0:(handler.c:1563:mds_handle()) operation 400 on unconnected MDS 
from 12345-192.168.3.141 at tcp
Nov 27 10:53:05 meta1 kernel: LustreError: 
6145:0:(ldlm_lib.c:1863:target_send_reply_msg()) @@@ processing error 
(-107)  req at ffff8101bec0cc00 x1320487774846997/t0
o400-><?>@<?>:0/0 lens
192/0 e 0 to
 0 dl 1259315591 ref 1 fl Interpret:H/0/0 rc -107/0
Nov 27 10:53:05 meta1 kernel: Lustre: 
6149:0:(ldlm_lib.c:541:target_handle_reconnect()) dafs-MDT0000: 
b4c8f591-df5c-b6f2-93c1-57e5f64312d0 reconnecting
800; still busy with 1 active RPCs
a02b-9917-c60e-d5b6c1c1d3c8@:0/0 lens 408/0 e 0 to 0 dl 0 ref 1 fl 
New:/0/0 rc 0/0
Nov 27 10:53:05 meta1 kernel: Lustre: 
6282:0:(ldlm_lib.c:766:target_handle_connect()) MGS: exp 
ffff81023e83c000 already connecting
l busy with 1 active RPCs
Nov 27 10:53:05 meta1 kernel: Lustre: 
6282:0:(ldlm_lib.c:766:target_handle_connect()) Skipped 8 previous 
similar messages
Nov 27 10:53:05 meta1 kernel: Lustre: 
6121:0:(ldlm_lib.c:835:target_handle_connect()) Skipped 10 previous 
similar messages
Nov 27 10:53:05 meta1 kernel: LustreError: 
6282:0:(mgs_handler.c:661:mgs_handle()) MGS handle cmd=250 rc=-114
Nov 27 10:53:05 meta1 kernel: LustreError: 
6282:0:(mgs_handler.c:661:mgs_handle()) Skipped 9 previous similar messages
3400 x1320390320856467/t0 
o37->56d39829-d432-2c01-0261-eee2779a4fe3 at NET_0x20000c0a80390_UUID:0/0 
lens 408/360 e 1 to 0 dl 1259315514 ref 1 fl Complete:/0/0 rc -110/-110
Nov 27 10:53:05 meta1 kernel: Lustre: 
6841:0:(watchdog.c:363:lcw_update_time()) Expired watchdog for pid 6841 
disabled after 208.42s
Nov 27 10:53:05 meta1 kernel: LustreError: 
6145:0:(ldlm_lib.c:1863:target_send_reply_msg()) Skipped 17 previous 
similar messages
Nov 27 10:53:06 meta1 kernel: Lustre: 
6161:0:(ldlm_lib.c:541:target_handle_reconnect()) dafs-MDT0000: 
d3728ce6-b580-dbb3-b262-123d1516703b reconnecting
Nov 27 10:53:06 meta1 kernel: Lustre: 
6161:0:(ldlm_lib.c:541:target_handle_reconnect()) Skipped 185 previous 
similar messages
800; still busy with 1 active RPCs
Nov 27 10:53:06 meta1 kernel: Lustre: 
6143:0:(ldlm_lib.c:835:target_handle_connect()) Skipped 1725 previous 
similar messages
Nov 27 10:53:06 meta1 kernel: Lustre: 
6161:0:(ldlm_lib.c:541:target_handle_reconnect()) dafs-MDT0000: 
d3728ce6-b580-dbb3-b262-123d1516703b reconnecting
Nov 27 10:53:06 meta1 kernel: Lustre: 
6161:0:(ldlm_lib.c:541:target_handle_reconnect()) Skipped 1763 previous 
similar messages
6703b at NET_0x20000c0a80376_UUID:0/0 lens 368/264 e 0 to 0 dl 1259315686 
ref 1 fl Interpret:/0/0 rc -16/0
Nov 27 10:53:06 meta1 kernel: LustreError: 
6152:0:(ldlm_lib.c:1863:target_send_reply_msg()) Skipped 4173 previous 
similar messages
800; still busy with 1 active RPCs
Nov 27 10:53:07 meta1 kernel: Lustre: 
6148:0:(ldlm_lib.c:835:target_handle_connect()) Skipped 4913 previous 
similar messages
Nov 27 10:53:07 meta1 kernel: Lustre: 
6136:0:(ldlm_lib.c:541:target_handle_reconnect()) dafs-MDT0000: 
d3728ce6-b580-dbb3-b262-123d1516703b reconnecting
Nov 27 10:53:07 meta1 kernel: Lustre: 
6136:0:(ldlm_lib.c:541:target_handle_reconnect()) Skipped 4916 previous 
similar messages
6703b at NET_0x20000c0a80376_UUID:0/0 lens 368/264 e 0 to 0 dl 1259315688 
ref 1 fl Interpret:/0/0 rc -16/0
Nov 27 10:53:08 meta1 kernel: LustreError: 
6160:0:(ldlm_lib.c:1863:target_send_reply_msg()) Skipped 9782 previous 
similar messages
Nov 27 10:53:08 meta1 kernel: Lustre: MGS: haven''t heard from client 
f6fe0d62-238f-d5b3-5137-d1d897bc1195 (at 192.168.3.141 at tcp) in 467 
seconds. I think it''s dead, and I am evicting it.
Nov 27 10:53:08 meta1 kernel: Lustre: Skipped 1 previous similar message
800; still busy with 1 active RPCs
Nov 27 10:53:09 meta1 kernel: Lustre: 
6156:0:(ldlm_lib.c:835:target_handle_connect()) Skipped 9712 previous 
similar messages
Nov 27 10:53:09 meta1 kernel: Lustre: 
6156:0:(ldlm_lib.c:541:target_handle_reconnect()) dafs-MDT0000: 
d3728ce6-b580-dbb3-b262-123d1516703b reconnecting
Nov 27 10:53:09 meta1 kernel: Lustre: 
6156:0:(ldlm_lib.c:541:target_handle_reconnect()) Skipped 9718 previous 
similar messages
Nov 27 10:53:11 meta1 kernel: LustreError: 
27480:0:(handler.c:161:mds_sendpage()) @@@ bulk failed: timeout 0(4096), 
evicting d3728ce6-b580-dbb3-b262-123d1516703b at NET_0x20000c0a80376_UUID
t:/0/0 rc 0/0
Nov 27 10:54:23 meta1 hpasmpld[2804]: Sensor 9b is missing. Reading 
Flags = 40
d3728ce6-b580-dbb3-b262-123d1516703b at NET_0x20000c0a80376_UUID:0/0 lens 
408/528 e 0 to 0 dl 1259315709 ref 1 fl Interpret:/0/0 rc 0/0
Nov 27 10:55:03 meta1 kernel: LustreError: 
27480:0:(mds_open.c:1679:mds_close()) Skipped 3 previous similar messages
516703b at NET_0x20000c0a80376_UUID:0/0 lens 408/528 e 0 to 0 dl 1259315709 
ref 1 fl Interpret:/0/0 rc -116/0
Nov 27 10:55:03 meta1 kernel: LustreError: 
27480:0:(ldlm_lib.c:1863:target_send_reply_msg()) Skipped 14495 previous 
similar messages
Nov 27 10:56:23 meta1 hpasmpld[2804]: Sensor 9b is missing. Reading 
Flags = 40


On node2 a single one of this:

Nov 27 10:52:43 node2 kernel: ll_ost_io_68  S ffff81023c101080     0  
6222      1          6223  6221 (L-TLB)
Nov 27 10:52:43 node2 kernel:  ffff810209f5b960 0000000000000046 
0000000000000001 ffffffff884776b0
Nov 27 10:52:43 node2 kernel:  ffff810107f24000 000000000000000a 
ffff81022408a7e0 ffff81023c101080
Nov 27 10:52:43 node2 kernel:  000254bcf81e00e9 00000000000003b2 
ffff81022408a9c8 0000000100000000
Nov 27 10:52:43 node2 kernel: Call Trace:
Nov 27 10:52:43 node2 kernel:  [<ffffffff884776b0>] 
:lnet:LNetMDBind+0x2c0/0x420
Nov 27 10:52:43 node2 kernel:  [<ffffffff8003d382>] 
lock_timer_base+0x1b/0x3c
Nov 27 10:52:43 node2 kernel:  [<ffffffff8001c6fa>] __mod_timer+0xb0/0xbe
Nov 27 10:52:43 node2 kernel:  [<ffffffff80063947>] 
schedule_timeout+0x8a/0xad
Nov 27 10:52:43 node2 kernel:  [<ffffffff80094f4f>]
process_timeout+0x0/0x5
Nov 27 10:52:43 node2 kernel:  [<ffffffff888060dc>] 
:ost:ost_brw_read+0x127c/0x1a30
Nov 27 10:52:43 node2 kernel:  [<ffffffff8008a3f3>] 
default_wake_function+0x0/0xe
Nov 27 10:52:43 node2 kernel:  [<ffffffff885865d8>] 
:ptlrpc:lustre_msg_check_version_v2+0x8/0x20
Nov 27 10:52:43 node2 kernel:  [<ffffffff8880e429>] 
:ost:ost_handle+0x3129/0x5a70
Nov 27 10:52:43 node2 kernel:  [<ffffffff88478923>] 
:lnet:lnet_ni_send+0x93/0xd0
Nov 27 10:52:43 node2 kernel:  [<ffffffff8847ad23>] 
:lnet:lnet_send+0x973/0x9a0
Nov 27 10:52:43 node2 kernel:  [<ffffffff8001c6fa>] __mod_timer+0xb0/0xbe
Nov 27 10:52:43 node2 kernel:  [<ffffffff8847f305>] 
:lnet:lnet_match_blocked_msg+0x375/0x390
Nov 27 10:52:43 node2 kernel:  [<ffffffff8855be8a>] 
:ptlrpc:ldlm_resource_foreach+0x11a/0x390
Nov 27 10:52:43 node2 kernel:  [<ffffffff884dadf1>] 
:obdclass:class_handle2object+0xd1/0x160
Nov 27 10:52:43 node2 kernel:  [<ffffffff80148e8c>] __next_cpu+0x19/0x28
Nov 27 10:52:43 node2 kernel:  [<ffffffff80088f36>] 
find_busiest_group+0x20d/0x621
Nov 27 10:52:43 node2 kernel:  [<ffffffff8854119a>] 
:ptlrpc:lock_res_and_lock+0xba/0xd0
Nov 27 10:52:43 node2 kernel:  [<ffffffff8858bf05>] 
:ptlrpc:lustre_msg_get_conn_cnt+0x35/0xf0
Nov 27 10:52:43 node2 kernel:  [<ffffffff80089d8d>] enqueue_task+0x41/0x56
Nov 27 10:52:43 node2 kernel:  [<ffffffff88590c1d>] 
:ptlrpc:ptlrpc_check_req+0x1d/0x110
Nov 27 10:52:43 node2 kernel:  [<ffffffff88593357>] 
:ptlrpc:ptlrpc_server_handle_request+0xa97/0x1170
Nov 27 10:52:43 node2 kernel:  [<ffffffff8008881d>] 
__wake_up_common+0x3e/0x68
Nov 27 10:52:44 node2 kernel:  [<ffffffff88596e08>] 
:ptlrpc:ptlrpc_main+0x1218/0x13e0
Nov 27 10:52:44 node2 kernel:  [<ffffffff8008a3f3>] 
default_wake_function+0x0/0xe
Nov 27 10:52:44 node2 kernel:  [<ffffffff8005dfb1>] child_rip+0xa/0x11
Nov 27 10:52:44 node2 kernel:  [<ffffffff88595bf0>] 
:ptlrpc:ptlrpc_main+0x0/0x13e0
Nov 27 10:52:44 node2 kernel:  [<ffffffff8005dfa7>] child_rip+0x0/0x11
Nov 27 10:52:44 node2 kernel:
Nov 27 10:52:44 node2 kernel: LustreError: dumping log to 
/tmp/lustre-log.1259315563.6222

I attach the lustre-log.

I googled the net for LNetMDBind, but I found no useful pages.

I see this in the Understanding_Lustre_Filesystem_Internals.pdf, but 
it''s not really helpful to me.

int LNetMDBind(
    lnet_md_t umd,
    lnet_unlink_t unlink,
    lnet_handle_md_t *handle)

This function creates a standalone memory descriptor, i.e., an MD that 
is not attached
to an ME.

By the way it''s a bit strange, nobody else met this bug yet.


tamas
-------------- next part --------------
A non-text attachment was scrubbed...
Name: lustre-log.1259315563.6222.bz2
Type: application/octet-stream
Size: 9859 bytes
Desc: not available
Url :
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20091127/d315eee5/attachment.obj

Andreas Dilger

2009-Nov-27 23:22 UTC

head link

[Lustre-discuss] 1.8.1.1

On 2009-11-27, at 03:13, Papp Tam?s wrote:> Craig Prescott wrote, On 2009. 11. 19. 20:42:
>> We had the same problem with 1.8.x.x.
>>
>> We set lnet.printk=0 on our OSS nodes and it has helped us  
>> dramatically - we have not seen the ''soft lockup''
problem since.
>>
>> sysctl -w lnet.printk=0
>>
>> This will turn off all but ''emerg'' messages from
lnet.
>>
>> It would be interesting to know if this avoided the lockups for  
>> you, too.
>
> Well, this definetely helped, but didn''t resolve the root of the  
> problem.
>
> A few minutes ago we were not able to reach our cluster from clients.
>
> On MDS I see a lot of this:
>
>
> Nov 27 10:52:17 meta1 kernel: BUG: soft lockup - CPU#3 stuck for  
> 10s! [ll_evictor:6123]
> Nov 27 10:52:18 meta1 kernel: Call Trace:
> :obdclass:lustre_hash_for_each_empty+0x237/0x2b0
> :obdclass:class_disconnect+0x398/0x420
> :mds:mds_disconnect+0x121/0xe40
> :obdclass:class_fail_export+0x384/0x4c0
> :ptlrpc:ping_evictor_main+0x4f8/0x7e0
> default_wake_function+0x0/0xe
> :ptlrpc:ping_evictor_main+0x0/0x7e0
This looks like the server evicting a client that has a lot of locks.
One thing to try is in lustre_hash_for_each_empty() add a call to  
cond_resched(), since it seems this function could run a long time if
func() doesn''t ever cause a reschedule:

lustre_hash_for_each_empty(lustre_hash_t *lh, lh_for_each_cb func,
{
                         read_unlock(&lh->lh_rwlock);
                         func(obj, data);
                         (void)lh_put(lh, hnode);
+                       cond_resched();
                         goto restart;

I''m not sure this is the root cause, but you could check the DLM lock  
stats in /proc/fs/lustre/ldlm/namespaces/*/lock_count on some clients,  
to see how many locks they are holding, or the same on the MDS, which  
will be the total number of locks currently granted to all clients.
> After them:
>
> 6009:0:(events.c:367:server_bulk_callback()) event type 4, status  
> -5, desc ffff81022ed54280
This is just fallout from the MDS being too busy to handle requests.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

Papp Tamás

2009-Dec-01 21:51 UTC

head link

[Lustre-discuss] 1.8.1.1

Andreas Dilger wrote, On 2009. 11. 28. 0:22:> On 2009-11-27, at 03:13, Papp Tam?s wrote:
>> Craig Prescott wrote, On 2009. 11. 19. 20:42:
>>> We had the same problem with 1.8.x.x.
>>>
>>> We set lnet.printk=0 on our OSS nodes and it has helped us 
>>> dramatically - we have not seen the ''soft lockup''
problem since.
>>>
>>> sysctl -w lnet.printk=0
>>>
>>> This will turn off all but ''emerg'' messages from
lnet.
>>>
>>> It would be interesting to know if this avoided the lockups for
you,
>>> too.
>>
>> Well, this definetely helped, but didn''t resolve the root of
the
>> problem.
>>
>> A few minutes ago we were not able to reach our cluster from clients.
>>
>> On MDS I see a lot of this:
>>
>>
>> Nov 27 10:52:17 meta1 kernel: BUG: soft lockup - CPU#3 stuck for 10s! 
>> [ll_evictor:6123]
>> Nov 27 10:52:18 meta1 kernel: Call Trace:
>> :obdclass:lustre_hash_for_each_empty+0x237/0x2b0
>> :obdclass:class_disconnect+0x398/0x420
>> :mds:mds_disconnect+0x121/0xe40
>> :obdclass:class_fail_export+0x384/0x4c0
>> :ptlrpc:ping_evictor_main+0x4f8/0x7e0
>> default_wake_function+0x0/0xe
>> :ptlrpc:ping_evictor_main+0x0/0x7e0
>
> This looks like the server evicting a client that has a lot of locks.
> One thing to try is in lustre_hash_for_each_empty() add a call to 
> cond_resched(), since it seems this function could run a long time if
> func() doesn''t ever cause a reschedule:
>
> lustre_hash_for_each_empty(lustre_hash_t *lh, lh_for_each_cb func,
> {
>                         read_unlock(&lh->lh_rwlock);
>                         func(obj, data);
>                         (void)lh_put(lh, hnode);
> +                       cond_resched();
>                         goto restart;
>
> I''m not sure this is the root cause, but you could check the DLM
lock
> stats in /proc/fs/lustre/ldlm/namespaces/*/lock_count on some clients, 
> to see how many locks they are holding, or the same on the MDS, which 
> will be the total number of locks currently granted to all clients.
At the weekend we upgraded the firmware on the OSS''s raid controllers 
(SmartArray P410) and the messages has gone. Of course I can''t be sure,
because of it, but we are hoping, still watching it.

On the MDS the messages are more rare, in two days there were two of it:


Dec  1 09:19:18 meta1 kernel: Modules linked in: mds(U) 
fsfilt_ldiskfs(U) mgs(U) mgc(U) ldiskfs(U) crc16(U) lustre(U) lov(U) m
dc(U) lquota(U) osc(U) ksocklnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) 
libcfs(U) mptctl(U) mptbase(U) ipmi_devintf(U) ipmi_s
i(U) ipmi_msghandler(U) ipv6(U) xfrm_nalgo(U) crypto_api(U) autofs4(U) 
lockd(U) sunrpc(U) cpufreq_ondemand(U) acpi_cpufreq(U)
freq_table(U) dm_mirror(U) dm_multipath(U) scsi_dh(U) video(U) hwmon(U) 
backlight(U) sbs(U) i2c_ec(U) i2c_core(U) button(U) ba
ttery(U) asus_acpi(U) acpi_memhotplug(U) ac(U) parport_pc(U) lp(U) 
parport(U) sg(U) pcspkr(U) shpchp(U) igb(U) dm_raid45(U) dm
_message(U) dm_region_hash(U) dm_log(U) dm_mod(U) dm_mem_cache(U) 
usb_storage(U) ata_piix(U) ahci(U) libata(U) sd_mod(U) scsi_
mod(U) raid1(U) ext3(U) jbd(U) uhci_hcd(U) ohci_hcd(U) ehci_hcd(U)
Dec  1 09:19:18 meta1 kernel: Pid: 4470, comm: ll_evictor Tainted: 
G      2.6.18-128.7.1.el5_lustre.1.8.1.1 #1
Dec  1 09:19:18 meta1 kernel: RIP: 0010:[<ffffffff80064aee>]  
[<ffffffff80064aee>] _write_lock+0x7/0xf
Dec  1 09:19:18 meta1 kernel: RSP: 0018:ffff81022e503c78  EFLAGS: 00000246
Dec  1 09:19:18 meta1 kernel: RAX: 0000000000007fff RBX: 
00000000000013d6 RCX: 0000000000006f89
Dec  1 09:19:18 meta1 kernel: RDX: 0000000000000216 RSI: 
ffffffff802fae80 RDI: ffffc2001085cd6c
Dec  1 09:19:18 meta1 kernel: RBP: 0000000000000286 R08: 
ffff81000101f3e0 R09: 5a5a5a5a5a5a5a5a
Dec  1 09:19:18 meta1 kernel: R10: 5a5a5a5a5a5a5a5a R11: 
5a5a5a5a5a5a5a5a R12: 0000000000000286
Dec  1 09:19:18 meta1 kernel: R13: ffff81022e503c20 R14: 
ffff8101cea7f418 R15: ffff8101cea7f3c0
Dec  1 09:19:18 meta1 kernel: FS:  00002ae6ba0b1230(0000) 
GS:ffff810107f61540(0000) knlGS:0000000000000000
Dec  1 09:19:18 meta1 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 
000000008005003b
Dec  1 09:19:18 meta1 kernel: CR2: 00002aaaae52d000 CR3: 
00000002350d0000 CR4: 00000000000006e0
Dec  1 09:19:18 meta1 kernel:
Dec  1 09:19:18 meta1 kernel: Call Trace:
Dec  1 09:19:18 meta1 kernel:  [<ffffffff884ccba7>] 
:obdclass:lustre_hash_for_each_empty+0x237/0x2b0
Dec  1 09:19:19 meta1 kernel:  [<ffffffff884d3ae8>] 
:obdclass:class_disconnect+0x398/0x420
Dec  1 09:19:19 meta1 kernel:  [<ffffffff8886f5e1>] 
:mds:mds_disconnect+0x121/0xe40
Dec  1 09:19:19 meta1 kernel:  [<ffffffff8014b87a>] snprintf+0x44/0x4c
Dec  1 09:19:19 meta1 kernel:  [<ffffffff884cf994>] 
:obdclass:class_fail_export+0x384/0x4c0
Dec  1 09:19:19 meta1 kernel:  [<ffffffff885a0238>] 
:ptlrpc:ping_evictor_main+0x4f8/0x7e0
Dec  1 09:19:19 meta1 kernel:  [<ffffffff8008a3f3>] 
default_wake_function+0x0/0xe
Dec  1 09:19:19 meta1 kernel:  [<ffffffff8005dfb1>] child_rip+0xa/0x11
Dec  1 09:19:19 meta1 kernel:  [<ffffffff8859fd40>] 
:ptlrpc:ping_evictor_main+0x0/0x7e0
Dec  1 09:19:19 meta1 kernel:  [<ffffffff8005dfa7>] child_rip+0x0/0x11


Dec  1 10:08:38 meta1 kernel: BUG: soft lockup - CPU#1 stuck for 10s! 
[ldlm_elt:4466]
Dec  1 10:08:38 meta1 kernel: CPU 1:
Dec  1 10:08:38 meta1 kernel: Modules linked in: mds(U) 
fsfilt_ldiskfs(U) mgs(U) mgc(U) ldiskfs(U) crc16(U) lustre(U) lov(U) m
dc(U) lquota(U) osc(U) ksocklnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) 
libcfs(U) mptctl(U) mptbase(U) ipmi_devintf(U) ipmi_s
i(U) ipmi_msghandler(U) ipv6(U) xfrm_nalgo(U) crypto_api(U) autofs4(U) 
lockd(U) sunrpc(U) cpufreq_ondemand(U) acpi_cpufreq(U)
freq_table(U) dm_mirror(U) dm_multipath(U) scsi_dh(U) video(U) hwmon(U) 
backlight(U) sbs(U) i2c_ec(U) i2c_core(U) button(U) ba
ttery(U) asus_acpi(U) acpi_memhotplug(U) ac(U) parport_pc(U) lp(U) 
parport(U) sg(U) pcspkr(U) shpchp(U) igb(U) dm_raid45(U) dm
_message(U) dm_region_hash(U) dm_log(U) dm_mod(U) dm_mem_cache(U) 
usb_storage(U) ata_piix(U) ahci(U) libata(U) sd_mod(U) scsi_
mod(U) raid1(U) ext3(U) jbd(U) uhci_hcd(U) ohci_hcd(U) ehci_hcd(U)
Dec  1 10:08:38 meta1 kernel: Pid: 4466, comm: ldlm_elt Tainted: G      
2.6.18-128.7.1.el5_lustre.1.8.1.1 #1
Dec  1 10:08:38 meta1 kernel: RIP: 0010:[<ffffffff80064aee>]  
[<ffffffff80064aee>] _write_lock+0x7/0xf
Dec  1 10:08:38 meta1 kernel: RSP: 0018:ffff81022efa3ca8  EFLAGS: 00000246
Dec  1 10:08:38 meta1 kernel: RAX: 0000000000007fff RBX: 
0000000000004cc7 RCX: 0000000000005e96
Dec  1 10:08:38 meta1 kernel: RDX: ffffffffffffa6d8 RSI: 
ffffffff802fae80 RDI: ffffc20010611c7c
Dec  1 10:08:38 meta1 kernel: RBP: 0000000000000282 R08: 
ffff81000100e8e0 R09: 5a5a5a5a5a5a5a5a
Dec  1 10:08:38 meta1 kernel: R10: 5a5a5a5a5a5a5a5a R11: 
5a5a5a5a5a5a5a5a R12: 0000000000000282
Dec  1 10:08:38 meta1 kernel: R13: ffff81022efa3c50 R14: 
ffff8101b9d2eb98 R15: ffff8101b9d2eb40
Dec  1 10:08:38 meta1 kernel: FS:  00002ae6ba0b1230(0000) 
GS:ffff810107ed96c0(0000) knlGS:0000000000000000
Dec  1 10:08:38 meta1 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 
000000008005003b
Dec  1 10:08:38 meta1 kernel: CR2: 00000031d329a3a0 CR3: 
0000000237251000 CR4: 00000000000006e0
Dec  1 10:08:38 meta1 kernel:
Dec  1 10:08:38 meta1 kernel: Call Trace:
Dec  1 10:08:38 meta1 kernel:  [<ffffffff884ccba7>] 
:obdclass:lustre_hash_for_each_empty+0x237/0x2b0
Dec  1 10:08:38 meta1 kernel:  [<ffffffff884d3ae8>] 
:obdclass:class_disconnect+0x398/0x420
Dec  1 10:08:38 meta1 kernel:  [<ffffffff8886f5e1>] 
:mds:mds_disconnect+0x121/0xe40
Dec  1 10:08:38 meta1 kernel:  [<ffffffff80089d8d>] enqueue_task+0x41/0x56
Dec  1 10:08:38 meta1 kernel:  [<ffffffff80089bdd>] dequeue_task+0x18/0x37
Dec  1 10:08:38 meta1 kernel:  [<ffffffff80063098>]
thread_return+0x62/0xfe
Dec  1 10:08:38 meta1 kernel:  [<ffffffff884cf994>] 
:obdclass:class_fail_export+0x384/0x4c0
Dec  1 10:08:39 meta1 kernel:  [<ffffffff8003b59a>] 
remove_wait_queue+0x1c/0x2c
Dec  1 10:08:39 meta1 kernel:  [<ffffffff8857224f>] 
:ptlrpc:expired_lock_main+0x68f/0x8b0
Dec  1 10:08:39 meta1 kernel:  [<ffffffff8008a3f3>] 
default_wake_function+0x0/0xe
Dec  1 10:08:39 meta1 kernel:  [<ffffffff8005dfb1>] child_rip+0xa/0x11
Dec  1 10:08:39 meta1 kernel:  [<ffffffff88571bc0>] 
:ptlrpc:expired_lock_main+0x0/0x8b0
Dec  1 10:08:39 meta1 kernel:  [<ffffffff8005dfa7>] child_rip+0x0/0x11

I''m afraid(?), it''s not the same. The MDT is not on a
SmartArray, it''s
on an md raid1 array, I don''t know if it matters.
I think these were in the time, when users switched on their workstation.

Anyway, the not reported problem about the cluster.

I''m sorry, it''s been a bit long...

Still do you think, I should recompile the kernel with the change above?

Thank you,

tamas

Johann Lombardi

2009-Dec-01 23:18 UTC

head link

[Lustre-discuss] 1.8.1.1

On Dec 1, 2009, at 10:51 PM, Papp Tam?s wrote:> I''m afraid(?), it''s not the same.
The stacks still look similar, afaics.
> Still do you think, I should recompile the kernel with the change  
> above?
Actually, this is bug 19557 and a patch is pending review.
I would be delighted if you could give the patch a try.

Cheers,
Johann

Papp Tamás

2009-Dec-05 17:28 UTC

head link

[Lustre-discuss] 1.8.1.1

Johann Lombardi wrote, On 2009. 12. 02. 0:18:> Actually, this is bug 19557 and a patch is pending review.
> I would be delighted if you could give the patch a try.
How can I rebuild only lustre modules?

I checked out the b1_8 source tree and applied the patched from bugzilla.

Then I tried this:

./configure 
--with-linux=/usr/src/kernels/2.6.18-128.7.1.el5_lustre.1.8.1.1-x86_64/

make says:

mkdir -p linux-stage/fs/ext3 linux-stage/include/linux
cp 
/usr/src/kernels/2.6.18-128.7.1.el5_lustre.1.8.1.1-x86_64//fs/ext3/acl.c 
/usr/src/kernels/2.6.18-128.7.1.el5_lustre.1.8.1.1-x86_64//fs/ext3/balloc.c 
/usr/src/kernels/2.6.18-128.7.1.el5_lustre.1.8.1.1-x86_64//fs/ext3/bitmap.c 
/usr/src/kernels/2.6.18-128.7.1.el5_lustre.1.8.1.1-x86_64//fs/ext3/dir.c 
/usr/src/kernels/2.6.18-128.7.1.el5_lustre.1.8.1.1-x86_64//fs/ext3/file.c 
/usr/src/kernels/2.6.18-128.7.1.el5_lustre.1.8.1.1-x86_64//fs/ext3/fsync.c 
/usr/src/kernels/2.6.18-128.7.1.el5_lustre.1.8.1.1-x86_64//fs/ext3/hash.c 
/usr/src/kernels/2.6.18-128.7.1.el5_lustre.1.8.1.1-x86_64//fs/ext3/ialloc.c 
/usr/src/kernels/2.6.18-128.7.1.el5_lustre.1.8.1.1-x86_64//fs/ext3/inode.c 
/usr/src/kernels/2.6.18-128.7.1.el5_lustre.1.8.1.1-x86_64//fs/ext3/ioctl.c 
/usr/src/kernels/2.6.18-128.7.1.el5_lustre.1.8.1.1-x86_64//fs/ext3/namei.c 
/usr/src/kernels/2.6.18-128.7.1.el5_lustre.1.8.1.1-x86_64//fs/ext3/resize.c 
/usr/src/kernels/2.6.18-128.7.1.el5_lustre.1.8.1.1-x86_64//fs/ext3/super.c 
/usr/src/kernels/2.6.18-128.7.1.el5_lustre.1.8.1.1-x86_64//fs/ext3/symlink.c 
/usr/src/kernels/2.6.18-128.7.1.el5_lustre.1.8.1.1-x86_64//fs/ext3/xattr.c 
/usr/src/kernels/2.6.18-128.7.1.el5_lustre.1.8.1.1-x86_64//fs/ext3/xattr_security.c
/usr/src/kernels/2.6.18-128.7.1.el5_lustre.1.8.1.1-x86_64//fs/ext3/xattr_trusted.c
/usr/src/kernels/2.6.18-128.7.1.el5_lustre.1.8.1.1-x86_64//fs/ext3/xattr_user.c 
/usr/src/kernels/2.6.18-128.7.1.el5_lustre.1.8.1.1-x86_64//fs/ext3/acl.h 
/usr/src/kernels/2.6.18-128.7.1.el5_lustre.1.8.1.1-x86_64//fs/ext3/namei.h 
/usr/src/kernels/2.6.18-128.7.1.el5_lustre.1.8.1.1-x86_64//fs/ext3/xattr.h 
/usr/src/kernels/2.6.18-128.7.1.el5_lustre.1.8.1.1-x86_64//fs/ext3/Makefile 
linux-stage/fs/ext3
if test -n 
"/usr/src/kernels/2.6.18-128.7.1.el5_lustre.1.8.1.1-x86_64//include/linux/ext3_fs.h
/usr/src/kernels/2.6.18-128.7.1.el5_lustre.1.8.1.1-x86_64//include/linux/ext3_fs_i.h
/usr/src/kernels/2.6.18-128.7.1.el5_lustre.1.8.1.1-x86_64//include/linux/ext3_fs_sb.h
/usr/src/kernels/2.6.18-128.7.1.el5_lustre.1.8.1.1-x86_64//include/linux/ext3_jbd.h"
; then \
                cp 
/usr/src/kernels/2.6.18-128.7.1.el5_lustre.1.8.1.1-x86_64//include/linux/ext3_fs.h
/usr/src/kernels/2.6.18-128.7.1.el5_lustre.1.8.1.1-x86_64//include/linux/ext3_fs_i.h
/usr/src/kernels/2.6.18-128.7.1.el5_lustre.1.8.1.1-x86_64//include/linux/ext3_fs_sb.h
/usr/src/kernels/2.6.18-128.7.1.el5_lustre.1.8.1.1-x86_64//include/linux/ext3_jbd.h
linux-stage/include/linux; \
        fi
Applying ext3 patches: ext3-version-2.6-rhel5.patch 
ext3-wantedi-2.6-rhel4.patch iopen-2.6.18-rhel5.patch 
ext3-map_inode_page-2.6.18.patch export-ext3-2.6-rhel4.patch 
ext3-include-fixes-2.6-rhel4.patch ext3-extents-2.6.18-vanilla.patch1 
out of 5 hunks FAILED -- saving rejects to file include/linux/ext3_fs.h.rej
make[5]: *** [sources] Error 1
make[5]: Leaving directory `/usr/src/cvs/lustre/ldiskfs/ldiskfs''
make[4]: *** [all-sources] Error 2
make[4]: Leaving directory `/usr/src/cvs/lustre/ldiskfs''
make[3]: *** [all-recursive] Error 1
make[3]: Leaving directory `/usr/src/cvs/lustre/ldiskfs''
make[2]: *** [all] Error 2
make[2]: Leaving directory `/usr/src/cvs/lustre/ldiskfs''
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory `/usr/src/cvs/lustre''
make: *** [all] Error 2


Why does it want to patch the kernel again?

Thank you,

tamas

Andreas Dilger

2009-Dec-06 00:47 UTC

head link

[Lustre-discuss] 1.8.1.1

On 2009-12-05, at 10:28, Papp Tam?s wrote:> Johann Lombardi wrote, On 2009. 12. 02. 0:18:
>> Actually, this is bug 19557 and a patch is pending review.
>> I would be delighted if you could give the patch a try.
>
> How can I rebuild only lustre modules?
>
> I checked out the b1_8 source tree and applied the patched from  
> bugzilla.
>
> Then I tried this:
>
> ./configure
> --with-linux=/usr/src/kernels/2.6.18-128.7.1.el5_lustre.1.8.1.1- 
> x86_64/
This should be the right kernel for b1_8, according to lustre/ChangeLog
> make says:
>
> mkdir -p linux-stage/fs/ext3 linux-stage/include/linux
> cp
> /usr/src/kernels/2.6.18-128.7.1.el5_lustre.1.8.1.1-x86_64//fs/ext3/ 
> acl.c
> /usr/src/kernels/2.6.18-128.7.1.el5_lustre.1.8.1.1-x86_64//fs/ext3/ 
> balloc.c
> /usr/src/kernels/2.6.18-128.7.1.el5_lustre.1.8.1.1-x86_64//fs/ext3/ 
> bitmap.c
> /usr/src/kernels/2.6.18-128.7.1.el5_lustre.1.8.1.1-x86_64//fs/ext3/ 
> dir.c
> /usr/src/kernels/2.6.18-128.7.1.el5_lustre.1.8.1.1-x86_64//fs/ext3/ 
> file.c
> /usr/src/kernels/2.6.18-128.7.1.el5_lustre.1.8.1.1-x86_64//fs/ext3/ 
> fsync.c
> /usr/src/kernels/2.6.18-128.7.1.el5_lustre.1.8.1.1-x86_64//fs/ext3/ 
> hash.c
> /usr/src/kernels/2.6.18-128.7.1.el5_lustre.1.8.1.1-x86_64//fs/ext3/ 
> ialloc.c
> /usr/src/kernels/2.6.18-128.7.1.el5_lustre.1.8.1.1-x86_64//fs/ext3/ 
> inode.c
> /usr/src/kernels/2.6.18-128.7.1.el5_lustre.1.8.1.1-x86_64//fs/ext3/ 
> ioctl.c
> /usr/src/kernels/2.6.18-128.7.1.el5_lustre.1.8.1.1-x86_64//fs/ext3/ 
> namei.c
> /usr/src/kernels/2.6.18-128.7.1.el5_lustre.1.8.1.1-x86_64//fs/ext3/ 
> resize.c
> /usr/src/kernels/2.6.18-128.7.1.el5_lustre.1.8.1.1-x86_64//fs/ext3/ 
> super.c
> /usr/src/kernels/2.6.18-128.7.1.el5_lustre.1.8.1.1-x86_64//fs/ext3/ 
> symlink.c
> /usr/src/kernels/2.6.18-128.7.1.el5_lustre.1.8.1.1-x86_64//fs/ext3/ 
> xattr.c
> /usr/src/kernels/2.6.18-128.7.1.el5_lustre.1.8.1.1-x86_64//fs/ext3/ 
> xattr_security.c
> /usr/src/kernels/2.6.18-128.7.1.el5_lustre.1.8.1.1-x86_64//fs/ext3/ 
> xattr_trusted.c
> /usr/src/kernels/2.6.18-128.7.1.el5_lustre.1.8.1.1-x86_64//fs/ext3/ 
> xattr_user.c
> /usr/src/kernels/2.6.18-128.7.1.el5_lustre.1.8.1.1-x86_64//fs/ext3/ 
> acl.h
> /usr/src/kernels/2.6.18-128.7.1.el5_lustre.1.8.1.1-x86_64//fs/ext3/ 
> namei.h
> /usr/src/kernels/2.6.18-128.7.1.el5_lustre.1.8.1.1-x86_64//fs/ext3/ 
> xattr.h
> /usr/src/kernels/2.6.18-128.7.1.el5_lustre.1.8.1.1-x86_64//fs/ext3/ 
> Makefile
> linux-stage/fs/ext3
> if test -n
> "/usr/src/kernels/2.6.18-128.7.1.el5_lustre.1.8.1.1-x86_64//include/ 
> linux/ext3_fs.h
> /usr/src/kernels/2.6.18-128.7.1.el5_lustre.1.8.1.1-x86_64//include/ 
> linux/ext3_fs_i.h
> /usr/src/kernels/2.6.18-128.7.1.el5_lustre.1.8.1.1-x86_64//include/ 
> linux/ext3_fs_sb.h
> /usr/src/kernels/2.6.18-128.7.1.el5_lustre.1.8.1.1-x86_64//include/ 
> linux/ext3_jbd.h"
> ; then \
>                cp
> /usr/src/kernels/2.6.18-128.7.1.el5_lustre.1.8.1.1-x86_64//include/ 
> linux/ext3_fs.h
> /usr/src/kernels/2.6.18-128.7.1.el5_lustre.1.8.1.1-x86_64//include/ 
> linux/ext3_fs_i.h
> /usr/src/kernels/2.6.18-128.7.1.el5_lustre.1.8.1.1-x86_64//include/ 
> linux/ext3_fs_sb.h
> /usr/src/kernels/2.6.18-128.7.1.el5_lustre.1.8.1.1-x86_64//include/ 
> linux/ext3_jbd.h
> linux-stage/include/linux; \
>        fi
> Applying ext3 patches:
> ext3-version-2.6-rhel5.patch
> ext3-wantedi-2.6-rhel4.patch
> iopen-2.6.18-rhel5.patch
> ext3-map_inode_page-2.6.18.patch
> export-ext3-2.6-rhel4.patch
> ext3-include-fixes-2.6-rhel4.patch
> ext3-extents-2.6.18-vanilla.patch
> 1 out of 5 hunks FAILED -- saving rejects to include/linux/ 
> ext3_fs.h.rej
>
> Why does it want to patch the kernel again?

This is trying to build the ldiskfs module from the ext3 sources.  It  
_should_ work, given that you have the right kernel sources, but  
clearly either the patch was changed, or something is different  
between your ext3 and what the patch expects.  This is normally just a  
simple context error.

To fix this:

cd ldiskfs/ldiskfs/linux-stage
quilt push -f
{apply include/linux/ext3_fs.h.rej to ext3_fs.h by hand}
quilt refresh
cd -

then "make" should work again.


Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

Papp Tamas

2009-Dec-10 23:37 UTC

head link

[Lustre-discuss] 1.8.1.1

On 2009. 12. 06. 1:47, Andreas Dilger wrote:>>
>> ./configure
>> --with-linux=/usr/src/kernels/2.6.18-128.7.1.el5_lustre.1.8.1.1-x86_64/
>
>
> This should be the right kernel for b1_8, according to lustre/ChangeLog
>
>
> This is trying to build the ldiskfs module from the ext3 sources.  It 
> _should_ work, given that you have the right kernel sources, but 
> clearly either the patch was changed, or something is different 
> between your ext3 and what the patch expects.  This is normally just a 
> simple context error.
>
> To fix this:
>
> cd ldiskfs/ldiskfs/linux-stage
> quilt push -f
> {apply include/linux/ext3_fs.h.rej to ext3_fs.h by hand}
> quilt refresh
> cd -
>
> then "make" should work again.
>
Well, it was not working, and by the way, my guess is that it should not 
work. I didn''t write, but 
/usr/src/kernels/2.6.18-128.7.1.el5_lustre.1.8.1.1-x86_64/ belongs to 
the official kernel-lustre-devel-2.6.18-128.7.1.el5_lustre.1.8.1.1 package.

Anyway, a build an own kernel and lustre b1_8 with the patches from the 
bug 19557:

Linux meta1 2.6.18-prep #1 SMP Sun Dec 6 14:40:15 CET 2009 x86_64 x86_64 
x86_64 GNU/Linux


I installed it on the MDS and the tw OSSs. Unofortunately no luck:

Dec  9 18:41:36 node1 kernel: ll_ost_io_05  S ffff81023f353860     0  
4421      1          4422  4420 (L-TLB)
Dec  9 18:41:36 node1 kernel:  ffff810229cf9990 0000000000000046 
00000000000000fd ffffffff88491b91
Dec  9 18:41:36 node1 kernel:  ffff8101ff1d65a0 000000000000000a 
ffff810229c080c0 ffff81023f353860
Dec  9 18:41:36 node1 kernel:  0000f4410ae68bb7 0000000000000825 
ffff810229c082a8 0000000300000000
Dec  9 18:41:36 node1 kernel: Call Trace:
Dec  9 18:41:36 node1 kernel:  [<ffffffff88491b91>] 
:lnet:LNetMDBind+0x301/0x450
Dec  9 18:41:36 node1 kernel:  [<ffffffff8003dacd>] 
lock_timer_base+0x1b/0x3c
Dec  9 18:41:36 node1 kernel:  [<ffffffff8001caa7>] __mod_timer+0xb0/0xbe
Dec  9 18:41:36 node1 kernel:  [<ffffffff8006387b>] 
schedule_timeout+0x8a/0xad
Dec  9 18:41:36 node1 kernel:  [<ffffffff80096ff3>]
process_timeout+0x0/0x5
Dec  9 18:41:36 node1 kernel:  [<ffffffff888234ca>] 
:ost:ost_brw_write+0x137a/0x23a0
Dec  9 18:41:36 node1 kernel:  [<ffffffff8859f998>] 
:ptlrpc:ptlrpc_send_reply+0x5c8/0x5e0
Dec  9 18:41:36 node1 kernel:  [<ffffffff8856aab0>] 
:ptlrpc:target_committed_to_req+0x40/0x120
Dec  9 18:41:36 node1 kernel:  [<ffffffff8881f81d>] 
:ost:ost_brw_read+0x189d/0x1a70
Dec  9 18:41:36 node1 kernel:  [<ffffffff885a3e25>] 
:ptlrpc:lustre_msg_get_version+0x35/0xf0
Dec  9 18:41:36 node1 kernel:  [<ffffffff8008c3ba>] 
default_wake_function+0x0/0xe
Dec  9 18:41:36 node1 kernel:  [<ffffffff885a3ee8>] 
:ptlrpc:lustre_msg_check_version_v2+0x8/0x20
Dec  9 18:41:36 node1 kernel:  [<ffffffff8882709e>] 
:ost:ost_handle+0x2bae/0x53e0
Dec  9 18:41:36 node1 kernel:  [<ffffffff8858233e>] 
:ptlrpc:ldlm_refresh_waiting_lock+0xbe/0x110
Dec  9 18:41:36 node1 kernel:  [<ffffffff88499485>] 
:lnet:lnet_match_blocked_msg+0x375/0x390
Dec  9 18:41:36 node1 kernel:  [<ffffffff8855e1d4>] 
:ptlrpc:ldlm_lock_get+0x4/0x10
Dec  9 18:41:36 node1 kernel:  [<ffffffff885ae45d>] 
:ptlrpc:ptlrpc_check_req+0x1d/0x110
Dec  9 18:41:36 node1 kernel:  [<ffffffff885b10b9>] 
:ptlrpc:ptlrpc_server_handle_request+0xa79/0x1150
Dec  9 18:41:36 node1 kernel:  [<ffffffff8008a7e4>] 
__wake_up_common+0x3e/0x68
Dec  9 18:41:36 node1 kernel:  [<ffffffff885b4ae8>] 
:ptlrpc:ptlrpc_main+0x1258/0x1420
Dec  9 18:41:36 node1 kernel:  [<ffffffff8008c3ba>] 
default_wake_function+0x0/0xe
Dec  9 18:41:36 node1 kernel:  [<ffffffff8005dfb1>] child_rip+0xa/0x11
Dec  9 18:41:36 node1 kernel:  [<ffffffff885b3890>] 
:ptlrpc:ptlrpc_main+0x0/0x1420
Dec  9 18:41:36 node1 kernel:  [<ffffffff8005dfa7>] child_rip+0x0/0x11
Dec  9 18:41:36 node1 kernel:
Dec  9 18:41:36 node1 kernel: LustreError: dumping log to 
/tmp/lustre-log.1260380496.4421


Should I install this patched lustre on the clients too? Or is the 
problem something else?

I''m sorry for the messy mail.

Thank you,

tamas

Andreas Dilger

2009-Dec-11 00:38 UTC

head link

[Lustre-discuss] 1.8.1.1

On 2009-12-10, at 16:37, Papp Tamas wrote:> Well, it was not working, and by the way, my guess is that it should  
> not work. I didn''t write, but /usr/src/kernels/ 
> 2.6.18-128.7.1.el5_lustre.1.8.1.1-x86_64/ belongs to the official  
> kernel-lustre-devel-2.6.18-128.7.1.el5_lustre.1.8.1.1 package.
>
> Anyway, a build an own kernel and lustre b1_8 with the patches from  
> the bug 19557:
>
> Linux meta1 2.6.18-prep #1 SMP Sun Dec 6 14:40:15 CET 2009 x86_64  
> x86_64 x86_64 GNU/Linux
>
>
> I installed it on the MDS and the tw OSSs. Unofortunately no luck:
>
> Dec  9 18:41:36 node1 kernel: Call Trace:
> Dec  9 18:41:36 node1 kernel:  [<ffffffff88491b91>] :lnet:LNetMDBind 
> +0x301/0x450
> Dec  9 18:41:36 node1 kernel:  [<ffffffff8003dacd>] lock_timer_base 
> +0x1b/0x3c
> Dec  9 18:41:36 node1 kernel:  [<ffffffff8001caa7>] __mod_timer 
> +0xb0/0xbe
> Dec  9 18:41:36 node1 kernel:  [<ffffffff8006387b>] schedule_timeout 
> +0x8a/0xad
> Dec  9 18:41:36 node1 kernel:  [<ffffffff80096ff3>] process_timeout 
> +0x0/0x5
> Dec  9 18:41:36 node1 kernel:   
> [<ffffffff888234ca>] :ost:ost_brw_write+0x137a/0x23a0
> Dec  9 18:41:36 node1 kernel:   
> [<ffffffff8859f998>] :ptlrpc:ptlrpc_send_reply+0x5c8/0x5e0
This is a different issue.  The previous stack was busy in  
lustre_hash_for_each_empty().
> Should I install this patched lustre on the clients too? Or is the  
> problem something else?
This is strictly a server-side patch.  This looks like a similar  
issue, and may be fixed by one of the other patches on bug 19557.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

Lustre discuss - Nov 2009 - 1.8.1.1

[Lustre-discuss] 1.8.1.1

[Lustre-discuss] 1.8.1.1

[Lustre-discuss] 1.8.1.1

[Lustre-discuss] 1.8.1.1

[Lustre-discuss] 1.8.1.1

[Lustre-discuss] 1.8.1.1

[Lustre-discuss] 1.8.1.1

[Lustre-discuss] 1.8.1.1

[Lustre-discuss] 1.8.1.1

[Lustre-discuss] 1.8.1.1

[Lustre-discuss] 1.8.1.1

[Lustre-discuss] 1.8.1.1

[Lustre-discuss] 1.8.1.1

[Lustre-discuss] 1.8.1.1

[Lustre-discuss] 1.8.1.1