thr3ads.net - Lustre discuss - [Lustre-discuss] Kernel panics while mounting OSTs [Mar 2009]

If this information is useful, please help other people find it:
Share via:

Adam Gandelman

2009-Mar-25 23:47 UTC

[Lustre-discuss] Kernel panics while mounting OSTs

Hi list-

I''m in the process of configuring my first Lustre cluster for testing.
I''ve got access to 10 nodes right that are all made up of identical 
hardware. I decided to start small and configure 3 nodes, after those 
are up and working, move on setting the remaining OSS''s up in the same 
way.  I''ve successfully gotten 3 nodes up and running in a very simple 
lustre setup with the MDT/MGS located on one node and the other two 
setup as OSS''s.  The OST''s of the OSS''s are local
40gb drives (hdb:
Maxtor 6E040L0, ATA DISK drive).   Everything mounts fine, connects to 
the MGS, and mounts on client nodes just fine.  Decided to move onto 
nodes 4 and 5.

Here''s where kernel panics began.  I set these up as OSS''s
following the
same steps on 2 and 3 and using the same stock lustre kernel and RPMs.  
Had no trouble creating the lustre filesystem on /dev/hdb1, however, as 
soon as I issued mount /dev/hdb1 /mnt/ost my kernel would panic.  The 
same on both nodes.    Wondering if it might be a networking issue, I 
isolated one of the faulty nodes (node 4) with a known working OSS node 
(node 2).  I set up a new lustre configuration with node 4 acting as a 
combined MGS/MDT server and node 2 as the OSS.  I was able to create the 
FS on node 4 and even mount the device at /mnt/mdt, however, as soon as 
I tried mounting the OST on node 2, node 4 went into kernel panic again.

Frustrated, I powered off node 4 and 5 and moved onto nodes 6 and 7 to 
see how they fair.  Node 6 works great, node 7 runs into the same 
problems.  Again, all of these boxes contain identical hardware. 

I''ve started from scratch 2 times, switching CentOS 5.2 to RHEL5 on all
nodes and downgrade from Lustre 1.6.7 to 1.6.6 to see if that made a 
difference.

I''m lost as to where to begin to resolve this.  I''ve attached
relevant
info, any tips anyone may provide will be greatly appreciated. 

Thanks.

Network setup: node 1 contains both MDT/MGS.  All other nodes act as 
OSS''s mounting /dev/hdb1 as their only OST.

Hardware on all nodes:
Intel(R) Xeon(TM) CPU 1.70GHz
hda: WDC WD102BA, ATA DISK drive
hdb: Maxtor 6E040L0, ATA DISK drive

On all nodes: Linux 2.6.18-92.1.10.el5_lustre.1.6.6smp #1 SMP Tue Aug 26 
12:05:09 EDT 2008 i686 i686 i386 GNU/Linux

BUG: soft lockup - CPU#0 stuck for 10s [socknal_cd00:2785]

esi: c048bd14   edi: d303403c   ebp: d3034000   esp: d1279df8
ds: 007b        es: 007b        ss:0068
Process socknal_cd00 (pid:2776, ti=d1278000 task=df195550 task.ti=d1278000)
Stack:  c063858a d5b74d44 83010183 d3034000 d5b74d40 d3947180 e0d8fc20 
d1255400
        00000000 00000000 00000000 00000005 00000005 0000000e 00000001 
00000000
        d1269240 00000000 d1269240 e0d86854 00000000 d1279f00 d1279f00 
00001388
Call Trace:
[<e0d8fc20>] ksocknal_read_callback+0x100/0x270 [ksocklnd]
[<e0d86854>] ksocknal_create_conn+0x19a4/0x1f90 [ksocklnd]
[<e0b66b31>] libcfs_sock_write+0xb1/0x3a0 [libcfs]
[<c0483471>] __posix_lock_file_conf+0x431/0x48e
[<e0e2c641>] lnet_connect+0xb1/0x150 [lnet]
[<e0d89ee4>] ksocknal_connect+0x124/0x540 [ksocklnd]
[<e0d8f792>] ksocknal_connd+0x2a2/0x400 [ksocklnd]
[<c044b863>] audit_syscall_exit+0x2cc/0x2e2
[<c043631d>] autoremove_wake_function+0x0/0x2d
[<e0d8f4f0>] ksocknal_connd+0x0/0x400 [ksocklnd]
[<c0405c0f>] kernel_thread_helper+0x7/0x10
=======================
Code: 74 17 50 52 68 4d 85 63 c0 e8 f6 eb f3 ff 0f 0b 1a 00 ff 84 63 c0 
82 c4 0c 8b 06 39 d8 74 17 50 53 68 8a 85 63 c0 e8 d9 eb f3 ff <0f> 0b 
1f 00 ff 84 63 c0 83 c4 0c 89 7b 04 89 1f 89 77 04 89 3

EIP: [<c04e7d9d>] __list_add+0x39/0x52 SS:ESP 0068:d1279df8
<0>Kernel panic - not syncing: Fatal exception in interrupt

Isaac Huang

2009-Mar-26 23:21 UTC

head link

[Lustre-discuss] Kernel panics while mounting OSTs

On Wed, Mar 25, 2009 at 04:47:21PM -0700, Adam Gandelman
wrote:> Hi list-
> ......
> On all nodes: Linux 2.6.18-92.1.10.el5_lustre.1.6.6smp #1 SMP Tue Aug 26 
> 12:05:09 EDT 2008 i686 i686 i386 GNU/Linux
> 
> BUG: soft lockup - CPU#0 stuck for 10s [socknal_cd00:2785]
It smells to me like an aftermath of a previous error, like a LASSERT
failure inside a spinlock. Could you look at the syslogs (assuming
that kernel printks are logged) before the soft lockup happened for
something fishy? That''d give some context.

Thanks,
Isaac
> esi: c048bd14   edi: d303403c   ebp: d3034000   esp: d1279df8
> ds: 007b        es: 007b        ss:0068
> Process socknal_cd00 (pid:2776, ti=d1278000 task=df195550 task.ti=d1278000)
> Stack:  c063858a d5b74d44 83010183 d3034000 d5b74d40 d3947180 e0d8fc20 
> d1255400
>         00000000 00000000 00000000 00000005 00000005 0000000e 00000001 
> 00000000
>         d1269240 00000000 d1269240 e0d86854 00000000 d1279f00 d1279f00 
> 00001388
> Call Trace:
> [<e0d8fc20>] ksocknal_read_callback+0x100/0x270 [ksocklnd]
> [<e0d86854>] ksocknal_create_conn+0x19a4/0x1f90 [ksocklnd]
> [<e0b66b31>] libcfs_sock_write+0xb1/0x3a0 [libcfs]
> [<c0483471>] __posix_lock_file_conf+0x431/0x48e
> [<e0e2c641>] lnet_connect+0xb1/0x150 [lnet]
> [<e0d89ee4>] ksocknal_connect+0x124/0x540 [ksocklnd]
> [<e0d8f792>] ksocknal_connd+0x2a2/0x400 [ksocklnd]
> [<c044b863>] audit_syscall_exit+0x2cc/0x2e2
> [<c043631d>] autoremove_wake_function+0x0/0x2d
> [<e0d8f4f0>] ksocknal_connd+0x0/0x400 [ksocklnd]
> [<c0405c0f>] kernel_thread_helper+0x7/0x10
> =======================> 
> Code: 74 17 50 52 68 4d 85 63 c0 e8 f6 eb f3 ff 0f 0b 1a 00 ff 84 63 c0 
> 82 c4 0c 8b 06 39 d8 74 17 50 53 68 8a 85 63 c0 e8 d9 eb f3 ff <0f>
0b
> 1f 00 ff 84 63 c0 83 c4 0c 89 7b 04 89 1f 89 77 04 89 3
> 
> EIP: [<c04e7d9d>] __list_add+0x39/0x52 SS:ESP 0068:d1279df8
> <0>Kernel panic - not syncing: Fatal exception in interrupt

Lustre discuss - Mar 2009 - Kernel panics while mounting OSTs

[Lustre-discuss] Kernel panics while mounting OSTs

[Lustre-discuss] Kernel panics while mounting OSTs