Hi list-
I''m in the process of configuring my first Lustre cluster for testing.
I''ve got access to 10 nodes right that are all made up of identical
hardware. I decided to start small and configure 3 nodes, after those
are up and working, move on setting the remaining OSS''s up in the same
way. I''ve successfully gotten 3 nodes up and running in a very simple
lustre setup with the MDT/MGS located on one node and the other two
setup as OSS''s. The OST''s of the OSS''s are local
40gb drives (hdb:
Maxtor 6E040L0, ATA DISK drive). Everything mounts fine, connects to
the MGS, and mounts on client nodes just fine. Decided to move onto
nodes 4 and 5.
Here''s where kernel panics began. I set these up as OSS''s
following the
same steps on 2 and 3 and using the same stock lustre kernel and RPMs.
Had no trouble creating the lustre filesystem on /dev/hdb1, however, as
soon as I issued mount /dev/hdb1 /mnt/ost my kernel would panic. The
same on both nodes. Wondering if it might be a networking issue, I
isolated one of the faulty nodes (node 4) with a known working OSS node
(node 2). I set up a new lustre configuration with node 4 acting as a
combined MGS/MDT server and node 2 as the OSS. I was able to create the
FS on node 4 and even mount the device at /mnt/mdt, however, as soon as
I tried mounting the OST on node 2, node 4 went into kernel panic again.
Frustrated, I powered off node 4 and 5 and moved onto nodes 6 and 7 to
see how they fair. Node 6 works great, node 7 runs into the same
problems. Again, all of these boxes contain identical hardware.
I''ve started from scratch 2 times, switching CentOS 5.2 to RHEL5 on all
nodes and downgrade from Lustre 1.6.7 to 1.6.6 to see if that made a
difference.
I''m lost as to where to begin to resolve this. I''ve attached
relevant
info, any tips anyone may provide will be greatly appreciated.
Thanks.
Network setup: node 1 contains both MDT/MGS. All other nodes act as
OSS''s mounting /dev/hdb1 as their only OST.
Hardware on all nodes:
Intel(R) Xeon(TM) CPU 1.70GHz
hda: WDC WD102BA, ATA DISK drive
hdb: Maxtor 6E040L0, ATA DISK drive
On all nodes: Linux 2.6.18-92.1.10.el5_lustre.1.6.6smp #1 SMP Tue Aug 26
12:05:09 EDT 2008 i686 i686 i386 GNU/Linux
BUG: soft lockup - CPU#0 stuck for 10s [socknal_cd00:2785]
esi: c048bd14 edi: d303403c ebp: d3034000 esp: d1279df8
ds: 007b es: 007b ss:0068
Process socknal_cd00 (pid:2776, ti=d1278000 task=df195550 task.ti=d1278000)
Stack: c063858a d5b74d44 83010183 d3034000 d5b74d40 d3947180 e0d8fc20
d1255400
00000000 00000000 00000000 00000005 00000005 0000000e 00000001
00000000
d1269240 00000000 d1269240 e0d86854 00000000 d1279f00 d1279f00
00001388
Call Trace:
[<e0d8fc20>] ksocknal_read_callback+0x100/0x270 [ksocklnd]
[<e0d86854>] ksocknal_create_conn+0x19a4/0x1f90 [ksocklnd]
[<e0b66b31>] libcfs_sock_write+0xb1/0x3a0 [libcfs]
[<c0483471>] __posix_lock_file_conf+0x431/0x48e
[<e0e2c641>] lnet_connect+0xb1/0x150 [lnet]
[<e0d89ee4>] ksocknal_connect+0x124/0x540 [ksocklnd]
[<e0d8f792>] ksocknal_connd+0x2a2/0x400 [ksocklnd]
[<c044b863>] audit_syscall_exit+0x2cc/0x2e2
[<c043631d>] autoremove_wake_function+0x0/0x2d
[<e0d8f4f0>] ksocknal_connd+0x0/0x400 [ksocklnd]
[<c0405c0f>] kernel_thread_helper+0x7/0x10
=======================
Code: 74 17 50 52 68 4d 85 63 c0 e8 f6 eb f3 ff 0f 0b 1a 00 ff 84 63 c0
82 c4 0c 8b 06 39 d8 74 17 50 53 68 8a 85 63 c0 e8 d9 eb f3 ff <0f> 0b
1f 00 ff 84 63 c0 83 c4 0c 89 7b 04 89 1f 89 77 04 89 3
EIP: [<c04e7d9d>] __list_add+0x39/0x52 SS:ESP 0068:d1279df8
<0>Kernel panic - not syncing: Fatal exception in interrupt