Got a weird issue here that I hope someone can help me with. I have 6 boxes connected to an iscsi SAN, 5 via 100Mb and 1 via 1Gb all on the same switch, the 5 on the 100Mb are my mx servers while the 1 on Gbit is my load balancer. The 5 mx boxes are pxe booted and mount there root fileystem read only from the iscsi san which is an ocfs2 volume. The load balancer mounts the same volume as read-write and runs periodic virus/spam updates on the volume throughout the day. My problem, however, came this morning at around 4 am when the system runs nightly maintenance on itself and on the iscsi volume. It looks as though just shortly after it ran the system froze, but I can tell why or how. To make matters more confusing, all of the mx servers could no longer read from the iscsi volume after this happened, it was like everything got fenced at once and nothing had access. Another weird thing is the logs on the mx servers, they seem to all be different with different events happening. I hope somebody out there can help shed some light on this for me. Load Balancer: Jan 27 04:06:12 mx kernel: (26880,2):dlm_create_lock_handler:441 ERROR: dlm status = DLM_IVLOCKID Jan 27 04:06:12 mx last message repeated 2 times MX Servers **mx1** Jan 27 04:06:11 mx1 kernel: (9582,1):dlm_do_assert_master:1505 ERROR: during assert master of D000000000000000005320be20df6c3 to 2, got -22. Jan 27 04:06:11 mx1 kernel: (9582,1):dlm_dump_lock_resources:297 struct dlm_ctxt: B125E362AD5B43109B40EA0D3DBB7D33, node=1, key=73257850 Jan 27 04:06:11 mx1 kernel: (9582,1):dlm_print_one_lock_resource:223 lockres: M000000000000000009e5212792ca0a, owner=1, state=0 Jan 27 04:06:11 mx1 kernel: (9582,1):__dlm_print_one_lock_resource:238 lockres: M000000000000000009e5212792ca0a, owner=1, state=0 Jan 27 04:06:11 mx1 kernel: (9582,1):__dlm_print_one_lock_resource:240 last used: 0, on purge list: no Jan 27 04:06:11 mx1 kernel: (9582,1):__dlm_print_one_lock_resource:241 granted queue: Jan 27 04:06:11 mx1 kernel: (9582,1):__dlm_print_one_lock_resource:251 type=3, conv=-1, node=0, cookie=3822148708753997824, ast=(empty=y,pend=n), bast=(empty=y,pend=n) Jan 27 04:06:11 mx1 kernel: (9582,1):__dlm_print_one_lock_resource:254 converting queue: Jan 27 04:06:11 mx1 kernel: (9582,1):__dlm_print_one_lock_resource:267 blocked queue: Jan 27 04:06:11 mx1 kernel: (9582,1):dlm_print_one_lock_resource:223 lockres: M000000000000000009e5252792ca0e, owner=0, state=0 Jan 27 04:06:11 mx1 kernel: (9582,1):__dlm_print_one_lock_resource:238 lockres: M000000000000000009e5252792ca0e, owner=0, state=0 Jan 27 04:06:11 mx1 kernel: (9582,1):__dlm_print_one_lock_resource:240 last used: 0, on purge list: no Jan 27 04:06:11 mx1 kernel: (9582,1):__dlm_print_one_lock_resource:241 granted queue: Jan 27 04:06:11 mx1 kernel: (9582,1):__dlm_print_one_lock_resource:251 type=3, conv=-1, node=1, cookie=726205439913492481, ast=(empty=y,pend=n), bast=(empty=y,pend=n) Jan 27 04:06:11 mx1 kernel: (9582,1):__dlm_print_one_lock_resource:254 converting queue: Jan 27 04:06:11 mx1 kernel: (9582,1):__dlm_print_one_lock_resource:267 blocked queue: Jan 27 04:06:11 mx1 kernel: (9582,1):dlm_print_one_lock_resource:223 lockres: M00000000000000000a724f2792f1c0, owner=1, state=0 Jan 27 04:06:11 mx1 kernel: (9582,1):__dlm_print_one_lock_resource:238 lockres: M00000000000000000a724f2792f1c0, owner=1, state=0 Jan 27 04:06:11 mx1 kernel: (9582,1):__dlm_print_one_lock_resource:240 last used: 0, on purge list: no Jan 27 04:06:11 mx1 kernel: (9582,1):__dlm_print_one_lock_resource:241 granted queue: Jan 27 04:06:11 mx1 kernel: (9582,1):__dlm_print_one_lock_resource:251 type=3, conv=-1, node=1, cookie=1446781380292771841, ast=(empty=y,pend=n), bast=(empty=y,pend=n) Jan 27 04:06:11 mx1 kernel: (9582,1):__dlm_print_one_lock_resource:251 type=3, conv=-1, node=5, cookie=1590896568368627717, ast=(empty=y,pend=n), bast=(empty=y,pend=n) Jan 27 04:06:11 mx1 kernel: (9582,1):__dlm_print_one_lock_resource:251 type=3, conv=-1, node=2, cookie=1807069350482411522, ast=(empty=y,pend=n), bast=(empty=y,pend=n) Jan 27 04:06:11 mx1 kernel: (9582,1):__dlm_print_one_lock_resource:251 type=3, conv=-1, node=3, cookie=7139331309289078787, ast=(empty=y,pend=n), bast=(empty=y,pend=n) Jan 27 04:06:11 mx1 kernel: (9582,1):__dlm_print_one_lock_resource:251 type=3, conv=-1, node=4, cookie=2384093052739256324, ast=(empty=y,pend=n), bast=(empty=y,pend=n) Jan 27 04:06:11 mx1 kernel: (9582,1):__dlm_print_one_lock_resource:254 converting queue: Jan 27 04:06:11 mx1 kernel: (9582,1):__dlm_print_one_lock_resource:267 blocked queue: Jan 27 04:06:11 mx1 kernel: (9582,1):dlm_print_one_lock_resource:223 lockres: D000000000000000009e52f2792ca18, owner=0, state=0 *** REPEATS TILL *** Jan 27 04:06:26 mx1 kernel: ------------[ cut here ]------------ Jan 27 04:06:26 mx1 kernel: kernel BUG at /usr/src/redhat/BUILD/ocfs2-1.1.8/fs/ocfs2/dlm/dlmmaster.c:1507! Jan 27 04:06:26 mx1 kernel: invalid operand: 0000 [#3] Jan 27 04:06:26 mx1 kernel: SMP Jan 27 04:06:26 mx1 kernel: Modules linked in: ipv6 autofs4 microcode uhci_hcd ehci_hcd video button battery ac ocfs2 debugfs ocfs2_dlmfs ocfs2_dlm ocfs2_nodemanager configfs crc32c libcrc32c md5 iscsi_sfnet scsi_transport_iscsi 8021q ext3 jbd aic79xx aic7xxx ata_piix libata sd_mod scsi_mod tg3 e1000 Jan 27 04:06:26 mx1 kernel: CPU: 1 Jan 27 04:06:26 mx1 kernel: EIP: 0060:[<f89abef0>] Not tainted VLI Jan 27 04:06:26 mx1 kernel: EFLAGS: 00010296 (2.6.10-1.771_FC2tek-mx) Jan 27 04:06:26 mx1 kernel: EIP is at dlm_do_assert_master+0x2d1/0x324 [ocfs2_dlm] Jan 27 04:06:26 mx1 kernel: eax: f65ed030 ebx: 00000000 ecx: d788cbe0 edx: 00000400 Jan 27 04:06:26 mx1 kernel: esi: e9012f7f edi: d788ccb7 ebp: 0000001f esp: d788cc4c Jan 27 04:06:26 mx1 kernel: ds: 007b es: 007b ss: 0068 Jan 27 04:06:26 mx1 kernel: Process md5sum (pid: 9600, threadinfo=d788c000 task=cdd82020) Jan 27 04:06:26 mx1 kernel: Stack: 00000002 00000000 e9012f60 f65ed000 ffffffea 0000003d 00000000 00000000 Jan 27 04:06:26 mx1 kernel: 00000000 00000000 00000000 00000000 00000000 00000002 00058893 43d9f094 Jan 27 04:06:26 mx1 kernel: 00058898 00001f01 00000000 3030304d 30303030 30303030 30303030 33353030 Jan 27 04:06:26 mx1 kernel: Call Trace: Jan 27 04:06:26 mx1 kernel: [<f89aaaae>] dlm_wait_for_lock_mastery +0x465/0x514 [ocfs2_dlm] Jan 27 04:06:26 mx1 kernel: [<f89a6534>] __dlm_insert_lockres+0x78/0x8b [ocfs2_dlm] Jan 27 04:06:26 mx1 kernel: [<f89aa45e>] dlm_get_lock_resource +0x513/0x6fe [ocfs2_dlm] Jan 27 04:06:26 mx1 kernel: [<f89b26af>] dlmlock+0x7b1/0xaaa [ocfs2_dlm] Jan 27 04:06:26 mx1 kernel: [<f89d544f>] ocfs2_lock_create+0xfd/0x2d7 [ocfs2] Jan 27 04:06:26 mx1 kernel: [<f89d43ab>] ocfs2_inode_ast_func+0x0/0x53c [ocfs2] Jan 27 04:06:26 mx1 kernel: [<f89d4b16>] ocfs2_inode_bast_func +0x0/0x130 [ocfs2] Jan 27 04:06:26 mx1 kernel: [<f89d5901>] ocfs2_cluster_lock+0x1fa/0x843 [ocfs2] Jan 27 04:06:26 mx1 kernel: [<f89d569b>] ocfs2_status_completion_cb +0x0/0xa [ocfs2] Jan 27 04:06:26 mx1 kernel: [<f89d75af>] ocfs2_meta_lock_full +0x1e8/0x563 [ocfs2] Jan 27 04:06:26 mx1 kernel: [<c01452d3>] handle_mm_fault+0xbe/0x176 Jan 27 04:06:26 mx1 kernel: [<c019d5a2>] selinux_inode_getattr +0x4a/0x52 Jan 27 04:06:26 mx1 kernel: [<c0116f44>] do_page_fault+0x191/0x57a Jan 27 04:06:26 mx1 kernel: [<f89e3da7>] ocfs2_inode_revalidate +0x11a/0x1e9 [ocfs2] Jan 27 04:06:26 mx1 kernel: [<f89e0099>] ocfs2_getattr+0x0/0x141 [ocfs2] Jan 27 04:06:26 mx1 kernel: [<f89e00fc>] ocfs2_getattr+0x63/0x141 [ocfs2] Jan 27 04:06:26 mx1 kernel: [<f89e0099>] ocfs2_getattr+0x0/0x141 [ocfs2] Jan 27 04:06:26 mx1 kernel: [<c0158a6e>] vfs_getattr+0x35/0x88 Jan 27 04:06:26 mx1 kernel: [<c0158b54>] vfs_fstat+0x22/0x31 Jan 27 04:06:26 mx1 kernel: [<c0159094>] sys_fstat64+0xf/0x23 Jan 27 04:06:26 mx1 kernel: [<c0179e1f>] dnotify_parent+0x1b/0x6c Jan 27 04:06:26 mx1 kernel: [<c0116db3>] do_page_fault+0x0/0x57a Jan 27 04:06:26 mx1 kernel: [<c0103c57>] syscall_call+0x7/0xb Jan 27 04:06:26 mx1 kernel: Code: 10 55 68 e1 05 00 00 68 bb cc 9b f8 ff 70 10 ff b2 94 00 00 00 68 66 cd 9b f8 e8 aa 21 77 c7 83 c4 24 8b 44 24 0c e8 8d 9c ff ff <0f> 0b e3 05 46 c0 9b f8 8b 4c 24 34 ba ff 00 00 00 8d 44 24 14 Jan 27 04:07:05 mx1 kernel: <3>(782,0):dlm_create_lock_handler:441 ERROR: dlm status = DLM_IVLOCKID Jan 27 04:24:44 mx1 kernel: (0,0):o2net_idle_timer:1310 connection to node mx.highspd.net (num 0) at 10.20.0.1:7777 has been idle for 10 seconds, shutting it down. Jan 27 04:24:44 mx1 kernel: (0,0):o2net_idle_timer:1321 here are some times that might help debug the situation: (tmr 1138357474.529323 now 1138357484.527980 dr 1138357474.529310 adv 1138357474.529336:1138357474.529339 func (045dd37a:503) 1138357014.669140:1138357014.669145) Jan 27 04:24:44 mx1 kernel: (782,0):o2net_set_nn_state:411 no longer connected to node mx.highspd.net (num 0) at 10.20.0.1:7777 Jan 27 04:24:49 mx1 kernel: (782,0):ocfs2_dlm_eviction_cb:118 device (8,17): dlm has evicted node 0 Jan 27 04:24:49 mx1 kernel: (782,0):dlm_begin_reco_handler:2057 ERROR: recovery master 3 sees 0 as dead, but this node has not yet. marking 0 as dead Jan 27 04:24:54 mx1 kernel: (10814,1):ocfs2_replay_journal:1167 Recovering node 0 from slot 0 on device (8,17) ** mx2 ** Same as MX1 but no kernel oops ** mx3 ** Jan 27 04:24:42 mx3 kernel: (0,0):o2net_idle_timer:1310 connection to node mx.highspd.net (num 0) at 10.20.0.1:7777 has been idle for 10 seconds, shutting it down. Jan 27 04:24:42 mx3 kernel: (0,0):o2net_idle_timer:1321 here are some times that might help debug the situation: (tmr 1138357472.621478 now 1138357482.620252 dr 1138357472.621465 adv 1138357472.621490:1138357472.621493 func (045dd37a:502) 1138356977.791621:1138356977.791625) Jan 27 04:24:42 mx3 kernel: (778,0):o2net_set_nn_state:411 no longer connected to node mx.highspd.net (num 0) at 10.20.0.1:7777 Jan 27 04:24:51 mx3 kernel: (32525,0):dlm_restart_lock_mastery:1064 ERROR: node down! 0 ** mx4 ** Same as MX3 ** mx5 ** Same as MX1 including kernel oops -- Thanks, Ed McLain Sr. Colocation Engineer TekLinks / High Speed Solutions -------------- next part -------------- An HTML attachment was scrubbed... URL: http://oss.oracle.com/pipermail/ocfs2-users/attachments/20060127/a1b925ee/attachment-0001.html