thr3ads.net - Ocfs2 users - [Ocfs2-users] Kernel Panic, Server not coming back up [Apr 2010]

If this information is useful, please help other people find it:
Share via:

kevin at utahsysadmin.com

2010-Apr-05 20:08 UTC

[Ocfs2-users] Kernel Panic, Server not coming back up

I have a relatively new test environment setup that is a little different
from your typical scenario.  This is my first time using OCFS2, but I
believe it should work the way I have it setup.

All of this is setup on VMWare virtual hosts.  I have two front-end web
servers and one backend administrative server.  They all share 2 virtual
hard drives within VMware (independent, persistent, & thick provisioned).

Everything works great and the way I want, except occasionally, one of the
nodes will crash with errors like the following:

end_request:  I/O error, dev sdc, sector 585159
Aborting journal on device sdc1
end_request:  I/O error, dev sdc, sector 528151
Buffer I/O error on device sdc1,  logical block 66011
lost page write due to I/O error on sdc1
(2848,1):ocfs2_start_trans:240 ERROR: status = -30
OCFS2: abort (device sdc1): ocfs2_start_trans: Detected aborted journal
Kernel panic - not syncing: OCFS2:  (device sdc1): panic forced after error

 <0>Rebooting in 30 seconds..BUG: warning at
arch/i386/kernel/smp.c:492/smp_send_reschedule() (Tainted: G    )

The server never reboots, it just sits there until I reset it.  The cluster
ran fine without errors (for a week or two) and now that I upgraded to the
latest kernel/ocfs2 it's happening almost daily.  The disks are fine,
it's
on a LUN on a SAN with no problems and I unmounted all the partitions and
ran fsck.ocfs2 -f on both drives from all three nodes (one at a time) and
it found no errors.

This morning it happened again and now after a reset the server will not
boot up at all, just sits there on "Starting Oracle Cluster File System
(OCFS2)".  These servers are all running OEL 5.4 with the latest patches
installed.

Here's the setup:# cat /etc/ocfs2/cluster.conf 
cluster:
	node_count = 3
	name = qacluster

node:
	ip_port = 7777
	ip_address = 10.10.220.30
	number = 0
	name = qa-admin
	cluster = qacluster

node:
	ip_port = 7777
	ip_address = 10.10.220.31
	number = 1
	name = qa-web1
	cluster = qacluster

node:
	ip_port = 7777
	ip_address = 10.10.220.32
	number = 2
	name = qa-web2
	cluster = qacluster

# mounted.ocfs2 -d
Device                FS     UUID                                  Label
/dev/sdb1             ocfs2  85b050a0-a381-49d8-8353-c21b1c8b28c4  data
/dev/sdc1             ocfs2  6a03e81a-8186-41a6-8fd8-dc23854e12d3  logs

# uname -a
Linux qa-admin.domain.com 2.6.18-164.15.1.0.1.el5 #1 SMP Wed Mar 17
00:56:05 EDT 2010 i686 i686 i386 GNU/Linux

# rpm -qa | grep ocfs2
ocfs2-2.6.18-164.11.1.0.1.el5-1.4.4-1.el5
ocfs2-tools-1.4.3-1.el5
ocfs2-2.6.18-164.15.1.0.1.el5-1.4.4-1.el5

This is the latest from one of the alive hosts:

# dmesg | tail -50
(2869,0):ocfs2_lock_allocators:677 ERROR: status = -5
(2869,0):__ocfs2_extend_allocation:739 ERROR: status = -5
(2869,0):ocfs2_extend_no_holes:952 ERROR: status = -5
(2869,0):ocfs2_expand_nonsparse_inode:1678 ERROR: status = -5
(2869,0):ocfs2_write_begin_nolock:1722 ERROR: status = -5
(2869,0):ocfs2_write_begin:1860 ERROR: status = -5
(2869,0):ocfs2_file_buffered_write:2039 ERROR: status = -5
(2869,0):__ocfs2_file_aio_write:2194 ERROR: status = -5
OCFS2: ERROR (device sdc1): ocfs2_check_group_descriptor: Group descriptor
# 1128960 has bit count 32256 but claims that 34300 are free
(2881,0):ocfs2_search_chain:1244 ERROR: status = -5
(2881,0):ocfs2_claim_suballoc_bits:1433 ERROR: status = -5
(2881,0):__ocfs2_claim_clusters:1715 ERROR: status = -5
(2881,0):ocfs2_local_alloc_new_window:1013 ERROR: status = -5
(2881,0):ocfs2_local_alloc_slide_window:1116 ERROR: status = -5
(2881,0):ocfs2_reserve_local_alloc_bits:537 ERROR: status = -5
(2881,0):__ocfs2_reserve_clusters:725 ERROR: status = -5
(2881,0):ocfs2_lock_allocators:677 ERROR: status = -5
(2881,0):__ocfs2_extend_allocation:739 ERROR: status = -5
(2881,0):ocfs2_extend_no_holes:952 ERROR: status = -5
(2881,0):ocfs2_expand_nonsparse_inode:1678 ERROR: status = -5
(2881,0):ocfs2_write_begin_nolock:1722 ERROR: status = -5
(2881,0):ocfs2_write_begin:1860 ERROR: status = -5
(2881,0):ocfs2_file_buffered_write:2039 ERROR: status = -5
(2881,0):__ocfs2_file_aio_write:2194 ERROR: status = -5
(2045,0):o2net_connect_expired:1664 ERROR: no connection established with
node 2 after 30.0 seconds, giving up and returning errors.
OCFS2: ERROR (device sdc1): ocfs2_check_group_descriptor: Group descriptor
# 1128960 has bit count 32256 but claims that 34300 are free
(2872,0):ocfs2_search_chain:1244 ERROR: status = -5
(2872,0):ocfs2_claim_suballoc_bits:1433 ERROR: status = -5
(2872,0):__ocfs2_claim_clusters:1715 ERROR: status = -5
(2872,0):ocfs2_local_alloc_new_window:1013 ERROR: status = -5
(2872,0):ocfs2_local_alloc_slide_window:1116 ERROR: status = -5
(2872,0):ocfs2_reserve_local_alloc_bits:537 ERROR: status = -5
(2872,0):__ocfs2_reserve_clusters:725 ERROR: status = -5
(2872,0):ocfs2_lock_allocators:677 ERROR: status = -5
(2872,0):__ocfs2_extend_allocation:739 ERROR: status = -5
(2872,0):ocfs2_extend_no_holes:952 ERROR: status = -5
(2872,0):ocfs2_expand_nonsparse_inode:1678 ERROR: status = -5
(2872,0):ocfs2_write_begin_nolock:1722 ERROR: status = -5
(2872,0):ocfs2_write_begin:1860 ERROR: status = -5
(2872,0):ocfs2_file_buffered_write:2039 ERROR: status = -5
(2872,0):__ocfs2_file_aio_write:2194 ERROR: status = -5
(2065,0):ocfs2_dlm_eviction_cb:98 device (8,33): dlm has evicted node 2
(12701,1):dlm_get_lock_resource:844
6A03E81A818641A68FD8DC23854E12D3:M00000000000000000000243568d3c5: at least
one node (2) to recover before lock mastery can begin
(2045,0):ocfs2_dlm_eviction_cb:98 device (8,33): dlm has evicted node 2
(12701,1):dlm_get_lock_resource:898
6A03E81A818641A68FD8DC23854E12D3:M00000000000000000000243568d3c5: at least
one node (2) to recover before lock mastery can begin
o2net: accepted connection from node qa-web2 (num 2) at 147.178.220.32:7777
ocfs2_dlm: Node 2 joins domain 6A03E81A818641A68FD8DC23854E12D3
ocfs2_dlm: Nodes in domain ("6A03E81A818641A68FD8DC23854E12D3"): 0 1 2
(12701,1):dlm_restart_lock_mastery:1216 node 2 up while restarting
(12701,1):dlm_wait_for_lock_mastery:1040 ERROR: status = -11

Any suggestions?  Is there anymore data I can provide?

Thanks for any help.

Kevin

Sunil Mushran

2010-Apr-05 20:45 UTC

head link

[Ocfs2-users] Kernel Panic, Server not coming back up

It is having problems doing ios to the virtual devices. -5 is EIO.

kevin at utahsysadmin.com wrote:> I have a relatively new test environment setup that is a little different
> from your typical scenario.  This is my first time using OCFS2, but I
> believe it should work the way I have it setup.
>
> All of this is setup on VMWare virtual hosts.  I have two front-end web
> servers and one backend administrative server.  They all share 2 virtual
> hard drives within VMware (independent, persistent, & thick
provisioned).
>
> Everything works great and the way I want, except occasionally, one of the
> nodes will crash with errors like the following:
>
> end_request:  I/O error, dev sdc, sector 585159
> Aborting journal on device sdc1
> end_request:  I/O error, dev sdc, sector 528151
> Buffer I/O error on device sdc1,  logical block 66011
> lost page write due to I/O error on sdc1
> (2848,1):ocfs2_start_trans:240 ERROR: status = -30
> OCFS2: abort (device sdc1): ocfs2_start_trans: Detected aborted journal
> Kernel panic - not syncing: OCFS2:  (device sdc1): panic forced after error
>
>  <0>Rebooting in 30 seconds..BUG: warning at
> arch/i386/kernel/smp.c:492/smp_send_reschedule() (Tainted: G    )
>
> The server never reboots, it just sits there until I reset it.  The cluster
> ran fine without errors (for a week or two) and now that I upgraded to the
> latest kernel/ocfs2 it's happening almost daily.  The disks are fine,
it's
> on a LUN on a SAN with no problems and I unmounted all the partitions and
> ran fsck.ocfs2 -f on both drives from all three nodes (one at a time) and
> it found no errors.
>
> This morning it happened again and now after a reset the server will not
> boot up at all, just sits there on "Starting Oracle Cluster File
System
> (OCFS2)".  These servers are all running OEL 5.4 with the latest
patches
> installed.
>
> Here's the setup:# cat /etc/ocfs2/cluster.conf 
> cluster:
> 	node_count = 3
> 	name = qacluster
>
> node:
> 	ip_port = 7777
> 	ip_address = 10.10.220.30
> 	number = 0
> 	name = qa-admin
> 	cluster = qacluster
>
> node:
> 	ip_port = 7777
> 	ip_address = 10.10.220.31
> 	number = 1
> 	name = qa-web1
> 	cluster = qacluster
>
> node:
> 	ip_port = 7777
> 	ip_address = 10.10.220.32
> 	number = 2
> 	name = qa-web2
> 	cluster = qacluster
>
> # mounted.ocfs2 -d
> Device                FS     UUID                                  Label
> /dev/sdb1             ocfs2  85b050a0-a381-49d8-8353-c21b1c8b28c4  data
> /dev/sdc1             ocfs2  6a03e81a-8186-41a6-8fd8-dc23854e12d3  logs
>
> # uname -a
> Linux qa-admin.domain.com 2.6.18-164.15.1.0.1.el5 #1 SMP Wed Mar 17
> 00:56:05 EDT 2010 i686 i686 i386 GNU/Linux
>
> # rpm -qa | grep ocfs2
> ocfs2-2.6.18-164.11.1.0.1.el5-1.4.4-1.el5
> ocfs2-tools-1.4.3-1.el5
> ocfs2-2.6.18-164.15.1.0.1.el5-1.4.4-1.el5
>
> This is the latest from one of the alive hosts:
>
> # dmesg | tail -50
> (2869,0):ocfs2_lock_allocators:677 ERROR: status = -5
> (2869,0):__ocfs2_extend_allocation:739 ERROR: status = -5
> (2869,0):ocfs2_extend_no_holes:952 ERROR: status = -5
> (2869,0):ocfs2_expand_nonsparse_inode:1678 ERROR: status = -5
> (2869,0):ocfs2_write_begin_nolock:1722 ERROR: status = -5
> (2869,0):ocfs2_write_begin:1860 ERROR: status = -5
> (2869,0):ocfs2_file_buffered_write:2039 ERROR: status = -5
> (2869,0):__ocfs2_file_aio_write:2194 ERROR: status = -5
> OCFS2: ERROR (device sdc1): ocfs2_check_group_descriptor: Group descriptor
> # 1128960 has bit count 32256 but claims that 34300 are free
> (2881,0):ocfs2_search_chain:1244 ERROR: status = -5
> (2881,0):ocfs2_claim_suballoc_bits:1433 ERROR: status = -5
> (2881,0):__ocfs2_claim_clusters:1715 ERROR: status = -5
> (2881,0):ocfs2_local_alloc_new_window:1013 ERROR: status = -5
> (2881,0):ocfs2_local_alloc_slide_window:1116 ERROR: status = -5
> (2881,0):ocfs2_reserve_local_alloc_bits:537 ERROR: status = -5
> (2881,0):__ocfs2_reserve_clusters:725 ERROR: status = -5
> (2881,0):ocfs2_lock_allocators:677 ERROR: status = -5
> (2881,0):__ocfs2_extend_allocation:739 ERROR: status = -5
> (2881,0):ocfs2_extend_no_holes:952 ERROR: status = -5
> (2881,0):ocfs2_expand_nonsparse_inode:1678 ERROR: status = -5
> (2881,0):ocfs2_write_begin_nolock:1722 ERROR: status = -5
> (2881,0):ocfs2_write_begin:1860 ERROR: status = -5
> (2881,0):ocfs2_file_buffered_write:2039 ERROR: status = -5
> (2881,0):__ocfs2_file_aio_write:2194 ERROR: status = -5
> (2045,0):o2net_connect_expired:1664 ERROR: no connection established with
> node 2 after 30.0 seconds, giving up and returning errors.
> OCFS2: ERROR (device sdc1): ocfs2_check_group_descriptor: Group descriptor
> # 1128960 has bit count 32256 but claims that 34300 are free
> (2872,0):ocfs2_search_chain:1244 ERROR: status = -5
> (2872,0):ocfs2_claim_suballoc_bits:1433 ERROR: status = -5
> (2872,0):__ocfs2_claim_clusters:1715 ERROR: status = -5
> (2872,0):ocfs2_local_alloc_new_window:1013 ERROR: status = -5
> (2872,0):ocfs2_local_alloc_slide_window:1116 ERROR: status = -5
> (2872,0):ocfs2_reserve_local_alloc_bits:537 ERROR: status = -5
> (2872,0):__ocfs2_reserve_clusters:725 ERROR: status = -5
> (2872,0):ocfs2_lock_allocators:677 ERROR: status = -5
> (2872,0):__ocfs2_extend_allocation:739 ERROR: status = -5
> (2872,0):ocfs2_extend_no_holes:952 ERROR: status = -5
> (2872,0):ocfs2_expand_nonsparse_inode:1678 ERROR: status = -5
> (2872,0):ocfs2_write_begin_nolock:1722 ERROR: status = -5
> (2872,0):ocfs2_write_begin:1860 ERROR: status = -5
> (2872,0):ocfs2_file_buffered_write:2039 ERROR: status = -5
> (2872,0):__ocfs2_file_aio_write:2194 ERROR: status = -5
> (2065,0):ocfs2_dlm_eviction_cb:98 device (8,33): dlm has evicted node 2
> (12701,1):dlm_get_lock_resource:844
> 6A03E81A818641A68FD8DC23854E12D3:M00000000000000000000243568d3c5: at least
> one node (2) to recover before lock mastery can begin
> (2045,0):ocfs2_dlm_eviction_cb:98 device (8,33): dlm has evicted node 2
> (12701,1):dlm_get_lock_resource:898
> 6A03E81A818641A68FD8DC23854E12D3:M00000000000000000000243568d3c5: at least
> one node (2) to recover before lock mastery can begin
> o2net: accepted connection from node qa-web2 (num 2) at 147.178.220.32:7777
> ocfs2_dlm: Node 2 joins domain 6A03E81A818641A68FD8DC23854E12D3
> ocfs2_dlm: Nodes in domain ("6A03E81A818641A68FD8DC23854E12D3"):
0 1 2
> (12701,1):dlm_restart_lock_mastery:1216 node 2 up while restarting
> (12701,1):dlm_wait_for_lock_mastery:1040 ERROR: status = -11
>
> Any suggestions?  Is there anymore data I can provide?
>
> Thanks for any help.
>
> Kevin
>
>
>
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users at oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>

Apparently Analagous Threads

Search for more seemingly similar threads

Ocfs2 users - Apr 2010 - Kernel Panic, Server not coming back up

[Ocfs2-users] Kernel Panic, Server not coming back up

[Ocfs2-users] Kernel Panic, Server not coming back up

Apparently Analagous Threads