kevin at utahsysadmin.com
2010-Apr-05 20:08 UTC
[Ocfs2-users] Kernel Panic, Server not coming back up
I have a relatively new test environment setup that is a little different from your typical scenario. This is my first time using OCFS2, but I believe it should work the way I have it setup. All of this is setup on VMWare virtual hosts. I have two front-end web servers and one backend administrative server. They all share 2 virtual hard drives within VMware (independent, persistent, & thick provisioned). Everything works great and the way I want, except occasionally, one of the nodes will crash with errors like the following: end_request: I/O error, dev sdc, sector 585159 Aborting journal on device sdc1 end_request: I/O error, dev sdc, sector 528151 Buffer I/O error on device sdc1, logical block 66011 lost page write due to I/O error on sdc1 (2848,1):ocfs2_start_trans:240 ERROR: status = -30 OCFS2: abort (device sdc1): ocfs2_start_trans: Detected aborted journal Kernel panic - not syncing: OCFS2: (device sdc1): panic forced after error <0>Rebooting in 30 seconds..BUG: warning at arch/i386/kernel/smp.c:492/smp_send_reschedule() (Tainted: G ) The server never reboots, it just sits there until I reset it. The cluster ran fine without errors (for a week or two) and now that I upgraded to the latest kernel/ocfs2 it's happening almost daily. The disks are fine, it's on a LUN on a SAN with no problems and I unmounted all the partitions and ran fsck.ocfs2 -f on both drives from all three nodes (one at a time) and it found no errors. This morning it happened again and now after a reset the server will not boot up at all, just sits there on "Starting Oracle Cluster File System (OCFS2)". These servers are all running OEL 5.4 with the latest patches installed. Here's the setup:# cat /etc/ocfs2/cluster.conf cluster: node_count = 3 name = qacluster node: ip_port = 7777 ip_address = 10.10.220.30 number = 0 name = qa-admin cluster = qacluster node: ip_port = 7777 ip_address = 10.10.220.31 number = 1 name = qa-web1 cluster = qacluster node: ip_port = 7777 ip_address = 10.10.220.32 number = 2 name = qa-web2 cluster = qacluster # mounted.ocfs2 -d Device FS UUID Label /dev/sdb1 ocfs2 85b050a0-a381-49d8-8353-c21b1c8b28c4 data /dev/sdc1 ocfs2 6a03e81a-8186-41a6-8fd8-dc23854e12d3 logs # uname -a Linux qa-admin.domain.com 2.6.18-164.15.1.0.1.el5 #1 SMP Wed Mar 17 00:56:05 EDT 2010 i686 i686 i386 GNU/Linux # rpm -qa | grep ocfs2 ocfs2-2.6.18-164.11.1.0.1.el5-1.4.4-1.el5 ocfs2-tools-1.4.3-1.el5 ocfs2-2.6.18-164.15.1.0.1.el5-1.4.4-1.el5 This is the latest from one of the alive hosts: # dmesg | tail -50 (2869,0):ocfs2_lock_allocators:677 ERROR: status = -5 (2869,0):__ocfs2_extend_allocation:739 ERROR: status = -5 (2869,0):ocfs2_extend_no_holes:952 ERROR: status = -5 (2869,0):ocfs2_expand_nonsparse_inode:1678 ERROR: status = -5 (2869,0):ocfs2_write_begin_nolock:1722 ERROR: status = -5 (2869,0):ocfs2_write_begin:1860 ERROR: status = -5 (2869,0):ocfs2_file_buffered_write:2039 ERROR: status = -5 (2869,0):__ocfs2_file_aio_write:2194 ERROR: status = -5 OCFS2: ERROR (device sdc1): ocfs2_check_group_descriptor: Group descriptor # 1128960 has bit count 32256 but claims that 34300 are free (2881,0):ocfs2_search_chain:1244 ERROR: status = -5 (2881,0):ocfs2_claim_suballoc_bits:1433 ERROR: status = -5 (2881,0):__ocfs2_claim_clusters:1715 ERROR: status = -5 (2881,0):ocfs2_local_alloc_new_window:1013 ERROR: status = -5 (2881,0):ocfs2_local_alloc_slide_window:1116 ERROR: status = -5 (2881,0):ocfs2_reserve_local_alloc_bits:537 ERROR: status = -5 (2881,0):__ocfs2_reserve_clusters:725 ERROR: status = -5 (2881,0):ocfs2_lock_allocators:677 ERROR: status = -5 (2881,0):__ocfs2_extend_allocation:739 ERROR: status = -5 (2881,0):ocfs2_extend_no_holes:952 ERROR: status = -5 (2881,0):ocfs2_expand_nonsparse_inode:1678 ERROR: status = -5 (2881,0):ocfs2_write_begin_nolock:1722 ERROR: status = -5 (2881,0):ocfs2_write_begin:1860 ERROR: status = -5 (2881,0):ocfs2_file_buffered_write:2039 ERROR: status = -5 (2881,0):__ocfs2_file_aio_write:2194 ERROR: status = -5 (2045,0):o2net_connect_expired:1664 ERROR: no connection established with node 2 after 30.0 seconds, giving up and returning errors. OCFS2: ERROR (device sdc1): ocfs2_check_group_descriptor: Group descriptor # 1128960 has bit count 32256 but claims that 34300 are free (2872,0):ocfs2_search_chain:1244 ERROR: status = -5 (2872,0):ocfs2_claim_suballoc_bits:1433 ERROR: status = -5 (2872,0):__ocfs2_claim_clusters:1715 ERROR: status = -5 (2872,0):ocfs2_local_alloc_new_window:1013 ERROR: status = -5 (2872,0):ocfs2_local_alloc_slide_window:1116 ERROR: status = -5 (2872,0):ocfs2_reserve_local_alloc_bits:537 ERROR: status = -5 (2872,0):__ocfs2_reserve_clusters:725 ERROR: status = -5 (2872,0):ocfs2_lock_allocators:677 ERROR: status = -5 (2872,0):__ocfs2_extend_allocation:739 ERROR: status = -5 (2872,0):ocfs2_extend_no_holes:952 ERROR: status = -5 (2872,0):ocfs2_expand_nonsparse_inode:1678 ERROR: status = -5 (2872,0):ocfs2_write_begin_nolock:1722 ERROR: status = -5 (2872,0):ocfs2_write_begin:1860 ERROR: status = -5 (2872,0):ocfs2_file_buffered_write:2039 ERROR: status = -5 (2872,0):__ocfs2_file_aio_write:2194 ERROR: status = -5 (2065,0):ocfs2_dlm_eviction_cb:98 device (8,33): dlm has evicted node 2 (12701,1):dlm_get_lock_resource:844 6A03E81A818641A68FD8DC23854E12D3:M00000000000000000000243568d3c5: at least one node (2) to recover before lock mastery can begin (2045,0):ocfs2_dlm_eviction_cb:98 device (8,33): dlm has evicted node 2 (12701,1):dlm_get_lock_resource:898 6A03E81A818641A68FD8DC23854E12D3:M00000000000000000000243568d3c5: at least one node (2) to recover before lock mastery can begin o2net: accepted connection from node qa-web2 (num 2) at 147.178.220.32:7777 ocfs2_dlm: Node 2 joins domain 6A03E81A818641A68FD8DC23854E12D3 ocfs2_dlm: Nodes in domain ("6A03E81A818641A68FD8DC23854E12D3"): 0 1 2 (12701,1):dlm_restart_lock_mastery:1216 node 2 up while restarting (12701,1):dlm_wait_for_lock_mastery:1040 ERROR: status = -11 Any suggestions? Is there anymore data I can provide? Thanks for any help. Kevin
It is having problems doing ios to the virtual devices. -5 is EIO. kevin at utahsysadmin.com wrote:> I have a relatively new test environment setup that is a little different > from your typical scenario. This is my first time using OCFS2, but I > believe it should work the way I have it setup. > > All of this is setup on VMWare virtual hosts. I have two front-end web > servers and one backend administrative server. They all share 2 virtual > hard drives within VMware (independent, persistent, & thick provisioned). > > Everything works great and the way I want, except occasionally, one of the > nodes will crash with errors like the following: > > end_request: I/O error, dev sdc, sector 585159 > Aborting journal on device sdc1 > end_request: I/O error, dev sdc, sector 528151 > Buffer I/O error on device sdc1, logical block 66011 > lost page write due to I/O error on sdc1 > (2848,1):ocfs2_start_trans:240 ERROR: status = -30 > OCFS2: abort (device sdc1): ocfs2_start_trans: Detected aborted journal > Kernel panic - not syncing: OCFS2: (device sdc1): panic forced after error > > <0>Rebooting in 30 seconds..BUG: warning at > arch/i386/kernel/smp.c:492/smp_send_reschedule() (Tainted: G ) > > The server never reboots, it just sits there until I reset it. The cluster > ran fine without errors (for a week or two) and now that I upgraded to the > latest kernel/ocfs2 it's happening almost daily. The disks are fine, it's > on a LUN on a SAN with no problems and I unmounted all the partitions and > ran fsck.ocfs2 -f on both drives from all three nodes (one at a time) and > it found no errors. > > This morning it happened again and now after a reset the server will not > boot up at all, just sits there on "Starting Oracle Cluster File System > (OCFS2)". These servers are all running OEL 5.4 with the latest patches > installed. > > Here's the setup:# cat /etc/ocfs2/cluster.conf > cluster: > node_count = 3 > name = qacluster > > node: > ip_port = 7777 > ip_address = 10.10.220.30 > number = 0 > name = qa-admin > cluster = qacluster > > node: > ip_port = 7777 > ip_address = 10.10.220.31 > number = 1 > name = qa-web1 > cluster = qacluster > > node: > ip_port = 7777 > ip_address = 10.10.220.32 > number = 2 > name = qa-web2 > cluster = qacluster > > # mounted.ocfs2 -d > Device FS UUID Label > /dev/sdb1 ocfs2 85b050a0-a381-49d8-8353-c21b1c8b28c4 data > /dev/sdc1 ocfs2 6a03e81a-8186-41a6-8fd8-dc23854e12d3 logs > > # uname -a > Linux qa-admin.domain.com 2.6.18-164.15.1.0.1.el5 #1 SMP Wed Mar 17 > 00:56:05 EDT 2010 i686 i686 i386 GNU/Linux > > # rpm -qa | grep ocfs2 > ocfs2-2.6.18-164.11.1.0.1.el5-1.4.4-1.el5 > ocfs2-tools-1.4.3-1.el5 > ocfs2-2.6.18-164.15.1.0.1.el5-1.4.4-1.el5 > > This is the latest from one of the alive hosts: > > # dmesg | tail -50 > (2869,0):ocfs2_lock_allocators:677 ERROR: status = -5 > (2869,0):__ocfs2_extend_allocation:739 ERROR: status = -5 > (2869,0):ocfs2_extend_no_holes:952 ERROR: status = -5 > (2869,0):ocfs2_expand_nonsparse_inode:1678 ERROR: status = -5 > (2869,0):ocfs2_write_begin_nolock:1722 ERROR: status = -5 > (2869,0):ocfs2_write_begin:1860 ERROR: status = -5 > (2869,0):ocfs2_file_buffered_write:2039 ERROR: status = -5 > (2869,0):__ocfs2_file_aio_write:2194 ERROR: status = -5 > OCFS2: ERROR (device sdc1): ocfs2_check_group_descriptor: Group descriptor > # 1128960 has bit count 32256 but claims that 34300 are free > (2881,0):ocfs2_search_chain:1244 ERROR: status = -5 > (2881,0):ocfs2_claim_suballoc_bits:1433 ERROR: status = -5 > (2881,0):__ocfs2_claim_clusters:1715 ERROR: status = -5 > (2881,0):ocfs2_local_alloc_new_window:1013 ERROR: status = -5 > (2881,0):ocfs2_local_alloc_slide_window:1116 ERROR: status = -5 > (2881,0):ocfs2_reserve_local_alloc_bits:537 ERROR: status = -5 > (2881,0):__ocfs2_reserve_clusters:725 ERROR: status = -5 > (2881,0):ocfs2_lock_allocators:677 ERROR: status = -5 > (2881,0):__ocfs2_extend_allocation:739 ERROR: status = -5 > (2881,0):ocfs2_extend_no_holes:952 ERROR: status = -5 > (2881,0):ocfs2_expand_nonsparse_inode:1678 ERROR: status = -5 > (2881,0):ocfs2_write_begin_nolock:1722 ERROR: status = -5 > (2881,0):ocfs2_write_begin:1860 ERROR: status = -5 > (2881,0):ocfs2_file_buffered_write:2039 ERROR: status = -5 > (2881,0):__ocfs2_file_aio_write:2194 ERROR: status = -5 > (2045,0):o2net_connect_expired:1664 ERROR: no connection established with > node 2 after 30.0 seconds, giving up and returning errors. > OCFS2: ERROR (device sdc1): ocfs2_check_group_descriptor: Group descriptor > # 1128960 has bit count 32256 but claims that 34300 are free > (2872,0):ocfs2_search_chain:1244 ERROR: status = -5 > (2872,0):ocfs2_claim_suballoc_bits:1433 ERROR: status = -5 > (2872,0):__ocfs2_claim_clusters:1715 ERROR: status = -5 > (2872,0):ocfs2_local_alloc_new_window:1013 ERROR: status = -5 > (2872,0):ocfs2_local_alloc_slide_window:1116 ERROR: status = -5 > (2872,0):ocfs2_reserve_local_alloc_bits:537 ERROR: status = -5 > (2872,0):__ocfs2_reserve_clusters:725 ERROR: status = -5 > (2872,0):ocfs2_lock_allocators:677 ERROR: status = -5 > (2872,0):__ocfs2_extend_allocation:739 ERROR: status = -5 > (2872,0):ocfs2_extend_no_holes:952 ERROR: status = -5 > (2872,0):ocfs2_expand_nonsparse_inode:1678 ERROR: status = -5 > (2872,0):ocfs2_write_begin_nolock:1722 ERROR: status = -5 > (2872,0):ocfs2_write_begin:1860 ERROR: status = -5 > (2872,0):ocfs2_file_buffered_write:2039 ERROR: status = -5 > (2872,0):__ocfs2_file_aio_write:2194 ERROR: status = -5 > (2065,0):ocfs2_dlm_eviction_cb:98 device (8,33): dlm has evicted node 2 > (12701,1):dlm_get_lock_resource:844 > 6A03E81A818641A68FD8DC23854E12D3:M00000000000000000000243568d3c5: at least > one node (2) to recover before lock mastery can begin > (2045,0):ocfs2_dlm_eviction_cb:98 device (8,33): dlm has evicted node 2 > (12701,1):dlm_get_lock_resource:898 > 6A03E81A818641A68FD8DC23854E12D3:M00000000000000000000243568d3c5: at least > one node (2) to recover before lock mastery can begin > o2net: accepted connection from node qa-web2 (num 2) at 147.178.220.32:7777 > ocfs2_dlm: Node 2 joins domain 6A03E81A818641A68FD8DC23854E12D3 > ocfs2_dlm: Nodes in domain ("6A03E81A818641A68FD8DC23854E12D3"): 0 1 2 > (12701,1):dlm_restart_lock_mastery:1216 node 2 up while restarting > (12701,1):dlm_wait_for_lock_mastery:1040 ERROR: status = -11 > > Any suggestions? Is there anymore data I can provide? > > Thanks for any help. > > Kevin > > > > _______________________________________________ > Ocfs2-users mailing list > Ocfs2-users at oss.oracle.com > http://oss.oracle.com/mailman/listinfo/ocfs2-users >