thr3ads.net - Ocfs2 users - [Ocfs2-users] self fencing and system panic problem after forced reboot [Sep 2006]

If this information is useful, please help other people find it:
Share via:

Holger Brueckner

2006-Sep-14 04:47 UTC

[Ocfs2-users] self fencing and system panic problem after forced reboot

hello,

i'm running ocfs2 to provide a shared disk thoughout a xen cluster.
this setup was working fine until today where there was an power outage
and all xen nodes where forcefully shut down. whenever i try to
mount/access the ocfs2 partition the system panics and reboots: 

darks:~# fsck.ocfs2 -y -f /dev/sda4
(617,0):__dlm_print_nodes:377 Nodes in my domain
("5BA3969FC2714FFEAD66033486242B58"):
(617,0):__dlm_print_nodes:381  node 0
Checking OCFS2 filesystem in /dev/sda4:
  label:              <NONE>
  uuid:               5b a3 96 9f c2 71 4f fe ad 66 03 34 86 24 2b 58
  number of blocks:   35983584
  bytes per block:    4096
  number of clusters: 4497948
  bytes per cluster:  32768
  max slots:          4

/dev/sda4 was run with -f, check forced.
Pass 0a: Checking cluster allocation chains
Pass 0b: Checking inode allocation chains
Pass 0c: Checking extent block allocation chains
Pass 1: Checking inodes and blocks.
[CLUSTER_ALLOC_BIT] Cluster 295771 is marked in the global cluster
bitmap but it isn't in use.  Clear its bit in the bitmap? y
[CLUSTER_ALLOC_BIT] Cluster 2456870 is marked in the global cluster
bitmap but it isn't in use.  Clear its bit in the bitmap? y
[CLUSTER_ALLOC_BIT] Cluster 2683096 is marked in the global cluster
bitmap but it isn't in use.  Clear its bit in the bitmap? y
Pass 2: Checking directory entries.
Pass 3: Checking directory connectivity.
Pass 4a: checking for orphaned inodes
Pass 4b: Checking inodes link counts.
All passes succeeded.
darks:~# mount /data
(622,0):ocfs2_initialize_super:1326 max_slots for this device: 4
(622,0):ocfs2_fill_local_node_info:1019 I am node 0
(622,0):__dlm_print_nodes:377 Nodes in my domain
("5BA3969FC2714FFEAD66033486242B58"):
(622,0):__dlm_print_nodes:381  node 0
(622,0):ocfs2_find_slot:261 slot 2 is already allocated to this node!
(622,0):ocfs2_find_slot:267 taking node slot 2
(622,0):ocfs2_check_volume:1586 File system was not unmounted cleanly,
recovering volume.
kjournald starting.  Commit interval 5 seconds
ocfs2: Mounting device (8,4) on (node 0, slot 2) with ordered data mode.
(630,0):ocfs2_replay_journal:1181 Recovering node 2 from slot 0 on
device (8,4)
darks:~# (4,0):o2hb_write_timeout:164 ERROR: Heartbeat write timeout to
device sda4 after 12000 milliseconds
(4,0):o2hb_stop_all_regions:1789 ERROR: stopping heartbeat on all active
regions.
Kernel panic - not syncing: ocfs2 is very sorry to be fencing this
system by panicing

ocfs2-tools    1.2.1-1
kernel         2.6.16-xen (with corresponding ocfs2 compiled into the
               kernel)

i already tried the elevator=deadline scheduler option with no effect.
any further help debugging this issue is greatly appreciated. are ther
any other possibilities to get access to the data from outside the
cluster (obviously while the partition isn't mounted) ?

thanks for your help

holger brueckner



--
--------------------- [ SECURITY NOTICE ] ---------------------
To: ocfs2-users@oss.oracle.com.
For your security, brueckner@net-labs.de
digitally signed this message on 14 September 2006 at 11:47:14 UTC.
Verify this digital signature at http://www.ciphire.com/verify.
---------------- [ CIPHIRE DIGITAL SIGNATURE ] ----------------
Q2lwaGlyZSBTaWcuAjhvY2ZzMi11c2Vyc0Bvc3Mub3JhY2xlLmNvbQBicnVlY2tu
ZXJAbmV0LWxhYnMuZGUAZW1haWwgYm9keQA/CQAAfAB8AAAAAQAAAEJBCUU/CQAA
lgEAAgACAAIAIKA7ViXI99RkAWqA0snRdXlDlKutNWFc3wUJV4O5h62LAQAcdMJ0
R5fWn8czQau8TqIAfUpzp2vNEIOvbP6cxxoG7jGnNXT13csiJkIBd/uto7fkTcsL
fMvBLF8alglS2tY9U2lnRW5k
------------------ [ END DIGITAL SIGNATURE ] ------------------

Holger Brueckner

2006-Sep-14 05:16 UTC

head link

[Ocfs2-users] self fencing and system panic problem after forced reboot

side note: setting HEARBEAT_THRESHOLD to 30 did not help either.

could it be that the syncronization between the daemons does not work?
(e.g daemons think fs is mounted on some nodes and try to synchonize but
actually the fs isn't mounted on any node?)

i'm rather clueless now. finding a way to access the data and copy it to
the non shared partitions would help me a lot.

thx

holger brueckner


On Thu, 2006-09-14 at 13:47 +0200, Holger Brueckner
wrote:> 
> X-CS-3-Report: plain
> 
> 
> hello,
> 
> i'm running ocfs2 to provide a shared disk thoughout a xen cluster.
> this setup was working fine until today where there was an power outage
> and all xen nodes where forcefully shut down. whenever i try to
> mount/access the ocfs2 partition the system panics and reboots: 
> 
> darks:~# fsck.ocfs2 -y -f /dev/sda4
> (617,0):__dlm_print_nodes:377 Nodes in my domain
> ("5BA3969FC2714FFEAD66033486242B58"):
> (617,0):__dlm_print_nodes:381  node 0
> Checking OCFS2 filesystem in /dev/sda4:
>   label:              <NONE>
>   uuid:               5b a3 96 9f c2 71 4f fe ad 66 03 34 86 24 2b 58
>   number of blocks:   35983584
>   bytes per block:    4096
>   number of clusters: 4497948
>   bytes per cluster:  32768
>   max slots:          4
> 
> /dev/sda4 was run with -f, check forced.
> Pass 0a: Checking cluster allocation chains
> Pass 0b: Checking inode allocation chains
> Pass 0c: Checking extent block allocation chains
> Pass 1: Checking inodes and blocks.
> [CLUSTER_ALLOC_BIT] Cluster 295771 is marked in the global cluster
> bitmap but it isn't in use.  Clear its bit in the bitmap? y
> [CLUSTER_ALLOC_BIT] Cluster 2456870 is marked in the global cluster
> bitmap but it isn't in use.  Clear its bit in the bitmap? y
> [CLUSTER_ALLOC_BIT] Cluster 2683096 is marked in the global cluster
> bitmap but it isn't in use.  Clear its bit in the bitmap? y
> Pass 2: Checking directory entries.
> Pass 3: Checking directory connectivity.
> Pass 4a: checking for orphaned inodes
> Pass 4b: Checking inodes link counts.
> All passes succeeded.
> darks:~# mount /data
> (622,0):ocfs2_initialize_super:1326 max_slots for this device: 4
> (622,0):ocfs2_fill_local_node_info:1019 I am node 0
> (622,0):__dlm_print_nodes:377 Nodes in my domain
> ("5BA3969FC2714FFEAD66033486242B58"):
> (622,0):__dlm_print_nodes:381  node 0
> (622,0):ocfs2_find_slot:261 slot 2 is already allocated to this node!
> (622,0):ocfs2_find_slot:267 taking node slot 2
> (622,0):ocfs2_check_volume:1586 File system was not unmounted cleanly,
> recovering volume.
> kjournald starting.  Commit interval 5 seconds
> ocfs2: Mounting device (8,4) on (node 0, slot 2) with ordered data mode.
> (630,0):ocfs2_replay_journal:1181 Recovering node 2 from slot 0 on
> device (8,4)
> darks:~# (4,0):o2hb_write_timeout:164 ERROR: Heartbeat write timeout to
> device sda4 after 12000 milliseconds
> (4,0):o2hb_stop_all_regions:1789 ERROR: stopping heartbeat on all active
> regions.
> Kernel panic - not syncing: ocfs2 is very sorry to be fencing this
> system by panicing
> 
> ocfs2-tools    1.2.1-1
> kernel         2.6.16-xen (with corresponding ocfs2 compiled into the
>                kernel)
> 
> i already tried the elevator=deadline scheduler option with no effect.
> any further help debugging this issue is greatly appreciated. are ther
> any other possibilities to get access to the data from outside the
> cluster (obviously while the partition isn't mounted) ?
> 
> thanks for your help
> 
> holger brueckner
> 
> 
> 
> 
> 
> 
> 
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users@oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users


--
--------------------- [ SECURITY NOTICE ] ---------------------
To: ocfs2-users@oss.oracle.com.
For your security, brueckner@net-labs.de
digitally signed this message on 14 September 2006 at 12:16:40 UTC.
Verify this digital signature at http://www.ciphire.com/verify.
---------------- [ CIPHIRE DIGITAL SIGNATURE ] ----------------
Q2lwaGlyZSBTaWcuAjhvY2ZzMi11c2Vyc0Bvc3Mub3JhY2xlLmNvbQBicnVlY2tu
ZXJAbmV0LWxhYnMuZGUAZW1haWwgYm9keQCsCwAAfAB8AAAAAQAAAChICUWsCwAA
8gEAAgACAAIAIKA7ViXI99RkAWqA0snRdXlDlKutNWFc3wUJV4O5h62LAQAcdMJ0
R5fWn8czQau8TqIAfUpzp2vNEIOvbP6cxxoG7ns6ngd8291RDaTCQAuKkvfejXpr
UBPfDB3k3C6B59ENU2lnRW5k
------------------ [ END DIGITAL SIGNATURE ] ------------------

Ocfs2 users - Sep 2006 - self fencing and system panic problem after forced reboot

[Ocfs2-users] self fencing and system panic problem after forced reboot

[Ocfs2-users] self fencing and system panic problem after forced reboot