thr3ads.net - Ocfs2 users - [Ocfs2-users] ocfs2 fencing on reboot of 2nd node [Sep 2006]

If this information is useful, please help other people find it:
Share via:

SRuff@fiberlink.com

2006-Sep-21 14:56 UTC

[Ocfs2-users] ocfs2 fencing on reboot of 2nd node

I'm performing some testing with ocfs2 on 2 nodes with Red Hat AS4 Update 
4 (x86_64) and (mulitpath included in the 2.6 kernel) and am runing into 
some issues when cleanly rebooting the 2nd node, while the 1st node is 
still up.

So if I do the following on the 2nd node, the 1st node does not fence 
itself:

/etc/init.d/ocfs2 stop
/etc/init.d/o2cb stop
wait more than 60 seconds
init 6

I get the following on the 1st node, but everything is fine:

Sep 21 21:44:49 bbflgrid11 kernel: SCSI error : <0 0 0 12> return code = 
0x20000
Sep 21 21:44:49 bbflgrid11 kernel: end_request: I/O error, dev sdm, sector 
192785
Sep 21 21:44:49 bbflgrid11 kernel: device-mapper: dm-multipath: Failing 
path 8:192.
Sep 21 21:44:49 bbflgrid11 kernel: SCSI error : <0 0 0 14> return code = 
0x20000
Sep 21 21:44:49 bbflgrid11 kernel: end_request: I/O error, dev sdo, sector 
193297
Sep 21 21:44:49 bbflgrid11 kernel: device-mapper: dm-multipath: Failing 
path 8:224.
Sep 21 21:44:49 bbflgrid11 kernel: SCSI error : <0 0 0 13> return code = 
0x20000
Sep 21 21:44:49 bbflgrid11 kernel: end_request: I/O error, dev sdn, sector 
192785
Sep 21 21:44:49 bbflgrid11 kernel: device-mapper: dm-multipath: Failing 
path 8:208.
Sep 21 21:44:49 bbflgrid11 multipathd: 8:192: mark as failed
Sep 21 21:44:49 bbflgrid11 multipathd: mpath1: remaining active paths: 1
Sep 21 21:44:49 bbflgrid11 multipathd: 8:224: mark as failed
Sep 21 21:44:49 bbflgrid11 multipathd: mpath3: remaining active paths: 1
Sep 21 21:44:49 bbflgrid11 multipathd: 8:208: mark as failed
Sep 21 21:44:49 bbflgrid11 multipathd: mpath2: remaining active paths: 1
Sep 21 21:44:58 bbflgrid11 multipathd: 8:192: readsector0 checker reports 
path is up
Sep 21 21:44:58 bbflgrid11 multipathd: 8:192: reinstated
Sep 21 21:44:58 bbflgrid11 multipathd: mpath1: remaining active paths: 2
Sep 21 21:44:58 bbflgrid11 multipathd: 8:208: readsector0 checker reports 
path is up
Sep 21 21:44:58 bbflgrid11 multipathd: 8:208: reinstated
Sep 21 21:44:58 bbflgrid11 multipathd: mpath2: remaining active paths: 2
Sep 21 21:44:58 bbflgrid11 multipathd: 8:224: readsector0 checker reports 
path is up
Sep 21 21:44:58 bbflgrid11 multipathd: 8:224: reinstated
Sep 21 21:44:58 bbflgrid11 multipathd: mpath3: remaining active paths: 2
Sep 21 21:46:06 bbflgrid11 kernel: SCSI error : <1 0 0 11> return code = 
0x20000
Sep 21 21:46:06 bbflgrid11 kernel: end_request: I/O error, dev sdaa, 
sector 1920
Sep 21 21:46:06 bbflgrid11 kernel: device-mapper: dm-multipath: Failing 
path 65:160.
Sep 21 21:46:06 bbflgrid11 multipathd: 65:160: mark as failed
Sep 21 21:46:06 bbflgrid11 multipathd: mpath0: remaining active paths: 1
Sep 21 21:46:06 bbflgrid11 multipathd: 65:160: readsector0 checker reports 
path is up
Sep 21 21:46:06 bbflgrid11 multipathd: 65:160: reinstated
Sep 21 21:46:06 bbflgrid11 multipathd: mpath0: remaining active paths: 2



Now if I do the following on the 2nd node, the 1st node fences itself 
(same as above, except dont wait 60 seconds after o2cb stop)

/etc/init.d/ocfs2 stop
/etc/init.d/o2cb stop
init 6

Node 1 logs the following and fences itself, I have to power cycle the 
server to get it back, it doesn't reboot or shutdown just hangs

Sep 21 21:28:00 bbflgrid11 kernel: SCSI error : <0 0 0 13> return code = 
0x20000
Sep 21 21:28:00 bbflgrid11 kernel: end_request: I/O error, dev sdn, sector 
192785
Sep 21 21:28:00 bbflgrid11 kernel: device-mapper: dm-multipath: Failing 
path 8:208.
Sep 21 21:28:00 bbflgrid11 multipathd: 8:208: mark as failed
Sep 21 21:28:00 bbflgrid11 multipathd: mpath2: remaining active paths: 1
Sep 21 21:28:00 bbflgrid11 kernel: SCSI error : <1 0 0 12> return code = 
0x20000
Sep 21 21:28:00 bbflgrid11 kernel: end_request: I/O error, dev sdab, 
sector 192784
Sep 21 21:28:00 bbflgrid11 kernel: end_request: I/O error, dev sdab, 
sector 192786
Sep 21 21:28:00 bbflgrid11 kernel: device-mapper: dm-multipath: Failing 
path 65:176.
Sep 21 21:28:00 bbflgrid11 kernel: SCSI error : <1 0 0 13> return code = 
0x20000
Sep 21 21:28:00 bbflgrid11 kernel: end_request: I/O error, dev sdac, 
sector 192785
Sep 21 21:28:00 bbflgrid11 kernel: device-mapper: dm-multipath: Failing 
path 65:192.
Sep 21 21:28:00 bbflgrid11 multipathd: 65:176: mark as failed
Sep 21 21:28:00 bbflgrid11 multipathd: mpath1: remaining active paths: 1
Sep 21 21:28:01 bbflgrid11 multipathd: 65:192: mark as failed
Sep 21 21:28:01 bbflgrid11 multipathd: mpath2: remaining active paths: 0
Sep 21 21:28:01 bbflgrid11 kernel: (4912,1):o2hb_bio_end_io:331 ERROR: IO 
Error -5
Sep 21 21:28:01 bbflgrid11 kernel: (4912,1):o2hb_do_disk_heartbeat:973 
ERROR: status = -5
Sep 21 21:28:01 bbflgrid11 kernel: (4912,1):o2hb_bio_end_io:331 ERROR: IO 
Error -5
Sep 21 21:28:01 bbflgrid11 kernel: (4912,1):o2hb_do_disk_heartbeat:973 
ERROR: status = -5
Sep 21 21:28:01 bbflgrid11 multipathd: 65:176: readsector0 checker reports 
path is up
Sep 21 21:28:01 bbflgrid11 multipathd: 65:176: reinstated
Sep 21 21:28:01 bbflgrid11 multipathd: mpath1: remaining active paths: 2
Sep 21 21:28:03 bbflgrid11 kernel: (4912,1):o2hb_bio_end_io:331 ERROR: IO 
Error -5
Sep 21 21:28:03 bbflgrid11 kernel: (4912,1):o2hb_do_disk_heartbeat:973 
ERROR: status = -5
Sep 21 21:28:03 bbflgrid11 kernel: (4912,1):o2hb_bio_end_io:331 ERROR: IO 
Error -5
Sep 21 21:28:03 bbflgrid11 kernel: (4912,1):o2hb_do_disk_heartbeat:973 
ERROR: status = -5
Sep 21 21:28:05 bbflgrid11 kernel: (4912,1):o2hb_bio_end_io:331 ERROR: IO 
Error -5
Sep 21 21:28:05 bbflgrid11 kernel: (4912,1):o2hb_do_disk_heartbeat:973 
ERROR: status = -5
Sep 21 21:28:05 bbflgrid11 kernel: (4912,1):o2hb_bio_end_io:331 ERROR: IO 
Error -5
Sep 21 21:28:05 bbflgrid11 kernel: (4912,1):o2hb_do_disk_heartbeat:973 
ERROR: status = -5
Sep 21 21:28:07 bbflgrid11 kernel: (4912,1):o2hb_bio_end_io:331 ERROR: IO 
Error -5
Sep 21 21:28:07 bbflgrid11 kernel: (4912,1):o2hb_do_disk_heartbeat:973 
ERROR: status = -5
Sep 21 21:28:07 bbflgrid11 kernel: (4912,1):o2hb_bio_end_io:331 ERROR: IO 
Error -5
Sep 21 21:28:07 bbflgrid11 kernel: (4912,1):o2hb_do_disk_heartbeat:973 
ERROR: status = -5
Sep 21 21:28:09 bbflgrid11 kernel: (4912,1):o2hb_bio_end_io:331 ERROR: IO 
Error -5
Sep 21 21:28:09 bbflgrid11 kernel: (4912,1):o2hb_do_disk_heartbeat:973 
ERROR: status = -5
Sep 21 21:28:09 bbflgrid11 kernel: (4912,1):o2hb_bio_end_io:331 ERROR: IO 
Error -5
Sep 21 21:28:09 bbflgrid11 kernel: (4912,1):o2hb_do_disk_heartbeat:973 
ERROR: status = -5
Sep 21 21:28:09 bbflgrid11 multipathd: 8:208: readsector0 checker reports 
path is up
Sep 21 21:28:09 bbflgrid11 multipathd: 8:208: reinstated
Sep 21 21:28:09 bbflgrid11 multipathd: mpath2: remaining active paths: 1
Sep 21 21:28:10 bbflgrid11 multipathd: 65:192: readsector0 checker reports 
path is up
Sep 21 21:28:10 bbflgrid11 multipathd: 65:192: reinstated
Sep 21 21:28:10 bbflgrid11 multipathd: mpath2: remaining active paths: 2


...
Index 14: took 0 ms to do submit_bio for read
Index 15: took 0 ms to do waiting for read completion
(11,1):o2hb_stop_all_regions:1908 ERROR: stopping heartbeat on all active 
regions
Kernel panic - not syncing:  ocfs2 is very sorry to be fencing this system 
by panicing


Seems like if I wait for the node 1 to heartbeat to node 2, with o2c down, 
before rebooting it's fine, but if I reboot before node 1 has had a chance 
to hearbeat to node 2, with o2cb down, it's panics.



Shawn E. Ruff
Senior Oracle DBA
Fiberlink Communications

The information transmitted is intended only for the person or entity to 
which it is addressed and may contain confidential and/or privileged 
material.  Any review, retransmission, dissemination or other use of, or 
taking of any action in reliance upon, this information by persons or 
entities other than the intended recipient is prohibited.   If you 
received this in error, please contact the sender and delete the material 
from any computer.

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://oss.oracle.com/pipermail/ocfs2-users/attachments/20060921/093859e2/attachment.html

Sunil Mushran

2006-Sep-21 17:04 UTC

head link

[Ocfs2-users] ocfs2 fencing on reboot of 2nd node

What is your O2CB_HEARTBEAT_THRESHOLD set to?

For more, refer:
http://oss.oracle.com/projects/ocfs2/dist/documentation/ocfs2_faq.html#HEARTBEAT

SRuff@fiberlink.com wrote:>
> I'm performing some testing with ocfs2 on 2 nodes with Red Hat AS4 
> Update 4 (x86_64) and (mulitpath included in the 2.6 kernel) and am 
> runing into some issues when cleanly rebooting the 2nd node, while the 
> 1st node is still up.
>
> So if I do the following on the 2nd node, the 1st node does not fence 
> itself:
>
> /etc/init.d/ocfs2 stop
> /etc/init.d/o2cb stop
> wait more than 60 seconds
> init 6
>
> I get the following on the 1st node, but everything is fine:
>
> Sep 21 21:44:49 bbflgrid11 kernel: SCSI error : <0 0 0 12> return
code
> = 0x20000
> Sep 21 21:44:49 bbflgrid11 kernel: end_request: I/O error, dev sdm, 
> sector 1.
> Sep 21 21:44:49 bbflgrid11 kernel: device-mapper: dm-multipath: 
> Failing path 8:192.
> Sep 21 21:44:49 bbflgrid11 kernel: SCSI error : <0 0 0 14> return
code
> = 0x20000
> Sep 21 21:44:49 bbflgrid11 kernel: end_request: I/O error, dev sdo, 
> sector 193297
> Sep 21 21:44:49 bbflgrid11 kernel: device-mapper: dm-multipath: 
> Failing path 8:224.
> Sep 21 21:44:49 bbflgrid11 kernel: SCSI error : <0 0 0 13> return
code
> = 0x20000
> Sep 21 21:44:49 bbflgrid11 kernel: end_request: I/O error, dev sdn, 
> sector 192785
> Sep 21 21:44:49 bbflgrid11 kernel: device-mapper: dm-multipath: 
> Failing path 8:208.
> Sep 21 21:44:49 bbflgrid11 multipathd: 8:192: mark as failed
> Sep 21 21:44:49 bbflgrid11 multipathd: mpath1: remaining active paths: 1
> Sep 21 21:44:49 bbflgrid11 multipathd: 8:224: mark as failed
> Sep 21 21:44:49 bbflgrid11 multipathd: mpath3: remaining active paths: 1
> Sep 21 21:44:49 bbflgrid11 multipathd: 8:208: mark as failed
> Sep 21 21:44:49 bbflgrid11 multipathd: mpath2: remaining active paths: 1
> Sep 21 21:44:58 bbflgrid11 multipathd: 8:192: readsector0 checker 
> reports path is up
> Sep 21 21:44:58 bbflgrid11 multipathd: 8:192: reinstated
> Sep 21 21:44:58 bbflgrid11 multipathd: mpath1: remaining active paths: 2
> Sep 21 21:44:58 bbflgrid11 multipathd: 8:208: readsector0 checker 
> reports path is up
> Sep 21 21:44:58 bbflgrid11 multipathd: 8:208: reinstated
> Sep 21 21:44:58 bbflgrid11 multipathd: mpath2: remaining active paths: 2
> Sep 21 21:44:58 bbflgrid11 multipathd: 8:224: readsector0 checker 
> reports path is up
> Sep 21 21:44:58 bbflgrid11 multipathd: 8:224: reinstated
> Sep 21 21:44:58 bbflgrid11 multipathd: mpath3: remaining active paths: 2
> Sep 21 21:46:06 bbflgrid11 kernel: SCSI error : <1 0 0 11> return
code
> = 0x20000
> Sep 21 21:46:06 bbflgrid11 kernel: end_request: I/O error, dev sdaa, 
> sector 1920
> Sep 21 21:46:06 bbflgrid11 kernel: device-mapper: dm-multipath: 
> Failing path 65:160.
> Sep 21 21:46:06 bbflgrid11 multipathd: 65:160: mark as failed
> Sep 21 21:46:06 bbflgrid11 multipathd: mpath0: remaining active paths: 1
> Sep 21 21:46:06 bbflgrid11 multipathd: 65:160: readsector0 checker 
> reports path is up
> Sep 21 21:46:06 bbflgrid11 multipathd: 65:160: reinstated
> Sep 21 21:46:06 bbflgrid11 multipathd: mpath0: remaining active paths: 2
>
>
>
> Now if I do the following on the 2nd node, the 1st node fences itself 
> (same as above, except dont wait 60 seconds after o2cb stop)
>
> /etc/init.d/ocfs2 stop
> /etc/init.d/o2cb stop
> init 6
>
> Node 1 logs the following and fences itself, I have to power cycle the 
> server to get it back, it doesn't reboot or shutdown just hangs
>
> Sep 21 21:28:00 bbflgrid11 kernel: SCSI error : <0 0 0 13> return
code
> = 0x20000
> Sep 21 21:28:00 bbflgrid11 kernel: end_request: I/O error, dev sdn, 
> sector 192785
> Sep 21 21:28:00 bbflgrid11 kernel: device-mapper: dm-multipath: 
> Failing path 8:208.
> Sep 21 21:28:00 bbflgrid11 multipathd: 8:208: mark as failed
> Sep 21 21:28:00 bbflgrid11 multipathd: mpath2: remaining active paths: 1
> Sep 21 21:28:00 bbflgrid11 kernel: SCSI error : <1 0 0 12> return
code
> = 0x20000
> Sep 21 21:28:00 bbflgrid11 kernel: end_request: I/O error, dev sdab, 
> sector 192784
> Sep 21 21:28:00 bbflgrid11 kernel: end_request: I/O error, dev sdab, 
> sector 192786
> Sep 21 21:28:00 bbflgrid11 kernel: device-mapper: dm-multipath: 
> Failing path 65:176.
> Sep 21 21:28:00 bbflgrid11 kernel: SCSI error : <1 0 0 13> return
code
> = 0x20000
> Sep 21 21:28:00 bbflgrid11 kernel: end_request: I/O error, dev sdac, 
> sector 192785
> Sep 21 21:28:00 bbflgrid11 kernel: device-mapper: dm-multipath: 
> Failing path 65:192.
> Sep 21 21:28:00 bbflgrid11 multipathd: 65:176: mark as failed
> Sep 21 21:28:00 bbflgrid11 multipathd: mpath1: remaining active paths: 1
> Sep 21 21:28:01 bbflgrid11 multipathd: 65:192: mark as failed
> Sep 21 21:28:01 bbflgrid11 multipathd: mpath2: remaining active paths: 0
> Sep 21 21:28:01 bbflgrid11 kernel: (4912,1):o2hb_bio_end_io:331 ERROR: 
> IO Error -5
> Sep 21 21:28:01 bbflgrid11 kernel: (4912,1):o2hb_do_disk_heartbeat:973 
> ERROR: status = -5
> Sep 21 21:28:01 bbflgrid11 kernel: (4912,1):o2hb_bio_end_io:331 ERROR: 
> IO Error -5
> Sep 21 21:28:01 bbflgrid11 kernel: (4912,1):o2hb_do_disk_heartbeat:973 
> ERROR: status = -5
> Sep 21 21:28:01 bbflgrid11 multipathd: 65:176: readsector0 checker 
> reports path is up
> Sep 21 21:28:01 bbflgrid11 multipathd: 65:176: reinstated
> Sep 21 21:28:01 bbflgrid11 multipathd: mpath1: remaining active paths: 2
> Sep 21 21:28:03 bbflgrid11 kernel: (4912,1):o2hb_bio_end_io:331 ERROR: 
> IO Error -5
> Sep 21 21:28:03 bbflgrid11 kernel: (4912,1):o2hb_do_disk_heartbeat:973 
> ERROR: status = -5
> Sep 21 21:28:03 bbflgrid11 kernel: (4912,1):o2hb_bio_end_io:331 ERROR: 
> IO Error -5
> Sep 21 21:28:03 bbflgrid11 kernel: (4912,1):o2hb_do_disk_heartbeat:973 
> ERROR: status = -5
> Sep 21 21:28:05 bbflgrid11 kernel: (4912,1):o2hb_bio_end_io:331 ERROR: 
> IO Error -5
> Sep 21 21:28:05 bbflgrid11 kernel: (4912,1):o2hb_do_disk_heartbeat:973 
> ERROR: status = -5
> Sep 21 21:28:05 bbflgrid11 kernel: (4912,1):o2hb_bio_end_io:331 ERROR: 
> IO Error -5
> Sep 21 21:28:05 bbflgrid11 kernel: (4912,1):o2hb_do_disk_heartbeat:973 
> ERROR: status = -5
> Sep 21 21:28:07 bbflgrid11 kernel: (4912,1):o2hb_bio_end_io:331 ERROR: 
> IO Error -5
> Sep 21 21:28:07 bbflgrid11 kernel: (4912,1):o2hb_do_disk_heartbeat:973 
> ERROR: status = -5
> Sep 21 21:28:07 bbflgrid11 kernel: (4912,1):o2hb_bio_end_io:331 ERROR: 
> IO Error -5
> Sep 21 21:28:07 bbflgrid11 kernel: (4912,1):o2hb_do_disk_heartbeat:973 
> ERROR: status = -5
> Sep 21 21:28:09 bbflgrid11 kernel: (4912,1):o2hb_bio_end_io:331 ERROR: 
> IO Error -5
> Sep 21 21:28:09 bbflgrid11 kernel: (4912,1):o2hb_do_disk_heartbeat:973 
> ERROR: status = -5
> Sep 21 21:28:09 bbflgrid11 kernel: (4912,1):o2hb_bio_end_io:331 ERROR: 
> IO Error -5
> Sep 21 21:28:09 bbflgrid11 kernel: (4912,1):o2hb_do_disk_heartbeat:973 
> ERROR: status = -5
> Sep 21 21:28:09 bbflgrid11 multipathd: 8:208: readsector0 checker 
> reports path is up
> Sep 21 21:28:09 bbflgrid11 multipathd: 8:208: reinstated
> Sep 21 21:28:09 bbflgrid11 multipathd: mpath2: remaining active paths: 1
> Sep 21 21:28:10 bbflgrid11 multipathd: 65:192: readsector0 checker 
> reports path is up
> Sep 21 21:28:10 bbflgrid11 multipathd: 65:192: reinstated
> Sep 21 21:28:10 bbflgrid11 multipathd: mpath2: remaining active paths: 2
>
>
> ...
> Index 14: took 0 ms to do submit_bio for read
> Index 15: took 0 ms to do waiting for read completion
> (11,1):o2hb_stop_all_regions:1908 ERROR: stopping heartbeat on all 
> active regions
> Kernel panic - not syncing:  ocfs2 is very sorry to be fencing this 
> system by panicing
>
>
> Seems like if I wait for the node 1 to heartbeat to node 2, with o2c 
> down, before rebooting it's fine, but if I reboot before node 1 has 
> had a chance to hearbeat to node 2, with o2cb down, it's panics.
>
>
>
> Shawn E. Ruff
> Senior Oracle DBA
> Fiberlink Communications
>
> The information transmitted is intended only for the person or entity 
> to which it is addressed and may contain confidential and/or 
> privileged material.  Any review, retransmission, dissemination or 
> other use of, or taking of any action in reliance upon, this 
> information by persons or entities other than the intended recipient 
> is prohibited.   If you received this in error, please contact the 
> sender and delete the material from any computer.
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users@oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>

Alexei_Roudnev

2006-Sep-21 17:38 UTC

head link

[Ocfs2-users] ocfs2 fencing on reboot of 2nd node

IT IS NOT NORMAL. Something wrong with your storage, FC switch or cards. Why ,
when you shutdown one node, second node experience IO errors?

  ----- Original Message ----- 
  From: SRuff@fiberlink.com 
  To: ocfs2-users@oss.oracle.com 
  Sent: Thursday, September 21, 2006 2:56 PM
  Subject: [Ocfs2-users] ocfs2 fencing on reboot of 2nd node



  I'm performing some testing with ocfs2 on 2 nodes with Red Hat AS4 Update
4 (x86_64) and (mulitpath included in the 2.6 kernel) and am runing into some
issues when cleanly rebooting the 2nd node, while the 1st node is still up.

  So if I do the following on the 2nd node, the 1st node does not fence itself: 

  /etc/init.d/ocfs2 stop 
  /etc/init.d/o2cb stop 
  wait more than 60 seconds 
  init 6 

  I get the following on the 1st node, but everything is fine: 

  Sep 21 21:44:49 bbflgrid11 kernel: SCSI error : <0 0 0 12> return code =
0x20000
  Sep 21 21:44:49 bbflgrid11 kernel: end_request: I/O error, dev sdm, sector
192785
  Sep 21 21:44:49 bbflgrid11 kernel: device-mapper: dm-multipath: Failing path
8:192.
  Sep 21 21:44:49 bbflgrid11 kernel: SCSI error : <0 0 0 14> return code =
0x20000
  Sep 21 21:44:49 bbflgrid11 kernel: end_request: I/O error, dev sdo, sector
193297
  Sep 21 21:44:49 bbflgrid11 kernel: device-mapper: dm-multipath: Failing path
8:224.
  Sep 21 21:44:49 bbflgrid11 kernel: SCSI error : <0 0 0 13> return code =
0x20000
  Sep 21 21:44:49 bbflgrid11 kernel: end_request: I/O error, dev sdn, sector
192785
  Sep 21 21:44:49 bbflgrid11 kernel: device-mapper: dm-multipath: Failing path
8:208.
  Sep 21 21:44:49 bbflgrid11 multipathd: 8:192: mark as failed 
  Sep 21 21:44:49 bbflgrid11 multipathd: mpath1: remaining active paths: 1 
  Sep 21 21:44:49 bbflgrid11 multipathd: 8:224: mark as failed 
  Sep 21 21:44:49 bbflgrid11 multipathd: mpath3: remaining active paths: 1 
  Sep 21 21:44:49 bbflgrid11 multipathd: 8:208: mark as failed 
  Sep 21 21:44:49 bbflgrid11 multipathd: mpath2: remaining active paths: 1 
  Sep 21 21:44:58 bbflgrid11 multipathd: 8:192: readsector0 checker reports path
is up
  Sep 21 21:44:58 bbflgrid11 multipathd: 8:192: reinstated 
  Sep 21 21:44:58 bbflgrid11 multipathd: mpath1: remaining active paths: 2 
  Sep 21 21:44:58 bbflgrid11 multipathd: 8:208: readsector0 checker reports path
is up
  Sep 21 21:44:58 bbflgrid11 multipathd: 8:208: reinstated 
  Sep 21 21:44:58 bbflgrid11 multipathd: mpath2: remaining active paths: 2 
  Sep 21 21:44:58 bbflgrid11 multipathd: 8:224: readsector0 checker reports path
is up
  Sep 21 21:44:58 bbflgrid11 multipathd: 8:224: reinstated 
  Sep 21 21:44:58 bbflgrid11 multipathd: mpath3: remaining active paths: 2 
  Sep 21 21:46:06 bbflgrid11 kernel: SCSI error : <1 0 0 11> return code =
0x20000
  Sep 21 21:46:06 bbflgrid11 kernel: end_request: I/O error, dev sdaa, sector
1920
  Sep 21 21:46:06 bbflgrid11 kernel: device-mapper: dm-multipath: Failing path
65:160.
  Sep 21 21:46:06 bbflgrid11 multipathd: 65:160: mark as failed 
  Sep 21 21:46:06 bbflgrid11 multipathd: mpath0: remaining active paths: 1 
  Sep 21 21:46:06 bbflgrid11 multipathd: 65:160: readsector0 checker reports
path is up
  Sep 21 21:46:06 bbflgrid11 multipathd: 65:160: reinstated 
  Sep 21 21:46:06 bbflgrid11 multipathd: mpath0: remaining active paths: 2 



  Now if I do the following on the 2nd node, the 1st node fences itself (same as
above, except dont wait 60 seconds after o2cb stop)

  /etc/init.d/ocfs2 stop 
  /etc/init.d/o2cb stop 
  init 6 

  Node 1 logs the following and fences itself, I have to power cycle the server
to get it back, it doesn't reboot or shutdown just hangs

  Sep 21 21:28:00 bbflgrid11 kernel: SCSI error : <0 0 0 13> return code =
0x20000
  Sep 21 21:28:00 bbflgrid11 kernel: end_request: I/O error, dev sdn, sector
192785
  Sep 21 21:28:00 bbflgrid11 kernel: device-mapper: dm-multipath: Failing path
8:208.
  Sep 21 21:28:00 bbflgrid11 multipathd: 8:208: mark as failed 
  Sep 21 21:28:00 bbflgrid11 multipathd: mpath2: remaining active paths: 1 
  Sep 21 21:28:00 bbflgrid11 kernel: SCSI error : <1 0 0 12> return code =
0x20000
  Sep 21 21:28:00 bbflgrid11 kernel: end_request: I/O error, dev sdab, sector
192784
  Sep 21 21:28:00 bbflgrid11 kernel: end_request: I/O error, dev sdab, sector
192786
  Sep 21 21:28:00 bbflgrid11 kernel: device-mapper: dm-multipath: Failing path
65:176.
  Sep 21 21:28:00 bbflgrid11 kernel: SCSI error : <1 0 0 13> return code =
0x20000
  Sep 21 21:28:00 bbflgrid11 kernel: end_request: I/O error, dev sdac, sector
192785
  Sep 21 21:28:00 bbflgrid11 kernel: device-mapper: dm-multipath: Failing path
65:192.
  Sep 21 21:28:00 bbflgrid11 multipathd: 65:176: mark as failed 
  Sep 21 21:28:00 bbflgrid11 multipathd: mpath1: remaining active paths: 1 
  Sep 21 21:28:01 bbflgrid11 multipathd: 65:192: mark as failed 
  Sep 21 21:28:01 bbflgrid11 multipathd: mpath2: remaining active paths: 0 
  Sep 21 21:28:01 bbflgrid11 kernel: (4912,1):o2hb_bio_end_io:331 ERROR: IO
Error -5
  Sep 21 21:28:01 bbflgrid11 kernel: (4912,1):o2hb_do_disk_heartbeat:973 ERROR:
status = -5
  Sep 21 21:28:01 bbflgrid11 kernel: (4912,1):o2hb_bio_end_io:331 ERROR: IO
Error -5
  Sep 21 21:28:01 bbflgrid11 kernel: (4912,1):o2hb_do_disk_heartbeat:973 ERROR:
status = -5
  Sep 21 21:28:01 bbflgrid11 multipathd: 65:176: readsector0 checker reports
path is up
  Sep 21 21:28:01 bbflgrid11 multipathd: 65:176: reinstated 
  Sep 21 21:28:01 bbflgrid11 multipathd: mpath1: remaining active paths: 2 
  Sep 21 21:28:03 bbflgrid11 kernel: (4912,1):o2hb_bio_end_io:331 ERROR: IO
Error -5
  Sep 21 21:28:03 bbflgrid11 kernel: (4912,1):o2hb_do_disk_heartbeat:973 ERROR:
status = -5
  Sep 21 21:28:03 bbflgrid11 kernel: (4912,1):o2hb_bio_end_io:331 ERROR: IO
Error -5
  Sep 21 21:28:03 bbflgrid11 kernel: (4912,1):o2hb_do_disk_heartbeat:973 ERROR:
status = -5
  Sep 21 21:28:05 bbflgrid11 kernel: (4912,1):o2hb_bio_end_io:331 ERROR: IO
Error -5
  Sep 21 21:28:05 bbflgrid11 kernel: (4912,1):o2hb_do_disk_heartbeat:973 ERROR:
status = -5
  Sep 21 21:28:05 bbflgrid11 kernel: (4912,1):o2hb_bio_end_io:331 ERROR: IO
Error -5
  Sep 21 21:28:05 bbflgrid11 kernel: (4912,1):o2hb_do_disk_heartbeat:973 ERROR:
status = -5
  Sep 21 21:28:07 bbflgrid11 kernel: (4912,1):o2hb_bio_end_io:331 ERROR: IO
Error -5
  Sep 21 21:28:07 bbflgrid11 kernel: (4912,1):o2hb_do_disk_heartbeat:973 ERROR:
status = -5
  Sep 21 21:28:07 bbflgrid11 kernel: (4912,1):o2hb_bio_end_io:331 ERROR: IO
Error -5
  Sep 21 21:28:07 bbflgrid11 kernel: (4912,1):o2hb_do_disk_heartbeat:973 ERROR:
status = -5
  Sep 21 21:28:09 bbflgrid11 kernel: (4912,1):o2hb_bio_end_io:331 ERROR: IO
Error -5
  Sep 21 21:28:09 bbflgrid11 kernel: (4912,1):o2hb_do_disk_heartbeat:973 ERROR:
status = -5
  Sep 21 21:28:09 bbflgrid11 kernel: (4912,1):o2hb_bio_end_io:331 ERROR: IO
Error -5
  Sep 21 21:28:09 bbflgrid11 kernel: (4912,1):o2hb_do_disk_heartbeat:973 ERROR:
status = -5
  Sep 21 21:28:09 bbflgrid11 multipathd: 8:208: readsector0 checker reports path
is up
  Sep 21 21:28:09 bbflgrid11 multipathd: 8:208: reinstated 
  Sep 21 21:28:09 bbflgrid11 multipathd: mpath2: remaining active paths: 1 
  Sep 21 21:28:10 bbflgrid11 multipathd: 65:192: readsector0 checker reports
path is up
  Sep 21 21:28:10 bbflgrid11 multipathd: 65:192: reinstated 
  Sep 21 21:28:10 bbflgrid11 multipathd: mpath2: remaining active paths: 2 


  ... 
  Index 14: took 0 ms to do submit_bio for read 
  Index 15: took 0 ms to do waiting for read completion 
  (11,1):o2hb_stop_all_regions:1908 ERROR: stopping heartbeat on all active
regions
  Kernel panic - not syncing:  ocfs2 is very sorry to be fencing this system by
panicing


  Seems like if I wait for the node 1 to heartbeat to node 2, with o2c down,
before rebooting it's fine, but if I reboot before node 1 has had a chance
to hearbeat to node 2, with o2cb down, it's panics.



  Shawn E. Ruff
  Senior Oracle DBA
  Fiberlink Communications

  The information transmitted is intended only for the person or entity to which
it is addressed and may contain confidential and/or privileged material.  Any
review, retransmission, dissemination or other use of, or taking of any action
in reliance upon, this information by persons or entities other than the
intended recipient is prohibited.   If you received this in error, please
contact the sender and delete the material from any computer.




------------------------------------------------------------------------------


  _______________________________________________
  Ocfs2-users mailing list
  Ocfs2-users@oss.oracle.com
  http://oss.oracle.com/mailman/listinfo/ocfs2-users
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://oss.oracle.com/pipermail/ocfs2-users/attachments/20060921/42beb607/attachment.html

Ocfs2 users - Sep 2006 - ocfs2 fencing on reboot of 2nd node

[Ocfs2-users] ocfs2 fencing on reboot of 2nd node

[Ocfs2-users] ocfs2 fencing on reboot of 2nd node

[Ocfs2-users] ocfs2 fencing on reboot of 2nd node