Thanks for the information,
>From first glance I would say this is not an OCFS2 issue. It would
appear that all your iSCSI targets are going offline at the same time,
thus causing multipath to fail the device - which OCFS2 should not be
expected to deal with nicely (its most likely to fence to ensure data
integrity).
That being said, we are happy to help you with this issue.
Do all the servers have this problem at the same time?
If so, it is likely that this is more a problem with the iSCSI target
or network leading to it (can you provide a topology diagram?).
Once all the paths disappear they are most likely not recovering, how
many messages are there before Apr 22 15:53:09?
Marty
On 24 April 2014 17:06, <Andrew.MORLEY at sungard.com>
wrote:> Hi,
>
>
>
> I have an issue with ocfs2 and I am not quite sure, where the problem is. I
> would be grateful for any feedback. The issue looks like a multipath issue,
> however I have redundant links, so not quite sure why ocfs2 would barf and
> bring the server down.
>
>
>
> I have a set of production servers that have started showing the same
error.
>
> I am not aware of any changes within the infrastructure.
>
>
>
> setup is.
>
>
>
> 4 off Equallogic ps6100X.
>
> lots of Dell R610 servers, all with multiple ISCSI interfaces.
>
>
>
>
>
> This has happened on 3 different servers in the last week, causing the
> servers to hang.
>
>
>
>
>
>
>
> I have checked all switches and logs and can see no flapping interfaces. I
> can see the ISCSI initiator make logout and login requests during this time
> period.
>
>
>
> I See in the logs
>
>
>
> Apr 22 15:53:09 servername multipathd:
> eql-0-8a0906-2d6a4c605-13244eee0b250b79_a: Entering recovery mode:
> max_retries=5
>
> Apr 22 15:53:09 servername multipathd: 8:176: mark as failed
>
> Apr 22 15:53:09 servername multipathd: 8:16: mark as failed
>
> Apr 22 15:53:09 servername multipathd: 8:48: mark as failed
>
> Apr 22 15:53:09 servername multipathd: 8:64: mark as failed
>
> Apr 22 15:53:09 servername multipathd: 8:128: mark as failed
>
> Apr 22 15:53:09 servername multipathd: 8:160: mark as failed
>
> Apr 22 15:53:09 servername multipathd:
> eql-0-8a0906-2d6a4c605-13244eee0b250b79_a: Entering recovery mode:
> max_retries=5
>
> Apr 22 15:53:09 servername multipathd: 8:176: mark as failed
>
> Apr 22 15:53:09 servername multipathd: 8:16: mark as failed
>
> Apr 22 15:53:09 servername multipathd: 8:48: mark as failed
>
> Apr 22 15:53:09 servername multipathd: 8:64: mark as failed
>
> Apr 22 15:53:09 servername multipathd: 8:128: mark as failed
>
> Apr 22 15:53:09 servername multipathd: 8:160: mark as failed
>
> Apr 22 15:53:11 servername kernel: (kmpathd/6,2888,6):o2hb_bio_end_io:241
> ERROR: IO Error -5
>
> Apr 22 15:53:11 servername kernel: Buffer I/O error on device dm-7, logical
> block 480
>
> Apr 22 15:53:11 servername kernel: lost page write due to I/O error on dm-7
>
> Apr 22 15:53:11 servername kernel: scsi 114:0:0:0: rejecting I/O to dead
> device
>
> Apr 22 15:53:11 servername kernel: device-mapper: multipath: Failing path
> 8:176.
>
> Apr 22 15:53:11 servername kernel:
> (o2hb-1B3B9BEE63,4754,7):o2hb_do_disk_heartbeat:772 ERROR: status = -5
>
> Apr 22 15:53:11 servername multipathd: dm-4: add map (uevent)
>
> Apr 22 15:53:11 servername kernel: scsi 115:0:0:0: rejecting I/O to dead
> device
>
> Apr 22 15:53:11 servername kernel: device-mapper: multipath: Failing path
> 8:16.
>
> Apr 22 15:53:11 servername multipathd: dm-4: devmap already registered
>
> Apr 22 15:53:11 servername multipathd: dm-4: add map (uevent)
>
> Apr 22 15:53:11 servername multipathd: dm-4: devmap already registered
>
> Apr 22 15:53:11 servername multipathd: dm-3: add map (uevent)
>
> Apr 22 15:53:11 servername kernel: scsi 110:0:0:0: rejecting I/O to dead
> device
>
> Apr 22 15:53:11 servername kernel: device-mapper: multipath: Failing path
> 8:48.
>
>
>
>
>
> Apr 22 15:53:17 servername multipathd: asvolume: load table [0 629145600
> multipath 0 0 1 1 round-robin 0 6 1 8:32 10
>
> 8:80 10 8:96 10 8:112 10 8:144 10 8:16 10]
>
> Apr 22 15:53:17 servername multipathd: dm-2: add map (uevent)
>
> Apr 22 15:53:17 servername multipathd: dm-2: devmap already registered
>
> Apr 22 15:53:17 servername multipathd: dm-8: add map (uevent)
>
> Apr 22 15:53:17 servername iscsid: Connection117:0 to [target:
> iqn.2001-05.com.equallogic:0-8a0906-2d6a4c605-13244eee
>
> 0b250b79-as14volumeocfs2, portal: 192.168.5.100,3260] through [iface:
> eql.eth2_2] is operational now
>
> Apr 22 15:53:22 servername multipathd: dm-3: add map (uevent)
>
> Apr 22 15:53:22 servername multipathd: dm-3: devmap already registered
>
> Apr 22 15:53:22 servername multipathd: dm-4: add map (uevent)
>
> Apr 22 15:53:22 servername multipathd: dm-4: devmap already registered
>
> Apr 22 15:53:22 servername multipathd: dm-5: add map (uevent)
>
> Apr 22 15:53:22 servername multipathd: dm-5: devmap already registered
>
> Apr 22 15:53:22 servername multipathd: dm-9: add map (uevent)
>
> Apr 22 15:53:22 servername multipathd: dm-9: devmap already registered
>
> Apr 22 15:53:22 servername kernel: get_page_tbl ctx=0xffff810623d041c0
> (253:6): bits=2, mask=0x3, num=20480, max=2048
>
> 0
>
>
>
>
>
> Then the ocfs2 has an issue.
>
>
>
>
>
> Apr 22 15:53:23 servername kernel: (ocfs2cmt,4773,6):ocfs2_commit_cache:191
> ERROR: status = -5
>
> Apr 22 15:53:23 servername kernel:
> (ocfs2cmt,4773,6):ocfs2_commit_thread:1799 ERROR: status = -5
>
> Apr 22 15:53:23 servername kernel: (ocfs2cmt,4773,6):ocfs2_commit_cache:191
> ERROR: status = -5
>
>
>
>
>
>
>
>
>
>
>
> then
>
>
>
> Apr 22 15:53:23 servername kernel:
>
s2cmt,4773,6):ocfs2<3>(ocfs2c<3>(ocfs2cmt,4773,6):ocfs2_commit_cache:191
> ERROR: status = -
>
> 5
>
>
>
> Apr 22 15:53:23 servername kernel:
>
(ocfs2cmt,4773,6):ocfs2_commi<3>(ocfs2cm<<3>(<3>(ocfs2cmt,4773,6):ocfs2_commit_cache:191
> E
>
> RROR: status = -5
>
>
>
> Apr 22 15:53:23 servername kernel:
>
(ocfs2<3>(ocfs2cmt,47<3>(ocf<3>(ocfs2cmt,47<3>(ocfs<3>(ocfs2cmt,4<3>(ocf<3>(ocfs<3>(ocf<3>
>
>
(ocfs2cm<3>(o<3>(ocfs2cm<3>(ocf<3>(ocfs2cmt<3>(o<3>(ocfs2cmt<3>(ocfs2cm<3>(ocfs2c<3><3>(ocfs2<3>(oc<3>(ocfs2cmt,<3>(ocf<3>(oc
>
>
fs2cmt,47<3>(ocf<3>(ocfs2cmt,47<3>(ocfs<3>(ocfs2c<3>(o<3>(ocfs2c<3>(oc<3>(ocfs2cmt,47<3>(o<3>(ocfs2cmt,477<3>(ocfs<3>(ocfs2c<
>
>
3>(ocf<3>(ocfs2cmt<3>(<3>(ocfs2cmt,4773<3>(oc<3>(ocfs2cmt,<3>(oc<3>(ocfs2cmt<3>(ocfs<3>(ocfs2cm<3>(oc<3>(ocfs<3>(oc<3>(ocf<3>
>
>
(ocfs2cmt,<3>(oc<3>(ocfs2cmt<3>(ocfs2<3>(ocfs2<3>(<3>(ocfs2cmt,4773,<3>(oc<3>(ocfs2cmt,4773,<3>(ocfs<3>(ocfs2cmt<3>(oc<3>(ocf
>
>
s2cmt,477<3>(ocf<3>(ocfs2cmt,477<3>(<3>(ocfs2cmt,<3>(oc<3>(ocfs2cmt,<3>(o<3>(ocfs2cmt<3>(ocfs<3>(ocfs2c<3>(ocf<3>(ocfs2cmt<3>
>
>
(ocfs<3>(ocfs2c<3>(ocf<3>(ocfs2cmt<3>(<3>(ocfs2<3>(ocf<3>(ocfs2cmt<3>(oc<3>(ocfs2cmt<3>(oc<3>(ocfs<3>(ocfs2<3>(ocfs2c<3>(o<3>
>
>
(ocfs2cmt,4<3>(ocf<3>(ocfs2<3>(oc<3>(ocfs2cm<3>(oc<3>(ocfs2cmt<3>(oc<3>(ocfs2cmt<3>(ocfs<3>(ocfs2cmt,<3>(ocfs<3>(ocfs2c<3>(oc
>
> fs2<3>(ocfs2c<3>(ocfs2c<3>(ocf
>
>
>
> Apr 22 15:53:23 servername kernel:
>
2cmt,4773,6):<3>(ocf<3>(ocfs2cmt,<3>(ocfs2<3>(ocfs2cmt,<3>(ocfs<3>(ocfs2cmt<3>(ocf<3>(ocfs
>
>
2cmt,47<3>(ocf<3>(ocfs2cmt,47<3>(ocfs<3>(ocfs2cmt,<3>(o<3>(ocfs2cmt,4<3>(ocf<3>(ocfs2cmt<3>(ocf<3>(ocfs2cmt<3>(ocf<3>(ocfs2cm
>
>
t,<3>(ocf<3>(ocfs2cmt<3>(ocfs2<3>(ocfs2cmt<3>(<3>(ocfs2cm<3>(ocfs<3>(ocfs2cmt<3>(ocfs2<3>(ocfs2cmt<3>(oc<3>(ocfs2cmt<3>(ocfs<
>
>
3>(ocfs2<3>(ocf<3>(ocfs2cmt,4773,<3>(oc<3>(ocfs2cm<3>(ocfs2<3>(ocfs2cm<3>(oc<3>(ocfs2cmt,4773,6):<3>(<3>(ocfs2cmt<3>(oc<3>(oc
>
>
fs2cm<3>(ocfs2<3>(ocfs2cmt<3>(o<3>(ocfs2cmt<3>(ocf<3>(ocfs2c<3>(ocfs2c<3>(ocfs2cmt,<3>(oc<3>(ocfs2c<3>(ocfs2cm<3>(ocfs2cmt<3>
>
>
(o<3>(ocfs2cmt<3>(o<3>(ocfs2cm<3><3>(ocfs2cmt<3>(ocfs2c<3>(ocfs2cmt,<3>(o<3>(ocfs2cmt<3>(ocf<3>(ocfs2cmt<3>(ocf<3>(ocfs2cmt<3
>
>>(o<3>(ocfs2<3>(oc<3>(ocfs2cmt,47<3>(oc<3>(ocfs2cmt,4773,6<3>(o<3>(ocfs2cm<3>(ocf<3>(ocfs2<3>(o<3>(ocfs2<3>(<3>(ocfs2cm<3>(oc
>
>
<3>(ocfs<3>(ocfs2c<3>(ocfs2cmt<3>(o<3>(ocfs2cm<3>(ocf<3>(ocfs2cmt<3><3>(ocfs2cmt,<3>(o<3>(ocfs2cmt,4<3>(oc<3>(ocfs2c<3>(o<3>(
>
> ocfs2cmt,<3>(o<3>(ocfs2cmt<3>(
>
>
>
> Repeated thousands of times and bringing the server to a halt.
>
>
>
>
>
> cat /etc/multipath.conf
>
>
>
> blacklist {
>
> devnode "^sd[a]$"
>
> }
>
>
>
> ## Use user friendly names, instead of using WWIDs as names.
>
> defaults {
>
> user_friendly_names yes
>
> }
>
> multipaths {
>
> multipath {
>
> wwid 36090a058604c6a2d790b250bee4exxxx
>
> alias asvolume
>
> path_grouping_policy multibus
>
> #path_checker readsector0
>
> path_selector "round-robin 0"
>
> failback immediate
>
> rr_weight priorities
>
> rr_min_io 10
>
> no_path_retry 5
>
> }
>
> }
>
> devices {
>
> device {
>
> vendor "EQLOGIC"
>
> product "100E-00"
>
> path_grouping_policy multibus
>
> getuid_callout "/sbin/scsi_id -g -u -s
/block/%n"
>
> #features "1 queue_if_no_path"
>
> path_checker readsector0
>
> path_selector "round-robin 0"
>
> failback immediate
>
> rr_min_io 10
>
> rr_weight priorities
>
> }
>
> }
>
>
>
> cat /etc/ocfs2/cluster.conf
>
>
>
>
>
>
>
> node:
>
> ip_port = 8888
>
> ip_address = x.x.x.x
>
> number = 9
>
> name = servername
>
> cluster = ocfs
>
> node:
>
> ip_port = 8888
>
> ip_address = x.x.x.x
>
> number = 109
>
> name = servername1
>
> cluster = ocfs
>
>
>
> more nodes in here
>
>
>
> cluster:
>
> node_count = 22
>
> name = ocfs
>
>
>
> Cluster consists of 14 nodes.
>
>
>
> /etc/init.d/o2cb status
>
>
>
>
>
> Driver for "configfs": Loaded
>
> Filesystem "configfs": Mounted
>
> Driver for "ocfs2_dlmfs": Loaded
>
> Filesystem "ocfs2_dlmfs": Mounted
>
> Checking O2CB cluster ocfs: Online
>
> Heartbeat dead threshold = 61
>
> Network idle timeout: 30000
>
> Network keepalive delay: 2000
>
> Network reconnect delay: 2000
>
> Checking O2CB heartbeat: Active
>
>
>
>
>
>
>
>
>
>
>
>
>
> Server and package information.
>
>
>
> cat /etc/redhat-release
>
> Red Hat Enterprise Linux Server release 5.10 (Tikanga)
>
>
>
> rpm -qa | grep multipath
>
> device-mapper-multipath-0.4.7-59.el5
>
>
>
>
>
> rpm -qa | grep ocfs2
>
>
>
> ocfs2-2.6.18-371.3.1.el5-1.4.10-1.el5
>
> ocfs2-tools-1.4.4-1.el5
>
> ocfs2console-1.4.4-1.el5
>
>
>
> rpm -qa | grep kernel
>
>
>
> kernel-2.6.18-371.3.1.el5
>
>
>
> modinfo ocfs2
>
>
>
> filename: /lib/modules/2.6.18-371.3.1.el5/kernel/fs/ocfs2/ocfs2.ko
>
> license: GPL
>
> author: Oracle
>
> version: 1.4.10
>
> description: OCFS2 1.4.10 Thu Dec 5 16:38:36 PST 2013 (build
> b703e5e0906b370c876b657dabe8d4c8)
>
> srcversion: 41115DB9EFDAA5735C18810
>
> depends: ocfs2_dlm,jbd,ocfs2_nodemanager
>
> vermagic: 2.6.18-371.3.1.el5 SMP mod_unload gcc-4.1
>
>
>
>
>
>
>
>
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users at oss.oracle.com
> https://oss.oracle.com/mailman/listinfo/ocfs2-users