Mike Reid
2011-Apr-01 18:44 UTC
[Ocfs2-users] Node Recovery locks I/O in two-node OCFS2 cluster (DRBD 8.3.8 / Ubuntu 10.10)
I am running a two-node web cluster on OCFS2 via DRBD Primary/Primary (v8.3.8) and Pacemaker. Everything seems to be working great, except during testing of hard-boot scenarios. Whenever I hard-boot one of the nodes, the other node is successfully fenced and marked ?Outdated? * <resource minor="0" cs="WFConnection" ro1="Primary" ro2="Unknown" ds1="UpToDate" ds2="Outdated" /> However, this locks up I/O on the still active node and prevents any operations within the cluster :( I have even forced DRBD into StandAlone mode while in this state, but that does not resolve the I/O lock either. * <resource minor="0" cs="StandAlone" ro1="Primary" ro2="Unknown" ds1="UpToDate" ds2="Outdated" /> The only way I?ve been able to successfully regain I/O within the cluster is to bring back up the other node. While monitoring the logs, it seems that it is OCFS2 that?s establishing the lock/unlock and not DRBD at all.> > > Apr 1 12:07:19 ubu10a kernel: [ 1352.739777] > (ocfs2rec,3643,0):ocfs2_replay_journal:1605 Recovering node 1124116672 from > slot 1 on device (147,0) > Apr 1 12:07:19 ubu10a kernel: [ 1352.900874] > (ocfs2rec,3643,0):ocfs2_begin_quota_recovery:407 Beginning quota recovery in > slot 1 > Apr 1 12:07:19 ubu10a kernel: [ 1352.902509] > (ocfs2_wq,1213,0):ocfs2_finish_quota_recovery:598 Finishing quota recovery in > slot 1 > > Apr 1 12:07:20 ubu10a kernel: [ 1354.423915] block drbd0: Handshake > successful: Agreed network protocol version 94 > Apr 1 12:07:20 ubu10a kernel: [ 1354.433074] block drbd0: Peer authenticated > using 20 bytes of 'sha1' HMAC > Apr 1 12:07:20 ubu10a kernel: [ 1354.433083] block drbd0: conn( WFConnection > -> WFReportParams ) > Apr 1 12:07:20 ubu10a kernel: [ 1354.433097] block drbd0: Starting asender > thread (from drbd0_receiver [2145]) > Apr 1 12:07:20 ubu10a kernel: [ 1354.433562] block drbd0: data-integrity-alg: > <not-used> > Apr 1 12:07:20 ubu10a kernel: [ 1354.434090] block drbd0: > drbd_sync_handshake: > Apr 1 12:07:20 ubu10a kernel: [ 1354.434094] block drbd0: self > FBA98A2F89E05B83:EE17466F4DEC2F8B:6A4CD8FDD0562FA1:EC7831379B78B997 bits:4 > flags:0 > Apr 1 12:07:20 ubu10a kernel: [ 1354.434097] block drbd0: peer > EE17466F4DEC2F8A:0000000000000000:6A4CD8FDD0562FA0:EC7831379B78B997 bits:2048 > flags:2 > Apr 1 12:07:20 ubu10a kernel: [ 1354.434099] block drbd0: uuid_compare()=1 by > rule 70 > Apr 1 12:07:20 ubu10a kernel: [ 1354.434104] block drbd0: peer( Unknown -> > Secondary ) conn( WFReportParams -> WFBitMapS ) > Apr 1 12:07:21 ubu10a kernel: [ 1354.601353] block drbd0: conn( WFBitMapS -> > SyncSource ) pdsk( Outdated -> Inconsistent ) > Apr 1 12:07:21 ubu10a kernel: [ 1354.601367] block drbd0: Began resync as > SyncSource (will sync 8192 KB [2048 bits set]). > Apr 1 12:07:21 ubu10a kernel: [ 1355.401912] block drbd0: Resync done (total > 1 sec; paused 0 sec; 8192 K/sec) > Apr 1 12:07:21 ubu10a kernel: [ 1355.401923] block drbd0: conn( SyncSource -> > Connected ) pdsk( Inconsistent -> UpToDate ) > Apr 1 12:07:22 ubu10a kernel: [ 1355.612601] block drbd0: peer( Secondary -> > Primary ) > >Therefore, my question is if there is an option in OCFS2 to remove / prevent this lock, especially since it?s inside a DRBD configuration? I?m still new to OCFS2, so I am definitely open to any criticism regarding my setup/approach, or any recommendations related to keeping the cluster active when another node is shutdown during testing. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://oss.oracle.com/pipermail/ocfs2-users/attachments/20110401/7680d306/attachment-0001.html
Sunil Mushran
2011-Apr-01 19:01 UTC
[Ocfs2-users] Node Recovery locks I/O in two-node OCFS2 cluster (DRBD 8.3.8 / Ubuntu 10.10)
I believe this is a pacemaker issue. There was a time it required a qdisk to continue working as a single node in a 2 node cluster when one node died. if pacemaker people don't jump in, you may want to try your luck in the linux-cluster mailing list. On 04/01/2011 11:44 AM, Mike Reid wrote:> I am running a two-node web cluster on OCFS2 via DRBD Primary/Primary (v8.3.8) and Pacemaker. Everything seems to be working great, except during testing of hard-boot scenarios. > > Whenever I hard-boot one of the nodes, the other node is successfully fenced and marked "Outdated" > > * <resource minor="0" cs="WFConnection" ro1="Primary" ro2="*Unknown*" ds1="UpToDate" ds2="*Outdated*" /> > > > However, this locks up I/O on the still active node and prevents any operations within the cluster :( > I have even forced DRBD into StandAlone mode while in this state, but that does not resolve the I/O lock either. > > * <resource minor="0" cs="*StandAlone*" ro1="*Primary*" ro2="Unknown" ds1="*UpToDate*" ds2="Outdated" /> > > > The only way I've been able to successfully regain I/O within the cluster is to bring back up the other node. While monitoring the logs, it seems that it is OCFS2 that's establishing the lock/unlock and /not/ DRBD at all. > > > > Apr 1 12:07:19 ubu10a kernel: [ 1352.739777] (ocfs2rec,3643,0):ocfs2_replay_journal:1605 Recovering node 1124116672 from slot 1 on device (147,0) > Apr 1 12:07:19 ubu10a kernel: [ 1352.900874] (ocfs2rec,3643,0):ocfs2_begin_quota_recovery:407 Beginning quota recovery in slot 1 > Apr 1 12:07:19 ubu10a kernel: [ 1352.902509] (ocfs2_wq,1213,0):ocfs2_finish_quota_recovery:598 Finishing quota recovery in slot 1 > > Apr 1 12:07:20 ubu10a kernel: [ 1354.423915] block drbd0: Handshake successful: Agreed network protocol version 94 > Apr 1 12:07:20 ubu10a kernel: [ 1354.433074] block drbd0: Peer authenticated using 20 bytes of 'sha1' HMAC > Apr 1 12:07:20 ubu10a kernel: [ 1354.433083] block drbd0: conn( WFConnection -> WFReportParams ) > Apr 1 12:07:20 ubu10a kernel: [ 1354.433097] block drbd0: Starting asender thread (from drbd0_receiver [2145]) > Apr 1 12:07:20 ubu10a kernel: [ 1354.433562] block drbd0: data-integrity-alg: <not-used> > Apr 1 12:07:20 ubu10a kernel: [ 1354.434090] block drbd0: drbd_sync_handshake: > Apr 1 12:07:20 ubu10a kernel: [ 1354.434094] block drbd0: self FBA98A2F89E05B83:EE17466F4DEC2F8B:6A4CD8FDD0562FA1:EC7831379B78B997 bits:4 flags:0 > Apr 1 12:07:20 ubu10a kernel: [ 1354.434097] block drbd0: peer EE17466F4DEC2F8A:0000000000000000:6A4CD8FDD0562FA0:EC7831379B78B997 bits:2048 flags:2 > Apr 1 12:07:20 ubu10a kernel: [ 1354.434099] block drbd0: uuid_compare()=1 by rule 70 > Apr 1 12:07:20 ubu10a kernel: [ 1354.434104] block drbd0: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS ) > Apr 1 12:07:21 ubu10a kernel: [ 1354.601353] block drbd0: conn( WFBitMapS -> SyncSource ) pdsk( Outdated -> Inconsistent ) > Apr 1 12:07:21 ubu10a kernel: [ 1354.601367] block drbd0: Began resync as SyncSource (will sync 8192 KB [2048 bits set]). > Apr 1 12:07:21 ubu10a kernel: [ 1355.401912] block drbd0: Resync done (total 1 sec; paused 0 sec; 8192 K/sec) > Apr 1 12:07:21 ubu10a kernel: [ 1355.401923] block drbd0: conn( SyncSource -> Connected ) pdsk( Inconsistent -> UpToDate ) > Apr 1 12:07:22 ubu10a kernel: [ 1355.612601] block drbd0: peer( Secondary -> Primary ) > > > Therefore, my question is if there is an option in OCFS2 to remove / prevent this lock, especially since it's inside a DRBD configuration? I'm still new to OCFS2, so I am definitely open to any criticism regarding my setup/approach, or any recommendations related to keeping the cluster active when another node is shutdown during testing. > > > _______________________________________________ > Ocfs2-users mailing list > Ocfs2-users at oss.oracle.com > http://oss.oracle.com/mailman/listinfo/ocfs2-users-------------- next part -------------- An HTML attachment was scrubbed... URL: http://oss.oracle.com/pipermail/ocfs2-users/attachments/20110401/ca8e851b/attachment.html