thr3ads.net - Ocfs2 users - [Ocfs2-users] Node Recovery locks I/O in two-node OCFS2 cluster (DRBD 8.3.8 / Ubuntu 10.10) [Apr 2011]

If this information is useful, please help other people find it:
Share via:

Mike Reid

2011-Apr-01 18:44 UTC

[Ocfs2-users] Node Recovery locks I/O in two-node OCFS2 cluster (DRBD 8.3.8 / Ubuntu 10.10)

I am running a two-node web cluster on OCFS2 via DRBD Primary/Primary
(v8.3.8) and Pacemaker. Everything  seems to be working great, except during
testing of hard-boot scenarios.

Whenever I hard-boot one of the nodes, the other node is successfully fenced
and marked ?Outdated?

* <resource minor="0" cs="WFConnection"
ro1="Primary" ro2="Unknown"
ds1="UpToDate" ds2="Outdated" />

However, this locks up I/O on the still active node and prevents any
operations within the cluster :(
I have even forced DRBD into StandAlone mode while in this state, but that
does not resolve the I/O lock either.

* <resource minor="0" cs="StandAlone"
ro1="Primary" ro2="Unknown"
ds1="UpToDate" ds2="Outdated" />

The only way I?ve been able to successfully regain I/O within the cluster is
to bring back up the other node. While monitoring the logs, it seems that it
is OCFS2 that?s establishing the lock/unlock and not DRBD at
all.> 
> 
> Apr  1 12:07:19 ubu10a kernel: [ 1352.739777]
> (ocfs2rec,3643,0):ocfs2_replay_journal:1605 Recovering node 1124116672 from
> slot 1 on device (147,0)
> Apr  1 12:07:19 ubu10a kernel: [ 1352.900874]
> (ocfs2rec,3643,0):ocfs2_begin_quota_recovery:407 Beginning quota recovery
in
> slot 1
> Apr  1 12:07:19 ubu10a kernel: [ 1352.902509]
> (ocfs2_wq,1213,0):ocfs2_finish_quota_recovery:598 Finishing quota recovery
in
> slot 1
> 
> Apr  1 12:07:20 ubu10a kernel: [ 1354.423915] block drbd0: Handshake
> successful: Agreed network protocol version 94
> Apr  1 12:07:20 ubu10a kernel: [ 1354.433074] block drbd0: Peer
authenticated
> using 20 bytes of 'sha1' HMAC
> Apr  1 12:07:20 ubu10a kernel: [ 1354.433083] block drbd0: conn(
WFConnection
> -> WFReportParams )
> Apr  1 12:07:20 ubu10a kernel: [ 1354.433097] block drbd0: Starting asender
> thread (from drbd0_receiver [2145])
> Apr  1 12:07:20 ubu10a kernel: [ 1354.433562] block drbd0:
data-integrity-alg:
> <not-used>
> Apr  1 12:07:20 ubu10a kernel: [ 1354.434090] block drbd0:
> drbd_sync_handshake:
> Apr  1 12:07:20 ubu10a kernel: [ 1354.434094] block drbd0: self
> FBA98A2F89E05B83:EE17466F4DEC2F8B:6A4CD8FDD0562FA1:EC7831379B78B997 bits:4
> flags:0
> Apr  1 12:07:20 ubu10a kernel: [ 1354.434097] block drbd0: peer
> EE17466F4DEC2F8A:0000000000000000:6A4CD8FDD0562FA0:EC7831379B78B997
bits:2048
> flags:2
> Apr  1 12:07:20 ubu10a kernel: [ 1354.434099] block drbd0: uuid_compare()=1
by
> rule 70
> Apr  1 12:07:20 ubu10a kernel: [ 1354.434104] block drbd0: peer( Unknown
->
> Secondary ) conn( WFReportParams -> WFBitMapS )
> Apr  1 12:07:21 ubu10a kernel: [ 1354.601353] block drbd0: conn( WFBitMapS
->
> SyncSource ) pdsk( Outdated -> Inconsistent )
> Apr  1 12:07:21 ubu10a kernel: [ 1354.601367] block drbd0: Began resync as
> SyncSource (will sync 8192 KB [2048 bits set]).
> Apr  1 12:07:21 ubu10a kernel: [ 1355.401912] block drbd0: Resync done
(total
> 1 sec; paused 0 sec; 8192 K/sec)
> Apr  1 12:07:21 ubu10a kernel: [ 1355.401923] block drbd0: conn( SyncSource
->
> Connected ) pdsk( Inconsistent -> UpToDate )
> Apr  1 12:07:22 ubu10a kernel: [ 1355.612601] block drbd0: peer( Secondary
->
> Primary )
> 
> Therefore, my question is if there is an option in OCFS2 to remove / prevent
this lock, especially since it?s inside a DRBD configuration? I?m still new
to OCFS2, so I am definitely open to any criticism regarding my
setup/approach, or any recommendations related to keeping the cluster active
when another node is shutdown during testing.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://oss.oracle.com/pipermail/ocfs2-users/attachments/20110401/7680d306/attachment-0001.html

Sunil Mushran

2011-Apr-01 19:01 UTC

head link

[Ocfs2-users] Node Recovery locks I/O in two-node OCFS2 cluster (DRBD 8.3.8 / Ubuntu 10.10)

I believe this is a pacemaker issue. There was a time it required a
qdisk to continue working as a single node in a 2 node cluster when
one node died. if pacemaker people don't jump in, you may want to
try your luck in the linux-cluster mailing list.

On 04/01/2011 11:44 AM, Mike Reid wrote:> I am running a two-node web cluster on OCFS2 via DRBD Primary/Primary
(v8.3.8) and Pacemaker. Everything  seems to be working great, except during
testing of hard-boot scenarios.
>
> Whenever I hard-boot one of the nodes, the other node is successfully
fenced and marked "Outdated"
>
>     * <resource minor="0" cs="WFConnection"
ro1="Primary" ro2="*Unknown*" ds1="UpToDate"
ds2="*Outdated*" />
>
>
> However, this locks up I/O on the still active node and prevents any
operations within the cluster :(
> I have even forced DRBD into StandAlone mode while in this state, but that
does not resolve the I/O lock either.
>
>     * <resource minor="0" cs="*StandAlone*"
ro1="*Primary*" ro2="Unknown" ds1="*UpToDate*"
ds2="Outdated" />
>
>
> The only way I've been able to successfully regain I/O within the
cluster is to bring back up the other node. While monitoring the logs, it seems
that it is OCFS2 that's establishing the lock/unlock and /not/ DRBD at all.
>
>
>
>     Apr  1 12:07:19 ubu10a kernel: [ 1352.739777]
(ocfs2rec,3643,0):ocfs2_replay_journal:1605 Recovering node 1124116672 from slot
1 on device (147,0)
>     Apr  1 12:07:19 ubu10a kernel: [ 1352.900874]
(ocfs2rec,3643,0):ocfs2_begin_quota_recovery:407 Beginning quota recovery in
slot 1
>     Apr  1 12:07:19 ubu10a kernel: [ 1352.902509]
(ocfs2_wq,1213,0):ocfs2_finish_quota_recovery:598 Finishing quota recovery in
slot 1
>
>     Apr  1 12:07:20 ubu10a kernel: [ 1354.423915] block drbd0: Handshake
successful: Agreed network protocol version 94
>     Apr  1 12:07:20 ubu10a kernel: [ 1354.433074] block drbd0: Peer
authenticated using 20 bytes of 'sha1' HMAC
>     Apr  1 12:07:20 ubu10a kernel: [ 1354.433083] block drbd0: conn(
WFConnection -> WFReportParams )
>     Apr  1 12:07:20 ubu10a kernel: [ 1354.433097] block drbd0: Starting
asender thread (from drbd0_receiver [2145])
>     Apr  1 12:07:20 ubu10a kernel: [ 1354.433562] block drbd0:
data-integrity-alg: <not-used>
>     Apr  1 12:07:20 ubu10a kernel: [ 1354.434090] block drbd0:
drbd_sync_handshake:
>     Apr  1 12:07:20 ubu10a kernel: [ 1354.434094] block drbd0: self
FBA98A2F89E05B83:EE17466F4DEC2F8B:6A4CD8FDD0562FA1:EC7831379B78B997 bits:4
flags:0
>     Apr  1 12:07:20 ubu10a kernel: [ 1354.434097] block drbd0: peer
EE17466F4DEC2F8A:0000000000000000:6A4CD8FDD0562FA0:EC7831379B78B997 bits:2048
flags:2
>     Apr  1 12:07:20 ubu10a kernel: [ 1354.434099] block drbd0:
uuid_compare()=1 by rule 70
>     Apr  1 12:07:20 ubu10a kernel: [ 1354.434104] block drbd0: peer(
Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS )
>     Apr  1 12:07:21 ubu10a kernel: [ 1354.601353] block drbd0: conn(
WFBitMapS -> SyncSource ) pdsk( Outdated -> Inconsistent )
>     Apr  1 12:07:21 ubu10a kernel: [ 1354.601367] block drbd0: Began resync
as SyncSource (will sync 8192 KB [2048 bits set]).
>     Apr  1 12:07:21 ubu10a kernel: [ 1355.401912] block drbd0: Resync done
(total 1 sec; paused 0 sec; 8192 K/sec)
>     Apr  1 12:07:21 ubu10a kernel: [ 1355.401923] block drbd0: conn(
SyncSource -> Connected ) pdsk( Inconsistent -> UpToDate )
>     Apr  1 12:07:22 ubu10a kernel: [ 1355.612601] block drbd0: peer(
Secondary -> Primary )
>
>
> Therefore, my question is if there is an option in OCFS2 to remove /
prevent this lock, especially since it's inside a DRBD configuration?
I'm still new to OCFS2, so I am definitely open to any criticism regarding
my setup/approach, or any recommendations related to keeping the cluster active
when another node is shutdown during testing.
>
>
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users at oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://oss.oracle.com/pipermail/ocfs2-users/attachments/20110401/ca8e851b/attachment.html

Reasonably Related Threads

Search for more seemingly similar threads

Ocfs2 users - Apr 2011 - Node Recovery locks I/O in two-node OCFS2 cluster (DRBD 8.3.8 / Ubuntu 10.10)

[Ocfs2-users] Node Recovery locks I/O in two-node OCFS2 cluster (DRBD 8.3.8 / Ubuntu 10.10)

[Ocfs2-users] Node Recovery locks I/O in two-node OCFS2 cluster (DRBD 8.3.8 / Ubuntu 10.10)

Reasonably Related Threads