thr3ads.net - Ocfs2 users - [CentOS] Unexplained reboots in DRBD82 + OCFS2 setup [Jun 2009]

If this information is useful, please help other people find it:
Share via:

Kris Buytaert

2009-Jun-24 14:10 UTC

[CentOS] Unexplained reboots in DRBD82 + OCFS2 setup

We're trying to setup a dual-primary DRBD environment, with a shared
disk with either OCFS2 or GFS.   The environment is a Centos 5.3 with
DRBD82 (but also tried with DRBD83 from testing) .

Setting up a single primary disk and running bonnie++ on it works.
Setting up a dual-primary disk, only mounting it on one node (ext3) and
running bonnie++  works

When setting up ocfs2 on the /dev/drbd0 disk and mounting it on both
nodes, basic functionality seems in place but usually less than 5-10
minutes after I start bonnie++ as a test on one of the nodes , both
nodes power cycle  with no errors in the logfiles, just a crash.

When at the console at the time of crash it looks like a disk IO (you
can type , but actions happen)  block happens  then a reboot, no panics,
no oops , nothing. ( sysctl panic values set to timeouts etc )
Setting up a dual-primary disk , with ocfs2 only mounting it on one node
and starting bonnie++ causes only that node to crash.

On DRBD level I get the following error when that node dissapears

drbd0: PingAck did not arrive in time.
drbd0: peer( Primary -> Unknown ) conn( Connected -> NetworkFailure )
pdsk(UpToDate -> DUnknown )
drbd0: asender terminated
drbd0: Terminating asender thread

That however is an expected error because of the reboot.

At first I assumed OCFS2 to be the root of this problem ..so I moved
forward and setup an ISCSI target on a 3rd node, and used that device
with the same OCFS2 setup. There no crashes occured and bonnie++
flawlessly completed it test run.

So my attention went  back to the combination of DRBD and OCFS 

I tried both DRBD 8.2 drbd82-8.2.6-1.el5.centos kmod-drbd82-8.2.6-2  and
the 83 variant from Centos Testing

At first I was trying with the ocfs2 1.4.1-1.el5.i386.rpm verson but
upgrading to  1.4.2-1.el5.i386.rpm didn't change the behaviour


Anyone has an idea on this ? 
How can we get more debug info from OCFS2  , apart from heartbeat
tracing which doesn't learn me nothing yet ..  in order to potentially
file a valuable bug report.


thnx in advance 

Kris

nate

2009-Jun-24 14:22 UTC

head link

[CentOS] Unexplained reboots in DRBD82 + OCFS2 setup

Kris Buytaert wrote:>
>
> We're trying to setup a dual-primary DRBD environment, with a shared
> disk with either OCFS2 or GFS.   The environment is a Centos 5.3 with
> DRBD82 (but also tried with DRBD83 from testing) .
Both OCFS2 and GFS are meant to be used on SANs with shared storage(same
LUNs being accessed by multiple servers), I just re-confirmed that DRBD
is not a shared storage mechanism but just a simple block mirroring
technology between a couple of nodes(as I originally thought).

I think you are mixing incompatible technologies. Even if you can
get it working, just seems like a really bad idea.

Perhaps what you could do is setup an iSCSI initiator on your DRBD
cluster, export a LUN to another cluster running OCFS2 or GFS(last I
checked GFS required at least 3 nodes less than that and the cluster
goes to read-only mode, I didn't see any minimum requirements for
OCFS2).

Though the whole concept of DRBD just screams to me crap performance
compared to a real shared storage system, wouldn't touch it with
a 50 foot pole myself.

nate

Sunil Mushran

2009-Jun-24 19:02 UTC

head link

[Ocfs2-users] Unexplained reboots in DRBD82 + OCFS2 setup

Do you have a separate network path for drbd traffic? If you do
not, then you are probably overloading the network. In this case,
I believe drbd is unable to replicate the ios fast enough and thus
is blocking the o2cb disk heartbeat. One workaround is to increase
the O2CB_HEARTBEAT_THRESHOLD to more than the default of 60 secs.
Refer to the ocfs2 faq or ocfs2 1.4 user's guide for more on this.

And if you want to capture the logs, setup netconsole.

Kris Buytaert wrote:> We're trying to setup a dual-primary DRBD environment, with a shared
> disk with either OCFS2 or GFS.   The environment is a Centos 5.3 with
> DRBD82 (but also tried with DRBD83 from testing) .
>
> Setting up a single primary disk and running bonnie++ on it works.
> Setting up a dual-primary disk, only mounting it on one node (ext3) and
> running bonnie++  works
>
> When setting up ocfs2 on the /dev/drbd0 disk and mounting it on both
> nodes, basic functionality seems in place but usually less than 5-10
> minutes after I start bonnie++ as a test on one of the nodes , both
> nodes power cycle  with no errors in the logfiles, just a crash.
>
> When at the console at the time of crash it looks like a disk IO (you
> can type , but actions happen)  block happens  then a reboot, no panics,
> no oops , nothing. ( sysctl panic values set to timeouts etc )
> Setting up a dual-primary disk , with ocfs2 only mounting it on one node
> and starting bonnie++ causes only that node to crash.
>
> On DRBD level I get the following error when that node dissapears
>
> drbd0: PingAck did not arrive in time.
> drbd0: peer( Primary -> Unknown ) conn( Connected -> NetworkFailure )
> pdsk(UpToDate -> DUnknown )
> drbd0: asender terminated
> drbd0: Terminating asender thread
>
> That however is an expected error because of the reboot.
>
> At first I assumed OCFS2 to be the root of this problem ..so I moved
> forward and setup an ISCSI target on a 3rd node, and used that device
> with the same OCFS2 setup. There no crashes occured and bonnie++
> flawlessly completed it test run.
>
> So my attention went  back to the combination of DRBD and OCFS 
>
> I tried both DRBD 8.2 drbd82-8.2.6-1.el5.centos kmod-drbd82-8.2.6-2  and
> the 83 variant from Centos Testing
>
> At first I was trying with the ocfs2 1.4.1-1.el5.i386.rpm verson but
> upgrading to  1.4.2-1.el5.i386.rpm didn't change the behaviour
>
>
> Anyone has an idea on this ? 
> How can we get more debug info from OCFS2  , apart from heartbeat
> tracing which doesn't learn me nothing yet ..  in order to potentially
> file a valuable bug report.
>

Kris Buytaert

2009-Jun-30 08:11 UTC

head link

[CentOS] [DRBD-user] Unexplained reboots in DRBD82 + OCFS2 setup

On Thu, 2009-06-25 at 11:42 +0200, Kris Buytaert wrote:
> > Use a serial console, attach that to some "monitoring" host.
> > (you can useUSB-to-Serial, they are cheap and work), and log
> > on that one. You'll get the last messages from there.
> > 
> I indeed had hoped to see some output on on the serial console when the
> reboots happened .. but the best I got so far was a partial timestamp
> with no further explanation before the reboot output started again .. 
> 
> Any other ideas ? 
> 
Update : 

The problem is indeed ocfs2 fencing off the systems , the logging
however does not show up in a serial console  it DOES show up when using
netconsole 

[base-root at CCMT-A ~]# nc -l -u -p 6666
(8,0):o2hb_write_timeout:166 ERROR: Heartbeat write timeout to device
drbd0 after 478000 milliseconds
(8,0):o2hb_stop_all_regions:1873 ERROR: stopping heartbeat on all active
regions.
ocfs2 is very sorry to be fencing this system by restarting
,

One'd think that it output over Serial console before it log over the
network :)   It doesn't . 

Next step is that I`ll start fiddling some more with the timeout
values :)

Maybe Matching Threads

Search for more possibly parallel threads

Ocfs2 users - Jun 2009 - Unexplained reboots in DRBD82 + OCFS2 setup

[CentOS] Unexplained reboots in DRBD82 + OCFS2 setup

[CentOS] Unexplained reboots in DRBD82 + OCFS2 setup

[Ocfs2-users] Unexplained reboots in DRBD82 + OCFS2 setup

[CentOS] [DRBD-user] Unexplained reboots in DRBD82 + OCFS2 setup

Maybe Matching Threads