thr3ads.net - Ocfs2 users - [Ocfs2-users] heartbeat write timeout [Mar 2006]

If this information is useful, please help other people find it:
Share via:

Stephan A. Rickauer

2006-Mar-29 14:17 UTC

[Ocfs2-users] heartbeat write timeout

Dear list,

I am evaluating ocfs2 in a test environment, that currently runs a
"cluster" in a one node mode (AMD Opteron, 2GB RAM, RH AS4 (CentOS
4.3),
2.6.9-34.EL) connected to an iSCSI storage device. While doing load
tests with 'bonnie++' to test the performance of the storage device
together with the file system I experience regular kernel panics related
to ocfs2 (1.2.0 RPMs).

Here is the message I get (I did not want to file a bug yet, maybe it's
just me missing something). sdb1 is the iscsi device:

---snip---
(3,0):o2hb_write_timeout: 164 ERROR: Heartbeat write timeout to device
sdb1 after 12000 milliseconds
(3,0):02hb_stop_all_regions: 1727 ERROR: stopping heartbeat on all
active regions
Kernel panic - not syncing: ocfs2 is very sorry to be fencing this
system by panicing
---snip---

I am tempted to rule out iscsi storage device related problems, but this
is not 100% sure, though tests with GFS and ext3 did not reveal
comparable problems.

On the bug page I spotted ID565 which seems to fit my szenario, but the
status of the bug is unclear to me (references to version 0.99 are
given): http://oss.oracle.com/bugzilla/show_bug.cgi?id=565

Any help / comments etc. are appreciated.
Thanks.

-- 

 Stephan A. Rickauer

 -----------------------------------------------------------
 Institut f?r Neuroinformatik          Tel: +41 44 635 30 50
 Universit?t / ETH Z?rich              Sek: +41 44 635 30 52
 Winterthurerstrasse 190               Fax: +41 44 635 30 53
 CH-8057 Z?rich                        Web:  www.ini.ethz.ch

 RSA public key: https://www.ini.ethz.ch/~stephan/pubkey.asc
 -----------------------------------------------------------

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 890 bytes
Desc: OpenPGP digital signature
Url :
http://oss.oracle.com/pipermail/ocfs2-users/attachments/20060329/9d3e00d0/signature.bin

Brian Long

2006-Mar-29 14:36 UTC

head link

[Ocfs2-users] heartbeat write timeout

Stephan A. Rickauer wrote:
>Dear list,
>
>I am evaluating ocfs2 in a test environment, that currently runs a
>"cluster" in a one node mode (AMD Opteron, 2GB RAM, RH AS4 (CentOS
4.3),
>2.6.9-34.EL) connected to an iSCSI storage device. While doing load
>tests with 'bonnie++' to test the performance of the storage device
>together with the file system I experience regular kernel panics related
>to ocfs2 (1.2.0 RPMs).
>
>Here is the message I get (I did not want to file a bug yet, maybe it's
>just me missing something). sdb1 is the iscsi device:
>
>---snip---
>(3,0):o2hb_write_timeout: 164 ERROR: Heartbeat write timeout to device
>sdb1 after 12000 milliseconds
>(3,0):02hb_stop_all_regions: 1727 ERROR: stopping heartbeat on all
>active regions
>Kernel panic - not syncing: ocfs2 is very sorry to be fencing this
>system by panicing
>---snip---
>
>I am tempted to rule out iscsi storage device related problems, but this
>is not 100% sure, though tests with GFS and ext3 did not reveal
>comparable problems.
>
>On the bug page I spotted ID565 which seems to fit my szenario, but the
>status of the bug is unclear to me (references to version 0.99 are
>given): http://oss.oracle.com/bugzilla/show_bug.cgi?id=565
>
>Any help / comments etc. are appreciated.
>  
>Are you using the default "cfq" scheduler?  Oracle's OCFS2 web
site
states there is a scheduler bug in RHEL 4 and you should use the 
"deadline" scheduler until Red Hat fixes the cfq bug.  This fix is not
part of Update 3.

/Brian/

Silviu Marin-Caea

2006-Mar-29 14:37 UTC

head link

[Ocfs2-users] heartbeat write timeout

On Wednesday 29 March 2006 17:17, Stephan A. Rickauer
wrote:> Dear list,
>
> I am evaluating ocfs2 in a test environment, that currently runs a
> "cluster" in a one node mode (AMD Opteron, 2GB RAM, RH AS4
(CentOS 4.3),
> 2.6.9-34.EL) connected to an iSCSI storage device. While doing load
> tests with 'bonnie++' to test the performance of the storage device
> together with the file system I experience regular kernel panics related
> to ocfs2 (1.2.0 RPMs).
Did you specify the elevator=deadline parameter in /boot/grub/menu.lst?

SCOTT, Gavin

2006-Mar-31 00:15 UTC

head link

[Ocfs2-users] heartbeat write timeout

After confirming with Stephan, this problem appears to relate to the
HEARTBEAT_THRESHOLD parameter as set in /etc/sysconfig/o2cb. After encountering
this myself and having confirmed with a couple of other people in the list that
it has caused problems, it seems that the default threshold of 7 is possibly too
short, even in reasonably fast server-storage solutions such as an HP DL380
Packaged Cluster.

Does the OCFS2 development team also consider this to be too short, or is
altering the paramater just a workaround that shouldn't be used? If this is
the case then how should we approach the problem of self-fencing nodes?

Also, can we expect this behaviour with some platforms but not others, or is it
too short for all platforms? If it is a blanket problem, then should the default
threshold be raised?

Finally, if the altering the threshold is a valid solution, could it please be
added to the FAQs and the user guide so that people know to adjust it as a first
step on encountering the problem, rather than having to post to the list and
wait for replies.

Regards,
Gavin
 

-----Original Message-----
From: ocfs2-users-bounces at oss.oracle.com [mailto:ocfs2-users-bounces at
oss.oracle.com] On Behalf Of Stephan A. Rickauer
Sent: Thursday, 30 March 2006 00:47
To: ocfs2-users at oss.oracle.com
Subject: [Ocfs2-users] heartbeat write timeout

Dear list,

I am evaluating ocfs2 in a test environment, that currently runs a
"cluster" in a one node mode (AMD Opteron, 2GB RAM, RH AS4 (CentOS
4.3),
2.6.9-34.EL) connected to an iSCSI storage device. While doing load tests with
'bonnie++' to test the performance of the storage device together with
the file system I experience regular kernel panics related to ocfs2 (1.2.0
RPMs).

Here is the message I get (I did not want to file a bug yet, maybe it's just
me missing something). sdb1 is the iscsi device:

---snip---
(3,0):o2hb_write_timeout: 164 ERROR: Heartbeat write timeout to device
sdb1 after 12000 milliseconds
(3,0):02hb_stop_all_regions: 1727 ERROR: stopping heartbeat on all active
regions Kernel panic - not syncing: ocfs2 is very sorry to be fencing this
system by panicing
---snip---

I am tempted to rule out iscsi storage device related problems, but this is not
100% sure, though tests with GFS and ext3 did not reveal comparable problems.

On the bug page I spotted ID565 which seems to fit my szenario, but the status
of the bug is unclear to me (references to version 0.99 are
given): http://oss.oracle.com/bugzilla/show_bug.cgi?id=565

Any help / comments etc. are appreciated.
Thanks.

-- 

 Stephan A. Rickauer

 -----------------------------------------------------------
 Institut f?r Neuroinformatik          Tel: +41 44 635 30 50
 Universit?t / ETH Z?rich              Sek: +41 44 635 30 52
 Winterthurerstrasse 190               Fax: +41 44 635 30 53
 CH-8057 Z?rich                        Web:  www.ini.ethz.ch

 RSA public key: https://www.ini.ethz.ch/~stephan/pubkey.asc
 -----------------------------------------------------------

SCOTT, Gavin

2006-Mar-31 04:50 UTC

head link

[Ocfs2-users] heartbeat write timeout

I implemented the change to the threshold to get around the self-fencing
before the scheduler bug was reported, as suggested by yourself in a
post to someone with a similar problem from November last year
(http://oss.oracle.com/pipermail/ocfs2-users/2005-November/000269.html).
Perhaps it should be made clear to anyone who read that post and changed
the threshold that it should be changed back to default once the
elevator=deadline fix is implemented.

I've already implemented the elevator=deadline fix but haven't changed
the threshold back to default. I'll do that & hopefully won't see
the
self fences; if I do I'll send back the message dump.

Gavin  

-----Original Message-----
From: Sunil Mushran [mailto:Sunil.Mushran at oracle.com] 
Sent: Friday, 31 March 2006 11:06
To: SCOTT, Gavin
Cc: 
Subject: Re: [Ocfs2-users] heartbeat write timeout

Are you seeing timeouts with elevator=deadline?

We only test with the default value and have not seen any disk hb
timeouts on either 2G fc or gige iscsi. And these are heavy db loads.

When the hb thread panics, it dumps messages indicating the times it
took to perform the tasks. Could you share those messages?

SCOTT, Gavin wrote:> After confirming with Stephan, this problem appears to relate to theHEARTBEAT_THRESHOLD parameter as set in /etc/sysconfig/o2cb. After
encountering this myself and having confirmed with a couple of other
people in the list that it has caused problems, it seems that the
default threshold of 7 is possibly too short, even in reasonably fast
server-storage solutions such as an HP DL380 Packaged
Cluster.>
> Does the OCFS2 development team also consider this to be too short, oris altering the paramater just a workaround that shouldn't be used? If
this is the case then how should we approach the problem of self-fencing
nodes? >
> Also, can we expect this behaviour with some platforms but not others,or is it too short for all platforms? If it is a blanket problem, then
should the default threshold be raised?>
> Finally, if the altering the threshold is a valid solution, could itplease be added to the FAQs and the user guide so that people know to
adjust it as a first step on encountering the problem, rather than
having to post to the list and wait for replies. >
> Regards,
> Gavin
>  
>
> -----Original Message-----
> From: ocfs2-users-bounces at oss.oracle.com 
> [mailto:ocfs2-users-bounces at oss.oracle.com] On Behalf Of Stephan A. 
> Rickauer
> Sent: Thursday, 30 March 2006 00:47
> To: ocfs2-users at oss.oracle.com
> Subject: [Ocfs2-users] heartbeat write timeout
>
> Dear list,
>
> I am evaluating ocfs2 in a test environment, that currently runs a 
> "cluster" in a one node mode (AMD Opteron, 2GB RAM, RH AS4
(CentOS
> 4.3),
> 2.6.9-34.EL) connected to an iSCSI storage device. While doing loadtests with 'bonnie++' to test the performance of the storage device
together with the file system I experience regular kernel panics related
to ocfs2 (1.2.0 RPMs).>
> Here is the message I get (I did not want to file a bug yet, maybeit's just me missing something). sdb1 is the iscsi
device:>
> ---snip---
> (3,0):o2hb_write_timeout: 164 ERROR: Heartbeat write timeout to device
> sdb1 after 12000 milliseconds
> (3,0):02hb_stop_all_regions: 1727 ERROR: stopping heartbeat on all 
> active regions Kernel panic - not syncing: ocfs2 is very sorry to be 
> fencing this system by panicing
> ---snip---
>
> I am tempted to rule out iscsi storage device related problems, butthis is not 100% sure, though tests with GFS and ext3 did not reveal
comparable problems.>
> On the bug page I spotted ID565 which seems to fit my szenario, but 
> the status of the bug is unclear to me (references to version 0.99 are
> given): http://oss.oracle.com/bugzilla/show_bug.cgi?id=565
>
> Any help / comments etc. are appreciated.
> Thanks.
>
>

Weller, Michael

2006-Apr-01 19:36 UTC

head link

[Ocfs2-users] heartbeat write timeout

Hi List,

I experience the very same problem. This time it's not iScsi but a very
performant 2 node HP DL385 (dual dual-core Opteron) cluster redundantly
connected with two one port HP 2214 2Gbit San-Controllers (relabelled
Qlogic, 2340 I think) two a dual Fabric HP 4/32 San connected to an EVA
6000 holding a few Terabytes (no, the ocfs2 is only about 7Gig). The
System is SAN-booted (no local disks). This is a vendor certified setup,
we are bound to SLES9SP3 (and EXACTLY that, nothing less, not a patch
more) with a HP certified qla driver 8.0.2p11.

We have plenty similar systems in that SAN, all work well, except the
ocsf2 cluster which easily locks up with the 12000ms timeout message.
Especially the system cannot survive a SAN Failover in case a switch or
link fails.
It locks up immediately. Definitely nothing like a 12s timeout expires.
We had no chance yet to put any sensible load upon that system, system
lock in case of a SAN failover even in idle state

You mention a FAQ regarding some config option which I didn't come
across up to now, where can I find it?

Which options would you recommend to fix the problem or at least make
locks much less likely.

Thanks in advance.

---

Dr. Michael Weller

ITZ Informationstechnologie GmbH
Consulting/Systemengineering
Bismarckstrasse 57
D-45128 Essen

Phone Office +49 201 24714 28
FAX Office +49 201 24714 33
Phone Mobile +49 172 2178078
E-Mail mailto:michael.weller at itz-essen.de

-------------- next part --------------
A non-text attachment was scrubbed...
Name: Michael Weller.vcf
Type: text/x-vcard
Size: 533 bytes
Desc: Michael Weller.vcf
Url :
http://oss.oracle.com/pipermail/ocfs2-users/attachments/20060401/00ee21a4/MichaelWeller.vcf

Weller, Michael

2006-Apr-02 12:17 UTC

head link

[Ocfs2-users] heartbeat write timeout

Thx for the hints, I'll try that.

With regards to the updates, while I generally agree, I can't update the
kernel here, because we'll loose vendor warranty in that case. I know this
is an odd concept, but that's how it works. We'll even loose Oracle
support because the kernel update would void HP SAN-support.

I mentioned SAN Failover, which for example does not work with current kernel
and current (even the not so current HP checked variant) Qlogic driver.

Anyway, I'll try your suggestions on monday and drop the list a note if it
worked.

Thanks,
Michael.

 ---

Dr. Michael Weller

ITZ Informationstechnologie GmbH
Consulting/Systemengineering
Bismarckstrasse 57
D-45128 Essen

Phone Office    +49 201 24714 28
FAX   Office    +49 201 24714 33
Phone Mobile    +49 172 2178078
E-Mail          mailto:michael.weller at itz-essen.de> -----Urspr?ngliche Nachricht-----
> Von: ocfs2-users-bounces at oss.oracle.com [mailto:ocfs2-users-
> bounces at oss.oracle.com] Im Auftrag von Silviu Marin-Caea
> Gesendet: Sonntag, 2. April 2006 08:26
> An: ocfs2-users at oss.oracle.com
> Betreff: Re: [Ocfs2-users] heartbeat write timeout
> 
> On Saturday 01 April 2006 22:36, Weller, Michael wrote:
> 
> > we are bound to SLES9SP3 (and EXACTLY that, nothing less, not a patch
> > more)
> 
> Having latest updates does not hurt, on the contrary, it helps.  For
> example,
> the latest kernel has OCFS2 1.1.8, while the kernel from SP3 has 1.1.7.
> There are a number of bugfixes.
> 
> SLES updates do really have a purpose.  Apply them after testing in a
> non-production system.
> 
> > It locks up immediately. Definitely nothing like a 12s timeout
expires.
> 
> It just looks like it's immediate, actually, the 12s do expire.
> 
> > You mention a FAQ regarding some config option which I didn't come
> > across up to now, where can I find it?
> 
> /boot/grub/menu.lst
> 
> change elevator=cfq to elevator=deadline
> 
> http://oss.oracle.com/projects/ocfs2/
> scroll down, look at the red text
> 
> > Which options would you recommend to fix the problem or at least make
> > locks much less likely.
> 
> You could also increase the timeout:
> 
> /etc/sysconfig/o2cb
> 
> # O2CB_HEARTBEAT_THRESHOLD: Iterations before a node is considered dead.
> O2CB_HEARTBEAT_THRESHOLD=16
> 
> 
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users at oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Michael Weller.vcf
Type: text/x-vcard
Size: 533 bytes
Desc: Michael Weller.vcf
Url :
http://oss.oracle.com/pipermail/ocfs2-users/attachments/20060402/cccb9a6e/MichaelWeller.vcf

Zunker, Christian

2006-Apr-18 13:21 UTC

head link

[Ocfs2-users] heartbeat write timeout

Hi,

I experienced the same problems. The elevator=deadline parameter didn't
help. But increasing the threshold to 60 did it. I think you could decrease the
threshold, but didn't test it. In another posting, it is said to take a
timeout between 60 and 90 seconds. This would mean a threshold between 31 and
46.

I'll test this later.

Best regards,
Christian


-----Urspr?ngliche Nachricht-----
Von: ocfs2-users-bounces at oss.oracle.com [mailto:ocfs2-users-bounces at
oss.oracle.com] Im Auftrag von Weller, Michael
Gesendet: Sonntag, 2. April 2006 14:18
An: Silviu Marin-Caea; ocfs2-users at oss.oracle.com
Betreff: Re: [Ocfs2-users] heartbeat write timeout

Thx for the hints, I'll try that.

With regards to the updates, while I generally agree, I can't update the
kernel here, because we'll loose vendor warranty in that case. I know this
is an odd concept, but that's how it works. We'll even loose Oracle
support because the kernel update would void HP SAN-support.

I mentioned SAN Failover, which for example does not work with current kernel
and current (even the not so current HP checked variant) Qlogic driver.

Anyway, I'll try your suggestions on monday and drop the list a note if it
worked.

Thanks,
Michael.

 ---

Dr. Michael Weller

ITZ Informationstechnologie GmbH
Consulting/Systemengineering
Bismarckstrasse 57
D-45128 Essen

Phone Office    +49 201 24714 28
FAX   Office    +49 201 24714 33
Phone Mobile    +49 172 2178078
E-Mail          mailto:michael.weller at itz-essen.de> -----Urspr?ngliche Nachricht-----
> Von: ocfs2-users-bounces at oss.oracle.com [mailto:ocfs2-users-
> bounces at oss.oracle.com] Im Auftrag von Silviu Marin-Caea
> Gesendet: Sonntag, 2. April 2006 08:26
> An: ocfs2-users at oss.oracle.com
> Betreff: Re: [Ocfs2-users] heartbeat write timeout
> 
> On Saturday 01 April 2006 22:36, Weller, Michael wrote:
> 
> > we are bound to SLES9SP3 (and EXACTLY that, nothing less, not a patch
> > more)
> 
> Having latest updates does not hurt, on the contrary, it helps.  For
> example,
> the latest kernel has OCFS2 1.1.8, while the kernel from SP3 has 1.1.7.
> There are a number of bugfixes.
> 
> SLES updates do really have a purpose.  Apply them after testing in a
> non-production system.
> 
> > It locks up immediately. Definitely nothing like a 12s timeout
expires.
> 
> It just looks like it's immediate, actually, the 12s do expire.
> 
> > You mention a FAQ regarding some config option which I didn't come
> > across up to now, where can I find it?
> 
> /boot/grub/menu.lst
> 
> change elevator=cfq to elevator=deadline
> 
> http://oss.oracle.com/projects/ocfs2/
> scroll down, look at the red text
> 
> > Which options would you recommend to fix the problem or at least make
> > locks much less likely.
> 
> You could also increase the timeout:
> 
> /etc/sysconfig/o2cb
> 
> # O2CB_HEARTBEAT_THRESHOLD: Iterations before a node is considered dead.
> O2CB_HEARTBEAT_THRESHOLD=16
> 
> 
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users at oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users

Weller, Michael

2006-Apr-18 13:49 UTC

head link

[Ocfs2-users] heartbeat write timeout

I don't know if I mentioned that to the list, elevator=deadline and rising
the THRESHOLD to 14 solved my self-fencing issues.

(We'll see what happens under a possibly extreme load).

Michael.

---

Dr. Michael Weller

ITZ Informationstechnologie GmbH
Consulting/Systemengineering
Bismarckstrasse 57
D-45128 Essen

Phone Office  +49 201 24714 28
FAX   Office  +49 201 24714 33
Phone Mobile  +49 172 2178078
E-Mail        mailto:michael.weller at itz-essen.de
> -----Urspr?ngliche Nachricht-----
> Von: ocfs2-users-bounces at oss.oracle.com [mailto:ocfs2-users-
> bounces at oss.oracle.com] Im Auftrag von Zunker, Christian
> Gesendet: Dienstag, 18. April 2006 15:21
> An: ocfs2-users at oss.oracle.com
> Betreff: Re: [Ocfs2-users] heartbeat write timeout
> 
> Hi,
> 
> I experienced the same problems. The elevator=deadline parameter didn't
> help. But increasing the threshold to 60 did it. I think you could
> decrease the threshold, but didn't test it. In another posting, it is
said
> to take a timeout between 60 and 90 seconds. This would mean a threshold
> between 31 and 46.
> 
> I'll test this later.
> 
> Best regards,
> Christian
> 
> 
> -----Urspr?ngliche Nachricht-----
> Von: ocfs2-users-bounces at oss.oracle.com [mailto:ocfs2-users-
> bounces at oss.oracle.com] Im Auftrag von Weller, Michael
> Gesendet: Sonntag, 2. April 2006 14:18
> An: Silviu Marin-Caea; ocfs2-users at oss.oracle.com
> Betreff: Re: [Ocfs2-users] heartbeat write timeout
> 
> Thx for the hints, I'll try that.
> 
> With regards to the updates, while I generally agree, I can't update
the
> kernel here, because we'll loose vendor warranty in that case. I know
this
> is an odd concept, but that's how it works. We'll even loose Oracle
> support because the kernel update would void HP SAN-support.
> 
> I mentioned SAN Failover, which for example does not work with current
> kernel and current (even the not so current HP checked variant) Qlogic
> driver.
> 
> Anyway, I'll try your suggestions on monday and drop the list a note if
it
> worked.
> 
> Thanks,
> Michael.
> 
>  ---
> 
> Dr. Michael Weller
> 
> ITZ Informationstechnologie GmbH
> Consulting/Systemengineering
> Bismarckstrasse 57
> D-45128 Essen
> 
> Phone Office    +49 201 24714 28
> FAX   Office    +49 201 24714 33
> Phone Mobile    +49 172 2178078
> E-Mail          mailto:michael.weller at itz-essen.de
> > -----Urspr?ngliche Nachricht-----
> > Von: ocfs2-users-bounces at oss.oracle.com [mailto:ocfs2-users-
> > bounces at oss.oracle.com] Im Auftrag von Silviu Marin-Caea
> > Gesendet: Sonntag, 2. April 2006 08:26
> > An: ocfs2-users at oss.oracle.com
> > Betreff: Re: [Ocfs2-users] heartbeat write timeout
> >
> > On Saturday 01 April 2006 22:36, Weller, Michael wrote:
> >
> > > we are bound to SLES9SP3 (and EXACTLY that, nothing less, not a
patch
> > > more)
> >
> > Having latest updates does not hurt, on the contrary, it helps.  For
> > example,
> > the latest kernel has OCFS2 1.1.8, while the kernel from SP3 has
1.1.7.
> > There are a number of bugfixes.
> >
> > SLES updates do really have a purpose.  Apply them after testing in a
> > non-production system.
> >
> > > It locks up immediately. Definitely nothing like a 12s timeout
> expires.
> >
> > It just looks like it's immediate, actually, the 12s do expire.
> >
> > > You mention a FAQ regarding some config option which I didn't
come
> > > across up to now, where can I find it?
> >
> > /boot/grub/menu.lst
> >
> > change elevator=cfq to elevator=deadline
> >
> > http://oss.oracle.com/projects/ocfs2/
> > scroll down, look at the red text
> >
> > > Which options would you recommend to fix the problem or at least
make
> > > locks much less likely.
> >
> > You could also increase the timeout:
> >
> > /etc/sysconfig/o2cb
> >
> > # O2CB_HEARTBEAT_THRESHOLD: Iterations before a node is considered
dead.
> > O2CB_HEARTBEAT_THRESHOLD=16
> >
> >
> > _______________________________________________
> > Ocfs2-users mailing list
> > Ocfs2-users at oss.oracle.com
> > http://oss.oracle.com/mailman/listinfo/ocfs2-users
> 
> 
> 
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users at oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Michael Weller.vcf
Type: text/x-vcard
Size: 570 bytes
Desc: Michael Weller.vcf
Url :
http://oss.oracle.com/pipermail/ocfs2-users/attachments/20060418/95b4412b/MichaelWeller.vcf

Sunil Mushran

2006-Apr-18 17:57 UTC

head link

[Ocfs2-users] heartbeat write timeout

This looks very similar to what we were seeing with cfq.

What type of load are you running when this happens?
What is the full kernel version?
Are there are other messages during that time?
What type of storage is this?

Stephan A. Rickauer wrote:> Sunil Mushran wrote:
>   
>> Set up netdump/netconsole. We print more messages after the
>> write_timeout which will provide more clues. As the node is panicing,
>> these messages are caught only by the netdump server.
>>     
>
> scheduler = deadline, hb_dead_threshold = 50. Still fencing.
>
> ---snip---
> Heartbeat thread (3) printing last 24 blocking operations (cur=10):
> Heartbeat thread stuck at waiting for write completion stuffing current
time
> into that blocker (index 10)
> Index 11: took 0 ms to do submit_bio for write
> Index 12: took 0 ms to do checking slots
> Index 13: took 1673 ms to do waiting for write completion
> Index 14: took 250 ms to do msleep
> Index 15: took 0 ms to do allocating bios for read
> Index 16: took 0 ms to do bio alloc read
> Index 17: took 0 ms to do bio add page read
> Index 18: took 0 ms to do submit_bio for read
> Index 19: took 149 ms to do waiting for read completion
> Index 20: took 0 ms to do bio alloc write
> Index 21: took 0 ms to do bio add page write
> Index 22: took 0 ms to do submit_bio for write
> Index 23: took 0 ms to do checking slots
> Index 0: took 4090 ms to do waiting for write completion
> Index 1: took 0 ms to do allocating bios for read
> Index 2: took 0 ms to do bio alloc read
> Index 3: took 0 ms to do bio add page read
> Index 4: took 0 ms to do submit_bio for read
> Index 5: took 134 ms to do waiting for read completion
> Index 6: took 0 ms to do bio alloc write
> Index 7: took 0 ms to do bio add page write
> Index 8: took 0 ms to do submit_bio for write
> Index 9: took 0 ms to do checking slots
> Index 10: took 120045 ms to do waiting for write completion
> Kernel panic - not syncing: ocfs2 is very sorry to be fencing this system
by
> panicing
> ---snip---
>
>   
> ------------------------------------------------------------------------
>
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users at oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>

Ocfs2 users - Mar 2006 - heartbeat write timeout

[Ocfs2-users] heartbeat write timeout

[Ocfs2-users] heartbeat write timeout

[Ocfs2-users] heartbeat write timeout

[Ocfs2-users] heartbeat write timeout

[Ocfs2-users] heartbeat write timeout

[Ocfs2-users] heartbeat write timeout

[Ocfs2-users] heartbeat write timeout

[Ocfs2-users] heartbeat write timeout

[Ocfs2-users] heartbeat write timeout

[Ocfs2-users] heartbeat write timeout