thr3ads.net - Ocfs2 users - [Ocfs2-users] OCFS2 Panic! [Nov 2005]

If this information is useful, please help other people find it:
Share via:

Peter Sylvester

2005-Nov-10 12:09 UTC

[Ocfs2-users] OCFS2 Panic!

System config:

Dell PE2850 server
(4) 36GB SCSI drives in (onboard) RAID-5

RHEL4-U2
Dell ATI Video Driver update 10/2005

ocfs2-2.6.9-22.ELsmp-1.0.7-1.i686.rpm
ocfs2-tools-1.0.2-1.i386.rpm
ocfs2console-1.0.2-1.i386.rpm

Note that this is a single node cluster, nothing else installed/running 
except iozone.

I was running some "iozone" tests on the OCFS2 volume for about a day,
and the system locked up completely.
The following messages were transcribed from the console (nothing 
written to /var/log/messages):

usb4-2: device not accepting address 4, error -71
(11,1): o2hb_write_timeout: 164 ERROR: heartbeat write timeout to device 
sda6 after 12000 miliseconds
(11,1): o2hb_stop_all_regions: 1724 ERROR: stopping heartbeat on all 
active regeons
Kernel Panic - not syncing: ocfs2 is very sorry to be fencing the system 
by panicing

Questions:
What does all this mean?
Why is nothing getting written to /var/log/messages?
If this software really ready for prime time (honestly...)?

thanks,
Peter Sylvester
MITRE Corp.

Hartmut Wöhrle

2005-Nov-10 12:32 UTC

head link

[Ocfs2-users] OCFS2 Panic!

What did you do at this time?

I realized some problems if a session (bash) is open and you are on the 
filesystem (pwd). Then a shutdown of this maschine without closing the 
session makes the maschine hang with this errormessage.

CU
Hartmut Woehrle

Am Donnerstag, 10. November 2005 19:08 schrieb Peter
Sylvester:> System config:
>
> Dell PE2850 server
> (4) 36GB SCSI drives in (onboard) RAID-5
>
> RHEL4-U2
> Dell ATI Video Driver update 10/2005
>
> ocfs2-2.6.9-22.ELsmp-1.0.7-1.i686.rpm
> ocfs2-tools-1.0.2-1.i386.rpm
> ocfs2console-1.0.2-1.i386.rpm
>
> Note that this is a single node cluster, nothing else installed/running
> except iozone.
>
> I was running some "iozone" tests on the OCFS2 volume for about a
day,
> and the system locked up completely.
> The following messages were transcribed from the console (nothing
> written to /var/log/messages):
>
> usb4-2: device not accepting address 4, error -71
> (11,1): o2hb_write_timeout: 164 ERROR: heartbeat write timeout to device
> sda6 after 12000 miliseconds
> (11,1): o2hb_stop_all_regions: 1724 ERROR: stopping heartbeat on all
> active regeons
> Kernel Panic - not syncing: ocfs2 is very sorry to be fencing the system
> by panicing
>
> Questions:
> What does all this mean?
> Why is nothing getting written to /var/log/messages?
> If this software really ready for prime time (honestly...)?
>
> thanks,
> Peter Sylvester
> MITRE Corp.
>
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users@oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users
-- 
==========================================
    Hartmut Woehrle
    EMail: hartmut.woehrle@mail.pcom.de

Sunil Mushran

2005-Nov-10 12:42 UTC

head link

[Ocfs2-users] OCFS2 Panic!

What this means is that the hb thread was unable to complete an io
for 12 secs and was forced to fence the node.

One solution is to increase this threshold time by specifying
it in /etc/sysconfig/o2cb.

O2CB_HEARTBEAT_THRESHOLD = 14

The default value is 7 will results in 12 secs.
(O2CB_HEARTBEAT_THRESHOLD - 1) * 2 secs

Setting it to 14 will make it 26 secs.

Peter Sylvester wrote:
> System config:
>
> Dell PE2850 server
> (4) 36GB SCSI drives in (onboard) RAID-5
>
> RHEL4-U2
> Dell ATI Video Driver update 10/2005
>
> ocfs2-2.6.9-22.ELsmp-1.0.7-1.i686.rpm
> ocfs2-tools-1.0.2-1.i386.rpm
> ocfs2console-1.0.2-1.i386.rpm
>
> Note that this is a single node cluster, nothing else 
> installed/running except iozone.
>
> I was running some "iozone" tests on the OCFS2 volume for about a
day,
> and the system locked up completely.
> The following messages were transcribed from the console (nothing 
> written to /var/log/messages):
>
> usb4-2: device not accepting address 4, error -71
> (11,1): o2hb_write_timeout: 164 ERROR: heartbeat write timeout to 
> device sda6 after 12000 miliseconds
> (11,1): o2hb_stop_all_regions: 1724 ERROR: stopping heartbeat on all 
> active regeons
> Kernel Panic - not syncing: ocfs2 is very sorry to be fencing the 
> system by panicing
>
> Questions:
> What does all this mean?
> Why is nothing getting written to /var/log/messages?
> If this software really ready for prime time (honestly...)?
>
> thanks,
> Peter Sylvester
> MITRE Corp.
>
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users@oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users

Peter Sylvester

2005-Nov-10 13:00 UTC

head link

[Ocfs2-users] OCFS2 Panic!

Sunil,

Can you expand upon this explanation a bit?
What kind of I/O (disk, network, etc) are we talking about here, and 
under what conditions could it possibly take 12 seconds?
Disk I/O service time should be around 10ms for these (10K RPM SCSI) drives.
Remember that this is a single note cluster, managing locally attached 
disk, so it should only be talking to itself.

thanks,
Peter Sylvester

Sunil Mushran wrote:> What this means is that the hb thread was unable to complete an io
> for 12 secs and was forced to fence the node.
>
> One solution is to increase this threshold time by specifying
> it in /etc/sysconfig/o2cb.
>
> O2CB_HEARTBEAT_THRESHOLD = 14
>
> The default value is 7 will results in 12 secs.
> (O2CB_HEARTBEAT_THRESHOLD - 1) * 2 secs
>
> Setting it to 14 will make it 26 secs.
>
> Peter Sylvester wrote:
>
>> System config:
>>
>> Dell PE2850 server
>> (4) 36GB SCSI drives in (onboard) RAID-5
>>
>> RHEL4-U2
>> Dell ATI Video Driver update 10/2005
>>
>> ocfs2-2.6.9-22.ELsmp-1.0.7-1.i686.rpm
>> ocfs2-tools-1.0.2-1.i386.rpm
>> ocfs2console-1.0.2-1.i386.rpm
>>
>> Note that this is a single node cluster, nothing else 
>> installed/running except iozone.
>>
>> I was running some "iozone" tests on the OCFS2 volume for
about a
>> day, and the system locked up completely.
>> The following messages were transcribed from the console (nothing 
>> written to /var/log/messages):
>>
>> usb4-2: device not accepting address 4, error -71
>> (11,1): o2hb_write_timeout: 164 ERROR: heartbeat write timeout to 
>> device sda6 after 12000 miliseconds
>> (11,1): o2hb_stop_all_regions: 1724 ERROR: stopping heartbeat on all 
>> active regeons
>> Kernel Panic - not syncing: ocfs2 is very sorry to be fencing the 
>> system by panicing
>>
>> Questions:
>> What does all this mean?
>> Why is nothing getting written to /var/log/messages?
>> If this software really ready for prime time (honestly...)?
>>
>> thanks,
>> Peter Sylvester
>> MITRE Corp.
>>
>> _______________________________________________
>> Ocfs2-users mailing list
>> Ocfs2-users@oss.oracle.com
>> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>

Sunil Mushran

2005-Nov-10 13:12 UTC

head link

[Ocfs2-users] OCFS2 Panic!

http://oss.oracle.com/projects/ocfs2/dist/documentation/ocfs2_faq.txt

Refer to the section titled "Heartbeat" and "Quorum and
Fencing".

What size ios were you performing when running iozone?

Peter Sylvester wrote:
> Sunil,
>
> Can you expand upon this explanation a bit?
> What kind of I/O (disk, network, etc) are we talking about here, and 
> under what conditions could it possibly take 12 seconds?
> Disk I/O service time should be around 10ms for these (10K RPM SCSI) 
> drives.
> Remember that this is a single note cluster, managing locally attached 
> disk, so it should only be talking to itself.
>
> thanks,
> Peter Sylvester
>
> Sunil Mushran wrote:
>
>> What this means is that the hb thread was unable to complete an io
>> for 12 secs and was forced to fence the node.
>>
>> One solution is to increase this threshold time by specifying
>> it in /etc/sysconfig/o2cb.
>>
>> O2CB_HEARTBEAT_THRESHOLD = 14
>>
>> The default value is 7 will results in 12 secs.
>> (O2CB_HEARTBEAT_THRESHOLD - 1) * 2 secs
>>
>> Setting it to 14 will make it 26 secs.
>>
>> Peter Sylvester wrote:
>>
>>> System config:
>>>
>>> Dell PE2850 server
>>> (4) 36GB SCSI drives in (onboard) RAID-5
>>>
>>> RHEL4-U2
>>> Dell ATI Video Driver update 10/2005
>>>
>>> ocfs2-2.6.9-22.ELsmp-1.0.7-1.i686.rpm
>>> ocfs2-tools-1.0.2-1.i386.rpm
>>> ocfs2console-1.0.2-1.i386.rpm
>>>
>>> Note that this is a single node cluster, nothing else 
>>> installed/running except iozone.
>>>
>>> I was running some "iozone" tests on the OCFS2 volume for
about a
>>> day, and the system locked up completely.
>>> The following messages were transcribed from the console (nothing 
>>> written to /var/log/messages):
>>>
>>> usb4-2: device not accepting address 4, error -71
>>> (11,1): o2hb_write_timeout: 164 ERROR: heartbeat write timeout to 
>>> device sda6 after 12000 miliseconds
>>> (11,1): o2hb_stop_all_regions: 1724 ERROR: stopping heartbeat on
all
>>> active regeons
>>> Kernel Panic - not syncing: ocfs2 is very sorry to be fencing the 
>>> system by panicing
>>>
>>> Questions:
>>> What does all this mean?
>>> Why is nothing getting written to /var/log/messages?
>>> If this software really ready for prime time (honestly...)?
>>>
>>> thanks,
>>> Peter Sylvester
>>> MITRE Corp.
>>>
>>> _______________________________________________
>>> Ocfs2-users mailing list
>>> Ocfs2-users@oss.oracle.com
>>> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>>
>>
>
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users@oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users

Eckenfels. Bernd

2005-Nov-10 13:17 UTC

head link

[Ocfs2-users] OCFS2 Panic!

It looks to me you have been playing around with the USB layer of your
box, which somehow locked up the interrupt processing for too long. 

Ocfs2 is killing the machine in these conditions, because it needs to be
shure it wont access data a nother node is looking at. Even if a single
node is living, it has to do this fencing, since another one could show
up in the meantime.

I suggest you do not ever use USB block devices on linux production
servers. The messages below, you may want to set the kernel timestamp
option, so you can see if the messages are really related. (i.e. if the
fencing panic is exactly 12s after the USB problem).

Gruss
Bernd
>> usb4-2: device not accepting address 4, error -71
>> (11,1): o2hb_write_timeout: 164 ERROR: heartbeat write timeout to 
>> device sda6 after 12000 miliseconds

Peter Sylvester

2005-Nov-10 13:43 UTC

head link

[Ocfs2-users] OCFS2 Panic!

Sunil,

I was running the following iozone command (iozone version 3.248):
/iozone -az -e -q 4096 -n 1G -g 18G -b r5_ocfs2_iozone1.xls

Tool is available here:
http://iozone.org

This cycles through various tests, using "record" sizes of 4K through 
4MB and file size ranging from 1GB to 18GB.
 From the log file, it appears that it was about half way through the 
16GB file test, using 32K records.

--Peter


Sunil Mushran wrote:> http://oss.oracle.com/projects/ocfs2/dist/documentation/ocfs2_faq.txt
>
> Refer to the section titled "Heartbeat" and "Quorum and
Fencing".
>
> What size ios were you performing when running iozone?
>
> Peter Sylvester wrote:
>
>> Sunil,
>>
>> Can you expand upon this explanation a bit?
>> What kind of I/O (disk, network, etc) are we talking about here, and 
>> under what conditions could it possibly take 12 seconds?
>> Disk I/O service time should be around 10ms for these (10K RPM SCSI) 
>> drives.
>> Remember that this is a single note cluster, managing locally 
>> attached disk, so it should only be talking to itself.
>>
>> thanks,
>> Peter Sylvester
>>
>> Sunil Mushran wrote:
>>
>>> What this means is that the hb thread was unable to complete an io
>>> for 12 secs and was forced to fence the node.
>>>
>>> One solution is to increase this threshold time by specifying
>>> it in /etc/sysconfig/o2cb.
>>>
>>> O2CB_HEARTBEAT_THRESHOLD = 14
>>>
>>> The default value is 7 will results in 12 secs.
>>> (O2CB_HEARTBEAT_THRESHOLD - 1) * 2 secs
>>>
>>> Setting it to 14 will make it 26 secs.
>>>
>>> Peter Sylvester wrote:
>>>
>>>> System config:
>>>>
>>>> Dell PE2850 server
>>>> (4) 36GB SCSI drives in (onboard) RAID-5
>>>>
>>>> RHEL4-U2
>>>> Dell ATI Video Driver update 10/2005
>>>>
>>>> ocfs2-2.6.9-22.ELsmp-1.0.7-1.i686.rpm
>>>> ocfs2-tools-1.0.2-1.i386.rpm
>>>> ocfs2console-1.0.2-1.i386.rpm
>>>>
>>>> Note that this is a single node cluster, nothing else 
>>>> installed/running except iozone.
>>>>
>>>> I was running some "iozone" tests on the OCFS2 volume
for about a
>>>> day, and the system locked up completely.
>>>> The following messages were transcribed from the console
(nothing
>>>> written to /var/log/messages):
>>>>
>>>> usb4-2: device not accepting address 4, error -71
>>>> (11,1): o2hb_write_timeout: 164 ERROR: heartbeat write timeout
to
>>>> device sda6 after 12000 miliseconds
>>>> (11,1): o2hb_stop_all_regions: 1724 ERROR: stopping heartbeat
on
>>>> all active regeons
>>>> Kernel Panic - not syncing: ocfs2 is very sorry to be fencing
the
>>>> system by panicing
>>>>
>>>> Questions:
>>>> What does all this mean?
>>>> Why is nothing getting written to /var/log/messages?
>>>> If this software really ready for prime time (honestly...)?
>>>>
>>>> thanks,
>>>> Peter Sylvester
>>>> MITRE Corp.
>>>>
>>>> _______________________________________________
>>>> Ocfs2-users mailing list
>>>> Ocfs2-users@oss.oracle.com
>>>> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>>>
>>>
>>
>> _______________________________________________
>> Ocfs2-users mailing list
>> Ocfs2-users@oss.oracle.com
>> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>

Ocfs2 users - Nov 2005 - OCFS2 Panic!

[Ocfs2-users] OCFS2 Panic!

[Ocfs2-users] OCFS2 Panic!

[Ocfs2-users] OCFS2 Panic!

[Ocfs2-users] OCFS2 Panic!

[Ocfs2-users] OCFS2 Panic!

[Ocfs2-users] OCFS2 Panic!

[Ocfs2-users] OCFS2 Panic!