thr3ads.net - Ocfs2 users - [Ocfs2-users] ocfs2 crash with bugs reports (dlmmaster.c) [Feb 2011]

If this information is useful, please help other people find it:
Share via:

Piotr Teodorowski

2011-Feb-28 09:46 UTC

[Ocfs2-users] ocfs2 crash with bugs reports (dlmmaster.c)

Hi,

After problem described in http://oss.oracle.com/pipermail/ocfs2-users/2010-
December/004854.html we've upgraded kernels and ocfs2-tools on every node.

The present versions are:
kernel 2.6.32-bpo.5-amd64 (from debian lenny-backports)
ocfs2-tolls 1.4.4-3 (from debian squeeze)

We didn't noticed any problems in logs untill last friday, when the whole 
ocfs2 cluster crashed.

We know that it started with some problems on node 7 (esiprap01). It reported 
o2hb_write_timeout error and it rebooted automatically.
Could you please explain what have happend with other nodes?
Some of them reported bug:
kernel BUG at 
/tmp/buildd/linux-2.6-2.6.32/debian/build/source_amd64_none/fs/ocfs2/dlm/dlmmaster.c:241!
one of them (es1prap03 - node 4) reported:
kernel BUG at 
/tmp/buildd/linux-2.6-2.6.32/debian/build/source_amd64_none/fs/ocfs2/dlm/dlmmaster.c:3260!

We've had a problem to start the claster again. While one node was starting 
the other crashed (logged some stack strace - see attachments, and rebooted). 
The only way to start the claster was stop almost all nodes and start them one 
by one.

We didn't find what caused problem with the first node (node 7), we
don't
expect tha we will find it out. Propably it wasn't hardware problem. The 
sotrage was responsible, we don't have any errors in storage event log.
The question is why the other nodes crashed.

The configuration is the same as it was in december (cluster.conf).

Regards,
Piotr Teodorowski
-------------- next part --------------
node:
	ip_port = 7777
	ip_address = 172.28.4.48
	number = 0
	name = es1prgw01
	cluster = ocfs2

node:
	ip_port = 7777
	ip_address = 172.28.4.56
	number = 1
	name = es4prgw01
	cluster = ocfs2

node:
	ip_port = 7777
	ip_address = 172.28.4.65
	number = 3
	name = es1prap02
	cluster = ocfs2

node:
	ip_port = 7777
	ip_address = 172.28.4.66
	number = 4
	name = es1prap03
	cluster = ocfs2

node:
	ip_port = 7777
	ip_address = 172.28.4.80
	number = 5
	name = es4prap01
	cluster = ocfs2

node:
	ip_port = 7777
	ip_address = 172.28.4.81
	number = 6
	name = es4prap02
	cluster = ocfs2

node:
	ip_port = 7777
	ip_address = 172.28.4.64
	number = 2
	name = es1prap01
	cluster = ocfs2

node:
	ip_port = 7777
	ip_address = 172.28.4.78
	number = 7
	name = esiprap01
	cluster = ocfs2

node:
	ip_port = 7777
	ip_address = 172.28.4.67
	number = 8
	name = es1prap04
	cluster = ocfs2

node:
	ip_port = 7777
	ip_address = 172.28.4.68
	number = 9
	name = es1prap05
	cluster = ocfs2

cluster:
	node_count = 10
	name = ocfs2

-------------- next part --------------
A non-text attachment was scrubbed...
Name: netconsole.tgz
Type: application/x-compressed-tar
Size: 55465 bytes
Desc: not available
Url :
http://oss.oracle.com/pipermail/ocfs2-users/attachments/20110228/2475c133/attachment-0002.bin
-------------- next part --------------
A non-text attachment was scrubbed...
Name: messages.tgz
Type: application/x-compressed-tar
Size: 183445 bytes
Desc: not available
Url :
http://oss.oracle.com/pipermail/ocfs2-users/attachments/20110228/2475c133/attachment-0003.bin

Sunil Mushran

2011-Mar-01 01:55 UTC

head link

[Ocfs2-users] ocfs2 crash with bugs reports (dlmmaster.c)

Thanks for the bug report. Please can you file a bz and attach
the all the message files. Yes the problem started with the hb
timeout in esiprap01. The problem spread to other nodes possibly
because of a race in migration. A bz will help us track the issue
more easily.

On 02/28/2011 01:46 AM, Piotr Teodorowski wrote:> Hi,
>
> After problem described in
http://oss.oracle.com/pipermail/ocfs2-users/2010-
> December/004854.html we've upgraded kernels and ocfs2-tools on every
node.
>
> The present versions are:
> kernel 2.6.32-bpo.5-amd64 (from debian lenny-backports)
> ocfs2-tolls 1.4.4-3 (from debian squeeze)
>
> We didn't noticed any problems in logs untill last friday, when the
whole
> ocfs2 cluster crashed.
>
> We know that it started with some problems on node 7 (esiprap01). It
reported
> o2hb_write_timeout error and it rebooted automatically.
> Could you please explain what have happend with other nodes?
> Some of them reported bug:
> kernel BUG at
>
/tmp/buildd/linux-2.6-2.6.32/debian/build/source_amd64_none/fs/ocfs2/dlm/dlmmaster.c:241!
> one of them (es1prap03 - node 4) reported:
> kernel BUG at
>
/tmp/buildd/linux-2.6-2.6.32/debian/build/source_amd64_none/fs/ocfs2/dlm/dlmmaster.c:3260!
>
> We've had a problem to start the claster again. While one node was
starting
> the other crashed (logged some stack strace - see attachments, and
rebooted).
> The only way to start the claster was stop almost all nodes and start them
one
> by one.
>
> We didn't find what caused problem with the first node (node 7), we
don't
> expect tha we will find it out. Propably it wasn't hardware problem.
The
> sotrage was responsible, we don't have any errors in storage event log.
> The question is why the other nodes crashed.
>
> The configuration is the same as it was in december (cluster.conf).
>
> Regards,
> Piotr Teodorowski
>
>
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users at oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://oss.oracle.com/pipermail/ocfs2-users/attachments/20110228/8a75426e/attachment.html

Piotr Teodorowski

2011-Mar-01 12:28 UTC

head link

[Ocfs2-users] ocfs2 crash with bugs reports (dlmmaster.c)

Thanks for quick response,
the bug:
http://oss.oracle.com/bugzilla/show_bug.cgi?id=1319

Regards,
Piotr Teodorowski

On Tuesday 01 of March 2011 02:55:01 Sunil Mushran
wrote:> Thanks for the bug report. Please can you file a bz and attach
> the all the message files. Yes the problem started with the hb
> timeout in esiprap01. The problem spread to other nodes possibly
> because of a race in migration. A bz will help us track the issue
> more easily.
> 
> On 02/28/2011 01:46 AM, Piotr Teodorowski wrote:
> > Hi,
> >
> > After problem described in
> > http://oss.oracle.com/pipermail/ocfs2-users/2010- December/004854.html
> > we've upgraded kernels and ocfs2-tools on every node.
> >
> > The present versions are:
> > kernel 2.6.32-bpo.5-amd64 (from debian lenny-backports)
> > ocfs2-tolls 1.4.4-3 (from debian squeeze)
> >
> > We didn't noticed any problems in logs untill last friday, when
the whole
> > ocfs2 cluster crashed.
> >
> > We know that it started with some problems on node 7 (esiprap01). It
> > reported o2hb_write_timeout error and it rebooted automatically.
> > Could you please explain what have happend with other nodes?
> > Some of them reported bug:
> > kernel BUG at
> >
/tmp/buildd/linux-2.6-2.6.32/debian/build/source_amd64_none/fs/ocfs2/dlm/
> >dlmmaster.c:241! one of them (es1prap03 - node 4) reported:
> > kernel BUG at
> >
/tmp/buildd/linux-2.6-2.6.32/debian/build/source_amd64_none/fs/ocfs2/dlm/
> >dlmmaster.c:3260!
> >
> > We've had a problem to start the claster again. While one node was
> > starting the other crashed (logged some stack strace - see
attachments,
> > and rebooted). The only way to start the claster was stop almost all
> > nodes and start them one by one.
> >
> > We didn't find what caused problem with the first node (node 7),
we don't
> > expect tha we will find it out. Propably it wasn't hardware
problem. The
> > sotrage was responsible, we don't have any errors in storage event
log.
> > The question is why the other nodes crashed.
> >
> > The configuration is the same as it was in december (cluster.conf).
> >
> > Regards,
> > Piotr Teodorowski
> >
> >
> > _______________________________________________
> > Ocfs2-users mailing list
> > Ocfs2-users at oss.oracle.com
> > http://oss.oracle.com/mailman/listinfo/ocfs2-users
>

Apparently Analagous Threads

Search for more reasonably related threads

Ocfs2 users - Feb 2011 - ocfs2 crash with bugs reports (dlmmaster.c)

[Ocfs2-users] ocfs2 crash with bugs reports (dlmmaster.c)

[Ocfs2-users] ocfs2 crash with bugs reports (dlmmaster.c)

[Ocfs2-users] ocfs2 crash with bugs reports (dlmmaster.c)

Apparently Analagous Threads