Piotr Teodorowski
2011-Feb-28 09:46 UTC
[Ocfs2-users] ocfs2 crash with bugs reports (dlmmaster.c)
Hi, After problem described in http://oss.oracle.com/pipermail/ocfs2-users/2010- December/004854.html we've upgraded kernels and ocfs2-tools on every node. The present versions are: kernel 2.6.32-bpo.5-amd64 (from debian lenny-backports) ocfs2-tolls 1.4.4-3 (from debian squeeze) We didn't noticed any problems in logs untill last friday, when the whole ocfs2 cluster crashed. We know that it started with some problems on node 7 (esiprap01). It reported o2hb_write_timeout error and it rebooted automatically. Could you please explain what have happend with other nodes? Some of them reported bug: kernel BUG at /tmp/buildd/linux-2.6-2.6.32/debian/build/source_amd64_none/fs/ocfs2/dlm/dlmmaster.c:241! one of them (es1prap03 - node 4) reported: kernel BUG at /tmp/buildd/linux-2.6-2.6.32/debian/build/source_amd64_none/fs/ocfs2/dlm/dlmmaster.c:3260! We've had a problem to start the claster again. While one node was starting the other crashed (logged some stack strace - see attachments, and rebooted). The only way to start the claster was stop almost all nodes and start them one by one. We didn't find what caused problem with the first node (node 7), we don't expect tha we will find it out. Propably it wasn't hardware problem. The sotrage was responsible, we don't have any errors in storage event log. The question is why the other nodes crashed. The configuration is the same as it was in december (cluster.conf). Regards, Piotr Teodorowski -------------- next part -------------- node: ip_port = 7777 ip_address = 172.28.4.48 number = 0 name = es1prgw01 cluster = ocfs2 node: ip_port = 7777 ip_address = 172.28.4.56 number = 1 name = es4prgw01 cluster = ocfs2 node: ip_port = 7777 ip_address = 172.28.4.65 number = 3 name = es1prap02 cluster = ocfs2 node: ip_port = 7777 ip_address = 172.28.4.66 number = 4 name = es1prap03 cluster = ocfs2 node: ip_port = 7777 ip_address = 172.28.4.80 number = 5 name = es4prap01 cluster = ocfs2 node: ip_port = 7777 ip_address = 172.28.4.81 number = 6 name = es4prap02 cluster = ocfs2 node: ip_port = 7777 ip_address = 172.28.4.64 number = 2 name = es1prap01 cluster = ocfs2 node: ip_port = 7777 ip_address = 172.28.4.78 number = 7 name = esiprap01 cluster = ocfs2 node: ip_port = 7777 ip_address = 172.28.4.67 number = 8 name = es1prap04 cluster = ocfs2 node: ip_port = 7777 ip_address = 172.28.4.68 number = 9 name = es1prap05 cluster = ocfs2 cluster: node_count = 10 name = ocfs2 -------------- next part -------------- A non-text attachment was scrubbed... Name: netconsole.tgz Type: application/x-compressed-tar Size: 55465 bytes Desc: not available Url : http://oss.oracle.com/pipermail/ocfs2-users/attachments/20110228/2475c133/attachment-0002.bin -------------- next part -------------- A non-text attachment was scrubbed... Name: messages.tgz Type: application/x-compressed-tar Size: 183445 bytes Desc: not available Url : http://oss.oracle.com/pipermail/ocfs2-users/attachments/20110228/2475c133/attachment-0003.bin
Sunil Mushran
2011-Mar-01 01:55 UTC
[Ocfs2-users] ocfs2 crash with bugs reports (dlmmaster.c)
Thanks for the bug report. Please can you file a bz and attach the all the message files. Yes the problem started with the hb timeout in esiprap01. The problem spread to other nodes possibly because of a race in migration. A bz will help us track the issue more easily. On 02/28/2011 01:46 AM, Piotr Teodorowski wrote:> Hi, > > After problem described in http://oss.oracle.com/pipermail/ocfs2-users/2010- > December/004854.html we've upgraded kernels and ocfs2-tools on every node. > > The present versions are: > kernel 2.6.32-bpo.5-amd64 (from debian lenny-backports) > ocfs2-tolls 1.4.4-3 (from debian squeeze) > > We didn't noticed any problems in logs untill last friday, when the whole > ocfs2 cluster crashed. > > We know that it started with some problems on node 7 (esiprap01). It reported > o2hb_write_timeout error and it rebooted automatically. > Could you please explain what have happend with other nodes? > Some of them reported bug: > kernel BUG at > /tmp/buildd/linux-2.6-2.6.32/debian/build/source_amd64_none/fs/ocfs2/dlm/dlmmaster.c:241! > one of them (es1prap03 - node 4) reported: > kernel BUG at > /tmp/buildd/linux-2.6-2.6.32/debian/build/source_amd64_none/fs/ocfs2/dlm/dlmmaster.c:3260! > > We've had a problem to start the claster again. While one node was starting > the other crashed (logged some stack strace - see attachments, and rebooted). > The only way to start the claster was stop almost all nodes and start them one > by one. > > We didn't find what caused problem with the first node (node 7), we don't > expect tha we will find it out. Propably it wasn't hardware problem. The > sotrage was responsible, we don't have any errors in storage event log. > The question is why the other nodes crashed. > > The configuration is the same as it was in december (cluster.conf). > > Regards, > Piotr Teodorowski > > > _______________________________________________ > Ocfs2-users mailing list > Ocfs2-users at oss.oracle.com > http://oss.oracle.com/mailman/listinfo/ocfs2-users-------------- next part -------------- An HTML attachment was scrubbed... URL: http://oss.oracle.com/pipermail/ocfs2-users/attachments/20110228/8a75426e/attachment.html
Piotr Teodorowski
2011-Mar-01 12:28 UTC
[Ocfs2-users] ocfs2 crash with bugs reports (dlmmaster.c)
Thanks for quick response, the bug: http://oss.oracle.com/bugzilla/show_bug.cgi?id=1319 Regards, Piotr Teodorowski On Tuesday 01 of March 2011 02:55:01 Sunil Mushran wrote:> Thanks for the bug report. Please can you file a bz and attach > the all the message files. Yes the problem started with the hb > timeout in esiprap01. The problem spread to other nodes possibly > because of a race in migration. A bz will help us track the issue > more easily. > > On 02/28/2011 01:46 AM, Piotr Teodorowski wrote: > > Hi, > > > > After problem described in > > http://oss.oracle.com/pipermail/ocfs2-users/2010- December/004854.html > > we've upgraded kernels and ocfs2-tools on every node. > > > > The present versions are: > > kernel 2.6.32-bpo.5-amd64 (from debian lenny-backports) > > ocfs2-tolls 1.4.4-3 (from debian squeeze) > > > > We didn't noticed any problems in logs untill last friday, when the whole > > ocfs2 cluster crashed. > > > > We know that it started with some problems on node 7 (esiprap01). It > > reported o2hb_write_timeout error and it rebooted automatically. > > Could you please explain what have happend with other nodes? > > Some of them reported bug: > > kernel BUG at > > /tmp/buildd/linux-2.6-2.6.32/debian/build/source_amd64_none/fs/ocfs2/dlm/ > >dlmmaster.c:241! one of them (es1prap03 - node 4) reported: > > kernel BUG at > > /tmp/buildd/linux-2.6-2.6.32/debian/build/source_amd64_none/fs/ocfs2/dlm/ > >dlmmaster.c:3260! > > > > We've had a problem to start the claster again. While one node was > > starting the other crashed (logged some stack strace - see attachments, > > and rebooted). The only way to start the claster was stop almost all > > nodes and start them one by one. > > > > We didn't find what caused problem with the first node (node 7), we don't > > expect tha we will find it out. Propably it wasn't hardware problem. The > > sotrage was responsible, we don't have any errors in storage event log. > > The question is why the other nodes crashed. > > > > The configuration is the same as it was in december (cluster.conf). > > > > Regards, > > Piotr Teodorowski > > > > > > _______________________________________________ > > Ocfs2-users mailing list > > Ocfs2-users at oss.oracle.com > > http://oss.oracle.com/mailman/listinfo/ocfs2-users >