*Dear Sir,* * * *We have a HPC setup with four OSS server(OSS1 to OSS4) and two MDS Nodes(MDS1 to MDS2)* *It has been running till yesterday without any problem. Today morning i found that OSS4 is in * *shutdown condition. I have verified the OSS3 logs and found that it has been got to fencing state* *I have again switched on OSS4 now its running* * * *In OSS4 logs i saw some "unreadable" error as mentioned below* * * * Feb 26 04:24:43 oss4 smartd[9306]: Device: /dev/sda, 2 Currently unreadable (pending) sectors Feb 26 04:54:43 oss4 smartd[9306]: Device: /dev/sda, 2 Currently unreadable (pending) sectors Feb 26 05:24:43 oss4 smartd[9306]: Device: /dev/sda, 2 Currently unreadable (pending) sectors Feb 26 05:54:43 oss4 smartd[9306]: Device: /dev/sda, 2 Currently unreadable (pending) sectors Feb 26 06:24:43 oss4 smartd[9306]: Device: /dev/sda, 2 Currently unreadable (pending) sectors Feb 26 06:54:43 oss4 smartd[9306]: Device: /dev/sda, 2 Currently unreadable (pending) sectors Feb 26 07:24:43 oss4 smartd[9306]: Device: /dev/sda, 2 Currently unreadable (pending) sectors /dev/sda is a local hard disk. Is it possible the Node fencing is due to this error ? While running the e2fsck will resolve this issue ? * *Herewith i have attached the /var/log/messages of OSS3 and OSS4* *can anybody please analyse the log file and kindly assist me what to do ? * * * * * * * *Thanks & Regards * *VIJESH* -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20120227/c613c2f0/attachment-0001.html -------------- next part -------------- A non-text attachment was scrubbed... Name: new_oss3_messages Type: application/octet-stream Size: 22673 bytes Desc: not available Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20120227/c613c2f0/attachment-0002.obj -------------- next part -------------- A non-text attachment was scrubbed... Name: new_oss4_messages Type: application/octet-stream Size: 98363 bytes Desc: not available Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20120227/c613c2f0/attachment-0003.obj
Hi Vijesh, Most likely your oss4 crashed probably with kernel panic due to faulty local disk which I guess holds oss4''s OS. This caused lack of communication between (heartbeat) openais nodes oss3-oss4 and triggered fencing and failover. Best regards, Wojciech On 27 February 2012 06:40, VIJESH EK <ekvijesh at gmail.com> wrote:> *Dear Sir,* > * > * > *We have a HPC setup with four OSS server(OSS1 to OSS4) and two MDS > Nodes(MDS1 to MDS2)* > *It has been running till yesterday without any problem. Today morning i > found that OSS4 is in * > *shutdown condition. I have verified the OSS3 logs and found that it has > been got to fencing state* > *I have again switched on OSS4 now its running* > * > * > *In OSS4 logs i saw some "unreadable" error as mentioned below* > * > * > * > Feb 26 04:24:43 oss4 smartd[9306]: Device: /dev/sda, 2 Currently > unreadable (pending) sectors > Feb 26 04:54:43 oss4 smartd[9306]: Device: /dev/sda, 2 Currently > unreadable (pending) sectors > Feb 26 05:24:43 oss4 smartd[9306]: Device: /dev/sda, 2 Currently > unreadable (pending) sectors > Feb 26 05:54:43 oss4 smartd[9306]: Device: /dev/sda, 2 Currently > unreadable (pending) sectors > Feb 26 06:24:43 oss4 smartd[9306]: Device: /dev/sda, 2 Currently > unreadable (pending) sectors > Feb 26 06:54:43 oss4 smartd[9306]: Device: /dev/sda, 2 Currently > unreadable (pending) sectors > Feb 26 07:24:43 oss4 smartd[9306]: Device: /dev/sda, 2 Currently > unreadable (pending) sectors > > /dev/sda is a local hard disk. Is it possible the Node fencing is due to > this error ? > While running the e2fsck will resolve this issue ? > > * > *Herewith i have attached the /var/log/messages of OSS3 and OSS4* > *can anybody please analyse the log file and kindly assist me what to do > ? * > * > * > * > * > * > * > *Thanks & Regards > > * > *VIJESH* > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss > >-------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20120227/7c175c79/attachment.html
It would be also a good idea if you capture your console as well as this would give us more details of what actually happened to oss4 On 27 February 2012 18:57, Wojciech Turek <wjt27 at cam.ac.uk> wrote:> Hi Vijesh, > > Most likely your oss4 crashed probably with kernel panic due to faulty > local disk which I guess holds oss4''s OS. This caused lack of communication > between (heartbeat) openais nodes oss3-oss4 and triggered fencing and > failover. > > Best regards, > > Wojciech > > On 27 February 2012 06:40, VIJESH EK <ekvijesh at gmail.com> wrote: > >> *Dear Sir,* >> * >> * >> *We have a HPC setup with four OSS server(OSS1 to OSS4) and two MDS >> Nodes(MDS1 to MDS2)* >> *It has been running till yesterday without any problem. Today morning i >> found that OSS4 is in * >> *shutdown condition. I have verified the OSS3 logs and found that it has >> been got to fencing state* >> *I have again switched on OSS4 now its running* >> * >> * >> *In OSS4 logs i saw some "unreadable" error as mentioned below* >> * >> * >> * >> Feb 26 04:24:43 oss4 smartd[9306]: Device: /dev/sda, 2 Currently >> unreadable (pending) sectors >> Feb 26 04:54:43 oss4 smartd[9306]: Device: /dev/sda, 2 Currently >> unreadable (pending) sectors >> Feb 26 05:24:43 oss4 smartd[9306]: Device: /dev/sda, 2 Currently >> unreadable (pending) sectors >> Feb 26 05:54:43 oss4 smartd[9306]: Device: /dev/sda, 2 Currently >> unreadable (pending) sectors >> Feb 26 06:24:43 oss4 smartd[9306]: Device: /dev/sda, 2 Currently >> unreadable (pending) sectors >> Feb 26 06:54:43 oss4 smartd[9306]: Device: /dev/sda, 2 Currently >> unreadable (pending) sectors >> Feb 26 07:24:43 oss4 smartd[9306]: Device: /dev/sda, 2 Currently >> unreadable (pending) sectors >> >> /dev/sda is a local hard disk. Is it possible the Node fencing is due to >> this error ? >> While running the e2fsck will resolve this issue ? >> >> * >> *Herewith i have attached the /var/log/messages of OSS3 and OSS4* >> *can anybody please analyse the log file and kindly assist me what to do >> ? * >> * >> * >> * >> * >> * >> * >> *Thanks & Regards >> >> * >> *VIJESH* >> >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss >> >>-------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20120227/b55b86ef/attachment.html