Ms. Megan Larko
2009-Feb-20 19:10 UTC
[Lustre-discuss] OST denied reconnect to MGS; kernel panic led to system crash
Hello, I had an unusual event on my Lustre OSS computer today. At 6:20 a.m. there seemed to be some sort of communication snafu. One of several OST''s on my OSS would not communicate with the MGS (the other OSTs did not generate any such communication error). The error seemed to lead to a kernel panic and a crash. I am attaching the February 20 sections of the OSS /var/log/messages file: lustre.error.20Feb09.gz The system tried to communicate with the OST crew8-OST0009 a couple of times. Then some sort of system memory error seemed to have occurred. The Lustre error number was -16 which I did not see in http://manual.lustre.org/manual/LustreManual16_HTML/LustreTroubleshootingTips.html. I could not find -16 in any of the following errno.h files I checked: [root at oss4 log]# vi /usr/include/asm-x86_64/errno.h [root at oss4 log]# vi /usr/include/errno.h [root at oss4 log]# vi /usr/include/asm/errno.h [root at oss4 log]# vi /usr/include/linux/errno.h [root at oss4 log]# vi /usr/include/sys/errno.h The drives are contained in 16-bay JBODs connected to a server via LSI.1078 card. The LSI utility MegaCli64 gave no indication of any errors on either the card or any of the drives. The OSS in question did reboot and (eventually) recover all the disks on its own. This is 2.6.18-53.1.13.el5_lustre.1.6.4.3smp on CentOS 5. Any insights? Thoughts welcome. megan -------------- next part -------------- A non-text attachment was scrubbed... Name: lustre.error.20Feb09.gz Type: application/x-gzip Size: 18904 bytes Desc: not available Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090220/8143d4d0/attachment-0001.bin