Peter Wood
2013-Mar-20 15:34 UTC
System started crashing hard after zpool reconfigure and OI upgrade
I have two identical Supermicro boxes with 32GB ram. Hardware details at the end of the message. They were running OI 151.a.5 for months. The zpool configuration was one storage zpool with 3 vdevs of 8 disks in RAIDZ2. The OI installation is absolutely clean. Just next-next-next until done. All I do is configure the network after install. I don''t install or enable any other services. Then I added more disks and rebuild the systems with OI 151.a.7 and this time configured the zpool with 6 vdevs of 5 disks in RAIDZ. The systems started crashing really bad. They just disappear from the network, black and unresponsive console, no error lights but no activity indication either. The only way out is to power cycle the system. There is no pattern in the crashes. It may crash in 2 days in may crash in 2 hours. I upgraded the memory on both systems to 128GB at no avail. This is the max memory they can take. In summary all I did is upgrade to OI 151.a.7 and reconfigured zpool. Any idea what could be the problem. Thank you -- Peter Supermicro X9DRH-iF Xeon E5-2620 @ 2.0 GHz 6-Core LSI SAS9211-8i HBA 32x 3TB Hitachi HUS723030ALS640, SAS, 7.2K _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Michael Schuster
2013-Mar-20 15:38 UTC
Re: System started crashing hard after zpool reconfigure and OI upgrade
Peter, sorry if this is so obvious that you didn''t mention it: Have you checked /var/adm/messages and other diagnostic tool output? regards Michael On Wed, Mar 20, 2013 at 4:34 PM, Peter Wood <peterwood.sd@gmail.com> wrote:> I have two identical Supermicro boxes with 32GB ram. Hardware details at > the end of the message. > > They were running OI 151.a.5 for months. The zpool configuration was one > storage zpool with 3 vdevs of 8 disks in RAIDZ2. > > The OI installation is absolutely clean. Just next-next-next until done. > All I do is configure the network after install. I don''t install or enable > any other services. > > Then I added more disks and rebuild the systems with OI 151.a.7 and this > time configured the zpool with 6 vdevs of 5 disks in RAIDZ. > > The systems started crashing really bad. They just disappear from the > network, black and unresponsive console, no error lights but no activity > indication either. The only way out is to power cycle the system. > > There is no pattern in the crashes. It may crash in 2 days in may crash in > 2 hours. > > I upgraded the memory on both systems to 128GB at no avail. This is the > max memory they can take. > > In summary all I did is upgrade to OI 151.a.7 and reconfigured zpool. > > Any idea what could be the problem. > > Thank you > > -- Peter > > Supermicro X9DRH-iF > Xeon E5-2620 @ 2.0 GHz 6-Core > LSI SAS9211-8i HBA > 32x 3TB Hitachi HUS723030ALS640, SAS, 7.2K > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > >-- Michael Schuster http://recursiveramblings.wordpress.com/ _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Will Murnane
2013-Mar-20 15:40 UTC
Re: System started crashing hard after zpool reconfigure and OI upgrade
Does the Supermicro IPMI show anything when it crashes? Does anything show up in event logs in the BIOS, or in system logs under OI? On Wed, Mar 20, 2013 at 11:34 AM, Peter Wood <peterwood.sd@gmail.com> wrote:> I have two identical Supermicro boxes with 32GB ram. Hardware details at > the end of the message. > > They were running OI 151.a.5 for months. The zpool configuration was one > storage zpool with 3 vdevs of 8 disks in RAIDZ2. > > The OI installation is absolutely clean. Just next-next-next until done. > All I do is configure the network after install. I don''t install or enable > any other services. > > Then I added more disks and rebuild the systems with OI 151.a.7 and this > time configured the zpool with 6 vdevs of 5 disks in RAIDZ. > > The systems started crashing really bad. They just disappear from the > network, black and unresponsive console, no error lights but no activity > indication either. The only way out is to power cycle the system. > > There is no pattern in the crashes. It may crash in 2 days in may crash in > 2 hours. > > I upgraded the memory on both systems to 128GB at no avail. This is the > max memory they can take. > > In summary all I did is upgrade to OI 151.a.7 and reconfigured zpool. > > Any idea what could be the problem. > > Thank you > > -- Peter > > Supermicro X9DRH-iF > Xeon E5-2620 @ 2.0 GHz 6-Core > LSI SAS9211-8i HBA > 32x 3TB Hitachi HUS723030ALS640, SAS, 7.2K > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > >_______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Peter Wood
2013-Mar-20 15:50 UTC
Re: System started crashing hard after zpool reconfigure and OI upgrade
I''m sorry. I should have mentioned it that I can''t find any errors in the logs. The last entry in /var/adm/messages is that I removed the keyboard after the last reboot and then it shows the new boot up messages when I boot up the system after the crash. The BIOS log is empty. I''m not sure how to check the IPMI but IPMI is not configured and I''m not using it. Just another observation - the crashes are more intense the more data the system serves (NFS). I''m looking into FRMW upgrades for the LSI now. On Wed, Mar 20, 2013 at 8:40 AM, Will Murnane <will.murnane@gmail.com>wrote:> Does the Supermicro IPMI show anything when it crashes? Does anything > show up in event logs in the BIOS, or in system logs under OI? > > > On Wed, Mar 20, 2013 at 11:34 AM, Peter Wood <peterwood.sd@gmail.com>wrote: > >> I have two identical Supermicro boxes with 32GB ram. Hardware details at >> the end of the message. >> >> They were running OI 151.a.5 for months. The zpool configuration was one >> storage zpool with 3 vdevs of 8 disks in RAIDZ2. >> >> The OI installation is absolutely clean. Just next-next-next until done. >> All I do is configure the network after install. I don''t install or enable >> any other services. >> >> Then I added more disks and rebuild the systems with OI 151.a.7 and this >> time configured the zpool with 6 vdevs of 5 disks in RAIDZ. >> >> The systems started crashing really bad. They just disappear from the >> network, black and unresponsive console, no error lights but no activity >> indication either. The only way out is to power cycle the system. >> >> There is no pattern in the crashes. It may crash in 2 days in may crash >> in 2 hours. >> >> I upgraded the memory on both systems to 128GB at no avail. This is the >> max memory they can take. >> >> In summary all I did is upgrade to OI 151.a.7 and reconfigured zpool. >> >> Any idea what could be the problem. >> >> Thank you >> >> -- Peter >> >> Supermicro X9DRH-iF >> Xeon E5-2620 @ 2.0 GHz 6-Core >> LSI SAS9211-8i HBA >> 32x 3TB Hitachi HUS723030ALS640, SAS, 7.2K >> >> _______________________________________________ >> zfs-discuss mailing list >> zfs-discuss@opensolaris.org >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >> >> >_______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Michael Schuster
2013-Mar-20 15:53 UTC
Re: System started crashing hard after zpool reconfigure and OI upgrade
How about crash dumps? michael On Wed, Mar 20, 2013 at 4:50 PM, Peter Wood <peterwood.sd@gmail.com> wrote:> I''m sorry. I should have mentioned it that I can''t find any errors in the > logs. The last entry in /var/adm/messages is that I removed the keyboard > after the last reboot and then it shows the new boot up messages when I > boot up the system after the crash. The BIOS log is empty. I''m not sure how > to check the IPMI but IPMI is not configured and I''m not using it. > > Just another observation - the crashes are more intense the more data the > system serves (NFS). > > I''m looking into FRMW upgrades for the LSI now. > > > On Wed, Mar 20, 2013 at 8:40 AM, Will Murnane <will.murnane@gmail.com>wrote: > >> Does the Supermicro IPMI show anything when it crashes? Does anything >> show up in event logs in the BIOS, or in system logs under OI? >> >> >> On Wed, Mar 20, 2013 at 11:34 AM, Peter Wood <peterwood.sd@gmail.com>wrote: >> >>> I have two identical Supermicro boxes with 32GB ram. Hardware details at >>> the end of the message. >>> >>> They were running OI 151.a.5 for months. The zpool configuration was one >>> storage zpool with 3 vdevs of 8 disks in RAIDZ2. >>> >>> The OI installation is absolutely clean. Just next-next-next until done. >>> All I do is configure the network after install. I don''t install or enable >>> any other services. >>> >>> Then I added more disks and rebuild the systems with OI 151.a.7 and this >>> time configured the zpool with 6 vdevs of 5 disks in RAIDZ. >>> >>> The systems started crashing really bad. They just disappear from the >>> network, black and unresponsive console, no error lights but no activity >>> indication either. The only way out is to power cycle the system. >>> >>> There is no pattern in the crashes. It may crash in 2 days in may crash >>> in 2 hours. >>> >>> I upgraded the memory on both systems to 128GB at no avail. This is the >>> max memory they can take. >>> >>> In summary all I did is upgrade to OI 151.a.7 and reconfigured zpool. >>> >>> Any idea what could be the problem. >>> >>> Thank you >>> >>> -- Peter >>> >>> Supermicro X9DRH-iF >>> Xeon E5-2620 @ 2.0 GHz 6-Core >>> LSI SAS9211-8i HBA >>> 32x 3TB Hitachi HUS723030ALS640, SAS, 7.2K >>> >>> _______________________________________________ >>> zfs-discuss mailing list >>> zfs-discuss@opensolaris.org >>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >>> >>> >> > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > >-- Michael Schuster http://recursiveramblings.wordpress.com/ _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Peter Wood
2013-Mar-20 16:15 UTC
Re: System started crashing hard after zpool reconfigure and OI upgrade
I''m going to need some help with the crash dumps. I''m not very familiar with Solaris. Do I have to enable something to get the crash dumps? Where should I look for them? Thanks for the help. On Wed, Mar 20, 2013 at 8:53 AM, Michael Schuster <michaelsprivate@gmail.com> wrote:> How about crash dumps? > > michael > > > On Wed, Mar 20, 2013 at 4:50 PM, Peter Wood <peterwood.sd@gmail.com>wrote: > >> I''m sorry. I should have mentioned it that I can''t find any errors in the >> logs. The last entry in /var/adm/messages is that I removed the keyboard >> after the last reboot and then it shows the new boot up messages when I >> boot up the system after the crash. The BIOS log is empty. I''m not sure how >> to check the IPMI but IPMI is not configured and I''m not using it. >> >> Just another observation - the crashes are more intense the more data the >> system serves (NFS). >> >> I''m looking into FRMW upgrades for the LSI now. >> >> >> On Wed, Mar 20, 2013 at 8:40 AM, Will Murnane <will.murnane@gmail.com>wrote: >> >>> Does the Supermicro IPMI show anything when it crashes? Does anything >>> show up in event logs in the BIOS, or in system logs under OI? >>> >>> >>> On Wed, Mar 20, 2013 at 11:34 AM, Peter Wood <peterwood.sd@gmail.com>wrote: >>> >>>> I have two identical Supermicro boxes with 32GB ram. Hardware details >>>> at the end of the message. >>>> >>>> They were running OI 151.a.5 for months. The zpool configuration was >>>> one storage zpool with 3 vdevs of 8 disks in RAIDZ2. >>>> >>>> The OI installation is absolutely clean. Just next-next-next until >>>> done. All I do is configure the network after install. I don''t install or >>>> enable any other services. >>>> >>>> Then I added more disks and rebuild the systems with OI 151.a.7 and >>>> this time configured the zpool with 6 vdevs of 5 disks in RAIDZ. >>>> >>>> The systems started crashing really bad. They just disappear from the >>>> network, black and unresponsive console, no error lights but no activity >>>> indication either. The only way out is to power cycle the system. >>>> >>>> There is no pattern in the crashes. It may crash in 2 days in may crash >>>> in 2 hours. >>>> >>>> I upgraded the memory on both systems to 128GB at no avail. This is the >>>> max memory they can take. >>>> >>>> In summary all I did is upgrade to OI 151.a.7 and reconfigured zpool. >>>> >>>> Any idea what could be the problem. >>>> >>>> Thank you >>>> >>>> -- Peter >>>> >>>> Supermicro X9DRH-iF >>>> Xeon E5-2620 @ 2.0 GHz 6-Core >>>> LSI SAS9211-8i HBA >>>> 32x 3TB Hitachi HUS723030ALS640, SAS, 7.2K >>>> >>>> _______________________________________________ >>>> zfs-discuss mailing list >>>> zfs-discuss@opensolaris.org >>>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >>>> >>>> >>> >> >> _______________________________________________ >> zfs-discuss mailing list >> zfs-discuss@opensolaris.org >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >> >> > > > -- > Michael Schuster > http://recursiveramblings.wordpress.com/ >_______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Jim Klimov
2013-Mar-20 18:29 UTC
Re: System started crashing hard after zpool reconfigure and OI upgrade
On 2013-03-20 17:15, Peter Wood wrote:> I''m going to need some help with the crash dumps. I''m not very familiar > with Solaris. > > Do I have to enable something to get the crash dumps? Where should I > look for them?Typically the kernel crash dumps are created as a result of kernel panic; also they may be forced by administrative actions like NMI. They require you to configure a dump volume of sufficient size (see dumpadm) and a /var/crash which may be a dataset on a large enough pool - after the reboot the dump data will be migrated there. To "help" with the hangs you can try the BIOS watchdog (which would require a bmc driver, one which is known from OpenSolaris is alas not opensourced and not redistributable), or with a software deadman timer: http://www.cuddletech.com/blog/pivot/entry.php?id=1044 http://wiki.illumos.org/display/illumos/System+Hangs Also, if you configure "crash dump on NMI" and set up your IPMI card, then you can likely gain remote access to both the server console ("physical" and/or serial) and may be able to trigger the NMI, too. HTH, //Jim> > Thanks for the help. > > > On Wed, Mar 20, 2013 at 8:53 AM, Michael Schuster > <michaelsprivate@gmail.com <mailto:michaelsprivate@gmail.com>> wrote: > > How about crash dumps? > > michael > > > On Wed, Mar 20, 2013 at 4:50 PM, Peter Wood <peterwood.sd@gmail.com > <mailto:peterwood.sd@gmail.com>> wrote: > > I''m sorry. I should have mentioned it that I can''t find any > errors in the logs. The last entry in /var/adm/messages is that > I removed the keyboard after the last reboot and then it shows > the new boot up messages when I boot up the system after the > crash. The BIOS log is empty. I''m not sure how to check the IPMI > but IPMI is not configured and I''m not using it. > > Just another observation - the crashes are more intense the more > data the system serves (NFS). > > I''m looking into FRMW upgrades for the LSI now. > > > On Wed, Mar 20, 2013 at 8:40 AM, Will Murnane > <will.murnane@gmail.com <mailto:will.murnane@gmail.com>> wrote: > > Does the Supermicro IPMI show anything when it crashes? > Does anything show up in event logs in the BIOS, or in > system logs under OI? > > > On Wed, Mar 20, 2013 at 11:34 AM, Peter Wood > <peterwood.sd@gmail.com <mailto:peterwood.sd@gmail.com>> wrote: > > I have two identical Supermicro boxes with 32GB ram. > Hardware details at the end of the message. > > They were running OI 151.a.5 for months. The zpool > configuration was one storage zpool with 3 vdevs of 8 > disks in RAIDZ2. > > The OI installation is absolutely clean. Just > next-next-next until done. All I do is configure the > network after install. I don''t install or enable any > other services. > > Then I added more disks and rebuild the systems with OI > 151.a.7 and this time configured the zpool with 6 vdevs > of 5 disks in RAIDZ. > > The systems started crashing really bad. They > just disappear from the network, black and unresponsive > console, no error lights but no activity indication > either. The only way out is to power cycle the system. > > There is no pattern in the crashes. It may crash in 2 > days in may crash in 2 hours. > > I upgraded the memory on both systems to 128GB at no > avail. This is the max memory they can take. > > In summary all I did is upgrade to OI 151.a.7 and > reconfigured zpool. > > Any idea what could be the problem. > > Thank you > > -- Peter > > Supermicro X9DRH-iF > Xeon E5-2620 @ 2.0 GHz 6-Core > LSI SAS9211-8i HBA > 32x 3TB Hitachi HUS723030ALS640, SAS, 7.2K > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > <mailto:zfs-discuss@opensolaris.org> > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > > > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss@opensolaris.org <mailto:zfs-discuss@opensolaris.org> > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > > > > -- > Michael Schuster > http://recursiveramblings.wordpress.com/ > > > > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >
Peter Wood
2013-Mar-20 19:38 UTC
Re: System started crashing hard after zpool reconfigure and OI upgrade
Hi Jim, Thanks for the pointers. I''ll definitely look into this. -- Peter Blajev IT Manager, TAAZ Inc. Office: 858-597-0512 x125 On Wed, Mar 20, 2013 at 11:29 AM, Jim Klimov <jimklimov@cos.ru> wrote:> On 2013-03-20 17:15, Peter Wood wrote: > >> I''m going to need some help with the crash dumps. I''m not very familiar >> with Solaris. >> >> Do I have to enable something to get the crash dumps? Where should I >> look for them? >> > > Typically the kernel crash dumps are created as a result of kernel > panic; also they may be forced by administrative actions like NMI. > They require you to configure a dump volume of sufficient size (see > dumpadm) and a /var/crash which may be a dataset on a large enough > pool - after the reboot the dump data will be migrated there. > > To "help" with the hangs you can try the BIOS watchdog (which would > require a bmc driver, one which is known from OpenSolaris is alas > not opensourced and not redistributable), or with a software deadman > timer: > > http://www.cuddletech.com/**blog/pivot/entry.php?id=1044<http://www.cuddletech.com/blog/pivot/entry.php?id=1044> > > http://wiki.illumos.org/**display/illumos/System+Hangs<http://wiki.illumos.org/display/illumos/System+Hangs> > > Also, if you configure "crash dump on NMI" and set up your IPMI card, > then you can likely gain remote access to both the server console > ("physical" and/or serial) and may be able to trigger the NMI, too. > > HTH, > //Jim > > >> Thanks for the help. >> >> >> On Wed, Mar 20, 2013 at 8:53 AM, Michael Schuster >> <michaelsprivate@gmail.com <mailto:michaelsprivate@gmail.**com<michaelsprivate@gmail.com>>> >> wrote: >> >> How about crash dumps? >> >> michael >> >> >> On Wed, Mar 20, 2013 at 4:50 PM, Peter Wood <peterwood.sd@gmail.com >> <mailto:peterwood.sd@gmail.com**>> wrote: >> >> I''m sorry. I should have mentioned it that I can''t find any >> errors in the logs. The last entry in /var/adm/messages is that >> I removed the keyboard after the last reboot and then it shows >> the new boot up messages when I boot up the system after the >> crash. The BIOS log is empty. I''m not sure how to check the IPMI >> but IPMI is not configured and I''m not using it. >> >> Just another observation - the crashes are more intense the more >> data the system serves (NFS). >> >> I''m looking into FRMW upgrades for the LSI now. >> >> >> On Wed, Mar 20, 2013 at 8:40 AM, Will Murnane >> <will.murnane@gmail.com <mailto:will.murnane@gmail.com**>> wrote: >> >> Does the Supermicro IPMI show anything when it crashes? >> Does anything show up in event logs in the BIOS, or in >> system logs under OI? >> >> >> On Wed, Mar 20, 2013 at 11:34 AM, Peter Wood >> <peterwood.sd@gmail.com <mailto:peterwood.sd@gmail.com**>> >> wrote: >> >> I have two identical Supermicro boxes with 32GB ram. >> Hardware details at the end of the message. >> >> They were running OI 151.a.5 for months. The zpool >> configuration was one storage zpool with 3 vdevs of 8 >> disks in RAIDZ2. >> >> The OI installation is absolutely clean. Just >> next-next-next until done. All I do is configure the >> network after install. I don''t install or enable any >> other services. >> >> Then I added more disks and rebuild the systems with OI >> 151.a.7 and this time configured the zpool with 6 vdevs >> of 5 disks in RAIDZ. >> >> The systems started crashing really bad. They >> just disappear from the network, black and unresponsive >> console, no error lights but no activity indication >> either. The only way out is to power cycle the system. >> >> There is no pattern in the crashes. It may crash in 2 >> days in may crash in 2 hours. >> >> I upgraded the memory on both systems to 128GB at no >> avail. This is the max memory they can take. >> >> In summary all I did is upgrade to OI 151.a.7 and >> reconfigured zpool. >> >> Any idea what could be the problem. >> >> Thank you >> >> -- Peter >> >> Supermicro X9DRH-iF >> Xeon E5-2620 @ 2.0 GHz 6-Core >> LSI SAS9211-8i HBA >> 32x 3TB Hitachi HUS723030ALS640, SAS, 7.2K >> >> ______________________________**_________________ >> zfs-discuss mailing list >> zfs-discuss@opensolaris.org >> <mailto:zfs-discuss@**opensolaris.org<zfs-discuss@opensolaris.org> >> > >> >> http://mail.opensolaris.org/** >> mailman/listinfo/zfs-discuss<http://mail.opensolaris.org/mailman/listinfo/zfs-discuss> >> >> >> >> >> ______________________________**_________________ >> zfs-discuss mailing list >> zfs-discuss@opensolaris.org <mailto:zfs-discuss@**opensolaris.org<zfs-discuss@opensolaris.org> >> > >> >> http://mail.opensolaris.org/**mailman/listinfo/zfs-discuss<http://mail.opensolaris.org/mailman/listinfo/zfs-discuss> >> >> >> >> >> -- >> Michael Schuster >> http://recursiveramblings.**wordpress.com/<http://recursiveramblings.wordpress.com/> >> >> >> >> >> ______________________________**_________________ >> zfs-discuss mailing list >> zfs-discuss@opensolaris.org >> http://mail.opensolaris.org/**mailman/listinfo/zfs-discuss<http://mail.opensolaris.org/mailman/listinfo/zfs-discuss> >> >> > ______________________________**_________________ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/**mailman/listinfo/zfs-discuss<http://mail.opensolaris.org/mailman/listinfo/zfs-discuss> >_______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Peter Wood
2013-Mar-20 19:42 UTC
Re: [BULK] System started crashing hard after zpool reconfigure and OI upgrade
No problem Trey. Anything will help. Yes, I did a clean install overwriting the old OS.> Just to make sure, you actually did an overwrite reinstall with OI151a7 > rather than upgrading the existing OS images? If you did a pkg > image-update, you should be able to boot back into the oi151a5 image from > grub. Apologies in advance if I''m stating the obvious. > > -- Trey > > > On Mar 20, 2013, at 11:34 AM, "Peter Wood" <peterwood.sd@gmail.com> wrote: > > I have two identical Supermicro boxes with 32GB ram. Hardware details > at the end of the message. > > They were running OI 151.a.5 for months. The zpool configuration was one > storage zpool with 3 vdevs of 8 disks in RAIDZ2. > > The OI installation is absolutely clean. Just next-next-next until done. > All I do is configure the network after install. I don''t install or enable > any other services. > > Then I added more disks and rebuild the systems with OI 151.a.7 and this > time configured the zpool with 6 vdevs of 5 disks in RAIDZ. > > The systems started crashing really bad. They just disappear from the > network, black and unresponsive console, no error lights but no activity > indication either. The only way out is to power cycle the system. > > There is no pattern in the crashes. It may crash in 2 days in may crash > in 2 hours. > > I upgraded the memory on both systems to 128GB at no avail. This is the > max memory they can take. > > In summary all I did is upgrade to OI 151.a.7 and reconfigured zpool. > > Any idea what could be the problem. > > Thank you > > -- Peter > > Supermicro X9DRH-iF > Xeon E5-2620 @ 2.0 GHz 6-Core > LSI SAS9211-8i HBA > 32x 3TB Hitachi HUS723030ALS640, SAS, 7.2K > > _______________________________________________ > > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > >_______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Jens Elkner
2013-Mar-20 20:08 UTC
Re: System started crashing hard after zpool reconfigure and OI upgrade
On Wed, Mar 20, 2013 at 08:50:40AM -0700, Peter Wood wrote:> I''m sorry. I should have mentioned it that I can''t find any errors in the > logs. The last entry in /var/adm/messages is that I removed the keyboard > after the last reboot and then it shows the new boot up messages when I boot > up the system after the crash. The BIOS log is empty. I''m not sure how to > check the IPMI but IPMI is not configured and I''m not using it.You definitely should! Plugin a cable into the dedicated network port and configure it (easiest way for you is probably to jump into the BIOS and assign the appropriate IP address etc.). Than, for a quick look, point your browser to the given IP port 80 (default login is ADMIN/ADMIN). Also you may now configure some other details (accounts/passwords/roles). To track the problem, either write a script, which polls the parameters in question periodically or just install the latest ipmiViewer and use this to monitor your sensors ad hoc. see ftp://ftp.supermicro.com/utility/IPMIView/> Just another observation - the crashes are more intense the more data the > system serves (NFS). > I''m looking into FRMW upgrades for the LSI now.Latest LSI FW should be P15, for this MB type 217 (2.17), MB-BIOS C28 (1.0b). However, I doubt, that your problem has anything to do with the SAS-ctrl or OI or ZFS. My guess is, that either your MB is broken (we had an X9DRH-iF, which instantly "disappeared" as soon as it got some real load) or you have a heat problem (watch you cpu temp e.g. via ipmiviewer). With 2GHz that''s not very likely, but worth a try (socket placement on this board is not really smart IMHO). To test quickly - disable all addtional, unneeded service in OI, which may put some load on the machine (like NFS service, http and bla) and perhaps even export unneeded pools (just to be sure) - fire up your ipmiviewer and look at the sensors (set update to 10s) or refresh manually often - start ''openssl speed -multi 32'' and keep watching your cpu temp sensors (with 2GHz I guess it takes ~ 12min) I guess, your machine "disappears" before the CPUs getting really hot (broken MB). If CPUs switch off (usually first CPU2 and a little bit later CPU1) you have a cooling problem. If nothing happens, well, than it could be an OI or ZFS problem ;-) Have fun, jel. -- Otto-von-Guericke University http://www.cs.uni-magdeburg.de/ Department of Computer Science Geb. 29 R 027, Universitaetsplatz 2 39106 Magdeburg, Germany Tel: +49 391 67 52768
Peter Wood
2013-Mar-20 20:45 UTC
Re: System started crashing hard after zpool reconfigure and OI upgrade
Great write up Jens. The chance of two MB to be broken is probably low but overheating is a very good point. It was on my to-do list to setup IPMI and seems that now is the best time to do it. Thanks On Wed, Mar 20, 2013 at 1:08 PM, Jens Elkner <jel+zfs@cs.uni-magdeburg.de>wrote:> On Wed, Mar 20, 2013 at 08:50:40AM -0700, Peter Wood wrote: > > I''m sorry. I should have mentioned it that I can''t find any errors in > the > > logs. The last entry in /var/adm/messages is that I removed the > keyboard > > after the last reboot and then it shows the new boot up messages when > I boot > > up the system after the crash. The BIOS log is empty. I''m not sure > how to > > check the IPMI but IPMI is not configured and I''m not using it. > > You definitely should! Plugin a cable into the dedicated network port > and configure it (easiest way for you is probably to jump into the BIOS > and assign the appropriate IP address etc.). Than, for a quick look, > point your browser to the given IP port 80 (default login is > ADMIN/ADMIN). Also you may now configure some other details > (accounts/passwords/roles). > > To track the problem, either write a script, which polls the parameters > in question periodically or just install the latest ipmiViewer and use > this to monitor your sensors ad hoc. > see ftp://ftp.supermicro.com/utility/IPMIView/ > > > Just another observation - the crashes are more intense the more data > the > > system serves (NFS). > > I''m looking into FRMW upgrades for the LSI now. > > Latest LSI FW should be P15, for this MB type 217 (2.17), MB-BIOS C28 > (1.0b). > However, I doubt, that your problem has anything to do with the > SAS-ctrl or OI or ZFS. > > My guess is, that either your MB is broken (we had an X9DRH-iF, which > instantly "disappeared" as soon as it got some real load) or you have > a heat problem (watch you cpu temp e.g. via ipmiviewer). With 2GHz > that''s not very likely, but worth a try (socket placement on this board > is not really smart IMHO). > > To test quickly > - disable all addtional, unneeded service in OI, which may put some > load on the machine (like NFS service, http and bla) and perhaps > even export unneeded pools (just to be sure) > - fire up your ipmiviewer and look at the sensors (set update to > 10s) or refresh manually often > - start ''openssl speed -multi 32'' and keep watching your cpu temp > sensors (with 2GHz I guess it takes ~ 12min) > > I guess, your machine "disappears" before the CPUs getting really hot > (broken MB). If CPUs switch off (usually first CPU2 and a little bit > later CPU1) you have a cooling problem. If nothing happens, well, than > it could be an OI or ZFS problem ;-) > > Have fun, > jel. > -- > Otto-von-Guericke University http://www.cs.uni-magdeburg.de/ > Department of Computer Science Geb. 29 R 027, Universitaetsplatz 2 > 39106 Magdeburg, Germany Tel: +49 391 67 52768 > _______________________________________________ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >_______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Peter Wood
2013-Mar-20 21:34 UTC
Re: System started crashing hard after zpool reconfigure and OI upgrade
I can reproduce the problem. I can crash the system. Here are the steps I did (some steps may not be needed but I haven''t tested it): - Clean install of OI 151.a.7 on Supermicro hardware described above (32GB RAM though, not the 128GB) - Create 1 zpool, 6 raidz vdevs with 5 drives each - NFS export a dataset zfs set sharenfs="rw=@10.20.1/24" vol01/htmlspace - Create zfs child dataset zfs create vol01/htmlspace/A $ zfs get -H sharenfs vol01/htmlspace/A vol01/htmlspace/A sharenfs rw=@10.20.1/24 inherited from vol01/htmlspace - Stop NFS shearing for the child dataset zfs set sharenfs=off vol01/htmlspace/A The crash is instant after the sharenfs=off command. I thought it was coincident so after reboot I tried it on another dataset. Instant crash again. I get my prompt back but that''s it. The system is gone after that. The NFS exported file systems are not accessed by any system on the network. They are not in use. That''s why I wanted to stop exporting them. And, even if they were in use this should now crash the system, right? I can''t try the other box because it is heavy in production. At least not until later tonight. I thought I''ll collect some advice to make each crash as useful as possible. Any pointers are appreciated. Thanks, -- Peter On Wed, Mar 20, 2013 at 8:34 AM, Peter Wood <peterwood.sd@gmail.com> wrote:> I have two identical Supermicro boxes with 32GB ram. Hardware details at > the end of the message. > > They were running OI 151.a.5 for months. The zpool configuration was one > storage zpool with 3 vdevs of 8 disks in RAIDZ2. > > The OI installation is absolutely clean. Just next-next-next until done. > All I do is configure the network after install. I don''t install or enable > any other services. > > Then I added more disks and rebuild the systems with OI 151.a.7 and this > time configured the zpool with 6 vdevs of 5 disks in RAIDZ. > > The systems started crashing really bad. They just disappear from the > network, black and unresponsive console, no error lights but no activity > indication either. The only way out is to power cycle the system. > > There is no pattern in the crashes. It may crash in 2 days in may crash in > 2 hours. > > I upgraded the memory on both systems to 128GB at no avail. This is the > max memory they can take. > > In summary all I did is upgrade to OI 151.a.7 and reconfigured zpool. > > Any idea what could be the problem. > > Thank you > > -- Peter > > Supermicro X9DRH-iF > Xeon E5-2620 @ 2.0 GHz 6-Core > LSI SAS9211-8i HBA > 32x 3TB Hitachi HUS723030ALS640, SAS, 7.2K >_______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss