Freddie Cash
2012-Jan-09 17:40 UTC
Upgrade from 8.2-STABLE to 9.0-RELEASE wedges on SuperMicro H8DGiF-based system
Good morning, Just wondering if anyone else has run into a similar issue. We have a ZFS storage server that was running 8.2-STABLE (from around beginning of Dec 2011) without any issues, that was upgraded to 9.0-RELEASE (to consolidate all the ZFS and networking fixes/updates and bring it up to version parity with our other ZFS storage server running 9.0) last Thursday. The "svn switch" of the source tree, the buildworld, the buildkernel, the installkernel, the reboot with the new kernel, the installworld, the reboot into the new world, the mergemaster processes all completed successfully. About half-way through the "make delete-old" process, the box locked up. No messages on the console, no log entries of any kind, everything just stopped. Had to do a power-cycle. And then everything went to hell. :( On reboot, the loader complained about not being able to determine which disk it was booting from (even though the new loader had already booted at least once), and gave strange messages about panic/free/something or other (didn't write that error down). I was able to boot using a 9.0 install CD, drop to a loader prompt, unload the kernel/modules from CD, load the kernel/modules from the harddrive, set currdev to the harddrive, and boot. But no matter what I did (gpart bootcode using pmbr/gptboot from CD or from HD; copy loader from CD, copy /boot from CD), I could not get the loader on the HD to load the kernel; always gave the same error message: can't determine which disk we're booting from. After trying for 24 hours to make it work, I just re-installed off the 9.0-RELEASE CD. Now, this box (alphadrive) will freeze after running for between 3 and 10 hours. Even when left completely idle, it will lock up after about 3 hours. :( I have another system (betadrive) that's almost identical hardware (chassis, backplane, SATA controllers are different, everything else is the same) that went from 8.2-STABLE to 9.0-RC2 to 9.0-RC3 to 9.0-RELEASE without any issues. I've tried copying /boot/loader.conf, /etc/make.conf, /etc/src.conf, /etc/sysctl.conf, /etc/rc.conf from betadrive to alphadrive, without any change in the freezing behaviour. These are ZFS storage systems, with / (UFS) and swap on SSDs, with 16 or 24 SATA HDs in the pool (3x 5-disk raidz2 + spare and 4x 6-disk raidz2 resp). All of the ZFS settings are identical between the two systems (pool name, pool properties, ZFS filesystems, ZFS properties per filesystem). Dedupe and compression (LZJB) are enabled on both systems. When alphadrive locks up, there are no entries made in any log files; there are no log entries on the console; there are no entries in the BIOS event log; there are no entries in the IPMI event log; the CPU/case temps are below 40C (emergency shutoff is 75C) as shown via IPMI; RAM usage is under 20 GB (24 GB per box) with the lowest being under 2 GB used (I run top on the console so I can see the stats when it locks up, and the time it locks up). It just ... stops. The system will even lock up when running in single-user mode, with only / mounted (ZFS not loaded, zpool not imported). Hardware (alphadrive): Chenbro 5U rackmount chassis with 24 hot-swap drive bays SuperMicro H8DGi-F motherboard AMD Opteron 2218 CPU (8-cores at 2.0 GHz) 24 GB DDR3-SDRAM 3x SuperMicro AOC-USAS-L8i SATA controllers (multi-lane break-out cables) 8x Seagate 7200.12 1.5 TB SATA harddrives 16x WD RE4 1.0 TB SATA harddrives 1x Kingston 60 GB SSD (for /, swap, L2ARC) Hardware (betadrive): SuperMicro 4U rackmount chassis with 16 hot-swap drive bays SuperMicro H8DGi-F motherboard AMD Opteron 2218 CPU (8-cores at 2.0 GHz) 24 GB DDR3-SDRAM 2x SuperMicro AOC-USAS2-L8i SATA controllers (multi-lane cables) 16x WD RE4 2.0 TB SATA harddrives 1x Kingston 60 GB SSD (for /, swap, L2ARC) betadrive runs perfectly with FreeBSD 9.0-RELEASE. alphadrive locks up with FreeBSD 9.0-RELEASE. We're currently investigating hardware firmware revisions to see if anything else is different between the two systems. Has anyone experience anything similar? Does anyone have any ideas on what to look for? Any suggestions on what to try next? -- Freddie Cash fjwcash@gmail.com
Freddie Cash
2012-Jan-09 17:56 UTC
Upgrade from 8.2-STABLE to 9.0-RELEASE wedges on SuperMicro H8DGiF-based system
On Mon, Jan 9, 2012 at 9:50 AM, John Nielsen <lists@jnielsen.net> wrote:> From what you've said I strongly suspect that you have some kind of hardware issue. Dodgy RAM is my first guess, something cooling-related is my 2nd, and PSU is my 3rd. It is a little suspicious that you only started having problems after your upgrade but it could be coincidence or it could be something about the new software tickling the hardware differently than the old.That's what we're leaning toward as well. We're planning on doing a BIOS upgrade (betadrive is running v2.00 and alphadrive is v1.00), then a memtest86+ run, then check firmware on the SATA controllers. If none of the above helps, we're thinking of swapping the CPUs between the two systems to see if the problems stay with the box or follow the CPU. Thanks for the reply. -- Freddie Cash fjwcash@gmail.com
Freddie Cash
2012-Jan-09 18:03 UTC
Upgrade from 8.2-STABLE to 9.0-RELEASE wedges on SuperMicro H8DGiF-based system
Small correction: these are AMD Opteron 6218 CPUs, not 2218.> Hardware (alphadrive): > ?Chenbro 5U rackmount chassis with 24 hot-swap drive bays > ?SuperMicro H8DGi-F motherboard > ?AMD Opteron 6218 CPU (8-cores at 2.0 GHz) > ?24 GB DDR3-SDRAM > ?3x SuperMicro AOC-USAS-L8i SATA controllers (multi-lane break-out cables) > ?8x Seagate 7200.12 1.5 TB SATA harddrives > ?16x WD RE4 1.0 TB SATA harddrives > ?1x Kingston 60 GB SSD (for /, swap, L2ARC) > > Hardware (betadrive): > ?SuperMicro 4U rackmount chassis with 16 hot-swap drive bays > ?SuperMicro H8DGi-F motherboard > ?AMD Opteron 6218 CPU (8-cores at 2.0 GHz) > ?24 GB DDR3-SDRAM > ?2x SuperMicro AOC-USAS2-L8i SATA controllers (multi-lane cables) > ?16x WD RE4 2.0 TB SATA harddrives > ?1x Kingston 60 GB SSD (for /, swap, L2ARC)-- Freddie Cash fjwcash@gmail.com
John Nielsen
2012-Jan-09 18:16 UTC
Upgrade from 8.2-STABLE to 9.0-RELEASE wedges on SuperMicro H8DGiF-based system
On Jan 9, 2012, at 12:40 PM, Freddie Cash wrote:> Just wondering if anyone else has run into a similar issue. > > We have a ZFS storage server that was running 8.2-STABLE (from around > beginning of Dec 2011) without any issues, that was upgraded to > 9.0-RELEASE (to consolidate all the ZFS and networking fixes/updates > and bring it up to version parity with our other ZFS storage server > running 9.0) last Thursday. The "svn switch" of the source tree, the > buildworld, the buildkernel, the installkernel, the reboot with the > new kernel, the installworld, the reboot into the new world, the > mergemaster processes all completed successfully. About half-way > through the "make delete-old" process, the box locked up. No messages > on the console, no log entries of any kind, everything just stopped. > Had to do a power-cycle. And then everything went to hell. :( > > On reboot, the loader complained about not being able to determine > which disk it was booting from (even though the new loader had already > booted at least once), and gave strange messages about > panic/free/something or other (didn't write that error down). > > I was able to boot using a 9.0 install CD, drop to a loader prompt, > unload the kernel/modules from CD, load the kernel/modules from the > harddrive, set currdev to the harddrive, and boot. But no matter what > I did (gpart bootcode using pmbr/gptboot from CD or from HD; copy > loader from CD, copy /boot from CD), I could not get the loader on the > HD to load the kernel; always gave the same error message: can't > determine which disk we're booting from. > > After trying for 24 hours to make it work, I just re-installed off the > 9.0-RELEASE CD. > > Now, this box (alphadrive) will freeze after running for between 3 and > 10 hours. Even when left completely idle, it will lock up after about > 3 hours. :( > > I have another system (betadrive) that's almost identical hardware > (chassis, backplane, SATA controllers are different, everything else > is the same) that went from 8.2-STABLE to 9.0-RC2 to 9.0-RC3 to > 9.0-RELEASE without any issues. I've tried copying /boot/loader.conf, > /etc/make.conf, /etc/src.conf, /etc/sysctl.conf, /etc/rc.conf from > betadrive to alphadrive, without any change in the freezing behaviour. > > These are ZFS storage systems, with / (UFS) and swap on SSDs, with 16 > or 24 SATA HDs in the pool (3x 5-disk raidz2 + spare and 4x 6-disk > raidz2 resp). All of the ZFS settings are identical between the two > systems (pool name, pool properties, ZFS filesystems, ZFS properties > per filesystem). Dedupe and compression (LZJB) are enabled on both > systems. > > When alphadrive locks up, there are no entries made in any log files; > there are no log entries on the console; there are no entries in the > BIOS event log; there are no entries in the IPMI event log; the > CPU/case temps are below 40C (emergency shutoff is 75C) as shown via > IPMI; RAM usage is under 20 GB (24 GB per box) with the lowest being > under 2 GB used (I run top on the console so I can see the stats when > it locks up, and the time it locks up). It just ... stops. > > The system will even lock up when running in single-user mode, with > only / mounted (ZFS not loaded, zpool not imported). > > Hardware (alphadrive): > Chenbro 5U rackmount chassis with 24 hot-swap drive bays > SuperMicro H8DGi-F motherboard > AMD Opteron 2218 CPU (8-cores at 2.0 GHz) > 24 GB DDR3-SDRAM > 3x SuperMicro AOC-USAS-L8i SATA controllers (multi-lane break-out cables) > 8x Seagate 7200.12 1.5 TB SATA harddrives > 16x WD RE4 1.0 TB SATA harddrives > 1x Kingston 60 GB SSD (for /, swap, L2ARC) > > Hardware (betadrive): > SuperMicro 4U rackmount chassis with 16 hot-swap drive bays > SuperMicro H8DGi-F motherboard > AMD Opteron 2218 CPU (8-cores at 2.0 GHz) > 24 GB DDR3-SDRAM > 2x SuperMicro AOC-USAS2-L8i SATA controllers (multi-lane cables) > 16x WD RE4 2.0 TB SATA harddrives > 1x Kingston 60 GB SSD (for /, swap, L2ARC) > > betadrive runs perfectly with FreeBSD 9.0-RELEASE. > alphadrive locks up with FreeBSD 9.0-RELEASE. > > We're currently investigating hardware firmware revisions to see if > anything else is different between the two systems. > > Has anyone experience anything similar? Does anyone have any ideas on > what to look for? Any suggestions on what to try next?From what you've said I strongly suspect that you have some kind of hardware issue. Dodgy RAM is my first guess, something cooling-related is my 2nd, and PSU is my 3rd. It is a little suspicious that you only started having problems after your upgrade but it could be coincidence or it could be something about the new software tickling the hardware differently than the old. Open it up, make sure you don't have dust buildup and that all the fans are spinning, re-seat the RAM and then boot into memtest for a few hours. If you have spare similar hardware you can also try swapping components until you isolate the fault. Good luck, JN
Daniel Kalchev
2012-Jan-09 19:11 UTC
Upgrade from 8.2-STABLE to 9.0-RELEASE wedges on SuperMicro H8DGiF-based system
On Jan 9, 2012, at 8:03 PM, Freddie Cash wrote:> Small correction: these are AMD Opteron 6218 CPUs, not 2218. > >> Hardware (alphadrive): >> Chenbro 5U rackmount chassis with 24 hot-swap drive bays >> SuperMicro H8DGi-F motherboard >> AMD Opteron 6218 CPU (8-cores at 2.0 GHz)You meant Opteron 6128 perhaps? This looks weird coincidence indeed and considering the comments so far I too would question ACPI (BIOS revision, settings etc) and the possibility for some hardware going bad. Is it possible that you might have touched any hardware just before the upgrade? I had few cases an old system "die" on me when doing "minor" cleaning etc just before an update? Daniel