Hey folks, I''ve been wanting to use Solaris for a while now, for a ZFS home storage server and simply to get used to Solaris (I like to experiment). However, installing b70 has really not worked out for me at all. The hardware I''m using is pretty simple, but didn''t seem to be supported under the latest Nexenta or Belenix build. It seems to work fine in b70 SXCE...with a few catastrophic problems (counterintuitive I know, but hear me out). I haven''t yet managed to get the Webstart hardware analyzer to work on this system (no install, ubuntu liveCDs seem to not want to install a jdk for some reason), so I''m really not sure that the hardware is "supported", but as I said everything seemed to work fine in the installer and then initializing the ZFS pool (I''d just like to say how shockingly simple it was to create my zpool -- I was amazed!). The hardware is: 3.0ghz P4 socket 775 Intel 965G desktop board ("Widowmaker") 3x 400GB SATA drives (ZFS RaidZ) 1x 100GB IDE drive (UFS boot) I added a SI 2 port PCI SATA controller, but it seemed to not be recognized so I am not using it. The problems I''m experiencing are as follows: ZFS creates the storage pool just fine, sees no errors on the drives, and seems to work great...right up until I attempt to put data on the drives. After only a few moments of transfer, things start to go wrong. The system doesn''t power off, it just beeps 4-5 times. The X session dies and the monitor turns off (doesn''t drop back to a console). All network access dies. It seems that the system panics (is it called something else in solaris-land?). The HD access light stays on (though I can hear no drives doing anything strenuous), and the CD light blinks. This has happened two or three times, every time I''ve tried to start copying data to the ZFS pool. I''ve been transfering over the network, via SCP or NFS. This happens every time I''ve attempted to transfer data to the ZFS storage pool. Data transfers to the UFS partition seemed to work fine, and when I rebooted everything seemed to be working again. When I did a zfs scrub on the storage pool, the system crashed as usual, but didn''t come back up properly. It went to a disk cleanup root password prompt (which I couldn''t enter because I didn''t have USB legacy mode enabled and apparently USB isn''t supported until the OS is fully booted and I didn''t have a spare PS2 keyboard to use on that system). This is really bothersome, since I really was looking forward to the ease of use and administration of ZFS versus Linux software RAID + LVM. Can anybody shed some light on my situation? Is there any way I can get a little more information about what''s causing this crash? I have no problem hooking up a serial console to the system to pull off info if that''s possible (provided it has a serial port...I don''t really remember) if necessary. Or maybe there are logs stored when the system takes a dive? Anything I can do to help sort this out I''ll be willing to do. As a side note, this so far is entirely experimental for me...I haven''t even gotten the chance to get any large amount of data on the ZFS pool (~650MB so far), so I have no problem reinstalling, changing around hardware, swapping board & processor out for something different (I have several systems with some potential to be good storage servers that I don''t mind moving around -- I borrowed 2 or 3 drives from work so that I can move data between stable systems to move around other hardware). Thanks! This message posted from opensolaris.org
Mattias Pantzare
2007-Aug-30 23:18 UTC
[zfs-discuss] Please help! ZFS crash & burn in SXCE b70!
> The problems I''m experiencing are as follows: > ZFS creates the storage pool just fine, sees no errors on the drives, and seems to work great...right up until I attempt to put data on the drives. After only a few moments of transfer, things start to go wrong. The system doesn''t power off, it just beeps 4-5 times. The X session dies and the monitor turns off (doesn''t drop back to a console). All network access dies. It seems that the system panics (is it called something else in solaris-land?). The HD access light stays on (though I can hear no drives doing anything strenuous), and the CD light blinks. This has happened two or three times, every time I''ve tried to start copying data to the ZFS pool. I''ve been transfering over the network, via SCP or NFS.This could be a hardware problem. Bad powersuply for the load? Try removing 2 of the large disks.
Nigel Smith
2007-Aug-30 23:31 UTC
[zfs-discuss] Please help! ZFS crash & burn in SXCE b70!
Are you sure your hardware is working without problems? I would first check the RAM with memtest86+ http://www.memtest.org/ How many megabytes of RAM do you have on this PC? Can you get any other operating system, like Ubuntu to work ok on this hardware? I think it would be useful to know which chipset and hence driver you are using to connect the sata drives. I would guess it''s the AHCI driver. See this link to see how I answered this question for my system: http://mail.opensolaris.org/pipermail/zfs-discuss/2007-May/040562.html Regards Nigel Smith This message posted from opensolaris.org
Richard Elling
2007-Aug-30 23:44 UTC
[zfs-discuss] Please help! ZFS crash & burn in SXCE b70!
Nigel Smith wrote:> Are you sure your hardware is working without problems? > I would first check the RAM with memtest86+ > http://www.memtest.org/Also, SunVTS should be in /usr/sunvts and includes memory and disk tests (plus others). This is the test suite we (Sun) use in manufacturing. Take care when using destructive tests :-) -- richard
Mario Goebbels
2007-Aug-31 12:40 UTC
[zfs-discuss] Please help! ZFS crash & burn in SXCE b70!
> I added a SI 2 port PCI SATA controller, but it seemed to not be recognized so I am not using it.Do you by chance mean Silicon Image with that "SI"? Their chipsets aren''t exactly known for reliability and data safety. Just pointing that out as potential source of problems. -mg -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 648 bytes Desc: OpenPGP digital signature URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20070831/c5d0104e/attachment.bin>
> This could be a hardware problem. Bad powersuply for > the load? Try removing 2 of the large disks.I should have mentioned in my first post that this is the very first thing I thought, and that I''ve already swapped the power supply with one I know can handle the load (an Enermax 400W which has powered a significantly more power-hungry system than this just fine). If all else fails (going to run Memtest for 24h as suggested below first), I''ll remove two of the drives and try again. This message posted from opensolaris.org
> Are you sure your hardware is working without > problems? > I would first check the RAM with memtest86+ > http://www.memtest.org/I''ll give this a shot tonight when I get home. I believe that Ubuntu liveCDs have a memtest boot option on them, if not I''ve got a Memtest disc somewhere. I''ll run it at least 24h and let you know how it goes.> How many megabytes of RAM do you have on this PC? > Can you get any other operating system, like Ubuntu > to work ok on this hardware?It''s got 1GB of RAM, and Solaris is the first OS I''ve installed on this particular system. I ran an Ubuntu LiveCD and did notice some instability while attempting to install some extra packages (was trying to get a JDK installed to run the Solaris driver tool) though, so maybe it is the RAM.> I think it would be useful to know which chipset and > hence driver you are using to connect the sata drives. > I would guess it''s the AHCI driver. > See this link to see how I answered this question for > my system: > http://mail.opensolaris.org/pipermail/zfs-discuss/2007 > -May/040562.htmlAgain, I''ll take a look at this when I get home. I strongly suspect that it is indeed the AHCI driver. This message posted from opensolaris.org
> Do you by chance mean Silicon Image with that "SI"? > Their chipsets aren''t exactly known for reliability and > data safety. Just pointing that out as potential source > of problems.Indeed it is; however I''m not using that controller for anything at the moment, it''s simply in the system with nothing hooked up to it. I had read somewhere that ZFS performance is better if you have disks in a RaidZ spread across controllers, and there are 2 on the motherboard already, so I was hoping to use 3 controllers for 3 disks. However, I noticed a message in the Solaris installer that there was an unrecognized controller which for some reason I thought was that card, so I didn''t use it. This message posted from opensolaris.org
On Fri, 31 Aug 2007, Zeke wrote:>> Are you sure your hardware is working without >> problems? >> I would first check the RAM with memtest86+ >> http://www.memtest.org/ > > I''ll give this a shot tonight when I get home. I believe that Ubuntu liveCDs have a memtest boot option on them, if not I''ve got a Memtest disc somewhere. I''ll run it at least 24h and let you know how it goes.http://www.ultimatebootcd.com/ has memtest86. IMHO you have some serious hardware issue(s) with this system. OpenSolaris tends to push the underlying system hardware pretty hard. I''ve seen systems fail to install (Open)Solaris - while they installed and ran other OSes just fine. In one case, where a system failed to load Solaris, I advised the OP to remove the CPU fan from the CPU cooler and visually inspect if there was a layer of dust restricting airflow over the heatsink [1]. This turned out to be the issue - and after removing all the crap from the heatsink, he was able to load Solaris just fine. A similar issue, when the CPU fan is 2+ years old, is that the fan bearings are foobarred and the fan slows down when the heatsink starts to warn up. In this case, when you pop the side cover off, everything appears to be working just fine. Ten minutes later, *after* you''ve replaced the covers, the fan slows down to almost nothing and your system starts to "mis-behave". Recommendation: replace the CPU cooler fan assembly if its 2 years or older. PS: For a long time, the AMD factory coolers were completely un-reliable. And the very thin spacing between the heatsink fins nicely facilitated the capture and buildup of a layer of dust. I always recommend replacement of older AMD factory coolers with Zalman (www.zalmanusa.com) parts. Email me offlist if you want specific part recommendations. On older systems I recommend the Zalman passive copper heatsink (CNPS6000-Cu) in conjunction with the (fan bracket) FB123 with one or more 92mm (Zalman) fans. [1] you *must* remove the fan to do the inspection. You can''t see the thin layer of crap with the fan in place.>> How many megabytes of RAM do you have on this PC? >> Can you get any other operating system, like Ubuntu >> to work ok on this hardware? > > It''s got 1GB of RAM, and Solaris is the first OS I''ve installed on this particular system. I ran an Ubuntu LiveCD and did notice some instability while attempting to install some extra packages (was trying to get a JDK installed to run the Solaris driver tool) though, so maybe it is the RAM. > >> I think it would be useful to know which chipset and >> hence driver you are using to connect the sata drives. >> I would guess it''s the AHCI driver. >> See this link to see how I answered this question for >> my system: >> http://mail.opensolaris.org/pipermail/zfs-discuss/2007 >> -May/040562.html > > Again, I''ll take a look at this when I get home. I strongly suspect that it is indeed the AHCI driver. >Regards, Al Hopper Logical Approach Inc, Plano, TX. al at logical-approach.com Voice: 972.379.2133 Fax: 972.379.2134 Timezone: US CDT OpenSolaris Governing Board (OGB) Member - Apr 2005 to Mar 2007 http://www.opensolaris.org/os/community/ogb/ogb_2005-2007/
Ok, well I fired up Memtest and had a failure on the first run. I''ve run it twice more and have yet to manage to get it through a full run. Memory problem it is. :( Sorry to bother everyone. Thanks for the help. This message posted from opensolaris.org
Nigel Smith
2007-Aug-31 23:23 UTC
[zfs-discuss] Please help! ZFS crash & burn in SXCE b70!
Yes, I''m not surprised. I thought it would be a RAM problem. I always recommend a ''memtest'' on any new hardware. Murphy''s law predicts that you only have RAM problems on PC''s that you don''t test! Regards Nigel Smith This message posted from opensolaris.org
Nigel Smith
2007-Aug-31 23:48 UTC
[zfs-discuss] Please help! ZFS crash & burn in SXCE b70!
Richard, thanks for the pointer to the tests in ''/usr/sunvts'', as this is the first I have heard of them. They look quite comprehensive. I will give them a trial when I have some free time. Thanks Nigel Smith pmemtest - Physical Memory Test ramtest - Memory DIMMs (RAM) Test vmemtest - Virtual Memory Test cddvdtest - Optical Disk Drive Test cputest - CPUtest disktest - Disk and Floppy Drives Test dtlbtest - Data Translation Look-aside Buffer Test fputest - Floating Point Unit Test l1dcachetest - Level 1 Data Cache Test l2sramtest - Level 2 Cache Test netlbtest - Net Loop Back Test nettest - Network Hardware Test serialtest - Serial Port Test tapetest - Tape Drive Test usbtest - USB Device Test systest - System Test iobustest - Test for the IO interconnects and the Components on the IObus on high end Machines This message posted from opensolaris.org
> Richard, thanks for the pointer to the tests in > ''/usr/sunvts'', as this > is the first I have heard of them. They look quite > comprehensive. > I will give them a trial when I have some free time. > Thanks > Nigel Smith > > pmemtest - Physical Memory Test > ramtest - Memory DIMMs (RAM) Test > vmemtest - Virtual Memory Test > cddvdtest - Optical Disk Drive Test > cputest - CPUtest > disktest - Disk and Floppy Drives Test > dtlbtest - Data Translation Look-aside Buffer > Test > fputest - Floating Point Unit Test > l1dcachetest - Level 1 Data Cache Test > l2sramtest - Level 2 Cache Test > netlbtest - Net Loop Back Test > nettest - Network Hardware Test > serialtest - Serial Port Test > tapetest - Tape Drive Test > usbtest - USB Device Test > systest - System Test > iobustest - Test for the IO interconnects and > the Components on the IObus on high end MachinesThat is apparently one of those crazy hidden features in OpenSolaris that I think Indiana should expose :) This message posted from opensolaris.org
Frank Leers
2007-Sep-01 02:09 UTC
[zfs-discuss] Please help! ZFS crash & burn in SXCE b70!
MC wrote:>> Richard, thanks for the pointer to the tests in >> ''/usr/sunvts'', as this >> is the first I have heard of them. They look quite >> comprehensive. >> I will give them a trial when I have some free time. >> Thanks >> Nigel Smith >> >> pmemtest - Physical Memory Test >> ramtest - Memory DIMMs (RAM) Test >> vmemtest - Virtual Memory Test >> cddvdtest - Optical Disk Drive Test >> cputest - CPUtest >> disktest - Disk and Floppy Drives Test >> dtlbtest - Data Translation Look-aside Buffer >> Test >> fputest - Floating Point Unit Test >> l1dcachetest - Level 1 Data Cache Test >> l2sramtest - Level 2 Cache Test >> netlbtest - Net Loop Back Test >> nettest - Network Hardware Test >> serialtest - Serial Port Test >> tapetest - Tape Drive Test >> usbtest - USB Device Test >> systest - System Test >> iobustest - Test for the IO interconnects and >> the Components on the IObus on high end Machines > > > That is apparently one of those crazy hidden features in OpenSolaris that I think Indiana should expose :) >VTS has been around for many years, although may have been more widely deployed on SPARC hardware. VTS is Sun Services'' tool of choice when ''validating'' hardware (V_alidation T_est S_uite). Manufacturing also use the tool suite extensively to burn in hardware on their floor before shipping.
Mario Goebbels
2007-Sep-01 10:53 UTC
[zfs-discuss] Please help! ZFS crash & burn in SXCE b70!
> Yes, I''m not surprised. I thought it would be a RAM problem. > I always recommend a ''memtest'' on any new hardware. > Murphy''s law predicts that you only have RAM problems > on PC''s that you don''t test!Heh, the last ever RAM problems I had was a broken 1MB memory stick on that wannabe 486 from Cyrix like over a decade ago. And I never test my machines for broken sticks :) -mg -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 648 bytes Desc: OpenPGP digital signature URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20070901/1cbfba2a/attachment.bin>
Richard Elling
2007-Sep-01 23:22 UTC
[zfs-discuss] Please help! ZFS crash & burn in SXCE b70!
more history (geezing) below... Frank Leers wrote:> MC wrote: >>> Richard, thanks for the pointer to the tests in >>> ''/usr/sunvts'', as this >>> is the first I have heard of them. They look quite >>> comprehensive. >>> I will give them a trial when I have some free time. >>> Thanks >>> Nigel Smith >>> >>> pmemtest - Physical Memory Test >>> ramtest - Memory DIMMs (RAM) Test >>> vmemtest - Virtual Memory Test >>> cddvdtest - Optical Disk Drive Test >>> cputest - CPUtest >>> disktest - Disk and Floppy Drives Test >>> dtlbtest - Data Translation Look-aside Buffer >>> Test >>> fputest - Floating Point Unit Test >>> l1dcachetest - Level 1 Data Cache Test >>> l2sramtest - Level 2 Cache Test >>> netlbtest - Net Loop Back Test >>> nettest - Network Hardware Test >>> serialtest - Serial Port Test >>> tapetest - Tape Drive Test >>> usbtest - USB Device Test >>> systest - System Test >>> iobustest - Test for the IO interconnects and >>> the Components on the IObus on high end Machines >> >> That is apparently one of those crazy hidden features in OpenSolaris that I think Indiana should expose :) >> > > VTS has been around for many years, although may have been more widely > deployed on SPARC hardware. VTS is Sun Services'' tool of choice when > ''validating'' hardware (V_alidation T_est S_uite). Manufacturing also > use the tool suite extensively to burn in hardware on their floor before > shipping.IIRC, SunVTS came from SunDiag (wow! http://docs.sun.com/app/docs/doc/801-6627 :-) which I first saw delivered on 1/2" tape :-) SunVTS is still actively developed, and we (actually, my sibling group) is very interested in any bugs or RFEs. Test developers are always looking to improve test coverage. The more we can catch in development or the factory, the less you should see in the field. -- richard
Pete Bentley
2007-Sep-04 12:07 UTC
[zfs-discuss] Please help! ZFS crash & burn in SXCE b70!
Mario Goebbels wrote:> Heh, the last ever RAM problems I had was a broken 1MB memory stick on > that wannabe 486 from Cyrix like over a decade ago. And I never test my > machines for broken sticks :)If you don''t test your RAM, how are you sure you have no problems (unless you exclusively use ECC memory)? For example, a friend recently built a new zfs home fileserver which appeared to work fine but a zpool scrub of a large raidz pool after copying lots of files into it would consistently return one or two errors. That turned out to be marginal RAM, showed up by a long memtest86 run. Swapped the RAM and the problem went away. So RAM problems may not manifest themselves very obviously without some kind of checksumming technology (either a zfs pool or ECC on the memory itself). I have often wondered how much of Windows'' poor reputation for stability is actually due to uncorrected RAM errors on cheapo PCs. Pete.
Richard Elling
2007-Sep-04 18:44 UTC
[zfs-discuss] Please help! ZFS crash & burn in SXCE b70!
Pete Bentley wrote:> Mario Goebbels wrote: >> Heh, the last ever RAM problems I had was a broken 1MB memory stick on >> that wannabe 486 from Cyrix like over a decade ago. And I never test my >> machines for broken sticks :) > > If you don''t test your RAM, how are you sure you have no problems (unless you > exclusively use ECC memory)?Even if you use ECC :-) though the probability that ECC will show an error is much better than simple parity or nothing. WARNING: PC vendors are very cost sensitive. In most cases, they will not offer ECC. Try going to Fry''s and asking for ECC memory, they will laugh at you (that is, if they even know what ECC is)> For example, a friend recently built a new zfs home fileserver which appeared > to work fine but a zpool scrub of a large raidz pool after copying lots of > files into it would consistently return one or two errors. That turned out > to be marginal RAM, showed up by a long memtest86 run. Swapped the RAM and > the problem went away. > > So RAM problems may not manifest themselves very obviously without some kind > of checksumming technology (either a zfs pool or ECC on the memory itself). I > have often wondered how much of Windows'' poor reputation for stability is > actually due to uncorrected RAM errors on cheapo PCs.A Microsoft paper says that memory-induced failures are now in the top-10 list of common failures. Microsoft is trying to create change, but since ECC DIMMs will always cost more than non-ECC DIMMs, the market has not shown any interest. http://www.eetimes.com/news/latest/showArticle.jhtml?articleID=199601761 -- richard
me at tomservo.cc
2007-Sep-04 21:01 UTC
[zfs-discuss] Please help! ZFS crash & burn in SXCE b70!
> If you don''t test your RAM, how are you sure you have no problems (unless > you exclusively use ECC memory)?I usually keep an eagle eye on my personal systems. If something appears to be wrong, I usually spend considerable time into diagnosing. Goes as far as me running zpool status everytime I think I''ve heard suspicious disk activity (like screeching, which usually ends up being some neighbor with a flex, and stuff like that :) Professional systems naturally use ECC. -mg