Muti Zen
2008-May-28 21:05 UTC
[zfs-discuss] ZFS locking up! Bug in ZFS, device drivers or time for mobo RMA?
Greetings all. I am facing serious problems running ZFS on a storage server assembled out of commodity hardware that is supposed to be Solaris compatible. Although I am quite familiar with Linux distros and other unices, I am new to Solaris so any suggestions are highly appreciated. First I tried SXDE 1/08 creating the following pool: -bash-3.2# zpool status -v tank pool: tank state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM tank ONLINE 0 0 0 raidz1 ONLINE 0 0 0 c5t0d0 ONLINE 0 0 0 c5t1d0 ONLINE 0 0 0 c6t1d0 ONLINE 0 0 0 c6t0d0 ONLINE 0 0 0 c7t0d0 ONLINE 0 0 0 errors: No known data errors All went well until I tried pulling files from the server to another machine running 64bit Vista Ultimate SP1 via its build in NFS client. After copying cca. 100 of them (split archives all 100MB in size, i.e. cca. 10GB of data) I always get an "The semaphore timeout period has expired." error. The machines are currently connected by a 1Gbps switch, but I have tried several other devices as well (some supporting only 100Mbps). When this happens, Solaris is still responsive but any zpool command I try locks up. E.g. "zpool status tank" would write just the following pool: tank state: ONLINE scrub: none requested and then lock up. This gives me the impression that after several minutes of usage, the ZFS subsystem on the machine locks up and anything that tries to touch it locks up as well. The only way I found to make the server run again would be a hardware reset. Software reboot/shutdown locks up as well. Another possibly related problem I had was that instead of or in addition to this lock up, ZFS degraded my pool considering one of the discs as faulty. Always the same one, regardless of the port it was plugged in. The weird thing is though that the disc appears to be perfectly functional. Running the thorough Samsung ESTOOL diagnostic on it many times discovered no problems. Cleaning the errors and scrubbing the pool would make it operational again, at least for a while. I have replaced the SXDE with OpenSolaris 2008.05 but it didnt seem to affect these problems at all. I bought more discs hoping that replacing the faulting one would solve the problems. Unfortunately it did not solve all of them. The array doesnt degrade due to a "faulting disc" anymore, but ZFS still seems to be locking up after several minutes of usage. Thanks in advance for any suggestions how to best approach these problems. Server HW: Mobo: MSI K9N Diamond CPU: Athlon 64 X2 5200+ Mem: Corsair TWIN2x4096-6400C4DHX PSU: Corsair HX620W Case: ThermalTake Armor+ GFX: MSI N9600GT-T2D1G-OC HDDs: Spinpoint F1 HD103UJ (1TB, 32MB, 7200rpm, SATA2) All HDDs are the same model. The machine is not overclocked. This message posted from opensolaris.org