Hi, I have been seeing data corruption on the ZFS filesystem. Here are some details. The machine is running s10 on X86 platform with a single 160Gb SATA disk. (root on s0 and zfs on s7) ...Sanjaya --------- /etc/release ---------- -bash-3.00# cat /etc/release Solaris 10 6/06 s10x_u2wos_09a X86 Copyright 2006 Sun Microsystems, Inc. All Rights Reserved. Use is subject to license terms. Assembled 09 June 2006 -bash-3.00# ----------- Zpool -------------- -bash-3.00# zpool status -v pool: home state: ONLINE status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: http://www.sun.com/msg/ZFS-8000-8A scrub: scrub completed with 3 errors on Thu Aug 17 09:27:30 2006 config: NAME STATE READ WRITE CKSUM home ONLINE 0 0 20 c0d0s7 ONLINE 0 0 20 errors: The following persistent errors have been detected: DATASET OBJECT RANGE 5 a38 lvl=2 blkid=0 5 174f lvl=1 blkid=2 5 dd2 lvl=1 blkid=0 -bash-3.00# ----------------- PRTDIAG INFO -------------------- -bash-3.00# prtdiag System Configuration: To Be Filled By O.E.M. To Be Filled By O.E.M. BIOS Configuration: American Megatrends Inc.V2.05 080010 06/14/2005 ==== Processor Sockets =================================== Version Location Tag -------------------------------- -------------------------- Dual Core AMD Opteron(tm) Processor 270 CPU 1 Dual Core AMD Opteron(tm) Processor 270 CPU 2 ==== Memory Device Sockets =============================== Type Status Set Device Locator Bank Locator ------- ------ --- ------------------- -------------------- SDRAM in use 0 DIMM0 BANK0 SDRAM in use 0 DIMM1 BANK1 SDRAM in use 0 DIMM2 BANK2 SDRAM in use 0 DIMM3 BANK3 SDRAM in use 0 DIMM4 BANK4 SDRAM in use 0 DIMM5 BANK5 SDRAM in use 0 DIMM6 BANK6 SDRAM in use 0 DIMM7 BANK7 ==== On-Board Devices ==================================== To Be Filled By O.E.M. ==== Upgradeable Slots =================================== ID Status Type Description --- --------- ---------------- ---------------------------- 0 available AGP 4X AGP 1 available PCI PCI1 2 available PCI PCI2 3 available PCI PCI3 4 available PCI PCI4 -bash-3.00# -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20060818/e09fda0b/attachment.html> -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: dmesg.txt URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20060818/e09fda0b/attachment.txt> -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: eeprom.txt URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20060818/e09fda0b/attachment-0001.txt> -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: format-e.txt URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20060818/e09fda0b/attachment-0002.txt> -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: prtconf_verbose.txt URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20060818/e09fda0b/attachment-0003.txt> -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: prtconf-V.txt URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20060818/e09fda0b/attachment-0004.txt>
Hello Sanjaya, Friday, August 18, 2006, 7:50:21 PM, you wrote: > Hi, I have been seeing data corruption on the ZFS filesystem. Here are some details. The machine is running s10 on X86 platform with a single 160Gb SATA disk. (root on s0 and zfs on s7) Well you have a ZFS without any protection (except ditto blocks for meta data). Unless you overwrite underlying disk/slice it''s possible you have a problem with your disk or other hardware. Try ''fmdump -eV'' btw: your system produced crash dump - I understand that server restarted actually, right? Also interesting is: Aug 15 18:31:14 sfo-dk2-s62 unix: [ID 557827 kern.info] cpu3 initialization complete - online Aug 15 18:31:14 sfo-dk2-s62 unix: [ID 999285 kern.warning] WARNING: BIOS microcode patch for AMD Athlon(tm) 64/Opteron(tmprocessor Aug 15 18:31:14 sfo-dk2-s62 erratum 131 was not detected; updating your system''s BIOS to a version Aug 15 18:31:14 sfo-dk2-s62 containing this microcode patch is HIGHLY recommended or erroneous system Aug 15 18:31:14 sfo-dk2-s62 operation may occur. However I do not belive it''s related to this problem. -- Best regards, Robert mailto:rmilkowski@task.gda.pl http://milek.blogspot.com _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Srivastava, Sanjaya wrote:> I have been seeing data corruption on the ZFS filesystem. Here are > some details. The machine is running s10 on X86 platform with a single > 160Gb SATA disk. (root on s0 and zfs on s7)I''d wager that it is a hardware problem. Personally, I''ve had less than satisfactory reliability experiences with 160 GByte disks from a variety of vendors. Try mirroring. -- richard
Thanks for the reply. If a put a SATA Raid card (ARC-1110, http://www.areca.us/products/html/pcix-sata.htm ) the problem disappers. ...Sanjaya ________________________________ From: Robert Milkowski [mailto:rmilkowski at task.gda.pl] Sent: Friday, August 18, 2006 11:59 AM To: Srivastava, Sanjaya Cc: zfs-discuss at opensolaris.org Subject: Re: [zfs-discuss] ZFS Filesytem Corrpution Hello Sanjaya, Friday, August 18, 2006, 7:50:21 PM, you wrote:>Hi, I have been seeing data corruption on the ZFS filesystem. Here are some details. The machine is running s10 on X86 platform with a single 160Gb SATA disk. (root on s0 and zfs on s7) Well you have a ZFS without any protection (except ditto blocks for meta data). Unless you overwrite underlying disk/slice it''s possible you have a problem with your disk or other hardware. Try ''fmdump -eV'' btw: your system produced crash dump - I understand that server restarted actually, right? Also interesting is: Aug 15 18:31:14 sfo-dk2-s62 unix: [ID 557827 kern.info] cpu3 initialization complete - online Aug 15 18:31:14 sfo-dk2-s62 unix: [ID 999285 kern.warning] WARNING: BIOS microcode patch for AMD Athlon(tm) 64/Opteron(tmprocessor Aug 15 18:31:14 sfo-dk2-s62 erratum 131 was not detected; updating your system''s BIOS to a version Aug 15 18:31:14 sfo-dk2-s62 containing this microcode patch is HIGHLY recommended or erroneous system Aug 15 18:31:14 sfo-dk2-s62 operation may occur. However I do not belive it''s related to this problem. -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20060818/2d45557e/attachment.html>
I agree with you, but only 50%. Mirroring will only mask the problem and will delay the fs corruption (Depending on who zfs responds to data corruption. Does it go back and recheck the blocks later or just marks them bad?) The problem lies in somewhere in hardware, but certainly not in disks. I have over 20 machines exhibiting the same behavior. If I put a raid card in between the problem disappears altogether. ...Sanjaya -----Original Message----- From: Richard.Elling at Sun.COM [mailto:Richard.Elling at Sun.COM] Sent: Friday, August 18, 2006 11:59 AM To: Srivastava, Sanjaya Cc: zfs-discuss at opensolaris.org Subject: Re: [zfs-discuss] ZFS Filesytem Corrpution Srivastava, Sanjaya wrote:> I have been seeing data corruption on the ZFS filesystem. Here are > some details. The machine is running s10 on X86 platform with a single> 160Gb SATA disk. (root on s0 and zfs on s7)I''d wager that it is a hardware problem. Personally, I''ve had less than satisfactory reliability experiences with 160 GByte disks from a variety of vendors. Try mirroring. -- richard