Ed Saipetch
2007-Oct-30 04:09 UTC
[zfs-discuss] zfs corruption w/ sil3114 sata controllers
Hello, I''m experiencing major checksum errors when using a syba silicon image 3114 based pci sata controller w/ nonraid firmware. I''ve tested by copying data via sftp and smb. With everything I''ve swapped out, I can''t fathom this being a hardware problem. There have been quite a few blog posts out there with people having a similar config and not having any problems. Here''s what I''ve done so far: 1. Changed solaris releases from S10 U3 to NV 75a 2. Switched out motherboards and cpus from AMD sempron to a Celeron D 3. Switched out memory to use completely different dimms 4. Switched out sata drives (2-3 250gb hitachi''s and seagates in RAIDZ, 3x400GB seagates RAIDZ and 1x250GB hitachi with no raid) Here''s output of a scrub and the status (ignore the date and time, I haven''t reset it on this new motherboard) and please point me in the right direction if I''m barking up the wrong tree. # zpool scrub tank # zpool status pool: tank state: ONLINE status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: http://www.sun.com/msg/ZFS-8000-8A scrub: scrub completed with 140 errors on Sat Sep 15 02:07:35 2007 config: NAME STATE READ WRITE CKSUM tank ONLINE 0 0 293 c0d1 ONLINE 0 0 293 errors: 140 data errors, use ''-v'' for a list This message posted from opensolaris.org
Nathan Kroenert
2007-Oct-30 04:23 UTC
[zfs-discuss] zfs corruption w/ sil3114 sata controllers
You have not mentioned if you have swapped the 3114 based HBA itself...? Have you tried a different HBA? :) Nathan. Ed Saipetch wrote:> Hello, > > I''m experiencing major checksum errors when using a syba silicon image 3114 based pci sata controller w/ nonraid firmware. I''ve tested by copying data via sftp and smb. With everything I''ve swapped out, I can''t fathom this being a hardware problem. There have been quite a few blog posts out there with people having a similar config and not having any problems. > > Here''s what I''ve done so far: > 1. Changed solaris releases from S10 U3 to NV 75a > 2. Switched out motherboards and cpus from AMD sempron to a Celeron D > 3. Switched out memory to use completely different dimms > 4. Switched out sata drives (2-3 250gb hitachi''s and seagates in RAIDZ, 3x400GB seagates RAIDZ and 1x250GB hitachi with no raid) > > Here''s output of a scrub and the status (ignore the date and time, I haven''t reset it on this new motherboard) and please point me in the right direction if I''m barking up the wrong tree. > > # zpool scrub tank > # zpool status > pool: tank > state: ONLINE > status: One or more devices has experienced an error resulting in data > corruption. Applications may be affected. > action: Restore the file in question if possible. Otherwise restore the > entire pool from backup. > see: http://www.sun.com/msg/ZFS-8000-8A > scrub: scrub completed with 140 errors on Sat Sep 15 02:07:35 2007 > config: > > NAME STATE READ WRITE CKSUM > tank ONLINE 0 0 293 > c0d1 ONLINE 0 0 293 > > errors: 140 data errors, use ''-v'' for a list > > > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Neal Pollack
2007-Oct-30 04:24 UTC
[zfs-discuss] zfs corruption w/ sil3114 sata controllers
Ed Saipetch wrote:> Hello, > > I''m experiencing major checksum errors when using a syba silicon image 3114 based pci sata controller w/ nonraid firmware. I''ve tested by copying data via sftp and smb. With everything I''ve swapped out, I can''t fathom this being a hardware problem.I can. But I suppose it could also be in some unknown way a driver issue. Even before ZFS, I''ve had numerous situations where various si3112 and 3114 chips would corrupt data on UFS and PCFS, with very simple copy and checksum test scripts, doing large bulk transfers. Si chips are best used to clean coffee grinders. Go buy a real SATA controller. Neal> There have been quite a few blog posts out there with people having a similar config and not having any problems. > > Here''s what I''ve done so far: > 1. Changed solaris releases from S10 U3 to NV 75a > 2. Switched out motherboards and cpus from AMD sempron to a Celeron D > 3. Switched out memory to use completely different dimms > 4. Switched out sata drives (2-3 250gb hitachi''s and seagates in RAIDZ, 3x400GB seagates RAIDZ and 1x250GB hitachi with no raid) > > Here''s output of a scrub and the status (ignore the date and time, I haven''t reset it on this new motherboard) and please point me in the right direction if I''m barking up the wrong tree. > > # zpool scrub tank > # zpool status > pool: tank > state: ONLINE > status: One or more devices has experienced an error resulting in data > corruption. Applications may be affected. > action: Restore the file in question if possible. Otherwise restore the > entire pool from backup. > see: http://www.sun.com/msg/ZFS-8000-8A > scrub: scrub completed with 140 errors on Sat Sep 15 02:07:35 2007 > config: > > NAME STATE READ WRITE CKSUM > tank ONLINE 0 0 293 > c0d1 ONLINE 0 0 293 > > errors: 140 data errors, use ''-v'' for a list > > > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >
Edward Saipetch
2007-Oct-30 05:00 UTC
[zfs-discuss] zfs corruption w/ sil3114 sata controllers
Neal Pollack wrote:> Ed Saipetch wrote: >> Hello, >> >> I''m experiencing major checksum errors when using a syba silicon >> image 3114 based pci sata controller w/ nonraid firmware. I''ve >> tested by copying data via sftp and smb. With everything I''ve >> swapped out, I can''t fathom this being a hardware problem. > > I can. But I suppose it could also be in some unknown way a driver > issue. > Even before ZFS, I''ve had numerous situations where various si3112 > and 3114 chips > would corrupt data on UFS and PCFS, with very simple copy and > checksum > test scripts, doing large bulk transfers. > > Si chips are best used to clean coffee grinders. Go buy a real SATA > controller. > > NealI have no problem ponying up money for a better SATA controller. I saw a bunch of blog posts that people were successful using the card so I thought maybe I had a bad card with corrupt firmware nvram. Is it worth trying to trace down the bug? If this type of corruption exists, nobody should be using this card. As a side note, what SATA cards are people having luck with?> >> There have been quite a few blog posts out there with people having >> a similar config and not having any problems. >> >> Here''s what I''ve done so far: >> 1. Changed solaris releases from S10 U3 to NV 75a >> 2. Switched out motherboards and cpus from AMD sempron to a Celeron D >> 3. Switched out memory to use completely different dimms >> 4. Switched out sata drives (2-3 250gb hitachi''s and seagates in >> RAIDZ, 3x400GB seagates RAIDZ and 1x250GB hitachi with no raid) >> >> Here''s output of a scrub and the status (ignore the date and time, >> I haven''t reset it on this new motherboard) and please point me in >> the right direction if I''m barking up the wrong tree. >> >> # zpool scrub tank >> # zpool status >> pool: tank >> state: ONLINE >> status: One or more devices has experienced an error resulting in >> data >> corruption. Applications may be affected. >> action: Restore the file in question if possible. Otherwise >> restore the >> entire pool from backup. >> see: http://www.sun.com/msg/ZFS-8000-8A >> scrub: scrub completed with 140 errors on Sat Sep 15 02:07:35 2007 >> config: >> >> NAME STATE READ WRITE CKSUM >> tank ONLINE 0 0 293 >> c0d1 ONLINE 0 0 293 >> >> errors: 140 data errors, use ''-v'' for a list >> This message posted from opensolaris.org >> _______________________________________________ >> zfs-discuss mailing list >> zfs-discuss at opensolaris.org >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >> >
> Here''s what I''ve done so far:The obvious thing to test is the drive controller, so maybe you should do that :) This message posted from opensolaris.org
Will Murnane
2007-Oct-30 05:29 UTC
[zfs-discuss] zfs corruption w/ sil3114 sata controllers
On 10/30/07, Edward Saipetch <beamz at twentybelow.com> wrote:> As a side note, what SATA cards are people having luck with?Running b74, I''m very happy with the Marvell mv88sx6081-based Supermicro card: http://www.supermicro.com/products/accessories/addon/AoC-SAT2-MV8.cfm http://www.newegg.com/Product/Product.aspx?Item=N82E16815121009&Tpk=aoc-sat2 http://www.wiredzone.com/xq/asp/ic.10016527/qx/itemdesc.htm It hypothetically supports port multipliers, but I haven''t tested this myself. On earlier releases (b69, specifically) I had problems with disks occasionally disappearing. Those appear to have been completely resolved; the box has most recently been up for 16 days with no errors. Will
James C. McPherson
2007-Oct-30 05:31 UTC
[zfs-discuss] zfs corruption w/ sil3114 sata controllers
Will Murnane wrote:> On 10/30/07, Edward Saipetch <beamz at twentybelow.com> wrote: >> As a side note, what SATA cards are people having luck with? > Running b74, I''m very happy with the Marvell mv88sx6081-based Supermicro card: > http://www.supermicro.com/products/accessories/addon/AoC-SAT2-MV8.cfm > http://www.newegg.com/Product/Product.aspx?Item=N82E16815121009&Tpk=aoc-sat2 > http://www.wiredzone.com/xq/asp/ic.10016527/qx/itemdesc.htm > It hypothetically supports port multipliers, but I haven''t tested this myself. > > On earlier releases (b69, specifically) I had problems with disks > occasionally disappearing. Those appear to have been completely > resolved; the box has most recently been up for 16 days with no > errors.We don''t currently have support for SATA port multipliers in Solaris or OpenSolaris. I know this because people in my team are working on it (no ETA as yet) and we discussed it last week. James C. McPherson -- Senior Kernel Software Engineer, Solaris Sun Microsystems
Neal Pollack
2007-Oct-30 05:43 UTC
[zfs-discuss] zfs corruption w/ sil3114 sata controllers
Edward Saipetch wrote:> Neal Pollack wrote: >> Ed Saipetch wrote: >>> Hello, >>> >>> I''m experiencing major checksum errors when using a syba silicon >>> image 3114 based pci sata controller w/ nonraid firmware. I''ve >>> tested by copying data via sftp and smb. With everything I''ve >>> swapped out, I can''t fathom this being a hardware problem. >> >> I can. But I suppose it could also be in some unknown way a driver >> issue. >> Even before ZFS, I''ve had numerous situations where various si3112 >> and 3114 chips >> would corrupt data on UFS and PCFS, with very simple copy and checksum >> test scripts, doing large bulk transfers. >> >> Si chips are best used to clean coffee grinders. Go buy a real SATA >> controller. >> >> Neal > I have no problem ponying up money for a better SATA controller. I > saw a bunch of blog posts that people were successful using the card > so I thought maybe I had a bad card with corrupt firmware nvram. Is > it worth trying to trace down the bug?Of course it is. File a bug so someone on the SATA team can study it.> If this type of corruption exists, nobody should be using this card. > As a side note, what SATA cards are people having luck with?A lot of people are happy with the 8 port PCI SATA card made by SuperMicro that has the Marvell chip on it. Don''t buy other marvell cards on ebay, because Marvell dumped a ton of cards that ended up with an earlier rev of the silicon that can corrupt data. But all the cards made by SuperMicro and sold by them have the c rev or later silicon and work great. That said, I wish someone would investigate the Silicon Image issues, but there are only so many engineers, with so little time.>> >>> There have been quite a few blog posts out there with people having >>> a similar config and not having any problems. >>> >>> Here''s what I''ve done so far: >>> 1. Changed solaris releases from S10 U3 to NV 75a >>> 2. Switched out motherboards and cpus from AMD sempron to a Celeron D >>> 3. Switched out memory to use completely different dimms >>> 4. Switched out sata drives (2-3 250gb hitachi''s and seagates in >>> RAIDZ, 3x400GB seagates RAIDZ and 1x250GB hitachi with no raid) >>> >>> Here''s output of a scrub and the status (ignore the date and time, I >>> haven''t reset it on this new motherboard) and please point me in the >>> right direction if I''m barking up the wrong tree. >>> >>> # zpool scrub tank >>> # zpool status >>> pool: tank >>> state: ONLINE >>> status: One or more devices has experienced an error resulting in data >>> corruption. Applications may be affected. >>> action: Restore the file in question if possible. Otherwise restore >>> the >>> entire pool from backup. >>> see: http://www.sun.com/msg/ZFS-8000-8A >>> scrub: scrub completed with 140 errors on Sat Sep 15 02:07:35 2007 >>> config: >>> >>> NAME STATE READ WRITE CKSUM >>> tank ONLINE 0 0 293 >>> c0d1 ONLINE 0 0 293 >>> >>> errors: 140 data errors, use ''-v'' for a list >>> >>> >>> This message posted from opensolaris.org >>> _______________________________________________ >>> zfs-discuss mailing list >>> zfs-discuss at opensolaris.org >>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >>> >> >
Nigel Smith
2007-Oct-30 10:15 UTC
[zfs-discuss] zfs corruption w/ sil3114 sata controllers
First off, can we just confirm the exact version of the Silicon Image Card and which driver Solaris is using. Use ''prtconf -pv'' and ''/usr/X11/bin/scanpci'' to get the PCI vendor & device ID information. Use ''prtconf -D'' to confirm which drivers are being used by which devices. And ''modinfo'' will tell you the version of the drivers. The above commands will give details for all the devices in the PC. You may want to edit down the output before posting it back here, or alternatively put the output into an attached file. See this link for an example of this sort of information for a different hard disk controller card: http://mail.opensolaris.org/pipermail/storage-discuss/2007-September/003399.html Regards Nigel Smith This message posted from opensolaris.org
Nigel Smith
2007-Oct-30 10:26 UTC
[zfs-discuss] zfs corruption w/ sil3114 sata controllers
And are you seeing any error messages in ''/var/adm/messages'' indicating any failure on the disk controller card? If so, please post a sample back here to the forum. This message posted from opensolaris.org
Tomasz Torcz
2007-Oct-30 11:36 UTC
[zfs-discuss] zfs corruption w/ sil3114 sata controllers
On 10/30/07, Neal Pollack <Neal.Pollack at sun.com> wrote:> > I''m experiencing major checksum errors when using a syba silicon image 3114 based pci sata controller w/ nonraid firmware. I''ve tested by copying data via sftp and smb. With everything I''ve swapped out, I can''t fathom this being a hardware problem. > Even before ZFS, I''ve had numerous situations where various si3112 and > 3114 chips > would corrupt data on UFS and PCFS, with very simple copy and checksum > test scripts, doing large bulk transfers.Those SIL chips are really broken when used with certain Seagate drivers. But I have data corrupted by them with WD drive also. Linux can workaround this bug by reducing transfer sizes (and thus dramatically impacting speed). Solaris probably don''t have workaround. With this quirk enabled (on Linux), I get at most 20 MB/s from drives, but ZFS do not report any corruption. Before I had corruptions hourly. More info about SIL issue: http://home-tj.org/wiki/index.php/Sil_m15w I have Si 3112, but despite SIL claims other chips seem to be affected also. -- Tomasz Torcz zdzichu at gmail.com
Stephen Usher
2007-Oct-30 12:11 UTC
[zfs-discuss] zfs corruption w/ sil3114 sata controllers
One thing to check before you blame your controller: Are the SATA cables close together for an extended length? Basically, most SATA cables will generate massive levels of cross-talk between them if they''re tied together or a run parallel in close proximity for a part of their run-length. I friend found this sort of problem a couple of months ago and it was cured by separating the cables. Steve -- --------------------------------------------------------------------------- Computer Systems Administrator, E-Mail:-steve at earth.ox.ac.uk Department of Earth Sciences, Tel:- +44 (0)1865 282110 University of Oxford, Parks Road, Oxford, UK. Fax:- +44 (0)1865 272072
Frank.Hofmann at Sun.COM
2007-Oct-30 12:16 UTC
[zfs-discuss] zfs corruption w/ sil3114 sata controllers
On Tue, 30 Oct 2007, Tomasz Torcz wrote:> On 10/30/07, Neal Pollack <Neal.Pollack at sun.com> wrote: >>> I''m experiencing major checksum errors when using a syba silicon image 3114 based pci sata controller w/ nonraid firmware. I''ve tested by copying data via sftp and smb. With everything I''ve swapped out, I can''t fathom this being a hardware problem. >> Even before ZFS, I''ve had numerous situations where various si3112 and >> 3114 chips >> would corrupt data on UFS and PCFS, with very simple copy and checksum >> test scripts, doing large bulk transfers. > > Those SIL chips are really broken when used with certain Seagate drivers. > But I have data corrupted by them with WD drive also. > Linux can workaround this bug by reducing transfer sizes (and thus > dramatically impacting speed). Solaris probably don''t have workaround.Might be slightly off-topic for the whole, but _this_ specific thing (reducing transfer sizes) is possible on Solaris as well. As documented here: http://docs.sun.com/app/docs/doc/819-2724/chapter2-29?a=view You can also read a bit more on the following thread: http://www.opensolaris.org/jive/thread.jspa?threadID=6866 It''s possible to limit this system-wide or per-LUN. Best regards, FrankH.> With this quirk enabled (on Linux), I get at most 20 MB/s from drives, > but ZFS do not report any corruption. Before I had corruptions hourly. > > More info about SIL issue: http://home-tj.org/wiki/index.php/Sil_m15w > I have Si 3112, but despite SIL claims other chips seem to be affected also. > > > -- > Tomasz Torcz > zdzichu at gmail.com > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >------------------------------------------------------------------------------ No good can come from selling your freedom, not for all the gold in the world, for the value of this heavenly gift far exceeds that of any fortune on earth. ------------------------------------------------------------------------------
Ed Saipetch
2007-Oct-30 14:07 UTC
[zfs-discuss] zfs corruption w/ sil3114 sata controllers
To answer a number of questions: Regarding different controllers, I''ve tried 2 Syba Sil 3114 controllers purchased about 4 months apart. I''ve tried 5.4.3 firmware with one and 5.4.13 with another. Maybe Syba makes crappy Sil 3114 cards but it''s the same one that someone on blogs.sun.com used with success. I had weird problems flashing the first card I got, hence the order of another one. I''m not sure how I could get 2 different controllers 4 months apart and then use them in 2 completely different computers and both controllers be bad. Regarding cables, they aren''t densely packed. I''ve just got 1 drive attached in this new instance. In the old, I just had 4 cables unbundled (not bound together) attached between the card and the drives. Here''s an error on startup in /var/adm/messages, note however that this error didn''t come up on the old mb/cpu combo with the older 3114 hba. These errors happen only during boot and don''t happen during file transfers: Sep 14 23:51:49 eknas genunix: [ID 936769 kern.info] sd0 is /pci at 0,0/pci-ide at f,1/ide at 1/sd at 0,0 Sep 14 23:52:11 eknas scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci-ide at 8/ide at 0 (ata0): Sep 14 23:52:11 eknas timeout: abort request, target=1 lun=0 Here''s the scanpci output: pci bus 0x0000 cardnum 0x08 function 0x00: vendor 0x1095 device 0x3114 Silicon Image, Inc. SiI 3114 [SATALink/SATARaid] Serial ATA Controller and prtconf -pv: subsystem-vendor-id: 00001095 subsystem-id: 00003114 unit-address: ''8'' class-code: 00018000 revision-id: 00000002 vendor-id: 00001095 device-id: 00003114 and prtconf -D: pci-ide, instance #0 (driver name: pci-ide) ide, instance #0 (driver name: ata) and pertinent modinfo: 40 fffffffffbbf1250 1050 224 1 pci-ide (pciide nexus driver for ''PCI-ID) 41 fffffffff783c000 10230 112 1 ata (ATA AT-bus attachment disk cont) This message posted from opensolaris.org
On Mon, 29 Oct 2007, MC wrote:>> Here''s what I''ve done so far: > > The obvious thing to test is the drive controller, so maybe you should do that :) >Also - while you''re doing swapTronics - don''t forget the Power Supply (PSU). Ensure that your PSU has sufficient capacity on its 12Volt rails (older PSUs did''nt even tell you how much current they can push out on the 12V outputs). See also: http://blogs.sun.com/elowe/entry/zfs_saves_the_day_ta Regards, Al Hopper Logical Approach Inc, Plano, TX. al at logical-approach.com Voice: 972.379.2133 Fax: 972.379.2134 Timezone: US CDT OpenSolaris Governing Board (OGB) Member - Apr 2005 to Mar 2007 http://www.opensolaris.org/os/community/ogb/ogb_2005-2007/ Graduate from "sugar-coating school"? Sorry - I never attended! :)
Ed Saipetch
2007-Oct-30 14:32 UTC
[zfs-discuss] zfs corruption w/ sil3114 sata controllers
Tried that... completely different cases with different power supplies. On Oct 30, 2007, at 10:28 AM, Al Hopper wrote:> On Mon, 29 Oct 2007, MC wrote: > >>> Here''s what I''ve done so far: >> >> The obvious thing to test is the drive controller, so maybe you >> should do that :) >> > > Also - while you''re doing swapTronics - don''t forget the Power Supply > (PSU). Ensure that your PSU has sufficient capacity on its 12Volt > rails (older PSUs did''nt even tell you how much current they can push > out on the 12V outputs). > > See also: http://blogs.sun.com/elowe/entry/zfs_saves_the_day_ta > > Regards, > > Al Hopper Logical Approach Inc, Plano, TX. al at logical-approach.com > Voice: 972.379.2133 Fax: 972.379.2134 Timezone: US CDT > OpenSolaris Governing Board (OGB) Member - Apr 2005 to Mar 2007 > http://www.opensolaris.org/os/community/ogb/ogb_2005-2007/ > Graduate from "sugar-coating school"? Sorry - I never attended! :) > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Mauro Mozzarelli
2007-Oct-30 15:34 UTC
[zfs-discuss] zfs corruption w/ sil3114 sata controllers
Hi, I have the same sil3114 based controller, installed in a dual Opteron box. I have installed Solaris x86 and have had no problem with it, however I hardly used that box with Solaris as my installation was only to try out Solaris on my Opteron worksation. Instead, on that workstation I constantly run Linux, and twice in a few months I came across (while running linux Fedora) several I/O errors on the SATA disk attached to that controller. I though at first that the hard drive was gone, but then I swapped that controller with a sil3112 and the I/O errors stopped. I swapped back the sil3114 and had no errors since. I reckon that it might have been due to one of the SATA cables (power or data?) not making a perfect contact. SATA connectors are of extremely poor quality and they fail to hold in place as well as the older IDE or SCSI or molex power connector. I noticed as well that they crack easily if inadvertently pulled or pushed while working inside the computer case. This message posted from opensolaris.org
Nigel Smith
2007-Oct-31 12:54 UTC
[zfs-discuss] zfs corruption w/ sil3114 sata controllers
Ok, this is a strange problem! You seem to have tried & eliminated all the possible issues that the community has suggested! I was hoping you would see some errors logged in ''/var/adm/messages'' that would give a clue. Your original ''zpool status'' said 140 errors. Over what time period are these occurring? I''m wondering if the errors are occurring at a constant steady rate or if there are bursts of error? Maybe you could monitor zpool status while generating activity with "dd" or similar. You could use "zpool iostat <interval>" to monitor bandwidth and see if it is reasonably steady or erratic.>From your "prtconf -D" we see the 3114 card is usingthe "ata" driver, as expected. I believe the driver can talk to the disk drive in either PIO or DMA mode, so you could try changing that in the "ata.conf" file. See here for details: http://docs.sun.com/app/docs/doc/819-2254/ata-7d?a=view I''ve just had a quick look at the source code for the ata driver, and there does seem to be specific support for the Silicon Image chips in the drivers: http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/intel/io/dktp/controller/ata/sil3xxx.c and http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/intel/io/dktp/controller/ata/sil3xxx.h The file "sil3xxx.h" does mention: "Errata Sil-AN-0109-B2 (Sil3114 Rev 0.3) To prevent erroneous ERR set for queued DMA transfers greater then 8k, FIS reception for FIS0cfg needs to be set to Accept FIS without Interlock" ..which I read as meaning there have being some ''issues'' with this chip. And it sounds similar to the issue mention on the link that Tomasz supplied: http://home-tj.org/wiki/index.php/Sil_m15w If you decide to try a different SATA controller card, possible options are: 1. The si3124 driver, which supports SiI-3132 (PCI-E) and SiI-3124 (PCI-X) devices. 2. The AHCI driver, which supports the Intel ICH6 and latter devices, often found on motherboard. 4. The NV_SATA driver which supports Nvidia ck804/mcp55 devices. Regards Nigel Smith This message posted from opensolaris.org
Edward Saipetch
2007-Oct-31 13:08 UTC
[zfs-discuss] zfs corruption w/ sil3114 sata controllers
Nigel, Thanks for the response! Basically my last method of testing was to sftp a few 50-100MB files to /tank over a couple of minutes and force a scrub after. The very first time this happened, I was using it as a NAS device dumping data to it for over a week. I went to a customer''s site to show him how cool zfs was and upon running zpool status, I saw the data corruption status and telling me to restore from a backup. Running zpool status without a scrub shows no errors. I tried mirrored devices, no raid whatsoever and raidz, all with the same results. All the motherboards I''ve been using only have PCI since I was hoping I could create a low cost solution as a POC. I''ll test changing the transfer mode a bit later. Other people have had better luck, what other debugging can be done? I''m willing to even let someone have remote access to the box if they want. Nigel Smith wrote:> Ok, this is a strange problem! > You seem to have tried & eliminated all the possible issues > that the community has suggested! > > I was hoping you would see some errors logged in > ''/var/adm/messages'' that would give a clue. > > Your original ''zpool status'' said 140 errors. > Over what time period are these occurring? > I''m wondering if the errors are occurring at a > constant steady rate or if there are bursts of error? > Maybe you could monitor zpool status while generating > activity with "dd" or similar. > You could use "zpool iostat <interval>" to monitor > bandwidth and see if it is reasonably steady or erratic. > > >From your "prtconf -D" we see the 3114 card is using > the "ata" driver, as expected. > I believe the driver can talk to the disk drive > in either PIO or DMA mode, so you could try > changing that in the "ata.conf" file. See here for details: > http://docs.sun.com/app/docs/doc/819-2254/ata-7d?a=view > > I''ve just had a quick look at the source code for > the ata driver, and there does seem to be specific support > for the Silicon Image chips in the drivers: > http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/intel/io/dktp/controller/ata/sil3xxx.c > and > http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/intel/io/dktp/controller/ata/sil3xxx.h > The file "sil3xxx.h" does mention: > "Errata Sil-AN-0109-B2 (Sil3114 Rev 0.3) > To prevent erroneous ERR set for queued DMA transfers > greater then 8k, FIS reception for FIS0cfg needs to be set > to Accept FIS without Interlock" > ..which I read as meaning there have being some ''issues'' > with this chip. And it sounds similar to the issue mention on > the link that Tomasz supplied: > http://home-tj.org/wiki/index.php/Sil_m15w > > If you decide to try a different SATA controller card, possible options are: > > 1. The si3124 driver, which supports SiI-3132 (PCI-E) > and SiI-3124 (PCI-X) devices. > > 2. The AHCI driver, which supports the Intel ICH6 and latter devices, often > found on motherboard. > > 4. The NV_SATA driver which supports Nvidia ck804/mcp55 devices. > > Regards > Nigel Smith > > > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >
Mario Goebbels
2007-Oct-31 13:14 UTC
[zfs-discuss] zfs corruption w/ sil3114 sata controllers
I haven''t seen the beginning of this discussion, but seeing SiI sets the fire alarm off here. The Silicon Image chipsets are renowned to be crap and causing data corruption. At least the variants that usually go onto mainboards. Based on this, I suggest that you should get a different card. -mg -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 648 bytes Desc: OpenPGP digital signature URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20071031/415d5442/attachment.bin>
Edward Saipetch
2007-Oct-31 13:25 UTC
[zfs-discuss] zfs corruption w/ sil3114 sata controllers
Mario, I don''t have any issues getting a new card. The root of the discussion started because people did indeed post that they had good luck with them. In fact, when I went out there and google''d to find which cards would worked well, it seemed to be at the top of the list. I''m interested to know if it''s something I can help resolve so other people don''t have this problem or make sure people don''t run into the same issue I do. Mario Goebbels wrote:> I haven''t seen the beginning of this discussion, but seeing SiI sets the > fire alarm off here. > > The Silicon Image chipsets are renowned to be crap and causing data > corruption. At least the variants that usually go onto mainboards. Based > on this, I suggest that you should get a different card. > > -mg >
grant beattie
2007-Nov-04 10:11 UTC
[zfs-discuss] zfs corruption w/ sil3114 sata controllers
Ed Saipetch wrote:> To answer a number of questions: > > Regarding different controllers, I''ve tried 2 Syba Sil 3114 controllers purchased about 4 months apart. I''ve tried 5.4.3 firmware with one and 5.4.13 with another. Maybe Syba makes crappy Sil 3114 cards but it''s the same one that someone on blogs.sun.com used with success. I had weird problems flashing the first card I got, hence the order of another one. I''m not sure how I could get 2 different controllers 4 months apart and then use them in 2 completely different computers and both controllers be bad.another data point.. I run two SiI 3114 based cards in my home fileserver running s10u3. I was having ZFS data corruption issues and I suspected the SiI cards - that was until I replaced the motherboard/CPU/memory. I didn''t have the time or patience to try to determine which component was at fault, but I swapped the motherboard/CPU/memory and stressed it for a few hours and the data corruption problem was gone. before that, I was seeing data corruption issues within minutes. maybe it was just memory, but I''ll never know. I junked the old kit after I confirmed I had eliminated the problem. grant.