I am having problems with silent data corruption on (some) drives connected to an MCP55 SATA controller. I have two servers, both running RELENG_7_0/amd64. One has the 570 Ultra chipset, the other has 570 SLI. Both chipsets have the MCP55 SATA controller. The server with 570 Ultra chipset has a bunch of older 250GB SATA-150 drives hooked up to the MCP55 controller and it is working just fine. The server with 570 SLI chipset has a bunch of new SATA-300 drives hooked up to the MCP55 controller and it is giving me silent data corruption (easily detectable by running ZFS scrub, every time I run it new checksum errors show up). I know the drives are good because when they are hooked up to another controller they work just fine. Unfortunately the drives does not have a jumper for setting SATA-150 speed (they are Samsung 1 TB drives), and trying to force the drives to SATA-150 speed with the "patch" provided by the manufacturer does not seem to work (the drives still negotiate SATA-300 speed). I will try to get my hands on another older SATA-150 drive (or a new that can be jumpered) to verify if the culprit is the MCP55 revision (see below) or the interface speed. NOT working (570 SLI) --------------------- atapci1@pci0:0:5:0: class=0x010185 card=0x72501462 chip=0x037f10de rev=0xa2 hdr=0x00 vendor = 'Nvidia Corp' device = 'MCP55 SATA Controller' class = mass storage subclass = ATA Working (570 Ultra) ------------------- atapci1@pci0:0:5:0: class=0x010185 card=0xcb8410de chip=0x037f10de rev=0xa3 hdr=0x00 vendor = 'Nvidia Corp' device = 'MCP55 SATA Controller' class = mass storage subclass = ATA This is most likely related to kern/120296 (http://www.freebsd.org/cgi/query-pr.cgi?pr=kern/120296) and kern/121396 (http://www.freebsd.org/cgi/query-pr.cgi?pr=kern/121396). If someone else is having data corruption problems with drives connected to an MCP55 controller it might be worth testing if limiting the drives to SATA-150 makes a difference. It will most likely take me a while before I can verify this. --- Daniel Eriksson (http://www.toomuchdata.com/)
Hi I'll look into that providing I can find HW to work on, IIRC I have one in the ATA collection but I have to verify when I get to the lab. -S?ren On 1Jul, 2008, at 11:01 , Daniel Eriksson wrote:> > I am having problems with silent data corruption on (some) drives > connected to an MCP55 SATA controller. > > I have two servers, both running RELENG_7_0/amd64. One has the 570 > Ultra > chipset, the other has 570 SLI. Both chipsets have the MCP55 SATA > controller. > > The server with 570 Ultra chipset has a bunch of older 250GB SATA-150 > drives hooked up to the MCP55 controller and it is working just fine. > The server with 570 SLI chipset has a bunch of new SATA-300 drives > hooked up to the MCP55 controller and it is giving me silent data > corruption (easily detectable by running ZFS scrub, every time I run > it > new checksum errors show up). I know the drives are good because when > they are hooked up to another controller they work just fine. > > Unfortunately the drives does not have a jumper for setting SATA-150 > speed (they are Samsung 1 TB drives), and trying to force the drives > to > SATA-150 speed with the "patch" provided by the manufacturer does not > seem to work (the drives still negotiate SATA-300 speed). I will try > to > get my hands on another older SATA-150 drive (or a new that can be > jumpered) to verify if the culprit is the MCP55 revision (see below) > or > the interface speed. > > > NOT working (570 SLI) > --------------------- > atapci1@pci0:0:5:0: class=0x010185 card=0x72501462 chip=0x037f10de > rev=0xa2 hdr=0x00 > vendor = 'Nvidia Corp' > device = 'MCP55 SATA Controller' > class = mass storage > subclass = ATA > > Working (570 Ultra) > ------------------- > atapci1@pci0:0:5:0: class=0x010185 card=0xcb8410de chip=0x037f10de > rev=0xa3 hdr=0x00 > vendor = 'Nvidia Corp' > device = 'MCP55 SATA Controller' > class = mass storage > subclass = ATA > > This is most likely related to kern/120296 > (http://www.freebsd.org/cgi/query-pr.cgi?pr=kern/120296) and kern/ > 121396 > (http://www.freebsd.org/cgi/query-pr.cgi?pr=kern/121396). > > > If someone else is having data corruption problems with drives > connected > to an MCP55 controller it might be worth testing if limiting the > drives > to SATA-150 makes a difference. It will most likely take me a while > before I can verify this. > > --- > Daniel Eriksson (http://www.toomuchdata.com/) >-S?ren
Daniel Eriksson wrote:> Unfortunately the drives does not have a jumper for setting SATA-150 > speed (they are Samsung 1 TB drives), and trying to force the drives to > SATA-150 speed with the "patch" provided by the manufacturer does not > seem to work (the drives still negotiate SATA-300 speed). I will try to > get my hands on another older SATA-150 drive (or a new that can be > jumpered) to verify if the culprit is the MCP55 revision (see below) or > the interface speed.Which patch did you use? -- WBR, Andrey V. Elsukov
Andrey V. Elsukov wrote:> Which patch did you use?I used BDM_SpeedSwitch1.zip (http://www.samsung.com/global/system/business/hdd/faq/2007/10/29/184337 BDM_SpeedSwitch1.zip). /Daniel
On Tue, Jul 01, 2008 at 11:01:17AM +0200, Daniel Eriksson wrote:> The server with 570 Ultra chipset has a bunch of older 250GB SATA-150 > drives hooked up to the MCP55 controller and it is working just fine. > The server with 570 SLI chipset has a bunch of new SATA-300 drives > hooked up to the MCP55 controller and it is giving me silent data > corruption (easily detectable by running ZFS scrub, every time I run it > new checksum errors show up). I know the drives are good because when > they are hooked up to another controller they work just fine.With the same cables? Not that I want to use cables as a scapegoat, but in this case it seems applicable.> Unfortunately the drives does not have a jumper for setting SATA-150 > speed (they are Samsung 1 TB drives), and trying to force the drives to > SATA-150 speed with the "patch" provided by the manufacturer does not > seem to work (the drives still negotiate SATA-300 speed). I will try to > get my hands on another older SATA-150 drive (or a new that can be > jumpered) to verify if the culprit is the MCP55 revision (see below) or > the interface speed.Can you provide "atacontrol cap" output for one of the drives? I know in the case of Maxtor drives, there is a bug that exists in one of their disk firmwares which causes silent data corruption and/or SATA bus lockups when NCQ is used on nForce 4 chipsets. Maxtor provides a firmware update which fixes the bug. -- | Jeremy Chadwick jdc at parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, USA | | Making life hard for others since 1977. PGP: 4BD6C0CB |
Hi OK, the only "modern" nVidia board I have is MCP51 based, however it uses the same codepath as the MCP55. Anyhow, there has been fixes fro these in -current, thats not in any of the releng's yet. Please try the attached patch, or even better try a -current kernel. -S?ren -------------- next part -------------- A non-text attachment was scrubbed... Name: ff Type: application/octet-stream Size: 1238 bytes Desc: not available Url : http://lists.freebsd.org/pipermail/freebsd-stable/attachments/20080701/30d963c2/ff.obj -------------- next part -------------- On 1Jul, 2008, at 11:01 , Daniel Eriksson wrote:> > I am having problems with silent data corruption on (some) drives > connected to an MCP55 SATA controller. > > I have two servers, both running RELENG_7_0/amd64. One has the 570 > Ultra > chipset, the other has 570 SLI. Both chipsets have the MCP55 SATA > controller. > > The server with 570 Ultra chipset has a bunch of older 250GB SATA-150 > drives hooked up to the MCP55 controller and it is working just fine. > The server with 570 SLI chipset has a bunch of new SATA-300 drives > hooked up to the MCP55 controller and it is giving me silent data > corruption (easily detectable by running ZFS scrub, every time I run > it > new checksum errors show up). I know the drives are good because when > they are hooked up to another controller they work just fine. > > Unfortunately the drives does not have a jumper for setting SATA-150 > speed (they are Samsung 1 TB drives), and trying to force the drives > to > SATA-150 speed with the "patch" provided by the manufacturer does not > seem to work (the drives still negotiate SATA-300 speed). I will try > to > get my hands on another older SATA-150 drive (or a new that can be > jumpered) to verify if the culprit is the MCP55 revision (see below) > or > the interface speed. > > > NOT working (570 SLI) > --------------------- > atapci1@pci0:0:5:0: class=0x010185 card=0x72501462 chip=0x037f10de > rev=0xa2 hdr=0x00 > vendor = 'Nvidia Corp' > device = 'MCP55 SATA Controller' > class = mass storage > subclass = ATA > > Working (570 Ultra) > ------------------- > atapci1@pci0:0:5:0: class=0x010185 card=0xcb8410de chip=0x037f10de > rev=0xa3 hdr=0x00 > vendor = 'Nvidia Corp' > device = 'MCP55 SATA Controller' > class = mass storage > subclass = ATA > > This is most likely related to kern/120296 > (http://www.freebsd.org/cgi/query-pr.cgi?pr=kern/120296) and kern/ > 121396 > (http://www.freebsd.org/cgi/query-pr.cgi?pr=kern/121396). > > > If someone else is having data corruption problems with drives > connected > to an MCP55 controller it might be worth testing if limiting the > drives > to SATA-150 makes a difference. It will most likely take me a while > before I can verify this. > > --- > Daniel Eriksson (http://www.toomuchdata.com/) >-S?ren
Jeremy Chadwick wrote:> With the same cables? Not that I want to use cables as a > scapegoat, but in this case it seems applicable.With the same cables, yes.> Can you provide "atacontrol cap" output for one of the drives?# atacontrol cap ad4 Protocol Serial ATA II device model SAMSUNG HD103UJ firmware revision 1AA01112 cylinders 16383 heads 16 sectors/track 63 lba supported 268435455 sectors lba48 supported 1953525168 sectors dma supported overlap not supported Feature Support Enable Value Vendor write cache yes yes read ahead yes yes Native Command Queuing (NCQ) yes - 31/0x1F Tagged Command Queuing (TCQ) no no 31/0x1F SMART yes yes microcode download yes yes security yes no power management yes yes advanced power management yes no 0/0x00 automatic acoustic management yes no 0/0x00 254/0xFE> I know in the case of Maxtor drives, there is a bug that exists in one > of their disk firmwares which causes silent data corruption and/or > SATA bus lockups when NCQ is used on nForce 4 chipsets. Maxtor > provides a firmware update which fixes the bug.Connecting (some of) the drives to a <JMicron JMB363 SATA300 controller> or a <Promise PDC20318 SATA150 controller> makes them work just fine. FreeBSD itself does not seem to notice any data corruption. I only noticed it because "zpool status" reported checksum errors after I had written almost 3 TB to the array. I then issued a "zpool scrub", and within a couple of minutes I already had dozens of corrupt files (so I stopped the scrub, deleted the pool and started fault-finding). --- Daniel Eriksson (http://www.toomuchdata.com/)
S?ren Schmidt wrote:> Please try the attached patch, or even better try a -current kernel.The patch made no difference on RELENG_7_0 unfortunately. (And I cannot try CURRENT on this server.) ___ Daniel Eriksson (http://www.toomuchdata.com/)
> Date: Tue, 1 Jul 2008 11:01:17 +0200 > From: "Daniel Eriksson" <daniel_k_eriksson@telia.com> > > I am having problems with silent data corruption on (some) drives > connected to an MCP55 SATA controller. > > I have two servers, both running RELENG_7_0/amd64. One has the 570 Ultra > chipset, the other has 570 SLI. Both chipsets have the MCP55 SATA > controller. > > The server with 570 Ultra chipset has a bunch of older 250GB SATA-150 > drives hooked up to the MCP55 controller and it is working just fine. > The server with 570 SLI chipset has a bunch of new SATA-300 drives > hooked up to the MCP55 controller and it is giving me silent data > corruption (easily detectable by running ZFS scrub, every time I run it > new checksum errors show up). I know the drives are good because when > they are hooked up to another controller they work just fine. > > Unfortunately the drives does not have a jumper for setting SATA-150 > speed (they are Samsung 1 TB drives), and trying to force the drives to > SATA-150 speed with the "patch" provided by the manufacturer does not > seem to work (the drives still negotiate SATA-300 speed). I will try to > get my hands on another older SATA-150 drive (or a new that can be > jumpered) to verify if the culprit is the MCP55 revision (see below) or > the interface speed. > > > NOT working (570 SLI) > --------------------- > atapci1@pci0:0:5:0: class=0x010185 card=0x72501462 chip=0x037f10de > rev=0xa2 hdr=0x00 > vendor = 'Nvidia Corp' > device = 'MCP55 SATA Controller' > class = mass storage > subclass = ATA > > Working (570 Ultra) > ------------------- > atapci1@pci0:0:5:0: class=0x010185 card=0xcb8410de chip=0x037f10de > rev=0xa3 hdr=0x00 > vendor = 'Nvidia Corp' > device = 'MCP55 SATA Controller' > class = mass storage > subclass = ATA > > This is most likely related to kern/120296 > (http://www.freebsd.org/cgi/query-pr.cgi?pr=kern/120296) and kern/121396 > (http://www.freebsd.org/cgi/query-pr.cgi?pr=kern/121396). > > > If someone else is having data corruption problems with drives connected > to an MCP55 controller it might be worth testing if limiting the drives > to SATA-150 makes a difference. It will most likely take me a while > before I can verify this. > > --- > Daniel Eriksson (http://www.toomuchdata.com/) >I have a 570 SLI too (Asus M2N-SLI Deluxe), I've been looking for an excuse to put FreeBSD on here :) I'll start installing it, anything I should do to make this error more obvious? My hard drive is a WDC WD2000JS-00SGB0; http://www.wdc.com/en/library/sata/2879-001146.pdf Chris
> Date: Wed, 2 Jul 2008 10:55:07 +0200 > "Daniel Eriksson" <daniel_k_eriksson@telia.com> wrote: > Jeremy Chadwick wrote: > >> Can the OP get some non-Samsung disks for testing? > > I've got a 750 GB Western Digital that I've been planning to use to > verify if it's a SATA-150 / SATA-300 problem (it can be jumpered to > SATA-150), but the drive is packed with valuable data that I'd have to > move elsewhere first. > > I'll get to it eventually, but maybe not this week. > > ___ > Daniel Eriksson (http://www.toomuchdata.com/) >Looks like I'm the guinea pig for now, I'll post in about half an hour with the results :) This is a clean install; it works perfectly with the restriction jumper on, now it comes off. Chris
On Tue, Jul 1, 2008 at 5:01 AM, Daniel Eriksson <daniel_k_eriksson@telia.com> wrote:> > I am having problems with silent data corruption on (some) drives > connected to an MCP55 SATA controller. >I have an MCP55 controller here running most of my RAID array. When I origionally loaded this machine, I had many problems until I figured out I could only use every other SATA port with any degree of reliability. It turned out that with the first two drives I bought, this every-other rule was true. These drives are: ad4: 238475MB <SAMSUNG SP2504C VT100-33> at ata2-master SATA300 But When I bought new drives, they happily used every channel: ad10: 715404MB <WDC WD7500AAKS-00RBA0 30.04G30> at ata5-master SATA300 The difference, I'm lead to believe, is that the Samsung drive is a PATA drive with a SATA to PATA bridge on it. The newer "true SATA" drives work fine.
I wrote:> I am having problems with silent data corruption on (some) drives > connected to an MCP55 SATA controller.The original problem showed up when talking to (brand new) Samsung 1TB drives in SATA-300 mode hooked up to the onboard controller. I have now tested with a 750GB Seagate drive in both SATA-300 and SATA-150 mode. Unfortunately the problem was not Samsung-related or SATA-300 specific. This points to a driver problem with the chipset/controller combination, or possibly some sort of strange interaction with other hardware (interrupts?). I have no idea how to troubleshoot this any further. ___ Daniel Eriksson (http://www.toomuchdata.com/)
On 7/1/08, Daniel Eriksson <daniel_k_eriksson@telia.com> wrote:> > The server with 570 SLI chipset has a bunch of new SATA-300 drives > hooked up to the MCP55 controller and it is giving me silent data > corruption (easily detectable by running ZFS scrub, every time I run it > new checksum errors show up).Could be in-memory data corruption. How much RAM installed on the system? -Manjunath
Manjunath Ranganathaiah wrote:> Could be in-memory data corruption. How much RAM installed on the > system?I doubt it. If it was a RAM problem then all drives would be affected. ___ Daniel Eriksson (http://www.toomuchdata.com/)
Hi Daniel, Could you try the following patch? You can apply this patch in freebsd 7.0 just by copying and pasting to your shell. Before you apply this patch, you can check as follows if this works on your environment or not. 1. Set bootverbose mode. cat >> /boot/loader.conf << EOF boot_verbose="YES" EOF 2. Reboot your machine. 3. Check the dmesg log of your HDDs as follows. dmesg | grep ata . . ata2-master: pio=PIO4 wdma=WDMA2 udma=UDMA100 cable=40 wire ^^^^^^^^^^^^ ad4: 115328MB <Super Talent Tech 02.10103> at ata2-master SATA150 ata3-master: pio=PIO4 wdma=WDMA2 udma=UDMA133 cable=40 wire ad6: 953869MB <Hitachi HDS721010KLA330 GKAOA51D> at ata3-master SATA150 If you have a device like 'ad4' which is detected as 'udma=UDMA100', this patch will work. ------------------------patch start---------------------------- cd /usr/src/sys/dev/ata cat> ata-chipset.c.patch <<EOF --- ata-chipset.c.orig 2008-04-02 00:20:49.000000000 +0900 +++ ata-chipset.c 2008-07-18 19:15:24.000000000 +0900 @@ -377,6 +377,7 @@ ata_sata_setmode(device_t dev, int mode) { struct ata_device *atadev = device_get_softc(dev); + struct ata_params *atacap = &atadev->param; /* * if we detect that the device isn't a real SATA device we limit @@ -390,7 +391,7 @@ /* on some drives we need to set the transfer mode */ ata_controlcmd(dev, ATA_SETFEATURES, ATA_SF_SETXFER, 0, - ata_limit_mode(dev, mode, ATA_UDMA6)); + ata_limit_mode(dev, mode, ata_umode(atacap))); /* query SATA STATUS for the speed */ if (ch->r_io[ATA_SSTATUS].res && EOF patch -l < ata-chipset.c.patch ------------------------patch end---------------------------- Best regards, -- Munenori Ohuchi <ohuchi@iij.ad.jp> Internet Initiative Japan Inc.