On Sunday 20 August 2006 13:00, freebsd-stable-request@freebsd.org wrote:> Do you mean different type of cables, or just another piece? I can't > change cables by myself, servers are dedicated from provider, but as I > can saw, they picked whole new machine from their HW storage and put new > Samsung disk drives in. So these two last machines are brand new with > new cables. (Probably with a same type of cables - all machines are ASUS > RS120)I can confirm the same behaviour with a ULi M1689/Newcastle Athlon64 based system running 6.1-RELEASE-p3 (i386). ad6 just detaches without warning and it takes a reboot to bring it back. atacontrol reinit has no effect. Tried the following to resolve the problems: - Changed cables (both ad4 and ad6) - Changed SATA power to legacy - Moved the NIC and anything else from the shared PCI INT (thought I'd cracked it at this point as it was stable for a month, then it lost ad6 on a nightly dump) - Remade my gmirror array as an ar. Put it straight back to gmirror again when I found out what a pain it is to rebuild after ad6 disappears. Until I read this thread, I was convinced there was something flaky in my hardware/BIOS or WD's TLER. Now I'm not so sure. Hardware: $ pciconf -lv agp0@pci0:0:0: class=0x060000 card=0x50001458 chip=0x168910b9 rev=0x00 hdr=0x00 vendor = 'Acer Labs Incorporated (ALi)' class = bridge subclass = HOST-PCI pcib1@pci0:1:0: class=0x060400 card=0x00000000 chip=0x524610b9 rev=0x00 hdr=0x01 vendor = 'Acer Labs Incorporated (ALi)' class = bridge subclass = PCI-PCI pcib2@pci0:2:0: class=0x060401 card=0x00000000 chip=0x524910b9 rev=0x00 hdr=0x01 vendor = 'Acer Labs Incorporated (ALi)' device = 'M5249 HyperTransport to PCI Bridge' class = bridge subclass = PCI-PCI isab0@pci0:3:0: class=0x060100 card=0x50011458 chip=0x156310b9 rev=0x70 hdr=0x00 vendor = 'Acer Labs Incorporated (ALi)' device = 'ALI M1563 South Bridge with Hypertransport Support' class = bridge subclass = PCI-ISA none0@pci0:3:1: class=0x068000 card=0x50031458 chip=0x710110b9 rev=0x00 hdr=0x00 vendor = 'Acer Labs Incorporated (ALi)' device = 'ALI M7101 Power Management Controller' class = bridge atapci0@pci0:14:0: class=0x0101fa card=0x50021458 chip=0x522910b9 rev=0xc7 hdr=0x00 vendor = 'Acer Labs Incorporated (ALi)' device = 'M1543 Southbridge EIDE Controller' class = mass storage subclass = ATA atapci1@pci0:14:1: class=0x01018f card=0xb0031458 chip=0x528910b9 rev=0x10 hdr=0x00 vendor = 'Acer Labs Incorporated (ALi)' class = mass storage subclass = ATA ohci0@pci0:15:0: class=0x0c0310 card=0x50041458 chip=0x523710b9 rev=0x03 hdr=0x00 vendor = 'Acer Labs Incorporated (ALi)' device = 'M5237 OpenHCI 1.1 USB Controller' class = serial bus subclass = USB ohci1@pci0:15:1: class=0x0c0310 card=0x50041458 chip=0x523710b9 rev=0x03 hdr=0x00 vendor = 'Acer Labs Incorporated (ALi)' device = 'M5237 OpenHCI 1.1 USB Controller' class = serial bus subclass = USB ohci2@pci0:15:2: class=0x0c0310 card=0x50041458 chip=0x523710b9 rev=0x03 hdr=0x00 vendor = 'Acer Labs Incorporated (ALi)' device = 'M5237 OpenHCI 1.1 USB Controller' class = serial bus subclass = USB ehci0@pci0:15:3: class=0x0c0320 card=0x50041458 chip=0x523910b9 rev=0x01 hdr=0x00 vendor = 'Acer Labs Incorporated (ALi)' device = 'USB 2.0 Enhanced Host Controller' class = serial bus subclass = USB hostb0@pci0:24:0: class=0x060000 card=0x00000000 chip=0x11001022 rev=0x00 hdr=0x00 vendor = 'Advanced Micro Devices (AMD)' device = 'Athlon 64 / Opteron HyperTransport Technology Configuration' class = bridge subclass = HOST-PCI hostb1@pci0:24:1: class=0x060000 card=0x00000000 chip=0x11011022 rev=0x00 hdr=0x00 vendor = 'Advanced Micro Devices (AMD)' device = 'Athlon 64 / Opteron Address Map' class = bridge subclass = HOST-PCI hostb2@pci0:24:2: class=0x060000 card=0x00000000 chip=0x11021022 rev=0x00 hdr=0x00 vendor = 'Advanced Micro Devices (AMD)' device = 'Athlon 64 / Opteron DRAM Controller' class = bridge subclass = HOST-PCI hostb3@pci0:24:3: class=0x060000 card=0x00000000 chip=0x11031022 rev=0x00 hdr=0x00 vendor = 'Advanced Micro Devices (AMD)' device = 'Athlon 64 / Opteron Miscellaneous Control' class = bridge subclass = HOST-PCI none1@pci1:0:0: class=0x030000 card=0x02071787 chip=0x51571002 rev=0x00 hdr=0x00 vendor = 'ATI Technologies Inc' device = 'Radeon 7500 Series (RV200)' class = display subclass = VGA ahc0@pci2:5:0: class=0x010000 card=0x00000000 chip=0x81789004 rev=0x00 hdr=0x00 vendor = 'Adaptec Inc' device = 'AHA-2940U/UW/2940D Ultra/Ultra Wide/Dual SCSI Host Adapter' class = mass storage subclass = SCSI xl0@pci2:6:0: class=0x020000 card=0x00000000 chip=0x905010b7 rev=0x00 hdr=0x00 vendor = '3COM Corp, Networking Division' device = '3C905-TX Fast Etherlink XL PCI 10/100' class = network subclass = ethernet fxp0@pci2:8:0: class=0x020000 card=0xb01e0e11 chip=0x12298086 rev=0x05 hdr=0x00 vendor = 'Intel Corporation' device = '82550/1/7/8/9 EtherExpress PRO/100(B) Ethernet Adapter' class = network subclass = ethernet $ gmirror list gm0 Geom name: gm0 State: COMPLETE Components: 2 Balance: round-robin Slice: 4096 Flags: NONE GenID: 0 SyncID: 3 ID: 2674318127 Providers: 1. Name: mirror/gm0 Mediasize: 164696555008 (153G) Sectorsize: 512 Mode: r8w8e10 Consumers: 1. Name: ad4 Mediasize: 164696555520 (153G) Sectorsize: 512 Mode: r1w1e1 State: ACTIVE Priority: 0 Flags: DIRTY GenID: 0 SyncID: 3 ID: 4037385812 2. Name: ad6 Mediasize: 164696555520 (153G) Sectorsize: 512 Mode: r1w1e1 State: ACTIVE Priority: 0 Flags: DIRTY GenID: 0 SyncID: 3 ID: 3245363724 Both drives are WD RE2s with TLER. -- Matt Dawson. matt@chronos.org.uk MTD15-RIPE OpenNIC M_D9 MD51-6BONE
On Sun, Aug 20, 2006 at 01:38:55PM +0100, Matt Dawson wrote:> On Sunday 20 August 2006 13:00, freebsd-stable-request@freebsd.org wrote:> > Do you mean different type of cables, or just another piece? I can't > > change cables by myself, servers are dedicated from provider, but as I > > can saw, they picked whole new machine from their HW storage and put new > > Samsung disk drives in. So these two last machines are brand new with > > new cables. (Probably with a same type of cables - all machines are ASUS > > RS120)> I can confirm the same behaviour with a ULi M1689/Newcastle Athlon64 based > system running 6.1-RELEASE-p3 (i386). ad6 just detaches without warning and > it takes a reboot to bring it back. atacontrol reinit has no effect. Tried > the following to resolve the problems:> - Changed cables (both ad4 and ad6) > - Changed SATA power to legacy > - Moved the NIC and anything else from the shared PCI INT (thought I'd cracked > it at this point as it was stable for a month, then it lost ad6 on a nightly > dump) > - Remade my gmirror array as an ar. Put it straight back to gmirror again when > I found out what a pain it is to rebuild after ad6 disappears.I am not sure if it is related, but... I experienced a similar sort of problem, although the details in my case are quite different. What was similar was that I would "lose" two ATA drives from an array, inexplicably. Reconfiguring the same drives and rebuilding would cause them to work perfectly again -- for some number of days, after which the same failure would occur. What is different is that this was with a 3Ware RAID controller -- which made removing/raconfiguring/rebuilding much easier -- but I was seeing the exact same errors. This happened four times (with the same errors that have been discussed here), running 6.1 STABLE as of June 22. Before attempting to RMA the drives, I tried an updated kernel, 6.1 STABLE as of July 19. Strangely enough, the problems disappeared. So, while I have not checked everything that has changed, it _might_ be worth trying 6.1 STABLE... -- greg byshenk - gbyshenk@byshenk.net - Leiden, NL
Konstantin Saurbier
2006-Aug-21 07:25 UTC
New Intel boards (was: Re: ATA problems again ... general problem of ICH7 or ATA?)
Hi! Am 21.08.2006 um 09:10 schrieb Patrick M. Hausen:> Hi, all! > > On Mon, Aug 21, 2006 at 04:03:47AM +0200, Konstantin Saurbier wrote: > >> This errors are not driver or OS dependent such as they appear on >> FreeBSD as well on different Linux distros. >> Since not all controllers suffering of these errors it is maybe >> depending on the firmware or board/chip revisions. > > We have two brand new TYAN B5161G20SH4 systems that feature > ICH7 controllers and SATA-hotplug-bays. One system is equipped > with two Seagate ST3160811AS drives, the other one with > WD1600YS-01SHB0 drives. > Both are configured with gmirror for slice 1. > > No problems at all after several days of "make -j4 buildworld". > > OTOH I can confirm that I got random "watchdog timeouts" > with the em driver. debug.mpsafenet=0 fixed the problem > for now.Sorry my post was way too unspecific. My response was only for Greg Byshenk and his 3ware related problem. They tend to loose drives oder mark drives as broken which are not broken at all. So his problems with 3ware are not related to this thread of ATA/ICH bugs. -- Best regards, Konstantin Saurbier ------------------------------------------------------ Konstantin Saurbier Tel.: 0521 106 3861 Computerlabor Mathematik U5-138 Universitaet Bielefeld Universitaetsstr.25 33501 Bielefeld email: saurbier@math.uni-bielefeld.de ------------------------------------------------------ -------------- next part -------------- A non-text attachment was scrubbed... Name: PGP.sig Type: application/pgp-signature Size: 186 bytes Desc: Signierter Teil der Nachricht Url : http://lists.freebsd.org/pipermail/freebsd-stable/attachments/20060821/ae4dcd85/PGP.pgp
Patrick M. Hausen
2006-Aug-21 10:53 UTC
ATA problems again ... general problem of ICH7 or ATA?
Hi! On Sun, Aug 20, 2006 at 01:38:55PM +0100, Matt Dawson wrote:> I can confirm the same behaviour with a ULi M1689/Newcastle Athlon64 based > system running 6.1-RELEASE-p3 (i386). ad6 just detaches without warning and > it takes a reboot to bring it back. atacontrol reinit has no effect. Tried > the following to resolve the problems:I don't know what is supposed to be the canonical way to reattach a disconnected SATA drive, but while testing our new hardware and hot-pulling a drive while the system was running, atacaontrol reinit didn't find the reinserted drive here, either. atacontrol detach ata3; atacontrol attach ata3 did. HTH, Patrick -- punkt.de GmbH Internet - Dienstleistungen - Beratung Vorholzstr. 25 Tel. 0721 9109 -0 Fax: -100 76137 Karlsruhe http://punkt.de
On Monday 21 August 2006 13:00, freebsd-stable-request@freebsd.org wrote:> > I can confirm the same behaviour with a ULi M1689/Newcastle Athlon64 > > based system running 6.1-RELEASE-p3 (i386). ad6 just detaches without > > warning and it takes a reboot to bring it back. atacontrol reinit has no > > effect. Tried the following to resolve the problems: > > I don't know what is supposed to be the canonical way to > reattach a disconnected SATA drive, but while testing our > new hardware and hot-pulling a drive while the system > was running, atacaontrol reinit didn't find the reinserted drive > here, either. > > atacontrol detach ata3; atacontrol attach ata3 did.Yes, that is the method for a controlled remove and reattach, a la hotplug SATA. AIUI, though, if the drive goes AWOL on its own you need to reinit the channel before issuing an atacontrol attach foo. In theory... (man 8 atacontrol) In practice, the drive disappears, never to be probed again. A warm reboot without power down makes it appear again, so the drive itself isn't confused. FWIW, the problem takes *far* longer to rear its head when the SATA controller has a PCI INT and IRQ to itself. Put a NIC onto a shared slot (a very Bad Thing [TM] as the BIOS simply maps the INT to a single IRQ and both devices end up sharing it. Now tranfer a large file over the network and watch the ensuing hilarity) and it happens at least every couple of days. Now, with the slot shared with the SATA controller empty, I have six days uptime since the last event, which means I'm probably due one any time now. At least gmirror rebuilds the array after a simple reboot, but I would expect the dd operation to throw a wobbly if it's a timing issue/fight for interrupt between the two drives/channels. It doesn't, which makes me wonder if I'm barking up the wrong tree, but I can't help noticing that SATA channels have one interrupt between them whereas PATA channels have one each and all of these reports are from SATA users... I wonder what pciconf -lv shows on Miroslav's system? Is the SATA controller sharing an INT/IRQ with something else? Does moving that device to another slot alleviate the problem at all? Please not that Miroslav and I are using totally different drives, chipsets and processors. He's using, IIRC, an Intel chip with an ICH7 southbridge and Samsung drives. I'm using an AMD Athlon 64 Newcastle (running the i386 port) on a ULi M1689 chipset with WD RE2 drives so, although I'd be more than happy to be the numpty that is wrong and to have ata(4) vindicated by someone else, I suspect it is ata(4) that is the problem. However, finger pointing isn't productive and is certainly not fair given that ata(4) has been progressing so well. Anything else I can try to nail this irksome beast? Any suggestions for where I've been an idiot (easy, tiger!) and missed something obvious? BTW, this is a production server (DLT backed up nightly, so the data is safe) so I can't just pull it to bits. I do have an identical (CPU/mobo) box in the workshop as a workstation, however, which I could buy/borrow another drive for and set up gmirror to try things out. -- Matt Dawson. matt@chronos.org.uk MTD15-RIPE OpenNIC M_D9 MD51-6BONE