Harald Schmalzbauer
2010-Feb-23 10:41 UTC
ahcich timeouts, only with ahci, not with ataahci
Hello, I'm frequently getting my machine locked with ahcichX timeouts: ahcich2: Timeout on slot 0 ahcich2: is 00000000 cs 00000001 ss 00000000 rs 00000001 tfd c0 serr 00000000 ahcich2: Timeout on slot 8 ahcich2: is 00000000 cs 00000100 ss 00000000 rs 00000100 tfd c0 serr 00000000 ahcich2: Timeout on slot 8 ahcich2: is 00000000 cs fffff07f ss ffffff7f rs ffffff7f tfd c0 serr 00000000 ... This happens when backup over GbE overloads ZFS/HDD capabilities. I reduced vfs.zfs.txg.timeout to 1 to prevent the machine from locking up almost immediately, but from it still happens. When I don't use ahci but ataahci (the old driver if I understand things correct) I also see the ZFS burst write congestion, but this doesn't lead to controller timeouts, thus blocking the machine. Sometimes the machine recovers from the disk lock, but most often I have to reboot. Kernel is from Feb. 19, so recent ahci improovements are active. Controller is ICH9R with 3 Samsung F3 SpinPoints. Any ideas how to work arround the hangs other than using the old ahci driver? Thanks, -Harry -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 196 bytes Desc: OpenPGP digital signature Url : http://lists.freebsd.org/pipermail/freebsd-stable/attachments/20100223/506358e1/signature.pgp
I am having the same issue since Feb 20 which was my last update. ahcich1: Timeout on slot 3 ahcich1: is 00000000 cs 00000018 ss 00000000 rs 00000018 tfd d0 serr 00000000 ahcich1: Timeout on slot 5 ahcich1: is 00000000 cs 00000060 ss 00000000 rs 00000060 tfd d0 serr 00000000 ahcich1: Timeout on slot 7 ahcich1: is 00000000 cs 00000180 ss 00000000 rs 00000180 tfd d0 serr 00000000 ahcich1: Timeout on slot 9 ahcich1: is 00000000 cs 00000600 ss 00000000 rs 00000600 tfd d0 serr 00000000> dmesg | grep ahciahci0: <Intel ICH9M AHCI SATA controller> port 0x1c40-0x1c47,0x1834-0x1837,0x1838-0x183f,0x1830-0x1833,0x1c20-0x1c3f mem 0xfc226000-0xfc2267ff irq 16 at device 31.2 on pci0 ahci0: [ITHREAD] ahci0: AHCI v1.20 with 4 3Gbps ports, Port Multiplier not supported ahcich0: <AHCI channel> at channel 0 on ahci0 ahcich0: [ITHREAD] ahcich1: <AHCI channel> at channel 1 on ahci0 ahcich1: [ITHREAD] ada0 at ahcich0 bus 0 scbus0 target 0 lun 0 cd0 at ahcich1 bus 0 scbus1 target 0 lun 0 On Tue, Feb 23, 2010 at 10:40 AM, Harald Schmalzbauer <h.schmalzbauer@omnilan.de> wrote:> Hello, > > I'm frequently getting my machine locked with ahcichX timeouts: > ahcich2: Timeout on slot 0 > ahcich2: is 00000000 cs 00000001 ss 00000000 rs 00000001 tfd c0 serr > 00000000 > ahcich2: Timeout on slot 8 > ahcich2: is 00000000 cs 00000100 ss 00000000 rs 00000100 tfd c0 serr > 00000000 > ahcich2: Timeout on slot 8 > ahcich2: is 00000000 cs fffff07f ss ffffff7f rs ffffff7f tfd c0 serr > 00000000 > ... > > This happens when backup over GbE overloads ZFS/HDD capabilities. > I reduced vfs.zfs.txg.timeout to 1 to prevent the machine from locking up > almost immediately, but from it still happens. > When I don't use ahci but ataahci (the old driver if I understand things > correct) I also see the ZFS burst write congestion, but this doesn't lead to > controller timeouts, thus blocking the machine. > > Sometimes the machine recovers from the disk lock, but most often I have to > reboot. > > Kernel is from Feb. 19, so recent ahci improovements are active. > Controller is ICH9R with 3 Samsung F3 SpinPoints. > > Any ideas how to work arround the hangs other than using the old ahci > driver? > > Thanks, > > -Harry > >
Harald Schmalzbauer wrote:> I'm frequently getting my machine locked with ahcichX timeouts: > ahcich2: Timeout on slot 0 > ahcich2: is 00000000 cs 00000001 ss 00000000 rs 00000001 tfd c0 serr > 00000000 > ahcich2: Timeout on slot 8 > ahcich2: is 00000000 cs 00000100 ss 00000000 rs 00000100 tfd c0 serr > 00000000 > ahcich2: Timeout on slot 8 > ahcich2: is 00000000 cs fffff07f ss ffffff7f rs ffffff7f tfd c0 serr > 00000000 > ...Looking that is (Interrupt status) is zero and `rs == cs | ss` (running command bitmasks in driver and hardware), controller doesn't report command completion. Looking on TFD status 0xc0 with BUSY bit set, I would suppose that either disk stuck in command processing for some reason, or controller missed command completion status. Have you noticed 30 second (default ATA timeout) pause before timeout message printed? Just want to be sure that driver waited enough before give up.> This happens when backup over GbE overloads ZFS/HDD capabilities. > I reduced vfs.zfs.txg.timeout to 1 to prevent the machine from locking > up almost immediately, but from it still happens. > When I don't use ahci but ataahci (the old driver if I understand things > correct) I also see the ZFS burst write congestion, but this doesn't > lead to controller timeouts, thus blocking the machine. > > Sometimes the machine recovers from the disk lock, but most often I have > to reboot.How it looks when it doesn't? Can you send me full log messages?> Kernel is from Feb. 19, so recent ahci improovements are active. > Controller is ICH9R with 3 Samsung F3 SpinPoints. > > Any ideas how to work arround the hangs other than using the old ahci > driver?Old ataahci driver wasn't using NCQ. NCQ may trigger some bugs in drive firmware or expose some protocol inconsistencies. I would recommend you to search for some errata for your drive and possibly firmware update. -- Alexander Motin
George Liaskos wrote:> I am having the same issue since Feb 20 which was my last update.Which version was before?> ahcich1: Timeout on slot 3 > ahcich1: is 00000000 cs 00000018 ss 00000000 rs 00000018 tfd d0 serr 00000000 > ahcich1: Timeout on slot 5 > ahcich1: is 00000000 cs 00000060 ss 00000000 rs 00000060 tfd d0 serr 00000000 > ahcich1: Timeout on slot 7 > ahcich1: is 00000000 cs 00000180 ss 00000000 rs 00000180 tfd d0 serr 00000000 > ahcich1: Timeout on slot 9 > ahcich1: is 00000000 cs 00000600 ss 00000000 rs 00000600 tfd d0 serr 00000000Situation looks alike, except it is CD drive and so without NCQ. I am still don't see bugs from driver side here. Do you see delays before timeout messages? You may try to enable verbose kernel messages to get some more info.>> dmesg | grep ahci > ahci0: <Intel ICH9M AHCI SATA controller> port > 0x1c40-0x1c47,0x1834-0x1837,0x1838-0x183f,0x1830-0x1833,0x1c20-0x1c3f > mem 0xfc226000-0xfc2267ff irq 16 at device 31.2 on pci0 > ahci0: [ITHREAD] > ahci0: AHCI v1.20 with 4 3Gbps ports, Port Multiplier not supported > ahcich0: <AHCI channel> at channel 0 on ahci0 > ahcich0: [ITHREAD] > ahcich1: <AHCI channel> at channel 1 on ahci0 > ahcich1: [ITHREAD] > ada0 at ahcich0 bus 0 scbus0 target 0 lun 0 > cd0 at ahcich1 bus 0 scbus1 target 0 lun 0Show this also with verbose.> On Tue, Feb 23, 2010 at 10:40 AM, Harald Schmalzbauer > <h.schmalzbauer@omnilan.de> wrote: >> Hello, >> >> I'm frequently getting my machine locked with ahcichX timeouts: >> ahcich2: Timeout on slot 0 >> ahcich2: is 00000000 cs 00000001 ss 00000000 rs 00000001 tfd c0 serr >> 00000000 >> ahcich2: Timeout on slot 8 >> ahcich2: is 00000000 cs 00000100 ss 00000000 rs 00000100 tfd c0 serr >> 00000000 >> ahcich2: Timeout on slot 8 >> ahcich2: is 00000000 cs fffff07f ss ffffff7f rs ffffff7f tfd c0 serr >> 00000000 >> ... >> >> This happens when backup over GbE overloads ZFS/HDD capabilities. >> I reduced vfs.zfs.txg.timeout to 1 to prevent the machine from locking up >> almost immediately, but from it still happens. >> When I don't use ahci but ataahci (the old driver if I understand things >> correct) I also see the ZFS burst write congestion, but this doesn't lead to >> controller timeouts, thus blocking the machine. >> >> Sometimes the machine recovers from the disk lock, but most often I have to >> reboot. >> >> Kernel is from Feb. 19, so recent ahci improovements are active. >> Controller is ICH9R with 3 Samsung F3 SpinPoints. >> >> Any ideas how to work arround the hangs other than using the old ahci >> driver?-- Alexander Motin
Harald Schmalzbauer
2010-Mar-03 07:49 UTC
ahcich timeouts, only with ahci, not with ataahci
Alexander Motin schrieb am 23.02.2010 16:10 (localtime):> Harald Schmalzbauer wrote: >> I'm frequently getting my machine locked with ahcichX timeouts: >> ahcich2: Timeout on slot 0 >> ahcich2: is 00000000 cs 00000001 ss 00000000 rs 00000001 tfd c0 serr >> 00000000 >> ahcich2: Timeout on slot 8 >> ahcich2: is 00000000 cs 00000100 ss 00000000 rs 00000100 tfd c0 serr >> 00000000 >> ahcich2: Timeout on slot 8 >> ahcich2: is 00000000 cs fffff07f ss ffffff7f rs ffffff7f tfd c0 serr >> 00000000 >> ... > > Looking that is (Interrupt status) is zero and `rs == cs | ss` (running > command bitmasks in driver and hardware), controller doesn't report > command completion. Looking on TFD status 0xc0 with BUSY bit set, I > would suppose that either disk stuck in command processing for some > reason, or controller missed command completion status. > > Have you noticed 30 second (default ATA timeout) pause before timeout > message printed? Just want to be sure that driver waited enough before > give up. > >> This happens when backup over GbE overloads ZFS/HDD capabilities. >> I reduced vfs.zfs.txg.timeout to 1 to prevent the machine from locking >> up almost immediately, but from it still happens. >> When I don't use ahci but ataahci (the old driver if I understand things >> correct) I also see the ZFS burst write congestion, but this doesn't >> lead to controller timeouts, thus blocking the machine. >> >> Sometimes the machine recovers from the disk lock, but most often I have >> to reboot. > > How it looks when it doesn't? Can you send me full log messages?Hello, this morning I had a stall, but the machine recovered after about one Minute. Here's what I got from the kernel: ahcich2: Timeout on slot 29 ahcich2: is 00000000 cs 00000003 ss e0000003 rs e0000003 tfd c0 serr 00000000 em1: watchdog timeout -- resetting em1: watchdog timeout -- resetting ahcich2: Timeout on slot 10 ahcich2: is 00000000 cs 00006000 ss 00007c00 rs 00007c00 tfd c0 serr 00000000 ahcich2: Timeout on slot 18 ahcich2: is 00000000 cs 00040000 ss 00000000 rs 00040000 tfd c0 serr 00000000 ahcich2: Timeout on slot 2 ahcich2: is 00000000 cs 00000004 ss 00000000 rs 00000004 tfd c0 serr 00000000 ahcich2: Timeout on slot 2 ahcich2: is 00000000 cs 00000000 ss 0000000c rs 0000000c tfd 40 serr 00000000 Does this tell you something useful? Thanks, -Harry -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 196 bytes Desc: OpenPGP digital signature Url : http://lists.freebsd.org/pipermail/freebsd-stable/attachments/20100303/69256479/signature.pgp