I've seen mention of this kind of issue before, but I never saw a solution, except that someone reported that a certain version of 6.x seemed to make it go away - accounts of this problem are a bit vague. I am running 7.0-RC1, and I am seeing the errors periodically, and I am wondering if this is a known issue. Note that smartctl does not report errors logged and gives a "PASSED" to the drive. I am running at UDMA100 ATA. Also, if it matters, I am using ZFS. Attached is a grep of the /var/log/messages file. Let me know if anyone has suggestions. Thanks! Joe -------------- next part -------------- Jan 21 23:39:54 crater kernel: ad0: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=54112319 Jan 22 00:06:29 crater kernel: ad0: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=51610951 Jan 22 00:16:40 crater kernel: ad0: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=53031647 Jan 22 00:30:15 crater kernel: ad0: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=54243391 Jan 22 07:05:59 crater kernel: ad0: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=51768047 Jan 22 09:08:16 crater kernel: ad0: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=55890239 Jan 22 09:17:52 crater kernel: ad0: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=55919423 Jan 22 09:23:42 crater kernel: ad0: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=53470111 Jan 23 00:26:03 crater kernel: ad0: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=53588527 Jan 23 00:26:26 crater kernel: ad0: TIMEOUT - WRITE_DMA48 retrying (1 retry left) LBA=764596887 Jan 23 00:26:26 crater kernel: ad0: TIMEOUT - WRITE_DMA48 retrying (0 retries left) LBA=764596887 Jan 23 00:26:26 crater kernel: ad0: FAILURE - WRITE_DMA48 status=51<READY,DSC,ERROR> error=10<NID_NOT_FOUND> LBA=764596887 Jan 23 03:01:06 crater kernel: ad0: TIMEOUT - READ_DMA retrying (1 retry left) LBA=185819705 Jan 23 03:01:37 crater kernel: ad0: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=54837686 Jan 23 03:03:22 crater kernel: ad0: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=53472407 Jan 23 03:03:39 crater kernel: ad0: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=53627991 Jan 23 11:33:27 crater kernel: ad0: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=57479999 Jan 23 12:30:31 crater kernel: ad0: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=55407234 Jan 23 13:20:06 crater kernel: ad0: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=57779519 Jan 23 17:30:18 crater kernel: ad0: TIMEOUT - READ_DMA48 retrying (1 retry left) LBA=453849407 Jan 23 17:30:19 crater kernel: ad0: FAILURE - READ_DMA48 status=51<READY,DSC,ERROR> error=10<NID_NOT_FOUND> LBA=453849407 Jan 23 17:30:29 crater kernel: ad0: TIMEOUT - READ_DMA retrying (1 retry left) LBA=187373078 Jan 23 18:34:50 crater kernel: ad0: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=1017919 Jan 23 18:35:00 crater kernel: ad0: TIMEOUT - READ_DMA retrying (1 retry left) LBA=54547647 Jan 23 18:35:12 crater kernel: ad0: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=56354060 Jan 23 18:35:20 crater kernel: ad0: TIMEOUT - READ_DMA retrying (1 retry left) LBA=53919167 Jan 23 23:59:18 crater kernel: ad0: TIMEOUT - FLUSHCACHE retrying (1 retry left) Jan 24 00:00:27 crater kernel: ad0: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=237661119 Jan 24 00:00:27 crater kernel: ad0: TIMEOUT - WRITE_DMA retrying (0 retries left) LBA=237661119 Jan 24 00:00:27 crater kernel: ad0: FAILURE - WRITE_DMA timed out LBA=237661119 Jan 24 00:00:27 crater kernel: ad0: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=236239553 Jan 24 00:00:27 crater kernel: ad0: TIMEOUT - WRITE_DMA retrying (0 retries left) LBA=236239553 Jan 24 00:00:27 crater kernel: ad0: FAILURE - WRITE_DMA timed out LBA=236239553 Jan 24 00:00:27 crater kernel: ad0: TIMEOUT - WRITE_DMA48 retrying (1 retry left) LBA=764595671 Jan 24 00:00:27 crater kernel: ad0: TIMEOUT - WRITE_DMA48 retrying (0 retries left) LBA=764595671 Jan 24 00:00:27 crater kernel: ad0: FAILURE - WRITE_DMA48 timed out LBA=764595671 Jan 24 00:01:13 crater kernel: ad0: TIMEOUT - WRITE_DMA48 retrying (1 retry left) LBA=764595671 Jan 24 00:01:13 crater kernel: ad0: TIMEOUT - WRITE_DMA48 retrying (0 retries left) LBA=764595671 Jan 24 00:01:13 crater kernel: ad0: FAILURE - WRITE_DMA48 timed out LBA=764595671 Jan 24 00:01:13 crater kernel: ad0: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=236180175 Jan 24 00:01:13 crater kernel: ad0: TIMEOUT - WRITE_DMA retrying (0 retries left) LBA=236180175 Jan 24 00:01:13 crater kernel: ad0: FAILURE - WRITE_DMA timed out LBA=236180175 Jan 24 00:01:13 crater kernel: ad0: TIMEOUT - FLUSHCACHE retrying (1 retry left) Jan 24 00:01:13 crater kernel: ad0: TIMEOUT - FLUSHCACHE retrying (0 retries left) Jan 24 02:31:53 crater kernel: ad0: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=236191551 Jan 24 04:54:57 crater kernel: ad0: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=238068287 Jan 24 04:55:56 crater kernel: ad0: TIMEOUT - WRITE_DMA retrying (0 retries left) LBA=238068287 Jan 24 04:55:56 crater kernel: ad0: FAILURE - WRITE_DMA timed out LBA=238068287 Jan 24 04:55:56 crater kernel: ad0: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=236315627 Jan 24 04:55:56 crater kernel: ad0: TIMEOUT - WRITE_DMA retrying (0 retries left) LBA=236315627 Jan 24 04:55:56 crater kernel: ad0: FAILURE - WRITE_DMA timed out LBA=236315627 Jan 24 04:55:56 crater kernel: ad0: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=238068415 Jan 24 04:55:56 crater kernel: ad0: TIMEOUT - WRITE_DMA retrying (0 retries left) LBA=238068415 Jan 24 04:55:56 crater kernel: ad0: FAILURE - WRITE_DMA timed out LBA=238068415 Jan 24 04:55:56 crater kernel: ad0: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=236315627 Jan 24 04:55:56 crater kernel: ad0: TIMEOUT - WRITE_DMA retrying (0 retries left) LBA=236315627 Jan 24 06:38:42 crater kernel: ad0: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=236217031 Jan 24 06:40:54 crater kernel: ad0: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=236354111 Jan 24 09:00:11 crater kernel: ad0: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=787071 Jan 24 23:56:40 crater kernel: ad0: TIMEOUT - WRITE_DMA48 retrying (1 retry left) LBA=764502723 Jan 24 23:56:43 crater kernel: ad0: FAILURE - WRITE_DMA48 status=51<READY,DSC,ERROR> error=10<NID_NOT_FOUND> LBA=764502723 Jan 25 03:01:08 crater kernel: ad0: TIMEOUT - WRITE_DMA48 retrying (1 retry left) LBA=765143487 Jan 25 03:01:58 crater kernel: ad0: FAILURE - WRITE_DMA48 status=51<READY,DSC,ERROR> error=10<NID_NOT_FOUND> LBA=765143487 Jan 25 03:01:58 crater kernel: ad0: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=53815221 Jan 25 03:01:58 crater kernel: ad0: TIMEOUT - WRITE_DMA retrying (0 retries left) LBA=53815221 Jan 25 03:01:58 crater kernel: ad0: FAILURE - WRITE_DMA timed out LBA=53815221 Jan 25 03:01:58 crater kernel: ad0: TIMEOUT - WRITE_DMA48 retrying (1 retry left) LBA=764668013 Jan 25 03:01:58 crater kernel: ad0: TIMEOUT - WRITE_DMA48 retrying (0 retries left) LBA=764668013 Jan 25 03:01:58 crater kernel: ad0: FAILURE - WRITE_DMA48 timed out LBA=764668013 Jan 25 03:01:58 crater kernel: ad0: TIMEOUT - WRITE_DMA48 retrying (1 retry left) LBA=765143359 Jan 25 03:01:58 crater kernel: ad0: TIMEOUT - WRITE_DMA48 retrying (0 retries left) LBA=765143359 Jan 25 03:03:13 crater kernel: ad0: FAILURE - WRITE_DMA48 timed out LBA=765143359 Jan 25 03:03:13 crater kernel: ad0: TIMEOUT - READ_DMA retrying (1 retry left) LBA=185665215 Jan 25 03:03:13 crater kernel: ad0: TIMEOUT - READ_DMA retrying (0 retries left) LBA=185665215 Jan 25 03:03:13 crater kernel: ad0: FAILURE - READ_DMA timed out LBA=185665215 Jan 25 03:03:13 crater kernel: ad0: TIMEOUT - WRITE_DMA48 retrying (1 retry left) LBA=764561207 Jan 25 03:03:13 crater kernel: ad0: FAILURE - WRITE_DMA48 status=51<READY,DSC,ERROR> error=10<NID_NOT_FOUND> LBA=764561207 Jan 25 03:03:13 crater kernel: ad0: TIMEOUT - WRITE_DMA48 retrying (1 retry left) LBA=764674815 Jan 25 03:03:13 crater kernel: ad0: TIMEOUT - WRITE_DMA48 retrying (0 retries left) LBA=764674815 Jan 25 03:03:13 crater kernel: ad0: FAILURE - WRITE_DMA48 timed out LBA=764674815 Jan 25 03:03:13 crater kernel: ad0: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=236113195 Jan 25 03:03:13 crater kernel: ad0: TIMEOUT - WRITE_DMA retrying (0 retries left) LBA=236113195 Jan 25 03:03:13 crater kernel: ad0: TIMEOUT - WRITE_DMA48 retrying (1 retry left) LBA=764561207 Jan 25 03:03:13 crater kernel: ad0: TIMEOUT - WRITE_DMA48 retrying (0 retries left) LBA=764561207 Jan 25 03:03:13 crater kernel: ad0: FAILURE - WRITE_DMA48 status=51<READY,DSC,ERROR> error=10<NID_NOT_FOUND> LBA=764561207 Jan 25 07:46:58 crater kernel: ad0: TIMEOUT - WRITE_DMA48 retrying (1 retry left) LBA=764587087 Jan 25 07:46:58 crater kernel: ad0: FAILURE - WRITE_DMA48 status=51<READY,DSC,ERROR> error=10<NID_NOT_FOUND> LBA=764587087
On Fri, Jan 25, 2008 at 08:58:41AM -0700, Joe Peterson wrote:> I've seen mention of this kind of issue before, but I never saw a > solution, except that someone reported that a certain version of 6.x > seemed to make it go away - accounts of this problem are a bit vague. I > am running 7.0-RC1, and I am seeing the errors periodically, and I am > wondering if this is a known issue. Note that smartctl does not report > errors logged and gives a "PASSED" to the drive. I am running at > UDMA100 ATA. Also, if it matters, I am using ZFS.What you've shown is usually the sign of a disk-related problem. It's very obvious when it's just one disk reporting DMA errors. You use ZFS, so chances are you have more than one disk in a pool/volume -- there's no indication ad1, ad4, ad6, etc. are failing, so this seems to indicate something specific to ad0. Manufacturers pick very passive (non-aggressive) thresholds for error conditions on disks, so disks which are failing very commonly show "PASSED" during SMART analysis. To make matters worse, most users I know read SMART stats incorrectly (they're easy to misinterpret). Can you please provide output of the following: * smartctl -a /dev/ad0 * atacontrol cap ad0 * atacontrol info <ata0, ata1, etc. -- any controller used by ZFS> * Relevant dmesg output that indicates what kind of ATA controller these disks are attached to. Start with output from 'ad0:' and work backwards. For example, ad0 on this machine is using an Intel ICH6 controller: atapci0: <Intel ICH6 SATA150 controller> port 0x1f0-0x1f7,0x3f6,0x170-0x177,0x376,0xf000-0xf00f at device 31.2 on pci0 ata0: <ATA channel 0> on atapci0 ad0: 238475MB <WDC WD2500KS-00MJB0 02.01C03> at ata0-master SATA150 Other stuff: SMART stats which are labelled "Offline" are only updated when a short or long offline test is performed. Have you tried using "smartctl -t short /dev/ad0" and "smartctl -t long /dev/ad0" to see if any of the raw values on the far right column increment? Have you tried using "zpool scrub" on the ZFS pool, then "zpool status" to see if READ/WRITE/CHKSUM counters increment or if the "scrub" line states there were errors? Other things which have fixed problems in the past for others: * BIOS updates * Change of motherboards (sometimes replacing board with same model, other times going with a completely different vendor (implies weird implementation issues or BIOS problems)) * Changing SATA cables * Getting a larger power supply (usually when lots of disk are involved) -- | Jeremy Chadwick jdc at parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, USA | | Making life hard for others since 1977. PGP: 4BD6C0CB |
Julian H. Stacey
2008-Jan-25 09:41 UTC
"ad0: TIMEOUT - WRITE_DMA" type errors with 7.0-RC1
Jeremy Chadwick wrote:> > wondering if this is a known issue. Note that smartctl does not report > > errors logged and gives a "PASSED" to the drive. I am running at > > UDMA100 ATA. Also, if it matters, I am using ZFS.> Can you please provide output of the following: > > * smartctl -a /dev/ad0>From ports/sysutils/smartmontools I presume ?( Asking as I also have a DMA prob. to solve, at present needing hw.ata.ata_dma="0" in /boot/loader.conf to boot, (& interuptions on sound on 7-stable, though no ZFS here)). smartctl: Not installed by /usr/src-7 No /usr/ports/*/smartctl Clues found with locate for ports: sysutils/munin-node/files/patch-hddtemp_smartctl.in sysutils/sensors-applet/files/smartctl-helper.c sysutils/sensors-applet/files/smartctl-sensors-interface.c sysutils/sensors-applet/files/smartctl-sensors-interface.h sysutils/munin-main # Not really ? ports/sysutils/sensors-applet -> ports/sysutils/smartmontools -- Julian Stacey. Munich Computer Consultant, BSD Unix C Linux. http://berklix.com
On Fri, Jan 25, 2008 at 06:42:04PM +0100, Julian H. Stacey wrote:> Jeremy Chadwick wrote: > > > wondering if this is a known issue. Note that smartctl does not report > > > errors logged and gives a "PASSED" to the drive. I am running at > > > UDMA100 ATA. Also, if it matters, I am using ZFS. > > > Can you please provide output of the following: > > > > * smartctl -a /dev/ad0 > > >From ports/sysutils/smartmontools I presume ? > ( Asking as I also have a DMA prob. to solve, at present > needing hw.ata.ata_dma="0" in /boot/loader.conf to boot, > (& interuptions on sound on 7-stable, though no ZFS here)).Yep! smartctl comes with ports/sysutils/smartmontools. -- | Jeremy Chadwick jdc at parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, USA | | Making life hard for others since 1977. PGP: 4BD6C0CB |
Joe Peterson wrote:> Joe Peterson wrote: >> So I have started a "SeaTools" (disk scanner from Seagate) "long test" of the >> drive. The short test passed already. The results should be interesting. If >> it finds nothing wrong, I am going to start to wonder if I am experiencing ZFS >> bugs that just happen to look like drive problems. I already did a long read, >> under linux, of disk contents, and got no messages about anything wrong. > > Update: both SHORT and LONG tests passed for this drive in SeaTools. > Hmph... the mystery remains.Were both tests done in the same machine (actually, I mean the same PSU)? -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 250 bytes Desc: OpenPGP digital signature Url : http://lists.freebsd.org/pipermail/freebsd-stable/attachments/20080126/de5ed17f/signature.pgp
On Sat, Jan 26, 2008 at 01:15:31PM -0700, Joe Peterson wrote:> Joe Peterson wrote: > > So I have started a "SeaTools" (disk scanner from Seagate) "long test" of the > > drive. The short test passed already. The results should be interesting. If > > it finds nothing wrong, I am going to start to wonder if I am experiencing ZFS > > bugs that just happen to look like drive problems. I already did a long read, > > under linux, of disk contents, and got no messages about anything wrong. > > Update: both SHORT and LONG tests passed for this drive in SeaTools. > Hmph... the mystery remains.As do mine -- I also completed both short and long tests in SeaTools on my drive (finished early this evening). Absolutely no errors, everything passed. -- | Jeremy Chadwick jdc at parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, USA | | Making life hard for others since 1977. PGP: 4BD6C0CB |