Folks, I'm seeing something very unusual on one of our FreeBSD 5.4 Stable boxes which I'm having a hard time getting to the bottom of. You may recall that a few weeks ago I posted regarding a server that was having trouble with WRITE_DMA and READ_DMA timeouts on it's SATA disk. We finally decided to migrate to a new disk, so we purchased a brand new Western Digital 250GB SATA drive and transferred the data across, before removing the old drive. We got about two days of trouble free access to this new disk before it too started throwing READ_DMA problems. This time they were error 40<UNCORRECTABLE>. Running SmartCtl on the disk showed a number of errors and there were specific files on the disk that could not be read. We moved the disk to a desktop box to confirm the problem and noted that fsck couldn't fix the errors on the drive. Assuming a dud drive, we purchased a replacement and this time we spurned SATA in favour of a PATA drive (Western Digital 200GB). We installed the drive yesterday using a brand new UDMA cable. Imagine my surprise when I came in this morning to find that this new drive was also now suffering from UNCORRECTABLE READ_DMA failures and SmartCtl confirmed that the drive wasn't happy. What are the odds of getting two dud disks from two separate batches of drives from, a reputable brand? The server itself is a 1U high rack mount installed in an AC'd machine room. It is powered from a UPS. There is space around the drive and a pair of fans draw air over the drive casing, to the casings are cool to the touch. The motherboard is an Intel S875PWP3 equipped with an Intel ICH5 chipset. Is there any known problem with using WD SATA / PATA disks with FreeBSD 5.4 Stable with the above mainboard? Is it possible that a FreeBSD bug is causing problems with these drives, including the problems reported by SmartCtl? Regards, Tony. -- Tony Byrne
Hello Tony, Tuesday, July 19, 2005, 10:37:40 AM, you wrote: TB> Folks, TB> I'm seeing something very unusual on one of our FreeBSD 5.4 Stable TB> boxes which I'm having a hard time getting to the bottom of. Further information from my server logs: Jul 19 13:01:48 roo kernel: ad0: FAILURE - READ_DMA status=51<READY,DSC,ERROR> error=40<UNCORRECTABLE> LBA=288810495 Jul 19 13:01:59 roo kernel: ad0: FAILURE - READ_DMA status=51<READY,DSC,ERROR> error=1<ILLEGAL_LENGTH> LBA=288810495 Jul 19 13:02:05 roo kernel: ad0: FAILURE - READ_DMA status=51<READY,DSC,ERROR> error=40<UNCORRECTABLE> LBA=288810495 Jul 19 13:02:16 roo kernel: ad0: FAILURE - READ_DMA status=51<READY,DSC,ERROR> error=40<UNCORRECTABLE> LBA=288810495 Jul 19 13:04:36 roo last message repeated 4 times With this disk it appears to be the same LBA each time. How can I translate that LBA offset into something indicating the file affected? I installed the *other* disk into a Windows box an ran the Western Digital Drive Tools SMART test on it. It found some sectors needing reallocation and successfully performed the reallocation. The tests (both short and long) now pass, but the drive's SMART Status remains at 'fail'. When I examine the attributes, the Raw Read Error Rate is flagged. I'm totally confused. I don't know enough about SMART to know whether I'm looking at real failing drives or some bug exposed by the interaction between drive firmware, hd controller and FreeBSD. Regards, Tony. -- Tony Byrne