Hi, my system is running oi148 on a super micro X8SIL-F board. I have two pools (2 disc mirror, 4 disc RAIDZ) with RAID level SATA drives. (Hitachi HUA72205 and SAMSUNG HE103UJ). The system runs as expected however every few days (sometimes weeks) the system comes to a halt due to these errors: Dec 3 13:51:20 nasjpk gda: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci-ide at 1f,2/ide at 1/cmdk at 0,0 (Disk1): Dec 3 13:51:20 nasjpk Error for commandX \''read sector\'' Error Level: Fatal Dec 3 13:51:20 nasjpk gda: [ID 107833 kern.notice] Requested Block 5503936, Error Block: 5503936 Dec 3 13:51:20 nasjpk gda: [ID 107833 kern.notice] Sense Key: uncorrectable data error Dec 3 13:51:20 nasjpk gda: [ID 107833 kern.notice] Vendor \''Gen-ATA \'' error code: XX7 It is not related to this one disk. It happens on all disks. Sometimes several are listed before the system "crashes", sometimes just one. I cannot pinpoint it to a single defect disk though (and already have replaced the disks). I suspect that this is an error with the SATA controller or the driver. Can someone give me a hint on whether or not that assumption sounds feasible? I am planning on getting a new "cheap" 6-8 way SATA2 or SATA3 controller and switch over the drives to that controller. If it is driver/controller related the problem should disappear. Is it possible to simply reconnect the drives and all is going to be well or will I have to reinstall due to different SATA "layouts" on the disks or alike? Kind regards, JP -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20110728/18be9739/attachment.html>
On Thu, Jul 28, 2011 at 01:55:27PM +0200, Koopmann, Jan-Peter wrote: Hi,> > my system is running oi148 on a super micro X8SIL-F board. I have two pools > (2 disc mirror, 4 disc RAIDZ) with RAID level SATA drives. (Hitachi HUA72205 > and SAMSUNG HE103UJ). The system runs as expected however every few days > (sometimes weeks) the system comes to a halt due to these errors: > > Dec 3 13:51:20 nasjpk gda: [ID 107833 kern.warning] WARNING: > /pci at 0,0/pci-ide at 1f,2/ide at 1/cmdk at 0,0 (Disk1): > Dec 3 13:51:20 nasjpk Error for commandX \''read sector\'' Error Level: > Fatal > Dec 3 13:51:20 nasjpk gda: [ID 107833 kern.notice] Requested Block > 5503936, Error Block: 5503936 > Dec 3 13:51:20 nasjpk gda: [ID 107833 kern.notice] Sense Key: > uncorrectable data error > Dec 3 13:51:20 nasjpk gda: [ID 107833 kern.notice] Vendor \''Gen-ATA \'' > error code: XX7 > > It is not related to this one disk. It happens on all disks. Sometimes > several are listed before the system "crashes", sometimes just one. I cannotI tend to agree, that the IDE driver seems to have a problem: I.e. on our machines (HP Z400 with a 0B4Ch-D MB with a 82801JI (ICH10 Family) controller) using a rpool 2-way mirror of WDC WD5000AAKS HDDs) we also see sometimes, that one drive got disabled dueto "too many errors". zpool clear revives the pool (i.e. the HDD gets resilvered very quickly without any problem) ''til it occures again (i.e. after some days, weeks, or months). Unfortunately we couldn''t find a procedure to reproduce the problem (e.g. like for the Marvell ctrl in the early days). Regards, jel. -- Otto-von-Guericke University http://www.cs.uni-magdeburg.de/ Department of Computer Science Geb. 29 R 027, Universitaetsplatz 2 39106 Magdeburg, Germany Tel: +49 391 67 12768
On Jul 28, 2011, at 4:55 AM, Koopmann, Jan-Peter wrote:> Hi, > > my system is running oi148 on a super micro X8SIL-F board. I have two pools (2 disc mirror, 4 disc RAIDZ) with RAID level SATA drives. (Hitachi HUA72205 and SAMSUNG HE103UJ). The system runs as expected however every few days (sometimes weeks) the system comes to a halt due to these errors: > > Dec 3 13:51:20 nasjpk gda: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci-ide at 1f,2/ide at 1/cmdk at 0,0 (Disk1): > Dec 3 13:51:20 nasjpk Error for commandX \''read sector\'' Error Level: Fatal > Dec 3 13:51:20 nasjpk gda: [ID 107833 kern.notice] Requested Block 5503936, Error Block: 5503936 > Dec 3 13:51:20 nasjpk gda: [ID 107833 kern.notice] Sense Key: uncorrectable data error > Dec 3 13:51:20 nasjpk gda: [ID 107833 kern.notice] Vendor \''Gen-ATA \'' error code: XX7Several things: 1. You are using SATA in IDE-compatibility mode. Usually this is a BIOS setting and for most BIOSes, IDE-compatibility mode is the default. Change to AHCI is an improvement that includes better error monitoring. 2. In this case, the disk is returning an unrecoverable read error. This is the most common error for modern HDDs. 3. When #2 happens, consumer-grade disks can get stuck retrying forever. Enterprise-class drives have limited retry. For the retry-forever disks, the OS is responsible for ultimately timing out the I/O attempt. For many Solaris releases, the default retry/timeout cycle lasts 3 to 5 minutes. Because of #1, the disk cannot service more than one outstanding I/O, so all I/O to the disk is blocked, impacting the rest of the pool.> > It is not related to this one disk. It happens on all disks. Sometimes several are listed before the system "crashes", sometimes just one. I cannot pinpoint it to a single defect disk though (and already have replaced the disks). I suspect that this is an error with the SATA controller or the driver. Can someone give me a hint on whether or not that assumption sounds feasible? I am planning on getting a new "cheap" 6-8 way SATA2 or SATA3 controller and switch over the drives to that controller. If it is driver/controller related the problem should disappear. Is it possible to simply reconnect the drives and all is going to be well or will I have to reinstall due to different SATA "layouts" on the disks or alike?The ease of migration depends on your HBA and whether it writes metadata that is not compatible with other HBAs. For simple HBAs, it is quite common for disks to be migrated to other machines and the pool imported. HTH, -- richard -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20110729/22eeb751/attachment.html>