I'm removing the In-Reply-To mail headers for this thread, as you've now hijacked it for a different purpose. Please don't do this; start a new thread altogether. :-) On Tue, Jan 26, 2010 at 02:57:20PM +0100, Gerrit K?hn wrote:> I am still busy replacing RE2-disks with updated drives. I came across a > very strange thing with zfs. Actually I had the following pool layout: > > mclane# zpool status > pool: tank > state: ONLINE > scrub: none requested > config: > > NAME STATE READ WRITE CKSUM > tank ONLINE 0 0 0 > raidz1 ONLINE 0 0 0 > ad8 ONLINE 0 0 0 > ad10 ONLINE 0 0 0 > ad12 ONLINE 0 0 0 > spares > ad14 AVAIL > > errors: No known data errors > > All disks still have the firmware bug, so I want to replace them with > disks that I already fixed. I put in a updated drive as ad18 and > wanted to replace ad12 to get the drive with the broken firmware out: > > mclane# zpool replace tank /dev/ad12 /dev/ad18 > mclane# zpool status > pool: tank > state: ONLINE > status: One or more devices is currently being resilvered. The pool will > continue to function, possibly in a degraded state. > action: Wait for the resilver to complete. > scrub: resilver in progress for 0h0m, 0.01% done, 52h51m to go > config: > > NAME STATE READ WRITE CKSUM > tank ONLINE 0 0 0 > raidz1 ONLINE 0 0 0 > ad8 ONLINE 0 0 0 7.21M resilvered > ad10 ONLINE 0 0 0 7.22M resilvered > replacing ONLINE 0 0 0 > ad12 ONLINE 0 0 0 > ad18 ONLINE 0 0 0 10.7M resilvered > spares > ad14 AVAIL > > errors: No known data errors > > However, something must have gone wrong during the resilvering process and > it now looks like this: > > mclane# zpool status > pool: tank > state: DEGRADED > status: One or more devices has experienced an unrecoverable error. An > attempt was made to correct the error. Applications are > unaffected. action: Determine if the device needs to be replaced, and > clear the errors using 'zpool clear' or replace the device with 'zpool > replace'. see: http://www.sun.com/msg/ZFS-8000-9P > scrub: resilver completed after 2h39m with 0 errors on Tue Jan 26 > 14:00:00 2010 config: > > NAME STATE READ WRITE CKSUM > tank DEGRADED 0 0 0 > raidz1 DEGRADED 0 0 0 > ad8 ONLINE 0 0 0 975M resilvered > ad10 ONLINE 0 0 142 974M resilvered > replacing DEGRADED 0 7.25M 0 > ad12 ONLINE 0 0 0 > ad18 REMOVED 0 1 0 79.4M resilvered > spares > ad14 AVAIL > > errors: No known data errors > > > What is going on here? ad18 obviously detached during the > process. /var/log/messages just gives me > > Jan 26 11:23:33 mclane kernel: ad18: FAILURE - device detached > > Additionally ad10 obviously produced chksum errors. What do I do about the > degraded replacing process? Can I terminate it somehow and maybe replace > ad10 first? Any other hints?I'm not sure how the above is supposed to work (I haven't personally tried it), but: 1) Why didn't you offline the ad10 disk first? zpool offline tank ad10 2) How did you attach ad18? Did you tell the system about it using atacontrol? If so, what commands did you use? 3) Can you please provide uname -a output, as well as relevant dmesg output to show what kind of SATA controller you have, what's attached to what, etc.? -- | Jeremy Chadwick jdc@parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, USA | | Making life hard for others since 1977. PGP: 4BD6C0CB |
On Tue, 26 Jan 2010 06:30:21 -0800 Jeremy Chadwick <freebsd@jdc.parodius.com> wrote about Re: ZFS "zpool replace" problems: JC> I'm removing the In-Reply-To mail headers for this thread, as you've JC> now hijacked it for a different purpose. Please don't do this; start JC> a new thread altogether. :-) Thanks. You're perfectly right, I should have done that. JC> I'm not sure how the above is supposed to work (I haven't personally JC> tried it), but: JC> JC> 1) Why didn't you offline the ad10 disk first? JC> zpool offline tank ad10 Well, probably because I thought that zfs would simply handle the situation. I just wanted to replace drive A with drive B, so this was quite straight-forward for me. JC> 2) How did you attach ad18? Did you tell the system about it using JC> atacontrol? If so, what commands did you use? Yes. The drives did not appear automatically (verified with atacontrol list). Then I first tried reinit ata9, but that did not work out, so I did a detach/attach for ata9, then the drive was there (with list and also the device node appeared). JC> 3) Can you please provide uname -a output, as well as relevant dmesg JC> output to show what kind of SATA controller you have, what's JC> attached to what, etc.? Of course (dmesg is not there anymore, I use pciconf -vl and atacontrol instead): ATA channel 0: Master: no device present Slave: acd0 <Optiarc DVD RW AD-7540A/1.01> ATA/ATAPI revision 0 ATA channel 1: Master: no device present Slave: no device present ATA channel 2: Master: ad4 <ST380815AS/3.AAC> SATA revision 2.x Slave: no device present ATA channel 3: Master: ad6 <ST380815AS/3.AAC> SATA revision 2.x Slave: no device present ATA channel 4: Master: ad8 <WDC WD1000FYPS-01ZKB0/02.01B01> SATA revision 2.x Slave: no device present ATA channel 5: Master: ad10 <WDC WD1000FYPS-01ZKB0/02.01B01> SATA revision 2.x Slave: no device present ATA channel 6: Master: ad12 <WDC WD1000FYPS-01ZKB0/02.01B01> SATA revision 2.x Slave: no device present ATA channel 7: Master: ad14 <WDC WD1000FYPS-01ZKB0/02.01B01> SATA revision 2.x Slave: no device present ATA channel 8: Master: no device present Slave: no device present ATA channel 9: Master: no device present Slave: no device present FreeBSD mclane.rt.aei.uni-hannover.de 7.2-STABLE FreeBSD 7.2-STABLE #0: Mon Sep 7 11:01:56 CEST 2009 root@mclane.rt.aei.uni-hannover.de:/usr/obj/usr/src/sys/MCLANE.72 amd64 The first six drives (up to ad14) are connected onboard (Supermicro dual opteron board with mcp55): atapci1@pci0:0:5:0: class=0x010485 card=0x161115d9 chip=0x037f10de rev=0xa3 hdr=0x00 vendor = 'Nvidia Corp' device = 'MCP55 SATA/RAID Controller (MCP55S)' class = mass storage subclass = RAID atapci2@pci0:0:5:1: class=0x010485 card=0x161115d9 chip=0x037f10de rev=0xa3 hdr=0x00 vendor = 'Nvidia Corp' device = 'MCP55 SATA/RAID Controller (MCP55S)' class = mass storage subclass = RAID atapci3@pci0:0:5:2: class=0x010485 card=0x161115d9 chip=0x037f10de rev=0xa3 hdr=0x00 vendor = 'Nvidia Corp' device = 'MCP55 SATA/RAID Controller (MCP55S)' class = mass storage subclass = RAID The other two (ad16 and ad18, the chassis has 8 slots and the last two were only intended to be used in situtations like the one I have now) are connected to an extra pci card: atapci4@pci0:3:6:0: class=0x010401 card=0x02409005 chip=0x02401095 rev=0x02 hdr=0x00 vendor = 'Silicon Image Inc (Was: CMD Technology Inc)' device = 'SATA/Raid controller(2XSATA150) (SIL3112)' class = mass storage subclass = RAID Meanwhile I took out the ad18 drive again and tried to use a different drive. But that was listed as "UNAVAIL" with corrupted data by zfs. Probably it already branded the disk for resilvering and is looking for exactly this one now. I also put in the disk which caused the problem above again. The resilvering process started again, but very soon the drive got detached again resulting in the same situation I described above. Any help is greatly appreciated. cu Gerrit