thr3ads.net - freebsd stable - ZFS "zpool replace" problems [Jan 2010]

If this information is useful, please help other people find it:
Share via:

Jeremy Chadwick

2010-Jan-26 14:30 UTC

ZFS "zpool replace" problems

I'm removing the In-Reply-To mail headers for this thread, as you've now
hijacked it for a different purpose.  Please don't do this; start a new
thread altogether.  :-)

On Tue, Jan 26, 2010 at 02:57:20PM +0100, Gerrit K?hn
wrote:> I am still busy replacing RE2-disks with updated drives. I came across a
> very strange thing with zfs. Actually I had the following pool layout:
> 
> mclane# zpool status
>   pool: tank
>  state: ONLINE
>  scrub: none requested
> config:
> 
>         NAME        STATE     READ WRITE CKSUM
>         tank        ONLINE       0     0     0
>           raidz1    ONLINE       0     0     0
>             ad8     ONLINE       0     0     0
>             ad10    ONLINE       0     0     0
>             ad12    ONLINE       0     0     0
>         spares
>           ad14      AVAIL   
> 
> errors: No known data errors
> 
> All disks still have the firmware bug, so I want to replace them with
> disks that I already fixed. I put in a updated drive as ad18 and
> wanted to replace ad12 to get the drive with the broken firmware out:
> 
> mclane# zpool replace tank /dev/ad12 /dev/ad18 
> mclane# zpool status
>   pool: tank
>  state: ONLINE
> status: One or more devices is currently being resilvered.  The pool will
>         continue to function, possibly in a degraded state.
> action: Wait for the resilver to complete.
>  scrub: resilver in progress for 0h0m, 0.01% done, 52h51m to go
> config:
> 
>         NAME           STATE     READ WRITE CKSUM
>         tank           ONLINE       0     0     0
>           raidz1       ONLINE       0     0     0
>             ad8        ONLINE       0     0     0  7.21M resilvered
>             ad10       ONLINE       0     0     0  7.22M resilvered
>             replacing  ONLINE       0     0     0
>               ad12     ONLINE       0     0     0
>               ad18     ONLINE       0     0     0  10.7M resilvered
>         spares
>           ad14         AVAIL   
> 
> errors: No known data errors
> 
> However, something must have gone wrong during the resilvering process and
> it now looks like this:
> 
> mclane# zpool status
>   pool: tank
>  state: DEGRADED
> status: One or more devices has experienced an unrecoverable error.  An
>         attempt was made to correct the error.  Applications are
> unaffected. action: Determine if the device needs to be replaced, and
> clear the errors using 'zpool clear' or replace the device with
'zpool
> replace'. see: http://www.sun.com/msg/ZFS-8000-9P
>  scrub: resilver completed after 2h39m with 0 errors on Tue Jan 26
> 14:00:00 2010 config:
> 
>         NAME           STATE     READ WRITE CKSUM
>         tank           DEGRADED     0     0     0
>           raidz1       DEGRADED     0     0     0
>             ad8        ONLINE       0     0     0  975M resilvered
>             ad10       ONLINE       0     0   142  974M resilvered
>             replacing  DEGRADED     0 7.25M     0
>               ad12     ONLINE       0     0     0
>               ad18     REMOVED      0     1     0  79.4M resilvered
>         spares
>           ad14         AVAIL   
> 
> errors: No known data errors
> 
> 
> What is going on here? ad18 obviously detached during the
> process. /var/log/messages just gives me
> 
> Jan 26 11:23:33 mclane kernel: ad18: FAILURE - device detached
> 
> Additionally ad10 obviously produced chksum errors. What do I do about the
> degraded replacing process? Can I terminate it somehow and maybe replace
> ad10 first? Any other hints?
I'm not sure how the above is supposed to work (I haven't personally
tried it), but:

1) Why didn't you offline the ad10 disk first?
   zpool offline tank ad10

2) How did you attach ad18?  Did you tell the system about it using
   atacontrol?  If so, what commands did you use?

3) Can you please provide uname -a output, as well as relevant dmesg
   output to show what kind of SATA controller you have, what's
   attached to what, etc.?

-- 
| Jeremy Chadwick                                   jdc@parodius.com |
| Parodius Networking                       http://www.parodius.com/ |
| UNIX Systems Administrator                  Mountain View, CA, USA |
| Making life hard for others since 1977.              PGP: 4BD6C0CB |

Gerrit Kühn

2010-Jan-26 15:03 UTC

head link

ZFS "zpool replace" problems

On Tue, 26 Jan 2010 06:30:21 -0800 Jeremy Chadwick
<freebsd@jdc.parodius.com> wrote about Re: ZFS "zpool replace"
problems:

JC> I'm removing the In-Reply-To mail headers for this thread, as
you've
JC> now hijacked it for a different purpose.  Please don't do this; start
JC> a new thread altogether.  :-)

Thanks. You're perfectly right, I should have done that.

JC> I'm not sure how the above is supposed to work (I haven't
personally
JC> tried it), but:
JC> 
JC> 1) Why didn't you offline the ad10 disk first?
JC>    zpool offline tank ad10

Well, probably because I thought that zfs would simply handle the
situation. I just wanted to replace drive A with drive B, so this was
quite straight-forward for me.

JC> 2) How did you attach ad18?  Did you tell the system about it using
JC>    atacontrol?  If so, what commands did you use?

Yes. The drives did not appear automatically (verified with atacontrol
list). Then I first tried reinit ata9, but that did not work out, so I did
a detach/attach for ata9, then the drive was there (with list and also
the device node appeared).

JC> 3) Can you please provide uname -a output, as well as relevant dmesg
JC>    output to show what kind of SATA controller you have, what's
JC>    attached to what, etc.?

Of course (dmesg is not there anymore, I use pciconf -vl and
atacontrol instead):

ATA channel 0:
    Master:      no device present
    Slave:  acd0 <Optiarc DVD RW AD-7540A/1.01> ATA/ATAPI revision 0
ATA channel 1:
    Master:      no device present
    Slave:       no device present
ATA channel 2:
    Master:  ad4 <ST380815AS/3.AAC> SATA revision 2.x
    Slave:       no device present
ATA channel 3:
    Master:  ad6 <ST380815AS/3.AAC> SATA revision 2.x
    Slave:       no device present
ATA channel 4:
    Master:  ad8 <WDC WD1000FYPS-01ZKB0/02.01B01> SATA revision 2.x
    Slave:       no device present
ATA channel 5:
    Master: ad10 <WDC WD1000FYPS-01ZKB0/02.01B01> SATA revision 2.x
    Slave:       no device present
ATA channel 6:
    Master: ad12 <WDC WD1000FYPS-01ZKB0/02.01B01> SATA revision 2.x
    Slave:       no device present
ATA channel 7:
    Master: ad14 <WDC WD1000FYPS-01ZKB0/02.01B01> SATA revision 2.x
    Slave:       no device present
ATA channel 8:
    Master:      no device present
    Slave:       no device present
ATA channel 9:
    Master:      no device present
    Slave:       no device present


FreeBSD mclane.rt.aei.uni-hannover.de 7.2-STABLE FreeBSD 7.2-STABLE #0:
Mon Sep  7 11:01:56 CEST 2009
root@mclane.rt.aei.uni-hannover.de:/usr/obj/usr/src/sys/MCLANE.72  amd64

The first six drives (up to ad14) are connected onboard (Supermicro dual
opteron board with mcp55):

atapci1@pci0:0:5:0:     class=0x010485 card=0x161115d9 chip=0x037f10de
rev=0xa3 hdr=0x00 vendor     = 'Nvidia Corp'
    device     = 'MCP55 SATA/RAID Controller (MCP55S)'
    class      = mass storage
    subclass   = RAID
atapci2@pci0:0:5:1:     class=0x010485 card=0x161115d9 chip=0x037f10de
rev=0xa3 hdr=0x00 vendor     = 'Nvidia Corp'
    device     = 'MCP55 SATA/RAID Controller (MCP55S)'
    class      = mass storage
    subclass   = RAID
atapci3@pci0:0:5:2:     class=0x010485 card=0x161115d9 chip=0x037f10de
rev=0xa3 hdr=0x00 vendor     = 'Nvidia Corp'
    device     = 'MCP55 SATA/RAID Controller (MCP55S)'
    class      = mass storage
    subclass   = RAID

The other two (ad16 and ad18, the chassis has 8 slots and the last two
were only intended to be used in situtations like the one I have now) are
connected to an extra pci card:

atapci4@pci0:3:6:0:     class=0x010401 card=0x02409005 chip=0x02401095
rev=0x02 hdr=0x00 vendor     = 'Silicon Image Inc (Was: CMD Technology
Inc)' device     = 'SATA/Raid controller(2XSATA150) (SIL3112)'
    class      = mass storage
    subclass   = RAID

Meanwhile I took out the ad18 drive again and tried to use a different
drive. But that was listed as "UNAVAIL" with corrupted data by zfs.
Probably it already branded the disk for resilvering and is looking for
exactly this one now. I also put in the disk which caused the problem
above again. The resilvering process started again, but very soon the
drive got detached again resulting in the same situation I described above.

Any help is greatly appreciated.


cu
  Gerrit

freebsd stable - Jan 2010 - ZFS "zpool replace" problems

ZFS "zpool replace" problems

ZFS "zpool replace" problems