thr3ads.net - zfs discuss - [zfs-discuss] v440 - root mirror lost after LU [Jun 2009]

If this information is useful, please help other people find it:
Share via:

Jens Elkner

2009-Jun-16 22:32 UTC

[zfs-discuss] v440 - root mirror lost after LU

Hmmm,

just upgraded some servers to U7. Unfortunately one server''s primary
disk
died during the upgrade, so that luactivate was not able to activate the
s10u7 BE ("Unable to determine the configuration ..."). Since the
rpool
is a 2-way mirror, the boot-device=/pci at 1f,700000/scsi at 2/disk at 1,0:a was
simply set to /pci at 1f,700000/scsi at 2/disk at 0,0:a and checked, whether the
machine still reboots unattended. As expected - no problem.

At the evening the faulty disk was replaced and the mirror resilvered via
''zpool replace rpool c1t1d0s0'' (see below).  Since there was
no error and
everything stated to be healthy, the s10u7 BE was luactivated (no error
here message as well) and ''init 6''.

Unfortunately, now the server was gone and no known recipe helped to
revive it (I guess, LU damaged the zpool.cache?) :((((

Any hints, how to get the rpool back? 

Regards,
jel.


What has been tried ''til now:
########################################
{3} ok boot
Boot device: /pci at 1f,700000/scsi at 2/disk at 1,0:a  File and args:
Bad magic number in disk label
Can''t open disk label package

Can''t open boot device

{3} ok
########################################
{3} ok boot /pci at 1f,700000/scsi at 2/disk at 0,0:a
Boot device: /pci at 1f,700000/scsi at 2/disk at 0,0:a  File and args:
SunOS Release 5.10 Version Generic_138888-08 64-bit
Copyright 1983-2009 Sun Microsystems, Inc.  All rights reserved.
Use is subject to license terms.
NOTICE:
spa_import_rootpool: error 22

Cannot mount root on /pci at 1f,700000/scsi at 2/disk at 0,0:a fstype zfs

panic[cpu3]/thread=180e000: vfs_mountroot: cannot mount root

000000000180b950 genunix:vfs_mountroot+358 (800, 200, 0, 1875c00,
189f800, 18ca000)
  %l0-3: 00000000010ba000 00000000010ba208 000000000187bba8
  %00000000011e8400
  %l4-7: 00000000011e8400 00000000018cc400 0000000000000600
  %0000000000000200
000000000180ba10 genunix:main+a0 (1815178, 180c000, 18397b0, 18c6800,
181b578, 1815000)
  %l0-3: 0000000001015400 0000000000000001 0000000070002000
  %0000000000000000
  %l4-7: 000000000183ec00 0000000000000003 000000000180c000
  %0000000000000000

skipping system dump - no dump device configured
rebooting...

SC Alert: Host System has Reset
########################################
{3} ok boot net -s
Boot device: /pci at 1c,600000/network at 2  File and args: -s
1000 Mbps FDX Link up
Timeout waiting for ARP/RARP packet
3a000 1000 Mbps FDX Link up
SunOS Release 5.10 Version Generic_137137-09 64-bit
Copyright 1983-2008 Sun Microsystems, Inc.  All rights reserved.
Use is subject to license terms.
Hardware watchdog enabled
Booting to milestone "milestone/single-user:default".
Configuring devices.
Using RPC Bootparams for network configuration information.
Attempting to configure interface ce1...
Skipped interface ce1
Attempting to configure interface ce0...
Configured interface ce0
Requesting System Maintenance Mode
SINGLE USER MODE
# mount -F zfs /dev/dsk/c1t1d0s0 /mnt
cannot open ''/dev/dsk/c1t1d0s0'': invalid dataset name
# mount -F zfs /dev/dsk/c1t0d0s0 /mnt
cannot open ''/dev/dsk/c1t0d0s0'': invalid dataset name
# zpool import
  pool: pool1
    id: 5088500955966129017
 state: ONLINE
action: The pool can be imported using its name or numeric identifier.
config:

        pool1       ONLINE
          mirror    ONLINE
            c1t2d0  ONLINE
            c1t3d0  ONLINE

  pool: rpool
    id: 5910200402071733373
 state: UNAVAIL
action: The pool cannot be imported due to damaged devices or data.
config:

        rpool         UNAVAIL  insufficient replicas
          mirror      UNAVAIL  corrupted data
            c1t1d0s0  ONLINE
            c1t0d0s0  ONLINE

# dd if=/dev/rdsk/c1t1d0s0 of=/tmp/bb bs=1b iseek=1 count=15
15+0 records in
15+0 records out
# dd if=/dev/rdsk/c1t1d0s0 of=/tmp/bb bs=1b iseek=1024 oseek=15 count=16
16+0 records in
16+0 records out
# cmp /tmp/bb /usr/platform/`uname -i`/lib/fs/zfs/bootblk
# echo $?
0
# dd if=/dev/rdsk/c1t0d0s0 of=/tmp/ab bs=1b iseek=1 count=15
15+0 records in
15+0 records out
# dd if=/dev/rdsk/c1t0d0s0 of=/tmp/ab bs=1b iseek=1024 oseek=15 count=16
16+0 records in
16+0 records out
# cmp /tmp/ab /usr/platform/`uname -i`/lib/fs/zfs/bootblk
# echo $?
0
#

pre-history:
===========admin.tpol ~ # zpool status -xv
  pool: rpool
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
 scrub: resilver in progress for 0h28m, 98.10% done, 0h0m to go
config:

        NAME                STATE     READ WRITE CKSUM
        rpool               DEGRADED     0     0     0
          mirror            DEGRADED     0     0     0
            replacing       DEGRADED     0     0     0
              c1t1d0s0/old  FAULTED      0     0     0  corrupted data
              c1t1d0s0      ONLINE       0     0     0
            c1t0d0s0        ONLINE       0     0     0

errors: No known data errors

admin.tpol ~ # zpool status -xv
all pools are healthy
admin.tpol ~ # zpool status 
  pool: pool1
 state: ONLINE
 scrub: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        pool1       ONLINE       0     0     0
          mirror    ONLINE       0     0     0
            c1t2d0  ONLINE       0     0     0
            c1t3d0  ONLINE       0     0     0

errors: No known data errors

  pool: rpool
 state: ONLINE
 scrub: resilver completed after 0h28m with 0 errors on Tue Jun 16 19:13:57 2009
config:

        NAME          STATE     READ WRITE CKSUM
        rpool         ONLINE       0     0     0
          mirror      ONLINE       0     0     0
            c1t1d0s0  ONLINE       0     0     0
            c1t0d0s0  ONLINE       0     0     0

errors: No known data errors

admin.tpol / # luactivate -l $ERR -o $OUT S10u7
...
In case of a failure while booting ...
mount -Fzfs /dev/dsk/c1t1d0s0 /mnt
...
Modifying boot archive service
Activation of boot environment <s10u7> successful.
8.32u 16.74s 1:25.15 29.4%
admin.tpol / # init 6
-- 
Otto-von-Guericke University     http://www.cs.uni-magdeburg.de/
Department of Computer Science   Geb. 29 R 027, Universitaetsplatz 2
39106 Magdeburg, Germany         Tel: +49 391 67 12768

Lori Alt

2009-Jun-16 23:58 UTC

head link

[zfs-discuss] v440 - root mirror lost after LU

On 06/16/09 16:32, Jens Elkner wrote:> Hmmm,
>
> just upgraded some servers to U7. Unfortunately one server''s
primary disk
> died during the upgrade, so that luactivate was not able to activate the
> s10u7 BE ("Unable to determine the configuration ..."). Since the
rpool
> is a 2-way mirror, the boot-device=/pci at 1f,700000/scsi at 2/disk at
1,0:a was
> simply set to /pci at 1f,700000/scsi at 2/disk at 0,0:a and checked,
whether the
> machine still reboots unattended. As expected - no problem.
>
> At the evening the faulty disk was replaced and the mirror resilvered via
> ''zpool replace rpool c1t1d0s0'' (see below).  Since there
was no error and
> everything stated to be healthy, the s10u7 BE was luactivated (no error
> here message as well) and ''init 6''.
>
> Unfortunately, now the server was gone and no known recipe helped to
> revive it (I guess, LU damaged the zpool.cache?) :((((
>
>   Even if LU somehow damaged the zpool.cache, that wouldn''t explain why
an
import, while booted off the net, wouldn''t work.  LU would have had to 
damage
the pool''s label in some way.

I notice that when you booted the system from the  net, you booted from an
Update 6 image.  It''s possible that the luupgrade to U7 upgraded the
pool
version to one not understood by Update 6.  One thing I suggest is booting
off the net from a U7 image and see if that allows you to import the pool.

The other suggestion I have is to remove the

/pci at 1f,700000/scsi at 2/disk at 1,0:a

device (the one that didn''t appear to be able to boot at all) and try 
booting
off just the other disk.  There are some problems booting from one side
of a mirror when the disk on the other side is present, but somehow not
quite right.  If that boot is successful, insert the other disk again (while
still booted, if your system supports hot-plugging), and then perform the
recovery procedure documented here:

http://www.solarisinternals.com/wiki/index.php/ZFS_Troubleshooting_Guide#Primary_Mirror_Disk_in_a_ZFS_Root_Pool_is_Unavailable_or_Fails

making sure you perform the "installboot" step to install a boot
block.

If you can manage to boot off one side of the mirror, but your system
does not support hot-plugging, try doing a ''zpool detach'' to
remove
the other disk from the root mirror, then taking the system down, inserting
the replacement disk, rebooting, then re-attach the disk to the
mirror using "zpool attach", and then reinstall the boot block, as
shown in the above link.

If none of this works, I''ll look into other reasons why a pool might
appear to have "corrupted data" and therefore be non-importable.

Lori

> Any hints, how to get the rpool back? 
>
> Regards,
> jel.
>
>
> What has been tried ''til now:
> ########################################
> {3} ok boot
> Boot device: /pci at 1f,700000/scsi at 2/disk at 1,0:a  File and args:
> Bad magic number in disk label
> Can''t open disk label package
>
> Can''t open boot device
>
> {3} ok
> ########################################
> {3} ok boot /pci at 1f,700000/scsi at 2/disk at 0,0:a
> Boot device: /pci at 1f,700000/scsi at 2/disk at 0,0:a  File and args:
> SunOS Release 5.10 Version Generic_138888-08 64-bit
> Copyright 1983-2009 Sun Microsystems, Inc.  All rights reserved.
> Use is subject to license terms.
> NOTICE:
> spa_import_rootpool: error 22
>
> Cannot mount root on /pci at 1f,700000/scsi at 2/disk at 0,0:a fstype zfs
>
> panic[cpu3]/thread=180e000: vfs_mountroot: cannot mount root
>
> 000000000180b950 genunix:vfs_mountroot+358 (800, 200, 0, 1875c00,
> 189f800, 18ca000)
>   %l0-3: 00000000010ba000 00000000010ba208 000000000187bba8
>   %00000000011e8400
>   %l4-7: 00000000011e8400 00000000018cc400 0000000000000600
>   %0000000000000200
> 000000000180ba10 genunix:main+a0 (1815178, 180c000, 18397b0, 18c6800,
> 181b578, 1815000)
>   %l0-3: 0000000001015400 0000000000000001 0000000070002000
>   %0000000000000000
>   %l4-7: 000000000183ec00 0000000000000003 000000000180c000
>   %0000000000000000
>
> skipping system dump - no dump device configured
> rebooting...
>
> SC Alert: Host System has Reset
> ########################################
> {3} ok boot net -s
> Boot device: /pci at 1c,600000/network at 2  File and args: -s
> 1000 Mbps FDX Link up
> Timeout waiting for ARP/RARP packet
> 3a000 1000 Mbps FDX Link up
> SunOS Release 5.10 Version Generic_137137-09 64-bit
> Copyright 1983-2008 Sun Microsystems, Inc.  All rights reserved.
> Use is subject to license terms.
> Hardware watchdog enabled
> Booting to milestone "milestone/single-user:default".
> Configuring devices.
> Using RPC Bootparams for network configuration information.
> Attempting to configure interface ce1...
> Skipped interface ce1
> Attempting to configure interface ce0...
> Configured interface ce0
> Requesting System Maintenance Mode
> SINGLE USER MODE
> # mount -F zfs /dev/dsk/c1t1d0s0 /mnt
> cannot open ''/dev/dsk/c1t1d0s0'': invalid dataset name
> # mount -F zfs /dev/dsk/c1t0d0s0 /mnt
> cannot open ''/dev/dsk/c1t0d0s0'': invalid dataset name
> # zpool import
>   pool: pool1
>     id: 5088500955966129017
>  state: ONLINE
> action: The pool can be imported using its name or numeric identifier.
> config:
>
>         pool1       ONLINE
>           mirror    ONLINE
>             c1t2d0  ONLINE
>             c1t3d0  ONLINE
>
>   pool: rpool
>     id: 5910200402071733373
>  state: UNAVAIL
> action: The pool cannot be imported due to damaged devices or data.
> config:
>
>         rpool         UNAVAIL  insufficient replicas
>           mirror      UNAVAIL  corrupted data
>             c1t1d0s0  ONLINE
>             c1t0d0s0  ONLINE
>
> # dd if=/dev/rdsk/c1t1d0s0 of=/tmp/bb bs=1b iseek=1 count=15
> 15+0 records in
> 15+0 records out
> # dd if=/dev/rdsk/c1t1d0s0 of=/tmp/bb bs=1b iseek=1024 oseek=15 count=16
> 16+0 records in
> 16+0 records out
> # cmp /tmp/bb /usr/platform/`uname -i`/lib/fs/zfs/bootblk
> # echo $?
> 0
> # dd if=/dev/rdsk/c1t0d0s0 of=/tmp/ab bs=1b iseek=1 count=15
> 15+0 records in
> 15+0 records out
> # dd if=/dev/rdsk/c1t0d0s0 of=/tmp/ab bs=1b iseek=1024 oseek=15 count=16
> 16+0 records in
> 16+0 records out
> # cmp /tmp/ab /usr/platform/`uname -i`/lib/fs/zfs/bootblk
> # echo $?
> 0
> #
>
> pre-history:
> ===========> admin.tpol ~ # zpool status -xv
>   pool: rpool
>  state: DEGRADED
> status: One or more devices is currently being resilvered.  The pool will
>         continue to function, possibly in a degraded state.
> action: Wait for the resilver to complete.
>  scrub: resilver in progress for 0h28m, 98.10% done, 0h0m to go
> config:
>
>         NAME                STATE     READ WRITE CKSUM
>         rpool               DEGRADED     0     0     0
>           mirror            DEGRADED     0     0     0
>             replacing       DEGRADED     0     0     0
>               c1t1d0s0/old  FAULTED      0     0     0  corrupted data
>               c1t1d0s0      ONLINE       0     0     0
>             c1t0d0s0        ONLINE       0     0     0
>
> errors: No known data errors
>
> admin.tpol ~ # zpool status -xv
> all pools are healthy
> admin.tpol ~ # zpool status 
>   pool: pool1
>  state: ONLINE
>  scrub: none requested
> config:
>
>         NAME        STATE     READ WRITE CKSUM
>         pool1       ONLINE       0     0     0
>           mirror    ONLINE       0     0     0
>             c1t2d0  ONLINE       0     0     0
>             c1t3d0  ONLINE       0     0     0
>
> errors: No known data errors
>
>   pool: rpool
>  state: ONLINE
>  scrub: resilver completed after 0h28m with 0 errors on Tue Jun 16 19:13:57
2009
> config:
>
>         NAME          STATE     READ WRITE CKSUM
>         rpool         ONLINE       0     0     0
>           mirror      ONLINE       0     0     0
>             c1t1d0s0  ONLINE       0     0     0
>             c1t0d0s0  ONLINE       0     0     0
>
> errors: No known data errors
>
> admin.tpol / # luactivate -l $ERR -o $OUT S10u7
> ...
> In case of a failure while booting ...
> mount -Fzfs /dev/dsk/c1t1d0s0 /mnt
> ...
> Modifying boot archive service
> Activation of boot environment <s10u7> successful.
> 8.32u 16.74s 1:25.15 29.4%
> admin.tpol / # init 6
>

Jens Elkner

2009-Jun-17 04:20 UTC

head link

[zfs-discuss] v440 - root mirror lost after LU

On Tue, Jun 16, 2009 at 05:58:00PM -0600, Lori Alt wrote:

First: Thanx a lot, Lori for the quick help!!!
> On 06/16/09 16:32, Jens Elkner wrote:
> >At the evening the faulty disk was replaced and the mirror resilvered
via
> >''zpool replace rpool c1t1d0s0'' (see below).  Since
there was no error and
> >everything stated to be healthy, the s10u7 BE was luactivated (no error
> >here message as well) and ''init 6''.
> >
> >Unfortunately, now the server was gone and no known recipe helped to
> >revive it (I guess, LU damaged the zpool.cache?) :((((
  > The other suggestion I have is to remove the
> /pci at 1f,700000/scsi at 2/disk at 1,0:a
Indeed, physically removing the "new" c1t1d0 was the key for solving
the
problem (netboot s10u7 gave the same import error as s10u6).

Just in case, somebody is interested in the details:

Since ''cfgadm -c unconfigure c1::dsk/c1t1d0'' didn''t
work (no blue
LED), the machine was ''poweroff''ed, disk removed and
''poweron''ed.

Really strange: it came back with the s10u6 BE instead of the s10u7 BE
and took quite a while ''til it gave up to get HDD1:
WARNING: /pci at 1d,700000/scsi at 2,1 (mpt3):
	Disconnected command timeout for Target 1
Hardware watchdog enabled
...

Now cfgadm did not show c1::dsk/c1t1d0. So re-inserted HDD1, which was
properly logged ''SC Alert: DISK @ HDD1 has been inserted.''
Unfortunately
cfgadm didn''t show it and ''cfgadm -c configure
c1::dsk/c1t1d0'' meant:
"cfgadm: Attachment point not found". 

However, ''format -e'':
AVAILABLE DISK SELECTIONS:
       0. c1t0d0 <SUN36G cyl 24620 alt 2 hd 27 sec 107>
          /pci at 1f,700000/scsi at 2/sd at 0,0
       1. c1t1d0 <HITACHI-DK32EJ36NSUN36G-PQ08-33.92GB>
          /pci at 1f,700000/scsi at 2/sd at 1,0
       2. c1t2d0 <SEAGATE-ST3146707LC-0005-136.73GB>
          /pci at 1f,700000/scsi at 2/sd at 2,0
       3. c1t3d0 <SEAGATE-ST3146707LC-0005-136.73GB>
          /pci at 1f,700000/scsi at 2/sd at 3,0
Specify disk (enter its number): 1
selecting c1t1d0
[disk formatted]

and now the machine seemed to be stalled. No ssh nor login via console
possible.

So ''poweroff''ed again, removed the disk,
''poweron''ed and this time
the first thing done was ''zpool detach rpool c1t1d0'' and
scrubbing the
pool (completed after 0h22m with 0 errors).

After that a ''cfgadm -x insert_device c1'' got back
c1::dsk/c1t1d0 and
''format -e'' worked as expected (1p showed an EFI partition
table)!

So the rest was trivial:
SMI label, repart, label, reattached c1t1d0 to rpool (resilver took 
about 31m), installboot, verified boot ''ok'' s10u6, luactivated
s10u7
and finally verified boot ''ok'' again.

Once again, thanx a lot Lori for your quick help!!!

Regards,
jel.
-- 
Otto-von-Guericke University     http://www.cs.uni-magdeburg.de/
Department of Computer Science   Geb. 29 R 027, Universitaetsplatz 2
39106 Magdeburg, Germany         Tel: +49 391 67 12768

zfs discuss - Jun 2009 - v440 - root mirror lost after LU

[zfs-discuss] v440 - root mirror lost after LU

[zfs-discuss] v440 - root mirror lost after LU

[zfs-discuss] v440 - root mirror lost after LU