Hmmm, just upgraded some servers to U7. Unfortunately one server''s primary disk died during the upgrade, so that luactivate was not able to activate the s10u7 BE ("Unable to determine the configuration ..."). Since the rpool is a 2-way mirror, the boot-device=/pci at 1f,700000/scsi at 2/disk at 1,0:a was simply set to /pci at 1f,700000/scsi at 2/disk at 0,0:a and checked, whether the machine still reboots unattended. As expected - no problem. At the evening the faulty disk was replaced and the mirror resilvered via ''zpool replace rpool c1t1d0s0'' (see below). Since there was no error and everything stated to be healthy, the s10u7 BE was luactivated (no error here message as well) and ''init 6''. Unfortunately, now the server was gone and no known recipe helped to revive it (I guess, LU damaged the zpool.cache?) :(((( Any hints, how to get the rpool back? Regards, jel. What has been tried ''til now: ######################################## {3} ok boot Boot device: /pci at 1f,700000/scsi at 2/disk at 1,0:a File and args: Bad magic number in disk label Can''t open disk label package Can''t open boot device {3} ok ######################################## {3} ok boot /pci at 1f,700000/scsi at 2/disk at 0,0:a Boot device: /pci at 1f,700000/scsi at 2/disk at 0,0:a File and args: SunOS Release 5.10 Version Generic_138888-08 64-bit Copyright 1983-2009 Sun Microsystems, Inc. All rights reserved. Use is subject to license terms. NOTICE: spa_import_rootpool: error 22 Cannot mount root on /pci at 1f,700000/scsi at 2/disk at 0,0:a fstype zfs panic[cpu3]/thread=180e000: vfs_mountroot: cannot mount root 000000000180b950 genunix:vfs_mountroot+358 (800, 200, 0, 1875c00, 189f800, 18ca000) %l0-3: 00000000010ba000 00000000010ba208 000000000187bba8 %00000000011e8400 %l4-7: 00000000011e8400 00000000018cc400 0000000000000600 %0000000000000200 000000000180ba10 genunix:main+a0 (1815178, 180c000, 18397b0, 18c6800, 181b578, 1815000) %l0-3: 0000000001015400 0000000000000001 0000000070002000 %0000000000000000 %l4-7: 000000000183ec00 0000000000000003 000000000180c000 %0000000000000000 skipping system dump - no dump device configured rebooting... SC Alert: Host System has Reset ######################################## {3} ok boot net -s Boot device: /pci at 1c,600000/network at 2 File and args: -s 1000 Mbps FDX Link up Timeout waiting for ARP/RARP packet 3a000 1000 Mbps FDX Link up SunOS Release 5.10 Version Generic_137137-09 64-bit Copyright 1983-2008 Sun Microsystems, Inc. All rights reserved. Use is subject to license terms. Hardware watchdog enabled Booting to milestone "milestone/single-user:default". Configuring devices. Using RPC Bootparams for network configuration information. Attempting to configure interface ce1... Skipped interface ce1 Attempting to configure interface ce0... Configured interface ce0 Requesting System Maintenance Mode SINGLE USER MODE # mount -F zfs /dev/dsk/c1t1d0s0 /mnt cannot open ''/dev/dsk/c1t1d0s0'': invalid dataset name # mount -F zfs /dev/dsk/c1t0d0s0 /mnt cannot open ''/dev/dsk/c1t0d0s0'': invalid dataset name # zpool import pool: pool1 id: 5088500955966129017 state: ONLINE action: The pool can be imported using its name or numeric identifier. config: pool1 ONLINE mirror ONLINE c1t2d0 ONLINE c1t3d0 ONLINE pool: rpool id: 5910200402071733373 state: UNAVAIL action: The pool cannot be imported due to damaged devices or data. config: rpool UNAVAIL insufficient replicas mirror UNAVAIL corrupted data c1t1d0s0 ONLINE c1t0d0s0 ONLINE # dd if=/dev/rdsk/c1t1d0s0 of=/tmp/bb bs=1b iseek=1 count=15 15+0 records in 15+0 records out # dd if=/dev/rdsk/c1t1d0s0 of=/tmp/bb bs=1b iseek=1024 oseek=15 count=16 16+0 records in 16+0 records out # cmp /tmp/bb /usr/platform/`uname -i`/lib/fs/zfs/bootblk # echo $? 0 # dd if=/dev/rdsk/c1t0d0s0 of=/tmp/ab bs=1b iseek=1 count=15 15+0 records in 15+0 records out # dd if=/dev/rdsk/c1t0d0s0 of=/tmp/ab bs=1b iseek=1024 oseek=15 count=16 16+0 records in 16+0 records out # cmp /tmp/ab /usr/platform/`uname -i`/lib/fs/zfs/bootblk # echo $? 0 # pre-history: ===========admin.tpol ~ # zpool status -xv pool: rpool state: DEGRADED status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scrub: resilver in progress for 0h28m, 98.10% done, 0h0m to go config: NAME STATE READ WRITE CKSUM rpool DEGRADED 0 0 0 mirror DEGRADED 0 0 0 replacing DEGRADED 0 0 0 c1t1d0s0/old FAULTED 0 0 0 corrupted data c1t1d0s0 ONLINE 0 0 0 c1t0d0s0 ONLINE 0 0 0 errors: No known data errors admin.tpol ~ # zpool status -xv all pools are healthy admin.tpol ~ # zpool status pool: pool1 state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM pool1 ONLINE 0 0 0 mirror ONLINE 0 0 0 c1t2d0 ONLINE 0 0 0 c1t3d0 ONLINE 0 0 0 errors: No known data errors pool: rpool state: ONLINE scrub: resilver completed after 0h28m with 0 errors on Tue Jun 16 19:13:57 2009 config: NAME STATE READ WRITE CKSUM rpool ONLINE 0 0 0 mirror ONLINE 0 0 0 c1t1d0s0 ONLINE 0 0 0 c1t0d0s0 ONLINE 0 0 0 errors: No known data errors admin.tpol / # luactivate -l $ERR -o $OUT S10u7 ... In case of a failure while booting ... mount -Fzfs /dev/dsk/c1t1d0s0 /mnt ... Modifying boot archive service Activation of boot environment <s10u7> successful. 8.32u 16.74s 1:25.15 29.4% admin.tpol / # init 6 -- Otto-von-Guericke University http://www.cs.uni-magdeburg.de/ Department of Computer Science Geb. 29 R 027, Universitaetsplatz 2 39106 Magdeburg, Germany Tel: +49 391 67 12768
On 06/16/09 16:32, Jens Elkner wrote:> Hmmm, > > just upgraded some servers to U7. Unfortunately one server''s primary disk > died during the upgrade, so that luactivate was not able to activate the > s10u7 BE ("Unable to determine the configuration ..."). Since the rpool > is a 2-way mirror, the boot-device=/pci at 1f,700000/scsi at 2/disk at 1,0:a was > simply set to /pci at 1f,700000/scsi at 2/disk at 0,0:a and checked, whether the > machine still reboots unattended. As expected - no problem. > > At the evening the faulty disk was replaced and the mirror resilvered via > ''zpool replace rpool c1t1d0s0'' (see below). Since there was no error and > everything stated to be healthy, the s10u7 BE was luactivated (no error > here message as well) and ''init 6''. > > Unfortunately, now the server was gone and no known recipe helped to > revive it (I guess, LU damaged the zpool.cache?) :(((( > >Even if LU somehow damaged the zpool.cache, that wouldn''t explain why an import, while booted off the net, wouldn''t work. LU would have had to damage the pool''s label in some way. I notice that when you booted the system from the net, you booted from an Update 6 image. It''s possible that the luupgrade to U7 upgraded the pool version to one not understood by Update 6. One thing I suggest is booting off the net from a U7 image and see if that allows you to import the pool. The other suggestion I have is to remove the /pci at 1f,700000/scsi at 2/disk at 1,0:a device (the one that didn''t appear to be able to boot at all) and try booting off just the other disk. There are some problems booting from one side of a mirror when the disk on the other side is present, but somehow not quite right. If that boot is successful, insert the other disk again (while still booted, if your system supports hot-plugging), and then perform the recovery procedure documented here: http://www.solarisinternals.com/wiki/index.php/ZFS_Troubleshooting_Guide#Primary_Mirror_Disk_in_a_ZFS_Root_Pool_is_Unavailable_or_Fails making sure you perform the "installboot" step to install a boot block. If you can manage to boot off one side of the mirror, but your system does not support hot-plugging, try doing a ''zpool detach'' to remove the other disk from the root mirror, then taking the system down, inserting the replacement disk, rebooting, then re-attach the disk to the mirror using "zpool attach", and then reinstall the boot block, as shown in the above link. If none of this works, I''ll look into other reasons why a pool might appear to have "corrupted data" and therefore be non-importable. Lori> Any hints, how to get the rpool back? > > Regards, > jel. > > > What has been tried ''til now: > ######################################## > {3} ok boot > Boot device: /pci at 1f,700000/scsi at 2/disk at 1,0:a File and args: > Bad magic number in disk label > Can''t open disk label package > > Can''t open boot device > > {3} ok > ######################################## > {3} ok boot /pci at 1f,700000/scsi at 2/disk at 0,0:a > Boot device: /pci at 1f,700000/scsi at 2/disk at 0,0:a File and args: > SunOS Release 5.10 Version Generic_138888-08 64-bit > Copyright 1983-2009 Sun Microsystems, Inc. All rights reserved. > Use is subject to license terms. > NOTICE: > spa_import_rootpool: error 22 > > Cannot mount root on /pci at 1f,700000/scsi at 2/disk at 0,0:a fstype zfs > > panic[cpu3]/thread=180e000: vfs_mountroot: cannot mount root > > 000000000180b950 genunix:vfs_mountroot+358 (800, 200, 0, 1875c00, > 189f800, 18ca000) > %l0-3: 00000000010ba000 00000000010ba208 000000000187bba8 > %00000000011e8400 > %l4-7: 00000000011e8400 00000000018cc400 0000000000000600 > %0000000000000200 > 000000000180ba10 genunix:main+a0 (1815178, 180c000, 18397b0, 18c6800, > 181b578, 1815000) > %l0-3: 0000000001015400 0000000000000001 0000000070002000 > %0000000000000000 > %l4-7: 000000000183ec00 0000000000000003 000000000180c000 > %0000000000000000 > > skipping system dump - no dump device configured > rebooting... > > SC Alert: Host System has Reset > ######################################## > {3} ok boot net -s > Boot device: /pci at 1c,600000/network at 2 File and args: -s > 1000 Mbps FDX Link up > Timeout waiting for ARP/RARP packet > 3a000 1000 Mbps FDX Link up > SunOS Release 5.10 Version Generic_137137-09 64-bit > Copyright 1983-2008 Sun Microsystems, Inc. All rights reserved. > Use is subject to license terms. > Hardware watchdog enabled > Booting to milestone "milestone/single-user:default". > Configuring devices. > Using RPC Bootparams for network configuration information. > Attempting to configure interface ce1... > Skipped interface ce1 > Attempting to configure interface ce0... > Configured interface ce0 > Requesting System Maintenance Mode > SINGLE USER MODE > # mount -F zfs /dev/dsk/c1t1d0s0 /mnt > cannot open ''/dev/dsk/c1t1d0s0'': invalid dataset name > # mount -F zfs /dev/dsk/c1t0d0s0 /mnt > cannot open ''/dev/dsk/c1t0d0s0'': invalid dataset name > # zpool import > pool: pool1 > id: 5088500955966129017 > state: ONLINE > action: The pool can be imported using its name or numeric identifier. > config: > > pool1 ONLINE > mirror ONLINE > c1t2d0 ONLINE > c1t3d0 ONLINE > > pool: rpool > id: 5910200402071733373 > state: UNAVAIL > action: The pool cannot be imported due to damaged devices or data. > config: > > rpool UNAVAIL insufficient replicas > mirror UNAVAIL corrupted data > c1t1d0s0 ONLINE > c1t0d0s0 ONLINE > > # dd if=/dev/rdsk/c1t1d0s0 of=/tmp/bb bs=1b iseek=1 count=15 > 15+0 records in > 15+0 records out > # dd if=/dev/rdsk/c1t1d0s0 of=/tmp/bb bs=1b iseek=1024 oseek=15 count=16 > 16+0 records in > 16+0 records out > # cmp /tmp/bb /usr/platform/`uname -i`/lib/fs/zfs/bootblk > # echo $? > 0 > # dd if=/dev/rdsk/c1t0d0s0 of=/tmp/ab bs=1b iseek=1 count=15 > 15+0 records in > 15+0 records out > # dd if=/dev/rdsk/c1t0d0s0 of=/tmp/ab bs=1b iseek=1024 oseek=15 count=16 > 16+0 records in > 16+0 records out > # cmp /tmp/ab /usr/platform/`uname -i`/lib/fs/zfs/bootblk > # echo $? > 0 > # > > pre-history: > ===========> admin.tpol ~ # zpool status -xv > pool: rpool > state: DEGRADED > status: One or more devices is currently being resilvered. The pool will > continue to function, possibly in a degraded state. > action: Wait for the resilver to complete. > scrub: resilver in progress for 0h28m, 98.10% done, 0h0m to go > config: > > NAME STATE READ WRITE CKSUM > rpool DEGRADED 0 0 0 > mirror DEGRADED 0 0 0 > replacing DEGRADED 0 0 0 > c1t1d0s0/old FAULTED 0 0 0 corrupted data > c1t1d0s0 ONLINE 0 0 0 > c1t0d0s0 ONLINE 0 0 0 > > errors: No known data errors > > admin.tpol ~ # zpool status -xv > all pools are healthy > admin.tpol ~ # zpool status > pool: pool1 > state: ONLINE > scrub: none requested > config: > > NAME STATE READ WRITE CKSUM > pool1 ONLINE 0 0 0 > mirror ONLINE 0 0 0 > c1t2d0 ONLINE 0 0 0 > c1t3d0 ONLINE 0 0 0 > > errors: No known data errors > > pool: rpool > state: ONLINE > scrub: resilver completed after 0h28m with 0 errors on Tue Jun 16 19:13:57 2009 > config: > > NAME STATE READ WRITE CKSUM > rpool ONLINE 0 0 0 > mirror ONLINE 0 0 0 > c1t1d0s0 ONLINE 0 0 0 > c1t0d0s0 ONLINE 0 0 0 > > errors: No known data errors > > admin.tpol / # luactivate -l $ERR -o $OUT S10u7 > ... > In case of a failure while booting ... > mount -Fzfs /dev/dsk/c1t1d0s0 /mnt > ... > Modifying boot archive service > Activation of boot environment <s10u7> successful. > 8.32u 16.74s 1:25.15 29.4% > admin.tpol / # init 6 >
On Tue, Jun 16, 2009 at 05:58:00PM -0600, Lori Alt wrote: First: Thanx a lot, Lori for the quick help!!!> On 06/16/09 16:32, Jens Elkner wrote:> >At the evening the faulty disk was replaced and the mirror resilvered via > >''zpool replace rpool c1t1d0s0'' (see below). Since there was no error and > >everything stated to be healthy, the s10u7 BE was luactivated (no error > >here message as well) and ''init 6''. > > > >Unfortunately, now the server was gone and no known recipe helped to > >revive it (I guess, LU damaged the zpool.cache?) :((((> The other suggestion I have is to remove the > /pci at 1f,700000/scsi at 2/disk at 1,0:aIndeed, physically removing the "new" c1t1d0 was the key for solving the problem (netboot s10u7 gave the same import error as s10u6). Just in case, somebody is interested in the details: Since ''cfgadm -c unconfigure c1::dsk/c1t1d0'' didn''t work (no blue LED), the machine was ''poweroff''ed, disk removed and ''poweron''ed. Really strange: it came back with the s10u6 BE instead of the s10u7 BE and took quite a while ''til it gave up to get HDD1: WARNING: /pci at 1d,700000/scsi at 2,1 (mpt3): Disconnected command timeout for Target 1 Hardware watchdog enabled ... Now cfgadm did not show c1::dsk/c1t1d0. So re-inserted HDD1, which was properly logged ''SC Alert: DISK @ HDD1 has been inserted.'' Unfortunately cfgadm didn''t show it and ''cfgadm -c configure c1::dsk/c1t1d0'' meant: "cfgadm: Attachment point not found". However, ''format -e'': AVAILABLE DISK SELECTIONS: 0. c1t0d0 <SUN36G cyl 24620 alt 2 hd 27 sec 107> /pci at 1f,700000/scsi at 2/sd at 0,0 1. c1t1d0 <HITACHI-DK32EJ36NSUN36G-PQ08-33.92GB> /pci at 1f,700000/scsi at 2/sd at 1,0 2. c1t2d0 <SEAGATE-ST3146707LC-0005-136.73GB> /pci at 1f,700000/scsi at 2/sd at 2,0 3. c1t3d0 <SEAGATE-ST3146707LC-0005-136.73GB> /pci at 1f,700000/scsi at 2/sd at 3,0 Specify disk (enter its number): 1 selecting c1t1d0 [disk formatted] and now the machine seemed to be stalled. No ssh nor login via console possible. So ''poweroff''ed again, removed the disk, ''poweron''ed and this time the first thing done was ''zpool detach rpool c1t1d0'' and scrubbing the pool (completed after 0h22m with 0 errors). After that a ''cfgadm -x insert_device c1'' got back c1::dsk/c1t1d0 and ''format -e'' worked as expected (1p showed an EFI partition table)! So the rest was trivial: SMI label, repart, label, reattached c1t1d0 to rpool (resilver took about 31m), installboot, verified boot ''ok'' s10u6, luactivated s10u7 and finally verified boot ''ok'' again. Once again, thanx a lot Lori for your quick help!!! Regards, jel. -- Otto-von-Guericke University http://www.cs.uni-magdeburg.de/ Department of Computer Science Geb. 29 R 027, Universitaetsplatz 2 39106 Magdeburg, Germany Tel: +49 391 67 12768