Donald Murray, P.Eng.
2009-Nov-02 13:20 UTC
[zfs-discuss] Kernel panic on zfs import (hardware failure)
Hey, On Sat, Oct 31, 2009 at 5:03 PM, Victor Latushkin <Victor.Latushkin at sun.com> wrote:> Donald Murray, P.Eng. wrote: >> >> Hi, >> >> I''ve got an OpenSolaris 2009.06 box that will reliably panic whenever >> I try to import one of my pools. What''s the best practice for >> recovering (before I resort to nuking the pool and restoring from >> backup)? > > Could you please post panic stack backtrace? > >> There are two pools on the system: rpool and tank. The rpool seems to >> be fine, since I can boot from a 2009.06 CD and ''zpool import -f >> rpool''; I can also ''zfs scrub rpool'', and it doesn''t find any errors. >> Hooray! Except I don''t care about rpool. :-( >> >> If I boot from hard disk, the system begins importing zfs pools; once >> it''s imported everything I usually have enough time to log in before >> it panics. If I boot from CD and ''zfs import -f tank'', it panics. >> >> I''ve just started a ''zdb -e tank'' which I found on the intertubes >> here: http://opensolaris.org/jive/thread.jspa?threadID=49020. Zdb >> seems to be ... doing something. Not sure _what_ it''s doing, but it >> can''t be making things worse for me right? > > Yes, zdb only reads, so it cannot make thing worse. > >> I''m going to try adding the following to /etc/system, as mentioned >> here: http://opensolaris.org/jive/thread.jspa?threadID=114906 >> set zfs:zfs_recover=1 >> set aok=1 > > Please do not rush with these settings. Let''s look at the stack backtrace > first. > > Regards, > Victor >I think I''ve found the cause of my problem. I disconnected one side of each mirror, rebooted, and imported. The system didn''t panic! So one of the disconnected drives (or cables, or controllers...) was the culprit. I''ve since narrowed it down to a single 500GB drive. When that drive is connected, a zpool import panics the system. When that drive is disconnected, the pool imports fine. root at weyl:~# zpool status tank pool: tank state: DEGRADED status: One or more devices could not be used because the label is missing or invalid. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Replace the device using ''zpool replace''. see: http://www.sun.com/msg/ZFS-8000-4J scrub: resilver completed after 0h8m with 0 errors on Sun Nov 1 22:11:15 2009 config: NAME STATE READ WRITE CKSUM tank DEGRADED 0 0 0 mirror DEGRADED 0 0 0 7508645614192559694 FAULTED 0 0 0 was /dev/dsk/c7t0d0s0 c6t1d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c5t1d0 ONLINE 0 0 6 21.2G resilvered c7t0d0 ONLINE 0 0 0 errors: No known data errors root at weyl:~# The first thing that''s jumping out at me: why does the first mirror think the missing disk was c7t0d0? I have an old zpool status from before the problem began, and that disk used to be c6t0d0. root at weyl:~# zpool status tank pool: tank state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM tank ONLINE 0 0 0 mirror ONLINE 0 0 0 c6t0d0 ONLINE 0 0 0 c6t1d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c5t1d0 ONLINE 0 0 0 c7t0d0 ONLINE 0 0 0 errors: No known data errors root at weyl:~# Victor has been very helpful, living up to his reputation. Thanks Victor! If we determine a root cause, I''ll update the list. Things I''ve learned along the way: - pools import automatically based on cached information in /etc/zfs/zpool.cache; if you move zpool.cache elsewhere, none of the pools will import upon rebooting; - import problematic pools via ''zpool import -f -R /a <poolname>''; this doesn''t update the cachefile, and mounts the pool on /a; - adding the following to /etc/system didn''t prevent a hardware-induced panic: set zfs:zfs_recover=1 set aok=1 - crash dumps are typically saved in /var/crash/$( uname -n ) - beadm is your friend; - redundancy is your friend (okay, I already knew that); - if you have a zfs problem, you want Victor Latushkin to be your friend; Cheers!