thr3ads.net - zfs discuss - [zfs-discuss] Unbootable system recovery [Oct 2006]

If this information is useful, please help other people find it:
Share via:

Ewen Chan

2006-Oct-05 09:13 UTC

[zfs-discuss] Unbootable system recovery

I have just recently (physically) moved a system with 16 hard drives (for the
array) and 1 OS drive; and in doing so, I needed to pull out the 16 drives so
that it would be light enough for me to lift.

When I plugged the drives back in, initially, it went into a panic-reboot loop.
After doing some digging, I deleted the file /etc/zfs/zpool.cache.

When I try to import the pool using the zpool import command this is what I get:

# zpool import
pool: share
id: 10028139418536329530
state: ONLINE
action: The pool can be imported using its name or numeric identifier. The pool
may be active on on another system, but can be imported using the
''-f'' flag.
config:

share        ONLINE
raidz         ONLINE
c0t0d0      ONLINE
...
c0t15d0     ONLINE

(total of 16 drives)

When I try to import the pool using the zpool import -f command, I end up
getting the same system panic that I got before.

How can I re-initialize the devices, and what would the best for me to bring the
pool back up and online, given that I do NOT have backups or a means to recover
the data on there?
 
 
This message posted from opensolaris.org

Robert Milkowski

2006-Oct-05 09:44 UTC

head link

[zfs-discuss] Unbootable system recovery

Hello Ewen,

Thursday, October 5, 2006, 11:13:04 AM, you wrote:

Can you post at least panic info?

-- 
Best regards,
 Robert                            mailto:rmilkowski at task.gda.pl
                                       http://milek.blogspot.com

Ewen Chan

2006-Oct-05 18:22 UTC

head link

[zfs-discuss] Re: Unbootable system recovery

(with help from Robert)

Yes, there are files.

# pwd
/var/crash/FILESERVER
# ls -F
bounds    unix.0    unix.1     vmcore.0     vmcore.1
# mdb 0
Loading modules: [ unix krtld genunix specfs dtrace cpu.AuthenticAMD.15 ufs ip
sctp usba fctl nca lofs random zfs nfs sppp ptm cpc fcip
]> ::statusdebugging crash dump vmcore.0 (64-bit) from unknown
operating system: 5.10 Generic_118855-14 (i85pc)
panic message:
assertion failed: 0 == dmu_buf_hold_array(os, object, offset, size, FALSE, FTAG,
&numbufs, &dbp), file : ../../common/fs/zfs/dmu.c, line: 366
dump content: kernel pages only> ::stackvpanic()
0xfffffffffb9ad0f3()
dmu_write+0x127()
space_map_sync+0x1ee()
metaslab_sync+0xfa()
vdev_sync+0x50()
spa_sync+0x10e()
txg_sync_thread+0x115()> $<msgbufMESSAGE
--------------------------------------------------
pseudo-device: zfs0
zfs0 is /pseudo/zfs at 0
pseudo-device: pm0
pm0 is /pseudo/pm at 0
psuedo-device: power0
power0 is /pseudo/power at 0
pseudo-device: devinfo0
xsvc0 at root
xsvc0 is /xsvc
pseudo-device: vol0
vol0 is /pseudo/vol at 0
psplusmp: fds (fds) instance 0 vector 0x6 ioapic 0x4 intin 0x6 is bount to cpu 0
ISA-device: fdc0
fd0 at fdc0
fd0 is /isa/fdc at 1,3f0/fd at 0,0
PCI-device: pci1166,104 at dm pci_pci1
pci_pci1 is /pci at 0,0/pci1166,36 at 1/pci1166,104 at d
pseudo-device: pseudo1
pseudo1 is /pseudo/zconsnex at 1
pcplusmp: ide (ata) instance 1 vector 0xe ioapic 0x4 intin 0xe is bound to cpu 0
pcplusmp: ide (ata) instance 1 vector 0xe ioapic 0x4 intin 0xe is bound to cpu 1
ATAPI device at targ 0, lun 0 lastlun 0xf800
model HL-DT-ST DVDRAM GSA-5167B
ATA/ATAPI-5 supported, majver 0x3c minver 0x0
Found card: AAC card
pcplusmp: pci9005,185 (aac) instance 0 vector 0x10 ioapic 0x5 intin 0x0 is bound
to cpu 1
Total 15 container(s) found
PCI-device: pci9005,293 at 3, aac0
aac0 is /pci at 0,0/pci1166,36 at 1/pci1166,104 at d/pci9005,193 at 3
sd0 at aac0: target 0 lun 0
sd0 is /pci at 0,0/pci1166,36 at 1/pci1166,104 at d/pci9005,293 at 3/sd at 0,0
psuedo-device: ramdisk1024
ramdisk 1024 is /pseudo/ramdisk1024
pseudo-device: lockstat0
lockstat0 is /pseudo/lockstat at 0
pseudo-device: llc10
llc10 is /pseudo/llc1 at 0
pseudo-device: fcsm0
fcsm0 is /pseudo/fcsm0
pseudo-device: lofi0
lofi0 is /pseudo/lofi at 0
pseudo-device: profile0
profile0 is /pseudo/profile at 0
pseudo-device: systrace0
systrace0 is /pseudo/systrace at 0
pseudo-device: fbt0
fbt0 is /pseudo/fbt at 0
pseudo-device: sdt0
sdt0 is /pseudo/sdt at 0
pseudo-device: fasttrap0
fasttrap0 is /pseudo/fasttrap at 0
pseudo-device: fssnap0
fssnap0 is /pseudo/fssnap at 0
pseudo-device: winlock0
winlock0 is /pseudo/winlock0
pseudo-device: rsm0
rsm0 is /pseudo/rsm at 0
pseudo-device: pool0
pool0 is /pseudo/pool at 0
IP Filter: v4.0.3, running
PCI-device: ide at 0, ata1
ata1 is /pci at 0,0/pci-ide at 2,1/ide at 0
ATA DMA off: disabled. Control with "atapi-cd-dma-enabled" property
PIO mode 4 sleected
ATA DMA off: disabled. Control with "atapi-cd-dma-enabled" property
PIO mode 4 selected
ATA DMA off: disabled. Control with "atapi-cd-dma-enabled" property
PIO mode 4 selected
ATA DMA off: disabled. Control with "atapi-cd-dma-enabled" property
PIO mode 4 selected
pseudo-device: vol0
vol0 is /pseudo/vol at 0
NOTICE: e1000g1/0 unregistered
pseudo-device: zfs0
zfs0 is /pseudo/zfs at 0
sd0 at aac0: target 0 lun 0
sd0 is at /pci at 0,0/pci1166,36 at 1/pci1166,104 at d/pci9005,293 at 3/sd at
0,0
sd1 at aac0: target 1 lun 0
sd1 is at /pci at 0,0/pci1166,36 at 1/pci1166,104 at d/pci9005,293 at 3/sd at
1,0
sd2 at aac0: target 2 lun 0
sd2 is at /pci at 0,0/pci1166,36 at 1/pci1166,104 at d/pci9005,293 at 3/sd at
2,0
sd3 at aac0: target 3 lun 0
sd3 is at /pci at 0,0/pci1166,36 at 1/pci1166,104 at d/pci9005,293 at 3/sd at
3,0
sd4 at aac0: target 4 lun 0
sd4 is at /pci at 0,0/pci1166,36 at 1/pci1166,104 at d/pci9005,293 at 3/sd at
4,0
sd5 at aac0: target 5 lun 0
sd5 is at /pci at 0,0/pci1166,36 at 1/pci1166,104 at d/pci9005,293 at 3/sd at
5,0
sd6 at aac0: target 6 lun 0
sd6 is at /pci at 0,0/pci1166,36 at 1/pci1166,104 at d/pci9005,293 at 3/sd at
6,0
sd7 at aac0: target 7 lun 0
sd7 is at /pci at 0,0/pci1166,36 at 1/pci1166,104 at d/pci9005,293 at 3/sd at
7,0
sd8 at aac0: target 8 lun 0
sd8 is at /pci at 0,0/pci1166,36 at 1/pci1166,104 at d/pci9005,293 at 3/sd at
8,0
sd9 at aac0: target 9 lun 0
sd9 is at /pci at 0,0/pci1166,36 at 1/pci1166,104 at d/pci9005,293 at 3/sd at
9,0
sd10 at aac0: target a lun 0
sd10 is at /pci at 0,0/pci1166,36 at 1/pci1166,104 at d/pci9005,293 at 3/sd at
a,0
sd11 at aac0: target b lun 0
sd11 is at /pci at 0,0/pci1166,36 at 1/pci1166,104 at d/pci9005,293 at 3/sd at
b,0
sd12 at aac0: target c lun 0
sd12 is at /pci at 0,0/pci1166,36 at 1/pci1166,104 at d/pci9005,293 at 3/sd at
c,0
sd13 at aac0: target d lun 0
sd13 is at /pci at 0,0/pci1166,36 at 1/pci1166,104 at d/pci9005,293 at 3/sd at
d,0
sd14 at aac0: target e lun 0
sd14 is at /pci at 0,0/pci1166,36 at 1/pci1166,104 at d/pci9005,293 at 3/sd at
e,0
sd15 at aac0: target f lun 0
sd15 is at /pci at 0,0/pci1166,36 at 1/pci1166,104 at d/pci9005,293 at 3/sd at
f,0

panic[cpu1]/thread=fffffe8000bf1c80:
assertion failed: 0 == dmu_buf_hold_array(os, object, offset, size, FALSE, FTAG,
&numbufs, &dbp), file: ../../common/fs/zfs/dmu.c, line: 366


fffffe8000bf19d0 fffffffffb9ad0f3 (fffffe8000bf1a0c, fffffe800)
fffffe8000bf1a40 zfs:zfsctl_ops_root+2ff7f31f ()
fffffe8000bf1ac0 zfs:space_map_syn+1ee ()
fffffe8000bf1b30 zfs:metaslab_sync+fa ()
fffffe8000bf1b70 zfs:zfsctl_ops_root+2ff9b978 ()
fffffe8000bf1bd0 zfs:spa_sync+10e ()
fffffe8000bf1c60 zfs:txg_sync_thread+115 ()

syncing file systems...
done
dumping to /dev/dsk/c1d0s1, offset 1719074816, content: kernel

The other file should look very similar, if not the exact same (because it looks
like that it just did it twice - once when I first tried to boot the system up,
and then again when I tried to import the zpool using the zpool import -f
command.

I have no idea what any of that means, but hopefully, you would be able to help
me bring it back online safely.
 
 
This message posted from opensolaris.org

Matthew Ahrens

2006-Oct-05 18:58 UTC

head link

[zfs-discuss] Re: Unbootable system recovery

Ewen Chan wrote:>> ::status
> debugging crash dump vmcore.0 (64-bit) from unknown
> operating system: 5.10 Generic_118855-14 (i85pc)
> panic message:
> assertion failed: 0 == dmu_buf_hold_array(os, object, offset, size, FALSE,
FTAG, &numbufs, &dbp), file : ../../common/fs/zfs/dmu.c, line: 366
> dump content: kernel pages only
>> ::stack
> vpanic()
> 0xfffffffffb9ad0f3()
> dmu_write+0x127()
> space_map_sync+0x1ee()
> metaslab_sync+0xfa()
> vdev_sync+0x50()
> spa_sync+0x10e()
> txg_sync_thread+0x115()
It would appear that some critical metadata was damaged.  Since you 
appear to be using a single (16-wide!) raid-z group, this could happen 
if there are failures on two devices (despite the fact that we store 
multiple copies of all critical metadata).

In this situation it may not be possible to recover your data.

--matt

Ewen Chan

2006-Oct-05 19:15 UTC

head link

[zfs-discuss] Re: Re: Unbootable system recovery

well...let me give a little bit of background.

I built the system with a 4U, 16 drive rackmount enclosure; without a backplane.
Originally, I thought that I wouldn''t really need one because I was
going to have 16 cables running around anyways.

Once everything was in place, and AFTER, I had transferred my data to the
system; then I decided that I was going to move it into my room (and out of the
living room where I was doing the build).

However, in order for me to lift the unit, I needed to pull the drives out so
that it would actually be moveable, and in doing so, I think that the
drive<->cable<->port allocation/assignment has changed.

If there is a way for me to figure out which drive is supposed to go to which
port (as reported by ZFS and/or Solaris), than also in theory, I should be able
to figured out what goes where, and it would have been like as if nothing
changed.

The problem is I don''t know what that map(ping) is.

The other thing is that the documentation doesn''t really exactly tell
you what steps you should do to recover from something like that. It just says
you should re-initialize the drives.

My hope is that someone, SOMEWHERE, would be able to have some suggestions as to
what I can do from here, other than crying about the lost data, biting the
bullet, and reinstalling the OS from scratch again.

(whether it''d be here, on OpenSolaris, ZFS, Jeff Bonwick (and his
team), Bill Moore (and his team) -- I just think that with over 200 million
tests, that someone would have tried that and been through what I''m
going through now, and knows what needs/should be done at this point in order
for safe data recovery and bring the pool back online.)

If there''s a way to find out the original drive mapping, then I can try
and see if I can slowly replicate that to bring the pool back online.

(I do remember that when I was doing the OS install, that the drives on the aac
wasn''t listed in sequential order, it was ..t7, then..t9, t2, t8, etc..
(I don''t recall what it was exactly of course, and I don''t
know why it wouldn''t scan it sequentially)).
 
 
This message posted from opensolaris.org

Matthew Ahrens

2006-Oct-05 19:21 UTC

head link

[zfs-discuss] Re: Re: Unbootable system recovery

Ewen Chan wrote:> However, in order for me to lift the unit, I needed to pull the
> drives out so that it would actually be moveable, and in doing so, I
> think that the drive<->cable<->port allocation/assignment has
> changed.
If that is the case, then ZFS would automatically figure out the new 
mapping.  (Of course, there could be an undiscovered bug in that code.)

--matt

Akhilesh Mritunjai

2006-Oct-06 03:29 UTC

head link

[zfs-discuss] Re: Re: Unbootable system recovery

Hi,

Like what matt said, unless there is a bug in code, zfs should automatically
figure out the drive mappings. The real problem as I see is using 16 drives in
single raidz... which means if two drives malfunction, you''re out of
luck. (raidz2 would survive 2 drives... but still I believe 16 drives is too
much).

May I suggest you re-check the cabling as drive going bad might be related to
that... or even changing the power supply (I got burnt that way). It might just
be an intermittent drive malfunction. You might also surface scan the drives and
rule out bad sectors.

Good luck :)

PS: When you get your data back, do switch to raidz2 or mirrored config that can
survive loss of more than 1 disk. My experience (which is not much) shows it
doesn''t take much to render more than one disk out of 20 or so...
especially when moving them.

- Akhilesh
 
 
This message posted from opensolaris.org

Ewen Chan

2006-Oct-06 04:47 UTC

head link

[zfs-discuss] Re: Unbootable system recovery

In the instructions, it says that the system retains a copy of the zpool cache
in /etc/zfs/zpool.cache.

It also said that when the system boots up, it looks to that to try and mount
the pool, so to get out of the panic-reboot look, it said to delete that file.

Well, I retained a copy of it before I deleted it.

Would looking at the contents of that file help to determine what''s
what? Or would it help in trying to fix/resolve the problem that I am
experiencing?
 
 
This message posted from opensolaris.org

Ewen Chan

2006-Oct-07 17:04 UTC

head link

[zfs-discuss] Re: Re: Unbootable system recovery

Well, the drives technically didn''t "malfunction".

Like I said, the reason why I had to pull the drives out is because 70 lbs is a
little TOO much for me to be able to lift.

The drives aren''t more than 3 weeks old, with a DOM of Jul 2006.

Is there anything that I can do to find out how the system was scanning the
drives (i.e. as I recall, during the installation, c0t7d0 was listed as the
first device. Is there a way to look at the order that the drives were brought
online, and maybe I would be able to correlate that to the drive/port map on the
controller.)

I am banking on that it''s SOMETHING related to when I had to plug the
drives back in after moving the unit, because I didnt'' tag the
individual cables for the drives.
 
 
This message posted from opensolaris.org

Ewen Chan

2006-Oct-08 07:04 UTC

head link

[zfs-discuss] Re: Unbootable system recovery

P.S. I don''t know if it makes any difference, but I did find that the
scan order has changed somewhat.

For example, right now, it starts the scan for the drives from sd10 (i.e. sd at
a,0) whereas before; the drive scan started with sd1 (i.e. sd at 0,0).

Would it make a difference if:

a) say I had interchanged the cable for drive 8 <-> drive 10. Would ZFS
have a fit with that? What if I changed more than one drive; without making the
array offline?
(because I don''t recall the controller card reporting that there were
changes in the drive arrangement; although I did find out that if there were
any, and that the array configuration got updated with that remapping; there
isn''t really a way to "change it back" because the controller
has already picked up those changes and recorded it as such.)

b) the order in which the drives are being picked up by Solaris now?
I was able to find in /var/adm/messages that up until Oct 5 03:42; that the
system had no issues with it.

Now, not only is the scan order different, but after Oct 5 03:42;
that''s when the system started having those panics.

(I am going nuts trying to figure this thing out.)
 
 
This message posted from opensolaris.org

Anton B. Rang

2006-Oct-08 16:45 UTC

head link

[zfs-discuss] Re: Unbootable system recovery

The scan order won''t make any difference to ZFS, as it identifies the
drives by a label written to them, rather than by their controller path.

Perhaps someone in ZFS support could analyze the panic to determine the cause,
or look at the disk labels; have you made the core file available to Sun?
 
 
This message posted from opensolaris.org

Ewen Chan

2006-Oct-08 18:37 UTC

head link

[zfs-discuss] Re: Unbootable system recovery

How do I do that?

I have the system messages (recorded by syslogd) and from there; you can kind of
tell roughly the time period as to when things went wrong.

If anybody from anywhere (Sun, ZFS, Solaris, etc.) wants to take a look at the
unix.* and vmcore.* data; in addition to any logs or system messages - just let
me know where to send it; and I would gladly publish the data.

At this point, I am willing to do whatever it to takes to get help because I
don''t want to lose the new data and I also don''t want to have
to rebuild the system from scratch or go through the 2-8 million permutations of
possible cable arrangements.

Thank you ALL, very much for helping. I know that I may not sound like it (all
of the time), but you have no idea how greatly I do appreciate all the help you
guys are willing to give to me.

Thank you.
 
 
This message posted from opensolaris.org

Matthew Ahrens

2006-Oct-10 05:25 UTC

head link

[zfs-discuss] Re: Unbootable system recovery

Ewen Chan wrote:> (with help from Robert)
> 
> Yes, there are files.
> 
> # pwd
> /var/crash/FILESERVER
> # ls -F
> bounds    unix.0    unix.1     vmcore.0     vmcore.1
> # mdb 0
> Loading modules: [ unix krtld genunix specfs dtrace cpu.AuthenticAMD.15 ufs
ip sctp usba fctl nca lofs random zfs nfs sppp ptm cpc fcip ]
>> ::status
> debugging crash dump vmcore.0 (64-bit) from unknown
> operating system: 5.10 Generic_118855-14 (i85pc)
> panic message:
> assertion failed: 0 == dmu_buf_hold_array(os, object, offset, size, FALSE,
FTAG, &numbufs, &dbp), file : ../../common/fs/zfs/dmu.c, line: 366
> dump content: kernel pages only
>> ::stack
> vpanic()
> 0xfffffffffb9ad0f3()
> dmu_write+0x127()
> space_map_sync+0x1ee()
> metaslab_sync+0xfa()
> vdev_sync+0x50()
> spa_sync+0x10e()
> txg_sync_thread+0x115()
As I''ve mentioned, my best guess is that something went wrong with your
hardware.  If you''d like, I could perform some more investigation if
you
can provide me with ssh access to a root account on your machine.  Send 
username/password to me off-list.

--matt

Ewen Chan

2006-Nov-14 00:46 UTC

head link

[zfs-discuss] Re: Re: Unbootable system recovery

Matt:

What''s your contact information so that I can send that information to
you?

My apologies for taking so long to get back to this.

Sincerely,
Ewen
 
 
This message posted from opensolaris.org

Possibly Parallel Threads

Search for more possibly parallel threads

zfs discuss - Oct 2006 - Unbootable system recovery

[zfs-discuss] Unbootable system recovery

[zfs-discuss] Unbootable system recovery

[zfs-discuss] Re: Unbootable system recovery

[zfs-discuss] Re: Unbootable system recovery

[zfs-discuss] Re: Re: Unbootable system recovery

[zfs-discuss] Re: Re: Unbootable system recovery

[zfs-discuss] Re: Re: Unbootable system recovery

[zfs-discuss] Re: Unbootable system recovery

[zfs-discuss] Re: Re: Unbootable system recovery

[zfs-discuss] Re: Unbootable system recovery

[zfs-discuss] Re: Unbootable system recovery

[zfs-discuss] Re: Unbootable system recovery

[zfs-discuss] Re: Unbootable system recovery

[zfs-discuss] Re: Re: Unbootable system recovery

Possibly Parallel Threads