thr3ads.net - zfs discuss - [zfs-discuss] FW: Supermicro AOC-SAT2-MV8 hang when drive removed [Aug 2008]

If this information is useful, please help other people find it:
Share via:

Ross Smith

2008-Aug-07 14:44 UTC

[zfs-discuss] FW: Supermicro AOC-SAT2-MV8 hang when drive removed

Sending this again as it appears my previous e-mail never made it to the list. 
Richard, did you receive it?
 
Ross



From: myxiplx at hotmail.comTo: richard.elling at sun.com; zfs-discuss at
opensolaris.orgSubject: RE: [zfs-discuss] Supermicro AOC-SAT2-MV8 hang when
drive removedDate: Tue, 5 Aug 2008 15:04:49 +0100


Ok, I think I''ve got to the bottom of all this now, but it took some
work to figure out everything that was going on.  I couldn''t think of
any way to sensibly write this all up in an e-mail, so I''ve written up
my findings and they''re all in the attached PDF. The initial problem
can be summarised as: ZFS can cause silent data loss if you accidentally remove
a device from a pool that''s in a non-redundant state. But that breaks
down into several individual issues:

SATA hot plug is poorly supported on the Supermicro AOC-SAT2-MV8 card,which uses
a Marvell 88SX6081 controller.
ZFS is inconsistent in its handling of SATA devices going offline.
FMA takes too long to diagnose a device removal, and can generate hundredsof MB
of errors while doing so..
ZFS can continue to read and write from a pool for some considerable time after
it has gone offline.
"zpool status" can not only hang, but can lock out other tools.  BUG:
6667199
"zpool clear" hangs on single drives (and probably also hangs for any
pool in anon redundant state).Probably related to BUG: 667208
"zpool status" doesn''t report if there has been a problem
mounting the poolRoss> Date: Thu, 31 Jul 2008 09:17:46 -0700> From:
Richard.Elling at Sun.COM> Subject: Re: [zfs-discuss] Supermicro AOC-SAT2-MV8
hang when drive removed> To: myxiplx at hotmail.com> > Ross Smith
wrote:> > Ok, in snv_94, with a USB drive that I pulled half way through
the > > standard copy of 19k files (71MB).> > > > This time
the copy operation paused after just 5-10MB more, and it''s > >
currently sat there. FMdump doesn''t have a a lot to say, fmdump -e has
> > been scrolling zfs io & data messages down the screen for nearly
10 > > minutes now.> > OK, that is what I saw. There is a
transaction group which is waiting> to get out and it has up to 5 seconds of
writes in it. > > There is a couple of rounds of logic going on here with
the diagnosis> and feedback to ZFS to stop trying. These things can get very
complex> to solve for the general case, but the current state seems to be>
suboptimal.> > > > > # fmdump> > TIME UUID SUNW-MSG-ID>
> Jul 25 11:27:27.2858 08faf2a3-e39f-e435-8229-d409514f8531 ZFS-8000-D3>
> Interesting... you got 3 -D3 diagnoses and one -HC (which is what I
also> got). The -D3 is similar, but may also lead to a different zpool status
-x> result (which has yet another diagnosis).> > > Jul 29
16:27:56.5151 c2537861-80bb-6154-c8d2-cac9fb1674ae ZFS-8000-D3> > Jul 30
14:11:08.8059 7e33e484-728e-4ffe-cbdc-e9d8a05e33aa ZFS-8000-HC> > Jul 31
11:45:12.3883 d76fcc2c-acee-6b62-f70f-b770651ea5ad ZFS-8000-D3> > >
> The fmdump -e lines are all along the lines of:> > Jul 31
08:21:38.9999 ereport.fs.zf.io> > Jul 31 08:21:38.9999
ereport.fs.zf.data> > Yes, these are error reports where ZFS hit an I/O
error and that> will stimulate a data error report, too. The correlation and
analysis> of these errors is done by FMA (actually fmd). I also noticed a>
lot of activity on the /var file system as fmd was busy checkpointing> the
zfs diagnosis. This is probably redundant, redundant also.> > > >
> I plugged the USB disk in again, /var/adm/messages says:> > > >
Jul 31 16:45:06 unknown usba: [ID 691482 kern.warning] WARNING: > > /pci
at 0,0/pci15d9,a011 at 2,1/storage at 3 (scsa2usb0): Disconnected device >
> was busy, please reconnect.> > Jul 31 16:45:07 unknown scsi: [ID
107833 kern.warning] WARNING: > > /pci at 0,0/pci15d9,a011 at 2,1/storage
at 3/disk at 0,0 (sd17):> > Jul 31 16:45:07 unknown Command failed to
complete...Device is gone> > Jul 31 16:45:07 unknown scsi: [ID 107833
kern.warning] WARNING: > > /pci at 0,0/pci15d9,a011 at 2,1/storage at
3/disk at 0,0 (sd17):> > Jul 31 16:45:07 unknown Command failed to
complete...Device is gone> > Jul 31 16:45:07 unknown scsi: [ID 107833
kern.warning] WARNING: > > /pci at 0,0/pci15d9,a011 at 2,1/storage at
3/disk at 0,0 (sd17):> > Jul 31 16:45:07 unknown Command failed to
complete...Device is gone> > Jul 31 16:45:07 unknown scsi: [ID 107833
kern.warning] WARNING: > > /pci at 0,0/pci15d9,a011 at 2,1/storage at
3/disk at 0,0 (sd17):> > Jul 31 16:45:07 unknown Command failed to
complete...Device is gone> > Jul 31 16:45:07 unknown scsi: [ID 107833
kern.warning] WARNING: > > /pci at 0,0/pci15d9,a011 at 2,1/storage at
3/disk at 0,0 (sd17):> > Jul 31 16:45:07 unknown Command failed to
complete...Device is gone> > Jul 31 16:45:07 unknown scsi: [ID 107833
kern.warning] WARNING: > > /pci at 0,0/pci15d9,a011 at 2,1/storage at
3/disk at 0,0 (sd17):> > Jul 31 16:45:07 unknown Command failed to
complete...Device is gone> > Jul 31 16:45:07 unknown scsi: [ID 107833
kern.warning] WARNING: > > /pci at 0,0/pci15d9,a011 at 2,1/storage at
3/disk at 0,0 (sd17):> > Jul 31 16:45:07 unknown Command failed to
complete...Device is gone> > Jul 31 16:45:17 unknown smbd[516]: [ID 766186
daemon.error] > > NbtDatagramDecode[11]: too small packet> > Jul 31
16:47:17 unknown last message repeated 1 time> > Jul 31 16:49:06 unknown
/sbin/dhcpagent[100]: [ID 732317 > > daemon.warning] accept_v4_acknak: ACK
packet on nge0 missing mandatory > > lease option, ignored> > Jul 31
16:49:22 unknown last message repeated 5 times> > Jul 31 16:49:54 unknown
usba: [ID 691482 kern.warning] WARNING: > > /pci at 0,0/pci15d9,a011 at
2,1/storage at 3 (scsa2usb0): Reinserted device is > > accessible
again.> >> > After a few minutes (at 16:51), fmdump -e changed from
the above lines to:> > fmdump: warning: skipping record: log file
corruption detected> > > > Checking /var/adm/messages now gives:>
> Jul 31 16:50:17 unknown smbd[516]: [ID 766186 daemon.error] > >
NbtDatagramDecode[11]: too small packet> > Jul 31 16:50:50 unknown fmd:
[ID 441519 daemon.error] SUNW-MSG-ID: > > ZFS-8000-FD, TYPE: Fault, VER:
1, SEVERITY: Major> > Jul 31 16:50:50 unknown EVENT-TIME: Thu Jul 31
16:50:49 BST 2008> > Jul 31 16:50:50 unknown PLATFORM: H8DM3-2, CSN:
1234567890, HOSTNAME: > > unknown> > Jul 31 16:50:50 unknown SOURCE:
zfs-diagnosis, REV: 1.0> > Jul 31 16:50:50 unknown EVENT-ID:
3a12b357-2d61-491f-e8ab-9247ebcea342> > Jul 31 16:50:50 unknown DESC: The
number of I/O errors associated with > > a ZFS device exceeded> >
Jul 31 16:50:50 unknown acceptable levels. Refer to > >
http://sun.com/msg/ZFS-8000-FD for more information.> > Jul 31 16:50:50
unknown AUTO-RESPONSE: The device has been offlined > > and marked as
faulted. An attempt> > Jul 31 16:50:50 unknown will be made to activate a
hot spare if > > available.> > Jul 31 16:50:50 unknown IMPACT: Fault
tolerance of the pool may be > > compromised.> > Jul 31 16:50:50
unknown REC-ACTION: Run ''zpool status -x'' and replace >
> the bad device.> > > > Which looks pretty similar to what you
saw. zpool status still > > appears to hang though.> > Yes. The hang
is due to the failmode property. A process waiting on I/O> in UNIX will not
receive any signals until it wakes from the wait... which> won''t
happen because the failmode=wait. I''m going to try another test>
with failmode=continue and see what happens.> > FWIW, there is
considerable debate about whether failmode=wait or> continue is the best
default. wait works like the default for NFS, which> works like most PC-like
operating systems. For highly available systems,> we''d actually
rather ''get off the pot'' than ''sh*t'' so we
tend to prefer> panic, with a compromise on continue.> > > > >
Running fmdump again, I now have this line at the bottom:> > > >
TIME UUID SUNW-MSG-ID> > Jul 31 16:50:49.9906
3a12b357-2d61-491f-e8ab-9247ebcea342 ZFS-8000-FD> >> > This is the
first time I''ve ever seen that FMD message appear in > >
/var/adm/messages. I wonder if it''s the zpool status hanging
that''s > > causing the FMD stuff to not work? What happens if you
try to > > reproduce this there and run zpool status as you remove your
drive?> > Some zpool commands will wait, but I had good luck with>
zpool status -x... but now that seems to be hanging too. I don''t>
think zpool status should hang, ever, so this looks like a real> bug.> --
richard> > > > > > Ross> > > >> >> >
> Date: Thu, 31 Jul 2008 07:42:48 -0700> > > From: Richard.Elling at
Sun.COM> > > Subject: Re: [zfs-discuss] Supermicro AOC-SAT2-MV8 hang
when drive > > removed> > > To: myxiplx at hotmail.com> >
>> > > [off-alias, as the e-mails may get large...]> > >
what does fmdump and fmdump -e say?> > > -- richard> > >>
> >> > > Ross Smith wrote:> > > > I''m not
sure you''re actually seeing the same problem there Richard.> >
> > It seems that for you I/O is stopping on removal of the device,>
> > > whereas for me I/O continues for some considerable time. You are
also> > > > able to obtain a result from ''zpool
status'' whereas that completely> > > > hangs for me.>
> > >> > > > To illustrate the difference, this is what I
saw today in snv_94, > > with> > > > a pool created from a
single external USB hard drive.> > > >> > > > 1. As
before I started a copy of a directory using Solaris'' file> >
> > manager. About 1/3 of the way through I pulled the plug on the
drive.> > > > 2. File manager continued to copy a further 30MB+ of
files across.> > > > Checking the properties of the copy shows it
contains 71.1MB of data> > > > and 19,160 files, despite me pulling
the drive at around 8,000 files.> > > >> > > > 3. 8:24am
I ran ''zpool status'':> > > > # zpool status
rc-usb> > > > pool: rc-usb> > > > state: ONLINE> >
> > status: One or more devices has experienced an error resulting in
data> > > > corruption. Applications may be affected.> > >
> action: Restore the file in question if possible. Otherwise > >
restore the> > > > entire pool from backup.> > > > see:
http://www.sun.com/msg/ZFS-8000-8A> > > > scrub: none requested>
> > >> > > > That is as far as it gets. It never gives me
any further> > > > information. I left it two hours, and it still
had not displayed the> > > > status of the drive in the pool. I also
did a ''zfs list'', that also> > > > hangs now
although I''m pretty sure that if you run ''zfs list''
before> > > > ''zpool status'' it works fine.>
> > >> > > > As you can see from /var/adm/messages, I am
getting nothing at all> > > > from FMA:> > > > Jul 31
08:16:46 unknown usba: [ID 912658 kern.info] USB 2.0 device> > > >
(usbd49,7350) operating at hi speed (USB 2.x) on USB 2.0 root hub:> > >
> storage at 3 <mailto:storage at 3>, scsa2usb0 at bus address 2>
> > > Jul 31 08:16:46 unknown usba: [ID 349649 kern.info] Maxtor>
> > > OneTouch 2HAP70DZ> > > > Jul 31 08:16:46 unknown
genunix: [ID 936769 kern.info] scsa2usb0 is> > > > /pci at
0,0/pci15d9,a011 at 2,1/storage at 3> > > > Jul 31 08:16:46 unknown
genunix: [ID 408114 kern.info]> > > > /pci at 0,0/pci15d9,a011 at
2,1/storage at 3 (scsa2usb0) online> > > > Jul 31 08:16:46 unknown
scsi: [ID 193665 kern.info] sd17 at > > scsa2usb0:> > > >
target 0 lun 0> > > > Jul 31 08:16:46 unknown genunix: [ID 936769
kern.info] sd17 is> > > > /pci at 0,0/pci15d9,a011 at 2,1/storage at
3/disk at 0,0> > > > Jul 31 08:16:46 unknown genunix: [ID 340201
kern.warning] WARNING:> > > > Page83 data not standards compliant
Maxtor OneTouch 0125> > > > Jul 31 08:16:46 unknown genunix: [ID
408114 kern.info]> > > > /pci at 0,0/pci15d9,a011 at 2,1/storage at
3/disk at 0,0 (sd17) online> > > > Jul 31 08:16:49 unknown pcplusmp:
[ID 444295 kern.info] pcplusmp: ide> > > > (ata) instance #1 vector
0xf ioapic 0x4 intin 0xf is bound to cpu 3> > > > Jul 31 08:16:49
unknown scsi: [ID 193665 kern.info] sd14 at> > > > marvell88sx1:
target 7 lun 0> > > > Jul 31 08:16:49 unknown genunix: [ID 936769
kern.info] sd14 is> > > > /pci at 1,0/pci1022,7458 at 2/pci11ab,11ab
at 1/disk at 7,0> > > > Jul 31 08:16:49 unknown genunix: [ID 408114
kern.info]> > > > /pci at 1,0/pci1022,7458 at 2/pci11ab,11ab at
1/disk at 7,0 (sd14) online> > > > Jul 31 08:21:35 unknown usba: [ID
691482 kern.warning] WARNING:> > > > /pci at 0,0/pci15d9,a011 at
2,1/storage at 3 (scsa2usb0): Disconnected device> > > > was busy,
please reconnect.> > > > Jul 31 08:21:38 unknown scsi: [ID 107833
kern.warning] WARNING:> > > > /pci at 0,0/pci15d9,a011 at
2,1/storage at 3/disk at 0,0 (sd17):> > > > Jul 31 08:21:38 unknown
Command failed to complete...Device is gone> > > > Jul 31 08:21:38
unknown scsi: [ID 107833 kern.warning] WARNING:> > > > /pci at
0,0/pci15d9,a011 at 2,1/storage at 3/disk at 0,0 (sd17):> > > > Jul
31 08:21:38 unknown Command failed to complete...Device is gone> > >
> Jul 31 08:21:38 unknown scsi: [ID 107833 kern.warning] WARNING:> >
> > /pci at 0,0/pci15d9,a011 at 2,1/storage at 3/disk at 0,0 (sd17):>
> > > Jul 31 08:21:38 unknown Command failed to complete...Device is
gone> > > > Jul 31 08:21:38 unknown scsi: [ID 107833 kern.warning]
WARNING:> > > > /pci at 0,0/pci15d9,a011 at 2,1/storage at 3/disk at
0,0 (sd17):> > > > Jul 31 08:21:38 unknown Command failed to
complete...Device is gone> > > > Jul 31 08:21:38 unknown scsi: [ID
107833 kern.warning] WARNING:> > > > /pci at 0,0/pci15d9,a011 at
2,1/storage at 3/disk at 0,0 (sd17):> > > > Jul 31 08:21:38 unknown
Command failed to complete...Device is gone> > > > Jul 31 08:21:38
unknown scsi: [ID 107833 kern.warning] WARNING:> > > > /pci at
0,0/pci15d9,a011 at 2,1/storage at 3/disk at 0,0 (sd17):> > > > Jul
31 08:21:38 unknown Command failed to complete...Device is gone> > >
> Jul 31 08:21:38 unknown scsi: [ID 107833 kern.warning] WARNING:> >
> > /pci at 0,0/pci15d9,a011 at 2,1/storage at 3/disk at 0,0 (sd17):>
> > > Jul 31 08:21:38 unknown Command failed to complete...Device is
gone> > > > Jul 31 08:21:38 unknown scsi: [ID 107833 kern.warning]
WARNING:> > > > /pci at 0,0/pci15d9,a011 at 2,1/storage at 3/disk at
0,0 (sd17):> > > > Jul 31 08:21:38 unknown Command failed to
complete...Device is gone> > > > Jul 31 08:24:26 unknown scsi: [ID
107833 kern.warning] WARNING:> > > > /pci at 0,0/pci15d9,a011 at
2,1/storage at 3/disk at 0,0 (sd17):> > > > Jul 31 08:24:26 unknown
Command failed to complete...Device is gone> > > > Jul 31 08:24:26
unknown scsi: [ID 107833 kern.warning] WARNING:> > > > /pci at
0,0/pci15d9,a011 at 2,1/storage at 3/disk at 0,0 (sd17):> > > > Jul
31 08:24:26 unknown Command failed to complete...Device is gone> > >
> Jul 31 08:24:26 unknown scsi: [ID 107833 kern.warning] WARNING:> >
> > /pci at 0,0/pci15d9,a011 at 2,1/storage at 3/disk at 0,0 (sd17):>
> > > Jul 31 08:24:26 unknown drive offline> > > > Jul 31
08:27:43 unknown smbd[603]: [ID 766186 daemon.error]> > > >
NbtDatagramDecode[11]: too small packet> > > > Jul 31 08:39:43
unknown smbd[603]: [ID 766186 daemon.error]> > > >
NbtDatagramDecode[11]: too small packet> > > > Jul 31 08:44:50
unknown /sbin/dhcpagent[95]: [ID 732317> > > > daemon.warning]
accept_v4_acknak: ACK packet on nge0 missing > > mandatory> > >
> lease option, ignored> > > > Jul 31 08:44:58 unknown last
message repeated 3 times> > > > Jul 31 08:45:06 unknown
/sbin/dhcpagent[95]: [ID 732317> > > > daemon.warning]
accept_v4_acknak: ACK packet on nge0 missing > > mandatory> > >
> lease option, ignored> > > > Jul 31 08:45:06 unknown last
message repeated 1 time> > > > Jul 31 08:51:44 unknown smbd[603]:
[ID 766186 daemon.error]> > > > NbtDatagramDecode[11]: too small
packet> > > > Jul 31 09:03:44 unknown smbd[603]: [ID 766186
daemon.error]> > > > NbtDatagramDecode[11]: too small packet>
> > > Jul 31 09:13:51 unknown /sbin/dhcpagent[95]: [ID 732317> >
> > daemon.warning] accept_v4_acknak: ACK packet on nge0 missing > >
mandatory> > > > lease option, ignored> > > > Jul 31
09:14:09 unknown last message repeated 5 times> > > > Jul 31
09:15:44 unknown smbd[603]: [ID 766186 daemon.error]> > > >
NbtDatagramDecode[11]: too small packet> > > > Jul 31 09:27:44
unknown smbd[603]: [ID 766186 daemon.error]> > > >
NbtDatagramDecode[11]: too small packet> > > > Jul 31 09:27:55
unknown pcplusmp: [ID 444295 kern.info] pcplusmp: ide> > > > (ata)
instance #1 vector 0xf ioapic 0x4 intin 0xf is bound to cpu 3> > >
>> > > > cfgadm reports that the port is empty but still
configured:> > > > # cfgadm> > > > Ap_Id Type Receptacle
Occupant> > > > Condition> > > > usb1/3 unknown empty
configured> > > > unusable> > > >> > > > 4.
9:32am I now tried writing more data to the pool, to see if I can> > >
> trigger the I/O error you are seeing. I tried making a second copy of>
> > > the files on the USB drive in the Solaris File manager, but
that> > > > attempt simply hung the copy dialog. I''m still
seeing nothing else> > > > that appears relevant in
/var/adm/messages.> > > >> > > > 5. 10:08am While
checking free space, I found that although df works,> > > >
''df -kh'' hangs, apparently when it tries to query any zfs
pool:> > > > # df> > > > / (/dev/dsk/c1t0d0s0 ): 2504586
blocks 656867 files> > > > /devices (/devices ): 0 blocks 0
files> > > > /dev (/dev ): 0 blocks 0 files> > > >
/system/contract (ctfs ): 0 blocks 2147483609 files> > > > /proc
(proc ): 0 blocks 29902 files> > > > /etc/mnttab (mnttab ): 0 blocks
0 files> > > > /etc/svc/volatile (swap ): 9850928 blocks 1180374
files> > > > /system/object (objfs ): 0 blocks 2147483409 files>
> > > /etc/dfs/sharetab (sharefs ): 0 blocks 2147483646 files> >
> > /lib/libc.so.1 (/usr/lib/libc/libc_hwcap2.so.1): 2504586 blocks>
> > > 656867 files> > > > /dev/fd (fd ): 0 blocks 0
files> > > > /tmp (swap ): 9850928 blocks 1180374 files> >
> > /var/run (swap ): 9850928 blocks 1180374 files> > > >
/export/home (/dev/dsk/c1t0d0s7 ):881398942 blocks 53621232 files> > >
> /rc-pool (rc-pool ):4344346098 blocks 4344346098 files> > > >
/rc-pool/admin (rc-pool/admin ):4344346098 blocks 4344346098 files> > >
> /rc-pool/ross-home (rc-pool/ross-home ):4344346098 blocks > >
4344346098 files> > > > /rc-pool/vmware (rc-pool/vmware ):4344346098
blocks 4344346098 files> > > > /rc-usb (rc-usb ):153725153 blocks
153725153 files> > > > # df -kh> > > > Filesystem size
used avail capacity Mounted on> > > > /dev/dsk/c1t0d0s0 7.2G 6.0G
1.1G 85% /> > > > /devices 0K 0K 0K 0% /devices> > > >
/dev 0K 0K 0K 0% /dev> > > > ctfs 0K 0K 0K 0% /system/contract>
> > > proc 0K 0K 0K 0% /proc> > > > mnttab 0K 0K 0K 0%
/etc/mnttab> > > > swap 4.7G 1.1M 4.7G 1% /etc/svc/volatile> >
> > objfs 0K 0K 0K 0% /system/object> > > > sharefs 0K 0K 0K
0% /etc/dfs/sharetab> > > > /usr/lib/libc/libc_hwcap2.so.1> >
> > 7.2G 6.0G 1.1G 85% /lib/libc.so.1> > > > fd 0K 0K 0K 0%
/dev/fd> > > > swap 4.7G 48K 4.7G 1% /tmp> > > > swap
4.7G 76K 4.7G 1% /var/run> > > > /dev/dsk/c1t0d0s7 425G 4.8G 416G 2%
/export/home> > > >> > > > 6. 10:35am It''s now
been two hours, neither ''zpool status'' nor ''zfs>
> > > list'' have ever finished. The file copy attempt has also
been hung> > > > for over an hour (although that''s not
unexpected with ''wait'' as the> > > >
failmode).> > > >> > > > Richard, you say ZFS is not
silently failing, well for me it appears> > > > that it is. I
can''t see any warnings from ZFS, I can''t get any status>
> > > information. I see no way that I could find out what files are
going> > > > to be lost on this server.> > > >> >
> > Yes, I''m now aware that the pool has hung since file
operations are> > > > hanging, however had that been my first
indication of a problem I> > > > believe I am now left in a position
where I cannot find out either > > the> > > > cause, nor the
files affected. I don''t believe I have any way to find> > >
> out which operations had completed without error, but are not> > >
> currently committed to disk. I certainly don''t get the status
message> > > > you do saying permanent errors have been found in
files.> > > >> > > > I plugged the USB drive back in
now, Solaris detected it ok, but ZFS> > > > is still hung. The rest
of /var/adm/messages is:> > > > Jul 31 09:39:44 unknown smbd[603]:
[ID 766186 daemon.error]> > > > NbtDatagramDecode[11]: too small
packet> > > > Jul 31 09:45:22 unknown /sbin/dhcpagent[95]: [ID
732317> > > > daemon.warning] accept_v4_acknak: ACK packet on nge0
missing > > mandatory> > > > lease option, ignored> >
> > Jul 31 09:45:38 unknown last message repeated 5 times> > >
> Jul 31 09:51:44 unknown smbd[603]: [ID 766186 daemon.error]> > >
> NbtDatagramDecode[11]: too small packet> > > > Jul 31 10:03:44
unknown last message repeated 2 times> > > > Jul 31 10:14:27 unknown
/sbin/dhcpagent[95]: [ID 732317> > > > daemon.warning]
accept_v4_acknak: ACK packet on nge0 missing > > mandatory> > >
> lease option, ignored> > > > Jul 31 10:14:45 unknown last
message repeated 5 times> > > > Jul 31 10:15:44 unknown smbd[603]:
[ID 766186 daemon.error]> > > > NbtDatagramDecode[11]: too small
packet> > > > Jul 31 10:27:45 unknown smbd[603]: [ID 766186
daemon.error]> > > > NbtDatagramDecode[11]: too small packet>
> > > Jul 31 10:36:25 unknown usba: [ID 691482 kern.warning]
WARNING:> > > > /pci at 0,0/pci15d9,a011 at 2,1/storage at 3
(scsa2usb0): Reinserted device is> > > > accessible again.> >
> > Jul 31 10:39:45 unknown smbd[603]: [ID 766186 daemon.error]> >
> > NbtDatagramDecode[11]: too small packet> > > > Jul 31
10:45:53 unknown /sbin/dhcpagent[95]: [ID 732317> > > >
daemon.warning] accept_v4_acknak: ACK packet on nge0 missing > >
mandatory> > > > lease option, ignored> > > > Jul 31
10:46:09 unknown last message repeated 5 times> > > > Jul 31
10:51:45 unknown smbd[603]: [ID 766186 daemon.error]> > > >
NbtDatagramDecode[11]: too small packet> > > >> > > > 7.
10:55am Gave up on ZFS ever recovering. A shutdown attempt hung> > >
> as expected. I hard-reset the computer.> > > >> > >
> Ross> > > >> > > >> > > >> > >
>> > > > > Date: Wed, 30 Jul 2008 11:17:08 -0700> > >
> > From: Richard.Elling at Sun.COM> > > > > Subject: Re:
[zfs-discuss] Supermicro AOC-SAT2-MV8 hang when drive> > > >
removed> > > > > To: myxiplx at hotmail.com> > > >
> CC: zfs-discuss at opensolaris.org> > > > >> > >
> > I was able to reproduce this in b93, but might have a different>
> > > > interpretation of the conditions. More below...> >
> > >> > > > > Ross Smith wrote:> > > > >
> A little more information today. I had a feeling that ZFS would> >
> > > > continue quite some time before giving an error, and today
> > I''ve shown> > > > > > that you can carry on
working with the filesystem for at least> > > > half an> >
> > > > hour with the disk removed.> > > > > >>
> > > > > I suspect on a system with little load you could carry
on > > working for> > > > > > several hours without any
indication that there is a problem. It> > > > > > looks to me
like ZFS is caching reads & writes, and that provided> > > >
> > requests can be fulfilled from the cache, it doesn''t care
> > whether the> > > > > > disk is present or not.>
> > > >> > > > > In my
USB-flash-disk-sudden-removal-while-writing-big-file-test,> > > >
> 1. I/O to the missing device stopped (as I expected)> > > >
> 2. FMA kicked in, as expected.> > > > > 3. /var/adm/messages
recorded ''Command failed to complete... device> > > >
gone.''> > > > > 4. After exactly 9 minutes, 17,951
e-reports had been processed > > and the> > > > > diagnosis
was complete. FMA logged the following to > > /var/adm/messages> >
> > >> > > > > Jul 30 10:33:44 grond scsi: [ID 107833
kern.warning] WARNING:> > > > > /pci at 0,0/pci1458,5004 at
b,1/storage at 8/disk at 0,0 (sd1):> > > > > Jul 30 10:33:44
grond Command failed to complete...Device is gone> > > > > Jul 30
10:42:31 grond fmd: [ID 441519 daemon.error] SUNW-MSG-ID:> > > >
> ZFS-8000-FD, TYPE: Fault, VER: 1, SEVERITY: Major> > > > >
Jul 30 10:42:31 grond EVENT-TIME: Wed Jul 30 10:42:30 PDT 2008> > >
> > Jul 30 10:42:31 grond PLATFORM: , CSN: , HOSTNAME: grond> > >
> > Jul 30 10:42:31 grond SOURCE: zfs-diagnosis, REV: 1.0> > >
> > Jul 30 10:42:31 grond EVENT-ID:
d99769aa-28e8-cf16-d181-945592130525> > > > > Jul 30 10:42:31
grond DESC: The number of I/O errors associated > > with a> > >
> > ZFS device exceeded> > > > > Jul 30 10:42:31 grond
acceptable levels. Refer to> > > > >
http://sun.com/msg/ZFS-8000-FD for more information.> > > > > Jul
30 10:42:31 grond AUTO-RESPONSE: The device has been > > offlined and>
> > > > marked as faulted. An attempt> > > > > Jul 30
10:42:31 grond will be made to activate a hot spare if> > > > >
available.> > > > > Jul 30 10:42:31 grond IMPACT: Fault tolerance
of the pool may be> > > > > compromised.> > > > >
Jul 30 10:42:31 grond REC-ACTION: Run ''zpool status -x'' and
replace> > > > > the bad device.> > > > >> >
> > > The above URL shows what you expect, but more (and better)
info> > > > > is available from zpool status -xv> > >
> >> > > > > pool: rmtestpool> > > > >
state: UNAVAIL> > > > > status: One or more devices are faultd in
response to IO failures.> > > > > action: Make sure the affected
devices are connected, then run > > ''zpool> > > >
> clear''.> > > > > see:
http://www.sun.com/msg/ZFS-8000-HC> > > > > scrub: none
requested> > > > > config:> > > > >> > >
> > NAME STATE READ WRITE CKSUM> > > > > rmtestpool UNAVAIL
0 15.7K 0 insufficient replicas> > > > > c2t0d0p0 FAULTED 0 15.7K
0 experienced I/O failures> > > > >> > > > >
errors: Permanent errors have been detected in the following files:> >
> > >> > > > > /rmtestpool/random.data> > >
> >> > > > >> > > > > If you surf to
http://www.sun.com/msg/ZFS-8000-HC you''ll> > > > > see
words to the effect that,> > > > > The pool has experienced I/O
failures. Since the ZFS pool property> > > > >
''failmode'' is set to ''wait'', all I/Os (reads
and writes) are> > > > > blocked. See the zpool(1M) manpage for
more information on the> > > > > ''failmode''
property. Manual intervention is required for I/Os to> > > > > be
serviced.> > > > >> > > > > >> > >
> > > I would guess that ZFS is attempting to write to the disk in
the> > > > > > background, and that this is silently
failing.> > > > >> > > > > It is clearly not
silently failing.> > > > >> > > > > However, the
default failmode property is set to ''wait'' which will> >
> > patiently> > > > > wait forever. If you would rather
have the I/O fail, then you > > should> > > > change> >
> > > the failmode to ''continue'' I would not normally
recommend a > > failmode of> > > > >
''panic''> > > > >> > > > > Now to
figure out how to recover gracefully... zpool clear isn''t> > >
> happy...> > > > >> > > > > [sidebar]> >
> > > while performing this experiment, I noticed that fmd was >
> checkpointing> > > > > the diagnosis engine to disk in the
/var/fm/fmd/ckpt/zfs-diagnosis> > > > > directory.> > >
> > If this had been the boot disk, with failmode=wait, I''m not
> > convinced> > > > > that we''d get a complete
diagnosis... I''ll explore that later.> > > > >
[/sidebar]> > > > >> > > > > -- richard> >
> > >> > > >> > > >> > > > >
>
------------------------------------------------------------------------>
> > > Win ?3000 to spend on whatever you want at Uni! Click here to
WIN!> > > >
<http://clk.atdmt.com/UKM/go/101719803/direct/01/>> > >>
>> >> >
------------------------------------------------------------------------>
> Win ?3000 to spend on whatever you want at Uni! Click here to WIN! >
> <http://clk.atdmt.com/UKM/go/101719803/direct/01/>>

Find out how to make Messenger your very own TV! Try it Now! 
_________________________________________________________________
Win New York holidays with Kellogg?s & Live Search 
http://clk.atdmt.com/UKM/go/107571440/direct/01/
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080807/766ce90a/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Problems with ZFS + SATA hot plug.pdf
Type: application/pdf
Size: 84624 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080807/766ce90a/attachment.pdf>

Ross

2008-Aug-11 14:15 UTC

head link

[zfs-discuss] FW: Supermicro AOC-SAT2-MV8 hang when drive removed

Ok, I''ve now reported most of the problems I found, but have additional
information to add to bugs 6667199 and 667208.  Can anybody tell me how I go
about reporting that to Sun?

thanks,

Ross
 
 
This message posted from opensolaris.org

Brian D. Horn

2008-Aug-13 22:46 UTC

head link

[zfs-discuss] FW: Supermicro AOC-SAT2-MV8 hang when drive removed

Looking at what you wrote, you claim that hot plug events on ports 6 and 7
generally work, but other ports are not immediately discovered.  Since
there is no special code for ports 6 & 7 and no one else has reported this
sort of behavior, it would make me think that you have a hardware issue.
Possibly poor signaling over the SATA cables or possibly marginal power.

See if things behave differently with fewer disks attached or when mix
and matching cables/drives.
 
 
This message posted from opensolaris.org

Ross

2008-Aug-14 10:24 UTC

head link

[zfs-discuss] FW: Supermicro AOC-SAT2-MV8 hang when drive removed

This is the problem when you try to write up a good summary of what you found. 
I''ve got pages and pages of notes of all the tests I did here, far more
than I could include in that PDF.

What makes me think it''s driver is that I''ve done much of what
you suggested.  I''ve replicated the exact same behaviour on two
different cards, individually and with both cards attached to the server. 
It''s also consistent across many different brands and types of drive,
and occurs even if I have just 4 drives connected out of 8 on a single
controller.

I did wonder whether it could be hardware related, so I tested plugging and
unplugging drives while the computer was booting.  While doing that and
hot-plugging drives in the BIOS, at no point did I see any hanging of the
system, which tends to confirm my thought that it''s driver related.

I was also able to power on the system with all drives connected, wait for the
controllers to finish scanning the drives, then remove a few at the GRUB boot
screen.  From there when I continue to boot Solaris, the correct state is
detected every time for all drives.

Based on that, it appears that it''s purely a problem with detection of
the insertion / removal event after Solaris has loaded its drivers.  Initial
detection is fine, it''s purely hot swap detection on ports 0-5 that
fails.  I know it sounds weird, but trust me I checked this pretty carefully,
and experience has taught me never to assume computers won''t behave in
odd ways.

I do appreciate my diagnosis may be wrong as I have very limited knowledge of
Solaris'' internals, but that is my best guess right now.

Ross
 
 
This message posted from opensolaris.org

Tim

2008-Aug-14 13:26 UTC

head link

[zfs-discuss] FW: Supermicro AOC-SAT2-MV8 hang when drive removed

I don''t have any extra cards lying around and can''t really
take my server
down, so my immediate question would be:
Is there any sort of PCI bridge chip on the card?  I know in my experience
I''ve seen all sorts of headaches with less than stellar bridge chips.
Specifically some of the IBM bridge chips.

Food for thought.

--Tim





On Thu, Aug 14, 2008 at 5:24 AM, Ross <myxiplx at hotmail.com> wrote:
> This is the problem when you try to write up a good summary of what you
> found.  I''ve got pages and pages of notes of all the tests I did
here, far
> more than I could include in that PDF.
>
> What makes me think it''s driver is that I''ve done much of
what you
> suggested.  I''ve replicated the exact same behaviour on two
different cards,
> individually and with both cards attached to the server.  It''s
also
> consistent across many different brands and types of drive, and occurs even
> if I have just 4 drives connected out of 8 on a single controller.
>
> I did wonder whether it could be hardware related, so I tested plugging and
> unplugging drives while the computer was booting.  While doing that and
> hot-plugging drives in the BIOS, at no point did I see any hanging of the
> system, which tends to confirm my thought that it''s driver
related.
>
> I was also able to power on the system with all drives connected, wait for
> the controllers to finish scanning the drives, then remove a few at the
GRUB
> boot screen.  From there when I continue to boot Solaris, the correct state
> is detected every time for all drives.
>
> Based on that, it appears that it''s purely a problem with
detection of the
> insertion / removal event after Solaris has loaded its drivers.  Initial
> detection is fine, it''s purely hot swap detection on ports 0-5
that fails.
>  I know it sounds weird, but trust me I checked this pretty carefully, and
> experience has taught me never to assume computers won''t behave in
odd ways.
>
> I do appreciate my diagnosis may be wrong as I have very limited knowledge
> of Solaris'' internals, but that is my best guess right now.
>
> Ross
>
>
> This message posted from opensolaris.org
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080814/4530708d/attachment.html>

Ross

2008-Aug-15 14:44 UTC

head link

[zfs-discuss] FW: Supermicro AOC-SAT2-MV8 hang when drive removed

Haven''t a clue, but I''ve just gotten around to installing
windows on this box to test and I can confirm that hot plug works just fine in
windows.

Drives appear and dissappear in device manager the second I unplug the hardware.
Any drive, either controller.  So far I''ve done a couple of dozen
removals, pulling individual drives, or as many as half a dozen at once. 
I''ve even gone as far as to immediately pull a drive I only just
connected.  Windows has no problems at all.

Unfortunately for me, Windows doesn''t support ZFS...  right now
it''s looking a whole load more stable.

Ross

> <div id="jive-html-wrapper-div">
> <div dir="ltr">I don''t have any extra cards lying
> around and can''t really take my server down, so
> my immediate question would be:<br>Is there any sort
> of PCI bridge chip on the card?? I know in my
> experience I''ve seen all sorts of headaches with
> less than stellar bridge chips.? Specifically
> some of the IBM bridge chips.<br>
> <br>Food for
> thought.<br><br>--Tim 
 
This message posted from opensolaris.org

Tim

2008-Aug-15 15:07 UTC

head link

[zfs-discuss] FW: Supermicro AOC-SAT2-MV8 hang when drive removed

You could always try FreeBSD :)

--Tim

On Fri, Aug 15, 2008 at 9:44 AM, Ross <myxiplx at hotmail.com> wrote:
> Haven''t a clue, but I''ve just gotten around to installing
windows on this
> box to test and I can confirm that hot plug works just fine in windows.
>
> Drives appear and dissappear in device manager the second I unplug the
> hardware.  Any drive, either controller.  So far I''ve done a
couple of dozen
> removals, pulling individual drives, or as many as half a dozen at once.
>  I''ve even gone as far as to immediately pull a drive I only just
connected.
>  Windows has no problems at all.
>
> Unfortunately for me, Windows doesn''t support ZFS...  right now
it''s
> looking a whole load more stable.
>
> Ross
>
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080815/0875e692/attachment.html>

Ross Smith

2008-Aug-15 15:08 UTC

head link

[zfs-discuss] FW: Supermicro AOC-SAT2-MV8 hang when drive removed

Oh god no, I''m already learning three new operating systems, now is not
a good time to add a fourth.
 
Ross    <-- Windows admin now working with Ubuntu, OpenSolaris and ESX



Date: Fri, 15 Aug 2008 10:07:31 -0500From: tim at tcsac.netTo: myxiplx at
hotmail.comSubject: Re: [zfs-discuss] FW: Supermicro AOC-SAT2-MV8 hang when
drive removedCC: zfs-discuss at opensolaris.org
You could always try FreeBSD :)--Tim
On Fri, Aug 15, 2008 at 9:44 AM, Ross <myxiplx at hotmail.com> wrote:
Haven''t a clue, but I''ve just gotten around to installing
windows on this box to test and I can confirm that hot plug works just fine in
windows.Drives appear and dissappear in device manager the second I unplug the
hardware.  Any drive, either controller.  So far I''ve done a couple of
dozen removals, pulling individual drives, or as many as half a dozen at once. 
I''ve even gone as far as to immediately pull a drive I only just
connected.  Windows has no problems at all.Unfortunately for me, Windows
doesn''t support ZFS...  right now it''s looking a whole load
more stable.Ross
_________________________________________________________________
Win a voice over part with Kung Fu Panda & Live Search?? and?? 100?s of Kung
Fu Panda prizes to win with Live Search
http://clk.atdmt.com/UKM/go/107571439/direct/01/
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080815/5f8a6a72/attachment.html>

Florin Iucha

2008-Aug-15 19:24 UTC

head link

[zfs-discuss] FW: Supermicro AOC-SAT2-MV8 hang when drive removed

On Fri, Aug 15, 2008 at 10:07:31AM -0500, Tim wrote:> You could always try FreeBSD :)
>
> > Unfortunately for me, Windows doesn''t support ZFS...  right
now it''s
> > looking a whole load more stable.
Nope: FreeBSD doesn''t have proper power management either.

florin

-- 
Bruce Schneier expects the Spanish Inquisition.
      http://geekz.co.uk/schneierfacts/fact/163
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080815/378200db/attachment.bin>

Brian D. Horn

2008-Aug-20 22:27 UTC

head link

[zfs-discuss] FW: Supermicro AOC-SAT2-MV8 hang when drive removed

Well, when you leave out a bunch of relevant information you also leave
people guessing! :-)

Regardless, is it possibly that all of your testing was done with ZFS and not
just the "raw" disk?  If so, it is possible that ZFS isn''t
noticing the hot unplugging
of the disk until it tries to access the drive.  I don''t know this, but
it would
be consistent with what you have related to date.
 
 
This message posted from opensolaris.org

Ross

2008-Aug-20 22:57 UTC

head link

[zfs-discuss] FW: Supermicro AOC-SAT2-MV8 hang when drive removed

lol, I got bored after 13 pages and a whole day of going back through my notes
to pick out the relevant information.

Besides, I did mention that I was using cfgadm to see what was connected :-p. 
If you''re really interested, most of my troubleshooting notes have been
posted to the forum, but unfortunately Sun''s software has split it into
three or four pieces.  Just search for posts talking about the AOC-SAT2-MV8 card
to find them.

Without fail, cfgadm changes the status from "disk" to
"sata-port" when I unplug a device attached to port 6 or 7, but most
of the time unplugging disks 0-5 results in no change in cfgadm, until I also
attach disk 6 or 7.

Often the system hung completely when you pulled one of the disks 0-5, and
wouldn''t respond again until you re-inserted it.

I''m 99.99% sure this is a driver issue for this controller.
 
 
This message posted from opensolaris.org

James C. McPherson

2008-Aug-20 23:02 UTC

head link

[zfs-discuss] FW: Supermicro AOC-SAT2-MV8 hang when drive removed

Ross wrote:> lol, I got bored after 13 pages and a whole day of going back through my
> notes to pick out the relevant information.
> 
> Besides, I did mention that I was using cfgadm to see what was connected
> :-p.  If you''re really interested, most of my troubleshooting
notes have
> been posted to the forum, but unfortunately Sun''s software has
split it
> into three or four pieces.  Just search for posts talking about the
> AOC-SAT2-MV8 card to find them.
> 
> Without fail, cfgadm changes the status from "disk" to
"sata-port" when I
> unplug a device attached to port 6 or 7, but most of the time unplugging
> disks 0-5 results in no change in cfgadm, until I also attach disk 6 or 7.
That does seem inconsistent, or at least, it''s not what I''d
expect.
> Often the system hung completely when you pulled one of the disks 0-5,
> and wouldn''t respond again until you re-inserted it.
> 
> I''m 99.99% sure this is a driver issue for this controller.
Have you logged a bug on it yet?


James C. McPherson
--
Senior Kernel Software Engineer, Solaris
Sun Microsystems
http://blogs.sun.com/jmcp	http://www.jmcp.homeunix.com/blog

Ross Smith

2008-Aug-20 23:18 UTC

head link

[zfs-discuss] FW: Supermicro AOC-SAT2-MV8 hang when drive removed

> > Without fail, cfgadm changes the status from "disk" to
"sata-port" when I
> > unplug a device attached to port 6 or 7, but most of the time
unplugging
> > disks 0-5 results in no change in cfgadm, until I also attach disk 6
or 7.
> 
> That does seem inconsistent, or at least, it''s not what
I''d expect.
Yup, was an absolute nightmare to diagnose on top of everything else. 
Definitely doesn''t happen in windows too.  I really want somebody to
try snv_94 on a Thumper to see if you get the same behaviour there, or whether
it''s unique to Supermicro''s Marvell card.
> > Often the system hung completely when you pulled one of the disks 0-5,
> > and wouldn''t respond again until you re-inserted it.
> > 
> > I''m 99.99% sure this is a driver issue for this controller.
> 
> Have you logged a bug on it yet?
Yup, 6735931.  Added the information about it working in Windows today too.

Ross

_________________________________________________________________
Get Hotmail on your mobile from Vodafone 
http://clk.atdmt.com/UKM/go/107571435/direct/01/
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080821/9a5b2ce3/attachment.html>

James C. McPherson

2008-Aug-20 23:31 UTC

head link

[zfs-discuss] FW: Supermicro AOC-SAT2-MV8 hang when drive removed

Ross Smith wrote:>  > > Without fail, cfgadm changes the status from "disk" to
"sata-port"
> when I
>  > > unplug a device attached to port 6 or 7, but most of the time 
> unplugging
>  > > disks 0-5 results in no change in cfgadm, until I also attach
disk
> 6 or 7.
>  >
>  > That does seem inconsistent, or at least, it''s not what
I''d expect.
> 
> Yup, was an absolute nightmare to diagnose on top of everything else.  
> Definitely doesn''t happen in windows too.  I really want somebody
to try
> snv_94 on a Thumper to see if you get the same behaviour there, or 
> whether it''s unique to Supermicro''s Marvell card.
That''s a very good question.
>  > > Often the system hung completely when you pulled one of the
disks 0-5,
>  > > and wouldn''t respond again until you re-inserted it.
>  > >
>  > > I''m 99.99% sure this is a driver issue for this
controller.
>  >
>  > Have you logged a bug on it yet?
> 
> Yup, 6735931.  Added the information about it working in Windows today too.

Heh... I should have recognised that, I moved it from the
triage queue to driver/sata :-)


James
--
Senior Kernel Software Engineer, Solaris
Sun Microsystems
http://blogs.sun.com/jmcp	http://www.jmcp.homeunix.com/blog

Tim

2008-Aug-20 23:44 UTC

head link

[zfs-discuss] FW: Supermicro AOC-SAT2-MV8 hang when drive removed

I don''t think its just b94, I recall this behavior for as long as
I''ve
had the card.  I''d also be interested to know if the sun driver team
has ever even tested with this card.  I realize its probably not a top
priority, but it sure would be nice to have it working properly.






On 8/20/08, Ross Smith <myxiplx at hotmail.com>
wrote:>
>> > Without fail, cfgadm changes the status from "disk" to
"sata-port" when
>> > I
>> > unplug a device attached to port 6 or 7, but most of the time
unplugging
>> > disks 0-5 results in no change in cfgadm, until I also attach disk
6 or
>> > 7.
>>
>> That does seem inconsistent, or at least, it''s not what
I''d expect.
>
> Yup, was an absolute nightmare to diagnose on top of everything else.
> Definitely doesn''t happen in windows too.  I really want somebody
to try
> snv_94 on a Thumper to see if you get the same behaviour there, or whether
> it''s unique to Supermicro''s Marvell card.
>
>> > Often the system hung completely when you pulled one of the disks
0-5,
>> > and wouldn''t respond again until you re-inserted it.
>> >
>> > I''m 99.99% sure this is a driver issue for this
controller.
>>
>> Have you logged a bug on it yet?
>
> Yup, 6735931.  Added the information about it working in Windows today too.
>
> Ross
>
> _________________________________________________________________
> Get Hotmail on your mobile from Vodafone
> http://clk.atdmt.com/UKM/go/107571435/direct/01/

Peter Schultze

2009-Feb-11 22:46 UTC

head link

[zfs-discuss] FW: Supermicro AOC-SAT2-MV8 hang when drive removed

> Yup, was an absolute nightmare to diagnose on top of everything else. 
Definitely doesn''t
> happen in windows too.  I really want somebody to try snv_94 on a Thumper
to see if you
> get the same behaviour there, or whether it''s unique to
Supermicro''s Marvell card.
On a Thumper under S10U5 we recently had a hardware failure
of one disk. This caused all I/O to the entire 46 disk pool to hang.
zpool status commands also were hanging. Reset commands
from the service processor timed out unsuccessfully. The system
had to be power cycled manually. After that booting took about
30 minutes. At this point the bad disk could be unconfigured
with cfgadm and then hot swapped with a warranty replacement.

So it appears that bug 6735931 is also affecting the X4500 upon disk
hardware failure; in a way that seriously impairs the entire system''s 
fault tolerance. 

I would be willing to test any T-patch coming out soon....

I found this thread after seeing a total failure of a hot unplug
of a 1.5TB disk from a (different) newly assembled system with 3 AOC-SAT2-MV8
cards and 24 disks + one host spare. After removing one disk
the entire system also froze; instead of initiating a resilver
process with the hot spare. Clearly the marvell88sx driver cannot handle
disk outages in any environment.
-- 
This message posted from opensolaris.org

Ross

2009-Feb-12 15:25 UTC

head link

[zfs-discuss] FW: Supermicro AOC-SAT2-MV8 hang when drive removed

This sounds like exactly the kind of problem I''ve been shouting about
for 6 months or more.  I posted a huge thread on availability on these forums
because I had concerns over exactly this kind of hanging.

ZFS doesn''t trust hardware or drivers when it comes to your data -
everything is checksummed.  However, when it comes to seeing whether devices are
responding, and checking for faults, it blindly trusts whatever the hardware or
driver tells it.  Unfortunately, that means ZFS is vulnerable to any unexpected
bug or error in the storage chain.  I''ve encountered at least two hang
conditions myself (and I''m not exactly a heavy user), and I''ve
seen several others on the forums, including a few on x4500''s.

Now, I do accept that errors like this will be few and far between, but they
still means you have the risk that a badly handled error condition can hang your
entire server, instead of just one drive.  Solaris can handle things like
CPU''s or Memory going faulty for crying out loud.  Its raid storage
system had better be able to handle a disk failing.

Sun seem to be taking the approach that these errors should be dealt with in the
driver layer.  And while that''s technically correct, a reliable storage
system had damn well better be able to keep the server limping along while we
wait for patches to the storage drivers.

ZFS absolutely needs an error handling layer between the volume manager and the
devices.  It needs to timeout items that are not responding, and it needs to
drop bad devices if they could cause problems elsewhere.

And yes, I''m repeating myself, but I can''t understand why this
is not being acted on.  Right now the error checking appears to be such that if
an unexpected, or badly handled error condition occurs in the driver stack, the
pool or server hangs.  Whereas the expected behavior would be for just one drive
to fail.  The absolute worst case scenario should be that an entire controller
has to be taken offline (and I would hope that the controllers in an x4500 would
be running separate instances of the driver software).

None one of those conditions should be fatal, good storage designs cope with
them all, and good error handling at the ZFS layer is absolutely vital when you
have projects like Comstar introducing more and more types of storage device for
ZFS to work with.

Each extra type of storage introduces yet more software into the equation, and
increases the risk of finding faults like this.  While they will be rare, they
should be expected, and ZFS should be designed to handle them.
-- 
This message posted from opensolaris.org

Tim

2009-Feb-12 16:25 UTC

head link

[zfs-discuss] FW: Supermicro AOC-SAT2-MV8 hang when drive removed

On Thu, Feb 12, 2009 at 9:25 AM, Ross <myxiplx at googlemail.com> wrote:
> This sounds like exactly the kind of problem I''ve been shouting
about for 6
> months or more.  I posted a huge thread on availability on these forums
> because I had concerns over exactly this kind of hanging.
>
> ZFS doesn''t trust hardware or drivers when it comes to your data -
> everything is checksummed.  However, when it comes to seeing whether
devices
> are responding, and checking for faults, it blindly trusts whatever the
> hardware or driver tells it.  Unfortunately, that means ZFS is vulnerable
to
> any unexpected bug or error in the storage chain.  I''ve
encountered at least
> two hang conditions myself (and I''m not exactly a heavy user), and
I''ve seen
> several others on the forums, including a few on x4500''s.
>
> Now, I do accept that errors like this will be few and far between, but
> they still means you have the risk that a badly handled error condition can
> hang your entire server, instead of just one drive.  Solaris can handle
> things like CPU''s or Memory going faulty for crying out loud.  Its
raid
> storage system had better be able to handle a disk failing.
>
> Sun seem to be taking the approach that these errors should be dealt with
> in the driver layer.  And while that''s technically correct, a
reliable
> storage system had damn well better be able to keep the server limping
along
> while we wait for patches to the storage drivers.
>
> ZFS absolutely needs an error handling layer between the volume manager and
> the devices.  It needs to timeout items that are not responding, and it
> needs to drop bad devices if they could cause problems elsewhere.
>
> And yes, I''m repeating myself, but I can''t understand why
this is not being
> acted on.  Right now the error checking appears to be such that if an
> unexpected, or badly handled error condition occurs in the driver stack,
the
> pool or server hangs.  Whereas the expected behavior would be for just one
> drive to fail.  The absolute worst case scenario should be that an entire
> controller has to be taken offline (and I would hope that the controllers
in
> an x4500 would be running separate instances of the driver software).
>
> None one of those conditions should be fatal, good storage designs cope
> with them all, and good error handling at the ZFS layer is absolutely vital
> when you have projects like Comstar introducing more and more types of
> storage device for ZFS to work with.
>
> Each extra type of storage introduces yet more software into the equation,
> and increases the risk of finding faults like this.  While they will be
> rare, they should be expected, and ZFS should be designed to handle them.
>

I''d imagine for the exact same reason short-stroking/right-sizing
isn''t a
concern.

"We don''t have this problem in the 7000 series, perhaps you should
buy one
of those".

;)

--Tim
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090212/73a9cbe6/attachment-0001.html>

Ross Smith

2009-Feb-12 19:19 UTC

head link

[zfs-discuss] FW: Supermicro AOC-SAT2-MV8 hang when drive removed

Heh, yeah, I''ve thought the same kind of thing in the past.  The
problem is that the argument doesn''t really work for system admins.

As far as I''m concerned, the 7000 series is a new hardware platform,
with relatively untested drivers, running a software solution that I
know is prone to locking up when hardware faults are handled badly by
drivers.  Fair enough, that actual solution is out of our price range,
but I would still be very dubious about purchasing it.  At the very
least I''d be waiting a year for other people to work the kinks out of
the drivers.

Which is a shame, because ZFS has so many other great features it''s
easily our first choice for a storage platform.  The one and only
concern we have is its reliability.  We have snv_106 running as a test
platform now.  If I felt I could trust ZFS 100% I''d roll it out
tomorrow.



On Thu, Feb 12, 2009 at 4:25 PM, Tim <tim at tcsac.net>
wrote:>
>
> On Thu, Feb 12, 2009 at 9:25 AM, Ross <myxiplx at googlemail.com>
wrote:
>>
>> This sounds like exactly the kind of problem I''ve been
shouting about for
>> 6 months or more.  I posted a huge thread on availability on these
forums
>> because I had concerns over exactly this kind of hanging.
>>
>> ZFS doesn''t trust hardware or drivers when it comes to your
data -
>> everything is checksummed.  However, when it comes to seeing whether
devices
>> are responding, and checking for faults, it blindly trusts whatever the
>> hardware or driver tells it.  Unfortunately, that means ZFS is
vulnerable to
>> any unexpected bug or error in the storage chain.  I''ve
encountered at least
>> two hang conditions myself (and I''m not exactly a heavy user),
and I''ve seen
>> several others on the forums, including a few on x4500''s.
>>
>> Now, I do accept that errors like this will be few and far between, but
>> they still means you have the risk that a badly handled error condition
can
>> hang your entire server, instead of just one drive.  Solaris can handle
>> things like CPU''s or Memory going faulty for crying out loud. 
Its raid
>> storage system had better be able to handle a disk failing.
>>
>> Sun seem to be taking the approach that these errors should be dealt
with
>> in the driver layer.  And while that''s technically correct, a
reliable
>> storage system had damn well better be able to keep the server limping
along
>> while we wait for patches to the storage drivers.
>>
>> ZFS absolutely needs an error handling layer between the volume manager
>> and the devices.  It needs to timeout items that are not responding,
and it
>> needs to drop bad devices if they could cause problems elsewhere.
>>
>> And yes, I''m repeating myself, but I can''t understand
why this is not
>> being acted on.  Right now the error checking appears to be such that
if an
>> unexpected, or badly handled error condition occurs in the driver
stack, the
>> pool or server hangs.  Whereas the expected behavior would be for just
one
>> drive to fail.  The absolute worst case scenario should be that an
entire
>> controller has to be taken offline (and I would hope that the
controllers in
>> an x4500 would be running separate instances of the driver software).
>>
>> None one of those conditions should be fatal, good storage designs cope
>> with them all, and good error handling at the ZFS layer is absolutely
vital
>> when you have projects like Comstar introducing more and more types of
>> storage device for ZFS to work with.
>>
>> Each extra type of storage introduces yet more software into the
equation,
>> and increases the risk of finding faults like this.  While they will be
>> rare, they should be expected, and ZFS should be designed to handle
them.
>
>
> I''d imagine for the exact same reason short-stroking/right-sizing
isn''t a
> concern.
>
> "We don''t have this problem in the 7000 series, perhaps you
should buy one
> of those".
>
> ;)
>
> --Tim
>

Bob Friesenhahn

2009-Feb-12 23:16 UTC

head link

[zfs-discuss] FW: Supermicro AOC-SAT2-MV8 hang when drive removed

On Thu, 12 Feb 2009, Ross Smith wrote:>
> As far as I''m concerned, the 7000 series is a new hardware
platform,
You are joking right?  Have you ever looked at the photos of these 
"new" systems or compared them to other Sun systems?  They are just 
re-purposed existing systems with a bit of extra secret sauce added.

Bob
=====================================Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Tim

2009-Feb-12 23:52 UTC

head link

[zfs-discuss] FW: Supermicro AOC-SAT2-MV8 hang when drive removed

On Thu, Feb 12, 2009 at 5:16 PM, Bob Friesenhahn <
bfriesen at simple.dallas.tx.us> wrote:
> On Thu, 12 Feb 2009, Ross Smith wrote:
>
>>
>> As far as I''m concerned, the 7000 series is a new hardware
platform,
>>
>
> You are joking right?  Have you ever looked at the photos of these
"new"
> systems or compared them to other Sun systems?  They are just re-purposed
> existing systems with a bit of extra secret sauce added.
>
> Bob
>
Ya, that *secret sauce* is what makes it a new system.  And out of the last
4 x4240''s I''ve ordered, two had to have new motherboards
installed within a
week, and one had to have a new power supply.  The other appears to have a
dvd rom drive going flaky.  So the fact they''re based on existing
hardware
isn''t exactly confidence inspiring either.

Sun''s old sparc gear: rock solid.  The newer x64 has been leaving a bad
taste in my mouth TBQH.  The engineering behind the systems when I open them
up is absolutely phenomenal.  The failure rate, however, is downright scary.

--Tim
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090212/2a1240be/attachment.html>

Aaron Brady

2009-Oct-13 13:54 UTC

head link

[zfs-discuss] FW: Supermicro AOC-SAT2-MV8 hang when drive removed

All''s gone quiet on this issue, and the bug is closed, but I''m
having exactly the same problem; pulling a disk on this card, under OpenSolaris
111, is pausing all IO (including, weirdly, network IO), and using the ZFS
utilities (zfs list, zpool list, zpool status) causes a hang until I replace the
disk.
-- 
This message posted from opensolaris.org

Tim Cook

2009-Oct-13 14:20 UTC

head link

[zfs-discuss] FW: Supermicro AOC-SAT2-MV8 hang when drive removed

On Tue, Oct 13, 2009 at 8:54 AM, Aaron Brady <bradya at gmail.com> wrote:
> All''s gone quiet on this issue, and the bug is closed, but
I''m having
> exactly the same problem; pulling a disk on this card, under OpenSolaris
> 111, is pausing all IO (including, weirdly, network IO), and using the ZFS
> utilities (zfs list, zpool list, zpool status) causes a hang until I
replace
> the disk.
> --
>

Did you set your failmode to continue?


--Tim
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20091013/b2188986/attachment.html>

Ross

2009-Oct-13 14:28 UTC

head link

[zfs-discuss] FW: Supermicro AOC-SAT2-MV8 hang when drive removed

Hi Tim, that doesn''t help in this case - it''s a complete
lockup apparently caused by driver issues.

However, the good news ofr Insom is that the bug is closed because the problem
now appears fixed.  I tested it and found that it''s no longer occuring
in OpenSolaris 2008.11 or 2009.06.

If you move to a newer build of OpenSolaris you should be fine.
-- 
This message posted from opensolaris.org

Aaron Brady

2009-Oct-13 14:42 UTC

head link

[zfs-discuss] FW: Supermicro AOC-SAT2-MV8 hang when drive removed

I did, but as tcook suggests running a later build, I''ll try an
image-update (though, 111 > 2008.11, right?)
-- 
This message posted from opensolaris.org

Tim Cook

2009-Oct-13 14:47 UTC

head link

[zfs-discuss] FW: Supermicro AOC-SAT2-MV8 hang when drive removed

On Tue, Oct 13, 2009 at 9:42 AM, Aaron Brady <bradya at gmail.com> wrote:
> I did, but as tcook suggests running a later build, I''ll try an
> image-update (though, 111 > 2008.11, right?)
>

It should be, yes.  b111 was released in April of 2009.

--Tim
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20091013/6ef543c2/attachment.html>

Aaron Brady

2009-Oct-14 13:56 UTC

head link

[zfs-discuss] FW: Supermicro AOC-SAT2-MV8 hang when drive removed

Well, I upgraded to b124, disabling ACPI because of [1], and I get exactly the
same behaviour. I''ve removed the device from the zpool, and tried
dd-ing from the device while I remove it; it still hangs all IO on the system
until the disk is re-inserted.

I''m running the kernel with -v (from diagnosing the ACPI issue) and
nothing enlightening is printed in dmesg.

1: http://defect.opensolaris.org/bz/show_bug.cgi?id=11739
-- 
This message posted from opensolaris.org

Ross

2009-Oct-14 16:01 UTC

head link

[zfs-discuss] FW: Supermicro AOC-SAT2-MV8 hang when drive removed

What are you running there?  snv or OpenSolaris?

Could you try an OpenSolaris 2009.06 live disc and boot directly from that. 
Once I was running that build every single hot plug I tried worked flawlessly. 
I tried for several hours to replicate the problems that caused me to log that
bug report, but the issue appeared completely resolved.
-- 
This message posted from opensolaris.org

zfs discuss - Aug 2008 - FW: Supermicro AOC-SAT2-MV8 hang when drive removed

[zfs-discuss] FW: Supermicro AOC-SAT2-MV8 hang when drive removed

[zfs-discuss] FW: Supermicro AOC-SAT2-MV8 hang when drive removed

[zfs-discuss] FW: Supermicro AOC-SAT2-MV8 hang when drive removed

[zfs-discuss] FW: Supermicro AOC-SAT2-MV8 hang when drive removed

[zfs-discuss] FW: Supermicro AOC-SAT2-MV8 hang when drive removed

[zfs-discuss] FW: Supermicro AOC-SAT2-MV8 hang when drive removed

[zfs-discuss] FW: Supermicro AOC-SAT2-MV8 hang when drive removed

[zfs-discuss] FW: Supermicro AOC-SAT2-MV8 hang when drive removed

[zfs-discuss] FW: Supermicro AOC-SAT2-MV8 hang when drive removed

[zfs-discuss] FW: Supermicro AOC-SAT2-MV8 hang when drive removed

[zfs-discuss] FW: Supermicro AOC-SAT2-MV8 hang when drive removed

[zfs-discuss] FW: Supermicro AOC-SAT2-MV8 hang when drive removed

[zfs-discuss] FW: Supermicro AOC-SAT2-MV8 hang when drive removed

[zfs-discuss] FW: Supermicro AOC-SAT2-MV8 hang when drive removed

[zfs-discuss] FW: Supermicro AOC-SAT2-MV8 hang when drive removed

[zfs-discuss] FW: Supermicro AOC-SAT2-MV8 hang when drive removed

[zfs-discuss] FW: Supermicro AOC-SAT2-MV8 hang when drive removed

[zfs-discuss] FW: Supermicro AOC-SAT2-MV8 hang when drive removed

[zfs-discuss] FW: Supermicro AOC-SAT2-MV8 hang when drive removed

[zfs-discuss] FW: Supermicro AOC-SAT2-MV8 hang when drive removed

[zfs-discuss] FW: Supermicro AOC-SAT2-MV8 hang when drive removed

[zfs-discuss] FW: Supermicro AOC-SAT2-MV8 hang when drive removed

[zfs-discuss] FW: Supermicro AOC-SAT2-MV8 hang when drive removed

[zfs-discuss] FW: Supermicro AOC-SAT2-MV8 hang when drive removed

[zfs-discuss] FW: Supermicro AOC-SAT2-MV8 hang when drive removed

[zfs-discuss] FW: Supermicro AOC-SAT2-MV8 hang when drive removed

[zfs-discuss] FW: Supermicro AOC-SAT2-MV8 hang when drive removed

[zfs-discuss] FW: Supermicro AOC-SAT2-MV8 hang when drive removed