thr3ads.net - zfs discuss - [zfs-discuss] experiences with zpool errors and glm flipouts [Oct 2006]

If this information is useful, please help other people find it:
Share via:

Dan Price

2006-Oct-26 08:30 UTC

[zfs-discuss] experiences with zpool errors and glm flipouts

Tonight I''ve been moving some of my personal data around on my
desktop system and have hit some on-disk corruption.  As you may
know, I''m cursed, and so this had a high probability of ending badly.
I have two SCSI disks and use live upgrade, and I have a partition,
/aux0, where I tend to keep personal stuff.  This is on an SB2500
running snv_46.

The upshot is that I have a slice 7 on each of two disks; all of my
data was on one of these slices.  So, I turned the other slice into a
zpool, and copied the data from UFS slice to zpool.

Then I tried to attach the UFS slice to the zpool, in order to
form a mirror.  The resilver kicked off but eventually ground my
machine to a halt (the last I saw was 77% completed), and I was getting
a ton of these errors:

scsi: WARNING: /pci at 1d,700000/scsi at 4 (glm0):
        Resetting scsi bus, got incorrect phase from (1,0)
genunix: NOTICE: glm0: fault detected in device; service still available
genunix: NOTICE: glm0: Resetting scsi bus, got incorrect phase from (1,0)
scsi: WARNING: /pci at 1d,700000/scsi at 4 (glm0):
        got SCSI bus reset
genunix: NOTICE: glm0: fault detected in device; service still available
genunix: NOTICE: glm0: got SCSI bus reset
scsi: WARNING: /pci at 1d,700000/scsi at 4/sd at 1,0 (sd11):
        auto request sense failed (reason=reset)

Eventually I had to drive in to work to reboot the machine, although
the system did not tip over.  After a reboot to single user mode, the
same symptoms recurred (since it seems that the resilver kicked off
again... and at a certain stage hit this problem over again).

The only recourse was to reboot to single user mode, rapidly log in, and
detach the problem-causing side of the mirror.  This led me to
suggestion #1:

        - It''d be nice if auto-resilvering did not kick off until
          sometime after we leave single user mode.  

So I don''t know what might be causing glm to flip out.

Next, I did a scrub of the one slice in my pool and got this:
...
errors: The following persistent errors have been detected:

          DATASET   OBJECT  RANGE
          dp_stuff  42073   917504-1048576
          dp_stuff  42073   1048576-1179648

This is awesome.  I can pinpoint any corruption, which is great.
But...  So this may be a stupid question, but it''s unclear how to
locate the object in question.  I did a find -inum 42073, which
located some help.jar file in a copy of netbeans I have in the
zpool.  If that''s all I''ve lost, then hooray!

But I wasn''t sure if that was the right thing to do.  It''d be
great if
the documentation was clearer on this point:

http://docs.sun.com/app/docs/doc/819-5461/6n7ht6qt1?a=view#gbcuz

Just says to try ''rm'' on "the file" but does not
mention how to
locate it.

I''d appreciate any thoughts on how to resolve the glm bus reset
issue...  Thanks!

        -dp

-- 
Daniel Price - Solaris Kernel Engineering - dp at eng.sun.com - blogs.sun.com/dp

Eric Schrock

2006-Oct-26 16:11 UTC

head link

[zfs-discuss] experiences with zpool errors and glm flipouts

On Thu, Oct 26, 2006 at 01:30:46AM -0700, Dan Price
wrote:> 
> scsi: WARNING: /pci at 1d,700000/scsi at 4 (glm0):
>         Resetting scsi bus, got incorrect phase from (1,0)
> genunix: NOTICE: glm0: fault detected in device; service still available
> genunix: NOTICE: glm0: Resetting scsi bus, got incorrect phase from (1,0)
> scsi: WARNING: /pci at 1d,700000/scsi at 4 (glm0):
>         got SCSI bus reset
> genunix: NOTICE: glm0: fault detected in device; service still available
> genunix: NOTICE: glm0: got SCSI bus reset
> scsi: WARNING: /pci at 1d,700000/scsi at 4/sd at 1,0 (sd11):
>         auto request sense failed (reason=reset)
> 
> Eventually I had to drive in to work to reboot the machine, although
> the system did not tip over.  After a reboot to single user mode, the
> same symptoms recurred (since it seems that the resilver kicked off
> again... and at a certain stage hit this problem over again).
This is where the next phase of ZFS/FMA interoperability (which I''ve
been sketching out for a while and am starting to work on now) will come
in handy.  Currently, ZFS will drive on forever even if a disk is
arbitrarily misbehaving.  In this case, it caused the scrub to grind to
a halt (it was likely making progress, just very slowly).  In the future
ZFS/FMA world, the number of errors on the device would have exceeded an
appropriate threshold (via a SERD engine) and the device would have been
placed into the ''FAULTED'' state.  The scrub would have
finished, you
would have a nice FMA message on your console, and one of the drives
would have been faulted.  There are a lot of subtleties here,
particularly w.r.t. other I/O FMA work, but we''re making some progress.
> The only recourse was to reboot to single user mode, rapidly log in, and
> detach the problem-causing side of the mirror.  This led me to
> suggestion #1:
> 
>         - It''d be nice if auto-resilvering did not kick off until
>           sometime after we leave single user mode.  
This isn''t completely straightforward, but obviously doable. 
It''s also
unclear if this is just a temporary stopgap in lieu of a complete FMA
solution.  Please file an RFE anyway so that the problem is recorded
somewhere.
> This is awesome.  I can pinpoint any corruption, which is great.
> But...  So this may be a stupid question, but it''s unclear how to
> locate the object in question.
See:

6410433 ''zpool status -v'' would be more useful with filenames
> I did a find -inum 42073, which located some help.jar file in a copy
> of netbeans I have in the zpool.  If that''s all I''ve
lost, then
> hooray!
> 
> But I wasn''t sure if that was the right thing to do. 
It''d be great if
> the documentation was clearer on this point:
> 
> http://docs.sun.com/app/docs/doc/819-5461/6n7ht6qt1?a=view#gbcuz
> 
> Just says to try ''rm'' on "the file" but does
not mention how to
> locate it.
Yeah, partly because there is no good way ;-)  We want the answer to be
6410433, but even then there are tricky edge conditions (such as
directories and dnode corruption) that can''t simply be removed because
they reference arbitrary amounts of metadata.  The documentation can be
improved in the meantime (to mention ''find -inum'' at the very
least),
but we really need to sit down again and think about how we want the
user experience to be when dealing with corruption.

- Eric

--
Eric Schrock, Solaris Kernel Development       http://blogs.sun.com/eschrock

Daniel B. Price

2006-Oct-27 00:55 UTC

head link

[zfs-discuss] Re: experiences with zpool errors and glm flipouts

(For some reason I never actually got this as an email; maybe because
I''m
not subscribed to zfs-discuss?)

Thanks, Eric.

So do you guys have any suspicions about what is actually failing
here?  Is it my drives, or the glm chip?  or both?  I was wondering
whether new drives were going to help.

I''ll give this experiment another try once I''ve upgraded to
B51.

Thanks
       -dp
 
 
This message posted from opensolaris.org

Seemingly Similar Threads

Search for more possibly parallel threads

zfs discuss - Oct 2006 - experiences with zpool errors and glm flipouts

[zfs-discuss] experiences with zpool errors and glm flipouts

[zfs-discuss] experiences with zpool errors and glm flipouts

[zfs-discuss] Re: experiences with zpool errors and glm flipouts

Seemingly Similar Threads