thr3ads.net - zfs discuss - [zfs-discuss] chgrp -R hangs all writes to pool [Jul 2007]

If this information is useful, please help other people find it:
Share via:

Stuart Anderson

2007-Jul-17 04:36 UTC

[zfs-discuss] chgrp -R hangs all writes to pool

Running Solaris 10 Update 3 on an X4500 I have found that it is possible
to reproducibly block all writes to a ZFS pool by running "chgrp -R"
on any large filesystem in that pool.  As can be seen below in the zpool
iostat output below, after about 10-sec of running the chgrp command all
writes to the pool stop, and the pool starts exclusively running a slow
background task of 1kB reads.

At this point the chgrp -R command is not killable via root kill -9,
and in fact even the command "halt -d" does not do anything.

In at lest one instance I have seen the chgrp command eventually
respond to the kill command after ~30 minutes, and the pool was
writable again. However, while waiting for this to happen the
kernel was generating "No more processes." when simple commands
where attempted to be run in pre-existing shells, e.g., uname or uptime.


# zpool iostat test 2
               capacity     operations    bandwidth
pool         used  avail   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
...
test        1.12T  19.2T      1  1.72K  11.2K   220M
test        1.12T  19.2T      0  3.10K      0   380M
test        1.12T  19.2T      0    335      0  41.9M
test        1.12T  19.2T      0  4.49K      0   559M
test        1.12T  19.2T      0      0      0      0
test        1.12T  19.2T      0  1.51K      0   193M
test        1.12T  19.2T      0  3.31K      0   408M
test        1.12T  19.2T      0      0      0      0
test        1.12T  19.2T      0  3.54K      0   453M
test        1.13T  19.2T    428  1.17K  1.82M   129M
*** Started chgrp -R ***
test        1.13T  19.2T  1.74K  2.21K  7.19M   282M
test        1.13T  19.2T    531  2.49K  2.34M   300M
test        1.13T  19.2T    549  1.67K  2.96M   213M
test        1.13T  19.2T    395  3.00K  2.38M   368M
test        1.13T  19.2T    343      0  1.66M      0
test        1.13T  19.2T    113      0   113K      0
test        1.13T  19.2T    132      0   132K      0
test        1.13T  19.2T    136      0   137K      0
test        1.13T  19.2T    132      0   132K      0
test        1.13T  19.2T    148      0   149K      0
test        1.13T  19.2T    137      0   138K      0
test        1.13T  19.2T    163      0   163K      0
test        1.13T  19.2T    152      0   153K      0
...
*** All writes to this pool are hung for some long period of time. ***


Here is the pool configuration:

# zpool status
  pool: test
 state: ONLINE
 scrub: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        test        ONLINE       0     0     0
          raidz2    ONLINE       0     0     0
            c7t1d0  ONLINE       0     0     0
            c1t0d0  ONLINE       0     0     0
            c5t0d0  ONLINE       0     0     0
            c6t1d0  ONLINE       0     0     0
            c7t0d0  ONLINE       0     0     0
            c0t7d0  ONLINE       0     0     0
            c0t3d0  ONLINE       0     0     0
            c1t3d0  ONLINE       0     0     0
            c5t3d0  ONLINE       0     0     0
          raidz2    ONLINE       0     0     0
            c6t3d0  ONLINE       0     0     0
            c7t3d0  ONLINE       0     0     0
            c8t3d0  ONLINE       0     0     0
            c0t2d0  ONLINE       0     0     0
            c1t2d0  ONLINE       0     0     0
            c5t2d0  ONLINE       0     0     0
            c6t2d0  ONLINE       0     0     0
            c7t2d0  ONLINE       0     0     0
            c8t2d0  ONLINE       0     0     0
          raidz2    ONLINE       0     0     0
            c0t5d0  ONLINE       0     0     0
            c1t5d0  ONLINE       0     0     0
            c5t5d0  ONLINE       0     0     0
            c7t5d0  ONLINE       0     0     0
            c8t5d0  ONLINE       0     0     0
            c0t4d0  ONLINE       0     0     0
            c1t4d0  ONLINE       0     0     0
            c5t4d0  ONLINE       0     0     0
            c0t6d0  ONLINE       0     0     0
          raidz2    ONLINE       0     0     0
            c7t4d0  ONLINE       0     0     0
            c8t4d0  ONLINE       0     0     0
            c8t0d0  ONLINE       0     0     0
            c1t7d0  ONLINE       0     0     0
            c5t7d0  ONLINE       0     0     0
            c6t7d0  ONLINE       0     0     0
            c7t7d0  ONLINE       0     0     0
            c8t7d0  ONLINE       0     0     0
            c6t5d0  ONLINE       0     0     0
          raidz2    ONLINE       0     0     0
            c1t6d0  ONLINE       0     0     0
            c5t6d0  ONLINE       0     0     0
            c6t6d0  ONLINE       0     0     0
            c7t6d0  ONLINE       0     0     0
            c8t6d0  ONLINE       0     0     0
            c0t1d0  ONLINE       0     0     0
            c1t1d0  ONLINE       0     0     0
            c5t1d0  ONLINE       0     0     0
            c0t0d0  ONLINE       0     0     0
        spares
          c8t1d0    AVAIL   

errors: No known data errors


There is nothing in the output of dmesg, svcs -xv, or fmdump associated
with this event.

Is this a known issue or should I open a new case with Sun?


Thanks.


-- 
Stuart Anderson  anderson at ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

James C. McPherson

2007-Jul-17 04:49 UTC

head link

[zfs-discuss] chgrp -R hangs all writes to pool

Stuart Anderson wrote:> Running Solaris 10 Update 3 on an X4500 I have found that it is possible
> to reproducibly block all writes to a ZFS pool by running "chgrp
-R"
> on any large filesystem in that pool.  As can be seen below in the zpool
> iostat output below, after about 10-sec of running the chgrp command all
> writes to the pool stop, and the pool starts exclusively running a slow
> background task of 1kB reads.
> 
> At this point the chgrp -R command is not killable via root kill -9,
> and in fact even the command "halt -d" does not do anything.
> 
> In at lest one instance I have seen the chgrp command eventually
> respond to the kill command after ~30 minutes, and the pool was
> writable again. However, while waiting for this to happen the
> kernel was generating "No more processes." when simple commands
> where attempted to be run in pre-existing shells, e.g., uname or uptime....
> There is nothing in the output of dmesg, svcs -xv, or fmdump associated
> with this event.
> 
> Is this a known issue or should I open a new case with Sun?
Log a new case with Sun, and make sure you supply
a crash dump so people who know ZFS can analyze
the issue.

You can use <stop-A> sync, <break> sync, or

reboot -dq




cheers,
James C. McPherson
--
Solaris kernel software engineer
Sun Microsystems

Stuart Anderson

2007-Jul-17 04:58 UTC

head link

[zfs-discuss] chgrp -R hangs all writes to pool

On Tue, Jul 17, 2007 at 02:49:08PM +1000, James C. McPherson
wrote:> Stuart Anderson wrote:
> >Running Solaris 10 Update 3 on an X4500 I have found that it is
possible
> >to reproducibly block all writes to a ZFS pool by running "chgrp
-R"
> >on any large filesystem in that pool.  As can be seen below in the
zpool
> >iostat output below, after about 10-sec of running the chgrp command
all
> >writes to the pool stop, and the pool starts exclusively running a slow
> >background task of 1kB reads.
> >
...
> >
> >Is this a known issue or should I open a new case with Sun?
> 
> Log a new case with Sun, and make sure you supply
> a crash dump so people who know ZFS can analyze
> the issue.
> 
> You can use <stop-A> sync, <break> sync, or
> 
> reboot -dq
> 
In previous attempts, neither "halt -d" nor reboot (with no arguments)
where able to shutdown the machine. Is "reboot -dq" really a bigger
hammer
than "halt -d"?

Sorry to be pedantic, but what is the exact key sequence on a Sun
USB keyboard one should use to force a kernel dump on Solx86?
Since there is no OBP on an X4500 where do I type the sync command?

Thanks.

-- 
Stuart Anderson  anderson at ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

James C. McPherson

2007-Jul-17 05:08 UTC

head link

[zfs-discuss] chgrp -R hangs all writes to pool

Stuart Anderson wrote:> On Tue, Jul 17, 2007 at 02:49:08PM +1000, James C. McPherson wrote:
>> Stuart Anderson wrote:
>>> Running Solaris 10 Update 3 on an X4500 I have found that it is
possible
>>> to reproducibly block all writes to a ZFS pool by running
"chgrp -R"
>>> on any large filesystem in that pool.  As can be seen below in the
zpool
>>> iostat output below, after about 10-sec of running the chgrp
command all
>>> writes to the pool stop, and the pool starts exclusively running a
slow
>>> background task of 1kB reads.
>>> Is this a known issue or should I open a new case with Sun?
>> Log a new case with Sun, and make sure you supply
>> a crash dump so people who know ZFS can analyze
>> the issue.
>>
>> You can use <stop-A> sync, <break> sync, or
>>
>> reboot -dq
>>
> 
> In previous attempts, neither "halt -d" nor reboot (with no
arguments)
> where able to shutdown the machine. Is "reboot -dq" really a
bigger hammer
> than "halt -d"?
Kindasorta - the q option tells reboot to do its stuff with
all guns blazing, as it were.
> Sorry to be pedantic, but what is the exact key sequence on a Sun
> USB keyboard one should use to force a kernel dump on Solx86?
> Since there is no OBP on an X4500 where do I type the sync command?
first, either boot with "-k" or shortly after you get to
multiuser, run "mdb -K" on the console (and hit :c <enter>).

Then you can use <F1>A to drop to kmdb, and then run

::systemdump

or

0>rip
:c
:c

or for 32bit mode

0>eip
:c
:c


cheers,
James C. McPherson
--
Solaris kernel software engineer
Sun Microsystems

Rayson Ho

2007-Jul-17 05:42 UTC

head link

[zfs-discuss] chgrp -R hangs all writes to pool

I found a very nice doc. that describes the steps to create a kernel dump:

"The Solaris Operating System on x86 Platforms - Crashdump Analysis
Operating System Internals"

http://opensolaris.org/os/community/documentation/files/book.pdf

-> 7.2.2.Forcing system crashdumps

Rayson



On 7/17/07, James C. McPherson <James.McPherson at sun.com>
wrote:> first, either boot with "-k" or shortly after you get to
> multiuser, run "mdb -K" on the console (and hit :c
<enter>).
>
> Then you can use <F1>A to drop to kmdb, and then run
>
> ::systemdump
>
> or
>
> 0>rip
> :c
> :c
>
> or for 32bit mode
>
> 0>eip
> :c
> :c
>
>
> cheers,
> James C. McPherson
> --
> Solaris kernel software engineer
> Sun Microsystems
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>

Joshua.Goodall at editure.com

2007-Jul-17 07:49 UTC

head link

[zfs-discuss] chgrp -R hangs all writes to pool

zfs-discuss-bounces at opensolaris.org wrote on 17/07/2007 02:36:06 PM:
> Running Solaris 10 Update 3 on an X4500 I have found that it is possible
> to reproducibly block all writes to a ZFS pool by running "chgrp
-R"
> on any large filesystem in that pool.  As can be seen below in the zpool
> iostat output below, after about 10-sec of running the chgrp command all
> writes to the pool stop, and the pool starts exclusively running a slow
> background task of 1kB reads.
Related or not, I can hang all reads on a nv_65 zpool simply by dd''ing
a
5GB server image to a zvol.

- JG



This email, including any attachments, is intended only for the use of the
individual or entity named above and may contain information that is
confidential and privileged. Any information contained in this email is not to
be used or disclosed for any purpose other than the purpose for which you
received it. If you are not the intended recipient you are notified that
disclosing, copying, distributing or taking any action in reliance on the
contents of this information is strictly prohibited. If you have received this
email by mistake, please delete this email permanently from your system.
WARNING: Although Editure has taken reasonable precautions to ensure no viruses
are present in this email, Editure can not accept responsibility for any losses
or damages whatsoever, arising from the use of this email and/or its
attachments.
www.editure.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20070717/b8f7422a/attachment.html>

Stuart Anderson

2007-Jul-17 19:40 UTC

head link

[zfs-discuss] chgrp -R hangs all writes to pool

On Tue, Jul 17, 2007 at 03:08:44PM +1000, James C. McPherson
wrote:> >>Log a new case with Sun, and make sure you supply
> >>a crash dump so people who know ZFS can analyze
> >>the issue.
> >>
> >>You can use <stop-A> sync, <break> sync, or
> >>
> >>reboot -dq
> >>
That does appear to have caused a panic/kernel dump. However, I cannot
find the dump image after rebooting to Solaris even thought savecore
appears to be configured,

# reboot -dq
Jul 17 12:27:35 x4500gc reboot: rebooted by root

panic[cpu2]/thread=ffffffff9823c460: forced crash dump initiated at user request

fffffe8000e18d60 genunix:kadmin+4b4 ()
fffffe8000e18ec0 genunix:uadmin+93 ()
fffffe8000e18f10 unix:sys_syscall32+101 ()

syncing file systems... 1 1 done
dumping to /dev/md/dsk/d2, offset 3436511232, content: kernel
100% done: 3268790 pages dumped, compression ratio 12.39, dump succeeded
rebooting...


# dumpadm
      Dump content: kernel pages
       Dump device: /dev/md/dsk/d2 (swap)
Savecore directory: /var/crash/x4500gc
  Savecore enabled: yes

# ls -laR /var/crash/x4500gc/
/var/crash/x4500gc/:
total 2
drwx------  2 root root 512 Jul 12 16:26 .
drwxr-xr-x  3 root root 512 Jul 12 16:26 ..


Thanks.


-- 
Stuart Anderson  anderson at ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

Stuart Anderson

2007-Jul-17 20:04 UTC

head link

[zfs-discuss] chgrp -R hangs all writes to pool

It looks like there is a problem dumping a kernel panic on an X4500.
During the self induced panic, there where additional syslog messages
that indicate a problem writing to the two disks that make up
/dev/md/dsk/d2 in my case.  It is as if the SATA controllers are being
reset during the crash dump.

At any rate I will send this all to Sun support.

Thanks.


Jul 17 12:27:35 x4500gc unix: [ID 836849 kern.notice] 
Jul 17 12:27:35 x4500gc ^Mpanic[cpu2]/thread=ffffffff9823c460: 
Jul 17 12:27:35 x4500gc genunix: [ID 156897 kern.notice] forced crash dump
initiated at user request
Jul 17 12:27:35 x4500gc unix: [ID 100000 kern.notice] 
Jul 17 12:27:35 x4500gc genunix: [ID 655072 kern.notice] fffffe8000e18d60
genunix:kadmin+4b4 ()
Jul 17 12:27:35 x4500gc genunix: [ID 655072 kern.notice] fffffe8000e18ec0
genunix:uadmin+93 ()
Jul 17 12:27:35 x4500gc genunix: [ID 655072 kern.notice] fffffe8000e18f10
unix:sys_syscall32+101 ()
Jul 17 12:27:35 x4500gc unix: [ID 100000 kern.notice] 
Jul 17 12:27:35 x4500gc genunix: [ID 672855 kern.notice] syncing file systems...
Jul 17 12:27:35 x4500gc genunix: [ID 733762 kern.notice]  1
Jul 17 12:27:37 x4500gc last message repeated 1 time
Jul 17 12:27:38 x4500gc genunix: [ID 904073 kern.notice]  done
Jul 17 12:27:39 x4500gc genunix: [ID 111219 kern.notice] dumping to
/dev/md/dsk/d2, offset 3436511232, content: kernel
Jul 17 12:27:39 x4500gc marvell88sx: [ID 812950 kern.warning] WARNING:
marvell88sx3: error on port 0:
Jul 17 12:27:39 x4500gc marvell88sx: [ID 517869 kern.info]      device
disconnected
Jul 17 12:27:39 x4500gc marvell88sx: [ID 517869 kern.info]      device connected
Jul 17 12:27:39 x4500gc marvell88sx: [ID 517869 kern.info]      SError interrupt
Jul 17 12:27:39 x4500gc marvell88sx: [ID 131198 kern.info]      SErrors:
Jul 17 12:27:39 x4500gc marvell88sx: [ID 517869 kern.info]             
Recovered communication error
Jul 17 12:27:39 x4500gc marvell88sx: [ID 517869 kern.info]              PHY
ready change
Jul 17 12:27:39 x4500gc marvell88sx: [ID 517869 kern.info]              10-bit
to 8-bit decode error
Jul 17 12:27:39 x4500gc marvell88sx: [ID 517869 kern.info]             
Disparity error
Jul 17 12:27:39 x4500gc marvell88sx: [ID 812950 kern.warning] WARNING:
marvell88sx3: error on port 4:
Jul 17 12:27:39 x4500gc marvell88sx: [ID 517869 kern.info]      device
disconnected
Jul 17 12:27:39 x4500gc marvell88sx: [ID 517869 kern.info]      device connected
Jul 17 12:27:39 x4500gc marvell88sx: [ID 517869 kern.info]      SError interrupt
Jul 17 12:27:39 x4500gc marvell88sx: [ID 131198 kern.info]      SErrors:
Jul 17 12:27:39 x4500gc marvell88sx: [ID 517869 kern.info]             
Recovered communication error
Jul 17 12:27:39 x4500gc marvell88sx: [ID 517869 kern.info]              PHY
ready change
Jul 17 12:27:39 x4500gc marvell88sx: [ID 517869 kern.info]              10-bit
to 8-bit decode error
Jul 17 12:27:39 x4500gc marvell88sx: [ID 517869 kern.info]             
Disparity error
Jul 17 12:28:39 x4500gc genunix: [ID 409368 kern.notice] ^M100% done: 3268790
pages dumped, compression ratio 12.39,
Jul 17 12:28:39 x4500gc genunix: [ID 851671 kern.notice] dump succeeded
Jul 17 12:30:38 x4500gc genunix: [ID 540533 kern.notice] ^MSunOS Release 5.10
Version Generic_125101-10 64-bit
Jul 17 12:30:38 x4500gc genunix: [ID 943907 kern.notice] Copyright 1983-2007 Sun
Microsystems, Inc.  All rights reserved.




On Tue, Jul 17, 2007 at 12:40:16PM -0700, Stuart Anderson
wrote:> On Tue, Jul 17, 2007 at 03:08:44PM +1000, James C. McPherson wrote:
> > >>Log a new case with Sun, and make sure you supply
> > >>a crash dump so people who know ZFS can analyze
> > >>the issue.
> > >>
> > >>You can use <stop-A> sync, <break> sync, or
> > >>
> > >>reboot -dq
> > >>
> 
> That does appear to have caused a panic/kernel dump. However, I cannot
> find the dump image after rebooting to Solaris even thought savecore
> appears to be configured,
> 
> # reboot -dq
> Jul 17 12:27:35 x4500gc reboot: rebooted by root
> 
> panic[cpu2]/thread=ffffffff9823c460: forced crash dump initiated at user
request
> 
> fffffe8000e18d60 genunix:kadmin+4b4 ()
> fffffe8000e18ec0 genunix:uadmin+93 ()
> fffffe8000e18f10 unix:sys_syscall32+101 ()
> 
> syncing file systems... 1 1 done
> dumping to /dev/md/dsk/d2, offset 3436511232, content: kernel
> 100% done: 3268790 pages dumped, compression ratio 12.39, dump succeeded
> rebooting...
> 
> 
> # dumpadm
>       Dump content: kernel pages
>        Dump device: /dev/md/dsk/d2 (swap)
> Savecore directory: /var/crash/x4500gc
>   Savecore enabled: yes
> 
> # ls -laR /var/crash/x4500gc/
> /var/crash/x4500gc/:
> total 2
> drwx------  2 root root 512 Jul 12 16:26 .
> drwxr-xr-x  3 root root 512 Jul 12 16:26 ..
> 
> 
> Thanks.
> 
> 
> -- 
> Stuart Anderson  anderson at ligo.caltech.edu
> http://www.ligo.caltech.edu/~anderson
-- 
Stuart Anderson  anderson at ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

Robert Milkowski

2007-Jul-18 18:25 UTC

head link

[zfs-discuss] chgrp -R hangs all writes to pool

Hello Stuart,

  Looks like crash dumped went ok.
  Check logs after system booted up again if there''s a warning that
  there''s no enough space in /var/crash/x4500gc to save crashdump.
  When using zfs on a file servers crashdumps usually will be almost
  of server''s memory size...

  Eventually just run ''savecore path_to_dir'' where path_to_dir
is a
  path to a directory with enough free space.
  Of course assuming you haven''t touch swap device up-to this time.




-- 
Best regards,
 Robert Milkowski                      mailto:rmilkowski at task.gda.pl
                                       http://milek.blogspot.com

                                       
Tuesday, July 17, 2007, 9:04:55 PM, you wrote:

SA> It looks like there is a problem dumping a kernel panic on an X4500.
SA> During the self induced panic, there where additional syslog messages
SA> that indicate a problem writing to the two disks that make up
SA> /dev/md/dsk/d2 in my case.  It is as if the SATA controllers are being
SA> reset during the crash dump.

SA> At any rate I will send this all to Sun support.

SA> Thanks.


SA> Jul 17 12:27:35 x4500gc unix: [ID 836849 kern.notice] 
SA> Jul 17 12:27:35 x4500gc ^Mpanic[cpu2]/thread=ffffffff9823c460: 
SA> Jul 17 12:27:35 x4500gc genunix: [ID 156897 kern.notice] forced
SA> crash dump initiated at user request
SA> Jul 17 12:27:35 x4500gc unix: [ID 100000 kern.notice] 
SA> Jul 17 12:27:35 x4500gc genunix: [ID 655072 kern.notice]
SA> fffffe8000e18d60 genunix:kadmin+4b4 ()
SA> Jul 17 12:27:35 x4500gc genunix: [ID 655072 kern.notice]
SA> fffffe8000e18ec0 genunix:uadmin+93 ()
SA> Jul 17 12:27:35 x4500gc genunix: [ID 655072 kern.notice]
SA> fffffe8000e18f10 unix:sys_syscall32+101 ()
SA> Jul 17 12:27:35 x4500gc unix: [ID 100000 kern.notice] 
SA> Jul 17 12:27:35 x4500gc genunix: [ID 672855 kern.notice] syncing file
systems...
SA> Jul 17 12:27:35 x4500gc genunix: [ID 733762 kern.notice]  1
SA> Jul 17 12:27:37 x4500gc last message repeated 1 time
SA> Jul 17 12:27:38 x4500gc genunix: [ID 904073 kern.notice]  done
SA> Jul 17 12:27:39 x4500gc genunix: [ID 111219 kern.notice] dumping
SA> to /dev/md/dsk/d2, offset 3436511232, content: kernel
SA> Jul 17 12:27:39 x4500gc marvell88sx: [ID 812950 kern.warning]
SA> WARNING: marvell88sx3: error on port 0:
SA> Jul 17 12:27:39 x4500gc marvell88sx: [ID 517869 kern.info]      device
disconnected
SA> Jul 17 12:27:39 x4500gc marvell88sx: [ID 517869 kern.info]      device
connected
SA> Jul 17 12:27:39 x4500gc marvell88sx: [ID 517869 kern.info]      SError
interrupt
SA> Jul 17 12:27:39 x4500gc marvell88sx: [ID 131198 kern.info]      SErrors:
SA> Jul 17 12:27:39 x4500gc marvell88sx: [ID 517869 kern.info]       
Recovered communication error
SA> Jul 17 12:27:39 x4500gc marvell88sx: [ID 517869 kern.info]        PHY
ready change
SA> Jul 17 12:27:39 x4500gc marvell88sx: [ID 517869 kern.info]        10-bit
to 8-bit decode error
SA> Jul 17 12:27:39 x4500gc marvell88sx: [ID 517869 kern.info]       
Disparity error
SA> Jul 17 12:27:39 x4500gc marvell88sx: [ID 812950 kern.warning]
SA> WARNING: marvell88sx3: error on port 4:
SA> Jul 17 12:27:39 x4500gc marvell88sx: [ID 517869 kern.info]      device
disconnected
SA> Jul 17 12:27:39 x4500gc marvell88sx: [ID 517869 kern.info]      device
connected
SA> Jul 17 12:27:39 x4500gc marvell88sx: [ID 517869 kern.info]      SError
interrupt
SA> Jul 17 12:27:39 x4500gc marvell88sx: [ID 131198 kern.info]      SErrors:
SA> Jul 17 12:27:39 x4500gc marvell88sx: [ID 517869 kern.info]       
Recovered communication error
SA> Jul 17 12:27:39 x4500gc marvell88sx: [ID 517869 kern.info]        PHY
ready change
SA> Jul 17 12:27:39 x4500gc marvell88sx: [ID 517869 kern.info]        10-bit
to 8-bit decode error
SA> Jul 17 12:27:39 x4500gc marvell88sx: [ID 517869 kern.info]       
Disparity error
SA> Jul 17 12:28:39 x4500gc genunix: [ID 409368 kern.notice] ^M100%
SA> done: 3268790 pages dumped, compression ratio 12.39, 
SA> Jul 17 12:28:39 x4500gc genunix: [ID 851671 kern.notice] dump succeeded
SA> Jul 17 12:30:38 x4500gc genunix: [ID 540533 kern.notice] ^MSunOS
SA> Release 5.10 Version Generic_125101-10 64-bit
SA> Jul 17 12:30:38 x4500gc genunix: [ID 943907 kern.notice]
SA> Copyright 1983-2007 Sun Microsystems, Inc.  All rights reserved.




SA> On Tue, Jul 17, 2007 at 12:40:16PM -0700, Stuart Anderson
wrote:>> On Tue, Jul 17, 2007 at 03:08:44PM +1000, James C. McPherson wrote:
>> > >>Log a new case with Sun, and make sure you supply
>> > >>a crash dump so people who know ZFS can analyze
>> > >>the issue.
>> > >>
>> > >>You can use <stop-A> sync, <break> sync, or
>> > >>
>> > >>reboot -dq
>> > >>
>> 
>> That does appear to have caused a panic/kernel dump. However, I cannot
>> find the dump image after rebooting to Solaris even thought savecore
>> appears to be configured,
>> 
>> # reboot -dq
>> Jul 17 12:27:35 x4500gc reboot: rebooted by root
>> 
>> panic[cpu2]/thread=ffffffff9823c460: forced crash dump initiated at
user request
>> 
>> fffffe8000e18d60 genunix:kadmin+4b4 ()
>> fffffe8000e18ec0 genunix:uadmin+93 ()
>> fffffe8000e18f10 unix:sys_syscall32+101 ()
>> 
>> syncing file systems... 1 1 done
>> dumping to /dev/md/dsk/d2, offset 3436511232, content: kernel
>> 100% done: 3268790 pages dumped, compression ratio 12.39, dump
succeeded
>> rebooting...
>> 
>> 
>> # dumpadm
>>       Dump content: kernel pages
>>        Dump device: /dev/md/dsk/d2 (swap)
>> Savecore directory: /var/crash/x4500gc
>>   Savecore enabled: yes
>> 
>> # ls -laR /var/crash/x4500gc/
>> /var/crash/x4500gc/:
>> total 2
>> drwx------  2 root root 512 Jul 12 16:26 .
>> drwxr-xr-x  3 root root 512 Jul 12 16:26 ..
>> 
>> 
>> Thanks.
>> 
>> 
>> -- 
>> Stuart Anderson  anderson at ligo.caltech.edu
>> http://www.ligo.caltech.edu/~anderson

Stuart Anderson

2007-Oct-04 22:12 UTC

head link

[zfs-discuss] chgrp -R hangs all writes to pool

On Mon, Jul 16, 2007 at 09:36:06PM -0700, Stuart Anderson
wrote:> Running Solaris 10 Update 3 on an X4500 I have found that it is possible
> to reproducibly block all writes to a ZFS pool by running "chgrp
-R"
> on any large filesystem in that pool.  As can be seen below in the zpool
> iostat output below, after about 10-sec of running the chgrp command all
> writes to the pool stop, and the pool starts exclusively running a slow
> background task of 1kB reads.
> 
> At this point the chgrp -R command is not killable via root kill -9,
> and in fact even the command "halt -d" does not do anything.
> 
For posterity this appears to have been fixed in S10U4, at least I am
unable to reproduce the problem that was easy to trigger with S10U3.

Thanks.

-- 
Stuart Anderson  anderson at ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

zfs discuss - Jul 2007 - chgrp -R hangs all writes to pool

[zfs-discuss] chgrp -R hangs all writes to pool

[zfs-discuss] chgrp -R hangs all writes to pool

[zfs-discuss] chgrp -R hangs all writes to pool

[zfs-discuss] chgrp -R hangs all writes to pool

[zfs-discuss] chgrp -R hangs all writes to pool

[zfs-discuss] chgrp -R hangs all writes to pool

[zfs-discuss] chgrp -R hangs all writes to pool

[zfs-discuss] chgrp -R hangs all writes to pool

[zfs-discuss] chgrp -R hangs all writes to pool

[zfs-discuss] chgrp -R hangs all writes to pool