thr3ads.net - zfs discuss - [zfs-discuss] high system cpu use/lock contention (redux) [Dec 2005]

If this information is useful, please help other people find it:
Share via:

kate

2005-Dec-25 17:46 UTC

[zfs-discuss] high system cpu use/lock contention (redux)

hi,

a while ago i posted about a problem that occurred, apparently because of 
ZFS:

  http://www.opensolaris.org/jive/thread.jspa?threadID=3704

after this one instance, the problem did not recur, so i assumed it was a 
fluke of some kind.  

a few days ago, i copied around eight several-MB files from one filesystem 
to another.   upon realising that i had misnamed one, i attempted to
"mv"
it.  however, the mv did not complete, and vmstat showed the same problem 
as before.

this time, i saved a crash dump (with savecore -L) that i can supply if
it''s
useful, although since it would take rather a while to upload, i''m also
happy to run mdb commands on it.

between the two occurrences, i moved the two disks in question to a 
different system, so i don''t think it''s a hardware issue,
although it could
be an oddity of the disks...

k.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 185 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20051225/fc2aa51b/attachment.bin>

James C. McPherson

2005-Dec-28 22:00 UTC

head link

[zfs-discuss] high system cpu use/lock contention (redux)

kate wrote:
...> this time, i saved a crash dump (with savecore -L) that i can supply if
it''s
> useful, although since it would take rather a while to upload, i''m
also
> happy to run mdb commands on it.

Hi Kate,
for starters, if you could run

::status
::msgbuf
*panic_thread::findstack -v

If you can compress the crashdump and make it available somewhere, then
I/somebody can pull it in and analyse it further.


thanks in advance,
James C. McPherson
--
Solaris Datapath Engineering
Data Management Group
Sun Microsystems

kate

2005-Dec-29 00:10 UTC

head link

[zfs-discuss] high system cpu use/lock contention (redux)

hi james :-)

James C. McPherson:
JC> for starters, if you could run

JC> ::status

debugging crash dump ./vmcore.9 (32-bit) from hyacinth.local
operating system: 5.11 brandz_release (i86pc)
panic message:
dump content: kernel pages only

JC> ::msgbuf
MESSAGE
WARNING: gld_start: gldm_send failed -1
WARNING: rtls0: send fail--CHIP ERROR!
<similar messages elided>
pseudo-device: devinfo0
devinfo0 is /pseudo/devinfo at 0
pseudo-device: devinfo0
devinfo0 is /pseudo/devinfo at 0
dump on /dev/dsk/c2d0s1 size 1589 MB
dumping to /dev/dsk/c2d0s1, offset 65536, content: kernel

JC> *panic_thread::findstack -v
mdb: couldn''t read thread at 0

i assume the last is because it''s from a running system rather than an 
actual crash dump.

JC> If you can compress the crashdump and make it available somewhere, then
JC> I/somebody can pull it in and analyse it further.

(sent privately...)

k.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 185 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20051229/142b2795/attachment.bin>

James C. McPherson

2005-Dec-29 07:43 UTC

head link

[zfs-discuss] high system cpu use/lock contention (redux)

kate wrote:> debugging crash dump ./vmcore.9 (32-bit) from hyacinth.local
> operating system: 5.11 brandz_release (i86pc)
> panic message:
> dump content: kernel pages only
> 
> JC> ::msgbuf
> MESSAGE
> WARNING: gld_start: gldm_send failed -1
> WARNING: rtls0: send fail--CHIP ERROR!
> <similar messages elided>
> pseudo-device: devinfo0
> devinfo0 is /pseudo/devinfo at 0
> pseudo-device: devinfo0
> devinfo0 is /pseudo/devinfo at 0
> dump on /dev/dsk/c2d0s1 size 1589 MB
> dumping to /dev/dsk/c2d0s1, offset 65536, content: kernel
> 
> JC> *panic_thread::findstack -v
> mdb: couldn''t read thread at 0
> 
> i assume the last is because it''s from a running system rather
than an
> actual crash dump.
doh! I should have read your email more carefully.


[from the crash dump]

SolarisCAT(vmcore.9/11x)> proc -L 232339
    addr     PID    PPID   RUID/UID     size      rss     swresv   time  command
---------- ------ ------ ---------- ---------- -------- -------- ------
---------
0xcd376208 232339      1          0    1105920        0    98304  18216 mv 02 -
Melancholy (Holy Martyr).mp3 05_melancholy.mp3

==== user (LWP_SYS) thread: 0xcefafa00  PID: 232339 ===cmd: mv 02 - Melancholy
(Holy Martyr).mp3 05_melancholy.mp3
t_procp: 0xcd376208  p_as: 0xce398da8  hat: 0xce4f8e58
   size: 1105920  rss: 0
   zone: global
t_stk: 0xca1d7fa4  sp: 0xca1d7d68  t_stkbase: 0xca1d6000
t_pri: 10(IA)  pctcpu: 47.431519  t_lwp: 0xcc677200  lwp_regs: 0xca1d7fa4
psrset: 0  last CPU: 0
idle: 105 ticks (1.05 seconds)
start: Fri Dec 23 01:38:34 2005
age: 443 seconds (7 minutes 23 seconds)
stime: 15353531 (1.04 seconds earlier)
syscall: #134 rename(, 0x0) (sysent: address translation failed for sysent: 16
bytes @ 0x860
)
tstate: TS_RUN - runnable, but not yet on a processor
tflg:   T_DFLTSTK - stack is default size
tpflg:  TP_TWAIT - wait to be freed by lwp_wait
         TP_MSACCT - collect micro-state accounting information
tsched: TS_LOAD - thread is in memory
         TS_DONT_SWAP - thread/LWP should not be swapped
pflag:  SKILLED - SIGKILL has been posted to the process
         SMSACCT - process is keeping micro-state accounting
         SMSFORK - child inherits micro-state accounting

pc:      0xfe832637     unix:0xfffffffffe832637

unix:_resume_from_idle+0x96 resume_return()
genunix:cv_wait+0x5d()
0xfa0a5943()
0xfa0c8c59()
genunix:fop_rename+0x1e()
genunix:vn_renameat+0x1e2()
genunix:vn_rename+0x18()
genunix:rename+0x13()
unix:sys_sysenter+0x1a3 _sysenter_done()
-- switch to user thread''s user stack --


[Since SolarisCAT for x86/x64 isn''t quite working right just yet,
we''ll
  jump back and forth between it and mdb]

 > 0t232339::pid2proc|::print proc_t p_tlist|::findstack -v
stack pointer for thread cefafa00: ca1d7d68
   ca1d7d94 swtch+0x1f6()
   ca1d7da0 cv_wait+0x5d(c87bdaee, c87bdaac)
   ca1d7ddc txg_wait_open+0xe9(c87bda00, 1b1b05fe, 0)
   ca1d7e44 zfs_rename+0x36a()
   ca1d7e64 fop_rename+0x1e(c9430180, c86bad80, c9430180, d5992000, cd021cf8)
   ca1d7f34 vn_renameat+0x1e2()
   ca1d7f5c vn_rename+0x18(80479ef, 8047490, 0)
   ca1d7f7c rename+0x13()
   ca1d7fa4 sys_sysenter+0x1a3()


Looking through the zfs-related threads which have a significant percentage
of cpu accumulated:

SolarisCAT(vmcore.9/11x)> tlist findmodule zfs|egrep "wchan|pctcpu"
[[[ elided]
...
t_wchan: 0xc8e413d4  sobj: condition var (from zfs:zfs_delete_thread+0x160)
t_pri: 60(SYS)  pctcpu: 97.175514  t_lwp: 0x0
t_wchan: 0xc9cccd14  sobj: condition var (from zfs:zfs_delete_thread+0x160)
t_pri: 60(SYS)  pctcpu: 97.319267  t_lwp: 0x0
t_wchan: 0xc9ccca14  sobj: condition var (from zfs:zfs_delete_thread+0x160)
t_pri: 60(SYS)  pctcpu: 97.333702  t_lwp: 0x0
t_wchan: 0xc9ccc714  sobj: condition var (from zfs:zfs_delete_thread+0x160)
t_pri: 60(SYS)  pctcpu: 97.342072  t_lwp: 0x0
t_wchan: 0xc9ccc414  sobj: condition var (from zfs:zfs_delete_thread+0x160)
t_pri: 60(SYS)  pctcpu: 97.349762  t_lwp: 0x0
t_wchan: 0xc9ccc114  sobj: condition var (from zfs:zfs_delete_thread+0x160)
t_pri: 60(SYS)  pctcpu: 97.353722  t_lwp: 0x0
t_wchan: 0xc9cbad54  sobj: condition var (from zfs:zfs_delete_thread+0x160)
t_pri: 60(SYS)  pctcpu: 97.356201  t_lwp: 0x0
t_wchan: 0xcd8ddd14  sobj: condition var (from zfs:zfs_delete_thread+0x160)
t_pri: 60(SYS)  pctcpu: 100.000000  t_lwp: 0x0

this last thread in particular is of concern due to its immense cputime:

 > 0xca1adde0::findstack -v
stack pointer for thread ca1adde0: ca1add48
   ca1add74 swtch+0x1f6()
   ca1add84 cv_wait+0x5d(cd8ddd14, cd8ddd0c)
   ca1addc8 zfs_delete_thread+0x160(cd8ddcc0, 0)
   ca1addd8 thread_start+8()


The memory usage in this system does not seem to be excessive, except for
one particular zfs-related cache:

                              buf     buf     buf    memory    #allocations
cache name                  size   avail   total    in use    succeed fail
------------------------- ------ ------- ------- --------- ---------- ----
zio_buf_65536              65536       2     324  21233664      47691    0



So what I reckon has happened is that we have (a) run out of the appropriately
sized zio_buf* cache entries, and (b) we have a deadlock between the relevant
zfs_delete_thread worker threads as a result.

Code-wise we are in usr/src/uts/common/fs/zfs/txg.c:

    382  void
    383  txg_wait_open(dsl_pool_t *dp, uint64_t txg)
    384  {
    385          tx_state_t *tx = &dp->dp_tx;
    386
    387          mutex_enter(&tx->tx_sync_lock);
    388          ASSERT(tx->tx_threads == 3);
    389          if (txg == 0)
    390                  txg = tx->tx_open_txg + 1;
    391          if (tx->tx_quiesce_txg_waiting < txg)
    392                  tx->tx_quiesce_txg_waiting = txg;
    393          dprintf("txg=%llu quiesce_txg=%llu sync_txg=%llu\n",
    394              txg, tx->tx_quiesce_txg_waiting,
tx->tx_sync_txg_waiting);
    395          while (tx->tx_open_txg < txg) {
    396                  cv_broadcast(&tx->tx_quiesce_more_cv);
    397                  cv_wait(&tx->tx_quiesce_done_cv,
&tx->tx_sync_lock);
    398          }
    399          mutex_exit(&tx->tx_sync_lock);
    400  }



at line 397.


I''m logging this as a new bug - as far as I can determine the others
that
were similar have all be closed as resolved quite some time ago.


best regards,
James C. McPherson
--
Solaris Datapath Engineering
Data Management Group
Sun Microsystems

zfs discuss - Dec 2005 - high system cpu use/lock contention (redux)

[zfs-discuss] high system cpu use/lock contention (redux)

[zfs-discuss] high system cpu use/lock contention (redux)

[zfs-discuss] high system cpu use/lock contention (redux)

[zfs-discuss] high system cpu use/lock contention (redux)