I've had a problem on a FreeBSD 8.1R system for a few weeks. It seems
that ZFS gets in to an almost unresponsive state. Last time it did it
(two weeks ago) I couldn't even log in, although the system was up, this
time I could manage a reboot but couldn't stop any applications (they
were likely hanging on I/O).
Here's some details I collected prior to reboot.
The zpool output, including iostat and gstat for the disks:
# zpool status
pool: pool0
state: ONLINE
scrub: none requested
config:
NAME STATE READ WRITE CKSUM
pool0 ONLINE 0 0 0
mirror ONLINE 0 0 0
ad4s3 ONLINE 0 0 0
ad6s3 ONLINE 0 0 0
errors: No known data errors
# zpool iostat -v 5
...
capacity operations bandwidth
pool used avail read write read write
---------- ----- ----- ----- ----- ----- -----
pool0 117G 16.7G 248 114 865K 269K
mirror 117G 16.7G 248 114 865K 269K
ad4s3 - - 43 56 2.47M 269K
ad6s3 - - 39 56 2.41M 269K
---------- ----- ----- ----- ----- ----- -----
# gstat
...
L(q) ops/s r/s kBps ms/r w/s kBps ms/w %busy Name
1 48 48 3042 9.8 0 0 0.0 47.6| ad4
0 38 38 2406 10.5 0 0 0.0 39.5| ad6
0 0 0 0 0.0 0 0 0.0 0.0| ad4s1
0 0 0 0 0.0 0 0 0.0 0.0| ad4s2
1 48 48 3042 9.8 0 0 0.0 47.6| ad4s3
0 0 0 0 0.0 0 0 0.0 0.0| ad6s1
0 0 0 0 0.0 0 0 0.0 0.0| ad6s2
0 38 38 2406 11.8 0 0 0.0 44.4| ad6s3
I've seen this before when I've had poor ZFS performance. There's
more
I/O on the disks than on the pool itself. It's not particularly busy
though.
A few items from top, including zfskern:
PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND
5 root 4 -8 - 0K 60K zio->i 0 54:38 3.47% zfskern
91775 70 1 44 0 53040K 31144K tx->tx 1 2:11 0.00% postgres
39661 tdb 1 44 0 55776K 32968K tx->tx 0 0:39 0.00% mutt
14828 root 1 47 0 14636K 1572K tx->tx 1 0:03 0.00% zfs
11188 root 1 51 0 14636K 1572K tx->tx 0 0:03 0.00% zfs
At some point during this process my zfs snapshots have been failing to
complete:
root 5 0.8 0.0 0 60 ?? DL 7Aug10 54:43.83 [zfskern]
root 8265 0.0 0.0 14636 1528 ?? D 10:00AM 0:03.12 zfs snapshot -r
pool0@2010-08-21_10:00:01--1d
root 11188 0.0 0.1 14636 1572 ?? D 11:00AM 0:02.93 zfs snapshot -r
pool0@2010-08-21_11:00:01--1d
root 14828 0.0 0.1 14636 1572 ?? D 12:00PM 0:03.04 zfs snapshot -r
pool0@2010-08-21_12:00:00--1d
root 17862 0.0 0.1 14636 1572 ?? D 1:00PM 0:01.96 zfs snapshot -r
pool0@2010-08-21_13:00:01--1d
root 20986 0.0 0.1 14636 1572 ?? D 2:00PM 0:02.07 zfs snapshot -r
pool0@2010-08-21_14:00:01--1d
It all seems to point at ZFS getting to the point of being almost
unresponsive. It's been exactly two weeks since the last time this
happened and therefore the last reboot, so it'll be interesting to see
if the same happens again after the same period of time.
I noticed this given in a few other ZFS related messages:
vfs.worklist_len: 15
I have attached all (hopefully) ZFS-related sysctl output.
Finally, the reboot log:
Aug 21 22:13:06 server kernel: Aug 21 22:13:06 server reboot: rebooted by tdb
Aug 21 22:19:47 server kernel: Waiting (max 60 seconds) for system process
`vnlru' to stop...done
Aug 21 22:19:47 server kernel: Waiting (max 60 seconds) for system process
`bufdaemon' to stop...
Aug 21 22:19:48 server kernel: done
Aug 21 22:19:48 server kernel: Waiting (max 60 seconds) for system process
`syncer' to stop...
Aug 21 22:20:03 server kernel:
Aug 21 22:20:03 server kernel: Syncing disks, vnodes remaining...14
Aug 21 22:20:48 server kernel: timed out
Aug 21 22:21:55 server kernel: Waiting (max 60 seconds) for system process
`vnlru' to stop...
Aug 21 22:22:39 server kernel: 1
Aug 21 22:22:55 server kernel: timed out
Aug 21 22:22:55 server kernel: Waiting (max 60 seconds) for system process
`bufdaemon' to stop...
I've undoubtedly missed some important information, so please let me
know if there's anything more useful I can collect next time (I'm quite
sure it'll happen again).
Thanks,
Tim.
--
Tim Bishop
http://www.bishnet.net/tim/
PGP Key: 0x5AE7D984
-------------- next part --------------
vfs.zfs.l2c_only_size: 0
vfs.zfs.mfu_ghost_data_lsize: 40245248
vfs.zfs.mfu_ghost_metadata_lsize: 87331328
vfs.zfs.mfu_ghost_size: 127576576
vfs.zfs.mfu_data_lsize: 99885056
vfs.zfs.mfu_metadata_lsize: 146944
vfs.zfs.mfu_size: 101330432
vfs.zfs.mru_ghost_data_lsize: 181200896
vfs.zfs.mru_ghost_metadata_lsize: 25819648
vfs.zfs.mru_ghost_size: 207020544
vfs.zfs.mru_data_lsize: 351579136
vfs.zfs.mru_metadata_lsize: 0
vfs.zfs.mru_size: 419203072
vfs.zfs.anon_data_lsize: 0
vfs.zfs.anon_metadata_lsize: 0
vfs.zfs.anon_size: 16982016
vfs.zfs.l2arc_norw: 1
vfs.zfs.l2arc_feed_again: 1
vfs.zfs.l2arc_noprefetch: 0
vfs.zfs.l2arc_feed_min_ms: 200
vfs.zfs.l2arc_feed_secs: 1
vfs.zfs.l2arc_headroom: 2
vfs.zfs.l2arc_write_boost: 8388608
vfs.zfs.l2arc_write_max: 8388608
vfs.zfs.arc_meta_limit: 162276480
vfs.zfs.arc_meta_used: 167108904
vfs.zfs.mdcomp_disable: 0
vfs.zfs.arc_min: 81138240
vfs.zfs.arc_max: 649105920
vfs.zfs.zfetch.array_rd_sz: 1048576
vfs.zfs.zfetch.block_cap: 256
vfs.zfs.zfetch.min_sec_reap: 2
vfs.zfs.zfetch.max_streams: 8
vfs.zfs.prefetch_disable: 1
vfs.zfs.check_hostid: 1
vfs.zfs.recover: 0
vfs.zfs.txg.write_limit_override: 0
vfs.zfs.txg.synctime: 5
vfs.zfs.txg.timeout: 30
vfs.zfs.scrub_limit: 10
vfs.zfs.vdev.cache.bshift: 16
vfs.zfs.vdev.cache.size: 10485760
vfs.zfs.vdev.cache.max: 16384
vfs.zfs.vdev.aggregation_limit: 131072
vfs.zfs.vdev.ramp_rate: 2
vfs.zfs.vdev.time_shift: 6
vfs.zfs.vdev.min_pending: 4
vfs.zfs.vdev.max_pending: 35
vfs.zfs.cache_flush_disable: 0
vfs.zfs.zil_disable: 0
vfs.zfs.zio.use_uma: 0
vfs.zfs.version.zpl: 3
vfs.zfs.version.vdev_boot: 1
vfs.zfs.version.spa: 14
vfs.zfs.version.dmu_backup_stream: 1
vfs.zfs.version.dmu_backup_header: 2
vfs.zfs.version.acl: 1
vfs.zfs.debug: 0
vfs.zfs.super_owner: 0
kstat.zfs.misc.zfetchstats.hits: 0
kstat.zfs.misc.zfetchstats.misses: 0
kstat.zfs.misc.zfetchstats.colinear_hits: 0
kstat.zfs.misc.zfetchstats.colinear_misses: 0
kstat.zfs.misc.zfetchstats.stride_hits: 0
kstat.zfs.misc.zfetchstats.stride_misses: 0
kstat.zfs.misc.zfetchstats.reclaim_successes: 0
kstat.zfs.misc.zfetchstats.reclaim_failures: 0
kstat.zfs.misc.zfetchstats.streams_resets: 0
kstat.zfs.misc.zfetchstats.streams_noresets: 0
kstat.zfs.misc.zfetchstats.bogus_streams: 0
kstat.zfs.misc.arcstats.hits: 268465432
kstat.zfs.misc.arcstats.misses: 68288918
kstat.zfs.misc.arcstats.demand_data_hits: 96018646
kstat.zfs.misc.arcstats.demand_data_misses: 29346057
kstat.zfs.misc.arcstats.demand_metadata_hits: 172436107
kstat.zfs.misc.arcstats.demand_metadata_misses: 38917800
kstat.zfs.misc.arcstats.prefetch_data_hits: 0
kstat.zfs.misc.arcstats.prefetch_data_misses: 0
kstat.zfs.misc.arcstats.prefetch_metadata_hits: 10679
kstat.zfs.misc.arcstats.prefetch_metadata_misses: 25061
kstat.zfs.misc.arcstats.mru_hits: 50140760
kstat.zfs.misc.arcstats.mru_ghost_hits: 3740247
kstat.zfs.misc.arcstats.mfu_hits: 218317059
kstat.zfs.misc.arcstats.mfu_ghost_hits: 25731748
kstat.zfs.misc.arcstats.allocated: 74130663
kstat.zfs.misc.arcstats.deleted: 40615278
kstat.zfs.misc.arcstats.stolen: 31781579
kstat.zfs.misc.arcstats.recycle_miss: 26118118
kstat.zfs.misc.arcstats.mutex_miss: 69832
kstat.zfs.misc.arcstats.evict_skip: 315025656
kstat.zfs.misc.arcstats.evict_l2_cached: 0
kstat.zfs.misc.arcstats.evict_l2_eligible: 1987022682112
kstat.zfs.misc.arcstats.evict_l2_ineligible: 63680512
kstat.zfs.misc.arcstats.hash_elements: 44157
kstat.zfs.misc.arcstats.hash_elements_max: 206842
kstat.zfs.misc.arcstats.hash_collisions: 24859197
kstat.zfs.misc.arcstats.hash_chains: 9616
kstat.zfs.misc.arcstats.hash_chain_max: 22
kstat.zfs.misc.arcstats.p: 40774359
kstat.zfs.misc.arcstats.c: 628821360
kstat.zfs.misc.arcstats.c_min: 81138240
kstat.zfs.misc.arcstats.c_max: 649105920
kstat.zfs.misc.arcstats.size: 628128200
kstat.zfs.misc.arcstats.hdr_size: 10035104
kstat.zfs.misc.arcstats.data_size: 537511936
kstat.zfs.misc.arcstats.other_size: 80581160
kstat.zfs.misc.arcstats.l2_hits: 0
kstat.zfs.misc.arcstats.l2_misses: 0
kstat.zfs.misc.arcstats.l2_feeds: 0
kstat.zfs.misc.arcstats.l2_rw_clash: 0
kstat.zfs.misc.arcstats.l2_read_bytes: 0
kstat.zfs.misc.arcstats.l2_write_bytes: 0
kstat.zfs.misc.arcstats.l2_writes_sent: 0
kstat.zfs.misc.arcstats.l2_writes_done: 0
kstat.zfs.misc.arcstats.l2_writes_error: 0
kstat.zfs.misc.arcstats.l2_writes_hdr_miss: 0
kstat.zfs.misc.arcstats.l2_evict_lock_retry: 0
kstat.zfs.misc.arcstats.l2_evict_reading: 0
kstat.zfs.misc.arcstats.l2_free_on_write: 0
kstat.zfs.misc.arcstats.l2_abort_lowmem: 0
kstat.zfs.misc.arcstats.l2_cksum_bad: 0
kstat.zfs.misc.arcstats.l2_io_error: 0
kstat.zfs.misc.arcstats.l2_size: 0
kstat.zfs.misc.arcstats.l2_hdr_size: 0
kstat.zfs.misc.arcstats.memory_throttle_count: 4236474
kstat.zfs.misc.arcstats.l2_write_trylock_fail: 0
kstat.zfs.misc.arcstats.l2_write_passed_headroom: 0
kstat.zfs.misc.arcstats.l2_write_spa_mismatch: 0
kstat.zfs.misc.arcstats.l2_write_in_l2: 0
kstat.zfs.misc.arcstats.l2_write_io_in_progress: 0
kstat.zfs.misc.arcstats.l2_write_not_cacheable: 62188
kstat.zfs.misc.arcstats.l2_write_full: 0
kstat.zfs.misc.arcstats.l2_write_buffer_iter: 0
kstat.zfs.misc.arcstats.l2_write_pios: 0
kstat.zfs.misc.arcstats.l2_write_buffer_bytes_scanned: 0
kstat.zfs.misc.arcstats.l2_write_buffer_list_iter: 0
kstat.zfs.misc.arcstats.l2_write_buffer_list_null_iter: 0
kstat.zfs.misc.vdev_cache_stats.delegations: 16993
kstat.zfs.misc.vdev_cache_stats.hits: 27480802
kstat.zfs.misc.vdev_cache_stats.misses: 10351041