Stuart Anderson
2007-Oct-22 02:09 UTC
[zfs-discuss] Parallel zfs destroy results in No more processes
Running 102 parallel "zfs destroy -r" commands on an X4500 running
S10U4 has
resulted in "No more processes" errors in existing login shells for
several
minutes of time, but then fork() calls started working again. However, none
of the zfs destroy processes have actually completed yet, which is odd since
some of the filesystems are trivially small.
After fork() started working there where hardly any other processes than
the 102 "zfs destroy" running on the system, i.e.,
# ps -ef | wc -l
154
Here is a snapshot of top that looks resonable, note especially that
"free swap" is 16GB and that the "last pid" is still in the
range of
the ~100 zfs commands being run.
Is this a known issue? Any ideas on what resource lots of zfs commands use
up to prevent fork() from working?
Thanks.
last pid: 11473; load avg: 0.35, 0.87, 0.68; up 9+00:21:42
18:56:38
148 processes: 146 sleeping, 1 zombie, 1 on cpu
CPU states: 94.2% idle, 0.0% user, 5.8% kernel, 0.0% iowait, 0.0% swap
Memory: 16G phys mem, 1029M free mem, 16G total swap, 16G free swap
PID USERNAME LWP PRI NICE SIZE RES STATE TIME CPU COMMAND
11333 root 1 59 0 3188K 772K cpu/3 0:01 0.02% top
622 noaccess 28 59 0 172M 4528K sleep 4:28 0.01% java
528 root 1 59 0 20M 5092K sleep 2:44 0.01% Xorg
431 root 11 59 0 5620K 1248K sleep 0:01 0.01% syslogd
565 root 1 59 0 10M 1384K sleep 0:53 0.00% dtgreet
206 root 1 100 -20 2068K 1128K sleep 0:21 0.00% xntpd
10864 root 1 59 0 7416K 1216K sleep 0:00 0.00% sshd
7 root 14 59 0 12M 680K sleep 0:05 0.00% svc.startd
158 root 33 59 0 6864K 1616K sleep 0:15 0.00% nscd
312 root 1 59 0 1112K 660K sleep 0:00 0.00% utmpd
340 root 3 59 0 3932K 1312K sleep 0:00 0.00% inetd
582 root 22 59 0 17M 2028K sleep 5:49 0.00% fmd
11432 root 1 59 0 4556K 1496K sleep 0:30 0.00% zfs
11449 root 1 59 0 4556K 1496K sleep 0:27 0.00% zfs
11360 root 1 59 0 4552K 1492K sleep 0:26 0.00% zfs
--
Stuart Anderson anderson at ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson
David Bustos
2007-Oct-24 17:40 UTC
[zfs-discuss] Parallel zfs destroy results in No more processes
Quoth Stuart Anderson on Sun, Oct 21, 2007 at 07:09:10PM -0700:> Running 102 parallel "zfs destroy -r" commands on an X4500 running S10U4 has > resulted in "No more processes" errors in existing login shells for several > minutes of time, but then fork() calls started working again. However, none > of the zfs destroy processes have actually completed yet, which is odd since > some of the filesystems are trivially small....> Is this a known issue? Any ideas on what resource lots of zfs commands use > up to prevent fork() from working?ZFS is known to use a lot of memory. I suspect this problem has diminished in recent Nevada builds. Can you try this on Nevada? David
Stuart Anderson
2007-Oct-24 17:58 UTC
[zfs-discuss] Parallel zfs destroy results in No more processes
On Wed, Oct 24, 2007 at 10:40:41AM -0700, David Bustos wrote:> Quoth Stuart Anderson on Sun, Oct 21, 2007 at 07:09:10PM -0700: > > Running 102 parallel "zfs destroy -r" commands on an X4500 running S10U4 has > > resulted in "No more processes" errors in existing login shells for several > > minutes of time, but then fork() calls started working again. However, none > > of the zfs destroy processes have actually completed yet, which is odd since > > some of the filesystems are trivially small. > ... > > Is this a known issue? Any ideas on what resource lots of zfs commands use > > up to prevent fork() from working? > > ZFS is known to use a lot of memory. I suspect this problem has > diminished in recent Nevada builds. Can you try this on Nevada?I suspect it is more subtle than this since top was reporting that none of the available swap space was being used yet, so there was 16GB of free VM. Unfortunately, I am not currently in a position to switch this system over to Nevada. Thanks. -- Stuart Anderson anderson at ligo.caltech.edu http://www.ligo.caltech.edu/~anderson
Robert Lawhead
2007-Oct-25 22:54 UTC
[zfs-discuss] Parallel zfs destroy results in No more processes
Do you have sata Native Command Queuing enabled? I''ve experienced
delays of just under one minute when NCQ is enabled, that do not occur when NCQ
is disabled. If all threads comprising the parallel zfs destroy hang for a
minute, I bet its the hang that causes "no more processes". I have
opened a trouble ticket on this issue, and am waiting for feedback. In the mean
time, I''ve disabled NCQ by adding the line below to /etc/system (and
rebooting).
set sata:sata_func_enable = 0x5
While this probably incurs some preformance penalty, its better than the one
minute hangs.
The following is a typical log entry that appears at the conclusion of a one
minute period
"stall".
Oct 21 07:56:09 host marvell88sx: [ID 670675 kern.info] NOTICE: marvell88sx1:
device on port 0 reset: DMA command timeout
Oct 21 07:56:09 host sata: [ID 801593 kern.notice] NOTICE: /pci at
0,0/pci1022,7458 at 2/pci11ab,11ab at 1:
Oct 21 07:56:09 host port 0: device reset
Oct 21 07:56:09 host marvell88sx: [ID 670675 kern.info] NOTICE: marvell88sx1:
device on port 0 reset: device disconnected or
device error
Oct 21 07:56:09 host sata: [ID 801593 kern.notice] NOTICE: /pci at
0,0/pci1022,7458 at 2/pci11ab,11ab at 1:
Oct 21 07:56:09 host port 0: device reset
Oct 21 07:56:09 host sata: [ID 801593 kern.notice] NOTICE: /pci at
0,0/pci1022,7458 at 2/pci11ab,11ab at 1:
Oct 21 07:56:09 host port 0: link lost
Oct 21 07:56:09 host sata: [ID 801593 kern.notice] NOTICE: /pci at
0,0/pci1022,7458 at 2/pci11ab,11ab at 1:
Oct 21 07:56:09 host port 0: link established
Oct 21 07:56:09 host marvell88sx: [ID 812950 kern.warning] WARNING:
marvell88sx1: error on port 0:
Oct 21 07:56:09 host marvell88sx: [ID 517869 kern.info] device
disconnected
Oct 21 07:56:09 host marvell88sx: [ID 517869 kern.info] device connected
Oct 21 07:56:09 host scsi: [ID 107833 kern.warning] WARNING: /pci at
0,0/pci1022,7458 at 2/pci11ab,11ab at 1/disk at 0,0 (sd6):
Oct 21 07:56:09 host Error for Command: write(10) Error Level:
Retryable
Oct 21 07:56:09 host scsi: [ID 107833 kern.notice] Requested Block:
376060962 Error Block: 376060962
Oct 21 07:56:09 host scsi: [ID 107833 kern.notice] Vendor: ATA
Serial Number:
Oct 21 07:56:09 host scsi: [ID 107833 kern.notice] Sense Key: No Additional
Sense
Oct 21 07:56:09 host scsi: [ID 107833 kern.notice] ASC: 0x0 (no additional
sense info), ASCQ: 0x0, FRU: 0x0d
This message posted from opensolaris.org
Stuart Anderson
2007-Oct-26 19:41 UTC
[zfs-discuss] Parallel zfs destroy results in No more processes
> Do you have sata Native Command Queuing enabled? I''ve experienced delays of just under one minute when NCQ is enabled, that do not occur when NCQ is disabled. If all threads comprising the parallel zfs destroy hang for a minute, I bet its the hang that causes "no more processes". I have opened a trouble ticket on this issue, and am waiting for feedback. In the mean time, I''ve disabled NCQ by adding the line below to /etc/system (and rebooting). > > set sata:sata_func_enable = 0x5Not on this system. It is not clear to me how these timeout/disconnected problems would cause a call to fork() to fail but I can give that a try the next time I need to delete that much data. However, we have disabled NCQ through this mechanism on another system that was locking up ~1/week with several device disconnected messgaes. That system has been up for 2 weeks after disabling NCQ and has not displayed any disconnected messages since then. Can anyone confirm that that 125205-07 has solved these NCQ problems? Thanks. -- Stuart Anderson anderson at ligo.caltech.edu http://www.ligo.caltech.edu/~anderson