Stuart Anderson
2007-Oct-22 02:09 UTC
[zfs-discuss] Parallel zfs destroy results in No more processes
Running 102 parallel "zfs destroy -r" commands on an X4500 running S10U4 has resulted in "No more processes" errors in existing login shells for several minutes of time, but then fork() calls started working again. However, none of the zfs destroy processes have actually completed yet, which is odd since some of the filesystems are trivially small. After fork() started working there where hardly any other processes than the 102 "zfs destroy" running on the system, i.e., # ps -ef | wc -l 154 Here is a snapshot of top that looks resonable, note especially that "free swap" is 16GB and that the "last pid" is still in the range of the ~100 zfs commands being run. Is this a known issue? Any ideas on what resource lots of zfs commands use up to prevent fork() from working? Thanks. last pid: 11473; load avg: 0.35, 0.87, 0.68; up 9+00:21:42 18:56:38 148 processes: 146 sleeping, 1 zombie, 1 on cpu CPU states: 94.2% idle, 0.0% user, 5.8% kernel, 0.0% iowait, 0.0% swap Memory: 16G phys mem, 1029M free mem, 16G total swap, 16G free swap PID USERNAME LWP PRI NICE SIZE RES STATE TIME CPU COMMAND 11333 root 1 59 0 3188K 772K cpu/3 0:01 0.02% top 622 noaccess 28 59 0 172M 4528K sleep 4:28 0.01% java 528 root 1 59 0 20M 5092K sleep 2:44 0.01% Xorg 431 root 11 59 0 5620K 1248K sleep 0:01 0.01% syslogd 565 root 1 59 0 10M 1384K sleep 0:53 0.00% dtgreet 206 root 1 100 -20 2068K 1128K sleep 0:21 0.00% xntpd 10864 root 1 59 0 7416K 1216K sleep 0:00 0.00% sshd 7 root 14 59 0 12M 680K sleep 0:05 0.00% svc.startd 158 root 33 59 0 6864K 1616K sleep 0:15 0.00% nscd 312 root 1 59 0 1112K 660K sleep 0:00 0.00% utmpd 340 root 3 59 0 3932K 1312K sleep 0:00 0.00% inetd 582 root 22 59 0 17M 2028K sleep 5:49 0.00% fmd 11432 root 1 59 0 4556K 1496K sleep 0:30 0.00% zfs 11449 root 1 59 0 4556K 1496K sleep 0:27 0.00% zfs 11360 root 1 59 0 4552K 1492K sleep 0:26 0.00% zfs -- Stuart Anderson anderson at ligo.caltech.edu http://www.ligo.caltech.edu/~anderson
David Bustos
2007-Oct-24 17:40 UTC
[zfs-discuss] Parallel zfs destroy results in No more processes
Quoth Stuart Anderson on Sun, Oct 21, 2007 at 07:09:10PM -0700:> Running 102 parallel "zfs destroy -r" commands on an X4500 running S10U4 has > resulted in "No more processes" errors in existing login shells for several > minutes of time, but then fork() calls started working again. However, none > of the zfs destroy processes have actually completed yet, which is odd since > some of the filesystems are trivially small....> Is this a known issue? Any ideas on what resource lots of zfs commands use > up to prevent fork() from working?ZFS is known to use a lot of memory. I suspect this problem has diminished in recent Nevada builds. Can you try this on Nevada? David
Stuart Anderson
2007-Oct-24 17:58 UTC
[zfs-discuss] Parallel zfs destroy results in No more processes
On Wed, Oct 24, 2007 at 10:40:41AM -0700, David Bustos wrote:> Quoth Stuart Anderson on Sun, Oct 21, 2007 at 07:09:10PM -0700: > > Running 102 parallel "zfs destroy -r" commands on an X4500 running S10U4 has > > resulted in "No more processes" errors in existing login shells for several > > minutes of time, but then fork() calls started working again. However, none > > of the zfs destroy processes have actually completed yet, which is odd since > > some of the filesystems are trivially small. > ... > > Is this a known issue? Any ideas on what resource lots of zfs commands use > > up to prevent fork() from working? > > ZFS is known to use a lot of memory. I suspect this problem has > diminished in recent Nevada builds. Can you try this on Nevada?I suspect it is more subtle than this since top was reporting that none of the available swap space was being used yet, so there was 16GB of free VM. Unfortunately, I am not currently in a position to switch this system over to Nevada. Thanks. -- Stuart Anderson anderson at ligo.caltech.edu http://www.ligo.caltech.edu/~anderson
Robert Lawhead
2007-Oct-25 22:54 UTC
[zfs-discuss] Parallel zfs destroy results in No more processes
Do you have sata Native Command Queuing enabled? I''ve experienced delays of just under one minute when NCQ is enabled, that do not occur when NCQ is disabled. If all threads comprising the parallel zfs destroy hang for a minute, I bet its the hang that causes "no more processes". I have opened a trouble ticket on this issue, and am waiting for feedback. In the mean time, I''ve disabled NCQ by adding the line below to /etc/system (and rebooting). set sata:sata_func_enable = 0x5 While this probably incurs some preformance penalty, its better than the one minute hangs. The following is a typical log entry that appears at the conclusion of a one minute period "stall". Oct 21 07:56:09 host marvell88sx: [ID 670675 kern.info] NOTICE: marvell88sx1: device on port 0 reset: DMA command timeout Oct 21 07:56:09 host sata: [ID 801593 kern.notice] NOTICE: /pci at 0,0/pci1022,7458 at 2/pci11ab,11ab at 1: Oct 21 07:56:09 host port 0: device reset Oct 21 07:56:09 host marvell88sx: [ID 670675 kern.info] NOTICE: marvell88sx1: device on port 0 reset: device disconnected or device error Oct 21 07:56:09 host sata: [ID 801593 kern.notice] NOTICE: /pci at 0,0/pci1022,7458 at 2/pci11ab,11ab at 1: Oct 21 07:56:09 host port 0: device reset Oct 21 07:56:09 host sata: [ID 801593 kern.notice] NOTICE: /pci at 0,0/pci1022,7458 at 2/pci11ab,11ab at 1: Oct 21 07:56:09 host port 0: link lost Oct 21 07:56:09 host sata: [ID 801593 kern.notice] NOTICE: /pci at 0,0/pci1022,7458 at 2/pci11ab,11ab at 1: Oct 21 07:56:09 host port 0: link established Oct 21 07:56:09 host marvell88sx: [ID 812950 kern.warning] WARNING: marvell88sx1: error on port 0: Oct 21 07:56:09 host marvell88sx: [ID 517869 kern.info] device disconnected Oct 21 07:56:09 host marvell88sx: [ID 517869 kern.info] device connected Oct 21 07:56:09 host scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci1022,7458 at 2/pci11ab,11ab at 1/disk at 0,0 (sd6): Oct 21 07:56:09 host Error for Command: write(10) Error Level: Retryable Oct 21 07:56:09 host scsi: [ID 107833 kern.notice] Requested Block: 376060962 Error Block: 376060962 Oct 21 07:56:09 host scsi: [ID 107833 kern.notice] Vendor: ATA Serial Number: Oct 21 07:56:09 host scsi: [ID 107833 kern.notice] Sense Key: No Additional Sense Oct 21 07:56:09 host scsi: [ID 107833 kern.notice] ASC: 0x0 (no additional sense info), ASCQ: 0x0, FRU: 0x0d This message posted from opensolaris.org
Stuart Anderson
2007-Oct-26 19:41 UTC
[zfs-discuss] Parallel zfs destroy results in No more processes
> Do you have sata Native Command Queuing enabled? I''ve experienced delays of just under one minute when NCQ is enabled, that do not occur when NCQ is disabled. If all threads comprising the parallel zfs destroy hang for a minute, I bet its the hang that causes "no more processes". I have opened a trouble ticket on this issue, and am waiting for feedback. In the mean time, I''ve disabled NCQ by adding the line below to /etc/system (and rebooting). > > set sata:sata_func_enable = 0x5Not on this system. It is not clear to me how these timeout/disconnected problems would cause a call to fork() to fail but I can give that a try the next time I need to delete that much data. However, we have disabled NCQ through this mechanism on another system that was locking up ~1/week with several device disconnected messgaes. That system has been up for 2 weeks after disabling NCQ and has not displayed any disconnected messages since then. Can anyone confirm that that 125205-07 has solved these NCQ problems? Thanks. -- Stuart Anderson anderson at ligo.caltech.edu http://www.ligo.caltech.edu/~anderson