thr3ads.net - zfs discuss - [zfs-discuss] Parallel zfs destroy results in No more processes [Oct 2007]

If this information is useful, please help other people find it:
Share via:

Stuart Anderson

2007-Oct-22 02:09 UTC

[zfs-discuss] Parallel zfs destroy results in No more processes

Running 102 parallel "zfs destroy -r" commands on an X4500 running
S10U4 has
resulted in "No more processes" errors in existing login shells for
several
minutes of time, but then fork() calls started working again.  However, none
of the zfs destroy processes have actually completed yet, which is odd since
some of the filesystems are trivially small.

After fork() started working there where hardly any other processes than
the 102 "zfs destroy" running on the system, i.e.,
# ps -ef | wc -l
154

Here is a snapshot of top that looks resonable, note especially that
"free swap" is 16GB and that the "last pid" is still in the
range of
the ~100 zfs commands being run.

Is this a known issue?  Any ideas on what resource lots of zfs commands use
up to prevent fork() from working?

Thanks.


last pid: 11473;  load avg:  0.35,  0.87,  0.68;       up 9+00:21:42     
18:56:38
148 processes: 146 sleeping, 1 zombie, 1 on cpu
CPU states: 94.2% idle,  0.0% user,  5.8% kernel,  0.0% iowait,  0.0% swap
Memory: 16G phys mem, 1029M free mem, 16G total swap, 16G free swap

   PID USERNAME LWP PRI NICE  SIZE   RES STATE    TIME    CPU COMMAND
 11333 root       1  59    0 3188K  772K cpu/3    0:01  0.02% top
   622 noaccess  28  59    0  172M 4528K sleep    4:28  0.01% java
   528 root       1  59    0   20M 5092K sleep    2:44  0.01% Xorg
   431 root      11  59    0 5620K 1248K sleep    0:01  0.01% syslogd
   565 root       1  59    0   10M 1384K sleep    0:53  0.00% dtgreet
   206 root       1 100  -20 2068K 1128K sleep    0:21  0.00% xntpd
 10864 root       1  59    0 7416K 1216K sleep    0:00  0.00% sshd
     7 root      14  59    0   12M  680K sleep    0:05  0.00% svc.startd
   158 root      33  59    0 6864K 1616K sleep    0:15  0.00% nscd
   312 root       1  59    0 1112K  660K sleep    0:00  0.00% utmpd
   340 root       3  59    0 3932K 1312K sleep    0:00  0.00% inetd
   582 root      22  59    0   17M 2028K sleep    5:49  0.00% fmd
 11432 root       1  59    0 4556K 1496K sleep    0:30  0.00% zfs
 11449 root       1  59    0 4556K 1496K sleep    0:27  0.00% zfs
 11360 root       1  59    0 4552K 1492K sleep    0:26  0.00% zfs

-- 
Stuart Anderson  anderson at ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

David Bustos

2007-Oct-24 17:40 UTC

head link

[zfs-discuss] Parallel zfs destroy results in No more processes

Quoth Stuart Anderson on Sun, Oct 21, 2007 at 07:09:10PM
-0700:> Running 102 parallel "zfs destroy -r" commands on an X4500
running S10U4 has
> resulted in "No more processes" errors in existing login shells
for several
> minutes of time, but then fork() calls started working again.  However,
none
> of the zfs destroy processes have actually completed yet, which is odd
since
> some of the filesystems are trivially small.
...> Is this a known issue?  Any ideas on what resource lots of zfs commands use
> up to prevent fork() from working?
ZFS is known to use a lot of memory.  I suspect this problem has
diminished in recent Nevada builds.  Can you try this on Nevada?


David

Stuart Anderson

2007-Oct-24 17:58 UTC

head link

[zfs-discuss] Parallel zfs destroy results in No more processes

On Wed, Oct 24, 2007 at 10:40:41AM -0700, David Bustos
wrote:> Quoth Stuart Anderson on Sun, Oct 21, 2007 at 07:09:10PM -0700:
> > Running 102 parallel "zfs destroy -r" commands on an X4500
running S10U4 has
> > resulted in "No more processes" errors in existing login
shells for several
> > minutes of time, but then fork() calls started working again. 
However, none
> > of the zfs destroy processes have actually completed yet, which is odd
since
> > some of the filesystems are trivially small.
> ...
> > Is this a known issue?  Any ideas on what resource lots of zfs
commands use
> > up to prevent fork() from working?
> 
> ZFS is known to use a lot of memory.  I suspect this problem has
> diminished in recent Nevada builds.  Can you try this on Nevada?
I suspect it is more subtle than this since top was reporting that
none of the available swap space was being used yet, so there was 16GB
of free VM.

Unfortunately, I am not currently in a position to switch this system
over to Nevada.

Thanks.

-- 
Stuart Anderson  anderson at ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

Robert Lawhead

2007-Oct-25 22:54 UTC

head link

[zfs-discuss] Parallel zfs destroy results in No more processes

Do you have sata Native Command Queuing enabled?  I''ve experienced
delays of just under one minute when NCQ is enabled, that do not occur when NCQ
is disabled.  If all threads comprising the parallel zfs destroy hang for a
minute, I bet  its the hang that causes "no more processes".  I have
opened a trouble ticket on this issue, and am waiting for feedback.  In the mean
time, I''ve disabled NCQ by adding the line below to /etc/system (and
rebooting).

       set sata:sata_func_enable = 0x5

While this probably incurs some preformance penalty, its better than the one
minute hangs.

The following is a typical log entry that appears at the conclusion of a one
minute period
"stall".

Oct 21 07:56:09 host marvell88sx: [ID 670675 kern.info] NOTICE: marvell88sx1:
device on port 0 reset: DMA command timeout
Oct 21 07:56:09 host sata: [ID 801593 kern.notice] NOTICE: /pci at
0,0/pci1022,7458 at 2/pci11ab,11ab at 1:
Oct 21 07:56:09 host  port 0: device reset
Oct 21 07:56:09 host marvell88sx: [ID 670675 kern.info] NOTICE: marvell88sx1:
device on port 0 reset: device disconnected or
 device error
Oct 21 07:56:09 host sata: [ID 801593 kern.notice] NOTICE: /pci at
0,0/pci1022,7458 at 2/pci11ab,11ab at 1:
Oct 21 07:56:09 host  port 0: device reset
Oct 21 07:56:09 host sata: [ID 801593 kern.notice] NOTICE: /pci at
0,0/pci1022,7458 at 2/pci11ab,11ab at 1:
Oct 21 07:56:09 host  port 0: link lost
Oct 21 07:56:09 host sata: [ID 801593 kern.notice] NOTICE: /pci at
0,0/pci1022,7458 at 2/pci11ab,11ab at 1:
Oct 21 07:56:09 host  port 0: link established
Oct 21 07:56:09 host marvell88sx: [ID 812950 kern.warning] WARNING:
marvell88sx1: error on port 0:
Oct 21 07:56:09 host marvell88sx: [ID 517869 kern.info]        device
disconnected
Oct 21 07:56:09 host marvell88sx: [ID 517869 kern.info]        device connected
Oct 21 07:56:09 host scsi: [ID 107833 kern.warning] WARNING: /pci at
0,0/pci1022,7458 at 2/pci11ab,11ab at 1/disk at 0,0 (sd6):
Oct 21 07:56:09 host   Error for Command: write(10)               Error Level:
Retryable
Oct 21 07:56:09 host scsi: [ID 107833 kern.notice]     Requested Block:
376060962                 Error Block: 376060962
Oct 21 07:56:09 host scsi: [ID 107833 kern.notice]     Vendor: ATA              
Serial Number:
  
Oct 21 07:56:09 host scsi: [ID 107833 kern.notice]     Sense Key: No Additional
Sense
Oct 21 07:56:09 host scsi: [ID 107833 kern.notice]     ASC: 0x0 (no additional
sense info), ASCQ: 0x0, FRU: 0x0d
 
 
This message posted from opensolaris.org

Stuart Anderson

2007-Oct-26 19:41 UTC

head link

[zfs-discuss] Parallel zfs destroy results in No more processes

> Do you have sata Native Command Queuing enabled?  I''ve experienced
delays of just under one minute when NCQ is enabled, that do not occur when NCQ
is disabled.  If all threads comprising the parallel zfs destroy hang for a
minute, I bet  its the hang that causes "no more processes".  I have
opened a trouble ticket on this issue, and am waiting for feedback.  In the mean
time, I''ve disabled NCQ by adding the line below to /etc/system (and
rebooting).
> 
>        set sata:sata_func_enable = 0x5
Not on this system. It is not clear to me how these timeout/disconnected
problems would cause a call to fork() to fail but I can give that a try
the next time I need to delete that much data.

However, we have disabled NCQ through this mechanism on another system that
was locking up ~1/week with several device disconnected messgaes. That
system has been up for 2 weeks after disabling NCQ and has not displayed
any disconnected messages since then.

Can anyone confirm that that 125205-07 has solved these NCQ problems?

Thanks.

-- 
Stuart Anderson  anderson at ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

zfs discuss - Oct 2007 - Parallel zfs destroy results in No more processes

[zfs-discuss] Parallel zfs destroy results in No more processes

[zfs-discuss] Parallel zfs destroy results in No more processes

[zfs-discuss] Parallel zfs destroy results in No more processes

[zfs-discuss] Parallel zfs destroy results in No more processes

[zfs-discuss] Parallel zfs destroy results in No more processes