thr3ads.net - zfs discuss - [zfs-discuss] Recovering ZFS stops after syseventconfd can''t fork [Dec 2009]

If this information is useful, please help other people find it:
Share via:

Paul Armstrong

2009-Dec-23 03:17 UTC

[zfs-discuss] Recovering ZFS stops after syseventconfd can''t fork

I have a machine connected to an HDS with a corrupted pool.

While running zpool import -nfFX on the pool, it spawns a large number of zfsdle
processes and eventually the machine hangs for 20-30 seconds, spits out error
messages

zfs: [ID 346414 kern.warning] WARNING: Couldn''t create process for zfs
pool "hds1"
syseventconfd[6885]: [ID 437529 daemon.error] cannot fork - Resource temporarily
unavailable

(tips on finding out which resource was temporarily unavailable would be
appreciated too as there seems to be plenty of resources around shortly before
it dies, I''m used to only seeing that when I''ve run out of ram
or LWPs but with 3GB of RAM free according to ''echo ::memstat | mdb
-k'' and 2.1G LWPs available according to prctl I''m stumped)

and then the recovery aborts with

-bash-4.0# zpool import -fFX hds1
cannot import ''hds1'': one or more devices is currently
unavailable
        Destroy and re-create the pool from
        a backup source.

The number of zfsdle processes running while the recovery is running is quite
high per disk:

 268 /devices/pci at 0,0/pci8086,25e2 at 2/pci8086,3500 at 0/pci8086,3510 at
0/pci10df,fe00 at 0,1/fp at 0,0/disk at w50060e8010037135,0:a
 269 /devices/pci at 0,0/pci8086,25e2 at 2/pci8086,3500 at 0/pci8086,3510 at
0/pci10df,fe00 at 0,1/fp at 0,0/disk at w50060e8010037135,10:a
 269 /devices/pci at 0,0/pci8086,25e2 at 2/pci8086,3500 at 0/pci8086,3510 at
0/pci10df,fe00 at 0,1/fp at 0,0/disk at w50060e8010037135,11:a
 269 /devices/pci at 0,0/pci8086,25e2 at 2/pci8086,3500 at 0/pci8086,3510 at
0/pci10df,fe00 at 0,1/fp at 0,0/disk at w50060e8010037135,14:a
 269 /devices/pci at 0,0/pci8086,25e2 at 2/pci8086,3500 at 0/pci8086,3510 at
0/pci10df,fe00 at 0,1/fp at 0,0/disk at w50060e8010037135,15:a
 269 /devices/pci at 0,0/pci8086,25e2 at 2/pci8086,3500 at 0/pci8086,3510 at
0/pci10df,fe00 at 0,1/fp at 0,0/disk at w50060e8010037135,18:a
 268 /devices/pci at 0,0/pci8086,25e2 at 2/pci8086,3500 at 0/pci8086,3510 at
0/pci10df,fe00 at 0,1/fp at 0,0/disk at w50060e8010037135,19:a
 268 /devices/pci at 0,0/pci8086,25e2 at 2/pci8086,3500 at 0/pci8086,3510 at
0/pci10df,fe00 at 0,1/fp at 0,0/disk at w50060e8010037135,1:a
 269 /devices/pci at 0,0/pci8086,25e2 at 2/pci8086,3500 at 0/pci8086,3510 at
0/pci10df,fe00 at 0,1/fp at 0,0/disk at w50060e8010037135,1c:a
 268 /devices/pci at 0,0/pci8086,25e2 at 2/pci8086,3500 at 0/pci8086,3510 at
0/pci10df,fe00 at 0,1/fp at 0,0/disk at w50060e8010037135,1d:a
 269 /devices/pci at 0,0/pci8086,25e2 at 2/pci8086,3500 at 0/pci8086,3510 at
0/pci10df,fe00 at 0,1/fp at 0,0/disk at w50060e8010037135,20:a
 269 /devices/pci at 0,0/pci8086,25e2 at 2/pci8086,3500 at 0/pci8086,3510 at
0/pci10df,fe00 at 0,1/fp at 0,0/disk at w50060e8010037135,21:a
 269 /devices/pci at 0,0/pci8086,25e2 at 2/pci8086,3500 at 0/pci8086,3510 at
0/pci10df,fe00 at 0,1/fp at 0,0/disk at w50060e8010037135,24:a
 268 /devices/pci at 0,0/pci8086,25e2 at 2/pci8086,3500 at 0/pci8086,3510 at
0/pci10df,fe00 at 0,1/fp at 0,0/disk at w50060e8010037135,25:a
 269 /devices/pci at 0,0/pci8086,25e2 at 2/pci8086,3500 at 0/pci8086,3510 at
0/pci10df,fe00 at 0,1/fp at 0,0/disk at w50060e8010037135,28:a
 269 /devices/pci at 0,0/pci8086,25e2 at 2/pci8086,3500 at 0/pci8086,3510 at
0/pci10df,fe00 at 0,1/fp at 0,0/disk at w50060e8010037135,29:a
 269 /devices/pci at 0,0/pci8086,25e2 at 2/pci8086,3500 at 0/pci8086,3510 at
0/pci10df,fe00 at 0,1/fp at 0,0/disk at w50060e8010037135,2c:a
 269 /devices/pci at 0,0/pci8086,25e2 at 2/pci8086,3500 at 0/pci8086,3510 at
0/pci10df,fe00 at 0,1/fp at 0,0/disk at w50060e8010037135,2d:a
 269 /devices/pci at 0,0/pci8086,25e2 at 2/pci8086,3500 at 0/pci8086,3510 at
0/pci10df,fe00 at 0,1/fp at 0,0/disk at w50060e8010037135,30:a
 269 /devices/pci at 0,0/pci8086,25e2 at 2/pci8086,3500 at 0/pci8086,3510 at
0/pci10df,fe00 at 0,1/fp at 0,0/disk at w50060e8010037135,31:a
 269 /devices/pci at 0,0/pci8086,25e2 at 2/pci8086,3500 at 0/pci8086,3510 at
0/pci10df,fe00 at 0,1/fp at 0,0/disk at w50060e8010037135,34:a
 269 /devices/pci at 0,0/pci8086,25e2 at 2/pci8086,3500 at 0/pci8086,3510 at
0/pci10df,fe00 at 0,1/fp at 0,0/disk at w50060e8010037135,35:a
 268 /devices/pci at 0,0/pci8086,25e2 at 2/pci8086,3500 at 0/pci8086,3510 at
0/pci10df,fe00 at 0,1/fp at 0,0/disk at w50060e8010037135,38:a
 270 /devices/pci at 0,0/pci8086,25e2 at 2/pci8086,3500 at 0/pci8086,3510 at
0/pci10df,fe00 at 0,1/fp at 0,0/disk at w50060e8010037135,39:a
 268 /devices/pci at 0,0/pci8086,25e2 at 2/pci8086,3500 at 0/pci8086,3510 at
0/pci10df,fe00 at 0,1/fp at 0,0/disk at w50060e8010037135,4:a
 269 /devices/pci at 0,0/pci8086,25e2 at 2/pci8086,3500 at 0/pci8086,3510 at
0/pci10df,fe00 at 0,1/fp at 0,0/disk at w50060e8010037135,5:a
 269 /devices/pci at 0,0/pci8086,25e2 at 2/pci8086,3500 at 0/pci8086,3510 at
0/pci10df,fe00 at 0,1/fp at 0,0/disk at w50060e8010037135,8:a
 269 /devices/pci at 0,0/pci8086,25e2 at 2/pci8086,3500 at 0/pci8086,3510 at
0/pci10df,fe00 at 0,1/fp at 0,0/disk at w50060e8010037135,9:a
 269 /devices/pci at 0,0/pci8086,25e2 at 2/pci8086,3500 at 0/pci8086,3510 at
0/pci10df,fe00 at 0,1/fp at 0,0/disk at w50060e8010037135,c:a
 269 /devices/pci at 0,0/pci8086,25e2 at 2/pci8086,3500 at 0/pci8086,3510 at
0/pci10df,fe00 at 0,1/fp at 0,0/disk at w50060e8010037135,d:a

The total here is a little over 8,000 (fairly close to 8K) but not over that and
there''s nothing that I can see in prctl that''s near 8K and
related (it''s only things like max-msg-messages and max-port-ids).

The pool was corrupted under S10u6 (machine crashed and then the filesystem died
during a scrub) and I''m trying to recover it with SXCE 129.

Ideas?

Here''s the layout:
-bash-4.0# zpool import
  pool: hds1
    id: 15655551015334264270
 state: FAULTED
status: The pool metadata is corrupted.
action: The pool cannot be imported due to damaged devices or data.
   see: http://www.sun.com/msg/ZFS-8000-72
config:

        hds1                         FAULTED  corrupted data
          raidz1-0                   ONLINE
            c1t50060E8010037135d0p0  ONLINE
            c1t50060E8010037135d4    ONLINE
            c1t50060E8010037135d8    ONLINE
            c1t50060E8010037135d12   ONLINE
            c1t50060E8010037135d16   ONLINE
          raidz1-1                   ONLINE
            c1t50060E8010037135d40   ONLINE
            c1t50060E8010037135d44   ONLINE
            c1t50060E8010037135d48   ONLINE
            c1t50060E8010037135d52   ONLINE
            c1t50060E8010037135d56   ONLINE
          raidz1-2                   ONLINE
            c1t50060E8010037135d20   ONLINE
            c1t50060E8010037135d24   ONLINE
            c1t50060E8010037135d28   ONLINE
            c1t50060E8010037135d32   ONLINE
            c1t50060E8010037135d36   ONLINE
          raidz1-3                   ONLINE
            c1t50060E8010037135d1    ONLINE
            c1t50060E8010037135d5    ONLINE
            c1t50060E8010037135d9    ONLINE
            c1t50060E8010037135d13   ONLINE
            c1t50060E8010037135d17   ONLINE
          raidz1-4                   ONLINE
            c1t50060E8010037135d21   ONLINE
            c1t50060E8010037135d25   ONLINE
            c1t50060E8010037135d29   ONLINE
            c1t50060E8010037135d33   ONLINE
            c1t50060E8010037135d37   ONLINE
          raidz1-5                   ONLINE
            c1t50060E8010037135d41   ONLINE
            c1t50060E8010037135d45   ONLINE
            c1t50060E8010037135d49   ONLINE
            c1t50060E8010037135d53   ONLINE
            c1t50060E8010037135d57   ONLINE

<snip several working volumes>

Thanks,
Paul
-- 
This message posted from opensolaris.org

Anton B. Rang

2009-Dec-23 03:52 UTC

head link

[zfs-discuss] Recovering ZFS stops after syseventconfd can''t fork

Something over 8000 sounds vaguely like the default maximum process count.  What
does ''ulimit -a'' show?

I don''t know why you''re seeing so many zfsdle processes,
though ? sounds like a bug to me.
-- 
This message posted from opensolaris.org

Paul Armstrong

2009-Dec-23 04:02 UTC

head link

[zfs-discuss] Recovering ZFS stops after syseventconfd can''t fork

bash-4.0# ulimit -a
core file size          (blocks, -c) unlimited
data seg size           (kbytes, -d) unlimited
file size               (blocks, -f) unlimited
open files                      (-n) 256
pipe size            (512 bytes, -p) 10
stack size              (kbytes, -s) 10240
cpu time               (seconds, -t) unlimited
max user processes              (-u) 29995
virtual memory          (kbytes, -v) unlimited
-- 
This message posted from opensolaris.org

Paul Armstrong

2009-Dec-23 04:27 UTC

head link

[zfs-discuss] Recovering ZFS stops after syseventconfd can''t fork

I''m surprised at the number as well.

Running it again, I''m seeing it jump fairly high just before the fork
errors:
bash-4.0# ps -ef | grep zfsdle | wc -l
   20930

(the next run of ps failed due to the fork error).
So maybe it is running out of processes.

ZFS file data from ::memstat just went down to 29MiB (from 22GiB) too which may
or may not be related.

Message was edited by: psa
-- 
This message posted from opensolaris.org

Carson Gaspar

2009-Dec-23 18:39 UTC

head link

[zfs-discuss] Recovering ZFS stops after syseventconfd can''t fork

Paul Armstrong wrote:> I''m surprised at the number as well.
> 
> Running it again, I''m seeing it jump fairly high just before the
fork errors:
> bash-4.0# ps -ef | grep zfsdle | wc -l
>    20930
> 
> (the next run of ps failed due to the fork error).
> So maybe it is running out of processes.
> 
> ZFS file data from ::memstat just went down to 29MiB (from 22GiB) too which
may or may not be related.
> 
> Message was edited by: psa
Note that I saw the exact same thing when my pool got trashed. My fix 
was to rename 
/etc/sysevent/config/SUNW,EC_dev_status,ESC_dev_dle,sysevent.conf

I _suspect_ the problem is that the developers don''t expect zfsdle to 
hang. So they don''t bother to use a lock or check if one is already 
running. They just spawn more, and more, and more...

It would be lovely if someone who understands what this creature is were 
to fix this rather catastrophic bug.

-- 
Carson

Paul Armstrong

2009-Dec-28 18:17 UTC

head link

[zfs-discuss] Recovering ZFS stops after syseventconfd can''t fork

Alas, even moving the file out of the way and rebooting the box (to guarantee
state) didn''t work:

-bash-4.0# zpool import -nfFX hds1
echo $?
-bash-4.0# echo $?
1

Do you need to be able to read all the labels for each disk in the array in
order to recover?
>From zdb -l on one of the disks:--------------------------------------------
LABEL 3
--------------------------------------------
failed to unpack label 3

Thanks,
Paul
-- 
This message posted from opensolaris.org

Cindy Swearingen

2010-Jan-05 16:46 UTC

head link

[zfs-discuss] Recovering ZFS stops after syseventconfd can''t fork

Hi Paul,

I opened 6914208 to cover the sysevent/zfsdle problem.

If the system crashed due to a power failure and the disk labels for
this pool were corrupted, then I think you will need to follow the steps 
to get the disks relabeled correctly. You might review some previous 
postings by Victor Latuskin that describe these steps.

Thanks,

Cindy

On 12/28/09 11:17, Paul Armstrong wrote:> Alas, even moving the file out of the way and rebooting the box (to
guarantee state) didn''t work:
> 
> -bash-4.0# zpool import -nfFX hds1
> echo $?
> -bash-4.0# echo $?
> 1
> 
> Do you need to be able to read all the labels for each disk in the array in
order to recover?
> 
>>From zdb -l on one of the disks:
> --------------------------------------------
> LABEL 3
> --------------------------------------------
> failed to unpack label 3
> 
> Thanks,
> Paul

zfs discuss - Dec 2009 - Recovering ZFS stops after syseventconfd can''t fork

[zfs-discuss] Recovering ZFS stops after syseventconfd can''t fork

[zfs-discuss] Recovering ZFS stops after syseventconfd can''t fork

[zfs-discuss] Recovering ZFS stops after syseventconfd can''t fork

[zfs-discuss] Recovering ZFS stops after syseventconfd can''t fork

[zfs-discuss] Recovering ZFS stops after syseventconfd can''t fork

[zfs-discuss] Recovering ZFS stops after syseventconfd can''t fork

[zfs-discuss] Recovering ZFS stops after syseventconfd can''t fork