Hello, I am having a problem destroying zfs snapshots. The machine is almost not responding for more than 4 hours, after I started the command and I can''t run anything else during that time - I get (bash): fork: Resource temporarily unavailable - errors. The machine is still responding somewhat, but very, very slow. It is: P4, 2.4 GHz with 512 MB RAM, 8 x 750 GB disks as raidZ, running Solaris 11 06. Creating, renaming snapshots seem to be ok - takes < 2 seconds. There are two ZFS file systems in the pool, both using ~3.5TB out of 4.7 TB. About 5 snapshots for each of the file system were created. I am tryign to destroy one of the snapshots and the machine just dies - no panic or reboot, but I can''t start anything. I have ''top'' still running on a ssh terminal that shows kernel using ~20% and ~2MB free memory. How long should usualy take to destroy ~2TB snapshot (actually using ~30GB)? Is this delay expected behavior or am I hitting a bug of some kind? Any help is appreciated. Thanks, Miro This message posted from opensolaris.org
Miroslav Pendev
2007-Apr-02  14:53 UTC
[zfs-discuss] Re: zfs destroy <snapshot> takes hours
I did some more testing, here is what I found:
- I can destroy older and newer snapshots, just not that particular snapshot
- I added some more memory total 1GB, now after I start the destroy command,
~500MB RAM are taken right away, there is still ~200MB or so left.
o The machine is responsive, 
o If I run ''zfs list''   it shows the snapshot as destroyed (it
is gone from the list).
o There is some zfs activity for about 20 seconds - I can see the lights of the
HDDs of the pool blinking, then it stops
o If I try to access any of the file systems of that pool - to read a file for
example - the machine stops responding (or it is very slow) and I can''t
run anything.
This is what ''top'' shows when the above happens:
last pid:  1168;  load avg:  0.10,  0.17,  0.25;       up 0+00:17:42   10:14:02
76 processes: 73 sleeping, 2 running, 1 on cpu
CPU states: 97.4% idle,  0.0% user,  2.6% kernel,  0.0% iowait,  0.0% swap
Memory: 1024M phys mem, 264M free mem, 2048M swap, 2048M free swap
   PID USERNAME LWP PRI NICE  SIZE   RES STATE    TIME    CPU COMMAND
  1034 root       1  59    0 1768K 1284K cpu      0:01  0.40% top
  1101 root      15  59    0  116M   45M sleep    0:08  0.08% java
  1107 root       1  59    0 1964K 1068K sleep    0:00  0.07% iostat
  1026 miro       1  59    0 7076K 1840K run      0:00  0.04% sshd
  1053 root      12  59    0   86M   13M sleep    0:00  0.03% java
     7 root      12  59    0   10M 8984K sleep    0:02  0.01% svc.startd
   276 root       1  59    0 1064K  548K sleep    0:00  0.00% utmpd
     9 root      23  59    0 9480K 8328K sleep    0:05  0.00% svc.configd
   488 root       4  59    0 2500K 1780K sleep    0:04  0.00% vold
   397 root      11  59    0 8996K 7236K sleep    0:01  0.00% cctransport
   384 root      16  59    0   10M 6952K sleep    0:00  0.00% fmd
No disk activity... It can stay in this state for hours, nothing changes.
If I ''cold reset'' it, the snapshot I was trying to destroy is
back on the ''zfs list'' command.
Any ideas what else to test or change are welcome.
Is there a way to get rid of that snapshot, or to check its consistency somehow?
Thanks,
Miro
 
 
This message posted from opensolaris.org
You are definitely hitting a bug.. Not sure which one (hopefully someone else will chime in on that.) It should take mere milliseconds to destroy a snapshot regardless of size. Do you have any disk errors? What would happen if you scrubbed the pool? Eric This message posted from opensolaris.org
Miroslav Pendev wrote:> I did some more testing, here is what I found: > > - I can destroy older and newer snapshots, just not that particular snapshot > > - I added some more memory total 1GB, now after I start the destroy command, ~500MB RAM are taken right away, there is still ~200MB or so left. > > o The machine is responsive, > > o If I run ''zfs list'' it shows the snapshot as destroyed (it is gone from the list). > > o There is some zfs activity for about 20 seconds - I can see the lights of the HDDs of the pool blinking, then it stopsCan you take a crash dump when the system is "hung" (ie. after there is no more disk activity), and make it available to me? Also, can you send the output of ''zdb -vvv <pool> <pool>'' (repeat the poolname twice), and the name of the snapshot that can''t be deleted? --matt
Matthew Ahrens wrote:> Miroslav Pendev wrote: >> I did some more testing, here is what I found: >> >> - I can destroy older and newer snapshots, just not that particular >> snapshot >> >> - I added some more memory total 1GB, now after I start the destroy >> command, ~500MB RAM are taken right away, there is still ~200MB or so >> left. >> o The machine is responsive, >> o If I run ''zfs list'' it shows the snapshot as destroyed (it is gone >> from the list). >> >> o There is some zfs activity for about 20 seconds - I can see the >> lights of the HDDs of the pool blinking, then it stops > > Can you take a crash dump when the system is "hung" (ie. after there is > no more disk activity), and make it available to me?Miro supplied the dump which I examined and filed bug 6542681. The root cause is that the machine is out of memory (in this case, kernel virtual address space). As a workaround, you can change kernelbase to allow the kernel to use more virtual address space. --matt
Miroslav Pendev
2007-Apr-05  14:01 UTC
[zfs-discuss] Re: Re: zfs destroy <snapshot> takes hours
After some discussions with Matt I removed all the previous snapshots before the one causing the memory issues. Guess what - it worked. I was able to remove that snapshot after I removed all previous ones. It took 2 seconds. I will definitely have to upgrade that machine these days to 64 bit and more memory. Once again, thanks for your help Matt! ZFS rocks :) This message posted from opensolaris.org
i am having a similar problem - system hung on zfs destroy snapshot - 50% cpu utilization - running for hours - how can i know if i have the same problem? can you be specific about hpw to set the kernelbase? This message posted from opensolaris.org
the release notes: http://docs.sun.com/app/docs/doc/817-0552/6mgbi4fgg?a=view say an alternative to fixing the kernelbase is to upgrade to 64 bit - i''m already running on a 64 bit sparc. maybe i have a different problem - my drives have spun down to sleepy mode - zfs is still burning coal. This message posted from opensolaris.org