Hello, I am having a problem destroying zfs snapshots. The machine is almost not responding for more than 4 hours, after I started the command and I can''t run anything else during that time - I get (bash): fork: Resource temporarily unavailable - errors. The machine is still responding somewhat, but very, very slow. It is: P4, 2.4 GHz with 512 MB RAM, 8 x 750 GB disks as raidZ, running Solaris 11 06. Creating, renaming snapshots seem to be ok - takes < 2 seconds. There are two ZFS file systems in the pool, both using ~3.5TB out of 4.7 TB. About 5 snapshots for each of the file system were created. I am tryign to destroy one of the snapshots and the machine just dies - no panic or reboot, but I can''t start anything. I have ''top'' still running on a ssh terminal that shows kernel using ~20% and ~2MB free memory. How long should usualy take to destroy ~2TB snapshot (actually using ~30GB)? Is this delay expected behavior or am I hitting a bug of some kind? Any help is appreciated. Thanks, Miro This message posted from opensolaris.org
Miroslav Pendev
2007-Apr-02 14:53 UTC
[zfs-discuss] Re: zfs destroy <snapshot> takes hours
I did some more testing, here is what I found: - I can destroy older and newer snapshots, just not that particular snapshot - I added some more memory total 1GB, now after I start the destroy command, ~500MB RAM are taken right away, there is still ~200MB or so left. o The machine is responsive, o If I run ''zfs list'' it shows the snapshot as destroyed (it is gone from the list). o There is some zfs activity for about 20 seconds - I can see the lights of the HDDs of the pool blinking, then it stops o If I try to access any of the file systems of that pool - to read a file for example - the machine stops responding (or it is very slow) and I can''t run anything. This is what ''top'' shows when the above happens: last pid: 1168; load avg: 0.10, 0.17, 0.25; up 0+00:17:42 10:14:02 76 processes: 73 sleeping, 2 running, 1 on cpu CPU states: 97.4% idle, 0.0% user, 2.6% kernel, 0.0% iowait, 0.0% swap Memory: 1024M phys mem, 264M free mem, 2048M swap, 2048M free swap PID USERNAME LWP PRI NICE SIZE RES STATE TIME CPU COMMAND 1034 root 1 59 0 1768K 1284K cpu 0:01 0.40% top 1101 root 15 59 0 116M 45M sleep 0:08 0.08% java 1107 root 1 59 0 1964K 1068K sleep 0:00 0.07% iostat 1026 miro 1 59 0 7076K 1840K run 0:00 0.04% sshd 1053 root 12 59 0 86M 13M sleep 0:00 0.03% java 7 root 12 59 0 10M 8984K sleep 0:02 0.01% svc.startd 276 root 1 59 0 1064K 548K sleep 0:00 0.00% utmpd 9 root 23 59 0 9480K 8328K sleep 0:05 0.00% svc.configd 488 root 4 59 0 2500K 1780K sleep 0:04 0.00% vold 397 root 11 59 0 8996K 7236K sleep 0:01 0.00% cctransport 384 root 16 59 0 10M 6952K sleep 0:00 0.00% fmd No disk activity... It can stay in this state for hours, nothing changes. If I ''cold reset'' it, the snapshot I was trying to destroy is back on the ''zfs list'' command. Any ideas what else to test or change are welcome. Is there a way to get rid of that snapshot, or to check its consistency somehow? Thanks, Miro This message posted from opensolaris.org
You are definitely hitting a bug.. Not sure which one (hopefully someone else will chime in on that.) It should take mere milliseconds to destroy a snapshot regardless of size. Do you have any disk errors? What would happen if you scrubbed the pool? Eric This message posted from opensolaris.org
Miroslav Pendev wrote:> I did some more testing, here is what I found: > > - I can destroy older and newer snapshots, just not that particular snapshot > > - I added some more memory total 1GB, now after I start the destroy command, ~500MB RAM are taken right away, there is still ~200MB or so left. > > o The machine is responsive, > > o If I run ''zfs list'' it shows the snapshot as destroyed (it is gone from the list). > > o There is some zfs activity for about 20 seconds - I can see the lights of the HDDs of the pool blinking, then it stopsCan you take a crash dump when the system is "hung" (ie. after there is no more disk activity), and make it available to me? Also, can you send the output of ''zdb -vvv <pool> <pool>'' (repeat the poolname twice), and the name of the snapshot that can''t be deleted? --matt
Matthew Ahrens wrote:> Miroslav Pendev wrote: >> I did some more testing, here is what I found: >> >> - I can destroy older and newer snapshots, just not that particular >> snapshot >> >> - I added some more memory total 1GB, now after I start the destroy >> command, ~500MB RAM are taken right away, there is still ~200MB or so >> left. >> o The machine is responsive, >> o If I run ''zfs list'' it shows the snapshot as destroyed (it is gone >> from the list). >> >> o There is some zfs activity for about 20 seconds - I can see the >> lights of the HDDs of the pool blinking, then it stops > > Can you take a crash dump when the system is "hung" (ie. after there is > no more disk activity), and make it available to me?Miro supplied the dump which I examined and filed bug 6542681. The root cause is that the machine is out of memory (in this case, kernel virtual address space). As a workaround, you can change kernelbase to allow the kernel to use more virtual address space. --matt
Miroslav Pendev
2007-Apr-05 14:01 UTC
[zfs-discuss] Re: Re: zfs destroy <snapshot> takes hours
After some discussions with Matt I removed all the previous snapshots before the one causing the memory issues. Guess what - it worked. I was able to remove that snapshot after I removed all previous ones. It took 2 seconds. I will definitely have to upgrade that machine these days to 64 bit and more memory. Once again, thanks for your help Matt! ZFS rocks :) This message posted from opensolaris.org
i am having a similar problem - system hung on zfs destroy snapshot - 50% cpu utilization - running for hours - how can i know if i have the same problem? can you be specific about hpw to set the kernelbase? This message posted from opensolaris.org
the release notes: http://docs.sun.com/app/docs/doc/817-0552/6mgbi4fgg?a=view say an alternative to fixing the kernelbase is to upgrade to 64 bit - i''m already running on a 64 bit sparc. maybe i have a different problem - my drives have spun down to sleepy mode - zfs is still burning coal. This message posted from opensolaris.org