Paul Kraus
2011-Sep-14 16:50 UTC
[zfs-discuss] zfs destroy snapshot runs out of memory bug
I know there was (is ?) a bug where a zfs destroy of a large snapshot would run a system out of kernel memory, but searching the list archives and on defects.opensolaris.org I cannot find it. Could someone here explain the failure mechanism in language a Sys Admin (I am NOT a developer) could understand. I am running Solaris 10 with zpool 22 and I am looking for both understanding of the underlying problem and a way to estimate the amount of kernel memory necessary to destroy a given snapshot (based on information gathered from zfs, zdb, and any other necessary commands). Thanks in advance, and sorry to bring this up again. I am almost certain I saw mention here that this bug is fixed in Solaris 11 Express and Nexenta (Oracle Support is telling me the bug is fixed in zpool 26 which is included with Solaris 10U10, but because of our use of ACLs I don''t think I can go there, and upgrading the zpool won''t help with legacy snapshots). -- {--------1---------2---------3---------4---------5---------6---------7---------} Paul Kraus -> Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ ) -> Sound Designer: Frankenstein, A New Musical (http://www.facebook.com/event.php?eid=123170297765140) -> Sound Coordinator, Schenectady Light Opera Company ( http://www.sloctheater.org/ ) -> Technical Advisor, RPI Players
Richard Elling
2011-Sep-14 18:30 UTC
[zfs-discuss] zfs destroy snapshot runs out of memory bug
On Sep 14, 2011, at 9:50 AM, Paul Kraus wrote:> I know there was (is ?) a bug where a zfs destroy of a large > snapshot would run a system out of kernel memory, but searching the > list archives and on defects.opensolaris.org I cannot find it. Could > someone here explain the failure mechanism in language a Sys Admin (I > am NOT a developer) could understand. I am running Solaris 10 with > zpool 22 and I am looking for both understanding of the underlying > problem and a way to estimate the amount of kernel memory necessary to > destroy a given snapshot (based on information gathered from zfs, zdb, > and any other necessary commands).I don''t recall a bug with that description. However, there are several bugs that relate to how the internals work that were fixed last summer and led to the on-disk format change to version 26 (Improved snapshot deletion performance). Look for details in http://src.illumos.org/source/history/illumos-gate/usr/src/uts/common/fs/zfs/ during the May-July 2010 timeframe. Methinks the most important change was 6948890 snapshot deletion can induce pathologically long spa_sync() times spa_sync() is called when the transaction group is sync''ed to permanent storage. -- richard> > Thanks in advance, and sorry to bring this up again. I am almost > certain I saw mention here that this bug is fixed in Solaris 11 > Express and Nexenta (Oracle Support is telling me the bug is fixed in > zpool 26 which is included with Solaris 10U10, but because of our use > of ACLs I don''t think I can go there, and upgrading the zpool won''t > help with legacy snapshots).Sorry, I haven''t run Solaris 10 in the past 6 years :-) can''t help you there. But I can say that NexentaStor has this bug fix in 3.0.5. For NexentaStor 3.1+ releases, zpool version is 28. -- richard -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20110914/65f77089/attachment.html>
Paul Kraus
2011-Sep-14 19:07 UTC
[zfs-discuss] zfs destroy snapshot runs out of memory bug
On Wed, Sep 14, 2011 at 2:30 PM, Richard Elling <richard.elling at gmail.com> wrote:> I don''t recall a bug with that description. However, there are several bugs that > relate to how the internals work that were fixed last summer and led to the > on-disk format change to version 26 (Improved snapshot deletion performance). > Look for details in?http://src.illumos.org/source/history/illumos-gate/usr/src/uts/common/fs/zfs/ > during the May-July 2010 timeframe. Methinks the most important change was > 6948890?snapshot deletion can induce pathologically long spa_sync() times > spa_sync() is called when the transaction group is sync''ed to permanent storage.I looked through that list, and found the following that looked applicable: 6948911 snapshot deletion can induce unsatisfiable allocations in txg sync 6948890 snapshot deletion can induce pathologically long spa_sync() times But all I get at bugs.opensolaris.org is a Service Temporarily Unavailable message (and have for at least the past few weeks). The MOS lookup of the 6948890 bug yields the title and not much else, no details. I can''t even find the 6948911 bug in MOS. MOS == My Oracle Support Thanks for the pointers, I just wish I could find more data that will lead me to either: A) a mechanism to estimate the RAM needed to destroy a pre-26 snapshot -or- B) indication that there is no way to do A.>From watching the system try to import this pool, it looks like it isstill building a kernel structure in RAM when the system runs out of RAM. It has not committed anything to disk. -- {--------1---------2---------3---------4---------5---------6---------7---------} Paul Kraus -> Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ ) -> Sound Designer: Frankenstein, A New Musical (http://www.facebook.com/event.php?eid=123170297765140) -> Sound Coordinator, Schenectady Light Opera Company ( http://www.sloctheater.org/ ) -> Technical Advisor, RPI Players
Richard Elling
2011-Sep-14 20:31 UTC
[zfs-discuss] zfs destroy snapshot runs out of memory bug
Question below? On Sep 14, 2011, at 12:07 PM, Paul Kraus wrote:> On Wed, Sep 14, 2011 at 2:30 PM, Richard Elling > <richard.elling at gmail.com> wrote: > >> I don''t recall a bug with that description. However, there are several bugs that >> relate to how the internals work that were fixed last summer and led to the >> on-disk format change to version 26 (Improved snapshot deletion performance). >> Look for details in http://src.illumos.org/source/history/illumos-gate/usr/src/uts/common/fs/zfs/ >> during the May-July 2010 timeframe. Methinks the most important change was >> 6948890 snapshot deletion can induce pathologically long spa_sync() times >> spa_sync() is called when the transaction group is sync''ed to permanent storage. > > I looked through that list, and found the following that looked applicable: > 6948911 snapshot deletion can induce unsatisfiable allocations in txg sync > 6948890 snapshot deletion can induce pathologically long spa_sync() times > > But all I get at bugs.opensolaris.org is a Service Temporarily > Unavailable message (and have for at least the past few weeks). The > MOS lookup of the 6948890 bug yields the title and not much else, no > details. I can''t even find the 6948911 bug in MOS. > > MOS == My Oracle Support > > Thanks for the pointers, I just wish I could find more data that will > lead me to either: > A) a mechanism to estimate the RAM needed to destroy a pre-26 snapshot > -or- > B) indication that there is no way to do A. > > From watching the system try to import this pool, it looks like it is > still building a kernel structure in RAM when the system runs out of > RAM. It has not committed anything to disk.Did you experience a severe memory shortfall? (Do you know how to determine that condition?) -- richard
Paul Kraus
2011-Sep-14 21:36 UTC
[zfs-discuss] zfs destroy snapshot runs out of memory bug
On Wed, Sep 14, 2011 at 4:31 PM, Richard Elling <richard.elling at gmail.com> wrote:>> From watching the system try to import this pool, it looks like it is >> still building a kernel structure in RAM when the system runs out of >> RAM. It has not committed anything to disk. > > Did you experience a severe memory shortfall? > (Do you know how to determine that condition?)T2000 with 32 GB RAM zpool that hangs the machine by running it out of kernel memory when trying to import the zpool zpool has an "incomplete" snapshot from a zfs recv that it is trying to destroy on import I *can* import the zpool readonly So the answer is yes to the severe memory shortfall. One of the many things I did to instrument this system was as simple as running vmstat 10 on the console :-) The last instance before the system hung showed a scan rate of 900,000 ! In one case I watched as it hung (it has done this many times as I have troubleshot with Oracle Support) and did not see *any* user level processes that would account for the memory shortfall. I have logs of system freemem showing the memory exhaustion. Oracle Support has confirmed (from a core dump) that it is some combination of the two bugs you mentioned (plus they created a new Bug ID for this specific problem). I have asked multiple times if the incomplete snapshot could be corrupt in a way that would cause this (early on then led us to believe the incomplete snapshot was 7 TB when it should be about 2.5 TB), but have not gotten anything substantive back (just a one line, "The snapshot is not corrupt."). What I am looking for is a way to estimate the kernel memory necessary to destroy a given snapshot so that I can see if any of the snapshots on my production server (M4000 with 16 GB) will run the machine out of memory. -- {--------1---------2---------3---------4---------5---------6---------7---------} Paul Kraus -> Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ ) -> Sound Designer: Frankenstein, A New Musical (http://www.facebook.com/event.php?eid=123170297765140) -> Sound Coordinator, Schenectady Light Opera Company ( http://www.sloctheater.org/ ) -> Technical Advisor, RPI Players
Daniel Carosone
2011-Sep-15 05:20 UTC
[zfs-discuss] zfs destroy snapshot runs out of memory bug
On Wed, Sep 14, 2011 at 05:36:53PM -0400, Paul Kraus wrote:> T2000 with 32 GB RAM > > zpool that hangs the machine by running it out of kernel memory when > trying to import the zpool > > zpool has an "incomplete" snapshot from a zfs recv that it is trying > to destroy on import > > I *can* import the zpool readonlyCan you import it booting from a newer kernel (say liveDVD), and allow that to complete the deletion? Or does this not help until the pool is upgraded past the on-disk format in question, for which it must first be imported writable? If you can import it read-only, would it be faster to just send it somewhere else? Is there a new-enough snapshot near the current data? -- Dan. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 194 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20110915/d2de38dd/attachment.bin>
Paul Kraus
2011-Sep-15 12:17 UTC
[zfs-discuss] zfs destroy snapshot runs out of memory bug
On Thu, Sep 15, 2011 at 1:20 AM, Daniel Carosone <dan at geek.com.au> wrote:> Can you import it booting from a newer kernel (say liveDVD), and allow > that to complete the deletion?I have not tried anything newer than the latest patched 5.10.> Or does this not help until the pool is > upgraded past the on-disk format in question, for which it must first > be imported writable?Support is telling me that no matter what, due to the on disk format, it will take more RAM to destroy the incomplete snapshot... and I can''t do that with the pool imported read-only, and when I try to import it read-write the import operation tries to destroy the incomplete snapshot and runs the machine out of memory.> If you can import it read-only, would it be faster to just send it > somewhere else? ?Is there a new-enough snapshot near the current data?Support has given us that as Option B, which would be viable for the backup server, if we had a spare 20+ TB of storage just sitting around. Copying off is NOT an option for production due to outage window _and_ lack of spare 20+ storage :-( -- {--------1---------2---------3---------4---------5---------6---------7---------} Paul Kraus -> Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ ) -> Sound Designer: Frankenstein, A New Musical (http://www.facebook.com/event.php?eid=123170297765140) -> Sound Coordinator, Schenectady Light Opera Company ( http://www.sloctheater.org/ ) -> Technical Advisor, RPI Players
Jim Klimov
2011-Oct-30 21:13 UTC
[zfs-discuss] zfs destroy snapshot runs out of memory bug
> I know there was (is ?) a bug where a zfs destroy of a large > snapshot would run a system out of kernel memory, but searching the > list archives and on defects.opensolaris.org I cannot find it. Could > someone here explain the failure mechanism in language a Sys Admin (I > am NOT a developer) could understand. I am running Solaris 10 with > zpool 22 and I am looking for both understanding of the underlying > problem and a way to estimate the amount of kernel memory necessary to > destroy a given snapshot (based on information gathered from zfs, zdb, > and any other necessary commands). > > Thanks in advance, and sorry to bring this up again. I am almost > certain I saw mention here that this bug is fixed in Solaris 11 > Express and Nexenta (Oracle Support is telling me the bug is fixed in > zpool 26 which is included with Solaris 10U10, but because of our use > of ACLs I don''t think I can go there, and upgrading the zpool won''t > help with legacy snapshots).Sorry, I am late. Still, as I recently posted, I have had a similar bug with oi_148a installed this spring, and it seems that box is still having it. I am trying to upgrade to oi_151a, but it has hung so far and I''m waiting for someone to get to my home and reset it. Symptoms are like what you''ve described, including the huge scanrate just before the system dies (becomes unresponsive). Also if you try running with "vmstat 1" you can see that in the last few seconds of uptime the system would go from several hundred free MBs (or even over a GB free RAM) down to under 32Mb very quickly - consuming hundreds of MBs per second. Unlike your system, my pool started with ZFSv28 (oi_148a), so any bugfixes and on-disk layout fixes relevant for ZFSv26 patches are in place already. According to my research (flushed out with the Jive Forums, so I''d repeat here) it seems that (MY SPECULATION FOLLOWS): 1) some kernel module (probably related to ZFS) takes hold of more and more RAM; 2) since it is kernel memory, it can not be swapped out; 3) since all RAM is depleted but there are requests for RAM allocation, the kernel scans all allocated memory to find candidates for swapping out (hence the high scanrate). 4) Since all RAM is now consumed by a BADLY DESIGNED kernel module which can not be swapped out, the system dies in a high-scanrate agony, because there is no RAM available to do anything. It can be "pinged" for a while, but not much more. I stress that the module is BADLY DESIGNED as it is in my current running version of the OS (I don''t know yet if it was fixed in oi_151a), because probably it is trying to build the full ZFS tree in its adressable memory - regardles of whether it can fit there. IMHO the module should try to process the pool in smaller chunks, or allow swapping out, if the hardware constraints like insufficient RAM force it to. While debugging my system, I removed the /etc/zfs/zpool.cache file and imported the pool without using a cachefile, so I could at least boot the system and do some postmortems. Further on I made an SMF service importing the pool following a configured timeout, so that I could automate the import-reboot cycles as well as intervene to abort a delayed pool import attempt and run some ZDB diags instead. I found that walking the pool with "zdb" has a similar pattern of RAM consumption (no surprise - the algorithms must have something in common with live ZFS code), however, as a userspace process it could be swapped out to disk. In my case ZDB consumed up to 20-30Gb swap and ran for about 20 hours to analyze my pool - successfully. A "zpool import" attempt halted the 8Gb system in 1 to 3 hours. However, with ZDB analysis I managed to find some counter of free blocks - those which belonged to a killed dataset. Seems that at first they are quickly marked for deletion (i.e. are not referenced by any dataset, but are still in the ZFS block tree), and then during pool''s current uptime or further import attempts, these blocks are actually walked and excluded from the ZFS tree. In my case I saw that between reboots and import attempts this counter went down by some 3 million blocks every uptime, and after a couple of stressful weeks the destroyed dataset was gone and the pool just worked on and on. So if you still have this problem, try running ZDB to see if deferred-free count is decreasing between pool import attempts: # time zdb -bsvL -e <POOL-GUID-NUMBER> ... 976K 114G 113G 172G 180K 1.01 1.56 deferred free ... In order to facilitate the process of rebooting, I made a simple watchdog which forcedly soft-resets the OS (with uadmin call) if fast memory exhaustion is detected. This is based on vmstat code, and includes an SMF service to run. Since RAM usage is only updated once per second in kernel probes, the watchdog program might not catch the problem soon enough to react. http://thumper.cos.ru/~jim/freeram-watchdog-20110610-v0.11.tgz Note that it WILL crash your system in case of RAM depletion, without syncs or service shutdowns. Since the RAM depletion happens quickly, it might not even have enough time to reset your OS. In your case with T2000 you might be better off with a hardware watchdog instead (if it doen''t "ping" the driver for too long, BMC would reset the box). //Jim
Jim Klimov
2011-Oct-30 21:37 UTC
[zfs-discuss] zfs destroy snapshot runs out of memory bug
2011-10-31 1:13, Jim Klimov ?????:> Sorry, I am late.... If my memory and GoogleCache don''t fail me too much, I ended up with the following incantations for pool-import attempts: :; echo zfs_vdev_max_pending/W0t5 | mdb -kw :; echo "aok/W 1" | mdb -kw :; echo "zfs_recover/W 1" | mdb -kw :; echo zfs_resilver_delay/W0t0 | mdb -kw :; echo zfs_resilver_min_time_ms/W0t20000 | mdb -kw :; echo zfs_txg_synctime/W0t1 | mdb -kw ### These intend to boost zfs self-repair priorities and ### allow self-repair somehow. Voodoo magic ;) :; /root/freeram-watchdog.i386 & :; time zpool import -o altroot=/pool -o cachefile=none 1601233584937321596 ### This starts the watchdog (to have some on-screen logs) ### and imports the pool bu GUID without cache file usage. :; df -k :; zfs list :; zpool list :; zpool status ### Just in case the import succeeds, these commands ### are cached by the terminal ;) //Jim
Paul Kraus
2011-Oct-31 12:28 UTC
[zfs-discuss] zfs destroy snapshot runs out of memory bug
On Sun, Oct 30, 2011 at 5:13 PM, Jim Klimov <jimklimov at cos.ru> wrote:>> ? ? I know there was (is ?) a bug where a zfs destroy of a large >> snapshot would run a system out of kernel memory, but searching the> Symptoms are like what you''ve described, including the huge scanrate > just before the system dies (becomes unresponsive). Also if you try running > with "vmstat 1" you can see that in the last few seconds of > uptime the system would go from several hundred free MBs (or even > over a GB free RAM) down to under 32Mb very quickly - consuming > hundreds of MBs per second.That is the traditional symptoms of a Solaris kernel memory bug :-)> Unlike your system, my pool started with ZFSv28 (oi_148a), so any > bugfixes and on-disk layout fixes relevant for ZFSv26 patches are > in place already.Ahhh, but jumping to the end...> In my case I saw that between reboots and import attempts this > counter went down by some 3 million blocks every uptime, and > after a couple of stressful weeks the destroyed dataset was gone > and the pool just worked on and on.So your pool does have the fix. With zpool 22 NO PROGRESS is made at all with each boot-import-habg cycle. I have an mdb command that I got from Oracle support to determine the size of the snapshot that is being destroyed. The bug in 22 is that a snapshot destroy is committed as a single TXG. In 26 this is fixed (I assume there are on disk checkpoints to permit a snapshot to be destroyed in multiple TXG). How big is / was the snapshot and dataset ? I am dealing with a 7 TB dataset and a 2.5 TB snapshot on a system with 32 GB RAM. Oracle has provided a loaner system with 128 GB RAM and it took 75 GB of RAM to destroy the problem snapshot). I had not yet posted a summary as we are still working through the overall problem (we tripped over this on the replica, now we are working on it on the production copy). -- {--------1---------2---------3---------4---------5---------6---------7---------} Paul Kraus -> Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ ) -> Sound Coordinator, Schenectady Light Opera Company ( http://www.sloctheater.org/ ) -> Technical Advisor, RPI Players
Jim Klimov
2011-Oct-31 13:07 UTC
[zfs-discuss] zfs destroy snapshot runs out of memory bug
2011-10-31 16:28, Paul Kraus wrote:> How big is / was the snapshot and dataset ? I am dealing with a 7 > TB dataset and a 2.5 TB snapshot on a system with 32 GB RAM.I had a smaller-scale problem, with datasets and snapshots sized several hundred GB, but on an 8Gb RAM system. So proportionally it seems similar ;) I have deduped data on the system, which adds to the strain of dataset removal. The plan was to save some archive data there, with few to no removals planned. But during testing of different dataset layout hierarchies, things got out of hand ;) I''ve also had an approx. 4Tb dataset to destroy (a volume where I kept another pool), but armed with the knowledge of how things are expected to fail, I did its cleanup in small steps and very few (perhaps no?) hangs while evacuating the data to the toplevel pool (which contained this volume).> Oracle has provided a loaner system with 128 GB RAM and it took 75 GB of RAM > to destroy the problem snapshot). I had not yet posted a summary as we > are still working through the overall problem (we tripped over this on > the replica, now we are working on it on the production copy).Good for you ;) Does Oracle loan such systems free to support their own foul-ups? Or do you have to pay a lease anyway? ;)
Paul Kraus
2011-Oct-31 13:41 UTC
[zfs-discuss] zfs destroy snapshot runs out of memory bug
On Mon, Oct 31, 2011 at 9:07 AM, Jim Klimov <jimklimov at cos.ru> wrote:> 2011-10-31 16:28, Paul Kraus wrote:>> Oracle has provided a loaner system with 128 GB RAM and it took 75 GB of >> RAM >> to destroy the problem snapshot). I had not yet posted a summary as we >> are still working through the overall problem (we tripped over this on >> the replica, now we are working on it on the production copy). > > Good for you ;) > Does Oracle loan such systems free to support their own foul-ups? > Or do you have to pay a lease anyway? ;)If you are paying for a support contract, _demand_ what is needed to fix the problem. If you are not paying for support, well, then you are on your own (as I believe the license says). Maybe I''ve been in this business longer than many of the folks here, but I both expect software to have bugs and I do NOT expect commercial software vendors to provide fixes for free. -- {--------1---------2---------3---------4---------5---------6---------7---------} Paul Kraus -> Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ ) -> Sound Coordinator, Schenectady Light Opera Company ( http://www.sloctheater.org/ ) -> Technical Advisor, RPI Players