As part of the ARC inception review for ZFS crypto we were asked to follow up on PSARC/2006/370 which indicates that swap & dump will be done using a means other than a ZVOL. Currently I have the ZFS crypto project allowing for ephemeral keys to support using a ZVOL as a swap device. Since it seems that we won''t be swapping on ZVOLS I need to find out more how we will be providing swap and dump space in a root pool. I suspect that the best answer to encrypted swap is that we do it independently of which filesystem/device is being used as the swap device - ie do it inside the VM system. -- Darren J Moffat
Darren J Moffat wrote:> As part of the ARC inception review for ZFS crypto we were asked to > follow up on PSARC/2006/370 which indicates that swap & dump will be > done using a means other than a ZVOL. > > Currently I have the ZFS crypto project allowing for ephemeral keys to > support using a ZVOL as a swap device. > > Since it seems that we won''t be swapping on ZVOLS I need to find out > more how we will be providing swap and dump space in a root pool. >The current plan is to provide what we''re calling (for lack of a better term. I''m open to suggestions.) a "pseudo-zvol". It''s preallocated space within the pool, logically concatenated by a driver to appear like a disk or a slice. It''s meant to be a low overhead way to emulate a slice within a pool. So no COW or related zfs features are provided, except for the ability to change its size without having to re-partition a disk. A pseudo-zvol will support both swap and dump. It will also be possible to use a slice for swapping, just as is done now with ufs roots. But we''re hoping that the overhead of a pseudo-zvol will be low enough that administrators will take advantage of it to simplify installation (it allows a user to dedicate an entire disk to a root pool, without having to carve out part of it for swapping.) Eventually, swapping on true zvols might be supported (the problems with swapping to zvols are considered bugs), but fixing those bugs are a bigger task than we want to take on for the zfs-boot project. We decided on pseudo-zvols as a lower-risk approach for the time being.> I suspect that the best answer to encrypted swap is that we do it > independently of which filesystem/device is being used as the swap > device - ie do it inside the VM system. > '' >Treat a pseudo-zvol like you would a slice. Lori
Thanks for the info. As for name suggestions here are a few: RAW RVOL RZVOL -- Darren J Moffat
Lori Alt wrote:> Darren J Moffat wrote: >> As part of the ARC inception review for ZFS crypto we were asked to >> follow up on PSARC/2006/370 which indicates that swap & dump will be >> done using a means other than a ZVOL. >> >> Currently I have the ZFS crypto project allowing for ephemeral keys to >> support using a ZVOL as a swap device. >> >> Since it seems that we won''t be swapping on ZVOLS I need to find out >> more how we will be providing swap and dump space in a root pool. >> > The current plan is to provide what we''re calling (for lack of a > better term. I''m open to suggestions.) a "pseudo-zvol". It''s > preallocated space within the pool, logically concatenated by > a driver to appear like a disk or a slice. It''s meant to be a low > overhead way to emulate a slice within a pool. So no COW or > related zfs features are provided, except for the ability to change > its size without having to re-partition a disk. A pseudo-zvol > will support both swap and dump. > > It will also be possible to use a slice for swapping, just as is > done now with ufs roots. But we''re hoping that the overhead of > a pseudo-zvol will be low enough that administrators will > take advantage of it to simplify installation (it allows a user > to dedicate an entire disk to a root pool, without having to > carve out part of it for swapping.) > > Eventually, swapping on true zvols might be supported (the > problems with swapping to zvols are considered bugs), but > fixing those bugs are a bigger task than we want to take on > for the zfs-boot project. We decided on pseudo-zvols as > a lower-risk approach for the time being. > >> I suspect that the best answer to encrypted swap is that we do it >> independently of which filesystem/device is being used as the swap >> device - ie do it inside the VM system. >> '' >> > Treat a pseudo-zvol like you would a slice.So these new zvol-like things don''t support snapshots, etc, right? I take it they work by allowing overwriting of the data, correct? Are these a zslice? <aside> For those of us who''ve been swapping to zvols for some time, can you describe the failure modes? </aside> - Bart -- Bart Smaalders Solaris Kernel Performance barts at cyber.eng.sun.com http://blogs.sun.com/barts
>>> >> Treat a pseudo-zvol like you would a slice. > > > So these new zvol-like things don''t support snapshots, etc, right? > I take it they work by allowing overwriting of the data, correct?yes, and yes> > Are these a zslice?I suppose we could call them that. That''s better than pseudo-zvol. "rzvol" has also been suggested.> > <aside> > For those of us who''ve been swapping to zvols for some time, can > you describe the failure modes? > </aside> >See bug 6528296 (system hang while zvol swap space shorted). Lori
On Thu, 2007-07-12 at 10:45 -0700, Bart Smaalders wrote:> <aside> > For those of us who''ve been swapping to zvols for some time, can > you describe the failure modes? > </aside>I asked about this during the zfs boot inception review -- the high level answer is occasional deadlock in low-memory situations (zfs needs to allocate memory to free memory via pageout/swapout, but the system doesn''t have any to give zfs) the relevant bug appears to be: 6528296 system hang while zvol swap space shorted
Richard Elling
2007-Jul-12 23:27 UTC
[zfs-discuss] Another zfs dataset [was: Plans for swapping to part of a pool]
Lori Alt wrote:>>>> >>> Treat a pseudo-zvol like you would a slice. >> >> So these new zvol-like things don''t support snapshots, etc, right? >> I take it they work by allowing overwriting of the data, correct? > yes, and yes >> Are these a zslice? > I suppose we could call them that. That''s better than pseudo-zvol. > "rzvol" has also been suggested.I think we should up-level this and extend to the community for comments. The proposal, as I see it, is to create a simple, contiguous (?) space which sits in a zpool. As such, it does inherit the behaviour of the zpool. But it does not inherit the behaviour of a file system or zvol (no snapshots, copies, etc.) While the original reason for this was swap, I have a sneaky suspicion that others may wish for this as well, or perhaps something else. Thoughts? (database folks, jump in :-) -- richard
I really don''t want to bring this up but ... Why do we still tell people to use swap volumes? Would we have the same sort of issue with the dump device so we need to fix it anyway?
Bill Sommerfeld
2007-Jul-13 00:11 UTC
[zfs-discuss] Another zfs dataset [was: Plans for swapping to part of a pool]
On Thu, 2007-07-12 at 16:27 -0700, Richard Elling wrote:> I think we should up-level this and extend to the community for comments. > The proposal, as I see it, is to create a simple,yes> contiguous (?)as I understand the proposal, not necessarily contiguous.> space which sits in a zpool. As such, it does inherit the behaviour of the > zpool. But it does not inherit the behaviour of a file system or zvol > (no snapshots, copies, etc.) > > While the original reason for this was swap, I have a sneaky suspicion > that others may wish for this as well, or perhaps something else. > Thoughts? (database folks, jump in :-)the record will hopefully show that I predicted that the database folks would want to use this when Lori described the concept during ARC review... - Bill
On Thu, 12 Jul 2007, Bill Sommerfeld wrote:> On Thu, 2007-07-12 at 10:45 -0700, Bart Smaalders wrote: >> <aside> >> For those of us who''ve been swapping to zvols for some time, can >> you describe the failure modes? >> </aside> > > I asked about this during the zfs boot inception review -- the high > level answer is occasional deadlock in low-memory situations (zfs needs > to allocate memory to free memory via pageout/swapout, but the system > doesn''t have any to give zfs) > > the relevant bug appears to be: > > 6528296 system hang while zvol swap space shortedYep - I''ve seen this on Sol 10 Update 3 with swap on the dedicated UFS boot disk and a 2-way ZFS mirror that is doing everything else - including a zvol based swap. The box will eventually get too slow to be useable and the only fix is to reboot. Usually this bug will be "tickled" after running a (weekly) zpool scrub (on ~ 270Gb of data). The box is an AMD x4400 with 4Gb of RAM. It drives 22" and 30" monitors (Nvidia) and runs about 18 Gnome workspaces. Its my "window" into the world. To date it has averaged about 6 weeks between reboots. I know that zfs could be tuned - but update 4 is not too far away. Aside from this minor irritation ... this box is a pure pleasure to work on and ZFS (and snapshots) totally rocks. Regards, Al Hopper Logical Approach Inc, Plano, TX. al at logical-approach.com Voice: 972.379.2133 Fax: 972.379.2134 Timezone: US CDT OpenSolaris Governing Board (OGB) Member - Apr 2005 to Mar 2007 http://www.opensolaris.org/os/community/ogb/ogb_2005-2007/
Bart Smaalders wrote:> <aside> > For those of us who''ve been swapping to zvols for some time, can > you describe the failure modes? > </aside>I can swap fine. I can''t dump. LU gets confused about them and I have to re add it. It is slower than swapping directly to a slice. I''ve never needed to snapshot my swap :-) -- Darren J Moffat
> > (for lack of a better term. I''m open to suggestions.) a > > "pseudo-zvol". It''s meant to be a low > > overhead way to emulate a slice within a pool. So > > no COW or related zfs features > > Are these a zslice?zbart - "Don''t have a CoW, man!" This message posted from opensolaris.org
Daniel Carosone wrote:>>> (for lack of a better term. I''m open to suggestions.) a >>> "pseudo-zvol". It''s meant to be a low >>> overhead way to emulate a slice within a pool. So >>> no COW or related zfs features >> Are these a zslice? > > zbart - "Don''t have a CoW, man!"but we already have /usr/bin/bart :-) -- Darren J Moffat
Mario Goebbels
2007-Jul-13 13:19 UTC
[zfs-discuss] Another zfs dataset [was: Plans for swapping to part of a pool]
> While the original reason for this was swap, I have a sneaky suspicion > that others may wish for this as well, or perhaps something else. > Thoughts? (database folks, jump in :-)Lower overhead storage for my QEMU volumes. I figure other filesystems running within a ZVOL may cause a little bit havoc in regards to pool fragmentation. -mg -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 648 bytes Desc: This is a digitally signed message part URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20070713/0934ba6f/attachment.bin>
Lori Alt
2007-Jul-13 15:26 UTC
[zfs-discuss] Another zfs dataset [was: Plans for swapping to part of a pool]
Bill Sommerfeld wrote:> On Thu, 2007-07-12 at 16:27 -0700, Richard Elling wrote: > >> I think we should up-level this and extend to the community for comments. >> The proposal, as I see it, is to create a simple, >> > yes > > >> contiguous (?) >> > as I understand the proposal, not necessarily contiguous. >Correct, not necessarily contiguous on disk.> >> space which sits in a zpool. As such, it does inherit the behaviour of the >> zpool. But it does not inherit the behaviour of a file system or zvol >> (no snapshots, copies, etc.) >> >> While the original reason for this was swap, I have a sneaky suspicion >> that others may wish for this as well, or perhaps something else. >> Thoughts? (database folks, jump in :-) >> > > the record will hopefully show that I predicted that the database folks > would want to use this when Lori described the concept during ARC > review... > > >Yes, you did. That was my first thought when I saw these mails, that you had predicted this. Since there seems to be a broader interest in this, I''m going to ask Eric Taylor, who''s designing and implementing this feature, to send out mail describing it more detail, since he knows a lot more about it than I do. Lori
Torrey McMahon wrote:> I really don''t want to bring this up but ... > > Why do we still tell people to use swap volumes?Jeff Bonwick has suggested a fix to 6528296 (system hang while zvol swap space shorted). If we can get that fixed, then it may become safe to use true zvols for swap. I''ll update the bug with his suggested fix (and probably ping him on it too, to see if he wants to elaborate). Even if the bug is fixed however, we might want this new kind of zvol for performance reasons. But you''re right. We shouldn''t leave things as they are. We should either fix the bug, or stop telling people that it''s OK to use zvols for swap.> Would we have the same > sort of issue with the dump device so we need to fix it anyway? >The current implementation of zvols, even if the above-mentioned bug were fixed, will not support dump. It would be too complicated to write to them in the panic path. That was the other reason for implementing pseudo-zvols (or zslices, or whatever). Lori
Darren J Moffat
2007-Jul-13 15:54 UTC
[zfs-discuss] Another zfs dataset [was: Plans for swapping to part of a pool]
Mario Goebbels wrote:>> While the original reason for this was swap, I have a sneaky suspicion >> that others may wish for this as well, or perhaps something else. >> Thoughts? (database folks, jump in :-) > > Lower overhead storage for my QEMU volumes. I figure other filesystems running within a ZVOL may cause a little bit havoc in regards to pool fragmentation.Interesting, for QEMU I''d actually want a real ZVOL so I can use ZFS to snapshot the image. -- Darren J Moffat
Lori Alt wrote:> Torrey McMahon wrote: >> I really don''t want to bring this up but ... >> >> Why do we still tell people to use swap volumes? > > Jeff Bonwick has suggested a fix to 6528296 (system > hang while zvol swap space shorted). If we can get that > fixed, then it may become safe to use true zvols for swap. > I''ll update the bug with his suggested fix (and probably > ping him on it too, to see if he wants to elaborate). > > Even if the bug is fixed however, we might want this > new kind of zvol for performance reasons.I''d rather not use the term "performance" here because that may lead to the wrong expectations. The reason we want something new here is for data management policy. In some cases that policy may lead to better or worse performance, but it will not be generally true one way or another. Since trade-offs are involved, it may take some time before best practices can be identified.> But you''re right. We shouldn''t leave things as they are. > We should either fix the bug, or stop telling people > that it''s OK to use zvols for swap.We should fix the bug anyway :-) -- richard
Eric
2007-Jul-13 18:10 UTC
[zfs-discuss] Another zfs dataset [was: Plans for swapping to part of a pool]
Thanks for suggesting a broader discussion about the needs and possible uses for specialized storage objects within a pool. In doing so, part of that discussion should include the effect upon overall complexity and manageability as well as conceptual coherence. In his blog post on ZFS layering (http://blogs.sun.com/bonwick/entry/rampant_layering_violation), Jeff Bonwick articulates the benefits of refactoring the storage stack and eliminating the volume LBA. At the time I bought the argument and I still want it to be so, so it''s worth some effort here as well. As I understand it, real ZVOLs are block level access to single DMU objects with full transactional semantics. With those come the ability to snapshot, etc. And with that comes the ability to sit on top of mirrored, raidz/raidz2, or unprotected vdevs. With them also comes some code and memory consumption, I/O ordering, ZIL usage, etc. that could be problematic on the OS dump path and might (or might not) be overkill for swap. Before we name this thing, what is it conceptually? Is it a new kind of DMU object (e.g. a "non-transactional" object?)? A new module on the side of the DMU? A bypass to direct access to the SPA? Direct access to one or more vdevs? Many of these could be possible and perhaps the discussion of possible uses can help choose where it should be done, but the choice does have long-term consequences. Dump and swap certainly pose specialized requirements and historically Unix systems have made pragmatic solutions (e.g. contiguous lvols) in this area. IMHO, it is important enough to have a solution that supports dump and swap in a zpool, that a specialized solution (aka kludge) might be warranted, but I''d like to push hard for at least a short while to see if there is a better cleaner way to do this that avoids making ZFS expose all of the rough edges of past volume managers. A few comments: For swap I think it that some form of data protection (mirroring or raidz) is required. For dump, it would be desirable (and cleaner when all of my zpool is protected) to be able to support mirroring or raidz too. Perhaps there''s a problem with why we still configure an OS to have dump and swap *volumes*-- is that time past? Finally, I do not have a solution in mind but I will note that in many ways this reminds me of the problem of bootstrapping the slab allocator and of avoiding allocations when freeing memory objects. When people do cool things in the past, it raises the bar on expectations for the future :). Eric
Mario Goebbels
2007-Jul-14 11:05 UTC
[zfs-discuss] Another zfs dataset [was: Plans for swapping to part of a pool]
> >> While the original reason for this was swap, I have a sneaky suspicion > >> that others may wish for this as well, or perhaps something else. > >> Thoughts? (database folks, jump in :-) > > > > Lower overhead storage for my QEMU volumes. I figure other filesystems running within a ZVOL may cause a little bit havoc in regards to pool fragmentation. > > Interesting, for QEMU I''d actually want a real ZVOL so I can use ZFS to > snapshot the image.Naturally, that''s a valid use. But the VMs I have are throw-away Windows VMs, with data on the host via SMB. Windows loves swapping. That and ZFS COW is something I''d rather not want combined, if not necessary. -mg -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 648 bytes Desc: This is a digitally signed message part URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20070714/234b3d54/attachment.bin>
Cyril Plisko
2007-Jul-14 18:18 UTC
[zfs-discuss] Another zfs dataset [was: Plans for swapping to part of a pool]
On 7/13/07, Darren J Moffat <Darren.Moffat at sun.com> wrote:> Mario Goebbels wrote: > >> While the original reason for this was swap, I have a sneaky suspicion > >> that others may wish for this as well, or perhaps something else. > >> Thoughts? (database folks, jump in :-) > > > > Lower overhead storage for my QEMU volumes. I figure other filesystems running within a ZVOL may cause a little bit havoc in regards to pool fragmentation. > > Interesting, for QEMU I''d actually want a real ZVOL so I can use ZFS to > snapshot the image.QEMU (or strictly speaking a number of image formats it supports) has its own snapshot facility, which may (or may not) be more suitable for the task in hand. Since its snapshot facility is application specific there are cases where ZFS just cannot provide anything comparable. For example we had a situation where QEMU image (snapshot) resided on one kind of storage (a really slow one), while the delta blocks generated by the writes to the volume/image were redirected to a another, dedicated (very fast) storage.Worked extremely well. -- Regards, Cyril
Lori Alt wrote:>> Since it seems that we won''t be swapping on ZVOLS I need to find out >> more how we will be providing swap and dump space in a root pool. >> > The current plan is to provide what we''re calling (for lack of a > better term. I''m open to suggestions.) a "pseudo-zvol". It''s > preallocated space within the pool, logically concatenated by > a driver to appear like a disk or a slice. It''s meant to be a low > overhead way to emulate a slice within a pool. So no COW or > related zfs features are provided, except for the ability to change > its size without having to re-partition a disk. A pseudo-zvol > will support both swap and dump.Just so it is clear, what are the full list of "related features" that won''t be supported with these non COW volumes ? For swap (but less so for dump) we MUST be able to keep raid and the probably means checksuming as well (otherwise we can bring down the system as a result of a disk failure when paging in or out). I think for encrypted swap it is probably best we do that inside the VM system as a separate project rather than as part of ZFS. -- Darren J Moffat