Sriram Narayanan
2009-Sep-07 10:49 UTC
[zfs-discuss] Fwd: [ilugb] Does ZFS support Hole Punching/Discard
Folks: I gave a presentation last weekend on how one could use Zones, ZFS and Crossbow to recreate deployments scenarios on one''s computer (to the extent possible). I''ve received the following question, and would like to ask the ZFS Community for answers. -- Sriram ---------- Forwarded message ---------- From: Ritesh Raj Sarraf <rrs at researchut.com> Date: Mon, Sep 7, 2009 at 2:20 PM Subject: [ilugb] Does ZFS support Hole Punching/Discard To: ilug-bengaluru at googlegroups.com Thanks to Sriram for the nice walk through on "Beyond localhost". There was one item I forgot to ask. Does ZFS support Hole Punching ? After pushing off to BP is when I remembered of this issue. Here''s a link about this issue and its state in Linux. http://lwn.net/Articles/293658/ Ritesh -- Ritesh Raj Sarraf RESEARCHUT - http://www.researchut.com "Necessity is the mother of invention." -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 834 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090907/6811514d/attachment.bin>
Richard Elling
2009-Sep-07 16:58 UTC
[zfs-discuss] Fwd: [ilugb] Does ZFS support Hole Punching/Discard
On Sep 7, 2009, at 3:49 AM, Sriram Narayanan wrote:> Folks: > > I gave a presentation last weekend on how one could use Zones, ZFS and > Crossbow to recreate deployments scenarios on one''s computer (to the > extent possible). > > I''ve received the following question, and would like to ask the ZFS > Community for answers. > > -- Sriram > > > ---------- Forwarded message ---------- > From: Ritesh Raj Sarraf <rrs at researchut.com> > Date: Mon, Sep 7, 2009 at 2:20 PM > Subject: [ilugb] Does ZFS support Hole Punching/Discard > To: ilug-bengaluru at googlegroups.com > > > Thanks to Sriram for the nice walk through on "Beyond localhost". > > There was one item I forgot to ask. Does ZFS support Hole Punching ?I only know of "hole punching" in the context of networking. ZFS doesn''t do networking, so the pedantic answer is no.> After pushing off to BP is when I remembered of this issue. Here''s a > link about > this issue and its state in Linux. > http://lwn.net/Articles/293658/This is an article about the new TRIM command. It would be important for file systems which write their metadata to the same physical location or use a MRU replacement algorithm. But ZFS is copy-on-write, so the metadata is allocated from free space and ZFS is transactional, not directly MRU. It is not expected that this problem would affect ZFS file systems for an extended period of time, which you can further extend by using snapshots. Interesting sidebar: you can measure how many times a block is rewritten in a Solaris system, but the data collection and analysis is a rather large task. I don''t know of anyone patient enough to do it long enough to get near the endurance of an SSD. -- richard
Bob Friesenhahn
2009-Sep-07 17:20 UTC
[zfs-discuss] Fwd: [ilugb] Does ZFS support Hole Punching/Discard
On Mon, 7 Sep 2009, Richard Elling wrote:> > This is an article about the new TRIM command. It would be important for > file systems which write their metadata to the same physical location or use > a MRU replacement algorithm. But ZFS is copy-on-write, so the metadata is > allocated from free space and ZFS is transactional, not directly MRU. It isThe purpose of the TRIM command is to allow the FLASH device to reclaim and erase storage at its leisure so that the writer does not need to wait for erasure once the device becomes full. Otherwise the FLASH device does not know when an area stops being used. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Richard Elling
2009-Sep-07 17:57 UTC
[zfs-discuss] Fwd: [ilugb] Does ZFS support Hole Punching/Discard
On Sep 7, 2009, at 10:20 AM, Bob Friesenhahn wrote:> On Mon, 7 Sep 2009, Richard Elling wrote: >> >> This is an article about the new TRIM command. It would be >> important for >> file systems which write their metadata to the same physical >> location or use >> a MRU replacement algorithm. But ZFS is copy-on-write, so the >> metadata is >> allocated from free space and ZFS is transactional, not directly >> MRU. It is > > The purpose of the TRIM command is to allow the FLASH device to > reclaim and erase storage at its leisure so that the writer does not > need to wait for erasure once the device becomes full. Otherwise > the FLASH device does not know when an area stops being used.Yep, it is there to try and solve the problem of rewrites in a small area, smaller than the bulk erase size. While it would be trivial to traverse the spacemap and TRIM the free blocks, it might not improve performance for COW file systems. My crystal ball says smarter flash controllers or a form of managed flash will win and obviate the need for TRIM entirely. -- richard
Bob Friesenhahn
2009-Sep-07 18:48 UTC
[zfs-discuss] Fwd: [ilugb] Does ZFS support Hole Punching/Discard
On Mon, 7 Sep 2009, Richard Elling wrote:> > Yep, it is there to try and solve the problem of rewrites in a small area, > smaller than the bulk erase size. While it would be trivial to traverse the > spacemap and TRIM the free blocks, it might not improve performance > for COW file systems. My crystal ball says smarter flash controllers or a > form of managed flash will win and obviate the need for TRIM entirely.Without TRIM there is no way for the FLASH device to know that a region of data is free and can be reclaimed. It is pretty difficult to be intelligent without that. Regardless, TRIM only improves the perception of performance under relatively light loads where the device is able to erase faster than the amount of writes. This is important for PCs, where perception is everything. It does not improve sustained maximum write throughput. As far as your crystal ball goes, people here might be interested in this article about an apparent Sun product which had its documentation released to the Sun web site a bit too early: http://www.theregister.co.uk/2009/09/03/sun_flash_array/ Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Richard Elling
2009-Sep-07 19:23 UTC
[zfs-discuss] Fwd: [ilugb] Does ZFS support Hole Punching/Discard
On Sep 7, 2009, at 11:48 AM, Bob Friesenhahn wrote:> On Mon, 7 Sep 2009, Richard Elling wrote: >> >> Yep, it is there to try and solve the problem of rewrites in a >> small area, >> smaller than the bulk erase size. While it would be trivial to >> traverse the >> spacemap and TRIM the free blocks, it might not improve performance >> for COW file systems. My crystal ball says smarter flash >> controllers or a >> form of managed flash will win and obviate the need for TRIM >> entirely. > > Without TRIM there is no way for the FLASH device to know that a > region of data is free and can be reclaimed. It is pretty difficult > to be intelligent without that.Yes, it is a trade-off for the page size mismatch. But you could manage this by reading the page, erasing, and writing... as long as you have some sort of nonvolatility arrangement -- hence a managed solution. TRIM just tries to eliminate the need for a nonvolatile buffer by pushing that decision to the OS. The interesting question is what happens when the important page is never free? I presume you just get stuck being slow.> Regardless, TRIM only improves the perception of performance under > relatively light loads where the device is able to erase faster than > the amount of writes. This is important for PCs, where perception > is everything. It does not improve sustained maximum write > throughput. > > As far as your crystal ball goes, people here might be interested in > this article about an apparent Sun product which had its > documentation released to the Sun web site a bit too early: > > http://www.theregister.co.uk/2009/09/03/sun_flash_array/The Flash Modules are well known and used in several products already. http://www.sun.com/storage/flash/module.jsp The rest is just packaging... (another famous last words :-) -- richard
Chris Csanady
2009-Sep-07 19:55 UTC
[zfs-discuss] Fwd: [ilugb] Does ZFS support Hole Punching/Discard
2009/9/7 Richard Elling <richard.elling at gmail.com>:> On Sep 7, 2009, at 10:20 AM, Bob Friesenhahn wrote: > >> The purpose of the TRIM command is to allow the FLASH device to reclaim >> and erase storage at its leisure so that the writer does not need to wait >> for erasure once the device becomes full. ?Otherwise the FLASH device does >> not know when an area stops being used. > > Yep, it is there to try and solve the problem of rewrites in a small area, > smaller than the bulk erase size. ?While it would be trivial to traverse the > spacemap and TRIM the free blocks, it might not improve performance > for COW file systems. My crystal ball says smarter flash controllers or a > form of managed flash will win and obviate the need for TRIM entirely. > ?-- richardI agree with this sentiment, although I still look forward to it being obviated by a better memory technology instead, like PRAM. In any case, the ATA TRIM command may not be so useful after all, as it can''t be queued: http://lwn.net/Articles/347511/ As an aside, after a bit of digging, I came across fcntl(F_FREESP). This will at least allow you to put the sparse back into sparse files if you so desire. Unfortunately, I don''t see any way to do this for a zvol. Chris
Ritesh Raj Sarraf
2009-Sep-08 04:27 UTC
[zfs-discuss] Fwd: [ilugb] Does ZFS support Hole Punching/Discard
The Discard/Trim command is also available as part of the SCSI standard now. Now, if you look from a SAN perspective, you will need a little of both. Filesystems will need to be able to deallocate blocks and then the same should be triggered as a SCSI Trim to the Storage Controller. For a virtualized environment, the filesystem should be able to punch holes into virt image files. F_FREESP is only on XFS to my knowledge. So how does ZFS tackle the above 2 problems ? -- This message posted from opensolaris.org
Chris Csanady
2009-Sep-08 07:40 UTC
[zfs-discuss] Fwd: [ilugb] Does ZFS support Hole Punching/Discard
2009/9/7 Ritesh Raj Sarraf <rrs at researchut.com>:> The Discard/Trim command is also available as part of the SCSI standard now. > > Now, if you look from a SAN perspective, you will need a little of both. > Filesystems will need to be able to deallocate blocks and then the same should be triggered as a SCSI Trim to the Storage Controller. > For a virtualized environment, the filesystem should be able to punch holes into virt image files. > > F_FREESP is only on XFS to my knowledge.I found F_FREESP while looking through the OpenSolaris source, and it is supported on all filesystems which implement VOP_SPACE. (I was initially investigating what it would take to transform writes of zeroed blocks into block frees on ZFS. Although it would not appear to be too difficult, I''m not sure if it would be worth complicating the code paths.)> So how does ZFS tackle the above 2 problems ?At least for file backed filesystems, ZFS already does its part. It is the responsibility of the hypervisor to execute the mentioned fcntl(), wether it is triggered by a TRIM or whatever else. ZFS does not use TRIM itself, though it is not recommended to use it on top of files anyway, nor is there a need for virtualization purposes. It does appear that the ATA TRIM command should be used with great care though, or avoided all together. Not only does it need to wait for the entire queue to empty, it can cause a delay of ~100ms if you execute them without enough elapsed time. (See the thread linked from the article I mentioned.) As far as I can tell, Solaris is missing the equivalent of a DKIOCDISCARD ioctl(). Something like that should be implemented to allow recovery of space on zvols and iSCSI backing stores. (Though, the latter would require implementing the SCSI TRIM support as well, if I understand correctly.) Chris
George Janczuk
2009-Nov-11 00:51 UTC
[zfs-discuss] Fwd: [ilugb] Does ZFS support Hole Punching/Discard
I''ve been following the use of SSD with ZFS and HSPs for some time now, and I am working (in an architectural capacity) with one of our IT guys to set up our own ZFS HSP (using a J4200 connected to an X2270). The best practice seems to be to use an Intel X25-M for the L2ARC (Readzilla) and an Intel X25-E for the ZIL/SLOG (Logzilla). However, whilst being a BIG thing in the Windows 7 world - I have pretty much heard nothing about Intel''s G2 devices and updated firmware when Intel''s SSDs are used in a ZFS HSP. In particular, does ZFS use or support the TRIM command? Is it even relevant or useful in a hierarchical (vs. primary) storage context? Any comment would be appreciated. Some comment from the Fishworks guys in particular would be great! -- This message posted from opensolaris.org
Tim Cook
2009-Nov-11 01:21 UTC
[zfs-discuss] Fwd: [ilugb] Does ZFS support Hole Punching/Discard
On Tue, Nov 10, 2009 at 6:51 PM, George Janczuk < georgej at objectconsulting.com.au> wrote:> I''ve been following the use of SSD with ZFS and HSPs for some time now, and > I am working (in an architectural capacity) with one of our IT guys to set > up our own ZFS HSP (using a J4200 connected to an X2270). > > The best practice seems to be to use an Intel X25-M for the L2ARC > (Readzilla) and an Intel X25-E for the ZIL/SLOG (Logzilla). > > However, whilst being a BIG thing in the Windows 7 world - I have pretty > much heard nothing about Intel''s G2 devices and updated firmware when > Intel''s SSDs are used in a ZFS HSP. In particular, does ZFS use or support > the TRIM command? Is it even relevant or useful in a hierarchical (vs. > primary) storage context? > > Any comment would be appreciated. Some comment from the Fishworks guys in > particular would be great! > >My personal thought would be that it doesn''t really make sense to even have it, at least for readzilla. In theory, you always want the SSD to be full, or nearly full, as it''s a cache. The whole point of TRIM, from my understanding, is to speed up the drive by zeroing out unused blocks so they next time you try to write to them, they don''t have to be cleared, then written to. When dealing with a cache, there shouldn''t (again in theory) be any free blocks, a warmed cache should be full of data. Logzilla is kind of in the same boat, it should constantly be filling and emptying as new data comes in. I''d imagine the TRIM would just add unnecessary overhead. It could in theory help there by zeroing out blocks ahead of time before a new batch of writes come in if you have a period of little I/O. My thought is it would be far more work than it''s worth, but I''ll let the coders decide that one. --Tim -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20091110/a3318e41/attachment.html>
Bob Friesenhahn
2009-Nov-11 17:51 UTC
[zfs-discuss] Fwd: [ilugb] Does ZFS support Hole Punching/Discard
On Tue, 10 Nov 2009, Tim Cook wrote:> > My personal thought would be that it doesn''t really make sense to > even have it, at least for readzilla.? In theory, you always want > the SSD to be full, or nearly full, as it''s a cache.? The whole > point of TRIM, from my understanding, is to speed up the drive by > zeroing out unused blocks so they next time you try to write to > them, they don''t have to be cleared, then written to.? When dealing > with a cache, there shouldn''t (again in theory) be any free blocks, > a warmed cache should be full of data.This thought is wrong because SSDs actually have many more blocks that they don''t admit to in their declared size. The "extreme" or "enterprise" units will have more extra blocks. These extra blocks are necessarily in order to replace failing blocks, and to spread the write load over many more underlying blocks, and thereby decrease the chance of failure. If a FLASH block is to be overwritten, then the device can reassign the old FLASH block to the spare pool, and update its tables so that a different FLASH block (from the spare pool) is used for the write.> Logzilla is kind of in the same boat, it should constantly be > filling and emptying as new data comes in.? I''d imagine the TRIM > would just add unnecessary overhead.? It could in theory help there > by zeroing out blocks ahead of time before a new batch of writes > come in if you have a period of little I/O.? My thought is it would > be far more work than it''s worth, but I''ll let the coders decide > that one.The "problem" with TRIM is that its goal is to decrease write latency at low/medium writing loads, or at high load for a short duration. It does not do anything to increase maximum sustained write performance since the maximum write performance then depends on how fast the device can erase blocks. Some server environments will write to the device at close to 100% most of the time, and especially for relatively slow devices like the X25-E. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Tim Cook
2009-Nov-11 18:07 UTC
[zfs-discuss] Fwd: [ilugb] Does ZFS support Hole Punching/Discard
On Wed, Nov 11, 2009 at 11:51 AM, Bob Friesenhahn < bfriesen at simple.dallas.tx.us> wrote:> On Tue, 10 Nov 2009, Tim Cook wrote: > >> >> My personal thought would be that it doesn''t really make sense to even >> have it, at least for readzilla. In theory, you always want the SSD to be >> full, or nearly full, as it''s a cache. The whole point of TRIM, from my >> understanding, is to speed up the drive by zeroing out unused blocks so they >> next time you try to write to them, they don''t have to be cleared, then >> written to. When dealing with a cache, there shouldn''t (again in theory) be >> any free blocks, a warmed cache should be full of data. >> > > This thought is wrong because SSDs actually have many more blocks that they > don''t admit to in their declared size. The "extreme" or "enterprise" units > will have more extra blocks. These extra blocks are necessarily in order to > replace failing blocks, and to spread the write load over many more > underlying blocks, and thereby decrease the chance of failure. If a FLASH > block is to be overwritten, then the device can reassign the old FLASH block > to the spare pool, and update its tables so that a different FLASH block > (from the spare pool) is used for the write.I''m well aware of the fact that SSD mfg''s put extra blocks into the device to increase both performance and MTBF. I''m not sure how that invalidates what I''ve said though, or even plays a roll, and you haven''t done a very good job of explaining why you think I''m wrong. TRIM is simply letting the device know that a block has been deleted from the OS perspective. In a caching scenario, you aren''t deleting anything, you''re continually over-writing. How exactly do you foresee TRIM being useful when the command wouldn''t even be invoked?> > > Logzilla is kind of in the same boat, it should constantly be filling and >> emptying as new data comes in. I''d imagine the TRIM would just add >> unnecessary overhead. It could in theory help there by zeroing out blocks >> ahead of time before a new batch of writes come in if you have a period of >> little I/O. My thought is it would be far more work than it''s worth, but >> I''ll let the coders decide that one. >> > > The "problem" with TRIM is that its goal is to decrease write latency at > low/medium writing loads, or at high load for a short duration. It does not > do anything to increase maximum sustained write performance since the > maximum write performance then depends on how fast the device can erase > blocks. Some server environments will write to the device at close to 100% > most of the time, and especially for relatively slow devices like the X25-E. >Right... you just repeated what I said with different wording. --Tim -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20091111/8a025d21/attachment.html>
Nicolas Williams
2009-Nov-11 18:17 UTC
[zfs-discuss] Fwd: [ilugb] Does ZFS support Hole Punching/Discard
On Mon, Sep 07, 2009 at 09:58:19AM -0700, Richard Elling wrote:> I only know of "hole punching" in the context of networking. ZFS doesn''t > do networking, so the pedantic answer is no.But a VDEV may be an iSCSI device, thus there can be networking below ZFS. For some iSCSI targets (including ZVOL-based ones) a hole punchin operation can be very useful since it explicitly tells the backend that some contiguous block of space can be released for allocation to others. Nico --
Bob Friesenhahn
2009-Nov-11 19:01 UTC
[zfs-discuss] Fwd: [ilugb] Does ZFS support Hole Punching/Discard
On Wed, 11 Nov 2009, Tim Cook wrote:> > I''m well aware of the fact that SSD mfg''s put extra blocks into the > device to increase both performance and MTBF.? I''m not sure how that > invalidates what I''ve said though, or even plays a roll, and you > haven''t done a very good job of explaining why you think I''m wrong.? > TRIM is simply letting the device know that a block has been deleted > from the OS perspective.? In a caching scenario, you aren''t deleting > anything, you''re continually over-writing.? How exactly do you > foresee TRIM being useful when the command wouldn''t even be invoked?The act of over-writing requires erasing. If the cache is going to expire seldom-used data, it could potentially use TRIM to start erasing pages while the new data is being retrieved from primary storage. Regardless, it seems that smarter FLASH storage device design eliminates most of the value offered by TRIM. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/