Hi, Does anyone heard about having any plans to support thin devices by ZFS? I''m talking about the thin device feature by SAN frames (EMC, HDS) which provides more efficient space utilization. The concept is similar to ZFS with the pool and datasets, though the pool in this case is in the SAN frame itself, so the pool can be shared among different systems attached to the same SAN frame. This topic is really complex but I''m sure it''s inevitable to support for enterprise customers with SAN storage, basically it brings the differentiation of space used vs space allocated, which can be a huge difference in a large environment, and this difference is major even on the financial level as well. Veritas already added support to thin devices, first of all support to VxFS to be "thin-aware" (for example how to handle over-subscribed thin devices), then Veritas added a feature called SmartMove, a nice feature to migrate from fat to thin devices, and the most brilliant feature of all (my personal opinion, of course) is they released the Veritas Thin Device Reclamation API, which provides an interface to the SAN frame to report unused space at the block level. This API is a major hit, and even though SAN vendors today doesn''t support it, HP and HDS already working on it, and I assume EMC has to follow as well. With this API Veritas can keep track of files deleted for example, and with a simple command once a day (depending on your policy) it can report the unused space back to the frame, so thin devices [b]remain[/b] thin. I really believe that ZFS should have support to thin devices, especially referring to the feature what this API brings into this field, as it can result a huge cost difference to enterprise customers. Regards, sendai -- This message posted from opensolaris.org
making transactional,logging filesystems thin-provisioning aware should be hard to do, as every new and every changed block is written to a new location. so what applies to zfs, should also apply to btrfs or nilfs or similar filesystems. i`m not sure if there is a good way to make zfs thin-provisioning aware/friendly - so you should wait what a zfs developer has to tell about this. not sure about vxfs, but i think vxfs is very different by it`s basic design and on-disk structure -- This message posted from opensolaris.org
Devzero, Unfortunately that was my assumption as well. I don''t have source level knowledge of ZFS, though based on what I know it wouldn''t be an easy way to do it. I''m not even sure it''s only a technical question, but a design question, which would make it even less feasible. Apart from the technical possibilities, this feature looks really inevitable to me in the long run especially for enterprise customers with high-end SAN as cost is always a major factor in a storage design and it''s a huge difference if you have to pay based on the space used vs space allocated (for example). Devzero, I agree with you, let''s wait till some zfs developer can provide us some insightful thoughts about this topic. Regards, sendai -- This message posted from opensolaris.org
On Wed, Dec 30, 2009 at 19:23, roland <devzero at web.de> wrote:> making transactional,logging filesystems thin-provisioning aware should be hard to do, as every new and every changed block is written to a new location. > so what applies to zfs, should also apply to btrfs or nilfs or similar filesystems.If that where a problem it would be a problem for UFS when you write new files... ZFS knows what blocks are free and that is all you need send to the disk system.
> making transactional,logging filesystems > thin-provisioning aware should be hard to do, as > every new and every changed block is written to a new > location. so what applies to zfs, should also apply to btrfs or > nilfs or similar filesystems. > > i`m not sure if there is a good way to make zfs > thin-provisioning aware/friendly - so you should wait > what a zfs developer has to tell about this.ZFS already supports thin-provisioning, and has since pretty much the beginning (earliest I''ve used it in is ZFSv6). I may get the terms backwards here, but if the Quota property is larger than the Reservation, then you have a thin-provisioned volume or filesystem. The Quota will set the "disk size" or "available space" that the OS sees, while the Reservation sets "the currently usable space". As the OS uses space in the volume/fs and approaches the Reservation, you just increase the value. The "total size" that the OS doesn''t change, but the actual amount of usable space does. This is especially useful for volumes that are exported via iSCSI. -- This message posted from opensolaris.org
On Dec 30, 2009, at 10:53 AM, Andras Spitzer wrote:> Devzero, > > Unfortunately that was my assumption as well. I don''t have source > level knowledge of ZFS, though based on what I know it wouldn''t be > an easy way to do it. I''m not even sure it''s only a technical > question, but a design question, which would make it even less > feasible.It is not hard, because ZFS knows the current free list, so walking that list and telling the storage about the freed blocks isn''t very hard. What is hard is figuring out if this would actually improve life. The reason I say this is because people like to use snapshots and clones on ZFS. If you keep snapshots, then you aren''t freeing blocks, so the free list doesn''t grow. This is a very different use case than UFS, as an example. There are a few minor bumps in the road. The ATA PASSTHROUGH command, which allows TRIM to pass through the SATA drivers, was just integrated into b130. This will be more important to small servers than SANs, but the point is that all parts of the software stack need to support the effort. As such, it is not clear to me who, if anyone, inside Sun is champion for the effort -- it crosses multiple organizational boundaries.> > Apart from the technical possibilities, this feature looks really > inevitable to me in the long run especially for enterprise customers > with high-end SAN as cost is always a major factor in a storage > design and it''s a huge difference if you have to pay based on the > space used vs space allocated (for example).If the high cost of SAN storage is the problem, then I think there are better ways to solve that :-) -- richard
On 12/30/2009 2:40 PM, Richard Elling wrote:> There are a few minor bumps in the road. The ATA PASSTHROUGH > command, which allows TRIM to pass through the SATA drivers, was > just integrated into b130. This will be more important to small servers > than SANs, but the point is that all parts of the software stack need to > support the effort. As such, it is not clear to me who, if anyone, inside > Sun is champion for the effort -- it crosses multiple organizational > boundaries.I''d think it more important for devices where this is an issue, namely SSDs, then it is spinning rust though use of the TRIM command, or something like it, would fix a lot of the issues I''ve seen with thin provisioning over the last six years or so. However, I''m not sure it''s going to be much of an impact until you can get the entire stack - application to device - rewired to work with the concept behind it. One of the biggest issues I''ve seen with thin provisioning is how the applications work and you can''t fix that in the file system code.
On Wed, Dec 30, 2009 at 1:40 PM, Richard Elling <richard.elling at gmail.com> wrote:> On Dec 30, 2009, at 10:53 AM, Andras Spitzer wrote: > >> Devzero, >> >> Unfortunately that was my assumption as well. I don''t have source level >> knowledge of ZFS, though based on what I know it wouldn''t be an easy way to >> do it. I''m not even sure it''s only a technical question, but a design >> question, which would make it even less feasible. > > It is not hard, because ZFS knows the current free list, so walking that > list > and telling the storage about the freed blocks isn''t very hard. > > What is hard is figuring out if this would actually improve life. ?The > reason > I say this is because people like to use snapshots and clones on ZFS. > If you keep snapshots, then you aren''t freeing blocks, so the free list > doesn''t grow. This is a very different use case than UFS, as an example.It seems as though the oft mentioned block rewrite capabilities needed for pool shrinking and changing things like compression, encryption, and deduplication would also show benefit here. That is, blocks would be re-written in such a way to minimize the number of chunks of storage that is allocated. The current HDS chunk size is 42 MB. The most benefit would seem to be to have ZFS make a point of reusing old but freed blocks before doing an allocation that causes the back-end storage to allocate another chunk of disk to the thin-provisioned. While it is important to be able to roll back a few transactions in the event of some widely discussed failure modes, it is probably reasonable to reuse a block freed by a txg that is 3,000 txg''s old (about 1 day old if 1 txg per 30 seconds). Such a threshold could be used to determine whether to reuse a block or venture into previously untouched regions of the disk. This strategy would allow the SAN administrator (who is a different person than the sysadmin) to allocate extra space to servers and the sysadmin can control the amount of space really used by quotas. In the event that there is an emergency need for more space, the sysadmin can increase the quota and allow more of the allocate SAN space to be used. Assuming the block rewrite feature comes to fruition, this emergency growth could be shrunk back down to the original size once the surge in demand (or errant process) subsides.> > There are a few minor bumps in the road. The ATA PASSTHROUGH > command, which allows TRIM to pass through the SATA drivers, was > just integrated into b130. This will be more important to small servers > than SANs, but the point is that all parts of the software stack need to > support the effort. As such, it is not clear to me who, if anyone, inside > Sun is champion for the effort -- it crosses multiple organizational > boundaries. > >> >> Apart from the technical possibilities, this feature looks really >> inevitable to me in the long run especially for enterprise customers with >> high-end SAN as cost is always a major factor in a storage design and it''s a >> huge difference if you have to pay based on the space used vs space >> allocated (for example). > > If the high cost of SAN storage is the problem, then I think there are > better ways to solve that :-)The "SAN" could be an OpenSolaris device serving LUNs through COMSTAR. If those LUNs are used to hold a zpool, the zpool could notify the LUN that blocks are no longer used and the "SAN" could reclaim those blocks. This is just a variant of the same problem faced with expensive SAN devices that have thin provisioning allocation units measured in the tens of megabytes instead of hundreds to thousands of kilobytes. -- Mike Gerdts http://mgerdts.blogspot.com/
Richard, That''s an interesting question, if it''s worth it or not. I guess the question is always who are the targets for ZFS (I assume everyone, though in reality priorities has to set up as the developer resources are limited). For a home office, no doubt thin provisioning is not much of a use, for an enterprise company the numbers might really make a difference if we look at the space used vs space allocated. There are some studies that thin provisioning can reduce physical space used up to 30%, which is huge. (Even though I understands studies are not real life and thin provisioning is not viable in every environment) Btw, I would like to discuss scenarios where though we have over-subscribed pool in the SAN (meaning the overall allocated space to the systems is more than the physical space in the pool) with proper monitoring and proactive physical drive adds we won''t let any systems/applications attached to the SAN realize that we have thin devices. Actually that''s why I believe configuring thin devices without periodically reclaiming space is just a timebomb, though if you have the option to periodically reclaim space, you can maintain the pool in the SAN in a really efficient way. That''s why I found Veritas'' Thin Reclamation API as a milestone in the thin device field. Anyway, only future can tell if thin provisioning will or won''t be a major feature in the storage world, though as I saw Veritas already added this feature I was wondering if ZFS has it at least on it''s roadmap. Regards, sendai -- This message posted from opensolaris.org
To some extent it already does. If what you''re talking about is filesystems/datasets, then all filesystems within a pool share the same free space, which is functionally very similar to each filesystem within the pool being thin-provisioned. To get a "thick" filesystem, you''d need to set at least the filesystem''s reservation, and probably quota as well. Basically filesystems within a pool are thin by default, with the added bonus that space freed within a single filesystem is available for use in any other filesystem within the pool. If you''re talking about volumes provisioned from a pool, then volumes can be provisioned as "sparse", which is pretty much the same thing. And if you happen to be providing ISCSI luns from files rather than volumes, then those files can be created sparse as well. Reclaiming space from sparse volumes and files is not so easy unfortunately! If you''re talking about the pool itself being thin... that''s harder to do, although if you really needed it I guess if you provision your pool from an array that itself provides thin provisioning. Regards, Tristan On 30/12/2009 9:34 PM, Andras Spitzer wrote:> Hi, > > Does anyone heard about having any plans to support thin devices by ZFS? I''m talking about the thin device feature by SAN frames (EMC, HDS) which provides more efficient space utilization. The concept is similar to ZFS with the pool and datasets, though the pool in this case is in the SAN frame itself, so the pool can be shared among different systems attached to the same SAN frame. > > This topic is really complex but I''m sure it''s inevitable to support for enterprise customers with SAN storage, basically it brings the differentiation of space used vs space allocated, which can be a huge difference in a large environment, and this difference is major even on the financial level as well. > > Veritas already added support to thin devices, first of all support to VxFS to be "thin-aware" (for example how to handle over-subscribed thin devices), then Veritas added a feature called SmartMove, a nice feature to migrate from fat to thin devices, and the most brilliant feature of all (my personal opinion, of course) is they released the Veritas Thin Device Reclamation API, which provides an interface to the SAN frame to report unused space at the block level. > > This API is a major hit, and even though SAN vendors today doesn''t support it, HP and HDS already working on it, and I assume EMC has to follow as well. With this API Veritas can keep track of files deleted for example, and with a simple command once a day (depending on your policy) it can report the unused space back to the frame, so thin devices [b]remain[/b] thin. > > I really believe that ZFS should have support to thin devices, especially referring to the feature what this API brings into this field, as it can result a huge cost difference to enterprise customers. > > Regards, > sendai >
now this is getting interesting :-)... On Dec 30, 2009, at 12:13 PM, Mike Gerdts wrote:> On Wed, Dec 30, 2009 at 1:40 PM, Richard Elling > <richard.elling at gmail.com> wrote: >> On Dec 30, 2009, at 10:53 AM, Andras Spitzer wrote: >> >>> Devzero, >>> >>> Unfortunately that was my assumption as well. I don''t have source >>> level >>> knowledge of ZFS, though based on what I know it wouldn''t be an >>> easy way to >>> do it. I''m not even sure it''s only a technical question, but a >>> design >>> question, which would make it even less feasible. >> >> It is not hard, because ZFS knows the current free list, so walking >> that >> list >> and telling the storage about the freed blocks isn''t very hard. >> >> What is hard is figuring out if this would actually improve life. >> The >> reason >> I say this is because people like to use snapshots and clones on ZFS. >> If you keep snapshots, then you aren''t freeing blocks, so the free >> list >> doesn''t grow. This is a very different use case than UFS, as an >> example. > > It seems as though the oft mentioned block rewrite capabilities needed > for pool shrinking and changing things like compression, encryption, > and deduplication would also show benefit here. That is, blocks would > be re-written in such a way to minimize the number of chunks of > storage that is allocated. The current HDS chunk size is 42 MB.Good observation, Mike. ZFS divides a leaf vdev into approximately 200 metaslabs. Space is allocated in a metaslab and at some point another metaslab will be chosen. The assumption is made that the outer tracks of a disk have higher bandwidth than inner tracks, so allocations should be biased towards lower-numbered metaslabs. Let''s ignore, for the moment, that SSDs, and to some degree, RAID arrays, don''t exhibit this behavior. OK, so here''s how it works, in a nutshell. Space is allocated in the same metaslab until it fills or becomes "fragmented" and then the next metaslab is used. You can see this in my "Spacemaps from Space" blog, http://blogs.sun.com/relling/entry/space_maps_from_space where the lower numbered tracks (towards the bottom) you can see occasional, small blank areas. Note to self: a better picture would be useful :-) Note: copies are intentionally spread to other, distant metaslabs for diversity. Inside the metaslab, space is allocated on a first-fit basis until the space is mostly consumed and the algorithm changes to best-fit. The algorithm for these two decisions was changed in b129, in an effort to improve performance. So, the questions that arise are: Should the allocator be made aware of the chunk size of virtual storage vdevs? [hint: there is evidence of the intention to permit different allocators in the source, but I dunno if there is an intent to expose those through an interface.] If the allocator can change, what sorts of policies should be implemented? Examples include: + should the allocator stick with best-fit and encourage more gangs when the vdev is virtual? + should the allocator be aware of an SSD''s page size? Is said page size available to an OS? + should the metaslab boundaries align with virtual storage or SSD page boundaries? And, perhaps most important, how can this be done automatically so that system administrators don''t have to be rocket scientists to make a good choice? -- richard
Ack.. I''ve just re-read your original post. :-) It''s clear you are talking about support for thin devices behind the pool, not features inside the pool itself. Mea culpa. So I guess we wait for trim to be fully supported.. :-) T. On 31/12/2009 8:09 AM, Tristan Ball wrote:> To some extent it already does. > > If what you''re talking about is filesystems/datasets, then all > filesystems within a pool share the same free space, which is > functionally very similar to each filesystem within the pool being > thin-provisioned. To get a "thick" filesystem, you''d need to set at > least the filesystem''s reservation, and probably quota as well. > Basically filesystems within a pool are thin by default, with the > added bonus that space freed within a single filesystem is available > for use in any other filesystem within the pool. > > If you''re talking about volumes provisioned from a pool, then volumes > can be provisioned as "sparse", which is pretty much the same thing. > > And if you happen to be providing ISCSI luns from files rather than > volumes, then those files can be created sparse as well. > > Reclaiming space from sparse volumes and files is not so easy > unfortunately! > > If you''re talking about the pool itself being thin... that''s harder to > do, although if you really needed it I guess if you provision your > pool from an array that itself provides thin provisioning. > > Regards, > Tristan > > > > On 30/12/2009 9:34 PM, Andras Spitzer wrote: >> Hi, >> >> Does anyone heard about having any plans to support thin devices by >> ZFS? I''m talking about the thin device feature by SAN frames (EMC, >> HDS) which provides more efficient space utilization. The concept is >> similar to ZFS with the pool and datasets, though the pool in this >> case is in the SAN frame itself, so the pool can be shared among >> different systems attached to the same SAN frame. >> >> This topic is really complex but I''m sure it''s inevitable to support >> for enterprise customers with SAN storage, basically it brings the >> differentiation of space used vs space allocated, which can be a huge >> difference in a large environment, and this difference is major even >> on the financial level as well. >> >> Veritas already added support to thin devices, first of all support >> to VxFS to be "thin-aware" (for example how to handle over-subscribed >> thin devices), then Veritas added a feature called SmartMove, a nice >> feature to migrate from fat to thin devices, and the most brilliant >> feature of all (my personal opinion, of course) is they released the >> Veritas Thin Device Reclamation API, which provides an interface to >> the SAN frame to report unused space at the block level. >> >> This API is a major hit, and even though SAN vendors today doesn''t >> support it, HP and HDS already working on it, and I assume EMC has to >> follow as well. With this API Veritas can keep track of files deleted >> for example, and with a simple command once a day (depending on your >> policy) it can report the unused space back to the frame, so thin >> devices [b]remain[/b] thin. >> >> I really believe that ZFS should have support to thin devices, >> especially referring to the feature what this API brings into this >> field, as it can result a huge cost difference to enterprise customers. >> >> Regards, >> sendai > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > ______________________________________________________________________ > This email has been scanned by the MessageLabs Email Security System. > For more information please visit http://www.messagelabs.com/email > ______________________________________________________________________
On Dec 30, 2009, at 12:25 PM, Andras Spitzer wrote:> Richard, > > That''s an interesting question, if it''s worth it or not. I guess the > question is always who are the targets for ZFS (I assume everyone, > though in reality priorities has to set up as the developer > resources are limited). For a home office, no doubt thin > provisioning is not much of a use, for an enterprise company the > numbers might really make a difference if we look at the space used > vs space allocated. > > There are some studies that thin provisioning can reduce physical > space used up to 30%, which is huge. (Even though I understands > studies are not real life and thin provisioning is not viable in > every environment) > > Btw, I would like to discuss scenarios where though we have over- > subscribed pool in the SAN (meaning the overall allocated space to > the systems is more than the physical space in the pool) with proper > monitoring and proactive physical drive adds we won''t let any > systems/applications attached to the SAN realize that we have thin > devices. > > Actually that''s why I believe configuring thin devices without > periodically reclaiming space is just a timebomb, though if you have > the option to periodically reclaim space, you can maintain the pool > in the SAN in a really efficient way. That''s why I found Veritas'' > Thin Reclamation API as a milestone in the thin device field. > > Anyway, only future can tell if thin provisioning will or won''t be a > major feature in the storage world, though as I saw Veritas already > added this feature I was wondering if ZFS has it at least on it''s > roadmap.Thin provisioning is absolutely, positively a wonderful, good thing! The question is, how does the industry handle the multitude of thin provisioning models, each layered on top of another? For example, here at the ranch I use VMWare and Xen, which thinly provision virtual disks. I do this over iSCSI to a server running ZFS which thinly provisions the iSCSI target. If I had a virtual RAID array, I would probably use that, too. Personally, I think being thinner closer to the application wins over being thinner closer to dumb storage devices (disk drives). BTW, I do not see an RFE for this on http://bugs.opensolaris.org Would you be so kind as to file one? -- richard
On Wed, Dec 30, 2009 at 3:12 PM, Richard Elling <richard.elling at gmail.com> wrote:> If the allocator can change, what sorts of policies should be > implemented? ?Examples include: > ? ? ? ?+ should the allocator stick with best-fit and encourage more > ? ? ? ? ? gangs when the vdev is virtual? > ? ? ? ?+ should the allocator be aware of an SSD''s page size? ?Is > ? ? ? ? ? said page size available to an OS? > ? ? ? ?+ should the metaslab boundaries align with virtual storage > ? ? ? ? ? or SSD page boundaries?Wandering off topic a little bit... Should the block size be a tunable so that page size of SSD (typically 4K, right?) and upcoming hard disks that sport a sector size > 512 bytes? http://arc.opensolaris.org/caselog/PSARC/2008/769/final_spec.txt> And, perhaps most important, how can this be done automatically > so that system administrators don''t have to be rocket scientists > to make a good choice?Didn''t you read the marketing literature? ZFS is easy because you only need to know two commands: zpool and zfs. If you just ignore all the subcommands, options to those subcommands, evil tuning that is sometimes needed, and effects of redundancy choices then there is no need for any rocket scientists. :) -- Mike Gerdts http://mgerdts.blogspot.com/
On 30 dec 2009, at 22.45, Richard Elling wrote:> On Dec 30, 2009, at 12:25 PM, Andras Spitzer wrote: > >> Richard, >> >> That''s an interesting question, if it''s worth it or not. I guess the question is always who are the targets for ZFS (I assume everyone, though in reality priorities has to set up as the developer resources are limited). For a home office, no doubt thin provisioning is not much of a use, for an enterprise company the numbers might really make a difference if we look at the space used vs space allocated. >> >> There are some studies that thin provisioning can reduce physical space used up to 30%, which is huge. (Even though I understands studies are not real life and thin provisioning is not viable in every environment) >> >> Btw, I would like to discuss scenarios where though we have over-subscribed pool in the SAN (meaning the overall allocated space to the systems is more than the physical space in the pool) with proper monitoring and proactive physical drive adds we won''t let any systems/applications attached to the SAN realize that we have thin devices. >> >> Actually that''s why I believe configuring thin devices without periodically reclaiming space is just a timebomb, though if you have the option to periodically reclaim space, you can maintain the pool in the SAN in a really efficient way. That''s why I found Veritas'' Thin Reclamation API as a milestone in the thin device field. >> >> Anyway, only future can tell if thin provisioning will or won''t be a major feature in the storage world, though as I saw Veritas already added this feature I was wondering if ZFS has it at least on it''s roadmap. > > Thin provisioning is absolutely, positively a wonderful, good thing! The question > is, how does the industry handle the multitude of thin provisioning models, each > layered on top of another? For example, here at the ranch I use VMWare and Xen, > which thinly provision virtual disks. I do this over iSCSI to a server running ZFS > which thinly provisions the iSCSI target. If I had a virtual RAID array, I would > probably use that, too. Personally, I think being thinner closer to the application > wins over being thinner closer to dumb storage devices (disk drives).I don''t get it - why do we need anything more magic (or complicated) than support for TRIM from the filesystems and the storage systems? I don''t see why TRIM would be hard to implement for ZFS either, except that you may want to keep data from a few txgs back just for safety, which would probably call for some two-stage freeing of data blocks (those free blocks that are to be TRIMmed, and those that already are). /ragge
On Wed, 30 Dec 2009, Mike Gerdts wrote:> > Should the block size be a tunable so that page size of SSD (typically > 4K, right?) and upcoming hard disks that sport a sector size > 512 > bytes?Enterprise SSDs are still in their infancy. The actual page size of an SSD could be almost anything. Due to lack of seek time concerns and the high cost of erasing a page, a SSD could be designed with a level of indirection so that multiple logical writes to disjoint offsets could be combined into a single SSD physical page. Likewise a large logical block could be subdivided into mutiple SSD pages, which are allocated on demand. Logic is cheap and SSDs are full of logic so it seems reasonable that future SSDs will do this, if not already, since similar logic enables wear-leveling. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On Dec 30, 2009, at 2:24 PM, Ragnar Sundblad wrote:> > On 30 dec 2009, at 22.45, Richard Elling wrote: > >> On Dec 30, 2009, at 12:25 PM, Andras Spitzer wrote: >> >>> Richard, >>> >>> That''s an interesting question, if it''s worth it or not. I guess >>> the question is always who are the targets for ZFS (I assume >>> everyone, though in reality priorities has to set up as the >>> developer resources are limited). For a home office, no doubt thin >>> provisioning is not much of a use, for an enterprise company the >>> numbers might really make a difference if we look at the space >>> used vs space allocated. >>> >>> There are some studies that thin provisioning can reduce physical >>> space used up to 30%, which is huge. (Even though I understands >>> studies are not real life and thin provisioning is not viable in >>> every environment) >>> >>> Btw, I would like to discuss scenarios where though we have over- >>> subscribed pool in the SAN (meaning the overall allocated space to >>> the systems is more than the physical space in the pool) with >>> proper monitoring and proactive physical drive adds we won''t let >>> any systems/applications attached to the SAN realize that we have >>> thin devices. >>> >>> Actually that''s why I believe configuring thin devices without >>> periodically reclaiming space is just a timebomb, though if you >>> have the option to periodically reclaim space, you can maintain >>> the pool in the SAN in a really efficient way. That''s why I found >>> Veritas'' Thin Reclamation API as a milestone in the thin device >>> field. >>> >>> Anyway, only future can tell if thin provisioning will or won''t be >>> a major feature in the storage world, though as I saw Veritas >>> already added this feature I was wondering if ZFS has it at least >>> on it''s roadmap. >> >> Thin provisioning is absolutely, positively a wonderful, good >> thing! The question >> is, how does the industry handle the multitude of thin provisioning >> models, each >> layered on top of another? For example, here at the ranch I use >> VMWare and Xen, >> which thinly provision virtual disks. I do this over iSCSI to a >> server running ZFS >> which thinly provisions the iSCSI target. If I had a virtual RAID >> array, I would >> probably use that, too. Personally, I think being thinner closer to >> the application >> wins over being thinner closer to dumb storage devices (disk drives). > > I don''t get it - why do we need anything more magic (or complicated) > than support for TRIM from the filesystems and the storage systems?TRIM is just one part of the problem (or solution, depending on your point of view). The TRIM command is part of the T10 protocols that allows a host to tell a block device that data in a set of blocks is no longer of any value, and the block device can destroy the data without adverse consequence. In a world with copy-on-write and without snapshots, it is obvious that there will be a lot of blocks running around that are no longer in use. Snapshots (and their clones) changes that use case. So in a world of snapshots, there will be fewer blocks which are not used. Remember, the TRIM command is very important to OSes like Windows or OSX which do not have file systems that are copy-on-write or have decent snapshots. OTOH, ZFS does copy-on-write and lots of ZFS folks use snapshots. That said, adding TRIM support is not hard in ZFS. But it depends on lower level drivers to pass the TRIM commands down the stack. These ducks are lining up now.> I don''t see why TRIM would be hard to implement for ZFS either, > except that you may want to keep data from a few txgs back just > for safety, which would probably call for some two-stage freeing > of data blocks (those free blocks that are to be TRIMmed, and > those that already are).Once a block is freed in ZFS, it no longer needs it. So the "problem" of TRIM in ZFS is not related to the recent txg commit history. The issue is that traversing the free block list has to be protected by locks, so that the file system does not allocate a block when it is also TRIMming the block. Not so difficult, as long as the TRIM occurs relatively quickly. I think that any TRIM implementation should be an administration command, like scrub. It probably doesn''t make sense to have it running all of the time. But on occasion, it might make sense. My concern is that people will have an expectation that they can use snapshots and TRIM -- the former reduces the effectiveness of the latter. As the price of storing bytes continues to decrease, will the cost of not TRIMming be a long term issue? I think not. -- richard
Let me sum up my thoughts in this topic. To Richard [relling] : I agree with you this topic is even more confusing if we are not careful enough to specify exactly what we are talking about. Thin provision can be done on multiple layers, and though you said you like it to be closer to the app than closer to the dumb disks (if you were referring to SAN), my opinion is that each and every scenario has it''s own pros/cons. I learned long time ago not to declare a technology good/bad, there are technologies which are used properly (usually declared as good tech) and others which are not (usually declared as bad). -- Let me clarify my case, and why I mentioned thin devices on SAN specifically. Many people replied with the thin device support of ZFS (which is called sparse volumes if I''m correct), but what I was talking about is something else. It''s thin device "awareness" on the SAN. In this case you configure your LUN in the SAN as thin device, a virtual LUN(s) which is backed by a pool of physical disks in the SAN. From the OS it''s transparent, so it is from the Volume Manager/Filesystem point of view. That is the basic definition of my scenarion with thin devices on SAN. High-end SAN frames like HDS USP-V (feature called "Hitachi Dynamic Provisioning"), EMC Symmetrix V-Max (feature called "Virtual provisioning") supports this (and I''m sure many others as well). As you discovered the LUN in the OS, you start to use it, like put under Volume Manager, create filesystem, copy files, but the SAN only allocates physical blocks (more precisely group of blocks called extents) as you write them, which means you''ll use only as much (or a bit more rounded to the next extent) on the physical disk as you use in reality.>From this standpoint we can define two terms, thin-friendly and thin-hostile environments. Thin-friendly would be any environment where OS/VM/FS doesn''t write to blocks it doesn''t really use (for example during initialization it doesn''t fills up the LUN with a pattern or 0s).That''s why Veritas'' SmartMove is a nice feature, as when you move from fat to thin devices (from the OS both LUNs look exactly the same), it will copy the blocks only which are used by the VxFS files. That is still the basics of having thin devices on SAN, and hope to have a thin-friendly environment. The next level of this is the management of the thin devices and the physical pool where thin devices allocates their extents from. Even if you get migrated to thin device LUNs, your thin devices will become fat again, even if you fill up your filesystem once, the thin device on the SAN will remain fat, no space reclamation is happening by default. The reason is pretty simple, the SAN storage has no knowledge of the filesystem structure, as such it can''t decide whether a block should be released back to the pool, or it''s really not in use. Then came Veritas with this brilliant idea of building a bridge between the FS and the SAN frame (this became the Thin Reclamation API), so they can communicate which blocks are not in use indeed. I really would like you to read this Quick Note from Veritas about this feature, it will explain way better the concept as I did : http://ftp.support.veritas.com/pub/support/products/Foundation_Suite/338546.pdf Btw, in this concept VxVM can even detect (via ASL) whether a LUN is thin device/thin device reclamation capable or not. Honestly I have mixed feeling about ZFS. I feel that this is obviously the future''s VM/Filesystem, but then I realize in the same time the roles of the individual parts in the big picture are getting mixed up. Am I the only one with the impression that ZFS sooner or later will evolve to a SAN OS, and the zfs, zpool commands will only become some lightweight interfaces to control the SAN frame? :-) (like Solution Enabler for EMC) If you ask me the pool concept always works more efficient if 1# you have more capacity in the pool 2# if you have more systems to share the pool, that''s why I see the thin device pool more rational in a SAN frame. Anyway, I''m sorry if you were already aware what I explained above, I also hope I didn''t offend anyone with my views, Regards, sendai -- This message posted from opensolaris.org
On 31 dec 2009, at 06.01, Richard Elling wrote:> > On Dec 30, 2009, at 2:24 PM, Ragnar Sundblad wrote: > >> >> On 30 dec 2009, at 22.45, Richard Elling wrote: >> >>> On Dec 30, 2009, at 12:25 PM, Andras Spitzer wrote: >>> >>>> Richard, >>>> >>>> That''s an interesting question, if it''s worth it or not. I guess the question is always who are the targets for ZFS (I assume everyone, though in reality priorities has to set up as the developer resources are limited). For a home office, no doubt thin provisioning is not much of a use, for an enterprise company the numbers might really make a difference if we look at the space used vs space allocated. >>>> >>>> There are some studies that thin provisioning can reduce physical space used up to 30%, which is huge. (Even though I understands studies are not real life and thin provisioning is not viable in every environment) >>>> >>>> Btw, I would like to discuss scenarios where though we have over-subscribed pool in the SAN (meaning the overall allocated space to the systems is more than the physical space in the pool) with proper monitoring and proactive physical drive adds we won''t let any systems/applications attached to the SAN realize that we have thin devices. >>>> >>>> Actually that''s why I believe configuring thin devices without periodically reclaiming space is just a timebomb, though if you have the option to periodically reclaim space, you can maintain the pool in the SAN in a really efficient way. That''s why I found Veritas'' Thin Reclamation API as a milestone in the thin device field. >>>> >>>> Anyway, only future can tell if thin provisioning will or won''t be a major feature in the storage world, though as I saw Veritas already added this feature I was wondering if ZFS has it at least on it''s roadmap. >>> >>> Thin provisioning is absolutely, positively a wonderful, good thing! The question >>> is, how does the industry handle the multitude of thin provisioning models, each >>> layered on top of another? For example, here at the ranch I use VMWare and Xen, >>> which thinly provision virtual disks. I do this over iSCSI to a server running ZFS >>> which thinly provisions the iSCSI target. If I had a virtual RAID array, I would >>> probably use that, too. Personally, I think being thinner closer to the application >>> wins over being thinner closer to dumb storage devices (disk drives). >> >> I don''t get it - why do we need anything more magic (or complicated) >> than support for TRIM from the filesystems and the storage systems? > > TRIM is just one part of the problem (or solution, depending on your point > of view). The TRIM command is part of the T10 protocols that allows a > host to tell a block device that data in a set of blocks is no longer of > any value, and the block device can destroy the data without adverse > consequence. > > In a world with copy-on-write and without snapshots, it is obvious that > there will be a lot of blocks running around that are no longer in use. > Snapshots (and their clones) changes that use case. So in a world of > snapshots, there will be fewer blocks which are not used. Remember, > the TRIM command is very important to OSes like Windows or OSX > which do not have file systems that are copy-on-write or have decent > snapshots. OTOH, ZFS does copy-on-write and lots of ZFS folks use > snapshots.I don''t believe that there is such a big difference between those cases. Sure, snapshots may keep more data on disk, but only as much as the user choose to keep. There has been other ways to keep old data on disk before (RCS, Solaris patch backout blurbs, logs, caches, what have you), so there is not really a brand new world there. (BTW, once upon a time, real operating systems had (optional) file versioning built into the operating system or file system itself.) If there was a mechanism that always tended to keep all of the disk full, that would be another case. Snapshots may do that with the autosnapshot and warn-and-clean-when-getting-full features of OpenSolaris, but especially servers will probably not be managed that way, they will probably have a much more controlled snapshot policy. (Especially if you want to save every possible bit of disk space, as those guys with the big fantastic and ridiculously expensive storage systems always want to do - maybe that will change in the future though.)> That said, adding TRIM support is not hard in ZFS. But it depends on > lower level drivers to pass the TRIM commands down the stack. These > ducks are lining up now.Good.>> I don''t see why TRIM would be hard to implement for ZFS either, >> except that you may want to keep data from a few txgs back just >> for safety, which would probably call for some two-stage freeing >> of data blocks (those free blocks that are to be TRIMmed, and >> those that already are). > > Once a block is freed in ZFS, it no longer needs it. So the "problem" > of TRIM in ZFS is not related to the recent txg commit history.It may be that you want to save a few txgs back, so if you get a failure where parts of the last txg gets lost, you will still be able to get an old (few seconds/minutes) version of your data back. This could happen if the sync commands aren''t correctly implemented all the way (as we have seen some stories about on this list). Maybe someone disabled syncing somewhere to improve performance. It could also happen if a "non volatile" caching device, such as a storage controller, breaks in some bad way. Or maybe you just had a bad/old battery/supercap in a device that implements NV storage with batteries/supercaps.> The > issue is that traversing the free block list has to be protected by > locks, so that the file system does not allocate a block when it is > also TRIMming the block. Not so difficult, as long as the TRIM > occurs relatively quickly. > > I think that any TRIM implementation should be an administration > command, like scrub. It probably doesn''t make sense to have it > running all of the time. But on occasion, it might make sense.I am not sure why it shouldn''t run at all times, except for the fact that it seems to be badly implemented in some SATA devices with high latencies, so that it will interrupt any data streaming to/from the disks. On a general purpose system, that may not be an issue since you may read a lot from cache anyway, and synced writes may wait a little without anyone even noticing. On a special system that needs streaming performance, it might be interesting to only trim at certain occasions, but then you will probably have a service window for it, with a start- and stop time, so you need to be ably to control the trimming process pretty exact for this feature to be interesting. It may turn out that such systems may be better served not trimming at all. On a laptop on the other hand, you typically don''t have a service window and have no idea when it would be a good time to start TRIMing, and continuous TRIMing may be the best option.> My concern is that people will have an expectation that they can > use snapshots and TRIM -- the former reduces the effectiveness > of the latter.In my experience, disks tends to get full one way or another anyway if you don''t manage your data. I don''t really see that snapshots changes that a whole lot.> As the price of storing bytes continues to decrease, > will the cost of not TRIMming be a long term issue? I think not. > -- richardMaybe, maybe not. Storage will always have a cost, not even OpenStorage has really changed that by order of magnitudes (yet, at least). Also, currently, when the SSDs for some very strange reason is constructed from flash chips designed for firmware and slowly changing configuration data and can only erase in very large chunks, TRIMing is good for the housekeeping in the SSD drive. A typical use case for this would be a laptop. Happy new year, everybody! /ragge s
On 31 dec 2009, at 00.31, Bob Friesenhahn wrote:> On Wed, 30 Dec 2009, Mike Gerdts wrote: >> >> Should the block size be a tunable so that page size of SSD (typically >> 4K, right?) and upcoming hard disks that sport a sector size > 512 >> bytes? > > Enterprise SSDs are still in their infancy. The actual page size of an SSD could be almost anything. Due to lack of seek time concerns and the high cost of erasing a page, a SSD could be designed with a level of indirection so that multiple logical writes to disjoint offsets could be combined into a single SSD physical page. Likewise a large logical block could be subdivided into mutiple SSD pages, which are allocated on demand. Logic is cheap and SSDs are full of logic so it seems reasonable that future SSDs will do this, if not already, since similar logic enables wear-leveling.I believe that almost all flash devices are already are doing this, and only the first generation SD cards or something like that are not doing it and leaving it to the host. But I could be wrong of course. /ragge s
On 31 dec 2009, at 10.43, Andras Spitzer wrote:> Then came Veritas with this brilliant idea of building a bridge between the FS and the SAN frame (this became the Thin Reclamation API), so they can communicate which blocks are not in use indeed.This is exactly what TRIM is for, but could be implemented in a very light weight, general purpose way in all operating systems file systems and storage devices. Once things implement this, sparsing/thinning out disks will be a non issue. This will be the same simple mechanism and useful all the way from the laptop to the enterprise virtual server environment. I don''t see any need for a big complicated system for this.> Honestly I have mixed feeling about ZFS. I feel that this is obviously the future''s VM/Filesystem, but then I realize in the same time the roles of the individual parts in the big picture are getting mixed up. Am I the only one with the impression that ZFS sooner or later will evolve to a SAN OS, and the zfs, zpool commands will only become some lightweight interfaces to control the SAN frame? :-) (like Solution Enabler for EMC)ZFS et al can today be a "SAN frame", that is what the OpenStorage product family is. (Not with all the bells and whistles of some of the other systems, but a lot cheaper (not ridiculously expensive, only very expensive (list price, which no one pays of course)).) Another possible interim solution to sparsing (thinning) out disks if you have dedup or compression in your storage thingy: write large files with for example zeros on the free space on the clients and remove them again, these blocks will dedup and/or compress nicely. /ragge s
On Thu, 31 Dec 2009, Ragnar Sundblad wrote:> > Also, currently, when the SSDs for some very strange reason is > constructed from flash chips designed for firmware and slowly > changing configuration data and can only erase in very large chunks, > TRIMing is good for the housekeeping in the SSD drive. A typical > use case for this would be a laptop.I have heard quite a few times that TRIM is "good" for SSD drives but I don''t see much actual use for it. Every responsible SSD drive maintains a reserve of unused space (20-50%) since it is needed for wear leveling and to repair failing spots. This means that even when a SSD is 100% full it still has considerable space remaining. A very simple SSD design solution is that when a SSD block is "overwritten" it is replaced with an already-erased block from the free pool and the old block is submitted to the free pool for eventual erasure and re-use. This approach avoids adding erase times to the write latency as long as the device can erase as fast as the average date write rate. There are of course SSDs with hardly any (or no) reserve space, but while we might be willing to sacrifice an image or two to SSD block failure in our digital camera, that is just not acceptable for serious computer use. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Just an update : Finally I found some technical details about this Thin Reclamation API : (http://blogs.hds.com/claus/2009/12/i-love-it-when-a-plan-comes-together.html) "This week, (December 7th), Symantec announced their ?completing the thin provisioning ecosystem? that includes the necessary API calls for the file system to ?notify? the storage array when space is ?deleted?. The interface is a previously disused and now revised/reused/repurposed SCSI command (called Write Same) which was jointly worked out with Symantec, Hitachi, and 3PAR. This command allows the file systems (in this case Veritas VxFS) to notify the storage systems that space is no longer occupied. How cool is that! There is also a subcommittee to INCITS T10 studying the standardization is this and SNIA is also studying this. It won?t be long before most file systems, databases, and storage vendors adopt this technology." So it''s based on the SCSI Write Same/UNMAP command, (and if I understand correctly SATA TRIM is similar to this from the FS point of view) which standard is not ratified yet. Also, happy new year to everyone! Regards, sendai -- This message posted from opensolaris.org
On 31 dec 2009, at 17.18, Bob Friesenhahn wrote:> On Thu, 31 Dec 2009, Ragnar Sundblad wrote: >> >> Also, currently, when the SSDs for some very strange reason is >> constructed from flash chips designed for firmware and slowly >> changing configuration data and can only erase in very large chunks, >> TRIMing is good for the housekeeping in the SSD drive. A typical >> use case for this would be a laptop. > > I have heard quite a few times that TRIM is "good" for SSD drives but I don''t see much actual use for it. Every responsible SSD drive maintains a reserve of unused space (20-50%) since it is needed for wear leveling and to repair failing spots. This means that even when a SSD is 100% full it still has considerable space remaining.(At least as long as those blocks aren''t used up in place of bad/worn out) blocks...)> A very simple SSD design solution is that when a SSD block is "overwritten" it is replaced with an already-erased block from the free pool and the old block is submitted to the free pool for eventual erasure and re-use. This approach avoids adding erase times to the write latency as long as the device can erase as fast as the average date write rate.This is what they do, as far as I have understood, but more free space to play with makes the job easier and therefor faster, and gives you a larger burst headroom before you hit the erase-speed limit of the disk.> There are of course SSDs with hardly any (or no) reserve space, but while we might be willing to sacrifice an image or two to SSD block failure in our digital camera, that is just not acceptable for serious computer use.I think the idea is that with TRIM you can also use the file system''s unused space for wear leveling and flash block filling. If your disk is completely full there is of course no gain. /ragge s
On Dec 31, 2009, at 1:43 AM, Andras Spitzer wrote:> Let me sum up my thoughts in this topic. > > To Richard [relling] : I agree with you this topic is even more > confusing if we are not careful enough to specify exactly what we > are talking about. Thin provision can be done on multiple layers, > and though you said you like it to be closer to the app than closer > to the dumb disks (if you were referring to SAN), my opinion is that > each and every scenario has it''s own pros/cons. I learned long time > ago not to declare a technology good/bad, there are technologies > which are used properly (usually declared as good tech) and others > which are not (usually declared as bad).I hear you. But you are trapped thinking about 20th century designs and ZFS is a 21st century design. More below...> Let me clarify my case, and why I mentioned thin devices on SAN > specifically. Many people replied with the thin device support of > ZFS (which is called sparse volumes if I''m correct), but what I was > talking about is something else. It''s thin device "awareness" on the > SAN. > > In this case you configure your LUN in the SAN as thin device, a > virtual LUN(s) which is backed by a pool of physical disks in the > SAN. From the OS it''s transparent, so it is from the Volume Manager/ > Filesystem point of view. > > That is the basic definition of my scenarion with thin devices on > SAN. High-end SAN frames like HDS USP-V (feature called "Hitachi > Dynamic Provisioning"), EMC Symmetrix V-Max (feature called "Virtual > provisioning") supports this (and I''m sure many others as well). As > you discovered the LUN in the OS, you start to use it, like put > under Volume Manager, create filesystem, copy files, but the SAN > only allocates physical blocks (more precisely group of blocks > called extents) as you write them, which means you''ll use only as > much (or a bit more rounded to the next extent) on the physical disk > as you use in reality. > >> From this standpoint we can define two terms, thin-friendly and >> thin-hostile environments. Thin-friendly would be any environment >> where OS/VM/FS doesn''t write to blocks it doesn''t really use (for >> example during initialization it doesn''t fills up the LUN with a >> pattern or 0s). > > That''s why Veritas'' SmartMove is a nice feature, as when you move > from fat to thin devices (from the OS both LUNs look exactly the > same), it will copy the blocks only which are used by the VxFS files.ZFS does this by design. There is no way in ZFS to not do this. I suppose it could be touted as a "feature" :-) Maybe we should brand ZFS as "THINbyDESIGN(TM)" Or perhaps we can rebrand SMARTMOVE(TM) as TRYINGTOCATCHUPWITHZFS(TM) :-)> That is still the basics of having thin devices on SAN, and hope to > have a thin-friendly environment. The next level of this is the > management of the thin devices and the physical pool where thin > devices allocates their extents from. > > Even if you get migrated to thin device LUNs, your thin devices will > become fat again, even if you fill up your filesystem once, the thin > device on the SAN will remain fat, no space reclamation is happening > by default. The reason is pretty simple, the SAN storage has no > knowledge of the filesystem structure, as such it can''t decide > whether a block should be released back to the pool, or it''s really > not in use. Then came Veritas with this brilliant idea of building a > bridge between the FS and the SAN frame (this became the Thin > Reclamation API), so they can communicate which blocks are not in > use indeed. > > I really would like you to read this Quick Note from Veritas about > this feature, it will explain way better the concept as I did : http://ftp.support.veritas.com/pub/support/products/Foundation_Suite/338546.pdf > > Btw, in this concept VxVM can even detect (via ASL) whether a LUN is > thin device/thin device reclamation capable or not.Correct. Since VxVM and VxFS are separate software, they have expanded the interface between them. Consider adding a mirror or replacing a drive. Prior to SMARTMOVE, VxVM had no idea what part of the volume was data and what was unused. So VxVM would silver the mirror by copying all of the blocks from one side to the other. Clearly this is uncool when your SAN storage is virtualized. With SMARTMOVE, VxFS has a method to tell VxVM that portions of the volume are unused. Now when you silver the mirror, VxVM knows that some bits are unused and it won''t bother to copy them. This is a bona fide good thing for virtualized SAN arrays. ZFS was designed with the knowledge that the limited interface between file systems and volume managers was a severe limitation that leads to all sorts of complexity and angst. So a different design is needed. ZFS has fully integrated RAID with the file system, so there is no need, by design, to create a new interface between these layers. In other words, the only way to silver a disk in ZFS is to silver the data. You can''t silver unused space. There are other advantages as well. For example, in ZFS silvers are done in time order, which has benefits for recovery when devices are breaking all around you. Jeff describes this rather nicely in his blog: http://blogs.sun.com/bonwick/entry/smokin_mirrors In short. ZFS doesn''t need SMARTMOVE because it doesn''t have the antiquated view of storage management that last century''s designs had. Also, ZFS users who don''t use snapshots could benefit from TRIM.> Honestly I have mixed feeling about ZFS. I feel that this is > obviously the future''s VM/Filesystem, but then I realize in the same > time the roles of the individual parts in the big picture are > getting mixed up. Am I the only one with the impression that ZFS > sooner or later will evolve to a SAN OS, and the zfs, zpool commands > will only become some lightweight interfaces to control the SAN > frame? :-) (like Solution Enabler for EMC)I don''t see that evolution. But I''ve always contended that storage arrays are just specialized servers which speak a limited set of protocols. After all, there is no such thing as "hardware RAID," all RAID is done in software. So my crystal ball says that such limited server OSes will have a hard life ahead of them.> If you ask me the pool concept always works more efficient if 1# you > have more capacity in the pool 2# if you have more systems to share > the pool, that''s why I see the thin device pool more rational in a > SAN frame. > > Anyway, I''m sorry if you were already aware what I explained above, > I also hope I didn''t offend anyone with my views,I have a much simpler view of VxFS and VxVM. They are neither open source nor free, but they are so last century :-) -- richard
[I TRIMmed the thread a bit ;-)] On Dec 31, 2009, at 1:43 AM, Ragnar Sundblad wrote:> On 31 dec 2009, at 06.01, Richard Elling wrote: >> >> In a world with copy-on-write and without snapshots, it is obvious >> that >> there will be a lot of blocks running around that are no longer in >> use. >> Snapshots (and their clones) changes that use case. So in a world of >> snapshots, there will be fewer blocks which are not used. Remember, >> the TRIM command is very important to OSes like Windows or OSX >> which do not have file systems that are copy-on-write or have decent >> snapshots. OTOH, ZFS does copy-on-write and lots of ZFS folks use >> snapshots. > > I don''t believe that there is such a big difference between those > cases.The reason you want TRIM for SSDs is to recover the write speed. A freshly cleaned page can be written faster than a dirty page. But in COW, you are writing to new pages and not rewriting old pages. This is fundamentally different than FAT, NTFS, or HFS+, but it is those markets which are driving TRIM adoption. [TRIMmed]>> Once a block is freed in ZFS, it no longer needs it. So the "problem" >> of TRIM in ZFS is not related to the recent txg commit history. > > It may be that you want to save a few txgs back, so if you get > a failure where parts of the last txg gets lost, you will still be > able to get an old (few seconds/minutes) version of your data back.This is already implemented. Blocks freed in the past few txgs are not returned to the freelist immediately. This was needed to enable uberblock recovery in b128. So TRIMming from the freelist is safe.> This could happen if the sync commands aren''t correctly implemented > all the way (as we have seen some stories about on this list). > Maybe someone disabled syncing somewhere to improve performance. > > It could also happen if a "non volatile" caching device, such as > a storage controller, breaks in some bad way. Or maybe you just > had a bad/old battery/supercap in a device that implements > NV storage with batteries/supercaps. > >> The >> issue is that traversing the free block list has to be protected by >> locks, so that the file system does not allocate a block when it is >> also TRIMming the block. Not so difficult, as long as the TRIM >> occurs relatively quickly. >> >> I think that any TRIM implementation should be an administration >> command, like scrub. It probably doesn''t make sense to have it >> running all of the time. But on occasion, it might make sense. > > I am not sure why it shouldn''t run at all times, except for the > fact that it seems to be badly implemented in some SATA devices > with high latencies, so that it will interrupt any data streaming > to/from the disks.I don''t see how it would not have negative performance impacts. -- richard
Bob Friesenhahn <bfriesen at simple.dallas.tx.us> wrote:> I have heard quite a few times that TRIM is "good" for SSD drives but > I don''t see much actual use for it. Every responsible SSD drive > maintains a reserve of unused space (20-50%) since it is needed for > wear leveling and to repair failing spots. This means that even when > a SSD is 100% full it still has considerable space remaining. A very > simple SSD design solution is that when a SSD block is "overwritten" > it is replaced with an already-erased block from the free pool and the > old block is submitted to the free pool for eventual erasure and > re-use. This approach avoids adding erase times to the write latency > as long as the device can erase as fast as the average date write > rate.The question in case if SSDs is: ZFS is COW, but does the SSD know which block is "in use" and which is not? If the SSD did know whether a block is in use, it could erase unused blocks in advance. But what is an "unused block" on a filesystem that supports snapshots?>From the perspective of the SSD I see only the following difference betweena COW filesystem an a conventional filesystem. A conventional filesystem may write more often to the same block number than a COW filesystem does. But even for the non-COW case, I would expect that the SSD frequently remaps overwritten blocks to previously erased spares. My conclusion is that ZFS on a SSD works fine in case that the the primary used blocks plus all active snapshots use less space than the official size - the spare reserve from the SSD. If you however fill up the medium, I expect a performance degradation. J?rg -- EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin js at cs.tu-berlin.de (uni) joerg.schilling at fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily
Richard Elling <richard.elling at gmail.com> wrote:> The reason you want TRIM for SSDs is to recover the write speed. > A freshly cleaned page can be written faster than a dirty page. > But in COW, you are writing to new pages and not rewriting old > pages. This is fundamentally different than FAT, NTFS, or HFS+, > but it is those markets which are driving TRIM adoption.Your mistake is to asume a maiden SSD and not to think about what''s happening after the SSD was in use for a while. Even for the COW case, blocks are reused after some time and the "disk" does has no way to know in advance which blocks are still in use and which blocks are no longer used and may be prepared for being overwritten. J?rg -- EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin js at cs.tu-berlin.de (uni) joerg.schilling at fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily
On 31 dec 2009, at 19.26, Richard Elling wrote:> [I TRIMmed the thread a bit ;-)] > > On Dec 31, 2009, at 1:43 AM, Ragnar Sundblad wrote: >> On 31 dec 2009, at 06.01, Richard Elling wrote: >>> >>> In a world with copy-on-write and without snapshots, it is obvious that >>> there will be a lot of blocks running around that are no longer in use. >>> Snapshots (and their clones) changes that use case. So in a world of >>> snapshots, there will be fewer blocks which are not used. Remember, >>> the TRIM command is very important to OSes like Windows or OSX >>> which do not have file systems that are copy-on-write or have decent >>> snapshots. OTOH, ZFS does copy-on-write and lots of ZFS folks use >>> snapshots. >> >> I don''t believe that there is such a big difference between those >> cases. > > The reason you want TRIM for SSDs is to recover the write speed. > A freshly cleaned page can be written faster than a dirty page. > But in COW, you are writing to new pages and not rewriting old > pages. This is fundamentally different than FAT, NTFS, or HFS+, > but it is those markets which are driving TRIM adoption.Flash SSDs actually always remap new writes into a only-append-to-new-pages style, pretty much as ZFS does itself. So for a SSD there is no big difference between ZFS and filesystems as UFS, NTFS, HFS+ et al, on the flash level they all work the same. The reason is that there is no way for it to rewrite single disk blocks, it can only fill up already erased pages of 512K (for example). When the old blocks get mixed with unused blocks (because of block rewrites, TRIM or Write Many/UNMAP), it needs to compact the data by copying all active blocks from those pages into previously erased pages, and there write the active data compacted/continuos. (When this happens, things tend to get really slow.) So TRIM is just as applicable to ZFS as any other file system for flash SSD, there is no real difference.> [TRIMmed] > >>> Once a block is freed in ZFS, it no longer needs it. So the "problem" >>> of TRIM in ZFS is not related to the recent txg commit history. >> >> It may be that you want to save a few txgs back, so if you get >> a failure where parts of the last txg gets lost, you will still be >> able to get an old (few seconds/minutes) version of your data back. > > This is already implemented. Blocks freed in the past few txgs are > not returned to the freelist immediately. This was needed to enable > uberblock recovery in b128. So TRIMming from the freelist is safe.I see, very good!>> This could happen if the sync commands aren''t correctly implemented >> all the way (as we have seen some stories about on this list). >> Maybe someone disabled syncing somewhere to improve performance. >> >> It could also happen if a "non volatile" caching device, such as >> a storage controller, breaks in some bad way. Or maybe you just >> had a bad/old battery/supercap in a device that implements >> NV storage with batteries/supercaps. >> >>> The >>> issue is that traversing the free block list has to be protected by >>> locks, so that the file system does not allocate a block when it is >>> also TRIMming the block. Not so difficult, as long as the TRIM >>> occurs relatively quickly. >>> >>> I think that any TRIM implementation should be an administration >>> command, like scrub. It probably doesn''t make sense to have it >>> running all of the time. But on occasion, it might make sense. >> >> I am not sure why it shouldn''t run at all times, except for the >> fact that it seems to be badly implemented in some SATA devices >> with high latencies, so that it will interrupt any data streaming >> to/from the disks. > > I don''t see how it would not have negative performance impacts.It will, I am sure! But *if* the user for one reason or the other wants TRIM, it can not be assumed that TRIMing major bunches at certain times is any better than trimming small amounts all the time. Both behaviors may be useful, but I have hard to see a real good use case where you want batch trimming, but easy to see cases where continuos trimming could be useful and hopefully hardly noticeable thanks to the file system caching. /ragge s
On Dec 31, 2009, at 13:44, Joerg Schilling wrote:> ZFS is COW, but does the SSD know which block is "in use" and which > is not? > > If the SSD did know whether a block is in use, it could erase unused > blocks > in advance. But what is an "unused block" on a filesystem that > supports > snapshots?Personally, I think that at some point in the future there will need to be a command telling SSDs that the file system will take care of handling blocks, as new FS designs will be COW. ZFS is the first "mainstream" one to do it, but Btrfs is there as well, and it looks like Apple will be making its own FS. Just as the first 4096-byte block disks are silently emulating 4096 - to-512 blocks, SSDs are currently re-mapping LBAs behind the scenes. Perhaps in the future there will be a setting to say "no really, I''m talking about the /actual/ LBA 123456".
On Thu, Dec 31 at 16:53, David Magda wrote:>Just as the first 4096-byte block disks are silently emulating 4096 - >to-512 blocks, SSDs are currently re-mapping LBAs behind the scenes. >Perhaps in the future there will be a setting to say "no really, I''m >talking about the /actual/ LBA 123456".What, exactly, is the "/actual/ LBA 123456" on a modern SSD? --eric -- Eric D. Mudama edmudama at mail.bounceswoosh.org
On Thu, Dec 31 at 10:18, Bob Friesenhahn wrote:>There are of course SSDs with hardly any (or no) reserve space, but >while we might be willing to sacrifice an image or two to SSD block >failure in our digital camera, that is just not acceptable for >serious computer use.Some people are doing serious computing on devices with 6-7% reserve. Devices with less enforced reserve will be significantly cheaper per exposed gigabyte, independent of all other factors, and always give the user the flexibility to increase their effective reserve by destroking the working area a little or a lot. If someone just needs blazing fast read access and isn''t expecting to put more than a few cycles/day on their devices, small reserve MLC drives may be very cost effective and just as fast as their 20-30% reserve SLC counterparts. -- Eric D. Mudama edmudama at mail.bounceswoosh.org
On 31 dec 2009, at 22.53, David Magda wrote:> On Dec 31, 2009, at 13:44, Joerg Schilling wrote: > >> ZFS is COW, but does the SSD know which block is "in use" and which is not? >> >> If the SSD did know whether a block is in use, it could erase unused blocks >> in advance. But what is an "unused block" on a filesystem that supports >> snapshots?Snapshots make no difference - when you delete the last dataset/snapshot that references a file you also delete the data. Snapshots is a way to keep more files around, it is not a really way to keep the disk entirely full or anything like that. There is obviously no problem to distinguish between used and unused blocks, and zfs (or btrfs or similar) make no difference.> Personally, I think that at some point in the future there will need to be a command telling SSDs that the file system will take care of handling blocks, as new FS designs will be COW. ZFS is the first "mainstream" one to do it, but Btrfs is there as well, and it looks like Apple will be making its own FS.That could be an idea, but there still will be holes after deleted files that need to be reclaimed. Do you mean it would be a major win to have the file system take care of the space reclaiming instead of the drive?> Just as the first 4096-byte block disks are silently emulating 4096 -to-512 blocks, SSDs are currently re-mapping LBAs behind the scenes. Perhaps in the future there will be a setting to say "no really, I''m talking about the /actual/ LBA 123456".A typical flash page size is 512 KB. You probably don''t want to use all the physical pages, since those could be worn out or bad, so those need to be remapped (or otherwise avoided) at some level anyway. These days, typically disks do the remapping without the host computer knowing (both SSDs and rotating rust). I see the possible win that you could always use all the working blocks on the disk, and when blocks goes bad your disk will shrink. I am not sure that is really what people expect, though. Apart from that, I am not sure what the gain would be. Could you elaborate on why this would be called for? /ragge
On Jan 1, 2010, at 03:30, Eric D. Mudama wrote:> On Thu, Dec 31 at 16:53, David Magda wrote: >> Just as the first 4096-byte block disks are silently emulating 4096 - >> to-512 blocks, SSDs are currently re-mapping LBAs behind the >> scenes. Perhaps in the future there will be a setting to say "no >> really, I''m talking about the /actual/ LBA 123456". > > What, exactly, is the "/actual/ LBA 123456" on a modern SSD?It doesn''t exist currently because of the behind-the-scenes re-mapping that''s being done by the SSD''s firmware. While arbitrary to some extent, and "actual" LBA would presumably the number of a particular cell in the SSD.
On Jan 1, 2010, at 04:33, Ragnar Sundblad wrote:> I see the possible win that you could always use all the working > blocks on the disk, and when blocks goes bad your disk will shrink. > I am not sure that is really what people expect, though. Apart from > that, I am not sure what the gain would be. > > Could you elaborate on why this would be called for?Currently you have SSDs that look like disks, but under certain circumstances the OS / FS know that it isn''t rotating rust--in which case the TRIM command is then used by the OS to help the SSD''s allocation algorithm(s). If the file system is COW, and knows about SSDs via TRIM, why not just skip the middle-man and tell the SSD "I''ll take care of managing blocks". In the ZFS case, I think it''s a logical extension of how RAID is handling: ZFS'' system is much more helpful in most case that hardware- / firmware-based RAID, so it''s generally best just to expose the underlying hardware to ZFS. In the same way ZFS already does COW, so why bother with the SSD''s firmware doing it when giving extra knowledge to ZFS could be more useful?
On 1 jan 2010, at 14.14, David Magda wrote:> On Jan 1, 2010, at 04:33, Ragnar Sundblad wrote: > >> I see the possible win that you could always use all the working >> blocks on the disk, and when blocks goes bad your disk will shrink. >> I am not sure that is really what people expect, though. Apart from >> that, I am not sure what the gain would be. >> >> Could you elaborate on why this would be called for? > > Currently you have SSDs that look like disks, but under certain circumstances the OS / FS know that it isn''t rotating rust--in which case the TRIM command is then used by the OS to help the SSD''s allocation algorithm(s).(Note that TRIM and equivalents are not only useful on SSDs, but on other storage too, such as when using sparse/thin storage.)> If the file system is COW, and knows about SSDs via TRIM, why not just skip the middle-man and tell the SSD "I''ll take care of managing blocks". > > In the ZFS case, I think it''s a logical extension of how RAID is handling: ZFS'' system is much more helpful in most case that hardware- / firmware-based RAID, so it''s generally best just to expose the underlying hardware to ZFS. In the same way ZFS already does COW, so why bother with the SSD''s firmware doing it when giving extra knowledge to ZFS could be more useful?But that would only move the hardware specific and dependent flash chip handling code into the file system code, wouldn''t it? What is won with that? As long as the flash chips have larger pages than the file system blocks, someone will have to shuffle around blocks to reclaim space, why not let the one thing that knows the hardware and also is very close to the hardware do it? And if this is good for SSDs, why isn''t it as good for rotating rust? /ragge s
On Jan 1, 2010, at 11:04, Ragnar Sundblad wrote:> But that would only move the hardware specific and dependent flash > chip handling code into the file system code, wouldn''t it? What > is won with that? As long as the flash chips have larger pages than > the file system blocks, someone will have to shuffle around blocks > to reclaim space, why not let the one thing that knows the hardware > and also is very close to the hardware do it? > > And if this is good for SSDs, why isn''t it as good for rotating rust?Don''t really see how things are either hardware specific or dependent. COW is COW. Am I missing something? It''s done by code somewhere in the stack, if the FS knows about it, it can lay things out in sequential writes. If we''re talking about 512 KB blocks, ZFS in particular would create four 128 KB txgs--and 128 KB is simply the currently #define''d size, which can be changed in the future. One thing you gain is perhaps not requiring to have as much of a reserve. At most you have some hidden bad block re-mapping, similar to rotating rust nowadays. If you''re shuffling blocks around, you''re doing a read-modify-write, which if done in the file system, you could use as a mechanism to defrag on-the-fly or to group many small files together. Not quite sure what you mean by your last question.
On Dec 31, 2009, at 12:59 PM, Ragnar Sundblad wrote:> Flash SSDs actually always remap new writes into a > only-append-to-new-pages style, pretty much as ZFS does itself. > So for a SSD there is no big difference between ZFS and > filesystems as UFS, NTFS, HFS+ et al, on the flash level they > all work the same.> The reason is that there is no way for it to rewrite single > disk blocks, it can only fill up already erased pages of > 512K (for example). When the old blocks get mixed with unused > blocks (because of block rewrites, TRIM or Write Many/UNMAP), > it needs to compact the data by copying all active blocks from > those pages into previously erased pages, and there write the > active data compacted/continuos. (When this happens, things tend > to get really slow.)However, the quantity of small, overwritten pages is vastly different. I am not convinced that a workload that generates few overwrites will be penalized as much as a workload that generates a large number of overwrites. I think most folks here will welcome good, empirical studies, but thus far the only one I''ve found is from STEC and their disks behave very well after they''ve been filled and subjected to a rewrite workload. You get what you pay for. Additional pointers are always appreciated :-) http://www.stec-inc.com/ssd/videos/ssdvideo1.php -- richard
On Fri, 1 Jan 2010, David Magda wrote:> > It doesn''t exist currently because of the behind-the-scenes re-mapping that''s > being done by the SSD''s firmware. > > While arbitrary to some extent, and "actual" LBA would presumably the number > of a particular cell in the SSD.There seems to be some severe misunderstanding of that a SSD is. This severe misunderstanding leads one to assume that a SSD has a "native" blocksize. SSDs (as used in computer drives) are comprised of many tens of FLASH memory chips which can be layed out and mapped in whatever fashion the designers choose to do. They could be mapped sequentially, in parallel, a combination of the two, or perhaps even change behavior depending on use. Individual FLASH devices usually have a much smaller page size than 4K. A 4K write would likely be striped across several/many FLASH devices. The construction of any given SSD is typically a closely-held trade secret and the vendor will not reveal how it is designed. You would have to chip away the epoxy yourself and reverse-engineer in order to gain some understanding of how a given SSD operates and even then it would be mostly guesswork. It would be wrong for anyone here, including someone who has participated in the design of an SSD, to claim that they know how a "SSD" will behave unless they have access to the design of that particular SSD. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On Fri, Jan 1, 2010 at 11:17 AM, Bob Friesenhahn <bfriesen at simple.dallas.tx.us> wrote:> On Fri, 1 Jan 2010, David Magda wrote: >> >> It doesn''t exist currently because of the behind-the-scenes re-mapping >> that''s being done by the SSD''s firmware. >> >> While arbitrary to some extent, and "actual" LBA would presumably the >> number of a particular cell in the SSD. > > There seems to be some severe misunderstanding of that a SSD is. This severe > misunderstanding leads one to assume that a SSD has a "native" blocksize. > ?SSDs (as used in computer drives) are comprised of many tens of FLASH > memory chips which can be layed out and mapped in whatever fashion the > designers choose to do. ?They could be mapped sequentially, in parallel, a > combination of the two, or perhaps even change behavior depending on use. > ?Individual FLASH devices usually have a much smaller page size than 4K. ?A > 4K write would likely be striped across several/many FLASH devices. > > The construction of any given SSD is typically a closely-held trade secret > and the vendor will not reveal how it is designed. ?You would have to chip > away the epoxy yourself and reverse-engineer in order to gain some > understanding of how a given SSD operates and even then it would be mostly > guesswork. > > It would be wrong for anyone here, including someone who has participated in > the design of an SSD, to claim that they know how a "SSD" will behave unless > they have access to the design of that particular SSD. >The main issue is that most flash devices support 128k byte pages, and the smallest "chunk" (for want of a better word) of flash memory that can be written is a page - or 128kb. So if you have a write to an SSD that only changes 1 byte in one 512 byte "disk" sector, the SSD controller has to either read/re-write the affected page or figure out how to update the flash memory with the minimum affect on flash wear. If one did''nt have to worry about flash wear levelling, one could read/update/write the affected page all day long..... And, to date, flash writes are much slower than flash reads - which is another basic property of the current generation of flash devices. For anyone who is interested in getting more details of the challenges with flash memory, when used to build solid state drives, reading the tech data sheets on the flash memory devices will give you a feel for the basic issues that must be solved. Bobs point is well made. The specifics of a given SSD implementation will make the performance characteristics of the resulting SSD very difficult to predict or even describe - especially as the device hardware and firmware continue to evolve. And some SSDs change the algorithms they implement on-the-fly - depending on the characteristics of the current workload and of the (inbound) data being written. There are some links to well written articles in the URL I posted earlier this morning: http://www.anandtech.com/storage/showdoc.aspx?i=3702 Regards, -- Al Hopper Logical Approach Inc,Plano,TX al at logical-approach.com Voice: 972.379.2133 Timezone: US CDT OpenSolaris Governing Board (OGB) Member - Apr 2005 to Mar 2007 http://www.opensolaris.org/os/community/ogb/ogb_2005-2007/
On 1 jan 2010, at 17.44, Richard Elling wrote:> On Dec 31, 2009, at 12:59 PM, Ragnar Sundblad wrote: >> Flash SSDs actually always remap new writes into a >> only-append-to-new-pages style, pretty much as ZFS does itself. >> So for a SSD there is no big difference between ZFS and >> filesystems as UFS, NTFS, HFS+ et al, on the flash level they >> all work the same. > >> The reason is that there is no way for it to rewrite single >> disk blocks, it can only fill up already erased pages of >> 512K (for example). When the old blocks get mixed with unused >> blocks (because of block rewrites, TRIM or Write Many/UNMAP), >> it needs to compact the data by copying all active blocks from >> those pages into previously erased pages, and there write the >> active data compacted/continuos. (When this happens, things tend >> to get really slow.) > > However, the quantity of small, overwritten pages is vastly different. > I am not convinced that a workload that generates few overwrites > will be penalized as much as a workload that generates a large > number of overwrites.Zfs is not append only in itself, there will be holes from deleted files after a while, and space will have to be reclaimed sooner or later. I am not convinced that a zfs that has been in use for a while rewrites a lot less than other file systems. But maybe you are right, and if so, I agree that intuitively such a workload may be better matched to a flash based device. If you have a workload that only appends data and never changes or deletes it, zfs is probably a bit better than other file systems of not rewriting blocks. But that is a pretty special use case, and another file system could rewrite almost as little.> I think most folks here will welcome good, empirical studies, > but thus far the only one I''ve found is from STEC and their > disks behave very well after they''ve been filled and subjected > to a rewrite workload. You get what you pay for. Additional > pointers are always appreciated :-) > http://www.stec-inc.com/ssd/videos/ssdvideo1.phpThere certainly are big differences between the flash SSD drives out there, I wouldn''t argue about that for a second! /ragge
On 1 jan 2010, at 17.28, David Magda wrote:> On Jan 1, 2010, at 11:04, Ragnar Sundblad wrote: > >> But that would only move the hardware specific and dependent flash >> chip handling code into the file system code, wouldn''t it? What >> is won with that? As long as the flash chips have larger pages than >> the file system blocks, someone will have to shuffle around blocks >> to reclaim space, why not let the one thing that knows the hardware >> and also is very close to the hardware do it? >> >> And if this is good for SSDs, why isn''t it as good for rotating rust? > > Don''t really see how things are either hardware specific or dependent.The inner workings of a SSD flash drive is pretty hardware (or rather vendor) specific, and it may not be a good idea to move any knowledge about that to the file system layer.> COW is COW. Am I missing something? It''s done by code somewhere in the stack, if the FS knows about it, it can lay things out in sequential writes. If we''re talking about 512 KB blocks, ZFS in particular would create four 128 KB txgs--and 128 KB is simply the currently #define''d size, which can be changed in the future.As I said in another mail, zfs is not append only, especially not if it has been in random read write use for a while. There will be holes in the data and space to be reclaimed, something has to handle that, and I am not sure it is a good idea to move that into the host, since it it dependent of the design of the SSD drive.> One thing you gain is perhaps not requiring to have as much of a reserve. At most you have some hidden bad block re-mapping, similar to rotating rust nowadays. If you''re shuffling blocks around, you''re doing a read-modify-write, which if done in the file system, you could use as a mechanism to defrag on-the-fly or to group many small files together.Yes, defrag on the fly may be interesting. Otherwise I am not sure I think the file system should do any of that, since it may be that it can be done much faster and smarter in the SSD controller.> Not quite sure what you mean by your last question.I meant that if hardware dependent handling of the storage medium is good to move into the host, why isn''t the same true for spinning disks? But we can leave that for now. /ragge
On 1 jan 2010, at 18.17, Bob Friesenhahn wrote:> On Fri, 1 Jan 2010, David Magda wrote: >> >> It doesn''t exist currently because of the behind-the-scenes re-mapping that''s being done by the SSD''s firmware. >> >> While arbitrary to some extent, and "actual" LBA would presumably the number of a particular cell in the SSD. > > There seems to be some severe misunderstanding of that a SSD is. This severe misunderstanding leads one to assume that a SSD has a "native" blocksize. SSDs (as used in computer drives) are comprised of many tens of FLASH memory chips which can be layed out and mapped in whatever fashion the designers choose to do. They could be mapped sequentially, in parallel, a combination of the two, or perhaps even change behavior depending on use. Individual FLASH devices usually have a much smaller page size than 4K. A 4K write would likely be striped across several/many FLASH devices.Yes, but erases are always much larger, right? (With the flash chips of today, I am not sure why there aren''t any flash chips with smaller erase page sizes yet.)> The construction of any given SSD is typically a closely-held trade secret and the vendor will not reveal how it is designed. You would have to chip away the epoxy yourself and reverse-engineer in order to gain some understanding of how a given SSD operates and even then it would be mostly guesswork. > > It would be wrong for anyone here, including someone who has participated in the design of an SSD, to claim that they know how a "SSD" will behave unless they have access to the design of that particular SSD.I certainly agree, but there still isn''t much they can do about the WORM-like properties of flash chips, were reading is pretty fast, writing is not to bad, but erasing is very slow and must be done in pretty large pages which also means that active data probably have to be copied around before an erase. I believe this is why even fast flash SSD devices can take tenth or even hundreds of thousands of writes for a short burst, but then fall back to a few thousand writes/second sustained. /ragge
Mike, As far as I know only Hitachi is using such a huge chunk size : "So each vendor?s implementation of TP uses a different block size. HDS use 42MB on the USP, EMC use 768KB on DMX, IBM allow a variable size from 32KB to 256KB on the SVC and 3Par use blocks of just 16KB. The reasons for this are many and varied and for legacy hardware are a reflection of the underlying hardware architecture." http://gestaltit.com/all/tech/storage/chris/thin-provisioning-holy-grail-utilisation/ Also, here Hu explains the reason why they believe 42M is the most efficient : http://blogs.hds.com/hu/2009/07/chunk-size-matters.html He has some good points in his arguments. Regards, sendai -- This message posted from opensolaris.org
Ragnar Sundblad <ragge at csc.kth.se> wrote:> On 1 jan 2010, at 17.28, David Magda wrote:> > Don''t really see how things are either hardware specific or dependent. > > The inner workings of a SSD flash drive is pretty hardware (or > rather vendor) specific, and it may not be a good idea to move > any knowledge about that to the file system layer.If ZFS likes to keep SSDs fast even after it was in use for a while, then even ZFS would need to tell the SSD which sectors are no longer in use. Such a mode may cause a noticable performance loss as ZFS for this reason may need to traverse freed outdated data trees but it will help the SSD to erase the needed space in advance. J?rg -- EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin js at cs.tu-berlin.de (uni) joerg.schilling at fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily
Ragnar Sundblad <ragge at csc.kth.se> wrote:> I certainly agree, but there still isn''t much they can do about > the WORM-like properties of flash chips, were reading is pretty > fast, writing is not to bad, but erasing is very slow and must be > done in pretty large pages which also means that active data > probably have to be copied around before an erase.WORM devices do not allow to write a block a secdond time. There is a typical 5% reserve that would allow to reassign some blocks and to make it appear they have been rewritten, but this is not what ZFS does. Well, you are hoewever true that there is a slight relation as I did invent COW for a WORM filesystem in 1989 ;-) J?rg -- EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin js at cs.tu-berlin.de (uni) joerg.schilling at fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily
Joerg Schilling wrote:> Ragnar Sundblad <ragge at csc.kth.se> wrote: > > >> On 1 jan 2010, at 17.28, David Magda wrote: >> > > >>> Don''t really see how things are either hardware specific or dependent. >>> >> The inner workings of a SSD flash drive is pretty hardware (or >> rather vendor) specific, and it may not be a good idea to move >> any knowledge about that to the file system layer. >> > > If ZFS likes to keep SSDs fast even after it was in use for a while, then > even ZFS would need to tell the SSD which sectors are no longer in use. > > > Such a mode may cause a noticable performance loss as ZFS for this reason > may need to traverse freed outdated data trees but it will help the SSD > to erase the needed space in advance. > > J?rthe TRIM command is what is intended for an OS to notify the SSD as to which blocks are deleted/erased, so the SSD''s internal free list can be updated (that is, it allows formerly-in-use blocks to be moved to the free list). This is necessary since only the OS has the information to determine which previous-written-to blocks are actually no longer in-use. See the parallel discussion here titled "preview of new SSD based on SandForce controller" for more about "smart" vs "dumb" SSD controllers. From ZFS''s standpoint, the optimal configuration would be for the SSD to inform ZFS as to it''s PAGE size, and ZFS would use this as the fundamental BLOCK size for that device (i.e. all writes are in integer multiples of the SSD page size). Reads could be in smaller sections, though. Which would be interesting: ZFS would write in Page Size increments, and read in Block Size amounts. -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA Timezone: US/Pacific (GMT-0800)
Erik Trimble <Erik.Trimble at sun.com> wrote:> From ZFS''s standpoint, the optimal configuration would be for the SSD > to inform ZFS as to it''s PAGE size, and ZFS would use this as the > fundamental BLOCK size for that device (i.e. all writes are in integerIt seems that a command to retrieve this information does not yet exist, or could you help me? J?rg -- EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin js at cs.tu-berlin.de (uni) joerg.schilling at fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily
On 2 jan 2010, at 12.43, Joerg Schilling wrote:> Ragnar Sundblad <ragge at csc.kth.se> wrote: > >> I certainly agree, but there still isn''t much they can do about >> the WORM-like properties of flash chips, were reading is pretty >> fast, writing is not to bad, but erasing is very slow and must be >> done in pretty large pages which also means that active data >> probably have to be copied around before an erase. > > WORM devices do not allow to write a block a secdond time.(I know, that is why I wrote WORM-like.)> There is > a typical 5% reserve that would allow to reassign some blocks and to make it > appear they have been rewritten, but this is not what ZFS does.Well, zfs kind of does, but especially typical flash SSDs do it, they have a redirection layer so that any block can go anywhere, so they can use the flash media in a WORM like style with occasional bulk erases.> Well, you are > hoewever true that there is a slight relation as I did invent COW for a WORM > filesystem in 1989 ;-)Yes, there indeed are several similarities. /ragge
On 2 jan 2010, at 13.10, Erik Trimble wrote:> Joerg Schilling wrote: >> Ragnar Sundblad <ragge at csc.kth.se> wrote: >> >> >>> On 1 jan 2010, at 17.28, David Magda wrote: >>> >> >> >>>> Don''t really see how things are either hardware specific or dependent. >>>> >>> The inner workings of a SSD flash drive is pretty hardware (or >>> rather vendor) specific, and it may not be a good idea to move >>> any knowledge about that to the file system layer. >>> >> >> If ZFS likes to keep SSDs fast even after it was in use for a while, then >> even ZFS would need to tell the SSD which sectors are no longer in use. >> >> >> Such a mode may cause a noticable performance loss as ZFS for this reason >> may need to traverse freed outdated data trees but it will help the SSD >> to erase the needed space in advance. >> >> J?r > the TRIM command is what is intended for an OS to notify the SSD as to which blocks are deleted/erased, so the SSD''s internal free list can be updated (that is, it allows formerly-in-use blocks to be moved to the free list). This is necessary since only the OS has the information to determine which previous-written-to blocks are actually no longer in-use. > > See the parallel discussion here titled "preview of new SSD based on SandForce controller" for more about "smart" vs "dumb" SSD controllers. > > From ZFS''s standpoint, the optimal configuration would be for the SSD to inform ZFS as to it''s PAGE size, and ZFS would use this as the fundamental BLOCK size for that device (i.e. all writes are in integer multiples of the SSD page size). Reads could be in smaller sections, though. Which would be interesting: ZFS would write in Page Size increments, and read in Block Size amounts.Well, this could be useful if updates are larger than the block size, for example 512 K, as it is then possible to erase and rewrite without having to copy around other data from the page. If updates are smaller, zfs will have to reclaim erased space by itself, which if I am not mistaken it can not do today (but probably will in some future, I guess the BP Rewrite is what is needed). I am still not entirely convinced that it would be better to let the file system take care of that instead of a flash controller, there could be quite a lot of reading and writing going on for space reclamation (depending on the work load, of course). /ragge
Joerg Schilling wrote:> Erik Trimble <Erik.Trimble at sun.com> wrote: > > >> From ZFS''s standpoint, the optimal configuration would be for the SSD >> to inform ZFS as to it''s PAGE size, and ZFS would use this as the >> fundamental BLOCK size for that device (i.e. all writes are in integer >> > > It seems that a command to retrieve this information does not yet exist, > or could you help me? > > J?rg > >Sadly, no, there does not exist any way for the SSD to communicate that info back to the OS. Probably, the smart thing to push for is inclusion of some new command in the ATA standard (in a manner like TRIM). Likely something that would return both native Block and Page sizes upon query. I''m still trying to see if there will be any support for TRIM-like things in SAS. -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA
Ragnar Sundblad wrote:> On 2 jan 2010, at 13.10, Erik Trimble wrote >> Joerg Schilling wrote: >> >> the TRIM command is what is intended for an OS to notify the SSD as to which blocks are deleted/erased, so the SSD''s internal free list can be updated (that is, it allows formerly-in-use blocks to be moved to the free list). This is necessary since only the OS has the information to determine which previous-written-to blocks are actually no longer in-use. >> >> See the parallel discussion here titled "preview of new SSD based on SandForce controller" for more about "smart" vs "dumb" SSD controllers. >> >> From ZFS''s standpoint, the optimal configuration would be for the SSD to inform ZFS as to it''s PAGE size, and ZFS would use this as the fundamental BLOCK size for that device (i.e. all writes are in integer multiples of the SSD page size). Reads could be in smaller sections, though. Which would be interesting: ZFS would write in Page Size increments, and read in Block Size amounts. >> > > Well, this could be useful if updates are larger than the block size, for example 512 K, as it is then possible to erase and rewrite without having to copy around other data from the page. If updates are smaller, zfs will have to reclaim erased space by itself, which if I am not mistaken it can not do today (but probably will in some future, I guess the BP Rewrite is what is needed). >Sure, it does that today. What do you think happens on a standard COW action? Let''s be clear here: I''m talking about exactly the same thing that currently happens when you modify a ZFS "block" that spans multiple vdevs (say, in a RAIDZ). The entire ZFS block is read from disk/L2ARC, the modifications made, then it is written back to storage, likely in another LBA. The original ZFS block location ON THE VDEV is now available for re-use (i.e. the vdev adds it to it''s Free Block List). This is one of the things that leads to ZFS''s fragmentation issues (note, we''re talking about block fragmentation on the vdev, not ZFS block fragmentation), and something that we''re looking to BP rewrite to enable defragging to be implemented. In fact, I would argue that the biggest advantage of removing any advanced intelligence from the SSD controller is with small modifications to existing files. By using the L2ARC (and other features, like compression, encryption, and dedup), ZFS can composite the needed changes with an existing cached copy of the ZFS block(s) to be changed, then issue a full new block write to the SSD. This eliminates the need for the SSD to do the dreaded Read-Modify-Write cycle, and instead do just a Write. In this scenario, the ZFS block is likely larger than the SSD Page size, so more data will need to be written; however, given the highly parallel nature of SSDs, writing several SSD pages simultaneously is easy (and fast); let''s remember that a ZFS block is a maximum of only 8x the size of a SSD page, and writing 8 pages is only slightly more work than writing 1 page. This larger write is all a single IOP, where a R-M-W essentially requires 3 IOPS. If you want the SSD controller to do the work, then it ALWAYS has to read the to-be-modified page from NAND, do the mod itself, then issue the write - and, remember, ZFS likely has already issued a full ZFS-block write (due to the COW nature of ZFS, there is no concept of "just change this 1 bit and leave everything else on disk where it is"), so you likely don''t save on the number of pages that need to be written in any case.> I am still not entirely convinced that it would be better to let the file system take care of that instead of a flash controller, there could be quite a lot of reading and writing going on for space reclamation (depending on the work load, of course). > > /raggeThe point here is that regardless of the workload, there''s a R-M-W cycle that has to happen, whether that occurs at the ZFS level or at the SSD level. My argument is that the OS has a far better view of the whole data picture, and access to much higher performing caches (i.e. RAM/registers) than the SSD, so not only can the OS make far better decisions about the data and how (and how much of) it should be stored, but it''s almost certainly to be able to do so far faster than any little SSD controller can do. -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA
On 2 jan 2010, at 22.49, Erik Trimble wrote:> Ragnar Sundblad wrote: >> On 2 jan 2010, at 13.10, Erik Trimble wrote >>> Joerg Schilling wrote: >>> the TRIM command is what is intended for an OS to notify the SSD as to which blocks are deleted/erased, so the SSD''s internal free list can be updated (that is, it allows formerly-in-use blocks to be moved to the free list). This is necessary since only the OS has the information to determine which previous-written-to blocks are actually no longer in-use. >>> >>> See the parallel discussion here titled "preview of new SSD based on SandForce controller" for more about "smart" vs "dumb" SSD controllers. >>> >>> From ZFS''s standpoint, the optimal configuration would be for the SSD to inform ZFS as to it''s PAGE size, and ZFS would use this as the fundamental BLOCK size for that device (i.e. all writes are in integer multiples of the SSD page size). Reads could be in smaller sections, though. Which would be interesting: ZFS would write in Page Size increments, and read in Block Size amounts. >>> >> >> Well, this could be useful if updates are larger than the block size, for example 512 K, as it is then possible to erase and rewrite without having to copy around other data from the page. If updates are smaller, zfs will have to reclaim erased space by itself, which if I am not mistaken it can not do today (but probably will in some future, I guess the BP Rewrite is what is needed). >> > Sure, it does that today. What do you think happens on a standard COW action? Let''s be clear here: I''m talking about exactly the same thing that currently happens when you modify a ZFS "block" that spans multiple vdevs (say, in a RAIDZ). The entire ZFS block is read from disk/L2ARC, the modifications made, then it is written back to storage, likely in another LBA. The original ZFS block location ON THE VDEV is now available for re-use (i.e. the vdev adds it to it''s Free Block List). This is one of the things that leads to ZFS''s fragmentation issues (note, we''re talking about block fragmentation on the vdev, not ZFS block fragmentation), and something that we''re looking to BP rewrite to enable defragging to be implemented.What I am talking about is to be able to reuse the free space you get in the previously written data when you write modified data to new places on the disk, or just remove a file for that matter. To be able to reclaim that space with flash, you have to erase large pages (for example 512 KB), but before you erase, you will also have to save away all still valid data in that page and rewrite that to a free page. What I am saying is that I am not sure that this would be best done in the file system, since it could be quite a bit of data to shuffle around, and there could possibly be hardware specific optimizations that could be done here that zfs wouldn''t know about. A good flash controller could probably do it much better. (And a bad one worse, of course.) And as far as I know, zfs can not do that today - it can not move around already written data, not for defragmentation, not for adding or removing disks to stripes/raidz:s, not for deduping/duping and so on, and I have understood it as BP Rewrite could solve a lot of this. Still, it could certainly be useful if zfs could try to use a blocksize that matches the SSD erase page size - this could avoid having to copy and compact data before erasing, which could speed up writes in a typical flash SSD disk.> In fact, I would argue that the biggest advantage of removing any advanced intelligence from the SSD controller is with small modifications to existing files. By using the L2ARC (and other features, like compression, encryption, and dedup), ZFS can composite the needed changes with an existing cached copy of the ZFS block(s) to be changed, then issue a full new block write to the SSD. This eliminates the need for the SSD to do the dreaded Read-Modify-Write cycle, and instead do just a Write. In this scenario, the ZFS block is likely larger than the SSD Page size, so more data will need to be written; however, given the highly parallel nature of SSDs, writing several SSD pages simultaneously is easy (and fast); let''s remember that a ZFS block is a maximum of only 8x the size of a SSD page, and writing 8 pages is only slightly more work than writing 1 page. This larger write is all a single IOP, where a R-M-W essentially requires 3 IOPS. If you want the SSD controller to do the work, then it ALWAYS has to read the to-be-modified page from NAND, do the mod itself, then issue the write - and, remember, ZFS likely has already issued a full ZFS-block write (due to the COW nature of ZFS, there is no concept of "just change this 1 bit and leave everything else on disk where it is"), so you likely don''t save on the number of pages that need to be written in any case.I don''t think many SSDs do R-M-W, but rather just append blocks to free pages (pretty much as zfs works, if you will). They also have to do some space reclamation (copying/compacting blocks and erasing pages) in the background, of course.> I am still not entirely convinced that it would be better to let the file system take care of that instead of a flash controller, there could be quite a lot of reading and writing going on for space reclamation (depending on the work load, of course). >> >> /ragge > The point here is that regardless of the workload, there''s a R-M-W cycle that has to happen, whether that occurs at the ZFS level or at the SSD level. My argument is that the OS has a far better view of the whole data picture, and access to much higher performing caches (i.e. RAM/registers) than the SSD, so not only can the OS make far better decisions about the data and how (and how much of) it should be stored, but it''s almost certainly to be able to do so far faster than any little SSD controller can do.Well, inside the flash system you could possibly have a much better situation to shuffle data around for space reclamation - that is copying and compacting data and erasing flash pages. If the device has a good design, that is! If the SSD controller is some small slow sad thing it might be better to shuffle it up and down to the host and do it in the CPU, but I am not sure about that either since it typically is the very same slow controller that does the host communication. I certainly agree that there seems to be some redundancy when the flash SSD controller does a logging-file-system kind of work under zfs that does pretty much that by itself, and it could possibly be better to cut one of them (and not zfs). I am still not convinced that it won''t be better to do this in a good controller instead just for speed and to take advantage of new hardware that does this smarter than the devices of today. Do you know how the F5100 works for example? /ragge
On Jan 2, 2010, at 16:49, Erik Trimble wrote:> My argument is that the OS has a far better view of the whole data > picture, and access to much higher performing caches (i.e. RAM/ > registers) than the SSD, so not only can the OS make far better > decisions about the data and how (and how much of) it should be > stored, but it''s almost certainly to be able to do so far faster > than any little SSD controller can do.Though one advantage of doing it with-in the disk is that you''re not using up bus bandwidth. Probably not that big of a deal, but worth mentioning for completeness / fairness.
On Jan 2, 2010, at 1:47 AM, Andras Spitzer wrote:> Mike, > > As far as I know only Hitachi is using such a huge chunk size : > > "So each vendor?s implementation of TP uses a different block size. > HDS use 42MB on the USP, EMC use 768KB on DMX, IBM allow a variable > size from 32KB to 256KB on the SVC and 3Par use blocks of just 16KB. > The reasons for this are many and varied and for legacy hardware are > a reflection of the underlying hardware architecture." > > http://gestaltit.com/all/tech/storage/chris/thin-provisioning-holy-grail-utilisation/ > > Also, here Hu explains the reason why they believe 42M is the most > efficient : > > http://blogs.hds.com/hu/2009/07/chunk-size-matters.html > > He has some good points in his arguments.Yes, and they apply to ZFS dedup as well... :-) -- richard
Ragnar Sundblad wrote:> On 2 jan 2010, at 22.49, Erik Trimble wrote: > > >> Ragnar Sundblad wrote: >> >>> On 2 jan 2010, at 13.10, Erik Trimble wrote >>> >>>> Joerg Schilling wrote: >>>> the TRIM command is what is intended for an OS to notify the SSD as to which blocks are deleted/erased, so the SSD''s internal free list can be updated (that is, it allows formerly-in-use blocks to be moved to the free list). This is necessary since only the OS has the information to determine which previous-written-to blocks are actually no longer in-use. >>>> >>>> See the parallel discussion here titled "preview of new SSD based on SandForce controller" for more about "smart" vs "dumb" SSD controllers. >>>> >>>> From ZFS''s standpoint, the optimal configuration would be for the SSD to inform ZFS as to it''s PAGE size, and ZFS would use this as the fundamental BLOCK size for that device (i.e. all writes are in integer multiples of the SSD page size). Reads could be in smaller sections, though. Which would be interesting: ZFS would write in Page Size increments, and read in Block Size amounts. >>>> >>>> >>> Well, this could be useful if updates are larger than the block size, for example 512 K, as it is then possible to erase and rewrite without having to copy around other data from the page. If updates are smaller, zfs will have to reclaim erased space by itself, which if I am not mistaken it can not do today (but probably will in some future, I guess the BP Rewrite is what is needed). >>> >>> >> Sure, it does that today. What do you think happens on a standard COW action? Let''s be clear here: I''m talking about exactly the same thing that currently happens when you modify a ZFS "block" that spans multiple vdevs (say, in a RAIDZ). The entire ZFS block is read from disk/L2ARC, the modifications made, then it is written back to storage, likely in another LBA. The original ZFS block location ON THE VDEV is now available for re-use (i.e. the vdev adds it to it''s Free Block List). This is one of the things that leads to ZFS''s fragmentation issues (note, we''re talking about block fragmentation on the vdev, not ZFS block fragmentation), and something that we''re looking to BP rewrite to enable defragging to be implemented. >> > > What I am talking about is to be able to reuse the free space > you get in the previously written data when you write modified > data to new places on the disk, or just remove a file for that > matter. To be able to reclaim that space with flash, you have > to erase large pages (for example 512 KB), but before you erase, > you will also have to save away all still valid data in that > page and rewrite that to a free page. What I am saying is that > I am not sure that this would be best done in the file system, > since it could be quite a bit of data to shuffle around, and > there could possibly be hardware specific optimizations that > could be done here that zfs wouldn''t know about. A good flash > controller could probably do it much better. (And a bad one > worse, of course.) >You certainly DO get to reuse the free space again. Here''s what happens nowdays in an SSD: Let''s say I have 4k blocks, grouped into a 128k page. That is, the SSD''s fundamental minimum unit size is 4k, but the minimum WRITE size is 128k. Thus, 32 blocks in a page. So, I write a bit of data 100k in size. This occupies the first 25 blocks in the one page. The remaining 9 blocks are still one the SSD''s Free List (i.e. list of free space). Now, I want to change the last byte of the file, and add 10k more to the file. Currently, a non-COW filesystem will simply send the 1 byte modification request and the 10k addition to the SSD (all as one unit, if you are lucky - if not, it comes as two ops: 1 byte modification followed by a 10k append). The SSD now has to read all 25 blocks of the page back into it''s local cache on the controller, do the modification and append computing, then writes out 28 blocks to NAND. In all likelihood, if there is any extra pre-erased (or never written to) space on the drive, this 28 block write will go to a whole new page. The blocks in the original page will be moved over to the SSD Free List (and may or may not be actually erased, depending on the controller). For filesystems like ZFS, this is a whole lot of extra work being done that doesn''t need to happen (and, chews up valuable IOPS and time). For, when ZFS does a write, it doesn''t merely just twiddle the modified/appended bits - instead, it creates a whole new ZFS block to write. In essence, ZFS has already done all the work that the SSD controller is planning on doing. So why duplicate the effort? SSDs should simply notify ZFS about their block & page sizes, which would then allow ZFS to better align it''s own variable block size to optimally coincide with the SSD''s implementation.> And as far as I know, zfs can not do that today - it can not > move around already written data, not for defragmentation, not > for adding or removing disks to stripes/raidz:s, not for > deduping/duping and so on, and I have understood it as > BP Rewrite could solve a lot of this. >ZFS''s propensity to fragmentation doesn''t mean you lose space. Rather, it means that COW often results in frequently-modified files being distributed over the entire media, rather than being contiguous. So, over time, the actual media has very little (if any) contiguous free space, which is what the fragmentation problem is. BP rewrite will indeed allow us to create a de-fragger. Areas which used to hold a ZFS block (now vacated by a COW to somewhere else) are simply added back to the device''s Free List. Now, in SSD''s case, this isn''t a worry. Due to the completely even performance characteristics of NAND, it doesn''t make any difference if the physical layout of a file happens to be sections (e.g. ZFS blocks) scattered all over the SSD. Access time is identical, and so is read time. SSD''s don''t care about this kind of fragmentation. What SSD''s have to worry about is sub-page fragmentation. Which brings us back to the whole R-M-W mess.> Still, it could certainly be useful if zfs could try to use a > blocksize that matches the SSD erase page size - this could > avoid having to copy and compact data before erasing, which > could speed up writes in a typical flash SSD disk. > > >> In fact, I would argue that the biggest advantage of removing any advanced intelligence from the SSD controller is with small modifications to existing files. By using the L2ARC (and other features, like compression, encryption, and dedup), ZFS can composite the needed changes with an existing cached copy of the ZFS block(s) to be changed, then issue a full new block write to the SSD. This eliminates the need for the SSD to do the dreaded Read-Modify-Write cycle, and instead do just a Write. In this scenario, the ZFS block is likely larger than the SSD Page size, so more data will need to be written; however, given the highly parallel nature of SSDs, writing several SSD pages simultaneously is easy (and fast); let''s remember that a ZFS block is a maximum of only 8x the size of a SSD page, and writing 8 pages is only slightly more work than writing 1 page. This larger write is all a single IOP, where a R-M-W essentially requires 3 IOPS. If you want the SSD controller to do the work, then it ALWAYS has to read the to-be-modified page from NAND, do the mod itself, then issue the write - and, remember, ZFS likely has already issued a full ZFS-block write (due to the COW nature of ZFS, there is no concept of "just change this 1 bit and leave everything else on disk where it is"), so you likely don''t save on the number of pages that need to be written in any case. >> > > I don''t think many SSDs do R-M-W, but rather just append blocks > to free pages (pretty much as zfs works, if you will). They also > have to do some space reclamation (copying/compacting blocks and > erasing pages) in the background, of course.>MLC-based SSDs all do R-M-W. Now, they might not do Read-Modify-Erase-Write right away: But they''ll do R-M-W on ANY write which modifies existing data (unless you are extremely lucky and your data fully fills an existing page): the difference is that the final W is to previous-unused NAND page(s). However, when the SSD runs out of never-used space, it starts to have to add the E step on future writes. So far as I know, no SSD does space reclamation in the manner you refer to. That is, the SSD controller isn''t going to be moving data around on its own, with the exception of wear-leveling. TRIM is there so that the SSD can add stuff to it''s internal Free List more efficiently, but an SSD isn''t going (on its own) say: "Ooh, page 1004 has only 5 of 10 blocks used, so why don''t we merge it with page 20054, which has only 3 of 10 blocks used.">> I am still not entirely convinced that it would be better to let the file system take care of that instead of a flash controller, there could be quite a lot of reading and writing going on for space reclamation (depending on the work load, of course). >> >>> /ragge >>> >> The point here is that regardless of the workload, there''s a R-M-W cycle that has to happen, whether that occurs at the ZFS level or at the SSD level. My argument is that the OS has a far better view of the whole data picture, and access to much higher performing caches (i.e. RAM/registers) than the SSD, so not only can the OS make far better decisions about the data and how (and how much of) it should be stored, but it''s almost certainly to be able to do so far faster than any little SSD controller can do. >> > > Well, inside the flash system you could possibly have a much > better situation to shuffle data around for space reclamation - > that is copying and compacting data and erasing flash pages. > If the device has a good design, that is! If the SSD controller > is some small slow sad thing it might be better to shuffle it up > and down to the host and do it in the CPU, but I am not sure > about that either since it typically is the very same slow > controller that does the host communication. >It''s actually far more likely that a dumb SSD controller can handle high levels of pure data transfer faster than a smart SSD controller can actually manipulate that same data quickly. SSD controllers, by their very nature, need to be as small and cheap as possible, which means they have extremely limited computation ability. For a given compute level controller, one which is only "dumb" has to worry about 4 things: wear leveling, bad block remapping, and LBA->physical block mapping, and actual I/O transfer (i.e. managing data flow from the host to the NAND chips). A smart controller also has to worry about page alignment, page modification and rewriting, potentially RAID-like checksumming/parity, page/block fragmentation, and other things. So, if the compute amount is fixed, a dumb controller is going to be able to handle a /whole/ lot more I/O transfer than a smart controller. Which means, for the same level of I/O transfer, a dumb controller costs less than a smart controller.> I certainly agree that there seems to be some redundancy when > the flash SSD controller does a logging-file-system kind of work > under zfs that does pretty much that by itself, and it could > possibly be better to cut one of them (and not zfs). > I am still not convinced that it won''t be better to do this > in a good controller instead just for speed and to take advantage > of new hardware that does this smarter than the devices of today. > > Do you know how the F5100 works for example? > > /ragge >The point I''m making here is that the filesystem/OS can make all the same decisions that a good SSD controller can make, faster (as it has most of the data in local RAM or register already), and with a global system viewpoint that the SSD simply can''t have. Most importantly, it''s essentially free for the OS to do so - it has the spare cycles and bandwidth to do so. Putting this intelligence on the SSD costs money that is essentially wasted, not to mention being less efficient overall. -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA Timezone: US/Pacific (GMT-0800)
David Magda wrote:> On Jan 2, 2010, at 16:49, Erik Trimble wrote: > >> My argument is that the OS has a far better view of the whole data >> picture, and access to much higher performing caches (i.e. >> RAM/registers) than the SSD, so not only can the OS make far better >> decisions about the data and how (and how much of) it should be >> stored, but it''s almost certainly to be able to do so far faster than >> any little SSD controller can do. > > Though one advantage of doing it with-in the disk is that you''re not > using up bus bandwidth. Probably not that big of a deal, but worth > mentioning for completeness / fairness.This is true. But, also in fairness, this is /already/ being used by the COW nature of ZFS. Changing one bit in a file causes the /entire/ ZFS block containing that bit to be re-written. So I''m not really using much (if any) more bus bandwidth by doing the SSD page layout in the OS rather than in the SSD controller. Remember that I''m highly likely not to have to read anything from the SSD to do the page rewrite, as the data I want is already in the L2ARC. -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA Timezone: US/Pacific (GMT-0800)
On 3 jan 2010, at 04.19, Erik Trimble wrote:> Ragnar Sundblad wrote: >> On 2 jan 2010, at 22.49, Erik Trimble wrote: >> >> >>> Ragnar Sundblad wrote: >>> >>>> On 2 jan 2010, at 13.10, Erik Trimble wrote >>>> >>>>> Joerg Schilling wrote: >>>>> the TRIM command is what is intended for an OS to notify the SSD as to which blocks are deleted/erased, so the SSD''s internal free list can be updated (that is, it allows formerly-in-use blocks to be moved to the free list). This is necessary since only the OS has the information to determine which previous-written-to blocks are actually no longer in-use. >>>>> >>>>> See the parallel discussion here titled "preview of new SSD based on SandForce controller" for more about "smart" vs "dumb" SSD controllers. >>>>> >>>>> From ZFS''s standpoint, the optimal configuration would be for the SSD to inform ZFS as to it''s PAGE size, and ZFS would use this as the fundamental BLOCK size for that device (i.e. all writes are in integer multiples of the SSD page size). Reads could be in smaller sections, though. Which would be interesting: ZFS would write in Page Size increments, and read in Block Size amounts. >>>>> >>>> Well, this could be useful if updates are larger than the block size, for example 512 K, as it is then possible to erase and rewrite without having to copy around other data from the page. If updates are smaller, zfs will have to reclaim erased space by itself, which if I am not mistaken it can not do today (but probably will in some future, I guess the BP Rewrite is what is needed). >>>> >>> Sure, it does that today. What do you think happens on a standard COW action? Let''s be clear here: I''m talking about exactly the same thing that currently happens when you modify a ZFS "block" that spans multiple vdevs (say, in a RAIDZ). The entire ZFS block is read from disk/L2ARC, the modifications made, then it is written back to storage, likely in another LBA. The original ZFS block location ON THE VDEV is now available for re-use (i.e. the vdev adds it to it''s Free Block List). This is one of the things that leads to ZFS''s fragmentation issues (note, we''re talking about block fragmentation on the vdev, not ZFS block fragmentation), and something that we''re looking to BP rewrite to enable defragging to be implemented. >>> >> >> What I am talking about is to be able to reuse the free space >> you get in the previously written data when you write modified >> data to new places on the disk, or just remove a file for that >> matter. To be able to reclaim that space with flash, you have >> to erase large pages (for example 512 KB), but before you erase, >> you will also have to save away all still valid data in that >> page and rewrite that to a free page. What I am saying is that >> I am not sure that this would be best done in the file system, >> since it could be quite a bit of data to shuffle around, and >> there could possibly be hardware specific optimizations that >> could be done here that zfs wouldn''t know about. A good flash >> controller could probably do it much better. (And a bad one >> worse, of course.) >> > You certainly DO get to reuse the free space again. Here''s what happens nowdays in an SSD: > > Let''s say I have 4k blocks, grouped into a 128k page. That is, the SSD''s fundamental minimum unit size is 4k, but the minimum WRITE size is 128k. Thus, 32 blocks in a page.Do you know of SSD disks that have a minimum write size of 128 KB? I don''t understand why it would be designed that way. A typical flash chip has pretty small write block sizes, like 2 KB or so, but they can only erase in pages of 128 KB or so. (And then you are running a few of those in parallel to get some speed, so these numbers often multiply with the number of parallel chips, like 4 or 8 or so.) Typically, you have to write the 2 KB blocks consecutively in a page. Pretty much all set up for an append-style system. :-) In addition, flash SSDs typically have some DRAM write buffer that buffers up writes (like a txg, if you will), so small writes should not be a problem - just collect a few and append!> So, I write a bit of data 100k in size. This occupies the first 25 blocks in the one page. The remaining 9 blocks are still one the SSD''s Free List (i.e. list of free space). > > Now, I want to change the last byte of the file, and add 10k more to the file. Currently, a non-COW filesystem will simply send the 1 byte modification request and the 10k addition to the SSD (all as one unit, if you are lucky - if not, it comes as two ops: 1 byte modification followed by a 10k append). The SSD now has to read all 25 blocks of the page back into it''s local cache on the controller, do the modification and append computing, then writes out 28 blocks to NAND. In all likelihood, if there is any extra pre-erased (or never written to) space on the drive, this 28 block write will go to a whole new page. The blocks in the original page will be moved over to the SSD Free List (and may or may not be actually erased, depending on the controller).Do you know for sure that you have SSD flash disks that work this way? It seems incredibly stupid. It would also use up the available erase cycles much faster than necessary. What write speed do you get?> For filesystems like ZFS, this is a whole lot of extra work being done that doesn''t need to happen (and, chews up valuable IOPS and time). For, when ZFS does a write, it doesn''t merely just twiddle the modified/appended bits - instead, it creates a whole new ZFS block to write. In essence, ZFS has already done all the work that the SSD controller is planning on doing. So why duplicate the effort? SSDs should simply notify ZFS about their block & page sizes, which would then allow ZFS to better align it''s own variable block size to optimally coincide with the SSD''s implementation. > > >> And as far as I know, zfs can not do that today - it can not >> move around already written data, not for defragmentation, not >> for adding or removing disks to stripes/raidz:s, not for >> deduping/duping and so on, and I have understood it as >> BP Rewrite could solve a lot of this. >> > ZFS''s propensity to fragmentation doesn''t mean you lose space. Rather, it means that COW often results in frequently-modified files being distributed over the entire media, rather than being contiguous. So, over time, the actual media has very little (if any) contiguous free space, which is what the fragmentation problem is. BP rewrite will indeed allow us to create a de-fragger. Areas which used to hold a ZFS block (now vacated by a COW to somewhere else) are simply added back to the device''s Free List. > Now, in SSD''s case, this isn''t a worry. Due to the completely even performance characteristics of NAND, it doesn''t make any difference if the physical layout of a file happens to be sections (e.g. ZFS blocks) scattered all over the SSD.Yes, there is something to worry about, as you can only erase flash in large pages - you can not erase them only where the free data blocks in the Free List are.> Access time is identical, and so is read time. SSD''s don''t care about this kind of fragmentation. > What SSD''s have to worry about is sub-page fragmentation. Which brings us back to the whole R-M-W mess.Yes, why R-M-W of entire pages for every change is a really bad implementation of a flash SSD.> Still, it could certainly be useful if zfs could try to use a >> blocksize that matches the SSD erase page size - this could >> avoid having to copy and compact data before erasing, which >> could speed up writes in a typical flash SSD disk. >> >> >>> In fact, I would argue that the biggest advantage of removing any advanced intelligence from the SSD controller is with small modifications to existing files. By using the L2ARC (and other features, like compression, encryption, and dedup), ZFS can composite the needed changes with an existing cached copy of the ZFS block(s) to be changed, then issue a full new block write to the SSD. This eliminates the need for the SSD to do the dreaded Read-Modify-Write cycle, and instead do just a Write. In this scenario, the ZFS block is likely larger than the SSD Page size, so more data will need to be written; however, given the highly parallel nature of SSDs, writing several SSD pages simultaneously is easy (and fast); let''s remember that a ZFS block is a maximum of only 8x the size of a SSD page, and writing 8 pages is only slightly more work than writing 1 page. This larger write is all a single IOP, where a R-M-W essentially requires 3 IOPS. If you want the SSD controller to do the work, then it ALWAYS has to read the to-be-modified page from NAND, do the mod itself, then issue the write - and, remember, ZFS likely has already issued a full ZFS-block write (due to the COW nature of ZFS, there is no concept of "just change this 1 bit and leave everything else on disk where it is"), so you likely don''t save on the number of pages that need to be written in any case. >>> >> >> I don''t think many SSDs do R-M-W, but rather just append blocks >> to free pages (pretty much as zfs works, if you will). They also >> have to do some space reclamation (copying/compacting blocks and >> erasing pages) in the background, of course. > >> > MLC-based SSDs all do R-M-W. Now, they might not do Read-Modify-Erase-Write right away: But they''ll do R-M-W on ANY write which modifies existing data (unless you are extremely lucky and your data fully fills an existing page): the difference is that the final W is to previous-unused NAND page(s). However, when the SSD runs out of never-used space, it starts to have to add the E step on future writes. > > So far as I know, no SSD does space reclamation in the manner you refer to. That is, the SSD controller isn''t going to be moving data around on its own, with the exception of wear-leveling. TRIM is there so that the SSD can add stuff to it''s internal Free List more efficiently, but an SSD isn''t going (on its own) say: "Ooh, page 1004 has only 5 of 10 blocks used, so why don''t we merge it with page 20054, which has only 3 of 10 blocks used."(I don''t think they typically merge pages, I believe they rather just pick pages with some freed blocks, copies the active blocks to the "end" of the disk, and erases the page.) Well, the algorithms are often trade secrets, and if what you say is correct, and it was my product, then I wouldn''t even want to tell anyone about it, since it would be a horrible waste of both bandwidth and erase cycles. Using up the 10000 erase cycles of a MLC device 64 times faster than necessary seems like an extremely bad idea. But there sure is a lot of crap out there, I can''t say you are wrong (only hope :-). I doubt for example the F5100 works that way, it would be hard to get ~15000 4KB w/s per "flash-SODIMM" if it behaved like that (you typically can erase only 500-1000 pages a second, for example). I doubt the Intel X25 works that way, as their read performance suffers if they are written with smaller blocks and get internally fragmented - that problem could not exist if they always filled complete new pages in a R-M-W manner.>>> I am still not entirely convinced that it would be better to let the file system take care of that instead of a flash controller, there could be quite a lot of reading and writing going on for space reclamation (depending on the work load, of course). >>> >>>> /ragge >>>> >>> The point here is that regardless of the workload, there''s a R-M-W cycle that has to happen, whether that occurs at the ZFS level or at the SSD level. My argument is that the OS has a far better view of the whole data picture, and access to much higher performing caches (i.e. RAM/registers) than the SSD, so not only can the OS make far better decisions about the data and how (and how much of) it should be stored, but it''s almost certainly to be able to do so far faster than any little SSD controller can do. >> >> Well, inside the flash system you could possibly have a much >> better situation to shuffle data around for space reclamation - >> that is copying and compacting data and erasing flash pages. >> If the device has a good design, that is! If the SSD controller >> is some small slow sad thing it might be better to shuffle it up >> and down to the host and do it in the CPU, but I am not sure >> about that either since it typically is the very same slow >> controller that does the host communication. >> > It''s actually far more likely that a dumb SSD controller can handle high levels of pure data transfer faster than a smart SSD controller can actually manipulate that same data quickly. SSD controllers, by their very nature, need to be as small and cheap as possible, which means they have extremely limited computation ability. For a given compute level controller, one which is only "dumb" has to worry about 4 things: wear leveling, bad block remapping, and LBA->physical block mapping, and actual I/O transfer (i.e. managing data flow from the host to the NAND chips). A smart controller also has to worry about page alignment, page modification and rewriting, potentially RAID-like checksumming/parity, page/block fragmentation, and other things. So, if the compute amount is fixed, a dumb controller is going to be able to handle a /whole/ lot more I/O transfer than a smart controller. Which means, for the same level of I/O transfer, a dumb controller costs less than a smart controller.I am not convinced the compute amount needs to be fixed, or even that they by their nature need to be as cheap as possible - if that hurts performance. People are obviously willing to pay quite a lot to get high perf disk systems. The best flash SSDs out there are quite expensive. In addition the number of transistors per area (and monetary unit) tend to increase with time (that intel guy had some saying about that... :-).> I certainly agree that there seems to be some redundancy when >> the flash SSD controller does a logging-file-system kind of work >> under zfs that does pretty much that by itself, and it could >> possibly be better to cut one of them (and not zfs). >> I am still not convinced that it won''t be better to do this >> in a good controller instead just for speed and to take advantage >> of new hardware that does this smarter than the devices of today. >> >> Do you know how the F5100 works for example? >> >> /ragge >> > The point I''m making here is that the filesystem/OS can make all the same decisions that a good SSD controller can make, faster (as it has most of the data in local RAM or register already), and with a global system viewpoint that the SSD simply can''t have. Most importantly, it''s essentially free for the OS to do so - it has the spare cycles and bandwidth to do so. Putting this intelligence on the SSD costs money that is essentially wasted, not to mention being less efficient overall.I have not done the math here, but to me it isn''t obvious that the OS has spare cycles and bandwidth to do it, since space reclaiming (compacting and erasing) could potentially draw much more bandwidth than the actual workload, and since people have had problem already with to few spare cycles on the X4500 if they want it to do something more than only being a filer (and I guess is where there now is a X4550). The filesystem/OS will most probably *not* have most of the data in local ram when reclaiming space/compacting memory, it will most likely have to read it in to write it out again. /ragge
On 3 jan 2010, at 06.07, Ragnar Sundblad wrote:> (I don''t think they typically merge pages, I believe they rather > just pick pages with some freed blocks, copies the active blocks > to the "end" of the disk, and erases the page.)(And of course you implement wear leveling with the same mechanism - when the wear differs to much, pick a page with low wear and copy it to a more worn page.) I actually happened to stumble on an application note from Numonyx that describes the append-style SSD disk and space reclamation method I described, right here: <http://www.numonyx.com/Documents/Application%20Notes/AN1821.pdf> (No - I had not read this before writing my previous mail! :-) To me, it seems also in this paper that it is common knowledge that this is how you should implement a flash SSD disk - if you don''t do anything fancier of course. /ragge
Ragnar Sundblad wrote:> On 3 jan 2010, at 04.19, Erik Trimble wrote: > >> Let''s say I have 4k blocks, grouped into a 128k page. That is, the SSD''s fundamental minimum unit size is 4k, but the minimum WRITE size is 128k. Thus, 32 blocks in a page. >> > Do you know of SSD disks that have a minimum write size of > 128 KB? I don''t understand why it would be designed that way. > > A typical flash chip has pretty small write block sizes, like > 2 KB or so, but they can only erase in pages of 128 KB or so. > (And then you are running a few of those in parallel to get some > speed, so these numbers often multiply with the number of > parallel chips, like 4 or 8 or so.) > Typically, you have to write the 2 KB blocks consecutively > in a page. Pretty much all set up for an append-style system. > :-) > > In addition, flash SSDs typically have some DRAM write buffer > that buffers up writes (like a txg, if you will), so small > writes should not be a problem - just collect a few and append! >In MLC-style SSDs, you typically have a block size of 2k or 4k. However, you have a Page size of several multiples of that, 128k being common, but by no means ubiquitous. I think you''re confusing erasing with writing. When I say "minimum write size", I mean that for an MLC, no matter how small you make a change, the minimum amount of data actually being written to the SSD is a full page (128k in my example). There is no "append" down at this level. If I have a page of 128k, with data in 5 of the 4k blocks, and I then want to add another 2k of data to this, I have to READ all 5 4k blocks into the controller''s DRAM, add the 2k of data to that, then write out the full amount to a new page (if available), or wait for a older page to be erased before writing to it. Thus, in this case, in order to do an actual 2k write, the SSD must first read 10k of data, do some compositing, then write 12k to a fresh page. Thus, to change any data inside a single page, then entire contents of that page have to be read, the page modified, then the entire page written back out.>> So, I write a bit of data 100k in size. This occupies the first 25 blocks in the one page. The remaining 9 blocks are still one the SSD''s Free List (i.e. list of free space). >> >> Now, I want to change the last byte of the file, and add 10k more to the file. Currently, a non-COW filesystem will simply send the 1 byte modification request and the 10k addition to the SSD (all as one unit, if you are lucky - if not, it comes as two ops: 1 byte modification followed by a 10k append). The SSD now has to read all 25 blocks of the page back into it''s local cache on the controller, do the modification and append computing, then writes out 28 blocks to NAND. In all likelihood, if there is any extra pre-erased (or never written to) space on the drive, this 28 block write will go to a whole new page. The blocks in the original page will be moved over to the SSD Free List (and may or may not be actually erased, depending on the controller). >> > > Do you know for sure that you have SSD flash disks that > work this way? It seems incredibly stupid. It would also > use up the available erase cycles much faster than necessary. > What write speed do you get? >What I''m describing is how ALL MLC-based SSDs work. SLC-based SSDs work differently, but still have problems with what I''ll call "excess-writing".>>> And as far as I know, zfs can not do that today - it can not >>> move around already written data, not for defragmentation, not >>> for adding or removing disks to stripes/raidz:s, not for >>> deduping/duping and so on, and I have understood it as >>> BP Rewrite could solve a lot of this. >>> >> ZFS''s propensity to fragmentation doesn''t mean you lose space. Rather, it means that COW often results in frequently-modified files being distributed over the entire media, rather than being contiguous. So, over time, the actual media has very little (if any) contiguous free space, which is what the fragmentation problem is. BP rewrite will indeed allow us to create a de-fragger. Areas which used to hold a ZFS block (now vacated by a COW to somewhere else) are simply added back to the device''s Free List. >> Now, in SSD''s case, this isn''t a worry. Due to the completely even performance characteristics of NAND, it doesn''t make any difference if the physical layout of a file happens to be sections (e.g. ZFS blocks) scattered all over the SSD. >> > > Yes, there is something to worry about, as you can only > erase flash in large pages - you can not erase them only where > the free data blocks in the Free List are. >I''m not sure that SSDs actually _have_ to erase - they just overwrite anything there with new data. But this is implementation dependent, so I can say how /all/ MLC SSDs behave.> (I don''t think they typically merge pages, I believe they rather > just pick pages with some freed blocks, copies the active blocks > to the "end" of the disk, and erases the page.) > > Well, the algorithms are often trade secrets, and if what you say > is correct, and it was my product, then I wouldn''t even want to > tell anyone about it, since it would be a horrible waste of both > bandwidth and erase cycles. Using up the 10000 erase cycles of > a MLC device 64 times faster than necessary seems like an > extremely bad idea. But there sure is a lot of crap out there, > I can''t say you are wrong (only hope :-). > > I doubt for example the F5100 works that way, it would be hard to > get ~15000 4KB w/s per "flash-SODIMM" if it behaved like that > (you typically can erase only 500-1000 pages a second, for > example). > I doubt the Intel X25 works that way, as their read performance > suffers if they are written with smaller blocks and get internally > fragmented - that problem could not exist if they always filled > complete new pages in a R-M-W manner. >Once again, what I''m talking about is a characteristic of MLC SSDs, which are used in most consumer SSDS (the Intel X25-M, included). Sure, such an SSD will commit any new writes to pages drawn from the list of "never before used" NAND. However, at some point, this list becomes empty. In most current MLC SSDs, there''s about 10% "extra" (a 60GB advertised capacity is actually ~54GB usable with 6-8GB "extra"). Once this list is empty, the SSD has to start writing back to previous used pages, which may require an erase step first before any write. Which is why MLC SSDs slow down drastically once they''ve been fulled to capacity several times.> I am not convinced the compute amount needs to be fixed, or > even that they by their nature need to be as cheap as possible - > if that hurts performance. People are obviously willing to pay > quite a lot to get high perf disk systems. The best flash SSDs > out there are quite expensive. In addition the number of > transistors per area (and monetary unit) tend to increase > with time (that intel guy had some saying about that... :-). >My point there is that if you build a controller for $X, that will get you Y compute ability. For a dumb controller, less of this Y ability is going to be used up by "housekeeping" functions for the SSD, and more thus being available to manage I/O, than for a smart controller. Put it another way: For a giving throughput performance of X, it will cost less to build a dumb controller than a smart controller. And, yes, price is a concern, even at the Enterprise level. Being able to build a dumb controller for 50% (or less) of the cost of a smart controller is likely to get you noticed by your consumers. Or at least by your accountant, since your profit for the SSD will be higher.> I have not done the math here, but to me it isn''t obvious that > the OS has spare cycles and bandwidth to do it, since space > reclaiming (compacting and erasing) could potentially draw much > more bandwidth than the actual workload, and since people have > had problem already with to few spare cycles on the X4500 > if they want it to do something more than only being a > filer (and I guess is where there now is a X4550). > The filesystem/OS will most probably *not* have most of the > data in local ram when reclaiming space/compacting memory, > it will most likely have to read it in to write it out again. > > /ragge >The whole point behind ZFS is that CPU cycles are cheap and available, much more so than dedicated hardware of any sort. What I''m arguing here is that the controller on an SSD is in the same boat as a dedicated RAID HBA - in the latter case, use a cheap HBA instead and let the CPU & ZFS do the work, while in the former case, use a "dumb" controller for the SSD instead of a smart one. I''m pretty sure compacting doesn''t occur in ANY SSDs without any OS intervention (that is, the SSD itself doesn''t do it), and I''d be surprised to see an OS try to implement some sort of intra-page compaction - there benefit doesn''t seem to be there; it''s better just to optimize writes than try to compact existing pages. As far as reclaiming unused space, the TRIM command is there to allow the SSD to mark a page Free for reuse, and an SSD isn''t going to be erasing a page unless it''s right before something is to be written to that page. The X4500 was specifically designed to be a filer. It has more than enough CPU cycles to deal with pretty much all workloads it gets in that area - in fact, the major problem with the X4500 is insufficient response time of the SATA drives, which slows throughput. Sure you can run other things on it, but it''s really not designed for heavy-duty extra workloads - it''s a disk server, not a compute server. I''ve run compressed zvols on it, and have no problem saturating the 4x1Gbit interfaces while still not pegging both CPUs. I''d imaging that it starts to run into problems with multiple 10Gbit Ethernet interfaces, but that''s to be expected. Bus bandwidth isn''t really a concern, with SSDs using either SATA 3G or SAS 3G right now, and soon SATA 6G or SAS 12G in the near future. Likewise, system bus isn''t much of an immediate concern, as pretty much all SAS/SATA controllers use an 8x PCI-E attachment, for no more than 8 devices (SAS controllers which support more than 8 devices almost always have several 8x attachments). And, as I pointed out in another message, doing it my way doesn''t increase bus traffic that much over what is being done now, in any case. -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA Timezone: US/Pacific (GMT-0800)
Erik Trimble wrote:> Ragnar Sundblad wrote: >> Yes, there is something to worry about, as you can only >> erase flash in large pages - you can not erase them only where >> the free data blocks in the Free List are. > I''m not sure that SSDs actually _have_ to erase - they just overwrite > anything there with new data. But this is implementation dependent, so > I can say how /all/ MLC SSDs behave.I meant to say that I DON''T know how all MLC drives deal with erasure.>> (I don''t think they typically merge pages, I believe they rather >> just pick pages with some freed blocks, copies the active blocks >> to the "end" of the disk, and erases the page.)That is correct, as your pointer to the Numonyx doc explains.> I''m pretty sure compacting doesn''t occur in ANY SSDs without any OS > intervention (that is, the SSD itself doesn''t do it), and I''d be > surprised to see an OS try to implement some sort of intra-page > compaction - there benefit doesn''t seem to be there; it''s better just > to optimize writes than try to compact existing pages. As far as > reclaiming unused space, the TRIM command is there to allow the SSD to > mark a page Free for reuse, and an SSD isn''t going to be erasing a > page unless it''s right before something is to be written to that page.My thinking of what compacting meant doesn''t match up with what I''m seeing general usage in the SSD technical papers is, so in this respect, I''m wrong: compacting does occur, but only when there are no fully erased (or unused) pages available. Thus, compacting is done in the context of a write operation. -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA Timezone: US/Pacific (GMT-0800)
On Sat, Jan 2 at 22:24, Erik Trimble wrote:>In MLC-style SSDs, you typically have a block size of 2k or 4k. >However, you have a Page size of several multiples of that, 128k >being common, but by no means ubiquitous.I believe your terminology is crossed a bit. What you call a block is usually called a sector, and what you call a page is known as a block. Sector is (usually) the unit of reading from the NAND flash. The unit of write in NAND flash is the page, typically 2k or 4k depending on NAND generation, and thus consisting of 4-8 ATA sectors (typically). A single page may be written at a time. I believe some vendors support partial-page programming as well, allowing a single sector "append" type operation where the previous write left off. Ordered pages are collected into the unit of erase, which is known as a block (or "erase block"), and is anywhere from 128KB to 512KB or more, depending again on NAND generation, manufacturer, and a bunch of other things. Some large number of blocks are grouped by chip enables, often 4K or 8K blocks.>I think you''re confusing erasing with writing. > >When I say "minimum write size", I mean that for an MLC, no matter >how small you make a change, the minimum amount of data actually >being written to the SSD is a full page (128k in my example). TherePage is the unit of write, but it''s much smaller in all NAND I am aware of.>is no "append" down at this level. If I have a page of 128k, with >data in 5 of the 4k blocks, and I then want to add another 2k of data >to this, I have to READ all 5 4k blocks into the controller''s DRAM, >add the 2k of data to that, then write out the full amount to a new >page (if available), or wait for a older page to be erased before >writing to it. Thus, in this case, in order to do an actual 2k >write, the SSD must first read 10k of data, do some compositing, then >write 12k to a fresh page. > >Thus, to change any data inside a single page, then entire contents >of that page have to be read, the page modified, then the entire page >written back out.See above.>What I''m describing is how ALL MLC-based SSDs work. SLC-based SSDs >work differently, but still have problems with what I''ll call >"excess-writing".I think you''re only describing dumb SSDs with erase-block granularity mapping. Most (all) vendors have moved away from that technique since random write performance is awful in those designs and they fall over dead from wAmp in a jiffy. SLC and MLC NAND is similar, and they are read/written/erased almost identically by the controller.>I''m not sure that SSDs actually _have_ to erase - they just overwrite >anything there with new data. But this is implementation dependent, >so I can say how /all/ MLC SSDs behave.Technically you can program the same NAND page repeatedly, but since bits can only transition from 1->0 on a program operation, the result wouldn''t be very meaningful. An erase sets all the bits in the block to 1, allowing you to store your data.>Once again, what I''m talking about is a characteristic of MLC SSDs, >which are used in most consumer SSDS (the Intel X25-M, included). > >Sure, such an SSD will commit any new writes to pages drawn from the >list of "never before used" NAND. However, at some point, this list >becomes empty. In most current MLC SSDs, there''s about 10% "extra" >(a 60GB advertised capacity is actually ~54GB usable with 6-8GB >"extra"). Once this list is empty, the SSD has to start writing >back to previous used pages, which may require an erase step first >before any write. Which is why MLC SSDs slow down drastically once >they''ve been fulled to capacity several times.From what I''ve seen, erasing a block typically takes a time in the same scale as programming an MLC page, meaning in flash with large page counts per block, the % of time spent erasing is not very large. Lets say that an erase took 100ms and a program took 10ms, in an MLC NAND device with 100 pages per block. In this design, it takes us 1s to program the entire block, but only 1/10 of the time to erase it. An infinitely fast erase would only make the design about 10% faster. For SLC the erase performance matters more since page writes are much faster on average and there are half as many pages, but we were talking MLC. The performance differences seen is because they were artificially fast to begin with because they were empty. It''s similar to destroking a rotating drive in many ways to speed seek times. Once the drive is full, it all comes down to raw NAND performance, controller design, reserve/extra area (or TRIM) and algorithmic quality. --eric -- Eric D. Mudama edmudama at mail.bounceswoosh.org
Yet another way to thin-out the backing devices for a zpool on a thin-provisioned storage host, today: resilver. If your zpool has some redundancy across the SAN backing LUNs, simply drop and replace one at a time and allow zfs to resilver only the blocks currently in use onto the replacement LUN. -- Dan. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 194 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100104/b5698d6d/attachment.bin>
Eric D. Midama did a very good job answering this, and I don''t have much to add. Thanks Eric! On 3 jan 2010, at 07.24, Erik Trimble wrote:> I think you''re confusing erasing with writing.I am now quite certain that it actually was you who were confusing those. I hope this discussion has cleared things up a little though.> What I''m describing is how ALL MLC-based SSDs work. SLC-based SSDs work differently, but still have problems with what I''ll call "excess-writing".Eric already said it, but I need to say this myself too: SLC and MLC disks could be almost identical, only the storing of the bits in the flash chips differs a little (1 or 2 bits per storage cell). There is absolutely no other fundamental difference between the two. Hopefully no modern MLC *or* SLC disk works as you described, since it is a horrible design, and selling it would be close to robbery. It would be slow and it would wear out quite fast. Now, SLC disks are typically better overall, because those who want to pay for SLC flash typically also want to pay for better controllers, but otherwise those issues are really orthogonal.> I''m not sure that SSDs actually _have_ to erase - they just overwrite anything there with new data. But this is implementation dependent, so I can say how /all/ MLC SSDs behave.As Eric said - yes you have to erase, otherwise you can''t write new data. It is not implementation dependent, it is inherent in the flash technology. And, as has been said several times now, erasing can only be done in large chunks, but writing can be done in small chunks. I''d say that this is the main problem to handle when creating a good flash SSD.> The whole point behind ZFS is that CPU cycles are cheap and available, much more so than dedicated hardware of any sort. What I''m arguing here is that the controller on an SSD is in the same boat as a dedicated RAID HBA - in the latter case, use a cheap HBA instead and let the CPU & ZFS do the work, while in the former case, use a "dumb" controller for the SSD instead of a smart one.This could be true, I am still not sure. My main issues with this is that it would make the file system code dependent of a special hardware behavior (that of todays flash chips), and that it could be quite a lot of data to shuffle around when compacting. But we''ll see. If it could be cheap enough, it could absolutely happen and be worth it even if it has some drawbacks.> And, as I pointed out in another message, doing it my way doesn''t increase bus traffic that much over what is being done now, in any case.Yes, it would increase bus traffic, if you would handle flash the compacting in the host - which you have to with your idea - it could be many times the real workload bandwidth. But it could still be worth it, that is quite possible. --------- On 3 jan 2010, at 07.43, Erik Trimble wrote:> I meant to say that I DON''T know how all MLC drives deal with erasure.Again - yes they do. (Or they would be write-once only. :-)>> I''m pretty sure compacting doesn''t occur in ANY SSDs without any OS intervention (that is, the SSD itself doesn''t do it), and I''d be surprised to see an OS try to implement some sort of intra-page compaction - there benefit doesn''t seem to be there; it''s better just to optimize writes than try to compact existing pages. As far as reclaiming unused space, the TRIM command is there to allow the SSD to mark a page Free for reuse, and an SSD isn''t going to be erasing a page unless it''s right before something is to be written to that page. > My thinking of what compacting meant doesn''t match up with what I''m seeing general usage in the SSD technical papers is, so in this respect, I''m wrong: compacting does occur, but only when there are no fully erased (or unused) pages available. Thus, compacting is done in the context of a write operation.Exactly what and when it is that triggers compacting is another issue, and that could probably change with firmware revisions. It is wise to do it earlier than when you get that write that didn''t fit, since if you have some erased space you can then take burts of writes up to that size quickly. But compacting takes bandwidth from the flash chips and wears them out, so you don''t want to do it to early and to much. I guess this could be an interesting optimization problem, and optimal behavior probably depends on the workload too. Maybe it should be an adjustable knob. --------- On 3 jan 2010, at 10.57, Eric D. Mudama wrote:> On Sat, Jan 2 at 22:24, Erik Trimble wrote: >> In MLC-style SSDs, you typically have a block size of 2k or 4k. However, you have a Page size of several multiples of that, 128k being common, but by no means ubiquitous. > > I believe your terminology is crossed a bit. What you call a block is > usually called a sector, and what you call a page is known as a block. > > Sector is (usually) the unit of reading from the NAND flash.... Indeed, and I am partly guilty to that mess, but I didn''t want do change terminology in the middle of the discussion just to make it more flash-y. Maybe a mistake. :-) --------- Now, *my* view of how a typical, modern flash SSD works is as an appendable cyclic log. You can append blocks to it, but no two blocks can have the same address (the new block would mask away the old one), and there is a maximum address (dependent of the size of the disk), so the log has a maximum length. This has, in my head, some resemblance to the txg appending zfs does. On the inside, the flash SSD can''t just rewrite new blocks to any free space because of the the way erasing works on large chunks, "erase blocks" in the flash chips of today. Therefore, it has to internally take "erase blocks" with freed space in it and move all active blocks to the end of the log to save them and compact them. It can then erase the "erase block", and reuse that area for new pages. This activity competes with the normal disk activities. There are of course other issues two, like wear leveling, bad block handling and stuff. /ragge
>>>>> "dm" == David Magda <dmagda at ee.ryerson.ca> writes:dm> 4096 - to-512 blocks aiui NAND flash has a minimum write size (determiined by ECC OOB bits) of 2 - 4kB, and a minimum erase size that''s much larger. Remapping cannot abstract away the performance implication of the minimum write size if you are doing a series of synchronous writes smaller than the minimum size on a device with no battery/capacitor, although using a DRAM+supercap prebuffer might be able to abstract away some of it. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100105/6f74c5dc/attachment.bin>
As a further update, I went back and re-read my SSD controller info, and then did some more Googling. Turns out, I''m about a year behind on State-of-the-SSD. Eric is correct on the way current SSDs implement writes (both SLC and MLC), so I''m issuing a mea-cupla here. The change in implementation appears to occur sometime shortly after the introduction of the Indilinx controllers. My fault for not catching this. -Erik Eric D. Mudama wrote:> On Sat, Jan 2 at 22:24, Erik Trimble wrote: >> In MLC-style SSDs, you typically have a block size of 2k or 4k. >> However, you have a Page size of several multiples of that, 128k >> being common, but by no means ubiquitous. > > I believe your terminology is crossed a bit. What you call a block is > usually called a sector, and what you call a page is known as a block. > > Sector is (usually) the unit of reading from the NAND flash. > > The unit of write in NAND flash is the page, typically 2k or 4k > depending on NAND generation, and thus consisting of 4-8 ATA sectors > (typically). A single page may be written at a time. I believe some > vendors support partial-page programming as well, allowing a single > sector "append" type operation where the previous write left off. > > Ordered pages are collected into the unit of erase, which is known as > a block (or "erase block"), and is anywhere from 128KB to 512KB or > more, depending again on NAND generation, manufacturer, and a bunch of > other things. > > Some large number of blocks are grouped by chip enables, often 4K or > 8K blocks. > >> I think you''re confusing erasing with writing. >> >> When I say "minimum write size", I mean that for an MLC, no matter >> how small you make a change, the minimum amount of data actually >> being written to the SSD is a full page (128k in my example). There > > Page is the unit of write, but it''s much smaller in all NAND I am > aware of. > >> is no "append" down at this level. If I have a page of 128k, with >> data in 5 of the 4k blocks, and I then want to add another 2k of data >> to this, I have to READ all 5 4k blocks into the controller''s DRAM, >> add the 2k of data to that, then write out the full amount to a new >> page (if available), or wait for a older page to be erased before >> writing to it. Thus, in this case, in order to do an actual 2k >> write, the SSD must first read 10k of data, do some compositing, then >> write 12k to a fresh page. >> >> Thus, to change any data inside a single page, then entire contents >> of that page have to be read, the page modified, then the entire page >> written back out. > > See above. > >> What I''m describing is how ALL MLC-based SSDs work. SLC-based SSDs >> work differently, but still have problems with what I''ll call >> "excess-writing". > > I think you''re only describing dumb SSDs with erase-block granularity > mapping. Most (all) vendors have moved away from that technique since > random write performance is awful in those designs and they fall over > dead from wAmp in a jiffy. > > SLC and MLC NAND is similar, and they are read/written/erased almost > identically by the controller. > >> I''m not sure that SSDs actually _have_ to erase - they just overwrite >> anything there with new data. But this is implementation dependent, >> so I can say how /all/ MLC SSDs behave. > > Technically you can program the same NAND page repeatedly, but since > bits can only transition from 1->0 on a program operation, the result > wouldn''t be very meaningful. An erase sets all the bits in the block > to 1, allowing you to store your data. > >> Once again, what I''m talking about is a characteristic of MLC SSDs, >> which are used in most consumer SSDS (the Intel X25-M, included). >> >> Sure, such an SSD will commit any new writes to pages drawn from the >> list of "never before used" NAND. However, at some point, this list >> becomes empty. In most current MLC SSDs, there''s about 10% "extra" >> (a 60GB advertised capacity is actually ~54GB usable with 6-8GB >> "extra"). Once this list is empty, the SSD has to start writing >> back to previous used pages, which may require an erase step first >> before any write. Which is why MLC SSDs slow down drastically once >> they''ve been fulled to capacity several times. > > From what I''ve seen, erasing a block typically takes a time in the > same scale as programming an MLC page, meaning in flash with large > page counts per block, the % of time spent erasing is not very large. > > Lets say that an erase took 100ms and a program took 10ms, in an MLC > NAND device with 100 pages per block. In this design, it takes us 1s > to program the entire block, but only 1/10 of the time to erase it. > An infinitely fast erase would only make the design about 10% faster. > > For SLC the erase performance matters more since page writes are much > faster on average and there are half as many pages, but we were > talking MLC. > > The performance differences seen is because they were artificially > fast to begin with because they were empty. It''s similar to > destroking a rotating drive in many ways to speed seek times. Once > the drive is full, it all comes down to raw NAND performance, > controller design, reserve/extra area (or TRIM) and algorithmic > quality. > > --eric > >-- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA Timezone: US/Pacific (GMT-0800)
>>>>> "ah" == Al Hopper <al at logical-approach.com> writes:ah> The main issue is that most flash devices support 128k byte ah> pages, and the smallest "chunk" (for want of a better word) of ah> flash memory that can be written is a page - or 128kb. So if ah> you have a write to an SSD that only changes 1 byte in one 512 ah> byte "disk" sector, the SSD controller has to either ah> read/re-write the affected page or figure out how to update ah> the flash memory with the minimum affect on flash wear. yeah well, I''m not sure it matters, but that''s untrue. there are two sizes for NAND flash, the minimum write size and the minimum erase size. The minimum write size is the size over which error correction is done, the unit at which inband and OOB data is interleaved, on NAND flash. The minimum erase size is just what it sounds, the size the cleaner/garbagecolelctor must evacuate. The minimum write size is I suppose likely to provoke read/modify/write and wasting of write and wear bandwidth for smaller writes in flashes which do not have a DRAM+supercap, if you ask to SYNCHRONZIE CACHE right after the write. If there is a supercap, or if you allow teh drive to do write caching, then the smaller write could be coalesced making this size irrelevant. I think it''s usually 2 - 4 kB. I would expect resistance to growing it larger than 4kB because of NTFS---electrical engineers are usually over-obsessed with Windows. The minimum erase size you don''t really care about at all. That''s the one that''s usually at least 128kB. ah> For anyone who is interested in getting more details of the ah> challenges with flash memory, when used to build solid state ah> drives, reading the tech data sheets on the flash memory ah> devices will give you a feel for the basic issues that must be ah> solved. and the linux-mtd list will give you a feel for how people are solving them, because that''s the only place I know of where NAND filesystem work is going on in the open. There are a bunch of geezers saying ``I wrote one for BSD but my employer won''t let me release it,'''' and then the new crop of intel/sandforce/stec proprietary kids, but in the open world AFAIK there is just yaffs and ubifs. The tmobile G1 is yaffs. ah> Bobs point is well made. The specifics of a given SSD ah> implementation will make the performance characteristics of ah> the resulting SSD very difficult to predict or even describe - I''m really a fan of thte idea of using ACARD ANS-9010 for a slog. It''s basically all DRAM+battery, and uses a low performance CF card for durable storage if the battery starts to run low, or if you explicitly request it (to move data between ACARD units by moving the CF card maybe). It will even make non-ECC RAM into ECC storage (using a sector size and OOB data :). It seems like Zeus-like performance at 1/10th the price, but of course it''s a little goofy, and I''ve never tried it. slog is where I''d expect the high synchronous workload to be, so this is where there are small writes that can''t be coalesced, I would presume, and appropriate slog sizes are reachable with DRAM alone. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100112/2094ec66/attachment.bin>
>>>>> "et" == Erik Trimble <Erik.Trimble at Sun.COM> writes:et> Probably, the smart thing to push for is inclusion of some new et> command in the ATA standard (in a manner like TRIM). Likely et> something that would return both native Block and Page sizes et> upon query. that would be the *sane* thing to do. The *smart* thing to do would be write a quick test to determine the apparent page size by performance-testing write-flush-write-flush-write-flush with various write sizes and finding the knee that indicates the smallest size at which read-before-write has stopped. The test could happen in ''zpool create'' and have its result written into the vdev label. Inventing ATA commands takes too long to propogate through the technosphere, and the EE''s always implement them wrongly: for example, a device with SDRAM + supercap should probably report 512 byte sectors because the algorithm for copying from SDRAM to NAND is subject to change and none of your business, but EE''s are not good with language and will try to apelike match up the paragraph in the spec with the disorganized thoughts in their head, fit pegs into holes, and will end up giving you the NAND page size without really understanding why you wanted it other than that some standard they can''t control demands it. They may not even understand why their devices are faster and slower---they are probably just hurling shit against an NTFS and shipping whatever runs some testsuite fastest---so doing the empirical test is the only way to document what you really care about in a way that will make it across the language and cultural barriers between people who argue about javascript vs python and ones that argue about Agilent vs LeCroy. Within the proprietary wall of these flash filesystem companies the testsuites are probably worth as much as the filesystem code, and here without the wall an open-source statistical test is worth more than a haggled standard. Remember the ``removeable'''' bit in USB sticks and the mess that both software and hardware made out of it. (hot-swappable SATA drives are ``non-removeable'''' and don''t need rmformat while USB/firewore do? yeah, sorry, u fail abstraction. and USB drives have the ``removable medium'''' bit set when the medium and the controller are inseperable, it''s the _controller_ that''s removeable? ya sorry u fail reading English.) If you can get an answer by testing, DO IT, and evolve the test to match products on the market as necessary. This promises to be a lot more resilient than the track record with bullshit ATA commands and will work with old devices too. By the time you iron out your standard we will be using optonanocyberflash instead: that''s what happened with the removeable bit and r/w optical storage. BTW let me know when read/write UDF 2.0 on dvd+r is ready---the standard was only announced twelve years ago, thanks. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100113/e8dbc6bd/attachment.bin>