Hi all Is it possible to securely delete a file from a zfs dataset/zpool once it''s been snapshotted, meaning "delete (and perhaps overwrite) all copies of this file"? Best regards roy -- Roy Sigurd Karlsbakk (+47) 97542685 roy at karlsbakk.net http://blogg.karlsbakk.net/ -- I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et element?rt imperativ for alle pedagoger ? unng? eksessiv anvendelse av idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og relevante synonymer p? norsk.
No, until all snapshots referencing the file in question are removed. Simplest way to understand snapshots is to consider them as references. Any file-system object (say, file or block) is only removed when its reference count drops to zero. Regards, Andrey On Sat, Apr 10, 2010 at 10:20 PM, Roy Sigurd Karlsbakk <roy at karlsbakk.net> wrote:> Hi all > > Is it possible to securely delete a file from a zfs dataset/zpool once it''s been snapshotted, meaning "delete (and perhaps overwrite) all copies of this file"? > > Best regards > > roy > -- > Roy Sigurd Karlsbakk > (+47) 97542685 > roy at karlsbakk.net > http://blogg.karlsbakk.net/ > -- > I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et element?rt imperativ for alle pedagoger ? unng? eksessiv anvendelse av idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og relevante synonymer p? norsk. > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >
On 10.04.10 21:06, Andrey Kuzmin wrote:> No, until all snapshots referencing the file in question are removed. > > Simplest way to understand snapshots is to consider them as > references. Any file-system object (say, file or block) is only > removed when its reference count drops to zero.another thing to consider is the copy-on-write nature of ZFS, so "overwriting" a file will actually not write to the same place on disk, therefore it may be possible to retrieve supposedly deleted data if the disk(s) in question were to be examined at a lower level (I seem to remember that something is/was in the works to address this, can''t recall any detail though - I think Darren Moffat was involved). Michael> Regards, > Andrey > > > > > On Sat, Apr 10, 2010 at 10:20 PM, Roy Sigurd Karlsbakk > <roy at karlsbakk.net> wrote: >> Hi all >> >> Is it possible to securely delete a file from a zfs dataset/zpool once it''s been snapshotted, meaning "delete (and perhaps overwrite) all copies of this file"? >> >> Best regards >> >> roy >> -- >> Roy Sigurd Karlsbakk >> (+47) 97542685 >> roy at karlsbakk.net >> http://blogg.karlsbakk.net/ >> -- >> I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et element?rt imperativ for alle pedagoger ? unng? eksessiv anvendelse av idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og relevante synonymer p? norsk. >> _______________________________________________ >> zfs-discuss mailing list >> zfs-discuss at opensolaris.org >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >> > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss-- Michael Schuster Oracle Recursion, n.: see ''Recursion''
I guess that''s the way I thought it was. Perhaps it would be nice to add such a feature? If something gets stuck in a truckload of snapshots, say a 40GB file in the root fs, it''d be nice to just rm --killemall largefile Best regards roy -- Roy Sigurd Karlsbakk (+47) 97542685 roy at karlsbakk.net http://blogg.karlsbakk.net/ -- I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et element?rt imperativ for alle pedagoger ? unng? eksessiv anvendelse av idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og relevante synonymer p? norsk. ----- "Andrey Kuzmin" <andrey.v.kuzmin at gmail.com> skrev:> No, until all snapshots referencing the file in question are removed. > > Simplest way to understand snapshots is to consider them as > references. Any file-system object (say, file or block) is only > removed when its reference count drops to zero. > > Regards, > Andrey > > > > > On Sat, Apr 10, 2010 at 10:20 PM, Roy Sigurd Karlsbakk > <roy at karlsbakk.net> wrote: > > Hi all > > > > Is it possible to securely delete a file from a zfs dataset/zpool > once it''s been snapshotted, meaning "delete (and perhaps overwrite) > all copies of this file"? > > > > Best regards > > > > roy > > -- > > Roy Sigurd Karlsbakk > > (+47) 97542685 > > roy at karlsbakk.net > > http://blogg.karlsbakk.net/ > > -- > > I all pedagogikk er det essensielt at pensum presenteres > intelligibelt. Det er et element?rt imperativ for alle pedagoger ? > unng? eksessiv anvendelse av idiomer med fremmed opprinnelse. I de > fleste tilfeller eksisterer adekvate og relevante synonymer p? norsk. > > _______________________________________________ > > zfs-discuss mailing list > > zfs-discuss at opensolaris.org > > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > >
Roy Sigurd Karlsbakk <roy at karlsbakk.net> wrote:> I guess that''s the way I thought it was. Perhaps it would be nice to add such a feature? If something gets stuck in a truckload of snapshots, say a 40GB file in the root fs, it''d be nice to just rm --killemall largefileLet us first asume the simple case where the file is not a part of any snapshot. For a secure delete, a file needs to be overwritten in place and this cannot be done on a FS that is always COW. The secure deletion of the data would be something that hallens before the file is actually unlinked (e.g. by rm). This secure deletion would need open the file in a non COW mode. J?rg -- EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin js at cs.tu-berlin.de (uni) joerg.schilling at fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > <roy at karlsbakk.net> wrote: > > Hi all > > > > Is it possible to securely delete a file from a zfs dataset/zpool > once it''s been snapshotted, meaning "delete (and perhaps overwrite) all > copies of this file"? > > No, until all snapshots referencing the file in question are removed.Actually, the question was about secure delete. This means, even if you deleted all the snapshots, you''d still have to find some way of overwriting the blocks that formerly contained the file. This is ungraceful, but it will get the job done: for pass in 1 2 3 4 ; do for wiper in zero urandom ; do dd if=/dev/$wiper of=junk.file bs=1024k # sleep 30 ensures data synced to disk sleep 30 rm junk.file done done
Joerg.Schilling at fokus.fraunhofer.de wrote:> The secure deletion of the data would be something that hallens before > the file is actually unlinked (e.g. by rm). This secure deletion would > need open the file in a non COW mode.That may not be sufficient. Earlier writes to the file might have left older copies of the blocks lying around which could be recovered. My $0.02 -Manoj
On 04/11/10 10:19, Manoj Joseph wrote:> Earlier writes to the file might have left > older copies of the blocks lying around which could be recovered.Indeed; to be really sure you need to overwrite all the free space in the pool. If you limit yourself to worrying about data accessible via a regular read on the raw device, it''s possible to do this without an outage if you have a spare disk and a lot of time: rough process: 0) delete the files and snapshots containing the data you wish to purge. 1) replace a previously unreplaced disk in the pool with the spare disk using "zpool replace" 2) wait for the replace to complete 3) wipe the removed disk, using the "purge" command of format(1m)''s analyze subsystem or equivalent; the wiped disk is now the spare disk. 4) if all disks have not been replaced yet, go back to step 1. This relies on the fact that the resilver kicked off by "zpool replace" copies only allocated data. There are some assumptions in the above. For one, I''m assuming that that all disks in the pool are the same size. A bigger one is that a "purge" is sufficient to wipe the disks completely -- probably the biggest single assumption, given that the underlying storage devices themselves are increasingly using copy-on-write techniques. The most paranoid will replace all the disks and then physically destroy the old ones. - Bill
> The most paranoid will replace all the disks and then physically destroy > the old ones.I thought the most paranoid will encrypt everything and then forget the key... :-) Seriously, once encrypted zfs is integrated that''s a viable method. Regards -- Volker -- ------------------------------------------------------------------------ Volker A. Brandt Consulting and Support for Sun Solaris Brandt & Brandt Computer GmbH WWW: http://www.bb-c.de/ Am Wiesenpfad 6, 53340 Meckenheim Email: vab at bb-c.de Handelsregister: Amtsgericht Bonn, HRB 10513 Schuhgr??e: 45 Gesch?ftsf?hrer: Rainer J. H. Brandt und Volker A. Brandt
On 04/11/10 12:46, Volker A. Brandt wrote:>> The most paranoid will replace all the disks and then physically >> destroy the old ones. > > I thought the most paranoid will encrypt everything and then forget > the key... :-)Actually, I hear that the most paranoid encrypt everything *and then* destroy the physical media when they''re done with it.> Seriously, once encrypted zfs is integrated that''s a viable method.It''s certainly a new tool to help with the problem, but consider that forgetting a key requires secure deletion of the key. Like most cryptographic techniques, filesystem encryption only changes the size of the problem we need to solve. - Bill
OpenSolaris needs support for the TRIM command for SSDs. This command is issued to an SSD to indicate that a block is no longer in use and the SSD may erase it in preparation for future writes. A SECURE_FREE dataset property might be added that says that when a block is released to free space (and hence eligible for TRIM), ZFS should overwrite the block to zeros (or better, ones). If a dataset has such a property set then no "stray" copies of the data exist in free space and deletion of the file and snapshots is sufficient to remove all instances of the data. If a file exists before such a property is set that''s a problem. If it''s really important - and it might be in some cases because of legal mandates - there could be a per-file flag SECURELY_FREED that is set on file creation iff the dataset SECURE_FREE is set and is reset if the file is ever changed while SECURE_FREE is clear - this indicates if any file data "escaped" into free space at some point. Finally an UNLINK_SECURE call would be needed to avoid race conditions at the end so an app can be sure the data really was securely erased. PS. It is faster for an SSD to write a block of 0xFF than 0 and it''s possible some might make that optimization. That''s why I suggest erase-to-ones rather than erase-to-zero. -- This message posted from opensolaris.org
On Sun, Apr 11 at 22:45, James Van Artsdalen wrote:> PS. It is faster for an SSD to write a block of 0xFF than 0 and it''s > possible some might make that optimization. That''s why I suggest > erase-to-ones rather than erase-to-zero.Do you have any data to back this up? While I understand the underlying hardware implementation of NAND, I am not sure SSDs would bother optimizing for this case. A block erase would be just as effective at hiding data. I believe the reason strings of bits "leak" on rotating drives you''ve overwritten (other than grown defects) is because of minute off-track occurances while writing (vibration, particles, etc.), causing off-center writes that can be recovered in the future with the right equipment. Flash doesn''t have this "analog positioning" problem. While each electron well is effectively analog, there''s no "best guess" work at locating the wells. -- Eric D. Mudama edmudama at mail.bounceswoosh.org
Hi James: On Mon, Apr 12, 2010 at 06:45, James Van Artsdalen <james-opensolaris at jrv.org> wrote:> OpenSolaris needs support for the TRIM command for SSDs. ?This command is issued to an SSD to indicate that a block is no longer in use and the SSD may erase it in preparation for future writes.That''s what this RFE is about 6859245 Solaris needs to support the TRIM command for solid state drives (ssd)> A SECURE_FREE dataset property might be added that says that when a block is released to free space (and hence eligible for TRIM), ZFS should overwrite the block to zeros (or better, ones). ?If a dataset has such a property set then no "stray" copies of the data exist in free space and deletion of the file and snapshots is sufficient to remove all instances of the data. > > If a file exists before such a property is set that''s a problem. ?If it''s really important - and it might be in some cases because of legal mandates - there could be a per-file flag SECURELY_FREED that is set on file creation iff the dataset SECURE_FREE is set and is reset if the file is ever changed while SECURE_FREE is clear - this indicates if any file data "escaped" into free space at some point. ?Finally an UNLINK_SECURE call would be needed to avoid race conditions at the end so an app can be sure the data really was securely erased. > > PS. It is faster for an SSD to write a block of 0xFF than 0 and it''s possible some might make that optimization. ?That''s why I suggest erase-to-ones rather than erase-to-zero.-- Pablo M?ndez Hern?ndez
On Sun, 11 Apr 2010, James Van Artsdalen wrote:> OpenSolaris needs support for the TRIM command for SSDs. This > command is issued to an SSD to indicate that a block is no longer in > use and the SSD may erase it in preparation for future writes.There does not seem to be very much `need'' since there are other ways that a SSD can know that a block is no longer in use so it can be erased. In fact, ZFS already uses an algorithm (COW) which is friendly for SSDs. Zfs is designed for high thoughput, and TRIM does not seem to improve throughput. Perhaps it is most useful for low-grade devices like USB dongles and compact flash. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On 12 April, 2010 - Bob Friesenhahn sent me these 0,9K bytes:> On Sun, 11 Apr 2010, James Van Artsdalen wrote: > >> OpenSolaris needs support for the TRIM command for SSDs. This command >> is issued to an SSD to indicate that a block is no longer in use and >> the SSD may erase it in preparation for future writes. > > There does not seem to be very much `need'' since there are other ways > that a SSD can know that a block is no longer in use so it can be > erased. In fact, ZFS already uses an algorithm (COW) which is friendly > for SSDs. > > Zfs is designed for high thoughput, and TRIM does not seem to improve > throughput. Perhaps it is most useful for low-grade devices like USB > dongles and compact flash.For flash to overwrite a block, it needs to clear it first.. so yes, clearing it out in the background (after erasing) instead of just before the timing critical write(), you can make stuff go faster. /Tomas -- Tomas ?gren, stric at acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Ume? `- Sysadmin at {cs,acc}.umu.se
On Mon, April 12, 2010 10:48, Tomas ?gren wrote:> On 12 April, 2010 - Bob Friesenhahn sent me these 0,9K bytes: > >> Zfs is designed for high thoughput, and TRIM does not seem to improve >> throughput. Perhaps it is most useful for low-grade devices like USB >> dongles and compact flash. > > For flash to overwrite a block, it needs to clear it first.. so yes, > clearing it out in the background (after erasing) instead of just before > the timing critical write(), you can make stuff go faster.Except that ZFS does not overwrite blocks because it is copy-on-write.
On Mon, 12 Apr 2010, Tomas ?gren wrote:> > For flash to overwrite a block, it needs to clear it first.. so yes, > clearing it out in the background (after erasing) instead of just before > the timing critical write(), you can make stuff go faster.Yes of course. Properly built SSDs include considerable extra space to support wear leveling, and this same space may be used to store erased blocks. A block which is "overwritten" can simply be written to a block allocated from the extra free pool, and the existing block can be re-assigned to the free pool and scheduled for erasure. This is a fairly simple recirculating algorithm which just happens to also assist with wear management. Filesystem blocks are rarely aligned and sized to match underlying FLASH device blocks so FLASH devices would need to implement fancy accounting in order to decide when they should actually erase a FLASH block. Erasing a FLASH block may require moving some existing data which was still not erased. It is much easier to allocate a completely fresh block, update it as needed, and use some sort of ordered "atomic" operation to exchange the blocks so the data always exists in some valid state. Without care, existing data which should not be involved in the write may be destroyed due to a power failure. This is why it is not extremely useful for Solaris to provide support for the "Windows 7 TRIM" command. Really low-grade devices might not have much smarts or do very good wear leveling, and these devices might benefit from the Windows 7 TRIM command. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On Mon, 12 Apr 2010, David Magda wrote:> > Except that ZFS does not overwrite blocks because it is copy-on-write.At some time in the (possibly distant) future the ZFS block might become free and then the Windows 7 TRIM command could be used to try to pre-erase it. This might help an intermittent benchmark. Of course, the background TRIM commands might clog other on-going operations so it might hurt the benchmark. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On 12 April, 2010 - David Magda sent me these 0,7K bytes:> On Mon, April 12, 2010 10:48, Tomas ?gren wrote: > > On 12 April, 2010 - Bob Friesenhahn sent me these 0,9K bytes: > > > >> Zfs is designed for high thoughput, and TRIM does not seem to improve > >> throughput. Perhaps it is most useful for low-grade devices like USB > >> dongles and compact flash. > > > > For flash to overwrite a block, it needs to clear it first.. so yes, > > clearing it out in the background (after erasing) instead of just before > > the timing critical write(), you can make stuff go faster. > > Except that ZFS does not overwrite blocks because it is copy-on-write.So CoW will enable infinite storage, so you never have to write on the same place again? Cool. /Tomas -- Tomas ?gren, stric at acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Ume? `- Sysadmin at {cs,acc}.umu.se
>> OpenSolaris needs support for the TRIM command for SSDs. ?This command is >> issued to an SSD to indicate that a block is no longer in use and the SSD >> may erase it in preparation for future writes. > > There does not seem to be very much `need'' since there are other ways that a > SSD can know that a block is no longer in use so it can be erased. ?In fact, > ZFS already uses an algorithm (COW) which is friendly for SSDs.What ways would that be?
On Mon, April 12, 2010 12:28, Tomas ?gren wrote:> On 12 April, 2010 - David Magda sent me these 0,7K bytes: > >> On Mon, April 12, 2010 10:48, Tomas ?gren wrote: >> >> > For flash to overwrite a block, it needs to clear it first.. so yes, >> > clearing it out in the background (after erasing) instead of just >> > before the timing critical write(), you can make stuff go faster. >> >> Except that ZFS does not overwrite blocks because it is copy-on-write. > > So CoW will enable infinite storage, so you never have to write on the > same place again? Cool.Your comment was regarding making write()s go faster by pre-clearing unused blocks so there''s always writable blocks available. Because ZFS doesn''t go to the same LBAs when writing data, the SSD doesn''t have to worry about read-modify-write circumstances like it has to with traditional file systems. Given that ZFS probably would not have to go back to "old" blocks until it''s reached the end of the disk, that should give the SSDs'' firmware plenty of time to do block-remapping and background erasing--something that''s done now anyway regardless of whether an SSD supports TRIM or not. You don''t need TRIM to make ZFS go fast, though it doesn''t hurt. There will be no "timing critical" instances as long as there is a decent amount of free space available and ZFS can simply keep doing an LBA++. SSDs worked fine without TRIM, it''s just that command helps them work more efficiently.
My point is not to advocate the TRIM command - those issues are already well-known - but rather suggest that the code that sends TRIM is also a good place to securely erase data on other media, such a hard disk. TRIM is not a Windows 7 command but rather a device command. FreeBSD''s CAM layer also provides TRIM support but I don''t think any of the filesystems issue the request yet. -- This message posted from opensolaris.org
On Mon, 12 Apr 2010, James Van Artsdalen wrote:> TRIM is not a Windows 7 command but rather a device command.I only called it the "Windows 7 TRIM command" since that is how almost all of the original reports in the media described it. It seems best to preserve this original name (as used by media experts) when discussing the feature on a Solaris list. ;-) Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On Mon, Apr 12, 2010 at 19:19, David Magda <dmagda at ee.ryerson.ca> wrote:> On Mon, April 12, 2010 12:28, Tomas ?gren wrote: >> On 12 April, 2010 - David Magda sent me these 0,7K bytes: >> >>> On Mon, April 12, 2010 10:48, Tomas ?gren wrote: >>> >>> > For flash to overwrite a block, it needs to clear it first.. so yes, >>> > clearing it out in the background (after erasing) instead of just >>> > before the timing critical write(), you can make stuff go faster. >>> >>> Except that ZFS does not overwrite blocks because it is copy-on-write. >> >> So CoW will enable infinite storage, so you never have to write on the >> same place again? Cool. > > Your comment was regarding making write()s go faster by pre-clearing > unused blocks so there''s always writable blocks available. Because ZFS > doesn''t go to the same LBAs when writing data, the SSD doesn''t have to > worry about read-modify-write circumstances like it has to with > traditional file systems. > > Given that ZFS probably would not have to go back to "old" blocks until > it''s reached the end of the disk, that should give the SSDs'' firmware > plenty of time to do block-remapping and background erasing--something > that''s done now anyway regardless of whether an SSD supports TRIM or not. > You don''t need TRIM to make ZFS go fast, though it doesn''t hurt.Why would the disk care about if the block was written recently? There is old data on it that has to be preserved anyway. The SSD does not know if the old data was important. ZFS will overwrite just as any other filesystem. The only thing that makes ZFS SSD friendly is that it tries to make large writes. But that only works if you have few synchronous writes.
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Eric D. Mudama > > I believe the reason strings of bits "leak" on rotating drives you''ve > overwritten (other than grown defects) is because of minute off-track > occurances while writing (vibration, particles, etc.), causing > off-center writes that can be recovered in the future with the right > equipment.That''s correct. In spindle drives, even if you "zero" the drive, the imprecise positioning of the head is accurate enough for itself to later read "zeroes" accurately from that location, but if the platters are removed and placed into special high precision hardware, the data can be forensically reconstructed by reading the slightly off-track traces. This process costs a few thousand per drive, and takes about a week. So "zero"ing the drive is good enough data destruction for nearly all people in nearly all situations, but not good enough if a malicious person were willing to pay thousands to recover the data. BTW, during the above process, they have to make intelligent guesses about when they''re picking up formerly erased bits and when they''re picking up noise. They have to know what to listen for. So they can identify things like "that sounds like a jpg file" and so on ... but if the data itself were encrypted, and the empty space around the useful data were also encrypted, and then the whole thing was then zeroed, it would be nearly impossible to recover the encrypted data after zero''ing, because even the intended data signal would be indistinguishable from noise. And even if they were able to get that ... they''d still have to decrypt it.> Flash doesn''t have this "analog positioning" problem. While each > electron well is effectively analog, there''s no "best guess" work at > locating the wells.Although flash doesn''t have the tracking issue, it does have a similar stored history characteristic, which at least theoretically could be used to read formerly erased data. Assuming the storage elements are 3-bit multilevel cells, it means the FG charge level should land into one of 8 bins ... ideally at the precise center of each bin each time. But in reality, it never will. When programming or erasing the element, the tunnel injection or release is held at a known value for a known time, sufficiently long enough to bring the FC into the desired bin, but if the final charge level lands within +/- 5% or even 10% or more, off center from the precise center of the bin, that doesn''t matter in normal operation. Because it''s still clearly identifiable which bin it''s in. But if a flash device were "zero"ed or "erased" (all 1''s) and a forensic examiner could directly access the word lines, then using instrumentation of higher precision than the on-chip A2D''s, the former data could be extracted, with a level of confidence similar to the aforementioned off-track forensic data reconstruction of a spindle drive. Problem is, how to access the word lines. Cuz generally speaking, they didn''t bring ''em out to pins of the chip. So like I said ... theoretically possible. I expect the NSA or CIA could do it. But the local drive recovery mom & pop shop ... not so likely.
On Mon, Apr 12 at 10:50, Bob Friesenhahn wrote:>On Mon, 12 Apr 2010, Tomas ?gren wrote: >> >>For flash to overwrite a block, it needs to clear it first.. so yes, >>clearing it out in the background (after erasing) instead of just before >>the timing critical write(), you can make stuff go faster. > >Yes of course. Properly built SSDs include considerable extra space >to support wear leveling, and this same space may be used to store >erased blocks. A block which is "overwritten" can simply be written >to a block allocated from the extra free pool, and the existing block >can be re-assigned to the free pool and scheduled for erasure. This >is a fairly simple recirculating algorithm which just happens to also >assist with wear management.The point originally made is that if you eventually write to every LBA on a drive without TRIM support, your "considerable extra space" will only include the extra physical blocks that the manufacturer provided when they sold you the device, and for which you are paying. The advantage of TRIM, even in high end SSDs, is that it allows you to effectively have additional "considerable extra space" available to the device for garbage collection and wear management when not all sectors are in use on the device. For most users, with anywhere from 5-15% of their device unused, this difference is significant and can improve performance greatly in some workloads. Without TRIM, the device has no way to use this space for anything but tracking the data that is no longer active. Based on the above, I think TRIM has the potential to help every SSD, not just the "cheap" SSDs. -- Eric D. Mudama edmudama at mail.bounceswoosh.org
Bob Friesenhahn <bfriesen at simple.dallas.tx.us> wrote:> On Sun, 11 Apr 2010, James Van Artsdalen wrote: > > > OpenSolaris needs support for the TRIM command for SSDs. This > > command is issued to an SSD to indicate that a block is no longer in > > use and the SSD may erase it in preparation for future writes. > > There does not seem to be very much `need'' since there are other ways > that a SSD can know that a block is no longer in use so it can be > erased. In fact, ZFS already uses an algorithm (COW) which is > friendly for SSDs.Could you please explain what "other ways" you have in mind and how they work? J?rg -- EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin js at cs.tu-berlin.de (uni) joerg.schilling at fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily
Bob Friesenhahn <bfriesen at simple.dallas.tx.us> wrote:> Yes of course. Properly built SSDs include considerable extra space > to support wear leveling, and this same space may be used to store > erased blocks. A block which is "overwritten" can simply be written > to a block allocated from the extra free pool, and the existing block > can be re-assigned to the free pool and scheduled for erasure. This > is a fairly simple recirculating algorithm which just happens to > also assist with wear management.I believe you make a mistake with this assumption. - The SSD cannot know which blocks are currently not in use. - In special with a COW filesystem, after some time all net space may have been written to but the SSD does not know whether it is still used or not. So you see s mainly empty filesystem while the SSD does not know this fact. - If wou write not too much to a SSD, it may be that the spare space for defect management is sufficient in order to have suficient prepared erased space. - Once you write more, I see no reason why a COW filesystem should be any better than a non-COW filesystem. J?rg -- EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin js at cs.tu-berlin.de (uni) joerg.schilling at fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily
"David Magda" <dmagda at ee.ryerson.ca> wrote:> Given that ZFS probably would not have to go back to "old" blocks until > it''s reached the end of the disk, that should give the SSDs'' firmware > plenty of time to do block-remapping and background erasing--something > that''s done now anyway regardless of whether an SSD supports TRIM or not. > You don''t need TRIM to make ZFS go fast, though it doesn''t hurt.This is only true as long as the filesystem is far awy from being full and as long as you write less than the spare size of the SSD. J?rg -- EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin js at cs.tu-berlin.de (uni) joerg.schilling at fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily
If you''re concerned about someone reading the charge level of a Flash cell to infer the value of the cell before being erased, then overwrite with random data twice before issuing TRIM (remapping in an SSD probably makes this ineffective). Most people needing a secure erase feature need it to satisfy legal requirements not national security requirements. Anyone needing a strong TINFOIL_HAT_ERASE feature is going to be encrypting the data anyway. A SECURE_ERASE for them is mainly to satisfy legal and statutory language requiring that the data be actually erased (when it''s not worth the lawyer''s fees to convince a court that loss of a key makes encrypted data unrecoverable). -- This message posted from opensolaris.org
On Mon, 12 Apr 2010, Eric D. Mudama wrote:> > The advantage of TRIM, even in high end SSDs, is that it allows you to > effectively have additional "considerable extra space" available to > the device for garbage collection and wear management when not all > sectors are in use on the device. > > For most users, with anywhere from 5-15% of their device unused, this > difference is significant and can improve performance greatly in some > workloads. Without TRIM, the device has no way to use this space for > anything but tracking the data that is no longer active. > > Based on the above, I think TRIM has the potential to help every SSD, > not just the "cheap" SSDs.It seems that the "above" was missing. What concrete evidence were you citing? The value should be clearly demonstrated an fact (with many months of prototype testing with various devices) before the feature becomes a pervasive part of the operating system. Every article I have read about the value of TRIM is pure speculation. Perhaps it will be found that TRIM has more value for SAN storage (to reclaim space for accounting purposes) than for SSDs. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On Tue, 13 Apr 2010, Joerg Schilling wrote:> > I believe you make a mistake with this assumption.I see that you make some mistakes with your own assumptions. :-)> - The SSD cannot know which blocks are currently not in use.It does know that blocks in its spare pool are not in use.> - In special with a COW filesystem, after some time all net space > may have been written to but the SSD does not know whether it > is still used or not. So you see s mainly empty filesystem > while the SSD does not know this fact.You are assuming that a COW filesystem will tend to overwrite all disk blocks. That is not a good assumption since filesystems are often optimized to prefer disk "outer tracks". FLASH does not have any tracks, but it is likely that existing optimizations remain. Filesystems are not of great value unless they store data so optimizing for the empty case is not useful in the real world. The natives grow restless (and look for scalps) if they find that write performance goes away once the device contains data.> - If wou write not too much to a SSD, it may be that the spare space > for defect management is sufficient in order to have suficient > prepared erased space. > > - Once you write more, I see no reason why a COW filesystem should > be any better than a non-COW filesystem.The main reason why a COW filesystem like zfs may perform better is that zfs sends a fully defined block to the device. This reduces the probability that the FLASH device will need to update an existing FLASH block. COW increases the total amount of data written, but it also reduces the FLASH read/update/re-write cycle. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On Tue, Apr 13 at 9:52, Bob Friesenhahn wrote:>On Mon, 12 Apr 2010, Eric D. Mudama wrote: >> >>The advantage of TRIM, even in high end SSDs, is that it allows you to >>effectively have additional "considerable extra space" available to >>the device for garbage collection and wear management when not all >>sectors are in use on the device. >> >>For most users, with anywhere from 5-15% of their device unused, this >>difference is significant and can improve performance greatly in some >>workloads. Without TRIM, the device has no way to use this space for >>anything but tracking the data that is no longer active. >> >>Based on the above, I think TRIM has the potential to help every SSD, >>not just the "cheap" SSDs. > >It seems that the "above" was missing. What concrete evidence were >you citing?Nothing concrete. Just makes sense to me that if ZFS has to work harder to garbage collect as a pool approaches 100% full, so would SSDs that use variants of CoW have to work harder to garbage collect as they approach 100% written. The purpose of TRIM is to tell the drive that some # of sectors are no longer important so that it doesn''t have to work as hard in its internal garbage collection.>The value should be clearly demonstrated an fact (with many months of >prototype testing with various devices) before the feature becomes a >pervasive part of the operating system. Every article I have read >about the value of TRIM is pure speculation. > >Perhaps it will be found that TRIM has more value for SAN storage (to >reclaim space for accounting purposes) than for SSDs.Perhaps, but that''s not my gut feel. I believe it has real value for users in enterprise type workloads where performance comes down to a simple calculation of reserve area on the SSD. --eric -- Eric D. Mudama edmudama at mail.bounceswoosh.org
On Thu, 15 Apr 2010, Eric D. Mudama wrote:> > The purpose of TRIM is to tell the drive that some # of sectors are no > longer important so that it doesn''t have to work as hard in its > internal garbage collection.The sector size does not typically match the FLASH page size so the SSD still has to do some heavy lifting. It has to keep track of many small "holes" in the FLASH pages. This seems pretty complicated since all of this information needs to be well-preserved in non-volatile storage. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On 4/16/2010 10:30 AM, Bob Friesenhahn wrote:> On Thu, 15 Apr 2010, Eric D. Mudama wrote: >> >> The purpose of TRIM is to tell the drive that some # of sectors are no >> longer important so that it doesn''t have to work as hard in its >> internal garbage collection. > > The sector size does not typically match the FLASH page size so the > SSD still has to do some heavy lifting. It has to keep track of many > small "holes" in the FLASH pages. This seems pretty complicated since > all of this information needs to be well-preserved in non-volatile > storage. >But doesn''t the TRIM command help here. If as the OS goes along it makes sectors as unused, then the SSD will have a lighter wight lift to only need to read for example 1 out of 8 (assuming sectors of 512 bytes, and 4K FLASH Pages) before writing a new page with that 1 sector and 7 new ones. Additionally in the background I would think it would be able to find a Page with 3 inuse sectors and another with 5 for example, write all 8 to a new page, remap those sectors to the new location, and then pre-erase the 2 pages just freed up. How doesn''t that help? -Kyle> Bob > -- > Bob Friesenhahn > bfriesen at simple.dallas.tx.us, > http://www.simplesystems.org/users/bfriesen/ > GraphicsMagick Maintainer, http://www.GraphicsMagick.org/ > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
On Fri, 16 Apr 2010, Kyle McDonald wrote:>> > But doesn''t the TRIM command help here. If as the OS goes along it makes > sectors as unused, then the SSD will have a lighter wight lift to only > need to read for example 1 out of 8 (assuming sectors of 512 bytes, and > 4K FLASH Pages) before writing a new page with that 1 sector and 7 new > ones. > > Additionally in the background I would think it would be able to find a > Page with 3 inuse sectors and another with 5 for example, write all 8 to > a new page, remap those sectors to the new location, and then pre-erase > the 2 pages just freed up.While I am not a SSD designer, I agree with you that a smart SSD designer would include a dereferencing table which maps sectors to pages so that FLASH pages can be completely filled, even if the stored sectors are not contiguous. It would allow sectors to be migrated to different pages in the background in order to support wear leveling and compaction. This is obviously challenging to do if FLASH is used to store this dereferencing table and the device does not at least include a super-capacitor which assures that the table will be fully written on power fail. If the table is corrupted, then the device is bricked. It is much more efficient (from a housekeeping perspective) if filesystem sectors map directly to SSD pages, but we are not there yet. As a devil''s advocate, I am still waiting for someone to post a URL to a serious study which proves the long-term performance advantages of TRIM. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On Fri, Apr 16 at 10:05, Bob Friesenhahn wrote:>It is much more efficient (from a housekeeping perspective) if >filesystem sectors map directly to SSD pages, but we are not there >yet.How would you stripe or manage a dataset across a mix of devices with different geometries? That would break many of the assumptions made by filesystems today. I would argue it''s easier to let the device virtualize this mapping and present a consistent interface, regardless of the underlying geometry.>As a devil''s advocate, I am still waiting for someone to post a URL >to a serious study which proves the long-term performance advantages >of TRIM.I am absolutely sure these studies exist, but as to some entity publishing a long term analysis that cost real money (many thousands of dollars) to create, I have no idea if data like that exists in the public domain where anyone can see it. I can virtually guarantee every storage, SSD and OS vendor is generating that data internally however. --eric -- Eric D. Mudama edmudama at mail.bounceswoosh.org
>>>>> "edm" == Eric D Mudama <edmudama at bounceswoosh.org> writes:edm> How would you stripe or manage a dataset across a mix of edm> devices with different geometries? the ``geometry'''' discussed is 1-dimensional: sector size. The way that you do it is to align all writes, and never write anything smaller than the sector size. The rule is very simple, and you can also start or stop following it at any moment without rewriting any of the dataset and still get the full benefit. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100416/a27efe90/attachment.bin>
On Fri, 16 Apr 2010, Eric D. Mudama wrote:> On Fri, Apr 16 at 10:05, Bob Friesenhahn wrote: >> It is much more efficient (from a housekeeping perspective) if filesystem >> sectors map directly to SSD pages, but we are not there yet. > > How would you stripe or manage a dataset across a mix of devices with > different geometries? That would break many of the assumptions made > by filesystems today. > > I would argue it''s easier to let the device virtualize this mapping > and present a consistent interface, regardless of the underlying > geometry.You must have misunderstood me. I was talking about functionality built into the device. As far as filesystems go, filesystems typically allocate much larger blocks than the sector size. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On Fri, Apr 16 at 14:42, Miles Nordin wrote:>>>>>> "edm" == Eric D Mudama <edmudama at bounceswoosh.org> writes: > > edm> How would you stripe or manage a dataset across a mix of > edm> devices with different geometries? > >the ``geometry'''' discussed is 1-dimensional: sector size. > >The way that you do it is to align all writes, and never write >anything smaller than the sector size. The rule is very simple, and >you can also start or stop following it at any moment without >rewriting any of the dataset and still get the full benefit.The response was regarding a filesystem with knowledge of the NAND geometry, to align writes to exact page granularity. My question was how to implement that, if not all devices in a stripe set have the same page size. What you''re suggesting is exactly what SSD vendors already do. They present a 512B standard host interface sector size, and perform their own translations and management inside the device. -- Eric D. Mudama edmudama at mail.bounceswoosh.org
On 16 apr 2010, at 17.05, Bob Friesenhahn wrote:> On Fri, 16 Apr 2010, Kyle McDonald wrote: >>> >> But doesn''t the TRIM command help here. If as the OS goes along it makes >> sectors as unused, then the SSD will have a lighter wight lift to only >> need to read for example 1 out of 8 (assuming sectors of 512 bytes, and >> 4K FLASH Pages) before writing a new page with that 1 sector and 7 new >> ones. >> >> Additionally in the background I would think it would be able to find a >> Page with 3 inuse sectors and another with 5 for example, write all 8 to >> a new page, remap those sectors to the new location, and then pre-erase >> the 2 pages just freed up. > > While I am not a SSD designer, I agree with you that a smart SSD designer would include a dereferencing table which maps sectors to pages so that FLASH pages can be completely filled, even if the stored sectors are not contiguous. It would allow sectors to be migrated to different pages in the background in order to support wear leveling and compaction. This is obviously challenging to do if FLASH is used to store this dereferencing table and the device does not at least include a super-capacitor which assures that the table will be fully written on power fail. If the table is corrupted, then the device is bricked.This is exactly how they work, at least most of them. If they should erase and reprogram each flash block (say, 128 KB blocks times the parallelism of the drive) for each 512 B block written, they would wear out in no time and performance would be horrible. Eventually they have to gc because they are out of erased blocks, and then they have to copy the data in used to new places. In that process, it of course helps if some of the data is tagged as not needed anymore, it can then compact the data much more efficiently and it doesn''t have to copy around a lot of data that won''t be used. I''t should also help saving copy/erase cycles in the drive, since the data it moves is much more likely to actually be in use and probably won''t be overwritten as fast as a non-used block, it will in effect pack data actually in use into flash blocks. If the disk is nearly full, TRIM likely doesn''t make much difference. I''d guess TRIM should be very useful on a slog device.> It is much more efficient (from a housekeeping perspective) if filesystem sectors map directly to SSD pages, but we are not there yet.I agree with Eric that it could very well be better to let the device virtualize the thing and have control and knowledge of all the hardware specific implementation details. Flash chips, controllers, bus drivers and configurations don''t all come equal.> As a devil''s advocate, I am still waiting for someone to post a URL to a serious study which proves the long-term performance advantages of TRIM.That would sure be interesting! /ragge> > Bob > -- > Bob Friesenhahn > bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ > GraphicsMagick Maintainer, http://www.GraphicsMagick.org/ > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>>>>> "edm" == Eric D Mudama <edmudama at bounceswoosh.org> writes:edm> What you''re suggesting is exactly what SSD vendors already do. no, it''s not. You have to do it for them. edm> They present a 512B standard host interface sector size, and edm> perform their own translations and management inside the edm> device. It is not nearly so magical! The pages are 2 - 4kB. They are this size for nothing to do with the erase block size or the secret blackbox filesystem running on the SSD. It''s because of the ECC, because the reed-solomon for the entire block must be recalculated if any of the block is changed. Therefore, changing 0.5kB means: for a 4kB page device: * read 4kB * write 4kB for a 2kB page device: * read 2kB * write 2kB and changing 4kB at offset <integer> * 4kB means: for a 4kB device: * write 4kB for a 2kB device: * write 4kB It does not matter if all devices have the same page size or not. Just write at the biggest size, or write at the appropriate size if you can. The important thing is that you write a whole page, even if you just pad with zeroes, so the controller does not have to do any reading. simple. the problem with big-sector spinning hard drives and alignment/blocksize is exactly the same problem. non-ZFS people discuss it a lot becuase ZFS filesystems start at <integer> * <rather large block> offset, thanks to all the disk label hokus pocus, but NTFS filesystems often start at 16065 * 0.5kB -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100416/69d7fd97/attachment.bin>