thr3ads.net - zfs discuss - [zfs-discuss] Thin device support in ZFS? [Dec 2009]

If this information is useful, please help other people find it:
Share via:

Andras Spitzer

2009-Dec-30 10:34 UTC

[zfs-discuss] Thin device support in ZFS?

Hi,

Does anyone heard about having any plans to support thin devices by ZFS?
I''m talking about the thin device feature by SAN frames (EMC, HDS)
which provides more efficient space utilization. The concept is similar to ZFS
with the pool and datasets, though the pool in this case is in the SAN frame
itself, so the pool can be shared among different systems attached to the same
SAN frame.

This topic is really complex but I''m sure it''s inevitable to
support for enterprise customers with SAN storage, basically it brings the
differentiation of space used vs space allocated, which can be a huge difference
in a large environment, and this difference is major even on the financial level
as well.

Veritas already added support to thin devices, first of all support to VxFS to
be "thin-aware" (for example how to handle over-subscribed thin
devices), then Veritas added a feature called SmartMove, a nice feature to
migrate from fat to thin devices, and the most brilliant feature of all (my
personal opinion, of course) is they released the Veritas Thin Device
Reclamation API, which provides an interface to the SAN frame to report unused
space at the block level.

This API is a major hit, and even though SAN vendors today doesn''t
support it, HP and HDS already working on it, and I assume EMC has to follow as
well. With this API Veritas can keep track of files deleted for example, and
with a simple command once a day (depending on your policy) it can report the
unused space back to the frame, so thin devices [b]remain[/b] thin.

I really believe that ZFS should have support to thin devices, especially
referring to the feature what this API brings into this field, as it can result
a huge cost difference to enterprise customers.

Regards,
sendai
-- 
This message posted from opensolaris.org

roland

2009-Dec-30 18:23 UTC

head link

[zfs-discuss] Thin device support in ZFS?

making transactional,logging filesystems thin-provisioning aware should be hard
to do, as every new and every changed block is written to a new location.
so what applies to zfs, should also apply to btrfs or nilfs or similar
filesystems.

i`m not sure if there is a good way to make zfs thin-provisioning aware/friendly
- so you should wait what a zfs developer has to tell about this.

not sure about vxfs, but i think vxfs is very different by it`s basic design and
on-disk structure
-- 
This message posted from opensolaris.org

Andras Spitzer

2009-Dec-30 18:53 UTC

head link

[zfs-discuss] Thin device support in ZFS?

Devzero,

Unfortunately that was my assumption as well. I don''t have source level
knowledge of ZFS, though based on what I know it wouldn''t be an easy
way to do it. I''m not even sure it''s only a technical
question, but a design question, which would make it even less feasible.

Apart from the technical possibilities, this feature looks really inevitable to
me in the long run especially for enterprise customers with high-end SAN as cost
is always a major factor in a storage design and it''s a huge difference
if you have to pay based on the space used vs space allocated (for example).

Devzero, I agree with you, let''s wait till some zfs developer can
provide us some insightful thoughts about this topic.

Regards,
sendai
-- 
This message posted from opensolaris.org

Mattias Pantzare

2009-Dec-30 19:05 UTC

head link

[zfs-discuss] Thin device support in ZFS?

On Wed, Dec 30, 2009 at 19:23, roland <devzero at web.de>
wrote:> making transactional,logging filesystems thin-provisioning aware should be
hard to do, as every new and every changed block is written to a new location.
> so what applies to zfs, should also apply to btrfs or nilfs or similar
filesystems.
If that where a problem it would be a problem for UFS when you write
new files...

ZFS knows what blocks are free and that is all you need send to the disk system.

Freddie Cash

2009-Dec-30 19:18 UTC

head link

[zfs-discuss] Thin device support in ZFS?

> making transactional,logging filesystems
> thin-provisioning aware should be hard to do, as
> every new and every changed block is written to a new
> location.  so what applies to zfs, should also apply to btrfs or
> nilfs or similar filesystems.
> 
> i`m not sure if there is a good way to make zfs
> thin-provisioning aware/friendly - so you should wait
> what a zfs developer has to tell about this.
ZFS already supports thin-provisioning, and has since pretty much the beginning
(earliest I''ve used it in is ZFSv6).

I may get the terms backwards here, but if the Quota property is larger than the
Reservation, then you have a thin-provisioned volume or filesystem.  The Quota
will set the "disk size" or "available space" that the OS
sees, while the Reservation sets "the currently usable space".  As the
OS uses space in the volume/fs and approaches the Reservation, you just increase
the value.  The "total size" that the OS doesn''t change, but
the actual amount of usable space does.

This is especially useful for volumes that are exported via iSCSI.
-- 
This message posted from opensolaris.org

Richard Elling

2009-Dec-30 19:40 UTC

head link

[zfs-discuss] Thin device support in ZFS?

On Dec 30, 2009, at 10:53 AM, Andras Spitzer wrote:
> Devzero,
>
> Unfortunately that was my assumption as well. I don''t have source
> level knowledge of ZFS, though based on what I know it wouldn''t be
> an easy way to do it. I''m not even sure it''s only a
technical
> question, but a design question, which would make it even less  
> feasible.
It is not hard, because ZFS knows the current free list, so walking  
that list
and telling the storage about the freed blocks isn''t very hard.

What is hard is figuring out if this would actually improve life.  The  
reason
I say this is because people like to use snapshots and clones on ZFS.
If you keep snapshots, then you aren''t freeing blocks, so the free list
doesn''t grow. This is a very different use case than UFS, as an
example.

There are a few minor bumps in the road. The ATA PASSTHROUGH
command, which allows TRIM to pass through the SATA drivers, was
just integrated into b130. This will be more important to small servers
than SANs, but the point is that all parts of the software stack need to
support the effort. As such, it is not clear to me who, if anyone,  
inside
Sun is champion for the effort -- it crosses multiple organizational
boundaries.
>
> Apart from the technical possibilities, this feature looks really  
> inevitable to me in the long run especially for enterprise customers  
> with high-end SAN as cost is always a major factor in a storage  
> design and it''s a huge difference if you have to pay based on the
> space used vs space allocated (for example).
If the high cost of SAN storage is the problem, then I think there are
better ways to solve that :-)
  -- richard

Torrey McMahon

2009-Dec-30 19:55 UTC

head link

[zfs-discuss] Thin device support in ZFS?

On 12/30/2009 2:40 PM, Richard Elling wrote:> There are a few minor bumps in the road. The ATA PASSTHROUGH
> command, which allows TRIM to pass through the SATA drivers, was
> just integrated into b130. This will be more important to small servers
> than SANs, but the point is that all parts of the software stack need to
> support the effort. As such, it is not clear to me who, if anyone, inside
> Sun is champion for the effort -- it crosses multiple organizational
> boundaries. 
I''d think it more important for devices where this is an issue, namely 
SSDs, then it is spinning rust though use of the TRIM command, or 
something like it, would fix a lot of the issues I''ve seen with thin 
provisioning over the last six years or so. However, I''m not sure
it''s
going to be much of an impact until you can get the entire stack - 
application to device - rewired to work with the concept behind it. One 
of the biggest issues I''ve seen with thin provisioning is how the 
applications work and you can''t fix that in the file system code.

Mike Gerdts

2009-Dec-30 20:13 UTC

head link

[zfs-discuss] Thin device support in ZFS?

On Wed, Dec 30, 2009 at 1:40 PM, Richard Elling
<richard.elling at gmail.com> wrote:> On Dec 30, 2009, at 10:53 AM, Andras Spitzer wrote:
>
>> Devzero,
>>
>> Unfortunately that was my assumption as well. I don''t have
source level
>> knowledge of ZFS, though based on what I know it wouldn''t be
an easy way to
>> do it. I''m not even sure it''s only a technical
question, but a design
>> question, which would make it even less feasible.
>
> It is not hard, because ZFS knows the current free list, so walking that
> list
> and telling the storage about the freed blocks isn''t very hard.
>
> What is hard is figuring out if this would actually improve life. ?The
> reason
> I say this is because people like to use snapshots and clones on ZFS.
> If you keep snapshots, then you aren''t freeing blocks, so the free
list
> doesn''t grow. This is a very different use case than UFS, as an
example.
It seems as though the oft mentioned block rewrite capabilities needed
for pool shrinking and changing things like compression, encryption,
and deduplication would also show benefit here.  That is, blocks would
be re-written in such a way to minimize the number of chunks of
storage that is allocated.  The current HDS chunk size is 42 MB.

The most benefit would seem to be to have ZFS make a point of reusing
old but freed blocks before doing an allocation that causes the
back-end storage to allocate another chunk of disk to the
thin-provisioned.  While it is important to be able to roll back a few
transactions in the event of some widely discussed failure modes, it
is probably reasonable to reuse a block freed by a txg that is 3,000
txg''s old (about 1 day old if 1 txg per 30 seconds).  Such a threshold
could be used to determine whether to reuse a block or venture into
previously untouched regions of the disk.

This strategy would allow the SAN administrator (who is a different
person than the sysadmin) to allocate extra space to servers and the
sysadmin can control the amount of space really used by quotas.  In
the event that there is an emergency need for more space, the sysadmin
can increase the quota and allow more of the allocate SAN space to be
used.  Assuming the block rewrite feature comes to fruition, this
emergency growth could be shrunk back down to the original size once
the surge in demand (or errant process) subsides.
>
> There are a few minor bumps in the road. The ATA PASSTHROUGH
> command, which allows TRIM to pass through the SATA drivers, was
> just integrated into b130. This will be more important to small servers
> than SANs, but the point is that all parts of the software stack need to
> support the effort. As such, it is not clear to me who, if anyone, inside
> Sun is champion for the effort -- it crosses multiple organizational
> boundaries.
>
>>
>> Apart from the technical possibilities, this feature looks really
>> inevitable to me in the long run especially for enterprise customers
with
>> high-end SAN as cost is always a major factor in a storage design and
it''s a
>> huge difference if you have to pay based on the space used vs space
>> allocated (for example).
>
> If the high cost of SAN storage is the problem, then I think there are
> better ways to solve that :-)
The "SAN" could be an OpenSolaris device serving LUNs through COMSTAR.
 If those LUNs are used to hold a zpool, the zpool could notify the
LUN that blocks are no longer used and the "SAN" could reclaim those
blocks.  This is just a variant of the same problem faced with
expensive SAN devices that have thin provisioning allocation units
measured in the tens of megabytes instead of hundreds to thousands of
kilobytes.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/

Andras Spitzer

2009-Dec-30 20:25 UTC

head link

[zfs-discuss] Thin device support in ZFS?

Richard,

That''s an interesting question, if it''s worth it or not. I
guess the question is always who are the targets for ZFS (I assume everyone,
though in reality priorities has to set up as the developer resources are
limited). For a home office, no doubt thin provisioning is not much of a use,
for an enterprise company the numbers might really make a difference if we look
at the space used vs space allocated.

There are some studies that thin provisioning can reduce physical space used up
to 30%, which is huge. (Even though I understands studies are not real life and
thin provisioning is not viable in every environment)

Btw, I would like to discuss scenarios where though we have over-subscribed pool
in the SAN (meaning the overall allocated space to the systems is more than the
physical space in the pool) with proper monitoring and proactive physical drive
adds we won''t let any systems/applications attached to the SAN realize
that we have thin devices.

Actually that''s why I believe configuring thin devices without
periodically reclaiming space is just a timebomb, though if you have the option
to periodically reclaim space, you can maintain the pool in the SAN in a really
efficient way. That''s why I found Veritas'' Thin Reclamation
API as a milestone in the thin device field.

Anyway, only future can tell if thin provisioning will or won''t be a
major feature in the storage world, though as I saw Veritas already added this
feature I was wondering if ZFS has it at least on it''s roadmap.

Regards,
sendai
-- 
This message posted from opensolaris.org

Tristan Ball

2009-Dec-30 21:09 UTC

head link

[zfs-discuss] Thin device support in ZFS?

To some extent it already does.

If what you''re talking about is filesystems/datasets, then all 
filesystems within a pool share the same free space, which is 
functionally very similar to each filesystem within the pool being 
thin-provisioned. To get a "thick" filesystem, you''d need to
set at
least the filesystem''s reservation, and probably quota as well. 
Basically filesystems within a pool are thin by default, with the added 
bonus that space freed within a single filesystem is available for use 
in any other filesystem within the pool.

If you''re talking about volumes provisioned from a pool, then volumes 
can be provisioned as "sparse", which is pretty much the same thing.

And if you happen to be providing ISCSI luns from files rather than 
volumes, then those files can be created sparse as well.

Reclaiming space from sparse volumes and files is not so easy unfortunately!

If you''re talking about the pool itself being thin... that''s
harder to
do, although if you really needed it I guess if you provision your pool 
from an array that itself provides thin provisioning.

Regards,
     Tristan

On 30/12/2009 9:34 PM, Andras Spitzer wrote:> Hi,
>
> Does anyone heard about having any plans to support thin devices by ZFS?
I''m talking about the thin device feature by SAN frames (EMC, HDS)
which provides more efficient space utilization. The concept is similar to ZFS
with the pool and datasets, though the pool in this case is in the SAN frame
itself, so the pool can be shared among different systems attached to the same
SAN frame.
>
> This topic is really complex but I''m sure it''s inevitable
to support for enterprise customers with SAN storage, basically it brings the
differentiation of space used vs space allocated, which can be a huge difference
in a large environment, and this difference is major even on the financial level
as well.
>
> Veritas already added support to thin devices, first of all support to VxFS
to be "thin-aware" (for example how to handle over-subscribed thin
devices), then Veritas added a feature called SmartMove, a nice feature to
migrate from fat to thin devices, and the most brilliant feature of all (my
personal opinion, of course) is they released the Veritas Thin Device
Reclamation API, which provides an interface to the SAN frame to report unused
space at the block level.
>
> This API is a major hit, and even though SAN vendors today doesn''t
support it, HP and HDS already working on it, and I assume EMC has to follow as
well. With this API Veritas can keep track of files deleted for example, and
with a simple command once a day (depending on your policy) it can report the
unused space back to the frame, so thin devices [b]remain[/b] thin.
>
> I really believe that ZFS should have support to thin devices, especially
referring to the feature what this API brings into this field, as it can result
a huge cost difference to enterprise customers.
>
> Regards,
> sendai
>

Richard Elling

2009-Dec-30 21:12 UTC

head link

[zfs-discuss] Thin device support in ZFS?

now this is getting interesting :-)...

On Dec 30, 2009, at 12:13 PM, Mike Gerdts wrote:
> On Wed, Dec 30, 2009 at 1:40 PM, Richard Elling
> <richard.elling at gmail.com> wrote:
>> On Dec 30, 2009, at 10:53 AM, Andras Spitzer wrote:
>>
>>> Devzero,
>>>
>>> Unfortunately that was my assumption as well. I don''t have
source
>>> level
>>> knowledge of ZFS, though based on what I know it wouldn''t
be an
>>> easy way to
>>> do it. I''m not even sure it''s only a technical
question, but a
>>> design
>>> question, which would make it even less feasible.
>>
>> It is not hard, because ZFS knows the current free list, so walking  
>> that
>> list
>> and telling the storage about the freed blocks isn''t very
hard.
>>
>> What is hard is figuring out if this would actually improve life.   
>> The
>> reason
>> I say this is because people like to use snapshots and clones on ZFS.
>> If you keep snapshots, then you aren''t freeing blocks, so the
free
>> list
>> doesn''t grow. This is a very different use case than UFS, as
an
>> example.
>
> It seems as though the oft mentioned block rewrite capabilities needed
> for pool shrinking and changing things like compression, encryption,
> and deduplication would also show benefit here.  That is, blocks would
> be re-written in such a way to minimize the number of chunks of
> storage that is allocated.  The current HDS chunk size is 42 MB.
Good observation, Mike. ZFS divides a leaf vdev into approximately 200
metaslabs. Space is allocated in a metaslab and at some point another
metaslab will be chosen.  The assumption is made that the outer tracks
of a disk have higher bandwidth than inner tracks, so allocations should
be biased towards lower-numbered metaslabs.  Let''s ignore, for the
moment, that SSDs, and to some degree, RAID arrays, don''t exhibit
this behavior. OK, so here''s how it works, in a nutshell.

Space is allocated in the same metaslab until it fills or becomes
"fragmented" and then the next metaslab is used.  You can see this
in my "Spacemaps from Space" blog,
http://blogs.sun.com/relling/entry/space_maps_from_space
where the lower numbered tracks (towards the bottom) you can see
occasional, small blank areas.  Note to self: a better picture would be
useful :-)

Note: copies are intentionally spread to other, distant metaslabs for
diversity.

Inside the metaslab, space is allocated on a first-fit basis until the
space is mostly consumed and the algorithm changes to best-fit.

The algorithm for these two decisions was changed in b129, in an
effort to improve performance.

So, the questions that arise are:
Should the allocator be made aware of the chunk size of virtual
storage vdevs?  [hint: there is evidence of the intention to permit
different allocators in the source, but I dunno if there is an intent
to expose those through an interface.]

If the allocator can change, what sorts of policies should be
implemented?  Examples include:
	+ should the allocator stick with best-fit and encourage more
	   gangs when the vdev is virtual?
	+ should the allocator be aware of an SSD''s page size?  Is
	   said page size available to an OS?
	+ should the metaslab boundaries align with virtual storage
	   or SSD page boundaries?

And, perhaps most important, how can this be done automatically
so that system administrators don''t have to be rocket scientists
to make a good choice?

  -- richard

Tristan Ball

2009-Dec-30 21:14 UTC

head link

[zfs-discuss] Thin device support in ZFS?

Ack..

I''ve just re-read your original post. :-) It''s clear you are
talking
about support for thin devices behind the pool, not features inside the 
pool itself.

Mea culpa.

So I guess we wait for trim to be fully supported..  :-)

T.



On 31/12/2009 8:09 AM, Tristan Ball wrote:> To some extent it already does.
>
> If what you''re talking about is filesystems/datasets, then all 
> filesystems within a pool share the same free space, which is 
> functionally very similar to each filesystem within the pool being 
> thin-provisioned. To get a "thick" filesystem, you''d
need to set at
> least the filesystem''s reservation, and probably quota as well. 
> Basically filesystems within a pool are thin by default, with the 
> added bonus that space freed within a single filesystem is available 
> for use in any other filesystem within the pool.
>
> If you''re talking about volumes provisioned from a pool, then
volumes
> can be provisioned as "sparse", which is pretty much the same
thing.
>
> And if you happen to be providing ISCSI luns from files rather than 
> volumes, then those files can be created sparse as well.
>
> Reclaiming space from sparse volumes and files is not so easy 
> unfortunately!
>
> If you''re talking about the pool itself being thin...
that''s harder to
> do, although if you really needed it I guess if you provision your 
> pool from an array that itself provides thin provisioning.
>
> Regards,
>     Tristan
>
>
>
> On 30/12/2009 9:34 PM, Andras Spitzer wrote:
>> Hi,
>>
>> Does anyone heard about having any plans to support thin devices by 
>> ZFS? I''m talking about the thin device feature by SAN frames
(EMC,
>> HDS) which provides more efficient space utilization. The concept is 
>> similar to ZFS with the pool and datasets, though the pool in this 
>> case is in the SAN frame itself, so the pool can be shared among 
>> different systems attached to the same SAN frame.
>>
>> This topic is really complex but I''m sure it''s
inevitable to support
>> for enterprise customers with SAN storage, basically it brings the 
>> differentiation of space used vs space allocated, which can be a huge 
>> difference in a large environment, and this difference is major even 
>> on the financial level as well.
>>
>> Veritas already added support to thin devices, first of all support 
>> to VxFS to be "thin-aware" (for example how to handle
over-subscribed
>> thin devices), then Veritas added a feature called SmartMove, a nice 
>> feature to migrate from fat to thin devices, and the most brilliant 
>> feature of all (my personal opinion, of course) is they released the 
>> Veritas Thin Device Reclamation API, which provides an interface to 
>> the SAN frame to report unused space at the block level.
>>
>> This API is a major hit, and even though SAN vendors today
doesn''t
>> support it, HP and HDS already working on it, and I assume EMC has to 
>> follow as well. With this API Veritas can keep track of files deleted 
>> for example, and with a simple command once a day (depending on your 
>> policy) it can report the unused space back to the frame, so thin 
>> devices [b]remain[/b] thin.
>>
>> I really believe that ZFS should have support to thin devices, 
>> especially referring to the feature what this API brings into this 
>> field, as it can result a huge cost difference to enterprise customers.
>>
>> Regards,
>> sendai
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>
> ______________________________________________________________________
> This email has been scanned by the MessageLabs Email Security System.
> For more information please visit http://www.messagelabs.com/email 
> ______________________________________________________________________

Richard Elling

2009-Dec-30 21:45 UTC

head link

[zfs-discuss] Thin device support in ZFS?

On Dec 30, 2009, at 12:25 PM, Andras Spitzer wrote:
> Richard,
>
> That''s an interesting question, if it''s worth it or not.
I guess the
> question is always who are the targets for ZFS (I assume everyone,  
> though in reality priorities has to set up as the developer  
> resources are limited). For a home office, no doubt thin  
> provisioning is not much of a use, for an enterprise company the  
> numbers might really make a difference if we look at the space used  
> vs space allocated.
>
> There are some studies that thin provisioning can reduce physical  
> space used up to 30%, which is huge. (Even though I understands  
> studies are not real life and thin provisioning is not viable in  
> every environment)
>
> Btw, I would like to discuss scenarios where though we have over- 
> subscribed pool in the SAN (meaning the overall allocated space to  
> the systems is more than the physical space in the pool) with proper  
> monitoring and proactive physical drive adds we won''t let any  
> systems/applications attached to the SAN realize that we have thin  
> devices.
>
> Actually that''s why I believe configuring thin devices without  
> periodically reclaiming space is just a timebomb, though if you have  
> the option to periodically reclaim space, you can maintain the pool  
> in the SAN in a really efficient way. That''s why I found
Veritas''
> Thin Reclamation API as a milestone in the thin device field.
>
> Anyway, only future can tell if thin provisioning will or won''t be
a
> major feature in the storage world, though as I saw Veritas already  
> added this feature I was wondering if ZFS has it at least on it''s
> roadmap.
Thin provisioning is absolutely, positively a wonderful, good thing!   
The question
is, how does the industry handle the multitude of thin provisioning  
models, each
layered on top of another? For example, here at the ranch I use VMWare  
and Xen,
which thinly provision virtual disks. I do this over iSCSI to a server  
running ZFS
which thinly provisions the iSCSI target.  If I had a virtual RAID  
array, I would
probably use that, too. Personally, I think being thinner closer to  
the application
wins over being thinner closer to dumb storage devices (disk drives).

BTW, I do not see an RFE for this on http://bugs.opensolaris.org
Would you be so kind as to file one?
  -- richard

Mike Gerdts

2009-Dec-30 22:11 UTC

head link

[zfs-discuss] Thin device support in ZFS?

On Wed, Dec 30, 2009 at 3:12 PM, Richard Elling
<richard.elling at gmail.com> wrote:> If the allocator can change, what sorts of policies should be
> implemented? ?Examples include:
> ? ? ? ?+ should the allocator stick with best-fit and encourage more
> ? ? ? ? ? gangs when the vdev is virtual?
> ? ? ? ?+ should the allocator be aware of an SSD''s page size? ?Is
> ? ? ? ? ? said page size available to an OS?
> ? ? ? ?+ should the metaslab boundaries align with virtual storage
> ? ? ? ? ? or SSD page boundaries?
Wandering off topic a little bit...

Should the block size be a tunable so that page size of SSD (typically
4K, right?) and upcoming hard disks that sport a sector size > 512
bytes?

http://arc.opensolaris.org/caselog/PSARC/2008/769/final_spec.txt
> And, perhaps most important, how can this be done automatically
> so that system administrators don''t have to be rocket scientists
> to make a good choice?
Didn''t you read the marketing literature?  ZFS is easy because you
only need to know two commands: zpool and zfs.  If you just ignore all
the subcommands, options to those subcommands, evil tuning that is
sometimes needed, and effects of redundancy choices then there is no
need for any rocket scientists.  :)

-- 
Mike Gerdts
http://mgerdts.blogspot.com/

Ragnar Sundblad

2009-Dec-30 22:24 UTC

head link

[zfs-discuss] Thin device support in ZFS?

On 30 dec 2009, at 22.45, Richard Elling wrote:
> On Dec 30, 2009, at 12:25 PM, Andras Spitzer wrote:
> 
>> Richard,
>> 
>> That''s an interesting question, if it''s worth it or
not. I guess the question is always who are the targets for ZFS (I assume
everyone, though in reality priorities has to set up as the developer resources
are limited). For a home office, no doubt thin provisioning is not much of a
use, for an enterprise company the numbers might really make a difference if we
look at the space used vs space allocated.
>> 
>> There are some studies that thin provisioning can reduce physical space
used up to 30%, which is huge. (Even though I understands studies are not real
life and thin provisioning is not viable in every environment)
>> 
>> Btw, I would like to discuss scenarios where though we have
over-subscribed pool in the SAN (meaning the overall allocated space to the
systems is more than the physical space in the pool) with proper monitoring and
proactive physical drive adds we won''t let any systems/applications
attached to the SAN realize that we have thin devices.
>> 
>> Actually that''s why I believe configuring thin devices without
periodically reclaiming space is just a timebomb, though if you have the option
to periodically reclaim space, you can maintain the pool in the SAN in a really
efficient way. That''s why I found Veritas'' Thin Reclamation
API as a milestone in the thin device field.
>> 
>> Anyway, only future can tell if thin provisioning will or
won''t be a major feature in the storage world, though as I saw Veritas
already added this feature I was wondering if ZFS has it at least on
it''s roadmap.
> 
> Thin provisioning is absolutely, positively a wonderful, good thing!  The
question
> is, how does the industry handle the multitude of thin provisioning models,
each
> layered on top of another? For example, here at the ranch I use VMWare and
Xen,
> which thinly provision virtual disks. I do this over iSCSI to a server
running ZFS
> which thinly provisions the iSCSI target.  If I had a virtual RAID array, I
would
> probably use that, too. Personally, I think being thinner closer to the
application
> wins over being thinner closer to dumb storage devices (disk drives).
I don''t get it - why do we need anything more magic (or complicated)
than support for TRIM from the filesystems and the storage systems?

I don''t see why TRIM would be hard to implement for ZFS either,
except that you may want to keep data from a few txgs back just
for safety, which would probably call for some two-stage freeing
of data blocks (those free blocks that are to be TRIMmed, and
those that already are).

/ragge

Bob Friesenhahn

2009-Dec-30 23:31 UTC

head link

[zfs-discuss] Thin device support in ZFS?

On Wed, 30 Dec 2009, Mike Gerdts wrote:>
> Should the block size be a tunable so that page size of SSD (typically
> 4K, right?) and upcoming hard disks that sport a sector size > 512
> bytes?
Enterprise SSDs are still in their infancy.  The actual page size of 
an SSD could be almost anything.  Due to lack of seek time concerns 
and the high cost of erasing a page, a SSD could be designed with a 
level of indirection so that multiple logical writes to disjoint 
offsets could be combined into a single SSD physical page.  Likewise a 
large logical block could be subdivided into mutiple SSD pages, which 
are allocated on demand.  Logic is cheap and SSDs are full of logic so 
it seems reasonable that future SSDs will do this, if not already, 
since similar logic enables wear-leveling.

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Richard Elling

2009-Dec-31 05:01 UTC

head link

[zfs-discuss] Thin device support in ZFS?

On Dec 30, 2009, at 2:24 PM, Ragnar Sundblad wrote:
>
> On 30 dec 2009, at 22.45, Richard Elling wrote:
>
>> On Dec 30, 2009, at 12:25 PM, Andras Spitzer wrote:
>>
>>> Richard,
>>>
>>> That''s an interesting question, if it''s worth it
or not. I guess
>>> the question is always who are the targets for ZFS (I assume  
>>> everyone, though in reality priorities has to set up as the  
>>> developer resources are limited). For a home office, no doubt thin
>>> provisioning is not much of a use, for an enterprise company the  
>>> numbers might really make a difference if we look at the space  
>>> used vs space allocated.
>>>
>>> There are some studies that thin provisioning can reduce physical  
>>> space used up to 30%, which is huge. (Even though I understands  
>>> studies are not real life and thin provisioning is not viable in  
>>> every environment)
>>>
>>> Btw, I would like to discuss scenarios where though we have over- 
>>> subscribed pool in the SAN (meaning the overall allocated space to
>>> the systems is more than the physical space in the pool) with  
>>> proper monitoring and proactive physical drive adds we
won''t let
>>> any systems/applications attached to the SAN realize that we have  
>>> thin devices.
>>>
>>> Actually that''s why I believe configuring thin devices
without
>>> periodically reclaiming space is just a timebomb, though if you  
>>> have the option to periodically reclaim space, you can maintain  
>>> the pool in the SAN in a really efficient way. That''s why
I found
>>> Veritas'' Thin Reclamation API as a milestone in the thin
device
>>> field.
>>>
>>> Anyway, only future can tell if thin provisioning will or
won''t be
>>> a major feature in the storage world, though as I saw Veritas  
>>> already added this feature I was wondering if ZFS has it at least  
>>> on it''s roadmap.
>>
>> Thin provisioning is absolutely, positively a wonderful, good  
>> thing!  The question
>> is, how does the industry handle the multitude of thin provisioning  
>> models, each
>> layered on top of another? For example, here at the ranch I use  
>> VMWare and Xen,
>> which thinly provision virtual disks. I do this over iSCSI to a  
>> server running ZFS
>> which thinly provisions the iSCSI target.  If I had a virtual RAID  
>> array, I would
>> probably use that, too. Personally, I think being thinner closer to  
>> the application
>> wins over being thinner closer to dumb storage devices (disk drives).
>
> I don''t get it - why do we need anything more magic (or
complicated)
> than support for TRIM from the filesystems and the storage systems?
TRIM is just one part of the problem (or solution, depending on your  
point
of view). The TRIM command is part of the T10 protocols that allows a
host to tell a block device that data in a set of blocks is no longer of
any value, and the block device can destroy the data without adverse
consequence.

In a world with copy-on-write and without snapshots, it is obvious that
there will be a lot of blocks running around that are no longer in use.
Snapshots (and their clones) changes that use case. So in a world of
snapshots, there will be fewer blocks which are not used. Remember,
the TRIM command is very important to OSes like Windows or OSX
which do not have file systems that are copy-on-write or have decent
snapshots. OTOH, ZFS does copy-on-write and lots of ZFS folks use
snapshots.

That said, adding TRIM support is not hard in ZFS. But it depends on
lower level drivers to pass the TRIM commands down the stack. These
ducks are lining up now.
> I don''t see why TRIM would be hard to implement for ZFS either,
> except that you may want to keep data from a few txgs back just
> for safety, which would probably call for some two-stage freeing
> of data blocks (those free blocks that are to be TRIMmed, and
> those that already are).
Once a block is freed in ZFS, it no longer needs it. So the "problem"
of TRIM in ZFS is not related to the recent txg commit history. The
issue is that traversing the free block list has to be protected by
locks, so that the file system does not allocate a block when it is
also TRIMming the block. Not so difficult, as long as the TRIM
occurs relatively quickly.

I think that any TRIM implementation should be an administration
command, like scrub. It probably doesn''t make sense to have it
running all of the time.  But on occasion, it might make sense.

My concern is that people will have an expectation that they can
use snapshots and TRIM -- the former reduces the effectiveness
of the latter.  As the price of storing bytes continues to decrease,
will the cost of not TRIMming be a long term issue?  I think not.
  -- richard

Andras Spitzer

2009-Dec-31 09:43 UTC

head link

[zfs-discuss] Thin device support in ZFS?

Let me sum up my thoughts in this topic.

To Richard [relling] : I agree with you this topic is even more confusing if we
are not careful enough to specify exactly what we are talking about. Thin
provision can be done on multiple layers, and though you said you like it to be
closer to the app than closer to the dumb disks (if you were referring to SAN),
my opinion is that each and every scenario has it''s own pros/cons. I
learned long time ago not to declare a technology good/bad, there are
technologies which are used properly (usually declared as good tech) and others
which are not (usually declared as bad).

--

Let me clarify my case, and why I mentioned thin devices on SAN specifically.
Many people replied with the thin device support of ZFS (which is called sparse
volumes if I''m correct), but what I was talking about is something
else. It''s thin device "awareness" on the SAN.

In this case you configure your LUN in the SAN as thin device, a virtual LUN(s)
which is backed by a pool of physical disks in the SAN. From the OS
it''s transparent, so it is from the Volume Manager/Filesystem point of
view.

That is the basic definition of my scenarion with thin devices on SAN. High-end
SAN frames like HDS USP-V (feature called "Hitachi Dynamic
Provisioning"), EMC Symmetrix V-Max (feature called "Virtual
provisioning") supports this (and I''m sure many others as well).
As you discovered the LUN in the OS, you start to use it, like put under Volume
Manager, create filesystem, copy files, but the SAN only allocates physical
blocks (more precisely group of blocks called extents) as you write them, which
means you''ll use only as much (or a bit more rounded to the next
extent) on the physical disk as you use in reality.
>From this standpoint we can define two terms, thin-friendly and thin-hostile
environments. Thin-friendly would be any environment where OS/VM/FS
doesn''t write to blocks it doesn''t really use (for example
during initialization it doesn''t fills up the LUN with a pattern or
0s).
That''s why Veritas'' SmartMove is a nice feature, as when you
move from fat to thin devices (from the OS both LUNs look exactly the same), it
will copy the blocks only which are used by the VxFS files.

That is still the basics of having thin devices on SAN, and hope to have a
thin-friendly environment. The next level of this is the management of the thin
devices and the physical pool where thin devices allocates their extents from.

Even if you get migrated to thin device LUNs, your thin devices will become fat
again, even if you fill up your filesystem once, the thin device on the SAN will
remain fat, no space reclamation is happening by default. The reason is pretty
simple, the SAN storage has no knowledge of the filesystem structure, as such it
can''t decide whether a block should be released back to the pool, or
it''s really not in use. Then came Veritas with this brilliant idea of
building a bridge between the FS and the SAN frame (this became the Thin
Reclamation API), so they can communicate which blocks are not in use indeed.

I really would like you to read this Quick Note from Veritas about this feature,
it will explain way better the concept as I did :
http://ftp.support.veritas.com/pub/support/products/Foundation_Suite/338546.pdf

Btw, in this concept VxVM can even detect (via ASL) whether a LUN is thin
device/thin device reclamation capable or not.

Honestly I have mixed feeling about ZFS. I feel that this is obviously the
future''s VM/Filesystem, but then I realize in the same time the roles
of the individual parts in the big picture are getting mixed up. Am I the only
one with the impression that ZFS sooner or later will evolve to a SAN OS, and
the zfs, zpool commands will only become some lightweight interfaces to control
the SAN frame? :-) (like Solution Enabler for EMC)

If you ask me the pool concept always works more efficient if 1# you have more
capacity in the pool 2# if you have more systems to share the pool,
that''s why I see the thin device pool more rational in a SAN frame.

Anyway, I''m sorry if you were already aware what I explained above, I
also hope I didn''t offend anyone with my views,

Regards,
sendai
-- 
This message posted from opensolaris.org

Ragnar Sundblad

2009-Dec-31 09:43 UTC

head link

[zfs-discuss] Thin device support in ZFS?

On 31 dec 2009, at 06.01, Richard Elling wrote:
> 
> On Dec 30, 2009, at 2:24 PM, Ragnar Sundblad wrote:
> 
>> 
>> On 30 dec 2009, at 22.45, Richard Elling wrote:
>> 
>>> On Dec 30, 2009, at 12:25 PM, Andras Spitzer wrote:
>>> 
>>>> Richard,
>>>> 
>>>> That''s an interesting question, if it''s worth
it or not. I guess the question is always who are the targets for ZFS (I assume
everyone, though in reality priorities has to set up as the developer resources
are limited). For a home office, no doubt thin provisioning is not much of a
use, for an enterprise company the numbers might really make a difference if we
look at the space used vs space allocated.
>>>> 
>>>> There are some studies that thin provisioning can reduce
physical space used up to 30%, which is huge. (Even though I understands studies
are not real life and thin provisioning is not viable in every environment)
>>>> 
>>>> Btw, I would like to discuss scenarios where though we have
over-subscribed pool in the SAN (meaning the overall allocated space to the
systems is more than the physical space in the pool) with proper monitoring and
proactive physical drive adds we won''t let any systems/applications
attached to the SAN realize that we have thin devices.
>>>> 
>>>> Actually that''s why I believe configuring thin devices
without periodically reclaiming space is just a timebomb, though if you have the
option to periodically reclaim space, you can maintain the pool in the SAN in a
really efficient way. That''s why I found Veritas'' Thin
Reclamation API as a milestone in the thin device field.
>>>> 
>>>> Anyway, only future can tell if thin provisioning will or
won''t be a major feature in the storage world, though as I saw Veritas
already added this feature I was wondering if ZFS has it at least on
it''s roadmap.
>>> 
>>> Thin provisioning is absolutely, positively a wonderful, good
thing!  The question
>>> is, how does the industry handle the multitude of thin provisioning
models, each
>>> layered on top of another? For example, here at the ranch I use
VMWare and Xen,
>>> which thinly provision virtual disks. I do this over iSCSI to a
server running ZFS
>>> which thinly provisions the iSCSI target.  If I had a virtual RAID
array, I would
>>> probably use that, too. Personally, I think being thinner closer to
the application
>>> wins over being thinner closer to dumb storage devices (disk
drives).
>> 
>> I don''t get it - why do we need anything more magic (or
complicated)
>> than support for TRIM from the filesystems and the storage systems?
> 
> TRIM is just one part of the problem (or solution, depending on your point
> of view). The TRIM command is part of the T10 protocols that allows a
> host to tell a block device that data in a set of blocks is no longer of
> any value, and the block device can destroy the data without adverse
> consequence.
> 
> In a world with copy-on-write and without snapshots, it is obvious that
> there will be a lot of blocks running around that are no longer in use.
> Snapshots (and their clones) changes that use case. So in a world of
> snapshots, there will be fewer blocks which are not used. Remember,
> the TRIM command is very important to OSes like Windows or OSX
> which do not have file systems that are copy-on-write or have decent
> snapshots. OTOH, ZFS does copy-on-write and lots of ZFS folks use
> snapshots.
I don''t believe that there is such a big difference between those
cases. Sure, snapshots may keep more data on disk, but only as much
as the user choose to keep. There has been other ways to keep old
data on disk before (RCS, Solaris patch backout blurbs, logs, caches,
what have you), so there is not really a brand new world there.
(BTW, once upon a time, real operating systems had (optional) file
versioning built into the operating system or file system itself.)

If there was a mechanism that always tended to keep all of the
disk full, that would be another case. Snapshots may do that
with the autosnapshot and warn-and-clean-when-getting-full
features of OpenSolaris, but especially servers will probably
not be managed that way, they will probably have a much more
controlled snapshot policy. (Especially if you want to save every
possible bit of disk space, as those guys with the big fantastic
and ridiculously expensive storage systems always want to do -
maybe that will change in the future though.)
> That said, adding TRIM support is not hard in ZFS. But it depends on
> lower level drivers to pass the TRIM commands down the stack. These
> ducks are lining up now.
Good.
>> I don''t see why TRIM would be hard to implement for ZFS
either,
>> except that you may want to keep data from a few txgs back just
>> for safety, which would probably call for some two-stage freeing
>> of data blocks (those free blocks that are to be TRIMmed, and
>> those that already are).
> 
> Once a block is freed in ZFS, it no longer needs it. So the
"problem"
> of TRIM in ZFS is not related to the recent txg commit history.
It may be that you want to save a few txgs back, so if you get
a failure where parts of the last txg gets lost, you will still be
able to get an old (few seconds/minutes) version of your data back.

This could happen if the sync commands aren''t correctly implemented
all the way (as we have seen some stories about on this list).
Maybe someone disabled syncing somewhere to improve performance.

It could also happen if a "non volatile" caching device, such as
a storage controller, breaks in some bad way. Or maybe you just
had a bad/old battery/supercap in a device that implements
NV storage with batteries/supercaps.
> The
> issue is that traversing the free block list has to be protected by
> locks, so that the file system does not allocate a block when it is
> also TRIMming the block. Not so difficult, as long as the TRIM
> occurs relatively quickly.
> 
> I think that any TRIM implementation should be an administration
> command, like scrub. It probably doesn''t make sense to have it
> running all of the time.  But on occasion, it might make sense.
I am not sure why it shouldn''t run at all times, except for the
fact that it seems to be badly implemented in some SATA devices
with high latencies, so that it will interrupt any data streaming
to/from the disks.
On a general purpose system, that may not be an issue since you
may read a lot from cache anyway, and synced writes may wait a
little without anyone even noticing.
On a special system that needs streaming performance, it might be
interesting to only trim at certain occasions, but then you will
probably have a service window for it, with a start- and stop time,
so you need to be ably to control the trimming process pretty exact
for this feature to be interesting. It may turn out that such
systems may be better served not trimming at all.
On a laptop on the other hand, you typically don''t have a service
window and have no idea when it would be a good time to start
TRIMing, and continuous TRIMing may be the best option.
> My concern is that people will have an expectation that they can
> use snapshots and TRIM -- the former reduces the effectiveness
> of the latter.
In my experience, disks tends to get full one way or another
anyway if you don''t manage your data. I don''t really see that
snapshots changes that a whole lot.
>  As the price of storing bytes continues to decrease,
> will the cost of not TRIMming be a long term issue?  I think not.
> -- richard
Maybe, maybe not.

Storage will always have a cost, not even OpenStorage has
really changed that by order of magnitudes (yet, at least).

Also, currently, when the SSDs for some very strange reason is
constructed from flash chips designed for firmware and slowly
changing configuration data and can only erase in very large chunks,
TRIMing is good for the housekeeping in the SSD drive. A typical
use case for this would be a laptop.

Happy new year, everybody!

/ragge s

Ragnar Sundblad

2009-Dec-31 09:46 UTC

head link

[zfs-discuss] Thin device support in ZFS?

On 31 dec 2009, at 00.31, Bob Friesenhahn wrote:
> On Wed, 30 Dec 2009, Mike Gerdts wrote:
>> 
>> Should the block size be a tunable so that page size of SSD (typically
>> 4K, right?) and upcoming hard disks that sport a sector size > 512
>> bytes?
> 
> Enterprise SSDs are still in their infancy.  The actual page size of an SSD
could be almost anything.  Due to lack of seek time concerns and the high cost
of erasing a page, a SSD could be designed with a level of indirection so that
multiple logical writes to disjoint offsets could be combined into a single SSD
physical page.  Likewise a large logical block could be subdivided into mutiple
SSD pages, which are allocated on demand.  Logic is cheap and SSDs are full of
logic so it seems reasonable that future SSDs will do this, if not already,
since similar logic enables wear-leveling.
I believe that almost all flash devices are already are doing this,
and only the first generation SD cards or something like that are
not doing it and leaving it to the host.

But I could be wrong of course.

/ragge s

Ragnar Sundblad

2009-Dec-31 10:02 UTC

head link

[zfs-discuss] Thin device support in ZFS?

On 31 dec 2009, at 10.43, Andras Spitzer wrote:
> Then came Veritas with this brilliant idea of building a bridge between the
FS and the SAN frame (this became the Thin Reclamation API), so they can
communicate which blocks are not in use indeed.
This is exactly what TRIM is for, but could be implemented in a
very light weight, general purpose way in all operating systems
file systems and storage devices. Once things implement this,
sparsing/thinning out disks will be a non issue.
This will be the same simple mechanism and useful all the way from
the laptop to the enterprise virtual server environment.

I don''t see any need for a big complicated system for this.
> Honestly I have mixed feeling about ZFS. I feel that this is obviously the
future''s VM/Filesystem, but then I realize in the same time the roles
of the individual parts in the big picture are getting mixed up. Am I the only
one with the impression that ZFS sooner or later will evolve to a SAN OS, and
the zfs, zpool commands will only become some lightweight interfaces to control
the SAN frame? :-) (like Solution Enabler for EMC)
ZFS et al can today be a "SAN frame", that is what the OpenStorage
product family is. (Not with all the bells and whistles of some of
the other systems, but a lot cheaper (not ridiculously expensive,
only very expensive (list price, which no one pays of course)).)

Another possible interim solution to sparsing (thinning) out disks
if you have dedup or compression in your storage thingy: write
large files with for example zeros on the free space on the clients
and remove them again, these blocks will dedup and/or compress nicely.

/ragge s

Bob Friesenhahn

2009-Dec-31 16:18 UTC

head link

[zfs-discuss] Thin device support in ZFS?

On Thu, 31 Dec 2009, Ragnar Sundblad wrote:>
> Also, currently, when the SSDs for some very strange reason is
> constructed from flash chips designed for firmware and slowly
> changing configuration data and can only erase in very large chunks,
> TRIMing is good for the housekeeping in the SSD drive. A typical
> use case for this would be a laptop.
I have heard quite a few times that TRIM is "good" for SSD drives but 
I don''t see much actual use for it.  Every responsible SSD drive 
maintains a reserve of unused space (20-50%) since it is needed for 
wear leveling and to repair failing spots.  This means that even when 
a SSD is 100% full it still has considerable space remaining.  A very 
simple SSD design solution is that when a SSD block is "overwritten" 
it is replaced with an already-erased block from the free pool and the 
old block is submitted to the free pool for eventual erasure and 
re-use.  This approach avoids adding erase times to the write latency 
as long as the device can erase as fast as the average date write 
rate.

There are of course SSDs with hardly any (or no) reserve space, but 
while we might be willing to sacrifice an image or two to SSD block 
failure in our digital camera, that is just not acceptable for serious 
computer use.

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Andras Spitzer

2009-Dec-31 16:27 UTC

head link

[zfs-discuss] Thin device support in ZFS?

Just an update :

Finally I found some technical details about this Thin Reclamation API :

(http://blogs.hds.com/claus/2009/12/i-love-it-when-a-plan-comes-together.html)

"This week, (December 7th), Symantec announced their ?completing the thin
provisioning ecosystem? that includes the necessary API calls for the file
system to ?notify? the storage array when space is ?deleted?. The interface is a
previously disused and now revised/reused/repurposed SCSI command (called Write
Same) which was jointly worked out with Symantec, Hitachi, and 3PAR. This
command allows the file systems (in this case Veritas VxFS) to notify the
storage systems that space is no longer occupied. How cool is that! There is
also a subcommittee to INCITS T10 studying the standardization is this and SNIA
is also studying this. It won?t be long before most file systems, databases, and
storage vendors adopt this technology."

So it''s based on the SCSI Write Same/UNMAP command, (and if I
understand correctly SATA TRIM is similar to this from the FS point of view)
which standard is not ratified yet.

Also, happy new year to everyone!

Regards,
sendai
-- 
This message posted from opensolaris.org

Ragnar Sundblad

2009-Dec-31 17:09 UTC

head link

[zfs-discuss] Thin device support in ZFS?

On 31 dec 2009, at 17.18, Bob Friesenhahn wrote:
> On Thu, 31 Dec 2009, Ragnar Sundblad wrote:
>> 
>> Also, currently, when the SSDs for some very strange reason is
>> constructed from flash chips designed for firmware and slowly
>> changing configuration data and can only erase in very large chunks,
>> TRIMing is good for the housekeeping in the SSD drive. A typical
>> use case for this would be a laptop.
> 
> I have heard quite a few times that TRIM is "good" for SSD drives
but I don''t see much actual use for it.  Every responsible SSD drive
maintains a reserve of unused space (20-50%) since it is needed for wear
leveling and to repair failing spots.  This means that even when a SSD is 100%
full it still has considerable space remaining.
(At least as long as those blocks aren''t used up in place of
bad/worn out) blocks...)
>  A very simple SSD design solution is that when a SSD block is
"overwritten" it is replaced with an already-erased block from the
free pool and the old block is submitted to the free pool for eventual erasure
and re-use.  This approach avoids adding erase times to the write latency as
long as the device can erase as fast as the average date write rate.
This is what they do, as far as I have understood, but more
free space to play with makes the job easier and therefor
faster, and gives you a larger burst headroom before you hit
the erase-speed limit of the disk.
> There are of course SSDs with hardly any (or no) reserve space, but while
we might be willing to sacrifice an image or two to SSD block failure in our
digital camera, that is just not acceptable for serious computer use.
I think the idea is that with TRIM you can also use the file
system''s unused space for wear leveling and flash block filling.
If your disk is completely full there is of course no gain.

/ragge s

Richard Elling

2009-Dec-31 18:03 UTC

head link

[zfs-discuss] Thin device support in ZFS?

On Dec 31, 2009, at 1:43 AM, Andras Spitzer wrote:
> Let me sum up my thoughts in this topic.
>
> To Richard [relling] : I agree with you this topic is even more  
> confusing if we are not careful enough to specify exactly what we  
> are talking about. Thin provision can be done on multiple layers,  
> and though you said you like it to be closer to the app than closer  
> to the dumb disks (if you were referring to SAN), my opinion is that  
> each and every scenario has it''s own pros/cons. I learned long
time
> ago not to declare a technology good/bad, there are technologies  
> which are used properly (usually declared as good tech) and others  
> which are not (usually declared as bad).
I hear you.  But you are trapped thinking about 20th century designs  
and ZFS is a
21st century design.  More below...
> Let me clarify my case, and why I mentioned thin devices on SAN  
> specifically. Many people replied with the thin device support of  
> ZFS (which is called sparse volumes if I''m correct), but what I
was
> talking about is something else. It''s thin device
"awareness" on the
> SAN.
>
> In this case you configure your LUN in the SAN as thin device, a  
> virtual LUN(s) which is backed by a pool of physical disks in the  
> SAN. From the OS it''s transparent, so it is from the Volume
Manager/
> Filesystem point of view.
>
> That is the basic definition of my scenarion with thin devices on  
> SAN. High-end SAN frames like HDS USP-V (feature called "Hitachi  
> Dynamic Provisioning"), EMC Symmetrix V-Max (feature called
"Virtual
> provisioning") supports this (and I''m sure many others as
well). As
> you discovered the LUN in the OS, you start to use it, like put  
> under Volume Manager, create filesystem, copy files, but the SAN  
> only allocates physical blocks (more precisely group of blocks  
> called extents) as you write them, which means you''ll use only as
> much (or a bit more rounded to the next extent) on the physical disk  
> as you use in reality.
>
>> From this standpoint we can define two terms, thin-friendly and  
>> thin-hostile environments. Thin-friendly would be any environment  
>> where OS/VM/FS doesn''t write to blocks it doesn''t
really use (for
>> example during initialization it doesn''t fills up the LUN with
a
>> pattern or 0s).
>
> That''s why Veritas'' SmartMove is a nice feature, as when
you move
> from fat to thin devices (from the OS both LUNs look exactly the  
> same), it will copy the blocks only which are used by the VxFS files.
ZFS does this by design. There is no way in ZFS to not do this.
I suppose it could be touted as a "feature" :-)  Maybe we should brand
ZFS as "THINbyDESIGN(TM)"  Or perhaps we can rebrand
SMARTMOVE(TM) as TRYINGTOCATCHUPWITHZFS(TM) :-)
> That is still the basics of having thin devices on SAN, and hope to  
> have a thin-friendly environment. The next level of this is the  
> management of the thin devices and the physical pool where thin  
> devices allocates their extents from.
>
> Even if you get migrated to thin device LUNs, your thin devices will  
> become fat again, even if you fill up your filesystem once, the thin  
> device on the SAN will remain fat, no space reclamation is happening  
> by default. The reason is pretty simple, the SAN storage has no  
> knowledge of the filesystem structure, as such it can''t decide  
> whether a block should be released back to the pool, or it''s
really
> not in use. Then came Veritas with this brilliant idea of building a  
> bridge between the FS and the SAN frame (this became the Thin  
> Reclamation API), so they can communicate which blocks are not in  
> use indeed.
>
> I really would like you to read this Quick Note from Veritas about  
> this feature, it will explain way better the concept as I did :
http://ftp.support.veritas.com/pub/support/products/Foundation_Suite/338546.pdf
>
> Btw, in this concept VxVM can even detect (via ASL) whether a LUN is  
> thin device/thin device reclamation capable or not.
Correct.  Since VxVM and VxFS are separate software, they have expanded
the interface between them.

Consider adding a mirror or replacing a drive.

Prior to SMARTMOVE, VxVM had no idea what part of the volume was data
and what was unused. So VxVM would silver the mirror by copying all of  
the
blocks from one side to the other. Clearly this is uncool when your SAN
storage is virtualized.

With SMARTMOVE, VxFS has a method to tell VxVM that portions of the
volume are unused. Now when you silver the mirror, VxVM knows that
some bits are unused and it won''t bother to copy them.  This is a bona
fide good thing for virtualized SAN arrays.

ZFS was designed with the knowledge that the limited interface between
file systems and volume managers was a severe limitation that leads to
all sorts of complexity and angst. So a different design is needed.  ZFS
has fully integrated RAID with the file system, so there is no need, by
design, to create a new interface between these layers. In other words,
the only way to silver a disk in ZFS is to silver the data. You can''t  
silver
unused space. There are other advantages as well.  For example, in
ZFS silvers are done in time order, which has benefits for recovery
when devices are breaking all around you.  Jeff describes this rather
nicely in his blog:
	http://blogs.sun.com/bonwick/entry/smokin_mirrors

In short. ZFS doesn''t need SMARTMOVE because it doesn''t have
the
antiquated view of storage management that last century''s designs
had. Also, ZFS users who don''t use snapshots could benefit from TRIM.
> Honestly I have mixed feeling about ZFS. I feel that this is  
> obviously the future''s VM/Filesystem, but then I realize in the
same
> time the roles of the individual parts in the big picture are  
> getting mixed up. Am I the only one with the impression that ZFS  
> sooner or later will evolve to a SAN OS, and the zfs, zpool commands  
> will only become some lightweight interfaces to control the SAN  
> frame? :-) (like Solution Enabler for EMC)
I don''t see that evolution. But I''ve always contended that
storage
arrays are just specialized servers which speak a limited set of
protocols.  After all, there is no such thing as "hardware RAID,"
all RAID is done in software. So my crystal ball says that such
limited server OSes will have a hard life ahead of them.
> If you ask me the pool concept always works more efficient if 1# you  
> have more capacity in the pool 2# if you have more systems to share  
> the pool, that''s why I see the thin device pool more rational in a
> SAN frame.
>
> Anyway, I''m sorry if you were already aware what I explained
above,
> I also hope I didn''t offend anyone with my views,
I have a much simpler view of VxFS and VxVM.  They are neither
open source nor free, but they are so last century :-)
  -- richard

Richard Elling

2009-Dec-31 18:26 UTC

head link

[zfs-discuss] Thin device support in ZFS?

[I TRIMmed the thread a bit ;-)]

On Dec 31, 2009, at 1:43 AM, Ragnar Sundblad wrote:> On 31 dec 2009, at 06.01, Richard Elling wrote:
>>
>> In a world with copy-on-write and without snapshots, it is obvious  
>> that
>> there will be a lot of blocks running around that are no longer in  
>> use.
>> Snapshots (and their clones) changes that use case. So in a world of
>> snapshots, there will be fewer blocks which are not used. Remember,
>> the TRIM command is very important to OSes like Windows or OSX
>> which do not have file systems that are copy-on-write or have decent
>> snapshots. OTOH, ZFS does copy-on-write and lots of ZFS folks use
>> snapshots.
>
> I don''t believe that there is such a big difference between those
> cases.
The reason you want TRIM for SSDs is to recover the write speed.
A freshly cleaned page can be written faster than a dirty page.
But in COW, you are writing to new pages and not rewriting old
pages. This is fundamentally different than FAT, NTFS, or HFS+,
but it is those markets which are driving TRIM adoption.

[TRIMmed]
>> Once a block is freed in ZFS, it no longer needs it. So the
"problem"
>> of TRIM in ZFS is not related to the recent txg commit history.
>
> It may be that you want to save a few txgs back, so if you get
> a failure where parts of the last txg gets lost, you will still be
> able to get an old (few seconds/minutes) version of your data back.
This is already implemented. Blocks freed in the past few txgs are
not returned to the freelist immediately. This was needed to enable
uberblock recovery in b128. So TRIMming from the freelist is safe.
> This could happen if the sync commands aren''t correctly
implemented
> all the way (as we have seen some stories about on this list).
> Maybe someone disabled syncing somewhere to improve performance.
>
> It could also happen if a "non volatile" caching device, such as
> a storage controller, breaks in some bad way. Or maybe you just
> had a bad/old battery/supercap in a device that implements
> NV storage with batteries/supercaps.
>
>> The
>> issue is that traversing the free block list has to be protected by
>> locks, so that the file system does not allocate a block when it is
>> also TRIMming the block. Not so difficult, as long as the TRIM
>> occurs relatively quickly.
>>
>> I think that any TRIM implementation should be an administration
>> command, like scrub. It probably doesn''t make sense to have it
>> running all of the time.  But on occasion, it might make sense.
>
> I am not sure why it shouldn''t run at all times, except for the
> fact that it seems to be badly implemented in some SATA devices
> with high latencies, so that it will interrupt any data streaming
> to/from the disks.
I don''t see how it would not have negative performance impacts.
  -- richard

Joerg Schilling

2009-Dec-31 18:44 UTC

head link

[zfs-discuss] Thin device support in ZFS?

Bob Friesenhahn <bfriesen at simple.dallas.tx.us> wrote:
> I have heard quite a few times that TRIM is "good" for SSD drives
but
> I don''t see much actual use for it.  Every responsible SSD drive 
> maintains a reserve of unused space (20-50%) since it is needed for 
> wear leveling and to repair failing spots.  This means that even when 
> a SSD is 100% full it still has considerable space remaining.  A very 
> simple SSD design solution is that when a SSD block is
"overwritten"
> it is replaced with an already-erased block from the free pool and the 
> old block is submitted to the free pool for eventual erasure and 
> re-use.  This approach avoids adding erase times to the write latency 
> as long as the device can erase as fast as the average date write 
> rate.
The question in case if SSDs is:

ZFS is COW, but does the SSD know which block is "in use" and which is
not?

If the SSD did know whether a block is in use, it could erase unused blocks
in advance. But what is an "unused block" on a filesystem that
supports
snapshots?

>From the perspective of the SSD I see only the following difference betweena COW filesystem an a conventional filesystem. A conventional filesystem 
may write more often to the same block number than a COW filesystem does.
But even for the non-COW case, I would expect that the SSD frequently remaps
overwritten blocks to previously erased spares.

My conclusion is that ZFS on a SSD works fine in case that the the primary used
blocks plus all active snapshots use less space than the official size - the 
spare reserve from the SSD. If you however fill up the medium, I expect a
performance degradation.

J?rg

-- 
 EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin
       js at cs.tu-berlin.de                (uni)  
       joerg.schilling at fokus.fraunhofer.de (work) Blog:
http://schily.blogspot.com/
 URL:  http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily

Joerg Schilling

2009-Dec-31 18:52 UTC

head link

[zfs-discuss] Thin device support in ZFS?

Richard Elling <richard.elling at gmail.com> wrote:
> The reason you want TRIM for SSDs is to recover the write speed.
> A freshly cleaned page can be written faster than a dirty page.
> But in COW, you are writing to new pages and not rewriting old
> pages. This is fundamentally different than FAT, NTFS, or HFS+,
> but it is those markets which are driving TRIM adoption.
Your mistake is to asume a maiden SSD and not to think about what''s
happening after the SSD was in use for a while. Even for the COW case,
blocks are reused after some time and the "disk" does has no way to
know in advance which blocks are still in use and which blocks are no
longer used and may be prepared for being overwritten.

J?rg

-- 
 EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin
       js at cs.tu-berlin.de                (uni)  
       joerg.schilling at fokus.fraunhofer.de (work) Blog:
http://schily.blogspot.com/
 URL:  http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily

Ragnar Sundblad

2009-Dec-31 20:59 UTC

head link

[zfs-discuss] Thin device support in ZFS?

On 31 dec 2009, at 19.26, Richard Elling wrote:
> [I TRIMmed the thread a bit ;-)]
> 
> On Dec 31, 2009, at 1:43 AM, Ragnar Sundblad wrote:
>> On 31 dec 2009, at 06.01, Richard Elling wrote:
>>> 
>>> In a world with copy-on-write and without snapshots, it is obvious
that
>>> there will be a lot of blocks running around that are no longer in
use.
>>> Snapshots (and their clones) changes that use case. So in a world
of
>>> snapshots, there will be fewer blocks which are not used. Remember,
>>> the TRIM command is very important to OSes like Windows or OSX
>>> which do not have file systems that are copy-on-write or have
decent
>>> snapshots. OTOH, ZFS does copy-on-write and lots of ZFS folks use
>>> snapshots.
>> 
>> I don''t believe that there is such a big difference between
those
>> cases.
> 
> The reason you want TRIM for SSDs is to recover the write speed.
> A freshly cleaned page can be written faster than a dirty page.
> But in COW, you are writing to new pages and not rewriting old
> pages. This is fundamentally different than FAT, NTFS, or HFS+,
> but it is those markets which are driving TRIM adoption.
Flash SSDs actually always remap new writes into a
only-append-to-new-pages style, pretty much as ZFS does itself.
So for a SSD there is no big difference between ZFS and
filesystems as UFS, NTFS, HFS+ et al, on the flash level they
all work the same.
The reason is that there is no way for it to rewrite single
disk blocks, it can only fill up already erased pages of
512K (for example). When the old blocks get mixed with unused
blocks (because of block rewrites, TRIM or Write Many/UNMAP),
it needs to compact the data by copying all active blocks from
those pages into previously erased pages, and there write the
active data compacted/continuos. (When this happens, things tend
to get really slow.)

So TRIM is just as applicable to ZFS as any other file system
for flash SSD, there is no real difference.
> [TRIMmed]
> 
>>> Once a block is freed in ZFS, it no longer needs it. So the
"problem"
>>> of TRIM in ZFS is not related to the recent txg commit history.
>> 
>> It may be that you want to save a few txgs back, so if you get
>> a failure where parts of the last txg gets lost, you will still be
>> able to get an old (few seconds/minutes) version of your data back.
> 
> This is already implemented. Blocks freed in the past few txgs are
> not returned to the freelist immediately. This was needed to enable
> uberblock recovery in b128. So TRIMming from the freelist is safe.
I see, very good!
>> This could happen if the sync commands aren''t correctly
implemented
>> all the way (as we have seen some stories about on this list).
>> Maybe someone disabled syncing somewhere to improve performance.
>> 
>> It could also happen if a "non volatile" caching device, such
as
>> a storage controller, breaks in some bad way. Or maybe you just
>> had a bad/old battery/supercap in a device that implements
>> NV storage with batteries/supercaps.
>> 
>>> The
>>> issue is that traversing the free block list has to be protected by
>>> locks, so that the file system does not allocate a block when it is
>>> also TRIMming the block. Not so difficult, as long as the TRIM
>>> occurs relatively quickly.
>>> 
>>> I think that any TRIM implementation should be an administration
>>> command, like scrub. It probably doesn''t make sense to
have it
>>> running all of the time.  But on occasion, it might make sense.
>> 
>> I am not sure why it shouldn''t run at all times, except for
the
>> fact that it seems to be badly implemented in some SATA devices
>> with high latencies, so that it will interrupt any data streaming
>> to/from the disks.
> 
> I don''t see how it would not have negative performance impacts.
It will, I am sure! But *if* the user for one reason or the other
wants TRIM, it can not be assumed that TRIMing major bunches at
certain times is any better than trimming small amounts all the
time. Both behaviors may be useful, but I have hard to see a real
good use case where you want batch trimming, but easy to see cases
where continuos trimming could be useful and hopefully hardly
noticeable thanks to the file system caching.

/ragge s

David Magda

2009-Dec-31 21:53 UTC

head link

[zfs-discuss] Thin device support in ZFS?

On Dec 31, 2009, at 13:44, Joerg Schilling wrote:
> ZFS is COW, but does the SSD know which block is "in use" and
which
> is not?
>
> If the SSD did know whether a block is in use, it could erase unused  
> blocks
> in advance. But what is an "unused block" on a filesystem that  
> supports
> snapshots?
Personally, I think that at some point in the future there will need  
to be a command telling SSDs that the file system will take care of  
handling blocks, as new FS designs will be COW. ZFS is the first  
"mainstream" one to do it, but Btrfs is there as well, and it looks  
like Apple will be making its own FS.

Just as the first 4096-byte block disks are silently emulating 4096 - 
to-512 blocks, SSDs are currently re-mapping LBAs behind the scenes.  
Perhaps in the future there will be a setting to say "no really,
I''m
talking about the /actual/ LBA 123456".

Eric D. Mudama

2010-Jan-01 08:30 UTC

head link

[zfs-discuss] Thin device support in ZFS?

On Thu, Dec 31 at 16:53, David Magda wrote:>Just as the first 4096-byte block disks are silently emulating 4096 -
>to-512 blocks, SSDs are currently re-mapping LBAs behind the scenes. 
>Perhaps in the future there will be a setting to say "no really,
I''m
>talking about the /actual/ LBA 123456".
What, exactly, is the "/actual/ LBA 123456" on a modern SSD?

--eric

-- 
Eric D. Mudama
edmudama at mail.bounceswoosh.org

Eric D. Mudama

2010-Jan-01 08:37 UTC

head link

[zfs-discuss] Thin device support in ZFS?

On Thu, Dec 31 at 10:18, Bob Friesenhahn wrote:>There are of course SSDs with hardly any (or no) reserve space, but 
>while we might be willing to sacrifice an image or two to SSD block 
>failure in our digital camera, that is just not acceptable for 
>serious computer use.
Some people are doing serious computing on devices with 6-7% reserve.
Devices with less enforced reserve will be significantly cheaper per
exposed gigabyte, independent of all other factors, and always give
the user the flexibility to increase their effective reserve by
destroking the working area a little or a lot.

If someone just needs blazing fast read access and isn''t expecting to
put more than a few cycles/day on their devices, small reserve MLC
drives may be very cost effective and just as fast as their 20-30%
reserve SLC counterparts.

-- 
Eric D. Mudama
edmudama at mail.bounceswoosh.org

Ragnar Sundblad

2010-Jan-01 09:33 UTC

head link

[zfs-discuss] Thin device support in ZFS?

On 31 dec 2009, at 22.53, David Magda wrote:
> On Dec 31, 2009, at 13:44, Joerg Schilling wrote:
> 
>> ZFS is COW, but does the SSD know which block is "in use" and
which is not?
>> 
>> If the SSD did know whether a block is in use, it could erase unused
blocks
>> in advance. But what is an "unused block" on a filesystem
that supports
>> snapshots?
Snapshots make no difference - when you delete the last
dataset/snapshot that references a file you also delete the
data. Snapshots is a way to keep more files around, it is not
a really way to keep the disk entirely full or anything like
that. There is obviously no problem to distinguish between
used and unused blocks, and zfs (or btrfs or similar) make no
difference.
> Personally, I think that at some point in the future there will need to be
a command telling SSDs that the file system will take care of handling blocks,
as new FS designs will be COW. ZFS is the first "mainstream" one to do
it, but Btrfs is there as well, and it looks like Apple will be making its own
FS.
That could be an idea, but there still will be holes after
deleted files that need to be reclaimed. Do you mean it would
be a major win to have the file system take care of the
space reclaiming instead of the drive?
> Just as the first 4096-byte block disks are silently emulating 4096 -to-512
blocks, SSDs are currently re-mapping LBAs behind the scenes. Perhaps in the
future there will be a setting to say "no really, I''m talking
about the /actual/ LBA 123456".
A typical flash page size is 512 KB. You probably don''t want to
use all the physical pages, since those could be worn out or bad,
so those need to be remapped (or otherwise avoided) at some level
anyway. These days, typically disks do the remapping without the
host computer knowing (both SSDs and rotating rust).

I see the possible win that you could always use all the working
blocks on the disk, and when blocks goes bad your disk will shrink.
I am not sure that is really what people expect, though. Apart from
that, I am not sure what the gain would be.
Could you elaborate on why this would be called for?

/ragge

David Magda

2010-Jan-01 13:07 UTC

head link

[zfs-discuss] Thin device support in ZFS?

On Jan 1, 2010, at 03:30, Eric D. Mudama wrote:
> On Thu, Dec 31 at 16:53, David Magda wrote:
>> Just as the first 4096-byte block disks are silently emulating 4096 -
>> to-512 blocks, SSDs are currently re-mapping LBAs behind the  
>> scenes. Perhaps in the future there will be a setting to say "no  
>> really, I''m talking about the /actual/ LBA 123456".
>
> What, exactly, is the "/actual/ LBA 123456" on a modern SSD?
It doesn''t exist currently because of the behind-the-scenes re-mapping
that''s being done by the SSD''s firmware.

While arbitrary to some extent, and "actual" LBA would presumably the
number of a particular cell in the SSD.

David Magda

2010-Jan-01 13:14 UTC

head link

[zfs-discuss] Thin device support in ZFS?

On Jan 1, 2010, at 04:33, Ragnar Sundblad wrote:
> I see the possible win that you could always use all the working
> blocks on the disk, and when blocks goes bad your disk will shrink.
> I am not sure that is really what people expect, though. Apart from
> that, I am not sure what the gain would be.
>
> Could you elaborate on why this would be called for?
Currently you have SSDs that look like disks, but under certain  
circumstances the OS / FS know that it isn''t rotating rust--in which  
case the TRIM command is then used by the OS to help the SSD''s  
allocation algorithm(s).

If the file system is COW, and knows about SSDs via TRIM, why not just  
skip the middle-man and tell the SSD "I''ll take care of managing  
blocks".

In the ZFS case, I think it''s a logical extension of how RAID is  
handling: ZFS'' system is much more helpful in most case that  
hardware- / firmware-based RAID, so it''s generally best just to expose
the underlying hardware to ZFS. In the same way ZFS already does COW,  
so why bother with the SSD''s firmware doing it when giving extra  
knowledge to ZFS could be more useful?

Ragnar Sundblad

2010-Jan-01 16:04 UTC

head link

[zfs-discuss] Thin device support in ZFS?

On 1 jan 2010, at 14.14, David Magda wrote:
> On Jan 1, 2010, at 04:33, Ragnar Sundblad wrote:
> 
>> I see the possible win that you could always use all the working
>> blocks on the disk, and when blocks goes bad your disk will shrink.
>> I am not sure that is really what people expect, though. Apart from
>> that, I am not sure what the gain would be.
>> 
>> Could you elaborate on why this would be called for?
> 
> Currently you have SSDs that look like disks, but under certain
circumstances the OS / FS know that it isn''t rotating rust--in which
case the TRIM command is then used by the OS to help the SSD''s
allocation algorithm(s).
(Note that TRIM and equivalents are not only useful on SSDs,
but on other storage too, such as when using sparse/thin
storage.)
> If the file system is COW, and knows about SSDs via TRIM, why not just skip
the middle-man and tell the SSD "I''ll take care of managing
blocks".
> 
> In the ZFS case, I think it''s a logical extension of how RAID is
handling: ZFS'' system is much more helpful in most case that hardware-
/ firmware-based RAID, so it''s generally best just to expose the
underlying hardware to ZFS. In the same way ZFS already does COW, so why bother
with the SSD''s firmware doing it when giving extra knowledge to ZFS
could be more useful?
But that would only move the hardware specific and dependent flash
chip handling code into the file system code, wouldn''t it? What
is won with that? As long as the flash chips have larger pages than
the file system blocks, someone will have to shuffle around blocks
to reclaim space, why not let the one thing that knows the hardware
and also is very close to the hardware do it?

And if this is good for SSDs, why isn''t it as good for rotating rust?

/ragge s

David Magda

2010-Jan-01 16:28 UTC

head link

[zfs-discuss] Thin device support in ZFS?

On Jan 1, 2010, at 11:04, Ragnar Sundblad wrote:
> But that would only move the hardware specific and dependent flash
> chip handling code into the file system code, wouldn''t it? What
> is won with that? As long as the flash chips have larger pages than
> the file system blocks, someone will have to shuffle around blocks
> to reclaim space, why not let the one thing that knows the hardware
> and also is very close to the hardware do it?
>
> And if this is good for SSDs, why isn''t it as good for rotating
rust?
Don''t really see how things are either hardware specific or dependent.
COW is COW. Am I missing something? It''s done by code somewhere in the
stack, if the FS knows about it, it can lay things out in sequential  
writes. If we''re talking about 512 KB blocks, ZFS in particular would  
create four 128 KB txgs--and 128 KB is simply the currently #define''d  
size, which can be changed in the future.

One thing you gain is perhaps not requiring to have as much of a  
reserve. At most you have some hidden bad block re-mapping, similar to  
rotating rust nowadays. If you''re shuffling blocks around,
you''re
doing a read-modify-write, which if done in the file system, you could  
use as a mechanism to defrag on-the-fly or to group many small files  
together.

Not quite sure what you mean by your last question.

Richard Elling

2010-Jan-01 16:44 UTC

head link

[zfs-discuss] Thin device support in ZFS?

On Dec 31, 2009, at 12:59 PM, Ragnar Sundblad wrote:> Flash SSDs actually always remap new writes into a
> only-append-to-new-pages style, pretty much as ZFS does itself.
> So for a SSD there is no big difference between ZFS and
> filesystems as UFS, NTFS, HFS+ et al, on the flash level they
> all work the same.
> The reason is that there is no way for it to rewrite single
> disk blocks, it can only fill up already erased pages of
> 512K (for example). When the old blocks get mixed with unused
> blocks (because of block rewrites, TRIM or Write Many/UNMAP),
> it needs to compact the data by copying all active blocks from
> those pages into previously erased pages, and there write the
> active data compacted/continuos. (When this happens, things tend
> to get really slow.)
However, the quantity of small, overwritten pages is vastly different.
I am not convinced that a workload that generates few overwrites
will be penalized as much as a workload that generates a large
number of overwrites.

I think most folks here will welcome good, empirical studies,
but thus far the only one I''ve found is from STEC and their
disks behave very well after they''ve been filled and subjected
to a rewrite workload. You get what you pay for.  Additional
pointers are always appreciated :-)
http://www.stec-inc.com/ssd/videos/ssdvideo1.php

  -- richard

Bob Friesenhahn

2010-Jan-01 17:17 UTC

head link

[zfs-discuss] Thin device support in ZFS?

On Fri, 1 Jan 2010, David Magda wrote:>
> It doesn''t exist currently because of the behind-the-scenes
re-mapping that''s
> being done by the SSD''s firmware.
>
> While arbitrary to some extent, and "actual" LBA would presumably
the number
> of a particular cell in the SSD.
There seems to be some severe misunderstanding of that a SSD is. 
This severe misunderstanding leads one to assume that a SSD has a 
"native" blocksize.  SSDs (as used in computer drives) are comprised 
of many tens of FLASH memory chips which can be layed out and mapped 
in whatever fashion the designers choose to do.  They could be mapped 
sequentially, in parallel, a combination of the two, or perhaps even 
change behavior depending on use.  Individual FLASH devices usually 
have a much smaller page size than 4K.  A 4K write would likely be 
striped across several/many FLASH devices.

The construction of any given SSD is typically a closely-held trade 
secret and the vendor will not reveal how it is designed.  You would 
have to chip away the epoxy yourself and reverse-engineer in order to 
gain some understanding of how a given SSD operates and even then it 
would be mostly guesswork.

It would be wrong for anyone here, including someone who has 
participated in the design of an SSD, to claim that they know how a 
"SSD" will behave unless they have access to the design of that 
particular SSD.

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Al Hopper

2010-Jan-01 20:37 UTC

head link

[zfs-discuss] Thin device support in ZFS?

On Fri, Jan 1, 2010 at 11:17 AM, Bob Friesenhahn
<bfriesen at simple.dallas.tx.us> wrote:> On Fri, 1 Jan 2010, David Magda wrote:
>>
>> It doesn''t exist currently because of the behind-the-scenes
re-mapping
>> that''s being done by the SSD''s firmware.
>>
>> While arbitrary to some extent, and "actual" LBA would
presumably the
>> number of a particular cell in the SSD.
>
> There seems to be some severe misunderstanding of that a SSD is. This
severe
> misunderstanding leads one to assume that a SSD has a "native"
blocksize.
> ?SSDs (as used in computer drives) are comprised of many tens of FLASH
> memory chips which can be layed out and mapped in whatever fashion the
> designers choose to do. ?They could be mapped sequentially, in parallel, a
> combination of the two, or perhaps even change behavior depending on use.
> ?Individual FLASH devices usually have a much smaller page size than 4K. ?A
> 4K write would likely be striped across several/many FLASH devices.
>
> The construction of any given SSD is typically a closely-held trade secret
> and the vendor will not reveal how it is designed. ?You would have to chip
> away the epoxy yourself and reverse-engineer in order to gain some
> understanding of how a given SSD operates and even then it would be mostly
> guesswork.
>
> It would be wrong for anyone here, including someone who has participated
in
> the design of an SSD, to claim that they know how a "SSD" will
behave unless
> they have access to the design of that particular SSD.
>
The main issue is that most flash devices support 128k byte pages, and
the smallest "chunk" (for want of a better word) of flash memory that
can be written is a page - or 128kb.  So if you have a write to an SSD
that only changes 1 byte in one 512 byte "disk" sector, the SSD
controller has to either read/re-write the affected page or figure out
how to update the flash memory with the minimum affect on flash wear.

If one did''nt have to worry about flash wear levelling, one could
read/update/write the affected page all day long.....

And, to date, flash writes are much slower than flash reads - which is
another basic property of the current generation of flash devices.

For anyone who is interested in getting more details of the challenges
with flash memory, when used to build solid state drives, reading the
tech data sheets on the flash memory devices will give you a feel for
the basic issues that must be solved.

Bobs point is well made.  The specifics of a given SSD implementation
will make the performance characteristics of the resulting SSD very
difficult to predict or even describe - especially as the device
hardware and firmware continue to evolve.   And some SSDs change the
algorithms they implement on-the-fly - depending on the
characteristics of the current workload and of the (inbound) data
being written.

There are some links to well written articles in the URL I posted
earlier this morning:
http://www.anandtech.com/storage/showdoc.aspx?i=3702

Regards,

-- 
Al Hopper  Logical Approach Inc,Plano,TX al at logical-approach.com
                   Voice: 972.379.2133 Timezone: US CDT
OpenSolaris Governing Board (OGB) Member - Apr 2005 to Mar 2007
http://www.opensolaris.org/os/community/ogb/ogb_2005-2007/

Ragnar Sundblad

2010-Jan-02 08:37 UTC

head link

[zfs-discuss] Thin device support in ZFS?

On 1 jan 2010, at 17.44, Richard Elling wrote:
> On Dec 31, 2009, at 12:59 PM, Ragnar Sundblad wrote:
>> Flash SSDs actually always remap new writes into a
>> only-append-to-new-pages style, pretty much as ZFS does itself.
>> So for a SSD there is no big difference between ZFS and
>> filesystems as UFS, NTFS, HFS+ et al, on the flash level they
>> all work the same.
> 
>> The reason is that there is no way for it to rewrite single
>> disk blocks, it can only fill up already erased pages of
>> 512K (for example). When the old blocks get mixed with unused
>> blocks (because of block rewrites, TRIM or Write Many/UNMAP),
>> it needs to compact the data by copying all active blocks from
>> those pages into previously erased pages, and there write the
>> active data compacted/continuos. (When this happens, things tend
>> to get really slow.)
> 
> However, the quantity of small, overwritten pages is vastly different.
> I am not convinced that a workload that generates few overwrites
> will be penalized as much as a workload that generates a large
> number of overwrites.
Zfs is not append only in itself, there will be holes from
deleted files after a while, and space will have to be
reclaimed sooner or later.

I am not convinced that a zfs that has been in use for a while
rewrites a lot less than other file systems. But maybe you are
right, and if so, I agree that intuitively such a workload
may be better matched to a flash based device.

If you have a workload that only appends data and never changes
or deletes it, zfs is probably a bit better than other file
systems of not rewriting blocks. But that is a pretty special
use case, and another file system could rewrite almost as
little.
> I think most folks here will welcome good, empirical studies,
> but thus far the only one I''ve found is from STEC and their
> disks behave very well after they''ve been filled and subjected
> to a rewrite workload. You get what you pay for.  Additional
> pointers are always appreciated :-)
> http://www.stec-inc.com/ssd/videos/ssdvideo1.php
There certainly are big differences between the flash SSD drives
out there, I wouldn''t argue about that for a second!

/ragge

Ragnar Sundblad

2010-Jan-02 08:45 UTC

head link

[zfs-discuss] Thin device support in ZFS?

On 1 jan 2010, at 17.28, David Magda wrote:
> On Jan 1, 2010, at 11:04, Ragnar Sundblad wrote:
> 
>> But that would only move the hardware specific and dependent flash
>> chip handling code into the file system code, wouldn''t it?
What
>> is won with that? As long as the flash chips have larger pages than
>> the file system blocks, someone will have to shuffle around blocks
>> to reclaim space, why not let the one thing that knows the hardware
>> and also is very close to the hardware do it?
>> 
>> And if this is good for SSDs, why isn''t it as good for
rotating rust?
> 
> Don''t really see how things are either hardware specific or
dependent.
The inner workings of a SSD flash drive is pretty hardware (or
rather vendor) specific, and it may not be a good idea to move
any knowledge about that to the file system layer.
> COW is COW. Am I missing something? It''s done by code somewhere in
the stack, if the FS knows about it, it can lay things out in sequential writes.
If we''re talking about 512 KB blocks, ZFS in particular would create
four 128 KB txgs--and 128 KB is simply the currently #define''d size,
which can be changed in the future.
As I said in another mail, zfs is not append only, especially
not if it has been in random read write use for a while.
There will be holes in the data and space to be reclaimed,
something has to handle that, and I am not sure it is a good
idea to move that into the host, since it it dependent of the
design of the SSD drive.
> One thing you gain is perhaps not requiring to have as much of a reserve.
At most you have some hidden bad block re-mapping, similar to rotating rust
nowadays. If you''re shuffling blocks around, you''re doing a
read-modify-write, which if done in the file system, you could use as a
mechanism to defrag on-the-fly or to group many small files together.
Yes, defrag on the fly may be interesting. Otherwise I am not
sure I think the file system should do any of that, since it
may be that it can be done much faster and smarter in the
SSD controller.
> Not quite sure what you mean by your last question.
I meant that if hardware dependent handling of the storage medium
is good to move into the host, why isn''t the same true for
spinning disks? But we can leave that for now.

/ragge

Ragnar Sundblad

2010-Jan-02 08:53 UTC

head link

[zfs-discuss] Thin device support in ZFS?

On 1 jan 2010, at 18.17, Bob Friesenhahn wrote:
> On Fri, 1 Jan 2010, David Magda wrote:
>> 
>> It doesn''t exist currently because of the behind-the-scenes
re-mapping that''s being done by the SSD''s firmware.
>> 
>> While arbitrary to some extent, and "actual" LBA would
presumably the number of a particular cell in the SSD.
> 
> There seems to be some severe misunderstanding of that a SSD is. This
severe misunderstanding leads one to assume that a SSD has a "native"
blocksize.  SSDs (as used in computer drives) are comprised of many tens of
FLASH memory chips which can be layed out and mapped in whatever fashion the
designers choose to do.  They could be mapped sequentially, in parallel, a
combination of the two, or perhaps even change behavior depending on use. 
Individual FLASH devices usually have a much smaller page size than 4K.  A 4K
write would likely be striped across several/many FLASH devices.
Yes, but erases are always much larger, right?
(With the flash chips of today, I am not sure why there
aren''t any flash chips with smaller erase page sizes yet.)
> The construction of any given SSD is typically a closely-held trade secret
and the vendor will not reveal how it is designed.  You would have to chip away
the epoxy yourself and reverse-engineer in order to gain some understanding of
how a given SSD operates and even then it would be mostly guesswork.
> 
> It would be wrong for anyone here, including someone who has participated
in the design of an SSD, to claim that they know how a "SSD" will
behave unless they have access to the design of that particular SSD.
I certainly agree, but there still isn''t much they can do about
the WORM-like properties of flash chips, were reading is pretty
fast, writing is not to bad, but erasing is very slow and must be
done in pretty large pages which also means that active data
probably have to be copied around before an erase.

I believe this is why even fast flash SSD devices can take
tenth or even hundreds of thousands of writes for a short burst,
but then fall back to a few thousand writes/second sustained.

/ragge

Andras Spitzer

2010-Jan-02 09:47 UTC

head link

[zfs-discuss] Thin device support in ZFS?

Mike,

As far as I know only Hitachi is using such a huge chunk size : 

"So each vendor?s implementation of TP uses a different block size. HDS use
42MB on the USP, EMC use 768KB on DMX, IBM allow a variable size from 32KB to
256KB on the SVC and 3Par use blocks of just 16KB. The reasons for this are many
and varied and for legacy hardware are a reflection of the underlying hardware
architecture."

http://gestaltit.com/all/tech/storage/chris/thin-provisioning-holy-grail-utilisation/

Also, here Hu explains the reason why they believe 42M is the most efficient :

http://blogs.hds.com/hu/2009/07/chunk-size-matters.html

He has some good points in his arguments.

Regards,
sendai
-- 
This message posted from opensolaris.org

Joerg Schilling

2010-Jan-02 11:38 UTC

head link

[zfs-discuss] Thin device support in ZFS?

Ragnar Sundblad <ragge at csc.kth.se> wrote:
> On 1 jan 2010, at 17.28, David Magda wrote:
> > Don''t really see how things are either hardware specific or
dependent.
>
> The inner workings of a SSD flash drive is pretty hardware (or
> rather vendor) specific, and it may not be a good idea to move
> any knowledge about that to the file system layer.
If ZFS likes to keep SSDs fast even after it was in use for a while, then
even ZFS would need to tell the SSD which sectors are no longer in use.


Such a mode may cause a noticable performance loss as ZFS for this reason
may need to traverse freed outdated data trees but it will help the SSD
to erase the needed space in advance.

J?rg

-- 
 EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin
       js at cs.tu-berlin.de                (uni)  
       joerg.schilling at fokus.fraunhofer.de (work) Blog:
http://schily.blogspot.com/
 URL:  http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily

Joerg Schilling

2010-Jan-02 11:43 UTC

head link

[zfs-discuss] Thin device support in ZFS?

Ragnar Sundblad <ragge at csc.kth.se> wrote:
> I certainly agree, but there still isn''t much they can do about
> the WORM-like properties of flash chips, were reading is pretty
> fast, writing is not to bad, but erasing is very slow and must be
> done in pretty large pages which also means that active data
> probably have to be copied around before an erase.
WORM devices do not allow to write a block a secdond time. There is
a typical 5% reserve that would allow to reassign some blocks and to make it 
appear they have been rewritten, but this is not what ZFS does. Well, you are 
hoewever true that there is a slight relation as I did invent COW for a WORM 
filesystem in 1989 ;-)

J?rg

-- 
 EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin
       js at cs.tu-berlin.de                (uni)  
       joerg.schilling at fokus.fraunhofer.de (work) Blog:
http://schily.blogspot.com/
 URL:  http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily

Erik Trimble

2010-Jan-02 12:10 UTC

head link

[zfs-discuss] Thin device support in ZFS?

Joerg Schilling wrote:> Ragnar Sundblad <ragge at csc.kth.se> wrote:
>
>   
>> On 1 jan 2010, at 17.28, David Magda wrote:
>>     
>
>   
>>> Don''t really see how things are either hardware specific
or dependent.
>>>       
>> The inner workings of a SSD flash drive is pretty hardware (or
>> rather vendor) specific, and it may not be a good idea to move
>> any knowledge about that to the file system layer.
>>     
>
> If ZFS likes to keep SSDs fast even after it was in use for a while, then
> even ZFS would need to tell the SSD which sectors are no longer in use.
>
>
> Such a mode may cause a noticable performance loss as ZFS for this reason
> may need to traverse freed outdated data trees but it will help the SSD
> to erase the needed space in advance.
>
> J?rthe TRIM command is what is intended for an OS to notify the SSD as to 
which blocks are deleted/erased, so the SSD''s internal free list can be
updated (that is, it allows formerly-in-use blocks to be moved to the 
free list).  This is necessary since only the OS has the information to 
determine which previous-written-to blocks are actually no longer in-use.

See the parallel discussion here titled "preview of new SSD based on 
SandForce controller" for more about "smart" vs "dumb"
SSD controllers.

 From ZFS''s standpoint, the optimal configuration would be for the SSD 
to inform ZFS as to it''s PAGE size, and ZFS would use this as the 
fundamental BLOCK size for that device (i.e. all writes are in integer 
multiples of the SSD page size).  Reads could be in smaller sections, 
though.  Which would be interesting:  ZFS would write in Page Size 
increments, and read in Block Size amounts.

-- 
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA
Timezone: US/Pacific (GMT-0800)

Joerg Schilling

2010-Jan-02 12:18 UTC

head link

[zfs-discuss] Thin device support in ZFS?

Erik Trimble <Erik.Trimble at sun.com> wrote:
>  From ZFS''s standpoint, the optimal configuration would be for the
SSD
> to inform ZFS as to it''s PAGE size, and ZFS would use this as the 
> fundamental BLOCK size for that device (i.e. all writes are in integer 
It seems that a command to retrieve this information does not yet exist,
or could you help me?

J?rg

-- 
 EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin
       js at cs.tu-berlin.de                (uni)  
       joerg.schilling at fokus.fraunhofer.de (work) Blog:
http://schily.blogspot.com/
 URL:  http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily

Ragnar Sundblad

2010-Jan-02 16:22 UTC

head link

[zfs-discuss] Thin device support in ZFS?

On 2 jan 2010, at 12.43, Joerg Schilling wrote:
> Ragnar Sundblad <ragge at csc.kth.se> wrote:
> 
>> I certainly agree, but there still isn''t much they can do
about
>> the WORM-like properties of flash chips, were reading is pretty
>> fast, writing is not to bad, but erasing is very slow and must be
>> done in pretty large pages which also means that active data
>> probably have to be copied around before an erase.
> 
> WORM devices do not allow to write a block a secdond time.
(I know, that is why I wrote WORM-like.)
> There is
> a typical 5% reserve that would allow to reassign some blocks and to make
it
> appear they have been rewritten, but this is not what ZFS does.
Well, zfs kind of does, but especially typical flash SSDs do it,
they have a redirection layer so that any block can go anywhere,
so they can use the flash media in a WORM like style with
occasional bulk erases.
> Well, you are 
> hoewever true that there is a slight relation as I did invent COW for a
WORM
> filesystem in 1989 ;-)
Yes, there indeed are several similarities.

/ragge

Ragnar Sundblad

2010-Jan-02 16:36 UTC

head link

[zfs-discuss] Thin device support in ZFS?

On 2 jan 2010, at 13.10, Erik Trimble wrote:
> Joerg Schilling wrote:
>> Ragnar Sundblad <ragge at csc.kth.se> wrote:
>> 
>>  
>>> On 1 jan 2010, at 17.28, David Magda wrote:
>>>    
>> 
>>  
>>>> Don''t really see how things are either hardware
specific or dependent.
>>>>      
>>> The inner workings of a SSD flash drive is pretty hardware (or
>>> rather vendor) specific, and it may not be a good idea to move
>>> any knowledge about that to the file system layer.
>>>    
>> 
>> If ZFS likes to keep SSDs fast even after it was in use for a while,
then
>> even ZFS would need to tell the SSD which sectors are no longer in use.
>> 
>> 
>> Such a mode may cause a noticable performance loss as ZFS for this
reason
>> may need to traverse freed outdated data trees but it will help the SSD
>> to erase the needed space in advance.
>> 
>> J?r
> the TRIM command is what is intended for an OS to notify the SSD as to
which blocks are deleted/erased, so the SSD''s internal free list can be
updated (that is, it allows formerly-in-use blocks to be moved to the free
list).  This is necessary since only the OS has the information to determine
which previous-written-to blocks are actually no longer in-use.
> 
> See the parallel discussion here titled "preview of new SSD based on
SandForce controller" for more about "smart" vs "dumb"
SSD controllers.
> 
> From ZFS''s standpoint, the optimal configuration would be for the
SSD to inform ZFS as to it''s PAGE size, and ZFS would use this as the
fundamental BLOCK size for that device (i.e. all writes are in integer multiples
of the SSD page size).  Reads could be in smaller sections, though.  Which would
be interesting:  ZFS would write in Page Size increments, and read in Block Size
amounts.
Well, this could be useful if updates are larger than the block size, for
example 512 K, as it is then possible to erase and rewrite without having to
copy around other data from the page. If updates are smaller, zfs will have to
reclaim erased space by itself, which if I am not mistaken it can not do today
(but probably will in some future, I guess the BP Rewrite is what is needed).

I am still not entirely convinced that it would be better to let the file system
take care of that instead of a flash controller, there could be quite a lot of
reading and writing going on for space reclamation (depending on the work load,
of course).

/ragge

Erik Trimble

2010-Jan-02 21:10 UTC

head link

[zfs-discuss] Thin device support in ZFS?

Joerg Schilling wrote:> Erik Trimble <Erik.Trimble at sun.com> wrote:
>
>   
>>  From ZFS''s standpoint, the optimal configuration would be for
the SSD
>> to inform ZFS as to it''s PAGE size, and ZFS would use this as
the
>> fundamental BLOCK size for that device (i.e. all writes are in integer 
>>     
>
> It seems that a command to retrieve this information does not yet exist,
> or could you help me?
>
> J?rg
>
>   Sadly, no, there does not exist any way for the SSD to communicate that 
info back to the OS.

Probably, the smart thing to push for is inclusion of some new command 
in the ATA standard (in a manner like TRIM).  Likely something that 
would return both native Block and Page sizes upon query.

I''m still trying to see if there will be any support for TRIM-like 
things in SAS.

-- 
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA

Erik Trimble

2010-Jan-02 21:49 UTC

head link

[zfs-discuss] Thin device support in ZFS?

Ragnar Sundblad wrote:> On 2 jan 2010, at 13.10, Erik Trimble wrote
>> Joerg Schilling wrote:
>>     
>> the TRIM command is what is intended for an OS to notify the SSD as to
which blocks are deleted/erased, so the SSD''s internal free list can be
updated (that is, it allows formerly-in-use blocks to be moved to the free
list).  This is necessary since only the OS has the information to determine
which previous-written-to blocks are actually no longer in-use.
>>
>> See the parallel discussion here titled "preview of new SSD based
on SandForce controller" for more about "smart" vs
"dumb" SSD controllers.
>>
>> From ZFS''s standpoint, the optimal configuration would be for
the SSD to inform ZFS as to it''s PAGE size, and ZFS would use this as
the fundamental BLOCK size for that device (i.e. all writes are in integer
multiples of the SSD page size).  Reads could be in smaller sections, though. 
Which would be interesting:  ZFS would write in Page Size increments, and read
in Block Size amounts.
>>     
>
> Well, this could be useful if updates are larger than the block size, for
example 512 K, as it is then possible to erase and rewrite without having to
copy around other data from the page. If updates are smaller, zfs will have to
reclaim erased space by itself, which if I am not mistaken it can not do today
(but probably will in some future, I guess the BP Rewrite is what is needed).
>   Sure, it does that today. What do you think happens on a standard COW 
action?   Let''s be clear here:  I''m talking about exactly the
same thing
that currently happens when you modify a ZFS "block" that spans
multiple
vdevs (say, in a RAIDZ).   The entire ZFS block is read from disk/L2ARC, 
the modifications made, then it is written back to storage, likely in 
another LBA. The original ZFS block location ON THE VDEV is now 
available for re-use (i.e. the vdev adds it to it''s Free Block List).
This is one of the things that leads to ZFS''s fragmentation issues 
(note, we''re talking about block fragmentation on the vdev, not ZFS 
block fragmentation), and something that we''re looking to BP rewrite to
enable defragging to be implemented.

In fact, I would argue that the biggest advantage of removing any 
advanced intelligence from the SSD controller is with small 
modifications to existing files.  By using the L2ARC (and other 
features, like compression, encryption, and dedup), ZFS can composite 
the needed changes with an existing cached copy of the ZFS block(s) to 
be changed, then issue a full new block write to the SSD.  This 
eliminates the need for the SSD to do the dreaded Read-Modify-Write 
cycle, and instead do just a Write.  In this scenario, the ZFS block is 
likely larger than the SSD Page size, so more data will need to be 
written; however, given the highly parallel nature of SSDs, writing 
several SSD pages simultaneously is easy (and fast);  let''s remember 
that a ZFS block is a maximum of only 8x the size of a SSD page, and 
writing 8 pages is only slightly more work than writing 1 page.  This 
larger write is all a single IOP, where a R-M-W essentially requires 3 
IOPS.  If you want the SSD controller to do the work, then it ALWAYS has 
to read the to-be-modified page from NAND, do the mod itself, then issue 
the write - and, remember, ZFS likely has already issued a full 
ZFS-block write (due to the COW nature of ZFS, there is no concept of 
"just change this 1 bit and leave everything else on disk where it
is"),
so you likely don''t save on the number of pages that need to be written
in any case.

> I am still not entirely convinced that it would be better to let the file
system take care of that instead of a flash controller, there could be quite a
lot of reading and writing going on for space reclamation (depending on the work
load, of course).
>
> /raggeThe point here is that regardless of the workload, there''s a R-M-W
cycle
that has to happen, whether that occurs at the ZFS level or at the SSD 
level.  My argument is that the OS has a far better view of the whole 
data picture, and access to much higher performing caches (i.e. 
RAM/registers) than the SSD, so not only can the OS make far better 
decisions about the data and how (and how much of) it should be stored, 
but it''s almost certainly to be able to do so far faster than any
little
SSD controller can do. 

-- 
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA

Ragnar Sundblad

2010-Jan-03 01:08 UTC

head link

[zfs-discuss] Thin device support in ZFS?

On 2 jan 2010, at 22.49, Erik Trimble wrote:
> Ragnar Sundblad wrote:
>> On 2 jan 2010, at 13.10, Erik Trimble wrote
>>> Joerg Schilling wrote:
>>>    the TRIM command is what is intended for an OS to notify the SSD
as to which blocks are deleted/erased, so the SSD''s internal free list
can be updated (that is, it allows formerly-in-use blocks to be moved to the
free list).  This is necessary since only the OS has the information to
determine which previous-written-to blocks are actually no longer in-use.
>>> 
>>> See the parallel discussion here titled "preview of new SSD
based on SandForce controller" for more about "smart" vs
"dumb" SSD controllers.
>>> 
>>> From ZFS''s standpoint, the optimal configuration would be
for the SSD to inform ZFS as to it''s PAGE size, and ZFS would use this
as the fundamental BLOCK size for that device (i.e. all writes are in integer
multiples of the SSD page size).  Reads could be in smaller sections, though. 
Which would be interesting:  ZFS would write in Page Size increments, and read
in Block Size amounts.
>>>    
>> 
>> Well, this could be useful if updates are larger than the block size,
for example 512 K, as it is then possible to erase and rewrite without having to
copy around other data from the page. If updates are smaller, zfs will have to
reclaim erased space by itself, which if I am not mistaken it can not do today
(but probably will in some future, I guess the BP Rewrite is what is needed).
>>  
> Sure, it does that today. What do you think happens on a standard COW
action?   Let''s be clear here:  I''m talking about exactly the
same thing that currently happens when you modify a ZFS "block" that
spans multiple vdevs (say, in a RAIDZ).   The entire ZFS block is read from
disk/L2ARC, the modifications made, then it is written back to storage, likely
in another LBA. The original ZFS block location ON THE VDEV is now available for
re-use (i.e. the vdev adds it to it''s Free Block List).   This is one
of the things that leads to ZFS''s fragmentation issues (note,
we''re talking about block fragmentation on the vdev, not ZFS block
fragmentation), and something that we''re looking to BP rewrite to
enable defragging to be implemented.
What I am talking about is to be able to reuse the free space
you get in the previously written data when you write modified
data to new places on the disk, or just remove a file for that
matter. To be able to reclaim that space with flash, you have
to erase large pages (for example 512 KB), but before you erase,
you will also have to save away all still valid data in that
page and rewrite that to a free page. What I am saying is that
I am not sure that this would be best done in the file system,
since it could be quite a bit of data to shuffle around, and
there could possibly be hardware specific optimizations that
could be done here that zfs wouldn''t know about. A good flash
controller could probably do it much better. (And a bad one
worse, of course.)

And as far as I know, zfs can not do that today - it can not
move around already written data, not for defragmentation, not
for adding or removing disks to stripes/raidz:s, not for
deduping/duping and so on, and I have understood it as
BP Rewrite could solve a lot of this.

Still, it could certainly be useful if zfs could try to use a
blocksize that matches the SSD erase page size - this could
avoid having to copy and compact data before erasing, which
could speed up writes in a typical flash SSD disk.
> In fact, I would argue that the biggest advantage of removing any advanced
intelligence from the SSD controller is with small modifications to existing
files.  By using the L2ARC (and other features, like compression, encryption,
and dedup), ZFS can composite the needed changes with an existing cached copy of
the ZFS block(s) to be changed, then issue a full new block write to the SSD. 
This eliminates the need for the SSD to do the dreaded Read-Modify-Write cycle,
and instead do just a Write.  In this scenario, the ZFS block is likely larger
than the SSD Page size, so more data will need to be written; however, given the
highly parallel nature of SSDs, writing several SSD pages simultaneously is easy
(and fast);  let''s remember that a ZFS block is a maximum of only 8x
the size of a SSD page, and writing 8 pages is only slightly more work than
writing 1 page.  This larger write is all a single IOP, where a R-M-W
essentially requires 3 IOPS.  If you want the SSD controller to do the work,
then it ALWAYS has to read the to-be-modified page from NAND, do the mod itself,
then issue the write - and, remember, ZFS likely has already issued a full
ZFS-block write (due to the COW nature of ZFS, there is no concept of "just
change this 1 bit and leave everything else on disk where it is"), so you
likely don''t save on the number of pages that need to be written in any
case.
I don''t think many SSDs do R-M-W, but rather just append blocks
to free pages (pretty much as zfs works, if you will). They also
have to do some space reclamation (copying/compacting blocks and
erasing pages) in the background, of course.
> I am still not entirely convinced that it would be better to let the file
system take care of that instead of a flash controller, there could be quite a
lot of reading and writing going on for space reclamation (depending on the work
load, of course).
>> 
>> /ragge
> The point here is that regardless of the workload, there''s a R-M-W
cycle that has to happen, whether that occurs at the ZFS level or at the SSD
level.  My argument is that the OS has a far better view of the whole data
picture, and access to much higher performing caches (i.e. RAM/registers) than
the SSD, so not only can the OS make far better decisions about the data and how
(and how much of) it should be stored, but it''s almost certainly to be
able to do so far faster than any little SSD controller can do.
Well, inside the flash system you could possibly have a much
better situation to shuffle data around for space reclamation -
that is copying and compacting data and erasing flash pages.
If the device has a good design, that is! If the SSD controller
is some small slow sad thing it might be better to shuffle it up
and down to the host and do it in the CPU, but I am not sure
about that either since it typically is the very same slow
controller that does the host communication.

I certainly agree that there seems to be some redundancy when
the flash SSD controller does a logging-file-system kind of work
under zfs that does pretty much that by itself, and it could
possibly be better to cut one of them (and not zfs).
I am still not convinced that it won''t be better to do this
in a good controller instead just for speed and to take advantage
of new hardware that does this smarter than the devices of today.

Do you know how the F5100 works for example?

/ragge

David Magda

2010-Jan-03 02:46 UTC

head link

[zfs-discuss] Thin device support in ZFS?

On Jan 2, 2010, at 16:49, Erik Trimble wrote:
> My argument is that the OS has a far better view of the whole data  
> picture, and access to much higher performing caches (i.e. RAM/ 
> registers) than the SSD, so not only can the OS make far better  
> decisions about the data and how (and how much of) it should be  
> stored, but it''s almost certainly to be able to do so far faster  
> than any little SSD controller can do.
Though one advantage of doing it with-in the disk is that you''re not  
using up bus bandwidth. Probably not that big of a deal, but worth  
mentioning for completeness / fairness.

Richard Elling

2010-Jan-03 02:51 UTC

head link

[zfs-discuss] Thin device support in ZFS?

On Jan 2, 2010, at 1:47 AM, Andras Spitzer wrote:> Mike,
>
> As far as I know only Hitachi is using such a huge chunk size :
>
> "So each vendor?s implementation of TP uses a different block size.  
> HDS use 42MB on the USP, EMC use 768KB on DMX, IBM allow a variable  
> size from 32KB to 256KB on the SVC and 3Par use blocks of just 16KB.  
> The reasons for this are many and varied and for legacy hardware are  
> a reflection of the underlying hardware architecture."
>
>
http://gestaltit.com/all/tech/storage/chris/thin-provisioning-holy-grail-utilisation/
>
> Also, here Hu explains the reason why they believe 42M is the most  
> efficient :
>
> http://blogs.hds.com/hu/2009/07/chunk-size-matters.html
>
> He has some good points in his arguments.
Yes, and they apply to ZFS dedup as well... :-)
  -- richard

Erik Trimble

2010-Jan-03 03:19 UTC

head link

[zfs-discuss] Thin device support in ZFS?

Ragnar Sundblad wrote:> On 2 jan 2010, at 22.49, Erik Trimble wrote:
>
>   
>> Ragnar Sundblad wrote:
>>     
>>> On 2 jan 2010, at 13.10, Erik Trimble wrote
>>>       
>>>> Joerg Schilling wrote:
>>>>    the TRIM command is what is intended for an OS to notify the
SSD as to which blocks are deleted/erased, so the SSD''s internal free
list can be updated (that is, it allows formerly-in-use blocks to be moved to
the free list).  This is necessary since only the OS has the information to
determine which previous-written-to blocks are actually no longer in-use.
>>>>
>>>> See the parallel discussion here titled "preview of new
SSD based on SandForce controller" for more about "smart" vs
"dumb" SSD controllers.
>>>>
>>>> From ZFS''s standpoint, the optimal configuration would
be for the SSD to inform ZFS as to it''s PAGE size, and ZFS would use
this as the fundamental BLOCK size for that device (i.e. all writes are in
integer multiples of the SSD page size).  Reads could be in smaller sections,
though.  Which would be interesting:  ZFS would write in Page Size increments,
and read in Block Size amounts.
>>>>    
>>>>         
>>> Well, this could be useful if updates are larger than the block
size, for example 512 K, as it is then possible to erase and rewrite without
having to copy around other data from the page. If updates are smaller, zfs will
have to reclaim erased space by itself, which if I am not mistaken it can not do
today (but probably will in some future, I guess the BP Rewrite is what is
needed).
>>>  
>>>       
>> Sure, it does that today. What do you think happens on a standard COW
action?   Let''s be clear here:  I''m talking about exactly the
same thing that currently happens when you modify a ZFS "block" that
spans multiple vdevs (say, in a RAIDZ).   The entire ZFS block is read from
disk/L2ARC, the modifications made, then it is written back to storage, likely
in another LBA. The original ZFS block location ON THE VDEV is now available for
re-use (i.e. the vdev adds it to it''s Free Block List).   This is one
of the things that leads to ZFS''s fragmentation issues (note,
we''re talking about block fragmentation on the vdev, not ZFS block
fragmentation), and something that we''re looking to BP rewrite to
enable defragging to be implemented.
>>     
>
> What I am talking about is to be able to reuse the free space
> you get in the previously written data when you write modified
> data to new places on the disk, or just remove a file for that
> matter. To be able to reclaim that space with flash, you have
> to erase large pages (for example 512 KB), but before you erase,
> you will also have to save away all still valid data in that
> page and rewrite that to a free page. What I am saying is that
> I am not sure that this would be best done in the file system,
> since it could be quite a bit of data to shuffle around, and
> there could possibly be hardware specific optimizations that
> could be done here that zfs wouldn''t know about. A good flash
> controller could probably do it much better. (And a bad one
> worse, of course.)
>   You certainly DO get to reuse the free space again.   Here''s what 
happens nowdays in an SSD:

Let''s say I have 4k blocks, grouped into a 128k page.  That is, the 
SSD''s fundamental minimum unit size is 4k, but the minimum WRITE size
is
128k.  Thus, 32 blocks in a page.

So, I write a bit of data 100k in size. This occupies the first 25 
blocks in the one page. The remaining 9 blocks are still one the SSD''s 
Free List (i.e. list of free space).

Now, I want to change the last byte of the file, and add 10k more to the 
file.  Currently, a non-COW filesystem will simply send the 1 byte 
modification request and the 10k addition to the SSD (all as one unit, 
if you are lucky - if not, it comes as two ops: 1 byte modification 
followed by a 10k append).   The SSD now has to read all 25 blocks of 
the page back into it''s local cache on the controller, do the 
modification and append computing, then writes out 28 blocks to NAND.  
In all likelihood, if there is any extra pre-erased (or never written 
to) space on the drive, this 28 block write will go to a whole new 
page.  The blocks in the original page will be moved over to the SSD 
Free List (and may or may not be actually erased, depending on the 
controller).

For filesystems like ZFS, this is a whole lot of extra work being done 
that doesn''t need to happen (and, chews up valuable IOPS and time).  
For, when ZFS does a write, it doesn''t merely just twiddle the 
modified/appended bits - instead, it creates a whole new ZFS block to 
write.   In essence, ZFS has already done all the work that the SSD 
controller is planning on doing.  So why duplicate the effort?   SSDs 
should simply notify ZFS about their block & page sizes, which would 
then allow ZFS to better align it''s own variable block size to
optimally
coincide with the SSD''s implementation.

> And as far as I know, zfs can not do that today - it can not
> move around already written data, not for defragmentation, not
> for adding or removing disks to stripes/raidz:s, not for
> deduping/duping and so on, and I have understood it as
> BP Rewrite could solve a lot of this.
>   ZFS''s propensity to fragmentation doesn''t mean you lose space.
Rather,
it means that COW often results in frequently-modified files being 
distributed over the entire media, rather than being contiguous. So, 
over time, the actual media has very little (if any) contiguous free 
space, which is what the fragmentation problem is.  BP rewrite will 
indeed allow us to create a de-fragger.  Areas which used to hold a ZFS 
block (now vacated by a COW to somewhere else) are simply added back to 
the device''s Free List. 

Now, in SSD''s case, this isn''t a worry.  Due to the completely
even
performance characteristics of NAND, it doesn''t make any difference if 
the physical layout of a file happens to be sections (e.g. ZFS blocks) 
scattered all over the SSD.  Access time is identical, and so is read 
time.  SSD''s don''t care about this kind of fragmentation.

What SSD''s have to worry about is sub-page fragmentation.  Which brings
us back to the whole R-M-W mess.

> Still, it could certainly be useful if zfs could try to use a
> blocksize that matches the SSD erase page size - this could
> avoid having to copy and compact data before erasing, which
> could speed up writes in a typical flash SSD disk.
>
>   
>> In fact, I would argue that the biggest advantage of removing any
advanced intelligence from the SSD controller is with small modifications to
existing files.  By using the L2ARC (and other features, like compression,
encryption, and dedup), ZFS can composite the needed changes with an existing
cached copy of the ZFS block(s) to be changed, then issue a full new block write
to the SSD.  This eliminates the need for the SSD to do the dreaded
Read-Modify-Write cycle, and instead do just a Write.  In this scenario, the ZFS
block is likely larger than the SSD Page size, so more data will need to be
written; however, given the highly parallel nature of SSDs, writing several SSD
pages simultaneously is easy (and fast);  let''s remember that a ZFS
block is a maximum of only 8x the size of a SSD page, and writing 8 pages is
only slightly more work than writing 1 page.  This larger write is all a single
IOP, where a R-M-W essentially requires 3 IOPS.  If you want the SSD controller
to do the work, then it ALWAYS has to read the to-be-modified page from NAND, do
the mod itself, then issue the write - and, remember, ZFS likely has already
issued a full ZFS-block write (due to the COW nature of ZFS, there is no concept
of "just change this 1 bit and leave everything else on disk where it
is"), so you likely don''t save on the number of pages that need to
be written in any case.
>>     
>
> I don''t think many SSDs do R-M-W, but rather just append blocks
> to free pages (pretty much as zfs works, if you will). They also
> have to do some space reclamation (copying/compacting blocks and
> erasing pages) in the background, of course.
>   MLC-based SSDs  all do R-M-W.  Now, they might not do 
Read-Modify-Erase-Write right away:   But they''ll do R-M-W on ANY write
which modifies existing data (unless you are extremely lucky and your 
data fully fills an existing page):  the difference is that the final W 
is to previous-unused NAND page(s).  However, when the SSD runs out of 
never-used space, it starts to have to add the E step on future writes.

So far as I know, no SSD does space reclamation in the manner you refer 
to. That is, the SSD controller isn''t going to be moving data around on
its own, with the exception of wear-leveling.  TRIM is there so that the 
SSD can add stuff to it''s internal Free List more efficiently, but an 
SSD isn''t going (on its own) say:  "Ooh, page 1004 has only 5 of
10
blocks used, so why don''t we merge it with page 20054, which has only 3
of 10 blocks used." 
>> I am still not entirely convinced that it would be better to let the
file system take care of that instead of a flash controller, there could be
quite a lot of reading and writing going on for space reclamation (depending on
the work load, of course).
>>     
>>> /ragge
>>>       
>> The point here is that regardless of the workload, there''s a
R-M-W cycle that has to happen, whether that occurs at the ZFS level or at the
SSD level.  My argument is that the OS has a far better view of the whole data
picture, and access to much higher performing caches (i.e. RAM/registers) than
the SSD, so not only can the OS make far better decisions about the data and how
(and how much of) it should be stored, but it''s almost certainly to be
able to do so far faster than any little SSD controller can do.
>>     
>
> Well, inside the flash system you could possibly have a much
> better situation to shuffle data around for space reclamation -
> that is copying and compacting data and erasing flash pages.
> If the device has a good design, that is! If the SSD controller
> is some small slow sad thing it might be better to shuffle it up
> and down to the host and do it in the CPU, but I am not sure
> about that either since it typically is the very same slow
> controller that does the host communication.
>   It''s actually far more likely that a dumb SSD controller can handle
high
levels of pure data transfer faster than a smart SSD controller can 
actually manipulate that same data quickly.   SSD controllers, by their 
very nature, need to be as small and cheap as possible, which means they 
have extremely limited computation ability.  For a given compute level 
controller, one which is only "dumb" has to worry about 4 things: wear
leveling, bad block remapping, and LBA->physical block mapping, and 
actual I/O transfer (i.e. managing data flow from the host to the NAND 
chips).   A smart controller also has to worry about page alignment, 
page modification and rewriting, potentially RAID-like 
checksumming/parity,  page/block fragmentation, and other things.  So, 
if the compute amount is fixed, a dumb controller is going to be able to 
handle a /whole/ lot more I/O transfer than a smart controller.    Which 
means, for the same level of I/O transfer, a dumb controller costs less 
than a smart controller.

> I certainly agree that there seems to be some redundancy when
> the flash SSD controller does a logging-file-system kind of work
> under zfs that does pretty much that by itself, and it could
> possibly be better to cut one of them (and not zfs).
> I am still not convinced that it won''t be better to do this
> in a good controller instead just for speed and to take advantage
> of new hardware that does this smarter than the devices of today.
>
> Do you know how the F5100 works for example?
>
> /ragge
>   The point I''m making here is that the filesystem/OS can make all the 
same decisions that a good SSD controller can make, faster (as it has 
most of the data in local RAM or register already), and with a global 
system viewpoint that the SSD simply can''t have.  Most importantly,
it''s
essentially free for the OS to do so - it has the spare cycles and 
bandwidth to do so.  Putting this intelligence on the SSD costs money 
that is essentially wasted, not to mention being less efficient overall.

-- 
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA
Timezone: US/Pacific (GMT-0800)

Erik Trimble

2010-Jan-03 03:26 UTC

head link

[zfs-discuss] Thin device support in ZFS?

David Magda wrote:> On Jan 2, 2010, at 16:49, Erik Trimble wrote:
>
>> My argument is that the OS has a far better view of the whole data 
>> picture, and access to much higher performing caches (i.e. 
>> RAM/registers) than the SSD, so not only can the OS make far better 
>> decisions about the data and how (and how much of) it should be 
>> stored, but it''s almost certainly to be able to do so far
faster than
>> any little SSD controller can do.
>
> Though one advantage of doing it with-in the disk is that you''re
not
> using up bus bandwidth. Probably not that big of a deal, but worth 
> mentioning for completeness / fairness.This is true.  But, also in fairness, this is /already/ being used by 
the COW nature of ZFS.  Changing one bit in a file causes the /entire/ 
ZFS block containing that bit to be re-written.  So I''m not really
using
much (if any) more bus bandwidth by doing the SSD page layout in the OS 
rather than in the SSD controller. Remember that I''m highly likely not 
to have to read anything from the SSD to do the page rewrite, as the 
data I want is already in the L2ARC.


-- 
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA
Timezone: US/Pacific (GMT-0800)

Ragnar Sundblad

2010-Jan-03 05:07 UTC

head link

[zfs-discuss] Thin device support in ZFS?

On 3 jan 2010, at 04.19, Erik Trimble wrote:
> Ragnar Sundblad wrote:
>> On 2 jan 2010, at 22.49, Erik Trimble wrote:
>> 
>>  
>>> Ragnar Sundblad wrote:
>>>    
>>>> On 2 jan 2010, at 13.10, Erik Trimble wrote
>>>>      
>>>>> Joerg Schilling wrote:
>>>>>   the TRIM command is what is intended for an OS to notify
the SSD as to which blocks are deleted/erased, so the SSD''s internal
free list can be updated (that is, it allows formerly-in-use blocks to be moved
to the free list).  This is necessary since only the OS has the information to
determine which previous-written-to blocks are actually no longer in-use.
>>>>> 
>>>>> See the parallel discussion here titled "preview of
new SSD based on SandForce controller" for more about "smart" vs
"dumb" SSD controllers.
>>>>> 
>>>>> From ZFS''s standpoint, the optimal configuration
would be for the SSD to inform ZFS as to it''s PAGE size, and ZFS would
use this as the fundamental BLOCK size for that device (i.e. all writes are in
integer multiples of the SSD page size).  Reads could be in smaller sections,
though.  Which would be interesting:  ZFS would write in Page Size increments,
and read in Block Size amounts.
>>>>>           
>>>> Well, this could be useful if updates are larger than the block
size, for example 512 K, as it is then possible to erase and rewrite without
having to copy around other data from the page. If updates are smaller, zfs will
have to reclaim erased space by itself, which if I am not mistaken it can not do
today (but probably will in some future, I guess the BP Rewrite is what is
needed).
>>>>       
>>> Sure, it does that today. What do you think happens on a standard
COW action?   Let''s be clear here:  I''m talking about exactly
the same thing that currently happens when you modify a ZFS "block"
that spans multiple vdevs (say, in a RAIDZ).   The entire ZFS block is read from
disk/L2ARC, the modifications made, then it is written back to storage, likely
in another LBA. The original ZFS block location ON THE VDEV is now available for
re-use (i.e. the vdev adds it to it''s Free Block List).   This is one
of the things that leads to ZFS''s fragmentation issues (note,
we''re talking about block fragmentation on the vdev, not ZFS block
fragmentation), and something that we''re looking to BP rewrite to
enable defragging to be implemented.
>>>    
>> 
>> What I am talking about is to be able to reuse the free space
>> you get in the previously written data when you write modified
>> data to new places on the disk, or just remove a file for that
>> matter. To be able to reclaim that space with flash, you have
>> to erase large pages (for example 512 KB), but before you erase,
>> you will also have to save away all still valid data in that
>> page and rewrite that to a free page. What I am saying is that
>> I am not sure that this would be best done in the file system,
>> since it could be quite a bit of data to shuffle around, and
>> there could possibly be hardware specific optimizations that
>> could be done here that zfs wouldn''t know about. A good flash
>> controller could probably do it much better. (And a bad one
>> worse, of course.)
>>  
> You certainly DO get to reuse the free space again.   Here''s what
happens nowdays in an SSD:
> 
> Let''s say I have 4k blocks, grouped into a 128k page.  That is,
the SSD''s fundamental minimum unit size is 4k, but the minimum WRITE
size is 128k.  Thus, 32 blocks in a page.
Do you know of SSD disks that have a minimum write size of
128 KB? I don''t understand why it would be designed that way.

A typical flash chip has pretty small write block sizes, like
2 KB or so, but they can only erase in pages of 128 KB or so.
(And then you are running a few of those in parallel to get some
speed, so these numbers often multiply with the number of
parallel chips, like 4 or 8 or so.)
Typically, you have to write the 2 KB blocks consecutively
in a page. Pretty much all set up for an append-style system.
:-)

In addition, flash SSDs typically have some DRAM write buffer
that buffers up writes (like a txg, if you will), so small
writes should not be a problem - just collect a few and append!
> So, I write a bit of data 100k in size. This occupies the first 25 blocks
in the one page. The remaining 9 blocks are still one the SSD''s Free
List (i.e. list of free space).
> 
> Now, I want to change the last byte of the file, and add 10k more to the
file.  Currently, a non-COW filesystem will simply send the 1 byte modification
request and the 10k addition to the SSD (all as one unit, if you are lucky - if
not, it comes as two ops: 1 byte modification followed by a 10k append).   The
SSD now has to read all 25 blocks of the page back into it''s local
cache on the controller, do the modification and append computing, then writes
out 28 blocks to NAND.  In all likelihood, if there is any extra pre-erased (or
never written to) space on the drive, this 28 block write will go to a whole new
page.  The blocks in the original page will be moved over to the SSD Free List
(and may or may not be actually erased, depending on the controller).
Do you know for sure that you have SSD flash disks that
work this way? It seems incredibly stupid. It would also
use up the available erase cycles much faster than necessary.
What write speed do you get?
> For filesystems like ZFS, this is a whole lot of extra work being done that
doesn''t need to happen (and, chews up valuable IOPS and time).  For,
when ZFS does a write, it doesn''t merely just twiddle the
modified/appended bits - instead, it creates a whole new ZFS block to write.  
In essence, ZFS has already done all the work that the SSD controller is
planning on doing.  So why duplicate the effort?   SSDs should simply notify ZFS
about their block & page sizes, which would then allow ZFS to better align
it''s own variable block size to optimally coincide with the
SSD''s implementation.
> 
> 
>> And as far as I know, zfs can not do that today - it can not
>> move around already written data, not for defragmentation, not
>> for adding or removing disks to stripes/raidz:s, not for
>> deduping/duping and so on, and I have understood it as
>> BP Rewrite could solve a lot of this.
>>  
> ZFS''s propensity to fragmentation doesn''t mean you lose
space.  Rather, it means that COW often results in frequently-modified files
being distributed over the entire media, rather than being contiguous. So, over
time, the actual media has very little (if any) contiguous free space, which is
what the fragmentation problem is.  BP rewrite will indeed allow us to create a
de-fragger.  Areas which used to hold a ZFS block (now vacated by a COW to
somewhere else) are simply added back to the device''s Free List.
> Now, in SSD''s case, this isn''t a worry.  Due to the
completely even performance characteristics of NAND, it doesn''t make
any difference if the physical layout of a file happens to be sections (e.g. ZFS
blocks) scattered all over the SSD.
Yes, there is something to worry about, as you can only
erase flash in large pages - you can not erase them only where
the free data blocks in the Free List are.
>  Access time is identical, and so is read time.  SSD''s
don''t care about this kind of fragmentation.
> What SSD''s have to worry about is sub-page fragmentation.  Which
brings us back to the whole R-M-W mess.
Yes, why R-M-W of entire pages for every change is a really
bad implementation of a flash SSD.
> Still, it could certainly be useful if zfs could try to use a
>> blocksize that matches the SSD erase page size - this could
>> avoid having to copy and compact data before erasing, which
>> could speed up writes in a typical flash SSD disk.
>> 
>>  
>>> In fact, I would argue that the biggest advantage of removing any
advanced intelligence from the SSD controller is with small modifications to
existing files.  By using the L2ARC (and other features, like compression,
encryption, and dedup), ZFS can composite the needed changes with an existing
cached copy of the ZFS block(s) to be changed, then issue a full new block write
to the SSD.  This eliminates the need for the SSD to do the dreaded
Read-Modify-Write cycle, and instead do just a Write.  In this scenario, the ZFS
block is likely larger than the SSD Page size, so more data will need to be
written; however, given the highly parallel nature of SSDs, writing several SSD
pages simultaneously is easy (and fast);  let''s remember that a ZFS
block is a maximum of only 8x the size of a SSD page, and writing 8 pages is
only slightly more work than writing 1 page.  This larger write is all a single
IOP, where a R-M-W essentially requires 3 IOPS.  If you want the SSD controller
to do the work, then it ALWAYS has to read the to-be-modified page from NAND, do
the mod itself, then issue the write - and, remember, ZFS likely has already
issued a full ZFS-block write (due to the COW nature of ZFS, there is no concept
of "just change this 1 bit and leave everything else on disk where it
is"), so you likely don''t save on the number of pages that need to
be written in any case.
>>>    
>> 
>> I don''t think many SSDs do R-M-W, but rather just append
blocks
>> to free pages (pretty much as zfs works, if you will). They also
>> have to do some space reclamation (copying/compacting blocks and
>> erasing pages) in the background, of course.
> 
>>  
> MLC-based SSDs  all do R-M-W.  Now, they might not do
Read-Modify-Erase-Write right away:   But they''ll do R-M-W on ANY write
which modifies existing data (unless you are extremely lucky and your data fully
fills an existing page):  the difference is that the final W is to
previous-unused NAND page(s).  However, when the SSD runs out of never-used
space, it starts to have to add the E step on future writes.
> 
> So far as I know, no SSD does space reclamation in the manner you refer to.
That is, the SSD controller isn''t going to be moving data around on its
own, with the exception of wear-leveling.  TRIM is there so that the SSD can add
stuff to it''s internal Free List more efficiently, but an SSD
isn''t going (on its own) say:  "Ooh, page 1004 has only 5 of 10
blocks used, so why don''t we merge it with page 20054, which has only 3
of 10 blocks used."
(I don''t think they typically merge pages, I believe they rather
just pick pages with some freed blocks, copies the active blocks
to the "end" of the disk, and erases the page.)

Well, the algorithms are often trade secrets, and if what you say
is correct, and it was my product, then I wouldn''t even want to
tell anyone about it, since it would be a horrible waste of both
bandwidth and erase cycles. Using up the 10000 erase cycles of
a MLC device 64 times faster than necessary seems like an
extremely bad idea. But there sure is a lot of crap out there,
I can''t say you are wrong (only hope :-).

I doubt for example the F5100 works that way, it would be hard to
get ~15000 4KB w/s per "flash-SODIMM" if it behaved like that
(you typically can erase only 500-1000 pages a second, for
example).
I doubt the Intel X25 works that way, as their read performance
suffers if they are written with smaller blocks and get internally
fragmented - that problem could not exist if they always filled
complete new pages in a R-M-W manner.
>>> I am still not entirely convinced that it would be better to let
the file system take care of that instead of a flash controller, there could be
quite a lot of reading and writing going on for space reclamation (depending on
the work load, of course).
>>>    
>>>> /ragge
>>>>      
>>> The point here is that regardless of the workload, there''s
a R-M-W cycle that has to happen, whether that occurs at the ZFS level or at the
SSD level.  My argument is that the OS has a far better view of the whole data
picture, and access to much higher performing caches (i.e. RAM/registers) than
the SSD, so not only can the OS make far better decisions about the data and how
(and how much of) it should be stored, but it''s almost certainly to be
able to do so far faster than any little SSD controller can do.
>> 
>> Well, inside the flash system you could possibly have a much
>> better situation to shuffle data around for space reclamation -
>> that is copying and compacting data and erasing flash pages.
>> If the device has a good design, that is! If the SSD controller
>> is some small slow sad thing it might be better to shuffle it up
>> and down to the host and do it in the CPU, but I am not sure
>> about that either since it typically is the very same slow
>> controller that does the host communication.
>>  
> It''s actually far more likely that a dumb SSD controller can
handle high levels of pure data transfer faster than a smart SSD controller can
actually manipulate that same data quickly.   SSD controllers, by their very
nature, need to be as small and cheap as possible, which means they have
extremely limited computation ability.  For a given compute level controller,
one which is only "dumb" has to worry about 4 things: wear leveling,
bad block remapping, and LBA->physical block mapping, and actual I/O transfer
(i.e. managing data flow from the host to the NAND chips).   A smart controller
also has to worry about page alignment, page modification and rewriting,
potentially RAID-like checksumming/parity,  page/block fragmentation, and other
things.  So, if the compute amount is fixed, a dumb controller is going to be
able to handle a /whole/ lot more I/O transfer than a smart controller.    Which
means, for the same level of I/O transfer, a dumb controller costs less than a
smart controller.
I am not convinced the compute amount needs to be fixed, or
even that they by their nature need to be as cheap as possible -
if that hurts performance. People are obviously willing to pay
quite a lot to get high perf disk systems. The best flash SSDs
out there are quite expensive. In addition the number of
transistors per area (and monetary unit) tend to increase
with time (that intel guy had some saying about that... :-).
> I certainly agree that there seems to be some redundancy when
>> the flash SSD controller does a logging-file-system kind of work
>> under zfs that does pretty much that by itself, and it could
>> possibly be better to cut one of them (and not zfs).
>> I am still not convinced that it won''t be better to do this
>> in a good controller instead just for speed and to take advantage
>> of new hardware that does this smarter than the devices of today.
>> 
>> Do you know how the F5100 works for example?
>> 
>> /ragge
>>  
> The point I''m making here is that the filesystem/OS can make all
the same decisions that a good SSD controller can make, faster (as it has most
of the data in local RAM or register already), and with a global system
viewpoint that the SSD simply can''t have.  Most importantly,
it''s essentially free for the OS to do so - it has the spare cycles and
bandwidth to do so.  Putting this intelligence on the SSD costs money that is
essentially wasted, not to mention being less efficient overall.
I have not done the math here, but to me it isn''t obvious that
the OS has spare cycles and bandwidth to do it, since space
reclaiming (compacting and erasing) could potentially draw much
more bandwidth than the actual workload, and since people have
had problem already with to few spare cycles on the X4500
if they want it to do something more than only being a
filer (and I guess is where there now is a X4550).
The filesystem/OS will most probably *not* have most of the
data in local ram when reclaiming space/compacting memory,
it will most likely have to read it in to write it out again.

/ragge

Ragnar Sundblad

2010-Jan-03 05:25 UTC

head link

[zfs-discuss] Thin device support in ZFS?

On 3 jan 2010, at 06.07, Ragnar Sundblad wrote:
> (I don''t think they typically merge pages, I believe they rather
> just pick pages with some freed blocks, copies the active blocks
> to the "end" of the disk, and erases the page.)
(And of course you implement wear leveling with the same
mechanism - when the wear differs to much, pick a page
with low wear and copy it to a more worn page.)

I actually happened to stumble on an application note from Numonyx
that describes the append-style SSD disk and space reclamation
method I described, right here:
<http://www.numonyx.com/Documents/Application%20Notes/AN1821.pdf>
(No - I had not read this before writing my previous mail! :-)

To me, it seems also in this paper that it is common knowledge
that this is how you should implement a flash SSD disk - if you
don''t do anything fancier of course.

/ragge

Erik Trimble

2010-Jan-03 06:24 UTC

head link

[zfs-discuss] Thin device support in ZFS?

Ragnar Sundblad wrote:> On 3 jan 2010, at 04.19, Erik Trimble wrote:
>   
>> Let''s say I have 4k blocks, grouped into a 128k page.  That
is, the SSD''s fundamental minimum unit size is 4k, but the minimum
WRITE size is 128k.  Thus, 32 blocks in a page.
>>     
> Do you know of SSD disks that have a minimum write size of
> 128 KB? I don''t understand why it would be designed that way.
>
> A typical flash chip has pretty small write block sizes, like
> 2 KB or so, but they can only erase in pages of 128 KB or so.
> (And then you are running a few of those in parallel to get some
> speed, so these numbers often multiply with the number of
> parallel chips, like 4 or 8 or so.)
> Typically, you have to write the 2 KB blocks consecutively
> in a page. Pretty much all set up for an append-style system.
> :-)
>
> In addition, flash SSDs typically have some DRAM write buffer
> that buffers up writes (like a txg, if you will), so small
> writes should not be a problem - just collect a few and append!
>   In MLC-style SSDs, you typically have a block size of 2k or 4k. However, 
you have a Page size of several multiples of that, 128k being common, 
but by no means ubiquitous.

I think you''re confusing erasing with writing.

When I say "minimum write size", I mean that for an MLC, no matter how
small you make a change, the minimum amount of data actually being 
written to the SSD is a full page (128k in my example).   There is no 
"append" down at this level. If I have a page of 128k, with data in 5
of
the 4k blocks, and I then want to add another 2k of data to this, I have 
to READ all 5 4k blocks into the controller''s DRAM, add the 2k of data 
to that, then write out the full amount to a new page (if available), or 
wait for a older page to be erased before writing to it.  Thus, in this 
case,  in order to do an actual 2k write, the SSD must first read 10k of 
data, do some compositing, then write 12k to a fresh page.  

Thus, to change any data inside a single page, then entire contents of 
that page have to be read, the page modified, then the entire page 
written back out.


>> So, I write a bit of data 100k in size. This occupies the first 25
blocks in the one page. The remaining 9 blocks are still one the SSD''s
Free List (i.e. list of free space).
>>
>> Now, I want to change the last byte of the file, and add 10k more to
the file.  Currently, a non-COW filesystem will simply send the 1 byte
modification request and the 10k addition to the SSD (all as one unit, if you
are lucky - if not, it comes as two ops: 1 byte modification followed by a 10k
append).   The SSD now has to read all 25 blocks of the page back into
it''s local cache on the controller, do the modification and append
computing, then writes out 28 blocks to NAND.  In all likelihood, if there is
any extra pre-erased (or never written to) space on the drive, this 28 block
write will go to a whole new page.  The blocks in the original page will be
moved over to the SSD Free List (and may or may not be actually erased,
depending on the controller).
>>     
>
> Do you know for sure that you have SSD flash disks that
> work this way? It seems incredibly stupid. It would also
> use up the available erase cycles much faster than necessary.
> What write speed do you get?
>   What I''m describing is how ALL MLC-based SSDs work. SLC-based SSDs work
differently, but still have problems with what I''ll call
"excess-writing".

>>> And as far as I know, zfs can not do that today - it can not
>>> move around already written data, not for defragmentation, not
>>> for adding or removing disks to stripes/raidz:s, not for
>>> deduping/duping and so on, and I have understood it as
>>> BP Rewrite could solve a lot of this. 
>>>       
>> ZFS''s propensity to fragmentation doesn''t mean you
lose space.  Rather, it means that COW often results in frequently-modified
files being distributed over the entire media, rather than being contiguous. So,
over time, the actual media has very little (if any) contiguous free space,
which is what the fragmentation problem is.  BP rewrite will indeed allow us to
create a de-fragger.  Areas which used to hold a ZFS block (now vacated by a COW
to somewhere else) are simply added back to the device''s Free List.
>> Now, in SSD''s case, this isn''t a worry.  Due to the
completely even performance characteristics of NAND, it doesn''t make
any difference if the physical layout of a file happens to be sections (e.g. ZFS
blocks) scattered all over the SSD.
>>     
>
> Yes, there is something to worry about, as you can only
> erase flash in large pages - you can not erase them only where
> the free data blocks in the Free List are.
>   I''m not sure that SSDs actually _have_ to erase - they just overwrite 
anything there with new data. But this is implementation dependent, so I 
can say how /all/ MLC SSDs behave.
> (I don''t think they typically merge pages, I believe they rather
> just pick pages with some freed blocks, copies the active blocks
> to the "end" of the disk, and erases the page.)
>
> Well, the algorithms are often trade secrets, and if what you say
> is correct, and it was my product, then I wouldn''t even want to
> tell anyone about it, since it would be a horrible waste of both
> bandwidth and erase cycles. Using up the 10000 erase cycles of
> a MLC device 64 times faster than necessary seems like an
> extremely bad idea. But there sure is a lot of crap out there,
> I can''t say you are wrong (only hope :-).
>
> I doubt for example the F5100 works that way, it would be hard to
> get ~15000 4KB w/s per "flash-SODIMM" if it behaved like that
> (you typically can erase only 500-1000 pages a second, for
> example).
> I doubt the Intel X25 works that way, as their read performance
> suffers if they are written with smaller blocks and get internally
> fragmented - that problem could not exist if they always filled
> complete new pages in a R-M-W manner.
>   Once again, what I''m talking about is a characteristic of MLC SSDs, 
which are used in most consumer SSDS (the Intel X25-M, included). 

Sure, such an SSD will commit any new writes to pages drawn from the 
list of "never before used" NAND.  However, at some point, this list 
becomes empty.  In most current MLC SSDs, there''s about 10%
"extra" (a
60GB advertised capacity is actually ~54GB usable with 6-8GB "extra").
Once this list is empty, the SSD has to start writing back to previous 
used pages, which may require an erase step first before any write. 
Which is why MLC SSDs slow down drastically once they''ve been fulled to
capacity several times.
> I am not convinced the compute amount needs to be fixed, or
> even that they by their nature need to be as cheap as possible -
> if that hurts performance. People are obviously willing to pay
> quite a lot to get high perf disk systems. The best flash SSDs
> out there are quite expensive. In addition the number of
> transistors per area (and monetary unit) tend to increase
> with time (that intel guy had some saying about that... :-).
>   My point there is that if you build a controller for $X, that will get 
you Y compute ability.  For a dumb controller, less of this Y ability is 
going to be used up by "housekeeping" functions for the SSD, and more 
thus being available to manage I/O, than for a smart controller.

Put it another way:  For a giving throughput performance of X, it will 
cost less to build a dumb controller than a smart controller.   And, 
yes, price is a concern, even at the Enterprise level.   Being able to 
build a dumb controller for 50% (or less) of the cost of a smart 
controller is likely to get you noticed by your consumers.   Or at least 
by your accountant, since your profit for the SSD will be higher.
> I have not done the math here, but to me it isn''t obvious that
> the OS has spare cycles and bandwidth to do it, since space
> reclaiming (compacting and erasing) could potentially draw much
> more bandwidth than the actual workload, and since people have
> had problem already with to few spare cycles on the X4500
> if they want it to do something more than only being a
> filer (and I guess is where there now is a X4550).
> The filesystem/OS will most probably *not* have most of the
> data in local ram when reclaiming space/compacting memory,
> it will most likely have to read it in to write it out again.
>
> /ragge
>   The whole point behind ZFS is that CPU cycles are cheap and available, 
much more so than dedicated hardware of any sort. What I''m arguing here
is that the controller on an SSD is in the same boat as a dedicated RAID 
HBA -  in the latter case, use a cheap HBA instead and let the CPU & ZFS 
do the work, while in the former case, use a "dumb" controller for the
SSD instead of a smart one.

I''m pretty sure compacting doesn''t occur in ANY SSDs without
any OS
intervention (that is, the SSD itself doesn''t do it), and I''d
be
surprised to see an OS try to implement some sort of intra-page 
compaction - there benefit doesn''t seem to be there; it''s
better just to
optimize writes than try to compact existing pages. As far as reclaiming 
unused space, the TRIM command is there to allow the SSD to mark a page 
Free for reuse, and an SSD isn''t going to be erasing a page unless
it''s
right before something is to be written to that page.



The X4500 was specifically designed to be a filer.  It has more than 
enough CPU cycles to deal with pretty much all workloads it gets in that 
area - in fact, the major problem with the X4500 is insufficient 
response time of the SATA drives, which slows throughput.  Sure you can 
run other things on it, but it''s really not designed for heavy-duty 
extra workloads - it''s a disk server, not a compute server. 
I''ve run
compressed zvols on it, and have no problem saturating the 4x1Gbit 
interfaces while still not pegging both CPUs.   I''d imaging that it 
starts to run into problems with multiple 10Gbit Ethernet interfaces, 
but that''s to be expected.

Bus bandwidth isn''t really a concern, with SSDs using either SATA 3G or
SAS 3G right now, and soon SATA 6G or SAS 12G in the near future. 
Likewise, system bus isn''t much of an immediate concern, as pretty much
all SAS/SATA controllers use an 8x PCI-E attachment, for no more than 8 
devices (SAS controllers which support more than 8 devices almost always 
have several 8x attachments).

And, as I pointed out in another message, doing it my way doesn''t 
increase bus traffic that much over what is being done now, in any case.

-- 
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA
Timezone: US/Pacific (GMT-0800)

Erik Trimble

2010-Jan-03 06:43 UTC

head link

[zfs-discuss] Thin device support in ZFS?

Erik Trimble wrote:> Ragnar Sundblad wrote:
>> Yes, there is something to worry about, as you can only
>> erase flash in large pages - you can not erase them only where
>> the free data blocks in the Free List are.   
> I''m not sure that SSDs actually _have_ to erase - they just
overwrite
> anything there with new data. But this is implementation dependent, so 
> I can say how /all/ MLC SSDs behave.
I meant to say that I DON''T know how all MLC drives deal with erasure.
>> (I don''t think they typically merge pages, I believe they
rather
>> just pick pages with some freed blocks, copies the active blocks
>> to the "end" of the disk, and erases the page.)That is correct, as your pointer to the Numonyx doc explains.
> I''m pretty sure compacting doesn''t occur in ANY SSDs
without any OS
> intervention (that is, the SSD itself doesn''t do it), and
I''d be
> surprised to see an OS try to implement some sort of intra-page 
> compaction - there benefit doesn''t seem to be there; it''s
better just
> to optimize writes than try to compact existing pages. As far as 
> reclaiming unused space, the TRIM command is there to allow the SSD to 
> mark a page Free for reuse, and an SSD isn''t going to be erasing a
> page unless it''s right before something is to be written to that
page.My thinking of what compacting meant doesn''t match up with what
I''m
seeing general usage in the SSD technical papers is, so in this respect, 
I''m wrong:  compacting does occur, but only when there are no fully 
erased (or unused) pages available.  Thus, compacting is done in the 
context of a write operation.

-- 
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA
Timezone: US/Pacific (GMT-0800)

Eric D. Mudama

2010-Jan-03 09:57 UTC

head link

[zfs-discuss] Thin device support in ZFS?

On Sat, Jan  2 at 22:24, Erik Trimble wrote:>In MLC-style SSDs, you typically have a block size of 2k or 4k. 
>However, you have a Page size of several multiples of that, 128k 
>being common, but by no means ubiquitous.
I believe your terminology is crossed a bit.  What you call a block is
usually called a sector, and what you call a page is known as a block.

Sector is (usually) the unit of reading from the NAND flash.

The unit of write in NAND flash is the page, typically 2k or 4k
depending on NAND generation, and thus consisting of 4-8 ATA sectors
(typically).  A single page may be written at a time.  I believe some
vendors support partial-page programming as well, allowing a single
sector "append" type operation where the previous write left off.

Ordered pages are collected into the unit of erase, which is known as
a block (or "erase block"), and is anywhere from 128KB to 512KB or
more, depending again on NAND generation, manufacturer, and a bunch of
other things.

Some large number of blocks are grouped by chip enables, often 4K or
8K blocks.
>I think you''re confusing erasing with writing.
>
>When I say "minimum write size", I mean that for an MLC, no matter
>how small you make a change, the minimum amount of data actually 
>being written to the SSD is a full page (128k in my example).   There
Page is the unit of write, but it''s much smaller in all NAND I am
aware of.
>is no "append" down at this level. If I have a page of 128k, with 
>data in 5 of the 4k blocks, and I then want to add another 2k of data 
>to this, I have to READ all 5 4k blocks into the controller''s DRAM,
>add the 2k of data to that, then write out the full amount to a new 
>page (if available), or wait for a older page to be erased before 
>writing to it.  Thus, in this case,  in order to do an actual 2k 
>write, the SSD must first read 10k of data, do some compositing, then 
>write 12k to a fresh page.
>
>Thus, to change any data inside a single page, then entire contents 
>of that page have to be read, the page modified, then the entire page 
>written back out.
See above.
>What I''m describing is how ALL MLC-based SSDs work. SLC-based SSDs 
>work differently, but still have problems with what I''ll call 
>"excess-writing".
I think you''re only describing dumb SSDs with erase-block granularity
mapping. Most (all) vendors have moved away from that technique since
random write performance is awful in those designs and they fall over
dead from wAmp in a jiffy.

SLC and MLC NAND is similar, and they are read/written/erased almost
identically by the controller.
>I''m not sure that SSDs actually _have_ to erase - they just
overwrite
>anything there with new data. But this is implementation dependent, 
>so I can say how /all/ MLC SSDs behave.
Technically you can program the same NAND page repeatedly, but since
bits can only transition from 1->0 on a program operation, the result
wouldn''t be very meaningful.  An erase sets all the bits in the block
to 1, allowing you to store your data.
>Once again, what I''m talking about is a characteristic of MLC SSDs,
>which are used in most consumer SSDS (the Intel X25-M, included).
>
>Sure, such an SSD will commit any new writes to pages drawn from the 
>list of "never before used" NAND.  However, at some point, this
list
>becomes empty.  In most current MLC SSDs, there''s about 10%
"extra"
>(a 60GB advertised capacity is actually ~54GB usable with 6-8GB 
>"extra").   Once this list is empty, the SSD has to start writing 
>back to previous used pages, which may require an erase step first 
>before any write. Which is why MLC SSDs slow down drastically once 
>they''ve been fulled to capacity several times.
 From what I''ve seen, erasing a block typically takes a time in the
same scale as programming an MLC page, meaning in flash with large
page counts per block, the % of time spent erasing is not very large.

Lets say that an erase took 100ms and a program took 10ms, in an MLC
NAND device with 100 pages per block.  In this design, it takes us 1s
to program the entire block, but only 1/10 of the time to erase it.
An infinitely fast erase would only make the design about 10% faster.

For SLC the erase performance matters more since page writes are much
faster on average and there are half as many pages, but we were
talking MLC.

The performance differences seen is because they were artificially
fast to begin with because they were empty.  It''s similar to
destroking a rotating drive in many ways to speed seek times.  Once
the drive is full, it all comes down to raw NAND performance,
controller design, reserve/extra area (or TRIM) and algorithmic
quality.

--eric


-- 
Eric D. Mudama
edmudama at mail.bounceswoosh.org

Daniel Carosone

2010-Jan-03 21:07 UTC

head link

[zfs-discuss] Thin device support in ZFS?

Yet another way to thin-out the backing devices for a zpool on a
thin-provisioned storage host, today: resilver. 

If your zpool has some redundancy across the SAN backing LUNs, simply
drop and replace one at a time and allow zfs to resilver only the
blocks currently in use onto the replacement LUN.

--
Dan.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 194 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100104/b5698d6d/attachment.bin>

Ragnar Sundblad

2010-Jan-04 00:09 UTC

head link

[zfs-discuss] Thin device support in ZFS?

Eric D. Midama did a very good job answering this, and I don''t have
much to add. Thanks Eric!

On 3 jan 2010, at 07.24, Erik Trimble wrote:
> I think you''re confusing erasing with writing.
I am now quite certain that it actually was you who were
confusing those. I hope this discussion has cleared things
up a little though.
> What I''m describing is how ALL MLC-based SSDs work. SLC-based SSDs
work differently, but still have problems with what I''ll call
"excess-writing".
Eric already said it, but I need to say this myself too:
SLC and MLC disks could be almost identical, only the storing of
the bits in the flash chips differs a little (1 or 2 bits per
storage cell). There is absolutely no other fundamental difference
between the two.

Hopefully no modern MLC *or* SLC disk works as you described,
since it is a horrible design, and selling it would be close to
robbery. It would be slow and it would wear out quite fast.

Now, SLC disks are typically better overall, because those who
want to pay for SLC flash typically also want to pay for better
controllers, but otherwise those issues are really orthogonal.
> I''m not sure that SSDs actually _have_ to erase - they just
overwrite anything there with new data. But this is implementation dependent, so
I can say how /all/ MLC SSDs behave.
As Eric said - yes you have to erase, otherwise you can''t write
new data. It is not implementation dependent, it is inherent in
the flash technology. And, as has been said several times now,
erasing can only be done in large chunks, but writing can be done
in small chunks. I''d say that this is the main problem to handle
when creating a good flash SSD.
> The whole point behind ZFS is that CPU cycles are cheap and available, much
more so than dedicated hardware of any sort. What I''m arguing here is
that the controller on an SSD is in the same boat as a dedicated RAID HBA -  in
the latter case, use a cheap HBA instead and let the CPU & ZFS do the work,
while in the former case, use a "dumb" controller for the SSD instead
of a smart one.
This could be true, I am still not sure. My main issues with this
is that it would make the file system code dependent of a special
hardware behavior (that of todays flash chips), and that it could
be quite a lot of data to shuffle around when compacting. But
we''ll see. If it could be cheap enough, it could absolutely happen
and be worth it even if it has some drawbacks.
> And, as I pointed out in another message, doing it my way doesn''t
increase bus traffic that much over what is being done now, in any case.
Yes, it would increase bus traffic, if you would handle flash the
compacting in the host - which you have to with your idea - it could
be many times the real workload bandwidth. But it could still be
worth it, that is quite possible.

---------

On 3 jan 2010, at 07.43, Erik Trimble wrote:> I meant to say that I DON''T know how all MLC drives deal with
erasure.
Again - yes they do. (Or they would be write-once only. :-)
>> I''m pretty sure compacting doesn''t occur in ANY SSDs
without any OS intervention (that is, the SSD itself doesn''t do it),
and I''d be surprised to see an OS try to implement some sort of
intra-page compaction - there benefit doesn''t seem to be there;
it''s better just to optimize writes than try to compact existing pages.
As far as reclaiming unused space, the TRIM command is there to allow the SSD to
mark a page Free for reuse, and an SSD isn''t going to be erasing a page
unless it''s right before something is to be written to that page.
> My thinking of what compacting meant doesn''t match up with what
I''m seeing general usage in the SSD technical papers is, so in this
respect, I''m wrong:  compacting does occur, but only when there are no
fully erased (or unused) pages available.  Thus, compacting is done in the
context of a write operation.
Exactly what and when it is that triggers compacting is another
issue, and that could probably change with firmware revisions.

It is wise to do it earlier than when you get that write that
didn''t fit, since if you have some erased space you can then take
burts of writes up to that size quickly. But compacting takes
bandwidth from the flash chips and wears them out, so you don''t
want to do it to early and to much.

I guess this could be an interesting optimization problem, and
optimal behavior probably depends on the workload too. Maybe it
should be an adjustable knob.

---------

On 3 jan 2010, at 10.57, Eric D. Mudama wrote:
> On Sat, Jan  2 at 22:24, Erik Trimble wrote:
>> In MLC-style SSDs, you typically have a block size of 2k or 4k.
However, you have a Page size of several multiples of that, 128k being common,
but by no means ubiquitous.
> 
> I believe your terminology is crossed a bit.  What you call a block is
> usually called a sector, and what you call a page is known as a block.
> 
> Sector is (usually) the unit of reading from the NAND flash....

Indeed, and I am partly guilty to that mess, but I didn''t want do
change terminology in the middle of the discussion just to make it
more flash-y. Maybe a mistake. :-)

---------

Now, *my* view of how a typical, modern flash SSD works is as an
appendable cyclic log. You can append blocks to it, but no two
blocks can have the same address (the new block would mask away
the old one), and there is a maximum address (dependent of the
size of the disk), so the log has a maximum length.

This has, in my head, some resemblance to the txg appending zfs
does.

On the inside, the flash SSD can''t just rewrite new blocks to
any free space because of the the way erasing works on large
chunks, "erase blocks" in the flash chips of today. Therefore, it
has to internally take "erase blocks" with freed space in it and
move all active blocks to the end of the log to save them and
compact them. It can then erase the "erase block", and reuse
that area for new pages. This activity competes with the normal
disk activities.

There are of course other issues two, like wear leveling,
bad block handling and stuff.

/ragge

Miles Nordin

2010-Jan-05 23:18 UTC

head link

[zfs-discuss] Thin device support in ZFS?

>>>>> "dm" == David Magda <dmagda at
ee.ryerson.ca> writes:
    dm> 4096 - to-512 blocks

aiui NAND flash has a minimum write size (determiined by ECC OOB bits)
of 2 - 4kB, and a minimum erase size that''s much larger.  Remapping
cannot abstract away the performance implication of the minimum write
size if you are doing a series of synchronous writes smaller than the
minimum size on a device with no battery/capacitor, although using a
DRAM+supercap prebuffer might be able to abstract away some of it.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 304 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100105/6f74c5dc/attachment.bin>

Erik Trimble

2010-Jan-06 04:47 UTC

head link

[zfs-discuss] Thin device support in ZFS?

As a further update, I went back and re-read my SSD controller info, and 
then did some more Googling.

Turns out, I''m about a year behind on State-of-the-SSD.    Eric is 
correct on the way current SSDs implement writes (both SLC and MLC),  so 
I''m issuing a mea-cupla here. The change in implementation appears to 
occur sometime shortly after the introduction of the Indilinx 
controllers.  My fault for not catching this.

-Erik



Eric D. Mudama wrote:> On Sat, Jan  2 at 22:24, Erik Trimble wrote:
>> In MLC-style SSDs, you typically have a block size of 2k or 4k. 
>> However, you have a Page size of several multiples of that, 128k 
>> being common, but by no means ubiquitous.
>
> I believe your terminology is crossed a bit.  What you call a block is
> usually called a sector, and what you call a page is known as a block.
>
> Sector is (usually) the unit of reading from the NAND flash.
>
> The unit of write in NAND flash is the page, typically 2k or 4k
> depending on NAND generation, and thus consisting of 4-8 ATA sectors
> (typically).  A single page may be written at a time.  I believe some
> vendors support partial-page programming as well, allowing a single
> sector "append" type operation where the previous write left off.
>
> Ordered pages are collected into the unit of erase, which is known as
> a block (or "erase block"), and is anywhere from 128KB to 512KB
or
> more, depending again on NAND generation, manufacturer, and a bunch of
> other things.
>
> Some large number of blocks are grouped by chip enables, often 4K or
> 8K blocks.
>
>> I think you''re confusing erasing with writing.
>>
>> When I say "minimum write size", I mean that for an MLC, no
matter
>> how small you make a change, the minimum amount of data actually 
>> being written to the SSD is a full page (128k in my example).   There
>
> Page is the unit of write, but it''s much smaller in all NAND I am
> aware of.
>
>> is no "append" down at this level. If I have a page of 128k,
with
>> data in 5 of the 4k blocks, and I then want to add another 2k of data 
>> to this, I have to READ all 5 4k blocks into the controller''s
DRAM,
>> add the 2k of data to that, then write out the full amount to a new 
>> page (if available), or wait for a older page to be erased before 
>> writing to it.  Thus, in this case,  in order to do an actual 2k 
>> write, the SSD must first read 10k of data, do some compositing, then 
>> write 12k to a fresh page.
>>
>> Thus, to change any data inside a single page, then entire contents 
>> of that page have to be read, the page modified, then the entire page 
>> written back out.
>
> See above.
>
>> What I''m describing is how ALL MLC-based SSDs work. SLC-based
SSDs
>> work differently, but still have problems with what I''ll call 
>> "excess-writing".
>
> I think you''re only describing dumb SSDs with erase-block
granularity
> mapping. Most (all) vendors have moved away from that technique since
> random write performance is awful in those designs and they fall over
> dead from wAmp in a jiffy.
>
> SLC and MLC NAND is similar, and they are read/written/erased almost
> identically by the controller.
>
>> I''m not sure that SSDs actually _have_ to erase - they just
overwrite
>> anything there with new data. But this is implementation dependent, 
>> so I can say how /all/ MLC SSDs behave.
>
> Technically you can program the same NAND page repeatedly, but since
> bits can only transition from 1->0 on a program operation, the result
> wouldn''t be very meaningful.  An erase sets all the bits in the
block
> to 1, allowing you to store your data.
>
>> Once again, what I''m talking about is a characteristic of MLC
SSDs,
>> which are used in most consumer SSDS (the Intel X25-M, included).
>>
>> Sure, such an SSD will commit any new writes to pages drawn from the 
>> list of "never before used" NAND.  However, at some point,
this list
>> becomes empty.  In most current MLC SSDs, there''s about 10%
"extra"
>> (a 60GB advertised capacity is actually ~54GB usable with 6-8GB 
>> "extra").   Once this list is empty, the SSD has to start
writing
>> back to previous used pages, which may require an erase step first 
>> before any write. Which is why MLC SSDs slow down drastically once 
>> they''ve been fulled to capacity several times.
>
> From what I''ve seen, erasing a block typically takes a time in the
> same scale as programming an MLC page, meaning in flash with large
> page counts per block, the % of time spent erasing is not very large.
>
> Lets say that an erase took 100ms and a program took 10ms, in an MLC
> NAND device with 100 pages per block.  In this design, it takes us 1s
> to program the entire block, but only 1/10 of the time to erase it.
> An infinitely fast erase would only make the design about 10% faster.
>
> For SLC the erase performance matters more since page writes are much
> faster on average and there are half as many pages, but we were
> talking MLC.
>
> The performance differences seen is because they were artificially
> fast to begin with because they were empty.  It''s similar to
> destroking a rotating drive in many ways to speed seek times.  Once
> the drive is full, it all comes down to raw NAND performance,
> controller design, reserve/extra area (or TRIM) and algorithmic
> quality.
>
> --eric
>
>

-- 
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA
Timezone: US/Pacific (GMT-0800)

Miles Nordin

2010-Jan-12 23:33 UTC

head link

[zfs-discuss] Thin device support in ZFS?

>>>>> "ah" == Al Hopper <al at
logical-approach.com> writes:
ah> The main issue is that most flash devices support 128k byte
ah> pages, and the smallest "chunk" (for want of a better word)
of
ah> flash memory that can be written is a page - or 128kb. So if
ah> you have a write to an SSD that only changes 1 byte in one 512
ah> byte "disk" sector, the SSD controller has to either
ah> read/re-write the affected page or figure out how to update
ah> the flash memory with the minimum affect on flash wear.

yeah well, I''m not sure it matters, but that''s untrue.

there are two sizes for NAND flash, the minimum write size and the
minimum erase size. The minimum write size is the size over which
error correction is done, the unit at which inband and OOB data is
interleaved, on NAND flash. The minimum erase size is just what it
sounds, the size the cleaner/garbagecolelctor must evacuate.

The minimum write size is I suppose likely to provoke
read/modify/write and wasting of write and wear bandwidth for smaller
writes in flashes which do not have a DRAM+supercap, if you ask to
SYNCHRONZIE CACHE right after the write. If there is a supercap, or
if you allow teh drive to do write caching, then the smaller write
could be coalesced making this size irrelevant. I think it''s usually
2 - 4 kB. I would expect resistance to growing it larger than 4kB
because of NTFS---electrical engineers are usually over-obsessed with
Windows.

The minimum erase size you don''t really care about at all.
That''s the
one that''s usually at least 128kB.

ah> For anyone who is interested in getting more details of the
ah> challenges with flash memory, when used to build solid state
ah> drives, reading the tech data sheets on the flash memory
ah> devices will give you a feel for the basic issues that must be
ah> solved.

and the linux-mtd list will give you a feel for how people are solving
them, because that''s the only place I know of where NAND filesystem
work is going on in the open. There are a bunch of geezers saying ``I
wrote one for BSD but my employer won''t let me release
it,'''' and then
the new crop of intel/sandforce/stec proprietary kids, but in the open
world AFAIK there is just yaffs and ubifs. The tmobile G1 is yaffs.

ah> Bobs point is well made. The specifics of a given SSD
ah> implementation will make the performance characteristics of
ah> the resulting SSD very difficult to predict or even describe -

I''m really a fan of thte idea of using ACARD ANS-9010 for a slog.
It''s basically all DRAM+battery, and uses a low performance CF card
for durable storage if the battery starts to run low, or if you
explicitly request it (to move data between ACARD units by moving the
CF card maybe). It will even make non-ECC RAM into ECC storage (using
a sector size and OOB data :). It seems like Zeus-like performance at
1/10th the price, but of course it''s a little goofy, and I''ve
never
tried it.

slog is where I''d expect the high synchronous workload to be, so this
is where there are small writes that can''t be coalesced, I would
presume, and appropriate slog sizes are reachable with DRAM alone.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 304 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100112/2094ec66/attachment.bin>

Miles Nordin

2010-Jan-13 18:53 UTC

head link

[zfs-discuss] Thin device support in ZFS?

>>>>> "et" == Erik Trimble <Erik.Trimble at
Sun.COM> writes:
et> Probably, the smart thing to push for is inclusion of some new
et> command in the ATA standard (in a manner like TRIM). Likely
et> something that would return both native Block and Page sizes
et> upon query.

that would be the *sane* thing to do. The *smart* thing to do would
be write a quick test to determine the apparent page size by
performance-testing write-flush-write-flush-write-flush with various
write sizes and finding the knee that indicates the smallest size at
which read-before-write has stopped. The test could happen in ''zpool
create'' and have its result written into the vdev label.

Inventing ATA commands takes too long to propogate through the
technosphere, and the EE''s always implement them wrongly: for example,
a device with SDRAM + supercap should probably report 512 byte sectors
because the algorithm for copying from SDRAM to NAND is subject to
change and none of your business, but EE''s are not good with language
and will try to apelike match up the paragraph in the spec with the
disorganized thoughts in their head, fit pegs into holes, and will end
up giving you the NAND page size without really understanding why you
wanted it other than that some standard they can''t control demands it.
They may not even understand why their devices are faster and
slower---they are probably just hurling shit against an NTFS and
shipping whatever runs some testsuite fastest---so doing the empirical
test is the only way to document what you really care about in a way
that will make it across the language and cultural barriers between
people who argue about javascript vs python and ones that argue about
Agilent vs LeCroy. Within the proprietary wall of these flash
filesystem companies the testsuites are probably worth as much as the
filesystem code, and here without the wall an open-source statistical
test is worth more than a haggled standard.

Remember the ``removeable'''' bit in USB sticks and the mess
that both
software and hardware made out of it. (hot-swappable SATA drives are
``non-removeable'''' and don''t need rmformat while
USB/firewore do?
yeah, sorry, u fail abstraction. and USB drives have the ``removable
medium'''' bit set when the medium and the controller are
inseperable,
it''s the _controller_ that''s removeable? ya sorry u fail
reading
English.) If you can get an answer by testing, DO IT, and evolve the
test to match products on the market as necessary. This promises to
be a lot more resilient than the track record with bullshit ATA
commands and will work with old devices too. By the time you iron out
your standard we will be using optonanocyberflash instead: that''s what
happened with the removeable bit and r/w optical storage. BTW let me
know when read/write UDF 2.0 on dvd+r is ready---the standard was only
announced twelve years ago, thanks.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 304 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100113/e8dbc6bd/attachment.bin>

zfs discuss - Dec 2009 - Thin device support in ZFS?

[zfs-discuss] Thin device support in ZFS?

[zfs-discuss] Thin device support in ZFS?

[zfs-discuss] Thin device support in ZFS?

[zfs-discuss] Thin device support in ZFS?

[zfs-discuss] Thin device support in ZFS?

[zfs-discuss] Thin device support in ZFS?

[zfs-discuss] Thin device support in ZFS?

[zfs-discuss] Thin device support in ZFS?

[zfs-discuss] Thin device support in ZFS?

[zfs-discuss] Thin device support in ZFS?

[zfs-discuss] Thin device support in ZFS?

[zfs-discuss] Thin device support in ZFS?

[zfs-discuss] Thin device support in ZFS?

[zfs-discuss] Thin device support in ZFS?

[zfs-discuss] Thin device support in ZFS?

[zfs-discuss] Thin device support in ZFS?

[zfs-discuss] Thin device support in ZFS?

[zfs-discuss] Thin device support in ZFS?

[zfs-discuss] Thin device support in ZFS?

[zfs-discuss] Thin device support in ZFS?

[zfs-discuss] Thin device support in ZFS?

[zfs-discuss] Thin device support in ZFS?

[zfs-discuss] Thin device support in ZFS?

[zfs-discuss] Thin device support in ZFS?

[zfs-discuss] Thin device support in ZFS?

[zfs-discuss] Thin device support in ZFS?

[zfs-discuss] Thin device support in ZFS?

[zfs-discuss] Thin device support in ZFS?

[zfs-discuss] Thin device support in ZFS?

[zfs-discuss] Thin device support in ZFS?

[zfs-discuss] Thin device support in ZFS?

[zfs-discuss] Thin device support in ZFS?

[zfs-discuss] Thin device support in ZFS?

[zfs-discuss] Thin device support in ZFS?

[zfs-discuss] Thin device support in ZFS?

[zfs-discuss] Thin device support in ZFS?

[zfs-discuss] Thin device support in ZFS?

[zfs-discuss] Thin device support in ZFS?

[zfs-discuss] Thin device support in ZFS?

[zfs-discuss] Thin device support in ZFS?

[zfs-discuss] Thin device support in ZFS?

[zfs-discuss] Thin device support in ZFS?

[zfs-discuss] Thin device support in ZFS?

[zfs-discuss] Thin device support in ZFS?

[zfs-discuss] Thin device support in ZFS?

[zfs-discuss] Thin device support in ZFS?

[zfs-discuss] Thin device support in ZFS?

[zfs-discuss] Thin device support in ZFS?

[zfs-discuss] Thin device support in ZFS?

[zfs-discuss] Thin device support in ZFS?

[zfs-discuss] Thin device support in ZFS?

[zfs-discuss] Thin device support in ZFS?

[zfs-discuss] Thin device support in ZFS?

[zfs-discuss] Thin device support in ZFS?

[zfs-discuss] Thin device support in ZFS?

[zfs-discuss] Thin device support in ZFS?

[zfs-discuss] Thin device support in ZFS?

[zfs-discuss] Thin device support in ZFS?

[zfs-discuss] Thin device support in ZFS?

[zfs-discuss] Thin device support in ZFS?

[zfs-discuss] Thin device support in ZFS?

[zfs-discuss] Thin device support in ZFS?

[zfs-discuss] Thin device support in ZFS?

[zfs-discuss] Thin device support in ZFS?

[zfs-discuss] Thin device support in ZFS?

[zfs-discuss] Thin device support in ZFS?

[zfs-discuss] Thin device support in ZFS?

[zfs-discuss] Thin device support in ZFS?