thr3ads.net - zfs discuss - [zfs-discuss] RFE: Un-dedup for unique blocks [Jan 2013]

If this information is useful, please help other people find it:
Share via:

Jim Klimov

2013-Jan-19 16:42 UTC

[zfs-discuss] RFE: Un-dedup for unique blocks

Hello all,

   While revising my home NAS which had dedup enabled before I gathered
that its RAM capacity was too puny for the task, I found that there is
some deduplication among the data bits I uploaded there (makes sense,
since it holds backups of many of the computers I''ve worked on - some
of my homedirs'' contents were bound to intersect). However, a lot of
the blocks are in fact "unique" - have entries in the DDT with count=1
and the blkptr_t bit set. In fact they are not deduped, and with my
pouring of backups complete - they are unlikely to ever become deduped.

   Thus these many unique "deduped" blocks are just a burden when my
system writes into the datasets with dedup enabled, when it walks the
superfluously large DDT, when it has to store this DDT on disk and in
ARC, maybe during the scrubbing... These entries bring lots of headache
(or performance degradation) for zero gain.

   So I thought it would be a nice feature to let ZFS go over the DDT
(I won''t care if it requires to offline/export the pool) and evict the
entries with count==1 as well as locate the block-pointer tree entries
on disk and clear the dedup bits, making such blocks into regular unique
ones. This would require rewriting metadata (less DDT, new blockpointer)
but should not touch or reallocate the already-saved userdata (blocks''
contents) on the disk. The new BP without the dedup bit set would have
the same contents of other fields (though its parents would of course
have to be changed more - new DVAs, new checksums...)

   In the end my pool would only track as deduped those blocks which do
already have two or more references - which, given the "static" nature
of such backup box, should be enough (i.e. new full backups of the same
source data would remain deduped and use no extra space, while unique
data won''t waste the resources being accounted as deduped).

What do you think?
//Jim

Nico Williams

2013-Jan-20 01:59 UTC

head link

[zfs-discuss] RFE: Un-dedup for unique blocks

I''ve wanted a system where dedup applies only to blocks being written
that have a good chance of being dups of others.

I think one way to do this would be to keep a scalable Bloom filter
(on disk) into which one inserts block hashes.

To decide if a block needs dedup one would first check the Bloom
filter, then if the block is in it, use the dedup code path, else the
non-dedup codepath and insert the block in the Bloom filter.  This
means that the filesystem would store *two* copies of any
deduplicatious block, with one of those not being in the DDT.

This would allow most writes of non-duplicate blocks to be faster than
normal dedup writes, but still slower than normal non-dedup writes:
the Bloom filter will add some cost.

The nice thing about this is that Bloom filters can be sized to fit in
main memory, and will be much smaller than the DDT.

It''s very likely that this is a bit too obvious to just work.

Of course, it is easier to just use flash.  It''s also easier to just
not dedup: the most highly deduplicatious data (VM images) is
relatively easy to manage using clones and snapshots, to a point
anyways.

Nico
--

Richard Elling

2013-Jan-20 02:32 UTC

head link

[zfs-discuss] RFE: Un-dedup for unique blocks

bloom filters are a great fit for this :-)

  -- richard



On Jan 19, 2013, at 5:59 PM, Nico Williams <nico at cryptonector.com>
wrote:
> I''ve wanted a system where dedup applies only to blocks being
written
> that have a good chance of being dups of others.
> 
> I think one way to do this would be to keep a scalable Bloom filter
> (on disk) into which one inserts block hashes.
> 
> To decide if a block needs dedup one would first check the Bloom
> filter, then if the block is in it, use the dedup code path, else the
> non-dedup codepath and insert the block in the Bloom filter.  This
> means that the filesystem would store *two* copies of any
> deduplicatious block, with one of those not being in the DDT.
> 
> This would allow most writes of non-duplicate blocks to be faster than
> normal dedup writes, but still slower than normal non-dedup writes:
> the Bloom filter will add some cost.
> 
> The nice thing about this is that Bloom filters can be sized to fit in
> main memory, and will be much smaller than the DDT.
> 
> It''s very likely that this is a bit too obvious to just work.
> 
> Of course, it is easier to just use flash.  It''s also easier to
just
> not dedup: the most highly deduplicatious data (VM images) is
> relatively easy to manage using clones and snapshots, to a point
> anyways.
> 
> Nico
> --
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)

2013-Jan-20 16:02 UTC

head link

[zfs-discuss] RFE: Un-dedup for unique blocks

> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
> bounces at opensolaris.org] On Behalf Of Nico Williams
> 
> I''ve wanted a system where dedup applies only to blocks being
written
> that have a good chance of being dups of others.
> 
> I think one way to do this would be to keep a scalable Bloom filter
> (on disk) into which one inserts block hashes.
> 
> To decide if a block needs dedup one would first check the Bloom
> filter, then if the block is in it, use the dedup code path, 
How is this different or better than the existing dedup architecture?  If you
found that some block about to be written in fact matches the hash of an
existing block on disk, then you''ve already determined it''s a
duplicate block, exactly as you would, if you had dedup enabled.  In that
situation, gosh, it sure would be nice to have the extra information like
reference count, and pointer to the duplicate block, which exists in the dedup
table.

In other words, exactly the way existing dedup is already architected.

> The nice thing about this is that Bloom filters can be sized to fit in
> main memory, and will be much smaller than the DDT.
If you''re storing all the hashes of all the blocks, how is that going
to be smaller than the DDT storing all the hashes of all the blocks?

Edward Harvey

2013-Jan-20 16:16 UTC

head link

[zfs-discuss] RFE: Un-dedup for unique blocks

So ... The way things presently are, ideally you would know in advance what
stuff you were planning to write that has duplicate copies.  You could enable
dedup, then write all the stuff that''s highly duplicated, then turn off
dedup and write all the non-duplicate stuff.  Obviously, however, this is a
fairly implausible actual scenario.

In reality, while you''re writing, you''re going to have
duplicate blocks mixed in with your non-duplicate blocks, which fundamentally
means the system needs to be calculating the cksums and entering into DDT, even
for the unique blocks...  Just because the first time the system sees each
duplicate block, it doesn''t yet know that it''s going to be
duplicated later.

But as you said, after data is written, and sits around for a while, the
probability of duplicating unique blocks diminishes over time.  So
they''re just a burden.

I would think, the ideal situation would be to take your idea of un-dedup for
unique blocks, and take it a step further.  Un-dedup unique blocks that are
older than some configurable threshold.  Maybe you could have a command for a
sysadmin to run, to scan the whole pool performing this operation, but
it''s the kind of maintenance that really should be done upon access,
too.  Somebody goes back and reads a jpg from last year, system reads it and
consequently loads the DDT entry, discovers that it''s unique and has
been for a long time, so throw out the DDT info.

But, by talking about it, we''re just smoking pipe dreams.  Cuz we all
know zfs is developmentally challenged now.  But one can dream...

finglonger

Nico Williams

2013-Jan-20 17:29 UTC

head link

[zfs-discuss] RFE: Un-dedup for unique blocks

Bloom filters are very small, that''s the difference.  You might only
need a
few bits per block for a Bloom filter.  Compare to the size of a DDT entry.
 A Bloom filter could be cached entirely in main memory.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20130120/bf3f6ed0/attachment.html>

Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)

2013-Jan-20 18:29 UTC

head link

[zfs-discuss] RFE: Un-dedup for unique blocks

> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
> bounces at opensolaris.org] On Behalf Of Nico Williams
> 
> To decide if a block needs dedup one would first check the Bloom
> filter, then if the block is in it, use the dedup code path, else the
> non-dedup codepath and insert the block in the Bloom filter.  
Sorry, I didn''t know what a Bloom filter was before I replied before -
Now I''ve read the wikipedia article and am consequently an expert.  
*sic*   ;-)

It sounds like, what you''re describing...  The first time some data
gets written, it will not produce a hit in the Bloom filter, so it will get
written to disk without dedup.  But now it has an entry in the Bloom filter.  So
the second time the data block gets written (the first duplicate) it will
produce a hit in the Bloom filter, and consequently get a dedup DDT entry.  But
since the system didn''t dedup the first one, it means the second one
still needs to be written to disk independently of the first one.  So in effect,
you''ll always "miss" the first duplicated block write, but
you''ll successfully dedup n-1 duplicated blocks.  Which is entirely
reasonable, although not strictly optimal.  And sometimes you''ll get a
false positive out of the Bloom filter, so sometimes you''ll be running
the dedup code on blocks which are actually unique, but with some intelligently
selected parameters such as Bloom table size, you can get this probability to be
reasonably small, like less than 1%.

In the wikipedia article, they say you can''t remove an entry from the
Bloom filter table, which would over time cause consistent increase of false
positive probability (approaching 100% false positives) from the Bloom filter
and consequently high probability of dedup''ing blocks that are actually
unique; but with even a minimal amount of thinking about it, I''m quite
sure that''s a solvable implementation detail.  Instead of storing a
single bit for each entry in the table, store a counter.  Every time you create
a new entry in the table, increment the different locations; every time you
remove an entry from the table, decrement.  Obviously a counter requires more
bits than a bit, but it''s a linear increase of size, exponential
increase of utility, and within the implementation limits of available hardware.
But there may be a more intelligent way of accomplishing the same goal.  (Like I
said, I''ve only thought about this minimally).

Meh, well.  Thanks for the interesting thought.  For whatever it''s
worth.

Tomas Forsman

2013-Jan-20 18:55 UTC

head link

[zfs-discuss] RFE: Un-dedup for unique blocks

On 19 January, 2013 - Jim Klimov sent me these 2,0K bytes:
> Hello all,
>
>   While revising my home NAS which had dedup enabled before I gathered
> that its RAM capacity was too puny for the task, I found that there is
> some deduplication among the data bits I uploaded there (makes sense,
> since it holds backups of many of the computers I''ve worked on -
some
> of my homedirs'' contents were bound to intersect). However, a lot
of
> the blocks are in fact "unique" - have entries in the DDT with
count=1
> and the blkptr_t bit set. In fact they are not deduped, and with my
> pouring of backups complete - they are unlikely to ever become deduped.
Another RFE would be ''zfs dedup mypool/somefs'' and basically
go through
and do a one-shot dedup. Would be useful in various scenarios. Possibly
go through the entire pool at once, to make dedups intra-datasets (like
"the real thing").

/Tomas
-- 
Tomas Forsman, stric at acc.umu.se, http://www.acc.umu.se/~stric/
|- Student at Computing Science, University of Ume?
`- Sysadmin at {cs,acc}.umu.se

Jim Klimov

2013-Jan-20 21:58 UTC

head link

[zfs-discuss] RFE: Un-dedup for unique blocks

On 2013-01-20 19:55, Tomas Forsman wrote:> On 19 January, 2013 - Jim Klimov sent me these 2,0K bytes:
>
>> Hello all,
>>
>>    While revising my home NAS which had dedup enabled before I gathered
>> that its RAM capacity was too puny for the task, I found that there is
>> some deduplication among the data bits I uploaded there (makes sense,
>> since it holds backups of many of the computers I''ve worked on
- some
>> of my homedirs'' contents were bound to intersect). However, a
lot of
>> the blocks are in fact "unique" - have entries in the DDT
with count=1
>> and the blkptr_t bit set. In fact they are not deduped, and with my
>> pouring of backups complete - they are unlikely to ever become deduped.
>
> Another RFE would be ''zfs dedup mypool/somefs'' and
basically go through
> and do a one-shot dedup. Would be useful in various scenarios. Possibly
> go through the entire pool at once, to make dedups intra-datasets (like
> "the real thing").
Yes, but that was asked before =)

Actually, the pool''s metadata does contain all the needed bits (i.e.
checksum and size of blocks) such that a scrub-like procedure could
try and find same blocks among unique ones (perhaps with a filter
of "this" block being referenced from a dataset that currently wants
dedup), throw one out and add a DDT entry to another.

On 2013-01-20 17:16, Edward Harvey wrote:
 > So ... The way things presently are, ideally you would know in
 > advance what stuff you were planning to write that has duplicate
 > copies.  You could enable dedup, then write all the stuff that''s
 > highly duplicated, then turn off dedup and write all the
 > non-duplicate stuff.  Obviously, however, this is a fairly
 > implausible actual scenario.

Well, I guess I could script a solution that uses ZDB to dump the
blockpointer tree (about 100Gb of text on my system), and some
perl or sort/uniq/grep parsing over this huge text to find blocks
that are the same but not deduped - as well as those single-copy
"deduped" ones, and toggle the dedup property while rewriting the
block inside its parent file with DD.

This would all be within current ZFS''s capabilities and ultimately
reach the goals of deduping pre-existing data as well as dropping
unique blocks from the DDT. It would certainly not be a real-time
solution (likely might take months on my box - just fetching the
BP tree took a couple of days) and would require more resources
than needed otherwise (rewrites of same userdata, storing and
parsing of addresses as text instead of binaries, etc.)

But I do see how this is doable even today even by a non-expert ;)
(Not sure I''d ever get around to actually doing this thus, though -
it is not a very "clean" solution nor a performant one).

As a bonus, however, this ZDB dump would also provide an answer
to a frequently-asked question: "which files on my system intersect
or are the same - and have some/all blocks in common via dedup?"
Knowledge of this answer might help admins with some policy
decisions, be it witch-hunt for hoarders of same files or some
pattern-making to determine which datasets should keep "dedup=on"...

My few cents,
//Jim

Richard Elling

2013-Jan-21 00:19 UTC

head link

[zfs-discuss] RFE: Un-dedup for unique blocks

On Jan 20, 2013, at 8:16 AM, Edward Harvey <imaginative at nedharvey.com>
wrote:> But, by talking about it, we''re just smoking pipe dreams.  Cuz we
all know zfs is developmentally challenged now.  But one can dream...
I disagree the ZFS is developmentally challenged. There is more development
now than ever in every way: # of developers, companies, OSes, KLOCs, features.
Perhaps the level of maturity makes progress appear to be moving slower than 
it seems in early life?

 -- richard

Jim Klimov

2013-Jan-21 00:40 UTC

head link

[zfs-discuss] RFE: Un-dedup for unique blocks

On 2013-01-20 17:16, Edward Harvey wrote:> But, by talking about it, we''re just smoking pipe dreams. Cuz we
all know zfs is developmentally challenged now. But one can dream...
I beg to disagree. While most of my contribution was so far about
learning stuff and sharing with others, as well as planting some
new ideas and (hopefully, seen as constructively) doubting others -
including the implementation we have now - and I do have yet to
see someone pick up my ideas and turn them into code (or prove
why they are rubbish) -- overall I can''t say that development
stagnated by some metric of stagnation or activity.

Yes, maybe there were more "cool new things" per year popping up
with Sun''s concentrated engineering talent and financing, but now
it seems that most players - wherever they work now - took a pause
from the marathon, to refine what was done in the decade before.
And this is just as important as churning out innovations faster
than people can comprehend or audit or use them.

As a loud example of present active development - take the LZ4
quests completed by Saso recently. From what I gather, this is a
single man''s job done "on-line" in the view of fellow list
members
over a few months, almost like a reality-show; and I guess anyone
with enough concentration, time and devotion could do likewise.

I suspect many of my proposals to the list might also take some
half of a man-year to complete. Unfortunately for the community
and for part of myself, I now have some higher daily priorities
so that I likely won''t sit down and code lots of stuff in the
nearest years (until that Priority goes to school, or so). Maybe
that''s why I''m eager to suggest quests for brilliant coders
here
who can complete the job better and faster than I ever would ;)
So I''m doing the next best things I can do to help the progress :)

And I don''t believe this is in vain, that the development ceased
and my writings are only destined to be "stuffed under the carpet".
Be it these RFEs or dome others, better and more useful, I believe
they shall be coded and published in common ZFS code. Sometime...

//Jim

Tim Cook

2013-Jan-21 00:51 UTC

head link

[zfs-discuss] RFE: Un-dedup for unique blocks

On Sun, Jan 20, 2013 at 6:19 PM, Richard Elling <richard.elling at
gmail.com>wrote:
> On Jan 20, 2013, at 8:16 AM, Edward Harvey <imaginative at
nedharvey.com>
> wrote:
> > But, by talking about it, we''re just smoking pipe dreams. 
Cuz we all
> know zfs is developmentally challenged now.  But one can dream...
>
> I disagree the ZFS is developmentally challenged. There is more development
> now than ever in every way: # of developers, companies, OSes, KLOCs,
> features.
> Perhaps the level of maturity makes progress appear to be moving slower
> than
> it seems in early life?
>
>  -- richard
>
Well, perhaps a part of it is marketing.   Maturity isn''t really an
excuse
for not having a long-term feature roadmap.  It seems as though
"maturity"
in this case equals stagnation.  What are the features being worked on we
aren''t aware of?  The big ones that come to mind that everyone else is
talking about for not just ZFS but openindiana as a whole and other storage
platforms would be:
1. SMB3 - hyper-v WILL be gaining market share over the next couple years,
not supporting it means giving up a sizeable portion of the market.  Not to
mention finally being able to run SQL (again) and Exchange on a fileshare.
2. VAAI support.
3. the long-sought bp-rewrite.
4. full drive encryption support.
5. tiering (although I''d argue caching is superior, it''s still
a checkbox).

There''s obviously more, but those are just ones off the top of my head
that
others are supporting/working on.  Again, it just feels like all the work
is going into fixing bugs and refining what is there, not adding new
features.  Obviously Saso personally added features, but overall there
don''t seem to be a ton of announcements to the list about features that
have been added or are being actively worked on.  It feels like all these
companies are just adding niche functionality they need that may or may not
be getting pushed back to mainline.

/debbie-downer
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20130120/de6b2b9e/attachment.html>

Richard Elling

2013-Jan-21 03:51 UTC

head link

[zfs-discuss] RFE: Un-dedup for unique blocks

On Jan 20, 2013, at 4:51 PM, Tim Cook <tim at cook.ms> wrote:
> On Sun, Jan 20, 2013 at 6:19 PM, Richard Elling <richard.elling at
gmail.com> wrote:
> On Jan 20, 2013, at 8:16 AM, Edward Harvey <imaginative at
nedharvey.com> wrote:
> > But, by talking about it, we''re just smoking pipe dreams. 
Cuz we all know zfs is developmentally challenged now.  But one can dream...
> 
> I disagree the ZFS is developmentally challenged. There is more development
> now than ever in every way: # of developers, companies, OSes, KLOCs,
features.
> Perhaps the level of maturity makes progress appear to be moving slower
than
> it seems in early life?
> 
>  -- richard
> 
> Well, perhaps a part of it is marketing.  
A lot of it is marketing :-/
> Maturity isn''t really an excuse for not having a long-term feature
roadmap.  It seems as though "maturity" in this case equals
stagnation.  What are the features being worked on we aren''t aware of?
Most of the illumos-centric discussion is on the developer''s list. The
ZFSonLinux
and BSD communities are also quite active. Almost none of the ZFS developers
hang
out on this zfs-discuss at opensolaris.org anymore. In fact, I wonder why
I''m still here...
>  The big ones that come to mind that everyone else is talking about for not
just ZFS but openindiana as a whole and other storage platforms would be:
> 1. SMB3 - hyper-v WILL be gaining market share over the next couple years,
not supporting it means giving up a sizeable portion of the market.  Not to
mention finally being able to run SQL (again) and Exchange on a fileshare.
I know of at least one illumos community company working on this. However, I do
not
know their public plans.
> 2. VAAI support.  
VAAI has 4 features, 3 of which have been in illumos for a long time. The
remaining
feature (SCSI UNMAP) was done by Nexenta and exists in their NexentaStor
product,
but the CEO made a conscious (and unpopular) decision to keep that code from the
community. Over the summer, another developer picked up the work in the
community,
but I''ve lost track of the progress and haven''t seen an RTI
yet.
> 3. the long-sought bp-rewrite.
Go for it!
> 4. full drive encryption support.
This is a key management issue mostly. Unfortunately, the open source code for
handling this (trousers) covers much more than keyed disks and can be unwieldy.
I''m not sure which distros picked up trousers, but it doesn''t
belong in the illumos-gate
and it doesn''t expose itself to ZFS.
> 5. tiering (although I''d argue caching is superior, it''s
still a checkbox).
You want to add tiering to the OS? That has been available for a long time via
the
(defunct?) SAM-QFS project that actually delivered code
http://hub.opensolaris.org/bin/view/Project+samqfs/

If you want to add it to ZFS, that is a different conversation.
 -- richard
> 
> There''s obviously more, but those are just ones off the top of my
head that others are supporting/working on.  Again, it just feels like all the
work is going into fixing bugs and refining what is there, not adding new
features.  Obviously Saso personally added features, but overall there
don''t seem to be a ton of announcements to the list about features that
have been added or are being actively worked on.  It feels like all these
companies are just adding niche functionality they need that may or may not be
getting pushed back to mainline.
> 
> /debbie-downer
> 
--

Richard.Elling at RichardElling.com
+1-760-896-4422









-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20130120/c3447790/attachment.html>

Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)

2013-Jan-21 13:28 UTC

head link

[zfs-discuss] RFE: Un-dedup for unique blocks

> From: Richard Elling [mailto:richard.elling at gmail.com]
> 
> I disagree the ZFS is developmentally challenged. 
As an IT consultant, 8 years ago before I heard of ZFS, it was always easy to
sell Ontap, as long as it fit into the budget.  5 years ago, whenever I told
customers about ZFS, it was always a quick easy sell.  Nowadays, anybody
who''s heard of it says they don''t want it, because they
believe it''s a dying product, and they''re putting their bets
on linux instead.  I try to convince them otherwise, but I''m trying to
buck the word on the street.  They don''t listen, however much sense I
make.  I can only sell ZFS to customers nowadays, who have still never heard of
it.

"Developmentally challenged" doesn''t mean there is no
development taking place.  It means the largest development effort is working
closed-source, and not available for free (except some purposes), so some
consumers are going to follow their path, while others are going to follow the
open source branch illumos path, which means both disunity amongst developers
and disunity amongst consumers, and incompatibility amongst products.  So far,
in the illumos branch, I''ve only seen bugfixes introduced since zpool
28, no significant introduction of new features.  (Unlike the oracle branch,
which is just as easy to sell as ontap).

Which presents a challenge.  Hence the term, "challenged."

Right now, ZFS is the leading product as far as I''m concerned.  Better
than MS VSS, better than Ontap, better than BTRFS.  It is my personal opinion
that one day BTRFS will eclipse ZFS due to oracle''s unsupportive
strategy causing disparity and lowering consumer demand for zfs, but of course,
that''s just a personal opinion prediction for the future, which has yet
to be seen.  So far, every time I evaluate BTRFS, it fails spectacularly, but
the last time I did, was about a year ago.  I''m due for a BTRFS
re-evaluation now.

Dan Swartzendruber

2013-Jan-21 13:35 UTC

head link

[zfs-discuss] RFE: Un-dedup for unique blocks

Zfs on linux (ZOL) has made some pretty impressive strides over the last
year or so...

Sašo Kiselkov

2013-Jan-21 17:03 UTC

head link

[zfs-discuss] RFE: Un-dedup for unique blocks

On 01/21/2013 02:28 PM, Edward Ned Harvey
(opensolarisisdeadlongliveopensolaris) wrote:>> From: Richard Elling [mailto:richard.elling at gmail.com]
>>
>> I disagree the ZFS is developmentally challenged. 
> 
> As an IT consultant, 8 years ago before I heard of ZFS, it was always easy
> to sell Ontap, as long as it fit into the budget.  5 years ago, whenever I
> told customers about ZFS, it was always a quick easy sell.  Nowadays,
> anybody who''s heard of it says they don''t want it,
because they believe
> it''s a dying product, and they''re putting their bets on
linux instead. I
> try to convince them otherwise, but I''m trying to buck the word on
the street.
> They don''t listen, however much sense I make. I can only sell ZFS
to
> customers nowadays, who have still never heard of it.
Yes, Oracle did some serious damage to ZFS'' and its own reputation. My
former employer used to be an almost exclusive Sun-shop. The moment
Oracle took over and decided to tank the products aimed at our segment,
we waved our beloved Sun hardware goodbye. Larry has clearly delineated
his marketing strategy: either you''re a Fortune500, or you can fuck
right off.
> "Developmentally challenged" doesn''t mean there is no
development taking place.
> It means the largest development effort is working closed-source, and not
> available for free (except some purposes), so some consumers are going to
> follow their path,
I would contest that point. Besides encryption (which I think was
already well underway by the time Oracle took over), AFAIK nothing much
improved in Oracle ZFS. Oracle only considers Sun a vehicle to sell its
software products on (DB, ERP, CRM, etc.). Anything that doesn''t fit
into that strategy (e.g. Thumper) got butchered and thrown to the side.
> while others are going to follow the open source branch illumos path, which
> means both disunity amongst developers and disunity amongst consumers, and
> incompatibility amongst products.
I can''t talk about "disunity" among devs (how would that
manifest
itself?), but as far as incompatibility among products, I''ve yet to
come
across it. In fact, thanks to ZFS feature flags, different feature sets
can coexist peacefully and give admins unprecedented control over their
storage pools. Version control in ZFS used to be a "take it or leave
it"
approach, now you can selectively enable and use only features you want to.
> So far, in the illumos branch, I''ve only seen bugfixes introduced
since
> zpool 28, no significant introduction of new features.
I''ve had #3035 LZ4 compression for ZFS and GRUB integrated just a few
days back and I''ve got #3137 L2ARC compression up for review as we
speak. Waiting for #3137 to integrate, I''m looking to focus on multi-MB
record sizes next, and then perhaps taking a long hard look at reducing
the in-memory DDT footprint.
> (Unlike the oracle branch, which is just as easy to sell as ontap).
Again, what significant features did they add besides encryption? I''m
not saying they didn''t, I''m just not aware of that many.
> Which presents a challenge.  Hence the term, "challenged."
Agreed, it is a challenge and needs to be taken seriously. We are up
against a lot of money and man-hours invested by big-name companies, so
I fully agree there. We need to rally ourselves as a community hold
together tightly.
> Right now, ZFS is the leading product as far as I''m concerned. 
Better
> than MS VSS, better than Ontap, better than BTRFS.  It is my personal
> opinion that one day BTRFS will eclipse ZFS due to oracle''s
unsupportive
> strategy causing disparity and lowering consumer demand for zfs, but of
> course, that''s just a personal opinion prediction for the future,
which
> has yet to be seen.  So far, every time I evaluate BTRFS, it fails
> spectacularly, but the last time I did, was about a year ago.  I''m
due
> for a BTRFS re-evaluation now.
Let us know at zfs at lists.illumos.org how that goes, perhaps write a blog
post about your observations. I''m sure the BTRFS folks came up with
some
neat ideas which we might learn from.

Cheers,
--
Saso

Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)

2013-Jan-22 02:56 UTC

head link

[zfs-discuss] RFE: Un-dedup for unique blocks

> From: Sa?o Kiselkov [mailto:skiselkov.ml at gmail.com]
> 
> as far as incompatibility among products, I''ve yet to come
> across it
I was talking about ... install solaris 11, and it''s using a new
version of zfs that''s incompatible with anything else out there.  And
vice-versa.  (Not sure if feature flags is the default, or zpool 28 is the
default, in various illumos-based distributions.  But my understanding is that
once you upgrade to feature flags, you can''t go back to 28.  Which
means, mutually, anything >28 is incompatible with each other.)  You have to
typically make a conscious decision and plan ahead, and intentionally go to
zpool 28 and no higher, if you want compatibility between systems.

> Let us know at zfs at lists.illumos.org how that goes, perhaps write a blog
> post about your observations. I''m sure the BTRFS folks came up
with some
> neat ideas which we might learn from.
Actually - I''ve written about it before (but it''ll be
difficult to find, and nothing earth shattering, so not worth the search.)  I
don''t think there''s anything that zfs developers
don''t already know.  Basic stuff like fsck, and ability to shrink and
remove devices, those are the things btrfs has and zfs doesn''t.  (But
there''s lots more stuff that zfs has and btrfs doesn''t.  Just
making sure my previous comment isn''t seen as a criticism of zfs, or a
judgement in favor of btrfs.)

And even with a new evaluation, the conclusion can''t be completely
clear, nor immediate.  Last evaluation started about 10 months ago, and we kept
it in production for several weeks or a couple of months, because it appeared to
be doing everything well.  (Except for features that were known to be not-yet
implemented, such as read-only snapshots (aka quotas) and btrfs-equivalent of
"zfs send.")  Problem was, the system was unstable, crashing about
once a week.  No clues why.  We tried all sorts of things in kernel, hardware,
drivers, with and without support, to diagnose and capture the cause of the
crashes.  Then one day, I took a blind stab in the dark (for the ninetieth time)
and I reformatted the storage volume ext4 instead of btrfs.  After that, no more
crashes.  That was approx 8 months ago.

I think the only thing I could learn upon a new evaluation is:  #1  I hear
"btrfs send" is implemented now.  I''d like to see it with my
own eyes before I believe it.  #2  I hear quotas (read-only snapshots) are
implemented now.  Again, I''d like to see it before I believe it.  #3 
Proven stability.  Never seen it yet with btrfs.  Want to see it with my eyes
and stand the test of time before it earns my trust.

Sašo Kiselkov

2013-Jan-22 07:01 UTC

head link

[zfs-discuss] RFE: Un-dedup for unique blocks

On 01/22/2013 03:56 AM, Edward Ned Harvey
(opensolarisisdeadlongliveopensolaris) wrote:>> From: Sa?o Kiselkov [mailto:skiselkov.ml at gmail.com]
>>
>> as far as incompatibility among products, I''ve yet to come
>> across it
> 
> I was talking about ... install solaris 11, and it''s using a new
version
> of zfs that''s incompatible with anything else out there.  And
vice-versa.
Wait, you''re complaining about a closed-source vendor who did a
conscious effort to fuck the rest of the community over? I think you''re
crying on the wrong shoulder - it wasn''t the open ZFS community that
pulled this dick move. Yes, you can argue that the customer isn''t
interested in politics, but unfortunately, there are some things that we
simply can''t do anything about - the ball is in Oracle''s court
on this one.
>  (Not sure if feature flags is the default, or zpool 28 is the default,
> in various illumos-based distributions.  But my understanding is that
> once you upgrade to feature flags, you can''t go back to 28.  Which
means,
> mutually, anything >28 is incompatible with each other.)  You have to
> typically make a conscious decision and plan ahead, and intentionally go
> to zpool 28 and no higher, if you want compatibility between systems.
Yes, feature flags is the default, simply because it is a way for open
ZFS vendors to interoperate. Oracle is an important player in ZFS for
sure, but we can''t let their unwillingness to cooperate with others
hold
the whole community in stasis - that is actually what they would have
wanted.
>> Let us know at zfs at lists.illumos.org how that goes, perhaps write a
blog
>> post about your observations. I''m sure the BTRFS folks came up
with some
>> neat ideas which we might learn from.
> 
> Actually - I''ve written about it before (but it''ll be
difficult to find,
> and nothing earth shattering, so not worth the search.)  I don''t
think
> there''s anything that zfs developers don''t already know. 
Basic stuff like
> fsck, and ability to shrink and remove devices, those are the things btrfs
> has and zfs doesn''t.  (But there''s lots more stuff that
zfs has and btrfs
> doesn''t.  Just making sure my previous comment isn''t seen
as a criticism
> of zfs, or a judgement in favor of btrfs.)
Well, I learned of the LZ4 compression algorithm in a benchmark
comparison of ZFS, BTRFS and other filesystem compression. Seeing that
there were better things out there I decided to try and push the state
of ZFS compression ahead a little.
> And even with a new evaluation, the conclusion can''t be completely
clear,
> nor immediate.  Last evaluation started about 10 months ago, and we kept
> it in production for several weeks or a couple of months, because it
> appeared to be doing everything well.  (Except for features that were known
> to be not-yet implemented, such as read-only snapshots (aka quotas) and
> btrfs-equivalent of "zfs send.")  Problem was, the system was
unstable,
> crashing about once a week.  No clues why.  We tried all sorts of things
> in kernel, hardware, drivers, with and without support, to diagnose and
> capture the cause of the crashes.  Then one day, I took a blind stab in the
> dark (for the ninetieth time) and I reformatted the storage volume ext4
> instead of btrfs.  After that, no more crashes.  That was approx 8 months
ago.
Even negative results are results. I''m sure the BTRFS devs would be
interested in your crash dumps. Not saying that you are in any way
obligated to provide them - just pointing out that perhaps you were
hitting some snag that could have been resolved (or not).
> I think the only thing I could learn upon a new evaluation is:  #1  I hear
> "btrfs send" is implemented now.  I''d like to see it
with my own eyes before
> I believe it.  #2  I hear quotas (read-only snapshots) are implemented now.
> Again, I''d like to see it before I believe it.  #3  Proven
stability.  Never
> seen it yet with btrfs.  Want to see it with my eyes and stand the test of
> time before it earns my trust.
Do not underestimate these guys. They could have come up with a cool new
feature that we haven''t heard about anything at all. One of the things
knocking around in my head ever since it was mentioned a while back on
these mailing lists was a metadata-caching device, i.e. a small yet
super-fast small device that would allow you to just store the pool
topology for very fast scrub/resilver. These are the sort of things that
I meant - they could have thought about filesystems in ways that
haven''t
been done widely before. While BTRFS may be developmentally behind ZFS,
one still has to have great respect for the intellect of its developers
- these guys are not dumb.

Cheers,
--
Saso

Darren J Moffat

2013-Jan-22 11:30 UTC

head link

[zfs-discuss] RFE: Un-dedup for unique blocks

On 01/21/13 17:03, Sa?o Kiselkov wrote:> Again, what significant features did they add besides encryption?
I''m
> not saying they didn''t, I''m just not aware of that many.
Just a few examples:

Solaris ZFS already has support for 1MB block size.

Support for SCSI UNMAP - both issuing it and honoring it when it is the 
backing store of an iSCSI target.

It also has a lot of performance improvements and general bug fixes in 
the Solaris 11.1 release.

-- 
Darren J Moffat

Tomas Forsman

2013-Jan-22 11:57 UTC

head link

[zfs-discuss] RFE: Un-dedup for unique blocks

On 22 January, 2013 - Darren J Moffat sent me these 0,6K bytes:
> On 01/21/13 17:03, Sa?o Kiselkov wrote:
>> Again, what significant features did they add besides encryption?
I''m
>> not saying they didn''t, I''m just not aware of that
many.
>
> Just a few examples:
>
> Solaris ZFS already has support for 1MB block size.
>
> Support for SCSI UNMAP - both issuing it and honoring it when it is the  
> backing store of an iSCSI target.
Would this apply to say a SATA SSD used as ZIL? (which we have, a
vertex2ex with supercap)

/Tomas
-- 
Tomas Forsman, stric at acc.umu.se, http://www.acc.umu.se/~stric/
|- Student at Computing Science, University of Ume?
`- Sysadmin at {cs,acc}.umu.se

Sašo Kiselkov

2013-Jan-22 13:13 UTC

head link

[zfs-discuss] RFE: Un-dedup for unique blocks

On 01/22/2013 12:30 PM, Darren J Moffat wrote:> On 01/21/13 17:03, Sa?o Kiselkov wrote:
>> Again, what significant features did they add besides encryption?
I''m
>> not saying they didn''t, I''m just not aware of that
many.
> 
> Just a few examples:
> 
> Solaris ZFS already has support for 1MB block size.
Working on that as we speak.
I''ll see your 1MB and raise you another 7 :P
> Support for SCSI UNMAP - both issuing it and honoring it when it is the
> backing store of an iSCSI target.
AFAIK, the first isn''t in Illumos'' ZFS, while the latter one
is (though
I might be mistaken). In any case, interesting features.
> It also has a lot of performance improvements and general bug fixes in
> the Solaris 11.1 release.
Performance improvements such as?

Cheers,
--
Saso

Darren J Moffat

2013-Jan-22 13:18 UTC

head link

[zfs-discuss] RFE: Un-dedup for unique blocks

On 01/22/13 11:57, Tomas Forsman wrote:> On 22 January, 2013 - Darren J Moffat sent me these 0,6K bytes:
>
>> On 01/21/13 17:03, Sa?o Kiselkov wrote:
>>> Again, what significant features did they add besides encryption?
I''m
>>> not saying they didn''t, I''m just not aware of
that many.
>>
>> Just a few examples:
>>
>> Solaris ZFS already has support for 1MB block size.
>>
>> Support for SCSI UNMAP - both issuing it and honoring it when it is the
>> backing store of an iSCSI target.
>
> Would this apply to say a SATA SSD used as ZIL? (which we have, a
> vertex2ex with supercap)
If the device advertises the UNMAP feature and you are running Solaris 
11.1 it should attempt to use it.

-- 
Darren J Moffat

Michel Jansens

2013-Jan-22 13:20 UTC

head link

[zfs-discuss] RFE: Un-dedup for unique blocks

Maybe ''shadow migration'' ?  (eg: zfs create -o
shadow=nfs://server/dir
pool/newfs)

Michel
> On 01/21/13 17:03, Sa?o Kiselkov wrote:
>> Again, what significant features did they add besides encryption?
I''m
>> not saying they didn''t, I''m just not aware of that
many.
>
> Just a few examples:
>
> Solaris ZFS already has support for 1MB block size.
>
> Support for SCSI UNMAP - both issuing it and honoring it when it is  
> the backing store of an iSCSI target.
>
> It also has a lot of performance improvements and general bug fixes  
> in the Solaris 11.1 release.
>
> -- 
> Darren J Moffat
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Michel Jansens
mjansens at ulb.ac.be

Darren J Moffat

2013-Jan-22 13:29 UTC

head link

[zfs-discuss] RFE: Un-dedup for unique blocks

On 01/22/13 13:20, Michel Jansens wrote:>
> Maybe ''shadow migration'' ? (eg: zfs create -o
shadow=nfs://server/dir
> pool/newfs)
That isn''t really a ZFS feature, since it happens at the VFS layer. 
The
ZFS support there is really about getting the options passed through and 
checking status but the core of the work happens at the VFS layer.

Shadow migration works with UFS as well!

Since I''m replying here are a few others that have been introduced in 
Solaris 11 or 11.1.

There is also the new improved ZFS share syntax for NFS and CIFS in 
Solaris 11.1 where you can much more easily inherit and also override 
individual share properties.

There is improved diganostics rules.

ZFS support for Immutable Zones (mostly a VFS feature) & Extended 
(privilege) Policy and aliasing of datasets in Zones (so you don''t see 
the part of the dataset hierarchy above the bit delegated to the zone).

UEFI GPT label support for root pools with GRUB2 and on SPARC with OBP.

New "sensitive" per file flag.

Various ZIL and ARC performance improvements.

Preallocated ZVOLs - for swap/dump.
> Michel
>
>> On 01/21/13 17:03, Sa?o Kiselkov wrote:
>>> Again, what significant features did they add besides encryption?
I''m
>>> not saying they didn''t, I''m just not aware of
that many.
>>
>> Just a few examples:
>>
>> Solaris ZFS already has support for 1MB block size.
>>
>> Support for SCSI UNMAP - both issuing it and honoring it when it is
>> the backing store of an iSCSI target.
>>
>> It also has a lot of performance improvements and general bug fixes in
>> the Solaris 11.1 release.
>>
>> --
>> Darren J Moffat
>> _______________________________________________
>> zfs-discuss mailing list
>> zfs-discuss at opensolaris.org
>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>
> Michel Jansens
> mjansens at ulb.ac.be
>
>
>
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
-- 
Darren J Moffat

Sašo Kiselkov

2013-Jan-22 13:29 UTC

head link

[zfs-discuss] RFE: Un-dedup for unique blocks

On 01/22/2013 02:20 PM, Michel Jansens wrote:> 
> Maybe ''shadow migration'' ?  (eg: zfs create -o
shadow=nfs://server/dir
> pool/newfs)
Hm, interesting, so it works as a sort of replication system, except
that the data needs to be read-only and you can start accessing it on
the target before the initial sync. Did I get that right?

--
Saso

Darren J Moffat

2013-Jan-22 13:37 UTC

head link

[zfs-discuss] RFE: Un-dedup for unique blocks

On 01/22/13 13:29, Sa?o Kiselkov wrote:> On 01/22/2013 02:20 PM, Michel Jansens wrote:
>>
>> Maybe ''shadow migration'' ?  (eg: zfs create -o
shadow=nfs://server/dir
>> pool/newfs)
>
> Hm, interesting, so it works as a sort of replication system, except
> that the data needs to be read-only and you can start accessing it on
> the target before the initial sync. Did I get that right?
The source filesystem needs to be read-only.  It works at the VFS layer 
so it doesn''t copy snapshots or clones over.  Once mounted it appears 
like all the original data is instantly there.

There is an (optional) shadowd that pushes the migration along, but it 
will complete on its own anyway.

shadowstat(1M) gives information on the status of the migrations.


-- 
Darren J Moffat

Darren J Moffat

2013-Jan-22 13:39 UTC

head link

[zfs-discuss] RFE: Un-dedup for unique blocks

On 01/22/13 13:29, Darren J Moffat wrote:> Since I''m replying here are a few others that have been introduced
in
> Solaris 11 or 11.1.
and another one I can''t believe I missed since I was one of the people 
that helped design it and I did codereview...

Per file sensitively labels for TX configurations.

and I''m sure I''m still missing stuff that is in Solaris 11 and
11.1.

-- 
Darren J Moffat

Sašo Kiselkov

2013-Jan-22 14:12 UTC

head link

[zfs-discuss] RFE: Un-dedup for unique blocks

On 01/22/2013 02:39 PM, Darren J Moffat wrote:> 
> On 01/22/13 13:29, Darren J Moffat wrote:
>> Since I''m replying here are a few others that have been
introduced in
>> Solaris 11 or 11.1.
> 
> and another one I can''t believe I missed since I was one of the
people
> that helped design it and I did codereview...
> 
> Per file sensitively labels for TX configurations.
Can you give some details on that? Google search are turning up pretty dry.

Cheers,
--
Saso

Casper.Dik at oracle.com

2013-Jan-22 14:30 UTC

head link

[zfs-discuss] RFE: Un-dedup for unique blocks

>On 01/22/2013 02:39 PM, Darren J Moffat wrote:
>> 
>> On 01/22/13 13:29, Darren J Moffat wrote:
>>> Since I''m replying here are a few others that have been
introduced in
>>> Solaris 11 or 11.1.
>> 
>> and another one I can''t believe I missed since I was one of
the people
>> that helped design it and I did codereview...
>> 
>> Per file sensitively labels for TX configurations.
>
>Can you give some details on that? Google search are turning up pretty dry.

Start here:

http://docs.oracle.com/cd/E26502_01/html/E29017/managefiles-1.html#scrolltoc


Look for "multilevel datasets".

Casper

Bob Friesenhahn

2013-Jan-22 15:27 UTC

head link

[zfs-discuss] RFE: Un-dedup for unique blocks

On Mon, 21 Jan 2013, Jim Klimov wrote:>
> Yes, maybe there were more "cool new things" per year popping up
> with Sun''s concentrated engineering talent and financing, but now
> it seems that most players - wherever they work now - took a pause
> from the marathon, to refine what was done in the decade before.
> And this is just as important as churning out innovations faster
> than people can comprehend or audit or use them.
I am on most of the mailing lists where zfs is discussed and it is 
clear that significant issues/bugs are continually being discovered 
and fixed.  Fixes come from both the Illumos community and from 
outside it (e.g. from FreeBSD).

Zfs is already quite feature rich.  Many of us would lobby for 
bug fixes and performance improvements over "features".

Sa?o Kiselkov''s LZ4 compression additions may qualify as
"features"
yet they also offer rather profound performance improvements.

Bob
-- 
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)

2013-Jan-22 15:32 UTC

head link

[zfs-discuss] RFE: Un-dedup for unique blocks

> From: Darren J Moffat [mailto:darrenm at opensolaris.org]
> 
> Support for SCSI UNMAP - both issuing it and honoring it when it is the
> backing store of an iSCSI target.
When I search for scsi unmap, I come up with all sorts of documentation that ...
is ... like reading a medical journal when all you want to know is the
conversion from 98.6F to C.

Would you mind momentarily, describing what SCSI UNMAP is used for?  If I were
describing to a customer (CEO, CFO) I''m not going to tell them about
SCSI UNMAP, I''m going to say the new system has a new feature that
enables ... or solves the ___ problem...

Customer doesn''t *necessarily* have to be as clueless as CEO/CFO. 
Perhaps just another IT person, or whatever.

Sašo Kiselkov

2013-Jan-22 15:48 UTC

head link

[zfs-discuss] RFE: Un-dedup for unique blocks

On 01/22/2013 04:32 PM, Edward Ned Harvey
(opensolarisisdeadlongliveopensolaris) wrote:>> From: Darren J Moffat [mailto:darrenm at opensolaris.org]
>>
>> Support for SCSI UNMAP - both issuing it and honoring it when it is the
>> backing store of an iSCSI target.
> 
> When I search for scsi unmap, I come up with all sorts of documentation
that ... is ... like reading a medical journal when all you want to know is the
conversion from 98.6F to C.
> 
> Would you mind momentarily, describing what SCSI UNMAP is used for?  If I
were describing to a customer (CEO, CFO) I''m not going to tell them
about SCSI UNMAP, I''m going to say the new system has a new feature
that enables ... or solves the ___ problem...
> 
> Customer doesn''t *necessarily* have to be as clueless as CEO/CFO. 
Perhaps just another IT person, or whatever.
SCSI Unmap is a feature of the SCSI protocol that is used by SSDs to
signal that a given data block is no longer in use by the filesystem and
may be erased.

TL&DR:
It makes writing to flash faster. Flash write latency degrades with
time, this prevents it from happening. Keep in mind that this is only
important for sync-write workloads (e.g. Databases, NFS, etc.), not
async-write workloads (file servers, bulk storage). For ZFS this is a
win if you''re using a flash-based slog (ZIL) device. You can entirely
side-step this issue (and performance-sensitive applications often do)
by placing the slog onto a device not based on flash, e.g. DDRDrive x1,
ZeusRAM, etc.

THE DETAILS:
As you may know, flash memory cells, by design, cannot be overwritten.
They can only be read (very fast), written when they are empty (called
"programmed", still quite fast) or erased (slow as hell). To implement
overwriting, when a flash controller detects an attempt to overwrite an
already programmed flash cell, it instead holds the write while it
erases the block first (which takes a lot of time), and only then
programs it with the new data.

Before SCSI Unmap (also called TRIM in SATA) filesystems had no way of
talking to the underlying flash memory to tell it that a given block of
data has been freed (e.g. due to a user deleting a file). So sooner or
later, a filesystem used up all empty blocks on the flash device and
essentially every write had to first erase some flash blocks to
complete. This impacts synchronous I/O write latency (e.g. ZIL, sync
database I/O, etc.).

With Unmap, a filesystem can preemptively tell the flash controller that
a given data block is no longer needed and the flash controller can, at
its leisure, pre-erase it. Thus, as long as you have free space on your
filesystem, most, if not all of your writes will be direct program
writes, not erase-program.

Cheers,
--
Saso

Andrew Gabriel

2013-Jan-22 15:51 UTC

head link

[zfs-discuss] RFE: Un-dedup for unique blocks

Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)
wrote:>> From: Darren J Moffat [mailto:darrenm at opensolaris.org]
>>
>> Support for SCSI UNMAP - both issuing it and honoring it when it is the
>> backing store of an iSCSI target.
>>     
>
> When I search for scsi unmap, I come up with all sorts of documentation
that ... is ... like reading a medical journal when all you want to know is the
conversion from 98.6F to C.
>
> Would you mind momentarily, describing what SCSI UNMAP is used for?  If I
were describing to a customer (CEO, CFO) I''m not going to tell them
about SCSI UNMAP, I''m going to say the new system has a new feature
that enables ... or solves the ___ problem...
>
> Customer doesn''t *necessarily* have to be as clueless as CEO/CFO. 
Perhaps just another IT person, or whatever.
>   
SCSI UNMAP (or SATA TRIM) is a means of telling a storage device that 
some blocks are no longer needed. (This might be because a file has been 
deleted in the filesystem on the device.)

In the case of a Flash device, it can optimise usage by knowing this, 
e.g. it can perhaps perform a background erase on the real blocks so 
they''re ready for reuse sooner, and/or better optimise wear leveling by
having more spare space to play with. There are some devices in which 
this enables the device to improve its lifetime by performing better 
wear leveling when having more spare space. It can also help by avoiding 
some read-modify-write operations, if the device knows the data that is 
in the rest of the 4k block is no loner needed.

In the case of an iSCSI LUN target, these blocks no longer need to be 
archived, and if sparse space allocation is in use, the space they 
occupied can be freed off. In the particular case of ZFS provisioning 
the iSCSI LUN (COMSTAR), you might get performance improvements by 
having more free space to play with during other write operations to 
allow better storage layout optimisation.

So, bottom line is longer life of SSDs (maybe higher performance too if 
there''s less waiting for erases during writes), and better space 
utilisation and performance for a ZFS COMSTAR target.

-- 
Andrew Gabriel

Darren J Moffat

2013-Jan-22 15:53 UTC

head link

[zfs-discuss] RFE: Un-dedup for unique blocks

On 01/22/13 15:32, Edward Ned Harvey 
(opensolarisisdeadlongliveopensolaris) wrote:>> From: Darren J Moffat [mailto:darrenm at opensolaris.org]
>>
>> Support for SCSI UNMAP - both issuing it and honoring it when it is the
>> backing store of an iSCSI target.
>
> When I search for scsi unmap, I come up with all sorts of documentation
that ... is ... like reading a medical journal when all you want to know is the
conversion from 98.6F to C.
>
> Would you mind momentarily, describing what SCSI UNMAP is used for?  If I
were describing to a customer (CEO, CFO) I''m not going to tell them
about SCSI UNMAP, I''m going to say the new system has a new feature
that enables ... or solves the ___ problem...
>
> Customer doesn''t *necessarily* have to be as clueless as CEO/CFO. 
Perhaps just another IT person, or whatever.
It is a mechanism for part of the storage system above the "disk" (eg 
ZFS) to inform the "disk" that it is no longer using a given set of
blocks.

This is useful when using an SSD - see Saso''s excellent response on
that.

However it can also be very useful when your "disk" is an iSCSI LUN. 
It
allows the filesystem layer (eg ZFS or NTFS, etc) when on iSCSI LUN that 
advertises SCSI UNMAP to tell the target there are blocks in that LUN it 
isn''t using any more (eg it just deleted some blocks).

This means you can get more accurate space usage when using things like 
iSCSI.

ZFS in Solaris 11.1 issues SCSI UNMAP to devices that support it and the 
ZVOLs when exported over COMSTAR advertise it too.

In the iSCSI case it is mostly about improved space accounting and 
utilisation.  This is particularly interesting with ZFS when snapshots 
and clones of ZVOLs come into play.

Some vendors call this (and thins like it) "Thin Provisioning",
I''d say
it is more "accurate communication between ''disk'' and
filesystem" about
in use blocks.

-- 
Darren J Moffat

Casper.Dik at oracle.com

2013-Jan-22 16:00 UTC

head link

[zfs-discuss] RFE: Un-dedup for unique blocks

>Some vendors call this (and thins like it) "Thin Provisioning",
I''d say
>it is more "accurate communication between ''disk'' and
filesystem" about
>in use blocks.
In some cases, users of disks are charged by bytes in use; when not using
SCSI UNMAP, a set of disks used for a zpool will in the end be charged for
the whole reservation; this becomes costly when your standard usage is 
much less than your peak usage.

Thin provisioning can now be used for zpools as long as the underlying 
LUNs have support for SCSI UNMAP


Casper

Sašo Kiselkov

2013-Jan-22 16:02 UTC

head link

[zfs-discuss] RFE: Un-dedup for unique blocks

On 01/22/2013 05:00 PM, Casper.Dik at oracle.com wrote:>> Some vendors call this (and thins like it) "Thin
Provisioning", I''d say
>> it is more "accurate communication between
''disk'' and filesystem" about
>> in use blocks.
> 
> In some cases, users of disks are charged by bytes in use; when not using
> SCSI UNMAP, a set of disks used for a zpool will in the end be charged for
> the whole reservation; this becomes costly when your standard usage is 
> much less than your peak usage.
> 
> Thin provisioning can now be used for zpools as long as the underlying 
> LUNs have support for SCSI UNMAP
Looks like an interesting technical solution to a political problem :D

Cheers,
--
Saso

Darren J Moffat

2013-Jan-22 16:34 UTC

head link

[zfs-discuss] RFE: Un-dedup for unique blocks

On 01/22/13 16:02, Sa?o Kiselkov wrote:> On 01/22/2013 05:00 PM, Casper.Dik at oracle.com wrote:
>>> Some vendors call this (and thins like it) "Thin
Provisioning", I''d say
>>> it is more "accurate communication between
''disk'' and filesystem" about
>>> in use blocks.
>>
>> In some cases, users of disks are charged by bytes in use; when not
using
>> SCSI UNMAP, a set of disks used for a zpool will in the end be charged
for
>> the whole reservation; this becomes costly when your standard usage is
>> much less than your peak usage.
>>
>> Thin provisioning can now be used for zpools as long as the underlying
>> LUNs have support for SCSI UNMAP
>
> Looks like an interesting technical solution to a political problem :D
There is also a technical problem too: because if you can''t inform the 
backing store that you no longer need the blocks it can''t free them 
either so they get stuck in snapshots unnecessarily.

-- 
Darren J Moffat

Sašo Kiselkov

2013-Jan-22 17:10 UTC

head link

[zfs-discuss] RFE: Un-dedup for unique blocks

On 01/22/2013 05:34 PM, Darren J Moffat wrote:> 
> 
> On 01/22/13 16:02, Sa?o Kiselkov wrote:
>> On 01/22/2013 05:00 PM, Casper.Dik at oracle.com wrote:
>>>> Some vendors call this (and thins like it) "Thin
Provisioning", I''d say
>>>> it is more "accurate communication between
''disk'' and filesystem" about
>>>> in use blocks.
>>>
>>> In some cases, users of disks are charged by bytes in use; when not
>>> using
>>> SCSI UNMAP, a set of disks used for a zpool will in the end be
>>> charged for
>>> the whole reservation; this becomes costly when your standard usage
is
>>> much less than your peak usage.
>>>
>>> Thin provisioning can now be used for zpools as long as the
underlying
>>> LUNs have support for SCSI UNMAP
>>
>> Looks like an interesting technical solution to a political problem :D
> 
> There is also a technical problem too: because if you can''t inform
the
> backing store that you no longer need the blocks it can''t free
them
> either so they get stuck in snapshots unnecessarily.
Yes, I understand the technical merit of the solution. I''m just amused
that a noticeable side-effect is lower licensing costs (by that I don''t
of course mean that the issue is unimportant, just that I find it
interesting what the world has come to) - I''m not trying to ridicule.

Cheers,
--
Saso

Jim Klimov

2013-Jan-22 21:45 UTC

head link

[zfs-discuss] RFE: Un-dedup for unique blocks

On 2013-01-22 14:29, Darren J Moffat wrote:> Preallocated ZVOLs - for swap/dump.
Sounds like something I proposed on these lists, too ;)
Does this preallocation only mean filling an otherwise ordinary
ZVOL with zeroes (or some other pattern) - if so, to what effect?

Or is it also supported to disable COW for such datasets, so that
the preallocated swap/dump zvols might remain contiguous on the
faster tracks of the drive (i.e. like a dedicated partition, but
with benefits of ZFS checksums and maybe compression)?

Thanks,
//Jim

Ian Collins

2013-Jan-22 21:58 UTC

head link

[zfs-discuss] RFE: Un-dedup for unique blocks

Darren J Moffat wrote:> It is a mechanism for part of the storage system above the "disk"
(eg
> ZFS) to inform the "disk" that it is no longer using a given set
of blocks.
>
> This is useful when using an SSD - see Saso''s excellent response
on that.
>
> However it can also be very useful when your "disk" is an iSCSI
LUN.  It
> allows the filesystem layer (eg ZFS or NTFS, etc) when on iSCSI LUN that
> advertises SCSI UNMAP to tell the target there are blocks in that LUN it
> isn''t using any more (eg it just deleted some blocks).
That is something I have been waiting a long time for!  I have to run a 
periodic "fill the pool with zeros" cycle on a couple of iSCSI backed 
pools to reclaim free space.

I guess the big question is do oracle storage appliances advertise SCSI 
UNMAP?

-- 
Ian.

Sašo Kiselkov

2013-Jan-22 22:03 UTC

head link

[zfs-discuss] RFE: Un-dedup for unique blocks

On 01/22/2013 10:45 PM, Jim Klimov wrote:> On 2013-01-22 14:29, Darren J Moffat wrote:
>> Preallocated ZVOLs - for swap/dump.
> 
> Or is it also supported to disable COW for such datasets, so that
> the preallocated swap/dump zvols might remain contiguous on the
> faster tracks of the drive (i.e. like a dedicated partition, but
> with benefits of ZFS checksums and maybe compression)?
I highly doubt it, as it breaks one of the fundamental design principles
behind ZFS (always maintain transactional consistency). Also,
contiguousness and compression are fundamentally at odds (contiguousness
requires each block to remain the same length regardless of contents,
compression varies block length depending on the entropy of the contents).

Cheers,
--
Saso

Jim Klimov

2013-Jan-22 22:22 UTC

head link

[zfs-discuss] RFE: Un-dedup for unique blocks

On 2013-01-22 23:03, Sa?o Kiselkov wrote:> On 01/22/2013 10:45 PM, Jim Klimov wrote:
>> On 2013-01-22 14:29, Darren J Moffat wrote:
>>> Preallocated ZVOLs - for swap/dump.
>>
>> Or is it also supported to disable COW for such datasets, so that
>> the preallocated swap/dump zvols might remain contiguous on the
>> faster tracks of the drive (i.e. like a dedicated partition, but
>> with benefits of ZFS checksums and maybe compression)?
>
> I highly doubt it, as it breaks one of the fundamental design principles
> behind ZFS (always maintain transactional consistency). Also,
> contiguousness and compression are fundamentally at odds (contiguousness
> requires each block to remain the same length regardless of contents,
> compression varies block length depending on the entropy of the contents).
Well, dump and swap devices are kind of special in that they need
verifiable storage (i.e. detectable to have no bit-errors) but not
really consistency as in sudden-power-off transaction protection.
Both have a lifetime span of a single system uptime - like L2ARC,
for example - and will be reused anew afterwards - after a reboot,
a power-surge, or a kernel panic.

So while metadata used to address the swap ZVOL contents may and
should be subject to common ZFS transactions and COW and so on,
and jump around the disk along with rewrites of blocks, the ZVOL
userdata itself may as well occupy the same positions on the disk,
I think, rewriting older stuff. With mirroring likely in place as
well as checksums, there are other ways than COW to ensure that
the swap (at least some component thereof) contains what it should,
even with intermittent errors of some component devices.

Likewise, swap/dump breed of zvols shouldn''t really have snapshots,
especially not automatic ones (and the installer should take care
of this at least for the two zvols it creates) ;)

Compression for swap is an interesting matter... for example, how
should it be accounted? As dynamic expansion and/or shrinking of
available swap space (or just of space needed to store it)?

If the latter, and we still intend to preallocate and guarantee
that the swap has its administratively predefined amount of
gigabytes, compressed blocks can be aligned on those starting
locations as if they were not compressed. In effect this would
just decrease the bandwidth requirements, maybe.

For dump this might be just a bulky compressed write from start
to however much it needs, within the preallocated psize limits...

//Jim

Nico Williams

2013-Jan-22 22:32 UTC

head link

[zfs-discuss] RFE: Un-dedup for unique blocks

IIRC dump is special.

As for swap... really, you don''t want to swap.  If you''re
swapping you
have problems.  Any swap space you have is to help you detect those
problems and correct them before apps start getting ENOMEM.  There
*are* exceptions to this, such as Varnish.  For Varnish and any other
apps like it I''d dedicate an entire flash drive to it, no ZFS, no
nothing.

Nico
--

Jim Klimov

2013-Jan-22 23:27 UTC

head link

[zfs-discuss] RFE: Un-dedup for unique blocks

On 2013-01-22 23:32, Nico Williams wrote:> IIRC dump is special.
>
> As for swap... really, you don''t want to swap.  If you''re
swapping you
> have problems.  Any swap space you have is to help you detect those
> problems and correct them before apps start getting ENOMEM.  There
> *are* exceptions to this, such as Varnish.  For Varnish and any other
> apps like it I''d dedicate an entire flash drive to it, no ZFS, no
> nothing.
I know of this stance, and in general you''re right. But... ;)

Sometimes, there are once-in-a-longtime tasks that might require
enormous virtual memory that you wouldn''t normally provision
proper hardware for (RAM, SSD) and/or cases when you have to run
similarly greedy tasks on hardware with limited specs (i.e. home
PC capped at 8GB RAM). As an example I might think of a ZDB walk
taking about 35-40GB VM on my box. This is not something I do
every month, but when I do - I need it to complete regardless
that I have 5 times less RAM on that box (and kernel''s equivalent
of that walk fails with scanrate hell because it can''t swap, btw).

On another hand, there are tasks like VirtualBox which "require"
swap to be configured in amounts equivalent to VM RAM size, but
don''t really swap (most of the time). Setting aside SSDs for this
task might be too expensive, if they are never to be used in real
practice.

But this point is more of a task for swap device tiering (like
with Linux swap priorities), as I proposed earlier last year...

//Jim

Sašo Kiselkov

2013-Jan-22 23:31 UTC

head link

[zfs-discuss] RFE: Un-dedup for unique blocks

On 01/22/2013 11:22 PM, Jim Klimov wrote:> On 2013-01-22 23:03, Sa?o Kiselkov wrote:
>> On 01/22/2013 10:45 PM, Jim Klimov wrote:
>>> On 2013-01-22 14:29, Darren J Moffat wrote:
>>>> Preallocated ZVOLs - for swap/dump.
>>>
>>> Or is it also supported to disable COW for such datasets, so that
>>> the preallocated swap/dump zvols might remain contiguous on the
>>> faster tracks of the drive (i.e. like a dedicated partition, but
>>> with benefits of ZFS checksums and maybe compression)?
>>
>> I highly doubt it, as it breaks one of the fundamental design
principles
>> behind ZFS (always maintain transactional consistency). Also,
>> contiguousness and compression are fundamentally at odds
(contiguousness
>> requires each block to remain the same length regardless of contents,
>> compression varies block length depending on the entropy of the
>> contents).
> 
> Well, dump and swap devices are kind of special in that they need
> verifiable storage (i.e. detectable to have no bit-errors) but not
> really consistency as in sudden-power-off transaction protection.
I get your point, but I would argue that if you are willing to
preallocate storage for these, then putting dump/swap on an iSCSI LUN as
opposed to having it locally is kind of pointless anyway. Since they are
used rarely, having them "thin provisioned" is probably better in a
iSCSI environment than wasting valuable network-storage resources on
something you rarely need.
> Both have a lifetime span of a single system uptime - like L2ARC,
> for example - and will be reused anew afterwards - after a reboot,
> a power-surge, or a kernel panic.
For the record, the L2ARC is not transactionally consistent. It use a
completely different allocation strategy from the main pool (essentially
a simple rotor). Besides, if you plan to shred your dump contents after
reboot anyway, why fat-provision them? I can understand swap, but dump?
> So while metadata used to address the swap ZVOL contents may and
> should be subject to common ZFS transactions and COW and so on,
> and jump around the disk along with rewrites of blocks, the ZVOL
> userdata itself may as well occupy the same positions on the disk,
> I think, rewriting older stuff. With mirroring likely in place as
> well as checksums, there are other ways than COW to ensure that
> the swap (at least some component thereof) contains what it should,
> even with intermittent errors of some component devices.
You don''t understand, the transactional integrity in ZFS isn''t
just to
protect the data you put in, it''s also meant to protect ZFS''
internal
structure (i.e. the metadata). This includes the layout of your zvols
(which are also just another dataset). I understand that you want to
view a this kind of fat-provisioned zvol as a simple contiguous
container block, but it is probably more hassle to implement than it''s
worth.
> Likewise, swap/dump breed of zvols shouldn''t really have
snapshots,
> especially not automatic ones (and the installer should take care
> of this at least for the two zvols it creates) ;)
If you are talking about the standard opensolaris-style
boot-environments, then yes, this is taken into account. Your BE lives
under rpool/ROOT, while swap and dump are rpool/swap and rpool/dump
respectively (both thin-provisioned, since they are rarely needed).
> Compression for swap is an interesting matter... for example, how
> should it be accounted? As dynamic expansion and/or shrinking of
> available swap space (or just of space needed to store it)?
Since compression occurs way below the dataset layer, your zvol capacity
doesn''t change with compression, even though how much space it actually
uses in the pool can. A zvol''s capacity pertains to its logical
attributes, i.e. most importantly the maximum byte offset within it
accessible to an application (in this case, swap). How the underlying
blocks are actually stored and how much space they take up is up to the
lower layers.
> If the latter, and we still intend to preallocate and guarantee
> that the swap has its administratively predefined amount of
> gigabytes, compressed blocks can be aligned on those starting
> locations as if they were not compressed. In effect this would
> just decrease the bandwidth requirements, maybe.
But you forget that a compressed block''s physical size fundamentally
depends on its contents. That''s why compressed zvols still appear the
same size as before. What changes is how much space they occupy on the
underlying pool.
> For dump this might be just a bulky compressed write from start
> to however much it needs, within the preallocated psize limits...
I hope you now understand the distinction between the logical size of a
zvol and its actual in-pool size. We can''t tie one to other, since it
would result in unpredictable behavior for the application (write one
set of data, get capacity X, write another set, get capacity Y - how to
determine in advance how much fits in? You can''t).

Cheers,
--
Saso

Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)

2013-Jan-22 23:54 UTC

head link

[zfs-discuss] RFE: Un-dedup for unique blocks

> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
> bounces at opensolaris.org] On Behalf Of Nico Williams
> 
> As for swap... really, you don''t want to swap.  If you''re
swapping you
> have problems.  
For clarification, the above is true in Solaris and derivatives, but
it''s not universally true for all OSes.  I''ll cite linux as
the example, because I know it.  If you provide swap to a linux kernel, it
considers this a degree of freedom when choosing to evict data from the cache,
versus swapping out idle processes (or zombie processes.)  As long as you swap
out idle process memory that is colder than some cache memory, swap actually
improves performance.  But of course, if you have any active process starved of
ram and consequently thrashing swap actively, of course, you''re right. 
It''s bad bad bad to use swap that way.

In solaris, I''ve never seen it swap out idle processes; I''ve
only seen it use swap for the bad bad bad situation.  I assume that''s
all it can do with swap.

Jim Klimov

2013-Jan-23 01:17 UTC

head link

[zfs-discuss] RFE: Un-dedup for unique blocks

The discussion gets suddenly hot and interesting - albeit quite diverged
from the original topic ;)

First of all, as a disclaimer, when I have earlier proposed such changes
to datasets for swap (and maybe dump) use, I''ve explicitly proposed
that
this be a new dataset type - compared to zvol and fs and snapshot that
we have today. Granted, this distinction was lost in today''s exchange
of words, but it is still an important one - especially since it means
that while basic ZFS (or rather ZPOOL) rules are maintained, the dataset
rules might be redefined ;)

I''ll try to reply to a few points below, snipping a lot of older text.

 >> Well, dump and swap devices are kind of special in that they
need>> verifiable storage (i.e. detectable to have no bit-errors) but not
>> really consistency as in sudden-power-off transaction protection.
>
> I get your point, but I would argue that if you are willing to
> preallocate storage for these, then putting dump/swap on an iSCSI LUN as
> opposed to having it locally is kind of pointless anyway. Since they are
> used rarely, having them "thin provisioned" is probably better in
a
> iSCSI environment than wasting valuable network-storage resources on
> something you rarely need.
I am not sure what in my post led you to think that I meant iSCSI
or otherwise networked storage to keep swap and dump. Some servers
have local disks, you know - and in networked storage environments
the local disks are only used to keep the OS image, swap and dump ;)
> Besides, if you plan to shred your dump contents after
> reboot anyway, why fat-provision them? I can understand swap, but dump?
Guarantee that the space is there... Given the recent mischiefs
with dumping (i.e. the context is quite stripped compared to the
general kernel work, so multithreading broke somehow) I guess that
pre-provisioned sequential areas might also reduce some risks...
though likely not - random metadata would still have to get into
the pool.
> You don''t understand, the transactional integrity in ZFS
isn''t just to
> protect the data you put in, it''s also meant to protect
ZFS'' internal
> structure (i.e. the metadata). This includes the layout of your zvols
> (which are also just another dataset). I understand that you want to
> view a this kind of fat-provisioned zvol as a simple contiguous
> container block, but it is probably more hassle to implement than
it''s
> worth.
I''d argue that transactional integrity in ZFS primarily protects
metadata, so that there is a tree of always-actual block pointers.
There is this octopus of a block-pointer tree whose leaf nodes
point to data blocks - but only as DVAs and checksums, basically.
Nothing really requires data to be or not be COWed and stored at
a different location than the previous version of the block at
the same logical offset for the data consumers (FS users, zvol
users), except that we want that data to be readable even after
a catastrophic pool close (system crash, poweroff, etc.).

We don''t (AFAIK) have such a requirement for swap. If the pool
which contained swap kicked the bucket, we probably have a
larger problem whose solution will likely involve reboot and thus
recycling of all swap data.

And for single-device errors with (contiguous) preallocated
unrelocatable swap, we can protect with mirrors and checksums
(used upon read, within this same uptime that wrote the bits).
>
>> Likewise, swap/dump breed of zvols shouldn''t really have
snapshots,
>> especially not automatic ones (and the installer should take care
>> of this at least for the two zvols it creates) ;)
>
> If you are talking about the standard opensolaris-style
> boot-environments, then yes, this is taken into account. Your BE lives
> under rpool/ROOT, while swap and dump are rpool/swap and rpool/dump
> respectively (both thin-provisioned, since they are rarely needed).
I meant the attribute for zfs-auto-snapshots service, i.e.:
rpool/swap  com.sun:auto-snapshot  false                  local

As I wrote, I''d argue that for "new" swap (and maybe dump)
datasets
the snapshot action should not even be implemented.
>
>> Compression for swap is an interesting matter... for example, how
>> should it be accounted? As dynamic expansion and/or shrinking of
>> available swap space (or just of space needed to store it)?
>
> Since compression occurs way below the dataset layer, your zvol capacity
> doesn''t change with compression, even though how much space it
actually
> uses in the pool can. A zvol''s capacity pertains to its logical
> attributes, i.e. most importantly the maximum byte offset within it
> accessible to an application (in this case, swap). How the underlying
> blocks are actually stored and how much space they take up is up to the
> lower layers.
...> But you forget that a compressed block''s physical size
fundamentally
> depends on its contents. That''s why compressed zvols still appear
the
> same size as before. What changes is how much space they occupy on the
> underlying pool.
I won''t argue with this, as it is perfectly correct for zvols and
undefined for the mythical new dataset type ;)

However, regarding dump and size prediction - when I created dump
zvol''s manually and fed them to dumpadm, it can complain that the
device is too small. Then at some point it accepts the given size,
even though it is some value not like the system RAM or anything.
So I guess the system also does some guessing in this case?..
If so, preallocating as many bytes as it thinks minimally required
and then allowing compression to stuff more data in, might help to
actually save the larger dumps in cases the system (dumpadm) made
a wrong guess.

//Jim

Gary Mills

2013-Jan-23 03:50 UTC

head link

[zfs-discuss] RFE: Un-dedup for unique blocks

On Tue, Jan 22, 2013 at 11:54:53PM +0000, Edward Ned Harvey
(opensolarisisdeadlongliveopensolaris) wrote:> > From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
> > bounces at opensolaris.org] On Behalf Of Nico Williams
> > 
> > As for swap... really, you don''t want to swap.  If
you''re swapping you
> > have problems.  
> 
> In solaris, I''ve never seen it swap out idle processes;
I''ve only
> seen it use swap for the bad bad bad situation.  I assume that''s
all
> it can do with swap.
You would be wrong.  Solaris uses swap space for paging.  Paging out
unused portions of an executing process from real memory to the swap
device is certainly beneficial.  Swapping out complete processes is a
desperation move, but paging out most of an idle process is a good
thing.

-- 
-Gary Mills-		-refurb-		-Winnipeg, Manitoba, Canada-

Casper.Dik at oracle.com

2013-Jan-23 08:41 UTC

head link

[zfs-discuss] RFE: Un-dedup for unique blocks

>IIRC dump is special.
>
>As for swap... really, you don''t want to swap.  If you''re
swapping you
>have problems.  Any swap space you have is to help you detect those
>problems and correct them before apps start getting ENOMEM.  There
>*are* exceptions to this, such as Varnish.  For Varnish and any other
>apps like it I''d dedicate an entire flash drive to it, no ZFS, no
>nothing.
Yes and no: the system reserves a lot of additional memory (Solaris 
doesn''t over-commits swap) and swap is needed to support those 
reservations.  Also, some pages are dirtied early on and never touched 
again; those pages should not be kept in memory.

But continuously swapping is clearly a sign of a system too small for its 
job.

Of course, compressing and/or encrypting swap has interesting issues: in 
order to free memory by swapping pages out requires even more memory.

Casper

Jim Klimov

2013-Jan-23 08:47 UTC

head link

[zfs-discuss] RFE: Un-dedup for unique blocks

On 2013-01-23 09:41, Casper.Dik at oracle.com wrote:> Yes and no: the system reserves a lot of additional memory (Solaris
> doesn''t over-commits swap) and swap is needed to support those
> reservations.  Also, some pages are dirtied early on and never touched
> again; those pages should not be kept in memory.

I believe, by the symptoms, that this is what happens often
in particular to Java processes (app-servers and such) - I do
regularly see these have large "VM" sizes and much (3x) smaller
"RSS" sizes. One explanation I''ve seen is that JVM nominally
depends on a number of shared libraries which are loaded to
fulfill the runtime requirements, but aren''t actively used and
thus go out into swap quickly. I chose to trust that statement ;)

//Jim

Ian Collins

2013-Jan-23 10:11 UTC

head link

[zfs-discuss] RFE: Un-dedup for unique blocks

Jim Klimov wrote:> On 2013-01-23 09:41, Casper.Dik at oracle.com wrote:
>> Yes and no: the system reserves a lot of additional memory (Solaris
>> doesn''t over-commits swap) and swap is needed to support those
>> reservations.  Also, some pages are dirtied early on and never touched
>> again; those pages should not be kept in memory.
>
> I believe, by the symptoms, that this is what happens often
> in particular to Java processes (app-servers and such) - I do
> regularly see these have large "VM" sizes and much (3x) smaller
> "RSS" sizes.
Being swapped out is probably the best thing that can be done to most 
Java processes :)

-- 
Ian.

Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)

2013-Jan-23 12:36 UTC

head link

[zfs-discuss] RFE: Un-dedup for unique blocks

> From: Gary Mills [mailto:gary_mills at fastmail.fm]
> 
> > In solaris, I''ve never seen it swap out idle processes;
I''ve only
> > seen it use swap for the bad bad bad situation.  I assume
that''s all
> > it can do with swap.
> 
> You would be wrong.  Solaris uses swap space for paging.  Paging out
> unused portions of an executing process from real memory to the swap
> device is certainly beneficial.  Swapping out complete processes is a
> desperation move, but paging out most of an idle process is a good
> thing.
You seem to be emphasizing the distinction between swapping and paging.  My
point though, is that I''ve never seen the swap usage (which is being
used for paging) on any solaris derivative to be used nonzero, for the sake of
keeping something in cache.  It seems to me, that solaris will always evict all
cache memory before it swaps (pages) out even the most idle process memory.

Ray Arachelian

2013-Jan-23 17:46 UTC

head link

[zfs-discuss] RFE: Un-dedup for unique blocks

On 01/22/2013 10:50 PM, Gary Mills wrote:> On Tue, Jan 22, 2013 at 11:54:53PM +0000, Edward Ned Harvey
(opensolarisisdeadlongliveopensolaris) wrote:
> Paging out unused portions of an executing process from real memory to
> the swap device is certainly beneficial. Swapping out complete
> processes is a desperation move, but paging out most of an idle
> process is a good thing. 
It gets even better.  Executables become part of the swap space via
mmap, so that if you have a lot of copies of the same process running in
memory, the executable bits don''t waste any more space (well, unless
you
use the sticky bit, although that might be deprecated, or if you copy
the binary elsewhere.)  There''s lots of awesome fun optimizations in
UNIX. :)

Casper.Dik at oracle.com

2013-Jan-23 19:48 UTC

head link

[zfs-discuss] RFE: Un-dedup for unique blocks

>On 01/22/2013 10:50 PM, Gary Mills wrote:
>> On Tue, Jan 22, 2013 at 11:54:53PM +0000, Edward Ned Harvey
(opensolarisisdeadlongliveopensolari
s) wrote:>> Paging out unused portions of an executing process from real memory to
>> the swap device is certainly beneficial. Swapping out complete
>> processes is a desperation move, but paging out most of an idle
>> process is a good thing. 
>
>It gets even better.  Executables become part of the swap space via
>mmap, so that if you have a lot of copies of the same process running in
>memory, the executable bits don''t waste any more space (well,
unless you
>use the sticky bit, although that might be deprecated, or if you copy
>the binary elsewhere.)  There''s lots of awesome fun optimizations
in
>UNIX. :)
The "sticky bit" has never been used in  that form of SunOS for as
long
as I remember (SunOS 3.x) and probably before that.  It no longer makes 
sense in demand-paged executables.

Casper

Matthew Ahrens

2013-Jan-24 00:04 UTC

head link

[zfs-discuss] RFE: Un-dedup for unique blocks

On Tue, Jan 22, 2013 at 5:29 AM, Darren J Moffat <darrenm at
opensolaris.org>wrote:
> Preallocated ZVOLs - for swap/dump.
>
Darren, good to hear about the cool stuff in S11.

Just to clarify, is this preallocated ZVOL different than the preallocated
dump which has been there for quite some time (and is in Illumos)? Can you
use it for other zvols besides swap and dump?

Some background: the zfs dump device has always been preallocated ("thick
provisioned"), so that we can reliably dump. By definition, something has
gone horribly wrong when we are dumping, so this code path needs to be as
small as possible to have any hope of getting a dump. So we preallocate
the space for dump, and store a simple linked list of disk segments where
it will be stored. The dump device is not COW, checksummed, deduped,
compressed, etc. by ZFS.

In Illumos (and S10), swap was treated more or less like a regular zvol.
This leads to some tricky code paths because ZFS allocates memory from
many points in the code as it is writing out changes. I could see
advantages to the simplicity of a preallocated swap volume, using the same
code that already existed for preallocated dump. Of course, the loss of
checksumming and encryption is much more of a concern with swap (which is
critical for correct behavior) than with dump (which is nice to have for
debugging).

--matt
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20130123/ae5cd1e0/attachment.html>

Darren J Moffat

2013-Jan-24 10:06 UTC

head link

[zfs-discuss] RFE: Un-dedup for unique blocks

On 01/24/13 00:04, Matthew Ahrens wrote:> On Tue, Jan 22, 2013 at 5:29 AM, Darren J Moffat
> <darrenm at opensolaris.org <mailto:darrenm at
opensolaris.org>> wrote:
>
>     Preallocated ZVOLs - for swap/dump.
>
>
> Darren, good to hear about the cool stuff in S11.
>
> Just to clarify, is this preallocated ZVOL different than the
> preallocated dump which has been there for quite some time (and is in
> Illumos)?  Can you use it for other zvols besides swap and dump?
It is the same but we are using it for swap now too.  It isn''t
available
for general use.
> Some background:  the zfs dump device has always been preallocated
> ("thick provisioned"), so that we can reliably dump.  By
definition,
> something has gone horribly wrong when we are dumping, so this code path
> needs to be as small as possible to have any hope of getting a dump.  So
> we preallocate the space for dump, and store a simple linked list of
> disk segments where it will be stored.  The dump device is not COW,
> checksummed, deduped, compressed, etc. by ZFS.
For the sake of others, I know you know this Matt, the dump system does 
the compression so ZFS didn''t need to anyway.
> In Illumos (and S10), swap was treated more or less like a regular zvol.
>   This leads to some tricky code paths because ZFS allocates memory from
> many points in the code as it is writing out changes.  I could see
> advantages to the simplicity of a preallocated swap volume, using the
> same code that already existed for preallocated dump.  Of course, the
> loss of checksumming and encryption is much more of a concern with swap
> (which is critical for correct behavior) than with dump (which is nice
> to have for debugging).
We have encryption for dump because it is hooked in to the zvol code.

For encrypting swap Illumos could do the same as Solaris 11 does and use 
lofi.  I changed swapadd so that if "encryption" is specified in the 
options field of the vfstab entry it creates a lofi shim over the swap 
device using ''lofiadm -e''.  This provides you encrypted swap
regardless
of what the underlying "disk" is (normal ZVOL, prealloc ZVOL, real
disk
slide, SVM mirror etc).

-- 
Darren J Moffat

Jim Klimov

2013-Jan-24 11:29 UTC

head link

[zfs-discuss] RFE: Un-dedup for unique blocks

On 2013-01-24 11:06, Darren J Moffat wrote:
> On 01/24/13 00:04, Matthew Ahrens wrote:
>> On Tue, Jan 22, 2013 at 5:29 AM, Darren J Moffat
>> <darrenm at opensolaris.org <mailto:darrenm at
opensolaris.org>> wrote:
>>
>>     Preallocated ZVOLs - for swap/dump.
>>
>>
>> Darren, good to hear about the cool stuff in S11.
Yes, thanks, Darren :)
>> Just to clarify, is this preallocated ZVOL different than the
>> preallocated dump which has been there for quite some time (and is in
>> Illumos)?  Can you use it for other zvols besides swap and dump?
>
> It is the same but we are using it for swap now too.  It isn''t
available
> for general use.
>
>> Some background:  the zfs dump device has always been preallocated
>> ("thick provisioned"), so that we can reliably dump.  By
definition,
>> something has gone horribly wrong when we are dumping, so this code
path
>> needs to be as small as possible to have any hope of getting a dump. 
So
>> we preallocate the space for dump, and store a simple linked list of
>> disk segments where it will be stored.  The dump device is not COW,
>> checksummed, deduped, compressed, etc. by ZFS.
Comparing these two statements, can I say (and be correct) that the
preallocated swap devices would lack COW (as I proposed too) and thus
likely snapshots, but would also lack the checksums? (we might live
without compression, though that was once touted as a bonus for swap
over zfs, and certainly can do without dedup)

Basically, they are seemingly little different from preallocated
disk slices - and for those an admin might have better control over
the dedicated disk locations (i.e. faster tracks in a small-seek
stroke range), except that ZFS datasets are easier to resize...
right or wrong?

//Jim

Robert Milkowski

2013-Jan-29 13:59 UTC

head link

[zfs-discuss] RFE: Un-dedup for unique blocks

> > It also has a lot of performance improvements and general bug fixes
> in
> > the Solaris 11.1 release.
> 
> Performance improvements such as?

Dedup''ed ARC for one.
0 block automatically "dedup''ed" in-memory.
Improvements to ZIL performance.
Zero-copy zfs+nfs+iscsi
...


-- 
Robert Milkowski
http://milek.blogspot.com

Sašo Kiselkov

2013-Jan-29 14:06 UTC

head link

[zfs-discuss] RFE: Un-dedup for unique blocks

On 01/29/2013 02:59 PM, Robert Milkowski wrote:>>> It also has a lot of performance improvements and general bug fixes
>> in
>>> the Solaris 11.1 release.
>>
>> Performance improvements such as?
> 
> 
> Dedup''ed ARC for one.
> 0 block automatically "dedup''ed" in-memory.
> Improvements to ZIL performance.
> Zero-copy zfs+nfs+iscsi
> ...
Cool, thanks for the inspiration on my next work in Illumos'' ZFS.

Cheers,
--
Saso

Robert Milkowski

2013-Jan-29 14:08 UTC

head link

[zfs-discuss] RFE: Un-dedup for unique blocks

>From: Richard Elling
>Sent: 21 January 2013 03:51
>VAAI has 4 features, 3 of which have been in illumos for a long time. The
remaining>feature (SCSI UNMAP) was done by Nexenta and exists in their NexentaStor
product,?>but the CEO made?a conscious (and unpopular) decision to keep that code
from the?>community. Over the?summer, another developer picked up the work in the
community,?>but I''ve lost track of?the progress and haven''t seen an
RTI yet.
That is one thing that always bothered me... so it is ok for others, like
Nexenta, to keep stuff closed and not in open, while if Oracle does it they
are bad?

Isn''t it at least a little bit being hypocritical? (bashing Oracle and
doing
sort of the same)

-- 
Robert Milkowski
http://milek.blogspot.com

Sašo Kiselkov

2013-Jan-29 14:21 UTC

head link

[zfs-discuss] RFE: Un-dedup for unique blocks

On 01/29/2013 03:08 PM, Robert Milkowski wrote:>> From: Richard Elling
>> Sent: 21 January 2013 03:51
>> VAAI has 4 features, 3 of which have been in illumos for a long time.
The
> remaining
>> feature (SCSI UNMAP) was done by Nexenta and exists in their
NexentaStor
> product, 
>> but the CEO made a conscious (and unpopular) decision to keep that code
> from the 
>> community. Over the summer, another developer picked up the work in the
> community, 
>> but I''ve lost track of the progress and haven''t seen
an RTI yet.
> 
> That is one thing that always bothered me... so it is ok for others, like
> Nexenta, to keep stuff closed and not in open, while if Oracle does it they
> are bad?
> 
> Isn''t it at least a little bit being hypocritical? (bashing Oracle
and doing
> sort of the same)
Nexenta is a downstream repository that chooses to keep some of their
new developments in-house while making others open. Most importantly,
they participate and make a conscious effort to play nice.

Contrast this with Oracle. Oracle swoops in and buys up Sun, closes
*all* of the technologies it can turn a profit on, changes licensing
terms to extremely draconian and in the process takes a dump on all of
the open-source community and large numbers of their customers.

Now imagine which of these two is more popular in the community?

(Disclaimer: my company was formerly an almost exclusive Sun shop.)

Cheers,
--
Saso

Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)

2013-Jan-29 15:03 UTC

head link

[zfs-discuss] RFE: Un-dedup for unique blocks

> From: Robert Milkowski [mailto:rmilkowski at task.gda.pl]
> 
> That is one thing that always bothered me... so it is ok for others, like
> Nexenta, to keep stuff closed and not in open, while if Oracle does it they
> are bad?
Oracle, like Nexenta, and my own company CleverTrove, and Microsoft, and Netapp,
has every right to close source development, if they believe it''s
beneficial to their business.  For all we know, Oracle might not even have a
choice about it - it might have been in the terms of settlement with NetApp
(because open source ZFS definitely hurt NetApp business.)  The real question
is, in which situations, is it beneficial to your business to be closed source,
as opposed to open source?  There''s the whole redhat/centos dichotomy. 
At first blush, it would seem redhat gets screwed by centos (or oracle linux)
but then you realize how many more redhat derived systems are out there,
compared to suse, etc.  By allowing people to use it for free, it actually gains
popularity, and then redhat actually has a successful support business model as
compared to suse, which tanked.

But it''s useless to argue about whether oracle''s making the
right business choice, whether open or closed source is better for their
business.  Cuz it''s their choice, regardless who agrees.  Arguing about
it here isn''t going to do any good.

Those of us who gained something and no longer count on having that benefit
moving forward have a tendency to say "You gave it to me for free before,
now I''m pissed off because you''re not giving it to me for free
anymore."  instead of "thanks for what you gave before."

The world moves on.  There''s plenty of time to figure out which
solution is best for you, the consumer, in the future product offerings: 
commercial closed source product offering, open source product offering, or
something completely different such as btrfs.

Richard Elling

2013-Jan-29 22:28 UTC

head link

[zfs-discuss] RFE: Un-dedup for unique blocks

On Jan 29, 2013, at 6:08 AM, Robert Milkowski <rmilkowski at task.gda.pl>
wrote:
>> From: Richard Elling
>> Sent: 21 January 2013 03:51
> 
>> VAAI has 4 features, 3 of which have been in illumos for a long time.
The
> remaining
>> feature (SCSI UNMAP) was done by Nexenta and exists in their
NexentaStor
> product, 
>> but the CEO made a conscious (and unpopular) decision to keep that code
> from the 
>> community. Over the summer, another developer picked up the work in the
> community, 
>> but I''ve lost track of the progress and haven''t seen
an RTI yet.
> 
> That is one thing that always bothered me... so it is ok for others, like
> Nexenta, to keep stuff closed and not in open, while if Oracle does it they
> are bad?
Nexenta is just as bad. For the record, the illumos-community folks who worked
at
Nexenta at the time were overruled by executive management. Some of those folks
are now executive management elsewhere :-)
> 
> Isn''t it at least a little bit being hypocritical? (bashing Oracle
and doing
> sort of the same)
No, not at all.
 -- richard

--

Richard.Elling at RichardElling.com
+1-760-896-4422









-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20130129/6625da72/attachment-0001.html>

Joerg Schilling

2013-Feb-01 17:05 UTC

head link

[zfs-discuss] RFE: Un-dedup for unique blocks

<Casper.Dik at oracle.com> wrote:
> >It gets even better.  Executables become part of the swap space via
> >mmap, so that if you have a lot of copies of the same process running
in
> >memory, the executable bits don''t waste any more space (well,
unless you
> >use the sticky bit, although that might be deprecated, or if you copy
> >the binary elsewhere.)  There''s lots of awesome fun
optimizations in
> >UNIX. :)
>
> The "sticky bit" has never been used in  that form of SunOS for
as long
> as I remember (SunOS 3.x) and probably before that.  It no longer makes 
> sense in demand-paged executables.
SunOS-3.0 introduced NFS-root and swap on NFS. For that reason, the meaning of 
the sticky bit was changed to mean "do not cache write this file".

Note that SunOS-3.0 appeared with the new Sun3 machines (first build on 
24.12.1985).

J?rg

-- 
 EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin
       js at cs.tu-berlin.de                (uni)  
       joerg.schilling at fokus.fraunhofer.de (work) Blog:
http://schily.blogspot.com/
 URL:  http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily

Pasi Kärkkäinen

2013-Feb-03 16:20 UTC

head link

[zfs-discuss] RFE: Un-dedup for unique blocks

On Sun, Jan 20, 2013 at 07:51:15PM -0800, Richard Elling
wrote:> 
>      2. VAAI support.
> 
>    VAAI has 4 features, 3 of which have been in illumos for a long time.
The
>    remaining
>    feature (SCSI UNMAP) was done by Nexenta and exists in their NexentaStor
>    product,
>    but the CEO made a conscious (and unpopular) decision to keep that code
>    from the
>    community. Over the summer, another developer picked up the work in the
>    community,
>    but I''ve lost track of the progress and haven''t seen
an RTI yet.
> 
I assume SCSI UNMAP is implemented in Comstar in NexentaStor? 
Isn''t Comstar CDDL licensed? 

There''s also this:
https://www.illumos.org/issues/701

.. which says UNMAP support was added to Illumos Comstar 2 years ago.


-- Pasi

zfs discuss - Jan 2013 - RFE: Un-dedup for unique blocks

[zfs-discuss] RFE: Un-dedup for unique blocks

[zfs-discuss] RFE: Un-dedup for unique blocks

[zfs-discuss] RFE: Un-dedup for unique blocks

[zfs-discuss] RFE: Un-dedup for unique blocks

[zfs-discuss] RFE: Un-dedup for unique blocks

[zfs-discuss] RFE: Un-dedup for unique blocks

[zfs-discuss] RFE: Un-dedup for unique blocks

[zfs-discuss] RFE: Un-dedup for unique blocks

[zfs-discuss] RFE: Un-dedup for unique blocks

[zfs-discuss] RFE: Un-dedup for unique blocks

[zfs-discuss] RFE: Un-dedup for unique blocks

[zfs-discuss] RFE: Un-dedup for unique blocks

[zfs-discuss] RFE: Un-dedup for unique blocks

[zfs-discuss] RFE: Un-dedup for unique blocks

[zfs-discuss] RFE: Un-dedup for unique blocks

[zfs-discuss] RFE: Un-dedup for unique blocks

[zfs-discuss] RFE: Un-dedup for unique blocks

[zfs-discuss] RFE: Un-dedup for unique blocks

[zfs-discuss] RFE: Un-dedup for unique blocks

[zfs-discuss] RFE: Un-dedup for unique blocks

[zfs-discuss] RFE: Un-dedup for unique blocks

[zfs-discuss] RFE: Un-dedup for unique blocks

[zfs-discuss] RFE: Un-dedup for unique blocks

[zfs-discuss] RFE: Un-dedup for unique blocks

[zfs-discuss] RFE: Un-dedup for unique blocks

[zfs-discuss] RFE: Un-dedup for unique blocks

[zfs-discuss] RFE: Un-dedup for unique blocks

[zfs-discuss] RFE: Un-dedup for unique blocks

[zfs-discuss] RFE: Un-dedup for unique blocks

[zfs-discuss] RFE: Un-dedup for unique blocks

[zfs-discuss] RFE: Un-dedup for unique blocks

[zfs-discuss] RFE: Un-dedup for unique blocks

[zfs-discuss] RFE: Un-dedup for unique blocks

[zfs-discuss] RFE: Un-dedup for unique blocks

[zfs-discuss] RFE: Un-dedup for unique blocks

[zfs-discuss] RFE: Un-dedup for unique blocks

[zfs-discuss] RFE: Un-dedup for unique blocks

[zfs-discuss] RFE: Un-dedup for unique blocks

[zfs-discuss] RFE: Un-dedup for unique blocks

[zfs-discuss] RFE: Un-dedup for unique blocks

[zfs-discuss] RFE: Un-dedup for unique blocks

[zfs-discuss] RFE: Un-dedup for unique blocks

[zfs-discuss] RFE: Un-dedup for unique blocks

[zfs-discuss] RFE: Un-dedup for unique blocks

[zfs-discuss] RFE: Un-dedup for unique blocks

[zfs-discuss] RFE: Un-dedup for unique blocks

[zfs-discuss] RFE: Un-dedup for unique blocks

[zfs-discuss] RFE: Un-dedup for unique blocks

[zfs-discuss] RFE: Un-dedup for unique blocks

[zfs-discuss] RFE: Un-dedup for unique blocks

[zfs-discuss] RFE: Un-dedup for unique blocks

[zfs-discuss] RFE: Un-dedup for unique blocks

[zfs-discuss] RFE: Un-dedup for unique blocks

[zfs-discuss] RFE: Un-dedup for unique blocks

[zfs-discuss] RFE: Un-dedup for unique blocks

[zfs-discuss] RFE: Un-dedup for unique blocks

[zfs-discuss] RFE: Un-dedup for unique blocks

[zfs-discuss] RFE: Un-dedup for unique blocks

[zfs-discuss] RFE: Un-dedup for unique blocks

[zfs-discuss] RFE: Un-dedup for unique blocks

[zfs-discuss] RFE: Un-dedup for unique blocks

[zfs-discuss] RFE: Un-dedup for unique blocks

[zfs-discuss] RFE: Un-dedup for unique blocks

[zfs-discuss] RFE: Un-dedup for unique blocks

[zfs-discuss] RFE: Un-dedup for unique blocks