Hello all, While revising my home NAS which had dedup enabled before I gathered that its RAM capacity was too puny for the task, I found that there is some deduplication among the data bits I uploaded there (makes sense, since it holds backups of many of the computers I''ve worked on - some of my homedirs'' contents were bound to intersect). However, a lot of the blocks are in fact "unique" - have entries in the DDT with count=1 and the blkptr_t bit set. In fact they are not deduped, and with my pouring of backups complete - they are unlikely to ever become deduped. Thus these many unique "deduped" blocks are just a burden when my system writes into the datasets with dedup enabled, when it walks the superfluously large DDT, when it has to store this DDT on disk and in ARC, maybe during the scrubbing... These entries bring lots of headache (or performance degradation) for zero gain. So I thought it would be a nice feature to let ZFS go over the DDT (I won''t care if it requires to offline/export the pool) and evict the entries with count==1 as well as locate the block-pointer tree entries on disk and clear the dedup bits, making such blocks into regular unique ones. This would require rewriting metadata (less DDT, new blockpointer) but should not touch or reallocate the already-saved userdata (blocks'' contents) on the disk. The new BP without the dedup bit set would have the same contents of other fields (though its parents would of course have to be changed more - new DVAs, new checksums...) In the end my pool would only track as deduped those blocks which do already have two or more references - which, given the "static" nature of such backup box, should be enough (i.e. new full backups of the same source data would remain deduped and use no extra space, while unique data won''t waste the resources being accounted as deduped). What do you think? //Jim
I''ve wanted a system where dedup applies only to blocks being written that have a good chance of being dups of others. I think one way to do this would be to keep a scalable Bloom filter (on disk) into which one inserts block hashes. To decide if a block needs dedup one would first check the Bloom filter, then if the block is in it, use the dedup code path, else the non-dedup codepath and insert the block in the Bloom filter. This means that the filesystem would store *two* copies of any deduplicatious block, with one of those not being in the DDT. This would allow most writes of non-duplicate blocks to be faster than normal dedup writes, but still slower than normal non-dedup writes: the Bloom filter will add some cost. The nice thing about this is that Bloom filters can be sized to fit in main memory, and will be much smaller than the DDT. It''s very likely that this is a bit too obvious to just work. Of course, it is easier to just use flash. It''s also easier to just not dedup: the most highly deduplicatious data (VM images) is relatively easy to manage using clones and snapshots, to a point anyways. Nico --
bloom filters are a great fit for this :-) -- richard On Jan 19, 2013, at 5:59 PM, Nico Williams <nico at cryptonector.com> wrote:> I''ve wanted a system where dedup applies only to blocks being written > that have a good chance of being dups of others. > > I think one way to do this would be to keep a scalable Bloom filter > (on disk) into which one inserts block hashes. > > To decide if a block needs dedup one would first check the Bloom > filter, then if the block is in it, use the dedup code path, else the > non-dedup codepath and insert the block in the Bloom filter. This > means that the filesystem would store *two* copies of any > deduplicatious block, with one of those not being in the DDT. > > This would allow most writes of non-duplicate blocks to be faster than > normal dedup writes, but still slower than normal non-dedup writes: > the Bloom filter will add some cost. > > The nice thing about this is that Bloom filters can be sized to fit in > main memory, and will be much smaller than the DDT. > > It''s very likely that this is a bit too obvious to just work. > > Of course, it is easier to just use flash. It''s also easier to just > not dedup: the most highly deduplicatious data (VM images) is > relatively easy to manage using clones and snapshots, to a point > anyways. > > Nico > -- > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)
2013-Jan-20 16:02 UTC
[zfs-discuss] RFE: Un-dedup for unique blocks
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Nico Williams > > I''ve wanted a system where dedup applies only to blocks being written > that have a good chance of being dups of others. > > I think one way to do this would be to keep a scalable Bloom filter > (on disk) into which one inserts block hashes. > > To decide if a block needs dedup one would first check the Bloom > filter, then if the block is in it, use the dedup code path,How is this different or better than the existing dedup architecture? If you found that some block about to be written in fact matches the hash of an existing block on disk, then you''ve already determined it''s a duplicate block, exactly as you would, if you had dedup enabled. In that situation, gosh, it sure would be nice to have the extra information like reference count, and pointer to the duplicate block, which exists in the dedup table. In other words, exactly the way existing dedup is already architected.> The nice thing about this is that Bloom filters can be sized to fit in > main memory, and will be much smaller than the DDT.If you''re storing all the hashes of all the blocks, how is that going to be smaller than the DDT storing all the hashes of all the blocks?
So ... The way things presently are, ideally you would know in advance what stuff you were planning to write that has duplicate copies. You could enable dedup, then write all the stuff that''s highly duplicated, then turn off dedup and write all the non-duplicate stuff. Obviously, however, this is a fairly implausible actual scenario. In reality, while you''re writing, you''re going to have duplicate blocks mixed in with your non-duplicate blocks, which fundamentally means the system needs to be calculating the cksums and entering into DDT, even for the unique blocks... Just because the first time the system sees each duplicate block, it doesn''t yet know that it''s going to be duplicated later. But as you said, after data is written, and sits around for a while, the probability of duplicating unique blocks diminishes over time. So they''re just a burden. I would think, the ideal situation would be to take your idea of un-dedup for unique blocks, and take it a step further. Un-dedup unique blocks that are older than some configurable threshold. Maybe you could have a command for a sysadmin to run, to scan the whole pool performing this operation, but it''s the kind of maintenance that really should be done upon access, too. Somebody goes back and reads a jpg from last year, system reads it and consequently loads the DDT entry, discovers that it''s unique and has been for a long time, so throw out the DDT info. But, by talking about it, we''re just smoking pipe dreams. Cuz we all know zfs is developmentally challenged now. But one can dream... finglonger
Bloom filters are very small, that''s the difference. You might only need a few bits per block for a Bloom filter. Compare to the size of a DDT entry. A Bloom filter could be cached entirely in main memory. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20130120/bf3f6ed0/attachment.html>
Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)
2013-Jan-20 18:29 UTC
[zfs-discuss] RFE: Un-dedup for unique blocks
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Nico Williams > > To decide if a block needs dedup one would first check the Bloom > filter, then if the block is in it, use the dedup code path, else the > non-dedup codepath and insert the block in the Bloom filter.Sorry, I didn''t know what a Bloom filter was before I replied before - Now I''ve read the wikipedia article and am consequently an expert. *sic* ;-) It sounds like, what you''re describing... The first time some data gets written, it will not produce a hit in the Bloom filter, so it will get written to disk without dedup. But now it has an entry in the Bloom filter. So the second time the data block gets written (the first duplicate) it will produce a hit in the Bloom filter, and consequently get a dedup DDT entry. But since the system didn''t dedup the first one, it means the second one still needs to be written to disk independently of the first one. So in effect, you''ll always "miss" the first duplicated block write, but you''ll successfully dedup n-1 duplicated blocks. Which is entirely reasonable, although not strictly optimal. And sometimes you''ll get a false positive out of the Bloom filter, so sometimes you''ll be running the dedup code on blocks which are actually unique, but with some intelligently selected parameters such as Bloom table size, you can get this probability to be reasonably small, like less than 1%. In the wikipedia article, they say you can''t remove an entry from the Bloom filter table, which would over time cause consistent increase of false positive probability (approaching 100% false positives) from the Bloom filter and consequently high probability of dedup''ing blocks that are actually unique; but with even a minimal amount of thinking about it, I''m quite sure that''s a solvable implementation detail. Instead of storing a single bit for each entry in the table, store a counter. Every time you create a new entry in the table, increment the different locations; every time you remove an entry from the table, decrement. Obviously a counter requires more bits than a bit, but it''s a linear increase of size, exponential increase of utility, and within the implementation limits of available hardware. But there may be a more intelligent way of accomplishing the same goal. (Like I said, I''ve only thought about this minimally). Meh, well. Thanks for the interesting thought. For whatever it''s worth.
On 19 January, 2013 - Jim Klimov sent me these 2,0K bytes:> Hello all, > > While revising my home NAS which had dedup enabled before I gathered > that its RAM capacity was too puny for the task, I found that there is > some deduplication among the data bits I uploaded there (makes sense, > since it holds backups of many of the computers I''ve worked on - some > of my homedirs'' contents were bound to intersect). However, a lot of > the blocks are in fact "unique" - have entries in the DDT with count=1 > and the blkptr_t bit set. In fact they are not deduped, and with my > pouring of backups complete - they are unlikely to ever become deduped.Another RFE would be ''zfs dedup mypool/somefs'' and basically go through and do a one-shot dedup. Would be useful in various scenarios. Possibly go through the entire pool at once, to make dedups intra-datasets (like "the real thing"). /Tomas -- Tomas Forsman, stric at acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Ume? `- Sysadmin at {cs,acc}.umu.se
On 2013-01-20 19:55, Tomas Forsman wrote:> On 19 January, 2013 - Jim Klimov sent me these 2,0K bytes: > >> Hello all, >> >> While revising my home NAS which had dedup enabled before I gathered >> that its RAM capacity was too puny for the task, I found that there is >> some deduplication among the data bits I uploaded there (makes sense, >> since it holds backups of many of the computers I''ve worked on - some >> of my homedirs'' contents were bound to intersect). However, a lot of >> the blocks are in fact "unique" - have entries in the DDT with count=1 >> and the blkptr_t bit set. In fact they are not deduped, and with my >> pouring of backups complete - they are unlikely to ever become deduped. > > Another RFE would be ''zfs dedup mypool/somefs'' and basically go through > and do a one-shot dedup. Would be useful in various scenarios. Possibly > go through the entire pool at once, to make dedups intra-datasets (like > "the real thing").Yes, but that was asked before =) Actually, the pool''s metadata does contain all the needed bits (i.e. checksum and size of blocks) such that a scrub-like procedure could try and find same blocks among unique ones (perhaps with a filter of "this" block being referenced from a dataset that currently wants dedup), throw one out and add a DDT entry to another. On 2013-01-20 17:16, Edward Harvey wrote: > So ... The way things presently are, ideally you would know in > advance what stuff you were planning to write that has duplicate > copies. You could enable dedup, then write all the stuff that''s > highly duplicated, then turn off dedup and write all the > non-duplicate stuff. Obviously, however, this is a fairly > implausible actual scenario. Well, I guess I could script a solution that uses ZDB to dump the blockpointer tree (about 100Gb of text on my system), and some perl or sort/uniq/grep parsing over this huge text to find blocks that are the same but not deduped - as well as those single-copy "deduped" ones, and toggle the dedup property while rewriting the block inside its parent file with DD. This would all be within current ZFS''s capabilities and ultimately reach the goals of deduping pre-existing data as well as dropping unique blocks from the DDT. It would certainly not be a real-time solution (likely might take months on my box - just fetching the BP tree took a couple of days) and would require more resources than needed otherwise (rewrites of same userdata, storing and parsing of addresses as text instead of binaries, etc.) But I do see how this is doable even today even by a non-expert ;) (Not sure I''d ever get around to actually doing this thus, though - it is not a very "clean" solution nor a performant one). As a bonus, however, this ZDB dump would also provide an answer to a frequently-asked question: "which files on my system intersect or are the same - and have some/all blocks in common via dedup?" Knowledge of this answer might help admins with some policy decisions, be it witch-hunt for hoarders of same files or some pattern-making to determine which datasets should keep "dedup=on"... My few cents, //Jim
On Jan 20, 2013, at 8:16 AM, Edward Harvey <imaginative at nedharvey.com> wrote:> But, by talking about it, we''re just smoking pipe dreams. Cuz we all know zfs is developmentally challenged now. But one can dream...I disagree the ZFS is developmentally challenged. There is more development now than ever in every way: # of developers, companies, OSes, KLOCs, features. Perhaps the level of maturity makes progress appear to be moving slower than it seems in early life? -- richard
On 2013-01-20 17:16, Edward Harvey wrote:> But, by talking about it, we''re just smoking pipe dreams. Cuz we all know zfs is developmentally challenged now. But one can dream...I beg to disagree. While most of my contribution was so far about learning stuff and sharing with others, as well as planting some new ideas and (hopefully, seen as constructively) doubting others - including the implementation we have now - and I do have yet to see someone pick up my ideas and turn them into code (or prove why they are rubbish) -- overall I can''t say that development stagnated by some metric of stagnation or activity. Yes, maybe there were more "cool new things" per year popping up with Sun''s concentrated engineering talent and financing, but now it seems that most players - wherever they work now - took a pause from the marathon, to refine what was done in the decade before. And this is just as important as churning out innovations faster than people can comprehend or audit or use them. As a loud example of present active development - take the LZ4 quests completed by Saso recently. From what I gather, this is a single man''s job done "on-line" in the view of fellow list members over a few months, almost like a reality-show; and I guess anyone with enough concentration, time and devotion could do likewise. I suspect many of my proposals to the list might also take some half of a man-year to complete. Unfortunately for the community and for part of myself, I now have some higher daily priorities so that I likely won''t sit down and code lots of stuff in the nearest years (until that Priority goes to school, or so). Maybe that''s why I''m eager to suggest quests for brilliant coders here who can complete the job better and faster than I ever would ;) So I''m doing the next best things I can do to help the progress :) And I don''t believe this is in vain, that the development ceased and my writings are only destined to be "stuffed under the carpet". Be it these RFEs or dome others, better and more useful, I believe they shall be coded and published in common ZFS code. Sometime... //Jim
On Sun, Jan 20, 2013 at 6:19 PM, Richard Elling <richard.elling at gmail.com>wrote:> On Jan 20, 2013, at 8:16 AM, Edward Harvey <imaginative at nedharvey.com> > wrote: > > But, by talking about it, we''re just smoking pipe dreams. Cuz we all > know zfs is developmentally challenged now. But one can dream... > > I disagree the ZFS is developmentally challenged. There is more development > now than ever in every way: # of developers, companies, OSes, KLOCs, > features. > Perhaps the level of maturity makes progress appear to be moving slower > than > it seems in early life? > > -- richard >Well, perhaps a part of it is marketing. Maturity isn''t really an excuse for not having a long-term feature roadmap. It seems as though "maturity" in this case equals stagnation. What are the features being worked on we aren''t aware of? The big ones that come to mind that everyone else is talking about for not just ZFS but openindiana as a whole and other storage platforms would be: 1. SMB3 - hyper-v WILL be gaining market share over the next couple years, not supporting it means giving up a sizeable portion of the market. Not to mention finally being able to run SQL (again) and Exchange on a fileshare. 2. VAAI support. 3. the long-sought bp-rewrite. 4. full drive encryption support. 5. tiering (although I''d argue caching is superior, it''s still a checkbox). There''s obviously more, but those are just ones off the top of my head that others are supporting/working on. Again, it just feels like all the work is going into fixing bugs and refining what is there, not adding new features. Obviously Saso personally added features, but overall there don''t seem to be a ton of announcements to the list about features that have been added or are being actively worked on. It feels like all these companies are just adding niche functionality they need that may or may not be getting pushed back to mainline. /debbie-downer -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20130120/de6b2b9e/attachment.html>
On Jan 20, 2013, at 4:51 PM, Tim Cook <tim at cook.ms> wrote:> On Sun, Jan 20, 2013 at 6:19 PM, Richard Elling <richard.elling at gmail.com> wrote: > On Jan 20, 2013, at 8:16 AM, Edward Harvey <imaginative at nedharvey.com> wrote: > > But, by talking about it, we''re just smoking pipe dreams. Cuz we all know zfs is developmentally challenged now. But one can dream... > > I disagree the ZFS is developmentally challenged. There is more development > now than ever in every way: # of developers, companies, OSes, KLOCs, features. > Perhaps the level of maturity makes progress appear to be moving slower than > it seems in early life? > > -- richard > > Well, perhaps a part of it is marketing.A lot of it is marketing :-/> Maturity isn''t really an excuse for not having a long-term feature roadmap. It seems as though "maturity" in this case equals stagnation. What are the features being worked on we aren''t aware of?Most of the illumos-centric discussion is on the developer''s list. The ZFSonLinux and BSD communities are also quite active. Almost none of the ZFS developers hang out on this zfs-discuss at opensolaris.org anymore. In fact, I wonder why I''m still here...> The big ones that come to mind that everyone else is talking about for not just ZFS but openindiana as a whole and other storage platforms would be: > 1. SMB3 - hyper-v WILL be gaining market share over the next couple years, not supporting it means giving up a sizeable portion of the market. Not to mention finally being able to run SQL (again) and Exchange on a fileshare.I know of at least one illumos community company working on this. However, I do not know their public plans.> 2. VAAI support.VAAI has 4 features, 3 of which have been in illumos for a long time. The remaining feature (SCSI UNMAP) was done by Nexenta and exists in their NexentaStor product, but the CEO made a conscious (and unpopular) decision to keep that code from the community. Over the summer, another developer picked up the work in the community, but I''ve lost track of the progress and haven''t seen an RTI yet.> 3. the long-sought bp-rewrite.Go for it!> 4. full drive encryption support.This is a key management issue mostly. Unfortunately, the open source code for handling this (trousers) covers much more than keyed disks and can be unwieldy. I''m not sure which distros picked up trousers, but it doesn''t belong in the illumos-gate and it doesn''t expose itself to ZFS.> 5. tiering (although I''d argue caching is superior, it''s still a checkbox).You want to add tiering to the OS? That has been available for a long time via the (defunct?) SAM-QFS project that actually delivered code http://hub.opensolaris.org/bin/view/Project+samqfs/ If you want to add it to ZFS, that is a different conversation. -- richard> > There''s obviously more, but those are just ones off the top of my head that others are supporting/working on. Again, it just feels like all the work is going into fixing bugs and refining what is there, not adding new features. Obviously Saso personally added features, but overall there don''t seem to be a ton of announcements to the list about features that have been added or are being actively worked on. It feels like all these companies are just adding niche functionality they need that may or may not be getting pushed back to mainline. > > /debbie-downer >-- Richard.Elling at RichardElling.com +1-760-896-4422 -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20130120/c3447790/attachment.html>
Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)
2013-Jan-21 13:28 UTC
[zfs-discuss] RFE: Un-dedup for unique blocks
> From: Richard Elling [mailto:richard.elling at gmail.com] > > I disagree the ZFS is developmentally challenged.As an IT consultant, 8 years ago before I heard of ZFS, it was always easy to sell Ontap, as long as it fit into the budget. 5 years ago, whenever I told customers about ZFS, it was always a quick easy sell. Nowadays, anybody who''s heard of it says they don''t want it, because they believe it''s a dying product, and they''re putting their bets on linux instead. I try to convince them otherwise, but I''m trying to buck the word on the street. They don''t listen, however much sense I make. I can only sell ZFS to customers nowadays, who have still never heard of it. "Developmentally challenged" doesn''t mean there is no development taking place. It means the largest development effort is working closed-source, and not available for free (except some purposes), so some consumers are going to follow their path, while others are going to follow the open source branch illumos path, which means both disunity amongst developers and disunity amongst consumers, and incompatibility amongst products. So far, in the illumos branch, I''ve only seen bugfixes introduced since zpool 28, no significant introduction of new features. (Unlike the oracle branch, which is just as easy to sell as ontap). Which presents a challenge. Hence the term, "challenged." Right now, ZFS is the leading product as far as I''m concerned. Better than MS VSS, better than Ontap, better than BTRFS. It is my personal opinion that one day BTRFS will eclipse ZFS due to oracle''s unsupportive strategy causing disparity and lowering consumer demand for zfs, but of course, that''s just a personal opinion prediction for the future, which has yet to be seen. So far, every time I evaluate BTRFS, it fails spectacularly, but the last time I did, was about a year ago. I''m due for a BTRFS re-evaluation now.
Zfs on linux (ZOL) has made some pretty impressive strides over the last year or so...
On 01/21/2013 02:28 PM, Edward Ned Harvey (opensolarisisdeadlongliveopensolaris) wrote:>> From: Richard Elling [mailto:richard.elling at gmail.com] >> >> I disagree the ZFS is developmentally challenged. > > As an IT consultant, 8 years ago before I heard of ZFS, it was always easy > to sell Ontap, as long as it fit into the budget. 5 years ago, whenever I > told customers about ZFS, it was always a quick easy sell. Nowadays, > anybody who''s heard of it says they don''t want it, because they believe > it''s a dying product, and they''re putting their bets on linux instead. I > try to convince them otherwise, but I''m trying to buck the word on the street. > They don''t listen, however much sense I make. I can only sell ZFS to > customers nowadays, who have still never heard of it.Yes, Oracle did some serious damage to ZFS'' and its own reputation. My former employer used to be an almost exclusive Sun-shop. The moment Oracle took over and decided to tank the products aimed at our segment, we waved our beloved Sun hardware goodbye. Larry has clearly delineated his marketing strategy: either you''re a Fortune500, or you can fuck right off.> "Developmentally challenged" doesn''t mean there is no development taking place. > It means the largest development effort is working closed-source, and not > available for free (except some purposes), so some consumers are going to > follow their path,I would contest that point. Besides encryption (which I think was already well underway by the time Oracle took over), AFAIK nothing much improved in Oracle ZFS. Oracle only considers Sun a vehicle to sell its software products on (DB, ERP, CRM, etc.). Anything that doesn''t fit into that strategy (e.g. Thumper) got butchered and thrown to the side.> while others are going to follow the open source branch illumos path, which > means both disunity amongst developers and disunity amongst consumers, and > incompatibility amongst products.I can''t talk about "disunity" among devs (how would that manifest itself?), but as far as incompatibility among products, I''ve yet to come across it. In fact, thanks to ZFS feature flags, different feature sets can coexist peacefully and give admins unprecedented control over their storage pools. Version control in ZFS used to be a "take it or leave it" approach, now you can selectively enable and use only features you want to.> So far, in the illumos branch, I''ve only seen bugfixes introduced since > zpool 28, no significant introduction of new features.I''ve had #3035 LZ4 compression for ZFS and GRUB integrated just a few days back and I''ve got #3137 L2ARC compression up for review as we speak. Waiting for #3137 to integrate, I''m looking to focus on multi-MB record sizes next, and then perhaps taking a long hard look at reducing the in-memory DDT footprint.> (Unlike the oracle branch, which is just as easy to sell as ontap).Again, what significant features did they add besides encryption? I''m not saying they didn''t, I''m just not aware of that many.> Which presents a challenge. Hence the term, "challenged."Agreed, it is a challenge and needs to be taken seriously. We are up against a lot of money and man-hours invested by big-name companies, so I fully agree there. We need to rally ourselves as a community hold together tightly.> Right now, ZFS is the leading product as far as I''m concerned. Better > than MS VSS, better than Ontap, better than BTRFS. It is my personal > opinion that one day BTRFS will eclipse ZFS due to oracle''s unsupportive > strategy causing disparity and lowering consumer demand for zfs, but of > course, that''s just a personal opinion prediction for the future, which > has yet to be seen. So far, every time I evaluate BTRFS, it fails > spectacularly, but the last time I did, was about a year ago. I''m due > for a BTRFS re-evaluation now.Let us know at zfs at lists.illumos.org how that goes, perhaps write a blog post about your observations. I''m sure the BTRFS folks came up with some neat ideas which we might learn from. Cheers, -- Saso
Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)
2013-Jan-22 02:56 UTC
[zfs-discuss] RFE: Un-dedup for unique blocks
> From: Sa?o Kiselkov [mailto:skiselkov.ml at gmail.com] > > as far as incompatibility among products, I''ve yet to come > across itI was talking about ... install solaris 11, and it''s using a new version of zfs that''s incompatible with anything else out there. And vice-versa. (Not sure if feature flags is the default, or zpool 28 is the default, in various illumos-based distributions. But my understanding is that once you upgrade to feature flags, you can''t go back to 28. Which means, mutually, anything >28 is incompatible with each other.) You have to typically make a conscious decision and plan ahead, and intentionally go to zpool 28 and no higher, if you want compatibility between systems.> Let us know at zfs at lists.illumos.org how that goes, perhaps write a blog > post about your observations. I''m sure the BTRFS folks came up with some > neat ideas which we might learn from.Actually - I''ve written about it before (but it''ll be difficult to find, and nothing earth shattering, so not worth the search.) I don''t think there''s anything that zfs developers don''t already know. Basic stuff like fsck, and ability to shrink and remove devices, those are the things btrfs has and zfs doesn''t. (But there''s lots more stuff that zfs has and btrfs doesn''t. Just making sure my previous comment isn''t seen as a criticism of zfs, or a judgement in favor of btrfs.) And even with a new evaluation, the conclusion can''t be completely clear, nor immediate. Last evaluation started about 10 months ago, and we kept it in production for several weeks or a couple of months, because it appeared to be doing everything well. (Except for features that were known to be not-yet implemented, such as read-only snapshots (aka quotas) and btrfs-equivalent of "zfs send.") Problem was, the system was unstable, crashing about once a week. No clues why. We tried all sorts of things in kernel, hardware, drivers, with and without support, to diagnose and capture the cause of the crashes. Then one day, I took a blind stab in the dark (for the ninetieth time) and I reformatted the storage volume ext4 instead of btrfs. After that, no more crashes. That was approx 8 months ago. I think the only thing I could learn upon a new evaluation is: #1 I hear "btrfs send" is implemented now. I''d like to see it with my own eyes before I believe it. #2 I hear quotas (read-only snapshots) are implemented now. Again, I''d like to see it before I believe it. #3 Proven stability. Never seen it yet with btrfs. Want to see it with my eyes and stand the test of time before it earns my trust.
On 01/22/2013 03:56 AM, Edward Ned Harvey (opensolarisisdeadlongliveopensolaris) wrote:>> From: Sa?o Kiselkov [mailto:skiselkov.ml at gmail.com] >> >> as far as incompatibility among products, I''ve yet to come >> across it > > I was talking about ... install solaris 11, and it''s using a new version > of zfs that''s incompatible with anything else out there. And vice-versa.Wait, you''re complaining about a closed-source vendor who did a conscious effort to fuck the rest of the community over? I think you''re crying on the wrong shoulder - it wasn''t the open ZFS community that pulled this dick move. Yes, you can argue that the customer isn''t interested in politics, but unfortunately, there are some things that we simply can''t do anything about - the ball is in Oracle''s court on this one.> (Not sure if feature flags is the default, or zpool 28 is the default, > in various illumos-based distributions. But my understanding is that > once you upgrade to feature flags, you can''t go back to 28. Which means, > mutually, anything >28 is incompatible with each other.) You have to > typically make a conscious decision and plan ahead, and intentionally go > to zpool 28 and no higher, if you want compatibility between systems.Yes, feature flags is the default, simply because it is a way for open ZFS vendors to interoperate. Oracle is an important player in ZFS for sure, but we can''t let their unwillingness to cooperate with others hold the whole community in stasis - that is actually what they would have wanted.>> Let us know at zfs at lists.illumos.org how that goes, perhaps write a blog >> post about your observations. I''m sure the BTRFS folks came up with some >> neat ideas which we might learn from. > > Actually - I''ve written about it before (but it''ll be difficult to find, > and nothing earth shattering, so not worth the search.) I don''t think > there''s anything that zfs developers don''t already know. Basic stuff like > fsck, and ability to shrink and remove devices, those are the things btrfs > has and zfs doesn''t. (But there''s lots more stuff that zfs has and btrfs > doesn''t. Just making sure my previous comment isn''t seen as a criticism > of zfs, or a judgement in favor of btrfs.)Well, I learned of the LZ4 compression algorithm in a benchmark comparison of ZFS, BTRFS and other filesystem compression. Seeing that there were better things out there I decided to try and push the state of ZFS compression ahead a little.> And even with a new evaluation, the conclusion can''t be completely clear, > nor immediate. Last evaluation started about 10 months ago, and we kept > it in production for several weeks or a couple of months, because it > appeared to be doing everything well. (Except for features that were known > to be not-yet implemented, such as read-only snapshots (aka quotas) and > btrfs-equivalent of "zfs send.") Problem was, the system was unstable, > crashing about once a week. No clues why. We tried all sorts of things > in kernel, hardware, drivers, with and without support, to diagnose and > capture the cause of the crashes. Then one day, I took a blind stab in the > dark (for the ninetieth time) and I reformatted the storage volume ext4 > instead of btrfs. After that, no more crashes. That was approx 8 months ago.Even negative results are results. I''m sure the BTRFS devs would be interested in your crash dumps. Not saying that you are in any way obligated to provide them - just pointing out that perhaps you were hitting some snag that could have been resolved (or not).> I think the only thing I could learn upon a new evaluation is: #1 I hear > "btrfs send" is implemented now. I''d like to see it with my own eyes before > I believe it. #2 I hear quotas (read-only snapshots) are implemented now. > Again, I''d like to see it before I believe it. #3 Proven stability. Never > seen it yet with btrfs. Want to see it with my eyes and stand the test of > time before it earns my trust.Do not underestimate these guys. They could have come up with a cool new feature that we haven''t heard about anything at all. One of the things knocking around in my head ever since it was mentioned a while back on these mailing lists was a metadata-caching device, i.e. a small yet super-fast small device that would allow you to just store the pool topology for very fast scrub/resilver. These are the sort of things that I meant - they could have thought about filesystems in ways that haven''t been done widely before. While BTRFS may be developmentally behind ZFS, one still has to have great respect for the intellect of its developers - these guys are not dumb. Cheers, -- Saso
On 01/21/13 17:03, Sa?o Kiselkov wrote:> Again, what significant features did they add besides encryption? I''m > not saying they didn''t, I''m just not aware of that many.Just a few examples: Solaris ZFS already has support for 1MB block size. Support for SCSI UNMAP - both issuing it and honoring it when it is the backing store of an iSCSI target. It also has a lot of performance improvements and general bug fixes in the Solaris 11.1 release. -- Darren J Moffat
On 22 January, 2013 - Darren J Moffat sent me these 0,6K bytes:> On 01/21/13 17:03, Sa?o Kiselkov wrote: >> Again, what significant features did they add besides encryption? I''m >> not saying they didn''t, I''m just not aware of that many. > > Just a few examples: > > Solaris ZFS already has support for 1MB block size. > > Support for SCSI UNMAP - both issuing it and honoring it when it is the > backing store of an iSCSI target.Would this apply to say a SATA SSD used as ZIL? (which we have, a vertex2ex with supercap) /Tomas -- Tomas Forsman, stric at acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Ume? `- Sysadmin at {cs,acc}.umu.se
On 01/22/2013 12:30 PM, Darren J Moffat wrote:> On 01/21/13 17:03, Sa?o Kiselkov wrote: >> Again, what significant features did they add besides encryption? I''m >> not saying they didn''t, I''m just not aware of that many. > > Just a few examples: > > Solaris ZFS already has support for 1MB block size.Working on that as we speak. I''ll see your 1MB and raise you another 7 :P> Support for SCSI UNMAP - both issuing it and honoring it when it is the > backing store of an iSCSI target.AFAIK, the first isn''t in Illumos'' ZFS, while the latter one is (though I might be mistaken). In any case, interesting features.> It also has a lot of performance improvements and general bug fixes in > the Solaris 11.1 release.Performance improvements such as? Cheers, -- Saso
On 01/22/13 11:57, Tomas Forsman wrote:> On 22 January, 2013 - Darren J Moffat sent me these 0,6K bytes: > >> On 01/21/13 17:03, Sa?o Kiselkov wrote: >>> Again, what significant features did they add besides encryption? I''m >>> not saying they didn''t, I''m just not aware of that many. >> >> Just a few examples: >> >> Solaris ZFS already has support for 1MB block size. >> >> Support for SCSI UNMAP - both issuing it and honoring it when it is the >> backing store of an iSCSI target. > > Would this apply to say a SATA SSD used as ZIL? (which we have, a > vertex2ex with supercap)If the device advertises the UNMAP feature and you are running Solaris 11.1 it should attempt to use it. -- Darren J Moffat
Maybe ''shadow migration'' ? (eg: zfs create -o shadow=nfs://server/dir pool/newfs) Michel> On 01/21/13 17:03, Sa?o Kiselkov wrote: >> Again, what significant features did they add besides encryption? I''m >> not saying they didn''t, I''m just not aware of that many. > > Just a few examples: > > Solaris ZFS already has support for 1MB block size. > > Support for SCSI UNMAP - both issuing it and honoring it when it is > the backing store of an iSCSI target. > > It also has a lot of performance improvements and general bug fixes > in the Solaris 11.1 release. > > -- > Darren J Moffat > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discussMichel Jansens mjansens at ulb.ac.be
On 01/22/13 13:20, Michel Jansens wrote:> > Maybe ''shadow migration'' ? (eg: zfs create -o shadow=nfs://server/dir > pool/newfs)That isn''t really a ZFS feature, since it happens at the VFS layer. The ZFS support there is really about getting the options passed through and checking status but the core of the work happens at the VFS layer. Shadow migration works with UFS as well! Since I''m replying here are a few others that have been introduced in Solaris 11 or 11.1. There is also the new improved ZFS share syntax for NFS and CIFS in Solaris 11.1 where you can much more easily inherit and also override individual share properties. There is improved diganostics rules. ZFS support for Immutable Zones (mostly a VFS feature) & Extended (privilege) Policy and aliasing of datasets in Zones (so you don''t see the part of the dataset hierarchy above the bit delegated to the zone). UEFI GPT label support for root pools with GRUB2 and on SPARC with OBP. New "sensitive" per file flag. Various ZIL and ARC performance improvements. Preallocated ZVOLs - for swap/dump.> Michel > >> On 01/21/13 17:03, Sa?o Kiselkov wrote: >>> Again, what significant features did they add besides encryption? I''m >>> not saying they didn''t, I''m just not aware of that many. >> >> Just a few examples: >> >> Solaris ZFS already has support for 1MB block size. >> >> Support for SCSI UNMAP - both issuing it and honoring it when it is >> the backing store of an iSCSI target. >> >> It also has a lot of performance improvements and general bug fixes in >> the Solaris 11.1 release. >> >> -- >> Darren J Moffat >> _______________________________________________ >> zfs-discuss mailing list >> zfs-discuss at opensolaris.org >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > Michel Jansens > mjansens at ulb.ac.be > > > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss-- Darren J Moffat
On 01/22/2013 02:20 PM, Michel Jansens wrote:> > Maybe ''shadow migration'' ? (eg: zfs create -o shadow=nfs://server/dir > pool/newfs)Hm, interesting, so it works as a sort of replication system, except that the data needs to be read-only and you can start accessing it on the target before the initial sync. Did I get that right? -- Saso
On 01/22/13 13:29, Sa?o Kiselkov wrote:> On 01/22/2013 02:20 PM, Michel Jansens wrote: >> >> Maybe ''shadow migration'' ? (eg: zfs create -o shadow=nfs://server/dir >> pool/newfs) > > Hm, interesting, so it works as a sort of replication system, except > that the data needs to be read-only and you can start accessing it on > the target before the initial sync. Did I get that right?The source filesystem needs to be read-only. It works at the VFS layer so it doesn''t copy snapshots or clones over. Once mounted it appears like all the original data is instantly there. There is an (optional) shadowd that pushes the migration along, but it will complete on its own anyway. shadowstat(1M) gives information on the status of the migrations. -- Darren J Moffat
On 01/22/13 13:29, Darren J Moffat wrote:> Since I''m replying here are a few others that have been introduced in > Solaris 11 or 11.1.and another one I can''t believe I missed since I was one of the people that helped design it and I did codereview... Per file sensitively labels for TX configurations. and I''m sure I''m still missing stuff that is in Solaris 11 and 11.1. -- Darren J Moffat
On 01/22/2013 02:39 PM, Darren J Moffat wrote:> > On 01/22/13 13:29, Darren J Moffat wrote: >> Since I''m replying here are a few others that have been introduced in >> Solaris 11 or 11.1. > > and another one I can''t believe I missed since I was one of the people > that helped design it and I did codereview... > > Per file sensitively labels for TX configurations.Can you give some details on that? Google search are turning up pretty dry. Cheers, -- Saso
Casper.Dik at oracle.com
2013-Jan-22 14:30 UTC
[zfs-discuss] RFE: Un-dedup for unique blocks
>On 01/22/2013 02:39 PM, Darren J Moffat wrote: >> >> On 01/22/13 13:29, Darren J Moffat wrote: >>> Since I''m replying here are a few others that have been introduced in >>> Solaris 11 or 11.1. >> >> and another one I can''t believe I missed since I was one of the people >> that helped design it and I did codereview... >> >> Per file sensitively labels for TX configurations. > >Can you give some details on that? Google search are turning up pretty dry.Start here: http://docs.oracle.com/cd/E26502_01/html/E29017/managefiles-1.html#scrolltoc Look for "multilevel datasets". Casper
On Mon, 21 Jan 2013, Jim Klimov wrote:> > Yes, maybe there were more "cool new things" per year popping up > with Sun''s concentrated engineering talent and financing, but now > it seems that most players - wherever they work now - took a pause > from the marathon, to refine what was done in the decade before. > And this is just as important as churning out innovations faster > than people can comprehend or audit or use them.I am on most of the mailing lists where zfs is discussed and it is clear that significant issues/bugs are continually being discovered and fixed. Fixes come from both the Illumos community and from outside it (e.g. from FreeBSD). Zfs is already quite feature rich. Many of us would lobby for bug fixes and performance improvements over "features". Sa?o Kiselkov''s LZ4 compression additions may qualify as "features" yet they also offer rather profound performance improvements. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)
2013-Jan-22 15:32 UTC
[zfs-discuss] RFE: Un-dedup for unique blocks
> From: Darren J Moffat [mailto:darrenm at opensolaris.org] > > Support for SCSI UNMAP - both issuing it and honoring it when it is the > backing store of an iSCSI target.When I search for scsi unmap, I come up with all sorts of documentation that ... is ... like reading a medical journal when all you want to know is the conversion from 98.6F to C. Would you mind momentarily, describing what SCSI UNMAP is used for? If I were describing to a customer (CEO, CFO) I''m not going to tell them about SCSI UNMAP, I''m going to say the new system has a new feature that enables ... or solves the ___ problem... Customer doesn''t *necessarily* have to be as clueless as CEO/CFO. Perhaps just another IT person, or whatever.
On 01/22/2013 04:32 PM, Edward Ned Harvey (opensolarisisdeadlongliveopensolaris) wrote:>> From: Darren J Moffat [mailto:darrenm at opensolaris.org] >> >> Support for SCSI UNMAP - both issuing it and honoring it when it is the >> backing store of an iSCSI target. > > When I search for scsi unmap, I come up with all sorts of documentation that ... is ... like reading a medical journal when all you want to know is the conversion from 98.6F to C. > > Would you mind momentarily, describing what SCSI UNMAP is used for? If I were describing to a customer (CEO, CFO) I''m not going to tell them about SCSI UNMAP, I''m going to say the new system has a new feature that enables ... or solves the ___ problem... > > Customer doesn''t *necessarily* have to be as clueless as CEO/CFO. Perhaps just another IT person, or whatever.SCSI Unmap is a feature of the SCSI protocol that is used by SSDs to signal that a given data block is no longer in use by the filesystem and may be erased. TL&DR: It makes writing to flash faster. Flash write latency degrades with time, this prevents it from happening. Keep in mind that this is only important for sync-write workloads (e.g. Databases, NFS, etc.), not async-write workloads (file servers, bulk storage). For ZFS this is a win if you''re using a flash-based slog (ZIL) device. You can entirely side-step this issue (and performance-sensitive applications often do) by placing the slog onto a device not based on flash, e.g. DDRDrive x1, ZeusRAM, etc. THE DETAILS: As you may know, flash memory cells, by design, cannot be overwritten. They can only be read (very fast), written when they are empty (called "programmed", still quite fast) or erased (slow as hell). To implement overwriting, when a flash controller detects an attempt to overwrite an already programmed flash cell, it instead holds the write while it erases the block first (which takes a lot of time), and only then programs it with the new data. Before SCSI Unmap (also called TRIM in SATA) filesystems had no way of talking to the underlying flash memory to tell it that a given block of data has been freed (e.g. due to a user deleting a file). So sooner or later, a filesystem used up all empty blocks on the flash device and essentially every write had to first erase some flash blocks to complete. This impacts synchronous I/O write latency (e.g. ZIL, sync database I/O, etc.). With Unmap, a filesystem can preemptively tell the flash controller that a given data block is no longer needed and the flash controller can, at its leisure, pre-erase it. Thus, as long as you have free space on your filesystem, most, if not all of your writes will be direct program writes, not erase-program. Cheers, -- Saso
Edward Ned Harvey (opensolarisisdeadlongliveopensolaris) wrote:>> From: Darren J Moffat [mailto:darrenm at opensolaris.org] >> >> Support for SCSI UNMAP - both issuing it and honoring it when it is the >> backing store of an iSCSI target. >> > > When I search for scsi unmap, I come up with all sorts of documentation that ... is ... like reading a medical journal when all you want to know is the conversion from 98.6F to C. > > Would you mind momentarily, describing what SCSI UNMAP is used for? If I were describing to a customer (CEO, CFO) I''m not going to tell them about SCSI UNMAP, I''m going to say the new system has a new feature that enables ... or solves the ___ problem... > > Customer doesn''t *necessarily* have to be as clueless as CEO/CFO. Perhaps just another IT person, or whatever. >SCSI UNMAP (or SATA TRIM) is a means of telling a storage device that some blocks are no longer needed. (This might be because a file has been deleted in the filesystem on the device.) In the case of a Flash device, it can optimise usage by knowing this, e.g. it can perhaps perform a background erase on the real blocks so they''re ready for reuse sooner, and/or better optimise wear leveling by having more spare space to play with. There are some devices in which this enables the device to improve its lifetime by performing better wear leveling when having more spare space. It can also help by avoiding some read-modify-write operations, if the device knows the data that is in the rest of the 4k block is no loner needed. In the case of an iSCSI LUN target, these blocks no longer need to be archived, and if sparse space allocation is in use, the space they occupied can be freed off. In the particular case of ZFS provisioning the iSCSI LUN (COMSTAR), you might get performance improvements by having more free space to play with during other write operations to allow better storage layout optimisation. So, bottom line is longer life of SSDs (maybe higher performance too if there''s less waiting for erases during writes), and better space utilisation and performance for a ZFS COMSTAR target. -- Andrew Gabriel
On 01/22/13 15:32, Edward Ned Harvey (opensolarisisdeadlongliveopensolaris) wrote:>> From: Darren J Moffat [mailto:darrenm at opensolaris.org] >> >> Support for SCSI UNMAP - both issuing it and honoring it when it is the >> backing store of an iSCSI target. > > When I search for scsi unmap, I come up with all sorts of documentation that ... is ... like reading a medical journal when all you want to know is the conversion from 98.6F to C. > > Would you mind momentarily, describing what SCSI UNMAP is used for? If I were describing to a customer (CEO, CFO) I''m not going to tell them about SCSI UNMAP, I''m going to say the new system has a new feature that enables ... or solves the ___ problem... > > Customer doesn''t *necessarily* have to be as clueless as CEO/CFO. Perhaps just another IT person, or whatever.It is a mechanism for part of the storage system above the "disk" (eg ZFS) to inform the "disk" that it is no longer using a given set of blocks. This is useful when using an SSD - see Saso''s excellent response on that. However it can also be very useful when your "disk" is an iSCSI LUN. It allows the filesystem layer (eg ZFS or NTFS, etc) when on iSCSI LUN that advertises SCSI UNMAP to tell the target there are blocks in that LUN it isn''t using any more (eg it just deleted some blocks). This means you can get more accurate space usage when using things like iSCSI. ZFS in Solaris 11.1 issues SCSI UNMAP to devices that support it and the ZVOLs when exported over COMSTAR advertise it too. In the iSCSI case it is mostly about improved space accounting and utilisation. This is particularly interesting with ZFS when snapshots and clones of ZVOLs come into play. Some vendors call this (and thins like it) "Thin Provisioning", I''d say it is more "accurate communication between ''disk'' and filesystem" about in use blocks. -- Darren J Moffat
Casper.Dik at oracle.com
2013-Jan-22 16:00 UTC
[zfs-discuss] RFE: Un-dedup for unique blocks
>Some vendors call this (and thins like it) "Thin Provisioning", I''d say >it is more "accurate communication between ''disk'' and filesystem" about >in use blocks.In some cases, users of disks are charged by bytes in use; when not using SCSI UNMAP, a set of disks used for a zpool will in the end be charged for the whole reservation; this becomes costly when your standard usage is much less than your peak usage. Thin provisioning can now be used for zpools as long as the underlying LUNs have support for SCSI UNMAP Casper
On 01/22/2013 05:00 PM, Casper.Dik at oracle.com wrote:>> Some vendors call this (and thins like it) "Thin Provisioning", I''d say >> it is more "accurate communication between ''disk'' and filesystem" about >> in use blocks. > > In some cases, users of disks are charged by bytes in use; when not using > SCSI UNMAP, a set of disks used for a zpool will in the end be charged for > the whole reservation; this becomes costly when your standard usage is > much less than your peak usage. > > Thin provisioning can now be used for zpools as long as the underlying > LUNs have support for SCSI UNMAPLooks like an interesting technical solution to a political problem :D Cheers, -- Saso
On 01/22/13 16:02, Sa?o Kiselkov wrote:> On 01/22/2013 05:00 PM, Casper.Dik at oracle.com wrote: >>> Some vendors call this (and thins like it) "Thin Provisioning", I''d say >>> it is more "accurate communication between ''disk'' and filesystem" about >>> in use blocks. >> >> In some cases, users of disks are charged by bytes in use; when not using >> SCSI UNMAP, a set of disks used for a zpool will in the end be charged for >> the whole reservation; this becomes costly when your standard usage is >> much less than your peak usage. >> >> Thin provisioning can now be used for zpools as long as the underlying >> LUNs have support for SCSI UNMAP > > Looks like an interesting technical solution to a political problem :DThere is also a technical problem too: because if you can''t inform the backing store that you no longer need the blocks it can''t free them either so they get stuck in snapshots unnecessarily. -- Darren J Moffat
On 01/22/2013 05:34 PM, Darren J Moffat wrote:> > > On 01/22/13 16:02, Sa?o Kiselkov wrote: >> On 01/22/2013 05:00 PM, Casper.Dik at oracle.com wrote: >>>> Some vendors call this (and thins like it) "Thin Provisioning", I''d say >>>> it is more "accurate communication between ''disk'' and filesystem" about >>>> in use blocks. >>> >>> In some cases, users of disks are charged by bytes in use; when not >>> using >>> SCSI UNMAP, a set of disks used for a zpool will in the end be >>> charged for >>> the whole reservation; this becomes costly when your standard usage is >>> much less than your peak usage. >>> >>> Thin provisioning can now be used for zpools as long as the underlying >>> LUNs have support for SCSI UNMAP >> >> Looks like an interesting technical solution to a political problem :D > > There is also a technical problem too: because if you can''t inform the > backing store that you no longer need the blocks it can''t free them > either so they get stuck in snapshots unnecessarily.Yes, I understand the technical merit of the solution. I''m just amused that a noticeable side-effect is lower licensing costs (by that I don''t of course mean that the issue is unimportant, just that I find it interesting what the world has come to) - I''m not trying to ridicule. Cheers, -- Saso
On 2013-01-22 14:29, Darren J Moffat wrote:> Preallocated ZVOLs - for swap/dump.Sounds like something I proposed on these lists, too ;) Does this preallocation only mean filling an otherwise ordinary ZVOL with zeroes (or some other pattern) - if so, to what effect? Or is it also supported to disable COW for such datasets, so that the preallocated swap/dump zvols might remain contiguous on the faster tracks of the drive (i.e. like a dedicated partition, but with benefits of ZFS checksums and maybe compression)? Thanks, //Jim
Darren J Moffat wrote:> It is a mechanism for part of the storage system above the "disk" (eg > ZFS) to inform the "disk" that it is no longer using a given set of blocks. > > This is useful when using an SSD - see Saso''s excellent response on that. > > However it can also be very useful when your "disk" is an iSCSI LUN. It > allows the filesystem layer (eg ZFS or NTFS, etc) when on iSCSI LUN that > advertises SCSI UNMAP to tell the target there are blocks in that LUN it > isn''t using any more (eg it just deleted some blocks).That is something I have been waiting a long time for! I have to run a periodic "fill the pool with zeros" cycle on a couple of iSCSI backed pools to reclaim free space. I guess the big question is do oracle storage appliances advertise SCSI UNMAP? -- Ian.
On 01/22/2013 10:45 PM, Jim Klimov wrote:> On 2013-01-22 14:29, Darren J Moffat wrote: >> Preallocated ZVOLs - for swap/dump. > > Or is it also supported to disable COW for such datasets, so that > the preallocated swap/dump zvols might remain contiguous on the > faster tracks of the drive (i.e. like a dedicated partition, but > with benefits of ZFS checksums and maybe compression)?I highly doubt it, as it breaks one of the fundamental design principles behind ZFS (always maintain transactional consistency). Also, contiguousness and compression are fundamentally at odds (contiguousness requires each block to remain the same length regardless of contents, compression varies block length depending on the entropy of the contents). Cheers, -- Saso
On 2013-01-22 23:03, Sa?o Kiselkov wrote:> On 01/22/2013 10:45 PM, Jim Klimov wrote: >> On 2013-01-22 14:29, Darren J Moffat wrote: >>> Preallocated ZVOLs - for swap/dump. >> >> Or is it also supported to disable COW for such datasets, so that >> the preallocated swap/dump zvols might remain contiguous on the >> faster tracks of the drive (i.e. like a dedicated partition, but >> with benefits of ZFS checksums and maybe compression)? > > I highly doubt it, as it breaks one of the fundamental design principles > behind ZFS (always maintain transactional consistency). Also, > contiguousness and compression are fundamentally at odds (contiguousness > requires each block to remain the same length regardless of contents, > compression varies block length depending on the entropy of the contents).Well, dump and swap devices are kind of special in that they need verifiable storage (i.e. detectable to have no bit-errors) but not really consistency as in sudden-power-off transaction protection. Both have a lifetime span of a single system uptime - like L2ARC, for example - and will be reused anew afterwards - after a reboot, a power-surge, or a kernel panic. So while metadata used to address the swap ZVOL contents may and should be subject to common ZFS transactions and COW and so on, and jump around the disk along with rewrites of blocks, the ZVOL userdata itself may as well occupy the same positions on the disk, I think, rewriting older stuff. With mirroring likely in place as well as checksums, there are other ways than COW to ensure that the swap (at least some component thereof) contains what it should, even with intermittent errors of some component devices. Likewise, swap/dump breed of zvols shouldn''t really have snapshots, especially not automatic ones (and the installer should take care of this at least for the two zvols it creates) ;) Compression for swap is an interesting matter... for example, how should it be accounted? As dynamic expansion and/or shrinking of available swap space (or just of space needed to store it)? If the latter, and we still intend to preallocate and guarantee that the swap has its administratively predefined amount of gigabytes, compressed blocks can be aligned on those starting locations as if they were not compressed. In effect this would just decrease the bandwidth requirements, maybe. For dump this might be just a bulky compressed write from start to however much it needs, within the preallocated psize limits... //Jim
IIRC dump is special. As for swap... really, you don''t want to swap. If you''re swapping you have problems. Any swap space you have is to help you detect those problems and correct them before apps start getting ENOMEM. There *are* exceptions to this, such as Varnish. For Varnish and any other apps like it I''d dedicate an entire flash drive to it, no ZFS, no nothing. Nico --
On 2013-01-22 23:32, Nico Williams wrote:> IIRC dump is special. > > As for swap... really, you don''t want to swap. If you''re swapping you > have problems. Any swap space you have is to help you detect those > problems and correct them before apps start getting ENOMEM. There > *are* exceptions to this, such as Varnish. For Varnish and any other > apps like it I''d dedicate an entire flash drive to it, no ZFS, no > nothing.I know of this stance, and in general you''re right. But... ;) Sometimes, there are once-in-a-longtime tasks that might require enormous virtual memory that you wouldn''t normally provision proper hardware for (RAM, SSD) and/or cases when you have to run similarly greedy tasks on hardware with limited specs (i.e. home PC capped at 8GB RAM). As an example I might think of a ZDB walk taking about 35-40GB VM on my box. This is not something I do every month, but when I do - I need it to complete regardless that I have 5 times less RAM on that box (and kernel''s equivalent of that walk fails with scanrate hell because it can''t swap, btw). On another hand, there are tasks like VirtualBox which "require" swap to be configured in amounts equivalent to VM RAM size, but don''t really swap (most of the time). Setting aside SSDs for this task might be too expensive, if they are never to be used in real practice. But this point is more of a task for swap device tiering (like with Linux swap priorities), as I proposed earlier last year... //Jim
On 01/22/2013 11:22 PM, Jim Klimov wrote:> On 2013-01-22 23:03, Sa?o Kiselkov wrote: >> On 01/22/2013 10:45 PM, Jim Klimov wrote: >>> On 2013-01-22 14:29, Darren J Moffat wrote: >>>> Preallocated ZVOLs - for swap/dump. >>> >>> Or is it also supported to disable COW for such datasets, so that >>> the preallocated swap/dump zvols might remain contiguous on the >>> faster tracks of the drive (i.e. like a dedicated partition, but >>> with benefits of ZFS checksums and maybe compression)? >> >> I highly doubt it, as it breaks one of the fundamental design principles >> behind ZFS (always maintain transactional consistency). Also, >> contiguousness and compression are fundamentally at odds (contiguousness >> requires each block to remain the same length regardless of contents, >> compression varies block length depending on the entropy of the >> contents). > > Well, dump and swap devices are kind of special in that they need > verifiable storage (i.e. detectable to have no bit-errors) but not > really consistency as in sudden-power-off transaction protection.I get your point, but I would argue that if you are willing to preallocate storage for these, then putting dump/swap on an iSCSI LUN as opposed to having it locally is kind of pointless anyway. Since they are used rarely, having them "thin provisioned" is probably better in a iSCSI environment than wasting valuable network-storage resources on something you rarely need.> Both have a lifetime span of a single system uptime - like L2ARC, > for example - and will be reused anew afterwards - after a reboot, > a power-surge, or a kernel panic.For the record, the L2ARC is not transactionally consistent. It use a completely different allocation strategy from the main pool (essentially a simple rotor). Besides, if you plan to shred your dump contents after reboot anyway, why fat-provision them? I can understand swap, but dump?> So while metadata used to address the swap ZVOL contents may and > should be subject to common ZFS transactions and COW and so on, > and jump around the disk along with rewrites of blocks, the ZVOL > userdata itself may as well occupy the same positions on the disk, > I think, rewriting older stuff. With mirroring likely in place as > well as checksums, there are other ways than COW to ensure that > the swap (at least some component thereof) contains what it should, > even with intermittent errors of some component devices.You don''t understand, the transactional integrity in ZFS isn''t just to protect the data you put in, it''s also meant to protect ZFS'' internal structure (i.e. the metadata). This includes the layout of your zvols (which are also just another dataset). I understand that you want to view a this kind of fat-provisioned zvol as a simple contiguous container block, but it is probably more hassle to implement than it''s worth.> Likewise, swap/dump breed of zvols shouldn''t really have snapshots, > especially not automatic ones (and the installer should take care > of this at least for the two zvols it creates) ;)If you are talking about the standard opensolaris-style boot-environments, then yes, this is taken into account. Your BE lives under rpool/ROOT, while swap and dump are rpool/swap and rpool/dump respectively (both thin-provisioned, since they are rarely needed).> Compression for swap is an interesting matter... for example, how > should it be accounted? As dynamic expansion and/or shrinking of > available swap space (or just of space needed to store it)?Since compression occurs way below the dataset layer, your zvol capacity doesn''t change with compression, even though how much space it actually uses in the pool can. A zvol''s capacity pertains to its logical attributes, i.e. most importantly the maximum byte offset within it accessible to an application (in this case, swap). How the underlying blocks are actually stored and how much space they take up is up to the lower layers.> If the latter, and we still intend to preallocate and guarantee > that the swap has its administratively predefined amount of > gigabytes, compressed blocks can be aligned on those starting > locations as if they were not compressed. In effect this would > just decrease the bandwidth requirements, maybe.But you forget that a compressed block''s physical size fundamentally depends on its contents. That''s why compressed zvols still appear the same size as before. What changes is how much space they occupy on the underlying pool.> For dump this might be just a bulky compressed write from start > to however much it needs, within the preallocated psize limits...I hope you now understand the distinction between the logical size of a zvol and its actual in-pool size. We can''t tie one to other, since it would result in unpredictable behavior for the application (write one set of data, get capacity X, write another set, get capacity Y - how to determine in advance how much fits in? You can''t). Cheers, -- Saso
Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)
2013-Jan-22 23:54 UTC
[zfs-discuss] RFE: Un-dedup for unique blocks
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Nico Williams > > As for swap... really, you don''t want to swap. If you''re swapping you > have problems.For clarification, the above is true in Solaris and derivatives, but it''s not universally true for all OSes. I''ll cite linux as the example, because I know it. If you provide swap to a linux kernel, it considers this a degree of freedom when choosing to evict data from the cache, versus swapping out idle processes (or zombie processes.) As long as you swap out idle process memory that is colder than some cache memory, swap actually improves performance. But of course, if you have any active process starved of ram and consequently thrashing swap actively, of course, you''re right. It''s bad bad bad to use swap that way. In solaris, I''ve never seen it swap out idle processes; I''ve only seen it use swap for the bad bad bad situation. I assume that''s all it can do with swap.
The discussion gets suddenly hot and interesting - albeit quite diverged from the original topic ;) First of all, as a disclaimer, when I have earlier proposed such changes to datasets for swap (and maybe dump) use, I''ve explicitly proposed that this be a new dataset type - compared to zvol and fs and snapshot that we have today. Granted, this distinction was lost in today''s exchange of words, but it is still an important one - especially since it means that while basic ZFS (or rather ZPOOL) rules are maintained, the dataset rules might be redefined ;) I''ll try to reply to a few points below, snipping a lot of older text. >> Well, dump and swap devices are kind of special in that they need>> verifiable storage (i.e. detectable to have no bit-errors) but not >> really consistency as in sudden-power-off transaction protection. > > I get your point, but I would argue that if you are willing to > preallocate storage for these, then putting dump/swap on an iSCSI LUN as > opposed to having it locally is kind of pointless anyway. Since they are > used rarely, having them "thin provisioned" is probably better in a > iSCSI environment than wasting valuable network-storage resources on > something you rarely need.I am not sure what in my post led you to think that I meant iSCSI or otherwise networked storage to keep swap and dump. Some servers have local disks, you know - and in networked storage environments the local disks are only used to keep the OS image, swap and dump ;)> Besides, if you plan to shred your dump contents after > reboot anyway, why fat-provision them? I can understand swap, but dump?Guarantee that the space is there... Given the recent mischiefs with dumping (i.e. the context is quite stripped compared to the general kernel work, so multithreading broke somehow) I guess that pre-provisioned sequential areas might also reduce some risks... though likely not - random metadata would still have to get into the pool.> You don''t understand, the transactional integrity in ZFS isn''t just to > protect the data you put in, it''s also meant to protect ZFS'' internal > structure (i.e. the metadata). This includes the layout of your zvols > (which are also just another dataset). I understand that you want to > view a this kind of fat-provisioned zvol as a simple contiguous > container block, but it is probably more hassle to implement than it''s > worth.I''d argue that transactional integrity in ZFS primarily protects metadata, so that there is a tree of always-actual block pointers. There is this octopus of a block-pointer tree whose leaf nodes point to data blocks - but only as DVAs and checksums, basically. Nothing really requires data to be or not be COWed and stored at a different location than the previous version of the block at the same logical offset for the data consumers (FS users, zvol users), except that we want that data to be readable even after a catastrophic pool close (system crash, poweroff, etc.). We don''t (AFAIK) have such a requirement for swap. If the pool which contained swap kicked the bucket, we probably have a larger problem whose solution will likely involve reboot and thus recycling of all swap data. And for single-device errors with (contiguous) preallocated unrelocatable swap, we can protect with mirrors and checksums (used upon read, within this same uptime that wrote the bits).> >> Likewise, swap/dump breed of zvols shouldn''t really have snapshots, >> especially not automatic ones (and the installer should take care >> of this at least for the two zvols it creates) ;) > > If you are talking about the standard opensolaris-style > boot-environments, then yes, this is taken into account. Your BE lives > under rpool/ROOT, while swap and dump are rpool/swap and rpool/dump > respectively (both thin-provisioned, since they are rarely needed).I meant the attribute for zfs-auto-snapshots service, i.e.: rpool/swap com.sun:auto-snapshot false local As I wrote, I''d argue that for "new" swap (and maybe dump) datasets the snapshot action should not even be implemented.> >> Compression for swap is an interesting matter... for example, how >> should it be accounted? As dynamic expansion and/or shrinking of >> available swap space (or just of space needed to store it)? > > Since compression occurs way below the dataset layer, your zvol capacity > doesn''t change with compression, even though how much space it actually > uses in the pool can. A zvol''s capacity pertains to its logical > attributes, i.e. most importantly the maximum byte offset within it > accessible to an application (in this case, swap). How the underlying > blocks are actually stored and how much space they take up is up to the > lower layers....> But you forget that a compressed block''s physical size fundamentally > depends on its contents. That''s why compressed zvols still appear the > same size as before. What changes is how much space they occupy on the > underlying pool.I won''t argue with this, as it is perfectly correct for zvols and undefined for the mythical new dataset type ;) However, regarding dump and size prediction - when I created dump zvol''s manually and fed them to dumpadm, it can complain that the device is too small. Then at some point it accepts the given size, even though it is some value not like the system RAM or anything. So I guess the system also does some guessing in this case?.. If so, preallocating as many bytes as it thinks minimally required and then allowing compression to stuff more data in, might help to actually save the larger dumps in cases the system (dumpadm) made a wrong guess. //Jim
On Tue, Jan 22, 2013 at 11:54:53PM +0000, Edward Ned Harvey (opensolarisisdeadlongliveopensolaris) wrote:> > From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > > bounces at opensolaris.org] On Behalf Of Nico Williams > > > > As for swap... really, you don''t want to swap. If you''re swapping you > > have problems. > > In solaris, I''ve never seen it swap out idle processes; I''ve only > seen it use swap for the bad bad bad situation. I assume that''s all > it can do with swap.You would be wrong. Solaris uses swap space for paging. Paging out unused portions of an executing process from real memory to the swap device is certainly beneficial. Swapping out complete processes is a desperation move, but paging out most of an idle process is a good thing. -- -Gary Mills- -refurb- -Winnipeg, Manitoba, Canada-
Casper.Dik at oracle.com
2013-Jan-23 08:41 UTC
[zfs-discuss] RFE: Un-dedup for unique blocks
>IIRC dump is special. > >As for swap... really, you don''t want to swap. If you''re swapping you >have problems. Any swap space you have is to help you detect those >problems and correct them before apps start getting ENOMEM. There >*are* exceptions to this, such as Varnish. For Varnish and any other >apps like it I''d dedicate an entire flash drive to it, no ZFS, no >nothing.Yes and no: the system reserves a lot of additional memory (Solaris doesn''t over-commits swap) and swap is needed to support those reservations. Also, some pages are dirtied early on and never touched again; those pages should not be kept in memory. But continuously swapping is clearly a sign of a system too small for its job. Of course, compressing and/or encrypting swap has interesting issues: in order to free memory by swapping pages out requires even more memory. Casper
On 2013-01-23 09:41, Casper.Dik at oracle.com wrote:> Yes and no: the system reserves a lot of additional memory (Solaris > doesn''t over-commits swap) and swap is needed to support those > reservations. Also, some pages are dirtied early on and never touched > again; those pages should not be kept in memory.I believe, by the symptoms, that this is what happens often in particular to Java processes (app-servers and such) - I do regularly see these have large "VM" sizes and much (3x) smaller "RSS" sizes. One explanation I''ve seen is that JVM nominally depends on a number of shared libraries which are loaded to fulfill the runtime requirements, but aren''t actively used and thus go out into swap quickly. I chose to trust that statement ;) //Jim
Jim Klimov wrote:> On 2013-01-23 09:41, Casper.Dik at oracle.com wrote: >> Yes and no: the system reserves a lot of additional memory (Solaris >> doesn''t over-commits swap) and swap is needed to support those >> reservations. Also, some pages are dirtied early on and never touched >> again; those pages should not be kept in memory. > > I believe, by the symptoms, that this is what happens often > in particular to Java processes (app-servers and such) - I do > regularly see these have large "VM" sizes and much (3x) smaller > "RSS" sizes.Being swapped out is probably the best thing that can be done to most Java processes :) -- Ian.
Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)
2013-Jan-23 12:36 UTC
[zfs-discuss] RFE: Un-dedup for unique blocks
> From: Gary Mills [mailto:gary_mills at fastmail.fm] > > > In solaris, I''ve never seen it swap out idle processes; I''ve only > > seen it use swap for the bad bad bad situation. I assume that''s all > > it can do with swap. > > You would be wrong. Solaris uses swap space for paging. Paging out > unused portions of an executing process from real memory to the swap > device is certainly beneficial. Swapping out complete processes is a > desperation move, but paging out most of an idle process is a good > thing.You seem to be emphasizing the distinction between swapping and paging. My point though, is that I''ve never seen the swap usage (which is being used for paging) on any solaris derivative to be used nonzero, for the sake of keeping something in cache. It seems to me, that solaris will always evict all cache memory before it swaps (pages) out even the most idle process memory.
On 01/22/2013 10:50 PM, Gary Mills wrote:> On Tue, Jan 22, 2013 at 11:54:53PM +0000, Edward Ned Harvey (opensolarisisdeadlongliveopensolaris) wrote: > Paging out unused portions of an executing process from real memory to > the swap device is certainly beneficial. Swapping out complete > processes is a desperation move, but paging out most of an idle > process is a good thing.It gets even better. Executables become part of the swap space via mmap, so that if you have a lot of copies of the same process running in memory, the executable bits don''t waste any more space (well, unless you use the sticky bit, although that might be deprecated, or if you copy the binary elsewhere.) There''s lots of awesome fun optimizations in UNIX. :)
Casper.Dik at oracle.com
2013-Jan-23 19:48 UTC
[zfs-discuss] RFE: Un-dedup for unique blocks
>On 01/22/2013 10:50 PM, Gary Mills wrote: >> On Tue, Jan 22, 2013 at 11:54:53PM +0000, Edward Ned Harvey (opensolarisisdeadlongliveopensolaris) wrote:>> Paging out unused portions of an executing process from real memory to >> the swap device is certainly beneficial. Swapping out complete >> processes is a desperation move, but paging out most of an idle >> process is a good thing. > >It gets even better. Executables become part of the swap space via >mmap, so that if you have a lot of copies of the same process running in >memory, the executable bits don''t waste any more space (well, unless you >use the sticky bit, although that might be deprecated, or if you copy >the binary elsewhere.) There''s lots of awesome fun optimizations in >UNIX. :)The "sticky bit" has never been used in that form of SunOS for as long as I remember (SunOS 3.x) and probably before that. It no longer makes sense in demand-paged executables. Casper
On Tue, Jan 22, 2013 at 5:29 AM, Darren J Moffat <darrenm at opensolaris.org>wrote:> Preallocated ZVOLs - for swap/dump. >Darren, good to hear about the cool stuff in S11. Just to clarify, is this preallocated ZVOL different than the preallocated dump which has been there for quite some time (and is in Illumos)? Can you use it for other zvols besides swap and dump? Some background: the zfs dump device has always been preallocated ("thick provisioned"), so that we can reliably dump. By definition, something has gone horribly wrong when we are dumping, so this code path needs to be as small as possible to have any hope of getting a dump. So we preallocate the space for dump, and store a simple linked list of disk segments where it will be stored. The dump device is not COW, checksummed, deduped, compressed, etc. by ZFS. In Illumos (and S10), swap was treated more or less like a regular zvol. This leads to some tricky code paths because ZFS allocates memory from many points in the code as it is writing out changes. I could see advantages to the simplicity of a preallocated swap volume, using the same code that already existed for preallocated dump. Of course, the loss of checksumming and encryption is much more of a concern with swap (which is critical for correct behavior) than with dump (which is nice to have for debugging). --matt -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20130123/ae5cd1e0/attachment.html>
On 01/24/13 00:04, Matthew Ahrens wrote:> On Tue, Jan 22, 2013 at 5:29 AM, Darren J Moffat > <darrenm at opensolaris.org <mailto:darrenm at opensolaris.org>> wrote: > > Preallocated ZVOLs - for swap/dump. > > > Darren, good to hear about the cool stuff in S11. > > Just to clarify, is this preallocated ZVOL different than the > preallocated dump which has been there for quite some time (and is in > Illumos)? Can you use it for other zvols besides swap and dump?It is the same but we are using it for swap now too. It isn''t available for general use.> Some background: the zfs dump device has always been preallocated > ("thick provisioned"), so that we can reliably dump. By definition, > something has gone horribly wrong when we are dumping, so this code path > needs to be as small as possible to have any hope of getting a dump. So > we preallocate the space for dump, and store a simple linked list of > disk segments where it will be stored. The dump device is not COW, > checksummed, deduped, compressed, etc. by ZFS.For the sake of others, I know you know this Matt, the dump system does the compression so ZFS didn''t need to anyway.> In Illumos (and S10), swap was treated more or less like a regular zvol. > This leads to some tricky code paths because ZFS allocates memory from > many points in the code as it is writing out changes. I could see > advantages to the simplicity of a preallocated swap volume, using the > same code that already existed for preallocated dump. Of course, the > loss of checksumming and encryption is much more of a concern with swap > (which is critical for correct behavior) than with dump (which is nice > to have for debugging).We have encryption for dump because it is hooked in to the zvol code. For encrypting swap Illumos could do the same as Solaris 11 does and use lofi. I changed swapadd so that if "encryption" is specified in the options field of the vfstab entry it creates a lofi shim over the swap device using ''lofiadm -e''. This provides you encrypted swap regardless of what the underlying "disk" is (normal ZVOL, prealloc ZVOL, real disk slide, SVM mirror etc). -- Darren J Moffat
On 2013-01-24 11:06, Darren J Moffat wrote:> On 01/24/13 00:04, Matthew Ahrens wrote: >> On Tue, Jan 22, 2013 at 5:29 AM, Darren J Moffat >> <darrenm at opensolaris.org <mailto:darrenm at opensolaris.org>> wrote: >> >> Preallocated ZVOLs - for swap/dump. >> >> >> Darren, good to hear about the cool stuff in S11.Yes, thanks, Darren :)>> Just to clarify, is this preallocated ZVOL different than the >> preallocated dump which has been there for quite some time (and is in >> Illumos)? Can you use it for other zvols besides swap and dump? > > It is the same but we are using it for swap now too. It isn''t available > for general use. > >> Some background: the zfs dump device has always been preallocated >> ("thick provisioned"), so that we can reliably dump. By definition, >> something has gone horribly wrong when we are dumping, so this code path >> needs to be as small as possible to have any hope of getting a dump. So >> we preallocate the space for dump, and store a simple linked list of >> disk segments where it will be stored. The dump device is not COW, >> checksummed, deduped, compressed, etc. by ZFS.Comparing these two statements, can I say (and be correct) that the preallocated swap devices would lack COW (as I proposed too) and thus likely snapshots, but would also lack the checksums? (we might live without compression, though that was once touted as a bonus for swap over zfs, and certainly can do without dedup) Basically, they are seemingly little different from preallocated disk slices - and for those an admin might have better control over the dedicated disk locations (i.e. faster tracks in a small-seek stroke range), except that ZFS datasets are easier to resize... right or wrong? //Jim
> > It also has a lot of performance improvements and general bug fixes > in > > the Solaris 11.1 release. > > Performance improvements such as?Dedup''ed ARC for one. 0 block automatically "dedup''ed" in-memory. Improvements to ZIL performance. Zero-copy zfs+nfs+iscsi ... -- Robert Milkowski http://milek.blogspot.com
On 01/29/2013 02:59 PM, Robert Milkowski wrote:>>> It also has a lot of performance improvements and general bug fixes >> in >>> the Solaris 11.1 release. >> >> Performance improvements such as? > > > Dedup''ed ARC for one. > 0 block automatically "dedup''ed" in-memory. > Improvements to ZIL performance. > Zero-copy zfs+nfs+iscsi > ...Cool, thanks for the inspiration on my next work in Illumos'' ZFS. Cheers, -- Saso
>From: Richard Elling >Sent: 21 January 2013 03:51>VAAI has 4 features, 3 of which have been in illumos for a long time. Theremaining>feature (SCSI UNMAP) was done by Nexenta and exists in their NexentaStorproduct,?>but the CEO made?a conscious (and unpopular) decision to keep that codefrom the?>community. Over the?summer, another developer picked up the work in thecommunity,?>but I''ve lost track of?the progress and haven''t seen an RTI yet.That is one thing that always bothered me... so it is ok for others, like Nexenta, to keep stuff closed and not in open, while if Oracle does it they are bad? Isn''t it at least a little bit being hypocritical? (bashing Oracle and doing sort of the same) -- Robert Milkowski http://milek.blogspot.com
On 01/29/2013 03:08 PM, Robert Milkowski wrote:>> From: Richard Elling >> Sent: 21 January 2013 03:51 >> VAAI has 4 features, 3 of which have been in illumos for a long time. The > remaining >> feature (SCSI UNMAP) was done by Nexenta and exists in their NexentaStor > product, >> but the CEO made a conscious (and unpopular) decision to keep that code > from the >> community. Over the summer, another developer picked up the work in the > community, >> but I''ve lost track of the progress and haven''t seen an RTI yet. > > That is one thing that always bothered me... so it is ok for others, like > Nexenta, to keep stuff closed and not in open, while if Oracle does it they > are bad? > > Isn''t it at least a little bit being hypocritical? (bashing Oracle and doing > sort of the same)Nexenta is a downstream repository that chooses to keep some of their new developments in-house while making others open. Most importantly, they participate and make a conscious effort to play nice. Contrast this with Oracle. Oracle swoops in and buys up Sun, closes *all* of the technologies it can turn a profit on, changes licensing terms to extremely draconian and in the process takes a dump on all of the open-source community and large numbers of their customers. Now imagine which of these two is more popular in the community? (Disclaimer: my company was formerly an almost exclusive Sun shop.) Cheers, -- Saso
Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)
2013-Jan-29 15:03 UTC
[zfs-discuss] RFE: Un-dedup for unique blocks
> From: Robert Milkowski [mailto:rmilkowski at task.gda.pl] > > That is one thing that always bothered me... so it is ok for others, like > Nexenta, to keep stuff closed and not in open, while if Oracle does it they > are bad?Oracle, like Nexenta, and my own company CleverTrove, and Microsoft, and Netapp, has every right to close source development, if they believe it''s beneficial to their business. For all we know, Oracle might not even have a choice about it - it might have been in the terms of settlement with NetApp (because open source ZFS definitely hurt NetApp business.) The real question is, in which situations, is it beneficial to your business to be closed source, as opposed to open source? There''s the whole redhat/centos dichotomy. At first blush, it would seem redhat gets screwed by centos (or oracle linux) but then you realize how many more redhat derived systems are out there, compared to suse, etc. By allowing people to use it for free, it actually gains popularity, and then redhat actually has a successful support business model as compared to suse, which tanked. But it''s useless to argue about whether oracle''s making the right business choice, whether open or closed source is better for their business. Cuz it''s their choice, regardless who agrees. Arguing about it here isn''t going to do any good. Those of us who gained something and no longer count on having that benefit moving forward have a tendency to say "You gave it to me for free before, now I''m pissed off because you''re not giving it to me for free anymore." instead of "thanks for what you gave before." The world moves on. There''s plenty of time to figure out which solution is best for you, the consumer, in the future product offerings: commercial closed source product offering, open source product offering, or something completely different such as btrfs.
On Jan 29, 2013, at 6:08 AM, Robert Milkowski <rmilkowski at task.gda.pl> wrote:>> From: Richard Elling >> Sent: 21 January 2013 03:51 > >> VAAI has 4 features, 3 of which have been in illumos for a long time. The > remaining >> feature (SCSI UNMAP) was done by Nexenta and exists in their NexentaStor > product, >> but the CEO made a conscious (and unpopular) decision to keep that code > from the >> community. Over the summer, another developer picked up the work in the > community, >> but I''ve lost track of the progress and haven''t seen an RTI yet. > > That is one thing that always bothered me... so it is ok for others, like > Nexenta, to keep stuff closed and not in open, while if Oracle does it they > are bad?Nexenta is just as bad. For the record, the illumos-community folks who worked at Nexenta at the time were overruled by executive management. Some of those folks are now executive management elsewhere :-)> > Isn''t it at least a little bit being hypocritical? (bashing Oracle and doing > sort of the same)No, not at all. -- richard -- Richard.Elling at RichardElling.com +1-760-896-4422 -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20130129/6625da72/attachment-0001.html>
<Casper.Dik at oracle.com> wrote:> >It gets even better. Executables become part of the swap space via > >mmap, so that if you have a lot of copies of the same process running in > >memory, the executable bits don''t waste any more space (well, unless you > >use the sticky bit, although that might be deprecated, or if you copy > >the binary elsewhere.) There''s lots of awesome fun optimizations in > >UNIX. :) > > The "sticky bit" has never been used in that form of SunOS for as long > as I remember (SunOS 3.x) and probably before that. It no longer makes > sense in demand-paged executables.SunOS-3.0 introduced NFS-root and swap on NFS. For that reason, the meaning of the sticky bit was changed to mean "do not cache write this file". Note that SunOS-3.0 appeared with the new Sun3 machines (first build on 24.12.1985). J?rg -- EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin js at cs.tu-berlin.de (uni) joerg.schilling at fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily
On Sun, Jan 20, 2013 at 07:51:15PM -0800, Richard Elling wrote:> > 2. VAAI support. > > VAAI has 4 features, 3 of which have been in illumos for a long time. The > remaining > feature (SCSI UNMAP) was done by Nexenta and exists in their NexentaStor > product, > but the CEO made a conscious (and unpopular) decision to keep that code > from the > community. Over the summer, another developer picked up the work in the > community, > but I''ve lost track of the progress and haven''t seen an RTI yet. >I assume SCSI UNMAP is implemented in Comstar in NexentaStor? Isn''t Comstar CDDL licensed? There''s also this: https://www.illumos.org/issues/701 .. which says UNMAP support was added to Illumos Comstar 2 years ago. -- Pasi