Hi all, I''m planning a new build based on a SuperMicro chassis with 16 bays. I am looking to use up to 4 of the bays for SSD devices. After reading many posts about SSDs I believe I have a _basic_ understanding of a reasonable approach to utilizing SSDs for ZIL and L2ARC. Namely: ZIL: Intel X-25E L2ARC: Intel X-25M So, I am somewhat unclear about a couple of details surrounding the deployment of these devices. 1) Mirroring. Leaving cost out of it, should ZIL and/or L2ARC SSDs be mirrored ? 2) ZIL write cache. It appears some have disabled the write cache on the X-25E. This results in a 5 fold performance hit but it eliminates a potential mechanism for data loss. Is this valid? If I can mirror ZIL, I imagine this is no longer a concern? 3) SATA devices on a SAS backplane. Assuming the main drives are SAS, what impact do the SATA SSDs have? Any performance impact? I realize I could use an onboard SATA controller for the SSDs however this complicates things in terms of the mounting of these drives. thanks ! -- This message posted from opensolaris.org
On Sat, 17 Apr 2010, Dave Vrona wrote:> > 1) Mirroring. Leaving cost out of it, should ZIL and/or L2ARC SSDs > be mirrored ?Mirroring the intent log is a good idea, particularly for ZFS versions which don''t support removing the intent log device.> 2) ZIL write cache. It appears some have disabled the write cache > on the X-25E. This results in a 5 fold performance hit but it > eliminates a potential mechanism for data loss. Is this valid? If > I can mirror ZIL, I imagine this is no longer a concern?It is not necessary to disable the write cache if the device responds correctly to cache flush requests. The intent log is flushed frequently. Previously some have reported (based on testing) that the X-25E does not flush the write cache reliably when it is enabled. It may be that some X-25E versions work better than others. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On 04/17/10 07:59, Dave Vrona wrote:> 1) Mirroring. Leaving cost out of it, should ZIL and/or L2ARC SSDs > be mirrored ?L2ARC cannot be mirrored -- and doesn''t need to be. The contents are checksummed; if the checksum doesn''t match, it''s treated as a cache miss and the block is re-read from the main pool disks. The ZIL can be mirrored, and mirroring it improves your ability to recover the pool in the face of multiple failures.> 2) ZIL write cache. It appears some have disabled the write cache on > the X-25E. This results in a 5 fold performance hit but it > eliminates a potential mechanism for data loss. Is this valid?With the ZIL disabled, you may lose the last ~30s of writes to the pool (the transaction group being assembled and written at the time of the crash). With the ZIL on a device with a write cache that ignores cache flush requests, you may lose the tail of some of the intent logs, starting with the first block in each log which wasn''t readable after the restart. (I say "may" rather than "will" because some failures may not result in the loss of the write cache). Depending on how quickly your ZIL device pushes writes from cache to stable storage, this may narrow the window from ~30s to less than 1s, but doesn''t close the window entirely.> If I can mirror ZIL, I imagine this is no longer a concern?Mirroring a ZIL device with a volatile write cache doesn''t eliminate this risk. Whether it reduces the risk depends on precisely *what* caused your system to crash and reboot; if the failure also causes loss of the write cache contents on both sides of the mirror, mirroring won''t help. - Bill
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Dave Vrona > > 1) Mirroring. Leaving cost out of it, should ZIL and/or L2ARC SSDs be > mirrored ?IMHO, the best answer to this question is the one from the ZFS Best Practices guide. (I wrote part of it.) In short: You have no need to mirror your L2ARC cache device, and it''s impossible even if you want to for some bizarre reason. For zpool < 19, which includes all present releases of Solaris 10 and Opensolaris 2009.06, it is critical to mirror your ZIL log device. A failed unmirrored log device would be the permanent death of the pool. For zpool >= 19, which is available in the developer builds, downloadable from genunix, you need to make your own decision: If you have an unmirrored log device fail, *or* an ungraceful system crash, there is no problem. But if you have both, then you lose the latest writes leading up to the crash. You don''t lose your whole pool. But there are some scenarios where it''s possible to have the failing log device go undetected, until after the ungraceful reboot, in which case you lose the latest data, but not the whole pool. Personally, I recommend the latest build from genunix, and I recommend no mirroring for log devices, except in the most critical of situations, such as a machine that processes credit card transactions or stuff like that.> 2) ZIL write cache. It appears some have disabled the write cache on > the X-25E. This results in a 5 fold performance hit but it eliminates > a potential mechanism for data loss. Is this valid? If I can mirror > ZIL, I imagine this is no longer a concern?This disagrees with my measurements. If you have a dedicated log device, I found the best performance by disabling all the write cache on all the devices (disk and HBA.) This is because ZFS has inner knowledge of the filesystem, and knowledge of the block level devices, while the HBA only has knowledge of the block level devices, and no knowledge of the filesystem. Long story short, ZFS does a better job of write buffering and utilizing the devices available. Details are on the ZFS Best Practices guide.> 3) SATA devices on a SAS backplane. Assuming the main drives are SAS, > what impact do the SATA SSDs have? Any performance impact? I realize > I could use an onboard SATA controller for the SSDs however this > complicates things in terms of the mounting of these drives.SATA SSD devices on the SAS backplane is precisely what you should do. This works perfectly, and this is the configuration I used when I produced the measurements described above.
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Edward Ned Harvey > > > From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > > bounces at opensolaris.org] On Behalf Of Dave Vrona > > > > > 2) ZIL write cache. It appears some have disabled the write cache on > > the X-25E. This results in a 5 fold performance hit but it > eliminates > > a potential mechanism for data loss. Is this valid? If I can mirror > > ZIL, I imagine this is no longer a concern?Ahh, I see there may have been some confusion there, because your question wasn''t asked right. ;-) "Disabling ZIL" is not the same thing as "disabling write cache." Those two terms are not to be mixed. The write cache is either the volatile memory on the disk, or the presumably nonvolatile memory in the HBA. You should never enable volatile disk write cache. You should only enable the HBA writeback cache if (a) the HBA has nonvolatile memory, such as battery backed up, and (b) you don''t have a dedicated ZIL log device. The ZIL, generally speaking, should not be disabled. There are some cases where it''s ok, but generally speaking don''t do it. The justification is thus: Disabling the ZIL makes would-be sync writes into async writes, which are faster, but prone to disappearance caused by ungraceful system shutdown. If you trust your applications to only issue sync writes when they have a reason they need to, and to do async writes whenever that''s ok, then you should not disable your ZIL. The only time to disable your ZIL is when you believe your applications are performing sync writes unnecessarily hurting their own performance, and you''re not worried about losing the latest ~30 seconds of supposedly already written data after an ungraceful shutdown. PS, if you are in the latter case, if you do disable your ZIL, then there''s no point in either HBA writeback, or ZIL log device. The ZIL log device is only used for sync writes, and it will be 100% unused if you disable ZIL. Also, HBA writeback does not benefit ZFS for async writes.
On 17 apr 2010, at 20.51, Edward Ned Harvey wrote:>> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- >> bounces at opensolaris.org] On Behalf Of Dave Vrona >> >> 1) Mirroring. Leaving cost out of it, should ZIL and/or L2ARC SSDs be >> mirrored ?...> Personally, I recommend the latest build from genunix, and I recommend no > mirroring for log devices, except in the most critical of situations, such > as a machine that processes credit card transactions or stuff like that.... It depends if you think this is a risk or not. Personally I do - disk systems are always acting up at the same time as you have other problem, in addition to the other times. Those SSDs I have tried and read about seem to be a even more crappy than disks in general, and I wouldn''t trust them for about anything that I want to keep. If you handle something that you really must not lose, you should probably have some other redundancy too, like parallel servers in different locations, and you may skip redundancy on many components in each individual server. If you have a more standard application, restricted budget, and still don''t want to lose file system transactions, I believe you should have at least as good redundancy on your ZILs as on the rest of the disk system. Examples are: - Mail servers (you are not allowed to lose email). - NFS servers (to follow the protocol, not lose user data, and not leave clients in undefined/bad states). - A general file server or other application server where people expect the bits they have put in there to be there, even though the server happened to crash. - All other applications where you want to take as many steps as possible to not lose data. On 17 apr 2010, at 21.09, Edward Ned Harvey wrote:>>> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- >>> bounces at opensolaris.org] On Behalf Of Dave Vrona >>> >> >>> 2) ZIL write cache. It appears some have disabled the write cache on >>> the X-25E. This results in a 5 fold performance hit but it >> eliminates >>> a potential mechanism for data loss. Is this valid? If I can mirror >>> ZIL, I imagine this is no longer a concern?... I''d say it is of concern - the X25-E does just straight ignore cache flush commands (idiots!), but disabling the write cache *seems* to put it in more of a write though mode, so that cache flushing shouldn''t be needed. Some tests have shown that it *may* loose one or a few transactions anyway if it suddenly loses power. X25-E is, sadly, not a storage device worth the name, at least not until Intel has fixed the problems with it, which doesn''t seem to be happening, sadly, Intel seems the just keep quiet and ignore those few that are actually checking out what their disks are doing - they still sell a lot to all the others, I guess. ...> The write cache is either the volatile memory on the disk, or the presumably > nonvolatile memory in the HBA. You should never enable volatile disk write > cache.This is not correct. Zfs normally enables the write cache, and assumes the devices are correctly honoring cache flush commands. Sadly, there are devices out there that ignores them, because they wan''t to look like they have higher performance than they have, or because of bugs, or in some cases because of just plain ignorance from the implementors. Intel X25-E is sadly one of those bad devices. [Traditionally, Solaris has always tried to disable volatile disk write caches, zfs changes this.]> You should only enable the HBA writeback cache if (a) the HBA has > nonvolatile memory, such as battery backed up, and (b) you don''t have a > dedicated ZIL log device.And (c) you don''t mind having the same problem with non redundancy as for the ZIL device: If you have writeback caching enabled in your HBA and your HBA fails, some of your data will be lost in the HBA cache, a bit similar to the ZIL case. Your file system may be in a little worse state, since zfs is always consistent on storage, but if some of the recently written storage is lost in the HBA cache, it won''t be consistent on disk. And HBAs do seem to fail a bit more often than most other computer boards, for some strange reason. /ragge
> > From: zfs-discuss-bounces at opensolaris.org > [mailto:zfs-discuss- > > bounces at opensolaris.org] On Behalf Of Edward Ned > Harvey > > > > > From: zfs-discuss-bounces at opensolaris.org > [mailto:zfs-discuss- > > > bounces at opensolaris.org] On Behalf Of Dave Vrona > > > > > > > > 2) ZIL write cache. It appears some have > disabled the write cache on > > > the X-25E. This results in a 5 fold performance > hit but it > > eliminates > > > a potential mechanism for data loss. Is this > valid? If I can mirror > > > ZIL, I imagine this is no longer a concern? > > Ahh, I see there may have been some confusion there, > because your question > wasn''t asked right. ;-) > > "Disabling ZIL" is not the same thing as "disabling > write cache." Those two > terms are not to be mixed. >My statement was less than perfectly worded. I specifically meant disabling write cache on the X-25e that is holding the ZIL. I certainly didn''t mean to imply disabling ZIL. -- This message posted from opensolaris.org
Ok, so originally I presented the X-25E as a "reasonable" approach. After reading the follow-ups, I''m second guessing my statement. Any decent alternatives at a reasonable price? -- This message posted from opensolaris.org
On 18 apr 2010, at 00.52, Dave Vrona wrote:> Ok, so originally I presented the X-25E as a "reasonable" approach. After reading the follow-ups, I''m second guessing my statement. > > Any decent alternatives at a reasonable price?How much is reasonable? :-) I guess there are STEC drives that should work for slogs (ZIL devices), but I haven''t tried them yet, and not read about many that has, except in euphoric reports from Sun users that got them from Sun. Would be really interesting to try. I think Sun/Oracle actually has sold X25-Es themselves, possibly with Sun/Oracle firmware. I don''t know if those drives are as bad as the Intel branded ones. For l2arc, about any drive would do, I think, it is just not critical for the file system in any way (but could be critical for your application). I''d also really like to hear the zfs developers'' view on this subject, I guess they have tested many of these drives and problems in their labs. /ragge
On Apr 17, 2010, at 11:51 AM, Edward Ned Harvey wrote:>> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- >> bounces at opensolaris.org] On Behalf Of Dave Vrona >> >> 1) Mirroring. Leaving cost out of it, should ZIL and/or L2ARC SSDs be >> mirrored ? > > IMHO, the best answer to this question is the one from the ZFS Best > Practices guide. (I wrote part of it.) > In short: > > You have no need to mirror your L2ARC cache device, and it''s impossible even > if you want to for some bizarre reason. > > For zpool < 19, which includes all present releases of Solaris 10 and > Opensolaris 2009.06, it is critical to mirror your ZIL log device. A failed > unmirrored log device would be the permanent death of the pool.I do not believe this is a true statement. In large part it will depend on the nature of the failure -- all failures are not created equal. It has also been shown that such pools are recoverable, albeit with tedious, manual procedures required. Rather than saying this is a "critical" issue, I could say it is "preferred." Indeed, there are *many* SPOFs in the typical system (any x86 system) which can be considered similarly "critical." Finally, you have choices -- you can use an HBA with nonvolatile write cache and avoid the need for separate log device. -- richard ZFS storage and performance consulting at http://www.RichardElling.com ZFS training on deduplication, NexentaStor, and NAS performance Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com
On 18 apr 2010, at 06.43, Richard Elling wrote:> On Apr 17, 2010, at 11:51 AM, Edward Ned Harvey wrote: > >>> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- >>> bounces at opensolaris.org] On Behalf Of Dave Vrona >>> >>> 1) Mirroring. Leaving cost out of it, should ZIL and/or L2ARC SSDs be >>> mirrored ? >> >> IMHO, the best answer to this question is the one from the ZFS Best >> Practices guide. (I wrote part of it.) >> In short: >> >> You have no need to mirror your L2ARC cache device, and it''s impossible even >> if you want to for some bizarre reason. >> >> For zpool < 19, which includes all present releases of Solaris 10 and >> Opensolaris 2009.06, it is critical to mirror your ZIL log device. A failed >> unmirrored log device would be the permanent death of the pool. > > I do not believe this is a true statement. In large part it will depend on > the nature of the failure -- all failures are not created equal. It has also > been shown that such pools are recoverable, albeit with tedious, manual > procedures required. Rather than saying this is a "critical" issue, I could > say it is "preferred." Indeed, there are *many* SPOFs in the typical system > (any x86 system) which can be considered similarly "critical."Yes there are. The thing is that a common situations is that the most valuable is the data itself, often more than either 0.999[0-9]* uptime and certainly more than the machine itself. If so, you want very good redundancy on your data, and don''t care much about (live) redundancy on the machine. You just take the disks and slam them into another machine, physically, be means of FC or SAS, virtually, or whatever. (You may want to have a spare machine standby to save time, though.) It is often not very expensive to get quite a bit of redundancy in your data, running parallel systems is often much more both complicated and expensive. That the data possibly could be recovered with tedious procedures with experts doing it by hand is not good enough a crash recovery plan for many of us - in a crash situation you want your data to be there and be safe and you just have to figure out how to access it, and you are probably interested in making it happen as quickly as possible. Hopefully you have planned for the procedure already. That said, it is good that the manual option is there, if you get in deep trouble. At least this is our reasoning when we set up our server machines...> Finally, you have choices -- you can use an HBA with nonvolatile write > cache and avoid the need for separate log device.Except that then that HBA is a non-redundant place, a SOPF, where you store our data, and a place where you could lose data. As long as you know that and know that you can take that, everything is fine. Again, it all depends on the application, I guess, and giving general advice is nearly impossible. /ragge> -- richard > > ZFS storage and performance consulting at http://www.RichardElling.com > ZFS training on deduplication, NexentaStor, and NAS performance > Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com > > > > > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
> > On 18 apr 2010, at 00.52, Dave Vrona wrote: > > > Ok, so originally I presented the X-25E as a > "reasonable" approach. After reading the follow-ups, > I''m second guessing my statement. > > > > Any decent alternatives at a reasonable price? > > How much is reasonable? :-)How about $1000 per device? $2000 for a mirrored pair. -- This message posted from opensolaris.org
The Acard device mentioned in this thread looks interesting: http://opensolaris.org/jive/thread.jspa?messageID=401719񢄷 -- This message posted from opensolaris.org
> From: Richard Elling [mailto:richard.elling at gmail.com] > > On Apr 17, 2010, at 11:51 AM, Edward Ned Harvey wrote: > > > For zpool < 19, which includes all present releases of Solaris 10 and > > Opensolaris 2009.06, it is critical to mirror your ZIL log device. A > failed > > unmirrored log device would be the permanent death of the pool. > > I do not believe this is a true statement. In large part it will depend > on > the nature of the failure -- all failures are not created equal. It has > also > been shown that such pools are recoverable, albeit with tedious, manual > procedures required. Rather than saying this is a "critical" issue, I > could > say it is "preferred." Indeed, there are *many* SPOFs in the typical > system > (any x86 system) which can be considered similarly "critical."Could you please describe a type of failure of an unmirrored log device in zpool < 19 which does not result in the pool being faulted and unable to import? I don''t know of any. If you have a faulted zpool < 19, due to a faulted nonmirrored log device, could you describe how it''s possible to recover that pool? I know I tried and couldn''t do it, but then again, it was only a test pool. I only dedicated an hour of labor to trying.> Finally, you have choices -- you can use an HBA with nonvolatile write > cache and avoid the need for separate log device.The HBA with nonvolatile cache gains a lot over just plain disks. By my measurement, 2x-3x faster for sync writes, but no improvement for async writes, or any reads. But it''s not as effective as using a dedicated SSD for log device. By my measurement, using a SSD for log device (with all the HBA write cache disabled) was about 3x-4x faster than just plain disks for sync writes, but no different for async writes, or any reads. I agree with you, HBA nonvolatile write cache is an option. It''s cheaper than buying an SSD, and it doesn''t consume a slot. Better than nothing. Depends on what your design requirements are, and how much you care about the sync write performance.
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Dave Vrona > > > On 18 apr 2010, at 00.52, Dave Vrona wrote: > > > > > Ok, so originally I presented the X-25E as a > > "reasonable" approach. After reading the follow-ups, > > I''m second guessing my statement. > > > > > > Any decent alternatives at a reasonable price? > > > > How much is reasonable? :-) > > How about $1000 per device? $2000 for a mirrored pair.That''s how much I paid for my Intel SSD''s, sun branded. I think the Intel SSD''s are the industry standard, at least for now.
Or, DDRDrive X1 ? Would the X1 need to be mirrored? -- This message posted from opensolaris.org
IMHO, whether a dedicated log device needs redundancy (mirrored), should be determined by the dynamics of each end-user environment (zpool version, goals/priorities, and budget). If mirroring is deemed important, a key benefit of the DDRdrive X1, is the HBA / storage device integration. For example, to approach the redundancy of a mirrored DDRdrive X1 pair, a SATA Flash based SSD solution would require each SSD to have a dedicated HBA controller. As sharing an HBA between the two mirrored SSDs would introduce a single point of failure not existing in the X1 configuration. Even with dedicated HBAs, removing the need for SATA cables while halving both the controller count and data path travel will notably increase reliability. It should be mentioned, one plus for a mirrored Flash SSD with dedicated HBAs (no cache or write through) is the lack of required power protection. Thanks, Christopher George Founder/CTO www.ddrdrive.com -- This message posted from opensolaris.org
>>>>> "re" == Richard Elling <richard.elling at gmail.com> writes:>> A failed unmirrored log device would be the >> permanent death of the pool. re> It has also been shown that such pools are recoverable, albeit re> with tedious, manual procedures required. for the 100th time, No, they''re not, not if you lose zpool.cache also. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100418/bc2f4499/attachment.bin>
> IMHO, whether a dedicated log device needs redundancy > (mirrored), should > be determined by the dynamics of each end-user > environment (zpool version, > goals/priorities, and budget). >Well, I populate a chassis with dual HBAs because my _perception_ is they tend to fail more than other cards. Please help me with my perception of the X1. :-) -- This message posted from opensolaris.org
On Apr 18, 2010, at 10:48 AM, Miles Nordin wrote:>>>>>> "re" == Richard Elling <richard.elling at gmail.com> writes: > >>> A failed unmirrored log device would be the >>> permanent death of the pool. > > re> It has also been shown that such pools are recoverable, albeit > re> with tedious, manual procedures required. > > for the 100th time, No, they''re not, not if you lose zpool.cache also.It is disingenuous to complain about multiple failures in a system which has so many single points of failure. Also, a well managed system will not lose zpool.cache or any other file. -- richard ZFS storage and performance consulting at http://www.RichardElling.com ZFS training on deduplication, NexentaStor, and NAS performance Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com
On Apr 18, 2010, at 5:23 AM, Edward Ned Harvey wrote:>> From: Richard Elling [mailto:richard.elling at gmail.com] >> >> On Apr 17, 2010, at 11:51 AM, Edward Ned Harvey wrote: >> >>> For zpool < 19, which includes all present releases of Solaris 10 and >>> Opensolaris 2009.06, it is critical to mirror your ZIL log device. A >> failed >>> unmirrored log device would be the permanent death of the pool. >> >> I do not believe this is a true statement. In large part it will depend >> on >> the nature of the failure -- all failures are not created equal. It has >> also >> been shown that such pools are recoverable, albeit with tedious, manual >> procedures required. Rather than saying this is a "critical" issue, I >> could >> say it is "preferred." Indeed, there are *many* SPOFs in the typical >> system >> (any x86 system) which can be considered similarly "critical." > > Could you please describe a type of failure of an unmirrored log device in > zpool < 19 which does not result in the pool being faulted and unable to > import? I don''t know of any.The most common failure mode on HDDs and, it seems, SSDs is a nonrecoverable read. A nonrecoverable read failure on your separate log device will not cause the pool to fail import. -- richard ZFS storage and performance consulting at http://www.RichardElling.com ZFS training on deduplication, NexentaStor, and NAS performance Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com
There is no definitive answer (yes or no) on whether to mirror a dedicated log device, as reliability is one of many variables. This leads me to the frequently given but never satisfying "it depends". In a time when too many good questions go unanswered, let me take advantage of our less rigid "rules of engagement" and share some facts about the DDRdrive X1 which are uncommonly shared: - 12 Layer PCB (layman translation - more layers, better SI, higher cost) - Nelco N4000-13 EP Laminate (extremely high quality, a price to match) - Solid Via Construction (hold a X1 in front of a bright light - no holes :-) - "Best of Breed" components, all 520 of them - Assembled and validated in Northern CA, USA - 1.5 weeks of test/burn-in of every X1. (extensive DRAM validation) In summary, the DDRdrive X1 is designed, built and tested with immense pride and an overwhelming attention to detail. Thanks, Christopher George Founder/CTO www.ddrdrive.com -- This message posted from opensolaris.org
On Sun, 18 Apr 2010, Christopher George wrote:> > In summary, the DDRdrive X1 is designed, built and tested with immense > pride and an overwhelming attention to detail.Sounds great. What performance does DDRdrive X1 provide for this simple NFS write test from a single client over gigabit ethernet? This seems to be the test of the day. time tar jxf gcc-4.4.3.tar.bz2 I get 22 seconds locally and about 6-1/2 minutes from an NFS client. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
>>>>> "re" == Richard Elling <richard.elling at gmail.com> writes:re> a well managed system will not lose zpool.cache or any other re> file. I would complain this was circular reasoning if it weren''t such obvious chest-puffing bullshit. It''s normal even to the extent of being a best practice to have no redundancy for rpool on systems that can tolerate gaps in availability because you can reinstall from the livecd relatively quickly. re> It is disingenuous to complain about multiple failures strongly disagree. I''m quite genuine. A really common and really terrible suggestion is, ``get an SSD, and put your rpool in one slice and your slog in another.'''' If you do that and lose the SSD, you''ve lost the whole pool. You cannot recover with ''zpool clear'' or any number of -f -F -FFF flags. This common scenario doesn''t require any multiple failure. Now, even among those who don''t do this, people following your suggestions will not design their systems realizing the rpool and the SSD make up a redundant pair. They will not see: you can lose the rpool and import the pool IFF you have the SSD, and you can lose the SSD and force-online the pool IFF you have the rpool with the missing-slog pool already imported to it. They will instead desgin following the raidz/mirroring failure rules treating slog as disposable, like you''ve told them, and this is flat wrong. Hiding behind fuzzy glossary terms like ``multiple failures'''' is useless, IMHO to the point of being deliberately obtuse. Besides that, you don''t need any multiple failures---all you need to do is make the mistake of typing the perfectly reasonable command ''zpool export'' in the course of trying to fix your problem, and poof, your whole pool is gone. A pool that runs fine until you try to export and re-import it, after which it is permanently lost, is a ticking time bomb. I don''t think it''s a good idea to run that way at all because of the flexible tools one needs to have available for maintenance in a disaster (ex., livecd of newer version with special import -F rescue-magic in it, WONT WORK. moving drives to a different controller causing them to have a different devid, WONT WORK. accumulate enough of these and not only does your toolkit get smaller and weaker, but you must move slowly and with great fear because the slightest move can make everything explode in totally unobvious ways.). If you do want to run this way, as an absolute MINIMUM, you need to discuss this cannot-import case at moments like this one so that it can influence people''s designs. It seems if I say it the long way, I get ignored. If I say it the short way, you dive into every corner case. I don''t know how to be any more clear, so...good luck out there, y''all. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100418/044a29a3/attachment.bin>
So if the Intel X25E is a bad device- can anyone recommend an SLC device with good firmware? (Or an MLC drive that performs as well?) I''ve got 80 spindles in 5 16 bay drive shelves (76 15k RPM SAS drives in 19 4 disk raidz sets, 2 hot spares, and 2 bays set aside for a mirrored ZIL) connected to two servers (so if one fails I can import on the other one). Host based cards are not an option for my ZIL- I need something that sits in the array and can be imported by the other system. I was planning on using a pair of mirrored SLC based Intel X25E''s because of their superior write performance but if it''s going to destroy my pool- then it''s useless. Does anyone else have something that can match their write performance without breaking ZFS? -- This message posted from opensolaris.org
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Bob Friesenhahn > > On Sun, 18 Apr 2010, Christopher George wrote: > > > > In summary, the DDRdrive X1 is designed, built and tested with > immense > > pride and an overwhelming attention to detail. > > Sounds great. What performance does DDRdrive X1 provide for this > simple NFS write test from a single client over gigabit ethernet? > This seems to be the test of the day. > > time tar jxf gcc-4.4.3.tar.bz2 > > I get 22 seconds locally and about 6-1/2 minutes from an NFS client.There''s no point trying to accelerate your disks if you''re only going to use a single client over gigabit. Assuming you''ve got some sort of nontrivial server infrastructure, and you''ve got many clients doing things simultaneously and more than 1Gb network connection, then it can become worth while. Also, if you do work on the physical server on local disks, that can also be worth while.
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Don > > I''ve got 80 spindles in 5 16 bay drive shelves (76 15k RPM SAS drives > in 19 4 disk raidz sets, 2 hot spares, and 2 bays set aside for a > mirrored ZIL) connected to two servers (so if one fails I can import on > the other one). Host based cards are not an option for my ZIL- I need > something that sits in the array and can be imported by the other > system. > > I was planning on using a pair of mirrored SLC based Intel X25E''s > because of their superior write performance but if it''s going to > destroy my pool- then it''s useless. > > Does anyone else have something that can match their write performance > without breaking ZFS?You''re not going to break ZFS with the X25''s, if you just get 2 of them and make them a mirror. But be aware, that all sync writes will go to these devices, and if you''ve got 80 spindles, it''s possible that 1 mirror might not be enough for your optimal performance. You might gain more by using more than one pair. I don''t know any way to test it other than getting your hands on more than one pair and seeing what results you get.
If you have a pair of heads talking to shared disks with ZFS- what can you do to ensure the second head always has a current copy of the zpool.cache file? I''d prefer not to lose the ZIL, fail over, and then suddenly find out I can''t import the pool on my second head. -- This message posted from opensolaris.org
But if the X25E doesn''t honor cache flushes then it really doesn''t matter if they are mirrored- they both may cache the data, not write it out, and leave me screwed. I''m running 2009.06 and not one of the newer developer candidates that handle ZIL losses gracefully (or at all- at least as far as I understand things). As for the optimal performance- a single pair probably won''t give me optimal performance- but based on all the numbers I''ve seen it''s still going to beat using the pool disks. If I find the ZIL is still a bottleneck I''ll definitely add a second set of SSD''s- but I''ve got a lot of testing to do before I get there. -- This message posted from opensolaris.org
On Sun, Apr 18, 2010 at 07:02:38PM -0700, Don wrote:> If you have a pair of heads talking to shared disks with ZFS- what can you do to ensure the second head always has a current copy of the zpool.cache file? I''d prefer not to lose the ZIL, fail over, and then suddenly find out I can''t import the pool on my second head.Replicatedd backups of your running BE, like for many other reasons. -- Dan. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 194 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100419/9914ab5a/attachment.bin>
I''m not sure to what you are referring when you say my "running BE" I haven''t looked at the zpool.cache file too closely but if the devices don''t match between the two systems for some reason- isn''t that going to cause a problem? I was really asking if there is a way to build the cache file without importing the disks. -- This message posted from opensolaris.org
On Sun, 18 Apr 2010, Edward Ned Harvey wrote:>> This seems to be the test of the day. >> >> time tar jxf gcc-4.4.3.tar.bz2 >> >> I get 22 seconds locally and about 6-1/2 minutes from an NFS client. > > There''s no point trying to accelerate your disks if you''re only going to use > a single client over gigabit.This is a really strange statement. It does not make any sense. It makes about as much sense as saying that if you have only one car that there is no need for it to be able to go faster than 10 mph, but if you have 60 cars, then it is worthwhile for the cars to each be able to go 60 mph. The driver of that lone 10 mph car will not be very happy. On a different discussion thread, one fellow was able to drop the tar file extraction time from 92 minutes to just under 7 minutes. As a user of the client system, he is much happier. Probably the DDRDrive is able to go faster since it should have lower latency than a FLASH SSD drive. However, it may have some bandwidth limits on its interface. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On Sun, Apr 18, 2010 at 10:33:36PM -0500, Bob Friesenhahn wrote:> Probably the DDRDrive is able to go faster since it should have lower > latency than a FLASH SSD drive. However, it may have some bandwidth > limits on its interface.It clearly has some. They''re just as clearly well in excess of those applicable to a SATA-interface SSD, even a DRAM-based one like the acard. In return, the SATA SSD has some deployment options (in an external JBOD, for example) not as readily accessible to a PCI device. I''d be curious to compare mirroring these kinds of devices across server heads, using comstar and some suitable interconnect, as a comparison to slogs colocated with the drives. -- Dan. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 194 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100419/f8ec66e8/attachment.bin>
On Apr 18, 2010, at 7:02 PM, Don wrote:> If you have a pair of heads talking to shared disks with ZFS- what can you do to ensure the second head always has a current copy of the zpool.cache file?By definition, the zpool.cache file is always up to date.> I''d prefer not to lose the ZIL, fail over, and then suddenly find out I can''t import the pool on my second head.I''d rather not have multiple failures, either. But the information needed in the zpool.cache file for reconstructing a missing (as in destroyed) top-level vdev is easily recovered from a backup or snapshot. -- richard ZFS storage and performance consulting at http://www.RichardElling.com ZFS training on deduplication, NexentaStor, and NAS performance Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com
On Sun, Apr 18, 2010 at 07:37:10PM -0700, Don wrote:> I''m not sure to what you are referring when you say my "running BE"Running boot environment - the filesystem holding /etc/zpool.cache -- Dan. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 194 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100419/e774f395/attachment.bin>
On Mon, Apr 19, 2010 at 03:37:43PM +1000, Daniel Carosone wrote:> the filesystem holding /etc/zpool.cacheor, indeed, /etc/zfs/zpool.cache :-) -- Dan. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 194 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100419/de2f59e9/attachment.bin>
Michael DeMan (OA)
2010-Apr-19 10:17 UTC
[zfs-discuss] newbie, WAS: Re: SSD best practices
By the way, I would like to chip in about how informative this thread has been, at least for me, despite (and actually because of) the strong opinions on some of the posts about the issues involved. From what I gather, there is still an interesting failure possibility with ZFS, although probably rare. In the case where a zil (aka slog) device fails, AND the zpool.cache information is not available, basically folks are toast? In addition, the zpool.cache itself exhibits the following behaviors (and I could be totally wrong, this is why I ask): assumptions: A. It is not written to frequently, i.e., it is not a performance impact unless new zfs file systems (pardon me if I have the incorrect terminology) are not being fabricated and supplied to the underlying operating system. B. The current implementation stores that cache file on the zil file system, so if for some reason, that device is totally lost, it is nigh impossible to recover the entire pool it correlates with. possible solutions: 1. Why not have an option to mirror that darn cache file, like to the root file system of the boot device at least? Presuming that most folks at least want enough redundancy that their machine will boot, and if it boots have a shot at recovery the associated directly attached storage, and with my other presumptions above, there is little reason do not to offer a feature like this? Respectfully, - mike On Apr 18, 2010, at 10:10 PM, Richard Elling wrote:> On Apr 18, 2010, at 7:02 PM, Don wrote: > >> If you have a pair of heads talking to shared disks with ZFS- what can you do to ensure the second head always has a current copy of the zpool.cache file? > > By definition, the zpool.cache file is always up to date. > >> I''d prefer not to lose the ZIL, fail over, and then suddenly find out I can''t import the pool on my second head. > > I''d rather not have multiple failures, either. But the information needed in the > zpool.cache file for reconstructing a missing (as in destroyed) top-level vdev is > easily recovered from a backup or snapshot. > -- richard > > ZFS storage and performance consulting at http://www.RichardElling.com > ZFS training on deduplication, NexentaStor, and NAS performance > Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com > > > > > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
By the way, I would like to chip in about how informative this thread has been, at least for me, despite (and actually because of) the strong opinions on some of the posts about the issues involved. From what I gather, there is still an interesting failure possibility with ZFS, although probably rare. In the case where a zil (aka slog) device fails, AND the zpool.cache information is not available, basically folks are toast? In addition, the zpool.cache itself exhibits the following behaviors (and I could be totally wrong, this is why I ask): A. It is not written to frequently, i.e., it is not a performance impact unless new zfs file systems (pardon me if I have the incorrect terminology) are not being fabricated and supplied to the underlying operating system. B. The current implementation stores that cache file on the zil device, so if for some reason, that device is totally lost (along with said .cache file), it is nigh impossible to recover the entire pool it correlates with. possible solutions: 1. Why not have an option to mirror that darn cache file (like to the root file system of the boot device at least as an initial implementation) no matter what intent log devices are present? Presuming that most folks at least want enough redundancy that their machine will boot, and if it boots - then they have a shot at recovery of the balance of the associated (zfs) directly attached storage, and with my other presumptions above, there is little reason do not to offer a feature like this? Respectfully, - mike On Apr 18, 2010, at 10:10 PM, Richard Elling wrote:> On Apr 18, 2010, at 7:02 PM, Don wrote: > >> If you have a pair of heads talking to shared disks with ZFS- what can you do to ensure the second head always has a current copy of the zpool.cache file? > > By definition, the zpool.cache file is always up to date. > >> I''d prefer not to lose the ZIL, fail over, and then suddenly find out I can''t import the pool on my second head. > > I''d rather not have multiple failures, either. But the information needed in the > zpool.cache file for reconstructing a missing (as in destroyed) top-level vdev is > easily recovered from a backup or snapshot. > -- richard > > ZFS storage and performance consulting at http://www.RichardElling.com > ZFS training on deduplication, NexentaStor, and NAS performance > Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com > > > > > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Also, pardon my typos, and my lack of re-titling my subject to note that it is a fork from the original topic. Corrections in text that I noticed after finally sorting out getting on the mailing list are below... On Apr 19, 2010, at 3:26 AM, Michael DeMan wrote:> By the way, > > I would like to chip in about how informative this thread has been, at least for me, despite (and actually because of) the strong opinions on some of the posts about the issues involved. > > From what I gather, there is still an interesting failure possibility with ZFS, although probably rare. In the case where a zil (aka slog) device fails, AND the zpool.cache information is not available, basically folks are toast? > > In addition, the zpool.cache itself exhibits the following behaviors (and I could be totally wrong, this is why I ask): > > A. It is not written to frequently, i.e., it is not a performance impact unless new zfs file systems (pardon me if I have the incorrect terminology) are not being fabricated and supplied to the underlying operating system. >The above ''are not being fabricated'' should be ''are regularly being fabricated''> B. The current implementation stores that cache file on the zil device, so if for some reason, that device is totally lost (along with said .cache file), it is nigh impossible to recover the entire pool it correlates with.The above, ''on the zil device'', should say ''on the fundamental zfs file system itself, or a zil device if one is provisioned''> > > possible solutions: > > 1. Why not have an option to mirror that darn cache file (like to the root file system of the boot device at least as an initial implementation) no matter what intent log devices are present? Presuming that most folks at least want enough redundancy that their machine will boot, and if it boots - then they have a shot at recovery of the balance of the associated (zfs) directly attached storage, and with my other presumptions above, there is little reason do not to offer a feature like this?Missing final sentence: The vast amount of problems with computer and network reliability is typically related to human error. The more ''9s'' that can be intrinsically provided by the systems themselves helps mitigate this.> > > Respectfully, > - mike > > > On Apr 18, 2010, at 10:10 PM, Richard Elling wrote: > >> On Apr 18, 2010, at 7:02 PM, Don wrote: >> >>> If you have a pair of heads talking to shared disks with ZFS- what can you do to ensure the second head always has a current copy of the zpool.cache file? >> >> By definition, the zpool.cache file is always up to date. >> >>> I''d prefer not to lose the ZIL, fail over, and then suddenly find out I can''t import the pool on my second head. >> >> I''d rather not have multiple failures, either. But the information needed in the >> zpool.cache file for reconstructing a missing (as in destroyed) top-level vdev is >> easily recovered from a backup or snapshot. >> -- richard >> >> ZFS storage and performance consulting at http://www.RichardElling.com >> ZFS training on deduplication, NexentaStor, and NAS performance >> Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com >> >> >> >> >> >> _______________________________________________ >> zfs-discuss mailing list >> zfs-discuss at opensolaris.org >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
I would advise getting familiar with the basic terminology and vocabulary of ZFS first. Start with the Solaris 10 ZFS Administration Guide. It''s a bit more complete for a newbie. http://docs.sun.com/app/docs/doc/819-5461?l=en You can then move on to the Best Practices Guide, Configuration Guide, Troubleshooting Guide and Evil Tuning Guide on solarisinternals.com: http://www.solarisinternals.com//wiki/index.php?title=Category:ZFS All of the features in ZFS on Solaris 10 appear in OpenSolaris; the inverse does not necessarily hold true, as active development occurs on the OpenSolaris trunk and updates take about a year to filter back down into Solaris due to integration concerns, testing, etc. A Separate Log (SLOG) device can be used for a ZIL, but they are not necessarily the same thing. The ZIL always exists, and is part of the pool if you have not defined a SLOG device. The zpool.cache file does not reside in the pool. It lives in /etc/zfs in the root file system of your OpenSolaris system. Thus, it does not reside "on the ZIL device" either, since there may not necessarily be a SLOG (what you would term a "ZIL device") anyway. (There is always a ZIL, though. See remarks above.) Hopefully that clears up some of the misconceptions and misunderstandings you have. Cheers! On Mon, Apr 19, 2010 at 06:52, Michael DeMan <solaris at deman.com> wrote:> Also, pardon my typos, and my lack of re-titling my subject to note that it > is a fork from the original topic. Corrections in text that I noticed after > finally sorting out getting on the mailing list are below... > > On Apr 19, 2010, at 3:26 AM, Michael DeMan wrote: > > > By the way, > > > > I would like to chip in about how informative this thread has been, at > least for me, despite (and actually because of) the strong opinions on some > of the posts about the issues involved. > > > > From what I gather, there is still an interesting failure possibility > with ZFS, although probably rare. In the case where a zil (aka slog) device > fails, AND the zpool.cache information is not available, basically folks are > toast? > > > > In addition, the zpool.cache itself exhibits the following behaviors (and > I could be totally wrong, this is why I ask): > > > > A. It is not written to frequently, i.e., it is not a performance impact > unless new zfs file systems (pardon me if I have the incorrect terminology) > are not being fabricated and supplied to the underlying operating system. > > > The above ''are not being fabricated'' should be ''are regularly being > fabricated'' > > > B. The current implementation stores that cache file on the zil device, > so if for some reason, that device is totally lost (along with said .cache > file), it is nigh impossible to recover the entire pool it correlates with. > The above, ''on the zil device'', should say ''on the fundamental zfs file > system itself, or a zil device if one is provisioned'' > > > > > > > possible solutions: > > > > 1. Why not have an option to mirror that darn cache file (like to the > root file system of the boot device at least as an initial implementation) > no matter what intent log devices are present? Presuming that most folks at > least want enough redundancy that their machine will boot, and if it boots - > then they have a shot at recovery of the balance of the associated (zfs) > directly attached storage, and with my other presumptions above, there is > little reason do not to offer a feature like this? > Missing final sentence: The vast amount of problems with computer and > network reliability is typically related to human error. The more ''9s'' that > can be intrinsically provided by the systems themselves helps mitigate this. > > > > > > > Respectfully, > > - mike > > > > > > On Apr 18, 2010, at 10:10 PM, Richard Elling wrote: > > > >> On Apr 18, 2010, at 7:02 PM, Don wrote: > >> > >>> If you have a pair of heads talking to shared disks with ZFS- what can > you do to ensure the second head always has a current copy of the > zpool.cache file? > >> > >> By definition, the zpool.cache file is always up to date. > >> > >>> I''d prefer not to lose the ZIL, fail over, and then suddenly find out I > can''t import the pool on my second head. > >> > >> I''d rather not have multiple failures, either. But the information > needed in the > >> zpool.cache file for reconstructing a missing (as in destroyed) > top-level vdev is > >> easily recovered from a backup or snapshot. > >> -- richard > >> > >> ZFS storage and performance consulting at http://www.RichardElling.com > >> ZFS training on deduplication, NexentaStor, and NAS performance > >> Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com > >> > >> > >> > >> > >> > >> _______________________________________________ > >> zfs-discuss mailing list > >> zfs-discuss at opensolaris.org > >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > > > _______________________________________________ > > zfs-discuss mailing list > > zfs-discuss at opensolaris.org > > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >-- "You can choose your friends, you can choose the deals." - Equity Private "If Linux is faster, it''s a Solaris bug." - Phil Harman Blog - http://whatderass.blogspot.com/ Twitter - @khyron4eva -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100419/3426e6ad/attachment.html>
> From: Bob Friesenhahn [mailto:bfriesen at simple.dallas.tx.us] > Sent: Sunday, April 18, 2010 11:34 PM > To: Edward Ned Harvey > Cc: Christopher George; zfs-discuss at opensolaris.org > Subject: RE: [zfs-discuss] SSD best practices > > On Sun, 18 Apr 2010, Edward Ned Harvey wrote: > >> This seems to be the test of the day. > >> > >> time tar jxf gcc-4.4.3.tar.bz2 > >> > >> I get 22 seconds locally and about 6-1/2 minutes from an NFS client. > > > > There''s no point trying to accelerate your disks if you''re only going > to use > > a single client over gigabit. > > This is a really strange statement. It does not make any sense.I''m saying that even a single pair of disks (maybe 4 disks if you''re using cheap slow disks) will outperform a 1Gb Ethernet. So if your bottleneck is the 1Gb Ethernet, you won''t gain anything (significant) by accelerating the stuff that isn''t the bottleneck.
In all honesty, I haven''t done much at sysadmin level with Solaris since it was SunOS 5.2. I found ZFS after becoming concerned with reliability of traditional RAID5 and RAID6 systems once drives exceeded 500GB. I have a few months running ZFS on FreeBSD lately on a test/augmentation basis with 1TB drives in older hardware. Thus far, it seems very promising. As other people have pointed out though, one''s mileage may vary. I am interested in a blend of performance, reliability and cost. I think ZFS can deliver all three across the board. You are right - if I am not aware enough yet on the docs to know the difference between a zil device and a slog device, I guess I need to finally hit the books on this one some more. ZFS seems both stable enough and I think also has enough ''cool factor'' to it, that its probably about time there were some books available? Perhaps if/when Solaris 10 gets de-dupe that will be the breaker/maker? I have a couple more comments down below. Thanks for the response, and once more - I have very much been enjoying the ''SSD best practices'' thread. On Apr 19, 2010, at 4:12 AM, Khyron wrote:> I would advise getting familiar with the basic terminology and vocabulary of ZFS > first. Start with the Solaris 10 ZFS Administration Guide. It''s a bit more complete > for a newbie. > > http://docs.sun.com/app/docs/doc/819-5461?l=en > > You can then move on to the Best Practices Guide, Configuration Guide, > Troubleshooting Guide and Evil Tuning Guide on solarisinternals.com: > > http://www.solarisinternals.com//wiki/index.php?title=Category:ZFS > > All of the features in ZFS on Solaris 10 appear in OpenSolaris; the inverse does > not necessarily hold true, as active development occurs on the OpenSolaris trunk > and updates take about a year to filter back down into Solaris due to integration > concerns, testing, etc.Yes, I understand this. When the heck is de-dupe coming into Solaris 10? People could save enough money on disks (not too mention the power bills and the cooling costs) to upgrade maybe?> > A Separate Log (SLOG) device can be used for a ZIL, but they are not necessarily > the same thing. The ZIL always exists, and is part of the pool if you have not > defined a SLOG device. > > The zpool.cache file does not reside in the pool. It lives in /etc/zfs in the root > file system of your OpenSolaris system. Thus, it does not reside "on the ZIL > device" either, since there may not necessarily be a SLOG (what you would term > a "ZIL device") anyway. (There is always a ZIL, though. See remarks above.) >I have one test box, running FreeBSD8, not Solaris, and have no /etc/zfs/zpool.cache or /usr/local/etc/zpool.cache, I will check on another list about that and how they are handling it.> Hopefully that clears up some of the misconceptions and misunderstandings you > have. > > Cheers!-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100419/fda1f455/attachment.html>
Yes yes- /etc/zfs/zpool.cache - we all hate typos :) -- This message posted from opensolaris.org
I must note that you haven''t answered my question... If the zpool.cache file differs between the two heads for some reason- how do I ensure that the second head has an accurate copy without importing the ZFS pool? -- This message posted from opensolaris.org
I''m not certain if I''m misunderstanding you- or if you didn''t read my post carefully. Why would the zpool.cache file be current on the _second_ node? The first node is where I''ve added my zpools and so on. The second node isn''t going to have an updated cache file until I export the zpool from the first system and import it to the second system no? In my case- I believe both nodes have exactly the same view of the disks- all the controllers and targets are identical- but there is no reason they have to be as far as I know. As such- simply backing up the primary systems zpool.cache to the secondary could cause problems. I''m simply curious if there is a way for a node to keep it''s zpool.cache up to date without actually importing the zpool. i.e. is there a scandisks command that can scan for a zpool without importing it. Am I misunderstanding something here? -- This message posted from opensolaris.org
On Mon, April 19, 2010 07:32, Edward Ned Harvey wrote:> I''m saying that even a single pair of disks (maybe 4 disks if you''re using > cheap slow disks) will outperform a 1Gb Ethernet. So if your bottleneck > is the 1Gb Ethernet, you won''t gain anything (significant) by accelerating > the stuff that isn''t the bottleneck.Would it help in improving IOps or latency for more random work loads? It may not be that you''re pushing bandwidth, but rather a lot of (say) NFS writes. It could potentially cause a lot of seeks, even with striped mirrors.
On Mon, April 19, 2010 06:26, Michael DeMan wrote:> B. The current implementation stores that cache file on the zil device, > so if for some reason, that device is totally lost (along with said .cache > file), it is nigh impossible to recover the entire pool it correlates > with.Given that ZFS is always consistent on-disk, why would you lose a pool if you lose the ZIL and/or cache file? Theoretically shouldn''t you lose, at most, the last few transactions? With recent updates to ZFS you can do a forced import giving "informed consent" to go back to a previous uber-block.
On Mon, 19 Apr 2010, Don wrote:> If the zpool.cache file differs between the two heads for some > reason- how do I ensure that the second head has an accurate copy > without importing the ZFS pool?The zpool.cache file can only be valid for one system at a time. If the pool is imported to a different system, then the zpool.cache file generated on that system will be different due to differing device names and a different host name. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Ok- I think perhaps I''m failing to explain myself. I want to know if there is a way for a second node- connected to a set of shared disks- to keep its zpool.cache up to date _without_ actually importing the ZFS pool. As I understand it- keeping the zpool up to date on the second node would provide additional protection should the slog fail at the same time my primary head failed (it should also improve import times if what I''ve read is true). I understand that importing the disks to the second node will update the cache file- but by that time it may be too late. I''d like to update the cache file _before_ then. I see no reason why the second node couldn''t scan the disks being used by the first node and then update it''s zpool.cache. -- This message posted from opensolaris.org
On Mon, 19 Apr 2010, Edward Ned Harvey wrote:>>> There''s no point trying to accelerate your disks if you''re only >>> going to use a single client over gigabit. >> >> This is a really strange statement. It does not make any sense. > > I''m saying that even a single pair of disks (maybe 4 disks if you''re using > cheap slow disks) will outperform a 1Gb Ethernet. So if your bottleneck is > the 1Gb Ethernet, you won''t gain anything (significant) by accelerating the > stuff that isn''t the bottleneck.That is true. For the record, this is the size of the uncompressed tarball: % du -sh gcc-4.4.3.tar 409M gcc-4.4.3.tar Expecting close to 100MB/seconds for the data transfer over gigabit, this may place a cap on achievable performance at around 4 seconds. The test included bzip2 decompression and it is not clear (without testing) if the bzip2 decompression increases or decreases the available data flow to the network. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On 19/04/2010 16:46, Don wrote:> I want to know if there is a way for a second node- connected to> a set of shared disks- to keep its zpool.cache up to date > _without_ actually importing the ZFS pool. See zpool(1M): cachefile=path | none Controls the location of where the pool configuration is cached. Discovering all pools on system startup requires a cached copy of the configuration data that is stored on the root file system. All pools in this cache are automatically imported when the system boots. Some environments, such as install and clustering, need to cache this information in a different location so that pools are not automatically imported. Setting this pro- perty caches the pool configuration in a different loca- tion that can later be imported with "zpool import -c". Setting it to the special value "none" creates a tem- porary pool that is never cached, and the special value '''' (empty string) uses the default location. Multiple pools can share the same cache file. Because the kernel destroys and recreates this file when pools are added and removed, care should be taken when attempting to access this file. When the last pool using a cachefile is exported or destroyed, the file is removed. -- Darren J Moffat
That section of the man page is actually helpful- as I wasn''t sure what I was going to do to ensure the nodes didn''t try to bring up the zpool on their own- outside of clustering software or my own intervention. That said- it still doesn''t explain how I would keep the secondary nodes zpool.cache up to date. If I create a zpool on the first node. Import it on the second, then move it back to the first. Now they both have a current zpool.cache. If I add additional disks to the first node- how do I get the second nodes cache file current without first importing the disks? -- This message posted from opensolaris.org
On 19/04/2010 17:13, Don wrote:> That section of the man page is actually helpful- as I wasn''t sure what I was going to do to ensure the nodes didn''t try to bring up the zpool on their own- outside of clustering software or my own intervention. > > That said- it still doesn''t explain how I would keep the secondary nodes zpool.cache up to date.That is the job of the cluster software.> If I create a zpool on the first node. Import it on the second,> then move it back to the first. Now they both have a > current zpool.cache. If I add additional disks to the first node- how do I get the second nodes cache file current without first importing the disks? The point of the cachefile zpool option is that there isn''t two copies of the zpool.cache file there is only one. -- Darren J Moffat
Now I''m simply confused. Do you mean one cachefile shared between the two nodes for this zpool? How, may I ask, would this work? The rpool should be in /etc/zfs/zpool.cache. The shared pool should be in /etc/cluster/zpool.cache (or wherever you prefer to put it) so it won''t come up on system start. What I don''t understand is how the second node is either a) supposed to share the first nodes cachefile or b) create it''s own without importing the pool. You say this is the job of the cluster software- does ha-cluster already handle this with their ZFS modules? I''ve asked this question 5 different ways and I either still haven''t gotten an answer- or still don''t understand the problem. Is there a way for a passive node to generate it''s _own_ zpool.cache without importing the file system. If so- how. If not- why is this unimportant? -- This message posted from opensolaris.org
On 19/04/2010 17:50, Don wrote:> Now I''m simply confused. > > Do you mean one cachefile shared between the two nodes for this zpool? How, may I ask, would this work?Either that or a way for the nodes to update each others copy very quickly. Such as a parallel filesystem. It is the job of the cluster software to provide a mechanism to do that.> What I don''t understand is how the second node is either a) supposed to share the first nodes cachefile or b) create it''s own without importing the pool.Are you writing your own cluster framework ?> You say this is the job of the cluster software- does ha-cluster already handle this with their ZFS modules?Searching google for "solaris ha-cluster zfs zpool.cache" I found this opensolaris.org thread on the ha-cluster list: http://opensolaris.org/jive/thread.jspa?messageID=338413 That thread has information on which cluster release/patch is needed.> I''ve asked this question 5 different ways and I either still haven''t gotten an answer- or still don''t understand the problem.My apologies but I jumped in part way through the thread. Are you writing your own cluster software or are you trying to use an already existing cluster framework that already supports ZFS ? -- Darren J Moffat
On Apr 19, 2010, at 9:50 AM, Don wrote:> Now I''m simply confused.In one sentence, the cachefile keeps track of what is currently imported.> Do you mean one cachefile shared between the two nodes for this zpool? How, may I ask, would this work?Each OS instance has a default cachefile.> The rpool should be in /etc/zfs/zpool.cache.Yes, this is the default for OpenSolaris distributions.> The shared pool should be in /etc/cluster/zpool.cache (or wherever you prefer to put it) so it won''t come up on system start.Correct> What I don''t understand is how the second node is either a) supposed to share the first nodes cachefile or b) create it''s own without importing the pool.a) it doesn''t b) it doesn''t> You say this is the job of the cluster software- does ha-cluster already handle this with their ZFS modules?Yes> I''ve asked this question 5 different ways and I either still haven''t gotten an answer- or still don''t understand the problem.see below> Is there a way for a passive node to generate it''s _own_ zpool.cache without importing the file system. If so- how. If not- why is this unimportant?No. And this is unimportant. The bit of context missing here is the answer to the question, why do we want to keep a backup of the cache file? The answer is that the cache file contains a record of the GUID for each disk. In the event that a disk is destroyed, there are some cases where the pool can be brought online if the GUID of the destroyed disks are known. This is not a typical recovery method and has rarely been needed. Please do not confuse the discussion of the desire to keep a copy of the cachefile with the greater desire to keep a record of the GUIDs. In this context, the cachefile is a convenient record of the GUIDs. -- richard ZFS storage and performance consulting at http://www.RichardElling.com ZFS training on deduplication, NexentaStor, and NAS performance Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com
On Apr 19, 2010, at 12:50 PM, Don <don at blacksun.org> wrote:> Now I''m simply confused. > > Do you mean one cachefile shared between the two nodes for this > zpool? How, may I ask, would this work? > > The rpool should be in /etc/zfs/zpool.cache. > > The shared pool should be in /etc/cluster/zpool.cache (or wherever > you prefer to put it) so it won''t come up on system start. > > What I don''t understand is how the second node is either a) supposed > to share the first nodes cachefile or b) create it''s own without > importing the pool. > > You say this is the job of the cluster software- does ha-cluster > already handle this with their ZFS modules? > > I''ve asked this question 5 different ways and I either still haven''t > gotten an answer- or still don''t understand the problem. > > Is there a way for a passive node to generate it''s _own_ zpool.cache > without importing the file system. If so- how. If not- why is this > unimportant?I don''t run the cluster suite, but I''d be surprised if the software doesn''t copy the cache to the passive node whenever it''s updated. -Ross
I apologize- I didn''t mean to come across as rude- I''m just not sure if I''m asking the right question. I''m not ready to use the ha-cluster software yet as I haven''t finished testing it. For now I''m manually failing over from the primary to the backup node. That will change- but I''m not ready to go there yet. As such I''m trying to make sure both my nodes have a current cache file so that the targets and GUID''s are ready. -- This message posted from opensolaris.org
Edward Ned Harvey wrote:> I''m saying that even a single pair of disks (maybe 4 disks if you''re using > cheap slow disks) will outperform a 1Gb Ethernet. So if your bottleneck is > the 1Gb Ethernet, you won''t gain anything (significant) by accelerating the > stuff that isn''t the bottleneck.And you are confusing throughput with latency (in a sense). Yes, my raidz2 pool is much faster than GigE, measured in pipelined, streaming throughput (for those old fogeys like me, think good kermit / zmodem). Yet it still sucks for small non-pipelined synchronous writes, such as those triggered by NFS + tar (think xmodem *shudder*). So spending around $150 for a greater than 10x performance improvement makes a hell of a lot of sense. And yes, spending even more on an ACARD flash backed DRAM solution would probably make things even faster, but I was feeling cheap ;-) As for the DDRDrive X1, it is not a solution I would recommend to anyone in its current state. If it gets a supercap / battery that enables it to dump its data to non-volatile storage following a power cut, that will change, and I know they have taken our feedback and are revising the product. -- Carson
I understand that important bit about having the cachefile is the GUID''s (although the disk record is, I believe, helpful in improving import speeds) so we can recover in certain oddball cases. As such- I''m still confused why you say it''s unimportant. Is it enough to simply copy the /etc/cluster/zpool.cache file from the primary node to the secondary so that I at least have the GUID''s even if the disks references (the /dev/dsk sections) might not match? -- This message posted from opensolaris.org
To clarify, the DDRdrive X1 is not an option for OpenSolaris today, irrespective of specific features, because the driver is not yet available. When our OpenSolaris device driver is released, later this quarter, the X1 will have updated firmware to automatically provide backup/restore based on an external power source. We hope the X1 will be the first in a family of products, where future iterations will also offer an internal power source option. Feedback from this list also played a decisive role in our forthcoming strategy to focus exclusively on serving the ZFS dedicated log market. Thanks, Christopher George Founder/CTO www.ddrdrive.com -- This message posted from opensolaris.org
>>>>> "dm" == David Magda <dmagda at ee.ryerson.ca> writes:dm> Given that ZFS is always consistent on-disk, why would you dm> lose a pool if you lose the ZIL and/or cache file? because of lazy assertions inside ''zpool import''. you are right there is no fundamental reason for it---it''s just code that doesn''t exist. If you are a developer you can probably still recover your pool, but there aren''t any commands with a supported interface to do it. ''zpool.cache'' doesn''t contain magical information, but it allows you to pass through a different code path that doesn''t include the ``BrrkBrrk, omg panic device missing, BAIL OUT HERE'''' checks. I don''t think squirreling away copies of zpool.cache is a great way to make your pool safe from slog failures because there may be other things about the different manual ''zpool import'' codepath that you need during a disaster, like -F, which will remain inaccessible to you if you rely on some saving-your-zpool.cache hack, even if your hack ends up actually working when the time comes, which it might not. I think is really interesting, the case of an HA cluster using a single-device slog made from a ramdisk on the passive node. This case would also become safer if slogs were fully disposeable. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100419/1cddb626/attachment.bin>
I think the DDR drive has a battery and can dump to a cf card. -B Sent from my Nexus One. On Apr 19, 2010 10:41 AM, "Carson Gaspar" <carson at taltos.org> wrote: Edward Ned Harvey wrote:> I''m saying that even a single pair of disks (maybe 4 disks if you''reusi... And you are confusing throughput with latency (in a sense). Yes, my raidz2 pool is much faster than GigE, measured in pipelined, streaming throughput (for those old fogeys like me, think good kermit / zmodem). Yet it still sucks for small non-pipelined synchronous writes, such as those triggered by NFS + tar (think xmodem *shudder*). So spending around $150 for a greater than 10x performance improvement makes a hell of a lot of sense. And yes, spending even more on an ACARD flash backed DRAM solution would probably make things even faster, but I was feeling cheap ;-) As for the DDRDrive X1, it is not a solution I would recommend to anyone in its current state. If it gets a supercap / battery that enables it to dump its data to non-volatile storage following a power cut, that will change, and I know they have taken our feedback and are revising the product. -- Carson _______________________________________________ zfs-discuss mailing list zfs-discuss at opensolaris.o... -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100419/2d933b1f/attachment.html>
Continuing on the best practices theme- how big should the ZIL slog disk be? The ZFS evil tuning guide suggests enough space for 10 seconds of my synchronous write load- even assuming I could cram 20 gigabits/sec into the host (2 10 gigE NICs) That only comes out to 200 Gigabits which = 25 Gigabytes. I''m currently planning to use 4 32GB SSD''s arranged in 2 2 way mirrors which should give me 64GB of log space. Is there any reason to believe that this would be insufficient (especially considering I can''t begin to imagine being able to cram 5 Gb/s into the host- let alone 20). Are there any guidelines on how much ZIL performance should increase with 2 SSD slogs (4 disks with mirrors) over a single SSD slog (2 disks mirrored). -- This message posted from opensolaris.org
> I think the DDR drive has a battery and can dump to a cf card.The DDRdrive X1''s automatic backup/restore feature utilizes on-board SLC NAND (high quality Flash) and is completely self- contained. Neither the backup nor restore feature involves data transfer over the PCIe bus or to/from removable media. Thanks, Christopher George Founder/CTO www.ddrdrive.com -- This message posted from opensolaris.org
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Don > > Continuing on the best practices theme- how big should the ZIL slog > disk be? > > The ZFS evil tuning guide suggests enough space for 10 seconds of my > synchronous write load- even assuming I could cram 20 gigabits/sec into > the host (2 10 gigE NICs) That only comes out to 200 Gigabits which > 25 Gigabytes. > > I''m currently planning to use 4 32GB SSD''s arranged in 2 2 way mirrors > which should give me 64GB of log space. Is there any reason to believe > that this would be insufficient (especially considering I can''t begin > to imagine being able to cram 5 Gb/s into the host- let alone 20). > > Are there any guidelines on how much ZIL performance should increase > with 2 SSD slogs (4 disks with mirrors) over a single SSD slog (2 disks > mirrored).I think the size of the ZIL log is basically irrelevant ... For example, I remember reading somewhere that the system refuses to use more than 50% of the size of RAM, yet, you can hardly even think about buying an SSD smaller than 32G. If you''ve got a 64G ram system, you''re probably not going to use only a single SSD, just due to the fact that you''ve probably got dozens of disks attached, and you''ll probably use multiple log devices striped just for the sake of performance. Improbability assessment aside, suppose you use something like the DDRDrive X1 ... Which might be more like 4G instead of 32G ... Is it even physically possible to write 4G to any device in less than 10 seconds? Remember, to achieve worst case, highest demand on ZIL log device, these would all have to be <32kbyte writes (default configuration), because larger writes will go directly to primary storage, with only the intent landing on the ZIL. To try and quantify this a little closer, suppose all your writes are 31K (worst case for typical setup) ... meaning they''re as large as possible while still going to the log device instead of primary storage. Suppose you get 2000 IOPS (which is roughly typical according to my benchmarks) then you''re writing a little less than 64Mbytes/sec, and you won''t even come close to reaching 1G within 10 seconds. As a cross-check, assume it''s a PCIE 2.0 x1 bus. This is 500 Mbytes/sec theoretical maximum. So in 10 seconds, 5Gbyte theoretical maximum. How about if you''re using SAS 6Gbit devices, using the unrealistic assumption that you can write 6Gbits. Well, that''s 750 Mbytes unrealistically high, so 7.5G in 10 seconds. Which I know to be not just unrealistic, but ridiculously overestimated, by at least one order of magnitude. So, although I don''t have any physical machine to test or verify this on, I have a very educated guess which says even the smallest nonvolatile device will be more than you can use for your ZIL log. Size doesn''t matter. Just speed. (and reliability, price, etc)
> I think the size of the ZIL log is basically irrelevantThat was the understanding I got from reading the various blog posts and tuning guide.> only a single SSD, just due to the fact that you''ve probably got dozens of disks attached, and you''ll probably use multiple log devices striped just for the sake of performance.I''ve got 72 (possibly 76) 15k RPM 300GB and 600GB SAS drives and my head has 16 GB of RAM though that can be increased at any time to 32GB. My current plan is to use 4 x 32GB SLC write optimized SSD''s in a striped mirrors configuration. I''m curious if anyone knows how ZIL slog performance scales. For example- how much benefit would you expect from 2 SSD slogs over 1? Would there be a significant benefit to 3 over 2 or does it begin to taper off? I''m sure a lot of this is dependent on the environment- but rough ideas are good to know. Is it safe to assume that a stripe across two mirrored write optimized SSD''s is going to give me the best performance for 4 available drive bays (assuming I want the ZIL to remain safe)?> Is it even physically possible to write 4G to any device in less than 10 seconds?I wasn''t actually sure the 10 second number was still accurate- that was definitely part of my question. If it is- then yes- I could never fill a 32 GB ZIL, let alone a 64GB one. Thanks for all of the help and advice. -- This message posted from opensolaris.org
On Mon, 19 Apr 2010, Don wrote:> Continuing on the best practices theme- how big should the ZIL slog disk be? > > The ZFS evil tuning guide suggests enough space for 10 seconds of my > synchronous write load- even assuming I could cram 20 gigabits/sec > into the host (2 10 gigE NICs) That only comes out to 200 Gigabits > which = 25 Gigabytes.Note that large writes bypass the dedicated intent log device entirely and go directly to a ZIL on primary disk. This is because SSDs typically have much less raw bandwidth than primary disk does. If you are writing bulk data, the larger chunks will go directly to primary disk, and the smaller bits (e.g. smaller writes, metadata, filenames, directories, etc.) will go to the dedicated intent log device. This means that the device does not need to be as large as you may think it should be. Use the ''zilstat'' DTrace script to evalutate what really happens on your system before you invest in extra hardware. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On Mon, 19 Apr 2010, Edward Ned Harvey wrote:> Improbability assessment aside, suppose you use something like the DDRDrive > X1 ... Which might be more like 4G instead of 32G ... Is it even physically > possible to write 4G to any device in less than 10 seconds? Remember, to > achieve worst case, highest demand on ZIL log device, these would all have > to be <32kbyte writes (default configuration), because larger writes will go > directly to primary storage, with only the intent landing on the ZIL.Note that ZFS always writes data in order so I believe that the statement "larger writes will go directly to primary storage" really should be "larger writes will go directly to the ZIL implemented in primary storage (which always exists)". Otherwise, ZFS would need to write a new TXG whenever a new "large" block of data appeared (which may be puny as far as the underlying store is concerned) in order to assure proper ordering. This would result in a very high TXG issue rate. Pool fragmentation would be increased. I am sure that someone will correct me if this is wrong. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
I always try to plan for the worst case- I just wasn''t sure how to arrive at the worst case. Thanks for providing the information- and I will definitely checkout the dtrace zilstat script. Considering the smallest SSD I can buy from a manufacturer that I trust seems to be 32GB- that''s probably going to be my choice. As for the choice of striping across two mirrored pairs- I want every last IOP I Can get my hands on- an extra $700 isn''t going to make much of a difference in a system involving 2 heads, 5 storage shelves, and 76 SAS drives- if I could think of something better to spend that money on- I would- but right now- it seems like the best option. -- This message posted from opensolaris.org
On Mon, 19 Apr 2010, Don wrote:> I''m curious if anyone knows how ZIL slog performance scales. For > example- how much benefit would you expect from 2 SSD slogs over 1? > Would there be a significant benefit to 3 over 2 or does it begin to > taper off? I''m sure a lot of this is dependent on the environment- > but rough ideas are good to know.I don''t know the answer but I expect that the answer depends quite a lot on the nature of the SSDs used. A STEC Zeus IOPS SSD (45K IOPS) will behave quite differently than an Intel X-25E (~3.3K IOPS). A SRAM or DRAM-based "drive" (with FLASH backup) will behave dramatically differently than a typical SSD. If the SSD employed supports sufficient IOPS and bandwidth, then adding more will not help since it is not the bottleneck. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
> A STEC Zeus IOPS SSD (45K IOPS) will behave quite differently than an Intel X-25E (~3.3K IOPS).Where can you even get the Zeus drives? I thought they were only in the OEM market and last time I checked they were ludicrously expensive. I''m looking for between 5k and 10k IOPS using up to 4 drive bays (so a 2 x 2 striped mirror would be fine). Right now we peak at about 3k IOPS (though that''s not to a ZFS system) but I would like to be able to be able to burst to double that. We do have a lot of small size burst writes hence our ZIL concerns.> A SRAM or DRAM-based "drive" (with FLASH backup) will behavedramatically differently than a typical SSD. As long as it can speak SAS or SATA and I can put it in a drive shelf I''d happily consider using it. All the DRAM devices I know are host based and that won''t help my cluster. On that note- what write optimized SSD''s do you recommend? I don''t actually know where to buy the Zeus drives even if they''ve become more reasonably priced. Thanks for taking the time to share- it''s been very informative. -- This message posted from opensolaris.org
On Apr 19, 2010, at 7:02 PM, Bob Friesenhahn wrote:> On Mon, 19 Apr 2010, Don wrote: > >> Continuing on the best practices theme- how big should the ZIL slog disk be? >> >> The ZFS evil tuning guide suggests enough space for 10 seconds of my synchronous write load- even assuming I could cram 20 gigabits/sec into the host (2 10 gigE NICs) That only comes out to 200 Gigabits which = 25 Gigabytes. > > Note that large writes bypass the dedicated intent log device entirely and go directly to a ZIL on primary disk. This is because SSDs typically have much less raw bandwidth than primary disk does.That was last year. This year there are many SSDs which have sustained write bandwidth greater than the media speed on HDDs. The newer models with 6Gbps SAS can write > 200 MB/sec and read > 300 MB/sec. For comparison, a 15krpm Seagate Cheetah with 4 platters is rated at 116-195 MB/sec sustainable disk transfer rate. When it comes to performance, game over.> If you are writing bulk data, the larger chunks will go directly to primary disk, and the smaller bits (e.g. smaller writes, metadata, filenames, directories, etc.) will go to the dedicated intent log device. This means that the device does not need to be as large as you may think it should be. > > Use the ''zilstat'' DTrace script to evalutate what really happens on your system before you invest in extra hardware.Yes, good idea. http://www.richardelling.com/Home/scripts-and-programs-1/zilstat -- richard ZFS storage and performance consulting at http://www.RichardElling.com ZFS training on deduplication, NexentaStor, and NAS performance Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com
On Apr 19, 2010, at 7:11 PM, Bob Friesenhahn wrote:> On Mon, 19 Apr 2010, Edward Ned Harvey wrote: >> Improbability assessment aside, suppose you use something like the DDRDrive >> X1 ... Which might be more like 4G instead of 32G ... Is it even physically >> possible to write 4G to any device in less than 10 seconds? Remember, to >> achieve worst case, highest demand on ZIL log device, these would all have >> to be <32kbyte writes (default configuration), because larger writes will go >> directly to primary storage, with only the intent landing on the ZIL. > > Note that ZFS always writes data in order so I believe that the statement "larger writes will go directly to primary storage" really should be "larger writes will go directly to the ZIL implemented in primary storage (which always exists)". Otherwise, ZFS would need to write a new TXG whenever a new "large" block of data appeared (which may be puny as far as the underlying store is concerned) in order to assure proper ordering. This would result in a very high TXG issue rate. Pool fragmentation would be increased. > > I am sure that someone will correct me if this is wrong.Actually, when (you are not using a separate log device and the block size is > 32kB) or (you are using a separate log and logbias=throughput) then the data is written once to the main pool and a reference record is written to the ZIL. When the txg commits, the reference record is discarded and the committed block pointer is correct. Upon rollback, all you need is the real data and the reference record from the ZIL to reconstruct. -- richard ZFS storage and performance consulting at http://www.RichardElling.com ZFS training on deduplication, NexentaStor, and NAS performance Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com
On Apr 19, 2010, at 12:44 PM, Miles Nordin wrote:>>>>>> "dm" == David Magda <dmagda at ee.ryerson.ca> writes: > > dm> Given that ZFS is always consistent on-disk, why would you > dm> lose a pool if you lose the ZIL and/or cache file? > > because of lazy assertions inside ''zpool import''. you are right there > is no fundamental reason for it---it''s just code that doesn''t exist.No, there is not a different code path. The information in the cache file is the pools to be imported and their configuration. The configuration contains the GUIDs for all disks in the pool. Disks are identified by GUID and not their path, because as we all know, the paths can and do change.> If you are a developer you can probably still recover your pool, but > there aren''t any commands with a supported interface to do it.ZFS requires that non-optional top-level vdevs be accessible at import time. These include pool vdevs and log vdevs. In the case of a single disk log, the log vdev will have only one disk of interest. A mirrored vdev will have two disks (children of the mirror vdev). Since the disks are referenced by GUID rather than path, the knowledge of what GUIDs are used to build the vdevs can be useful when you have to reconstruct by hand.> ''zpool.cache'' doesn''t contain magical information, but it allows you > to pass through a different code path that doesn''t include the > ``BrrkBrrk, omg panic device missing, BAIL OUT HERE'''' checks. I don''t > think squirreling away copies of zpool.cache is a great way to make > your pool safe from slog failures because there may be other things > about the different manual ''zpool import'' codepath that you need > during a disaster, like -F, which will remain inaccessible to you if > you rely on some saving-your-zpool.cache hack, even if your hack ends > up actually working when the time comes, which it might not.If there is but a single log disk, and it gets destroyed, and you are on b125 or older, then ZFS will not allow the pool to be imported. ZFS is looking for the GUID, but you won''t know what the GUID is, unless you have a copy of it somewhere (eg. backup of zpool.cache or you wrote it on the bathroom wall :-)> I think is really interesting, the case of an HA cluster using a > single-device slog made from a ramdisk on the passive node. This case > would also become safer if slogs were fully disposeable.More interesting is the "look Ma no directly connected shared storage for a shared storage cluster!" where each node acts as an iSCSI target for the mirrored storage. I don''t have any direct experience with this, but you can read about it here http://docs.sun.com/app/docs/doc/820-7821/girgb?a=view -- richard ZFS storage and performance consulting at http://www.RichardElling.com ZFS training on deduplication, NexentaStor, and NAS performance Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com
>On Mon, 19 Apr 2010, Edward Ned Harvey wrote: >> Improbability assessment aside, suppose you use something like the DDRDrive >> X1 ... Which might be more like 4G instead of 32G ... Is it even physically >> possible to write 4G to any device in less than 10 seconds? Remember, to >> achieve worst case, highest demand on ZIL log device, these would all have >> to be <32kbyte writes (default configuration), because larger writes will go >> directly to primary storage, with only the intent landing on the ZIL. > >Note that ZFS always writes data in order so I believe that the >statement "larger writes will go directly to primary storage" really >should be "larger writes will go directly to the ZIL implemented in >primary storage (which always exists)". Otherwise, ZFS would need to >write a new TXG whenever a new "large" block of data appeared (which >may be puny as far as the underlying store is concerned) in order to >assure proper ordering. This would result in a very high TXG issue >rate. Pool fragmentation would be increased. > >I am sure that someone will correct me if this is wrong.There''s a difference between "written" and "the data is referenced by the uberblock". There is no need to start a new TXG when a large datablock is written. (If the system resets, the data will be on disk but not referenced and is lost unless the TXG it belongs to is comitted) Casper
On Mon, April 19, 2010 23:05, Don wrote:>> A STEC Zeus IOPS SSD (45K IOPS) will behave quite differently than an >> Intel X-25E (~3.3K IOPS). > > Where can you even get the Zeus drives? I thought they were only in the > OEM market and last time I checked they were ludicrously expensive. I''m > looking for between 5k and 10k IOPS using up to 4 drive bays (so a 2 x 2 > striped mirror would be fine). Right now we peak at about 3k IOPS (though > that''s not to a ZFS system) but I would like to be able to be able to > burst to double that. We do have a lot of small size burst writes hence > our ZIL concerns.They do have distributors: http://www.stec-inc.com/support/global_contact.php http://tinyurl.com/y2lrse2 http://www.stec-inc.com/support/oem_regional_sales_contacts.php?region=USA&subregion=New%20York And though they do cost a pretty penny, getting the same number of IOps out of a stack of 15 krpm disks would probably cost a lot more in hardware, power, and cooling.
I looked through that distributor page already and none of the ones I visited listed the IOPS SSD''s- they all listed DRAM and other memory from STEC- but not the SSD''s. I''m not looking to get the same number of IOPS out of 15k RPM drives. I''m looking for an appropriate number of IOPS for my environment- that is to say- twice what I''m currently getting. That would be 6k-10k IOPS. If I can do that with four Intel drives for 1/10th of what a pair of ZEUS SSD''s are going to cost me- then that would seem to make a lot more sense. It would also be nice to be able to have a couple of spares on hand- just in case a mirror fails. That''s a lot harder when the drives areas expensive as the ZEUS. Who else, besides STEC, is making write optimized drives and what kind of IOP performance can be expected? -- This message posted from opensolaris.org
> From: casper at holland.sun.com [mailto:casper at holland.sun.com] On Behalf > Of Casper.Dik at Sun.COM > > >On Mon, 19 Apr 2010, Edward Ned Harvey wrote: > >> Improbability assessment aside, suppose you use something like the > DDRDrive > >> X1 ... Which might be more like 4G instead of 32G ... Is it even > physically > >> possible to write 4G to any device in less than 10 seconds? > Remember, to > >> achieve worst case, highest demand on ZIL log device, these would > all have > >> to be <32kbyte writes (default configuration), because larger writes > will go > >> directly to primary storage, with only the intent landing on the > ZIL. > > > >Note that ZFS always writes data in order so I believe that the > >statement "larger writes will go directly to primary storage" really > >should be "larger writes will go directly to the ZIL implemented in > >primary storage (which always exists)". Otherwise, ZFS would need to > >write a new TXG whenever a new "large" block of data appeared (which > >may be puny as far as the underlying store is concerned) in order to > >assure proper ordering. This would result in a very high TXG issue > >rate. Pool fragmentation would be increased. > > > >I am sure that someone will correct me if this is wrong. > > There''s a difference between "written" and "the data is referenced by > the > uberblock". There is no need to start a new TXG when a large datablock > is written. (If the system resets, the data will be on disk but not > referenced and is lost unless the TXG it belongs to is comitted)*Also* it turns out, what I said was not strictly correct either. I think I''m too sleepy to get this correct right now, but ... My (hopefully corrected) understanding is now: By default, all sync writes will go to ZIL entirely, regardless of size. Only if you change the ... what is it ... logbias to ... throughput. Then, if you have a large sync write, the bulk of data will be written to primary storage, while just a tiny little intent will be written to the SSD. I think I misunderstood the default. I previously thought throughput was the default, not latency.
On 04/20/10 11:06 AM, Don wrote:> Who else, besides STEC, is making write optimized drives and what > kind of IOP performance can be expected?Just got a distributor email about Texas Memory Systems'' RamSan-630, one of a range of huge non-volatile SAN products they make. Other than that this has a capacity of 4-10TB, looks like a 4U, and consumes an amazing 450W, I don''t know anything about them. The iops are pretty impressive, but power-wise, at 45W/TB even mirrored disks use quite a bit less power. But 500K random iops and 8GB/s might be worth it if the specs are to be believed...
On Apr 21, 2010, at 7:24 AM, Frank Middleton wrote:> On 04/20/10 11:06 AM, Don wrote: > >> Who else, besides STEC, is making write optimized drives and what >> kind of IOP performance can be expected? > > Just got a distributor email about Texas Memory Systems'' RamSan-630, > one of a range of huge non-volatile SAN products they make. Other > than that this has a capacity of 4-10TB, looks like a 4U, and consumes > an amazing 450W, I don''t know anything about them. The iops are > pretty impressive, but power-wise, at 45W/TB even mirrored disks > use quite a bit less power. But 500K random iops and 8GB/s might > be worth it if the specs are to be believed...They have been around for a long time and have a good track record in important markets. They do not cater to the home/hobbiest market. -- richard ZFS storage and performance consulting at http://www.RichardElling.com ZFS training on deduplication, NexentaStor, and NAS performance Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com
On Wed, Apr 21, 2010 at 7:24 AM, Frank Middleton <f.middleton at apogeect.com> wrote:> On 04/20/10 11:06 AM, Don wrote: > Just got a distributor email about Texas Memory Systems'' ?RamSan-630, > one of a range of huge non-volatile SAN products they make. Other > than that this has a capacity of 4-10TB, looks like a 4U, and consumes > an amazing 450W, I don''t know anything about them. The iops are > pretty impressive, but power-wise, at 45W/TB even mirrored disks > use quite a bit less power. But 500K random iops and 8GB/s might > be worth it if the specs are to be believed...We use the RamSan 400 and 440 with some systems at work. I''m trying to get some RamSan 630 eval units for testing. Unfortunately, we''re using Linux LVM and ext3, so I can''t answer how well they work with zfs. TMS did caution us that the 630 would be slower than the 440 for our use case, which is a lot of synchronous random iops. The same probably holds true for use as a slog. -B -- Brandon High : bhigh at freaks.com
Someone on this list threw out the idea a year or so ago to just setup 2 ramdisk servers, export a ramdisk from each and create a mirror slog from them. Assuming newer version zpools, this sounds like it could be even safer since there is (supposedly) less of a chance of catastrophic failure if your ramdisk setup fails. Use just one remote ramdisk or two with battery backup.. whatever meets your paranoia level. It''s not ssd cheap, but I''m sure you could dream up several options that are less than stec prices. You also could probably use these machines on multiple pools if you''ve got them. I know, it still probably sounds a bit too cowboy for most on this list though. -- This message posted from opensolaris.org
On Thu, Apr 22, 2010 at 09:58:12PM -0700, thomas wrote:> Assuming newer version zpools, this sounds like it could be even > safer since there is (supposedly) less of a chance of catastrophic > failure if your ramdisk setup fails. Use just one remote ramdisk or > two with battery backup.. whatever meets your paranoia level.If the iscsi initiator worked for me at all, I would be trying this. I liked the idea, but it''s just not accessible now. -- Dan. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 194 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100423/d6625f5f/attachment.bin>
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of thomas > > Someone on this list threw out the idea a year or so ago to just setup > 2 ramdisk servers, export a ramdisk from each and create a mirror slog > from them.Isn''t the whole point of a ramdisk to be fast? And now it''s going to be at the other end of an Ethernet, with TCP and ... some additional filesystem overhead? No thank you.
On 23/04/2010 12:24, Edward Ned Harvey wrote:>> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- >> bounces at opensolaris.org] On Behalf Of thomas >> >> Someone on this list threw out the idea a year or so ago to just setup >> 2 ramdisk servers, export a ramdisk from each and create a mirror slog >> from them. > > Isn''t the whole point of a ramdisk to be fast? > And now it''s going to be at the other end of an Ethernet, with TCP and ... > some additional filesystem overhead?iSCSI over 1G or even 10G Ethernet to something on the remote side can be very fast, faster than a 7200rpm drive and possibly faster than a 15k rpm drive. Or maybe it isn''t Ethernet but Infiband, then we are looking at very fast. The point of the ZFS L2ARC cache devices is to be faster than your main pool devices. In particular the idea is to allow you to use cheaper 7200 rpm (or maybe even slower) disks rather than expensive 15k rpm drives but to get equivalent or better performance for certain types of workload that have traditionally been dominated by 15k rpm drives. If you are using this as a ZFS log device then you need to be more careful as the log device does need to persist, otherwise there is no point in having it. I remember many years ago on SPARCstation ELC (sun4c) systems with only 8Mb of RAM and local swap (IIRC local / but remote /usr too so a dataless client) it was better to run some X applications remotely on another machine (that someone else was using) than to let them swap locally. The idea being that you had to be unlucky for both machines to need to swap and both to swap out the same program at the same time. That was only over 10BaseT. What I''m saying is that this isn''t new, don''t assume that the path to/from local storage is faster than networking. -- Darren J Moffat