We''ve been working on a prototype of a ZFS file server for a while now, based on Solaris 10. Now that official support is available for openSolaris, we are looking into that as a possible option as well. openSolaris definitely has a greater feature set, but is still a bit rough around the edges for production use. I''ve heard that a considerable amount of ZFS improvements are slated to show up in S10U6. I was wondering if anybody could give an unofficial list of what will probably be deployed in S10U6, and how that will compare feature wise to openSolaris 05/08. Some rough guess at an ETA would also be nice :). Thanks... -- Paul B. Henson | (909) 979-6361 | http://www.csupomona.edu/~henson/ Operating Systems and Network Analyst | henson at csupomona.edu California State Polytechnic University | Pomona CA 91768
I was hoping that in U5 at least ZFS version 5 would be included but it was not, do you think that will be in U6? On Fri, 16 May 2008, Robin Guo wrote:> Hi, Paul > > The most feature and bugfix so far towards Navada 87 (or 88? ) will > backport into s10u6. > It''s about the same (I mean from outside viewer, not inside) with > openSolaris 05/08, > but certainly, some other features as CIFS has no plan to backport to > s10u6 yet, so ZFS > will has fully ready but no effect on these kind of area. That depend on > how they co-operate. > > At least, s10u6 will contain L2ARC cache, ZFS as root filesystem, etc.. > > Paul B. Henson wrote: >> We''ve been working on a prototype of a ZFS file server for a while now, >> based on Solaris 10. Now that official support is available for >> openSolaris, we are looking into that as a possible option as well. >> openSolaris definitely has a greater feature set, but is still a bit rough >> around the edges for production use. >> >> I''ve heard that a considerable amount of ZFS improvements are slated to >> show up in S10U6. I was wondering if anybody could give an unofficial list >> of what will probably be deployed in S10U6, and how that will compare >> feature wise to openSolaris 05/08. Some rough guess at an ETA would also be >> nice :). >> >> Thanks... >> >> >> > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > > !DSPAM:122,482ce24518355742411484! >
Hi, Paul The most feature and bugfix so far towards Navada 87 (or 88? ) will backport into s10u6. It''s about the same (I mean from outside viewer, not inside) with openSolaris 05/08, but certainly, some other features as CIFS has no plan to backport to s10u6 yet, so ZFS will has fully ready but no effect on these kind of area. That depend on how they co-operate. At least, s10u6 will contain L2ARC cache, ZFS as root filesystem, etc.. Paul B. Henson wrote:> We''ve been working on a prototype of a ZFS file server for a while now, > based on Solaris 10. Now that official support is available for > openSolaris, we are looking into that as a possible option as well. > openSolaris definitely has a greater feature set, but is still a bit rough > around the edges for production use. > > I''ve heard that a considerable amount of ZFS improvements are slated to > show up in S10U6. I was wondering if anybody could give an unofficial list > of what will probably be deployed in S10U6, and how that will compare > feature wise to openSolaris 05/08. Some rough guess at an ETA would also be > nice :). > > Thanks... > > >
Hi, Krzys, Definitely, s10u6_01 ZFS''s version is 10 already, I never expect it''ll downgrade :) U5 only inlcude bugfix but without great ZFS feature included, that''s a pity, but anyway, s10u6 will come, sooner or later. Krzys wrote:> I was hoping that in U5 at least ZFS version 5 would be included but > it was not, do you think that will be in U6? > > On Fri, 16 May 2008, Robin Guo wrote: > >> Hi, Paul >> >> The most feature and bugfix so far towards Navada 87 (or 88? ) will >> backport into s10u6. >> It''s about the same (I mean from outside viewer, not inside) with >> openSolaris 05/08, >> but certainly, some other features as CIFS has no plan to backport to >> s10u6 yet, so ZFS >> will has fully ready but no effect on these kind of area. That depend on >> how they co-operate. >> >> At least, s10u6 will contain L2ARC cache, ZFS as root filesystem, etc.. >> >> Paul B. Henson wrote: >>> We''ve been working on a prototype of a ZFS file server for a while now, >>> based on Solaris 10. Now that official support is available for >>> openSolaris, we are looking into that as a possible option as well. >>> openSolaris definitely has a greater feature set, but is still a bit >>> rough >>> around the edges for production use. >>> >>> I''ve heard that a considerable amount of ZFS improvements are slated to >>> show up in S10U6. I was wondering if anybody could give an >>> unofficial list >>> of what will probably be deployed in S10U6, and how that will compare >>> feature wise to openSolaris 05/08. Some rough guess at an ETA would >>> also be >>> nice :). >>> >>> Thanks... >>> >>> >>> >> >> _______________________________________________ >> zfs-discuss mailing list >> zfs-discuss at opensolaris.org >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >> >> >> !DSPAM:122,482ce24518355742411484! >>-- Regards, Robin Guo, Xue-Bin Guo Solaris Kernel and Data Service QE, Sun China Engineering and Reserch Institute Phone: +86 10 82618200 +82296 Email: robin.guo at sun.com Blog: http://blogs.sun.com/robinguo
On Fri, May 16, 2008 at 09:30:27AM +0800, Robin Guo wrote:> Hi, Paul > > At least, s10u6 will contain L2ARC cache, ZFS as root filesystem, etc..As far as root zfs goes, are there any plans to support more than just single disks or mirrors in U6, or will that be for a later date? -brian -- "Coding in C is like sending a 3 year old to do groceries. You gotta tell them exactly what you want or you''ll end up with a cupboard full of pop tarts and pancake mix." -- IRC User (http://www.bash.org/?841435)
Robin Guo wrote:> Hi, Brian > > You mean stripe type with multiple-disks or raidz type? I''m afraid > it''s still single disk > or mirrors only. If opensolaris start new project of this kind of > feature, it''ll be backport > to s10u* eventually, but that''s need some time to go, sounds no > possibility in U6, I think.Not necessarily true. Not all things in OpenSolaris get backported and not all future ZFS features are guaranteed to get backported eventually. For example I have no current plans to backport the ZFS Crypto functionality. -- Darren J Moffat
Hi, Brian You mean stripe type with multiple-disks or raidz type? I''m afraid it''s still single disk or mirrors only. If opensolaris start new project of this kind of feature, it''ll be backport to s10u* eventually, but that''s need some time to go, sounds no possibility in U6, I think. Brian Hechinger wrote:> On Fri, May 16, 2008 at 09:30:27AM +0800, Robin Guo wrote: > >> Hi, Paul >> >> At least, s10u6 will contain L2ARC cache, ZFS as root filesystem, etc.. >> > > As far as root zfs goes, are there any plans to support more than just single > disks or mirrors in U6, or will that be for a later date? > > -brian >
On Thu, 15 May 2008, Robin Guo wrote:> The most feature and bugfix so far towards Navada 87 (or 88? ) will > backport into s10u6. It''s about the same (I mean from outside viewer, not > inside) with openSolaris 05/08, but certainly, some other features as > CIFS has no plan to backport to s10u6 yet, so ZFS will has fully ready > but no effect on these kind of area. That depend on how they co-operate.Yah, I''ve heard that the CIFS stuff was way too many changes to backport, guess that is going to have to wait until Solaris 11. So, from a feature perspective it looks like S10U6 is going to be in pretty good shape ZFS-wise. If only someone could speak to (perhaps under the cloak of anonymity ;) ) the timing side :). Given U5 barely came out, I wouldn''t expect U6 anytime soon :(. Thanks... -- Paul B. Henson | (909) 979-6361 | http://www.csupomona.edu/~henson/ Operating Systems and Network Analyst | henson at csupomona.edu California State Polytechnic University | Pomona CA 91768
The issues with CIFS is not just complexity; it''s the total amount of incompatible change in the kernel that we had to make in order to make the CIFS protocol a first class citizen in Solaris. This includes changes in the VFS layer which would break all S10 file systems. So in a very real sense CIFS simply cannot be backported to S10. -- Fred On May 16, 2008, at 3:06 PM, Paul B. Henson wrote:> On Thu, 15 May 2008, Robin Guo wrote: > >> The most feature and bugfix so far towards Navada 87 (or 88? ) will >> backport into s10u6. It''s about the same (I mean from outside >> viewer, not >> inside) with openSolaris 05/08, but certainly, some other features as >> CIFS has no plan to backport to s10u6 yet, so ZFS will has fully >> ready >> but no effect on these kind of area. That depend on how they co- >> operate. > > Yah, I''ve heard that the CIFS stuff was way too many changes to > backport, > guess that is going to have to wait until Solaris 11. > > So, from a feature perspective it looks like S10U6 is going to be in > pretty > good shape ZFS-wise. If only someone could speak to (perhaps under the > cloak of anonymity ;) ) the timing side :). Given U5 barely came > out, I > wouldn''t expect U6 anytime soon :(. > > Thanks... > > > -- > Paul B. Henson | (909) 979-6361 | http://www.csupomona.edu/ > ~henson/ > Operating Systems and Network Analyst | henson at csupomona.edu > California State Polytechnic University | Pomona CA 91768 > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss-- Fred Zlotnick Senior Director, Solaris NAS Sun Microsystems, Inc. fred.zlotnick at sun.com x81142/+1 650 352 9298
On Fri, May 16, 2008 at 03:12:02PM -0700, Zlotnick Fred wrote:> The issues with CIFS is not just complexity; it''s the total amount > of incompatible change in the kernel that we had to make in order > to make the CIFS protocol a first class citizen in Solaris. This > includes changes in the VFS layer which would break all S10 file > systems. So in a very real sense CIFS simply cannot be backported > to S10.However, the same arguments were made explaining the difficulty backporting ZFS and GRUB boot to Solaris 10. Adam -- Adam Leventhal, Fishworks http://blogs.sun.com/ahl
Hi Paul, I believe the goal is to come out w/ new Solaris updates every 4-6 months and sometimes are known as quarterly updates. Regards. -------- Original Message -------- Subject: Re: [zfs-discuss] ZFS in S10U6 vs openSolaris 05/08 From: Paul B. Henson <henson at acm.org> To: Robin Guo <Robin.Guo at Sun.COM> CC: zfs-discuss at opensolaris.org Date: Fri May 16 15:06:02 2008> So, from a feature perspective it looks like S10U6 is going to be in pretty > good shape ZFS-wise. If only someone could speak to (perhaps under the > cloak of anonymity ;) ) the timing side :). Given U5 barely came out, I > wouldn''t expect U6 anytime soon :(. > > Thanks..
Hi again, I sort of take that back, here''s the past history: Solaris 10 3/05 = Solaris 10 RR 1/05 Solaris 10 1/06 = Update 1 Solaris 10 6/06 = Update 2 Solaris 10 11/06 = Update 3 Solaris 10 8/07 = Update 4 Solaris 10 5/08 = Update 5 I did say it was a "goal" though. -------- Original Message -------- Subject: Re: [zfs-discuss] ZFS in S10U6 vs openSolaris 05/08 From: Daryl Doami <Daryl.Doami at Sun.COM> To: Paul B. Henson <henson at acm.org> CC: zfs-discuss at opensolaris.org Date: Fri May 16 22:59:13 2008> Hi Paul, > > I believe the goal is to come out w/ new Solaris updates every 4-6 > months and sometimes are known as quarterly updates. > > Regards. > > -------- Original Message -------- > Subject: Re: [zfs-discuss] ZFS in S10U6 vs openSolaris 05/08 > From: Paul B. Henson <henson at acm.org> > To: Robin Guo <Robin.Guo at Sun.COM> > CC: zfs-discuss at opensolaris.org > Date: Fri May 16 15:06:02 2008 > >> So, from a feature perspective it looks like S10U6 is going to be in pretty >> good shape ZFS-wise. If only someone could speak to (perhaps under the >> cloak of anonymity ;) ) the timing side :). Given U5 barely came out, I >> wouldn''t expect U6 anytime soon :(. >> >> Thanks.. >> > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >
| So, from a feature perspective it looks like S10U6 is going to be in | pretty good shape ZFS-wise. If only someone could speak to (perhaps | under the cloak of anonymity ;) ) the timing side :). For what it''s worth, back in January or so we were told that S10U6 was scheduled for August. Given that we were told more or less the same thing about S10U4 last year and it slipped somewhat, I''m not expecting S10U6 before about October or so. - cks
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Robin Guo wrote: | At least, s10u6 will contain L2ARC cache, ZFS as root filesystem, etc.. Any detail about this L2ARC thing?. I see some references in Google (a cache device) but no "in deep" description. - -- Jesus Cea Avion _/_/ _/_/_/ _/_/_/ jcea at jcea.es - http://www.jcea.es/ _/_/ _/_/ _/_/ _/_/ _/_/ jabber / xmpp:jcea at jabber.org _/_/ _/_/ _/_/_/_/_/ ~ _/_/ _/_/ _/_/ _/_/ _/_/ "Things are not so easy" _/_/ _/_/ _/_/ _/_/ _/_/ _/_/ "My name is Dump, Core Dump" _/_/_/ _/_/_/ _/_/ _/_/ "El amor es poner tu felicidad en la felicidad de otro" - Leibniz -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.8 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iQCVAwUBSDXu85lgi5GaxT1NAQJRNQP+LauaUCQ+rdV6AYTe1ZK/Y9LpPEfCa+U8 hkuCnUdqJiqFLDM/TDMRLNkK/CmzhmjTRyF3cu054MNJpiw8MqRc3/pUQUgV/NVX ot2J90Qwwrsz7lAOItBnGLMnM/yShOovpb5joZjPT/A14OZXYNFmlzDrMBHjyRSG jjXhmLbrJD4=DiFU -----END PGP SIGNATURE-----
Jesus Cea wrote:> -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > Robin Guo wrote: > | At least, s10u6 will contain L2ARC cache, ZFS as root filesystem, etc.. > > Any detail about this L2ARC thing?. I see some references in Google (a > cache device) but no "in deep" description. > >Sure. The concept is quite simple, really. We observe that solid state memories can be very fast, when compared to spinning rust (disk) drives. The Adaptive Replacement Cache (ARC) uses main memory as a read cache. But sometimes people want high performance, but don''t want to spend money on main memory. So, the Level-2 ARC can be placed on a block device, such as a fast [solid state] disk which may even be volatile. This may be very useful for those cases where the actual drive is located far away in time (eg across the internet) but near-by, fast "disks" are readily available. Since the L2ARC is only a read cache, it doesn''t have to be nonvolatile. This opens up some interesting possibilities for some applications which have large data stored (>> RAM) where you might get some significant performance improvements with local, fast devices. The PSARC case materials go into some detail: http://opensolaris.org/os/community/arc/caselog/2007/618/ -- richard
On May 22, 2008, at 19:54, Richard Elling wrote:> The Adaptive Replacement Cache > (ARC) uses main memory as a read cache. But sometimes > people want high performance, but don''t want to spend money > on main memory. So, the Level-2 ARC can be placed on a > block device, such as a fast [solid state] disk which may even > be volatile.The remote-disk cache makes perfect sense. I''m curious if there are measurable benefits for caching local disks as well? NAND-flash SSD drives have good ''seek'' and slow transfer, IIRC, but that might still be useful for lots of small reads where seek is everything. I''m not quite understanding the argument for a being read-only so it can be used on volatile SDRAM-based SSD''s, though. Those tend to be much, much more expensive than main memory, right? So, why would anybody buy one for cache - is it so they can front a really massive pool of disks that would exhaust market-available maximum main memory sizes? -Bill ----- Bill McGonigle, Owner Work: 603.448.4440 BFC Computing, LLC Home: 603.448.1668 bill at bfccomputing.com Cell: 603.252.2606 http://www.bfccomputing.com/ Page: 603.442.1833 Blog: http://blog.bfccomputing.com/ VCard: http://bfccomputing.com/vcard/bill.vcf
On Fri, 23 May 2008, Bill McGonigle wrote:> The remote-disk cache makes perfect sense. I''m curious if there are > measurable benefits for caching local disks as well? NAND-flash SSD > drives have good ''seek'' and slow transfer, IIRC, but that might > still be useful for lots of small reads where seek is everything.NAND-flash SSD drives also wear out. They are not very useful as a cache device which is written to repetitively. A busy server could likely wear one out in just a day or two unless the drive contains aggressive hardware-based write leveling so that it might survive a few more days, depending on how large the device is. Cache devices are usually much smaller and run a lot "hotter" than a normal filesystem. Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
> measurable benefits for caching local disks as well? NAND-flash SSDI''m confused, the only reason I can think of making a To create a pool with cache devices, specify a "cache" vdev with any number of devices. For example: # zpool create pool c0d0 c1d0 cache c2d0 c3d0 Cache devices cannot be mirrored or part of a raidz confi- guration. If a read error is encountered on a cache device, that read I/O is reissued to the original storage pool dev- ice, which might be part of a mirrored or raidz configuration. The content of the cache devices is considered volatile, as is the case with other system caches. device non-volatile was to fill the ARC after reboot, and the in ram ARC pointers for the cache device will take quite abit of ram too, so perhaps spending the $$ on more system ram rather than a SSD cache device would be better? unless you have really slow iscsi vdevs :-) Rob
Rob at Logan.com wrote:> > measurable benefits for caching local disks as well? NAND-flash SSD > > I''m confused, the only reason I can think of making a > > To create a pool with cache devices, specify a "cache" vdev > with any number of devices. For example: > > # zpool create pool c0d0 c1d0 cache c2d0 c3d0 > > Cache devices cannot be mirrored or part of a raidz confi- > guration. If a read error is encountered on a cache device, > that read I/O is reissued to the original storage pool dev- > ice, which might be part of a mirrored or raidz > configuration. > > The content of the cache devices is considered volatile, as > is the case with other system caches. > > device non-volatile was to fill the ARC after reboot, and the in ram > ARC pointers for the cache device will take quite abit of ram too, so > perhaps spending the $$ on more system ram rather than a SSD cache > device would be better? unless you have really slow iscsi vdevs :-) >Consider a case where you might use large, slow SATA drives (1 TByte, 7,200 rpm) for the main storage, and a single small, fast (36 GByte, 15krpm) drive for the L2ARC. This might provide a reasonable cost/performance trade-off. -- richard
On Sat, May 24, 2008 at 3:21 AM, Richard Elling <Richard.Elling at sun.com> wrote:> Consider a case where you might use large, slow SATA drives (1 TByte, > 7,200 rpm) > for the main storage, and a single small, fast (36 GByte, 15krpm) drive > for the > L2ARC. This might provide a reasonable cost/performance trade-off.In this case (or in any other case where a cache device is used), does the cache improve write performance or only reads? I presume it cannot increase write performance as the cache is considered volatile, so the write couldn''t be committed until the data had left the cache device?>From the ZFS admin guide [1] "Using cache devices provide the greatestperformance improvement for random read-workloads of mostly static content." I''m not sure if that means no performance increase for writes, or just not very much? [1]http://docs.sun.com/app/docs/doc/817-2271/gaynr?a=view -- Hugh Saunders
On Fri, May 23, 2008 at 05:26:34PM -0500, Bob Friesenhahn wrote:> On Fri, 23 May 2008, Bill McGonigle wrote: > > The remote-disk cache makes perfect sense. I''m curious if there are > > measurable benefits for caching local disks as well? NAND-flash SSD > > drives have good ''seek'' and slow transfer, IIRC, but that might > > still be useful for lots of small reads where seek is everything. > > NAND-flash SSD drives also wear out. They are not very useful as a > cache device which is written to repetitively. A busy server could > likely wear one out in just a day or two unless the drive contains > aggressive hardware-based write leveling so that it might survive a > few more days, depending on how large the device is. > > Cache devices are usually much smaller and run a lot "hotter" than a > normal filesystem.Someone (Gigabyte, are you listening?) need to make something like the iRAM, only with more capacity and bump it up to 3.0Gbps. SAS would be nice since you could load a nice controller up with them. Does anyone make a 3.5" HDD format RAM disk system that isn''t horribly expensive? Backing to disk wouldn''t matter to me, but a battery that could hold at least 30 minutes of data would be nice. -brian -- "Coding in C is like sending a 3 year old to do groceries. You gotta tell them exactly what you want or you''ll end up with a cupboard full of pop tarts and pancake mix." -- IRC User (http://www.bash.org/?841435)
> cache improve write performance or only reads?L2ARC cache device is for reads... for write you want Intent Log The ZFS Intent Log (ZIL) satisfies POSIX requirements for synchronous transactions. For instance, databases often require their transactions to be on stable storage devices when returning from a system call. NFS and other applica- tions can also use fsync() to ensure data stability. By default, the intent log is allocated from blocks within the main pool. However, it might be possible to get better per- formance using separate intent log devices such as NVRAM or a dedicated disk. For example: # zpool create pool c0d0 c1d0 log c2d0 Multiple log devices can also be specified, and they can be mirrored. See the EXAMPLES section for an example of mirror- ing multiple log devices. Log devices can be added, replaced, attached, detached, and imported and exported as part of the larger pool. but don''t underestimate the speed of several slow vdevs vs one fast vdev. > Does anyone make a 3.5" HDD format RAM disk system that isn''t horribly http://mail.opensolaris.org/pipermail/zfs-discuss/2007-July/041956.html perhaps adding ram to the system would be more flexible? Rob
On Sat, May 24, 2008 at 4:00 PM, <Rob at logan.com> wrote:> > > cache improve write performance or only reads? > > L2ARC cache device is for reads... for write you want > Intent LogThanks for answering my question, I had seen mention of intent log devices, but wasn''t sure of their purpose. If only one significantly faster disk is available, would it make sense to slice it and use a slice for L2ARC and a slice for ZIL? or would that cause horrible thrashing? -- Hugh Saunders
Hugh Saunders wrote:> On Sat, May 24, 2008 at 4:00 PM, <Rob at logan.com> wrote: >> > cache improve write performance or only reads? >> >> L2ARC cache device is for reads... for write you want >> Intent Log > > Thanks for answering my question, I had seen mention of intent log > devices, but wasn''t sure of their purpose. > > If only one significantly faster disk is available, would it make > sense to slice it and use a slice for L2ARC and a slice for ZIL? or > would that cause horrible thrashing?I wouldn''t recommend this configuration. As you say it would thrash the head. Log devices mainly need to write fast as they only ever are read once on reboot if there''s uncommitted transactions. Whereas cache devices require a fast read as the write can be done slowly and asynchronously. So a common device sliced for use as both purposes wouldn''t work well unless it was both fast read and write and had minimal seek times (nvram, ss disk). Neil.
"Hugh Saunders" <hugh at mjr.org> writes:> On Sat, May 24, 2008 at 3:21 AM, Richard Elling <Richard.Elling at sun.com> wrote: >> Consider a case where you might use large, slow SATA drives (1 TByte, >> 7,200 rpm) >> for the main storage, and a single small, fast (36 GByte, 15krpm) drive >> for the >> L2ARC. This might provide a reasonable cost/performance trade-off. > > In this case (or in any other case where a cache device is used), does > the cache improve write performance or only reads? > I presume it cannot increase write performance as the cache is > considered volatile, so the write couldn''t be committed until the > data had left the cache device? > >>From the ZFS admin guide [1] "Using cache devices provide the greatest > performance improvement for random read-workloads of mostly static > content." I''m not sure if that means no performance increase for > writes, or just not very much? > > [1]http://docs.sun.com/app/docs/doc/817-2271/gaynr?a=viewMy understanding is that this is correct. The writes to the L2ARC are done from non-dirty pages in the in-core ARC, in the background, while nothing is waiting. Dirty pages are not written to the L2ARC at all. AIUI all this is intended to help in the case of flash-based SSDs which tend to have performance that''s heavily biased in one direction. Normally, they have outstanding read IOPS and terrible write IOPS, so retiring pages from the RAM ARC to the L2ARC in the background makes sense. There''s some good info in the comment here: http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/arc.c#3377 Boyd
On Sat, May 24, 2008 at 11:45 PM, Neil Perrin <Neil.Perrin at sun.com> wrote:> > > Hugh Saunders wrote: >> On Sat, May 24, 2008 at 4:00 PM, <Rob at logan.com> wrote: >>> > cache improve write performance or only reads? >>> >>> L2ARC cache device is for reads... for write you want >>> Intent Log >> >> Thanks for answering my question, I had seen mention of intent log >> devices, but wasn''t sure of their purpose. >> >> If only one significantly faster disk is available, would it make >> sense to slice it and use a slice for L2ARC and a slice for ZIL? or >> would that cause horrible thrashing? > > I wouldn''t recommend this configuration. > As you say it would thrash the head. Log devices mainly need to write > fast as they only ever are read once on reboot if there''s uncommitted > transactions. Whereas cache devices require a fast read as the write can > be done slowly and asynchronously. So a common device sliced for use as > both purposes wouldn''t work well unless it was both fast read and write > and had minimal seek times (nvram, ss disk). > > Neil. > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >An interesting thread. I sit in front of a heavily used SXDE (latest) desktop for many hours and my home dir (where most of the work takes place) is based on a mirrored pair of 500Gb SATA drives. The real issue with this desktop is that it''s based on a 939-pin AMD (x4400 IIRC) AMD CPU and I can''t go beyond 4Gb of main memory. So .... after reading this thread, I pushed in a couple of 15k RPM SAS drives on an LsiLogic 1068 SAS controller and assigned one as a log device and one as a cache device.[1] What a difference!!!! :) Now I''ve go the response time I wanted from the system to begin with[0], along with the capacity and low-cost per Gb afforded by commodity SATA drives. This is just an FYI to others on the list to "experiment" with this setup. You may be very pleasantly surprised by the results. I certainly am! :) Excellent work Team ZFS. [0] and which I had until I got towards 46 ZFS filesystems.... and an lofiadm mounted home dir that is shared between different (work) zones and just keeps on growing... [1] # psrinfo -v Status of virtual processor 0 as of: 05/25/2008 21:57:37 on-line since 05/24/2008 15:44:51. The i386 processor operates at 2420 MHz, and has an i387 compatible floating point processor. Status of virtual processor 1 as of: 05/25/2008 21:57:37 on-line since 05/24/2008 15:44:53. The i386 processor operates at 2420 MHz, and has an i387 compatible floating point processor. # zpool status pool: tanku state: ONLINE scrub: scrub completed with 0 errors on Sat May 24 19:35:32 2008 config: NAME STATE READ WRITE CKSUM tanku ONLINE 0 0 0 mirror ONLINE 0 0 0 c1t1d0 ONLINE 0 0 0 c2t0d0 ONLINE 0 0 0 logs ONLINE 0 0 0 c4t2d0 ONLINE 0 0 0 cache c4t1d0 ONLINE 0 0 0 errors: No known data errors Regards, -- Al Hopper Logical Approach Inc,Plano,TX al at logical-approach.com Voice: 972.379.2133 Timezone: US CDT OpenSolaris Governing Board (OGB) Member - Apr 2005 to Mar 2007 http://www.opensolaris.org/os/community/ogb/ogb_2005-2007/
It''s true that NAND based falsh''s wear out under heavy load. Regular consumer grade nand drives will wear out the extra cells pretty rapidly. (in a year or so) However enterprise grade SSD disks are fine tuned to with stand continous writes for more than 10 years Best regards MErtol Mertol Ozyoney Storage Practice - Sales Manager Sun Microsystems, TR Istanbul TR Phone +902123352200 Mobile +905339310752 Fax +902123352222 Email mertol.ozyoney at Sun.COM -----Original Message----- From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-bounces at opensolaris.org] On Behalf Of Bob Friesenhahn Sent: 24 May?s 2008 Cumartesi 01:27 To: Bill McGonigle Cc: ZFS Discuss Subject: Re: [zfs-discuss] ZFS in S10U6 vs openSolaris 05/08 On Fri, 23 May 2008, Bill McGonigle wrote:> The remote-disk cache makes perfect sense. I''m curious if there are > measurable benefits for caching local disks as well? NAND-flash SSD > drives have good ''seek'' and slow transfer, IIRC, but that might > still be useful for lots of small reads where seek is everything.NAND-flash SSD drives also wear out. They are not very useful as a cache device which is written to repetitively. A busy server could likely wear one out in just a day or two unless the drive contains aggressive hardware-based write leveling so that it might survive a few more days, depending on how large the device is. Cache devices are usually much smaller and run a lot "hotter" than a normal filesystem. Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/ _______________________________________________ zfs-discuss mailing list zfs-discuss at opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
On Mon, 26 May 2008, Mertol Ozyoney wrote:> It''s true that NAND based falsh''s wear out under heavy load. Regular > consumer grade nand drives will wear out the extra cells pretty rapidly. (in > a year or so) However enterprise grade SSD disks are fine tuned to with > stand continous writes for more than 10 yearsIt is incorrect to classify wear in terms of years without also specifying update behavior. NAND FLASH sectors can withstand 100,000 to (sometimes) 1,000,000 write-erase-cycles. In normal filesystem use, there are far more reads than writes and the size of the storage device is much larger than the the data re-written. Even in server use, only a small fraction of the data is updated. A device used to cache writes will be written to as often as it is read from (or perhaps more often). If the cache device storage is fully occupied, then wear leveling algorithms based on statistics do not have much opportunity to work. If the underlying device sectors are good for 100,000 write-erase-cycles and the entire device is re-written once per second, then the device is not going to last very long (27 hours). Of course the write performance for these devices is quite poor (8-120MB/second) and the write performace seems to be proportional to the total storage size so it is quite unlikely that you could re-write a suitably performant device once per second. The performance of FLASH SSDs does not seem very appropriate for use as a write cache device. There is a useful guide to these devices at "http://www.storagesearch.com/ssd-buyers-guide.html". SRAM-based cache devices which plug into a PCI-X or PCI-Express slot seem far more appropriate for use as a write cache than a slow SATA device. At least 5X or 10X the performance is available by this means. Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
You''re still concentrating on consumer level drives. The stec drives emc is using for instance, exhibit none of the behaviors you describe. While I agree from a cost perspective, fusionIO''s product is more attractice, you blanket statements on flash are inaccurate. On 5/27/08, Bob Friesenhahn <bfriesen at simple.dallas.tx.us> wrote:> On Mon, 26 May 2008, Mertol Ozyoney wrote: > >> It''s true that NAND based falsh''s wear out under heavy load. Regular >> consumer grade nand drives will wear out the extra cells pretty rapidly. >> (in >> a year or so) However enterprise grade SSD disks are fine tuned to with >> stand continous writes for more than 10 years > > It is incorrect to classify wear in terms of years without also > specifying update behavior. NAND FLASH sectors can withstand 100,000 > to (sometimes) 1,000,000 write-erase-cycles. In normal filesystem > use, there are far more reads than writes and the size of the storage > device is much larger than the the data re-written. Even in server > use, only a small fraction of the data is updated. A device used to > cache writes will be written to as often as it is read from (or > perhaps more often). If the cache device storage is fully occupied, > then wear leveling algorithms based on statistics do not have much > opportunity to work. > > If the underlying device sectors are good for 100,000 > write-erase-cycles and the entire device is re-written once per > second, then the device is not going to last very long (27 hours). > Of course the write performance for these devices is quite poor > (8-120MB/second) and the write performace seems to be proportional to > the total storage size so it is quite unlikely that you could re-write > a suitably performant device once per second. The performance of > FLASH SSDs does not seem very appropriate for use as a write cache > device. > > There is a useful guide to these devices at > "http://www.storagesearch.com/ssd-buyers-guide.html". > > SRAM-based cache devices which plug into a PCI-X or PCI-Express slot > seem far more appropriate for use as a write cache than a slow SATA > device. At least 5X or 10X the performance is available by this > means. > > Bob > =====================================> Bob Friesenhahn > bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ > GraphicsMagick Maintainer, http://www.GraphicsMagick.org/ > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >
On Tue, 27 May 2008, Tim wrote:> You''re still concentrating on consumer level drives. The stec drives > emc is using for instance, exhibit none of the behaviors you describe.How long have you been working for STEC? ;-) Looking at the specifications for STEC SSDs I see that they are very good at IOPS (probably many times faster than the Solaris I/O stack). Write performance of the fastest product (ZEUS iops) is similar to a typical SAS hard drive, with the remaining products being much slower. This all that STEC has to say about FLASH lifetime in their products: "http://www.stec-inc.com/technology/flash_life_support.php". There are no "hard facts" to be found there. The STEC SSDs are targeted towards being a replacement for a traditional hard drive. There is no mention of lifetime when used as a write-intensive cache device. Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
There is something more to consider with SSDs uses as a cache device. STEC mentions that they obtain improved reliability by employing error correction. The ZFS scrub operation is very good at testing filesystem blocks for errors by reading them. Besides corrections at the ZFS level, the SSD device could repair a weak block by moving it. The obvious way to detect failing blocks is by reading them. This means that SSDs will work well as a normal filesystem storage device. In a write cache scenario, blocks may be written millions of times without ever being read. Unless the SSD device includes firmware which independently scrubs the blocks, or it always verifies that it can read what it just wrote (slowing the available transaction rate), the "write only" scenario will cause blocks to be written to extinction so that ultimately they can not be recovered by any error correction technique. Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
> There is something more to consider with SSDs uses as a cache device.why use SATA as the interface? perhaps http://www.tgdaily.com/content/view/34065/135/ would be better? (no experience) "cards will start at 80 GB and will scale to 320 and 640 GB next year. By the end of 2008, Fusion io also hopes to roll out a 1.2 TB card..... 160 parallel pipelines that can read data at 800 megabytes per second and write at 600 MB/sec.... 4K blocks and then streaming eight simultaneous 1 GB reads and writes. In that test, the ioDrive clocked in at 100,000 operations per second... beat $30 dollars a GB,"
Rob Logan wrote:> > There is something more to consider with SSDs uses as a cache device. > why use SATA as the interface? perhaps > http://www.tgdaily.com/content/view/34065/135/ > would be better? (no experience) > > "cards will start at 80 GB and will scale to 320 and 640 GB next year. > By the end of 2008, Fusion io also hopes to roll out a 1.2 TB card..... > 160 parallel pipelines that can read data at 800 megabytes per second > and write at 600 MB/sec.... 4K blocks and then streaming eight > simultaneous 1 GB reads and writes. In that test, the ioDrive > clocked in at 100,000 operations per second... beat $30 dollars a GB," > _______________________________________________ >The key take-away here is that the SSD guys *could* do all sorts of neat things to optimize for speed, reliability, and cost. They have many more technology options than the spinning rust guys. My advice: don''t bet against Moore''s law :-) -- richard
On Tue, May 27, 2008 at 12:44 PM, Rob Logan <Rob at logan.com> wrote:> > There is something more to consider with SSDs uses as a cache device. > why use SATA as the interface? perhaps > http://www.tgdaily.com/content/view/34065/135/ > would be better? (no experience) > > "cards will start at 80 GB and will scale to 320 and 640 GB next year. > By the end of 2008, Fusion io also hopes to roll out a 1.2 TB card..... > 160 parallel pipelines that can read data at 800 megabytes per second > and write at 600 MB/sec.... 4K blocks and then streaming eight > simultaneous 1 GB reads and writes. In that test, the ioDrive > clocked in at 100,000 operations per second... beat $30 dollars a GB,"These could be rather interesting as swap devices. On the face of it, $30/GB is pretty close to the list price of taking a T5240 from 32 GB to 64 GB. However, it is *a lot* less than feeding system-board DIMM slots to workloads that use a lot of RAM but are fairly inactive. As such, a $10k PCIe card may be able to allow a $42k 64 GB T5240 handle 5+ times the number of not-too-busy J2EE instances. If anyone''s done any modelling or testing of such an idea, I''d love to hear about it. -- Mike Gerdts http://mgerdts.blogspot.com/
On May 27, 2008, at 1:44 PM, Rob Logan wrote:> >> There is something more to consider with SSDs uses as a cache device. > why use SATA as the interface? perhaps > http://www.tgdaily.com/content/view/34065/135/ > would be better? (no experience)We are pretty happy with RAMSAN SSD''s (ours is RAM based, not flash). -Andy> > > "cards will start at 80 GB and will scale to 320 and 640 GB next year. > By the end of 2008, Fusion io also hopes to roll out a 1.2 TB > card..... > 160 parallel pipelines that can read data at 800 megabytes per second > and write at 600 MB/sec.... 4K blocks and then streaming eight > simultaneous 1 GB reads and writes. In that test, the ioDrive > clocked in at 100,000 operations per second... beat $30 dollars a > GB," > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
On May 23, 2008, at 22:21, Richard Elling wrote:> Consider a case where you might use large, slow SATA drives (1 TByte, > 7,200 rpm) > for the main storage, and a single small, fast (36 GByte, 15krpm) > drive > for the > L2ARC. This might provide a reasonable cost/performance trade-off.Ooh, neat; I hadn''t considered that. Cool, thanks. :) -Bill ----- Bill McGonigle, Owner Work: 603.448.4440 BFC Computing, LLC Home: 603.448.1668 bill at bfccomputing.com Cell: 603.252.2606 http://www.bfccomputing.com/ Page: 603.442.1833 Blog: http://blog.bfccomputing.com/ VCard: http://bfccomputing.com/vcard/bill.vcf
I strongly agree most of the comments. I quess, I tried to keep it simple, perhaps a little bit too simple. If I am not mistaken ,most of the Nand disks will virtualize the underlying cells so even you update the same sector update will be made somewhere else. So the time to corrupt an enterprise grade SSD (nand based) will be quite long although I wouldn''t recommend to keep the swap file or any sort of fast changing cache on those drives. Think that you have a 146 GB SSD and the wirte cycle is around 100k And you can write/update data at 10 MB/sec (depends on the IO pattern could be a lot slower or a lot higher) It will take 4 Hours or 14,400 sec''s to fully populate the drive. Multiply this with 100k , this is 45 Years. If the virtualisation algorithmws work at %25 efficiency this will be 10 years plus. And if I am not mistaken all enterprise NAnds and most consumer Nands do read after write verify and they will mark bad blocks. This will also increase the usable time as you will not be marking a whole device failed , just a cell... Please correct me where I am wrong , as I am not quite knowledgeble on this subject Mertol Mertol Ozyoney Storage Practice - Sales Manager Sun Microsystems, TR Istanbul TR Phone +902123352200 Mobile +905339310752 Fax +902123352222 Email mertol.ozyoney at Sun.COM -----Original Message----- From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-bounces at opensolaris.org] On Behalf Of Bob Friesenhahn Sent: 27 May?s 2008 Sal? 18:55 To: Mertol Ozyoney Cc: ''ZFS Discuss'' Subject: Re: [zfs-discuss] ZFS in S10U6 vs openSolaris 05/08 On Mon, 26 May 2008, Mertol Ozyoney wrote:> It''s true that NAND based falsh''s wear out under heavy load. Regular > consumer grade nand drives will wear out the extra cells pretty rapidly.(in> a year or so) However enterprise grade SSD disks are fine tuned to with > stand continous writes for more than 10 yearsIt is incorrect to classify wear in terms of years without also specifying update behavior. NAND FLASH sectors can withstand 100,000 to (sometimes) 1,000,000 write-erase-cycles. In normal filesystem use, there are far more reads than writes and the size of the storage device is much larger than the the data re-written. Even in server use, only a small fraction of the data is updated. A device used to cache writes will be written to as often as it is read from (or perhaps more often). If the cache device storage is fully occupied, then wear leveling algorithms based on statistics do not have much opportunity to work. If the underlying device sectors are good for 100,000 write-erase-cycles and the entire device is re-written once per second, then the device is not going to last very long (27 hours). Of course the write performance for these devices is quite poor (8-120MB/second) and the write performace seems to be proportional to the total storage size so it is quite unlikely that you could re-write a suitably performant device once per second. The performance of FLASH SSDs does not seem very appropriate for use as a write cache device. There is a useful guide to these devices at "http://www.storagesearch.com/ssd-buyers-guide.html". SRAM-based cache devices which plug into a PCI-X or PCI-Express slot seem far more appropriate for use as a write cache than a slow SATA device. At least 5X or 10X the performance is available by this means. Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/ _______________________________________________ zfs-discuss mailing list zfs-discuss at opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
By the way. All enterprise SSD''s have internal Dram based cache. Some vendors may quote the write performance of the internal RAM device. Normally Nand drives due to read after write operations and several other reasons will not perform quite good under write based load. Mertol Ozyoney Storage Practice - Sales Manager Sun Microsystems, TR Istanbul TR Phone +902123352200 Mobile +905339310752 Fax +902123352222 Email mertol.ozyoney at Sun.COM -----Original Message----- From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-bounces at opensolaris.org] On Behalf Of Bob Friesenhahn Sent: 27 May?s 2008 Sal? 20:22 To: Tim Cc: ZFS Discuss Subject: Re: [zfs-discuss] ZFS in S10U6 vs openSolaris 05/08 On Tue, 27 May 2008, Tim wrote:> You''re still concentrating on consumer level drives. The stec drives > emc is using for instance, exhibit none of the behaviors you describe.How long have you been working for STEC? ;-) Looking at the specifications for STEC SSDs I see that they are very good at IOPS (probably many times faster than the Solaris I/O stack). Write performance of the fastest product (ZEUS iops) is similar to a typical SAS hard drive, with the remaining products being much slower. This all that STEC has to say about FLASH lifetime in their products: "http://www.stec-inc.com/technology/flash_life_support.php". There are no "hard facts" to be found there. The STEC SSDs are targeted towards being a replacement for a traditional hard drive. There is no mention of lifetime when used as a write-intensive cache device. Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/ _______________________________________________ zfs-discuss mailing list zfs-discuss at opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
On Wed, 28 May 2008, Mertol Ozyoney wrote:> > Think that you have a 146 GB SSD and the wirte cycle is around 100k > And you can write/update data at 10 MB/sec (depends on the IO pattern could > be a lot slower or a lot higher) It will take 4 Hours or 14,400 sec''s to > fully populate the drive. Multiply this with 100k , this is 45 Years. If the > virtualisation algorithmws work at %25 efficiency this will be 10 years > plus. > > Please correct me where I am wrong , as I am not quite knowledgeble on this > subjectIt seems that we are in agreement that expected lifetime depends on the usage model. Lifetime will be vastly longer if the drive is used as a normal filesystem disk as compared to being using as a RAID write cache device. I have not heard of any RAID arrays which use FLASH for their write cache. They all use battery backed SRAM (or similar). Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Yup, I''m watching that card closely. No Solaris drivers yet, but hopefully somebody will realise just how good that could be for the ZIL and work on some. Just the 80GB $2,400 card would make a huge difference to write performance. For use with VMware and NFS it would be a godsend. This message posted from opensolaris.org