Chris Greer
2008-Oct-05 05:37 UTC
[zfs-discuss] An slog experiment (my NAS can beat up your NAS)
I currently have a traditional NFS cluster hardware setup in the lab (2 host with FC attached JBOD storage) but no cluster software yet. I''ve been wanting to try out the separate ZIL to see what it might do to boost performance. My problem is that I don''t have any cool SSD devices, much less ones that I could have shared between two host. Commercial arrays have custom hardware with mirrored cache which got me thinking about a way to do this with regular hardware. So I tried this experiment this week... On each host (OpenSolaris 2008.05), I created an 8GB ramdisk with ramdiskadm. I shared this ramdisk on each host via the iscsi target and initiator over a 1GB crossconnect cable (jumbo frames enabled). I added these as mirrored slog devices in a zpool. The end result was a pool that I could import and export between host, and it can survive one of the host dying. I also copied a dd image of my ramdisk device to stable storage with the pool exported (thus flushed), which allowed me to shut the entire cluster down, and power 1 node up, recreate the ramdisk and dd the image back and re-import the pool. I''m not sure I could survive a crash of both nodes, going to try and test some more. The big thing here is I ended up getting a MASSIVE boost in performance even with the overhead of the 1GB link, and iSCSI. The iorate test I was using went from 3073 IOPS on 90% sequential writes to 23953 IOPS with the RAM slog added. The service time was also significantly better than the physical disk. It also boosted the reads significantly and I''m guessing this is because of updating the access time on the files was completely cached. So what are the downsides to this? If both nodes were to crash and I used the same technique to recreate the ramdisk I would lose any transactions in the slog at the time of the crash, but the physical disk image is still in a consistent state right (just not from my apps point of view)? Anyone have any idea what difference infiniband might make for the cross connect? In some test, I did completely saturate the 1GB link between the boxes. So is this idea completely crazy? It also brings up questions of correctly sizing your slog in relation to the physical disk on the backend. It looks like if the ZIL can handle significantly more I/O than the physical disk the effect will be short lived as the system will have to slow things down as it spends more time flushing from the slog to physical disk. The 8GB looked like overkill in my case, because in a lot of the test, it drove the individual disk in the system to 100% and was causing service times on the physical disk in the 900 - 1000ms range (although my app never saw that because of the slog). -- This message posted from opensolaris.org
Brian Hechinger
2008-Oct-06 01:07 UTC
[zfs-discuss] An slog experiment (my NAS can beat up your NAS)
On Sat, Oct 04, 2008 at 10:37:26PM -0700, Chris Greer wrote:> > So I tried this experiment this week... > On each host (OpenSolaris 2008.05), I created an 8GB ramdisk with ramdiskadm. I shared this ramdisk on each host via the iscsi target and initiator over a 1GB crossconnect cable (jumbo frames enabled). I added these as mirrored slog devices in a zpool.Very interesting. This also gives me an idea. Using COMSTAR you could build any number of RAM based slog devices. They wouldn''t need to be anything amazing, just a bunch of RAM and a supported FC card (or two).> I''m not sure I could survive a crash of both nodes, going to try and test some more.Ok, so taking my idea above, maybe a pair of 15K SAS disks in those boxes so that you could create a backing store. I wonder what the best way to setup realtime sync would be (without making the backing store responsible for slowing down the ramdisk, so no zfs mirroring between rmadisk and sas disk in other words).> So is this idea completely crazy?I don''t think so, no. ;) -brian -- "Coding in C is like sending a 3 year old to do groceries. You gotta tell them exactly what you want or you''ll end up with a cupboard full of pop tarts and pancake mix." -- IRC User (http://www.bash.org/?841435)
Adam Leventhal
2008-Oct-06 03:46 UTC
[zfs-discuss] An slog experiment (my NAS can beat up your NAS)
> So what are the downsides to this? If both nodes were to crash and > I used the same technique to recreate the ramdisk I would lose any > transactions in the slog at the time of the crash, but the physical > disk image is still in a consistent state right (just not from my > apps point of view)?You would lose transactions, but the pool would still reflect a consistent state.> So is this idea completely crazy?On the contrary; it''s very clever. Adam -- Adam Leventhal, Fishworks http://blogs.sun.com/ahl
Nicolas Williams
2008-Oct-06 04:30 UTC
[zfs-discuss] An slog experiment (my NAS can beat up your NAS)
On Sun, Oct 05, 2008 at 09:07:31PM -0400, Brian Hechinger wrote:> On Sat, Oct 04, 2008 at 10:37:26PM -0700, Chris Greer wrote: > > I''m not sure I could survive a crash of both nodes, going to try and > > test some more. > > Ok, so taking my idea above, maybe a pair of 15K SAS disks in those > boxes so that you could create a backing store. I wonder what the best > way to setup realtime sync would be (without making the backing store > responsible for slowing down the ramdisk, so no zfs mirroring between > rmadisk and sas disk in other words).There have been threads about adding a feature to support slow mirror devices that don''t stay synced synchronously. At least IIRC. That would help. But then, if the pool is busy writing then your slow ZIL mirrors would generally be out of sync, thus being of no help in the even of a power failure given fast slog devices that don''t survive power failure. Also, using remote devices for a ZIL may defeat the purpose of fast ZILs, even if the actual devices are fast, because what really matters here is latency, and the farther the device, the higher the latency.> > So is this idea completely crazy? > > I don''t think so, no. ;)Yes, it''s pretty smart. Add UPS and it''s sortof like battery-backed RAM. You can probably get a good enough reliability rate out of this for your purposes, though actual slog devices would be better if you can afford them. Nico --
Very interesting idea, thanks for sharing it. Infiniband would definately be worth looking at for performance, although I think you''d need iSER to get the benefits and that might still be a little new: http://www.opensolaris.org/os/project/iser/Release-notes/. It''s also worth bearing in mind that you can have multiple mirrors. I don''t know what effect that will have on the performance, but it''s an easy way to boost the reliability even further. I think this idea configured on a set of 2-3 servers, with separate UPS'' for each, and a script that can export the pool and save the ramdrive when the power fails, is potentially a very neat little system. -- This message posted from opensolaris.org
Moore, Joe
2008-Oct-06 14:47 UTC
[zfs-discuss] An slog experiment (my NAS can beat up your NAS)
Nicolas Williams wrote> There have been threads about adding a feature to support slow mirror > devices that don''t stay synced synchronously. At least IIRC. That > would help. But then, if the pool is busy writing then your slow ZIL > mirrors would generally be out of sync, thus being of no help in the > even of a power failure given fast slog devices that don''t > survive power > failure.I wonder if an AVS-replicated storage device on the backends would be appropriate? write -> ZFS-mirrored slog -> ramdisk -AVS-> physical disk \ +-iscsi-> ramdisk -AVS-> physical disk You''d get the continuous replication of the ramdisk to physical drive (and perhaps automagic recovery on reboot) but not pay the syncronous write to remote physical disk penalty> > Also, using remote devices for a ZIL may defeat the purpose of fast > ZILs, even if the actual devices are fast, because what really matters > here is latency, and the farther the device, the higher the latency.A .5-ms RTT on an ethernet link to the iSCSI disk may be faster than a 9-ms latency on physical media. There was a time when it was better to place workstations'' swap files on the far side of a 100Mbps ethernet link rather than using the local spinning rust. Ah, the good old days... --Joe
Brian Hechinger
2008-Oct-06 21:38 UTC
[zfs-discuss] An slog experiment (my NAS can beat up your NAS)
On Sun, Oct 05, 2008 at 11:30:54PM -0500, Nicolas Williams wrote:> > There have been threads about adding a feature to support slow mirror > devices that don''t stay synced synchronously. At least IIRC. That > would help. But then, if the pool is busy writing then your slow ZILThat would definitely be a great help.> mirrors would generally be out of sync, thus being of no help in the > even of a power failure given fast slog devices that don''t survive power > failure.Maybe not, but it would at least save *something* as opposed to not saving anything at all. Still, with enough UPS power, there should be at least enough run time left to get the rest of the ZIL to the disk mirror.> Also, using remote devices for a ZIL may defeat the purpose of fast > ZILs, even if the actual devices are fast, because what really matters > here is latency, and the farther the device, the higher the latency.4Gb FC is slow and low latency? Tell that to all my local fast disks that are attached via FC. :)> Yes, it''s pretty smart. Add UPS and it''s sortof like battery-backed > RAM. You can probably get a good enough reliability rate out of this > for your purposes, though actual slog devices would be better if you can > afford them.Or would they? A box dedicated to being a RAM based slog is going to be faster than any SSD would be. Especially if you make the expensive jump to 8Gb FC. -brian -- "Coding in C is like sending a 3 year old to do groceries. You gotta tell them exactly what you want or you''ll end up with a cupboard full of pop tarts and pancake mix." -- IRC User (http://www.bash.org/?841435)
Brian Hechinger
2008-Oct-06 21:44 UTC
[zfs-discuss] An slog experiment (my NAS can beat up your NAS)
On Mon, Oct 06, 2008 at 10:47:04AM -0400, Moore, Joe wrote:> > I wonder if an AVS-replicated storage device on the backends would be appropriate? > > write -> ZFS-mirrored slog -> ramdisk -AVS-> physical disk > \ > +-iscsi-> ramdisk -AVS-> physical disk > > You''d get the continuous replication of the ramdisk to physical drive (and perhaps automagic recovery on reboot) but not pay the syncronous write to remote physical disk penaltyHmmm, AVS *might* just be the ticket here. Will have to look at that.> A .5-ms RTT on an ethernet link to the iSCSI disk may be faster than a 9-ms latency on physical media.Or, if you''re looking into what I''m thinking with 4Gb/8Gb FC, it gets even better.> There was a time when it was better to place workstations'' swap files on the far side of a 100Mbps ethernet link rather than using the local spinning rust. Ah, the good old days...I remember those days. My SPARCstation LX ran that way. Not due to speed, however, due to lack of disk space in the LX. ;) -brian -- "Coding in C is like sending a 3 year old to do groceries. You gotta tell them exactly what you want or you''ll end up with a cupboard full of pop tarts and pancake mix." -- IRC User (http://www.bash.org/?841435)
Nicolas Williams
2008-Oct-06 21:51 UTC
[zfs-discuss] An slog experiment (my NAS can beat up your NAS)
On Mon, Oct 06, 2008 at 05:38:33PM -0400, Brian Hechinger wrote:> On Sun, Oct 05, 2008 at 11:30:54PM -0500, Nicolas Williams wrote: > > There have been threads about adding a feature to support slow mirror > > devices that don''t stay synced synchronously. At least IIRC. That > > would help. But then, if the pool is busy writing then your slow ZIL > > That would definitely be a great help. > > > mirrors would generally be out of sync, thus being of no help in the > > even of a power failure given fast slog devices that don''t survive power > > failure. > > Maybe not, but it would at least save *something* as opposed to not saving > anything at all. Still, with enough UPS power, there should be at least > enough run time left to get the rest of the ZIL to the disk mirror.Yes. But again, you get somewhat more protection from writing to a write-biased SSD in that once the ZIL bits are committed then you get protection from panics in the OS too, not just power failure.> > Also, using remote devices for a ZIL may defeat the purpose of fast > > ZILs, even if the actual devices are fast, because what really matters > > here is latency, and the farther the device, the higher the latency. > > 4Gb FC is slow and low latency? Tell that to all my local fast disks that > are attached via FC. :)The comparison was to RAM, not "local fast disks." I''m pretty sure that local RAM beats remote-anything, no matter what the "anything" (as long as it isn''t RAM) and what the protocol to get to it (as long as it isn''t a normal backplane). (You could claim with NUMA memory can be remote, so let''s say that for a reasonable value of "remote.")> > Yes, it''s pretty smart. Add UPS and it''s sortof like battery-backed > > RAM. You can probably get a good enough reliability rate out of this > > for your purposes, though actual slog devices would be better if you can > > afford them. > > Or would they? A box dedicated to being a RAM based slog is going to be > faster than any SSD would be. Especially if you make the expensive jump > to 8Gb FC.Unless the SSD had a battery-backed RAM cache, or were based entierly on battery-backed RAM (but then you have to worry about battery upkeep). To me this is a performance/reliability trade-off. RAM slogs mirrored in cluster + UPS -> very fast, works as well as the UPS. Write-biased flash slogs -> fast, no UPS to worry about. Nico --
Brian Hechinger
2008-Oct-06 22:53 UTC
[zfs-discuss] An slog experiment (my NAS can beat up your NAS)
On Mon, Oct 06, 2008 at 10:47:04AM -0400, Moore, Joe wrote:> > I wonder if an AVS-replicated storage device on the backends would be appropriate? > > write -> ZFS-mirrored slog -> ramdisk -AVS-> physical disk > \ > +-iscsi-> ramdisk -AVS-> physical disk > > You''d get the continuous replication of the ramdisk to physical drive (and perhaps automagic recovery on reboot) but not pay the syncronous write to remote physical disk penaltyIt looks like the answer is no. wonko at wintermute$ sudo sndradm -e localhost /dev/rramdisk/avstest1 /dev/zvol/rdsk/SYS0/bitmap1 \wintermute /dev/zvol/dsk/SYS0/avstest2 /dev/zvol/rdsk/SYS0/bitmap2 ip async Enable Remote Mirror? (Y/N) [N]: y sndradm: Error: both localhost and wintermute are local In order to use AVS, it looks like you''d have to replicate between two (or more) "ZIL Boxes". Not the worst thing in the world to have to do, but it certainly complicates things. Also, you don''t get that super fast RAM->Disk sync anymore as you now have to traverse an IP network to get there. Still might be an acceptable way to achieve the goals we are looking at here. I guess at this point falling back to ''zfs send'' run in a continuous loop might be an alternative. -brian -- "Coding in C is like sending a 3 year old to do groceries. You gotta tell them exactly what you want or you''ll end up with a cupboard full of pop tarts and pancake mix." -- IRC User (http://www.bash.org/?841435)
Brian Hechinger
2008-Oct-06 22:59 UTC
[zfs-discuss] An slog experiment (my NAS can beat up your NAS)
On Mon, Oct 06, 2008 at 01:13:40AM -0700, Ross wrote:> > It''s also worth bearing in mind that you can have multiple mirrors. I don''t know what effect that will have on the performance, but it''s an easy way to boost the reliability even further. I think this idea configured on a set of 2-3 servers, with separate UPS'' for each, and a script that can export the pool and save the ramdrive when the power fails, is potentially a very neat little system.The more slog devices, the better. :) If the host using the slogs could trigger the shutdown, that would be even better I think. Once we know the zpool is exported, the slogs have just entered a nicely consistent state at which point the copies could be made. Also, it would also be nice if the host using these slogs would be able to wait until enough of them are online to attempt to mount its pool. That shouldn''t be too hard, nothing more than some startup script modifications. -brian -- "Coding in C is like sending a 3 year old to do groceries. You gotta tell them exactly what you want or you''ll end up with a cupboard full of pop tarts and pancake mix." -- IRC User (http://www.bash.org/?841435)
> Or would they? A box dedicated to being a RAM based > slog is going to be > faster than any SSD would be. Especially if you make > the expensive jump > to 8Gb FC.Not necessarily. While this has some advantages in terms of price & performance, at ~$2400 the 80GB ioDrive would give it a run for it''s money. 600MB/s and enough capacity to (hopefully) use it as a L2ARC as well. When you consider that you need at least two machines, UPS'', and the supporting infrastructure for this idea, the ioDrive really isn''t far off for cost. -- This message posted from opensolaris.org
Robert Milkowski
2008-Oct-07 11:38 UTC
[zfs-discuss] An slog experiment (my NAS can beat up your NAS)
Hello Nicolas, Monday, October 6, 2008, 10:51:58 PM, you wrote: NW> I''m pretty sure that local RAM beats remote-anything, no matter what the NW> "anything" (as long as it isn''t RAM) and what the protocol to get to it NW> (as long as it isn''t a normal backplane). (You could claim with NUMA NW> memory can be remote, so let''s say that for a reasonable value of NW> "remote.") IIRC the total throughput to remote memory over fire link could be faster than to local memory... just a funny thing I remembered. Not that it is relevant here. -- Best regards, Robert mailto:milek at task.gda.pl http://milek.blogspot.com
Moore, Joe
2008-Oct-08 12:50 UTC
[zfs-discuss] An slog experiment (my NAS can beat up your NAS)
Brian Hechinger> On Mon, Oct 06, 2008 at 10:47:04AM -0400, Moore, Joe wrote: > > > > I wonder if an AVS-replicated storage device on the > backends would be appropriate? > > > > write -> ZFS-mirrored slog -> ramdisk -AVS-> physical disk > > \ > > +-iscsi-> ramdisk -AVS-> physical disk > > > > You''d get the continuous replication of the ramdisk to > physical drive (and perhaps automagic recovery on reboot) but > not pay the syncronous write to remote physical disk penalty > > It looks like the answer is no. > > wonko at wintermute$ sudo sndradm -e localhost > /dev/rramdisk/avstest1 /dev/zvol/rdsk/SYS0/bitmap1 > \wintermute /dev/zvol/dsk/SYS0/avstest2 > /dev/zvol/rdsk/SYS0/bitmap2 ip async > Enable Remote Mirror? (Y/N) [N]: y > sndradm: Error: both localhost and wintermute are localI''ve not worked with AVS other than looking at the basic concepts, but to me this looks like a dont-shoot-yourself-in-the-foot critical warning rather than an actual functionality restriction. Is there a -force option to override this normally quite reasonable sanity check? --Joe
Wilkinson, Alex
2008-Oct-08 13:10 UTC
[zfs-discuss] An slog experiment (my NAS can beat up your NAS)
0n Sat, Oct 04, 2008 at 10:37:26PM -0700, Chris Greer wrote: >The big thing here is I ended up getting a MASSIVE boost in >performance even with the overhead of the 1GB link, and iSCSI. >The iorate test I was using went from 3073 IOPS on 90% sequential >writes to 23953 IOPS with the RAM slog added. The service time >was also significantly better than the physical disk. Curios, what tool did you use to benchmark your IOPS ? -aW IMPORTANT: This email remains the property of the Australian Defence Organisation and is subject to the jurisdiction of section 70 of the CRIMES ACT 1914. If you have received this email in error, you are requested to contact the sender and delete the email.
Brian Hechinger
2008-Oct-08 15:25 UTC
[zfs-discuss] An slog experiment (my NAS can beat up your NAS)
On Wed, Oct 08, 2008 at 08:50:57AM -0400, Moore, Joe wrote:> > I''ve not worked with AVS other than looking at the basic concepts, but to me this looks like a dont-shoot-yourself-in-the-foot critical warning rather than an actual functionality restriction. Is there a -force option to override this normally quite reasonable sanity check?There is no force option that I can see, but I''m also not ever worked with AVS. -brian -- "Coding in C is like sending a 3 year old to do groceries. You gotta tell them exactly what you want or you''ll end up with a cupboard full of pop tarts and pancake mix." -- IRC User (http://www.bash.org/?841435)
Chris Greer
2008-Oct-08 16:35 UTC
[zfs-discuss] An slog experiment (my NAS can beat up your NAS)
I was using EMC''s iorate for the comparison..... ftp://ftp.emc.com/pub/symm3000/iorate/ I had 4 processes running on the pool in parallel do 4K sequential writes. I''ve also been playing around with a few other benchmark tools (i just had results from other storage test with this same iorate test). -- This message posted from opensolaris.org
Jim Dunham
2008-Oct-08 22:27 UTC
[zfs-discuss] An slog experiment (my NAS can beat up your NAS)
Joe,> Brian Hechinger >> On Mon, Oct 06, 2008 at 10:47:04AM -0400, Moore, Joe wrote: >>> >>> I wonder if an AVS-replicated storage device on the >> backends would be appropriate? >>> >>> write -> ZFS-mirrored slog -> ramdisk -AVS-> physical disk >>> \ >>> +-iscsi-> ramdisk -AVS-> physical disk >>> >>> You''d get the continuous replication of the ramdisk to >> physical drive (and perhaps automagic recovery on reboot) but >> not pay the syncronous write to remote physical disk penalty >> >> It looks like the answer is no. >> >> wonko at wintermute$ sudo sndradm -e localhost >> /dev/rramdisk/avstest1 /dev/zvol/rdsk/SYS0/bitmap1 >> \wintermute /dev/zvol/dsk/SYS0/avstest2 >> /dev/zvol/rdsk/SYS0/bitmap2 ip async >> Enable Remote Mirror? (Y/N) [N]: y >> sndradm: Error: both localhost and wintermute are local > > I''ve not worked with AVS other than looking at the basic concepts, > but to me this looks like a dont-shoot-yourself-in-the-foot critical > warning rather than an actual functionality restriction. Is there a > -force option to override this normally quite reasonable sanity check?This is a hard restriction, with no override. AVS, or more specifically the remote replication component call SNDR, needs to know which end of the replica is the SNDR primary node and which end is the SNDR secondary node. Since SNDR requires this functionality to know which direction to replica data, a single Solaris node can not be both the primary and secondary node. If one wants this type of mirror functionality on a single node, use host based or controller based mirroring software.> > > --Joe > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discussJim Dunham Storage Platform Software Group Sun Microsystems, Inc.
Brian Hechinger
2008-Oct-08 22:42 UTC
[zfs-discuss] An slog experiment (my NAS can beat up your NAS)
On Wed, Oct 08, 2008 at 06:27:51PM -0400, Jim Dunham wrote:> > If one wants this type of mirror functionality on a single node, use > host based or controller based mirroring software.Is there mirroring software that can do async copies to a mirror? -brian -- "Coding in C is like sending a 3 year old to do groceries. You gotta tell them exactly what you want or you''ll end up with a cupboard full of pop tarts and pancake mix." -- IRC User (http://www.bash.org/?841435)
Keith Bierman
2008-Oct-10 02:32 UTC
[zfs-discuss] An slog experiment (my NAS can beat up your NAS)
On Oct 8, 2008, at 4:27 PM 10/8/, Jim Dunham wrote:> , a single Solaris node can not be both > the primary and secondary node. > > If one wants this type of mirror functionality on a single node, use > host based or controller based mirroring software.If one is running multiple zones, couldn''t you fool AVS into thinking that one zone was the primary and the other the secondary? -- Keith H. Bierman khbkhb at gmail.com | AIM kbiermank 5430 Nassau Circle East | Cherry Hills Village, CO 80113 | 303-997-2749 <speaking for myself*> Copyright 2008