Let me frame this in the context specifically of VMWare ESXi 4.x. If I create a zvol and give it to ESXi via iSCSI our experience has been that it is very fast and guest response is excellent. If we use NFS without a zil (we use DDRdrive X1==awesome) because VMWare uses sync (Stable = FSYNC) writes NFS performance is not very good. Once we enable our zil accelerator, NFS performance is approximately as fast as iSCSI. Enabling or disabling the zil has no measurable impact on iSCSI performance for us. Does a zvol use the zil then or not? If it does, then iSCSI performance seems like it should also be slower without a zil accelerator but it''s not. If it doesn''t, then is it true that if the power goes off when I''m doing a write to iSCSI and I have no battery backed HBA or RAID card I''ll lose data? -- This message posted from opensolaris.org
Yes, ZVOLs do use the ZIL. If the write cache has been disabled on the zvol by the DKIOCSETWCE ioctl or the sync property is set to always. -- Darren J Moffat
Does the write cache referred to above refer to the "Writeback Cache" property listed by stmfadm list-lu -v (when a zvol is a target) or is that some other cache and if it is, how does it interact with the first one? -- This message posted from opensolaris.org
On 21/10/2010 18:59, Maurice Volaski wrote:> Does the write cache referred to above refer to the "Writeback Cache"> property listed by stmfadm list-lu -v (when a zvol is a target) or > is that some other cache and if it is, how does it interact with the > first one? Yes it does, that basically results in DKIOCGETWCE ioctl being called on the ZVOL (though you won''t see that in truss because it is called from the comstar kernel modules not directly from stmfadm in userland). -- Darren J Moffat
On Oct 21, 2010, at 6:19 AM, Eff Norwood wrote:> Let me frame this in the context specifically of VMWare ESXi 4.x. If I create a zvol and give it to ESXi via iSCSI our experience has been that it is very fast and guest response is excellent. If we use NFS without a zil (we use DDRdrive X1==awesome) because VMWare uses sync (Stable = FSYNC) writes NFS performance is not very good. Once we enable our zil accelerator, NFS performance is approximately as fast as iSCSI. Enabling or disabling the zil has no measurable impact on iSCSI performance for us. > > Does a zvol use the zil then or not? If it does, then iSCSI performance seems like it should also be slower without a zil accelerator but it''s not. If it doesn''t, then is it true that if the power goes off when I''m doing a write to iSCSI and I have no battery backed HBA or RAID card I''ll lose data?The risk here is not really different that that faced by normal disk drives which have nonvolatile buffers (eg virtually all HDDs and some SSDs). This is why applications can send cache flush commands when they need to ensure the data is on the media. -- richard -- OpenStorage Summit, October 25-27, Palo Alto, CA http://nexenta-summit2010.eventbrite.com USENIX LISA ''10 Conference, November 7-12, San Jose, CA ZFS and performance consulting http://www.RichardElling.com
On Thu, 2010-10-21 at 17:09 -0700, Richard Elling wrote:> On Oct 21, 2010, at 6:19 AM, Eff Norwood wrote: > > Let me frame this in the context specifically of VMWare ESXi 4.x. If I create a zvol and give it to ESXi via iSCSI our experience has been that it is very fast and guest response is excellent. If we use NFS without a zil (we use DDRdrive X1==awesome) because VMWare uses sync (Stable = FSYNC) writes NFS performance is not very good. Once we enable our zil accelerator, NFS performance is approximately as fast as iSCSI. Enabling or disabling the zil has no measurable impact on iSCSI performance for us. > > > > Does a zvol use the zil then or not? If it does, then iSCSI performance seems like it should also be slower without a zil accelerator but it''s not. If it doesn''t, then is it true that if the power goes off when I''m doing a write to iSCSI and I have no battery backed HBA or RAID card I''ll lose data? > > The risk here is not really different that that faced by normal disk drives which have > nonvolatile buffers (eg virtually all HDDs and some SSDs). This is why applications > can send cache flush commands when they need to ensure the data is on the media. > -- richard >I think you mean "volatile buffers", right? You''ll lose data if you HD or SSD has a volatile buffer (almost always DRAM chips with no battery or supercapacitor). -- Erik Trimble Java System Support Mailstop: usca22-317 Phone: x67195 Santa Clara, CA Timezone: US/Pacific (GMT-0800)
On Oct 21, 2010, at 5:26 PM, Erik Trimble wrote:> On Thu, 2010-10-21 at 17:09 -0700, Richard Elling wrote: >> On Oct 21, 2010, at 6:19 AM, Eff Norwood wrote: >>> Let me frame this in the context specifically of VMWare ESXi 4.x. If I create a zvol and give it to ESXi via iSCSI our experience has been that it is very fast and guest response is excellent. If we use NFS without a zil (we use DDRdrive X1==awesome) because VMWare uses sync (Stable = FSYNC) writes NFS performance is not very good. Once we enable our zil accelerator, NFS performance is approximately as fast as iSCSI. Enabling or disabling the zil has no measurable impact on iSCSI performance for us. >>> >>> Does a zvol use the zil then or not? If it does, then iSCSI performance seems like it should also be slower without a zil accelerator but it''s not. If it doesn''t, then is it true that if the power goes off when I''m doing a write to iSCSI and I have no battery backed HBA or RAID card I''ll lose data? >> >> The risk here is not really different that that faced by normal disk drives which have >> nonvolatile buffers (eg virtually all HDDs and some SSDs). This is why applications >> can send cache flush commands when they need to ensure the data is on the media. >> -- richard >> > > I think you mean "volatile buffers", right? You''ll lose data if you HD > or SSD has a volatile buffer (almost always DRAM chips with no battery > or supercapacitor).Indeed... to quote someone (Erik) recently, (a) I''m not infallible. :-) -- richard -- OpenStorage Summit, October 25-27, Palo Alto, CA http://nexenta-summit2010.eventbrite.com USENIX LISA ''10 Conference, November 7-12, San Jose, CA ZFS and performance consulting http://www.RichardElling.com
>>>>> "re" == Richard Elling <richard.elling at gmail.com> writes:re> The risk here is not really different that that faced by re> normal disk drives which have nonvolatile buffers (eg re> virtually all HDDs and some SSDs). This is why applications re> can send cache flush commands when they need to ensure the re> data is on the media. It''s probably different because of the iSCSI target reboot problem I''ve written about before: iSCSI initiator iSCSI target nonvolatile medium write A ------------> <----- ack A write B ------------> <----- ack B ----------> [A] [REBOOT] write C ------------> [timeout!] reconnect ------------> <----- ack Connected write C ------------> <----- ack C flush ------------> ---------> [C] <----- ack Flush in the above time chart, the initiator thinks A, B, and C are written, but in fact only A and C are written. I regard this as a failing of imagination in the SCSI protocol, but probably with better understanding of the details than I have the initiator could be made to provably work around the problem. My guess has always been that no current initiators actually do, though. I think it could happen also with a directly-attached SATA disk if you remove power from the disk without rebooting the host, so as Richard said it is not really different, except that in the real world it''s much more common for an iSCSI target to lose power without the initiator''s also losing power than it is for a disk to lose power without its host adapter losing power. The ancient practice of unix filesystem design always considers cord-yanking as something happening to the entire machine, and failing disks are not the filesystem''s responsibility to work arund because how could it? This assumption should have been changed and wasn''t, when we entered the era of RAID and removable disks, where the connections to disks and disks themselves are both allowed to fail. However, when NFS was designed, the assumption *WAS* changed, and indeed NFSv2 and earlier operated always with the write cache OFF to be safe from this, just as COMSTAR does in its (default?) abyssmal-performance mode (so campuses bought prestoserve cards (equivalent to a DDRDrive except much less silly because they have onboard batteries), or auspex servers with included NVRAM, which are analagous outside the NFS world to netapp/hitachi/emc FC/iSCSI targets which always have big NVRAM''s so they can leave the write cache off), and NFSv3 has a commit protocol that is smart enough to replay the ''write B'' which makes the nonvolatile caches less necessary (so long as you''re not closing files frequently, I guess?). I think it would be smart to design more storage systems so NFS can replace the role of iSCSI, for disk access. In Isilon or Lustre clusters this trick is common when a node can settle with unshared access to a subtree: create an image file on the NFS/Lustre back-end and fill it with an ext3 or XFS, and writes to that inner filesystem become much faster because this rube goldberg arrangement discards the clsoe-to-open consistency guarantee. We might use it in the ZFS world for actual physical disk acess instead of iSCSI, ex., it should be possible to NFS-export a zvol and see a share with a single file in it named ''theTarget'' or something, but this file would be without read-ahead. Better yet, to accomodate VMWare limitations, would be to export a single fake /zvol share containing all NFS-shared zvol''s, and as you export zvol''s their files appear within this share. Also it should be possible to mount vdev elements over NFS without deadlocks---I know that is difficult, but VMWare does it. Perahps it cannot be done through the existing NFS client, but obviously it can be done somehow, and it would both solve the iSCSI target reboot problem and also allow using more kinds of proprietary storage backend---the same reasons VMWare wants to give admins a choice applies to ZFS. When NFS is used in this way the disk image file is never closed, so the NFS server will not need a slog to give good performance: the same job is accomplished by double-caching the uncommitted data on the client so it can be replayed if the time diagram above happens. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20101022/1b8e1483/attachment.bin>
On Oct 22, 2010, at 10:40 AM, Miles Nordin wrote:>>>>>> "re" == Richard Elling <richard.elling at gmail.com> writes: > > re> The risk here is not really different that that faced by > re> normal disk drives which have nonvolatile buffers (eg > re> virtually all HDDs and some SSDs). This is why applications > re> can send cache flush commands when they need to ensure the > re> data is on the media. > > It''s probably different because of the iSCSI target reboot problem > I''ve written about before: > > iSCSI initiator iSCSI target nonvolatile medium > > write A ------------> > <----- ack A > write B ------------> > <----- ack B > ----------> [A] > [REBOOT] > write C ------------> > [timeout!] > reconnect ------------> > <----- ack Connected > write C ------------> > <----- ack C > flush ------------> > ---------> [C] > <----- ack Flush > > in the above time chart, the initiator thinks A, B, and C are written, > but in fact only A and C are written. I regard this as a failing of > imagination in the SCSI protocol, but probably with better > understanding of the details than I have the initiator could be made > to provably work around the problem. My guess has always been that no > current initiators actually do, though. > > I think it could happen also with a directly-attached SATA disk if you > remove power from the disk without rebooting the host, so as Richard > said it is not really different, except that in the real world it''s > much more common for an iSCSI target to lose power without the > initiator''s also losing power than it is for a disk to lose power > without its host adapter losing power.I agree. I''d like to have some good field information on this, but I think it is safe to assume that for the average small server, when the local disks lose power the server also loses power and the exposure to this issue is lost in the general recovery of the server. For the geezers, who remember such pain as Netware or Aegis, NFS was a breath of fresh air and led to much more robust designs. In that respect, iSCSI or even FC is a step backwards, down the protocol stack.> The ancient practice of unix > filesystem design always considers cord-yanking as something happening > to the entire machine, and failing disks are not the filesystem''s > responsibility to work arund because how could it? This assumption > should have been changed and wasn''t, when we entered the era of RAID > and removable disks, where the connections to disks and disks > themselves are both allowed to fail. However, when NFS was designed, > the assumption *WAS* changed, and indeed NFSv2 and earlier operated > always with the write cache OFF to be safe from this, just as COMSTAR > does in its (default?) abyssmal-performance mode (so campuses bought > prestoserve cards (equivalent to a DDRDrive except much less silly > because they have onboard batteries), or auspex servers with included > NVRAM, which are analagous outside the NFS world to netapp/hitachi/emc > FC/iSCSI targets which always have big NVRAM''s so they can leave the > write cache off), and NFSv3 has a commit protocol that is smart enough > to replay the ''write B'' which makes the nonvolatile caches less > necessary (so long as you''re not closing files frequently, I guess?).With COMSTAR, you can implement the commit to media policy in at least three ways: 1. server side: disable writeback cache, per LUN 2. server side: change the sync policy to "always" for the zvol 3. client side: disable write cache enable per LUN For choices 1 and 2, the ZIL and separate log come into play.> I think it would be smart to design more storage systems so NFS can > replace the role of iSCSI, for disk access.I agree.> In Isilon or Lustre > clusters this trick is common when a node can settle with unshared > access to a subtree: create an image file on the NFS/Lustre back-end > and fill it with an ext3 or XFS, and writes to that inner filesystem > become much faster because this rube goldberg arrangement discards the > clsoe-to-open consistency guarantee. We might use it in the ZFS world > for actual physical disk acess instead of iSCSI, ex., it should be > possible to NFS-export a zvol and see a share with a single file in it > named ''theTarget'' or something, but this file would be without > read-ahead. Better yet, to accomodate VMWare limitations, would be to > export a single fake /zvol share containing all NFS-shared zvol''s, and > as you export zvol''s their files appear within this share. Also it > should be possible to mount vdev elements over NFS without > deadlocks---I know that is difficult, but VMWare does it. Perahps it > cannot be done through the existing NFS client, but obviously it can > be done somehow, and it would both solve the iSCSI target reboot > problem and also allow using more kinds of proprietary storage > backend---the same reasons VMWare wants to give admins a choice > applies to ZFS. When NFS is used in this way the disk image file is > never closed, so the NFS server will not need a slog to give good > performance: the same job is accomplished by double-caching the > uncommitted data on the client so it can be replayed if the time > diagram above happens.In the case of VMs, I particularly dislike the failure policies. With NFS, by default, it was simple -- if the client couldn''t hear the server, the processes blocked on I/O remain blocked. Later, the "soft" option was added so that they would eventually return failures, but that just let system administrators introduce the same sort of failure mode as iSCSI. In the VM world, it seems the hypervisors try to do crazy things like make the disks readonly, which is perhaps the worst thing you can do to a guest OS because now it needs to be rebooted -- kinda like the old Netware world that we so gladly left behind. -- richard
>>>>> "re" == Richard Elling <richard.elling at gmail.com> writes:re> it seems the hypervisors try to do crazy things like make the re> disks readonly, haha! re> which is perhaps the worst thing you can do to a guest OS re> because now it needs to be rebooted I might''ve set it up to ``pause'''' the VM for most failures, and for punts like this read-only case, maybe leave it paused until someone comes along to turn it off or unpause it. But for loss of connection to an iSCSI-backed disk, I think that''s wrong. I guess the truly correct failure handling would be to immediately poweroff the guest VM: pausing it tempts the sysadmin to fix the iscsi connection and unpause it, which in this case is the only real disaster-begging thing to do. One would get a lot of complaints from sysadmins who don''t understand the iscsi write hole, but I think it''s right. so...in that context, maybe read-only-until-reboot is actually not so dumb! For guests unknowingly getting their disks via NFS, it would make sense to pause the VM to stop (some of) its interval timer(s), (and hope you get the timer running the ATA/SCSI/... driver among the stopped ones) because the guest''s disk driver won''t understand NFS hard mount timeout rules---won''t understand that, for certain errors, you can pass ``stale file handle'''' up the stack, but for other errors you must wait forever. Instead they''ll enforce a 30-second timeout like for an ATA disk. I think you could probably still avoid losing the ''write B'' if the guest fired its ATA timeout with an NFS-backed disk because the writes have already been handed off to the host. It might be weird user experience in the VM manager because whatever process is doing the NFS writes will be unkillable ''D'' state even if you poweroff the VM, but this weirdness is an expression of arcane reality, not a bug. It''d be better sysadmin experience to avoid the guest ATA timeout, though: pause the VM and resume so that NFS server reboots would freeze guests for a while, not require rebooting them, just like they do for nonvirtual NFSv3 clients. You would have to figure out the maximum number of seconds the guests can go without disk access, and deviously pause them before their burried / proprietary disk timeouts can fire. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20101025/f16c80b1/attachment.bin>