thr3ads.net - zfs discuss - [zfs-discuss] Does a zvol use the zil? [Oct 2010]

If this information is useful, please help other people find it:
Share via:

Eff Norwood

2010-Oct-21 13:19 UTC

[zfs-discuss] Does a zvol use the zil?

Let me frame this in the context specifically of VMWare ESXi 4.x. If I create a
zvol and give it to ESXi via iSCSI our experience has been that it is very fast
and guest response is excellent. If we use NFS without a zil (we use DDRdrive
X1==awesome) because VMWare uses sync (Stable = FSYNC) writes NFS performance is
not very good. Once we enable our zil accelerator, NFS performance is
approximately as fast as iSCSI. Enabling or disabling the zil has no measurable
impact on iSCSI performance for us.

Does a zvol use the zil then or not? If it does, then iSCSI performance seems
like it should also be slower without a zil accelerator but it''s not.
If it doesn''t, then is it true that if the power goes off when
I''m doing a write to iSCSI and I have no battery backed HBA or RAID
card I''ll lose data?
-- 
This message posted from opensolaris.org

Darren J Moffat

2010-Oct-21 13:32 UTC

head link

[zfs-discuss] Does a zvol use the zil?

Yes, ZVOLs do use the ZIL.

If the write cache has been disabled on the zvol by the DKIOCSETWCE 
ioctl or the sync property is set to always.

-- 
Darren J Moffat

Maurice Volaski

2010-Oct-21 17:59 UTC

head link

[zfs-discuss] Does a zvol use the zil?

Does the write cache referred to above refer to the "Writeback Cache"
property listed by stmfadm list-lu -v (when a zvol is a target) or is that some
other cache and if it is, how does it interact with the first one?
-- 
This message posted from opensolaris.org

Darren J Moffat

2010-Oct-21 18:16 UTC

head link

[zfs-discuss] Does a zvol use the zil?

On 21/10/2010 18:59, Maurice Volaski wrote:> Does the write cache referred to above refer to the "Writeback
Cache" > property listed by stmfadm list-lu -v (when a zvol is a target) or
 > is that some other cache and if it is, how does it interact with the
 > first one?

Yes it does, that basically results in DKIOCGETWCE ioctl being called on 
the ZVOL (though you won''t see that in truss because it is called from 
the comstar kernel modules not directly from stmfadm in userland).

-- 
Darren J Moffat

Richard Elling

2010-Oct-22 00:09 UTC

head link

[zfs-discuss] Does a zvol use the zil?

On Oct 21, 2010, at 6:19 AM, Eff Norwood wrote:> Let me frame this in the context specifically of VMWare ESXi 4.x. If I
create a zvol and give it to ESXi via iSCSI our experience has been that it is
very fast and guest response is excellent. If we use NFS without a zil (we use
DDRdrive X1==awesome) because VMWare uses sync (Stable = FSYNC) writes NFS
performance is not very good. Once we enable our zil accelerator, NFS
performance is approximately as fast as iSCSI. Enabling or disabling the zil has
no measurable impact on iSCSI performance for us.
> 
> Does a zvol use the zil then or not? If it does, then iSCSI performance
seems like it should also be slower without a zil accelerator but it''s
not. If it doesn''t, then is it true that if the power goes off when
I''m doing a write to iSCSI and I have no battery backed HBA or RAID
card I''ll lose data?
The risk here is not really different that that faced by normal disk drives
which have
nonvolatile buffers (eg virtually all HDDs and some SSDs).  This is why
applications
can send cache flush commands when they need to ensure the data is on the media.
 -- richard

-- 
OpenStorage Summit, October 25-27, Palo Alto, CA
http://nexenta-summit2010.eventbrite.com
USENIX LISA ''10 Conference, November 7-12, San Jose, CA
ZFS and performance consulting
http://www.RichardElling.com

Erik Trimble

2010-Oct-22 00:26 UTC

head link

[zfs-discuss] Does a zvol use the zil?

On Thu, 2010-10-21 at 17:09 -0700, Richard Elling wrote:> On Oct 21, 2010, at 6:19 AM, Eff Norwood wrote:
> > Let me frame this in the context specifically of VMWare ESXi 4.x. If I
create a zvol and give it to ESXi via iSCSI our experience has been that it is
very fast and guest response is excellent. If we use NFS without a zil (we use
DDRdrive X1==awesome) because VMWare uses sync (Stable = FSYNC) writes NFS
performance is not very good. Once we enable our zil accelerator, NFS
performance is approximately as fast as iSCSI. Enabling or disabling the zil has
no measurable impact on iSCSI performance for us.
> > 
> > Does a zvol use the zil then or not? If it does, then iSCSI
performance seems like it should also be slower without a zil accelerator but
it''s not. If it doesn''t, then is it true that if the power
goes off when I''m doing a write to iSCSI and I have no battery backed
HBA or RAID card I''ll lose data?
> 
> The risk here is not really different that that faced by normal disk drives
which have
> nonvolatile buffers (eg virtually all HDDs and some SSDs).  This is why
applications
> can send cache flush commands when they need to ensure the data is on the
media.
>  -- richard
> 
I think you mean "volatile buffers", right? You''ll lose data
if you HD
or SSD has a volatile buffer (almost always DRAM chips with no battery
or supercapacitor).



-- 
Erik Trimble
Java System Support
Mailstop:  usca22-317
Phone:  x67195
Santa Clara, CA
Timezone: US/Pacific (GMT-0800)

Richard Elling

2010-Oct-22 00:46 UTC

head link

[zfs-discuss] Does a zvol use the zil?

On Oct 21, 2010, at 5:26 PM, Erik Trimble wrote:
> On Thu, 2010-10-21 at 17:09 -0700, Richard Elling wrote:
>> On Oct 21, 2010, at 6:19 AM, Eff Norwood wrote:
>>> Let me frame this in the context specifically of VMWare ESXi 4.x.
If I create a zvol and give it to ESXi via iSCSI our experience has been that it
is very fast and guest response is excellent. If we use NFS without a zil (we
use DDRdrive X1==awesome) because VMWare uses sync (Stable = FSYNC) writes NFS
performance is not very good. Once we enable our zil accelerator, NFS
performance is approximately as fast as iSCSI. Enabling or disabling the zil has
no measurable impact on iSCSI performance for us.
>>> 
>>> Does a zvol use the zil then or not? If it does, then iSCSI
performance seems like it should also be slower without a zil accelerator but
it''s not. If it doesn''t, then is it true that if the power
goes off when I''m doing a write to iSCSI and I have no battery backed
HBA or RAID card I''ll lose data?
>> 
>> The risk here is not really different that that faced by normal disk
drives which have
>> nonvolatile buffers (eg virtually all HDDs and some SSDs).  This is why
applications
>> can send cache flush commands when they need to ensure the data is on
the media.
>> -- richard
>> 
> 
> I think you mean "volatile buffers", right? You''ll lose
data if you HD
> or SSD has a volatile buffer (almost always DRAM chips with no battery
> or supercapacitor).
Indeed... to quote someone (Erik) recently, 

(a) I''m not infallible. :-)

 -- richard

-- 
OpenStorage Summit, October 25-27, Palo Alto, CA
http://nexenta-summit2010.eventbrite.com
USENIX LISA ''10 Conference, November 7-12, San Jose, CA
ZFS and performance consulting
http://www.RichardElling.com

Miles Nordin

2010-Oct-22 17:40 UTC

head link

[zfs-discuss] Does a zvol use the zil?

>>>>> "re" == Richard Elling <richard.elling at
gmail.com> writes:
    re> The risk here is not really different that that faced by
    re> normal disk drives which have nonvolatile buffers (eg
    re> virtually all HDDs and some SSDs).  This is why applications
    re> can send cache flush commands when they need to ensure the
    re> data is on the media.

It''s probably different because of the iSCSI target reboot problem
I''ve written about before:

iSCSI initiator         iSCSI target       nonvolatile medium

write A   ------------>
                   <-----  ack A    
write B   ------------>
                   <-----  ack B
                                  ---------->    [A]
                         [REBOOT]
write C   ------------>
[timeout!]
reconnect ------------>
                   <-----  ack Connected
write C   ------------>
                   <-----  ack C
flush     ------------>
                                  --------->     [C]
                   <-----  ack Flush

in the above time chart, the initiator thinks A, B, and C are written,
but in fact only A and C are written.  I regard this as a failing of
imagination in the SCSI protocol, but probably with better
understanding of the details than I have the initiator could be made
to provably work around the problem.  My guess has always been that no
current initiators actually do, though.

I think it could happen also with a directly-attached SATA disk if you
remove power from the disk without rebooting the host, so as Richard
said it is not really different, except that in the real world it''s
much more common for an iSCSI target to lose power without the
initiator''s also losing power than it is for a disk to lose power
without its host adapter losing power.  The ancient practice of unix
filesystem design always considers cord-yanking as something happening
to the entire machine, and failing disks are not the filesystem''s
responsibility to work arund because how could it?  This assumption
should have been changed and wasn''t, when we entered the era of RAID
and removable disks, where the connections to disks and disks
themselves are both allowed to fail.  However, when NFS was designed,
the assumption *WAS* changed, and indeed NFSv2 and earlier operated
always with the write cache OFF to be safe from this, just as COMSTAR
does in its (default?) abyssmal-performance mode (so campuses bought
prestoserve cards (equivalent to a DDRDrive except much less silly
because they have onboard batteries), or auspex servers with included
NVRAM, which are analagous outside the NFS world to netapp/hitachi/emc
FC/iSCSI targets which always have big NVRAM''s so they can leave the
write cache off), and NFSv3 has a commit protocol that is smart enough
to replay the ''write B'' which makes the nonvolatile caches
less
necessary (so long as you''re not closing files frequently, I guess?).

I think it would be smart to design more storage systems so NFS can
replace the role of iSCSI, for disk access.  In Isilon or Lustre
clusters this trick is common when a node can settle with unshared
access to a subtree: create an image file on the NFS/Lustre back-end
and fill it with an ext3 or XFS, and writes to that inner filesystem
become much faster because this rube goldberg arrangement discards the
clsoe-to-open consistency guarantee.  We might use it in the ZFS world
for actual physical disk acess instead of iSCSI, ex., it should be
possible to NFS-export a zvol and see a share with a single file in it
named ''theTarget'' or something, but this file would be without
read-ahead.  Better yet, to accomodate VMWare limitations, would be to
export a single fake /zvol share containing all NFS-shared zvol''s, and
as you export zvol''s their files appear within this share.  Also it
should be possible to mount vdev elements over NFS without
deadlocks---I know that is difficult, but VMWare does it.  Perahps it
cannot be done through the existing NFS client, but obviously it can
be done somehow, and it would both solve the iSCSI target reboot
problem and also allow using more kinds of proprietary storage
backend---the same reasons VMWare wants to give admins a choice
applies to ZFS.  When NFS is used in this way the disk image file is
never closed, so the NFS server will not need a slog to give good
performance: the same job is accomplished by double-caching the
uncommitted data on the client so it can be replayed if the time
diagram above happens.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 304 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20101022/1b8e1483/attachment.bin>

Richard Elling

2010-Oct-23 17:07 UTC

head link

[zfs-discuss] Does a zvol use the zil?

On Oct 22, 2010, at 10:40 AM, Miles Nordin wrote:
>>>>>> "re" == Richard Elling <richard.elling at
gmail.com> writes:
> 
>    re> The risk here is not really different that that faced by
>    re> normal disk drives which have nonvolatile buffers (eg
>    re> virtually all HDDs and some SSDs).  This is why applications
>    re> can send cache flush commands when they need to ensure the
>    re> data is on the media.
> 
> It''s probably different because of the iSCSI target reboot problem
> I''ve written about before:
> 
> iSCSI initiator         iSCSI target       nonvolatile medium
> 
> write A   ------------>
>                   <-----  ack A    
> write B   ------------>
>                   <-----  ack B
>                                  ---------->    [A]
>                         [REBOOT]
> write C   ------------>
> [timeout!]
> reconnect ------------>
>                   <-----  ack Connected
> write C   ------------>
>                   <-----  ack C
> flush     ------------>
>                                  --------->     [C]
>                   <-----  ack Flush
> 
> in the above time chart, the initiator thinks A, B, and C are written,
> but in fact only A and C are written.  I regard this as a failing of
> imagination in the SCSI protocol, but probably with better
> understanding of the details than I have the initiator could be made
> to provably work around the problem.  My guess has always been that no
> current initiators actually do, though.
> 
> I think it could happen also with a directly-attached SATA disk if you
> remove power from the disk without rebooting the host, so as Richard
> said it is not really different, except that in the real world
it''s
> much more common for an iSCSI target to lose power without the
> initiator''s also losing power than it is for a disk to lose power
> without its host adapter losing power.  
I agree. I''d like to have some good field information on this, but I
think it
is safe to assume that for the average small server, when the local disks
lose power the server also loses power and the exposure to this issue
is lost in the general recovery of the server. For the geezers, who remember
such pain as Netware or Aegis, NFS was a breath of fresh air and led to
much more robust designs.  In that respect, iSCSI or even FC is a step
backwards, down the protocol stack.
> The ancient practice of unix
> filesystem design always considers cord-yanking as something happening
> to the entire machine, and failing disks are not the filesystem''s
> responsibility to work arund because how could it?  This assumption
> should have been changed and wasn''t, when we entered the era of
RAID
> and removable disks, where the connections to disks and disks
> themselves are both allowed to fail.  However, when NFS was designed,
> the assumption *WAS* changed, and indeed NFSv2 and earlier operated
> always with the write cache OFF to be safe from this, just as COMSTAR
> does in its (default?) abyssmal-performance mode (so campuses bought
> prestoserve cards (equivalent to a DDRDrive except much less silly
> because they have onboard batteries), or auspex servers with included
> NVRAM, which are analagous outside the NFS world to netapp/hitachi/emc
> FC/iSCSI targets which always have big NVRAM''s so they can leave
the
> write cache off), and NFSv3 has a commit protocol that is smart enough
> to replay the ''write B'' which makes the nonvolatile
caches less
> necessary (so long as you''re not closing files frequently, I
guess?).
With COMSTAR, you can implement the commit to media policy in at least
three ways:
	1. server side: disable writeback cache, per LUN
	2. server side: change the sync policy to "always" for the zvol
	3. client side: disable write cache enable per LUN

For choices 1 and 2, the ZIL and separate log come into play.
> I think it would be smart to design more storage systems so NFS can
> replace the role of iSCSI, for disk access.  
I agree.
> In Isilon or Lustre
> clusters this trick is common when a node can settle with unshared
> access to a subtree: create an image file on the NFS/Lustre back-end
> and fill it with an ext3 or XFS, and writes to that inner filesystem
> become much faster because this rube goldberg arrangement discards the
> clsoe-to-open consistency guarantee.  We might use it in the ZFS world
> for actual physical disk acess instead of iSCSI, ex., it should be
> possible to NFS-export a zvol and see a share with a single file in it
> named ''theTarget'' or something, but this file would be
without
> read-ahead.  Better yet, to accomodate VMWare limitations, would be to
> export a single fake /zvol share containing all NFS-shared zvol''s,
and
> as you export zvol''s their files appear within this share.  Also
it
> should be possible to mount vdev elements over NFS without
> deadlocks---I know that is difficult, but VMWare does it.  Perahps it
> cannot be done through the existing NFS client, but obviously it can
> be done somehow, and it would both solve the iSCSI target reboot
> problem and also allow using more kinds of proprietary storage
> backend---the same reasons VMWare wants to give admins a choice
> applies to ZFS.  When NFS is used in this way the disk image file is
> never closed, so the NFS server will not need a slog to give good
> performance: the same job is accomplished by double-caching the
> uncommitted data on the client so it can be replayed if the time
> diagram above happens.
In the case of VMs, I particularly dislike the failure policies.  With NFS, by
default, it was simple -- if the client couldn''t hear the server, the
processes
blocked on I/O remain blocked.  Later, the "soft" option was added so
that
they would eventually return failures, but that just let system administrators
introduce the same sort of failure mode as iSCSI. In the VM world, it seems
the hypervisors try to do crazy things like make the disks readonly, which 
is perhaps the worst thing you can do to a guest OS because now it needs
to be rebooted -- kinda like the old Netware world that we so gladly left
behind.
 -- richard

Miles Nordin

2010-Oct-25 17:57 UTC

head link

[zfs-discuss] Does a zvol use the zil?

>>>>> "re" == Richard Elling <richard.elling at
gmail.com> writes:
    re> it seems the hypervisors try to do crazy things like make the
    re> disks readonly,

haha!

    re> which is perhaps the worst thing you can do to a guest OS
    re> because now it needs to be rebooted

I might''ve set it up to ``pause'''' the VM for most
failures, and for
punts like this read-only case, maybe leave it paused until someone
comes along to turn it off or unpause it.  But for loss of connection
to an iSCSI-backed disk, I think that''s wrong.  I guess the truly
correct failure handling would be to immediately poweroff the guest
VM: pausing it tempts the sysadmin to fix the iscsi connection and
unpause it, which in this case is the only real disaster-begging thing
to do.  One would get a lot of complaints from sysadmins who don''t
understand the iscsi write hole, but I think it''s right.  so...in that
context, maybe read-only-until-reboot is actually not so dumb!

For guests unknowingly getting their disks via NFS, it would make
sense to pause the VM to stop (some of) its interval timer(s), (and
hope you get the timer running the ATA/SCSI/... driver among the
stopped ones) because the guest''s disk driver won''t understand
NFS
hard mount timeout rules---won''t understand that, for certain errors,
you can pass ``stale file handle'''' up the stack, but for other
errors
you must wait forever.  Instead they''ll enforce a 30-second timeout
like for an ATA disk.  I think you could probably still avoid losing
the ''write B'' if the guest fired its ATA timeout with an
NFS-backed
disk because the writes have already been handed off to the host.  It
might be weird user experience in the VM manager because whatever
process is doing the NFS writes will be unkillable ''D'' state
even if
you poweroff the VM, but this weirdness is an expression of arcane
reality, not a bug.  It''d be better sysadmin experience to avoid the
guest ATA timeout, though: pause the VM and resume so that NFS server
reboots would freeze guests for a while, not require rebooting them,
just like they do for nonvirtual NFSv3 clients.  You would have to
figure out the maximum number of seconds the guests can go without
disk access, and deviously pause them before their burried /
proprietary disk timeouts can fire.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 304 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20101025/f16c80b1/attachment.bin>

zfs discuss - Oct 2010 - Does a zvol use the zil?

[zfs-discuss] Does a zvol use the zil?

[zfs-discuss] Does a zvol use the zil?

[zfs-discuss] Does a zvol use the zil?

[zfs-discuss] Does a zvol use the zil?

[zfs-discuss] Does a zvol use the zil?

[zfs-discuss] Does a zvol use the zil?

[zfs-discuss] Does a zvol use the zil?

[zfs-discuss] Does a zvol use the zil?

[zfs-discuss] Does a zvol use the zil?

[zfs-discuss] Does a zvol use the zil?