thr3ads.net - zfs discuss - [zfs-discuss] zfs not yet suitable for HA applications? [Nov 2008]

If this information is useful, please help other people find it:
Share via:

alex black

2008-Nov-14 21:49 UTC

[zfs-discuss] zfs not yet suitable for HA applications?

hi All,

I realize the subject is a bit incendiary, but we''re running into what
I view as a design omission with ZFS that is preventing us from  
building highly available storage infrastructure; I want to bring some  
attention (again) to this major issue:

Currently we have a set of iSCSI targets published from a storage host  
which are consumed by a ZFS host.

If a _single_disk_ on the storage host goes bad, ZFS pauses for a full  
180 seconds before allowing read/write operations to resume. This is  
an aeon, beyond TCP timeout, etc.

I''ve read the claims that ZFS is unconcerned with underlying  
infrastructure and agree with the basic sense of those claims see:[1],  
however:

* If ZFS experiences _any_ behavior when interacting with a device  
which is not consistent with known historical performance norms
-and-
* ZFS knows the data it is attempting to fetch from that device is  
resident on another device

Why then would it not make a decision, dynamically based on a  
reasonably small sample of recent device performance to drop its  
current attempt and instead fetch the data from the other device?

I don''t even think a configurable timeout is that useful - it should  
be based on a sample of performance from (say) a day - or, hey, for  
the moment, just to make it easy, a configurable timeout!

As it is, I can''t put this in production. 180 seconds is not
"highly
available", it''s users seeing "The Connection has Timed
Out".

Everything - and I mean every other tiny detail - of ZFS that I have  
seen and used is crystalline perfection.

So, ZFS is (for us) a diamond with a little bit of volcanic crust  
remaining to be polished off.

Is there any intention of dealing with this problem in the (hopefully  
very) near future?

If you''re in the bay area, I will personally deliver (2) cases of the  
cold beer of your choice (including trappist) if you solve this problem.

If offering a bounty would have any effect, I''d offer one. We need  
this to work.

thanks,

_alex


Related:
[1]
http://mail.opensolaris.org/pipermail/zfs-discuss/2008-August/thread.html#50609


-- 
alex black, founder
the turing studio, inc.
888.603.6023 / main
510.666.0074 / office
root at turingstudio.com

Ross

2008-Nov-15 20:18 UTC

head link

[zfs-discuss] zfs not yet suitable for HA applications?

I discussed this exact issue on the forums in February, and filed a bug at the
time.  I''ve also e-mailed and chatted with the iSCSI developers, and
the iSER developers a few times.  There was also been another thread about the
iSCSI timeouts being made configurable a few months back, and finally, I started
another discussion on ZFS availability, and filed an RFE for pretty much exactly
what you''re asking for.

So the question is being asked, but as for how long it will be before Sun
improve ZFS availability, I really wouldn''t like to say.  One potential
problem is that Sun almost certainly have a pretty good HA system with Fishworks
running on their own hardware, and I don''t know how much they are going
to want to create an open source alternative to that.

My original discussion in Feb:
http://opensolaris.org/jive/thread.jspa?messageID=213482

The iSCSI timeout bugs.  The first one was raised in November 2006!!
http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6497777
http://bugs.opensolaris.org/bugdatabase/view_bug.do;jsessionid=a1c19a874eb8bffffffffac94084acffabc5?bug_id=6670866

The ZFS availability thread:
http://www.opensolaris.org/jive/thread.jspa?messageID=274031&#274031

I can''t find the RFE I filed on the back of that just yet,
I''ll have a look through my e-mails on Monday to find it for you.

The one bright point is that it does look like it would be possible to edit
iscsi.h manually and recompile the driver, but that''s a bit outside of
my experience right now so I''m leaving that until I have no other
choice.

Ross
-- 
This message posted from opensolaris.org

David Anderson

2008-Dec-05 21:19 UTC

head link

[zfs-discuss] zfs not yet suitable for HA applications?

Trying to keep this in the spotlight. Apologies for the lengthy post.

I''d really like to see features as described by Ross in his summary of
the "Availability: ZFS needs to handle disk removal / driver failure  
better" 
(http://www.opensolaris.org/jive/thread.jspa?messageID=274031&#274031
  ). I''d like to have these/similar features as well. Has there  
already been internal discussions regarding adding this type of  
functionality to ZFS itself, and was there approval, disapproval or no  
decision?

Unfortunately my situation has put me in urgent need to find  
workarounds in the meantime.

My setup: I have two iSCSI target nodes, each with six drives exported  
via iscsi (Storage Nodes). I have a ZFS Node that logs into each  
target from both Storage Nodes and creates a mirrored Zpool with one  
drive from each Storage Node comprising each half of the mirrored  
vdevs (6 x 2-way mirrors).

My problem: If a Storage Node crashes completely, is disconnected from  
the network, iscsitgt core dumps, a drive is pulled, or a drive has a  
problem accessing data (read retries), then my ZFS Node hangs while  
ZFS waits patiently for the layers below to report a problem and  
timeout the devices. This can lead to a roughly 3 minute or longer  
halt when reading OR writing to the Zpool on the ZFS node. While this  
is acceptable in certain situations, I have a case where my  
availability demand is more severe.

My goal: figure out how to have the zpool pause for NO LONGER than 30  
seconds (roughly within a typical HTTP request timeout) and then issue  
reads/writes to the good devices in the zpool/mirrors while the other  
side comes back online or is fixed.

My ideas:
   1. In the case of the iscsi targets disappearing (iscsitgt core  
dump, Storage Node crash, Storage Node disconnected from network), I  
need to lower the iSCSI login retry/timeout values. Am I correct in  
assuming the only way to accomplish this is to recompile the iscsi  
initiator? If so, can someone help point me in the right direction (I  
have never compiled ONNV sources - do I need to do this or can I just  
recompile the iscsi initiator)?

    1.a. I''m not sure in what Initiator session states  
iscsi_sess_max_delay is applicable - only for the initial login, or  
also in the case of reconnect? Ross, if you still have your test boxes  
available, can you please try setting "set iscsi:iscsi_sess_max_delay  
= 5" in /etc/system, reboot and try failing your iscsi vdevs again? I  
can''t find a case where this was tested quick failover.

     1.b. I would much prefer to have bug 6497777 addressed and fixed  
rather than having to resort to recompiling the iscsi initiator (if  
iscsi_sess_max_delay) doesn''t work. This seems like a trivial feature  
to implement. How can I sponsor development?

   2. In the case of the iscsi target being reachable, but the  
physical disk is having problems reading/writing data (retryable  
events that take roughly 60 seconds to timeout), should I change the  
iscsi_rx_max_window tunable with mdb? Is there a tunable for iscsi_tx?  
Ross, I know you tried this recently in the thread referenced above  
(with value 15), which resulted in a 60 second hang. How did you  
offline the iscsi vol to test this failure? Unless iscsi uses a  
multiple of the value for retries, then maybe the way you failed the  
disk caused the iscsi system to follow a different failure path?  
Unfortunately I don''t know of a way to introduce read/write retries to
a disk while the disk is still reachable and presented via iscsitgt,  
so I''m not sure how to test this.

     2.a With the fix of http://bugs.opensolaris.org/view_bug.do?bug_id=6518995 
  , we can set sd_retry_count along with sd_io_time to cause I/O  
failure when a command takes longer than sd_retry_count * sd_io_time.  
Can (or should) these tunables be set on the imported iscsi disks in  
the ZFS Node, or can/should they be applied only to the local disk on  
the Storage Nodes? If there is a way to apply them to ONLY the  
imported iscsi disks (and not the local disks) of the ZFS Node, and  
without rebooting every time a new iscsi disk is imported, then I''m  
thinking this is the way to go.

In a year of having this setup in customer beta I have never had  
Storage nodes (or both sides of a mirror) down at the same time. I''d  
like ZFS to take advantage of this. If (and only if) both sides fail  
then ZFS can enter failmode=wait.

Currently using Nevada b96. Planning to move to >100 shortly to avoid  
zpool commands hanging while the zpool is waiting to reach a device.

David Anderson
Aktiom Networks, LLC

Ross wrote:
 > I discussed this exact issue on the forums in February, and filed a  
bug at the time.  I''ve also e-mailed and chatted with the iSCSI  
developers, and the iSER developers a few times.  There was also been  
another thread about the iSCSI timeouts being made configurable a few  
months back, and finally, I started another discussion on ZFS  
availability, and filed an RFE for pretty much exactly what you''re  
asking for.
 >
 > So the question is being asked, but as for how long it will be  
before Sun improve ZFS availability, I really wouldn''t like to say.   
One potential problem is that Sun almost certainly have a pretty good  
HA system with Fishworks running on their own hardware, and I don''t  
know how much they are going to want to create an open source  
alternative to that.
 >
 > My original discussion in Feb:
 > http://opensolaris.org/jive/thread.jspa?messageID=213482
 >
 > The iSCSI timeout bugs.  The first one was raised in November 2006!!
 > http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6497777
 >
http://bugs.opensolaris.org/bugdatabase/view_bug.do;jsessionid=a1c19a874eb8bffffffffac94084acffabc5?bug_id=6670866
 >
 > The ZFS availability thread:
 > http://www.opensolaris.org/jive/thread.jspa?messageID=274031&#274031
 >
 > I can''t find the RFE I filed on the back of that just yet,
I''ll
have a look through my e-mails on Monday to find it for you.
 >
 > The one bright point is that it does look like it would be possible  
to edit iscsi.h manually and recompile the driver, but that''s a bit  
outside of my experience right now so I''m leaving that until I have no
other choice.
 >
 > Ross

Ross Smith

2008-Dec-05 22:00 UTC

head link

[zfs-discuss] zfs not yet suitable for HA applications?

Hi Dan, replying in line:

On Fri, Dec 5, 2008 at 9:19 PM, David Anderson <projects-zfs at
aktiom.net> wrote:> Trying to keep this in the spotlight. Apologies for the lengthy post.
Heh, don''t apologise, you should see some of my posts... o_0
> I''d really like to see features as described by Ross in his
summary of the
> "Availability: ZFS needs to handle disk removal / driver failure
better"
>  (http://www.opensolaris.org/jive/thread.jspa?messageID=274031&#274031
).
> I''d like to have these/similar features as well. Has there already
been
> internal discussions regarding adding this type of functionality to ZFS
> itself, and was there approval, disapproval or no decision?
>
> Unfortunately my situation has put me in urgent need to find workarounds in
> the meantime.
>
> My setup: I have two iSCSI target nodes, each with six drives exported via
> iscsi (Storage Nodes). I have a ZFS Node that logs into each target from
> both Storage Nodes and creates a mirrored Zpool with one drive from each
> Storage Node comprising each half of the mirrored vdevs (6 x 2-way
mirrors).
>
> My problem: If a Storage Node crashes completely, is disconnected from the
> network, iscsitgt core dumps, a drive is pulled, or a drive has a problem
> accessing data (read retries), then my ZFS Node hangs while ZFS waits
> patiently for the layers below to report a problem and timeout the devices.
> This can lead to a roughly 3 minute or longer halt when reading OR writing
> to the Zpool on the ZFS node. While this is acceptable in certain
> situations, I have a case where my availability demand is more severe.
>
> My goal: figure out how to have the zpool pause for NO LONGER than 30
> seconds (roughly within a typical HTTP request timeout) and then issue
> reads/writes to the good devices in the zpool/mirrors while the other side
> comes back online or is fixed.
>
> My ideas:
>  1. In the case of the iscsi targets disappearing (iscsitgt core dump,
> Storage Node crash, Storage Node disconnected from network), I need to
lower
> the iSCSI login retry/timeout values. Am I correct in assuming the only way
> to accomplish this is to recompile the iscsi initiator? If so, can someone
> help point me in the right direction (I have never compiled ONNV sources -
> do I need to do this or can I just recompile the iscsi initiator)?
I believe it''s possible to just recompile the initiator and install
the new driver.  I have some *very* rough notes that were sent to me
about a year ago, but I''ve no experience compiling anything in
Solaris, so don''t know how useful they will be.  I''ll try to
dig them
out in case they''re useful.
>
>   1.a. I''m not sure in what Initiator session states
iscsi_sess_max_delay is
> applicable - only for the initial login, or also in the case of reconnect?
> Ross, if you still have your test boxes available, can you please try
> setting "set iscsi:iscsi_sess_max_delay = 5" in /etc/system,
reboot and try
> failing your iscsi vdevs again? I can''t find a case where this was
tested
> quick failover.
Will gladly have a go at this on Monday.
>    1.b. I would much prefer to have bug 6497777 addressed and fixed rather
> than having to resort to recompiling the iscsi initiator (if
> iscsi_sess_max_delay) doesn''t work. This seems like a trivial
feature to
> implement. How can I sponsor development?
>
>  2. In the case of the iscsi target being reachable, but the physical disk
> is having problems reading/writing data (retryable events that take roughly
> 60 seconds to timeout), should I change the iscsi_rx_max_window tunable
with
> mdb? Is there a tunable for iscsi_tx? Ross, I know you tried this recently
> in the thread referenced above (with value 15), which resulted in a 60
> second hang. How did you offline the iscsi vol to test this failure? Unless
> iscsi uses a multiple of the value for retries, then maybe the way you
> failed the disk caused the iscsi system to follow a different failure path?
> Unfortunately I don''t know of a way to introduce read/write
retries to a
> disk while the disk is still reachable and presented via iscsitgt, so
I''m
> not sure how to test this.
So far I''ve just been shutting down the Solaris box hosting the iSCSI
target.  Next step will involve pulling some virtual cables.
Unfortunately I don''t think I''ve got a physical box handy to
test
drive failures right now, but my previous testing (of simply pulling
drives) showed that it can be hit and miss as to how well ZFS detects
these types of ''failure''.

Like you I don''t know yet how to simulate failures, so I''m
doing
simple tests right now, offlining entire drives or computers.
Unfortunately I''ve found more than enough problems with just those
tests to keep me busy.

>    2.a With the fix of
> http://bugs.opensolaris.org/view_bug.do?bug_id=6518995 , we can set
> sd_retry_count along with sd_io_time to cause I/O failure when a command
> takes longer than sd_retry_count * sd_io_time. Can (or should) these
> tunables be set on the imported iscsi disks in the ZFS Node, or can/should
> they be applied only to the local disk on the Storage Nodes? If there is a
> way to apply them to ONLY the imported iscsi disks (and not the local
disks)
> of the ZFS Node, and without rebooting every time a new iscsi disk is
> imported, then I''m thinking this is the way to go.
Very interesting.  I''ll certainly be looking to see if this helps too.
>
> In a year of having this setup in customer beta I have never had Storage
> nodes (or both sides of a mirror) down at the same time. I''d like
ZFS to
> take advantage of this. If (and only if) both sides fail then ZFS can enter
> failmode=wait.
>
> Currently using Nevada b96. Planning to move to >100 shortly to avoid
zpool
> commands hanging while the zpool is waiting to reach a device.
They still hang if my experience last week is anything to go by.  But
at least they recover now and don''t lock up the pool for good.  Filing
the bug for that is on my to do list :)

>
> David Anderson
> Aktiom Networks, LLC
>
> Ross wrote:
>> I discussed this exact issue on the forums in February, and filed a bug
at
>> the time.  I''ve also e-mailed and chatted with the iSCSI
developers, and the
>> iSER developers a few times.  There was also been another thread about
the
>> iSCSI timeouts being made configurable a few months back, and finally,
I
>> started another discussion on ZFS availability, and filed an RFE for
pretty
>> much exactly what you''re asking for.
>>
>> So the question is being asked, but as for how long it will be before
Sun
>> improve ZFS availability, I really wouldn''t like to say.  One
potential
>> problem is that Sun almost certainly have a pretty good HA system with
>> Fishworks running on their own hardware, and I don''t know how
much they are
>> going to want to create an open source alternative to that.
>>
>> My original discussion in Feb:
>> http://opensolaris.org/jive/thread.jspa?messageID=213482
>>
>> The iSCSI timeout bugs.  The first one was raised in November 2006!!
>> http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6497777
>>
>>
http://bugs.opensolaris.org/bugdatabase/view_bug.do;jsessionid=a1c19a874eb8bffffffffac94084acffabc5?bug_id=6670866
>>
>> The ZFS availability thread:
>>
http://www.opensolaris.org/jive/thread.jspa?messageID=274031&#274031
>>
>> I can''t find the RFE I filed on the back of that just yet,
I''ll have a
>> look through my e-mails on Monday to find it for you.
>>
>> The one bright point is that it does look like it would be possible to
>> edit iscsi.h manually and recompile the driver, but that''s a
bit outside of
>> my experience right now so I''m leaving that until I have no
other choice.
>>
>> Ross
>
>

Miles Nordin

2008-Dec-05 23:00 UTC

head link

[zfs-discuss] zfs not yet suitable for HA applications?

>>>>> "da" == David Anderson <projects-zfs at
aktiom.net> writes:
da> (I have never
da> compiled ONNV sources - do I need to do this or can I just
da> recompile the iscsi initiator)?

The source offering is disorganized and spread over many
``consolidations'''' which are pushed through
``gates'''', similar to
Linux with its source tarballs and kernel patch-kits, but only tens of
consolidations instead of thousands of packages. The downside: the
overall source-to-binary system you get with Gentoo portage or Debian
dpkg or RedHat SRPM''s to gather the consolidations and turn them into
an .iso, is Sun-proprietary. There was talk of an IPS ``distribution
builder'''' but it seems to be a binary-only FLAR replacement
for
replicating installed systems, not a proper open build system that
consumes sources and produces IPS ``images''''. I
don''t know which
consolidation holds the iSCSI initiator sources, or how to find it.

Also for sharing your experiences you need to get the exact same
version of the sources as on other people''s binary installs, so you
can compare with others while change onyl what you''re trying to
change, ``snv_101 +my timeout change''''. I''m not sure
how to do
that---I see some bugs for example 6684570 refers to versions like
``onnv-gate:2008-04-04'''' but how does that map to the
snv_<nn>
version-markers, or is it a different branch entirely? On Linux or
BSD I''d use the package system to both find the source and get the
exact version of it I''m running.

There are only a few consolidations to dig through. maybe either the
ON consolidation or the Storage consolidation? Once you find it all
you have to do is solve the version question.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 304 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20081205/05955111/attachment.bin>

Al Tobey

2008-Dec-12 02:32 UTC

head link

[zfs-discuss] zfs not yet suitable for HA applications?

I compiled just the iscsi initiator this evening and remembered this thread, so
here are the simple instructions.   This is not the "right" way to do
it.   It''s just the easiest.   I should (and probably will) use a
Makefile for this, but for now, this works and I''ve been able to tweak
some of the timeouts in my system.   Tests are still running, so I have no
results on whether it''s a good idea or not yet.

# Check out all of ONNV (mercurial does not have partial checkouts AFAIK)
hg clone -U ssh://anon at hg.opensolaris.org/hg/onnv/onnv-gate

# go into the iscsi initiator directory
cd onnv-gate/usr/src/uts/common/io/scsi/adapters/iscsi

# edit your files
vim iscsi.h

# compile!
PATH=/opt/SunStudioExpress/bin:/usr/ccs/bin:$PATH
export PATH
for cfile in *.c ../../../../../../common/iscsi/*.c
../../../../inet/kifconf/kifconf.c
do
        cc -I. -I../../../..  -D_KERNEL -D_SYSCALL32 -m64 -xO3 -xmodel=kernel -c
$cfile
done

ld -m64 -dy -r -N misc/scsi -N fs/sockfs -N sys/doorfs -N misc/md5 -N
inet/kifconf -o iscsi *.o

# back up the default driver
cp /kernel/drv/amd64/iscsi /root/iscsi.bak

# install
cp iscsi /kernel/drv/amd64/iscsi
# if you want to avoid reboot, shut down iscsi_initiator
update_drv iscsi
# reboot if necessary - I prefer to "zfs export" and "rm -f
/etc/iscsi/*" before doing so in my lab

There was a bit more that went into this when I did it, but that should get you
far enough to play with timeouts.      Be sure to check out elfdump(1).    It
can tell you a lot about your new driver -- it''s how I verified I got
all the right bits in my build before I tried it in the kernel for the first
time.

See also:
http://docs.sun.com/app/docs/doc/819-3196/eqbvo?a=view

Enjoy!
-- 
This message posted from opensolaris.org

Al Tobey

2008-Dec-12 04:02 UTC

head link

[zfs-discuss] zfs not yet suitable for HA applications?

Ok.   I figured out how to basically backport snv104''s iscsi initiator
(version 1.55 versus snv101b''s 1.54) into snv101b.

Set everything up as in my previous post then create a new C file:

cat &gt; boot_sym.c &lt;&lt;EOF
#include <sys/bootprops.h>

/*
 * Global iscsi boot prop
 */
ib_boot_prop_t  *iscsiboot_prop = NULL;

EOF

Then update the file list in the for loop to also include
"../../../../os/iscsiboot_prop.c".  Here it is (yes mom, I''m
doing a makefile tomorrow):

#!/bin/bash

PATH=/opt/SunStudioExpress/bin:/usr/ccs/bin:$PATH
export PATH

rm -f iscsi
rm -f *.o

for cfile in *.c ../../../../../../common/iscsi/*.c
../../../../inet/kifconf/kifconf.c ../../../../os/iscsiboot_prop.c
do
        cc -I. -I../../../.. -D_KERNEL -m64 -xO3 -xmodel=kernel -D_SYSCALL32 -c
$cfile
done

ld -m64 -dy -r -N misc/scsi -N fs/sockfs -N sys/doorfs -N misc/md5 -o iscsi *.o

elfdump -d iscsi
elfdump -d /kernel/drv/amd64/iscsi

cp /kernel/drv/amd64/iscsi /root/iscsi.bak
cp iscsi /kernel/drv/amd64/iscsi

update_drv iscsi
-- 
This message posted from opensolaris.org

zfs discuss - Nov 2008 - zfs not yet suitable for HA applications?

[zfs-discuss] zfs not yet suitable for HA applications?

[zfs-discuss] zfs not yet suitable for HA applications?

[zfs-discuss] zfs not yet suitable for HA applications?

[zfs-discuss] zfs not yet suitable for HA applications?

[zfs-discuss] zfs not yet suitable for HA applications?

[zfs-discuss] zfs not yet suitable for HA applications?

[zfs-discuss] zfs not yet suitable for HA applications?