hi All, I realize the subject is a bit incendiary, but we''re running into what I view as a design omission with ZFS that is preventing us from building highly available storage infrastructure; I want to bring some attention (again) to this major issue: Currently we have a set of iSCSI targets published from a storage host which are consumed by a ZFS host. If a _single_disk_ on the storage host goes bad, ZFS pauses for a full 180 seconds before allowing read/write operations to resume. This is an aeon, beyond TCP timeout, etc. I''ve read the claims that ZFS is unconcerned with underlying infrastructure and agree with the basic sense of those claims see:[1], however: * If ZFS experiences _any_ behavior when interacting with a device which is not consistent with known historical performance norms -and- * ZFS knows the data it is attempting to fetch from that device is resident on another device Why then would it not make a decision, dynamically based on a reasonably small sample of recent device performance to drop its current attempt and instead fetch the data from the other device? I don''t even think a configurable timeout is that useful - it should be based on a sample of performance from (say) a day - or, hey, for the moment, just to make it easy, a configurable timeout! As it is, I can''t put this in production. 180 seconds is not "highly available", it''s users seeing "The Connection has Timed Out". Everything - and I mean every other tiny detail - of ZFS that I have seen and used is crystalline perfection. So, ZFS is (for us) a diamond with a little bit of volcanic crust remaining to be polished off. Is there any intention of dealing with this problem in the (hopefully very) near future? If you''re in the bay area, I will personally deliver (2) cases of the cold beer of your choice (including trappist) if you solve this problem. If offering a bounty would have any effect, I''d offer one. We need this to work. thanks, _alex Related: [1] http://mail.opensolaris.org/pipermail/zfs-discuss/2008-August/thread.html#50609 -- alex black, founder the turing studio, inc. 888.603.6023 / main 510.666.0074 / office root at turingstudio.com
I discussed this exact issue on the forums in February, and filed a bug at the time. I''ve also e-mailed and chatted with the iSCSI developers, and the iSER developers a few times. There was also been another thread about the iSCSI timeouts being made configurable a few months back, and finally, I started another discussion on ZFS availability, and filed an RFE for pretty much exactly what you''re asking for. So the question is being asked, but as for how long it will be before Sun improve ZFS availability, I really wouldn''t like to say. One potential problem is that Sun almost certainly have a pretty good HA system with Fishworks running on their own hardware, and I don''t know how much they are going to want to create an open source alternative to that. My original discussion in Feb: http://opensolaris.org/jive/thread.jspa?messageID=213482 The iSCSI timeout bugs. The first one was raised in November 2006!! http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6497777 http://bugs.opensolaris.org/bugdatabase/view_bug.do;jsessionid=a1c19a874eb8bffffffffac94084acffabc5?bug_id=6670866 The ZFS availability thread: http://www.opensolaris.org/jive/thread.jspa?messageID=274031񂹯 I can''t find the RFE I filed on the back of that just yet, I''ll have a look through my e-mails on Monday to find it for you. The one bright point is that it does look like it would be possible to edit iscsi.h manually and recompile the driver, but that''s a bit outside of my experience right now so I''m leaving that until I have no other choice. Ross -- This message posted from opensolaris.org
David Anderson
2008-Dec-05 21:19 UTC
[zfs-discuss] zfs not yet suitable for HA applications?
Trying to keep this in the spotlight. Apologies for the lengthy post. I''d really like to see features as described by Ross in his summary of the "Availability: ZFS needs to handle disk removal / driver failure better" (http://www.opensolaris.org/jive/thread.jspa?messageID=274031񂹯 ). I''d like to have these/similar features as well. Has there already been internal discussions regarding adding this type of functionality to ZFS itself, and was there approval, disapproval or no decision? Unfortunately my situation has put me in urgent need to find workarounds in the meantime. My setup: I have two iSCSI target nodes, each with six drives exported via iscsi (Storage Nodes). I have a ZFS Node that logs into each target from both Storage Nodes and creates a mirrored Zpool with one drive from each Storage Node comprising each half of the mirrored vdevs (6 x 2-way mirrors). My problem: If a Storage Node crashes completely, is disconnected from the network, iscsitgt core dumps, a drive is pulled, or a drive has a problem accessing data (read retries), then my ZFS Node hangs while ZFS waits patiently for the layers below to report a problem and timeout the devices. This can lead to a roughly 3 minute or longer halt when reading OR writing to the Zpool on the ZFS node. While this is acceptable in certain situations, I have a case where my availability demand is more severe. My goal: figure out how to have the zpool pause for NO LONGER than 30 seconds (roughly within a typical HTTP request timeout) and then issue reads/writes to the good devices in the zpool/mirrors while the other side comes back online or is fixed. My ideas: 1. In the case of the iscsi targets disappearing (iscsitgt core dump, Storage Node crash, Storage Node disconnected from network), I need to lower the iSCSI login retry/timeout values. Am I correct in assuming the only way to accomplish this is to recompile the iscsi initiator? If so, can someone help point me in the right direction (I have never compiled ONNV sources - do I need to do this or can I just recompile the iscsi initiator)? 1.a. I''m not sure in what Initiator session states iscsi_sess_max_delay is applicable - only for the initial login, or also in the case of reconnect? Ross, if you still have your test boxes available, can you please try setting "set iscsi:iscsi_sess_max_delay = 5" in /etc/system, reboot and try failing your iscsi vdevs again? I can''t find a case where this was tested quick failover. 1.b. I would much prefer to have bug 6497777 addressed and fixed rather than having to resort to recompiling the iscsi initiator (if iscsi_sess_max_delay) doesn''t work. This seems like a trivial feature to implement. How can I sponsor development? 2. In the case of the iscsi target being reachable, but the physical disk is having problems reading/writing data (retryable events that take roughly 60 seconds to timeout), should I change the iscsi_rx_max_window tunable with mdb? Is there a tunable for iscsi_tx? Ross, I know you tried this recently in the thread referenced above (with value 15), which resulted in a 60 second hang. How did you offline the iscsi vol to test this failure? Unless iscsi uses a multiple of the value for retries, then maybe the way you failed the disk caused the iscsi system to follow a different failure path? Unfortunately I don''t know of a way to introduce read/write retries to a disk while the disk is still reachable and presented via iscsitgt, so I''m not sure how to test this. 2.a With the fix of http://bugs.opensolaris.org/view_bug.do?bug_id=6518995 , we can set sd_retry_count along with sd_io_time to cause I/O failure when a command takes longer than sd_retry_count * sd_io_time. Can (or should) these tunables be set on the imported iscsi disks in the ZFS Node, or can/should they be applied only to the local disk on the Storage Nodes? If there is a way to apply them to ONLY the imported iscsi disks (and not the local disks) of the ZFS Node, and without rebooting every time a new iscsi disk is imported, then I''m thinking this is the way to go. In a year of having this setup in customer beta I have never had Storage nodes (or both sides of a mirror) down at the same time. I''d like ZFS to take advantage of this. If (and only if) both sides fail then ZFS can enter failmode=wait. Currently using Nevada b96. Planning to move to >100 shortly to avoid zpool commands hanging while the zpool is waiting to reach a device. David Anderson Aktiom Networks, LLC Ross wrote: > I discussed this exact issue on the forums in February, and filed a bug at the time. I''ve also e-mailed and chatted with the iSCSI developers, and the iSER developers a few times. There was also been another thread about the iSCSI timeouts being made configurable a few months back, and finally, I started another discussion on ZFS availability, and filed an RFE for pretty much exactly what you''re asking for. > > So the question is being asked, but as for how long it will be before Sun improve ZFS availability, I really wouldn''t like to say. One potential problem is that Sun almost certainly have a pretty good HA system with Fishworks running on their own hardware, and I don''t know how much they are going to want to create an open source alternative to that. > > My original discussion in Feb: > http://opensolaris.org/jive/thread.jspa?messageID=213482 > > The iSCSI timeout bugs. The first one was raised in November 2006!! > http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6497777 > http://bugs.opensolaris.org/bugdatabase/view_bug.do;jsessionid=a1c19a874eb8bffffffffac94084acffabc5?bug_id=6670866 > > The ZFS availability thread: > http://www.opensolaris.org/jive/thread.jspa?messageID=274031񂹯 > > I can''t find the RFE I filed on the back of that just yet, I''ll have a look through my e-mails on Monday to find it for you. > > The one bright point is that it does look like it would be possible to edit iscsi.h manually and recompile the driver, but that''s a bit outside of my experience right now so I''m leaving that until I have no other choice. > > Ross
Hi Dan, replying in line: On Fri, Dec 5, 2008 at 9:19 PM, David Anderson <projects-zfs at aktiom.net> wrote:> Trying to keep this in the spotlight. Apologies for the lengthy post.Heh, don''t apologise, you should see some of my posts... o_0> I''d really like to see features as described by Ross in his summary of the > "Availability: ZFS needs to handle disk removal / driver failure better" > (http://www.opensolaris.org/jive/thread.jspa?messageID=274031񂹯 ). > I''d like to have these/similar features as well. Has there already been > internal discussions regarding adding this type of functionality to ZFS > itself, and was there approval, disapproval or no decision? > > Unfortunately my situation has put me in urgent need to find workarounds in > the meantime. > > My setup: I have two iSCSI target nodes, each with six drives exported via > iscsi (Storage Nodes). I have a ZFS Node that logs into each target from > both Storage Nodes and creates a mirrored Zpool with one drive from each > Storage Node comprising each half of the mirrored vdevs (6 x 2-way mirrors). > > My problem: If a Storage Node crashes completely, is disconnected from the > network, iscsitgt core dumps, a drive is pulled, or a drive has a problem > accessing data (read retries), then my ZFS Node hangs while ZFS waits > patiently for the layers below to report a problem and timeout the devices. > This can lead to a roughly 3 minute or longer halt when reading OR writing > to the Zpool on the ZFS node. While this is acceptable in certain > situations, I have a case where my availability demand is more severe. > > My goal: figure out how to have the zpool pause for NO LONGER than 30 > seconds (roughly within a typical HTTP request timeout) and then issue > reads/writes to the good devices in the zpool/mirrors while the other side > comes back online or is fixed. > > My ideas: > 1. In the case of the iscsi targets disappearing (iscsitgt core dump, > Storage Node crash, Storage Node disconnected from network), I need to lower > the iSCSI login retry/timeout values. Am I correct in assuming the only way > to accomplish this is to recompile the iscsi initiator? If so, can someone > help point me in the right direction (I have never compiled ONNV sources - > do I need to do this or can I just recompile the iscsi initiator)?I believe it''s possible to just recompile the initiator and install the new driver. I have some *very* rough notes that were sent to me about a year ago, but I''ve no experience compiling anything in Solaris, so don''t know how useful they will be. I''ll try to dig them out in case they''re useful.> > 1.a. I''m not sure in what Initiator session states iscsi_sess_max_delay is > applicable - only for the initial login, or also in the case of reconnect? > Ross, if you still have your test boxes available, can you please try > setting "set iscsi:iscsi_sess_max_delay = 5" in /etc/system, reboot and try > failing your iscsi vdevs again? I can''t find a case where this was tested > quick failover.Will gladly have a go at this on Monday.> 1.b. I would much prefer to have bug 6497777 addressed and fixed rather > than having to resort to recompiling the iscsi initiator (if > iscsi_sess_max_delay) doesn''t work. This seems like a trivial feature to > implement. How can I sponsor development? > > 2. In the case of the iscsi target being reachable, but the physical disk > is having problems reading/writing data (retryable events that take roughly > 60 seconds to timeout), should I change the iscsi_rx_max_window tunable with > mdb? Is there a tunable for iscsi_tx? Ross, I know you tried this recently > in the thread referenced above (with value 15), which resulted in a 60 > second hang. How did you offline the iscsi vol to test this failure? Unless > iscsi uses a multiple of the value for retries, then maybe the way you > failed the disk caused the iscsi system to follow a different failure path? > Unfortunately I don''t know of a way to introduce read/write retries to a > disk while the disk is still reachable and presented via iscsitgt, so I''m > not sure how to test this.So far I''ve just been shutting down the Solaris box hosting the iSCSI target. Next step will involve pulling some virtual cables. Unfortunately I don''t think I''ve got a physical box handy to test drive failures right now, but my previous testing (of simply pulling drives) showed that it can be hit and miss as to how well ZFS detects these types of ''failure''. Like you I don''t know yet how to simulate failures, so I''m doing simple tests right now, offlining entire drives or computers. Unfortunately I''ve found more than enough problems with just those tests to keep me busy.> 2.a With the fix of > http://bugs.opensolaris.org/view_bug.do?bug_id=6518995 , we can set > sd_retry_count along with sd_io_time to cause I/O failure when a command > takes longer than sd_retry_count * sd_io_time. Can (or should) these > tunables be set on the imported iscsi disks in the ZFS Node, or can/should > they be applied only to the local disk on the Storage Nodes? If there is a > way to apply them to ONLY the imported iscsi disks (and not the local disks) > of the ZFS Node, and without rebooting every time a new iscsi disk is > imported, then I''m thinking this is the way to go.Very interesting. I''ll certainly be looking to see if this helps too.> > In a year of having this setup in customer beta I have never had Storage > nodes (or both sides of a mirror) down at the same time. I''d like ZFS to > take advantage of this. If (and only if) both sides fail then ZFS can enter > failmode=wait. > > Currently using Nevada b96. Planning to move to >100 shortly to avoid zpool > commands hanging while the zpool is waiting to reach a device.They still hang if my experience last week is anything to go by. But at least they recover now and don''t lock up the pool for good. Filing the bug for that is on my to do list :)> > David Anderson > Aktiom Networks, LLC > > Ross wrote: >> I discussed this exact issue on the forums in February, and filed a bug at >> the time. I''ve also e-mailed and chatted with the iSCSI developers, and the >> iSER developers a few times. There was also been another thread about the >> iSCSI timeouts being made configurable a few months back, and finally, I >> started another discussion on ZFS availability, and filed an RFE for pretty >> much exactly what you''re asking for. >> >> So the question is being asked, but as for how long it will be before Sun >> improve ZFS availability, I really wouldn''t like to say. One potential >> problem is that Sun almost certainly have a pretty good HA system with >> Fishworks running on their own hardware, and I don''t know how much they are >> going to want to create an open source alternative to that. >> >> My original discussion in Feb: >> http://opensolaris.org/jive/thread.jspa?messageID=213482 >> >> The iSCSI timeout bugs. The first one was raised in November 2006!! >> http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6497777 >> >> http://bugs.opensolaris.org/bugdatabase/view_bug.do;jsessionid=a1c19a874eb8bffffffffac94084acffabc5?bug_id=6670866 >> >> The ZFS availability thread: >> http://www.opensolaris.org/jive/thread.jspa?messageID=274031񂹯 >> >> I can''t find the RFE I filed on the back of that just yet, I''ll have a >> look through my e-mails on Monday to find it for you. >> >> The one bright point is that it does look like it would be possible to >> edit iscsi.h manually and recompile the driver, but that''s a bit outside of >> my experience right now so I''m leaving that until I have no other choice. >> >> Ross > >
Miles Nordin
2008-Dec-05 23:00 UTC
[zfs-discuss] zfs not yet suitable for HA applications?
>>>>> "da" == David Anderson <projects-zfs at aktiom.net> writes:da> (I have never da> compiled ONNV sources - do I need to do this or can I just da> recompile the iscsi initiator)? The source offering is disorganized and spread over many ``consolidations'''' which are pushed through ``gates'''', similar to Linux with its source tarballs and kernel patch-kits, but only tens of consolidations instead of thousands of packages. The downside: the overall source-to-binary system you get with Gentoo portage or Debian dpkg or RedHat SRPM''s to gather the consolidations and turn them into an .iso, is Sun-proprietary. There was talk of an IPS ``distribution builder'''' but it seems to be a binary-only FLAR replacement for replicating installed systems, not a proper open build system that consumes sources and produces IPS ``images''''. I don''t know which consolidation holds the iSCSI initiator sources, or how to find it. Also for sharing your experiences you need to get the exact same version of the sources as on other people''s binary installs, so you can compare with others while change onyl what you''re trying to change, ``snv_101 +my timeout change''''. I''m not sure how to do that---I see some bugs for example 6684570 refers to versions like ``onnv-gate:2008-04-04'''' but how does that map to the snv_<nn> version-markers, or is it a different branch entirely? On Linux or BSD I''d use the package system to both find the source and get the exact version of it I''m running. There are only a few consolidations to dig through. maybe either the ON consolidation or the Storage consolidation? Once you find it all you have to do is solve the version question. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20081205/05955111/attachment.bin>
I compiled just the iscsi initiator this evening and remembered this thread, so here are the simple instructions. This is not the "right" way to do it. It''s just the easiest. I should (and probably will) use a Makefile for this, but for now, this works and I''ve been able to tweak some of the timeouts in my system. Tests are still running, so I have no results on whether it''s a good idea or not yet. # Check out all of ONNV (mercurial does not have partial checkouts AFAIK) hg clone -U ssh://anon at hg.opensolaris.org/hg/onnv/onnv-gate # go into the iscsi initiator directory cd onnv-gate/usr/src/uts/common/io/scsi/adapters/iscsi # edit your files vim iscsi.h # compile! PATH=/opt/SunStudioExpress/bin:/usr/ccs/bin:$PATH export PATH for cfile in *.c ../../../../../../common/iscsi/*.c ../../../../inet/kifconf/kifconf.c do cc -I. -I../../../.. -D_KERNEL -D_SYSCALL32 -m64 -xO3 -xmodel=kernel -c $cfile done ld -m64 -dy -r -N misc/scsi -N fs/sockfs -N sys/doorfs -N misc/md5 -N inet/kifconf -o iscsi *.o # back up the default driver cp /kernel/drv/amd64/iscsi /root/iscsi.bak # install cp iscsi /kernel/drv/amd64/iscsi # if you want to avoid reboot, shut down iscsi_initiator update_drv iscsi # reboot if necessary - I prefer to "zfs export" and "rm -f /etc/iscsi/*" before doing so in my lab There was a bit more that went into this when I did it, but that should get you far enough to play with timeouts. Be sure to check out elfdump(1). It can tell you a lot about your new driver -- it''s how I verified I got all the right bits in my build before I tried it in the kernel for the first time. See also: http://docs.sun.com/app/docs/doc/819-3196/eqbvo?a=view Enjoy! -- This message posted from opensolaris.org
Ok. I figured out how to basically backport snv104''s iscsi initiator (version 1.55 versus snv101b''s 1.54) into snv101b. Set everything up as in my previous post then create a new C file: cat > boot_sym.c <<EOF #include <sys/bootprops.h> /* * Global iscsi boot prop */ ib_boot_prop_t *iscsiboot_prop = NULL; EOF Then update the file list in the for loop to also include "../../../../os/iscsiboot_prop.c". Here it is (yes mom, I''m doing a makefile tomorrow): #!/bin/bash PATH=/opt/SunStudioExpress/bin:/usr/ccs/bin:$PATH export PATH rm -f iscsi rm -f *.o for cfile in *.c ../../../../../../common/iscsi/*.c ../../../../inet/kifconf/kifconf.c ../../../../os/iscsiboot_prop.c do cc -I. -I../../../.. -D_KERNEL -m64 -xO3 -xmodel=kernel -D_SYSCALL32 -c $cfile done ld -m64 -dy -r -N misc/scsi -N fs/sockfs -N sys/doorfs -N misc/md5 -o iscsi *.o elfdump -d iscsi elfdump -d /kernel/drv/amd64/iscsi cp /kernel/drv/amd64/iscsi /root/iscsi.bak cp iscsi /kernel/drv/amd64/iscsi update_drv iscsi -- This message posted from opensolaris.org