David Collier-Brown
2008-Aug-31 14:26 UTC
[zfs-discuss] Sidebar to ZFS Availability discussion
Re Availability: ZFS needs to handle disk removal / driver failure better>> A better option would be to not use this to perform FMA diagnosis, but >> instead work into the mirror child selection code. This has already >> been alluded to before, but it would be cool to keep track of latency >> over time, and use this to both a) prefer one drive over another when >> selecting the child and b) proactively timeout/ignore results from one >> child and select the other if it''s taking longer than some historical >> standard deviation. This keeps away from diagnosing drives as faulty, >> but does allow ZFS to make better choices and maintain response times. >> It shouldn''t be hard to keep track of the average and/or standard >> deviation and use it for selection; proactively timing out the slow I/Os >> is much trickier.Interestingly, tracking latency has come under discussion in the Linux world, too, as they start to deal with developing resource management for disks as well as CPU. In fact, there are two cases where you can use a feedback loop to adjust disk behavior, and a third to detect problems. The first loop is the one you identified, for dealing with near/far and fast/slow mirrors. The second is for resource management, where one throttles disk-hog projects when one discovers latency growing without bound on disk saturation, and the third is in case of a fault other than the above. For the latter to work well, I''d like to see the resource management and fast/slow mirror adaptation be something one turns on explicitly, because then when FMA discovered that you in fact have a fast/slow mirror or a Dr. Evil program saturating the array, the "fix" could be to notify the sysadmin that they had a problem and suggesting built-in tools to ameliorate it. Ian Collins writes:> One solution (again, to be used with a remote mirror) is the three way > mirror. If two devices are local and one remote, data is safe once the > two local writes return. I guess the issue then changes from "is my > data safe" to "how safe is my data". I would be reluctant to deploy a > remote mirror device without local redundancy, so this probably won''t be > an uncommon setup. There would have to be an acceptable window of risk > when local data isn''t replicated.And in this case too, I''d prefer the sysadmin provide the information to ZFS about what she wants, and have the system adapt to it, and report how big the risk window is. This would effectively change the FMA behavior, you understand, so as to have it report failures to complete the local writes in time t0 and remote in time t1, much as the resource management or fast/slow cases would need to be visible to FMA. --dave (at home) c-b -- David Collier-Brown | Always do right. This will gratify Sun Microsystems, Toronto | some people and astonish the rest davecb at sun.com | -- Mark Twain cell: (647) 833-9377, bridge: (877) 385-4099 code: 506 9191#
David Collier-Brown wrote:> Re Availability: ZFS needs to handle disk removal / > driver failure better > >>> A better option would be to not use this to perform FMA diagnosis, but >>> instead work into the mirror child selection code. This has already >>> been alluded to before, but it would be cool to keep track of latency >>> over time, and use this to both a) prefer one drive over another when >>> selecting the child and b) proactively timeout/ignore results from one >>> child and select the other if it''s taking longer than some historical >>> standard deviation. This keeps away from diagnosing drives as faulty, >>> but does allow ZFS to make better choices and maintain response times. >>> It shouldn''t be hard to keep track of the average and/or standard >>> deviation and use it for selection; proactively timing out the slow I/Os >>> is much trickier. >>> > > Interestingly, tracking latency has come under discussion in the > Linux world, too, as they start to deal with developing resource > management for disks as well as CPU. > > In fact, there are two cases where you can use a feedback loop to > adjust disk behavior, and a third to detect problems. The first > loop is the one you identified, for dealing with near/far and > fast/slow mirrors. >[what usually concerns me is that the software people spec''ing device drivers don''t seem to have much training in control systems, which is what is being designed] The feedback loop is troublesome because there is usually at least one queue, perhaps 3 queues between the host and the media. At each queue, iops can be reordered. As Sommerfeld points out, we see the same sort of thing in IP networks, but two things bother me about that: 1. latency for disk seeks, rotates, and cache hits look very different than random IP network latencies. For example: a TNF trace I recently examined for an IDE disk (no queues which reorder) running a single thread read workload showed the following data: block size latency (ms) ---------------------------- 446464 48 1.18 7180944 16 13.82 (long seek?) 7181072 112 3.65 (some rotation?) 7181184 112 2.16 7181296 16 0.53 (track cache?) 446512 16 0.57 (track cache?) This same system using a SATA disk might look very different, because there are 2 additional queues at work, and (expect) NCQ. OK, so the easy way around this is to build in a substantial guard band... no problem, but if you get above about a second, then you aren''t much different than the B_FAILFAST solution even though... 2. The algorithm *must* be computationally efficient. We are looking down the tunnel at I/O systems that can deliver on the order of 5 Million iops. We really won''t have many (any?) spare cycles to play with.> The second is for resource management, where one throttles > disk-hog projects when one discovers latency growing without > bound on disk saturation, and the third is in case of a fault > other than the above. >Resource management is difficult when you cannot directly attribute physical I/O to a process.> For the latter to work well, I''d like to see the resource management > and fast/slow mirror adaptation be something one turns on explicitly, > because then when FMA discovered that you in fact have a fast/slow > mirror or a Dr. Evil program saturating the array, the "fix" > could be to notify the sysadmin that they had a problem and > suggesting built-in tools to ameliorate it. >Agree 100%.> > Ian Collins writes: > >> One solution (again, to be used with a remote mirror) is the three way >> mirror. If two devices are local and one remote, data is safe once the >> two local writes return. I guess the issue then changes from "is my >> data safe" to "how safe is my data". I would be reluctant to deploy a >> remote mirror device without local redundancy, so this probably won''t be >> an uncommon setup. There would have to be an acceptable window of risk >> when local data isn''t replicated. >> > > And in this case too, I''d prefer the sysadmin provide the information > to ZFS about what she wants, and have the system adapt to it, and > report how big the risk window is. > > This would effectively change the FMA behavior, you understand, so as > to have it report failures to complete the local writes in time t0 and > remote in time t1, much as the resource management or fast/slow cases would > need to be visible to FMA. >I think this can be reasonably accomplished within the scope of FMA. Perhaps we should pick that up on fm-discuss? But I think the bigger problem is that unless you can solve for the general case, you *will* get nailed. I might even argue that we need a way for storage devices to notify hosts of their characteristics, which would require protocol adoption and would take years to implement. Consider two scenarios: Case 1. Fully redundant storage array with active/active controllers. A failed controller should cause the system to recover on the surviving controller. I have some lab test data for this sort of thing and some popular arrays can take on the order of a minute to complete the failure detection and reconfiguration. You don''t want to degrade the vdev when this happens, you just want to wait until the array is again ready for use (this works ok today.) I would further argue that no "disk failure prediction" code would be useful for this case. Case 2. Power on test. I had a bruise (no scar :-) once from an integrated product we were designing http://docs.sun.com/app/docs/coll/cluster280-3 which had a server (or two) and raid array (or two). If you build such a system from scratch, then it will fail a power-on test. If you power on the rack containing these systems, then the time required for the RAID array to boot was longer than the time required for the server to boot *and* timeout probes of the array. The result was that the volume manager will declare the disks bad and system administration intervention is required to regain access to the data in the array. Since this was an integrated product, we solved it by inducing a delay loop in the server boot cycle to slow down the server. Was it the best possible solution? No, but it was the only solution which met our other design constraints. In both of these cases, the solutions imply multi-minute timeouts are required to maintain a stable system. For 101-level insight to this sort of problem see the Sun BluePrint article (an oldie, but goodie): http://www.sun.com/blueprints/1101/clstrcomplex.pdf -- richard
>>>>> "dc" == David Collier-Brown <davecb at sun.com> writes:dc> one discovers latency growing without bound on disk dc> saturation, yeah, ZFS needs the same thing just for scrub. I guess if the disks don''t let you tag commands with priorities, then you have to run them at slightly below max throughput in order to QoS them. It''s sort of like network QoS, but not quite, because: (a) you don''t know exactly how big the ``pipe'''' is, only approximately, (b) you''re not QoS''ing half of a bidirectional link---you get instant feedback of how long it took to ``send'''' each ``packet'''' that you don''t get with network QoS, and (c) all the fabrics are lossless, so while there are queues which undesireably fill up during congestion, these queues never drop ``packets'''' but instead exert back-pressure all the way up to the top of the stack. I''m surprised we survive as well as we do without disk QoS. Are the storage vendors already doing it somehow? -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080831/28103adb/attachment.bin>
Miles Nordin wrote:>>>>>> "dc" == David Collier-Brown <davecb at sun.com> writes: >>>>>> > > dc> one discovers latency growing without bound on disk > dc> saturation, > > yeah, ZFS needs the same thing just for scrub. >ZFS already schedules scrubs at a low priority. However, once the iops leave ZFS''s queue, they can''t be rescheduled by ZFS.> I guess if the disks don''t let you tag commands with priorities, then > you have to run them at slightly below max throughput in order to QoS > them. > > It''s sort of like network QoS, but not quite, because: > > (a) you don''t know exactly how big the ``pipe'''' is, only > approximately, > > (b) you''re not QoS''ing half of a bidirectional link---you get > instant feedback of how long it took to ``send'''' each ``packet'''' > that you don''t get with network QoS, and > > (c) all the fabrics are lossless, so while there are queues which > undesireably fill up during congestion, these queues never drop > ``packets'''' but instead exert back-pressure all the way up to > the top of the stack. > > I''m surprised we survive as well as we do without disk QoS. Are the > storage vendors already doing it somehow? >Excellent question. I hope someone will pipe up with an answer. In my experience, they get by through overprovisioning. But I predict that SSDs will render this question moot, at least for another generation or so. -- richard
Robert Milkowski
2008-Sep-01 16:57 UTC
[zfs-discuss] Sidebar to ZFS Availability discussion
Hello Miles, Sunday, August 31, 2008, 8:03:45 PM, you wrote:>>>>>> "dc" == David Collier-Brown <davecb at sun.com> writes:MN> dc> one discovers latency growing without bound on disk MN> dc> saturation, MN> yeah, ZFS needs the same thing just for scrub. MN> I guess if the disks don''t let you tag commands with priorities, then MN> you have to run them at slightly below max throughput in order to QoS MN> them. MN> It''s sort of like network QoS, but not quite, because: MN> (a) you don''t know exactly how big the ``pipe'''' is, only MN> approximately, MN> (b) you''re not QoS''ing half of a bidirectional link---you get MN> instant feedback of how long it took to ``send'''' each ``packet'''' MN> that you don''t get with network QoS, and MN> (c) all the fabrics are lossless, so while there are queues which MN> undesireably fill up during congestion, these queues never drop MN> ``packets'''' but instead exert back-pressure all the way up to MN> the top of the stack. MN> I''m surprised we survive as well as we do without disk QoS. Are the MN> storage vendors already doing it somehow? I don''t know the details and haven''t actually tested it but EMC provides QoS in their Clariion line... -- Best regards, Robert Milkowski mailto:milek at task.gda.pl http://milek.blogspot.com
David Collier-Brown
2008-Sep-01 23:06 UTC
[zfs-discuss] Sidebar to ZFS Availability discussion
Richard Elling wrote:> [what usually concerns me is that the software people spec''ing device > drivers don''t seem to have much training in control systems, which is > what is being designed]Or try to develop safety-critical systems based on "best effort" instead of first developing a clear and verifiable idea of what is required for correct functioning.> > The feedback loop is troublesome because there is usually at least one > queue, perhaps 3 queues between the host and the media. At each > queue, iops can be reordered.And that''s evil... A former colleague did a study of how much reordering could be done and still preserve correctness as his master''s thesis, and it was notable how easily one could mess up!> As Sommerfeld points out, we see the > same sort of thing in IP networks, but two things bother me about that: > > 1. latency for disk seeks, rotates, and cache hits look very different > than random IP network latencies. For example: a TNF trace I > recently examined for an IDE disk (no queues which reorder) > running a single thread read workload showed the following data: > block size latency (ms) > ---------------------------- > 446464 48 1.18 > 7180944 16 13.82 (long seek?) > 7181072 112 3.65 (some rotation?) > 7181184 112 2.16 > 7181296 16 0.53 (track cache?) > 446512 16 0.57 (track cache?) > > This same system using a SATA disk might look very > different, because there are 2 additional queues at > work, and (expect) NCQ. OK, so the easy way around > this is to build in a substantial guard band... no > problem, but if you get above about a second, then > you aren''t much different than the B_FAILFAST solution > even though...Fortunately, latencies grow without bound after N*, the saturation point, so one can distinguish overloads (insanely bad latency & response time) from normal mismanagement (single orders of magnitude, base 10 (;-))> > 2. The algorithm *must* be computationally efficient. > We are looking down the tunnel at I/O systems that can > deliver on the order of 5 Million iops. We really won''t > have many (any?) spare cycles to play with.Ok, I make it two comparisons and a subtract at the decision point, but a lot of precalculation in user-space, over time. Very similar to the IBM mainframe experience with goal-directed management.> >> The second is for resource management, where one throttles >> disk-hog projects when one discovers latency growing without >> bound on disk saturation, and the third is in case of a fault >> other than the above. >> > > > Resource management is difficult when you cannot directly attribute > physical I/O to a process.Agreed: we may need a way to associate logical I/Os with the project which authored them.> >> For the latter to work well, I''d like to see the resource management >> and fast/slow mirror adaptation be something one turns on explicitly, >> because then when FMA discovered that you in fact have a fast/slow >> mirror or a Dr. Evil program saturating the array, the "fix" >> could be to notify the sysadmin that they had a problem and >> suggesting built-in tools to ameliorate it. > > > Agree 100%. > >> >> Ian Collins writes: >> >>> One solution (again, to be used with a remote mirror) is the three >>> way mirror. If two devices are local and one remote, data is safe >>> once the two local writes return. I guess the issue then changes >>> from "is my data safe" to "how safe is my data". I would be >>> reluctant to deploy a remote mirror device without local redundancy, >>> so this probably won''t be an uncommon setup. There would have to be >>> an acceptable window of risk when local data isn''t replicated. >>> >> >> >> And in this case too, I''d prefer the sysadmin provide the information >> to ZFS about what she wants, and have the system adapt to it, and >> report how big the risk window is. >> >> This would effectively change the FMA behavior, you understand, so >> as to have it report failures to complete the local writes in time t0 >> and remote in time t1, much as the resource management or fast/slow >> cases would >> need to be visible to FMA. >> > > > I think this can be reasonably accomplished within the scope of FMA. > Perhaps we should pick that up on fm-discuss? > > But I think the bigger problem is that unless you can solve for the general > case, you *will* get nailed. I might even argue that we need a way for > storage devices to notify hosts of their characteristics, which would > require > protocol adoption and would take years to implement.Fortunately, the critical metric, latency, is easy to measure. Noisy! Indeed, very noisy, but easy for specific cases, as noted above. The general case you describe below is indeed harder. I suspect we may need to statically annotate certain devices with critical behavior information...> Consider two scenarios: > > Case 1. Fully redundant storage array with active/active controllers. > A failed controller should cause the system to recover on the > surviving controller. I have some lab test data for this sort of > thing > and some popular arrays can take on the order of a minute to > complete the failure detection and reconfiguration. You don''t > want to degrade the vdev when this happens, you just want to > wait until the array is again ready for use (this works ok today.) > I would further argue that no "disk failure prediction" code would > be useful for this case. > > Case 2. Power on test. I had a bruise (no scar :-) once from an > integrated product we were designing > http://docs.sun.com/app/docs/coll/cluster280-3 > which had a server (or two) and raid array (or two). If you build > such a system from scratch, then it will fail a power-on test. > If you > power on the rack containing these systems, then the time required > for the RAID array to boot was longer than the time required for > the server to boot *and* timeout probes of the array. The result > was that the volume manager will declare the disks bad and > system administration intervention is required to regain access to > the data in the array. Since this was an integrated product, we > solved it by inducing a delay loop in the server boot cycle to > slow down the server. Was it the best possible solution? No, but > it was the only solution which met our other design constraints. > > In both of these cases, the solutions imply multi-minute timeouts are > required to maintain a stable system. For 101-level insight to this sort > of problem see the Sun BluePrint article (an oldie, but goodie): > http://www.sun.com/blueprints/1101/clstrcomplex.pdf >--dave -- David Collier-Brown | Always do right. This will gratify Sun Microsystems, Toronto | some people and astonish the rest davecb at sun.com | -- Mark Twain cell: (647) 833-9377, bridge: (877) 385-4099 code: 506 9191#
Bill Sommerfeld
2008-Sep-02 17:56 UTC
[zfs-discuss] Sidebar to ZFS Availability discussion
On Sun, 2008-08-31 at 12:00 -0700, Richard Elling wrote:> 2. The algorithm *must* be computationally efficient. > We are looking down the tunnel at I/O systems that can > deliver on the order of 5 Million iops. We really won''t > have many (any?) spare cycles to play with.If you pick the constants carefully (powers of two) you can do the TCP RTT + variance estimation using only a handful of shifts, adds, and subtracts.> In both of these cases, the solutions imply multi-minute timeouts are > required to maintain a stable system.Again, there are different uses for timeouts: 1) how long should we wait on an ordinary request before deciding to try "plan B" and go elsewhere (a la B_FAILFAST) 2) how long should we wait (while trying all alternatives) before declaring an overall failure and giving up. The RTT estimation approach is really only suitable for the former, where you have some alternatives available (retransmission in the case of TCP; trying another disk in the case of mirrors, etc.,). when you''ve tried all the alternatives and nobody''s responding, there''s no substitute for just retrying for a long time. - Bill
Bill Sommerfeld
2008-Sep-02 18:21 UTC
[zfs-discuss] Sidebar to ZFS Availability discussion
On Sun, 2008-08-31 at 15:03 -0400, Miles Nordin wrote:> It''s sort of like network QoS, but not quite, because: > > (a) you don''t know exactly how big the ``pipe'''' is, only > approximately,In an ip network, end nodes generally know no more than the pipe size of the first hop -- and in some cases (such as true CSMA networks like classical ethernet or wireless) only have an upper bound on the pipe size. beyond that, they can only estimate the characteristics of the rest of the network by observing its behavior - all they get is end-to-end latency, and *maybe* a ''congestion observed'' mark set by an intermediate system.> (c) all the fabrics are lossless, so while there are queues which > undesireably fill up during congestion, these queues never drop > ``packets'''' but instead exert back-pressure all the way up to > the top of the stack.hmm. I don''t think the back pressure makes it all the way up to zfs (the top of the block storage stack) except as added latency. (on the other hand, if it did, zfs could schedule around it both for reads and writes, avoiding pouring more work on already-congested paths..)> I''m surprised we survive as well as we do without disk QoS. Are the > storage vendors already doing it somehow?I bet that (as with networking) in many/most cases overprovisioning the hardware and running at lower average utilization is often cheaper in practice than running close to the edge and spending a lot of expensive expert time monitoring performance and tweaking QoS parameters.
>>>>> "bs" == Bill Sommerfeld <sommerfeld at sun.com> writes:bs> In an ip network, end nodes generally know no more than the bs> pipe size of the first hop -- and in some cases (such as true bs> CSMA networks like classical ethernet or wireless) only have bs> an upper bound on the pipe size. yeah, but the most complicated and well-studied queueing disciplines (like, everything implemented in ALTQ and I think everything implemented by the two different Cisco queueing frameworks (the CBQ process-switched one, and the diffserv-like cat6500 ASIC-switched one)) is (a) hop-by-hop, so the algorithm one discusses only applies to a single hop, a single transmit queue, never to a whole path, and (b) assumes a unidirectional link of known fixed size, not a broadcast link or token ring or anything like that. For wireless they are not using the fancy algorithms. They''re doing really primitive things like ``unsolicited grants''''---basically just TDMA channels. I wouldn''t think of ECN as part of QoS exactly, because it separates so cleanly from your choice of queue discipline. bs> hmm. I don''t think the back pressure makes it all the way up bs> to zfs I guess I was thinking of the lossless fabrics, which might change some of the assumptions behind designing a scheduler that went into IP QoS. For example, most of the IP QoS systems divide the usual one-big-queue into many smaller queues. A ``classifier'''' picks some packets as pink ones and some as blue, and assigns them thusly to queues, and they always get classified to the end of the queue. The ``scheduler'''' then decides from which queue to take the next packet. The primitive QoS in Ethernet chips might give you 4 queues that are either strict-priority or weighted-round-robin. Link-sharing schedulers like CBQ or HFSC make a heirarchy of queues where, to the extent that they''re work-conserving, child queues borrow unused transmission slots from their ancestors. Or a flat 256 hash-bucket queues for WFQ, which just tries to separate one job from another. but no matter which of those you choose, within each of the smaller queues you get an orthogonal choice of RED or FIFO. There''s no such thing as RED or FIFO with queues in storage networks because there is no packet dropping. This confuses the implementation of the upper queueing discipline because what happens when one of the small queues fills up? How can you push up the stack, ``I will not accept another CDB if I would classify it as a Pink CDB, because the Pink queue is full. I will still accept Blue CDB''s though.'''' Needing to express this destroys the modularity of the IP QoS model. We can only say ``block---no more CDB''s accepted,'''' but that defeats the whole purpose of the QoS! so how to say no more CDB''s of the pink kind? With normal hop-by-hop QoS, I don''t think we can. This inexpressability of ``no more pink CDB''s'''' is the same reason enterprise Ethernet switches never actually use the gigabit ethernet ``flow control'''' mechanism. Yeah, they negotiate flow control and obey received flow control signals, but they never _assert_ a flow control signal, at least not for normal output-queue congestion, because this would block reception of packets that would get switched to uncongested output ports, too. Proper enterprise switches would assert flow control only for rare pathological cases like backplane saturation or cheap oversubscribed line cards. No matter what overzealous powerpoint monkeys claim, CEE/FCoE is _not_ going to use ``pause frames.'''' I guess you''re right that some of the ``queues'''' in storage are sort of arbitrarily sized, like the write queue which could take up the whole buffer cache, so back pressure might not be the right way to imagine it. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080903/a9d0027a/attachment.bin>