for s10u2, documentation recommends 3 to 9 devices in raidz. what is the basis for this recommendation? i assume it is performance and not failure resilience, but i am just guessing... [i know, recommendation was intended for people who know their raid cold, so it needed no further explanation] thanks... oz -- ozan s. yigit | oz at somanetworks.com | 416 977 1414 x 1540 I have a hard time getting enough time to do even trivial blogging: being truly thoughtful takes a lot of time. -- james gosling
Hello ozan, Friday, November 3, 2006, 3:57:00 PM, you wrote: osy> for s10u2, documentation recommends 3 to 9 devices in raidz. what is the osy> basis for this recommendation? i assume it is performance and not failure osy> resilience, but i am just guessing... [i know, recommendation was intended osy> for people who know their raid cold, so it needed no further explanation] Performance reason for random reads. ps. however the bigger raid-z group the more risky it could be - but this is obvious. -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
ozan s. yigit wrote:> for s10u2, documentation recommends 3 to 9 devices in raidz. what is the > basis for this recommendation? i assume it is performance and not failure > resilience, but i am just guessing... [i know, recommendation was intended > for people who know their raid cold, so it needed no further explanation]Both actually. The small, random read performance will approximate that of a single disk. The probability of data loss increases as you add disks to a RAID-5/6/Z/Z2 volumes. For example, suppose you have 12 disks and insist on RAID-Z. Given 1. small, random read iops for a single disk is 141 (eg. 2.5" SAS 10k rpm drive) 2. MTBF = 1.4M hours (0.63% AFR) (so says the disk vendor) 3. no spares 4. service time = 24 hours, resync rate 100 GBytes/hr, 50% space utilization 5. infinite service life Scenario 1: 12-way RAID-Z performance = 141 iops MTTDL[1] = 68,530 years space = 11 * disk size Scenario 2: 2x 6-way RAID-Z+0 performance = 282 iops MTTDL[1] = 150,767 years space = 10 * disk size [1] Using MTTDL = MTBF^2 / (N * (N-1) * MTTR) -- richard
On Fri, 3 Nov 2006, Richard Elling - PAE wrote:> ozan s. yigit wrote: > > for s10u2, documentation recommends 3 to 9 devices in raidz. what is the > > basis for this recommendation? i assume it is performance and not failure > > resilience, but i am just guessing... [i know, recommendation was intended > > for people who know their raid cold, so it needed no further explanation] > > Both actually. > The small, random read performance will approximate that of a single disk. > The probability of data loss increases as you add disks to a RAID-5/6/Z/Z2 > volumes. > > For example, suppose you have 12 disks and insist on RAID-Z. > Given > 1. small, random read iops for a single disk is 141 (eg. 2.5" SAS > 10k rpm drive) > 2. MTBF = 1.4M hours (0.63% AFR) (so says the disk vendor) > 3. no spares > 4. service time = 24 hours, resync rate 100 GBytes/hr, 50% space > utilization > 5. infinite service life > > Scenario 1: 12-way RAID-Z > performance = 141 iops > MTTDL[1] = 68,530 years > space = 11 * disk size > > Scenario 2: 2x 6-way RAID-Z+0 > performance = 282 iops > MTTDL[1] = 150,767 years > space = 10 * disk size > > [1] Using MTTDL = MTBF^2 / (N * (N-1) * MTTR)But ... I''m not sure I buy into your numbers given the probability that more than one disk will fail inside the service window - given that the disks are identical? Or ... a disk failure occurs at 5:01 PM (quitting time) on a Friday and won''t be replaced until 8:00AM on Monday morning. Does the failure data you have access to support my hypothesis that failures of identical mechanical systems tend to occur in small clusters within a relatively small window of time? Call me paranoid, but I''d prefer to see a product like thumper configured with 50% of the disks manufactured by vendor A and the other 50% manufactured by someone else. This paranoia is based on a personal experience, many years ago (before we had smart fans etc), where we had a rack full of expensive custom equipment cooled by (what we thought was) a highly redundant group of 5 fans. One fan suffered infant mortality and its failure went unnoticed, leaving 4 fans running. Two of the fans died on the same extended weekend (public holiday). It was an expensive and embarassing disaster. Regards, Al Hopper Logical Approach Inc, Plano, TX. al at logical-approach.com Voice: 972.379.2133 Fax: 972.379.2134 Timezone: US CDT OpenSolaris.Org Community Advisory Board (CAB) Member - Apr 2005 OpenSolaris Governing Board (OGB) Member - Feb 2006
Al Hopper wrote:>> [1] Using MTTDL = MTBF^2 / (N * (N-1) * MTTR) > > But ... I''m not sure I buy into your numbers given the probability that > more than one disk will fail inside the service window - given that the > disks are identical? Or ... a disk failure occurs at 5:01 PM (quitting > time) on a Friday and won''t be replaced until 8:00AM on Monday morning. > Does the failure data you have access to support my hypothesis that > failures of identical mechanical systems tend to occur in small clusters > within a relatively small window of time?Separating the right hand side: MTTDL = MTBF/N * MTBF/(N-1)*MTTR the right-most product is the probability that one of the N-1 disks fail during the recovery window for the first disk''s failure. As the MTTR increases, the probability of the 2nd disk failure also increases. RAIDoptimizer calculates the MTTR as: MTTR = service response time + resync time where resync time = size * space used (%) / resync rate Incidentally, since ZFS schedules the resync iops itself, then it can really move along on a mostly idle system. You should be able to resync at near the media speed for an idle system. By contrast, a hardware RAID array has no knowledge of the context of the data or the I/O scheduling, so they will perform resyncs using a throttle. Not only do they end up resyncing unused space, but they also take a long time (4-18 GBytes/hr for some arrays) and thus expose you to a higher probability of second disk failure.> Call me paranoid, but I''d prefer to see a product like thumper configured > with 50% of the disks manufactured by vendor A and the other 50% > manufactured by someone else.Diversity is usually a good thing. Unfortunately, this is often impractical for a manufacturer.> This paranoia is based on a personal experience, many years ago (before we > had smart fans etc), where we had a rack full of expensive custom > equipment cooled by (what we thought was) a highly redundant group of 5 > fans. One fan suffered infant mortality and its failure went unnoticed, > leaving 4 fans running. Two of the fans died on the same extended weekend > (public holiday). It was an expensive and embarassing disaster.Modelling such as this assumes independence of failures. Common cause or bad lots are not that hard to model, but you may never find any failure rate data for them. You can look at the MTBF sensitivities, though that is an opening to another set of results. I prefer to ignore the absolute values and judge competing designs by their relative results. To wit, I fully expect to be beyond dust in 150,767 years, and the expected lifetime of most disks is 5 years. But given two competing designs using the same model, a design predicting and MTTDL 150,767 years will very likely demonstrate better MTTDL than a design predicting 68,530 years. -- richard
Richard Elling - PAE wrote:> > > Incidentally, since ZFS schedules the resync iops itself, then it can > really move along on a mostly idle system. You should be able to resync > at near the media speed for an idle system. By contrast, a hardware > RAID array has no knowledge of the context of the data or the I/O > scheduling, > so they will perform resyncs using a throttle. Not only do they end up > resyncing unused space, but they also take a long time (4-18 GBytes/hr > for > some arrays) and thus expose you to a higher probability of second disk > failure.Just as an other data point: It is true that the array doesn''t know the context of the data or the i/o scheduling but some arrays do watch the incoming data rate and throttle accordingly. (T3 used to for example.)
Hello Richard, Saturday, November 4, 2006, 12:46:05 AM, you wrote: REP> Incidentally, since ZFS schedules the resync iops itself, then it can REP> really move along on a mostly idle system. You should be able to resync REP> at near the media speed for an idle system. By contrast, a hardware REP> RAID array has no knowledge of the context of the data or the I/O scheduling, REP> so they will perform resyncs using a throttle. Not only do they end up REP> resyncing unused space, but they also take a long time (4-18 GBytes/hr for REP> some arrays) and thus expose you to a higher probability of second disk REP> failure. However some mechanism to slow or freeze scrub/resilvering would be useful. Especially in cases where server does many other things and not only file serving - and scrub/resilver can take much CPU power on slower servers. Something like ''zpool scrub -r 10 pool'' - which would mean 10% of speed. -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
Robert Milkowski wrote:> Saturday, November 4, 2006, 12:46:05 AM, you wrote: > REP> Incidentally, since ZFS schedules the resync iops itself, then it can > REP> really move along on a mostly idle system. You should be able to resync > REP> at near the media speed for an idle system. By contrast, a hardware > REP> RAID array has no knowledge of the context of the data or the I/O scheduling, > REP> so they will perform resyncs using a throttle. Not only do they end up > REP> resyncing unused space, but they also take a long time (4-18 GBytes/hr for > REP> some arrays) and thus expose you to a higher probability of second disk > REP> failure. > > However some mechanism to slow or freeze scrub/resilvering would be > useful. Especially in cases where server does many other things and > not only file serving - and scrub/resilver can take much CPU power on > slower servers. > > Something like ''zpool scrub -r 10 pool'' - which would mean 10% of > speed.I think this has some merit for scrubs, but I wouldn''t suggest it for resilver. If your data is at risk, there is nothing more important than protecting it. While that sounds harsh, in reality there is a practical limit determined by the ability of a single LUN to absorb a (large, sequential?) write workload. For JBODs, that would be approximately the media speed. The big question, though, is "10% of what?" User CPU? iops? -- richard
Richard Elling - PAE schrieb:> The big question, though, is "10% of what?" User CPU? iops?Maybe something like the "slow" parameter of VxVM? slow[=iodelay] Reduces toe system performance impact of copy operations. Such operations are usually per- formed on small regions of the volume (nor- mally from 16 kilobytes to 128 kilobytes). This option inserts a delay between the recovery of each such region . A specific delay can be specified with iodelay as a number of milliseconds; otherwise, a default is chosen (normally 250 milliseconds). Daniel
Daniel Rock wrote:> Richard Elling - PAE schrieb: >> The big question, though, is "10% of what?" User CPU? iops? > > Maybe something like the "slow" parameter of VxVM? > > slow[=iodelay] > Reduces toe system performance impact of copy > operations. Such operations are usually per- > formed on small regions of the volume (nor- > mally from 16 kilobytes to 128 kilobytes). > This option inserts a delay between the > recovery of each such region . A specific > delay can be specified with iodelay as a > number of milliseconds; otherwise, a default > is chosen (normally 250 milliseconds).For modern machines, which *should* be the design point, the channel bandwidth is underutilized, so why not use it? NB. At 4 128kByte iops per second, it would take 11 days and 8 hours to resilver a single 500 GByte drive -- feeling lucky? In the bad old days when disks were small, and the systems were slow, this made some sense. The better approach is for the file system to do what it needs to do as efficiently as possible, which is the current state of ZFS. -- richard
Richard Elling - PAE wrote:> The better approach is for the file system to do what it needs > to do as efficiently as possible, which is the current state of ZFS.This implies that the filesystem has exclusive use of the channel - SAN or otherwise - as well as the storage array front end controllers, cache, and the raid groups that may be behind it. What we really need in this case, and a few others, is the filesystem and backend storage working together...but I''ll save that rant for an other day. ;)
Richard Elling - PAE schrieb:> For modern machines, which *should* be the design point, the channel > bandwidth is underutilized, so why not use it?And what about encrypted disks? Simply create a zpool with checksum=sha256, fill it up, then scrub. I''d be happy if I could use my machine during scrubbing. A throttling of scrubbing would help. Maybe also running the scrubbing with a "high nice level" in kernel.> NB. At 4 128kByte iops per second, it would take 11 days and 8 hours > to resilver a single 500 GByte drive -- feeling lucky?250ms is the Veritas default. It doesn''t have to be the ZFS default also. Daniel
Hello Richard, Tuesday, November 7, 2006, 5:19:07 PM, you wrote: REP> Robert Milkowski wrote:>> Saturday, November 4, 2006, 12:46:05 AM, you wrote: >> REP> Incidentally, since ZFS schedules the resync iops itself, then it can >> REP> really move along on a mostly idle system. You should be able to resync >> REP> at near the media speed for an idle system. By contrast, a hardware >> REP> RAID array has no knowledge of the context of the data or the I/O scheduling, >> REP> so they will perform resyncs using a throttle. Not only do they end up >> REP> resyncing unused space, but they also take a long time (4-18 GBytes/hr for >> REP> some arrays) and thus expose you to a higher probability of second disk >> REP> failure. >> >> However some mechanism to slow or freeze scrub/resilvering would be >> useful. Especially in cases where server does many other things and >> not only file serving - and scrub/resilver can take much CPU power on >> slower servers. >> >> Something like ''zpool scrub -r 10 pool'' - which would mean 10% of >> speed.REP> I think this has some merit for scrubs, but I wouldn''t suggest it for resilver. REP> If your data is at risk, there is nothing more important than protecting it. REP> While that sounds harsh, in reality there is a practical limit determined by REP> the ability of a single LUN to absorb a (large, sequential?) write workload. REP> For JBODs, that would be approximately the media speed. I can''t agree. I have some performance sensitive environments and I know that during the day I do not want loose performance even if it means longer resilvering times. That''s exactly what I do on HW RAID controller. In other environments I do want to resilver ASAP, you''re right. Also scrub can consume all CPU power on smaller and older machines and that''s not always what I would like. REP> The big question, though, is "10% of what?" User CPU? iops? Just slower the reate resilvering/scrub is done. Insert some kind of delay as some one else suggested. -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
On Thu, 9 Nov 2006, Robert Milkowski wrote:> Hello Richard, > > Tuesday, November 7, 2006, 5:19:07 PM, you wrote: > > REP> Robert Milkowski wrote: > >> Saturday, November 4, 2006, 12:46:05 AM, you wrote: > >> REP> Incidentally, since ZFS schedules the resync iops itself, then it can > >> REP> really move along on a mostly idle system. You should be able to resync > >> REP> at near the media speed for an idle system. By contrast, a hardware > >> REP> RAID array has no knowledge of the context of the data or the I/O scheduling, > >> REP> so they will perform resyncs using a throttle. Not only do they end up > >> REP> resyncing unused space, but they also take a long time (4-18 GBytes/hr for > >> REP> some arrays) and thus expose you to a higher probability of second disk > >> REP> failure. > >> > >> However some mechanism to slow or freeze scrub/resilvering would be > >> useful. Especially in cases where server does many other things and > >> not only file serving - and scrub/resilver can take much CPU power on > >> slower servers. > >> > >> Something like ''zpool scrub -r 10 pool'' - which would mean 10% of > >> speed. > > REP> I think this has some merit for scrubs, but I wouldn''t suggest it for resilver. > REP> If your data is at risk, there is nothing more important than protecting it. > REP> While that sounds harsh, in reality there is a practical limit determined by > REP> the ability of a single LUN to absorb a (large, sequential?) write workload. > REP> For JBODs, that would be approximately the media speed. > > I can''t agree. I have some performance sensitive environments and I > know that during the day I do not want loose performance even if it > means longer resilvering times. That''s exactly what I do on HW RAID > controller. In other environments I do want to resilver ASAP, you''re > right. > > Also scrub can consume all CPU power on smaller and older machines and > that''s not always what I would like. > > > REP> The big question, though, is "10% of what?" User CPU? iops?Probably N% of I/O Ops/Second would work well.> Just slower the reate resilvering/scrub is done. > Insert some kind of delay as some one else suggested. > >Al Hopper Logical Approach Inc, Plano, TX. al at logical-approach.com Voice: 972.379.2133 Fax: 972.379.2134 Timezone: US CDT OpenSolaris.Org Community Advisory Board (CAB) Member - Apr 2005 OpenSolaris Governing Board (OGB) Member - Feb 2006
Hello Al, Friday, November 10, 2006, 2:21:38 PM, you wrote: AH> On Thu, 9 Nov 2006, Robert Milkowski wrote:>> Hello Richard, >> >> Tuesday, November 7, 2006, 5:19:07 PM, you wrote: >> >> REP> Robert Milkowski wrote: >> >> Saturday, November 4, 2006, 12:46:05 AM, you wrote: >> >> REP> Incidentally, since ZFS schedules the resync iops itself, then it can >> >> REP> really move along on a mostly idle system. You should be able to resync >> >> REP> at near the media speed for an idle system. By contrast, a hardware >> >> REP> RAID array has no knowledge of the context of the data or the I/O scheduling, >> >> REP> so they will perform resyncs using a throttle. Not only do they end up >> >> REP> resyncing unused space, but they also take a long time (4-18 GBytes/hr for >> >> REP> some arrays) and thus expose you to a higher probability of second disk >> >> REP> failure. >> >> >> >> However some mechanism to slow or freeze scrub/resilvering would be >> >> useful. Especially in cases where server does many other things and >> >> not only file serving - and scrub/resilver can take much CPU power on >> >> slower servers. >> >> >> >> Something like ''zpool scrub -r 10 pool'' - which would mean 10% of >> >> speed. >> >> REP> I think this has some merit for scrubs, but I wouldn''t suggest it for resilver. >> REP> If your data is at risk, there is nothing more important than protecting it. >> REP> While that sounds harsh, in reality there is a practical limit determined by >> REP> the ability of a single LUN to absorb a (large, sequential?) write workload. >> REP> For JBODs, that would be approximately the media speed. >> >> I can''t agree. I have some performance sensitive environments and I >> know that during the day I do not want loose performance even if it >> means longer resilvering times. That''s exactly what I do on HW RAID >> controller. In other environments I do want to resilver ASAP, you''re >> right. >> >> Also scrub can consume all CPU power on smaller and older machines and >> that''s not always what I would like. >> >> >> REP> The big question, though, is "10% of what?" User CPU? iops?AH> Probably N% of I/O Ops/Second would work well. Or if 100% means full speed, then 10% means that expected time should be approximately 10x more (instead 1h make it 10h). It would be more intuitive than specifying some numbers like IOPS, etc. Additionally setting it to 0 means freeze, then setting it above 0% means continue (continue not start from the beginning). -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
Robert Milkowski wrote:>>> Also scrub can consume all CPU power on smaller and older machines and >>> that''s not always what I would like. >>> >>> >>> REP> The big question, though, is "10% of what?" User CPU? iops? >>> > > AH> Probably N% of I/O Ops/Second would work well. > > Or if 100% means full speed, then 10% means that expected time should > be approximately 10x more (instead 1h make it 10h). > > It would be more intuitive than specifying some numbers like IOPS, > etc.In any case you''re still going to have to provide a tunable for this even if the resulting algorithm works well on the host side. Keep in mind that a scrub can also impact the array(s) you''re filesystem lives on. If all my ZFS systems started scrubbing at full speed - Because they thought they weren''t busy - at the same time it might cause issues with other I/O on the array itself.
Hello Torrey, Friday, November 10, 2006, 11:31:31 PM, you wrote: TM> Robert Milkowski wrote:>>>> Also scrub can consume all CPU power on smaller and older machines and >>>> that''s not always what I would like. >>>> >>>> >>>> REP> The big question, though, is "10% of what?" User CPU? iops? >>>> >> >> AH> Probably N% of I/O Ops/Second would work well. >> >> Or if 100% means full speed, then 10% means that expected time should >> be approximately 10x more (instead 1h make it 10h). >> >> It would be more intuitive than specifying some numbers like IOPS, >> etc.TM> In any case you''re still going to have to provide a tunable for this TM> even if the resulting algorithm works well on the host side. Keep in TM> mind that a scrub can also impact the array(s) you''re filesystem lives TM> on. If all my ZFS systems started scrubbing at full speed - Because they TM> thought they weren''t busy - at the same time it might cause issues with TM> other I/O on the array itself. Tunable in a form of pool property, with default 100%. On the other hand maybe simple algorithm Veritas has used is good enough - simple delay between scrubing/resilvering some data. -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
Robert Milkowski wrote:> Hello Torrey, > > Friday, November 10, 2006, 11:31:31 PM, you wrote: > > TM> Robert Milkowski wrote: > >>>>> Also scrub can consume all CPU power on smaller and older machines and >>>>> that''s not always what I would like. >>>>> >>>>> >>>>> REP> The big question, though, is "10% of what?" User CPU? iops? >>>>> >>>>> >>> AH> Probably N% of I/O Ops/Second would work well. >>> >>> Or if 100% means full speed, then 10% means that expected time should >>> be approximately 10x more (instead 1h make it 10h). >>> >>> It would be more intuitive than specifying some numbers like IOPS, >>> etc. >>> > > TM> In any case you''re still going to have to provide a tunable for this > TM> even if the resulting algorithm works well on the host side. Keep in > TM> mind that a scrub can also impact the array(s) you''re filesystem lives > TM> on. If all my ZFS systems started scrubbing at full speed - Because they > TM> thought they weren''t busy - at the same time it might cause issues with > TM> other I/O on the array itself. > > Tunable in a form of pool property, with default 100%. > > On the other hand maybe simple algorithm Veritas has used is good > enough - simple delay between scrubing/resilvering some data.I think a not-to-convoluted algorithm as people have suggested would be ideal and then let people override it as necessary. I would think a 100% default might be a call generator but I''m up for debate. ("Hey my array just went crazy. All the lights are blinking but my application isn''t doing any I/O. What gives?")
Hello Torrey, Monday, November 13, 2006, 5:07:02 AM, you wrote: TM> Robert Milkowski wrote:>> Hello Torrey, >> >> Friday, November 10, 2006, 11:31:31 PM, you wrote: >> >> TM> Robert Milkowski wrote: >> >>>>>> Also scrub can consume all CPU power on smaller and older machines and >>>>>> that''s not always what I would like. >>>>>> >>>>>> >>>>>> REP> The big question, though, is "10% of what?" User CPU? iops? >>>>>> >>>>>> >>>> AH> Probably N% of I/O Ops/Second would work well. >>>> >>>> Or if 100% means full speed, then 10% means that expected time should >>>> be approximately 10x more (instead 1h make it 10h). >>>> >>>> It would be more intuitive than specifying some numbers like IOPS, >>>> etc. >>>> >> >> TM> In any case you''re still going to have to provide a tunable for this >> TM> even if the resulting algorithm works well on the host side. Keep in >> TM> mind that a scrub can also impact the array(s) you''re filesystem lives >> TM> on. If all my ZFS systems started scrubbing at full speed - Because they >> TM> thought they weren''t busy - at the same time it might cause issues with >> TM> other I/O on the array itself. >> >> Tunable in a form of pool property, with default 100%. >> >> On the other hand maybe simple algorithm Veritas has used is good >> enough - simple delay between scrubing/resilvering some data.TM> I think a not-to-convoluted algorithm as people have suggested would be TM> ideal and then let people override it as necessary. I would think a 100% TM> default might be a call generator but I''m up for debate. ("Hey my array TM> just went crazy. All the lights are blinking but my application isn''t TM> doing any I/O. What gives?") You''ve got the same behavior with any LVM when you replace a disk. So it''s not something unexpected for admins. Also most of the time they expect LVM to resilver ASAP. With default setting not being 100% you''ll definitely see people complaining ZFS is slooow, etc. -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
Howdy Robert. Robert Milkowski wrote:> > You''ve got the same behavior with any LVM when you replace a disk. > So it''s not something unexpected for admins. Also most of the time > they expect LVM to resilver ASAP. With default setting not being 100% > you''ll definitely see people complaining ZFS is slooow, etc. >It''s quite possible that I''ve only seen the other side of the coin but in my past I''ve had support calls where customers complained that they {replaced a drive, resilvered a mirror, ... } and it knocked the performance of other things. My fave was a set of A5200s on a hub and after they cranked the i/o rate up on the mirror it caused some other app - Me thinks it was Oracle - get too slow, think there was a disk problem, crash(!), and then initiate a cluster failover. Given the disk group was not in perfect health....oh the fun we had. In any case the key is documenting the behavior well enough so people can see what is going on, how to tune it slower or faster on the fly, etc. I''m more concerned with that then the actual algorithm or method used.
>> Maybe something like the "slow" parameter of VxVM? >> >> slow[=iodelay] >> Reduces toe system performance impact of copy >> operations. Such operations are usually per- >> formed on small regions of the volume (nor- >> mally from 16 kilobytes to 128 kilobytes). >> This option inserts a delay between the >> recovery of each such region . A specific >> delay can be specified with iodelay as a >> number of milliseconds; otherwise, a default >> is chosen (normally 250 milliseconds). > > For modern machines, which *should* be the design point, the channel > bandwidth is underutilized, so why not use it? > > NB. At 4 128kByte iops per second, it would take 11 days and 8 hours > to resilver a single 500 GByte drive -- feeling lucky? In the bad old > days when disks were small, and the systems were slow, this made some > sense. The better approach is for the file system to do what it needs > to do as efficiently as possible, which is the current state of ZFS.Well, we are trying to balance impact of resilvering on running applications with a speed of resilvering. I think that having an option to tell filesystem to postpone full-throttle resilvering till some quieter period of time may help. This may be combined with some throttling mechanism so during quieter period resilvering is done with full speed, and during busy period it may continue with reduced speed. Such arrangement may be useful for customers with e.g. well-defined SLAs. Wbr, Victor
Hi All, From reading the docs, it seems that you can add devices (non-spares) to a zpool, but you cannot take them away, right? Best, Mike Victor Latushkin wrote:>>> Maybe something like the "slow" parameter of VxVM? >>> >>> slow[=iodelay] >>> Reduces toe system performance impact of copy >>> operations. Such operations are usually per- >>> formed on small regions of the volume (nor- >>> mally from 16 kilobytes to 128 kilobytes). >>> This option inserts a delay between the >>> recovery of each such region . A specific >>> delay can be specified with iodelay as a >>> number of milliseconds; otherwise, a default >>> is chosen (normally 250 milliseconds). >> >> For modern machines, which *should* be the design point, the channel >> bandwidth is underutilized, so why not use it? >> >> NB. At 4 128kByte iops per second, it would take 11 days and 8 hours >> to resilver a single 500 GByte drive -- feeling lucky? In the bad old >> days when disks were small, and the systems were slow, this made some >> sense. The better approach is for the file system to do what it needs >> to do as efficiently as possible, which is the current state of ZFS. > > Well, we are trying to balance impact of resilvering on running > applications with a speed of resilvering. > > I think that having an option to tell filesystem to postpone > full-throttle resilvering till some quieter period of time may help. > This may be combined with some throttling mechanism so during quieter > period resilvering is done with full speed, and during busy period it > may continue with reduced speed. Such arrangement may be useful for > customers with e.g. well-defined SLAs. > > Wbr, > Victor > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Hi Mike, Yes, outside of the hot-spares feature, you can detach, offline, and replace existing devices in a pool, but you can''t remove devices, yet. This feature work is being tracked under this RFE: http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=4852783 Cindy Mike Seda wrote:> Hi All, > From reading the docs, it seems that you can add devices (non-spares) > to a zpool, but you cannot take them away, right? > Best, > Mike > > > Victor Latushkin wrote: > >>>> Maybe something like the "slow" parameter of VxVM? >>>> >>>> slow[=iodelay] >>>> Reduces toe system performance impact of copy >>>> operations. Such operations are usually per- >>>> formed on small regions of the volume (nor- >>>> mally from 16 kilobytes to 128 kilobytes). >>>> This option inserts a delay between the >>>> recovery of each such region . A specific >>>> delay can be specified with iodelay as a >>>> number of milliseconds; otherwise, a default >>>> is chosen (normally 250 milliseconds). >>> >>> >>> For modern machines, which *should* be the design point, the channel >>> bandwidth is underutilized, so why not use it? >>> >>> NB. At 4 128kByte iops per second, it would take 11 days and 8 hours >>> to resilver a single 500 GByte drive -- feeling lucky? In the bad old >>> days when disks were small, and the systems were slow, this made some >>> sense. The better approach is for the file system to do what it needs >>> to do as efficiently as possible, which is the current state of ZFS. >> >> >> Well, we are trying to balance impact of resilvering on running >> applications with a speed of resilvering. >> >> I think that having an option to tell filesystem to postpone >> full-throttle resilvering till some quieter period of time may help. >> This may be combined with some throttling mechanism so during quieter >> period resilvering is done with full speed, and during busy period it >> may continue with reduced speed. Such arrangement may be useful for >> customers with e.g. well-defined SLAs. >> >> Wbr, >> Victor >> _______________________________________________ >> zfs-discuss mailing list >> zfs-discuss at opensolaris.org >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Torrey McMahon wrote:> Robert Milkowski wrote: >> Hello Torrey, >> >> Friday, November 10, 2006, 11:31:31 PM, you wrote: >> >> TM> Robert Milkowski wrote: >> >>>>>> Also scrub can consume all CPU power on smaller and older >>>>>> machines and >>>>>> that''s not always what I would like. >>>>>> >>>>>> >>>>>> REP> The big question, though, is "10% of what?" User CPU? iops? >>>>>> >>>> AH> Probably N% of I/O Ops/Second would work well. >>>> >>>> Or if 100% means full speed, then 10% means that expected time should >>>> be approximately 10x more (instead 1h make it 10h). >>>> >>>> It would be more intuitive than specifying some numbers like IOPS, >>>> etc. >>>> >> >> TM> In any case you''re still going to have to provide a tunable for >> this TM> even if the resulting algorithm works well on the host side. >> Keep in TM> mind that a scrub can also impact the array(s) you''re >> filesystem lives >> TM> on. If all my ZFS systems started scrubbing at full speed - >> Because they >> TM> thought they weren''t busy - at the same time it might cause >> issues with >> TM> other I/O on the array itself. >> >> Tunable in a form of pool property, with default 100%. >> >> On the other hand maybe simple algorithm Veritas has used is good >> enough - simple delay between scrubing/resilvering some data. > > I think a not-to-convoluted algorithm as people have suggested would > be ideal and then let people override it as necessary. I would think a > 100% default might be a call generator but I''m up for debate. ("Hey my > array just went crazy. All the lights are blinking but my application > isn''t doing any I/O. What gives?")I''ll argue that *any* random % is bogus. What you really want to do is prioritize activity where resources are constrained. From a RAS perspective, idle systems are the devil''s playground :-). ZFS already does prioritize I/O that it knows about. Prioritizing on CPU might have some merit, but to integrate into Solaris'' resource management system might bring some added system admin complexity which is unwanted. -- richard
Richard Elling - PAE wrote:> Torrey McMahon wrote: >> Robert Milkowski wrote: >>> Hello Torrey, >>> >>> Friday, November 10, 2006, 11:31:31 PM, you wrote: >>> >>> [SNIP] >>> >>> Tunable in a form of pool property, with default 100%. >>> >>> On the other hand maybe simple algorithm Veritas has used is good >>> enough - simple delay between scrubing/resilvering some data. >> >> I think a not-to-convoluted algorithm as people have suggested would >> be ideal and then let people override it as necessary. I would think >> a 100% default might be a call generator but I''m up for debate. ("Hey >> my array just went crazy. All the lights are blinking but my >> application isn''t doing any I/O. What gives?") > > I''ll argue that *any* random % is bogus. What you really want to > do is prioritize activity where resources are constrained. From a RAS > perspective, idle systems are the devil''s playground :-). ZFS already > does prioritize I/O that it knows about. Prioritizing on CPU might have > some merit, but to integrate into Solaris'' resource management system > might bring some added system admin complexity which is unwanted.I agree but the problem as I see it as that nothing has a overview of the entire environment. ZFS knows what I/O is coming in and what its sending out but that''s it. Even if we had an easy to use resource management framework across all the Sun applications and devices we''d still run into non-Sun bits that place demands on shared components like networking, san, arrays, etc. Anything that can be auto-tuned is great but I''m afraid we''re still going to need manual tuning in some cases.
Torrey McMahon wrote:> Richard Elling - PAE wrote: >> Torrey McMahon wrote: >>> Robert Milkowski wrote: >>>> Hello Torrey, >>>> >>>> Friday, November 10, 2006, 11:31:31 PM, you wrote: >>>> >>>> [SNIP] >>>> >>>> Tunable in a form of pool property, with default 100%. >>>> >>>> On the other hand maybe simple algorithm Veritas has used is good >>>> enough - simple delay between scrubing/resilvering some data. >>> >>> I think a not-to-convoluted algorithm as people have suggested would >>> be ideal and then let people override it as necessary. I would think >>> a 100% default might be a call generator but I''m up for debate. >>> ("Hey my array just went crazy. All the lights are blinking but my >>> application isn''t doing any I/O. What gives?") >> >> I''ll argue that *any* random % is bogus. What you really want to >> do is prioritize activity where resources are constrained. From a RAS >> perspective, idle systems are the devil''s playground :-). ZFS already >> does prioritize I/O that it knows about. Prioritizing on CPU might have >> some merit, but to integrate into Solaris'' resource management system >> might bring some added system admin complexity which is unwanted. > > > I agree but the problem as I see it as that nothing has a overview of > the entire environment. ZFS knows what I/O is coming in and what its > sending out but that''s it. Even if we had an easy to use resource > management framework across all the Sun applications and devices we''d > still run into non-Sun bits that place demands on shared components > like networking, san, arrays, etc. Anything that can be auto-tuned is > great but I''m afraid we''re still going to need manual tuning in some > cases.I think this is reason #7429 why I hate SANs: no meaningful QoS related to reason #85823 why I hate SANs: sdd_max_throttle is a butt-ugly hack :-) -- richard
I noticed that there is still an open bug regarding removing devices from a zpool: http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=4852783 Does anyone know if or when this feature will be implemented? Cindy Swearingen wrote:> Hi Mike, > > Yes, outside of the hot-spares feature, you can detach, offline, and > replace existing devices in a pool, but you can''t remove devices, yet. > > This feature work is being tracked under this RFE: > > http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=4852783 > > Cindy > > Mike Seda wrote: >> Hi All, >> From reading the docs, it seems that you can add devices >> (non-spares) to a zpool, but you cannot take them away, right? >> Best, >> Mike >> >> >> Victor Latushkin wrote: >> >>>>> Maybe something like the "slow" parameter of VxVM? >>>>> >>>>> slow[=iodelay] >>>>> Reduces toe system performance impact of copy >>>>> operations. Such operations are usually per- >>>>> formed on small regions of the volume (nor- >>>>> mally from 16 kilobytes to 128 kilobytes). >>>>> This option inserts a delay between the >>>>> recovery of each such region . A specific >>>>> delay can be specified with iodelay as a >>>>> number of milliseconds; otherwise, a default >>>>> is chosen (normally 250 milliseconds). >>>> >>>> >>>> For modern machines, which *should* be the design point, the channel >>>> bandwidth is underutilized, so why not use it? >>>> >>>> NB. At 4 128kByte iops per second, it would take 11 days and 8 hours >>>> to resilver a single 500 GByte drive -- feeling lucky? In the bad old >>>> days when disks were small, and the systems were slow, this made some >>>> sense. The better approach is for the file system to do what it needs >>>> to do as efficiently as possible, which is the current state of ZFS. >>> >>> >>> Well, we are trying to balance impact of resilvering on running >>> applications with a speed of resilvering. >>> >>> I think that having an option to tell filesystem to postpone >>> full-throttle resilvering till some quieter period of time may help. >>> This may be combined with some throttling mechanism so during >>> quieter period resilvering is done with full speed, and during busy >>> period it may continue with reduced speed. Such arrangement may be >>> useful for customers with e.g. well-defined SLAs. >>> >>> Wbr, >>> Victor >>> _______________________________________________ >>> zfs-discuss mailing list >>> zfs-discuss at opensolaris.org >>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >> >> >> _______________________________________________ >> zfs-discuss mailing list >> zfs-discuss at opensolaris.org >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Mike, This RFE is still being worked and I have no ETA on completion... cs Mike Seda wrote:> I noticed that there is still an open bug regarding removing devices > from a zpool: > http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=4852783 > Does anyone know if or when this feature will be implemented? > > > Cindy Swearingen wrote: > >> Hi Mike, >> >> Yes, outside of the hot-spares feature, you can detach, offline, and >> replace existing devices in a pool, but you can''t remove devices, yet. >> >> This feature work is being tracked under this RFE: >> >> http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=4852783 >> >> Cindy >> >> Mike Seda wrote: >> >>> Hi All, >>> From reading the docs, it seems that you can add devices >>> (non-spares) to a zpool, but you cannot take them away, right? >>> Best, >>> Mike >>> >>> >>> Victor Latushkin wrote: >>> >>>>>> Maybe something like the "slow" parameter of VxVM? >>>>>> >>>>>> slow[=iodelay] >>>>>> Reduces toe system performance impact of copy >>>>>> operations. Such operations are usually per- >>>>>> formed on small regions of the volume (nor- >>>>>> mally from 16 kilobytes to 128 kilobytes). >>>>>> This option inserts a delay between the >>>>>> recovery of each such region . A specific >>>>>> delay can be specified with iodelay as a >>>>>> number of milliseconds; otherwise, a default >>>>>> is chosen (normally 250 milliseconds). >>>>> >>>>> >>>>> >>>>> For modern machines, which *should* be the design point, the channel >>>>> bandwidth is underutilized, so why not use it? >>>>> >>>>> NB. At 4 128kByte iops per second, it would take 11 days and 8 hours >>>>> to resilver a single 500 GByte drive -- feeling lucky? In the bad old >>>>> days when disks were small, and the systems were slow, this made some >>>>> sense. The better approach is for the file system to do what it needs >>>>> to do as efficiently as possible, which is the current state of ZFS. >>>> >>>> >>>> >>>> Well, we are trying to balance impact of resilvering on running >>>> applications with a speed of resilvering. >>>> >>>> I think that having an option to tell filesystem to postpone >>>> full-throttle resilvering till some quieter period of time may help. >>>> This may be combined with some throttling mechanism so during >>>> quieter period resilvering is done with full speed, and during busy >>>> period it may continue with reduced speed. Such arrangement may be >>>> useful for customers with e.g. well-defined SLAs. >>>> >>>> Wbr, >>>> Victor >>>> _______________________________________________ >>>> zfs-discuss mailing list >>>> zfs-discuss at opensolaris.org >>>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >>> >>> >>> >>> _______________________________________________ >>> zfs-discuss mailing list >>> zfs-discuss at opensolaris.org >>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > >