Hi all I have a test system with snv134 and 8x2TB drives in RAIDz2 and currently no Zil or L2ARC. I noticed the I/O speed to NFS shares on the testpool drops to something hardly usable while scrubbing the pool. How can I address this? Will adding Zil or L2ARC help? Is it possible to tune down scrub''s priority somehow? Best regards roy -- Roy Sigurd Karlsbakk (+47) 97542685 roy at karlsbakk.net http://blogg.karlsbakk.net/
On Tue, 27 Apr 2010, Roy Sigurd Karlsbakk wrote:> > I have a test system with snv134 and 8x2TB drives in RAIDz2 and > currently no Zil or L2ARC. I noticed the I/O speed to NFS shares on > the testpool drops to something hardly usable while scrubbing the > pool. > > How can I address this? Will adding Zil or L2ARC help? Is it > possible to tune down scrub''s priority somehow?Does the NFS performance problem seem to be mainly read performance, or write performance? If it is primarily a read performance issue, then adding lots more RAM and/or a L2ARC device should help since that would reduce the need to (re-)read the underlying disks during the scrub. Likewise, adding an intent log SSD would help with NFS write performance. Zfs scrub needs to access all written data on all disks and is usually disk-seek or disk I/O bound so it is difficult to keep it from hogging the disk resources. A pool based on mirror devices will behave much more nicely while being scrubbed than one based on RAIDz2. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On 04/28/10 03:17 AM, Roy Sigurd Karlsbakk wrote:> Hi all > > I have a test system with snv134 and 8x2TB drives in RAIDz2 and currently no Zil or L2ARC. I noticed the I/O speed to NFS shares on the testpool drops to something hardly usable while scrubbing the pool. > >Is that small random or block I/O? I''ve found latency to be the killer rather than throughput, at lest when receiving snapshots. In normal operation, receiving an empty snapshot is a sub-second operation. While resilvering, at can take up to 30 seconds. The write speed on bigger snapshots is still acceptable. -- Ian.
On 04/28/10 10:01 AM, Bob Friesenhahn wrote:> On Wed, 28 Apr 2010, Ian Collins wrote: > >> On 04/28/10 03:17 AM, Roy Sigurd Karlsbakk wrote: >>> Hi all >>> >>> I have a test system with snv134 and 8x2TB drives in RAIDz2 and >>> currently no Zil or L2ARC. I noticed the I/O speed to NFS shares on >>> the testpool drops to something hardly usable while scrubbing the pool. >>> >>> >> Is that small random or block I/O? >> >> I''ve found latency to be the killer rather than throughput, at lest >> when receiving snapshots. In normal operation, receiving an empty >> snapshot is a sub-second operation. While resilvering, at can take >> up to 30 seconds. The write speed on bigger snapshots is still >> acceptable. >> > zfs scrub != zfs send >Where did I say it did? I didn''t even mention zfs send. My observation concerned poor performance (latency) during a scrub/resilver. -- Ian.
> Zfs scrub needs to access all written data on all > disks and is usually > disk-seek or disk I/O bound so it is difficult to > keep it from hogging > the disk resources. A pool based on mirror devices > will behave much > more nicely while being scrubbed than one based on > RAIDz2.Experience seconded entirely. I''d like to repeat that I think we need more efficient load balancing functions in order to keep housekeeping payload manageable. Detrimental side effects of scrub should not be a decision point for choosing certain hardware or redundancy concepts in my opinion. Regards, Tonmaus -- This message posted from opensolaris.org
On Apr 28, 2010, at 1:34 AM, Tonmaus wrote:>> Zfs scrub needs to access all written data on all >> disks and is usually >> disk-seek or disk I/O bound so it is difficult to >> keep it from hogging >> the disk resources. A pool based on mirror devices >> will behave much >> more nicely while being scrubbed than one based on >> RAIDz2.The data I have does not show a difference in the disk loading while scrubbing for different pool configs. All HDDs become IOPS bound. If you have SSDs, then there can be a bandwidth issue. As soon as someone sends me a big pile of SSDs, I''ll characterize the scrub load :-)> Experience seconded entirely. I''d like to repeat that I think we need more efficient load balancing functions in order to keep housekeeping payload manageable.The current load balancing is based on number of queued I/Os. For later builds, I think the algorithm might be out of balance, but the jury is still out.> Detrimental side effects of scrub should not be a decision point for choosing certain hardware or redundancy concepts in my opinion.Scrub performance is directly impacted by IOPS performance of the pool. Slow disks == poor performance. -- richard ZFS storage and performance consulting at http://www.RichardElling.com ZFS training on deduplication, NexentaStor, and NAS performance Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com
On Wed, Apr 28 at 1:34, Tonmaus wrote:>> Zfs scrub needs to access all written data on all >> disks and is usually >> disk-seek or disk I/O bound so it is difficult to >> keep it from hogging >> the disk resources. A pool based on mirror devices >> will behave much >> more nicely while being scrubbed than one based on >> RAIDz2. > > Experience seconded entirely. I''d like to repeat that I think we > need more efficient load balancing functions in order to keep > housekeeping payload manageable. Detrimental side effects of scrub > should not be a decision point for choosing certain hardware or > redundancy concepts in my opinion.While there may be some possible optimizations, i''m sure everyone would love the random performance of mirror vdevs, combined with the redundancy of raidz3 and the space of a raidz1. However, as in all systems, there are tradeoffs. To scrub a long lived, full pool, you must read essentially every sector on every component device, and if you''re going to do it in the order in which your transactions occurred, it''ll wind up devolving to random IO eventually. You can choose to bias your workloads so that foreground IO takes priority over scrub, but then you''ve got the cases where people complain that their scrub takes too long. There may be knobs for individuals to use, but I don''t think overall there''s a magic answer. --eric -- Eric D. Mudama edmudama at mail.bounceswoosh.org
Hi Eric,> While there may be some possible optimizations, i''m > sure everyone > would love the random performance of mirror vdevs, > combined with the > redundancy of raidz3 and the space of a raidz1. > However, as in all > ystems, there are tradeoffs.I think we all may agree that the topic here is scrub trade-offs, specifically. My question is if manageability of the pool, and that includes periodical scrubs, is a trade-off as well. It would be very bad news, if it were. Maintenance functions should be practicable on any supported configuration, if possible.> You can choose to bias your workloads so that > foreground IO takes > priority over scrub, but then you''ve got the cases > where people > complain that their scrub takes too long. There may > be knobs for > individuals to use, but I don''t think overall there''s > a magic answer.The priority balance only works as long as the IO is within ZFS. As soon as the request is in the pipe of the controller/disk, no further bias will occur, as that subsystem is agnostic to ZFS rules. This is where Richards answer, just above if you read this from jive, kicks in. This leads to the pool being basically not operational from a production POV during scrub pass. From that perspective, any scrub pass exceeding a periodically acceptable service window is "too long". In such a situation, a "pause" option for resuming scrub passes upon the next service window might help. The advantage: such an option would be usable on any hardware. Regards, Tonmaus -- This message posted from opensolaris.org
On 28 April, 2010 - Eric D. Mudama sent me these 1,6K bytes:> On Wed, Apr 28 at 1:34, Tonmaus wrote: >>> Zfs scrub needs to access all written data on all >>> disks and is usually >>> disk-seek or disk I/O bound so it is difficult to >>> keep it from hogging >>> the disk resources. A pool based on mirror devices >>> will behave much >>> more nicely while being scrubbed than one based on >>> RAIDz2. >> >> Experience seconded entirely. I''d like to repeat that I think we >> need more efficient load balancing functions in order to keep >> housekeeping payload manageable. Detrimental side effects of scrub >> should not be a decision point for choosing certain hardware or >> redundancy concepts in my opinion. > > While there may be some possible optimizations, i''m sure everyone > would love the random performance of mirror vdevs, combined with the > redundancy of raidz3 and the space of a raidz1. However, as in all > systems, there are tradeoffs. > > To scrub a long lived, full pool, you must read essentially every > sector on every component device, and if you''re going to do it in the > order in which your transactions occurred, it''ll wind up devolving to > random IO eventually. > > You can choose to bias your workloads so that foreground IO takes > priority over scrub, but then you''ve got the cases where people > complain that their scrub takes too long. There may be knobs for > individuals to use, but I don''t think overall there''s a magic answer.We have one system with a raidz2 of 8 SATA disks.. If we start a scrub, then you can kiss any NFS performance goodbye.. A single mkdir or creating a file can take 30 seconds.. Single write()s can take 5-30 seconds.. Without the scrub, it''s perfectly fine. Local performance during scrub is fine. NFS performance becomes useless. This means we can''t do a scrub, because doing so will basically disable the NFS service for a day or three. If the scrub would be less agressive and take a week to perform, it would probably not kill the performance as bad.. /Tomas -- Tomas ?gren, stric at acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Ume? `- Sysadmin at {cs,acc}.umu.se
On Wed, 28 Apr 2010, Richard Elling wrote:>>> the disk resources. A pool based on mirror devices will behave >>> much more nicely while being scrubbed than one based on RAIDz2. > > The data I have does not show a difference in the disk loading while > scrubbing for different pool configs. All HDDs become IOPS bound.It is true that all HDDs become IOPS bound, but the mirror configuration offers more usable IOPS and therefore the user waits for less time for their request to be satisfied. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
adding on... On Apr 28, 2010, at 8:57 AM, Tomas ?gren wrote:> On 28 April, 2010 - Eric D. Mudama sent me these 1,6K bytes: > >> On Wed, Apr 28 at 1:34, Tonmaus wrote: >>>> Zfs scrub needs to access all written data on all >>>> disks and is usually >>>> disk-seek or disk I/O bound so it is difficult to >>>> keep it from hogging >>>> the disk resources. A pool based on mirror devices >>>> will behave much >>>> more nicely while being scrubbed than one based on >>>> RAIDz2. >>> >>> Experience seconded entirely. I''d like to repeat that I think we >>> need more efficient load balancing functions in order to keep >>> housekeeping payload manageable. Detrimental side effects of scrub >>> should not be a decision point for choosing certain hardware or >>> redundancy concepts in my opinion. >> >> While there may be some possible optimizations, i''m sure everyone >> would love the random performance of mirror vdevs, combined with the >> redundancy of raidz3 and the space of a raidz1. However, as in all >> systems, there are tradeoffs. >> >> To scrub a long lived, full pool, you must read essentially every >> sector on every component device, and if you''re going to do it in the >> order in which your transactions occurred, it''ll wind up devolving to >> random IO eventually. >> >> You can choose to bias your workloads so that foreground IO takes >> priority over scrub, but then you''ve got the cases where people >> complain that their scrub takes too long. There may be knobs for >> individuals to use, but I don''t think overall there''s a magic answer. > > We have one system with a raidz2 of 8 SATA disks.. If we start a scrub, > then you can kiss any NFS performance goodbye.. A single mkdir or > creating a file can take 30 seconds.. Single write()s can take 5-30 > seconds.. Without the scrub, it''s perfectly fine. Local performance > during scrub is fine. NFS performance becomes useless. > > This means we can''t do a scrub, because doing so will basically disable > the NFS service for a day or three. If the scrub would be less agressive > and take a week to perform, it would probably not kill the performance > as bad..Which OS release? Later versions of ZFS have a scrub/resilver throttle of sorts. There is not an exposed interface to manage the throttle and I doubt there is much (if any) community experience with using it in real-world situations. -- richard ZFS storage and performance consulting at http://www.RichardElling.com ZFS training on deduplication, NexentaStor, and NAS performance Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com
On Wed, April 28, 2010 10:16, Eric D. Mudama wrote:> On Wed, Apr 28 at 1:34, Tonmaus wrote: >>> Zfs scrub needs to access all written data on all >>> disks and is usually >>> disk-seek or disk I/O bound so it is difficult to >>> keep it from hogging >>> the disk resources. A pool based on mirror devices >>> will behave much >>> more nicely while being scrubbed than one based on >>> RAIDz2. >> >> Experience seconded entirely. I''d like to repeat that I think we >> need more efficient load balancing functions in order to keep >> housekeeping payload manageable. Detrimental side effects of scrub >> should not be a decision point for choosing certain hardware or >> redundancy concepts in my opinion. > > While there may be some possible optimizations, i''m sure everyone > would love the random performance of mirror vdevs, combined with the > redundancy of raidz3 and the space of a raidz1. However, as in all > systems, there are tradeoffs.The situations being mentioned are much worse than what seem reasonable tradeoffs to me. Maybe that''s because my intuition is misleading me about what''s available. But if the normal workload of a system uses 25% of its sustained IOPS, and a scrub is run at "low priority", I''d like to think that during a scrub I''d see a little degradation in performance, and that the scrub would take 25% or so longer than it would on an idle system. There''s presumably some inefficiency, so the two loads don''t just add perfectly; so maybe another 5% lost to that? That''s the big uncertainty. I have a hard time believing in 20% lost to that. Do you think that''s a reasonable outcome to hope for? Do you think ZFS is close to meeting it? People with systems that live at 75% all day are obviously going to have more problems than people who live at 25%! -- David Dyer-Bennet, dd-b at dd-b.net; http://dd-b.net/ Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/ Photos: http://dd-b.net/photography/gallery/ Dragaera: http://dragaera.info
Indeed the scrub seems to take too much resources from a live system. For instance i have a server with 24 disks (SATA 1TB) serving as NFS store to a linux machine holding user mailboxes. I have around 200 users, with maybe 30-40% of active users at the same time. As soon as the scrub process kicks in, linux box starts to give messages like " nfs server not available" and the users start to complain that the Outlook gives "connection timeout". Again, as soon as the scrub process stops everything comes to normal. So for me, it''s real issue the fact that the scrub takes so many resources of the system, making it pretty much unusable. In my case i did a *workaround, *where basically i have zfs send/receive from this server to another server and the scrub process is now running on the second server. I don''t know if this such a good idea, given the fact that i don''t know for sure if the scrub process in the secondary machine will be usefull in case of data corruption...but so far so good , and it''s probably better than nothing. I still remember before ZFS , that any good RAID controller would have a background consistency check task, and such a task would be possible to assign priority , like "low, medium, high" ...going back to ZFS what''s the possibility of getting this feature as well? Just out as curiosity , the Sun OpenStorage appliances , or Nexenta based ones, have any scrub task enabled by default ? I would like to get some feedback from users that run ZFS appliances regarding the impact of running a scrub on their appliances. Bruno On 28-4-2010 22:39, David Dyer-Bennet wrote:> On Wed, April 28, 2010 10:16, Eric D. Mudama wrote: > >> On Wed, Apr 28 at 1:34, Tonmaus wrote: >> >>>> Zfs scrub needs to access all written data on all >>>> disks and is usually >>>> disk-seek or disk I/O bound so it is difficult to >>>> keep it from hogging >>>> the disk resources. A pool based on mirror devices >>>> will behave much >>>> more nicely while being scrubbed than one based on >>>> RAIDz2. >>>> >>> Experience seconded entirely. I''d like to repeat that I think we >>> need more efficient load balancing functions in order to keep >>> housekeeping payload manageable. Detrimental side effects of scrub >>> should not be a decision point for choosing certain hardware or >>> redundancy concepts in my opinion. >>> >> While there may be some possible optimizations, i''m sure everyone >> would love the random performance of mirror vdevs, combined with the >> redundancy of raidz3 and the space of a raidz1. However, as in all >> systems, there are tradeoffs. >> > The situations being mentioned are much worse than what seem reasonable > tradeoffs to me. Maybe that''s because my intuition is misleading me about > what''s available. But if the normal workload of a system uses 25% of its > sustained IOPS, and a scrub is run at "low priority", I''d like to think > that during a scrub I''d see a little degradation in performance, and that > the scrub would take 25% or so longer than it would on an idle system. > There''s presumably some inefficiency, so the two loads don''t just add > perfectly; so maybe another 5% lost to that? That''s the big uncertainty. > I have a hard time believing in 20% lost to that. > > Do you think that''s a reasonable outcome to hope for? Do you think ZFS is > close to meeting it? > > People with systems that live at 75% all day are obviously going to have > more problems than people who live at 25%! > >-- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100429/2edb6f28/attachment.html>
I got this hint from Richard Elling, but haven''t had time to test it much. Perhaps someone else could help? roy> Interesting. If you''d like to experiment, you can change the limit of the > number of scrub I/Os queued to each vdev. The default is 10, but that > is too close to the normal limit. You can see the current scrub limit via: > > # echo zfs_scrub_limit/D | mdb -k > zfs_scrub_limit: > zfs_scrub_limit:10 > > you can change it with: > # echo zfs_scrub_limit/W0t2 | mdb -kw > zfs_scrub_limit:0xa = 0x2 > > # echo zfs_scrub_limit/D | mdb -k > zfs_scrub_limit: > zfs_scrub_limit:2 > > In theory, this should help your scenario, but I do not believe this has > been exhaustively tested in the lab. Hopefully, it will help. > -- richard----- "Bruno Sousa" <bsousa at epinfante.com> skrev: Indeed the scrub seems to take too much resources from a live system. For instance i have a server with 24 disks (SATA 1TB) serving as NFS store to a linux machine holding user mailboxes. I have around 200 users, with maybe 30-40% of active users at the same time. As soon as the scrub process kicks in, linux box starts to give messages like " nfs server not available" and the users start to complain that the Outlook gives "connection timeout". Again, as soon as the scrub process stops everything comes to normal. So for me, it''s real issue the fact that the scrub takes so many resources of the system, making it pretty much unusable. In my case i did a workaround, where basically i have zfs send/receive from this server to another server and the scrub process is now running on the second server. I don''t know if this such a good idea, given the fact that i don''t know for sure if the scrub process in the secondary machine will be usefull in case of data corruption...but so far so good , and it''s probably better than nothing. I still remember before ZFS , that any good RAID controller would have a background consistency check task, and such a task would be possible to assign priority , like "low, medium, high" ...going back to ZFS what''s the possibility of getting this feature as well? Just out as curiosity , the Sun OpenStorage appliances , or Nexenta based ones, have any scrub task enabled by default ? I would like to get some feedback from users that run ZFS appliances regarding the impact of running a scrub on their appliances. Bruno On 28-4-2010 22:39, David Dyer-Bennet wrote: On Wed, April 28, 2010 10:16, Eric D. Mudama wrote: On Wed, Apr 28 at 1:34, Tonmaus wrote: Zfs scrub needs to access all written data on all disks and is usually disk-seek or disk I/O bound so it is difficult to keep it from hogging the disk resources. A pool based on mirror devices will behave much more nicely while being scrubbed than one based on RAIDz2. Experience seconded entirely. I''d like to repeat that I think we need more efficient load balancing functions in order to keep housekeeping payload manageable. Detrimental side effects of scrub should not be a decision point for choosing certain hardware or redundancy concepts in my opinion. While there may be some possible optimizations, i''m sure everyone would love the random performance of mirror vdevs, combined with the redundancy of raidz3 and the space of a raidz1. However, as in all systems, there are tradeoffs. The situations being mentioned are much worse than what seem reasonable tradeoffs to me. Maybe that''s because my intuition is misleading me about what''s available. But if the normal workload of a system uses 25% of its sustained IOPS, and a scrub is run at "low priority", I''d like to think that during a scrub I''d see a little degradation in performance, and that the scrub would take 25% or so longer than it would on an idle system. There''s presumably some inefficiency, so the two loads don''t just add perfectly; so maybe another 5% lost to that? That''s the big uncertainty. I have a hard time believing in 20% lost to that. Do you think that''s a reasonable outcome to hope for? Do you think ZFS is close to meeting it? People with systems that live at 75% all day are obviously going to have more problems than people who live at 25%! -- This message has been scanned for viruses and dangerous content by MailScanner , and is believed to be clean. _______________________________________________ zfs-discuss mailing list zfs-discuss at opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100429/c942350d/attachment.html>
On 28/04/2010 21:39, David Dyer-Bennet wrote:> > The situations being mentioned are much worse than what seem reasonable > tradeoffs to me. Maybe that''s because my intuition is misleading me about > what''s available. But if the normal workload of a system uses 25% of its > sustained IOPS, and a scrub is run at "low priority", I''d like to think > that during a scrub I''d see a little degradation in performance, and that > the scrub would take 25% or so longer than it would on an idle system. > There''s presumably some inefficiency, so the two loads don''t just add > perfectly; so maybe another 5% lost to that? That''s the big uncertainty. > I have a hard time believing in 20% lost to that. > >Well, it''s not that easy as there are many other factors you need to take into account. For example how many IOs are you allowing to be queued per device? This might affect a latency for your application. Or if you have a disk array with its own cache - just by doing scrub you might be pushing other entries in a cache out which might impact the performance of your application. Then there might be SAN and.... and so on. I''m not saying there is no room for improvement here. All I''m saying is that it is not as easy problem as it seems. -- Robert Milkowski http://milek.blogspot.com
On 29 April, 2010 - Roy Sigurd Karlsbakk sent me these 10K bytes:> I got this hint from Richard Elling, but haven''t had time to test it much. Perhaps someone else could help? > > roy > > > Interesting. If you''d like to experiment, you can change the limit of the > > number of scrub I/Os queued to each vdev. The default is 10, but that > > is too close to the normal limit. You can see the current scrub limit via: > > > > # echo zfs_scrub_limit/D | mdb -k > > zfs_scrub_limit: > > zfs_scrub_limit:10 > > > > you can change it with: > > # echo zfs_scrub_limit/W0t2 | mdb -kw > > zfs_scrub_limit:0xa = 0x2 > > > > # echo zfs_scrub_limit/D | mdb -k > > zfs_scrub_limit: > > zfs_scrub_limit:2 > > > > In theory, this should help your scenario, but I do not believe this has > > been exhaustively tested in the lab. Hopefully, it will help. > > -- richardIf I''m reading the code right, it''s only used when "creating" a new vdev (import, zpool create, maybe at boot).. So I took an alternate route: http://pastebin.com/hcYtQcJH (spa_scrub_maxinflight used to be 0x46 (70 decimal) due to 7 devices * zfs_scrub_limit(10) = 70..) With these lower numbers, our pool is much more responsive over NFS.. scrub: scrub in progress for 0h40m, 0.10% done, 697h29m to go Might take a while though. We''ve taken periodic snapshots and have snapshots from 2008, which probably has fragmented the pool beyond sanity or something..> ----- "Bruno Sousa" <bsousa at epinfante.com> skrev: > > > Indeed the scrub seems to take too much resources from a live system. > For instance i have a server with 24 disks (SATA 1TB) serving as NFS store to a linux machine holding user mailboxes. I have around 200 users, with maybe 30-40% of active users at the same time. > As soon as the scrub process kicks in, linux box starts to give messages like " nfs server not available" and the users start to complain that the Outlook gives "connection timeout". Again, as soon as the scrub process stops everything comes to normal. > So for me, it''s real issue the fact that the scrub takes so many resources of the system, making it pretty much unusable. In my case i did a workaround, where basically i have zfs send/receive from this server to another server and the scrub process is now running on the second server. > I don''t know if this such a good idea, given the fact that i don''t know for sure if the scrub process in the secondary machine will be usefull in case of data corruption...but so far so good , and it''s probably better than nothing. > I still remember before ZFS , that any good RAID controller would have a background consistency check task, and such a task would be possible to assign priority , like "low, medium, high" ...going back to ZFS what''s the possibility of getting this feature as well? > > > Just out as curiosity , the Sun OpenStorage appliances , or Nexenta based ones, have any scrub task enabled by default ? I would like to get some feedback from users that run ZFS appliances regarding the impact of running a scrub on their appliances. > > > Bruno > > On 28-4-2010 22:39, David Dyer-Bennet wrote: > > On Wed, April 28, 2010 10:16, Eric D. Mudama wrote: > > On Wed, Apr 28 at 1:34, Tonmaus wrote: > > > > Zfs scrub needs to access all written data on all > disks and is usually > disk-seek or disk I/O bound so it is difficult to > keep it from hogging > the disk resources. A pool based on mirror devices > will behave much > more nicely while being scrubbed than one based on > RAIDz2. Experience seconded entirely. I''d like to repeat that I think we > need more efficient load balancing functions in order to keep > housekeeping payload manageable. Detrimental side effects of scrub > should not be a decision point for choosing certain hardware or > redundancy concepts in my opinion. While there may be some possible optimizations, i''m sure everyone > would love the random performance of mirror vdevs, combined with the > redundancy of raidz3 and the space of a raidz1. However, as in all > systems, there are tradeoffs. The situations being mentioned are much worse than what seem reasonable > tradeoffs to me. Maybe that''s because my intuition is misleading me about > what''s available. But if the normal workload of a system uses 25% of its > sustained IOPS, and a scrub is run at "low priority", I''d like to think > that during a scrub I''d see a little degradation in performance, and that > the scrub would take 25% or so longer than it would on an idle system. > There''s presumably some inefficiency, so the two loads don''t just add > perfectly; so maybe another 5% lost to that? That''s the big uncertainty. > I have a hard time believing in 20% lost to that. > > Do you think that''s a reasonable outcome to hope for? Do you think ZFS is > close to meeting it? > > People with systems that live at 75% all day are obviously going to have > more problems than people who live at 25%! > > -- > This message has been scanned for viruses and > dangerous content by MailScanner , and is > believed to be clean. > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >> _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss/Tomas -- Tomas ?gren, stric at acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Ume? `- Sysadmin at {cs,acc}.umu.se
On 29 April, 2010 - Tomas ?gren sent me these 5,8K bytes:> On 29 April, 2010 - Roy Sigurd Karlsbakk sent me these 10K bytes: > > > I got this hint from Richard Elling, but haven''t had time to test it much. Perhaps someone else could help? > > > > roy > > > > > Interesting. If you''d like to experiment, you can change the limit of the > > > number of scrub I/Os queued to each vdev. The default is 10, but that > > > is too close to the normal limit. You can see the current scrub limit via: > > > > > > # echo zfs_scrub_limit/D | mdb -k > > > zfs_scrub_limit: > > > zfs_scrub_limit:10 > > > > > > you can change it with: > > > # echo zfs_scrub_limit/W0t2 | mdb -kw > > > zfs_scrub_limit:0xa = 0x2 > > > > > > # echo zfs_scrub_limit/D | mdb -k > > > zfs_scrub_limit: > > > zfs_scrub_limit:2 > > > > > > In theory, this should help your scenario, but I do not believe this has > > > been exhaustively tested in the lab. Hopefully, it will help. > > > -- richard > > If I''m reading the code right, it''s only used when "creating" a new vdev > (import, zpool create, maybe at boot).. So I took an alternate route: > > http://pastebin.com/hcYtQcJH > > (spa_scrub_maxinflight used to be 0x46 (70 decimal) due to 7 devices * > zfs_scrub_limit(10) = 70..) > > With these lower numbers, our pool is much more responsive over NFS..But taking snapshots is quite bad.. A single recursive snapshot over ~800 filesystems took about 45 minutes, with NFS operations taking 5-10 seconds.. Snapshots usually take 10-30 seconds..> scrub: scrub in progress for 0h40m, 0.10% done, 697h29m to goscrub: scrub in progress for 1h41m, 2.10% done, 78h35m to go This is chugging along.. The server is a Fujitsu RX300 with a Quad Xeon 1.6GHz, 6G ram, 8x400G SATA through a U320SCSI<->SATA box - Infortrend A08U-G1410, Sol10u8. Should have enough oompf, but when you combine snapshot with a scrub/resilver, sync performance gets abysmal.. Should probably try adding a ZIL when u9 comes, so we can remove it again if performance goes crap. /Tomas -- Tomas ?gren, stric at acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Ume? `- Sysadmin at {cs,acc}.umu.se
On Apr 29, 2010, at 5:52 AM, Tomas ?gren wrote:> On 29 April, 2010 - Tomas ?gren sent me these 5,8K bytes: > >> On 29 April, 2010 - Roy Sigurd Karlsbakk sent me these 10K bytes: >> >>> I got this hint from Richard Elling, but haven''t had time to test it much. Perhaps someone else could help? >>> >>> roy >>> >>>> Interesting. If you''d like to experiment, you can change the limit of the >>>> number of scrub I/Os queued to each vdev. The default is 10, but that >>>> is too close to the normal limit. You can see the current scrub limit via: >>>> >>>> # echo zfs_scrub_limit/D | mdb -k >>>> zfs_scrub_limit: >>>> zfs_scrub_limit:10 >>>> >>>> you can change it with: >>>> # echo zfs_scrub_limit/W0t2 | mdb -kw >>>> zfs_scrub_limit:0xa = 0x2 >>>> >>>> # echo zfs_scrub_limit/D | mdb -k >>>> zfs_scrub_limit: >>>> zfs_scrub_limit:2 >>>> >>>> In theory, this should help your scenario, but I do not believe this has >>>> been exhaustively tested in the lab. Hopefully, it will help. >>>> -- richard >> >> If I''m reading the code right, it''s only used when "creating" a new vdev >> (import, zpool create, maybe at boot).. So I took an alternate route: >> >> http://pastebin.com/hcYtQcJH >> >> (spa_scrub_maxinflight used to be 0x46 (70 decimal) due to 7 devices * >> zfs_scrub_limit(10) = 70..) >> >> With these lower numbers, our pool is much more responsive over NFS.. > > But taking snapshots is quite bad.. A single recursive snapshot over > ~800 filesystems took about 45 minutes, with NFS operations taking 5-10 > seconds.. Snapshots usually take 10-30 seconds.. > >> scrub: scrub in progress for 0h40m, 0.10% done, 697h29m to go > > scrub: scrub in progress for 1h41m, 2.10% done, 78h35m to go > > This is chugging along.. > > The server is a Fujitsu RX300 with a Quad Xeon 1.6GHz, 6G ram, 8x400G > SATA through a U320SCSI<->SATA box - Infortrend A08U-G1410, Sol10u8.slow disks == poor performance> Should have enough oompf, but when you combine snapshot with a > scrub/resilver, sync performance gets abysmal.. Should probably try > adding a ZIL when u9 comes, so we can remove it again if performance > goes crap.A separate log will not help. Try faster disks. -- richard ZFS storage and performance consulting at http://www.RichardElling.com ZFS training on deduplication, NexentaStor, and NAS performance Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com
On 29 April, 2010 - Richard Elling sent me these 2,5K bytes:> >> With these lower numbers, our pool is much more responsive over NFS.. > > > > But taking snapshots is quite bad.. A single recursive snapshot over > > ~800 filesystems took about 45 minutes, with NFS operations taking 5-10 > > seconds.. Snapshots usually take 10-30 seconds.. > > > >> scrub: scrub in progress for 0h40m, 0.10% done, 697h29m to go > > > > scrub: scrub in progress for 1h41m, 2.10% done, 78h35m to go > > > > This is chugging along.. > > > > The server is a Fujitsu RX300 with a Quad Xeon 1.6GHz, 6G ram, 8x400G > > SATA through a U320SCSI<->SATA box - Infortrend A08U-G1410, Sol10u8. > > slow disks == poor performanceI know they''re not "fast", but they''re not "should take 10-30 seconds to create a directory". They do perfectly well in all combinations, except when a scrub comes along (or sometimes when a snapshot feels like taking 45 minutes instead of 4.5 seconds). iostat says the disks aren''t 100% busy, the storage box itself doesn''t seem to be busy, yet with zfs they go downhill in some conditions..> > Should have enough oompf, but when you combine snapshot with a > > scrub/resilver, sync performance gets abysmal.. Should probably try > > adding a ZIL when u9 comes, so we can remove it again if performance > > goes crap. > > A separate log will not help. Try faster disks./Tomas -- Tomas ?gren, stric at acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Ume? `- Sysadmin at {cs,acc}.umu.se
> > The server is a Fujitsu RX300 with a Quad Xeon 1.6GHz, 6G ram, 8x400G > > SATA through a U320SCSI<->SATA box - Infortrend A08U-G1410, Sol10u8.> slow disks == poor performance> > Should have enough oompf, but when you combine snapshot with a > > scrub/resilver, sync performance gets abysmal.. Should probably try > > adding a ZIL when u9 comes, so we can remove it again if performance > > goes crap.> A separate log will not help. Try faster disks.We''re seeing the same thing in Sol10u8 with both 300gb 15k rpm SAS disks in-board on a Sun x4250 and an external chassis with 1tb 7200 rpm SATA disks connected via SAS. Faster disks aren''t the problem; there''s a fundamental issue with ZFS [iscsi;nfs;cifs] share performance under scrub & resilver. -K --- Karl Katzke Systems Analyst II TAMU DRGS
On Thu, 29 Apr 2010, Roy Sigurd Karlsbakk wrote:> While there may be some possible optimizations, i''m sure everyone > would love the random performance of mirror vdevs, combined with the > redundancy of raidz3 and the space of a raidz1. However, as in all > systems, there are tradeoffs.In my opinion periodic scrubs are most useful for pools based on mirrors, or raidz1, and much less useful for pools based on raidz2 or raidz3. It is useful to run a scrub at least once on a well-populated new pool in order to validate the hardware and OS, but otherwise, the scrub is most useful for discovering bit-rot in singly-redundant pools. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On 04/30/10 10:35 AM, Bob Friesenhahn wrote:> On Thu, 29 Apr 2010, Roy Sigurd Karlsbakk wrote: > >> While there may be some possible optimizations, i''m sure everyone >> would love the random performance of mirror vdevs, combined with the >> redundancy of raidz3 and the space of a raidz1. However, as in all >> systems, there are tradeoffs. > > In my opinion periodic scrubs are most useful for pools based on > mirrors, or raidz1, and much less useful for pools based on raidz2 or > raidz3. It is useful to run a scrub at least once on a well-populated > new pool in order to validate the hardware and OS, but otherwise, the > scrub is most useful for discovering bit-rot in singly-redundant pools. >I agree. I look after an x4500 with a poll of raidz2 vdevs that I can''t run scrubs on due the the dire impact on performance. That''s one reason I''d never use raidz1 in a real system. -- Ian.
> In my opinion periodic scrubs are most useful for > pools based on > mirrors, or raidz1, and much less useful for pools > based on raidz2 or > raidz3. It is useful to run a scrub at least once on > a well-populated > new pool in order to validate the hardware and OS, > but otherwise, the > scrub is most useful for discovering bit-rot in > singly-redundant > pools. > > BobHi, for once, well populated pools are rarely new. Second, Best Practises recommendations on scrubbing intervals are based on disk product line (Enterprise monthly vs. Consumer weekly), not on redundancy level or pool configuration. Obviously, the issue under discussion affects all imaginable configurations, though. It may only vary in the degree. Recommending to not using scrub doesn''t even qualify as a workaround, in my regard. Regards, Tonmaus -- This message posted from opensolaris.org
On Thu, April 29, 2010 17:35, Bob Friesenhahn wrote:> In my opinion periodic scrubs are most useful for pools based on > mirrors, or raidz1, and much less useful for pools based on raidz2 or > raidz3. It is useful to run a scrub at least once on a well-populated > new pool in order to validate the hardware and OS, but otherwise, the > scrub is most useful for discovering bit-rot in singly-redundant > pools.I''ve got 10 years of photos on my disk now, and it''s growing at faster than one year per year (since I''m scanning backwards slowly through the negatives). Many of them don''t get accessed very often; they''re archival, not current use. Scrub was one of the primary reasons I chose ZFS for the fileserver they live on -- I want some assurance, 20 years from now, that they''re still valid. I needed something to check them periodically, and something to check *against*, and block checksums and scrub seemed to fill the bill. So, yes, I want to catch bit rot -- on a pool of mirrored VDEVs. -- David Dyer-Bennet, dd-b at dd-b.net; http://dd-b.net/ Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/ Photos: http://dd-b.net/photography/gallery/ Dragaera: http://dragaera.info
> On Thu, 29 Apr 2010, Tonmaus wrote: > > > Recommending to not using scrub doesn''t even qualify as a > > workaround, in my regard. > > As a devoted believer in the power of scrub, I believe that after the > > OS, power supplies, and controller have been verified to function with > a good scrubbing, if there is more than one level of redundancy, > scrubs are not really warranted. With just one level of redundancy it > > becomes much more important to verify that both copies were written to > disk correctly.The scrub should still be available without slowing down the system to something barely usable - that''s why it''s there. Adding new layers of security is nice, but dropping scrub because of OS bugs is rather ugly roy
On Thu, 29 Apr 2010, Tonmaus wrote:> Recommending to not using scrub doesn''t even qualify as a > workaround, in my regard.As a devoted believer in the power of scrub, I believe that after the OS, power supplies, and controller have been verified to function with a good scrubbing, if there is more than one level of redundancy, scrubs are not really warranted. With just one level of redundancy it becomes much more important to verify that both copies were written to disk correctly. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On Fri, Apr 30, 2010 at 11:35 AM, Bob Friesenhahn < bfriesen at simple.dallas.tx.us> wrote:> On Thu, 29 Apr 2010, Tonmaus wrote: > > Recommending to not using scrub doesn''t even qualify as a workaround, in >> my regard. >> > > As a devoted believer in the power of scrub, I believe that after the OS, > power supplies, and controller have been verified to function with a good > scrubbing, if there is more than one level of redundancy, scrubs are not > really warranted. With just one level of redundancy it becomes much more > important to verify that both copies were written to disk correctly. > > Without a periodic scrub that touches every single bit of data in the pool,how can you be sure that 10-year files that haven''t been opened in 5 years are still intact? Self-healing only comes into play when the file is read. If you don''t read a file for years, how can you be sure that all copies of that file haven''t succumbed to bit-rot? Do you rally want that "oh shit" moment 5 years from now, when you go to open "Super Important Doc Saved for Legal Reasons" and find that all copies are corrupt? Sure, you don''t have to scrub every single week. But you definitely want to scrub more than once over the lifetime of the pool. -- Freddie Cash fjwcash at gmail.com -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100430/4745d2ce/attachment.html>
On Fri, April 30, 2010 13:44, Freddie Cash wrote:> On Fri, Apr 30, 2010 at 11:35 AM, Bob Friesenhahn < > bfriesen at simple.dallas.tx.us> wrote: > >> On Thu, 29 Apr 2010, Tonmaus wrote: >> >> Recommending to not using scrub doesn''t even qualify as a workaround, >> in >>> my regard. >>> >> >> As a devoted believer in the power of scrub, I believe that after the >> OS, >> power supplies, and controller have been verified to function with a >> good >> scrubbing, if there is more than one level of redundancy, scrubs are not >> really warranted. With just one level of redundancy it becomes much >> more >> important to verify that both copies were written to disk correctly. >> > Without a periodic scrub that touches every single bit of data in the > pool, > how can you be sure that 10-year files that haven''t been opened in 5 years > are still intact? > > Self-healing only comes into play when the file is read. If you don''t > read > a file for years, how can you be sure that all copies of that file haven''t > succumbed to bit-rot?Yes, that''s precisely my point. That''s why it''s especially relevant to archival data -- it''s important (to me), but not frequently accessed. -- David Dyer-Bennet, dd-b at dd-b.net; http://dd-b.net/ Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/ Photos: http://dd-b.net/photography/gallery/ Dragaera: http://dragaera.info
On Fri, 30 Apr 2010, Freddie Cash wrote:> > Without a periodic scrub that touches every single bit of data in the pool, how can you be sure > that 10-year files that haven''t been opened in 5 years are still intact?You don''t. But it seems that having two or three extra copies of the data on different disks should instill considerable confidence. With sufficient redundancy, chances are that the computer will explode before it loses data due to media corruption. The calculated time before data loss becomes longer than even the pyramids in Egypt could withstand. The situation becomes similar to having a house with a heavy front door with three deadbolt locks, and many glass windows. The front door with its three locks is no longer a concern when you are evaluating your home for its security against burglary or home invasion because the glass windows are so fragile and easily broken. It is necessary to look at all the factors which might result in data loss before deciding what the most effective steps are to minimize the probability of loss. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Hi Bob,> > It is necessary to look at all the factors which > might result in data > loss before deciding what the most effective steps > are to minimize > the probability of loss. > > BobI am under the impression that exactly those were the considerations for both the ZFS designers to implement a scrub function to ZFS and the author of Best Practises to recommend performing this function frequently. I am hearing you are coming to a different conclusion and I would be interested in learning what could possibly be so highly interpretable in this. Regards, Tonmaus -- This message posted from opensolaris.org
On Sun, 2 May 2010, Tonmaus wrote:> > I am under the impression that exactly those were the considerations > for both the ZFS designers to implement a scrub function to ZFS and > the author of Best Practises to recommend performing this function > frequently. I am hearing you are coming to a different conclusion > and I would be interested in learning what could possibly be so > highly interpretable in this.The value of periodic scrub is subject to opinion. There are some highly respected folks on this list who put less faith in scrub because they believe more in MTTDL statistical models and less in the value of early detection (scrub == early detection). With a single level of redundancy, early detection is more useful since there is just one opportunity to correct the error and correcting the error early decreases the chance of a later uncorrectable error. Scrub will help repair the results of transient wrong hardware operation, or partial media failures, but will not keep a whole disk from failing. Once the computed MTTDL for the storage configuration is sufficiently high, then other factors such as the reliability of ECC memory, kernel bugs, and hardware design flaws, become dominant. The human factor is often the most dominant factor when it comes to data loss since most data loss is still due to human error. Most data loss problems we see reported here are due to human error or hardware design flaws. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
----- "Roy Sigurd Karlsbakk" <roy at karlsbakk.net> skrev:> Hi all > > I have a test system with snv134 and 8x2TB drives in RAIDz2 and > currently no Zil or L2ARC. I noticed the I/O speed to NFS shares on > the testpool drops to something hardly usable while scrubbing the > pool. > > How can I address this? Will adding Zil or L2ARC help? Is it possible > to tune down scrub''s priority somehow?Further testing shows NFS speeds are acceptable after adding Zil and L2ARC (my test system has two SSDs for the root, so I detached on of them and split it into a 4GB slice for Zil and the rest for L2ARC). Best regards roy -- Roy Sigurd Karlsbakk (+47) 97542685 roy at karlsbakk.net http://blogg.karlsbakk.net/ -- I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et element?rt imperativ for alle pedagoger ? unng? eksessiv anvendelse av idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og relevante synonymer p? norsk.
On May 1, 2010, at 1:56 PM, Bob Friesenhahn wrote:> On Fri, 30 Apr 2010, Freddie Cash wrote: >> Without a periodic scrub that touches every single bit of data in the pool, how can you be sure >> that 10-year files that haven''t been opened in 5 years are still intact? > > You don''t. But it seems that having two or three extra copies of the data on different disks should instill considerable confidence. With sufficient redundancy, chances are that the computer will explode before it loses data due to media corruption. The calculated time before data loss becomes longer than even the pyramids in Egypt could withstand.These calculations are based on fixed MTBF. But disk MTBF decreases with age. Most disks are only rated at 3-5 years of expected lifetime. Hence, archivists use solutions with longer lifetimes (high quality tape = 30 years) and plans for migrating the data to newer media before the expected media lifetime is reached. In short, if you don''t expect to read your 5-year lifetime rated disk for another 5 years, then your solution is uhmm... shall we say... in need of improvement.> > The situation becomes similar to having a house with a heavy front door with three deadbolt locks, and many glass windows. The front door with its three locks is no longer a concern when you are evaluating your home for its security against burglary or home invasion because the glass windows are so fragile and easily broken. > > It is necessary to look at all the factors which might result in data loss before deciding what the most effective steps are to minimize the probability of loss.Yep... and manage the data over time. There is a good reason why library scientists will never worry about the future of their profession :-) http://en.wikipedia.org/wiki/Library_science -- richard ZFS storage and performance consulting at http://www.RichardElling.com
On Sun, 2 May 2010, Richard Elling wrote:> These calculations are based on fixed MTBF. But disk MTBF decreases with > age. Most disks are only rated at 3-5 years of expected lifetime. Hence, archivists > use solutions with longer lifetimes (high quality tape = 30 years) and plans for > migrating the data to newer media before the expected media lifetime is reached. > In short, if you don''t expect to read your 5-year lifetime rated disk for another 5 years, > then your solution is uhmm... shall we say... in need of improvement.Yes, the hardware does not last forever. It only needs to last while it is still being used and should only be used during its expected service life. Your point is a good one. On the flip-side, using ''zfs scrub'' puts more stress on the system which may make it more likely to fail. It increases load on the power supplies, CPUs, interfaces, and disks. A system which might work fine under normal load may be stressed and misbehave under scrub. Using scrub on a weak system could actually increase the chance of data loss.> ZFS storage and performance consulting at http://www.RichardElling.comPlease send $$$ to the above address in return for wisdom. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On 5/2/10 3:12 PM, "Bob Friesenhahn" <bfriesen at simple.dallas.tx.us> wrote:> On the flip-side, using ''zfs scrub'' puts more stress on the system > which may make it more likely to fail. It increases load on the power > supplies, CPUs, interfaces, and disks. A system which might work fine > under normal load may be stressed and misbehave under scrub. Using > scrub on a weak system could actually increase the chance of data > loss.If my system is going to fail under the stress of a scrub, it''s going to fail under the stress of a resilver. From my perspective, I''m not as scared of data corruption as I am of data corruption *that I don''t know about.* I only keep backups for a finite amount of time. If I scrub every week, and my zpool dies during a scrub, then I know it''s time to pull out last week''s backup, where I know (thanks to scrubbing) the data was not corrupt. I''ve lived the experience where a user comes to me because he tried to open a seven-year-old file and it was corrupt. Not a blankety-blank thing I could do, because we only retain backup tapes for four years and the four-year-old tape had a backup of the file post-corruption. Data loss may be unavoidable, but that''s why we keep backups. It''s the invisible data loss that makes life suboptimal. -- Dave Pooser, ACSA Manager of Information Services Alford Media http://www.alfordmedia.com
On Sun, 2 May 2010, Dave Pooser wrote:> > If my system is going to fail under the stress of a scrub, it''s going to > fail under the stress of a resilver. From my perspective, I''m not as scaredI don''t disagree with any of the opinions you stated except to point out that resilver will usually hit the (old) hardware less severely than scrub. Resilver does not have to access any of the redundant copies of data or metadata, unless they are the only remaining good copy. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On May 2, 2010, at 12:05 PM, Roy Sigurd Karlsbakk wrote:> ----- "Roy Sigurd Karlsbakk" <roy at karlsbakk.net> skrev: > >> Hi all >> >> I have a test system with snv134 and 8x2TB drives in RAIDz2 and >> currently no Zil or L2ARC. I noticed the I/O speed to NFS shares on >> the testpool drops to something hardly usable while scrubbing the >> pool. >> >> How can I address this? Will adding Zil or L2ARC help? Is it possible >> to tune down scrub''s priority somehow? > > Further testing shows NFS speeds are acceptable after adding Zil and L2ARC (my test system has two SSDs for the root, so I detached on of them and split it into a 4GB slice for Zil and the rest for L2ARC).Ok, this makes sense. If you are using a pool configuration which is not so good for high IOPS workloads (raidz*) and you give it a latency-sensitive, synchronous IOPS workload (NFS) along with another high IOPS workload (scrub), then the latency-sensitive workload will notice. Adding the SSD as a separate log is a good idea. -- richard ZFS storage and performance consulting at http://www.RichardElling.com
> On Sun, 2 May 2010, Dave Pooser wrote: > > > > If my system is going to fail under the stress of a > scrub, it''s going to > > fail under the stress of a resilver. From my > perspective, I''m not as scared > > I don''t disagree with any of the opinions you stated > except to point > out that resilver will usually hit the (old) hardware > less severely > than scrub. Resilver does not have to access any of > the redundant > copies of data or metadata, unless they are the only > remaining good > copy. > > BobAdding the perspective that scrub could consume my hard disks life may sound like a really good point why I should avoid scrub on my system as far as possible, and thus avoid experiencing performance issues in the first place, while using scrub. I just don''t buy this. Sorry. It''s too far-fetched. I''d still prefer if the original issue could be fixed. Regards, Tonmaus -- This message posted from opensolaris.org
On Sun, May 2, 2010 14:12, Richard Elling wrote:> On May 1, 2010, at 1:56 PM, Bob Friesenhahn wrote: >> On Fri, 30 Apr 2010, Freddie Cash wrote: >>> Without a periodic scrub that touches every single bit of data in the >>> pool, how can you be sure >>> that 10-year files that haven''t been opened in 5 years are still >>> intact? >> >> You don''t. But it seems that having two or three extra copies of the >> data on different disks should instill considerable confidence. With >> sufficient redundancy, chances are that the computer will explode before >> it loses data due to media corruption. The calculated time before data >> loss becomes longer than even the pyramids in Egypt could withstand. > > These calculations are based on fixed MTBF. But disk MTBF decreases with > age. Most disks are only rated at 3-5 years of expected lifetime. Hence, > archivists > use solutions with longer lifetimes (high quality tape = 30 years) and > plans for > migrating the data to newer media before the expected media lifetime is > reached. > In short, if you don''t expect to read your 5-year lifetime rated disk for > another 5 years, > then your solution is uhmm... shall we say... in need of improvement.Are they giving tape that long an estimated life these days? They certainly weren''t last time I looked. And I basically don''t trust tape; too many bad experiences (ever since I moved off of DECTape, I''ve been having bad experiences with tape). The drives are terribly expensive and I can''t afford redundancy, and in thirty years I very probably could not buy a new drive for my old tapes. I started out a big fan of tape, but the economics have been very much against it in the range I''m working (small; 1.2 terabytes usable on my server currently). I don''t expect I''ll keep my hard disks for 30 years; I expect I''ll upgrade them periodically, probably even within their MTBF. (Although note that, though tests haven''t been run, the MTBF of a 5-year disk after 4 years is nearly certainly greater than 1 year.) -- David Dyer-Bennet, dd-b at dd-b.net; http://dd-b.net/ Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/ Photos: http://dd-b.net/photography/gallery/ Dragaera: http://dragaera.info
On May 3, 2010, at 2:38 PM, David Dyer-Bennet wrote:> On Sun, May 2, 2010 14:12, Richard Elling wrote: >> On May 1, 2010, at 1:56 PM, Bob Friesenhahn wrote: >>> On Fri, 30 Apr 2010, Freddie Cash wrote: >>>> Without a periodic scrub that touches every single bit of data in the >>>> pool, how can you be sure >>>> that 10-year files that haven''t been opened in 5 years are still >>>> intact? >>> >>> You don''t. But it seems that having two or three extra copies of the >>> data on different disks should instill considerable confidence. With >>> sufficient redundancy, chances are that the computer will explode before >>> it loses data due to media corruption. The calculated time before data >>> loss becomes longer than even the pyramids in Egypt could withstand. >> >> These calculations are based on fixed MTBF. But disk MTBF decreases with >> age. Most disks are only rated at 3-5 years of expected lifetime. Hence, >> archivists >> use solutions with longer lifetimes (high quality tape = 30 years) and >> plans for >> migrating the data to newer media before the expected media lifetime is >> reached. >> In short, if you don''t expect to read your 5-year lifetime rated disk for >> another 5 years, >> then your solution is uhmm... shall we say... in need of improvement. > > Are they giving tape that long an estimated life these days? They > certainly weren''t last time I looked.Yes. http://www.oracle.com/us/products/servers-storage/storage/tape-storage/036556.pdf http://www.sunstarco.com/PDF%20Files/Quantum%20LTO3.pdf> And I basically don''t trust tape; too many bad experiences (ever since I > moved off of DECTape, I''ve been having bad experiences with tape). The > drives are terribly expensive and I can''t afford redundancy, and in thirty > years I very probably could not buy a new drive for my old tapes. > > I started out a big fan of tape, but the economics have been very much > against it in the range I''m working (small; 1.2 terabytes usable on my > server currently). > > I don''t expect I''ll keep my hard disks for 30 years; I expect I''ll upgrade > them periodically, probably even within their MTBF. (Although note that, > though tests haven''t been run, the MTBF of a 5-year disk after 4 years is > nearly certainly greater than 1 year.)Yes, but MTBF != expected lifetime. MTBF is defined as Mean Time Between Failures (a rate), not Time Until Death (a lifetime). If your MTBF was 1 year, then the probability of failing within 1 year would be approximately 63%, assuming an exponential distribution. -- richard ZFS storage and performance consulting at http://www.RichardElling.com
On Apr 30, 2010, at 11:44 AM, Freddie Cash wrote:> Sure, you don''t have to scrub every single week. But you definitely want to scrub more than once over the lifetime of the pool.Yes. There have been studies of this and the results depend on the technical (probabilities) and the comfort level (feeling lucky?). The technical results will also depend on the quality and nature of the algorithms involved along with the quality of the hardware. I think it is safe to say that a scrub once per year is too infrequent and a weekly scrub is too frequent for most folks. -- richard ZFS storage and performance consulting at http://www.RichardElling.com
On Apr 29, 2010, at 11:55 AM, Katzke, Karl wrote:>>> The server is a Fujitsu RX300 with a Quad Xeon 1.6GHz, 6G ram, 8x400G >>> SATA through a U320SCSI<->SATA box - Infortrend A08U-G1410, Sol10u8. > >> slow disks == poor performance > >>> Should have enough oompf, but when you combine snapshot with a >>> scrub/resilver, sync performance gets abysmal.. Should probably try >>> adding a ZIL when u9 comes, so we can remove it again if performance >>> goes crap. > >> A separate log will not help. Try faster disks. > > We''re seeing the same thing in Sol10u8 with both 300gb 15k rpm SAS disks in-board on a Sun x4250 and an external chassis with 1tb 7200 rpm SATA disks connected via SAS. Faster disks aren''t the problem; there''s a fundamental issue with ZFS [iscsi;nfs;cifs] share performance under scrub & resilver.In Solaris 10u8 (and prior releases) the default number of outstanding I/Os is 35 and (I trust, because Solaris 10 is not open source) the default max number of scrub I/Os is 10 per vdev. If your disk is slow, then the service time for the queue with 35 entries will be such that the queue depth has grown to 35 entries and the I/O scheduler in ZFS has an opportunity to prioritize and reorder the queue. If your disk is fast, then you won''t see this and life will be good. In recent OpenSolaris builds, the default number of outstanding I/Os is reduced to 4-10, by default. For slow disks, the scheduler has a greater probability of being able to prioritize non-scrub I/Os. Again, if your disk is fast, you won''t see the queue depth reach 10 and life will be good. iostat is the preferred tool for measuring queue depth, though it would be easy to write a dedicated tool using DTrace. Also in OpenSolaris, there is code to throttle the scrub based on bandwidth. But we''ve clearly ascertained that this is not a bandwidth problem, so a bandwidth throttle is mostly useless... unless the disks are fast. P.S. I don''t consider any HDDs to be fast. SSDs won. Game over :-) -- richard ZFS storage and performance consulting at http://www.RichardElling.com
On Mon, May 3, 2010 17:02, Richard Elling wrote:> On May 3, 2010, at 2:38 PM, David Dyer-Bennet wrote: >> On Sun, May 2, 2010 14:12, Richard Elling wrote: >>> On May 1, 2010, at 1:56 PM, Bob Friesenhahn wrote: >>>> On Fri, 30 Apr 2010, Freddie Cash wrote: >>>>> Without a periodic scrub that touches every single bit of data in the >>>>> pool, how can you be sure >>>>> that 10-year files that haven''t been opened in 5 years are still >>>>> intact? >>>> >>>> You don''t. But it seems that having two or three extra copies of the >>>> data on different disks should instill considerable confidence. With >>>> sufficient redundancy, chances are that the computer will explode >>>> before >>>> it loses data due to media corruption. The calculated time before >>>> data >>>> loss becomes longer than even the pyramids in Egypt could withstand. >>> >>> These calculations are based on fixed MTBF. But disk MTBF decreases >>> with >>> age. Most disks are only rated at 3-5 years of expected lifetime. >>> Hence, >>> archivists >>> use solutions with longer lifetimes (high quality tape = 30 years) and >>> plans for >>> migrating the data to newer media before the expected media lifetime is >>> reached. >>> In short, if you don''t expect to read your 5-year lifetime rated disk >>> for >>> another 5 years, >>> then your solution is uhmm... shall we say... in need of improvement. >> >> Are they giving tape that long an estimated life these days? They >> certainly weren''t last time I looked. > > Yes. > http://www.oracle.com/us/products/servers-storage/storage/tape-storage/036556.pdf > http://www.sunstarco.com/PDF%20Files/Quantum%20LTO3.pdfYep, they say 30 years. That''s probably in the same "years" where the MAM gold archival DVDs are good for 200, I imagine. (i.e. based on accelerated testing, with the lab knowing what answer the client wants). Although we may know more about tape aging, the accelerated tests may be more valid for tapes? But LTO-3 is a 400GB tape that costs, hmmm, maybe $40 each (maybe less with better shopping, that''s a quick Amazon price rounded down). (I don''t factor in compression in my own analysis because my data is overwhelmingly image filee and MP3 files, which don''t compress further very well.) Plus a $1000 drive, or $2000 for a 3-tape changer (and that''s barely big enough to back up my small server without manual intervention, might not be by the end of the year). Tape is a LOT more expensive than my current hard-drive based backup scheme, even if I use the backup drives only three years (and since they spin less than 10% of the time, they should last pretty well). Also, I lose my snapshots in a tape backup, whereas I keep them on my hard drive backups. (Or else I''m storing a ZFS send stream on tape and hoping it will actually restore.)>> And I basically don''t trust tape; too many bad experiences (ever since I >> moved off of DECTape, I''ve been having bad experiences with tape). The >> drives are terribly expensive and I can''t afford redundancy, and in >> thirty >> years I very probably could not buy a new drive for my old tapes. >> >> I started out a big fan of tape, but the economics have been very much >> against it in the range I''m working (small; 1.2 terabytes usable on my >> server currently). >> >> I don''t expect I''ll keep my hard disks for 30 years; I expect I''ll >> upgrade >> them periodically, probably even within their MTBF. (Although note >> that, >> though tests haven''t been run, the MTBF of a 5-year disk after 4 years >> is >> nearly certainly greater than 1 year.) > > Yes, but MTBF != expected lifetime. MTBF is defined as Mean Time Between > Failures (a rate), not Time Until Death (a lifetime). If your MTBF was 1 > year, > then the probability of failing within 1 year would be approximately 63%, > assuming an exponential distribution.Yeah, sorry, I stumbled into using the same wrong figures lots of people were. -- David Dyer-Bennet, dd-b at dd-b.net; http://dd-b.net/ Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/ Photos: http://dd-b.net/photography/gallery/ Dragaera: http://dragaera.info