Dear zfs fellows, during a specific test I have got the impression that scrub may have quite an impact on other I/O. CIFS throughput is down to 7 MB/s from 100 MB/s while scrub on my main NAS. That is not a surprise as scrub of my raidz2 pool maxes out CPU on that machine. (1 Xeon L5410). I am running scrubs during week-ends, so this is not a problem. I am asking myself however what will happen on larger pools where a scrub pass will take days to weeks. Obviously, zfs file systems are much more scalable than CPU power ever will be. Hence, I am seeing a requirement to manage scrub activity so that trade-offs can be done to maintain availability and performance of the pool. Does anybody know how this is done? Thanks in advance for any hints, Regards, Tonmaus -- This message posted from opensolaris.org
Richard Elling
2010-Mar-14  04:54 UTC
[zfs-discuss] How to manage scrub priority or defer scrub?
On Mar 13, 2010, at 4:54 AM, Tonmaus wrote:> Dear zfs fellows, > > during a specific test I have got the impression that scrub may have quite an impact on other I/O. CIFS throughput is down to 7 MB/s from 100 MB/s while scrub on my main NAS. That is not a surprise as scrub of my raidz2 pool maxes out CPU on that machine. (1 Xeon L5410). > I am running scrubs during week-ends, so this is not a problem. I am asking myself however what will happen on larger pools where a scrub pass will take days to weeks. Obviously, zfs file systems are much more scalable than CPU power ever will be. > Hence, I am seeing a requirement to manage scrub activity so that trade-offs can be done to maintain availability and performance of the pool. Does anybody know how this is done?This is noticeable on systems with slow disks and is worse on releases prior to b129 or so. Can you describe your hardware and release? -- richard ZFS storage and performance consulting at http://www.RichardElling.com ZFS training on deduplication, NexentaStor, and NAS performance Atlanta, March 16-18, 2010 http://nexenta-atlanta.eventbrite.com Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com
Hi Richard, these are - 11x WD1002fbys (7200rpm SATA drives) in 1 raidz2 group - 4 GB RAM - 1 CPU L5410 - snv_133 (where the current array was created as well) Regards, Tonmaus -- This message posted from opensolaris.org
Richard Elling
2010-Mar-14  16:33 UTC
[zfs-discuss] How to manage scrub priority or defer scrub?
On Mar 14, 2010, at 12:16 AM, Tonmaus wrote:> Hi Richard, > > these are > - 11x WD1002fbys (7200rpm SATA drives) in 1 raidz2 group > - 4 GB RAM > - 1 CPU L5410 > - snv_133 (where the current array was created as well)These are slow drives and the configuration will have poor random read performance. Do not expect blazing fast scrubs. In b133, the priority scheduler will work better than on older releases. But it may not be enough to overcome a very wide raidz2 set. -- richard ZFS storage and performance consulting at http://www.RichardElling.com ZFS training on deduplication, NexentaStor, and NAS performance Atlanta, March 16-18, 2010 http://nexenta-atlanta.eventbrite.com Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com
Hi Richard, thanks for the answer. I think I am aware on the properties of my configuration and how it will scale. Let me please stress that this is not the point in the discussion. The target of this discussion should rather be if scrubbing can co-exist with payload or if we are thrown back to scrub in the after-hours. So, do I have to conclude that zfs is not able to make good decisions about load prioritisation on commodity hardware and that there are no further options available to tweak scrub load impact, or are there other options? I am thinking about managing pools capable of hundred times the capacity of mine (currently there are 3,7 TB on disk, and it takes 2,5 h to scrub them on the double-parity pool) that practically would be un-scrub-able. (Yes, Enterprise HW is faster, but Enterprise service windows are much more narrow as well... you can''t move around or offline 200 TB of live data for days only because you need to scrub the disks can you?) The only idea I could think of myself is to exchange individual drives in a round-robin fashion all the time and use re-silver instead of full scrubs. But actually I don''t like the idea anymore at second glance. Regards, Tonmaus -- This message posted from opensolaris.org
Richard Elling
2010-Mar-14  20:08 UTC
[zfs-discuss] How to manage scrub priority or defer scrub?
On Mar 14, 2010, at 11:45 AM, Tonmaus wrote:> Hi Richard, > > thanks for the answer. I think I am aware on the properties of my configuration and how it will scale. Let me please stress that this is not the point in the discussion. > The target of this discussion should rather be if scrubbing can co-exist with payload or if we are thrown back to scrub in the after-hours. > So, do I have to conclude that zfs is not able to make good decisions about load prioritisation on commodity hardware and that there are no further options available to tweak scrub load impact, or are there other options?ZFS prioritizes I/O. Scrub has the lowest priority. The scrub will queue no more than 10 I/Os at one time to a device, so devices which can handle concurrent I/O are not consumed entirely by scrub I/O. This could be tuned lower, but your storage is slow and *any* I/O activity will be noticed.> I am thinking about managing pools capable of hundred times the capacity of mine (currently there are 3,7 TB on disk, and it takes 2,5 h to scrub them on the double-parity pool) that practically would be un-scrub-able. (Yes, Enterprise HW is faster, but Enterprise service windows are much more narrow as well... you can''t move around or offline 200 TB of live data for days only because you need to scrub the disks can you?)I can''t follow your logic here. Scrub is a low priority process and should be done at infrequent intervals. If you are concerned that a single 200TB pool would take a long time to scrub, then use more pools and scrub in parallel.> The only idea I could think of myself is to exchange individual drives in a round-robin fashion all the time and use re-silver instead of full scrubs. But actually I don''t like the idea anymore at second glance.You are right, this would not be a good idea. -- richard ZFS storage and performance consulting at http://www.RichardElling.com ZFS training on deduplication, NexentaStor, and NAS performance Atlanta, March 16-18, 2010 http://nexenta-atlanta.eventbrite.com Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com
Hello again, I am still concerned if my points are being well taken.> If you are concerned that a > single 200TB pool would take a long > time to scrub, then use more pools and scrub in > parallel.The main concern is not scrub time. Scrub time could be weeks if scrub just would behave. You may imagine that there are applications where segmentation is a pain point, too.> The scrub will queue no more > han 10 I/Os at one time to a device, so devices which > can handle concurrent I/O > are not consumed entirely by scrub I/O. This could be > tuned lower, but your storage > is slow and *any* I/O activity will be noticed.There are a couple of things I maybe don''t understand, then. - zpool iostat is reporting more than 1k of outputs while scrub - throughput is as high as can be until maxing out CPU - nominal I/O capacity of a single device is still around 90, how can 10 I/Os already bring down payload - scrubbing the same pool, configured as raidz1 didn''t max out CPU which is no surprise (haha, slow storage...) the notable part is that it didn''t slow down payload that much either. - scrub is obviously fine with data added or deleted during a pass. So, it could be possible to pause and resume a pass, couldn''t it? My conclusion from these observations is that not only disk speed counts here, but other bottlenecks may strike as well. Solving the issue by the wallet is one way, solving it by configuration of parameters is another. So, is there a lever for scrub I/O prio, or not? Is there a possibility to pause scrub passed and resume? Regards, Tonmaus -- This message posted from opensolaris.org
Richard Elling
2010-Mar-16  03:44 UTC
[zfs-discuss] How to manage scrub priority or defer scrub?
On Mar 14, 2010, at 11:25 PM, Tonmaus wrote:> Hello again, > > I am still concerned if my points are being well taken. > >> If you are concerned that a >> single 200TB pool would take a long >> time to scrub, then use more pools and scrub in >> parallel. > > The main concern is not scrub time. Scrub time could be weeks if scrub just would behave. You may imagine that there are applications where segmentation is a pain point, too.I agree.>> The scrub will queue no more >> han 10 I/Os at one time to a device, so devices which >> can handle concurrent I/O >> are not consumed entirely by scrub I/O. This could be >> tuned lower, but your storage >> is slow and *any* I/O activity will be noticed. > > There are a couple of things I maybe don''t understand, then. > > - zpool iostat is reporting more than 1k of outputs while scrubok> - throughput is as high as can be until maxing out CPUYou would rather your CPU be idle? What use is an idle CPU, besides wasting energy :-)?> - nominal I/O capacity of a single device is still around 90, how can 10 I/Os already bring down payload90 IOPS is approximately the worst-case rate for a 7,200 rpm disk for a small, random workload. ZFS tends to write sequentially, so "random writes" tend to become "sequential writes" on ZFS. So it is quite common to see scrub workloads with >> 90 IOPS.> - scrubbing the same pool, configured as raidz1 didn''t max out CPU which is no surprise (haha, slow storage...) the notable part is that it didn''t slow down payload that much either.raidz creates more, smaller writes than a mirror or simple stripe. If the disks are slow, then the IOPS will be lower and the scrub takes longer, but the I/O scheduler can manage the queue better (disks are slower).> - scrub is obviously fine with data added or deleted during a pass. So, it could be possible to pause and resume a pass, couldn''t it?You can start or stop scrubs, there no resume directive. There are several bugs/RFEs along these lines, something like: http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6743992> My conclusion from these observations is that not only disk speed counts here, but other bottlenecks may strike as well. Solving the issue by the wallet is one way, solving it by configuration of parameters is another. So, is there a lever for scrub I/O prio, or not? Is there a possibility to pause scrub passed and resume?Scrub is already the lowest priority. Would you like it to be lower? I think the issue is more related to which queue is being managed by the ZFS priority scheduler rather than the lack of scheduling priority. -- richard ZFS storage and performance consulting at http://www.RichardElling.com ZFS training on deduplication, NexentaStor, and NAS performance Atlanta, March 16-18, 2010 http://nexenta-atlanta.eventbrite.com Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com
Hi Richard,> > - scrubbing the same pool, configured as raidz1 > didn''t max out CPU which is no surprise (haha, slow > storage...) the notable part is that it didn''t slow > down payload that much either. > > raidz creates more, smaller writes than a mirror or > simple stripe. If the disks are slow, > then the IOPS will be lower and the scrub takes > longer, but the I/O scheduler can > manage the queue better (disks are slower).This wasn''t mirror vs. raidz but raidz1 vs. raidz2, whereas the latter maxes out CPU and the former maxes out physical disc I/O. Concurrent payload degradation isn''t that extreme on raidz1 pools, as it seems. Hence, the CPU theory that you still seem to be reluctant to follow.> There are several > bugs/RFEs along these lines, something like: > http://bugs.opensolaris.org/bugdatabase/view_bug.do?bu > g_id=6743992Thanks to pointing at this. As it seems, it''s a problem for a couple of years already. Obviously, the opinion is being shared that this a management problem, not a HW issue. As a Project Manager I will soon have to take a purchase decision for an archival storage system (A/V media), and one of the options we are looking into is SAMFS/QFS solution including tiers on disk with ZFS. I will have to make up my mind if the pool sizes we are looking into (typically we will need 150-200 TB) are really manageable under the current circumstances when we think about including zfs scrub in the picture. From what I have learned here it rather looks as if there will be an extra challenge, if not even a problem for the system integrator. That''s unfortunate. Regards, Tonmaus -- This message posted from opensolaris.org
In following this discussion, I get the feeling that you and Richard are somewhat talking past each other. He asked you about the hardware you are currently running on, whereas you seem to be interested in a model for the impact of scrubbing on I/O throughput that you can apply to some not-yet-acquired hardware. It should be clear by now that the model you are looking for does not exist given how new ZFS is, and Richard has been focusing his comments on your existing (home) configuration since that is what you provided specs for. Since you haven''t provided specs for this larger system you may be purchasing in the future, I don''t think anyone can give you specific guidance on what the I/O impact of scrubs on your configuration will be. Richard seems to be giving more design guidelines and hints, and just generally good to know information to keep in mind while designing your solution. Of course, he''s been giving it in the context of your 11 disk wide RAIDZ2 and not the 200 TB monster you only described in the last e-mail. Stepping back, it may be worthwhile to examine the advice Richard has given, in the context of the larger configuration. First, you won''t be using commodity hardware for your enterprise-class storage system, will you? Second, I would imagine that as a matter of practice, most people schedule their pools to scrub as far away from prime hours as possible. Maybe it''s possible, and maybe it''s not. The question to the larger community should be "who is running a 100+ TB pool and how have you configured your scrubs?" Or even "for those running 100+ TB pools, do your scrubs interfere with your production traffic/throughput? If so, how do you compensate for this?" Third, as for ZFS scrub prioritization, Richard answered your question about that. He said it is low priority and can be tuned lower. However, he was answering within the context of an 11 disk RAIDZ2 with slow disks His exact words were: "This could be tuned lower, but your storage is slow and *any* I/O activity will be noticed." If you had asked about a 200 TB enterprise-class pool, he may have had a different response. I don''t know if ZFS will make different decisisons regarding I/O priority on commodity hardware as opposed to enterprise hardware, but I imagine it does *not*. If I''m mistaken, someone should correct me. Richard also said "In b133, the priority scheduler will work better than on older releases." That may not be an issue since you haven''t acquired your hardware YET, but again, Richard didn''t know that you were talking about a 200 TB behemoth because you never said that. Fourth, Richard mentioned a wide RAIDZ2 set. Hopefully, if nothing else, we''ve seen that designing larger ZFS storage systems with pools composed of smaller top level VDEVs works better, and preferably mirrored top level VDEVs in the case of lots of small, random reads. You didn''t indicate the profile of the data to be stored on your system, so no one can realistically speak to that. I think the general guidance is sound. Multiple top level VDEVs, preferably mirrors. If you''re creating RAIDZ2 top level VDEVs, then they should be smaller (narrower) in terms of the number of disks in the set. 11 would be too many, based on what I have seen and heard on this list cross referenced with the (little) information you have provided. RAIDZ2 would appear to require more CPU power that RAIDZ, based on the report you gave and thus may have less negative impact on the performance of your storage system. I''ll cop to that. However, you never mentioned how your 200 TB behemoth system will be used, besides an off-hand remark about CIFS. Will it be serving CIFS? NFS? Raw ZVOLs over iSCSI? You never mentioned any of that. Asking about CIFS if you''re not going to serve CIFS doesn''t make much sense. That would appear to be another question for the ZFS gurus here -- WHY does RAIDZ2 cause so much negative performance impact on your CIFS service while RAIDZ does not? Your experience is that a scrub of a RAIDZ2 maxed CPU while a RAIDZ scrub did not, right? Fifth, the pool scrub should probably be as far away from peak usage times as possible. That may or may not be feasible, but I don''t think anyone would disagree with that advice. Again, I know there are people running large pools who perform scrubs. It might be worthwhile to directly ask what these people have experienced in terms of scrub performance on RAIDZ vs. RAIDZ2, or in general. Finally, I would also note that Richard has been very responsive to your questions (in his own way) but you increasingly seem to be hostile and even disrespectful toward him. (I''ve noticed this in more then one of your e-mails; they sound progressively more self-centered and selfish. That''s just my opinion.) If this is a community, that''s not a helpful way to treat a senior member of the community, even if he''s not answering the question you want answered. Keep in mind that asking the wrong questions is the leading cause of wrong answers, as a former boss of mine likes to say. Maybe you would have gotten responses you found more useful and less insulting had you phrased your questions in an different way? And no, Richard doesn''t need me to defend him, especially since I don''t know him from a can of paint. Your attacks (for lack of a better word) on him seem unwarranted and I can''t stay silent about it any longer. Anyway, hopefully that helps in some way. And hopefully you''ll get how you''re appearing to others who are reading your words. Right now, in MY opinion alone, you look like a n00b who isn''t respectful enough to deserve the help of anyone here. On Tue, Mar 16, 2010 at 07:35, Tonmaus <sequoiamobil at gmx.net> wrote:> Hi Richard, > > > > - scrubbing the same pool, configured as raidz1 > > didn''t max out CPU which is no surprise (haha, slow > > storage...) the notable part is that it didn''t slow > > down payload that much either. > > > > raidz creates more, smaller writes than a mirror or > > simple stripe. If the disks are slow, > > then the IOPS will be lower and the scrub takes > > longer, but the I/O scheduler can > > manage the queue better (disks are slower). > > This wasn''t mirror vs. raidz but raidz1 vs. raidz2, whereas the latter > maxes out CPU and the former maxes out physical disc I/O. Concurrent payload > degradation isn''t that extreme on raidz1 pools, as it seems. Hence, the CPU > theory that you still seem to be reluctant to follow. > > > > There are several > > bugs/RFEs along these lines, something like: > > http://bugs.opensolaris.org/bugdatabase/view_bug.do?bu > > g_id=6743992 > > Thanks to pointing at this. As it seems, it''s a problem for a couple of > years already. Obviously, the opinion is being shared that this a management > problem, not a HW issue. > > As a Project Manager I will soon have to take a purchase decision for an > archival storage system (A/V media), and one of the options we are looking > into is SAMFS/QFS solution including tiers on disk with ZFS. I will have to > make up my mind if the pool sizes we are looking into (typically we will > need 150-200 TB) are really manageable under the current circumstances when > we think about including zfs scrub in the picture. From what I have learned > here it rather looks as if there will be an extra challenge, if not even a > problem for the system integrator. That''s unfortunate. > > Regards, > > Tonmaus > -- > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >-- "You can choose your friends, you can choose the deals." - Equity Private "If Linux is faster, it''s a Solaris bug." - Phil Harman Blog - http://whatderass.blogspot.com/ Twitter - @khyron4eva -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100316/f367adfc/attachment-0001.html>
Bruno Sousa
2010-Mar-16  16:02 UTC
[zfs-discuss] How to manage scrub priority or defer scrub?
Well...i can only say "well said". BTW i have a raidz2 with 9 vdevs with 4 disks each (sata enterprise disks) and the scrub of the pool takes between 12 to 39 hours..depends on the workload of the server. So far it''s acceptable but each case is a case i think... Bruno On 16-3-2010 14:04, Khyron wrote:> In following this discussion, I get the feeling that you and Richard > are somewhat > talking past each other. He asked you about the hardware you are > currently running > on, whereas you seem to be interested in a model for the impact of > scrubbing on > I/O throughput that you can apply to some not-yet-acquired hardware. > > It should be clear by now that the model you are looking for does not > exist given > how new ZFS is, and Richard has been focusing his comments on your > existing (home) > configuration since that is what you provided specs for. > > Since you haven''t provided specs for this larger system you may be > purchasing in the > future, I don''t think anyone can give you specific guidance on what > the I/O impact of > scrubs on your configuration will be. Richard seems to be giving more > design guidelines > and hints, and just generally good to know information to keep in mind > while designing > your solution. Of course, he''s been giving it in the context of your > 11 disk wide > RAIDZ2 and not the 200 TB monster you only described in the last e-mail. > > Stepping back, it may be worthwhile to examine the advice Richard has > given, in the > context of the larger configuration. > > First, you won''t be using commodity hardware for your enterprise-class > storage system, > will you? > > Second, I would imagine that as a matter of practice, most people > schedule their pools > to scrub as far away from prime hours as possible. Maybe it''s > possible, and maybe it''s > not. The question to the larger community should be "who is running a > 100+ TB pool > and how have you configured your scrubs?" Or even "for those running > 100+ TB pools, > do your scrubs interfere with your production traffic/throughput? If > so, how do you > compensate for this?" > > Third, as for ZFS scrub prioritization, Richard answered your question > about that. He > said it is low priority and can be tuned lower. However, he was > answering within the > context of an 11 disk RAIDZ2 with slow disks His exact words were: > > "This could be tuned lower, but your storage is slow and *any* I/O > activity will be > noticed." > > If you had asked about a 200 TB enterprise-class pool, he may have had > a different > response. I don''t know if ZFS will make different decisisons > regarding I/O priority on > commodity hardware as opposed to enterprise hardware, but I imagine it > does *not*. > If I''m mistaken, someone should correct me. Richard also said "In > b133, the priority > scheduler will work better than on older releases." That may not be > an issue since > you haven''t acquired your hardware YET, but again, Richard didn''t know > that you > were talking about a 200 TB behemoth because you never said that. > > Fourth, Richard mentioned a wide RAIDZ2 set. Hopefully, if nothing > else, we''ve > seen that designing larger ZFS storage systems with pools composed of > smaller top > level VDEVs works better, and preferably mirrored top level VDEVs in > the case of lots > of small, random reads. You didn''t indicate the profile of the data > to be stored on > your system, so no one can realistically speak to that. I think the > general guidance > is sound. Multiple top level VDEVs, preferably mirrors. If you''re > creating RAIDZ2 > top level VDEVs, then they should be smaller (narrower) in terms of > the number of > disks in the set. 11 would be too many, based on what I have seen and > heard on > this list cross referenced with the (little) information you have > provided. > > RAIDZ2 would appear to require more CPU power that RAIDZ, based on the > report > you gave and thus may have less negative impact on the performance of > your storage > system. I''ll cop to that. However, you never mentioned how your 200 > TB behemoth > system will be used, besides an off-hand remark about CIFS. Will it > be serving CIFS? > NFS? Raw ZVOLs over iSCSI? You never mentioned any of that. Asking > about CIFS > if you''re not going to serve CIFS doesn''t make much sense. That would > appear to > be another question for the ZFS gurus here -- WHY does RAIDZ2 cause so > much > negative performance impact on your CIFS service while RAIDZ does > not? Your > experience is that a scrub of a RAIDZ2 maxed CPU while a RAIDZ scrub > did not, right? > > Fifth, the pool scrub should probably be as far away from peak usage > times as possible. > That may or may not be feasible, but I don''t think anyone would > disagree with that > advice. Again, I know there are people running large pools who > perform scrubs. It > might be worthwhile to directly ask what these people have experienced > in terms of > scrub performance on RAIDZ vs. RAIDZ2, or in general. > > Finally, I would also note that Richard has been very responsive to > your questions (in > his own way) but you increasingly seem to be hostile and even > disrespectful toward > him. (I''ve noticed this in more then one of your e-mails; they sound > progressively > more self-centered and selfish. That''s just my opinion.) If this is > a community, that''s > not a helpful way to treat a senior member of the community, even if > he''s not > answering the question you want answered. > > Keep in mind that asking the wrong questions is the leading cause of > wrong answers, > as a former boss of mine likes to say. Maybe you would have gotten > responses you > found more useful and less insulting had you phrased your questions in > an different > way? > > And no, Richard doesn''t need me to defend him, especially since I > don''t know him > from a can of paint. Your attacks (for lack of a better word) on him > seem > unwarranted and I can''t stay silent about it any longer. > > Anyway, hopefully that helps in some way. And hopefully you''ll get > how you''re > appearing to others who are reading your words. Right now, in MY > opinion alone, > you look like a n00b who isn''t respectful enough to deserve the help > of anyone > here. > > On Tue, Mar 16, 2010 at 07:35, Tonmaus <sequoiamobil at gmx.net > <mailto:sequoiamobil at gmx.net>> wrote: > > Hi Richard, > > > > - scrubbing the same pool, configured as raidz1 > > didn''t max out CPU which is no surprise (haha, slow > > storage...) the notable part is that it didn''t slow > > down payload that much either. > > > > raidz creates more, smaller writes than a mirror or > > simple stripe. If the disks are slow, > > then the IOPS will be lower and the scrub takes > > longer, but the I/O scheduler can > > manage the queue better (disks are slower). > > This wasn''t mirror vs. raidz but raidz1 vs. raidz2, whereas the > latter maxes out CPU and the former maxes out physical disc I/O. > Concurrent payload degradation isn''t that extreme on raidz1 pools, > as it seems. Hence, the CPU theory that you still seem to be > reluctant to follow. > > > > There are several > > bugs/RFEs along these lines, something like: > > http://bugs.opensolaris.org/bugdatabase/view_bug.do?bu > > g_id=6743992 > > Thanks to pointing at this. As it seems, it''s a problem for a > couple of years already. Obviously, the opinion is being shared > that this a management problem, not a HW issue. > > As a Project Manager I will soon have to take a purchase decision > for an archival storage system (A/V media), and one of the options > we are looking into is SAMFS/QFS solution including tiers on disk > with ZFS. I will have to make up my mind if the pool sizes we are > looking into (typically we will need 150-200 TB) are really > manageable under the current circumstances when we think about > including zfs scrub in the picture. From what I have learned here > it rather looks as if there will be an extra challenge, if not > even a problem for the system integrator. That''s unfortunate. > > Regards, > > Tonmaus > -- > This message posted from opensolaris.org <http://opensolaris.org> > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org <mailto:zfs-discuss at opensolaris.org> > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > > > > -- > "You can choose your friends, you can choose the deals." - Equity Private > > "If Linux is faster, it''s a Solaris bug." - Phil Harman > > Blog - http://whatderass.blogspot.com/ > Twitter - @khyron4eva > > > -- > This message has been scanned for viruses and > dangerous content by *MailScanner* <http://www.mailscanner.info/>, and is > believed to be clean. > > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100316/9d0a81eb/attachment.html> -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 3656 bytes Desc: S/MIME Cryptographic Signature URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100316/9d0a81eb/attachment.bin>
Bob Friesenhahn
2010-Mar-16  16:38 UTC
[zfs-discuss] How to manage scrub priority or defer scrub?
On Tue, 16 Mar 2010, Tonmaus wrote:> > This wasn''t mirror vs. raidz but raidz1 vs. raidz2, whereas the > latter maxes out CPU and the former maxes out physical disc I/O. > Concurrent payload degradation isn''t that extreme on raidz1 pools, > as it seems. Hence, the CPU theory that you still seem to be > reluctant to follow.If CPU is maxed out then that usually indicates some severe problem with choice of hardware or a misbehaving device driver. Modern systems have an abundance of CPU. I don''t think that the size of the pool is particularly significant since zfs scrubs in a particular order and scrub throughput is dicated by access times and bandwidth. In fact there should be less impact from scrub in a larger pool (even though scrub may take much longer) since the larger pool will have more vdevs. The vdev design is most important. Too many drives per vdev leads to poor performance, particularly if the drives are huge sluggish ones. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Even if it might not be the best technical solution, I think what a lot of people are looking for when this comes up is a knob they can use to say "I only want X IOPS per vdev" (in addition to low prioritization) to be used while scrubbing. Doing so probably helps them feel more at ease that they have some excess capacity on cpu and vdev if production traffic should come along. That''s probably a false sense of moderating resource usage when the current "full speed, but lowest prioritization" is just as good and would finish quicker.. but, it gives them peace of mind? -- This message posted from opensolaris.org
David Dyer-Bennet
2010-Mar-16  17:41 UTC
[zfs-discuss] How to manage scrub priority or defer scrub?
On Tue, March 16, 2010 11:53, thomas wrote:> Even if it might not be the best technical solution, I think what a lot of > people are looking for when this comes up is a knob they can use to say "I > only want X IOPS per vdev" (in addition to low prioritization) to be used > while scrubbing. Doing so probably helps them feel more at ease that they > have some excess capacity on cpu and vdev if production traffic should > come along. > > That''s probably a false sense of moderating resource usage when the > current "full speed, but lowest prioritization" is just as good and would > finish quicker.. but, it gives them peace of mind?I may have been reading too quickly, but I have the impression that at least some of the people not happy with the current prioritization were reporting severe impacts to non-scrub performance when a scrub was in progress. If that''s the case, then they have a real problem, they''re not just looking for more peace of mind in a hypothetical situation. -- David Dyer-Bennet, dd-b at dd-b.net; http://dd-b.net/ Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/ Photos: http://dd-b.net/photography/gallery/ Dragaera: http://dragaera.info
The issue as presented by Tonmaus was that a scrub was negatively impacting his RAIDZ2 CIFS performance, but he didn''t see the same impact with RAIDZ. I''m not going to say whether that is a "problem" one way or the other; it may be expected behavior under the circumstances. That''s for ZFS developers to speak on. (This was one of many issues Tonmaus mentioned.) However, what was lost was the context. Tonmaus reported this behavior on a commodity server using slow disks in an 11 disk RAIDZ2 set. However, he *really* wants to know if this will be an issue on a 100+ TB pool. So his examples were given on a pool that was possibly 5% of the size the pool that he actually wants to deploy. He never said any of this in the original e-mail, so Richard assumed the context to be the smaller system. That''s why I pointed out all of the discrepancies and questions he could/should have asked which might have yielded more useful answers. There''s quite a difference between the 11 disk RAIDZ2 set and a 100+ TB ZFS pool, especially when the use case, VDEV layout and other design aspects of the 100+ TB pool have not been described. On Tue, Mar 16, 2010 at 13:41, David Dyer-Bennet <dd-b at dd-b.net> wrote:> > On Tue, March 16, 2010 11:53, thomas wrote: > > Even if it might not be the best technical solution, I think what a lot > of > > people are looking for when this comes up is a knob they can use to say > "I > > only want X IOPS per vdev" (in addition to low prioritization) to be used > > while scrubbing. Doing so probably helps them feel more at ease that they > > have some excess capacity on cpu and vdev if production traffic should > > come along. > > > > That''s probably a false sense of moderating resource usage when the > > current "full speed, but lowest prioritization" is just as good and would > > finish quicker.. but, it gives them peace of mind? > > I may have been reading too quickly, but I have the impression that at > least some of the people not happy with the current prioritization were > reporting severe impacts to non-scrub performance when a scrub was in > progress. If that''s the case, then they have a real problem, they''re not > just looking for more peace of mind in a hypothetical situation. > -- > David Dyer-Bennet, dd-b at dd-b.net; http://dd-b.net/ > Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/ > Photos: http://dd-b.net/photography/gallery/ > Dragaera: http://dragaera.info > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >-- "You can choose your friends, you can choose the deals." - Equity Private "If Linux is faster, it''s a Solaris bug." - Phil Harman Blog - http://whatderass.blogspot.com/ Twitter - @khyron4eva -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100316/bfaf9b54/attachment.html>
Hello,> In following this discussion, I get the feeling that > you and Richard are somewhat talking past each > other.Talking past each other is a problem I have noted and remarked earlier. I have to admit to have got frustrated about the discussion narrowing down to a certain perspective that was quite the opposite of my own observations and what I had initially described. It may be that I have been more harsh than I should. Please accept my apology. I was trying from the outset to obtain a perspective on the matter that is independent from an actual configuration. I firmly believe that the scrub function is more meaningful if it can be applied in a variety of implementations. I think however that the insight that there seems to be no specific scrub management functions is transferable from a commodity implementation to a enterprise configuration. Regards, Tonmaus -- This message posted from opensolaris.org
> If CPU is maxed out then that usually indicates some > severe problem > with choice of hardware or a misbehaving device > driver. Modern > systems have an abundance of CPU.AFAICS the CPU loads are only high while scrubbing a double parity pool. I have no indication of a technical misbehaviour with the exception of dismal concurrent performance. What is not getting beyond me is the notion that even if I *had* a storage configuration with 20 times more I/O capacity it still would max out any CPU I could buy better than the single L5410 I am running from currently. I am seeing CPU performance being a pain point on any "software" based array I have used so far. From SOHO NAS boxes (the usual Thecus stuff) to NetApp 3200 filers, all exposed a nominal performance drop once parity configurations were employed. Performance of the L5410 is abundant for the typical operation of my system, btw. It can easiely saturate the dual 1000 Mbit NICs for iSCSI and CIFS services. I am slightly reluctant to buy a second L5410 just to provide more headroom during maintenance operations, as the device will be idle otherwise, consuming power. Regards, Tonmaus -- This message posted from opensolaris.org
Bob Friesenhahn
2010-Mar-16  23:07 UTC
[zfs-discuss] How to manage scrub priority or defer scrub?
On Tue, 16 Mar 2010, Tonmaus wrote:> AFAICS the CPU loads are only high while scrubbing a double parity > pool. I have no indication of a technical misbehaviour with the > exception of dismal concurrent performance.This seems pretty weird to me. I have not heard anyone else complain about this sort of problem before in the several years I have been on this list. Are you sure that you didn''t also enable something which does consume lots of CPU such as enabling some sort of compression, sha256 checksums, or deduplication?> running from currently. I am seeing CPU performance being a pain > point on any "software" based array I have used so far. From SOHO > NAS boxes (the usual Thecus stuff) to NetApp 3200 filers, all > exposed a nominal performance drop once parity configurations were > employed.The main concern that one should have is I/O bandwidth rather than CPU consumption since "software" based RAID must handle the work using the system''s CPU rather than expecting it to be done by some other CPU. There are more I/Os and (in the case of mirroring) more data transferred. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
> Are you sure that you didn''t also enable > something which > does consume lots of CPU such as enabling some sort > of compression, > sha256 checksums, or deduplication?None of them is active on that pool or in any existing file system. Maybe the issue is particular to RAIDZ2, which is comparably recent. On that occasion: does anybody know if ZFS reads all parities during a scrub? Wouldn''t it be sufficient for stale corruption detection to read only one parity set unless an error occurs there?> The main concern that one should have is I/O > bandwidth rather than CPU > consumption since "software" based RAID must handle > the work using the > system''s CPU rather than expecting it to be done by > some other CPU. > There are more I/Os and (in the case of mirroring) > more data > transferred.What I am trying to say is that CPU may become the bottleneck for I/O in case of parity-secured stripe sets. Mirrors and simple stripe sets have almost 0 impact on CPU. So far at least my observations. Moreover, x86 processors not optimized for that kind of work as much as i.e. an Areca controller with a dedicated XOR chip is, in its targeted field. Regards, Tonmaus -- This message posted from opensolaris.org
Bob Friesenhahn
2010-Mar-17  14:31 UTC
[zfs-discuss] How to manage scrub priority or defer scrub?
On Tue, 16 Mar 2010, Tonmaus wrote:> > None of them is active on that pool or in any existing file system. > Maybe the issue is particular to RAIDZ2, which is comparably recent. > On that occasion: does anybody know if ZFS reads all parities during > a scrub? Wouldn''t it be sufficient for stale corruption detection to > read only one parity set unless an error occurs there?Zfs scrub reads and verifies everything. That is it''s purpose.> What I am trying to say is that CPU may become the bottleneck for > I/O in case of parity-secured stripe sets. Mirrors and simple stripe > sets have almost 0 impact on CPU. So far at least my observations. > Moreover, x86 processors not optimized for that kind of work as much > as i.e. an Areca controller with a dedicated XOR chip is, in its > targeted field.It would be astonishing if the XOR algorithm consumed very much CPU with modern CPUs. Zfs''s own checksum is more brutal than XOR. The scrub re-assembles full (usually 128K) data blocks and verifies the zfs checksum. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Hi, I got a message from you off-list that doesn''t show up in the thread even after hours. As you mentioned the aspect here as well I''d like to respond to, I''ll do it from here:> Third, as for ZFS scrub prioritization, Richard > answered your question about that. He said it is > low priority and can be tuned lower. However, he was > answering within the <br>context of an 11 disk RAIDZ2 > with slow disks His exact words were: > > > This could be tuned lower, but your storage > is slow and *any* I/O activity will be > noticed.Richard told us two times that scrub already is as low in priority as can be. From another message: "Scrub is already the lowest priority. Would you like it to be lower?" ============================================================================ As much as the comparison goes between "slow" and "fast" storage. I have understood that Richard''s message was that with storage providing better random I/O zfs priority scheduling will perform significantly better, providing less degradation of concurrent load. While I am even inclined to buy that, nobody will be able to tell me how a certain system will behave until it was tested, and to what degree concurrent scrubbing still will be possible. Another thing: people are talking a lot about narrow vdevs and mirrors. However, when you need to build a 200 TB pool you end up with a lot of disks in the first place. You will need at least double failover resilience for such a pool. If one would do that with mirrors, ending up with app. 600 TB gross to provide 200 TB net capacity is definitely NOT an option. Regards, Tonmaus -- This message posted from opensolaris.org
Ugh! I meant that to go to the list, so I''ll probably re-send it for the benefit of everyone involved in the discussion. There were parts of that that I wanted others to read.>From a re-read of Richard''s e-mail, maybe he meant that the number of I/Osqueued to a device can be tuned lower and not the priority of the scrub (as I took him to mean). Hopefully Richard can clear that up. I personally stand corrected for mis-reading Richard there. Of course the performance of a given system cannot be described until it is built. Again, my interpretation of your e-mail was that you were looking for a model for the performance of concurrent scrub and I/O load of a RAIDZ2 VDEV that you could scale up from your "test" environment of 11 disks to a 200+ TB behemoth. As I mentioned several times, I doubt such a model exists, and I have not seen anything published to that effect. I don''t know how useful it would be if it did exist because the performance of your disks would be a critical factor. (Although *any* model beats no model any day.) Let''s just face it. You''re using a new storage system that has not been modeled. To get the model you seek, you will probably have to create it yourself. (It''s notable that most of the ZFS models that I have seen have been done by Richard. Of course, they were MTTDL models, not scrub vs. I/O performance models for different VDEV types.) As for your point about building large pools from lots of mirror VDEVs, my response is "meh". I''ve said several times, and maybe you''ve missed it several times, that there may be pathologies for which YOU should open bugs. RAIDZ3 may exhibit the same kind of pathologies you observed with RAIDZ2. Apparently RAIDZ does not. I''ve also noticed (and I''m sure I''ll be corrected if I''m mistaken) that there is not a limit on the number of VDEVs in a pool but single digit RAID VDEVs are recommended. So there is nothing preventing you from building (for example) VDEVs from 1 TB disks. If you take 9 x 1 TB disks per VDEV, and use RAIDZ2, you get 7 TB usable. That means about 29 VDEVs to get 200 TB. Double the disk capacity and you can probably get to 15 top level VDEVs. (And you''ll want that RAIDZ2 as well since I don''t know if you could trust that many disks, whether enterprise or consumer.) However, that number of top level VDEVs sounds reasonable based on what others have reported. What''s been proven to be "A Bad Idea(TM)" is putting lots of disks in a single VDEV. Remember that ZFS is a *new* software system. It is complex. It will have bugs. You have chosen ZFS; it didn''t choose you. So I''d say you can contribute to the community by reporting back your experiences, opening bugs on things which make sense to open bugs on, testing configurations, modeling, documenting and sharing. So far, you just seem to be interested in taking w/o so much as an offer of helping the community or developers to understand what works and what doesn''t. All take and no give is not cool. And if you don''t like ZFS, then choose something else. I''m sure EMC or NetApp will willingly sell you all the spindles you want. However, I think it is still early to write off ZFS as a losing proposition, but that''s my opinion. So far, you seem to be spending a lot of time complaining about a *new* software system that you''re not paying for. That''s pretty tasteless, IMO. And now I''ll re-send that e-mail... P.S.: Did you remember to re-read this e-mail? Read it 2 or 3 times and be clear about what I said and what I did _not_ say. On Wed, Mar 17, 2010 at 16:12, Tonmaus <sequoiamobil at gmx.net> wrote:> Hi, > > I got a message from you off-list that doesn''t show up in the thread even > after hours. As you mentioned the aspect here as well I''d like to respond > to, I''ll do it from here: > > > Third, as for ZFS scrub prioritization, Richard > > answered your question about that. He said it is > > low priority and can be tuned lower. However, he was > > answering within the <br>context of an 11 disk RAIDZ2 > > with slow disks His exact words were: > > > > > > This could be tuned lower, but your storage > > is slow and *any* I/O activity will be > > noticed. > > Richard told us two times that scrub already is as low in priority as can > be. From another message: > > "Scrub is already the lowest priority. Would you like it to be lower?" > > > ============================================================================> > As much as the comparison goes between "slow" and "fast" storage. I have > understood that Richard''s message was that with storage providing better > random I/O zfs priority scheduling will perform significantly better, > providing less degradation of concurrent load. While I am even inclined to > buy that, nobody will be able to tell me how a certain system will behave > until it was tested, and to what degree concurrent scrubbing still will be > possible. > Another thing: people are talking a lot about narrow vdevs and mirrors. > However, when you need to build a 200 TB pool you end up with a lot of disks > in the first place. You will need at least double failover resilience for > such a pool. If one would do that with mirrors, ending up with app. 600 TB > gross to provide 200 TB net capacity is definitely NOT an option. > > Regards, > > Tonmaus > -- > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >-- "You can choose your friends, you can choose the deals." - Equity Private "If Linux is faster, it''s a Solaris bug." - Phil Harman Blog - http://whatderass.blogspot.com/ Twitter - @khyron4eva -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100317/a698d299/attachment.html>
For those following along, this is the e-mail I meant to send to the list but instead sent directly to Tonmaus. My mistake, and I apologize for having to re-send. === Start == My understanding, limited though it may be, is that a scrub touches ALL data that has been written, including the parity data. It confirms the validity of every bit that has been written to the array. Now, there may be an implementation detail that is responsible for the pathology that you observed. More than likely, I''d imagine. Filing a bug may be in order. Since triple parity RAIDZ exists now, you may want to test with that by grabbing a LiveCD or LiveUSB image from genunix.org. Maybe RAIDZ3 has the same (or worse) problems? As for "scrub management", I pointed out the specific responses from Richard where he noted that scrub I/O priority *can* be tuned. How you do that, I''m not sure. Richard, how does one tune scrub I/O priority? Other than that, as I said, I don''t think there is a model (publicly available anyway) describing scrub behavior and how it scales with pool size (< 5 TB, 5 TB - 50 TB, > 50 TB, etc.) or data layout (mirror vs. RAIDZ vs. RAIDZ2). ZFS is really that new, that all of this needs to be reconsidered and modeled. Maybe this is something you can contribute to the community? ZFS is a new storage system, not the same old file systems whose behaviors and quirks are well known because of 20+ years of history. We''re all writing a new chapter in data storage here, so it is incumbent upon us to share knowledge in order to answer these types of questions. I think the questions I raised in my longer response are also valid and need to be re-considered. There are large pools in production today. So how are people scrubbing these pools? Please post your experiences with scrubbing 100+ TB pools. Tonmaus, maybe you should repost my other questions in a new, separate thread? === End == On Tue, Mar 16, 2010 at 19:41, Tonmaus <sequoiamobil at gmx.net> wrote:> > Are you sure that you didn''t also enable > > something which > > does consume lots of CPU such as enabling some sort > > of compression, > > sha256 checksums, or deduplication? > > None of them is active on that pool or in any existing file system. Maybe > the issue is particular to RAIDZ2, which is comparably recent. On that > occasion: does anybody know if ZFS reads all parities during a scrub? > Wouldn''t it be sufficient for stale corruption detection to read only one > parity set unless an error occurs there? > > > The main concern that one should have is I/O > > bandwidth rather than CPU > > consumption since "software" based RAID must handle > > the work using the > > system''s CPU rather than expecting it to be done by > > some other CPU. > > There are more I/Os and (in the case of mirroring) > > more data > > transferred. > > What I am trying to say is that CPU may become the bottleneck for I/O in > case of parity-secured stripe sets. Mirrors and simple stripe sets have > almost 0 impact on CPU. So far at least my observations. Moreover, x86 > processors not optimized for that kind of work as much as i.e. an Areca > controller with a dedicated XOR chip is, in its targeted field. > > Regards, > > Tonmaus > -- > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >-- "You can choose your friends, you can choose the deals." - Equity Private "If Linux is faster, it''s a Solaris bug." - Phil Harman Blog - http://whatderass.blogspot.com/ Twitter - @khyron4eva -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100317/fcf9432c/attachment.html>
Richard Elling
2010-Mar-18  11:51 UTC
[zfs-discuss] How to manage scrub priority or defer scrub?
On Mar 16, 2010, at 4:41 PM, Tonmaus wrote:>> Are you sure that you didn''t also enable >> something which >> does consume lots of CPU such as enabling some sort >> of compression, >> sha256 checksums, or deduplication? > > None of them is active on that pool or in any existing file system. Maybe the issue is particular to RAIDZ2, which is comparably recent. On that occasion: does anybody know if ZFS reads all parities during a scrub?Yes> Wouldn''t it be sufficient for stale corruption detection to read only one parity set unless an error occurs there?No, because the parity itself is not verified.>> The main concern that one should have is I/O >> bandwidth rather than CPU >> consumption since "software" based RAID must handle >> the work using the >> system''s CPU rather than expecting it to be done by >> some other CPU. >> There are more I/Os and (in the case of mirroring) >> more data >> transferred. > > What I am trying to say is that CPU may become the bottleneck for I/O in case of parity-secured stripe sets. Mirrors and simple stripe sets have almost 0 impact on CPU. So far at least my observations. Moreover, x86 processors not optimized for that kind of work as much as i.e. an Areca controller with a dedicated XOR chip is, in its targeted field.All x86 processors you care about do XOR at memory bandwidth speed. XOR is one of the simplest instructions to implement on a microprocessor. The need for a dedicated XOR chip for older "hardware RAID" systems is because they use very slow processors with low memory bandwidth. Cheap is as cheap does :-) However, the issue for raidz2 and above (including RAID-6) is that the second parity is a more computationally complex Reed-Solomon code, not a simple XOR. So there is more computing required and that would be reflected in the CPU usage. -- richard ZFS storage and performance consulting at http://www.RichardElling.com ZFS training on deduplication, NexentaStor, and NAS performance Atlanta, March 16-18, 2010 http://nexenta-atlanta.eventbrite.com Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com
> On that > occasion: does anybody know if ZFS reads all parities > during a scrub? > > Yes > > > Wouldn''t it be sufficient for stale corruption > detection to read only one parity set unless an error > occurs there? > > No, because the parity itself is not verified.Aha. Well, my understanding was that a scrub basically means reading all data, and compare with the parities, which means that these have to be re-computed. Is that correct? Regards, Tonmaus -- This message posted from opensolaris.org
Daniel Carosone
2010-Mar-18  20:53 UTC
[zfs-discuss] How to manage scrub priority or defer scrub?
On Thu, Mar 18, 2010 at 05:21:17AM -0700, Tonmaus wrote:> > No, because the parity itself is not verified. > > Aha. Well, my understanding was that a scrub basically means reading > all data, and compare with the parities, which means that these have > to be re-computed. Is that correct?A scrub does, yes. It reads all data and metadata and checksums and verifies they''re correct. A read of the pool might not - for example, it might: - read only one side of a mirror - read only one instance of a ditto block (metadata or copies>1) - use cached copies of data or metadata; for a long-running system it might be a long time since some metadata blocks were ever read, if they''re frequently used. Roughly speaking, reading through the filesystem does the least work possible to return the data. A scrub does the most work possible to check the disks (and returns none of the data). For the OP: scrub issues low-priority IO (and the details of how much and how low have changed a few times along the version trail). However, that prioritisation applies only within the kernel; sata disks don''t understand the prioritisation, so once the requests are with the disk they can still saturate out other IOs that made it to the front of the kernel''s queue faster. If you''re looking for something to tune, you may want to look at limiting the number of concurrent IO''s handed to the disk to try and avoid saturating the heads. You also want to confirm that your disks are on an NCQ-capable controller (eg sata rather than cmdk) otherwise they will be severely limited to processing one request at a time, at least for reads if you have write-cache on (they will be saturated at the stop-and-wait channel, long before the heads). -- Dan. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 194 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100319/a8787850/attachment.bin>
Hello Dan, Thank you very much for this interesting reply.> roughly speaking, reading through the filesystem does > the least work > possible to return the data. A scrub does the most > work possible to > check the disks (and returns none of the data).Thanks for the clarification. That''s what I had thought.> > For the OP: scrub issues low-priority IO (and the > details of how much > and how low have changed a few times along the > version trail).Is there any documentation about this, besides source code?> However, that prioritisation applies only within the > kernel; sata disks > don''t understand the prioritisation, so once the > requests are with the > disk they can still saturate out other IOs that made > it to the front > of the kernel''s queue faster.I am not sure what you are hinting at. I initially thought about TCQ vs. NCQ when I read this. But I am not sure which detail of TCQ would allow for I/O discrimination that NCQ doesn''t have. All I know about command cueing is that it is about optimising DMA strategies and optimising the handling of the I/O requests currently issued in respect to what to do first to return all data in the least possible time. (??)> If you''re looking for > something to > tune, you may want to look at limiting the number of > concurrent IO''s > handed to the disk to try and avoid saturating the > heads.Indeed, that was what I had in mind. With the addition that I think it is as well necessary to avoid saturating other components, such as CPU.> > You also want to confirm that your disks are on an > NCQ-capable > controller (eg sata rather than cmdk) otherwise they > will be severely > limited to processing one request at a time, at least > for reads if you > have write-cache on (they will be saturated at the > stop-and-wait > channel, long before the heads).I have two systems here, a production system that is on LSI SAS (mpt) controllers, and another one that is on ICH-9 (ahci). Disks are SATA-2. The plan was that this combo will have NCQ support. On the other hand, do you know if there a method to verify if its functioning? Best regards, Tonmaus -- This message posted from opensolaris.org
Daniel Carosone
2010-Mar-19  05:37 UTC
[zfs-discuss] How to manage scrub priority or defer scrub?
On Thu, Mar 18, 2010 at 09:54:28PM -0700, Tonmaus wrote:> > (and the details of how much and how low have changed a few times > > along the version trail). > > Is there any documentation about this, besides source code?There are change logs and release notes, and random blog postings along the way - they''re less structured but often more informative. There were some good descriptions about the scrub improvements 6-12 months ago. The bugid''s listed in change logs that mention scrub should be pretty simple to find and sequence with versions.> > However, that prioritisation applies only within the kernel; sata > > disks don''t understand the prioritisation, so once the requests > > are with the disk they can still saturate out other IOs that made > > it to the front of the kernel''s queue faster. > > I am not sure what you are hinting at. I initially thought about TCQ > vs. NCQ when I read this. But I am not sure which detail of TCQ > would allow for I/O discrimination that NCQ doesn''t have.Er, the point was exactly that there is no discrimination, once the request is handed to the disk. If the internal-to-disk queue is enough to keep the heads saturated / seek bound, then a new high-priority-in-the-kernel request will get to the disk sooner, but may languish once there. You''ll get best overall disk throughput by letting the disk firmware optimise seeks, but your priority request won''t get any further preference. Shortening the list of requests handed to the disk in parallel may help, and still keep the channel mostly busy, perhaps at the expense of some extra seek length and lower overall throughput. You can shorten the number of outstanding IO''s per vdev for the pool overall, or preferably the number scrub will generate (to avoid penalising all IO). The tunables for each of these should be found readily, probably in the Evil Tuning Guide.> All I know about command cueing is that it is about optimising DMA > strategies and optimising the handling of the I/O requests currently > issued in respect to what to do first to return all data in the > least possible time. (??)Mostly, as above it''s about giving the disk controller more than one thing to work on at a time, and having the issuance of a request and its completion overlap with others, so the head movement can be optimised and the controller channel can be busy with data transfer for another while seeking. Disks with write cache effectively do this for writes, by pretending they complete immediately, but reads would block the channel until satisfied. (This is all for ATA which lacked this, before NCQ. SCSI has had these capabilities for a long time).> > If you''re looking for something to tune, you may want to look at > > limiting the number of concurrent IO''s handed to the disk to try > > and avoid saturating the heads. > > Indeed, that was what I had in mind. With the addition that I think > it is as well necessary to avoid saturating other components, such > as CPU.Less important, since prioritisation can be applied there too, but potentially also an issue. Perhaps you want to keep the cpu fan speed/noise down for a home server, even if the scrub runs longer.> I have two systems here, a production system that is on LSI SAS > (mpt) controllers, and another one that is on ICH-9 (ahci). Disks > are SATA-2. The plan was that this combo will have NCQ support. On > the other hand, do you know if there a method to verify if its > functioning?AHCI should be fine. In practice if you see actv > 1 (with a small margin for sampling error) then ncq is working. -- Dan. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 194 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100319/a35d748b/attachment.bin>
> > > > sata > > > disks don''t understand the prioritisation, so> > Er, the point was exactly that there is no > discrimination, once the > request is handed to the disk.So, do you say that SCSI drives do understand prioritisation (i.e. TCQ supports the schedule from ZFS), while SATA/NCQ drives don''t, or is it just boiling down to what Richard told us, SATA disks being too slow?> If the > internal-to-disk queue is > enough to keep the heads saturated / seek bound, then > a new > high-priority-in-the-kernel request will get to the > disk sooner, but > may languish once there.Thanks. That makes sense to me.> > You can shorten the number of outstanding IO''s per > vdev for the pool > overall, or preferably the number scrub will generate > (to avoid > penalising all IO).That sounds like a meaningful approach to addressing bottlenecks caused by zpool scrub to me.>The tunables for each of these > should be found > readily, probably in the Evil Tuning Guide.I think I should try to digest the Evil Tuning Guide occasionally with respect to this topic. Thanks for pointing me to a direction. Maybe what you have suggested above (shorten the number of I/Os issued by scrub) is already possible? If not, I think it would be a meaningful improvement to request.> Disks with write cache effectively do this [command cueing] for > writes, by pretending > they complete immediately, but reads would block the > channel until > satisfied. (This is all for ATA which lacked this, > before NCQ. SCSI > has had these capabilities for a long time).As scrub is about reads, are you saying that this is still a problem with SATA/NCQ drives, or not? I am unsure what you mean at this point.> > > limiting the number of concurrent IO''s handed to > the disk to try > > > and avoid saturating the heads. > > > > Indeed, that was what I had in mind. With the > addition that I think > > it is as well necessary to avoid saturating other > components, such > > as CPU. > > Less important, since prioritisation can be applied > there too, but > potentially also an issue. Perhaps you want to keep > the cpu fan > speed/noise down for a home server, even if the scrub > runs longer.Well, the only thing that was really remarkable while scrubbing was CPU load constantly near 100%. I still think that is at least contributing to the collapse of concurrent payload. I.e., it''s all about services that take place in Kernel: CIFS, ZFS, iSCSI.... Mostly, about concurrent load within ZFS itself. That means an implicit trade-off while a file is being provided over CIFS, i.e..> > AHCI should be fine. In practice if you see actv > 1 > (with a small > margin for sampling error) then ncq is working.Ok, and how is that in respect to mpt? My assertion that mpt will support NCQ is mainly based on the marketing information provided by LSI that these controllers offer NCQ support with SATA drives. How (by which tool) do I get to this "actv" parameter? Regards, Tonmaus -- This message posted from opensolaris.org
Lutz Schumann
2010-May-01  12:28 UTC
[zfs-discuss] How to manage scrub priority or defer scrub?
I was going though this posting and it seems that were is some "personal tension" :). However going back to the technical problem of scrubbing a 200 TB pool I think this issue needs to be addressed. One warning up front: This writing is rather long, and if you like to jump to the part dealing with Scrub, jump to "Scrub implementation" below.>From my perspective:- ZFS is great for huge amounts of data Thats what it was made for with 128bit and jbod design in mind. So ZFS is perfect for internet multi media in terms of scalability. - ZFS is great for commodity hardware Ok you should use 24x7 drives, but 2 TB 7200 disks are ok for internet media mass storage. We want huge amounts of data stored and in the internet age nobody pays for this. So you must use low cost hardware (well it must be compatible) - but you should not need enterprise components - thats what we have ZFS as clever software for. For mass storage internet services, the alternative is NOT EMC, NetApp (remember nobody pays a lot for the services because you can get it free at google) - the alternative is Linux based HW raid (with its well known limitations) and home grown solutions. Those do not have the nice ZFS features mentioned below. - ZFS guarantees data intrity by self-healing silent data corruption (thats what the checksums are for) - But only if you have redundancy. There are a lot of posts on the net saying when people will notice the bad blocks - it happens when a disk in a raid5 failes, and they have to resilver everything. Then you detect the missing redundancy. So people use Raid6 and "hope" that everything works. Or people do scrubs on their advanced raid controllers (if they provide internal checksumming). The same problem exists for huge, passive, raidz1 data sets in ZFS. If you do not scrub the array regularly, chances are higher that you will have a bad block during resilvering and then ZFS can not help. For active data sets the problem is not as critical, because on every read the checksum is verified - but still - because once in arc cache noboy checks - the problem exists. So we need scrub! - ZFS can do many other nice things There''s compression, dedupe etc .. however I look at them as "nice to have. - ZFS needs proper pool design Using ZFS right is not easy, sizing the system is even more complicated. There are a lot of threads reagarding pool design - the easiest is to say "do a lot of mirrors", cause then the read performance really scales. However in internet mass media services, you cant - too expensive - because mirrored ZFS is more expensive then HW Raid 6 with Linux. How much members to a vdev ? multiple pools or single pools ? - ZFS is open and community based ... well lets see how this goes wth Oracle "financing" the whole stuff :) And some of those points make ZFS a hit for internet service provider and mass media requirements (VOD etc.)! So whats my point you may ask ? My experience with ZFS is that some points are simply not addressed well enough yet - BUT - ZFS is a living piece of software and thanks to the many great people developing it, it evolves faster then all the other storage solutions. So for the longer term - I believe ZFS will (hopefully) have all "enterprice-ness" it needs and it will revolutionize the storage industry (like cisco did in the 70''s). I really believe that.>From my perspective some of the points not addressed well in ZFS are:- pool defragmentation - you need this for a COW filesystem I think the ZFS developers are working on this with the background rewriter. So I hope it will come 2010. With the rewriter on disk layout can be optimized for read performance for sequencial workloads - also for raidz1 and raidz2 - meaning ZFS can compete with Raid5 and Raid6 - also for wider vdevs. And wider vdevs mean more effective capacity. If the vdev read-ahead cache is working nice with a sequencially aligned on disk layout then - (from disk) read performance will be great. - IO priorization for zvols / zfs filesystems (aka Storage QoS) Unfortunately you can not prioritize I/O to zfs filesystems and zvols right now. I think this is one of the features that make ZFS not suitable for 1st tier storage (like EMC Symmetrix or NetApp FAS6000 series). You need priorization here - because your SAP system really is more important than my MP3 web server :) - Deduplication not ready for production Currently dedup is nice, but the DDT table handling and memory sizing is tricky and hardly usable for larger pools (my perspective). The DDT is handled like any other component - meaning user I/O can push the DDT out of the arc (and the L2ARC) - even with (primarycache=secondarycache)=metadata. For typical mass media storage applications, the working set is much larger then the memory (and L2ARC) meaning your DDT will come from disk - causing real performance degration. This is especially true for COMSTAR environments with small block sized (8k etc.) Here the DDT becomes really huge. Once it fits not into memory anymore - bummer. There is a open BUG for this: http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6913566 and I hope it will be addressed soon. - Scrub implementation And thats the point mentioned in this thread. Currently you can not manage scrub very well. In pre snv_133 scrub was very agressive. When scrub was running, you really had a dramatic perfromance penalty (more full strokese etc., latency goes up). With post snv_133 scrub was less agressive - sounds nice - but this makes scrubbing take longer. This is bad if your disk failed. So either way it is not optimal and I believe a system can not automate this process very well. To schedule scrub I/O right, the storage system must "predict" the I/O access pattern for the future and I believe this is not possible. So scrub management must be manageable by the storage user. For very large pools another problem comes up. You simply can not scrub a 200TB++ pool over the weekend - it WILL take longer. However your users will come in on monday and they want to work. Currently scrub can not be prioritized, paused, aborted or resumed. This makes it very difficult to make the scrub managebale. If scrub could be prioritized, would be pausable, resumable and abortable the admin (or a management software like NexentaStor) could schedule the scrub according to the user policy (ok we scrub weekdays 18:00 - 20:00 low prio, 20:00 - 04:00 at high prio, 04:00 - 18:00 we do not scrub at all) BUT if a device is degraded - we resilver with maximum priority) So I think this feature would be VERY essential because this makes the nice features of ZFS (checksumming, data integrity) really usuable for enterprise use cases AND large data sets. Otherwise the features are "nice - but you cant really use them with more then 24TB". Conclustion (in the context of Scrub): - People want 200TB++ pools for mass media applications - People will use huge low cost drives to storage huge amounts fo data - People dont use mirrors because 50% eff. capacity is not enough - People NEED to scrub to repair bad blocks because of the cheap drives with high capacity --> Scrub / Resilver management needs to be improved. There are (a lot) of open bugs for this: - http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6743992 - http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6494473 ... - http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6888481 So I hope this will be adressed for scrub and resilver. Regards, Robert P.s. Sorry for the rather long writing :) -- This message posted from opensolaris.org