thr3ads.net - zfs discuss - [zfs-discuss] How to manage scrub priority or defer scrub? [Mar 2010]

If this information is useful, please help other people find it:
Share via:

Tonmaus

2010-Mar-13 12:54 UTC

[zfs-discuss] How to manage scrub priority or defer scrub?

Dear zfs fellows,

during a specific test I have got the impression that scrub may have quite an
impact on other I/O. CIFS throughput is down to 7 MB/s from 100 MB/s while scrub
on my main NAS. That is not a surprise as scrub of my raidz2 pool maxes out CPU
on that machine. (1 Xeon L5410).
I am running scrubs during week-ends, so this is not a problem. I am asking
myself however what will happen on larger pools where a scrub pass will take
days to weeks. Obviously, zfs file systems are much more scalable than CPU power
ever will be.
Hence, I am seeing a requirement to manage scrub activity so that trade-offs can
be done to maintain availability and performance of the pool. Does anybody know
how this is done?

Thanks in advance for any hints,

Regards,

Tonmaus
-- 
This message posted from opensolaris.org

Richard Elling

2010-Mar-14 04:54 UTC

head link

[zfs-discuss] How to manage scrub priority or defer scrub?

On Mar 13, 2010, at 4:54 AM, Tonmaus wrote:> Dear zfs fellows,
> 
> during a specific test I have got the impression that scrub may have quite
an impact on other I/O. CIFS throughput is down to 7 MB/s from 100 MB/s while
scrub on my main NAS. That is not a surprise as scrub of my raidz2 pool maxes
out CPU on that machine. (1 Xeon L5410).
> I am running scrubs during week-ends, so this is not a problem. I am asking
myself however what will happen on larger pools where a scrub pass will take
days to weeks. Obviously, zfs file systems are much more scalable than CPU power
ever will be.
> Hence, I am seeing a requirement to manage scrub activity so that
trade-offs can be done to maintain availability and performance of the pool.
Does anybody know how this is done?
This is noticeable on systems with slow disks and is worse on releases
prior to b129 or so.  Can you describe your hardware and release?
 -- richard

ZFS storage and performance consulting at http://www.RichardElling.com
ZFS training on deduplication, NexentaStor, and NAS performance
Atlanta, March 16-18, 2010 http://nexenta-atlanta.eventbrite.com 
Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com

Tonmaus

2010-Mar-14 08:16 UTC

head link

[zfs-discuss] How to manage scrub priority or defer scrub?

Hi Richard,

these are 
- 11x WD1002fbys (7200rpm SATA drives) in 1 raidz2 group
- 4 GB RAM
- 1 CPU L5410
- snv_133 (where the current array was created as well)

Regards,

Tonmaus
-- 
This message posted from opensolaris.org

Richard Elling

2010-Mar-14 16:33 UTC

head link

[zfs-discuss] How to manage scrub priority or defer scrub?

On Mar 14, 2010, at 12:16 AM, Tonmaus wrote:
> Hi Richard,
> 
> these are 
> - 11x WD1002fbys (7200rpm SATA drives) in 1 raidz2 group
> - 4 GB RAM
> - 1 CPU L5410
> - snv_133 (where the current array was created as well)
These are slow drives and the configuration will have poor random
read performance. Do not expect blazing fast scrubs.

In b133, the priority scheduler will work better than on older 
releases. But it may not be enough to overcome a very wide raidz2
set.
 -- richard

ZFS storage and performance consulting at http://www.RichardElling.com
ZFS training on deduplication, NexentaStor, and NAS performance
Atlanta, March 16-18, 2010 http://nexenta-atlanta.eventbrite.com 
Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com

Tonmaus

2010-Mar-14 18:45 UTC

head link

[zfs-discuss] How to manage scrub priority or defer scrub?

Hi Richard,

thanks for the answer. I think I am aware on the properties of my configuration
and how it will scale. Let me please stress that this is not the point in the
discussion.
The target of this discussion should rather be if scrubbing can co-exist with
payload or if we are thrown back to scrub in the after-hours.
So, do I have to conclude that zfs is not able to make good decisions about load
prioritisation on commodity hardware and that there are no further options
available to tweak scrub load impact, or are there other options?

I am thinking about managing pools capable of hundred times the capacity of mine
(currently there are 3,7 TB on disk, and it takes 2,5 h to scrub them on the
double-parity pool) that practically would be un-scrub-able. (Yes, Enterprise HW
is faster, but Enterprise service windows are much more narrow as well... you
can''t move around or offline 200 TB of live data for days only because
you need to scrub the disks can you?)

The only idea I could think of myself is to exchange individual drives in a
round-robin fashion all the time and use re-silver instead of full scrubs. But
actually I don''t like the idea anymore at second glance.

Regards,

Tonmaus
-- 
This message posted from opensolaris.org

Richard Elling

2010-Mar-14 20:08 UTC

head link

[zfs-discuss] How to manage scrub priority or defer scrub?

On Mar 14, 2010, at 11:45 AM, Tonmaus wrote:> Hi Richard,
> 
> thanks for the answer. I think I am aware on the properties of my
configuration and how it will scale. Let me please stress that this is not the
point in the discussion.
> The target of this discussion should rather be if scrubbing can co-exist
with payload or if we are thrown back to scrub in the after-hours.
> So, do I have to conclude that zfs is not able to make good decisions about
load prioritisation on commodity hardware and that there are no further options
available to tweak scrub load impact, or are there other options?
ZFS prioritizes I/O.  Scrub has the lowest priority.  The scrub will queue no
more
than 10 I/Os at one time to a device, so devices which can handle concurrent I/O
are not consumed entirely by scrub I/O. This could be tuned lower, but your
storage
is slow and *any* I/O activity will be noticed.
> I am thinking about managing pools capable of hundred times the capacity of
mine (currently there are 3,7 TB on disk, and it takes 2,5 h to scrub them on
the double-parity pool) that practically would be un-scrub-able. (Yes,
Enterprise HW is faster, but Enterprise service windows are much more narrow as
well... you can''t move around or offline 200 TB of live data for days
only because you need to scrub the disks can you?)
I can''t follow your logic here. Scrub is a low priority process and
should be done at
infrequent intervals. If you are concerned that a single 200TB pool would take a
long
time to scrub, then use more pools and scrub in parallel.
> The only idea I could think of myself is to exchange individual drives in a
round-robin fashion all the time and use re-silver instead of full scrubs. But
actually I don''t like the idea anymore at second glance.
You are right, this would not be a good idea.
 -- richard

ZFS storage and performance consulting at http://www.RichardElling.com
ZFS training on deduplication, NexentaStor, and NAS performance
Atlanta, March 16-18, 2010 http://nexenta-atlanta.eventbrite.com 
Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com

Tonmaus

2010-Mar-15 06:25 UTC

head link

[zfs-discuss] How to manage scrub priority or defer scrub?

Hello again,

I am still concerned if my points are being well taken.
> If you are concerned that a
> single 200TB pool would take a long
> time to scrub, then use more pools and scrub in
> parallel.
The main concern is not scrub time. Scrub time could be weeks if scrub just
would behave. You may imagine that there are applications where segmentation is
a pain point, too.
>  The scrub will queue no more
> han 10 I/Os at one time to a device, so devices which
> can handle concurrent I/O
> are not consumed entirely by scrub I/O. This could be
> tuned lower, but your storage 
> is slow and *any* I/O activity will be noticed.
There are a couple of things I maybe don''t understand, then.

- zpool iostat is reporting more than 1k of outputs while scrub
- throughput is as high as can be until maxing out CPU
- nominal I/O capacity of a single device is still around 90, how can 10 I/Os
already bring down payload
- scrubbing the same pool, configured as raidz1 didn''t max out CPU
which is no surprise (haha, slow storage...) the notable part is that it
didn''t slow down payload that much either.
- scrub is obviously fine with data added or deleted during a pass. So, it could
be possible to pause and resume a pass, couldn''t it?

My conclusion from these observations is that not only disk speed counts here,
but other bottlenecks may strike as well. Solving the issue by the wallet is one
way, solving it by configuration of parameters is another. So, is there a lever
for scrub I/O prio, or not? Is there a possibility to pause scrub passed and
resume?

Regards,

Tonmaus
-- 
This message posted from opensolaris.org

Richard Elling

2010-Mar-16 03:44 UTC

head link

[zfs-discuss] How to manage scrub priority or defer scrub?

On Mar 14, 2010, at 11:25 PM, Tonmaus wrote:> Hello again,
> 
> I am still concerned if my points are being well taken.
> 
>> If you are concerned that a
>> single 200TB pool would take a long
>> time to scrub, then use more pools and scrub in
>> parallel.
> 
> The main concern is not scrub time. Scrub time could be weeks if scrub just
would behave. You may imagine that there are applications where segmentation is
a pain point, too.
I agree.
>> The scrub will queue no more
>> han 10 I/Os at one time to a device, so devices which
>> can handle concurrent I/O
>> are not consumed entirely by scrub I/O. This could be
>> tuned lower, but your storage 
>> is slow and *any* I/O activity will be noticed.
> 
> There are a couple of things I maybe don''t understand, then.
> 
> - zpool iostat is reporting more than 1k of outputs while scrub
ok
> - throughput is as high as can be until maxing out CPU
You would rather your CPU be idle?  What use is an idle CPU, besides wasting
energy :-)?
> - nominal I/O capacity of a single device is still around 90, how can 10
I/Os already bring down payload
90 IOPS is approximately the worst-case rate for a 7,200 rpm disk for a small,
random
workload. ZFS tends to write sequentially, so "random writes" tend to
become "sequential
writes" on ZFS. So it is quite common to see scrub workloads with >>
90 IOPS.
> - scrubbing the same pool, configured as raidz1 didn''t max out CPU
which is no surprise (haha, slow storage...) the notable part is that it
didn''t slow down payload that much either.
raidz creates more, smaller writes than a mirror or simple stripe. If the disks
are slow,
then the IOPS will be lower and the scrub takes longer, but the I/O scheduler
can
manage the queue better (disks are slower).
> - scrub is obviously fine with data added or deleted during a pass. So, it
could be possible to pause and resume a pass, couldn''t it?
You can start or stop scrubs, there no resume directive.   There are several
bugs/RFEs along these lines, something like:
http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6743992
> My conclusion from these observations is that not only disk speed counts
here, but other bottlenecks may strike as well. Solving the issue by the wallet
is one way, solving it by configuration of parameters is another. So, is there a
lever for scrub I/O prio, or not? Is there a possibility to pause scrub passed
and resume?
Scrub is already the lowest priority.  Would you like it to be lower?
I think the issue is more related to which queue is being managed by
the ZFS priority scheduler rather than the lack of scheduling priority.
 -- richard

ZFS storage and performance consulting at http://www.RichardElling.com
ZFS training on deduplication, NexentaStor, and NAS performance
Atlanta, March 16-18, 2010 http://nexenta-atlanta.eventbrite.com 
Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com

Tonmaus

2010-Mar-16 11:35 UTC

head link

[zfs-discuss] How to manage scrub priority or defer scrub?

Hi Richard,
> > - scrubbing the same pool, configured as raidz1
> didn''t max out CPU which is no surprise (haha, slow
> storage...) the notable part is that it didn''t slow
> down payload that much either.
> 
> raidz creates more, smaller writes than a mirror or
> simple stripe. If the disks are slow,
> then the IOPS will be lower and the scrub takes
> longer, but the I/O scheduler can
> manage the queue better (disks are slower).
This wasn''t mirror vs. raidz but raidz1 vs. raidz2, whereas the latter
maxes out CPU and the former maxes out physical disc I/O. Concurrent payload
degradation isn''t that extreme on raidz1 pools, as it seems. Hence, the
CPU theory that you still seem to be reluctant to follow.

> There are several
> bugs/RFEs along these lines, something like:
> http://bugs.opensolaris.org/bugdatabase/view_bug.do?bu
> g_id=6743992
Thanks to pointing at this. As it seems, it''s a problem for a couple of
years already. Obviously, the opinion is being shared that this a management
problem, not a HW issue.

As a Project Manager I will soon have to take a purchase decision for an
archival storage system (A/V media), and one of the options we are looking into
is SAMFS/QFS solution including tiers on disk with ZFS. I will have to make up
my mind if the pool sizes we are looking into (typically we will need 150-200
TB) are really manageable under the current circumstances when we think about
including zfs scrub in the picture. From what I have learned here it rather
looks as if there will be an extra challenge, if not even a problem for the
system integrator. That''s unfortunate.

Regards,

Tonmaus
-- 
This message posted from opensolaris.org

Khyron

2010-Mar-16 13:04 UTC

head link

[zfs-discuss] How to manage scrub priority or defer scrub?

In following this discussion, I get the feeling that you and Richard are
somewhat
talking past each other.  He asked you about the hardware you are currently
running
on, whereas you seem to be interested in a model for the impact of scrubbing
on
I/O throughput that you can apply to some not-yet-acquired hardware.

It should be clear by now that the model you are looking for does not exist
given
how new ZFS is, and Richard has been focusing his comments on your existing
(home)
configuration since that is what you provided specs for.

Since you haven''t provided specs for this larger system you may be
purchasing in the
future, I don''t think anyone can give you specific guidance on what the
I/O
impact of
scrubs on your configuration will be.  Richard seems to be giving more
design guidelines
and hints, and just generally good to know information to keep in mind while
designing
your solution.  Of course, he''s been giving it in the context of your
11
disk wide
RAIDZ2 and not the 200 TB monster you only described in the last e-mail.

Stepping back, it may be worthwhile to examine the advice Richard has given,
in the
context of the larger configuration.

First, you won''t be using commodity hardware for your enterprise-class
storage system,
will you?

Second, I would imagine that as a matter of practice, most people schedule
their pools
to scrub as far away from prime hours as possible.  Maybe it''s
possible, and
maybe it''s
not.  The question to the larger community should be "who is running a 100+
TB pool
and how have you configured your scrubs?"  Or even "for those running
100+
TB pools,
do your scrubs interfere with your production traffic/throughput?  If so,
how do you
compensate for this?"

Third, as for ZFS scrub prioritization, Richard answered your question about
that.  He
said it is low priority and can be tuned lower.  However, he was answering
within the
context of an 11 disk RAIDZ2 with slow disks  His exact words were:

"This could be tuned lower, but your storage is slow and *any* I/O activity
will be
noticed."

If you had asked about a 200 TB enterprise-class pool, he may have had a
different
response.  I don''t know if ZFS will make different decisisons regarding
I/O
priority on
commodity hardware as opposed to enterprise hardware, but I imagine it does
*not*.
If I''m mistaken, someone should correct me.  Richard also said "In
b133, the
priority
scheduler will work better than on older releases."  That may not be an
issue since
you haven''t acquired your hardware YET, but again, Richard
didn''t know that
you
were talking about a 200 TB behemoth because you never said that.

Fourth, Richard mentioned a wide RAIDZ2 set.  Hopefully, if nothing else,
we''ve
seen that designing larger ZFS storage systems with pools composed of
smaller top
level VDEVs works better, and preferably mirrored top level VDEVs in the
case of lots
of small, random reads.  You didn''t indicate the profile of the data to
be
stored on
your system, so no one can realistically speak to that.  I think the general
guidance
is sound.  Multiple top level VDEVs, preferably mirrors.  If you''re
creating
RAIDZ2
top level VDEVs, then they should be smaller (narrower) in terms of the
number of
disks in the set.  11 would be too many, based on what I have seen and heard
on
this list cross referenced with the (little) information you have provided.

RAIDZ2 would appear to require more CPU power that RAIDZ, based on the
report
you gave and thus may have less negative impact on the performance of your
storage
system.  I''ll cop to that.  However, you never mentioned how your 200
TB
behemoth
system will be used, besides an off-hand remark about CIFS.  Will it be
serving CIFS?
NFS?  Raw ZVOLs over iSCSI?  You never mentioned any of that.  Asking about
CIFS
if you''re not going to serve CIFS doesn''t make much sense. 
That would
appear to
be another question for the ZFS gurus here -- WHY does RAIDZ2 cause so much
negative performance impact on your CIFS service while RAIDZ does not?  Your

experience is that a scrub of a RAIDZ2 maxed CPU while a RAIDZ scrub did
not, right?

Fifth, the pool scrub should probably be as far away from peak usage times
as possible.
That may or may not be feasible, but I don''t think anyone would
disagree
with that
advice.  Again, I know there are people running large pools who perform
scrubs.  It
might be worthwhile to directly ask what these people have experienced in
terms of
scrub performance on RAIDZ vs. RAIDZ2, or in general.

Finally, I would also note that Richard has been very responsive to your
questions (in
his own way) but you increasingly seem to be hostile and even disrespectful
toward
him.  (I''ve noticed this in more then one of your e-mails; they sound
progressively
more self-centered and selfish.  That''s just my opinion.)  If this is a
community, that''s
not a helpful way to treat a senior member of the community, even if
he''s
not
answering the question you want answered.

Keep in mind that asking the wrong questions is the leading cause of wrong
answers,
as a former boss of mine likes to say.  Maybe you would have gotten
responses you
found more useful and less insulting had you phrased your questions in an
different
way?

And no, Richard doesn''t need me to defend him, especially since I
don''t know
him
from a can of paint.  Your attacks (for lack of a better word) on him seem
unwarranted and I can''t stay silent about it any longer.

Anyway, hopefully that helps in some way.  And hopefully you''ll get how
you''re
appearing to others who are reading your words.  Right now, in MY opinion
alone,
you look like a n00b who isn''t respectful enough to deserve the help of
anyone
here.

On Tue, Mar 16, 2010 at 07:35, Tonmaus <sequoiamobil at gmx.net> wrote:
> Hi Richard,
>
> > > - scrubbing the same pool, configured as raidz1
> > didn''t max out CPU which is no surprise (haha, slow
> > storage...) the notable part is that it didn''t slow
> > down payload that much either.
> >
> > raidz creates more, smaller writes than a mirror or
> > simple stripe. If the disks are slow,
> > then the IOPS will be lower and the scrub takes
> > longer, but the I/O scheduler can
> > manage the queue better (disks are slower).
>
> This wasn''t mirror vs. raidz but raidz1 vs. raidz2, whereas the
latter
> maxes out CPU and the former maxes out physical disc I/O. Concurrent
payload
> degradation isn''t that extreme on raidz1 pools, as it seems.
Hence, the CPU
> theory that you still seem to be reluctant to follow.
>
>
> > There are several
> > bugs/RFEs along these lines, something like:
> > http://bugs.opensolaris.org/bugdatabase/view_bug.do?bu
> > g_id=6743992
>
> Thanks to pointing at this. As it seems, it''s a problem for a
couple of
> years already. Obviously, the opinion is being shared that this a
management
> problem, not a HW issue.
>
> As a Project Manager I will soon have to take a purchase decision for an
> archival storage system (A/V media), and one of the options we are looking
> into is SAMFS/QFS solution including tiers on disk with ZFS. I will have to
> make up my mind if the pool sizes we are looking into (typically we will
> need 150-200 TB) are really manageable under the current circumstances when
> we think about including zfs scrub in the picture. From what I have learned
> here it rather looks as if there will be an extra challenge, if not even a
> problem for the system integrator. That''s unfortunate.
>
> Regards,
>
> Tonmaus
> --
> This message posted from opensolaris.org
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>

-- 
"You can choose your friends, you can choose the deals." - Equity
Private

"If Linux is faster, it''s a Solaris bug." - Phil Harman

Blog - http://whatderass.blogspot.com/
Twitter - @khyron4eva
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100316/f367adfc/attachment-0001.html>

Bruno Sousa

2010-Mar-16 16:02 UTC

head link

[zfs-discuss] How to manage scrub priority or defer scrub?

Well...i can only say "well said".

BTW i have a raidz2 with 9 vdevs with 4 disks each (sata enterprise
disks) and the scrub of the pool takes between 12 to 39 hours..depends
on the workload of the server.
So far it''s acceptable but each case is a case i think...


Bruno

On 16-3-2010 14:04, Khyron wrote:> In following this discussion, I get the feeling that you and Richard
> are somewhat
> talking past each other.  He asked you about the hardware you are
> currently running
> on, whereas you seem to be interested in a model for the impact of
> scrubbing on
> I/O throughput that you can apply to some not-yet-acquired hardware. 
>
> It should be clear by now that the model you are looking for does not
> exist given
> how new ZFS is, and Richard has been focusing his comments on your
> existing (home)
> configuration since that is what you provided specs for.
>
> Since you haven''t provided specs for this larger system you may be
> purchasing in the
> future, I don''t think anyone can give you specific guidance on
what
> the I/O impact of
> scrubs on your configuration will be.  Richard seems to be giving more
> design guidelines
> and hints, and just generally good to know information to keep in mind
> while designing
> your solution.  Of course, he''s been giving it in the context of
your
> 11 disk wide
> RAIDZ2 and not the 200 TB monster you only described in the last e-mail.
>
> Stepping back, it may be worthwhile to examine the advice Richard has
> given, in the
> context of the larger configuration. 
>
> First, you won''t be using commodity hardware for your
enterprise-class
> storage system,
> will you?
>
> Second, I would imagine that as a matter of practice, most people
> schedule their pools
> to scrub as far away from prime hours as possible.  Maybe it''s
> possible, and maybe it''s
> not.  The question to the larger community should be "who is running a
> 100+ TB pool
> and how have you configured your scrubs?"  Or even "for those
running
> 100+ TB pools,
> do your scrubs interfere with your production traffic/throughput?  If
> so, how do you
> compensate for this?"
>
> Third, as for ZFS scrub prioritization, Richard answered your question
> about that.  He
> said it is low priority and can be tuned lower.  However, he was
> answering within the
> context of an 11 disk RAIDZ2 with slow disks  His exact words were:
>
> "This could be tuned lower, but your storage is slow and *any* I/O
> activity will be
> noticed."
>
> If you had asked about a 200 TB enterprise-class pool, he may have had
> a different
> response.  I don''t know if ZFS will make different decisisons
> regarding I/O priority on
> commodity hardware as opposed to enterprise hardware, but I imagine it
> does *not*. 
> If I''m mistaken, someone should correct me.  Richard also said
"In
> b133, the priority
> scheduler will work better than on older releases."  That may not be
> an issue since
> you haven''t acquired your hardware YET, but again, Richard
didn''t know
> that you
> were talking about a 200 TB behemoth because you never said that.
>
> Fourth, Richard mentioned a wide RAIDZ2 set.  Hopefully, if nothing
> else, we''ve
> seen that designing larger ZFS storage systems with pools composed of
> smaller top
> level VDEVs works better, and preferably mirrored top level VDEVs in
> the case of lots
> of small, random reads.  You didn''t indicate the profile of the
data
> to be stored on
> your system, so no one can realistically speak to that.  I think the
> general guidance
> is sound.  Multiple top level VDEVs, preferably mirrors.  If
you''re
> creating RAIDZ2
> top level VDEVs, then they should be smaller (narrower) in terms of
> the number of
> disks in the set.  11 would be too many, based on what I have seen and
> heard on
> this list cross referenced with the (little) information you have
> provided.
>
> RAIDZ2 would appear to require more CPU power that RAIDZ, based on the
> report
> you gave and thus may have less negative impact on the performance of
> your storage
> system.  I''ll cop to that.  However, you never mentioned how your
200
> TB behemoth
> system will be used, besides an off-hand remark about CIFS.  Will it
> be serving CIFS?
> NFS?  Raw ZVOLs over iSCSI?  You never mentioned any of that.  Asking
> about CIFS
> if you''re not going to serve CIFS doesn''t make much
sense.  That would
> appear to
> be another question for the ZFS gurus here -- WHY does RAIDZ2 cause so
> much
> negative performance impact on your CIFS service while RAIDZ does
> not?  Your
> experience is that a scrub of a RAIDZ2 maxed CPU while a RAIDZ scrub
> did not, right?
>
> Fifth, the pool scrub should probably be as far away from peak usage
> times as possible.
> That may or may not be feasible, but I don''t think anyone would
> disagree with that
> advice.  Again, I know there are people running large pools who
> perform scrubs.  It
> might be worthwhile to directly ask what these people have experienced
> in terms of
> scrub performance on RAIDZ vs. RAIDZ2, or in general.
>
> Finally, I would also note that Richard has been very responsive to
> your questions (in
> his own way) but you increasingly seem to be hostile and even
> disrespectful toward
> him.  (I''ve noticed this in more then one of your e-mails; they
sound
> progressively
> more self-centered and selfish.  That''s just my opinion.)  If this
is
> a community, that''s
> not a helpful way to treat a senior member of the community, even if
> he''s not
> answering the question you want answered.
>
> Keep in mind that asking the wrong questions is the leading cause of
> wrong answers,
> as a former boss of mine likes to say.  Maybe you would have gotten
> responses you
> found more useful and less insulting had you phrased your questions in
> an different
> way?
>
> And no, Richard doesn''t need me to defend him, especially since I
> don''t know him
> from a can of paint.  Your attacks (for lack of a better word) on him
> seem
> unwarranted and I can''t stay silent about it any longer.
>
> Anyway, hopefully that helps in some way.  And hopefully you''ll
get
> how you''re
> appearing to others who are reading your words.  Right now, in MY
> opinion alone,
> you look like a n00b who isn''t respectful enough to deserve the
help
> of anyone
> here.
>
> On Tue, Mar 16, 2010 at 07:35, Tonmaus <sequoiamobil at gmx.net
> <mailto:sequoiamobil at gmx.net>> wrote:
>
>     Hi Richard,
>
>     > > - scrubbing the same pool, configured as raidz1
>     > didn''t max out CPU which is no surprise (haha, slow
>     > storage...) the notable part is that it didn''t slow
>     > down payload that much either.
>     >
>     > raidz creates more, smaller writes than a mirror or
>     > simple stripe. If the disks are slow,
>     > then the IOPS will be lower and the scrub takes
>     > longer, but the I/O scheduler can
>     > manage the queue better (disks are slower).
>
>     This wasn''t mirror vs. raidz but raidz1 vs. raidz2, whereas
the
>     latter maxes out CPU and the former maxes out physical disc I/O.
>     Concurrent payload degradation isn''t that extreme on raidz1
pools,
>     as it seems. Hence, the CPU theory that you still seem to be
>     reluctant to follow.
>
>
>     > There are several
>     > bugs/RFEs along these lines, something like:
>     > http://bugs.opensolaris.org/bugdatabase/view_bug.do?bu
>     > g_id=6743992
>
>     Thanks to pointing at this. As it seems, it''s a problem for a
>     couple of years already. Obviously, the opinion is being shared
>     that this a management problem, not a HW issue.
>
>     As a Project Manager I will soon have to take a purchase decision
>     for an archival storage system (A/V media), and one of the options
>     we are looking into is SAMFS/QFS solution including tiers on disk
>     with ZFS. I will have to make up my mind if the pool sizes we are
>     looking into (typically we will need 150-200 TB) are really
>     manageable under the current circumstances when we think about
>     including zfs scrub in the picture. From what I have learned here
>     it rather looks as if there will be an extra challenge, if not
>     even a problem for the system integrator. That''s unfortunate.
>
>     Regards,
>
>     Tonmaus
>     --
>     This message posted from opensolaris.org <http://opensolaris.org>
>     _______________________________________________
>     zfs-discuss mailing list
>     zfs-discuss at opensolaris.org <mailto:zfs-discuss at
opensolaris.org>
>     http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>
>
>
>
> -- 
> "You can choose your friends, you can choose the deals." - Equity
Private
>
> "If Linux is faster, it''s a Solaris bug." - Phil Harman
>
> Blog - http://whatderass.blogspot.com/
> Twitter - @khyron4eva
>
>
> -- 
> This message has been scanned for viruses and
> dangerous content by *MailScanner* <http://www.mailscanner.info/>,
and is
> believed to be clean.
>
>
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>   
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100316/9d0a81eb/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 3656 bytes
Desc: S/MIME Cryptographic Signature
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100316/9d0a81eb/attachment.bin>

Bob Friesenhahn

2010-Mar-16 16:38 UTC

head link

[zfs-discuss] How to manage scrub priority or defer scrub?

On Tue, 16 Mar 2010, Tonmaus wrote:>
> This wasn''t mirror vs. raidz but raidz1 vs. raidz2, whereas the 
> latter maxes out CPU and the former maxes out physical disc I/O. 
> Concurrent payload degradation isn''t that extreme on raidz1 pools,
> as it seems. Hence, the CPU theory that you still seem to be 
> reluctant to follow.
If CPU is maxed out then that usually indicates some severe problem 
with choice of hardware or a misbehaving device driver.  Modern 
systems have an abundance of CPU.

I don''t think that the size of the pool is particularly significant 
since zfs scrubs in a particular order and scrub throughput is dicated 
by access times and bandwidth.  In fact there should be less impact 
from scrub in a larger pool (even though scrub may take much longer) 
since the larger pool will have more vdevs.  The vdev design is most 
important.  Too many drives per vdev leads to poor performance, 
particularly if the drives are huge sluggish ones.

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

thomas

2010-Mar-16 16:53 UTC

head link

[zfs-discuss] How to manage scrub priority or defer scrub?

Even if it might not be the best technical solution, I think what a lot of
people are looking for when this comes up is a knob they can use to say "I
only want X IOPS per vdev" (in addition to low prioritization) to be used
while scrubbing. Doing so probably helps them feel more at ease that they have
some excess capacity on cpu and vdev if production traffic should come along.

That''s probably a false sense of moderating resource usage when the
current "full speed, but lowest prioritization" is just as good and
would finish quicker.. but, it gives them peace of mind?
-- 
This message posted from opensolaris.org

David Dyer-Bennet

2010-Mar-16 17:41 UTC

head link

[zfs-discuss] How to manage scrub priority or defer scrub?

On Tue, March 16, 2010 11:53, thomas wrote:> Even if it might not be the best technical solution, I think what a lot of
> people are looking for when this comes up is a knob they can use to say
"I
> only want X IOPS per vdev" (in addition to low prioritization) to be
used
> while scrubbing. Doing so probably helps them feel more at ease that they
> have some excess capacity on cpu and vdev if production traffic should
> come along.
>
> That''s probably a false sense of moderating resource usage when
the
> current "full speed, but lowest prioritization" is just as good
and would
> finish quicker.. but, it gives them peace of mind?
I may have been reading too quickly, but I have the impression that at
least some of the people not happy with the current prioritization were
reporting severe impacts to non-scrub performance when a scrub was in
progress.  If that''s the case, then they have a real problem,
they''re not
just looking for more peace of mind in a hypothetical situation.
-- 
David Dyer-Bennet, dd-b at dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info

Khyron

2010-Mar-16 18:01 UTC

head link

[zfs-discuss] How to manage scrub priority or defer scrub?

The issue as presented by Tonmaus was that a scrub was negatively impacting
his RAIDZ2 CIFS performance, but he didn''t see the same impact with
RAIDZ.
I''m not going to say whether that is a "problem" one way or
the other; it
may
be expected behavior under the circumstances.  That''s for ZFS
developers to
speak on.  (This was one of many issues Tonmaus mentioned.)

However, what was lost was the context.  Tonmaus reported this behavior on
a commodity server using slow disks in an 11 disk RAIDZ2 set.  However, he
*really* wants to know if this will be an issue on a 100+ TB pool.  So his
examples were given on a pool that was possibly 5% of the size the pool that

he actually wants to deploy.  He never said any of this in the original
e-mail, so
Richard assumed the context to be the smaller system.  That''s why I
pointed
out all of the discrepancies and questions he could/should have asked which
might have yielded more useful answers.

There''s quite a difference between the 11 disk RAIDZ2 set and a 100+ TB
ZFS
pool, especially when the use case, VDEV layout and other design aspects of
the 100+ TB pool have not been described.

On Tue, Mar 16, 2010 at 13:41, David Dyer-Bennet <dd-b at dd-b.net> wrote:
>
> On Tue, March 16, 2010 11:53, thomas wrote:
> > Even if it might not be the best technical solution, I think what a
lot
> of
> > people are looking for when this comes up is a knob they can use to
say
> "I
> > only want X IOPS per vdev" (in addition to low prioritization) to
be used
> > while scrubbing. Doing so probably helps them feel more at ease that
they
> > have some excess capacity on cpu and vdev if production traffic should
> > come along.
> >
> > That''s probably a false sense of moderating resource usage
when the
> > current "full speed, but lowest prioritization" is just as
good and would
> > finish quicker.. but, it gives them peace of mind?
>
> I may have been reading too quickly, but I have the impression that at
> least some of the people not happy with the current prioritization were
> reporting severe impacts to non-scrub performance when a scrub was in
> progress.  If that''s the case, then they have a real problem,
they''re not
> just looking for more peace of mind in a hypothetical situation.
> --
> David Dyer-Bennet, dd-b at dd-b.net; http://dd-b.net/
> Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
> Photos: http://dd-b.net/photography/gallery/
> Dragaera: http://dragaera.info
>
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>

-- 
"You can choose your friends, you can choose the deals." - Equity
Private

"If Linux is faster, it''s a Solaris bug." - Phil Harman

Blog - http://whatderass.blogspot.com/
Twitter - @khyron4eva
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100316/bfaf9b54/attachment.html>

Tonmaus

2010-Mar-16 21:18 UTC

head link

[zfs-discuss] How to manage scrub priority or defer scrub?

Hello,
> In following this discussion, I get the feeling that
> you and Richard are somewhat talking past each
> other.
Talking past each other is a problem I have noted and remarked earlier. I have
to admit to have got frustrated about the discussion narrowing down to a certain
perspective that was quite the opposite of my own observations and what I had
initially described. It may be that I have been more harsh than I should. Please
accept my apology.
I was trying from the outset to obtain a perspective on the matter that is
independent from an actual configuration. I firmly believe that the scrub
function is more meaningful if it can be applied in a variety of
implementations.
I think however that the insight that there seems to be no specific scrub
management functions is transferable from a commodity implementation to a
enterprise configuration.

Regards,

Tonmaus
-- 
This message posted from opensolaris.org

Tonmaus

2010-Mar-16 21:38 UTC

head link

[zfs-discuss] How to manage scrub priority or defer scrub?

> If CPU is maxed out then that usually indicates some
> severe problem 
> with choice of hardware or a misbehaving device
> driver.  Modern 
> systems have an abundance of CPU.
AFAICS the CPU loads are only high while scrubbing a double parity pool. I have
no indication of a technical misbehaviour with the exception of dismal
concurrent performance.

What is not getting beyond me is the notion that even if I *had* a storage
configuration with 20 times more I/O capacity it still would max out any CPU I
could buy better than the single L5410 I am running from currently. I am seeing
CPU performance being a pain point on any "software" based array I
have used so far. From SOHO NAS boxes (the usual Thecus stuff) to NetApp 3200
filers, all exposed a nominal performance drop once parity configurations were
employed.
Performance of the L5410 is abundant for the typical operation of my system,
btw. It can easiely saturate the dual 1000 Mbit NICs for iSCSI and CIFS
services. I am slightly reluctant to buy a second L5410 just to provide more
headroom during maintenance operations, as the device will be idle otherwise,
consuming power.

Regards,

Tonmaus
-- 
This message posted from opensolaris.org

Bob Friesenhahn

2010-Mar-16 23:07 UTC

head link

[zfs-discuss] How to manage scrub priority or defer scrub?

On Tue, 16 Mar 2010, Tonmaus wrote:
> AFAICS the CPU loads are only high while scrubbing a double parity 
> pool. I have no indication of a technical misbehaviour with the 
> exception of dismal concurrent performance.
This seems pretty weird to me.  I have not heard anyone else complain 
about this sort of problem before in the several years I have been on 
this list.  Are you sure that you didn''t also enable something which 
does consume lots of CPU such as enabling some sort of compression, 
sha256 checksums, or deduplication?
> running from currently. I am seeing CPU performance being a pain 
> point on any "software" based array I have used so far. From SOHO
> NAS boxes (the usual Thecus stuff) to NetApp 3200 filers, all 
> exposed a nominal performance drop once parity configurations were 
> employed.
The main concern that one should have is I/O bandwidth rather than CPU 
consumption since "software" based RAID must handle the work using the
system''s CPU rather than expecting it to be done by some other CPU. 
There are more I/Os and (in the case of mirroring) more data 
transferred.

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Tonmaus

2010-Mar-16 23:41 UTC

head link

[zfs-discuss] How to manage scrub priority or defer scrub?

> Are you sure that you didn''t also enable
> something which 
> does consume lots of CPU such as enabling some sort
> of compression, 
> sha256 checksums, or deduplication?
None of them is active on that pool or in any existing file system. Maybe the
issue is particular to RAIDZ2, which is comparably recent. On that occasion:
does anybody know if ZFS reads all parities during a scrub? Wouldn''t it
be sufficient for stale corruption detection to read only one parity set unless
an error occurs there?
> The main concern that one should have is I/O
> bandwidth rather than CPU 
> consumption since "software" based RAID must handle
> the work using the 
> system''s CPU rather than expecting it to be done by
> some other CPU. 
> There are more I/Os and (in the case of mirroring)
> more data 
> transferred.
What I am trying to say is that CPU may become the bottleneck for I/O in case of
parity-secured stripe sets. Mirrors and simple stripe sets have almost 0 impact
on CPU. So far at least my observations. Moreover, x86 processors not optimized
for that kind of work as much as i.e. an Areca controller with a dedicated XOR
chip is, in its targeted field.

Regards,

Tonmaus
-- 
This message posted from opensolaris.org

Bob Friesenhahn

2010-Mar-17 14:31 UTC

head link

[zfs-discuss] How to manage scrub priority or defer scrub?

On Tue, 16 Mar 2010, Tonmaus wrote:>
> None of them is active on that pool or in any existing file system. 
> Maybe the issue is particular to RAIDZ2, which is comparably recent. 
> On that occasion: does anybody know if ZFS reads all parities during 
> a scrub? Wouldn''t it be sufficient for stale corruption detection
to
> read only one parity set unless an error occurs there?
Zfs scrub reads and verifies everything.  That is it''s purpose.
> What I am trying to say is that CPU may become the bottleneck for 
> I/O in case of parity-secured stripe sets. Mirrors and simple stripe 
> sets have almost 0 impact on CPU. So far at least my observations. 
> Moreover, x86 processors not optimized for that kind of work as much 
> as i.e. an Areca controller with a dedicated XOR chip is, in its 
> targeted field.
It would be astonishing if the XOR algorithm consumed very much CPU 
with modern CPUs.  Zfs''s own checksum is more brutal than XOR.  The 
scrub re-assembles full (usually 128K) data blocks and verifies the 
zfs checksum.

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Tonmaus

2010-Mar-17 20:12 UTC

head link

[zfs-discuss] How to manage scrub priority or defer scrub?

Hi,

I got a message from you off-list that doesn''t show up in the thread
even after hours. As you mentioned the aspect here as well I''d like to
respond to, I''ll do it from here:
> Third, as for ZFS scrub prioritization, Richard
> answered your question about that.  He said it is
> low priority and can be tuned lower.  However, he was
> answering within the <br>context of an 11 disk RAIDZ2
> with slow disks  His exact words were:
> 
> 
> This could be tuned lower, but your storage
> is slow and *any* I/O activity will be
> noticed.
Richard told us two times that scrub already is as low in priority as can be.
From another message:

"Scrub is already the lowest priority. Would you like it to be lower?"

============================================================================
As much as the comparison goes between "slow" and "fast"
storage. I have understood that Richard''s message was that with storage
providing better random I/O zfs priority scheduling will perform significantly
better, providing less degradation of concurrent load. While I am even inclined
to buy that, nobody will be able to tell me how a certain system will behave
until it was tested, and to what degree concurrent scrubbing still will be
possible.
Another thing: people are talking a lot about narrow vdevs and mirrors. However,
when you need to build a 200 TB pool you end up with a lot of disks in the first
place. You will need at least double failover resilience for such a pool. If one
would do that with mirrors, ending up with app. 600 TB gross to provide 200 TB
net capacity is definitely NOT an option.

Regards,

Tonmaus
-- 
This message posted from opensolaris.org

Khyron

2010-Mar-17 22:23 UTC

head link

[zfs-discuss] How to manage scrub priority or defer scrub?

Ugh!  I meant that to go to the list, so I''ll probably re-send it for
the
benefit
of everyone involved in the discussion.  There were parts of that that I
wanted
others to read.
>From a re-read of Richard''s e-mail, maybe he meant that the number
of I/Osqueued to a device can be tuned lower and not the priority of the scrub (as
I took him to mean).  Hopefully Richard can clear that up.  I personally
stand
corrected for mis-reading Richard there.

Of course the performance of a given system cannot be described until it is
built.  Again, my interpretation of your e-mail was that you were looking
for
a model for the performance of concurrent scrub and I/O load of a RAIDZ2
VDEV that you could scale up from your "test" environment of 11 disks
to a
200+ TB behemoth.  As I mentioned several times, I doubt such a model
exists, and I have not seen anything published to that effect.  I don''t
know

how useful it would be if it did exist because the performance of your disks

would be a critical factor.  (Although *any* model beats no model any day.)
Let''s just face it.  You''re using a new storage system that
has not been
modeled.  To get the model you seek, you will probably have to create it
yourself.

(It''s notable that most of the ZFS models that I have seen have been
done
by Richard.  Of course, they were MTTDL models, not scrub vs. I/O
performance models for different VDEV types.)

As for your point about building large pools from lots of mirror VDEVs, my
response is "meh".  I''ve said several times, and maybe
you''ve missed it
several times, that there may be pathologies for which YOU should open
bugs.  RAIDZ3 may exhibit the same kind of pathologies you observed with
RAIDZ2.  Apparently RAIDZ does not.  I''ve also noticed (and
I''m sure I''ll
be corrected if I''m mistaken) that there is not a limit on the number
of
VDEVs in a pool but single digit RAID VDEVs are recommended.  So there
is nothing preventing you from building (for example) VDEVs from 1 TB
disks.  If you take 9 x 1 TB disks per VDEV, and use RAIDZ2, you get 7 TB
usable.  That means about 29 VDEVs to get 200 TB.  Double the disk
capacity and you can probably get to 15 top level VDEVs.  (And you''ll
want
that RAIDZ2 as well since I don''t know if you could trust that many
disks,
whether enterprise or consumer.)  However, that number of top level VDEVs
sounds reasonable based on what others have reported.  What''s been
proven to be "A Bad Idea(TM)" is putting lots of disks in a single
VDEV.

Remember that ZFS is a *new* software system.  It is complex.  It will have
bugs.  You have chosen ZFS; it didn''t choose you.  So I''d say
you can
contribute to the community by reporting back your experiences, opening
bugs on things which make sense to open bugs on, testing configurations,
modeling, documenting and sharing.  So far, you just seem to be interested
in taking w/o so much as an offer of helping the community or developers to
understand what works and what doesn''t.  All take and no give is not
cool.
And if you don''t like ZFS, then choose something else.  I''m
sure EMC or
NetApp will willingly sell you all the spindles you want.  However, I think
it is
still early to write off ZFS as a losing proposition, but that''s my
opinion.

So far, you seem to be spending a lot of time complaining about a *new*
software system that you''re not paying for.  That''s pretty
tasteless, IMO.

And now I''ll re-send that e-mail...

P.S.: Did you remember to re-read this e-mail?  Read it 2 or 3 times and be
clear about what I said and what I did _not_ say.

On Wed, Mar 17, 2010 at 16:12, Tonmaus <sequoiamobil at gmx.net> wrote:
> Hi,
>
> I got a message from you off-list that doesn''t show up in the
thread even
> after hours. As you mentioned the aspect here as well I''d like to
respond
> to, I''ll do it from here:
>
> > Third, as for ZFS scrub prioritization, Richard
> > answered your question about that.  He said it is
> > low priority and can be tuned lower.  However, he was
> > answering within the <br>context of an 11 disk RAIDZ2
> > with slow disks  His exact words were:
> >
> >
> > This could be tuned lower, but your storage
> > is slow and *any* I/O activity will be
> > noticed.
>
> Richard told us two times that scrub already is as low in priority as can
> be. From another message:
>
> "Scrub is already the lowest priority. Would you like it to be
lower?"
>
>
>
============================================================================>
> As much as the comparison goes between "slow" and
"fast" storage. I have
> understood that Richard''s message was that with storage providing
better
> random I/O zfs priority scheduling will perform significantly better,
> providing less degradation of concurrent load. While I am even inclined to
> buy that, nobody will be able to tell me how a certain system will behave
> until it was tested, and to what degree concurrent scrubbing still will be
> possible.
> Another thing: people are talking a lot about narrow vdevs and mirrors.
> However, when you need to build a 200 TB pool you end up with a lot of
disks
> in the first place. You will need at least double failover resilience for
> such a pool. If one would do that with mirrors, ending up with app. 600 TB
> gross to provide 200 TB net capacity is definitely NOT an option.
>
> Regards,
>
> Tonmaus
> --
> This message posted from opensolaris.org
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>

-- 
"You can choose your friends, you can choose the deals." - Equity
Private

"If Linux is faster, it''s a Solaris bug." - Phil Harman

Blog - http://whatderass.blogspot.com/
Twitter - @khyron4eva
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100317/a698d299/attachment.html>

Khyron

2010-Mar-17 22:24 UTC

head link

[zfs-discuss] How to manage scrub priority or defer scrub?

For those following along, this is the e-mail I meant to send to the list
but
instead sent directly to Tonmaus.  My mistake, and I apologize for having to

re-send.

=== Start ==
My understanding, limited though it may be, is that a scrub touches ALL data
that
has been written, including the parity data.  It confirms the validity of
every bit that
has been written to the array.  Now, there may be an implementation detail
that is
responsible for the pathology that you observed.  More than likely, I''d
imagine.  Filing
a bug may be in order.  Since triple parity RAIDZ exists now, you may want
to test
with that by grabbing a LiveCD or LiveUSB image from genunix.org.  Maybe
RAIDZ3
has the same (or worse) problems?

As for "scrub management", I pointed out the specific responses from
Richard
where
he noted that scrub I/O priority *can* be tuned.  How you do that, I''m
not
sure.
Richard, how does one tune scrub I/O priority?  Other than that, as I said,
I don''t
think there is a model (publicly available anyway) describing scrub behavior
and how it
scales with pool size (< 5 TB, 5 TB - 50 TB, > 50 TB, etc.) or data layout
(mirror vs.
RAIDZ vs. RAIDZ2).  ZFS is really that new, that all of this needs to be
reconsidered
and modeled.  Maybe this is something you can contribute to the community?
ZFS
is a new storage system, not the same old file systems whose behaviors and
quirks
are well known because of 20+ years of history.  We''re all writing a
new
chapter in
data storage here, so it is incumbent upon us to share knowledge in order to
answer
these types of questions.

I think the questions I raised in my longer response are also valid and need
to be
re-considered.  There are large pools in production today.  So how are
people
scrubbing these pools?  Please post your experiences with scrubbing 100+ TB
pools.

Tonmaus, maybe you should repost my other questions in a new, separate
thread?

=== End ==
On Tue, Mar 16, 2010 at 19:41, Tonmaus <sequoiamobil at gmx.net> wrote:
> > Are you sure that you didn''t also enable
> > something which
> > does consume lots of CPU such as enabling some sort
> > of compression,
> > sha256 checksums, or deduplication?
>
> None of them is active on that pool or in any existing file system. Maybe
> the issue is particular to RAIDZ2, which is comparably recent. On that
> occasion: does anybody know if ZFS reads all parities during a scrub?
> Wouldn''t it be sufficient for stale corruption detection to read
only one
> parity set unless an error occurs there?
>
> > The main concern that one should have is I/O
> > bandwidth rather than CPU
> > consumption since "software" based RAID must handle
> > the work using the
> > system''s CPU rather than expecting it to be done by
> > some other CPU.
> > There are more I/Os and (in the case of mirroring)
> > more data
> > transferred.
>
> What I am trying to say is that CPU may become the bottleneck for I/O in
> case of parity-secured stripe sets. Mirrors and simple stripe sets have
> almost 0 impact on CPU. So far at least my observations. Moreover, x86
> processors not optimized for that kind of work as much as i.e. an Areca
> controller with a dedicated XOR chip is, in its targeted field.
>
> Regards,
>
> Tonmaus
> --
> This message posted from opensolaris.org
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>

-- 
"You can choose your friends, you can choose the deals." - Equity
Private

"If Linux is faster, it''s a Solaris bug." - Phil Harman

Blog - http://whatderass.blogspot.com/
Twitter - @khyron4eva
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100317/fcf9432c/attachment.html>

Richard Elling

2010-Mar-18 11:51 UTC

head link

[zfs-discuss] How to manage scrub priority or defer scrub?

On Mar 16, 2010, at 4:41 PM, Tonmaus wrote:>> Are you sure that you didn''t also enable
>> something which 
>> does consume lots of CPU such as enabling some sort
>> of compression, 
>> sha256 checksums, or deduplication?
> 
> None of them is active on that pool or in any existing file system. Maybe
the issue is particular to RAIDZ2, which is comparably recent. On that occasion:
does anybody know if ZFS reads all parities during a scrub?
Yes
> Wouldn''t it be sufficient for stale corruption detection to read
only one parity set unless an error occurs there?
No, because the parity itself is not verified.
>> The main concern that one should have is I/O
>> bandwidth rather than CPU 
>> consumption since "software" based RAID must handle
>> the work using the 
>> system''s CPU rather than expecting it to be done by
>> some other CPU. 
>> There are more I/Os and (in the case of mirroring)
>> more data 
>> transferred.
> 
> What I am trying to say is that CPU may become the bottleneck for I/O in
case of parity-secured stripe sets. Mirrors and simple stripe sets have almost 0
impact on CPU. So far at least my observations. Moreover, x86 processors not
optimized for that kind of work as much as i.e. an Areca controller with a
dedicated XOR chip is, in its targeted field.
All x86 processors you care about do XOR at memory bandwidth speed.
XOR is one of the simplest instructions to implement on a microprocessor.
The need for a dedicated XOR chip for older "hardware RAID" systems is
because they use very slow processors with low memory bandwidth. Cheap
is as cheap does :-)

However, the issue for raidz2 and above (including RAID-6) is that the 
second parity is a more computationally complex Reed-Solomon code, 
not a simple XOR. So there is more computing required and that would be 
reflected in the CPU usage.
 -- richard

ZFS storage and performance consulting at http://www.RichardElling.com
ZFS training on deduplication, NexentaStor, and NAS performance
Atlanta, March 16-18, 2010 http://nexenta-atlanta.eventbrite.com 
Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com

Tonmaus

2010-Mar-18 12:21 UTC

head link

[zfs-discuss] How to manage scrub priority or defer scrub?

> On that
> occasion: does anybody know if ZFS reads all parities
> during a scrub?
> 
> Yes
> 
> > Wouldn''t it be sufficient for stale corruption
> detection to read only one parity set unless an error
> occurs there?
> 
> No, because the parity itself is not verified.
Aha. Well, my understanding was that a scrub basically means reading all data,
and compare with the parities, which means that these have to be re-computed. Is
that correct?

Regards,

Tonmaus
-- 
This message posted from opensolaris.org

Daniel Carosone

2010-Mar-18 20:53 UTC

head link

[zfs-discuss] How to manage scrub priority or defer scrub?

On Thu, Mar 18, 2010 at 05:21:17AM -0700, Tonmaus wrote:> > No, because the parity itself is not verified.
> 
> Aha. Well, my understanding was that a scrub basically means reading
> all data, and compare with the parities, which means that these have
> to be re-computed. Is that correct? 
A scrub does, yes. It reads all data and metadata and checksums and
verifies they''re correct.

A read of the pool might not - for example, it might:
 - read only one side of a mirror
 - read only one instance of a ditto block (metadata or copies>1)
 - use cached copies of data or metadata; for a long-running system it
   might be a long time since some metadata blocks were ever read, if
   they''re frequently used.

Roughly speaking, reading through the filesystem does the least work
possible to return the data. A scrub does the most work possible to
check the disks (and returns none of the data). 

For the OP:  scrub issues low-priority IO (and the details of how much
and how low have changed a few times along the version trail).
However, that prioritisation applies only within the kernel; sata disks
don''t understand the prioritisation, so once the requests are with the
disk they can still saturate out other IOs that made it to the front
of the kernel''s queue faster.  If you''re looking for something
to
tune, you may want to look at limiting the number of concurrent IO''s
handed to the disk to try and avoid saturating the heads.  

You also want to confirm that your disks are on an NCQ-capable
controller (eg sata rather than cmdk) otherwise they will be severely
limited to processing one request at a time, at least for reads if you
have write-cache on (they will be saturated at the stop-and-wait
channel, long before the heads). 

--
Dan.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 194 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100319/a8787850/attachment.bin>

Tonmaus

2010-Mar-19 04:54 UTC

head link

[zfs-discuss] How to manage scrub priority or defer scrub?

Hello Dan,

Thank you very much for this interesting reply.
> roughly speaking, reading through the filesystem does
> the least work
> possible to return the data. A scrub does the most
> work possible to
> check the disks (and returns none of the data).
Thanks for the clarification. That''s what I had thought.
> 
> For the OP:  scrub issues low-priority IO (and the
> details of how much
> and how low have changed a few times along the
> version trail).
Is there any documentation about this, besides source code?
> However, that prioritisation applies only within the
> kernel; sata disks
> don''t understand the prioritisation, so once the
> requests are with the
> disk they can still saturate out other IOs that made
> it to the front
> of the kernel''s queue faster. 
I am not sure what you are hinting at. I initially thought about TCQ vs. NCQ
when I read this. But I am not sure which detail of TCQ would allow for I/O
discrimination that NCQ doesn''t have. All I know about command cueing
is that it is about optimising DMA strategies and optimising the handling of the
I/O requests currently issued in respect to what to do first to return all data
in the least possible time. (??)
> If you''re looking for
> something to
> tune, you may want to look at limiting the number of
> concurrent IO''s
> handed to the disk to try and avoid saturating the
> heads.
Indeed, that was what I had in mind. With the addition that I think it is as
well necessary to avoid saturating other components, such as CPU.
 > 
> You also want to confirm that your disks are on an
> NCQ-capable
> controller (eg sata rather than cmdk) otherwise they
> will be severely
> limited to processing one request at a time, at least
> for reads if you
> have write-cache on (they will be saturated at the
> stop-and-wait
> channel, long before the heads). 
I have two systems here, a production system that is on LSI SAS (mpt)
controllers, and another one that is on ICH-9 (ahci). Disks are SATA-2. The plan
was that this combo will have NCQ support. On the other hand, do you know if
there a method to verify if its functioning?

Best regards,

Tonmaus
-- 
This message posted from opensolaris.org

Daniel Carosone

2010-Mar-19 05:37 UTC

head link

[zfs-discuss] How to manage scrub priority or defer scrub?

On Thu, Mar 18, 2010 at 09:54:28PM -0700, Tonmaus wrote:> > (and the details of how much and how low have changed a few times
> > along the version trail).  
> 
> Is there any documentation about this, besides source code?
There are change logs and release notes, and random blog postings
along the way - they''re less structured but often more informative.
There were some good descriptions about the scrub improvements 6-12
months ago.  The bugid''s listed in change logs that mention scrub
should be pretty simple to find and sequence with versions.
> > However, that prioritisation applies only within the kernel; sata
> > disks don''t understand the prioritisation, so once the
requests
> > are with the disk they can still saturate out other IOs that made
> > it to the front of the kernel''s queue faster.  
> 
> I am not sure what you are hinting at. I initially thought about TCQ
> vs. NCQ when I read this. But I am not sure which detail of TCQ
> would allow for I/O discrimination that NCQ doesn''t have. 
Er, the point was exactly that there is no discrimination, once the
request is handed to the disk.  If the internal-to-disk queue is
enough to keep the heads saturated / seek bound, then a new
high-priority-in-the-kernel request will get to the disk sooner, but
may languish once there.   

You''ll get best overall disk throughput by letting the disk firmware
optimise seeks, but your priority request won''t get any further
preference. 

Shortening the list of requests handed to the disk in parallel may
help, and still keep the channel mostly busy, perhaps at the expense
of some extra seek length and lower overall throughput.

You can shorten the number of outstanding IO''s per vdev for the pool
overall, or preferably the number scrub will generate (to avoid
penalising all IO).  The tunables for each of these should be found
readily, probably in the Evil Tuning Guide.
> All I know about command cueing is that it is about optimising DMA
> strategies and optimising the handling of the I/O requests currently
> issued in respect to what to do first to return all data in the
> least possible time. (??)  
Mostly, as above it''s about giving the disk controller more than one
thing to work on at a time, and having the issuance of a request and
its completion overlap with others, so the head movement can be
optimised and the controller channel can be busy with data transfer
for another while seeking.

Disks with write cache effectively do this for writes, by pretending
they complete immediately, but reads would block the channel until
satisfied.  (This is all for ATA which lacked this, before NCQ. SCSI
has had these capabilities for a long time).
> > If you''re looking for something to tune, you may want to look
at
> > limiting the number of concurrent IO''s handed to the disk to
try
> > and avoid saturating the heads.
> 
> Indeed, that was what I had in mind. With the addition that I think
> it is as well necessary to avoid saturating other components, such
> as CPU.  
Less important, since prioritisation can be applied there too, but
potentially also an issue.  Perhaps you want to keep the cpu fan
speed/noise down for a home server, even if the scrub runs longer.
> I have two systems here, a production system that is on LSI SAS
> (mpt) controllers, and another one that is on ICH-9 (ahci). Disks
> are SATA-2. The plan was that this combo will have NCQ support. On
> the other hand, do you know if there a method to verify if its
> functioning? 
AHCI should be fine.  In practice if you see actv > 1 (with a small
margin for sampling error) then ncq is working.

--
Dan.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 194 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100319/a35d748b/attachment.bin>

Tonmaus

2010-Mar-19 11:17 UTC

head link

[zfs-discuss] How to manage scrub priority or defer scrub?

> 
> > > sata
> > > disks don''t understand the prioritisation, so
> 
> Er, the point was exactly that there is no
> discrimination, once the
> request is handed to the disk. 
So, do you say that SCSI drives do understand prioritisation (i.e. TCQ supports
the schedule from ZFS), while SATA/NCQ drives don''t, or is it just
boiling down to what Richard told us, SATA disks being too slow?
> If the
> internal-to-disk queue is
> enough to keep the heads saturated / seek bound, then
> a new
> high-priority-in-the-kernel request will get to the
> disk sooner, but
> may languish once there.  
Thanks. That makes sense to me.

> 
> You can shorten the number of outstanding IO''s per
> vdev for the pool
> overall, or preferably the number scrub will generate
> (to avoid
> penalising all IO).  
That sounds like a meaningful approach to addressing bottlenecks caused by zpool
scrub to me.
>The tunables for each of these
> should be found
> readily, probably in the Evil Tuning Guide.
I think I should try to digest the Evil Tuning Guide occasionally with respect
to this topic. Thanks for pointing me to a direction. Maybe what you have
suggested above (shorten the number of I/Os issued by scrub) is already
possible? If not, I think it would be a meaningful improvement to request.
> Disks with write cache effectively do this [command cueing] for
> writes, by pretending
> they complete immediately, but reads would block the
> channel until
> satisfied.  (This is all for ATA which lacked this,
> before NCQ. SCSI
> has had these capabilities for a long time).
As scrub is about reads, are you saying that this is still a problem with
SATA/NCQ drives, or not? I am unsure what you mean at this point.
> > > limiting the number of concurrent IO''s handed to
> the disk to try
> > > and avoid saturating the heads.
> > 
> > Indeed, that was what I had in mind. With the
> addition that I think
> > it is as well necessary to avoid saturating other
> components, such
> > as CPU.  
> 
> Less important, since prioritisation can be applied
> there too, but
> potentially also an issue.  Perhaps you want to keep
> the cpu fan
> speed/noise down for a home server, even if the scrub
> runs longer.
Well, the only thing that was really remarkable while scrubbing was CPU load
constantly near 100%. I still think that is at least contributing to the
collapse of concurrent payload. I.e., it''s all about services that take
place in Kernel: CIFS, ZFS, iSCSI.... Mostly, about concurrent load within ZFS
itself. That means an implicit trade-off while a file is being provided over
CIFS, i.e..
> 
> AHCI should be fine.  In practice if you see actv > 1
> (with a small
> margin for sampling error) then ncq is working.
Ok, and how is that in respect to mpt? My assertion that mpt will support NCQ is
mainly based on the marketing information provided by LSI that these controllers
offer NCQ support with SATA drives. How (by which tool) do I get to this
"actv" parameter?

Regards,

Tonmaus
-- 
This message posted from opensolaris.org

Lutz Schumann

2010-May-01 12:28 UTC

head link

[zfs-discuss] How to manage scrub priority or defer scrub?

I was going though this posting and it seems that were is some "personal
tension" :).

However going back to the technical problem of scrubbing a 200 TB pool I think
this issue needs to be addressed.

One warning up front: This writing is rather long, and if you like to jump to
the part dealing with Scrub, jump to "Scrub implementation" below.
>From my perspective: 
  - ZFS is great for huge amounts of data 

Thats what it was made for with 128bit and jbod design in mind. So ZFS is
perfect for internet multi media in terms of scalability.

  - ZFS is great for commodity hardware

Ok you should use 24x7 drives, but 2 TB 7200 disks are ok for internet media
mass storage. We want huge amounts of data stored and in the internet age nobody
pays for this. So you must use low cost hardware (well it must be compatible) -
but you should not need enterprise components - thats what we have ZFS as clever
software for. For mass storage internet services, the alternative is NOT EMC,
NetApp (remember nobody pays a lot for the services because you can get it free
at google) - the alternative is Linux based HW raid (with its well known
limitations) and home grown solutions. Those do not have the nice ZFS features
mentioned below.

  - ZFS guarantees data intrity by self-healing silent data corruption (thats
what the checksums are for) - But only if you have redundancy.

There are a lot of posts on the net saying when people will notice the bad
blocks - it happens when a disk in a raid5 failes, and they have to resilver
everything. Then you detect the missing redundancy. So people use Raid6 and
"hope" that everything works. Or people do scrubs on their advanced
raid controllers (if they provide internal checksumming).

The same problem exists for huge, passive, raidz1 data sets in ZFS. If you do
not scrub the array regularly, chances are higher that you will have a bad block
during resilvering and then ZFS can not help. For active data sets the problem
is not as critical, because on every read the checksum is verified - but still -
because once in arc cache noboy checks - the problem exists. So we need scrub!

  - ZFS can do many other nice things 

There''s compression, dedupe etc .. however I look at them as "nice
to have.

  - ZFS needs proper pool design 

Using ZFS right is not easy, sizing the system is even more complicated. There
are a lot of threads reagarding pool design - the easiest is to say "do a
lot of mirrors", cause then the read performance really scales. However in
internet mass media services, you cant - too expensive - because mirrored ZFS is
more expensive then HW Raid 6 with Linux. How much members to a vdev ? multiple
pools or single pools ?

  - ZFS is open and community based 

... well lets see how this goes wth Oracle "financing" the whole stuff
:)

And some of those points make ZFS a hit for internet service provider and mass
media requirements (VOD etc.)!

So whats my point you may ask ? 

My experience with ZFS is that some points are simply not addressed well enough
yet - BUT - ZFS is a living piece of software and thanks to the many great
people developing it, it evolves faster then all the other storage solutions. So
for the longer term - I believe ZFS will (hopefully) have all
"enterprice-ness" it needs and it will revolutionize the storage
industry (like cisco did in the 70''s). I really believe that.
>From my perspective some of the points not addressed well in ZFS are:
  - pool defragmentation - you need this for a COW filesystem 

I think the ZFS developers are working on this with the background rewriter. So
I hope it will come 2010. With the rewriter on disk layout can be optimized for
read performance for sequencial workloads - also for raidz1 and raidz2 - meaning
ZFS can compete with Raid5 and Raid6 - also for wider vdevs. And wider vdevs
mean more effective capacity. If the vdev read-ahead cache is working nice with
a sequencially aligned on disk layout then - (from disk) read performance will
be great.

  - IO priorization for zvols / zfs filesystems (aka Storage QoS)

Unfortunately you can not prioritize I/O to zfs filesystems and zvols right now.
I think this is one of the features that make ZFS not suitable for 1st tier
storage (like EMC Symmetrix or NetApp FAS6000 series).  You need priorization
here - because your SAP system really is more important than my MP3 web server
:)

  - Deduplication not ready for production

Currently dedup is nice, but the DDT table handling and memory sizing is tricky
and hardly usable for larger pools (my perspective). The DDT is handled like any
other component - meaning user I/O can push the DDT out of the arc (and the
L2ARC) - even with (primarycache=secondarycache)=metadata. For typical mass
media storage applications, the working set is much larger then the memory (and
L2ARC) meaning your DDT will come from disk - causing real performance
degration.

This is especially true for COMSTAR environments with small block sized (8k
etc.) Here the DDT becomes really huge. Once it fits not into memory anymore -
bummer.

There is a open BUG for this:
http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6913566 and I hope it
will be addressed soon.

  - Scrub implementation 

And thats the point mentioned in this thread. Currently you can not manage scrub
very well. In pre snv_133 scrub was very agressive. When scrub was running, you
really had a dramatic perfromance penalty (more full strokese etc., latency goes
up). With post snv_133 scrub was less agressive  - sounds nice - but this makes
scrubbing take longer. This is bad if your disk failed. So either way it is not
optimal and I believe a system can not automate this process very well. To
schedule scrub I/O right, the storage system must "predict" the I/O
access pattern for the future and I believe this is not possible. So scrub
management must be manageable by the storage user.

For very large pools another problem comes up. You simply can not scrub a
200TB++ pool over the weekend - it WILL take longer. However your users will
come in on monday and they want to work.

Currently scrub can not be prioritized, paused, aborted or resumed. This makes
it very difficult to make the scrub managebale. If scrub could be prioritized,
would be pausable, resumable and abortable the admin (or a management software
like NexentaStor) could schedule the scrub according to the user policy (ok we
scrub weekdays 18:00 - 20:00 low prio, 20:00 - 04:00 at high prio, 04:00 - 18:00
we do not scrub at all) BUT if a device is degraded  - we resilver with maximum
priority)

So I think this feature would be VERY essential because this makes the nice
features of ZFS (checksumming, data integrity) really usuable for enterprise use
cases AND large data sets. Otherwise the features are "nice - but you cant
really use them with more then 24TB".

Conclustion (in the context of Scrub): 

  - People want 200TB++ pools for mass media applications
  - People will use huge low cost drives to storage huge amounts fo data
  - People dont use mirrors because 50% eff. capacity is not enough 
  - People NEED to scrub to repair bad blocks because of the cheap drives with
high capacity
    --> Scrub / Resilver management needs to be improved.

There are (a lot) of open bugs for this: 

  - http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6743992
  - http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6494473 
...
  - http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6888481

So I hope this will be adressed for scrub and resilver. 

Regards, 
Robert 

P.s. Sorry for the rather long writing :)
-- 
This message posted from opensolaris.org

zfs discuss - Mar 2010 - How to manage scrub priority or defer scrub?

[zfs-discuss] How to manage scrub priority or defer scrub?

[zfs-discuss] How to manage scrub priority or defer scrub?

[zfs-discuss] How to manage scrub priority or defer scrub?

[zfs-discuss] How to manage scrub priority or defer scrub?

[zfs-discuss] How to manage scrub priority or defer scrub?

[zfs-discuss] How to manage scrub priority or defer scrub?

[zfs-discuss] How to manage scrub priority or defer scrub?

[zfs-discuss] How to manage scrub priority or defer scrub?

[zfs-discuss] How to manage scrub priority or defer scrub?

[zfs-discuss] How to manage scrub priority or defer scrub?

[zfs-discuss] How to manage scrub priority or defer scrub?

[zfs-discuss] How to manage scrub priority or defer scrub?

[zfs-discuss] How to manage scrub priority or defer scrub?

[zfs-discuss] How to manage scrub priority or defer scrub?

[zfs-discuss] How to manage scrub priority or defer scrub?

[zfs-discuss] How to manage scrub priority or defer scrub?

[zfs-discuss] How to manage scrub priority or defer scrub?

[zfs-discuss] How to manage scrub priority or defer scrub?

[zfs-discuss] How to manage scrub priority or defer scrub?

[zfs-discuss] How to manage scrub priority or defer scrub?

[zfs-discuss] How to manage scrub priority or defer scrub?

[zfs-discuss] How to manage scrub priority or defer scrub?

[zfs-discuss] How to manage scrub priority or defer scrub?

[zfs-discuss] How to manage scrub priority or defer scrub?

[zfs-discuss] How to manage scrub priority or defer scrub?

[zfs-discuss] How to manage scrub priority or defer scrub?

[zfs-discuss] How to manage scrub priority or defer scrub?

[zfs-discuss] How to manage scrub priority or defer scrub?

[zfs-discuss] How to manage scrub priority or defer scrub?

[zfs-discuss] How to manage scrub priority or defer scrub?