thr3ads.net - zfs discuss - [zfs-discuss] ZFS Restripe [Aug 2010]

If this information is useful, please help other people find it:
Share via:

Eduardo Bragatto

2010-Aug-03 17:52 UTC

[zfs-discuss] ZFS Restripe

Hi,

I have a large pool (~50TB total, ~42TB usable), composed of 4 raidz1  
volumes (of 7 x 2TB disks each):

# zpool iostat -v | grep -v c4
                  capacity     operations    bandwidth
pool           used  avail   read  write   read  write
------------  -----  -----  -----  -----  -----  -----
backup        35.2T  15.3T    602    272  15.3M  11.1M
   raidz1      11.6T  1.06T    138     49  2.99M  2.33M
   raidz1      11.8T   845G    163     54  3.82M  2.57M
   raidz1      6.00T  6.62T    161     84  4.50M  3.16M
   raidz1      5.88T  6.75T    139     83  4.01M  3.09M
------------  -----  -----  -----  -----  -----  -----

Originally there were only the first two raidz1 volumes, and the two  
from the bottom were added later.

You can notice that by the amount of used / free space. The first two  
volumes have ~11TB used and ~1TB free, while the other two have around  
~6TB used and ~6TB free.

I have hundreds of zfs''es storing backups from several servers. Each  
ZFS has about 7 snapshots of older backups.

I have the impression I''m getting degradation in performance due to  
the limited space in the first two volumes, specially the second,  
which has only 845GB free.

Is there any way to re-stripe the pool, so I can take advantage of all  
spindles across the raidz1 volumes? Right now it looks like the newer  
volumes are doing the heavy while the other two just hold old data.

Thanks,
Eduardo Bragatto

Khyron

2010-Aug-04 02:08 UTC

head link

[zfs-discuss] ZFS Restripe

Short answer: No.

Long answer: Not without rewriting the previously written data.  Data
is being striped over all of the top level VDEVs, or at least it should
be.  But there is no way, at least not built into ZFS, to re-allocate the
storage to perform I/O balancing.  You would basically have to do
this manually.

Either way, I''m guessing this isn''t the answer you wanted but
hey, you
get what you get.

On Tue, Aug 3, 2010 at 13:52, Eduardo Bragatto <eduardo at bragatto.com>
wrote:
> Hi,
>
> I have a large pool (~50TB total, ~42TB usable), composed of 4 raidz1
> volumes (of 7 x 2TB disks each):
>
> # zpool iostat -v | grep -v c4
>                 capacity     operations    bandwidth
> pool           used  avail   read  write   read  write
> ------------  -----  -----  -----  -----  -----  -----
> backup        35.2T  15.3T    602    272  15.3M  11.1M
>  raidz1      11.6T  1.06T    138     49  2.99M  2.33M
>  raidz1      11.8T   845G    163     54  3.82M  2.57M
>  raidz1      6.00T  6.62T    161     84  4.50M  3.16M
>  raidz1      5.88T  6.75T    139     83  4.01M  3.09M
> ------------  -----  -----  -----  -----  -----  -----
>
> Originally there were only the first two raidz1 volumes, and the two from
> the bottom were added later.
>
> You can notice that by the amount of used / free space. The first two
> volumes have ~11TB used and ~1TB free, while the other two have around ~6TB
> used and ~6TB free.
>
> I have hundreds of zfs''es storing backups from several servers.
Each ZFS
> has about 7 snapshots of older backups.
>
> I have the impression I''m getting degradation in performance due
to the
> limited space in the first two volumes, specially the second, which has
only
> 845GB free.
>
> Is there any way to re-stripe the pool, so I can take advantage of all
> spindles across the raidz1 volumes? Right now it looks like the newer
> volumes are doing the heavy while the other two just hold old data.
>
> Thanks,
> Eduardo Bragatto
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>


-- 
"You can choose your friends, you can choose the deals." - Equity
Private

"If Linux is faster, it''s a Solaris bug." - Phil Harman

Blog - http://whatderass.blogspot.com/
Twitter - @khyron4eva
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100803/404a66b1/attachment.html>

Eduardo Bragatto

2010-Aug-04 02:48 UTC

head link

[zfs-discuss] ZFS Restripe

On Aug 3, 2010, at 10:08 PM, Khyron wrote:
> Long answer: Not without rewriting the previously written data.  Data
> is being striped over all of the top level VDEVs, or at least it  
> should
> be.  But there is no way, at least not built into ZFS, to re- 
> allocate the
> storage to perform I/O balancing.  You would basically have to do
> this manually.
>
> Either way, I''m guessing this isn''t the answer you wanted
but hey, you
> get what you get.
Actually, that was the answer I was expecting, yes. The real question,  
then, is: what data should I rewrite? I want to rewrite data that''s  
written on the nearly full volumes so they get spread to the volumes  
with more space available.

Should I simply do a " zfs send | zfs receive" on all ZFSes I have?  
(we are talking about 400 ZFSes with about 7 snapshots each, here)...  
Or is there a way to rearrange specifically the data from the nearly  
full volumes?

Thanks,
Eduardo Bragatto

Richard Elling

2010-Aug-04 02:57 UTC

head link

[zfs-discuss] ZFS Restripe

On Aug 3, 2010, at 10:52 AM, Eduardo Bragatto wrote:> Hi,
> 
> I have a large pool (~50TB total, ~42TB usable), composed of 4 raidz1
volumes (of 7 x 2TB disks each):
> 
> # zpool iostat -v | grep -v c4
Unfortunately, zpool iostat is completely useless at describing performance.
The only thing it can do is show device bandwidth, and everyone here knows
that bandwidth is not performance, right?  Nod along, thank you.
>                 capacity     operations    bandwidth
> pool           used  avail   read  write   read  write
> ------------  -----  -----  -----  -----  -----  -----
> backup        35.2T  15.3T    602    272  15.3M  11.1M
>  raidz1      11.6T  1.06T    138     49  2.99M  2.33M
>  raidz1      11.8T   845G    163     54  3.82M  2.57M
>  raidz1      6.00T  6.62T    161     84  4.50M  3.16M
>  raidz1      5.88T  6.75T    139     83  4.01M  3.09M
> ------------  -----  -----  -----  -----  -----  -----
> 
> Originally there were only the first two raidz1 volumes, and the two from
the bottom were added later.
> 
> You can notice that by the amount of used / free space. The first two
volumes have ~11TB used and ~1TB free, while the other two have around ~6TB used
and ~6TB free.
Yes, and you also notice that the writes are biased towards the raidz1 sets
that are less full.  This is exactly what you want :-)  Eventually, when the
less
empty sets become more empty, the writes will rebalance.

OTOH, reads will come from whence they were written.
> 
> I have hundreds of zfs''es storing backups from several servers.
Each ZFS has about 7 snapshots of older backups.
> 
> I have the impression I''m getting degradation in performance due
to the limited space in the first two volumes, specially the second, which has
only 845GB free.
Impressions work well for dating, but not so well for performance.
Does your application run faster or slower?
> 
> Is there any way to re-stripe the pool, so I can take advantage of all
spindles across the raidz1 volumes? Right now it looks like the newer volumes
are doing the heavy while the other two just hold old data.
Yes, of course.  But it requires copying the data, which probably isn''t
feasible.
 -- richard

-- 
Richard Elling
richard at nexenta.com   +1-760-896-4422
Enterprise class storage for everyone
www.nexenta.com

Eduardo Bragatto

2010-Aug-04 03:55 UTC

head link

[zfs-discuss] ZFS Restripe

On Aug 3, 2010, at 10:57 PM, Richard Elling wrote:
> Unfortunately, zpool iostat is completely useless at describing  
> performance.
> The only thing it can do is show device bandwidth, and everyone here  
> knows
> that bandwidth is not performance, right?  Nod along, thank you.
I totally understand that, I only used the output to show the space  
utilization per raidz1 volume.
> Yes, and you also notice that the writes are biased towards the  
> raidz1 sets
> that are less full.  This is exactly what you want :-)  Eventually,  
> when the less
> empty sets become more empty, the writes will rebalance.
Actually, if we are going to consider the values from zpool iostats,  
they are just slightly biased towards the volumes I would want -- for  
example, on the first post I''ve made, the volume with less free space  
had 845GB free.. that same volume now has 833GB -- I really would like  
to just stop writing to that volume at this point as I''ve experience  
very bad performance in the past when a volume gets nearly full.

As a reference, here''s the information I posted less than 12 hours ago:

# zpool iostat -v | grep -v c4
                 capacity     operations    bandwidth
pool           used  avail   read  write   read  write
------------  -----  -----  -----  -----  -----  -----
backup        35.2T  15.3T    602    272  15.3M  11.1M
  raidz1      11.6T  1.06T    138     49  2.99M  2.33M
  raidz1      11.8T   845G    163     54  3.82M  2.57M
  raidz1      6.00T  6.62T    161     84  4.50M  3.16M
  raidz1      5.88T  6.75T    139     83  4.01M  3.09M
------------  -----  -----  -----  -----  -----  -----

And here''s the info from the same system, as I write now:

# zpool iostat -v | grep -v c4
                  capacity     operations    bandwidth
pool           used  avail   read  write   read  write
------------  -----  -----  -----  -----  -----  -----
backup        35.3T  15.2T    541    208  9.90M  6.45M
   raidz1      11.6T  1.06T    116     38  2.16M  1.41M
   raidz1      11.8T   833G    122     39  2.28M  1.49M
   raidz1      6.02T  6.61T    152     64  2.72M  1.78M
   raidz1      5.89T  6.73T    149     66  2.73M  1.77M
------------  -----  -----  -----  -----  -----  -----

As you can see, the second raidz1 volume is not being spared and has  
been providing with almost as much space as the others (and even more  
compared to the first volume).
>> I have the impression I''m getting degradation in performance
due to
>> the limited space in the first two volumes, specially the second,  
>> which has only 845GB free.
>
> Impressions work well for dating, but not so well for performance.
> Does your application run faster or slower?
You''re a funny guy. :)

Let me re-phrase it: I''m sure I''m getting degradation in
performance
as my applications are waiting more on I/O now than they used to do  
(based on CPU utilization graphs I have). The impression part, is that  
the reason is the limited space in those two volumes -- as I said, I  
already experienced bad performance on zfs systems running nearly out  
of space before.
>> Is there any way to re-stripe the pool, so I can take advantage of  
>> all spindles across the raidz1 volumes? Right now it looks like the  
>> newer volumes are doing the heavy while the other two just hold old  
>> data.
>
> Yes, of course.  But it requires copying the data, which probably  
> isn''t feasible.
I''m willing to copy data around to get this accomplish, I''m
really
just looking for the best method -- I have more than 10TB free, so I  
have some space to play with if I have to duplicate some data and  
erase the old copy, for example.

Thanks,
Eduardo Bragatto

Khyron

2010-Aug-04 04:20 UTC

head link

[zfs-discuss] ZFS Restripe

I notice you use the word "volume" which really isn''t
accurate or
appropriate here.

If all of these VDEVs are part of the same pool, which as I recall you
said they are, then writes are striped across all of them (with bias for
the more empty aka less full VDEVs).

You probably want to "zfs send" the oldest dataset (ZFS terminology
for a file system) into a new dataset.  That oldest dataset was created
when there were only 2 top level VDEVs, most likely.  If you have
multiple datasets created when you had only 2 VDEVs, then send/receive
them both (in serial fashion, one after the other).  If you have room for
the snapshots too, then send all of it and then delete the source dataset
when done.  I think this will achieve what you want.

You may want to get a bit more specific and choose from the oldest
datasets THEN find the smallest of those oldest datasets and
send/receive it first.  That way, the send/receive completes in less
time, and when you delete the source dataset, you''ve now created
more free space on the entire pool but without the risk of a single
dataset exceeding your 10 TiB of workspace.

ZFS'' copy-on-write nature really wants no less than 20% free because
you never update data in place; a new copy is always written to disk.

You might want to consider turning on compression on your new datasets
too, especially if you have free CPU cycles to spare.  I don''t know how
compressible your data is, but if it''s fairly compressible, say lots of
text,
then you might get some added benefit when you copy the old data into
the new datasets.  Saving more space, then deleting the source dataset,
should help your pool have more free space, and thus influence your
writes for better I/O balancing when you do the next (and the next) dataset
copies.

HTH.

On Tue, Aug 3, 2010 at 22:48, Eduardo Bragatto <eduardo at bragatto.com>
wrote:
> On Aug 3, 2010, at 10:08 PM, Khyron wrote:
>
>  Long answer: Not without rewriting the previously written data.  Data
>> is being striped over all of the top level VDEVs, or at least it should
>> be.  But there is no way, at least not built into ZFS, to re-allocate
the
>> storage to perform I/O balancing.  You would basically have to do
>> this manually.
>>
>> Either way, I''m guessing this isn''t the answer you
wanted but hey, you
>> get what you get.
>>
>
> Actually, that was the answer I was expecting, yes. The real question,
> then, is: what data should I rewrite? I want to rewrite data
that''s written
> on the nearly full volumes so they get spread to the volumes with more
space
> available.
>
> Should I simply do a " zfs send | zfs receive" on all ZFSes I
have? (we are
> talking about 400 ZFSes with about 7 snapshots each, here)... Or is there a
> way to rearrange specifically the data from the nearly full volumes?
>
>
> Thanks,
> Eduardo Bragatto
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>

-- 
"You can choose your friends, you can choose the deals." - Equity
Private

"If Linux is faster, it''s a Solaris bug." - Phil Harman

Blog - http://whatderass.blogspot.com/
Twitter - @khyron4eva
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100804/d7f370ea/attachment.html>

Richard Elling

2010-Aug-04 04:26 UTC

head link

[zfs-discuss] ZFS Restripe

On Aug 3, 2010, at 8:55 PM, Eduardo Bragatto wrote:
> On Aug 3, 2010, at 10:57 PM, Richard Elling wrote:
> 
>> Unfortunately, zpool iostat is completely useless at describing
performance.
>> The only thing it can do is show device bandwidth, and everyone here
knows
>> that bandwidth is not performance, right?  Nod along, thank you.
> 
> I totally understand that, I only used the output to show the space
utilization per raidz1 volume.
> 
>> Yes, and you also notice that the writes are biased towards the raidz1
sets
>> that are less full.  This is exactly what you want :-)  Eventually,
when the less
>> empty sets become more empty, the writes will rebalance.
> 
> Actually, if we are going to consider the values from zpool iostats, they
are just slightly biased towards the volumes I would want -- for example, on the
first post I''ve made, the volume with less free space had 845GB free..
that same volume now has 833GB -- I really would like to just stop writing to
that volume at this point as I''ve experience very bad performance in
the past when a volume gets nearly full.
The tipping point for the change in the first fit/best fit allocation algorithm
is
now 96%. Previously, it was 70%. Since you don''t specify which OS,
build,
or zpool version, I''ll assume you are on something modern.

NB, "zdb -m" will show the pool''s metaslab allocations. If
there are no 100%
free metaslabs, then it is a clue that the allocator might be working extra
hard.
> As a reference, here''s the information I posted less than 12 hours
ago:
> 
> # zpool iostat -v | grep -v c4
>                capacity     operations    bandwidth
> pool           used  avail   read  write   read  write
> ------------  -----  -----  -----  -----  -----  -----
> backup        35.2T  15.3T    602    272  15.3M  11.1M
> raidz1      11.6T  1.06T    138     49  2.99M  2.33M
> raidz1      11.8T   845G    163     54  3.82M  2.57M
> raidz1      6.00T  6.62T    161     84  4.50M  3.16M
> raidz1      5.88T  6.75T    139     83  4.01M  3.09M
> ------------  -----  -----  -----  -----  -----  -----
> 
> And here''s the info from the same system, as I write now:
> 
> # zpool iostat -v | grep -v c4
>                 capacity     operations    bandwidth
> pool           used  avail   read  write   read  write
> ------------  -----  -----  -----  -----  -----  -----
> backup        35.3T  15.2T    541    208  9.90M  6.45M
>  raidz1      11.6T  1.06T    116     38  2.16M  1.41M
>  raidz1      11.8T   833G    122     39  2.28M  1.49M
>  raidz1      6.02T  6.61T    152     64  2.72M  1.78M
>  raidz1      5.89T  6.73T    149     66  2.73M  1.77M
> ------------  -----  -----  -----  -----  -----  -----
> 
> As you can see, the second raidz1 volume is not being spared and has been
providing with almost as much space as the others (and even more compared to the
first volume).
Yes, perhaps 1.5-2x data written to the less full raidz1 sets.  The exact 
amount of data is not shown, because zpool iostat doesn''t show how 
much data is written, it shows the bandwidth.
>>> I have the impression I''m getting degradation in
performance due to the limited space in the first two volumes, specially the
second, which has only 845GB free.
>> 
>> Impressions work well for dating, but not so well for performance.
>> Does your application run faster or slower?
> 
> You''re a funny guy. :)
> 
> Let me re-phrase it: I''m sure I''m getting degradation in
performance as my applications are waiting more on I/O now than they used to do
(based on CPU utilization graphs I have). The impression part, is that the
reason is the limited space in those two volumes -- as I said, I already
experienced bad performance on zfs systems running nearly out of space before.
OK, so how long are they waiting?  Try "iostat -zxCn" and look at the
asvc_t column.  This will show how the disk is performing, though it 
won''t show the performance delivered by the file system to the 
application.  To measure the latter, try "fsstat zfs" (assuming you
are
on a Solaris distro)

Also, if these are HDDs, the media bandwidth decreases and seeks 
increase as they fill. ZFS tries to favor the outer cylinders (lower numbered
metaslabs) to take this into account.
>>> Is there any way to re-stripe the pool, so I can take advantage of
all spindles across the raidz1 volumes? Right now it looks like the newer
volumes are doing the heavy while the other two just hold old data.
>> 
>> Yes, of course.  But it requires copying the data, which probably
isn''t feasible.
> 
> I''m willing to copy data around to get this accomplish,
I''m really just looking for the best method -- I have more than 10TB
free, so I have some space to play with if I have to duplicate some data and
erase the old copy, for example.
zfs send/receive is usually the best method.
 -- richard

-- 
Richard Elling
richard at nexenta.com   +1-760-896-4422
Enterprise class storage for everyone
www.nexenta.com

Bob Friesenhahn

2010-Aug-04 15:18 UTC

head link

[zfs-discuss] ZFS Restripe

On Tue, 3 Aug 2010, Eduardo Bragatto wrote:> You''re a funny guy. :)
>
> Let me re-phrase it: I''m sure I''m getting degradation in
performance as my
> applications are waiting more on I/O now than they used to do (based on CPU
> utilization graphs I have). The impression part, is that the reason is the 
> limited space in those two volumes -- as I said, I already experienced bad 
> performance on zfs systems running nearly out of space before.
Assuming that your impressions are correct, are you sure that your new 
disk drives are similar to the older ones?  Are they an identical 
model?  Design trade-offs are now often resulting in larger capacity 
drives with reduced performance.

Bob
-- 
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Eduardo Bragatto

2010-Aug-04 16:03 UTC

head link

[zfs-discuss] ZFS Restripe

On Aug 4, 2010, at 12:26 AM, Richard Elling wrote:
> The tipping point for the change in the first fit/best fit  
> allocation algorithm is
> now 96%. Previously, it was 70%. Since you don''t specify which OS,
> build,
> or zpool version, I''ll assume you are on something modern.
I''m running Solaris 10 10/09 s10x_u8wos_08a, ZFS Pool version 15.
> NB, "zdb -m" will show the pool''s metaslab allocations.
If there are
> no 100%
> free metaslabs, then it is a clue that the allocator might be  
> working extra hard.
On the first two VDEVs there are no allocations 100% free (most are  
nearly full)... The two newer ones, however, do have several  
allocations of 128GB each, 100% free.

If I understand correctly in that scenario the allocator will work  
extra, is that correct?
> OK, so how long are they waiting?  Try "iostat -zxCn" and look at
the
> asvc_t column.  This will show how the disk is performing, though it
> won''t show the performance delivered by the file system to the
> application.  To measure the latter, try "fsstat zfs" (assuming
you
> are
> on a Solaris distro)
Checking with iostat, I noticed the average wait time to be between  
40ms and 50ms for all disks. Which doesn''t seem too bad.

And this is the output of fsstat:

# fsstat zfs
new  name   name  attr  attr lookup rddir  read read  write write
file remov  chng   get   set    ops   ops   ops bytes   ops bytes
3.26M 1.34M 3.22M  161M 13.4M  1.36G  9.6M 10.5M  899G 22.0M  625G zfs

However I did have CPU spikes at 100% where the kernel was taking all  
cpu time.

I have reduced my zfs_arc_max parameter as it seemed the applications  
were struggling for RAM and things are looking better now

Thanks for your time,
Eduardo Bragatto.

Eduardo Bragatto

2010-Aug-04 16:03 UTC

head link

[zfs-discuss] ZFS Restripe

On Aug 4, 2010, at 12:20 AM, Khyron wrote:
> I notice you use the word "volume" which really isn''t
accurate or
> appropriate here.
Yeah, it didn''t seem right to me, but I wasn''t sure about the
nomenclature, thanks for clarifying.
> You may want to get a bit more specific and choose from the oldest
> datasets THEN find the smallest of those oldest datasets and
> send/receive it first.  That way, the send/receive completes in less
> time, and when you delete the source dataset, you''ve now created
> more free space on the entire pool but without the risk of a single
> dataset exceeding your 10 TiB of workspace.
That makes sense, I''ll try send/receiving a few of those datasets and  
see how it goes. I believe I can find the ones that were created  
before the two new VDEVs were added, by comparing the creation time  
from "zfs get creation"
> ZFS'' copy-on-write nature really wants no less than 20% free
because
> you never update data in place; a new copy is always written to disk.
Right, and my problem is that I have two VDEVs with less than 10% free  
at this point -- although the other two have around 50% free each.
> You might want to consider turning on compression on your new datasets
> too, especially if you have free CPU cycles to spare.  I don''t
know
> how
> compressible your data is, but if it''s fairly compressible, say
lots
> of text,
> then you might get some added benefit when you copy the old data into
> the new datasets.  Saving more space, then deleting the source  
> dataset,
> should help your pool have more free space, and thus influence your
> writes for better I/O balancing when you do the next (and the next)  
> dataset
> copies.
Unfortunately the data taking most of the space it already compressed,  
so while I would gain some space from many text files that I also  
have, those are not the majority of my content, and the effort would  
probably not justify the small gain.

Thanks
Eduardo Bragatto

Bob Friesenhahn

2010-Aug-04 18:11 UTC

head link

[zfs-discuss] ZFS Restripe

On Wed, 4 Aug 2010, Eduardo Bragatto wrote:>
> Checking with iostat, I noticed the average wait time to be between 40ms
and
> 50ms for all disks. Which doesn''t seem too bad.
Actually, this is quite high.  I would not expect such long wait times 
except for when under extreme load such as a benchmark.  If the wait 
times are this long under normal use, then there is something wrong.
> However I did have CPU spikes at 100% where the kernel was taking all cpu 
> time.
>
> I have reduced my zfs_arc_max parameter as it seemed the applications were 
> struggling for RAM and things are looking better now
Odd.  What type of applications are you running on this system?  Are 
applications running on the server competing with client accesses?

Bob
-- 
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Eduardo Bragatto

2010-Aug-04 19:04 UTC

head link

[zfs-discuss] ZFS Restripe

On Aug 4, 2010, at 11:18 AM, Bob Friesenhahn wrote:
> Assuming that your impressions are correct, are you sure that your  
> new disk drives are similar to the older ones?  Are they an  
> identical model?  Design trade-offs are now often resulting in  
> larger capacity drives with reduced performance.
Yes, the disks are the same, no problems there.


On Aug 4, 2010, at 2:11 PM, Bob Friesenhahn wrote:
> On Wed, 4 Aug 2010, Eduardo Bragatto wrote:
>>
>> Checking with iostat, I noticed the average wait time to be between  
>> 40ms and 50ms for all disks. Which doesn''t seem too bad.
>
> Actually, this is quite high.  I would not expect such long wait  
> times except for when under extreme load such as a benchmark.  If  
> the wait times are this long under normal use, then there is  
> something wrong.
That''s a backup server, I usually have 10 rsync instances running  
simultaneously so there''s a lot of random disk access going on -- I  
think that explains the high average time. Also, I recently enabled  
graphing of the IOPS per disk (reading it using net-snmp) and I see  
most disks are operating near their limit -- except for some disks  
from the older VDEVs which is what I''m trying to address here.
>> However I did have CPU spikes at 100% where the kernel was taking  
>> all cpu time.
>>
>> I have reduced my zfs_arc_max parameter as it seemed the  
>> applications were struggling for RAM and things are looking better  
>> now
>
> Odd.  What type of applications are you running on this system?  Are  
> applications running on the server competing with client accesses?

I noticed some of those rsync processes were using almost 1GB of RAM  
each and the server has only 8GB. I started seeing the server swapping  
a bit during the cpu spikes at 100%, so I figured it would be better  
to cap ARC and leave some room for the rsync processes.

I will also start using rsync v3 to reduce the memory foot print, so I  
might be able to give back some RAM to ARC, and I''m thinking maybe  
going to 16GB RAM, as the pool is quite large and I''m sure more ARC  
wouldn''t hurt.

Thanks,
Eduardo Bragatto.

Bob Friesenhahn

2010-Aug-04 20:39 UTC

head link

[zfs-discuss] ZFS Restripe

On Wed, 4 Aug 2010, Eduardo Bragatto wrote:>
> I will also start using rsync v3 to reduce the memory foot print, so I
might
> be able to give back some RAM to ARC, and I''m thinking maybe going
to 16GB
> RAM, as the pool is quite large and I''m sure more ARC
wouldn''t hurt.
It is definitely a wise idea to use rsync v3.  Previous versions had 
to recurse the whole tree on both sides (storing what was 
learned in memory) before doing anything.

Bob
-- 
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Richard Elling

2010-Aug-04 20:42 UTC

head link

[zfs-discuss] ZFS Restripe

On Aug 4, 2010, at 9:03 AM, Eduardo Bragatto wrote:
> On Aug 4, 2010, at 12:26 AM, Richard Elling wrote:
> 
>> The tipping point for the change in the first fit/best fit allocation
algorithm is
>> now 96%. Previously, it was 70%. Since you don''t specify which
OS, build,
>> or zpool version, I''ll assume you are on something modern.
> 
> I''m running Solaris 10 10/09 s10x_u8wos_08a, ZFS Pool version 15.
Then the first fit/best fit threshold is 96%.
>> NB, "zdb -m" will show the pool''s metaslab
allocations. If there are no 100%
>> free metaslabs, then it is a clue that the allocator might be working
extra hard.
> 
> On the first two VDEVs there are no allocations 100% free (most are nearly
full)... The two newer ones, however, do have several allocations of 128GB each,
100% free.
> 
> If I understand correctly in that scenario the allocator will work extra,
is that correct?
Yes, and this can be measured, but...
>> OK, so how long are they waiting?  Try "iostat -zxCn" and
look at the
>> asvc_t column.  This will show how the disk is performing, though it
>> won''t show the performance delivered by the file system to the
>> application.  To measure the latter, try "fsstat zfs"
(assuming you are
>> on a Solaris distro)
> 
> Checking with iostat, I noticed the average wait time to be between 40ms
and 50ms for all disks. Which doesn''t seem too bad.
... actually, that is pretty bad.  Look for an average around 10 ms and peaks
around 20ms.  Solve this problem first -- the system can do a huge amount of
allocations for any algorithm in 1ms.
> And this is the output of fsstat:
> 
> # fsstat zfs
> new  name   name  attr  attr lookup rddir  read read  write write
> file remov  chng   get   set    ops   ops   ops bytes   ops bytes
> 3.26M 1.34M 3.22M  161M 13.4M  1.36G  9.6M 10.5M  899G 22.0M  625G zfs
Unfortunately, the first line is useless, it is the summary since boot.  Try 
adding a sample interval to see how things are moving now.
> 
> However I did have CPU spikes at 100% where the kernel was taking all cpu
time.
Again, this can be analyzed using baseline performance analysis techniques.
The "prstat" command should show how CPU is being used.  I''m
not running
Solaris 10 10/09, but IIRC, it has the ZFS enhancement where CPU time is 
attributed to the pool, as seen in prstat.
 -- richard
> 
> I have reduced my zfs_arc_max parameter as it seemed the applications were
struggling for RAM and things are looking better now
> 
> Thanks for your time,
> Eduardo Bragatto.
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
-- 
Richard Elling
richard at nexenta.com   +1-760-896-4422
Enterprise class storage for everyone
www.nexenta.com

zfs discuss - Aug 2010 - ZFS Restripe

[zfs-discuss] ZFS Restripe

[zfs-discuss] ZFS Restripe

[zfs-discuss] ZFS Restripe

[zfs-discuss] ZFS Restripe

[zfs-discuss] ZFS Restripe

[zfs-discuss] ZFS Restripe

[zfs-discuss] ZFS Restripe

[zfs-discuss] ZFS Restripe

[zfs-discuss] ZFS Restripe

[zfs-discuss] ZFS Restripe

[zfs-discuss] ZFS Restripe

[zfs-discuss] ZFS Restripe

[zfs-discuss] ZFS Restripe

[zfs-discuss] ZFS Restripe