thr3ads.net - zfs discuss - [zfs-discuss] Wired write performance problem [Jun 2011]

If this information is useful, please help other people find it:
Share via:

Ding Honghui

2011-Jun-08 03:07 UTC

[zfs-discuss] Wired write performance problem

Hi,

I got a wired write performance and need your help.

One day, the write performance of zfs degrade.
The write performance decrease from 60MB/s to about 6MB/s in sequence write.

Command:
date;dd if=/dev/zero of=block bs=1024*128 count=10000;date

The hardware configuration is 1 Dell MD3000 and 1 MD1000 with 30 disks.
The OS is Solaris 10U8, zpool version 15 and zfs version 4.

I run Dtrace to trace the write performance:

fbt:zfs:zfs_write:entry
{
         self->ts = timestamp;
}


fbt:zfs:zfs_write:return
/self->ts/
{
         @time = quantize(timestamp-self->ts);
         self->ts = 0;
}

It shows
            value  ------------- Distribution ------------- count
             8192 |                                         0
            16384 |                                         16
            32768 |@@@@@@@@@@@@@@@@@@@@@@@@                 3270
            65536 |@@@@@@@                                  898
           131072 |@@@@@@@                                  985
           262144 |                                         33
           524288 |                                         1
          1048576 |                                         1
          2097152 |                                         3
          4194304 |                                         0
          8388608 |@                                        180
         16777216 |                                         33
         33554432 |                                         0
         67108864 |                                         0
        134217728 |                                         0
        268435456 |                                         1
        536870912 |                                         1
       1073741824 |                                         2
       2147483648 |                                         0
       4294967296 |                                         0
       8589934592 |                                         0
      17179869184 |                                         2
      34359738368 |                                         3
      68719476736 |                                         0

Compare to a working well storage(1 MD3000), the max write time of 
zfs_write is 4294967296, it is about 10 times faster.

Any suggestions?

Thanks
Ding

Ding Honghui

2011-Jun-08 03:20 UTC

head link

[zfs-discuss] Wired write performance problem

And one comment:
When we do write operation(by command dd), heavy read operation 
increased from zero to 3M for each disk,
and the write bandwidth is poor.
The disk io %b increase from 0 to about 60.

I don''t understand why this happened.

                                            capacity     operations    
bandwidth
pool                                     used  avail   read  write   
read  write
--------------------------------------  -----  -----  -----  -----  
-----  -----
datapool                                19.8T  5.48T    543     47  
1.74M  5.89M
   raidz1                                5.64T   687G    146     13   
480K  1.66M
     c3t6002219000854867000003B2490FB009d0      -      -     49     13  
3.26M   293K
     c3t6002219000854867000003B4490FB063d0      -      -     48     13  
3.19M   296K
     c3t60022190008528890000055F4CB79C10d0      -      -     48     13  
3.19M   293K
     c3t6002219000854867000003B8490FB0FFd0      -      -     50     13  
3.28M   284K
     c3t6002219000854867000003BA490FB14Fd0      -      -     50     13  
3.31M   287K
     c3t60022190008528890000041C490FAFA0d0      -      -     49     14  
3.27M   297K
     c3t6002219000854867000003C0490FB27Dd0      -      -     48     14  
3.24M   300K
   raidz1                                5.73T   594G    102      7   
337K   996K
     c3t6002219000854867000003C2490FB2BFd0      -      -     52      5  
3.59M   166K
     c3t60022190008528890000041F490FAFD0d0      -      -     54      5  
3.72M   166K
     c3t600221900085288900000428490FB0D8d0      -      -     55      5  
3.79M   166K
     c3t600221900085288900000422490FB02Cd0      -      -     52      5  
3.57M   166K
     c3t600221900085288900000425490FB07Cd0      -      -     53      5  
3.64M   166K
     c3t600221900085288900000434490FB24Ed0      -      -     55      5  
3.76M   166K
     c3t60022190008528890000043949100968d0      -      -     55      5  
3.83M   166K
   raidz1                                5.81T   519G    117     10   
388K  1.26M
     c3t60022190008528890000056B4CB79D66d0      -      -     46      9  
3.09M   215K
     c3t6002219000854867000004B94CB79F91d0      -      -     44      9  
2.91M   215K
     c3t6002219000854867000004BB4CB79FE1d0      -      -     44      9  
2.97M   224K
     c3t6002219000854867000004BD4CB7A035d0      -      -     44      9  
2.96M   215K
     c3t6002219000854867000004BF4CB7A0ABd0      -      -     44      9  
2.97M   216K
     c3t60022190008528890000055C4CB79BB8d0      -      -     45      9  
3.04M   215K
     c3t6002219000854867000004C14CB7A0FDd0      -      -     46      9  
3.02M   215K
   raidz1                                2.59T  3.72T    176     16   
581K  2.00M
     c3t60022190008528890000042B490FB124d0      -      -     48      5  
3.21M   342K
     c3t6002219000854867000004C54CB7A199d0      -      -     46      5  
2.99M   342K
     c3t6002219000854867000004C74CB7A1D5d0      -      -     49      5  
3.27M   342K
     c3t6002219000852889000005594CB79B64d0      -      -     46      6  
3.00M   342K
     c3t6002219000852889000005624CB79C86d0      -      -     47      6  
3.11M   342K
     c3t6002219000852889000005654CB79CCCd0      -      -     50      6  
3.29M   342K
     c3t6002219000852889000005684CB79D1Ed0      -      -     45      5  
2.98M   342K
   c3t6B8AC6F0000F8376000005864DC9E9F1d0     4K   928G      0      
0      0      0
--------------------------------------  -----  -----  -----  -----  
-----  -----

^C
root at nas-hz-01:~#


On 06/08/2011 11:07 AM, Ding Honghui wrote:> Hi,
>
> I got a wired write performance and need your help.
>
> One day, the write performance of zfs degrade.
> The write performance decrease from 60MB/s to about 6MB/s in sequence 
> write.
>
> Command:
> date;dd if=/dev/zero of=block bs=1024*128 count=10000;date
>
> The hardware configuration is 1 Dell MD3000 and 1 MD1000 with 30 disks.
> The OS is Solaris 10U8, zpool version 15 and zfs version 4.
>
> I run Dtrace to trace the write performance:
>
> fbt:zfs:zfs_write:entry
> {
>         self->ts = timestamp;
> }
>
>
> fbt:zfs:zfs_write:return
> /self->ts/
> {
>         @time = quantize(timestamp-self->ts);
>         self->ts = 0;
> }
>
> It shows
>            value  ------------- Distribution ------------- count
>             8192 |                                         0
>            16384 |                                         16
>            32768 |@@@@@@@@@@@@@@@@@@@@@@@@                 3270
>            65536 |@@@@@@@                                  898
>           131072 |@@@@@@@                                  985
>           262144 |                                         33
>           524288 |                                         1
>          1048576 |                                         1
>          2097152 |                                         3
>          4194304 |                                         0
>          8388608 |@                                        180
>         16777216 |                                         33
>         33554432 |                                         0
>         67108864 |                                         0
>        134217728 |                                         0
>        268435456 |                                         1
>        536870912 |                                         1
>       1073741824 |                                         2
>       2147483648 |                                         0
>       4294967296 |                                         0
>       8589934592 |                                         0
>      17179869184 |                                         2
>      34359738368 |                                         3
>      68719476736 |                                         0
>
> Compare to a working well storage(1 MD3000), the max write time of 
> zfs_write is 4294967296, it is about 10 times faster.
>
> Any suggestions?
>
> Thanks
> Ding
>
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>

Donald Stahl

2011-Jun-08 04:12 UTC

head link

[zfs-discuss] Wired write performance problem

>> One day, the write performance of zfs degrade.
>> The write performance decrease from 60MB/s to about 6MB/s in sequence
>> write.
>>
>> Command:
>> date;dd if=/dev/zero of=block bs=1024*128 count=10000;date
See this thread:

http://www.opensolaris.org/jive/thread.jspa?threadID=139317&tstart=45

And search in the page for:
"metaslab_min_alloc_size"

Try adjusting the metaslab size and see if it fixes your performance problem.

-Don

Tomas Ögren

2011-Jun-08 08:05 UTC

head link

[zfs-discuss] Wired write performance problem

On 08 June, 2011 - Donald Stahl sent me these 0,6K bytes:
> >> One day, the write performance of zfs degrade.
> >> The write performance decrease from 60MB/s to about 6MB/s in
sequence
> >> write.
> >>
> >> Command:
> >> date;dd if=/dev/zero of=block bs=1024*128 count=10000;date
> 
> See this thread:
> 
> http://www.opensolaris.org/jive/thread.jspa?threadID=139317&tstart=45
> 
> And search in the page for:
> "metaslab_min_alloc_size"
> 
> Try adjusting the metaslab size and see if it fixes your performance
problem.
And if pool usage is >90%, then there''s another problem (change of
finding free space algorithm).

/Tomas
-- 
Tomas ?gren, stric at acc.umu.se, http://www.acc.umu.se/~stric/
|- Student at Computing Science, University of Ume?
`- Sysadmin at {cs,acc}.umu.se

Markus Kovero

2011-Jun-08 09:57 UTC

head link

[zfs-discuss] Wired write performance problem

Hi, also see;
http://www.mail-archive.com/zfs-discuss at opensolaris.org/msg45408.html

We hit this with Sol11 though, not sure if it''s possible with sol10

Yours
Markus Kovero

-----Original Message-----
From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-bounces at
opensolaris.org] On Behalf Of Ding Honghui
Sent: 8. kes?kuuta 2011 6:07
To: zfs-discuss at opensolaris.org
Subject: [zfs-discuss] Wired write performance problem

Hi,

I got a wired write performance and need your help.

One day, the write performance of zfs degrade.
The write performance decrease from 60MB/s to about 6MB/s in sequence write.

Command:
date;dd if=/dev/zero of=block bs=1024*128 count=10000;date

The hardware configuration is 1 Dell MD3000 and 1 MD1000 with 30 disks.
The OS is Solaris 10U8, zpool version 15 and zfs version 4.

I run Dtrace to trace the write performance:

fbt:zfs:zfs_write:entry
{
         self->ts = timestamp;
}


fbt:zfs:zfs_write:return
/self->ts/
{
         @time = quantize(timestamp-self->ts);
         self->ts = 0;
}

It shows
            value  ------------- Distribution ------------- count
             8192 |                                         0
            16384 |                                         16
            32768 |@@@@@@@@@@@@@@@@@@@@@@@@                 3270
            65536 |@@@@@@@                                  898
           131072 |@@@@@@@                                  985
           262144 |                                         33
           524288 |                                         1
          1048576 |                                         1
          2097152 |                                         3
          4194304 |                                         0
          8388608 |@                                        180
         16777216 |                                         33
         33554432 |                                         0
         67108864 |                                         0
        134217728 |                                         0
        268435456 |                                         1
        536870912 |                                         1
       1073741824 |                                         2
       2147483648 |                                         0
       4294967296 |                                         0
       8589934592 |                                         0
      17179869184 |                                         2
      34359738368 |                                         3
      68719476736 |                                         0

Compare to a working well storage(1 MD3000), the max write time of zfs_write is
4294967296, it is about 10 times faster.

Any suggestions?

Thanks
Ding

_______________________________________________
zfs-discuss mailing list
zfs-discuss at opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Ding Honghui

2011-Jun-08 10:47 UTC

head link

[zfs-discuss] Wired write performance problem

On 06/08/2011 12:12 PM, Donald Stahl wrote:>>> One day, the write performance of zfs degrade.
>>> The write performance decrease from 60MB/s to about 6MB/s in
sequence
>>> write.
>>>
>>> Command:
>>> date;dd if=/dev/zero of=block bs=1024*128 count=10000;date
> See this thread:
>
> http://www.opensolaris.org/jive/thread.jspa?threadID=139317&tstart=45
>
> And search in the page for:
> "metaslab_min_alloc_size"
>
> Try adjusting the metaslab size and see if it fixes your performance
problem.
>
> -Don
>
"metaslab_min_alloc_size" is not in use when block allocator isDynamic
block allocator[1].
So it is not tunable parameter in my case.

Thanks anyway.

[1]
http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/metaslab.c#496

Ding Honghui

2011-Jun-08 10:49 UTC

head link

[zfs-discuss] Wired write performance problem

For now, I find it take long time in function metaslab_block_picker in 
metaslab.c.
I guess there maybe many avl search actions.

I still not sure what cause the avl search and if there is any 
parameters to tune for it.

Any suggestions?

On 06/08/2011 05:57 PM, Markus Kovero wrote:> Hi, also see;
> http://www.mail-archive.com/zfs-discuss at opensolaris.org/msg45408.html
>
> We hit this with Sol11 though, not sure if it''s possible with
sol10
>
> Yours
> Markus Kovero
>
> -----Original Message-----
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-bounces at
opensolaris.org] On Behalf Of Ding Honghui
> Sent: 8. kes?kuuta 2011 6:07
> To: zfs-discuss at opensolaris.org
> Subject: [zfs-discuss] Wired write performance problem
>
> Hi,
>
> I got a wired write performance and need your help.
>
> One day, the write performance of zfs degrade.
> The write performance decrease from 60MB/s to about 6MB/s in sequence
write.
>
> Command:
> date;dd if=/dev/zero of=block bs=1024*128 count=10000;date
>
> The hardware configuration is 1 Dell MD3000 and 1 MD1000 with 30 disks.
> The OS is Solaris 10U8, zpool version 15 and zfs version 4.
>
> I run Dtrace to trace the write performance:
>
> fbt:zfs:zfs_write:entry
> {
>           self->ts = timestamp;
> }
>
>
> fbt:zfs:zfs_write:return
> /self->ts/
> {
>           @time = quantize(timestamp-self->ts);
>           self->ts = 0;
> }
>
> It shows
>              value  ------------- Distribution ------------- count
>               8192 |                                         0
>              16384 |                                         16
>              32768 |@@@@@@@@@@@@@@@@@@@@@@@@                 3270
>              65536 |@@@@@@@                                  898
>             131072 |@@@@@@@                                  985
>             262144 |                                         33
>             524288 |                                         1
>            1048576 |                                         1
>            2097152 |                                         3
>            4194304 |                                         0
>            8388608 |@                                        180
>           16777216 |                                         33
>           33554432 |                                         0
>           67108864 |                                         0
>          134217728 |                                         0
>          268435456 |                                         1
>          536870912 |                                         1
>         1073741824 |                                         2
>         2147483648 |                                         0
>         4294967296 |                                         0
>         8589934592 |                                         0
>        17179869184 |                                         2
>        34359738368 |                                         3
>        68719476736 |                                         0
>
> Compare to a working well storage(1 MD3000), the max write time of
zfs_write is 4294967296, it is about 10 times faster.
>
> Any suggestions?
>
> Thanks
> Ding
>
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>

Ding Honghui

2011-Jun-08 10:56 UTC

head link

[zfs-discuss] Wired write performance problem

On 06/08/2011 04:05 PM, Tomas ?gren wrote:> On 08 June, 2011 - Donald Stahl sent me these 0,6K bytes:
>
>>>> One day, the write performance of zfs degrade.
>>>> The write performance decrease from 60MB/s to about 6MB/s in
sequence
>>>> write.
>>>>
>>>> Command:
>>>> date;dd if=/dev/zero of=block bs=1024*128 count=10000;date
>> See this thread:
>>
>>
http://www.opensolaris.org/jive/thread.jspa?threadID=139317&tstart=45
>>
>> And search in the page for:
>> "metaslab_min_alloc_size"
>>
>> Try adjusting the metaslab size and see if it fixes your performance
problem.
> And if pool usage is>90%, then there''s another problem (change
of
> finding free space algorithm).
>
> /Tomas
Tomas,

Thanks for your suggestion.

You are right.

I have tune parameter metaslab_df_free_pct from 35 to 4 to reduce this 
problem some days ago.
The performance keep good for about 1 week and performance degrade again.

And I still not sure how many operation run into best fit block allocate 
policy
and how many run into fist fit block allocate policy in current situation.

It''s very appreciate if you can help.

Regards,
Ding

Ding Honghui

2011-Jun-08 14:55 UTC

head link

[zfs-discuss] Wired write performance problem

On 06/08/2011 09:15 PM, Donald Stahl wrote:>> "metaslab_min_alloc_size" is not in use when block allocator
isDynamic block
>> allocator[1].
>> So it is not tunable parameter in my case.
> May I ask where it says this is not a tunable in that case? I''ve
read
> through the code and I don''t see what you are talking about.
>
> The problem you are describing- including the "long time in function
> metaslab_block_picker" exactly matches the block picker trying to find
> a large enough block and failing.
>
> What value do you get when you run:
> echo "metaslab_min_alloc_size/K" | mdb -kw
> ?
>
> You can always try setting it via:
> echo "metaslab_min_alloc_size/Z 1000" | mdb -kw
>
> and if that doesn''t work set it right back.
>
> I''m not familiar with the specifics of Solaris 10u8 so perhaps
this is
> not a tunable in that version but if it is- I would suggest you try
> changing it. If your performance is as bad as you say then it
can''t
> hurt to try it.
>
> -Don
>Thanks very much, Don.

In Solaris 10u8:
root at nas-hz-01:~# uname -a
SunOS nas-hz-01 5.10 Generic_141445-09 i86pc i386 i86pc
root at nas-hz-01:~# echo "metaslab_min_alloc_size/K" | mdb -kw
mdb: failed to dereference symbol: unknown symbol name
root at nas-hz-01:~#

The pool version is 15 and zfs version is 4.

And this parameter is valid in my openindiana build 148, it''s zpool 
version is 28 and zfs version is 5.
ops at oi:~$ echo "metaslab_min_alloc_size/Z 1000" | pfexec mdb -kw
metaslab_min_alloc_size:        0x1000                  =       0x1000
ops at oi:~$

I''m not sure which version introduce the parameter.

Should I run this openindiana? Any suggestions?

Regards,
Ding

Donald Stahl

2011-Jun-08 15:25 UTC

head link

[zfs-discuss] Wired write performance problem

> In Solaris 10u8:
> root at nas-hz-01:~# uname -a
> SunOS nas-hz-01 5.10 Generic_141445-09 i86pc i386 i86pc
> root at nas-hz-01:~# echo "metaslab_min_alloc_size/K" | mdb -kw
> mdb: failed to dereference symbol: unknown symbol nameFair enough. I don''t have anything older than b147 at this point so I
wasn''t sure if that was in there or not.

If you delete a bunch of data (perhaps old files you have laying
around) does your performance go back up- even if temporarily?

The problem we had matches your description word for word. All of a
sudden we had terrible write performance with a ton of time spent in
the metaslab allocator. Then we''d delete a big chunk of data (100 gigs
or so) and poof- performance would get better for a short while.

Several people suggested changing the allocation free percent from 30
to 4 but that change was already incorporated into the b147 box we
were testing. The only thing that made a difference (and I mean a
night and day difference) was the change above. That said- I have no
idea how that part of the code works in 10u8.

-Don

Bill Sommerfeld

2011-Jun-08 15:33 UTC

head link

[zfs-discuss] Wired write performance problem

On 06/08/11 01:05, Tomas ?gren wrote:> And if pool usage is>90%, then there''s another problem (change
of
> finding free space algorithm).
Another (less satisfying) workaround is to increase the amount of free 
space in the pool, either by reducing usage or adding more storage. 
Observed behavior is that allocation is fast until usage crosses a 
threshhold, then performance hits a wall.

I have a small sample size (maybe 2-3 samples), but the threshhold point 
varies from pool to pool but tends to be consistent for a given pool.  I 
suspect some artifact of layout/fragmentation is at play.  I''ve seen 
things hit the wall at as low as 70% on one pool.

The original poster''s pool is about 78% full.  If possible, try freeing
stuff until usage goes back under 75% or 70% and see if your performance 
returns.

Donald Stahl

2011-Jun-08 16:23 UTC

head link

[zfs-discuss] Wired write performance problem

> Another (less satisfying) workaround is to increase the amount of free
space
> in the pool, either by reducing usage or adding more storage. Observed
> behavior is that allocation is fast until usage crosses a threshhold, then
> performance hits a wall.We actually tried this solution. We were at 70% usage and performance
hit a wall. We figured it was because of the change of fit algorithm
so we added 16 2TB disks in mirrors. (Added 16TB to an 18TB pool). It
made almost no difference in our pool performance. It wasn''t until we
told the metaslab allocator to stop looking for such large chunks that
the problem went away.
> The original poster''s pool is about 78% full. ?If possible, try
freeing
> stuff until usage goes back under 75% or 70% and see if your performance
> returns.Freeing stuff did fix the problem for us (temporarily) but only in an
indirect way. When we freed up a bunch of space, the metaslab
allocator was able to find large enough blocks to write to without
searching all over the place. This would fix the performance problem
until those large free blocks got used up. Then- even though we were
below the usage problem threshold from earlier- we would still have
the performance problem.

-Don

Ding Honghui

2011-Jun-09 02:14 UTC

head link

[zfs-discuss] Wired write performance problem

On 06/09/2011 12:23 AM, Donald Stahl wrote:>> Another (less satisfying) workaround is to increase the amount of free
space
>> in the pool, either by reducing usage or adding more storage. Observed
>> behavior is that allocation is fast until usage crosses a threshhold,
then
>> performance hits a wall.
> We actually tried this solution. We were at 70% usage and performance
> hit a wall. We figured it was because of the change of fit algorithm
> so we added 16 2TB disks in mirrors. (Added 16TB to an 18TB pool). It
> made almost no difference in our pool performance. It wasn''t until
we
> told the metaslab allocator to stop looking for such large chunks that
> the problem went away.
>
>> The original poster''s pool is about 78% full.  If possible,
try freeing
>> stuff until usage goes back under 75% or 70% and see if your
performance
>> returns.
> Freeing stuff did fix the problem for us (temporarily) but only in an
> indirect way. When we freed up a bunch of space, the metaslab
> allocator was able to find large enough blocks to write to without
> searching all over the place. This would fix the performance problem
> until those large free blocks got used up. Then- even though we were
> below the usage problem threshold from earlier- we would still have
> the performance problem.
>
> -Don
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>Don,

 From your words, my symptom is almost same with yours.

We have examine the metaslab layout, when metaslab_df_free_pct is 35, 
there are 65 free metaslab(64G),
The write performance is very low and the rough test shows no new free 
metaslab will be loaded and activated.

Then we tune the metaslab_df_free_pct to 4, the performance keep good 
for 1 week and the free metaslab reduce to 51.
But now, the write bandwidth is poor again ( maybe I''d better trace the
free space of each metaslab? )

Maybe there are some problem in metaslab rating score(weight) for select 
the metaslab and block allocator algorithm?

There is snapshot of metaslab layout, the last 51 metaslabs have 64G 
free space.

         vdev         offset                spacemap          free
         ----------   -------------------   ---------------   -------------

... snip

         vdev     3   offset  27000000000   spacemap    440   free    21.0G
         vdev     3   offset  28000000000   spacemap     31   free    7.36G
         vdev     3   offset  29000000000   spacemap     32   free    2.44G
         vdev     3   offset  2a000000000   spacemap     33   free    2.91G
         vdev     3   offset  2b000000000   spacemap     34   free    3.25G
         vdev     3   offset  2c000000000   spacemap     35   free    3.03G
         vdev     3   offset  2d000000000   spacemap     36   free    3.20G
         vdev     3   offset  2e000000000   spacemap     90   free    3.28G
         vdev     3   offset  2f000000000   spacemap     91   free    2.46G
         vdev     3   offset  30000000000   spacemap     92   free    2.98G
         vdev     3   offset  31000000000   spacemap     93   free    2.19G
         vdev     3   offset  32000000000   spacemap     94   free    2.42G
         vdev     3   offset  33000000000   spacemap     95   free    2.83G
         vdev     3   offset  34000000000   spacemap    252   free    41.6G
         vdev     3   offset  35000000000   spacemap      0   free      64G
         vdev     3   offset  36000000000   spacemap      0   free      64G
         vdev     3   offset  37000000000   spacemap      0   free      64G
         vdev     3   offset  38000000000   spacemap      0   free      64G
         vdev     3   offset  39000000000   spacemap      0   free      64G
         vdev     3   offset  3a000000000   spacemap      0   free      64G
         vdev     3   offset  3b000000000   spacemap      0   free      64G
         vdev     3   offset  3c000000000   spacemap      0   free      64G
         vdev     3   offset  3d000000000   spacemap      0   free      64G
         vdev     3   offset  3e000000000   spacemap      0   free      64G
...snip

Ding Honghui

2011-Jun-09 02:39 UTC

head link

[zfs-discuss] Wired write performance problem

On 06/09/2011 10:14 AM, Ding Honghui wrote:>
> On 06/09/2011 12:23 AM, Donald Stahl wrote:
>>> Another (less satisfying) workaround is to increase the amount of 
>>> free space
>>> in the pool, either by reducing usage or adding more storage.
Observed
>>> behavior is that allocation is fast until usage crosses a 
>>> threshhold, then
>>> performance hits a wall.
>> We actually tried this solution. We were at 70% usage and performance
>> hit a wall. We figured it was because of the change of fit algorithm
>> so we added 16 2TB disks in mirrors. (Added 16TB to an 18TB pool). It
>> made almost no difference in our pool performance. It wasn''t
until we
>> told the metaslab allocator to stop looking for such large chunks that
>> the problem went away.
>>
>>> The original poster''s pool is about 78% full.  If
possible, try freeing
>>> stuff until usage goes back under 75% or 70% and see if your 
>>> performance
>>> returns.
>> Freeing stuff did fix the problem for us (temporarily) but only in an
>> indirect way. When we freed up a bunch of space, the metaslab
>> allocator was able to find large enough blocks to write to without
>> searching all over the place. This would fix the performance problem
>> until those large free blocks got used up. Then- even though we were
>> below the usage problem threshold from earlier- we would still have
>> the performance problem.
>>
>> -Don
>> _______________________________________________
>> zfs-discuss mailing list
>> zfs-discuss at opensolaris.org
>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>>
> Don,
>
> From your words, my symptom is almost same with yours.
>
> We have examine the metaslab layout, when metaslab_df_free_pct is 35, 
> there are 65 free metaslab(64G),
> The write performance is very low and the rough test shows no new free 
> metaslab will be loaded and activated.
>
> Then we tune the metaslab_df_free_pct to 4, the performance keep good 
> for 1 week and the free metaslab reduce to 51.
> But now, the write bandwidth is poor again ( maybe I''d better
trace
> the free space of each metaslab? )
>
> Maybe there are some problem in metaslab rating score(weight) for 
> select the metaslab and block allocator algorithm?
>
> There is snapshot of metaslab layout, the last 51 metaslabs have 64G 
> free space.
>
>         vdev         offset                spacemap          free
>         ----------   -------------------   ---------------   
> -------------
>
> ... snip
>
>         vdev     3   offset  27000000000   spacemap    440   free    
> 21.0G
>         vdev     3   offset  28000000000   spacemap     31   free    
> 7.36G
>         vdev     3   offset  29000000000   spacemap     32   free    
> 2.44G
>         vdev     3   offset  2a000000000   spacemap     33   free    
> 2.91G
>         vdev     3   offset  2b000000000   spacemap     34   free    
> 3.25G
>         vdev     3   offset  2c000000000   spacemap     35   free    
> 3.03G
>         vdev     3   offset  2d000000000   spacemap     36   free    
> 3.20G
>         vdev     3   offset  2e000000000   spacemap     90   free    
> 3.28G
>         vdev     3   offset  2f000000000   spacemap     91   free    
> 2.46G
>         vdev     3   offset  30000000000   spacemap     92   free    
> 2.98G
>         vdev     3   offset  31000000000   spacemap     93   free    
> 2.19G
>         vdev     3   offset  32000000000   spacemap     94   free    
> 2.42G
>         vdev     3   offset  33000000000   spacemap     95   free    
> 2.83G
>         vdev     3   offset  34000000000   spacemap    252   free    
> 41.6G
>         vdev     3   offset  35000000000   spacemap      0   free      
> 64G
>         vdev     3   offset  36000000000   spacemap      0   free      
> 64G
>         vdev     3   offset  37000000000   spacemap      0   free      
> 64G
>         vdev     3   offset  38000000000   spacemap      0   free      
> 64G
>         vdev     3   offset  39000000000   spacemap      0   free      
> 64G
>         vdev     3   offset  3a000000000   spacemap      0   free      
> 64G
>         vdev     3   offset  3b000000000   spacemap      0   free      
> 64G
>         vdev     3   offset  3c000000000   spacemap      0   free      
> 64G
>         vdev     3   offset  3d000000000   spacemap      0   free      
> 64G
>         vdev     3   offset  3e000000000   spacemap      0   free      
> 64G
> ...snip
>I free up some disk space(about 300GB), the performance is back again.
I''m sure the performance will degrade again soon.

Donald Stahl

2011-Jun-09 03:27 UTC

head link

[zfs-discuss] Wired write performance problem

> There is snapshot of metaslab layout, the last 51 metaslabs have 64G free
> space.After we added all the disks to our system we had lots of free
metaslabs- but that didn''t seem to matter. I don''t know if
perhaps the
system was attempting to balance the writes across more of our devices
but whatever the reason- the percentage didn''t seem to matter. All
that mattered was changing the size of the min_alloc tunable.

You seem to have gotten a lot deeper into some of this analysis than I
did so I''m not sure if I can really add anything. Since 10u8
doesn''t
support that tunable I''m not really sure where to go from there.

If you can take the pool offline, you might try connecting it to a
b148 box and see if that tunable makes a difference. Beyond that I
don''t really have any suggestions.

Your problem description, including the return of performance when
freeing space is _identical_ to the problem we had. After checking
every single piece of hardware, replacing countless pieces, removing
COMSTAR and other pieces from the puzzle- the only change that helped
was changing that tunable.

I wish I could be of more help but I have not had the time to dive
into the ZFS code with any gusto.

-Don

Markus Kovero

2011-Jun-09 07:48 UTC

head link

[zfs-discuss] Wired write performance problem

Have you determined this is not 7000208 as it sounds much like it;
You could run
/usr/sbin/lockstat -HcwP -n 100000 -x aggrate=10hz -D 20 -s 40 sleep 2
/usr/sbin/lockstat -CcwP -n 100000 -x aggrate=10hz -D 20 -s 40 sleep 2
to find out hottest callers (space_map_load,kmem_cache_free) while issue is on.

Yours
Markus Kovero


-----Original Message-----
From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-bounces at
opensolaris.org] On Behalf Of Donald Stahl
Sent: 9. kes?kuuta 2011 6:27
To: Ding Honghui
Cc: zfs-discuss at opensolaris.org
Subject: Re: [zfs-discuss] Wired write performance problem
> There is snapshot of metaslab layout, the last 51 metaslabs have 64G 
> free space.After we added all the disks to our system we had lots of free
metaslabs- but that didn''t seem to matter. I don''t know if
perhaps the system was attempting to balance the writes across more of our
devices but whatever the reason- the percentage didn''t seem to matter.
All that mattered was changing the size of the min_alloc tunable.

You seem to have gotten a lot deeper into some of this analysis than I did so
I''m not sure if I can really add anything. Since 10u8 doesn''t
support that tunable I''m not really sure where to go from there.

If you can take the pool offline, you might try connecting it to a
b148 box and see if that tunable makes a difference. Beyond that I
don''t really have any suggestions.

Your problem description, including the return of performance when freeing space
is _identical_ to the problem we had. After checking every single piece of
hardware, replacing countless pieces, removing COMSTAR and other pieces from the
puzzle- the only change that helped was changing that tunable.

I wish I could be of more help but I have not had the time to dive into the ZFS
code with any gusto.

-Don
_______________________________________________
zfs-discuss mailing list
zfs-discuss at opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

zfs discuss - Jun 2011 - Wired write performance problem

[zfs-discuss] Wired write performance problem

[zfs-discuss] Wired write performance problem

[zfs-discuss] Wired write performance problem

[zfs-discuss] Wired write performance problem

[zfs-discuss] Wired write performance problem

[zfs-discuss] Wired write performance problem

[zfs-discuss] Wired write performance problem

[zfs-discuss] Wired write performance problem

[zfs-discuss] Wired write performance problem

[zfs-discuss] Wired write performance problem

[zfs-discuss] Wired write performance problem

[zfs-discuss] Wired write performance problem

[zfs-discuss] Wired write performance problem

[zfs-discuss] Wired write performance problem

[zfs-discuss] Wired write performance problem

[zfs-discuss] Wired write performance problem