Hi, I got a wired write performance and need your help. One day, the write performance of zfs degrade. The write performance decrease from 60MB/s to about 6MB/s in sequence write. Command: date;dd if=/dev/zero of=block bs=1024*128 count=10000;date The hardware configuration is 1 Dell MD3000 and 1 MD1000 with 30 disks. The OS is Solaris 10U8, zpool version 15 and zfs version 4. I run Dtrace to trace the write performance: fbt:zfs:zfs_write:entry { self->ts = timestamp; } fbt:zfs:zfs_write:return /self->ts/ { @time = quantize(timestamp-self->ts); self->ts = 0; } It shows value ------------- Distribution ------------- count 8192 | 0 16384 | 16 32768 |@@@@@@@@@@@@@@@@@@@@@@@@ 3270 65536 |@@@@@@@ 898 131072 |@@@@@@@ 985 262144 | 33 524288 | 1 1048576 | 1 2097152 | 3 4194304 | 0 8388608 |@ 180 16777216 | 33 33554432 | 0 67108864 | 0 134217728 | 0 268435456 | 1 536870912 | 1 1073741824 | 2 2147483648 | 0 4294967296 | 0 8589934592 | 0 17179869184 | 2 34359738368 | 3 68719476736 | 0 Compare to a working well storage(1 MD3000), the max write time of zfs_write is 4294967296, it is about 10 times faster. Any suggestions? Thanks Ding
And one comment: When we do write operation(by command dd), heavy read operation increased from zero to 3M for each disk, and the write bandwidth is poor. The disk io %b increase from 0 to about 60. I don''t understand why this happened. capacity operations bandwidth pool used avail read write read write -------------------------------------- ----- ----- ----- ----- ----- ----- datapool 19.8T 5.48T 543 47 1.74M 5.89M raidz1 5.64T 687G 146 13 480K 1.66M c3t6002219000854867000003B2490FB009d0 - - 49 13 3.26M 293K c3t6002219000854867000003B4490FB063d0 - - 48 13 3.19M 296K c3t60022190008528890000055F4CB79C10d0 - - 48 13 3.19M 293K c3t6002219000854867000003B8490FB0FFd0 - - 50 13 3.28M 284K c3t6002219000854867000003BA490FB14Fd0 - - 50 13 3.31M 287K c3t60022190008528890000041C490FAFA0d0 - - 49 14 3.27M 297K c3t6002219000854867000003C0490FB27Dd0 - - 48 14 3.24M 300K raidz1 5.73T 594G 102 7 337K 996K c3t6002219000854867000003C2490FB2BFd0 - - 52 5 3.59M 166K c3t60022190008528890000041F490FAFD0d0 - - 54 5 3.72M 166K c3t600221900085288900000428490FB0D8d0 - - 55 5 3.79M 166K c3t600221900085288900000422490FB02Cd0 - - 52 5 3.57M 166K c3t600221900085288900000425490FB07Cd0 - - 53 5 3.64M 166K c3t600221900085288900000434490FB24Ed0 - - 55 5 3.76M 166K c3t60022190008528890000043949100968d0 - - 55 5 3.83M 166K raidz1 5.81T 519G 117 10 388K 1.26M c3t60022190008528890000056B4CB79D66d0 - - 46 9 3.09M 215K c3t6002219000854867000004B94CB79F91d0 - - 44 9 2.91M 215K c3t6002219000854867000004BB4CB79FE1d0 - - 44 9 2.97M 224K c3t6002219000854867000004BD4CB7A035d0 - - 44 9 2.96M 215K c3t6002219000854867000004BF4CB7A0ABd0 - - 44 9 2.97M 216K c3t60022190008528890000055C4CB79BB8d0 - - 45 9 3.04M 215K c3t6002219000854867000004C14CB7A0FDd0 - - 46 9 3.02M 215K raidz1 2.59T 3.72T 176 16 581K 2.00M c3t60022190008528890000042B490FB124d0 - - 48 5 3.21M 342K c3t6002219000854867000004C54CB7A199d0 - - 46 5 2.99M 342K c3t6002219000854867000004C74CB7A1D5d0 - - 49 5 3.27M 342K c3t6002219000852889000005594CB79B64d0 - - 46 6 3.00M 342K c3t6002219000852889000005624CB79C86d0 - - 47 6 3.11M 342K c3t6002219000852889000005654CB79CCCd0 - - 50 6 3.29M 342K c3t6002219000852889000005684CB79D1Ed0 - - 45 5 2.98M 342K c3t6B8AC6F0000F8376000005864DC9E9F1d0 4K 928G 0 0 0 0 -------------------------------------- ----- ----- ----- ----- ----- ----- ^C root at nas-hz-01:~# On 06/08/2011 11:07 AM, Ding Honghui wrote:> Hi, > > I got a wired write performance and need your help. > > One day, the write performance of zfs degrade. > The write performance decrease from 60MB/s to about 6MB/s in sequence > write. > > Command: > date;dd if=/dev/zero of=block bs=1024*128 count=10000;date > > The hardware configuration is 1 Dell MD3000 and 1 MD1000 with 30 disks. > The OS is Solaris 10U8, zpool version 15 and zfs version 4. > > I run Dtrace to trace the write performance: > > fbt:zfs:zfs_write:entry > { > self->ts = timestamp; > } > > > fbt:zfs:zfs_write:return > /self->ts/ > { > @time = quantize(timestamp-self->ts); > self->ts = 0; > } > > It shows > value ------------- Distribution ------------- count > 8192 | 0 > 16384 | 16 > 32768 |@@@@@@@@@@@@@@@@@@@@@@@@ 3270 > 65536 |@@@@@@@ 898 > 131072 |@@@@@@@ 985 > 262144 | 33 > 524288 | 1 > 1048576 | 1 > 2097152 | 3 > 4194304 | 0 > 8388608 |@ 180 > 16777216 | 33 > 33554432 | 0 > 67108864 | 0 > 134217728 | 0 > 268435456 | 1 > 536870912 | 1 > 1073741824 | 2 > 2147483648 | 0 > 4294967296 | 0 > 8589934592 | 0 > 17179869184 | 2 > 34359738368 | 3 > 68719476736 | 0 > > Compare to a working well storage(1 MD3000), the max write time of > zfs_write is 4294967296, it is about 10 times faster. > > Any suggestions? > > Thanks > Ding > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >
>> One day, the write performance of zfs degrade. >> The write performance decrease from 60MB/s to about 6MB/s in sequence >> write. >> >> Command: >> date;dd if=/dev/zero of=block bs=1024*128 count=10000;dateSee this thread: http://www.opensolaris.org/jive/thread.jspa?threadID=139317&tstart=45 And search in the page for: "metaslab_min_alloc_size" Try adjusting the metaslab size and see if it fixes your performance problem. -Don
On 08 June, 2011 - Donald Stahl sent me these 0,6K bytes:> >> One day, the write performance of zfs degrade. > >> The write performance decrease from 60MB/s to about 6MB/s in sequence > >> write. > >> > >> Command: > >> date;dd if=/dev/zero of=block bs=1024*128 count=10000;date > > See this thread: > > http://www.opensolaris.org/jive/thread.jspa?threadID=139317&tstart=45 > > And search in the page for: > "metaslab_min_alloc_size" > > Try adjusting the metaslab size and see if it fixes your performance problem.And if pool usage is >90%, then there''s another problem (change of finding free space algorithm). /Tomas -- Tomas ?gren, stric at acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Ume? `- Sysadmin at {cs,acc}.umu.se
Hi, also see; http://www.mail-archive.com/zfs-discuss at opensolaris.org/msg45408.html We hit this with Sol11 though, not sure if it''s possible with sol10 Yours Markus Kovero -----Original Message----- From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-bounces at opensolaris.org] On Behalf Of Ding Honghui Sent: 8. kes?kuuta 2011 6:07 To: zfs-discuss at opensolaris.org Subject: [zfs-discuss] Wired write performance problem Hi, I got a wired write performance and need your help. One day, the write performance of zfs degrade. The write performance decrease from 60MB/s to about 6MB/s in sequence write. Command: date;dd if=/dev/zero of=block bs=1024*128 count=10000;date The hardware configuration is 1 Dell MD3000 and 1 MD1000 with 30 disks. The OS is Solaris 10U8, zpool version 15 and zfs version 4. I run Dtrace to trace the write performance: fbt:zfs:zfs_write:entry { self->ts = timestamp; } fbt:zfs:zfs_write:return /self->ts/ { @time = quantize(timestamp-self->ts); self->ts = 0; } It shows value ------------- Distribution ------------- count 8192 | 0 16384 | 16 32768 |@@@@@@@@@@@@@@@@@@@@@@@@ 3270 65536 |@@@@@@@ 898 131072 |@@@@@@@ 985 262144 | 33 524288 | 1 1048576 | 1 2097152 | 3 4194304 | 0 8388608 |@ 180 16777216 | 33 33554432 | 0 67108864 | 0 134217728 | 0 268435456 | 1 536870912 | 1 1073741824 | 2 2147483648 | 0 4294967296 | 0 8589934592 | 0 17179869184 | 2 34359738368 | 3 68719476736 | 0 Compare to a working well storage(1 MD3000), the max write time of zfs_write is 4294967296, it is about 10 times faster. Any suggestions? Thanks Ding _______________________________________________ zfs-discuss mailing list zfs-discuss at opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
On 06/08/2011 12:12 PM, Donald Stahl wrote:>>> One day, the write performance of zfs degrade. >>> The write performance decrease from 60MB/s to about 6MB/s in sequence >>> write. >>> >>> Command: >>> date;dd if=/dev/zero of=block bs=1024*128 count=10000;date > See this thread: > > http://www.opensolaris.org/jive/thread.jspa?threadID=139317&tstart=45 > > And search in the page for: > "metaslab_min_alloc_size" > > Try adjusting the metaslab size and see if it fixes your performance problem. > > -Don >"metaslab_min_alloc_size" is not in use when block allocator isDynamic block allocator[1]. So it is not tunable parameter in my case. Thanks anyway. [1] http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/metaslab.c#496
For now, I find it take long time in function metaslab_block_picker in metaslab.c. I guess there maybe many avl search actions. I still not sure what cause the avl search and if there is any parameters to tune for it. Any suggestions? On 06/08/2011 05:57 PM, Markus Kovero wrote:> Hi, also see; > http://www.mail-archive.com/zfs-discuss at opensolaris.org/msg45408.html > > We hit this with Sol11 though, not sure if it''s possible with sol10 > > Yours > Markus Kovero > > -----Original Message----- > From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-bounces at opensolaris.org] On Behalf Of Ding Honghui > Sent: 8. kes?kuuta 2011 6:07 > To: zfs-discuss at opensolaris.org > Subject: [zfs-discuss] Wired write performance problem > > Hi, > > I got a wired write performance and need your help. > > One day, the write performance of zfs degrade. > The write performance decrease from 60MB/s to about 6MB/s in sequence write. > > Command: > date;dd if=/dev/zero of=block bs=1024*128 count=10000;date > > The hardware configuration is 1 Dell MD3000 and 1 MD1000 with 30 disks. > The OS is Solaris 10U8, zpool version 15 and zfs version 4. > > I run Dtrace to trace the write performance: > > fbt:zfs:zfs_write:entry > { > self->ts = timestamp; > } > > > fbt:zfs:zfs_write:return > /self->ts/ > { > @time = quantize(timestamp-self->ts); > self->ts = 0; > } > > It shows > value ------------- Distribution ------------- count > 8192 | 0 > 16384 | 16 > 32768 |@@@@@@@@@@@@@@@@@@@@@@@@ 3270 > 65536 |@@@@@@@ 898 > 131072 |@@@@@@@ 985 > 262144 | 33 > 524288 | 1 > 1048576 | 1 > 2097152 | 3 > 4194304 | 0 > 8388608 |@ 180 > 16777216 | 33 > 33554432 | 0 > 67108864 | 0 > 134217728 | 0 > 268435456 | 1 > 536870912 | 1 > 1073741824 | 2 > 2147483648 | 0 > 4294967296 | 0 > 8589934592 | 0 > 17179869184 | 2 > 34359738368 | 3 > 68719476736 | 0 > > Compare to a working well storage(1 MD3000), the max write time of zfs_write is 4294967296, it is about 10 times faster. > > Any suggestions? > > Thanks > Ding > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >
On 06/08/2011 04:05 PM, Tomas ?gren wrote:> On 08 June, 2011 - Donald Stahl sent me these 0,6K bytes: > >>>> One day, the write performance of zfs degrade. >>>> The write performance decrease from 60MB/s to about 6MB/s in sequence >>>> write. >>>> >>>> Command: >>>> date;dd if=/dev/zero of=block bs=1024*128 count=10000;date >> See this thread: >> >> http://www.opensolaris.org/jive/thread.jspa?threadID=139317&tstart=45 >> >> And search in the page for: >> "metaslab_min_alloc_size" >> >> Try adjusting the metaslab size and see if it fixes your performance problem. > And if pool usage is>90%, then there''s another problem (change of > finding free space algorithm). > > /TomasTomas, Thanks for your suggestion. You are right. I have tune parameter metaslab_df_free_pct from 35 to 4 to reduce this problem some days ago. The performance keep good for about 1 week and performance degrade again. And I still not sure how many operation run into best fit block allocate policy and how many run into fist fit block allocate policy in current situation. It''s very appreciate if you can help. Regards, Ding
On 06/08/2011 09:15 PM, Donald Stahl wrote:>> "metaslab_min_alloc_size" is not in use when block allocator isDynamic block >> allocator[1]. >> So it is not tunable parameter in my case. > May I ask where it says this is not a tunable in that case? I''ve read > through the code and I don''t see what you are talking about. > > The problem you are describing- including the "long time in function > metaslab_block_picker" exactly matches the block picker trying to find > a large enough block and failing. > > What value do you get when you run: > echo "metaslab_min_alloc_size/K" | mdb -kw > ? > > You can always try setting it via: > echo "metaslab_min_alloc_size/Z 1000" | mdb -kw > > and if that doesn''t work set it right back. > > I''m not familiar with the specifics of Solaris 10u8 so perhaps this is > not a tunable in that version but if it is- I would suggest you try > changing it. If your performance is as bad as you say then it can''t > hurt to try it. > > -Don >Thanks very much, Don. In Solaris 10u8: root at nas-hz-01:~# uname -a SunOS nas-hz-01 5.10 Generic_141445-09 i86pc i386 i86pc root at nas-hz-01:~# echo "metaslab_min_alloc_size/K" | mdb -kw mdb: failed to dereference symbol: unknown symbol name root at nas-hz-01:~# The pool version is 15 and zfs version is 4. And this parameter is valid in my openindiana build 148, it''s zpool version is 28 and zfs version is 5. ops at oi:~$ echo "metaslab_min_alloc_size/Z 1000" | pfexec mdb -kw metaslab_min_alloc_size: 0x1000 = 0x1000 ops at oi:~$ I''m not sure which version introduce the parameter. Should I run this openindiana? Any suggestions? Regards, Ding
> In Solaris 10u8: > root at nas-hz-01:~# uname -a > SunOS nas-hz-01 5.10 Generic_141445-09 i86pc i386 i86pc > root at nas-hz-01:~# echo "metaslab_min_alloc_size/K" | mdb -kw > mdb: failed to dereference symbol: unknown symbol nameFair enough. I don''t have anything older than b147 at this point so I wasn''t sure if that was in there or not. If you delete a bunch of data (perhaps old files you have laying around) does your performance go back up- even if temporarily? The problem we had matches your description word for word. All of a sudden we had terrible write performance with a ton of time spent in the metaslab allocator. Then we''d delete a big chunk of data (100 gigs or so) and poof- performance would get better for a short while. Several people suggested changing the allocation free percent from 30 to 4 but that change was already incorporated into the b147 box we were testing. The only thing that made a difference (and I mean a night and day difference) was the change above. That said- I have no idea how that part of the code works in 10u8. -Don
On 06/08/11 01:05, Tomas ?gren wrote:> And if pool usage is>90%, then there''s another problem (change of > finding free space algorithm).Another (less satisfying) workaround is to increase the amount of free space in the pool, either by reducing usage or adding more storage. Observed behavior is that allocation is fast until usage crosses a threshhold, then performance hits a wall. I have a small sample size (maybe 2-3 samples), but the threshhold point varies from pool to pool but tends to be consistent for a given pool. I suspect some artifact of layout/fragmentation is at play. I''ve seen things hit the wall at as low as 70% on one pool. The original poster''s pool is about 78% full. If possible, try freeing stuff until usage goes back under 75% or 70% and see if your performance returns.
> Another (less satisfying) workaround is to increase the amount of free space > in the pool, either by reducing usage or adding more storage. Observed > behavior is that allocation is fast until usage crosses a threshhold, then > performance hits a wall.We actually tried this solution. We were at 70% usage and performance hit a wall. We figured it was because of the change of fit algorithm so we added 16 2TB disks in mirrors. (Added 16TB to an 18TB pool). It made almost no difference in our pool performance. It wasn''t until we told the metaslab allocator to stop looking for such large chunks that the problem went away.> The original poster''s pool is about 78% full. ?If possible, try freeing > stuff until usage goes back under 75% or 70% and see if your performance > returns.Freeing stuff did fix the problem for us (temporarily) but only in an indirect way. When we freed up a bunch of space, the metaslab allocator was able to find large enough blocks to write to without searching all over the place. This would fix the performance problem until those large free blocks got used up. Then- even though we were below the usage problem threshold from earlier- we would still have the performance problem. -Don
On 06/09/2011 12:23 AM, Donald Stahl wrote:>> Another (less satisfying) workaround is to increase the amount of free space >> in the pool, either by reducing usage or adding more storage. Observed >> behavior is that allocation is fast until usage crosses a threshhold, then >> performance hits a wall. > We actually tried this solution. We were at 70% usage and performance > hit a wall. We figured it was because of the change of fit algorithm > so we added 16 2TB disks in mirrors. (Added 16TB to an 18TB pool). It > made almost no difference in our pool performance. It wasn''t until we > told the metaslab allocator to stop looking for such large chunks that > the problem went away. > >> The original poster''s pool is about 78% full. If possible, try freeing >> stuff until usage goes back under 75% or 70% and see if your performance >> returns. > Freeing stuff did fix the problem for us (temporarily) but only in an > indirect way. When we freed up a bunch of space, the metaslab > allocator was able to find large enough blocks to write to without > searching all over the place. This would fix the performance problem > until those large free blocks got used up. Then- even though we were > below the usage problem threshold from earlier- we would still have > the performance problem. > > -Don > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >Don, From your words, my symptom is almost same with yours. We have examine the metaslab layout, when metaslab_df_free_pct is 35, there are 65 free metaslab(64G), The write performance is very low and the rough test shows no new free metaslab will be loaded and activated. Then we tune the metaslab_df_free_pct to 4, the performance keep good for 1 week and the free metaslab reduce to 51. But now, the write bandwidth is poor again ( maybe I''d better trace the free space of each metaslab? ) Maybe there are some problem in metaslab rating score(weight) for select the metaslab and block allocator algorithm? There is snapshot of metaslab layout, the last 51 metaslabs have 64G free space. vdev offset spacemap free ---------- ------------------- --------------- ------------- ... snip vdev 3 offset 27000000000 spacemap 440 free 21.0G vdev 3 offset 28000000000 spacemap 31 free 7.36G vdev 3 offset 29000000000 spacemap 32 free 2.44G vdev 3 offset 2a000000000 spacemap 33 free 2.91G vdev 3 offset 2b000000000 spacemap 34 free 3.25G vdev 3 offset 2c000000000 spacemap 35 free 3.03G vdev 3 offset 2d000000000 spacemap 36 free 3.20G vdev 3 offset 2e000000000 spacemap 90 free 3.28G vdev 3 offset 2f000000000 spacemap 91 free 2.46G vdev 3 offset 30000000000 spacemap 92 free 2.98G vdev 3 offset 31000000000 spacemap 93 free 2.19G vdev 3 offset 32000000000 spacemap 94 free 2.42G vdev 3 offset 33000000000 spacemap 95 free 2.83G vdev 3 offset 34000000000 spacemap 252 free 41.6G vdev 3 offset 35000000000 spacemap 0 free 64G vdev 3 offset 36000000000 spacemap 0 free 64G vdev 3 offset 37000000000 spacemap 0 free 64G vdev 3 offset 38000000000 spacemap 0 free 64G vdev 3 offset 39000000000 spacemap 0 free 64G vdev 3 offset 3a000000000 spacemap 0 free 64G vdev 3 offset 3b000000000 spacemap 0 free 64G vdev 3 offset 3c000000000 spacemap 0 free 64G vdev 3 offset 3d000000000 spacemap 0 free 64G vdev 3 offset 3e000000000 spacemap 0 free 64G ...snip
On 06/09/2011 10:14 AM, Ding Honghui wrote:> > On 06/09/2011 12:23 AM, Donald Stahl wrote: >>> Another (less satisfying) workaround is to increase the amount of >>> free space >>> in the pool, either by reducing usage or adding more storage. Observed >>> behavior is that allocation is fast until usage crosses a >>> threshhold, then >>> performance hits a wall. >> We actually tried this solution. We were at 70% usage and performance >> hit a wall. We figured it was because of the change of fit algorithm >> so we added 16 2TB disks in mirrors. (Added 16TB to an 18TB pool). It >> made almost no difference in our pool performance. It wasn''t until we >> told the metaslab allocator to stop looking for such large chunks that >> the problem went away. >> >>> The original poster''s pool is about 78% full. If possible, try freeing >>> stuff until usage goes back under 75% or 70% and see if your >>> performance >>> returns. >> Freeing stuff did fix the problem for us (temporarily) but only in an >> indirect way. When we freed up a bunch of space, the metaslab >> allocator was able to find large enough blocks to write to without >> searching all over the place. This would fix the performance problem >> until those large free blocks got used up. Then- even though we were >> below the usage problem threshold from earlier- we would still have >> the performance problem. >> >> -Don >> _______________________________________________ >> zfs-discuss mailing list >> zfs-discuss at opensolaris.org >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >> > Don, > > From your words, my symptom is almost same with yours. > > We have examine the metaslab layout, when metaslab_df_free_pct is 35, > there are 65 free metaslab(64G), > The write performance is very low and the rough test shows no new free > metaslab will be loaded and activated. > > Then we tune the metaslab_df_free_pct to 4, the performance keep good > for 1 week and the free metaslab reduce to 51. > But now, the write bandwidth is poor again ( maybe I''d better trace > the free space of each metaslab? ) > > Maybe there are some problem in metaslab rating score(weight) for > select the metaslab and block allocator algorithm? > > There is snapshot of metaslab layout, the last 51 metaslabs have 64G > free space. > > vdev offset spacemap free > ---------- ------------------- --------------- > ------------- > > ... snip > > vdev 3 offset 27000000000 spacemap 440 free > 21.0G > vdev 3 offset 28000000000 spacemap 31 free > 7.36G > vdev 3 offset 29000000000 spacemap 32 free > 2.44G > vdev 3 offset 2a000000000 spacemap 33 free > 2.91G > vdev 3 offset 2b000000000 spacemap 34 free > 3.25G > vdev 3 offset 2c000000000 spacemap 35 free > 3.03G > vdev 3 offset 2d000000000 spacemap 36 free > 3.20G > vdev 3 offset 2e000000000 spacemap 90 free > 3.28G > vdev 3 offset 2f000000000 spacemap 91 free > 2.46G > vdev 3 offset 30000000000 spacemap 92 free > 2.98G > vdev 3 offset 31000000000 spacemap 93 free > 2.19G > vdev 3 offset 32000000000 spacemap 94 free > 2.42G > vdev 3 offset 33000000000 spacemap 95 free > 2.83G > vdev 3 offset 34000000000 spacemap 252 free > 41.6G > vdev 3 offset 35000000000 spacemap 0 free > 64G > vdev 3 offset 36000000000 spacemap 0 free > 64G > vdev 3 offset 37000000000 spacemap 0 free > 64G > vdev 3 offset 38000000000 spacemap 0 free > 64G > vdev 3 offset 39000000000 spacemap 0 free > 64G > vdev 3 offset 3a000000000 spacemap 0 free > 64G > vdev 3 offset 3b000000000 spacemap 0 free > 64G > vdev 3 offset 3c000000000 spacemap 0 free > 64G > vdev 3 offset 3d000000000 spacemap 0 free > 64G > vdev 3 offset 3e000000000 spacemap 0 free > 64G > ...snip >I free up some disk space(about 300GB), the performance is back again. I''m sure the performance will degrade again soon.
> There is snapshot of metaslab layout, the last 51 metaslabs have 64G free > space.After we added all the disks to our system we had lots of free metaslabs- but that didn''t seem to matter. I don''t know if perhaps the system was attempting to balance the writes across more of our devices but whatever the reason- the percentage didn''t seem to matter. All that mattered was changing the size of the min_alloc tunable. You seem to have gotten a lot deeper into some of this analysis than I did so I''m not sure if I can really add anything. Since 10u8 doesn''t support that tunable I''m not really sure where to go from there. If you can take the pool offline, you might try connecting it to a b148 box and see if that tunable makes a difference. Beyond that I don''t really have any suggestions. Your problem description, including the return of performance when freeing space is _identical_ to the problem we had. After checking every single piece of hardware, replacing countless pieces, removing COMSTAR and other pieces from the puzzle- the only change that helped was changing that tunable. I wish I could be of more help but I have not had the time to dive into the ZFS code with any gusto. -Don
Have you determined this is not 7000208 as it sounds much like it; You could run /usr/sbin/lockstat -HcwP -n 100000 -x aggrate=10hz -D 20 -s 40 sleep 2 /usr/sbin/lockstat -CcwP -n 100000 -x aggrate=10hz -D 20 -s 40 sleep 2 to find out hottest callers (space_map_load,kmem_cache_free) while issue is on. Yours Markus Kovero -----Original Message----- From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-bounces at opensolaris.org] On Behalf Of Donald Stahl Sent: 9. kes?kuuta 2011 6:27 To: Ding Honghui Cc: zfs-discuss at opensolaris.org Subject: Re: [zfs-discuss] Wired write performance problem> There is snapshot of metaslab layout, the last 51 metaslabs have 64G > free space.After we added all the disks to our system we had lots of free metaslabs- but that didn''t seem to matter. I don''t know if perhaps the system was attempting to balance the writes across more of our devices but whatever the reason- the percentage didn''t seem to matter. All that mattered was changing the size of the min_alloc tunable. You seem to have gotten a lot deeper into some of this analysis than I did so I''m not sure if I can really add anything. Since 10u8 doesn''t support that tunable I''m not really sure where to go from there. If you can take the pool offline, you might try connecting it to a b148 box and see if that tunable makes a difference. Beyond that I don''t really have any suggestions. Your problem description, including the return of performance when freeing space is _identical_ to the problem we had. After checking every single piece of hardware, replacing countless pieces, removing COMSTAR and other pieces from the puzzle- the only change that helped was changing that tunable. I wish I could be of more help but I have not had the time to dive into the ZFS code with any gusto. -Don _______________________________________________ zfs-discuss mailing list zfs-discuss at opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss