thr3ads.net - zfs discuss - [zfs-discuss] ZFS performance and memory consumption [Jul 2007]

If this information is useful, please help other people find it:
Share via:

Łukasz

2007-Jul-05 16:59 UTC

[zfs-discuss] ZFS performance and memory consumption

Hello,
    I''m investigating problem with ZFS over NFS. 
The problems started about 2 weeks ago, most nfs threads are hanging in
txg_wait_open.
Sync thread is consuming one processor all the time,
Average spa_sync function times from entry to return is 2 minutes.
I can''t use dtrace to examine problem, because I keep getting:
     dtrace: processing aborted: Abort due to systemic unresponsiveness

Using mdb and examining tx_sync_thread with ::findstack I keep getting this
stack:
 fffffe8002da1410 _resume_from_idle+0xf8() ]
  fffffe8002da1570 avl_walk+0x39()
  fffffe8002da15a0 space_map_alloc+0x21()
  fffffe8002da1620 metaslab_group_alloc+0x1a2()
  fffffe8002da16b0 metaslab_alloc_dva+0xab()
  fffffe8002da1700 metaslab_alloc+0x51()
  fffffe8002da1720 zio_dva_allocate+0x3f()
  fffffe8002da1730 zio_next_stage+0x72()
  fffffe8002da1750 zio_checksum_generate+0x5f()
  fffffe8002da1760 zio_next_stage+0x72()
  fffffe8002da17b0 zio_write_compress+0x136()
  fffffe8002da17c0 zio_next_stage+0x72()
  fffffe8002da17f0 zio_wait_for_children+0x49()
  fffffe8002da1800 zio_wait_children_ready+0x15()
  fffffe8002da1810 zio_next_stage_async+0xae()
  fffffe8002da1820 zio_nowait+9()
  fffffe8002da18b0 arc_write+0xe7()
  fffffe8002da19a0 dbuf_sync+0x274()
  fffffe8002da1a10 dnode_sync+0x2e3()
  fffffe8002da1a60 dmu_objset_sync_dnodes+0x7b()
  fffffe8002da1af0 dmu_objset_sync+0x6a()
  fffffe8002da1b10 dsl_dataset_sync+0x23()
  fffffe8002da1b60 dsl_pool_sync+0x7b()
  fffffe8002da1bd0 spa_sync+0x116()

I also managed to sum metaslabs space maps:
  ::walk spa | ::walk metaslab | ::print struct metaslab ms_smo.smo_objsize 
and I got 1GB.

I have a pool 1,3T with 500G avail space. 
Pool was created about 3 months ago.
I''m using solaris 10 u3

Do you think changing system to nevada will help ?
I red that there are some changes that can help:
http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6512391
http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6532056
 
 
This message posted from opensolaris.org

Prabahar Jeyaram

2007-Jul-05 17:22 UTC

head link

[zfs-discuss] ZFS performance and memory consumption

This system exhibits the symptoms of :

http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6495013

Moving to nevada would certainly help as it has many more bug fixes and
performance improvements over S10U3.

--
Prabahar.

?ukasz wrote:> Hello,
>     I''m investigating problem with ZFS over NFS. 
> The problems started about 2 weeks ago, most nfs threads are hanging in
txg_wait_open.
> Sync thread is consuming one processor all the time,
> Average spa_sync function times from entry to return is 2 minutes.
> I can''t use dtrace to examine problem, because I keep getting:
>      dtrace: processing aborted: Abort due to systemic unresponsiveness
> 
> Using mdb and examining tx_sync_thread with ::findstack I keep getting this
stack:
>  fffffe8002da1410 _resume_from_idle+0xf8() ]
>   fffffe8002da1570 avl_walk+0x39()
>   fffffe8002da15a0 space_map_alloc+0x21()
>   fffffe8002da1620 metaslab_group_alloc+0x1a2()
>   fffffe8002da16b0 metaslab_alloc_dva+0xab()
>   fffffe8002da1700 metaslab_alloc+0x51()
>   fffffe8002da1720 zio_dva_allocate+0x3f()
>   fffffe8002da1730 zio_next_stage+0x72()
>   fffffe8002da1750 zio_checksum_generate+0x5f()
>   fffffe8002da1760 zio_next_stage+0x72()
>   fffffe8002da17b0 zio_write_compress+0x136()
>   fffffe8002da17c0 zio_next_stage+0x72()
>   fffffe8002da17f0 zio_wait_for_children+0x49()
>   fffffe8002da1800 zio_wait_children_ready+0x15()
>   fffffe8002da1810 zio_next_stage_async+0xae()
>   fffffe8002da1820 zio_nowait+9()
>   fffffe8002da18b0 arc_write+0xe7()
>   fffffe8002da19a0 dbuf_sync+0x274()
>   fffffe8002da1a10 dnode_sync+0x2e3()
>   fffffe8002da1a60 dmu_objset_sync_dnodes+0x7b()
>   fffffe8002da1af0 dmu_objset_sync+0x6a()
>   fffffe8002da1b10 dsl_dataset_sync+0x23()
>   fffffe8002da1b60 dsl_pool_sync+0x7b()
>   fffffe8002da1bd0 spa_sync+0x116()
> 
> I also managed to sum metaslabs space maps:
>   ::walk spa | ::walk metaslab | ::print struct metaslab ms_smo.smo_objsize
> and I got 1GB.
> 
> I have a pool 1,3T with 500G avail space. 
> Pool was created about 3 months ago.
> I''m using solaris 10 u3
> 
> Do you think changing system to nevada will help ?
> I red that there are some changes that can help:
> http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6512391
> http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6532056
>  
>  
> This message posted from opensolaris.org
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Łukasz

2007-Jul-06 11:22 UTC

head link

[zfs-discuss] ZFS performance and memory consumption

Field ms_smo.smo_objsize in metaslab struct is size of data on disk. 
I checked the size of metaslabs in memory:
::walk spa | ::walk metaslab | ::print struct metaslab
ms_map.sm_root.avl_numnodes
I got 1GB

But only some metaslabs are loaded:
::walk spa | ::walk metaslab | ::print struct metaslab
ms_map.sm_root.avl_numnodes ! grep "0x" | wc -l
     231
  from 664 metaslabs. And number of metaslabs is changing very fast.

Is there a way to keep all metaslabs ion RAM ? Is there any limit ?
I encourage other administrators to check "free map space" size.
 
 
This message posted from opensolaris.org

Łukasz

2007-Jul-06 14:24 UTC

head link

[zfs-discuss] ZFS performance and memory consumption

After few hours with dtrace and source code browsing I found that in my space
map there are no 128K blocks left.
Try this on your ZFS. 
  dtrace -n fbt::metaslab_group_alloc:return''/arg1 == -1/{}

If you will get probes, then you also have the same problem.
Allocating from space map works like this:
1. metaslab_group_alloc want to allocate 128K block size
2. for (all metaslabs) {
   read space map and check 128K block size
   if no block then remove flag METASLAB_ACTIVE_MASK
}
3. unload maps for all metaslabs without METASLAB_ACTIVE_MASK

Thats is why spa_sync take so much time.

Now the workaround:
 zfs set recordsize=8K pool

Now the spa_sync functions takes 1-2 seconds, processor is idle, 
only few metaslabs space maps are loaded:> 00000600103ee500::walk metaslab |::print struct metaslab ms_map.sm_loaded !
grep -c "0x"3

But now I have another question.
How 8k blocks will impact on performance ?
 
 
This message posted from opensolaris.org

Łukasz

2007-Jul-06 16:51 UTC

head link

[zfs-discuss] ZFS performance and memory consumption

If you want to know which blocks you do not have:
  dtrace -n fbt::metaslab_group_alloc:entry''{ self->s = arg1;
}'' -n fbt::metaslab_group_alloc:return''/arg1 != -1/{
self->s = 0 }'' -n
fbt::metaslab_group_alloc:return''/self->s && (arg1 == -1)/{
@s = quantize(self->s); self->s = 0; }'' -n tick-10s''{
printa(@s); }''

and which blocks you do not have in some metaslabs:
 dtrace -n fbt::space_map_alloc:entry''{ self->s = arg1; }''
-n fbt::space_map_alloc:return''/arg1 != -1/{ self->s = 0 }''
-n fbt::space_map_alloc:return''/self->s && (arg1 == -1)/{ @s
= quantize(self->s); self->s = 0; }'' -n tick-10s''{
printa(@s); }''

If metaslabs_group_alloc looks like this
          value  ------------- Distribution ------------- count
           65536 |                                          0
          131072 |@@@@@@@@@@      9065
          262144 |                                          0

then you can set zfs record size to 64k
 
 
This message posted from opensolaris.org

Victor Latushkin

2007-Jul-06 17:55 UTC

head link

[zfs-discuss] ZFS performance and memory consumption

?ukasz ?????:> After few hours with dtrace and source code browsing I found that in my
space map there are no 128K blocks left.
Actually you may have some 128k or more free space segments, but 
alignment requirements will not allow to allocate them. Consider the 
following example:

1. Space map starts at 0 at its size is 256KB.
2. There are two 512-byte blocks allocated from the space map - one at
the beginning, another one at the end, so space map contains exactly 1
space segment with start 512 and size 255k.

Let''s allocate 128k block from such space map. avl_find() will return 
this space segment, then we will calculate offset for the segment we are 
going to allocate:

align = size & -size = 128k & -128k = 0x20000 & 0xfffffffffffe0000 =
0x20000 = 128k

offset = P2ROUNDUP(ss->start, align) = P2ROUNDUP(512, 128k) -(-(512) &
-(128k)) = -(0xfffffffffffffe00 & 0xfffffffffffe0000) -(0xfffffffffffe0000)
= -(-128k) = 128k

then we will check if offset + size is less than or equal to space 
segment end, which is not true in this case
offset + size = 128k + 128k = 256k > 255k = ss->ss_end.

So even though we have 255k free in contiguous space segment, we cannot 
allocate 128k block out of it due to alignment requirements.

What is the reason for such alignment requirements? I can see at least two:
a) reduce number of search locations for big blocks to reduce number of 
iterations in ''while'' loop inside metaslab_ff_alloc()
b) since we are using cursors to keep location were the last allocated 
block ended (for each block size), this allows to ensure that 
allocations of smaller size will have a chance not to loop in the 
''while'' loop inside metaslab_ff_alloc()

There may be other reasons also.

Bug 6495013 "Loops and recursion in metaslab_ff_alloc can kill 
performance, even on a pool with lots of free data" fixes rather nasty 
race condition issue, which aggravates performance of a 
metaslab_ff_alloc() on a fragmented pool:

http://src.opensolaris.org/source/diff/onnv/onnv-gate/usr/src/uts/common/fs/zfs/metaslab.c?r2=3848&r1=3713

But loops (and recursion) stay there.
> Try this on your ZFS. 
>   dtrace -n fbt::metaslab_group_alloc:return''/arg1 == -1/{}
> 
> If you will get probes, then you also have the same problem.
> Allocating from space map works like this:
> 1. metaslab_group_alloc want to allocate 128K block size
> 2. for (all metaslabs) {
>    read space map and check 128K block size
>    if no block then remove flag METASLAB_ACTIVE_MASK
> }
> 3. unload maps for all metaslabs without METASLAB_ACTIVE_MASK
> 
> Thats is why spa_sync take so much time.
> 
> Now the workaround:
>  zfs set recordsize=8K poolGood idea, but it may have some drawbacks.
> Now the spa_sync functions takes 1-2 seconds, processor is idle, 
> only few metaslabs space maps are loaded:
>> 00000600103ee500::walk metaslab |::print struct metaslab
ms_map.sm_loaded ! grep -c "0x"
> 3
> 
> But now I have another question.
> How 8k blocks will impact on performance ?First of all, you will need more block pointers to address the same 
amount of data, which is not good if you files are big and mostly 
static. If files change frequently this may increase fragmentation 
further...

Victor.

johansen-osdev at sun.com

2007-Jul-06 18:02 UTC

head link

[zfs-discuss] ZFS performance and memory consumption

> But now I have another question.
> How 8k blocks will impact on performance ?
When tuning recordsize for things like databases, we try to recommend
that the customer''s recordsize match the I/O size of the database
record.

I don''t think that''s the case in your situation.  ZFS is
clever enough
that changes to recordsize only affect new blocks written to the
filesystem.  If you''re seeing metaslab fragmentation problems now,
changing your recordsize to 8k is likely to increase your performance.
This is because you''re out of 128k metaslabs, so using a smaller size
lets you make better use of the remaining space.  This also means you
won''t have to iterate through all of the used 128k metaslabs looking
for
a free one.

If you''re asking, "How does setting the recordsize to 8k affect
performance when I''m not encountering fragmentation," I would
guess
that there would be some reduction.  However, you can adjust the
recordsize once you encounter this problem with the default size.

-j

Łukasz

2007-Jul-07 13:36 UTC

head link

[zfs-discuss] ZFS performance and memory consumption

> When tuning recordsize for things like databases, we
> try to recommend
> that the customer''s recordsize match the I/O size of
> the database
> record.
On this filesystem I have:
 - file links and they are rather static
 - small files ( about 8kB ) that keeps changing
 - big files ( 1MB - 20 MB ) used as temporary files( create, write, read,
unlink )
 and operations on theses files is about 50% off all I/O

I think I need defragmentator tool. Do you think there will be any ?

Now all I can do is copy filesystems from this zpool to another. 
After this operation new zpool will not be fragmentated. 
But it takes time and I have several zpool''s like this.
 
 
This message posted from opensolaris.org

zfs discuss - Jul 2007 - ZFS performance and memory consumption

[zfs-discuss] ZFS performance and memory consumption

[zfs-discuss] ZFS performance and memory consumption

[zfs-discuss] ZFS performance and memory consumption

[zfs-discuss] ZFS performance and memory consumption

[zfs-discuss] ZFS performance and memory consumption

[zfs-discuss] ZFS performance and memory consumption

[zfs-discuss] ZFS performance and memory consumption

[zfs-discuss] ZFS performance and memory consumption