thr3ads.net - zfs discuss - [zfs-discuss] NFS performance near zero on a very full pool [Sep 2010]

If this information is useful, please help other people find it:
Share via:

Arne Jansen

2010-Sep-09 15:00 UTC

[zfs-discuss] NFS performance near zero on a very full pool

Hi,

currently I''m trying to debug a very strange phenomenon on a nearly
full
pool (96%). Here are the symptoms: over NFS, a find on the pool takes
a very long time, up to 30s (!) for each file. Locally, the performance
is quite normal.
What I found out so far: It seems that every nfs write (rfs3_write) blocks
until the txg is flushed. This means a write takes up to 30 seconds. During
this time, the nfs calls block, occupying all NFS server threads. With all
server threads blocked, all other OPs (LOOKUP, GETATTR, ...) have to wait
until the writes finish, bringing the performance of the server effectively
down to zero.
It may be that the trigger for this behavior is around 95%. I managed to bring
the pool down to 95%, now the writes get served continuously as it should be.

What is the explanation for this behaviour? Is it intentional and can the
threshold be tuned? I experienced this on Sol10 U8.

Thanks,
Arne

Neil Perrin

2010-Sep-09 16:36 UTC

head link

[zfs-discuss] NFS performance near zero on a very full pool

Arne,

NFS often demands it''s transactions are stable before returning.
This forces ZFS to do the system call synchronously. Usually the
ZIL (code) allocates and writes a new block in the intent log chain to 
achieve this.
If ever it fails to allocate a block (of the size requested) it it forced
to close the txg containing the system call. Yes this can be extremely
slow but there is no other option for the ZIL. I''m surprised the wait
is
30 seconds.
I would expect mush less, but finding room for the rest of the txg data 
and metadata
would also be a challenge.

Most (maybe all?) file systems perform badly when out of space. I 
believe we give a recommended
free size and I thought it was 90%.

Neil.

On 09/09/10 09:00, Arne Jansen wrote:> Hi,
>
> currently I''m trying to debug a very strange phenomenon on a
nearly full
> pool (96%). Here are the symptoms: over NFS, a find on the pool takes
> a very long time, up to 30s (!) for each file. Locally, the performance
> is quite normal.
> What I found out so far: It seems that every nfs write (rfs3_write) blocks
> until the txg is flushed. This means a write takes up to 30 seconds. During
> this time, the nfs calls block, occupying all NFS server threads. With all
> server threads blocked, all other OPs (LOOKUP, GETATTR, ...) have to wait
> until the writes finish, bringing the performance of the server effectively
> down to zero.
> It may be that the trigger for this behavior is around 95%. I managed to
bring
> the pool down to 95%, now the writes get served continuously as it should
be.
>
> What is the explanation for this behaviour? Is it intentional and can the
> threshold be tuned? I experienced this on Sol10 U8.
>
> Thanks,
> Arne
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>

Neil Perrin

2010-Sep-09 17:00 UTC

head link

[zfs-discuss] NFS performance near zero on a very full pool

I should also have mentioned that if the pool has a separate log device
then this shouldn''t happen.Assuming the slog is big enough then it
it should have enough blocks to not be forced into using main pool 
device blocks.

Neil.

On 09/09/10 10:36, Neil Perrin wrote:> Arne,
>
> NFS often demands it''s transactions are stable before returning.
> This forces ZFS to do the system call synchronously. Usually the
> ZIL (code) allocates and writes a new block in the intent log chain to 
> achieve this.
> If ever it fails to allocate a block (of the size requested) it it forced
> to close the txg containing the system call. Yes this can be extremely
> slow but there is no other option for the ZIL. I''m surprised the
wait
> is 30 seconds.
> I would expect mush less, but finding room for the rest of the txg 
> data and metadata
> would also be a challenge.
>
> Most (maybe all?) file systems perform badly when out of space. I 
> believe we give a recommended
> free size and I thought it was 90%.
>
> Neil.
>
> On 09/09/10 09:00, Arne Jansen wrote:
>> Hi,
>>
>> currently I''m trying to debug a very strange phenomenon on a
nearly full
>> pool (96%). Here are the symptoms: over NFS, a find on the pool takes
>> a very long time, up to 30s (!) for each file. Locally, the performance
>> is quite normal.
>> What I found out so far: It seems that every nfs write (rfs3_write) 
>> blocks
>> until the txg is flushed. This means a write takes up to 30 seconds. 
>> During
>> this time, the nfs calls block, occupying all NFS server threads. 
>> With all
>> server threads blocked, all other OPs (LOOKUP, GETATTR, ...) have to 
>> wait
>> until the writes finish, bringing the performance of the server 
>> effectively
>> down to zero.
>> It may be that the trigger for this behavior is around 95%. I managed 
>> to bring
>> the pool down to 95%, now the writes get served continuously as it 
>> should be.
>>
>> What is the explanation for this behaviour? Is it intentional and can 
>> the
>> threshold be tuned? I experienced this on Sol10 U8.
>>
>> Thanks,
>> Arne
>> _______________________________________________
>> zfs-discuss mailing list
>> zfs-discuss at opensolaris.org
>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>>   
>
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Arne Jansen

2010-Sep-09 17:09 UTC

head link

[zfs-discuss] NFS performance near zero on a very full pool

Hi Neil,

Neil Perrin wrote:> NFS often demands it''s transactions are stable before returning.
> This forces ZFS to do the system call synchronously. Usually the
> ZIL (code) allocates and writes a new block in the intent log chain to 
> achieve this.
> If ever it fails to allocate a block (of the size requested) it it forced
> to close the txg containing the system call. Yes this can be extremely
> slow but there is no other option for the ZIL. I''m surprised the
wait is
> 30 seconds.
> I would expect mush less, but finding room for the rest of the txg data 
> and metadata
> would also be a challenge.
I think this is not what we saw, for two reason:
  a) we have a mirrored slog device. According to zpool iostat -v only 16MB
     out of 4GB were in use.
  b) it didn''t seem like the txg would have been closed early. Rather
it kept
     approximately the 30 second intervals.

Internally we came up with a different explanation, without any backing that
it might be correct: When the pool reaches 96%, zfs goes into a ''self
defense''
mode. Instead of allocating block from ZIL, every write turns synchronous and
has to wait for the txg to finish naturally. The reasoning behind this might
be that even if ZIL is available, there might not be enough space left to commit
the ZIL to the pool. To prevent this, zfs doen''t use ZIL when the pool
is above
96%. While this might be proper for small pools, on large pools 4% are still
some TB of free space, so there should be an upper limit of maybe 10GB on this
hidden reserve.
Also this sudden switch of behavior is completely unexpected and at least under-
documented.
> 
> Most (maybe all?) file systems perform badly when out of space. I 
> believe we give a recommended
> free size and I thought it was 90%.
In this situation, not only writes suffered, but as a side effect reads also
came to a nearly complete halt.

--
Arne

> 
> Neil.
> 
> On 09/09/10 09:00, Arne Jansen wrote:
>> Hi,
>>
>> currently I''m trying to debug a very strange phenomenon on a
nearly full
>> pool (96%). Here are the symptoms: over NFS, a find on the pool takes
>> a very long time, up to 30s (!) for each file. Locally, the performance
>> is quite normal.
>> What I found out so far: It seems that every nfs write (rfs3_write) 
>> blocks
>> until the txg is flushed. This means a write takes up to 30 seconds. 
>> During
>> this time, the nfs calls block, occupying all NFS server threads. With 
>> all
>> server threads blocked, all other OPs (LOOKUP, GETATTR, ...) have to
wait
>> until the writes finish, bringing the performance of the server 
>> effectively
>> down to zero.
>> It may be that the trigger for this behavior is around 95%. I managed 
>> to bring
>> the pool down to 95%, now the writes get served continuously as it 
>> should be.
>>
>> What is the explanation for this behaviour? Is it intentional and can
the
>> threshold be tuned? I experienced this on Sol10 U8.
>>
>> Thanks,
>> Arne
>> _______________________________________________
>> zfs-discuss mailing list
>> zfs-discuss at opensolaris.org
>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>>   
>

Richard Elling

2010-Sep-09 17:43 UTC

head link

[zfs-discuss] NFS performance near zero on a very full pool

On Sep 9, 2010, at 10:09 AM, Arne Jansen wrote:
> Hi Neil,
> 
> Neil Perrin wrote:
>> NFS often demands it''s transactions are stable before
returning.
>> This forces ZFS to do the system call synchronously. Usually the
>> ZIL (code) allocates and writes a new block in the intent log chain to
achieve this.
>> If ever it fails to allocate a block (of the size requested) it it
forced
>> to close the txg containing the system call. Yes this can be extremely
>> slow but there is no other option for the ZIL. I''m surprised
the wait is 30 seconds.
>> I would expect mush less, but finding room for the rest of the txg data
and metadata
>> would also be a challenge.
> 
> I think this is not what we saw, for two reason:
> a) we have a mirrored slog device. According to zpool iostat -v only 16MB
>    out of 4GB were in use.
> b) it didn''t seem like the txg would have been closed early.
Rather it kept
>    approximately the 30 second intervals.
> 
> Internally we came up with a different explanation, without any backing
that
> it might be correct: When the pool reaches 96%, zfs goes into a
''self defense''
> mode. Instead of allocating block from ZIL, every write turns synchronous
and
> has to wait for the txg to finish naturally. The reasoning behind this
might
> be that even if ZIL is available, there might not be enough space left to
commit
> the ZIL to the pool. To prevent this, zfs doen''t use ZIL when the
pool is above
> 96%. While this might be proper for small pools, on large pools 4% are
still
> some TB of free space, so there should be an upper limit of maybe 10GB on
this
> hidden reserve.
I do not believe this is correct.  At 96% the first-fit algorithm changes to
best-fit
and ganging can be expected. This has nothing to do with the ZIL.  There is 
already a reserve set aside for metadata and the ZIL so that you can remove
files when the file system is 100% full.  This reserve is 32 MB or 1/64 of the
pool
size.
> Also this sudden switch of behavior is completely unexpected and at least
under-
> documented.
Methinks you are just seeing the change in performance from the allocation
algorithm change.
> 
>> Most (maybe all?) file systems perform badly when out of space. I
believe we give a recommended
>> free size and I thought it was 90%.
> 
> In this situation, not only writes suffered, but as a side effect reads
also
> came to a nearly complete halt.
If you have atime=on, then reads create writes.
 -- richard

-- 
OpenStorage Summit, October 25-27, Palo Alto, CA
http://nexenta-summit2010.eventbrite.com
ZFS and performance consulting
http://www.RichardElling.com

Arne Jansen

2010-Sep-09 17:56 UTC

head link

[zfs-discuss] NFS performance near zero on a very full pool

Richard Elling wrote:> On Sep 9, 2010, at 10:09 AM, Arne Jansen wrote:
> 
>> Hi Neil,
>>
>> Neil Perrin wrote:
>>> NFS often demands it''s transactions are stable before
returning.
>>> This forces ZFS to do the system call synchronously. Usually the
>>> ZIL (code) allocates and writes a new block in the intent log chain
to achieve this.
>>> If ever it fails to allocate a block (of the size requested) it it
forced
>>> to close the txg containing the system call. Yes this can be
extremely
>>> slow but there is no other option for the ZIL. I''m
surprised the wait is 30 seconds.
>>> I would expect mush less, but finding room for the rest of the txg
data and metadata
>>> would also be a challenge.
>> I think this is not what we saw, for two reason:
>> a) we have a mirrored slog device. According to zpool iostat -v only
16MB
>>    out of 4GB were in use.
>> b) it didn''t seem like the txg would have been closed early.
Rather it kept
>>    approximately the 30 second intervals.
>>
>> Internally we came up with a different explanation, without any backing
that
>> it might be correct: When the pool reaches 96%, zfs goes into a
''self defense''
>> mode. Instead of allocating block from ZIL, every write turns
synchronous and
>> has to wait for the txg to finish naturally. The reasoning behind this
might
>> be that even if ZIL is available, there might not be enough space left
to commit
>> the ZIL to the pool. To prevent this, zfs doen''t use ZIL when
the pool is above
>> 96%. While this might be proper for small pools, on large pools 4% are
still
>> some TB of free space, so there should be an upper limit of maybe 10GB
on this
>> hidden reserve.
> 
> I do not believe this is correct.  At 96% the first-fit algorithm changes
to best-fit
> and ganging can be expected. This has nothing to do with the ZIL.  There is
> already a reserve set aside for metadata and the ZIL so that you can remove
> files when the file system is 100% full.  This reserve is 32 MB or 1/64 of
the pool
> size.
Maybe it is some side-effect of this change of allocation scheme. But
I''m very
sure about what I saw. The change was drastic and abrupt. I had a dtrace script
running that measured the time for rfs3_write to complete. With the pool >96%
I saw a burst of writes every 30 seconds, with completion times of up to 30s.
With the pool < 96%, I saw a continuous stream of writes with completion
times
of mostly a few microseconds.
> 
>> In this situation, not only writes suffered, but as a side effect reads
also
>> came to a nearly complete halt.
> 
> If you have atime=on, then reads create writes.
atime is off. The impact on reads/lookups/getattr came imho because all server
threads have been occupied by blocking writes for a prolonged time.

I''ll try to reproduce this on a test machine.

--
Arne

zfs discuss - Sep 2010 - NFS performance near zero on a very full pool

[zfs-discuss] NFS performance near zero on a very full pool

[zfs-discuss] NFS performance near zero on a very full pool

[zfs-discuss] NFS performance near zero on a very full pool

[zfs-discuss] NFS performance near zero on a very full pool

[zfs-discuss] NFS performance near zero on a very full pool

[zfs-discuss] NFS performance near zero on a very full pool