Arne Jansen
2010-Sep-09 15:00 UTC
[zfs-discuss] NFS performance near zero on a very full pool
Hi, currently I''m trying to debug a very strange phenomenon on a nearly full pool (96%). Here are the symptoms: over NFS, a find on the pool takes a very long time, up to 30s (!) for each file. Locally, the performance is quite normal. What I found out so far: It seems that every nfs write (rfs3_write) blocks until the txg is flushed. This means a write takes up to 30 seconds. During this time, the nfs calls block, occupying all NFS server threads. With all server threads blocked, all other OPs (LOOKUP, GETATTR, ...) have to wait until the writes finish, bringing the performance of the server effectively down to zero. It may be that the trigger for this behavior is around 95%. I managed to bring the pool down to 95%, now the writes get served continuously as it should be. What is the explanation for this behaviour? Is it intentional and can the threshold be tuned? I experienced this on Sol10 U8. Thanks, Arne
Neil Perrin
2010-Sep-09 16:36 UTC
[zfs-discuss] NFS performance near zero on a very full pool
Arne, NFS often demands it''s transactions are stable before returning. This forces ZFS to do the system call synchronously. Usually the ZIL (code) allocates and writes a new block in the intent log chain to achieve this. If ever it fails to allocate a block (of the size requested) it it forced to close the txg containing the system call. Yes this can be extremely slow but there is no other option for the ZIL. I''m surprised the wait is 30 seconds. I would expect mush less, but finding room for the rest of the txg data and metadata would also be a challenge. Most (maybe all?) file systems perform badly when out of space. I believe we give a recommended free size and I thought it was 90%. Neil. On 09/09/10 09:00, Arne Jansen wrote:> Hi, > > currently I''m trying to debug a very strange phenomenon on a nearly full > pool (96%). Here are the symptoms: over NFS, a find on the pool takes > a very long time, up to 30s (!) for each file. Locally, the performance > is quite normal. > What I found out so far: It seems that every nfs write (rfs3_write) blocks > until the txg is flushed. This means a write takes up to 30 seconds. During > this time, the nfs calls block, occupying all NFS server threads. With all > server threads blocked, all other OPs (LOOKUP, GETATTR, ...) have to wait > until the writes finish, bringing the performance of the server effectively > down to zero. > It may be that the trigger for this behavior is around 95%. I managed to bring > the pool down to 95%, now the writes get served continuously as it should be. > > What is the explanation for this behaviour? Is it intentional and can the > threshold be tuned? I experienced this on Sol10 U8. > > Thanks, > Arne > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >
Neil Perrin
2010-Sep-09 17:00 UTC
[zfs-discuss] NFS performance near zero on a very full pool
I should also have mentioned that if the pool has a separate log device then this shouldn''t happen.Assuming the slog is big enough then it it should have enough blocks to not be forced into using main pool device blocks. Neil. On 09/09/10 10:36, Neil Perrin wrote:> Arne, > > NFS often demands it''s transactions are stable before returning. > This forces ZFS to do the system call synchronously. Usually the > ZIL (code) allocates and writes a new block in the intent log chain to > achieve this. > If ever it fails to allocate a block (of the size requested) it it forced > to close the txg containing the system call. Yes this can be extremely > slow but there is no other option for the ZIL. I''m surprised the wait > is 30 seconds. > I would expect mush less, but finding room for the rest of the txg > data and metadata > would also be a challenge. > > Most (maybe all?) file systems perform badly when out of space. I > believe we give a recommended > free size and I thought it was 90%. > > Neil. > > On 09/09/10 09:00, Arne Jansen wrote: >> Hi, >> >> currently I''m trying to debug a very strange phenomenon on a nearly full >> pool (96%). Here are the symptoms: over NFS, a find on the pool takes >> a very long time, up to 30s (!) for each file. Locally, the performance >> is quite normal. >> What I found out so far: It seems that every nfs write (rfs3_write) >> blocks >> until the txg is flushed. This means a write takes up to 30 seconds. >> During >> this time, the nfs calls block, occupying all NFS server threads. >> With all >> server threads blocked, all other OPs (LOOKUP, GETATTR, ...) have to >> wait >> until the writes finish, bringing the performance of the server >> effectively >> down to zero. >> It may be that the trigger for this behavior is around 95%. I managed >> to bring >> the pool down to 95%, now the writes get served continuously as it >> should be. >> >> What is the explanation for this behaviour? Is it intentional and can >> the >> threshold be tuned? I experienced this on Sol10 U8. >> >> Thanks, >> Arne >> _______________________________________________ >> zfs-discuss mailing list >> zfs-discuss at opensolaris.org >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >> > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Arne Jansen
2010-Sep-09 17:09 UTC
[zfs-discuss] NFS performance near zero on a very full pool
Hi Neil, Neil Perrin wrote:> NFS often demands it''s transactions are stable before returning. > This forces ZFS to do the system call synchronously. Usually the > ZIL (code) allocates and writes a new block in the intent log chain to > achieve this. > If ever it fails to allocate a block (of the size requested) it it forced > to close the txg containing the system call. Yes this can be extremely > slow but there is no other option for the ZIL. I''m surprised the wait is > 30 seconds. > I would expect mush less, but finding room for the rest of the txg data > and metadata > would also be a challenge.I think this is not what we saw, for two reason: a) we have a mirrored slog device. According to zpool iostat -v only 16MB out of 4GB were in use. b) it didn''t seem like the txg would have been closed early. Rather it kept approximately the 30 second intervals. Internally we came up with a different explanation, without any backing that it might be correct: When the pool reaches 96%, zfs goes into a ''self defense'' mode. Instead of allocating block from ZIL, every write turns synchronous and has to wait for the txg to finish naturally. The reasoning behind this might be that even if ZIL is available, there might not be enough space left to commit the ZIL to the pool. To prevent this, zfs doen''t use ZIL when the pool is above 96%. While this might be proper for small pools, on large pools 4% are still some TB of free space, so there should be an upper limit of maybe 10GB on this hidden reserve. Also this sudden switch of behavior is completely unexpected and at least under- documented.> > Most (maybe all?) file systems perform badly when out of space. I > believe we give a recommended > free size and I thought it was 90%.In this situation, not only writes suffered, but as a side effect reads also came to a nearly complete halt. -- Arne> > Neil. > > On 09/09/10 09:00, Arne Jansen wrote: >> Hi, >> >> currently I''m trying to debug a very strange phenomenon on a nearly full >> pool (96%). Here are the symptoms: over NFS, a find on the pool takes >> a very long time, up to 30s (!) for each file. Locally, the performance >> is quite normal. >> What I found out so far: It seems that every nfs write (rfs3_write) >> blocks >> until the txg is flushed. This means a write takes up to 30 seconds. >> During >> this time, the nfs calls block, occupying all NFS server threads. With >> all >> server threads blocked, all other OPs (LOOKUP, GETATTR, ...) have to wait >> until the writes finish, bringing the performance of the server >> effectively >> down to zero. >> It may be that the trigger for this behavior is around 95%. I managed >> to bring >> the pool down to 95%, now the writes get served continuously as it >> should be. >> >> What is the explanation for this behaviour? Is it intentional and can the >> threshold be tuned? I experienced this on Sol10 U8. >> >> Thanks, >> Arne >> _______________________________________________ >> zfs-discuss mailing list >> zfs-discuss at opensolaris.org >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >> >
Richard Elling
2010-Sep-09 17:43 UTC
[zfs-discuss] NFS performance near zero on a very full pool
On Sep 9, 2010, at 10:09 AM, Arne Jansen wrote:> Hi Neil, > > Neil Perrin wrote: >> NFS often demands it''s transactions are stable before returning. >> This forces ZFS to do the system call synchronously. Usually the >> ZIL (code) allocates and writes a new block in the intent log chain to achieve this. >> If ever it fails to allocate a block (of the size requested) it it forced >> to close the txg containing the system call. Yes this can be extremely >> slow but there is no other option for the ZIL. I''m surprised the wait is 30 seconds. >> I would expect mush less, but finding room for the rest of the txg data and metadata >> would also be a challenge. > > I think this is not what we saw, for two reason: > a) we have a mirrored slog device. According to zpool iostat -v only 16MB > out of 4GB were in use. > b) it didn''t seem like the txg would have been closed early. Rather it kept > approximately the 30 second intervals. > > Internally we came up with a different explanation, without any backing that > it might be correct: When the pool reaches 96%, zfs goes into a ''self defense'' > mode. Instead of allocating block from ZIL, every write turns synchronous and > has to wait for the txg to finish naturally. The reasoning behind this might > be that even if ZIL is available, there might not be enough space left to commit > the ZIL to the pool. To prevent this, zfs doen''t use ZIL when the pool is above > 96%. While this might be proper for small pools, on large pools 4% are still > some TB of free space, so there should be an upper limit of maybe 10GB on this > hidden reserve.I do not believe this is correct. At 96% the first-fit algorithm changes to best-fit and ganging can be expected. This has nothing to do with the ZIL. There is already a reserve set aside for metadata and the ZIL so that you can remove files when the file system is 100% full. This reserve is 32 MB or 1/64 of the pool size.> Also this sudden switch of behavior is completely unexpected and at least under- > documented.Methinks you are just seeing the change in performance from the allocation algorithm change.> >> Most (maybe all?) file systems perform badly when out of space. I believe we give a recommended >> free size and I thought it was 90%. > > In this situation, not only writes suffered, but as a side effect reads also > came to a nearly complete halt.If you have atime=on, then reads create writes. -- richard -- OpenStorage Summit, October 25-27, Palo Alto, CA http://nexenta-summit2010.eventbrite.com ZFS and performance consulting http://www.RichardElling.com
Arne Jansen
2010-Sep-09 17:56 UTC
[zfs-discuss] NFS performance near zero on a very full pool
Richard Elling wrote:> On Sep 9, 2010, at 10:09 AM, Arne Jansen wrote: > >> Hi Neil, >> >> Neil Perrin wrote: >>> NFS often demands it''s transactions are stable before returning. >>> This forces ZFS to do the system call synchronously. Usually the >>> ZIL (code) allocates and writes a new block in the intent log chain to achieve this. >>> If ever it fails to allocate a block (of the size requested) it it forced >>> to close the txg containing the system call. Yes this can be extremely >>> slow but there is no other option for the ZIL. I''m surprised the wait is 30 seconds. >>> I would expect mush less, but finding room for the rest of the txg data and metadata >>> would also be a challenge. >> I think this is not what we saw, for two reason: >> a) we have a mirrored slog device. According to zpool iostat -v only 16MB >> out of 4GB were in use. >> b) it didn''t seem like the txg would have been closed early. Rather it kept >> approximately the 30 second intervals. >> >> Internally we came up with a different explanation, without any backing that >> it might be correct: When the pool reaches 96%, zfs goes into a ''self defense'' >> mode. Instead of allocating block from ZIL, every write turns synchronous and >> has to wait for the txg to finish naturally. The reasoning behind this might >> be that even if ZIL is available, there might not be enough space left to commit >> the ZIL to the pool. To prevent this, zfs doen''t use ZIL when the pool is above >> 96%. While this might be proper for small pools, on large pools 4% are still >> some TB of free space, so there should be an upper limit of maybe 10GB on this >> hidden reserve. > > I do not believe this is correct. At 96% the first-fit algorithm changes to best-fit > and ganging can be expected. This has nothing to do with the ZIL. There is > already a reserve set aside for metadata and the ZIL so that you can remove > files when the file system is 100% full. This reserve is 32 MB or 1/64 of the pool > size.Maybe it is some side-effect of this change of allocation scheme. But I''m very sure about what I saw. The change was drastic and abrupt. I had a dtrace script running that measured the time for rfs3_write to complete. With the pool >96% I saw a burst of writes every 30 seconds, with completion times of up to 30s. With the pool < 96%, I saw a continuous stream of writes with completion times of mostly a few microseconds.> >> In this situation, not only writes suffered, but as a side effect reads also >> came to a nearly complete halt. > > If you have atime=on, then reads create writes.atime is off. The impact on reads/lookups/getattr came imho because all server threads have been occupied by blocking writes for a prolonged time. I''ll try to reproduce this on a test machine. -- Arne