thr3ads.net - zfs discuss - [zfs-discuss] ZFS write bursts cause short app stalls [Dec 2009]

If this information is useful, please help other people find it:
Share via:

Saso Kiselkov

2009-Dec-25 17:57 UTC

[zfs-discuss] ZFS write bursts cause short app stalls

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

I''ve started porting a video streaming application to opensolaris on
ZFS, and am hitting some pretty weird performance issues. The thing I''m
trying to do is run 77 concurrent video capture processes (roughly
430Mbit/s in total) all writing into separate files on a 12TB J4200
storage array. The disks in the array are arranged into a single RAID-0
ZFS volume (though I''ve tried different RAID levels, none helped). CPU
performance is not an issue (barely hitting 35% utilization on a single
CPU quad-core X2250). I/O bottlenecks can also be ruled out, since the
storage array''s sequential write performance is around 600MB/s.

The problem is the bursty behavior of ZFS writes. All the capture
processes do, in essence is poll() on a socket and then read() and
write() any available data from it to a file. The poll() call is done
with a timeout of 250ms, expecting that if no data arrives within 0.25
seconds, the input is dead and recording stops (I tried increasing this
value, but the problem still arises, although not as frequently). When
ZFS decides that it wants to commit a transaction group to disk (every
30 seconds), the system stalls for a short amount of time and depending
on the number capture of processes currently running, the poll() call
(which usually blocks for 1-2ms), takes on the order of hundreds of ms,
sometimes even longer. I figured that I might be able to resolve this by
lowering the txg timeout to something like 1-2 seconds (I need ZFS to
write as soon as data arrives, since it will likely never be
overwritten), but I couldn''t find any tunable parameter for it anywhere
on the net. On FreeBSD, I think this can be done via the
vfs.zfs.txg_timeout sysctl. A glimpse into the source at
http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/txg.c
on line 40 made me worry that somebody maybe hard-coded this value into
the kernel, in which case I''d be pretty much screwed in opensolaris.

Any help would be greatly appreciated.

Regards,
- --
Saso
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAks0/QoACgkQRO8UcfzpOHB9PgCeOuJFVHTCohRzuf7kAEkC1l1i
qBAAn18Jkx+N9OotWVCwpz5iQzNZSsEG
=FCJL
-----END PGP SIGNATURE-----

Richard Elling

2009-Dec-25 20:50 UTC

head link

[zfs-discuss] ZFS write bursts cause short app stalls

On Dec 25, 2009, at 9:57 AM, Saso Kiselkov wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> I''ve started porting a video streaming application to opensolaris
on
> ZFS, and am hitting some pretty weird performance issues. The thing  
> I''m
> trying to do is run 77 concurrent video capture processes (roughly
> 430Mbit/s in total) all writing into separate files on a 12TB J4200
> storage array. The disks in the array are arranged into a single  
> RAID-0
> ZFS volume (though I''ve tried different RAID levels, none helped).
CPU
> performance is not an issue (barely hitting 35% utilization on a  
> single
> CPU quad-core X2250). I/O bottlenecks can also be ruled out, since the
> storage array''s sequential write performance is around 600MB/s.
>
> The problem is the bursty behavior of ZFS writes. All the capture
> processes do, in essence is poll() on a socket and then read() and
> write() any available data from it to a file.
There have been some changes recently, including one in b130 that
might apply to this workload.  What version of the OS are you running?
If not b130, try b130.
  -- richard
> The poll() call is done
> with a timeout of 250ms, expecting that if no data arrives within 0.25
> seconds, the input is dead and recording stops (I tried increasing  
> this
> value, but the problem still arises, although not as frequently). When
> ZFS decides that it wants to commit a transaction group to disk (every
> 30 seconds), the system stalls for a short amount of time and  
> depending
> on the number capture of processes currently running, the poll() call
> (which usually blocks for 1-2ms), takes on the order of hundreds of  
> ms,
> sometimes even longer. I figured that I might be able to resolve  
> this by
> lowering the txg timeout to something like 1-2 seconds (I need ZFS to
> write as soon as data arrives, since it will likely never be
> overwritten), but I couldn''t find any tunable parameter for it  
> anywhere
> on the net. On FreeBSD, I think this can be done via the
> vfs.zfs.txg_timeout sysctl. A glimpse into the source at
>
http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/txg.c
> on line 40 made me worry that somebody maybe hard-coded this value  
> into
> the kernel, in which case I''d be pretty much screwed in
opensolaris.
>
> Any help would be greatly appreciated.
>
> Regards,
> - --
> Saso
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.9 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
>
> iEYEARECAAYFAks0/QoACgkQRO8UcfzpOHB9PgCeOuJFVHTCohRzuf7kAEkC1l1i
> qBAAn18Jkx+N9OotWVCwpz5iQzNZSsEG
> =FCJL
> -----END PGP SIGNATURE-----
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Leonid Kogan

2009-Dec-25 21:24 UTC

head link

[zfs-discuss] ZFS write bursts cause short app stalls

Hi there,
Try to:
zfs set logbias=throughput <yourdataset>

Good luck,
LK
-- 
This message posted from opensolaris.org

Saso Kiselkov

2009-Dec-25 22:58 UTC

head link

[zfs-discuss] ZFS write bursts cause short app stalls

Hi,

I''m not sure what "b130" means, I''m fairly new to
OpenSolaris. How do I
find out?
As for the OS version, it is OpenSolaris 2009.06.

Regards,
--
Saso

Richard Elling wrote:> On Dec 25, 2009, at 9:57 AM, Saso Kiselkov wrote:
>
> I''ve started porting a video streaming application to opensolaris
on
> ZFS, and am hitting some pretty weird performance issues. The thing
I''m
> trying to do is run 77 concurrent video capture processes (roughly
> 430Mbit/s in total) all writing into separate files on a 12TB J4200
> storage array. The disks in the array are arranged into a single RAID-0
> ZFS volume (though I''ve tried different RAID levels, none helped).
CPU
> performance is not an issue (barely hitting 35% utilization on a single
> CPU quad-core X2250). I/O bottlenecks can also be ruled out, since the
> storage array''s sequential write performance is around 600MB/s.
>
> The problem is the bursty behavior of ZFS writes. All the capture
> processes do, in essence is poll() on a socket and then read() and
> write() any available data from it to a file.
>
> > There have been some changes recently, including one in b130 that
> > might apply to this workload.  What version of the OS are you running?
> > If not b130, try b130.
> >  -- richard
>
> The poll() call is done
> with a timeout of 250ms, expecting that if no data arrives within 0.25
> seconds, the input is dead and recording stops (I tried increasing this
> value, but the problem still arises, although not as frequently). When
> ZFS decides that it wants to commit a transaction group to disk (every
> 30 seconds), the system stalls for a short amount of time and depending
> on the number capture of processes currently running, the poll() call
> (which usually blocks for 1-2ms), takes on the order of hundreds of ms,
> sometimes even longer. I figured that I might be able to resolve this by
> lowering the txg timeout to something like 1-2 seconds (I need ZFS to
> write as soon as data arrives, since it will likely never be
> overwritten), but I couldn''t find any tunable parameter for it
anywhere
> on the net. On FreeBSD, I think this can be done via the
> vfs.zfs.txg_timeout sysctl. A glimpse into the source at
>
http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/txg.c
> on line 40 made me worry that somebody maybe hard-coded this value into
> the kernel, in which case I''d be pretty much screwed in
opensolaris.
>
> Any help would be greatly appreciated.
>
> Regards,_______________________________________________
zfs-discuss mailing list
zfs-discuss at opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Saso Kiselkov

2009-Dec-25 22:59 UTC

head link

[zfs-discuss] ZFS write bursts cause short app stalls

Hi,

I tried it and I got the following error message:

# zfs set logbias=throughput content
cannot set property for ''content'': invalid property
''logbias''

Is it because I''m running some older version which does not have this
feature? (2009.06)

Regards,
--
Saso

Leonid Kogan wrote:> Hi there,
> Try to:
> zfs set logbias=throughput <yourdataset>
>
> Good luck,
> LK
>

Tim Cook

2009-Dec-26 03:47 UTC

head link

[zfs-discuss] ZFS write bursts cause short app stalls

On Fri, Dec 25, 2009 at 11:57 AM, Saso Kiselkov <skiselkov at gmail.com>
wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> I''ve started porting a video streaming application to opensolaris
on
> ZFS, and am hitting some pretty weird performance issues. The thing
I''m
> trying to do is run 77 concurrent video capture processes (roughly
> 430Mbit/s in total) all writing into separate files on a 12TB J4200
> storage array. The disks in the array are arranged into a single RAID-0
> ZFS volume (though I''ve tried different RAID levels, none helped).
CPU
> performance is not an issue (barely hitting 35% utilization on a single
> CPU quad-core X2250). I/O bottlenecks can also be ruled out, since the
> storage array''s sequential write performance is around 600MB/s.
>
> The problem is the bursty behavior of ZFS writes. All the capture
> processes do, in essence is poll() on a socket and then read() and
> write() any available data from it to a file. The poll() call is done
> with a timeout of 250ms, expecting that if no data arrives within 0.25
> seconds, the input is dead and recording stops (I tried increasing this
> value, but the problem still arises, although not as frequently). When
> ZFS decides that it wants to commit a transaction group to disk (every
> 30 seconds), the system stalls for a short amount of time and depending
> on the number capture of processes currently running, the poll() call
> (which usually blocks for 1-2ms), takes on the order of hundreds of ms,
> sometimes even longer. I figured that I might be able to resolve this by
> lowering the txg timeout to something like 1-2 seconds (I need ZFS to
> write as soon as data arrives, since it will likely never be
> overwritten), but I couldn''t find any tunable parameter for it
anywhere
> on the net. On FreeBSD, I think this can be done via the
> vfs.zfs.txg_timeout sysctl. A glimpse into the source at
>
>
http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/txg.c
> on line 40 made me worry that somebody maybe hard-coded this value into
> the kernel, in which case I''d be pretty much screwed in
opensolaris.
>
> Any help would be greatly appreciated.
>
> Regards,
> - --
> Saso
>
>
>
Hang on... if you''ve got 77 concurrent threads going, I don''t
see how that''s
a "sequential" I/O load.  To the backend storage it''s going
to look like the
equivalent of random I/O.  I''d also be surprised to see 12 1TB disks
supporting 600MB/sec throughput and would be interested in hearing where you
got those numbers from.

Is your video capture doing 430MB or 430Mbit?

-- 
--Tim
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20091225/b60015fe/attachment.html>

Brent Jones

2009-Dec-26 05:43 UTC

head link

[zfs-discuss] ZFS write bursts cause short app stalls

On Fri, Dec 25, 2009 at 7:47 PM, Tim Cook <tim at cook.ms>
wrote:>
>
> On Fri, Dec 25, 2009 at 11:57 AM, Saso Kiselkov <skiselkov at
gmail.com> wrote:
>>
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA1
>>
>> I''ve started porting a video streaming application to
opensolaris on
>> ZFS, and am hitting some pretty weird performance issues. The thing
I''m
>> trying to do is run 77 concurrent video capture processes (roughly
>> 430Mbit/s in total) all writing into separate files on a 12TB J4200
>> storage array. The disks in the array are arranged into a single RAID-0
>> ZFS volume (though I''ve tried different RAID levels, none
helped). CPU
>> performance is not an issue (barely hitting 35% utilization on a single
>> CPU quad-core X2250). I/O bottlenecks can also be ruled out, since the
>> storage array''s sequential write performance is around
600MB/s.
>>
>> The problem is the bursty behavior of ZFS writes. All the capture
>> processes do, in essence is poll() on a socket and then read() and
>> write() any available data from it to a file. The poll() call is done
>> with a timeout of 250ms, expecting that if no data arrives within 0.25
>> seconds, the input is dead and recording stops (I tried increasing this
>> value, but the problem still arises, although not as frequently). When
>> ZFS decides that it wants to commit a transaction group to disk (every
>> 30 seconds), the system stalls for a short amount of time and depending
>> on the number capture of processes currently running, the poll() call
>> (which usually blocks for 1-2ms), takes on the order of hundreds of ms,
>> sometimes even longer. I figured that I might be able to resolve this
by
>> lowering the txg timeout to something like 1-2 seconds (I need ZFS to
>> write as soon as data arrives, since it will likely never be
>> overwritten), but I couldn''t find any tunable parameter for it
anywhere
>> on the net. On FreeBSD, I think this can be done via the
>> vfs.zfs.txg_timeout sysctl. A glimpse into the source at
>>
>>
http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/txg.c
>> on line 40 made me worry that somebody maybe hard-coded this value into
>> the kernel, in which case I''d be pretty much screwed in
opensolaris.
>>
>> Any help would be greatly appreciated.
>>
>> Regards,
>> - --
>> Saso
>>
>>
>
>
> Hang on... if you''ve got 77 concurrent threads going, I
don''t see how that''s
> a "sequential" I/O load.? To the backend storage it''s
going to look like the
> equivalent of random I/O.? I''d also be surprised to see 12 1TB
disks
> supporting 600MB/sec throughput and would be interested in hearing where
you
> got those numbers from.
>
> Is your video capture doing 430MB or 430Mbit?
>
> --
> --Tim
>
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>
>
Think he said 430Mbit/sec, which if these are security cameras, would
be a good sized installation (30+ cameras).
We have a similar system, albeit running on Windows. Writing about
400Mbit/sec using just 6, 1TB SATA drives is entirely possible, and
working quite well on our system without any frame loss or much
latency.

The writes lag is noticeable however with ZFS, and the behavior of the
transaction group writes. If you have a big write that needs to land
on disk, it seems all other I/O, CPU and "niceness" is thrown out the
window in favor of getting all that data on disk.
I was on a watch list for a ZFS I/O scheduler bug with my paid Solaris
support, I''ll try to find that bug number, but I believe some
improvements were done in 129 and 130.



-- 
Brent Jones
brent at servuhome.net

Tim Cook

2009-Dec-26 05:56 UTC

head link

[zfs-discuss] ZFS write bursts cause short app stalls

On Fri, Dec 25, 2009 at 11:43 PM, Brent Jones <brent at servuhome.net>
wrote:
>
> >>
> >>
> >
> >
> > Hang on... if you''ve got 77 concurrent threads going, I
don''t see how
> that''s
> > a "sequential" I/O load.  To the backend storage
it''s going to look like
> the
> > equivalent of random I/O.  I''d also be surprised to see 12
1TB disks
> > supporting 600MB/sec throughput and would be interested in hearing
where
> you
> > got those numbers from.
> >
> > Is your video capture doing 430MB or 430Mbit?
> >
> > --
> > --Tim
> >
>  >
>
> Think he said 430Mbit/sec, which if these are security cameras, would
> be a good sized installation (30+ cameras).
> We have a similar system, albeit running on Windows. Writing about
> 400Mbit/sec using just 6, 1TB SATA drives is entirely possible, and
> working quite well on our system without any frame loss or much
> latency.
>
Once again, Mb or MB?  They''re two completely different numbers.  As
for
getting 400Mbit out of 6 SATA drive, that''s not really impressive at
all.
If you''re saying you got 400MB, that''s a different story
entirely, and while
possible with sequential I/O and a proper raid setup, it isn''t
happening
with random.

>
> The writes lag is noticeable however with ZFS, and the behavior of the
> transaction group writes. If you have a big write that needs to land
> on disk, it seems all other I/O, CPU and "niceness" is thrown out
the
> window in favor of getting all that data on disk.
> I was on a watch list for a ZFS I/O scheduler bug with my paid Solaris
> support, I''ll try to find that bug number, but I believe some
> improvements were done in 129 and 130.
>
>
>
-- 
--Tim
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20091225/ab2263f2/attachment.html>

Leonid Kogan

2009-Dec-26 07:46 UTC

head link

[zfs-discuss] ZFS write bursts cause short app stalls

Try b130.
http://genunix.org/

Cheers,
LK


On 12/26/2009 12:59 AM, Saso Kiselkov wrote:> Hi,
>
> I tried it and I got the following error message:
>
> # zfs set logbias=throughput content
> cannot set property for ''content'': invalid property
''logbias''
>
> Is it because I''m running some older version which does not have
this
> feature? (2009.06)
>
> Regards,
> --
> Saso
>
> Leonid Kogan wrote:
>    
>> Hi there,
>> Try to:
>> zfs set logbias=throughput<yourdataset>
>>
>> Good luck,
>> LK
>>
>>      
>

Saso Kiselkov

2009-Dec-26 08:39 UTC

head link

[zfs-discuss] ZFS write bursts cause short app stalls

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

The application I''m working on is a kind of large-scale network-PVR
system for our IPTV services. It records all running TV channels in a
X-hour carrousel (typically 24 or 48-hours), retaining only those bits
which users have marked as being interesting to them. The current setup
I''m doing development on is a small 12TB array, future deployment is
planned on several 96TB X4540 machines.

I agree that I kind of misused the term `sequential'' - it really is 77
concurrent sequential writes. However, as I explained, I/O is not the
bottleneck here, as the array is capable of writes around 600MBytes/s,
and the write load I''m putting on it is around 55MBytes/s (430Mbit/s).

The problem is, as Brent explained, that as soon as the OS decides it
wants to write the transaction group to disk, it totally ignores all
other time-critical activity in the system and focuses on just that,
causing an input poll() stall on all network sockets. What I''d need to
do is force it to commit transactions to disk more often so as to even
the load out over a longer period of time, to bring the CPU usage spikes
down to a more manageable and predictable level.

Regards,
- --
Saso

Tim Cook wrote:> On Fri, Dec 25, 2009 at 11:43 PM, Brent Jones <brent at
servuhome.net> wrote:
> 
>>>>
>>>
>>> Hang on... if you''ve got 77 concurrent threads going, I
don''t see how
>> that''s
>>> a "sequential" I/O load.  To the backend storage
it''s going to look like
>> the
>>> equivalent of random I/O.  I''d also be surprised to see 12
1TB disks
>>> supporting 600MB/sec throughput and would be interested in hearing
where
>> you
>>> got those numbers from.
>>>
>>> Is your video capture doing 430MB or 430Mbit?
>>>
>>> --
>>> --Tim
>>>
>>  >
>>
>> Think he said 430Mbit/sec, which if these are security cameras, would
>> be a good sized installation (30+ cameras).
>> We have a similar system, albeit running on Windows. Writing about
>> 400Mbit/sec using just 6, 1TB SATA drives is entirely possible, and
>> working quite well on our system without any frame loss or much
>> latency.
>>
> 
> Once again, Mb or MB?  They''re two completely different numbers. 
As for
> getting 400Mbit out of 6 SATA drive, that''s not really impressive
at all.
> If you''re saying you got 400MB, that''s a different story
entirely, and while
> possible with sequential I/O and a proper raid setup, it isn''t
happening
> with random.
> 
> 
>> The writes lag is noticeable however with ZFS, and the behavior of the
>> transaction group writes. If you have a big write that needs to land
>> on disk, it seems all other I/O, CPU and "niceness" is thrown
out the
>> window in favor of getting all that data on disk.
>> I was on a watch list for a ZFS I/O scheduler bug with my paid Solaris
>> support, I''ll try to find that bug number, but I believe some
>> improvements were done in 129 and 130.
>>
>>
>>
> 
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAks1y8oACgkQRO8UcfzpOHBkDQCgxScaPPS7d+peoiY16Nafo8lu
1nsAoNMwiUdOdQKCZpdyPGoAWz36IWY5
=T6fy
-----END PGP SIGNATURE-----

Saso Kiselkov

2009-Dec-26 08:41 UTC

head link

[zfs-discuss] ZFS write bursts cause short app stalls

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Would an upgrade to the development repository of 2010.02 do the same?
I''d like to avoid having to do a complete reinstall, since
I''ve got
quite a bit of custom software in the system already in various places
and recompiling and fine-tuning would take me another 1-2 days.

Regards,
- --
Saso

Leonid Kogan wrote:> Try b130.
> http://genunix.org/
> 
> Cheers,
> LK
> 
> 
> On 12/26/2009 12:59 AM, Saso Kiselkov wrote:
>> Hi,
>>
>> I tried it and I got the following error message:
>>
>> # zfs set logbias=throughput content
>> cannot set property for ''content'': invalid property
''logbias''
>>
>> Is it because I''m running some older version which does not
have this
>> feature? (2009.06)
>>
>> Regards,
>> -- 
>> Saso
>>
>> Leonid Kogan wrote:
>>   
>>> Hi there,
>>> Try to:
>>> zfs set logbias=throughput<yourdataset>
>>>
>>> Good luck,
>>> LK
>>>
>>>      
>>    
> 
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAks1zCIACgkQRO8UcfzpOHA1SQCaAqK+2v/+lQnuaXPc4pOju7UC
oaIAoNKJO3oOr4DCdCXHCp+vf2/Ri2mW
=pmGr
-----END PGP SIGNATURE-----

Brent Jones

2009-Dec-26 08:53 UTC

head link

[zfs-discuss] ZFS write bursts cause short app stalls

On Fri, Dec 25, 2009 at 9:56 PM, Tim Cook <tim at cook.ms>
wrote:>
>
> On Fri, Dec 25, 2009 at 11:43 PM, Brent Jones <brent at
servuhome.net> wrote:
>>
>> >>
>> >>
>> >
>> >
>> > Hang on... if you''ve got 77 concurrent threads going, I
don''t see how
>> > that''s
>> > a "sequential" I/O load.? To the backend storage
it''s going to look like
>> > the
>> > equivalent of random I/O.? I''d also be surprised to see
12 1TB disks
>> > supporting 600MB/sec throughput and would be interested in hearing
where
>> > you
>> > got those numbers from.
>> >
>> > Is your video capture doing 430MB or 430Mbit?
>> >
>> > --
>> > --Tim
>> >
>> >
>>
>> Think he said 430Mbit/sec, which if these are security cameras, would
>> be a good sized installation (30+ cameras).
>> We have a similar system, albeit running on Windows. Writing about
>> 400Mbit/sec using just 6, 1TB SATA drives is entirely possible, and
>> working quite well on our system without any frame loss or much
>> latency.
>
> Once again, Mb or MB?? They''re two completely different numbers.?
As for
> getting 400Mbit out of 6 SATA drive, that''s not really impressive
at all.
> If you''re saying you got 400MB, that''s a different story
entirely, and while
> possible with sequential I/O and a proper raid setup, it isn''t
happening
> with random.
>
Mb, megabit.
400 megabit is not terribly high, a single SATA drive could write that
24/7 without a sweat. Which is why he is reporting his issue.

Sequential or random, any modern system should be able to perform that
task without causing disruption to other processes running on the
system (if Windows can, Solaris/ZFS most definitely should be able
to).

I have similar workload on my X4540''s, streaming backups from multiple
systems at a time. These are very high end machines, dual quadcore
opterons and 64GB RAM, 48x 1TB drives in 5-6 disk RAIDZ vdevs.

The "write stalls" have been a significant problem since ZFS came out,
and hasn''t really been addressed in an acceptable fashion yet, though
work has been done to improve it.

I''m still trying to find the case number I have open with Sunsolve or
whatever, it was for exactly this issue, and I believe the fix was to
add dozens more "classes" to the scheduler, to allow more fair disk
I/O and overall "niceness" on the system when ZFS commits a
transaction group.

-- 
Brent Jones
brent at servuhome.net

Saso Kiselkov

2009-Dec-26 09:10 UTC

head link

[zfs-discuss] ZFS write bursts cause short app stalls

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Brent Jones wrote:> On Fri, Dec 25, 2009 at 9:56 PM, Tim Cook <tim at cook.ms> wrote:
>>
>> On Fri, Dec 25, 2009 at 11:43 PM, Brent Jones <brent at
servuhome.net> wrote:
>>>>>
>>>>
>>>> Hang on... if you''ve got 77 concurrent threads going,
I don''t see how
>>>> that''s
>>>> a "sequential" I/O load.  To the backend storage
it''s going to look like
>>>> the
>>>> equivalent of random I/O.  I''d also be surprised to
see 12 1TB disks
>>>> supporting 600MB/sec throughput and would be interested in
hearing where
>>>> you
>>>> got those numbers from.
>>>>
>>>> Is your video capture doing 430MB or 430Mbit?
>>>>
>>>> --
>>>> --Tim
>>>>
>>>>
>>> Think he said 430Mbit/sec, which if these are security cameras,
would
>>> be a good sized installation (30+ cameras).
>>> We have a similar system, albeit running on Windows. Writing about
>>> 400Mbit/sec using just 6, 1TB SATA drives is entirely possible, and
>>> working quite well on our system without any frame loss or much
>>> latency.
>> Once again, Mb or MB?  They''re two completely different
numbers.  As for
>> getting 400Mbit out of 6 SATA drive, that''s not really
impressive at all.
>> If you''re saying you got 400MB, that''s a different
story entirely, and while
>> possible with sequential I/O and a proper raid setup, it isn''t
happening
>> with random.
>>
> 
> Mb, megabit.
> 400 megabit is not terribly high, a single SATA drive could write that
> 24/7 without a sweat. Which is why he is reporting his issue.
> 
> Sequential or random, any modern system should be able to perform that
> task without causing disruption to other processes running on the
> system (if Windows can, Solaris/ZFS most definitely should be able
> to).
> 
> I have similar workload on my X4540''s, streaming backups from
multiple
> systems at a time. These are very high end machines, dual quadcore
> opterons and 64GB RAM, 48x 1TB drives in 5-6 disk RAIDZ vdevs.
> 
> The "write stalls" have been a significant problem since ZFS came
out,
> and hasn''t really been addressed in an acceptable fashion yet,
though
> work has been done to improve it.
> 
> I''m still trying to find the case number I have open with Sunsolve
or
> whatever, it was for exactly this issue, and I believe the fix was to
> add dozens more "classes" to the scheduler, to allow more fair
disk
> I/O and overall "niceness" on the system when ZFS commits a
> transaction group.
Wow, if there were a production-release solution to the problem, that
would be great! Reading the mailing list I almost gave up hope that I''d
be able to work around this issue without upgrading to the latest
bleeding-edge development version.

Regards,
- --
Saso
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAks10xQACgkQRO8UcfzpOHCFUQCeJ0kHwOgM3Vjc6QjIL6XHVip5
ed4AoIYrNGAZR2V69uUk3Gc/MAl3kew3
=5uSX
-----END PGP SIGNATURE-----

Fajar A. Nugraha

2009-Dec-26 10:30 UTC

head link

[zfs-discuss] ZFS write bursts cause short app stalls

On Sat, Dec 26, 2009 at 4:10 PM, Saso Kiselkov <skiselkov at gmail.com>
wrote:>> I''m still trying to find the case number I have open with
Sunsolve or
>> whatever, it was for exactly this issue, and I believe the fix was to
>> add dozens more "classes" to the scheduler, to allow more
fair disk
>> I/O and overall "niceness" on the system when ZFS commits a
>> transaction group.
>
> Wow, if there were a production-release solution to the problem, that
> would be great!
Have you checked this thread?
http://www.mail-archive.com/zfs-discuss at opensolaris.org/msg28704.html
> Reading the mailing list I almost gave up hope that I''d
> be able to work around this issue without upgrading to the latest
> bleeding-edge development version.
Isn''t opensolaris already bleeding edge?

-- 
Fajar

Saso Kiselkov

2009-Dec-26 12:22 UTC

head link

[zfs-discuss] ZFS write bursts cause short app stalls

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Thank you, the post you mentioned helped me move a bit forward. I tried
putting:

zfs:zfs_txg_timeout = 1

in /etc/system and now I''m getting much more even write load (a burst
every 5 seconds), which now does not cause any significant poll()
stalling anymore. So far I fail to find the timer in the ZFS source code
which causes the 5-second timeout instead of what I want (1 second).

Another thing that''s left on my mind is why I''m still getting
a very
slight burst every 60 seconds (causing a poll() delay of around 20-30ms,
instead of the usual 0-2ms). It''s not that big a problem, it''s
just that
I''m curious as to where it''s being created. I assume some
60-second
timer is firing, but I don''t know where.

Regards,
- --
Saso

Fajar A. Nugraha wrote:> On Sat, Dec 26, 2009 at 4:10 PM, Saso Kiselkov <skiselkov at
gmail.com> wrote:
>>> I''m still trying to find the case number I have open with
Sunsolve or
>>> whatever, it was for exactly this issue, and I believe the fix was
to
>>> add dozens more "classes" to the scheduler, to allow more
fair disk
>>> I/O and overall "niceness" on the system when ZFS commits
a
>>> transaction group.
>> Wow, if there were a production-release solution to the problem, that
>> would be great!
> 
> Have you checked this thread?
> http://www.mail-archive.com/zfs-discuss at opensolaris.org/msg28704.html
> 
>> Reading the mailing list I almost gave up hope that I''d
>> be able to work around this issue without upgrading to the latest
>> bleeding-edge development version.
> 
> Isn''t opensolaris already bleeding edge?
> 
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAks1/+8ACgkQRO8UcfzpOHC6kgCfcTv86Gwh2MvvVQJeJr/BRghe
f6IAn2N1t4QNLfwBdafZHUbXCw0grTRk
=hUJV
-----END PGP SIGNATURE-----

Bob Friesenhahn

2009-Dec-26 15:53 UTC

head link

[zfs-discuss] ZFS write bursts cause short app stalls

On Fri, 25 Dec 2009, Saso Kiselkov wrote:> sometimes even longer. I figured that I might be able to resolve this by
> lowering the txg timeout to something like 1-2 seconds (I need ZFS to
> write as soon as data arrives, since it will likely never be
> overwritten), but I couldn''t find any tunable parameter for it
anywhere
> on the net. On FreeBSD, I think this can be done via the
While there are some useful tunable parameters, another approach is to 
consider requesting a synchronous write using fdatasync(3RT) or 
fsync(3C) immediately after the final write() request in one of your 
poll() time quantums.  This will cause the data to be written 
immediately.  System behavior will then seem totally different. 
Unfortunately, it will also be less efficient.

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Menno Lageman

2009-Dec-26 16:14 UTC

head link

[zfs-discuss] ZFS write bursts cause short app stalls

On 12/26/09 09:53, Brent Jones wrote:> On Fri, Dec 25, 2009 at 9:56 PM, Tim Cook<tim at cook.ms>  wrote:
>>
>>
>> On Fri, Dec 25, 2009 at 11:43 PM, Brent Jones<brent at
servuhome.net>  wrote:
>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> Hang on... if you''ve got 77 concurrent threads going,
I don''t see how
>>>> that''s
>>>> a "sequential" I/O load.  To the backend storage
it''s going to look like
>>>> the
>>>> equivalent of random I/O.  I''d also be surprised to
see 12 1TB disks
>>>> supporting 600MB/sec throughput and would be interested in
hearing where
>>>> you
>>>> got those numbers from.
>>>>
>>>> Is your video capture doing 430MB or 430Mbit?
>>>>
>>>> --
>>>> --Tim
>>>>
>>>>
>>>
>>> Think he said 430Mbit/sec, which if these are security cameras,
would
>>> be a good sized installation (30+ cameras).
>>> We have a similar system, albeit running on Windows. Writing about
>>> 400Mbit/sec using just 6, 1TB SATA drives is entirely possible, and
>>> working quite well on our system without any frame loss or much
>>> latency.
>>
>> Once again, Mb or MB?  They''re two completely different
numbers.  As for
>> getting 400Mbit out of 6 SATA drive, that''s not really
impressive at all.
>> If you''re saying you got 400MB, that''s a different
story entirely, and while
>> possible with sequential I/O and a proper raid setup, it isn''t
happening
>> with random.
>>
>
> Mb, megabit.
> 400 megabit is not terribly high, a single SATA drive could write that
> 24/7 without a sweat. Which is why he is reporting his issue.
>
> Sequential or random, any modern system should be able to perform that
> task without causing disruption to other processes running on the
> system (if Windows can, Solaris/ZFS most definitely should be able
> to).
>
> I have similar workload on my X4540''s, streaming backups from
multiple
> systems at a time. These are very high end machines, dual quadcore
> opterons and 64GB RAM, 48x 1TB drives in 5-6 disk RAIDZ vdevs.
>
> The "write stalls" have been a significant problem since ZFS came
out,
> and hasn''t really been addressed in an acceptable fashion yet,
though
> work has been done to improve it.
>
> I''m still trying to find the case number I have open with Sunsolve
or
> whatever, it was for exactly this issue, and I believe the fix was to
> add dozens more "classes" to the scheduler, to allow more fair
disk
> I/O and overall "niceness" on the system when ZFS commits a
> transaction group.
>
That would be the new System Duty Cycle Scheduling Class that was 
putback in build 129:

Author: Jonathan Adams <Jonathan.Adams at Sun.COM>
Repository: /export/onnv-gate
Total changesets: 1

Changeset: 87f3734e64df

Comments:
6881015 ZFS write activity prevents other threads from running in a 
timely manner
6899867 mstate_thread_onproc_time() doesn''t account for runnable time 
correctly
PSARC/2009/615 System Duty Cycle Scheduling Class and ZFS IO Observability

See http://arc.opensolaris.org/caselog/PSARC/2009/615/ for more information.

If you''re using the "dev" repository, you can pkg
image-update to get
this new functionality.

Cheers,

Menno
-- 
Menno Lageman - Sun Microsystems - http://blogs.sun.com/menno

Richard Elling

2009-Dec-26 16:36 UTC

head link

[zfs-discuss] ZFS write bursts cause short app stalls

On Dec 26, 2009, at 1:10 AM, Saso Kiselkov wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Brent Jones wrote:
>> On Fri, Dec 25, 2009 at 9:56 PM, Tim Cook <tim at cook.ms> wrote:
>>>
>>> On Fri, Dec 25, 2009 at 11:43 PM, Brent Jones  
>>> <brent at servuhome.net> wrote:
>>>>>>
>>>>>
>>>>> Hang on... if you''ve got 77 concurrent threads
going, I don''t
>>>>> see how
>>>>> that''s
>>>>> a "sequential" I/O load.  To the backend storage
it''s going to
>>>>> look like
>>>>> the
>>>>> equivalent of random I/O.  I''d also be surprised
to see 12 1TB
>>>>> disks
>>>>> supporting 600MB/sec throughput and would be interested in
>>>>> hearing where
>>>>> you
>>>>> got those numbers from.
>>>>>
>>>>> Is your video capture doing 430MB or 430Mbit?
>>>>>
>>>>> --
>>>>> --Tim
>>>>>
>>>>>
>>>> Think he said 430Mbit/sec, which if these are security cameras,
>>>> would
>>>> be a good sized installation (30+ cameras).
>>>> We have a similar system, albeit running on Windows. Writing
about
>>>> 400Mbit/sec using just 6, 1TB SATA drives is entirely possible,
and
>>>> working quite well on our system without any frame loss or much
>>>> latency.
>>> Once again, Mb or MB?  They''re two completely different
numbers.
>>> As for
>>> getting 400Mbit out of 6 SATA drive, that''s not really
impressive
>>> at all.
>>> If you''re saying you got 400MB, that''s a
different story entirely,
>>> and while
>>> possible with sequential I/O and a proper raid setup, it
isn''t
>>> happening
>>> with random.
>>>
>>
>> Mb, megabit.
>> 400 megabit is not terribly high, a single SATA drive could write  
>> that
>> 24/7 without a sweat. Which is why he is reporting his issue.
>>
>> Sequential or random, any modern system should be able to perform  
>> that
>> task without causing disruption to other processes running on the
>> system (if Windows can, Solaris/ZFS most definitely should be able
>> to).
>>
>> I have similar workload on my X4540''s, streaming backups from
>> multiple
>> systems at a time. These are very high end machines, dual quadcore
>> opterons and 64GB RAM, 48x 1TB drives in 5-6 disk RAIDZ vdevs.
>>
>> The "write stalls" have been a significant problem since ZFS
came
>> out,
>> and hasn''t really been addressed in an acceptable fashion yet,
though
>> work has been done to improve it.
PSARC case 2009/615 : System Duty Cycle Scheduling Class and ZFS IO
Observability was integrated into b129. This creates a scheduling class
for ZFS IO and automatically places the zio threads into that class.   
This
is not really an earth-shattering change, Solaris has had a very  
flexible
scheduler for almost 20 years now. Another example is that on a desktop,
the application which has mouse focus runs in the interactive scheduling
class.  This is completely transparent to most folks and there is no  
tweaking
required.

Also fixed in b129 is BUG/RFE:6881015ZFS write activity prevents other
threads from running in a timely manner, which is related to the above.

>> I''m still trying to find the case number I have open with
Sunsolve or
>> whatever, it was for exactly this issue, and I believe the fix was to
>> add dozens more "classes" to the scheduler, to allow more
fair disk
>> I/O and overall "niceness" on the system when ZFS commits a
>> transaction group.
>
> Wow, if there were a production-release solution to the problem, that
> would be great! Reading the mailing list I almost gave up hope that  
> I''d
> be able to work around this issue without upgrading to the latest
> bleeding-edge development version.
Changes have to occur someplace first.  In the OpenSolaris world,
the changes occur first in the dev train and then are back ported to
Solaris 10 (sometimes, not always).

You should try the latest build first -- be sure to follow the release  
notes.
Then, if the problem persists, you might consider tuning  
zfs_txg_timeout,
which can be done on a live system.
  -- richard

Saso Kiselkov

2009-Dec-26 17:20 UTC

head link

[zfs-discuss] ZFS write bursts cause short app stalls

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Thanks for the advice. I did an in-place upgrade to the latest
development b130 release and it seems that the change in scheduling
classes for the kernel writer threads worked (not even having to fiddle
around with logbias) - now I''m just getting small delays every 60
seconds (on the order of 20-30ms). I''m not sure these have something to
do with ZFS, though... they happen outside of the write bursts.

Thank you all for the valuable advice!

Regards,
- --
Saso

Richard Elling wrote:> 
> On Dec 26, 2009, at 1:10 AM, Saso Kiselkov wrote:
> 
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA1
>>
>> Brent Jones wrote:
>>> On Fri, Dec 25, 2009 at 9:56 PM, Tim Cook <tim at cook.ms>
wrote:
>>>>
>>>> On Fri, Dec 25, 2009 at 11:43 PM, Brent Jones <brent at
servuhome.net>
>>>> wrote:
>>>>>>>
>>>>>>
>>>>>> Hang on... if you''ve got 77 concurrent threads
going, I don''t see how
>>>>>> that''s
>>>>>> a "sequential" I/O load.  To the backend
storage it''s going to
>>>>>> look like
>>>>>> the
>>>>>> equivalent of random I/O.  I''d also be
surprised to see 12 1TB disks
>>>>>> supporting 600MB/sec throughput and would be interested
in hearing
>>>>>> where
>>>>>> you
>>>>>> got those numbers from.
>>>>>>
>>>>>> Is your video capture doing 430MB or 430Mbit?
>>>>>>
>>>>>> -- 
>>>>>> --Tim
>>>>>>
>>>>>>
>>>>> Think he said 430Mbit/sec, which if these are security
cameras, would
>>>>> be a good sized installation (30+ cameras).
>>>>> We have a similar system, albeit running on Windows.
Writing about
>>>>> 400Mbit/sec using just 6, 1TB SATA drives is entirely
possible, and
>>>>> working quite well on our system without any frame loss or
much
>>>>> latency.
>>>> Once again, Mb or MB?  They''re two completely
different numbers.  As
>>>> for
>>>> getting 400Mbit out of 6 SATA drive, that''s not really
impressive at
>>>> all.
>>>> If you''re saying you got 400MB, that''s a
different story entirely,
>>>> and while
>>>> possible with sequential I/O and a proper raid setup, it
isn''t
>>>> happening
>>>> with random.
>>>>
>>>
>>> Mb, megabit.
>>> 400 megabit is not terribly high, a single SATA drive could write
that
>>> 24/7 without a sweat. Which is why he is reporting his issue.
>>>
>>> Sequential or random, any modern system should be able to perform
that
>>> task without causing disruption to other processes running on the
>>> system (if Windows can, Solaris/ZFS most definitely should be able
>>> to).
>>>
>>> I have similar workload on my X4540''s, streaming backups
from multiple
>>> systems at a time. These are very high end machines, dual quadcore
>>> opterons and 64GB RAM, 48x 1TB drives in 5-6 disk RAIDZ vdevs.
>>>
>>> The "write stalls" have been a significant problem since
ZFS came out,
>>> and hasn''t really been addressed in an acceptable fashion
yet, though
>>> work has been done to improve it.
> 
> PSARC case 2009/615 : System Duty Cycle Scheduling Class and ZFS IO
> Observability was integrated into b129. This creates a scheduling class
> for ZFS IO and automatically places the zio threads into that class.  This
> is not really an earth-shattering change, Solaris has had a very flexible
> scheduler for almost 20 years now. Another example is that on a desktop,
> the application which has mouse focus runs in the interactive scheduling
> class.  This is completely transparent to most folks and there is no
> tweaking
> required.
> 
> Also fixed in b129 is BUG/RFE:6881015ZFS write activity prevents other
> threads from running in a timely manner, which is related to the above.
> 
> 
>>> I''m still trying to find the case number I have open with
Sunsolve or
>>> whatever, it was for exactly this issue, and I believe the fix was
to
>>> add dozens more "classes" to the scheduler, to allow more
fair disk
>>> I/O and overall "niceness" on the system when ZFS commits
a
>>> transaction group.
>>
>> Wow, if there were a production-release solution to the problem, that
>> would be great! Reading the mailing list I almost gave up hope that
I''d
>> be able to work around this issue without upgrading to the latest
>> bleeding-edge development version.
> 
> Changes have to occur someplace first.  In the OpenSolaris world,
> the changes occur first in the dev train and then are back ported to
> Solaris 10 (sometimes, not always).
> 
> You should try the latest build first -- be sure to follow the release
> notes.
> Then, if the problem persists, you might consider tuning zfs_txg_timeout,
> which can be done on a live system.
>  -- richard
> 
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAks2RfgACgkQRO8UcfzpOHDhCQCeIrJxcy4TcqgvPwGYm/f97NG9
ac8An2zTTqtz/KCK6a4IzKHzgYdEB0Qe
=9zO8
-----END PGP SIGNATURE-----

Leonid Kogan

2009-Dec-27 09:47 UTC

head link

[zfs-discuss] ZFS write bursts cause short app stalls

On 12/26/2009 10:41 AM, Saso Kiselkov wrote:> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Would an upgrade to the development repository of 2010.02 do the same?
> I''d like to avoid having to do a complete reinstall, since
I''ve got
> quite a bit of custom software in the system already in various places
> and recompiling and fine-tuning would take me another 1-2 days.
>
> Regards,
> - --
> Saso
>
> Leonid Kogan wrote:
>    
>> Try b130.
>> http://genunix.org/
>>
>> Cheers,
>> LK
>>
>>
>> On 12/26/2009 12:59 AM, Saso Kiselkov wrote:
>>      
>>> Hi,
>>>
>>> I tried it and I got the following error message:
>>>
>>> # zfs set logbias=throughput content
>>> cannot set property for ''content'': invalid
property ''logbias''
>>>
>>> Is it because I''m running some older version which does
not have this
>>> feature? (2009.06)
>>>
>>> Regards,
>>> -- 
>>> Saso
>>>
>>> Leonid Kogan wrote:
>>>
>>>        
>>>> Hi there,
>>>> Try to:
>>>> zfs set logbias=throughput<yourdataset>
>>>>
>>>> Good luck,
>>>> LK
>>>>
>>>>
>>>>          
>>>
>>>        
>> _______________________________________________
>> zfs-discuss mailing list
>> zfs-discuss at opensolaris.org
>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>>      
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.9 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
>
> iEYEARECAAYFAks1zCIACgkQRO8UcfzpOHA1SQCaAqK+2v/+lQnuaXPc4pOju7UC
> oaIAoNKJO3oOr4DCdCXHCp+vf2/Ri2mW
> =pmGr
> -----END PGP SIGNATURE-----
>    AFAIK yes.

LK

Robert Milkowski

2009-Dec-27 14:22 UTC

head link

[zfs-discuss] ZFS write bursts cause short app stalls

On 26/12/2009 12:22, Saso Kiselkov wrote:> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Thank you, the post you mentioned helped me move a bit forward. I tried
> putting:
>
> zfs:zfs_txg_timeout = 1btw: you can tune it on a live system without a need to do reboots.

milek at r600:~# echo zfs_txg_timeout/D | mdb -k
zfs_txg_timeout:
zfs_txg_timeout:30
milek at r600:~# echo zfs_txg_timeout/W0t1 | mdb -kw
zfs_txg_timeout:0x1e            =       0x1
milek at r600:~# echo zfs_txg_timeout/D | mdb -k
zfs_txg_timeout:
zfs_txg_timeout:1
milek at r600:~# echo zfs_txg_timeout/W0t30 | mdb -kw
zfs_txg_timeout:0x1             =       0x1e
milek at r600:~# echo zfs_txg_timeout/D | mdb -k
zfs_txg_timeout:
zfs_txg_timeout:30
milek at r600:~#

-- 
Robert Milkowski
http://milek.blogspot.com

Saso Kiselkov

2009-Dec-27 17:07 UTC

head link

[zfs-discuss] ZFS write bursts cause short app stalls

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Thanks for the mdb syntax - I wasn''t sure how to set it using mdb at
runtime, which is why I used /etc/system. I was quite intrigued to find
out that the Solaris kernel was in fact designed for being tuned at
runtime using some generic debugging mechanism, rather than like other
traditional kernels, using a defined kernel settings interface (sysctl
comes to mind).

Anyway, upgrading to b130 helped my issue and I hope that by the time we
start selling this product, OpenSolaris 2010.02 comes out, so that I can
tell people to just grab the latest stable OpenSolaris release, rather
than having to go to a development branch or tuning kernel parameters to
even get the software working as it should.

Regards,
- --
Saso

Robert Milkowski wrote:> On 26/12/2009 12:22, Saso Kiselkov wrote:
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA1
>>
>> Thank you, the post you mentioned helped me move a bit forward. I tried
>> putting:
>>
>> zfs:zfs_txg_timeout = 1
> btw: you can tune it on a live system without a need to do reboots.
> 
> milek at r600:~# echo zfs_txg_timeout/D | mdb -k
> zfs_txg_timeout:
> zfs_txg_timeout:30
> milek at r600:~# echo zfs_txg_timeout/W0t1 | mdb -kw
> zfs_txg_timeout:0x1e            =       0x1
> milek at r600:~# echo zfs_txg_timeout/D | mdb -k
> zfs_txg_timeout:
> zfs_txg_timeout:1
> milek at r600:~# echo zfs_txg_timeout/W0t30 | mdb -kw
> zfs_txg_timeout:0x1             =       0x1e
> milek at r600:~# echo zfs_txg_timeout/D | mdb -k
> zfs_txg_timeout:
> zfs_txg_timeout:30
> milek at r600:~#
> 
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAks3lGkACgkQRO8UcfzpOHBzcwCgyDlxr94I9r8kHbVEkTt1lu0Y
AOIAmgJnZ5nZw8j7FS+irrJWJ4RBup0Q
=0g8/
-----END PGP SIGNATURE-----

Roch Bourbonnais

2009-Dec-27 19:38 UTC

head link

[zfs-discuss] ZFS write bursts cause short app stalls

Le 26 d?c. 09 ? 04:47, Tim Cook a ?crit :
>
>
> On Fri, Dec 25, 2009 at 11:57 AM, Saso Kiselkov  
> <skiselkov at gmail.com> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> I''ve started porting a video streaming application to opensolaris
on
> ZFS, and am hitting some pretty weird performance issues. The thing  
> I''m
> trying to do is run 77 concurrent video capture processes (roughly
> 430Mbit/s in total) all writing into separate files on a 12TB J4200
> storage array. The disks in the array are arranged into a single  
> RAID-0
> ZFS volume (though I''ve tried different RAID levels, none helped).
CPU
> performance is not an issue (barely hitting 35% utilization on a  
> single
> CPU quad-core X2250). I/O bottlenecks can also be ruled out, since the
> storage array''s sequential write performance is around 600MB/s.
>
> The problem is the bursty behavior of ZFS writes. All the capture
> processes do, in essence is poll() on a socket and then read() and
> write() any available data from it to a file. The poll() call is done
> with a timeout of 250ms, expecting that if no data arrives within 0.25
> seconds, the input is dead and recording stops (I tried increasing  
> this
> value, but the problem still arises, although not as frequently). When
> ZFS decides that it wants to commit a transaction group to disk (every
> 30 seconds), the system stalls for a short amount of time and  
> depending
> on the number capture of processes currently running, the poll() call
> (which usually blocks for 1-2ms), takes on the order of hundreds of  
> ms,
> sometimes even longer. I figured that I might be able to resolve  
> this by
> lowering the txg timeout to something like 1-2 seconds (I need ZFS to
> write as soon as data arrives, since it will likely never be
> overwritten), but I couldn''t find any tunable parameter for it  
> anywhere
> on the net. On FreeBSD, I think this can be done via the
> vfs.zfs.txg_timeout sysctl. A glimpse into the source at
>
http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/txg.c
> on line 40 made me worry that somebody maybe hard-coded this value  
> into
> the kernel, in which case I''d be pretty much screwed in
opensolaris.
>
> Any help would be greatly appreciated.
>
> Regards,
> - --
> Saso
>
>
>
>
> Hang on... if you''ve got 77 concurrent threads going, I
don''t see
> how that''s a "sequential" I/O load.  To the backend
storage it''s
> going to look like the equivalent of random I/O.

I see this posted once in  a while and I''m not sure where that comes  
from. Sequential workloads are important inasmuch as the FS/VM can  
detect and issue large request to disk (followed by cache hits)  
instead of multiple small ones.  The detection for ZFS is done at the  
file level and so the fact that one has N concurrent streams going is  
not relevant.
On writes ZFS and the Copy-On-Write model makes sequential/random  
distinction not very defining. All writes are targetting free blocks.

-r

> I''d also be surprised to see 12 1TB disks supporting 600MB/sec  
> throughput and would be interested in hearing where you got those  
> numbers from.
>
> Is your video capture doing 430MB or 430Mbit?
>
> -- 
> --Tim
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 2431 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20091227/ee32f98e/attachment.bin>

Tim Cook

2009-Dec-27 23:59 UTC

head link

[zfs-discuss] ZFS write bursts cause short app stalls

On Sun, Dec 27, 2009 at 1:38 PM, Roch Bourbonnais
<Roch.Bourbonnais at sun.com>wrote:
>
> Le 26 d?c. 09 ? 04:47, Tim Cook a ?crit :
>
>
>>
>> On Fri, Dec 25, 2009 at 11:57 AM, Saso Kiselkov <skiselkov at
gmail.com>
>> wrote:
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA1
>>
>> I''ve started porting a video streaming application to
opensolaris on
>> ZFS, and am hitting some pretty weird performance issues. The thing
I''m
>> trying to do is run 77 concurrent video capture processes (roughly
>> 430Mbit/s in total) all writing into separate files on a 12TB J4200
>> storage array. The disks in the array are arranged into a single RAID-0
>> ZFS volume (though I''ve tried different RAID levels, none
helped). CPU
>> performance is not an issue (barely hitting 35% utilization on a single
>> CPU quad-core X2250). I/O bottlenecks can also be ruled out, since the
>> storage array''s sequential write performance is around
600MB/s.
>>
>> The problem is the bursty behavior of ZFS writes. All the capture
>> processes do, in essence is poll() on a socket and then read() and
>> write() any available data from it to a file. The poll() call is done
>> with a timeout of 250ms, expecting that if no data arrives within 0.25
>> seconds, the input is dead and recording stops (I tried increasing this
>> value, but the problem still arises, although not as frequently). When
>> ZFS decides that it wants to commit a transaction group to disk (every
>> 30 seconds), the system stalls for a short amount of time and depending
>> on the number capture of processes currently running, the poll() call
>> (which usually blocks for 1-2ms), takes on the order of hundreds of ms,
>> sometimes even longer. I figured that I might be able to resolve this
by
>> lowering the txg timeout to something like 1-2 seconds (I need ZFS to
>> write as soon as data arrives, since it will likely never be
>> overwritten), but I couldn''t find any tunable parameter for it
anywhere
>> on the net. On FreeBSD, I think this can be done via the
>> vfs.zfs.txg_timeout sysctl. A glimpse into the source at
>>
>>
http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/txg.c
>> on line 40 made me worry that somebody maybe hard-coded this value into
>> the kernel, in which case I''d be pretty much screwed in
opensolaris.
>>
>> Any help would be greatly appreciated.
>>
>> Regards,
>> - --
>> Saso
>>
>>
>>
>>
>> Hang on... if you''ve got 77 concurrent threads going, I
don''t see how
>> that''s a "sequential" I/O load.  To the backend
storage it''s going to look
>> like the equivalent of random I/O.
>>
>
>
> I see this posted once in  a while and I''m not sure where that
comes from.
> Sequential workloads are important inasmuch as the FS/VM can detect and
> issue large request to disk (followed by cache hits) instead of multiple
> small ones.  The detection for ZFS is done at the file level and so the
fact
> that one has N concurrent streams going is not relevant.
> On writes ZFS and the Copy-On-Write model makes sequential/random
> distinction not very defining. All writes are targetting free blocks.
>
> -r
>
>
That is ONLY true when there''s significant free space available/a fresh
pool.  Once those files have been deleted and the blocks put back into the
free pool, they''re no longer "sequential" on disk,
they''re all over the
disk.  So it makes a VERY big difference.  I''m not sure why
you''d be shocked
someone would bring this up.

-- 
--Tim
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20091227/fa695c9d/attachment.html>

Bob Friesenhahn

2009-Dec-28 00:43 UTC

head link

[zfs-discuss] ZFS write bursts cause short app stalls

On Sun, 27 Dec 2009, Tim Cook wrote:> 
> That is ONLY true when there''s significant free space available/a 
> fresh pool.? Once those files have been deleted and the blocks put 
> back into the free pool, they''re no longer "sequential"
on disk,
> they''re all over the disk.? So it makes a VERY big difference.?
I''m
> not sure why you''d be shocked someone would bring this up. ? --
While I don''t know what zfs actually does, I do know that it performs 
large disk allocations (e.g. 1MB) and then parcels 128K zfs blocks 
from those allocations.  If the zfs designers are wise, then they will 
use knowledge of sequential access to ensure that all of the 128K 
blocks from a metaslab allocation are pre-assigned for use by that 
file, and they will try to choose metaslabs which are followed by free 
metaslabs, or close to other free metaslabs.  This approach would tend 
to limit the sequential-access damage caused by COW and free block 
fragmentation on a "dirty" disk.

This sort of planning is not terribly different than detecting 
sequential read I/O and scheduling data reads in advance of 
application requirements.  If you can intelligently pre-fetch data 
blocks, then you can certainly intelligently pre-allocate data blocks.

Today I did an interesting (to me) test where I ran two copies of 
iozone at once on huge (up to 64GB) files.  The results were somewhat 
amazing to me.  The cause of the amazement was that I noticed that the 
reported data rates from iozone did not drop very much (e.g. a 
single-process write rate of 359MB/second dropped to 298MB/second with 
two processes).  This clearly showed that zfs is doing quite a lot of 
smart things when writing files and that it is optimized for 
several/many writers rather than just one.

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Tim Cook

2009-Dec-28 00:53 UTC

head link

[zfs-discuss] ZFS write bursts cause short app stalls

On Sun, Dec 27, 2009 at 6:43 PM, Bob Friesenhahn <
bfriesen at simple.dallas.tx.us> wrote:
> On Sun, 27 Dec 2009, Tim Cook wrote:
>
>>
>> That is ONLY true when there''s significant free space
available/a fresh
>> pool.  Once those files have been deleted and the blocks put back into
the
>> free pool, they''re no longer "sequential" on disk,
they''re all over the
>> disk.  So it makes a VERY big difference.  I''m not sure why
you''d be shocked
>> someone would bring this up.   --
>>
>
> While I don''t know what zfs actually does, I do know that it
performs large
> disk allocations (e.g. 1MB) and then parcels 128K zfs blocks from those
> allocations.  If the zfs designers are wise, then they will use knowledge
of
> sequential access to ensure that all of the 128K blocks from a metaslab
> allocation are pre-assigned for use by that file, and they will try to
> choose metaslabs which are followed by free metaslabs, or close to other
> free metaslabs.  This approach would tend to limit the sequential-access
> damage caused by COW and free block fragmentation on a "dirty"
disk.
>
>How is that going to prevent blocks being spread all over the disk when
you''ve got files several GB in size being written concurrently and
deleted
at random?  And then throw in a mix of small files as well, kiss that
goodbye.


> This sort of planning is not terribly different than detecting sequential
> read I/O and scheduling data reads in advance of application requirements.
>  If you can intelligently pre-fetch data blocks, then you can certainly
> intelligently pre-allocate data blocks.
>
>Pre-allocating data blocks is also not going to cure head seek and the
latency it induces on slow 7200/5400RPM drives.



> Today I did an interesting (to me) test where I ran two copies of iozone at
> once on huge (up to 64GB) files.  The results were somewhat amazing to me.
>  The cause of the amazement was that I noticed that the reported data rates
> from iozone did not drop very much (e.g. a single-process write rate of
> 359MB/second dropped to 298MB/second with two processes).  This clearly
> showed that zfs is doing quite a lot of smart things when writing files and
> that it is optimized for several/many writers rather than just one.
>
>On a new, empty pool, or a pool that''s been filled completely and
emptied
several times?  It''s not amazing to me on a new pool.  I would be
surprised
to see you accomplish this feat repeatedly after filling and emptying the
drives.  It''s a drawback of every implementation of copy-on-write
I''ve ever
seen.  By it''s very nature, I have no idea how you would avoid it.


-- 
--Tim
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20091227/f1659658/attachment.html>

Bob Friesenhahn

2009-Dec-28 02:40 UTC

head link

[zfs-discuss] ZFS write bursts cause short app stalls

On Sun, 27 Dec 2009, Tim Cook wrote:
> How is that going to prevent blocks being spread all over the disk 
> when you''ve got files several GB in size being written
concurrently
> and deleted at random?? And then throw in a mix of small files as 
> well, kiss that goodbye.
There would certainly be blocks spread all over the disk, but a 
(possible) seek ever 1MB of data is not too bad (not considering 
metadata seeks).  If the pool is allowed to get very full, then 
optimizations based on pre-allocated space stop working.
> Pre-allocating data blocks is also not going to cure head seek and 
> the latency it induces on slow 7200/5400RPM drives.
But if the next seek to a data block is on a different drive, that 
drive can be seeking for the next block while the current block is 
already being read.
> On a new, empty pool, or a pool that''s been filled completely and 
> emptied several times?? It''s not amazing to me on a new pool.? I 
> would be surprised to see you accomplish this feat repeatedly after 
> filling and emptying the drives.? It''s a drawback of every 
> implementation of copy-on-write I''ve ever seen.? By it''s
very
> nature, I have no idea how you would avoid it.
This is a 2 year old pool which is typically filled (to about 80%) and 
"emptied" (reduced to 25%) many times.  However, when it is
"emptied",
all of the new files get removed since the extra space is used for 
testing.  I have only seen this pool get faster over time.

For example, when the pool was first created, iozone only measured a 
single-thread large-file (64GB) write rate of 148MB/second but now it 
is up to 380MB/second with the same hardware.  The performance 
improvement is due to improvements to Solaris 10 software and array 
(STK2540) firmware.

Original vs current:

               KB  reclen   write rewrite    read    reread
         67108864     256  148995  165041   463519   453896
         67108864     256  380286  377397   551060   550414

Here is an anchient blog entry where Jeff Bonwick discusses ZFS block 
allocation:

   http://blogs.sun.com/bonwick/entry/zfs_block_allocation

and a somewhat newer one where Jeff describes space maps:

   http://blogs.sun.com/bonwick/entry/space_maps

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Tim Cook

2009-Dec-28 02:48 UTC

head link

[zfs-discuss] ZFS write bursts cause short app stalls

On Sun, Dec 27, 2009 at 8:40 PM, Bob Friesenhahn <
bfriesen at simple.dallas.tx.us> wrote:
> On Sun, 27 Dec 2009, Tim Cook wrote:
>
>  How is that going to prevent blocks being spread all over the disk when
>> you''ve got files several GB in size being written concurrently
and deleted
>> at random?  And then throw in a mix of small files as well, kiss that
>> goodbye.
>>
>
> There would certainly be blocks spread all over the disk, but a (possible)
> seek ever 1MB of data is not too bad (not considering metadata seeks).  If
> the pool is allowed to get very full, then optimizations based on
> pre-allocated space stop working.
>
>I guess it depends entirely on the space map :)

>
>  Pre-allocating data blocks is also not going to cure head seek and the
>> latency it induces on slow 7200/5400RPM drives.
>>
>
> But if the next seek to a data block is on a different drive, that drive
> can be seeking for the next block while the current block is already being
> read.
>
>Well of course.  The argument of "if you just throw more disks at the
problem" will be valid in almost all situations.  Expecting to get the same
performance out of drives you get when they''re empty and new vs. full
and
used, in my experience, is crazy.  My point from the start was you will see
a significant performance decrease as time and fragmentation take place.



>
>  On a new, empty pool, or a pool that''s been filled completely and
emptied
>> several times?  It''s not amazing to me on a new pool.  I would
be surprised
>> to see you accomplish this feat repeatedly after filling and emptying
the
>> drives.  It''s a drawback of every implementation of
copy-on-write I''ve ever
>> seen.  By it''s very nature, I have no idea how you would avoid
it.
>>
>
> This is a 2 year old pool which is typically filled (to about 80%) and
> "emptied" (reduced to 25%) many times.  However, when it is
"emptied", all
> of the new files get removed since the extra space is used for testing.  I
> have only seen this pool get faster over time.
>
> For example, when the pool was first created, iozone only measured a
> single-thread large-file (64GB) write rate of 148MB/second but now it is up
> to 380MB/second with the same hardware.  The performance improvement is due
> to improvements to Solaris 10 software and array (STK2540) firmware.
>
> Original vs current:
>
>              KB  reclen   write rewrite    read    reread
>        67108864     256  148995  165041   463519   453896
>        67108864     256  380286  377397   551060   550414
>
>Cmon, saying "all I did was change code and firmware" isn''t a
valid
comparison at all.  Ignoring that, I''m still referring to multiple
streams
which create random I/O to the backend disk.

--Tim
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20091227/fbf2e615/attachment.html>

Saso Kiselkov

2009-Dec-28 13:25 UTC

head link

[zfs-discuss] ZFS write bursts cause short app stalls

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

I progressed with testing a bit further and found that I was hitting
another scheduling bottleneck - the network. While the write burst was
running and ZFS was commiting data to disk, the server was dropping
incomming UDP packets ("netstat -s | grep udpInOverflows" grew by
about
1000-2000 packets during every write burst).

To work around that I had to boost the scheduling priority of recorder
processes to the real-time class and I also had to lower
zfs_txg_timeout=1 (there was still minor packet drop after just doing
priocntl on the processes) to even out the CPU load.

Any ideas on why ZFS should completely thrash the network layer and make
it drop incomming packets?

Regards,
- --
Saso

Robert Milkowski wrote:> On 26/12/2009 12:22, Saso Kiselkov wrote:
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA1
>>
>> Thank you, the post you mentioned helped me move a bit forward. I tried
>> putting:
>>
>> zfs:zfs_txg_timeout = 1
> btw: you can tune it on a live system without a need to do reboots.
> 
> milek at r600:~# echo zfs_txg_timeout/D | mdb -k
> zfs_txg_timeout:
> zfs_txg_timeout:30
> milek at r600:~# echo zfs_txg_timeout/W0t1 | mdb -kw
> zfs_txg_timeout:0x1e            =       0x1
> milek at r600:~# echo zfs_txg_timeout/D | mdb -k
> zfs_txg_timeout:
> zfs_txg_timeout:1
> milek at r600:~# echo zfs_txg_timeout/W0t30 | mdb -kw
> zfs_txg_timeout:0x1             =       0x1e
> milek at r600:~# echo zfs_txg_timeout/D | mdb -k
> zfs_txg_timeout:
> zfs_txg_timeout:30
> milek at r600:~#
> 
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAks4sa8ACgkQRO8UcfzpOHAASgCdF1QWcKvpvK58BPBVr9EDmrWK
zmoAoLeX3Q+avIDbb+CONlh++pAIGOob
=NcRo
-----END PGP SIGNATURE-----

Markus Kovero

2009-Dec-28 13:47 UTC

head link

[zfs-discuss] ZFS write bursts cause short app stalls

Hi, Try to add flow for traffic you want to get prioritized, I noticed that
opensolaris tends to drop network connectivity without priority flows defined, I
believe this is a feature presented by crossbow itself. flowadm is your friend
that is.
I found this particularly annoying if you monitor servers with icmp-ping and
high load causes checks to fail therefore triggering unnecessary alarms.

Yours
Markus Kovero

-----Original Message-----
From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-bounces at
opensolaris.org] On Behalf Of Saso Kiselkov
Sent: 28. joulukuuta 2009 15:25
To: zfs-discuss at opensolaris.org
Subject: Re: [zfs-discuss] ZFS write bursts cause short app stalls

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

I progressed with testing a bit further and found that I was hitting
another scheduling bottleneck - the network. While the write burst was
running and ZFS was commiting data to disk, the server was dropping
incomming UDP packets ("netstat -s | grep udpInOverflows" grew by
about
1000-2000 packets during every write burst).

To work around that I had to boost the scheduling priority of recorder
processes to the real-time class and I also had to lower
zfs_txg_timeout=1 (there was still minor packet drop after just doing
priocntl on the processes) to even out the CPU load.

Any ideas on why ZFS should completely thrash the network layer and make
it drop incomming packets?

Regards,
- --
Saso

Robert Milkowski wrote:> On 26/12/2009 12:22, Saso Kiselkov wrote:
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA1
>>
>> Thank you, the post you mentioned helped me move a bit forward. I tried
>> putting:
>>
>> zfs:zfs_txg_timeout = 1
> btw: you can tune it on a live system without a need to do reboots.
> 
> milek at r600:~# echo zfs_txg_timeout/D | mdb -k
> zfs_txg_timeout:
> zfs_txg_timeout:30
> milek at r600:~# echo zfs_txg_timeout/W0t1 | mdb -kw
> zfs_txg_timeout:0x1e            =       0x1
> milek at r600:~# echo zfs_txg_timeout/D | mdb -k
> zfs_txg_timeout:
> zfs_txg_timeout:1
> milek at r600:~# echo zfs_txg_timeout/W0t30 | mdb -kw
> zfs_txg_timeout:0x1             =       0x1e
> milek at r600:~# echo zfs_txg_timeout/D | mdb -k
> zfs_txg_timeout:
> zfs_txg_timeout:30
> milek at r600:~#
> 
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAks4sa8ACgkQRO8UcfzpOHAASgCdF1QWcKvpvK58BPBVr9EDmrWK
zmoAoLeX3Q+avIDbb+CONlh++pAIGOob
=NcRo
-----END PGP SIGNATURE-----
_______________________________________________
zfs-discuss mailing list
zfs-discuss at opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Saso Kiselkov

2009-Dec-28 15:50 UTC

head link

[zfs-discuss] ZFS write bursts cause short app stalls

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Thank you for the advice. After trying flowadm the situation improved
somewhat, but I''m still getting occasional packet overflow (10-100
packets about every 10-15 minutes). This is somewhat unnerving, because
I don''t know how to track it down.

Here are the flowadm settings I use:

# flowadm show-flow iptv
FLOW        LINK        IPADDR                   PROTO  LPORT   RPORT
DSFLD
iptv        e1000g1     LCL:224.0.0.0/4          --     --      --      --

# flowadm show-flowprop iptv
FLOW         PROPERTY        VALUE          DEFAULT        POSSIBLE
iptv         maxbw           --             --             ?
iptv         priority        high           --             high

I also tuned udp_max_buf to 256MB. All recording processes are boosted
to the RT priority class and zfs_txg_timeout=1 to force the system to
commit data to disk in smaller and more manageable chunks. Is there any
further tuning you could recommend?

Regards,
- --
Saso

I need all IP multicast input traffic on e1000g1 to get the highest
possible priority.

Markus Kovero wrote:> Hi, Try to add flow for traffic you want to get prioritized, I noticed that
opensolaris tends to drop network connectivity without priority flows defined, I
believe this is a feature presented by crossbow itself. flowadm is your friend
that is.
> I found this particularly annoying if you monitor servers with icmp-ping
and high load causes checks to fail therefore triggering unnecessary alarms.
> 
> Yours
> Markus Kovero
> 
> -----Original Message-----
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-bounces at
opensolaris.org] On Behalf Of Saso Kiselkov
> Sent: 28. joulukuuta 2009 15:25
> To: zfs-discuss at opensolaris.org
> Subject: Re: [zfs-discuss] ZFS write bursts cause short app stalls
> 
> I progressed with testing a bit further and found that I was hitting
> another scheduling bottleneck - the network. While the write burst was
> running and ZFS was commiting data to disk, the server was dropping
> incomming UDP packets ("netstat -s | grep udpInOverflows" grew by
about
> 1000-2000 packets during every write burst).
> 
> To work around that I had to boost the scheduling priority of recorder
> processes to the real-time class and I also had to lower
> zfs_txg_timeout=1 (there was still minor packet drop after just doing
> priocntl on the processes) to even out the CPU load.
> 
> Any ideas on why ZFS should completely thrash the network layer and make
> it drop incomming packets?
> 
> Regards,_______________________________________________
zfs-discuss mailing list
zfs-discuss at opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAks406oACgkQRO8UcfzpOHBVFwCguUVlMhTt9PlcbcqUjJzJ8Oij
CiIAoJJFHu1wtLMbyOyhXbyDPTkSFSFc
=VLoO
-----END PGP SIGNATURE-----

Bob Friesenhahn

2009-Dec-28 16:07 UTC

head link

[zfs-discuss] ZFS write bursts cause short app stalls

On Sun, 27 Dec 2009, Tim Cook wrote:> 
> Cmon, saying "all I did was change code and firmware"
isn''t a valid comparison at all.? Ignoring
> that, I''m still referring to multiple streams which create random
I/O to the backend disk.
I do agree with you that this is a problematic scenario.  The issue is 
with how fast the data arrives.  If the data is written quickly, then 
quite a lot of data will be written in each transaction group and zfs 
can usefully optimize that transaction group.  If the data trickles 
in, then it is a difficult problem for any general-purpose filesystem 
to solve.

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Roch Bourbonnais

2009-Dec-28 17:19 UTC

head link

[zfs-discuss] ZFS write bursts cause short app stalls

Le 28 d?c. 09 ? 00:59, Tim Cook a ?crit :
>
>
> On Sun, Dec 27, 2009 at 1:38 PM, Roch Bourbonnais <Roch.Bourbonnais at
sun.com
> > wrote:
>
> Le 26 d?c. 09 ? 04:47, Tim Cook a ?crit :
>
>
>
> On Fri, Dec 25, 2009 at 11:57 AM, Saso Kiselkov  
> <skiselkov at gmail.com> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> I''ve started porting a video streaming application to opensolaris
on
> ZFS, and am hitting some pretty weird performance issues. The thing  
> I''m
> trying to do is run 77 concurrent video capture processes (roughly
> 430Mbit/s in total) all writing into separate files on a 12TB J4200
> storage array. The disks in the array are arranged into a single  
> RAID-0
> ZFS volume (though I''ve tried different RAID levels, none helped).
CPU
> performance is not an issue (barely hitting 35% utilization on a  
> single
> CPU quad-core X2250). I/O bottlenecks can also be ruled out, since the
> storage array''s sequential write performance is around 600MB/s.
>
> The problem is the bursty behavior of ZFS writes. All the capture
> processes do, in essence is poll() on a socket and then read() and
> write() any available data from it to a file. The poll() call is done
> with a timeout of 250ms, expecting that if no data arrives within 0.25
> seconds, the input is dead and recording stops (I tried increasing  
> this
> value, but the problem still arises, although not as frequently). When
> ZFS decides that it wants to commit a transaction group to disk (every
> 30 seconds), the system stalls for a short amount of time and  
> depending
> on the number capture of processes currently running, the poll() call
> (which usually blocks for 1-2ms), takes on the order of hundreds of  
> ms,
> sometimes even longer. I figured that I might be able to resolve  
> this by
> lowering the txg timeout to something like 1-2 seconds (I need ZFS to
> write as soon as data arrives, since it will likely never be
> overwritten), but I couldn''t find any tunable parameter for it  
> anywhere
> on the net. On FreeBSD, I think this can be done via the
> vfs.zfs.txg_timeout sysctl. A glimpse into the source at
>
http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/txg.c
> on line 40 made me worry that somebody maybe hard-coded this value  
> into
> the kernel, in which case I''d be pretty much screwed in
opensolaris.
>
> Any help would be greatly appreciated.
>
> Regards,
> - --
> Saso
>
>
>
>
> Hang on... if you''ve got 77 concurrent threads going, I
don''t see
> how that''s a "sequential" I/O load.  To the backend
storage it''s
> going to look like the equivalent of random I/O.
>
>
> I see this posted once in  a while and I''m not sure where that
comes
> from. Sequential workloads are important inasmuch as the FS/VM can  
> detect and issue large request to disk (followed by cache hits)  
> instead of multiple small ones.  The detection for ZFS is done at  
> the file level and so the fact that one has N concurrent streams  
> going is not relevant.
> On writes ZFS and the Copy-On-Write model makes sequential/random  
> distinction not very defining. All writes are targetting free blocks.
>
> -r
>
>
>
> That is ONLY true when there''s significant free space available/a
> fresh pool.  Once those files have been deleted and the blocks put  
> back into the free pool, they''re no longer "sequential"
on disk,
> they''re all over the disk.  So it makes a VERY big difference. 
I''m
> not sure why you''d be shocked someone would bring this up.
>
So on writes the performance is more defined by the  availability of  
sequential blocks rather than the application write access pattern or  
concurrency of.
On reads, multiple concurrent sequential streams are sequential to the  
filesystem independant of the number of streams leading to some  
optimisation at that level.
The on-disk I/O pattern is governed by the layout and again the  
concurrency of streams does not come into play when trying to  
understand the performance.

So IMO, having N file sequential access pattern does not imply that  
performance will governed by a random I/O response from disks.

-r

> -- 
> --Tim

Robert Milkowski

2009-Dec-29 11:03 UTC

head link

[zfs-discuss] ZFS write bursts cause short app stalls

I included networking-discuss@


On 28/12/2009 15:50, Saso Kiselkov wrote:> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Thank you for the advice. After trying flowadm the situation improved
> somewhat, but I''m still getting occasional packet overflow (10-100
> packets about every 10-15 minutes). This is somewhat unnerving, because
> I don''t know how to track it down.
>
> Here are the flowadm settings I use:
>
> # flowadm show-flow iptv
> FLOW        LINK        IPADDR                   PROTO  LPORT   RPORT
> DSFLD
> iptv        e1000g1     LCL:224.0.0.0/4          --     --      --      --
>
> # flowadm show-flowprop iptv
> FLOW         PROPERTY        VALUE          DEFAULT        POSSIBLE
> iptv         maxbw           --             --             ?
> iptv         priority        high           --             high
>
> I also tuned udp_max_buf to 256MB. All recording processes are boosted
> to the RT priority class and zfs_txg_timeout=1 to force the system to
> commit data to disk in smaller and more manageable chunks. Is there any
> further tuning you could recommend?
>
> Regards,
> - --
> Saso
>
> I need all IP multicast input traffic on e1000g1 to get the highest
> possible priority.
>
> Markus Kovero wrote:
>    
>> Hi, Try to add flow for traffic you want to get prioritized, I noticed
that opensolaris tends to drop network connectivity without priority flows
defined, I believe this is a feature presented by crossbow itself. flowadm is
your friend that is.
>> I found this particularly annoying if you monitor servers with
icmp-ping and high load causes checks to fail therefore triggering unnecessary
alarms.
>>
>> Yours
>> Markus Kovero
>>
>> -----Original Message-----
>> From: zfs-discuss-bounces at opensolaris.org
[mailto:zfs-discuss-bounces at opensolaris.org] On Behalf Of Saso Kiselkov
>> Sent: 28. joulukuuta 2009 15:25
>> To: zfs-discuss at opensolaris.org
>> Subject: Re: [zfs-discuss] ZFS write bursts cause short app stalls
>>
>> I progressed with testing a bit further and found that I was hitting
>> another scheduling bottleneck - the network. While the write burst was
>> running and ZFS was commiting data to disk, the server was dropping
>> incomming UDP packets ("netstat -s | grep udpInOverflows"
grew by about
>> 1000-2000 packets during every write burst).
>>
>> To work around that I had to boost the scheduling priority of recorder
>> processes to the real-time class and I also had to lower
>> zfs_txg_timeout=1 (there was still minor packet drop after just doing
>> priocntl on the processes) to even out the CPU load.
>>
>> Any ideas on why ZFS should completely thrash the network layer and
make
>> it drop incomming packets?
>>
>> Regards,
>>      
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.9 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
>
> iEYEARECAAYFAks406oACgkQRO8UcfzpOHBVFwCguUVlMhTt9PlcbcqUjJzJ8Oij
> CiIAoJJFHu1wtLMbyOyhXbyDPTkSFSFc
> =VLoO
> -----END PGP SIGNATURE-----
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>
>

Saso Kiselkov

2009-Dec-29 12:05 UTC

head link

[zfs-discuss] ZFS write bursts cause short app stalls

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

I tried removing the flow and subjectively packet loss occurs a bit less
often, but still it is happening. Right now I''m trying to figure out of
it''s due to the load on the server or not - I''ve left only
about 15
concurrent recording instances, producing < 8% load on the system. If
the packet loss still occurs, I guess I''ll have to disregard the loss
measurements as irrelevant, since at such a load the server should not
be dropping packets at all... I guess.

Regards,
- --
Saso

Robert Milkowski wrote:> I included networking-discuss@
> 
> 
> On 28/12/2009 15:50, Saso Kiselkov wrote:
> Thank you for the advice. After trying flowadm the situation improved
> somewhat, but I''m still getting occasional packet overflow (10-100
> packets about every 10-15 minutes). This is somewhat unnerving, because
> I don''t know how to track it down.
> 
> Here are the flowadm settings I use:
> 
> # flowadm show-flow iptv
> FLOW        LINK        IPADDR                   PROTO  LPORT   RPORT
> DSFLD
> iptv        e1000g1     LCL:224.0.0.0/4          --     --     
> --      --
> 
> # flowadm show-flowprop iptv
> FLOW         PROPERTY        VALUE          DEFAULT        POSSIBLE
> iptv         maxbw           --             --             ?
> iptv         priority        high           --             high
> 
> I also tuned udp_max_buf to 256MB. All recording processes are boosted
> to the RT priority class and zfs_txg_timeout=1 to force the system to
> commit data to disk in smaller and more manageable chunks. Is there any
> further tuning you could recommend?
> 
> Regards,_______________________________________________
zfs-discuss mailing list
zfs-discuss at opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss>>
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAks58KIACgkQRO8UcfzpOHCSJQCePCPVhbbfdogNHL735qz3A3dI
4acAn2jofXsGsveDYCgkelwg1xXKFVId
=UPRk
-----END PGP SIGNATURE-----

Saso Kiselkov

2009-Dec-30 14:22 UTC

head link

[zfs-discuss] ZFS write bursts cause short app stalls

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Ok, I figured out that apparently I was the idiot in this story, again.
I forgot to set SO_RCVBUF on my network sockets higher, so that''s why I
was dropping input packets.

The zfs_txg_timeout=1 flag is still necessary (or else dropping occurs
when commiting data to disk), but by increasing network input buffer
sizes it seems I was able to cut input packet loss to zero.

Thanks for all the valuable advice!

Regards,
- --
Saso

Saso Kiselkov wrote:> I tried removing the flow and subjectively packet loss occurs a bit less
> often, but still it is happening. Right now I''m trying to figure
out of
> it''s due to the load on the server or not - I''ve left
only about 15
> concurrent recording instances, producing < 8% load on the system. If
> the packet loss still occurs, I guess I''ll have to disregard the
loss
> measurements as irrelevant, since at such a load the server should not
> be dropping packets at all... I guess.
> 
> Regards,-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAks7YhIACgkQRO8UcfzpOHC8RACgrryGDuVNBYg7q7FPzTKbL8UJ
u+YAoJeUhNYGWwXGi3IqOPPIS4jW9x1j
=f+GQ
-----END PGP SIGNATURE-----

Bill Werner

2010-Jan-02 17:49 UTC

head link

[zfs-discuss] ZFS write bursts cause short app stalls

Thanks for this thread!  I was just coming here to discuss this very same
problem.  I''m running 2009.06 on a Q6600 with 8GB of RAM.  I have a
Windows system writing multiple OTA HD video streams via CIFS to the 2009.06
system running Samba.

I then have multiple clients reading back other HD video streams.  The write
client never skips a beat, but the read clients have constant problems getting
data when the "burst" writes occur.

I am now going to try the txg_timeout and see if that helps.   It would be nice
if these tunables were settable on a per-pool basis though.
-- 
This message posted from opensolaris.org

Saso Kiselkov

2010-Jan-02 20:53 UTC

head link

[zfs-discuss] ZFS write bursts cause short app stalls

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Be sure to also update to the latest dev b130 release, as that also
helps with a more smooth scheduling class for the zfs threads. If the
upgrade breaks anything, you can always just boot back into the old
environment before the upgrade.

Regards,
- --
Saso

Bill Werner wrote:> Thanks for this thread!  I was just coming here to discuss this very same
problem.  I''m running 2009.06 on a Q6600 with 8GB of RAM.  I have a
Windows system writing multiple OTA HD video streams via CIFS to the 2009.06
system running Samba.
> 
> I then have multiple clients reading back other HD video streams.  The
write client never skips a beat, but the read clients have constant problems
getting data when the "burst" writes occur.
> 
> I am now going to try the txg_timeout and see if that helps.   It would be
nice if these tunables were settable on a per-pool basis though.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAks/sloACgkQRO8UcfzpOHC7ywCffZSGYBwd3hRZE5BAfMZpT/g6
ebsAmQFDJ5VyOcaCXKW1TN6I7wmE9w1O
=Ex5W
-----END PGP SIGNATURE-----

Roch

2010-Jan-04 11:32 UTC

head link

[zfs-discuss] ZFS write bursts cause short app stalls

Tim Cook writes:

 > On Sun, Dec 27, 2009 at 6:43 PM, Bob Friesenhahn <
 > bfriesen at simple.dallas.tx.us> wrote:
 > 
 > > On Sun, 27 Dec 2009, Tim Cook wrote:
 > >
 > >>
 > >> That is ONLY true when there''s significant free space
available/a fresh
 > >> pool.  Once those files have been deleted and the blocks put back
into the
 > >> free pool, they''re no longer "sequential" on
disk, they''re all over the
 > >> disk.  So it makes a VERY big difference.  I''m not sure
why you''d be shocked
 > >> someone would bring this up.   --
 > >>
 > >
 > > While I don''t know what zfs actually does, I do know that it
performs large
 > > disk allocations (e.g. 1MB) and then parcels 128K zfs blocks from
those
 > > allocations.  If the zfs designers are wise, then they will use
knowledge of
 > > sequential access to ensure that all of the 128K blocks from a
metaslab
 > > allocation are pre-assigned for use by that file, and they will try
to
 > > choose metaslabs which are followed by free metaslabs, or close to
other
 > > free metaslabs.  This approach would tend to limit the
sequential-access
 > > damage caused by COW and free block fragmentation on a
"dirty" disk.
 > >
 > >
 > How is that going to prevent blocks being spread all over the disk when
 > you''ve got files several GB in size being written concurrently
and deleted
 > at random?  And then throw in a mix of small files as well, kiss that
 > goodbye.
 > 
 > 

Big files being deleted creates big chunks of space for
reuse. That is a great way to clean up the layout.
Within a metaslab ZFS uses cursors to bunch small objects
closer  together. 

	http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/metaslab.c#501

 > 
 > > This sort of planning is not terribly different than detecting
sequential
 > > read I/O and scheduling data reads in advance of application
requirements.
 > >  If you can intelligently pre-fetch data blocks, then you can
certainly
 > > intelligently pre-allocate data blocks.
 > >
 > >
 > Pre-allocating data blocks is also not going to cure head seek and the
 > latency it induces on slow 7200/5400RPM drives.
 > 
 > 
 > 
 > 
 > > Today I did an interesting (to me) test where I ran two copies of
iozone at
 > > once on huge (up to 64GB) files.  The results were somewhat amazing
to me.
 > >  The cause of the amazement was that I noticed that the reported data
rates
 > > from iozone did not drop very much (e.g. a single-process write rate
of
 > > 359MB/second dropped to 298MB/second with two processes).  This
clearly
 > > showed that zfs is doing quite a lot of smart things when writing
files and
 > > that it is optimized for several/many writers rather than just one.
 > >
 > >
 > On a new, empty pool, or a pool that''s been filled completely and
emptied
 > several times?  It''s not amazing to me on a new pool.  I would be
surprised
 > to see you accomplish this feat repeatedly after filling and emptying the
 > drives.  It''s a drawback of every implementation of copy-on-write
I''ve ever
 > seen.  By it''s very nature, I have no idea how you would avoid
it.
 > 

If you empty the drives you''re back to all free space :

	http://blogs.sun.com/bonwick/entry/space_maps

If you leave yourselve a nive cushion of free space and if
you''re profile of object sizes does no radically changes
over time, I think people should be fine when it comes to
free space fragmentation issues. That said, slab and block
selection is still on our radard for improvements.

-r

Saso Kiselkov

2010-Jan-06 18:14 UTC

head link

[zfs-discuss] ZFS write bursts cause short app stalls

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

I''ve encountered a new problem on the opposite end of my app - the
write() calls to disk sometimes block for a terribly long time (5-10
seconds) when I start deleting stuff on the filesystem where my recorder
processes are writing. Looking at iostat I can see that the disk load is
strongly uneven - with a lowered zfs_txg_timeout=1 I get normal writes
every second, but when I start deleting stuff (e.g. "rm -r *"), huge
load spikes appear from time to time, even to the level of blocking all
processes writing to the filesystem and filling up the network input
buffer and starting to drop packets.

Is there a way that I can increase the write I/O priority, or increase
the write buffer in ZFS so that write()s won''t block?

Regards,
- --
Saso

Saso Kiselkov wrote:> Ok, I figured out that apparently I was the idiot in this story, again.
> I forgot to set SO_RCVBUF on my network sockets higher, so that''s
why I
> was dropping input packets.
> 
> The zfs_txg_timeout=1 flag is still necessary (or else dropping occurs
> when commiting data to disk), but by increasing network input buffer
> sizes it seems I was able to cut input packet loss to zero.
> 
> Thanks for all the valuable advice!
> 
> Regards,-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAktE0xoACgkQRO8UcfzpOHBvhwCfSl6Acb2nPvtcFFgzZrkTCIFk
bhEAoKjfv3BWnIRtEsCZt9W0SfKN3xPT
=/f+g
-----END PGP SIGNATURE-----

Bob Friesenhahn

2010-Jan-06 18:44 UTC

head link

[zfs-discuss] ZFS write bursts cause short app stalls

On Wed, 6 Jan 2010, Saso Kiselkov wrote:
> I''ve encountered a new problem on the opposite end of my app - the
> write() calls to disk sometimes block for a terribly long time (5-10
> seconds) when I start deleting stuff on the filesystem where my recorder
> processes are writing. Looking at iostat I can see that the disk load is
> strongly uneven - with a lowered zfs_txg_timeout=1 I get normal writes
> every second, but when I start deleting stuff (e.g. "rm -r *"),
huge
> load spikes appear from time to time, even to the level of blocking all
> processes writing to the filesystem and filling up the network input
> buffer and starting to drop packets.
>
> Is there a way that I can increase the write I/O priority, or increase
> the write buffer in ZFS so that write()s won''t block?
Deleting stuff results in many small writes to the pool in order to 
free up blocks and update metadata.  It is one of the most challenging 
tasks that any filesystem will do.

It seems that most recent development OpenSolaris has added use of a 
new scheduling class in order to limit the impact of such "load 
spikes".  I am eagerly looking forward to being able to use this.

It is difficult for your application to do much if the network device 
driver fails to work, but your application can do some of its own 
buffering and use multithreading so that even a long delay can be 
handled.  Use of the asynchronous write APIs may also help.  Writes 
should be blocked up to the size of the zfs block (e.g. 128K), and 
also aligned to the zfs block if possible.

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Saso Kiselkov

2010-Jan-06 21:33 UTC

head link

[zfs-discuss] ZFS write bursts cause short app stalls

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

I''m aware of the theory and realize that deleting stuff requires
writes.
 I''m also running on the latest b130 and write stuff to disk in large
128k chunks. The thing I was wondering about is whether there is a
mechanism that might lower the I/O scheduling priority of a given
process (e.g. lower the priority of the rm command) in a manner similar
to CPU scheduling priority. Another solution would be to increase the
max size the ZFS write buffer, so that writes would not block.

What I''d specifically like to avoid doing is buffer writes in the
recorder process. Besides being complicated to do (the process
periodically closes and reopens several output files at specific moments
in time and keeping them in sync is a bit hairy), I need the written
data to appear in the filesystem very soon after being received from the
network. The logic behind this is that this is streaming media data
which a user can immediately start playing back while it''s being
recorded. It''s crucial that the user be able to follow the real-time
recording with at most a 1-2 second delay (in fact, at the moment I can
get down to 1 second behind live TV). If I buffer writes for up to 10
seconds in user-space, other playback processes can fail due to running
out of data.

Regards,
- --
Saso

Bob Friesenhahn wrote:> On Wed, 6 Jan 2010, Saso Kiselkov wrote:
> 
>> I''ve encountered a new problem on the opposite end of my app -
the
>> write() calls to disk sometimes block for a terribly long time (5-10
>> seconds) when I start deleting stuff on the filesystem where my
recorder
>> processes are writing. Looking at iostat I can see that the disk load
is
>> strongly uneven - with a lowered zfs_txg_timeout=1 I get normal writes
>> every second, but when I start deleting stuff (e.g. "rm -r
*"), huge
>> load spikes appear from time to time, even to the level of blocking all
>> processes writing to the filesystem and filling up the network input
>> buffer and starting to drop packets.
>>
>> Is there a way that I can increase the write I/O priority, or increase
>> the write buffer in ZFS so that write()s won''t block?
> 
> Deleting stuff results in many small writes to the pool in order to free
> up blocks and update metadata.  It is one of the most challenging tasks
> that any filesystem will do.
> 
> It seems that most recent development OpenSolaris has added use of a new
> scheduling class in order to limit the impact of such "load
spikes".  I
> am eagerly looking forward to being able to use this.
> 
> It is difficult for your application to do much if the network device
> driver fails to work, but your application can do some of its own
> buffering and use multithreading so that even a long delay can be
> handled.  Use of the asynchronous write APIs may also help.  Writes
> should be blocked up to the size of the zfs block (e.g. 128K), and also
> aligned to the zfs block if possible.
> 
> Bob
> -- 
> Bob Friesenhahn
> bfriesen at simple.dallas.tx.us,
http://www.simplesystems.org/users/bfriesen/
> GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAktFAaQACgkQRO8UcfzpOHDsHwCcC4CeWjmZgfINiVYXuyXKAjZg
a24AnA2mXCZMJzcAGlu9w8e81X2duNGI
=T7qS
-----END PGP SIGNATURE-----

Bob Friesenhahn

2010-Jan-06 22:21 UTC

head link

[zfs-discuss] ZFS write bursts cause short app stalls

On Wed, 6 Jan 2010, Saso Kiselkov wrote:
> I''m aware of the theory and realize that deleting stuff requires
writes.
> I''m also running on the latest b130 and write stuff to disk in
large
> 128k chunks. The thing I was wondering about is whether there is a
> mechanism that might lower the I/O scheduling priority of a given
> process (e.g. lower the priority of the rm command) in a manner similar
> to CPU scheduling priority. Another solution would be to increase the
> max size the ZFS write buffer, so that writes would not block.
Disks only have so many IOPS available and commands like ''rm
-rf'' use
quite a lot of them.  The ''rm'' command actually does hardly
anything
since unlink() is just one system call which could cause a flurry of 
activity.  A simple solution is to ban use of ''rm -rf'' and use
a
substitute which intentionally works slowly.  The source code for
''rm''
is available so you could even modify OpenSolaris ''rm'' so that
it does
a bit of a sleep after removing each file.

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Saso Kiselkov

2010-Jan-06 22:40 UTC

head link

[zfs-discuss] ZFS write bursts cause short app stalls

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Buffering the writes in the OS would work for me as well - I''ve got RAM
to spare. Slowing down rm is perhaps one way to go, but definitely not a
real solution. On rare occasions I could still get lockups, leading to
screwed up recordings and if its one thing people don''t like about
IPTV,
it''s packet loss. Eliminating even the possibility of packet loss
completely would be the best way to go, I think.

Regards,
- --
Saso

Bob Friesenhahn wrote:> On Wed, 6 Jan 2010, Saso Kiselkov wrote:
> 
>> I''m aware of the theory and realize that deleting stuff
requires writes.
>> I''m also running on the latest b130 and write stuff to disk in
large
>> 128k chunks. The thing I was wondering about is whether there is a
>> mechanism that might lower the I/O scheduling priority of a given
>> process (e.g. lower the priority of the rm command) in a manner similar
>> to CPU scheduling priority. Another solution would be to increase the
>> max size the ZFS write buffer, so that writes would not block.
> 
> Disks only have so many IOPS available and commands like ''rm
-rf'' use
> quite a lot of them.  The ''rm'' command actually does
hardly anything
> since unlink() is just one system call which could cause a flurry of
> activity.  A simple solution is to ban use of ''rm -rf''
and use a
> substitute which intentionally works slowly.  The source code for
''rm''
> is available so you could even modify OpenSolaris ''rm'' so
that it does a
> bit of a sleep after removing each file.
> 
> Bob
> -- 
> Bob Friesenhahn
> bfriesen at simple.dallas.tx.us,
http://www.simplesystems.org/users/bfriesen/
> GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAktFEWsACgkQRO8UcfzpOHATqQCdGoMb7EfUn1k99KJZlyxLcgGT
neUAoKyaBvsFMDnFJRDbrc65xX4SEAgq
=/pN1
-----END PGP SIGNATURE-----

Brent Jones

2010-Jan-06 23:46 UTC

head link

[zfs-discuss] ZFS write bursts cause short app stalls

On Wed, Jan 6, 2010 at 2:40 PM, Saso Kiselkov <skiselkov at gmail.com>
wrote:> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Buffering the writes in the OS would work for me as well - I''ve
got RAM
> to spare. Slowing down rm is perhaps one way to go, but definitely not a
> real solution. On rare occasions I could still get lockups, leading to
> screwed up recordings and if its one thing people don''t like about
IPTV,
> it''s packet loss. Eliminating even the possibility of packet loss
> completely would be the best way to go, I think.
>
> Regards,
> - --
> Saso
>
I shouldn''t dare suggest this, but what about disabling the ZIL? Since
this sounds like transient data to begin with, any risks would be
pretty low I''d imagine.

-- 
Brent Jones
brent at servuhome.net

Saso Kiselkov

2010-Jan-07 09:01 UTC

head link

[zfs-discuss] ZFS write bursts cause short app stalls

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Just tried and didn''t help :-(.

Regards,
- --
Saso

Brent Jones wrote:> On Wed, Jan 6, 2010 at 2:40 PM, Saso Kiselkov <skiselkov at
gmail.com> wrote:
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA1
>>
>> Buffering the writes in the OS would work for me as well -
I''ve got RAM
>> to spare. Slowing down rm is perhaps one way to go, but definitely not
a
>> real solution. On rare occasions I could still get lockups, leading to
>> screwed up recordings and if its one thing people don''t like
about IPTV,
>> it''s packet loss. Eliminating even the possibility of packet
loss
>> completely would be the best way to go, I think.
>>
>> Regards,
>> - --
>> Saso
>>
> 
> I shouldn''t dare suggest this, but what about disabling the ZIL?
Since
> this sounds like transient data to begin with, any risks would be
> pretty low I''d imagine.
> 
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAktFovIACgkQRO8UcfzpOHCWawCfSeXjpYjLvRE/5guwYZaSc0L/
XP8An2Q+5NBMDIurAkq+EF07woVzPuIW
=rLoe
-----END PGP SIGNATURE-----

zfs discuss - Dec 2009 - ZFS write bursts cause short app stalls

[zfs-discuss] ZFS write bursts cause short app stalls

[zfs-discuss] ZFS write bursts cause short app stalls

[zfs-discuss] ZFS write bursts cause short app stalls

[zfs-discuss] ZFS write bursts cause short app stalls

[zfs-discuss] ZFS write bursts cause short app stalls

[zfs-discuss] ZFS write bursts cause short app stalls

[zfs-discuss] ZFS write bursts cause short app stalls

[zfs-discuss] ZFS write bursts cause short app stalls

[zfs-discuss] ZFS write bursts cause short app stalls

[zfs-discuss] ZFS write bursts cause short app stalls

[zfs-discuss] ZFS write bursts cause short app stalls

[zfs-discuss] ZFS write bursts cause short app stalls

[zfs-discuss] ZFS write bursts cause short app stalls

[zfs-discuss] ZFS write bursts cause short app stalls

[zfs-discuss] ZFS write bursts cause short app stalls

[zfs-discuss] ZFS write bursts cause short app stalls

[zfs-discuss] ZFS write bursts cause short app stalls

[zfs-discuss] ZFS write bursts cause short app stalls

[zfs-discuss] ZFS write bursts cause short app stalls

[zfs-discuss] ZFS write bursts cause short app stalls

[zfs-discuss] ZFS write bursts cause short app stalls

[zfs-discuss] ZFS write bursts cause short app stalls

[zfs-discuss] ZFS write bursts cause short app stalls

[zfs-discuss] ZFS write bursts cause short app stalls

[zfs-discuss] ZFS write bursts cause short app stalls

[zfs-discuss] ZFS write bursts cause short app stalls

[zfs-discuss] ZFS write bursts cause short app stalls

[zfs-discuss] ZFS write bursts cause short app stalls

[zfs-discuss] ZFS write bursts cause short app stalls

[zfs-discuss] ZFS write bursts cause short app stalls

[zfs-discuss] ZFS write bursts cause short app stalls

[zfs-discuss] ZFS write bursts cause short app stalls

[zfs-discuss] ZFS write bursts cause short app stalls

[zfs-discuss] ZFS write bursts cause short app stalls

[zfs-discuss] ZFS write bursts cause short app stalls

[zfs-discuss] ZFS write bursts cause short app stalls

[zfs-discuss] ZFS write bursts cause short app stalls

[zfs-discuss] ZFS write bursts cause short app stalls

[zfs-discuss] ZFS write bursts cause short app stalls

[zfs-discuss] ZFS write bursts cause short app stalls

[zfs-discuss] ZFS write bursts cause short app stalls

[zfs-discuss] ZFS write bursts cause short app stalls

[zfs-discuss] ZFS write bursts cause short app stalls

[zfs-discuss] ZFS write bursts cause short app stalls

[zfs-discuss] ZFS write bursts cause short app stalls

[zfs-discuss] ZFS write bursts cause short app stalls