We''ve got an interesting application which involves recieving lots of multicast groups, and writing the data to disc as a cache. We''re currently using ZFS for this cache, as we''re potentially dealing with a couple of TB at a time. The threads writing to the filesystem have real-time SCHED_FIFO priorities set to 25. The processes recovering data from the cache and moving it elsewhere are niced at +10. We''re seeing the writes stall in favour of the reads. For normal workloads I can understand the reasons, but I was under the impression that real-time processes essentially trump all others, and I''m surprised by this behaviour; I had a dozen or so RT-processes sat waiting for disc for about 20s. My questions: * Is this a ZFS issue? Would we be better using another filesystem? * Is there any way to mitigate against it? Reduce the number of iops available for reading, say? * Is there any way to disable or invert this behaviour? * Is this a bug, or should it be considered one? Thanks. -- Dickon Hood Due to digital rights management, my .sig is temporarily unavailable. Normal service will be resumed as soon as possible. We apologise for the inconvenience in the meantime. No virus was found in this outgoing message as I didn''t bother looking.
Dickon Hood wrote:> We''ve got an interesting application which involves recieving lots of > multicast groups, and writing the data to disc as a cache. We''re > currently using ZFS for this cache, as we''re potentially dealing with a > couple of TB at a time. > > The threads writing to the filesystem have real-time SCHED_FIFO priorities > set to 25. The processes recovering data from the cache and moving it > elsewhere are niced at +10. > > We''re seeing the writes stall in favour of the reads. For normal > workloads I can understand the reasons, but I was under the impression > that real-time processes essentially trump all others, and I''m surprised > by this behaviour; I had a dozen or so RT-processes sat waiting for disc > for about 20s.Are the files opened with O_DSYNC or does the application call fsync ? -- Darren J Moffat
On Fri, Dec 07, 2007 at 12:38:11 +0000, Darren J Moffat wrote: : Dickon Hood wrote: : >We''ve got an interesting application which involves recieving lots of : >multicast groups, and writing the data to disc as a cache. We''re : >currently using ZFS for this cache, as we''re potentially dealing with a : >couple of TB at a time. : >The threads writing to the filesystem have real-time SCHED_FIFO priorities : >set to 25. The processes recovering data from the cache and moving it : >elsewhere are niced at +10. : >We''re seeing the writes stall in favour of the reads. For normal : >workloads I can understand the reasons, but I was under the impression : >that real-time processes essentially trump all others, and I''m surprised : >by this behaviour; I had a dozen or so RT-processes sat waiting for disc : >for about 20s. : Are the files opened with O_DSYNC or does the application call fsync ? No. O_WRONLY|O_CREAT|O_LARGEFILE|O_APPEND. Would that help? -- Dickon Hood Due to digital rights management, my .sig is temporarily unavailable. Normal service will be resumed as soon as possible. We apologise for the inconvenience in the meantime. No virus was found in this outgoing message as I didn''t bother looking.
Dickon Hood wrote:> On Fri, Dec 07, 2007 at 12:38:11 +0000, Darren J Moffat wrote: > : Dickon Hood wrote: > : >We''ve got an interesting application which involves recieving lots of > : >multicast groups, and writing the data to disc as a cache. We''re > : >currently using ZFS for this cache, as we''re potentially dealing with a > : >couple of TB at a time. > > : >The threads writing to the filesystem have real-time SCHED_FIFO priorities > : >set to 25. The processes recovering data from the cache and moving it > : >elsewhere are niced at +10. > > : >We''re seeing the writes stall in favour of the reads. For normal > : >workloads I can understand the reasons, but I was under the impression > : >that real-time processes essentially trump all others, and I''m surprised > : >by this behaviour; I had a dozen or so RT-processes sat waiting for disc > : >for about 20s. > > : Are the files opened with O_DSYNC or does the application call fsync ? > > No. O_WRONLY|O_CREAT|O_LARGEFILE|O_APPEND. Would that help?Don''t know if it will help, but it will be different :-). I suspected that since you put the processes in the RT class you would also be doing synchronous writes. If you can test this it may be worth doing so for the sake of gathering another data point. -- Darren J Moffat
On Fri, Dec 07, 2007 at 12:58:17 +0000, Darren J Moffat wrote: : Dickon Hood wrote: : >On Fri, Dec 07, 2007 at 12:38:11 +0000, Darren J Moffat wrote: : >: Dickon Hood wrote: : >: >We''ve got an interesting application which involves recieving lots of : >: >multicast groups, and writing the data to disc as a cache. We''re : >: >currently using ZFS for this cache, as we''re potentially dealing with a : >: >couple of TB at a time. : >: >The threads writing to the filesystem have real-time SCHED_FIFO : >priorities : >: >set to 25. The processes recovering data from the cache and moving it : >: >elsewhere are niced at +10. : >: >We''re seeing the writes stall in favour of the reads. For normal : >: >workloads I can understand the reasons, but I was under the impression : >: >that real-time processes essentially trump all others, and I''m surprised : >: >by this behaviour; I had a dozen or so RT-processes sat waiting for disc : >: >for about 20s. : >: Are the files opened with O_DSYNC or does the application call fsync ? : >No. O_WRONLY|O_CREAT|O_LARGEFILE|O_APPEND. Would that help? : Don''t know if it will help, but it will be different :-). I suspected : that since you put the processes in the RT class you would also be doing : synchronous writes. Right. I''ll let you know on Monday; I''ll need to restart it in the morning. I put the processes in the RT class as without they dropped packets once in a while, especially on lesser hardware (a Netra T1 can''t cope without, a Niagara usually can...). Very odd. : If you can test this it may be worth doing so for the sake of gathering : another data point. Noted. I suspect (from reading the man pages) it won''t make much difference, as to my mind it looks like a scheduling issue. Just for interest''s sake: when everything is behaving normally when writing only, ''zpool iostat 10'' looks like: capacity operations bandwidth pool used avail read write read write ---------- ----- ----- ----- ----- ----- ----- content 56.9G 2.66T 0 118 0 9.64M normally, whilst reading and writing it looks like: content 69.8G 2.65T 435 103 54.3M 9.63M and when everything breaks, it looks like: content 119G 2.60T 564 0 66.3M 0 prstat usually shows processes idling, a priority 125 for a moment, and other behaviour that I''d expect. When it all breaks, I get most of them sat at priority 125 thumbtwiddling. Perplexing. -- Dickon Hood Due to digital rights management, my .sig is temporarily unavailable. Normal service will be resumed as soon as possible. We apologise for the inconvenience in the meantime. No virus was found in this outgoing message as I didn''t bother looking.
> I was under the impression that real-time processes essentially trump all > others, and I''m surprised by this behaviour; I had a dozen or so RT-processes > sat waiting for disc for about 20s.Process priorities on Solaris affect CPU scheduling, but not (currently) I/O scheduling nor memory usage.> * Is this a ZFS issue? Would we be better using another filesystem?It is a ZFS issue, though depending on your I/O patterns, you might be able to see similar starvation on other file systems. In general, other file systems issue I/O independently, so on average each process will make roughly equal forward process on a continuous basis. You still don''t have guaranteed I/O rates (in the sense that XFS on SGI, for instance, provides).> * Is there any way to mitigate against it? Reduce the number of iops > available for reading, say? > Is there any way to disable or invert this behaviour?I''ll let the ZFS developers tackle this one .... --- Have you considered using two systems (or two virtual systems) to ensure that the writer isn''t affected by reads? Some QFS customers use this configuration, with one system writing to disk and another system reading from the same disk. This requires the use of a SAN file system but it provides the potential for much greater (and controllable) throughput. If your I/O needs are modest (less than a few GB/second), this is overkill. Anton This message posted from opensolaris.org
On Fri, Dec 07, 2007 at 05:27:25 -0800, Anton B. Rang wrote: : > I was under the impression that real-time processes essentially trump all : > others, and I''m surprised by this behaviour; I had a dozen or so RT-processes : > sat waiting for disc for about 20s. : Process priorities on Solaris affect CPU scheduling, but not (currently) : I/O scheduling nor memory usage. Ah, hmm. I hadn''t appreciated that. I''m surprised. : > * Is this a ZFS issue? Would we be better using another filesystem? : It is a ZFS issue, though depending on your I/O patterns, you might be : able to see similar starvation on other file systems. In general, other : file systems issue I/O independently, so on average each process will : make roughly equal forward process on a continuous basis. You still : don''t have guaranteed I/O rates (in the sense that XFS on SGI, for : instance, provides). That would make sense. I''ve not seen this before on any other filesystem. : > * Is there any way to mitigate against it? Reduce the number of iops : > available for reading, say? : > Is there any way to disable or invert this behaviour? : I''ll let the ZFS developers tackle this one .... : --- : Have you considered using two systems (or two virtual systems) to ensure : that the writer isn''t affected by reads? Some QFS customers use this : configuration, with one system writing to disk and another system : reading from the same disk. This requires the use of a SAN file system : but it provides the potential for much greater (and controllable) : throughput. If your I/O needs are modest (less than a few GB/second), : this is overkill. We''re writing (currently) about 10MB/s; this may rise to about double that if we add the other multiplexes. We''re taking the BBC''s DVB content off-air, splitting it into programme chunks, and moving it from the machine that''s doing the recording to a filestore. As it''s off-air streams, we have no control over the inbound data -- it just arrives whether we like it or not. We do control the movement from the recorder to the filestore, but as this is largely achieved via a Perl module calling sendfile(), even that''s mostly out of our hands. Definitely a headscratcher. -- Dickon Hood Due to digital rights management, my .sig is temporarily unavailable. Normal service will be resumed as soon as possible. We apologise for the inconvenience in the meantime. No virus was found in this outgoing message as I didn''t bother looking.
On Fri, Dec 07, 2007 at 13:14:56 +0000, I wrote: : On Fri, Dec 07, 2007 at 12:58:17 +0000, Darren J Moffat wrote: : : Dickon Hood wrote: : : >On Fri, Dec 07, 2007 at 12:38:11 +0000, Darren J Moffat wrote: : : >: Dickon Hood wrote: : : >: >We''re seeing the writes stall in favour of the reads. For normal : : >: >workloads I can understand the reasons, but I was under the impression : : >: >that real-time processes essentially trump all others, and I''m surprised : : >: >by this behaviour; I had a dozen or so RT-processes sat waiting for disc : : >: >for about 20s. : : >: Are the files opened with O_DSYNC or does the application call fsync ? : : >No. O_WRONLY|O_CREAT|O_LARGEFILE|O_APPEND. Would that help? : : Don''t know if it will help, but it will be different :-). I suspected : : that since you put the processes in the RT class you would also be doing : : synchronous writes. : Right. I''ll let you know on Monday; I''ll need to restart it in the : morning. I was a tad busy yesterday and didn''t have the time, but I''ve switched one of our recorder processes (the one doing the HD stream; ~17Mb/s, broadcasting a preview we don''t mind trashing) to a version of the code which opens its file O_DSYNC as suggested. We''ve gone from ~130 write ops per second and 10MB/s to ~450 write ops per second and 27MB/s, with a marginally higher CPU usage. This is roughly what I''d expect. We''ve artifically throttled the reads, which has helped (but not fixed; it isn''t as determinative as we''d like) the starvation problem at the expense of increasing a latency we''d rather have as close to zero as possible. Any ideas? Thanks. -- Dickon Hood Due to digital rights management, my .sig is temporarily unavailable. Normal service will be resumed as soon as possible. We apologise for the inconvenience in the meantime. No virus was found in this outgoing message as I didn''t bother looking.
Dickon Hood writes: > On Fri, Dec 07, 2007 at 13:14:56 +0000, I wrote: > : On Fri, Dec 07, 2007 at 12:58:17 +0000, Darren J Moffat wrote: > : : Dickon Hood wrote: > : : >On Fri, Dec 07, 2007 at 12:38:11 +0000, Darren J Moffat wrote: > : : >: Dickon Hood wrote: > > : : >: >We''re seeing the writes stall in favour of the reads. For normal > : : >: >workloads I can understand the reasons, but I was under the impression > : : >: >that real-time processes essentially trump all others, and I''m surprised > : : >: >by this behaviour; I had a dozen or so RT-processes sat waiting for disc > : : >: >for about 20s. > > : : >: Are the files opened with O_DSYNC or does the application call fsync ? > > : : >No. O_WRONLY|O_CREAT|O_LARGEFILE|O_APPEND. Would that help? > > : : Don''t know if it will help, but it will be different :-). I suspected > : : that since you put the processes in the RT class you would also be doing > : : synchronous writes. > > : Right. I''ll let you know on Monday; I''ll need to restart it in the > : morning. > > I was a tad busy yesterday and didn''t have the time, but I''ve switched one > of our recorder processes (the one doing the HD stream; ~17Mb/s, > broadcasting a preview we don''t mind trashing) to a version of the code > which opens its file O_DSYNC as suggested. > > We''ve gone from ~130 write ops per second and 10MB/s to ~450 write ops per > second and 27MB/s, with a marginally higher CPU usage. This is roughly > what I''d expect. > > We''ve artifically throttled the reads, which has helped (but not fixed; it > isn''t as determinative as we''d like) the starvation problem at the expense > of increasing a latency we''d rather have as close to zero as possible. > > Any ideas? > O_DSYNC was good idea. Then if you have recent Nevada you can use the separate intent log (log keyword in zpool create) to absord those writes without having splindle competition with the reads. Your write workload should then be well handled here (unless the incoming network processing is itself delayed). -r > Thanks. > > -- > Dickon Hood > > Due to digital rights management, my .sig is temporarily unavailable. > Normal service will be resumed as soon as possible. We apologise for the > inconvenience in the meantime. > > No virus was found in this outgoing message as I didn''t bother looking. > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
On Wed, Dec 12, 2007 at 10:27:56 +0100, Roch - PAE wrote: : O_DSYNC was good idea. Then if you have recent Nevada you : can use the separate intent log (log keyword in zpool : create) to absord those writes without having splindle : competition with the reads. Your write workload should then : be well handled here (unless the incoming network processing : is itself delayed). Thanks for the suggestion -- I''ll see if we can give that a go. -- Dickon Hood Due to digital rights management, my .sig is temporarily unavailable. Normal service will be resumed as soon as possible. We apologise for the inconvenience in the meantime. No virus was found in this outgoing message as I didn''t bother looking.