It has been quite some time (about a year) since I did testing of batch processing with my software (GraphicsMagick). In between time, ZFS added write-throttling. I am using Solaris 10 with kernel 141415-03. Quite a while back I complained that ZFS was periodically stalling the writing process (which UFS did not do). The ZFS write-throttling feature was supposed to avoid that. In my testing today I am still seeing ZFS stall the writing process periodically. When the process is stalled, there is a burst of disk activity, a burst of context switching, and total CPU use drops to almost zero. Zpool iostat says that read bandwidth is 15.8M and write bandwidth is 15.8M over a 60 second averaging interval. Since my drive array is good for writing over 250MB/second, this is a very small write load and the array is loafing. My program uses the simple read->process->write approach. Each file written (about 8MB/file) is written contiguously and written just once. Data is read and written in 128K blocks. For this application there is no value obtained by caching the file just written. From what I am seeing, reading occurs as needed, but writes are being batched up until the next ZFS synchronization cycle. During the ZFS synchronization cycle it seems that processes are blocked from writing. Since my system has a lot of memory and the ARC is capped at 10GB, quite a lot of data can be queued up to be written. The ARC is currently running at its limit of 10GB. If I tell my software to invoke fsync() before closing each written file, then the stall goes away, but the program then needs to block so there is less beneficial use of the CPU. If this application stall annoys me, I am sure that it would really annoy a user with mission-critical work which needs to get done on a uniform basis. If I run this little script then the application runs more smoothly but I see evidence of many shorter stalls: while true do sleep 3 sync done Is there a solution in the works for this problem? Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
is this a direct write to a zfs filesystem or is it some kind of zvol export? anyway, sounds similar to this: http://opensolaris.org/jive/thread.jspa?threadID=105702&tstart=0 On Tue, Jun 23, 2009 at 7:14 PM, Bob Friesenhahn<bfriesen at simple.dallas.tx.us> wrote:> It has been quite some time (about a year) since I did testing of batch > processing with my software (GraphicsMagick). ?In between time, ZFS added > write-throttling. ?I am using Solaris 10 with kernel 141415-03. > > Quite a while back I complained that ZFS was periodically stalling the > writing process (which UFS did not do). ?The ZFS write-throttling feature > was supposed to avoid that. ?In my testing today I am still seeing ZFS stall > the writing process periodically. ?When the process is stalled, there is a > burst of disk activity, a burst of context switching, and total CPU use > drops to almost zero. Zpool iostat says that read bandwidth is 15.8M and > write bandwidth is 15.8M over a 60 second averaging interval. ?Since my > drive array is good for writing over 250MB/second, this is a very small > write load and the array is loafing. > > My program uses the simple read->process->write approach. ?Each file written > (about 8MB/file) is written contiguously and written just once. ?Data is > read and written in 128K blocks. ?For this application there is no value > obtained by caching the file just written. ?From what I am seeing, reading > occurs as needed, but writes are being batched up until the next ZFS > synchronization cycle. ?During the ZFS synchronization cycle it seems that > processes are blocked from writing. Since my system has a lot of memory and > the ARC is capped at 10GB, quite a lot of data can be queued up to be > written. ?The ARC is currently running at its limit of 10GB. > > If I tell my software to invoke fsync() before closing each written file, > then the stall goes away, but the program then needs to block so there is > less beneficial use of the CPU. > > If this application stall annoys me, I am sure that it would really annoy a > user with mission-critical work which needs to get done on a uniform basis. > > If I run this little script then the application runs more smoothly but I > see evidence of many shorter stalls: > > while true > do > ?sleep 3 > ?sync > done > > Is there a solution in the works for this problem? > > Bob > -- > Bob Friesenhahn > bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ > GraphicsMagick Maintainer, ? ?http://www.GraphicsMagick.org/ > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >
On Tue, 23 Jun 2009, milosz wrote:> is this a direct write to a zfs filesystem or is it some kind of zvol export?This is direct write to a zfs filesystem implemented as six mirrors of 15K RPM 300GB drives on a Sun StorageTek 2500. This setup tests very well under iozone and performs remarkably well when extracting from large tar files.> anyway, sounds similar to this: > > http://opensolaris.org/jive/thread.jspa?threadID=105702&tstart=0Yes, this does sound very similar. It looks to me like data from read files is clogging the ARC so that there is no more room for more writes when ZFS periodically goes to commit unwritten data. The "Perfmeter" tool shows that almost all disk I/O occurs during a brief interval of time. The storage array is capable of writing at high rates, but ZFS is coming at it with huge periodic writes which are surely much larger than what the array''s internal buffering can handle. What is clear to me is that my drive array is "loafing". The application runs much slower than expected and zfs is to blame for this. Observed write performance could be sustained by a single fast disk drive. In fact, if I direct the output to a single SAS drive formatted with UFS, the observed performance is fairly similar except there are no stalls until iostat reports that the drive is extremely (close to 99%) busy. When the UFS-formatted drive is reported to be 60% busy (at 48MB/second), application execution is very smooth. If a similar rate is sent to the ZFS pool (52.9MB/second according to zpool iostat) and the individual drives in the pool are reported to be 5 to 33% busy (24-31% for 60 second average), then execution stutters for three seconds at a time as the 1.5GB to 3GB of "written" data which has been batched up is suddenly written. Something else interesting I notice is that performance is not consistent over time: % zpool iostat Sun_2540 60 capacity operations bandwidth pool used avail read write read write ---------- ----- ----- ----- ----- ----- ----- Sun_2540 460G 1.18T 368 447 45.7M 52.9M Sun_2540 463G 1.18T 336 400 42.1M 47.5M Sun_2540 465G 1.17T 341 400 42.6M 47.2M Sun_2540 469G 1.17T 280 473 34.8M 55.9M Sun_2540 472G 1.17T 286 449 35.5M 52.5M Sun_2540 474G 1.17T 338 391 42.1M 45.7M Sun_2540 477G 1.16T 332 400 41.3M 47.0M Sun_2540 479G 1.16T 300 356 37.5M 41.4M Sun_2540 482G 1.16T 314 381 39.3M 43.8M Sun_2540 485G 1.15T 520 479 63.0M 55.9M Sun_2540 490G 1.15T 564 722 67.3M 84.7M Sun_2540 494G 1.15T 586 539 70.4M 63.1M Sun_2540 499G 1.14T 549 698 66.9M 81.9M Sun_2540 504G 1.14T 547 749 65.6M 87.7M Sun_2540 507G 1.13T 584 495 70.8M 57.8M Sun_2540 512G 1.13T 544 822 64.9M 91.1M Sun_2540 516G 1.13T 596 527 72.0M 60.4M Sun_2540 521G 1.12T 561 759 68.0M 87.2M Sun_2540 526G 1.12T 548 779 65.9M 88.6M A 2X variation in minute-to-minute performance while performing consistently similar operations is remarkable. Also notice that the write data rates are gradually increasing (on average) even though the task being performed remains the same. Here is a Perfmeter graph showing what is happening in normal operation: http://www.simplesystems.org/users/bfriesen/zfs-discuss/perfmeter-stalls.png and here is one which shows what happens if fsync() is used to force the file data entirely to disk immediately after each file has been written: http://www.simplesystems.org/users/bfriesen/zfs-discuss/perfmeter-fsync.png Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
> > http://opensolaris.org/jive/thread.jspa?threadID=105702&tstart=0 > > Yes, this does sound very similar. It looks to me like data from read > files is clogging the ARC so that there is no more room for more > writes when ZFS periodically goes to commit unwritten data.I''m wondering if changing txg_time to a lower value might help.
On Wed, 24 Jun 2009, Ethan Erchinger wrote:>>> http://opensolaris.org/jive/thread.jspa?threadID=105702&tstart=0 >> >> Yes, this does sound very similar. It looks to me like data from read >> files is clogging the ARC so that there is no more room for more >> writes when ZFS periodically goes to commit unwritten data. > > I''m wondering if changing txg_time to a lower value might help.There is no doubt that having ZFS sync the written data more often would help. However, it should not be necessary to tune the OS for such a common task as batch processing a bunch of files. A more appropriate solution is for ZFS to notice that more than XXX megabytes are uncommitted, so maybe it should wake up and go write some data. It is useful for ZFS to defer data writes in case the same file is updated many times. In the case where the same file is updated many times, the total uncommitted data is still limited by the amount of data which is re-written and so the 30 second cycle is fine. In my case the amount of uncommitted data is limited by available RAM and how fast my application is able to produce new data to write. The problem is very much related to how fast the data is output. If the new data is created at a slower rate (output files are smaller) then the problem just goes away. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Hello Bob, I think that is related with my post about "zio_taskq_threads and TXG sync ": ( http://www.opensolaris.org/jive/thread.jspa?threadID=105703&tstart=0 ) Roch did say that this is on top of the performance problems, and in the same email i did talk about the change from 5s to 30s, what i think makes this problem worst, if this txg sync interval be "fixed". Leal [ http://www.eall.com.br/blog ] -- This message posted from opensolaris.org
On Wed, 24 Jun 2009, Marcelo Leal wrote:> Hello Bob, > I think that is related with my post about "zio_taskq_threads and TXG sync ": > ( http://www.opensolaris.org/jive/thread.jspa?threadID=105703&tstart=0 ) > Roch did say that this is on top of the performance problems, and in > the same email i did talk about the change from 5s to 30s, what i > think makes this problem worst, if this txg sync interval be > "fixed".The problem is that basing disk writes on a simple timeout and available memory does not work. It is easy for an application to write considerable amounts of new data in 30 seconds, or even 5 seconds. If the application blocks while the data is being comitted, then the application is not performing any useful function during that time. Current ZFS write behavior make it not very useful for the creative media industries even though otherwise it should be a perfect fit since hundreds of terrabytes of working disk (or even petabytes) are normal for this industry. For example, when data is captured to disk from film via a datacine (real time = 24 files/second and 6MB to 50MB per file), or captured to disk from a high-definition video camera, there is little margin for error and blocking on writes will result in missed frames or other malfunction. Current ZFS write behavior is based on timing and the amount of system memory and it does not seem that throwing more storage hardware at the problem solves anything at all. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Wouldn''t it make sense for the timing technique to be used if the data is coming in at a rate slower than the underlying disk storage? But then if the data starts to come at a faster rate, ZFS needs to start streaming to disk as quickly as it can, and instead of re-ordering writes in blocks, it should just do the best it can with whatever is currently in memory. And when that mode activates, inbound data should be throttled to match the current throughput to disk. That preserves the efficient write ordering that ZFS was originally designed for, but means a more graceful degradation under load, with the system tending towards a steady state of throughput that matches what you would expect from other filesystems on those physical disks. Of course, I have no idea how difficult this is technically. But the idea seems reasonable to me. -- This message posted from opensolaris.org
Bob Friesenhahn wrote:> On Wed, 24 Jun 2009, Marcelo Leal wrote: > >> Hello Bob, >> I think that is related with my post about "zio_taskq_threads and TXG >> sync ": >> ( http://www.opensolaris.org/jive/thread.jspa?threadID=105703&tstart=0 ) >> Roch did say that this is on top of the performance problems, and in >> the same email i did talk about the change from 5s to 30s, what i >> think makes this problem worst, if this txg sync interval be "fixed". > > The problem is that basing disk writes on a simple timeout and > available memory does not work. It is easy for an application to > write considerable amounts of new data in 30 seconds, or even 5 > seconds. If the application blocks while the data is being comitted, > then the application is not performing any useful function during that > time. > > Current ZFS write behavior make it not very useful for the creative > media industries even though otherwise it should be a perfect fit > since hundreds of terrabytes of working disk (or even petabytes) are > normal for this industry. For example, when data is captured to disk > from film via a datacine (real time = 24 files/second and 6MB to 50MB > per file), or captured to disk from a high-definition video camera, > there is little margin for error and blocking on writes will result in > missed frames or other malfunction. Current ZFS write behavior is > based on timing and the amount of system memory and it does not seem > that throwing more storage hardware at the problem solves anything at > all. >I wonder whether a filesystem property "streamed" might be appropriate? This could act as hint to ZFS that the data is sequential and should be streamed direct to disk. -- Ian.
On Wed, 24 Jun 2009, Ross wrote:> Wouldn''t it make sense for the timing technique to be used if the > data is coming in at a rate slower than the underlying disk storage?I am not sure how zfs would know the rate of the underlying disk storage without characterizing it for a while with actual I/O. Regardless, buffering up to 3GB of data and then writing it all at once does not make sense regardless of the write rate of the underlying disk storage. It results in the I/O channel being completely clogged for 3-7 seconds.> But then if the data starts to come at a faster rate, ZFS needs to > start streaming to disk as quickly as it can, and instead of > re-ordering writes in blocks, it should just do the best it can with > whatever is currently in memory. And when that mode activates, > inbound data should be throttled to match the current throughput to > disk.In my case, the data is produced at a continual rate (40-80MB/s). ZFS batches it up in a huge buffer for 30 seconds and then writes it all at once. It is not clear to me if the writer is blocking, or if the reader is blocking due to ZFS''s sudden huge use of the I/O channel. I am sure that I could find the answer via dtrace.> That preserves the efficient write ordering that ZFS was originally > designed for, but means a more graceful degradation under load, with > the system tending towards a steady state of throughput that matches > what you would expect from other filesystems on those physical > disks.In this case the files are complete and ready to be written in optimum order. Of course ZFS has no way to know that the application won''t try to update them again. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
I think that is the purpose of the current implementation: http://blogs.sun.com/roch/entry/the_new_zfs_write_throttle But seems like is not that easy... as i did understand what Roch said, seems like the cause is not always a "hardy" writer. Leal [ http://www.eall.com.br/blog ] -- This message posted from opensolaris.org
On Wed, 24 Jun 2009, Marcelo Leal wrote:> I think that is the purpose of the current implementation: > http://blogs.sun.com/roch/entry/the_new_zfs_write_throttle But seems > like is not that easy... as i did understand what Roch said, seems > like the cause is not always a "hardy" writer.I see this: "The new code keeps track of the amount of data accepted in a TXG and the time it takes to sync. It dynamically adjusts that amount so that each TXG sync takes about 5 seconds (txg_time variable). It also clamps the limit to no more than 1/8th of physical memory." It is interesting that it was decided that a TXG sync should take 5 seconds by default. That does seem to be about what I am seeing here. There is no mention of the devastation to the I/O channel which occurs if the kernel writes 5 seconds worth of data (e.g. 2GB) as fast as possible on a system using mirroring (2GB becomes 4GB of writes). If it writes 5 seconds of data as fast as possible, then it seems that this blocks any opportunity to read more data so that application processing can continue during the TXG sync. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On Thu, 25 Jun 2009, Ian Collins wrote:>> > I wonder whether a filesystem property "streamed" might be appropriate? This > could act as hint to ZFS that the data is sequential and should be streamed > direct to disk.ZFS does not seem to offer an ability to stream direct to disk other than perhaps via the special "raw" mode known to database developers. It seems that current ZFS behavior is "works as designed". The write transaction time is currently tuned for 5 seconds and so it writes data intensely for 5 seconds while either starving the readers and/or blocking the writers. Notice that by the end of TXG write, zfs iostat is reporting zero reads: % zpool iostat Sun_2540 1 capacity operations bandwidth pool used avail read write read write ---------- ----- ----- ----- ----- ----- ----- Sun_2540 456G 1.18T 14 0 1.86M 0 Sun_2540 456G 1.18T 0 19 0 1.47M Sun_2540 456G 1.18T 0 3.11K 0 385M Sun_2540 456G 1.18T 0 3.00K 0 385M Sun_2540 456G 1.18T 0 3.34K 0 387M Sun_2540 456G 1.18T 0 3.01K 0 386M Sun_2540 458G 1.18T 19 1.87K 30.2K 220M Sun_2540 458G 1.18T 0 0 0 0 Sun_2540 458G 1.18T 275 0 34.4M 0 Sun_2540 458G 1.18T 448 0 56.1M 0 Sun_2540 458G 1.18T 468 0 58.5M 0 Sun_2540 458G 1.18T 425 0 53.2M 0 Sun_2540 458G 1.18T 402 0 50.4M 0 Sun_2540 458G 1.18T 364 0 45.5M 0 Sun_2540 458G 1.18T 339 0 42.4M 0 Sun_2540 458G 1.18T 376 0 47.0M 0 Sun_2540 458G 1.18T 307 0 38.5M 0 Sun_2540 458G 1.18T 380 0 47.5M 0 Sun_2540 458G 1.18T 148 1.35K 18.3M 117M Sun_2540 458G 1.18T 20 3.01K 2.60M 385M Sun_2540 458G 1.18T 15 3.00K 1.98M 384M Sun_2540 458G 1.18T 4 3.03K 634K 388M Sun_2540 458G 1.18T 0 3.01K 0 386M Sun_2540 460G 1.18T 142 792 15.8M 82.7M Sun_2540 460G 1.18T 375 0 46.9M 0 Here is an interesting discussion thread on another list that I had not seen before: http://opensolaris.org/jive/thread.jspa?messageID=347212 Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Bob Friesenhahn wrote:> On Wed, 24 Jun 2009, Marcelo Leal wrote: > >> I think that is the purpose of the current implementation: >> http://blogs.sun.com/roch/entry/the_new_zfs_write_throttle But seems >> like is not that easy... as i did understand what Roch said, seems >> like the cause is not always a "hardy" writer. > > I see this: > > "The new code keeps track of the amount of data accepted in a TXG and > the time it takes to sync. It dynamically adjusts that amount so that > each TXG sync takes about 5 seconds (txg_time variable). It also > clamps the limit to no more than 1/8th of physical memory."hmmm... methinks there is a chance that the 1/8th rule might not work so well for machines with lots of RAM and slow I/O. I''m also reasonably sure that that sort of machine is not what Sun would typically build for performance lab testing, as a rule. Hopefully Roch will comment when it is morning in Europe. -- richard
On Wed, 24 Jun 2009, Richard Elling wrote:>> >> "The new code keeps track of the amount of data accepted in a TXG and the >> time it takes to sync. It dynamically adjusts that amount so that each TXG >> sync takes about 5 seconds (txg_time variable). It also clamps the limit to >> no more than 1/8th of physical memory." > > hmmm... methinks there is a chance that the 1/8th rule might not work so well > for machines with lots of RAM and slow I/O. I''m also reasonably sure that > that sort of machine is not what Sun would typically build for performance > lab > testing, as a rule. Hopefully Roch will comment when it is morning in > Europe.Slow I/O is relative. If I install more memory does that make my I/O even slower? I did some more testing. I put the input data on a different drive and sent application output to the ZFS pool. I no longer noticed any stalls in the execution even though the large ZFS flushes are taking place. This proves that my application is seeing stalled reads rather than stalled writes. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
> On Wed, 24 Jun 2009, Richard Elling wrote: > >> > >> "The new code keeps track of the amount of data > accepted in a TXG and the > >> time it takes to sync. It dynamically adjusts that > amount so that each TXG > >> sync takes about 5 seconds (txg_time variable). It > also clamps the limit to > >> no more than 1/8th of physical memory." > > > > hmmm... methinks there is a chance that the 1/8th > rule might not work so well > > for machines with lots of RAM and slow I/O. I''m > also reasonably sure that > > that sort of machine is not what Sun would > typically build for performance > > lab > > testing, as a rule. Hopefully Roch will comment > when it is morning in > > Europe. > > Slow I/O is relative. If I install more memory does > that make my I/O > even slower? > > I did some more testing. I put the input data on a > different drive > and sent application output to the ZFS pool. I no > longer noticed any > stalls in the execution even though the large ZFS > flushes are taking > place. This proves that my application is seeing > stalled reads rather > than stalled writes.There is a bug in the database about reads blocked by writes which may be related: http://bugs.opensolaris.org/view_bug.do?bug_id=6471212 The symptom is sometimes reducing queue depth makes read perform better.> > Bob > -- > Bob Friesenhahn > bfriesen at simple.dallas.tx.us, > http://www.simplesystems.org/users/bfriesen/ > GraphicsMagick Maintainer, > http://www.GraphicsMagick.org/ > ____________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discu > ss-- This message posted from opensolaris.org
> I am not sure how zfs would know the rate of the > underlying disk storageEasy: Is the buffer growing? :-) If the amount of data in the buffer is growing, you need to throttle back a bit until the disks catch up. Don''t stop writes until the buffer is empty, just slow them down to match the rate at which you''re clearing data from the buffer. In your case I''d expect to see ZFS buffer the early part of the write (so you''d see a very quick initial burst), but from then on you would want a continual stream of data to disk, at a steady rate. To the client it should respond just like storing to disk, the only difference is there''s actually a small delay before the data hits the disk, which will be proportional to the buffer size. ZFS won''t have so much opportunity to optimize writes, but you wouldn''t get such stuttering performance. However, reading through the other messages, if it''s a known bug and ZFS blocking reads while writing, there may not be any need for this idea. But then, that bug has been open since 2006, is flagged as fix in progress, and was planned for snv_51.... o_0. So it probably is worth having this discussion. And I may be completely wrong here, but reading that bug, it sounds like ZFS issues a whole bunch of writes at once as it clears the buffer, which ties in with the experiences of stalling actually being caused by reads being blocked. I''m guessing given ZFS''s aims it made sense to code it that way - if you''re going to queue a bunch of transactions to make them efficient on disk, you don''t want to interrupt that batch with a bunch of other (less efficient) reads. But the unintended side effect of this is that ZFS''s attempt to optimize writes will causes jerky read and write behaviour any time you have a large amount of writes going on, and when you should be pushing the disks to 100% usage you''re never going to reach that as it''s always going to have 5s of inactivity, followed by 5s of running the disks flat out. In fact, I wonder if it''s a simple as the disks ending up doing 5s of reads, a delay for processing, 5s of writes, 5s of reads, etc... It''s probably efficient, but it''s going to *feel* horrible, a 5s delay is easily noticeable by the end user, and is a deal breaker for many applications. In situations like that, 5s is a *huge* amount of time, especially so if you''re writing to a disk or storage device which has it''s own caching! Might it be possible to keep the 5s buffer for ordering transactions, but then commit that as a larger number of small transactions instead of one huge one? The number of transactions could even be based on how busy the system is - if there are a lot of reads coming in, I''d be quite happy to split that into 50 transactions. On 10GbE, 5s is potentially 6.25GB of data. Even split into 50 transactions you''re writing 128MB at a time, and that sounds plenty big enough to me! Either way, something needs to be done. If we move to ZFS our users are not going to be impressed with 5s delays on the storage system. Finally, I do have one question for the ZFS guys: How does the L2ARC interact with this? Are reads from the L2ARC blocked, or will they happen in parallel with the writes to the main storage? I suspect that a large L2ARC (potentially made up of SSD disks) would eliminate this problem the majority of the time. -- This message posted from opensolaris.org
On Wed, 24 Jun 2009, Lejun Zhu wrote:> > There is a bug in the database about reads blocked by writes which may be related: > > http://bugs.opensolaris.org/view_bug.do?bug_id=6471212 > > The symptom is sometimes reducing queue depth makes read perform better.This one certainly sounds promising. Since Matt Ahrens has been working on it for almost a year, it must be almost fixed by now. :-) I am not sure how is queue depth is managed, but it seems possible to detect when reads are blocked by bulk writes and make some automatic adjustments to improve balance. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On Thu, 25 Jun 2009, Ross wrote:> But the unintended side effect of this is that ZFS''s attempt to > optimize writes will causes jerky read and write behaviour any time > you have a large amount of writes going on, and when you should be > pushing the disks to 100% usage you''re never going to reach that as > it''s always going to have 5s of inactivity, followed by 5s of > running the disks flat out. > > In fact, I wonder if it''s a simple as the disks ending up doing 5s > of reads, a delay for processing, 5s of writes, 5s of reads, etc... > > It''s probably efficient, but it''s going to *feel* horrible, a 5s > delay is easily noticeable by the end user, and is a deal breaker > for many applications.Yes, 5 seconds is a long time. For an application which mixes computation with I/O it is not really acceptable for read I/O to go away for up to 5 seconds. This represents time that the CPU is not being used, and a time that the application may be unresponsive to the user. When compression is used the impact is different, but the compression itself consumes considerable CPU (and quite abruptly) so that other applications (e.g. X11) stop responding during the compress/write cycle. The read problem is one of congestion. If I/O is congested with massive writes, then reads don''t work. It does not really matter how fast your storage system is. If the 5 seconds of buffered writes are larger than what the device driver and storage system buffering allows for, then the I/O channel will be congested. As an example, my storage array is demonstrated to be able to write 359MB/second but ZFS will blast data from memory as fast as it can, and the storage path can not effectively absorb 1.8GB (359*5) of data since the StorageTek 2500''s internal buffers are much smaller than that, and fiber channel device drivers are not allowed to consume much memory either. To make matters worse, I am using ZFS mirrors so the amount of data written to the array in those five seconds is doubled to 3.6GB. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On Wed, 24 Jun 2009, Lejun Zhu wrote:> There is a bug in the database about reads blocked by writes which may be related: > > http://bugs.opensolaris.org/view_bug.do?bug_id=6471212 > > The symptom is sometimes reducing queue depth makes read perform better.I have been banging away at this issue without resolution. Based on Roch Bourbonnais''s blog description of the ZFS write throttle code, it seems that I am facing a perfect storm. Both the storage write bandwidth (800+ MB/second) and the memory size of my system (20 GB) result in the algorithm batching up 2.5 GB of user data to write. Since I am using mirrors, this results in 5 GB of data being written at full speed to the array on a very precise schedule since my application is processing fixed-sized files with a fixed algorithm. The huge writes lead to at least 3 seconds of read starvation, resulting in a stalled application and a square-wave of system CPU utilization. I could attempt to modify my application to read ahead by 3 seconds but that would require gigabytes of memory, lots of complexity, and would not be efficient. Richard Elling thinks that my array is pokey, but based on write speed and memory size, ZFS is always going to be batching up data to fill the write channel for 5 seconds so it does not really matter how fast that write channel is. If I had 32GB of RAM and 2X the write speed, the situation would be identical. Hopefully someone at Sun is indeed working this read starvation issue and it will be resolved soon. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On Mon, Jun 29, 2009 at 2:48 PM, Bob Friesenhahn<bfriesen at simple.dallas.tx.us> wrote:> On Wed, 24 Jun 2009, Lejun Zhu wrote: > >> There is a bug in the database about reads blocked by writes which may be >> related: >> >> http://bugs.opensolaris.org/view_bug.do?bug_id=6471212 >> >> The symptom is sometimes reducing queue depth makes read perform better. > > I have been banging away at this issue without resolution. ?Based on Roch > Bourbonnais''s blog description of the ZFS write throttle code, it seems that > I am facing a perfect storm. ?Both the storage write bandwidth (800+ > MB/second) and the memory size of my system (20 GB) result in the algorithm > batching up 2.5 GB of user data to write. Since I am using mirrors, this > results in 5 GB of data being written at full speed to the array on a very > precise schedule since my application is processing fixed-sized files with a > fixed algorithm. The huge writes lead to at least 3 seconds of read > starvation, resulting in a stalled application and a square-wave of system > CPU utilization. ?I could attempt to modify my application to read ahead by > 3 seconds but that would require gigabytes of memory, lots of complexity, > and would not be efficient. > > Richard Elling thinks that my array is pokey, but based on write speed and > memory size, ZFS is always going to be batching up data to fill the write > channel for 5 seconds so it does not really matter how fast that write > channel is. ?If I had 32GB of RAM and 2X the write speed, the situation > would be identical. > > Hopefully someone at Sun is indeed working this read starvation issue and it > will be resolved soon. > > Bob > -- > Bob Friesenhahn > bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ > GraphicsMagick Maintainer, ? ?http://www.GraphicsMagick.org/ > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >I see similar square-wave performance. However, my load is primarily write-based, when those commits happen, I see all network activity pause while the buffer is commited to disk. I write about 750Mbit/sec over the network to the X4540''s during backup windows using primarily iSCSI. When those writes occur to my RaidZ volume, all activity pauses until the writes are fully flushed. One thing to note, on 117, the effects are seemingly reduced and a bit more even performance, but it is still there. -- Brent Jones brent at servuhome.net
> On Wed, 24 Jun 2009, Lejun Zhu wrote: > > > There is a bug in the database about reads blocked > by writes which may be related: > > > > > http://bugs.opensolaris.org/view_bug.do?bug_id=6471212 > > > > The symptom is sometimes reducing queue depth makes > read perform better. > > I have been banging away at this issue without > resolution. Based on > Roch Bourbonnais''s blog description of the ZFS write > throttle code, it > seems that I am facing a perfect storm. Both the > storage write > bandwidth (800+ MB/second) and the memory size of my > system (20 GB) > result in the algorithm batching up 2.5 GB of user > data to write.With ZFS write throttle, the number 2.5GB is tunable. From what I''ve read in the code, it is possible to e.g. set zfs:zfs_write_limit_override = 0x8000000 (bytes) to make it write 128M instead.> Since I am using mirrors, this results in 5 GB of > data being written > at full speed to the array on a very precise schedule > since my > application is processing fixed-sized files with a > fixed algorithm. > The huge writes lead to at least 3 seconds of read > starvation, > resulting in a stalled application and a square-wave > of system CPU > utilization. I could attempt to modify my > application to read ahead > by 3 seconds but that would require gigabytes of > memory, lots of > complexity, and would not be efficient. > > Richard Elling thinks that my array is pokey, but > based on write speed > and memory size, ZFS is always going to be batching > up data to fill > the write channel for 5 seconds so it does not really > matter how fast > that write channel is. If I had 32GB of RAM and 2X > the write speed, > the situation would be identical. > > Hopefully someone at Sun is indeed working this read > starvation issue > and it will be resolved soon. > > Bob > -- > Bob Friesenhahn > bfriesen at simple.dallas.tx.us, > http://www.simplesystems.org/users/bfriesen/ > GraphicsMagick Maintainer, > http://www.GraphicsMagick.org/ > ____________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discu > ss-- This message posted from opensolaris.org
> backup windows using primarily iSCSI. When those > writes occur to my RaidZ volume, all activity pauses until the writes > are fully flushed.The more I read about this, the worse it sounds. The thing is, I can see where the ZFS developers are coming from - in theory this is a more efficient use of the disk, and with that being the slowest part of the system, there probably is a slight benefit in computational time. However, it completely breaks any process like this that can''t afford 3-5s delays in processing, it makes ZFS a nightmare for things like audio or video editing (where it would otherwise be a perfect fit), and it''s also horrible from the perspective of the end user. Does anybody know if a L2ARC would help this? Does that work off a different queue, or would reads still be blocked? I still think a simple solution to this could be to split the ZFS writes into smaller chunks. That creates room for reads to be squeezed in (with the ratio of reads to writes something that should be automatically balanced by the software), but you still get the benefit of ZFS write ordering with all the work that''s gone into perfecting that. Regardless of whether there are reads or not, your data is always going to be written to disk in an optimized fashion, and you could have a property on the pool that specifies how finely chopped up writes should be, allowing this to be easily tuned. We''re considering ZFS as storage for our virtualization solution, and this could be a big concern. We really don''t want the entire network pausing for 3-5 seconds any time there is a burst of write activity. -- This message posted from opensolaris.org
On Tue, 30 Jun 2009, Ross wrote:> > However, it completely breaks any process like this that can''t > afford 3-5s delays in processing, it makes ZFS a nightmare for > things like audio or video editing (where it would otherwise be a > perfect fit), and it''s also horrible from the perspective of the end > user.Yes. I updated the image at http://www.simplesystems.org/users/bfriesen/zfs-discuss/perfmeter-stalls.png so that it shows the execution impact with more processes running. This is taken with three processes running in parallel so that there can be no doubt that I/O is being globally blocked and it is not just misbehavior of a single process. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
For what it is worth, I too have seen this behavior when load testing our zfs box. I used iometer and the RealLife profile (1 worker, 1 target, 65% reads, 60% random, 8k, 32 IOs in the queue). When writes are being dumped, reads drop close to zero, from 600-700 read IOPS to 15-30 read IOPS. zpool iostat data01 1 Where data01 is my pool name pool used avail read write read write ---------- ----- ----- ----- ----- ----- ----- data01 55.5G 20.4T 691 0 4.21M 0 data01 55.5G 20.4T 632 0 3.80M 0 data01 55.5G 20.4T 657 0 3.93M 0 data01 55.5G 20.4T 669 0 4.12M 0 data01 55.5G 20.4T 689 0 4.09M 0 data01 55.5G 20.4T 488 1.77K 2.94M 9.56M data01 55.5G 20.4T 29 4.28K 176K 23.5M data01 55.5G 20.4T 25 4.26K 165K 23.7M data01 55.5G 20.4T 20 3.97K 133K 22.0M data01 55.6G 20.4T 170 2.26K 1.01M 11.8M data01 55.6G 20.4T 678 0 4.05M 0 data01 55.6G 20.4T 625 0 3.74M 0 data01 55.6G 20.4T 685 0 4.17M 0 data01 55.6G 20.4T 690 0 4.04M 0 data01 55.6G 20.4T 679 0 4.02M 0 data01 55.6G 20.4T 664 0 4.03M 0 data01 55.6G 20.4T 699 0 4.27M 0 data01 55.6G 20.4T 423 1.73K 2.66M 9.32M data01 55.6G 20.4T 26 3.97K 151K 21.8M data01 55.6G 20.4T 34 4.23K 223K 23.2M data01 55.6G 20.4T 13 4.37K 87.1K 23.9M data01 55.6G 20.4T 21 3.33K 136K 18.6M data01 55.6G 20.4T 468 496 2.89M 1.82M data01 55.6G 20.4T 687 0 4.13M 0 -Scott -- This message posted from opensolaris.org
On Mon, 29 Jun 2009, Lejun Zhu wrote:> > With ZFS write throttle, the number 2.5GB is tunable. From what I''ve > read in the code, it is possible to e.g. set > zfs:zfs_write_limit_override = 0x8000000 (bytes) to make it write > 128M instead.This works, and the difference in behavior is profound. Now it is a matter of finding the "best" value which optimizes both usability and performance. A tuning for 384 MB: # echo zfs_write_limit_override/W0t402653184 | mdb -kw zfs_write_limit_override: 0x30000000 = 0x18000000 CPU is smoothed out quite a lot and write latencies (as reported by a zio_rw.d dtrace script) are radically different than before. Perfmeter display for 256 MB: http://www.simplesystems.org/users/bfriesen/zfs-discuss/perfmeter-256mb.png Perfmeter display for 384 MB: http://www.simplesystems.org/users/bfriesen/zfs-discuss/perfmeter-384mb.png Perfmeter display for 768 MB: http://www.simplesystems.org/users/bfriesen/zfs-discuss/perfmeter-768mb.png Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On Tue, Jun 30, 2009 at 12:25 PM, Bob Friesenhahn<bfriesen at simple.dallas.tx.us> wrote:> On Mon, 29 Jun 2009, Lejun Zhu wrote: >> >> With ZFS write throttle, the number 2.5GB is tunable. From what I''ve read >> in the code, it is possible to e.g. set zfs:zfs_write_limit_override >> 0x8000000 (bytes) to make it write 128M instead. > > This works, and the difference in behavior is profound. ?Now it is a matter > of finding the "best" value which optimizes both usability and performance. > ?A tuning for 384 MB: > > # echo zfs_write_limit_override/W0t402653184 | mdb -kw > zfs_write_limit_override: ? ? ? 0x30000000 ? ? ?= ? ? ? 0x18000000 > > CPU is smoothed out quite a lot and write latencies (as reported by a > zio_rw.d dtrace script) are radically different than before. > > Perfmeter display for 256 MB: > http://www.simplesystems.org/users/bfriesen/zfs-discuss/perfmeter-256mb.png > > Perfmeter display for 384 MB: > http://www.simplesystems.org/users/bfriesen/zfs-discuss/perfmeter-384mb.png > > Perfmeter display for 768 MB: > http://www.simplesystems.org/users/bfriesen/zfs-discuss/perfmeter-768mb.png > > Bob > -- > Bob Friesenhahn > bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ > GraphicsMagick Maintainer, ? ?http://www.GraphicsMagick.org/ > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >Maybe there could be a supported ZFS tuneable (per file system even?) that is optimized for ''background'' tasks, or ''foreground''. Beyond that, I will give this tuneable a shot and see how it impacts my own workload. Thanks! -- Brent Jones brent at servuhome.net
On Tue, 30 Jun 2009, Brent Jones wrote:> > Maybe there could be a supported ZFS tuneable (per file system even?) > that is optimized for ''background'' tasks, or ''foreground''. > > Beyond that, I will give this tuneable a shot and see how it impacts > my own workload.Note that this issue does not apply at all to NFS service, database service, or any other usage which does synchronous writes. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
> CPU is smoothed out quite a lotyes, but the area under the CPU graph is less, so the rate of real work performed is less, so the entire job took longer. (allbeit "smoother") Rob
Interesting to see that it makes such a difference, but I wonder what effect it has on ZFS''s write ordering, and it''s attempts to prevent fragmentation? By reducing the write buffer, are you loosing those benefits? Although on the flip side, I guess this is no worse off than any other filesystem, and as SSD drives take off, fragmentation is going to be less and less of an issue. -- This message posted from opensolaris.org
On Tue, 30 Jun 2009, Rob Logan wrote:>> CPU is smoothed out quite a lot > yes, but the area under the CPU graph is less, so the > rate of real work performed is less, so the entire > job took longer. (allbeit "smoother")For the purpose of illustration, the case showing the huge sawtooth was when running three processes at once. The period/duration of the sawtooth was pretty similar, but the magnitude changes. I agree that there is a size which provides the best balance of smoothness and application performance. Probably the value should be dialed down to just below the point where the sawtooth occurs. More at 11. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
> On Tue, 30 Jun 2009, Bob Friesenhahn wrote: > > Note that this issue does not apply at all to NFS > service, database > service, or any other usage which does synchronous > writes.I see read starvation with NFS. I was using iometer on a Windows VM, connecting to an NFS mount on a 2008.11 physical box. iometer params: 65% read, 60% random, 8k blocks, 32 outstanding IO requests, 1 worker, 1 target. NFS Testing capacity operations bandwidth pool used avail read write read write ---------- ----- ----- ----- ----- ----- ----- data01 59.6G 20.4T 46 24 757K 3.09M data01 59.6G 20.4T 39 24 593K 3.09M data01 59.6G 20.4T 45 25 687K 3.22M data01 59.6G 20.4T 45 23 683K 2.97M data01 59.6G 20.4T 33 23 492K 2.97M data01 59.6G 20.4T 16 41 214K 1.71M data01 59.6G 20.4T 3 2.36K 53.4K 30.4M data01 59.6G 20.4T 1 2.23K 20.3K 29.2M data01 59.6G 20.4T 0 2.24K 30.2K 28.9M data01 59.6G 20.4T 0 1.93K 30.2K 25.1M data01 59.6G 20.4T 0 2.22K 0 28.4M data01 59.7G 20.4T 21 295 317K 4.48M data01 59.7G 20.4T 32 12 495K 1.61M data01 59.7G 20.4T 35 25 515K 3.22M data01 59.7G 20.4T 36 11 522K 1.49M data01 59.7G 20.4T 33 24 508K 3.09M data01 59.7G 20.4T 35 23 536K 2.97M data01 59.7G 20.4T 32 23 483K 2.97M data01 59.7G 20.4T 37 37 538K 4.70M While writes are being committed to the ZIL all the time, periodic dumping to the pool still occurs, and during those times reads are starved. Maybe this doesn''t happen in the ''real world'' ? -Scott -- This message posted from opensolaris.org
Even if I set zfs_write_limit_override to 8053063680 I am unable to achieve the massive writes that Solaris 10 (141415-03) sends to my drive array by default. When I read the blog entry at http://blogs.sun.com/roch/entry/the_new_zfs_write_throttle, I see this statement: "The new code keeps track of the amount of data accepted in a TXG and the time it takes to sync. It dynamically adjusts that amount so that each TXG sync takes about 5 seconds (txg_time variable). It also clamps the limit to no more than 1/8th of physical memory." On my system I see that the "about 5 seconds" rule is being followed, but see no sign of clamping the limit to no more than 1/8th of physical memory. There is no sign of clamping at all. The writen data is captured and does take about 5 seconds to write (good estimate). On my system with 20GB of RAM, and ARC memory limit set to 10GB (zfs:zfs_arc_max = 0x280000000), the maximum zfs_write_limit_override value I can set is on the order of 8053063680, yet this results in a much smaller amount of data being written per write cycle than the Solaris 10 default operation. The default operation is 24 seconds of no write activity followed by 5 seconds of write. On my system, 1/8 of memory would be 2.5GB. If I set the zfs_write_limit_override value to 2684354560 then it seems that about 1.2 seconds of data is captured for write. In this case I see 5 seconds of no write followed by maybe a second of write. This causes me to believe that the algorithm is not implemented as described in Solaris 10. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
> > Note that this issue does not apply at all to NFS > service, database > service, or any other usage which does synchronous > writes. > > BobHello Bob, There is impact for "all" workloads. The fact that the write is sync or not, is just a question to write on slog (SSD) or not. But the txg interval and sync time is the same. Actually the zil code is just to preserve that exact same thing for synchronous writes. Leal [ http://www.eall.com.br/blog ] -- This message posted from opensolaris.org
Actually it seems to be 3/4: dsl_pool.c 391 zfs_write_limit_max = ptob(physmem) >> zfs_write_limit_shift; 392 zfs_write_limit_inflated = MAX(zfs_write_limit_min, 393 spa_get_asize(dp->dp_spa, zfs_write_limit_max)); While spa_get_asize is: spa_misc.c 1249 uint64_t 1250 spa_get_asize(spa_t *spa, uint64_t lsize) 1251 { 1252 /* 1253 * For now, the worst case is 512-byte RAID-Z blocks, in which 1254 * case the space requirement is exactly 2x; so just assume that. 1255 * Add to this the fact that we can have up to 3 DVAs per bp, and 1256 * we have to multiply by a total of 6x. 1257 */ 1258 return (lsize * 6); 1259 } Which will result in: zfs_write_limit_inflated = MAX((32 << 20), (ptob(physmem) >> 3) * 6); Bob Friesenhahn wrote:> Even if I set zfs_write_limit_override to 8053063680 I am unable to > achieve the massive writes that Solaris 10 (141415-03) sends to my > drive array by default. > > When I read the blog entry at > http://blogs.sun.com/roch/entry/the_new_zfs_write_throttle, I see this > statement: > > "The new code keeps track of the amount of data accepted in a TXG and > the time it takes to sync. It dynamically adjusts that amount so that > each TXG sync takes about 5 seconds (txg_time variable). It also > clamps the limit to no more than 1/8th of physical memory." > > On my system I see that the "about 5 seconds" rule is being followed, > but see no sign of clamping the limit to no more than 1/8th of > physical memory. There is no sign of clamping at all. The writen > data is captured and does take about 5 seconds to write (good > estimate). > > On my system with 20GB of RAM, and ARC memory limit set to 10GB > (zfs:zfs_arc_max = 0x280000000), the maximum zfs_write_limit_override > value I can set is on the order of 8053063680, yet this results in a > much smaller amount of data being written per write cycle than the > Solaris 10 default operation. The default operation is 24 seconds of > no write activity followed by 5 seconds of write. > > On my system, 1/8 of memory would be 2.5GB. If I set the > zfs_write_limit_override value to 2684354560 then it seems that about > 1.2 seconds of data is captured for write. In this case I see 5 > seconds of no write followed by maybe a second of write. > > This causes me to believe that the algorithm is not implemented as > described in Solaris 10. > > Bob
On Thu, 2 Jul 2009, Zhu, Lejun wrote:> Actually it seems to be 3/4:3/4 is an awful lot. That would be 15 GB on my system, which explains why the "5 seconds to write" rule is dominant. It seems that both rules are worthy of re-consideration. There is also still the little problem that zfs is incable of reading during all/much of the time it is syncing a TXG. Even if the TXG is written more often, readers will still block, resulting in a similar cumulative effect on performance. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On 02.07.09 22:05, Bob Friesenhahn wrote:> On Thu, 2 Jul 2009, Zhu, Lejun wrote: > >> Actually it seems to be 3/4: > > 3/4 is an awful lot. That would be 15 GB on my system, which explains > why the "5 seconds to write" rule is dominant.3/4 is 1/8 * 6, where 6 is worst-case inflation factor (for raid-z2 is 9 actually, and considering ganged 1k block on raid-z2 in the really bad case it should be even bigger than that). DSL does inflate write sizes too, so inflated write sizes are compared against inflated limit, so it should be fine. victor
Is the system otherwise responsive during the zfs sync cycles? I ask because I think I''m seeing a similar thing - except that it''s not only other writers that block , it seems like other interrupts are blocked. Pinging my zfs server in 1s intervals results in large delays while the system syncs, followed by normal response times while the system buffers more input... Thanks, Tristan. Bob Friesenhahn wrote:> It has been quite some time (about a year) since I did testing of > batch processing with my software (GraphicsMagick). In between time, > ZFS added write-throttling. I am using Solaris 10 with kernel 141415-03. > > Quite a while back I complained that ZFS was periodically stalling the > writing process (which UFS did not do). The ZFS write-throttling > feature was supposed to avoid that. In my testing today I am still > seeing ZFS stall the writing process periodically. When the process > is stalled, there is a burst of disk activity, a burst of context > switching, and total CPU use drops to almost zero. Zpool iostat says > that read bandwidth is 15.8M and write bandwidth is 15.8M over a 60 > second averaging interval. Since my drive array is good for writing > over 250MB/second, this is a very small write load and the array is > loafing. > > My program uses the simple read->process->write approach. Each file > written (about 8MB/file) is written contiguously and written just > once. Data is read and written in 128K blocks. For this application > there is no value obtained by caching the file just written. From > what I am seeing, reading occurs as needed, but writes are being > batched up until the next ZFS synchronization cycle. During the ZFS > synchronization cycle it seems that processes are blocked from > writing. Since my system has a lot of memory and the ARC is capped at > 10GB, quite a lot of data can be queued up to be written. The ARC is > currently running at its limit of 10GB. > > If I tell my software to invoke fsync() before closing each written > file, then the stall goes away, but the program then needs to block so > there is less beneficial use of the CPU. > > If this application stall annoys me, I am sure that it would really > annoy a user with mission-critical work which needs to get done on a > uniform basis. > > If I run this little script then the application runs more smoothly > but I see evidence of many shorter stalls: > > while true > do > sleep 3 > sync > done > > Is there a solution in the works for this problem? > > Bob > -- > Bob Friesenhahn > bfriesen at simple.dallas.tx.us, > http://www.simplesystems.org/users/bfriesen/ > GraphicsMagick Maintainer, http://www.GraphicsMagick.org/ > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > ______________________________________________________________________ > This email has been scanned by the MessageLabs Email Security System. > For more information please visit http://www.messagelabs.com/email > ______________________________________________________________________
With regards too http://blogs.sun.com/roch/entry/the_new_zfs_write_throttle I would have thought that if you had enough data to be written, it is worth just writing it, and not waiting X seconds or trying to adjust things so it only takes 5 seconds For example different Disk Bus have different data sizes e.g. (If I remember correctly) Fiber Channel packet size 2MB, if you have 2MB you can write to a single disk/lun why not just write it straight away. If transaction group meta data/log reaches a certain size (say 128k??) why not write the TXG? If transaction group meta data/log is estimated to take more than X ms why not write the TXG? (assume reads are stop while this happens, to prevent large pauses) if any file has 2MB?? of outstanding data to be written why not do a TXG? and stall the process''s write thread until the data is written. ie to prevent "And to avoid the system wide and seconds long throttle effect, the new code will detect when we are dangerously close to that situation (7/8th of the limit) and will **insert 1 tick** delays for applications issuing writes. This prevents a write intensive thread from hogging the available space starving out other threads. This delay should also generally prevent the system wide throttle." A write limit on a individual thread/file would prevent a single file filling up the ARC? By write/TXG I mean close the existing open TXG, and place it into quiescing, ready for syncing. So the real question is why wait, why give the system the chance to stall; if you have enough data to write out to disk, that allows the target disk(s) to perform at optimal performance why not write the data out to the disks? (ie decent write I/O size. Even if quiescing state needed to be turned into a small FIFO queue) Back to the Tennis (Wimbledon) **Cheers
Red herring... Actually, I had compression=gzip-9 enabled on that filesystem, which is apparently too much for the old xeon''s in that server (it''s a Dell 1850). The CPU was sitting at 100% kernel time while it tried to compress + sync. Switching to compression=off or compression=on (lzjb) makes the problem go away. Interestingly, creating a second processor set also alleviates many of the symptoms - certainly the slow ping goes away. Assigning the ssh + shell session I had on the machine while running these to the second set restores responsiveness to that too, it appears that all the compression happens in set 0. Regards, Tristan. Tristan Ball wrote:> Is the system otherwise responsive during the zfs sync cycles? > > I ask because I think I''m seeing a similar thing - except that it''s > not only other writers that block , it seems like other interrupts are > blocked. Pinging my zfs server in 1s intervals results in large delays > while the system syncs, followed by normal response times while the > system buffers more input... > > Thanks, > Tristan. > > Bob Friesenhahn wrote: >> It has been quite some time (about a year) since I did testing of >> batch processing with my software (GraphicsMagick). In between time, >> ZFS added write-throttling. I am using Solaris 10 with kernel >> 141415-03. >> >> Quite a while back I complained that ZFS was periodically stalling >> the writing process (which UFS did not do). The ZFS write-throttling >> feature was supposed to avoid that. In my testing today I am still >> seeing ZFS stall the writing process periodically. When the process >> is stalled, there is a burst of disk activity, a burst of context >> switching, and total CPU use drops to almost zero. Zpool iostat says >> that read bandwidth is 15.8M and write bandwidth is 15.8M over a 60 >> second averaging interval. Since my drive array is good for writing >> over 250MB/second, this is a very small write load and the array is >> loafing. >> >> My program uses the simple read->process->write approach. Each file >> written (about 8MB/file) is written contiguously and written just >> once. Data is read and written in 128K blocks. For this application >> there is no value obtained by caching the file just written. From >> what I am seeing, reading occurs as needed, but writes are being >> batched up until the next ZFS synchronization cycle. During the ZFS >> synchronization cycle it seems that processes are blocked from >> writing. Since my system has a lot of memory and the ARC is capped at >> 10GB, quite a lot of data can be queued up to be written. The ARC is >> currently running at its limit of 10GB. >> >> If I tell my software to invoke fsync() before closing each written >> file, then the stall goes away, but the program then needs to block >> so there is less beneficial use of the CPU. >> >> If this application stall annoys me, I am sure that it would really >> annoy a user with mission-critical work which needs to get done on a >> uniform basis. >> >> If I run this little script then the application runs more smoothly >> but I see evidence of many shorter stalls: >> >> while true >> do >> sleep 3 >> sync >> done >> >> Is there a solution in the works for this problem? >> >> Bob >> -- >> Bob Friesenhahn >> bfriesen at simple.dallas.tx.us, >> http://www.simplesystems.org/users/bfriesen/ >> GraphicsMagick Maintainer, http://www.GraphicsMagick.org/ >> _______________________________________________ >> zfs-discuss mailing list >> zfs-discuss at opensolaris.org >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >> >> ______________________________________________________________________ >> This email has been scanned by the MessageLabs Email Security System. >> For more information please visit http://www.messagelabs.com/email >> ______________________________________________________________________ > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > ______________________________________________________________________ > This email has been scanned by the MessageLabs Email Security System. > For more information please visit http://www.messagelabs.com/email > ______________________________________________________________________
On Fri, 3 Jul 2009, Victor Latushkin wrote:> On 02.07.09 22:05, Bob Friesenhahn wrote: >> On Thu, 2 Jul 2009, Zhu, Lejun wrote: >> >>> Actually it seems to be 3/4: >> >> 3/4 is an awful lot. That would be 15 GB on my system, which explains why >> the "5 seconds to write" rule is dominant. > > 3/4 is 1/8 * 6, where 6 is worst-case inflation factor (for raid-z2 is 9 > actually, and considering ganged 1k block on raid-z2 in the really bad case > it should be even bigger than that). DSL does inflate write sizes too, so > inflated write sizes are compared against inflated limit, so it should be > fine.But blocking read I/O for several seconds is not so fine. There are various amounts of buffering and caches in the write pipe-line. These suggest that there is a certain amount of write data which is handled efficiently by the write pipe-line. Once buffers and caches fill, and the disks are maximally busy with write I/O, there is no more opportunity to do a read from the same disks for several seconds (up to five seconds). When a TXG is written, the system writes as just fast and hard as it can (for up to five seconds) without considering other requirements. ZFS''s asynchronous write caching is speculative, hoping that the application will update the data just written several times so that only the final version needs to be written and disk I/O and precious IOPS are saved. Unfortunately, not all applications work that way. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On Sat, 4 Jul 2009, Tristan Ball wrote:> Is the system otherwise responsive during the zfs sync cycles? > > I ask because I think I''m seeing a similar thing - except that it''s not only > other writers that block , it seems like other interrupts are blocked. > Pinging my zfs server in 1s intervals results in large delays while the > system syncs, followed by normal response times while the system buffers more > input...I don''t see any such problems unless compression is enabled. When compression is enabled, the TXG sync causes definite response time issues in the system. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
> This causes me to believe that the algorithm is not > implemented as described in Solaris 10.I was all ready to write about my frustrations with this problem, but I upgraded to snv_117 last night to fix some iscsi bugs and now it seems that the write throttling is working as described in that blog. If a process starts filling the ARC it is throttled and the data is written nice and constant using just about all the disk bandwidth without freezing the system every 5 seconds. However, with gzip-1 compression the symptoms return, but for my system I think it''s because the gzip compression is not multi-threaded? I''m only getting 50% utilization on a dual core system. LZJB seems to work well though. Is anyone aware of any bug fixes since 111b that would have helped to mitigate the freezing with the cache flushes? -John -- This message posted from opensolaris.org
> I was all ready to write about my frustrations with > this problem, but I upgraded to snv_117 last night to > fix some iscsi bugs and now it seems that the write > throttling is working as described in that blog.I may have been a little premature. While everything is much improved for Samba and local disk operations (dd, cp) on snv_117, Comstar ISCSI writes still seem to incur this "write a bit, block, write a bit, block" every 5 seconds. But on top of that, I am getting relatively poor ISCSI performance for some reason with a direct gigabit link with MTU=9000. I''m not sure what that is about yet. -John -- This message posted from opensolaris.org