Hi, I was directed here after posting in CIFS discuss (as i first thought that it could be a CIFS problem). I posted the following in CIFS: When using iometer from windows to the file share on opensolaris svn101 and svn111 I get pauses every 5 seconds of around 5 seconds (maybe a little less) where no data is transfered, when data is transfered it is at a fair speed and gets around 1000-2000 iops with 1 thread (depending on the work type). The maximum read response time is 200ms and the maximum write response time is 9824ms, which is very bad, an almost 10 seconds delay in being able to send data to the server. This has been experienced on 2 test servers, the same servers have also been tested with windows server 2008 and they havent shown this problem (the share performance was slightly lower than CIFS, but it was consistent, and the average access time and maximums were very close. I just noticed that if the server hasnt hit its target arc size, the pauses are for maybe .5 seconds, but as soon as it hits its arc target, the iops drop to around 50% of what it was and then there are the longer pauses around 4-5 seconds. and then after every pause the performance slows even more. So it appears it is definately server side. This is with 100% random io with a spread of 33% write 66% read, 2KB blocks. over a 50GB file, no compression, and a 5.5GB target arc size. Also I have just ran some tests with different IO patterns and 100 sequencial writes produce and consistent IO of 2100IOPS, except when it pauses for maybe .5 seconds every 10 - 15 seconds. 100% random writes produce around 200 IOPS with a 4-6 second pause around every 10 seconds. 100% sequencial reads produce around 3700IOPS with no pauses, just random peaks in response time (only 16ms) after about 1 minute of running, so nothing to complain about. 100% random reads produce around 200IOPS, with no pauses. So it appears that writes cause a problem, what is causing these very long write delays? A network capture shows that the server doesnt respond to the write from the client when these pauses occur. Also, when using iometer, the initial file creation doesnt have and pauses in the creation, so it might only happen when modifying files. Any help on finding a solution to this would be really appriciated. David -- This message posted from opensolaris.org
On Aug 27, 2009, at 4:30 AM, David Bond <david.bond at tag.no> wrote:> Hi, > > I was directed here after posting in CIFS discuss (as i first > thought that it could be a CIFS problem). > > I posted the following in CIFS: > > When using iometer from windows to the file share on opensolaris > svn101 and svn111 I get pauses every 5 seconds of around 5 seconds > (maybe a little less) where no data is transfered, when data is > transfered it is at a fair speed and gets around 1000-2000 iops with > 1 thread (depending on the work type). The maximum read response > time is 200ms and the maximum write response time is 9824ms, which > is very bad, an almost 10 seconds delay in being able to send data > to the server. > This has been experienced on 2 test servers, the same servers have > also been tested with windows server 2008 and they havent shown this > problem (the share performance was slightly lower than CIFS, but it > was consistent, and the average access time and maximums were very > close. > > > I just noticed that if the server hasnt hit its target arc size, the > pauses are for maybe .5 seconds, but as soon as it hits its arc > target, the iops drop to around 50% of what it was and then there > are the longer pauses around 4-5 seconds. and then after every pause > the performance slows even more. So it appears it is definately > server side. > > This is with 100% random io with a spread of 33% write 66% read, 2KB > blocks. over a 50GB file, no compression, and a 5.5GB target arc size. > > > > Also I have just ran some tests with different IO patterns and 100 > sequencial writes produce and consistent IO of 2100IOPS, except when > it pauses for maybe .5 seconds every 10 - 15 seconds. > > 100% random writes produce around 200 IOPS with a 4-6 second pause > around every 10 seconds. > > 100% sequencial reads produce around 3700IOPS with no pauses, just > random peaks in response time (only 16ms) after about 1 minute of > running, so nothing to complain about. > > 100% random reads produce around 200IOPS, with no pauses. > > So it appears that writes cause a problem, what is causing these > very long write delays? > > A network capture shows that the server doesnt respond to the write > from the client when these pauses occur. > > Also, when using iometer, the initial file creation doesnt have and > pauses in the creation, so it might only happen when modifying files. > > Any help on finding a solution to this would be really appriciated.What version? And system configuration? I think it might be the issue where ZFS/ARC write caches more then the underlying storage can handle writing in a reasonable time. There is a parameter to control how much is write cached, I believe it is zfs_write_override. -Ross
On Thu, 27 Aug 2009, David Bond wrote:> > I just noticed that if the server hasnt hit its target arc size, the > pauses are for maybe .5 seconds, but as soon as it hits its arc > target, the iops drop to around 50% of what it was and then there > are the longer pauses around 4-5 seconds. and then after every pause > the performance slows even more. So it appears it is definately > server side.This is known behavior of zfs for asynchronous writes. Recent zfs defers/aggregates writes up to one of these limits: * 7/8ths of available RAM * 5 seconds worth of write I/O (full speed write) * 30 seconds aggregation time Notice the 5 seconds. This 5 seconds results in the 4-6 second pause and it seems that the aggregation time is 10 seconds on your system with this write load. Systems with large amounts of RAM encounter this issue more than systems with limited RAM. I encountered the same problem so I put this in /etc/system: * Set ZFS maximum TXG group size to 3932160000 set zfs:zfs_write_limit_override = 0xea600000 By limiting the TXG group size, the size of the data burst is limited, but since zfs still writes the TXG as fast as it can, other I/O will cease during that time. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Hi David, Just wanted to ask you, how your windows server behaves during these pauses? Are there any clients, connected to it? The issue you''ve described might be related to one I saw on my server, see here: http://www.opensolaris.org/jive/thread.jspa?threadID=110013&tstart=0 I just wonder how windows behaves during these pauses. -- Roman Naumenko roman at frontline.ca -- This message posted from opensolaris.org
Ross Walker wrote:>On Aug 27, 2009, at 4:30 AM, David Bond <david.bond at tag.no> wrote: > >> Hi, >> >> I was directed here after posting in CIFS discuss (as i first >> thought that it could be a CIFS problem). >> >> I posted the following in CIFS: >> >> When using iometer from windows to the file share on opensolaris >> svn101 and svn111 I get pauses every 5 seconds of around 5 seconds >> (maybe a little less) where no data is transfered, when data is >> transfered it is at a fair speed and gets around 1000-2000 iops with >> 1 thread (depending on the work type). The maximum read response >> time is 200ms and the maximum write response time is 9824ms, which >> is very bad, an almost 10 seconds delay in being able to send data >> to the server. >> This has been experienced on 2 test servers, the same servers have >> also been tested with windows server 2008 and they havent shown this >> problem (the share performance was slightly lower than CIFS, but it >> was consistent, and the average access time and maximums were very >> close. >> >> >> I just noticed that if the server hasnt hit its target arc size, the >> pauses are for maybe .5 seconds, but as soon as it hits its arc >> target, the iops drop to around 50% of what it was and then there >> are the longer pauses around 4-5 seconds. and then after every pause >> the performance slows even more. So it appears it is definately >> server side. >> >> This is with 100% random io with a spread of 33% write 66% read, 2KB >> blocks. over a 50GB file, no compression, and a 5.5GB target arc size. >> >> >> >> Also I have just ran some tests with different IO patterns and 100 >> sequencial writes produce and consistent IO of 2100IOPS, except when >> it pauses for maybe .5 seconds every 10 - 15 seconds. >> >> 100% random writes produce around 200 IOPS with a 4-6 second pause >> around every 10 seconds. >> >> 100% sequencial reads produce around 3700IOPS with no pauses, just >> random peaks in response time (only 16ms) after about 1 minute of >> running, so nothing to complain about. >> >> 100% random reads produce around 200IOPS, with no pauses. >> >> So it appears that writes cause a problem, what is causing these >> very long write delays? >> >> A network capture shows that the server doesnt respond to the write >> from the client when these pauses occur. >> >> Also, when using iometer, the initial file creation doesnt have and >> pauses in the creation, so it might only happen when modifying files. >> >> Any help on finding a solution to this would be really appriciated. > >What version? And system configuration? > >I think it might be the issue where ZFS/ARC write caches more then the >underlying storage can handle writing in a reasonable time. > >There is a parameter to control how much is write cached, I believe it >is zfs_write_override.You should be able to disable the write throttle mechanism altogether with the undocumented zfs_no_write_throttle tunable. I never got around to testing this though ...>-Ross > >_______________________________________________ >zfs-discuss mailing list >zfs-discuss at opensolaris.org >http://mail.opensolaris.org/mailman/listinfo/zfs-discuss-- Med venlig hilsen / Best Regards Henrik Johansen henrik at scannet.dk Tlf. 75 53 35 00 ScanNet Group A/S ScanNet
I saw similar behavior when I was running under the kernel debugger (-k switch the the kernel). It largely went away when I went back to "normal". T David Bond wrote:> Hi, > > I was directed here after posting in CIFS discuss (as i first thought that it could be a CIFS problem). > > I posted the following in CIFS: > > When using iometer from windows to the file share on opensolaris svn101 and svn111 I get pauses every 5 seconds of around 5 seconds (maybe a little less) where no data is transfered, when data is transfered it is at a fair speed and gets around 1000-2000 iops with 1 thread (depending on the work type). The maximum read response time is 200ms and the maximum write response time is 9824ms, which is very bad, an almost 10 seconds delay in being able to send data to the server. > This has been experienced on 2 test servers, the same servers have also been tested with windows server 2008 and they havent shown this problem (the share performance was slightly lower than CIFS, but it was consistent, and the average access time and maximums were very close. > > > I just noticed that if the server hasnt hit its target arc size, the pauses are for maybe .5 seconds, but as soon as it hits its arc target, the iops drop to around 50% of what it was and then there are the longer pauses around 4-5 seconds. and then after every pause the performance slows even more. So it appears it is definately server side. > > This is with 100% random io with a spread of 33% write 66% read, 2KB blocks. over a 50GB file, no compression, and a 5.5GB target arc size. > > > > Also I have just ran some tests with different IO patterns and 100 sequencial writes produce and consistent IO of 2100IOPS, except when it pauses for maybe .5 seconds every 10 - 15 seconds. > > 100% random writes produce around 200 IOPS with a 4-6 second pause around every 10 seconds. > > 100% sequencial reads produce around 3700IOPS with no pauses, just random peaks in response time (only 16ms) after about 1 minute of running, so nothing to complain about. > > 100% random reads produce around 200IOPS, with no pauses. > > So it appears that writes cause a problem, what is causing these very long write delays? > > A network capture shows that the server doesnt respond to the write from the client when these pauses occur. > > Also, when using iometer, the initial file creation doesnt have and pauses in the creation, so it might only happen when modifying files. > > Any help on finding a solution to this would be really appriciated. > > David >
On Aug 27, 2009, at 11:29 AM, Bob Friesenhahn <bfriesen at simple.dallas.tx.us > wrote:> On Thu, 27 Aug 2009, David Bond wrote: >> >> I just noticed that if the server hasnt hit its target arc size, >> the pauses are for maybe .5 seconds, but as soon as it hits its arc >> target, the iops drop to around 50% of what it was and then there >> are the longer pauses around 4-5 seconds. and then after every >> pause the performance slows even more. So it appears it is >> definately server side. > > This is known behavior of zfs for asynchronous writes. Recent zfs > defers/aggregates writes up to one of these limits: > > * 7/8ths of available RAM > * 5 seconds worth of write I/O (full speed write) > * 30 seconds aggregation time > > Notice the 5 seconds. This 5 seconds results in the 4-6 second > pause and it seems that the aggregation time is 10 seconds on your > system with this write load. Systems with large amounts of RAM > encounter this issue more than systems with limited RAM. > > I encountered the same problem so I put this in /etc/system: > > * Set ZFS maximum TXG group size to 3932160000 > set zfs:zfs_write_limit_override = 0xea600000That''s the option. When I was experiencing my writes starving reads I set this to 512MB or the size of my NVRAM cache for my controller and everything was happy again. Write flushes happened in less then a second and my IO flattened out nicely. -Ross
Hi, happens on opensolaris build 101b and 111b. Arc cache max set to 6GB, joined to a windows 2003 r2 ad domain. With a pool of 4 15Krpm drives in a 2 way mirror. The bnx driver has been changed to have offloading enabled. Not much else has been changed. Ok, so when the chache fills and needs to be flushed, when the flush occurs it locks access to it, so no read? or writes can occur from cache, and as everything will go through the arc, nothing can happen until the arc has finished its flush. And to compensate for this, I would have to either reduce the cache size to one that is small enough that the disk array can write it at such a speed that the pauses are reduced to ones that are not really noticable. Wouldnt that then impact the overal burst write performance also. Why doesnt the arc allow writes while flushing? or just have 2 caches so that one can keep taking writes while the other flushes. If it allowed writes to the buffer while it was flushing, it would just reduce the write speed down to what the disks can handel wouldnt it? Anyway, thanks for the info I will give that parameter a go, see how it works. Thanks -- This message posted from opensolaris.org
Ok, so by limiting the write cache to that of the controller you were able to remove the pauses? How id that affect your overall write performance, if at all? thanks I will give that ago. David -- This message posted from opensolaris.org
I dont have any windows machine connected to it over iscsi (yet). My reference to the windows servers was, having the same hapdware running windows and its read writes not having these problems, so it isnt hardware causing it. But when I do eventually get iscsi going I will send a message if i have teh same problems. Also with your replication, whats teh perfomance like, does it impact the overall write performance of your server having it enabled, is the replication continuous? David -- This message posted from opensolaris.org
On Sat, 29 Aug 2009, David Bond wrote:> > Ok, so when the chache fills and needs to be flushed, when the flush > occurs it locks access to it, so no read? or writes can occur from > cache, and as everything will go through the arc, nothing can happen > until the arc has finished its flush.It has not been proven that reads from the ARC stop. It is clear that reads from physical disk temporarily stop. It is not clear (to me) if reads from physical disk stop because of the huge number of TXG sync write operations (up to 5 seconds worth) which are queued prior to the read request, or if reads are intentionally blocked due to some sort of coherency management.> And to compensate for this, I would have to either reduce the cache > size to one that is small enough that the disk array can write it at > such a speed that the pauses are reduced to ones that are not really > noticable.That would work. There is likely to be more total physical I/O though since delaying the writes tends to eliminate many redundant writes. For example, an application which re-writes the same file over and over again would be sending more of that data to physical disk. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
"100% random writes produce around 200 IOPS with a 4-6 second pause around every 10 seconds. " This indicates that the bandwidth you''re able to transfer through the protocol is about 50% greater than the bandwidth the pool can offer to ZFS. Since, this is is not sustainable, you see here ZFS trying to balance the 2 numbers. -r David Bond writes: > Hi, > > I was directed here after posting in CIFS discuss (as i first thought that it could be a CIFS problem). > > I posted the following in CIFS: > > When using iometer from windows to the file share on opensolaris > svn101 and svn111 I get pauses every 5 seconds of around 5 seconds > (maybe a little less) where no data is transfered, when data is > transfered it is at a fair speed and gets around 1000-2000 iops with 1 > thread (depending on the work type). The maximum read response time is > 200ms and the maximum write response time is 9824ms, which is very > bad, an almost 10 seconds delay in being able to send data to the > server. > This has been experienced on 2 test servers, the same servers have > also been tested with windows server 2008 and they havent shown this > problem (the share performance was slightly lower than CIFS, but it > was consistent, and the average access time and maximums were very > close. > > > I just noticed that if the server hasnt hit its target arc size, the > pauses are for maybe .5 seconds, but as soon as it hits its arc > target, the iops drop to around 50% of what it was and then there are > the longer pauses around 4-5 seconds. and then after every pause the > performance slows even more. So it appears it is definately server > side. > > This is with 100% random io with a spread of 33% write 66% read, 2KB > blocks. over a 50GB file, no compression, and a 5.5GB target arc > size. > > > > Also I have just ran some tests with different IO patterns and 100 > sequencial writes produce and consistent IO of 2100IOPS, except when > it pauses for maybe .5 seconds every 10 - 15 seconds. > > 100% random writes produce around 200 IOPS with a 4-6 second pause > around every 10 seconds. > > 100% sequencial reads produce around 3700IOPS with no pauses, just > random peaks in response time (only 16ms) after about 1 minute of > running, so nothing to complain about. > > 100% random reads produce around 200IOPS, with no pauses. > > So it appears that writes cause a problem, what is causing these very > long write delays? > > A network capture shows that the server doesnt respond to the write > from the client when these pauses occur. > > Also, when using iometer, the initial file creation doesnt have and > pauses in the creation, so it might only happen when modifying > files. > > Any help on finding a solution to this would be really appriciated. > > David > -- > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Roch Bourbonnais Wrote: ""100% random writes produce around 200 IOPS with a 4-6 second pause around every 10 seconds. " This indicates that the bandwidth you''re able to transfer through the protocol is about 50% greater than the bandwidth the pool can offer to ZFS. Since, this is is not sustainable, you see here ZFS trying to balance the 2 numbers." When I have tested using 50% reads, 60% random using iometer over NFS, I can see the data going straight to disk due to the sync nature of NFS. But I also see writes coming to a stand still every 10 seconds or so, which I have attributed to the ZIL dumping to disk. Therefore I conclude that it is the process of dumping the ZIL to disk that (mostly?) blocks writes during the dumping. I do agree with Bob and others that suggest making the size of the dump smaller will mask this behavior, and that seems like a good idea, although I have not yet tried and tested it myself. -Scott -- This message posted from opensolaris.org
On 09/04/09 09:54, Scott Meilicke wrote:>> Roch Bourbonnais Wrote: >> ""100% random writes produce around 200 IOPS with a 4-6 second pause >> around every 10 seconds. " >> >> This indicates that the bandwidth you''re able to transfer >> through the protocol is about 50% greater than the bandwidth >> the pool can offer to ZFS. Since, this is is not sustainable, you >> see here ZFS trying to balance the 2 numbers." > > When I have tested using 50% reads, 60% random using iometer over NFS, > I can see the data going straight to disk due to the sync nature of NFS. > But I also see writes coming to a stand still every 10 seconds or so, > which I have attributed to the ZIL dumping to disk. Therefore I conclude > that it is the process of dumping the ZIL to disk that (mostly?) blocks > writes during the dumping.The ZIL does does not work like that. It is not a journal. Under a typical write load write transactions are batched and written out in a group transaction (txg). This txg sync occurs every 30s under light load but more frequently or continuously under heavy load. When writing synchronous data (eg NFS) the transactions get written immediately to the intent log and are made stable. When the txg later commits the intent log blocks containing those committed transactions can be freed. So as you can see there is no periodic dumping of the ZIL to disk. What you are probably observing is the periodic txg commit. Hope that helps: Neil.
So what happens during the txg commit? For example, if the ZIL is a separate device, SSD for this example, does it not work like: 1. A sync operation commits the data to the SSD 2. A txg commit happens, and the data from the SSD are written to the spinning disk So this is two writes, correct? -Scott -- This message posted from opensolaris.org
On Fri, 4 Sep 2009, Scott Meilicke wrote:> So what happens during the txg commit? > > For example, if the ZIL is a separate device, SSD for this example, does it not work like: > > 1. A sync operation commits the data to the SSD > 2. A txg commit happens, and the data from the SSD are written to the spinning disk > > So this is two writes, correct?>From past descriptions, the slog is basically a list of pending writesystem calls. The only time the slog is read is after a reboot. Otherwise, the slog is simply updated as write operations proceed. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
I am still not buying it :) I need to research this to satisfy myself. I can understand that the writes come from memory to disk during a txg write for async, and that is the behavior I see in testing. But for sync, data must be committed, and a SSD/ZIL makes that faster because you are writing to the SSD/ZIL, and not to spinning disk. Eventually that data on the SSD must get to spinning disk. To the books I go! -Scott -- This message posted from opensolaris.org
Scott Meilicke wrote:> So what happens during the txg commit? > > For example, if the ZIL is a separate device, SSD for this example, does it not work like: > > 1. A sync operation commits the data to the SSD > 2. A txg commit happens, and the data from the SSD are written to the spinning disk#1 is correct. #2 is incorrect. The TXG commit goes from memory into the main pool. The SSD data is simply left there in case something bad happens before the TXG commit succeeds. Once it succeeds, then the SSD data can be overwritten. The only time you need to read from a ZIL device is if a crash occurs and you need those blocks to repair the pool. Eric
Doh! I knew that, but then forgot... So, for the case of no separate device for the ZIL, the ZIL lives on the disk pool. In which case, the data are written to the pool twice during a sync: 1. To the ZIL (on disk) 2. From RAM to disk during tgx If this is correct (and my history in this thread is not so good, so...), would that then explain some sort of pulsing write behavior for sync write operations? -- This message posted from opensolaris.org
So, I just re-read the thread, and you can forget my last post. I had thought the argument was that the data were not being written to disk twice (assuming no separate device for the ZIL), but it was just explaining to me that the data are not read from the ZIL to disk, but rather from memory to disk. I need more coffee... -- This message posted from opensolaris.org
Scott Meilicke wrote:> I am still not buying it :) I need to research this to satisfy myself. > > I can understand that the writes come from memory to disk during a txg write for async, and that is the behavior I see in testing. > > But for sync, data must be committed, and a SSD/ZIL makes that faster because you are writing to the SSD/ZIL, and not to spinning disk. Eventually that data on the SSD must get to spinning disk. > >But the txg (which may contain more data than just the sync data that was written to the ZIL) is still written from memory. Just because the sync data was written to the ZIL, doesn''t mean it''s not still in memory. -Kyle> To the books I go! > > -Scott >
On Sep 4, 2009, at 2:22 PM, Scott Meilicke <scott.meilicke at craneaerospace.com > wrote:> So, I just re-read the thread, and you can forget my last post. I > had thought the argument was that the data were not being written to > disk twice (assuming no separate device for the ZIL), but it was > just explaining to me that the data are not read from the ZIL to > disk, but rather from memory to disk. I need more coffee...I think your confusing ARC write-back with ZIL and it isn''t the sync writes that are blocking IO it''s the async writes that have been cached and are now being flushed. Just tell ARC to cache less IO for your hardware with the kernel config Bob mentioned way back. -Ross
Yes, I was getting confused. Thanks to you (and everyone else) for clarifying. Sync or async, I see the txg flushing to disk starve read IO. Scott -- This message posted from opensolaris.org
On Sep 4, 2009, at 4:33 PM, Scott Meilicke <scott.meilicke at craneaerospace.com > wrote:> Yes, I was getting confused. Thanks to you (and everyone else) for > clarifying. > > Sync or async, I see the txg flushing to disk starve read IO.Well try the kernel setting and see how it helps. Honestly though if you can say it''s all sync writes with certainty and IO is still blocking, you need a better storage sub-system, or an additional pool. -Ross
I only see the blocking while load testing, not during regular usage, so I am not so worried. I will try the kernel settings to see if that helps if/when I see the issue in production. For what it is worth, here is the pattern I see when load testing NFS (iometer, 60% random, 65% read, 8k chunks, 32 outstanding I/Os): data01 59.6G 20.4T 46 24 757K 3.09M data01 59.6G 20.4T 39 24 593K 3.09M data01 59.6G 20.4T 45 25 687K 3.22M data01 59.6G 20.4T 45 23 683K 2.97M data01 59.6G 20.4T 33 23 492K 2.97M data01 59.6G 20.4T 16 41 214K 1.71M data01 59.6G 20.4T 3 2.36K 53.4K 30.4M data01 59.6G 20.4T 1 2.23K 20.3K 29.2M data01 59.6G 20.4T 0 2.24K 30.2K 28.9M data01 59.6G 20.4T 0 1.93K 30.2K 25.1M data01 59.6G 20.4T 0 2.22K 0 28.4M data01 59.7G 20.4T 21 295 317K 4.48M data01 59.7G 20.4T 32 12 495K 1.61M data01 59.7G 20.4T 35 25 515K 3.22M data01 59.7G 20.4T 36 11 522K 1.49M data01 59.7G 20.4T 33 24 508K 3.09M LSI SAS HBA, 3 x 5 disk raidz, Dell 2950, 16GB RAM. -Scott -- This message posted from opensolaris.org
On Fri, 4 Sep 2009, Scott Meilicke wrote:> I only see the blocking while load testing, not during regular > usage, so I am not so worried. I will try the kernel settings to see > if that helps if/when I see the issue in production.The flipside of the "pulsing" is that the deferred writes dimish contention for precious read IOPs and quite a few programs have a habit of updating/rewriting a file over and over again. If the file is completely asynchronously rewritten once per second and zfs writes a transaction group every 30 seconds, then 29 of those updates avoided consuming write IOPs. Another benefit is that if zfs has more data in hand to write, then it can do a much better job of avoiding fragmentation, avoid unnecessary COW by diminishing short tail writes, and achieve more optimum write patterns. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On Sep 4, 2009, at 6:33 PM, Bob Friesenhahn <bfriesen at simple.dallas.tx.us > wrote:> On Fri, 4 Sep 2009, Scott Meilicke wrote: > >> I only see the blocking while load testing, not during regular >> usage, so I am not so worried. I will try the kernel settings to >> see if that helps if/when I see the issue in production. > > The flipside of the "pulsing" is that the deferred writes dimish > contention for precious read IOPs and quite a few programs have a > habit of updating/rewriting a file over and over again. If the file > is completely asynchronously rewritten once per second and zfs > writes a transaction group every 30 seconds, then 29 of those > updates avoided consuming write IOPs. Another benefit is that if > zfs has more data in hand to write, then it can do a much better job > of avoiding fragmentation, avoid unnecessary COW by diminishing > short tail writes, and achieve more optimum write patterns.I guess one can find a silver lining in any grey cloud, but for myself I''d just rather see a more linear approach to writes. Anyway I have never seen any reads happen during these write flushes. -Ross
On Sep 4, 2009, at 5:25 PM, Scott Meilicke <scott.meilicke at craneaerospace.com > wrote:> I only see the blocking while load testing, not during regular > usage, so I am not so worried. I will try the kernel settings to see > if that helps if/when I see the issue in production. > > For what it is worth, here is the pattern I see when load testing > NFS (iometer, 60% random, 65% read, 8k chunks, 32 outstanding I/Os): > > data01 59.6G 20.4T 46 24 757K 3.09M > data01 59.6G 20.4T 39 24 593K 3.09M > data01 59.6G 20.4T 45 25 687K 3.22M > data01 59.6G 20.4T 45 23 683K 2.97M > data01 59.6G 20.4T 33 23 492K 2.97M > data01 59.6G 20.4T 16 41 214K 1.71M > data01 59.6G 20.4T 3 2.36K 53.4K 30.4M > data01 59.6G 20.4T 1 2.23K 20.3K 29.2M > data01 59.6G 20.4T 0 2.24K 30.2K 28.9M > data01 59.6G 20.4T 0 1.93K 30.2K 25.1M > data01 59.6G 20.4T 0 2.22K 0 28.4M > data01 59.7G 20.4T 21 295 317K 4.48M > data01 59.7G 20.4T 32 12 495K 1.61M > data01 59.7G 20.4T 35 25 515K 3.22M > data01 59.7G 20.4T 36 11 522K 1.49M > data01 59.7G 20.4T 33 24 508K 3.09M > > LSI SAS HBA, 3 x 5 disk raidz, Dell 2950, 16GB RAM.With that setup you''ll see max 3x the IOPS of the type of disks, not really the kind of setup for 60% random workload. Assuming 2TB SATA drives the max IOPS would be around 240 IOPS. Now if it were mirror vdevs you''d get 7x or 560 IOPS. Is this for VMware or data warehousing? You''ll also need an SSD drive in the mix if your not using a controller with NVRAM write-back. Especially when sharing over NFS. I guess since it''s 15 drives it''s an MD1000, I might have gone with the newer 2.5" drive enclosure as it holds 24 over 15 and most SSDs come in 2.5". Since you got it already, invest in a PERC 6/E with 512MB of cache and stick it in the other PCIe 8x slot. -Ross
On Fri, 4 Sep 2009, Ross Walker wrote:> > I guess one can find a silver lining in any grey cloud, but for myself I''d > just rather see a more linear approach to writes. Anyway I have never seen > any reads happen during these write flushes.I have yet to see a read happen during the write flush either. That impacts my application since it needs to read in order to proceed, and it does a similar amount of writes as it does reads. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On Sep 4, 2009, at 8:59 PM, Bob Friesenhahn <bfriesen at simple.dallas.tx.us > wrote:> On Fri, 4 Sep 2009, Ross Walker wrote: >> >> I guess one can find a silver lining in any grey cloud, but for >> myself I''d just rather see a more linear approach to writes. Anyway >> I have never seen any reads happen during these write flushes. > > I have yet to see a read happen during the write flush either. That > impacts my application since it needs to read in order to proceed, > and it does a similar amount of writes as it does reads.The ARC makes it hard to tell if they are satisfied from cache or blocked due to writes. I suppose if you have the hardware to go sync that might be the best bet. That and limiting the write cache. Though I have only heard good comments from my ESX admins since moving the VMs off iSCSI and on to ZFS over NFS, so it can''t be that bad. -Ross
On Fri, 4 Sep 2009, Ross Walker wrote:>> >> I have yet to see a read happen during the write flush either. That >> impacts my application since it needs to read in order to proceed, and it >> does a similar amount of writes as it does reads. > > The ARC makes it hard to tell if they are satisfied from cache or blocked due > to writes.The existing prefetch bug makes it doubly hard. :-) First I complained about the blocking reads, and then I complained about the blocking writes (presumed responsible for the blocking reads) and now I am waiting for working prefetch in order to feed my hungry application. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On Sep 4, 2009, at 21:44, Ross Walker wrote:> Though I have only heard good comments from my ESX admins since > moving the VMs off iSCSI and on to ZFS over NFS, so it can''t be that > bad.What''s your pool configuration? Striped mirrors? RAID-Z with SSDs? Other?
On Sep 4, 2009, at 10:02 PM, David Magda <dmagda at ee.ryerson.ca> wrote:> On Sep 4, 2009, at 21:44, Ross Walker wrote: > >> Though I have only heard good comments from my ESX admins since >> moving the VMs off iSCSI and on to ZFS over NFS, so it can''t be >> that bad. > > What''s your pool configuration? Striped mirrors? RAID-Z with SSDs? > Other?Striped mirrors off NVRAM backed controller (Dell PERC 6/E). RAID-Z isn''t the best for many VMs as the whole vdev acts as single disk for random io. -Ross
True, this setup is not designed for high random I/O, but rather lots of storage with fair performance. This box is for our dev/test backend storage. Our production VI runs in the 500-700 IOPS (80+ VMs, production plus dev/test) on average, so for our development VI, we are expecting half of that at most, on average. Testing with parameters that match the observed behavior of the production VI gets us about 750 IOPS with compression (NFS, 2009.06), so I am happy with the performance and very happy with the amount of available space. Stripped mirrors are much faster, ~2200 IOPS with 16 disks (but alas, tested with iSCSI on 2008.11, compression on. We got about 1,000 IOPS with the 3x5 raidz setup with compression to compare iSCSI and 2008.11 vs NFS and 2009.06), but again we are shooting for available space, with performance being a secondary goal. And yes, we would likely get much better performance using SSDs for the ZIL and L2ARC. This has been an interesting thread! Sorry for the bit of hijacking... -- This message posted from opensolaris.org
im playing around with a home raidz2 install and i can see this pulsing as well. The only difference is i have 6 ext usb drives with activity lights on them so i can see whats actually being written to the disk and when :) What i see is about 8 second pauses while data is being sent over the network into what appears to be some sort of in memory cache. Then the cache is flushed to disk and the drives all spring into life, and the network activity dropps down to zero. after a few seconds writing, the drives stop and the whole process begins again. Somethings funny is going on there... -- This message posted from opensolaris.org