An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-code/attachments/20080722/e54f3bd7/attachment.html> -------------- next part -------------- ******************************************************************************************************************************************************************* "The information contained in this email including in any attachment is confidential and is meant to be read only by the person to whom it is addressed. If you are not the intended recipient(s), you are prohibited from printing, forwarding, saving or copying this email. If you have received this e-mail in error, please immediately notify the sender and delete this e-mail and its attachments from your computer." *******************************************************************************************************************************************************************
ZFS is designed to "sync" a transaction group about every 5 seconds under normal work loads. So your system looks to be operating as designed. Is there some specific reason why you need to reduce this interval? In general, this is a bad idea, as there is somewhat of a "fixed overhead" associated with each sync, so increasing the sync frequency could result in increased IO. -Mark Tharindu Rukshan Bamunuarachchi wrote:> Dear ZFS Gurus, > > We are developing low latency transaction processing systems for stock > exchanges. > Low latency high performance file system is critical component of our > trading systems. > > We have choose ZFS as our primary file system. > But we saw periodical disk write peaks every 4-5 second. > > Please refer first column of below output. (marked in bold) > Output is generated from our own Disk performance measuring tool. i.e > DTool (please find attachment) > > Compared UFS/VxFS , ZFS is performing very well, but we could not > minimize periodical peaks. > We used autoup and tune_r_fsflush flags for UFS tuning. > > Are there any ZFS specific tuning, which will reduce file system flush > interval of ZFS. > > I have tried all parameters specified in "solarisinternals" and google.com. > I would like to go for ZFS code change/recompile if necessary. > > Please advice. > > Cheers > Tharindu > > > > cpu4600-100 /tantan >./*DTool -f M -s 1000 -r 10000 -i 1 -W* > System Tick = 100 usecs > Clock resolution 10 > HR Timer created for 100usecs > z_FileName = M > i_Rate = 10000 > l_BlockSize = 1000 > i_SyncInterval = 0 > l_TickInterval = 100 > i_TicksPerIO = 1 > i_NumOfIOsPerSlot = 1 > Max (us)| Min (us) | Avg (us) | MB/S | File > Freq Distribution > 336 | 4 | 10.5635 | 4.7688 | M > 50(98.55), 200(1.09), 500(0.36), 2000(0.00), 5000(0.00), 10000(0.00), > 100000(0.00), 200000(0.00), > *1911 * | 4 | 10.3152 | 9.4822 | M > 50(98.90), 200(0.77), 500(0.32), 2000(0.01), 5000(0.00), 10000(0.00), > 100000(0.00), 200000(0.00), > 307 | 4 | 9.9386 | 9.5324 | M > 50(99.03), 200(0.66), 500(0.31), 2000(0.00), 5000(0.00), 10000(0.00), > 100000(0.00), 200000(0.00), > 331 | 4 | 9.9465 | 9.5332 | M > 50(99.04), 200(0.72), 500(0.24), 2000(0.00), 5000(0.00), 10000(0.00), > 100000(0.00), 200000(0.00), > 318 | 4 | 10.1241 | 9.5309 | M > 50(99.07), 200(0.66), 500(0.27), 2000(0.00), 5000(0.00), 10000(0.00), > 100000(0.00), 200000(0.00), > 303 | 4 | 9.9236 | 9.5296 | M > 50(99.13), 200(0.59), 500(0.28), 2000(0.00), 5000(0.00), 10000(0.00), > 100000(0.00), 200000(0.00), > 560 | 4 | 10.2604 | 9.4565 | M > 50(98.82), 200(0.86), 500(0.31), 2000(0.01), 5000(0.00), 10000(0.00), > 100000(0.00), 200000(0.00), > 376 | 4 | 9.9975 | 9.5176 | M > 50(99.05), 200(0.63), 500(0.32), 2000(0.00), 5000(0.00), 10000(0.00), > 100000(0.00), 200000(0.00), > *9783 * | 4 | 10.8216 | 9.5301 | M > 50(99.05), 200(0.58), 500(0.36), 2000(0.00), 5000(0.00), 10000(0.01), > 100000(0.00), 200000(0.00), > 332 | 4 | 9.9345 | 9.5252 | M > 50(99.06), 200(0.61), 500(0.33), 2000(0.00), 5000(0.00), 10000(0.00), > 100000(0.00), 200000(0.00), > 355 | 4 | 9.9906 | 9.5315 | M > 50(99.01), 200(0.69), 500(0.30), 2000(0.00), 5000(0.00), 10000(0.00), > 100000(0.00), 200000(0.00), > 356 | 4 | 10.2341 | 9.5207 | M > 50(98.96), 200(0.76), 500(0.28), 2000(0.00), 5000(0.00), 10000(0.00), > 100000(0.00), 200000(0.00), > 320 | 4 | 9.8893 | 9.5279 | M > 50(99.10), 200(0.59), 500(0.31), 2000(0.00), 5000(0.00), 10000(0.00), > 100000(0.00), 200000(0.00), > *10005* | 4 | 10.8956 | 9.5258 | M > 50(99.07), 200(0.63), 500(0.29), 2000(0.00), 5000(0.00), 10000(0.00), > 100000(0.01), 200000(0.00), > 308 | 4 | 9.8417 | 9.5312 | M > 50(99.07), 200(0.64), 500(0.29), 2000(0.00), 5000(0.00), 10000(0.00), > 100000(0.00), 200000(0.00), > > > ------------------------------------------------------------------------ > > ******************************************************************************************************************************************************************* > > "The information contained in this email including in any attachment is confidential and is meant to be read only by the person to whom it is addressed. If you are not the intended recipient(s), you are prohibited from printing, forwarding, saving or copying this email. If you have received this e-mail in error, please immediately notify the sender and delete this e-mail and its attachments from your computer." > > ******************************************************************************************************************************************************************* > > > ------------------------------------------------------------------------ > > _______________________________________________ > zfs-code mailing list > zfs-code at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-code
Tharindu Rukshan Bamunuarachchi
2008-Jul-23 05:35 UTC
[zfs-discuss] [zfs-code] Peak every 4-5 second
Dear Mark/All, Our trading system is writing to local and/or array volume at 10k messages per second. Each message is about 700bytes in size. Before ZFS, we used UFS. Even with UFS, there was evey 5 second peak due to fsflush invocation. However each peak is about ~5ms. Our application can not recover from such higher latency. So we used several tuning parameters (tune_r_* and autoup) to decrease the flush interval. As a result peaks came down to ~1.5ms. But it is still too high for our application. I believe, if we could reduce ZFS sync interval down to ~1s, peaks will be reduced to ~1ms or less. We like <1ms peaks per second than 5ms peak per 5 second :-) Are there any tunable, so i can reduce ZFS sync interval. If there is no any tunable, can not I use "mdb" for the job ...? This is not general and we are ok with increased I/O rate. Please advice/help. Thankx in advance. tharindu Mark Maybee wrote:> ZFS is designed to "sync" a transaction group about every 5 seconds > under normal work loads. So your system looks to be operating as > designed. Is there some specific reason why you need to reduce this > interval? In general, this is a bad idea, as there is somewhat of a > "fixed overhead" associated with each sync, so increasing the sync > frequency could result in increased IO. > > -Mark > > Tharindu Rukshan Bamunuarachchi wrote: >> Dear ZFS Gurus, >> >> We are developing low latency transaction processing systems for >> stock exchanges. >> Low latency high performance file system is critical component of our >> trading systems. >> >> We have choose ZFS as our primary file system. >> But we saw periodical disk write peaks every 4-5 second. >> >> Please refer first column of below output. (marked in bold) >> Output is generated from our own Disk performance measuring tool. i.e >> DTool (please find attachment) >> >> Compared UFS/VxFS , ZFS is performing very well, but we could not >> minimize periodical peaks. >> We used autoup and tune_r_fsflush flags for UFS tuning. >> >> Are there any ZFS specific tuning, which will reduce file system >> flush interval of ZFS. >> >> I have tried all parameters specified in "solarisinternals" and >> google.com. >> I would like to go for ZFS code change/recompile if necessary. >> >> Please advice. >> >> Cheers >> Tharindu >> >> >> >> cpu4600-100 /tantan >./*DTool -f M -s 1000 -r 10000 -i 1 -W* >> System Tick = 100 usecs >> Clock resolution 10 >> HR Timer created for 100usecs >> z_FileName = M >> i_Rate = 10000 >> l_BlockSize = 1000 >> i_SyncInterval = 0 >> l_TickInterval = 100 >> i_TicksPerIO = 1 >> i_NumOfIOsPerSlot = 1 >> Max (us)| Min (us) | Avg (us) | MB/S | >> File Freq Distribution >> 336 | 4 | 10.5635 | 4.7688 | M >> 50(98.55), 200(1.09), 500(0.36), 2000(0.00), 5000(0.00), 10000(0.00), >> 100000(0.00), 200000(0.00), >> *1911 * | 4 | 10.3152 | 9.4822 | M >> 50(98.90), 200(0.77), 500(0.32), 2000(0.01), 5000(0.00), 10000(0.00), >> 100000(0.00), 200000(0.00), >> 307 | 4 | 9.9386 | 9.5324 | M >> 50(99.03), 200(0.66), 500(0.31), 2000(0.00), 5000(0.00), 10000(0.00), >> 100000(0.00), 200000(0.00), >> 331 | 4 | 9.9465 | 9.5332 | M >> 50(99.04), 200(0.72), 500(0.24), 2000(0.00), 5000(0.00), 10000(0.00), >> 100000(0.00), 200000(0.00), >> 318 | 4 | 10.1241 | 9.5309 | M >> 50(99.07), 200(0.66), 500(0.27), 2000(0.00), 5000(0.00), 10000(0.00), >> 100000(0.00), 200000(0.00), >> 303 | 4 | 9.9236 | 9.5296 | M >> 50(99.13), 200(0.59), 500(0.28), 2000(0.00), 5000(0.00), 10000(0.00), >> 100000(0.00), 200000(0.00), >> 560 | 4 | 10.2604 | 9.4565 | M >> 50(98.82), 200(0.86), 500(0.31), 2000(0.01), 5000(0.00), 10000(0.00), >> 100000(0.00), 200000(0.00), >> 376 | 4 | 9.9975 | 9.5176 | M >> 50(99.05), 200(0.63), 500(0.32), 2000(0.00), 5000(0.00), 10000(0.00), >> 100000(0.00), 200000(0.00), >> *9783 * | 4 | 10.8216 | 9.5301 | M >> 50(99.05), 200(0.58), 500(0.36), 2000(0.00), 5000(0.00), 10000(0.01), >> 100000(0.00), 200000(0.00), >> 332 | 4 | 9.9345 | 9.5252 | M >> 50(99.06), 200(0.61), 500(0.33), 2000(0.00), 5000(0.00), 10000(0.00), >> 100000(0.00), 200000(0.00), >> 355 | 4 | 9.9906 | 9.5315 | M >> 50(99.01), 200(0.69), 500(0.30), 2000(0.00), 5000(0.00), 10000(0.00), >> 100000(0.00), 200000(0.00), >> 356 | 4 | 10.2341 | 9.5207 | M >> 50(98.96), 200(0.76), 500(0.28), 2000(0.00), 5000(0.00), 10000(0.00), >> 100000(0.00), 200000(0.00), >> 320 | 4 | 9.8893 | 9.5279 | M >> 50(99.10), 200(0.59), 500(0.31), 2000(0.00), 5000(0.00), 10000(0.00), >> 100000(0.00), 200000(0.00), >> *10005* | 4 | 10.8956 | 9.5258 | M >> 50(99.07), 200(0.63), 500(0.29), 2000(0.00), 5000(0.00), 10000(0.00), >> 100000(0.01), 200000(0.00), >> 308 | 4 | 9.8417 | 9.5312 | M >> 50(99.07), 200(0.64), 500(0.29), 2000(0.00), 5000(0.00), 10000(0.00), >> 100000(0.00), 200000(0.00), >> >> >> ------------------------------------------------------------------------ >> >> ******************************************************************************************************************************************************************* >> >> >> "The information contained in this email including in any attachment >> is confidential and is meant to be read only by the person to whom it >> is addressed. If you are not the intended recipient(s), you are >> prohibited from printing, forwarding, saving or copying this email. >> If you have received this e-mail in error, please immediately notify >> the sender and delete this e-mail and its attachments from your >> computer." >> >> ******************************************************************************************************************************************************************* >> >> >> >> ------------------------------------------------------------------------ >> >> _______________________________________________ >> zfs-code mailing list >> zfs-code at opensolaris.org >> http://mail.opensolaris.org/mailman/listinfo/zfs-code >******************************************************************************************************************************************************************* "The information contained in this email including in any attachment is confidential and is meant to be read only by the person to whom it is addressed. If you are not the intended recipient(s), you are prohibited from printing, forwarding, saving or copying this email. If you have received this e-mail in error, please immediately notify the sender and delete this e-mail and its attachments from your computer." *******************************************************************************************************************************************************************
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> <html><head><title></title> <META http-equiv=Content-Type content="text/html; charset=iso-8859-2"> <meta http-equiv="Content-Style-Type" content="text/css"> <style type="text/css"><!-- body { margin: 5px 5px 5px 5px; background-color: #ffffff; } /* ---------- Text Styles ---------- */ hr { color: #000000} body, table /* Normal text */ { font-size: 9pt; font-family: ''Courier New''; font-style: normal; font-weight: normal; color: #000000; text-decoration: none; } span.rvts1 /* Heading */ { font-size: 10pt; font-family: ''Arial''; font-weight: bold; color: #0000ff; } span.rvts2 /* Subheading */ { font-size: 10pt; font-family: ''Arial''; font-weight: bold; color: #000080; } span.rvts3 /* Keywords */ { font-size: 10pt; font-family: ''Arial''; font-style: italic; color: #800000; } a.rvts4, span.rvts4 /* Jump 1 */ { font-size: 10pt; font-family: ''Arial''; color: #008000; text-decoration: underline; } a.rvts5, span.rvts5 /* Jump 2 */ { font-size: 10pt; font-family: ''Arial''; color: #008000; text-decoration: underline; } span.rvts6 { font-size: 11pt; font-family: ''tahoma''; font-weight: bold; color: #ffffff; } span.rvts7 { font-size: 11pt; font-family: ''times new roman''; } span.rvts8 { font-size: 11pt; font-family: ''times new roman''; font-weight: bold; } span.rvts9 { font-size: 11pt; font-family: ''tahoma''; } span.rvts10 { font-size: 8pt; font-family: ''arial''; font-style: italic; color: #c0c0c0; } a.rvts11, span.rvts11 { font-size: 8pt; font-family: ''arial''; color: #0000ff; text-decoration: underline; } /* ---------- Para Styles ---------- */ p,ul,ol /* Paragraph Style */ { text-align: left; text-indent: 0px; padding: 0px 0px 0px 0px; margin: 0px 0px 0px 0px; } .rvps1 /* Centered */ { text-align: center; } --></style> </head> <body> <p>Hello Tharindu,</p> <p><br></p> <p>Tuesday, July 22, 2008, 5:56:58 PM, you wrote:</p> <p><br></p> <div><table border=0 cellpadding=1 cellspacing=2 style="border-color: #000000; border-style: solid; background-color: #ffffff;"> <tr valign=top> <td width=13 style="background-color: #0000ff;"> <p><span class=rvts6>></span></p> </td> <td width=990> <p><span class=rvts7>Dear ZFS Gurus,</span></p> <p><br></p> <p><span class=rvts7>We are developing low latency transaction processing systems for stock exchanges. </span></p> <p><span class=rvts7>Low latency high performance file system is critical component of our trading systems.</span></p> <p><br></p> <p><span class=rvts7>We have choose ZFS as our primary file system. </span></p> <p><span class=rvts7>But we saw periodical disk write peaks every 4-5 second.</span></p> <p><br></p> <p><span class=rvts7>Please refer first column of below output. (marked in bold)</span></p> <p><span class=rvts7>Output is generated from our own Disk performance measuring tool. i.e DTool (please find attachment)</span></p> <p><br></p> <p><span class=rvts7>Compared UFS/VxFS , ZFS is performing very well, but we could not minimize periodical peaks.</span></p> <p><span class=rvts7>We used autoup and tune_r_fsflush flags for UFS tuning.</span></p> <p><br></p> <p><span class=rvts7>Are there any ZFS specific tuning, which will reduce file system flush interval of ZFS.</span></p> <p><span class=rvts9><br></span></p> </td> </tr> </table> </div> <p><br></p> <p><br></p> <p>You can tune it by changing txg_time via mdb or /etc/system. By default currently is set to 5 seconds.</p> <p><br></p> <p>However as Mark pointed out - first ask yourself why you think it is a problem for you.</p> <p><br></p> <p><br></p> <p><br></p> <p><br></p> <p><span class=rvts10>-- </span></p> <p><span class=rvts10>Best regards,</span></p> <p><span class=rvts10> Robert Milkowski </span><a class=rvts11 href="mailto:milek@task.gda.pl">mailto:milek@task.gda.pl</a></p> <p><span class=rvts10> </span><a class=rvts11 href="http://milek.blogspot.com">http://milek.blogspot.com</a></p> </body></html>
Hello Tharindu, Wednesday, July 23, 2008, 6:35:33 AM, you wrote: TRB> Dear Mark/All, TRB> Our trading system is writing to local and/or array volume at 10k TRB> messages per second. TRB> Each message is about 700bytes in size. TRB> Before ZFS, we used UFS. TRB> Even with UFS, there was evey 5 second peak due to fsflush invocation. TRB> However each peak is about ~5ms. TRB> Our application can not recover from such higher latency. TRB> So we used several tuning parameters (tune_r_* and autoup) to decrease TRB> the flush interval. TRB> As a result peaks came down to ~1.5ms. But it is still too high for our TRB> application. TRB> I believe, if we could reduce ZFS sync interval down to ~1s, peaks will TRB> be reduced to ~1ms or less. TRB> We like <1ms peaks per second than 5ms peak per 5 second :-) TRB> Are there any tunable, so i can reduce ZFS sync interval. TRB> If there is no any tunable, can not I use "mdb" for the job ...? TRB> This is not general and we are ok with increased I/O rate. TRB> Please advice/help. txt_time/D btw: 10,000 * 700 = ~7MB What''s your storage subsystem? Any, even small, raid device with write cache should help. -- Best regards, Robert Milkowski mailto:milek at task.gda.pl http://milek.blogspot.com
An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-code/attachments/20080723/124fec9e/attachment.html> -------------- next part -------------- ******************************************************************************************************************************************************************* "The information contained in this email including in any attachment is confidential and is meant to be read only by the person to whom it is addressed. If you are not the intended recipient(s), you are prohibited from printing, forwarding, saving or copying this email. If you have received this e-mail in error, please immediately notify the sender and delete this e-mail and its attachments from your computer." *******************************************************************************************************************************************************************
Tharindu Rukshan Bamunuarachchi
2008-Jul-23 09:03 UTC
[zfs-discuss] [zfs-code] Peak every 4-5 second
An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080723/bfd2f680/attachment.html> -------------- next part -------------- ******************************************************************************************************************************************************************* "The information contained in this email including in any attachment is confidential and is meant to be read only by the person to whom it is addressed. If you are not the intended recipient(s), you are prohibited from printing, forwarding, saving or copying this email. If you have received this e-mail in error, please immediately notify the sender and delete this e-mail and its attachments from your computer." *******************************************************************************************************************************************************************
Tharindu Rukshan Bamunuarachchi
2008-Jul-23 09:05 UTC
[zfs-discuss] [zfs-code] Peak every 4-5 second
An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080723/0a650762/attachment.html> -------------- next part -------------- ******************************************************************************************************************************************************************* "The information contained in this email including in any attachment is confidential and is meant to be read only by the person to whom it is addressed. If you are not the intended recipient(s), you are prohibited from printing, forwarding, saving or copying this email. If you have received this e-mail in error, please immediately notify the sender and delete this e-mail and its attachments from your computer." *******************************************************************************************************************************************************************
Frank.Hofmann at Sun.COM
2008-Jul-23 11:04 UTC
[zfs-discuss] [zfs-code] Peak every 4-5 second
On Wed, 23 Jul 2008, Tharindu Rukshan Bamunuarachchi wrote:> 10,000 x 700 = 7MB per second ...... > > We have this rate for whole day .... > > 10,000 orders per second is minimum requirments of modern day stock exchanges ... > > Cache still help us for ~1 hours, but after that who will help us ... > > We are using 2540 for current testing ... > I have tried same with 6140, but no significant improvement ... only one or two hours ...It might not be exactly what you have in mind, but this "how do I get latency down at all costs" thing reminded me of this old paper: http://www.sun.com/blueprints/1000/layout.pdf I''m not a storage architect, someone with more experience in the area care to comment on this ? With huge disks as we have these days, the "wide thin" idea has gone under a bit - but how to replace such setups with modern arrays, if the workload is such that caches eventually must get blown and you''re down to spindle speed ? FrankH.> > Robert Milkowski wrote: > > Hello Tharindu, > > Wednesday, July 23, 2008, 6:35:33 AM, you wrote: > > TRB> Dear Mark/All, > > TRB> Our trading system is writing to local and/or array volume at 10k > TRB> messages per second. > TRB> Each message is about 700bytes in size. > > TRB> Before ZFS, we used UFS. > TRB> Even with UFS, there was evey 5 second peak due to fsflush invocation. > > TRB> However each peak is about ~5ms. > TRB> Our application can not recover from such higher latency. > > TRB> So we used several tuning parameters (tune_r_* and autoup) to decrease > TRB> the flush interval. > TRB> As a result peaks came down to ~1.5ms. But it is still too high for our > TRB> application. > > TRB> I believe, if we could reduce ZFS sync interval down to ~1s, peaks will > TRB> be reduced to ~1ms or less. > TRB> We like <1ms peaks per second than 5ms peak per 5 second :-) > > TRB> Are there any tunable, so i can reduce ZFS sync interval. > TRB> If there is no any tunable, can not I use "mdb" for the job ...? > > TRB> This is not general and we are ok with increased I/O rate. > TRB> Please advice/help. > > txt_time/D > > btw: > 10,000 * 700 = ~7MB > > What''s your storage subsystem? Any, even small, raid device with write > cache should help. > > > > > >------------------------------------------------------------------------------ No good can come from selling your freedom, not for all the gold in the world, for the value of this heavenly gift far exceeds that of any fortune on earth. ------------------------------------------------------------------------------
On Wed, 23 Jul 2008, Tharindu Rukshan Bamunuarachchi wrote:> 10,000 x 700 = 7MB per second ...... > > We have this rate for whole day .... > > 10,000 orders per second is minimum requirments of modern day stock exchanges ... > > Cache still help us for ~1 hours, but after that who will help us ... > > We are using 2540 for current testing ... > I have tried same with 6140, but no significant improvement ... only one or two hours ...Does your application request synchronous file writes or use fsync()? While normally fsync() slows performance I think that it will also serve to even the write response since ZFS will not be buffering lots of unwritten data. However, there may be buffered writes from other applications which gets written periodically and which may delay the writes from your critical application. In this case reducing the ARC size may help so that the ZFS sync takes less time. You could also run a script which executes ''sync'' every second or two in order to convince ZFS to cache less unwritten data. This will cause a bit of a performance hit for the whole system though. You 7MB per second is a very tiny write load so it is worthwhile investigating to see if there are other factors which are causing your storage system to not perform correctly. The 2540 is capable of supporting writes at hundreds of MB per second. As an example of "another factor", let''s say that you used the 2540 to create 6 small LUNs and then put them into a ZFS zraid. However, in this case the 2540 allocated all of the LUNs from the same disk (which it is happy to do by default) so now that disk is being severely thrashed since it is one disk rather than six. Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Frank.Hofmann at Sun.COM wrote:> On Wed, 23 Jul 2008, Tharindu Rukshan Bamunuarachchi wrote: > > >> 10,000 x 700 = 7MB per second ...... >> >> We have this rate for whole day .... >> >> 10,000 orders per second is minimum requirments of modern day stock exchanges ... >> >> Cache still help us for ~1 hours, but after that who will help us ... >> >> We are using 2540 for current testing ... >> I have tried same with 6140, but no significant improvement ... only one or two hours ... >> > > It might not be exactly what you have in mind, but this "how do I get > latency down at all costs" thing reminded me of this old paper: > > http://www.sun.com/blueprints/1000/layout.pdf > > I''m not a storage architect, someone with more experience in the area care > to comment on this ? With huge disks as we have these days, the "wide > thin" idea has gone under a bit - but how to replace such setups with > modern arrays, if the workload is such that caches eventually must get > blown and you''re down to spindle speed ? >Bob Larson wrote that article, and I would love to ask him for an update. Unfortunately, he passed away a few years ago :-( http://blogs.sun.com/relling/entry/bob_larson_my_friend I think the model still holds true, the per-disk performance hasn''t significantly changed since it was written. This particular problem screams for a queuing model. You don''t really need to have a huge cache as long as you can de-stage efficiently. However, the original poster hasn''t shared the read workload details... if you never read, it is a trivial problem to solve with a WOM. -- richard
On Tue, Jul 22, 2008 at 10:35 PM, Tharindu Rukshan Bamunuarachchi <tharindub at millenniumit.com> wrote:> > Dear Mark/All, > > Our trading system is writing to local and/or array volume at 10k > messages per second. > Each message is about 700bytes in size. > > Before ZFS, we used UFS. > Even with UFS, there was evey 5 second peak due to fsflush invocation. > > However each peak is about ~5ms. > Our application can not recover from such higher latency.Is the pool using raidz, raidz2, or mirroring? How many drives are you using? -B -- Brandon High bhigh at freaks.com "The good is the enemy of the best." - Nietzsche
Tharindu Rukshan Bamunuarachchi
2008-Jul-24 05:02 UTC
[zfs-discuss] [zfs-code] Peak every 4-5 second
An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080724/6d88dbfc/attachment.html> -------------- next part -------------- ******************************************************************************************************************************************************************* "The information contained in this email including in any attachment is confidential and is meant to be read only by the person to whom it is addressed. If you are not the intended recipient(s), you are prohibited from printing, forwarding, saving or copying this email. If you have received this e-mail in error, please immediately notify the sender and delete this e-mail and its attachments from your computer." *******************************************************************************************************************************************************************
On Wed, Jul 23, 2008 at 10:02 PM, Tharindu Rukshan Bamunuarachchi <tharindub at millenniumit.com> wrote:> We do not use raidz*. Virtually, no raid or stripe through OS.So it''s ZFS on a single LUN exported from the 2540? Or have you created a zpool from multiple raid1 LUNs on the 2540? Have you tried exporting the individual drives and using zfs to handle the mirroring? It might have better performance in your situation. -B -- Brandon High bhigh at freaks.com "The good is the enemy of the best." - Nietzsche
Tharindu Rukshan Bamunuarachchi
2008-Jul-24 11:55 UTC
[zfs-discuss] [zfs-code] Peak every 4-5 second
An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080724/586b5ff4/attachment.html> -------------- next part -------------- ******************************************************************************************************************************************************************* "The information contained in this email including in any attachment is confidential and is meant to be read only by the person to whom it is addressed. If you are not the intended recipient(s), you are prohibited from printing, forwarding, saving or copying this email. If you have received this e-mail in error, please immediately notify the sender and delete this e-mail and its attachments from your computer." *******************************************************************************************************************************************************************
Hmmn, that *sounds* as if you are saying you''ve a very-high-redundancy RAID1 mirror, 4 disks deep, on an ''enterprise-class tier 2 storage'' array that doesn''t support RAID 1+0 or 0+1. That sounds weird: the 2540 supports RAID levels 0, 1, (1+0), 3 and 5, and deep mirrors are normally only used on really fast equipment in mission-critical tier 1 storage... Are you sure you don''t mean you have raid 0 (stripes) 4 disks wide, each stripe presented as a LUN? If you really have 4-deep RAID 1, you have a configuration that will perform somewhat slower than any single disk, as the array launches 4 writes to 4 drives in parallel, and returns success when they all complete. If you had 4-wide RAID 0, with mirroring done at the host, you would have a configuration that would (probabilistically) perform better than a single drive when writing to each side of the mirror, and the write would return success when the slowest side of the mirror completed. --dave (puzzled!) c-b Tharindu Rukshan Bamunuarachchi wrote:> We do not use raidz*. Virtually, no raid or stripe through OS. > > We have 4 disk RAID1 volumes. RAID1 was created from CAM on 2540. > > 2540 does not have RAID 1+0 or 0+1. > > cheers > tharindu > > Brandon High wrote: > >>On Tue, Jul 22, 2008 at 10:35 PM, Tharindu Rukshan Bamunuarachchi >><tharindub at millenniumit.com> wrote: >> >> >>>Dear Mark/All, >>> >>>Our trading system is writing to local and/or array volume at 10k >>>messages per second. >>>Each message is about 700bytes in size. >>> >>>Before ZFS, we used UFS. >>>Even with UFS, there was evey 5 second peak due to fsflush invocation. >>> >>>However each peak is about ~5ms. >>>Our application can not recover from such higher latency. >>> >>> >> >>Is the pool using raidz, raidz2, or mirroring? How many drives are you using? >> >>-B >> >> >> > > ------------------------------------------------------------------------ > > ******************************************************************************************************************************************************************* > > "The information contained in this email including in any attachment is confidential and is meant to be read only by the person to whom it is addressed. If you are not the intended recipient(s), you are prohibited from printing, forwarding, saving or copying this email. If you have received this e-mail in error, please immediately notify the sender and delete this e-mail and its attachments from your computer." > > ******************************************************************************************************************************************************************* > > > ------------------------------------------------------------------------ > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss-- David Collier-Brown | Always do right. This will gratify Sun Microsystems, Toronto | some people and astonish the rest davecb at sun.com | -- Mark Twain (905) 943-1983, cell: (647) 833-9377, (800) 555-9786 x56583 bridge: (877) 385-4099 code: 506 9191#
On Thu, 24 Jul 2008, Tharindu Rukshan Bamunuarachchi wrote:> We do not use raidz*. Virtually, no raid or stripe through OS. > > We have 4 disk RAID1 volumes.? RAID1 was created from CAM on 2540.What ZFS block size are you using? Are you using synchronous writes for each 700byte message? 10k synchronous writes per second is pretty high and would depend heavily on the 2540''s write cache and how the 2540''s firmware behaves. You will find some cache tweaks for the 2540 in my writeup available at http://www.simplesystems.org/users/bfriesen/zfs-discuss/2540-zfs-performance.pdf. Without these tweaks, the 2540 waits for the data to be written to disk rather than written to its NVRAM whenever ZFS flushes the write cache. Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On Thu, 24 Jul 2008, Brandon High wrote:> > Have you tried exporting the individual drives and using zfs to handle > the mirroring? It might have better performance in your situation.It should indeed have better performance. The single LUN exported from the 2540 will be treated like a single drive from ZFS''s perspective. The data written needs to be serialized in the same way that it would be for a drive. ZFS has no understanding that some offsets will access a different drive so it may be that one pair of drives is experiencing all of the load. The most performant configuration would be to export a LUN from each of the 2540''s 12 drives and create a pool of 6 mirrors. In this situation, ZFS will load share across the 6 mirrors so that each pair gets its fair share of the IOPS based on its backlog. The 2540 cache tweaks will also help tremendously for this sort of work load. Since this is for critical data I would not disable the cache mirroring in the 2540''s controllers. Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Hello Tharindu, Wednesday, July 23, 2008, 10:03:15 AM, you wrote: > 10,000 x 700 = 7MB per second ...... We have this rate for whole day .... 10,000 orders per second is minimum requirments of modern day stock exchanges ... Cache still help us for ~1 hours, but after that who will help us ... Have you disable on zfs side SCSI flushes? Or have you disabled it on the array? Both on 2540 and 6540 if you do not disable it your performance will be very bad especially for synchronous IOs as ZIL will force your array to flush its cache every time. If you are not using ZFS on any other storage than 2540 on your servers then put "set zfs:zfs_nocacheflush=1" in /etc/system and do a reboot. If you haven''t done so it should help you considerably. With such relatively low throughput and with ZFS, plus cache on the array (after above correction), plus you stated in another email you are basically not reading at all you should cache evrything in the array and then stream it to disks (partly thanks to CoW in ZFS). Additional question is - how do you write your data? Are you updating larger files or creating a new file each time, or...? -- Best regards, Robert Milkowski mailto:milek@task.gda.pl http://milek.blogspot.com _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Hello Tharindu, Thursday, July 24, 2008, 6:02:31 AM, you wrote: > We do not use raidz*. Virtually, no raid or stripe through OS. We have 4 disk RAID1 volumes. RAID1 was created from CAM on 2540. 2540 does not have RAID 1+0 or 0+1. Of course it does 1+0. Just add more drives to RAID-1 -- Best regards, Robert Milkowski mailto:milek@task.gda.pl http://milek.blogspot.com _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
On Fri, 25 Jul 2008, Robert Milkowski wrote:> Both on 2540 and 6540 if you do not disable it your performance will > be very bad especially for synchronous IOs as ZIL will force your > array to flush its cache every time. If you are not using ZFS on any > other storage than 2540 on your servers then put "set > zfs:zfs_nocacheflush=1" in /etc/system and do a reboot. If you > haven''t done so it should help you considerably.This does not seem wise since then data (records of trades) may be lost if the system crashes or loses power. It is much better to apply the firmware tweaks so that the 2540 reports that the data is written as soon as it is safely in its NVRAM rather than waiting for it to be on disk. ZFS should then perform rather well with low latency. However, I have yet to see any response from Tharindu which indicates he has seen any of my emails regarding this (or many emails from others). Based on his responses I would assume that Tharindu is seeing less than a third of the response messages regarding his topic. Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
And do you really have 4-sided raid 1 mirrors, not 4-wide raid-0 stripes??? --dave Robert Milkowski wrote:> Hello Tharindu, > > > Thursday, July 24, 2008, 6:02:31 AM, you wrote: > > >> > > > > We do not use raidz*. Virtually, no raid or stripe through OS. > > > We have 4 disk RAID1 volumes. RAID1 was created from CAM on 2540. > > > 2540 does not have RAID 1+0 or 0+1. > > > > > Of course it does 1+0. Just add more drives to RAID-1 > > > > > -- > > Best regards, > > Robert Milkowski mailto:milek at task.gda.pl > > http://milek.blogspot.com > > > ------------------------------------------------------------------------ > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss-- David Collier-Brown | Always do right. This will gratify Sun Microsystems, Toronto | some people and astonish the rest davecb at sun.com | -- Mark Twain (905) 943-1983, cell: (647) 833-9377, (800) 555-9786 x56583 bridge: (877) 385-4099 code: 506 9191#
On Fri, Jul 25, 2008 at 9:17 AM, David Collier-Brown <davecb at sun.com> wrote:> And do you really have 4-sided raid 1 mirrors, not 4-wide raid-0 stripes???Or perhaps 4 RAID1 mirrors concatenated? -B -- Brandon High bhigh at freaks.com "The good is the enemy of the best." - Nietzsche
Brandon High wrote:> On Fri, Jul 25, 2008 at 9:17 AM, David Collier-Brown <davecb at sun.com> wrote: > >>And do you really have 4-sided raid 1 mirrors, not 4-wide raid-0 stripes??? > > > Or perhaps 4 RAID1 mirrors concatenated? >I wondered that too, but he insists he doesn''t have 0+1 or 1+0... Tharindu. could you clarify this for us? It significantly affects what advice we give! --dave (former tech lead, performance engineering at ACE) c-b -- David Collier-Brown | Always do right. This will gratify Sun Microsystems, Toronto | some people and astonish the rest davecb at sun.com | -- Mark Twain (905) 943-1983, cell: (647) 833-9377, (800) 555-9786 x56583 bridge: (877) 385-4099 code: 506 9191#
Tharindu Rukshan Bamunuarachchi
2008-Jul-26 17:02 UTC
[zfs-discuss] [zfs-code] Peak every 4-5 second
An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080726/43bc0268/attachment.html> -------------- next part -------------- ******************************************************************************************************************************************************************* "The information contained in this email including in any attachment is confidential and is meant to be read only by the person to whom it is addressed. If you are not the intended recipient(s), you are prohibited from printing, forwarding, saving or copying this email. If you have received this e-mail in error, please immediately notify the sender and delete this e-mail and its attachments from your computer." *******************************************************************************************************************************************************************
On Sat, 26 Jul 2008, Tharindu Rukshan Bamunuarachchi wrote:> > 1.Re configure array with 12 independent disks > 2. Allocate disks to RAIDZed poolUsing raidz will penalize your transaction performance since all disks will need to perform I/O for each write. It is definitely better to use load-shared mirrors for this purpose.> 3. Fine tune the 2540 according to Bob''s 2540-ZFS-Performance.pdf (Thankx Bob) > 4. Apply ZFS tunings (i.e. zfs_nocacheflush=1 etc.)Hopefully after step #3, step#4 will not be required. Step #4 puts data at risk if there is a system crash.> However, I could not find additional cards to support I/O Multipath. Hope that would not affect > on latency.Probably not. It will effect sequential I/O performance but latency is primarily dependent on disk configuration and ZFS filesystem block size. I have performed some tests here of synchronous writes using iozone with multi-threaded readers/writers. This is for the same 2540 configuration that I wrote about earlier. For this particular test, the ZFS filesystem blocksize is 8K and the size of the I/Os is 8K. This may not be a good representation of your own workload since the threads are contending for I/O with random access. In your case, it seems that writes may be written in a sequential append mode. I also have test results handy for similar test parameters but using various ZFS filesystem settings (8K/128K block size, checksum enable/disable, noatime, and sha256 checksum), and 8K or 128K I/O block sizes. Let me know if there is something you would like for me to measure. It should be easy to simulate your application behavior using iozone. Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/ Iozone: Performance Test of File I/O Version $Revision: 3.283 $ Compiled for 64 bit mode. Build: Solaris10gcc-64 Contributors:William Norcott, Don Capps, Isom Crawford, Kirby Collins Al Slater, Scott Rhine, Mike Wisner, Ken Goss Steve Landherr, Brad Smith, Mark Kelly, Dr. Alain CYR, Randy Dunlap, Mark Montague, Dan Million, Jean-Marc Zucconi, Jeff Blomberg, Benny Halevy, Erik Habbinga, Kris Strecker, Walter Wong. Run began: Wed Jul 2 10:54:19 2008 Multi_buffer. Work area 16777216 bytes OPS Mode. Output is in operations per second. Record Size 8 KB SYNC Mode. File size set to 2097152 KB Command line used: iozone -m -t 8 -T -O -r 8k -o -s 2G Time Resolution = 0.000001 seconds. Processor cache size set to 1024 Kbytes. Processor cache line size set to 32 bytes. File stride size set to 17 * record size. Throughput test with 8 threads Each thread writes a 2097152 Kbyte file in 8 Kbyte records Children see throughput for 8 initial writers = 4315.57 ops/sec Parent sees throughput for 8 initial writers = 4266.15 ops/sec Min throughput per thread = 532.18 ops/sec Max throughput per thread = 543.36 ops/sec Avg throughput per thread = 539.45 ops/sec Min xfer = 256746.00 ops Children see throughput for 8 rewriters = 2595.08 ops/sec Parent sees throughput for 8 rewriters = 2595.06 ops/sec Min throughput per thread = 322.07 ops/sec Max throughput per thread = 326.15 ops/sec Avg throughput per thread = 324.38 ops/sec Min xfer = 258867.00 ops Children see throughput for 8 readers = 53462.03 ops/sec Parent sees throughput for 8 readers = 53451.08 ops/sec Min throughput per thread = 6340.39 ops/sec Max throughput per thread = 6859.59 ops/sec Avg throughput per thread = 6682.75 ops/sec Min xfer = 242368.00 ops Children see throughput for 8 re-readers = 54585.11 ops/sec Parent sees throughput for 8 re-readers = 54573.08 ops/sec Min throughput per thread = 6022.81 ops/sec Max throughput per thread = 7164.78 ops/sec Avg throughput per thread = 6823.14 ops/sec Min xfer = 220373.00 ops Children see throughput for 8 reverse readers = 56755.70 ops/sec Parent sees throughput for 8 reverse readers = 56667.52 ops/sec Min throughput per thread = 5893.60 ops/sec Max throughput per thread = 7554.16 ops/sec Avg throughput per thread = 7094.46 ops/sec Min xfer = 204744.00 ops Children see throughput for 8 stride readers = 11964.43 ops/sec Parent sees throughput for 8 stride readers = 11959.61 ops/sec Min throughput per thread = 1353.59 ops/sec Max throughput per thread = 1545.83 ops/sec Avg throughput per thread = 1495.55 ops/sec Min xfer = 229619.00 ops Children see throughput for 8 random readers = 3314.17 ops/sec Parent sees throughput for 8 random readers = 3314.11 ops/sec Min throughput per thread = 367.38 ops/sec Max throughput per thread = 482.99 ops/sec Avg throughput per thread = 414.27 ops/sec Min xfer = 199395.00 ops Children see throughput for 8 mixed workload = 2438.01 ops/sec Parent sees throughput for 8 mixed workload = 2414.88 ops/sec Min throughput per thread = 77.17 ops/sec Max throughput per thread = 528.42 ops/sec Avg throughput per thread = 304.75 ops/sec Min xfer = 38284.00 ops Children see throughput for 8 random writers = 3176.50 ops/sec Parent sees throughput for 8 random writers = 3141.77 ops/sec Min throughput per thread = 394.89 ops/sec Max throughput per thread = 400.16 ops/sec Avg throughput per thread = 397.06 ops/sec Min xfer = 258695.00 ops iozone test complete.
Tharindu Rukshan Bamunuarachchi
2008-Jul-26 17:35 UTC
[zfs-discuss] [zfs-code] Peak every 4-5 second
An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080726/c839a338/attachment.html> -------------- next part -------------- A non-text attachment was scrubbed... Name: DTool Type: application/octet-stream Size: 109888 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080726/c839a338/attachment.obj> -------------- next part -------------- ******************************************************************************************************************************************************************* "The information contained in this email including in any attachment is confidential and is meant to be read only by the person to whom it is addressed. If you are not the intended recipient(s), you are prohibited from printing, forwarding, saving or copying this email. If you have received this e-mail in error, please immediately notify the sender and delete this e-mail and its attachments from your computer." *******************************************************************************************************************************************************************
On Sat, 26 Jul 2008, Tharindu Rukshan Bamunuarachchi wrote:> It is impossible to simulate my scenario with iozone. iozone performs very well for ZFS. OTOH, > iozone does not measure latency. > > Please find attached tool (Solaris x86), which we have written to measure latency.Very interesting software. I ran it for a little while in a ZFS filesystem configured with 8K zfs blocksize and produced a 673MB file. This is with a full graphical login environment running. I did a short run using 128K zfs blocksize and notice that the average peak latencies are about 2X as high as with 8K blocks, but the maximum peak latencies are similar (i.e. somewhat under 10,000 us). I suspect that the maximum peak latencies have something to do with zfs itself (or something in the test program) rather than the pool configuration. Here is the text output with 8k filesystem blocks: % ./DTool -W -i 1 -s 700 -r 10000 -f file System Tick = 100 usecs Clock resolution 10 HR Timer created for 100usecs z_FileName = file i_Rate = 10000 l_BlockSize = 700 i_SyncInterval = 0 l_TickInterval = 100 i_TicksPerIO = 1 i_NumOfIOsPerSlot = 1 Max (us)| Min (us) | Avg (us) | MB/S | File Freq Distribution 80 | 5 | 6.5637 | 3.3371 | file 50(99.99), 200(0.01), 500(0.00), 2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00), 2116 | 4 | 7.2277 | 6.6429 | file 50(99.88), 200(0.10), 500(0.00), 2000(0.01), 5000(0.01), 10000(0.00), 100000(0.00), 200000(0.00), 60 | 5 | 6.7522 | 6.6733 | file 50(99.99), 200(0.01), 500(0.00), 2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00), 64 | 5 | 6.6542 | 6.6753 | file 50(99.99), 200(0.01), 500(0.00), 2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00), 46 | 4 | 6.5489 | 6.6753 | file 50(100.00), 200(0.00), 500(0.00), 2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00), 68 | 5 | 6.5236 | 6.6726 | file 50(99.99), 200(0.01), 500(0.00), 2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00), 8694 | 4 | 8.7859 | 6.4859 | file 50(99.39), 200(0.54), 500(0.03), 2000(0.03), 5000(0.00), 10000(0.01), 100000(0.00), 200000(0.00), 70 | 4 | 6.5669 | 6.6753 | file 50(99.98), 200(0.02), 500(0.00), 2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00), 48 | 5 | 6.5907 | 6.6733 | file 50(100.00), 200(0.00), 500(0.00), 2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00), 49 | 5 | 6.5948 | 6.6753 | file 50(100.00), 200(0.00), 500(0.00), 2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00), 47 | 4 | 6.5437 | 6.6753 | file 50(100.00), 200(0.00), 500(0.00), 2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00), 7991 | 4 | 8.7452 | 6.5043 | file 50(99.45), 200(0.45), 500(0.06), 2000(0.03), 5000(0.00), 10000(0.01), 100000(0.00), 200000(0.00), 57 | 4 | 6.7606 | 6.6753 | file 50(99.98), 200(0.02), 500(0.00), 2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00), 49 | 5 | 6.6358 | 6.6753 | file 50(100.00), 200(0.00), 500(0.00), 2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00), 46 | 5 | 6.4603 | 6.6726 | file 50(100.00), 200(0.00), 500(0.00), 2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00), 60 | 5 | 6.4511 | 6.6727 | file 50(99.99), 200(0.01), 500(0.00), 2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00), 9099 | 4 | 9.0321 | 6.4891 | file 50(99.37), 200(0.51), 500(0.07), 2000(0.04), 5000(0.00), 10000(0.01), 100000(0.00), 200000(0.00), 48 | 5 | 6.5132 | 6.6727 | file 50(100.00), 200(0.00), 500(0.00), 2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00), 72 | 5 | 6.5453 | 6.6726 | file 50(99.99), 200(0.01), 500(0.00), 2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00), 44 | 5 | 6.5788 | 6.6753 | file 50(100.00), 200(0.00), 500(0.00), 2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00), 71 | 5 | 6.5554 | 6.6727 | file 50(99.98), 200(0.02), 500(0.00), 2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00), 9138 | 4 | 8.9271 | 6.5061 | file 50(99.43), 200(0.48), 500(0.03), 2000(0.04), 5000(0.01), 10000(0.01), 100000(0.00), 200000(0.00), 45 | 5 | 6.5028 | 6.6753 | file 50(100.00), 200(0.00), 500(0.00), 2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00), 51 | 5 | 6.5297 | 6.6733 | file 50(99.99), 200(0.01), 500(0.00), 2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00), 47 | 5 | 6.6340 | 6.6727 | file 50(100.00), 200(0.00), 500(0.00), 2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00), 49 | 4 | 6.6172 | 6.6753 | file 50(100.00), 200(0.00), 500(0.00), 2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00), 8671 | 4 | 8.5149 | 6.4436 | file 50(99.58), 200(0.35), 500(0.03), 2000(0.02), 5000(0.01), 10000(0.01), 100000(0.00), 200000(0.00), 51 | 5 | 6.5969 | 6.6754 | file 50(99.99), 200(0.01), 500(0.00), 2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00), 523 | 4 | 6.6691 | 6.6753 | file 50(99.98), 200(0.01), 500(0.00), 2000(0.01), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00), 68 | 5 | 6.6747 | 6.6726 | file 50(99.98), 200(0.02), 500(0.00), 2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00), 41 | 5 | 6.5438 | 6.6734 | file 50(100.00), 200(0.00), 500(0.00), 2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00), 8713 | 4 | 8.8272 | 6.4589 | file 50(99.36), 200(0.50), 500(0.11), 2000(0.02), 5000(0.00), 10000(0.01), 100000(0.00), 200000(0.00), 50 | 4 | 6.4549 | 6.6754 | file 50(100.00), 200(0.00), 500(0.00), 2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00), 49 | 5 | 6.5445 | 6.6726 | file 50(100.00), 200(0.00), 500(0.00), 2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00), 69 | 5 | 6.4427 | 6.6753 | file 50(99.98), 200(0.02), 500(0.00), 2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00), 71 | 5 | 6.5174 | 6.6727 | file 50(99.99), 200(0.01), 500(0.00), 2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00), 8745 | 4 | 9.0431 | 6.4960 | file 50(99.34), 200(0.55), 500(0.08), 2000(0.01), 5000(0.01), 10000(0.01), 100000(0.00), 200000(0.00), 61 | 5 | 6.6875 | 6.6727 | file 50(99.99), 200(0.01), 500(0.00), 2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00), 62 | 5 | 6.5137 | 6.6726 | file 50(99.99), 200(0.01), 500(0.00), 2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00), 59 | 5 | 6.6640 | 6.6727 | file 50(99.99), 200(0.01), 500(0.00), 2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00), 206 | 4 | 6.8049 | 6.6353 | file 50(99.86), 200(0.13), 500(0.01), 2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00), 9237 | 4 | 8.3959 | 6.5524 | file 50(99.59), 200(0.30), 500(0.07), 2000(0.03), 5000(0.00), 10000(0.01), 100000(0.00), 200000(0.00), 63 | 5 | 6.4380 | 6.6726 | file 50(99.99), 200(0.01), 500(0.00), 2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00), 44 | 5 | 6.3695 | 6.6754 | file 50(100.00), 200(0.00), 500(0.00), 2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00), 54 | 5 | 6.5120 | 6.6753 | file 50(99.98), 200(0.02), 500(0.00), 2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00), 1166 | 4 | 7.2351 | 6.5967 | file 50(99.67), 200(0.29), 500(0.03), 2000(0.01), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00), 9103 | 4 | 7.9335 | 6.5807 | file 50(99.83), 200(0.12), 500(0.03), 2000(0.01), 5000(0.00), 10000(0.01), 100000(0.00), 200000(0.00), 60 | 5 | 6.5613 | 6.6727 | file 50(99.98), 200(0.02), 500(0.00), 2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00), 66 | 5 | 6.4940 | 6.6733 | file 50(99.99), 200(0.01), 500(0.00), 2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00), 51 | 5 | 6.5826 | 6.6726 | file 50(99.99), 200(0.01), 500(0.00), 2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00), 1549 | 4 | 7.8373 | 6.4979 | file 50(99.48), 200(0.43), 500(0.06), 2000(0.03), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00), 9221 | 4 | 7.6834 | 6.6720 | file 50(99.96), 200(0.03), 500(0.00), 2000(0.00), 5000(0.00), 10000(0.01), 100000(0.00), 200000(0.00), 64 | 4 | 6.5624 | 6.6733 | file 50(99.99), 200(0.01), 500(0.00), 2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00), 47 | 5 | 6.5555 | 6.6747 | file 50(100.00), 200(0.00), 500(0.00), 2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00), 47 | 5 | 6.5979 | 6.6725 | file 50(100.00), 200(0.00), 500(0.00), 2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00), 9219 | 4 | 9.0427 | 6.4955 | file 50(99.45), 200(0.43), 500(0.08), 2000(0.03), 5000(0.00), 10000(0.01), 100000(0.00), 200000(0.00), 71 | 5 | 6.4686 | 6.6725 | file 50(99.99), 200(0.01), 500(0.00), 2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00), 59 | 5 | 6.6367 | 6.6727 | file 50(99.99), 200(0.01), 500(0.00), 2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00), 49 | 4 | 6.6251 | 6.6752 | file 50(100.00), 200(0.00), 500(0.00), 2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00), 59 | 5 | 6.5028 | 6.6727 | file 50(99.99), 200(0.01), 500(0.00), 2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00), 8722 | 4 | 8.7847 | 6.5003 | file 50(99.21), 200(0.72), 500(0.02), 2000(0.04), 5000(0.00), 10000(0.01), 100000(0.00), 200000(0.00), 46 | 5 | 6.4410 | 6.6727 | file 50(100.00), 200(0.00), 500(0.00), 2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00), 45 | 5 | 6.5116 | 6.6726 | file 50(100.00), 200(0.00), 500(0.00), 2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00), 70 | 5 | 6.5629 | 6.6753 | file 50(99.99), 200(0.01), 500(0.00), 2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00), 41 | 5 | 6.4026 | 6.6727 | file 50(100.00), 200(0.00), 500(0.00), 2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00), 8728 | 4 | 8.5549 | 6.4809 | file 50(99.47), 200(0.46), 500(0.05), 2000(0.01), 5000(0.00), 10000(0.01), 100000(0.00), 200000(0.00), 64 | 5 | 6.6568 | 6.6753 | file 50(99.98), 200(0.02), 500(0.00), 2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00), 59 | 5 | 6.3614 | 6.6726 | file 50(99.99), 200(0.01), 500(0.00), 2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00), 89 | 5 | 6.5293 | 6.6726 | file 50(99.99), 200(0.01), 500(0.00), 2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00), 47 | 5 | 6.4494 | 6.6733 | file 50(100.00), 200(0.00), 500(0.00), 2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00), 8752 | 4 | 8.7413 | 6.4929 | file 50(99.51), 200(0.40), 500(0.04), 2000(0.04), 5000(0.00), 10000(0.01), 100000(0.00), 200000(0.00), 73 | 5 | 6.5893 | 6.6726 | file 50(99.97), 200(0.03), 500(0.00), 2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00), 51 | 5 | 6.6581 | 6.6726 | file 50(99.99), 200(0.01), 500(0.00), 2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00), 62 | 5 | 6.6587 | 6.6726 | file 50(99.99), 200(0.01), 500(0.00), 2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00), 77 | 5 | 7.2512 | 6.6753 | file 50(99.94), 200(0.06), 500(0.00), 2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00), 9169 | 4 | 9.7338 | 6.4627 | file 50(99.04), 200(0.88), 500(0.04), 2000(0.02), 5000(0.01), 10000(0.01), 100000(0.00), 200000(0.00), 69 | 5 | 7.2407 | 6.6727 | file 50(99.96), 200(0.04), 500(0.00), 2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00), 97 | 4 | 7.3353 | 6.6753 | file 50(99.93), 200(0.07), 500(0.00), 2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00), 382 | 4 | 7.2239 | 6.6753 | file 50(99.93), 200(0.06), 500(0.01), 2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00), 109 | 4 | 7.2756 | 6.6753 | file 50(99.94), 200(0.06), 500(0.00), 2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00), 8015 | 4 | 9.7219 | 6.4942 | file 50(98.94), 200(0.92), 500(0.10), 2000(0.03), 5000(0.00), 10000(0.01), 100000(0.00), 200000(0.00), 70 | 5 | 7.2720 | 6.6753 | file 50(99.92), 200(0.08), 500(0.00), 2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00), 99 | 5 | 7.3499 | 6.6753 | file 50(99.90), 200(0.10), 500(0.00), 2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00), 64 | 4 | 7.2259 | 6.6753 | file 50(99.96), 200(0.04), 500(0.00), 2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00), 69 | 4 | 7.2479 | 6.6753 | file 50(99.94), 200(0.06), 500(0.00), 2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00), 8778 | 4 | 10.0523 | 6.4904 | file 50(98.99), 200(0.88), 500(0.08), 2000(0.03), 5000(0.01), 10000(0.01), 100000(0.00), 200000(0.00), 72 | 5 | 7.3874 | 6.6753 | file 50(99.90), 200(0.10), 500(0.00), 2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00), 69 | 5 | 7.4387 | 6.6726 | file 50(99.94), 200(0.06), 500(0.00), 2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00), 69 | 5 | 7.4582 | 6.6726 | file 50(99.95), 200(0.05), 500(0.00), 2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00), 71 | 4 | 7.4296 | 6.6753 | file 50(99.91), 200(0.09), 500(0.00), 2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00), 8756 | 4 | 9.4746 | 6.4835 | file 50(99.08), 200(0.85), 500(0.04), 2000(0.02), 5000(0.00), 10000(0.01), 100000(0.00), 200000(0.00), 70 | 5 | 7.4215 | 6.6753 | file 50(99.96), 200(0.04), 500(0.00), 2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00), 66 | 4 | 7.3560 | 6.6753 | file 50(99.94), 200(0.06), 500(0.00), 2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00), 65 | 4 | 7.3263 | 6.6754 | file 50(99.94), 200(0.06), 500(0.00), 2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00), 87 | 4 | 7.2929 | 6.6753 | file 50(99.95), 200(0.05), 500(0.00), 2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00), 8753 | 4 | 9.4104 | 6.4477 | file 50(99.16), 200(0.77), 500(0.03), 2000(0.03), 5000(0.00), 10000(0.01), 100000(0.00), 200000(0.00), 71 | 4 | 7.4860 | 6.6754 | file 50(99.90), 200(0.10), 500(0.00), 2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00), 66 | 5 | 7.2805 | 6.6753 | file 50(99.96), 200(0.04), 500(0.00), 2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00), 78 | 5 | 7.4491 | 6.6740 | file 50(99.92), 200(0.08), 500(0.00), 2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00), 75 | 5 | 7.4095 | 6.6753 | file 50(99.95), 200(0.05), 500(0.00), 2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00), =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On Sat, 26 Jul 2008, Bob Friesenhahn wrote:> I suspect that the maximum peak latencies have something to do with > zfs itself (or something in the test program) rather than the pool > configuration.As confirmation that the reported timings have virtually nothing to do with the pool configuration, I ran the program on a two-drive ZFS mirror pool consisting of two cheap 500MB USB drives. The average latency was not much worse. The peak latency values are often larger but the maximum peak is still on the order of 9000 microseconds. I then ran the test on a single-drive UFS filesystem (300GB 15K RPM SAS drive) which is freshly created and see that the average latency is somewhat lower but the maximum peak for each interval is typically much higher (at least 1200 but often 4000). I even saw a measured peak as high as 22224. Based on the findings, it seems that using the 2540 is a complete waste if two cheap USB drives in a zfs mirror pool can almost obtain the same timings. UFS on the fast SAS drive performed worse. I did not run your program in a real-time scheduling class (see priocntl). Perhaps it would perform better using real-time scheduling. It might also do better in a fixed-priority class. Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Bob Friesenhahn wrote:> On Sat, 26 Jul 2008, Bob Friesenhahn wrote: > > >> I suspect that the maximum peak latencies have something to do with >> zfs itself (or something in the test program) rather than the pool >> configuration. >> > > As confirmation that the reported timings have virtually nothing to do > with the pool configuration, I ran the program on a two-drive ZFS > mirror pool consisting of two cheap 500MB USB drives. The average > latency was not much worse. The peak latency values are often larger > but the maximum peak is still on the order of 9000 microseconds. >Is it doing buffered or sync writes? I''ll try it later today or tomorrow...> I then ran the test on a single-drive UFS filesystem (300GB 15K RPM > SAS drive) which is freshly created and see that the average latency > is somewhat lower but the maximum peak for each interval is typically > much higher (at least 1200 but often 4000). I even saw a measured peak > as high as 22224. > > Based on the findings, it seems that using the 2540 is a complete > waste if two cheap USB drives in a zfs mirror pool can almost obtain > the same timings. UFS on the fast SAS drive performed worse. > > I did not run your program in a real-time scheduling class (see > priocntl). Perhaps it would perform better using real-time > scheduling. It might also do better in a fixed-priority class. >This might be more important. But a better solution is to assign a processor set to run only the application -- a good idea any time you need a predictable response. -- richard
On Sat, 26 Jul 2008, Richard Elling wrote:> > Is it doing buffered or sync writes? I''ll try it later today or > tomorrow...I have not seen the source code but truss shows that this program is doing more than expected such as using send/recv to send a message. In fact, send(), pollsys(), recv(), and write() constitute most of the activity. A POSIX.4 real-time timer is created. Perhaps it uses two threads, with one sending messages to the other over a socketpair, and the second thread does the actual write.>> I did not run your program in a real-time scheduling class (see priocntl). >> Perhaps it would perform better using real-time scheduling. It might also >> do better in a fixed-priority class. > > This might be more important. But a better solution is to assign a > processor set to run only the application -- a good idea any time you > need a predictable response.Later on I did try running the program in the real time scheduling class with high priority and it made no difference at all. While it is clear that filesystem type (ZFS or UFS) does make a significant difference, it seems that the program is doing more than simply timing the write system call. A defect in the program could easily account for the long delays. It would help if source code for the program can be posted. Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Bob Says: "But a better solution is to assign a processor set to run only the application -- a good idea any time you need a predictable response." Bob''s suggestion above along with "no interrupts on that pset", and a fixed scheduling class for the application/processes in question could also be helpful. Tharindu, would you be able to share the source of your write-latency-measuring application? This might give us a better idea of exactly what its measuring and how. This might allow people (way smarter than me) to do some additional/alternative DTRACE work to help further drill down towards the source-and-resolution of the issue. Thanks, -- MikeE -----Original Message----- From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-bounces at opensolaris.org] On Behalf Of Richard Elling Sent: Saturday, July 26, 2008 3:33 PM To: Bob Friesenhahn Cc: zfs-discuss at opensolaris.org Subject: Re: [zfs-discuss] [zfs-code] Peak every 4-5 second Bob Friesenhahn wrote:> On Sat, 26 Jul 2008, Bob Friesenhahn wrote: > > >> I suspect that the maximum peak latencies have something to do with >> zfs itself (or something in the test program) rather than the pool >> configuration. >> > > As confirmation that the reported timings have virtually nothing to do> with the pool configuration, I ran the program on a two-drive ZFS > mirror pool consisting of two cheap 500MB USB drives. The average > latency was not much worse. The peak latency values are often larger > but the maximum peak is still on the order of 9000 microseconds. >Is it doing buffered or sync writes? I''ll try it later today or tomorrow...> I then ran the test on a single-drive UFS filesystem (300GB 15K RPM > SAS drive) which is freshly created and see that the average latency > is somewhat lower but the maximum peak for each interval is typically > much higher (at least 1200 but often 4000). I even saw a measured peak> as high as 22224. > > Based on the findings, it seems that using the 2540 is a complete > waste if two cheap USB drives in a zfs mirror pool can almost obtain > the same timings. UFS on the fast SAS drive performed worse. > > I did not run your program in a real-time scheduling class (see > priocntl). Perhaps it would perform better using real-time > scheduling. It might also do better in a fixed-priority class. >This might be more important. But a better solution is to assign a processor set to run only the application -- a good idea any time you need a predictable response. -- richard _______________________________________________ zfs-discuss mailing list zfs-discuss at opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Tharindu Rukshan Bamunuarachchi
2008-Jul-28 12:43 UTC
[zfs-discuss] [zfs-code] Peak every 4-5 second
An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080728/a1200262/attachment.html> -------------- next part -------------- ******************************************************************************************************************************************************************* "The information contained in this email including in any attachment is confidential and is meant to be read only by the person to whom it is addressed. If you are not the intended recipient(s), you are prohibited from printing, forwarding, saving or copying this email. If you have received this e-mail in error, please immediately notify the sender and delete this e-mail and its attachments from your computer." *******************************************************************************************************************************************************************
On Mon, 28 Jul 2008, Tharindu Rukshan Bamunuarachchi wrote:> > I have tried your pdf but did not get good latency numbers even after array > tuning...Right. And since I observed only slightly less optimal performance from a mirror pair of USB drives it seems that your requirement is not challenging at all for the storage hardware. USB does not offer very much throughput. A $200 portable disk drive is able to almost match a $23K drive array for this application. I did test your application with output to /dev/null and /tmp (memory based) and it did report consistently tiny numbers in that case. It seems likely that writes to ZFS encounter a "hickup" every so often in the write system call. There is still the possibility that a bug in your application is causing the "hickup" in the write timings. Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Hello Bob, Friday, July 25, 2008, 4:58:54 PM, you wrote: BF> On Fri, 25 Jul 2008, Robert Milkowski wrote:>> Both on 2540 and 6540 if you do not disable it your performance will >> be very bad especially for synchronous IOs as ZIL will force your >> array to flush its cache every time. If you are not using ZFS on any >> other storage than 2540 on your servers then put "set >> zfs:zfs_nocacheflush=1" in /etc/system and do a reboot. If you >> haven''t done so it should help you considerably.BF> This does not seem wise since then data (records of trades) may be BF> lost if the system crashes or loses power. It is much better to apply BF> the firmware tweaks so that the 2540 reports that the data is written BF> as soon as it is safely in its NVRAM rather than waiting for it to be BF> on disk. ZFS should then perform rather well with low latency. Both cases are basically the same. Please notice I''m not talking about disabling ZIL, I''m talking about disabling cache flushes in ZFS. ZFS will still wait for the array to confirm that it did receive data (nvram). If you loose power the behavior will be the same - no difference here. -- Best regards, Robert Milkowski mailto:milek at task.gda.pl http://milek.blogspot.com
On Wed, 30 Jul 2008, Robert Milkowski wrote:> > Both cases are basically the same. > Please notice I''m not talking about disabling ZIL, I''m talking about > disabling cache flushes in ZFS. ZFS will still wait for the array to > confirm that it did receive data (nvram).So it seems that in your opinion, the periodic "burp" in system call completion time is due to ZFS''s periodic cache flush. That is certainly quite possible. Testing will prove it, but the testing can be on someone else''s system rather than my own. :) Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Hello Bob, Wednesday, July 30, 2008, 3:07:05 AM, you wrote: BF> On Wed, 30 Jul 2008, Robert Milkowski wrote:>> >> Both cases are basically the same. >> Please notice I''m not talking about disabling ZIL, I''m talking about >> disabling cache flushes in ZFS. ZFS will still wait for the array to >> confirm that it did receive data (nvram).BF> So it seems that in your opinion, the periodic "burp" in system call BF> completion time is due to ZFS''s periodic cache flush. That is BF> certainly quite possible. Could be. Additionally he will end-up with up-to 35 IOs queued per each LUN and if he doesn''t effectively have a nvram cache there the latency can dramaticly increase during these periods. -- Best regards, Robert Milkowski mailto:milek at task.gda.pl http://milek.blogspot.com