Hi List, I''m wondering if one of you expert DTrace guru''s can help me. I want to write a DTrace script to print out a a histogram of how long IO requests sit in the service queue. I can output the results with the quantize method. I''m not sure which provider I should be using for this. Does anyone know? I can easily adapt one of the DTrace Toolkit routines for this, if I can find the provider. I''ll also throw out the problem I''m trying to meter. We are using ZFS on a large SAN array (4TB). The pool on this array serves up a lot of users, (250 home file systems/directories) and also /usr/local and other OTS software. It works fine most of the time, but then gets overloaded during busy periods. I''m going to reconfigure the array to help with this, but I sure would love to have some metrics to know how big a difference my tweaks are making. Basically, the problem users experience, when the load shoots up are huge latencies. An ls on a non-cached directory, which usually is instantaneous, will take 20, 30, 40 seconds or more. Then when the storage array catches up, things get better. My clients are not happy campers. I know, I know, I should have gone with a JBOD setup, but it''s too late for that in this iteration of this server. We we set this up, I had the gear already, and it''s not in my budget to get new stuff right now. Thanks for any help anyone can offer. Jon
jloran at ssl.berkeley.edu said:> difference my tweaks are making. Basically, the problem users experience, > when the load shoots up are huge latencies. An ls on a non-cached > directory, which usually is instantaneous, will take 20, 30, 40 seconds or > more. Then when the storage array catches up, things get better. My > clients are not happy campers. > > I know, I know, I should have gone with a JBOD setup, but it''s too late for > that in this iteration of this server. We we set this up, I had the gear > already, and it''s not in my budget to get new stuff right now.What kind of array are you seeing this problem with? It sounds very much like our experience here with a 3-yr-old HDS ATA array. When the crunch came here, I didn''t know enough dtrace to help, but I threw the following into crontab to run every five minutes (24x7), and it at least collected the info I needed to see what LUN/filesystem was busying things out. Way crude, but effective enough: /bin/ksh -c "date && mpstat 2 20 && iostat -xn 2 20 \ && fsstat $(zfs list -H -o mountpoint -t filesystem | egrep ''^/'') 2 20 \ && vmstat 2 20" >> /var/tmp/iostats.log 2>&1 </dev/null A quick scan using "egrep" could pull out trouble spots; E.g. the following would identify "iostat" lines that showed 90-100% busy: egrep ''^Sun |^Mon |^Tue |^Wed |^Thu |^Fri |^Sat | 1[0-9][0-9] c6| 9[0-9] c6''\ /var/tmp/iostats.log I know it''s not the Dtrace you were asking for, but maybe it''ll inspire something more useful than the above shotgun approach. Regards, Marion
> Way crude, but effective enough:kinda cool, but isn''t thats what sar -f /var/adm/sa/sa`date +%d` -A | grep -v "," is for? crontab -e sys to start.. for more fun acctadm -e extended -f /var/adm/exacct/proc process Rob
Marion Hakanson wrote:> jloran at ssl.berkeley.edu said: > >> ... >> >> I know, I know, I should have gone with a JBOD setup, but it''s too late for >> that in this iteration of this server. We we set this up, I had the gear >> already, and it''s not in my budget to get new stuff right now. >> > > What kind of array are you seeing this problem with? It sounds very much > like our experience here with a 3-yr-old HDS ATA array.It''s not that old. It''s a Supermicro system with a 3ware 9650SE-8LP. Open-E iSCSI-R3 DOM module. The system is plenty fast. I can pretty handily pull 120MB/sec from it, and write at over 100MB/sec. It falls apart more on random I/O. The server/initiator side is a T2000 with Solaris 10u4. It never sees over 25% CPU, ever. Oh yeah, and two 1GB network links to the SAN> When the crunch > came here, I didn''t know enough dtrace to help, but I threw the following > into crontab to run every five minutes (24x7), and it at least collected > the info I needed to see what LUN/filesystem was busying things out. > > Way crude, but effective enough: > > /bin/ksh -c "date && mpstat 2 20 && iostat -xn 2 20 \ > && fsstat $(zfs list -H -o mountpoint -t filesystem | egrep ''^/'') 2 20 \ > && vmstat 2 20" >> /var/tmp/iostats.log 2>&1 </dev/null > > A quick scan using "egrep" could pull out trouble spots; E.g. the following > would identify "iostat" lines that showed 90-100% busy: > > egrep ''^Sun |^Mon |^Tue |^Wed |^Thu |^Fri |^Sat | 1[0-9][0-9] c6| 9[0-9] > c6''\ > /var/tmp/iostats.log >yeah, I have some running traditional *stat utilities running. If I capture more than a second at a time, things look good. I was hoping to get a real distribution of service times, to catch the outliers, that don''t get absorbed into the average. Hence why I wanted to use dtrace. My opinion is, if when the array got really loaded up, everything slowed down evenly, users wouldn''t mind or notice much. But when every 20 or so reads/writes gets delayed my 10s of seconds, the users start to line up at my door. Thanks for the tips. Jon -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080213/e9af8db0/attachment.html>
jloran at ssl.berkeley.edu said:> It''s not that old. It''s a Supermicro system with a 3ware 9650SE-8LP. > Open-E iSCSI-R3 DOM module. The system is plenty fast. I can pretty > handily pull 120MB/sec from it, and write at over 100MB/sec. It falls apart > more on random I/O. The server/initiator side is a T2000 with Solaris 10u4. > It never sees over 25% CPU, ever. Oh yeah, and two 1GB network links to > the SAN > . . . > My opinion is, if when the array got really loaded up, everything slowed > down evenly, users wouldn''t mind or notice much. But when every 20 or so > reads/writes gets delayed my 10s of seconds, the users start to line up at > my door.Hmm, I have no experience with iSCSI yet. But behavior of our T2000 file/NFS server connected via 2Gbit fiber channel SAN is exactly as you describe when our HDS SATA array gets behind. Access to other ZFS pools remains unaffected, but any access to the busy pool just hangs. Some Oracle apps on NFS clients die due to excessive delays. In our case, this old HDS array''s SATA shelves have a very limited queue depth (four per RAID controller) in the "back end" loop, plus every write is hit with the added overhead of an in-array read-back verification. Maybe your iSCSI situation injects enough latency at higher loads to cause something like our FC queue limitations. Good luck, Marion
Marion Hakanson wrote:> jloran at ssl.berkeley.edu said: > >> It''s not that old. It''s a Supermicro system with a 3ware 9650SE-8LP. >> Open-E iSCSI-R3 DOM module. The system is plenty fast. I can pretty >> handily pull 120MB/sec from it, and write at over 100MB/sec. It falls apart >> more on random I/O. The server/initiator side is a T2000 with Solaris 10u4. >> It never sees over 25% CPU, ever. Oh yeah, and two 1GB network links to >> the SAN >> . . . >> My opinion is, if when the array got really loaded up, everything slowed >> down evenly, users wouldn''t mind or notice much. But when every 20 or so >> reads/writes gets delayed my 10s of seconds, the users start to line up at >> my door. >> > > Hmm, I have no experience with iSCSI yet. But behavior of our T2000 > file/NFS server connected via 2Gbit fiber channel SAN is exactly as > you describe when our HDS SATA array gets behind. Access to other > ZFS pools remains unaffected, but any access to the busy pool just > hangs. Some Oracle apps on NFS clients die due to excessive delays. > > In our case, this old HDS array''s SATA shelves have a very limited queue > depth (four per RAID controller) in the "back end" loop, plus every write > is hit with the added overhead of an in-array read-back verification. > Maybe your iSCSI situation injects enough latency at higher loads to > cause something like our FC queue limitations. > >The iSCSI array has 2GB RAM as a cache. Writes to cache complete very fast. I''m not sure, but would love to get some metering going on this guy to find out, that it''s really the reads that cause the issue. It seems like, but I''m not totally sure yet that heavy random read loads are when things break down. I''ll pass on anything I find to the list, ''cause I''m sure there are a lot of folks with ZFS on a SAN. The flexibility of having the SAN is still seductive, even though the benefits to ZFS performance for direct attached storage are pulling us the other way. Jon -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080214/d77f5929/attachment.html>
Brendan Gregg - Sun Microsystems
2008-Feb-14 21:48 UTC
[zfs-discuss] Which DTrace provider to use
G''Day Jon, For disk layer metrics, you could try Disk/iopending from the DTraceToolkit to check how saturated the disks become with requests (which answers that question with much higher definition than iostat). I''d also run disktime.d, which should be in the next DTraceToolkit release (it''s pretty obvious), and is included below. disktime.d measures disk delta times - time from request to completion. #!/usr/sbin/dtrace -s #pragma D option quiet #pragma D option dynvarsize=16m BEGIN { trace("Tracing... Hit Ctrl-C to end.\n"); } io:::start { start[args[0]->b_edev, args[0]->b_blkno] = timestamp; } io:::done /start[args[0]->b_edev, args[0]->b_blkno]/ { this->delta = timestamp - start[args[0]->b_edev, args[0]->b_blkno]; @[args[1]->dev_statname] = quantize(this->delta); } The iopattern script will also give you a measure of random vs sequential I/O - which would be interesting to see. ... For latencies in ZFS (such as ZIO pipeline latencies), we don''t have a stable provider yet. It is possible to write fbt based scripts to do this - but they''ll only work on a particular version of Solaris. fsinfo would be a good provider to hit up for the VFS layer. I''d also check syscall latencies - it might be too obvious, but it can be worth checking (eg, if you discover those long latencies are only on the open syscall)... Brendan -- Brendan [CA, USA]
Hi Brendon, I have been using iopending, though I''m not sure how to interpret it. Is it true the column on the left is how deep in the queue requests are, and then the histogram represents home many requests there are at each queue depth? Then I would guess if there''s lots of requests with high queue depth, that''s bad. once in awhile, I see some pretty long queues, but they only last a second, and then things even right out again. I''ll try your disktime.d script below and the other checks you recommend. May have more questions to follow. Thanks! Jon Brendan Gregg - Sun Microsystems wrote:> G''Day Jon, > > For disk layer metrics, you could try Disk/iopending from the DTraceToolkit > to check how saturated the disks become with requests (which answers that > question with much higher definition than iostat). I''d also run disktime.d, > which should be in the next DTraceToolkit release (it''s pretty obvious), > and is included below. disktime.d measures disk delta times - time from > request to completion. > > #!/usr/sbin/dtrace -s > > #pragma D option quiet > #pragma D option dynvarsize=16m > > BEGIN > { > trace("Tracing... Hit Ctrl-C to end.\n"); > } > > io:::start > { > start[args[0]->b_edev, args[0]->b_blkno] = timestamp; > } > > io:::done > /start[args[0]->b_edev, args[0]->b_blkno]/ > { > this->delta = timestamp - start[args[0]->b_edev, args[0]->b_blkno]; > @[args[1]->dev_statname] = quantize(this->delta); > } > > The iopattern script will also give you a measure of random vs sequential > I/O - which would be interesting to see. > > ... > > For latencies in ZFS (such as ZIO pipeline latencies), we don''t have a stable > provider yet. It is possible to write fbt based scripts to do this - but > they''ll only work on a particular version of Solaris. > > fsinfo would be a good provider to hit up for the VFS layer. > > I''d also check syscall latencies - it might be too obvious, but it can be > worth checking (eg, if you discover those long latencies are only on the > open syscall)... > > Brendan > > >-- - _____/ _____/ / - Jonathan Loran - - - / / / IT Manager - - _____ / _____ / / Space Sciences Laboratory, UC Berkeley - / / / (510) 643-5146 jloran at ssl.berkeley.edu - ______/ ______/ ______/ AST:7731^29u18e3
Brendan Gregg - Sun Microsystems
2008-Feb-14 23:57 UTC
[zfs-discuss] Which DTrace provider to use
G''Day Jon, On Thu, Feb 14, 2008 at 02:54:58PM -0800, Jonathan Loran wrote:> > Hi Brendon, > > I have been using iopending, though I''m not sure how to interpret it.Did you read Examples/iopending_example.txt? That''s why it''s there. :)> Is it true the column on the left is how deep in the queue requests are, > and then the histogram represents home many requests there are at each > queue depth?Yes, the counts are samples.> Then I would guess if there''s lots of requests with high > queue depth, that''s bad.If it remains that way, yes.> once in awhile, I see some pretty long queues, > but they only last a second, and then things even right out again.That sounds fine - operations like zil_sync() queue a bunch of I/O''s briefly, which is usually fine. Brendan -- Brendan [CA, USA]
On Tue, Feb 12, 2008 at 10:21:44PM -0800, Jonathan Loran wrote:> > Hi List, > > I''m wondering if one of you expert DTrace guru''s can help me. I want to > write a DTrace script to print out a a histogram of how long IO requests > sit in the service queue. I can output the results with the quantize > method. I''m not sure which provider I should be using for this. Does > anyone know? I can easily adapt one of the DTrace Toolkit routines for > this, if I can find the provider. > > I''ll also throw out the problem I''m trying to meter. We are using ZFS > on a large SAN array (4TB). The pool on this array serves up a lot of > users, (250 home file systems/directories) and also /usr/local and other > OTS software. It works fine most of the time, but then gets overloaded > during busy periods. I''m going to reconfigure the array to help with > this, but I sure would love to have some metrics to know how big a > difference my tweaks are making. Basically, the problem users > experience, when the load shoots up are huge latencies. An ls on a > non-cached directory, which usually is instantaneous, will take 20, 30, > 40 seconds or more. Then when the storage array catches up, things get > better. My clients are not happy campers. > > I know, I know, I should have gone with a JBOD setup, but it''s too late > for that in this iteration of this server. We we set this up, I had the > gear already, and it''s not in my budget to get new stuff right now. > > Thanks for any help anyone can offer.I have faced similar problem (although not exactly the same) and was going to monitor disk queue with dtrace but couldn''t find any docs/urls about it. Finally asked Chris Gerhard for help. He partially answered via his blog: http://blogs.sun.com/chrisg/entry/latency_bubble_in_your_io Maybe it helps you. Regards przemol -- http://przemol.blogspot.com/ ---------------------------------------------------------------------- Masz ostatnia szanse ! Sprawdz >>> http://link.interia.pl/f1d02
przemolicc at poczta.fm wrote:> On Tue, Feb 12, 2008 at 10:21:44PM -0800, Jonathan Loran wrote: > >> Thanks for any help anyone can offer. >> > > I have faced similar problem (although not exactly the same) and was going to > monitor disk queue with dtrace but couldn''t find any docs/urls about it. > Finally asked Chris Gerhard for help. He partially answered via his > blog: http://blogs.sun.com/chrisg/entry/latency_bubble_in_your_io > > Maybe it helps you. > > Regards > przemol > >This is perfect. Thank you. Jon -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080214/19611bfe/attachment.html>
Le 14 f?vr. 08 ? 02:22, Marion Hakanson a ?crit :> jloran at ssl.berkeley.edu said: >> It''s not that old. It''s a Supermicro system with a 3ware 9650SE-8LP. >> Open-E iSCSI-R3 DOM module. The system is plenty fast. I can pretty >> handily pull 120MB/sec from it, and write at over 100MB/sec. It >> falls apart >> more on random I/O. The server/initiator side is a T2000 with >> Solaris 10u4. >> It never sees over 25% CPU, ever. Oh yeah, and two 1GB network >> links to >> the SAN >> . . . >> My opinion is, if when the array got really loaded up, everything >> slowed >> down evenly, users wouldn''t mind or notice much. But when every 20 >> or so >> reads/writes gets delayed my 10s of seconds, the users start to >> line up at >> my door. > >This is the write throttling problem. I''ve tested code that changes radically the situation for the better. We just need to go through performance validation before putback. http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6429205 -r> > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss