Paul Clayton
2008-Dec-01 03:05 UTC
[dtrace-discuss] Concurrent Disk Activity from "Solaris Performance & Tools" Chap. 4.17.4
Hello.. Due to growing performance problems on numerous systems, I have been reading through a lot of information about DTrace and what it can find out about a system. It is a great tool, and while there is a learning curve to using it successfully, getting useful information quickly is helped by such books as the ?Solaris Performance & Tools? book. In light of us also ramping up new SANs and ever larger SAN fabrics, I have long been wondering what the times are for getting data in/out of the SANs. If we talk with the SAN vendor they do the say the SAN not a problem and we should look elsewhere. Talk with the fabric folks and they say no problems, look elsewhere. So, it was with high interest that I have been reading Chapter 4 in the book about disk activity multiple times and trying various commands out on systems here. One thing I have been puzzled with a lot this weekend is the information and plot in Figure 4.7. This section if I understand it correctly, offers the means to track the actual times from when an IO starts in the kernel to when it completes, implying the time to either read or write from disk or memory cache. I have been using a data file for an Oracle database as the test subject for this work. I have several confusion points with this section. ? The graph mentions ?strategy? and ?biodone? which seem to imply TNF based data, not output from the DTrace script above Figure 4.7. ? In looking at the data gotten from the DTrace script I see no way to generate the graph of Figure 4.7. Specifically the time difference between ?0? and the points for ?strategy?. With the DTrace script we have the start time of the IO. I see no way to determine some amount of time between ?start? and something earlier. The time values on the ?X? axis also don?t fall out of the data generated by the DTrace script. ? How can it be determined if the IO completed from a memory cache or required an I/O to a physical disk? I have a lot of times less than 0.5 ms but also have a fair number that are in the range of 1 ms to 300 ms. I modified the script to dump out the size of the I/O being done and that was interesting to see some unexpected lengths. Also added ?start? and ?end? to the appropriate lines as a sanity check to make it easier to pair up the entries. Should always have one start/end pair for a block address. I have attached an Excel spreadsheet with what I was able to create based on the data collected. My thanks for any clarifications to these confusions. pdc -- This message posted from opensolaris.org -------------- next part -------------- A non-text attachment was scrubbed... Name: Disk_IO_Speed.xls Type: application/octet-stream Size: 268800 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/dtrace-discuss/attachments/20081130/c1b19f23/attachment.obj>
Jim Mauro
2008-Dec-01 04:13 UTC
[dtrace-discuss] Concurrent Disk Activity from "Solaris Performance & Tools" Chap. 4.17.4
Hi Paul -> > One thing I have been puzzled with a lot this weekend is the information and plot in Figure 4.7. This section if I understand it correctly, offers the means to track the actual times from when an IO starts in the kernel to when it completes, implying the time to either read or write from disk or memory cache. > > I have been using a data file for an Oracle database as the test subject for this work. > > I have several confusion points with this section. > > ? The graph mentions ?strategy? and ?biodone? which seem to imply TNF based data, not output from the DTrace script above Figure 4.7. >The DTrace script uses the IO provider io:::start and io:::done probes. io:genunix::start enables several probes, including the bdev_strategy kernel routine; nv98> dtrace -n ''io:genunix::start'' dtrace: description ''io:genunix::start'' matched 3 probes CPU ID FUNCTION:NAME 0 22920 bdev_strategy:start 0 22920 bdev_strategy:start 0 22920 bdev_strategy:start io:genunix::done enables a probe in biodone; nv98> dtrace -n ''io:genunix::done'' dtrace: description ''io:genunix::done'' matched 1 probe CPU ID FUNCTION:NAME 0 22908 biodone:done The actions in the dtrace script gather timestamps and block numbers at each firing (assuming the predicates evaluate true).> ? In looking at the data gotten from the DTrace script I see no way to generate the graph of Figure 4.7. Specifically the time difference between ?0? and the points for ?strategy?. With the DTrace script we have the start time of the IO. I see no way to determine some amount of time between ?start? and something earlier. The time values on the ?X? axis also don?t fall out of the data generated by the DTrace script. >The script in the book generates output that looks like this; nv98> ./iot.d 122065039977,2100, 122065039980,,2100 122070310632,72, 122070310637,,72 122070310833,81, 122070310836,,81 . . . The value on the left is the timestamp, the value on the right is the block number. The data was imported into a spreadsheet, and the math was done on the start time and stop (done) time for each block, resulting in IO times on a per-block basis.> ? How can it be determined if the IO completed from a memory cache or required an I/O to a physical disk? I have a lot of times less than 0.5 ms but also have a fair number that are in the range of 1 ms to 300 ms. >The lookup in the file system memory cache happens above the bio layer. Put another way, if we''re hitting bdev_strategy, we need to do a disk IO to get the data (we already missed in the cache). HTH, Thanks, /jim> I modified the script to dump out the size of the I/O being done and that was interesting to see some unexpected lengths. Also added ?start? and ?end? to the appropriate lines as a sanity check to make it easier to pair up the entries. Should always have one start/end pair for a block address. > > I have attached an Excel spreadsheet with what I was able to create based on the data collected. > > My thanks for any clarifications to these confusions. > > pdc > > ------------------------------------------------------------------------ > > _______________________________________________ > dtrace-discuss mailing list > dtrace-discuss at opensolaris.org
James Litchfield
2008-Dec-01 04:17 UTC
[dtrace-discuss] Concurrent Disk Activity from "Solaris Performance & Tools" Chap. 4.17.4
Paul Clayton wrote:> Hello.. > > Due to growing performance problems on numerous systems, I have been reading through a lot of information about DTrace and what it can find out about a system. It is a great tool, and while there is a learning curve to using it successfully, getting useful information quickly is helped by such books as the ?Solaris Performance & Tools? book. > > In light of us also ramping up new SANs and ever larger SAN fabrics, I have long been wondering what the times are for getting data in/out of the SANs. If we talk with the SAN vendor they do the say the SAN not a problem and we should look elsewhere. Talk with the fabric folks and they say no problems, look elsewhere. > >They always do until you show them otherwise.> So, it was with high interest that I have been reading Chapter 4 in the book about disk activity multiple times and trying various commands out on systems here. > > One thing I have been puzzled with a lot this weekend is the information and plot in Figure 4.7. This section if I understand it correctly, offers the means to track the actual times from when an IO starts in the kernel to when it completes, implying the time to either read or write from disk or memory cache. > > I have been using a data file for an Oracle database as the test subject for this work. > > I have several confusion points with this section. > > ? The graph mentions ?strategy? and ?biodone? which seem to imply TNF based data, not output from the DTrace script above Figure 4.7. >Simply put, when a physical IO is started (which may be connected to a read or write from a program or from some other place), the ''strategy'' routine is called for the driver underlying the file system the program is writing to. The driver may be for a real device (e.g., ssd) or for a pseudo device (e.g., a solaris volume manager device). The strategy routine figures out what to do to satisfy the IO request. If the device is a pseudo device, the strategy routine calles into underlying strategy routines (which may themselves be pseudo devices - tracking IOs on a clustered system with SVM devices takes you through the md layer, the did layer and the underlying device layer). biodone is a kernel convenience function typically invoked when the physical IO is actually finished. TNF has its hooks in the same places but you should ignore TNF.> ? In looking at the data gotten from the DTrace script I see no way to generate the graph of Figure 4.7. Specifically the time difference between ?0? and the points for ?strategy?. With the DTrace script we have the start time of the IO. I see no way to determine some amount of time between ?start? and something earlier. The time values on the ?X? axis also don?t fall out of the data generated by the DTrace script. > > ? How can it be determined if the IO completed from a memory cache or required an I/O to a physical disk? I have a lot of times less than 0.5 ms but also have a fair number that are in the range of 1 ms to 300 ms. >DTrace provides the io provider which has hooks in various places in the kernel functions related to handling physical disk IO. io:::start is invoked when a *physical* IO is started. It is not invoked when the IO request is satisfied in cache.> I modified the script to dump out the size of the I/O being done and that was interesting to see some unexpected lengths. Also added ?start? and ?end? to the appropriate lines as a sanity check to make it easier to pair up the entries. Should always have one start/end pair for a block address. > > I have attached an Excel spreadsheet with what I was able to create based on the data collected. > > My thanks for any clarifications to these confusions. > > pdc > > ------------------------------------------------------------------------ > > _______________________________________________ > dtrace-discuss mailing list > dtrace-discuss at opensolaris.org
Jim Mauro
2008-Dec-01 04:19 UTC
[dtrace-discuss] Concurrent Disk Activity from "Solaris Performance & Tools" Chap. 4.17.4
Hey Paul - I should add that "iostat -xnz 1" is a great method for determine how well the SAN is performance. The asvc_t times are disk IO service times in milliseconds. I usually start there to sanity check disk IO times... Thanks, /jim Paul Clayton wrote:> Hello.. > > Due to growing performance problems on numerous systems, I have been reading through a lot of information about DTrace and what it can find out about a system. It is a great tool, and while there is a learning curve to using it successfully, getting useful information quickly is helped by such books as the ?Solaris Performance & Tools? book. > > In light of us also ramping up new SANs and ever larger SAN fabrics, I have long been wondering what the times are for getting data in/out of the SANs. If we talk with the SAN vendor they do the say the SAN not a problem and we should look elsewhere. Talk with the fabric folks and they say no problems, look elsewhere. > > So, it was with high interest that I have been reading Chapter 4 in the book about disk activity multiple times and trying various commands out on systems here. > > One thing I have been puzzled with a lot this weekend is the information and plot in Figure 4.7. This section if I understand it correctly, offers the means to track the actual times from when an IO starts in the kernel to when it completes, implying the time to either read or write from disk or memory cache. > > I have been using a data file for an Oracle database as the test subject for this work. > > I have several confusion points with this section. > > ? The graph mentions ?strategy? and ?biodone? which seem to imply TNF based data, not output from the DTrace script above Figure 4.7. > > ? In looking at the data gotten from the DTrace script I see no way to generate the graph of Figure 4.7. Specifically the time difference between ?0? and the points for ?strategy?. With the DTrace script we have the start time of the IO. I see no way to determine some amount of time between ?start? and something earlier. The time values on the ?X? axis also don?t fall out of the data generated by the DTrace script. > > ? How can it be determined if the IO completed from a memory cache or required an I/O to a physical disk? I have a lot of times less than 0.5 ms but also have a fair number that are in the range of 1 ms to 300 ms. > > I modified the script to dump out the size of the I/O being done and that was interesting to see some unexpected lengths. Also added ?start? and ?end? to the appropriate lines as a sanity check to make it easier to pair up the entries. Should always have one start/end pair for a block address. > > I have attached an Excel spreadsheet with what I was able to create based on the data collected. > > My thanks for any clarifications to these confusions. > > pdc > > ------------------------------------------------------------------------ > > _______________________________________________ > dtrace-discuss mailing list > dtrace-discuss at opensolaris.org
James Litchfield
2008-Dec-01 04:34 UTC
[dtrace-discuss] Concurrent Disk Activity from "Solaris Performance & Tools" Chap. 4.17.4
Keep in mind that the service times are *averages* over the interval. The average for the data in his spreadsheet would be pretty nice but would obscure the two pretty bad times he saw... Jim --- Jim Mauro wrote:> Hey Paul - I should add that "iostat -xnz 1" is a great method > for determine how well the SAN is performance. > The asvc_t times are disk IO service times in milliseconds. > I usually start there to sanity check disk IO times... > > Thanks, > /jim > > > Paul Clayton wrote: > >> Hello.. >> >> Due to growing performance problems on numerous systems, I have been reading through a lot of information about DTrace and what it can find out about a system. It is a great tool, and while there is a learning curve to using it successfully, getting useful information quickly is helped by such books as the ?Solaris Performance & Tools? book. >> >> In light of us also ramping up new SANs and ever larger SAN fabrics, I have long been wondering what the times are for getting data in/out of the SANs. If we talk with the SAN vendor they do the say the SAN not a problem and we should look elsewhere. Talk with the fabric folks and they say no problems, look elsewhere. >> >> So, it was with high interest that I have been reading Chapter 4 in the book about disk activity multiple times and trying various commands out on systems here. >> >> One thing I have been puzzled with a lot this weekend is the information and plot in Figure 4.7. This section if I understand it correctly, offers the means to track the actual times from when an IO starts in the kernel to when it completes, implying the time to either read or write from disk or memory cache. >> >> I have been using a data file for an Oracle database as the test subject for this work. >> >> I have several confusion points with this section. >> >> ? The graph mentions ?strategy? and ?biodone? which seem to imply TNF based data, not output from the DTrace script above Figure 4.7. >> >> ? In looking at the data gotten from the DTrace script I see no way to generate the graph of Figure 4.7. Specifically the time difference between ?0? and the points for ?strategy?. With the DTrace script we have the start time of the IO. I see no way to determine some amount of time between ?start? and something earlier. The time values on the ?X? axis also don?t fall out of the data generated by the DTrace script. >> >> ? How can it be determined if the IO completed from a memory cache or required an I/O to a physical disk? I have a lot of times less than 0.5 ms but also have a fair number that are in the range of 1 ms to 300 ms. >> >> I modified the script to dump out the size of the I/O being done and that was interesting to see some unexpected lengths. Also added ?start? and ?end? to the appropriate lines as a sanity check to make it easier to pair up the entries. Should always have one start/end pair for a block address. >> >> I have attached an Excel spreadsheet with what I was able to create based on the data collected. >> >> My thanks for any clarifications to these confusions. >> >> pdc >> >> ------------------------------------------------------------------------ >> >> _______________________________________________ >> dtrace-discuss mailing list >> dtrace-discuss at opensolaris.org >> > _______________________________________________ > dtrace-discuss mailing list > dtrace-discuss at opensolaris.org > >
Clayton, Paul D
2008-Dec-01 04:46 UTC
[dtrace-discuss] Concurrent Disk Activity from "Solaris Performance & Tools" Chap. 4.17.4
Jim, James.. My thanks for the info! The fog is starting to lift. It is that averaging from iostat and other utilities that has been making tracking down the actual activities taking place that has forced me to DTrace. One last question about the graph in Figure 4.7. I understand the bit about dealing with start/end times to figure out length of time. I did that in my spreadsheet. The part of the graph that is confusing me still is the offset of the ''strategy'' line from 0 (zero) time. What is involved in figuring out what that ~100us offset from 0 for a bunch of block addresses? Where would THAT time offset come from with the DTrace script? Take care. pdc -----Original Message----- From: James.Litchfield at Sun.COM [mailto:James.Litchfield at Sun.COM] Sent: Sunday, November 30, 2008 11:35 PM To: Jim Mauro Cc: Clayton, Paul D; dtrace-discuss at opensolaris.org Subject: Re: [dtrace-discuss] Concurrent Disk Activity from "Solaris Performance & Tools" Chap. 4.17.4 Keep in mind that the service times are *averages* over the interval. The average for the data in his spreadsheet would be pretty nice but would obscure the two pretty bad times he saw... Jim --- Jim Mauro wrote:> Hey Paul - I should add that "iostat -xnz 1" is a great method > for determine how well the SAN is performance. > The asvc_t times are disk IO service times in milliseconds. > I usually start there to sanity check disk IO times... > > Thanks, > /jim
Jim Mauro
2008-Dec-07 02:13 UTC
[dtrace-discuss] Concurrent Disk Activity from "Solaris Performance & Tools" Chap. 4.17.4
> that in my spreadsheet. The part of the graph that is confusing me still > is the offset of the ''strategy'' line from 0 (zero) time. What is > involved in figuring out what that ~100us offset from 0 for a bunch of > block addresses? Where would THAT time offset come from with the DTrace > script? >I''m not actually sure, but I''m guessing it''s all the first, outer cylinder blocks all hit on the same track, not inducing an electro-mechanical delay (no seeking - just the heads sitting over the right cylinders while the disk spins). Thanks, /jim
Clayton, Paul D
2008-Dec-08 04:00 UTC
[dtrace-discuss] Concurrent Disk Activity from "Solaris Performance & Tools" Chap. 4.17.4
James.. Maybe. Would have thought that what you mention would be more in line with what it takes to go from the strategy point to the equivalent biodone point. Still wondering about that offset. And it not being a perfectly vertical line just adds to the confusion. Thanks. pdc -----Original Message----- From: James.Mauro at Sun.COM [mailto:James.Mauro at Sun.COM] Sent: Saturday, December 06, 2008 9:14 PM To: Clayton, Paul D Cc: James.Litchfield at Sun.COM; dtrace-discuss at opensolaris.org Subject: Re: [dtrace-discuss] Concurrent Disk Activity from "Solaris Performance & Tools" Chap. 4.17.4> that in my spreadsheet. The part of the graph that is confusing mestill> is the offset of the ''strategy'' line from 0 (zero) time. What is > involved in figuring out what that ~100us offset from 0 for a bunch of > block addresses? Where would THAT time offset come from with theDTrace> script? >I''m not actually sure, but I''m guessing it''s all the first, outer cylinder blocks all hit on the same track, not inducing an electro-mechanical delay (no seeking - just the heads sitting over the right cylinders while the disk spins). Thanks, /jim