In order to get more information on IO performance problems I created the script below: #!/usr/sbin/dtrace -s #pragma D option flowindent syscall::*write*:entry /pid == $1 && guard++ == 0/ { self -> ts = timestamp; self->traceme = 1; printf("fd: %d", arg0); } fbt::: /self->traceme/ { /* elapsd =timestamp - self -> ts; printf(" elapsed : %d" , elapsd); */ printf(" timestamp : %d" , timestamp); } syscall::*write*:return /self->traceme/ { self->traceme = 0; elapsed=timestamp - self -> ts; printf(" timestamp : %d" , timestamp); printf("\telapsed : %d" , elapsed); exit(0); } I gives me the timestamp for every fbt call during a write system call. A snippet is here below 8 <- schedctl_save timestamp : 1627201334052600 8 <- savectx timestamp : 1627201334053000 0 -> restorectx timestamp : 1627202741110300 <-------- difference = 1.407.057.300 0 -> schedctl_restore timestamp : 1627202741115100 0 <- schedctl_restore timestamp : 1627202741116100 Visible is that the thread of for 1.4 s off cpu. Storage is on SAN with fibers between the system and the storage. Is it possible to dig deeper with dtrace to see how the HBA''s are performing. Other suggestions are welcome too. Regards Hans-Peter -- This message posted from opensolaris.org
Start with iostat. It''s simple, and provides an average of service times for disk IOs (iostat -xnz 1, the asvc_t column is average service times in milliseconds). As Jim Litchfield pointed out in a previous thread, keep in mind it is an average, so you won''t see nasty peaks, but if the average is bad, it''s time to talk to the SAN/Storage folks. Use DTrace with the IO provider to get more precise measurements. You can borrow from existing scripts in /usr/demo/dtrace to add time stamps, and use aggregations to capture IO times. If you need assistance with such a script, please let us know and we''ll be glad to post one (it looks like you''re already pretty comfortable with DTrace...). Thanks, /jim Hans-Peter wrote:> In order to get more information on IO performance problems I created the script below: > #!/usr/sbin/dtrace -s > #pragma D option flowindent > syscall::*write*:entry > /pid == $1 && guard++ == 0/ > { > self -> ts = timestamp; > self->traceme = 1; > printf("fd: %d", arg0); > } > fbt::: > /self->traceme/ > { > /* elapsd =timestamp - self -> ts; > printf(" elapsed : %d" , elapsd); */ > printf(" timestamp : %d" , timestamp); > } > syscall::*write*:return > /self->traceme/ > { > self->traceme = 0; > elapsed=timestamp - self -> ts; > printf(" timestamp : %d" , timestamp); > printf("\telapsed : %d" , elapsed); > exit(0); > } > > I gives me the timestamp for every fbt call during a write system call. > A snippet is here below > 8 <- schedctl_save timestamp : 1627201334052600 > 8 <- savectx timestamp : 1627201334053000 > 0 -> restorectx timestamp : 1627202741110300 <-------- difference = 1.407.057.300 > 0 -> schedctl_restore timestamp : 1627202741115100 > 0 <- schedctl_restore timestamp : 1627202741116100 > Visible is that the thread of for 1.4 s off cpu. > Storage is on SAN with fibers between the system and the storage. > > Is it possible to dig deeper with dtrace to see how the HBA''s are performing. > Other suggestions are welcome too. > > Regards Hans-Peter >
Hello Jim, The iostat average are not too bad. But the busy% is 100% most of the time. regards Hans-Peter -- This message posted from opensolaris.org
Hello all, I added a clause to my script. sysinfo::: /self->traceme==1 && pid == $1/ { trace(execname); printf("sysinfo: timestamp : %d" , timestamp); } A subsequent trace created a file of about 19000 lines. I loaded in Excel to be able to subtrace timestamps etc. The longest jump in timestamp is between the first pair of savectx and restorectx at line 70. I count 50 savectx/restorectx calls in my trace. But the actual physical write as indicated by the sysinfo is almost at the end of the file directly after the ufs_write and fop_write call end (line 18746). 24 <- tsd_agent_get oracle timestamp 1795469946839100 24 -> ufs_lockfs_top_vop_return oracle timestamp 1795469946841500 24 <- ufs_lockfs_top_vop_return oracle timestamp 1795469946844300 24 <- ufs_lockfs_end oracle timestamp 1795469946846700 24 <- ufs_write oracle timestamp 1795469946849600 24 <- fop_write oracle timestamp 1795469946853500 57,365,700 24 | pwrite syswrite oracle sysinfo timestamp 1795469946856800 24 | pwrite writech oracle sysinfo timestamp 1795469946860200 So it seems that the actual write not the bottle neck but I attached a zip file with the excel document. Am I right in thinking that is is more an OS issue than a storage issue? Regards Hans-Peter -- This message posted from opensolaris.org -------------- next part -------------- A non-text attachment was scrubbed... Name: 53811297.zip Type: application/octet-stream Size: 434780 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/dtrace-discuss/attachments/20081203/7afc4968/attachment.obj>
%busy is meaningless unless you''re looking at a single disk that can only have 1 outstanding IO in it''s active queue, which is to say %busy is a useless metric for anything disk that''s been designed and built in the last decade. Ignore %busy. Focus on queue depths and queue service times, both of which are reported in iotstat. And again, averages can be misleading. Use the iostat data as a starting point, but if you really think you have some disk IO issues, use DTrace and aggregations to get more precise data on disk IO times. Thanks, /jim Hans-Peter wrote:> Hello Jim, > > The iostat average are not too bad. > But the busy% is 100% most of the time. > > regards Hans-Peter >
It seems VOP_RWLOCK takes a long time. "ufs_rwlock" starts at line 12, but ends at line 15611. Is there another application keeps reading/writing in your system? -----Original Message----- From: dtrace-discuss-bounces at opensolaris.org [mailto:dtrace-discuss-bounces at opensolaris.org] On Behalf Of Hans-Peter Sent: 2008?12?3? 21:41 To: dtrace-discuss at opensolaris.org Subject: Re: [dtrace-discuss] How to dig deeper Hello all, I added a clause to my script. sysinfo::: /self->traceme==1 && pid == $1/ { trace(execname); printf("sysinfo: timestamp : %d" , timestamp); } A subsequent trace created a file of about 19000 lines. I loaded in Excel to be able to subtrace timestamps etc. The longest jump in timestamp is between the first pair of savectx and restorectx at line 70. I count 50 savectx/restorectx calls in my trace. But the actual physical write as indicated by the sysinfo is almost at the end of the file directly after the ufs_write and fop_write call end (line 18746). 24 <- tsd_agent_get oracle timestamp 1795469946839100 24 -> ufs_lockfs_top_vop_return oracle timestamp 1795469946841500 24 <- ufs_lockfs_top_vop_return oracle timestamp 1795469946844300 24 <- ufs_lockfs_end oracle timestamp 1795469946846700 24 <- ufs_write oracle timestamp 1795469946849600 24 <- fop_write oracle timestamp 1795469946853500 57,365,700 24 | pwrite syswrite oracle sysinfo timestamp 1795469946856800 24 | pwrite writech oracle sysinfo timestamp 1795469946860200 So it seems that the actual write not the bottle neck but I attached a zip file with the excel document. Am I right in thinking that is is more an OS issue than a storage issue? Regards Hans-Peter -- This message posted from opensolaris.org
The sysinfo provider isn''t the best choice for measuring disk IO times. Run; #dtrace -s /usr/demo/dtrace/iotime.d /jim Hans-Peter wrote:> Hello all, > > I added a clause to my script. > sysinfo::: > /self->traceme==1 && pid == $1/ > { > trace(execname); > printf("sysinfo: timestamp : %d" , timestamp); > } > A subsequent trace created a file of about 19000 lines. > I loaded in Excel to be able to subtrace timestamps etc. > The longest jump in timestamp is between the first pair of savectx and restorectx at line 70. > I count 50 savectx/restorectx calls in my trace. > But the actual physical write as indicated by the sysinfo is almost at the end of the file directly after the ufs_write and fop_write call end (line 18746). > > 24 <- tsd_agent_get oracle timestamp 1795469946839100 > 24 -> ufs_lockfs_top_vop_return oracle timestamp 1795469946841500 > 24 <- ufs_lockfs_top_vop_return oracle timestamp 1795469946844300 > 24 <- ufs_lockfs_end oracle timestamp 1795469946846700 > 24 <- ufs_write oracle timestamp 1795469946849600 > 24 <- fop_write oracle timestamp 1795469946853500 57,365,700 > 24 | pwrite syswrite oracle sysinfo timestamp 1795469946856800 > 24 | pwrite writech oracle sysinfo timestamp 1795469946860200 > > So it seems that the actual write not the bottle neck but > I attached a zip file with the excel document. > > Am I right in thinking that is is more an OS issue than a storage issue? > > Regards Hans-Peter > > ------------------------------------------------------------------------ > > _______________________________________________ > dtrace-discuss mailing list > dtrace-discuss at opensolaris.org
Also - since this is Oracle, are the Oracle files on a file system, or raw devices? If a file system, which one? /jim Jim Mauro wrote:> The sysinfo provider isn''t the best choice for measuring disk IO > times. > > Run; > #dtrace -s /usr/demo/dtrace/iotime.d > > /jim > > > Hans-Peter wrote: > >> Hello all, >> >> I added a clause to my script. >> sysinfo::: >> /self->traceme==1 && pid == $1/ >> { >> trace(execname); >> printf("sysinfo: timestamp : %d" , timestamp); >> } >> A subsequent trace created a file of about 19000 lines. >> I loaded in Excel to be able to subtrace timestamps etc. >> The longest jump in timestamp is between the first pair of savectx and restorectx at line 70. >> I count 50 savectx/restorectx calls in my trace. >> But the actual physical write as indicated by the sysinfo is almost at the end of the file directly after the ufs_write and fop_write call end (line 18746). >> >> 24 <- tsd_agent_get oracle timestamp 1795469946839100 >> 24 -> ufs_lockfs_top_vop_return oracle timestamp 1795469946841500 >> 24 <- ufs_lockfs_top_vop_return oracle timestamp 1795469946844300 >> 24 <- ufs_lockfs_end oracle timestamp 1795469946846700 >> 24 <- ufs_write oracle timestamp 1795469946849600 >> 24 <- fop_write oracle timestamp 1795469946853500 57,365,700 >> 24 | pwrite syswrite oracle sysinfo timestamp 1795469946856800 >> 24 | pwrite writech oracle sysinfo timestamp 1795469946860200 >> >> So it seems that the actual write not the bottle neck but >> I attached a zip file with the excel document. >> >> Am I right in thinking that is is more an OS issue than a storage issue? >> >> Regards Hans-Peter >> >> ------------------------------------------------------------------------ >> >> _______________________________________________ >> dtrace-discuss mailing list >> dtrace-discuss at opensolaris.org >> > _______________________________________________ > dtrace-discuss mailing list > dtrace-discuss at opensolaris.org >
Hi Lejun, No there is only a 1 Oracle database running on this system. regards Hans-Peter> It seems VOP_RWLOCK takes a long time. "ufs_rwlock" > starts at line 12, but ends at line 15611. Is there > another application keeps reading/writing in your > system? > > -----Original Message----- > From: dtrace-discuss-bounces at opensolaris.org > [mailto:dtrace-discuss-bounces at opensolaris.org] On > Behalf Of Hans-Peter > Sent: 2008?12?3? 21:41 > To: dtrace-discuss at opensolaris.org > Subject: Re: [dtrace-discuss] How to dig deeper > > Hello all, > > I added a clause to my script. > sysinfo::: > /self->traceme==1 && pid == $1/ > { > trace(execname); > printf("sysinfo: timestamp : %d" , timestamp); > } > A subsequent trace created a file of about 19000 > lines. > I loaded in Excel to be able to subtrace timestamps > etc. > The longest jump in timestamp is between the first > pair of savectx and restorectx at line 70. > I count 50 savectx/restorectx calls in my trace. > But the actual physical write as indicated by the > sysinfo is almost at the end of the file directly > after the ufs_write and fop_write call end (line > 18746). > > 24 <- tsd_agent_get oracle timestamp 1795469946839100 > > 24 -> ufs_lockfs_top_vop_return oracle timestamp 1795 > 469946841500 > 24 <- ufs_lockfs_top_vop_return oracle timestamp 1795 > 469946844300 > 24 <- ufs_lockfs_end oracle timestamp 179546994684670 > 0 > 24 <- ufs_write oracle timestamp 1795469946849600 > 24 <- fop_write oracle timestamp 1795469946853500 57 > ,365,700 > 24 | pwrite syswrite oracle sysinfo timestamp 1795469 > 946856800 > 24 | pwrite writech oracle sysinfo timestamp 17954699 > 46860200 > > So it seems that the actual write not the bottle neck > but > I attached a zip file with the excel document. > > Am I right in thinking that is is more an OS issue > than a storage issue? > > Regards Hans-Peter > -- > This message posted from opensolaris.org > _______________________________________________ > dtrace-discuss mailing list > dtrace-discuss at opensolaris.org-- This message posted from opensolaris.org
Hi Mauro, Yes I understand why sysinfo is not the best to measure IO. But I just wanted to see when in the whole trace the actual physical write was being done. So it seems to me that, because the sysinfo:::pwrite is right at the end the performance bottle neck is more because of the locking etc. The database files are on ufs filesystems. Is zfs better? Regards Hans-Peter -- This message posted from opensolaris.org
Do you have directio enabled on UFS? Especially for the redo logs? With directio enabled, UFS writes to the log do not serialize on the RW lock for the log file(s). directio will also bypass the memory cache, so you need to increase the Oracle db_block_buffers when enabling UFS directio. Thanks, /jim Hans-Peter wrote:> Hi Mauro, > > Yes I understand why sysinfo is not the best to measure IO. > But I just wanted to see when in the whole trace the actual physical write was being done. > So it seems to me that, because the sysinfo:::pwrite is right at the end the performance bottle neck is more because of the locking etc. > > The database files are on ufs filesystems. > Is zfs better? > > Regards Hans-Peter >
> > Do you have directio enabled on UFS? > Especially for the redo logs?That is the strange thing. filesystemio_options has been set to setall which is asynch and directio. But when I dtrace the fbt calls I see only directio calls for the database writer but not for the log writer. Perhaps we should mount the filesystems with forcedirectio. Would noatime also help? Regards> > With directio enabled, UFS writes to the log do not > serialize on the RW lock for the log file(s). > > directio will also bypass the memory cache, so you > need > to increase the Oracle db_block_buffers when enabling > UFS directio. > > Thanks, > /jim > > > Hans-Peter wrote: > > Hi Mauro, > > > > Yes I understand why sysinfo is not the best to > measure IO. > > But I just wanted to see when in the whole trace > the actual physical write was being done. > > So it seems to me that, because the > sysinfo:::pwrite is right at the end the performance > bottle neck is more because of the locking etc. > > > > The database files are on ufs filesystems. > > Is zfs better? > > > > Regards Hans-Peter > > > _______________________________________________ > dtrace-discuss mailing list > dtrace-discuss at opensolaris.org-- This message posted from opensolaris.org
On Thu, Dec 04, 2008 at 04:55:04AM -0800, Hans-Peter wrote:> > > > Do you have directio enabled on UFS? > > Especially for the redo logs? > > That is the strange thing. > filesystemio_options has been set to setall which is asynch and directio. > But when I dtrace the fbt calls I see only directio calls for the database writer > but not for the log writer. > > Perhaps we should mount the filesystems with forcedirectio.Forcedirectio forces direct io on _all_ files while ''setall'' lets database decide which files should be accessed with direct options. It (setall) means that database has ability to write to some files with buffered access e.g. log files (which is better). Forcedirectio disables the choice and database is forced to use direct access to all files. So if you have Oracle use setall. If you have non-Oracle and that database doesn''t have setall-like option - use forcedirectio.> Would noatime also help?Usually setall/forcedirectio gives you enough. Regards Przemyslaw Bak (przemol) -- http://przemol.blogspot.com/ ---------------------------------------------------------------------- Wygraj telefon komorkowy! Sprawdz >> http://link.interia.pl/f1fc0
Hi Hans-Peter, There is an Oracle bug that is not fixed until 10.2.0.4 (if I remember right) where filesystemio_options is broken and so it won''t do the correct thing with directio and ufs. The locks you are seeing are classic single-writer-lock issues and indicate that you aren''t getting directio. The Oracle bug number is Oracle Bug ID 5752399 (Oracle setall w/ UFS). You might check to be sure you have the fix to that. The only workaround is to use "forcedirectio" in the mount options, which has been pointed out has other issues for some activities, so getting the bug fixed would be a better idea. Regards, Dave Miller Hans-Peter wrote, On 12/04/08 06:55:>> Do you have directio enabled on UFS? >> Especially for the redo logs? > > That is the strange thing. > filesystemio_options has been set to setall which is asynch and directio. > But when I dtrace the fbt calls I see only directio calls for the database writer > but not for the log writer. > > Perhaps we should mount the filesystems with forcedirectio. > Would noatime also help? > > Regards > >> With directio enabled, UFS writes to the log do not >> serialize on the RW lock for the log file(s). >> >> directio will also bypass the memory cache, so you >> need >> to increase the Oracle db_block_buffers when enabling >> UFS directio. >> >> Thanks, >> /jim >> >> >> Hans-Peter wrote: >>> Hi Mauro, >>> >>> Yes I understand why sysinfo is not the best to >> measure IO. >>> But I just wanted to see when in the whole trace >> the actual physical write was being done. >>> So it seems to me that, because the >> sysinfo:::pwrite is right at the end the performance >> bottle neck is more because of the locking etc. >>> The database files are on ufs filesystems. >>> Is zfs better? >>> >>> Regards Hans-Peter >>> >> _______________________________________________ >> dtrace-discuss mailing list >> dtrace-discuss at opensolaris.org
For the record, my friend Phil Harman reminded me that it''s not the log files we care about for directio in terms of single-writer lock break-up. We care about directio for redo logs to avoid read-modify-write, which happens when the write is not memory-page aligned. Sorry about that. Nonetheless, you must still pursue getting UFS directio enabled and working on your system (dealing with the single-writer lock problem is still important, it''s just not directly applicable to the redo logs). Thanks, /jim Hans-Peter wrote:>> Do you have directio enabled on UFS? >> Especially for the redo logs? >> > > That is the strange thing. > filesystemio_options has been set to setall which is asynch and directio. > But when I dtrace the fbt calls I see only directio calls for the database writer > but not for the log writer. > > Perhaps we should mount the filesystems with forcedirectio. > Would noatime also help? > > Regards > > >> With directio enabled, UFS writes to the log do not >> serialize on the RW lock for the log file(s). >> >> directio will also bypass the memory cache, so you >> need >> to increase the Oracle db_block_buffers when enabling >> UFS directio. >> >> Thanks, >> /jim >> >> >> Hans-Peter wrote: >> >>> Hi Mauro, >>> >>> Yes I understand why sysinfo is not the best to >>> >> measure IO. >> >>> But I just wanted to see when in the whole trace >>> >> the actual physical write was being done. >> >>> So it seems to me that, because the >>> >> sysinfo:::pwrite is right at the end the performance >> bottle neck is more because of the locking etc. >> >>> The database files are on ufs filesystems. >>> Is zfs better? >>> >>> Regards Hans-Peter >>> >>> >> _______________________________________________ >> dtrace-discuss mailing list >> dtrace-discuss at opensolaris.org >>
Ok thanks a lot sofar. We have planned down time for remounting the filesystems this weekend. We will see what happens. Regards Hans-Peter -- This message posted from opensolaris.org
On Fri, Dec 05, 2008 at 05:40:19AM -0800, Hans-Peter wrote:> Ok thanks a lot sofar. > > We have planned down time for remounting the filesystems this weekend. > We will see what happens.You can remount it online: mount -o remount,forcedirectio /<mount point> mount -o remount,noforcedirectio /<mount point> Regards Przemyslaw Bak (przemol) -- http://przemol.blogspot.com/ ---------------------------------------------------------------------- Wygraj telefon komorkowy! Sprawdz >> http://link.interia.pl/f1fc0
If you just do the remount, and do NOT increase db_block_buffers, there is a better-than-average chance you''ll have poor (worse) performance, and you''ll end setting the mounts back to non-directio, thinking you need to look elsewhere for the problem. Why? Because without directio, the UFS data is getting cached in memory, and your workload _may_ be benefitting from hitting in the file system page cache (on reads, and non-sync writes, although I''m not sure if Oracle does non-sync writes). By enabling directio across-the-board, you''re turning off caching for all the file systems, which means all those memory page cache hits are going to turn into physical disk IOs, which, as you know, take about 1000 to 10000 times longer (i.e. reading a disk takes at least 1 millisecond - getting data from memory takes less than 1 microsecond). That is why it is necessary to tell Oracle to increase its memory cache, so the data that was in the kernel''s page cache will get cached by Oracle. There are rare cases where the page cache was not a contributing factor to the performance of the workload, but that''s less than 5% in my experience. /jim Hans-Peter wrote:> Ok thanks a lot sofar. > > We have planned down time for remounting the filesystems this weekend. > We will see what happens. > > Regards Hans-Peter >
The buffer cache was already huge. So I decided not to increase it. There is a KEEP pool of 5G which is hardly used. If needed I will sacrifice this cache and add it to the DEFAULT cache. Until now it looks promising. The average log file sync wait time has dropped from about 70ms to 7ms. But we will see how things devellop. Thanks so far to everybody who helped. Regards Hans-Peter -- This message posted from opensolaris.org