Shane, I''ve seen a couple of references to ORNL using xdd versus sgp_dd for low-level disk performance benchmarking. Could you please summarize the differences and advise if our engineer team as well as Lustre partners should be considering this alternative? Thanks, Bojanic
On Sat, 2008-05-03 at 11:54 -0700, Peter Bojanic wrote:> I''ve seen a couple of references to ORNL using xdd versus sgp_dd for > low-level disk performance benchmarking. Could you please summarize > the differences and advise if our engineer team as well as Lustre > partners should be considering this alternative?We originally started using xdd for testing as it had features that made it easy to synchronize runs involving multiple hosts -- this is important for the testing we''ve doing against LSI''s XBB-2 system and DDN''s 9900. For example, the 9900 was able to hit ~1550 MB/s to 1600 MB/s against a single IB port, but each singlet topped out at ~2650 to 2700 MB/s or so when hit by two hosts. To get realistic aggregate numbers for both systems, requires that we hit them with four IO hosts or OSSes. When run in direct IO (-dio) mode against the SCSI disk device on recent kernels, xdd takes a very similar path to Lustre''s use case -- building up bio''s and using submit_bio() directly, without going through the page cache and triggering the read-ahead code and associated problems. In this mode, xdd gave us an aggregate bandwidth of ~5500 MB/s, which matched up nicely against the ~5000 MB/s we obtained with an IOR run against a Lustre filesystem on the same hardware. We saw the expected 10% hit from the filesystem vs raw disk. In contrast, sgp_dd gave us ~1100 MB/s from a single port, which would indicate a maximum 4400 MB/s from the array assuming perfect scaling. That would mean we got a result on the filesystem of 113.6% of raw performance, which doesn''t sit well. That said, there are a few caveats to using xdd -- the largest being that it does not give perfectly sequential requests when run with a queue depth greater than 1. It uses multiple threads when it wants to have more than 1 request in flight, and that leads to the requests being generally ascending, but not perfectly sequential. This can cause performance regressions when the array does not internally reorder requests. It is only possible to run xdd in direct IO mode against block devices in recent kernels -- 2.6.23 I believe is the cutoff. In kernels older than that, it must go through the page cache, and that may cause lower performance to be measured. Aborted shutdowns of xdd will often leave SysV semaphores orphaned, which will require manual cleanup when you hit the system limit. It looks like it should be possible to run xdd in a manner suitable for sgpdd-survey so that we could run tests against multiple regions of the disk at the same time. I''ve not spent any time looking closely at that option. I''m not sure why sgd_dd was getting lower numbers on the 2.6.24 kernel I was testing against -- there may be a performance regression with the SCSI generic devices. Hope this helps, feel free to ask further questions. -- Dave Dillow National Center for Computational Science Oak Ridge National Laboratory (865) 241-6602 office
Dave, thanks for the great response --this could easily be elavorated as a short LCE whitepaper, btw. I look forward to hearing from Andreas, Alex and other Lustre engineers on this. Bojanic On 4-May-08, at 17:40, David Dillow <dillowda at ornl.gov> wrote:> > On Sat, 2008-05-03 at 11:54 -0700, Peter Bojanic wrote: >> I''ve seen a couple of references to ORNL using xdd versus sgp_dd for >> low-level disk performance benchmarking. Could you please summarize >> the differences and advise if our engineer team as well as Lustre >> partners should be considering this alternative? > > We originally started using xdd for testing as it had features that > made > it easy to synchronize runs involving multiple hosts -- this is > important for the testing we''ve doing against LSI''s XBB-2 system and > DDN''s 9900. For example, the 9900 was able to hit ~1550 MB/s to 1600 > MB/s against a single IB port, but each singlet topped out at ~2650 > to > 2700 MB/s or so when hit by two hosts. To get realistic aggregate > numbers for both systems, requires that we hit them with four IO hosts > or OSSes. > > When run in direct IO (-dio) mode against the SCSI disk device on > recent > kernels, xdd takes a very similar path to Lustre''s use case -- > building > up bio''s and using submit_bio() directly, without going through the > page > cache and triggering the read-ahead code and associated problems. In > this mode, xdd gave us an aggregate bandwidth of ~5500 MB/s, which > matched up nicely against the ~5000 MB/s we obtained with an IOR run > against a Lustre filesystem on the same hardware. We saw the expected > 10% hit from the filesystem vs raw disk. > > In contrast, sgp_dd gave us ~1100 MB/s from a single port, which would > indicate a maximum 4400 MB/s from the array assuming perfect scaling. > That would mean we got a result on the filesystem of 113.6% of raw > performance, which doesn''t sit well. > > That said, there are a few caveats to using xdd -- the largest being > that it does not give perfectly sequential requests when run with a > queue depth greater than 1. It uses multiple threads when it wants to > have more than 1 request in flight, and that leads to the requests > being > generally ascending, but not perfectly sequential. This can cause > performance regressions when the array does not internally reorder > requests. > > It is only possible to run xdd in direct IO mode against block devices > in recent kernels -- 2.6.23 I believe is the cutoff. In kernels older > than that, it must go through the page cache, and that may cause lower > performance to be measured. > > Aborted shutdowns of xdd will often leave SysV semaphores orphaned, > which will require manual cleanup when you hit the system limit. > > It looks like it should be possible to run xdd in a manner suitable > for > sgpdd-survey so that we could run tests against multiple regions of > the > disk at the same time. I''ve not spent any time looking closely at that > option. > > I''m not sure why sgd_dd was getting lower numbers on the 2.6.24 > kernel I > was testing against -- there may be a performance regression with the > SCSI generic devices. > > Hope this helps, feel free to ask further questions. > -- > Dave Dillow > National Center for Computational Science > Oak Ridge National Laboratory > (865) 241-6602 office >
On May 04, 2008 21:45 -0300, Peter Bojanic wrote:> Dave, thanks for the great response --this could easily be elavorated > as a short LCE whitepaper, btw. > > I look forward to hearing from Andreas, Alex and other Lustre > engineers on this.I haven''t personally been using sgp_dd or xdd very much, but the requirement for kernels >= 2.6.23 pretty much rules this out for use at most of our customers, since the latest vendor kernel (RHEL5) is based on 2.6.18. As for the issue of mutli-threaded processes not having perfectly sequential IO, that is fine also, because the way we use sgp_dd already has similar issues, and this is also true of Lustre OSTs as well.> On 4-May-08, at 17:40, David Dillow <dillowda at ornl.gov> wrote: > > > > > On Sat, 2008-05-03 at 11:54 -0700, Peter Bojanic wrote: > >> I''ve seen a couple of references to ORNL using xdd versus sgp_dd for > >> low-level disk performance benchmarking. Could you please summarize > >> the differences and advise if our engineer team as well as Lustre > >> partners should be considering this alternative? > > > > We originally started using xdd for testing as it had features that > > made > > it easy to synchronize runs involving multiple hosts -- this is > > important for the testing we''ve doing against LSI''s XBB-2 system and > > DDN''s 9900. For example, the 9900 was able to hit ~1550 MB/s to 1600 > > MB/s against a single IB port, but each singlet topped out at ~2650 > > to > > 2700 MB/s or so when hit by two hosts. To get realistic aggregate > > numbers for both systems, requires that we hit them with four IO hosts > > or OSSes. > > > > When run in direct IO (-dio) mode against the SCSI disk device on > > recent > > kernels, xdd takes a very similar path to Lustre''s use case -- > > building > > up bio''s and using submit_bio() directly, without going through the > > page > > cache and triggering the read-ahead code and associated problems. In > > this mode, xdd gave us an aggregate bandwidth of ~5500 MB/s, which > > matched up nicely against the ~5000 MB/s we obtained with an IOR run > > against a Lustre filesystem on the same hardware. We saw the expected > > 10% hit from the filesystem vs raw disk. > > > > In contrast, sgp_dd gave us ~1100 MB/s from a single port, which would > > indicate a maximum 4400 MB/s from the array assuming perfect scaling. > > That would mean we got a result on the filesystem of 113.6% of raw > > performance, which doesn''t sit well. > > > > That said, there are a few caveats to using xdd -- the largest being > > that it does not give perfectly sequential requests when run with a > > queue depth greater than 1. It uses multiple threads when it wants to > > have more than 1 request in flight, and that leads to the requests > > being > > generally ascending, but not perfectly sequential. This can cause > > performance regressions when the array does not internally reorder > > requests. > > > > It is only possible to run xdd in direct IO mode against block devices > > in recent kernels -- 2.6.23 I believe is the cutoff. In kernels older > > than that, it must go through the page cache, and that may cause lower > > performance to be measured. > > > > Aborted shutdowns of xdd will often leave SysV semaphores orphaned, > > which will require manual cleanup when you hit the system limit. > > > > It looks like it should be possible to run xdd in a manner suitable > > for > > sgpdd-survey so that we could run tests against multiple regions of > > the > > disk at the same time. I''ve not spent any time looking closely at that > > option. > > > > I''m not sure why sgd_dd was getting lower numbers on the 2.6.24 > > kernel I > > was testing against -- there may be a performance regression with the > > SCSI generic devices. > > > > Hope this helps, feel free to ask further questions. > > -- > > Dave Dillow > > National Center for Computational Science > > Oak Ridge National Laboratory > > (865) 241-6602 office > > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discussCheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.