I''m getting my first exposure to a ddn storage array and it''s enlightening :-} Mostly it just does what I expected of it with little trouble, but I''m having hard time getting the read side working as fast as it ought to be able to go. Does anybody have experience they''d like to share, tuning the kernel/driver or the array? I''m using 2.6.15 kernel, and qlogic 2462 hbas with 8.01.07 driver. Using the anticipatory scheduler, and tweaking up the readahead size for the blockdev, I can get around 300MB/s by using 4 threads on a port, or about 3/4 of the expected max. Writes max out easily. The ddn''s stats say that the large majority of my reads are only 256K, even though the requests are larger than that. I tried incorporating the blkdev-max-io-size-selection and increase-sglist-size patches from cfs, but that didn''t really help, my reads are still maxing out at 256K. If anybody''s been through this kind of thing and has experiences, rumors, or war stories about what kinds of tuning in this area yield good results, I''d love to talk to you!
John R. Dunning wrote:> I''m getting my first exposure to a ddn storage array and it''s enlightening :-} > > Mostly it just does what I expected of it with little trouble, but I''m having > hard time getting the read side working as fast as it ought to be able to go. > Does anybody have experience they''d like to share, tuning the kernel/driver or > the array? > > I''m using 2.6.15 kernel, and qlogic 2462 hbas with 8.01.07 driver. Using the > anticipatory scheduler, and tweaking up the readahead size for the blockdev, I > can get around 300MB/s by using 4 threads on a port, or about 3/4 of the > expected max. Writes max out easily. The ddn''s stats say that the large > majority of my reads are only 256K, even though the requests are larger than > that. > > I tried incorporating the blkdev-max-io-size-selection and > increase-sglist-size patches from cfs, but that didn''t really help, my reads > are still maxing out at 256K. > > If anybody''s been through this kind of thing and has experiences, rumors, or > war stories about what kinds of tuning in this area yield good results, I''d > love to talk to you! >I know this won''t help you but for posterities sake use IBGD 1.8.2 if you have an infiniband DDN array. I tried for 4 days with OFED 1.1.1 to get decent io performance and never could. 30 minutes to install IBGD and I was pushing 700MB/sec through the port using dd.> _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss@clusterfs.com > https://mail.clusterfs.com/mailman/listinfo/lustre-discuss >
On Friday 18 May 2007 10:06:35 am Daniel Leaberry wrote:> John R. Dunning wrote: > > I''m getting my first exposure to a ddn storage array and it''s > > enlightening :-} > > > > Mostly it just does what I expected of it with little trouble, but I''m > > having hard time getting the read side working as fast as it ought to be > > able to go. Does anybody have experience they''d like to share, tuning the > > kernel/driver or the array? > > > > I''m using 2.6.15 kernel, and qlogic 2462 hbas with 8.01.07 driver. Using > > the anticipatory scheduler, and tweaking up the readahead size for the > > blockdev, I can get around 300MB/s by using 4 threads on a port, or about > > 3/4 of the expected max. Writes max out easily. The ddn''s stats say > > that the large majority of my reads are only 256K, even though the > > requests are larger than that. > > > > I tried incorporating the blkdev-max-io-size-selection and > > increase-sglist-size patches from cfs, but that didn''t really help, my > > reads are still maxing out at 256K. > > > > If anybody''s been through this kind of thing and has experiences, rumors, > > or war stories about what kinds of tuning in this area yield good > > results, I''d love to talk to you! > > I know this won''t help you but for posterities sake use IBGD 1.8.2 if > you have an infiniband DDN array. I tried for 4 days with OFED 1.1.1 to > get decent io performance and never could. 30 minutes to install IBGD > and I was pushing 700MB/sec through the port using dd.Do you have actual numbers for your OFED test? If so, please send a message to the OpenFabrics General mailing list (general@lists.openfabrics.org) letting them know of this performance degredation. The more details we have of a slow-down in the SRP performance, the more chance we have of OFED finally fixing whatever the problem is (or at least getting Mellanox to pony up what is the difference between IBGD''s and OFED''s SRP client code and explain why they haven''t submitted changes).> > _______________________________________________ > > Lustre-discuss mailing list > > Lustre-discuss@clusterfs.com > > https://mail.clusterfs.com/mailman/listinfo/lustre-discuss > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss@clusterfs.com > https://mail.clusterfs.com/mailman/listinfo/lustre-discuss-- Makia Minich <minich@ornl.gov> National Center for Computation Science Oak Ridge National Laboratory Phone: 865.574.7460 --*-- Imagine no possessions I wonder if you can - John Lennon
In message <17997.38024.295869.482039@gs105.sicortex.com>,"John R. Dunning" wri tes:>I tried incorporating the blkdev-max-io-size-selection and >increase-sglist-size patches from cfs, but that didn''t really help, my reads >are still maxing out at 256K.the srp initator creates a virtual scsi device driver. this virtual device driver has a .max_sectors paramters associated with it. you can tune this with the max_sect= during login for the openfabrics stack. no idea, how this is tuned on ibgold. take a look at /sys/block/sd<whatever>/queue/{max_hw_sectors_kb,max_sectors_kb} if you arent using direct i/o, use direct i/o. you could just tune the page size of the ddn to 256k.
How much luck did you have with this tuning and OFED''s SRP? What performance are you seeing? We had done quite a bit of testing playing with this option, but saw very little improvement in performance (if I remember correctly, the block sizes did increase, but performance was still down). On Friday 18 May 2007 10:38:29 am chas williams - CONTRACTOR wrote:> In message <17997.38024.295869.482039@gs105.sicortex.com>,"John R. Dunning" > wri > > tes: > >I tried incorporating the blkdev-max-io-size-selection and > >increase-sglist-size patches from cfs, but that didn''t really help, my > > reads are still maxing out at 256K. > > the srp initator creates a virtual scsi device driver. this virtual > device driver has a .max_sectors paramters associated with it. you can > tune this with the max_sect= during login for the openfabrics stack. > no idea, how this is tuned on ibgold. > > take a look at > /sys/block/sd<whatever>/queue/{max_hw_sectors_kb,max_sectors_kb} > > if you arent using direct i/o, use direct i/o. you could just tune > the page size of the ddn to 256k. > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss@clusterfs.com > https://mail.clusterfs.com/mailman/listinfo/lustre-discuss-- Makia Minich <minich@ornl.gov> National Center for Computation Science Oak Ridge National Laboratory Phone: 865.574.7460 --*-- Imagine no possessions I wonder if you can - John Lennon
Makia Minich wrote:> How much luck did you have with this tuning and OFED''s SRP? What performance > are you seeing? We had done quite a bit of testing playing with this option, > but saw very little improvement in performance (if I remember correctly, the > block sizes did increase, but performance was still down). > >That''s what I saw as well. I eventually got great performance writing with /dev/sg* devices by tuning srp_sg_tablesize (it defaults to 12 which sent 48KB io''s to the array) the but I could never get /dev/sd* devices to perform and reading was always stuck at 128KB io''s no matter what I passed into to srp_sg_tablesize.> On Friday 18 May 2007 10:38:29 am chas williams - CONTRACTOR wrote: > >> In message <17997.38024.295869.482039@gs105.sicortex.com>,"John R. Dunning" >> wri >> >> tes: >> >>> I tried incorporating the blkdev-max-io-size-selection and >>> increase-sglist-size patches from cfs, but that didn''t really help, my >>> reads are still maxing out at 256K. >>> >> the srp initator creates a virtual scsi device driver. this virtual >> device driver has a .max_sectors paramters associated with it. you can >> tune this with the max_sect= during login for the openfabrics stack. >> no idea, how this is tuned on ibgold. >> >> take a look at >> /sys/block/sd<whatever>/queue/{max_hw_sectors_kb,max_sectors_kb} >> >> if you arent using direct i/o, use direct i/o. you could just tune >> the page size of the ddn to 256k. >> >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss@clusterfs.com >> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss >> > >
well... i suspect tuning the i/o sizes to be larger didnt make a big difference on reads. it helps to get the write to atleast match the page size on the ddn''s memory cache (2MB as i recall, but this can be tuned to a smaller value). this will let most devices "write through" the memory cache directly to disk. as you get farther and farther away from your storage, you need to increase the message size to offset bandwidth*delay. after a bit of fiddling, we managed to get: Using Minimum Record Size 1024 KB Auto Mode 2. This option is obsolete. Use -az -i0 -i1 O_DIRECT feature enabled Command line used: /data1/iozone.ia64 -f testfile -y 1024k -A -I Output is in Kbytes/sec Time Resolution = 0.000001 seconds. Processor cache size set to 1024 Kbytes. Processor cache line size set to 32 bytes. File stride size set to 17 * record size. KB reclen write rewrite read reread 524288 1024 270275 430822 354823 355126 524288 2048 412733 545186 673329 679848 524288 4096 533884 619260 1048551 1053483 524288 8192 606596 665102 1201478 1192968 524288 16384 662077 698136 1333838 1334341 this was a single host using 2 ddr adapters striped across 8 luns on the ddn. each lun on the ddn was across 14 tiers. obviously a 512MB test file fits inside the ddn cache. the ddn should be able to go faster, but my single host couldnt push harder. In message <200705181114.50119.minich@ornl.gov>,Makia Minich writes:>How much luck did you have with this tuning and OFED''s SRP? What performance >are you seeing? We had done quite a bit of testing playing with this option, >but saw very little improvement in performance (if I remember correctly, the >block sizes did increase, but performance was still down). > >On Friday 18 May 2007 10:38:29 am chas williams - CONTRACTOR wrote: >> In message <17997.38024.295869.482039@gs105.sicortex.com>,"John R. Dunning" >> wri >> >> tes: >> >I tried incorporating the blkdev-max-io-size-selection and >> >increase-sglist-size patches from cfs, but that didn''t really help, my >> > reads are still maxing out at 256K. >> >> the srp initator creates a virtual scsi device driver. this virtual >> device driver has a .max_sectors paramters associated with it. you can >> tune this with the max_sect= during login for the openfabrics stack. >> no idea, how this is tuned on ibgold. >> >> take a look at >> /sys/block/sd<whatever>/queue/{max_hw_sectors_kb,max_sectors_kb} >> >> if you arent using direct i/o, use direct i/o. you could just tune >> the page size of the ddn to 256k. >> >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss@clusterfs.com >> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss > >-- >Makia Minich <minich@ornl.gov> >National Center for Computation Science >Oak Ridge National Laboratory >Phone: 865.574.7460 >--*-- >Imagine no possessions >I wonder if you can >- John Lennon >
On May 18, 2007 07:56 -0400, John R. Dunning wrote:> I''m using 2.6.15 kernel, and qlogic 2462 hbas with 8.01.07 driver. Using the > anticipatory scheduler, and tweaking up the readahead size for the blockdev, IFor a DDN you should probably use noop or deadline scheduler. Anticipatory is really tuned for desktop workloads.> can get around 300MB/s by using 4 threads on a port, or about 3/4 of the > expected max. Writes max out easily. The ddn''s stats say that the large > majority of my reads are only 256K, even though the requests are larger than > that.What tool are you using to measure performance? I''d strongly suggest using the lustre-iokit, which has several components in order to test bare-disk, local filesystem, network, and lustre-filesystem components independently. Lustre can consistently generate 1MB IOs to the underlying filesystem because it submits the IO in 1MB chunks, unlike the kernel''s read() and write() calls which submit IO in 4kB chunks and hope the elevator can merge them. See also the DDN tuning section in the Lustre manual. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc.
From: Andreas Dilger <adilger@clusterfs.com> Date: Fri, 18 May 2007 12:43:48 -0600 On May 18, 2007 07:56 -0400, John R. Dunning wrote: > I''m using 2.6.15 kernel, and qlogic 2462 hbas with 8.01.07 driver. Using the > anticipatory scheduler, and tweaking up the readahead size for the blockdev, I For a DDN you should probably use noop or deadline scheduler. Anticipatory is really tuned for desktop workloads. Yes, others have said the same thing. I''ve tried them both but so far there''s not much difference. The evidence is that something in the block layer is breaking up read requests, which seems to negate any effect I might be getting from the iosched. I found /proc/sys/vm/block_dump, added some extra instrumentation to it, and turned it on. On the write side, I''m seeing nice big requests (though the sizes are a bit all over the place) but on the read side it seems to be willing to go up to 32 elements in the bio and never go any higher. That statement so far seems to be true regardless of what I use for readahead values, what scheduler tuning params I give it, what kind of request size the higher level thinks it''s issuing etc. It''s behaving like there''s something which has an arbitrary limit on the size of a read request, but I haven''t yet figured out what that is. > can get around 300MB/s by using 4 threads on a port, or about 3/4 of the > expected max. Writes max out easily. The ddn''s stats say that the large > majority of my reads are only 256K, even though the requests are larger than > that. What tool are you using to measure performance? Various. Mostly iozone and timing dd and stuff like that. I''m not (yet) running lustre against the ddn. I''d strongly suggest using the lustre-iokit, which has several components in order to test bare-disk, local filesystem, network, and lustre-filesystem components independently. Ok. I tried an older version of it last year, and it didn''t seem to be telling anything I hadn''t already found out by other means. EEB shipped me a newer version, which I''ve unpacked, and am currently trying to figure out how to build. It seems to be set up such that I have to autoconf it, but trying to do that causes errors. Hints? Lustre can consistently generate 1MB IOs to the underlying filesystem because it submits the IO in 1MB chunks, unlike the kernel''s read() and write() calls which submit IO in 4kB chunks and hope the elevator can merge them. See also the DDN tuning section in the Lustre manual. Ok, will do. Thanks.
From: "John R. Dunning" <jrd@sicortex.com> Date: Fri, 18 May 2007 15:10:45 -0400 It seems to be set up such that I have to autoconf it, but trying to do that causes errors. Hints? Ok, never mind, I realized that this version doesn''t contain ior, so it''s all shell scripts, no building required.