Anand Bisen
2008-Mar-22 18:55 UTC
[Lustre-discuss] Slow performance on the first IOR / iozone (Lustre)
Hi, I have been trying to understand why successive execution of iozone (also with dd and IOR) produces increasing performance from first run to third run. After the third run the performance is constant. So when a new OSS is put into service or if an OSS server is rebooted and it joins the Lustre file system this characteristics is visible for each OST that it serves. Once the OST''s reach a peak performance that performance is consistent, but once the OST''s are unmounted and mounted back the same escalating performance is seen. Once the OST has been primed again mounting from a new client will also get''s the same peak performance. And this characteristics is only visible for writes and not reads. We have tried multiple scenario''s, using one OST to multiple OST''s. We are using "noop" as the IO scheduler on our OSS servers. ---Example with DD''s--- mds2:/mnt/lustre/test # lfs setstripe foo 0 2 1 (We have tried each OST here same effect) mds2:/mnt/lustre/test # dd if=/dev/zero of=foo bs=1048576 count=8192 8192+0 records in 8192+0 records out 8589934592 bytes (8.6 GB) copied, 43.7827 seconds, 196 MB/s mds2:/mnt/lustre/test # dd if=/dev/zero of=foo bs=1048576 count=8192 8192+0 records in 8192+0 records out 8589934592 bytes (8.6 GB) copied, 33.153 seconds, 259 MB/s mds2:/mnt/lustre/test # dd if=/dev/zero of=foo bs=1048576 count=8192 8192+0 records in 8192+0 records out 8589934592 bytes (8.6 GB) copied, 32.9747 seconds, 261 MB/s mds2:/mnt/lustre/test # dd if=/dev/zero of=foo bs=1048576 count=8192 8192+0 records in 8192+0 records out 8589934592 bytes (8.6 GB) copied, 29.3566 seconds, 293 MB/s mds2:/mnt/lustre/test # dd if=/dev/zero of=foo bs=1048576 count=8192 8192+0 records in 8192+0 records out 8589934592 bytes (8.6 GB) copied, 22.23 seconds, 386 MB/s mds2:/mnt/lustre/test # dd if=/dev/zero of=foo bs=1048576 count=8192 8192+0 records in 8192+0 records out 8589934592 bytes (8.6 GB) copied, 22.236 seconds, 386 MB/s mds2:/mnt/lustre/test # dd if=/dev/zero of=foo bs=1048576 count=8192 8192+0 records in 8192+0 records out 8589934592 bytes (8.6 GB) copied, 22.2838 seconds, 385 MB/s ---Example with IOZONE--- client-mnt # iozone -i 0 -s 10g -r 1m -t 2 File size set to 10485760 KB Record Size 1024 KB Command line used: iozone -i 0 -s 10g -r 1m -t 2 Output is in Kbytes/sec Time Resolution = 0.000001 seconds. Processor cache size set to 1024 Kbytes. Processor cache line size set to 32 bytes. File stride size set to 17 * record size. Throughput test with 2 processes Each process writes a 10485760 Kbyte file in 1024 Kbyte records Children see throughput for 2 initial writers = 151909.33 KB/sec Parent sees throughput for 2 initial writers = 150657.64 KB/sec Min throughput per process = 75476.66 KB/sec Max throughput per process = 76432.67 KB/sec Avg throughput per process = 75954.66 KB/sec Min xfer = 10354688.00 KB Children see throughput for 2 rewriters = 249051.45 KB/sec Parent sees throughput for 2 rewriters = 247016.15 KB/sec Min throughput per process = 124407.29 KB/sec Max throughput per process = 124644.16 KB/sec Avg throughput per process = 124525.72 KB/sec Min xfer = 10465280.00 KB #######Second RUN Children see throughput for 2 initial writers = 244629.49 KB/sec Parent sees throughput for 2 initial writers = 242007.64 KB/sec Min throughput per process = 121223.02 KB/sec Max throughput per process = 123406.48 KB/sec Avg throughput per process = 122314.75 KB/sec Min xfer = 10300416.00 KB Children see throughput for 2 rewriters = 239316.48 KB/sec Parent sees throughput for 2 rewriters = 238836.01 KB/sec Min throughput per process = 119017.80 KB/sec Max throughput per process = 120298.68 KB/sec Avg throughput per process = 119658.24 KB/sec Min xfer = 10375168.00 KB #######Third RUN Children see throughput for 2 initial writers = 245567.60 KB/sec Parent sees throughput for 2 initial writers = 241539.16 KB/sec Min throughput per process = 121565.91 KB/sec Max throughput per process = 124001.69 KB/sec Avg throughput per process = 122783.80 KB/sec Min xfer = 10279936.00 KB Children see throughput for 2 rewriters = 240782.11 KB/sec Parent sees throughput for 2 rewriters = 240390.69 KB/sec Min throughput per process = 119717.36 KB/sec Max throughput per process = 121064.75 KB/sec Avg throughput per process = 120391.05 KB/sec Min xfer = 10370048.00 KB Any insight on explaining this behavior would be really appreciated. Thanks Anand
Oleg Drokin
2008-Mar-23 00:26 UTC
[Lustre-discuss] Slow performance on the first IOR / iozone (Lustre)
Hello! On Mar 22, 2008, at 2:55 PM, Anand Bisen wrote:> > I have been trying to understand why successive execution of iozone > (also with dd and IOR) produces increasing performance from first run > to third run. After the third run the performance is constant.I believe what you are seeing is a direct result of bitmap blocks being not read at mount time. So it takes some time to cache bitmaps (And other data structures) into memory (and it happens during writes which hinders performance). Once bitmaps are all read, performance stabilizes. Hopefully Alex or Andreas might comment further on this, or perhaps remember a bug number, Cray filed one some time ago, I think. Bye, Oleg
Cédric Lambert
2008-Mar-25 09:37 UTC
[Lustre-discuss] Slow performance on the first IOR / iozone (Lustre)
Hello, I am not today a kernel hacker ;-) and I would like to understand something : Does it mean that when Lustre client is mounted, I am obliged to make any IOs to force bitmaps cache or am I obliged to wait a certain time before making any writes ? If Yes, can I monitor under /proc this info ? Thanks C?dric Oleg Drokin a ?crit :> > I believe what you are seeing is a direct result of bitmap blocks > being not > read at mount time. So it takes some time to cache bitmaps (And other > data structures) into memory > (and it happens during writes which hinders performance). > Once bitmaps are all read, performance stabilizes. > > Hopefully Alex or Andreas might comment further on this, or perhaps > remember > a bug number, Cray filed one some time ago, I think. > > Bye, > Oleg >
Oleg Drokin
2008-Mar-25 14:37 UTC
[Lustre-discuss] Slow performance on the first IOR / iozone (Lustre)
Hello! You are certainly not obliged to do i/os to force bitmaps, since lustre will continue to work, it just will bear some penalties before all caches are populated. You certainly can prefetch that data sooner as a workaround and with zero kernel hacking required by e.g. doing dumpe2fs on every device with lustre backend fs after lustre servers are started (it is important to do this after mount, since kernel discards all block device data on last device close). I am not aware of any place in /proc where you can track current status of fs-specific cached metadata, also it might change over time, e.g. for fs that is not used, blocks might be forced out of memory. Bye, Oleg On Mar 25, 2008, at 5:37 AM, C?dric Lambert wrote:> Hello, > > I am not today a kernel hacker ;-) and I would like to understand > something : > Does it mean that when Lustre client is mounted, I am obliged to > make any IOs to force bitmaps cache or am I obliged to wait a > certain time before making any writes ? If Yes, can I monitor under / > proc this info ? > > Thanks > C?dric > > > Oleg Drokin a ?crit : >> >> I believe what you are seeing is a direct result of bitmap blocks >> being not >> read at mount time. So it takes some time to cache bitmaps (And other >> data structures) into memory >> (and it happens during writes which hinders performance). >> Once bitmaps are all read, performance stabilizes. >> >> Hopefully Alex or Andreas might comment further on this, or >> perhaps remember >> a bug number, Cray filed one some time ago, I think. >> >> Bye, >> Oleg >>
Brian J. Murrell
2008-Mar-25 14:45 UTC
[Lustre-discuss] Slow performance on the first IOR / iozone (Lustre)
On Tue, 2008-03-25 at 10:37 -0400, Oleg Drokin wrote:> You certainly can prefetch that data sooner as a workaround and > with zero kernel hacking > required by e.g. doing dumpe2fs on every device with lustre > backend fs after lustre servers > are started (it is important to do this after mount, since kernel > discards all block device data > on last device close).C?dric, If nothing else, you could use this technique to confirm Oleg''s theory that it is indeed bitmaps being cached that is resulting in your performance observations. Not that I doubt Oleg''s expertise here. It would just be scientific data to confirm your situation. b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080325/1291564e/attachment-0002.bin
Cédric Lambert
2008-Mar-25 16:43 UTC
[Lustre-discuss] Slow performance on the first IOR / iozone (Lustre)
Brian, Your remark is completely justified : theory can be limited by some external variables. For example, we saw there was a limit with "ext2" file system : EXT2_MAX_GROUP_LOADED is used to prevent too much cache bitmaps into memory. As Lustre is based on ext3, is there such a limit with ldiskfs ? If so, we could continue to have bad performance again with big big files on all OSTs (for example). I let Anand tell us if results are better on its benches after a dumpe2fs on every device... C?dric Brian J. Murrell a ?crit :> On Tue, 2008-03-25 at 10:37 -0400, Oleg Drokin wrote: > >> You certainly can prefetch that data sooner as a workaround and >> with zero kernel hacking >> required by e.g. doing dumpe2fs on every device with lustre >> backend fs after lustre servers >> are started (it is important to do this after mount, since kernel >> discards all block device data >> on last device close). >> > > C?dric, > > If nothing else, you could use this technique to confirm Oleg''s theory > that it is indeed bitmaps being cached that is resulting in your > performance observations. Not that I doubt Oleg''s expertise here. It > would just be scientific data to confirm your situation. > > b. >
Andreas Dilger
2008-Mar-25 21:08 UTC
[Lustre-discuss] Slow performance on the first IOR/iozone (Lustre)
On Mar 25, 2008 17:43 +0100, C?dric Lambert wrote:> Your remark is completely justified : theory can be limited by some > external variables. > > For example, we saw there was a limit with "ext2" file system : > EXT2_MAX_GROUP_LOADED is used to prevent too much cache bitmaps into > memory. As Lustre is based on ext3, is there such a limit with ldiskfs > ? If so, we could continue to have bad performance again with big big > files on all OSTs (for example).This MAX_GROUP_LOADED parameter was removed, and was pointless in any case because the block bitmaps were still kept by the buffer cache if used frequently.> I let Anand tell us if results are better on its benches after a > dumpe2fs on every device...I''m not sure that dumpe2fs will be sufficient because the ldiskfs mballoc code also generates buddy bitmaps for every group it is doing allocations in. Reading the block bitmaps into cache in advance will definitely help this, but dumpe2fs will not trigger the buddy bitmap generation.> Brian J. Murrell a ?crit : > > On Tue, 2008-03-25 at 10:37 -0400, Oleg Drokin wrote: > > > >> You certainly can prefetch that data sooner as a workaround and > >> with zero kernel hacking > >> required by e.g. doing dumpe2fs on every device with lustre > >> backend fs after lustre servers > >> are started (it is important to do this after mount, since kernel > >> discards all block device data > >> on last device close). > >> > > > > C?dric, > > > > If nothing else, you could use this technique to confirm Oleg''s theory > > that it is indeed bitmaps being cached that is resulting in your > > performance observations. Not that I doubt Oleg''s expertise here. It > > would just be scientific data to confirm your situation.Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.