Anand Bisen
2008-Mar-22 18:55 UTC
[Lustre-discuss] Slow performance on the first IOR / iozone (Lustre)
Hi,
I have been trying to understand why successive execution of iozone
(also with dd and IOR) produces increasing performance from first run
to third run. After the third run the performance is constant.
So when a new OSS is put into service or if an OSS server is rebooted
and it joins the Lustre file system this characteristics is visible
for each OST that it serves. Once the OST''s reach a peak performance
that performance is consistent, but once the OST''s are unmounted and
mounted back the same escalating performance is seen. Once the OST has
been primed again mounting from a new client will also get''s the same
peak performance. And this characteristics is only visible for writes
and not reads.
We have tried multiple scenario''s, using one OST to multiple
OST''s. We
are using "noop" as the IO scheduler on our OSS servers.
---Example with DD''s---
mds2:/mnt/lustre/test # lfs setstripe foo 0 2 1 (We have tried
each OST here same effect)
mds2:/mnt/lustre/test # dd if=/dev/zero of=foo bs=1048576 count=8192
8192+0 records in
8192+0 records out
8589934592 bytes (8.6 GB) copied, 43.7827 seconds, 196 MB/s
mds2:/mnt/lustre/test # dd if=/dev/zero of=foo bs=1048576 count=8192
8192+0 records in
8192+0 records out
8589934592 bytes (8.6 GB) copied, 33.153 seconds, 259 MB/s
mds2:/mnt/lustre/test # dd if=/dev/zero of=foo bs=1048576 count=8192
8192+0 records in
8192+0 records out
8589934592 bytes (8.6 GB) copied, 32.9747 seconds, 261 MB/s
mds2:/mnt/lustre/test # dd if=/dev/zero of=foo bs=1048576 count=8192
8192+0 records in
8192+0 records out
8589934592 bytes (8.6 GB) copied, 29.3566 seconds, 293 MB/s
mds2:/mnt/lustre/test # dd if=/dev/zero of=foo bs=1048576 count=8192
8192+0 records in
8192+0 records out
8589934592 bytes (8.6 GB) copied, 22.23 seconds, 386 MB/s
mds2:/mnt/lustre/test # dd if=/dev/zero of=foo bs=1048576 count=8192
8192+0 records in
8192+0 records out
8589934592 bytes (8.6 GB) copied, 22.236 seconds, 386 MB/s
mds2:/mnt/lustre/test # dd if=/dev/zero of=foo bs=1048576 count=8192
8192+0 records in
8192+0 records out
8589934592 bytes (8.6 GB) copied, 22.2838 seconds, 385 MB/s
---Example with IOZONE---
client-mnt # iozone -i 0 -s 10g -r 1m -t 2
File size set to 10485760 KB
Record Size 1024 KB
Command line used: iozone -i 0 -s 10g -r 1m -t 2
Output is in Kbytes/sec
Time Resolution = 0.000001 seconds.
Processor cache size set to 1024 Kbytes.
Processor cache line size set to 32 bytes.
File stride size set to 17 * record size.
Throughput test with 2 processes
Each process writes a 10485760 Kbyte file in 1024 Kbyte records
Children see throughput for 2 initial writers = 151909.33
KB/sec
Parent sees throughput for 2 initial writers = 150657.64
KB/sec
Min throughput per process = 75476.66
KB/sec
Max throughput per process = 76432.67
KB/sec
Avg throughput per process = 75954.66
KB/sec
Min xfer = 10354688.00
KB
Children see throughput for 2 rewriters = 249051.45
KB/sec
Parent sees throughput for 2 rewriters = 247016.15
KB/sec
Min throughput per process = 124407.29
KB/sec
Max throughput per process = 124644.16
KB/sec
Avg throughput per process = 124525.72
KB/sec
Min xfer = 10465280.00
KB
#######Second RUN
Children see throughput for 2 initial writers = 244629.49
KB/sec
Parent sees throughput for 2 initial writers = 242007.64
KB/sec
Min throughput per process = 121223.02
KB/sec
Max throughput per process = 123406.48
KB/sec
Avg throughput per process = 122314.75
KB/sec
Min xfer = 10300416.00
KB
Children see throughput for 2 rewriters = 239316.48
KB/sec
Parent sees throughput for 2 rewriters = 238836.01
KB/sec
Min throughput per process = 119017.80
KB/sec
Max throughput per process = 120298.68
KB/sec
Avg throughput per process = 119658.24
KB/sec
Min xfer = 10375168.00
KB
#######Third RUN
Children see throughput for 2 initial writers = 245567.60
KB/sec
Parent sees throughput for 2 initial writers = 241539.16
KB/sec
Min throughput per process = 121565.91
KB/sec
Max throughput per process = 124001.69
KB/sec
Avg throughput per process = 122783.80
KB/sec
Min xfer = 10279936.00
KB
Children see throughput for 2 rewriters = 240782.11
KB/sec
Parent sees throughput for 2 rewriters = 240390.69
KB/sec
Min throughput per process = 119717.36
KB/sec
Max throughput per process = 121064.75
KB/sec
Avg throughput per process = 120391.05
KB/sec
Min xfer = 10370048.00
KB
Any insight on explaining this behavior would be really appreciated.
Thanks
Anand
Oleg Drokin
2008-Mar-23 00:26 UTC
[Lustre-discuss] Slow performance on the first IOR / iozone (Lustre)
Hello! On Mar 22, 2008, at 2:55 PM, Anand Bisen wrote:> > I have been trying to understand why successive execution of iozone > (also with dd and IOR) produces increasing performance from first run > to third run. After the third run the performance is constant.I believe what you are seeing is a direct result of bitmap blocks being not read at mount time. So it takes some time to cache bitmaps (And other data structures) into memory (and it happens during writes which hinders performance). Once bitmaps are all read, performance stabilizes. Hopefully Alex or Andreas might comment further on this, or perhaps remember a bug number, Cray filed one some time ago, I think. Bye, Oleg
Cédric Lambert
2008-Mar-25 09:37 UTC
[Lustre-discuss] Slow performance on the first IOR / iozone (Lustre)
Hello, I am not today a kernel hacker ;-) and I would like to understand something : Does it mean that when Lustre client is mounted, I am obliged to make any IOs to force bitmaps cache or am I obliged to wait a certain time before making any writes ? If Yes, can I monitor under /proc this info ? Thanks C?dric Oleg Drokin a ?crit :> > I believe what you are seeing is a direct result of bitmap blocks > being not > read at mount time. So it takes some time to cache bitmaps (And other > data structures) into memory > (and it happens during writes which hinders performance). > Once bitmaps are all read, performance stabilizes. > > Hopefully Alex or Andreas might comment further on this, or perhaps > remember > a bug number, Cray filed one some time ago, I think. > > Bye, > Oleg >
Oleg Drokin
2008-Mar-25 14:37 UTC
[Lustre-discuss] Slow performance on the first IOR / iozone (Lustre)
Hello!
You are certainly not obliged to do i/os to force bitmaps, since
lustre will continue to work,
it just will bear some penalties before all caches are populated.
You certainly can prefetch that data sooner as a workaround and
with zero kernel hacking
required by e.g. doing dumpe2fs on every device with lustre
backend fs after lustre servers
are started (it is important to do this after mount, since kernel
discards all block device data
on last device close).
I am not aware of any place in /proc where you can track current
status of fs-specific cached
metadata, also it might change over time, e.g. for fs that is not
used, blocks might be forced
out of memory.
Bye,
Oleg
On Mar 25, 2008, at 5:37 AM, C?dric Lambert wrote:> Hello,
>
> I am not today a kernel hacker ;-) and I would like to understand
> something :
> Does it mean that when Lustre client is mounted, I am obliged to
> make any IOs to force bitmaps cache or am I obliged to wait a
> certain time before making any writes ? If Yes, can I monitor under /
> proc this info ?
>
> Thanks
> C?dric
>
>
> Oleg Drokin a ?crit :
>>
>> I believe what you are seeing is a direct result of bitmap blocks
>> being not
>> read at mount time. So it takes some time to cache bitmaps (And other
>> data structures) into memory
>> (and it happens during writes which hinders performance).
>> Once bitmaps are all read, performance stabilizes.
>>
>> Hopefully Alex or Andreas might comment further on this, or
>> perhaps remember
>> a bug number, Cray filed one some time ago, I think.
>>
>> Bye,
>> Oleg
>>
Brian J. Murrell
2008-Mar-25 14:45 UTC
[Lustre-discuss] Slow performance on the first IOR / iozone (Lustre)
On Tue, 2008-03-25 at 10:37 -0400, Oleg Drokin wrote:> You certainly can prefetch that data sooner as a workaround and > with zero kernel hacking > required by e.g. doing dumpe2fs on every device with lustre > backend fs after lustre servers > are started (it is important to do this after mount, since kernel > discards all block device data > on last device close).C?dric, If nothing else, you could use this technique to confirm Oleg''s theory that it is indeed bitmaps being cached that is resulting in your performance observations. Not that I doubt Oleg''s expertise here. It would just be scientific data to confirm your situation. b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080325/1291564e/attachment-0002.bin
Cédric Lambert
2008-Mar-25 16:43 UTC
[Lustre-discuss] Slow performance on the first IOR / iozone (Lustre)
Brian, Your remark is completely justified : theory can be limited by some external variables. For example, we saw there was a limit with "ext2" file system : EXT2_MAX_GROUP_LOADED is used to prevent too much cache bitmaps into memory. As Lustre is based on ext3, is there such a limit with ldiskfs ? If so, we could continue to have bad performance again with big big files on all OSTs (for example). I let Anand tell us if results are better on its benches after a dumpe2fs on every device... C?dric Brian J. Murrell a ?crit :> On Tue, 2008-03-25 at 10:37 -0400, Oleg Drokin wrote: > >> You certainly can prefetch that data sooner as a workaround and >> with zero kernel hacking >> required by e.g. doing dumpe2fs on every device with lustre >> backend fs after lustre servers >> are started (it is important to do this after mount, since kernel >> discards all block device data >> on last device close). >> > > C?dric, > > If nothing else, you could use this technique to confirm Oleg''s theory > that it is indeed bitmaps being cached that is resulting in your > performance observations. Not that I doubt Oleg''s expertise here. It > would just be scientific data to confirm your situation. > > b. >
Andreas Dilger
2008-Mar-25 21:08 UTC
[Lustre-discuss] Slow performance on the first IOR/iozone (Lustre)
On Mar 25, 2008 17:43 +0100, C?dric Lambert wrote:> Your remark is completely justified : theory can be limited by some > external variables. > > For example, we saw there was a limit with "ext2" file system : > EXT2_MAX_GROUP_LOADED is used to prevent too much cache bitmaps into > memory. As Lustre is based on ext3, is there such a limit with ldiskfs > ? If so, we could continue to have bad performance again with big big > files on all OSTs (for example).This MAX_GROUP_LOADED parameter was removed, and was pointless in any case because the block bitmaps were still kept by the buffer cache if used frequently.> I let Anand tell us if results are better on its benches after a > dumpe2fs on every device...I''m not sure that dumpe2fs will be sufficient because the ldiskfs mballoc code also generates buddy bitmaps for every group it is doing allocations in. Reading the block bitmaps into cache in advance will definitely help this, but dumpe2fs will not trigger the buddy bitmap generation.> Brian J. Murrell a ?crit : > > On Tue, 2008-03-25 at 10:37 -0400, Oleg Drokin wrote: > > > >> You certainly can prefetch that data sooner as a workaround and > >> with zero kernel hacking > >> required by e.g. doing dumpe2fs on every device with lustre > >> backend fs after lustre servers > >> are started (it is important to do this after mount, since kernel > >> discards all block device data > >> on last device close). > >> > > > > C?dric, > > > > If nothing else, you could use this technique to confirm Oleg''s theory > > that it is indeed bitmaps being cached that is resulting in your > > performance observations. Not that I doubt Oleg''s expertise here. It > > would just be scientific data to confirm your situation.Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.