thr3ads.net - Lustre discuss - [Lustre-discuss] Slow performance on the first IOR / iozone (Lustre) [Mar 2008]

If this information is useful, please help other people find it:
Share via:

Anand Bisen

2008-Mar-22 18:55 UTC

[Lustre-discuss] Slow performance on the first IOR / iozone (Lustre)

Hi,

I have been trying to understand why successive execution of iozone  
(also with dd and IOR) produces increasing performance from first run  
to third run. After the third run the performance is constant.

So when a new OSS is put into service or if an OSS server is rebooted  
and it joins the Lustre file system this characteristics is visible  
for each OST that it serves. Once the OST''s reach a peak performance  
that performance is consistent, but once the OST''s are unmounted and  
mounted back the same escalating performance is seen. Once the OST has  
been primed again mounting from a new client will also get''s the same  
peak performance. And this characteristics is only visible for writes  
and not reads.

We have tried multiple scenario''s, using one OST to multiple
OST''s. We
are using "noop" as the IO scheduler on our OSS servers.

---Example with DD''s---

mds2:/mnt/lustre/test # lfs setstripe foo 0 2 1         (We have tried  
each OST here same effect)
mds2:/mnt/lustre/test # dd if=/dev/zero of=foo bs=1048576 count=8192
8192+0 records in
8192+0 records out
8589934592 bytes (8.6 GB) copied, 43.7827 seconds, 196 MB/s
mds2:/mnt/lustre/test # dd if=/dev/zero of=foo bs=1048576 count=8192
8192+0 records in
8192+0 records out
8589934592 bytes (8.6 GB) copied, 33.153 seconds, 259 MB/s
mds2:/mnt/lustre/test # dd if=/dev/zero of=foo bs=1048576 count=8192
8192+0 records in
8192+0 records out
8589934592 bytes (8.6 GB) copied, 32.9747 seconds, 261 MB/s
mds2:/mnt/lustre/test # dd if=/dev/zero of=foo bs=1048576 count=8192
8192+0 records in
8192+0 records out
8589934592 bytes (8.6 GB) copied, 29.3566 seconds, 293 MB/s
mds2:/mnt/lustre/test # dd if=/dev/zero of=foo bs=1048576 count=8192
8192+0 records in
8192+0 records out
8589934592 bytes (8.6 GB) copied, 22.23 seconds, 386 MB/s
mds2:/mnt/lustre/test # dd if=/dev/zero of=foo bs=1048576 count=8192
8192+0 records in
8192+0 records out
8589934592 bytes (8.6 GB) copied, 22.236 seconds, 386 MB/s
mds2:/mnt/lustre/test # dd if=/dev/zero of=foo bs=1048576 count=8192
8192+0 records in
8192+0 records out
8589934592 bytes (8.6 GB) copied, 22.2838 seconds, 385 MB/s


---Example with IOZONE---
client-mnt # iozone -i 0 -s 10g -r 1m -t 2

         File size set to 10485760 KB
         Record Size 1024 KB
         Command line used: iozone -i 0 -s 10g -r 1m -t 2
         Output is in Kbytes/sec
         Time Resolution = 0.000001 seconds.
         Processor cache size set to 1024 Kbytes.
         Processor cache line size set to 32 bytes.
         File stride size set to 17 * record size.
         Throughput test with 2 processes
         Each process writes a 10485760 Kbyte file in 1024 Kbyte records

         Children see throughput for  2 initial writers  =  151909.33  
KB/sec
         Parent sees throughput for  2 initial writers   =  150657.64  
KB/sec
         Min throughput per process                      =   75476.66  
KB/sec
         Max throughput per process                      =   76432.67  
KB/sec
         Avg throughput per process                      =   75954.66  
KB/sec
         Min xfer                                        = 10354688.00  
KB

         Children see throughput for  2 rewriters        =  249051.45  
KB/sec
         Parent sees throughput for  2 rewriters         =  247016.15  
KB/sec
         Min throughput per process                      =  124407.29  
KB/sec
         Max throughput per process                      =  124644.16  
KB/sec
         Avg throughput per process                      =  124525.72  
KB/sec
         Min xfer                                        = 10465280.00  
KB


#######Second RUN

         Children see throughput for  2 initial writers  =  244629.49  
KB/sec
         Parent sees throughput for  2 initial writers   =  242007.64  
KB/sec
         Min throughput per process                      =  121223.02  
KB/sec
         Max throughput per process                      =  123406.48  
KB/sec
         Avg throughput per process                      =  122314.75  
KB/sec
         Min xfer                                        = 10300416.00  
KB

         Children see throughput for  2 rewriters        =  239316.48  
KB/sec
         Parent sees throughput for  2 rewriters         =  238836.01  
KB/sec
         Min throughput per process                      =  119017.80  
KB/sec
         Max throughput per process                      =  120298.68  
KB/sec
         Avg throughput per process                      =  119658.24  
KB/sec
         Min xfer                                        = 10375168.00  
KB

#######Third RUN

         Children see throughput for  2 initial writers  =  245567.60  
KB/sec
         Parent sees throughput for  2 initial writers   =  241539.16  
KB/sec
         Min throughput per process                      =  121565.91  
KB/sec
         Max throughput per process                      =  124001.69  
KB/sec
         Avg throughput per process                      =  122783.80  
KB/sec
         Min xfer                                        = 10279936.00  
KB

         Children see throughput for  2 rewriters        =  240782.11  
KB/sec
         Parent sees throughput for  2 rewriters         =  240390.69  
KB/sec
         Min throughput per process                      =  119717.36  
KB/sec
         Max throughput per process                      =  121064.75  
KB/sec
         Avg throughput per process                      =  120391.05  
KB/sec
         Min xfer                                        = 10370048.00  
KB

Any insight on explaining this behavior would be really appreciated.

Thanks

Anand

Oleg Drokin

2008-Mar-23 00:26 UTC

head link

[Lustre-discuss] Slow performance on the first IOR / iozone (Lustre)

Hello!

On Mar 22, 2008, at 2:55 PM, Anand Bisen wrote:>
> I have been trying to understand why successive execution of iozone
> (also with dd and IOR) produces increasing performance from first run
> to third run. After the third run the performance is constant.
I believe what you are seeing is a direct result of bitmap blocks  
being not
read at mount time. So it takes some time to cache bitmaps (And other
data structures) into memory
(and it happens during writes which hinders performance).
Once bitmaps are all read, performance stabilizes.

Hopefully Alex or Andreas might comment further on this, or perhaps  
remember
a bug number, Cray filed one some time ago, I think.

Bye,
     Oleg

Cédric Lambert

2008-Mar-25 09:37 UTC

head link

[Lustre-discuss] Slow performance on the first IOR / iozone (Lustre)

Hello,

I am not today a kernel hacker ;-) and I would like to understand 
something :
Does it mean that when Lustre client is mounted, I am obliged to make 
any IOs to force bitmaps cache or am I obliged to wait a certain time 
before making any writes ? If Yes, can I monitor under /proc this info ?

Thanks
C?dric


Oleg Drokin a ?crit :>
> I believe what you are seeing is a direct result of bitmap blocks  
> being not
> read at mount time. So it takes some time to cache bitmaps (And other
> data structures) into memory
> (and it happens during writes which hinders performance).
> Once bitmaps are all read, performance stabilizes.
>
> Hopefully Alex or Andreas might comment further on this, or perhaps  
> remember
> a bug number, Cray filed one some time ago, I think.
>
> Bye,
>      Oleg
>

Oleg Drokin

2008-Mar-25 14:37 UTC

head link

[Lustre-discuss] Slow performance on the first IOR / iozone (Lustre)

Hello!

    You are certainly not obliged to do i/os to force bitmaps, since  
lustre will continue to work,
    it just will bear some penalties before all caches are populated.
    You certainly can prefetch that data sooner as a workaround and  
with zero kernel hacking
    required by e.g. doing dumpe2fs on every device with lustre  
backend fs after lustre servers
    are started (it is important to do this after mount, since kernel  
discards all block device data
    on last device close).
    I am not aware of any place in /proc where you can track current  
status of fs-specific cached
    metadata, also it might change over time, e.g. for fs that is not  
used, blocks might be forced
    out of memory.

Bye,
     Oleg
On Mar 25, 2008, at 5:37 AM, C?dric Lambert wrote:> Hello,
>
> I am not today a kernel hacker ;-) and I would like to understand  
> something :
> Does it mean that when Lustre client is mounted, I am obliged to  
> make any IOs to force bitmaps cache or am I obliged to wait a  
> certain time before making any writes ? If Yes, can I monitor under / 
> proc this info ?
>
> Thanks
> C?dric
>
>
> Oleg Drokin a ?crit :
>>
>> I believe what you are seeing is a direct result of bitmap blocks   
>> being not
>> read at mount time. So it takes some time to cache bitmaps (And other
>> data structures) into memory
>> (and it happens during writes which hinders performance).
>> Once bitmaps are all read, performance stabilizes.
>>
>> Hopefully Alex or Andreas might comment further on this, or  
>> perhaps  remember
>> a bug number, Cray filed one some time ago, I think.
>>
>> Bye,
>>     Oleg
>>

Brian J. Murrell

2008-Mar-25 14:45 UTC

head link

[Lustre-discuss] Slow performance on the first IOR / iozone (Lustre)

On Tue, 2008-03-25 at 10:37 -0400, Oleg Drokin wrote:>     You certainly can prefetch that data sooner as a workaround and  
> with zero kernel hacking
>     required by e.g. doing dumpe2fs on every device with lustre  
> backend fs after lustre servers
>     are started (it is important to do this after mount, since kernel  
> discards all block device data
>     on last device close).
C?dric,

If nothing else, you could use this technique to confirm Oleg''s theory
that it is indeed bitmaps being cached that is resulting in your
performance observations.  Not that I doubt Oleg''s expertise here.  It
would just be scientific data to confirm your situation.

b.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url :
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080325/1291564e/attachment-0002.bin

Cédric Lambert

2008-Mar-25 16:43 UTC

head link

[Lustre-discuss] Slow performance on the first IOR / iozone (Lustre)

Brian,

Your remark is completely justified : theory can be limited by some 
external variables.

For example, we saw there was a limit with "ext2" file system : 
EXT2_MAX_GROUP_LOADED is used to prevent too much cache bitmaps into 
memory.  As Lustre is based on ext3, is there such a limit with ldiskfs 
? If so, we could continue to have bad performance again with big big 
files on all OSTs (for example).

I let Anand tell us if results are better on its benches after a 
dumpe2fs on every device...

C?dric

Brian J. Murrell a ?crit :> On Tue, 2008-03-25 at 10:37 -0400, Oleg Drokin wrote:
>   
>>     You certainly can prefetch that data sooner as a workaround and  
>> with zero kernel hacking
>>     required by e.g. doing dumpe2fs on every device with lustre  
>> backend fs after lustre servers
>>     are started (it is important to do this after mount, since kernel  
>> discards all block device data
>>     on last device close).
>>     
>
> C?dric,
>
> If nothing else, you could use this technique to confirm Oleg''s
theory
> that it is indeed bitmaps being cached that is resulting in your
> performance observations.  Not that I doubt Oleg''s expertise here.
It
> would just be scientific data to confirm your situation.
>
> b.
>

Andreas Dilger

2008-Mar-25 21:08 UTC

head link

[Lustre-discuss] Slow performance on the first IOR/iozone (Lustre)

On Mar 25, 2008  17:43 +0100, C?dric Lambert wrote:> Your remark is completely justified : theory can be limited by some 
> external variables.
> 
> For example, we saw there was a limit with "ext2" file system : 
> EXT2_MAX_GROUP_LOADED is used to prevent too much cache bitmaps into 
> memory.  As Lustre is based on ext3, is there such a limit with ldiskfs 
> ? If so, we could continue to have bad performance again with big big 
> files on all OSTs (for example).
This MAX_GROUP_LOADED parameter was removed, and was pointless in any
case because the block bitmaps were still kept by the buffer cache if
used frequently.
> I let Anand tell us if results are better on its benches after a 
> dumpe2fs on every device...
I''m not sure that dumpe2fs will be sufficient because the ldiskfs
mballoc code also generates buddy bitmaps for every group it is
doing allocations in.  Reading the block bitmaps into cache in advance
will definitely help this, but dumpe2fs will not trigger the buddy
bitmap generation.

> Brian J. Murrell a ?crit :
> > On Tue, 2008-03-25 at 10:37 -0400, Oleg Drokin wrote:
> >   
> >>     You certainly can prefetch that data sooner as a workaround
and
> >> with zero kernel hacking
> >>     required by e.g. doing dumpe2fs on every device with lustre  
> >> backend fs after lustre servers
> >>     are started (it is important to do this after mount, since
kernel
> >> discards all block device data
> >>     on last device close).
> >>     
> >
> > C?dric,
> >
> > If nothing else, you could use this technique to confirm
Oleg''s theory
> > that it is indeed bitmaps being cached that is resulting in your
> > performance observations.  Not that I doubt Oleg''s expertise
here.  It
> > would just be scientific data to confirm your situation.
Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

Lustre discuss - Mar 2008 - Slow performance on the first IOR / iozone (Lustre)

[Lustre-discuss] Slow performance on the first IOR / iozone (Lustre)

[Lustre-discuss] Slow performance on the first IOR / iozone (Lustre)

[Lustre-discuss] Slow performance on the first IOR / iozone (Lustre)

[Lustre-discuss] Slow performance on the first IOR / iozone (Lustre)

[Lustre-discuss] Slow performance on the first IOR / iozone (Lustre)

[Lustre-discuss] Slow performance on the first IOR / iozone (Lustre)

[Lustre-discuss] Slow performance on the first IOR/iozone (Lustre)