thr3ads.net - Lustre discuss - [Lustre-discuss] Curious about iozone findings of new Lustre FS [Mar 2010]

If this information is useful, please help other people find it:
Share via:

Jagga Soorma

2010-Mar-03 19:50 UTC

[Lustre-discuss] Curious about iozone findings of new Lustre FS

Hi Guys,

I have just deployed a new Lustre FS with 2 MDS servers, 2 active OSS
servers (5x2TB OST''s per OSS) and 16 compute nodes.  Attached are our
findings from the iozone tests and it looks like the iozone throughput tests
have demonstrated almost linear scalability of Lustre except for when
WRITING files that exceed 128MB in size.  When multiple clients create/write
files larger than 128MB, Lustre throughput levels up to approximately
~1GB/s. This behavior has been observed with almost all tested block size
ranges except for 4KB.  I don''t have any explanation as to why Lustre
performs poorly when writing large files.

Has anyoned experienced this behaviour?  Any comments on our findings?

Thanks,
-J
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100303/0f09fd1b/attachment-0001.html
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Lustre_FS_Report.pdf
Type: application/pdf
Size: 4428948 bytes
Desc: not available
Url :
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100303/0f09fd1b/attachment-0001.pdf

Jagga Soorma

2010-Mar-03 21:02 UTC

head link

[Lustre-discuss] Curious about iozone findings of new Lustre FS

Hi Guys,

(P.S. My apologies if this is a duplicate - Sent this email earlier today
with an attachment and did not see it populate in google groups so was not
sure if attachments were allowed)

I have just deployed a new Lustre FS with 2 MDS servers, 2 active OSS
servers (5x2TB OST''s per OSS) and 16 compute nodes.  It looks like the
iozone throughput tests have demonstrated almost linear scalability of
Lustre except for when WRITING files that exceed 128MB in size.  When
multiple clients create/write files larger than 128MB, Lustre throughput
levels up to approximately ~1GB/s. This behavior has been observed with
almost all tested block size ranges except for 4KB.  I don''t have any
explanation as to why Lustre performs poorly when writing large files.

Here is the iozone report:
http://docs.google.com/fileview?id=0Bz8GxDEZOhnwYjQyMDlhMWMtODVlYi00MTgwLTllN2QtYzU2MWJlNTEwMjA1&hl=en

The only changes I have made to the defaults are:
stripe_count: 2 stripe_size: 1048576 stripe_offset: -1

I am using Lustre 1.8.1.1 and my MDS/OSS servers are running RHEL 5.3.  All
the clients are SLES 11.

Has anyone experienced this behaviour?  Any comments on our findings?

Thanks,
-J
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100303/950ef61b/attachment.html

Andreas Dilger

2010-Mar-03 22:30 UTC

head link

[Lustre-discuss] Curious about iozone findings of new Lustre FS

On 2010-03-03, at 12:50, Jagga Soorma wrote:> I have just deployed a new Lustre FS with 2 MDS servers, 2 active  
> OSS servers (5x2TB OST''s per OSS) and 16 compute nodes.
Does this mean you are using 5 2TB disks in a single RAID-5 OST per  
OSS (i.e. total OST size is 8TB), or are you using 5 separate 2TB OSTs?
> Attached are our findings from the iozone tests and it looks like  
> the iozone throughput tests have demonstrated almost linear  
> scalability of Lustre except for when WRITING files that exceed  
> 128MB in size.  When multiple clients create/write files larger than  
> 128MB, Lustre throughput levels up to approximately ~1GB/s. This  
> behavior has been observed with almost all tested block size ranges  
> except for 4KB.  I don''t have any explanation as to why Lustre  
> performs poorly when writing large files.
>
> Has anyoned experienced this behaviour?  Any comments on our findings?

The default client tunable max_dirty_mb=32MB per OSC (i.e. the maximum  
amount of unwritten dirty data per OST before blocking the process  
submitting IO).  If you have 2 OST/OSCs and you have a stripe count of  
2 then you can cache up to 64MB on the client without having to wait  
for any RPCs to complete.  That is why you see a performance cliff for  
writes beyond 32MB.

It should be clear that the read graphs are meaningless, due to local  
cache of the file.  I''d hazard a guess that you are not getting 100GB/ 
s from 2 OSS nodes.

Also, what is the interconnect on the client?  If you are using a  
single 10GigE then 1GB/s is as fast as you can possibly write large  
files to the OSTs, regardless of the striping.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

Jagga Soorma

2010-Mar-03 23:27 UTC

head link

[Lustre-discuss] Curious about iozone findings of new Lustre FS

On Wed, Mar 3, 2010 at 2:30 PM, Andreas Dilger <adilger at sun.com> wrote:
> On 2010-03-03, at 12:50, Jagga Soorma wrote:
>
>> I have just deployed a new Lustre FS with 2 MDS servers, 2 active OSS
>> servers (5x2TB OST''s per OSS) and 16 compute nodes.
>>
>
> Does this mean you are using 5 2TB disks in a single RAID-5 OST per OSS
> (i.e. total OST size is 8TB), or are you using 5 separate 2TB OSTs?

No I am using 5 independent 2TB OST''s per OSS.

>
>
>  Attached are our findings from the iozone tests and it looks like the
>> iozone throughput tests have demonstrated almost linear scalability of
>> Lustre except for when WRITING files that exceed 128MB in size.  When
>> multiple clients create/write files larger than 128MB, Lustre
throughput
>> levels up to approximately ~1GB/s. This behavior has been observed with
>> almost all tested block size ranges except for 4KB.  I don''t
have any
>> explanation as to why Lustre performs poorly when writing large files.
>>
>> Has anyoned experienced this behaviour?  Any comments on our findings?
>>
>
>
> The default client tunable max_dirty_mb=32MB per OSC (i.e. the maximum
> amount of unwritten dirty data per OST before blocking the process
> submitting IO).  If you have 2 OST/OSCs and you have a stripe count of 2
> then you can cache up to 64MB on the client without having to wait for any
> RPCs to complete.  That is why you see a performance cliff for writes
beyond
> 32MB.
>
So the true write performance should be measured for data captured for files
larger than 128MB?  If we do see a large number of large files being created
on the lustre fs, is this something that can be tuned on the client side?
If so, where/how can I get this done and what would be the recommended
settings?

> It should be clear that the read graphs are meaningless, due to local cache
> of the file.  I''d hazard a guess that you are not getting 100GB/s
from 2 OSS
> nodes.
>
Agreed.  Is there a way to find out the size of the local cache on the
clients?

>
> Also, what is the interconnect on the client?  If you are using a single
> 10GigE then 1GB/s is as fast as you can possibly write large files to the
> OSTs, regardless of the striping.
>
I am using Infiniband (QDR) interconnects for all nodes.

>
> Cheers, Andreas
> --
> Andreas Dilger
> Sr. Staff Engineer, Lustre Group
> Sun Microsystems of Canada, Inc.
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100303/f80ff0ee/attachment.html

Jagga Soorma

2010-Mar-03 23:29 UTC

head link

[Lustre-discuss] Curious about iozone findings of new Lustre FS

Or would it be better to increase the stripe count for my lustre filesystem
to the max number of OST''s?

On Wed, Mar 3, 2010 at 3:27 PM, Jagga Soorma <jagga13 at gmail.com> wrote:
> On Wed, Mar 3, 2010 at 2:30 PM, Andreas Dilger <adilger at sun.com>
wrote:
>
>> On 2010-03-03, at 12:50, Jagga Soorma wrote:
>>
>>> I have just deployed a new Lustre FS with 2 MDS servers, 2 active
OSS
>>> servers (5x2TB OST''s per OSS) and 16 compute nodes.
>>>
>>
>> Does this mean you are using 5 2TB disks in a single RAID-5 OST per OSS
>> (i.e. total OST size is 8TB), or are you using 5 separate 2TB OSTs?
>
>
> No I am using 5 independent 2TB OST''s per OSS.
>
>
>>
>>
>>  Attached are our findings from the iozone tests and it looks like the
>>> iozone throughput tests have demonstrated almost linear scalability
of
>>> Lustre except for when WRITING files that exceed 128MB in size. 
When
>>> multiple clients create/write files larger than 128MB, Lustre
throughput
>>> levels up to approximately ~1GB/s. This behavior has been observed
with
>>> almost all tested block size ranges except for 4KB.  I
don''t have any
>>> explanation as to why Lustre performs poorly when writing large
files.
>>>
>>> Has anyoned experienced this behaviour?  Any comments on our
findings?
>>>
>>
>>
>> The default client tunable max_dirty_mb=32MB per OSC (i.e. the maximum
>> amount of unwritten dirty data per OST before blocking the process
>> submitting IO).  If you have 2 OST/OSCs and you have a stripe count of
2
>> then you can cache up to 64MB on the client without having to wait for
any
>> RPCs to complete.  That is why you see a performance cliff for writes
beyond
>> 32MB.
>>
>
> So the true write performance should be measured for data captured for
> files larger than 128MB?  If we do see a large number of large files being
> created on the lustre fs, is this something that can be tuned on the client
> side?  If so, where/how can I get this done and what would be the
> recommended settings?
>
>
>> It should be clear that the read graphs are meaningless, due to local
>> cache of the file.  I''d hazard a guess that you are not
getting 100GB/s from
>> 2 OSS nodes.
>>
>
> Agreed.  Is there a way to find out the size of the local cache on the
> clients?
>
>
>>
>> Also, what is the interconnect on the client?  If you are using a
single
>> 10GigE then 1GB/s is as fast as you can possibly write large files to
the
>> OSTs, regardless of the striping.
>>
>
> I am using Infiniband (QDR) interconnects for all nodes.
>
>
>>
>> Cheers, Andreas
>> --
>> Andreas Dilger
>> Sr. Staff Engineer, Lustre Group
>> Sun Microsystems of Canada, Inc.
>>
>>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100303/c991e3bc/attachment.html

Jeffrey Bennett

2010-Mar-03 23:35 UTC

head link

[Lustre-discuss] One or two OSS, no difference?

Hi Lustre experts

We are building a very small Lustre cluster with 32 clients (patchless) and two
OSS servers. Each OSS server has 1 OST with 1 TB of Solid State Drives. All is
connected using dual-port DDR IB.

For testing purposes, I am enabling/disabling one of the OSS/OST by using the
"lfs setstripe" command. I am running XDD and vdbench benchmarks.

Does anybody have an idea why there is no difference in MB/sec or random IOPS
when using one OSS or two OSS? A quick test with "dd" also shows the
same MB/sec when using one or two OSTs.

Thanks

jab
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100303/bea08cf2/attachment-0001.html

Oleg Drokin

2010-Mar-04 01:19 UTC

head link

[Lustre-discuss] One or two OSS, no difference?

Hello!

On Mar 3, 2010, at 6:35 PM, Jeffrey Bennett wrote:> We are building a very small Lustre cluster with 32 clients (patchless) and
two OSS servers. Each OSS server has 1 OST with 1 TB of Solid State Drives. All
is connected using dual-port DDR IB.
>  
> For testing purposes, I am enabling/disabling one of the OSS/OST by using
the ?lfs setstripe? command. I am running XDD and vdbench benchmarks.
>  
> Does anybody have an idea why there is no difference in MB/sec or random
IOPS when using one OSS or two OSS? A quick test with ?dd? also shows the same
MB/sec when using one or two OSTs.
I wonder if you just don''t saturate even one OST (both backend SSD and
IB interconnect) with this number of clients? Does the total throughput
decreases as you decrease
number of active clients and increases as you increase it even further?
Increasing maximum number of in-flight rpcs might help in that case.
Also are all of your clients writing to the same file or each client does io to
a separate file (I hope)?

Bye,
    Oleg

Jeffrey Bennett

2010-Mar-04 18:45 UTC

head link

[Lustre-discuss] One or two OSS, no difference?

Hi Oleg, thanks for your reply

I was actually testing with only one client. When adding a second client using a
different file, one client gets all the performance and the other one gets very
low performance, any recommendation?

Thanks in advance

jab


-----Original Message-----
From: Oleg.Drokin at Sun.COM [mailto:Oleg.Drokin at Sun.COM] 
Sent: Wednesday, March 03, 2010 5:20 PM
To: Jeffrey Bennett
Cc: lustre-discuss at lists.lustre.org
Subject: Re: [Lustre-discuss] One or two OSS, no difference?

Hello!

On Mar 3, 2010, at 6:35 PM, Jeffrey Bennett wrote:> We are building a very small Lustre cluster with 32 clients (patchless) and
two OSS servers. Each OSS server has 1 OST with 1 TB of Solid State Drives. All
is connected using dual-port DDR IB.
>  
> For testing purposes, I am enabling/disabling one of the OSS/OST by using
the "lfs setstripe" command. I am running XDD and vdbench benchmarks.
>  
> Does anybody have an idea why there is no difference in MB/sec or random
IOPS when using one OSS or two OSS? A quick test with "dd" also shows
the same MB/sec when using one or two OSTs.
I wonder if you just don''t saturate even one OST (both backend SSD and
IB interconnect) with this number of clients? Does the total throughput
decreases as you decrease
number of active clients and increases as you increase it even further?
Increasing maximum number of in-flight rpcs might help in that case.
Also are all of your clients writing to the same file or each client does io to
a separate file (I hope)?

Bye,
    Oleg

Oleg Drokin

2010-Mar-04 20:48 UTC

head link

[Lustre-discuss] One or two OSS, no difference?

Hello!

   This is pretty strange. Are there any differences in network topology that
can explain this?
   If you remove the first client, does the second one shows performance
   at the level of of the first, but as soon as you start the load on the first
again, the second
   client performance drops?

Bye,
    Oleg
On Mar 4, 2010, at 1:45 PM, Jeffrey Bennett wrote:
> Hi Oleg, thanks for your reply
> 
> I was actually testing with only one client. When adding a second client
using a different file, one client gets all the performance and the other one
gets very low performance, any recommendation?
> 
> Thanks in advance
> 
> jab
> 
> 
> -----Original Message-----
> From: Oleg.Drokin at Sun.COM [mailto:Oleg.Drokin at Sun.COM] 
> Sent: Wednesday, March 03, 2010 5:20 PM
> To: Jeffrey Bennett
> Cc: lustre-discuss at lists.lustre.org
> Subject: Re: [Lustre-discuss] One or two OSS, no difference?
> 
> Hello!
> 
> On Mar 3, 2010, at 6:35 PM, Jeffrey Bennett wrote:
>> We are building a very small Lustre cluster with 32 clients (patchless)
and two OSS servers. Each OSS server has 1 OST with 1 TB of Solid State Drives.
All is connected using dual-port DDR IB.
>> 
>> For testing purposes, I am enabling/disabling one of the OSS/OST by
using the "lfs setstripe" command. I am running XDD and vdbench
benchmarks.
>> 
>> Does anybody have an idea why there is no difference in MB/sec or
random IOPS when using one OSS or two OSS? A quick test with "dd" also
shows the same MB/sec when using one or two OSTs.
> 
> I wonder if you just don''t saturate even one OST (both backend SSD
and IB interconnect) with this number of clients? Does the total throughput
decreases as you decrease
> number of active clients and increases as you increase it even further?
> Increasing maximum number of in-flight rpcs might help in that case.
> Also are all of your clients writing to the same file or each client does
io to a separate file (I hope)?
> 
> Bye,
>    Oleg

Jeffrey Bennett

2010-Mar-04 21:18 UTC

head link

[Lustre-discuss] One or two OSS, no difference?

Hi Oleg,

I just noticed the sequential performance is ok, but the random IO (which is
what I am measuring) is not. Is there any way to increase random IO performance
on Lustre? We have LUNs that can provide around 250.000 random read 4kb IOPS but
we are only seeing 3.000 to 10.000 on Lustre.

jab


-----Original Message-----
From: Oleg.Drokin at Sun.COM [mailto:Oleg.Drokin at Sun.COM] 
Sent: Thursday, March 04, 2010 12:49 PM
To: Jeffrey Bennett
Cc: lustre-discuss at lists.lustre.org
Subject: Re: [Lustre-discuss] One or two OSS, no difference?

Hello!

   This is pretty strange. Are there any differences in network topology that
can explain this?
   If you remove the first client, does the second one shows performance
   at the level of of the first, but as soon as you start the load on the first
again, the second
   client performance drops?

Bye,
    Oleg
On Mar 4, 2010, at 1:45 PM, Jeffrey Bennett wrote:
> Hi Oleg, thanks for your reply
> 
> I was actually testing with only one client. When adding a second client
using a different file, one client gets all the performance and the other one
gets very low performance, any recommendation?
> 
> Thanks in advance
> 
> jab
> 
> 
> -----Original Message-----
> From: Oleg.Drokin at Sun.COM [mailto:Oleg.Drokin at Sun.COM] 
> Sent: Wednesday, March 03, 2010 5:20 PM
> To: Jeffrey Bennett
> Cc: lustre-discuss at lists.lustre.org
> Subject: Re: [Lustre-discuss] One or two OSS, no difference?
> 
> Hello!
> 
> On Mar 3, 2010, at 6:35 PM, Jeffrey Bennett wrote:
>> We are building a very small Lustre cluster with 32 clients (patchless)
and two OSS servers. Each OSS server has 1 OST with 1 TB of Solid State Drives.
All is connected using dual-port DDR IB.
>> 
>> For testing purposes, I am enabling/disabling one of the OSS/OST by
using the "lfs setstripe" command. I am running XDD and vdbench
benchmarks.
>> 
>> Does anybody have an idea why there is no difference in MB/sec or
random IOPS when using one OSS or two OSS? A quick test with "dd" also
shows the same MB/sec when using one or two OSTs.
> 
> I wonder if you just don''t saturate even one OST (both backend SSD
and IB interconnect) with this number of clients? Does the total throughput
decreases as you decrease
> number of active clients and increases as you increase it even further?
> Increasing maximum number of in-flight rpcs might help in that case.
> Also are all of your clients writing to the same file or each client does
io to a separate file (I hope)?
> 
> Bye,
>    Oleg

Andreas Dilger

2010-Mar-05 10:04 UTC

head link

[Lustre-discuss] One or two OSS, no difference?

On 2010-03-04, at 14:18, Jeffrey Bennett wrote:> I just noticed the sequential performance is ok, but the random IO  
> (which is what I am measuring) is not. Is there any way to increase  
> random IO performance on Lustre? We have LUNs that can provide  
> around 250.000 random read 4kb IOPS but we are only seeing 3.000 to  
> 10.000 on Lustre.
There is work currently underway to improve the SMP scaling  
performance for the RPC handling layer in Lustre.  Currently that  
limits the delivered RPC rate to 10-15k/sec or so.
> -----Original Message-----
> From: Oleg.Drokin at Sun.COM [mailto:Oleg.Drokin at Sun.COM]
> Sent: Thursday, March 04, 2010 12:49 PM
> To: Jeffrey Bennett
> Cc: lustre-discuss at lists.lustre.org
> Subject: Re: [Lustre-discuss] One or two OSS, no difference?
>
> Hello!
>
>   This is pretty strange. Are there any differences in network  
> topology that can explain this?
>   If you remove the first client, does the second one shows  
> performance
>   at the level of of the first, but as soon as you start the load on  
> the first again, the second
>   client performance drops?
>
> Bye,
>    Oleg
> On Mar 4, 2010, at 1:45 PM, Jeffrey Bennett wrote:
>
>> Hi Oleg, thanks for your reply
>>
>> I was actually testing with only one client. When adding a second  
>> client using a different file, one client gets all the performance  
>> and the other one gets very low performance, any recommendation?
>>
>> Thanks in advance
>>
>> jab
>>
>>
>> -----Original Message-----
>> From: Oleg.Drokin at Sun.COM [mailto:Oleg.Drokin at Sun.COM]
>> Sent: Wednesday, March 03, 2010 5:20 PM
>> To: Jeffrey Bennett
>> Cc: lustre-discuss at lists.lustre.org
>> Subject: Re: [Lustre-discuss] One or two OSS, no difference?
>>
>> Hello!
>>
>> On Mar 3, 2010, at 6:35 PM, Jeffrey Bennett wrote:
>>> We are building a very small Lustre cluster with 32 clients  
>>> (patchless) and two OSS servers. Each OSS server has 1 OST with 1  
>>> TB of Solid State Drives. All is connected using dual-port DDR IB.
>>>
>>> For testing purposes, I am enabling/disabling one of the OSS/OST  
>>> by using the "lfs setstripe" command. I am running XDD
and vdbench
>>> benchmarks.
>>>
>>> Does anybody have an idea why there is no difference in MB/sec or  
>>> random IOPS when using one OSS or two OSS? A quick test with
"dd"
>>> also shows the same MB/sec when using one or two OSTs.
>>
>> I wonder if you just don''t saturate even one OST (both backend
SSD
>> and IB interconnect) with this number of clients? Does the total  
>> throughput decreases as you decrease
>> number of active clients and increases as you increase it even  
>> further?
>> Increasing maximum number of in-flight rpcs might help in that case.
>> Also are all of your clients writing to the same file or each  
>> client does io to a separate file (I hope)?
>>
>> Bye,
>>   Oleg
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

Jeffrey Bennett

2010-Mar-05 21:53 UTC

head link

[Lustre-discuss] One or two OSS, no difference?

Andreas, if we are using 4kb blocks I understand we only transfer 1 page per RPC
call, so are we limited to 10-15K RPC per second or what''s the same,
10-15.000 IOPS?

jab


-----Original Message-----
From: Andreas.Dilger at sun.com [mailto:Andreas.Dilger at sun.com] On Behalf Of
Andreas Dilger
Sent: Friday, March 05, 2010 2:05 AM
To: Jeffrey Bennett
Cc: Oleg.Drokin at Sun.COM; lustre-discuss at lists.lustre.org
Subject: Re: [Lustre-discuss] One or two OSS, no difference?

On 2010-03-04, at 14:18, Jeffrey Bennett wrote:> I just noticed the sequential performance is ok, but the random IO  
> (which is what I am measuring) is not. Is there any way to increase  
> random IO performance on Lustre? We have LUNs that can provide  
> around 250.000 random read 4kb IOPS but we are only seeing 3.000 to  
> 10.000 on Lustre.
There is work currently underway to improve the SMP scaling  
performance for the RPC handling layer in Lustre.  Currently that  
limits the delivered RPC rate to 10-15k/sec or so.
> -----Original Message-----
> From: Oleg.Drokin at Sun.COM [mailto:Oleg.Drokin at Sun.COM]
> Sent: Thursday, March 04, 2010 12:49 PM
> To: Jeffrey Bennett
> Cc: lustre-discuss at lists.lustre.org
> Subject: Re: [Lustre-discuss] One or two OSS, no difference?
>
> Hello!
>
>   This is pretty strange. Are there any differences in network  
> topology that can explain this?
>   If you remove the first client, does the second one shows  
> performance
>   at the level of of the first, but as soon as you start the load on  
> the first again, the second
>   client performance drops?
>
> Bye,
>    Oleg
> On Mar 4, 2010, at 1:45 PM, Jeffrey Bennett wrote:
>
>> Hi Oleg, thanks for your reply
>>
>> I was actually testing with only one client. When adding a second  
>> client using a different file, one client gets all the performance  
>> and the other one gets very low performance, any recommendation?
>>
>> Thanks in advance
>>
>> jab
>>
>>
>> -----Original Message-----
>> From: Oleg.Drokin at Sun.COM [mailto:Oleg.Drokin at Sun.COM]
>> Sent: Wednesday, March 03, 2010 5:20 PM
>> To: Jeffrey Bennett
>> Cc: lustre-discuss at lists.lustre.org
>> Subject: Re: [Lustre-discuss] One or two OSS, no difference?
>>
>> Hello!
>>
>> On Mar 3, 2010, at 6:35 PM, Jeffrey Bennett wrote:
>>> We are building a very small Lustre cluster with 32 clients  
>>> (patchless) and two OSS servers. Each OSS server has 1 OST with 1  
>>> TB of Solid State Drives. All is connected using dual-port DDR IB.
>>>
>>> For testing purposes, I am enabling/disabling one of the OSS/OST  
>>> by using the "lfs setstripe" command. I am running XDD
and vdbench
>>> benchmarks.
>>>
>>> Does anybody have an idea why there is no difference in MB/sec or  
>>> random IOPS when using one OSS or two OSS? A quick test with
"dd"
>>> also shows the same MB/sec when using one or two OSTs.
>>
>> I wonder if you just don''t saturate even one OST (both backend
SSD
>> and IB interconnect) with this number of clients? Does the total  
>> throughput decreases as you decrease
>> number of active clients and increases as you increase it even  
>> further?
>> Increasing maximum number of in-flight rpcs might help in that case.
>> Also are all of your clients writing to the same file or each  
>> client does io to a separate file (I hope)?
>>
>> Bye,
>>   Oleg
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

Andreas Dilger

2010-Mar-05 22:05 UTC

head link

[Lustre-discuss] One or two OSS, no difference?

On 2010-03-05, at 14:53, Jeffrey Bennett wrote:> Andreas, if we are using 4kb blocks I understand we only transfer 1  
> page per RPC call, so are we limited to 10-15K RPC per second or  
> what''s the same, 10-15.000 IOPS?
That depends on whether you are doing read or write requests, whether  
it is in the client cache, etc.  Random read requests would definitely  
fall under the RPC limit, random write requests can benefit from  
aggregation on the client, assuming you aren''t doing O_DIRECT or  
O_SYNC IO operations.

Increasing your max_rpcs_in_flight and max_dirty_mb on the clients can  
improve IOPS, assuming the servers are not handling enough requests  
from the clients.  Check the RPC req_waittime, req_qdepth,  
ost_{read,write} service time via:

     lctl get_param ost.OSS.ost_io.stats

to see whether the servers are saturated, or idle.  CPU usage may also  
be a factor.

There was also bug 22074 fixed recently (post 1.8.2) that addresses a  
performance problem with lots of small IOs to different files (NOT  
related to small IOs to a single file).
> -----Original Message-----
> From: Andreas.Dilger at sun.com [mailto:Andreas.Dilger at sun.com] On  
> Behalf Of Andreas Dilger
> Sent: Friday, March 05, 2010 2:05 AM
> To: Jeffrey Bennett
> Cc: Oleg.Drokin at Sun.COM; lustre-discuss at lists.lustre.org
> Subject: Re: [Lustre-discuss] One or two OSS, no difference?
>
> On 2010-03-04, at 14:18, Jeffrey Bennett wrote:
>> I just noticed the sequential performance is ok, but the random IO
>> (which is what I am measuring) is not. Is there any way to increase
>> random IO performance on Lustre? We have LUNs that can provide
>> around 250.000 random read 4kb IOPS but we are only seeing 3.000 to
>> 10.000 on Lustre.
>
> There is work currently underway to improve the SMP scaling
> performance for the RPC handling layer in Lustre.  Currently that
> limits the delivered RPC rate to 10-15k/sec or so.
>
>> -----Original Message-----
>> From: Oleg.Drokin at Sun.COM [mailto:Oleg.Drokin at Sun.COM]
>> Sent: Thursday, March 04, 2010 12:49 PM
>> To: Jeffrey Bennett
>> Cc: lustre-discuss at lists.lustre.org
>> Subject: Re: [Lustre-discuss] One or two OSS, no difference?
>>
>> Hello!
>>
>>  This is pretty strange. Are there any differences in network
>> topology that can explain this?
>>  If you remove the first client, does the second one shows
>> performance
>>  at the level of of the first, but as soon as you start the load on
>> the first again, the second
>>  client performance drops?
>>
>> Bye,
>>   Oleg
>> On Mar 4, 2010, at 1:45 PM, Jeffrey Bennett wrote:
>>
>>> Hi Oleg, thanks for your reply
>>>
>>> I was actually testing with only one client. When adding a second
>>> client using a different file, one client gets all the performance
>>> and the other one gets very low performance, any recommendation?
>>>
>>> Thanks in advance
>>>
>>> jab
>>>
>>>
>>> -----Original Message-----
>>> From: Oleg.Drokin at Sun.COM [mailto:Oleg.Drokin at Sun.COM]
>>> Sent: Wednesday, March 03, 2010 5:20 PM
>>> To: Jeffrey Bennett
>>> Cc: lustre-discuss at lists.lustre.org
>>> Subject: Re: [Lustre-discuss] One or two OSS, no difference?
>>>
>>> Hello!
>>>
>>> On Mar 3, 2010, at 6:35 PM, Jeffrey Bennett wrote:
>>>> We are building a very small Lustre cluster with 32 clients
>>>> (patchless) and two OSS servers. Each OSS server has 1 OST with
1
>>>> TB of Solid State Drives. All is connected using dual-port DDR
IB.
>>>>
>>>> For testing purposes, I am enabling/disabling one of the
OSS/OST
>>>> by using the "lfs setstripe" command. I am running
XDD and vdbench
>>>> benchmarks.
>>>>
>>>> Does anybody have an idea why there is no difference in MB/sec
or
>>>> random IOPS when using one OSS or two OSS? A quick test with
"dd"
>>>> also shows the same MB/sec when using one or two OSTs.
>>>
>>> I wonder if you just don''t saturate even one OST (both
backend SSD
>>> and IB interconnect) with this number of clients? Does the total
>>> throughput decreases as you decrease
>>> number of active clients and increases as you increase it even
>>> further?
>>> Increasing maximum number of in-flight rpcs might help in that
case.
>>> Also are all of your clients writing to the same file or each
>>> client does io to a separate file (I hope)?
>>>
>>> Bye,
>>>  Oleg
>>
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
>
> Cheers, Andreas
> --
> Andreas Dilger
> Sr. Staff Engineer, Lustre Group
> Sun Microsystems of Canada, Inc.
>

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

Jeffrey Bennett

2010-Mar-10 00:40 UTC

head link

[Lustre-discuss] RPC limitation

Andreas and Oleg,

Sorry to bother you again.

Is there any way to "force" an RPC page size of 256kb and pack several
4kb random operations into a 256k page so the final throughput can increase?

Thanks

Jeffrey A. Bennett
HPC Systems Engineer
San Diego Supercomputer Center
http://users.sdsc.edu/~jab

-----Original Message-----
From: Andreas.Dilger at sun.com [mailto:Andreas.Dilger at sun.com] On Behalf Of
Andreas Dilger
Sent: Friday, March 05, 2010 2:05 AM
To: Jeffrey Bennett
Cc: Oleg.Drokin at Sun.COM; lustre-discuss at lists.lustre.org
Subject: Re: [Lustre-discuss] One or two OSS, no difference?

On 2010-03-04, at 14:18, Jeffrey Bennett wrote:> I just noticed the sequential performance is ok, but the random IO  
> (which is what I am measuring) is not. Is there any way to increase  
> random IO performance on Lustre? We have LUNs that can provide  
> around 250.000 random read 4kb IOPS but we are only seeing 3.000 to  
> 10.000 on Lustre.
There is work currently underway to improve the SMP scaling  
performance for the RPC handling layer in Lustre.  Currently that  
limits the delivered RPC rate to 10-15k/sec or so.

Oleg Drokin

2010-Mar-10 00:49 UTC

head link

[Lustre-discuss] RPC limitation

Hello!

   This only works if all the requests are for the same file, then it is done
for you automatically
   (assuming that these are write requests and there is not sync in between.
It''s impossible to do
    for reads for obvious reason that read is a synchronous operation and by the
time we do second
    read request first one must have completed and returned the data).

Bye,
    Oleg
On Mar 9, 2010, at 7:40 PM, Jeffrey Bennett wrote:
> Andreas and Oleg,
> 
> Sorry to bother you again.
> 
> Is there any way to "force" an RPC page size of 256kb and pack
several 4kb random operations into a 256k page so the final throughput can
increase?
> 
> Thanks
> 
> Jeffrey A. Bennett
> HPC Systems Engineer
> San Diego Supercomputer Center
> http://users.sdsc.edu/~jab
> 
> -----Original Message-----
> From: Andreas.Dilger at sun.com [mailto:Andreas.Dilger at sun.com] On
Behalf Of Andreas Dilger
> Sent: Friday, March 05, 2010 2:05 AM
> To: Jeffrey Bennett
> Cc: Oleg.Drokin at Sun.COM; lustre-discuss at lists.lustre.org
> Subject: Re: [Lustre-discuss] One or two OSS, no difference?
> 
> On 2010-03-04, at 14:18, Jeffrey Bennett wrote:
>> I just noticed the sequential performance is ok, but the random IO  
>> (which is what I am measuring) is not. Is there any way to increase  
>> random IO performance on Lustre? We have LUNs that can provide  
>> around 250.000 random read 4kb IOPS but we are only seeing 3.000 to  
>> 10.000 on Lustre.
> 
> There is work currently underway to improve the SMP scaling  
> performance for the RPC handling layer in Lustre.  Currently that  
> limits the delivered RPC rate to 10-15k/sec or so.
>

Andreas Dilger

2010-Mar-10 10:58 UTC

head link

[Lustre-discuss] RPC limitation

On 2010-03-09, at 17:40, Jeffrey Bennett wrote:> Is there any way to "force" an RPC page size of 256kb and pack  
> several 4kb random operations into a 256k page so the final  
> throughput can increase?
As Oleg mentioned, Lustre will in fact pack up to 256 random 4kB  
writes into a single RPC/RDMA over the network, IFF they are for the  
same object.

There is the long-standing bug 988 that relates to packing multiple  
small writes for different objects on the same OST into a single RPC/ 
RDMA request, but there was never enough demand to finish it.
> -----Original Message-----
> From: Andreas.Dilger at sun.com [mailto:Andreas.Dilger at sun.com] On  
> Behalf Of Andreas Dilger
> Sent: Friday, March 05, 2010 2:05 AM
> To: Jeffrey Bennett
> Cc: Oleg.Drokin at Sun.COM; lustre-discuss at lists.lustre.org
> Subject: Re: [Lustre-discuss] One or two OSS, no difference?
>
> On 2010-03-04, at 14:18, Jeffrey Bennett wrote:
>> I just noticed the sequential performance is ok, but the random IO
>> (which is what I am measuring) is not. Is there any way to increase
>> random IO performance on Lustre? We have LUNs that can provide
>> around 250.000 random read 4kb IOPS but we are only seeing 3.000 to
>> 10.000 on Lustre.
>
> There is work currently underway to improve the SMP scaling
> performance for the RPC handling layer in Lustre.  Currently that
> limits the delivered RPC rate to 10-15k/sec or so.
>

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

Lustre discuss - Mar 2010 - Curious about iozone findings of new Lustre FS

[Lustre-discuss] Curious about iozone findings of new Lustre FS

[Lustre-discuss] Curious about iozone findings of new Lustre FS

[Lustre-discuss] Curious about iozone findings of new Lustre FS

[Lustre-discuss] Curious about iozone findings of new Lustre FS

[Lustre-discuss] Curious about iozone findings of new Lustre FS

[Lustre-discuss] One or two OSS, no difference?

[Lustre-discuss] One or two OSS, no difference?

[Lustre-discuss] One or two OSS, no difference?

[Lustre-discuss] One or two OSS, no difference?

[Lustre-discuss] One or two OSS, no difference?

[Lustre-discuss] One or two OSS, no difference?

[Lustre-discuss] One or two OSS, no difference?

[Lustre-discuss] One or two OSS, no difference?

[Lustre-discuss] RPC limitation

[Lustre-discuss] RPC limitation

[Lustre-discuss] RPC limitation