thr3ads.net - Lustre devel - [Lustre-devel] 4MB IOs [Dec 2006]

If this information is useful, please help other people find it:
Share via:

Peter Bojanic

2006-Dec-15 12:23 UTC

[Lustre-devel] 4MB IOs

Andreas, Alex, Eric,

In a call today with Cray, we discussed briefly about 4MB IOs on the  
DDN 9500. I''ve heard and seen numbers from one customer (I''m
not sure
I should name them, so I will not), that demonstrate that they get  
better performance with our patch that coalesces 4MB IOs for the  
Lustre VFS client. Nic Henke mentioned a discussion with eeb that  
debated the advantages of 4MB IOs wrt LNET.

Can you advise:
- is the 4MB IO scenario specific to the Elan configuration of our  
first customer who used DDN 9500s
- what is the drawback of 4MB IOs wrt LNET
- is there a clear verdict for Lustre VFS clients?
- does this even matter for liblustre, for which we''re unable to  
aggregate IOs?

Thanks,
Bojanic

Andreas Dilger

2006-Dec-16 00:40 UTC

head link

[Lustre-devel] Re: 4MB IOs

On Dec 15, 2006  15:23 -0400, Peter Bojanic wrote:> In a call today with Cray, we discussed briefly about 4MB IOs on the  
> DDN 9500. I''ve heard and seen numbers from one customer
(I''m not sure
> I should name them, so I will not), that demonstrate that they get  
> better performance with our patch that coalesces 4MB IOs for the  
> Lustre VFS client. Nic Henke mentioned a discussion with eeb that  
> debated the advantages of 4MB IOs wrt LNET.
> 
> Can you advise:
> - is the 4MB IO scenario specific to the Elan configuration of our  
>   first customer who used DDN 9500s
No, I don''t think the 4MB IO performance relates to the network at all,
with the exception that a low-bandwidth network like GigE wouldn''t show
anything because it is the bottleneck and not the disk IO.
> - what is the drawback of 4MB IOs wrt LNET
The problem is that generating large IOVs is problematic for some LNDs
because they can''t handle more than 256 pages at a time in the scatter-
gather list.  This is LND-specific.
> - is there a clear verdict for Lustre VFS clients?
We''ll have to wait until Jody can test Lustre clients against the DDN
9500,
I don''t recall offhand whether the previous results were end-to-end or
with
sgp-dd.
> - does this even matter for liblustre, for which we''re unable to  
>   aggregate IOs?
Yes, possibly even moreso because if liblustre can''t do asynchronous
IOs
then IO performance is all the more important for applications (on Linux
the application can resume computation while Lustre flushes the cache in
the background).  That said, liblustre would only be able to take advantage
of larger RPC IO sizes if the application is itself doing such large IOs,
while Linux clients can aggregate multiple smaller IOs into a larger single
one on the wire.


I actually thought of a very interesting solution to this that allows large
disk IO sizes without even changing the wire protocol.  It would also allow
per-client bandwidth throttling, which some customers have expressed some
interest in.  Eric has expressed repeatedly that having a 1MB IO size is
plenty large enough to saturate the network, and the reason that 4MB IO is
faster is purely a function of the disk IOs.  Jody''s recent sgp-dd
testing
has shown that the raw disk performance does increase dramatically for 4MB
IOs compared to 1MB IOs.


The changes needed would be as follows:
- clients would submit IOs as normal (1MB) to the OSTs
- at the OST side the requests are immediately added to that client''s
  export instead of waiting in the incoming request queue for an IO
  thread to handle them (likely a single thread would decode enough of
  each request to figure out the export to attach it to), adding
  the exports to a list of exports with pending requests
- the OST service threads would walk the pending export list and process
  some number of requests from that export.  If it processed 4 bulk IO
  requests together it would give us 4MB IOs to disk with no change to
  the wire protocol.  It could even be smart and pick among the pending
  requests to submit them in file offset order.
- The threads could process more or less requests per export to provide
  more or less throughput on a per-export basis

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

Eric Barton

2006-Dec-16 09:18 UTC

head link

[Lustre-devel] RE: 4MB IOs

> Can you advise:
> - is the 4MB IO scenario specific to the Elan configuration of our  
>   first customer who used DDN 9500s
No
> - what is the drawback of 4MB IOs wrt LNET
LNET supports a 1MByte MTU in all network configs.  This limitation applies
because larger I/Os require a scatter/gather descriptor larger than 4K (the
lowest common denominator page size).  We can''t exceed this in general
because it requires multi-page allocations for descriptors and "small"
message buffers.  Also, 1MByte is quite enough to allow LNET to hide latency
and get good network bandwidth efficiency.

We have added a _hack_ that allows some LNDs to support a larger payload and
lustre to exploit this, if and only if both the client and server are on the
same network (no routers) and both support a larger page size.

A better solution would be to allow the server to read/write bulk data in
several chunks - e.g. 4 1MByte chunks to support a 4MByte bulk RPC.  This is
completely general purpose, it does not cause any network inefficiency and
still allows the server to do large I/O to disk.

However I would still beware of issues regarding disk I/O scatter/gather
descriptor size if the server has a 4K page size.  


                Cheers,
                        Eric

---------------------------------------------------
|Eric Barton        Barton Software               |
|9 York Gardens     Tel:    +44 (117) 330 1575    |
|Clifton            Mobile: +44 (7909) 680 356    |
|Bristol BS8 4LL    Fax:    call first            |
|United Kingdom     E-Mail: eeb@bartonsoftware.com|
---------------------------------------------------

Lustre devel - Dec 2006 - 4MB IOs

[Lustre-devel] 4MB IOs

[Lustre-devel] Re: 4MB IOs

[Lustre-devel] RE: 4MB IOs