On Dec 15, 2006 15:23 -0400, Peter Bojanic wrote:> In a call today with Cray, we discussed briefly about 4MB IOs on the
> DDN 9500. I''ve heard and seen numbers from one customer
(I''m not sure
> I should name them, so I will not), that demonstrate that they get
> better performance with our patch that coalesces 4MB IOs for the
> Lustre VFS client. Nic Henke mentioned a discussion with eeb that
> debated the advantages of 4MB IOs wrt LNET.
>
> Can you advise:
> - is the 4MB IO scenario specific to the Elan configuration of our
> first customer who used DDN 9500s
No, I don''t think the 4MB IO performance relates to the network at all,
with the exception that a low-bandwidth network like GigE wouldn''t show
anything because it is the bottleneck and not the disk IO.
> - what is the drawback of 4MB IOs wrt LNET
The problem is that generating large IOVs is problematic for some LNDs
because they can''t handle more than 256 pages at a time in the scatter-
gather list. This is LND-specific.
> - is there a clear verdict for Lustre VFS clients?
We''ll have to wait until Jody can test Lustre clients against the DDN
9500,
I don''t recall offhand whether the previous results were end-to-end or
with
sgp-dd.
> - does this even matter for liblustre, for which we''re unable to
> aggregate IOs?
Yes, possibly even moreso because if liblustre can''t do asynchronous
IOs
then IO performance is all the more important for applications (on Linux
the application can resume computation while Lustre flushes the cache in
the background). That said, liblustre would only be able to take advantage
of larger RPC IO sizes if the application is itself doing such large IOs,
while Linux clients can aggregate multiple smaller IOs into a larger single
one on the wire.
I actually thought of a very interesting solution to this that allows large
disk IO sizes without even changing the wire protocol. It would also allow
per-client bandwidth throttling, which some customers have expressed some
interest in. Eric has expressed repeatedly that having a 1MB IO size is
plenty large enough to saturate the network, and the reason that 4MB IO is
faster is purely a function of the disk IOs. Jody''s recent sgp-dd
testing
has shown that the raw disk performance does increase dramatically for 4MB
IOs compared to 1MB IOs.
The changes needed would be as follows:
- clients would submit IOs as normal (1MB) to the OSTs
- at the OST side the requests are immediately added to that client''s
export instead of waiting in the incoming request queue for an IO
thread to handle them (likely a single thread would decode enough of
each request to figure out the export to attach it to), adding
the exports to a list of exports with pending requests
- the OST service threads would walk the pending export list and process
some number of requests from that export. If it processed 4 bulk IO
requests together it would give us 4MB IOs to disk with no change to
the wire protocol. It could even be smart and pick among the pending
requests to submit them in file offset order.
- The threads could process more or less requests per export to provide
more or less throughput on a per-export basis
Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.