thr3ads.net - Lustre discuss - [Lustre-discuss] Lustre v1.8.0.1 slower than expected large-file, sequential-buffered-file-read speed [Aug 2009]

If this information is useful, please help other people find it:
Share via:

Rick Rothstein

2009-Aug-04 14:30 UTC

[Lustre-discuss] Lustre v1.8.0.1 slower than expected large-file, sequential-buffered-file-read speed

Hi -

I''m new to Lustre (v1.8.0.1),
and I''ve verified that
I can get about 1000-megabytes-per-second aggregate throughput
for large file sequential reads using direct-I/O.
(only limited by the speed of my 10gb NIC with TCP offload engine).

My simple I/O test
has the client on a separate machine than the OST''s, and
16 separate background "dd''s" reading
16 separate files, residing on
16 separate disks (OST''s);
e.g. running on the client machine:
dd if=/mnt/lustre/testfile01 of=/dev/null bs=2097152 count=500 iflag=direct
&
... ...
dd if=/mnt/lustre/testfile16 of=/dev/null bs=2097152 count=500 iflag=direct
&

As I said,
the above direct-I/O "dd" tests achieve about a
1000-megabyte-per-second
aggregate throughput,
but
when I try the same tests with normal buffered I/O,
(by just running "dd" without "iflag=direct"),
the runs
only get about a 550-megabyte-per-second aggregate throughput.

I suspect that this slowdown may have something to do with
client-side-caching,
but normal buffered reads have not speeded up,
even after I''ve tried such adjustments as:
lowering the value of max_cached_mb;
turning off server-side-caching via read_cache_enable;
turning off Linux caching via /proc/sys/vm/drop_caches;
turning debugging off via /proc/sys/lnet/debug.

I have also tried similar suggestions
discussed in the lustre-discussion list
July 22 entry "Lustre client memory usage very high";
and
these suggestions did not change my slower than expected results.

I''m now going to spend some time reading detailed Lustre tuning info,
and running the Lustre testing programs;
and
I''d also appreciate any advice from experienced Lustre users
on how to speed up these large-file, buffered-I/O, sequential reads.

Thanks for any help.

Rick Rothstein
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090804/982eab90/attachment.html

Andreas Dilger

2009-Aug-04 22:08 UTC

head link

[Lustre-discuss] Lustre v1.8.0.1 slower than expected large-file, sequential-buffered-file-read speed

On Aug 04, 2009  10:30 -0400, Rick Rothstein wrote:> I''m new to Lustre (v1.8.0.1), and I''ve verified that
> I can get about 1000-megabytes-per-second aggregate throughput
> for large file sequential reads using direct-I/O.
> (only limited by the speed of my 10gb NIC with TCP offload engine).
> 
> the above direct-I/O "dd" tests achieve about a
1000-megabyte-per-second
> aggregate throughput, but when I try the same tests with normal buffered
> I/O, (by just running "dd" without "iflag=direct"), the
runs
> only get about a 550-megabyte-per-second aggregate throughput.
> 
> I suspect that this slowdown may have something to do with
> client-side-caching, but normal buffered reads have not speeded up,
> even after I''ve tried such adjustments as:
Note that there is a significant CPU overhead on the client when using
buffered IO, simply due to CPU usage from copying the data between
userspace and the kernel.  Having multiple cores on the client (one
per dd process) allows distributing this copy overhead between cores.

You could also run "oprofile" to see if there is anything else of
interest that is consuming a lot of CPU.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

Rick Rothstein

2009-Aug-05 17:30 UTC

head link

[Lustre-discuss] Lustre v1.8.0.1 slower than expected large-file, sequential-buffered-file-read speed

Hi Andreas -

Thanks for the advice.
I will gather additional CPU stats and see what shows up.

However, CPU does not seem to be a factor
in the slower than expected large file buffered I/O reads.

My machines have dual quad 2.66ghz processors,
and gross CPU usage hovers around 50%
when I''m running 16 "dd" read jobs.

But a suspected client-side caching problem
crops up
when I just run a simple "dd" read job twice.

The first time I run the single "dd" read job
I get an expected throughput of 60-megabytes-per-second or so.
However,
the second time I run the job,
I get a throughput of about 2-gigabytes-per-second,
which is
twice the top speed of my 10gb NIC,
and only possible, I think,
if the entire file was cached on the client
when the first "dd" job was run.

So, if I run 16 "dd" jobs,
each trying to cache entire large files on the client,
that could explain the unexpected slow aggregate throughput

I would have thought that setting a low value for "max_cached_mb"
would have solved this problem,
but it made no difference.

A further indication that client-side caching is at the root of speed
slowdown,
is that
when I run my single "dd" job twice;
but I drop client-side cache after the first run,
(via "/proc/sys/vm/drop_caches"),
I get an expected 60-megabytes-per-second or so throughput for both runs.

Until I learn how to overcome this slowdown problem,
I''ll see if if I can obtain my required
concurrent, multi large file read speed
by carefully striping the files over a few boxes.

Again,
thanks for your help,
and
I''ll appreciate any other suggestions you might have,
or
any ideas for other diagnostics we might run.

Rick

On 8/4/09, Andreas Dilger <adilger at sun.com>
wrote:>
> On Aug 04, 2009  10:30 -0400, Rick Rothstein wrote:
> > I''m new to Lustre (v1.8.0.1), and I''ve verified that
> > I can get about 1000-megabytes-per-second aggregate throughput
> > for large file sequential reads using direct-I/O.
> > (only limited by the speed of my 10gb NIC with TCP offload engine).
> >
> > the above direct-I/O "dd" tests achieve about a
1000-megabyte-per-second
> > aggregate throughput, but when I try the same tests with normal
buffered
> > I/O, (by just running "dd" without
"iflag=direct"), the runs
> > only get about a 550-megabyte-per-second aggregate throughput.
> >
> > I suspect that this slowdown may have something to do with
> > client-side-caching, but normal buffered reads have not speeded up,
> > even after I''ve tried such adjustments as:
>
> Note that there is a significant CPU overhead on the client when using
> buffered IO, simply due to CPU usage from copying the data between
> userspace and the kernel.  Having multiple cores on the client (one
> per dd process) allows distributing this copy overhead between cores.
>
> You could also run "oprofile" to see if there is anything else of
> interest that is consuming a lot of CPU.
>
> Cheers, Andreas
> --
> Andreas Dilger
> Sr. Staff Engineer, Lustre Group
> Sun Microsystems of Canada, Inc.
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090805/8e8612e2/attachment.html

Andreas Dilger

2009-Aug-06 00:25 UTC

head link

[Lustre-discuss] Lustre v1.8.0.1 slower than expected large-file, sequential-buffered-file-read speed

On Aug 05, 2009  13:30 -0400, Rick Rothstein wrote:> My machines have dual quad 2.66ghz processors,
> and gross CPU usage hovers around 50%
> when I''m running 16 "dd" read jobs.
Be cautious of nice round numbers for CPU usage.  Sometimes this
means that 1 CPU is 100% busy, and another is 0% busy.  With 16
tasks on a 8-core system you are going to get some kind of CPU
contention, but whether it is too much is hard to say without
digging much more deeply into the code.
> But a suspected client-side caching problem
> crops up when I just run a simple "dd" read job twice.
> 
> The first time I run the single "dd" read job
> I get an expected throughput of 60-megabytes-per-second or so.
> However, the second time I run the job, I get a throughput of
> about 2-gigabytes-per-second, which is twice the top speed of
> my 10gb NIC, and only possible, I think, if the entire file was
> cached on the client when the first "dd" job was run.
That is a feature.  Lustre is cache coherent, so the fact that
the whole file can be read from cache on the client with no network
IO is totally safe.  The fact that the second read is much faster
does not, in itself, indicate any problem.
> So, if I run 16 "dd" jobs, each trying to cache entire large
files
> on the client, that could explain the unexpected slow aggregate throughput
> 
> A further indication that client-side caching is at the root of speed
> slowdown, is that when I run my single "dd" job twice; but I drop
> client-side cache after the first run, (via
"/proc/sys/vm/drop_caches"),
> I get an expected 60-megabytes-per-second or so throughput for both runs.
Well, that isn''t surprising, but it doesn''t necessarily
indicate why
the reads are going _slower_ than without any cache.  It is of course
very possible that there is some kind of lock contention on the client,
but I thought this had been fixed in the 1.8 release (bug 11817).

Note that using O_DIRECT will of course bypass caching, which is still
desirable if you know you are not re-using the data.
> Until I learn how to overcome this slowdown problem,
> I''ll see if if I can obtain my required
> concurrent, multi large file read speed
> by carefully striping the files over a few boxes.
I would run tests with 1,2,4,6,8,12,16 processes, and see what the
per-task performance is.  Examining oprofile data per run will tell
you what functions become more heavily used when there are more
tasks involved.

You might also consider to look at the client rpc_stats, or the
(corresponding) server brw_stats to see if the read RPCs become
badly formed with many threads.  Also the client read_ahead_stats
would help tell you if the readahead is going badly with many threads.
> 
> Again,
> thanks for your help,
> and
> I''ll appreciate any other suggestions you might have,
> or
> any ideas for other diagnostics we might run.
> 
> Rick
> 
> On 8/4/09, Andreas Dilger <adilger at sun.com> wrote:
> >
> > On Aug 04, 2009  10:30 -0400, Rick Rothstein wrote:
> > > I''m new to Lustre (v1.8.0.1), and I''ve verified
that
> > > I can get about 1000-megabytes-per-second aggregate throughput
> > > for large file sequential reads using direct-I/O.
> > > (only limited by the speed of my 10gb NIC with TCP offload
engine).
> > >
> > > the above direct-I/O "dd" tests achieve about a
1000-megabyte-per-second
> > > aggregate throughput, but when I try the same tests with normal
buffered
> > > I/O, (by just running "dd" without
"iflag=direct"), the runs
> > > only get about a 550-megabyte-per-second aggregate throughput.
> > >
> > > I suspect that this slowdown may have something to do with
> > > client-side-caching, but normal buffered reads have not speeded
up,
> > > even after I''ve tried such adjustments as:
> >
> > Note that there is a significant CPU overhead on the client when using
> > buffered IO, simply due to CPU usage from copying the data between
> > userspace and the kernel.  Having multiple cores on the client (one
> > per dd process) allows distributing this copy overhead between cores.
> >
> > You could also run "oprofile" to see if there is anything
else of
> > interest that is consuming a lot of CPU.
> >
> > Cheers, Andreas
> > --
> > Andreas Dilger
> > Sr. Staff Engineer, Lustre Group
> > Sun Microsystems of Canada, Inc.
> >
> >
Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

Lustre discuss - Aug 2009 - Lustre v1.8.0.1 slower than expected large-file, sequential-buffered-file-read speed

[Lustre-discuss] Lustre v1.8.0.1 slower than expected large-file, sequential-buffered-file-read speed

[Lustre-discuss] Lustre v1.8.0.1 slower than expected large-file, sequential-buffered-file-read speed

[Lustre-discuss] Lustre v1.8.0.1 slower than expected large-file, sequential-buffered-file-read speed

[Lustre-discuss] Lustre v1.8.0.1 slower than expected large-file, sequential-buffered-file-read speed