Rick Rothstein
2009-Aug-04 14:30 UTC
[Lustre-discuss] Lustre v1.8.0.1 slower than expected large-file, sequential-buffered-file-read speed
Hi - I''m new to Lustre (v1.8.0.1), and I''ve verified that I can get about 1000-megabytes-per-second aggregate throughput for large file sequential reads using direct-I/O. (only limited by the speed of my 10gb NIC with TCP offload engine). My simple I/O test has the client on a separate machine than the OST''s, and 16 separate background "dd''s" reading 16 separate files, residing on 16 separate disks (OST''s); e.g. running on the client machine: dd if=/mnt/lustre/testfile01 of=/dev/null bs=2097152 count=500 iflag=direct & ... ... dd if=/mnt/lustre/testfile16 of=/dev/null bs=2097152 count=500 iflag=direct & As I said, the above direct-I/O "dd" tests achieve about a 1000-megabyte-per-second aggregate throughput, but when I try the same tests with normal buffered I/O, (by just running "dd" without "iflag=direct"), the runs only get about a 550-megabyte-per-second aggregate throughput. I suspect that this slowdown may have something to do with client-side-caching, but normal buffered reads have not speeded up, even after I''ve tried such adjustments as: lowering the value of max_cached_mb; turning off server-side-caching via read_cache_enable; turning off Linux caching via /proc/sys/vm/drop_caches; turning debugging off via /proc/sys/lnet/debug. I have also tried similar suggestions discussed in the lustre-discussion list July 22 entry "Lustre client memory usage very high"; and these suggestions did not change my slower than expected results. I''m now going to spend some time reading detailed Lustre tuning info, and running the Lustre testing programs; and I''d also appreciate any advice from experienced Lustre users on how to speed up these large-file, buffered-I/O, sequential reads. Thanks for any help. Rick Rothstein -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090804/982eab90/attachment.html
Andreas Dilger
2009-Aug-04 22:08 UTC
[Lustre-discuss] Lustre v1.8.0.1 slower than expected large-file, sequential-buffered-file-read speed
On Aug 04, 2009 10:30 -0400, Rick Rothstein wrote:> I''m new to Lustre (v1.8.0.1), and I''ve verified that > I can get about 1000-megabytes-per-second aggregate throughput > for large file sequential reads using direct-I/O. > (only limited by the speed of my 10gb NIC with TCP offload engine). > > the above direct-I/O "dd" tests achieve about a 1000-megabyte-per-second > aggregate throughput, but when I try the same tests with normal buffered > I/O, (by just running "dd" without "iflag=direct"), the runs > only get about a 550-megabyte-per-second aggregate throughput. > > I suspect that this slowdown may have something to do with > client-side-caching, but normal buffered reads have not speeded up, > even after I''ve tried such adjustments as:Note that there is a significant CPU overhead on the client when using buffered IO, simply due to CPU usage from copying the data between userspace and the kernel. Having multiple cores on the client (one per dd process) allows distributing this copy overhead between cores. You could also run "oprofile" to see if there is anything else of interest that is consuming a lot of CPU. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
Rick Rothstein
2009-Aug-05 17:30 UTC
[Lustre-discuss] Lustre v1.8.0.1 slower than expected large-file, sequential-buffered-file-read speed
Hi Andreas - Thanks for the advice. I will gather additional CPU stats and see what shows up. However, CPU does not seem to be a factor in the slower than expected large file buffered I/O reads. My machines have dual quad 2.66ghz processors, and gross CPU usage hovers around 50% when I''m running 16 "dd" read jobs. But a suspected client-side caching problem crops up when I just run a simple "dd" read job twice. The first time I run the single "dd" read job I get an expected throughput of 60-megabytes-per-second or so. However, the second time I run the job, I get a throughput of about 2-gigabytes-per-second, which is twice the top speed of my 10gb NIC, and only possible, I think, if the entire file was cached on the client when the first "dd" job was run. So, if I run 16 "dd" jobs, each trying to cache entire large files on the client, that could explain the unexpected slow aggregate throughput I would have thought that setting a low value for "max_cached_mb" would have solved this problem, but it made no difference. A further indication that client-side caching is at the root of speed slowdown, is that when I run my single "dd" job twice; but I drop client-side cache after the first run, (via "/proc/sys/vm/drop_caches"), I get an expected 60-megabytes-per-second or so throughput for both runs. Until I learn how to overcome this slowdown problem, I''ll see if if I can obtain my required concurrent, multi large file read speed by carefully striping the files over a few boxes. Again, thanks for your help, and I''ll appreciate any other suggestions you might have, or any ideas for other diagnostics we might run. Rick On 8/4/09, Andreas Dilger <adilger at sun.com> wrote:> > On Aug 04, 2009 10:30 -0400, Rick Rothstein wrote: > > I''m new to Lustre (v1.8.0.1), and I''ve verified that > > I can get about 1000-megabytes-per-second aggregate throughput > > for large file sequential reads using direct-I/O. > > (only limited by the speed of my 10gb NIC with TCP offload engine). > > > > the above direct-I/O "dd" tests achieve about a 1000-megabyte-per-second > > aggregate throughput, but when I try the same tests with normal buffered > > I/O, (by just running "dd" without "iflag=direct"), the runs > > only get about a 550-megabyte-per-second aggregate throughput. > > > > I suspect that this slowdown may have something to do with > > client-side-caching, but normal buffered reads have not speeded up, > > even after I''ve tried such adjustments as: > > Note that there is a significant CPU overhead on the client when using > buffered IO, simply due to CPU usage from copying the data between > userspace and the kernel. Having multiple cores on the client (one > per dd process) allows distributing this copy overhead between cores. > > You could also run "oprofile" to see if there is anything else of > interest that is consuming a lot of CPU. > > Cheers, Andreas > -- > Andreas Dilger > Sr. Staff Engineer, Lustre Group > Sun Microsystems of Canada, Inc. > >-------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090805/8e8612e2/attachment.html
Andreas Dilger
2009-Aug-06 00:25 UTC
[Lustre-discuss] Lustre v1.8.0.1 slower than expected large-file, sequential-buffered-file-read speed
On Aug 05, 2009 13:30 -0400, Rick Rothstein wrote:> My machines have dual quad 2.66ghz processors, > and gross CPU usage hovers around 50% > when I''m running 16 "dd" read jobs.Be cautious of nice round numbers for CPU usage. Sometimes this means that 1 CPU is 100% busy, and another is 0% busy. With 16 tasks on a 8-core system you are going to get some kind of CPU contention, but whether it is too much is hard to say without digging much more deeply into the code.> But a suspected client-side caching problem > crops up when I just run a simple "dd" read job twice. > > The first time I run the single "dd" read job > I get an expected throughput of 60-megabytes-per-second or so. > However, the second time I run the job, I get a throughput of > about 2-gigabytes-per-second, which is twice the top speed of > my 10gb NIC, and only possible, I think, if the entire file was > cached on the client when the first "dd" job was run.That is a feature. Lustre is cache coherent, so the fact that the whole file can be read from cache on the client with no network IO is totally safe. The fact that the second read is much faster does not, in itself, indicate any problem.> So, if I run 16 "dd" jobs, each trying to cache entire large files > on the client, that could explain the unexpected slow aggregate throughput > > A further indication that client-side caching is at the root of speed > slowdown, is that when I run my single "dd" job twice; but I drop > client-side cache after the first run, (via "/proc/sys/vm/drop_caches"), > I get an expected 60-megabytes-per-second or so throughput for both runs.Well, that isn''t surprising, but it doesn''t necessarily indicate why the reads are going _slower_ than without any cache. It is of course very possible that there is some kind of lock contention on the client, but I thought this had been fixed in the 1.8 release (bug 11817). Note that using O_DIRECT will of course bypass caching, which is still desirable if you know you are not re-using the data.> Until I learn how to overcome this slowdown problem, > I''ll see if if I can obtain my required > concurrent, multi large file read speed > by carefully striping the files over a few boxes.I would run tests with 1,2,4,6,8,12,16 processes, and see what the per-task performance is. Examining oprofile data per run will tell you what functions become more heavily used when there are more tasks involved. You might also consider to look at the client rpc_stats, or the (corresponding) server brw_stats to see if the read RPCs become badly formed with many threads. Also the client read_ahead_stats would help tell you if the readahead is going badly with many threads.> > Again, > thanks for your help, > and > I''ll appreciate any other suggestions you might have, > or > any ideas for other diagnostics we might run. > > Rick > > On 8/4/09, Andreas Dilger <adilger at sun.com> wrote: > > > > On Aug 04, 2009 10:30 -0400, Rick Rothstein wrote: > > > I''m new to Lustre (v1.8.0.1), and I''ve verified that > > > I can get about 1000-megabytes-per-second aggregate throughput > > > for large file sequential reads using direct-I/O. > > > (only limited by the speed of my 10gb NIC with TCP offload engine). > > > > > > the above direct-I/O "dd" tests achieve about a 1000-megabyte-per-second > > > aggregate throughput, but when I try the same tests with normal buffered > > > I/O, (by just running "dd" without "iflag=direct"), the runs > > > only get about a 550-megabyte-per-second aggregate throughput. > > > > > > I suspect that this slowdown may have something to do with > > > client-side-caching, but normal buffered reads have not speeded up, > > > even after I''ve tried such adjustments as: > > > > Note that there is a significant CPU overhead on the client when using > > buffered IO, simply due to CPU usage from copying the data between > > userspace and the kernel. Having multiple cores on the client (one > > per dd process) allows distributing this copy overhead between cores. > > > > You could also run "oprofile" to see if there is anything else of > > interest that is consuming a lot of CPU. > > > > Cheers, Andreas > > -- > > Andreas Dilger > > Sr. Staff Engineer, Lustre Group > > Sun Microsystems of Canada, Inc. > > > >Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.