I''m running an OST (not a client on an OST) with SW RAID 5. My interconnect is IB, and I''m using OFED 1.3.1. I added OPROFILE to my kernel, to see if I could find a bottleneck. The biggest CPU user, at 25%, was copy_user_generic_c. Grepping through the linux, ofed, and lustre code, I cannot find where this is being called. Can anyone suggest where this is being called, and why? -Roger Roger Spellman Staff Engineer Terascala, Inc. 508-588-1501 www.terascala.com <http://www.terascala.com/> -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080707/b0a01b87/attachment.html
On Jul 07, 2008 17:42 -0400, Roger Spellman wrote:> I''m running an OST (not a client on an OST) with SW RAID 5. My > interconnect is IB, and I''m using OFED 1.3.1. I added OPROFILE to my > kernel, to see if I could find a bottleneck. The biggest CPU user, at > 25%, was copy_user_generic_c. > > Grepping through the linux, ofed, and lustre code, I cannot find where > this is being called. Can anyone suggest where this is being called, > and why?This is a well-known problem - this kernel function is copying data from userspace to the kernel buffers on a write, and vice versa on a read. The way to avoid this is by using O_DIRECT, but as a result you will not get cached data on the client, and this means you will not able to do cached writes (i.e. write behind) and will wait for IO completion for each write (i.e. sync writes). If you are doing enough IO to hit a bottleneck with copy_{to,from}_user() then you can probably also be doing large enough IOs to make the sync IO performance hit of O_DIRECT negligible. We are looking at how to spread the load of copy_{to,from}_user() over more CPUs, but that is not likely to make it into a Lustre release for some time yet. Completely avoiding the copy while allowing a cache on the client would require major VFS/VM surgery (e.g. ensuring the buffers are aligned and marking the pages read-only, forcing the client to fault them if it changes the buffer again). Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
Andreas, Thanks for this information. But, I''m seeing this problem on an OST, not on a client. Why would an OST be doing copy_to/from_user()? On a write, the IB card should be directly placing the data. So, shouldn''t the data already be in kernel space? Thanks. -Roger>This is a well-known problem - this kernel function is copying data >from userspace to the kernel buffers on a write, and vice versa on >a read.-------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080707/82f6829e/attachment.html
On Jul 07, 2008 20:36 -0400, Roger Spellman wrote:> Thanks for this information. > But, I''m seeing this problem on an OST, not on a client. Why would > an OST be doing copy_to/from_user()? On a write, the IB card should > be directly placing the data. So, shouldn''t the data already be in > kernel space?Yes, by all means it shouldn''t need a copy on the OST - that is what RDMA is for. You definitely are not running Samba exports on the OST node? I can''t imagine what else would be doing this on an OST. Your oprofile output should be able to show the callchain for the busiest callpaths. Alternately, if this is active 25% of the time it may be enough to do "echo p > /proc/sysrq-trigger" 16 times and see what the resulting stacks are. In theory 4 of them should have copy_{to,from}_user() at the top of the stack.> >This is a well-known problem - this kernel function is copying data > >from userspace to the kernel buffers on a write, and vice versa on > >a read.Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
> Yes, by all means it shouldn''t need a copy on the OST - that is what > RDMA is for.Agreed!> You definitely are not running Samba exports on the OST > node?Certainly not. I can''t imagine what else would be doing this on an OST.> Your oprofile output should be able to show the callchain for the > busiest callpaths. Alternately, if this is active 25% of the time it > may be enough to do "echo p > /proc/sysrq-trigger" 16 times and see > what the resulting stacks are. In theory 4 of them should have > copy_{to,from}_user() at the top of the stack.Andreas, why would 4 threads have copy_(to,from)_user at the top of the stack? Are certain threads supposed to be doing that on an OST? Thanks, Roger
On Tue, 2008-07-08 at 09:46 -0400, Roger Spellman wrote:> > Yes, by all means it shouldn''t need a copy on the OST - that is what > > RDMA is for. > > Agreed! > > > You definitely are not running Samba exports on the OST > > node? > > Certainly not. > > I can''t imagine what else would be doing this on an OST. > > > Your oprofile output should be able to show the callchain for the > > busiest callpaths. Alternately, if this is active 25% of the time it > > may be enough to do "echo p > /proc/sysrq-trigger" 16 times and see > > what the resulting stacks are. In theory 4 of them should have > > copy_{to,from}_user() at the top of the stack. > > Andreas, why would 4 threads have copy_(to,from)_user at the top of the > stack? Are certain threads supposed to be doing that on an OST?Since the CPU on the OST is active 25% of the time, triggering a stack trace 16 times, should give us the stack trace for the copy_{to,from}_user() functions around 4 times. I don''t think copy_{to,from}_user() is expected to be called on the OST with that frequency(if any) so having the stack trace will help us determine from where it is being called. Thanks, Kalpak> > Thanks, > > Roger > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss
Kalpak, Thank you for the clarification. -Roger> Since the CPU on the OST is active 25% of the time, triggering a stack > trace 16 times, should give us the stack trace for the > copy_{to,from}_user() functions around 4 times. > > I don''t think copy_{to,from}_user() is expected to be called on theOST> with that frequency(if any) so having the stack trace will help us > determine from where it is being called. > > Thanks, > Kalpak
It turns out that there was a problem in how I was using oprofile. I was doing opcontrol --start and opcontrol --stop. But, I forgot to do an opcontrol --reset in between. So, in addition to recording my OST results, oprofile was picking up some old data, which is probably where the copy_(to,from)_user() came from. Lesson learned: Always do opcontrol --reset before opcontrol --start. Thanks to everyone who helped me out. Roger Spellman Staff Engineer Terascala, Inc. 508-588-1501 www.terascala.com> I don''t think copy_{to,from}_user() is expected to be called on theOST> with that frequency(if any) so having the stack trace will help us > determine from where it is being called. > > Thanks, > Kalpak