Hello, I am new to Lustre and wanted to run a small simple small copy test between 2 virtual machines from MDT/OST server to client''s local disk. I realize small file performance is never fast, but this seems particularly slow considering the data is all buffered in memory with little to no disk activity. Setup Info Version is 2.4.50 Average file size is small. < 10KB The amount of data being copied is about 250MB. The VMs are on separate hosts. Performance 7 minutes over a gigabit network. NFS takes only 3 minutes. Observations iostat on the OST/MDT is usually 0% during the copy. Assuming all buffered. Additional network traffic is minimal. CPU load on the VMs is 15-20% during copy. RPC stats on the client shows only 1 RPC in flight at a time. max inflight is set to 64. Is that expected behavior for a copy? Here is a snapshot of rpc_stats early during the copy: read write pages per rpc rpcs % cum % | rpcs % cum % 1: 1653 90 90 | 0 0 0 2: 164 8 98 | 0 0 0 4: 7 0 99 | 0 0 0 8: 3 0 99 | 0 0 0 16: 3 0 99 | 0 0 0 32: 5 0 99 | 0 0 0 64: 0 0 99 | 0 0 0 128: 1 0 100 | 0 0 0 read write rpcs in flight rpcs % cum % | rpcs % cum % 0: 0 0 0 | 0 0 0 1: 1836 100 100 | 0 0 0 read write offset rpcs % cum % | rpcs % cum % 0: 1836 100 100 | 0 0 0 As I am new, any suggestions for what to look for or improve would be greatly appreciated. _______________________________________________ Lustre-discuss mailing list Lustre-discuss-aLEFhgZF4x6X6Mz3xDxJMA@public.gmane.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
"test between 2 virtual machines from MDT/OST server to client''s local disk." Andrew, I''m confused by the description of your test. Can you clarify? -- Brett Lee Sr. Systems Engineer Intel High Performance Data Division From: lustre-discuss-bounces-aLEFhgZF4x6X6Mz3xDxJMA@public.gmane.org [mailto:lustre-discuss-bounces@lists.lustre.org] On Behalf Of Andrew Mast Sent: Friday, June 21, 2013 3:42 PM To: lustre-discuss-aLEFhgZF4x6X6Mz3xDxJMA@public.gmane.org Subject: [Lustre-discuss] Slow Copy (Small Files) 1 RPC In Flight? Hello, I am new to Lustre and wanted to run a small simple small copy test between 2 virtual machines from MDT/OST server to client''s local disk. I realize small file performance is never fast, but this seems particularly slow considering the data is all buffered in memory with little to no disk activity. Setup Info Version is 2.4.50 Average file size is small. < 10KB The amount of data being copied is about 250MB. The VMs are on separate hosts. Performance 7 minutes over a gigabit network. NFS takes only 3 minutes. Observations iostat on the OST/MDT is usually 0% during the copy. Assuming all buffered. Additional network traffic is minimal. CPU load on the VMs is 15-20% during copy. RPC stats on the client shows only 1 RPC in flight at a time. max inflight is set to 64. Is that expected behavior for a copy? Here is a snapshot of rpc_stats early during the copy: read write pages per rpc rpcs % cum % | rpcs % cum % 1: 1653 90 90 | 0 0 0 2: 164 8 98 | 0 0 0 4: 7 0 99 | 0 0 0 8: 3 0 99 | 0 0 0 16: 3 0 99 | 0 0 0 32: 5 0 99 | 0 0 0 64: 0 0 99 | 0 0 0 128: 1 0 100 | 0 0 0 read write rpcs in flight rpcs % cum % | rpcs % cum % 0: 0 0 0 | 0 0 0 1: 1836 100 100 | 0 0 0 read write offset rpcs % cum % | rpcs % cum % 0: 1836 100 100 | 0 0 0 As I am new, any suggestions for what to look for or improve would be greatly appreciated. _______________________________________________ Lustre-discuss mailing list Lustre-discuss-aLEFhgZF4x6X6Mz3xDxJMA@public.gmane.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Hello! On Jun 21, 2013, at 5:42 PM, Andrew Mast wrote:> Hello, I am new to Lustre and wanted to run a small simple small copy test between 2 virtual machines from MDT/OST server to client''s local disk.> I realize small file performance is never fast, but this seems particularly slow considering the data is all buffered in memory with little to no disk activity. > > RPC stats on the client shows only 1 RPC in flight at a time. max inflight is set to 64. Is that expected behavior for a copy?Well, it seems you are reading from Lustre. Small files too. So Lustre reads a single file at a time (I assume you copy with somehing like cp - single threadedly), readahead does not come into play because file size is smaller than 1 RPC. So before we are done with a single file, we cannot guess there''d be another request to the next file. That''s why you have only one RPC in flight. Also Lustre metadata protocol is somewhat more heavy than NFS, which would explain why it''s slower than NFS. Situation should improve once you start trying bigger files. Bye, Oleg
Hi Brett, Sorry, I think my choice in wording is not correct. One VM is holding the metadata and the objects. I guess that would mean ti is the OSS and MDS? Another VM is the client.It has mounted the lusture filesystem and also has some local disks. The test is to just to use cp to read data to local disk. Thanks, Andy On Fri, Jun 21, 2013 at 3:22 PM, Lee, Brett <brett.lee-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> wrote:> “test between 2 virtual machines from MDT/OST server to client''s local > disk.”**** > > ** ** > > Andrew,**** > > ** ** > > I’m confused by the description of your test. Can you clarify?**** > > ** ** > > --**** > > Brett Lee**** > > Sr. Systems Engineer**** > > Intel High Performance Data Division**** > > ** ** > > *From:* lustre-discuss-bounces-aLEFhgZF4x6X6Mz3xDxJMA@public.gmane.org [mailto: > lustre-discuss-bounces-aLEFhgZF4x6X6Mz3xDxJMA@public.gmane.org] *On Behalf Of *Andrew Mast > *Sent:* Friday, June 21, 2013 3:42 PM > *To:* lustre-discuss-aLEFhgZF4x6X6Mz3xDxJMA@public.gmane.org > *Subject:* [Lustre-discuss] Slow Copy (Small Files) 1 RPC In Flight?**** > > ** ** > > Hello, I am new to Lustre and wanted to run a small simple small copy test > between 2 virtual machines from MDT/OST server to client''s local disk.**** > > ** ** > > I realize small file performance is never fast, but this seems > particularly slow considering the data is all buffered in memory with > little to no disk activity.**** > > ** ** > > Setup Info**** > > Version is 2.4.50**** > > Average file size is small. < 10KB**** > > The amount of data being copied is about 250MB.**** > > The VMs are on separate hosts.**** > > ** ** > > Performance**** > > 7 minutes over a gigabit network. **** > > NFS takes only 3 minutes.**** > > ** ** > > Observations**** > > iostat on the OST/MDT is usually 0% during the copy. Assuming all buffered. > **** > > Additional network traffic is minimal. **** > > CPU load on the VMs is 15-20% during copy.**** > > ** ** > > RPC stats on the client shows only 1 RPC in flight at a time. max inflight > is set to 64. Is that expected behavior for a copy?**** > > ** ** > > Here is a snapshot of rpc_stats early during the copy:**** > > ** ** > > read write**** > > pages per rpc rpcs % cum % | rpcs % cum %**** > > 1: 1653 90 90 | 0 0 0**** > > 2: 164 8 98 | 0 0 0**** > > 4: 7 0 99 | 0 0 0**** > > 8: 3 0 99 | 0 0 0**** > > 16: 3 0 99 | 0 0 0**** > > 32: 5 0 99 | 0 0 0**** > > 64: 0 0 99 | 0 0 0**** > > 128: 1 0 100 | 0 0 0**** > > ** ** > > read write** > ** > > rpcs in flight rpcs % cum % | rpcs % cum %**** > > 0: 0 0 0 | 0 0 0**** > > 1: 1836 100 100 | 0 0 0**** > > ** ** > > read write** > ** > > offset rpcs % cum % | rpcs % cum %**** > > 0: 1836 100 100 | 0 0 0**** > > ** ** > > As I am new, any suggestions for what to look for or improve would be > greatly appreciated.**** > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss-aLEFhgZF4x6X6Mz3xDxJMA@public.gmane.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss > >_______________________________________________ Lustre-discuss mailing list Lustre-discuss-aLEFhgZF4x6X6Mz3xDxJMA@public.gmane.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Oleg, Very clear, thank you for the explanation, I misunderstood readahead. Yes the 1gb and 10gb file transfer tests was on par with NFS. Our use case is typically compiling and find/grep through (30gb) amounts of source code so it seems we are stuck with small files. Andy On Fri, Jun 21, 2013 at 3:42 PM, Drokin, Oleg <oleg.drokin-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> wrote:> Hello! > > On Jun 21, 2013, at 5:42 PM, Andrew Mast wrote: > > > Hello, I am new to Lustre and wanted to run a small simple small copy > test between 2 virtual machines from MDT/OST server to client''s local disk. > > > I realize small file performance is never fast, but this seems > particularly slow considering the data is all buffered in memory with > little to no disk activity. > > > > RPC stats on the client shows only 1 RPC in flight at a time. max > inflight is set to 64. Is that expected behavior for a copy? > > Well, it seems you are reading from Lustre. Small files too. > So Lustre reads a single file at a time (I assume you copy with somehing > like cp - single threadedly), readahead does not come into play because > file size > is smaller than 1 RPC. > So before we are done with a single file, we cannot guess there''d be > another request to the next file. That''s why you have only one RPC in > flight. > > Also Lustre metadata protocol is somewhat more heavy than NFS, which would > explain why it''s slower than NFS. > Situation should improve once you start trying bigger files. > > Bye, > Oleg_______________________________________________ Lustre-discuss mailing list Lustre-discuss-aLEFhgZF4x6X6Mz3xDxJMA@public.gmane.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Hello! On Jun 21, 2013, at 9:07 PM, Andrew Mast wrote:> Very clear, thank you for the explanation, I misunderstood readahead. Yes the 1gb and 10gb file transfer tests was on par with NFS. > > Our use case is typically compiling and find/grep through (30gb) amounts of source code so it seems we are stuck with small files.Generally this sort of workload is pretty bad for network filesystems due to large amounts of synchronous RPC traffic that you cannot easily predict. You can get certain speedup by doing several copies in parallel (e.g. one copy per top level subtree or whatever) as then you''ll at least get concurrent RPCs. I know some people try to combat this by running a block device on top of network filesystem and then running some sort of a local fs (say, ext4) on top of that block device (loopback based). That allows readahead to work, caching to work much better and so on. But this is not without limitations too, only single node could have this filesystem-file mounted at any single time. IF you do not have any significant writes to this fileset (if any at all) but a lot of consecutive reads/greps…, you might want just store entire workset as a tar file, that you will read and unpack locally on a client (should be pretty fast) to say a ramfs (need tons of RAM of course) and then do the searches. Also not ideal, but at least network filesystem would then be doing what it''s best suited for - large transfers. If you can come up with some other way of storing large number of smaller files in a single large combined file that you will then access with special tools (like, I dunno, fuse-tarfs or whatever - assuming those don''t read unneeded data, but just skip over it, or something more specific to your case) - this might be a winner too. Bye, Oleg