zhengfeng
2012-Mar-15 04:10 UTC
[Lustre-discuss] Lustre read performance decay when OSSes are assigned in two different subnet
Dear all, We met one problem about Lustre read performance decay when OSSes are assigned in two different subnet. Describing that in the following diagram: diagram 1, OSS in different subnets: Client (subnet 10.0.1.2) | | | Switch | | | | | | OSS1 OSS2 (10.0.2.2) (10.0.3.2) For diagram 1, we made the CLient OSS1 and OSS2 in 3 different subnets. the switch used is able forward all packages. Use dd cmd to test r/w performance? write/rad data to/from to OSS1 and OSS2 at the same time: test result: [root at client client]# time dd if=test2 of=/dev/null bs=1M count=2000 2000+0 records in 2000+0 records out 2097152000 bytes (2.1 GB) copied, 53.5922 seconds, 39.1 MB/s real 0m53.796s user 0m0.005s sys 0m2.914s diagram 2, OSS in same subnet: Client (subnet 10.0.1.2) | | | Switch | | | | | | OSS1 OSS2 (10.0.2.2, 10.0.2.3, at same subnet) for diagram 2, we assigned OSS1 and OSS2 at the same subnet, then test: test result: [root at client219 client]# time dd of=/dev/null if=test1 bs=1M 10000+0 records in 10000+0 records out 10485760000 bytes (10 GB) copied, 193.07 seconds, 54.3 MB/s conclusion: In different subnets, the OSS read performance is 39.1 MB/s, while OSS in same subnet, the read performance is 54.3 MB/s. the performance decays so much. Question: Why using different subnets in lustre, the performance decayed? Anyone had met such problems? Many thanks for your answers and advice. B.R. Feng -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20120315/488d8e78/attachment.html
Hammitt, Charles Allen
2012-Mar-15 12:50 UTC
[Lustre-discuss] Lustre read performance decay when OSSes are assigned in two different subnet
Networking overhead? vlan routing perhaps; 1) with either adding an extra network device hop and latency from a network device/router or 2) overburdened switch handling the routing itself still introducing network latency. Latency is the storage and network i/o bandwidth killer. I?m willing to bet two things: 1) changing your stripe size from 2 to 1 will make similar bandwidth results to the diagram 2 [54.3MB/s], even if the layout is as diagram 1 [separate nets]. 2) If all your OSS/MDS and Clients nodes were in the same single vlan network?you?d see better performance than diagram?s 2 54.3MB/sec bandwidth throughput. So, drop classful subnets?go with cidr / supernetting networks to get the ip spaces you need and drop the extra routing latency. Regards, Charles -- ==========================================Charles Hammitt Storage Systems Specialist ITS Research Computing @ The University of North Carolina-CH ========================================== From: lustre-discuss-bounces at lists.lustre.org [mailto:lustre-discuss-bounces at lists.lustre.org] On Behalf Of zhengfeng Sent: Thursday, March 15, 2012 12:11 AM To: lustre-discuss at lists.lustre.org Subject: [Lustre-discuss] Lustre read performance decay when OSSes are assigned in two different subnet Dear all, We met one problem about Lustre read performance decay when OSSes are assigned in two different subnet. Describing that in the following diagram: diagram 1, OSS in different subnets: Client (subnet 10.0.1.2) | | | Switch | | | | | | OSS1 OSS2 (10.0.2.2) (10.0.3.2) For diagram 1, we made the CLient OSS1 and OSS2 in 3 different subnets. the switch used is able forward all packages. Use dd cmd to test r/w performance? write/rad data to/from to OSS1 and OSS2 at the same time: test result: [root at client client]# time dd if=test2 of=/dev/null bs=1M count=2000 2000+0 records in 2000+0 records out 2097152000 bytes (2.1 GB) copied, 53.5922 seconds, 39.1 MB/s real 0m53.796s user 0m0.005s sys 0m2.914s diagram 2, OSS in same subnet: Client (subnet 10.0.1.2) | | | Switch | | | | | | OSS1 OSS2 (10.0.2.2, 10.0.2.3, at same subnet) for diagram 2, we assigned OSS1 and OSS2 at the same subnet, then test: test result: [root at client219 client]# time dd of=/dev/null if=test1 bs=1M 10000+0 records in 10000+0 records out 10485760000 bytes (10 GB) copied, 193.07 seconds, 54.3 MB/s conclusion: In different subnets, the OSS read performance is 39.1 MB/s, while OSS in same subnet, the read performance is 54.3 MB/s. the performance decays so much. Question: Why using different subnets in lustre, the performance decayed? Anyone had met such problems? Many thanks for your answers and advice. ________________________________ B.R. Feng -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20120315/c3b752fa/attachment-0001.html
zhengfeng
2012-Mar-15 13:38 UTC
[Lustre-discuss] Lustre read performance decay when OSSes are assigned in two different subnet
Thanks a lot , Charles. I agree with you about this problem. And I did more tests with the following steps: 0) Use 3 subnets to assign the 3 nodes. 1?Run "netperf" in the two OSS separately, run "netserver" in "client";this step could simulate the networking scenario: "client" reads data from two OSS, but here is no disk i/o or other r/w; 2) two OSS netperf''s results are about 200 M/s, totally are 400M/s. so low - -! 3) run only netperf at one OSS, the test result is 950M/s.. this res is ok. 4) All the upper steps prove that, the networking is the bottleneck of the read performance. When 2 NODEs send TCP stream at the same time, and only 1 NODE recv TCP stream. The total throughput is half of normal value. so oddball.. What induced that? Thanks a lot Best regards feng From: Hammitt, Charles Allen Date: 2012-03-15 20:50 To: zf5984599; lustre-discuss at lists.lustre.org Subject: RE: [Lustre-discuss] Lustre read performance decay when OSSes are assigned in two different subnet Networking overhead? vlan routing perhaps; 1) with either adding an extra network device hop and latency from a network device/router or 2) overburdened switch handling the routing itself still introducing network latency. Latency is the storage and network i/o bandwidth killer. I?m willing to bet two things: 1) changing your stripe size from 2 to 1 will make similar bandwidth results to the diagram 2 [54.3MB/s], even if the layout is as diagram 1 [separate nets]. 2) If all your OSS/MDS and Clients nodes were in the same single vlan network?you?d see better performance than diagram?s 2 54.3MB/sec bandwidth throughput. So, drop classful subnets?go with cidr / supernetting networks to get the ip spaces you need and drop the extra routing latency. Regards, Charles -- ==========================================Charles Hammitt Storage Systems Specialist ITS Research Computing @ The University of North Carolina-CH ========================================== From: lustre-discuss-bounces at lists.lustre.org [mailto:lustre-discuss-bounces at lists.lustre.org] On Behalf Of zhengfeng Sent: Thursday, March 15, 2012 12:11 AM To: lustre-discuss at lists.lustre.org Subject: [Lustre-discuss] Lustre read performance decay when OSSes are assigned in two different subnet Dear all, We met one problem about Lustre read performance decay when OSSes are assigned in two different subnet. Describing that in the following diagram: diagram 1, OSS in different subnets: Client (subnet 10.0.1.2) | | | Switch | | | | | | OSS1 OSS2 (10.0.2.2) (10.0.3.2) For diagram 1, we made the CLient OSS1 and OSS2 in 3 different subnets. the switch used is able forward all packages. Use dd cmd to test r/w performance? write/rad data to/from to OSS1 and OSS2 at the same time: test result: [root at client client]# time dd if=test2 of=/dev/null bs=1M count=2000 2000+0 records in 2000+0 records out 2097152000 bytes (2.1 GB) copied, 53.5922 seconds, 39.1 MB/s real 0m53.796s user 0m0.005s sys 0m2.914s diagram 2, OSS in same subnet: Client (subnet 10.0.1.2) | | | Switch | | | | | | OSS1 OSS2 (10.0.2.2, 10.0.2.3, at same subnet) for diagram 2, we assigned OSS1 and OSS2 at the same subnet, then test: test result: [root at client219 client]# time dd of=/dev/null if=test1 bs=1M 10000+0 records in 10000+0 records out 10485760000 bytes (10 GB) copied, 193.07 seconds, 54.3 MB/s conclusion: In different subnets, the OSS read performance is 39.1 MB/s, while OSS in same subnet, the read performance is 54.3 MB/s. the performance decays so much. Question: Why using different subnets in lustre, the performance decayed? Anyone had met such problems? Many thanks for your answers and advice. B.R. Feng -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20120315/f1a3a783/attachment-0001.html
Hammitt, Charles Allen
2012-Mar-15 13:50 UTC
[Lustre-discuss] Lustre read performance decay when OSSes are assigned in two different subnet
I?d take a look at how the switch ASICs line up with the connections in the system to see if one set is more burdened than another perhaps? And; If routing is not handled by the switch, I?d see if there are issues with the way routing is working for the different networks, perhaps there is a different path or similar ASIC or other network performance / congestion problem. Simple network trace route might help flush out some questions; and a conversation with your networking team tracing cables and looking at interface stats. Regards, Charles From: zhengfeng [mailto:zf5984599 at gmail.com] Sent: Thursday, March 15, 2012 9:38 AM To: Hammitt, Charles Allen; lustre-discuss at lists.lustre.org Subject: Re: RE: [Lustre-discuss] Lustre read performance decay when OSSes are assigned in two different subnet Thanks a lot , Charles. I agree with you about this problem. And I did more tests with the following steps: 0) Use 3 subnets to assign the 3 nodes. 1?Run "netperf" in the two OSS separately, run "netserver" in "client";this step could simulate the networking scenario: "client" reads data from two OSS, but here is no disk i/o or other r/w; 2) two OSS netperf''s results are about 200 M/s, totally are 400M/s. so low - -! 3) run only netperf at one OSS, the test result is 950M/s.. this res is ok. 4) All the upper steps prove that, the networking is the bottleneck of the read performance. When 2 NODEs send TCP stream at the same time, and only 1 NODE recv TCP stream. The total throughput is half of normal value. so oddball.. What induced that? Thanks a lot ________________________________ Best regards feng From: Hammitt, Charles Allen<mailto:chammitt at email.unc.edu> Date: 2012-03-15 20:50 To: zf5984599<mailto:zf5984599 at gmail.com>; lustre-discuss at lists.lustre.org<mailto:lustre-discuss at lists.lustre.org> Subject: RE: [Lustre-discuss] Lustre read performance decay when OSSes are assigned in two different subnet Networking overhead? vlan routing perhaps; 1) with either adding an extra network device hop and latency from a network device/router or 2) overburdened switch handling the routing itself still introducing network latency. Latency is the storage and network i/o bandwidth killer. I?m willing to bet two things: 1) changing your stripe size from 2 to 1 will make similar bandwidth results to the diagram 2 [54.3MB/s], even if the layout is as diagram 1 [separate nets]. 2) If all your OSS/MDS and Clients nodes were in the same single vlan network?you?d see better performance than diagram?s 2 54.3MB/sec bandwidth throughput. So, drop classful subnets?go with cidr / supernetting networks to get the ip spaces you need and drop the extra routing latency. Regards, Charles -- ==========================================Charles Hammitt Storage Systems Specialist ITS Research Computing @ The University of North Carolina-CH ========================================== From: lustre-discuss-bounces at lists.lustre.org<mailto:lustre-discuss-bounces at lists.lustre.org> [mailto:lustre-discuss-bounces at lists.lustre.org] On Behalf Of zhengfeng Sent: Thursday, March 15, 2012 12:11 AM To: lustre-discuss at lists.lustre.org<mailto:lustre-discuss at lists.lustre.org> Subject: [Lustre-discuss] Lustre read performance decay when OSSes are assigned in two different subnet Dear all, We met one problem about Lustre read performance decay when OSSes are assigned in two different subnet. Describing that in the following diagram: diagram 1, OSS in different subnets: Client (subnet 10.0.1.2) | | | Switch | | | | | | OSS1 OSS2 (10.0.2.2) (10.0.3.2) For diagram 1, we made the CLient OSS1 and OSS2 in 3 different subnets. the switch used is able forward all packages. Use dd cmd to test r/w performance? write/rad data to/from to OSS1 and OSS2 at the same time: test result: [root at client client]# time dd if=test2 of=/dev/null bs=1M count=2000 2000+0 records in 2000+0 records out 2097152000 bytes (2.1 GB) copied, 53.5922 seconds, 39.1 MB/s real 0m53.796s user 0m0.005s sys 0m2.914s diagram 2, OSS in same subnet: Client (subnet 10.0.1.2) | | | Switch | | | | | | OSS1 OSS2 (10.0.2.2, 10.0.2.3, at same subnet) for diagram 2, we assigned OSS1 and OSS2 at the same subnet, then test: test result: [root at client219 client]# time dd of=/dev/null if=test1 bs=1M 10000+0 records in 10000+0 records out 10485760000 bytes (10 GB) copied, 193.07 seconds, 54.3 MB/s conclusion: In different subnets, the OSS read performance is 39.1 MB/s, while OSS in same subnet, the read performance is 54.3 MB/s. the performance decays so much. Question: Why using different subnets in lustre, the performance decayed? Anyone had met such problems? Many thanks for your answers and advice. ________________________________ B.R. Feng -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20120315/936e61f3/attachment-0001.html
Peter Grandi
2012-Mar-15 23:14 UTC
[Lustre-discuss] Lustre read performance decay when OSSes are assigned in two different subnet
[ ... ]> 0) Use 3 subnets to assign the 3 nodes. > 1) Run "netperf" in the two OSS separately, run "netserver" in > "client";this step could simulate the networking scenario: > "client" reads data from two OSS, but here is no disk i/o or > other r/w;It might be useful for you to see what is the OSS-OSS transfer rate too, and the transfer rate in the client->OSS direction is too, but since this is purely a networking issue it is a bit offtopic here.> 2) two OSS netperf''s results are about 200 M/s, totally are > 400M/s. so low - -!I usually prefer to use ''nuttcp'' for this, it is far more convenient to run.> 3) run only netperf at one OSS, the test result is 950M/s.. > this res is ok.That is unusually good.> 4) All the upper steps prove that, the networking is the > bottleneck of the read performance.> When 2 NODEs send TCP stream at the same time, and only 1 NODE > recv TCP stream. The total throughput is half of normal value.Well, if you introduce routing, and your router is not perhaps fully non-blocking 10Gb/s, or it has insufficient or excessive buffering, or it subtly changes latency patterns, that''s pretty unexeceptional. A lot of studying ''wireshark'' traces will show which particular limitation applies. For example enabling TSO can give very bad results with some NICs. But when setting up Lustre usually one tries to engineer the simplest/best networking case, not a more complicated one.> so oddball.. What induced that? Thanks a lotQ: "Doctor, if I stab my hand with a fork it really hurts". A: "Don''t do it". :-)