Olivier Hargoaa
2010-May-20 14:27 UTC
[Lustre-discuss] Modifying Lustre network (good practices)
Dear All, We have a cluster with lustre critical data. On this cluster there are three networks on each Lustre server and client : one ethernet network for administration (eth0), and two other ethernet networks configured in bonding (bond0: eth1 & eth2). On Lustre we get poor read performances and good write performances so we decide to modify Lustre network in order to see if problems comes from network layer. Currently Lustre network is bond0. We want to set it as eth0, then eth1, then eth2 and finally back to bond0 in order to compare performances. Therefore, we''ll perform the following steps: we will umount the filesystem, reformat the mgs, change lnet options in modprobe file, start new mgs server, and finally modify our ost and mdt with tunefs.lustre with failover and mgs new nids using "--erase-params" and "--writeconf" options. We tested it successfully on a test filesystem but we read in the manual that this can be really dangerous. Do you agree with this procedure? Do you have some advice or practice on this kind of requests? What''s the danger? Regards. -- Olivier, Hargoaa Phone: + 33 4 76 29 76 25
Brian J. Murrell
2010-May-20 14:43 UTC
[Lustre-discuss] Modifying Lustre network (good practices)
On Thu, 2010-05-20 at 16:27 +0200, Olivier Hargoaa wrote:> > On Lustre we get poor read performances > and good write performances so we decide to modify Lustre network in > order to see if problems comes from network layer.Without having any other information other than your statement that "performance is good in one direction but not the other" I wonder why you consider the network as being the most likely candidate as a culprit for this problem. I haven''t come across very many networks that (weren''t designed to be and yet) are fast in one direction and slow in the other.> Therefore, we''ll perform the following steps: we will umount the > filesystem, reformat the mgs, change lnet options in modprobe file, > start new mgs server, and finally modify our ost and mdt with > tunefs.lustre with failover and mgs new nids using "--erase-params" and > "--writeconf" options.Sounds like a lot of rigmarole to test something that I would consider to be of low probability (given the brief amount of information you have provided). But even if I did suspect the network were slow in only one direction, before I started mucking with reconfiguring Lustre for different networks, I would do some basic network throughput testing to verify my hypothesis and adjust the probability of the network being the problem accordingly. Did you do any hardware profiling (i.e. using the lustre-iokit) before deploying Lustre on this hardware? We always recommend profiling the hardware for exactly this reason: explaining performance problems. Unfortunately, now that you have data on the hardware, it''s much more difficult to profile the hardware because to do it properly, you need to be able to write to the disks. b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 198 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100520/f7052833/attachment.bin
Andreas Dilger
2010-May-20 14:59 UTC
[Lustre-discuss] Modifying Lustre network (good practices)
You should really be using the LNET Self Test (LST) to do network testing. You can do this without changing the Lustre config at all. Cheers, Andreas On 2010-05-20, at 8:43, "Brian J. Murrell" <Brian.Murrell at oracle.com> wrote:> On Thu, 2010-05-20 at 16:27 +0200, Olivier Hargoaa wrote: >> >> On Lustre we get poor read performances >> and good write performances so we decide to modify Lustre network in >> order to see if problems comes from network layer. > > Without having any other information other than your statement that > "performance is good in one direction but not the other" I wonder why > you consider the network as being the most likely candidate as a > culprit > for this problem. I haven''t come across very many networks that > (weren''t designed to be and yet) are fast in one direction and slow in > the other. > >> Therefore, we''ll perform the following steps: we will umount the >> filesystem, reformat the mgs, change lnet options in modprobe file, >> start new mgs server, and finally modify our ost and mdt with >> tunefs.lustre with failover and mgs new nids using "--erase-params" >> and >> "--writeconf" options. > > Sounds like a lot of rigmarole to test something that I would consider > to be of low probability (given the brief amount of information you > have > provided). But even if I did suspect the network were slow in only > one > direction, before I started mucking with reconfiguring Lustre for > different networks, I would do some basic network throughput testing > to > verify my hypothesis and adjust the probability of the network being > the > problem accordingly. > > Did you do any hardware profiling (i.e. using the lustre-iokit) before > deploying Lustre on this hardware? We always recommend profiling the > hardware for exactly this reason: explaining performance problems. > > Unfortunately, now that you have data on the hardware, it''s much more > difficult to profile the hardware because to do it properly, you > need to > be able to write to the disks. > > b. > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss
Nate Pearlstein
2010-May-20 15:03 UTC
[Lustre-discuss] Modifying Lustre network (good practices)
Which bonding method are you using? Has the performance always been this way? Depending on which bonding type you are using and the network hardware involved you might see the behavior you are describing. On Thu, 2010-05-20 at 16:27 +0200, Olivier Hargoaa wrote:> Dear All, > > We have a cluster with lustre critical data. On this cluster there are > three networks on each Lustre server and client : one ethernet network > for administration (eth0), and two other ethernet networks configured in > bonding (bond0: eth1 & eth2). On Lustre we get poor read performances > and good write performances so we decide to modify Lustre network in > order to see if problems comes from network layer. > > Currently Lustre network is bond0. We want to set it as eth0, then eth1, > then eth2 and finally back to bond0 in order to compare performances. > > Therefore, we''ll perform the following steps: we will umount the > filesystem, reformat the mgs, change lnet options in modprobe file, > start new mgs server, and finally modify our ost and mdt with > tunefs.lustre with failover and mgs new nids using "--erase-params" and > "--writeconf" options. > > We tested it successfully on a test filesystem but we read in the manual > that this can be really dangerous. Do you agree with this procedure? Do > you have some advice or practice on this kind of requests? What''s the > danger? > > Regards. >-- Sent from my wired giant hulking workstation Nate Pearlstein - npearl at sgi.com - Product Support Engineer
Johann Lombardi
2010-May-20 15:10 UTC
[Lustre-discuss] Modifying Lustre network (good practices)
On Thu, May 20, 2010 at 10:43:58AM -0400, Brian J. Murrell wrote:> On Thu, 2010-05-20 at 16:27 +0200, Olivier Hargoaa wrote: > > > > On Lustre we get poor read performances > > and good write performances so we decide to modify Lustre network in > > order to see if problems comes from network layer. > > Without having any other information other than your statement that > "performance is good in one direction but not the other" I wonder why > you consider the network as being the most likely candidate as a culprit > for this problem.Maybe you should start with running lnet self test to compare read & write performance? http://wiki.lustre.org/manual/LustreManual18_HTML/LustreIOKit.html#50598014_pgfId-1290255 Johann
Olivier Hargoaa
2010-May-20 15:39 UTC
[Lustre-discuss] Modifying Lustre network (good practices)
Nate Pearlstein a ?crit :> Which bonding method are you using? Has the performance always been > this way? Depending on which bonding type you are using and the network > hardware involved you might see the behavior you are describing. >Hi, Here is our bonding configuration : On linux side : mode=4 - to use 802.3ad miimon=100 - to set the link check interval (ms) xmit_hash_policy=layer2+3 - to set XOR hashing method lacp_rate=fast - to set LCAPDU tx rate to request (slow=20s, fast=1s) Onethernet switch side, load balancing is configured as: # port-channel load-balance src-dst-mac thanks> > On Thu, 2010-05-20 at 16:27 +0200, Olivier Hargoaa wrote: >> Dear All, >> >> We have a cluster with lustre critical data. On this cluster there are >> three networks on each Lustre server and client : one ethernet network >> for administration (eth0), and two other ethernet networks configured in >> bonding (bond0: eth1 & eth2). On Lustre we get poor read performances >> and good write performances so we decide to modify Lustre network in >> order to see if problems comes from network layer. >> >> Currently Lustre network is bond0. We want to set it as eth0, then eth1, >> then eth2 and finally back to bond0 in order to compare performances. >> >> Therefore, we''ll perform the following steps: we will umount the >> filesystem, reformat the mgs, change lnet options in modprobe file, >> start new mgs server, and finally modify our ost and mdt with >> tunefs.lustre with failover and mgs new nids using "--erase-params" and >> "--writeconf" options. >> >> We tested it successfully on a test filesystem but we read in the manual >> that this can be really dangerous. Do you agree with this procedure? Do >> you have some advice or practice on this kind of requests? What''s the >> danger? >> >> Regards. >> >
James Robnett
2010-May-20 15:48 UTC
[Lustre-discuss] Modifying Lustre network (good practices)
Can''t really help with your larger question but I had a similar experience with network appropriate write rates and slower reads. You might check that you have enabled TCP selective acknowledgments, echo 1 > /proc/sys/net/ipv4/tcp_sack or net.ipv4.tcp_sack = 1 This can help in cases where your OSS''s have larger pipes than your clients and your files are striped across multiple OSS''s. When multiple OSS''s are transmitting to a single client they can over run the switch buffers and drop packets. This is particularly noticeable when doing IOzone type benchmarking from a single client with a wide lfs stripe setting. With selective ACKs enabled the client will request a more minimal set of packets be retransmitted ... or at least that''s what I finally deduced when I ran into it. James Robnett NRAO/NM
Balagopal Pillai
2010-May-20 16:12 UTC
[Lustre-discuss] Modifying Lustre network (good practices)
If you use etherchannel/isl on the switch side, wouldn''t it be better performing if the bonding is mode 0 on the server side rather than use mode 4 for lacp? As far as i know, mode 0 is the only mode that is capable of providing bandwidth more than a single port could provide, if there is switch support, like cisco etherchannel. I use mode 6 on several hpc clusters and the performance is not bad, as long as the compute nodes pick up different mac addresses of the available interfaces on the storage boxes. That is worth a try. It needs no switch support, but works only if all clients are on the same subnet. On the other hand, i have also seen some weird behavior with procurve switches before, when two switches have 4 interfaces on a trunk with lacp and then i put the storage server also on mode 4 and lacp connected to one of the switches. No matter what i tried, there was significant traffic on just one of the interfaces. So i went back to mode 6 on the server and lacp between switches. Please see the interface statistics of each bonded interface on the server and see if there is an imbalance, like one of the interfaces having bulk of the traffic and others get nothing. When looking at performance, there might also be the issue of iowaits and see if dirty pages are piling up and the server is hitting the dirty_ratio limit on the storage server. That could cause temporary lockups and performance issues under heavy load. More aggressive flushing and increasing the diry_ratio limit might help if that is the case. Hope this helps! On 10-05-20 12:39 PM, Olivier Hargoaa wrote:> Nate Pearlstein a ?crit : >> Which bonding method are you using? Has the performance always been >> this way? Depending on which bonding type you are using and the network >> hardware involved you might see the behavior you are describing. >> > > Hi, > > Here is our bonding configuration : > > On linux side : > > mode=4 - to use 802.3ad > miimon=100 - to set the link check interval (ms) > xmit_hash_policy=layer2+3 - to set XOR hashing method > lacp_rate=fast - to set LCAPDU tx rate to request (slow=20s, fast=1s) > > Onethernet switch side, load balancing is configured as: > # port-channel load-balance src-dst-mac > > thanks > >> >> On Thu, 2010-05-20 at 16:27 +0200, Olivier Hargoaa wrote: >>> Dear All, >>> >>> We have a cluster with lustre critical data. On this cluster there are >>> three networks on each Lustre server and client : one ethernet network >>> for administration (eth0), and two other ethernet networks configured in >>> bonding (bond0: eth1& eth2). On Lustre we get poor read performances >>> and good write performances so we decide to modify Lustre network in >>> order to see if problems comes from network layer. >>> >>> Currently Lustre network is bond0. We want to set it as eth0, then eth1, >>> then eth2 and finally back to bond0 in order to compare performances. >>> >>> Therefore, we''ll perform the following steps: we will umount the >>> filesystem, reformat the mgs, change lnet options in modprobe file, >>> start new mgs server, and finally modify our ost and mdt with >>> tunefs.lustre with failover and mgs new nids using "--erase-params" and >>> "--writeconf" options. >>> >>> We tested it successfully on a test filesystem but we read in the manual >>> that this can be really dangerous. Do you agree with this procedure? Do >>> you have some advice or practice on this kind of requests? What''s the >>> danger? >>> >>> Regards. >>> >> > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss
Olivier Hargoaa
2010-May-20 16:27 UTC
[Lustre-discuss] Modifying Lustre network (good practices)
Hi Brian and all others, I''m sorry for not giving you all details. Here I will send you all information I have. Regarding our configuration : Lustre IO nodes are linked with two 10GB bonded links. Compute nodes are linked with two 1GB bonded links. Raw performances on server are fine for both write and read for each ost. Firstly we ran iperf (severals times), and we obtained expected read and write rate. Results are symmetric (read and write) with any number of threads. Then we test with LNET self test : Here is our lst command for write test lst add_test --batch bulkr --from c --to s brw write check=simple size=1M and result are : [LNet Rates of c] [R] Avg: 110 RPC/s Min: 110 RPC/s Max: 110 RPC/s [W] Avg: 219 RPC/s Min: 219 RPC/s Max: 219 RPC/s [LNet Bandwidth of c] [R] Avg: 0.02 MB/s Min: 0.02 MB/s Max: 0.02 MB/s [W] Avg: 109.20 MB/s Min: 109.20 MB/s Max: 109.20 MB/s [LNet Rates of c] [R] Avg: 109 RPC/s Min: 109 RPC/s Max: 109 RPC/s [W] Avg: 217 RPC/s Min: 217 RPC/s Max: 217 RPC/s [LNet Bandwidth of c] [R] Avg: 0.02 MB/s Min: 0.02 MB/s Max: 0.02 MB/s [W] Avg: 108.40 MB/s Min: 108.40 MB/s Max: 108.40 MB/s [LNet Rates of c] [R] Avg: 109 RPC/s Min: 109 RPC/s Max: 109 RPC/s [W] Avg: 217 RPC/s Min: 217 RPC/s Max: 217 RPC/s [LNet Bandwidth of c] [R] Avg: 0.02 MB/s Min: 0.02 MB/s Max: 0.02 MB/s [W] Avg: 108.40 MB/s Min: 108.40 MB/s Max: 108.40 MB/s and now for read : [LNet Rates of c] [R] Avg: 10 RPC/s Min: 10 RPC/s Max: 10 RPC/s [W] Avg: 5 RPC/s Min: 5 RPC/s Max: 5 RPC/s [LNet Bandwidth of c] [R] Avg: 4.59 MB/s Min: 4.59 MB/s Max: 4.59 MB/s [W] Avg: 0.00 MB/s Min: 0.00 MB/s Max: 0.00 MB/s [LNet Rates of c] [R] Avg: 10 RPC/s Min: 10 RPC/s Max: 10 RPC/s [W] Avg: 5 RPC/s Min: 5 RPC/s Max: 5 RPC/s [LNet Bandwidth of c] [R] Avg: 4.79 MB/s Min: 4.79 MB/s Max: 4.79 MB/s [W] Avg: 0.00 MB/s Min: 0.00 MB/s Max: 0.00 MB/s [LNet Rates of c] [R] Avg: 10 RPC/s Min: 10 RPC/s Max: 10 RPC/s [W] Avg: 5 RPC/s Min: 5 RPC/s Max: 5 RPC/s [LNet Bandwidth of c] [R] Avg: 4.79 MB/s Min: 4.79 MB/s Max: 4.79 MB/s [W] Avg: 0.00 MB/s Min: 0.00 MB/s Max: 0.00 MB/s Iozone presents same asymmetric results as LNET. With just one ost : On WRITE sense, we get 233 MB/sec and taking into account maximum theorical is 250 MB/sec is a very good result: it works fine: On READ sense, the maximun we get is 149 MB/sec with three theats ( processes: -t 3 ). If we configure four theats ( -t 4 ) we get 50 MB/sec We also verified in brw_stats file that we use 1MB block size (both r and w) So we only have problems with iozone/lustre and lnet selftest. Thanks to all. Brian J. Murrell a ?crit :> On Thu, 2010-05-20 at 16:27 +0200, Olivier Hargoaa wrote: >> On Lustre we get poor read performances >> and good write performances so we decide to modify Lustre network in >> order to see if problems comes from network layer. > > Without having any other information other than your statement that > "performance is good in one direction but not the other" I wonder why > you consider the network as being the most likely candidate as a culprit > for this problem. I haven''t come across very many networks that > (weren''t designed to be and yet) are fast in one direction and slow in > the other. > >> Therefore, we''ll perform the following steps: we will umount the >> filesystem, reformat the mgs, change lnet options in modprobe file, >> start new mgs server, and finally modify our ost and mdt with >> tunefs.lustre with failover and mgs new nids using "--erase-params" and >> "--writeconf" options. > > Sounds like a lot of rigmarole to test something that I would consider > to be of low probability (given the brief amount of information you have > provided). But even if I did suspect the network were slow in only one > direction, before I started mucking with reconfiguring Lustre for > different networks, I would do some basic network throughput testing to > verify my hypothesis and adjust the probability of the network being the > problem accordingly. > > Did you do any hardware profiling (i.e. using the lustre-iokit) before > deploying Lustre on this hardware? We always recommend profiling the > hardware for exactly this reason: explaining performance problems. > > Unfortunately, now that you have data on the hardware, it''s much more > difficult to profile the hardware because to do it properly, you need to > be able to write to the disks. > > b. > > > > ------------------------------------------------------------------------ > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss
Olivier Hargoaa
2010-May-20 17:12 UTC
[Lustre-discuss] Modifying Lustre network (good practices)
Thanks Johann, But you couldn''t know but we already ran lnet self test unsuccessfully. I wrote results as answer to Brian. What I do not know is if lnet test was good or not with bonding deactivated. I will ask administrators to test it. Regards. Johann Lombardi a ?crit :> On Thu, May 20, 2010 at 10:43:58AM -0400, Brian J. Murrell wrote: >> On Thu, 2010-05-20 at 16:27 +0200, Olivier Hargoaa wrote: >>> On Lustre we get poor read performances >>> and good write performances so we decide to modify Lustre network in >>> order to see if problems comes from network layer. >> Without having any other information other than your statement that >> "performance is good in one direction but not the other" I wonder why >> you consider the network as being the most likely candidate as a culprit >> for this problem. > > Maybe you should start with running lnet self test to compare read & write > performance? > > http://wiki.lustre.org/manual/LustreManual18_HTML/LustreIOKit.html#50598014_pgfId-1290255 > > Johann > >-- Olivier, Hargoaa Phone: + 33 4 76 29 76 25
Johann Lombardi
2010-May-21 10:52 UTC
[Lustre-discuss] Modifying Lustre network (good practices)
Hi Olivier, On Thu, May 20, 2010 at 07:12:45PM +0200, Olivier Hargoaa wrote:> But you couldn''t know but we already ran lnet self test unsuccessfully. > I wrote results as answer to Brian.ok. To get back to your original question:> Currently Lustre network is bond0. We want to set it as eth0, then eth1, > then eth2 and finally back to bond0 in order to compare performances. > Therefore, we''ll perform the following steps: we will umount the > filesystem, reformat the mgs, change lnet options in modprobe file, > start new mgs server, and finally modify our ost and mdt with > tunefs.lustre with failover and mgs new nids using "--erase-params" and > "--writeconf" options.Since you can reproduce the perf drop with LST, you don''t need to change the filesystem configuration to test the different lnet configs. I would just umount the filesystem, reconfigure lnet by changing the modprobe file and run LST. Johann
Olivier Hargoaa
2010-May-21 13:25 UTC
[Lustre-discuss] Modifying Lustre network (good practices)
Johann Lombardi a ?crit :> Hi Olivier, > > On Thu, May 20, 2010 at 07:12:45PM +0200, Olivier Hargoaa wrote: >> But you couldn''t know but we already ran lnet self test unsuccessfully. >> I wrote results as answer to Brian. > > ok. To get back to your original question: > >> Currently Lustre network is bond0. We want to set it as eth0, then eth1, >> then eth2 and finally back to bond0 in order to compare performances. >> Therefore, we''ll perform the following steps: we will umount the >> filesystem, reformat the mgs, change lnet options in modprobe file, >> start new mgs server, and finally modify our ost and mdt with >> tunefs.lustre with failover and mgs new nids using "--erase-params" and >> "--writeconf" options. > > Since you can reproduce the perf drop with LST, you don''t need to change > the filesystem configuration to test the different lnet configs. I would > just umount the filesystem, reconfigure lnet by changing the modprobe file > and run LST.Hi Johann, Yes you are right. We plan to run more LNET tests on wednesday. First we will test bonding again with LNET (read and write), then deactivate bonding and test eth2, then eth3 and finally run lst with both eth2 and eth3 without bonding. Depending on the results we will decide to try a lustre level test (iozone) and therefore we will modify Lustre filesystem lnet parameters. Thanks everybody.> > Johann > >-- Olivier, Hargoaa Phone: + 33 4 76 29 76 25
Olivier Hargoaa
2010-Jul-09 07:50 UTC
[Lustre-discuss] Modifying Lustre network (good practices)
Thank you very much. Problem was solved by activating selective acknowledgments. James Robnett a ?crit :> Can''t really help with your larger question but I had a similar > experience with network appropriate write rates and slower reads. > > You might check that you have enabled TCP selective acknowledgments, > echo 1 > /proc/sys/net/ipv4/tcp_sack > or > net.ipv4.tcp_sack = 1 > > This can help in cases where your OSS''s have larger pipes than > your clients and your files are striped across multiple OSS''s. > When multiple OSS''s are transmitting to a single client they can > over run the switch buffers and drop packets. This is particularly > noticeable when doing IOzone type benchmarking from a single client > with a wide lfs stripe setting. > > With selective ACKs enabled the client will request a more minimal > set of packets be retransmitted ... or at least that''s what I finally > deduced when I ran into it. > > James Robnett > NRAO/NM > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss > >