Hi! Does anyone know how to use QoS with Lustre''s o2ib LND? The Voltaire IB LND allowed to #define a service level, but I couldn''t find a similar facility in o2ib. Is there a different way to apply QoS rules? Thanks, Daniel.
On Mon, May 18, 2009 at 12:04:37PM +0200, Daniel Kobras wrote:> Hi! > > Does anyone know how to use QoS with Lustre''s o2ib LND? The Voltaire IB > LND allowed to #define a service level, but I couldn''t find a similar > facility in o2ib. Is there a different way to apply QoS rules? > > Thanks, > > Daniel.Hi, I don''t know much about this stuff, but our IB guys did use QoS to help us when we found LNET was falling apart when we brought up our first 1K node cluster based on quad socket, quad core opterons, and ran MPI collective stress tests on all cores. Here are some notes they put together - see the "QoS Policy file" section. Jim ____________________________________ QoS configuration on Infiniband May 18, 2009 Albert Chu chu11 at llnl.gov Overview -------- Quality of Service (QoS) is offered in Infiniband as a means to offer some guarantees/minimum requirements for certain applications on the fabric. Definitions ----------- Virtual Lanes (VLs): Infiniband supports up to 15 (numbered 0-14) Virtual Lanes (VLs) for traffic. The virtual lanes support independent virtual transmit/receive buffers for each port on the fabric. Service Level (SL): A number (0-15) that can be assigned to any Infiniband packet. The definition/purpose of a SL is not defined. It''s up to the user to determine. Basic QoS Implementation in Infiniband -------------------------------------- There are three basic parts to QoS in Infiniband. 1) Assign/configure protocols/tool/applications to use appropriate SLs. Normally, you assign different SLs to different protocols, applications, etc. (i.e. MPI, Lustre). This allows each protocol/application to be given unique QoS requirements. 2) Configure SL2VL mapping Map SLs to VLs. For example, SL0->VL0, SL1->VL1, etc. 3) Configure VL Arbitration Determines VL transmission rules based on a set of prioritization rules. It is the responsibility of administrators/users to use and configure the SLs/VLs properly. VLs and SLs do nothing/mean nothing in the Infiniband card. SL2VL Mapping Configuration --------------------------- This is pretty basic. You assign a SL to a VL. It''s a direct one to one mapping. i.e. SL1->VL1, SL2->VL2 Normally, you map SLX -> VLX. If you do otherwise, you''re starting to do something pretty crazy. VL Arbitration Configuration ---------------------------- This is not so basic. There are three components to VL Arbitration configuration, the High-Priority Table, the Low-Priority Table, and the Limit of High Priority. High/Low VL Arbitration Tables ------------------------------ High & Low Priority VL Arbitration Tables are a list of VL numbers (0-14) and a weighting value (0-255) pairs. The weighting value indicates the number of 64 byte units that can be transmitted from that VL when it is that VL''s turn to transmit. A weight of 0 means no data can be transferred. Counters are rounded up as needed for packets (i.e. a weight of 1 means a packet > 64 bytes can still be sent). The High Priority VL Arbitration Table is weights for "high priority" data while the Low Priority VL Arbitration Table is weights for "low priority" data (the usefulness will make more sense after you read "Limit of High Priority" below). Note that 64*255 =~ 16K, which is small number for many institutions. I think it is easiest to think of the weights as ratios for percentage bandwidth if the network is completely flooded with data from all protocols/applications. For example: A) VL0 Weight = 255, VL1 Weight = 255 50% bandwidth for VL0 and VL1 each. B) VL0 Weight = 255, VL1 Weight = 255, VL2 Weight = 255 33% bandwidth for VL0, VL1, and VL2 each. C) VL0 Weight = 200, VL1 Weight = 100 66% bandwidth for VL0, 33% bandwidth for VL1. D) VL0 Weight = 200, VL1 Weight = 100, VL2 Weight = 100 50% bandwidth for VL0, 25% bandwidth for VL1 and VL2 each. Limit of High Priority ---------------------- Indicates the number of high-priority packets (from the High VL Arbitration Table) that can be sent without an opportunity to send a low priority packet (from the Low VL Arbitration Table). Increments are in 4K bytes (special numbers, 0 = one packet. 255 = unlimited data). 4K*254 =~ 1M, which again is small number for many institutions. The most likely numbers to consider using are: 0 - one packet 254 - max high limit data w/o being unlimited 255 - unlimited data VL Arbitration Examples ----------------------- When you combine the High/Low VL Arbitration tables with the Limit of High Priority, you can create some interesting QoS behavior. Example 1: (Following example is borrowed from the "Quality and Service in OFED 3.1" presentation listed below.) High-Limit: 0 VL-Arb-High: VL2 Weight = 1 VL-Arb-Low: VL0 Weight = 200, VL1 Weight = 50 Effectively, anytime any data on VL2 is available, send at most one packet from VL2 before sending data from VL0 or VL1. If no VL2 data is available, VL0 gets 80% bandwidth, VL1 gets 20% of bandwidth. Idea: (Assume Lustre Meta Data Servers and Lustre OSTs are on the same fabric) MPI -> SL0 -> VL0 Lustre OST Data -> SL1 -> VL1 Lustre Meta Data -> SL2 -> VL2 In this example, Lustre meta data traffic is assumed to be low, but with the high priority, is accessed faster and theoretically allow for better Lustre interaction. When there is no Lustre meta data traffic on the fabric, MPI is given the majority share of bandwidth b/c it is more timing sensitive. Example 2: High-Limit: 254 Vl-Arb-High: VL0 Weight = 255 Vl-Arb-Low: VL1 Weight = 1 Effectively, whenever there is data on VL0, always send it before VL1. But do not allow VL0 to starve VL1. Let VL1 send *something* once in awhile. Idea: MPI -> SL1 -> VL0 Lustre -> Sl1 -> VL1 So MPI always gets priority over Lustre, but cannot starve it out. The High-Limit of 254 means a low priority packet must be sent once in awhile. This could be important if Lustre "pings" are done to keep some services alive. Configuring for OpenSM ---------------------- Currently configure in /var/cache/opensm/opensm.opts (later to be in /etc/opensm/opensm.conf). # # QoS OPTIONS # qos TRUE qos_policy_file /var/cache/opensm/qos-policy.conf # QoS default options qos_max_vls 2 qos_high_limit 254 qos_vlarb_high 0:255 qos_vlarb_low 1:1 qos_sl2vl 0,1,15,15,15,15,15,15,15,15,15,15,15,15,15,15 qos_ca_max_vls 2 qos_ca_high_limit 254 qos_ca_vlarb_high 0:255 qos_ca_vlarb_low 1:1 qos_ca_sl2vl 0,1,15,15,15,15,15,15,15,15,15,15,15,15,15,15 # achu: VL2 not used, need to give non-null input to buggy opensm qos_swe_max_vls 2 qos_swe_high_limit 255 qos_swe_vlarb_high 0:225,1:25 qos_swe_vlarb_low 2:1 qos_swe_sl2vl 0,1,15,15,15,15,15,15,15,15,15,15,15,15,15,15 Notes/Comments: There are default QoS options, and specific QoS options for channel adapters, switches, etc. They allow you to configure for different port-types across the fabric. The "max_vls" entries can be ignored. The "high_limit", "vlarb_high", and "vlarb_low" fields are hopefully self exaplanatory. The "vlarb_high"/"vlarb_low" entries take inputs as <VL>:<Weight> as input. In the above example, channel Adapters have: VL0 Weight = 255 -> For MPI VL1 Weight = 1 -> For Lustre Idea: With the High Limit of 254, MPI always gets priority, but cannot starve Lustre. In the above example, Switches have: VL0 Weight = 225 -> For MPI VL1 Weight = 25 -> For Lustre Idea: Across the entire cluster, MPI, Lustre, etc. are going on from different jobs/tasks. We don''t want MPI to starve out other traffic so we give it a nice chunk of bandwidth but not all bandwidth (in this example 90% for MPI, 10% for Lustre). SLs to VLs are mapped by listing the VLs for each SL in increasing order. In the above example, SL0 -> VL0 and SL1 -> VL1. The input of 15 is if the SL is one you don''t care about. Assigning SLs ------------- The configuration of QoS is now over, but we still need to make protocols/applications use the appropriate SL. Some tools allow you to pick an SL when you run. i.e.> mpirun -sl 0However, it may not be easy to force/change users/applications to use different SLs. The easiest way to configure SLs is through the OpenSM QoS policy file. QoS Policy File --------------- Depending on OpenSM version, this file is in /var/cache/opensm/qos-policy.conf or /etc/opensm/qos-policy.conf. The following is the short summary of options I think are needed for our environment. See "QoS Management in OpenSM" for full set of options. Format: qos-ulps <user level protocol>, <options> : <SL level> end-qos-ulps <user level protocol> = IPoIB, SDP, SRP, iSER <options> = port-num, pkey, service-id, target-port-guid (Note: options depends on which user level protocol is selected) <SL level> = SL level 0-15. Example: qos-ulps default : 0 any, target-port-guid 0x0002c9030002879d,0x0002c90300028765 : 1 end-qos-ulps Idea: Everything (most notably MPI) defaults to SL0. Any of the above locations with the listed destination GUID gets SL1. If the target-port-guid''s list of GUIDs are Lustre Routers, that would indicate Lustre data gets SL=1. In combination with the VL Arbitration and SL2VL Mapping configuration listed above, hopefully it can be seen how MPI gets priority over Lustre, but does not starve it out. Note that files with target-port-guids must be kept up to date if GUIDs change. You can determine GUIDs via /usr/sbin/ibstat. Verifying Configuration ----------------------- The tool smpquery can be used to verify that VL Arbitration tables and SL2VL tables have been configured in cards/switches properly. # > /usr/sbin/smpquery sl2vl 346 # SL2VL table: Lid 346 # SL: | 0| 1| 2| 3| 4| 5| 6| 7| 8| 9|10|11|12|13|14|15| ports: in 0, out 0: | 0| 1|15|15|15|15|15|15|15|15|15|15|15|15|15|15| # > /usr/sbin/smpquery vlarb 346 # VLArbitration tables: Lid 346 port 0 LowCap 8 HighCap 8 # Low priority VL Arbitration Table: VL : |0x1 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 | WEIGHT: |0x1 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 | # High priority VL Arbitration Table: VL : |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 | WEIGHT: |0xFF|0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 | The high limit can be determined by issuing portinfo queries via /usr/sbin/smpquery. # > /usr/sbin/smpquery portinfo 346 | grep Limit VLHighLimit:.....................0 Random Configuration Notes -------------------------- SLs are most often assigned during Infiniband Queue Pair (QP) creation time. So, if you change your QoS settings, any tools/applications (including Lustre) that are currently running and have already created QPs may not have absorbed the newest QoS policy. The appropriate tools/applications should be restarted. Not all Infiniband adapters support VLs. Those that do many not support all 15 VLs. You can determine what your system supports by issuing portinfo queries via /usr/sbin/smpquery. References ---------- Qos Management in OpenSM (this is a link to the Git Tree - hopefully the URL is always legit) http://www.openfabrics.org/git/?p=~sashak/management.git;a=blob_plain;f=opensm/doc/QoS_management_in_OpenSM.txt;hb=HEAD Quality and Service in OFED 3.1 - Liran Liss http://www.openfabrics.org/archives/spring2008sonoma/Tuesday/qos_sonoma08_ofa_v1.ppt QoS support in OFED (this is a link to the Git Tree - the URL is on the ofed_1_4 branch, so it probably will change at some point) http://www.openfabrics.org/git/?p=~tziporet/docs.git;a=blob_plain;f=QoS_architecture.txt;hb=ofed_1_4
Sébastien Buisson
2009-May-19 15:55 UTC
[Lustre-discuss] InfiniBand QoS with Lustre ko2iblnd.
Hi, We took a slightly different approach to deal with IB QoS in Lustre. We decided to assign a specific service-id to Lustre: in ofa-kernel we added a new value in the rdma_port_space enum, that we called RDMA_PS_LUSTRE. Then we modified the calls to rdma_create_id in o2iblnd.c and o2iblnd_cb.c to use this new port space value instead of RDMA_PS_TCP (well, we did a little more than that in the Lustre code, because we wanted the service-id to be a ko2iblnd module parameter, so we added some stuff in o2iblnd_modparams.c for instance). The next step is to tell OpenSM to assign an SL to this service-id. Here is an extract of our "QoS policy file": qos-ulps default : 0 any, service-id=0x.....: 3 end-qos-ulps The major drawback of this solution is that the modification we made in the ofa-kernel is not OpenFabrics Alliance compliant, because the portspace list is defined in the IB standard. Cheers, Sebastien. Jim Garlick a ?crit :> On Mon, May 18, 2009 at 12:04:37PM +0200, Daniel Kobras wrote: >> Hi! >> >> Does anyone know how to use QoS with Lustre''s o2ib LND? The Voltaire IB >> LND allowed to #define a service level, but I couldn''t find a similar >> facility in o2ib. Is there a different way to apply QoS rules? >> >> Thanks, >> >> Daniel. > > Hi, I don''t know much about this stuff, but our IB guys did use QoS > to help us when we found LNET was falling apart when we brought up > our first 1K node cluster based on quad socket, quad core opterons, > and ran MPI collective stress tests on all cores. > > Here are some notes they put together - see the "QoS Policy file" section. > > Jim > ____________________________________ > QoS configuration on Infiniband > > May 18, 2009 > > Albert Chu > chu11 at llnl.gov > > Overview > -------- > Quality of Service (QoS) is offered in Infiniband as a means to offer some > guarantees/minimum requirements for certain applications on the fabric. > > Definitions > ----------- > > Virtual Lanes (VLs): Infiniband supports up to 15 (numbered 0-14) > Virtual Lanes (VLs) for traffic. The virtual lanes support > independent virtual transmit/receive buffers for each port on the > fabric. > > Service Level (SL): A number (0-15) that can be assigned to any > Infiniband packet. The definition/purpose of a SL is not defined. > It''s up to the user to determine. > > Basic QoS Implementation in Infiniband > -------------------------------------- > > There are three basic parts to QoS in Infiniband. > > 1) Assign/configure protocols/tool/applications to use appropriate > SLs. > > Normally, you assign different SLs to different protocols, > applications, etc. (i.e. MPI, Lustre). This allows each > protocol/application to be given unique QoS requirements. > > 2) Configure SL2VL mapping > > Map SLs to VLs. For example, SL0->VL0, SL1->VL1, etc. > > 3) Configure VL Arbitration > > Determines VL transmission rules based on a set of prioritization > rules. > > It is the responsibility of administrators/users to use and configure > the SLs/VLs properly. VLs and SLs do nothing/mean nothing in the > Infiniband card. > > SL2VL Mapping Configuration > --------------------------- > > This is pretty basic. You assign a SL to a VL. It''s a direct one to > one mapping. i.e. SL1->VL1, SL2->VL2 > > Normally, you map SLX -> VLX. If you do otherwise, you''re starting to > do something pretty crazy. > > VL Arbitration Configuration > ---------------------------- > > This is not so basic. There are three components to VL Arbitration > configuration, the High-Priority Table, the Low-Priority Table, and > the Limit of High Priority. > > High/Low VL Arbitration Tables > ------------------------------ > > High & Low Priority VL Arbitration Tables are a list of VL numbers > (0-14) and a weighting value (0-255) pairs. The weighting value > indicates the number of 64 byte units that can be transmitted from > that VL when it is that VL''s turn to transmit. A weight of 0 means no > data can be transferred. Counters are rounded up as needed for > packets (i.e. a weight of 1 means a packet > 64 bytes can still be > sent). The High Priority VL Arbitration Table is weights for "high > priority" data while the Low Priority VL Arbitration Table is weights > for "low priority" data (the usefulness will make more sense after you > read "Limit of High Priority" below). > > Note that 64*255 =~ 16K, which is small number for many institutions. > I think it is easiest to think of the weights as ratios for percentage > bandwidth if the network is completely flooded with data from all > protocols/applications. > > For example: > > A) VL0 Weight = 255, VL1 Weight = 255 > > 50% bandwidth for VL0 and VL1 each. > > B) VL0 Weight = 255, VL1 Weight = 255, VL2 Weight = 255 > > 33% bandwidth for VL0, VL1, and VL2 each. > > C) VL0 Weight = 200, VL1 Weight = 100 > > 66% bandwidth for VL0, 33% bandwidth for VL1. > > D) VL0 Weight = 200, VL1 Weight = 100, VL2 Weight = 100 > > 50% bandwidth for VL0, 25% bandwidth for VL1 and VL2 each. > > Limit of High Priority > ---------------------- > > Indicates the number of high-priority packets (from the High VL > Arbitration Table) that can be sent without an opportunity to send a > low priority packet (from the Low VL Arbitration Table). Increments > are in 4K bytes (special numbers, 0 = one packet. 255 = unlimited > data). > > 4K*254 =~ 1M, which again is small number for many institutions. The > most likely numbers to consider using are: > > 0 - one packet > 254 - max high limit data w/o being unlimited > 255 - unlimited data > > VL Arbitration Examples > ----------------------- > > When you combine the High/Low VL Arbitration tables with the Limit of > High Priority, you can create some interesting QoS behavior. > > Example 1: > > (Following example is borrowed from the "Quality and Service in OFED > 3.1" presentation listed below.) > > High-Limit: 0 > VL-Arb-High: VL2 Weight = 1 > VL-Arb-Low: VL0 Weight = 200, VL1 Weight = 50 > > Effectively, anytime any data on VL2 is available, send at most one > packet from VL2 before sending data from VL0 or VL1. If no VL2 data > is available, VL0 gets 80% bandwidth, VL1 gets 20% of bandwidth. > > Idea: > > (Assume Lustre Meta Data Servers and Lustre OSTs are on the same > fabric) > > MPI -> SL0 -> VL0 > Lustre OST Data -> SL1 -> VL1 > Lustre Meta Data -> SL2 -> VL2 > > In this example, Lustre meta data traffic is assumed to be low, but > with the high priority, is accessed faster and theoretically allow for > better Lustre interaction. When there is no Lustre meta data traffic > on the fabric, MPI is given the majority share of bandwidth b/c it is > more timing sensitive. > > Example 2: > > High-Limit: 254 > Vl-Arb-High: VL0 Weight = 255 > Vl-Arb-Low: VL1 Weight = 1 > > Effectively, whenever there is data on VL0, always send it before VL1. > But do not allow VL0 to starve VL1. Let VL1 send *something* once in > awhile. > > Idea: > > MPI -> SL1 -> VL0 > Lustre -> Sl1 -> VL1 > > So MPI always gets priority over Lustre, but cannot starve it out. > The High-Limit of 254 means a low priority packet must be sent once in > awhile. This could be important if Lustre "pings" are done to keep > some services alive. > > Configuring for OpenSM > ---------------------- > > Currently configure in /var/cache/opensm/opensm.opts (later to be in > /etc/opensm/opensm.conf). > > # > # QoS OPTIONS > # > qos TRUE > > qos_policy_file /var/cache/opensm/qos-policy.conf > > # QoS default options > qos_max_vls 2 > qos_high_limit 254 > qos_vlarb_high 0:255 > qos_vlarb_low 1:1 > qos_sl2vl 0,1,15,15,15,15,15,15,15,15,15,15,15,15,15,15 > > qos_ca_max_vls 2 > qos_ca_high_limit 254 > qos_ca_vlarb_high 0:255 > qos_ca_vlarb_low 1:1 > qos_ca_sl2vl 0,1,15,15,15,15,15,15,15,15,15,15,15,15,15,15 > > # achu: VL2 not used, need to give non-null input to buggy opensm > qos_swe_max_vls 2 > qos_swe_high_limit 255 > qos_swe_vlarb_high 0:225,1:25 > qos_swe_vlarb_low 2:1 > qos_swe_sl2vl 0,1,15,15,15,15,15,15,15,15,15,15,15,15,15,15 > > Notes/Comments: > > There are default QoS options, and specific QoS options > for channel adapters, switches, etc. They allow you to configure > for different port-types across the fabric. > > The "max_vls" entries can be ignored. > > The "high_limit", "vlarb_high", and "vlarb_low" fields are hopefully > self exaplanatory. The "vlarb_high"/"vlarb_low" entries take inputs > as <VL>:<Weight> as input. > > In the above example, channel Adapters have: > > VL0 Weight = 255 -> For MPI > > VL1 Weight = 1 -> For Lustre > > Idea: With the High Limit of 254, MPI always gets priority, but cannot > starve Lustre. > > In the above example, Switches have: > > VL0 Weight = 225 -> For MPI > VL1 Weight = 25 -> For Lustre > > Idea: Across the entire cluster, MPI, Lustre, etc. are going on from > different jobs/tasks. We don''t want MPI to starve out other traffic > so we give it a nice chunk of bandwidth but not all bandwidth (in this > example 90% for MPI, 10% for Lustre). > > SLs to VLs are mapped by listing the VLs for each SL in increasing > order. In the above example, SL0 -> VL0 and SL1 -> VL1. The input of > 15 is if the SL is one you don''t care about. > > Assigning SLs > ------------- > > The configuration of QoS is now over, but we still need to make > protocols/applications use the appropriate SL. > > Some tools allow you to pick an SL when you run. > > i.e. > >> mpirun -sl 0 > > However, it may not be easy to force/change users/applications to use > different SLs. The easiest way to configure SLs is through the OpenSM > QoS policy file. > > QoS Policy File > --------------- > > Depending on OpenSM version, this file is in > /var/cache/opensm/qos-policy.conf or /etc/opensm/qos-policy.conf. > > The following is the short summary of options I think are needed for > our environment. See "QoS Management in OpenSM" for full set of > options. > > Format: > > qos-ulps > <user level protocol>, <options> : <SL level> > end-qos-ulps > > <user level protocol> = IPoIB, SDP, SRP, iSER > > <options> = port-num, pkey, service-id, target-port-guid > (Note: options depends on which user level protocol is selected) > > <SL level> = SL level 0-15. > > Example: > > qos-ulps > default : 0 > any, target-port-guid 0x0002c9030002879d,0x0002c90300028765 : 1 > end-qos-ulps > > Idea: > > Everything (most notably MPI) defaults to SL0. Any of the above > locations with the listed destination GUID gets SL1. > > If the target-port-guid''s list of GUIDs are Lustre Routers, that would > indicate Lustre data gets SL=1. In combination with the VL > Arbitration and SL2VL Mapping configuration listed above, hopefully it > can be seen how MPI gets priority over Lustre, but does not starve it > out. > > Note that files with target-port-guids must be kept up to date if > GUIDs change. You can determine GUIDs via /usr/sbin/ibstat. > > Verifying Configuration > ----------------------- > > The tool smpquery can be used to verify that VL Arbitration tables and > SL2VL tables have been configured in cards/switches properly. > > # > /usr/sbin/smpquery sl2vl 346 > # SL2VL table: Lid 346 > # SL: | 0| 1| 2| 3| 4| 5| 6| 7| 8| 9|10|11|12|13|14|15| > ports: in 0, out 0: | 0| 1|15|15|15|15|15|15|15|15|15|15|15|15|15|15| > > # > /usr/sbin/smpquery vlarb 346 > # VLArbitration tables: Lid 346 port 0 LowCap 8 HighCap 8 > # Low priority VL Arbitration Table: > VL : |0x1 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 | > WEIGHT: |0x1 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 | > # High priority VL Arbitration Table: > VL : |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 | > WEIGHT: |0xFF|0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 | > > The high limit can be determined by issuing portinfo queries via > /usr/sbin/smpquery. > > # > /usr/sbin/smpquery portinfo 346 | grep Limit > VLHighLimit:.....................0 > > Random Configuration Notes > -------------------------- > > SLs are most often assigned during Infiniband Queue Pair (QP) creation > time. So, if you change your QoS settings, any tools/applications > (including Lustre) that are currently running and have already created > QPs may not have absorbed the newest QoS policy. The appropriate > tools/applications should be restarted. > > Not all Infiniband adapters support VLs. Those that do many not > support all 15 VLs. You can determine what your system supports by > issuing portinfo queries via /usr/sbin/smpquery. > > References > ---------- > > Qos Management in OpenSM > > (this is a link to the Git Tree - hopefully the URL is always legit) > > http://www.openfabrics.org/git/?p=~sashak/management.git;a=blob_plain;f=opensm/doc/QoS_management_in_OpenSM.txt;hb=HEAD > > Quality and Service in OFED 3.1 - Liran Liss > > http://www.openfabrics.org/archives/spring2008sonoma/Tuesday/qos_sonoma08_ofa_v1.ppt > > QoS support in OFED > > (this is a link to the Git Tree - the URL is on the ofed_1_4 branch, > so it probably will change at some point) > > http://www.openfabrics.org/git/?p=~tziporet/docs.git;a=blob_plain;f=QoS_architecture.txt;hb=ofed_1_4 > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss > >
Hi! On Mon, May 18, 2009 at 01:34:03PM -0700, Jim Garlick wrote:> Hi, I don''t know much about this stuff, but our IB guys did use QoS > to help us when we found LNET was falling apart when we brought up > our first 1K node cluster based on quad socket, quad core opterons, > and ran MPI collective stress tests on all cores. > > Here are some notes they put together - see the "QoS Policy file" section.Great summary, thanks for sharing! Seems like qos-ulp is a rather recent OpenSM-specific feature, and the SMs in our switches apparently don''t offer a similar SID-to-SL mapping, either. , but it certainly got me a leap further. Thanks, Daniel.
On Mon, May 18, 2009 at 12:04:37PM +0200, Daniel Kobras wrote:> Hi! > > Does anyone know how to use QoS with Lustre''s o2ib LND? The Voltaire IB > LND allowed to #define a service level, but I couldn''t find a similar > facility in o2ib. Is there a different way to apply QoS rules?The o2iblnd SL is set by the OFED RDMA CM, indirectly based on the o2iblnd service port (set via ko2iblnd option ''service'', 987 by default) and its port space (RDMA_PS_TCP). For a complete, and more complicated story, please see: https://bugzilla.lustre.org/show_bug.cgi?id=18360#c2 Isaac
On Tue, May 19, 2009 at 05:55:21PM +0200, S??bastien Buisson wrote:> Hi, > > We took a slightly different approach to deal with IB QoS in Lustre. > > We decided to assign a specific service-id to Lustre: in ofa-kernel we > added a new value in the rdma_port_space enum, that we called > RDMA_PS_LUSTRE. Then we modified the calls to rdma_create_id in > o2iblnd.c and o2iblnd_cb.c to use this new port space value instead of > RDMA_PS_TCP (well, we did a little more than that in the Lustre code, > because we wanted the service-id to be a ko2iblnd module parameter, so > we added some stuff in o2iblnd_modparams.c for instance).Maybe I missed something, but it seemed to me an overkill to specify service-id this way. Without any code changes, you might figure out the service-id by the ko2iblnd ''service'' option: rdma_resolve_route->cma_resolve_ib_route->cma_query_ib_route->cma_get_service_id Isaac
Sébastien Buisson
2009-Jun-22 14:49 UTC
[Lustre-discuss] InfiniBand QoS with Lustre ko2iblnd.
Hi all, We have been thinking about this IB QoS thing in Lustre for a while, and we would like to express a need that may not be satisfied by the current solution exposed by Isaac (which consists in using the ko2iblnd ''service'' option). Let''s consider we have two sets of OSSes, each set serving a different Lustre file system (i.e. all the OSTs of an OSS are part of the same Lustre file system). The same Lustre clients have access to both filesystems. In these conditions, how can we enforce different IB QoS in Lustre for the 2 file systems? - by using the ko2iblnd ''service'' option, the o2iblnd SL would be the same for all connections initiated by a given Lustre client, regardless the destination file system. So we would not achieve our goal. Unless what really matters is the SL of the connections created by the servers (I think I have seen in the Lustre debug logs that the ''real'' data transfers are always done via the servers connections). What do you think? - if the ''service id'' information was stored on the MGS on a file system basis, one could imagine to retrieve it at mount time on the clients. The ''service id'' information stored on the MGS could consist in a port space and a port id. Thus it would be possible to affect different service ports to the various connections initiated by the client, depending on the target file system. What do you think? Would you say this is feasible, or can you see major issues with this proposal? Thanks in advance. Sebastien. Isaac Huang a ?crit :> On Mon, May 18, 2009 at 12:04:37PM +0200, Daniel Kobras wrote: >> Hi! >> >> Does anyone know how to use QoS with Lustre''s o2ib LND? The Voltaire IB >> LND allowed to #define a service level, but I couldn''t find a similar >> facility in o2ib. Is there a different way to apply QoS rules? > > The o2iblnd SL is set by the OFED RDMA CM, indirectly based on the > o2iblnd service port (set via ko2iblnd option ''service'', 987 by > default) and its port space (RDMA_PS_TCP). For a complete, and more > complicated story, please see: > https://bugzilla.lustre.org/show_bug.cgi?id=18360#c2 > > Isaac > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss > >
Sébastien Buisson
2009-Jun-24 07:46 UTC
[Lustre-discuss] InfiniBand QoS with Lustre ko2iblnd.
S?bastien Buisson a ?crit :> Hi all, > > We have been thinking about this IB QoS thing in Lustre for a while, and > we would like to express a need that may not be satisfied by the current > solution exposed by Isaac (which consists in using the ko2iblnd > ''service'' option). > > Let''s consider we have two sets of OSSes, each set serving a different > Lustre file system (i.e. all the OSTs of an OSS are part of the same > Lustre file system). The same Lustre clients have access to both > filesystems. > In these conditions, how can we enforce different IB QoS in Lustre for > the 2 file systems? > - by using the ko2iblnd ''service'' option, the o2iblnd SL would be the > same for all connections initiated by a given Lustre client, regardless > the destination file system. So we would not achieve our goal. > Unless what really matters is the SL of the connections created by the > servers (I think I have seen in the Lustre debug logs that the ''real'' > data transfers are always done via the servers connections). > What do you think?I have tried to make communicate a client for which I set the ko2iblnd ''service'' option to 986, with a server for which I set the ko2iblnd ''service'' option to 987: it does not work. This is not surprising because the ko2iblnd ''service'' parameter is used on the client side in the kiblnd_connect_peer function to designate to port of the remote peer (the server in this case). So, the ko2iblnd ''service'' option must be the same for all the nodes participating in the same file system. In our case where the same clients access both file systems, it means that we will not be able to set different o2iblnd SLs for the two file systems.> - if the ''service id'' information was stored on the MGS on a file system > basis, one could imagine to retrieve it at mount time on the clients. > The ''service id'' information stored on the MGS could consist in a port > space and a port id. Thus it would be possible to affect different > service ports to the various connections initiated by the client, > depending on the target file system. > What do you think? Would you say this is feasible, or can you see major > issues with this proposal? >The peer''s port information could be stored in the kib_peer_t structure. That way, it would be possible to make clients connect to servers which listen on different ports. What do you think?> > Thanks in advance. > Sebastien. > > > Isaac Huang a ?crit : >> On Mon, May 18, 2009 at 12:04:37PM +0200, Daniel Kobras wrote: >>> Hi! >>> >>> Does anyone know how to use QoS with Lustre''s o2ib LND? The Voltaire IB >>> LND allowed to #define a service level, but I couldn''t find a similar >>> facility in o2ib. Is there a different way to apply QoS rules? >> The o2iblnd SL is set by the OFED RDMA CM, indirectly based on the >> o2iblnd service port (set via ko2iblnd option ''service'', 987 by >> default) and its port space (RDMA_PS_TCP). For a complete, and more >> complicated story, please see: >> https://bugzilla.lustre.org/show_bug.cgi?id=18360#c2 >> >> Isaac >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss >> >> > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss > >
Hi S?bastien! On Wed, Jun 24, 2009 at 09:46:19AM +0200, S?bastien Buisson wrote:> > - if the ''service id'' information was stored on the MGS on a file system > > basis, one could imagine to retrieve it at mount time on the clients. > > The ''service id'' information stored on the MGS could consist in a port > > space and a port id. Thus it would be possible to affect different > > service ports to the various connections initiated by the client, > > depending on the target file system. > > What do you think? Would you say this is feasible, or can you see major > > issues with this proposal? > > > > The peer''s port information could be stored in the kib_peer_t structure. > That way, it would be possible to make clients connect to servers which > listen on different ports. > What do you think?Why do you want to distinguish the two filesystems solely by service id rather than, say, service id + port guids of the respective Lustre servers? You''ll need a full QoS policy file instead of the simplified syntax, and configuration needs to be adapted on hardware changes, but this still looks simpler to me than modifying the wire protocol. Regards, Daniel.
On Mon, Jun 22, 2009 at 04:49:03PM +0200, S?bastien Buisson wrote:> ...... > Let''s consider we have two sets of OSSes, each set serving a different > Lustre file system (i.e. all the OSTs of an OSS are part of the same > Lustre file system). The same Lustre clients have access to both > filesystems. > In these conditions, how can we enforce different IB QoS in Lustre for > the 2 file systems?By assigning different SLs to the two sets of servers based on server GUIDs, i.e. target-port-guid in QoS policy file.> - by using the ko2iblnd ''service'' option, the o2iblnd SL would be the > same for all connections initiated by a given Lustre client, regardless > the destination file system. So we would not achieve our goal.Not necessarily. The service-id would be the same, but SLs could be different if the SM has been configured in a way that doesn''t determine SLs solely based on service-id (e.g. also based on target GUIDs).> ...... > - if the ''service id'' information was stored on the MGS on a file system > basis, one could imagine to retrieve it at mount time on the clients. > The ''service id'' information stored on the MGS could consist in a port > space and a port id. Thus it would be possible to affect different > service ports to the various connections initiated by the client, > depending on the target file system. > What do you think? Would you say this is feasible, or can you see major > issues with this proposal?The LNet configurations could not reside on the MGS because LNet must have been properly configured so that configurations on MGS could be fetched over the network. Isaac
On Wed, Jun 24, 2009 at 09:46:19AM +0200, S?bastien Buisson wrote:> ...... > The peer''s port information could be stored in the kib_peer_t structure. > That way, it would be possible to make clients connect to servers which > listen on different ports. > What do you think?At this point it can''t be done. But we have in our development plans to implement dynamic LNet configuration which includes per-NI options (i.e. it''d be possible to specify the ''service'' option on a per-NI basis instead of being just LND global), and once it''s implemented you''d be able to specify different ''service'' option if you''d create two server networks for the two FS. For your current concern of setting up different SLs, I''d believe that it could be achieved via target GUIDs as mentioned in my previous reply. Hope this helps, Isaac
Sébastien Buisson
2009-Jun-26 11:42 UTC
[Lustre-discuss] InfiniBand QoS with Lustre ko2iblnd.
Isaac Huang a ?crit :> On Wed, Jun 24, 2009 at 09:46:19AM +0200, S?bastien Buisson wrote: >> ...... >> The peer''s port information could be stored in the kib_peer_t structure. >> That way, it would be possible to make clients connect to servers which >> listen on different ports. >> What do you think? > > At this point it can''t be done. But we have in our development plans > to implement dynamic LNet configuration which includes per-NI options > (i.e. it''d be possible to specify the ''service'' option on a per-NI > basis instead of being just LND global), and once it''s implemented > you''d be able to specify different ''service'' option if you''d create > two server networks for the two FS.OK, if I understand correctly, the major hurdle with what I proposed is that LNET is not able to get configuration information dynamically at the moment, right? I agree with you, I think the per-NI options in LNET would do the trick. Do you have plans about when this feature would be available? Have you already begun to work on it? If you have some pre-alpha work, we would be glad to evaluate it.> > For your current concern of setting up different SLs, I''d believe that > it could be achieved via target GUIDs as mentioned in my previous reply.Unfortunately, configuring IB QoS via target GUIDs quickly becomes too complicated. As the size of clusters grow, it would require to list hundreds of GUIDs in the QoS policy rules. Sebastien.
On Fri, Jun 26, 2009 at 01:42:53PM +0200, S?bastien Buisson wrote:> > Isaac Huang a ?crit : >> On Wed, Jun 24, 2009 at 09:46:19AM +0200, S?bastien Buisson wrote: >>> ...... >>> The peer''s port information could be stored in the kib_peer_t >>> structure. That way, it would be possible to make clients connect to >>> servers which listen on different ports. >>> What do you think? >> >> At this point it can''t be done. But we have in our development plans >> to implement dynamic LNet configuration which includes per-NI options >> (i.e. it''d be possible to specify the ''service'' option on a per-NI >> basis instead of being just LND global), and once it''s implemented >> you''d be able to specify different ''service'' option if you''d create >> two server networks for the two FS. > > OK, if I understand correctly, the major hurdle with what I proposed is > that LNET is not able to get configuration information dynamically at > the moment, right?Yes.> I agree with you, I think the per-NI options in LNET would do the trick. > Do you have plans about when this feature would be available? Have you > already begun to work on it?It''s too early to make any realistic estimate at the moment. Though It''s already on the lnet roadmap, I''m not sure when we''re going to start working on it.> If you have some pre-alpha work, we would be glad to evaluate it.Thanks, I''ll remember to ping you when it''s available.>> For your current concern of setting up different SLs, I''d believe that >> it could be achieved via target GUIDs as mentioned in my previous reply. > > Unfortunately, configuring IB QoS via target GUIDs quickly becomes too > complicated. As the size of clusters grow, it would require to list > hundreds of GUIDs in the QoS policy rules.Yes, it''s rather cumbersome at bigger scales. Thanks, Isaac
On Wed, Jul 01, 2009 at 02:07:33AM -0400, Isaac Huang wrote:> ...... > >> For your current concern of setting up different SLs, I''d believe that > >> it could be achieved via target GUIDs as mentioned in my previous reply. > > > > Unfortunately, configuring IB QoS via target GUIDs quickly becomes too > > complicated. As the size of clusters grow, it would require to list > > hundreds of GUIDs in the QoS policy rules. > > Yes, it''s rather cumbersome at bigger scales.It just occurred to me that it might work by configuring QoS policy based on IB partition keys. It''s just an initial thought - if you''d configure two @o2ib networks over two IB partitions over the same fabric, one for each filesystem, then you might differentiate traffic of the two FS based on their partition keys. I think it''d be much easier to configure an additional @o2ib network than to maintain hundreds of GUIDs that could change in the policy file. By default the o2iblnd runs over the default IB partition. Please see bug 18602 for how to configure the o2iblnd over a non-default partition. Thanks, Isaac