thr3ads.net - Lustre discuss - [Lustre-discuss] InfiniBand QoS with Lustre ko2iblnd. [May 2009]

If this information is useful, please help other people find it:
Share via:

Daniel Kobras

2009-May-18 10:04 UTC

[Lustre-discuss] InfiniBand QoS with Lustre ko2iblnd.

Hi!

Does anyone know how to use QoS with Lustre''s o2ib LND? The Voltaire IB
LND allowed to #define a service level, but I couldn''t find a similar
facility in o2ib. Is there a different way to apply QoS rules?

Thanks,

Daniel.

Jim Garlick

2009-May-18 20:34 UTC

head link

[Lustre-discuss] InfiniBand QoS with Lustre ko2iblnd.

On Mon, May 18, 2009 at 12:04:37PM +0200, Daniel Kobras
wrote:> Hi!
> 
> Does anyone know how to use QoS with Lustre''s o2ib LND? The
Voltaire IB
> LND allowed to #define a service level, but I couldn''t find a
similar
> facility in o2ib. Is there a different way to apply QoS rules?
> 
> Thanks,
> 
> Daniel.
Hi, I don''t know much about this stuff, but our IB guys did use QoS
to help us when we found LNET was falling apart when we brought up
our first 1K node cluster based on quad socket, quad core opterons,
and ran MPI collective stress tests on all cores.

Here are some notes they put together - see the "QoS Policy file"
section.

Jim
____________________________________
QoS configuration on Infiniband

May 18, 2009

Albert Chu
chu11 at llnl.gov

Overview
--------
Quality of Service (QoS) is offered in Infiniband as a means to offer some
guarantees/minimum requirements for certain applications on the fabric.

Definitions
-----------

Virtual Lanes (VLs): Infiniband supports up to 15 (numbered 0-14)
Virtual Lanes (VLs) for traffic.  The virtual lanes support
independent virtual transmit/receive buffers for each port on the
fabric.

Service Level (SL): A number (0-15) that can be assigned to any
Infiniband packet.  The definition/purpose of a SL is not defined.
It''s up to the user to determine.

Basic QoS Implementation in Infiniband
--------------------------------------

There are three basic parts to QoS in Infiniband.

1) Assign/configure protocols/tool/applications to use appropriate
   SLs.

   Normally, you assign different SLs to different protocols,
   applications, etc. (i.e. MPI, Lustre).  This allows each
   protocol/application to be given unique QoS requirements.

2) Configure SL2VL mapping

   Map SLs to VLs.  For example, SL0->VL0, SL1->VL1, etc.

3) Configure VL Arbitration

   Determines VL transmission rules based on a set of prioritization
   rules.

It is the responsibility of administrators/users to use and configure
the SLs/VLs properly.  VLs and SLs do nothing/mean nothing in the
Infiniband card.

SL2VL Mapping Configuration
---------------------------

This is pretty basic.  You assign a SL to a VL.  It''s a direct one to
one mapping.  i.e. SL1->VL1, SL2->VL2

Normally, you map SLX -> VLX.  If you do otherwise, you''re starting
to
do something pretty crazy.

VL Arbitration Configuration
----------------------------

This is not so basic.  There are three components to VL Arbitration
configuration, the High-Priority Table, the Low-Priority Table, and
the Limit of High Priority.

High/Low VL Arbitration Tables
------------------------------

High & Low Priority VL Arbitration Tables are a list of VL numbers
(0-14) and a weighting value (0-255) pairs.  The weighting value
indicates the number of 64 byte units that can be transmitted from
that VL when it is that VL''s turn to transmit.  A weight of 0 means no
data can be transferred.  Counters are rounded up as needed for
packets (i.e. a weight of 1 means a packet > 64 bytes can still be
sent).  The High Priority VL Arbitration Table is weights for "high
priority" data while the Low Priority VL Arbitration Table is weights
for "low priority" data (the usefulness will make more sense after you
read "Limit of High Priority" below).

Note that 64*255 =~ 16K, which is small number for many institutions.
I think it is easiest to think of the weights as ratios for percentage
bandwidth if the network is completely flooded with data from all
protocols/applications.

For example:

A) VL0 Weight = 255, VL1 Weight = 255 

   50% bandwidth for VL0 and VL1 each.

B) VL0 Weight = 255, VL1 Weight = 255, VL2 Weight = 255 

   33% bandwidth for VL0, VL1, and VL2 each.

C) VL0 Weight = 200, VL1 Weight = 100 

   66% bandwidth for VL0, 33% bandwidth for VL1.

D) VL0 Weight = 200, VL1 Weight = 100, VL2 Weight = 100 

   50% bandwidth for VL0, 25% bandwidth for VL1 and VL2 each.

Limit of High Priority
----------------------

Indicates the number of high-priority packets (from the High VL
Arbitration Table) that can be sent without an opportunity to send a
low priority packet (from the Low VL Arbitration Table).  Increments
are in 4K bytes (special numbers, 0 = one packet.  255 = unlimited
data).

4K*254 =~ 1M, which again is small number for many institutions.  The
most likely numbers to consider using are:

0 - one packet
254 - max high limit data w/o being unlimited
255 - unlimited data

VL Arbitration Examples
-----------------------

When you combine the High/Low VL Arbitration tables with the Limit of
High Priority, you can create some interesting QoS behavior.

Example 1:

(Following example is borrowed from the "Quality and Service in OFED
3.1" presentation listed below.)

High-Limit: 0
VL-Arb-High: VL2 Weight = 1
VL-Arb-Low: VL0 Weight = 200, VL1 Weight = 50

Effectively, anytime any data on VL2 is available, send at most one
packet from VL2 before sending data from VL0 or VL1.  If no VL2 data
is available, VL0 gets 80% bandwidth, VL1 gets 20% of bandwidth.

Idea: 

(Assume Lustre Meta Data Servers and Lustre OSTs are on the same
fabric)

MPI -> SL0 -> VL0
Lustre OST Data -> SL1 -> VL1
Lustre Meta Data -> SL2 -> VL2

In this example, Lustre meta data traffic is assumed to be low, but
with the high priority, is accessed faster and theoretically allow for
better Lustre interaction.  When there is no Lustre meta data traffic
on the fabric, MPI is given the majority share of bandwidth b/c it is
more timing sensitive.

Example 2:

High-Limit: 254
Vl-Arb-High: VL0 Weight = 255
Vl-Arb-Low: VL1 Weight = 1

Effectively, whenever there is data on VL0, always send it before VL1.
But do not allow VL0 to starve VL1.  Let VL1 send *something* once in
awhile.

Idea: 

MPI -> SL1 -> VL0
Lustre -> Sl1 -> VL1

So MPI always gets priority over Lustre, but cannot starve it out.
The High-Limit of 254 means a low priority packet must be sent once in
awhile.  This could be important if Lustre "pings" are done to keep
some services alive.

Configuring for OpenSM
----------------------

Currently configure in /var/cache/opensm/opensm.opts (later to be in
/etc/opensm/opensm.conf).

#
# QoS OPTIONS
#
qos TRUE

qos_policy_file /var/cache/opensm/qos-policy.conf

# QoS default options
qos_max_vls 2
qos_high_limit 254
qos_vlarb_high 0:255
qos_vlarb_low 1:1
qos_sl2vl 0,1,15,15,15,15,15,15,15,15,15,15,15,15,15,15

qos_ca_max_vls 2
qos_ca_high_limit 254
qos_ca_vlarb_high 0:255
qos_ca_vlarb_low 1:1
qos_ca_sl2vl 0,1,15,15,15,15,15,15,15,15,15,15,15,15,15,15

# achu: VL2 not used, need to give non-null input to buggy opensm
qos_swe_max_vls 2
qos_swe_high_limit 255
qos_swe_vlarb_high 0:225,1:25
qos_swe_vlarb_low 2:1
qos_swe_sl2vl 0,1,15,15,15,15,15,15,15,15,15,15,15,15,15,15

Notes/Comments:

There are default QoS options, and specific QoS options 
for channel adapters, switches, etc.  They allow you to configure
for different port-types across the fabric.

The "max_vls" entries can be ignored.

The "high_limit", "vlarb_high", and "vlarb_low"
fields are hopefully
self exaplanatory.  The "vlarb_high"/"vlarb_low" entries
take inputs
as <VL>:<Weight> as input.

In the above example, channel Adapters have:

VL0 Weight = 255 -> For MPI

VL1 Weight = 1 -> For Lustre

Idea: With the High Limit of 254, MPI always gets priority, but cannot
starve Lustre.

In the above example, Switches have:

VL0 Weight = 225 -> For MPI
VL1 Weight = 25 -> For Lustre

Idea: Across the entire cluster, MPI, Lustre, etc. are going on from
different jobs/tasks.  We don''t want MPI to starve out other traffic
so we give it a nice chunk of bandwidth but not all bandwidth (in this
example 90% for MPI, 10% for Lustre).

SLs to VLs are mapped by listing the VLs for each SL in increasing
order.  In the above example, SL0 -> VL0 and SL1 -> VL1.  The input of
15 is if the SL is one you don''t care about.

Assigning SLs
-------------

The configuration of QoS is now over, but we still need to make
protocols/applications use the appropriate SL.

Some tools allow you to pick an SL when you run.

i.e. 
> mpirun -sl 0
However, it may not be easy to force/change users/applications to use
different SLs.  The easiest way to configure SLs is through the OpenSM
QoS policy file.

QoS Policy File
---------------

Depending on OpenSM version, this file is in
/var/cache/opensm/qos-policy.conf or /etc/opensm/qos-policy.conf.

The following is the short summary of options I think are needed for
our environment.  See "QoS Management in OpenSM" for full set of
options.

Format:

qos-ulps
    <user level protocol>, <options> : <SL level>
end-qos-ulps

<user level protocol> = IPoIB, SDP, SRP, iSER

<options> = port-num, pkey, service-id, target-port-guid 
(Note: options depends on which user level protocol is selected)

<SL level> = SL level 0-15.

Example:

qos-ulps
    default                                                     : 0
    any, target-port-guid 0x0002c9030002879d,0x0002c90300028765 : 1
end-qos-ulps

Idea: 

Everything (most notably MPI) defaults to SL0.  Any of the above
locations with the listed destination GUID gets SL1.

If the target-port-guid''s list of GUIDs are Lustre Routers, that would
indicate Lustre data gets SL=1.  In combination with the VL
Arbitration and SL2VL Mapping configuration listed above, hopefully it
can be seen how MPI gets priority over Lustre, but does not starve it
out.

Note that files with target-port-guids must be kept up to date if
GUIDs change.  You can determine GUIDs via /usr/sbin/ibstat.

Verifying Configuration
-----------------------

The tool smpquery can be used to verify that VL Arbitration tables and
SL2VL tables have been configured in cards/switches properly.

# > /usr/sbin/smpquery sl2vl 346
# SL2VL table: Lid 346
#                 SL: | 0| 1| 2| 3| 4| 5| 6| 7| 8| 9|10|11|12|13|14|15|
ports: in  0, out  0: | 0| 1|15|15|15|15|15|15|15|15|15|15|15|15|15|15|

# > /usr/sbin/smpquery vlarb 346
# VLArbitration tables: Lid 346 port 0 LowCap 8 HighCap 8
# Low priority VL Arbitration Table:
VL    : |0x1 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |
WEIGHT: |0x1 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |
# High priority VL Arbitration Table:
VL    : |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |
WEIGHT: |0xFF|0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |

The high limit can be determined by issuing portinfo queries via
/usr/sbin/smpquery.

# > /usr/sbin/smpquery portinfo 346 | grep Limit
VLHighLimit:.....................0

Random Configuration Notes
--------------------------

SLs are most often assigned during Infiniband Queue Pair (QP) creation
time.  So, if you change your QoS settings, any tools/applications
(including Lustre) that are currently running and have already created
QPs may not have absorbed the newest QoS policy.  The appropriate
tools/applications should be restarted.

Not all Infiniband adapters support VLs.  Those that do many not
support all 15 VLs.  You can determine what your system supports by
issuing portinfo queries via /usr/sbin/smpquery.

References
----------

Qos Management in OpenSM

(this is a link to the Git Tree - hopefully the URL is always legit)

http://www.openfabrics.org/git/?p=~sashak/management.git;a=blob_plain;f=opensm/doc/QoS_management_in_OpenSM.txt;hb=HEAD

Quality and Service in OFED 3.1 - Liran Liss

http://www.openfabrics.org/archives/spring2008sonoma/Tuesday/qos_sonoma08_ofa_v1.ppt

QoS support in OFED

(this is a link to the Git Tree - the URL is on the ofed_1_4 branch,
so it probably will change at some point)

http://www.openfabrics.org/git/?p=~tziporet/docs.git;a=blob_plain;f=QoS_architecture.txt;hb=ofed_1_4

Sébastien Buisson

2009-May-19 15:55 UTC

head link

[Lustre-discuss] InfiniBand QoS with Lustre ko2iblnd.

Hi,

We took a slightly different approach to deal with IB QoS in Lustre.

We decided to assign a specific service-id to Lustre: in ofa-kernel we 
added a new value in the rdma_port_space enum, that we called 
RDMA_PS_LUSTRE. Then we modified the calls to rdma_create_id in 
o2iblnd.c and o2iblnd_cb.c to use this new port space value instead of 
RDMA_PS_TCP (well, we did a little more than that in the Lustre code, 
because we wanted the service-id to be a ko2iblnd module parameter, so 
we added some stuff in o2iblnd_modparams.c for instance).

The next step is to tell OpenSM to assign an SL to this service-id.
Here is an extract of our "QoS policy file":
qos-ulps
    default                                                     : 0
    any, service-id=0x.....: 3
end-qos-ulps

The major drawback of this solution is that the modification we made in 
the ofa-kernel is not OpenFabrics Alliance compliant, because the 
portspace list is defined in the IB standard.

Cheers,
Sebastien.


Jim Garlick a ?crit :> On Mon, May 18, 2009 at 12:04:37PM +0200, Daniel Kobras wrote:
>> Hi!
>>
>> Does anyone know how to use QoS with Lustre''s o2ib LND? The
Voltaire IB
>> LND allowed to #define a service level, but I couldn''t find a
similar
>> facility in o2ib. Is there a different way to apply QoS rules?
>>
>> Thanks,
>>
>> Daniel.
> 
> Hi, I don''t know much about this stuff, but our IB guys did use
QoS
> to help us when we found LNET was falling apart when we brought up
> our first 1K node cluster based on quad socket, quad core opterons,
> and ran MPI collective stress tests on all cores.
> 
> Here are some notes they put together - see the "QoS Policy file"
section.
> 
> Jim
> ____________________________________
> QoS configuration on Infiniband
> 
> May 18, 2009
> 
> Albert Chu
> chu11 at llnl.gov
> 
> Overview
> --------
> Quality of Service (QoS) is offered in Infiniband as a means to offer some
> guarantees/minimum requirements for certain applications on the fabric.
> 
> Definitions
> -----------
> 
> Virtual Lanes (VLs): Infiniband supports up to 15 (numbered 0-14)
> Virtual Lanes (VLs) for traffic.  The virtual lanes support
> independent virtual transmit/receive buffers for each port on the
> fabric.
> 
> Service Level (SL): A number (0-15) that can be assigned to any
> Infiniband packet.  The definition/purpose of a SL is not defined.
> It''s up to the user to determine.
> 
> Basic QoS Implementation in Infiniband
> --------------------------------------
> 
> There are three basic parts to QoS in Infiniband.
> 
> 1) Assign/configure protocols/tool/applications to use appropriate
>    SLs.
> 
>    Normally, you assign different SLs to different protocols,
>    applications, etc. (i.e. MPI, Lustre).  This allows each
>    protocol/application to be given unique QoS requirements.
> 
> 2) Configure SL2VL mapping
> 
>    Map SLs to VLs.  For example, SL0->VL0, SL1->VL1, etc.
> 
> 3) Configure VL Arbitration
> 
>    Determines VL transmission rules based on a set of prioritization
>    rules.
> 
> It is the responsibility of administrators/users to use and configure
> the SLs/VLs properly.  VLs and SLs do nothing/mean nothing in the
> Infiniband card.
> 
> SL2VL Mapping Configuration
> ---------------------------
> 
> This is pretty basic.  You assign a SL to a VL.  It''s a direct one
to
> one mapping.  i.e. SL1->VL1, SL2->VL2
> 
> Normally, you map SLX -> VLX.  If you do otherwise, you''re
starting to
> do something pretty crazy.
> 
> VL Arbitration Configuration
> ----------------------------
> 
> This is not so basic.  There are three components to VL Arbitration
> configuration, the High-Priority Table, the Low-Priority Table, and
> the Limit of High Priority.
> 
> High/Low VL Arbitration Tables
> ------------------------------
> 
> High & Low Priority VL Arbitration Tables are a list of VL numbers
> (0-14) and a weighting value (0-255) pairs.  The weighting value
> indicates the number of 64 byte units that can be transmitted from
> that VL when it is that VL''s turn to transmit.  A weight of 0
means no
> data can be transferred.  Counters are rounded up as needed for
> packets (i.e. a weight of 1 means a packet > 64 bytes can still be
> sent).  The High Priority VL Arbitration Table is weights for "high
> priority" data while the Low Priority VL Arbitration Table is weights
> for "low priority" data (the usefulness will make more sense
after you
> read "Limit of High Priority" below).
> 
> Note that 64*255 =~ 16K, which is small number for many institutions.
> I think it is easiest to think of the weights as ratios for percentage
> bandwidth if the network is completely flooded with data from all
> protocols/applications.
> 
> For example:
> 
> A) VL0 Weight = 255, VL1 Weight = 255 
> 
>    50% bandwidth for VL0 and VL1 each.
> 
> B) VL0 Weight = 255, VL1 Weight = 255, VL2 Weight = 255 
> 
>    33% bandwidth for VL0, VL1, and VL2 each.
> 
> C) VL0 Weight = 200, VL1 Weight = 100 
> 
>    66% bandwidth for VL0, 33% bandwidth for VL1.
> 
> D) VL0 Weight = 200, VL1 Weight = 100, VL2 Weight = 100 
> 
>    50% bandwidth for VL0, 25% bandwidth for VL1 and VL2 each.
> 
> Limit of High Priority
> ----------------------
> 
> Indicates the number of high-priority packets (from the High VL
> Arbitration Table) that can be sent without an opportunity to send a
> low priority packet (from the Low VL Arbitration Table).  Increments
> are in 4K bytes (special numbers, 0 = one packet.  255 = unlimited
> data).
> 
> 4K*254 =~ 1M, which again is small number for many institutions.  The
> most likely numbers to consider using are:
> 
> 0 - one packet
> 254 - max high limit data w/o being unlimited
> 255 - unlimited data
> 
> VL Arbitration Examples
> -----------------------
> 
> When you combine the High/Low VL Arbitration tables with the Limit of
> High Priority, you can create some interesting QoS behavior.
> 
> Example 1:
> 
> (Following example is borrowed from the "Quality and Service in OFED
> 3.1" presentation listed below.)
> 
> High-Limit: 0
> VL-Arb-High: VL2 Weight = 1
> VL-Arb-Low: VL0 Weight = 200, VL1 Weight = 50
> 
> Effectively, anytime any data on VL2 is available, send at most one
> packet from VL2 before sending data from VL0 or VL1.  If no VL2 data
> is available, VL0 gets 80% bandwidth, VL1 gets 20% of bandwidth.
> 
> Idea: 
> 
> (Assume Lustre Meta Data Servers and Lustre OSTs are on the same
> fabric)
> 
> MPI -> SL0 -> VL0
> Lustre OST Data -> SL1 -> VL1
> Lustre Meta Data -> SL2 -> VL2
> 
> In this example, Lustre meta data traffic is assumed to be low, but
> with the high priority, is accessed faster and theoretically allow for
> better Lustre interaction.  When there is no Lustre meta data traffic
> on the fabric, MPI is given the majority share of bandwidth b/c it is
> more timing sensitive.
> 
> Example 2:
> 
> High-Limit: 254
> Vl-Arb-High: VL0 Weight = 255
> Vl-Arb-Low: VL1 Weight = 1
> 
> Effectively, whenever there is data on VL0, always send it before VL1.
> But do not allow VL0 to starve VL1.  Let VL1 send *something* once in
> awhile.
> 
> Idea: 
> 
> MPI -> SL1 -> VL0
> Lustre -> Sl1 -> VL1
> 
> So MPI always gets priority over Lustre, but cannot starve it out.
> The High-Limit of 254 means a low priority packet must be sent once in
> awhile.  This could be important if Lustre "pings" are done to
keep
> some services alive.
> 
> Configuring for OpenSM
> ----------------------
> 
> Currently configure in /var/cache/opensm/opensm.opts (later to be in
> /etc/opensm/opensm.conf).
> 
> #
> # QoS OPTIONS
> #
> qos TRUE
> 
> qos_policy_file /var/cache/opensm/qos-policy.conf
> 
> # QoS default options
> qos_max_vls 2
> qos_high_limit 254
> qos_vlarb_high 0:255
> qos_vlarb_low 1:1
> qos_sl2vl 0,1,15,15,15,15,15,15,15,15,15,15,15,15,15,15
> 
> qos_ca_max_vls 2
> qos_ca_high_limit 254
> qos_ca_vlarb_high 0:255
> qos_ca_vlarb_low 1:1
> qos_ca_sl2vl 0,1,15,15,15,15,15,15,15,15,15,15,15,15,15,15
> 
> # achu: VL2 not used, need to give non-null input to buggy opensm
> qos_swe_max_vls 2
> qos_swe_high_limit 255
> qos_swe_vlarb_high 0:225,1:25
> qos_swe_vlarb_low 2:1
> qos_swe_sl2vl 0,1,15,15,15,15,15,15,15,15,15,15,15,15,15,15
> 
> Notes/Comments:
> 
> There are default QoS options, and specific QoS options 
> for channel adapters, switches, etc.  They allow you to configure
> for different port-types across the fabric.
> 
> The "max_vls" entries can be ignored.
> 
> The "high_limit", "vlarb_high", and
"vlarb_low" fields are hopefully
> self exaplanatory.  The "vlarb_high"/"vlarb_low"
entries take inputs
> as <VL>:<Weight> as input.
> 
> In the above example, channel Adapters have:
> 
> VL0 Weight = 255 -> For MPI
> 
> VL1 Weight = 1 -> For Lustre
> 
> Idea: With the High Limit of 254, MPI always gets priority, but cannot
> starve Lustre.
> 
> In the above example, Switches have:
> 
> VL0 Weight = 225 -> For MPI
> VL1 Weight = 25 -> For Lustre
> 
> Idea: Across the entire cluster, MPI, Lustre, etc. are going on from
> different jobs/tasks.  We don''t want MPI to starve out other
traffic
> so we give it a nice chunk of bandwidth but not all bandwidth (in this
> example 90% for MPI, 10% for Lustre).
> 
> SLs to VLs are mapped by listing the VLs for each SL in increasing
> order.  In the above example, SL0 -> VL0 and SL1 -> VL1.  The input
of
> 15 is if the SL is one you don''t care about.
> 
> Assigning SLs
> -------------
> 
> The configuration of QoS is now over, but we still need to make
> protocols/applications use the appropriate SL.
> 
> Some tools allow you to pick an SL when you run.
> 
> i.e. 
> 
>> mpirun -sl 0
> 
> However, it may not be easy to force/change users/applications to use
> different SLs.  The easiest way to configure SLs is through the OpenSM
> QoS policy file.
> 
> QoS Policy File
> ---------------
> 
> Depending on OpenSM version, this file is in
> /var/cache/opensm/qos-policy.conf or /etc/opensm/qos-policy.conf.
> 
> The following is the short summary of options I think are needed for
> our environment.  See "QoS Management in OpenSM" for full set of
> options.
> 
> Format:
> 
> qos-ulps
>     <user level protocol>, <options> : <SL level>
> end-qos-ulps
> 
> <user level protocol> = IPoIB, SDP, SRP, iSER
> 
> <options> = port-num, pkey, service-id, target-port-guid 
> (Note: options depends on which user level protocol is selected)
> 
> <SL level> = SL level 0-15.
> 
> Example:
> 
> qos-ulps
>     default                                                     : 0
>     any, target-port-guid 0x0002c9030002879d,0x0002c90300028765 : 1
> end-qos-ulps
> 
> Idea: 
> 
> Everything (most notably MPI) defaults to SL0.  Any of the above
> locations with the listed destination GUID gets SL1.
> 
> If the target-port-guid''s list of GUIDs are Lustre Routers, that
would
> indicate Lustre data gets SL=1.  In combination with the VL
> Arbitration and SL2VL Mapping configuration listed above, hopefully it
> can be seen how MPI gets priority over Lustre, but does not starve it
> out.
> 
> Note that files with target-port-guids must be kept up to date if
> GUIDs change.  You can determine GUIDs via /usr/sbin/ibstat.
> 
> Verifying Configuration
> -----------------------
> 
> The tool smpquery can be used to verify that VL Arbitration tables and
> SL2VL tables have been configured in cards/switches properly.
> 
> # > /usr/sbin/smpquery sl2vl 346
> # SL2VL table: Lid 346
> #                 SL: | 0| 1| 2| 3| 4| 5| 6| 7| 8| 9|10|11|12|13|14|15|
> ports: in  0, out  0: | 0| 1|15|15|15|15|15|15|15|15|15|15|15|15|15|15|
> 
> # > /usr/sbin/smpquery vlarb 346
> # VLArbitration tables: Lid 346 port 0 LowCap 8 HighCap 8
> # Low priority VL Arbitration Table:
> VL    : |0x1 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |
> WEIGHT: |0x1 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |
> # High priority VL Arbitration Table:
> VL    : |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |
> WEIGHT: |0xFF|0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |
> 
> The high limit can be determined by issuing portinfo queries via
> /usr/sbin/smpquery.
> 
> # > /usr/sbin/smpquery portinfo 346 | grep Limit
> VLHighLimit:.....................0
> 
> Random Configuration Notes
> --------------------------
> 
> SLs are most often assigned during Infiniband Queue Pair (QP) creation
> time.  So, if you change your QoS settings, any tools/applications
> (including Lustre) that are currently running and have already created
> QPs may not have absorbed the newest QoS policy.  The appropriate
> tools/applications should be restarted.
> 
> Not all Infiniband adapters support VLs.  Those that do many not
> support all 15 VLs.  You can determine what your system supports by
> issuing portinfo queries via /usr/sbin/smpquery.
> 
> References
> ----------
> 
> Qos Management in OpenSM
> 
> (this is a link to the Git Tree - hopefully the URL is always legit)
> 
>
http://www.openfabrics.org/git/?p=~sashak/management.git;a=blob_plain;f=opensm/doc/QoS_management_in_OpenSM.txt;hb=HEAD
> 
> Quality and Service in OFED 3.1 - Liran Liss
> 
>
http://www.openfabrics.org/archives/spring2008sonoma/Tuesday/qos_sonoma08_ofa_v1.ppt
> 
> QoS support in OFED
> 
> (this is a link to the Git Tree - the URL is on the ofed_1_4 branch,
> so it probably will change at some point)
> 
>
http://www.openfabrics.org/git/?p=~tziporet/docs.git;a=blob_plain;f=QoS_architecture.txt;hb=ofed_1_4
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
> 
>

Daniel Kobras

2009-May-19 17:05 UTC

head link

[Lustre-discuss] InfiniBand QoS with Lustre ko2iblnd.

Hi!

On Mon, May 18, 2009 at 01:34:03PM -0700, Jim Garlick
wrote:> Hi, I don''t know much about this stuff, but our IB guys did use
QoS
> to help us when we found LNET was falling apart when we brought up
> our first 1K node cluster based on quad socket, quad core opterons,
> and ran MPI collective stress tests on all cores.
> 
> Here are some notes they put together - see the "QoS Policy file"
section.
Great summary, thanks for sharing! Seems like qos-ulp is a rather recent
OpenSM-specific feature, and the SMs in our switches apparently don''t
offer a similar SID-to-SL mapping, either. , but it certainly got me a
leap further.

Thanks,

Daniel.

Isaac Huang

2009-May-19 19:25 UTC

head link

[Lustre-discuss] InfiniBand QoS with Lustre ko2iblnd.

On Mon, May 18, 2009 at 12:04:37PM +0200, Daniel Kobras
wrote:> Hi!
> 
> Does anyone know how to use QoS with Lustre''s o2ib LND? The
Voltaire IB
> LND allowed to #define a service level, but I couldn''t find a
similar
> facility in o2ib. Is there a different way to apply QoS rules?
The o2iblnd SL is set by the OFED RDMA CM, indirectly based on the
o2iblnd service port (set via ko2iblnd option ''service'', 987
by
default) and its port space (RDMA_PS_TCP). For a complete, and more
complicated story, please see:
https://bugzilla.lustre.org/show_bug.cgi?id=18360#c2

Isaac

Isaac Huang

2009-May-19 19:48 UTC

head link

[Lustre-discuss] InfiniBand QoS with Lustre ko2iblnd.

On Tue, May 19, 2009 at 05:55:21PM +0200, S??bastien Buisson
wrote:> Hi,
> 
> We took a slightly different approach to deal with IB QoS in Lustre.
> 
> We decided to assign a specific service-id to Lustre: in ofa-kernel we 
> added a new value in the rdma_port_space enum, that we called 
> RDMA_PS_LUSTRE. Then we modified the calls to rdma_create_id in 
> o2iblnd.c and o2iblnd_cb.c to use this new port space value instead of 
> RDMA_PS_TCP (well, we did a little more than that in the Lustre code, 
> because we wanted the service-id to be a ko2iblnd module parameter, so 
> we added some stuff in o2iblnd_modparams.c for instance).
Maybe I missed something, but it seemed to me an overkill to specify
service-id this way. Without any code changes, you might figure out
the service-id by the ko2iblnd ''service'' option:
rdma_resolve_route->cma_resolve_ib_route->cma_query_ib_route->cma_get_service_id

Isaac

Sébastien Buisson

2009-Jun-22 14:49 UTC

head link

[Lustre-discuss] InfiniBand QoS with Lustre ko2iblnd.

Hi all,

We have been thinking about this IB QoS thing in Lustre for a while, and 
we would like to express a need that may not be satisfied by the current 
solution exposed by Isaac (which consists in using the ko2iblnd 
''service'' option).

Let''s consider we have two sets of OSSes, each set serving a different 
Lustre file system (i.e. all the OSTs of an OSS are part of the same 
Lustre file system). The same Lustre clients have access to both 
filesystems.
In these conditions, how can we enforce different IB QoS in Lustre for 
the 2 file systems?
- by using the ko2iblnd ''service'' option, the o2iblnd SL would
be the
same for all connections initiated by a given Lustre client, regardless 
the destination file system. So we would not achieve our goal.
Unless what really matters is the SL of the connections created by the 
servers (I think I have seen in the Lustre debug logs that the
''real''
data transfers are always done via the servers connections).
What do you think?
- if the ''service id'' information was stored on the MGS on a
file system
basis, one could imagine to retrieve it at mount time on the clients. 
The ''service id'' information stored on the MGS could consist
in a port
space and a port id. Thus it would be possible to affect different 
service ports to the various connections initiated by the client, 
depending on the target file system.
What do you think? Would you say this is feasible, or can you see major 
issues with this proposal?


Thanks in advance.
Sebastien.


Isaac Huang a ?crit :> On Mon, May 18, 2009 at 12:04:37PM +0200, Daniel Kobras wrote:
>> Hi!
>>
>> Does anyone know how to use QoS with Lustre''s o2ib LND? The
Voltaire IB
>> LND allowed to #define a service level, but I couldn''t find a
similar
>> facility in o2ib. Is there a different way to apply QoS rules?
> 
> The o2iblnd SL is set by the OFED RDMA CM, indirectly based on the
> o2iblnd service port (set via ko2iblnd option ''service'',
987 by
> default) and its port space (RDMA_PS_TCP). For a complete, and more
> complicated story, please see:
> https://bugzilla.lustre.org/show_bug.cgi?id=18360#c2
> 
> Isaac
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
> 
>

Sébastien Buisson

2009-Jun-24 07:46 UTC

head link

[Lustre-discuss] InfiniBand QoS with Lustre ko2iblnd.

S?bastien Buisson a ?crit :> Hi all,
> 
> We have been thinking about this IB QoS thing in Lustre for a while, and 
> we would like to express a need that may not be satisfied by the current 
> solution exposed by Isaac (which consists in using the ko2iblnd 
> ''service'' option).
> 
> Let''s consider we have two sets of OSSes, each set serving a
different
> Lustre file system (i.e. all the OSTs of an OSS are part of the same 
> Lustre file system). The same Lustre clients have access to both 
> filesystems.
> In these conditions, how can we enforce different IB QoS in Lustre for 
> the 2 file systems?
> - by using the ko2iblnd ''service'' option, the o2iblnd SL
would be the
> same for all connections initiated by a given Lustre client, regardless 
> the destination file system. So we would not achieve our goal.
> Unless what really matters is the SL of the connections created by the 
> servers (I think I have seen in the Lustre debug logs that the
''real''
> data transfers are always done via the servers connections).
> What do you think?
I have tried to make communicate a client for which I set the ko2iblnd 
''service'' option to 986, with a server for which I set the
ko2iblnd
''service'' option to 987: it does not work.
This is not surprising because the ko2iblnd ''service''
parameter is used
  on the client side in the kiblnd_connect_peer function to designate to 
port of the remote peer (the server in this case).
So, the ko2iblnd ''service'' option must be the same for all the
nodes
participating in the same file system.

In our case where the same clients access both file systems, it means 
that we will not be able to set different o2iblnd SLs for the two file 
systems.

> - if the ''service id'' information was stored on the MGS
on a file system
> basis, one could imagine to retrieve it at mount time on the clients. 
> The ''service id'' information stored on the MGS could
consist in a port
> space and a port id. Thus it would be possible to affect different 
> service ports to the various connections initiated by the client, 
> depending on the target file system.
> What do you think? Would you say this is feasible, or can you see major 
> issues with this proposal?
> 
The peer''s port information could be stored in the kib_peer_t
structure.
That way, it would be possible to make clients connect to servers which 
listen on different ports.
What do you think?

> 
> Thanks in advance.
> Sebastien.
> 
> 
> Isaac Huang a ?crit :
>> On Mon, May 18, 2009 at 12:04:37PM +0200, Daniel Kobras wrote:
>>> Hi!
>>>
>>> Does anyone know how to use QoS with Lustre''s o2ib LND?
The Voltaire IB
>>> LND allowed to #define a service level, but I couldn''t
find a similar
>>> facility in o2ib. Is there a different way to apply QoS rules?
>> The o2iblnd SL is set by the OFED RDMA CM, indirectly based on the
>> o2iblnd service port (set via ko2iblnd option
''service'', 987 by
>> default) and its port space (RDMA_PS_TCP). For a complete, and more
>> complicated story, please see:
>> https://bugzilla.lustre.org/show_bug.cgi?id=18360#c2
>>
>> Isaac
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>
>>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
> 
>

Daniel Kobras

2009-Jun-24 08:05 UTC

head link

[Lustre-discuss] InfiniBand QoS with Lustre ko2iblnd.

Hi S?bastien!

On Wed, Jun 24, 2009 at 09:46:19AM +0200, S?bastien Buisson
wrote:> > - if the ''service id'' information was stored on the
MGS on a file system
> > basis, one could imagine to retrieve it at mount time on the clients. 
> > The ''service id'' information stored on the MGS could
consist in a port
> > space and a port id. Thus it would be possible to affect different 
> > service ports to the various connections initiated by the client, 
> > depending on the target file system.
> > What do you think? Would you say this is feasible, or can you see
major
> > issues with this proposal?
> > 
> 
> The peer''s port information could be stored in the kib_peer_t
structure.
> That way, it would be possible to make clients connect to servers which 
> listen on different ports.
> What do you think?
Why do you want to distinguish the two filesystems solely by service id
rather than, say, service id + port guids of the respective Lustre
servers? You''ll need a full QoS policy file instead of the simplified
syntax, and configuration needs to be adapted on hardware changes, but
this still looks simpler to me than modifying the wire protocol.

Regards,

Daniel.

Isaac Huang

2009-Jun-25 18:34 UTC

head link

[Lustre-discuss] InfiniBand QoS with Lustre ko2iblnd.

On Mon, Jun 22, 2009 at 04:49:03PM +0200, S?bastien Buisson
wrote:> ......
> Let''s consider we have two sets of OSSes, each set serving a
different
> Lustre file system (i.e. all the OSTs of an OSS are part of the same 
> Lustre file system). The same Lustre clients have access to both 
> filesystems.
> In these conditions, how can we enforce different IB QoS in Lustre for 
> the 2 file systems?
By assigning different SLs to the two sets of servers based on server
GUIDs, i.e. target-port-guid in QoS policy file.
> - by using the ko2iblnd ''service'' option, the o2iblnd SL
would be the
> same for all connections initiated by a given Lustre client, regardless 
> the destination file system. So we would not achieve our goal.
Not necessarily. The service-id would be the same, but SLs could be
different if the SM has been configured in a way that doesn''t
determine SLs solely based on service-id (e.g. also based on target
GUIDs).
> ......
> - if the ''service id'' information was stored on the MGS
on a file system
> basis, one could imagine to retrieve it at mount time on the clients. 
> The ''service id'' information stored on the MGS could
consist in a port
> space and a port id. Thus it would be possible to affect different 
> service ports to the various connections initiated by the client, 
> depending on the target file system.
> What do you think? Would you say this is feasible, or can you see major 
> issues with this proposal?
The LNet configurations could not reside on the MGS because LNet must
have been properly configured so that configurations on MGS could be
fetched over the network.

Isaac

Isaac Huang

2009-Jun-25 18:49 UTC

head link

[Lustre-discuss] InfiniBand QoS with Lustre ko2iblnd.

On Wed, Jun 24, 2009 at 09:46:19AM +0200, S?bastien Buisson
wrote:> ......
> The peer''s port information could be stored in the kib_peer_t
structure.
> That way, it would be possible to make clients connect to servers which  
> listen on different ports.
> What do you think?
At this point it can''t be done. But we have in our development plans
to implement dynamic LNet configuration which includes per-NI options
(i.e. it''d be possible to specify the ''service''
option on a per-NI
basis instead of being just LND global), and once it''s implemented
you''d be able to specify different ''service'' option
if you''d create
two server networks for the two FS.

For your current concern of setting up different SLs, I''d believe that
it could be achieved via target GUIDs as mentioned in my previous reply.

Hope this helps,
Isaac

Sébastien Buisson

2009-Jun-26 11:42 UTC

head link

[Lustre-discuss] InfiniBand QoS with Lustre ko2iblnd.

Isaac Huang a ?crit :> On Wed, Jun 24, 2009 at 09:46:19AM +0200, S?bastien Buisson wrote:
>> ......
>> The peer''s port information could be stored in the kib_peer_t
structure.
>> That way, it would be possible to make clients connect to servers which
>> listen on different ports.
>> What do you think?
> 
> At this point it can''t be done. But we have in our development
plans
> to implement dynamic LNet configuration which includes per-NI options
> (i.e. it''d be possible to specify the ''service''
option on a per-NI
> basis instead of being just LND global), and once it''s implemented
> you''d be able to specify different ''service''
option if you''d create
> two server networks for the two FS.
OK, if I understand correctly, the major hurdle with what I proposed is 
that LNET is not able to get configuration information dynamically at 
the moment, right?

I agree with you, I think the per-NI options in LNET would do the trick. 
Do you have plans about when this feature would be available? Have you 
already begun to work on it?
If you have some pre-alpha work, we would be glad to evaluate it.
> 
> For your current concern of setting up different SLs, I''d believe
that
> it could be achieved via target GUIDs as mentioned in my previous reply.
Unfortunately, configuring IB QoS via target GUIDs quickly becomes too 
complicated. As the size of clusters grow, it would require to list 
hundreds of GUIDs in the QoS policy rules.

Sebastien.

Isaac Huang

2009-Jul-01 06:07 UTC

head link

[Lustre-discuss] InfiniBand QoS with Lustre ko2iblnd.

On Fri, Jun 26, 2009 at 01:42:53PM +0200, S?bastien Buisson
wrote:>
> Isaac Huang a ?crit :
>> On Wed, Jun 24, 2009 at 09:46:19AM +0200, S?bastien Buisson wrote:
>>> ......
>>> The peer''s port information could be stored in the
kib_peer_t
>>> structure.  That way, it would be possible to make clients connect
to
>>> servers which  listen on different ports.
>>> What do you think?
>>
>> At this point it can''t be done. But we have in our development
plans
>> to implement dynamic LNet configuration which includes per-NI options
>> (i.e. it''d be possible to specify the
''service'' option on a per-NI
>> basis instead of being just LND global), and once it''s
implemented
>> you''d be able to specify different ''service''
option if you''d create
>> two server networks for the two FS.
>
> OK, if I understand correctly, the major hurdle with what I proposed is  
> that LNET is not able to get configuration information dynamically at  
> the moment, right?
Yes.
> I agree with you, I think the per-NI options in LNET would do the trick.  
> Do you have plans about when this feature would be available? Have you  
> already begun to work on it?
It''s too early to make any realistic estimate at the moment. Though
It''s
already on the lnet roadmap, I''m not sure when we''re going to
start
working on it.
> If you have some pre-alpha work, we would be glad to evaluate it.
Thanks, I''ll remember to ping you when it''s available.
>> For your current concern of setting up different SLs, I''d
believe that
>> it could be achieved via target GUIDs as mentioned in my previous
reply.
>
> Unfortunately, configuring IB QoS via target GUIDs quickly becomes too  
> complicated. As the size of clusters grow, it would require to list  
> hundreds of GUIDs in the QoS policy rules.
Yes, it''s rather cumbersome at bigger scales.

Thanks,
Isaac

Isaac Huang

2009-Jul-01 06:31 UTC

head link

[Lustre-discuss] InfiniBand QoS with Lustre ko2iblnd.

On Wed, Jul 01, 2009 at 02:07:33AM -0400, Isaac Huang
wrote:> ......
> >> For your current concern of setting up different SLs, I''d
believe that
> >> it could be achieved via target GUIDs as mentioned in my previous
reply.
> >
> > Unfortunately, configuring IB QoS via target GUIDs quickly becomes too
> > complicated. As the size of clusters grow, it would require to list  
> > hundreds of GUIDs in the QoS policy rules.
> 
> Yes, it''s rather cumbersome at bigger scales.
It just occurred to me that it might work by configuring QoS policy
based on IB partition keys. It''s just an initial thought - if
you''d
configure two @o2ib networks over two IB partitions over the same
fabric, one for each filesystem, then you might differentiate traffic
of the two FS based on their partition keys. I think it''d be much
easier to configure an additional @o2ib network than to maintain
hundreds of GUIDs that could change in the policy file.

By default the o2iblnd runs over the default IB partition. Please see 
bug 18602 for how to configure the o2iblnd over a non-default partition.

Thanks,
Isaac

Lustre discuss - May 2009 - InfiniBand QoS with Lustre ko2iblnd.

[Lustre-discuss] InfiniBand QoS with Lustre ko2iblnd.

[Lustre-discuss] InfiniBand QoS with Lustre ko2iblnd.

[Lustre-discuss] InfiniBand QoS with Lustre ko2iblnd.

[Lustre-discuss] InfiniBand QoS with Lustre ko2iblnd.

[Lustre-discuss] InfiniBand QoS with Lustre ko2iblnd.

[Lustre-discuss] InfiniBand QoS with Lustre ko2iblnd.

[Lustre-discuss] InfiniBand QoS with Lustre ko2iblnd.

[Lustre-discuss] InfiniBand QoS with Lustre ko2iblnd.

[Lustre-discuss] InfiniBand QoS with Lustre ko2iblnd.

[Lustre-discuss] InfiniBand QoS with Lustre ko2iblnd.

[Lustre-discuss] InfiniBand QoS with Lustre ko2iblnd.

[Lustre-discuss] InfiniBand QoS with Lustre ko2iblnd.

[Lustre-discuss] InfiniBand QoS with Lustre ko2iblnd.

[Lustre-discuss] InfiniBand QoS with Lustre ko2iblnd.