thr3ads.net - Lustre discuss - [Lustre-discuss] NUMA IO and Lustre [May 2009]

If this information is useful, please help other people find it:
Share via:

Sébastien Buisson

2009-May-12 14:07 UTC

[Lustre-discuss] NUMA IO and Lustre

Hello,

At Bull we would like to use a specific machine as Lustre OSS server. 
This is a NUMA IO machine made of 2 Infiniband interfaces for the 
connections to the clients, and 2 FiberChannel interfaces giving access 
to 8 LUNs.

Given this architecture, the stake is to avoid as much as possible 
suffering from the NUMA factor. Ideally, this would require the ability 
from Lustre to ''bind'' a given OST to a given IB interface
(let''s
consider we know which IB interface best suits a given FC interface). 
The goal is to ensure that no ''NUMA IO tax'' is paid when data
is
transferred between an FC interface and an IB interface, ie when a 
Lustre client reads or writes from/to an OST.

Concretely, we would like to know if it is possible in Lustre to bind an 
OST to a specific network interface, so that this OST is only reached 
through this interface (thus avoiding the NUMA IO factor in our case) ? 
For instance, we would like to have 4 OSTs attached to ib0 and the 4 
other OSTs attached to ib1.
We gave a look at the Lustre Manual, but we did not find any way to 
set-up this feature with the usual "mkfs.lustre" and "mount -t
lustre"
commands.

Cheers,
Sebastien.

Daniel Kobras

2009-May-12 14:35 UTC

head link

[Lustre-discuss] NUMA IO and Lustre

Hi!

On Tue, May 12, 2009 at 04:07:51PM +0200, S?bastien Buisson
wrote:> Concretely, we would like to know if it is possible in Lustre to bind an 
> OST to a specific network interface, so that this OST is only reached 
> through this interface (thus avoiding the NUMA IO factor in our case) ? 
> For instance, we would like to have 4 OSTs attached to ib0 and the 4 
> other OSTs attached to ib1.
As I recently learned the hard way, the network settings of an OST are
fixed when the OST connects to the MGS for the first time. Hence, you
could ifup ib0, ifdown ib1, start LNET and fire up the first set of
OSTs, then down LNET again, ifdown ib0, ifup ib1, restart LNET and start
up the rest of the OSTs. Subsequently, you should be able to run with
both IB nids enabled, but clients should still only know about a single
IB nid for each OSTs. As for a less hackish solution, wouldn''t it be
cleaner to just run, say, two Xen domUs on the machine and map the
HBAs/HCAs as appropriate?

Regards,

Daniel.

Sébastien Buisson

2009-May-12 15:49 UTC

head link

[Lustre-discuss] NUMA IO and Lustre

Hi Daniel,

Daniel Kobras a ?crit :> 
> As I recently learned the hard way, the network settings of an OST are
> fixed when the OST connects to the MGS for the first time. Hence, you
> could ifup ib0, ifdown ib1, start LNET and fire up the first set of
> OSTs, then down LNET again, ifdown ib0, ifup ib1, restart LNET and start
> up the rest of the OSTs. Subsequently, you should be able to run with
> both IB nids enabled, but clients should still only know about a single
> IB nid for each OSTs.
Your workaround sounds very interesting, because it shows that the 
network settings of an OST cannot be fixed by any configuration 
directive: this is done in an opaque way at mkfs.lustre time.

> As for a less hackish solution, wouldn''t it be
> cleaner to just run, say, two Xen domUs on the machine and map the
> HBAs/HCAs as appropriate?
Virtualization is the solution we were considering in case of negative 
answer to our question about bindings between OSTs and network 
interfaces. :)
If we are able to isolate appropriate HBAs/HCAs in separate VMs, that 
could do the trick. We would have a preference for kvm, but the 
principle is the same.
Of course this is a bit cumbersome, because each node will be seen as 2 
different Lustre servers (OSS), but that seems to be the only available 
solution at the moment.

Thanks.
Sebastien.

Johann Lombardi

2009-May-12 15:56 UTC

head link

[Lustre-discuss] NUMA IO and Lustre

On May 12, 2009, at 4:07 PM, S?bastien Buisson wrote:> Concretely, we would like to know if it is possible in Lustre to  
> bind an
> OST to a specific network interface, so that this OST is only reached
> through this interface (thus avoiding the NUMA IO factor in our  
> case) ?
No, i fear this cannot be done easily now, since server_sb2mti()
registers all the available nids. However, it should not be very  
difficult to add
an option to only register a subset of the available nids for a given  
target.

Johann

Daniel Kobras

2009-May-12 17:13 UTC

head link

[Lustre-discuss] NUMA IO and Lustre

Hi!

On Tue, May 12, 2009 at 05:49:26PM +0200, S?bastien Buisson
wrote:> Your workaround sounds very interesting, because it shows that the 
> network settings of an OST cannot be fixed by any configuration 
> directive: this is done in an opaque way at mkfs.lustre time.
To clarify: It''s the initial mount rather than mkfs. The process as far
as I understood works as follows (and hopefully someone will correct me
if I got it wrong):

* OST is formatted.
* OST is mounted for the first time.
  + OST queries the current list of nids on its OSS.
  + OST sends its list of nids off to the MGS.
* MGS registers new OST (and its nids).
* MGS advertises new OSTS (and its nids) to all (present and future) clients.

It''s important to note that MGS registration is a one-off process that
cannot be changed or redone later on, unless you wipe the complete
Lustre configuration from all servers using the infamous --writeconf
method and restart all Lustre clients to remove any stale cached info.
(Which of course implies a downtime of the complete filesystem.) Once
the registration is done, you can change a server''s LNET configuration,
but the MGS won''t care, and the clients will never get notified of it.

Regards,

Daniel.

Andreas Dilger

2009-May-12 19:58 UTC

head link

[Lustre-discuss] NUMA IO and Lustre

On May 12, 2009  16:07 +0200, S?bastien Buisson wrote:> At Bull we would like to use a specific machine as Lustre OSS server. 
> This is a NUMA IO machine made of 2 Infiniband interfaces for the 
> connections to the clients, and 2 FiberChannel interfaces giving access 
> to 8 LUNs.
> 
> Given this architecture, the stake is to avoid as much as possible 
> suffering from the NUMA factor. Ideally, this would require the ability 
> from Lustre to ''bind'' a given OST to a given IB interface
(let''s
> consider we know which IB interface best suits a given FC interface). 
> The goal is to ensure that no ''NUMA IO tax'' is paid when
data is
> transferred between an FC interface and an IB interface, ie when a 
> Lustre client reads or writes from/to an OST.
Note that the OST threads are already bound to a particular NUMA node.
This means that the pages used for the IO are CPU-local and are not
accessed from a remote CPU''s cache.

I don''t know if there is a CPU affinity option for the IB interfaces,
but that is definitely possible.
> Concretely, we would like to know if it is possible in Lustre to bind an 
> OST to a specific network interface, so that this OST is only reached 
> through this interface (thus avoiding the NUMA IO factor in our case) ? 
> For instance, we would like to have 4 OSTs attached to ib0 and the 4 
> other OSTs attached to ib1.
Do you know if there is a particular performance problem with the current
Lustre code, or are you only speculating?

Note that there is already work underway to make the IB driver and the
ptlrpc service handling use per-CPU threads, so if you are interested
to test this we could give you an early version of the patch when it is
available.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

Sébastien Buisson

2009-May-13 10:56 UTC

head link

[Lustre-discuss] NUMA IO and Lustre

Andreas Dilger a ?crit :> Note that the OST threads are already bound to a particular NUMA node.
> This means that the pages used for the IO are CPU-local and are not
> accessed from a remote CPU''s cache.
Indeed, I have seen in lustre/ptlrpc/service.c the following piece of code:

#if defined(HAVE_NODE_TO_CPUMASK) && defined(CONFIG_NUMA)
         /* we need to do this before any per-thread allocation is done 
so that
          * we get the per-thread allocations on local node.  bug 7342 */
         if (svc->srv_cpu_affinity) {
                 int cpu, num_cpu;

                 for (cpu = 0, num_cpu = 0; cpu < num_possible_cpus(); 
cpu++) {
                         if (!cpu_online(cpu))
                                 continue;
                         if (num_cpu == thread->t_id % num_online_cpus())
                                 break;
                         num_cpu++;
                 }
                 set_cpus_allowed(cfs_current(), 
node_to_cpumask(cpu_to_node(cpu)));
         }
#endif
> 
> I don''t know if there is a CPU affinity option for the IB
interfaces,
> but that is definitely possible.
> 
> 
> Do you know if there is a particular performance problem with the current
> Lustre code, or are you only speculating?
> 
I do not know about Lustre code, we did not have the opportunity to run 
Lustre tests until now. But we carried out basic tests (using xdd and 
ib_rdma_bw for instance) on the machine, which showed that the NUMA IO 
factor do harm to the performance. This is why we are looking for 
solutions to avoid this NUMA IO factor for Lustre.

> Note that there is already work underway to make the IB driver and the
> ptlrpc service handling use per-CPU threads, so if you are interested
> to test this we could give you an early version of the patch when it is
> available.
Yes, we would be very interested in testing an early version of the 
patch that makes possible for the IB driver and the ptlrpc service 
handling to use per-CPU threads. Is there a bugzilla for this?
This feature is necessary for what we are trying to achieve, but I do 
not know if it will be enough. Indeed, what will ensure that all clients 
that want to reach an OST do use the right IB interface, ie the one for 
which there is no NUMA IO factor to the FC adapter that connects the 
LUN? What do you think?


Cheers,
Sebastien.

Lustre discuss - May 2009 - NUMA IO and Lustre

[Lustre-discuss] NUMA IO and Lustre

[Lustre-discuss] NUMA IO and Lustre

[Lustre-discuss] NUMA IO and Lustre

[Lustre-discuss] NUMA IO and Lustre

[Lustre-discuss] NUMA IO and Lustre

[Lustre-discuss] NUMA IO and Lustre

[Lustre-discuss] NUMA IO and Lustre