Hello, At Bull we would like to use a specific machine as Lustre OSS server. This is a NUMA IO machine made of 2 Infiniband interfaces for the connections to the clients, and 2 FiberChannel interfaces giving access to 8 LUNs. Given this architecture, the stake is to avoid as much as possible suffering from the NUMA factor. Ideally, this would require the ability from Lustre to ''bind'' a given OST to a given IB interface (let''s consider we know which IB interface best suits a given FC interface). The goal is to ensure that no ''NUMA IO tax'' is paid when data is transferred between an FC interface and an IB interface, ie when a Lustre client reads or writes from/to an OST. Concretely, we would like to know if it is possible in Lustre to bind an OST to a specific network interface, so that this OST is only reached through this interface (thus avoiding the NUMA IO factor in our case) ? For instance, we would like to have 4 OSTs attached to ib0 and the 4 other OSTs attached to ib1. We gave a look at the Lustre Manual, but we did not find any way to set-up this feature with the usual "mkfs.lustre" and "mount -t lustre" commands. Cheers, Sebastien.
Hi! On Tue, May 12, 2009 at 04:07:51PM +0200, S?bastien Buisson wrote:> Concretely, we would like to know if it is possible in Lustre to bind an > OST to a specific network interface, so that this OST is only reached > through this interface (thus avoiding the NUMA IO factor in our case) ? > For instance, we would like to have 4 OSTs attached to ib0 and the 4 > other OSTs attached to ib1.As I recently learned the hard way, the network settings of an OST are fixed when the OST connects to the MGS for the first time. Hence, you could ifup ib0, ifdown ib1, start LNET and fire up the first set of OSTs, then down LNET again, ifdown ib0, ifup ib1, restart LNET and start up the rest of the OSTs. Subsequently, you should be able to run with both IB nids enabled, but clients should still only know about a single IB nid for each OSTs. As for a less hackish solution, wouldn''t it be cleaner to just run, say, two Xen domUs on the machine and map the HBAs/HCAs as appropriate? Regards, Daniel.
Hi Daniel, Daniel Kobras a ?crit :> > As I recently learned the hard way, the network settings of an OST are > fixed when the OST connects to the MGS for the first time. Hence, you > could ifup ib0, ifdown ib1, start LNET and fire up the first set of > OSTs, then down LNET again, ifdown ib0, ifup ib1, restart LNET and start > up the rest of the OSTs. Subsequently, you should be able to run with > both IB nids enabled, but clients should still only know about a single > IB nid for each OSTs.Your workaround sounds very interesting, because it shows that the network settings of an OST cannot be fixed by any configuration directive: this is done in an opaque way at mkfs.lustre time.> As for a less hackish solution, wouldn''t it be > cleaner to just run, say, two Xen domUs on the machine and map the > HBAs/HCAs as appropriate?Virtualization is the solution we were considering in case of negative answer to our question about bindings between OSTs and network interfaces. :) If we are able to isolate appropriate HBAs/HCAs in separate VMs, that could do the trick. We would have a preference for kvm, but the principle is the same. Of course this is a bit cumbersome, because each node will be seen as 2 different Lustre servers (OSS), but that seems to be the only available solution at the moment. Thanks. Sebastien.
On May 12, 2009, at 4:07 PM, S?bastien Buisson wrote:> Concretely, we would like to know if it is possible in Lustre to > bind an > OST to a specific network interface, so that this OST is only reached > through this interface (thus avoiding the NUMA IO factor in our > case) ?No, i fear this cannot be done easily now, since server_sb2mti() registers all the available nids. However, it should not be very difficult to add an option to only register a subset of the available nids for a given target. Johann
Hi! On Tue, May 12, 2009 at 05:49:26PM +0200, S?bastien Buisson wrote:> Your workaround sounds very interesting, because it shows that the > network settings of an OST cannot be fixed by any configuration > directive: this is done in an opaque way at mkfs.lustre time.To clarify: It''s the initial mount rather than mkfs. The process as far as I understood works as follows (and hopefully someone will correct me if I got it wrong): * OST is formatted. * OST is mounted for the first time. + OST queries the current list of nids on its OSS. + OST sends its list of nids off to the MGS. * MGS registers new OST (and its nids). * MGS advertises new OSTS (and its nids) to all (present and future) clients. It''s important to note that MGS registration is a one-off process that cannot be changed or redone later on, unless you wipe the complete Lustre configuration from all servers using the infamous --writeconf method and restart all Lustre clients to remove any stale cached info. (Which of course implies a downtime of the complete filesystem.) Once the registration is done, you can change a server''s LNET configuration, but the MGS won''t care, and the clients will never get notified of it. Regards, Daniel.
On May 12, 2009 16:07 +0200, S?bastien Buisson wrote:> At Bull we would like to use a specific machine as Lustre OSS server. > This is a NUMA IO machine made of 2 Infiniband interfaces for the > connections to the clients, and 2 FiberChannel interfaces giving access > to 8 LUNs. > > Given this architecture, the stake is to avoid as much as possible > suffering from the NUMA factor. Ideally, this would require the ability > from Lustre to ''bind'' a given OST to a given IB interface (let''s > consider we know which IB interface best suits a given FC interface). > The goal is to ensure that no ''NUMA IO tax'' is paid when data is > transferred between an FC interface and an IB interface, ie when a > Lustre client reads or writes from/to an OST.Note that the OST threads are already bound to a particular NUMA node. This means that the pages used for the IO are CPU-local and are not accessed from a remote CPU''s cache. I don''t know if there is a CPU affinity option for the IB interfaces, but that is definitely possible.> Concretely, we would like to know if it is possible in Lustre to bind an > OST to a specific network interface, so that this OST is only reached > through this interface (thus avoiding the NUMA IO factor in our case) ? > For instance, we would like to have 4 OSTs attached to ib0 and the 4 > other OSTs attached to ib1.Do you know if there is a particular performance problem with the current Lustre code, or are you only speculating? Note that there is already work underway to make the IB driver and the ptlrpc service handling use per-CPU threads, so if you are interested to test this we could give you an early version of the patch when it is available. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
Andreas Dilger a ?crit :> Note that the OST threads are already bound to a particular NUMA node. > This means that the pages used for the IO are CPU-local and are not > accessed from a remote CPU''s cache.Indeed, I have seen in lustre/ptlrpc/service.c the following piece of code: #if defined(HAVE_NODE_TO_CPUMASK) && defined(CONFIG_NUMA) /* we need to do this before any per-thread allocation is done so that * we get the per-thread allocations on local node. bug 7342 */ if (svc->srv_cpu_affinity) { int cpu, num_cpu; for (cpu = 0, num_cpu = 0; cpu < num_possible_cpus(); cpu++) { if (!cpu_online(cpu)) continue; if (num_cpu == thread->t_id % num_online_cpus()) break; num_cpu++; } set_cpus_allowed(cfs_current(), node_to_cpumask(cpu_to_node(cpu))); } #endif> > I don''t know if there is a CPU affinity option for the IB interfaces, > but that is definitely possible. > > > Do you know if there is a particular performance problem with the current > Lustre code, or are you only speculating? >I do not know about Lustre code, we did not have the opportunity to run Lustre tests until now. But we carried out basic tests (using xdd and ib_rdma_bw for instance) on the machine, which showed that the NUMA IO factor do harm to the performance. This is why we are looking for solutions to avoid this NUMA IO factor for Lustre.> Note that there is already work underway to make the IB driver and the > ptlrpc service handling use per-CPU threads, so if you are interested > to test this we could give you an early version of the patch when it is > available.Yes, we would be very interested in testing an early version of the patch that makes possible for the IB driver and the ptlrpc service handling to use per-CPU threads. Is there a bugzilla for this? This feature is necessary for what we are trying to achieve, but I do not know if it will be enough. Indeed, what will ensure that all clients that want to reach an OST do use the right IB interface, ie the one for which there is no NUMA IO factor to the FC adapter that connects the LUN? What do you think? Cheers, Sebastien.