Chris Worley
2008-Apr-22 03:22 UTC
[Lustre-discuss] Added Dual-homed OSS; ethernet clients confused
The only configuration error on my OSS was: I initially only had "o2ib0(ib0)" in my modprobe.conf. After unmounting all the OSTs, and getting the modprobe.conf right: options lnet networks=o2ib0(ib0),tcp0(eth0) ...and remounting from scratch, both ksocklnd and ko2iblnd are now loaded properly. But, I still can''t mount the partition on the ethernet-only client nodes. They get the error: LustreError: 8439:0:(events.c:401:ptlrpc_uuid_to_peer()) No NID found for 36.102.29.4 at o2ib LustreError: 8439:0:(client.c:58:ptlrpc_uuid_to_connection()) cannot find peer 36.102.29.4 at o2ib! LustreError: 8439:0:(ldlm_lib.c:312:client_obd_setup()) can''t add initial connection LustreError: 8439:0:(obd_config.c:325:class_setup()) setup lfs-OST0026-osc-0000010753919000 failed (-2) LustreError: 8439:0:(obd_config.c:1062:class_config_llog_handler()) Err -2 on cfg command: Lustre: cmd=cf003 0:lfs-OST0026-osc 1:lfs-OST0026_UUID 2:36.102.29.4 at o2ib LustreError: 15c-8: MGC36.101.29.1 at tcp: The configuration from log ''lfs-client'' failed (-2). The 36.102.29.4 is the IPoIB address of the added OSS. It shouldn''t want it "@o2ib". I''ve also unmounted all Lustre mounts on the MGS/MDS, unloaded all the modules and remounted. Still no joy. The file systems were created on the new OST, just as on all the others: for i in b c d e f g h i j k l; do mkfs.lustre --ost --mgsnode="36.102.29.1 at o2ib0,36.101.29.1 at tcp0" --fsname=lfs --param sys.timeout=40 --param lov.stripesize=2M /dev/sd$i & done The client has the right modprobe.conf, which worked before the additional OST: options lnet networks=tcp0(eth0) ... and I''m using the same mount command that worked previously: mount -t lustre 36.101.29.1 at tcp:/lfs /lfs>From the OST I can ping the client:# lctl list_nids 36.102.29.4 at o2ib 36.101.29.4 at tcp # lctl ping 36.101.255.10 at tcp 12345-0 at lo 12345-36.101.255.10 at tcp>From the client, I can ping the OST and MDS/MGS:# lctl list_nids 36.101.255.10 at tcp # lctl ping 36.101.29.4 at tcp 12345-0 at lo 12345-36.102.29.4 at o2ib 12345-36.101.29.4 at tcp # lctl ping 36.101.29.1 at tcp 12345-0 at lo 12345-36.102.29.1 at o2ib 12345-36.101.29.1 at tcp So, somehow, not having the right modprobe.conf the first time I mounted the partitions on the new OST has made it permanently not want to mount properly on Ethernet clients (it mounts fine on IB clients). Any ideas? Thanks, Chris
Chris Worley
2008-Apr-22 03:31 UTC
[Lustre-discuss] Added Dual-homed OSS; ethernet clients confused
On Mon, Apr 21, 2008 at 9:22 PM, Chris Worley <worleys at gmail.com> wrote:> The only configuration error on my OSS was: I initially only had > "o2ib0(ib0)" in my modprobe.conf. After unmounting all the OSTs, and > getting the modprobe.conf right: > > options lnet networks=o2ib0(ib0),tcp0(eth0) > > ...and remounting from scratch, both ksocklnd and ko2iblnd are now > loaded properly. > > But, I still can''t mount the partition on the ethernet-only client nodes. > > They get the error: > > LustreError: 8439:0:(events.c:401:ptlrpc_uuid_to_peer()) No NID found > for 36.102.29.4 at o2ib > LustreError: 8439:0:(client.c:58:ptlrpc_uuid_to_connection()) cannot > find peer 36.102.29.4 at o2ib! > LustreError: 8439:0:(ldlm_lib.c:312:client_obd_setup()) can''t add > initial connection > LustreError: 8439:0:(obd_config.c:325:class_setup()) setup > lfs-OST0026-osc-0000010753919000 failed (-2) > LustreError: 8439:0:(obd_config.c:1062:class_config_llog_handler()) > Err -2 on cfg command: > Lustre: cmd=cf003 0:lfs-OST0026-osc 1:lfs-OST0026_UUID 2:36.102.29.4 at o2ib > LustreError: 15c-8: MGC36.101.29.1 at tcp: The configuration from log > ''lfs-client'' failed (-2). > > The 36.102.29.4 is the IPoIB address of the added OSS. It shouldn''t > want it "@o2ib". > > I''ve also unmounted all Lustre mounts on the MGS/MDS, unloaded all the > modules and remounted. Still no joy. >>From this point forward, every time I say"OST" I mean "OSS"...> The file systems were created on the new OST, just as on all the others: > > for i in b c d e f g h i j k l; do mkfs.lustre --ost > --mgsnode="36.102.29.1 at o2ib0,36.101.29.1 at tcp0" --fsname=lfs --param > sys.timeout=40 --param lov.stripesize=2M /dev/sd$i & done > > The client has the right modprobe.conf, which worked before the additional OST: > > options lnet networks=tcp0(eth0) > > ... and I''m using the same mount command that worked previously: > > mount -t lustre 36.101.29.1 at tcp:/lfs /lfs > > From the OST I can ping the client: > > # lctl list_nids > 36.102.29.4 at o2ib > 36.101.29.4 at tcp > # lctl ping 36.101.255.10 at tcp > 12345-0 at lo > 12345-36.101.255.10 at tcp > > From the client, I can ping the OST and MDS/MGS: > > # lctl list_nids > 36.101.255.10 at tcp > # lctl ping 36.101.29.4 at tcp > 12345-0 at lo > 12345-36.102.29.4 at o2ib > 12345-36.101.29.4 at tcp > # lctl ping 36.101.29.1 at tcp > 12345-0 at lo > 12345-36.102.29.1 at o2ib > 12345-36.101.29.1 at tcp > > So, somehow, not having the right modprobe.conf the first time I > mounted the partitions on the new OST has made it permanently not want > to mount properly on Ethernet clients (it mounts fine on IB clients). > > Any ideas? > > Thanks, > > Chris >
Chris Worley
2008-Apr-22 17:36 UTC
[Lustre-discuss] Added Dual-homed OSS; ethernet clients confused
Does anybody have any clues, or do I need to rebuild the entire FS from scratch? On Mon, Apr 21, 2008 at 9:31 PM, Chris Worley <worleys at gmail.com> wrote:> > On Mon, Apr 21, 2008 at 9:22 PM, Chris Worley <worleys at gmail.com> wrote: > > The only configuration error on my OSS was: I initially only had > > "o2ib0(ib0)" in my modprobe.conf. After unmounting all the OSTs, and > > getting the modprobe.conf right: > > > > options lnet networks=o2ib0(ib0),tcp0(eth0) > > > > ...and remounting from scratch, both ksocklnd and ko2iblnd are now > > loaded properly. > > > > But, I still can''t mount the partition on the ethernet-only client nodes. > > > > They get the error: > > > > LustreError: 8439:0:(events.c:401:ptlrpc_uuid_to_peer()) No NID found > > for 36.102.29.4 at o2ib > > LustreError: 8439:0:(client.c:58:ptlrpc_uuid_to_connection()) cannot > > find peer 36.102.29.4 at o2ib! > > LustreError: 8439:0:(ldlm_lib.c:312:client_obd_setup()) can''t add > > initial connection > > LustreError: 8439:0:(obd_config.c:325:class_setup()) setup > > lfs-OST0026-osc-0000010753919000 failed (-2) > > LustreError: 8439:0:(obd_config.c:1062:class_config_llog_handler()) > > Err -2 on cfg command: > > Lustre: cmd=cf003 0:lfs-OST0026-osc 1:lfs-OST0026_UUID 2:36.102.29.4 at o2ib > > LustreError: 15c-8: MGC36.101.29.1 at tcp: The configuration from log > > ''lfs-client'' failed (-2). > > > > The 36.102.29.4 is the IPoIB address of the added OSS. It shouldn''t > > want it "@o2ib". > > > > I''ve also unmounted all Lustre mounts on the MGS/MDS, unloaded all the > > modules and remounted. Still no joy. > > > > From this point forward, every time I say"OST" I mean "OSS"... > > > > > The file systems were created on the new OST, just as on all the others: > > > > for i in b c d e f g h i j k l; do mkfs.lustre --ost > > --mgsnode="36.102.29.1 at o2ib0,36.101.29.1 at tcp0" --fsname=lfs --param > > sys.timeout=40 --param lov.stripesize=2M /dev/sd$i & done > > > > The client has the right modprobe.conf, which worked before the additional OST: > > > > options lnet networks=tcp0(eth0) > > > > ... and I''m using the same mount command that worked previously: > > > > mount -t lustre 36.101.29.1 at tcp:/lfs /lfs > > > > From the OST I can ping the client: > > > > # lctl list_nids > > 36.102.29.4 at o2ib > > 36.101.29.4 at tcp > > # lctl ping 36.101.255.10 at tcp > > 12345-0 at lo > > 12345-36.101.255.10 at tcp > > > > From the client, I can ping the OST and MDS/MGS: > > > > # lctl list_nids > > 36.101.255.10 at tcp > > # lctl ping 36.101.29.4 at tcp > > 12345-0 at lo > > 12345-36.102.29.4 at o2ib > > 12345-36.101.29.4 at tcp > > # lctl ping 36.101.29.1 at tcp > > 12345-0 at lo > > 12345-36.102.29.1 at o2ib > > 12345-36.101.29.1 at tcp > > > > So, somehow, not having the right modprobe.conf the first time I > > mounted the partitions on the new OST has made it permanently not want > > to mount properly on Ethernet clients (it mounts fine on IB clients). > > > > Any ideas? > > > > Thanks, > > > > Chris > > >
Cliff White
2008-Apr-22 20:21 UTC
[Lustre-discuss] Added Dual-homed OSS; ethernet clients confused
Chris Worley wrote:> Does anybody have any clues, or do I need to rebuild the entire FS from scratch?First, what is in your client modprobe.conf? Should only be ''tcp'' for tcp-only clients. Second, I don''t think you can use an ipoib address as a tcp connection. If it''s ipoib, LNET is going to use o2ib. cliffw> > On Mon, Apr 21, 2008 at 9:31 PM, Chris Worley <worleys at gmail.com> wrote: >> On Mon, Apr 21, 2008 at 9:22 PM, Chris Worley <worleys at gmail.com> wrote: >> > The only configuration error on my OSS was: I initially only had >> > "o2ib0(ib0)" in my modprobe.conf. After unmounting all the OSTs, and >> > getting the modprobe.conf right: >> > >> > options lnet networks=o2ib0(ib0),tcp0(eth0) >> > >> > ...and remounting from scratch, both ksocklnd and ko2iblnd are now >> > loaded properly. >> > >> > But, I still can''t mount the partition on the ethernet-only client nodes. >> > >> > They get the error: >> > >> > LustreError: 8439:0:(events.c:401:ptlrpc_uuid_to_peer()) No NID found >> > for 36.102.29.4 at o2ib >> > LustreError: 8439:0:(client.c:58:ptlrpc_uuid_to_connection()) cannot >> > find peer 36.102.29.4 at o2ib! >> > LustreError: 8439:0:(ldlm_lib.c:312:client_obd_setup()) can''t add >> > initial connection >> > LustreError: 8439:0:(obd_config.c:325:class_setup()) setup >> > lfs-OST0026-osc-0000010753919000 failed (-2) >> > LustreError: 8439:0:(obd_config.c:1062:class_config_llog_handler()) >> > Err -2 on cfg command: >> > Lustre: cmd=cf003 0:lfs-OST0026-osc 1:lfs-OST0026_UUID 2:36.102.29.4 at o2ib >> > LustreError: 15c-8: MGC36.101.29.1 at tcp: The configuration from log >> > ''lfs-client'' failed (-2). >> > >> > The 36.102.29.4 is the IPoIB address of the added OSS. It shouldn''t >> > want it "@o2ib". >> > >> > I''ve also unmounted all Lustre mounts on the MGS/MDS, unloaded all the >> > modules and remounted. Still no joy. >> > >> >> From this point forward, every time I say"OST" I mean "OSS"... >> >> >> >> > The file systems were created on the new OST, just as on all the others: >> > >> > for i in b c d e f g h i j k l; do mkfs.lustre --ost >> > --mgsnode="36.102.29.1 at o2ib0,36.101.29.1 at tcp0" --fsname=lfs --param >> > sys.timeout=40 --param lov.stripesize=2M /dev/sd$i & done >> > >> > The client has the right modprobe.conf, which worked before the additional OST: >> > >> > options lnet networks=tcp0(eth0) >> > >> > ... and I''m using the same mount command that worked previously: >> > >> > mount -t lustre 36.101.29.1 at tcp:/lfs /lfs >> > >> > From the OST I can ping the client: >> > >> > # lctl list_nids >> > 36.102.29.4 at o2ib >> > 36.101.29.4 at tcp >> > # lctl ping 36.101.255.10 at tcp >> > 12345-0 at lo >> > 12345-36.101.255.10 at tcp >> > >> > From the client, I can ping the OST and MDS/MGS: >> > >> > # lctl list_nids >> > 36.101.255.10 at tcp >> > # lctl ping 36.101.29.4 at tcp >> > 12345-0 at lo >> > 12345-36.102.29.4 at o2ib >> > 12345-36.101.29.4 at tcp >> > # lctl ping 36.101.29.1 at tcp >> > 12345-0 at lo >> > 12345-36.102.29.1 at o2ib >> > 12345-36.101.29.1 at tcp >> > >> > So, somehow, not having the right modprobe.conf the first time I >> > mounted the partitions on the new OST has made it permanently not want >> > to mount properly on Ethernet clients (it mounts fine on IB clients). >> > >> > Any ideas? >> > >> > Thanks, >> > >> > Chris >> > >> > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss
Chris Worley
2008-Apr-22 20:44 UTC
[Lustre-discuss] Added Dual-homed OSS; ethernet clients confused
On Tue, Apr 22, 2008 at 2:21 PM, Cliff White <Cliff.White at sun.com> wrote: > Chris Worley wrote: > > > Does anybody have any clues, or do I need to rebuild the entire FS from > scratch? > > > > First, what is in your client modprobe.conf? Should only be ''tcp'' for > tcp-only clients. It is/was: options lnet networks=tcp0(eth0) ... and this worked fine before I added the new OSS. > Second, I don''t think you can use an ipoib address as a tcp connection. > If it''s ipoib, LNET is going to use o2ib. I don''t quite follow. The specific client doesn''t have IB. The IPoIB addresses in the network are 36.102.x.x. The Ethernet addresses in the network are: 36.101.x.x. Both are 16 bit class masks. The only place I use IPoIB addresses are in the file system creation on the OSSes, as in: for i in b c d e f g h i j k l; do mkfs.lustre --ost --mgsnode="36.102.29.1 at o2ib0,36.101.29.1 at tcp0" --fsname=lfs --param sys.timeout=40 --param lov.stripesize=2M /dev/sd$i & done ... and that has worked well, up until I added another OSS. Did I do something wrong? The only thing I know I did wrong was, when I first mounted the created file systems, I had my new OSS''es modprobe.conf set for IB only: options lnet networks=o2ib(ib0) I changed that to be the same as my existing OSSes: options lnet networks=o2ib0(ib0),tcp0(eth0) ...after I realized my Ethernet-only clients weren''t working, and reloaded everything from scratch (at this point, I have unmounted all clients, unmounted all luster OST/MDT file systems on the servers, removed all Lustre modules from all clients and servers, rebooted the Ethernet client, then remounted all the file systems everywhere... but still no joy on the Ethernet-only clients). At this point I''m guessing that when I made the file systems on the new OSS, even though I had properly set: --mgsnode="36.102.29.1 at o2ib0,36.101.29.1 at tcp0" ...in the mkfs, the incorrectly set modprobe.conf screwed this mkfs up irrevocably, and since the file system has been in use from IB clients after adding the new OSS, my only recourse is to 1) backup the file system, and 2) rebuild everything (all OSTs and the MDT) from scratch (mkfs) on all OSS''es and the MDS. Is that correct? Thanks, Chris>> cliffw > > > > > > > > > > > > On Mon, Apr 21, 2008 at 9:31 PM, Chris Worley <worleys at gmail.com> wrote: > > > > > On Mon, Apr 21, 2008 at 9:22 PM, Chris Worley <worleys at gmail.com> wrote: > > > > The only configuration error on my OSS was: I initially only had > > > > "o2ib0(ib0)" in my modprobe.conf. After unmounting all the OSTs, > and > > > > getting the modprobe.conf right: > > > > > > > > options lnet networks=o2ib0(ib0),tcp0(eth0) > > > > > > > > ...and remounting from scratch, both ksocklnd and ko2iblnd are now > > > > loaded properly. > > > > > > > > But, I still can''t mount the partition on the ethernet-only client > nodes. > > > > > > > > They get the error: > > > > > > > > LustreError: 8439:0:(events.c:401:ptlrpc_uuid_to_peer()) No NID > found > > > > for 36.102.29.4 at o2ib > > > > LustreError: 8439:0:(client.c:58:ptlrpc_uuid_to_connection()) cannot > > > > find peer 36.102.29.4 at o2ib! > > > > LustreError: 8439:0:(ldlm_lib.c:312:client_obd_setup()) can''t add > > > > initial connection > > > > LustreError: 8439:0:(obd_config.c:325:class_setup()) setup > > > > lfs-OST0026-osc-0000010753919000 failed (-2) > > > > LustreError: 8439:0:(obd_config.c:1062:class_config_llog_handler()) > > > > Err -2 on cfg command: > > > > Lustre: cmd=cf003 0:lfs-OST0026-osc 1:lfs-OST0026_UUID > 2:36.102.29.4 at o2ib > > > > LustreError: 15c-8: MGC36.101.29.1 at tcp: The configuration from log > > > > ''lfs-client'' failed (-2). > > > > > > > > The 36.102.29.4 is the IPoIB address of the added OSS. It shouldn''t > > > > want it "@o2ib". > > > > > > > > I''ve also unmounted all Lustre mounts on the MGS/MDS, unloaded all > the > > > > modules and remounted. Still no joy. > > > > > > > > > > From this point forward, every time I say"OST" I mean "OSS"... > > > > > > > > > > > > > The file systems were created on the new OST, just as on all the > others: > > > > > > > > for i in b c d e f g h i j k l; do mkfs.lustre --ost > > > > --mgsnode="36.102.29.1 at o2ib0,36.101.29.1 at tcp0" --fsname=lfs --param > > > > sys.timeout=40 --param lov.stripesize=2M /dev/sd$i & done > > > > > > > > The client has the right modprobe.conf, which worked before the > additional OST: > > > > > > > > options lnet networks=tcp0(eth0) > > > > > > > > ... and I''m using the same mount command that worked previously: > > > > > > > > mount -t lustre 36.101.29.1 at tcp:/lfs /lfs > > > > > > > > From the OST I can ping the client: > > > > > > > > # lctl list_nids > > > > 36.102.29.4 at o2ib > > > > 36.101.29.4 at tcp > > > > # lctl ping 36.101.255.10 at tcp > > > > 12345-0 at lo > > > > 12345-36.101.255.10 at tcp > > > > > > > > From the client, I can ping the OST and MDS/MGS: > > > > > > > > # lctl list_nids > > > > 36.101.255.10 at tcp > > > > # lctl ping 36.101.29.4 at tcp > > > > 12345-0 at lo > > > > 12345-36.102.29.4 at o2ib > > > > 12345-36.101.29.4 at tcp > > > > # lctl ping 36.101.29.1 at tcp > > > > 12345-0 at lo > > > > 12345-36.102.29.1 at o2ib > > > > 12345-36.101.29.1 at tcp > > > > > > > > So, somehow, not having the right modprobe.conf the first time I > > > > mounted the partitions on the new OST has made it permanently not > want > > > > to mount properly on Ethernet clients (it mounts fine on IB > clients). > > > > > > > > Any ideas? > > > > > > > > Thanks, > > > > > > > > Chris > > > > > > > > > > > > _______________________________________________ > > Lustre-discuss mailing list > > Lustre-discuss at lists.lustre.org > > http://lists.lustre.org/mailman/listinfo/lustre-discuss > > > >
Chris Worley
2008-Apr-22 23:35 UTC
[Lustre-discuss] Added Dual-homed OSS; ethernet clients confused
On Tue, Apr 22, 2008 at 5:01 PM, Cliff White <Cliff.White at sun.com> wrote:> You don''t need to reformat or rebuild. The new OST registers with the > MGS on first startup, and since it didn''t know about the TCP address is > only registered as IB. You need to regenerate the config, which can be > done with ''tunefs.lustre --writeconf'' on the OSS providing the new OST.I unmounted everything lustre from clients and servers. I didn''t unload any modules. On the OSS in question, for each OST, I did: # tunefs.lustre --writeconf --ost --mgsnode="36.102.29.1 at o2ib0,36.101.29.1 at tcp0" --fsname=lfs --param sys.timeout=40 --param lov.stripesize=2M /dev/sdl checking for existing Lustre data: found CONFIGS/mountdata Reading CONFIGS/mountdata Read previous values: Target: lfs-OST0030 Index: 48 Lustre FS: lfs Mount type: ldiskfs Flags: 0x2 (OST ) Persistent mount opts: errors=remount-ro,extents,mballoc Parameters: mgsnode=36.102.29.1 at o2ib,36.101.29.1 at tcp sys.timeout=40 lov.stripesize=2M Permanent disk data: Target: lfs-OST0030 Index: 48 Lustre FS: lfs Mount type: ldiskfs Flags: 0x142 (OST update writeconf ) Persistent mount opts: errors=remount-ro,extents,mballoc Parameters: mgsnode=36.102.29.1 at o2ib,36.101.29.1 at tcp sys.timeout=40 lov.stripesize=2M mgsnode=36.102.29.1 at o2ib,36.101.29.1 at tcp sys.timeout=40 lov.stripesize=2M I remounted all the OST''s and MDT, then tried the Ethernet-only client mount. Still, the same error: # mount -t lustre 36.101.29.1 at tcp:/lfs /lfs mount.lustre: mount 36.101.29.1 at tcp:/lfs at /lfs failed: No such file or directory Is the MGS specification correct? Is the filesystem name correct? If upgrading, is the copied client log valid? (see upgrade docs) # dmesg Lustre: OBD class driver, info at clusterfs.com Lustre Version: 1.6.4.2 Build Version: 1.6.4.2-19691231190000-PRISTINE-.usr.src.linux-2.6.9-67.0.4.EL-Lustre-1.6.4.2 Lustre: Added LNI 36.101.255.10 at tcp [8/256] Lustre: Accept secure, port 988 Lustre: Lustre Client File System; info at clusterfs.com Lustre: Binding irq 177 to CPU 0 with cmd: echo 1 > /proc/irq/177/smp_affinity Lustre: lfs-clilov-000001076591e400.lov: set parameter stripesize=4194304 LustreError: 6934:0:(events.c:401:ptlrpc_uuid_to_peer()) No NID found for 36.102.29.4 at o2ib LustreError: 6934:0:(client.c:58:ptlrpc_uuid_to_connection()) cannot find peer 36.102.29.4 at o2ib! LustreError: 6934:0:(ldlm_lib.c:312:client_obd_setup()) can''t add initial connection LustreError: 7045:0:(connection.c:142:ptlrpc_put_connection()) NULL connection LustreError: 6934:0:(obd_config.c:325:class_setup()) setup lfs-OST0026-osc-000001076591e400 failed (-2) LustreError: 6934:0:(obd_config.c:1062:class_config_llog_handler()) Err -2 on cfg command: Lustre: cmd=cf003 0:lfs-OST0026-osc 1:lfs-OST0026_UUID 2:36.102.29.4 at o2ib LustreError: 15c-8: MGC36.101.29.1 at tcp: The configuration from log ''lfs-client'' failed (-2). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information. LustreError: 6934:0:(llite_lib.c:1021:ll_fill_super()) Unable to process log: -2 LustreError: 6934:0:(mdc_request.c:1273:mdc_precleanup()) client import never connected LustreError: 7045:0:(connection.c:142:ptlrpc_put_connection()) NULL connection LustreError: 6934:0:(obd_config.c:392:class_cleanup()) Device 41 not setup Lustre: client 000001076591e400 umount complete LustreError: 6934:0:(obd_mount.c:1924:lustre_fill_super()) Unable to mount (-2) Did I do something wrong? The "man" page says to only do the "--writconf" on the MDT node... I did it on the OSS as instructed. Thanks, Chris
Chris Worley
2008-Apr-23 00:08 UTC
[Lustre-discuss] Added Dual-homed OSS; ethernet clients confused
The error specifically complains about the first OST/disk on the new OSS, OST0026. It''s tunefs.lustre output was: # tunefs.lustre --writeconf --ost --mgsnode="36.102.29.1 at o2ib0,36.101.29.1 at tcp0" --fsname=lfs --param sys.timeout=40 --param lov.stripesize=2M /dev/sdb checking for existing Lustre data: found CONFIGS/mountdata Reading CONFIGS/mountdata Read previous values: Target: lfs-OST0026 Index: 38 Lustre FS: lfs Mount type: ldiskfs Flags: 0x2 (OST ) Persistent mount opts: errors=remount-ro,extents,mballoc Parameters: mgsnode=36.102.29.1 at o2ib,36.101.29.1 at tcp sys.timeout=40 lov.stripesize=2M Permanent disk data: Target: lfs-OST0026 Index: 38 Lustre FS: lfs Mount type: ldiskfs Flags: 0x142 (OST update writeconf ) Persistent mount opts: errors=remount-ro,extents,mballoc Parameters: mgsnode=36.102.29.1 at o2ib,36.101.29.1 at tcp sys.timeout=40 lov.stripesize=2M mgsnode=36.102.29.1 at o2ib,36.101.29.1 at tcp sys.timeout=40 lov.stripesize=2M Writing CONFIGS/mountdata In comparing with the first OST of an OSS that is (has been) working (doing a dryrun tunefs), I see no differences: # tunefs.lustre --dryrun --writeconf /dev/sdb checking for existing Lustre data: found CONFIGS/mountdata Reading CONFIGS/mountdata Read previous values: Target: lfs-OST0006 Index: 6 Lustre FS: lfs Mount type: ldiskfs Flags: 0x2 (OST ) Persistent mount opts: errors=remount-ro,extents,mballoc Parameters: mgsnode=36.102.29.1 at o2ib,36.101.29.1 at tcp sys.timeout=40 lov.stripesize=2M Permanent disk data: Target: lfs-OST0006 Index: 6 Lustre FS: lfs Mount type: ldiskfs Flags: 0x102 (OST writeconf ) Persistent mount opts: errors=remount-ro,extents,mballoc Parameters: mgsnode=36.102.29.1 at o2ib,36.101.29.1 at tcp sys.timeout=40 lov.stripesize=2M exiting before disk write. Any clues there? Thanks, Chris On Tue, Apr 22, 2008 at 5:35 PM, Chris Worley <worleys at gmail.com> wrote:> On Tue, Apr 22, 2008 at 5:01 PM, Cliff White <Cliff.White at sun.com> wrote: > > You don''t need to reformat or rebuild. The new OST registers with the > > MGS on first startup, and since it didn''t know about the TCP address is > > only registered as IB. You need to regenerate the config, which can be > > done with ''tunefs.lustre --writeconf'' on the OSS providing the new OST. > > I unmounted everything lustre from clients and servers. I didn''t > unload any modules. > > On the OSS in question, for each OST, I did: > > # tunefs.lustre --writeconf --ost > > --mgsnode="36.102.29.1 at o2ib0,36.101.29.1 at tcp0" --fsname=lfs --param > sys.timeout=40 --param lov.stripesize=2M /dev/sdl > > checking for existing Lustre data: found CONFIGS/mountdata > Reading CONFIGS/mountdata > > Read previous values: > Target: lfs-OST0030 > Index: 48 > Lustre FS: lfs > Mount type: ldiskfs > Flags: 0x2 > (OST ) > Persistent mount opts: errors=remount-ro,extents,mballoc > Parameters: mgsnode=36.102.29.1 at o2ib,36.101.29.1 at tcp sys.timeout=40 > lov.stripesize=2M > > > Permanent disk data: > Target: lfs-OST0030 > Index: 48 > Lustre FS: lfs > Mount type: ldiskfs > Flags: 0x142 > (OST update writeconf ) > Persistent mount opts: errors=remount-ro,extents,mballoc > Parameters: mgsnode=36.102.29.1 at o2ib,36.101.29.1 at tcp sys.timeout=40 > lov.stripesize=2M mgsnode=36.102.29.1 at o2ib,36.101.29.1 at tcp > sys.timeout=40 lov.stripesize=2M > > I remounted all the OST''s and MDT, then tried the Ethernet-only client > mount. Still, the same error: > > # mount -t lustre 36.101.29.1 at tcp:/lfs /lfs > mount.lustre: mount 36.101.29.1 at tcp:/lfs at /lfs failed: No such file > or directory > Is the MGS specification correct? > Is the filesystem name correct? > If upgrading, is the copied client log valid? (see upgrade docs) > # dmesg > Lustre: OBD class driver, info at clusterfs.com > Lustre Version: 1.6.4.2 > Build Version: > 1.6.4.2-19691231190000-PRISTINE-.usr.src.linux-2.6.9-67.0.4.EL-Lustre-1.6.4.2 > Lustre: Added LNI 36.101.255.10 at tcp [8/256] > Lustre: Accept secure, port 988 > Lustre: Lustre Client File System; info at clusterfs.com > Lustre: Binding irq 177 to CPU 0 with cmd: echo 1 > /proc/irq/177/smp_affinity > Lustre: lfs-clilov-000001076591e400.lov: set parameter stripesize=4194304 > LustreError: 6934:0:(events.c:401:ptlrpc_uuid_to_peer()) No NID found > > for 36.102.29.4 at o2ib > LustreError: 6934:0:(client.c:58:ptlrpc_uuid_to_connection()) cannot > > find peer 36.102.29.4 at o2ib! > LustreError: 6934:0:(ldlm_lib.c:312:client_obd_setup()) can''t add > initial connection > LustreError: 7045:0:(connection.c:142:ptlrpc_put_connection()) NULL connection > LustreError: 6934:0:(obd_config.c:325:class_setup()) setup > lfs-OST0026-osc-000001076591e400 failed (-2) > LustreError: 6934:0:(obd_config.c:1062:class_config_llog_handler()) > > Err -2 on cfg command: > Lustre: cmd=cf003 0:lfs-OST0026-osc 1:lfs-OST0026_UUID 2:36.102.29.4 at o2ib > LustreError: 15c-8: MGC36.101.29.1 at tcp: The configuration from log > ''lfs-client'' failed (-2). This may be the result of communication > errors between this node and the MGS, a bad configuration, or other > errors. See the syslog for more information. > LustreError: 6934:0:(llite_lib.c:1021:ll_fill_super()) Unable to process log: -2 > LustreError: 6934:0:(mdc_request.c:1273:mdc_precleanup()) client > import never connected > LustreError: 7045:0:(connection.c:142:ptlrpc_put_connection()) NULL connection > LustreError: 6934:0:(obd_config.c:392:class_cleanup()) Device 41 not setup > Lustre: client 000001076591e400 umount complete > LustreError: 6934:0:(obd_mount.c:1924:lustre_fill_super()) Unable to mount (-2) > > Did I do something wrong? The "man" page says to only do the > "--writconf" on the MDT node... I did it on the OSS as instructed. > > Thanks, > > Chris >
Andreas Dilger
2008-Apr-23 06:42 UTC
[Lustre-discuss] Added Dual-homed OSS; ethernet clients confused
On Apr 22, 2008 18:08 -0600, Chris Worley wrote:> The error specifically complains about the first OST/disk on the new > OSS, OST0026. It''s tunefs.lustre output was: > > > On the OSS in question, for each OST, I did: > > > > # tunefs.lustre --writeconf --ost > > > > --mgsnode="36.102.29.1 at o2ib0,36.101.29.1 at tcp0" --fsname=lfs --param > > sys.timeout=40 --param lov.stripesize=2M /dev/sdl > > Lustre: cmd=cf003 0:lfs-OST0026-osc 1:lfs-OST0026_UUID 2:36.102.29.4 at o2ib > > LustreError: 15c-8: MGC36.101.29.1 at tcp: The configuration from log > > ''lfs-client'' failed (-2). This may be the result of communication > > errors between this node and the MGS, a bad configuration, or other > > errors. See the syslog for more information.The problem is that the NID for the new OST is the IPoIB address, and this is what the TCP client is trying to connect to. If you specify the TCP NID first this may help. Also note that the client does not get the config from the OSTs, but rather the MGS, so you need to do a --write-conf on there. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
Chris Worley
2008-Apr-23 14:10 UTC
[Lustre-discuss] Added Dual-homed OSS; ethernet clients confused
On Wed, Apr 23, 2008 at 12:42 AM, Andreas Dilger <adilger at sun.com> wrote:> On Apr 22, 2008 18:08 -0600, Chris Worley wrote: > > The error specifically complains about the first OST/disk on the new > > OSS, OST0026. It''s tunefs.lustre output was: > > > > > > On the OSS in question, for each OST, I did: > > > > > > # tunefs.lustre --writeconf --ost > > > > > > --mgsnode="36.102.29.1 at o2ib0,36.101.29.1 at tcp0" --fsname=lfs --param > > > sys.timeout=40 --param lov.stripesize=2M /dev/sdl > > > > Lustre: cmd=cf003 0:lfs-OST0026-osc 1:lfs-OST0026_UUID 2:36.102.29.4 at o2ib > > > LustreError: 15c-8: MGC36.101.29.1 at tcp: The configuration from log > > > ''lfs-client'' failed (-2). This may be the result of communication > > > errors between this node and the MGS, a bad configuration, or other > > > errors. See the syslog for more information. > > The problem is that the NID for the new OST is the IPoIB address, and this > is what the TCP client is trying to connect to. If you specify the TCP > NID first this may help. Also note that the client does not get the > config from the OSTs, but rather the MGS, so you need to do a --write-conf > on there.This is confusing as the man page for "tunefs.lustre" wants a device name at the end of the command... and the device is on another OSS... "/dev/sda" on the MGS is a totally different drive. Can I use the label? Thanks, Chris> > Cheers, Andreas > -- > Andreas Dilger > Sr. Staff Engineer, Lustre Group > Sun Microsystems of Canada, Inc. > >
Chris Worley
2008-Apr-23 14:48 UTC
[Lustre-discuss] Added Dual-homed OSS; ethernet clients confused
On Wed, Apr 23, 2008 at 8:10 AM, Chris Worley <worleys at gmail.com> wrote:> > On Wed, Apr 23, 2008 at 12:42 AM, Andreas Dilger <adilger at sun.com> wrote: > > The problem is that the NID for the new OST is the IPoIB address, and this > > is what the TCP client is trying to connect to. If you specify the TCP > > NID first this may help. Also note that the client does not get the > > config from the OSTs, but rather the MGS, so you need to do a --write-conf > > on there. > > This is confusing as the man page for "tunefs.lustre" wants a device > name at the end of the command... and the device is on another OSS... > "/dev/sda" on the MGS is a totally different drive. Can I use the > label?Another point of confusion with tunefs.lustre. When I add, as you suggest (although I''m still doing this on the OSS, as I don''t know how to specify /dev/sd? on another node from the MGS), the Ethernet IP for both tcp and o2ib, (and have tcp first), as in: --mgsnode="36.101.29.1 at tcp0,36.101.29.1 at o2ib0" tunefs.lustre adds this, rather than replacing it, so now I get: ... Parameters: mgsnode=36.102.29.1 at o2ib,36.101.29.1 at tcp sys.timeout=40 lov.stripesize=2M mgsnode=36.102.29.1 at o2ib,36.101.29.1 at tcp sys.timeout=40 lov.stripesize=2M mgsnode=36.101.29.1 at tcp,36.101.29.1 at o2ib ... The initial setting, yesterday''s change, and today''s change are in there... I have no clue which takes precedence. Chris
D. Marc Stearman
2008-Apr-23 14:50 UTC
[Lustre-discuss] Added Dual-homed OSS; ethernet clients confused
On Apr 23, 2008, at 7:10 AM, Chris Worley wrote:> On Wed, Apr 23, 2008 at 12:42 AM, Andreas Dilger <adilger at sun.com> > wrote: >> On Apr 22, 2008 18:08 -0600, Chris Worley wrote: >>> The error specifically complains about the first OST/disk on the new >>> OSS, OST0026. It''s tunefs.lustre output was: >>> >> >>>> On the OSS in question, for each OST, I did: >>>> >>>> # tunefs.lustre --writeconf --ost >>>> >>>> --mgsnode="36.102.29.1 at o2ib0,36.101.29.1 at tcp0" --fsname=lfs --param >>>> sys.timeout=40 --param lov.stripesize=2M /dev/sdl >> >>>> Lustre: cmd=cf003 0:lfs-OST0026-osc 1:lfs-OST0026_UUID >>>> 2:36.102.29.4 at o2ib >>>> LustreError: 15c-8: MGC36.101.29.1 at tcp: The configuration from log >>>> ''lfs-client'' failed (-2). This may be the result of communication >>>> errors between this node and the MGS, a bad configuration, or >>>> other >>>> errors. See the syslog for more information. >> >> The problem is that the NID for the new OST is the IPoIB address, >> and this >> is what the TCP client is trying to connect to. If you specify >> the TCP >> NID first this may help. Also note that the client does not get the >> config from the OSTs, but rather the MGS, so you need to do a -- >> write-conf >> on there. > > This is confusing as the man page for "tunefs.lustre" wants a device > name at the end of the command... and the device is on another OSS... > "/dev/sda" on the MGS is a totally different drive. Can I use the > label?A --write-conf on the MGS will remove the file system config information, which forces the MGS to recreate it. Try this: 1. Stop all clients and servers (make sure to unload modules on clients to make sure they don''t have any devices lingering about) 2. Run your tunefs.lustre command on the new OST, and use --erase- params so you don''t have duplicate parameters. (I noticed you had multiple mgsnode params). Some params will be added multiple times, so if you want to change them, you need to erase all the params, and start over. 3. Re-run a write-conf on your MGS node. Something like this: "tunefs.lustre --writeconf --fsname=lfs --mdt --mgs \ --param mdt.group_upcall=/usr/sbin/l_getgroups \ --param lov.stripesize=2M \ --param lov.stripecount=2 /dev/sda" 4. Start the MGS/MDS node 5. Start the OSTs (if you care about device ordering, start them one at a time in index order. We like to have our ''lctl dl'' output list al the OSTs in index order) 6. mount clients Step 3 should re-create the client config information on the MGS, and when you start all your OSTs, the client configs will be updated with the proper NIDS. When clients mount, they ask the MGS what devices (OSTs) make up the filesystem, and then try to connect. If the MGS is unaware of the tcp NID on the new OST (because it had the wrong modprobe.conf when it first registered), the clients will not know that the new OST has a nid on tcp0. The write-conf on the MGS will fix that. -Marc ---- D. Marc Stearman LC Lustre Administration Lead marc at llnl.gov 925.423.9670 Pager: 1.888.203.0641
Chris Worley
2008-Apr-23 15:51 UTC
[Lustre-discuss] Added Dual-homed OSS; ethernet clients confused
On Wed, Apr 23, 2008 at 8:50 AM, D. Marc Stearman <marc at llnl.gov> wrote:> 3. Re-run a write-conf on your MGS node. Something like this: > "tunefs.lustre --writeconf --fsname=lfs --mdt --mgs \ > --param mdt.group_upcall=/usr/sbin/l_getgroups \ > --param lov.stripesize=2M \ > --param lov.stripecount=2 /dev/sda"Marc, Thanks for all the detailed help. My problem is w/ the above command: if I run that on the MGS, then the /dev/sda referenced is the /dev/sda on the MGS node, not the one on the OSS I need to change. How do I tell the MGS of a device on another system? Thanks! Chris
Chris Worley
2008-Apr-23 16:17 UTC
[Lustre-discuss] Added Dual-homed OSS; ethernet clients confused
Never mind the last question... the device in step #3 is the MDT. On Wed, Apr 23, 2008 at 9:51 AM, Chris Worley <worleys at gmail.com> wrote:> On Wed, Apr 23, 2008 at 8:50 AM, D. Marc Stearman <marc at llnl.gov> wrote: > > 3. Re-run a write-conf on your MGS node. Something like this: > > "tunefs.lustre --writeconf --fsname=lfs --mdt --mgs \ > > --param mdt.group_upcall=/usr/sbin/l_getgroups \ > > --param lov.stripesize=2M \ > > --param lov.stripecount=2 /dev/sda" > > Marc, > > Thanks for all the detailed help. > > My problem is w/ the above command: if I run that on the MGS, then the > /dev/sda referenced is the /dev/sda on the MGS node, not the one on > the OSS I need to change. How do I tell the MGS of a device on > another system? > > Thanks! > > Chris >
Chris Worley
2008-Apr-23 16:30 UTC
[Lustre-discuss] Added Dual-homed OSS; ethernet clients confused
Sorry to require remedial instructions... it works now. The trick was my misunderstanding that the tunefs.lustre needed to be run on the MGS given the mdt. I''ve rerun tunefs on the new OSS for all the OST''s w/ the original settings, and. likewise on the MGS for the MDT, with all original settings, and the multi-mome works again. Thanks to all that pitched in on this. Sorry I was so dense. Chris On Wed, Apr 23, 2008 at 10:17 AM, Chris Worley <worleys at gmail.com> wrote:> Never mind the last question... the device in step #3 is the MDT. > > > > On Wed, Apr 23, 2008 at 9:51 AM, Chris Worley <worleys at gmail.com> wrote: > > On Wed, Apr 23, 2008 at 8:50 AM, D. Marc Stearman <marc at llnl.gov> wrote: > > > 3. Re-run a write-conf on your MGS node. Something like this: > > > "tunefs.lustre --writeconf --fsname=lfs --mdt --mgs \ > > > --param mdt.group_upcall=/usr/sbin/l_getgroups \ > > > --param lov.stripesize=2M \ > > > --param lov.stripecount=2 /dev/sda" > > > > Marc, > > > > Thanks for all the detailed help. > > > > My problem is w/ the above command: if I run that on the MGS, then the > > /dev/sda referenced is the /dev/sda on the MGS node, not the one on > > the OSS I need to change. How do I tell the MGS of a device on > > another system? > > > > Thanks! > > > > Chris > > >