Hi all,
I''m running a small Luster system(1.8.1.1): 1 MDS, 1 OSS, 2 clients.
Each node has 1gig and infiniband (mlx4_0) with ipoib setup. I''m
trying to use IB transport.
The /etc/modprobe.conf is the same for all nodes:
----------
alias eth0 e1000e
alias eth1 e1000e
alias eth2 8139too
alias scsi_hostadapter aic79xx
alias scsi_hostadapter1 ata_piix
alias ib0 ib_ipoib
alias ib1 ib_ipoib
options lnet ip2nets="o2ib0(ib0) 172.16.0.[1-255]"
-----------------------
MDS has an IP: 192.168.104.4, IPoIB: 172.16.0.3
OSS has an IP:192.168.104.5, IPoIB: 172.16.0.4
For the MDS, it succeeded to run "/usr/sbin/mkfs.lustre", but failed
at "mount -t lustre ".
and the the OSS is unable to mount MDS. Below is some log output.
Here is the /var/log/messages from MDS:
----------------------
Apr 19 09:59:18 i3 kernel: Lustre: Added LNI 172.16.0.3 at o2ib [8/64/0/0]
Apr 19 09:59:19 i3 kernel: Lustre: Lustre Client File System;
http://www.lustre.org/
Apr 19 09:59:19 i3 kernel: kjournald starting. Commit interval 5 seconds
Apr 19 09:59:19 i3 kernel: LDISKFS FS on dm-2, internal journal
Apr 19 09:59:19 i3 kernel: LDISKFS-fs: recovery complete.
Apr 19 09:59:19 i3 kernel: LDISKFS-fs: mounted filesystem with ordered
data mode.
Apr 19 09:59:19 i3 kernel: kjournald starting. Commit interval 5 seconds
Apr 19 09:59:19 i3 kernel: LDISKFS FS on dm-2, internal journal
Apr 19 09:59:19 i3 kernel: LDISKFS-fs: mounted filesystem with ordered
data mode.
Apr 19 09:59:19 i3 kernel: Lustre: MGS MGS started
Apr 19 09:59:19 i3 kernel: Lustre: MGC172.16.0.3 at o2ib: Reactivating import
Apr 19 09:59:19 i3 kernel: Inside function: ldlm_cli_enqueue
Apr 19 09:59:19 i3 kernel: Inside function:ldlm_handle_enqueue
Apr 19 09:59:19 i3 kernel: Inside function ldlm_lock_enqueue
Apr 19 09:59:19 i3 kernel: Time spent in policy function to grab the lock5
Apr 19 09:59:19 i3 kernel: Time spent in ldlm_lock_enqueue 24
Apr 19 09:59:19 i3 kernel: Time spent in ptlrpc_queue_wait 269
Apr 19 09:59:19 i3 kernel: Inside function ldlm_lock_enqueue
Apr 19 09:59:19 i3 kernel: Inside function:ldlm_completion_ast
Apr 19 09:59:19 i3 kernel: Lustre: Enabling user_xattr
Apr 19 09:59:19 i3 kernel: Lustre: MDT lustre-MDT0000 now serving
lustre-MDT0000_UUID
(lustre-MDT0000/d92432e1-1f9a-b963-44cf-c7d529e44575) with recovery en
abled
Apr 19 09:59:19 i3 kernel: Lustre:
24287:0:(lproc_mds.c:271:lprocfs_wr_group_upcall()) lustre-MDT0000:
group upcall set to /usr/sbin/l_getgroups
Apr 19 09:59:19 i3 kernel: Lustre: lustre-MDT0000.mdt: set parameter
group_upcall=/usr/sbin/l_getgroups
Apr 19 09:59:19 i3 kernel: LustreError:
24287:0:(events.c:460:ptlrpc_uuid_to_peer()) No NID found for
192.168.104.5 at tcp
Apr 19 09:59:19 i3 kernel: LustreError:
24287:0:(client.c:69:ptlrpc_uuid_to_connection()) cannot find peer
192.168.104.5 at tcp!
Apr 19 09:59:19 i3 kernel: LustreError:
24287:0:(ldlm_lib.c:329:client_obd_setup()) can''t add initial
connection
Apr 19 09:59:19 i3 kernel: LustreError:
24287:0:(obd_config.c:370:class_setup()) setup lustre-OST0000-osc
failed (-2)
Apr 19 09:59:19 i3 kernel: LustreError:
24287:0:(obd_config.c:1197:class_config_llog_handler()) Err -2 on cfg
command:
Apr 19 09:59:19 i3 kernel: Lustre: cmd=cf003 0:lustre-OST0000-osc
1:lustre-OST0000_UUID 2:192.168.104.5 at tcp
Apr 19 09:59:19 i3 kernel: LustreError: 15c-8: MGC172.16.0.3 at o2ib: The
configuration from log ''lustre-MDT0000'' failed (-2). This may
be the
result of commun
ication errors between this node and the MGS, a bad configuration, or
other errors. See the syslog for more information.
Apr 19 09:59:19 i3 kernel: LustreError:
23990:0:(obd_mount.c:1114:server_start_targets()) failed to start
server lustre-MDT0000: -2
Apr 19 09:59:19 i3 kernel: LustreError:
23990:0:(obd_mount.c:1629:server_fill_super()) Unable to start
targets: -2
Apr 19 09:59:19 i3 kernel: Lustre: Failing over lustre-MDT0000
Apr 19 09:59:19 i3 kernel: Lustre: *** setting obd lustre-MDT0000
device ''dm-2'' read-only ***
Apr 19 09:59:19 i3 kernel: Turning device dm-2 (0xfd00002) read-only
Apr 19 09:59:19 i3 kernel: Lustre: Failing over lustre-mdtlov
Apr 19 09:59:19 i3 kernel: Lustre: lustre-MDT0000: shutting down for
failover; client state will be preserved.
Apr 19 09:59:19 i3 kernel: Lustre: MDT lustre-MDT0000 has stopped.
Apr 19 09:59:19 i3 kernel: LustreError:
23990:0:(ldlm_request.c:1074:ldlm_cli_cancel_req()) Got rc -108 from
cancel RPC: canceling anyway
Apr 19 09:59:19 i3 kernel: LustreError:
23990:0:(ldlm_request.c:1579:ldlm_cli_cancel_list())
ldlm_cli_cancel_list: -108
Apr 19 09:59:19 i3 kernel: Lustre: MGS has stopped.
Apr 19 09:59:19 i3 kernel: Removing read-only on unknown block (0xfd00002)
Apr 19 09:59:19 i3 kernel: Lustre: server umount lustre-MDT0000 complete
Apr 19 09:59:19 i3 kernel: LustreError:
23990:0:(obd_mount.c:1997:lustre_fill_super()) Unable to mount (-2)
Apr 19 09:59:30 i3 kernel: Lustre:
24058:0:(lib-move.c:1782:lnet_parse_put()) Dropping PUT from
12345-172.16.0.4 at o2ib portal 26 match 1333458968248321 offse
t 0 length 368: 2
Apr 19 09:59:36 i3 kernel: Lustre:
24060:0:(lib-move.c:1782:lnet_parse_put()) Dropping PUT from
12345-172.16.0.1 at o2ib portal 26 match 1333458974539777 offse
t 0 length 368: 2
Apr 19 09:59:42 i3 kernel: Lustre:
24061:0:(lib-move.c:1782:lnet_parse_put()) Dropping PUT from
12345-172.16.0.2 at o2ib portal 26 match 1333458980831233 offse
t 0 length 368: 2
-------------------
Here is /var/log/messages from OSS:
----------------
Apr 19 09:59:30 i4 kernel: kjournald starting. Commit interval 5 seconds
Apr 19 09:59:30 i4 kernel: LDISKFS FS on dm-2, internal journal
Apr 19 09:59:30 i4 kernel: LDISKFS-fs: mounted filesystem with ordered
data mode.
Apr 19 09:59:30 i4 restorecond: Will not restore a file with more than
one hard link (/etc/resolv.conf) Invalid argument
Apr 19 09:59:30 i4 restorecond: Will not restore a file with more than
one hard link (/etc/resolv.conf) Invalid argument
Apr 19 09:59:30 i4 kernel: Lustre: OBD class driver, http://www.lustre.org/
Apr 19 09:59:30 i4 kernel: Lustre: Lustre Version: 1.8.1.1
Apr 19 09:59:30 i4 kernel: Lustre: Build Version:
1.8.1.1-20091009095116-PRISTINE-2.6.18-128.7.1.el5-lustre.1.8.1.1smp-cust
Apr 19 09:59:30 i4 kernel: Lustre: Listener bound to ib0:172.16.0.4:987:mlx4_0
Apr 19 09:59:30 i4 kernel: Lustre: Register global MR array, MR size:
0xffffffffffffffff, array size: 1
Apr 19 09:59:30 i4 kernel: Lustre: Added LNI 172.16.0.4 at o2ib [8/64/0/0]
Apr 19 09:59:30 i4 kernel: Lustre: Lustre Client File System;
http://www.lustre.org/
Apr 19 09:59:30 i4 kernel: kjournald starting. Commit interval 5 seconds
Apr 19 09:59:30 i4 kernel: LDISKFS FS on dm-2, internal journal
Apr 19 09:59:30 i4 kernel: LDISKFS-fs: mounted filesystem with ordered
data mode.
Apr 19 09:59:30 i4 kernel: kjournald starting. Commit interval 5 seconds
Apr 19 09:59:30 i4 kernel: LDISKFS FS on dm-2, internal journal
Apr 19 09:59:30 i4 kernel: LDISKFS-fs: mounted filesystem with ordered
data mode.
Apr 19 09:59:30 i4 kernel: LDISKFS-fs: file extents enabled
Apr 19 09:59:30 i4 kernel: LDISKFS-fs: mballoc enabled
Apr 19 09:59:35 i4 kernel: Lustre:
6853:0:(client.c:1383:ptlrpc_expire_one_request()) @@@ Request
x1333458968248321 sent from MGC172.16.0.3 at o2ib to NID 172.16.0.3 at o2ib
5s ago has timed out (limit 5s).oc enabled
Apr 19 09:59:35 i4 kernel: req at ffff8100bfc31400 x1333458968248321/t0
o250->MGS at MGC172.16.0.3@o2ib_0:26/25 lens 368/584 e 0 to 1 dl
1271685575 ref 1 fl Rpc:N/0/0 rc 0/072.16.0.3 at o2ib to NID
172.16.0.3 at o2ib 5s ago has timed out (limit 5s).
Apr 19 09:59:35 i4 kernel: LustreError:
6775:0:(obd_mount.c:1085:server_start_targets()) Required registration
failed for lustre-OSTffff: -5
Apr 19 09:59:35 i4 kernel: LustreError: 15f-b: Communication error
with the MGS. Is the MGS running?
Apr 19 09:59:35 i4 kernel: LustreError:
6775:0:(obd_mount.c:1629:server_fill_super()) Unable to start targets:
-5
Apr 19 09:59:35 i4 kernel: LustreError:
6775:0:(obd_mount.c:1412:server_put_super()) no obd lustre-OSTffff
Apr 19 09:59:35 i4 kernel: LustreError:
6775:0:(obd_mount.c:136:server_deregister_mount()) lustre-OSTffff not
registered
Apr 19 09:59:35 i4 kernel: LDISKFS-fs: mballoc: 0 blocks 0 reqs (0 success)
Apr 19 09:59:35 i4 kernel: LDISKFS-fs: mballoc: 0 extents scanned, 0
goal hits, 0 2^N hits, 0 breaks, 0 lost
Apr 19 09:59:35 i4 kernel: LDISKFS-fs: mballoc: 0 generated and it took 0
Apr 19 09:59:35 i4 kernel: LDISKFS-fs: mballoc: 0 preallocated, 0 discarded
Apr 19 09:59:35 i4 kernel: Lustre: server umount lustre-OSTffff complete
Apr 19 09:59:35 i4 kernel: LustreError:
6775:0:(obd_mount.c:1997:lustre_fill_super()) Unable to mount (-5)
-------------------------
What''s wrong with my config? Thanks!
-Neutron