Mc Carthy, Fergal
2006-May-19 07:36 UTC
[Lustre-discuss] Difficulties with initial install of Lustre
Adam,
I suspect that your network settings are misconfigured.
Do you have a line like:
options lnet networks=tcp0(ethX)
in your /etc/modprobe.conf, where ethX is the Ethernet device associated
with your hostname of cf2?
Also, make sure, if this is a redhat system, that your hostname is not
associated with the localhost entry in /etc/hosts...
I suggest this because I see 127.0.0.1 related connection attempt
messages in the logs you sent.
Fergal.
--
Fergal.McCarthy@HP.com
(The contents of this message and any attachments to it are confidential
and may be legally privileged. If you have received this message in
error you should delete it from your system immediately and advise the
sender. To any recipient of this message within HP, unless otherwise
stated, you should consider this message and attachments as "HP
CONFIDENTIAL".)
-----Original Message-----
From: lustre-discuss-bounces@clusterfs.com
[mailto:lustre-discuss-bounces@clusterfs.com] On Behalf Of Adam D''Auria
Sent: 13 March 2006 09:08
To: lustre-discuss@clusterfs.com
Subject: [Lustre-discuss] Difficulties with initial install of Lustre
Greetings all,
I''ve read through the documentation and the email list archive for the
last
few months but I am still having trouble getting Lustre running on a
single
system - for test purposes only.
I have installed the RHEL 2.6.9-22.0.2 i686 kernel and associated
modules
for Lustre 1.4.6. I have created an xml file based on the Single Node
Client, MDS and Two OSTs in the docs. I could not find the utilities
for
''lwizard'' or the lustre-lite-utils with a matching version
number.
(1.4.6)
Situation:
When I execute ''lconf --reformat cf2.xml'', all seems to go
well until
the
first NETWORK: line. The line before it in the example shows a
''llite''
modules being loaded, I do not show that module as being loaded - I
assume
it is part of the Lustre-lite utilities which, I hope, are not necessary
and that is why I couldn''t locate them.
On the NETWORK line itself, mine differs from that shown in the
documentation at the end of the line:
Docs: ''NETWORK: NET_host_tcp NET_host_tcp _UUID tcp host 988''
Mine: ''NETWORK: NET_cf2_tcp NET_cf2_tcp_UUID tcp cf2''
The next few lines in the docs show OSD and OST message before showing a
MDSDEV: line. I do not have those in my output and suspect that this is
a
major clue as to my troubles. I include the console error output below
as
well as my lmc commands that I used to make my xml file.
Note: lconf does not finish, it keeps running and generating console
errors
- I assume because it cannot complete the connection it is attempting.
All help and suggestions are appreciated.
Adam
------------------------ make_xml ----------------------------------
#!/bin/bash
lmc --add node --node cf2 --output cf2.xml
lmc --add net --node cf2 --nid cf2 --nettype tcp --merge cf2.xml
lmc --add net --node client --nid \* --nettype tcp --merge cf2.xml
lmc --add mds --node cf2 --mds mds1 --fstype ldiskfs --dev /dev/hdb1
--journal_size 400 --merge cf2.xml
lmc --add lov --lov lov1 --mds mds1 --stripe_sz 1048576 --stripe_cnt 1
--stripe_pattern 0 --merge cf2.xml
lmc --add ost --node cf2 --ost ost1 --lov lov1 --fstype ldiskfs --dev
/dev/hdb2 --merge cf2.xml
lmc --add ost --node cf2 --ost ost2 --lov lov1 --fstype ldiskfs --dev
/dev/hdb3 --merge cf2.xml
lmc --add mtpt --node cf2 --mds mds1 --lov lov1 --path /mnt/lustre
--clientoptions async --merge cf2.xml
------------------------- Output from ''lconf --reformat
cf2.xml''
--------------------
loading module: libcfs srcdir None devdir libcfs
loading module: lnet srcdir None devdir lnet
loading module: ksocklnd srcdir None devdir klnds/socklnd
loading module: lvfs srcdir None devdir lvfs
loading module: obdclass srcdir None devdir obdclass
loading module: ptlrpc srcdir None devdir ptlrpc
loading module: mdc srcdir None devdir mdc
loading module: osc srcdir None devdir osc
loading module: lov srcdir None devdir lov
loading module: mds srcdir None devdir mds
loading module: ldiskfs srcdir None devdir ldiskfs
loading module: fsfilt_ldiskfs srcdir None devdir lvfs
NETWORK: NET_cf2_tcp NET_cf2_tcp_UUID tcp cf2
MDSDEV: mds1 mds1_UUID /dev/hdb1 ldiskfs no
recording clients for filesystem: FS_fsname_UUID
Recording log mds1 on mds1
LOV: lov_mds1 c05a1_lov_mds1_79ddc06aa5 mds1_UUID 1 1048576 0 0
[u''ost1_UUID''] mds1
OSC: OSC_cf2_ost1_mds1 c05a1_lov_mds1_79ddc06aa5 ost1_UUID
End recording log mds1 on mds1
MDSDEV: mds1 mds1_UUID /dev/hdb1 ldiskfs 0 no
MDS mount options: errors=remount-ro
----------------------- output from /var/log/messages
-------------------------
Mar 13 09:41:15 cf2 kernel: Lustre:
17655:0:(module.c:381:init_libcfs_module())
maximum lustre stack 8192
Mar 13 09:41:15 cf2 kernel: Lustre: OBD class driver Build Version:
1.4.6-196912
31190000-PRISTINE-.tmp.lbuild.lbuild-v1_4_6_RC3-2.6-rhel4-i686.lbuild.BU
ILD.lust
re-kernel-2.6.9.lustre.linux-2.6.9-22.0.2.EL_lustre.1.4.6smp,
info@clusterfs.com
Mar 13 09:41:15 cf2 kernel: Lustre: Added LNI 217.117.230.93@tcp [8/256]
Mar 13 09:41:15 cf2 kernel: Lustre: Accept secure, port 988
Mar 13 09:41:21 cf2 kernel: kjournald starting. Commit interval 5
seconds
Mar 13 09:41:21 cf2 kernel: LDISKFS FS on hdb1, internal journal
Mar 13 09:41:21 cf2 kernel: LDISKFS-fs: mounted filesystem with ordered
data mod
e.
Mar 13 09:41:21 cf2 kernel: Lustre:
17709:0:(mds_fs.c:239:mds_init_server_data()
) mds1: initializing new last_rcvd
Mar 13 09:41:21 cf2 kernel: Lustre: MDT mds1 now serving /dev/hdb1
(bec9544b-28f
5-415d-a592-0d467eeb07d0) with recovery enabled
Mar 13 09:41:23 cf2 kernel: Lustre: MDT mds1 has stopped.
Mar 13 09:41:23 cf2 kernel: kjournald starting. Commit interval 5
seconds
Mar 13 09:41:23 cf2 kernel: LDISKFS FS on hdb1, internal journal
Mar 13 09:41:23 cf2 kernel: LDISKFS-fs: mounted filesystem with ordered
data mod
e.
Mar 13 09:41:23 cf2 kernel: LustreError: Refusing connection from
127.0.0.1 for
127.0.0.1@tcp: No matching NI
Mar 13 09:41:23 cf2 kernel: LustreError:
17679:0:(socklnd_cb.c:1429:ksocknal_sen
d_hello()) Error -32 sending HELLO payload (1) to 127.0.0.1/988
Mar 13 09:41:23 cf2 kernel: LustreError: Unexpected error -32 connecting
to
127.
0.0.1@tcp at host 127.0.0.1 on port 988
Mar 13 09:41:23 cf2 kernel: LustreError:
17679:0:(socklnd_cb.c:396:ksocknal_txli
st_done()) Deleting packet type 1 len 240
217.117.230.93@tcp->127.0.0.1@tcp
Mar 13 09:41:23 cf2 kernel: LustreError:
17679:0:(events.c:54:request_out_callba
ck()) @@@ type 4, status -5 req@df800600 x1/t0
o8->ost1_UUID@ex-ost_UUID:6 lens
240/272 ref 2 fl Rpc:/0/0 rc 0/0
Mar 13 09:41:23 cf2 kernel: LustreError:
17859:0:(client.c:951:ptlrpc_expire_one
_request()) @@@ timeout (sent at 1142239283, 0s ago) req@df800600 x1/t0
o8->ost1
_UUID@ex-ost_UUID:6 lens 240/272 ref 1 fl Rpc:/0/0 rc 0/0
Mar 13 09:41:31 cf2 kernel: LustreError:
17830:0:(mds_lov.c:592:mds_lov_start_sy
nchronize()) mds1: error starting mds_lov_synchronize: -4
Mar 13 09:41:31 cf2 kernel: Lustre: MDT mds1 now serving /dev/hdb1
(bec9544b-28f
5-415d-a592-0d467eeb07d0) with recovery enabled
Mar 13 09:41:31 cf2 kernel: LustreError:
17830:0:(genops.c:1108:ping_evictor_sta
rt()) Cannot start ping evictor thread: -4
Mar 13 09:41:48 cf2 kernel: LustreError: Refusing connection from
127.0.0.1 for
127.0.0.1@tcp: No matching NI
Mar 13 09:41:48 cf2 kernel: LustreError:
17680:0:(socklnd_cb.c:1429:ksocknal_sen
d_hello()) Error -32 sending HELLO payload (1) to 127.0.0.1/988
Mar 13 09:41:48 cf2 kernel: LustreError: Unexpected error -32 connecting
to
127.
0.0.1@tcp at host 127.0.0.1 on port 988
Mar 13 09:41:48 cf2 kernel: LustreError:
17680:0:(socklnd_cb.c:396:ksocknal_txli
st_done()) Deleting packet type 1 len 240
217.117.230.93@tcp->127.0.0.1@tcp
Mar 13 09:41:48 cf2 kernel: LustreError:
17680:0:(events.c:54:request_out_callba
ck()) @@@ type 4, status -5 req@de8f5400 x3/t0
o8->ost1_UUID@ex-ost_UUID:6 lens
240/272 ref 2 fl Rpc:/0/0 rc 0/0
Mar 13 09:41:48 cf2 kernel: LustreError:
17859:0:(client.c:951:ptlrpc_expire_one
_request()) @@@ timeout (sent at 1142239308, 0s ago) req@de8f5400 x3/t0
o8->ost1
_UUID@ex-ost_UUID:6 lens 240/272 ref 1 fl Rpc:/0/0 rc 0/0
Mar 13 09:42:00 cf2 kernel: Lustre: MDT mds1 has stopped.
Mar 13 09:42:00 cf2 kernel: Lustre: Acceptor stopping
Mar 13 09:42:01 cf2 kernel: Lustre: Removed LNI 217.117.230.93@tcp
Mar 13 09:42:01 cf2 kernel: LustreError:
17909:0:(class_obd.c:783:cleanup_obdcla
ss()) obd mem max: 3767487 leaked: 8
Mar 13 09:42:01 cf2 kernel: LustreError:
17917:0:(lvfs_linux.c:506:lvfs_linux_ex
it()) obd mem max: 3767487 leaked: 8
<snip> errors continue to repeat
-------------------- Traceback when I do Ctrl-C after approx 15
minutes ----------------
Traceback (most recent call last):
File "/usr/sbin/lconf", line 2827, in ?
main()
File "/usr/sbin/lconf", line 2820, in main
doHost(lustreDB, node_list)
File "/usr/sbin/lconf", line 2264, in doHost
for_each_profile(node_db, prof_list, doSetup)
File "/usr/sbin/lconf", line 2044, in for_each_profile
operation(services)
File "/usr/sbin/lconf", line 2064, in doSetup
n.prepare()
File "/usr/sbin/lconf", line 1321, in prepare
setup ="%s %s %s %s %s" %(blkdev, self.fstype, self.name,
File "/usr/sbin/lconf", line 397, in newdev
self.setup(name, setup)
File "/usr/sbin/lconf", line 376, in setup
self.run(cmds)
File "/usr/sbin/lconf", line 278, in run
ready = select.select([outfd,errfd],[],[]) # Wait for input
KeyboardInterrupt
-------------------- devices -----------------------
Device Boot Start End Blocks Id System
/dev/hdb1 1 1939 977224+ 83 Linux
/dev/hdb2 1940 21316 9766008 83 Linux
/dev/hdb3 21317 40693 9766008 83 Linux
------------------- end of message -----------------------
_______________________________________________
Lustre-discuss mailing list
Lustre-discuss@clusterfs.com
https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
Adam D''Auria
2006-May-19 07:36 UTC
[Lustre-discuss] Difficulties with initial install of Lustre
Thank you, Fergal. I have made the following changes to my system and all seems to work now: Added line to /etc/hosts 192.168.1.20 cftest #IP address for eth0 Changed XML file, replacing all instances of ''cf2'' with ''cftest'' and saved as cftest.xml Added ''options lnet networks=tcp0(eth0)'' to /etc/modprobe.conf Executed ''lconf -d --node cf2 cf2.xml'' Executed ''lconf --reformat --node cftest --verbose cftest.xml Much more output, I now see the OST and OSD lines and it formatted and mounted the cluster. I have run some of the tests and they are returning proper results. I shall now proceed to set up a multiple machine cluster for more thorough testing. Thank you again for your assistance. Adam
Adam D''Auria
2006-May-19 07:36 UTC
[Lustre-discuss] Difficulties with initial install of Lustre
Greetings all,
I''ve read through the documentation and the email list archive for the
last
few months but I am still having trouble getting Lustre running on a single
system - for test purposes only.
I have installed the RHEL 2.6.9-22.0.2 i686 kernel and associated modules
for Lustre 1.4.6. I have created an xml file based on the Single Node
Client, MDS and Two OSTs in the docs. I could not find the utilities for
''lwizard'' or the lustre-lite-utils with a matching version
number. (1.4.6)
Situation:
When I execute ''lconf --reformat cf2.xml'', all seems to go
well until the
first NETWORK: line. The line before it in the example shows a
''llite''
modules being loaded, I do not show that module as being loaded - I assume
it is part of the Lustre-lite utilities which, I hope, are not necessary
and that is why I couldn''t locate them.
On the NETWORK line itself, mine differs from that shown in the
documentation at the end of the line:
Docs: ''NETWORK: NET_host_tcp NET_host_tcp _UUID tcp host 988''
Mine: ''NETWORK: NET_cf2_tcp NET_cf2_tcp_UUID tcp cf2''
The next few lines in the docs show OSD and OST message before showing a
MDSDEV: line. I do not have those in my output and suspect that this is a
major clue as to my troubles. I include the console error output below as
well as my lmc commands that I used to make my xml file.
Note: lconf does not finish, it keeps running and generating console errors
- I assume because it cannot complete the connection it is attempting.
All help and suggestions are appreciated.
Adam
------------------------ make_xml ----------------------------------
#!/bin/bash
lmc --add node --node cf2 --output cf2.xml
lmc --add net --node cf2 --nid cf2 --nettype tcp --merge cf2.xml
lmc --add net --node client --nid \* --nettype tcp --merge cf2.xml
lmc --add mds --node cf2 --mds mds1 --fstype ldiskfs --dev /dev/hdb1
--journal_size 400 --merge cf2.xml
lmc --add lov --lov lov1 --mds mds1 --stripe_sz 1048576 --stripe_cnt 1
--stripe_pattern 0 --merge cf2.xml
lmc --add ost --node cf2 --ost ost1 --lov lov1 --fstype ldiskfs --dev
/dev/hdb2 --merge cf2.xml
lmc --add ost --node cf2 --ost ost2 --lov lov1 --fstype ldiskfs --dev
/dev/hdb3 --merge cf2.xml
lmc --add mtpt --node cf2 --mds mds1 --lov lov1 --path /mnt/lustre
--clientoptions async --merge cf2.xml
------------------------- Output from ''lconf --reformat
cf2.xml''
--------------------
loading module: libcfs srcdir None devdir libcfs
loading module: lnet srcdir None devdir lnet
loading module: ksocklnd srcdir None devdir klnds/socklnd
loading module: lvfs srcdir None devdir lvfs
loading module: obdclass srcdir None devdir obdclass
loading module: ptlrpc srcdir None devdir ptlrpc
loading module: mdc srcdir None devdir mdc
loading module: osc srcdir None devdir osc
loading module: lov srcdir None devdir lov
loading module: mds srcdir None devdir mds
loading module: ldiskfs srcdir None devdir ldiskfs
loading module: fsfilt_ldiskfs srcdir None devdir lvfs
NETWORK: NET_cf2_tcp NET_cf2_tcp_UUID tcp cf2
MDSDEV: mds1 mds1_UUID /dev/hdb1 ldiskfs no
recording clients for filesystem: FS_fsname_UUID
Recording log mds1 on mds1
LOV: lov_mds1 c05a1_lov_mds1_79ddc06aa5 mds1_UUID 1 1048576 0 0
[u''ost1_UUID''] mds1
OSC: OSC_cf2_ost1_mds1 c05a1_lov_mds1_79ddc06aa5 ost1_UUID
End recording log mds1 on mds1
MDSDEV: mds1 mds1_UUID /dev/hdb1 ldiskfs 0 no
MDS mount options: errors=remount-ro
----------------------- output from /var/log/messages -------------------------
Mar 13 09:41:15 cf2 kernel: Lustre: 17655:0:(module.c:381:init_libcfs_module())
maximum lustre stack 8192
Mar 13 09:41:15 cf2 kernel: Lustre: OBD class driver Build Version:
1.4.6-196912
31190000-PRISTINE-.tmp.lbuild.lbuild-v1_4_6_RC3-2.6-rhel4-i686.lbuild.BUILD.lust
re-kernel-2.6.9.lustre.linux-2.6.9-22.0.2.EL_lustre.1.4.6smp,
info@clusterfs.com
Mar 13 09:41:15 cf2 kernel: Lustre: Added LNI 217.117.230.93@tcp [8/256]
Mar 13 09:41:15 cf2 kernel: Lustre: Accept secure, port 988
Mar 13 09:41:21 cf2 kernel: kjournald starting. Commit interval 5 seconds
Mar 13 09:41:21 cf2 kernel: LDISKFS FS on hdb1, internal journal
Mar 13 09:41:21 cf2 kernel: LDISKFS-fs: mounted filesystem with ordered
data mod
e.
Mar 13 09:41:21 cf2 kernel: Lustre:
17709:0:(mds_fs.c:239:mds_init_server_data()
) mds1: initializing new last_rcvd
Mar 13 09:41:21 cf2 kernel: Lustre: MDT mds1 now serving /dev/hdb1
(bec9544b-28f
5-415d-a592-0d467eeb07d0) with recovery enabled
Mar 13 09:41:23 cf2 kernel: Lustre: MDT mds1 has stopped.
Mar 13 09:41:23 cf2 kernel: kjournald starting. Commit interval 5 seconds
Mar 13 09:41:23 cf2 kernel: LDISKFS FS on hdb1, internal journal
Mar 13 09:41:23 cf2 kernel: LDISKFS-fs: mounted filesystem with ordered
data mod
e.
Mar 13 09:41:23 cf2 kernel: LustreError: Refusing connection from 127.0.0.1 for
127.0.0.1@tcp: No matching NI
Mar 13 09:41:23 cf2 kernel: LustreError:
17679:0:(socklnd_cb.c:1429:ksocknal_sen
d_hello()) Error -32 sending HELLO payload (1) to 127.0.0.1/988
Mar 13 09:41:23 cf2 kernel: LustreError: Unexpected error -32 connecting to
127.
0.0.1@tcp at host 127.0.0.1 on port 988
Mar 13 09:41:23 cf2 kernel: LustreError:
17679:0:(socklnd_cb.c:396:ksocknal_txli
st_done()) Deleting packet type 1 len 240 217.117.230.93@tcp->127.0.0.1@tcp
Mar 13 09:41:23 cf2 kernel: LustreError:
17679:0:(events.c:54:request_out_callba
ck()) @@@ type 4, status -5 req@df800600 x1/t0 o8->ost1_UUID@ex-ost_UUID:6
lens
240/272 ref 2 fl Rpc:/0/0 rc 0/0
Mar 13 09:41:23 cf2 kernel: LustreError:
17859:0:(client.c:951:ptlrpc_expire_one
_request()) @@@ timeout (sent at 1142239283, 0s ago) req@df800600 x1/t0
o8->ost1
_UUID@ex-ost_UUID:6 lens 240/272 ref 1 fl Rpc:/0/0 rc 0/0
Mar 13 09:41:31 cf2 kernel: LustreError:
17830:0:(mds_lov.c:592:mds_lov_start_sy
nchronize()) mds1: error starting mds_lov_synchronize: -4
Mar 13 09:41:31 cf2 kernel: Lustre: MDT mds1 now serving /dev/hdb1
(bec9544b-28f
5-415d-a592-0d467eeb07d0) with recovery enabled
Mar 13 09:41:31 cf2 kernel: LustreError:
17830:0:(genops.c:1108:ping_evictor_sta
rt()) Cannot start ping evictor thread: -4
Mar 13 09:41:48 cf2 kernel: LustreError: Refusing connection from 127.0.0.1 for
127.0.0.1@tcp: No matching NI
Mar 13 09:41:48 cf2 kernel: LustreError:
17680:0:(socklnd_cb.c:1429:ksocknal_sen
d_hello()) Error -32 sending HELLO payload (1) to 127.0.0.1/988
Mar 13 09:41:48 cf2 kernel: LustreError: Unexpected error -32 connecting to
127.
0.0.1@tcp at host 127.0.0.1 on port 988
Mar 13 09:41:48 cf2 kernel: LustreError:
17680:0:(socklnd_cb.c:396:ksocknal_txli
st_done()) Deleting packet type 1 len 240 217.117.230.93@tcp->127.0.0.1@tcp
Mar 13 09:41:48 cf2 kernel: LustreError:
17680:0:(events.c:54:request_out_callba
ck()) @@@ type 4, status -5 req@de8f5400 x3/t0 o8->ost1_UUID@ex-ost_UUID:6
lens
240/272 ref 2 fl Rpc:/0/0 rc 0/0
Mar 13 09:41:48 cf2 kernel: LustreError:
17859:0:(client.c:951:ptlrpc_expire_one
_request()) @@@ timeout (sent at 1142239308, 0s ago) req@de8f5400 x3/t0
o8->ost1
_UUID@ex-ost_UUID:6 lens 240/272 ref 1 fl Rpc:/0/0 rc 0/0
Mar 13 09:42:00 cf2 kernel: Lustre: MDT mds1 has stopped.
Mar 13 09:42:00 cf2 kernel: Lustre: Acceptor stopping
Mar 13 09:42:01 cf2 kernel: Lustre: Removed LNI 217.117.230.93@tcp
Mar 13 09:42:01 cf2 kernel: LustreError:
17909:0:(class_obd.c:783:cleanup_obdcla
ss()) obd mem max: 3767487 leaked: 8
Mar 13 09:42:01 cf2 kernel: LustreError:
17917:0:(lvfs_linux.c:506:lvfs_linux_ex
it()) obd mem max: 3767487 leaked: 8
<snip> errors continue to repeat
-------------------- Traceback when I do Ctrl-C after approx 15
minutes ----------------
Traceback (most recent call last):
File "/usr/sbin/lconf", line 2827, in ?
main()
File "/usr/sbin/lconf", line 2820, in main
doHost(lustreDB, node_list)
File "/usr/sbin/lconf", line 2264, in doHost
for_each_profile(node_db, prof_list, doSetup)
File "/usr/sbin/lconf", line 2044, in for_each_profile
operation(services)
File "/usr/sbin/lconf", line 2064, in doSetup
n.prepare()
File "/usr/sbin/lconf", line 1321, in prepare
setup ="%s %s %s %s %s" %(blkdev, self.fstype, self.name,
File "/usr/sbin/lconf", line 397, in newdev
self.setup(name, setup)
File "/usr/sbin/lconf", line 376, in setup
self.run(cmds)
File "/usr/sbin/lconf", line 278, in run
ready = select.select([outfd,errfd],[],[]) # Wait for input
KeyboardInterrupt
-------------------- devices -----------------------
Device Boot Start End Blocks Id System
/dev/hdb1 1 1939 977224+ 83 Linux
/dev/hdb2 1940 21316 9766008 83 Linux
/dev/hdb3 21317 40693 9766008 83 Linux
------------------- end of message -----------------------