Mc Carthy, Fergal
2006-May-19 07:36 UTC
[Lustre-discuss] Difficulties with initial install of Lustre
Adam, I suspect that your network settings are misconfigured. Do you have a line like: options lnet networks=tcp0(ethX) in your /etc/modprobe.conf, where ethX is the Ethernet device associated with your hostname of cf2? Also, make sure, if this is a redhat system, that your hostname is not associated with the localhost entry in /etc/hosts... I suggest this because I see 127.0.0.1 related connection attempt messages in the logs you sent. Fergal. -- Fergal.McCarthy@HP.com (The contents of this message and any attachments to it are confidential and may be legally privileged. If you have received this message in error you should delete it from your system immediately and advise the sender. To any recipient of this message within HP, unless otherwise stated, you should consider this message and attachments as "HP CONFIDENTIAL".) -----Original Message----- From: lustre-discuss-bounces@clusterfs.com [mailto:lustre-discuss-bounces@clusterfs.com] On Behalf Of Adam D''Auria Sent: 13 March 2006 09:08 To: lustre-discuss@clusterfs.com Subject: [Lustre-discuss] Difficulties with initial install of Lustre Greetings all, I''ve read through the documentation and the email list archive for the last few months but I am still having trouble getting Lustre running on a single system - for test purposes only. I have installed the RHEL 2.6.9-22.0.2 i686 kernel and associated modules for Lustre 1.4.6. I have created an xml file based on the Single Node Client, MDS and Two OSTs in the docs. I could not find the utilities for ''lwizard'' or the lustre-lite-utils with a matching version number. (1.4.6) Situation: When I execute ''lconf --reformat cf2.xml'', all seems to go well until the first NETWORK: line. The line before it in the example shows a ''llite'' modules being loaded, I do not show that module as being loaded - I assume it is part of the Lustre-lite utilities which, I hope, are not necessary and that is why I couldn''t locate them. On the NETWORK line itself, mine differs from that shown in the documentation at the end of the line: Docs: ''NETWORK: NET_host_tcp NET_host_tcp _UUID tcp host 988'' Mine: ''NETWORK: NET_cf2_tcp NET_cf2_tcp_UUID tcp cf2'' The next few lines in the docs show OSD and OST message before showing a MDSDEV: line. I do not have those in my output and suspect that this is a major clue as to my troubles. I include the console error output below as well as my lmc commands that I used to make my xml file. Note: lconf does not finish, it keeps running and generating console errors - I assume because it cannot complete the connection it is attempting. All help and suggestions are appreciated. Adam ------------------------ make_xml ---------------------------------- #!/bin/bash lmc --add node --node cf2 --output cf2.xml lmc --add net --node cf2 --nid cf2 --nettype tcp --merge cf2.xml lmc --add net --node client --nid \* --nettype tcp --merge cf2.xml lmc --add mds --node cf2 --mds mds1 --fstype ldiskfs --dev /dev/hdb1 --journal_size 400 --merge cf2.xml lmc --add lov --lov lov1 --mds mds1 --stripe_sz 1048576 --stripe_cnt 1 --stripe_pattern 0 --merge cf2.xml lmc --add ost --node cf2 --ost ost1 --lov lov1 --fstype ldiskfs --dev /dev/hdb2 --merge cf2.xml lmc --add ost --node cf2 --ost ost2 --lov lov1 --fstype ldiskfs --dev /dev/hdb3 --merge cf2.xml lmc --add mtpt --node cf2 --mds mds1 --lov lov1 --path /mnt/lustre --clientoptions async --merge cf2.xml ------------------------- Output from ''lconf --reformat cf2.xml'' -------------------- loading module: libcfs srcdir None devdir libcfs loading module: lnet srcdir None devdir lnet loading module: ksocklnd srcdir None devdir klnds/socklnd loading module: lvfs srcdir None devdir lvfs loading module: obdclass srcdir None devdir obdclass loading module: ptlrpc srcdir None devdir ptlrpc loading module: mdc srcdir None devdir mdc loading module: osc srcdir None devdir osc loading module: lov srcdir None devdir lov loading module: mds srcdir None devdir mds loading module: ldiskfs srcdir None devdir ldiskfs loading module: fsfilt_ldiskfs srcdir None devdir lvfs NETWORK: NET_cf2_tcp NET_cf2_tcp_UUID tcp cf2 MDSDEV: mds1 mds1_UUID /dev/hdb1 ldiskfs no recording clients for filesystem: FS_fsname_UUID Recording log mds1 on mds1 LOV: lov_mds1 c05a1_lov_mds1_79ddc06aa5 mds1_UUID 1 1048576 0 0 [u''ost1_UUID''] mds1 OSC: OSC_cf2_ost1_mds1 c05a1_lov_mds1_79ddc06aa5 ost1_UUID End recording log mds1 on mds1 MDSDEV: mds1 mds1_UUID /dev/hdb1 ldiskfs 0 no MDS mount options: errors=remount-ro ----------------------- output from /var/log/messages ------------------------- Mar 13 09:41:15 cf2 kernel: Lustre: 17655:0:(module.c:381:init_libcfs_module()) maximum lustre stack 8192 Mar 13 09:41:15 cf2 kernel: Lustre: OBD class driver Build Version: 1.4.6-196912 31190000-PRISTINE-.tmp.lbuild.lbuild-v1_4_6_RC3-2.6-rhel4-i686.lbuild.BU ILD.lust re-kernel-2.6.9.lustre.linux-2.6.9-22.0.2.EL_lustre.1.4.6smp, info@clusterfs.com Mar 13 09:41:15 cf2 kernel: Lustre: Added LNI 217.117.230.93@tcp [8/256] Mar 13 09:41:15 cf2 kernel: Lustre: Accept secure, port 988 Mar 13 09:41:21 cf2 kernel: kjournald starting. Commit interval 5 seconds Mar 13 09:41:21 cf2 kernel: LDISKFS FS on hdb1, internal journal Mar 13 09:41:21 cf2 kernel: LDISKFS-fs: mounted filesystem with ordered data mod e. Mar 13 09:41:21 cf2 kernel: Lustre: 17709:0:(mds_fs.c:239:mds_init_server_data() ) mds1: initializing new last_rcvd Mar 13 09:41:21 cf2 kernel: Lustre: MDT mds1 now serving /dev/hdb1 (bec9544b-28f 5-415d-a592-0d467eeb07d0) with recovery enabled Mar 13 09:41:23 cf2 kernel: Lustre: MDT mds1 has stopped. Mar 13 09:41:23 cf2 kernel: kjournald starting. Commit interval 5 seconds Mar 13 09:41:23 cf2 kernel: LDISKFS FS on hdb1, internal journal Mar 13 09:41:23 cf2 kernel: LDISKFS-fs: mounted filesystem with ordered data mod e. Mar 13 09:41:23 cf2 kernel: LustreError: Refusing connection from 127.0.0.1 for 127.0.0.1@tcp: No matching NI Mar 13 09:41:23 cf2 kernel: LustreError: 17679:0:(socklnd_cb.c:1429:ksocknal_sen d_hello()) Error -32 sending HELLO payload (1) to 127.0.0.1/988 Mar 13 09:41:23 cf2 kernel: LustreError: Unexpected error -32 connecting to 127. 0.0.1@tcp at host 127.0.0.1 on port 988 Mar 13 09:41:23 cf2 kernel: LustreError: 17679:0:(socklnd_cb.c:396:ksocknal_txli st_done()) Deleting packet type 1 len 240 217.117.230.93@tcp->127.0.0.1@tcp Mar 13 09:41:23 cf2 kernel: LustreError: 17679:0:(events.c:54:request_out_callba ck()) @@@ type 4, status -5 req@df800600 x1/t0 o8->ost1_UUID@ex-ost_UUID:6 lens 240/272 ref 2 fl Rpc:/0/0 rc 0/0 Mar 13 09:41:23 cf2 kernel: LustreError: 17859:0:(client.c:951:ptlrpc_expire_one _request()) @@@ timeout (sent at 1142239283, 0s ago) req@df800600 x1/t0 o8->ost1 _UUID@ex-ost_UUID:6 lens 240/272 ref 1 fl Rpc:/0/0 rc 0/0 Mar 13 09:41:31 cf2 kernel: LustreError: 17830:0:(mds_lov.c:592:mds_lov_start_sy nchronize()) mds1: error starting mds_lov_synchronize: -4 Mar 13 09:41:31 cf2 kernel: Lustre: MDT mds1 now serving /dev/hdb1 (bec9544b-28f 5-415d-a592-0d467eeb07d0) with recovery enabled Mar 13 09:41:31 cf2 kernel: LustreError: 17830:0:(genops.c:1108:ping_evictor_sta rt()) Cannot start ping evictor thread: -4 Mar 13 09:41:48 cf2 kernel: LustreError: Refusing connection from 127.0.0.1 for 127.0.0.1@tcp: No matching NI Mar 13 09:41:48 cf2 kernel: LustreError: 17680:0:(socklnd_cb.c:1429:ksocknal_sen d_hello()) Error -32 sending HELLO payload (1) to 127.0.0.1/988 Mar 13 09:41:48 cf2 kernel: LustreError: Unexpected error -32 connecting to 127. 0.0.1@tcp at host 127.0.0.1 on port 988 Mar 13 09:41:48 cf2 kernel: LustreError: 17680:0:(socklnd_cb.c:396:ksocknal_txli st_done()) Deleting packet type 1 len 240 217.117.230.93@tcp->127.0.0.1@tcp Mar 13 09:41:48 cf2 kernel: LustreError: 17680:0:(events.c:54:request_out_callba ck()) @@@ type 4, status -5 req@de8f5400 x3/t0 o8->ost1_UUID@ex-ost_UUID:6 lens 240/272 ref 2 fl Rpc:/0/0 rc 0/0 Mar 13 09:41:48 cf2 kernel: LustreError: 17859:0:(client.c:951:ptlrpc_expire_one _request()) @@@ timeout (sent at 1142239308, 0s ago) req@de8f5400 x3/t0 o8->ost1 _UUID@ex-ost_UUID:6 lens 240/272 ref 1 fl Rpc:/0/0 rc 0/0 Mar 13 09:42:00 cf2 kernel: Lustre: MDT mds1 has stopped. Mar 13 09:42:00 cf2 kernel: Lustre: Acceptor stopping Mar 13 09:42:01 cf2 kernel: Lustre: Removed LNI 217.117.230.93@tcp Mar 13 09:42:01 cf2 kernel: LustreError: 17909:0:(class_obd.c:783:cleanup_obdcla ss()) obd mem max: 3767487 leaked: 8 Mar 13 09:42:01 cf2 kernel: LustreError: 17917:0:(lvfs_linux.c:506:lvfs_linux_ex it()) obd mem max: 3767487 leaked: 8 <snip> errors continue to repeat -------------------- Traceback when I do Ctrl-C after approx 15 minutes ---------------- Traceback (most recent call last): File "/usr/sbin/lconf", line 2827, in ? main() File "/usr/sbin/lconf", line 2820, in main doHost(lustreDB, node_list) File "/usr/sbin/lconf", line 2264, in doHost for_each_profile(node_db, prof_list, doSetup) File "/usr/sbin/lconf", line 2044, in for_each_profile operation(services) File "/usr/sbin/lconf", line 2064, in doSetup n.prepare() File "/usr/sbin/lconf", line 1321, in prepare setup ="%s %s %s %s %s" %(blkdev, self.fstype, self.name, File "/usr/sbin/lconf", line 397, in newdev self.setup(name, setup) File "/usr/sbin/lconf", line 376, in setup self.run(cmds) File "/usr/sbin/lconf", line 278, in run ready = select.select([outfd,errfd],[],[]) # Wait for input KeyboardInterrupt -------------------- devices ----------------------- Device Boot Start End Blocks Id System /dev/hdb1 1 1939 977224+ 83 Linux /dev/hdb2 1940 21316 9766008 83 Linux /dev/hdb3 21317 40693 9766008 83 Linux ------------------- end of message ----------------------- _______________________________________________ Lustre-discuss mailing list Lustre-discuss@clusterfs.com https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
Adam D''Auria
2006-May-19 07:36 UTC
[Lustre-discuss] Difficulties with initial install of Lustre
Thank you, Fergal. I have made the following changes to my system and all seems to work now: Added line to /etc/hosts 192.168.1.20 cftest #IP address for eth0 Changed XML file, replacing all instances of ''cf2'' with ''cftest'' and saved as cftest.xml Added ''options lnet networks=tcp0(eth0)'' to /etc/modprobe.conf Executed ''lconf -d --node cf2 cf2.xml'' Executed ''lconf --reformat --node cftest --verbose cftest.xml Much more output, I now see the OST and OSD lines and it formatted and mounted the cluster. I have run some of the tests and they are returning proper results. I shall now proceed to set up a multiple machine cluster for more thorough testing. Thank you again for your assistance. Adam
Adam D''Auria
2006-May-19 07:36 UTC
[Lustre-discuss] Difficulties with initial install of Lustre
Greetings all, I''ve read through the documentation and the email list archive for the last few months but I am still having trouble getting Lustre running on a single system - for test purposes only. I have installed the RHEL 2.6.9-22.0.2 i686 kernel and associated modules for Lustre 1.4.6. I have created an xml file based on the Single Node Client, MDS and Two OSTs in the docs. I could not find the utilities for ''lwizard'' or the lustre-lite-utils with a matching version number. (1.4.6) Situation: When I execute ''lconf --reformat cf2.xml'', all seems to go well until the first NETWORK: line. The line before it in the example shows a ''llite'' modules being loaded, I do not show that module as being loaded - I assume it is part of the Lustre-lite utilities which, I hope, are not necessary and that is why I couldn''t locate them. On the NETWORK line itself, mine differs from that shown in the documentation at the end of the line: Docs: ''NETWORK: NET_host_tcp NET_host_tcp _UUID tcp host 988'' Mine: ''NETWORK: NET_cf2_tcp NET_cf2_tcp_UUID tcp cf2'' The next few lines in the docs show OSD and OST message before showing a MDSDEV: line. I do not have those in my output and suspect that this is a major clue as to my troubles. I include the console error output below as well as my lmc commands that I used to make my xml file. Note: lconf does not finish, it keeps running and generating console errors - I assume because it cannot complete the connection it is attempting. All help and suggestions are appreciated. Adam ------------------------ make_xml ---------------------------------- #!/bin/bash lmc --add node --node cf2 --output cf2.xml lmc --add net --node cf2 --nid cf2 --nettype tcp --merge cf2.xml lmc --add net --node client --nid \* --nettype tcp --merge cf2.xml lmc --add mds --node cf2 --mds mds1 --fstype ldiskfs --dev /dev/hdb1 --journal_size 400 --merge cf2.xml lmc --add lov --lov lov1 --mds mds1 --stripe_sz 1048576 --stripe_cnt 1 --stripe_pattern 0 --merge cf2.xml lmc --add ost --node cf2 --ost ost1 --lov lov1 --fstype ldiskfs --dev /dev/hdb2 --merge cf2.xml lmc --add ost --node cf2 --ost ost2 --lov lov1 --fstype ldiskfs --dev /dev/hdb3 --merge cf2.xml lmc --add mtpt --node cf2 --mds mds1 --lov lov1 --path /mnt/lustre --clientoptions async --merge cf2.xml ------------------------- Output from ''lconf --reformat cf2.xml'' -------------------- loading module: libcfs srcdir None devdir libcfs loading module: lnet srcdir None devdir lnet loading module: ksocklnd srcdir None devdir klnds/socklnd loading module: lvfs srcdir None devdir lvfs loading module: obdclass srcdir None devdir obdclass loading module: ptlrpc srcdir None devdir ptlrpc loading module: mdc srcdir None devdir mdc loading module: osc srcdir None devdir osc loading module: lov srcdir None devdir lov loading module: mds srcdir None devdir mds loading module: ldiskfs srcdir None devdir ldiskfs loading module: fsfilt_ldiskfs srcdir None devdir lvfs NETWORK: NET_cf2_tcp NET_cf2_tcp_UUID tcp cf2 MDSDEV: mds1 mds1_UUID /dev/hdb1 ldiskfs no recording clients for filesystem: FS_fsname_UUID Recording log mds1 on mds1 LOV: lov_mds1 c05a1_lov_mds1_79ddc06aa5 mds1_UUID 1 1048576 0 0 [u''ost1_UUID''] mds1 OSC: OSC_cf2_ost1_mds1 c05a1_lov_mds1_79ddc06aa5 ost1_UUID End recording log mds1 on mds1 MDSDEV: mds1 mds1_UUID /dev/hdb1 ldiskfs 0 no MDS mount options: errors=remount-ro ----------------------- output from /var/log/messages ------------------------- Mar 13 09:41:15 cf2 kernel: Lustre: 17655:0:(module.c:381:init_libcfs_module()) maximum lustre stack 8192 Mar 13 09:41:15 cf2 kernel: Lustre: OBD class driver Build Version: 1.4.6-196912 31190000-PRISTINE-.tmp.lbuild.lbuild-v1_4_6_RC3-2.6-rhel4-i686.lbuild.BUILD.lust re-kernel-2.6.9.lustre.linux-2.6.9-22.0.2.EL_lustre.1.4.6smp, info@clusterfs.com Mar 13 09:41:15 cf2 kernel: Lustre: Added LNI 217.117.230.93@tcp [8/256] Mar 13 09:41:15 cf2 kernel: Lustre: Accept secure, port 988 Mar 13 09:41:21 cf2 kernel: kjournald starting. Commit interval 5 seconds Mar 13 09:41:21 cf2 kernel: LDISKFS FS on hdb1, internal journal Mar 13 09:41:21 cf2 kernel: LDISKFS-fs: mounted filesystem with ordered data mod e. Mar 13 09:41:21 cf2 kernel: Lustre: 17709:0:(mds_fs.c:239:mds_init_server_data() ) mds1: initializing new last_rcvd Mar 13 09:41:21 cf2 kernel: Lustre: MDT mds1 now serving /dev/hdb1 (bec9544b-28f 5-415d-a592-0d467eeb07d0) with recovery enabled Mar 13 09:41:23 cf2 kernel: Lustre: MDT mds1 has stopped. Mar 13 09:41:23 cf2 kernel: kjournald starting. Commit interval 5 seconds Mar 13 09:41:23 cf2 kernel: LDISKFS FS on hdb1, internal journal Mar 13 09:41:23 cf2 kernel: LDISKFS-fs: mounted filesystem with ordered data mod e. Mar 13 09:41:23 cf2 kernel: LustreError: Refusing connection from 127.0.0.1 for 127.0.0.1@tcp: No matching NI Mar 13 09:41:23 cf2 kernel: LustreError: 17679:0:(socklnd_cb.c:1429:ksocknal_sen d_hello()) Error -32 sending HELLO payload (1) to 127.0.0.1/988 Mar 13 09:41:23 cf2 kernel: LustreError: Unexpected error -32 connecting to 127. 0.0.1@tcp at host 127.0.0.1 on port 988 Mar 13 09:41:23 cf2 kernel: LustreError: 17679:0:(socklnd_cb.c:396:ksocknal_txli st_done()) Deleting packet type 1 len 240 217.117.230.93@tcp->127.0.0.1@tcp Mar 13 09:41:23 cf2 kernel: LustreError: 17679:0:(events.c:54:request_out_callba ck()) @@@ type 4, status -5 req@df800600 x1/t0 o8->ost1_UUID@ex-ost_UUID:6 lens 240/272 ref 2 fl Rpc:/0/0 rc 0/0 Mar 13 09:41:23 cf2 kernel: LustreError: 17859:0:(client.c:951:ptlrpc_expire_one _request()) @@@ timeout (sent at 1142239283, 0s ago) req@df800600 x1/t0 o8->ost1 _UUID@ex-ost_UUID:6 lens 240/272 ref 1 fl Rpc:/0/0 rc 0/0 Mar 13 09:41:31 cf2 kernel: LustreError: 17830:0:(mds_lov.c:592:mds_lov_start_sy nchronize()) mds1: error starting mds_lov_synchronize: -4 Mar 13 09:41:31 cf2 kernel: Lustre: MDT mds1 now serving /dev/hdb1 (bec9544b-28f 5-415d-a592-0d467eeb07d0) with recovery enabled Mar 13 09:41:31 cf2 kernel: LustreError: 17830:0:(genops.c:1108:ping_evictor_sta rt()) Cannot start ping evictor thread: -4 Mar 13 09:41:48 cf2 kernel: LustreError: Refusing connection from 127.0.0.1 for 127.0.0.1@tcp: No matching NI Mar 13 09:41:48 cf2 kernel: LustreError: 17680:0:(socklnd_cb.c:1429:ksocknal_sen d_hello()) Error -32 sending HELLO payload (1) to 127.0.0.1/988 Mar 13 09:41:48 cf2 kernel: LustreError: Unexpected error -32 connecting to 127. 0.0.1@tcp at host 127.0.0.1 on port 988 Mar 13 09:41:48 cf2 kernel: LustreError: 17680:0:(socklnd_cb.c:396:ksocknal_txli st_done()) Deleting packet type 1 len 240 217.117.230.93@tcp->127.0.0.1@tcp Mar 13 09:41:48 cf2 kernel: LustreError: 17680:0:(events.c:54:request_out_callba ck()) @@@ type 4, status -5 req@de8f5400 x3/t0 o8->ost1_UUID@ex-ost_UUID:6 lens 240/272 ref 2 fl Rpc:/0/0 rc 0/0 Mar 13 09:41:48 cf2 kernel: LustreError: 17859:0:(client.c:951:ptlrpc_expire_one _request()) @@@ timeout (sent at 1142239308, 0s ago) req@de8f5400 x3/t0 o8->ost1 _UUID@ex-ost_UUID:6 lens 240/272 ref 1 fl Rpc:/0/0 rc 0/0 Mar 13 09:42:00 cf2 kernel: Lustre: MDT mds1 has stopped. Mar 13 09:42:00 cf2 kernel: Lustre: Acceptor stopping Mar 13 09:42:01 cf2 kernel: Lustre: Removed LNI 217.117.230.93@tcp Mar 13 09:42:01 cf2 kernel: LustreError: 17909:0:(class_obd.c:783:cleanup_obdcla ss()) obd mem max: 3767487 leaked: 8 Mar 13 09:42:01 cf2 kernel: LustreError: 17917:0:(lvfs_linux.c:506:lvfs_linux_ex it()) obd mem max: 3767487 leaked: 8 <snip> errors continue to repeat -------------------- Traceback when I do Ctrl-C after approx 15 minutes ---------------- Traceback (most recent call last): File "/usr/sbin/lconf", line 2827, in ? main() File "/usr/sbin/lconf", line 2820, in main doHost(lustreDB, node_list) File "/usr/sbin/lconf", line 2264, in doHost for_each_profile(node_db, prof_list, doSetup) File "/usr/sbin/lconf", line 2044, in for_each_profile operation(services) File "/usr/sbin/lconf", line 2064, in doSetup n.prepare() File "/usr/sbin/lconf", line 1321, in prepare setup ="%s %s %s %s %s" %(blkdev, self.fstype, self.name, File "/usr/sbin/lconf", line 397, in newdev self.setup(name, setup) File "/usr/sbin/lconf", line 376, in setup self.run(cmds) File "/usr/sbin/lconf", line 278, in run ready = select.select([outfd,errfd],[],[]) # Wait for input KeyboardInterrupt -------------------- devices ----------------------- Device Boot Start End Blocks Id System /dev/hdb1 1 1939 977224+ 83 Linux /dev/hdb2 1940 21316 9766008 83 Linux /dev/hdb3 21317 40693 9766008 83 Linux ------------------- end of message -----------------------