Verdi March
2007-Apr-20 05:05 UTC
[Lustre-discuss] Example "local" fails on node with two IP addresses
Hi, I''m encountering problem when starting the "local" example (one MSD, LOV, OST, and client, all on node "sun-n1-console"). # lmc -m test.xml --batch test.txt # cat test.txt --add node --node sun-n1-console --add net --node sun-n1-console --nettype lnet --nid sun-n1-console@tcp --add mds --node sun-n1-console --mds mds1 --fstype ldiskfs --dev /tmp/mds1-sun-n1-console --size 400000 --add lov --lov lov1 --mds mds1 --stripe_sz 1048576 --stripe_cnt 1 --stripe_pattern 0 --add ost --node sun-n1-console --lov lov1 --ost ost1-sun-n1-console --fstype ldiskfs --dev /tmp/ost1-sun-n1-console --size 400000 --add mtpt --node sun-n1-console --path /mnt/lustre --mds mds1 --lov lov1 The node has two ethernets, eth0 and eth1, both on separate subnets. I deploys all lustre components on eth1 (IP: 192.168.123.45, hostname: sun-n1-console). # cat /etc/hosts 127.0.0.1 localhost.localdomain localhost xxx.yyy.zzz.ab public-host 192.168.123.45 sun-n1-console When eth0 is down, I successfully deployed the "local" example. Only when eth0 is up that Lustre fails to start (see attachment) The error messages from /var/log/messages indicates that MDS does not respond (see below). I believe it''s not caused by firewall cause I''ve switched it off: # iptables -L Chain INPUT (policy ACCEPT) target prot opt source destination Chain FORWARD (policy ACCEPT) target prot opt source destination Chain OUTPUT (policy ACCEPT) target prot opt source destination And here''re are the error messages: # tail /var/log/messages Apr 20 17:37:35 sun-n1-console kernel: LustreError: 6840:0:(events.c:53:request_out_callback()) @@@ type 4, status -5 req@f7fe7e00 x22/t0 o8->ost1-sun-n1-console_UUID@sun-n1-console_UUID:6 lens 240/272 ref 2 fl Rpc:/0/0 rc 0/0 Apr 20 17:37:35 sun-n1-console kernel: LustreError: 6840:0:(client.c:947:ptlrpc_expire_one_request()) @@@ timeout (sent at 1177061855, 0s ago) req@f7fe7e00 x22/t0 o8->ost1-sun-n1-console_UUID@sun-n1-console_UUID:6 lens 240/272 ref 1 fl Rpc:/0/0 rc 0/0 Apr 20 17:37:35 sun-n1-console kernel: LustreError: 6840:0:(client.c:947:ptlrpc_expire_one_request()) Skipped 2 previous similar messages Apr 20 17:38:00 sun-n1-console kernel: LustreError: 6840:0:(events.c:53:request_out_callback()) @@@ type 4, status -5 req@ed133e00 x23/t0 o8->ost1-sun-n1-console_UUID@sun-n1-console_UUID:6 lens 240/272 ref 2 fl Rpc:/0/0 rc 0/0 Apr 20 17:38:25 sun-n1-console kernel: audit(1177061905.683:64): avc: denied { rawip_recv } for pid=6537 comm="socknal_cd03" saddr=192.168.123.45 src=1023 daddr=192.168.123.45 dest=988 netif=lo scontext=system_u:object_r:unlabeled_t tcontext=system_u:object_r:netif_lo_t tclass=netif Apr 20 17:38:25 sun-n1-console kernel: audit(1177061905.884:65): avc: denied { rawip_recv } for saddr=192.168.123.45 src=1023 daddr=192.168.123.45 dest=988 netif=lo scontext=system_u:object_r:unlabeled_t tcontext=system_u:object_r:netif_lo_t tclass=netif Apr 20 17:38:26 sun-n1-console kernel: audit(1177061906.286:66): avc: denied { rawip_recv } for saddr=192.168.123.45 src=1023 daddr=192.168.123.45 dest=988 netif=lo scontext=system_u:object_r:unlabeled_t tcontext=system_u:object_r:netif_lo_t tclass=netif Apr 20 17:38:27 sun-n1-console kernel: audit(1177061907.090:67): avc: denied { rawip_recv } for saddr=192.168.123.45 src=1023 daddr=192.168.123.45 dest=988 netif=lo scontext=system_u:object_r:unlabeled_t tcontext=system_u:object_r:netif_lo_t tclass=netif Apr 20 17:38:28 sun-n1-console kernel: audit(1177061908.698:68): avc: denied { rawip_recv } for saddr=192.168.123.45 src=1023 daddr=192.168.123.45 dest=988 netif=lo scontext=system_u:object_r:unlabeled_t tcontext=system_u:object_r:netif_lo_t tclass=netif Apr 20 17:38:30 sun-n1-console kernel: LustreError: 6539:0:(acceptor.c:442:lnet_acceptor()) Error -11 reading connection request from 192.168.123.45 Apr 20 17:38:30 sun-n1-console kernel: audit(1177061910.683:69): avc: denied { rawip_send } for pid=6539 comm="acceptor_988" saddr=192.168.123.45 src=988 daddr=192.168.123.45 dest=1023 netif=lo scontext=system_u:object_r:unlabeled_t tcontext=system_u:object_r:netif_lo_t tclass=netif Apr 20 17:38:30 sun-n1-console kernel: LustreError: 6537:0:(socklnd_cb.c:2160:ksocknal_recv_hello()) Error -104 reading HELLO from 192.168.123.45 Apr 20 17:38:30 sun-n1-console kernel: LustreError: Connection to 192.168.123.45@tcp at host 192.168.123.45 on port 988 was reset: is it running a compatible version of Lustre and is 192.168.123.45@tcp one of its NIDs? Apr 20 17:38:50 sun-n1-console kernel: LustreError: 6840:0:(events.c:53:request_out_callback()) @@@ type 4, status -5 req@ec698e00 x25/t0 o8->ost1-sun-n1-console_UUID@sun-n1-console_UUID:6 lens 240/272 ref 2 fl Rpc:/0/0 rc 0/0 Apr 20 17:39:15 sun-n1-console kernel: LustreError: 6840:0:(events.c:53:request_out_callback()) @@@ type 4, status -5 req@e97c8c00 x26/t0 o8->ost1-sun-n1-console_UUID@sun-n1-console_UUID:6 lens 240/272 ref 2 fl Rpc:/0/0 rc 0/0 Any advices how to make this simple example work? Regards, Verdi -- "Feel free" - 10 GB Mailbox, 100 FreeSMS/Monat ... Jetzt GMX TopMail testen: http://www.gmx.net/de/go/topmail -------------- next part -------------- [root@sun-n1-console tmp]# lconf --reformat --verbose hoho.xml configuring for host: [''sun-n1-console''] setting /proc/sys/net/core/rmem_max to at least 16777216 setting /proc/sys/net/core/wmem_max to at least 16777216 Service: network NET_sun-n1-console_lnet NET_sun-n1-console_lnet_UUID loading module: libcfs srcdir None devdir libcfs + /sbin/modprobe libcfs loading module: lnet srcdir None devdir lnet + /sbin/modprobe lnet + /sbin/modprobe lnet loading module: ksocklnd srcdir None devdir klnds/socklnd + /sbin/modprobe ksocklnd Service: ldlm ldlm ldlm_UUID loading module: lvfs srcdir None devdir lvfs + /sbin/modprobe lvfs loading module: obdclass srcdir None devdir obdclass + /sbin/modprobe obdclass loading module: ptlrpc srcdir None devdir ptlrpc + /sbin/modprobe ptlrpc Service: osd OSD_ost1-sun-n1-console_sun-n1-console -n1-console_sun-n1-console_UUID loading module: ost srcdir None devdir ost + /sbin/modprobe ost loading module: ldiskfs srcdir None devdir ldiskfs + /sbin/modprobe ldiskfs loading module: fsfilt_ldiskfs srcdir None devdir lvfs + /sbin/modprobe fsfilt_ldiskfs loading module: obdfilter srcdir None devdir obdfilter + /sbin/modprobe obdfilter Service: mdsdev MDD_mds1_sun-n1-console MDD_mds1_sun-n1-console_UUID original inode_size 0 stripe_count 1 inode_size 512 loading module: mdc srcdir None devdir mdc + /sbin/modprobe mdc loading module: osc srcdir None devdir osc + /sbin/modprobe osc loading module: lov srcdir None devdir lov + /sbin/modprobe lov loading module: mds srcdir None devdir mds + /sbin/modprobe mds Service: mountpoint MNT_sun-n1-console MNT_sun-n1-console_UUID get_lov_tgts failed, using get_refs dbg LOV __init__: [(<__main__.OSC instance at 0xb7cd952c>, 0, 1, 1)] [u''ost1-sun-n1-console_UUID''] 1 loading module: llite srcdir None devdir llite + /sbin/modprobe llite + sysctl lnet/debug_path /tmp/lustre-log-sun-n1-console + /usr/sbin/lctl modules > /tmp/ogdb-sun-n1-console Service: network NET_sun-n1-console_lnet NET_sun-n1-console_lnet_UUID NETWORK: NET_sun-n1-console_lnet NET_sun-n1-console_lnet_UUID lnet sun-n1-console@tcp Service: ldlm ldlm ldlm_UUID Service: osd OSD_ost1-sun-n1-console_sun-n1-console -n1-console_sun-n1-console_UUID OSD: ost1-sun-n1-console ost1-sun-n1-console_UUID obdfilter /tmp/ost1-sun-n1-console 400000 ldiskfs no 0 256 + losetup /dev/loop0 + losetup /dev/loop1 + losetup /dev/loop2 + losetup /dev/loop3 + losetup /dev/loop4 + losetup /dev/loop5 + losetup /dev/loop6 + losetup /dev/loop7 + dd if=/dev/zero bs=1k count=0 seek=400000 of=/tmp/ost1-sun-n1-console + mkfs.ext2 -j -b 4096 -F -I 256 /tmp/ost1-sun-n1-console 100000 + tune2fs -O dir_index /tmp/ost1-sun-n1-console + losetup /dev/loop0 + losetup /dev/loop0 /tmp/ost1-sun-n1-console + dumpe2fs -f -h /dev/loop0 no external journal found for /dev/loop0 OST mount options: errors=remount-ro + /usr/sbin/lctl attach obdfilter ost1-sun-n1-console ost1-sun-n1-console_UUID quit + /usr/sbin/lctl cfg_device ost1-sun-n1-console setup /dev/loop0 ldiskfs f errors=remount-ro quit + /usr/sbin/lctl attach ost OSS OSS_UUID quit + /usr/sbin/lctl cfg_device OSS setup quit Service: mdsdev MDD_mds1_sun-n1-console MDD_mds1_sun-n1-console_UUID original inode_size 0 stripe_count 1 inode_size 512 MDSDEV: mds1 mds1_UUID /tmp/mds1-sun-n1-console ldiskfs no + losetup /dev/loop0 + losetup /dev/loop1 + losetup /dev/loop2 + losetup /dev/loop3 + losetup /dev/loop4 + losetup /dev/loop5 + losetup /dev/loop6 + losetup /dev/loop7 + dd if=/dev/zero bs=1k count=0 seek=400000 of=/tmp/mds1-sun-n1-console + mkfs.ext2 -j -b 4096 -F -i 4096 -I 512 /tmp/mds1-sun-n1-console 100000 + tune2fs -O dir_index /tmp/mds1-sun-n1-console + losetup /dev/loop0 + losetup /dev/loop1 + losetup /dev/loop1 /tmp/mds1-sun-n1-console + /usr/sbin/lctl attach mds mds1 mds1_UUID quit + /usr/sbin/lctl cfg_device mds1 setup /dev/loop1 ldiskfs quit recording clients for filesystem: FS_fsname_UUID get_lov_tgts failed, using get_refs dbg LOV __init__: [(<__main__.OSC instance at 0xb7cd988c>, 0, 1, 1)] [u''ost1-sun-n1-console_UUID''] 1 + /usr/sbin/lctl device $mds1 probe clear_log mds1 quit Recording log mds1 on mds1 dbg LOV prepare dbg LOV prepare: [(<__main__.OSC instance at 0xb7cd988c>, 0, 1, 1)] [u''ost1-sun-n1-console_UUID''] LOV: lov_mds1 4300b_lov_mds1_fe6fd41018 mds1_UUID 1 1048576 0 0 [u''ost1-sun-n1-console_UUID''] mds1 + /usr/sbin/lctl device $mds1 record mds1 attach lov lov_mds1 4300b_lov_mds1_fe6fd41018 lov_setup lov1_UUID 1 1048576 0 0 quit OSC: OSC_sun-n1-console_ost1-sun-n1-console_mds1 4300b_lov_mds1_fe6fd41018 ost1-sun-n1-console_UUID dbg CLIENT __prepare__: ost1-sun-n1-console_UUID [<__main__.Network instance at 0xb7cd9c6c>] + /usr/sbin/lctl device $mds1 record mds1 add_uuid sun-n1-console_UUID sun-n1-console@tcp ost1-sun-n1-console_UUID active + /usr/sbin/lctl device $mds1 record mds1 attach osc OSC_sun-n1-console_ost1-sun-n1-console_mds1 4300b_lov_mds1_fe6fd41018 quit + /usr/sbin/lctl device $mds1 record mds1 cfg_device OSC_sun-n1-console_ost1-sun-n1-console_mds1 setup ost1-sun-n1-console_UUID sun-n1-console_UUID quit + /usr/sbin/lctl device $mds1 record mds1 cfg_device lov_mds1 lov_modify_tgts add lov_mds1 ost1-sun-n1-console_UUID 0 1 quit + /usr/sbin/lctl device $mds1 record mds1 mount_option mds1 lov_mds1 quit End recording log mds1 on mds1 Recording log sun-n1-console on mds1 + /usr/sbin/lconf -v --record --nomod --old_conf --record_log sun-n1-console --record_device mds1 --node sun-n1-console hoho.xml record> configuring for host: [''sun-n1-console''] record> Checking XML modification time record> + debugfs -c -R ''stat /LOGS'' /tmp/mds1-sun-n1-console 2>&1 | grep mtime record> Can not get mtime info of MDS LOGS directory record> + /usr/sbin/lctl record> device $mds1 record> probe record> clear_log sun-n1-console record> quit record> Recording log sun-n1-console on mds1 record> Service: network NET_sun-n1-console_lnet NET_sun-n1-console_lnet_UUID record> Service: ldlm ldlm ldlm_UUID record> Service: osd OSD_ost1-sun-n1-console_sun-n1-console -n1-console_sun-n1-console_UUID record> Service: mdsdev MDD_mds1_sun-n1-console MDD_mds1_sun-n1-console_UUID record> original inode_size 0 record> stripe_count 1 inode_size 512 record> Service: mountpoint MNT_sun-n1-console MNT_sun-n1-console_UUID record> get_lov_tgts failed, using get_refs record> dbg LOV __init__: [(<__main__.OSC instance at 0xb7cf64cc>, 0, 1, 1)] [u''ost1-sun-n1-console_UUID''] 1 record> dbg LOV prepare record> dbg LOV prepare: [(<__main__.OSC instance at 0xb7cf64cc>, 0, 1, 1)] [u''ost1-sun-n1-console_UUID''] record> LOV: lov1 028ec_lov1_fa9d4fa5b7 mds1_UUID 1 1048576 0 0 [u''ost1-sun-n1-console_UUID''] mds1 record> + /usr/sbin/lctl record> device $mds1 record> record sun-n1-console record> record> attach lov lov1 028ec_lov1_fa9d4fa5b7 record> lov_setup lov1_UUID 1 1048576 0 0 record> quit record> OSC: OSC_sun-n1-console_ost1-sun-n1-console_MNT_sun-n1-console 028ec_lov1_fa9d4fa5b7 ost1-sun-n1-console_UUID record> dbg CLIENT __prepare__: ost1-sun-n1-console_UUID [<__main__.Network instance at 0xb7cf66cc>] record> + /usr/sbin/lctl record> device $mds1 record> record sun-n1-console record> record> add_uuid sun-n1-console_UUID sun-n1-console@tcp record> ost1-sun-n1-console_UUID active record> + /usr/sbin/lctl record> device $mds1 record> record sun-n1-console record> record> attach osc OSC_sun-n1-console_ost1-sun-n1-console_MNT_sun-n1-console 028ec_lov1_fa9d4fa5b7 record> quit record> + /usr/sbin/lctl record> device $mds1 record> record sun-n1-console record> record> cfg_device OSC_sun-n1-console_ost1-sun-n1-console_MNT_sun-n1-console record> setup ost1-sun-n1-console_UUID sun-n1-console_UUID record> quit record> + /usr/sbin/lctl record> device $mds1 record> record sun-n1-console record> record> cfg_device lov1 record> lov_modify_tgts add lov1 ost1-sun-n1-console_UUID 0 1 record> quit record> MDC: MDC_sun-n1-console_mds1_MNT_sun-n1-console 0cf7b_MNT_sun-n1-console_dd8b963906 mds1_UUID record> dbg CLIENT __prepare__: mds1_UUID [<__main__.Network instance at 0xb7cf6a4c>] record> + /usr/sbin/lctl record> device $mds1 record> record sun-n1-console record> record> add_uuid sun-n1-console_UUID sun-n1-console@tcp record> mds1_UUID active record> + /usr/sbin/lctl record> device $mds1 record> record sun-n1-console record> record> attach mdc MDC_sun-n1-console_mds1_MNT_sun-n1-console 0cf7b_MNT_sun-n1-console_dd8b963906 record> quit record> + /usr/sbin/lctl record> device $mds1 record> record sun-n1-console record> record> cfg_device MDC_sun-n1-console_mds1_MNT_sun-n1-console record> setup mds1_UUID sun-n1-console_UUID record> quit record> MTPT: MNT_sun-n1-console MNT_sun-n1-console_UUID /mnt/lustre mds1_UUID lov1_UUID record> + /usr/sbin/lctl record> device $mds1 record> record sun-n1-console record> record> mount_option sun-n1-console lov1 MDC_sun-n1-console_mds1_MNT_sun-n1-console record> quit record> End recording log sun-n1-console on mds1 + /usr/sbin/lctl ignore_errors cfg_device $mds1 cleanup detach quit + losetup /dev/loop0 + losetup /dev/loop1 + losetup -d /dev/loop1 changing mtime of LOGS to 1177060884 + mktemp /tmp/lustre-cmd.XXXXXXXX + debugfs -w -R "mi /LOGS" </tmp/lustre-cmd.mEPL5082 /tmp/mds1-sun-n1-console MDSDEV: mds1 mds1_UUID /tmp/mds1-sun-n1-console ldiskfs 400000 no + losetup /dev/loop0 + losetup /dev/loop1 + losetup /dev/loop2 + losetup /dev/loop3 + losetup /dev/loop4 + losetup /dev/loop5 + losetup /dev/loop6 + losetup /dev/loop7 + losetup /dev/loop0 + losetup /dev/loop1 + losetup /dev/loop1 /tmp/mds1-sun-n1-console + /usr/sbin/lctl attach mdt MDT MDT_UUID quit + /usr/sbin/lctl cfg_device MDT setup quit + dumpe2fs -f -h /dev/loop1 no external journal found for /dev/loop1 MDS mount options: errors=remount-ro + /usr/sbin/lctl attach mds mds1 mds1_UUID quit + /usr/sbin/lctl cfg_device mds1 setup /dev/loop1 ldiskfs mds1 errors=remount-ro quit
Alexey Lyashkov
2007-Apr-20 05:19 UTC
[Lustre-discuss] Example "local" fails on node with two IP addresses
looks you need selinux disable. ==Apr 20 17:38:26 sun-n1-console kernel: audit(1177061906.286:66): avc: denied { rawip_recv } for saddr=192.168.123.45 src=1023 daddr=192.168.123.45 dest=988 netif=lo = On Fri, 2007-04-20 at 14:04, Verdi March wrote:> Hi, > > I''m encountering problem when starting the "local" example (one > MSD, LOV, OST, and client, all on node "sun-n1-console"). > > # lmc -m test.xml --batch test.txt > # cat test.txt > --add node --node sun-n1-console > --add net --node sun-n1-console --nettype lnet --nid sun-n1-console@tcp > --add mds --node sun-n1-console --mds mds1 --fstype ldiskfs --dev /tmp/mds1-sun-n1-console --size 400000 > --add lov --lov lov1 --mds mds1 --stripe_sz 1048576 --stripe_cnt 1 --stripe_pattern 0 > --add ost --node sun-n1-console --lov lov1 --ost ost1-sun-n1-console --fstype ldiskfs --dev /tmp/ost1-sun-n1-console --size 400000 > --add mtpt --node sun-n1-console --path /mnt/lustre --mds mds1 --lov lov1 > > > > The node has two ethernets, eth0 and eth1, both on separate subnets. > I deploys all lustre components on eth1 (IP: 192.168.123.45, hostname: > sun-n1-console). > > # cat /etc/hosts > 127.0.0.1 localhost.localdomain localhost > xxx.yyy.zzz.ab public-host > 192.168.123.45 sun-n1-console > > > When eth0 is down, I successfully deployed the "local" example. > Only when eth0 is up that Lustre fails to start (see attachment) > > The error messages from /var/log/messages indicates that MDS does > not respond (see below). I believe it''s not caused by firewall cause > I''ve switched it off: > > # iptables -L > Chain INPUT (policy ACCEPT) > target prot opt source destination > > Chain FORWARD (policy ACCEPT) > target prot opt source destination > > Chain OUTPUT (policy ACCEPT) > target prot opt source destination > > > > > And here''re are the error messages: > > # tail /var/log/messages > Apr 20 17:37:35 sun-n1-console kernel: LustreError: 6840:0:(events.c:53:request_out_callback()) @@@ type 4, status -5 req@f7fe7e00 x22/t0 o8->ost1-sun-n1-console_UUID@sun-n1-console_UUID:6 lens 240/272 ref 2 fl Rpc:/0/0 rc 0/0 > Apr 20 17:37:35 sun-n1-console kernel: LustreError: 6840:0:(client.c:947:ptlrpc_expire_one_request()) @@@ timeout (sent at 1177061855, 0s ago) req@f7fe7e00 x22/t0 o8->ost1-sun-n1-console_UUID@sun-n1-console_UUID:6 lens 240/272 ref 1 fl Rpc:/0/0 rc 0/0 > Apr 20 17:37:35 sun-n1-console kernel: LustreError: 6840:0:(client.c:947:ptlrpc_expire_one_request()) Skipped 2 previous similar messages > Apr 20 17:38:00 sun-n1-console kernel: LustreError: 6840:0:(events.c:53:request_out_callback()) @@@ type 4, status -5 req@ed133e00 x23/t0 o8->ost1-sun-n1-console_UUID@sun-n1-console_UUID:6 lens 240/272 ref 2 fl Rpc:/0/0 rc 0/0 > Apr 20 17:38:25 sun-n1-console kernel: audit(1177061905.683:64): avc: denied { rawip_recv } for pid=6537 comm="socknal_cd03" saddr=192.168.123.45 src=1023 daddr=192.168.123.45 dest=988 netif=lo scontext=system_u:object_r:unlabeled_t tcontext=system_u:object_r:netif_lo_t tclass=netif > Apr 20 17:38:25 sun-n1-console kernel: audit(1177061905.884:65): avc: denied { rawip_recv } for saddr=192.168.123.45 src=1023 daddr=192.168.123.45 dest=988 netif=lo scontext=system_u:object_r:unlabeled_t tcontext=system_u:object_r:netif_lo_t tclass=netif > Apr 20 17:38:26 sun-n1-console kernel: audit(1177061906.286:66): avc: denied { rawip_recv } for saddr=192.168.123.45 src=1023 daddr=192.168.123.45 dest=988 netif=lo scontext=system_u:object_r:unlabeled_t tcontext=system_u:object_r:netif_lo_t tclass=netif > Apr 20 17:38:27 sun-n1-console kernel: audit(1177061907.090:67): avc: denied { rawip_recv } for saddr=192.168.123.45 src=1023 daddr=192.168.123.45 dest=988 netif=lo scontext=system_u:object_r:unlabeled_t tcontext=system_u:object_r:netif_lo_t tclass=netif > Apr 20 17:38:28 sun-n1-console kernel: audit(1177061908.698:68): avc: denied { rawip_recv } for saddr=192.168.123.45 src=1023 daddr=192.168.123.45 dest=988 netif=lo scontext=system_u:object_r:unlabeled_t tcontext=system_u:object_r:netif_lo_t tclass=netif > Apr 20 17:38:30 sun-n1-console kernel: LustreError: 6539:0:(acceptor.c:442:lnet_acceptor()) Error -11 reading connection request from 192.168.123.45 > Apr 20 17:38:30 sun-n1-console kernel: audit(1177061910.683:69): avc: denied { rawip_send } for pid=6539 comm="acceptor_988" saddr=192.168.123.45 src=988 daddr=192.168.123.45 dest=1023 netif=lo scontext=system_u:object_r:unlabeled_t tcontext=system_u:object_r:netif_lo_t tclass=netif > Apr 20 17:38:30 sun-n1-console kernel: LustreError: 6537:0:(socklnd_cb.c:2160:ksocknal_recv_hello()) Error -104 reading HELLO from 192.168.123.45 > Apr 20 17:38:30 sun-n1-console kernel: LustreError: Connection to 192.168.123.45@tcp at host 192.168.123.45 on port 988 was reset: is it running a compatible version of Lustre and is 192.168.123.45@tcp one of its NIDs? > Apr 20 17:38:50 sun-n1-console kernel: LustreError: 6840:0:(events.c:53:request_out_callback()) @@@ type 4, status -5 req@ec698e00 x25/t0 o8->ost1-sun-n1-console_UUID@sun-n1-console_UUID:6 lens 240/272 ref 2 fl Rpc:/0/0 rc 0/0 > Apr 20 17:39:15 sun-n1-console kernel: LustreError: 6840:0:(events.c:53:request_out_callback()) @@@ type 4, status -5 req@e97c8c00 x26/t0 o8->ost1-sun-n1-console_UUID@sun-n1-console_UUID:6 lens 240/272 ref 2 fl Rpc:/0/0 rc 0/0 > > > > Any advices how to make this simple example work? > > > Regards, > Verdi-- Alexey Lyashkov <shadow@clusterfs.com> Beaver team
Nathaniel Rutman
2007-Apr-20 13:56 UTC
[Lustre-discuss] Example "local" fails on node with two IP addresses
Wouldn''t it be awesome to write a script that would look for various common configuration errors in the logs and print out a sensible message? e.g. why_no_lustre.sh look for SE linux messages ip tables disks set readonly I''m sure there''s more... Alexey Lyashkov wrote:> looks you need selinux disable. > ==> Apr 20 17:38:26 sun-n1-console kernel: audit(1177061906.286:66): avc: > denied { rawip_recv } for saddr=192.168.123.45 src=1023 > daddr=192.168.123.45 dest=988 netif=lo > => >
Verdi March
2007-Apr-23 00:29 UTC
[Lustre-discuss] Example "local" fails on node with two IP addresses
Hi Alexey, I''m still encountering a problem even after disabling SELinux. # cat /proc/cmdline ro root=LABEL=/ splash=0 rhgb selinux=0 quiet # grep ^SELINUX /etc/selinux/config SELINUX=disabled SELINUXTYPE=targeted Below is a snippet of /var/log/messages (more complete log is attached): =========Apr 23 12:57:06 sun-n1-console kernel: Lustre: OBD class driver Build Version: 1.4.10-19691231170000-PRISTINE-.testsuite.tmp.lbuild-boulder.lbuild-v1_4_10_RC2-2.6-rhel4-i686.lbuild.BUILD.lustre-kernel-2.6.9.lustre.linux-2.6.9-42.0.10.EL_lustre.1.4.10smp, info@clusterfs.com Apr 23 12:57:07 sun-n1-console kernel: Lustre: Added LNI 129.158.130.75@tcp [8/256] Apr 23 12:57:07 sun-n1-console kernel: Lustre: Accept secure, port 988 Apr 23 12:57:12 sun-n1-console kernel: LustreError: Refusing connection from 192.168.123.45 for 192.168.123.45@tcp: No matching NI Apr 23 12:57:12 sun-n1-console kernel: LustreError: 4416:0:(socklnd_cb.c:2160:ksocknal_recv_hello()) Error -104 reading HELLO from 192.168.123.45 Apr 23 12:57:12 sun-n1-console kernel: LustreError: Connection to 192.168.123.45@tcp at host 192.168.123.45 on port 988 was reset: is it running a compatible version of Lustre and is 192.168.123.45@tcp one of its NIDs? Apr 23 12:57:12 sun-n1-console kernel: Lustre: 10:0:(linux-debug.c:98:libcfs_run_upcall()) Invoked LNET upcall /usr/lib/lustre/lnet_upcall ROUTER_NOTIFY,192.168.123.45@tcp,down,1177304206 Apr 23 12:57:17 sun-n1-console kernel: LustreError: 4854:0:(client.c:947:ptlrpc_expire_one_request()) @@@ timeout (sent at 1177304232, 5s ago) req@ef64ec00 x1/t0 o8->ost1-sun-n1-console_UUID@sun-n1-console_UUID:6 lens 240/272 ref 1 fl Rpc:/0/0 rc 0/0 Apr 23 12:57:31 sun-n1-console kernel: LustreError: 5170:0:(mds_lov.c:589:mds_lov_start_synchronize()) mds1: error starting mds_lov_synchronize: -4 Apr 23 12:57:31 sun-n1-console kernel: LustreError: 5170:0:(quota_master.c:1103:mds_quota_recovery()) Cannot start quota recovery thread: rc -4 Apr 23 12:57:37 sun-n1-console kernel: LustreError: Refusing connection from 192.168.123.45 for 192.168.123.45@tcp: No matching NI Apr 23 12:57:37 sun-n1-console kernel: LustreError: 4417:0:(socklnd_cb.c:2160:ksocknal_recv_hello()) Error -104 reading HELLO from 192.168.123.45 Apr 23 12:57:37 sun-n1-console kernel: LustreError: Connection to 192.168.123.45@tcp at host 192.168.123.45 on port 988 was reset: is it running a compatible version of Lustre and is 192.168.123.45@tcp one of its NIDs? Apr 23 12:57:42 sun-n1-console kernel: LustreError: 4854:0:(client.c:947:ptlrpc_expire_one_request()) @@@ timeout (sent at 1177304257, 5s ago) req@f5024a00 x3/t0 o8->ost1-sun-n1-console_UUID@sun-n1-console_UUID:6 lens 240/272 ref 1 fl Rpc:/0/0 rc 0/0 ========= It looks to me that there''s a confusion over which network interface to use (eth0 = 129.158.130.75, and eth1 = 192.168.123.45). I intended to deploy MDS on eth1; this is specified using IP address when creating a node: --add net --node sun-n1-console --nettype lnet --nid 192.168.123.45@tcp I''ve emptied /etc/resolv.conf to ensured that "sun-n1-console" is resolved to 192.168.12.45, # cat /etc/hosts 127.0.0.1 localhost.localdomain localhost 192.168.123.45 sun-n1-console 129.158.130.75 public-host # hostname -f ; hostname -i sun-n1-console 192.168.123.45 And results of ifconfig: eth0 Link encap:Ethernet HWaddr 00:07:E9:06:AC:5C inet addr:129.158.130.75 Bcast:129.158.130.255 Mask:255.255.255.0 eth1 Link encap:Ethernet HWaddr 00:07:E9:06:AC:5D inet addr:192.168.123.45 Bcast:192.168.123.255 Mask:255.255.255.0 Are there anything else that I missed? Regards, Verdi Alexey Lyashkov wrote:> looks you need selinux disable. > ==> Apr 20 17:38:26 sun-n1-console kernel: audit(1177061906.286:66): avc: > denied { rawip_recv } for saddr=192.168.123.45 src=1023 > daddr=192.168.123.45 dest=988 netif=lo > => > > On Fri, 2007-04-20 at 14:04, Verdi March wrote: > > Hi, > > > > I''m encountering problem when starting the "local" example (one > > MSD, LOV, OST, and client, all on node "sun-n1-console"). > > > > # lmc -m test.xml --batch test.txt > > # cat test.txt > > --add node --node sun-n1-console > > --add net --node sun-n1-console --nettype lnet --nid sun-n1-console@tcp > > --add mds --node sun-n1-console --mds mds1 --fstype ldiskfs --dev > /tmp/mds1-sun-n1-console --size 400000 > > --add lov --lov lov1 --mds mds1 --stripe_sz 1048576 --stripe_cnt 1 > --stripe_pattern 0 > > --add ost --node sun-n1-console --lov lov1 --ost ost1-sun-n1-console > --fstype ldiskfs --dev /tmp/ost1-sun-n1-console --size 400000 > > --add mtpt --node sun-n1-console --path /mnt/lustre --mds mds1 --lov > lov1 > > > > > > > > The node has two ethernets, eth0 and eth1, both on separate subnets. > > I deploys all lustre components on eth1 (IP: 192.168.123.45, hostname: > > sun-n1-console). > > > > # cat /etc/hosts > > 127.0.0.1 localhost.localdomain localhost > > xxx.yyy.zzz.ab public-host > > 192.168.123.45 sun-n1-console > > > > > > When eth0 is down, I successfully deployed the "local" example. > > Only when eth0 is up that Lustre fails to start (see attachment) > > > > The error messages from /var/log/messages indicates that MDS does > > not respond (see below). I believe it''s not caused by firewall cause > > I''ve switched it off: > > > > # iptables -L > > Chain INPUT (policy ACCEPT) > > target prot opt source destination > > > > Chain FORWARD (policy ACCEPT) > > target prot opt source destination > > > > Chain OUTPUT (policy ACCEPT) > > target prot opt source destination > > > > > > > > > > And here''re are the error messages: > > > > # tail /var/log/messages > > Apr 20 17:37:35 sun-n1-console kernel: LustreError: > 6840:0:(events.c:53:request_out_callback()) @@@ type 4, status -5 req@f7fe7e00 x22/t0 > o8->ost1-sun-n1-console_UUID@sun-n1-console_UUID:6 lens 240/272 ref 2 fl Rpc:/0/0 > rc 0/0 > > Apr 20 17:37:35 sun-n1-console kernel: LustreError: > 6840:0:(client.c:947:ptlrpc_expire_one_request()) @@@ timeout (sent at 1177061855, 0s ago) > req@f7fe7e00 x22/t0 o8->ost1-sun-n1-console_UUID@sun-n1-console_UUID:6 lens > 240/272 ref 1 fl Rpc:/0/0 rc 0/0 > > Apr 20 17:37:35 sun-n1-console kernel: LustreError: > 6840:0:(client.c:947:ptlrpc_expire_one_request()) Skipped 2 previous similar messages > > Apr 20 17:38:00 sun-n1-console kernel: LustreError: > 6840:0:(events.c:53:request_out_callback()) @@@ type 4, status -5 req@ed133e00 x23/t0 > o8->ost1-sun-n1-console_UUID@sun-n1-console_UUID:6 lens 240/272 ref 2 fl Rpc:/0/0 > rc 0/0 > > Apr 20 17:38:25 sun-n1-console kernel: audit(1177061905.683:64): avc: > denied { rawip_recv } for pid=6537 comm="socknal_cd03" > saddr=192.168.123.45 src=1023 daddr=192.168.123.45 dest=988 netif=lo > scontext=system_u:object_r:unlabeled_t tcontext=system_u:object_r:netif_lo_t tclass=netif > > Apr 20 17:38:25 sun-n1-console kernel: audit(1177061905.884:65): avc: > denied { rawip_recv } for saddr=192.168.123.45 src=1023 > daddr=192.168.123.45 dest=988 netif=lo scontext=system_u:object_r:unlabeled_t > tcontext=system_u:object_r:netif_lo_t tclass=netif > > Apr 20 17:38:26 sun-n1-console kernel: audit(1177061906.286:66): avc: > denied { rawip_recv } for saddr=192.168.123.45 src=1023 > daddr=192.168.123.45 dest=988 netif=lo scontext=system_u:object_r:unlabeled_t > tcontext=system_u:object_r:netif_lo_t tclass=netif > > Apr 20 17:38:27 sun-n1-console kernel: audit(1177061907.090:67): avc: > denied { rawip_recv } for saddr=192.168.123.45 src=1023 > daddr=192.168.123.45 dest=988 netif=lo scontext=system_u:object_r:unlabeled_t > tcontext=system_u:object_r:netif_lo_t tclass=netif > > Apr 20 17:38:28 sun-n1-console kernel: audit(1177061908.698:68): avc: > denied { rawip_recv } for saddr=192.168.123.45 src=1023 > daddr=192.168.123.45 dest=988 netif=lo scontext=system_u:object_r:unlabeled_t > tcontext=system_u:object_r:netif_lo_t tclass=netif > > Apr 20 17:38:30 sun-n1-console kernel: LustreError: > 6539:0:(acceptor.c:442:lnet_acceptor()) Error -11 reading connection request from > 192.168.123.45 > > Apr 20 17:38:30 sun-n1-console kernel: audit(1177061910.683:69): avc: > denied { rawip_send } for pid=6539 comm="acceptor_988" > saddr=192.168.123.45 src=988 daddr=192.168.123.45 dest=1023 netif=lo > scontext=system_u:object_r:unlabeled_t tcontext=system_u:object_r:netif_lo_t tclass=netif > > Apr 20 17:38:30 sun-n1-console kernel: LustreError: > 6537:0:(socklnd_cb.c:2160:ksocknal_recv_hello()) Error -104 reading HELLO from 192.168.123.45 > > Apr 20 17:38:30 sun-n1-console kernel: LustreError: Connection to > 192.168.123.45@tcp at host 192.168.123.45 on port 988 was reset: is it running a > compatible version of Lustre and is 192.168.123.45@tcp one of its NIDs? > > Apr 20 17:38:50 sun-n1-console kernel: LustreError: > 6840:0:(events.c:53:request_out_callback()) @@@ type 4, status -5 req@ec698e00 x25/t0 > o8->ost1-sun-n1-console_UUID@sun-n1-console_UUID:6 lens 240/272 ref 2 fl Rpc:/0/0 > rc 0/0 > > Apr 20 17:39:15 sun-n1-console kernel: LustreError: > 6840:0:(events.c:53:request_out_callback()) @@@ type 4, status -5 req@e97c8c00 x26/t0 > o8->ost1-sun-n1-console_UUID@sun-n1-console_UUID:6 lens 240/272 ref 2 fl Rpc:/0/0 > rc 0/0 > > > > > > > > Any advices how to make this simple example work? > > > > > > Regards, > > Verdi > -- > Alexey Lyashkov <shadow@clusterfs.com> > Beaver team-- "Feel free" - 10 GB Mailbox, 100 FreeSMS/Monat ... Jetzt GMX TopMail testen: http://www.gmx.net/de/go/topmail
Oleg Drokin
2007-Apr-23 02:37 UTC
[Lustre-discuss] Example "local" fails on node with two IP addresses
Hwllo! On Mon, Apr 23, 2007 at 08:28:59AM +0200, Verdi March wrote:> Apr 23 12:57:07 sun-n1-console kernel: Lustre: Added LNI 129.158.130.75@tcp [8/256]You should have included full log like this from the very beginning,> > Apr 23 12:57:12 sun-n1-console kernel: LustreError: Refusing connection from 192.168.123.45 for 192.168.123.45@tcp: No matching NI > Apr 23 12:57:12 sun-n1-console kernel: LustreError: 4416:0:(socklnd_cb.c:2160:ksocknal_recv_hello()) Error -104 reading HELLO from 192.168.123.45 > Apr 23 12:57:12 sun-n1-console kernel: LustreError: Connection to 192.168.123.45@tcp at host 192.168.123.45 on port 988 was reset: is it running a compatible version of Lustre and is 192.168.123.45@tcp one of its NIDs?> It looks to me that there''s a confusion over which network interface > to use (eth0 = 129.158.130.75, and eth1 = 192.168.123.45).Right.> I intended to deploy MDS on eth1; this is specified using IP address > when creating a node: > --add net --node sun-n1-console --nettype lnet --nid 192.168.123.45@tcpThis won''t help.> Are there anything else that I missed?Yes, you need to pass lnet module option ''networks'' like this in your /etc/modprobe.conf: options lnet networks=tcp(eth1) (naturally replacing eth1 with interface that has the address you want to listen on) Bye, Oleg
Verdi March
2007-Apr-23 04:45 UTC
[Lustre-discuss] Example "local" fails on node with two IP addresses
Hi Oleg, Oleg Drokin wrote:> Yes, you need to pass lnet module option ''networks'' like this in your > /etc/modprobe.conf: > options lnet networks=tcp(eth1) > > (naturally replacing eth1 with interface that has the address you want to > listen > on)Thanks. With this, I managed to get Lustre works even when SELinux is enabled. Regards, Verdi -- "Feel free" - 10 GB Mailbox, 100 FreeSMS/Monat ... Jetzt GMX TopMail testen: http://www.gmx.net/de/go/topmail