Sten Wolf
2013-Dec-17 14:10 UTC
Setting up a lustre zfs dual mgs/mdt over tcp - help requested
Hi all, Here is the situation: I have 2 nodes MDS1 , MDS2 (10.0.0.22 , 10.0.0.23) I wish to use as failover MGS, active/active MDT with zfs. I have a jbod shelf with 12 disks, seen by both nodes as das (the shelf has 2 sas ports, connected to a sas hba on each node), and I am using lustre 2.4 on centos 6.4 x64 I have created 3 zfs pools: 1. mgs: # zpool create -f -o ashift=12 -O canmount=off lustre-mgs mirror /dev/disk/by-id/wwn-0x50000c0f012306fc /dev/disk/by-id/wwn-0x50000c0f01233aec # mkfs.lustre --mgs --servicenode=mds1@tcp0 --servicenode=mds2@tcp0 --param sys.timeout=5000 --backfstype=zfs lustre-mgs/mgs Permanent disk data: Target: MGS Index: unassigned Lustre FS: Mount type: zfs Flags: 0x1064 (MGS first_time update no_primnode ) Persistent mount opts: Parameters: failover.node=10.0.0.22@tcp failover.node=10.0.0.23@tcp sys.timeout=5000 2 mdt0: # zpool create -f -o ashift=12 -O canmount=off lustre-mdt0 mirror /dev/disk/by-id/wwn-0x50000c0f01d07a34 /dev/disk/by-id/wwn-0x50000c0f01d110c8 # mkfs.lustre --mdt --fsname=fs0 --servicenode=mds1@tcp0 --servicenode=mds2@tcp0 --param sys.timeout=5000 --backfstype=zfs --mgsnode=mds1@tcp0 --mgsnode=mds2@tcp0 lustre-mdt0/mdt0 warning: lustre-mdt0/mdt0: for Lustre 2.4 and later, the target index must be specified with --index Permanent disk data: Target: fs0:MDT0000 Index: 0 Lustre FS: fs0 Mount type: zfs Flags: 0x1061 (MDT first_time update no_primnode ) Persistent mount opts: Parameters: failover.node=10.0.0.22@tcp failover.node=10.0.0.23@tcp sys.timeout=5000 mgsnode=10.0.0.22@tcp mgsnode=10.0.0.23@tcp checking for existing Lustre data: not found mkfs_cmd = zfs create -o canmount=off -o xattr=sa lustre-mdt0/mdt0 Writing lustre-mdt0/mdt0 properties lustre:version=1 lustre:flags=4193 lustre:index=0 lustre:fsname=fs0 lustre:svname=fs0:MDT0000 lustre:failover.node=10.0.0.22@tcp lustre:failover.node=10.0.0.23@tcp lustre:sys.timeout=5000 lustre:mgsnode=10.0.0.22@tcp lustre:mgsnode=10.0.0.23@tcp 3. mdt1: # zpool create -f -o ashift=12 -O canmount=off lustre-mdt1 mirror /dev/disk/by-id/wwn-0x50000c0f01d113e0 /dev/disk/by-id/wwn-0x50000c0f01d116fc # mkfs.lustre --mdt --fsname=fs0 --servicenode=mds2@tcp0 --servicenode=mds1@tcp0 --param sys.timeout=5000 --backfstype=zfs --index=1 --mgsnode=mds1@tcp0 --mgsnode=mds2@tcp0 lustre-mdt1/mdt1 Permanent disk data: Target: fs0:MDT0001 Index: 1 Lustre FS: fs0 Mount type: zfs Flags: 0x1061 (MDT first_time update no_primnode ) Persistent mount opts: Parameters: failover.node=10.0.0.23@tcp failover.node=10.0.0.22@tcp sys.timeout=5000 mgsnode=10.0.0.22@tcp mgsnode=10.0.0.23@tcp checking for existing Lustre data: not found mkfs_cmd = zfs create -o canmount=off -o xattr=sa lustre-mdt1/mdt1 Writing lustre-mdt1/mdt1 properties lustre:version=1 lustre:flags=4193 lustre:index=1 lustre:fsname=fs0 lustre:svname=fs0:MDT0001 lustre:failover.node=10.0.0.23@tcp lustre:failover.node=10.0.0.22@tcp lustre:sys.timeout=5000 lustre:mgsnode=10.0.0.22@tcp lustre:mgsnode=10.0.0.23@tcp a few basic sanity checks: # zfs list NAME USED AVAIL REFER MOUNTPOINT lustre-mdt0 824K 3.57T 136K /lustre-mdt0 lustre-mdt0/mdt0 136K 3.57T 136K /lustre-mdt0/mdt0 lustre-mdt1 716K 3.57T 136K /lustre-mdt1 lustre-mdt1/mdt1 136K 3.57T 136K /lustre-mdt1/mdt1 lustre-mgs 4.78M 3.57T 136K /lustre-mgs lustre-mgs/mgs 4.18M 3.57T 4.18M /lustre-mgs/mgs # zpool list NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT lustre-mdt0 3.62T 1.00M 3.62T 0% 1.00x ONLINE - lustre-mdt1 3.62T 800K 3.62T 0% 1.00x ONLINE - lustre-mgs 3.62T 4.86M 3.62T 0% 1.00x ONLINE - # zpool status pool: lustre-mdt0 state: ONLINE scan: none requested config: NAME STATE READ WRITE CKSUM lustre-mdt0 ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 wwn-0x50000c0f01d07a34 ONLINE 0 0 0 wwn-0x50000c0f01d110c8 ONLINE 0 0 0 errors: No known data errors pool: lustre-mdt1 state: ONLINE scan: none requested config: NAME STATE READ WRITE CKSUM lustre-mdt1 ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 wwn-0x50000c0f01d113e0 ONLINE 0 0 0 wwn-0x50000c0f01d116fc ONLINE 0 0 0 errors: No known data errors pool: lustre-mgs state: ONLINE scan: none requested config: NAME STATE READ WRITE CKSUM lustre-mgs ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 wwn-0x50000c0f012306fc ONLINE 0 0 0 wwn-0x50000c0f01233aec ONLINE 0 0 0 errors: No known data errors # zfs get lustre:svname lustre-mgs/mgs NAME PROPERTY VALUE SOURCE lustre-mgs/mgs lustre:svname MGS local # zfs get lustre:svname lustre-mdt0/mdt0 NAME PROPERTY VALUE SOURCE lustre-mdt0/mdt0 lustre:svname fs0:MDT0000 local # zfs get lustre:svname lustre-mdt1/mdt1 NAME PROPERTY VALUE SOURCE lustre-mdt1/mdt1 lustre:svname fs0:MDT0001 local So far, so good. My /etc/ldev.conf: mds1 mds2 MGS zfs:lustre-mgs/mgs mds1 mds2 fs0-MDT0000 zfs:lustre-mdt0/mdt0 mds2 mds1 fs0-MDT0001 zfs:lustre-mdt1/mdt1 my /etc/modprobe.d/lustre.conf # options lnet networks=tcp0(em1) options lnet ip2nets="tcp0 10.0.0.[22,23]; tcp0 10.0.0.*;" ----------------------------------------------------------------------------- Now, when starting the services, I get strange errors: # service lustre start local Mounting lustre-mgs/mgs on /mnt/lustre/local/MGS Mounting lustre-mdt0/mdt0 on /mnt/lustre/local/fs0-MDT0000 mount.lustre: mount lustre-mdt0/mdt0 at /mnt/lustre/local/fs0-MDT0000 failed: Input/output error Is the MGS running? # service lustre status local running attached lctl-dk.local01 If I run the same command again, I get a different error: # service lustre start local Mounting lustre-mgs/mgs on /mnt/lustre/local/MGS mount.lustre: according to /etc/mtab lustre-mgs/mgs is already mounted on /mnt/lustre/local/MGS Mounting lustre-mdt0/mdt0 on /mnt/lustre/local/fs0-MDT0000 mount.lustre: mount lustre-mdt0/mdt0 at /mnt/lustre/local/fs0-MDT0000 failed: File exists attached lctl-dk.local02 What am I doing wrong? I have tested lnet self-test as well, using the following script: # cat lnet-selftest.sh #!/bin/bash export LST_SESSION=$$ lst new_session read/write lst add_group servers 10.0.0.[22,23]@tcp lst add_group readers 10.0.0.[22,23]@tcp lst add_group writers 10.0.0.[22,23]@tcp lst add_batch bulk_rw lst add_test --batch bulk_rw --from readers --to servers \ brw read check=simple size=1M lst add_test --batch bulk_rw --from writers --to servers \ brw write check=full size=4K # start running lst run bulk_rw # display server stats for 30 seconds lst stat servers & sleep 30; kill $! # tear down lst end_session and it seemed ok # modprobe lnet-selftest && ssh mds2 modprobe lnet-selftest # ./lnet-selftest.sh SESSION: read/write FEATURES: 0 TIMEOUT: 300 FORCE: No 10.0.0.[22,23]@tcp are added to session 10.0.0.[22,23]@tcp are added to session 10.0.0.[22,23]@tcp are added to session Test was added successfully Test was added successfully bulk_rw is running now [LNet Rates of servers] [R] Avg: 19486 RPC/s Min: 19234 RPC/s Max: 19739 RPC/s [W] Avg: 19486 RPC/s Min: 19234 RPC/s Max: 19738 RPC/s [LNet Bandwidth of servers] [R] Avg: 1737.60 MB/s Min: 1680.70 MB/s Max: 1794.51 MB/s [W] Avg: 1737.60 MB/s Min: 1680.70 MB/s Max: 1794.51 MB/s [LNet Rates of servers] [R] Avg: 19510 RPC/s Min: 19182 RPC/s Max: 19838 RPC/s [W] Avg: 19510 RPC/s Min: 19182 RPC/s Max: 19838 RPC/s [LNet Bandwidth of servers] [R] Avg: 1741.67 MB/s Min: 1679.51 MB/s Max: 1803.83 MB/s [W] Avg: 1741.67 MB/s Min: 1679.51 MB/s Max: 1803.83 MB/s [LNet Rates of servers] [R] Avg: 19458 RPC/s Min: 19237 RPC/s Max: 19679 RPC/s [W] Avg: 19458 RPC/s Min: 19237 RPC/s Max: 19679 RPC/s [LNet Bandwidth of servers] [R] Avg: 1738.87 MB/s Min: 1687.28 MB/s Max: 1790.45 MB/s [W] Avg: 1738.87 MB/s Min: 1687.28 MB/s Max: 1790.45 MB/s [LNet Rates of servers] [R] Avg: 19587 RPC/s Min: 19293 RPC/s Max: 19880 RPC/s [W] Avg: 19586 RPC/s Min: 19293 RPC/s Max: 19880 RPC/s [LNet Bandwidth of servers] [R] Avg: 1752.62 MB/s Min: 1695.38 MB/s Max: 1809.85 MB/s [W] Avg: 1752.62 MB/s Min: 1695.38 MB/s Max: 1809.85 MB/s [LNet Rates of servers] [R] Avg: 19528 RPC/s Min: 19232 RPC/s Max: 19823 RPC/s [W] Avg: 19528 RPC/s Min: 19232 RPC/s Max: 19824 RPC/s [LNet Bandwidth of servers] [R] Avg: 1741.63 MB/s Min: 1682.29 MB/s Max: 1800.98 MB/s [W] Avg: 1741.63 MB/s Min: 1682.29 MB/s Max: 1800.98 MB/s session is ended ./lnet-selftest.sh: line 17: 8835 Terminated lst stat servers _______________________________________________ Lustre-discuss mailing list Lustre-discuss-aLEFhgZF4x6X6Mz3xDxJMA@public.gmane.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Sten Wolf
2013-Dec-17 15:29 UTC
Setting up a lustre zfs dual mgs/mdt over tcp - help requested
Hi all, Here is the situation: I have 2 nodes MDS1 , MDS2 (10.0.0.22 , 10.0.0.23) I wish to use as failover MGS, active/active MDT with zfs. I have a jbod shelf with 12 disks, seen by both nodes as das (the shelf has 2 sas ports, connected to a sas hba on each node), and I am using lustre 2.4 on centos 6.4 x64 I have created 3 zfs pools: 1. mgs: # zpool create -f -o ashift=12 -O canmount=off lustre-mgs mirror /dev/disk/by-id/wwn-0x50000c0f012306fc /dev/disk/by-id/wwn-0x50000c0f01233aec # mkfs.lustre --mgs --servicenode=mds1@tcp0 --servicenode=mds2@tcp0 --param sys.timeout=5000 --backfstype=zfs lustre-mgs/mgs Permanent disk data: Target: MGS Index: unassigned Lustre FS: Mount type: zfs Flags: 0x1064 (MGS first_time update no_primnode ) Persistent mount opts: Parameters: failover.node=10.0.0.22@tcp failover.node=10.0.0.23@tcp sys.timeout=5000 2 mdt0: # zpool create -f -o ashift=12 -O canmount=off lustre-mdt0 mirror /dev/disk/by-id/wwn-0x50000c0f01d07a34 /dev/disk/by-id/wwn-0x50000c0f01d110c8 # mkfs.lustre --mdt --fsname=fs0 --servicenode=mds1@tcp0 --servicenode=mds2@tcp0 --param sys.timeout=5000 --backfstype=zfs --mgsnode=mds1@tcp0 --mgsnode=mds2@tcp0 lustre-mdt0/mdt0 warning: lustre-mdt0/mdt0: for Lustre 2.4 and later, the target index must be specified with --index Permanent disk data: Target: fs0:MDT0000 Index: 0 Lustre FS: fs0 Mount type: zfs Flags: 0x1061 (MDT first_time update no_primnode ) Persistent mount opts: Parameters: failover.node=10.0.0.22@tcp failover.node=10.0.0.23@tcp sys.timeout=5000 mgsnode=10.0.0.22@tcp mgsnode=10.0.0.23@tcp checking for existing Lustre data: not found mkfs_cmd = zfs create -o canmount=off -o xattr=sa lustre-mdt0/mdt0 Writing lustre-mdt0/mdt0 properties lustre:version=1 lustre:flags=4193 lustre:index=0 lustre:fsname=fs0 lustre:svname=fs0:MDT0000 lustre:failover.node=10.0.0.22@tcp lustre:failover.node=10.0.0.23@tcp lustre:sys.timeout=5000 lustre:mgsnode=10.0.0.22@tcp lustre:mgsnode=10.0.0.23@tcp 3. mdt1: # zpool create -f -o ashift=12 -O canmount=off lustre-mdt1 mirror /dev/disk/by-id/wwn-0x50000c0f01d113e0 /dev/disk/by-id/wwn-0x50000c0f01d116fc # mkfs.lustre --mdt --fsname=fs0 --servicenode=mds2@tcp0 --servicenode=mds1@tcp0 --param sys.timeout=5000 --backfstype=zfs --index=1 --mgsnode=mds1@tcp0 --mgsnode=mds2@tcp0 lustre-mdt1/mdt1 Permanent disk data: Target: fs0:MDT0001 Index: 1 Lustre FS: fs0 Mount type: zfs Flags: 0x1061 (MDT first_time update no_primnode ) Persistent mount opts: Parameters: failover.node=10.0.0.23@tcp failover.node=10.0.0.22@tcp sys.timeout=5000 mgsnode=10.0.0.22@tcp mgsnode=10.0.0.23@tcp checking for existing Lustre data: not found mkfs_cmd = zfs create -o canmount=off -o xattr=sa lustre-mdt1/mdt1 Writing lustre-mdt1/mdt1 properties lustre:version=1 lustre:flags=4193 lustre:index=1 lustre:fsname=fs0 lustre:svname=fs0:MDT0001 lustre:failover.node=10.0.0.23@tcp lustre:failover.node=10.0.0.22@tcp lustre:sys.timeout=5000 lustre:mgsnode=10.0.0.22@tcp lustre:mgsnode=10.0.0.23@tcp a few basic sanity checks: # zfs list NAME USED AVAIL REFER MOUNTPOINT lustre-mdt0 824K 3.57T 136K /lustre-mdt0 lustre-mdt0/mdt0 136K 3.57T 136K /lustre-mdt0/mdt0 lustre-mdt1 716K 3.57T 136K /lustre-mdt1 lustre-mdt1/mdt1 136K 3.57T 136K /lustre-mdt1/mdt1 lustre-mgs 4.78M 3.57T 136K /lustre-mgs lustre-mgs/mgs 4.18M 3.57T 4.18M /lustre-mgs/mgs # zpool list NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT lustre-mdt0 3.62T 1.00M 3.62T 0% 1.00x ONLINE - lustre-mdt1 3.62T 800K 3.62T 0% 1.00x ONLINE - lustre-mgs 3.62T 4.86M 3.62T 0% 1.00x ONLINE - # zpool status pool: lustre-mdt0 state: ONLINE scan: none requested config: NAME STATE READ WRITE CKSUM lustre-mdt0 ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 wwn-0x50000c0f01d07a34 ONLINE 0 0 0 wwn-0x50000c0f01d110c8 ONLINE 0 0 0 errors: No known data errors pool: lustre-mdt1 state: ONLINE scan: none requested config: NAME STATE READ WRITE CKSUM lustre-mdt1 ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 wwn-0x50000c0f01d113e0 ONLINE 0 0 0 wwn-0x50000c0f01d116fc ONLINE 0 0 0 errors: No known data errors pool: lustre-mgs state: ONLINE scan: none requested config: NAME STATE READ WRITE CKSUM lustre-mgs ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 wwn-0x50000c0f012306fc ONLINE 0 0 0 wwn-0x50000c0f01233aec ONLINE 0 0 0 errors: No known data errors # zfs get lustre:svname lustre-mgs/mgs NAME PROPERTY VALUE SOURCE lustre-mgs/mgs lustre:svname MGS local # zfs get lustre:svname lustre-mdt0/mdt0 NAME PROPERTY VALUE SOURCE lustre-mdt0/mdt0 lustre:svname fs0:MDT0000 local # zfs get lustre:svname lustre-mdt1/mdt1 NAME PROPERTY VALUE SOURCE lustre-mdt1/mdt1 lustre:svname fs0:MDT0001 local So far, so good. My /etc/ldev.conf: mds1 mds2 MGS zfs:lustre-mgs/mgs mds1 mds2 fs0-MDT0000 zfs:lustre-mdt0/mdt0 mds2 mds1 fs0-MDT0001 zfs:lustre-mdt1/mdt1 my /etc/modprobe.d/lustre.conf # options lnet networks=tcp0(em1) options lnet ip2nets="tcp0 10.0.0.[22,23]; tcp0 10.0.0.*;" ----------------------------------------------------------------------------- Now, when starting the services, I get strange errors: # service lustre start local Mounting lustre-mgs/mgs on /mnt/lustre/local/MGS Mounting lustre-mdt0/mdt0 on /mnt/lustre/local/fs0-MDT0000 mount.lustre: mount lustre-mdt0/mdt0 at /mnt/lustre/local/fs0-MDT0000 failed: Input/output error Is the MGS running? # service lustre status local running attached lctl-dk.local01 If I run the same command again, I get a different error: # service lustre start local Mounting lustre-mgs/mgs on /mnt/lustre/local/MGS mount.lustre: according to /etc/mtab lustre-mgs/mgs is already mounted on /mnt/lustre/local/MGS Mounting lustre-mdt0/mdt0 on /mnt/lustre/local/fs0-MDT0000 mount.lustre: mount lustre-mdt0/mdt0 at /mnt/lustre/local/fs0-MDT0000 failed: File exists attached lctl-dk.local02 What am I doing wrong? I have tested lnet self-test as well, using the following script: # cat lnet-selftest.sh #!/bin/bash export LST_SESSION=$$ lst new_session read/write lst add_group servers 10.0.0.[22,23]@tcp lst add_group readers 10.0.0.[22,23]@tcp lst add_group writers 10.0.0.[22,23]@tcp lst add_batch bulk_rw lst add_test --batch bulk_rw --from readers --to servers \ brw read check=simple size=1M lst add_test --batch bulk_rw --from writers --to servers \ brw write check=full size=4K # start running lst run bulk_rw # display server stats for 30 seconds lst stat servers & sleep 30; kill $! # tear down lst end_session and it seemed ok # modprobe lnet-selftest && ssh mds2 modprobe lnet-selftest # ./lnet-selftest.sh SESSION: read/write FEATURES: 0 TIMEOUT: 300 FORCE: No 10.0.0.[22,23]@tcp are added to session 10.0.0.[22,23]@tcp are added to session 10.0.0.[22,23]@tcp are added to session Test was added successfully Test was added successfully bulk_rw is running now [LNet Rates of servers] [R] Avg: 19486 RPC/s Min: 19234 RPC/s Max: 19739 RPC/s [W] Avg: 19486 RPC/s Min: 19234 RPC/s Max: 19738 RPC/s [LNet Bandwidth of servers] [R] Avg: 1737.60 MB/s Min: 1680.70 MB/s Max: 1794.51 MB/s [W] Avg: 1737.60 MB/s Min: 1680.70 MB/s Max: 1794.51 MB/s [LNet Rates of servers] [R] Avg: 19510 RPC/s Min: 19182 RPC/s Max: 19838 RPC/s [W] Avg: 19510 RPC/s Min: 19182 RPC/s Max: 19838 RPC/s [LNet Bandwidth of servers] [R] Avg: 1741.67 MB/s Min: 1679.51 MB/s Max: 1803.83 MB/s [W] Avg: 1741.67 MB/s Min: 1679.51 MB/s Max: 1803.83 MB/s [LNet Rates of servers] [R] Avg: 19458 RPC/s Min: 19237 RPC/s Max: 19679 RPC/s [W] Avg: 19458 RPC/s Min: 19237 RPC/s Max: 19679 RPC/s [LNet Bandwidth of servers] [R] Avg: 1738.87 MB/s Min: 1687.28 MB/s Max: 1790.45 MB/s [W] Avg: 1738.87 MB/s Min: 1687.28 MB/s Max: 1790.45 MB/s [LNet Rates of servers] [R] Avg: 19587 RPC/s Min: 19293 RPC/s Max: 19880 RPC/s [W] Avg: 19586 RPC/s Min: 19293 RPC/s Max: 19880 RPC/s [LNet Bandwidth of servers] [R] Avg: 1752.62 MB/s Min: 1695.38 MB/s Max: 1809.85 MB/s [W] Avg: 1752.62 MB/s Min: 1695.38 MB/s Max: 1809.85 MB/s [LNet Rates of servers] [R] Avg: 19528 RPC/s Min: 19232 RPC/s Max: 19823 RPC/s [W] Avg: 19528 RPC/s Min: 19232 RPC/s Max: 19824 RPC/s [LNet Bandwidth of servers] [R] Avg: 1741.63 MB/s Min: 1682.29 MB/s Max: 1800.98 MB/s [W] Avg: 1741.63 MB/s Min: 1682.29 MB/s Max: 1800.98 MB/s session is ended ./lnet-selftest.sh: line 17: 8835 Terminated lst stat servers Addendum - I can start the MGS service on the 2nd node, and then start mdt0 service on local node: # ssh mds2 service lustre start MGS Mounting lustre-mgs/mgs on /mnt/lustre/foreign/MGS # service lustre start fs0-MDT0000 Mounting lustre-mdt0/mdt0 on /mnt/lustre/local/fs0-MDT0000 # service lustre status unhealthy # service lustre status local running _______________________________________________ Lustre-discuss mailing list Lustre-discuss-aLEFhgZF4x6X6Mz3xDxJMA@public.gmane.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Mohr Jr, Richard Frank (Rick Mohr)
2013-Dec-17 17:52 UTC
Re: Setting up a lustre zfs dual mgs/mdt over tcp - help requested
On Dec 17, 2013, at 10:29 AM, Sten Wolf <sten-dX0jVuv5p8QybS5Ee8rs3A@public.gmane.org> wrote: I''m afraid I don''t have any suggested solutions to your problem, but I did notice something about your lnet selftest script.> lst add_group servers 10.0.0.[22,23]@tcp > lst add_group readers 10.0.0.[22,23]@tcp > lst add_group writers 10.0.0.[22,23]@tcp > lst add_batch bulk_rw > lst add_test --batch bulk_rw --from readers --to servers \ > brw read check=simple size=1M > lst add_test --batch bulk_rw --from writers --to servers \ > brw write check=full size=4KYou may want to try swapping the order of the nids in the "servers" group. If I recall correctly, the default distribution method for lnet selftest is 1:1. This means that your clients and servers will be paired like this: 10.0.0.22@tcp <--> 10.0.0.22@tcp 10.0.0.23@tcp <--> 10.0.0.23@tcp So you are not testing any lnet traffic between nodes. (That being said, the lnet connectivity between your nodes is still probably fine otherwise the lnet selftest would likely not have run at all.) -- Rick Mohr Senior HPC System Administrator National Institute for Computational Sciences http://www.nics.tennessee.edu