Sten Wolf
2013-Dec-17 14:10 UTC
Setting up a lustre zfs dual mgs/mdt over tcp - help requested
Hi all,
Here is the situation:
I have 2 nodes MDS1 , MDS2 (10.0.0.22 , 10.0.0.23) I wish to use as
failover MGS, active/active MDT with zfs.
I have a jbod shelf with 12 disks, seen by both nodes as das (the
shelf has 2 sas ports, connected to a sas hba on each node), and I
am using lustre 2.4 on centos 6.4 x64
I have created 3 zfs pools:
1. mgs:
# zpool create -f -o ashift=12 -O canmount=off lustre-mgs mirror
/dev/disk/by-id/wwn-0x50000c0f012306fc
/dev/disk/by-id/wwn-0x50000c0f01233aec
# mkfs.lustre --mgs --servicenode=mds1@tcp0 --servicenode=mds2@tcp0
--param sys.timeout=5000 --backfstype=zfs lustre-mgs/mgs
Permanent disk data:
Target: MGS
Index: unassigned
Lustre FS:
Mount type: zfs
Flags: 0x1064
(MGS first_time update no_primnode )
Persistent mount opts:
Parameters: failover.node=10.0.0.22@tcp failover.node=10.0.0.23@tcp
sys.timeout=5000
2 mdt0:
# zpool create -f -o ashift=12 -O canmount=off lustre-mdt0 mirror
/dev/disk/by-id/wwn-0x50000c0f01d07a34
/dev/disk/by-id/wwn-0x50000c0f01d110c8
# mkfs.lustre --mdt --fsname=fs0 --servicenode=mds1@tcp0
--servicenode=mds2@tcp0 --param sys.timeout=5000 --backfstype=zfs
--mgsnode=mds1@tcp0 --mgsnode=mds2@tcp0 lustre-mdt0/mdt0
warning: lustre-mdt0/mdt0: for Lustre 2.4 and later, the target
index must be specified with --index
Permanent disk data:
Target: fs0:MDT0000
Index: 0
Lustre FS: fs0
Mount type: zfs
Flags: 0x1061
(MDT first_time update no_primnode )
Persistent mount opts:
Parameters: failover.node=10.0.0.22@tcp failover.node=10.0.0.23@tcp
sys.timeout=5000 mgsnode=10.0.0.22@tcp mgsnode=10.0.0.23@tcp
checking for existing Lustre data: not found
mkfs_cmd = zfs create -o canmount=off -o xattr=sa lustre-mdt0/mdt0
Writing lustre-mdt0/mdt0 properties
lustre:version=1
lustre:flags=4193
lustre:index=0
lustre:fsname=fs0
lustre:svname=fs0:MDT0000
lustre:failover.node=10.0.0.22@tcp
lustre:failover.node=10.0.0.23@tcp
lustre:sys.timeout=5000
lustre:mgsnode=10.0.0.22@tcp
lustre:mgsnode=10.0.0.23@tcp
3. mdt1:
# zpool create -f -o ashift=12 -O canmount=off lustre-mdt1 mirror
/dev/disk/by-id/wwn-0x50000c0f01d113e0
/dev/disk/by-id/wwn-0x50000c0f01d116fc
# mkfs.lustre --mdt --fsname=fs0 --servicenode=mds2@tcp0
--servicenode=mds1@tcp0 --param sys.timeout=5000 --backfstype=zfs
--index=1 --mgsnode=mds1@tcp0 --mgsnode=mds2@tcp0 lustre-mdt1/mdt1
Permanent disk data:
Target: fs0:MDT0001
Index: 1
Lustre FS: fs0
Mount type: zfs
Flags: 0x1061
(MDT first_time update no_primnode )
Persistent mount opts:
Parameters: failover.node=10.0.0.23@tcp failover.node=10.0.0.22@tcp
sys.timeout=5000 mgsnode=10.0.0.22@tcp mgsnode=10.0.0.23@tcp
checking for existing Lustre data: not found
mkfs_cmd = zfs create -o canmount=off -o xattr=sa lustre-mdt1/mdt1
Writing lustre-mdt1/mdt1 properties
lustre:version=1
lustre:flags=4193
lustre:index=1
lustre:fsname=fs0
lustre:svname=fs0:MDT0001
lustre:failover.node=10.0.0.23@tcp
lustre:failover.node=10.0.0.22@tcp
lustre:sys.timeout=5000
lustre:mgsnode=10.0.0.22@tcp
lustre:mgsnode=10.0.0.23@tcp
a few basic sanity checks:
# zfs list
NAME USED AVAIL REFER MOUNTPOINT
lustre-mdt0 824K 3.57T 136K /lustre-mdt0
lustre-mdt0/mdt0 136K 3.57T 136K /lustre-mdt0/mdt0
lustre-mdt1 716K 3.57T 136K /lustre-mdt1
lustre-mdt1/mdt1 136K 3.57T 136K /lustre-mdt1/mdt1
lustre-mgs 4.78M 3.57T 136K /lustre-mgs
lustre-mgs/mgs 4.18M 3.57T 4.18M /lustre-mgs/mgs
# zpool list
NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT
lustre-mdt0 3.62T 1.00M 3.62T 0% 1.00x ONLINE -
lustre-mdt1 3.62T 800K 3.62T 0% 1.00x ONLINE -
lustre-mgs 3.62T 4.86M 3.62T 0% 1.00x ONLINE -
# zpool status
pool: lustre-mdt0
state: ONLINE
scan: none requested
config:
NAME STATE READ WRITE CKSUM
lustre-mdt0 ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
wwn-0x50000c0f01d07a34 ONLINE 0 0 0
wwn-0x50000c0f01d110c8 ONLINE 0 0 0
errors: No known data errors
pool: lustre-mdt1
state: ONLINE
scan: none requested
config:
NAME STATE READ WRITE CKSUM
lustre-mdt1 ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
wwn-0x50000c0f01d113e0 ONLINE 0 0 0
wwn-0x50000c0f01d116fc ONLINE 0 0 0
errors: No known data errors
pool: lustre-mgs
state: ONLINE
scan: none requested
config:
NAME STATE READ WRITE CKSUM
lustre-mgs ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
wwn-0x50000c0f012306fc ONLINE 0 0 0
wwn-0x50000c0f01233aec ONLINE 0 0 0
errors: No known data errors
# zfs get lustre:svname lustre-mgs/mgs
NAME PROPERTY VALUE SOURCE
lustre-mgs/mgs lustre:svname MGS local
# zfs get lustre:svname lustre-mdt0/mdt0
NAME PROPERTY VALUE SOURCE
lustre-mdt0/mdt0 lustre:svname fs0:MDT0000 local
# zfs get lustre:svname lustre-mdt1/mdt1
NAME PROPERTY VALUE SOURCE
lustre-mdt1/mdt1 lustre:svname fs0:MDT0001 local
So far, so good.
My /etc/ldev.conf:
mds1 mds2 MGS zfs:lustre-mgs/mgs
mds1 mds2 fs0-MDT0000 zfs:lustre-mdt0/mdt0
mds2 mds1 fs0-MDT0001 zfs:lustre-mdt1/mdt1
my /etc/modprobe.d/lustre.conf
# options lnet networks=tcp0(em1)
options lnet ip2nets="tcp0 10.0.0.[22,23]; tcp0 10.0.0.*;"
-----------------------------------------------------------------------------
Now, when starting the services, I get strange errors:
# service lustre start local
Mounting lustre-mgs/mgs on /mnt/lustre/local/MGS
Mounting lustre-mdt0/mdt0 on /mnt/lustre/local/fs0-MDT0000
mount.lustre: mount lustre-mdt0/mdt0 at
/mnt/lustre/local/fs0-MDT0000 failed: Input/output error
Is the MGS running?
# service lustre status local
running
attached lctl-dk.local01
If I run the same command again, I get a different error:
# service lustre start local
Mounting lustre-mgs/mgs on /mnt/lustre/local/MGS
mount.lustre: according to /etc/mtab lustre-mgs/mgs is already
mounted on /mnt/lustre/local/MGS
Mounting lustre-mdt0/mdt0 on /mnt/lustre/local/fs0-MDT0000
mount.lustre: mount lustre-mdt0/mdt0 at
/mnt/lustre/local/fs0-MDT0000 failed: File exists
attached lctl-dk.local02
What am I doing wrong?
I have tested lnet self-test as well, using the following script:
# cat lnet-selftest.sh
#!/bin/bash
export LST_SESSION=$$
lst new_session read/write
lst add_group servers 10.0.0.[22,23]@tcp
lst add_group readers 10.0.0.[22,23]@tcp
lst add_group writers 10.0.0.[22,23]@tcp
lst add_batch bulk_rw
lst add_test --batch bulk_rw --from readers --to servers \
brw read check=simple size=1M
lst add_test --batch bulk_rw --from writers --to servers \
brw write check=full size=4K
# start running
lst run bulk_rw
# display server stats for 30 seconds
lst stat servers & sleep 30; kill $!
# tear down
lst end_session
and it seemed ok
# modprobe lnet-selftest && ssh mds2 modprobe lnet-selftest
# ./lnet-selftest.sh
SESSION: read/write FEATURES: 0 TIMEOUT: 300 FORCE: No
10.0.0.[22,23]@tcp are added to session
10.0.0.[22,23]@tcp are added to session
10.0.0.[22,23]@tcp are added to session
Test was added successfully
Test was added successfully
bulk_rw is running now
[LNet Rates of servers]
[R] Avg: 19486 RPC/s Min: 19234 RPC/s Max: 19739 RPC/s
[W] Avg: 19486 RPC/s Min: 19234 RPC/s Max: 19738 RPC/s
[LNet Bandwidth of servers]
[R] Avg: 1737.60 MB/s Min: 1680.70 MB/s Max: 1794.51 MB/s
[W] Avg: 1737.60 MB/s Min: 1680.70 MB/s Max: 1794.51 MB/s
[LNet Rates of servers]
[R] Avg: 19510 RPC/s Min: 19182 RPC/s Max: 19838 RPC/s
[W] Avg: 19510 RPC/s Min: 19182 RPC/s Max: 19838 RPC/s
[LNet Bandwidth of servers]
[R] Avg: 1741.67 MB/s Min: 1679.51 MB/s Max: 1803.83 MB/s
[W] Avg: 1741.67 MB/s Min: 1679.51 MB/s Max: 1803.83 MB/s
[LNet Rates of servers]
[R] Avg: 19458 RPC/s Min: 19237 RPC/s Max: 19679 RPC/s
[W] Avg: 19458 RPC/s Min: 19237 RPC/s Max: 19679 RPC/s
[LNet Bandwidth of servers]
[R] Avg: 1738.87 MB/s Min: 1687.28 MB/s Max: 1790.45 MB/s
[W] Avg: 1738.87 MB/s Min: 1687.28 MB/s Max: 1790.45 MB/s
[LNet Rates of servers]
[R] Avg: 19587 RPC/s Min: 19293 RPC/s Max: 19880 RPC/s
[W] Avg: 19586 RPC/s Min: 19293 RPC/s Max: 19880 RPC/s
[LNet Bandwidth of servers]
[R] Avg: 1752.62 MB/s Min: 1695.38 MB/s Max: 1809.85 MB/s
[W] Avg: 1752.62 MB/s Min: 1695.38 MB/s Max: 1809.85 MB/s
[LNet Rates of servers]
[R] Avg: 19528 RPC/s Min: 19232 RPC/s Max: 19823 RPC/s
[W] Avg: 19528 RPC/s Min: 19232 RPC/s Max: 19824 RPC/s
[LNet Bandwidth of servers]
[R] Avg: 1741.63 MB/s Min: 1682.29 MB/s Max: 1800.98 MB/s
[W] Avg: 1741.63 MB/s Min: 1682.29 MB/s Max: 1800.98 MB/s
session is ended
./lnet-selftest.sh: line 17: 8835 Terminated lst stat
servers
_______________________________________________
Lustre-discuss mailing list
Lustre-discuss-aLEFhgZF4x6X6Mz3xDxJMA@public.gmane.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss
Sten Wolf
2013-Dec-17 15:29 UTC
Setting up a lustre zfs dual mgs/mdt over tcp - help requested
Hi all,
Here is the situation:
I have 2 nodes MDS1 , MDS2 (10.0.0.22 , 10.0.0.23) I wish to use as
failover MGS, active/active MDT with zfs.
I have a jbod shelf with 12 disks, seen by both nodes as das (the
shelf has 2 sas ports, connected to a sas hba on each node), and I
am using lustre 2.4 on centos 6.4 x64
I have created 3 zfs pools:
1. mgs:
# zpool create -f -o ashift=12 -O canmount=off lustre-mgs mirror
/dev/disk/by-id/wwn-0x50000c0f012306fc
/dev/disk/by-id/wwn-0x50000c0f01233aec
# mkfs.lustre --mgs --servicenode=mds1@tcp0 --servicenode=mds2@tcp0
--param sys.timeout=5000 --backfstype=zfs lustre-mgs/mgs
Permanent disk data:
Target: MGS
Index: unassigned
Lustre FS:
Mount type: zfs
Flags: 0x1064
(MGS first_time update no_primnode )
Persistent mount opts:
Parameters: failover.node=10.0.0.22@tcp failover.node=10.0.0.23@tcp
sys.timeout=5000
2 mdt0:
# zpool create -f -o ashift=12 -O canmount=off lustre-mdt0 mirror
/dev/disk/by-id/wwn-0x50000c0f01d07a34
/dev/disk/by-id/wwn-0x50000c0f01d110c8
# mkfs.lustre --mdt --fsname=fs0 --servicenode=mds1@tcp0
--servicenode=mds2@tcp0 --param sys.timeout=5000 --backfstype=zfs
--mgsnode=mds1@tcp0 --mgsnode=mds2@tcp0 lustre-mdt0/mdt0
warning: lustre-mdt0/mdt0: for Lustre 2.4 and later, the target
index must be specified with --index
Permanent disk data:
Target: fs0:MDT0000
Index: 0
Lustre FS: fs0
Mount type: zfs
Flags: 0x1061
(MDT first_time update no_primnode )
Persistent mount opts:
Parameters: failover.node=10.0.0.22@tcp failover.node=10.0.0.23@tcp
sys.timeout=5000 mgsnode=10.0.0.22@tcp mgsnode=10.0.0.23@tcp
checking for existing Lustre data: not found
mkfs_cmd = zfs create -o canmount=off -o xattr=sa lustre-mdt0/mdt0
Writing lustre-mdt0/mdt0 properties
lustre:version=1
lustre:flags=4193
lustre:index=0
lustre:fsname=fs0
lustre:svname=fs0:MDT0000
lustre:failover.node=10.0.0.22@tcp
lustre:failover.node=10.0.0.23@tcp
lustre:sys.timeout=5000
lustre:mgsnode=10.0.0.22@tcp
lustre:mgsnode=10.0.0.23@tcp
3. mdt1:
# zpool create -f -o ashift=12 -O canmount=off lustre-mdt1 mirror
/dev/disk/by-id/wwn-0x50000c0f01d113e0
/dev/disk/by-id/wwn-0x50000c0f01d116fc
# mkfs.lustre --mdt --fsname=fs0 --servicenode=mds2@tcp0
--servicenode=mds1@tcp0 --param sys.timeout=5000 --backfstype=zfs
--index=1 --mgsnode=mds1@tcp0 --mgsnode=mds2@tcp0 lustre-mdt1/mdt1
Permanent disk data:
Target: fs0:MDT0001
Index: 1
Lustre FS: fs0
Mount type: zfs
Flags: 0x1061
(MDT first_time update no_primnode )
Persistent mount opts:
Parameters: failover.node=10.0.0.23@tcp failover.node=10.0.0.22@tcp
sys.timeout=5000 mgsnode=10.0.0.22@tcp mgsnode=10.0.0.23@tcp
checking for existing Lustre data: not found
mkfs_cmd = zfs create -o canmount=off -o xattr=sa lustre-mdt1/mdt1
Writing lustre-mdt1/mdt1 properties
lustre:version=1
lustre:flags=4193
lustre:index=1
lustre:fsname=fs0
lustre:svname=fs0:MDT0001
lustre:failover.node=10.0.0.23@tcp
lustre:failover.node=10.0.0.22@tcp
lustre:sys.timeout=5000
lustre:mgsnode=10.0.0.22@tcp
lustre:mgsnode=10.0.0.23@tcp
a few basic sanity checks:
# zfs list
NAME USED AVAIL REFER MOUNTPOINT
lustre-mdt0 824K 3.57T 136K /lustre-mdt0
lustre-mdt0/mdt0 136K 3.57T 136K /lustre-mdt0/mdt0
lustre-mdt1 716K 3.57T 136K /lustre-mdt1
lustre-mdt1/mdt1 136K 3.57T 136K /lustre-mdt1/mdt1
lustre-mgs 4.78M 3.57T 136K /lustre-mgs
lustre-mgs/mgs 4.18M 3.57T 4.18M /lustre-mgs/mgs
# zpool list
NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT
lustre-mdt0 3.62T 1.00M 3.62T 0% 1.00x ONLINE -
lustre-mdt1 3.62T 800K 3.62T 0% 1.00x ONLINE -
lustre-mgs 3.62T 4.86M 3.62T 0% 1.00x ONLINE -
# zpool status
pool: lustre-mdt0
state: ONLINE
scan: none requested
config:
NAME STATE READ WRITE CKSUM
lustre-mdt0 ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
wwn-0x50000c0f01d07a34 ONLINE 0 0 0
wwn-0x50000c0f01d110c8 ONLINE 0 0 0
errors: No known data errors
pool: lustre-mdt1
state: ONLINE
scan: none requested
config:
NAME STATE READ WRITE CKSUM
lustre-mdt1 ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
wwn-0x50000c0f01d113e0 ONLINE 0 0 0
wwn-0x50000c0f01d116fc ONLINE 0 0 0
errors: No known data errors
pool: lustre-mgs
state: ONLINE
scan: none requested
config:
NAME STATE READ WRITE CKSUM
lustre-mgs ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
wwn-0x50000c0f012306fc ONLINE 0 0 0
wwn-0x50000c0f01233aec ONLINE 0 0 0
errors: No known data errors
# zfs get lustre:svname lustre-mgs/mgs
NAME PROPERTY VALUE SOURCE
lustre-mgs/mgs lustre:svname MGS local
# zfs get lustre:svname lustre-mdt0/mdt0
NAME PROPERTY VALUE SOURCE
lustre-mdt0/mdt0 lustre:svname fs0:MDT0000 local
# zfs get lustre:svname lustre-mdt1/mdt1
NAME PROPERTY VALUE SOURCE
lustre-mdt1/mdt1 lustre:svname fs0:MDT0001 local
So far, so good.
My /etc/ldev.conf:
mds1 mds2 MGS zfs:lustre-mgs/mgs
mds1 mds2 fs0-MDT0000 zfs:lustre-mdt0/mdt0
mds2 mds1 fs0-MDT0001 zfs:lustre-mdt1/mdt1
my /etc/modprobe.d/lustre.conf
# options lnet networks=tcp0(em1)
options lnet ip2nets="tcp0 10.0.0.[22,23]; tcp0 10.0.0.*;"
-----------------------------------------------------------------------------
Now, when starting the services, I get strange errors:
# service lustre start local
Mounting lustre-mgs/mgs on /mnt/lustre/local/MGS
Mounting lustre-mdt0/mdt0 on /mnt/lustre/local/fs0-MDT0000
mount.lustre: mount lustre-mdt0/mdt0 at
/mnt/lustre/local/fs0-MDT0000 failed: Input/output error
Is the MGS running?
# service lustre status local
running
attached lctl-dk.local01
If I run the same command again, I get a different error:
# service lustre start local
Mounting lustre-mgs/mgs on /mnt/lustre/local/MGS
mount.lustre: according to /etc/mtab lustre-mgs/mgs is already
mounted on /mnt/lustre/local/MGS
Mounting lustre-mdt0/mdt0 on /mnt/lustre/local/fs0-MDT0000
mount.lustre: mount lustre-mdt0/mdt0 at
/mnt/lustre/local/fs0-MDT0000 failed: File exists
attached lctl-dk.local02
What am I doing wrong?
I have tested lnet self-test as well, using the following script:
# cat lnet-selftest.sh
#!/bin/bash
export LST_SESSION=$$
lst new_session read/write
lst add_group servers 10.0.0.[22,23]@tcp
lst add_group readers 10.0.0.[22,23]@tcp
lst add_group writers 10.0.0.[22,23]@tcp
lst add_batch bulk_rw
lst add_test --batch bulk_rw --from readers --to servers \
brw read check=simple size=1M
lst add_test --batch bulk_rw --from writers --to servers \
brw write check=full size=4K
# start running
lst run bulk_rw
# display server stats for 30 seconds
lst stat servers & sleep 30; kill $!
# tear down
lst end_session
and it seemed ok
# modprobe lnet-selftest && ssh mds2 modprobe lnet-selftest
# ./lnet-selftest.sh
SESSION: read/write FEATURES: 0 TIMEOUT: 300 FORCE: No
10.0.0.[22,23]@tcp are added to session
10.0.0.[22,23]@tcp are added to session
10.0.0.[22,23]@tcp are added to session
Test was added successfully
Test was added successfully
bulk_rw is running now
[LNet Rates of servers]
[R] Avg: 19486 RPC/s Min: 19234 RPC/s Max: 19739 RPC/s
[W] Avg: 19486 RPC/s Min: 19234 RPC/s Max: 19738 RPC/s
[LNet Bandwidth of servers]
[R] Avg: 1737.60 MB/s Min: 1680.70 MB/s Max: 1794.51 MB/s
[W] Avg: 1737.60 MB/s Min: 1680.70 MB/s Max: 1794.51 MB/s
[LNet Rates of servers]
[R] Avg: 19510 RPC/s Min: 19182 RPC/s Max: 19838 RPC/s
[W] Avg: 19510 RPC/s Min: 19182 RPC/s Max: 19838 RPC/s
[LNet Bandwidth of servers]
[R] Avg: 1741.67 MB/s Min: 1679.51 MB/s Max: 1803.83 MB/s
[W] Avg: 1741.67 MB/s Min: 1679.51 MB/s Max: 1803.83 MB/s
[LNet Rates of servers]
[R] Avg: 19458 RPC/s Min: 19237 RPC/s Max: 19679 RPC/s
[W] Avg: 19458 RPC/s Min: 19237 RPC/s Max: 19679 RPC/s
[LNet Bandwidth of servers]
[R] Avg: 1738.87 MB/s Min: 1687.28 MB/s Max: 1790.45 MB/s
[W] Avg: 1738.87 MB/s Min: 1687.28 MB/s Max: 1790.45 MB/s
[LNet Rates of servers]
[R] Avg: 19587 RPC/s Min: 19293 RPC/s Max: 19880 RPC/s
[W] Avg: 19586 RPC/s Min: 19293 RPC/s Max: 19880 RPC/s
[LNet Bandwidth of servers]
[R] Avg: 1752.62 MB/s Min: 1695.38 MB/s Max: 1809.85 MB/s
[W] Avg: 1752.62 MB/s Min: 1695.38 MB/s Max: 1809.85 MB/s
[LNet Rates of servers]
[R] Avg: 19528 RPC/s Min: 19232 RPC/s Max: 19823 RPC/s
[W] Avg: 19528 RPC/s Min: 19232 RPC/s Max: 19824 RPC/s
[LNet Bandwidth of servers]
[R] Avg: 1741.63 MB/s Min: 1682.29 MB/s Max: 1800.98 MB/s
[W] Avg: 1741.63 MB/s Min: 1682.29 MB/s Max: 1800.98 MB/s
session is ended
./lnet-selftest.sh: line 17: 8835 Terminated lst stat
servers
Addendum - I can start the MGS service on the 2nd node, and then
start mdt0 service on local node:
# ssh mds2 service lustre start MGS
Mounting lustre-mgs/mgs on /mnt/lustre/foreign/MGS
# service lustre start fs0-MDT0000
Mounting lustre-mdt0/mdt0 on /mnt/lustre/local/fs0-MDT0000
# service lustre status
unhealthy
# service lustre status local
running
_______________________________________________
Lustre-discuss mailing list
Lustre-discuss-aLEFhgZF4x6X6Mz3xDxJMA@public.gmane.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss
Mohr Jr, Richard Frank (Rick Mohr)
2013-Dec-17 17:52 UTC
Re: Setting up a lustre zfs dual mgs/mdt over tcp - help requested
On Dec 17, 2013, at 10:29 AM, Sten Wolf <sten-dX0jVuv5p8QybS5Ee8rs3A@public.gmane.org> wrote: I''m afraid I don''t have any suggested solutions to your problem, but I did notice something about your lnet selftest script.> lst add_group servers 10.0.0.[22,23]@tcp > lst add_group readers 10.0.0.[22,23]@tcp > lst add_group writers 10.0.0.[22,23]@tcp > lst add_batch bulk_rw > lst add_test --batch bulk_rw --from readers --to servers \ > brw read check=simple size=1M > lst add_test --batch bulk_rw --from writers --to servers \ > brw write check=full size=4KYou may want to try swapping the order of the nids in the "servers" group. If I recall correctly, the default distribution method for lnet selftest is 1:1. This means that your clients and servers will be paired like this: 10.0.0.22@tcp <--> 10.0.0.22@tcp 10.0.0.23@tcp <--> 10.0.0.23@tcp So you are not testing any lnet traffic between nodes. (That being said, the lnet connectivity between your nodes is still probably fine otherwise the lnet selftest would likely not have run at all.) -- Rick Mohr Senior HPC System Administrator National Institute for Computational Sciences http://www.nics.tennessee.edu