Canon, Richard Shane
2007-Feb-03 11:26 UTC
[Lustre-discuss] Routing between two XT systems
Greetings, We are preparing to transition users from a Cray XT3 sytem to an XT4 system. In order to transfer the data between the two Lustre filesystems, we are looking at the possibility of having one system mount the other systems file system. This would allow data to be copied directly. However, we aren''t certain if this will work. Its not clear to me that the current routing implementation will handle two portals networks. I can sort of see how we would define the ip2nets. However, in looking at the router lines, it looks like we will need to specify portal nids. It isn''t clear to me how the system would distinguish a local portal nid from the same nid on the other system. Am I overlooking something? This is a temporary situation. We just need these mounts in place while we move the data. So, even a temporary work around may be useful. Thanks, --Shane
Shane, If the xt4 clients are on the sio nodes then you should only have to specify one ptl network in ip2nets and your routing would only require one hop. If you wanted xt4 compute pe''s to be clients then I think you''d need to have @ptl0 and @ptl1 (one for each cray) and specify a hop count of 2. If you''re interested in the latter let me know and I''ll send you a config. paul Canon, Richard Shane wrote:> > Greetings, > > We are preparing to transition users from a Cray XT3 sytem to an XT4 system. In order to transfer the data between the two Lustre filesystems, we are looking at the possibility of having one system mount the other systems file system. This would allow data to be copied directly. However, we aren''t certain if this will work. Its not clear to me that the current routing implementation will handle two portals networks. I can sort of see how we would define the ip2nets. However, in looking at the router lines, it looks like we will need to specify portal nids. It isn''t clear to me how the system would distinguish a local portal nid from the same nid on the other system. Am I overlooking something? This is a temporary situation. We just need these mounts in place while we move the data. So, even a temporary work around may be useful. > > Thanks, > > --Shane > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss@clusterfs.com > https://mail.clusterfs.com/mailman/listinfo/lustre-discuss > >
> We are preparing to transition users from a Cray XT3 sytem to > an XT4 system. In order to transfer the data between the two > Lustre filesystems, we are looking at the possibility of having > one system mount the other systems file system...Tell LNET there are 2 portals networks so one cluster has LNET NIDs that are <portals_nid>@ptl0 and the other has <portals_nid>@ptl1. Let me assume for the purposes of this discussion... 1. The XT3 service nodes have IP addresses in the range 192.168.1.* on some interface. 2. The XT3 system has 8 nodes that connect to the site IP network via eth2 which have IP addresses in the range 100.1.1.[1-8] and NIDs in the range 16-23 3. The XT4 service nodes have IP addresses in the range 192.168.2.* on some interface. 4. The XT4 system has 8 nodes that connect to the site IP network via eth3 which have IP addresses in the range 100.1.2.[1-8] and NIDs in the range 32-39 ...so you can set the following lnet module parameters... options lnet ip2nets="ptl0 192.168.1.* # XT3 service nodes;\ ptl1 192.168.2.* # XT4 service nodes;\ tcp0(eth2) 100.1.1.[1-8] # XT3 lnet routers;\ tcp0(eth3) 100.1.2.[1-8] # XT4 lnet routers;"\ routes="ptl0 1 100.1.1.[1-8]@tcp0 # XT3 <- IP network;\ ptl0 2 [32-39]@ptl1 # XT3 <- XT4;\ ptl1 1 100.1.2.[1-8]@tcp0 # XT4 <- IP network;\ ptl1 2 [16-23]@ptl0 # XT4 <- XT3;\ tcp0 1 [16-23]@ptl0 # IP network <- XT3;\ tcp0 1 [32-39]@ptl1 # IP network <- XT4;" You may also want to set the following additonal LNET parameters... options lnet check_routers_before_use=1\ dead_router_check_interval=50\ live_router_check_interval=50 ...which enable automatic router health checks (these are disabled by default) so that dead routers are avoided. This If you want catamount applications to run while this network is in place, you''ll need the following environment variables set... 1. XT4 catamount apps need LNET_NETWORKS="ptl1" so they know they are in the ptl1 network. XT3 nodes don''t need this because ptl0 is the default. 2. XT3 catamount apps wishing to access XT4 servers need LNET_ROUTES="ptl1 [16-23]@ptl0". 3. XT4 catamount apps wishing to access XT3 servers need LNET_ROUTES="ptl0 [32-39]@ptl1" Please note that catamount apps are not tolerant of router failure, so any downed routers must be omitted from the LNET_ROUTES environment variable. Cheers, Eric
Canon, Richard Shane
2007-Feb-06 07:09 UTC
[Lustre-discuss] Routing between two XT systems
Eric and Paul, Thanks for you help. We tested cross mounting the file system on a dev XT system on another XT. We still haven''t tested having both file systems mounted concurrently, but I''m optimistic given the success of this first test. Thanks again, --Shane -----Original Message----- From: Eric Barton [mailto:eeb@bartonsoftware.com] Sent: Sunday, February 04, 2007 9:28 AM To: Canon, Richard Shane; lustre-discuss@clusterfs.com Subject: RE: [Lustre-discuss] Routing between two XT systems> We are preparing to transition users from a Cray XT3 sytem to > an XT4 system. In order to transfer the data between the two > Lustre filesystems, we are looking at the possibility of having > one system mount the other systems file system...Tell LNET there are 2 portals networks so one cluster has LNET NIDs that are <portals_nid>@ptl0 and the other has <portals_nid>@ptl1. Let me assume for the purposes of this discussion... 1. The XT3 service nodes have IP addresses in the range 192.168.1.* on some interface. 2. The XT3 system has 8 nodes that connect to the site IP network via eth2 which have IP addresses in the range 100.1.1.[1-8] and NIDs in the range 16-23 3. The XT4 service nodes have IP addresses in the range 192.168.2.* on some interface. 4. The XT4 system has 8 nodes that connect to the site IP network via eth3 which have IP addresses in the range 100.1.2.[1-8] and NIDs in the range 32-39 ...so you can set the following lnet module parameters... options lnet ip2nets="ptl0 192.168.1.* # XT3 service nodes;\ ptl1 192.168.2.* # XT4 service nodes;\ tcp0(eth2) 100.1.1.[1-8] # XT3 lnet routers;\ tcp0(eth3) 100.1.2.[1-8] # XT4 lnet routers;"\ routes="ptl0 1 100.1.1.[1-8]@tcp0 # XT3 <- IP network;\ ptl0 2 [32-39]@ptl1 # XT3 <- XT4;\ ptl1 1 100.1.2.[1-8]@tcp0 # XT4 <- IP network;\ ptl1 2 [16-23]@ptl0 # XT4 <- XT3;\ tcp0 1 [16-23]@ptl0 # IP network <- XT3;\ tcp0 1 [32-39]@ptl1 # IP network <- XT4;" You may also want to set the following additonal LNET parameters... options lnet check_routers_before_use=1\ dead_router_check_interval=50\ live_router_check_interval=50 ...which enable automatic router health checks (these are disabled by default) so that dead routers are avoided. This If you want catamount applications to run while this network is in place, you''ll need the following environment variables set... 1. XT4 catamount apps need LNET_NETWORKS="ptl1" so they know they are in the ptl1 network. XT3 nodes don''t need this because ptl0 is the default. 2. XT3 catamount apps wishing to access XT4 servers need LNET_ROUTES="ptl1 [16-23]@ptl0". 3. XT4 catamount apps wishing to access XT3 servers need LNET_ROUTES="ptl0 [32-39]@ptl1" Please note that catamount apps are not tolerant of router failure, so any downed routers must be omitted from the LNET_ROUTES environment variable. Cheers, Eric