Porter Don wrote:> > 1) If I have a lustre lov set up as follows: > > node1: client > node2: client and mds > node3 & 4: ost > > If I reboot nodes 2-4 and leave the system mounted on node1, when the other > nodes come up and restart lustre, node1 cannot seamlessly restore its state. > In fact, the other three nodes seem to get messed up to the point that node > 2 cannot mount the lov until all nodes, client and server, are rebooted and > lustre is restarted. > > Is this how the system is supposed to work? If not, am I making a newbie > mistake?Lustre 1.x can recover from any single failure at a time -- but we explicitly do not support recovery from multiple simultaneous failures. Even if you just reboot node 2 here, you will not have a seamless recovery, because you have failed two components simultaneously: a client and an MDS. In this case, node1 will be told to flush its caches and abort any in-progress operations. If you try to start the MDS while one or more OSTs are down, it will fail unless you explicitly tell lconf to ignore the inactive OSTs. This is so that administrators do not accidentally start up file systems in degraded mode, without all of the servers, in which some data is inaccessible -- we thought it best for that decision to be made very consciously and explicitly.> 2) If I get a new client node, is there a way to have it join without > bringing down all nodes? It would be ok to bring down the servers, but as > noted above, doing so without bringing down all clients is causing me > problems.Adding new client nodes is standard practice -- you can mount it normally, as if it had been there from the beginning. If you are having problems mounting additional clients, please let us know -- that is a bug! Thanks-- -Phil
Phil, Thanks for the help. As long as I know what to look for in dmesg, I think that is ok. Yeah, running the abort command seems to fix things up. Thanks! Don Porter -----Original Message----- From: Phil Schwan To: Porter Don Cc: ''''lustre-discuss@lists.clusterfs.com'' '' Sent: 4/21/04 12:11 PM Subject: RE: [Lustre-discuss] Can lustre dynamically add clients? Hi Don-- On Tue, 2004-04-20 at 16:44, Porter Don wrote:> Ok. I ran > > lmc -m orch_test.xml --add net --node client --nid ''*'' --nettype tcp > > on all mds and osd nodes. > > Then restarted the nodes. When I tried: > > [root@jawa051 root]# mount -t lustre jawa046:/mds1/client /mnt/lustre > /sbin/mount.lustre: Invalid argumentThis is not a great error message, I will be the first to agree. Let me see what I can do about improving that, although it may take a lot of plumbing to get a useful error back from the kernel.> I also got this on the mds: > > [root@jawa046 root]# dmesg > LustreError: 1556:(../ldlm/ldlm_lib.c:474:target_handle_connect())denying> connection for new client b992fe62-fd97-4670-ba25-27a0c7e943f1: 10clients> in recovery for 120sThis is the key message. If the MDS was not shutdown cleanly, it will wait for all old clients to reconnect, so that it can complete recovery. Please see https://bugzilla.lustre.org/show_bug.cgi?id=2398 for more details. If you wait the 120 seconds, or perform the abort-recovery recipe described in issue 2397, are you able to mount again? -Phil
1) I got this to work by a setup like the following: node 1: mds node 2 & 3: ost node 4: client. I could reboot node 2 and everything still worked. 2) so, how do I add a node 5 to the mix without adding an entry to the config.xml file and restarting all other nodes? I suppose this is just something I am having trouble finding in the documentation. Thanks, don On Tue, 2004-04-13 at 00:58, Phil Schwan wrote:> Porter Don wrote: > > > > 1) If I have a lustre lov set up as follows: > > > > node1: client > > node2: client and mds > > node3 & 4: ost > > > > If I reboot nodes 2-4 and leave the system mounted on node1, when the other > > nodes come up and restart lustre, node1 cannot seamlessly restore its state. > > In fact, the other three nodes seem to get messed up to the point that node > > 2 cannot mount the lov until all nodes, client and server, are rebooted and > > lustre is restarted. > > > > Is this how the system is supposed to work? If not, am I making a newbie > > mistake? > > Lustre 1.x can recover from any single failure at a time -- but we > explicitly do not support recovery from multiple simultaneous failures. > > Even if you just reboot node 2 here, you will not have a seamless > recovery, because you have failed two components simultaneously: a > client and an MDS. In this case, node1 will be told to flush its caches > and abort any in-progress operations. > > If you try to start the MDS while one or more OSTs are down, it will > fail unless you explicitly tell lconf to ignore the inactive OSTs. This > is so that administrators do not accidentally start up file systems in > degraded mode, without all of the servers, in which some data is > inaccessible -- we thought it best for that decision to be made very > consciously and explicitly. > > > 2) If I get a new client node, is there a way to have it join without > > bringing down all nodes? It would be ok to bring down the servers, but as > > noted above, doing so without bringing down all clients is causing me > > problems. > > Adding new client nodes is standard practice -- you can mount it > normally, as if it had been there from the beginning. > > If you are having problems mounting additional clients, please let us > know -- that is a bug! > > Thanks-- > > -Phil
Hi-- On Mon, 2004-04-19 at 21:31, Don Porter wrote:> > 2) so, how do I add a node 5 to the mix without adding an entry to the > config.xml file and restarting all other nodes? I suppose this is just > something I am having trouble finding in the documentation.Ah! Now I understand what you''re having trouble with. Clients are usually configurated with a ''*'' rule, so they all use the same profile. For example: lmc -m config.xml --add net --node client --nid ''*'' --nettype tcp Then, to start any client, you run: mount -t lustre mds.host.name:/mds_name/client_profile /mnt/lustre In your configuration, assuming that you used the same service names as the examples, this might well be: mount -t lustre mds:/mds1/client /mnt/lustre (or if you are still using lconf: "lconf --node client config.xml") Hope that helps-- -Phil
Ahh. That makes life SO much better. Having to specify each client at startup time would be a major challenge to the usability of lustre. I didn''t see anything about the wildcard node id in this doc: https://wiki.clusterfs.com/lustre/LustreHowto Is there a more recent one? If not, it might be good to update that. Thanks for the help, Don -----Original Message----- From: Phil Schwan To: Don Porter Cc: ''lustre-discuss@lists.clusterfs.com'' Sent: 4/20/04 9:53 AM Subject: Re: [Lustre-discuss] Can lustre dynamically add clients? Hi-- On Mon, 2004-04-19 at 21:31, Don Porter wrote:> > 2) so, how do I add a node 5 to the mix without adding an entry to the > config.xml file and restarting all other nodes? I suppose this isjust> something I am having trouble finding in the documentation.Ah! Now I understand what you''re having trouble with. Clients are usually configurated with a ''*'' rule, so they all use the same profile. For example: lmc -m config.xml --add net --node client --nid ''*'' --nettype tcp Then, to start any client, you run: mount -t lustre mds.host.name:/mds_name/client_profile /mnt/lustre In your configuration, assuming that you used the same service names as the examples, this might well be: mount -t lustre mds:/mds1/client /mnt/lustre (or if you are still using lconf: "lconf --node client config.xml") Hope that helps-- -Phil
Ok. I ran lmc -m orch_test.xml --add net --node client --nid ''*'' --nettype tcp on all mds and osd nodes. Then restarted the nodes. When I tried: [root@jawa051 root]# mount -t lustre jawa046:/mds1/client /mnt/lustre /sbin/mount.lustre: Invalid argument [root@jawa051 root]# dmesg LustreError: 3583:(client.c:445:ptlrpc_check_status()) @@@ type =PTL_RPC_MSG_ERR req@f6a14000 x11/t0 o38->mds1@MDS_PEER_UUID:12 lens 168/64 ref 1 fl RPC:R/0/50000 rc 0/-16 LustreError: 3583:(llite_lib.c:446:lustre_process_log()) cannot connect to mds1: rc = -16 LustreError: 3583:(llite_lib.c:547:lustre_fill_super()) No profile found: client LustreError: 3583:(client.c:445:ptlrpc_check_status()) @@@ type =PTL_RPC_MSG_ERR req@f6d0c200 x12/t0 o38->mds1@MDS_PEER_UUID:12 lens 168/64 ref 1 fl RPC:R/0/50000 rc 0/-16 LustreError: 3583:(llite_lib.c:446:lustre_process_log()) cannot connect to mds1: rc = -16 I also got this on the mds: [root@jawa046 root]# dmesg LustreError: 1556:(../ldlm/ldlm_lib.c:474:target_handle_connect()) denying connection for new client b992fe62-fd97-4670-ba25-27a0c7e943f1: 10 clients in recovery for 120s LustreError: 1556:(../ldlm/ldlm_lib.c:1056:target_send_reply()) @@@ processing error (-16) req@f6c72c00 x11/t0 o38-><?>@:-1 lens 168/64 ref 0 fl ?phase?:/0/50000 rc -16/0 Lustre: 1474:(socknal_cb.c:1544:ksocknal_process_receive()) [c6368800] EOF from 0xa540450 ip 10.84.4.80:32806 LustreError: 1557:(../ldlm/ldlm_lib.c:474:target_handle_connect()) denying connection for new client 2e71937c-2275-4c06-ac3c-36ab10215d60: 10 clients in recovery for 120s LustreError: 1557:(../ldlm/ldlm_lib.c:1056:target_send_reply()) @@@ processing error (-16) req@f6ee8c00 x12/t0 o38-><?>@:-1 lens 168/64 ref 0 fl ?phase?:/0/50000 rc -16/0 Lustre: 1474:(socknal_cb.c:1544:ksocknal_process_receive()) [f71e1800] EOF from 0xa540450 ip 10.84.4.80:32809 Any suggestions? I can send my complete xml if you need it. Thanks, Don -----Original Message----- From: Phil Schwan To: Don Porter Cc: ''lustre-discuss@lists.clusterfs.com'' Sent: 4/20/04 9:53 AM Subject: Re: [Lustre-discuss] Can lustre dynamically add clients? Hi-- On Mon, 2004-04-19 at 21:31, Don Porter wrote:> > 2) so, how do I add a node 5 to the mix without adding an entry to the > config.xml file and restarting all other nodes? I suppose this isjust> something I am having trouble finding in the documentation.Ah! Now I understand what you''re having trouble with. Clients are usually configurated with a ''*'' rule, so they all use the same profile. For example: lmc -m config.xml --add net --node client --nid ''*'' --nettype tcp Then, to start any client, you run: mount -t lustre mds.host.name:/mds_name/client_profile /mnt/lustre In your configuration, assuming that you used the same service names as the examples, this might well be: mount -t lustre mds:/mds1/client /mnt/lustre (or if you are still using lconf: "lconf --node client config.xml") Hope that helps-- -Phil
Hi Don-- On Tue, 2004-04-20 at 16:44, Porter Don wrote:> Ok. I ran > > lmc -m orch_test.xml --add net --node client --nid ''*'' --nettype tcp > > on all mds and osd nodes. > > Then restarted the nodes. When I tried: > > [root@jawa051 root]# mount -t lustre jawa046:/mds1/client /mnt/lustre > /sbin/mount.lustre: Invalid argumentThis is not a great error message, I will be the first to agree. Let me see what I can do about improving that, although it may take a lot of plumbing to get a useful error back from the kernel.> I also got this on the mds: > > [root@jawa046 root]# dmesg > LustreError: 1556:(../ldlm/ldlm_lib.c:474:target_handle_connect()) denying > connection for new client b992fe62-fd97-4670-ba25-27a0c7e943f1: 10 clients > in recovery for 120sThis is the key message. If the MDS was not shutdown cleanly, it will wait for all old clients to reconnect, so that it can complete recovery. Please see https://bugzilla.lustre.org/show_bug.cgi?id=2398 for more details. If you wait the 120 seconds, or perform the abort-recovery recipe described in issue 2397, are you able to mount again? -Phil
Hello all, I have recently started using lustre 1.0.4 on a few x86 machines running the provided 2.4.20-28.9_lustre.1.0.4smp kernel. I have encountered two things that confuse me a bit. 1) If I have a lustre lov set up as follows: node1: client node2: client and mds node3 & 4: ost If I reboot nodes 2-4 and leave the system mounted on node1, when the other nodes come up and restart lustre, node1 cannot seamlessly restore its state. In fact, the other three nodes seem to get messed up to the point that node 2 cannot mount the lov until all nodes, client and server, are rebooted and lustre is restarted. Is this how the system is supposed to work? If not, am I making a newbie mistake? 2) If I get a new client node, is there a way to have it join without bringing down all nodes? It would be ok to bring down the servers, but as noted above, doing so without bringing down all clients is causing me problems. Any advice/suggestions/help would be greatly appreciated. Also, if I can provide any more helpful information with this problem, I would be happy to. Thanks, Don Porter