On Feb 02, 2006 15:52 -0500, Jaya Natarajan wrote:> - Created ost, mds and mounted /mnt/lustre in the client. > - In the client, copied some big files into /mnt/lustre > - In mds server, renamed /tmp/mds1 as /tmp/mds1.bkup > - But now back in the client, still I could list, view and create files. > - dmesg in the mds server displays these lines among other things: > - Tested with --failover option and with two mds. Still see the same > behavior.Just renaming the /tmp/mds1 file is not actually doing anything to "remove the MDS server". Unix can access open files even if they are renamed or unlinked, until they are closed. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc.
Jaya Natarajan wrote:> Hi, > > I am testing mds failover in a three node cluster (Lustre 1.4.4). As an > initial test, I wanted to see what happens when an mds is removed. > After I started the client I went back and removed the mds manually. But > I could still list, view and create files in lustre file system. Can > some one explain this behaviour? Following are the steps I did: > > - Created ost, mds and mounted /mnt/lustre in the client. > - In the client, copied some big files into /mnt/lustre > - In mds server, renamed /tmp/mds1 as /tmp/mds1.bkup > - But now back in the client, still I could list, view and create files. > - dmesg in the mds server displays these lines among other things: > - Tested with --failover option and with two mds. Still see the same > behavior. > > .... > Lustre: MDT mds1 has stopped. > kjournald starting. Commit interval 5 seconds > LDISKFS FS on loop0, internal journal > LDISKFS-fs: mounted filesystem with ordered data mode. > Lustre: 23679:0:(socknal.c:325:ksocknal_associate_route_conn_locked()) > Binding 0xc094fa3b 192.148.250.59 to 192.148.250.57 > Lustre: 24003:0:(mds_lov.c:216:mds_lov_connect()) got last object 0 from > OST 0 > Lustre: MDT mds1 now serving /dev/loop0 with recovery enabled. > ....If the file is already open, renaming it does nothing, linux will access the file via the existing descriptor. To ''fail'' the mds: # lconf --cleanup --failover nomdsfailover.xml on the mds node. Or, just pull the plug on that system. :) cliffw> > > Lustre Script file: > ------------------ > > #!/bin/sh > # > # > # Configure nodes and net > lmc -o nomdsfailover.xml --add net --node sanjay --nid sanjay.sf.osc.edu > --nettype tcp > lmc -m nomdsfailover.xml --add net --node uma --nid uma.sf.osc.edu -- > nettype tcp > lmc -m nomdsfailover.xml --add net --node pria --nid pria.sf.osc.edu -- > nettype tcp > > # Configure OSTs > # Size is in kilo bytes; Size of OST should be atleast 8MB > lmc -m nomdsfailover.xml --add ost --node sanjay --ost ost-test --fstype > ext3 --dev /tmp/ost --size 50000 > > # Cofigure MDS > # Size is in kilo bytes; Size MDS should be atleast 8MB > lmc -m nomdsfailover.xml --add mds --node uma --mds mds1 --fstype ext3 > --dev /tmp/mds1 --size 50000 > > # Configure client > lmc -m nomdsfailover.xml --add mtpt --node pria --path /mnt/lustre --mds > mds1 --ost ost-test > > > > -------------- > > > Thanks, > Jaya
Hi, I am testing mds failover in a three node cluster (Lustre 1.4.4). As an initial test, I wanted to see what happens when an mds is removed. After I started the client I went back and removed the mds manually. But I could still list, view and create files in lustre file system. Can some one explain this behaviour? Following are the steps I did: - Created ost, mds and mounted /mnt/lustre in the client. - In the client, copied some big files into /mnt/lustre - In mds server, renamed /tmp/mds1 as /tmp/mds1.bkup - But now back in the client, still I could list, view and create files. - dmesg in the mds server displays these lines among other things: - Tested with --failover option and with two mds. Still see the same behavior. .... Lustre: MDT mds1 has stopped. kjournald starting. Commit interval 5 seconds LDISKFS FS on loop0, internal journal LDISKFS-fs: mounted filesystem with ordered data mode. Lustre: 23679:0:(socknal.c:325:ksocknal_associate_route_conn_locked()) Binding 0xc094fa3b 192.148.250.59 to 192.148.250.57 Lustre: 24003:0:(mds_lov.c:216:mds_lov_connect()) got last object 0 from OST 0 Lustre: MDT mds1 now serving /dev/loop0 with recovery enabled. .... Lustre Script file: ------------------ #!/bin/sh # # # Configure nodes and net lmc -o nomdsfailover.xml --add net --node sanjay --nid sanjay.sf.osc.edu --nettype tcp lmc -m nomdsfailover.xml --add net --node uma --nid uma.sf.osc.edu -- nettype tcp lmc -m nomdsfailover.xml --add net --node pria --nid pria.sf.osc.edu -- nettype tcp # Configure OSTs # Size is in kilo bytes; Size of OST should be atleast 8MB lmc -m nomdsfailover.xml --add ost --node sanjay --ost ost-test --fstype ext3 --dev /tmp/ost --size 50000 # Cofigure MDS # Size is in kilo bytes; Size MDS should be atleast 8MB lmc -m nomdsfailover.xml --add mds --node uma --mds mds1 --fstype ext3 --dev /tmp/mds1 --size 50000 # Configure client lmc -m nomdsfailover.xml --add mtpt --node pria --path /mnt/lustre --mds mds1 --ost ost-test -------------- Thanks, Jaya -- Jaya Natarajan <jaya@osc.edu>