thr3ads.net - Lustre discuss - [Lustre-discuss] What happens when a mds fails? [May 2006]

If this information is useful, please help other people find it:
Share via:

Andreas Dilger

2006-May-19 07:36 UTC

[Lustre-discuss] What happens when a mds fails?

On Feb 02, 2006  15:52 -0500, Jaya Natarajan wrote:> - Created ost, mds and mounted /mnt/lustre in the client.
> - In the client, copied some big files into /mnt/lustre
> - In mds server, renamed /tmp/mds1 as /tmp/mds1.bkup
> - But now back in the client, still I could list, view and create files.
> - dmesg in the mds server displays these lines among other things:
> - Tested with --failover option and with two mds. Still see the same
> behavior.
Just renaming the /tmp/mds1 file is not actually doing anything to "remove
the MDS server".  Unix can access open files even if they are renamed
or unlinked, until they are closed.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

Cliff White

2006-May-19 07:36 UTC

head link

[Lustre-discuss] What happens when a mds fails?

Jaya Natarajan wrote:> Hi,
> 
> I am testing mds failover in a three node cluster (Lustre 1.4.4).  As an
> initial test, I wanted to see what happens when an mds is removed.
> After I started the client I went back and removed the mds manually. But
> I could still list, view and create files in lustre file system. Can
> some one explain this behaviour? Following are the steps I did:
> 
> - Created ost, mds and mounted /mnt/lustre in the client.
> - In the client, copied some big files into /mnt/lustre
> - In mds server, renamed /tmp/mds1 as /tmp/mds1.bkup
> - But now back in the client, still I could list, view and create files.
> - dmesg in the mds server displays these lines among other things:
> - Tested with --failover option and with two mds. Still see the same
> behavior.
> 
> ....
> Lustre: MDT mds1 has stopped.
> kjournald starting.  Commit interval 5 seconds
> LDISKFS FS on loop0, internal journal
> LDISKFS-fs: mounted filesystem with ordered data mode.
> Lustre: 23679:0:(socknal.c:325:ksocknal_associate_route_conn_locked())
> Binding 0xc094fa3b 192.148.250.59 to 192.148.250.57
> Lustre: 24003:0:(mds_lov.c:216:mds_lov_connect()) got last object 0 from
> OST 0
> Lustre: MDT mds1 now serving /dev/loop0 with recovery enabled.
> ....
If the file is already open, renaming it does nothing, linux will access 
the file via the existing descriptor. To ''fail'' the mds:

# lconf --cleanup --failover nomdsfailover.xml
on the mds node.

Or, just pull the plug on that system. :)

cliffw
> 
> 
> Lustre Script file:
> ------------------
> 
> #!/bin/sh
> #
> #
> # Configure nodes and net
> lmc -o nomdsfailover.xml --add net --node sanjay --nid sanjay.sf.osc.edu
> --nettype tcp
> lmc -m nomdsfailover.xml --add net --node uma --nid uma.sf.osc.edu --
> nettype tcp
> lmc -m nomdsfailover.xml --add net --node pria --nid pria.sf.osc.edu --
> nettype tcp
> 
> # Configure OSTs
> # Size is in kilo bytes; Size of OST should be atleast 8MB
> lmc -m nomdsfailover.xml --add ost --node sanjay --ost ost-test --fstype
> ext3 --dev /tmp/ost --size 50000
> 
> # Cofigure MDS
> # Size is in kilo bytes; Size MDS should be atleast 8MB
> lmc -m nomdsfailover.xml --add mds --node uma --mds mds1 --fstype ext3
> --dev /tmp/mds1 --size 50000
> 
> # Configure client
> lmc -m nomdsfailover.xml --add mtpt --node pria --path /mnt/lustre --mds
> mds1 --ost ost-test
> 
> 
> 
> --------------
> 
> 
> Thanks,
> Jaya

Jaya Natarajan

2006-May-19 07:36 UTC

head link

[Lustre-discuss] What happens when a mds fails?

Hi,

I am testing mds failover in a three node cluster (Lustre 1.4.4).  As an
initial test, I wanted to see what happens when an mds is removed.
After I started the client I went back and removed the mds manually. But
I could still list, view and create files in lustre file system. Can
some one explain this behaviour? Following are the steps I did:

- Created ost, mds and mounted /mnt/lustre in the client.
- In the client, copied some big files into /mnt/lustre
- In mds server, renamed /tmp/mds1 as /tmp/mds1.bkup
- But now back in the client, still I could list, view and create files.
- dmesg in the mds server displays these lines among other things:
- Tested with --failover option and with two mds. Still see the same
behavior.

....
Lustre: MDT mds1 has stopped.
kjournald starting.  Commit interval 5 seconds
LDISKFS FS on loop0, internal journal
LDISKFS-fs: mounted filesystem with ordered data mode.
Lustre: 23679:0:(socknal.c:325:ksocknal_associate_route_conn_locked())
Binding 0xc094fa3b 192.148.250.59 to 192.148.250.57
Lustre: 24003:0:(mds_lov.c:216:mds_lov_connect()) got last object 0 from
OST 0
Lustre: MDT mds1 now serving /dev/loop0 with recovery enabled.
....


Lustre Script file:
------------------

#!/bin/sh
#
#
# Configure nodes and net
lmc -o nomdsfailover.xml --add net --node sanjay --nid sanjay.sf.osc.edu
--nettype tcp
lmc -m nomdsfailover.xml --add net --node uma --nid uma.sf.osc.edu --
nettype tcp
lmc -m nomdsfailover.xml --add net --node pria --nid pria.sf.osc.edu --
nettype tcp

# Configure OSTs
# Size is in kilo bytes; Size of OST should be atleast 8MB
lmc -m nomdsfailover.xml --add ost --node sanjay --ost ost-test --fstype
ext3 --dev /tmp/ost --size 50000

# Cofigure MDS
# Size is in kilo bytes; Size MDS should be atleast 8MB
lmc -m nomdsfailover.xml --add mds --node uma --mds mds1 --fstype ext3
--dev /tmp/mds1 --size 50000

# Configure client
lmc -m nomdsfailover.xml --add mtpt --node pria --path /mnt/lustre --mds
mds1 --ost ost-test



--------------


Thanks,
Jaya
-- 
Jaya Natarajan <jaya@osc.edu>

Lustre discuss - May 2006 - What happens when a mds fails?

[Lustre-discuss] What happens when a mds fails?

[Lustre-discuss] What happens when a mds fails?

[Lustre-discuss] What happens when a mds fails?