thr3ads.net - Lustre discuss - [Lustre-discuss] MGS Nids [May 2010]

If this information is useful, please help other people find it:
Share via:

leen smit

2010-May-20 10:46 UTC

[Lustre-discuss] MGS Nids

Dear All,

Im in the middle of creating a new Lustre setup, as a replacement for 
our current one.
The current one is a single machine with MGS/MDT/OST all living on this 
one box.

In the new setup I have 4 machines, two MDT''s and  two OST''s
We want to use keepalived as a failover mechanism between the two
MDT''s.
To keep the MDT''s in sync, I''m using a DRBD disk between the
two.

Keepalive uses a VIP in a active/passive state. In a failover situation 
the VIP gets transferred to the passive one.

The problem I''m experiencing, is that I can''t seem to get the
VIP listed
as a NID, thus the OSS can only connect on the real IP, which is 
unwanted in this situation. Is there an easy way to change the nid on 
the MGS machine to the VIP?

See below for setup details. The last output from lctl list_nids is the 
problem area, where is that NID coming from?

I hope some one can shed some light on this...

Cheers,

Leen


Hosts:
192.168.21.32   fs-mgs-001
192.168.21.33   fs-mgs-002
192.168.21.34   fs-ost-001
192.168.21.35   fs-ost-002
192.168.21.40   fs-mgs-vip


mkfs.lustre --reformat --fsname datafs --mgs --mgsnode=fs-mgs-vip at tcp 
/dev/VG1/mgs
mkfs.lustre --reformat --fsname datafs --mdt --mgsnode=fs-mgs-vip at tcp 
/dev/drbd1

mount -t lustre /dev/VG1/mgs mgs/
mount -t lustre /dev/drbd1 /mnt/mdt/


fs-mgs-001:/mnt# lctl dl
   0 UP mgs MGS MGS 9
   1 UP mgc MGC192.168.21.40 at tcp 8f8dfecc-44bd-caae-3ed4-cd23168d59ab 5
   2 UP mdt MDS MDS_uuid 3
   3 UP lov datafs-mdtlov datafs-mdtlov_UUID 4
   4 UP mds datafs-MDT0000 datafs-MDT0000_UUID 3


fs-mgs-001:/mnt# lctl list_nids
192.168.21.32 at tcp

Johann Lombardi

2010-May-20 11:42 UTC

head link

[Lustre-discuss] MGS Nids

On Thu, May 20, 2010 at 12:46:42PM +0200, leen smit
wrote:> In the new setup I have 4 machines, two MDT''s and  two
OST''s
> We want to use keepalived as a failover mechanism between the two
MDT''s.
> To keep the MDT''s in sync, I''m using a DRBD disk between
the two.
> 
> Keepalive uses a VIP in a active/passive state. In a failover situation 
> the VIP gets transferred to the passive one.
Lustre uses stateful client/server connection. You don''t need to - and
cannot -
use a virtual ip. The lustre protocol already takes care of reconnection &
recovery.
> The problem I''m experiencing, is that I can''t seem to get
the VIP listed
> as a NID, thus the OSS can only connect on the real IP, which is 
> unwanted in this situation. Is there an easy way to change the nid on 
> the MGS machine to the VIP?
No, you have to list the nids of all the mgs nodes at mkfs time (i.e.
"--mgsnode=192.168.21.32 at tcp --mgsnode=192.168.21.33 at tcp" in
your case).

Johann

Brian J. Murrell

2010-May-20 11:45 UTC

head link

[Lustre-discuss] MGS Nids

On Thu, 2010-05-20 at 12:46 +0200, leen smit wrote: > Keepalive uses a VIP in a active/passive state. In a failover situation 
> the VIP gets transferred to the passive one.
Don''t use virtual IPs with Lustre.  Lustre clients know how to deal
with
failover nodes that have different IP addresses and using a virtual,
floating IP address will just confuse it.

b.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: This is a digitally signed message part
Url :
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100520/cca52536/attachment.bin

leen smit

2010-May-20 12:22 UTC

head link

[Lustre-discuss] MGS Nids

Ok, no VIP''s then.. But how does failover work in lustre then?
If I setup everything using the real IP and then mount from a client and 
bring down the active MGS, the client will just sit there until it comes 
back up again.
As in, there is no failover to the second node.  So how does this 
internal lustre failover mechanism work?

I''ve been going trought the docs, and I must say there is very little
on
the failover mechanism, apart from mentions that a seperate app should 
care of that. Thats the reason I''m implementing keepalived..

At this stage I really am clueless, and can only think of creating a TUN 
interface, which will have the VIP address (thus, it becomes a real IP, 
not just a VIP).
But I got a feeling that ain''t the right approach either...
Is there any docs available where a active/passive MGS setup is described?
Is it sufficient to define a --failnode=nid,...  at creation time?

Any help would be greatly appreciated!

Leen

On 05/20/2010 01:45 PM, Brian J. Murrell wrote:> On Thu, 2010-05-20 at 12:46 +0200, leen smit wrote:
>    
>> Keepalive uses a VIP in a active/passive state. In a failover situation
>> the VIP gets transferred to the passive one.
>>      
> Don''t use virtual IPs with Lustre.  Lustre clients know how to
deal with
> failover nodes that have different IP addresses and using a virtual,
> floating IP address will just confuse it.
>
> b.
>
>

Kevin Van Maren

2010-May-20 12:55 UTC

head link

[Lustre-discuss] MGS Nids

leen smit wrote:> Ok, no VIP''s then.. But how does failover work in lustre then?
> If I setup everything using the real IP and then mount from a client and 
> bring down the active MGS, the client will just sit there until it comes 
> back up again.
> As in, there is no failover to the second node.  So how does this 
> internal lustre failover mechanism work?
>
> I''ve been going trought the docs, and I must say there is very
little on
> the failover mechanism, apart from mentions that a seperate app should 
> care of that. Thats the reason I''m implementing keepalived..
>   Right: the external service needs to keep the "mount" active/healthy
on
one of the servers.
Lustre handles reconnecting clients/servers as long as the volume is 
mounted where it expects
(ie, the mkfs node or the --failover node).> At this stage I really am clueless, and can only think of creating a TUN 
> interface, which will have the VIP address (thus, it becomes a real IP, 
> not just a VIP).
> But I got a feeling that ain''t the right approach either...
> Is there any docs available where a active/passive MGS setup is described?
> Is it sufficient to define a --failnode=nid,...  at creation time?
>   Yep.  See Johann''s email on the MGS, but for the MDTs and OSTs
that''s
all you have to do
(besides listing both MGS NIDs at mkfs time).

For the clients, you specify both MGS NIDs at mount time, so it can 
mount regardless of which
node has the active MGS.

Kevin
> Any help would be greatly appreciated!
>
> Leen
>
>
> On 05/20/2010 01:45 PM, Brian J. Murrell wrote:
>   
>> On Thu, 2010-05-20 at 12:46 +0200, leen smit wrote:
>>    
>>     
>>> Keepalive uses a VIP in a active/passive state. In a failover
situation
>>> the VIP gets transferred to the passive one.
>>>      
>>>       
>> Don''t use virtual IPs with Lustre.  Lustre clients know how to
deal with
>> failover nodes that have different IP addresses and using a virtual,
>> floating IP address will just confuse it.
>>
>> b.
>>
>>    
>>     
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>

Gabriele Paciucci

2010-May-20 13:40 UTC

head link

[Lustre-discuss] MGS Nids

For a clearification in a two servers configuration:

server1 -> 192.168.2.20 MGS+MDT+OST0
server2 -> 192.168.2.22 OST1
/dev/sdb is a lun shared between server1 and server 2

from server1: mkfs.lustre --mgs --failnode=192.168.2.22 --reformat /dev/sdb1
from server1: mkfs.lustre  --reformat --mdt --mgsnode=192.168.2.20 
--fsname=prova --failover=192.168.2.22 /dev/sdb4
from server1: mkfs.lustre  --reformat --ost --mgsnode=192.168.2.20 
--failover=192.168.2.22 --fsname=prova /dev/sdb2
from server2: mkfs.lustre  --reformat --ost --mgsnode=192.168.2.20 
--failover=192.168.2.20 --fsname=prova /dev/sdb3


from server1: mount -t lustre /dev/sdb1 /lustre/mgs_prova
from server1: mount -t lustre /dev/sdb4 /lustre/mdt_prova
from server1: mount -t lustre /dev/sdb2 /lustre/ost0_prova
from server2: mount -t lustre /dev/sdb3 /lustre/ost1_prova


from client:
modprobe lustre
mount -t lustre 192.168.2.20 at tcp:192.168.2.22 at tcp:/prova /prova

now halt server1 and mount MGS, MDT and OST0 on server2, the client 
should continue the activity without problem



On 05/20/2010 02:55 PM, Kevin Van Maren wrote:> leen smit wrote:
>    
>> Ok, no VIP''s then.. But how does failover work in lustre then?
>> If I setup everything using the real IP and then mount from a client
and
>> bring down the active MGS, the client will just sit there until it
comes
>> back up again.
>> As in, there is no failover to the second node.  So how does this
>> internal lustre failover mechanism work?
>>
>> I''ve been going trought the docs, and I must say there is very
little on
>> the failover mechanism, apart from mentions that a seperate app should
>> care of that. Thats the reason I''m implementing keepalived..
>>
>>      
> Right: the external service needs to keep the "mount"
active/healthy on
> one of the servers.
> Lustre handles reconnecting clients/servers as long as the volume is
> mounted where it expects
> (ie, the mkfs node or the --failover node).
>    
>> At this stage I really am clueless, and can only think of creating a
TUN
>> interface, which will have the VIP address (thus, it becomes a real IP,
>> not just a VIP).
>> But I got a feeling that ain''t the right approach either...
>> Is there any docs available where a active/passive MGS setup is
described?
>> Is it sufficient to define a --failnode=nid,...  at creation time?
>>
>>      
> Yep.  See Johann''s email on the MGS, but for the MDTs and OSTs
that''s
> all you have to do
> (besides listing both MGS NIDs at mkfs time).
>
> For the clients, you specify both MGS NIDs at mount time, so it can
> mount regardless of which
> node has the active MGS.
>
> Kevin
>
>    
>> Any help would be greatly appreciated!
>>
>> Leen
>>
>>
>> On 05/20/2010 01:45 PM, Brian J. Murrell wrote:
>>
>>      
>>> On Thu, 2010-05-20 at 12:46 +0200, leen smit wrote:
>>>
>>>
>>>        
>>>> Keepalive uses a VIP in a active/passive state. In a failover
situation
>>>> the VIP gets transferred to the passive one.
>>>>
>>>>
>>>>          
>>> Don''t use virtual IPs with Lustre.  Lustre clients know
how to deal with
>>> failover nodes that have different IP addresses and using a
virtual,
>>> floating IP address will just confuse it.
>>>
>>> b.
>>>
>>>
>>>
>>>        
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>
>>      
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
>    

-- 
_Gabriele Paciucci_ http://www.linkedin.com/in/paciucci

Pursuant to legislative Decree n. 196/03 you are hereby informed that this email
contains confidential information intended only for use of addressee. If you are
not the addressee and have received this email by mistake, please send this
email to the sender. You may not copy or disseminate this message to anyone.
Thank You.

Shuichi Ihara

2010-May-20 15:06 UTC

head link

[Lustre-discuss] MGS Nids

You need two MGS nodes for ''mount'' commnand on the clients.
e.g) mount -t lustre 192.168.1.10 at tcp:192.168.1.11 at tcp:/lustre /lustre

client will attempt to connect to secondary MGS once primary is not available.

Thanks
Ihara

(5/20/10 9:22 PM), leen smit wrote:> Ok, no VIP''s then.. But how does failover work in lustre then?
> If I setup everything using the real IP and then mount from a client and
> bring down the active MGS, the client will just sit there until it comes
> back up again.
> As in, there is no failover to the second node.  So how does this
> internal lustre failover mechanism work?
>
> I''ve been going trought the docs, and I must say there is very
little on
> the failover mechanism, apart from mentions that a seperate app should
> care of that. Thats the reason I''m implementing keepalived..
>
> At this stage I really am clueless, and can only think of creating a TUN
> interface, which will have the VIP address (thus, it becomes a real IP,
> not just a VIP).
> But I got a feeling that ain''t the right approach either...
> Is there any docs available where a active/passive MGS setup is described?
> Is it sufficient to define a --failnode=nid,...  at creation time?
>
> Any help would be greatly appreciated!
>
> Leen
>
>
> On 05/20/2010 01:45 PM, Brian J. Murrell wrote:
>> On Thu, 2010-05-20 at 12:46 +0200, leen smit wrote:
>>
>>> Keepalive uses a VIP in a active/passive state. In a failover
situation
>>> the VIP gets transferred to the passive one.
>>>
>> Don''t use virtual IPs with Lustre.  Lustre clients know how to
deal with
>> failover nodes that have different IP addresses and using a virtual,
>> floating IP address will just confuse it.
>>
>> b.
>>
>>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss

leen smit

2010-May-21 09:57 UTC

head link

[Lustre-discuss] MGS Nids

Ok. I started from scratch, using your kind replies as a guide line. 
Yet, still no fail over when brining down the first MGS.
Below are the steps I''ve taken to setup, hopefully some one here can 
spot my err.
I got rid of keepalived and drbd (was this wise? or should I keep this 
for the MGS/MDT syncing?) and setup just Lustre.

Two nodes vor MGS/MDT, and two nodes for OSTs.


fs-mgs-001:~#   mkfs.lustre --mgs  --failnode=fs-mgs-002 at tcp --reformat 
/dev/VG1/mgs
fs-mgs-001:~# mkfs.lustre  --mdt --mgsnode=fs-mgs-001 at tcp 
--failnode=fs-mgs-002 at tcp --fsname=datafs --reformat /dev/VG1/mdt
fs-mgs-001:~#  mount -t lustre /dev/VG1/mgs /mnt/mgs/
fs-mgs-001:~#  mount -t lustre /dev/VG1/mdt /mnt/mdt/

fs-mgs-002:~#   mkfs.lustre --mgs  --failnode=fs-mgs-001 at tcp --reformat 
/dev/VG1/mgs
fs-mgs-002:~# mkfs.lustre  --mdt --mgsnode=fs-mgs-001 at tcp 
--failnode=fs-mgs-001 at tcp --fsname=datafs --reformat /dev/VG1/mdt
fs-mgs-002:~#  mount -t lustre /dev/VG1/mgs /mnt/mgs/
fs-mgs-002:~#  mount -t lustre /dev/VG1/mdt /mnt/mdt/

fs-ost-001:~# mkfs.lustre --ost --mgsnode=fs-mgs-001 at tcp 
--mgsnode=fs-mgs-002 at tcp --failnode=fs-ost-002 at tcp --reformat 
--fsname=datafs /dev/VG1/ost1
fs-ost-001:~# mount -t lustre /dev/VG1/ost1 /mnt/ost/

fs-ost-002:~# mkfs.lustre --ost --mgsnode=fs-mgs-001 at tcp 
--mgsnode=fs-mgs-002 at tcp --failnode=fs-ost-001 at tcp --reformat 
--fsname=datafs /dev/VG1/ost1
fs-ost-002:~# mount -t lustre /dev/VG1/ost1 /mnt/ost/


fs-mgs-001:~# lctl dl
   0 UP mgs MGS MGS 7
   1 UP mgc MGC192.168.21.33 at tcp 5b8fb365-ae8e-9742-f374-539d8876276f 5
   2 UP mgc MGC127.0.1.1 at tcp 380bc932-eaf3-9955-7ff0-af96067a2487 5
   3 UP mdt MDS MDS_uuid 3
   4 UP lov datafs-mdtlov datafs-mdtlov_UUID 4
   5 UP mds datafs-MDT0000 datafs-MDT0000_UUID 5
   6 UP osc datafs-OST0000-osc datafs-mdtlov_UUID 5
   7 UP osc datafs-OST0001-osc datafs-mdtlov_UUID 5

fs-mgs-001:~# lctl list_nids
192.168.21.32 at tcp


client:~# mount -t lustre 192.168.21.32 at tcp:192.168.21.33 at tcp:/datafs
/data
client:~# time cp test.file /data/
real    0m47.793s
user    0m0.001s
sys     0m3.155s

So far, so good.


Lets try that again, now bringing down mgs-001

client:~# time cp test.file /data/

fs-mgs-001:~#  umount /mnt/mdt && umount /mnt/mgs

fs-mgs-002:~# mount -t lustre /dev/VG1/mgs /mnt/mgs
fs-mgs-002:~# mount -t lustre /dev/VG1/mdt /mnt/mdt
fs-mgs-002:~# lctl dl
   0 UP mgs MGS MGS 5
   1 UP mgc MGC192.168.21.32 at tcp 82b34916-ed89-f5b9-026e-7f8e1370765f 5
   2 UP mdt MDS MDS_uuid 3
   3 UP lov datafs-mdtlov datafs-mdtlov_UUID 4
   4 UP mds datafs-MDT0000 datafs-MDT0000_UUID 3

Missing the OSTs here, so I (try to..) remount these too

fs-ost-001:~# umount /mnt/ost/
fs-ost-001:~# mount -t lustre /dev/VG1/ost1 /mnt/ost/
mount.lustre: mount /dev/mapper/VG1-ost1 at /mnt/ost failed: No such 
device or address
The target service failed to start (bad config log?) 
(/dev/mapper/VG1-ost1).  See /var/log/messages.


After this I can only get back to a running state by umounting 
everything on the mgs-002, and remount on the mgs-001
What am I missing here?? Am I messing things up by creating two mgs, one 
on each mgs node?


Leen



On 05/20/2010 03:40 PM, Gabriele Paciucci wrote:> For a clearification in a two servers configuration:
>
> server1 ->  192.168.2.20 MGS+MDT+OST0
> server2 ->  192.168.2.22 OST1
> /dev/sdb is a lun shared between server1 and server 2
>
> from server1: mkfs.lustre --mgs --failnode=192.168.2.22 --reformat
/dev/sdb1
> from server1: mkfs.lustre  --reformat --mdt --mgsnode=192.168.2.20
> --fsname=prova --failover=192.168.2.22 /dev/sdb4
> from server1: mkfs.lustre  --reformat --ost --mgsnode=192.168.2.20
> --failover=192.168.2.22 --fsname=prova /dev/sdb2
> from server2: mkfs.lustre  --reformat --ost --mgsnode=192.168.2.20
> --failover=192.168.2.20 --fsname=prova /dev/sdb3
>
>
> from server1: mount -t lustre /dev/sdb1 /lustre/mgs_prova
> from server1: mount -t lustre /dev/sdb4 /lustre/mdt_prova
> from server1: mount -t lustre /dev/sdb2 /lustre/ost0_prova
> from server2: mount -t lustre /dev/sdb3 /lustre/ost1_prova
>
>
> from client:
> modprobe lustre
> mount -t lustre 192.168.2.20 at tcp:192.168.2.22 at tcp:/prova /prova
>
> now halt server1 and mount MGS, MDT and OST0 on server2, the client
> should continue the activity without problem
>
>
>
> On 05/20/2010 02:55 PM, Kevin Van Maren wrote:
>    
>> leen smit wrote:
>>
>>      
>>> Ok, no VIP''s then.. But how does failover work in lustre
then?
>>> If I setup everything using the real IP and then mount from a
client and
>>> bring down the active MGS, the client will just sit there until it
comes
>>> back up again.
>>> As in, there is no failover to the second node.  So how does this
>>> internal lustre failover mechanism work?
>>>
>>> I''ve been going trought the docs, and I must say there is
very little on
>>> the failover mechanism, apart from mentions that a seperate app
should
>>> care of that. Thats the reason I''m implementing
keepalived..
>>>
>>>
>>>        
>> Right: the external service needs to keep the "mount"
active/healthy on
>> one of the servers.
>> Lustre handles reconnecting clients/servers as long as the volume is
>> mounted where it expects
>> (ie, the mkfs node or the --failover node).
>>
>>      
>>> At this stage I really am clueless, and can only think of creating
a TUN
>>> interface, which will have the VIP address (thus, it becomes a real
IP,
>>> not just a VIP).
>>> But I got a feeling that ain''t the right approach
either...
>>> Is there any docs available where a active/passive MGS setup is
described?
>>> Is it sufficient to define a --failnode=nid,...  at creation time?
>>>
>>>
>>>        
>> Yep.  See Johann''s email on the MGS, but for the MDTs and OSTs
that''s
>> all you have to do
>> (besides listing both MGS NIDs at mkfs time).
>>
>> For the clients, you specify both MGS NIDs at mount time, so it can
>> mount regardless of which
>> node has the active MGS.
>>
>> Kevin
>>
>>
>>      
>>> Any help would be greatly appreciated!
>>>
>>> Leen
>>>
>>>
>>> On 05/20/2010 01:45 PM, Brian J. Murrell wrote:
>>>
>>>
>>>        
>>>> On Thu, 2010-05-20 at 12:46 +0200, leen smit wrote:
>>>>
>>>>
>>>>
>>>>          
>>>>> Keepalive uses a VIP in a active/passive state. In a
failover situation
>>>>> the VIP gets transferred to the passive one.
>>>>>
>>>>>
>>>>>
>>>>>            
>>>> Don''t use virtual IPs with Lustre.  Lustre clients
know how to deal with
>>>> failover nodes that have different IP addresses and using a
virtual,
>>>> floating IP address will just confuse it.
>>>>
>>>> b.
>>>>
>>>>
>>>>
>>>>
>>>>          
>>> _______________________________________________
>>> Lustre-discuss mailing list
>>> Lustre-discuss at lists.lustre.org
>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>>
>>>
>>>        
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>
>>
>>      
>
>

Gabriele Paciucci

2010-May-21 10:14 UTC

head link

[Lustre-discuss] MGS Nids

Hi,
be carefoul with LVM, you should import and export the volume when you 
try to mount from one machine to an other!!!!

please refer to: http://kbase.redhat.com/faq/docs/DOC-4124


On 05/21/2010 11:57 AM, leen smit wrote:> Ok. I started from scratch, using your kind replies as a guide line.
> Yet, still no fail over when brining down the first MGS.
> Below are the steps I''ve taken to setup, hopefully some one here
can
> spot my err.
> I got rid of keepalived and drbd (was this wise? or should I keep this
> for the MGS/MDT syncing?) and setup just Lustre.
>
> Two nodes vor MGS/MDT, and two nodes for OSTs.
>
>
> fs-mgs-001:~#   mkfs.lustre --mgs  --failnode=fs-mgs-002 at tcp --reformat
> /dev/VG1/mgs
> fs-mgs-001:~# mkfs.lustre  --mdt --mgsnode=fs-mgs-001 at tcp
> --failnode=fs-mgs-002 at tcp --fsname=datafs --reformat /dev/VG1/mdt
> fs-mgs-001:~#  mount -t lustre /dev/VG1/mgs /mnt/mgs/
> fs-mgs-001:~#  mount -t lustre /dev/VG1/mdt /mnt/mdt/
>
>    
> fs-mgs-002:~#   mkfs.lustre --mgs  --failnode=fs-mgs-001 at tcp --reformat
> /dev/VG1/mgs
> fs-mgs-002:~# mkfs.lustre  --mdt --mgsnode=fs-mgs-001 at tcp
> --failnode=fs-mgs-001 at tcp --fsname=datafs --reformat /dev/VG1/mdt
> fs-mgs-002:~#  mount -t lustre /dev/VG1/mgs /mnt/mgs/
> fs-mgs-002:~#  mount -t lustre /dev/VG1/mdt /mnt/mdt/
>
>    
this is an error ^.. don''t do it!!!
> fs-ost-001:~# mkfs.lustre --ost --mgsnode=fs-mgs-001 at tcp
> --mgsnode=fs-mgs-002 at tcp --failnode=fs-ost-002 at tcp --reformat
> --fsname=datafs /dev/VG1/ost1
> fs-ost-001:~# mount -t lustre /dev/VG1/ost1 /mnt/ost/
>
>    
> fs-ost-002:~# mkfs.lustre --ost --mgsnode=fs-mgs-001 at tcp
> --mgsnode=fs-mgs-002 at tcp --failnode=fs-ost-001 at tcp --reformat
> --fsname=datafs /dev/VG1/ost1
> fs-ost-002:~# mount -t lustre /dev/VG1/ost1 /mnt/ost/
>
>
>    this is an error ^.. don''t do it!!!

the correct way is (WARNING: please use the IP address) :

fs-mgs-001:~# mkfs.lustre --mgs  --failnode=fs-mgs-002 at tcp --reformat 
/dev/VG1/mgs
fs-mgs-001:~# mount -t lustre /dev/VG1/mgs /mnt/mgs/

fs-mgs-001:~# mkfs.lustre  --mdt --mgsnode=fs-mgs-001 at tcp 
--failnode=fs-mgs-002 at tcp --fsname=datafs --reformat /dev/VG1/mdt
fs-mgs-001:~# mount -t lustre /dev/VG1/mdt /mnt/mdt/

fs-ost-001:~# mkfs.lustre --ost --mgsnode=fs-mgs-001 at tcp 
--failnode=fs-ost-002 at tcp --reformat --fsname=datafs /dev/VG1/ost1
fs-ost-001:~# mount -t lustre /dev/VG1/ost1 /mnt/ost/

yust this, nothing to do on the second node!!!

mount -t lustre fs-mgs-001 at tcp:fs-mgs-002 at tcp:/datafs /data
Bye


> fs-mgs-001:~# lctl dl
>     0 UP mgs MGS MGS 7
>     1 UP mgc MGC192.168.21.33 at tcp 5b8fb365-ae8e-9742-f374-539d8876276f 5
>     2 UP mgc MGC127.0.1.1 at tcp 380bc932-eaf3-9955-7ff0-af96067a2487 5
>     3 UP mdt MDS MDS_uuid 3
>     4 UP lov datafs-mdtlov datafs-mdtlov_UUID 4
>     5 UP mds datafs-MDT0000 datafs-MDT0000_UUID 5
>     6 UP osc datafs-OST0000-osc datafs-mdtlov_UUID 5
>     7 UP osc datafs-OST0001-osc datafs-mdtlov_UUID 5
>
> fs-mgs-001:~# lctl list_nids
> 192.168.21.32 at tcp
>
>
> client:~# mount -t lustre 192.168.21.32 at tcp:192.168.21.33 at tcp:/datafs
/data
> client:~# time cp test.file /data/
> real    0m47.793s
> user    0m0.001s
> sys     0m3.155s
>
> So far, so good.
>
>
> Lets try that again, now bringing down mgs-001
>
> client:~# time cp test.file /data/
>
> fs-mgs-001:~#  umount /mnt/mdt&&  umount /mnt/mgs
>
> fs-mgs-002:~# mount -t lustre /dev/VG1/mgs /mnt/mgs
> fs-mgs-002:~# mount -t lustre /dev/VG1/mdt /mnt/mdt
> fs-mgs-002:~# lctl dl
>     0 UP mgs MGS MGS 5
>     1 UP mgc MGC192.168.21.32 at tcp 82b34916-ed89-f5b9-026e-7f8e1370765f 5
>     2 UP mdt MDS MDS_uuid 3
>     3 UP lov datafs-mdtlov datafs-mdtlov_UUID 4
>     4 UP mds datafs-MDT0000 datafs-MDT0000_UUID 3
>
> Missing the OSTs here, so I (try to..) remount these too
>
> fs-ost-001:~# umount /mnt/ost/
> fs-ost-001:~# mount -t lustre /dev/VG1/ost1 /mnt/ost/
> mount.lustre: mount /dev/mapper/VG1-ost1 at /mnt/ost failed: No such
> device or address
> The target service failed to start (bad config log?)
> (/dev/mapper/VG1-ost1).  See /var/log/messages.
>
>
> After this I can only get back to a running state by umounting
> everything on the mgs-002, and remount on the mgs-001
> What am I missing here?? Am I messing things up by creating two mgs, one
> on each mgs node?
>
>
> Leen
>
>
>
> On 05/20/2010 03:40 PM, Gabriele Paciucci wrote:
>    
>> For a clearification in a two servers configuration:
>>
>> server1 ->   192.168.2.20 MGS+MDT+OST0
>> server2 ->   192.168.2.22 OST1
>> /dev/sdb is a lun shared between server1 and server 2
>>
>> from server1: mkfs.lustre --mgs --failnode=192.168.2.22 --reformat
/dev/sdb1
>> from server1: mkfs.lustre  --reformat --mdt --mgsnode=192.168.2.20
>> --fsname=prova --failover=192.168.2.22 /dev/sdb4
>> from server1: mkfs.lustre  --reformat --ost --mgsnode=192.168.2.20
>> --failover=192.168.2.22 --fsname=prova /dev/sdb2
>> from server2: mkfs.lustre  --reformat --ost --mgsnode=192.168.2.20
>> --failover=192.168.2.20 --fsname=prova /dev/sdb3
>>
>>
>> from server1: mount -t lustre /dev/sdb1 /lustre/mgs_prova
>> from server1: mount -t lustre /dev/sdb4 /lustre/mdt_prova
>> from server1: mount -t lustre /dev/sdb2 /lustre/ost0_prova
>> from server2: mount -t lustre /dev/sdb3 /lustre/ost1_prova
>>
>>
>> from client:
>> modprobe lustre
>> mount -t lustre 192.168.2.20 at tcp:192.168.2.22 at tcp:/prova /prova
>>
>> now halt server1 and mount MGS, MDT and OST0 on server2, the client
>> should continue the activity without problem
>>
>>
>>
>> On 05/20/2010 02:55 PM, Kevin Van Maren wrote:
>>
>>      
>>> leen smit wrote:
>>>
>>>
>>>        
>>>> Ok, no VIP''s then.. But how does failover work in
lustre then?
>>>> If I setup everything using the real IP and then mount from a
client and
>>>> bring down the active MGS, the client will just sit there until
it comes
>>>> back up again.
>>>> As in, there is no failover to the second node.  So how does
this
>>>> internal lustre failover mechanism work?
>>>>
>>>> I''ve been going trought the docs, and I must say there
is very little on
>>>> the failover mechanism, apart from mentions that a seperate app
should
>>>> care of that. Thats the reason I''m implementing
keepalived..
>>>>
>>>>
>>>>
>>>>          
>>> Right: the external service needs to keep the "mount"
active/healthy on
>>> one of the servers.
>>> Lustre handles reconnecting clients/servers as long as the volume
is
>>> mounted where it expects
>>> (ie, the mkfs node or the --failover node).
>>>
>>>
>>>        
>>>> At this stage I really am clueless, and can only think of
creating a TUN
>>>> interface, which will have the VIP address (thus, it becomes a
real IP,
>>>> not just a VIP).
>>>> But I got a feeling that ain''t the right approach
either...
>>>> Is there any docs available where a active/passive MGS setup is
described?
>>>> Is it sufficient to define a --failnode=nid,...  at creation
time?
>>>>
>>>>
>>>>
>>>>          
>>> Yep.  See Johann''s email on the MGS, but for the MDTs and
OSTs that''s
>>> all you have to do
>>> (besides listing both MGS NIDs at mkfs time).
>>>
>>> For the clients, you specify both MGS NIDs at mount time, so it can
>>> mount regardless of which
>>> node has the active MGS.
>>>
>>> Kevin
>>>
>>>
>>>
>>>        
>>>> Any help would be greatly appreciated!
>>>>
>>>> Leen
>>>>
>>>>
>>>> On 05/20/2010 01:45 PM, Brian J. Murrell wrote:
>>>>
>>>>
>>>>
>>>>          
>>>>> On Thu, 2010-05-20 at 12:46 +0200, leen smit wrote:
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>            
>>>>>> Keepalive uses a VIP in a active/passive state. In a
failover situation
>>>>>> the VIP gets transferred to the passive one.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>              
>>>>> Don''t use virtual IPs with Lustre.  Lustre clients
know how to deal with
>>>>> failover nodes that have different IP addresses and using a
virtual,
>>>>> floating IP address will just confuse it.
>>>>>
>>>>> b.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>            
>>>> _______________________________________________
>>>> Lustre-discuss mailing list
>>>> Lustre-discuss at lists.lustre.org
>>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>>>
>>>>
>>>>
>>>>          
>>> _______________________________________________
>>> Lustre-discuss mailing list
>>> Lustre-discuss at lists.lustre.org
>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>>
>>>
>>>
>>>        
>>
>>      
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
>    

-- 
_Gabriele Paciucci_ http://www.linkedin.com/in/paciucci

Pursuant to legislative Decree n. 196/03 you are hereby informed that this email
contains confidential information intended only for use of addressee. If you are
not the addressee and have received this email by mistake, please send this
email to the sender. You may not copy or disseminate this message to anyone.
Thank You.

leen smit

2010-May-21 11:43 UTC

head link

[Lustre-discuss] MGS Nids

Wouldn''t it be easier then to use brdb on the msg disk, so you dont
have
to move the lvm over to a new node?



On 05/21/2010 12:14 PM, Gabriele Paciucci wrote:> Hi,
> be carefoul with LVM, you should import and export the volume when you
> try to mount from one machine to an other!!!!
>
> please refer to: http://kbase.redhat.com/faq/docs/DOC-4124
>
>
> On 05/21/2010 11:57 AM, leen smit wrote:
>    
>> Ok. I started from scratch, using your kind replies as a guide line.
>> Yet, still no fail over when brining down the first MGS.
>> Below are the steps I''ve taken to setup, hopefully some one
here can
>> spot my err.
>> I got rid of keepalived and drbd (was this wise? or should I keep this
>> for the MGS/MDT syncing?) and setup just Lustre.
>>
>> Two nodes vor MGS/MDT, and two nodes for OSTs.
>>
>>
>> fs-mgs-001:~#   mkfs.lustre --mgs  --failnode=fs-mgs-002 at tcp
--reformat
>> /dev/VG1/mgs
>> fs-mgs-001:~# mkfs.lustre  --mdt --mgsnode=fs-mgs-001 at tcp
>> --failnode=fs-mgs-002 at tcp --fsname=datafs --reformat /dev/VG1/mdt
>> fs-mgs-001:~#  mount -t lustre /dev/VG1/mgs /mnt/mgs/
>> fs-mgs-001:~#  mount -t lustre /dev/VG1/mdt /mnt/mdt/
>>
>>
>>      
>
>    
>> fs-mgs-002:~#   mkfs.lustre --mgs  --failnode=fs-mgs-001 at tcp
--reformat
>> /dev/VG1/mgs
>> fs-mgs-002:~# mkfs.lustre  --mdt --mgsnode=fs-mgs-001 at tcp
>> --failnode=fs-mgs-001 at tcp --fsname=datafs --reformat /dev/VG1/mdt
>> fs-mgs-002:~#  mount -t lustre /dev/VG1/mgs /mnt/mgs/
>> fs-mgs-002:~#  mount -t lustre /dev/VG1/mdt /mnt/mdt/
>>
>>
>>      
> this is an error ^.. don''t do it!!!
>
>    
>> fs-ost-001:~# mkfs.lustre --ost --mgsnode=fs-mgs-001 at tcp
>> --mgsnode=fs-mgs-002 at tcp --failnode=fs-ost-002 at tcp --reformat
>> --fsname=datafs /dev/VG1/ost1
>> fs-ost-001:~# mount -t lustre /dev/VG1/ost1 /mnt/ost/
>>
>>
>>      
>
>    
>> fs-ost-002:~# mkfs.lustre --ost --mgsnode=fs-mgs-001 at tcp
>> --mgsnode=fs-mgs-002 at tcp --failnode=fs-ost-001 at tcp --reformat
>> --fsname=datafs /dev/VG1/ost1
>> fs-ost-002:~# mount -t lustre /dev/VG1/ost1 /mnt/ost/
>>
>>
>>
>>      
> this is an error ^.. don''t do it!!!
>
> the correct way is (WARNING: please use the IP address) :
>
> fs-mgs-001:~# mkfs.lustre --mgs  --failnode=fs-mgs-002 at tcp --reformat 
/dev/VG1/mgs
> fs-mgs-001:~# mount -t lustre /dev/VG1/mgs /mnt/mgs/
>
> fs-mgs-001:~# mkfs.lustre  --mdt --mgsnode=fs-mgs-001 at tcp 
--failnode=fs-mgs-002 at tcp --fsname=datafs --reformat /dev/VG1/mdt
> fs-mgs-001:~# mount -t lustre /dev/VG1/mdt /mnt/mdt/
>
> fs-ost-001:~# mkfs.lustre --ost --mgsnode=fs-mgs-001 at tcp 
--failnode=fs-ost-002 at tcp --reformat --fsname=datafs /dev/VG1/ost1
> fs-ost-001:~# mount -t lustre /dev/VG1/ost1 /mnt/ost/
>
> yust this, nothing to do on the second node!!!
>
> mount -t lustre fs-mgs-001 at tcp:fs-mgs-002 at tcp:/datafs /data
> Bye
>
>
>
>    
>> fs-mgs-001:~# lctl dl
>>      0 UP mgs MGS MGS 7
>>      1 UP mgc MGC192.168.21.33 at tcp
5b8fb365-ae8e-9742-f374-539d8876276f 5
>>      2 UP mgc MGC127.0.1.1 at tcp 380bc932-eaf3-9955-7ff0-af96067a2487
5
>>      3 UP mdt MDS MDS_uuid 3
>>      4 UP lov datafs-mdtlov datafs-mdtlov_UUID 4
>>      5 UP mds datafs-MDT0000 datafs-MDT0000_UUID 5
>>      6 UP osc datafs-OST0000-osc datafs-mdtlov_UUID 5
>>      7 UP osc datafs-OST0001-osc datafs-mdtlov_UUID 5
>>
>> fs-mgs-001:~# lctl list_nids
>> 192.168.21.32 at tcp
>>
>>
>> client:~# mount -t lustre 192.168.21.32 at tcp:192.168.21.33 at
tcp:/datafs /data
>> client:~# time cp test.file /data/
>> real    0m47.793s
>> user    0m0.001s
>> sys     0m3.155s
>>
>> So far, so good.
>>
>>
>> Lets try that again, now bringing down mgs-001
>>
>> client:~# time cp test.file /data/
>>
>> fs-mgs-001:~#  umount /mnt/mdt&&   umount /mnt/mgs
>>
>> fs-mgs-002:~# mount -t lustre /dev/VG1/mgs /mnt/mgs
>> fs-mgs-002:~# mount -t lustre /dev/VG1/mdt /mnt/mdt
>> fs-mgs-002:~# lctl dl
>>      0 UP mgs MGS MGS 5
>>      1 UP mgc MGC192.168.21.32 at tcp
82b34916-ed89-f5b9-026e-7f8e1370765f 5
>>      2 UP mdt MDS MDS_uuid 3
>>      3 UP lov datafs-mdtlov datafs-mdtlov_UUID 4
>>      4 UP mds datafs-MDT0000 datafs-MDT0000_UUID 3
>>
>> Missing the OSTs here, so I (try to..) remount these too
>>
>> fs-ost-001:~# umount /mnt/ost/
>> fs-ost-001:~# mount -t lustre /dev/VG1/ost1 /mnt/ost/
>> mount.lustre: mount /dev/mapper/VG1-ost1 at /mnt/ost failed: No such
>> device or address
>> The target service failed to start (bad config log?)
>> (/dev/mapper/VG1-ost1).  See /var/log/messages.
>>
>>
>> After this I can only get back to a running state by umounting
>> everything on the mgs-002, and remount on the mgs-001
>> What am I missing here?? Am I messing things up by creating two mgs,
one
>> on each mgs node?
>>
>>
>> Leen
>>
>>
>>
>> On 05/20/2010 03:40 PM, Gabriele Paciucci wrote:
>>
>>      
>>> For a clearification in a two servers configuration:
>>>
>>> server1 ->    192.168.2.20 MGS+MDT+OST0
>>> server2 ->    192.168.2.22 OST1
>>> /dev/sdb is a lun shared between server1 and server 2
>>>
>>> from server1: mkfs.lustre --mgs --failnode=192.168.2.22 --reformat
/dev/sdb1
>>> from server1: mkfs.lustre  --reformat --mdt --mgsnode=192.168.2.20
>>> --fsname=prova --failover=192.168.2.22 /dev/sdb4
>>> from server1: mkfs.lustre  --reformat --ost --mgsnode=192.168.2.20
>>> --failover=192.168.2.22 --fsname=prova /dev/sdb2
>>> from server2: mkfs.lustre  --reformat --ost --mgsnode=192.168.2.20
>>> --failover=192.168.2.20 --fsname=prova /dev/sdb3
>>>
>>>
>>> from server1: mount -t lustre /dev/sdb1 /lustre/mgs_prova
>>> from server1: mount -t lustre /dev/sdb4 /lustre/mdt_prova
>>> from server1: mount -t lustre /dev/sdb2 /lustre/ost0_prova
>>> from server2: mount -t lustre /dev/sdb3 /lustre/ost1_prova
>>>
>>>
>>> from client:
>>> modprobe lustre
>>> mount -t lustre 192.168.2.20 at tcp:192.168.2.22 at tcp:/prova
/prova
>>>
>>> now halt server1 and mount MGS, MDT and OST0 on server2, the client
>>> should continue the activity without problem
>>>
>>>
>>>
>>> On 05/20/2010 02:55 PM, Kevin Van Maren wrote:
>>>
>>>
>>>        
>>>> leen smit wrote:
>>>>
>>>>
>>>>
>>>>          
>>>>> Ok, no VIP''s then.. But how does failover work in
lustre then?
>>>>> If I setup everything using the real IP and then mount from
a client and
>>>>> bring down the active MGS, the client will just sit there
until it comes
>>>>> back up again.
>>>>> As in, there is no failover to the second node.  So how
does this
>>>>> internal lustre failover mechanism work?
>>>>>
>>>>> I''ve been going trought the docs, and I must say
there is very little on
>>>>> the failover mechanism, apart from mentions that a seperate
app should
>>>>> care of that. Thats the reason I''m implementing
keepalived..
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>            
>>>> Right: the external service needs to keep the "mount"
active/healthy on
>>>> one of the servers.
>>>> Lustre handles reconnecting clients/servers as long as the
volume is
>>>> mounted where it expects
>>>> (ie, the mkfs node or the --failover node).
>>>>
>>>>
>>>>
>>>>          
>>>>> At this stage I really am clueless, and can only think of
creating a TUN
>>>>> interface, which will have the VIP address (thus, it
becomes a real IP,
>>>>> not just a VIP).
>>>>> But I got a feeling that ain''t the right approach
either...
>>>>> Is there any docs available where a active/passive MGS
setup is described?
>>>>> Is it sufficient to define a --failnode=nid,...  at
creation time?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>            
>>>> Yep.  See Johann''s email on the MGS, but for the MDTs
and OSTs that''s
>>>> all you have to do
>>>> (besides listing both MGS NIDs at mkfs time).
>>>>
>>>> For the clients, you specify both MGS NIDs at mount time, so it
can
>>>> mount regardless of which
>>>> node has the active MGS.
>>>>
>>>> Kevin
>>>>
>>>>
>>>>
>>>>
>>>>          
>>>>> Any help would be greatly appreciated!
>>>>>
>>>>> Leen
>>>>>
>>>>>
>>>>> On 05/20/2010 01:45 PM, Brian J. Murrell wrote:
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>            
>>>>>> On Thu, 2010-05-20 at 12:46 +0200, leen smit wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>              
>>>>>>> Keepalive uses a VIP in a active/passive state. In
a failover situation
>>>>>>> the VIP gets transferred to the passive one.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>                
>>>>>> Don''t use virtual IPs with Lustre.  Lustre
clients know how to deal with
>>>>>> failover nodes that have different IP addresses and
using a virtual,
>>>>>> floating IP address will just confuse it.
>>>>>>
>>>>>> b.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>              
>>>>> _______________________________________________
>>>>> Lustre-discuss mailing list
>>>>> Lustre-discuss at lists.lustre.org
>>>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>            
>>>> _______________________________________________
>>>> Lustre-discuss mailing list
>>>> Lustre-discuss at lists.lustre.org
>>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>>>
>>>>
>>>>
>>>>
>>>>          
>>>
>>>        
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>
>>
>>      
>
>

Peter Grandi

2010-May-23 14:35 UTC

head link

[Lustre-discuss] MGS Nids

>> In the new setup I have 4 machines, two MDT''s and two
OST''s
>> We want to use keepalived as a failover mechanism between the
>> two MDT''s.  To keep the MDT''s in sync, I''m
using a DRBD disk
>> between the two. Keepalive uses a VIP in a active/passive
>> state. In a failover situation the VIP gets transferred to
>> the passive one.
> Lustre uses stateful client/server connection. You don''t need
> to - and cannot - use a virtual ip. The lustre protocol
> already takes care of reconnection & recovery.
Sure, for access to the server purposes, but there is a good way
to achieve *network* routing failover using something like VIPs
(in an IP-only setup).

What you do is simple:

 #1 On a server with multiple interfaces, for example 192.168.1.1
    and 192.168.2.1, add a ''dummy0'' interface, e.g.
192.168.42.1.

 #2 Run OSPF on the server, advertising a *host route* to the
    ''dummy0'' interface, 192.168.42.1/32 (as well as the real
    interfaces of course).

 #3 Bind the Lustre daemons to the ''dummy0'' interface address.
As
    long as there is a route to it, all network reconfigurations
    will be transparent to Lustre.

Of course one can have for MGSes, MDSes, and OSSes, two or more
servers with different "''dummy0''" addresses to use
as different
NIDs, to let Lustre handle *server* (as opposed to network)
failures.  The price is a host route per server, but that usually
is quite insignificant.

The only problem with the setup above is that #3 "Bind the Lustre
daemons to the ''dummy0'' interface address" seems
impossible to
achieve (not tried directly, so I am told), and while clients
packets sent to the ''dummy0'' address reach the server in a
fully
resilient way, reply packets often/usually have as source address
one of the addresses of the real interfaces, instead of that of
the ''dummy0'' interface, and this of course breaks the scheme.

Note that the scheme above is fairly valuable, as it gives full
*dynamic* network resilience.

Is there a simple way to get the Lustre daemons in the kernel to
bind to a specific address instead of [0.0.0.0], like most server
daemons in UNIX/Linux can?

[ ... ]

Lustre discuss - May 2010 - MGS Nids

[Lustre-discuss] MGS Nids

[Lustre-discuss] MGS Nids

[Lustre-discuss] MGS Nids

[Lustre-discuss] MGS Nids

[Lustre-discuss] MGS Nids

[Lustre-discuss] MGS Nids

[Lustre-discuss] MGS Nids

[Lustre-discuss] MGS Nids

[Lustre-discuss] MGS Nids

[Lustre-discuss] MGS Nids

[Lustre-discuss] MGS Nids