Dear All, Im in the middle of creating a new Lustre setup, as a replacement for our current one. The current one is a single machine with MGS/MDT/OST all living on this one box. In the new setup I have 4 machines, two MDT''s and two OST''s We want to use keepalived as a failover mechanism between the two MDT''s. To keep the MDT''s in sync, I''m using a DRBD disk between the two. Keepalive uses a VIP in a active/passive state. In a failover situation the VIP gets transferred to the passive one. The problem I''m experiencing, is that I can''t seem to get the VIP listed as a NID, thus the OSS can only connect on the real IP, which is unwanted in this situation. Is there an easy way to change the nid on the MGS machine to the VIP? See below for setup details. The last output from lctl list_nids is the problem area, where is that NID coming from? I hope some one can shed some light on this... Cheers, Leen Hosts: 192.168.21.32 fs-mgs-001 192.168.21.33 fs-mgs-002 192.168.21.34 fs-ost-001 192.168.21.35 fs-ost-002 192.168.21.40 fs-mgs-vip mkfs.lustre --reformat --fsname datafs --mgs --mgsnode=fs-mgs-vip at tcp /dev/VG1/mgs mkfs.lustre --reformat --fsname datafs --mdt --mgsnode=fs-mgs-vip at tcp /dev/drbd1 mount -t lustre /dev/VG1/mgs mgs/ mount -t lustre /dev/drbd1 /mnt/mdt/ fs-mgs-001:/mnt# lctl dl 0 UP mgs MGS MGS 9 1 UP mgc MGC192.168.21.40 at tcp 8f8dfecc-44bd-caae-3ed4-cd23168d59ab 5 2 UP mdt MDS MDS_uuid 3 3 UP lov datafs-mdtlov datafs-mdtlov_UUID 4 4 UP mds datafs-MDT0000 datafs-MDT0000_UUID 3 fs-mgs-001:/mnt# lctl list_nids 192.168.21.32 at tcp
On Thu, May 20, 2010 at 12:46:42PM +0200, leen smit wrote:> In the new setup I have 4 machines, two MDT''s and two OST''s > We want to use keepalived as a failover mechanism between the two MDT''s. > To keep the MDT''s in sync, I''m using a DRBD disk between the two. > > Keepalive uses a VIP in a active/passive state. In a failover situation > the VIP gets transferred to the passive one.Lustre uses stateful client/server connection. You don''t need to - and cannot - use a virtual ip. The lustre protocol already takes care of reconnection & recovery.> The problem I''m experiencing, is that I can''t seem to get the VIP listed > as a NID, thus the OSS can only connect on the real IP, which is > unwanted in this situation. Is there an easy way to change the nid on > the MGS machine to the VIP?No, you have to list the nids of all the mgs nodes at mkfs time (i.e. "--mgsnode=192.168.21.32 at tcp --mgsnode=192.168.21.33 at tcp" in your case). Johann
On Thu, 2010-05-20 at 12:46 +0200, leen smit wrote:> Keepalive uses a VIP in a active/passive state. In a failover situation > the VIP gets transferred to the passive one.Don''t use virtual IPs with Lustre. Lustre clients know how to deal with failover nodes that have different IP addresses and using a virtual, floating IP address will just confuse it. b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 198 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100520/cca52536/attachment.bin
Ok, no VIP''s then.. But how does failover work in lustre then? If I setup everything using the real IP and then mount from a client and bring down the active MGS, the client will just sit there until it comes back up again. As in, there is no failover to the second node. So how does this internal lustre failover mechanism work? I''ve been going trought the docs, and I must say there is very little on the failover mechanism, apart from mentions that a seperate app should care of that. Thats the reason I''m implementing keepalived.. At this stage I really am clueless, and can only think of creating a TUN interface, which will have the VIP address (thus, it becomes a real IP, not just a VIP). But I got a feeling that ain''t the right approach either... Is there any docs available where a active/passive MGS setup is described? Is it sufficient to define a --failnode=nid,... at creation time? Any help would be greatly appreciated! Leen On 05/20/2010 01:45 PM, Brian J. Murrell wrote:> On Thu, 2010-05-20 at 12:46 +0200, leen smit wrote: > >> Keepalive uses a VIP in a active/passive state. In a failover situation >> the VIP gets transferred to the passive one. >> > Don''t use virtual IPs with Lustre. Lustre clients know how to deal with > failover nodes that have different IP addresses and using a virtual, > floating IP address will just confuse it. > > b. > >
leen smit wrote:> Ok, no VIP''s then.. But how does failover work in lustre then? > If I setup everything using the real IP and then mount from a client and > bring down the active MGS, the client will just sit there until it comes > back up again. > As in, there is no failover to the second node. So how does this > internal lustre failover mechanism work? > > I''ve been going trought the docs, and I must say there is very little on > the failover mechanism, apart from mentions that a seperate app should > care of that. Thats the reason I''m implementing keepalived.. >Right: the external service needs to keep the "mount" active/healthy on one of the servers. Lustre handles reconnecting clients/servers as long as the volume is mounted where it expects (ie, the mkfs node or the --failover node).> At this stage I really am clueless, and can only think of creating a TUN > interface, which will have the VIP address (thus, it becomes a real IP, > not just a VIP). > But I got a feeling that ain''t the right approach either... > Is there any docs available where a active/passive MGS setup is described? > Is it sufficient to define a --failnode=nid,... at creation time? >Yep. See Johann''s email on the MGS, but for the MDTs and OSTs that''s all you have to do (besides listing both MGS NIDs at mkfs time). For the clients, you specify both MGS NIDs at mount time, so it can mount regardless of which node has the active MGS. Kevin> Any help would be greatly appreciated! > > Leen > > > On 05/20/2010 01:45 PM, Brian J. Murrell wrote: > >> On Thu, 2010-05-20 at 12:46 +0200, leen smit wrote: >> >> >>> Keepalive uses a VIP in a active/passive state. In a failover situation >>> the VIP gets transferred to the passive one. >>> >>> >> Don''t use virtual IPs with Lustre. Lustre clients know how to deal with >> failover nodes that have different IP addresses and using a virtual, >> floating IP address will just confuse it. >> >> b. >> >> >> > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss >
For a clearification in a two servers configuration: server1 -> 192.168.2.20 MGS+MDT+OST0 server2 -> 192.168.2.22 OST1 /dev/sdb is a lun shared between server1 and server 2 from server1: mkfs.lustre --mgs --failnode=192.168.2.22 --reformat /dev/sdb1 from server1: mkfs.lustre --reformat --mdt --mgsnode=192.168.2.20 --fsname=prova --failover=192.168.2.22 /dev/sdb4 from server1: mkfs.lustre --reformat --ost --mgsnode=192.168.2.20 --failover=192.168.2.22 --fsname=prova /dev/sdb2 from server2: mkfs.lustre --reformat --ost --mgsnode=192.168.2.20 --failover=192.168.2.20 --fsname=prova /dev/sdb3 from server1: mount -t lustre /dev/sdb1 /lustre/mgs_prova from server1: mount -t lustre /dev/sdb4 /lustre/mdt_prova from server1: mount -t lustre /dev/sdb2 /lustre/ost0_prova from server2: mount -t lustre /dev/sdb3 /lustre/ost1_prova from client: modprobe lustre mount -t lustre 192.168.2.20 at tcp:192.168.2.22 at tcp:/prova /prova now halt server1 and mount MGS, MDT and OST0 on server2, the client should continue the activity without problem On 05/20/2010 02:55 PM, Kevin Van Maren wrote:> leen smit wrote: > >> Ok, no VIP''s then.. But how does failover work in lustre then? >> If I setup everything using the real IP and then mount from a client and >> bring down the active MGS, the client will just sit there until it comes >> back up again. >> As in, there is no failover to the second node. So how does this >> internal lustre failover mechanism work? >> >> I''ve been going trought the docs, and I must say there is very little on >> the failover mechanism, apart from mentions that a seperate app should >> care of that. Thats the reason I''m implementing keepalived.. >> >> > Right: the external service needs to keep the "mount" active/healthy on > one of the servers. > Lustre handles reconnecting clients/servers as long as the volume is > mounted where it expects > (ie, the mkfs node or the --failover node). > >> At this stage I really am clueless, and can only think of creating a TUN >> interface, which will have the VIP address (thus, it becomes a real IP, >> not just a VIP). >> But I got a feeling that ain''t the right approach either... >> Is there any docs available where a active/passive MGS setup is described? >> Is it sufficient to define a --failnode=nid,... at creation time? >> >> > Yep. See Johann''s email on the MGS, but for the MDTs and OSTs that''s > all you have to do > (besides listing both MGS NIDs at mkfs time). > > For the clients, you specify both MGS NIDs at mount time, so it can > mount regardless of which > node has the active MGS. > > Kevin > > >> Any help would be greatly appreciated! >> >> Leen >> >> >> On 05/20/2010 01:45 PM, Brian J. Murrell wrote: >> >> >>> On Thu, 2010-05-20 at 12:46 +0200, leen smit wrote: >>> >>> >>> >>>> Keepalive uses a VIP in a active/passive state. In a failover situation >>>> the VIP gets transferred to the passive one. >>>> >>>> >>>> >>> Don''t use virtual IPs with Lustre. Lustre clients know how to deal with >>> failover nodes that have different IP addresses and using a virtual, >>> floating IP address will just confuse it. >>> >>> b. >>> >>> >>> >>> >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss >> >> > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss > >-- _Gabriele Paciucci_ http://www.linkedin.com/in/paciucci Pursuant to legislative Decree n. 196/03 you are hereby informed that this email contains confidential information intended only for use of addressee. If you are not the addressee and have received this email by mistake, please send this email to the sender. You may not copy or disseminate this message to anyone. Thank You.
You need two MGS nodes for ''mount'' commnand on the clients. e.g) mount -t lustre 192.168.1.10 at tcp:192.168.1.11 at tcp:/lustre /lustre client will attempt to connect to secondary MGS once primary is not available. Thanks Ihara (5/20/10 9:22 PM), leen smit wrote:> Ok, no VIP''s then.. But how does failover work in lustre then? > If I setup everything using the real IP and then mount from a client and > bring down the active MGS, the client will just sit there until it comes > back up again. > As in, there is no failover to the second node. So how does this > internal lustre failover mechanism work? > > I''ve been going trought the docs, and I must say there is very little on > the failover mechanism, apart from mentions that a seperate app should > care of that. Thats the reason I''m implementing keepalived.. > > At this stage I really am clueless, and can only think of creating a TUN > interface, which will have the VIP address (thus, it becomes a real IP, > not just a VIP). > But I got a feeling that ain''t the right approach either... > Is there any docs available where a active/passive MGS setup is described? > Is it sufficient to define a --failnode=nid,... at creation time? > > Any help would be greatly appreciated! > > Leen > > > On 05/20/2010 01:45 PM, Brian J. Murrell wrote: >> On Thu, 2010-05-20 at 12:46 +0200, leen smit wrote: >> >>> Keepalive uses a VIP in a active/passive state. In a failover situation >>> the VIP gets transferred to the passive one. >>> >> Don''t use virtual IPs with Lustre. Lustre clients know how to deal with >> failover nodes that have different IP addresses and using a virtual, >> floating IP address will just confuse it. >> >> b. >> >> > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss
Ok. I started from scratch, using your kind replies as a guide line. Yet, still no fail over when brining down the first MGS. Below are the steps I''ve taken to setup, hopefully some one here can spot my err. I got rid of keepalived and drbd (was this wise? or should I keep this for the MGS/MDT syncing?) and setup just Lustre. Two nodes vor MGS/MDT, and two nodes for OSTs. fs-mgs-001:~# mkfs.lustre --mgs --failnode=fs-mgs-002 at tcp --reformat /dev/VG1/mgs fs-mgs-001:~# mkfs.lustre --mdt --mgsnode=fs-mgs-001 at tcp --failnode=fs-mgs-002 at tcp --fsname=datafs --reformat /dev/VG1/mdt fs-mgs-001:~# mount -t lustre /dev/VG1/mgs /mnt/mgs/ fs-mgs-001:~# mount -t lustre /dev/VG1/mdt /mnt/mdt/ fs-mgs-002:~# mkfs.lustre --mgs --failnode=fs-mgs-001 at tcp --reformat /dev/VG1/mgs fs-mgs-002:~# mkfs.lustre --mdt --mgsnode=fs-mgs-001 at tcp --failnode=fs-mgs-001 at tcp --fsname=datafs --reformat /dev/VG1/mdt fs-mgs-002:~# mount -t lustre /dev/VG1/mgs /mnt/mgs/ fs-mgs-002:~# mount -t lustre /dev/VG1/mdt /mnt/mdt/ fs-ost-001:~# mkfs.lustre --ost --mgsnode=fs-mgs-001 at tcp --mgsnode=fs-mgs-002 at tcp --failnode=fs-ost-002 at tcp --reformat --fsname=datafs /dev/VG1/ost1 fs-ost-001:~# mount -t lustre /dev/VG1/ost1 /mnt/ost/ fs-ost-002:~# mkfs.lustre --ost --mgsnode=fs-mgs-001 at tcp --mgsnode=fs-mgs-002 at tcp --failnode=fs-ost-001 at tcp --reformat --fsname=datafs /dev/VG1/ost1 fs-ost-002:~# mount -t lustre /dev/VG1/ost1 /mnt/ost/ fs-mgs-001:~# lctl dl 0 UP mgs MGS MGS 7 1 UP mgc MGC192.168.21.33 at tcp 5b8fb365-ae8e-9742-f374-539d8876276f 5 2 UP mgc MGC127.0.1.1 at tcp 380bc932-eaf3-9955-7ff0-af96067a2487 5 3 UP mdt MDS MDS_uuid 3 4 UP lov datafs-mdtlov datafs-mdtlov_UUID 4 5 UP mds datafs-MDT0000 datafs-MDT0000_UUID 5 6 UP osc datafs-OST0000-osc datafs-mdtlov_UUID 5 7 UP osc datafs-OST0001-osc datafs-mdtlov_UUID 5 fs-mgs-001:~# lctl list_nids 192.168.21.32 at tcp client:~# mount -t lustre 192.168.21.32 at tcp:192.168.21.33 at tcp:/datafs /data client:~# time cp test.file /data/ real 0m47.793s user 0m0.001s sys 0m3.155s So far, so good. Lets try that again, now bringing down mgs-001 client:~# time cp test.file /data/ fs-mgs-001:~# umount /mnt/mdt && umount /mnt/mgs fs-mgs-002:~# mount -t lustre /dev/VG1/mgs /mnt/mgs fs-mgs-002:~# mount -t lustre /dev/VG1/mdt /mnt/mdt fs-mgs-002:~# lctl dl 0 UP mgs MGS MGS 5 1 UP mgc MGC192.168.21.32 at tcp 82b34916-ed89-f5b9-026e-7f8e1370765f 5 2 UP mdt MDS MDS_uuid 3 3 UP lov datafs-mdtlov datafs-mdtlov_UUID 4 4 UP mds datafs-MDT0000 datafs-MDT0000_UUID 3 Missing the OSTs here, so I (try to..) remount these too fs-ost-001:~# umount /mnt/ost/ fs-ost-001:~# mount -t lustre /dev/VG1/ost1 /mnt/ost/ mount.lustre: mount /dev/mapper/VG1-ost1 at /mnt/ost failed: No such device or address The target service failed to start (bad config log?) (/dev/mapper/VG1-ost1). See /var/log/messages. After this I can only get back to a running state by umounting everything on the mgs-002, and remount on the mgs-001 What am I missing here?? Am I messing things up by creating two mgs, one on each mgs node? Leen On 05/20/2010 03:40 PM, Gabriele Paciucci wrote:> For a clearification in a two servers configuration: > > server1 -> 192.168.2.20 MGS+MDT+OST0 > server2 -> 192.168.2.22 OST1 > /dev/sdb is a lun shared between server1 and server 2 > > from server1: mkfs.lustre --mgs --failnode=192.168.2.22 --reformat /dev/sdb1 > from server1: mkfs.lustre --reformat --mdt --mgsnode=192.168.2.20 > --fsname=prova --failover=192.168.2.22 /dev/sdb4 > from server1: mkfs.lustre --reformat --ost --mgsnode=192.168.2.20 > --failover=192.168.2.22 --fsname=prova /dev/sdb2 > from server2: mkfs.lustre --reformat --ost --mgsnode=192.168.2.20 > --failover=192.168.2.20 --fsname=prova /dev/sdb3 > > > from server1: mount -t lustre /dev/sdb1 /lustre/mgs_prova > from server1: mount -t lustre /dev/sdb4 /lustre/mdt_prova > from server1: mount -t lustre /dev/sdb2 /lustre/ost0_prova > from server2: mount -t lustre /dev/sdb3 /lustre/ost1_prova > > > from client: > modprobe lustre > mount -t lustre 192.168.2.20 at tcp:192.168.2.22 at tcp:/prova /prova > > now halt server1 and mount MGS, MDT and OST0 on server2, the client > should continue the activity without problem > > > > On 05/20/2010 02:55 PM, Kevin Van Maren wrote: > >> leen smit wrote: >> >> >>> Ok, no VIP''s then.. But how does failover work in lustre then? >>> If I setup everything using the real IP and then mount from a client and >>> bring down the active MGS, the client will just sit there until it comes >>> back up again. >>> As in, there is no failover to the second node. So how does this >>> internal lustre failover mechanism work? >>> >>> I''ve been going trought the docs, and I must say there is very little on >>> the failover mechanism, apart from mentions that a seperate app should >>> care of that. Thats the reason I''m implementing keepalived.. >>> >>> >>> >> Right: the external service needs to keep the "mount" active/healthy on >> one of the servers. >> Lustre handles reconnecting clients/servers as long as the volume is >> mounted where it expects >> (ie, the mkfs node or the --failover node). >> >> >>> At this stage I really am clueless, and can only think of creating a TUN >>> interface, which will have the VIP address (thus, it becomes a real IP, >>> not just a VIP). >>> But I got a feeling that ain''t the right approach either... >>> Is there any docs available where a active/passive MGS setup is described? >>> Is it sufficient to define a --failnode=nid,... at creation time? >>> >>> >>> >> Yep. See Johann''s email on the MGS, but for the MDTs and OSTs that''s >> all you have to do >> (besides listing both MGS NIDs at mkfs time). >> >> For the clients, you specify both MGS NIDs at mount time, so it can >> mount regardless of which >> node has the active MGS. >> >> Kevin >> >> >> >>> Any help would be greatly appreciated! >>> >>> Leen >>> >>> >>> On 05/20/2010 01:45 PM, Brian J. Murrell wrote: >>> >>> >>> >>>> On Thu, 2010-05-20 at 12:46 +0200, leen smit wrote: >>>> >>>> >>>> >>>> >>>>> Keepalive uses a VIP in a active/passive state. In a failover situation >>>>> the VIP gets transferred to the passive one. >>>>> >>>>> >>>>> >>>>> >>>> Don''t use virtual IPs with Lustre. Lustre clients know how to deal with >>>> failover nodes that have different IP addresses and using a virtual, >>>> floating IP address will just confuse it. >>>> >>>> b. >>>> >>>> >>>> >>>> >>>> >>> _______________________________________________ >>> Lustre-discuss mailing list >>> Lustre-discuss at lists.lustre.org >>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >>> >>> >>> >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss >> >> >> > >
Hi, be carefoul with LVM, you should import and export the volume when you try to mount from one machine to an other!!!! please refer to: http://kbase.redhat.com/faq/docs/DOC-4124 On 05/21/2010 11:57 AM, leen smit wrote:> Ok. I started from scratch, using your kind replies as a guide line. > Yet, still no fail over when brining down the first MGS. > Below are the steps I''ve taken to setup, hopefully some one here can > spot my err. > I got rid of keepalived and drbd (was this wise? or should I keep this > for the MGS/MDT syncing?) and setup just Lustre. > > Two nodes vor MGS/MDT, and two nodes for OSTs. > > > fs-mgs-001:~# mkfs.lustre --mgs --failnode=fs-mgs-002 at tcp --reformat > /dev/VG1/mgs > fs-mgs-001:~# mkfs.lustre --mdt --mgsnode=fs-mgs-001 at tcp > --failnode=fs-mgs-002 at tcp --fsname=datafs --reformat /dev/VG1/mdt > fs-mgs-001:~# mount -t lustre /dev/VG1/mgs /mnt/mgs/ > fs-mgs-001:~# mount -t lustre /dev/VG1/mdt /mnt/mdt/ > >> fs-mgs-002:~# mkfs.lustre --mgs --failnode=fs-mgs-001 at tcp --reformat > /dev/VG1/mgs > fs-mgs-002:~# mkfs.lustre --mdt --mgsnode=fs-mgs-001 at tcp > --failnode=fs-mgs-001 at tcp --fsname=datafs --reformat /dev/VG1/mdt > fs-mgs-002:~# mount -t lustre /dev/VG1/mgs /mnt/mgs/ > fs-mgs-002:~# mount -t lustre /dev/VG1/mdt /mnt/mdt/ > >this is an error ^.. don''t do it!!!> fs-ost-001:~# mkfs.lustre --ost --mgsnode=fs-mgs-001 at tcp > --mgsnode=fs-mgs-002 at tcp --failnode=fs-ost-002 at tcp --reformat > --fsname=datafs /dev/VG1/ost1 > fs-ost-001:~# mount -t lustre /dev/VG1/ost1 /mnt/ost/ > >> fs-ost-002:~# mkfs.lustre --ost --mgsnode=fs-mgs-001 at tcp > --mgsnode=fs-mgs-002 at tcp --failnode=fs-ost-001 at tcp --reformat > --fsname=datafs /dev/VG1/ost1 > fs-ost-002:~# mount -t lustre /dev/VG1/ost1 /mnt/ost/ > > >this is an error ^.. don''t do it!!! the correct way is (WARNING: please use the IP address) : fs-mgs-001:~# mkfs.lustre --mgs --failnode=fs-mgs-002 at tcp --reformat /dev/VG1/mgs fs-mgs-001:~# mount -t lustre /dev/VG1/mgs /mnt/mgs/ fs-mgs-001:~# mkfs.lustre --mdt --mgsnode=fs-mgs-001 at tcp --failnode=fs-mgs-002 at tcp --fsname=datafs --reformat /dev/VG1/mdt fs-mgs-001:~# mount -t lustre /dev/VG1/mdt /mnt/mdt/ fs-ost-001:~# mkfs.lustre --ost --mgsnode=fs-mgs-001 at tcp --failnode=fs-ost-002 at tcp --reformat --fsname=datafs /dev/VG1/ost1 fs-ost-001:~# mount -t lustre /dev/VG1/ost1 /mnt/ost/ yust this, nothing to do on the second node!!! mount -t lustre fs-mgs-001 at tcp:fs-mgs-002 at tcp:/datafs /data Bye> fs-mgs-001:~# lctl dl > 0 UP mgs MGS MGS 7 > 1 UP mgc MGC192.168.21.33 at tcp 5b8fb365-ae8e-9742-f374-539d8876276f 5 > 2 UP mgc MGC127.0.1.1 at tcp 380bc932-eaf3-9955-7ff0-af96067a2487 5 > 3 UP mdt MDS MDS_uuid 3 > 4 UP lov datafs-mdtlov datafs-mdtlov_UUID 4 > 5 UP mds datafs-MDT0000 datafs-MDT0000_UUID 5 > 6 UP osc datafs-OST0000-osc datafs-mdtlov_UUID 5 > 7 UP osc datafs-OST0001-osc datafs-mdtlov_UUID 5 > > fs-mgs-001:~# lctl list_nids > 192.168.21.32 at tcp > > > client:~# mount -t lustre 192.168.21.32 at tcp:192.168.21.33 at tcp:/datafs /data > client:~# time cp test.file /data/ > real 0m47.793s > user 0m0.001s > sys 0m3.155s > > So far, so good. > > > Lets try that again, now bringing down mgs-001 > > client:~# time cp test.file /data/ > > fs-mgs-001:~# umount /mnt/mdt&& umount /mnt/mgs > > fs-mgs-002:~# mount -t lustre /dev/VG1/mgs /mnt/mgs > fs-mgs-002:~# mount -t lustre /dev/VG1/mdt /mnt/mdt > fs-mgs-002:~# lctl dl > 0 UP mgs MGS MGS 5 > 1 UP mgc MGC192.168.21.32 at tcp 82b34916-ed89-f5b9-026e-7f8e1370765f 5 > 2 UP mdt MDS MDS_uuid 3 > 3 UP lov datafs-mdtlov datafs-mdtlov_UUID 4 > 4 UP mds datafs-MDT0000 datafs-MDT0000_UUID 3 > > Missing the OSTs here, so I (try to..) remount these too > > fs-ost-001:~# umount /mnt/ost/ > fs-ost-001:~# mount -t lustre /dev/VG1/ost1 /mnt/ost/ > mount.lustre: mount /dev/mapper/VG1-ost1 at /mnt/ost failed: No such > device or address > The target service failed to start (bad config log?) > (/dev/mapper/VG1-ost1). See /var/log/messages. > > > After this I can only get back to a running state by umounting > everything on the mgs-002, and remount on the mgs-001 > What am I missing here?? Am I messing things up by creating two mgs, one > on each mgs node? > > > Leen > > > > On 05/20/2010 03:40 PM, Gabriele Paciucci wrote: > >> For a clearification in a two servers configuration: >> >> server1 -> 192.168.2.20 MGS+MDT+OST0 >> server2 -> 192.168.2.22 OST1 >> /dev/sdb is a lun shared between server1 and server 2 >> >> from server1: mkfs.lustre --mgs --failnode=192.168.2.22 --reformat /dev/sdb1 >> from server1: mkfs.lustre --reformat --mdt --mgsnode=192.168.2.20 >> --fsname=prova --failover=192.168.2.22 /dev/sdb4 >> from server1: mkfs.lustre --reformat --ost --mgsnode=192.168.2.20 >> --failover=192.168.2.22 --fsname=prova /dev/sdb2 >> from server2: mkfs.lustre --reformat --ost --mgsnode=192.168.2.20 >> --failover=192.168.2.20 --fsname=prova /dev/sdb3 >> >> >> from server1: mount -t lustre /dev/sdb1 /lustre/mgs_prova >> from server1: mount -t lustre /dev/sdb4 /lustre/mdt_prova >> from server1: mount -t lustre /dev/sdb2 /lustre/ost0_prova >> from server2: mount -t lustre /dev/sdb3 /lustre/ost1_prova >> >> >> from client: >> modprobe lustre >> mount -t lustre 192.168.2.20 at tcp:192.168.2.22 at tcp:/prova /prova >> >> now halt server1 and mount MGS, MDT and OST0 on server2, the client >> should continue the activity without problem >> >> >> >> On 05/20/2010 02:55 PM, Kevin Van Maren wrote: >> >> >>> leen smit wrote: >>> >>> >>> >>>> Ok, no VIP''s then.. But how does failover work in lustre then? >>>> If I setup everything using the real IP and then mount from a client and >>>> bring down the active MGS, the client will just sit there until it comes >>>> back up again. >>>> As in, there is no failover to the second node. So how does this >>>> internal lustre failover mechanism work? >>>> >>>> I''ve been going trought the docs, and I must say there is very little on >>>> the failover mechanism, apart from mentions that a seperate app should >>>> care of that. Thats the reason I''m implementing keepalived.. >>>> >>>> >>>> >>>> >>> Right: the external service needs to keep the "mount" active/healthy on >>> one of the servers. >>> Lustre handles reconnecting clients/servers as long as the volume is >>> mounted where it expects >>> (ie, the mkfs node or the --failover node). >>> >>> >>> >>>> At this stage I really am clueless, and can only think of creating a TUN >>>> interface, which will have the VIP address (thus, it becomes a real IP, >>>> not just a VIP). >>>> But I got a feeling that ain''t the right approach either... >>>> Is there any docs available where a active/passive MGS setup is described? >>>> Is it sufficient to define a --failnode=nid,... at creation time? >>>> >>>> >>>> >>>> >>> Yep. See Johann''s email on the MGS, but for the MDTs and OSTs that''s >>> all you have to do >>> (besides listing both MGS NIDs at mkfs time). >>> >>> For the clients, you specify both MGS NIDs at mount time, so it can >>> mount regardless of which >>> node has the active MGS. >>> >>> Kevin >>> >>> >>> >>> >>>> Any help would be greatly appreciated! >>>> >>>> Leen >>>> >>>> >>>> On 05/20/2010 01:45 PM, Brian J. Murrell wrote: >>>> >>>> >>>> >>>> >>>>> On Thu, 2010-05-20 at 12:46 +0200, leen smit wrote: >>>>> >>>>> >>>>> >>>>> >>>>> >>>>>> Keepalive uses a VIP in a active/passive state. In a failover situation >>>>>> the VIP gets transferred to the passive one. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>> Don''t use virtual IPs with Lustre. Lustre clients know how to deal with >>>>> failover nodes that have different IP addresses and using a virtual, >>>>> floating IP address will just confuse it. >>>>> >>>>> b. >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>> _______________________________________________ >>>> Lustre-discuss mailing list >>>> Lustre-discuss at lists.lustre.org >>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >>>> >>>> >>>> >>>> >>> _______________________________________________ >>> Lustre-discuss mailing list >>> Lustre-discuss at lists.lustre.org >>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >>> >>> >>> >>> >> >> > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss > >-- _Gabriele Paciucci_ http://www.linkedin.com/in/paciucci Pursuant to legislative Decree n. 196/03 you are hereby informed that this email contains confidential information intended only for use of addressee. If you are not the addressee and have received this email by mistake, please send this email to the sender. You may not copy or disseminate this message to anyone. Thank You.
Wouldn''t it be easier then to use brdb on the msg disk, so you dont have to move the lvm over to a new node? On 05/21/2010 12:14 PM, Gabriele Paciucci wrote:> Hi, > be carefoul with LVM, you should import and export the volume when you > try to mount from one machine to an other!!!! > > please refer to: http://kbase.redhat.com/faq/docs/DOC-4124 > > > On 05/21/2010 11:57 AM, leen smit wrote: > >> Ok. I started from scratch, using your kind replies as a guide line. >> Yet, still no fail over when brining down the first MGS. >> Below are the steps I''ve taken to setup, hopefully some one here can >> spot my err. >> I got rid of keepalived and drbd (was this wise? or should I keep this >> for the MGS/MDT syncing?) and setup just Lustre. >> >> Two nodes vor MGS/MDT, and two nodes for OSTs. >> >> >> fs-mgs-001:~# mkfs.lustre --mgs --failnode=fs-mgs-002 at tcp --reformat >> /dev/VG1/mgs >> fs-mgs-001:~# mkfs.lustre --mdt --mgsnode=fs-mgs-001 at tcp >> --failnode=fs-mgs-002 at tcp --fsname=datafs --reformat /dev/VG1/mdt >> fs-mgs-001:~# mount -t lustre /dev/VG1/mgs /mnt/mgs/ >> fs-mgs-001:~# mount -t lustre /dev/VG1/mdt /mnt/mdt/ >> >> >> > > >> fs-mgs-002:~# mkfs.lustre --mgs --failnode=fs-mgs-001 at tcp --reformat >> /dev/VG1/mgs >> fs-mgs-002:~# mkfs.lustre --mdt --mgsnode=fs-mgs-001 at tcp >> --failnode=fs-mgs-001 at tcp --fsname=datafs --reformat /dev/VG1/mdt >> fs-mgs-002:~# mount -t lustre /dev/VG1/mgs /mnt/mgs/ >> fs-mgs-002:~# mount -t lustre /dev/VG1/mdt /mnt/mdt/ >> >> >> > this is an error ^.. don''t do it!!! > > >> fs-ost-001:~# mkfs.lustre --ost --mgsnode=fs-mgs-001 at tcp >> --mgsnode=fs-mgs-002 at tcp --failnode=fs-ost-002 at tcp --reformat >> --fsname=datafs /dev/VG1/ost1 >> fs-ost-001:~# mount -t lustre /dev/VG1/ost1 /mnt/ost/ >> >> >> > > >> fs-ost-002:~# mkfs.lustre --ost --mgsnode=fs-mgs-001 at tcp >> --mgsnode=fs-mgs-002 at tcp --failnode=fs-ost-001 at tcp --reformat >> --fsname=datafs /dev/VG1/ost1 >> fs-ost-002:~# mount -t lustre /dev/VG1/ost1 /mnt/ost/ >> >> >> >> > this is an error ^.. don''t do it!!! > > the correct way is (WARNING: please use the IP address) : > > fs-mgs-001:~# mkfs.lustre --mgs --failnode=fs-mgs-002 at tcp --reformat /dev/VG1/mgs > fs-mgs-001:~# mount -t lustre /dev/VG1/mgs /mnt/mgs/ > > fs-mgs-001:~# mkfs.lustre --mdt --mgsnode=fs-mgs-001 at tcp --failnode=fs-mgs-002 at tcp --fsname=datafs --reformat /dev/VG1/mdt > fs-mgs-001:~# mount -t lustre /dev/VG1/mdt /mnt/mdt/ > > fs-ost-001:~# mkfs.lustre --ost --mgsnode=fs-mgs-001 at tcp --failnode=fs-ost-002 at tcp --reformat --fsname=datafs /dev/VG1/ost1 > fs-ost-001:~# mount -t lustre /dev/VG1/ost1 /mnt/ost/ > > yust this, nothing to do on the second node!!! > > mount -t lustre fs-mgs-001 at tcp:fs-mgs-002 at tcp:/datafs /data > Bye > > > > >> fs-mgs-001:~# lctl dl >> 0 UP mgs MGS MGS 7 >> 1 UP mgc MGC192.168.21.33 at tcp 5b8fb365-ae8e-9742-f374-539d8876276f 5 >> 2 UP mgc MGC127.0.1.1 at tcp 380bc932-eaf3-9955-7ff0-af96067a2487 5 >> 3 UP mdt MDS MDS_uuid 3 >> 4 UP lov datafs-mdtlov datafs-mdtlov_UUID 4 >> 5 UP mds datafs-MDT0000 datafs-MDT0000_UUID 5 >> 6 UP osc datafs-OST0000-osc datafs-mdtlov_UUID 5 >> 7 UP osc datafs-OST0001-osc datafs-mdtlov_UUID 5 >> >> fs-mgs-001:~# lctl list_nids >> 192.168.21.32 at tcp >> >> >> client:~# mount -t lustre 192.168.21.32 at tcp:192.168.21.33 at tcp:/datafs /data >> client:~# time cp test.file /data/ >> real 0m47.793s >> user 0m0.001s >> sys 0m3.155s >> >> So far, so good. >> >> >> Lets try that again, now bringing down mgs-001 >> >> client:~# time cp test.file /data/ >> >> fs-mgs-001:~# umount /mnt/mdt&& umount /mnt/mgs >> >> fs-mgs-002:~# mount -t lustre /dev/VG1/mgs /mnt/mgs >> fs-mgs-002:~# mount -t lustre /dev/VG1/mdt /mnt/mdt >> fs-mgs-002:~# lctl dl >> 0 UP mgs MGS MGS 5 >> 1 UP mgc MGC192.168.21.32 at tcp 82b34916-ed89-f5b9-026e-7f8e1370765f 5 >> 2 UP mdt MDS MDS_uuid 3 >> 3 UP lov datafs-mdtlov datafs-mdtlov_UUID 4 >> 4 UP mds datafs-MDT0000 datafs-MDT0000_UUID 3 >> >> Missing the OSTs here, so I (try to..) remount these too >> >> fs-ost-001:~# umount /mnt/ost/ >> fs-ost-001:~# mount -t lustre /dev/VG1/ost1 /mnt/ost/ >> mount.lustre: mount /dev/mapper/VG1-ost1 at /mnt/ost failed: No such >> device or address >> The target service failed to start (bad config log?) >> (/dev/mapper/VG1-ost1). See /var/log/messages. >> >> >> After this I can only get back to a running state by umounting >> everything on the mgs-002, and remount on the mgs-001 >> What am I missing here?? Am I messing things up by creating two mgs, one >> on each mgs node? >> >> >> Leen >> >> >> >> On 05/20/2010 03:40 PM, Gabriele Paciucci wrote: >> >> >>> For a clearification in a two servers configuration: >>> >>> server1 -> 192.168.2.20 MGS+MDT+OST0 >>> server2 -> 192.168.2.22 OST1 >>> /dev/sdb is a lun shared between server1 and server 2 >>> >>> from server1: mkfs.lustre --mgs --failnode=192.168.2.22 --reformat /dev/sdb1 >>> from server1: mkfs.lustre --reformat --mdt --mgsnode=192.168.2.20 >>> --fsname=prova --failover=192.168.2.22 /dev/sdb4 >>> from server1: mkfs.lustre --reformat --ost --mgsnode=192.168.2.20 >>> --failover=192.168.2.22 --fsname=prova /dev/sdb2 >>> from server2: mkfs.lustre --reformat --ost --mgsnode=192.168.2.20 >>> --failover=192.168.2.20 --fsname=prova /dev/sdb3 >>> >>> >>> from server1: mount -t lustre /dev/sdb1 /lustre/mgs_prova >>> from server1: mount -t lustre /dev/sdb4 /lustre/mdt_prova >>> from server1: mount -t lustre /dev/sdb2 /lustre/ost0_prova >>> from server2: mount -t lustre /dev/sdb3 /lustre/ost1_prova >>> >>> >>> from client: >>> modprobe lustre >>> mount -t lustre 192.168.2.20 at tcp:192.168.2.22 at tcp:/prova /prova >>> >>> now halt server1 and mount MGS, MDT and OST0 on server2, the client >>> should continue the activity without problem >>> >>> >>> >>> On 05/20/2010 02:55 PM, Kevin Van Maren wrote: >>> >>> >>> >>>> leen smit wrote: >>>> >>>> >>>> >>>> >>>>> Ok, no VIP''s then.. But how does failover work in lustre then? >>>>> If I setup everything using the real IP and then mount from a client and >>>>> bring down the active MGS, the client will just sit there until it comes >>>>> back up again. >>>>> As in, there is no failover to the second node. So how does this >>>>> internal lustre failover mechanism work? >>>>> >>>>> I''ve been going trought the docs, and I must say there is very little on >>>>> the failover mechanism, apart from mentions that a seperate app should >>>>> care of that. Thats the reason I''m implementing keepalived.. >>>>> >>>>> >>>>> >>>>> >>>>> >>>> Right: the external service needs to keep the "mount" active/healthy on >>>> one of the servers. >>>> Lustre handles reconnecting clients/servers as long as the volume is >>>> mounted where it expects >>>> (ie, the mkfs node or the --failover node). >>>> >>>> >>>> >>>> >>>>> At this stage I really am clueless, and can only think of creating a TUN >>>>> interface, which will have the VIP address (thus, it becomes a real IP, >>>>> not just a VIP). >>>>> But I got a feeling that ain''t the right approach either... >>>>> Is there any docs available where a active/passive MGS setup is described? >>>>> Is it sufficient to define a --failnode=nid,... at creation time? >>>>> >>>>> >>>>> >>>>> >>>>> >>>> Yep. See Johann''s email on the MGS, but for the MDTs and OSTs that''s >>>> all you have to do >>>> (besides listing both MGS NIDs at mkfs time). >>>> >>>> For the clients, you specify both MGS NIDs at mount time, so it can >>>> mount regardless of which >>>> node has the active MGS. >>>> >>>> Kevin >>>> >>>> >>>> >>>> >>>> >>>>> Any help would be greatly appreciated! >>>>> >>>>> Leen >>>>> >>>>> >>>>> On 05/20/2010 01:45 PM, Brian J. Murrell wrote: >>>>> >>>>> >>>>> >>>>> >>>>> >>>>>> On Thu, 2010-05-20 at 12:46 +0200, leen smit wrote: >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>>> Keepalive uses a VIP in a active/passive state. In a failover situation >>>>>>> the VIP gets transferred to the passive one. >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>> Don''t use virtual IPs with Lustre. Lustre clients know how to deal with >>>>>> failover nodes that have different IP addresses and using a virtual, >>>>>> floating IP address will just confuse it. >>>>>> >>>>>> b. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>> _______________________________________________ >>>>> Lustre-discuss mailing list >>>>> Lustre-discuss at lists.lustre.org >>>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >>>>> >>>>> >>>>> >>>>> >>>>> >>>> _______________________________________________ >>>> Lustre-discuss mailing list >>>> Lustre-discuss at lists.lustre.org >>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >>>> >>>> >>>> >>>> >>>> >>> >>> >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss >> >> >> > >
>> In the new setup I have 4 machines, two MDT''s and two OST''s >> We want to use keepalived as a failover mechanism between the >> two MDT''s. To keep the MDT''s in sync, I''m using a DRBD disk >> between the two. Keepalive uses a VIP in a active/passive >> state. In a failover situation the VIP gets transferred to >> the passive one.> Lustre uses stateful client/server connection. You don''t need > to - and cannot - use a virtual ip. The lustre protocol > already takes care of reconnection & recovery.Sure, for access to the server purposes, but there is a good way to achieve *network* routing failover using something like VIPs (in an IP-only setup). What you do is simple: #1 On a server with multiple interfaces, for example 192.168.1.1 and 192.168.2.1, add a ''dummy0'' interface, e.g. 192.168.42.1. #2 Run OSPF on the server, advertising a *host route* to the ''dummy0'' interface, 192.168.42.1/32 (as well as the real interfaces of course). #3 Bind the Lustre daemons to the ''dummy0'' interface address. As long as there is a route to it, all network reconfigurations will be transparent to Lustre. Of course one can have for MGSes, MDSes, and OSSes, two or more servers with different "''dummy0''" addresses to use as different NIDs, to let Lustre handle *server* (as opposed to network) failures. The price is a host route per server, but that usually is quite insignificant. The only problem with the setup above is that #3 "Bind the Lustre daemons to the ''dummy0'' interface address" seems impossible to achieve (not tried directly, so I am told), and while clients packets sent to the ''dummy0'' address reach the server in a fully resilient way, reply packets often/usually have as source address one of the addresses of the real interfaces, instead of that of the ''dummy0'' interface, and this of course breaks the scheme. Note that the scheme above is fairly valuable, as it gives full *dynamic* network resilience. Is there a simple way to get the Lustre daemons in the kernel to bind to a specific address instead of [0.0.0.0], like most server daemons in UNIX/Linux can? [ ... ]