Hi Guys, My MDT is setup with LVM and I was able to test failover based on the Volume Group failing on my MDS (by unplugging both fibre cables). However, for my OST''s, I have created filesystems directly on the SAN luns and when I unplug the fibre cables on my OSS, heartbeat does not detect failure for the filesystem since it shows as mounted. Is there somehow we can trigger a failure based on multipath failing on the OSS? Any assistance would be greatly appreciated. Thanks in advance, -J -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100121/b15cd4f0/attachment.html
Jagga Soorma wrote:> Hi Guys, > > My MDT is setup with LVM and I was able to test failover based on the > Volume Group failing on my MDS (by unplugging both fibre cables). > However, for my OST''s, I have created filesystems directly on the SAN > luns and when I unplug the fibre cables on my OSS, heartbeat does not > detect failure for the filesystem since it shows as mounted. Is there > somehow we can trigger a failure based on multipath failing on the OSS? >Hi- It would depend on the version of heartbeat you are using. Heartbeat v1 did not do any resource level monitoring and if that is what you are using you are out of luck. If using v2 CRM and/or Pacemaker, you have two options: 1, Modify the Filesystem OCF script''s monitor operation to check the actual health of the filesystem and/or multipath in addition to the status of the mount and return accordingly. The Filesystem OCF agent is located at /usr/lib/ocf/resource.d/heartbeat/Filesystem 2, Create your own resource agent that interacts with dm/multipath to start/stop/monitor it. Then constrain the resource to start before/stop after and run with the Filesystem resource. Then the filesystem will be dependent on the health of the multipath resource. I recommend the second for the sake of thoroughness. Including multipath monitoring in the Filesystem OCF may "just work" but leaves room for other multipath related failures going unnoticed. Writing your own OCF is fairly straight forward and is documented somewhere on www.clusterlabs.org. There is an OCF script that does the same for LVM which would serve as a good example of what needs to be done. Or maybe someone else has already created one? Linux-HA or Pacemaker lists might be a good place to ask. Good luck -- : Adam Gandelman : LINBIT | Your Way to High Availability : : http://www.linbit.com
On Thursday 21 January 2010, Adam Gandelman wrote:> Jagga Soorma wrote: > > Hi Guys, > > > > My MDT is setup with LVM and I was able to test failover based on the > > Volume Group failing on my MDS (by unplugging both fibre cables). > > However, for my OST''s, I have created filesystems directly on the SAN > > luns and when I unplug the fibre cables on my OSS, heartbeat does not > > detect failure for the filesystem since it shows as mounted. Is there > > somehow we can trigger a failure based on multipath failing on the OSS? > > Hi- > > It would depend on the version of heartbeat you are using. Heartbeat v1 > did not do any resource level monitoring and if that is what you are > using you are out of luck. > > If using v2 CRM and/or Pacemaker, you have two options: > > 1, Modify the Filesystem OCF script''s monitor operation to check the > actual health of the filesystem and/or multipath in addition to the > status of the mount and return accordingly. The Filesystem OCF agent > is located at /usr/lib/ocf/resource.d/heartbeat/Filesystem > 2, Create your own resource agent that interacts with dm/multipath to > start/stop/monitor it. Then constrain the resource to start before/stop > after and run with the Filesystem resource. Then the filesystem will be > dependent on the health of the multipath resource.I guess you want to use the pacemaker agent I posted into this bugzilla: https://bugzilla.lustre.org/show_bug.cgi?id=20807 It does not interact with with multipath, but knows about several lustre details. How would you monitor multipath? If one of your several paths fails, what do you want to do? If all paths fail, it is clear, but what to for a partial path failure? I think think OCF defines a return code for that? I also think mutipath should be a separate agent to reduce complicity from the script. Cheers, Bernd -- Bernd Schubert DataDirect Networks
Hi Jagga, You can simply mount your MDT and OST with options "errors=panic". In case of fibre disconnection, Lustre OSS/MDS will get IO errors and panic the node so that other node can take over. With this option, your /etc/ha.d/haresources file would look like: mds1 LVM::homevg Filesystem::/dev/homevg/home::/mnt/home::lustre::errors=panic Cheers, _Atul Jagga Soorma wrote:> Hi Guys, > > My MDT is setup with LVM and I was able to test failover based on the > Volume Group failing on my MDS (by unplugging both fibre cables). > However, for my OST''s, I have created filesystems directly on the SAN > luns and when I unplug the fibre cables on my OSS, heartbeat does not > detect failure for the filesystem since it shows as mounted. Is there > somehow we can trigger a failure based on multipath failing on the OSS? > > Any assistance would be greatly appreciated. > > Thanks in advance, > -J > ------------------------------------------------------------------------ > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss >
Michael Schwartzkopff
2010-Jan-22 08:07 UTC
[Lustre-discuss] Filesystem monitoring in Heartbeat
Am Donnerstag, 21. Januar 2010 23:09:37 schrieb Bernd Schubert:> On Thursday 21 January 2010, Adam Gandelman wrote:(...)> I guess you want to use the pacemaker agent I posted into this bugzilla: > > https://bugzilla.lustre.org/show_bug.cgi?id=20807Hallo, how far did you come with the development of the agent? Some kind of finished? Publishable? Greetings, -- Dr. Michael Schwartzkopff MultiNET Services GmbH Addresse: Bretonischer Ring 7; 85630 Grasbrunn; Germany Tel: +49 - 89 - 45 69 11 0 Fax: +49 - 89 - 45 69 11 21 mob: +49 - 174 - 343 28 75 mail: misch at multinet.de web: www.multinet.de Sitz der Gesellschaft: 85630 Grasbrunn Registergericht: Amtsgericht M?nchen HRB 114375 Gesch?ftsf?hrer: G?nter Jurgeneit, Hubert Martens --- PGP Fingerprint: F919 3919 FF12 ED5A 2801 DEA6 AA77 57A4 EDD8 979B Skype: misch42