Is there a good way to detect a down MDS or OST from a Lustre client? We have a basic sanity check which runs at the start of jobs that does a test like this (ksh syntax); # check to see if the Lustre filesystem is mounted before running. # ( where /wrkdir is a Lustre Filesystem ) if [ ! -d /wrkdir/$USER ]; then # note the error and requeue the job. fi We had an issue with our Lustre filesystem a few days which caused several of the OSTs to go down. Unfortunately, the test above was simply hanging on the directory test when this happened, which caused the script to timeout. I''m wondering if there a standard way to test Lustre status from a client that won''t lock up when there is an issue with Lustre. Thanks, Don Donald Bahls | HPC Specialist | Arctic Region Supercomputing Center -------------- next part -------------- An HTML attachment was scrubbed... URL: mail.clusterfs.com/pipermail/lustre-discuss/attachments/20070504/f231c80c/attachment.html
Kilian CAVALOTTI
2007-May-04 14:46 UTC
[Lustre-discuss] Detecting Lustre Problems in a script
On Friday 04 May 2007 12:57:31 pm Don Bahls wrote:> I''m wondering if there a standard way to test Lustre status from > a client that won''t lock up when there is an issue with Lustre.You can see the devices state in /proc/fs/lustre/devices, this should be available even if the filesystem has problems. Cheers, -- Kilian
Kilian CAVALOTTI wrote:> On Friday 04 May 2007 12:57:31 pm Don Bahls wrote: >> I''m wondering if there a standard way to test Lustre status from >> a client that won''t lock up when there is an issue with Lustre. > > You can see the devices state in /proc/fs/lustre/devices, this should be > available even if the filesystem has problems. > > Cheers,You can: - cat /proc/mounts and check for your mount point -this will not hang when servers are down - cat /proc/fs/lustre/health_check - check for the string ''healthy'' - check /proc/fs/lustre/devices and look for ''UP'' cliffw
On Tue, 15 May 2007, Cliff White wrote:> Kilian CAVALOTTI wrote: > >On Friday 04 May 2007 12:57:31 pm Don Bahls wrote: > >>I''m wondering if there a standard way to test Lustre status from > >>a client that won''t lock up when there is an issue with Lustre. > > > >You can see the devices state in /proc/fs/lustre/devices, this should be > >available even if the filesystem has problems. > > > >Cheers, > You can: > - cat /proc/mounts and check for your mount point -this will not hang > when servers are down > - cat /proc/fs/lustre/health_check - check for the string ''healthy'' > - check /proc/fs/lustre/devices and look for ''UP''Don, Here''s a simple nagios plugin adapted from the CFS Mon script, that we use for monitoring lustre health. <begin check_lustre> #!/usr/bin/perl -T # # $Id: check_lustre 131 2007-04-16 13:37:26Z bret $ # nagios plugin to check health of lustre filesystem: # healthy: return 0 # otherwise: return 2 (Critical) use strict; my $health_check = ''/proc/fs/lustre/health_check''; # this is based on the lustre Mon monitor from the lustre manual # 1) lustre modules should be loaded # Is the lustre module check necessary, since /proc/fs/lustre is already # being checked? # 2) lustre kernel directory should exist if ( ! -d "/proc/fs/lustre" ) { print "no lustre kernel proc directory\n"; exit 2; } # 3) health check must pass open ( HEALTH, "< $health_check" ) or exit 2; while ( <HEALTH> ) { if ( /^healthy$/ ) { print "healthy\n"; exit 0; } else { print $_; while ( <HEALTH> ) { print $_; } exit 2; } } <end check_lustre>> > cliffwbret -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available Url : mail.clusterfs.com/pipermail/lustre-discuss/attachments/20070515/738a40ca/attachment.bin