I have been testing more since my last (premature) post. Some questions come to mind and I am most likely just doing something wrong here... I have five OSTs, one of them is the MGS/MDT. Yes, it is a totally bad idea to have a MGD/MDT on the same node as an OST/OSS, but this is only a test. I down one of the servers (normal shutdown, not the MGD of course). OK, so the clients seem to be frozen in regards to the lustre. Many here have noted that it should be ok, with the exception of files that were stored on the downed server, but that does not seem to be the case here. That is not my main concern however, the real question is, I bring the server back up; check its ID by issuing lctl dl; I check the MGS by a cat /proc/fs/lustre/devices and see the ID in there as UP. OK, so it all seems well again, but the client is still (somewhat) stuck. I unmount and mount the client back as per the Lustre FAQ, but it still has problems. I reboot the client, hrm, it still can not perform certain filesystem operations (ls -lR, df, du, find . all hang). I can create files and read files if I know their location, but I can not seem to perform any "recursive" type actions on the mount point on the client. I note also on the client, it seems to see all of the servers in a cat /proc/fs/lustre/devices. I was going to restart the MGS/OSS servers, but the last time I did that nothing worked again and I had to start over. I have to be missing something here. I thought you could reboot a OST at will with more or less no side effects other than clients not seeing the files that were on that OST. I assume that is actually true, but that I am doing something wrong bringing it up. Any ideas?
On Tue, 2009-02-03 at 10:29 -0600, Robert Minvielle wrote:> > I have five OSTs,Do you really mean OSTs here or OSSes? An OST is a disk device. An OSS is the server that an OST is serviced with.> one of them is the MGS/MDT. Yes, it is a totally bad > idea to have a MGD/MDT on the same node as an OST/OSS,Yes, it is. If you really do have 5 OSSes (and not 5 OSTs in a single OSS for example) why don''t you just dedicate one of those OSSes to being an MDS/MGS?> I down one of the servers (normal shutdown, not the MGD of course). > OK, so the clients seem to be frozen in regards to the lustre.Only if they want to access objects (files, or file stripes) on that server that you shut down, yes.> Many here > have noted that it should be ok, with the exception of files that were > stored on the downed server,Yes.> but that does not seem to be the case here. > That is not my main concern however, the real question is, I bring the server > back up; check its ID by issuing lctl dl; I check the MGS by a cat /proc/fs/lustre/devices > and see the ID in there as UP. OK, so it all seems well again, but the client > is still (somewhat) stuck.How long are you waiting after you bring the server up. Recovery is not instantaneous.> I reboot the client, hrm, it still > can not perform certain filesystem operations (ls -lR, df, du, find . all hang). > I can create files and read files if I know their location, but I can not seem > to perform any "recursive" type actions on the mount point on the client.Because you are likely looking for something from the down (and maybe now recovering) OSS.> I was going to restart the MGS/OSS servers, but the last time I did that > nothing worked again and I had to start over.Yes. This is exactly why MDS/OSS is a bad idea. When you reset one of those, recovery has to be aborted because you took a (hidden -- the MDS is a client of the OSTs) client down with the server.> I have to be missing something > here. I thought you could reboot a OST at will with more or less no side effects > other than clients not seeing the files that were on that OST.Yes, until it comes back up and recovery is finished. Look at the syslog of the OSS that you rebooted for details about the recovery. b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090203/c7779fdb/attachment.bin
On Feb 3, 2009, at 11:42 AM, Brian J. Murrell wrote:> >> I down one of the servers (normal shutdown, not the MGD of course). >> OK, so the clients seem to be frozen in regards to the lustre. > > Only if they want to access objects (files, or file stripes) on that > server that you shut down, yes.In our experience, despite what has been said and what we have read, if we lose or take down a single OSS, our clients lose access (i/o seems blocked) to the file system until that OSS is back up and has completed recovery. That''s just or experience and it has been very consistent. We''ve never seen otherwise, though we would like to. :)> >> Many here >> have noted that it should be ok, with the exception of files that >> were >> stored on the downed server,Again, not in our experience. We are currently running 1.6.4.2 and have never seen this work. Losing a single OSS renders the file system pretty much unusable until the OSS has recovered. We could be doing something wrong, I suppose but I''m not sure what.>> but that does not seem to be the case here. >> That is not my main concern however, the real question is, I bring >> the server >> back up; check its ID by issuing lctl dl; I check the MGS by a cat / >> proc/fs/lustre/devices >> and see the ID in there as UP. OK, so it all seems well again, but >> the client >> is still (somewhat) stuck.You have to wait for recovery to complete. You can check the recovery status on the OSSs and MGS/MDS by.... cd /proc/fs/lustre; find . -name "*recov*" -exec cat {} \; Once all the OSSs/MGS show recovery "COMPLETE", clients will be able to access the file system again. We''ve been running three separate Lustre file systems for over a year now and are *very* happy with it. There are a few things that we still don''t understand and this is one of them. We wish that when an OSS went down, we only lost access to files/objects on *that* OSS but, again, that has not been our experience. Still we''ve kissed a lot of distributed/parallel file system frogs. We''ll take Lustre, hands down. Charlie Taylor UF HPC Center> > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss
On Tue, 2009-02-03 at 12:21 -0500, Charles Taylor wrote:> > In our experience, despite what has been said and what we have read, > if we lose or take down a single OSS, our clients lose access (i/o > seems blocked) to the file system until that OSS is back up and has > completed recovery.That is likely the "real world" results of taking down an OSS, indeed. But that is more likely simply due to the "random distribution" of files/stripes around your filesystem and that it won''t take long for all active clients to eventually want something from that missing OSS.> Again, not in our experience.Have you actually tested your theory in a controlled environment where you could be sure that clients that got hung up have never tried to access an OST on missing OSS? If so, and you are still finding that clients that don''t touch the downed OSS are getting hung up, please, by all means, file a bug.> We''ve been running three separate Lustre file systems for over a year > now and are *very* happy with it.Glad to hear that, sincerely.> We wish that when an > OSS went down, we only lost access to files/objects on *that* OSS but, > again, that has not been our experience.It''s certainly supposed to be. As above, if you find otherwise, please let us know.> Still we''ve kissed a lot > of distributed/parallel file system frogs. We''ll take Lustre, hands > down.Thanx for the vote of confidence. It''s always nice to hear about people who are happy. Cheers, b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090203/d02dcfc3/attachment-0001.bin
Charles Taylor wrote:> On Feb 3, 2009, at 11:42 AM, Brian J. Murrell wrote: > > >>> I down one of the servers (normal shutdown, not the MGD of course). >>> OK, so the clients seem to be frozen in regards to the lustre. >>> >> Only if they want to access objects (files, or file stripes) on that >> server that you shut down, yes. >> > > In our experience, despite what has been said and what we have read, > if we lose or take down a single OSS, our clients lose access (i/o > seems blocked) to the file system until that OSS is back up and has > completed recovery. That''s just or experience and it has been very > consistent. We''ve never seen otherwise, though we would like to. :) >You are probably both correct. Only nodes with files on the down OSTs should be impacted, but it is very easy to use/access files on the down OST. If your home directory is on Lustre, then a login will certainly hang as you will likely have dotfiles on nearly all OSTs. If you do an "ls -l" you will hang as most likely a file in the directory will be on the hung OST. If you do an "lsof" and verify that there are no open files that are on an OST, then when that OST goes down the jobs on that node should continue to run, assuming nothing tries to access the down OST.>>> Many here >>> have noted that it should be ok, with the exception of files that >>> were >>> stored on the downed server, >>> > > Again, not in our experience. We are currently running 1.6.4.2 and > have never seen this work. Losing a single OSS renders the file > system pretty much unusable until the OSS has recovered. We could > be doing something wrong, I suppose but I''m not sure what. > > >>> but that does not seem to be the case here. >>> That is not my main concern however, the real question is, I bring >>> the server >>> back up; check its ID by issuing lctl dl; I check the MGS by a cat / >>> proc/fs/lustre/devices >>> and see the ID in there as UP. OK, so it all seems well again, but >>> the client >>> is still (somewhat) stuck. >>> > > You have to wait for recovery to complete. You can check the > recovery status on the OSSs and MGS/MDS by.... > > cd /proc/fs/lustre; find . -name "*recov*" -exec cat {} \; > > Once all the OSSs/MGS show recovery "COMPLETE", clients will be able > to access the file system again. > > We''ve been running three separate Lustre file systems for over a year > now and are *very* happy with it. There are a few things that we > still don''t understand and this is one of them. We wish that when an > OSS went down, we only lost access to files/objects on *that* OSS but, > again, that has not been our experience. Still we''ve kissed a lot > of distributed/parallel file system frogs. We''ll take Lustre, hands > down. > > Charlie Taylor > UF HPC Center > > >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss >> > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss >
Hello! On Feb 3, 2009, at 12:21 PM, Charles Taylor wrote:>>> Many here >>> have noted that it should be ok, with the exception of files that >>> were >>> stored on the downed server, > Again, not in our experience. We are currently running 1.6.4.2 and > have never seen this work. Losing a single OSS renders the file > system pretty much unusable until the OSS has recovered. We could > be doing something wrong, I suppose but I''m not sure what.After one of the OSSes is down, what sort of error messages do you get? on stuck clients that do not try to access files from those OSSes? Is it anything about problems contacting MDS by any chance? There were some bugs fixed in 1.6.6 and 1.6.7 that could easy this situation. E.g. see bugs 13375 and 16006. So perhaps consider upgrading your system and let us know if it still does not work for you. Bye, Oleg
On Feb 3, 2009, at 12:28 PM, Brian J. Murrell wrote:> On Tue, 2009-02-03 at 12:21 -0500, Charles Taylor wrote: >> >> In our experience, despite what has been said and what we have read, >> if we lose or take down a single OSS, our clients lose access (i/o >> seems blocked) to the file system until that OSS is back up and has >> completed recovery. > > That is likely the "real world" results of taking down an OSS, indeed. > But that is more likely simply due to the "random distribution" of > files/stripes around your filesystem and that it won''t take long for > all > active clients to eventually want something from that missing OSS.That could certainly be the case.> >> Again, not in our experience. > > Have you actually tested your theory in a controlled environment where > you could be sure that clients that got hung up have never tried to > access an OST on missing OSS?No, we''ve never set out to prove that it works or doesn''t. We are not complaining though - just saying that for us the "practical" ramification of an OSS going down is that the file system will be unusable until the OSS is back in service and recovery is complete.> If so, and you are still finding that > clients that don''t touch the downed OSS are getting hung up, please, > by > all means, file a bug.Will do. We''ll be upgrading to 1.6.6 pretty soon and perhaps we''ll do some more extensive testing then. Regards, Charlie> > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss
Interesting, if I perform that on the OSS/OST that I restarted I get: status: INACTIVE On the MGS I get... status: COMPLETE recovery_start: 1233260818 recovery_duration: 0 completed_clients: 1/1 replayed_requests: 0 last_transno: 13758 status: INACTIVE Does inactive mean it is not rebuilding the file system? Is there a way to force it to rebuild? Performing the check on any other OSS/OST box gives... status: COMPLETE recovery_start: 1233242886 recovery_duration: 309 completed_clients: 1/2 replayed_requests: 0 last_transno: 0 Hrm, the clients are still unable to do anything that gets to files on that OSS/OST. ----- "Charles Taylor" <taylor at hpc.ufl.edu> wrote:> On Feb 3, 2009, at 11:42 AM, Brian J. Murrell wrote: > > > You have to wait for recovery to complete. You can check the > recovery status on the OSSs and MGS/MDS by.... > > cd /proc/fs/lustre; find . -name "*recov*" -exec cat {} \; > > Once all the OSSs/MGS show recovery "COMPLETE", clients will be able > > to access the file system again.
On Tue, 2009-02-03 at 14:54 -0600, Robert Minvielle wrote:> Interesting, if I perform that on the OSS/OST that I restarted I get: > > status: INACTIVEAre you actually remounting the OST after the reboot? What does "lctl dl" say on that OSS? b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090203/1613c789/attachment-0001.bin
Yes. lctl dl: 0 UP mgc MGC10.1.15.6 at tcp f5c832f9-d4fb-837f-6782-a4c2b461c2b7 5 1 UP ost OSS OSS_uuid 3 2 UP obdfilter datafs-OST0001 datafs-OST0001_UUID 3 ----- "Brian J. Murrell" <Brian.Murrell at Sun.COM> wrote:> On Tue, 2009-02-03 at 14:54 -0600, Robert Minvielle wrote: > > Interesting, if I perform that on the OSS/OST that I restarted I > get: > > > > status: INACTIVE > > Are you actually remounting the OST after the reboot? > > What does "lctl dl" say on that OSS? > > b. > > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss
On Feb 03, 2009 12:21 -0500, Charles Taylor wrote:> In our experience, despite what has been said and what we have read, > if we lose or take down a single OSS, our clients lose access (i/o > seems blocked) to the file system until that OSS is back up and has > completed recovery. That''s just or experience and it has been very > consistent. We''ve never seen otherwise, though we would like to. :)To be clear - a client process will wait indefinitely until an OST is back alive, unless either the process is killed (this should be possible after the Lustre recovery timeout is exceeded, 100s by default), or the OST is explicitly marked "inactive" on the clients: lctl --device {failed OSC device on client} deactivate After the OSC is marked inactive, then all IO to that OST should immediately return with -EIO, and not hang. If you have experiences other than this it is a bug. If this isn''t explained in the documentation it is a documentation bug. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
On Feb 4, 2009, at 4:33 AM, Andreas Dilger wrote:> On Feb 03, 2009 12:21 -0500, Charles Taylor wrote: >> In our experience, despite what has been said and what we have read, >> if we lose or take down a single OSS, our clients lose access (i/o >> seems blocked) to the file system until that OSS is back up and has >> completed recovery. That''s just or experience and it has been very >> consistent. We''ve never seen otherwise, though we would like >> to. :) > > To be clear - a client process will wait indefinitely until an OST > is back alive, unless either the process is killed (this should be > possible after the Lustre recovery timeout is exceeded, 100s by > default), or the OST is explicitly marked "inactive" on the clients: > > lctl --device {failed OSC device on client} deactivate > > After the OSC is marked inactive, then all IO to that OST should > immediately return with -EIO, and not hang.Thanks Andreas, I think that clears things up and will help us understand what to expect going forward.> If you have experiences other than this it is a bug. If this isn''t > explained in the documentation it is a documentation bug.If that is spelled out clearly in the documentation, I missed it (certainly possible). I hope I indicated that this business has never been a show-stopper for us. Typically, if we lose an OSS or OST our top priority is getting it back in service. As you indicate, most clients wait and resume when recovery is complete and this is usually fine with us. In fact, its awesome and users understand it since it is akin to what they were used to w/ NFS - back in the day. We love you man! :) Charlie Taylor UF HPC Center
I still can not seem to get this OST to come online. The clients are still exhibiting the same behaviour as before. Is there any way to get the OST to go into active by force? I ran a ext3 check on it using the SUN modded e2fsprogs and it returns e2fsck 1.40.11.sun1 (17-June-2008) datafs-OST0001: recovering journal datafs-OST0001: clean, 472/25608192 files, 1862944/102410358 blocks Yet, I still get: cd /proc/fs/lustre; find . -name "*recov*" -exec cat {} \; status: INACTIVE On the MGS, it seems to show as active... [root at l1storage1 ~]# cat /proc/fs/lustre/lov/datafs-mdtlov/target_obd 0: datafs-OST0000_UUID ACTIVE 1: datafs-OST0001_UUID ACTIVE 4: datafs-OST0004_UUID ACTIVE 5: datafs-OST0005_UUID ACTIVE 6: datafs-OST0006_UUID ACTIVE I can not seem to how to up the OST in the FAQ/manual, other than the 4.2.1 4.2.2 section, which does not seem to work on this OST (when I do a lctl --device <devno> conf_param datafs-OST0001.osc.active=1 it fails, although no matter what I put in for <devno> it gives me an error). Any help would be much appreciated. ----- "Robert Minvielle" <robert at lite3d.com> wrote:> Yes. > > lctl dl: > 0 UP mgc MGC10.1.15.6 at tcp f5c832f9-d4fb-837f-6782-a4c2b461c2b7 5 > 1 UP ost OSS OSS_uuid 3 > 2 UP obdfilter datafs-OST0001 datafs-OST0001_UUID 3 > > > ----- "Brian J. Murrell" <Brian.Murrell at Sun.COM> wrote: > > > On Tue, 2009-02-03 at 14:54 -0600, Robert Minvielle wrote: > > > Interesting, if I perform that on the OSS/OST that I restarted I > > get: > > > > > > status: INACTIVE > > > > Are you actually remounting the OST after the reboot? > > > > What does "lctl dl" say on that OSS? > > > > b. > > > > > > _______________________________________________ > > Lustre-discuss mailing list > > Lustre-discuss at lists.lustre.org > > http://lists.lustre.org/mailman/listinfo/lustre-discuss > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss
On Feb 4, 2009, at 10:39 AM, Robert Minvielle wrote:> > I still can not seem to get this OST to come online. The clients > are still exhibiting the same behaviour as before. Is there any > way to get the OST to go into active by force? I ran a ext3 check > on it using the SUN modded e2fsprogs and it returns > > e2fsck 1.40.11.sun1 (17-June-2008) > datafs-OST0001: recovering journal > datafs-OST0001: clean, 472/25608192 files, 1862944/102410358 blocks > > Yet, I still get: > > cd /proc/fs/lustre; find . -name "*recov*" -exec cat {} \; > status: INACTIVE > > On the MGS, it seems to show as active... > > [root at l1storage1 ~]# cat /proc/fs/lustre/lov/datafs-mdtlov/target_obd > 0: datafs-OST0000_UUID ACTIVE > 1: datafs-OST0001_UUID ACTIVE > 4: datafs-OST0004_UUID ACTIVE > 5: datafs-OST0005_UUID ACTIVE > 6: datafs-OST0006_UUID ACTIVEWe''ve seen OSTs come up as INACTIVE before. We are not sure why it happens. Sometimes it will transition into RECOVERY if you remount it (umount, mount). Sometimes you may find that the OST is mounted read-only and you can force it back to read-write with mount (as in mount -o rw,remount <device>). Sometimes, if you wait, it will transition to ACTIVE on its own (perhaps passing through RECOVERY first, I don''t know). We''ve intentionally and unintentionally experienced all three. I think Brian and/or Andreas have already mentioned the remount route. Don''t worry though. Lustre really does work. This sounds like normal tooth cutting. You''ll be ok. :) Charlie Taylor UF HPC Center
On Wed, 2009-02-04 at 09:39 -0600, Robert Minvielle wrote:> I still can not seem to get this OST to come online. The clients > are still exhibiting the same behaviour as before. Is there any > way to get the OST to go into active by force? I ran a ext3 check > on it using the SUN modded e2fsprogs and it returns > > e2fsck 1.40.11.sun1 (17-June-2008) > datafs-OST0001: recovering journal > datafs-OST0001: clean, 472/25608192 files, 1862944/102410358 blocks > > Yet, I still get: > > cd /proc/fs/lustre; find . -name "*recov*" -exec cat {} \; > status: INACTIVE > > On the MGS, it seems to show as active... > > [root at l1storage1 ~]# cat /proc/fs/lustre/lov/datafs-mdtlov/target_obd > 0: datafs-OST0000_UUID ACTIVE > 1: datafs-OST0001_UUID ACTIVE > 4: datafs-OST0004_UUID ACTIVE > 5: datafs-OST0005_UUID ACTIVE > 6: datafs-OST0006_UUID ACTIVE > > I can not seem to how to up the OST in the FAQ/manual, other than > the 4.2.1 4.2.2 section, which does not seem to work on this OST > (when I do a lctl --device <devno> conf_param datafs-OST0001.osc.active=1 > it fails, although no matter what I put in for <devno> it gives me an > error). > > Any help would be much appreciated.You are trying way too hard. The process is simply to mount the OST and wait for recovery to complete. If that is not working, then that needs to be debugged. All of these other things you are attempting are likely just confusing things more than helping. So after you mount the OST, you should get a bunch of messages in your "kernel log". What are they? Also, can you explain, exactly, step by step what you are doing to invoke this failure and recovery. b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090204/2a6ee41d/attachment.bin
I mounted the OST rw. How long do you wait? It has been more than 24 hours, and this drive is 400GB. The kernel only reports... LDISKFS FS on sda4, internal journal LDISKFS-fs: mounted filesystem with ordered data mode. LDISKFS-fs: file extents enabled LDISKFS-fs: mballoc enabled Lustre: OST datafs-OST0001 now serving dev (datafs-OST0001/a4ca1e1f-5fb6-98bb-4001-179eef95f576) with recovery enabled Lustre: Server datafs-OST0001 on device /dev/sda4 has started Step by step for the problem... Goto a server at random. shutdown -h now wait 10 mins restart server remount lustre (mount -t lustre -o rw /dev/sda4 /mnt/data/ost5) check it... cd /proc/fs/lustre; find . -name "*recov*" -exec cat {} \; status: INACTIVE a cat /proc/fs/lustre/devices shows: 0 UP mgc MGC10.1.15.6 at tcp 6d8c5b4e-d22d-e17c-030b-0bf2a01defca 5 1 UP ost OSS OSS_uuid 3 2 UP obdfilter datafs-OST0001 datafs-OST0001_UUID 3 That seems correct, 10.1.15.6 is l1storage1, the MGS/MDT server. Check to see if MGS sees it... [root at l1storage1 ~]# cat /proc/fs/lustre/lov/datafs-mdtlov/target_obd 0: datafs-OST0000_UUID ACTIVE 1: datafs-OST0001_UUID ACTIVE 4: datafs-OST0004_UUID ACTIVE 5: datafs-OST0005_UUID ACTIVE 6: datafs-OST0006_UUID ACTIVE again... [root at l1storage1 ~]# cat /proc/fs/lustre/devices 0 UP mgs MGS MGS 13 1 UP mgc MGC10.1.15.6 at tcp efa6505e-238d-7107-7a7c-c64208640f9f 5 2 UP mdt MDS MDS_uuid 3 3 UP lov datafs-mdtlov datafs-mdtlov_UUID 4 4 UP mds datafs-MDT0000 datafs-MDT0000_UUID 5 5 UP ost OSS OSS_uuid 3 6 UP obdfilter datafs-OST0000 datafs-OST0000_UUID 5 7 UP osc datafs-OST0000-osc datafs-mdtlov_UUID 5 8 UP osc datafs-OST0006-osc datafs-mdtlov_UUID 5 9 UP osc datafs-OST0005-osc datafs-mdtlov_UUID 5 10 UP osc datafs-OST0004-osc datafs-mdtlov_UUID 5 11 UP osc datafs-OST0001-osc datafs-mdtlov_UUID 5 Now, just to make sure it should be ok, I goto a client, restart the client, make sure it sees the mount, then test: mount commands shows it is mounted: l1storage1 at tcp0:/datafs on /datafs type lustre (rw) ls -l of /datafs shows my test data... drwxr-xr-x 2 root root 4096 Jan 30 08:47 t drwxr-xr-x 2 root root 221184 Feb 2 12:36 test drwxr-xr-x 2 root root 221184 Feb 2 12:35 test2 As I previously noted, I can create/delete files, but ls -lR hangs, df hangs, etc, etc. "You are trying way too hard." -- I do not think that is possibe... ----- "Brian J. Murrell" <Brian.Murrell at Sun.COM> wrote:> On Wed, 2009-02-04 at 09:39 -0600, Robert Minvielle wrote: > > I still can not seem to get this OST to come online. The clients > > are still exhibiting the same behaviour as before. Is there any > > way to get the OST to go into active by force? I ran a ext3 check > > on it using the SUN modded e2fsprogs and it returns > > > > e2fsck 1.40.11.sun1 (17-June-2008) > > datafs-OST0001: recovering journal > > datafs-OST0001: clean, 472/25608192 files, 1862944/102410358 blocks > > > > Yet, I still get: > > > > cd /proc/fs/lustre; find . -name "*recov*" -exec cat {} \; > > status: INACTIVE > > > > On the MGS, it seems to show as active... > > > > [root at l1storage1 ~]# cat > /proc/fs/lustre/lov/datafs-mdtlov/target_obd > > 0: datafs-OST0000_UUID ACTIVE > > 1: datafs-OST0001_UUID ACTIVE > > 4: datafs-OST0004_UUID ACTIVE > > 5: datafs-OST0005_UUID ACTIVE > > 6: datafs-OST0006_UUID ACTIVE > > > > I can not seem to how to up the OST in the FAQ/manual, other than > > the 4.2.1 4.2.2 section, which does not seem to work on this OST > > (when I do a lctl --device <devno> conf_param > datafs-OST0001.osc.active=1 > > it fails, although no matter what I put in for <devno> it gives me > an > > error). > > > > Any help would be much appreciated. > > You are trying way too hard. The process is simply to mount the OST > and > wait for recovery to complete. If that is not working, then that > needs > to be debugged. All of these other things you are attempting are > likely > just confusing things more than helping. > > So after you mount the OST, you should get a bunch of messages in > your > "kernel log". What are they? > > Also, can you explain, exactly, step by step what you are doing to > invoke this failure and recovery. > > b. > > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss
Hello! On Feb 4, 2009, at 11:43 AM, Robert Minvielle wrote:> l1storage1 at tcp0:/datafs on /datafs type lustre (rw) > > ls -l of /datafs shows my test data... > > drwxr-xr-x 2 root root 4096 Jan 30 08:47 t > drwxr-xr-x 2 root root 221184 Feb 2 12:36 test > drwxr-xr-x 2 root root 221184 Feb 2 12:35 test2 > > As I previously noted, I can create/delete files, but > ls -lR hangs, df hangs, etc, etc.Any error messages in the kernel log when it hangs (either the oss affected or the client or both?) after a fresh reboot of the client: modprobe lustre echo -1 >/proc/sys/lnet/debug echo 50 >/proc/sys/lnet/debug_mb mount l1storage1:/datafs /datafs -t lustre df after it hung, from another terminal: lctl dk >/tmp/lustre.log gzip the /tmp/lustre.log file a bug at bugzilla.lustre.org and attach the log there. Bye, Oleg
Thanks. I did that and posted the logs. There was no kernel messages on the console/syslog. As an interesting aside, I really thought remounting the OST''s would work. I unmounted all of the servers, then the MGS/MDT. Then I brought the MGS/MDT back up and remounted all of the servers. This still does not fix the issue. I keep thinking something on my end has to be misconfigured. I have gone through the steps again and looked at my history files, but I do not see anything wrong with the install... yet. ----- "Oleg Drokin" <Oleg.Drokin at Sun.COM> wrote:> Hello! > On Feb 4, 2009, at 11:43 AM, Robert Minvielle wrote: > > > l1storage1 at tcp0:/datafs on /datafs type lustre (rw) > > > > ls -l of /datafs shows my test data... > > > > drwxr-xr-x 2 root root 4096 Jan 30 08:47 t > > drwxr-xr-x 2 root root 221184 Feb 2 12:36 test > > drwxr-xr-x 2 root root 221184 Feb 2 12:35 test2 > > > > As I previously noted, I can create/delete files, but > > ls -lR hangs, df hangs, etc, etc. > > Any error messages in the kernel log when it hangs > (either the oss affected or the client or both?) > after a fresh reboot of the client: > modprobe lustre > echo -1 >/proc/sys/lnet/debug > echo 50 >/proc/sys/lnet/debug_mb > mount l1storage1:/datafs /datafs -t lustre > df > after it hung, from another terminal: > lctl dk >/tmp/lustre.log > > gzip the /tmp/lustre.log > file a bug at bugzilla.lustre.org and attach the log there. > > Bye, > Oleg