Harald van Pee
2008-Jan-21 16:33 UTC
[Lustre-discuss] files/directories are temporarily unavailable on patchless clients
Hi, this is lustre 1.6.0.1 on redhat 4 update 4. With debian etch patchless clients and debian kernels 2.6.18. All are AMD64 bit. We use 1 MGS, 1 MDT and 5 OSS with together 10 OST. The cluster is running since roughly 6 monthes, no problems on the server side up to know! The only problem on client side was, that two clients lost connection to mdt or ost and after recovering some files can''t be accessed on this clients. After reboot of this clients everything was o.k.! But now we have a much more serious problem! It seems that it exists since a couple of days, but we just have found out what the reason is! A directory of one user with a couple of directories becomes temporarily unavailable on some clients. After a while one can access the directory again, but for me its not clear what happens and it vanishes also after a while. Reboot of the client does not change the situation. Now I have to ask for help, unfortunately its a production system and its needed as soon as possible. An update is planned, but most likely not realistic bevor middle of next month. What should I do next to solve or investigate the problem? Thanks in advance Harald -- Harald van Pee Helmholtz-Institut fuer Strahlen- und Kernphysik der Universitaet Bonn
Bernd Schubert
2008-Jan-21 16:44 UTC
[Lustre-discuss] files/directories are temporarily unavailable on patchless clients
On Monday 21 January 2008 17:33:31 Harald van Pee wrote:> Hi, > > this is lustre 1.6.0.1 on redhat 4 update 4. > With debian etch patchless clients and debian kernels 2.6.18. > All are AMD64 bit. > We use 1 MGS, 1 MDT and 5 OSS with together 10 OST. > > The cluster is running since roughly 6 monthes, no problems on the server > side up to know! > The only problem on client side was, that two clients lost connection to > mdt or ost and after recovering some files can''t be accessed on this > clients. After reboot of this clients everything was o.k.! > > But now we have a much more serious problem! It seems that it exists since > a couple of days, but we just have found out what the reason is! > A directory of one user with a couple of directories becomes temporarily > unavailable on some clients. > After a while one can access the directory again, but for me its not clear > what happens and it vanishes also after a while.What does this mean in detail? Does access to the directory just hang or is the directory not there?> Reboot of the client does not change the situation. > > Now I have to ask for help, unfortunately its a production system and its > needed as soon as possible. An update is planned, but most likely not > realistic bevor middle of next month. > > What should I do next to solve or investigate the problem?Lustre writes lots of debug messages. For the beginning logs of a client would help. Cheers, Bernd -- Bernd Schubert Q-Leap Networks GmbH
Harald van Pee
2008-Jan-21 17:55 UTC
[Lustre-discuss] files/directories are temporarily unavailable on patchless clients
On Monday 21 January 2008 05:44 pm, you wrote:> On Monday 21 January 2008 17:33:31 Harald van Pee wrote: > > Hi, > > > > this is lustre 1.6.0.1 on redhat 4 update 4. > > With debian etch patchless clients and debian kernels 2.6.18. > > All are AMD64 bit. > > We use 1 MGS, 1 MDT and 5 OSS with together 10 OST. > > > > The cluster is running since roughly 6 monthes, no problems on the server > > side up to know! > > The only problem on client side was, that two clients lost connection to > > mdt or ost and after recovering some files can''t be accessed on this > > clients. After reboot of this clients everything was o.k.! > > > > But now we have a much more serious problem! It seems that it exists > > since a couple of days, but we just have found out what the reason is! A > > directory of one user with a couple of directories becomes temporarily > > unavailable on some clients. > > After a while one can access the directory again, but for me its not > > clear what happens and it vanishes also after a while. > > What does this mean in detail? Does access to the directory just hang or is > the directory not there?The directory is just not there! Directory or file not found.> > > Reboot of the client does not change the situation. > > > > Now I have to ask for help, unfortunately its a production system and its > > needed as soon as possible. An update is planned, but most likely not > > realistic bevor middle of next month. > > > > What should I do next to solve or investigate the problem? > > Lustre writes lots of debug messages. For the beginning logs of a client > would help.in my opinion there is no error message on the clients which is directly related to the problem on our node0010 today I have seen this problem a several time. Mostly the directory is not seen! Probably all of the other directories can be accessed at the same time. and here all lustre related messages from the last days (others are mostly timestamps!) Jan 17 07:41:16 node0010 kernel: Lustre: 5723:0: (namei.c:235:ll_mdc_blocking_ast()) More than 1 alias dir 133798800 alias 2 Jan 17 07:41:16 node0010 kernel: Lustre: 5723:0: (namei.c:235:ll_mdc_blocking_ast()) Skipped 2 previous similar messages Jan 17 07:41:16 node0010 kernel: Lustre: 5716:0: (namei.c:235:ll_mdc_blocking_ast()) More than 1 alias dir 133799550 alias 2 Jan 17 07:41:16 node0010 kernel: Lustre: 5716:0: (namei.c:235:ll_mdc_blocking_ast()) Skipped 1 previous similar message Jan 17 07:41:28 node0010 kernel: Lustre: 5721:0: (namei.c:235:ll_mdc_blocking_ast()) More than 1 alias dir 133891907 alias 2 Jan 17 07:41:28 node0010 kernel: Lustre: 5721:0: (namei.c:235:ll_mdc_blocking_ast()) More than 1 alias dir 133891908 alias 2 Jan 18 14:37:58 node0010 kernel: Lustre: 5716:0: (namei.c:235:ll_mdc_blocking_ast()) More than 1 alias dir 134059689 alias 2 Jan 18 14:52:48 node0010 kernel: Lustre: 5723:0: (namei.c:235:ll_mdc_blocking_ast()) More than 1 alias dir 134092437 alias 2 Jan 18 14:54:04 node0010 kernel: Lustre: 5717:0: (namei.c:235:ll_mdc_blocking_ast()) More than 1 alias dir 134092815 alias 2 Jan 18 14:54:37 node0010 kernel: Lustre: 5719:0: (namei.c:235:ll_mdc_blocking_ast()) More than 1 alias dir 134092988 alias 2 Jan 18 15:02:34 node0010 kernel: Lustre: 5723:0: (namei.c:235:ll_mdc_blocking_ast()) More than 1 alias dir 134093893 alias 2 Jan 18 15:05:33 node0010 kernel: Lustre: 5722:0: (namei.c:235:ll_mdc_blocking_ast()) More than 1 alias dir 134094054 alias 2 Jan 18 15:06:04 node0010 kernel: Lustre: 5721:0: (namei.c:235:ll_mdc_blocking_ast()) More than 1 alias dir 134094179 alias 2 Jan 18 15:10:53 node0010 kernel: Lustre: 5723:0: (namei.c:235:ll_mdc_blocking_ast()) More than 1 alias dir 134094366 alias 2 Jan 18 15:14:59 node0010 kernel: Lustre: 5718:0: (namei.c:235:ll_mdc_blocking_ast()) More than 1 alias dir 134094555 alias 2 Jan 18 15:15:44 node0010 kernel: Lustre: 5723:0: (namei.c:235:ll_mdc_blocking_ast()) More than 1 alias dir 134094694 alias 2 Jan 18 15:22:55 node0010 kernel: Lustre: 5721:0: (namei.c:235:ll_mdc_blocking_ast()) More than 1 alias dir 134094871 alias 2 Jan 18 15:23:30 node0010 kernel: Lustre: 5719:0: (namei.c:235:ll_mdc_blocking_ast()) More than 1 alias dir 134119425 alias 2 Jan 18 15:28:37 node0010 kernel: Lustre: 5721:0: (namei.c:235:ll_mdc_blocking_ast()) More than 1 alias dir 134119833 alias 2 Jan 18 16:15:58 node0010 kernel: Lustre: 5718:0: (namei.c:235:ll_mdc_blocking_ast()) More than 1 alias dir 134059687 alias 2 Jan 18 22:27:36 node0010 kernel: Lustre: 5723:0: (namei.c:235:ll_mdc_blocking_ast()) More than 1 alias dir 134092437 alias 2 Jan 18 22:27:36 node0010 kernel: Lustre: 5723:0: (namei.c:235:ll_mdc_blocking_ast()) Skipped 2 previous similar messages Jan 18 22:27:49 node0010 kernel: Lustre: 5723:0: (namei.c:235:ll_mdc_blocking_ast()) More than 1 alias dir 134092648 alias 2 Jan 18 22:28:45 node0010 kernel: Lustre: 5721:0: (namei.c:235:ll_mdc_blocking_ast()) More than 1 alias dir 134092815 alias 2 Jan 18 22:29:46 node0010 kernel: Lustre: 5720:0: (namei.c:235:ll_mdc_blocking_ast()) More than 1 alias dir 134092988 alias 2 Jan 18 22:29:53 node0010 kernel: Lustre: 5722:0: (namei.c:235:ll_mdc_blocking_ast()) More than 1 alias dir 134093165 alias 2 Jan 18 22:30:03 node0010 kernel: Lustre: 5718:0: (namei.c:235:ll_mdc_blocking_ast()) More than 1 alias dir 134093358 alias 2 Jan 18 22:30:39 node0010 kernel: Lustre: 5722:0: (namei.c:235:ll_mdc_blocking_ast()) More than 1 alias dir 134093732 alias 2 Jan 18 22:31:40 node0010 kernel: Lustre: 5718:0: (namei.c:235:ll_mdc_blocking_ast()) More than 1 alias dir 134093893 alias 2 Jan 18 22:32:46 node0010 kernel: Lustre: 5721:0: (namei.c:235:ll_mdc_blocking_ast()) More than 1 alias dir 134094054 alias 2 Jan 18 22:34:27 node0010 kernel: Lustre: 5716:0: (namei.c:235:ll_mdc_blocking_ast()) More than 1 alias dir 134094179 alias 2 Jan 18 22:35:37 node0010 kernel: Lustre: 5720:0: (namei.c:235:ll_mdc_blocking_ast()) More than 1 alias dir 134094555 alias 2 Jan 18 22:35:37 node0010 kernel: Lustre: 5720:0: (namei.c:235:ll_mdc_blocking_ast()) Skipped 1 previous similar message Jan 18 22:37:47 node0010 kernel: Lustre: 5716:0: (namei.c:235:ll_mdc_blocking_ast()) More than 1 alias dir 134094694 alias 2 Jan 19 08:30:49 node0010 kernel: Lustre: 5717:0: (namei.c:235:ll_mdc_blocking_ast()) More than 1 alias dir 134059690 alias 2 Jan 19 08:30:49 node0010 kernel: Lustre: 5717:0: (namei.c:235:ll_mdc_blocking_ast()) Skipped 3 previous similar messages Jan 19 08:32:33 node0010 kernel: Lustre: 5721:0: (namei.c:235:ll_mdc_blocking_ast()) More than 1 alias dir 134060498 alias 2 Jan 19 08:34:02 node0010 kernel: Lustre: 5716:0: (namei.c:235:ll_mdc_blocking_ast()) More than 1 alias dir 134061258 alias 2 Jan 19 08:35:21 node0010 kernel: Lustre: 5716:0: (namei.c:235:ll_mdc_blocking_ast()) More than 1 alias dir 134091392 alias 2 Jan 19 08:35:21 node0010 kernel: Lustre: 5716:0: (namei.c:235:ll_mdc_blocking_ast()) Skipped 1 previous similar message Jan 19 08:37:41 node0010 kernel: Lustre: 5722:0: (namei.c:235:ll_mdc_blocking_ast()) More than 1 alias dir 134091529 alias 2 Jan 19 08:37:41 node0010 kernel: Lustre: 5722:0: (namei.c:235:ll_mdc_blocking_ast()) Skipped 1 previous similar message Jan 19 08:43:42 node0010 kernel: Lustre: 5718:0: (namei.c:235:ll_mdc_blocking_ast()) More than 1 alias dir 134091822 alias 2 Jan 19 08:43:42 node0010 kernel: Lustre: 5718:0: (namei.c:235:ll_mdc_blocking_ast()) Skipped 2 previous similar messages Jan 19 08:53:04 node0010 kernel: Lustre: 5721:0: (namei.c:235:ll_mdc_blocking_ast()) More than 1 alias dir 134059688 alias 2 Jan 19 08:53:04 node0010 kernel: Lustre: 5721:0: (namei.c:235:ll_mdc_blocking_ast()) Skipped 2 previous similar messages Jan 19 09:05:22 node0010 kernel: Lustre: 5722:0: (namei.c:235:ll_mdc_blocking_ast()) More than 1 alias dir 134093165 alias 2 Jan 19 09:05:22 node0010 kernel: Lustre: 5722:0: (namei.c:235:ll_mdc_blocking_ast()) Skipped 4 previous similar messages Jan 19 09:15:42 node0010 kernel: Lustre: 5721:0: (namei.c:235:ll_mdc_blocking_ast()) More than 1 alias dir 134094179 alias 2 Jan 19 09:15:42 node0010 kernel: Lustre: 5721:0: (namei.c:235:ll_mdc_blocking_ast()) Skipped 4 previous similar messages Jan 19 09:50:20 node0010 kernel: Lustre: 5719:0: (namei.c:235:ll_mdc_blocking_ast()) More than 1 alias dir 134094871 alias 2 Jan 19 09:50:20 node0010 kernel: Lustre: 5719:0: (namei.c:235:ll_mdc_blocking_ast()) Skipped 3 previous similar messages Jan 19 09:56:23 node0010 kernel: Lustre: 5717:0: (namei.c:235:ll_mdc_blocking_ast()) More than 1 alias dir 134119425 alias 2 Jan 19 10:01:28 node0010 kernel: Lustre: 5718:0: (namei.c:235:ll_mdc_blocking_ast()) More than 1 alias dir 134119630 alias 2 Jan 19 10:11:43 node0010 kernel: Lustre: 5723:0: (namei.c:235:ll_mdc_blocking_ast()) More than 1 alias dir 134119833 alias 2 Jan 20 00:58:03 node0010 kernel: Lustre: 5717:0: (namei.c:235:ll_mdc_blocking_ast()) More than 1 alias dir 134094366 alias 2 Jan 20 00:58:03 node0010 kernel: Lustre: 5717:0: (namei.c:235:ll_mdc_blocking_ast()) Skipped 3 previous similar messages Jan 20 01:03:04 node0010 kernel: Lustre: 5722:0: (namei.c:235:ll_mdc_blocking_ast()) More than 1 alias dir 134094555 alias 2 Jan 20 01:59:56 node0010 kernel: Lustre: 5722:0: (namei.c:235:ll_mdc_blocking_ast()) More than 1 alias dir 134060498 alias 2 Jan 20 02:00:18 node0010 kernel: Lustre: 5721:0: (namei.c:235:ll_mdc_blocking_ast()) More than 1 alias dir 134059690 alias 2 Jan 20 02:01:31 node0010 kernel: Lustre: 5717:0: (namei.c:235:ll_mdc_blocking_ast()) More than 1 alias dir 134059688 alias 2 Jan 20 02:47:29 node0010 kernel: Lustre: 5716:0: (namei.c:235:ll_mdc_blocking_ast()) More than 1 alias dir 134059687 alias 2 Jan 20 02:47:29 node0010 kernel: Lustre: 5716:0: (namei.c:235:ll_mdc_blocking_ast()) Skipped 1 previous similar message Jan 20 22:28:27 node0010 kernel: Lustre: 5721:0: (namei.c:235:ll_mdc_blocking_ast()) More than 1 alias dir 134092396 alias 2 Jan 20 22:28:27 node0010 kernel: Lustre: 5721:0: (namei.c:235:ll_mdc_blocking_ast()) Skipped 1 previous similar message Jan 20 22:28:59 node0010 kernel: Lustre: 5718:0: (namei.c:235:ll_mdc_blocking_ast()) More than 1 alias dir 134092437 alias 2 Jan 20 23:19:09 node0010 kernel: Lustre: 5722:0: (namei.c:235:ll_mdc_blocking_ast()) More than 1 alias dir 134093165 alias 2 Jan 20 23:39:11 node0010 kernel: Lustre: 5716:0: (namei.c:235:ll_mdc_blocking_ast()) More than 1 alias dir 134093358 alias 2 Jan 21 11:10:39 node0010 kernel: Lustre: 5720:0: (namei.c:235:ll_mdc_blocking_ast()) More than 1 alias dir 134059689 alias 2 Jan 21 14:22:44 node0010 kernel: Lustre: 5721:0: (namei.c:235:ll_mdc_blocking_ast()) More than 1 alias dir 134091343 alias 3 Jan 21 14:38:43 node0010 kernel: Lustre: 5720:0: (namei.c:235:ll_mdc_blocking_ast()) More than 1 alias dir 134091486 alias 3 Jan 21 15:08:20 node0010 kernel: Lustre: 5716:0: (namei.c:235:ll_mdc_blocking_ast()) More than 1 alias dir 134091822 alias 3 Jan 21 15:31:26 node0010 kernel: Lustre: 5723:0: (namei.c:235:ll_mdc_blocking_ast()) More than 1 alias dir 134059690 alias 3 Jan 21 15:32:42 node0010 kernel: Lustre: 5719:0: (namei.c:235:ll_mdc_blocking_ast()) More than 1 alias dir 134092333 alias 3 Jan 21 15:37:00 node0010 kernel: Lustre: 5719:0: (namei.c:235:ll_mdc_blocking_ast()) More than 1 alias dir 134060498 alias 3 Jan 21 15:42:40 node0010 kernel: Lustre: 5718:0: (namei.c:235:ll_mdc_blocking_ast()) More than 1 alias dir 134061258 alias 3 Jan 21 15:57:33 node0010 kernel: Lustre: 5721:0: (namei.c:235:ll_mdc_blocking_ast()) More than 1 alias dir 134091529 alias 3 Jan 21 16:32:29 node0010 kernel: Lustre: 5719:0: (namei.c:235:ll_mdc_blocking_ast()) More than 1 alias dir 134091683 alias 3 Jan 21 17:11:16 node0010 kernel: Lustre: 5720:0: (namei.c:235:ll_mdc_blocking_ast()) More than 1 alias dir 134091985 alias 3 Jan 21 17:32:25 node0010 kernel: Lustre: 5717:0: (namei.c:235:ll_mdc_blocking_ast()) More than 1 alias dir 134059688 alias 3 Jan 21 17:32:25 node0010 kernel: Lustre: 5721:0: (namei.c:235:ll_mdc_blocking_ast()) More than 1 alias dir 134092182 alias 3 Jan 21 18:12:51 node0010 kernel: Lustre: 5717:0: (namei.c:235:ll_mdc_blocking_ast()) More than 1 alias dir 134059687 alias 3 Jan 21 18:12:51 node0010 kernel: Lustre: 5722:0: (namei.c:235:ll_mdc_blocking_ast()) More than 1 alias dir 134094694 alias 2 Jan 21 18:12:51 node0010 kernel: Lustre: 5722:0: (namei.c:235:ll_mdc_blocking_ast()) Skipped 4 previous similar messages Jan 21 18:12:51 node0010 kernel: Lustre: 5717:0: (namei.c:235:ll_mdc_blocking_ast()) More than 1 alias dir 134120476 alias 2 Jan 21 18:12:51 node0010 kernel: Lustre: 5717:0: (namei.c:235:ll_mdc_blocking_ast()) Skipped 6 previous similar messages> > > Cheers, > BerndHelmholtz-Institut fuer Strahlen- und Kernphysik der Universitaet Bonn
Andreas Dilger
2008-Jan-21 22:55 UTC
[Lustre-discuss] files/directories are temporarily unavailable on patchless clients
On Jan 21, 2008 18:55 +0100, Harald van Pee wrote:> The directory is just not there! Directory or file not found. > > in my opinion there is no error message on the clients which is directly > related to the problem on our node0010 today I have seen this problem a > several time. Mostly the directory is not seen! Probably all of the other > directories can be accessed at the same time. > > and here all lustre related messages from the last days (others are mostly > timestamps!) > > > > Jan 17 07:41:16 node0010 kernel: Lustre: 5723:0: > (namei.c:235:ll_mdc_blocking_ast()) More than 1 alias dir 133798800 aliasA quick search in bugzilla for this error message shows bug 12123, which is fixed in the 1.6.1 release, and also has a patch. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
Harald van Pee
2008-Jan-22 08:39 UTC
[Lustre-discuss] files/directories are temporarily unavailable on patchless clients
On Monday 21 January 2008 11:55 pm, Andreas Dilger wrote:> > Jan 17 07:41:16 node0010 kernel: Lustre: 5723:0: > > (namei.c:235:ll_mdc_blocking_ast()) More than 1 alias dir 133798800 alias > > A quick search in bugzilla for this error message shows bug 12123, > which is fixed in the 1.6.1 release, and also has a patch.o.k. this I do understand, but is this related to my problem, that one directory can only be seen sometimes ? If I just try to access this directory and it was not found I get no error message. Its hard to find out in what situation this directory can be accessed and if, if the error is thrown then. At least to make a complete copy of this directory is not possible, because its gone again before the copy is finished. This directory itself is not much important because it can be reproduced. But what I need is to keep this filesystem alive for at least the next 2, better 3 weeks. Most of the tasks will be read operations, but small write or at least copy operations can not be avoided. Any suggestions? Harald -- Harald van Pee Helmholtz-Institut fuer Strahlen- und Kernphysik der Universitaet Bonn
Bernd Schubert
2008-Jan-24 19:13 UTC
[Lustre-discuss] files/directories are temporarily unavailable on patchless clients
Hello Harald,> Jan 21 18:12:51 node0010 kernel: Lustre: 5717:0: > (namei.c:235:ll_mdc_blocking_ast()) More than 1 alias dir 134120476 alias 2 > Jan 21 18:12:51 node0010 kernel: Lustre: 5717:0: > (namei.c:235:ll_mdc_blocking_ast()) Skipped 6 previous similar messagesthis looks very much like a real bug (and I don''t have time to look into it). I would also guess it is fixed by more recent lustre version. I think there have been many changes of the patchless client between 1.6.0.1 and 1.6.1 or 1.6.2. You really can''t update your client systems by now? Cheers, Bernd -- Bernd Schubert Q-Leap Networks GmbH
Harald van Pee
2008-Jan-24 20:11 UTC
[Lustre-discuss] files/directories are temporarily unavailable on patchless clients
On Thursday 24 January 2008 08:13 pm, you wrote:> Hello Harald, > > > Jan 21 18:12:51 node0010 kernel: Lustre: 5717:0: > > (namei.c:235:ll_mdc_blocking_ast()) More than 1 alias dir 134120476 alias > > 2 Jan 21 18:12:51 node0010 kernel: Lustre: 5717:0: > > (namei.c:235:ll_mdc_blocking_ast()) Skipped 6 previous similar messages > > this looks very much like a real bug (and I don''t have time to look into > it). I would also guess it is fixed by more recent lustre version. I think > there have been many changes of the patchless client between 1.6.0.1 and > 1.6.1 or 1.6.2. > You really can''t update your client systems by now?Hm at the moment not, we have urgent jobs running all arround the day. All heavy writing tasks we have done to local disks now, none of the error messages occoured since. But its worth to think about that. Updating the clients alone should be possible much earlier than updating all machines, and of course can be done machine by machine. But I would assume, that to be sure that no serious file system corruption will happen, I should also make a file system check on all the ost and maybe also mdt? But you are right, because 1.6.0.1 servers with 1.6.1 clients is a supported configuration right? And therefore updating the clients asap would be a good idea! Any objections about that? Harald -- Harald van Pee Helmholtz-Institut fuer Strahlen- und Kernphysik der Universitaet Bonn
Harald van Pee
2008-Mar-04 18:52 UTC
[Lustre-discuss] files/directories are temporarily unavailable on patchless clients
Hi, I have updated all clients to patched version 1.6.1, the servers still are 1.6.0.1. No lustre related error message occured since (2 weeks). I think its reasonable (necessary?) to e2fsck all osts and the mdt? The mdt resides on an drbd device configured as failover. I now have the following questions. 1. Is there a recommended order to do the file system checks? mdt first and than the osts or vice versa? 2. If I umount the mdt should I use -f ? I assume there will be no file system access possible as long the mdt is back again. Would it be better to umount all servers and clients and than the mdt? 3. I think each ost can be checked during the others are working, but I am unsure if I should use -f to umount or not? 4. should I unmount all clients? If this is recommended anyway, its maybe better to stop file system access for a couple of hours (2TB 70% used), but do the filesystem checks in parallel. Thanks in advance Harald On Monday 21 January 2008 11:55 pm, Andreas Dilger wrote:> On Jan 21, 2008 18:55 +0100, Harald van Pee wrote: > > The directory is just not there! Directory or file not found. > > > > in my opinion there is no error message on the clients which is directly > > related to the problem on our node0010 today I have seen this problem a > > several time. Mostly the directory is not seen! Probably all of the other > > directories can be accessed at the same time. > > > > and here all lustre related messages from the last days (others are > > mostly timestamps!) > > > > > > > > Jan 17 07:41:16 node0010 kernel: Lustre: 5723:0: > > (namei.c:235:ll_mdc_blocking_ast()) More than 1 alias dir 133798800 alias > > A quick search in bugzilla for this error message shows bug 12123, > which is fixed in the 1.6.1 release, and also has a patch. > > Cheers, Andreas > -- > Andreas Dilger > Sr. Staff Engineer, Lustre Group > Sun Microsystems of Canada, Inc.-- Harald van Pee Helmholtz-Institut fuer Strahlen- und Kernphysik der Universitaet Bonn
Andreas Dilger
2008-Mar-05 00:06 UTC
[Lustre-discuss] files/directories are temporarily unavailable on patchless clients
On Mar 04, 2008 19:52 +0100, Harald van Pee wrote:> I have updated all clients to patched version 1.6.1, the servers still are > 1.6.0.1. No lustre related error message occured since (2 weeks). > > I think its reasonable (necessary?) to e2fsck all osts and the mdt? > The mdt resides on an drbd device configured as failover. > > I now have the following questions. > 1. Is there a recommended order to do the file system checks? mdt first and > than the osts or vice versa? > > 2. If I umount the mdt should I use -f ? I assume there will be no file system > access possible as long the mdt is back again. Would it be better to umount > all servers and clients and than the mdt? > > 3. I think each ost can be checked during the others are working, but I am > unsure if I should use -f to umount or not? > > 4. should I unmount all clients? If this is recommended anyway, its maybe > better to stop file system access for a couple of hours (2TB 70% used), but > do the filesystem checks in parallel.If you are expecting to fix the filesystem, it is best to just unmount everything and run e2fsck in parallel. Alternately, you can just force unmount the MDT+OST filesystems and let the clients hang until the MDT+OSTs are restarted, but this can be more troublesome in some cases.> On Monday 21 January 2008 11:55 pm, Andreas Dilger wrote: > > On Jan 21, 2008 18:55 +0100, Harald van Pee wrote: > > > The directory is just not there! Directory or file not found. > > > > > > in my opinion there is no error message on the clients which is directly > > > related to the problem on our node0010 today I have seen this problem a > > > several time. Mostly the directory is not seen! Probably all of the other > > > directories can be accessed at the same time. > > > > > > and here all lustre related messages from the last days (others are > > > mostly timestamps!) > > > > > > > > > > > > Jan 17 07:41:16 node0010 kernel: Lustre: 5723:0: > > > (namei.c:235:ll_mdc_blocking_ast()) More than 1 alias dir 133798800 alias > > > > A quick search in bugzilla for this error message shows bug 12123, > > which is fixed in the 1.6.1 release, and also has a patch. > > > > Cheers, Andreas > > -- > > Andreas Dilger > > Sr. Staff Engineer, Lustre Group > > Sun Microsystems of Canada, Inc.Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
Harald van Pee
2008-Mar-05 00:19 UTC
[Lustre-discuss] files/directories are temporarily unavailable on patchless clients
On Wednesday 05 March 2008 01:06 am, Andreas Dilger wrote:> On Mar 04, 2008 19:52 +0100, Harald van Pee wrote: > > I have updated all clients to patched version 1.6.1, the servers still > > are 1.6.0.1. No lustre related error message occured since (2 weeks). > > > > I think its reasonable (necessary?) to e2fsck all osts and the mdt? > > The mdt resides on an drbd device configured as failover. > > > > I now have the following questions. > > 1. Is there a recommended order to do the file system checks? mdt first > > and than the osts or vice versa? > > > > 2. If I umount the mdt should I use -f ? I assume there will be no file > > system access possible as long the mdt is back again. Would it be better > > to umount all servers and clients and than the mdt? > > > > 3. I think each ost can be checked during the others are working, but I > > am unsure if I should use -f to umount or not? > > > > 4. should I unmount all clients? If this is recommended anyway, its > > maybe better to stop file system access for a couple of hours (2TB 70% > > used), but do the filesystem checks in parallel. > > If you are expecting to fix the filesystem, it is best to just unmount > everything and run e2fsck in parallel. Alternately, you can just force > unmount the MDT+OST filesystems and let the clients hang until the MDT+OSTs > are restarted, but this can be more troublesome in some cases.o.k. thanks, than I will unmount all clients first and than unmount all osts and the mdt as the last. If it is possible should I try to avoid the -f flag?> > > On Monday 21 January 2008 11:55 pm, Andreas Dilger wrote: > > > On Jan 21, 2008 18:55 +0100, Harald van Pee wrote: > > > > The directory is just not there! Directory or file not found. > > > > > > > > in my opinion there is no error message on the clients which is > > > > directly related to the problem on our node0010 today I have seen > > > > this problem a several time. Mostly the directory is not seen! > > > > Probably all of the other directories can be accessed at the same > > > > time. > > > > > > > > and here all lustre related messages from the last days (others are > > > > mostly timestamps!) > > > > > > > > > > > > > > > > Jan 17 07:41:16 node0010 kernel: Lustre: 5723:0: > > > > (namei.c:235:ll_mdc_blocking_ast()) More than 1 alias dir 133798800 > > > > alias > > > > > > A quick search in bugzilla for this error message shows bug 12123, > > > which is fixed in the 1.6.1 release, and also has a patch. > > > > > > Cheers, Andreas > > > -- > > > Andreas Dilger > > > Sr. Staff Engineer, Lustre Group > > > Sun Microsystems of Canada, Inc. > > Cheers, Andreas > -- > Andreas Dilger > Sr. Staff Engineer, Lustre Group > Sun Microsystems of Canada, Inc.-- Harald van Pee Helmholtz-Institut fuer Strahlen- und Kernphysik der Universitaet Bonn
Andreas Dilger
2008-Mar-05 06:42 UTC
[Lustre-discuss] files/directories are temporarily unavailable on patchless clients
On Mar 05, 2008 01:19 +0100, Harald van Pee wrote:> On Wednesday 05 March 2008 01:06 am, Andreas Dilger wrote: > > On Mar 04, 2008 19:52 +0100, Harald van Pee wrote: > > > I have updated all clients to patched version 1.6.1, the servers still > > > are 1.6.0.1. No lustre related error message occured since (2 weeks). > > > > > > I think its reasonable (necessary?) to e2fsck all osts and the mdt? > > > The mdt resides on an drbd device configured as failover. > > > > > > I now have the following questions. > > > 1. Is there a recommended order to do the file system checks? mdt first > > > and than the osts or vice versa? > > > > > > 2. If I umount the mdt should I use -f ? I assume there will be no file > > > system access possible as long the mdt is back again. Would it be better > > > to umount all servers and clients and than the mdt? > > > > > > 3. I think each ost can be checked during the others are working, but I > > > am unsure if I should use -f to umount or not? > > > > > > 4. should I unmount all clients? If this is recommended anyway, its > > > maybe better to stop file system access for a couple of hours (2TB 70% > > > used), but do the filesystem checks in parallel. > > > > If you are expecting to fix the filesystem, it is best to just unmount > > everything and run e2fsck in parallel. Alternately, you can just force > > unmount the MDT+OST filesystems and let the clients hang until the MDT+OSTs > > are restarted, but this can be more troublesome in some cases. > > o.k. thanks, > than I will unmount all clients first and than > unmount all osts > and the mdt as the last.Actually, it is better to unmount clients, then MDT, then OSTs last, because the MDT is a "client" on the OSTs.> If it is possible should I try to avoid the -f flag?You shouldn''t need to use -f if you unmount in the above order.> > > On Monday 21 January 2008 11:55 pm, Andreas Dilger wrote: > > > > On Jan 21, 2008 18:55 +0100, Harald van Pee wrote: > > > > > The directory is just not there! Directory or file not found. > > > > > > > > > > in my opinion there is no error message on the clients which is > > > > > directly related to the problem on our node0010 today I have seen > > > > > this problem a several time. Mostly the directory is not seen! > > > > > Probably all of the other directories can be accessed at the same > > > > > time. > > > > > > > > > > and here all lustre related messages from the last days (others are > > > > > mostly timestamps!) > > > > > > > > > > > > > > > > > > > > Jan 17 07:41:16 node0010 kernel: Lustre: 5723:0: > > > > > (namei.c:235:ll_mdc_blocking_ast()) More than 1 alias dir 133798800 > > > > > alias > > > > > > > > A quick search in bugzilla for this error message shows bug 12123, > > > > which is fixed in the 1.6.1 release, and also has a patch. > > > > > > > > Cheers, Andreas > > > > -- > > > > Andreas Dilger > > > > Sr. Staff Engineer, Lustre Group > > > > Sun Microsystems of Canada, Inc. > > > > Cheers, Andreas > > -- > > Andreas Dilger > > Sr. Staff Engineer, Lustre Group > > Sun Microsystems of Canada, Inc. > > -- > Harald van Pee > > Helmholtz-Institut fuer Strahlen- und Kernphysik der Universitaet Bonn > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discussCheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
Harald van Pee
2008-Mar-12 10:07 UTC
[Lustre-discuss] files/directories are temporarily unavailable on patchless clients
On Wednesday 05 March 2008 07:42 am, Andreas Dilger wrote:> On Mar 05, 2008 01:19 +0100, Harald van Pee wrote: > > On Wednesday 05 March 2008 01:06 am, Andreas Dilger wrote: > > > On Mar 04, 2008 19:52 +0100, Harald van Pee wrote:<snip>> > > > > > If you are expecting to fix the filesystem, it is best to just unmount > > > everything and run e2fsck in parallel. Alternately, you can just force > > > unmount the MDT+OST filesystems and let the clients hang until the > > > MDT+OSTs are restarted, but this can be more troublesome in some cases. > > > > o.k. thanks, > > than I will unmount all clients first and than > > unmount all osts > > and the mdt as the last. > > Actually, it is better to unmount clients, then MDT, then OSTs last, > because the MDT is a "client" on the OSTs. > > > If it is possible should I try to avoid the -f flag? > > You shouldn''t need to use -f if you unmount in the above order.o.k. done! There was no (mds) or only one error on some of the osts: Inode 2 has a extra size (2) which is invalid Fix? yes than I want to run lfsck but on debian etch I got the following error: lfsck -n -v --mdsdb ./mdsdb --ostdb ./ostdb0 ./ostdb1 ./ostdb2 ./ostdb3 ./ostdb4 ./ostdb5 ./ostdb6 ./ostdb7 ./ostdb8 ./ostdb9 /mnt/ lfsck 1.40.4.cfs1 (31-Dec-2007) MDSDB: ./mdsdb OSTDB[0]: ./ostdb0 OSTDB[1]: ./ostdb1 OSTDB[2]: ./ostdb2 OSTDB[3]: ./ostdb3 OSTDB[4]: ./ostdb4 OSTDB[5]: ./ostdb5 OSTDB[6]: ./ostdb6 OSTDB[7]: ./ostdb7 OSTDB[8]: ./ostdb8 OSTDB[9]: ./ostdb9 MOUNTPOINT: /mnt/ lfsck: symbol lookup error: lfsck: undefined symbol: db_env_create on redhat I got lfsck -n -v --mdsdb ./mdsdb --ostdb ./ostdb0 ./ostdb1 ./ostdb2 ./ostdb3 ./ostdb4 ./ostdb5 ./ostdb6 ./ostdb7 ./ostdb8 ./ostdb9 /mnt/ lfsck 1.40.4.cfs1 (31-Dec-2007) MDSDB: ./mdsdb OSTDB[0]: ./ostdb0 OSTDB[1]: ./ostdb1 OSTDB[2]: ./ostdb2 OSTDB[3]: ./ostdb3 OSTDB[4]: ./ostdb4 OSTDB[5]: ./ostdb5 OSTDB[6]: ./ostdb6 OSTDB[7]: ./ostdb7 OSTDB[8]: ./ostdb8 OSTDB[9]: ./ostdb9 MOUNTPOINT: /mnt/ error: can''t get lov name.: Inappropriate ioctl for device (25) any ideas? Harald> > > > > On Monday 21 January 2008 11:55 pm, Andreas Dilger wrote: > > > > > On Jan 21, 2008 18:55 +0100, Harald van Pee wrote: > > > > > > The directory is just not there! Directory or file not found. > > > > > > > > > > > > in my opinion there is no error message on the clients which is > > > > > > directly related to the problem on our node0010 today I have seen > > > > > > this problem a several time. Mostly the directory is not seen! > > > > > > Probably all of the other directories can be accessed at the same > > > > > > time. > > > > > > > > > > > > and here all lustre related messages from the last days (others > > > > > > are mostly timestamps!) > > > > > > > > > > > > > > > > > > > > > > > > Jan 17 07:41:16 node0010 kernel: Lustre: 5723:0: > > > > > > (namei.c:235:ll_mdc_blocking_ast()) More than 1 alias dir > > > > > > 133798800 alias > > > > > > > > > > A quick search in bugzilla for this error message shows bug 12123, > > > > > which is fixed in the 1.6.1 release, and also has a patch. > > > > > > > > > > Cheers, Andreas > > > > > -- > > > > > Andreas Dilger > > > > > Sr. Staff Engineer, Lustre Group > > > > > Sun Microsystems of Canada, Inc. > > > > > > Cheers, Andreas > > > -- > > > Andreas Dilger > > > Sr. Staff Engineer, Lustre Group > > > Sun Microsystems of Canada, Inc. > > > > -- > > Harald van Pee > > > > Helmholtz-Institut fuer Strahlen- und Kernphysik der Universitaet Bonn > > _______________________________________________ > > Lustre-discuss mailing list > > Lustre-discuss at lists.lustre.org > > http://lists.lustre.org/mailman/listinfo/lustre-discuss > > Cheers, Andreas > -- > Andreas Dilger > Sr. Staff Engineer, Lustre Group > Sun Microsystems of Canada, Inc.-- Harald van Pee Helmholtz-Institut fuer Strahlen- und Kernphysik der Universitaet Bonn