Alex Lee
2008-Aug-27 05:56 UTC
[Lustre-discuss] Seeing OST errors on the OSS that doesnt have it mounted
I''m getting these error messages on my OSS. However the OST006 is not mounted on this OSS(lustre-oss-0-0). OST0006 is visible to the server because lustre-oss-0-0 is the failover node for lustre-oss-0-1 which does mount OST0006. Should I be worried about these errors? I dont understand why the OSS is even giving these errors out since there is no hardware issue that I can see. Also the OST is not mounted on that OSS I would think its "invisible" to the OSS. I only get these errors during lustre usage. When the filesystem is not used I never get any errors. Clip of the log below. Thanks, -Alex Aug 23 12:27:52 lustre-oss-0-0 kernel: LustreError: 2918:0:(ldlm_lib.c:1536:target_send_reply_msg()) @@ @ processing error (-19) req at ffff81026f189a00 x52/t0 o8-><?>@<?>:0/0 lens 240/0 e 0 to 0 dl 1219462372 ref 1 fl Interpret:/0/0 rc -19/0 Aug 23 12:27:58 lustre-oss-0-0 kernel: LustreError: 137-5: UUID ''lfs-OST0006_UUID'' is not available fo r connect (no target) Aug 23 12:27:58 lustre-oss-0-0 kernel: LustreError: 21174:0:(ldlm_lib.c:1536:target_send_reply_msg()) @ @@ processing error (-19) req at ffff810211ddb050 x52/t0 o8-><?>@<?>:0/0 lens 240/0 e 0 to 0 dl 121946237 8 ref 1 fl Interpret:/0/0 rc -19/0 Aug 23 12:28:07 lustre-oss-0-0 kernel: LustreError: 137-5: UUID ''lfs-OST0006_UUID'' is not available fohat r connect (no target) Aug 23 12:28:07 lustre-oss-0-0 kernel: LustreError: 22646:0:(ldlm_lib.c:1536:target_send_reply_msg()) @ @@ processing error (-19) req at ffff81015e1e6200 x52/t0 o8-><?>@<?>:0/0 lens 240/0 e 0 to 0 dl 121946238 7 ref 1 fl Interpret:/0/0 rc -19/0
Andreas Dilger
2008-Aug-27 07:51 UTC
[Lustre-discuss] Seeing OST errors on the OSS that doesnt have it mounted
On Aug 26, 2008 22:56 -0700, Alex Lee wrote:> I''m getting these error messages on my OSS. However the OST006 is not mounted on this OSS(lustre-oss-0-0). OST0006 is visible to the server because lustre-oss-0-0 is the failover node for lustre-oss-0-1 which does mount OST0006. > > Should I be worried about these errors? I dont understand why the OSS is even giving these errors out since there is no hardware issue that I can see. Also the OST is not mounted on that OSS I would think its "invisible" to the OSS. > > I only get these errors during lustre usage. When the filesystem is not used I never get any errors.When a client has a problem accessing a service (OST or MDT) on the primary node (e.g. RPC timeout) it will retry on the same node first, then try the backup and continue to try both until one of them answers...> Aug 23 12:27:52 lustre-oss-0-0 kernel: LustreError: 2918:0:(ldlm_lib.c:1536:target_send_reply_msg()) @@ > @ processing error (-19) req at ffff81026f189a00 x52/t0 o8-><?>@<?>:0/0 lens 240/0 e 0 to 0 dl 1219462372 > ref 1 fl Interpret:/0/0 rc -19/0The fact that lustre-oss-0-0 returns -ENODEV (-19) isn''t a reason to stop trying there, because it may take some time for OST to failover from primary server to backup. What this really means is that your primary server is having network trouble, or is so severely overloaded that the client has given up on it and is trying the backup. It could also be a problem on the client I guess. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
Alex Lee
2008-Aug-28 03:41 UTC
[Lustre-discuss] Seeing OST errors on the OSS that doesnt have it mounted
Andreas Dilger wrote:> On Aug 26, 2008 22:56 -0700, Alex Lee wrote: > >> I''m getting these error messages on my OSS. However the OST006 is not mounted on this OSS(lustre-oss-0-0). OST0006 is visible to the server because lustre-oss-0-0 is the failover node for lustre-oss-0-1 which does mount OST0006. >> >> Should I be worried about these errors? I dont understand why the OSS is even giving these errors out since there is no hardware issue that I can see. Also the OST is not mounted on that OSS I would think its "invisible" to the OSS. >> >> I only get these errors during lustre usage. When the filesystem is not used I never get any errors. >> > > When a client has a problem accessing a service (OST or MDT) on the primary > node (e.g. RPC timeout) it will retry on the same node first, then try the > backup and continue to try both until one of them answers... > > >> Aug 23 12:27:52 lustre-oss-0-0 kernel: LustreError: 2918:0:(ldlm_lib.c:1536:target_send_reply_msg()) @@ >> @ processing error (-19) req at ffff81026f189a00 x52/t0 o8-><?>@<?>:0/0 lens 240/0 e 0 to 0 dl 1219462372 >> ref 1 fl Interpret:/0/0 rc -19/0 >> > > The fact that lustre-oss-0-0 returns -ENODEV (-19) isn''t a reason to stop > trying there, because it may take some time for OST to failover from primary > server to backup. > > What this really means is that your primary server is having network > trouble, or is so severely overloaded that the client has given up on > it and is trying the backup. It could also be a problem on the client > I guess. > > Cheers, Andreas > -- > Andreas Dilger > Sr. Staff Engineer, Lustre Group > Sun Microsystems of Canada, Inc. > >Andreas, Thanks for the info. Is there any documentation on how to decode the error messages? I feel bad keep posting on the list for every single error message I dont understand. Thanks, -Alex
Bernd Schubert
2008-Aug-28 10:03 UTC
[Lustre-discuss] Seeing OST errors on the OSS that doesnt have it mounted
On Thursday 28 August 2008 05:41:24 Alex Lee wrote:> Is there any documentation on how to decode the error messages? I feel > bad keep posting on the list for every single error message I dont > understand.I don''t think so, you have the source ;) Nathaniel Rutman posted a quite useful bash errno function some time ago: # errnos function errno() { for i in `find /usr/include -name errno*.h`; do expand $i | grep " "$1" " done } So to find out what error 19 is: bschubert at lanczos ~> errno 19 #define ENODEV 19 /* No such device */ Cheers, Bernd -- Bernd Schubert Q-Leap Networks GmbH
Andreas Dilger
2008-Aug-28 17:09 UTC
[Lustre-discuss] Seeing OST errors on the OSS that doesnt have it mounted
On Aug 28, 2008 12:41 +0900, Alex Lee wrote:> Andreas Dilger wrote: >>> Aug 23 12:27:52 lustre-oss-0-0 kernel: LustreError: 2918:0:(ldlm_lib.c:1536:target_send_reply_msg()) @@ >>> @ processing error (-19) req at ffff81026f189a00 x52/t0 o8-><?>@<?>:0/0 lens 240/0 e 0 to 0 dl 1219462372 >>> ref 1 fl Interpret:/0/0 rc -19/0 >> >> The fact that lustre-oss-0-0 returns -ENODEV (-19) isn''t a reason to stop >> trying there, because it may take some time for OST to failover from primary >> server to backup. >> >> What this really means is that your primary server is having network >> trouble, or is so severely overloaded that the client has given up on >> it and is trying the backup. It could also be a problem on the client >> I guess. > > Is there any documentation on how to decode the error messages? I feel > bad keep posting on the list for every single error message I dont > understand.There is some documentation about error messages in the manual (in particular how to decode the above RPC message). Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.