I have a system thats been spitting out OST disconnect messages under heavy load. I''m guessing the OST eventually reconnects. I want to say this happens when the OSS is extremely overloaded but I did notice this happening even under light load. Only the OSS seems to spit out any error messages. I dont see anything on the client side. Should I be concern? Or does this typically happen on other sites too? -Alex clip off one of the OSS: Aug 13 17:26:48 lustre-oss-0-1 kernel: LustreError: 137-5: UUID ''lfs-OST0004_UUID'' is not available for connect (no target) Aug 13 17:26:48 lustre-oss-0-1 kernel: LustreError: 11094:0:(ldlm_lib.c:1536:target_send_reply_msg()) @@@ processing error (-19) req at f fff8101f4570600 x54/t0 o8-><?>@<?>:0/0 lens 240/0 e 0 to 0 dl 1218616308 ref 1 fl Interpret:/0/0 rc -19/0 Aug 13 17:26:48 lustre-oss-0-1 kernel: LustreError: 11094:0:(ldlm_lib.c:1536:target_send_reply_msg()) Skipped 3 previous similar messag es Aug 13 17:26:48 lustre-oss-0-1 kernel: LustreError: Skipped 3 previous similar messages Aug 13 17:48:56 lustre-oss-0-1 kernel: LustreError: 137-5: UUID ''lfs-OST0004_UUID'' is not available for connect (no target) Aug 13 17:48:56 lustre-oss-0-1 kernel: LustreError: 10984:0:(ldlm_lib.c:1536:target_send_reply_msg()) @@@ processing error (-19) req at f fff81010fc86600 x50/t0 o8-><?>@<?>:0/0 lens 240/0 e 0 to 0 dl 1218617636 ref 1 fl Interpret:/0/0 rc -19/0 Aug 13 17:48:56 lustre-oss-0-1 kernel: LustreError: 10984:0:(ldlm_lib.c:1536:target_send_reply_msg()) Skipped 1 previous similar messag e Aug 13 17:48:56 lustre-oss-0-1 kernel: LustreError: Skipped 1 previous similar message Aug 13 18:47:39 lustre-oss-0-1 kernel: LustreError: 137-5: UUID ''lfs-OST0005_UUID'' is not available for connect (no target) Aug 13 18:47:39 lustre-oss-0-1 kernel: LustreError: 11070:0:(ldlm_lib.c:1536:target_send_reply_msg()) @@@ processing error (-19) req at f fff81022861b400 x49/t0 o8-><?>@<?>:0/0 lens 240/0 e 0 to 0 dl 1218621159 ref 1 fl Interpret:/0/0 rc -19/0 Aug 13 18:47:39 lustre-oss-0-1 kernel: LustreError: 11070:0:(ldlm_lib.c:1536:target_send_reply_msg()) Skipped 1 previous similar messag e Different OSS: Aug 12 20:13:49 lustre-oss-6-0 kernel: LustreError: 137-5: UUID ''lfs-OST0050_UUID'' is not available for connect (no target) Aug 12 20:13:49 lustre-oss-6-0 kernel: LustreError: 13527:0:(ldlm_lib.c:1536:target_send_reply_msg()) @@@ processing error (-19) req at f fff8103d3b79a00 x124/t0 o8-><?>@<?>:0/0 lens 240/0 e 0 to 0 dl 1218539929 ref 1 fl Interpret:/0/0 rc -19/0 Aug 12 20:13:49 lustre-oss-6-0 kernel: LustreError: 13527:0:(ldlm_lib.c:1536:target_send_reply_msg()) Skipped 1 previous similar messag e Aug 12 20:13:49 lustre-oss-6-0 kernel: LustreError: Skipped 1 previous similar message Aug 12 20:13:55 lustre-oss-6-0 kernel: LustreError: 137-5: UUID ''lfs-OST004f_UUID'' is not available for connect (no target) Aug 12 20:13:55 lustre-oss-6-0 kernel: LustreError: 13521:0:(ldlm_lib.c:1536:target_send_reply_msg()) @@@ processing error (-19) req at f fff8103d3e92a00 x125/t0 o8-><?>@<?>:0/0 lens 240/0 e 0 to 0 dl 1218539935 ref 1 fl Interpret:/0/0 rc -19/0 Aug 12 20:13:55 lustre-oss-6-0 kernel: LustreError: 13521:0:(ldlm_lib.c:1536:target_send_reply_msg()) Skipped 1 previous similar messag e Aug 12 20:13:55 lustre-oss-6-0 kernel: LustreError: Skipped 1 previous similar message Aug 12 20:13:58 lustre-oss-6-0 kernel: LustreError: 137-5: UUID ''lfs-OST004f_UUID'' is not available for connect (no target) Aug 12 20:13:58 lustre-oss-6-0 kernel: LustreError: 28121:0:(ldlm_lib.c:1536:target_send_reply_msg()) @@@ processing error (-19) req at f fff8103d3983c00 x125/t0 o8-><?>@<?>:0/0 lens 240/0 e 0 to 0 dl 1218539938 ref 1 fl Interpret:/0/0 rc -19/0 Aug 12 20:13:58 lustre-oss-6-0 kernel: LustreError: 28121:0:(ldlm_lib.c:1536:target_send_reply_msg()) Skipped 5 previous similar messag es
On Wed, 2008-08-13 at 22:39 +0900, Alex Lee wrote:> I have a system thats been spitting out OST disconnect messages under > heavy load. I''m guessing the OST eventually reconnects. > I want to say this happens when the OSS is extremely overloaded but I > did notice this happening even under light load. Only the OSS seems to > spit out any error messages. I dont see anything on the client side. > > Should I be concern? Or does this typically happen on other sites too? > > -Alex > > clip off one of the OSS: > > Aug 13 17:26:48 lustre-oss-0-1 kernel: LustreError: 137-5: UUID > ''lfs-OST0004_UUID'' is not available for connect (no target)This means that the device an OSS is using for an OST has become unavailable (i.e. ENODEV -- No such device). The question becomes, why? What kind of disk is the OST? You might want to look into your storage hardware''s logs to see if there is any indication of troubles. b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080813/3827656a/attachment.bin
The disks are DDN S2A 9550. I didnt see any issues on the hardware side. And my IO tests finished fine which was using all of the OST. Could long IO wait cause an error like that? I''v seen the OSS load hit 300 on a IO stress test. -Alex
>/ clip off one of the OSS:/>/ />/ Aug 13 17:26:48 lustre-oss-0-1 kernel: LustreError: 137-5: UUID />/ ''lfs-OST0004_UUID'' is not available for connect(no target) / I dont see any hardware issues. Could this be caused by extremely high load? I''m seeing up to 300 load on a dual processor quad core Opteron. Is there some sort of check ping that the OSS does on the OST? If there is, is there anyway I can extend the timeout? Thanks, -Alex
Hi everybody We just hit the same problem last evening. The OSTs were suddenly disconnecting from the OSS. I saw that we have manually limited the number of OSS threads to 128 while we are exporting 4 OSTs on that server and the file system is mounted by about 100 clients. I think this may be an issue? Could you find you''re reason for the errors? I will now remove this thread limitation and see if this helps. Kind regards Reto Gantenbein On Aug 13, 2008, at 3:39 PM, Alex Lee wrote:> I have a system thats been spitting out OST disconnect messages under > heavy load. I''m guessing the OST eventually reconnects. > I want to say this happens when the OSS is extremely overloaded but I > did notice this happening even under light load. Only the OSS seems to > spit out any error messages. I dont see anything on the client side. > > Should I be concern? Or does this typically happen on other sites too? > > -Alex > > clip off one of the OSS: > > Aug 13 17:26:48 lustre-oss-0-1 kernel: LustreError: 137-5: UUID > ''lfs-OST0004_UUID'' is not available for connect (no target) > Aug 13 17:26:48 lustre-oss-0-1 kernel: LustreError: > 11094:0:(ldlm_lib.c:1536:target_send_reply_msg()) @@@ processing error > (-19) req at f > fff8101f4570600 x54/t0 o8-><?>@<?>:0/0 lens 240/0 e 0 to 0 dl > 1218616308 > ref 1 fl Interpret:/0/0 rc -19/0 > Aug 13 17:26:48 lustre-oss-0-1 kernel: LustreError: > 11094:0:(ldlm_lib.c:1536:target_send_reply_msg()) Skipped 3 previous > similar messag > es > Aug 13 17:26:48 lustre-oss-0-1 kernel: LustreError: Skipped 3 previous > similar messages > Aug 13 17:48:56 lustre-oss-0-1 kernel: LustreError: 137-5: UUID > ''lfs-OST0004_UUID'' is not available for connect (no target) > Aug 13 17:48:56 lustre-oss-0-1 kernel: LustreError: > 10984:0:(ldlm_lib.c:1536:target_send_reply_msg()) @@@ processing error > (-19) req at f > fff81010fc86600 x50/t0 o8-><?>@<?>:0/0 lens 240/0 e 0 to 0 dl > 1218617636 > ref 1 fl Interpret:/0/0 rc -19/0 > Aug 13 17:48:56 lustre-oss-0-1 kernel: LustreError: > 10984:0:(ldlm_lib.c:1536:target_send_reply_msg()) Skipped 1 previous > similar messag > e > Aug 13 17:48:56 lustre-oss-0-1 kernel: LustreError: Skipped 1 previous > similar message > Aug 13 18:47:39 lustre-oss-0-1 kernel: LustreError: 137-5: UUID > ''lfs-OST0005_UUID'' is not available for connect (no target) > Aug 13 18:47:39 lustre-oss-0-1 kernel: LustreError: > 11070:0:(ldlm_lib.c:1536:target_send_reply_msg()) @@@ processing error > (-19) req at f > fff81022861b400 x49/t0 o8-><?>@<?>:0/0 lens 240/0 e 0 to 0 dl > 1218621159 > ref 1 fl Interpret:/0/0 rc -19/0 > Aug 13 18:47:39 lustre-oss-0-1 kernel: LustreError: > 11070:0:(ldlm_lib.c:1536:target_send_reply_msg()) Skipped 1 previous > similar messag > e > > Different OSS: > Aug 12 20:13:49 lustre-oss-6-0 kernel: LustreError: 137-5: UUID > ''lfs-OST0050_UUID'' is not available for connect (no target) > Aug 12 20:13:49 lustre-oss-6-0 kernel: LustreError: > 13527:0:(ldlm_lib.c:1536:target_send_reply_msg()) @@@ processing error > (-19) req at f > fff8103d3b79a00 x124/t0 o8-><?>@<?>:0/0 lens 240/0 e 0 to 0 dl > 1218539929 ref 1 fl Interpret:/0/0 rc -19/0 > Aug 12 20:13:49 lustre-oss-6-0 kernel: LustreError: > 13527:0:(ldlm_lib.c:1536:target_send_reply_msg()) Skipped 1 previous > similar messag > e > Aug 12 20:13:49 lustre-oss-6-0 kernel: LustreError: Skipped 1 previous > similar message > Aug 12 20:13:55 lustre-oss-6-0 kernel: LustreError: 137-5: UUID > ''lfs-OST004f_UUID'' is not available for connect (no target) > Aug 12 20:13:55 lustre-oss-6-0 kernel: LustreError: > 13521:0:(ldlm_lib.c:1536:target_send_reply_msg()) @@@ processing error > (-19) req at f > fff8103d3e92a00 x125/t0 o8-><?>@<?>:0/0 lens 240/0 e 0 to 0 dl > 1218539935 ref 1 fl Interpret:/0/0 rc -19/0 > Aug 12 20:13:55 lustre-oss-6-0 kernel: LustreError: > 13521:0:(ldlm_lib.c:1536:target_send_reply_msg()) Skipped 1 previous > similar messag > e > Aug 12 20:13:55 lustre-oss-6-0 kernel: LustreError: Skipped 1 previous > similar message > Aug 12 20:13:58 lustre-oss-6-0 kernel: LustreError: 137-5: UUID > ''lfs-OST004f_UUID'' is not available for connect (no target) > Aug 12 20:13:58 lustre-oss-6-0 kernel: LustreError: > 28121:0:(ldlm_lib.c:1536:target_send_reply_msg()) @@@ processing error > (-19) req at f > fff8103d3983c00 x125/t0 o8-><?>@<?>:0/0 lens 240/0 e 0 to 0 dl > 1218539938 ref 1 fl Interpret:/0/0 rc -19/0 > Aug 12 20:13:58 lustre-oss-6-0 kernel: LustreError: > 28121:0:(ldlm_lib.c:1536:target_send_reply_msg()) Skipped 5 previous > similar messag > es > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss-- Universit?t Bern Abt. Informatikdienste Gruppe Zentrale Systeme Reto Gantenbein Administrator UBELIX Gesellschaftsstrasse 6 CH-3012 Bern Raum -104 Tel. +41 (0)31 631 87 97