Hi there! In the messages file on my object storage servers, I find many "processing errors". They all are -16, -19 or -30. They all seem to come from ldlm_lib.c:1826:target_send_reply_msg which seems to be the only place in the source code, where the string "processing error" is located. I didn''t find any document so far, where those error numbers are explained. I did also some grep through the source code. Since I''m no programmer at all, I''m not familiar in browsing through source code efficiently :-) Background: I have 3 file systems with 2 OST each, distributed among 3 OSDs. RHEL 5.3, Lustre 1.8.1 installed from RPMs. The OSTs are in a HA configuration w/ Linux HA cluster (version 2) and there are some troubles with some OSTs in case of failover (OST won''t start). Most (if not all) of these processing errors are caused while switching an OST resource. I''d like to find out, what''s going wrong here. Any help/hints appreciated. wolfgang
On Thu, 2009-08-13 at 15:43 +0200, Wolfgang Stief wrote:> Hi there!Hi,> I didn''t find any document so far, where those error numbers are > explained. I did also some grep through the source code. Since I''m no > programmer at all, I''m not familiar in browsing through source code > efficiently :-)Without seeing the full context of the messages to which you are referring, I can say that those values look like "errno"s. I use the following little script to resolve errnos to symbolic types: #!/bin/bash grep " $1" /usr/include/*/*errno* b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090813/2b7cf308/attachment.bin
Hello! On Thu, 13 Aug 2009 10:23:57 -0400 "Brian J. Murrell" <Brian.Murrell at Sun.COM> wrote:> > > I didn''t find any document so far, where those error numbers are > > explained. I did also some grep through the source code. Since I''m > > no programmer at all, I''m not familiar in browsing through source > > code efficiently :-) > > Without seeing the full context of the messages to which you are > referring, I can say that those values look like "errno"s.Ok, good hint, that sounds reasonable. So, I get one of either "resource busy", "read-only filesystem" or "no such device" errors. Log file entries look like this (changed IP address to x.x.x.x): Aug 9 22:09:45 sososd2 kernel: Lustre: 19798:0:(ldlm_lib.c:815:target_handle_connect()) lustre-OST0000: refuse reconnection from 2084b6a3-47a3-ffb8-bc0f-6405 89c690ec at x.x.x.x@tcp to 0xffff810030576000; still busy with 12 active RPCs Aug 9 22:09:45 osd2 kernel: LustreError: 19798:0:(ldlm_lib.c:1826:target_send_reply_msg()) @@@ processing error (-16) req at ffff8101249a3000 x2140997709/t 0 o8->2084b6a3-47a3-ffb8-bc0f-640589c690ec@:0/0 lens 368/264 e 0 to 0 dl 1249870285 ref 1 fl Interpret:/0/0 rc -16/0 Aug 9 22:09:45 osd2 kernel: LustreError: 19798:0:(ldlm_lib.c:1826:target_send_reply_msg()) Skipped 11 previous similar messages wolfgang