Jody McIntyre
2006-Dec-28 14:46 UTC
[Lustre-devel] RFC: Proposed format for Lustre messages
RATIONALE: To improve the serviceability of Lustre, we need a common format for messages (errors and information) printed by Lustre. GOALS: - Consistent, non-reuseable message IDs. Each ID must uniquely identify a message. This will make it easy to find messages in troubleshooting documentation, and allow us to produce a web-based tool that can interpret messages automatically. - It would be nice to have a consistent format for all variables printed as part of a message. DETAILS OF PROPOSED FORMAT: Messages are of the form: Lustre [ID 1234]: MESSAGE Messages must be as readable and useful to a Lustre administrator as is practicable. Messages should not contain any information that is only useful to engineers familiar with Lustre internals - these should be saved to the debug log instead. The message ID (1234 above) is a 4-digit decimal number. This should meet our needs for several years. To allow for future growth, message IDs beyond 4 digits are OK and must be accepted by any tool that parses these messages. Message IDs must never be reused. If a message or its format changes, a new message ID must be allocated and used instead. These message IDs will be assigned using a page on the Lustre wiki listing the engineer allocating the message and the source file in which it will be used. This will become obsolete quickly but that''s OK - we only need to know that a message ID has been "claimed", by whom, and for what purpose. Once code that prints a message using a particular message ID has been committed to any branch of CVS, the format of the message may no longer be changed. Details of what the message means and how to interpret any variables in the message must be sent to an email address (to be determined.) This will go to the team(s) responsible for updating the troubleshooting documentation and the web-based analysis tool. Variables printed as part of messages must be formatted so that parsing by the web-based analysis tool can be done with regular expressions, and should be as readable and grammatical as possible. The value of the variable can be inserted as part of a text message, or it can be printed in the format: ''name: value'', depending on what is the most legible. Values need not be human-readable (for example, printing -ERRNO return codes is still acceptable) provided that they can be translated into human-readable form by the web-based analysis tool. EXAMPLE MESSAGES: Lustre [ID 1000]: Lustre version 1.5.97 loaded Lustre [ID 1234]: Server handling error on server foo@o2ib: transaction 11602746/0, opcode 42 returned -2
>>>>> Jody McIntyre (JM) writes:JM> DETAILS OF PROPOSED FORMAT: JM> Messages are of the form: Lustre [ID 1234]: MESSAGE JM> Messages must be as readable and useful to a Lustre administrator as JM> is practicable. Messages should not contain any information that is JM> only useful to engineers familiar with Lustre internals - these should JM> be saved to the debug log instead. would it be helpful to add some standard mnemonic like ''ha'' (recovery) or ''lnet'' just before ID ? thanks, Alex
Jody McIntyre
2006-Dec-28 15:17 UTC
[Lustre-devel] RFC: Proposed format for Lustre messages
Hi Alex, On Fri, Dec 29, 2006 at 12:50:39AM +0300, Alex Tomas wrote:> would it be helpful to add some standard mnemonic like ''ha'' (recovery) > or ''lnet'' just before ID ?That could be useful, but it may just add noise. Ideally, the message itself will be readable enough that a mnemonic is unnecessary so I lean towards leaving it out. Do you have an example of a message where such a mnemonic would be useful? Cheers, Jody> > thanks, Alex--
Nathaniel Rutman
2006-Dec-28 16:05 UTC
[Lustre-devel] RFC: Proposed format for Lustre messages
Jody McIntyre wrote:> Once code that prints a message using a particular message ID has been > committed to any branch of CVS, the format of the message may no > longer be changed. Details of what the message means and how to > interpret any variables in the message must be sent to an email > address (to be determined.) This will go to the team(s) responsible > for updating the troubleshooting documentation and the web-based > analysis tool. >Can we relax this to say "no substantive changes"? If we fix a typo or even wording: "[1234] lustre has dropped the ball and erased all your data" to "[1234] an irrecoverable error has occurred and erased all filesystem data" doesn''t seem to me that it should require a new message number. Parsing tools should just check the number and not worry about the exact contents of the message. I think that''s the whole point of having a [number] in the first place. I''ll agree that for messages like Lustre [ID 1234]: Server handling error on server foo@o2ib: transaction 11602746/0, opcode 42 returned -2 we can''t change the keyword preceding and data items
Jody McIntyre
2006-Dec-28 16:29 UTC
[Lustre-devel] RFC: Proposed format for Lustre messages
Hi Nathan,> Can we relax this to say "no substantive changes"? > If we fix a typo or even wording: > "[1234] lustre has dropped the ball and erased all your data" to > "[1234] an irrecoverable error has occurred and erased all filesystem data" > doesn''t seem to me that it should require a new message number. Parsing > tools should > just check the number and not worry about the exact contents of the message. > I think that''s the whole point of having a [number] in the first place.Perhaps, but what about a typo fix, for example: -Lustre [ID 1234]: Server handling error on servr foo@o2ib: +Lustre [ID 1234]: Server handling error on server foo@o2ib: transaction 11602746/0, opcode 42 returned -2 Looks innocent enough, except the web-based parser may be depending on the word "servr" to pick out the "foo@o2ib" NID. I think to avoid problems like this, we need to ban reuse across the board. I don''t think Lustre messages are changed often enough that we need to worry about running out of numbers. Cheers, Jody> > I''ll agree that for messages like > > Lustre [ID 1234]: Server handling error on server foo@o2ib: > transaction 11602746/0, opcode 42 returned -2 > > we can''t change the keyword preceding and data items
Andreas Dilger
2006-Dec-28 16:43 UTC
[Lustre-devel] RFC: Proposed format for Lustre messages
On Dec 28, 2006 16:46 -0500, Jody McIntyre wrote:> These message IDs will be assigned using a page on the Lustre wiki > listing the engineer allocating the message and the source file in > which it will be used. This will become obsolete quickly but that''s > OK - we only need to know that a message ID has been "claimed", by > whom, and for what purpose.Other companies uses messages of the form MMMM-NNNN (component-msgnum) so that messages can be allocated separately between different pieces of software or components (e.g. MGS, LMV, or other parts of the code that are being worked on independently).> Once code that prints a message using a particular message ID has been > committed to any branch of CVS, the format of the message may no > longer be changed. Details of what the message means and how to > interpret any variables in the message must be sent to an email > address (to be determined.) This will go to the team(s) responsible > for updating the troubleshooting documentation and the web-based > analysis tool.One issue with this is if a message is unclear or otherwise lacking information and it needs to be fixed then it presumably needs to have a new message ID. That in turn means that the message database will have duplicate information, or there needs to be a facility to link different messages together like "XXXX: (previously YYYY, ZZZZ)"... There are already 2224 CWARN and CERROR messages in the current code base, so I''m not sure a use-once 4-digit number is large enough. Another alternative is to allow the format to "grow" by adding on elements to the end and allowing the "non-format" parts of the message to change (improved wording, etc) so long as there are no changes in the order of existing format elements. As for "committed to any branch of CVS", that imposes a burden on in-development code which might have numerous changes before the code is first released... Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc.
Jody McIntyre
2006-Dec-28 17:05 UTC
[Lustre-devel] RFC: Proposed format for Lustre messages
Hi Andreas, On Thu, Dec 28, 2006 at 04:43:14PM -0700, Andreas Dilger wrote:> Other companies uses messages of the form MMMM-NNNN (component-msgnum) > so that messages can be allocated separately between different pieces > of software or components (e.g. MGS, LMV, or other parts of the code > that are being worked on independently).That could be useful, but it makes the message longer. Sun managed to come up with a unified set of message numbers (with 6 digits) for all of Solaris so we should be able to do this as well.> One issue with this is if a message is unclear or otherwise lacking > information and it needs to be fixed then it presumably needs to have > a new message ID. That in turn means that the message database will > have duplicate information, or there needs to be a facility to link > different messages together like "XXXX: (previously YYYY, ZZZZ)"...I don''t understand why this duplication is a problem or why we would need to "link" back to previous messages.> There are already 2224 CWARN and CERROR messages in the current code > base, so I''m not sure a use-once 4-digit number is large enough.We can extend to 5 or more digits later if needed. That''s explicitly part of the proposal and required of any parsing tool.> Another alternative is to allow the format to "grow" by adding on > elements to the end and allowing the "non-format" parts of the > message to change (improved wording, etc) so long as there are no > changes in the order of existing format elements.Yes. See my comments to Nathan about parsing messages, but maybe I''m being too rigid here and wording changes are OK. I do not want to allow "growing" the format under any circumstances though. The reason is that we need to distribute the web-based parser so that it can be used by secure sites, and I want the parser to be able to say "no, I don''t know how to deal with this message; you need to upgrade me" rather than provide an incomplete (or incorrect) interpretation. Allocating a new message ID (which would be recognized by an outdated parser as "too new") is the most reliable way to do this.> As for "committed to any branch of CVS", that imposes a burden > on in-development code which might have numerous changes before > the code is first released...True. My goal here was partially paranoia and partially to make sure we can always interpret messages in Buffalo, which sometimes tests non-production branches. Perhaps this requirement could be relaxed to "committed to any production branch" - but then we need to define "production branch." Can we guarantee that customers (including partners) will never pull or be given code from branches other than b1_4, b1_5, and the release branches? Cheers, Jody> > Cheers, Andreas > -- > Andreas Dilger > Principal Software Engineer > Cluster File Systems, Inc.--
Andreas Dilger
2006-Dec-28 18:36 UTC
[Lustre-devel] RFC: Proposed format for Lustre messages
On Dec 28, 2006 19:05 -0500, Jody McIntyre wrote:> On Thu, Dec 28, 2006 at 04:43:14PM -0700, Andreas Dilger wrote: > > One issue with this is if a message is unclear or otherwise lacking > > information and it needs to be fixed then it presumably needs to have > > a new message ID. That in turn means that the message database will > > have duplicate information, or there needs to be a facility to link > > different messages together like "XXXX: (previously YYYY, ZZZZ)"... > > I don''t understand why this duplication is a problem or why we would > need to "link" back to previous messages.Because if there is some knowledge accumulated with message XXXX (that is also applicable to the "same" message YYYY and ZZZZ) then it will be a nightmare to keep all of these entries in sync if there isn''t some kind of message linking. Consider a step-by-step debugging map that says "if you see message YYYY proceed to step 20 to debug a network connection problem". Or if there are translations of the message catalog, it only makes sense to do that for "current" messages, but it is useful to know that someone running an old version of lustre that hits YYYY or ZZZZ can look at the current web page and find the translated message. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc.
Nathaniel Rutman
2006-Dec-28 19:58 UTC
[Lustre-devel] RFC: Proposed format for Lustre messages
Andreas Dilger wrote:> On Dec 28, 2006 19:05 -0500, Jody McIntyre wrote: > >> On Thu, Dec 28, 2006 at 04:43:14PM -0700, Andreas Dilger wrote: >> >>> One issue with this is if a message is unclear or otherwise lacking >>> information and it needs to be fixed then it presumably needs to have >>> a new message ID. That in turn means that the message database will >>> have duplicate information, or there needs to be a facility to link >>> different messages together like "XXXX: (previously YYYY, ZZZZ)"... >>> >> I don''t understand why this duplication is a problem or why we would >> need to "link" back to previous messages. >> > > Because if there is some knowledge accumulated with message XXXX (that > is also applicable to the "same" message YYYY and ZZZZ) then it will > be a nightmare to keep all of these entries in sync if there isn''t > some kind of message linking. > > Consider a step-by-step debugging map that says "if you see message YYYY > proceed to step 20 to debug a network connection problem". Or if there > are translations of the message catalog, it only makes sense to do that > for "current" messages, but it is useful to know that someone running an > old version of lustre that hits YYYY or ZZZZ can look at the current web > page and find the translated message. > > >It also _requires_ that every analysis tool (including customers'') written to look for particular messages _must_ change, even for trivial changes. Maybe a smarter renumber: XXXX becomes XXXX.1 etc for changes (with a simple unbounded increment). That way tools can be written to look for generic or a specific version of the message.
>>>>> Jody McIntyre (JM) writes:JM> Hi Alex, JM> On Fri, Dec 29, 2006 at 12:50:39AM +0300, Alex Tomas wrote: >> would it be helpful to add some standard mnemonic like ''ha'' (recovery) >> or ''lnet'' just before ID ? JM> That could be useful, but it may just add noise. Ideally, the message JM> itself will be readable enough that a mnemonic is unnecessary so I lean JM> towards leaving it out. JM> Do you have an example of a message where such a mnemonic would be JM> useful? say, we have component list. some people understand llite better, another - mds, ptlrpc, lnet, etc. mnemonic could tell you a component, so that we''d avoid to lookup id in the database to know component. thanks, Alex
Nathan Wrote:> > Can we relax this to say "no substantive changes"? > > If we fix a typo or even wording: > > "[1234] lustre has dropped the ball and erased all your data" to > > "[1234] an irrecoverable error has occurred and erased all filesystem >data" > > doesn''t seem to me that it should require a new message number. Parsing > > tools should > > just check the number and not worry about the exact contents of the >message. > > I think that''s the whole point of having a [number] in the first place. >Jody Wrote:>Perhaps, but what about a typo fix, for example: > >-Lustre [ID 1234]: Server handling error on servr foo@o2ib: >+Lustre [ID 1234]: Server handling error on server foo@o2ib: >transaction 11602746/0, opcode 42 returned -2 > >Looks innocent enough, except the web-based parser may be depending on >the word "servr" to pick out the "foo@o2ib" NID. > >I think to avoid problems like this, we need to ban reuse across the >board. I don''t think Lustre messages are changed often enough that we >need to worry about running out of numbers. >Jody, that looks like a pretty brain-dead parser to me. I agree with points taken by various people, in particular: 1. If _new fields_ are added to a message, then I think that it deserves a new number (for the reasons mentioned earlier) 2. But, if text is modified in some minor way to improve readability or correct spelling errors, it should use the same number. Too many numbers that mean the same thing (in different releases) is problematic. 3. The programmer should recognize certain unusual cases (like the one above where the "field name" is mis-spelled), and then create a new number in this case. 4. All fields should be surrounded by spaces (never extra colons, semi-colons, commas, or periods) -- this should make parsing easier. That saves the parser from having to remove them. 5. (I have not seen this topic addressed) If a message can occur in the code in multiple places, then there must be some distinguishing feature so that the reader can know which line of code generated the message (if applicable). Personally, I like __FILE__:__LINE__. But, there can be other ways to distinguish which instance of a particular message is the culprit. Just my 2 cents. -Roger _________________________________________________________________ Dave vs. Carl: The Insignificant Championship Series. Who will win? http://clk.atdmt.com/MSN/go/msnnkwsp0070000001msn/direct/01/?href=http://davevscarl.spaces.live.com/?icid=T001MSN38C07001
Solofo.Ramangalahy@bull.net
2007-Jan-08 05:31 UTC
[Lustre-devel] RFC: Proposed format for Lustre messages
Jody McIntyre writes: > The message ID (1234 above) is a 4-digit decimal number. Why is it decimal? (as opposed to hexadecimal, ascii...) For example, "1234" could be replaced by "DOC ". > Messages are of the form: Lustre [ID 1234]: MESSAGE What is the need for "ID" and the brackets? Would it be ok to use Lustre 1234: MESSAGE instead? (without brackets, space and "ID") This is 5 chars less, which may be used for something else, e.g. longer identifiers. Is there an (implicit) 80 chars per line constraint? > EXAMPLE MESSAGES: [...] > Lustre [ID 1234]: Server handling error on server foo@o2ib: > transaction 11602746/0, opcode 42 returned -2 Supposing messages are long, would this be: Lustre [ID 1234]: Server handling error on server foo@o2ib: Lustre [ID 1234]: transaction 11602746/0, opcode 42 returned -2 or does the identifier appears only on the first line of the MESSAGE? -- Solofo.Ramangalahy@bull.net | Tel: +33 (0)4 76 29 72 48 Bull SAS, Linux R&D, HPC/CI/Lustre | Fax: +33 (0)4 76 61 52 52 1, Rue de Provence. BP208 | Office B1/386 38432 Echirolles Cedex, France | Mail Stop B1/167
Jody McIntyre
2007-Jan-08 09:44 UTC
[Lustre-devel] RFC: Proposed format for Lustre messages
Hi Solofo, On Mon, Jan 08, 2007 at 01:34:13PM +0100, Solofo.Ramangalahy@bull.net wrote:> Why is it decimal? (as opposed to hexadecimal, ascii...) > For example, "1234" could be replaced by "DOC ".Humans are generally more comfortable with decimal numbers, and the goal of this task is to make things friendlier to humans.> > Messages are of the form: Lustre [ID 1234]: MESSAGE > > What is the need for "ID" and the brackets? > Would it be ok to use > Lustre 1234: MESSAGE > instead? > (without brackets, space and "ID")I think it looks better, and it clearly indicates that 1234 is a message ID and not something else like a PID or a return code. I think that''s worth 5 characters but of course all of this is up for discussion - does anyone else have an opinion?> This is 5 chars less, which may be used for something else, e.g. longer > identifiers. > > > Is there an (implicit) 80 chars per line constraint?No. I realize many current kernel messages are limited to 80 characters, but I don''t want to impose such a limit since it may compromise readibility. However, all things being equal, shorter messages are preferable.> > EXAMPLE MESSAGES: > [...] > > Lustre [ID 1234]: Server handling error on server foo@o2ib: > > transaction 11602746/0, opcode 42 returned -2 > > Supposing messages are long, would this be: > Lustre [ID 1234]: Server handling error on server foo@o2ib: > Lustre [ID 1234]: transaction 11602746/0, opcode 42 returned -2 > or does the identifier appears only on the first line of the MESSAGE?No. I''m not proposing to line-wrap messages. One of the example messages was wrapped by my editor in the proposed format - I will clarify this in future versions of the document. Lustre [ID 1234]: Server handling error on server foo@o2ib: transaction 11602746/0, opcode 42 returned -2 Cheers, Jody> > > -- > Solofo.Ramangalahy@bull.net | Tel: +33 (0)4 76 29 72 48 > Bull SAS, Linux R&D, HPC/CI/Lustre | Fax: +33 (0)4 76 61 52 52 > 1, Rue de Provence. BP208 | Office B1/386 > 38432 Echirolles Cedex, France | Mail Stop B1/167 >--
Nicholas Henke
2007-Jan-08 10:00 UTC
[Lustre-devel] RFC: Proposed format for Lustre messages
Jody McIntyre wrote:> Hi Solofo, > > On Mon, Jan 08, 2007 at 01:34:13PM +0100, Solofo.Ramangalahy@bull.net wrote: > > >> Why is it decimal? (as opposed to hexadecimal, ascii...) >> For example, "1234" could be replaced by "DOC ". >> > > Humans are generally more comfortable with decimal numbers, and the goal > of this task is to make things friendlier to humans. > > >> > Messages are of the form: Lustre [ID 1234]: MESSAGE >> >> What is the need for "ID" and the brackets? >> Would it be ok to use >> Lustre 1234: MESSAGE >> instead? >> (without brackets, space and "ID") >> > > I think it looks better, and it clearly indicates that 1234 is a message > ID and not something else like a PID or a return code. I think that''s > worth 5 characters but of course all of this is up for discussion - does > anyone else have an opinion? >Actually -- yes. If you consider a machine with 100K clients, that is 500K characters fewer into a log, or almost 0.5 MB less text. Given that Lustre rarely spits out just one error message, this could start to add up quite quickly. I''d encourage the fewest possible additions and for keeping the messages short but readable. While I''m commenting on this, there was one item I wanted to mention - using this "interpreter" offline. More and more customers are running Lustre behind secure walls, where the ability to access this tool needs to be done 100% offline. I would hope that this either runs as a text only tool, or if a "web based" tool, running from a provided small http server (python/perl/ruby/sanskrit) or having the ability to run from a internal apache server too. Basically -- this should be pretty quick & easy to install and not require anything too fancy. That said, I''ve not heard any concrete plans for this yet -- are there any? Nic
Jody McIntyre
2007-Jan-10 16:35 UTC
[Lustre-devel] RFC: Proposed format for Lustre messages
Hi Nic, On Mon, Jan 08, 2007 at 11:00:15AM -0600, Nicholas Henke wrote:> [...] > While I''m commenting on this, there was one item I wanted to mention - > using this "interpreter" offline. More and more customers are running > Lustre behind secure walls, where the ability to access this tool needs > to be done 100% offline. I would hope that this either runs as a text > only tool, or if a "web based" tool, running from a provided small http > server (python/perl/ruby/sanskrit) or having the ability to run from a > internal apache server too. Basically -- this should be pretty quick & > easy to install and not require anything too fancy. > > That said, I''ve not heard any concrete plans for this yet -- are there any?We''re still planning exactly what we''re going to deliver as a first cut of the interpretation tool, but we''re well aware of the needs of secure sites. Whatever we provide will definately be useable on non-CFS machines with no outside connectivity. Cheers, Jody