thr3ads.net - Lustre devel - [Lustre-devel] RFC: Proposed format for Lustre messages [Dec 2006]

If this information is useful, please help other people find it:
Share via:

Jody McIntyre

2006-Dec-28 14:46 UTC

[Lustre-devel] RFC: Proposed format for Lustre messages

RATIONALE: To improve the serviceability of Lustre, we need a common
format for messages (errors and information) printed by Lustre.

GOALS:

 - Consistent, non-reuseable message IDs.  Each ID must uniquely
identify a message.  This will make it easy to find messages in
troubleshooting documentation, and allow us to produce a web-based
tool that can interpret messages automatically.

 - It would be nice to have a consistent format for all variables
printed as part of a message.

DETAILS OF PROPOSED FORMAT:

Messages are of the form: Lustre [ID 1234]: MESSAGE

Messages must be as readable and useful to a Lustre administrator as
is practicable.  Messages should not contain any information that is
only useful to engineers familiar with Lustre internals - these should
be saved to the debug log instead.

The message ID (1234 above) is a 4-digit decimal number.  This should
meet our needs for several years.  To allow for future growth, message
IDs beyond 4 digits are OK and must be accepted by any tool that
parses these messages.

Message IDs must never be reused.  If a message or its format changes,
a new message ID must be allocated and used instead.

These message IDs will be assigned using a page on the Lustre wiki
listing the engineer allocating the message and the source file in
which it will be used.  This will become obsolete quickly but that''s
OK - we only need to know that a message ID has been "claimed", by
whom, and for what purpose.

Once code that prints a message using a particular message ID has been
committed to any branch of CVS, the format of the message may no
longer be changed.  Details of what the message means and how to
interpret any variables in the message must be sent to an email
address (to be determined.)  This will go to the team(s) responsible
for updating the troubleshooting documentation and the web-based
analysis tool.

Variables printed as part of messages must be formatted so that
parsing by the web-based analysis tool can be done with regular
expressions, and should be as readable and grammatical as possible.

The value of the variable can be inserted as part of a text message,
or it can be printed in the format: ''name: value'', depending
on what
is the most legible.

Values need not be human-readable (for example, printing -ERRNO return
codes is still acceptable) provided that they can be translated into
human-readable form by the web-based analysis tool.

EXAMPLE MESSAGES:

Lustre [ID 1000]: Lustre version 1.5.97 loaded

Lustre [ID 1234]: Server handling error on server foo@o2ib:
transaction 11602746/0, opcode 42 returned -2

Alex Tomas

2006-Dec-28 14:50 UTC

head link

[Lustre-devel] RFC: Proposed format for Lustre messages

>>>>> Jody McIntyre (JM) writes:
 JM> DETAILS OF PROPOSED FORMAT:

 JM> Messages are of the form: Lustre [ID 1234]: MESSAGE

 JM> Messages must be as readable and useful to a Lustre administrator as
 JM> is practicable.  Messages should not contain any information that is
 JM> only useful to engineers familiar with Lustre internals - these should
 JM> be saved to the debug log instead.

would it be helpful to add some standard mnemonic like ''ha''
(recovery)
or ''lnet'' just before ID ?

thanks, Alex

Jody McIntyre

2006-Dec-28 15:17 UTC

head link

[Lustre-devel] RFC: Proposed format for Lustre messages

Hi Alex,

On Fri, Dec 29, 2006 at 12:50:39AM +0300, Alex Tomas wrote:
> would it be helpful to add some standard mnemonic like
''ha'' (recovery)
> or ''lnet'' just before ID ?
That could be useful, but it may just add noise.  Ideally, the message
itself will be readable enough that a mnemonic is unnecessary so I lean
towards leaving it out.

Do you have an example of a message where such a mnemonic would be
useful?

Cheers,
Jody
> 
> thanks, Alex
--

Nathaniel Rutman

2006-Dec-28 16:05 UTC

head link

[Lustre-devel] RFC: Proposed format for Lustre messages

Jody McIntyre wrote:> Once code that prints a message using a particular message ID has been
> committed to any branch of CVS, the format of the message may no
> longer be changed.  Details of what the message means and how to
> interpret any variables in the message must be sent to an email
> address (to be determined.)  This will go to the team(s) responsible
> for updating the troubleshooting documentation and the web-based
> analysis tool.
>   Can we relax this to say "no substantive changes"?
If we fix a typo or even wording:
"[1234] lustre has dropped the ball and erased all your data" to
"[1234] an irrecoverable error has occurred and erased all filesystem
data"
doesn''t seem to me that it should require a new message number. 
Parsing
tools should
just check the number and not worry about the exact contents of the message.
I think that''s the whole point of having a [number] in the first place.

I''ll agree that for messages like

Lustre [ID 1234]: Server handling error on server foo@o2ib:
transaction 11602746/0, opcode 42 returned -2

we can''t change the keyword preceding and data items

Jody McIntyre

2006-Dec-28 16:29 UTC

head link

[Lustre-devel] RFC: Proposed format for Lustre messages

Hi Nathan,
> Can we relax this to say "no substantive changes"?
> If we fix a typo or even wording:
> "[1234] lustre has dropped the ball and erased all your data" to
> "[1234] an irrecoverable error has occurred and erased all filesystem
data"
> doesn''t seem to me that it should require a new message number. 
Parsing
> tools should
> just check the number and not worry about the exact contents of the
message.
> I think that''s the whole point of having a [number] in the first
place.
Perhaps, but what about a typo fix, for example:

-Lustre [ID 1234]: Server handling error on servr foo@o2ib:
+Lustre [ID 1234]: Server handling error on server foo@o2ib:
transaction 11602746/0, opcode 42 returned -2

Looks innocent enough, except the web-based parser may be depending on
the word "servr" to pick out the "foo@o2ib" NID.

I think to avoid problems like this, we need to ban reuse across the
board.  I don''t think Lustre messages are changed often enough that we
need to worry about running out of numbers.

Cheers,
Jody
> 
> I''ll agree that for messages like
> 
> Lustre [ID 1234]: Server handling error on server foo@o2ib:
> transaction 11602746/0, opcode 42 returned -2
> 
> we can''t change the keyword preceding and data items

Andreas Dilger

2006-Dec-28 16:43 UTC

head link

[Lustre-devel] RFC: Proposed format for Lustre messages

On Dec 28, 2006  16:46 -0500, Jody McIntyre wrote:> These message IDs will be assigned using a page on the Lustre wiki
> listing the engineer allocating the message and the source file in
> which it will be used.  This will become obsolete quickly but
that''s
> OK - we only need to know that a message ID has been "claimed",
by
> whom, and for what purpose.
Other companies uses messages of the form MMMM-NNNN (component-msgnum)
so that messages can be allocated separately between different pieces
of software or components (e.g. MGS, LMV, or other parts of the code
that are being worked on independently).
> Once code that prints a message using a particular message ID has been
> committed to any branch of CVS, the format of the message may no
> longer be changed.  Details of what the message means and how to
> interpret any variables in the message must be sent to an email
> address (to be determined.)  This will go to the team(s) responsible
> for updating the troubleshooting documentation and the web-based
> analysis tool.
One issue with this is if a message is unclear or otherwise lacking
information and it needs to be fixed then it presumably needs to have
a new message ID.  That in turn means that the message database will
have duplicate information, or there needs to be a facility to link
different messages together like "XXXX: (previously YYYY, ZZZZ)"...
There are already 2224 CWARN and CERROR messages in the current code
base, so I''m not sure a use-once 4-digit number is large enough.

Another alternative is to allow the format to "grow" by adding on
elements to the end and allowing the "non-format" parts of the
message to change (improved wording, etc) so long as there are no
changes in the order of existing format elements.

As for "committed to any branch of CVS", that imposes a burden
on in-development code which might have numerous changes before
the code is first released...

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

Jody McIntyre

2006-Dec-28 17:05 UTC

head link

[Lustre-devel] RFC: Proposed format for Lustre messages

Hi Andreas,

On Thu, Dec 28, 2006 at 04:43:14PM -0700, Andreas Dilger wrote:
> Other companies uses messages of the form MMMM-NNNN (component-msgnum)
> so that messages can be allocated separately between different pieces
> of software or components (e.g. MGS, LMV, or other parts of the code
> that are being worked on independently).
That could be useful, but it makes the message longer.  Sun managed to
come up with a unified set of message numbers (with 6 digits) for all of
Solaris so we should be able to do this as well.
> One issue with this is if a message is unclear or otherwise lacking
> information and it needs to be fixed then it presumably needs to have
> a new message ID.  That in turn means that the message database will
> have duplicate information, or there needs to be a facility to link
> different messages together like "XXXX: (previously YYYY,
ZZZZ)"...
I don''t understand why this duplication is a problem or why we would
need to "link" back to previous messages.
> There are already 2224 CWARN and CERROR messages in the current code
> base, so I''m not sure a use-once 4-digit number is large enough.
We can extend to 5 or more digits later if needed.  That''s explicitly
part of the proposal and required of any parsing tool.
> Another alternative is to allow the format to "grow" by adding on
> elements to the end and allowing the "non-format" parts of the
> message to change (improved wording, etc) so long as there are no
> changes in the order of existing format elements.
Yes.  See my comments to Nathan about parsing messages, but maybe I''m
being too rigid here and wording changes are OK.

I do not want to allow "growing" the format under any circumstances
though.  The reason is that we need to distribute the web-based parser
so that it can be used by secure sites, and I want the parser to be able
to say "no, I don''t know how to deal with this message; you need
to
upgrade me" rather than provide an incomplete (or incorrect)
interpretation.  Allocating a new message ID (which would be recognized
by an outdated parser as "too new") is the most reliable way to do
this.
> As for "committed to any branch of CVS", that imposes a burden
> on in-development code which might have numerous changes before
> the code is first released...
True.  My goal here was partially paranoia and partially to make sure we
can always interpret messages in Buffalo, which sometimes tests
non-production branches.  Perhaps this requirement could be relaxed to
"committed to any production branch" - but then we need to define
"production branch."  Can we guarantee that customers (including
partners) will never pull or be given code from branches other than
b1_4, b1_5, and the release branches?

Cheers,
Jody
> 
> Cheers, Andreas
> --
> Andreas Dilger
> Principal Software Engineer
> Cluster File Systems, Inc.
--

Andreas Dilger

2006-Dec-28 18:36 UTC

head link

[Lustre-devel] RFC: Proposed format for Lustre messages

On Dec 28, 2006  19:05 -0500, Jody McIntyre wrote:> On Thu, Dec 28, 2006 at 04:43:14PM -0700, Andreas Dilger wrote:
> > One issue with this is if a message is unclear or otherwise lacking
> > information and it needs to be fixed then it presumably needs to have
> > a new message ID.  That in turn means that the message database will
> > have duplicate information, or there needs to be a facility to link
> > different messages together like "XXXX: (previously YYYY,
ZZZZ)"...
> 
> I don''t understand why this duplication is a problem or why we
would
> need to "link" back to previous messages.
Because if there is some knowledge accumulated with message XXXX (that
is also applicable to the "same" message YYYY and ZZZZ) then it will
be a nightmare to keep all of these entries in sync if there isn''t
some kind of message linking.

Consider a step-by-step debugging map that says "if you see message YYYY
proceed to step 20 to debug a network connection problem".  Or if there
are translations of the message catalog, it only makes sense to do that
for "current" messages, but it is useful to know that someone running
an
old version of lustre that hits YYYY or ZZZZ can look at the current web
page and find the translated message.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

Nathaniel Rutman

2006-Dec-28 19:58 UTC

head link

[Lustre-devel] RFC: Proposed format for Lustre messages

Andreas Dilger wrote:> On Dec 28, 2006  19:05 -0500, Jody McIntyre wrote:
>   
>> On Thu, Dec 28, 2006 at 04:43:14PM -0700, Andreas Dilger wrote:
>>     
>>> One issue with this is if a message is unclear or otherwise lacking
>>> information and it needs to be fixed then it presumably needs to
have
>>> a new message ID.  That in turn means that the message database
will
>>> have duplicate information, or there needs to be a facility to link
>>> different messages together like "XXXX: (previously YYYY,
ZZZZ)"...
>>>       
>> I don''t understand why this duplication is a problem or why we
would
>> need to "link" back to previous messages.
>>     
>
> Because if there is some knowledge accumulated with message XXXX (that
> is also applicable to the "same" message YYYY and ZZZZ) then it
will
> be a nightmare to keep all of these entries in sync if there isn''t
> some kind of message linking.
>
> Consider a step-by-step debugging map that says "if you see message
YYYY
> proceed to step 20 to debug a network connection problem".  Or if
there
> are translations of the message catalog, it only makes sense to do that
> for "current" messages, but it is useful to know that someone
running an
> old version of lustre that hits YYYY or ZZZZ can look at the current web
> page and find the translated message.
>
>
>   It also _requires_ that every analysis tool (including customers'') 
written to look
for particular messages _must_ change, even for trivial changes.
Maybe a smarter renumber: XXXX becomes XXXX.1 etc for changes
(with a simple unbounded increment). That way tools can be written to 
look for
generic or a specific version of the message.

Alex Tomas

2006-Dec-30 02:14 UTC

head link

[Lustre-devel] RFC: Proposed format for Lustre messages

>>>>> Jody McIntyre (JM) writes:
 JM> Hi Alex,
 JM> On Fri, Dec 29, 2006 at 12:50:39AM +0300, Alex Tomas wrote:

 >> would it be helpful to add some standard mnemonic like
''ha'' (recovery)
 >> or ''lnet'' just before ID ?

 JM> That could be useful, but it may just add noise.  Ideally, the message
 JM> itself will be readable enough that a mnemonic is unnecessary so I lean
 JM> towards leaving it out.

 JM> Do you have an example of a message where such a mnemonic would be
 JM> useful?

say, we have component list. some people understand llite better,
another - mds, ptlrpc, lnet, etc. mnemonic could tell you a component,
so that we''d avoid to lookup id in the database to know component.

thanks, Alex

RS RS

2007-Jan-03 14:49 UTC

head link

[Lustre-devel] RFC: Proposed format for Lustre messages

Nathan Wrote:> > Can we relax this to say "no substantive changes"?
> > If we fix a typo or even wording:
> > "[1234] lustre has dropped the ball and erased all your
data" to
> > "[1234] an irrecoverable error has occurred and erased all
filesystem
>data"
> > doesn''t seem to me that it should require a new message
number.  Parsing
> > tools should
> > just check the number and not worry about the exact contents of the 
>message.
> > I think that''s the whole point of having a [number] in the
first place.
>
Jody Wrote:>Perhaps, but what about a typo fix, for example:
>
>-Lustre [ID 1234]: Server handling error on servr foo@o2ib:
>+Lustre [ID 1234]: Server handling error on server foo@o2ib:
>transaction 11602746/0, opcode 42 returned -2
>
>Looks innocent enough, except the web-based parser may be depending on
>the word "servr" to pick out the "foo@o2ib" NID.
>
>I think to avoid problems like this, we need to ban reuse across the
>board.  I don''t think Lustre messages are changed often enough that
we
>need to worry about running out of numbers.
>
Jody, that looks like a pretty brain-dead parser to me.  I agree with points 
taken by various people, in particular:

1.  If _new fields_ are added to a message, then I think that it deserves a 
new number (for the reasons mentioned earlier)
2.  But, if text is modified in some minor way to improve readability or 
correct spelling errors, it should use the same number.  Too many numbers 
that mean the same thing (in different releases) is problematic.
3.  The programmer should recognize certain unusual cases (like the one 
above where the "field name" is mis-spelled), and then create a new
number
in this case.
4. All fields should be surrounded by spaces (never extra colons, 
semi-colons, commas, or periods) -- this should make parsing easier. That 
saves the parser from having to remove them.
5.  (I have not seen this topic addressed)  If a message can occur in the 
code in multiple places, then there must be some distinguishing feature so 
that the reader can know which line of code generated the message (if 
applicable).  Personally, I like __FILE__:__LINE__.  But, there can be other 
ways to distinguish which instance of a particular message is the culprit.

Just my 2 cents.

-Roger

_________________________________________________________________
Dave vs. Carl: The Insignificant Championship Series.  Who will win? 
http://clk.atdmt.com/MSN/go/msnnkwsp0070000001msn/direct/01/?href=http://davevscarl.spaces.live.com/?icid=T001MSN38C07001

Solofo.Ramangalahy@bull.net

2007-Jan-08 05:31 UTC

head link

[Lustre-devel] RFC: Proposed format for Lustre messages

Jody McIntyre writes:
 > The message ID (1234 above) is a 4-digit decimal number.

Why is it decimal? (as opposed to hexadecimal, ascii...)
For example, "1234" could be replaced by "DOC ".

 > Messages are of the form: Lustre [ID 1234]: MESSAGE

What is the need for "ID" and the brackets?
Would it be ok to use
Lustre 1234: MESSAGE
instead?
(without brackets, space and "ID")
This is 5 chars less, which may be used for something else, e.g. longer
identifiers.

Is there an (implicit) 80 chars per line constraint?
 > EXAMPLE MESSAGES:
 [...]
 > Lustre [ID 1234]: Server handling error on server foo@o2ib:
 > transaction 11602746/0, opcode 42 returned -2

Supposing messages are long, would this be:
Lustre [ID 1234]: Server handling error on server foo@o2ib:
Lustre [ID 1234]: transaction 11602746/0, opcode 42 returned -2
or does the identifier appears only on the first line of the MESSAGE?

-- 
Solofo.Ramangalahy@bull.net            | Tel: +33 (0)4 76 29 72 48
Bull SAS, Linux R&D, HPC/CI/Lustre     | Fax: +33 (0)4 76 61 52 52
1, Rue de Provence. BP208              | Office B1/386
38432 Echirolles Cedex, France         | Mail Stop B1/167

Jody McIntyre

2007-Jan-08 09:44 UTC

head link

[Lustre-devel] RFC: Proposed format for Lustre messages

Hi Solofo,

On Mon, Jan 08, 2007 at 01:34:13PM +0100, Solofo.Ramangalahy@bull.net wrote:
> Why is it decimal? (as opposed to hexadecimal, ascii...)
> For example, "1234" could be replaced by "DOC ".
Humans are generally more comfortable with decimal numbers, and the goal
of this task is to make things friendlier to humans.
>  > Messages are of the form: Lustre [ID 1234]: MESSAGE
> 
> What is the need for "ID" and the brackets?
> Would it be ok to use
> Lustre 1234: MESSAGE
> instead?
> (without brackets, space and "ID")
I think it looks better, and it clearly indicates that 1234 is a message
ID and not something else like a PID or a return code.  I think that''s
worth 5 characters but of course all of this is up for discussion - does
anyone else have an opinion?
> This is 5 chars less, which may be used for something else, e.g. longer
> identifiers.
> 
> 
> Is there an (implicit) 80 chars per line constraint?
No.  I realize many current kernel messages are limited to 80
characters, but I don''t want to impose such a limit since it may
compromise readibility.  However, all things being equal, shorter
messages are preferable.
>  > EXAMPLE MESSAGES:
>  [...]
>  > Lustre [ID 1234]: Server handling error on server foo@o2ib:
>  > transaction 11602746/0, opcode 42 returned -2
> 
> Supposing messages are long, would this be:
> Lustre [ID 1234]: Server handling error on server foo@o2ib:
> Lustre [ID 1234]: transaction 11602746/0, opcode 42 returned -2
> or does the identifier appears only on the first line of the MESSAGE?
No.  I''m not proposing to line-wrap messages.  One of the example
messages was wrapped by my editor in the proposed format - I will
clarify this in future versions of the document.

Lustre [ID 1234]: Server handling error on server foo@o2ib: transaction
11602746/0, opcode 42 returned -2

Cheers,
Jody
> 
> 
> -- 
> Solofo.Ramangalahy@bull.net            | Tel: +33 (0)4 76 29 72 48
> Bull SAS, Linux R&D, HPC/CI/Lustre     | Fax: +33 (0)4 76 61 52 52
> 1, Rue de Provence. BP208              | Office B1/386
> 38432 Echirolles Cedex, France         | Mail Stop B1/167
> 
--

Nicholas Henke

2007-Jan-08 10:00 UTC

head link

[Lustre-devel] RFC: Proposed format for Lustre messages

Jody McIntyre wrote:> Hi Solofo,
>
> On Mon, Jan 08, 2007 at 01:34:13PM +0100, Solofo.Ramangalahy@bull.net
wrote:
>
>   
>> Why is it decimal? (as opposed to hexadecimal, ascii...)
>> For example, "1234" could be replaced by "DOC ".
>>     
>
> Humans are generally more comfortable with decimal numbers, and the goal
> of this task is to make things friendlier to humans.
>
>   
>>  > Messages are of the form: Lustre [ID 1234]: MESSAGE
>>
>> What is the need for "ID" and the brackets?
>> Would it be ok to use
>> Lustre 1234: MESSAGE
>> instead?
>> (without brackets, space and "ID")
>>     
>
> I think it looks better, and it clearly indicates that 1234 is a message
> ID and not something else like a PID or a return code.  I think
that''s
> worth 5 characters but of course all of this is up for discussion - does
> anyone else have an opinion?
>   Actually -- yes. If you consider a machine with 100K clients, that is 
500K characters fewer into a log, or almost 0.5 MB less text. Given that 
Lustre rarely spits out just one error message, this could start to add 
up quite quickly.

I''d encourage the fewest possible additions and for keeping the
messages
short but readable.

While I''m commenting on this, there was one item I wanted to mention - 
using this "interpreter" offline. More and more customers are running 
Lustre behind secure walls, where the ability to access this tool needs 
to be done 100% offline. I would hope that this either runs as a text 
only tool, or if a "web based" tool, running from a provided small
http
server (python/perl/ruby/sanskrit) or having the ability to run from a 
internal apache server too. Basically -- this should be pretty quick & 
easy to install and not require anything too fancy.

That said, I''ve not heard any concrete plans for this yet -- are there
any?

Nic

Jody McIntyre

2007-Jan-10 16:35 UTC

head link

[Lustre-devel] RFC: Proposed format for Lustre messages

Hi Nic,

On Mon, Jan 08, 2007 at 11:00:15AM -0600, Nicholas Henke
wrote:> [...]
> While I''m commenting on this, there was one item I wanted to
mention -
> using this "interpreter" offline. More and more customers are
running
> Lustre behind secure walls, where the ability to access this tool needs 
> to be done 100% offline. I would hope that this either runs as a text 
> only tool, or if a "web based" tool, running from a provided
small http
> server (python/perl/ruby/sanskrit) or having the ability to run from a 
> internal apache server too. Basically -- this should be pretty quick & 
> easy to install and not require anything too fancy.
> 
> That said, I''ve not heard any concrete plans for this yet -- are
there any?
We''re still planning exactly what we''re going to deliver as a
first cut
of the interpretation tool, but we''re well aware of the needs of secure
sites.  Whatever we provide will definately be useable on non-CFS
machines with no outside connectivity.

Cheers,
Jody

Lustre devel - Dec 2006 - RFC: Proposed format for Lustre messages

[Lustre-devel] RFC: Proposed format for Lustre messages

[Lustre-devel] RFC: Proposed format for Lustre messages

[Lustre-devel] RFC: Proposed format for Lustre messages

[Lustre-devel] RFC: Proposed format for Lustre messages

[Lustre-devel] RFC: Proposed format for Lustre messages

[Lustre-devel] RFC: Proposed format for Lustre messages

[Lustre-devel] RFC: Proposed format for Lustre messages

[Lustre-devel] RFC: Proposed format for Lustre messages

[Lustre-devel] RFC: Proposed format for Lustre messages

[Lustre-devel] RFC: Proposed format for Lustre messages

[Lustre-devel] RFC: Proposed format for Lustre messages

[Lustre-devel] RFC: Proposed format for Lustre messages

[Lustre-devel] RFC: Proposed format for Lustre messages

[Lustre-devel] RFC: Proposed format for Lustre messages

[Lustre-devel] RFC: Proposed format for Lustre messages