thr3ads.net - Markdown Discuss - Formal Grammar

If this information is useful, please help other people find it:
Share via:

Allan Odgaard

2006-Jul-29 16:38 UTC

Formal Grammar — some thoughts

I recently subscribed and saw in the archive that Eric Astor was  
asking for a formal grammar (unlikely the first time for such request.)

Currently there are a few problems in making such a thing so I was  
curious if Mr. Gruber has made any thoughts about moving toward one?

This would also allow a more ?clean? parser which would get rid of  
some of the current problems (bad nesting[^1], styles which cross  
environments[^2], and problems with the md5 checksums[^3]) and I am  
sure it would also improve performance significantly.

Some of the problems with a formal grammar (as I see it) are:

1. interpreting tokens as literal text when end token is missing,  
example: `this is __not starting bold`. For bold it doesn?t matter  
IMO (having to escape the token,) but having to escape all single  
appearances of `_` and `*` could be irritating, although presently _  
often do come in pairs, so here one often already do need to wrap  
filenames, environment variables and similar which use the underscore  
in a raw environment.

2. using back-references in end-tokens, example: `a ``` ``raw`` ```  
environment`. A formal grammar can?t really do that, and IMO the  
clean solution would be to define single-quoted (backticked) raw as  
supporting no escaping and end with the first `` ` ``, where double- 
quoted (backticked) raw would support escaping of `` ` `` and `\`.

5. heuristically defined end of lists, sub-lists and block-quotes.  
This would need to be more strict. I am not entirely sure what the  
current definition is, so I am wary of reformulating a strict  
version. From the source it seems that a sub-list is started when a  
line is a list item with a different (exact) indent as the first list  
item, allowing for some fun flexibility:

            * item 1
          * item 1a
            * item 2
         * item 2a
          * item 2b
           * item 2c
            * item 3

     There is also an ambiguity between `*` used for bold and used  
for a list item.

         A minor problem is that when in a list item environment the  
rule e.g. for raw blocks needs to be redefined (to require 2 tabs or  
8 spaces) and that would be necessary for each new level (to add an  
extra indent in the requirement) with the likely outcome that raw  
blocks would only be supported in e.g. the 3 first levels of list  
items. OTOH I doubt anyone would feel safe using raw blocks in deeply  
nested list items given the (IMHO) rather vague definitions about  
when lists stop and interact with raw environments/block-quotes etc.  
Take the following relative simple code which produce bogus markup as  
an example of how fragile this stuff currently is:

         * this is list item
         > * this item is in a block quote
         more block quoting?

         are we still in list and block quote?

         > is this a new block quote?

Thanks for reading this far.



[^1]: example: `__bold _and__ italic_`.

[^2]: example: `*not italic [link*text](#)`.

[^3]: I have only experienced this with MultiMarkdown, for which the  
problem is easy to reproduce by using styles in footnotes.

Eric Astor

2006-Jul-29 17:22 UTC

head link

RE: Formal Grammar — some thoughts

> -----Original Message-----
> From: markdown-discuss-bounces@six.pairlist.net [mailto:markdown-discuss-
> bounces@six.pairlist.net] On Behalf Of Allan Odgaard
> Sent: Saturday, July 29, 2006 4:38 PM
> To: markdown-discuss@six.pairlist.net
> Subject: Formal Grammar ? some thoughts
> 
> I recently subscribed and saw in the archive that Eric Astor was
> asking for a formal grammar (unlikely the first time for such request.)
I should now add that since asking, I've started work on a parser for a
Markdown variant, coding in Python and using the Martel parsing framework
(http://www.dalkescientific.com/Martel/). A VERY large fraction of what I've
written could likely be re-used in building a true Markdown parser with this
framework, which uses a regex-based specification of the parsing format and
supports various features (including back-references) that are not possible
in formal grammars. In the process, I've learned a lot that may also be
applicable to building a true formal grammar for Markdown.
> Some of the problems with a formal grammar (as I see it) are:
> 
> 1. interpreting tokens as literal text when end token is missing,
> example: `this is __not starting bold`.
This is actually simple to deal with in most formal grammars - since formal
grammars are recursive, you simply define bold (for example) as:
bold := ('__' SPAN '__') | ('**' SPAN '**')
This will then allow the marker token to be interpreted as literal text. The
Martel 'grammar' works similarly.
> 2. using back-references in end-tokens, example: `a ``` ``raw`` ```
> environment`. A formal grammar can?t really do that, and IMO the
> clean solution would be to define single-quoted (backticked) raw as
> supporting no escaping and end with the first `` ` ``, where double-
> quoted (backticked) raw would support escaping of `` ` `` and `\`.
True, back-references are not possible in any formal grammar - but given a
parsing framework that supports a fixed amount of lookahead, it's easy to
support nearly the same functionality that Markdown does using alternative
definitions. Now this can get ugly, as when six definitions are required to
support a code span using up to six backticks as markers, but it is somewhat
manageable.

Regardless, I definitely prefer Allan's proposed revision of the syntax -
but if we need the backwards-compatibility, then it should be possible to
support it with only a minor mess resulting.
> 5. heuristically defined end of lists, sub-lists and block-quotes.
> This would need to be more strict. I am not entirely sure what the
> current definition is, so I am wary of reformulating a strict
> version.
This would indeed have to be more strict - and I really think some sort of
stricter specification would be very valuable, particularly since this is an
issue the various Markdown implementations tend to disagree on. Personally,
I think sublisting should require at least 4 spaces (or 1 tab) of
indentation past the previous list's indenting level, which would keep
consistency with the rest of Markdown. This sort of thing would also have
the benefit of making it MUCH easier to write a decent semi-formal lexer,
which could then simplify the task of writing the parser. (That reminds me -
like other languages that are indentation-sensitive, it's impossible to
specify Markdown both formally and completely... some part of the parser
will need to be informal.)

Anyway, that's a bit of what I've picked up in the process of writing my
parser... I'd be glad to help out with finding a way to formalize Markdown,
however I can.

- Eric Astor

-- 
No virus found in this outgoing message.
Checked by AVG Free Edition.
Version: 7.1.394 / Virus Database: 268.10.5 - Release Date: 7/28/2006

A. Pagaltzis

2006-Jul-29 17:54 UTC

head link

Formal Grammar — some thoughts

* Allan Odgaard <29mtuz102@sneakemail.com> [2006-07-29
22:40]:> 1. interpreting tokens as literal text when end token is
> missing,  example: `this is __not starting bold`. For bold it
> doesn?t matter  IMO (having to escape the token,) but having to
> escape all single  appearances of `_` and `*` could be
> irritating, although presently _  often do come in pairs, so
> here one often already do need to wrap  filenames, environment
> variables and similar which use the underscore  in a raw
> environment.
I wouldn?t go for a pure formal grammar. If you don?t, then it?s
easy to tolerate ambiguity in the language by deferring
disambiguation until possible. Just accumulate potential tokens
and only assign meaning once it?s decidable.
> 2. using back-references in end-tokens, example: `a ``` ``raw``
> ```  environment`. A formal grammar can?t really do that,
I?m pretty sure it can. You just need a couple redundant
non-terminals.
> 5. heuristically defined end of lists, sub-lists and
> block-quotes.  This would need to be more strict. I am not
> entirely sure what the  current definition is, so I am wary of
> reformulating a strict  version. From the source it seems that
> a sub-list is started when a  line is a list item with a
> different (exact) indent as the first list  item, allowing for
> some fun flexibility:
That was recently discussed. It will be stricter in future
versions, requiring a certain amount of indentation.
> There is also an ambiguity between `*` used for bold and used
> for a list item.
That one is helped if the vocabulary contains newlines as
terminals, and gets easy if you allow deferred disambiguation.
> A minor problem is that when in a list item environment the
> rule e.g. for raw blocks needs to be redefined (to require 2
> tabs or  8 spaces) and that would be necessary for each new
> level (to add an  extra indent in the requirement) with the
> likely outcome that raw  blocks would only be supported in e.g.
> the 3 first levels of list  items.
Objection. To me, a great feature of Markdown over nearly every
wiki markup out there is that nested block structures are
composable with straightforward rules. If a pure formal grammar
can?t cope, then to hell with pure formal grammars. It?s quite
easy to cope with nesting once you leave the purely declarative
path. Heck, Perl 5 pattern matches can do it.
> Take the following relative simple code which produce bogus
> markup as  an example of how fragile this stuff currently is:
The current reference implementation of Markdown, frankly, isn?t
very good. It?s a search&replace train, which makes it inherently
fragile and painful to extend. It?s just valuable anyway because
it?s actual running code that works without breaking badly too
often (cf. Anthony DeBoer?s delectably sardonic definition of
?legacy?).

Regards,
-- 
Aristotle Pagaltzis // <http://plasmasturm.org/>

Allan Odgaard

2006-Jul-29 21:12 UTC

head link

Re: Formal Grammar — some thoughts

On 29/7/2006, at 23:54, A. Pagaltzis wrote:
>> [ list stuff ]
> Objection. To me, a great feature of Markdown over nearly every  
> wiki markup out there is that nested block structures are  
> composable with straightforward rules. If a pure formal grammar  
> can?t cope, then to hell with pure formal grammars.
Well, I?d like to have these ?straightforward rules? defined before I  
judge wether or not a formal grammar can cope ;)

The syntax page [1] only mentions nesting of block quotes (in block  
quotes) and doesn?t mention nested lists at all. The ?reference  
implementation?, as you say, ?isn?t very good? -- at least not for  
extracting concise syntax rules.

[1] http://daringfireball.net/projects/markdown/syntax

Allan Odgaard

2006-Aug-01 11:17 UTC

head link

Re: Formal Grammar — some thoughts

On 31/7/2006, at 4:56, A. Pagaltzis wrote:
>> [...] it can?t be done without revising some parts of the syntax,  
>> OTOH the problematic parts (e.g. nested block elements)
> You keep asserting that, and it keeps failing to make any sense.
Asserting what? That nested block elements are problematic?
> [...] grammars can express nested constructs. I wonder what you are  
> talking about.
Nested expressions in context free grammars [1] are not directly  
comparable with nested block elements in Markdown.
> [...] That only means the current implementations need to be fixed.
Fixed how? From the syntax I get that I can do `> block quote` and `*  
list item` and from the cheat sheet that I can nest things like:

       * > list item with quoted text

Ignoring here that the HTML produced for that is:

     <ul>
     <li>> list item with quoting</li>
     </ul>

Let?s continue with the syntax which further says that I can be lazy  
and leave out the indent for list items and `>` for block quotes. So  
what should this convert to:

       * > list item with quoting
     more text here

Does the additional line belong to the block quote or the list item?  
I?m inclined to say block quote, since a block quote is not  
terminated before there is a blank line, at least from how I read the  
syntax. But then what about this:

       * > list item with quoting
     more text here
       * another list item

There is no blank line after the block quote, so does `* another list  
item` also belong to the block quote? This particular example does  
not parse as a block quote, but if we make one that does, we will see  
that Markdown does indeed make the list item part of the block quote,  
for example this example:

       * leading text as block quote can?t be first
         > some block quoted text
       * another list item

Turns into this (wrongly nested, but at least it got the block quote)  
HTML:

     <p><ul>
     <li>leading text as block quote can?t be first</p>

     <blockquote>
       <p>some block quoted text</li>
       <li>another list item</li>
       </ul></p>
     </blockquote>

This however raises two questions:

1. Should it actually be a list item in the block quote? for example  
take this example:

         > block quote
         * more block quote

     Which turns into this markup (i.e. there is no list item):

         <blockquote>
           <p>block quote
           * more block quote</p>
         </blockquote>

2. If `* another list item` gets nested into the block quote, we need  
a blank line in front of that item to make it part of the root list.  
But then, will we get the ?spaced out? version of the list where each  
item is wrapped in `<p>?</p>` (which is normal for such ?spaced out?
lists.)?

These are the problems I want to have addressed/fixed! And this is  
where I think the syntax needs revising. I have shown above the  
problems with the ambiguity of nesting, but lazy mode for block  
quoting I think should not be a feature at all, take this example:

     > > I wrote something
     > you replied
     and now here is my reply to your reply.

This turns into the following markup:

     <blockquote>
       <blockquote>
         <p>I wrote something
         you replied
         and now here is my combat.</p>
       </blockquote>
     </blockquote>

Would people actually expect that?

Anyway, this thread was just to learn the interest in a more strict  
(but potentially revised) syntax and Johns future direction and  
thoughts on this.

It?s clear that this thread did not turn out to be productive, so I  
will retreat from this discussion.

[1] http://en.wikipedia.org/wiki/Context_free_grammars

Possibly Parallel Threads

Search for more apparently analagous threads

Markdown Discuss - Jul 2006 - Formal Grammar — some thoughts

Formal Grammar — some thoughts

RE: Formal Grammar — some thoughts

Formal Grammar — some thoughts

Re: Formal Grammar — some thoughts

Re: Formal Grammar — some thoughts

Possibly Parallel Threads