I recently subscribed and saw in the archive that Eric Astor was asking for a formal grammar (unlikely the first time for such request.) Currently there are a few problems in making such a thing so I was curious if Mr. Gruber has made any thoughts about moving toward one? This would also allow a more ?clean? parser which would get rid of some of the current problems (bad nesting[^1], styles which cross environments[^2], and problems with the md5 checksums[^3]) and I am sure it would also improve performance significantly. Some of the problems with a formal grammar (as I see it) are: 1. interpreting tokens as literal text when end token is missing, example: `this is __not starting bold`. For bold it doesn?t matter IMO (having to escape the token,) but having to escape all single appearances of `_` and `*` could be irritating, although presently _ often do come in pairs, so here one often already do need to wrap filenames, environment variables and similar which use the underscore in a raw environment. 2. using back-references in end-tokens, example: `a ``` ``raw`` ``` environment`. A formal grammar can?t really do that, and IMO the clean solution would be to define single-quoted (backticked) raw as supporting no escaping and end with the first `` ` ``, where double- quoted (backticked) raw would support escaping of `` ` `` and `\`. 5. heuristically defined end of lists, sub-lists and block-quotes. This would need to be more strict. I am not entirely sure what the current definition is, so I am wary of reformulating a strict version. From the source it seems that a sub-list is started when a line is a list item with a different (exact) indent as the first list item, allowing for some fun flexibility: * item 1 * item 1a * item 2 * item 2a * item 2b * item 2c * item 3 There is also an ambiguity between `*` used for bold and used for a list item. A minor problem is that when in a list item environment the rule e.g. for raw blocks needs to be redefined (to require 2 tabs or 8 spaces) and that would be necessary for each new level (to add an extra indent in the requirement) with the likely outcome that raw blocks would only be supported in e.g. the 3 first levels of list items. OTOH I doubt anyone would feel safe using raw blocks in deeply nested list items given the (IMHO) rather vague definitions about when lists stop and interact with raw environments/block-quotes etc. Take the following relative simple code which produce bogus markup as an example of how fragile this stuff currently is: * this is list item > * this item is in a block quote more block quoting? are we still in list and block quote? > is this a new block quote? Thanks for reading this far. [^1]: example: `__bold _and__ italic_`. [^2]: example: `*not italic [link*text](#)`. [^3]: I have only experienced this with MultiMarkdown, for which the problem is easy to reproduce by using styles in footnotes.
> -----Original Message----- > From: markdown-discuss-bounces@six.pairlist.net [mailto:markdown-discuss- > bounces@six.pairlist.net] On Behalf Of Allan Odgaard > Sent: Saturday, July 29, 2006 4:38 PM > To: markdown-discuss@six.pairlist.net > Subject: Formal Grammar ? some thoughts > > I recently subscribed and saw in the archive that Eric Astor was > asking for a formal grammar (unlikely the first time for such request.)I should now add that since asking, I've started work on a parser for a Markdown variant, coding in Python and using the Martel parsing framework (http://www.dalkescientific.com/Martel/). A VERY large fraction of what I've written could likely be re-used in building a true Markdown parser with this framework, which uses a regex-based specification of the parsing format and supports various features (including back-references) that are not possible in formal grammars. In the process, I've learned a lot that may also be applicable to building a true formal grammar for Markdown.> Some of the problems with a formal grammar (as I see it) are: > > 1. interpreting tokens as literal text when end token is missing, > example: `this is __not starting bold`.This is actually simple to deal with in most formal grammars - since formal grammars are recursive, you simply define bold (for example) as: bold := ('__' SPAN '__') | ('**' SPAN '**') This will then allow the marker token to be interpreted as literal text. The Martel 'grammar' works similarly.> 2. using back-references in end-tokens, example: `a ``` ``raw`` ``` > environment`. A formal grammar can?t really do that, and IMO the > clean solution would be to define single-quoted (backticked) raw as > supporting no escaping and end with the first `` ` ``, where double- > quoted (backticked) raw would support escaping of `` ` `` and `\`.True, back-references are not possible in any formal grammar - but given a parsing framework that supports a fixed amount of lookahead, it's easy to support nearly the same functionality that Markdown does using alternative definitions. Now this can get ugly, as when six definitions are required to support a code span using up to six backticks as markers, but it is somewhat manageable. Regardless, I definitely prefer Allan's proposed revision of the syntax - but if we need the backwards-compatibility, then it should be possible to support it with only a minor mess resulting.> 5. heuristically defined end of lists, sub-lists and block-quotes. > This would need to be more strict. I am not entirely sure what the > current definition is, so I am wary of reformulating a strict > version.This would indeed have to be more strict - and I really think some sort of stricter specification would be very valuable, particularly since this is an issue the various Markdown implementations tend to disagree on. Personally, I think sublisting should require at least 4 spaces (or 1 tab) of indentation past the previous list's indenting level, which would keep consistency with the rest of Markdown. This sort of thing would also have the benefit of making it MUCH easier to write a decent semi-formal lexer, which could then simplify the task of writing the parser. (That reminds me - like other languages that are indentation-sensitive, it's impossible to specify Markdown both formally and completely... some part of the parser will need to be informal.) Anyway, that's a bit of what I've picked up in the process of writing my parser... I'd be glad to help out with finding a way to formalize Markdown, however I can. - Eric Astor -- No virus found in this outgoing message. Checked by AVG Free Edition. Version: 7.1.394 / Virus Database: 268.10.5 - Release Date: 7/28/2006
* Allan Odgaard <29mtuz102@sneakemail.com> [2006-07-29 22:40]:> 1. interpreting tokens as literal text when end token is > missing, example: `this is __not starting bold`. For bold it > doesn?t matter IMO (having to escape the token,) but having to > escape all single appearances of `_` and `*` could be > irritating, although presently _ often do come in pairs, so > here one often already do need to wrap filenames, environment > variables and similar which use the underscore in a raw > environment.I wouldn?t go for a pure formal grammar. If you don?t, then it?s easy to tolerate ambiguity in the language by deferring disambiguation until possible. Just accumulate potential tokens and only assign meaning once it?s decidable.> 2. using back-references in end-tokens, example: `a ``` ``raw`` > ``` environment`. A formal grammar can?t really do that,I?m pretty sure it can. You just need a couple redundant non-terminals.> 5. heuristically defined end of lists, sub-lists and > block-quotes. This would need to be more strict. I am not > entirely sure what the current definition is, so I am wary of > reformulating a strict version. From the source it seems that > a sub-list is started when a line is a list item with a > different (exact) indent as the first list item, allowing for > some fun flexibility:That was recently discussed. It will be stricter in future versions, requiring a certain amount of indentation.> There is also an ambiguity between `*` used for bold and used > for a list item.That one is helped if the vocabulary contains newlines as terminals, and gets easy if you allow deferred disambiguation.> A minor problem is that when in a list item environment the > rule e.g. for raw blocks needs to be redefined (to require 2 > tabs or 8 spaces) and that would be necessary for each new > level (to add an extra indent in the requirement) with the > likely outcome that raw blocks would only be supported in e.g. > the 3 first levels of list items.Objection. To me, a great feature of Markdown over nearly every wiki markup out there is that nested block structures are composable with straightforward rules. If a pure formal grammar can?t cope, then to hell with pure formal grammars. It?s quite easy to cope with nesting once you leave the purely declarative path. Heck, Perl 5 pattern matches can do it.> Take the following relative simple code which produce bogus > markup as an example of how fragile this stuff currently is:The current reference implementation of Markdown, frankly, isn?t very good. It?s a search&replace train, which makes it inherently fragile and painful to extend. It?s just valuable anyway because it?s actual running code that works without breaking badly too often (cf. Anthony DeBoer?s delectably sardonic definition of ?legacy?). Regards, -- Aristotle Pagaltzis // <http://plasmasturm.org/>
On 29/7/2006, at 23:54, A. Pagaltzis wrote:>> [ list stuff ] > Objection. To me, a great feature of Markdown over nearly every > wiki markup out there is that nested block structures are > composable with straightforward rules. If a pure formal grammar > can?t cope, then to hell with pure formal grammars.Well, I?d like to have these ?straightforward rules? defined before I judge wether or not a formal grammar can cope ;) The syntax page [1] only mentions nesting of block quotes (in block quotes) and doesn?t mention nested lists at all. The ?reference implementation?, as you say, ?isn?t very good? -- at least not for extracting concise syntax rules. [1] http://daringfireball.net/projects/markdown/syntax
On 31/7/2006, at 4:56, A. Pagaltzis wrote:>> [...] it can?t be done without revising some parts of the syntax, >> OTOH the problematic parts (e.g. nested block elements) > You keep asserting that, and it keeps failing to make any sense.Asserting what? That nested block elements are problematic?> [...] grammars can express nested constructs. I wonder what you are > talking about.Nested expressions in context free grammars [1] are not directly comparable with nested block elements in Markdown.> [...] That only means the current implementations need to be fixed.Fixed how? From the syntax I get that I can do `> block quote` and `* list item` and from the cheat sheet that I can nest things like: * > list item with quoted text Ignoring here that the HTML produced for that is: <ul> <li>> list item with quoting</li> </ul> Let?s continue with the syntax which further says that I can be lazy and leave out the indent for list items and `>` for block quotes. So what should this convert to: * > list item with quoting more text here Does the additional line belong to the block quote or the list item? I?m inclined to say block quote, since a block quote is not terminated before there is a blank line, at least from how I read the syntax. But then what about this: * > list item with quoting more text here * another list item There is no blank line after the block quote, so does `* another list item` also belong to the block quote? This particular example does not parse as a block quote, but if we make one that does, we will see that Markdown does indeed make the list item part of the block quote, for example this example: * leading text as block quote can?t be first > some block quoted text * another list item Turns into this (wrongly nested, but at least it got the block quote) HTML: <p><ul> <li>leading text as block quote can?t be first</p> <blockquote> <p>some block quoted text</li> <li>another list item</li> </ul></p> </blockquote> This however raises two questions: 1. Should it actually be a list item in the block quote? for example take this example: > block quote * more block quote Which turns into this markup (i.e. there is no list item): <blockquote> <p>block quote * more block quote</p> </blockquote> 2. If `* another list item` gets nested into the block quote, we need a blank line in front of that item to make it part of the root list. But then, will we get the ?spaced out? version of the list where each item is wrapped in `<p>?</p>` (which is normal for such ?spaced out? lists.)? These are the problems I want to have addressed/fixed! And this is where I think the syntax needs revising. I have shown above the problems with the ambiguity of nesting, but lazy mode for block quoting I think should not be a feature at all, take this example: > > I wrote something > you replied and now here is my reply to your reply. This turns into the following markup: <blockquote> <blockquote> <p>I wrote something you replied and now here is my combat.</p> </blockquote> </blockquote> Would people actually expect that? Anyway, this thread was just to learn the interest in a more strict (but potentially revised) syntax and Johns future direction and thoughts on this. It?s clear that this thread did not turn out to be productive, so I will retreat from this discussion. [1] http://en.wikipedia.org/wiki/Context_free_grammars