In article <6CDFE6FE-CE6C-461D-BB95-18C2A86CB75F at attacklab.net>,
John Fraser <markdown-discuss at six.pairlist.net> wrote:
>
>On Feb 29, 2008, at 10:13 AM, Michel Fortin wrote:
>
>> I think the syntax needs to be defined unambiguously, not
>> necessarily as a formal grammar, but certainly not with code either.
>> My idea, currently, is to write a parsing procedure which is easy to
>> read and implement in various ways, using a formal grammar to define
>> various constructs of the syntax and plain english to link things
>> together. I also intend to keep the spec implementable as an
>> incremental parser, but that will require backtracking.
>
>
>I agree that Markdown needs to be defined unambiguously, but I don't
>think that's feasible with plain English in the loop. For something
>as complex and flighty as Markdown, we need working code.
I'm not so sure about this. I managed to write a markdown
implementation without using anything other than the daring fireball
syntax document and MarkdownTest_1.0. And I am by no means a
Perl programmer.
If it's possible to write a Markdown (that passes MarkdownTest) with
the current documentation, describing it in plain english isn't as
difficult as it might seem. And plain english has the advantage
that it doesn't require knowledge of the implementation language to
understand the document.
-david parsons
On Mar 1, 2008, at 1:19 PM, david parsons wrote: >> I agree that Markdown needs to be defined unambiguously, but I don't >> think that's feasible with plain English in the loop. For something >> as complex and flighty as Markdown, we need working code. > > I'm not so sure about this. I managed to write a markdown > implementation without using anything other than the daring > fireball > syntax document and MarkdownTest_1.0. And I am by no means a > Perl programmer. Okay, but I'd argue that your success had a lot more to do with the test suite than the syntax document. I'll admit it: I'm probably more suspicious of paper specs than I should be. But I can't help thinking that (1) any natural-language Markdown spec will have holes; (2) any test suite will have littler holes; and (3) the most popular implementation will always be the de facto standard. My JavaScript port of Markdown needs to match up perfectly with a server-side version in order to be useful, so I'm probably a little more sensitive to underspecification than most. But a spec's not worth much if implementations aren't interchangeable. And since Markdown has to continue silently when it gets confused, we'd need to define all the corner cases completely -- or risk locking users into a particular reading of the spec. I'm all for writing a specification, but I think its purpose should be to inform and to justify a reference implementation and test suite. - John Fraser
On 1 Mar 2008, at 19:19, David Parsons wrote: >> I agree that Markdown needs to be defined unambiguously, but I don't >> think that's feasible with plain English [...] > > I'm not so sure about this. I managed to write a markdown > implementation without using anything other than the daring > fireball > syntax document and MarkdownTest_1.0. And I am by no means a > Perl programmer. And no offense, but there must be hundreds of edge-cases where your implementation disagrees with Markdown.pl. Have a look e.g. at http://six.pairlist.net/pipermail/markdown-discuss/2006-August/000151.html for some edge cases related to nesting block elements, where the outcome is ambiguous, and I am quite sure there are no tests, as Markdown.pl generates bad HTML for most of them. A formal grammar defines _exactly_ how things should be, as I argue here http://six.pairlist.net/pipermail/markdown-discuss/2007-August/000746.html where I also show how Markdown.pl presently has unintuitive precedence which is definitely not defined in the syntax document and something I doubt is in the tests (as the behavior to me is unattractive, but stems from how Markdown.pl is implemented, and thus most of the ports as well). The problem so far has been that the formal syntax normally used to define grammars does not support Markdown?s notion of embedding, but as mentioned here http://six.pairlist.net/pipermail/markdown-discuss/2008-February/001002.html I have had some success with a rule-based implementation that uses a stack for aggregating rules that needs to be applied to the current line before it is handed to the regular parser -- this allows a specification without code and which is unambiguous to edge-cases since the rules are exhaustive, unlike a document written in English. Though without changing a lot of edge-case behavior, I find it hard to see Markdown using such rule-based implementation, so personally I am favoring a new Markdown-inspired language.
Allan Odgaard wrote: > Though without changing a lot of edge-case behavior, I find it hard > to see Markdown using such rule-based implementation, so personally > I am favoring a new Markdown-inspired language. For my part, I'm currently trying to specify parsing rules Markdown Extra, and make the specification usable to parse Markdown too. The idea is to preserve the way it is working now, but to handle edge cases in a consistent and predictable manner. What I want to achieve is interoperability between implementations for the current Markdown and Markdown Extra languages, not creating a new look-alike language. > The problem so far has been that the formal syntax normally used to > define grammars does not support Markdown?s notion of embedding, but > as mentioned here http://six.pairlist.net/pipermail/markdown-discuss/2008-February/001002.html > I have had some success with a rule-based implementation that uses > a stack for aggregating rules that needs to be applied to the > current line before it is handed to the regular parser -- this > allows a specification without code and which is unambiguous to edge- > cases since the rules are exhaustive, unlike a document written in > English. I'd like to point out a thing: you can always write in english what you can with a formal grammar; if you write things correctly, they'll be precise and unambiguous. This has the disadvantage of being more verbose, but the advantage that you don't need to learn a new "language", which is the grammar. That said, I'm currently looking at how to specify Markdown formally. Whether to use a grammar or english, that is to be decided later. I'm looking at the general form of a rule, and I find the post you linked above gives a pretty good insight at what I need. Each rule in your lost rule-based implementation had this (quoting): > 1. A regexp that makes the parser enter the context the rule > represents (e.g. block quote, list, raw, etc.). > > 2. A list of which rules are allowed in the context of this rule. > > 3. A regexp for leaving the context of this rule. > > 4. A regexp which is pushed onto a stack when entering the context of > this rule, and popped again when leaving this rule. > > The fourth item here is really the interesting part, because it is > what made Markdown nesting work (99% of the time) despite this being > 100% rule-driven. I'm not sure that the regular expression in 4 does, beside being pushed and popped from the stack (perhaps it's the end of block expression), but overall it looks pretty good, and is pretty similar to how I'm currently approaching the problem. There are a couple of subtleties I'm not sure if these rules can catch though. In my idea, you'd have parametrized rules. For instance, the list of allowed rules (2) should change depending on the context: you shouldn't have a link within a link, but you can have emphasis in your link; therefore, the emphasis rule when within a link shouldn't have a link rule in it's list of sub rules (2). You also need a way for the regular expression in 3 to be variable depending on what you caught in 1 (to match the same number of backticks in a code span for instance; to catch a matching closing HTML tag, etc.). The way I see it, rules need to be parametrized so the above can be changed without having to define 2^(number of syntax elements) rules, such as EmphasisWithinLink, LinkWihtinEmphasis, CodeSpanWithinLinkWithinEmphasis, and so on. Michel Fortin michel.fortin at michelf.com http://michelf.com/
On Mar 3, 2008, at 7:30 AM, Michel Fortin wrote: > Allan Odgaard wrote: >> 4. A regexp which is pushed onto a stack when entering the context of >> this rule, and popped again when leaving this rule. >> >> The fourth item here is really the interesting part, because it is >> what made Markdown nesting work (99% of the time) despite this being >> 100% rule-driven. > > I'm not sure that the regular expression in 4 does, beside being > pushed and popped from the stack (perhaps it's the end of block > expression), but overall it looks pretty good, and is pretty similar > to how I'm currently approaching the problem. There are a couple of > subtleties I'm not sure if these rules can catch though. I assume Allan let the grammar refer back to this stack as if it were an ordinary rule, so you could use the stack to collect levels of indentation. It's like a limited kind of parameterization. I'd been planning to use recursive transformation to handle nesting, since it makes memoization easier and ought to be a little more readable. But I'll try Allan's idea if mine gets hairy. I like the direction you're both going, and I'm hoping we can come up with a definition that doesn't use any English at all. Admittedly, that'll be a lot easier for a version that does change some behavior at the edges -- like ditching Markdown's 'undocumented *precedence' rules* (<http://six.pairlist.net/pipermail/markdown-discuss/2007-August/000746.html >). I'm going to build my own little prototype to experiment with this stuff (<http://six.pairlist.net/pipermail/markdown-discuss/2008-February/001042.html >). My goal is to come up with a formal grammar that doubles as a (slow) reference implementation. You'll feed a grammar and an input file into a generic text-munging tool, which will spit out either the transformed output or an AST. The tool will be small, easy to port, and completely general -- you could use it to implement html2txt or smartypants or an HTML sanitizer, for example. That's the plan, anyway; we'll how the first iteration turns out. > The way I see it, rules need to be parametrized so the above can be > changed without having to define 2^(number of syntax elements) > rules, such as EmphasisWithinLink, LinkWihtinEmphasis, > CodeSpanWithinLinkWithinEmphasis, and so on. Since I'm doing something packrat-ish, I'm hoping I can use lookahead to keep the rules from exploding. John Fraser
> Since I'm doing something packrat-ish, I'm hoping I can use lookahead > to keep the rules from exploding. I didn't know what packrat was, so I googled it. It looks interesting: from what I understand, it's a more user-friendly way of specifying languages. Here's a good link, if anyone else is interested: http://pdos.csail.mit.edu/~baford/packrat/ -- Andrea Censi PhD student, Control & Dynamical Systems, Caltech http://www.cds.caltech.edu/~andrea/ "Life is too important to be taken seriously" (Oscar Wilde)
On Mar 3, 2008, at 3:36 PM, Andrea Censi wrote: > I didn't know what packrat was, so I googled it. It looks interesting: > from what I understand, it's a more user-friendly way of specifying > languages. > Here's a good link, if anyone else is interested: > > http://pdos.csail.mit.edu/~baford/packrat/ Sorry, I should have provided a link. And since you're looking at the papers, I should mention that what I'm building isn't strictly a packrat parser -- specifically, it won't guarantee O(n) running time. Two reasons: 1) it lets you match against regular expressions as well as literal strings; and 2) I'll probably handle nesting by modifying the results of one rule (i.e. stripping one level of indentation) and recursively calling a parser on the result.
On 3 Mar 2008, at 13:30, Michel Fortin wrote:
> [...]
>> 1. A regexp that makes the parser enter the context the rule
>> represents (e.g. block quote, list, raw, etc.).
>>
>> 2. A list of which rules are allowed in the context of this rule.
>>
>> 3. A regexp for leaving the context of this rule.
>>
>> 4. A regexp which is pushed onto a stack when entering the context of
>> this rule, and popped again when leaving this rule.
>>
>> The fourth item here is really the interesting part, because it is
>> what made Markdown nesting work (99% of the time) despite this being
>> 100% rule-driven.
>
> I'm not sure that the regular expression in 4 does, beside being
> pushed and popped from the stack
Yeah, I accidentally sent the letter w/o noticing I forgot to explain
the fourth rule.
The regexps which end on this stack are used to preprocess the current
line, so for example the rule for code blocks is:
RAW[1] = /\g {4}/ # Four spaces starts raw.
RAW[2] = [ RAW_TEXT ] # No other rules are active inside
raw, RAW_TEXT is a dummy .+ rule
RAW[4] = /\g( {4}| {,3}$)/ # While in the raw context, we need to
eat the first
# four spaces of each line, or the
line must be empty.
Two things to notice here:
1. I don?t use an explicit ?end? rule since we automatically leave
the context if RAW[4] doesn?t successfully match.
2. I use \g instead of ^ since we need to anchor to where the last
block-rule stopped matching, not necessarily BOL.
Now take the rule for block quote:
BQ[1] = /\g {,3}> {,3}/ # We start it for lines with > allowing
# up to 3 spaces before/after.
BQ[2] = [ BQ, RAW, PAR, ? ] # Basically all block elements
# can go inside block quote.
BQ[3] = /\g( *$|?hr?)/ # We leave block quote at empty lines or
# horizontal rulers?. The actual
pattern for
# ?hr? is something like:
# [ ]{,3}(?<M>[-*_])([
]{,2}\k<M>)
{2,}[ \t]*+$
BQ[4] = /\g( {,3}> ?)?/ # While in BQ eat leading quote
characters.
? I am actually not sure if this is ?the spec? or just a bug. But
placing a horizontal ruler just below a block quoted paragraph does
not give the expected ?lazy mode? and places the <hr> inside the block
quote, instead it leaves the block quote.
Just to make the example more complete, let us also have a paragraph
rule:
PAR[1] = /\g {,3}(?=[^ >])/ # Any non-special character with less
than
# 4 leading spaces starts a paragraph.
PAR[2] = [ B, EM, LINK, TEXT, ? ] # All the inline stuff works in
this context
PAR[3] = /\g(?= | {,3}>| {,3}$)/ # We exit the paragraph when
the line
# is starting raw, block
quote, or is
# empty. In practice
paragraphs do end
# with block quote, but not
with raw.
Now we have 3 rules, be aware I typed all this just now without actual
testing, and the goal is not to replicate Markdown.pl 100%, just to
give an example of how the rule-system works.
So our ROOT rule looks like this:
ROOT[1] = //
ROOT[2] = [ RAW, BQ, PAR ]
So when we start to process a document, using this root rule, we will
get a match (without actually advancing our position in the document,
since zero characters were matched).
After this match we have RAW, BQ, and PAR as active rules. Say our
document looks like this:
> A normal paragaph
> Some raw text
> Normal text again
Out of the block quote
The first line is ?> A normal paragaph? and we have 3 rules to apply,
BQ[1], RAW[1], and PAR[1].
Since all of these regexps starts with \g, they are anchored to the
first byte of the document, and only BQ[1] will match.
This ?eats? the ?> ? prefix, pushes BQ[4] on our stack, and makes BQ,
RAW, and PAR our new active rules (yeah, the same as before).
So we now have ?A normal paragaph? and again apply our 3 active rules,
this time PAR[1] will match, it won?t actually eat any characters, and
it won?t push additional rules onto our stack, but ti will change the
active rules to: B, EM, LINK, TEXT, ?
I didn?t define TEXT, but that is a fallback rule for non-special text-
runs. We apply these rules to the line, and TEXT will match the line.
Now comes the special part, when we move to next line, which is ?>
Some raw text? we start by applying the rules from our stack to this
line, we have BQ[4] on the stack, which will eat the leading ?> ?. The
line is now: ? Some raw text? and we have no more rules on the
stack. Before we apply the active rules though, we need to check if we
need to leave the current context, which is PAR, thus we try to apply
PAR[3], and we do get a match, so we leave PAR.
The active rules now revert to those active before we entered PAR,
i.e. RAW, BQ, and PAR. Applying these will give a match for RAW, so we
eat the match (the leading four spaces), push RAW[4] on the stack, and
set the new active rules to RAW[2], i.e. RAW_TEXT.
The line is now ?Some raw text? which will be eaten by the RAW_TEXT
rule.
Next line is ?> Normal text again? and we have both BQ[4] and RAW[4]
on the stack. We apply these in a FIFO order, so first BQ[4] which
eats ?> ?, then RAW[4], which fails to match, instructing us to leave
RAW, ?
Okay, enough writing ? I hope the above gives a better understanding
of how the rules are used.
> [...] You also need a way for the regular expression in 3 to be
> variable depending on what you caught in 1 (to match the same number
> of backticks in a code span for instance; to catch a matching
> closing HTML tag, etc.).
I allow captures from the match done by 1 to be referenced in 3.
Le 2008-03-04 ? 0:49, Allan Odgaard a ?crit :
> On 3 Mar 2008, at 13:30, Michel Fortin wrote:
>
>> [...]
>>> 1. A regexp that makes the parser enter the context the rule
>>> represents (e.g. block quote, list, raw, etc.).
>>>
>>> 2. A list of which rules are allowed in the context of this rule.
>>>
>>> 3. A regexp for leaving the context of this rule.
>>>
>>> 4. A regexp which is pushed onto a stack when entering the context
>>> of
>>> this rule, and popped again when leaving this rule.
>>>
>>> The fourth item here is really the interesting part, because it is
>>> what made Markdown nesting work (99% of the time) despite this
being
>>> 100% rule-driven.
>>
>> I'm not sure that the regular expression in 4 does, beside being
>> pushed and popped from the stack
>
> Yeah, I accidentally sent the letter w/o noticing I forgot to
> explain the fourth rule.
>
> [big explanation]
So you're basically using a line by line approach. I was thinking
about that as a possibility for parsing blocks, but I don't think I'll
do that because I need backtracking to be able to rewind beyond the
current line. Or can you do it?
I'm particularly curious about how you can handle headers of this form:
Header
======
> Now take the rule for block quote:
>
> BQ[1] = /\g {,3}> {,3}/ # We start it for lines with > allowing
> # up to 3 spaces before/after.
>
> BQ[2] = [ BQ, RAW, PAR, ? ] # Basically all block elements
> # can go inside block quote.
>
> BQ[3] = /\g( *$|?hr?)/ # We leave block quote at empty lines or
> # horizontal rulers?. The actual
> pattern for
> # ?hr? is something like:
> # [ ]{,3}(?<M>[-*_])([
]{,2}\k<M>)
> {2,}[ \t]*+$
>
> BQ[4] = /\g( {,3}> ?)?/ # While in BQ eat leading quote
> characters.
>
> ? I am actually not sure if this is ?the spec? or just a bug. But
> placing a horizontal ruler just below a block quoted paragraph does
> not give the expected ?lazy mode? and places the <hr> inside the
> block quote, instead it leaves the block quote.
I'm not sure what's the problem with horizontal rules in blockquotes.
I've tried many variations of:
> test
>
> ***
>
> test
and couldn't make it end the blockquote prematurely. If it did, I'd
say it'd be a bug because I see no way the user would expect the
horizontal rule to break the blockquote and no reason for the parser
to do so either.
> [...]
>
> Okay, enough writing ? I hope the above gives a better understanding
> of how the rules are used.
Indeed, it was quite insightful. Thank you.
Michel Fortin
michel.fortin at michelf.com
http://michelf.com/
On Tue, Mar 4, 2008 at 8:02 PM, Michel Fortin <michel.fortin at michelf.com> wrote: > > I'm not sure what's the problem with horizontal rules in blockquotes. > I've tried many variations of: ... trying to reproduce this on the df Markdown dingus lead to a weird result. Here's the input: > test > ---- and here's the output: <blockquote> <p>test</p> <h2>></h2> </blockquote>
On 5 Mar 2008, at 05:02, Michel Fortin wrote:
>> [big explanation]
> So you're basically using a line by line approach.
Yes, seeing how the block-level nesting stuff affects things ?line by
line?, this seems like the best approach :)
> I was thinking about that as a possibility for parsing blocks, but I
> don't think I'll do that because I need backtracking to be able to
> rewind beyond the current line. Or can you do it?
Backtracking? I am not sure when you want to do it and why you think
looking at things line-by-line will prevent you from doing it.
> I'm particularly curious about how you can handle headers of this
> form:
>
> Header
> ======
One approach is (for the regexp which starts a context) to allow an
array instead of just a single regexp?. The index of the regexp in
this array is the relative line offset (to current line) that the
regexp should be matched against.
So for the setext style header the rule would be:
H1[1][0] = /\g.+/ # anything can go into a heading
H1[1][1] = /\g={3,}/ # but line below must have at least three
equal signs
Of course when testing the regexps on lines, these must be
preprocessed as if in the current context, i.e. all the regexps on the
stack are applied to that line.
? In practice one could allow a composite regexp that uses \n and
simply call split on that, then insert the regexps from the stack
before all but the first regexp resulting from this split. This would
have the advantage of making the implementation hidden from the actual
rules and simpler for the person tweaking the rules.
>> [...] placing a horizontal ruler just below a block quoted
>> paragraph does not give the expected ?lazy mode? and places the
>> <hr> inside the block quote, instead it leaves the block quote.
>
> I'm not sure what's the problem with horizontal rules in blockquotes
> [...]
This is what I was referring to:
> Test
bla bla
> Test
- - -
The result becomes:
<blockquote>
<p>Test
bla bla</p>
<p>Test</p>
</blockquote>
<hr>