Hi,
I'm writing a Markdown Parser in Scheme by porting bits of Markdown.pl.
As you're probably aware, the Perl version massages the file into the
final output with a number of regexes. In my version I'm trying to use
the regexes to detect the starts and ends of the features and then take
specific action to emit the final representation. I'm doing it this way
because I want to emit an SXML tree data structure rather than an opaque
string.
I read the source file into a list of lines and can run regexes on
individual lines.
I can successfully extract link references.
I'm currently trying to extract inline HTML, basing my code on regexes
from _HashHTMLBlocks.
I can detect the start of an inline block:
"^<(block-tags-a\b.*
...but I'm having trouble detecting the end of the blocks in the same
way as the Perl version.
There seems to be a discrepancy between the "Markdown: Syntax"
document
and the implementation in _HashHTMLBlocks.
The syntax document says "...and the start and end tags of the block
should not be indented with tabs or spaces."
Whilst this is true for the first regex in _HashHTMLBlock, the second
regex (block_tags_b) will sweep up malformed entries like so:
-----
<div>
what happens when we have a <div>nested block</div> and
then the <div>nested block</div>
ends at the end of a line,
but no proper end tag?
-----
becomes
-----
<div>
what happens when we have a <div>nested block</div> and
then the <div>nested block</div>
<p>ends at the end of a line,
but no proper end tag?</p>
-----
i.e. the trailing </div> is detected as the end of the block.
If that block is not at the end of the file and is followed later by a
properly formed block with the same tag, then everything between the
first opening tag and the first properly formed closing tag will get
entirely consumed... which is correct per the Syntax document.
Furthermore, the syntax document does not mandate the user to indent the
block contents, although the example implies it:
-----
<div>
<div>
Test nested HTML without indents
</div>
</div>
-----
becomes
-----
<div>
<div>
Test nested HTML without indents
</div>
<p></div></p>
-----
Finally, capitalised tag names appear to get wrapped in <p>s:
-----
<div>
<div>
tags for inner block must be indented.
</div>
</div>
<DIV>
<DIV>
TAGS FOR INNER BLOCK MUST BE INDENTED.
</DIV>
</DIV>
-----
becomes
-----
<div>
<div>
tags for inner block must be indented.
</div>
</div>
<p><DIV>
<DIV>
TAGS FOR INNER BLOCK MUST BE INDENTED.
</DIV>
</DIV></p>
-----
What is the correct way to parse these examples? Should I aim to produce
the same output as the Perl implementation in all cases?
I'm not entirely sure what purpose the 2nd regex in _HashHTMLBlocks
serves (block_tags_b): I can't find reference to that type of syntax in
the syntax document. Why is the tag list different from block_tags_a? It
strikes me that perhaps the block_tags_b regex shouldn't match over
multiple lines.
In my line based parser, to match the same way as the Perl parser I'd
have to backtrack when I didn't find a valid end tag before the end of
the document and then sweep up with the same logic as the block_tags_b
regex.
I've attached the test cases that I've thought of so far.
I felt inclined to build up the SXML tree by parsing the original
document, rather than transforming the original into XHTML and then
parsing that into SXML at the end, because if I can detect the features
myself then I don't need to handle escaping and encoding in the parser.
SXML data structures are escaped and encoded when they are finally rendered.
Many thanks for any guidance you can offer.
Regards,
@ndy
--
andyjpb at ashurst.eu.org
http://www.ashurst.eu.org/
0x7EBA75FF
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: html.md
Url:
<http://six.pairlist.net/pipermail/markdown-discuss/attachments/20111130/06ddd578/attachment.ksh>
On Wed, Nov 30, 2011 at 09:27, Andy Bennett <andyjpb at ashurst.eu.org> wrote:> > What is the correct way to parse these examples? Should I aim to produce > the same output as the Perl implementation in all cases?You've discovered the first and second rules of creating your own Markdown implementation: there isn't a singular standard or correct implementation. Every implementation differs from every other implementation in some way. They resolve the edge cases and bugs in different ways and they add new, sometimes conflicting features (i.e. table or footnote implementations) and parsing methods. And Markdown as an idea, I think, is far past the point of achieving such a standard. Individuals have taken the task of creating structured, unambiguous grammars, but there are too many implementations that wouldn't fit or would be broken to adopt a single grammar. Plus, consolidating everything would be impossible due to conflicting features. The "correct" way to parse the given examples is essentially however you want them to be parsed. Make a decision about how you think it should be done or how you think your users will want it to behave. Document that decision; try to be consistent. It _may_ be easier to just handle the necessary escapes from within your parser. That could also give you the option of changing Markdown implementations as desired. -- arno? s? hautala? ? /-|?? arno at alum.wpi.edu pgp b2c9d448
On Wed, Nov 30, 2011 at 9:27 AM, Andy Bennett <andyjpb at ashurst.eu.org> wrote:> I'm writing a Markdown Parser in Scheme by porting bits of Markdown.pl. >[clip]> > There seems to be a discrepancy between the "Markdown: Syntax" document > and the implementation in _HashHTMLBlocks.I suspect this post [1] by Gruber himself in the list archive will shed some light on your conundrum. The issue has come up numerous times since, but that is the latest response I could find by JG on the subject. The point is, when you find a conflict between the documentation and the implementation - the documentation rules. However, when the documentation is silent, most of us rely on the implementation as a guide. Personally, what I find helpful is the existing test suite. Some of the examples in there shed light on the intended behavior. It doesn't hurt to run the test suites from other implementations as well. If you haven't already, you might want to run your test cases through babelmark [2] and see what results you get. Sometimes when I can't find an existing test and no specific documentation on an edge case, I go with the most common behavior among implementations on babelmark. Although, be aware that some of those implementations are a little outdated. [1]: http://six.pairlist.net/pipermail/markdown-discuss/2008-February/001001.html [2]: http://babelmark.bobtfish.net/ -- ---- \X/ /-\ `/ |_ /-\ |\| Waylan Limberg
+++ Andy Bennett [Nov 30 11 14:27 ]:> Furthermore, the syntax document does not mandate the user to indent the > block contents, although the example implies it: > > ----- > <div> > <div> > Test nested HTML without indents > </div> > </div> > ----- > becomes > ----- > <div> > <div> > Test nested HTML without indents > </div> > > <p></div></p>Note that John Gruber released a beta version of Markdown that fixes this bug (I believe it uses perl's Text::Balanced module). You can find it by searching the list. % Markdown.pl --version This is Markdown, version 1.0.2b8. Copyright 2004 John Gruber http://daringfireball.net/projects/markdown/ % Markdown.pl <div> <div> Test nested HTML </div> </div> ^D <div> <div> Test nested HTML </div> </div> Have you considered using a PEG instead of regexes? There are PEGs for markdown, and there seems to be a nice PEG generator for scheme: http://planet.plt-scheme.org/display.ss?package=peg.plt&owner=kazzmir John