thr3ads.net - Markdown Discuss - Inline HTML legalities [Nov 2011]

If this information is useful, please help other people find it:
Share via:

Andy Bennett

2011-Nov-30 14:27 UTC

Inline HTML legalities

Hi,

I'm writing a Markdown Parser in Scheme by porting bits of Markdown.pl.

As you're probably aware, the Perl version massages the file into the
final output with a number of regexes. In my version I'm trying to use
the regexes to detect the starts and ends of the features and then take
specific action to emit the final representation. I'm doing it this way
because I want to emit an SXML tree data structure rather than an opaque
string.

I read the source file into a list of lines and can run regexes on
individual lines.

I can successfully extract link references.

I'm currently trying to extract inline HTML, basing my code on regexes
from _HashHTMLBlocks.

I can detect the start of an inline block:
"^<(block-tags-a\b.*

...but I'm having trouble detecting the end of the blocks in the same
way as the Perl version.


There seems to be a discrepancy between the "Markdown: Syntax"
document
and the implementation in _HashHTMLBlocks.

The syntax document says "...and the start and end tags of the block
should not be indented with tabs or spaces."
Whilst this is true for the first regex in _HashHTMLBlock, the second
regex (block_tags_b) will sweep up malformed entries like so:


-----
<div>
 what happens when we have a <div>nested block</div> and
 then the <div>nested block</div>
 ends at the end of a line,
 but no proper end tag?
-----
becomes
-----
<div>
 what happens when we have a <div>nested block</div> and
 then the <div>nested block</div>

<p>ends at the end of a line,
 but no proper end tag?</p>
-----

i.e. the trailing </div> is detected as the end of the block.

If that block is not at the end of the file and is followed later by a
properly formed block with the same tag, then everything between the
first opening tag and the first properly formed closing tag will get
entirely consumed... which is correct per the Syntax document.


Furthermore, the syntax document does not mandate the user to indent the
block contents, although the example implies it:

-----
<div>
<div>
Test nested HTML without indents
</div>
</div>
-----
becomes
-----
<div>
<div>
Test nested HTML without indents
</div>

<p></div></p>
-----


Finally, capitalised tag names appear to get wrapped in <p>s:
-----
<div>
	<div>
	tags for inner block must be indented.
	</div>
</div>

<DIV>
	<DIV>
	TAGS FOR INNER BLOCK MUST BE INDENTED.
	</DIV>
</DIV>
-----
becomes
-----
<div>
    <div>
    tags for inner block must be indented.
    </div>
</div>

<p><DIV>
    <DIV>
    TAGS FOR INNER BLOCK MUST BE INDENTED.
    </DIV>
</DIV></p>
-----


What is the correct way to parse these examples? Should I aim to produce
the same output as the Perl implementation in all cases?

I'm not entirely sure what purpose the 2nd regex in _HashHTMLBlocks
serves (block_tags_b): I can't find reference to that type of syntax in
the syntax document. Why is the tag list different from block_tags_a? It
strikes me that perhaps the block_tags_b regex shouldn't match over
multiple lines.

In my line based parser, to match the same way as the Perl parser I'd
have to backtrack when I didn't find a valid end tag before the end of
the document and then sweep up with the same logic as the block_tags_b
regex.


I've attached the test cases that I've thought of so far.




I felt inclined to build up the SXML tree by parsing the original
document, rather than transforming the original into XHTML and then
parsing that into SXML at the end, because if I can detect the features
myself then I don't need to handle escaping and encoding in the parser.
SXML data structures are escaped and encoded when they are finally rendered.



Many thanks for any guidance you can offer.




Regards,
@ndy

-- 
andyjpb at ashurst.eu.org
http://www.ashurst.eu.org/
0x7EBA75FF

-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: html.md
Url:
<http://six.pairlist.net/pipermail/markdown-discuss/attachments/20111130/06ddd578/attachment.ksh>

Arno Hautala

2011-Nov-30 15:02 UTC

head link

Inline HTML legalities

On Wed, Nov 30, 2011 at 09:27, Andy Bennett <andyjpb at ashurst.eu.org>
wrote:>
> What is the correct way to parse these examples? Should I aim to produce
> the same output as the Perl implementation in all cases?
You've discovered the first and second rules of creating your own
Markdown implementation: there isn't a singular standard or correct
implementation. Every implementation differs from every other
implementation in some way. They resolve the edge cases and bugs in
different ways and they add new, sometimes conflicting features (i.e.
table or footnote implementations) and parsing methods.

And Markdown as an idea, I think, is far past the point of achieving
such a standard. Individuals have taken the task of creating
structured, unambiguous grammars, but there are too many
implementations that wouldn't fit or would be broken to adopt a single
grammar. Plus, consolidating everything would be impossible due to
conflicting features.

The "correct" way to parse the given examples is essentially however
you want them to be parsed. Make a decision about how you think it
should be done or how you think your users will want it to behave.
Document that decision; try to be consistent.

It _may_ be easier to just handle the necessary escapes from within
your parser. That could also give you the option of changing Markdown
implementations as desired.

-- 
arno? s? hautala? ? /-|?? arno at alum.wpi.edu

pgp b2c9d448

Waylan Limberg

2011-Nov-30 15:55 UTC

head link

Inline HTML legalities

On Wed, Nov 30, 2011 at 9:27 AM, Andy Bennett <andyjpb at ashurst.eu.org>
wrote:> I'm writing a Markdown Parser in Scheme by porting bits of Markdown.pl.
>
[clip]>
> There seems to be a discrepancy between the "Markdown: Syntax"
document
> and the implementation in _HashHTMLBlocks.
I suspect this post [1] by Gruber himself in the list archive will
shed some light on your conundrum. The issue has come up numerous
times since, but that is the latest response I could find by JG on the
subject.

The point is, when you find a conflict between the documentation and
the implementation - the documentation rules. However, when the
documentation is silent, most of us rely on the implementation as a
guide.

Personally, what I find helpful is the existing test suite. Some of
the examples in there shed light on the intended behavior. It doesn't
hurt to run the test suites from other implementations as well.

If you haven't already, you might want to run your test cases through
babelmark [2] and see what results you get. Sometimes when I can't
find an existing test and no specific documentation on an edge case, I
go with the most common behavior among implementations on babelmark.
Although, be aware that some of those implementations are a little
outdated.

[1]:
http://six.pairlist.net/pipermail/markdown-discuss/2008-February/001001.html
[2]: http://babelmark.bobtfish.net/

-- 
----
\X/ /-\ `/ |_ /-\ |\|
Waylan Limberg

John MacFarlane

2011-Nov-30 17:28 UTC

head link

Inline HTML legalities

+++ Andy Bennett [Nov 30 11 14:27 ]:
 > Furthermore, the syntax document does not mandate the user to indent the
> block contents, although the example implies it:
> 
> -----
> <div>
> <div>
> Test nested HTML without indents
> </div>
> </div>
> -----
> becomes
> -----
> <div>
> <div>
> Test nested HTML without indents
> </div>
> 
> <p></div></p>
Note that John Gruber released a beta version of Markdown that
fixes this bug (I believe it uses perl's Text::Balanced module).
You can find it by searching the list.

    % Markdown.pl --version

    This is Markdown, version 1.0.2b8.
    Copyright 2004 John Gruber
    http://daringfireball.net/projects/markdown/

    % Markdown.pl
    <div>
    <div>
    Test nested HTML
    </div>
    </div>
    ^D
    <div>
    <div>
    Test nested HTML
    </div>
    </div>

Have you considered using a PEG instead of regexes?  There are PEGs
for markdown, and there seems to be a nice PEG generator for scheme:
http://planet.plt-scheme.org/display.ss?package=peg.plt&owner=kazzmir

John

Reasonably Related Threads

Search for more maybe matching threads

Markdown Discuss - Nov 2011 - Inline HTML legalities

Inline HTML legalities

Inline HTML legalities

Inline HTML legalities

Inline HTML legalities

Reasonably Related Threads