This is a new release for PHP Markdown, following Markdown.pl 1.0.2b7 from a few weeks ago. It fix the same bugs, and some more; it also introduce more radical backend changes. It can be downloaded here: <http://www.michelf.com/docs/projets/php-markdown-1.0.2b7.zip> and you can test it on the PHP Markdown Dingus: <http://www.michelf.com/projects/php-markdown/dingus/> This version inaugurates the truly extensible version of PHP Markdown which should make it a lot easier to write extensions, like my own PHP Markdown Extra. If you want to create your own extension, I suggest to take a look at the code for the new PHP Markdown Extra I'll release in a few minutes. The most interesting part is probably the constructor for the MarkdownExtra_Parser class. Another big change is the automatic hashing of all Markdown-generated HTML content. Previous versions of PHP Markdown Extra were already doing this, but it was limited on block-level elements only and was done to have less call to make to the expensive html block parser. This has been ported to the more basic PHP Markdown, and in addition to hashing block-level content it now also hash span-level elements: this has the benefit of preventing bad nesting of elements, so something like this: *Some **strange* emphasis** will now give valid HTML: *Some <strong>strange* emphasis</strong> It should be noted however that fixing this introduced other nesting problems -- like being able to put a [link [inside](#) a link](#). These problems will be addressed in a future release. Original improvements in PHP Markdown 1.0.2b7: * Changed span and block gamut methods so that they loop over a customizable list of methods. This makes subclassing the parser a more interesting option for creating syntax extensions. * Also added a "document" gamut loop which can be used to hook document-level methods (like for striping link definitions). * Changed all methods which were inserting HTML code so that they now return a hashed representation of the code. New methods `hashSpan` and `hashBlock` are used to hash respectivly span- and block-level generated content. This has a couple of significant effects: 1. It prevents invalid nesting of Markdown-generated elements which could occur occuring with constructs like `*something [link*] [1]`. 2. It prevents problems occuring with deeply nested lists on which paragraphs were ill-formed. 3. It removes the need to call `hashHTMLBlocks` twice during the the block gamut. Hashes are turned back to HTML prior output. * Made the block-level HTML parser smarter using a specially- crafted regular expression capable of handling nested tags. * Solved backtick issues in tag attributes by rewriting the HTML tokenizer to be aware of code spans. All these lines should work correctly now: <span attr='`ticks`'>bar</span> <span attr='``double ticks``'>bar</span> `<test a="` content of attribute `">` * Changed the parsing of HTML comments to match simply from `<!--` to `-->` instead using of the more complicated SGML-style rule with paired `--`. This is how most browsers parse comments and how XML defines them too. * `<address>` has been added to the list of block-level elements and is no being incorrectly wrapped within paragraph tags. Improvements borrowed from Markdown.pl: * Now only trim trailing newlines from code blocks, instead of trimming all trailing whitespace characters. * Fixed bug where this: [text](http://m.com "title" ) wasn't working as expected, because the parser wasn't allowing for spaces before the closing paren. * Filthy hack to support markdown='1' in div tags. * _DoAutoLinks() now supports the 'dict://' URL scheme. * PHP- and ASP-style processor instructions are now protected as raw HTML blocks. <? ... ?> <% ... %> * Experimental support for [this] as a synonym for [this][]. * Fix for escaped backticks still triggering code spans: There are two raw backticks here: \` and here: \`, not a code span Michel Fortin michel.fortin@michelf.com http://www.michelf.com/
Michel Fortin <michel.fortin@michelf.com> wrote on 9/16/06 at 5:23 PM:> Another big change is the automatic hashing of all Markdown-generated > HTML content. Previous versions of PHP Markdown Extra were already > doing this, but it was limited on block-level elements only and was > done to have less call to make to the expensive html block parser. > This has been ported to the more basic PHP Markdown, and in addition > to hashing block-level content it now also hash span-level elements: > this has the benefit of preventing bad nesting of elements, so > something like this: > > *Some **strange* emphasis** > > will now give valid HTML: > > *Some <strong>strange* emphasis</strong>That's interesting, and because the output is valid, it's probably better than what Markdown.pl currently generates. I've been thinking that a better solution for input like that would be to generate markup like this: <em>Some <strong>strange</strong></em><strong> emphasis</strong> Which is more of a "do what I mean" solution. However, I've given no thought whatsoever to how this would be done algorithmically.> * Made the block-level HTML parser smarter using a > specially- crafted regular expression capable of handling > nested tags.A single pattern that matches nested tags?! $me == "downloading now"; -J.G.
Michel Fortin <michel.fortin@michelf.com> wrote on 9/16/06 at 5:23 PM:> * Changed the parsing of HTML comments to match simply from > `<!--` to `-->` instead using of the more complicated > SGML-style rule with paired `--`. This is how most browsers > parse comments and how XML defines them too.Interesting. I had no idea that SGML comment rules were being officially or semi-officially abandoned for HTML parsers. I certainly welcome this change. This page, and the included tests, seems like a good resource: <http://www.howtocreate.co.uk/SGMLComments.html> Test 4 is interesting: <http://www.howtocreate.co.uk/sgml/test4.html> The test is of this: <!-- -- -->bar<!-- -- --> SGML-compliant browsers will treat that not as two comments with "bar" in middle, but instead as a comment tag containing three different comments: -- -- -->bar<!-- -- -- Safari 2.0.4 (419.3) treats it the simple (obvious) way, as two comments with "bar" in the middle. OmniWeb 5.5 treats it the SGML way. (I hate the way OmniWeb overrides WebKit in so many edge cases, but that's another story.) Firefox 1.5.0.6 and Camino 1.0.1 treat it the SGML way. I also downloaded a nightly build of Firefox, and it still treats it the SGML way. Opera 9.0 treats it the simple, obvious way. * * * I'd like to make this change to Markdown, because I genuinely believe HTML comments *should* follow the simple, obvious rule, not the SGML rules. But I'd feel better about making this change in Markdown if Gecko were on board. -J.G.
On 18/9/2006, at 1:30, John Gruber gruber-at-fedora.net |Markdown| wrote:> [...] Interesting. I had no idea that SGML comment rules were being > officially or semi-officially abandoned for HTML parsers. I > certainly welcome this change.The HTML specification sort of defines comments [1]: > [...] A common error is to include a string of hyphens > ("---") within a comment. Authors should avoid putting two > or more adjacent hyphens inside comments. So it is an error, and authors *should* avoid it, which I read as, don?t do it because it gives problems, but it?s not really illegal. Of course another place [2] in the specification they write: > Please consult the SGML standard for information about rules > governing elements (e.g., they must be properly nested, an > end tag closes, back to the matching start tag, all unclosed > intervening start tags with omitted end tags (section > 7.5.1), etc.). I.e. HTML is an SGML application, and thus normal SGML rules apply, hinting that the bit about comments is really just trying to say that because of de facto standards, one should not expect full SGML comments to be supported. I find the above (quoted) paragraph a bit ironic, considering that I have not found a browser which fully adheres to the rule they quote (about an end-tag closing back to the matching start tag). [1] http://www.w3.org/TR/html4/intro/sgmltut.html#h-3.2.4 [2] http://www.w3.org/TR/html4/intro/sgmltut.html#h-3.2.1
On Sep 19, 2006, at 0:01, A. Pagaltzis wrote:> * Michel Fortin <michel.fortin@michelf.com> [2006-09-18 03:55]: >> Maybe Markdown should do something about double-hyphens inside >> comments. This way, a user could comment out any piece of >> Markdown text -- like this paragraph -- without having to >> worry about the double hyphens it contains. > > FWIW, HTML Tidy converts hyphen sequences in comments into equals > sign sequences.How about entity-encoding the hyphens? Or will that not help wrt. the XML syntax? (The semantics of the text would at least remain the same... No?) -- Magnus Lie Hetland http://hetland.org
* Magnus Lie Hetland <magnus@hetland.org> [2006-09-19 09:10]:> How about entity-encoding the hyphens? Or will that not help > wrt. the XML syntax? (The semantics of the text would at least > remain the same... No?)It would make the comment legal, but it would not preserve semantics, no. Comment text is always verbatim; no entity (or other) parsing is performed on the text between comment markers. I don?t see why anyone would care about lossy munging, though. Does anyone roundtrip Markdown-generated HTML documents back to Markdown and if so, do they really need to have their hyphens preserved exactly? It seems to me that simply subtituting a similar-enough-looking character is Good Enough. Regards, -- Aristotle Pagaltzis // <http://plasmasturm.org/>