Eric Wong
2009-Sep-05 21:50 UTC
[Mongrel-development] merging Unicorn HTTP parser back to Mongrel
Hello, (ok, this email got longer than expected, I now consider the most important parts the first and last paragraphs of the last footnote). The Unicorn HTTP parser is feature complete as far as I can tell and supports things the Mongrel one does not. I would very much like to see it used in places that Unicorn isn''t suited for[1]. In fact, a chunk of the new features are much better suited for a server with better slow client handling like Mongrel. The big roadblock to getting this back into Mongrel is the Java/JRuby version of the parser Mongrel uses. Simply put, I don''t do Java; somebody else will have to port it. But I''ll have to convince you that these features are worth going into Mongrel, too :) I could provide a standalone C parser that can be wrapped with FFI, but I''m not sure if the performance would be acceptable. I''m fairly certain that a pure-Ruby version with Ragel-generated code would not provide acceptable performance anywhere; maybe a hand-coded one could, but I''m not particularly excited about doing that... The MRI-C parser should just work on Win32. Unlike the rest of Unicorn, the HTTP parser remains portable to non-UNIX platforms and thread-safe. There are no system-calls made directly through it (only memory allocations through the Ruby C API). New features that aren''t in Mongrel are: * HTTP/0.9 support - blame a network BOFH hell bent for hell on saving bytes with a health-checker config for this :) The HttpParser#headers? method has been added to determine if headers should be sent in the response (HTTP/0.9 didn''t have response headers). * "Transfer-Encoding: chunked" request decoding support I''ve been told mobile devices[2] do uploads like this (since they may lack the storage capacity to store large files). This will be useful to Mongrel since Mongrel can handle slow clients better (mobile devices). I also have a use case that goes like this: tar zc $BIG_DIRECTORY | curl -T- http://unicorn/path/to/upload This designed to be slurp-resistant so clients cannot control memory usage of the server and DoS it even with huge chunk sizes. * Trailers support (with Transfer-Encoding: chunked). I haven''t run across applications that use this yet (Amazon S3 maybe?) but one use case that I can forsee is generating a Content-MD5 trailer with the above "tar | curl" command. * Multiline continuation headers - Pound sends them, I don''t care for Pound but I figured I might as well do it just in case somebody else starts doing it... * Absolute Request URI parsing - It was done with URI.parse originally, I figured I might as well do it in Ragel since it''s part of rfc 2616. I think client-side proxies use it so maybe one day somebody can turn Mongrel or a derived server into a client-side HTTP proxy... * Repeated headers handling - they''re joined with commas now since Rack doesn''t accept arrays in HTTP_* entries . I posted a standlone patch for this in <20090810001022.GA17572 at dcvr.yhbt.net> * HttpParser#keepalive? method - the parser can tell you if it''s safe to handle a keepalive request. Not used with Unicorn at the moment. Chunk extensions is one thing that the parser currently just ignores, this is because I''ve yet to see any use of them anywhere and Rack does not mention them.. Parser Limits: Request body handling: Maximum Content-Length is the maximum value of off_t. I don''t think this should be a problem for anyone as Ruby defaults to _FILE_OFFSET_BITS=64 on 32-bit arches. Mongrel does not have this limit in the parser, but since it buffers large uploads to a Tempfile, the limit always existed anyways. Maximum chunk size is also the maximum value of off_t, which is usually a 64-bit long (since Ruby defaults to _FILE_OFFSET_BITS=64 on my 32-bit boxes). I don''t expect valid clients to send any values close to this limit, but that''s just what it is. Headers: Mostly the same as Mongrel, all headers must fit into the same <=112K string object; which shouldn''t be a problem for anything capable of running Ruby. Continuation lines can bypass the per-header size limit, but everything still stays under 112K which is a pretty large limit. Trailers: These can fit into another <=112K string, space taken up during header processing doesn''t affect Trailer processing, so you could end up with 224K of combined metadata. You can get a full changelog since I branched from fauna/mongrel via: git log v0.0.0.. -- ext Finally, the new API is documented via RDoc here: http://unicorn.bogomips.org/Unicorn/HttpParser.html I don''t consider the API set in stone, but I do consider the header handling part a bit simpler/less error prone than the old one. Disclaimer: Due to the large amounts of changes to the C/Ragel portions, another security audit/pair-of-eyes would be nice. All use of Unicorn so far has been on LANs with trusted clients or with nginx in front. While I''m very comfortable with C and fairly comfortable with Ragel, I''m far from infallible so close review from a second pair of eyes would be greatly appreciated. Future: I''m also planning on porting this to Rubinius, too. I haven''t had a chance to look at it yet but the Mongrel/C one has already been ported so it shouldn''t be too hard (I only know/can stomach a small amount of C++, though I suspect I won''t even need it ...) Footnotes: [1] - Comet/long-polling/reverse HTTP, and sites that rely heavily on external services (including OpenID) are all badly suited for Unicorn. [2] - As a side effect, Unicorn also uses a TeeInput class that allows the request body to be read in real-time within the Rack application (while "tee-ing" to a temporary file to provide rewindability). This also allows Mongrel Upload Progress to be implemented in the future in a Rack::Lint-compliant manner. The one weird thing about TeeInput is that: env["rack.input"].read(NR_BYTES) Is not guaranteed to return NR_BYTES, only NR_BYTES at most. So every #read can provide "last block" semantics. Rack does not enforce this behavior, so it should be fine. This should not be a problem in practice since most read() and read()-like APIs provide no such guarantee even if implied when reading from "fast" devices like the filesystem. CGI apps that get a socket as stdin also got similar semantics as what apps under Unicorn get. I imagine this feature to be hugely useful for slow mobile clients that stream data slowly as it allows the server to start processing data as it is being uploaded. -- Eric Wong