Hi, by inspecting the Ragel grammar of the HTTP parser (coming from Mongrel) I''ve realized of some possible issues and bugs: hostname = (alnum | "-" | "." | "_")+; - It doesn''t allow IPv6. This is important IMHO. - It allows "_" which is an invalid symbol (not valid for a domain). I suggest: hostname = (alnum | "-" | "." | "[" | "]" | ":")+; ------------------ host_with_port = (hostname (":" digit*)?) >mark %host; - It allows something ugly as "mydomain.org:" I suggest: host_with_port = (hostname (":" digit{1,5})?) >mark %host; ------------------ message_header = ((field_name ":" " "* field_value)|value_cont) :> CRLF; - It doesn''t allow valid spaces before ":" as: Host : mydomain.org - Tabulators are also allowed. I suggest: message_header = ((field_name [ \t]* ":" [ \t]* field_value)|value_cont) :> CRLF; Regards. -- I?aki Baz Castillo <ibc at aliax.net>
I?aki Baz Castillo <ibc at aliax.net> wrote:> Hi, by inspecting the Ragel grammar of the HTTP parser (coming from > Mongrel) I''ve realized of some possible issues and bugs: > > > hostname = (alnum | "-" | "." | "_")+; > > - It doesn''t allow IPv6. This is important IMHO. > - It allows "_" which is an invalid symbol (not valid for a domain). > > I suggest: > hostname = (alnum | "-" | "." | "[" | "]" | ":")+;Hi I?aki, Underscore isn''t valid for hostnames, but it is allowed in domain names and most DNS servers will resolve them. I''ve personally seen websites with underscores in their domain names in the wild[1]. We''ll have to test the IPv6 addresses and probably split that out into a separate regexp since ":" would raise issues with the port number in existing cases. This is probably something for post-1.0.> ------------------ > > host_with_port = (hostname (":" digit*)?) >mark %host; > > - It allows something ugly as "mydomain.org:" > > I suggest: > host_with_port = (hostname (":" digit{1,5})?) >mark %host;It''s ugly, but section 3.2.2 of RFC 2396 appears to allows it.> ------------------ > > message_header = ((field_name ":" " "* field_value)|value_cont) :> CRLF; > > - It doesn''t allow valid spaces before ":" as: > Host : mydomain.orgSpaces before the ":" aren''t allowed in rfc2616, and I have yet to see evidence of clients sending headers like this in ~4 years of using this parser.> - Tabulators are also allowed. > > I suggest: > message_header = ((field_name [ \t]* ":" [ \t]* > field_value)|value_cont) :> CRLF;I just pushed this out to unicorn.git to allow horizontal tabs:>From 935912a422cabfd323f9b4ff268ded09a2ebe7a6 Mon Sep 17 00:00:00 2001From: Eric Wong <normalperson at yhbt.net> Date: Fri, 7 May 2010 18:20:49 +0000 Subject: [PATCH] http: allow horizontal tab as leading whitespace in header values MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit This is allowed by RFC 2616, section 2.2, where spaces and horizontal tabs are counted as linear white space and linear white space (not just regular spaces) may prefix field-values (section 4.2). This has _not_ been a real issue in ~4 years of using this parser (starting with Mongrel) with clients in the wild. Thanks to I?aki Baz Castillo for pointing this out. --- ext/unicorn_http/unicorn_http_common.rl | 2 +- test/unit/test_http_parser.rb | 8 ++++++++ 2 files changed, 9 insertions(+), 1 deletions(-) diff --git a/ext/unicorn_http/unicorn_http_common.rl b/ext/unicorn_http/unicorn_http_common.rl index 6fca604..f165e3f 100644 --- a/ext/unicorn_http/unicorn_http_common.rl +++ b/ext/unicorn_http/unicorn_http_common.rl @@ -54,7 +54,7 @@ value_cont = lws+ any* >start_value %write_cont_value; - message_header = ((field_name ":" " "* field_value)|value_cont) :> CRLF; + message_header = ((field_name ":" lws* field_value)|value_cont) :> CRLF; chunk_ext_val = token*; chunk_ext_name = token*; chunk_extension = ( ";" " "* chunk_ext_name ("=" chunk_ext_val)? )*; diff --git a/test/unit/test_http_parser.rb b/test/unit/test_http_parser.rb index 5b0ca9f..b7c8a1c 100644 --- a/test/unit/test_http_parser.rb +++ b/test/unit/test_http_parser.rb @@ -51,6 +51,14 @@ class HttpParserTest < Test::Unit::TestCase assert parser.keepalive? end + def test_tab_lws + parser = HttpParser.new + req = {} + tmp = "GET / HTTP/1.1\r\nHost:\tfoo.bar\r\n\r\n" + assert_equal req.object_id, parser.headers(req, tmp).object_id + assert_equal "foo.bar", req[''HTTP_HOST''] + end + def test_connection_close_no_ka parser = HttpParser.new req = {} -- [1] - and those were wild NSFW sites, of course :> -- Eric Wong
2010/5/7 Eric Wong <normalperson at yhbt.net>:> Underscore isn''t valid for hostnames, but it is allowed in domain names > and most DNS servers will resolve them. ?I''ve personally seen websites > with underscores in their domain names in the wild[1].Hi Eric, could you point me to the spec stating that underscore is allowed for a domain? In the past I''ve done a SIP parser [*] with Ragel, being 100% strict at BNF grammar, and note that SIP reuses 80% of the grammar of HTTP. I''m pretty sure that "_" is not valid in a domain (host, hostname or whatever). Anyhow it''s better just to allow it at parsing level :)> We''ll have to test the IPv6 addresses and probably split that out into a > separate regexp since ":" would raise issues with the port number in > existing cases. ?This is probably something for post-1.0.There is a IETF draft to improve and *fix* the existing BNF grammar for IPv6. It also improves the grammar for IPv4 (by dissallowing values greater than 255): http://tools.ietf.org/html/draft-ietf-sip-ipv6-abnf-fix I''ve already implemented it in Ragel and I can sure that it''s 100% valid and strict (I''ve done lots of tests): alphanum = ALPHA / DIGIT domainlabel = alphanum | ( alphanum ( alphanum | "-" )* alphanum ); toplabel = ALPHA | ( ALPHA ( alphanum | "-" )* alphanum ); hostname = ( domainlabel "." )* toplabel "."?; dec_octet = DIGIT | ( 0x31..0x39 DIGIT ) | ( "1" DIGIT{2} ) | ( "2" 0x30..0x34 DIGIT ) | ( "25" 0x30..0x35 ); IPv4address = dec_octet "." dec_octet "." dec_octet "." dec_octet; h16 = HEXDIG{1,4}; ls32 = ( h16 ":" h16 ) | IPv4address; IPv6address = ( ( h16 ":" ){6} ls32 ) | ( "::" ( h16 ":" ){5} ls32 ) | ( h16? "::" ( h16 ":" ){4} ls32 ) | ( ( ( h16 ":" )? h16 )? "::" ( h16 ":" ){3} ls32 ) | ( ( ( h16 ":" ){,2} h16 )? "::" ( h16 ":" ){2} ls32 ) | ( ( ( h16 ":" ){,3} h16 )? "::" h16 ":" ls32 ) | ( ( ( h16 ":" ){,4} h16 )? "::" ls32 ) | ( ( ( h16 ":" ){,5} h16 )? "::" h16 ) | ( ( ( h16 ":" ){,6} h16 )? "::" ); IPv6reference = "[" IPv6address "]"; host = hostname | IPv4address | IPv6reference; port = DIGIT{1,5}; hostport = host ( ":" port )?; This is much better than the deprecated and bogus grammar in RFC 2396 ;)>> ------------------ >> >> host_with_port = (hostname (":" digit*)?) >mark %host; >> >> - It allows something ugly as "mydomain.org:" >> >> I suggest: >> ? host_with_port = (hostname (":" digit{1,5})?) >mark %host; > > It''s ugly, but section 3.2.2 of RFC 2396 appears to allows it.Sometimes there are bugs in the RFC''s related to parsing and BNF grammars. I know several cases. Unfortunatelly RFC''s cannot be fixed, instead the errors are reported and a new draft or RFC "xxx-fix" appears some years later.>> message_header = ((field_name ":" " "* field_value)|value_cont) :> CRLF; >> >> - It doesn''t allow valid spaces before ":" as: >> ? ? ?Host : mydomain.org > > Spaces before the ":" aren''t allowed in rfc2616, and I have yet to see > evidence of clients sending headers like this in ~4 years of using this > parser.In SIP protocol spaces and tabulators before ":" are allowed, I really expected that in HTTP the same occurs as SIP grammar is based on HTTP grammar. But it could be different in some aspects, of course.>> - Tabulators are also allowed. >> >> I suggest: >> ? message_header = ((field_name [ \t]* ":" [ \t]* >> field_value)|value_cont) :> CRLF; > > I just pushed this out to unicorn.git to allow horizontal tabs:Thanks. [*] http://dev.sipdoc.net/projects/ragel-sip-parser/wiki/Phase1 -- I?aki Baz Castillo <ibc at aliax.net>
I?aki Baz Castillo <ibc at aliax.net> wrote:> 2010/5/7 Eric Wong <normalperson at yhbt.net>: > > Underscore isn''t valid for hostnames, but it is allowed in domain names > > and most DNS servers will resolve them. ?I''ve personally seen websites > > with underscores in their domain names in the wild[1]. > > Hi Eric, could you point me to the spec stating that underscore is > allowed for a domain? In the past I''ve done a SIP parser [*] with > Ragel, being 100% strict at BNF grammar, and note that SIP reuses 80% > of the grammar of HTTP. I''m pretty sure that "_" is not valid in a > domain (host, hostname or whatever). Anyhow it''s better just to allow > it at parsing level :)http://www.ietf.org/rfc/rfc2782.txt Even if it''s not part of the RFC, our parser will match reality and accommodate broken things we see in the wild, as it has done in the past: http://mid.gmane.org/20080327215027.GA14531 at untitled> > We''ll have to test the IPv6 addresses and probably split that out into a > > separate regexp since ":" would raise issues with the port number in > > existing cases. ?This is probably something for post-1.0. > > There is a IETF draft to improve and *fix* the existing BNF grammar for IPv6. > It also improves the grammar for IPv4 (by dissallowing values greater than 255): > > http://tools.ietf.org/html/draft-ietf-sip-ipv6-abnf-fix > > > I''ve already implemented it in Ragel and I can sure that it''s 100% > valid and strict (I''ve done lots of tests): > > alphanum = ALPHA / DIGIT > domainlabel = alphanum | ( alphanum ( alphanum | "-" )* alphanum ); > toplabel = ALPHA | ( ALPHA ( alphanum | "-" )* alphanum ); > hostname = ( domainlabel "." )* toplabel "."?; > dec_octet = DIGIT | ( 0x31..0x39 DIGIT ) | ( "1" DIGIT{2} ) | ( "2" > 0x30..0x34 DIGIT ) | ( "25" 0x30..0x35 ); > IPv4address = dec_octet "." dec_octet "." dec_octet "." dec_octet; > h16 = HEXDIG{1,4}; > ls32 = ( h16 ":" h16 ) | IPv4address; > IPv6address = ( ( h16 ":" ){6} ls32 ) | ( "::" ( h16 ":" ){5} ls32 ) | > ( h16? "::" ( h16 ":" ){4} ls32 ) | ( ( ( h16 ":" )? h16 )? "::" ( h16 > ":" ){3} ls32 ) | ( ( ( h16 ":" ){,2} h16 )? "::" ( h16 ":" ){2} ls32 > ) | ( ( ( h16 ":" ){,3} h16 )? "::" h16 ":" ls32 ) | ( ( ( h16 ":" > ){,4} h16 )? "::" ls32 ) | ( ( ( h16 ":" ){,5} h16 )? "::" h16 ) | ( ( > ( h16 ":" ){,6} h16 )? "::" ); > IPv6reference = "[" IPv6address "]"; > host = hostname | IPv4address | IPv6reference; > port = DIGIT{1,5}; > hostport = host ( ":" port )?; > > > This is much better than the deprecated and bogus grammar in RFC 2396 ;)Thanks, it might be worth simplifying a bit for readability, simplicity (and possibly performance) at the expense of 100% conformance. -- Eric Wong
2010/5/8 Eric Wong <normalperson at yhbt.net>:> I?aki Baz Castillo <ibc at aliax.net> wrote: >> 2010/5/7 Eric Wong <normalperson at yhbt.net>: >> > Underscore isn''t valid for hostnames, but it is allowed in domain names >> > and most DNS servers will resolve them. ?I''ve personally seen websites >> > with underscores in their domain names in the wild[1]. >> >> Hi Eric, could you point me to the spec stating that underscore is >> allowed for a domain? In the past I''ve done a SIP parser [*] with >> Ragel, being 100% strict at BNF grammar, and note that SIP reuses 80% >> of the grammar of HTTP. I''m pretty sure that "_" is not valid in a >> domain (host, hostname or whatever). Anyhow it''s better just to allow >> it at parsing level :) > > http://www.ietf.org/rfc/rfc2782.txtHi Eric. DNS SRV are not domain names, but DNS queries. For example, in SIP (and also in XMPP) if a phone is configured to use "myproxy.org" as proxy then it must perform a DNS SRV query for these entries: _sip._udp.myproxy.org _sip._tcp.myproxy.org Then the DNS query recevives some DNS A records for which the client also retrieves the associated IP''s. But after it, when the client generates a SIP request it uses "myproxy.org" rather than "_sip._udp.myproxy.org". This is, "_sip._udp.myproxy.org" is not a domain/hostname, but a format for querying DNS SRV records.> Even if it''s not part of the RFC, our parser will match reality and > accommodate broken things we see in the wild, as it has done in the > past: > > ?http://mid.gmane.org/20080327215027.GA14531 at untitledYes, I agree.>> alphanum ?= ?ALPHA / DIGIT >> domainlabel = alphanum | ( alphanum ( alphanum | "-" )* alphanum ); >> toplabel = ALPHA | ( ALPHA ( alphanum | "-" )* alphanum ); >> hostname = ( domainlabel "." )* toplabel "."?; >> dec_octet = DIGIT | ( 0x31..0x39 DIGIT ) | ( "1" DIGIT{2} ) | ( "2" >> 0x30..0x34 DIGIT ) | ( "25" 0x30..0x35 ); >> IPv4address = dec_octet "." dec_octet "." dec_octet "." dec_octet; >> h16 = HEXDIG{1,4}; >> ls32 = ( h16 ":" h16 ) | IPv4address; >> IPv6address = ( ( h16 ":" ){6} ls32 ) | ( "::" ( h16 ":" ){5} ls32 ) | >> ( h16? "::" ( h16 ":" ){4} ls32 ) | ( ( ( h16 ":" )? h16 )? "::" ( h16 >> ":" ){3} ls32 ) | ( ( ( h16 ":" ){,2} h16 )? "::" ( h16 ":" ){2} ls32 >> ) | ( ( ( h16 ":" ){,3} h16 )? "::" h16 ":" ls32 ) | ( ( ( h16 ":" >> ){,4} h16 )? "::" ls32 ) | ( ( ( h16 ":" ){,5} h16 )? "::" h16 ) | ( ( >> ( h16 ":" ){,6} h16 )? "::" ); >> IPv6reference = "[" IPv6address "]"; >> host = hostname | IPv4address | IPv6reference; >> port = DIGIT{1,5}; >> hostport = host ( ":" port )?; >> >> >> This is much better than the deprecated and bogus grammar in RFC 2396 ;) > > Thanks, it might be worth simplifying a bit for readability, simplicity > (and possibly performance) at the expense of 100% conformance.You can try to simplicity it, but note that some previous IPv6 BNF grammars failed to cover all the valid cases and they are bogus. For example, the original IPv6 BNF grammar appearing in RFC 3261 (SIP protocol) is buggy (even if it seems simpler) and the specification has been fixed with the draft I linked in my previous mail. Regards. -- I?aki Baz Castillo <ibc at aliax.net>