I am get a page which url contains special characters:> url > http://www.example.comr/details-kontakt-1274/[theater]-Dimbeldu-Puppentheater-Kinderschminken-Maerchen-.htmlwhen I open it directly with Mechanize:> agent.get(url)I got:> URI::InvalidURIError: bad URI(is not URI?): > http://www.example.comr/details-kontakt-1274/[theater]-Dimbeldu-Puppentheater-Kinderschminken-Maerchen-.htmlfrom> /System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/1.8/uri/common.rb:436:in > `split''from> /System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/1.8/uri/common.rb:485:in > `parse''and if I escape the url with> agent.get(CGI.escape(url))then I got:> RuntimeError: need absolute URLfrom> /Library/Ruby/Gems/1.8/gems/mechanize-1.0.0/lib/mechanize/chain/uri_resolver.rb:52:in > `handle''from> /Library/Ruby/Gems/1.8/gems/mechanize-1.0.0/lib/mechanize/chain.rb:24:in > `handle''from /Library/Ruby/Gems/1.8/gems/mechanize-1.0.0/lib/mechanize.rb:509:in> `fetch_page''from /Library/Ruby/Gems/1.8/gems/mechanize-1.0.0/lib/mechanize.rb:259:in> `get''could you please shed some light on this? -- Regards Leon Du -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://rubyforge.org/pipermail/mechanize-users/attachments/20100721/0dea9f28/attachment-0001.html>
Alex Young
2010-Jul-21 16:38 UTC
[Mechanize-users] weird url question when using Mechanize
On Wed, 2010-07-21 at 23:57 +0800, Leon Du wrote:> http://www.example.comr/details-kontakt-1274/[theater]-Dimbeldu-Puppentheater-Kinderschminken-Maerchen-.html[ and ] aren''t strictly legal characters in an href. I use this as a pre-filter: href.gsub(%r{[^%A-Za-z0-9:@\-._~!$&''\/\(\)*+,;=]}){|m| "%%%x" % m[0] } -- Alex
thanks Alex, that actually works :) but still my question is why the CGI.escape doesn''t work? shouldn''t it do the same? On Thu, Jul 22, 2010 at 12:38 AM, Alex Young <alex at blackkettle.org> wrote:> On Wed, 2010-07-21 at 23:57 +0800, Leon Du wrote: > > > http://www.example.comr/details-kontakt-1274/[theater]-Dimbeldu-Puppentheater-Kinderschminken-Maerchen-.html > > [ and ] aren''t strictly legal characters in an href. I use this as a > pre-filter: > > href.gsub(%r{[^%A-Za-z0-9:@\-._~!$&''\/\(\)*+,;=]}){|m| "%%%x" % m[0] } > > -- > Alex > > _______________________________________________ > Mechanize-users mailing list > Mechanize-users at rubyforge.org > http://rubyforge.org/mailman/listinfo/mechanize-users >-- Regards Leon Du -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://rubyforge.org/pipermail/mechanize-users/attachments/20100722/9da59264/attachment.html>
Alex Young
2010-Jul-22 08:10 UTC
[Mechanize-users] weird url question when using Mechanize
On Thu, 2010-07-22 at 15:24 +0800, Leon Du wrote:> thanks Alex, that actually works :) > > > but still my question is why the CGI.escape doesn''t work? > shouldn''t it do the same?The list of characters that CGI.escape escapes includes '':'' and ''/'', so it turns "http://" into "http%3A%2F%2F". URI.parse doesn''t recognise that as the start of an absolute URI. -- Alex> > > > > On Thu, Jul 22, 2010 at 12:38 AM, Alex Young <alex at blackkettle.org> > wrote: > On Wed, 2010-07-21 at 23:57 +0800, Leon Du wrote: > > > http://www.example.comr/details-kontakt-1274/[theater]-Dimbeldu-Puppentheater-Kinderschminken-Maerchen-.html > > > [ and ] aren''t strictly legal characters in an href. I use > this as a > pre-filter: > > href.gsub(%r{[^%A-Za-z0-9:@\-._~!$&''\/\(\)*+,;=]}){|m| "%%%x" > % m[0] } > > -- > Alex > > _______________________________________________ > Mechanize-users mailing list > Mechanize-users at rubyforge.org > http://rubyforge.org/mailman/listinfo/mechanize-users > > > > -- > Regards > Leon Du > > _______________________________________________ > Mechanize-users mailing list > Mechanize-users at rubyforge.org > http://rubyforge.org/mailman/listinfo/mechanize-users
yes, you are right, that is the problem, CGI escape just won''t work together with URI.parse, isn''t this a bug? On Thu, Jul 22, 2010 at 4:10 PM, Alex Young <alex at blackkettle.org> wrote:> On Thu, 2010-07-22 at 15:24 +0800, Leon Du wrote: > > thanks Alex, that actually works :) > > > > > > but still my question is why the CGI.escape doesn''t work? > > shouldn''t it do the same? > > The list of characters that CGI.escape escapes includes '':'' and ''/'', so > it turns "http://" into "http%3A%2F%2F". URI.parse doesn''t recognise > that as the start of an absolute URI. > > -- > Alex > > > > > > > > > > > On Thu, Jul 22, 2010 at 12:38 AM, Alex Young <alex at blackkettle.org> > > wrote: > > On Wed, 2010-07-21 at 23:57 +0800, Leon Du wrote: > > > > > > http://www.example.comr/details-kontakt-1274/[theater]-Dimbeldu-Puppentheater-Kinderschminken-Maerchen-.html > > > > > > [ and ] aren''t strictly legal characters in an href. I use > > this as a > > pre-filter: > > > > href.gsub(%r{[^%A-Za-z0-9:@\-._~!$&''\/\(\)*+,;=]}){|m| "%%%x" > > % m[0] } > > > > -- > > Alex > > > > _______________________________________________ > > Mechanize-users mailing list > > Mechanize-users at rubyforge.org > > http://rubyforge.org/mailman/listinfo/mechanize-users > > > > > > > > -- > > Regards > > Leon Du > > > > _______________________________________________ > > Mechanize-users mailing list > > Mechanize-users at rubyforge.org > > http://rubyforge.org/mailman/listinfo/mechanize-users > > > _______________________________________________ > Mechanize-users mailing list > Mechanize-users at rubyforge.org > http://rubyforge.org/mailman/listinfo/mechanize-users >-- Regards Leon Du -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://rubyforge.org/pipermail/mechanize-users/attachments/20100722/3c24128b/attachment.html>
Alex Young
2010-Jul-22 10:02 UTC
[Mechanize-users] weird url question when using Mechanize
On Thu, 2010-07-22 at 16:46 +0800, Leon Du wrote:> yes, you are right, that is the problem, > CGI escape just won''t work together with URI.parse, > isn''t this a bug?No. CGI.escape is intended for encoding strings *within* a URL so that you can, for instance, include path separators in your data string without the URL handlers at either end mistaking them for *actual* path separators. It''s not supposed to handle encoding the URL itself. -- Alex> > > > > On Thu, Jul 22, 2010 at 4:10 PM, Alex Young <alex at blackkettle.org> > wrote: > On Thu, 2010-07-22 at 15:24 +0800, Leon Du wrote: > > thanks Alex, that actually works :) > > > > > > but still my question is why the CGI.escape doesn''t work? > > shouldn''t it do the same? > > > The list of characters that CGI.escape escapes includes '':'' > and ''/'', so > it turns "http://" into "http%3A%2F%2F". URI.parse doesn''t > recognise > that as the start of an absolute URI. > > -- > Alex > > > > > > > > > > > > On Thu, Jul 22, 2010 at 12:38 AM, Alex Young > <alex at blackkettle.org> > > wrote: > > On Wed, 2010-07-21 at 23:57 +0800, Leon Du wrote: > > > > > > http://www.example.comr/details-kontakt-1274/[theater]-Dimbeldu-Puppentheater-Kinderschminken-Maerchen-.html > > > > > > [ and ] aren''t strictly legal characters in an href. > I use > > this as a > > pre-filter: > > > > href.gsub(%r{[^%A-Za-z0-9:@\-._~!$&''\/\(\)* > +,;=]}){|m| "%%%x" > > % m[0] } > > > > -- > > Alex > > > > _______________________________________________ > > Mechanize-users mailing list > > Mechanize-users at rubyforge.org > > > http://rubyforge.org/mailman/listinfo/mechanize-users > > > > > > > > -- > > Regards > > Leon Du > > > > _______________________________________________ > > Mechanize-users mailing list > > Mechanize-users at rubyforge.org > > http://rubyforge.org/mailman/listinfo/mechanize-users > > > _______________________________________________ > Mechanize-users mailing list > Mechanize-users at rubyforge.org > http://rubyforge.org/mailman/listinfo/mechanize-users > > > > > -- > Regards > Leon Du > > _______________________________________________ > Mechanize-users mailing list > Mechanize-users at rubyforge.org > http://rubyforge.org/mailman/listinfo/mechanize-users
thanks for the explanation :) On Thu, Jul 22, 2010 at 6:02 PM, Alex Young <alex at blackkettle.org> wrote:> On Thu, 2010-07-22 at 16:46 +0800, Leon Du wrote: > > yes, you are right, that is the problem, > > CGI escape just won''t work together with URI.parse, > > isn''t this a bug? > > No. CGI.escape is intended for encoding strings *within* a URL so that > you can, for instance, include path separators in your data string > without the URL handlers at either end mistaking them for *actual* path > separators. It''s not supposed to handle encoding the URL itself. > > -- > Alex > > > > > > > > > > > On Thu, Jul 22, 2010 at 4:10 PM, Alex Young <alex at blackkettle.org> > > wrote: > > On Thu, 2010-07-22 at 15:24 +0800, Leon Du wrote: > > > thanks Alex, that actually works :) > > > > > > > > > but still my question is why the CGI.escape doesn''t work? > > > shouldn''t it do the same? > > > > > > The list of characters that CGI.escape escapes includes '':'' > > and ''/'', so > > it turns "http://" into "http%3A%2F%2F". URI.parse doesn''t > > recognise > > that as the start of an absolute URI. > > > > -- > > Alex > > > > > > > > > > > > > > > > > > > On Thu, Jul 22, 2010 at 12:38 AM, Alex Young > > <alex at blackkettle.org> > > > wrote: > > > On Wed, 2010-07-21 at 23:57 +0800, Leon Du wrote: > > > > > > > > > > http://www.example.comr/details-kontakt-1274/[theater]-Dimbeldu-Puppentheater-Kinderschminken-Maerchen-.html > > > > > > > > > [ and ] aren''t strictly legal characters in an href. > > I use > > > this as a > > > pre-filter: > > > > > > href.gsub(%r{[^%A-Za-z0-9:@\-._~!$&''\/\(\)* > > +,;=]}){|m| "%%%x" > > > % m[0] } > > > > > > -- > > > Alex > > > > > > _______________________________________________ > > > Mechanize-users mailing list > > > Mechanize-users at rubyforge.org > > > > > http://rubyforge.org/mailman/listinfo/mechanize-users > > > > > > > > > > > > -- > > > Regards > > > Leon Du > > > > > > _______________________________________________ > > > Mechanize-users mailing list > > > Mechanize-users at rubyforge.org > > > http://rubyforge.org/mailman/listinfo/mechanize-users > > > > > > _______________________________________________ > > Mechanize-users mailing list > > Mechanize-users at rubyforge.org > > http://rubyforge.org/mailman/listinfo/mechanize-users > > > > > > > > > > -- > > Regards > > Leon Du > > > > _______________________________________________ > > Mechanize-users mailing list > > Mechanize-users at rubyforge.org > > http://rubyforge.org/mailman/listinfo/mechanize-users > > > _______________________________________________ > Mechanize-users mailing list > Mechanize-users at rubyforge.org > http://rubyforge.org/mailman/listinfo/mechanize-users >-- Regards Leon Du -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://rubyforge.org/pipermail/mechanize-users/attachments/20100723/ab5317d7/attachment-0001.html>