Chris Riddoch
2007-Jul-25 22:14 UTC
[Mechanize-users] Being a polite client: maintaining history
Hi, folks. I''m investigating libraries to use in a rather specialized feed reader. Some of the sites I want to follow don''t have RSS feeds (or have hopelessly broken feeds) so I was already planning on using Hpricot anyway -- Mechanize is looking good, here. In my research for my project, recipe 11.16 in O''Reilly''s Ruby Cookbook references a website[1] discussing the importance of the If-Modified-Since header in polite RSS readers. It mentions that the Etag header is also important. I see in Mechanize''s code that if conditional_requests is set, it''ll add the If-Modified-Since header. But this requires that the page is already in the history, and there''s currently no provision for caching the history. Since RSS readers (and most scrapers in general) are likely to be run periodically, mechanize should try to maintain this kind of state between runs, don''t you think? You might see a patch from me, unless someone beats me to it. [1] http://fishbowl.pastiche.org/2002/10/21/http_conditional_get_for_rss_hackers -- epistemological humility Chris Riddoch