Hi! I have a problem when using Mechanize as a crawler with some pages which have long-delayed meta refresh to the same url. First, some context: I want to reach "the real thing" when navigating to urls with delays... so following all kind of redirects (including meta refresh) is on my whishlist. What I have found is that pages such as TechMeme have a meta refresh to the same URL with very long waits (1800 seconds). For these pages it takes too long to reach the maximum number of redirects (and there is no real value in following the redirects either). For theses situations, it is not clear what the best option is, since several factors are at play. I would propose having a flag which avoids redirecting (and waiting) when the refresh is to the same url. This would be off by default allowing other use cases. Other option that could be useful for a wider range of use cases is ignoring waits on meta refresh. This is summarized also as an issue on github: http://github.com/tenderlove/mechanize/issues/issue/67 I am very new to Mechanize and I might be abusing it :-). Or just not seeing a better way of handling this. So, before I embark in forking & everything... feedback on this issue is welcome! -- Abel Mui?o -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://rubyforge.org/pipermail/mechanize-users/attachments/20101026/249598fe/attachment.html>
Hi! On Tue, Oct 26, 2010 at 9:18 AM, Abel Mui?o Vizcaino <amuino at gmail.com>wrote:> Hi! > > I have a problem when using Mechanize as a crawler with some pages which > have long-delayed meta refresh to the same url. > > First, some context: I want to reach "the real thing" when navigating to > urls with delays... so following all kind of redirects (including meta > refresh) is on my whishlist. > > What I have found is that pages such as TechMeme<http://www.techmeme.com/101014/p67#a101014p67> have > a meta refresh to the same URL with very long waits (1800 seconds). For > these pages it takes too long to reach the maximum number of redirects (and > there is no real value in following the redirects either). >What value is there in following meta-refreshes with long waits? Why not just turn off follow_meta_refresh? I think I don''t totally understand what it is you''re trying to do.> > For theses situations, it is not clear what the best option is, since > several factors are at play. > > I would propose having a flag which avoids redirecting (and waiting) when > the refresh is to the same url. This would be off by default allowing other > use cases. > > Other option that could be useful for a wider range of use cases is > ignoring waits on meta refresh. > > This is summarized also as an issue on github: > http://github.com/tenderlove/mechanize/issues/issue/67 > > I am very new to Mechanize and I might be abusing it :-). Or just not > seeing a better way of handling this. > > So, before I embark in forking & everything... feedback on this issue is > welcome! > -- > Abel Mui?o > > _______________________________________________ > Mechanize-users mailing list > Mechanize-users at rubyforge.org > http://rubyforge.org/mailman/listinfo/mechanize-users >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://rubyforge.org/pipermail/mechanize-users/attachments/20101026/060624ac/attachment.html>
Below: El 26/10/2010, a las 17:58, Mike Dalessio escribi?:> Hi! > > On Tue, Oct 26, 2010 at 9:18 AM, Abel Mui?o Vizcaino <amuino at gmail.com> wrote: > Hi! > > I have a problem when using Mechanize as a crawler with some pages which have long-delayed meta refresh to the same url. > > First, some context: I want to reach "the real thing" when navigating to urls with delays... so following all kind of redirects (including meta refresh) is on my whishlist. > > What I have found is that pages such as TechMeme have a meta refresh to the same URL with very long waits (1800 seconds). For these pages it takes too long to reach the maximum number of redirects (and there is no real value in following the redirects either). > > What value is there in following meta-refreshes with long waits? Why not just turn off follow_meta_refresh? I think I don''t totally understand what it is you''re trying to do.For my problem, I want to reach the last page in any sequence of redirects (be it http or meta refresh) and do some work with that last page. Additionally, I don''t know the page I''ll be crawling beforehand so I can''t enable/disable meta refresh depending on wether it is going to cause trouble or not. It is great that Mechanize can get me halfway there (other engines don''t even handle meta refresh) but it is not fully solving my problem. From a more general point of view, this issue with meta refresh pointing to the same page is going to be a problem for any crawler (which follows meta refresh tags, anyway). Of course I can use turn off follow_meta_refresh and implement my own meta refresh handling outside mechanize, but since there is some support for that, I think it would be better to enhance it than to reimplement it in my code. What I would like to do (maybe through options) is to ignore waits when redirecting to a different url and to stop following meta refresh when they point to the same url.> > > For theses situations, it is not clear what the best option is, since several factors are at play. > > I would propose having a flag which avoids redirecting (and waiting) when the refresh is to the same url. This would be off by default allowing other use cases. > > Other option that could be useful for a wider range of use cases is ignoring waits on meta refresh. > > This is summarized also as an issue on github: http://github.com/tenderlove/mechanize/issues/issue/67 > > I am very new to Mechanize and I might be abusing it :-). Or just not seeing a better way of handling this. > > So, before I embark in forking & everything... feedback on this issue is welcome! > -- > Abel Mui?o > > _______________________________________________ > Mechanize-users mailing list > Mechanize-users at rubyforge.org > http://rubyforge.org/mailman/listinfo/mechanize-users > > _______________________________________________ > Mechanize-users mailing list > Mechanize-users at rubyforge.org > http://rubyforge.org/mailman/listinfo/mechanize-users-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://rubyforge.org/pipermail/mechanize-users/attachments/20101026/fbea05ac/attachment-0001.html>
On Tue, Oct 26, 2010 at 12:54 PM, Abel Mui?o Vizcaino <amuino at gmail.com>wrote:> Below: > > El 26/10/2010, a las 17:58, Mike Dalessio escribi?: > > Hi! > > On Tue, Oct 26, 2010 at 9:18 AM, Abel Mui?o Vizcaino <amuino at gmail.com>wrote: > >> Hi! >> >> I have a problem when using Mechanize as a crawler with some pages which >> have long-delayed meta refresh to the same url. >> >> First, some context: I want to reach "the real thing" when navigating to >> urls with delays... so following all kind of redirects (including meta >> refresh) is on my whishlist. >> >> What I have found is that pages such as TechMeme<http://www.techmeme.com/101014/p67#a101014p67> have >> a meta refresh to the same URL with very long waits (1800 seconds). For >> these pages it takes too long to reach the maximum number of redirects (and >> there is no real value in following the redirects either). >> > > What value is there in following meta-refreshes with long waits? Why not > just turn off follow_meta_refresh? I think I don''t totally understand what > it is you''re trying to do. > > > For my problem, I want to reach the last page in any sequence of redirects > (be it http or meta refresh) and do some work with that last page. > Additionally, I don''t know the page I''ll be crawling beforehand so I can''t > enable/disable meta refresh depending on wether it is going to cause trouble > or not. > > It is great that Mechanize can get me halfway there (other engines don''t > even handle meta refresh) but it is not fully solving my problem. > > From a more general point of view, this issue with meta refresh pointing to > the same page is going to be a problem for any crawler (which follows meta > refresh tags, anyway). > > Of course I can use turn off follow_meta_refresh and implement my own meta > refresh handling outside mechanize, but since there is some support for > that, I think it would be better to enhance it than to reimplement it in my > code. > > What I would like to do (maybe through options) is to ignore waits when > redirecting to a different url and to stop following meta refresh when they > point to the same url. > >Ok, let me rephrase what I think you''re asking for, and you tell me if I''m correct. You''d like to ignore tags like: <meta http-equiv="refresh" content="600"> You''d like to follow tags like: <meta http-equiv="refresh" content="2;urlhttp://sample.net/"> That is, if content contains a URL to be followed, you''d like to follow it, but otherwise, not. Is that correct? -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://rubyforge.org/pipermail/mechanize-users/attachments/20101026/8966971e/attachment.html>
Trimmed the email for readability. El 26/10/2010, a las 21:40, Mike Dalessio escribi?:> Ok, let me rephrase what I think you''re asking for, and you tell me if I''m correct. > > You''d like to ignore tags like: <meta http-equiv="refresh" content="600"> > > You''d like to follow tags like: <meta http-equiv="refresh" content="2;url=http://sample.net/"> > > That is, if content contains a URL to be followed, you''d like to follow it, but otherwise, not. > > Is that correct?Yes, that''s correct, with an extra: if the url is the current url, don''t follow it. Example: mech.get("http://sample.net/") Should not follow (or sleep on) tags like: <meta http-equiv="refresh" content="2;url=http://sample.net/"> -- Abel Mui?o -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://rubyforge.org/pipermail/mechanize-users/attachments/20101026/dd84ce59/attachment.html>
On Tue, Oct 26, 2010 at 4:11 PM, Abel Mui?o Vizcaino <amuino at gmail.com>wrote:> Trimmed the email for readability. > > El 26/10/2010, a las 21:40, Mike Dalessio escribi?: > > Ok, let me rephrase what I think you''re asking for, and you tell me if I''m > correct. > > You''d like to ignore tags like: <meta http-equiv="refresh" content="600"> > > You''d like to follow tags like: <meta http-equiv="refresh" content="2;url> http://sample.net/"> > > That is, if content contains a URL to be followed, you''d like to follow it, > but otherwise, not. > > Is that correct? > > > > Yes, that''s correct, with an extra: if the url is the current url, don''t > follow it. > Example: > > mech.get("http://sample.net/") > > > Should not follow (or sleep on) tags like: > > <meta http-equiv="refresh" content="2;url=http://sample.net/"> > > -- > Abel Mui?o >This seems like reasonable behavior, which could be made the Mechanize default behavior. Does anybody have objections to changing Mechanize''s behavior to *not* follow meta refreshes that are either for the current page URI or contain no URI? -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://rubyforge.org/pipermail/mechanize-users/attachments/20101026/517f0147/attachment.html>
No objection here . . . but: The only reason I can imagine to keep requesting such URLs is for a web site that (for whatever reason) counts the refreshes and does something based on that count, or otherwise depends on a state change provoked by the refreshes. If the URL is for the current page URI, it almost seems worthy of an exception. On Tue, Oct 26, 2010 at 3:25 PM, Mike Dalessio <mike at csa.net> wrote:> > > On Tue, Oct 26, 2010 at 4:11 PM, Abel Mui?o Vizcaino <amuino at gmail.com>wrote: > >> Trimmed the email for readability. >> >> El 26/10/2010, a las 21:40, Mike Dalessio escribi?: >> >> Ok, let me rephrase what I think you''re asking for, and you tell me if I''m >> correct. >> >> You''d like to ignore tags like: <meta http-equiv="refresh" content="600"> >> >> You''d like to follow tags like: <meta http-equiv="refresh" content="2;url>> http://sample.net/"> >> >> That is, if content contains a URL to be followed, you''d like to follow >> it, but otherwise, not. >> >> Is that correct? >> >> >> >> Yes, that''s correct, with an extra: if the url is the current url, don''t >> follow it. >> Example: >> >> mech.get("http://sample.net/") >> >> >> Should not follow (or sleep on) tags like: >> >> <meta http-equiv="refresh" content="2;url=http://sample.net/"> >> >> -- >> Abel Mui?o >> > > This seems like reasonable behavior, which could be made the Mechanize > default behavior. > > Does anybody have objections to changing Mechanize''s behavior to *not* > follow meta refreshes that are either for the current page URI or contain no > URI? > > > _______________________________________________ > Mechanize-users mailing list > Mechanize-users at rubyforge.org > http://rubyforge.org/mailman/listinfo/mechanize-users >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://rubyforge.org/pipermail/mechanize-users/attachments/20101026/e81945d6/attachment.html>
El 27/10/2010, a las 00:05, John Norman escribi?:> No objection here . . . but: > > The only reason I can imagine to keep requesting such URLs is for a web site that (for whatever reason) counts the refreshes and does something based on that count, or otherwise depends on a state change provoked by the refreshes.As a developer, I have used this "refresh to self" strategy to wait for a background job (maybe display some progress information too). So, in order to keep Mechanize useful for end-to-end testing of webapps, this scenario should be supported.> If the URL is for the current page URI, it almost seems worthy of an exception.I''m not sure about the exception. Some kind of information would be nice (specially if Mech apparently "hangs" when following a <meta http-equiv="refresh" content="1800">). And adding a feature to avoid or limit waiting. But that''s probably a different topic. Just allowing an option to ignore or follow the meta redirection to the same page would be nice.> > On Tue, Oct 26, 2010 at 3:25 PM, Mike Dalessio <mike at csa.net> wrote: > > > On Tue, Oct 26, 2010 at 4:11 PM, Abel Mui?o Vizcaino <amuino at gmail.com> wrote: > Trimmed the email for readability. > > El 26/10/2010, a las 21:40, Mike Dalessio escribi?: > >> Ok, let me rephrase what I think you''re asking for, and you tell me if I''m correct. >> >> You''d like to ignore tags like: <meta http-equiv="refresh" content="600"> >> >> You''d like to follow tags like: <meta http-equiv="refresh" content="2;url=http://sample.net/"> >> >> That is, if content contains a URL to be followed, you''d like to follow it, but otherwise, not. >> >> Is that correct? > > > Yes, that''s correct, with an extra: if the url is the current url, don''t follow it. > Example: > mech.get("http://sample.net/") > > Should not follow (or sleep on) tags like: > <meta http-equiv="refresh" content="2;url=http://sample.net/"> > -- > Abel Mui?o > > This seems like reasonable behavior, which could be made the Mechanize default behavior. > > Does anybody have objections to changing Mechanize''s behavior to *not* follow meta refreshes that are either for the current page URI or contain no URI? > > > _______________________________________________ > Mechanize-users mailing list > Mechanize-users at rubyforge.org > http://rubyforge.org/mailman/listinfo/mechanize-users >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://rubyforge.org/pipermail/mechanize-users/attachments/20101027/31570d8a/attachment-0001.html>