thr3ads.net - Mechanize users - [Mechanize-users] Infinite loop on meta refresh [Oct 2010]

If this information is useful, please help other people find it:
Share via:

Abel Muiño Vizcaino

2010-Oct-26 13:18 UTC

[Mechanize-users] Infinite loop on meta refresh

Hi!

I have a problem when using Mechanize as a crawler with some pages which have
long-delayed meta refresh to the same url.

First, some context: I want to reach "the real thing" when navigating
to urls with delays... so following all kind of redirects (including meta
refresh) is on my whishlist.

What I have found is that pages such as TechMeme have a meta refresh to the same
URL with very long waits (1800 seconds). For these pages it takes too long to
reach the maximum number of redirects (and there is no real value in following
the redirects either).

For theses situations, it is not clear what the best option is, since several
factors are at play.

I would propose having a flag which avoids redirecting (and waiting) when the
refresh is to the same url. This would be off by default allowing other use
cases.

Other option that could be useful for a wider range of use cases is ignoring
waits on meta refresh.

This is summarized also as an issue on github:
http://github.com/tenderlove/mechanize/issues/issue/67

I am very new to Mechanize and I might be abusing it :-). Or just not seeing a
better way of handling this.

So, before I embark in forking & everything... feedback on this issue is
welcome!
--
Abel Mui?o
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://rubyforge.org/pipermail/mechanize-users/attachments/20101026/249598fe/attachment.html>

Mike Dalessio

2010-Oct-26 15:58 UTC

head link

[Mechanize-users] Infinite loop on meta refresh

Hi!

On Tue, Oct 26, 2010 at 9:18 AM, Abel Mui?o Vizcaino <amuino at
gmail.com>wrote:
> Hi!
>
> I have a problem when using Mechanize as a crawler with some pages which
> have long-delayed meta refresh to the same url.
>
> First, some context: I want to reach "the real thing" when
navigating to
> urls with delays... so following all kind of redirects (including meta
> refresh) is on my whishlist.
>
> What I have found is that pages such as
TechMeme<http://www.techmeme.com/101014/p67#a101014p67> have
> a meta refresh to the same URL with very long waits (1800 seconds). For
> these pages it takes too long to reach the maximum number of redirects (and
> there is no real value in following the redirects either).
>
What value is there in following meta-refreshes with long waits? Why not
just turn off follow_meta_refresh? I think I don''t totally understand
what
it is you''re trying to do.

>
> For theses situations, it is not clear what the best option is, since
> several factors are at play.
>
> I would propose having a flag which avoids redirecting (and waiting) when
> the refresh is to the same url. This would be off by default allowing other
> use cases.
>
> Other option that could be useful for a wider range of use cases is
> ignoring waits on meta refresh.
>
> This is summarized also as an issue on github:
> http://github.com/tenderlove/mechanize/issues/issue/67
>
> I am very new to Mechanize and I might be abusing it :-). Or just not
> seeing a better way of handling this.
>
> So, before I embark in forking & everything... feedback on this issue
is
> welcome!
> --
> Abel Mui?o
>
> _______________________________________________
> Mechanize-users mailing list
> Mechanize-users at rubyforge.org
> http://rubyforge.org/mailman/listinfo/mechanize-users
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://rubyforge.org/pipermail/mechanize-users/attachments/20101026/060624ac/attachment.html>

Abel Muiño Vizcaino

2010-Oct-26 16:54 UTC

head link

[Mechanize-users] Infinite loop on meta refresh

Below:

El 26/10/2010, a las 17:58, Mike Dalessio escribi?:
> Hi!
> 
> On Tue, Oct 26, 2010 at 9:18 AM, Abel Mui?o Vizcaino <amuino at
gmail.com> wrote:
> Hi!
> 
> I have a problem when using Mechanize as a crawler with some pages which
have long-delayed meta refresh to the same url.
> 
> First, some context: I want to reach "the real thing" when
navigating to urls with delays... so following all kind of redirects (including
meta refresh) is on my whishlist.
> 
> What I have found is that pages such as TechMeme have a meta refresh to the
same URL with very long waits (1800 seconds). For these pages it takes too long
to reach the maximum number of redirects (and there is no real value in
following the redirects either).
> 
> What value is there in following meta-refreshes with long waits? Why not
just turn off follow_meta_refresh? I think I don''t totally understand
what it is you''re trying to do.
For my problem, I want to reach the last page in any sequence of redirects (be
it http or meta refresh) and do some work with that last page. Additionally, I
don''t know the page I''ll be crawling beforehand so I
can''t enable/disable meta refresh depending on wether it is going to
cause trouble or not.

It is great that Mechanize can get me halfway there (other engines
don''t even handle meta refresh) but it is not fully solving my problem.

From a more general point of view, this issue with meta refresh pointing to the
same page is going to be a problem for any crawler (which follows meta refresh
tags, anyway).

Of course I can use turn off follow_meta_refresh and implement my own meta
refresh handling outside mechanize, but since there is some support for that, I
think it would be better to enhance it than to reimplement it in my code.

What I would like to do (maybe through options) is to ignore waits when
redirecting to a different url and to stop following meta refresh when they
point to the same url.
>  
> 
> For theses situations, it is not clear what the best option is, since
several factors are at play.
> 
> I would propose having a flag which avoids redirecting (and waiting) when
the refresh is to the same url. This would be off by default allowing other use
cases.
> 
> Other option that could be useful for a wider range of use cases is
ignoring waits on meta refresh.
> 
> This is summarized also as an issue on github:
http://github.com/tenderlove/mechanize/issues/issue/67
> 
> I am very new to Mechanize and I might be abusing it :-). Or just not
seeing a better way of handling this.
> 
> So, before I embark in forking & everything... feedback on this issue
is welcome!
> --
> Abel Mui?o
> 
> _______________________________________________
> Mechanize-users mailing list
> Mechanize-users at rubyforge.org
> http://rubyforge.org/mailman/listinfo/mechanize-users
> 
> _______________________________________________
> Mechanize-users mailing list
> Mechanize-users at rubyforge.org
> http://rubyforge.org/mailman/listinfo/mechanize-users
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://rubyforge.org/pipermail/mechanize-users/attachments/20101026/fbea05ac/attachment-0001.html>

Mike Dalessio

2010-Oct-26 19:40 UTC

head link

[Mechanize-users] Infinite loop on meta refresh

On Tue, Oct 26, 2010 at 12:54 PM, Abel Mui?o Vizcaino <amuino at
gmail.com>wrote:
> Below:
>
> El 26/10/2010, a las 17:58, Mike Dalessio escribi?:
>
> Hi!
>
> On Tue, Oct 26, 2010 at 9:18 AM, Abel Mui?o Vizcaino <amuino at
gmail.com>wrote:
>
>> Hi!
>>
>> I have a problem when using Mechanize as a crawler with some pages
which
>> have long-delayed meta refresh to the same url.
>>
>> First, some context: I want to reach "the real thing" when
navigating to
>> urls with delays... so following all kind of redirects (including meta
>> refresh) is on my whishlist.
>>
>> What I have found is that pages such as
TechMeme<http://www.techmeme.com/101014/p67#a101014p67> have
>> a meta refresh to the same URL with very long waits (1800 seconds). For
>> these pages it takes too long to reach the maximum number of redirects
(and
>> there is no real value in following the redirects either).
>>
>
> What value is there in following meta-refreshes with long waits? Why not
> just turn off follow_meta_refresh? I think I don''t totally
understand what
> it is you''re trying to do.
>
>
> For my problem, I want to reach the last page in any sequence of redirects
> (be it http or meta refresh) and do some work with that last page.
> Additionally, I don''t know the page I''ll be crawling
beforehand so I can''t
> enable/disable meta refresh depending on wether it is going to cause
trouble
> or not.
>
> It is great that Mechanize can get me halfway there (other engines
don''t
> even handle meta refresh) but it is not fully solving my problem.
>
> From a more general point of view, this issue with meta refresh pointing to
> the same page is going to be a problem for any crawler (which follows meta
> refresh tags, anyway).
>
> Of course I can use turn off follow_meta_refresh and implement my own meta
> refresh handling outside mechanize, but since there is some support for
> that, I think it would be better to enhance it than to reimplement it in my
> code.
>
> What I would like to do (maybe through options) is to ignore waits when
> redirecting to a different url and to stop following meta refresh when they
> point to the same url.
>
>Ok, let me rephrase what I think you''re asking for, and you tell me if
I''m
correct.

You''d like to ignore tags like: <meta http-equiv="refresh"
content="600">

You''d like to follow tags like: <meta http-equiv="refresh"
content="2;urlhttp://sample.net/">

That is, if content contains a URL to be followed, you''d like to follow
it,
but otherwise, not.

Is that correct?
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://rubyforge.org/pipermail/mechanize-users/attachments/20101026/8966971e/attachment.html>

Abel Muiño Vizcaino

2010-Oct-26 20:11 UTC

head link

[Mechanize-users] Infinite loop on meta refresh

Trimmed the email for readability. 

El 26/10/2010, a las 21:40, Mike Dalessio escribi?:
> Ok, let me rephrase what I think you''re asking for, and you tell
me if I''m correct.
> 
> You''d like to ignore tags like: <meta
http-equiv="refresh" content="600">
> 
> You''d like to follow tags like: <meta
http-equiv="refresh" content="2;url=http://sample.net/">
> 
> That is, if content contains a URL to be followed, you''d like to
follow it, but otherwise, not.
> 
> Is that correct?

Yes, that''s correct, with an extra: if the url is the current url,
don''t follow it.
Example:
mech.get("http://sample.net/")

Should not follow (or sleep on) tags like:
<meta http-equiv="refresh"
content="2;url=http://sample.net/">
--
Abel Mui?o
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://rubyforge.org/pipermail/mechanize-users/attachments/20101026/dd84ce59/attachment.html>

Mike Dalessio

2010-Oct-26 20:25 UTC

head link

[Mechanize-users] Infinite loop on meta refresh

On Tue, Oct 26, 2010 at 4:11 PM, Abel Mui?o Vizcaino <amuino at
gmail.com>wrote:
> Trimmed the email for readability.
>
> El 26/10/2010, a las 21:40, Mike Dalessio escribi?:
>
> Ok, let me rephrase what I think you''re asking for, and you tell
me if I''m
> correct.
>
> You''d like to ignore tags like: <meta
http-equiv="refresh" content="600">
>
> You''d like to follow tags like: <meta
http-equiv="refresh" content="2;url>
http://sample.net/">
>
> That is, if content contains a URL to be followed, you''d like to
follow it,
> but otherwise, not.
>
> Is that correct?
>
>
>
> Yes, that''s correct, with an extra: if the url is the current url,
don''t
> follow it.
> Example:
>
> mech.get("http://sample.net/")
>
>
> Should not follow (or sleep on) tags like:
>
> <meta http-equiv="refresh"
content="2;url=http://sample.net/">
>
> --
> Abel Mui?o
>
This seems like reasonable behavior, which could be made the Mechanize
default behavior.

Does anybody have objections to changing Mechanize''s behavior to *not*
follow meta refreshes that are either for the current page URI or contain no
URI?
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://rubyforge.org/pipermail/mechanize-users/attachments/20101026/517f0147/attachment.html>

John Norman

2010-Oct-26 22:05 UTC

head link

[Mechanize-users] Infinite loop on meta refresh

No objection here . . . but:

The only reason I can imagine to keep requesting such URLs is for a web site
that (for whatever reason) counts the refreshes and does something based on
that count, or otherwise depends on a state change provoked by the
refreshes.

If the URL is for the current page URI, it almost seems worthy of an
exception.

On Tue, Oct 26, 2010 at 3:25 PM, Mike Dalessio <mike at csa.net> wrote:
>
>
> On Tue, Oct 26, 2010 at 4:11 PM, Abel Mui?o Vizcaino <amuino at
gmail.com>wrote:
>
>> Trimmed the email for readability.
>>
>> El 26/10/2010, a las 21:40, Mike Dalessio escribi?:
>>
>> Ok, let me rephrase what I think you''re asking for, and you
tell me if I''m
>> correct.
>>
>> You''d like to ignore tags like: <meta
http-equiv="refresh" content="600">
>>
>> You''d like to follow tags like: <meta
http-equiv="refresh" content="2;url>>
http://sample.net/">
>>
>> That is, if content contains a URL to be followed, you''d like
to follow
>> it, but otherwise, not.
>>
>> Is that correct?
>>
>>
>>
>> Yes, that''s correct, with an extra: if the url is the current
url, don''t
>> follow it.
>> Example:
>>
>> mech.get("http://sample.net/")
>>
>>
>> Should not follow (or sleep on) tags like:
>>
>> <meta http-equiv="refresh"
content="2;url=http://sample.net/">
>>
>> --
>> Abel Mui?o
>>
>
> This seems like reasonable behavior, which could be made the Mechanize
> default behavior.
>
> Does anybody have objections to changing Mechanize''s behavior to
*not*
> follow meta refreshes that are either for the current page URI or contain
no
> URI?
>
>
> _______________________________________________
> Mechanize-users mailing list
> Mechanize-users at rubyforge.org
> http://rubyforge.org/mailman/listinfo/mechanize-users
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://rubyforge.org/pipermail/mechanize-users/attachments/20101026/e81945d6/attachment.html>

Abel Muiño Vizcaino

2010-Oct-27 14:50 UTC

head link

[Mechanize-users] Infinite loop on meta refresh

El 27/10/2010, a las 00:05, John Norman escribi?:
> No objection here . . . but:
> 
> The only reason I can imagine to keep requesting such URLs is for a web
site that (for whatever reason) counts the refreshes and does something based on
that count, or otherwise depends on a state change provoked by the refreshes.
As a developer, I have used this "refresh to self" strategy to wait
for a background job (maybe display some progress information too).

So, in order to keep Mechanize useful for end-to-end testing of webapps, this
scenario should be supported.
> If the URL is for the current page URI, it almost seems worthy of an
exception.
I''m not sure about the exception. Some kind of information would be
nice (specially if Mech apparently "hangs" when following a <meta
http-equiv="refresh" content="1800">). And adding a
feature to avoid or limit waiting. But that''s probably a different
topic.

Just allowing an option to ignore or follow the meta redirection to the same
page would be nice.
> 
> On Tue, Oct 26, 2010 at 3:25 PM, Mike Dalessio <mike at csa.net>
wrote:
> 
> 
> On Tue, Oct 26, 2010 at 4:11 PM, Abel Mui?o Vizcaino <amuino at
gmail.com> wrote:
> Trimmed the email for readability. 
> 
> El 26/10/2010, a las 21:40, Mike Dalessio escribi?:
> 
>> Ok, let me rephrase what I think you''re asking for, and you
tell me if I''m correct.
>> 
>> You''d like to ignore tags like: <meta
http-equiv="refresh" content="600">
>> 
>> You''d like to follow tags like: <meta
http-equiv="refresh" content="2;url=http://sample.net/">
>> 
>> That is, if content contains a URL to be followed, you''d like
to follow it, but otherwise, not.
>> 
>> Is that correct?
> 
> 
> Yes, that''s correct, with an extra: if the url is the current url,
don''t follow it.
> Example:
> mech.get("http://sample.net/")
> 
> Should not follow (or sleep on) tags like:
> <meta http-equiv="refresh"
content="2;url=http://sample.net/">
> --
> Abel Mui?o
> 
> This seems like reasonable behavior, which could be made the Mechanize
default behavior.
> 
> Does anybody have objections to changing Mechanize''s behavior to
*not* follow meta refreshes that are either for the current page URI or contain
no URI?
> 
> 
> _______________________________________________
> Mechanize-users mailing list
> Mechanize-users at rubyforge.org
> http://rubyforge.org/mailman/listinfo/mechanize-users
> 
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://rubyforge.org/pipermail/mechanize-users/attachments/20101027/31570d8a/attachment-0001.html>

Mechanize users - Oct 2010 - Infinite loop on meta refresh

[Mechanize-users] Infinite loop on meta refresh

[Mechanize-users] Infinite loop on meta refresh

[Mechanize-users] Infinite loop on meta refresh

[Mechanize-users] Infinite loop on meta refresh

[Mechanize-users] Infinite loop on meta refresh

[Mechanize-users] Infinite loop on meta refresh

[Mechanize-users] Infinite loop on meta refresh

[Mechanize-users] Infinite loop on meta refresh