thr3ads.net - Mechanize users - [Mechanize-users] How to stop downloading of non text (PDFs, Images etc.) [Jan 2010]

If this information is useful, please help other people find it:
Share via:

Jimmy McGrath

2010-Jan-19 01:42 UTC

[Mechanize-users] How to stop downloading of non text (PDFs, Images etc.)

Howdy All,

I am creating a tool that allows a user to request a URL to be
downloaded and information viewed about the downloaded content. I would
like to restrict the user to being able to only request URLs that
resolve to html or xml (or other text documents e.g. txt, jsp, asp etc),
as the tool has no useful functionality if someone specifies a PNG, PDF
or any other binary file. I would like the tool to fail fast instead of
trying to download a 20 MB powerpoint which is of no use.

At first I thought I would just validate the URL to ensure that it does
not end with certain suffixes, but quickly realised that since a server
can redirect url with impunity, screening out the URLs before at the
start won''t catch all instances. Trying to maintain a list of all valid
(or all invalid) file extensions would also be a painful maintenance
overhead. I thought that there may be a way to set valid mime-types for
mechanize, or to perform tests on the final URL after all redirects have
finished before the download starts.

Anyway, I am hoping somebody could give me a suggestion on how I could
achieve this filtering, or even just steer me in the right direction to
(even a suggestion of a good google query would be helpful!).

Thanks,

-Jimmy

Mechanize users - Jan 2010 - How to stop downloading of non text (PDFs, Images etc.)

[Mechanize-users] How to stop downloading of non text (PDFs, Images etc.)