Jimmy McGrath
2010-Jan-19 01:42 UTC
[Mechanize-users] How to stop downloading of non text (PDFs, Images etc.)
Howdy All, I am creating a tool that allows a user to request a URL to be downloaded and information viewed about the downloaded content. I would like to restrict the user to being able to only request URLs that resolve to html or xml (or other text documents e.g. txt, jsp, asp etc), as the tool has no useful functionality if someone specifies a PNG, PDF or any other binary file. I would like the tool to fail fast instead of trying to download a 20 MB powerpoint which is of no use. At first I thought I would just validate the URL to ensure that it does not end with certain suffixes, but quickly realised that since a server can redirect url with impunity, screening out the URLs before at the start won''t catch all instances. Trying to maintain a list of all valid (or all invalid) file extensions would also be a painful maintenance overhead. I thought that there may be a way to set valid mime-types for mechanize, or to perform tests on the final URL after all redirects have finished before the download starts. Anyway, I am hoping somebody could give me a suggestion on how I could achieve this filtering, or even just steer me in the right direction to (even a suggestion of a good google query would be helpful!). Thanks, -Jimmy