Hi, I''m having some trouble downloading a .csv file from a particular website. The file isn''t part of a url, you need to click on a link in order to get the file sent. I don''t know how to get mechanize to correctly identify that. Here is the link to the file I''m trying to retrieve: <td style="vertical-align: bottom; text-align: center;"> <a href="#" onClick="dataExport( ''csv'' ); return false;"><img src="/img/buttons/bu_csv.gif" width="37" height="17" style="border: none;" alt="Export to CVS"></a> </td> Here is my code (partial): agent = WWW::Mechanize.new { |a| a.log = Logger.new("mech.log") } agent.keep_alive = false agent.read_timeout = 60 # the page would timeout sometimes url = "https://website.com/page.php4" page = agent.get(url) page.links.text(/Export to CVS/).each { |link| file_page = agent.click(link) file_page.save_as(''output.csv'') return } What I get in output.csv is just the original page, not the .csv file. If someone could please help me understand how I can nab the file contents instead, I''d greatly appreciate it. (I actually want to eventually parse the csv within the code, not just save it) Thanks!
Mat Schaffer
2008-Apr-29 20:34 UTC
[Mechanize-users] Intercepting an onClick file download
Mechanize would need javascript support to make this work. Which I''m pretty sure it doesn''t have. Maybe Aaron has some trick up his sleeve though, I dunno. What I usually do in this cases is manually trace the javascript (in this case the dataExport function) using firebug and the firefox web developer toolbar. Once I get a handle on what the javascript is doing, I replicate that in ruby to build the appropriate URL and finally just use mechanize to make the GET request. Good luck with your project! -Mat On Apr 29, 2008, at 4:23 PM, Sell Trino wrote:> Hi, > > I''m having some trouble downloading a .csv file from a particular > website. The file isn''t part of a url, you need to click on a link in > order to get the file sent. I don''t know how to get mechanize to > correctly identify that. > > Here is the link to the file I''m trying to retrieve: > > <td style="vertical-align: bottom; text-align: center;"> > <a href="#" onClick="dataExport( ''csv'' ); return false;"><img > src="/img/buttons/bu_csv.gif" width="37" height="17" style="border: > none;" alt="Export to CVS"></a> > </td> > > Here is my code (partial): > > agent = WWW::Mechanize.new { |a| a.log = Logger.new("mech.log") } > agent.keep_alive = false > agent.read_timeout = 60 # the page would timeout sometimes > url = "https://website.com/page.php4" > page = agent.get(url) > page.links.text(/Export to CVS/).each { |link| > file_page = agent.click(link) > file_page.save_as(''output.csv'') > return > } > > What I get in output.csv is just the original page, not the .csv file. > If someone could please help me understand how I can nab the file > contents instead, I''d greatly appreciate it. (I actually want to > eventually parse the csv within the code, not just save it) > > Thanks! > _______________________________________________ > Mechanize-users mailing list > Mechanize-users at rubyforge.org > http://rubyforge.org/mailman/listinfo/mechanize-users
Sell, As Mechanize doesn''t interpret Javascript, you will need to dissect the function "dataExport". If you need help with that, paste the source for that function here and perhaps we can help more. Or if the website is publicly accessible, let us know the URL and we can take a look at it. Matt White ----- Original Message ---- From: Sell Trino <selltrino at gmail.com> To: mechanize-users at rubyforge.org Sent: Tuesday, April 29, 2008 2:23:26 PM Subject: [Mechanize-users] Intercepting an onClick file download Hi, I''m having some trouble downloading a .csv file from a particular website. The file isn''t part of a url, you need to click on a link in order to get the file sent. I don''t know how to get mechanize to correctly identify that. Here is the link to the file I''m trying to retrieve: <td style="vertical-align: bottom; text-align: center;"> <a href="#" onClick="dataExport( ''csv'' ); return false;"><img src="/img/buttons/bu_csv.gif" width="37" height="17" style="border: none;" alt="Export to CVS"></a> </td> Here is my code (partial): agent = WWW::Mechanize.new { |a| a.log = Logger.new("mech.log") } agent.keep_alive = false agent.read_timeout = 60 # the page would timeout sometimes url = "https://website.com/page.php4" page = agent.get(url) page.links.text(/Export to CVS/).each { |link| file_page = agent.click(link) file_page.save_as(''output.csv'') return } What I get in output.csv is just the original page, not the .csv file. If someone could please help me understand how I can nab the file contents instead, I''d greatly appreciate it. (I actually want to eventually parse the csv within the code, not just save it) Thanks! _______________________________________________ Mechanize-users mailing list Mechanize-users at rubyforge.org http://rubyforge.org/mailman/listinfo/mechanize-users ____________________________________________________________________________________ Be a better friend, newshound, and know-it-all with Yahoo! Mobile. Try it now. http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://rubyforge.org/pipermail/mechanize-users/attachments/20080429/05688680/attachment.html>
Chris Riddoch
2008-Apr-29 21:06 UTC
[Mechanize-users] Intercepting an onClick file download
This sort of question is clearly frequent enough to warrant documenting. Expect a patch from me soon for this... -- epistemological humility Chris Riddoch
Thanks guys for the feedback. I understand now the issue about this being javascript. I installed Firebug and the Firefox Web Developer toolbar (which look very helpful, btw) and got a full dump of the javascript code (the dataExport script wasn''t shown on a dump of the page source until this tool dug it out) and here is the code for it: /** * Initiate Export, type is either csv or xls **/ function dataExport( type ) { document.getElementById(''export'').value = ''Y''; document.getElementById(''exportType'').value = type; document.getElementById(''selectform'').method = ''POST''; switchTarget(''self''); document.getElementById(''export'').value = ''''; document.getElementById(''exportType'').value = ''''; } // dataExport I then made my url "https://website.com/page.php4?export=Y&exportType=csv" and did a get on that and it worked! (apparently their server doesn''t require it to be a post...) Thanks everyone for the help! One last thing though, when I get the page, page.body.class ''String''. I had setup the CSVParser via: class CSVParser < WWW::Mechanize::File attr_reader :csv def initialize(uri=nil, response=nil, body=nil, code=nil) super(uri, response, body, code) @csv = CSV.parse(body) end end agent = WWW::Mechanize.new agent.pluggable_parser.csv = CSVParser And it doesn''t seem to autorecognize the file as CSV. I think it''s because the content encoding is gzip, as per the log file: response-header: vary => User-Agent,Accept-Encoding response-header: cache-control => must-revalidate, post-check=0,pre-check=0 response-header: connection => close response-header: x-cache => MISS from 284720 response-header: expires => 0 response-header: content-type => application/octetstream response-header: date => Tue, 29 Apr 2008 22:23:15 GMT response-header: content-encoding => gzip response-header: content-disposition => attachment; filename=file.csv response-header: server => Apache response-header: content-length => 5837 response-header: pragma => public gunzip body Not like it''s a big deal to just CSV.pars(page.body), but just wondering if I''m write in why it didn''t recognize and parse this automatically as .csv Thanks! On Tue, Apr 29, 2008 at 2:06 PM, Chris Riddoch <riddochc at gmail.com> wrote:> This sort of question is clearly frequent enough to warrant > documenting. Expect a patch from me soon for this... > > -- > epistemological humility > Chris Riddoch > > > _______________________________________________ > Mechanize-users mailing list > Mechanize-users at rubyforge.org > http://rubyforge.org/mailman/listinfo/mechanize-users >
Aaron Patterson
2008-Apr-29 23:05 UTC
[Mechanize-users] Intercepting an onClick file download
On Tue, Apr 29, 2008 at 3:29 PM, Sell Trino <selltrino at gmail.com> wrote:> Thanks guys for the feedback. I understand now the issue about this > being javascript. > > I installed Firebug and the Firefox Web Developer toolbar (which look > very helpful, btw) and got a full dump of the javascript code (the > dataExport script wasn''t shown on a dump of the page source until this > tool dug it out) and here is the code for it: > > /** > * Initiate Export, type is either csv or xls > **/ > function dataExport( type ) { > > document.getElementById(''export'').value = ''Y''; > document.getElementById(''exportType'').value = type; > document.getElementById(''selectform'').method = ''POST''; > > switchTarget(''self''); > > document.getElementById(''export'').value = ''''; > document.getElementById(''exportType'').value = ''''; > > } // dataExport > > I then made my url > "https://website.com/page.php4?export=Y&exportType=csv" and did a get > on that and it worked! (apparently their server doesn''t require it to > be a post...) Thanks everyone for the help! > > One last thing though, when I get the page, page.body.class > ''String''. I had setup the CSVParser via: > > class CSVParser < WWW::Mechanize::File > attr_reader :csv > def initialize(uri=nil, response=nil, body=nil, code=nil) > super(uri, response, body, code) > @csv = CSV.parse(body) > end > end > agent = WWW::Mechanize.new > agent.pluggable_parser.csv = CSVParser > > And it doesn''t seem to autorecognize the file as CSV. I think it''s > because the content encoding is gzip, as per the log file: > > response-header: vary => User-Agent,Accept-Encoding > response-header: cache-control => must-revalidate, post-check=0,pre-check=0 > response-header: connection => close > response-header: x-cache => MISS from 284720 > response-header: expires => 0 > response-header: content-type => application/octetstream > response-header: date => Tue, 29 Apr 2008 22:23:15 GMT > response-header: content-encoding => gzip > response-header: content-disposition => attachment; filename=file.csv > response-header: server => Apache > response-header: content-length => 5837 > response-header: pragma => public > gunzip body > > Not like it''s a big deal to just CSV.pars(page.body), but just > wondering if I''m write in why it didn''t recognize and parse this > automatically as .csvMechanize uses the content-type header to determine which parser to use. The response header indicated ''application/octetstream'' which doesn''t really give any hints as to the type of data you are receiving. -- Aaron Patterson http://tenderlovemaking.com/
Aaron Patterson
2008-Apr-29 23:08 UTC
[Mechanize-users] Intercepting an onClick file download
On Tue, Apr 29, 2008 at 1:34 PM, Mat Schaffer <mat.schaffer at gmail.com> wrote:> Mechanize would need javascript support to make this work. Which I''m pretty > sure it doesn''t have. Maybe Aaron has some trick up his sleeve though, I > dunno.Not yet. I''m working on it though..... See these: http://tenderlovemaking.com/2008/04/23/take-it-to-the-limit-one-more-time/ http://github.com/jbarnette/johnson/tree/master Unfortunately this project isn''t my day job. ;-) -- Aaron Patterson http://tenderlovemaking.com/