thr3ads.net - Mechanize users - [Mechanize-users] Intercepting an onClick file download [Apr 2008]

If this information is useful, please help other people find it:
Share via:

Sell Trino

2008-Apr-29 20:23 UTC

[Mechanize-users] Intercepting an onClick file download

Hi,

I''m having some trouble downloading a .csv file from a particular
website.  The file isn''t part of a url, you need to click on a link in
order to get the file sent.  I don''t know how to get mechanize to
correctly identify that.

Here is the link to the file I''m trying to retrieve:

 <td style="vertical-align: bottom; text-align: center;">
 <a href="#" onClick="dataExport( ''csv'' );
return false;"><img
src="/img/buttons/bu_csv.gif" width="37"
height="17" style="border:
none;" alt="Export to CVS"></a>
 </td>

Here is my code (partial):

agent = WWW::Mechanize.new { |a| a.log = Logger.new("mech.log") }
agent.keep_alive = false
agent.read_timeout = 60 # the page would timeout sometimes
url = "https://website.com/page.php4"
page = agent.get(url)
page.links.text(/Export to CVS/).each { |link|
  file_page = agent.click(link)
  file_page.save_as(''output.csv'')
  return
}

What I get in output.csv is just the original page, not the .csv file.
 If someone could please help me understand how I can nab the file
contents instead, I''d greatly appreciate it.  (I actually want to
eventually parse the csv within the code, not just save it)

Thanks!

Mat Schaffer

2008-Apr-29 20:34 UTC

head link

[Mechanize-users] Intercepting an onClick file download

Mechanize would need javascript support to make this work.  Which I''m  
pretty sure it doesn''t have.  Maybe Aaron has some trick up his sleeve
though, I dunno.

What I usually do in this cases is manually trace the javascript (in  
this case the dataExport function) using firebug and the firefox web  
developer toolbar.  Once I get a handle on what the javascript is  
doing, I replicate that in ruby to build the appropriate URL and  
finally just use mechanize to make the GET request.

Good luck with your project!
-Mat

On Apr 29, 2008, at 4:23 PM, Sell Trino wrote:
> Hi,
>
> I''m having some trouble downloading a .csv file from a particular
> website.  The file isn''t part of a url, you need to click on a
link in
> order to get the file sent.  I don''t know how to get mechanize to
> correctly identify that.
>
> Here is the link to the file I''m trying to retrieve:
>
> <td style="vertical-align: bottom; text-align: center;">
> <a href="#" onClick="dataExport( ''csv''
); return false;"><img
> src="/img/buttons/bu_csv.gif" width="37"
height="17" style="border:
> none;" alt="Export to CVS"></a>
> </td>
>
> Here is my code (partial):
>
> agent = WWW::Mechanize.new { |a| a.log = Logger.new("mech.log") }
> agent.keep_alive = false
> agent.read_timeout = 60 # the page would timeout sometimes
> url = "https://website.com/page.php4"
> page = agent.get(url)
> page.links.text(/Export to CVS/).each { |link|
>  file_page = agent.click(link)
>  file_page.save_as(''output.csv'')
>  return
> }
>
> What I get in output.csv is just the original page, not the .csv file.
> If someone could please help me understand how I can nab the file
> contents instead, I''d greatly appreciate it.  (I actually want to
> eventually parse the csv within the code, not just save it)
>
> Thanks!
> _______________________________________________
> Mechanize-users mailing list
> Mechanize-users at rubyforge.org
> http://rubyforge.org/mailman/listinfo/mechanize-users

Matt White

2008-Apr-29 20:41 UTC

head link

[Mechanize-users] Intercepting an onClick file download

Sell,

As Mechanize doesn''t interpret Javascript, you will need to dissect the
function "dataExport". If you need help with that, paste the source
for that function here and perhaps we can help more. Or if the website is
publicly accessible, let us know the URL and we can take a look at it.

Matt White


----- Original Message ----
From: Sell Trino <selltrino at gmail.com>
To: mechanize-users at rubyforge.org
Sent: Tuesday, April 29, 2008 2:23:26 PM
Subject: [Mechanize-users] Intercepting an onClick file download

Hi,

I''m having some trouble downloading a .csv file from a particular
website.  The file isn''t part of a url, you need to click on a link in
order to get the file sent.  I don''t know how to get mechanize to
correctly identify that.

Here is the link to the file I''m trying to retrieve:

<td style="vertical-align: bottom; text-align: center;">
<a href="#" onClick="dataExport( ''csv'' );
return false;"><img
src="/img/buttons/bu_csv.gif" width="37"
height="17" style="border:
none;" alt="Export to CVS"></a>
</td>

Here is my code (partial):

agent = WWW::Mechanize.new { |a| a.log = Logger.new("mech.log") }
agent.keep_alive = false
agent.read_timeout = 60 # the page would timeout sometimes
url = "https://website.com/page.php4"
page = agent.get(url)
page.links.text(/Export to CVS/).each { |link|
  file_page = agent.click(link)
  file_page.save_as(''output.csv'')
  return
}

What I get in output.csv is just the original page, not the .csv file.
If someone could please help me understand how I can nab the file
contents instead, I''d greatly appreciate it.  (I actually want to
eventually parse the csv within the code, not just save it)

Thanks!
_______________________________________________
Mechanize-users mailing list
Mechanize-users at rubyforge.org
http://rubyforge.org/mailman/listinfo/mechanize-users



     
____________________________________________________________________________________
Be a better friend, newshound, and 
know-it-all with Yahoo! Mobile.  Try it now. 
http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://rubyforge.org/pipermail/mechanize-users/attachments/20080429/05688680/attachment.html>

Chris Riddoch

2008-Apr-29 21:06 UTC

head link

[Mechanize-users] Intercepting an onClick file download

This sort of question is clearly frequent enough to warrant
documenting.  Expect a patch from me soon for this...

-- 
epistemological humility
 Chris Riddoch

Sell Trino

2008-Apr-29 22:29 UTC

head link

[Mechanize-users] Intercepting an onClick file download

Thanks guys for the feedback.  I understand now the issue about this
being javascript.

I installed Firebug and the Firefox Web Developer toolbar (which look
very helpful, btw) and got a full dump of the javascript code (the
dataExport script wasn''t shown on a dump of the page source until this
tool dug it out) and here is the code for it:

/**
 * Initiate Export, type is either csv or xls
**/
function dataExport( type ) {
	
	document.getElementById(''export'').value =
''Y'';
	document.getElementById(''exportType'').value = type;
	document.getElementById(''selectform'').method =
''POST'';
	
	switchTarget(''self'');
	
	document.getElementById(''export'').value =
'''';
	document.getElementById(''exportType'').value =
'''';
	
} // dataExport

I then made my url
"https://website.com/page.php4?export=Y&exportType=csv" and did a
get
on that and it worked!  (apparently their server doesn''t require it to
be a post...)  Thanks everyone for the help!

One last thing though, when I get the page, page.body.class
''String''.  I had setup the CSVParser via:

 class CSVParser < WWW::Mechanize::File
   attr_reader :csv
   def initialize(uri=nil, response=nil, body=nil, code=nil)
     super(uri, response, body, code)
     @csv = CSV.parse(body)
   end
 end
 agent = WWW::Mechanize.new
 agent.pluggable_parser.csv = CSVParser

And it doesn''t seem to autorecognize the file as CSV.  I think
it''s
because the content encoding is gzip, as per the log file:

response-header: vary => User-Agent,Accept-Encoding
response-header: cache-control => must-revalidate, post-check=0,pre-check=0
response-header: connection => close
response-header: x-cache => MISS from 284720
response-header: expires => 0
response-header: content-type => application/octetstream
response-header: date => Tue, 29 Apr 2008 22:23:15 GMT
response-header: content-encoding => gzip
response-header: content-disposition => attachment; filename=file.csv
response-header: server => Apache
response-header: content-length => 5837
response-header: pragma => public
gunzip body

Not like it''s a big deal to just CSV.pars(page.body), but just
wondering if I''m write in why it didn''t recognize and parse
this
automatically as .csv

Thanks!





On Tue, Apr 29, 2008 at 2:06 PM, Chris Riddoch <riddochc at gmail.com>
wrote:> This sort of question is clearly frequent enough to warrant
>  documenting.  Expect a patch from me soon for this...
>
>  --
>  epistemological humility
>   Chris Riddoch
>
>
> _______________________________________________
>  Mechanize-users mailing list
>  Mechanize-users at rubyforge.org
>  http://rubyforge.org/mailman/listinfo/mechanize-users
>

Aaron Patterson

2008-Apr-29 23:05 UTC

head link

[Mechanize-users] Intercepting an onClick file download

On Tue, Apr 29, 2008 at 3:29 PM, Sell Trino <selltrino at gmail.com>
wrote:> Thanks guys for the feedback.  I understand now the issue about this
>  being javascript.
>
>  I installed Firebug and the Firefox Web Developer toolbar (which look
>  very helpful, btw) and got a full dump of the javascript code (the
>  dataExport script wasn''t shown on a dump of the page source until
this
>  tool dug it out) and here is the code for it:
>
>  /**
>   * Initiate Export, type is either csv or xls
>  **/
>  function dataExport( type ) {
>
>         document.getElementById(''export'').value =
''Y'';
>         document.getElementById(''exportType'').value =
type;
>         document.getElementById(''selectform'').method =
''POST'';
>
>         switchTarget(''self'');
>
>         document.getElementById(''export'').value =
'''';
>         document.getElementById(''exportType'').value =
'''';
>
>  } // dataExport
>
>  I then made my url
>  "https://website.com/page.php4?export=Y&exportType=csv" and
did a get
>  on that and it worked!  (apparently their server doesn''t require
it to
>  be a post...)  Thanks everyone for the help!
>
>  One last thing though, when I get the page, page.body.class > 
''String''.  I had setup the CSVParser via:
>
>   class CSVParser < WWW::Mechanize::File
>    attr_reader :csv
>    def initialize(uri=nil, response=nil, body=nil, code=nil)
>      super(uri, response, body, code)
>      @csv = CSV.parse(body)
>    end
>   end
>   agent = WWW::Mechanize.new
>   agent.pluggable_parser.csv = CSVParser
>
>  And it doesn''t seem to autorecognize the file as CSV.  I think
it''s
>  because the content encoding is gzip, as per the log file:
>
>  response-header: vary => User-Agent,Accept-Encoding
>  response-header: cache-control => must-revalidate,
post-check=0,pre-check=0
>  response-header: connection => close
>  response-header: x-cache => MISS from 284720
>  response-header: expires => 0
>  response-header: content-type => application/octetstream
>  response-header: date => Tue, 29 Apr 2008 22:23:15 GMT
>  response-header: content-encoding => gzip
>  response-header: content-disposition => attachment; filename=file.csv
>  response-header: server => Apache
>  response-header: content-length => 5837
>  response-header: pragma => public
>  gunzip body
>
>  Not like it''s a big deal to just CSV.pars(page.body), but just
>  wondering if I''m write in why it didn''t recognize and
parse this
>  automatically as .csv
Mechanize uses the content-type header to determine which parser to
use.  The response header indicated ''application/octetstream''
which
doesn''t really give any hints as to the type of data you are
receiving.

-- 
Aaron Patterson
http://tenderlovemaking.com/

Aaron Patterson

2008-Apr-29 23:08 UTC

head link

[Mechanize-users] Intercepting an onClick file download

On Tue, Apr 29, 2008 at 1:34 PM, Mat Schaffer <mat.schaffer at gmail.com>
wrote:> Mechanize would need javascript support to make this work.  Which
I''m pretty
> sure it doesn''t have.  Maybe Aaron has some trick up his sleeve
though, I
> dunno.
Not yet.  I''m working on it though.....

See these:

http://tenderlovemaking.com/2008/04/23/take-it-to-the-limit-one-more-time/
http://github.com/jbarnette/johnson/tree/master

Unfortunately this project isn''t my day job.  ;-)

-- 
Aaron Patterson
http://tenderlovemaking.com/

Seemingly Similar Threads

Search for more reasonably related threads

Mechanize users - Apr 2008 - Intercepting an onClick file download

[Mechanize-users] Intercepting an onClick file download

[Mechanize-users] Intercepting an onClick file download

[Mechanize-users] Intercepting an onClick file download

[Mechanize-users] Intercepting an onClick file download

[Mechanize-users] Intercepting an onClick file download

[Mechanize-users] Intercepting an onClick file download

[Mechanize-users] Intercepting an onClick file download

Seemingly Similar Threads