thr3ads.net - Mechanize users - [Mechanize-users] problem scrapping ATnT site (Matt White) [Mar 2009]

If this information is useful, please help other people find it:
Share via:
gmoraes
2009-Mar-11 15:08 UTC
[Mechanize-users] problem scrapping ATnT site (Matt White)

Try to use firebug to assist you finding these changes. I never used AT&T
website, but you may need to login and find the download url using firebug
as I did:

http://zenmachine.wordpress.com/2007/11/11/scraping-with-firebug-and-wwwmechanize/

regards,
gm

On Tue, Mar 10, 2009 at 4:12 PM, <mechanize-users-request at
rubyforge.org>wrote:
> Send Mechanize-users mailing list submissions to
>        mechanize-users at rubyforge.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
>        http://rubyforge.org/mailman/listinfo/mechanize-users
> or, via email, send a message with subject or body ''help''
to
>        mechanize-users-request at rubyforge.org
>
> You can reach the person managing the list at
>        mechanize-users-owner at rubyforge.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Mechanize-users digest..."
>
>
> Today''s Topics:
>
>   1. Re: problem scrapping ATnT site (Matt White)
>   2. need guidance on following links to download files (Reid Thompson)
>   3. Re: problem scrapping ATnT site (subhransu behera)
>   4. Mechanize, history and memory (barsalou)
>   5. [ANN] mechanize 0.9.2 Released (Aaron Patterson)
>   6. weird problem with cookies (Harm Aarts)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Tue, 24 Feb 2009 07:18:09 -0800 (PST)
> From: Matt White <whitethunder922 at yahoo.com>
> Subject: Re: [Mechanize-users] problem scrapping ATnT site
> To: Ruby Mechanize Users List <mechanize-users at rubyforge.org>
> Message-ID: <190289.59284.qm at web53309.mail.re2.yahoo.com>
> Content-Type: text/plain; charset="us-ascii"
>
> One thing to be aware of is that Mechanize doesn''t interpret
Javascript. If
> the page changes dynamically as you select things on the page, Mechanize
> will not recognize these changes. If this is the problem you are having,
you
> will have to have the script do whatever the Javascript is doing to get
> everything right.
>
> Matt White
>
>
>
>
> ________________________________
> From: subhransu behera <arya.subhransu at gmail.com>
> To: mechanize-users at rubyforge.org
> Sent: Tuesday, February 24, 2009 1:32:08 AM
> Subject: [Mechanize-users] problem scrapping ATnT site
>
> Hi,
>
> I am trying to download the past call details from ATnT site
> in csv format.
>
> It requires to select the bill period and click on a radio button.
> Then clicking on "Submit" link downloads the call summary for
> that period.
>
> I tried to do it in mechanize in the following way, but it download
> the src of the page in stead of downloading the actual CSV file.
>
> # get the download page
>
> page_download = agent.get "
>
https://www.wireless.att.com/view/billPayDownloadDetail.doview?execdownloadPage=true
> "
>
> # get the form for bill_period and select a bill period
>
> bill_period_form = page_download.forms[2]
> bill_period_form.field.options[2].select
>
> # click on the csv radio button
>
> download_format_form =  page_download.forms[3]
> download_format_form.radiobuttons[1].click
>
> # click on the submit link that downloads the csv file.
>
> download_file = agent.click download_page.search("a")[41]
> download_file.save_as("<path_to_file>.csv")
>
> The problem I am facing in the above code is:
>
> + Doesn''t do anything special after selecting a particular bill
period from
> the select options.
> + Download the page source in stead of downloading the actual csv file.
>
> Can you suggest something? Am I missing something here?
>
> Thanks,
> Shubh
>
>
>
>
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <
>
http://rubyforge.org/pipermail/mechanize-users/attachments/20090224/6e39be00/attachment-0001.html
> >
>
> ------------------------------
>
> Message: 2
> Date: Tue, 24 Feb 2009 10:19:22 -0500
> From: Reid Thompson <reid.thompson at ateb.com>
> Subject: [Mechanize-users] need guidance on following links to
>        download files
> To: mechanize-users at rubyforge.org
> Message-ID: <1235488762.32688.25.camel at raker>
> Content-Type: text/plain
>
> The script below is a mod of one i found via google.  I''m trying
to
> figure out what i''m missing in order to download the files
associated
> with the links.
>
>
>
> require ''mechanize''
>
> agent = WWW::Mechanize.new
> pagent = WWW::Mechanize.new
> agent.get("http://www.daytrotter.com/songs?offset=60/")
> links = agent.page.search(''a'')
> hrefs = links.map { |m| m[''href''] }.select { |u| u =~
/\.mp3.link$/ } #
> just links ending in mfile
> #puts hrefs
> #FileUtils.mkdir_p(''daytrotter'') # keep it neat
> hrefs.each { |mfile|
>    if mfile.match(/^\/download/) then next end
>    #puts mfile
>    filename = "#{mfile.split(''/'')[-1]}"
>    filename.gsub!(''.link'','''')
>
>    puts "Saving #{mfile} as #{filename}"
>
>    agent.get(mfile).save_as(filename)
> }
>
> This results in output of the following format:
> Saving
>
http://daytrotter.com/file_download/76/TwoGallants_DaytrotterSession_2.mp3.linkas
TwoGallants_DaytrotterSession_2.mp3
>
> I can''t seem to get the final result to resolve to the actual
file...
> I''d appreciate any pointers.
>
> Thanks,
> reid
>
>
>
> ------------------------------
>
> Message: 3
> Date: Wed, 25 Feb 2009 00:53:36 +0530
> From: subhransu behera <arya.subhransu at gmail.com>
> Subject: Re: [Mechanize-users] problem scrapping ATnT site
> To: Ruby Mechanize Users List <mechanize-users at rubyforge.org>
> Message-ID:
>        <8f00add50902241123r403fc219ua5f30a9110b6e615 at
mail.gmail.com>
> Content-Type: text/plain; charset="iso-8859-1"
>
> Hi Matt,
>
> I did exactly what you suggested. And now it works as expected.
> Thanks a ton buddy!
>
> Regards,
> Shubh
>
> On Tue, Feb 24, 2009 at 8:48 PM, Matt White <whitethunder922 at
yahoo.com
> >wrote:
>
> > One thing to be aware of is that Mechanize doesn''t interpret
Javascript.
> If
> > the page changes dynamically as you select things on the page,
Mechanize
> > will not recognize these changes. If this is the problem you are
having,
> you
> > will have to have the script do whatever the Javascript is doing to
get
> > everything right.
> >
> > Matt White
> >
> > ------------------------------
> > *From:* subhransu behera <arya.subhransu at gmail.com>
> > *To:* mechanize-users at rubyforge.org
> > *Sent:* Tuesday, February 24, 2009 1:32:08 AM
> > *Subject:* [Mechanize-users] problem scrapping ATnT site
> >
> > Hi,
> >
> > I am trying to download the past call details from ATnT site
> > in csv format.
> >
> > It requires to select the bill period and click on a radio button.
> > Then clicking on "Submit" link downloads the call summary
for
> > that period.
> >
> > I tried to do it in mechanize in the following way, but it download
> > the src of the page in stead of downloading the actual CSV file.
> >
> > # get the download page
> >
> > page_download = agent.get "
> >
>
https://www.wireless.att.com/view/billPayDownloadDetail.doview?execdownloadPage=true
> > "
> >
> > # get the form for bill_period and select a bill period
> >
> > bill_period_form = page_download.forms[2]
> > bill_period_form.field.options[2].select
> >
> > # click on the csv radio button
> >
> > download_format_form =  page_download.forms[3]
> > download_format_form.radiobuttons[1].click
> >
> > # click on the submit link that downloads the csv file.
> >
> > download_file = agent.click download_page.search("a")[41]
> > download_file.save_as("<path_to_file>.csv")
> >
> > The problem I am facing in the above code is:
> >
> > + Doesn''t do anything special after selecting a particular
bill period
> from
> > the select options.
> > + Download the page source in stead of downloading the actual csv
file.
> >
> > Can you suggest something? Am I missing something here?
> >
> > Thanks,
> > Shubh
> >
> >
> > _______________________________________________
> > Mechanize-users mailing list
> > Mechanize-users at rubyforge.org
> > http://rubyforge.org/mailman/listinfo/mechanize-users
> >
>
>
>
> --
> Innovator, Pune - India
> Phone       : (+91)-98605-59976
> Blog          : http://sbehera.livejournal.com/
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <
>
http://rubyforge.org/pipermail/mechanize-users/attachments/20090225/fdbd4090/attachment-0001.html
> >
>
> ------------------------------
>
> Message: 4
> Date: Tue, 24 Feb 2009 13:24:53 -0900
> From: barsalou <barjunk at attglobal.net>
> Subject: [Mechanize-users] Mechanize, history and memory
> To: Ruby Mechanize Users List <mechanize-users at rubyforge.org>
> Message-ID: <20090224132453.4yf9cr4so4ckw84g at lcgalaska.com>
> Content-Type: text/plain;       charset=ISO-8859-1;    
DelSp="Yes";
>        format="flowed"
>
> I recently wrote a script to read a web page over and over.  I ran
> into an issue where the script would stop for seemingly an unknown
> reason.
>
> Turns out "browser history" was continually growing.
>
> The answer of course is to set agent.max_history to some lower number,
> in my case one.
>
> Have you ever considered implementing a warning or changing the
> default to max_history to something that won''t eat up memory?
>
> Maybe a note in GUIDE.txt?
>
> I haven''t tested 0.9.1 yet, so you may have changed the
default...but
> the docs for 0.9.1 don''t seem to be very specific about that.
>
> I''ll provide a patch, but wanted to see which way you''d
want to go.
>
> Mike B.
>
> ----------------------------------------------------------------
> This message was sent using IMP, the Internet Messaging Program.
>
>
>
> ------------------------------
>
> Message: 5
> Date: Thu, 5 Mar 2009 09:54:43 -0800
> From: Aaron Patterson <aaron at tenderlovemaking.com>
> Subject: [Mechanize-users] [ANN] mechanize 0.9.2 Released
> To: Seattle Ruby Brigade! <ruby at zenspider.com>,
>        ruby-talk at ruby-lang.org,        mechanize-users at rubyforge.org
> Message-ID: <20090305175443.GA5166 at Jordan2.local>
> Content-Type: text/plain; charset=us-ascii
>
> mechanize version 0.9.2 has been released!
>
> * <http://mechanize.rubyforge.org/>
> * <http://github.com/tenderlove/mechanize/tree/master>
>
> The Mechanize library is used for automating interaction with websites.
> Mechanize automatically stores and sends cookies, follows redirects,
> can follow links, and submit forms.  Form fields can be populated and
> submitted.  Mechanize also keeps track of the sites that you have visited
> as
> a history.
>
> Changes:
>
> ### 0.9.2 / 2009/03/05
>
> * New Features:
>  * Mechanize#submit and Form#submit take arbitrary headers(thanks
> penguincoder)
>
> * Bug Fixes:
>  * Fixed a bug with bad cookie parsing
>  * Form::RadioButton#click unchecks other buttons (RF #24159)
>  * Fixed problems with Iconv (RF #24190, RF #24192, RF #24043)
>  * POST parameters should be CGI escaped
>  * Made Content-Type match case insensitive (Thanks Kelly Reynolds)
>  * Non-string form parameters work
>
> * <http://mechanize.rubyforge.org/>
> * <http://github.com/tenderlove/mechanize/tree/master>
>
> --
> Aaron Patterson
> http://tenderlovemaking.com/
>
>
> ------------------------------
>
> Message: 6
> Date: Tue, 10 Mar 2009 20:12:31 +0100
> From: Harm Aarts <harmaarts at gmail.com>
> Subject: [Mechanize-users] weird problem with cookies
> To: mechanize-users at rubyforge.org
> Message-ID:
>        <ef362f8e0903101212i73b1b65ehb0938a80104d59d3 at
mail.gmail.com>
> Content-Type: text/plain; charset="iso-8859-1"
>
> Hi,
>
> I wrote a script to login to some page. This works fine on my local
> machine,
> but fails on my server. Both run Mechanize 0.9.0 and Nokogiri 1.2.1.
> Turning debugging on I get for the succeeding run this pastie:
> http://pastie.org/413049. For the failing run I get this one:
> http://pastie.org/413052. Note the absence of the cookie request-header in
> the failing run (from the succeeding run):
> D, [2009-03-10T19:22:51.409080 #99291] DEBUG -- : request-header: cookie
=>
>
>
orangeSessionID=SID%3D178CF332CDB2DAAB051AB16E7A675073227EE45A09DD8AF280A0BA2D64D03E2782EF51FFB3E0756D44AEC76F18668B182179A1A2F06C6C9D4B976C4A322EF6CF%26SID1%3DBA218F59EEEF8F541F9464732E16A148
>
> How is this possible? Both save the cookie created in the previous request:
> D, [2009-03-10T19:22:51.389433 #99291] DEBUG -- : saved cookie:
>
>
orangeSessionID=SID%3D178CF332CDB2DAAB051AB16E7A675073227EE45A09DD8AF280A0BA2D64D03E2782EF51FFB3E0756D44AEC76F18668B182179A1A2F06C6C9D4B976C4A322EF6CF%26SID1%3DBA218F59EEEF8F541F9464732E16A148
>
> Where does Mechanize save it''s cookies? Maybe it is a permissions
issue?
> And
> how does it determine when to send the cookie header?
> I am at a loss and any help would be appreciated.
>
> With kind regards,
> Harm
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <
>
http://rubyforge.org/pipermail/mechanize-users/attachments/20090310/5f5dcc34/attachment.html
> >
>
> ------------------------------
>
> _______________________________________________
> Mechanize-users mailing list
> Mechanize-users at rubyforge.org
> http://rubyforge.org/mailman/listinfo/mechanize-users
>
> End of Mechanize-users Digest, Vol 24, Issue 1
> **********************************************
>


-- 
More cowbell, please !
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://rubyforge.org/pipermail/mechanize-users/attachments/20090311/51fe0038/attachment-0001.html>
Reasonably Related Threads

Search for more reasonably related threads
Mechanize users - Mar 2009 - problem scrapping ATnT site (Matt White)

[Mechanize-users] problem scrapping ATnT site (Matt White)

Reasonably Related Threads

Wisdom of the Ancients