Felipe Jordão A. P. Mattosinho
2010-Jan-25 05:29 UTC
[Mechanize-users] Does Amazon.com blocks scraping?
Hi there Does anyone know if Amazon.com has any sort of server side script that tries to block scraping activities? I first noticed that if I didn?t change the agent alias, it would fetch a page exactly like the normal one, but without the intial search field(maybe a silly way to prevent scraping). Then after it, I changed to some other alias, and submit a search. I got the result page as response, but right after getting the page, I received a message that Amazon.com closed my connection, and redirects me to another place. If anyone succeeded with Amazon.com, circunventing this protection, please send me some info Regards, Felipe -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://rubyforge.org/pipermail/mechanize-users/attachments/20100125/8535fc9d/attachment.html>
Hi Felipe, I was unable to reproduce your "Amazon closing the connection" issue. Could you perhaps post a sequence of commands from irb that can consistently reproduce it? Also, what happens if you change the user agent alias before the first connection with Amazon? If you do not experience the disconnection problem, is there any reason why changing the user agent before first contact is not a satisfactory solution? Cheers, -Jimmy Felipe Jord?o A. P. Mattosinho wrote:> > Hi there > > Does anyone know if Amazon.com has any sort of server side script that > tries to block scraping activities? I first noticed that if I didn?t > change the agent alias, it would fetch a page exactly like the normal > one, but without the intial search field(maybe a silly way to prevent > scraping). Then after it, I changed to some other alias, and submit a > search. I got the result page as response, but right after getting the > page, I received a message that Amazon.com closed my connection, and > redirects me to another place. > > If anyone succeeded with Amazon.com, circunventing this protection, > please send me some info > > Regards, > > Felipe > > ------------------------------------------------------------------------ > > _______________________________________________ > Mechanize-users mailing list > Mechanize-users at rubyforge.org > http://rubyforge.org/mailman/listinfo/mechanize-users
Hi Felipe, Just had another thought, and it''s probably something you''ve already considered so I apologise if I''m pointing out the obvious, but have you checked out Amazon''s A2S (formally ECS) web services? https://affiliate-program.amazon.com/gp/advertising/api/detail/main.html It''s been a while since I played with this stuff, but if provides access to the ability to search for Amazon products and list details, reviews etc. I used to use the following gem: http://www.pluitsolutions.com/projects/amazon-ecs But I''m not sure if that is still the one to beat (there were a number of them some years ago). Of course, these services may not do what you need, but I thought it worth suggesting just in case. Cheers. Felipe Jord?o A. P. Mattosinho wrote:> > Hi there > > Does anyone know if Amazon.com has any sort of server side script that > tries to block scraping activities? I first noticed that if I didn?t > change the agent alias, it would fetch a page exactly like the normal > one, but without the intial search field(maybe a silly way to prevent > scraping). Then after it, I changed to some other alias, and submit a > search. I got the result page as response, but right after getting the > page, I received a message that Amazon.com closed my connection, and > redirects me to another place. > > If anyone succeeded with Amazon.com, circunventing this protection, > please send me some info > > Regards, > > Felipe > > ------------------------------------------------------------------------ > > _______________________________________________ > Mechanize-users mailing list > Mechanize-users at rubyforge.org > http://rubyforge.org/mailman/listinfo/mechanize-users
Felipe Jordão A. P. Mattosinho
2010-Jan-26 18:44 UTC
[Mechanize-users] Does Amazon.com blocks scraping?
Thanks Jimmy, First thanks for answering it. About the web service I know about it, however the thing is that I am doing a project in my university where I need to scrap content, or to be more specific product reviews. So I just used Amazon.com as an example, but I would never scrap something when I have a service that can give me everything I need. The thing is just to use a scraping technique!But thanks anyway for you suggestion! Well, besides that these are the commands: @@mech = WWW::Mechanize.new #HERE IF I DON?T SET THE ALIAS I RECEIVE THE SAME MAIN PAGE BUT WITHOUT THE SEARCH #FIELD @@mech.user_agent_alias = ''Mac Safari'' page = @@mech.get("http://www.amazon.com") search_form = page.form("site-search") search_form["field-keywords"] = "Nikon Coolpix P90" @page = @@mech.submit(search_form, search_form.buttons.first) PP @page #HERE AFTER printing the page I confirm that I received the page I was #expecting, however on the console I get this " #ActionController::RoutingError (No route matches "/aan/2009-09-#09/static/amazon/ iframeproxy-1.html" with {:method=>:get}): " #Now when I want to get the first match in the result page from that XPATH # I receive a null object (which is strange for me since I have the content # stored on @page variable) @match @page.search("/html/body/div[4]/div/div/div[2]/div[3]/div/div/div[3]/div/a") # HERE @match is nil, which sounds strange to me. NOW A QUESTIOON CONCERNING THE MAILGROUP ITSELF: Is there a way to receive single messages and the digest. Or if I enable the digest mode, I stop receiving single messages? Felipe -----Mensagem original----- De: mechanize-users-bounces at rubyforge.org [mailto:mechanize-users-bounces at rubyforge.org] Em nome de mechanize-users-request at rubyforge.org Enviada em: ter?a-feira, 26 de janeiro de 2010 18:40 Para: mechanize-users at rubyforge.org Assunto: Mechanize-users Digest, Vol 34, Issue 10 Send Mechanize-users mailing list submissions to mechanize-users at rubyforge.org To subscribe or unsubscribe via the World Wide Web, visit http://rubyforge.org/mailman/listinfo/mechanize-users or, via email, send a message with subject or body ''help'' to mechanize-users-request at rubyforge.org You can reach the person managing the list at mechanize-users-owner at rubyforge.org When replying, please edit your Subject line so it is more specific than "Re: Contents of Mechanize-users digest..." Today''s Topics: 1. Does mechanize have the ability to set a maximum download limit? (Jimmy McGrath) 2. Does Amazon.com blocks scraping? (Felipe Jord?o A. P. Mattosinho) 3. Re: Does Amazon.com blocks scraping? (Jimmy McGrath) 4. Re: Does Amazon.com blocks scraping? (Jimmy McGrath) 5. Nokogiri vs mechanize objects (peter at rubyrailways.com) 6. Re: Having Problems with JS-Button within a form. (Nils Haldenwang) 7. Does Amazon.com block scraping? (Felipe Jord?o A. P. Mattosinho) 8. Re: Does Amazon.com block scraping? (Mike Dalessio) ---------------------------------------------------------------------- Message: 1 Date: Mon, 25 Jan 2010 09:46:44 +1000 From: Jimmy McGrath <mechanize-mail at zizee.com> To: mechanize-users at rubyforge.org Subject: [Mechanize-users] Does mechanize have the ability to set a maximum download limit? Message-ID: <4B5CDBE4.7090701 at zizee.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Hi there, I would like to set an upper limit on the size of any file that mechanize will download. I have read the documentation and googled but cannot see any support for this type of functionality. Is there any way to do it? Or should I be looking at a timeout solution? Thanks, Jimmy ------------------------------ Message: 2 Date: Mon, 25 Jan 2010 06:29:33 +0100 From: Felipe Jord?o A. P. Mattosinho <felipemattosinho at terra.com.br> To: <mechanize-users at rubyforge.org> Subject: [Mechanize-users] Does Amazon.com blocks scraping? Message-ID: <20100125052944.0C83E90000090 at hecla.tpn.terra.com> Content-Type: text/plain; charset="iso-8859-1" Hi there Does anyone know if Amazon.com has any sort of server side script that tries to block scraping activities? I first noticed that if I didn?t change the agent alias, it would fetch a page exactly like the normal one, but without the intial search field(maybe a silly way to prevent scraping). Then after it, I changed to some other alias, and submit a search. I got the result page as response, but right after getting the page, I received a message that Amazon.com closed my connection, and redirects me to another place. If anyone succeeded with Amazon.com, circunventing this protection, please send me some info Regards, Felipe -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://rubyforge.org/pipermail/mechanize-users/attachments/20100125/8535fc9 d/attachment-0001.html> ------------------------------ Message: 3 Date: Mon, 25 Jan 2010 22:35:03 +1000 From: Jimmy McGrath <mechanize-mail at zizee.com> To: Ruby Mechanize Users List <mechanize-users at rubyforge.org> Subject: Re: [Mechanize-users] Does Amazon.com blocks scraping? Message-ID: <4B5D8FF7.80002 at zizee.com> Content-Type: text/plain; charset=windows-1252; format=flowed Hi Felipe, I was unable to reproduce your "Amazon closing the connection" issue. Could you perhaps post a sequence of commands from irb that can consistently reproduce it? Also, what happens if you change the user agent alias before the first connection with Amazon? If you do not experience the disconnection problem, is there any reason why changing the user agent before first contact is not a satisfactory solution? Cheers, -Jimmy Felipe Jord?o A. P. Mattosinho wrote:> > Hi there > > Does anyone know if Amazon.com has any sort of server side script that > tries to block scraping activities? I first noticed that if I didn?t > change the agent alias, it would fetch a page exactly like the normal > one, but without the intial search field(maybe a silly way to prevent > scraping). Then after it, I changed to some other alias, and submit a > search. I got the result page as response, but right after getting the > page, I received a message that Amazon.com closed my connection, and > redirects me to another place. > > If anyone succeeded with Amazon.com, circunventing this protection, > please send me some info > > Regards, > > Felipe > > ------------------------------------------------------------------------ > > _______________________________________________ > Mechanize-users mailing list > Mechanize-users at rubyforge.org > http://rubyforge.org/mailman/listinfo/mechanize-users------------------------------ Message: 4 Date: Tue, 26 Jan 2010 07:16:21 +1000 From: Jimmy McGrath <mechanize-mail at zizee.com> To: Ruby Mechanize Users List <mechanize-users at rubyforge.org> Subject: Re: [Mechanize-users] Does Amazon.com blocks scraping? Message-ID: <4B5E0A25.8040606 at zizee.com> Content-Type: text/plain; charset=windows-1252; format=flowed Hi Felipe, Just had another thought, and it''s probably something you''ve already considered so I apologise if I''m pointing out the obvious, but have you checked out Amazon''s A2S (formally ECS) web services? https://affiliate-program.amazon.com/gp/advertising/api/detail/main.html It''s been a while since I played with this stuff, but if provides access to the ability to search for Amazon products and list details, reviews etc. I used to use the following gem: http://www.pluitsolutions.com/projects/amazon-ecs But I''m not sure if that is still the one to beat (there were a number of them some years ago). Of course, these services may not do what you need, but I thought it worth suggesting just in case. Cheers. Felipe Jord?o A. P. Mattosinho wrote:> > Hi there > > Does anyone know if Amazon.com has any sort of server side script that > tries to block scraping activities? I first noticed that if I didn?t > change the agent alias, it would fetch a page exactly like the normal > one, but without the intial search field(maybe a silly way to prevent > scraping). Then after it, I changed to some other alias, and submit a > search. I got the result page as response, but right after getting the > page, I received a message that Amazon.com closed my connection, and > redirects me to another place. > > If anyone succeeded with Amazon.com, circunventing this protection, > please send me some info > > Regards, > > Felipe > > ------------------------------------------------------------------------ > > _______________________________________________ > Mechanize-users mailing list > Mechanize-users at rubyforge.org > http://rubyforge.org/mailman/listinfo/mechanize-users------------------------------ Message: 5 Date: Tue, 26 Jan 2010 00:24:27 -0800 From: peter at rubyrailways.com To: mechanize-users at rubyforge.org Subject: [Mechanize-users] Nokogiri vs mechanize objects Message-ID: <9dee3ceec852ebe3fcc127361f54eaaa.squirrel at webmail.rubyrailways.com> Content-Type: text/plain;charset=iso-8859-1 Hey all, Is it possible to ''cast'' a Nokogri objects as a Mechanize one? i.e. I get back a Nokogiri Element after searching with an XPath, and now I''d like to click it (let''s suppose it''s an <a>). So something like>> agent = WWW::Mechanize.new >> ... >> agent.get(''http://github.com/'') >> ... >> link = agent.page.search("//p[child::strong[contains(.,''GitHub'')]]/a[1]")=> a href"http://help.github.com/post-receive-hooks/"web hooka>> link.clickNoMethodError: undefined method `click'' for <a href="http://help.github.com/post-receive-hooks/">web hook</a>:Nokogiri::XML::NodeSet from (irb):8 There has to be a mechanize object which is the "alter-ego" of the Nokogiri element - how do I switch between the two? Cheers, Peter ------------------------------ Message: 6 Date: Tue, 26 Jan 2010 18:00:31 +0100 From: Nils Haldenwang <nohfreakz at web.de> To: Ruby Mechanize Users List <mechanize-users at rubyforge.org> Subject: Re: [Mechanize-users] Having Problems with JS-Button within a form. Message-ID: <1054463354 at web.de> Content-Type: text/plain; charset=iso-8859-15 Dumb question maybe, but how can i create the button myself ? :-)> Create the submit button and submit the form :) > > > > On Jan 23, 2010, at 12:34 PM, nohfreakz at web.de wrote: > > > Hi there, > > > > i get following issue with a form. > > > > Within the form there are two buttons: > > > > ... > > <input type="button" value="Edit" onclick="bookmarks.do_some_js()" > > id="BlaBlaBtn" class="hidden" /> > > ... > > <input type="submit" value="Submit" /> > > ... > > > > When i get the page and choose the form, the form only has the first > > button ( with the js call ) in its buttons-list, so i can not > > submit the form. > > > > Is there any solution ? > > > > Greetz, > > Nils > > ______________________________________________________________________ > > Haiti-Nothilfe! Helfen Sie per SMS: Sende UIHAITI an die Nummer 81190. > > Von 5 Euro je SMS (zzgl. SMS-Geb?hr) gehen 4,83 Euro an UNICEF. > > > > _______________________________________________ > > Mechanize-users mailing list > > Mechanize-users at rubyforge.org > > http://rubyforge.org/mailman/listinfo/mechanize-users > _______________________________________________ > Mechanize-users mailing list > Mechanize-users at rubyforge.org > http://rubyforge.org/mailman/listinfo/mechanize-users________________________________________________________________________ Kostenlos tippen, t?glich 1 Million gewinnen: zum WEB.DE MillionenKlick! http://produkte.web.de/go/08/ ------------------------------ Message: 7 Date: Tue, 26 Jan 2010 18:27:02 +0100 From: Felipe Jord?o A. P. Mattosinho <felipemattosinho at terra.com.br> To: <mechanize-users at rubyforge.org> Subject: [Mechanize-users] Does Amazon.com block scraping? Message-ID: <20100126172708.C1BBCB0000140 at bermore.tpn.terra.com> Content-Type: text/plain; charset="iso-8859-1" Hi there Does anyone know if Amazon.com has any sort of server side script that tries to block scraping activities? I first noticed that if I didn?t change the agent alias, it would fetch a page exactly like the normal one, but without the intial search field(maybe a silly way to prevent scraping). Then after it, I changed to some other alias, and submit a search. I got the result page as response, but right after getting the page, I received a message that Amazon.com closed my connection, and redirects me to another place. If anyone succeeded with Amazon.com, circunventing this protection, please send me some info Regards, Felipe -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://rubyforge.org/pipermail/mechanize-users/attachments/20100126/8262fae b/attachment-0001.html> ------------------------------ Message: 8 Date: Tue, 26 Jan 2010 12:31:49 -0500 From: Mike Dalessio <mike at csa.net> To: Ruby Mechanize Users List <mechanize-users at rubyforge.org> Subject: Re: [Mechanize-users] Does Amazon.com block scraping? Message-ID: <618c07251001260931j113ccc53y4d58b5dad52f9480 at mail.gmail.com> Content-Type: text/plain; charset="windows-1252" Hello Felipe, Jimmy McGrath was kind enough to reply, twice, to your original email. (Thank you, Jimmy.) Please be kind enough to respond to his replies, and please do not resend your request when you''ve already received a response. Thank you for your kind cooperation, and for using Mechanize. 2010/1/26 Felipe Jord?o A. P. Mattosinho <felipemattosinho at terra.com.br>> Hi there > > > > Does anyone know if Amazon.com has any sort of server side script that > tries to block scraping activities? I first noticed that if I didn?tchange> the agent alias, it would fetch a page exactly like the normal one, but > without the intial search field(maybe a silly way to prevent scraping). > Then after it, I changed to some other alias, and submit a search. I gotthe> result page as response, but right after getting the page, I received a > message that Amazon.com closed my connection, and redirects me to another > place. > > If anyone succeeded with Amazon.com, circunventing this protection, please > send me some info > > > > Regards, > > > > Felipe > > > > > > _______________________________________________ > Mechanize-users mailing list > Mechanize-users at rubyforge.org > http://rubyforge.org/mailman/listinfo/mechanize-users >-- mike dalessio mike at csa.net -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://rubyforge.org/pipermail/mechanize-users/attachments/20100126/6b3da26 5/attachment.html> ------------------------------ _______________________________________________ Mechanize-users mailing list Mechanize-users at rubyforge.org http://rubyforge.org/mailman/listinfo/mechanize-users End of Mechanize-users Digest, Vol 34, Issue 10 ***********************************************
Apologies to the list if this is the second time this message is sent, I had an issue with my mail client and I think my original attempt failed. ------------------------------------------------------------------------ No worries Felipe, thought you would have probably been on the case already about the Amazon web services, but couldn''t hurt to suggest it just in case. Now, on to your issue: You say you are getting the following on the console:> #ActionController::RoutingError (No route matches > "/aan/2009-09-#09/static/amazon/ iframeproxy-1.html" with {:method=>:get}): >This screams out Rails to me. I doubt it has anything to do with Mechanize at all (anyone with more experience with Mechanize please feel free to correct me). Are you sure you are not making a request to the rails framework somewhere? As to the command:> @match = @page.search("/html/body/div[4]/div/div/div[2]/div[3]/div/div/div[3]/div/a")If I perform the following search: @page.search("/html/body/div[4]/div") I get an empty array as a response, so that search argument is borked. I don''t think that is the best way to get a link out of the page anyway as that will be very brittle. I think that you should look into the nokogiri documentation for a way to search for links with a specific css class. I can''t tell you off the top of my head how to do it and I need to get back to work, so I''ll have to leave you to work it out yourself. BTW: In Answer to the original question: "Does Amazon block scraping", I don''t think that they are attempting to block scraping at all, more likely they don''t recognise the mechanise user agent string and they get confused. It would be an interesting exercise to use the user agent switch plug-in in firefox and set it as mechanize, then see how amazon renders within firefox, but as I said, I have to get back to work. Ciao. Felipe Jord?o A. P. Mattosinho wrote:> Thanks Jimmy, > > First thanks for answering it. About the web service I know about it, > however the thing is that I am doing a project in my university where I need > to scrap content, or to be more specific product reviews. So I just used > Amazon.com as an example, but I would never scrap something when I have a > service that can give me everything I need. The thing is just to use a > scraping technique!But thanks anyway for you suggestion! > Well, besides that these are the commands: > > @@mech = WWW::Mechanize.new > > #HERE IF I DON?T SET THE ALIAS I RECEIVE THE SAME MAIN PAGE BUT WITHOUT THE > SEARCH #FIELD > > @@mech.user_agent_alias = ''Mac Safari'' > > page = @@mech.get("http://www.amazon.com") > > search_form = page.form("site-search") > > search_form["field-keywords"] = "Nikon Coolpix P90" > > @page = @@mech.submit(search_form, search_form.buttons.first) > > > PP @page > > > #HERE AFTER printing the page I confirm that I received the page I was > #expecting, however on the console I get this " > #ActionController::RoutingError (No route matches > "/aan/2009-09-#09/static/amazon/ iframeproxy-1.html" with {:method=>:get}): > " > > #Now when I want to get the first match in the result page from that XPATH > # I receive a null object (which is strange for me since I have the content > # stored on @page variable) > > @match > @page.search("/html/body/div[4]/div/div/div[2]/div[3]/div/div/div[3]/div/a") > > # HERE @match is nil, which sounds strange to me. > > NOW A QUESTIOON CONCERNING THE MAILGROUP ITSELF: Is there a way to receive > single messages and the digest. Or if I enable the digest mode, I stop > receiving single messages? > > > Felipe > > -