Felipe Jordão A. P. Mattosinho
2010-Jan-25 05:29 UTC
[Mechanize-users] Does Amazon.com blocks scraping?
Hi there Does anyone know if Amazon.com has any sort of server side script that tries to block scraping activities? I first noticed that if I didn?t change the agent alias, it would fetch a page exactly like the normal one, but without the intial search field(maybe a silly way to prevent scraping). Then after it, I changed to some other alias, and submit a search. I got the result page as response, but right after getting the page, I received a message that Amazon.com closed my connection, and redirects me to another place. If anyone succeeded with Amazon.com, circunventing this protection, please send me some info Regards, Felipe -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://rubyforge.org/pipermail/mechanize-users/attachments/20100125/8535fc9d/attachment.html>
Hi Felipe, I was unable to reproduce your "Amazon closing the connection" issue. Could you perhaps post a sequence of commands from irb that can consistently reproduce it? Also, what happens if you change the user agent alias before the first connection with Amazon? If you do not experience the disconnection problem, is there any reason why changing the user agent before first contact is not a satisfactory solution? Cheers, -Jimmy Felipe Jord?o A. P. Mattosinho wrote:> > Hi there > > Does anyone know if Amazon.com has any sort of server side script that > tries to block scraping activities? I first noticed that if I didn?t > change the agent alias, it would fetch a page exactly like the normal > one, but without the intial search field(maybe a silly way to prevent > scraping). Then after it, I changed to some other alias, and submit a > search. I got the result page as response, but right after getting the > page, I received a message that Amazon.com closed my connection, and > redirects me to another place. > > If anyone succeeded with Amazon.com, circunventing this protection, > please send me some info > > Regards, > > Felipe > > ------------------------------------------------------------------------ > > _______________________________________________ > Mechanize-users mailing list > Mechanize-users at rubyforge.org > http://rubyforge.org/mailman/listinfo/mechanize-users
Hi Felipe, Just had another thought, and it''s probably something you''ve already considered so I apologise if I''m pointing out the obvious, but have you checked out Amazon''s A2S (formally ECS) web services? https://affiliate-program.amazon.com/gp/advertising/api/detail/main.html It''s been a while since I played with this stuff, but if provides access to the ability to search for Amazon products and list details, reviews etc. I used to use the following gem: http://www.pluitsolutions.com/projects/amazon-ecs But I''m not sure if that is still the one to beat (there were a number of them some years ago). Of course, these services may not do what you need, but I thought it worth suggesting just in case. Cheers. Felipe Jord?o A. P. Mattosinho wrote:> > Hi there > > Does anyone know if Amazon.com has any sort of server side script that > tries to block scraping activities? I first noticed that if I didn?t > change the agent alias, it would fetch a page exactly like the normal > one, but without the intial search field(maybe a silly way to prevent > scraping). Then after it, I changed to some other alias, and submit a > search. I got the result page as response, but right after getting the > page, I received a message that Amazon.com closed my connection, and > redirects me to another place. > > If anyone succeeded with Amazon.com, circunventing this protection, > please send me some info > > Regards, > > Felipe > > ------------------------------------------------------------------------ > > _______________________________________________ > Mechanize-users mailing list > Mechanize-users at rubyforge.org > http://rubyforge.org/mailman/listinfo/mechanize-users
Felipe Jordão A. P. Mattosinho
2010-Jan-26 18:44 UTC
[Mechanize-users] Does Amazon.com blocks scraping?
Thanks Jimmy,
First thanks for answering it. About the web service I know about it,
however the thing is that I am doing a project in my university where I need
to scrap content, or to be more specific product reviews. So I just used
Amazon.com as an example, but I would never scrap something when I have a
service that can give me everything I need. The thing is just to use a
scraping technique!But thanks anyway for you suggestion!
Well, besides that these are the commands:
@@mech = WWW::Mechanize.new
#HERE IF I DON?T SET THE ALIAS I RECEIVE THE SAME MAIN PAGE BUT WITHOUT THE
SEARCH #FIELD
@@mech.user_agent_alias = ''Mac Safari''
page = @@mech.get("http://www.amazon.com")
search_form = page.form("site-search")
search_form["field-keywords"] = "Nikon Coolpix P90"
@page = @@mech.submit(search_form, search_form.buttons.first)
PP @page
#HERE AFTER printing the page I confirm that I received the page I was
#expecting, however on the console I get this "
#ActionController::RoutingError (No route matches
"/aan/2009-09-#09/static/amazon/ iframeproxy-1.html" with
{:method=>:get}):
"
#Now when I want to get the first match in the result page from that XPATH
# I receive a null object (which is strange for me since I have the content
# stored on @page variable)
@match
@page.search("/html/body/div[4]/div/div/div[2]/div[3]/div/div/div[3]/div/a")
# HERE @match is nil, which sounds strange to me.
NOW A QUESTIOON CONCERNING THE MAILGROUP ITSELF: Is there a way to receive
single messages and the digest. Or if I enable the digest mode, I stop
receiving single messages?
Felipe
-----Mensagem original-----
De: mechanize-users-bounces at rubyforge.org
[mailto:mechanize-users-bounces at rubyforge.org] Em nome de
mechanize-users-request at rubyforge.org
Enviada em: ter?a-feira, 26 de janeiro de 2010 18:40
Para: mechanize-users at rubyforge.org
Assunto: Mechanize-users Digest, Vol 34, Issue 10
Send Mechanize-users mailing list submissions to
mechanize-users at rubyforge.org
To subscribe or unsubscribe via the World Wide Web, visit
http://rubyforge.org/mailman/listinfo/mechanize-users
or, via email, send a message with subject or body ''help'' to
mechanize-users-request at rubyforge.org
You can reach the person managing the list at
mechanize-users-owner at rubyforge.org
When replying, please edit your Subject line so it is more specific
than "Re: Contents of Mechanize-users digest..."
Today''s Topics:
1. Does mechanize have the ability to set a maximum download
limit? (Jimmy McGrath)
2. Does Amazon.com blocks scraping? (Felipe Jord?o A. P. Mattosinho)
3. Re: Does Amazon.com blocks scraping? (Jimmy McGrath)
4. Re: Does Amazon.com blocks scraping? (Jimmy McGrath)
5. Nokogiri vs mechanize objects (peter at rubyrailways.com)
6. Re: Having Problems with JS-Button within a form.
(Nils Haldenwang)
7. Does Amazon.com block scraping? (Felipe Jord?o A. P. Mattosinho)
8. Re: Does Amazon.com block scraping? (Mike Dalessio)
----------------------------------------------------------------------
Message: 1
Date: Mon, 25 Jan 2010 09:46:44 +1000
From: Jimmy McGrath <mechanize-mail at zizee.com>
To: mechanize-users at rubyforge.org
Subject: [Mechanize-users] Does mechanize have the ability to set a
maximum download limit?
Message-ID: <4B5CDBE4.7090701 at zizee.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Hi there,
I would like to set an upper limit on the size of any file that
mechanize will download. I have read the documentation and googled but
cannot see any support for this type of functionality.
Is there any way to do it? Or should I be looking at a timeout solution?
Thanks,
Jimmy
------------------------------
Message: 2
Date: Mon, 25 Jan 2010 06:29:33 +0100
From: Felipe Jord?o A. P. Mattosinho <felipemattosinho at terra.com.br>
To: <mechanize-users at rubyforge.org>
Subject: [Mechanize-users] Does Amazon.com blocks scraping?
Message-ID: <20100125052944.0C83E90000090 at hecla.tpn.terra.com>
Content-Type: text/plain; charset="iso-8859-1"
Hi there
Does anyone know if Amazon.com has any sort of server side script that tries
to block scraping activities? I first noticed that if I didn?t change the
agent alias, it would fetch a page exactly like the normal one, but without
the intial search field(maybe a silly way to prevent scraping). Then after
it, I changed to some other alias, and submit a search. I got the result
page as response, but right after getting the page, I received a message
that Amazon.com closed my connection, and redirects me to another place.
If anyone succeeded with Amazon.com, circunventing this protection, please
send me some info
Regards,
Felipe
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://rubyforge.org/pipermail/mechanize-users/attachments/20100125/8535fc9
d/attachment-0001.html>
------------------------------
Message: 3
Date: Mon, 25 Jan 2010 22:35:03 +1000
From: Jimmy McGrath <mechanize-mail at zizee.com>
To: Ruby Mechanize Users List <mechanize-users at rubyforge.org>
Subject: Re: [Mechanize-users] Does Amazon.com blocks scraping?
Message-ID: <4B5D8FF7.80002 at zizee.com>
Content-Type: text/plain; charset=windows-1252; format=flowed
Hi Felipe,
I was unable to reproduce your "Amazon closing the connection" issue.
Could you perhaps post a sequence of commands from irb that can
consistently reproduce it?
Also, what happens if you change the user agent alias before the first
connection with Amazon? If you do not experience the disconnection
problem, is there any reason why changing the user agent before first
contact is not a satisfactory solution?
Cheers,
-Jimmy
Felipe Jord?o A. P. Mattosinho wrote:>
> Hi there
>
> Does anyone know if Amazon.com has any sort of server side script that
> tries to block scraping activities? I first noticed that if I didn?t
> change the agent alias, it would fetch a page exactly like the normal
> one, but without the intial search field(maybe a silly way to prevent
> scraping). Then after it, I changed to some other alias, and submit a
> search. I got the result page as response, but right after getting the
> page, I received a message that Amazon.com closed my connection, and
> redirects me to another place.
>
> If anyone succeeded with Amazon.com, circunventing this protection,
> please send me some info
>
> Regards,
>
> Felipe
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> Mechanize-users mailing list
> Mechanize-users at rubyforge.org
> http://rubyforge.org/mailman/listinfo/mechanize-users
------------------------------
Message: 4
Date: Tue, 26 Jan 2010 07:16:21 +1000
From: Jimmy McGrath <mechanize-mail at zizee.com>
To: Ruby Mechanize Users List <mechanize-users at rubyforge.org>
Subject: Re: [Mechanize-users] Does Amazon.com blocks scraping?
Message-ID: <4B5E0A25.8040606 at zizee.com>
Content-Type: text/plain; charset=windows-1252; format=flowed
Hi Felipe,
Just had another thought, and it''s probably something you''ve
already
considered so I apologise if I''m pointing out the obvious, but have you
checked out Amazon''s A2S (formally ECS) web services?
https://affiliate-program.amazon.com/gp/advertising/api/detail/main.html
It''s been a while since I played with this stuff, but if provides
access
to the ability to search for Amazon products and list details, reviews etc.
I used to use the following gem:
http://www.pluitsolutions.com/projects/amazon-ecs
But I''m not sure if that is still the one to beat (there were a number
of them some years ago).
Of course, these services may not do what you need, but I thought it
worth suggesting just in case.
Cheers.
Felipe Jord?o A. P. Mattosinho wrote:>
> Hi there
>
> Does anyone know if Amazon.com has any sort of server side script that
> tries to block scraping activities? I first noticed that if I didn?t
> change the agent alias, it would fetch a page exactly like the normal
> one, but without the intial search field(maybe a silly way to prevent
> scraping). Then after it, I changed to some other alias, and submit a
> search. I got the result page as response, but right after getting the
> page, I received a message that Amazon.com closed my connection, and
> redirects me to another place.
>
> If anyone succeeded with Amazon.com, circunventing this protection,
> please send me some info
>
> Regards,
>
> Felipe
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> Mechanize-users mailing list
> Mechanize-users at rubyforge.org
> http://rubyforge.org/mailman/listinfo/mechanize-users
------------------------------
Message: 5
Date: Tue, 26 Jan 2010 00:24:27 -0800
From: peter at rubyrailways.com
To: mechanize-users at rubyforge.org
Subject: [Mechanize-users] Nokogiri vs mechanize objects
Message-ID:
<9dee3ceec852ebe3fcc127361f54eaaa.squirrel at webmail.rubyrailways.com>
Content-Type: text/plain;charset=iso-8859-1
Hey all,
Is it possible to ''cast'' a Nokogri objects as a Mechanize one?
i.e. I get
back a Nokogiri Element after searching with an XPath, and now I''d like
to
click it (let''s suppose it''s an <a>). So something like
>> agent = WWW::Mechanize.new
>> ...
>> agent.get(''http://github.com/'')
>> ...
>> link =
agent.page.search("//p[child::strong[contains(.,''GitHub'')]]/a[1]")
=> a href"http://help.github.com/post-receive-hooks/"web
hooka>> link.click
NoMethodError: undefined method `click'' for <a
href="http://help.github.com/post-receive-hooks/">web
hook</a>:Nokogiri::XML::NodeSet
from (irb):8
There has to be a mechanize object which is the "alter-ego" of the
Nokogiri element - how do I switch between the two?
Cheers,
Peter
------------------------------
Message: 6
Date: Tue, 26 Jan 2010 18:00:31 +0100
From: Nils Haldenwang <nohfreakz at web.de>
To: Ruby Mechanize Users List <mechanize-users at rubyforge.org>
Subject: Re: [Mechanize-users] Having Problems with JS-Button within a
form.
Message-ID: <1054463354 at web.de>
Content-Type: text/plain; charset=iso-8859-15
Dumb question maybe, but how can i create the button myself ? :-)
> Create the submit button and submit the form :)
>
>
>
> On Jan 23, 2010, at 12:34 PM, nohfreakz at web.de wrote:
>
> > Hi there,
> >
> > i get following issue with a form.
> >
> > Within the form there are two buttons:
> >
> > ...
> > <input type="button" value="Edit"
onclick="bookmarks.do_some_js()"
> > id="BlaBlaBtn" class="hidden" />
> > ...
> > <input type="submit" value="Submit" />
> > ...
> >
> > When i get the page and choose the form, the form only has the first
> > button ( with the js call ) in its buttons-list, so i can not
> > submit the form.
> >
> > Is there any solution ?
> >
> > Greetz,
> > Nils
> > ______________________________________________________________________
> > Haiti-Nothilfe! Helfen Sie per SMS: Sende UIHAITI an die Nummer 81190.
> > Von 5 Euro je SMS (zzgl. SMS-Geb?hr) gehen 4,83 Euro an UNICEF.
> >
> > _______________________________________________
> > Mechanize-users mailing list
> > Mechanize-users at rubyforge.org
> > http://rubyforge.org/mailman/listinfo/mechanize-users
> _______________________________________________
> Mechanize-users mailing list
> Mechanize-users at rubyforge.org
> http://rubyforge.org/mailman/listinfo/mechanize-users
________________________________________________________________________
Kostenlos tippen, t?glich 1 Million gewinnen: zum WEB.DE MillionenKlick!
http://produkte.web.de/go/08/
------------------------------
Message: 7
Date: Tue, 26 Jan 2010 18:27:02 +0100
From: Felipe Jord?o A. P. Mattosinho <felipemattosinho at terra.com.br>
To: <mechanize-users at rubyforge.org>
Subject: [Mechanize-users] Does Amazon.com block scraping?
Message-ID: <20100126172708.C1BBCB0000140 at bermore.tpn.terra.com>
Content-Type: text/plain; charset="iso-8859-1"
Hi there
Does anyone know if Amazon.com has any sort of server side script that tries
to block scraping activities? I first noticed that if I didn?t change the
agent alias, it would fetch a page exactly like the normal one, but without
the intial search field(maybe a silly way to prevent scraping). Then after
it, I changed to some other alias, and submit a search. I got the result
page as response, but right after getting the page, I received a message
that Amazon.com closed my connection, and redirects me to another place.
If anyone succeeded with Amazon.com, circunventing this protection, please
send me some info
Regards,
Felipe
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://rubyforge.org/pipermail/mechanize-users/attachments/20100126/8262fae
b/attachment-0001.html>
------------------------------
Message: 8
Date: Tue, 26 Jan 2010 12:31:49 -0500
From: Mike Dalessio <mike at csa.net>
To: Ruby Mechanize Users List <mechanize-users at rubyforge.org>
Subject: Re: [Mechanize-users] Does Amazon.com block scraping?
Message-ID:
<618c07251001260931j113ccc53y4d58b5dad52f9480 at mail.gmail.com>
Content-Type: text/plain; charset="windows-1252"
Hello Felipe,
Jimmy McGrath was kind enough to reply, twice, to your original email.
(Thank you, Jimmy.)
Please be kind enough to respond to his replies, and please do not resend
your request when you''ve already received a response.
Thank you for your kind cooperation, and for using Mechanize.
2010/1/26 Felipe Jord?o A. P. Mattosinho <felipemattosinho at
terra.com.br>
> Hi there
>
>
>
> Does anyone know if Amazon.com has any sort of server side script that
> tries to block scraping activities? I first noticed that if I didn?t
change> the agent alias, it would fetch a page exactly like the normal one, but
> without the intial search field(maybe a silly way to prevent scraping).
> Then after it, I changed to some other alias, and submit a search. I got
the> result page as response, but right after getting the page, I received a
> message that Amazon.com closed my connection, and redirects me to another
> place.
>
> If anyone succeeded with Amazon.com, circunventing this protection, please
> send me some info
>
>
>
> Regards,
>
>
>
> Felipe
>
>
>
>
>
> _______________________________________________
> Mechanize-users mailing list
> Mechanize-users at rubyforge.org
> http://rubyforge.org/mailman/listinfo/mechanize-users
>
--
mike dalessio
mike at csa.net
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://rubyforge.org/pipermail/mechanize-users/attachments/20100126/6b3da26
5/attachment.html>
------------------------------
_______________________________________________
Mechanize-users mailing list
Mechanize-users at rubyforge.org
http://rubyforge.org/mailman/listinfo/mechanize-users
End of Mechanize-users Digest, Vol 34, Issue 10
***********************************************
Apologies to the list if this is the second time this message is sent, I had an issue with my mail client and I think my original attempt failed. ------------------------------------------------------------------------ No worries Felipe, thought you would have probably been on the case already about the Amazon web services, but couldn''t hurt to suggest it just in case. Now, on to your issue: You say you are getting the following on the console:> #ActionController::RoutingError (No route matches > "/aan/2009-09-#09/static/amazon/ iframeproxy-1.html" with {:method=>:get}): >This screams out Rails to me. I doubt it has anything to do with Mechanize at all (anyone with more experience with Mechanize please feel free to correct me). Are you sure you are not making a request to the rails framework somewhere? As to the command:> @match = @page.search("/html/body/div[4]/div/div/div[2]/div[3]/div/div/div[3]/div/a")If I perform the following search: @page.search("/html/body/div[4]/div") I get an empty array as a response, so that search argument is borked. I don''t think that is the best way to get a link out of the page anyway as that will be very brittle. I think that you should look into the nokogiri documentation for a way to search for links with a specific css class. I can''t tell you off the top of my head how to do it and I need to get back to work, so I''ll have to leave you to work it out yourself. BTW: In Answer to the original question: "Does Amazon block scraping", I don''t think that they are attempting to block scraping at all, more likely they don''t recognise the mechanise user agent string and they get confused. It would be an interesting exercise to use the user agent switch plug-in in firefox and set it as mechanize, then see how amazon renders within firefox, but as I said, I have to get back to work. Ciao. Felipe Jord?o A. P. Mattosinho wrote:> Thanks Jimmy, > > First thanks for answering it. About the web service I know about it, > however the thing is that I am doing a project in my university where I need > to scrap content, or to be more specific product reviews. So I just used > Amazon.com as an example, but I would never scrap something when I have a > service that can give me everything I need. The thing is just to use a > scraping technique!But thanks anyway for you suggestion! > Well, besides that these are the commands: > > @@mech = WWW::Mechanize.new > > #HERE IF I DON?T SET THE ALIAS I RECEIVE THE SAME MAIN PAGE BUT WITHOUT THE > SEARCH #FIELD > > @@mech.user_agent_alias = ''Mac Safari'' > > page = @@mech.get("http://www.amazon.com") > > search_form = page.form("site-search") > > search_form["field-keywords"] = "Nikon Coolpix P90" > > @page = @@mech.submit(search_form, search_form.buttons.first) > > > PP @page > > > #HERE AFTER printing the page I confirm that I received the page I was > #expecting, however on the console I get this " > #ActionController::RoutingError (No route matches > "/aan/2009-09-#09/static/amazon/ iframeproxy-1.html" with {:method=>:get}): > " > > #Now when I want to get the first match in the result page from that XPATH > # I receive a null object (which is strange for me since I have the content > # stored on @page variable) > > @match > @page.search("/html/body/div[4]/div/div/div[2]/div[3]/div/div/div[3]/div/a") > > # HERE @match is nil, which sounds strange to me. > > NOW A QUESTIOON CONCERNING THE MAILGROUP ITSELF: Is there a way to receive > single messages and the digest. Or if I enable the digest mode, I stop > receiving single messages? > > > Felipe > > -