HI, I want to grab some information about university names, and I found this term called "web scraping" I search about it in google, and there are tools in ruby. One of them is nokogiri but I''m a bit confused because it seems that it only gets information that its already in an html or xml I found a webpage that have a list of university names as a <select> </select> (html label) and I want to grab that information The question is... can I do that with nokogiri or another tool? The list is like a country list, but with the names of the universities of my country. It seems that it get that information from an DB using ajax, and what I''m trying to do may not be legal or possible I''ll really appreciate if someone can help me to understand what this tool is used for, and if what I''m trying to do is possible Thanks Javier Q -- You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To unsubscribe from this group, send email to rubyonrails-talk+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org For more options, visit this group at http://groups.google.com/group/rubyonrails-talk?hl=en.
On Mon, Dec 5, 2011 at 4:05 PM, JavierQQ <jquarites-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:> HI, > >Hi> I want to grab some information about university names, and I found > this term called "web scraping" > I search about it in google, and there are tools in ruby. > One of them is nokogiri but I''m a bit confused because it seems that > it only gets information that its already in an html or xml > > I found a webpage that have a list of university names as a > > <select> </select> (html label) > > and I want to grab that information > > The question is... can I do that with nokogiri or another tool? > The list is like a country list, but with the names of the > universities of my country. > > It seems that it get that information from an DB using ajax, and what > I''m trying to do may not be legal or possible > > I''ll really appreciate if someone can help me to understand what this > tool is used for, and if what I''m trying to do is possible > > Thanks > > Javier Q > >Take a look on some screencasts: http://railscasts.com/episodes?utf8=%E2%9C%93&search=mechanize http://railscasts.com/episodes/190-screen-scraping-with-nokogiri http://www.engineyard.com/blog/2010/getting-started-with-nokogiri/ With nokogiri, you could use CSS3 selectors to grab the information you want Best Regards, Everaldo> -- > You received this message because you are subscribed to the Google Groups > "Ruby on Rails: Talk" group. > To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org > To unsubscribe from this group, send email to > rubyonrails-talk+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org > For more options, visit this group at > http://groups.google.com/group/rubyonrails-talk?hl=en. > >-- You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To unsubscribe from this group, send email to rubyonrails-talk+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org For more options, visit this group at http://groups.google.com/group/rubyonrails-talk?hl=en.
On Dec 5, 2011, at 1:05 PM, JavierQQ wrote:> HI, > > I want to grab some information about university names, and I found > this term called "web scraping" > I search about it in google, and there are tools in ruby. > One of them is nokogiri but I''m a bit confused because it seems that > it only gets information that its already in an html or xmlYes, Nokogiri is a toolkit for (among lots of other things) running Xpath or CSS queries against a text file. That text file can be anything -- an io stream of one sort or another with textual data in it will do.> > I found a webpage that have a list of university names as a > > <select> </select> (html label) > > and I want to grab that information > > The question is... can I do that with nokogiri or another tool? > The list is like a country list, but with the names of the > universities of my country.A select can be traversed like any other DOM object, this should be fairly close: #given doc is a Nokogiri::XML or Nokogiri::HTML nodeset doc.css(''#yourPickerId option'').each do |opt| foo = opt[''value''] #whatever else you want to do with foo here end> > It seems that it get that information from an DB using ajax, and what > I''m trying to do may not be legal or possibleIf it''s Ajax, you''ll need to run a JavaScript interpreter against it. Rails 3.1 shows the way to do that server-side. Once you have munged the page into a text stream that includes this desired data (flattened it down to the result of the Ajax plus the base code) then Nokogiri or Hpricot or any other XML/HTML parser could rip through that DOM and give you individual nodes to play with.> > I''ll really appreciate if someone can help me to understand what this > tool is used for, and if what I''m trying to do is possiblePossible, sure. It''s never entirely clear why someone would run an Ajax request to populate a page. They may have done it to keep the scrapers out (like you), or they may have done it to isolate and accelerate a laggy part of the initial page load. If the latter (so they aren''t actually discouraging you -- did you ask them if you could do this?) then you might also want to look into loading the endpoint of that Ajax request instead of the surrounding page, as that would eliminate the whole JavaScript abstraction entirely. You''d have one HTTP request, and unless that endpoint was kinked to only accept requests from within its own domain, you would likely have JSON or some other structured data in return, and that could be even easier to interpret in your application. Walter> > Thanks > > Javier Q > > -- > You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. > To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org > To unsubscribe from this group, send email to rubyonrails-talk+unsubscribe@googlegroups.com. > For more options, visit this group at http://groups.google.com/group/rubyonrails-talk?hl=en. >-- You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To unsubscribe from this group, send email to rubyonrails-talk+unsubscribe@googlegroups.com. For more options, visit this group at http://groups.google.com/group/rubyonrails-talk?hl=en.
On 5 dic, 13:32, Walter Lee Davis <wa...-HQgmohHLjDZWk0Htik3J/w@public.gmane.org> wrote:> > A select can be traversed like any other DOM object, this should be fairly close: > > #given doc is a Nokogiri::XML or Nokogiri::HTML nodeset > doc.css(''#yourPickerId option'').each do |opt| > foo = opt[''value''] > #whatever else you want to do with foo here > end >Thanks, in nokogiri example the result is like "link.content" and that''s why I wondering how I can grab that information from the select group> > Possible, sure. It''s never entirely clear why someone would run an Ajax request to populate a page. They may have done it to keep the scrapers out (like you), or they may have done it to isolate and accelerate a laggy part of the initial page load. If the latter (so they aren''t actually discouraging you -- did you ask them if you could do this?) then you might also want to look into loading the endpoint of that Ajax request instead of the surrounding page, as that would eliminate the whole JavaScript abstraction entirely. You''d have one HTTP request, and unless that endpoint was kinked to only accept requests from within its own domain, you would likely have JSON or some other structured data in return, and that could be even easier to interpret in your application. > > Walter > >You mean that in order to make a better application I have to deliver the information as JSON ? I''m kind of new with rails (not a completly newbie but... sort of :D ) Thanks for your help Javier Q -- You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To unsubscribe from this group, send email to rubyonrails-talk+unsubscribe@googlegroups.com. For more options, visit this group at http://groups.google.com/group/rubyonrails-talk?hl=en.
On Dec 5, 2011, at 1:55 PM, JavierQQ wrote:> > > On 5 dic, 13:32, Walter Lee Davis <wa...-HQgmohHLjDZWk0Htik3J/w@public.gmane.org> wrote: > >> >> A select can be traversed like any other DOM object, this should be fairly close: >> >> #given doc is a Nokogiri::XML or Nokogiri::HTML nodeset >> doc.css(''#yourPickerId option'').each do |opt| >> foo = opt[''value''] >> #whatever else you want to do with foo here >> end >> > > Thanks, in nokogiri example the result is like "link.content" and > that''s why I wondering how I can grab that information from the select > groupThere are some basic things one can do with nodes once you find them. content() spills out the textual content of any node (in the case of an option, that might give you the same thing as the Option.text attribute in JavaScript, but I wouldn''t count on it specifically. In the case of a div, for example, content would give you the textual content of that div, minus any HTML tags, while inner_html would give you the actual HTML code defining all of the content tags as well as their text content. For everything else, any other named attribute on the given node you access simply by putting the name of the attribute in as a key: my_select[''label''] or my_select[''value''] or my_select[''selected''] for example. Behind the scenes, Nokogiri does some elegant metaprogramming with method_missing and gives you what you ask for if it''s available.> > >> >> Possible, sure. It''s never entirely clear why someone would run an Ajax request to populate a page. They may have done it to keep the scrapers out (like you), or they may have done it to isolate and accelerate a laggy part of the initial page load. If the latter (so they aren''t actually discouraging you -- did you ask them if you could do this?) then you might also want to look into loading the endpoint of that Ajax request instead of the surrounding page, as that would eliminate the whole JavaScript abstraction entirely. You''d have one HTTP request, and unless that endpoint was kinked to only accept requests from within its own domain, you would likely have JSON or some other structured data in return, and that could be even easier to interpret in your application. >> >> Walter >> >> > > You mean that in order to make a better application I have to deliver > the information as JSON ?I have seen this technique used for this reason, by splitting the application load over time on the same server or across servers. But then I would just throw a cacheing layer at the problem. Much less heartache. I''ve also seen this technique used to obfuscate the data source, or simply to integrate third-party data sources into an existing site. .> I''m kind of new with rails (not a completly newbie but... sort of :D )Me too, but I''ve done quite a lot of Nokogiri recently, so it''s all fairly fresh. Walter> > Thanks for your help > > Javier Q > > -- > You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. > To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org > To unsubscribe from this group, send email to rubyonrails-talk+unsubscribe@googlegroups.com. > For more options, visit this group at http://groups.google.com/group/rubyonrails-talk?hl=en. >-- You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To unsubscribe from this group, send email to rubyonrails-talk+unsubscribe@googlegroups.com. For more options, visit this group at http://groups.google.com/group/rubyonrails-talk?hl=en.
Hi, It''s me again, I was doing some easy example and it worked... but now I''ve got some trouble Is there a way to provide nokogiri data such as username and password? because in a web I have to login first Scrapy gives a way to simulate user login, and I was wonderin if nokogiri can do the same Javier -- You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To unsubscribe from this group, send email to rubyonrails-talk+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org For more options, visit this group at http://groups.google.com/group/rubyonrails-talk?hl=en.
You wouldn''t do it at the Nokogiri level. You need to read up on the open-uri library, there are all sorts of goodies in there to manage authentication, sessions, everything needed to create a Web client. That layer of your application will get the text stream that you will send on to Nokogiri. There''s nothing in Noko that is specific to solving that problem, it starts from the assumption that you have a text file locally or a stream from another client like open-uri. Walter On Dec 6, 2011, at 10:21 AM, JavierQQ wrote:> Hi, > It''s me again, I was doing some easy example and it worked... but now > I''ve got some trouble > Is there a way to provide nokogiri data such as username and password? > because in a web I have to login first > Scrapy gives a way to simulate user login, and I was wonderin if > nokogiri can do the same > > Javier > > -- > You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. > To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org > To unsubscribe from this group, send email to rubyonrails-talk+unsubscribe@googlegroups.com. > For more options, visit this group at http://groups.google.com/group/rubyonrails-talk?hl=en. >-- You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To unsubscribe from this group, send email to rubyonrails-talk+unsubscribe@googlegroups.com. For more options, visit this group at http://groups.google.com/group/rubyonrails-talk?hl=en.
It seems that :http_basic_authentication [user, pass] no longer works, I''ve tested with 2 webs and nothing, Is there any other way? Thanks Javier -- You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To unsubscribe from this group, send email to rubyonrails-talk+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org For more options, visit this group at http://groups.google.com/group/rubyonrails-talk?hl=en.
Can you post some code surrounding this, show the open-uri method call you''re using? Walter On Dec 6, 2011, at 11:28 AM, Javier Quarite wrote:> It seems that :http_basic_authentication [user, pass] > no longer works, I''ve tested with 2 webs and nothing, > Is there any other way? > > Thanks > > Javier > > -- > You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. > To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org > To unsubscribe from this group, send email to rubyonrails-talk+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org > For more options, visit this group at http://groups.google.com/group/rubyonrails-talk?hl=en.-- You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To unsubscribe from this group, send email to rubyonrails-talk+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org For more options, visit this group at http://groups.google.com/group/rubyonrails-talk?hl=en.
On Tue, Dec 6, 2011 at 11:58 AM, Walter Lee Davis <waltd-HQgmohHLjDZWk0Htik3J/w@public.gmane.org>wrote:> Can you post some code surrounding this, show the open-uri method call > you''re using? > > Walter > >require ''nokogiri'' require ''open-uri'' doc = Nokogiri::HTML(open(url, :http_basic_authentication => [user, pass]) doc.xpath(''//select/option'').each do |opt| puts opt.content end I grab some info from tha main page of the url (so it works) but when I enter to its login page with user/pass and try to get some, it seems to get information from other place (I''m not even sure from where) Javier -- You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To unsubscribe from this group, send email to rubyonrails-talk+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org For more options, visit this group at http://groups.google.com/group/rubyonrails-talk?hl=en.
> doc = Nokogiri::HTML(open(url, :http_basic_authentication => [user, pass]) > >I''ve made a mistake, that was another file. what I''m using is: open(url, :http_basic_authentication => [user, pass] ) doc = Nokogiri::HTML(open(url)) Javier -- You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To unsubscribe from this group, send email to rubyonrails-talk+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org For more options, visit this group at http://groups.google.com/group/rubyonrails-talk?hl=en.
On Dec 6, 2011, at 12:17 PM, Javier Quarite wrote:> I grab some info from tha main page of the url (so it works) but when I enter to its login page with user/pass and try to get some, it seems to get information from other place (I''m not even sure from where)Try all this out in a terminal with telnet or cURL -- see where you''re actually going when you log in. You may be redirected in some subtle way. Also, a browser may throw a "basic authentication" dialog box when you''re actually being challenged for digest authentication. :basic_authentication is not the same thing. I think your real solution here will be to abstract out the open() bit inside the Nokogiri::HTML() call. Look for a gem that accepts a URL and returns a text stream and offers a whole bunch of configuration options for authentication. I am certain there are at least a handful of them out there. By separating your concerns in this way, you''ll end up with a more modular solution so that you can swap in different credentials for each site you''re scraping. Walter -- You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To unsubscribe from this group, send email to rubyonrails-talk+unsubscribe@googlegroups.com. For more options, visit this group at http://groups.google.com/group/rubyonrails-talk?hl=en.
Hi,> The question is... can I do that with nokogiri or another tool? > The list is like a country list, but with the names of the > universities of my country. >Like Nokogiri, There is another tool called Hpricot> > It seems that it get that information from an DB using ajax, and what > I''m trying to do may not be legal or possible > > > Ya its is possible.See some examples which i tried with nokogiri,ruby *Nokogiri* http://sathia27.wordpress.com/2011/09/06/tbus-version-1-search-bus-routes-from-terminal/ http://sathia27.wordpress.com/2011/12/05/english-to-tamil-translator-script/ *Hpricot* http://sathia27.wordpress.com/2010/10/29/learned-ruby-and-hpricot/ -- ------------------------------------------------------------------------------------------ Regards sathia Here I share my experiments with open source. http://www.sathia27.wordpress.com <http://www.sathia27.wordpress.com/>http://www.lquery.com<http://www.sathia27.wordpress.com/> -- You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To unsubscribe from this group, send email to rubyonrails-talk+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org For more options, visit this group at http://groups.google.com/group/rubyonrails-talk?hl=en.