For an application I am working on I have to extract URLs and the text used to link. For example, ..... <a href="http://www.rubyonrails.org" title="rails" >Ruby on Rails</a>.... I have been trying all night but cannot come up with the regular expression needed to extract the URLs and the text. I have tried: myurls=response.scan(/href\s*=\s*["''](http|https)(.*)["'']\s*.*>(.*)<\/a>/) However I am left with : ://domain.com/filename" rel="tag and ://domain.com/filename " title="permanent link Can anyone please help me as to how I can specify to extract everything till the next single or double quote character? Or how can I go about extracting URL and the linked text? I will greatly appreciate it. Thanks Frank --------------------------------- What are the most popular cars? Find out at Yahoo! Autos -------------- next part -------------- An HTML attachment was scrubbed... URL: http://wrath.rubyonrails.org/pipermail/rails/attachments/20060218/2423c5e8/attachment.html
irb> response = %{Here is some link <a href="http://www.rubyonrails.org" title="rails" >Ruby on Rails</a> and <a href="http://www.google.com">Google ofcourse</a> and <a href="ftp://www.foo.bar" title="bar">Foo!</a>} irb> puts response.scan(/href="([^"]+)".*?>([^>]+)</) => [["http://www.rubyonrails.org", "Ruby on Rails"], ["http://www.google.com", "Google ofcourse"], ["ftp://www.foo.bar", "Foo!"]] what you''re looking for is the negation class so href="([^"]+)" ^^^^^ match anything that is not a doublequote all the way until you bump into one. and similarly>([^>]+)<^^^^^^ match everything but only between two > and < cheers, -Mehryar On Fri, 17 Feb 2006, softwareengineer 99 wrote:> For an application I am working on I have to extract URLs and the text used to link. > > For example, > > ..... <a href="http://www.rubyonrails.org" title="rails" >Ruby on Rails</a>.... > > I have been trying all night but cannot come up with the regular expression needed to extract the URLs and the text. > > I have tried: > > myurls=response.scan(/href\s*=\s*["''](http|https)(.*)["'']\s*.*>(.*)<\/a>/) > > However I am left with : > > ://domain.com/filename" rel="tag > > and > > ://domain.com/filename " title="permanent link > > Can anyone please help me as to how I can specify to extract everything till the next single or double quote character? Or how can I go about extracting URL and the linked text? > > I will greatly appreciate it. > > Thanks > Frank > > > --------------------------------- > > What are the most popular cars? Find out at Yahoo! Autos------------------------------------------------------- ... with proper design, the features come cheaply. This approach is arduous, but continues to succeed. ---Dennis Ritchie
Hello Mehryar, This works like a charm :) Thank you so much. I really appreciate it. Frank mehryar <mehryar@mehryar.com> wrote: irb> response = %{Here is some link Ruby on Rails and Google ofcourse and href="ftp://www.foo.bar" title="bar">Foo!} irb> puts response.scan(/href="([^"]+)".*?>([^>]+)=> [["http://www.rubyonrails.org", "Ruby on Rails"], ["http://www.google.com", "Google ofcourse"], ["ftp://www.foo.bar", "Foo!"]] what you''re looking for is the negation class so href="([^"]+)" ^^^^^ match anything that is not a doublequote all the way until you bump into one. and similarly>([^>]+)<^^^^^^ match everything but only between two > and < cheers, -Mehryar On Fri, 17 Feb 2006, softwareengineer 99 wrote:> For an application I am working on I have to extract URLs and the text used to link. > > For example, > > ..... Ruby on Rails.... > > I have been trying all night but cannot come up with the regular expression needed to extract the URLs and the text. > > I have tried: > > myurls=response.scan(/href\s*=\s*["''](http|https)(.*)["'']\s*.*>(.*)<\/a>/) > > However I am left with : > > ://domain.com/filename" rel="tag > > and > > ://domain.com/filename " title="permanent link > > Can anyone please help me as to how I can specify to extract everything till the next single or double quote character? Or how can I go about extracting URL and the linked text? > > I will greatly appreciate it. > > Thanks > Frank > > > --------------------------------- > > What are the most popular cars? Find out at Yahoo! Autos------------------------------------------------------- ... with proper design, the features come cheaply. This approach is arduous, but continues to succeed. ---Dennis Ritchie _______________________________________________ Rails mailing list Rails@lists.rubyonrails.org http://lists.rubyonrails.org/mailman/listinfo/rails --------------------------------- Brings words and photos together (easily) with PhotoMail - it''s free and works with Yahoo! Mail. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://wrath.rubyonrails.org/pipermail/rails/attachments/20060218/51e9bc47/attachment.html