thr3ads.net - Rails - [Rails] Extracting URL and text from HTML? [Feb 2006]

If this information is useful, please help other people find it:
Share via:

softwareengineer 99

2006-Feb-18 07:59 UTC

[Rails] Extracting URL and text from HTML?

For an application I am working on I have to extract URLs and the text used to
link.
  
  For example,
  
  ..... <a href="http://www.rubyonrails.org"
title="rails" >Ruby on Rails</a>....
  
  I have been trying all night but cannot come up with the regular expression
needed to extract the URLs and the text.
  
  I have tried:
  
  
myurls=response.scan(/href\s*=\s*["''](http|https)(.*)["'']\s*.*>(.*)<\/a>/)
  
  However I am left with :
  
  ://domain.com/filename" rel="tag
  
  and 
  
  ://domain.com/filename " title="permanent link
  
  Can anyone please help me as to how I can specify to extract everything  till
the next single or double quote character? Or how can I go about  extracting URL
and the linked text?
  
  I will greatly appreciate it.
  
  Thanks
  Frank
  
		
---------------------------------
 
 What are the most popular cars? Find out at Yahoo! Autos 
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://wrath.rubyonrails.org/pipermail/rails/attachments/20060218/2423c5e8/attachment.html

mehryar

2006-Feb-18 08:26 UTC

head link

[Rails] Extracting URL and text from HTML?

irb> response = %{Here is some link <a
href="http://www.rubyonrails.org" title="rails" >Ruby on
Rails</a>
and <a href="http://www.google.com">Google ofcourse</a>
and <a
href="ftp://www.foo.bar" title="bar">Foo!</a>}

irb> puts response.scan(/href="([^"]+)".*?>([^>]+)</)
=> [["http://www.rubyonrails.org", "Ruby on Rails"],
["http://www.google.com", "Google ofcourse"],
["ftp://www.foo.bar", "Foo!"]]

what you''re looking for is the negation class so
href="([^"]+)"
       ^^^^^
       match anything that is not a doublequote all the way until you bump into
one.

and similarly>([^>]+)<  ^^^^^^
  match everything but only between two > and <


cheers,
-Mehryar


On Fri, 17 Feb 2006, softwareengineer 99 wrote:
> For an application I am working on I have to extract URLs and the text used
to link.
>
>   For example,
>
>   ..... <a href="http://www.rubyonrails.org"
title="rails" >Ruby on Rails</a>....
>
>   I have been trying all night but cannot come up with the regular
expression needed to extract the URLs and the text.
>
>   I have tried:
>
>   
myurls=response.scan(/href\s*=\s*["''](http|https)(.*)["'']\s*.*>(.*)<\/a>/)
>
>   However I am left with :
>
>   ://domain.com/filename" rel="tag
>
>   and
>
>   ://domain.com/filename " title="permanent link
>
>   Can anyone please help me as to how I can specify to extract everything 
till the next single or double quote character? Or how can I go about 
extracting URL and the linked text?
>
>   I will greatly appreciate it.
>
>   Thanks
>   Frank
>
>
> ---------------------------------
>
>  What are the most popular cars? Find out at Yahoo! Autos
-------------------------------------------------------
... with proper design, the features come cheaply. This
approach is arduous, but continues to succeed.
                                     ---Dennis Ritchie

softwareengineer 99

2006-Feb-18 09:16 UTC

head link

[Rails] Extracting URL and text from HTML?

Hello Mehryar,
  This works like a charm :)
  
  Thank you so much. I really appreciate it.
  
  Frank
  
  
  
mehryar <mehryar@mehryar.com> wrote:  
irb> response = %{Here is some link Ruby on Rails
and Google ofcourse and 
href="ftp://www.foo.bar" title="bar">Foo!}

irb> puts response.scan(/href="([^"]+)".*?>([^>]+)=>
[["http://www.rubyonrails.org", "Ruby on Rails"],
["http://www.google.com", "Google ofcourse"],
["ftp://www.foo.bar", "Foo!"]]

what you''re looking for is the negation class so
href="([^"]+)"
       ^^^^^
       match anything that is not a doublequote all the way until you bump into
one.

and similarly>([^>]+)<  ^^^^^^
  match everything but only between two > and <


cheers,
-Mehryar


On Fri, 17 Feb 2006, softwareengineer 99 wrote:
> For an application I am working on I have to extract URLs and the text used
to link.
>
>   For example,
>
>   ..... Ruby on Rails....
>
>   I have been trying all night but cannot come up with the regular
expression needed to extract the URLs and the text.
>
>   I have tried:
>
>   
myurls=response.scan(/href\s*=\s*["''](http|https)(.*)["'']\s*.*>(.*)<\/a>/)
>
>   However I am left with :
>
>   ://domain.com/filename" rel="tag
>
>   and
>
>   ://domain.com/filename " title="permanent link
>
>  Can anyone please help me as to how I can specify to extract everything 
till the next single or double quote character? Or how can I go about 
extracting URL and the linked text?
>
>   I will greatly appreciate it.
>
>   Thanks
>   Frank
>
>
> ---------------------------------
>
>  What are the most popular cars? Find out at Yahoo! Autos
-------------------------------------------------------
... with proper design, the features come cheaply. This
approach is arduous, but continues to succeed.
                                     ---Dennis Ritchie
_______________________________________________
Rails mailing list
Rails@lists.rubyonrails.org
http://lists.rubyonrails.org/mailman/listinfo/rails


		
---------------------------------
Brings words and photos together (easily) with
 PhotoMail  - it''s free and works with Yahoo! Mail.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://wrath.rubyonrails.org/pipermail/rails/attachments/20060218/51e9bc47/attachment.html

Possibly Parallel Threads

Search for more seemingly similar threads

Rails - Feb 2006 - Extracting URL and text from HTML?

[Rails] Extracting URL and text from HTML?

[Rails] Extracting URL and text from HTML?

[Rails] Extracting URL and text from HTML?

Possibly Parallel Threads