thr3ads.net - Mechanize users - [Mechanize-users] Mechanize/Hpricot -- Strings parsing question [Oct 2008]

If this information is useful, please help other people find it:
Share via:

Cindy Schaller

2008-Oct-31 15:50 UTC

[Mechanize-users] Mechanize/Hpricot -- Strings parsing question

Hello.

 

I am new to Hpricot and Mechanize, but so far I am loving it.

 

I am trying to parse out some text inside of an HTML page and hoped that I
could get some advice from this group.

 

I have the following code:

 

<strong> 

     Wii Game for Sale<br />

     American Idol<br />

     Ad #: 12345 

</strong>

 

 

 

I want to get each line as a separate value to insert into a database.  What
is the best way to get each line?  Can I use the HTML tags in some way as
the beginning and ends of the strings to get the values in between?

 

Thanks!!

Cindy

 

 

 

 

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://rubyforge.org/pipermail/mechanize-users/attachments/20081031/edf7bffe/attachment.html>

Aaron Patterson

2008-Oct-31 16:16 UTC

head link

[Mechanize-users] Mechanize/Hpricot -- Strings parsing question

Hi Cindy,

On Fri, Oct 31, 2008 at 8:50 AM, Cindy Schaller <cschaller at gmail.com>
wrote:> Hello.
>
>
>
> I am new to Hpricot and Mechanize, but so far I am loving it.
>
>
>
> I am trying to parse out some text inside of an HTML page and hoped that I
> could get some advice from this group.
>
>
>
> I have the following code:
>
>
>
> <strong>
>
>      Wii Game for Sale<br />
>
>      American Idol<br />
>
>      Ad #: 12345
>
> </strong>
Assuming you have already been able to find the "strong" tag, I would
do something like this:

  strong_tag.inner_text.split(/<br[^>]*>/).map { |x| x.strip }

Hope that helps.

-- 
Aaron Patterson
http://tenderlovemaking.com/

Cindy Schaller

2008-Oct-31 21:23 UTC

head link

[Mechanize-users] Mechanize/Hpricot -- Strings parsing question

Thanks Aaron.

That worked great.

Now I know that this is going to show my lack of Ruby knowledge, but
I''m
still learning.

How can I parse the same set of HTML, but only get the first 2 lines and not
the third line.

My current code is this:
listing.inner_text.split(/<br[^>]*>/).map { |x| x.strip }   


THANKS!!!


-----Original Message-----
From: mechanize-users-bounces at rubyforge.org
[mailto:mechanize-users-bounces at rubyforge.org] On Behalf Of Aaron Patterson
Sent: Friday, October 31, 2008 11:16 AM
To: Ruby Mechanize Users List
Subject: Re: [Mechanize-users] Mechanize/Hpricot -- Strings parsing question

Hi Cindy,

On Fri, Oct 31, 2008 at 8:50 AM, Cindy Schaller <cschaller at gmail.com>
wrote:> Hello.
>
>
>
> I am new to Hpricot and Mechanize, but so far I am loving it.
>
>
>
> I am trying to parse out some text inside of an HTML page and hoped that I
> could get some advice from this group.
>
>
>
> I have the following code:
>
>
>
> <strong>
>
>      Wii Game for Sale<br />
>
>      American Idol<br />
>
>      Ad #: 12345
>
> </strong>
Assuming you have already been able to find the "strong" tag, I would
do something like this:

  strong_tag.inner_text.split(/<br[^>]*>/).map { |x| x.strip }

Hope that helps.

-- 
Aaron Patterson
http://tenderlovemaking.com/
_______________________________________________
Mechanize-users mailing list
Mechanize-users at rubyforge.org
http://rubyforge.org/mailman/listinfo/mechanize-users

Matt White

2008-Nov-01 07:46 UTC

head link

[Mechanize-users] Mechanize/Hpricot -- Strings parsing question

Cindy,

When you call split on this string, it splits it in to 3 pieces and puts them
into an array. If you do:

array = listing.inner_text.split(/<br[^>]*>/).map { |x| x.strip }

you can then access the first line with array[0] and the second with array[1].
If you want to just discard the last item in the array, just call pop on the
array. This example should illustrate:

irb(main):001:0> string = "this is\na string\nto split"
=> "this is\na string\nto split"
irb(main):002:0> array = string.split("\n").map{|s| s.strip}
=> ["this is", "a string", "to split"]
irb(main):003:0> array.pop
=> "to split"
irb(main):004:0> array
=> ["this is", "a string"]

If this is confusing or unclear, you might try a basic Ruby tutorial like at
tryruby.hobix.com. Good luck!

Matt White




________________________________
From: Cindy Schaller <cschaller at gmail.com>
To: Ruby Mechanize Users List <mechanize-users at rubyforge.org>
Sent: Friday, October 31, 2008 3:23:08 PM
Subject: Re: [Mechanize-users] Mechanize/Hpricot -- Strings parsing question

Thanks Aaron.

That worked great.

Now I know that this is going to show my lack of Ruby knowledge, but
I''m
still learning.

How can I parse the same set of HTML, but only get the first 2 lines and not
the third line.

My current code is this:
listing.inner_text.split(/<br[^>]*>/).map { |x| x.strip }  


THANKS!!!


-----Original Message-----
From: mechanize-users-bounces at rubyforge.org
[mailto:mechanize-users-bounces at rubyforge.org] On Behalf Of Aaron Patterson
Sent: Friday, October 31, 2008 11:16 AM
To: Ruby Mechanize Users List
Subject: Re: [Mechanize-users] Mechanize/Hpricot -- Strings parsing question

Hi Cindy,

On Fri, Oct 31, 2008 at 8:50 AM, Cindy Schaller <cschaller at gmail.com>
wrote:> Hello.
>
>
>
> I am new to Hpricot and Mechanize, but so far I am loving it.
>
>
>
> I am trying to parse out some text inside of an HTML page and hoped that I
> could get some advice from this group.
>
>
>
> I have the following code:
>
>
>
> <strong>
>
>      Wii Game for Sale<br />
>
>      American Idol<br />
>
>      Ad #: 12345
>
> </strong>
Assuming you have already been able to find the "strong" tag, I would
do something like this:

  strong_tag.inner_text.split(/<br[^>]*>/).map { |x| x.strip }

Hope that helps.

-- 
Aaron Patterson
http://tenderlovemaking.com/
_______________________________________________
Mechanize-users mailing list
Mechanize-users at rubyforge.org
http://rubyforge.org/mailman/listinfo/mechanize-users

_______________________________________________
Mechanize-users mailing list
Mechanize-users at rubyforge.org
http://rubyforge.org/mailman/listinfo/mechanize-users



      
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://rubyforge.org/pipermail/mechanize-users/attachments/20081101/397fb831/attachment-0001.html>

Cindy Schaller

2008-Nov-03 19:03 UTC

head link

[Mechanize-users] Mechanize/Hpricot -- Strings parsing question

Hi Matt -

 

I''m not sure what the problem is, but if when I run the script through
my
code this is what I see.  The output is cleaned up for readability.

 

The value of array[0]  appears to be the entire array instead of just the
first value.  If I take the string 

 

Thanks!

 

 

[code]

result.search(''strong'').each do |listing|  

puts "listing here: #{listing}"

            puts "==="

            array = listing.inner_text.split(/<br[^>]*>/).map { |x|
x.strip
}           

            puts "array: #{array}"

            puts "==="

            puts "array 0: #{array[0]}"

            puts "==="

end

 

 

[output]

 

listing here: 

<strong>

Wii Game for Sale<br />

American Idol52<br />

Ad #: 12345

<img class="noLink" src="img_source"
alt="image" />

</strong>

 

==
 

array: Wii Game for Sale

American Idol

Ad #: 12345

 

 

==
array 0: Wii Game for Sale

American Idol

Ad #:12345

 

 

==
 

 

Cindy

 

 

 

 

 

 

 

  _____  

From: mechanize-users-bounces at rubyforge.org
[mailto:mechanize-users-bounces at rubyforge.org] On Behalf Of Matt White
Sent: Saturday, November 01, 2008 2:46 AM
To: Ruby Mechanize Users List
Subject: Re: [Mechanize-users] Mechanize/Hpricot -- Strings parsing question

 

Cindy,

When you call split on this string, it splits it in to 3 pieces and puts
them into an array. If you do:

array = listing.inner_text.split(/<br[^>]*>/).map { |x| x.strip }

you can then access the first line with array[0] and the second with
array[1]. If you want to just discard the last item in the array, just call
pop on the array. This example should illustrate:

irb(main):001:0> string = "this is\na string\nto split"
=> "this is\na string\nto split"
irb(main):002:0> array = string.split("\n").map{|s| s.strip}
=> ["this is", "a string", "to split"]
irb(main):003:0> array.pop
=> "to split"
irb(main):004:0> array
=> ["this is", "a string"]

If this is confusing or unclear, you might try a basic Ruby tutorial like at
tryruby.hobix.com. Good luck!

Matt White

 

  _____  

From: Cindy Schaller <cschaller at gmail.com>
To: Ruby Mechanize Users List <mechanize-users at rubyforge.org>
Sent: Friday, October 31, 2008 3:23:08 PM
Subject: Re: [Mechanize-users] Mechanize/Hpricot -- Strings parsing question

Thanks Aaron.

That worked great.

Now I know that this is going to show my lack of Ruby knowledge, but
I''m
still learning.

How can I parse the same set of HTML, but only get the first 2 lines and not
the third line.

My current code is this:
listing.inner_text.split(/<br[^>]*>/).map { |x| x.strip }  


THANKS!!!


-----Original Message-----
From: mechanize-users-bounces at rubyforge.org
[mailto:mechanize-users-bounces at rubyforge.org] On Behalf Of Aaron Patterson
Sent: Friday, October 31, 2008 11:16 AM
To: Ruby Mechanize Users List
Subject: Re: [Mechanize-users] Mechanize/Hpricot -- Strings parsing question

Hi Cindy,

On Fri, Oct 31, 2008 at 8:50 AM, Cindy Schaller <cschaller at gmail.com>
wrote:> Hello.
>
>
>
> I am new to Hpricot and Mechanize, but so far I am loving it.
>
>
>
> I am trying to parse out some text inside of an HTML page and hoped that I
> could get some advice from this group.
>
>
>
> I have the following code:
>
>
>
> <strong>
>
>      Wii Game for Sale<br />
>
>      American Idol<br />
>
>      Ad #: 12345
>
> </strong>
Assuming you have already been able to find the "strong" tag, I would
do something like this:

  strong_tag.inner_text.split(/<br[^>]*>/).map { |x| x.strip }

Hope that helps.

-- 
Aaron Patterson
http://tenderlovemaking.com/
_______________________________________________
Mechanize-users mailing list
Mechanize-users at rubyforge.org
http://rubyforge.org/mailman/listinfo/mechanize-users

_______________________________________________
Mechanize-users mailing list
Mechanize-users at rubyforge.org
http://rubyforge.org/mailman/listinfo/mechanize-users

 

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://rubyforge.org/pipermail/mechanize-users/attachments/20081103/3bde509b/attachment-0001.html>

Matt White

2008-Nov-03 20:38 UTC

head link

[Mechanize-users] Mechanize/Hpricot -- Strings parsing question

Heh, I''m going to blame Aaron for this one because he wrote the code
that I cut and paste, but it''s my fault for not noticing either.
Instead of listing.inner_text, try using listing.inner_html. That should do it.

Matt




________________________________
From: Cindy Schaller <cschaller at gmail.com>
To: Ruby Mechanize Users List <mechanize-users at rubyforge.org>
Sent: Monday, November 3, 2008 12:03:10 PM
Subject: Re: [Mechanize-users] Mechanize/Hpricot -- Strings parsing question

 
Hi Matt ?
 
I?m not sure what the problem is,
but if when I run the script through my code this is what I see.  The
output is cleaned up for readability.
 
The value of array[0]  appears to be
the entire array instead of just the first value.  If I take the string 
 
Thanks!
 
 
[code]
result.search(''strong'').each do |listing|  
puts "listing here:
#{listing}"
          
 puts "==="
           
array = listing.inner_text.split(/<br[^>]*>/).map { |x| x.strip }
           
puts "array: #{array}"
           
puts "==="
           
puts "array 0: #{array[0]}"
           
puts "==="
end
 
 
[output]
 
listing here: 
<strong>
Wii Game for Sale <br />
American Idol52<br />
Ad #: 12345
<img class="noLink"
src="img_source" alt="image" />
</strong>
 
== 
array: Wii Game for Sale
American Idol
Ad #: 12345
 
 
==array 0: Wii Game for Sale
American Idol
Ad #:12345
 
 
== 
 
Cindy
 
 
 
 
 
 
 

________________________________
 
From:mechanize-users-bounces at rubyforge.org
[mailto:mechanize-users-bounces at rubyforge.org] On Behalf Of Matt White
Sent: Saturday, November 01, 2008
2:46 AM
To: Ruby Mechanize Users List
Subject: Re: [Mechanize-users]
Mechanize/Hpricot -- Strings parsing question
 
Cindy,

When you call split on this string, it splits it in to 3 pieces and puts them
into an array. If you do:

array = listing.inner_text.split(/<br[^>]*>/).map { |x| x.strip }

you can then access the first line with array[0] and the second with array[1].
If you want to just discard the last item in the array, just call pop on the
array. This example should illustrate:

irb(main):001:0> string = "this is\na string\nto split"
=> "this is\na string\nto split"
irb(main):002:0> array = string.split("\n").map{|s| s.strip}
=> ["this is", "a string", "to split"]
irb(main):003:0> array.pop
=> "to split"
irb(main):004:0> array
=> ["this is", "a string"]

If this is confusing or unclear, you might try a basic Ruby tutorial like at
tryruby.hobix.com. Good luck!

Matt White
 

________________________________
 
From:Cindy Schaller
<cschaller at gmail.com>
To: Ruby Mechanize Users List
<mechanize-users at rubyforge.org>
Sent: Friday, October 31, 2008
3:23:08 PM
Subject: Re: [Mechanize-users]
Mechanize/Hpricot -- Strings parsing question

Thanks Aaron.

That worked great.

Now I know that this is going to show my lack of Ruby knowledge, but
I''m
still learning.

How can I parse the same set of HTML, but only get the first 2 lines and not
the third line.

My current code is this:
listing.inner_text.split(/<br[^>]*>/).map { |x| x.strip }  


THANKS!!!


-----Original Message-----
From: mechanize-users-bounces at rubyforge.org
[mailto:mechanize-users-bounces at rubyforge.org]
On Behalf Of Aaron Patterson
Sent: Friday, October 31, 2008 11:16 AM
To: Ruby Mechanize Users List
Subject: Re: [Mechanize-users] Mechanize/Hpricot -- Strings parsing question

Hi Cindy,

On Fri, Oct 31, 2008 at 8:50 AM, Cindy Schaller <cschaller at gmail.com>
wrote:> Hello.
>
>
>
> I am new to Hpricot and Mechanize, but so far I am loving it.
>
>
>
> I am trying to parse out some text inside of an HTML page and hoped that I
> could get some advice from this group.
>
>
>
> I have the following code:
>
>
>
> <strong>
>
>      Wii Game for Sale <br
/>>
>      American Idol<br />
>
>      Ad #: 12345
>
> </strong>
Assuming you have already been able to find the "strong" tag, I would
do something like this:

  strong_tag.inner_text.split(/<br[^>]*>/).map { |x| x.strip }

Hope that helps.

-- 
Aaron Patterson
http://tenderlovemaking.com/
_______________________________________________
Mechanize-users mailing list
Mechanize-users at rubyforge.org
http://rubyforge.org/mailman/listinfo/mechanize-users

_______________________________________________
Mechanize-users mailing list
Mechanize-users at rubyforge.org
http://rubyforge.org/mailman/listinfo/mechanize-users


      
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://rubyforge.org/pipermail/mechanize-users/attachments/20081103/cbac9a1c/attachment-0001.html>

Cindy Schaller

2008-Nov-03 21:12 UTC

head link

[Mechanize-users] Mechanize/Hpricot -- Strings parsing question

Ahh, many thanks!  Works like a charm.

 

I see that my last email was pre-empted before I finished the sentence.  But
I had meant to type that in my console, the code to split the string had
worked great.

 

Lots to learn.  Thanks for being great teachers!

 

  _____  

From: mechanize-users-bounces at rubyforge.org
[mailto:mechanize-users-bounces at rubyforge.org] On Behalf Of Matt White
Sent: Monday, November 03, 2008 2:39 PM
To: Ruby Mechanize Users List
Subject: Re: [Mechanize-users] Mechanize/Hpricot -- Strings parsing question

 

Heh, I''m going to blame Aaron for this one because he wrote the code
that I
cut and paste, but it''s my fault for not noticing either. Instead of
listing.inner_text, try using listing.inner_html. That should do it.

Matt

 

  _____  

From: Cindy Schaller <cschaller at gmail.com>
To: Ruby Mechanize Users List <mechanize-users at rubyforge.org>
Sent: Monday, November 3, 2008 12:03:10 PM
Subject: Re: [Mechanize-users] Mechanize/Hpricot -- Strings parsing question

Hi Matt -

 

I''m not sure what the problem is, but if when I run the script through
my
code this is what I see.  The output is cleaned up for readability.

 

The value of array[0]  appears to be the entire array instead of just the
first value.  If I take the string 

 

Thanks!

 

 

[code]

result.search(''strong'').each do |listing|  

puts "listing here: #{listing}"

            puts "==="

            array = listing.inner_text.split(/<br[^>]*>/).map { |x|
x.strip
}           

            puts "array: #{array}"

            puts "==="

            puts "array 0: #{array[0]}"

            puts "==="

end

 

 

[output]

 

listing here: 

<strong>

Wii Game for Sale <br />

American Idol52<br />

Ad #: 12345

<img class="noLink" src="img_source"
alt="image" />

</strong>

 

==
 

array: Wii Game for Sale

American Idol

Ad #: 12345

 

 

==
array 0: Wii Game for Sale

American Idol

Ad #:12345

 

 

==
 

 

Cindy

 

 

 

 

 

 

 

  _____  

From: mechanize-users-bounces at rubyforge.org
[mailto:mechanize-users-bounces at rubyforge.org] On Behalf Of Matt White
Sent: Saturday, November 01, 2008 2:46 AM
To: Ruby Mechanize Users List
Subject: Re: [Mechanize-users] Mechanize/Hpricot -- Strings parsing question

 

Cindy,

When you call split on this string, it splits it in to 3 pieces and puts
them into an array. If you do:

array = listing.inner_text.split(/<br[^>]*>/).map { |x| x.strip }

you can then access the first line with array[0] and the second with
array[1]. If you want to just discard the last item in the array, just call
pop on the array. This example should illustrate:

irb(main):001:0> string = "this is\na string\nto split"
=> "this is\na string\nto split"
irb(main):002:0> array = string.split("\n").map{|s| s.strip}
=> ["this is", "a string", "to split"]
irb(main):003:0> array.pop
=> "to split"
irb(main):004:0> array
=> ["this is", "a string"]

If this is confusing or unclear, you might try a basic Ruby tutorial like at
tryruby.hobix.com. Good luck!

Matt White

 

  _____  

From: Cindy Schaller <cschaller at gmail.com>
To: Ruby Mechanize Users List <mechanize-users at rubyforge.org>
Sent: Friday, October 31, 2008 3:23:08 PM
Subject: Re: [Mechanize-users] Mechanize/Hpricot -- Strings parsing question

Thanks Aaron.

That worked great.

Now I know that this is going to show my lack of Ruby knowledge, but
I''m
still learning.

How can I parse the same set of HTML, but only get the first 2 lines and not
the third line.

My current code is this:
listing.inner_text.split(/<br[^>]*>/).map { |x| x.strip }  


THANKS!!!


-----Original Message-----
From: mechanize-users-bounces at rubyforge.org
[mailto:mechanize-users-bounces at rubyforge.org] On Behalf Of Aaron Patterson
Sent: Friday, October 31, 2008 11:16 AM
To: Ruby Mechanize Users List
Subject: Re: [Mechanize-users] Mechanize/Hpricot -- Strings parsing question

Hi Cindy,

On Fri, Oct 31, 2008 at 8:50 AM, Cindy Schaller <cschaller at gmail.com>
wrote:> Hello.
>
>
>
> I am new to Hpricot and Mechanize, but so far I am loving it.
>
>
>
> I am trying to parse out some text inside of an HTML page and hoped that I
> could get some advice from this group.
>
>
>
> I have the following code:
>
>
>
> <strong>
>
>      Wii Game for Sale <br />
>
>      American Idol<br />
>
>      Ad #: 12345
>
> </strong>
Assuming you have already been able to find the "strong" tag, I would
do something like this:

  strong_tag.inner_text.split(/<br[^>]*>/).map { |x| x.strip }

Hope that helps.

-- 
Aaron Patterson
http://tenderlovemaking.com/
_______________________________________________
Mechanize-users mailing list
Mechanize-users at rubyforge.org
http://rubyforge.org/mailman/listinfo/mechanize-users

_______________________________________________
Mechanize-users mailing list
Mechanize-users at rubyforge.org
http://rubyforge.org/mailman/listinfo/mechanize-users

 

 

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://rubyforge.org/pipermail/mechanize-users/attachments/20081103/2980c39b/attachment.html>

Chris McMahon

2008-Nov-03 21:20 UTC

head link

[Mechanize-users] Mechanize/Hpricot -- Strings parsing question

> Lots to learn.  Thanks for being great teachers!
FWIW, open source projects don''t need more developers;  they need more
users.  I''ve really enjoyed reading this thread, thanks.
-Chris

Mechanize users - Oct 2008 - Mechanize/Hpricot -- Strings parsing question

[Mechanize-users] Mechanize/Hpricot -- Strings parsing question

[Mechanize-users] Mechanize/Hpricot -- Strings parsing question

[Mechanize-users] Mechanize/Hpricot -- Strings parsing question

[Mechanize-users] Mechanize/Hpricot -- Strings parsing question

[Mechanize-users] Mechanize/Hpricot -- Strings parsing question

[Mechanize-users] Mechanize/Hpricot -- Strings parsing question

[Mechanize-users] Mechanize/Hpricot -- Strings parsing question

[Mechanize-users] Mechanize/Hpricot -- Strings parsing question