Cindy Schaller
2008-Oct-31 15:50 UTC
[Mechanize-users] Mechanize/Hpricot -- Strings parsing question
Hello.
I am new to Hpricot and Mechanize, but so far I am loving it.
I am trying to parse out some text inside of an HTML page and hoped that I
could get some advice from this group.
I have the following code:
<strong>
Wii Game for Sale<br />
American Idol<br />
Ad #: 12345
</strong>
I want to get each line as a separate value to insert into a database. What
is the best way to get each line? Can I use the HTML tags in some way as
the beginning and ends of the strings to get the values in between?
Thanks!!
Cindy
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://rubyforge.org/pipermail/mechanize-users/attachments/20081031/edf7bffe/attachment.html>
Aaron Patterson
2008-Oct-31 16:16 UTC
[Mechanize-users] Mechanize/Hpricot -- Strings parsing question
Hi Cindy, On Fri, Oct 31, 2008 at 8:50 AM, Cindy Schaller <cschaller at gmail.com> wrote:> Hello. > > > > I am new to Hpricot and Mechanize, but so far I am loving it. > > > > I am trying to parse out some text inside of an HTML page and hoped that I > could get some advice from this group. > > > > I have the following code: > > > > <strong> > > Wii Game for Sale<br /> > > American Idol<br /> > > Ad #: 12345 > > </strong>Assuming you have already been able to find the "strong" tag, I would do something like this: strong_tag.inner_text.split(/<br[^>]*>/).map { |x| x.strip } Hope that helps. -- Aaron Patterson http://tenderlovemaking.com/
Cindy Schaller
2008-Oct-31 21:23 UTC
[Mechanize-users] Mechanize/Hpricot -- Strings parsing question
Thanks Aaron.
That worked great.
Now I know that this is going to show my lack of Ruby knowledge, but
I''m
still learning.
How can I parse the same set of HTML, but only get the first 2 lines and not
the third line.
My current code is this:
listing.inner_text.split(/<br[^>]*>/).map { |x| x.strip }
THANKS!!!
-----Original Message-----
From: mechanize-users-bounces at rubyforge.org
[mailto:mechanize-users-bounces at rubyforge.org] On Behalf Of Aaron Patterson
Sent: Friday, October 31, 2008 11:16 AM
To: Ruby Mechanize Users List
Subject: Re: [Mechanize-users] Mechanize/Hpricot -- Strings parsing question
Hi Cindy,
On Fri, Oct 31, 2008 at 8:50 AM, Cindy Schaller <cschaller at gmail.com>
wrote:> Hello.
>
>
>
> I am new to Hpricot and Mechanize, but so far I am loving it.
>
>
>
> I am trying to parse out some text inside of an HTML page and hoped that I
> could get some advice from this group.
>
>
>
> I have the following code:
>
>
>
> <strong>
>
> Wii Game for Sale<br />
>
> American Idol<br />
>
> Ad #: 12345
>
> </strong>
Assuming you have already been able to find the "strong" tag, I would
do something like this:
strong_tag.inner_text.split(/<br[^>]*>/).map { |x| x.strip }
Hope that helps.
--
Aaron Patterson
http://tenderlovemaking.com/
_______________________________________________
Mechanize-users mailing list
Mechanize-users at rubyforge.org
http://rubyforge.org/mailman/listinfo/mechanize-users
Matt White
2008-Nov-01 07:46 UTC
[Mechanize-users] Mechanize/Hpricot -- Strings parsing question
Cindy,
When you call split on this string, it splits it in to 3 pieces and puts them
into an array. If you do:
array = listing.inner_text.split(/<br[^>]*>/).map { |x| x.strip }
you can then access the first line with array[0] and the second with array[1].
If you want to just discard the last item in the array, just call pop on the
array. This example should illustrate:
irb(main):001:0> string = "this is\na string\nto split"
=> "this is\na string\nto split"
irb(main):002:0> array = string.split("\n").map{|s| s.strip}
=> ["this is", "a string", "to split"]
irb(main):003:0> array.pop
=> "to split"
irb(main):004:0> array
=> ["this is", "a string"]
If this is confusing or unclear, you might try a basic Ruby tutorial like at
tryruby.hobix.com. Good luck!
Matt White
________________________________
From: Cindy Schaller <cschaller at gmail.com>
To: Ruby Mechanize Users List <mechanize-users at rubyforge.org>
Sent: Friday, October 31, 2008 3:23:08 PM
Subject: Re: [Mechanize-users] Mechanize/Hpricot -- Strings parsing question
Thanks Aaron.
That worked great.
Now I know that this is going to show my lack of Ruby knowledge, but
I''m
still learning.
How can I parse the same set of HTML, but only get the first 2 lines and not
the third line.
My current code is this:
listing.inner_text.split(/<br[^>]*>/).map { |x| x.strip }
THANKS!!!
-----Original Message-----
From: mechanize-users-bounces at rubyforge.org
[mailto:mechanize-users-bounces at rubyforge.org] On Behalf Of Aaron Patterson
Sent: Friday, October 31, 2008 11:16 AM
To: Ruby Mechanize Users List
Subject: Re: [Mechanize-users] Mechanize/Hpricot -- Strings parsing question
Hi Cindy,
On Fri, Oct 31, 2008 at 8:50 AM, Cindy Schaller <cschaller at gmail.com>
wrote:> Hello.
>
>
>
> I am new to Hpricot and Mechanize, but so far I am loving it.
>
>
>
> I am trying to parse out some text inside of an HTML page and hoped that I
> could get some advice from this group.
>
>
>
> I have the following code:
>
>
>
> <strong>
>
> Wii Game for Sale<br />
>
> American Idol<br />
>
> Ad #: 12345
>
> </strong>
Assuming you have already been able to find the "strong" tag, I would
do something like this:
strong_tag.inner_text.split(/<br[^>]*>/).map { |x| x.strip }
Hope that helps.
--
Aaron Patterson
http://tenderlovemaking.com/
_______________________________________________
Mechanize-users mailing list
Mechanize-users at rubyforge.org
http://rubyforge.org/mailman/listinfo/mechanize-users
_______________________________________________
Mechanize-users mailing list
Mechanize-users at rubyforge.org
http://rubyforge.org/mailman/listinfo/mechanize-users
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://rubyforge.org/pipermail/mechanize-users/attachments/20081101/397fb831/attachment-0001.html>
Cindy Schaller
2008-Nov-03 19:03 UTC
[Mechanize-users] Mechanize/Hpricot -- Strings parsing question
Hi Matt -
I''m not sure what the problem is, but if when I run the script through
my
code this is what I see. The output is cleaned up for readability.
The value of array[0] appears to be the entire array instead of just the
first value. If I take the string
Thanks!
[code]
result.search(''strong'').each do |listing|
puts "listing here: #{listing}"
puts "==="
array = listing.inner_text.split(/<br[^>]*>/).map { |x|
x.strip
}
puts "array: #{array}"
puts "==="
puts "array 0: #{array[0]}"
puts "==="
end
[output]
listing here:
<strong>
Wii Game for Sale<br />
American Idol52<br />
Ad #: 12345
<img class="noLink" src="img_source"
alt="image" />
</strong>
==
array: Wii Game for Sale
American Idol
Ad #: 12345
==
array 0: Wii Game for Sale
American Idol
Ad #:12345
==
Cindy
_____
From: mechanize-users-bounces at rubyforge.org
[mailto:mechanize-users-bounces at rubyforge.org] On Behalf Of Matt White
Sent: Saturday, November 01, 2008 2:46 AM
To: Ruby Mechanize Users List
Subject: Re: [Mechanize-users] Mechanize/Hpricot -- Strings parsing question
Cindy,
When you call split on this string, it splits it in to 3 pieces and puts
them into an array. If you do:
array = listing.inner_text.split(/<br[^>]*>/).map { |x| x.strip }
you can then access the first line with array[0] and the second with
array[1]. If you want to just discard the last item in the array, just call
pop on the array. This example should illustrate:
irb(main):001:0> string = "this is\na string\nto split"
=> "this is\na string\nto split"
irb(main):002:0> array = string.split("\n").map{|s| s.strip}
=> ["this is", "a string", "to split"]
irb(main):003:0> array.pop
=> "to split"
irb(main):004:0> array
=> ["this is", "a string"]
If this is confusing or unclear, you might try a basic Ruby tutorial like at
tryruby.hobix.com. Good luck!
Matt White
_____
From: Cindy Schaller <cschaller at gmail.com>
To: Ruby Mechanize Users List <mechanize-users at rubyforge.org>
Sent: Friday, October 31, 2008 3:23:08 PM
Subject: Re: [Mechanize-users] Mechanize/Hpricot -- Strings parsing question
Thanks Aaron.
That worked great.
Now I know that this is going to show my lack of Ruby knowledge, but
I''m
still learning.
How can I parse the same set of HTML, but only get the first 2 lines and not
the third line.
My current code is this:
listing.inner_text.split(/<br[^>]*>/).map { |x| x.strip }
THANKS!!!
-----Original Message-----
From: mechanize-users-bounces at rubyforge.org
[mailto:mechanize-users-bounces at rubyforge.org] On Behalf Of Aaron Patterson
Sent: Friday, October 31, 2008 11:16 AM
To: Ruby Mechanize Users List
Subject: Re: [Mechanize-users] Mechanize/Hpricot -- Strings parsing question
Hi Cindy,
On Fri, Oct 31, 2008 at 8:50 AM, Cindy Schaller <cschaller at gmail.com>
wrote:> Hello.
>
>
>
> I am new to Hpricot and Mechanize, but so far I am loving it.
>
>
>
> I am trying to parse out some text inside of an HTML page and hoped that I
> could get some advice from this group.
>
>
>
> I have the following code:
>
>
>
> <strong>
>
> Wii Game for Sale<br />
>
> American Idol<br />
>
> Ad #: 12345
>
> </strong>
Assuming you have already been able to find the "strong" tag, I would
do something like this:
strong_tag.inner_text.split(/<br[^>]*>/).map { |x| x.strip }
Hope that helps.
--
Aaron Patterson
http://tenderlovemaking.com/
_______________________________________________
Mechanize-users mailing list
Mechanize-users at rubyforge.org
http://rubyforge.org/mailman/listinfo/mechanize-users
_______________________________________________
Mechanize-users mailing list
Mechanize-users at rubyforge.org
http://rubyforge.org/mailman/listinfo/mechanize-users
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://rubyforge.org/pipermail/mechanize-users/attachments/20081103/3bde509b/attachment-0001.html>
Matt White
2008-Nov-03 20:38 UTC
[Mechanize-users] Mechanize/Hpricot -- Strings parsing question
Heh, I''m going to blame Aaron for this one because he wrote the code
that I cut and paste, but it''s my fault for not noticing either.
Instead of listing.inner_text, try using listing.inner_html. That should do it.
Matt
________________________________
From: Cindy Schaller <cschaller at gmail.com>
To: Ruby Mechanize Users List <mechanize-users at rubyforge.org>
Sent: Monday, November 3, 2008 12:03:10 PM
Subject: Re: [Mechanize-users] Mechanize/Hpricot -- Strings parsing question
Hi Matt ?
I?m not sure what the problem is,
but if when I run the script through my code this is what I see. The
output is cleaned up for readability.
The value of array[0] appears to be
the entire array instead of just the first value. If I take the string
Thanks!
[code]
result.search(''strong'').each do |listing|
puts "listing here:
#{listing}"
puts "==="
array = listing.inner_text.split(/<br[^>]*>/).map { |x| x.strip }
puts "array: #{array}"
puts "==="
puts "array 0: #{array[0]}"
puts "==="
end
[output]
listing here:
<strong>
Wii Game for Sale <br />
American Idol52<br />
Ad #: 12345
<img class="noLink"
src="img_source" alt="image" />
</strong>
==
array: Wii Game for Sale
American Idol
Ad #: 12345
==array 0: Wii Game for Sale
American Idol
Ad #:12345
==
Cindy
________________________________
From:mechanize-users-bounces at rubyforge.org
[mailto:mechanize-users-bounces at rubyforge.org] On Behalf Of Matt White
Sent: Saturday, November 01, 2008
2:46 AM
To: Ruby Mechanize Users List
Subject: Re: [Mechanize-users]
Mechanize/Hpricot -- Strings parsing question
Cindy,
When you call split on this string, it splits it in to 3 pieces and puts them
into an array. If you do:
array = listing.inner_text.split(/<br[^>]*>/).map { |x| x.strip }
you can then access the first line with array[0] and the second with array[1].
If you want to just discard the last item in the array, just call pop on the
array. This example should illustrate:
irb(main):001:0> string = "this is\na string\nto split"
=> "this is\na string\nto split"
irb(main):002:0> array = string.split("\n").map{|s| s.strip}
=> ["this is", "a string", "to split"]
irb(main):003:0> array.pop
=> "to split"
irb(main):004:0> array
=> ["this is", "a string"]
If this is confusing or unclear, you might try a basic Ruby tutorial like at
tryruby.hobix.com. Good luck!
Matt White
________________________________
From:Cindy Schaller
<cschaller at gmail.com>
To: Ruby Mechanize Users List
<mechanize-users at rubyforge.org>
Sent: Friday, October 31, 2008
3:23:08 PM
Subject: Re: [Mechanize-users]
Mechanize/Hpricot -- Strings parsing question
Thanks Aaron.
That worked great.
Now I know that this is going to show my lack of Ruby knowledge, but
I''m
still learning.
How can I parse the same set of HTML, but only get the first 2 lines and not
the third line.
My current code is this:
listing.inner_text.split(/<br[^>]*>/).map { |x| x.strip }
THANKS!!!
-----Original Message-----
From: mechanize-users-bounces at rubyforge.org
[mailto:mechanize-users-bounces at rubyforge.org]
On Behalf Of Aaron Patterson
Sent: Friday, October 31, 2008 11:16 AM
To: Ruby Mechanize Users List
Subject: Re: [Mechanize-users] Mechanize/Hpricot -- Strings parsing question
Hi Cindy,
On Fri, Oct 31, 2008 at 8:50 AM, Cindy Schaller <cschaller at gmail.com>
wrote:> Hello.
>
>
>
> I am new to Hpricot and Mechanize, but so far I am loving it.
>
>
>
> I am trying to parse out some text inside of an HTML page and hoped that I
> could get some advice from this group.
>
>
>
> I have the following code:
>
>
>
> <strong>
>
> Wii Game for Sale <br
/>>
> American Idol<br />
>
> Ad #: 12345
>
> </strong>
Assuming you have already been able to find the "strong" tag, I would
do something like this:
strong_tag.inner_text.split(/<br[^>]*>/).map { |x| x.strip }
Hope that helps.
--
Aaron Patterson
http://tenderlovemaking.com/
_______________________________________________
Mechanize-users mailing list
Mechanize-users at rubyforge.org
http://rubyforge.org/mailman/listinfo/mechanize-users
_______________________________________________
Mechanize-users mailing list
Mechanize-users at rubyforge.org
http://rubyforge.org/mailman/listinfo/mechanize-users
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://rubyforge.org/pipermail/mechanize-users/attachments/20081103/cbac9a1c/attachment-0001.html>
Cindy Schaller
2008-Nov-03 21:12 UTC
[Mechanize-users] Mechanize/Hpricot -- Strings parsing question
Ahh, many thanks! Works like a charm.
I see that my last email was pre-empted before I finished the sentence. But
I had meant to type that in my console, the code to split the string had
worked great.
Lots to learn. Thanks for being great teachers!
_____
From: mechanize-users-bounces at rubyforge.org
[mailto:mechanize-users-bounces at rubyforge.org] On Behalf Of Matt White
Sent: Monday, November 03, 2008 2:39 PM
To: Ruby Mechanize Users List
Subject: Re: [Mechanize-users] Mechanize/Hpricot -- Strings parsing question
Heh, I''m going to blame Aaron for this one because he wrote the code
that I
cut and paste, but it''s my fault for not noticing either. Instead of
listing.inner_text, try using listing.inner_html. That should do it.
Matt
_____
From: Cindy Schaller <cschaller at gmail.com>
To: Ruby Mechanize Users List <mechanize-users at rubyforge.org>
Sent: Monday, November 3, 2008 12:03:10 PM
Subject: Re: [Mechanize-users] Mechanize/Hpricot -- Strings parsing question
Hi Matt -
I''m not sure what the problem is, but if when I run the script through
my
code this is what I see. The output is cleaned up for readability.
The value of array[0] appears to be the entire array instead of just the
first value. If I take the string
Thanks!
[code]
result.search(''strong'').each do |listing|
puts "listing here: #{listing}"
puts "==="
array = listing.inner_text.split(/<br[^>]*>/).map { |x|
x.strip
}
puts "array: #{array}"
puts "==="
puts "array 0: #{array[0]}"
puts "==="
end
[output]
listing here:
<strong>
Wii Game for Sale <br />
American Idol52<br />
Ad #: 12345
<img class="noLink" src="img_source"
alt="image" />
</strong>
==
array: Wii Game for Sale
American Idol
Ad #: 12345
==
array 0: Wii Game for Sale
American Idol
Ad #:12345
==
Cindy
_____
From: mechanize-users-bounces at rubyforge.org
[mailto:mechanize-users-bounces at rubyforge.org] On Behalf Of Matt White
Sent: Saturday, November 01, 2008 2:46 AM
To: Ruby Mechanize Users List
Subject: Re: [Mechanize-users] Mechanize/Hpricot -- Strings parsing question
Cindy,
When you call split on this string, it splits it in to 3 pieces and puts
them into an array. If you do:
array = listing.inner_text.split(/<br[^>]*>/).map { |x| x.strip }
you can then access the first line with array[0] and the second with
array[1]. If you want to just discard the last item in the array, just call
pop on the array. This example should illustrate:
irb(main):001:0> string = "this is\na string\nto split"
=> "this is\na string\nto split"
irb(main):002:0> array = string.split("\n").map{|s| s.strip}
=> ["this is", "a string", "to split"]
irb(main):003:0> array.pop
=> "to split"
irb(main):004:0> array
=> ["this is", "a string"]
If this is confusing or unclear, you might try a basic Ruby tutorial like at
tryruby.hobix.com. Good luck!
Matt White
_____
From: Cindy Schaller <cschaller at gmail.com>
To: Ruby Mechanize Users List <mechanize-users at rubyforge.org>
Sent: Friday, October 31, 2008 3:23:08 PM
Subject: Re: [Mechanize-users] Mechanize/Hpricot -- Strings parsing question
Thanks Aaron.
That worked great.
Now I know that this is going to show my lack of Ruby knowledge, but
I''m
still learning.
How can I parse the same set of HTML, but only get the first 2 lines and not
the third line.
My current code is this:
listing.inner_text.split(/<br[^>]*>/).map { |x| x.strip }
THANKS!!!
-----Original Message-----
From: mechanize-users-bounces at rubyforge.org
[mailto:mechanize-users-bounces at rubyforge.org] On Behalf Of Aaron Patterson
Sent: Friday, October 31, 2008 11:16 AM
To: Ruby Mechanize Users List
Subject: Re: [Mechanize-users] Mechanize/Hpricot -- Strings parsing question
Hi Cindy,
On Fri, Oct 31, 2008 at 8:50 AM, Cindy Schaller <cschaller at gmail.com>
wrote:> Hello.
>
>
>
> I am new to Hpricot and Mechanize, but so far I am loving it.
>
>
>
> I am trying to parse out some text inside of an HTML page and hoped that I
> could get some advice from this group.
>
>
>
> I have the following code:
>
>
>
> <strong>
>
> Wii Game for Sale <br />
>
> American Idol<br />
>
> Ad #: 12345
>
> </strong>
Assuming you have already been able to find the "strong" tag, I would
do something like this:
strong_tag.inner_text.split(/<br[^>]*>/).map { |x| x.strip }
Hope that helps.
--
Aaron Patterson
http://tenderlovemaking.com/
_______________________________________________
Mechanize-users mailing list
Mechanize-users at rubyforge.org
http://rubyforge.org/mailman/listinfo/mechanize-users
_______________________________________________
Mechanize-users mailing list
Mechanize-users at rubyforge.org
http://rubyforge.org/mailman/listinfo/mechanize-users
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://rubyforge.org/pipermail/mechanize-users/attachments/20081103/2980c39b/attachment.html>