Cindy Schaller
2008-Oct-31 15:50 UTC
[Mechanize-users] Mechanize/Hpricot -- Strings parsing question
Hello. I am new to Hpricot and Mechanize, but so far I am loving it. I am trying to parse out some text inside of an HTML page and hoped that I could get some advice from this group. I have the following code: <strong> Wii Game for Sale<br /> American Idol<br /> Ad #: 12345 </strong> I want to get each line as a separate value to insert into a database. What is the best way to get each line? Can I use the HTML tags in some way as the beginning and ends of the strings to get the values in between? Thanks!! Cindy -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://rubyforge.org/pipermail/mechanize-users/attachments/20081031/edf7bffe/attachment.html>
Aaron Patterson
2008-Oct-31 16:16 UTC
[Mechanize-users] Mechanize/Hpricot -- Strings parsing question
Hi Cindy, On Fri, Oct 31, 2008 at 8:50 AM, Cindy Schaller <cschaller at gmail.com> wrote:> Hello. > > > > I am new to Hpricot and Mechanize, but so far I am loving it. > > > > I am trying to parse out some text inside of an HTML page and hoped that I > could get some advice from this group. > > > > I have the following code: > > > > <strong> > > Wii Game for Sale<br /> > > American Idol<br /> > > Ad #: 12345 > > </strong>Assuming you have already been able to find the "strong" tag, I would do something like this: strong_tag.inner_text.split(/<br[^>]*>/).map { |x| x.strip } Hope that helps. -- Aaron Patterson http://tenderlovemaking.com/
Cindy Schaller
2008-Oct-31 21:23 UTC
[Mechanize-users] Mechanize/Hpricot -- Strings parsing question
Thanks Aaron. That worked great. Now I know that this is going to show my lack of Ruby knowledge, but I''m still learning. How can I parse the same set of HTML, but only get the first 2 lines and not the third line. My current code is this: listing.inner_text.split(/<br[^>]*>/).map { |x| x.strip } THANKS!!! -----Original Message----- From: mechanize-users-bounces at rubyforge.org [mailto:mechanize-users-bounces at rubyforge.org] On Behalf Of Aaron Patterson Sent: Friday, October 31, 2008 11:16 AM To: Ruby Mechanize Users List Subject: Re: [Mechanize-users] Mechanize/Hpricot -- Strings parsing question Hi Cindy, On Fri, Oct 31, 2008 at 8:50 AM, Cindy Schaller <cschaller at gmail.com> wrote:> Hello. > > > > I am new to Hpricot and Mechanize, but so far I am loving it. > > > > I am trying to parse out some text inside of an HTML page and hoped that I > could get some advice from this group. > > > > I have the following code: > > > > <strong> > > Wii Game for Sale<br /> > > American Idol<br /> > > Ad #: 12345 > > </strong>Assuming you have already been able to find the "strong" tag, I would do something like this: strong_tag.inner_text.split(/<br[^>]*>/).map { |x| x.strip } Hope that helps. -- Aaron Patterson http://tenderlovemaking.com/ _______________________________________________ Mechanize-users mailing list Mechanize-users at rubyforge.org http://rubyforge.org/mailman/listinfo/mechanize-users
Matt White
2008-Nov-01 07:46 UTC
[Mechanize-users] Mechanize/Hpricot -- Strings parsing question
Cindy, When you call split on this string, it splits it in to 3 pieces and puts them into an array. If you do: array = listing.inner_text.split(/<br[^>]*>/).map { |x| x.strip } you can then access the first line with array[0] and the second with array[1]. If you want to just discard the last item in the array, just call pop on the array. This example should illustrate: irb(main):001:0> string = "this is\na string\nto split" => "this is\na string\nto split" irb(main):002:0> array = string.split("\n").map{|s| s.strip} => ["this is", "a string", "to split"] irb(main):003:0> array.pop => "to split" irb(main):004:0> array => ["this is", "a string"] If this is confusing or unclear, you might try a basic Ruby tutorial like at tryruby.hobix.com. Good luck! Matt White ________________________________ From: Cindy Schaller <cschaller at gmail.com> To: Ruby Mechanize Users List <mechanize-users at rubyforge.org> Sent: Friday, October 31, 2008 3:23:08 PM Subject: Re: [Mechanize-users] Mechanize/Hpricot -- Strings parsing question Thanks Aaron. That worked great. Now I know that this is going to show my lack of Ruby knowledge, but I''m still learning. How can I parse the same set of HTML, but only get the first 2 lines and not the third line. My current code is this: listing.inner_text.split(/<br[^>]*>/).map { |x| x.strip } THANKS!!! -----Original Message----- From: mechanize-users-bounces at rubyforge.org [mailto:mechanize-users-bounces at rubyforge.org] On Behalf Of Aaron Patterson Sent: Friday, October 31, 2008 11:16 AM To: Ruby Mechanize Users List Subject: Re: [Mechanize-users] Mechanize/Hpricot -- Strings parsing question Hi Cindy, On Fri, Oct 31, 2008 at 8:50 AM, Cindy Schaller <cschaller at gmail.com> wrote:> Hello. > > > > I am new to Hpricot and Mechanize, but so far I am loving it. > > > > I am trying to parse out some text inside of an HTML page and hoped that I > could get some advice from this group. > > > > I have the following code: > > > > <strong> > > Wii Game for Sale<br /> > > American Idol<br /> > > Ad #: 12345 > > </strong>Assuming you have already been able to find the "strong" tag, I would do something like this: strong_tag.inner_text.split(/<br[^>]*>/).map { |x| x.strip } Hope that helps. -- Aaron Patterson http://tenderlovemaking.com/ _______________________________________________ Mechanize-users mailing list Mechanize-users at rubyforge.org http://rubyforge.org/mailman/listinfo/mechanize-users _______________________________________________ Mechanize-users mailing list Mechanize-users at rubyforge.org http://rubyforge.org/mailman/listinfo/mechanize-users -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://rubyforge.org/pipermail/mechanize-users/attachments/20081101/397fb831/attachment-0001.html>
Cindy Schaller
2008-Nov-03 19:03 UTC
[Mechanize-users] Mechanize/Hpricot -- Strings parsing question
Hi Matt - I''m not sure what the problem is, but if when I run the script through my code this is what I see. The output is cleaned up for readability. The value of array[0] appears to be the entire array instead of just the first value. If I take the string Thanks! [code] result.search(''strong'').each do |listing| puts "listing here: #{listing}" puts "===" array = listing.inner_text.split(/<br[^>]*>/).map { |x| x.strip } puts "array: #{array}" puts "===" puts "array 0: #{array[0]}" puts "===" end [output] listing here: <strong> Wii Game for Sale<br /> American Idol52<br /> Ad #: 12345 <img class="noLink" src="img_source" alt="image" /> </strong> == array: Wii Game for Sale American Idol Ad #: 12345 == array 0: Wii Game for Sale American Idol Ad #:12345 == Cindy _____ From: mechanize-users-bounces at rubyforge.org [mailto:mechanize-users-bounces at rubyforge.org] On Behalf Of Matt White Sent: Saturday, November 01, 2008 2:46 AM To: Ruby Mechanize Users List Subject: Re: [Mechanize-users] Mechanize/Hpricot -- Strings parsing question Cindy, When you call split on this string, it splits it in to 3 pieces and puts them into an array. If you do: array = listing.inner_text.split(/<br[^>]*>/).map { |x| x.strip } you can then access the first line with array[0] and the second with array[1]. If you want to just discard the last item in the array, just call pop on the array. This example should illustrate: irb(main):001:0> string = "this is\na string\nto split" => "this is\na string\nto split" irb(main):002:0> array = string.split("\n").map{|s| s.strip} => ["this is", "a string", "to split"] irb(main):003:0> array.pop => "to split" irb(main):004:0> array => ["this is", "a string"] If this is confusing or unclear, you might try a basic Ruby tutorial like at tryruby.hobix.com. Good luck! Matt White _____ From: Cindy Schaller <cschaller at gmail.com> To: Ruby Mechanize Users List <mechanize-users at rubyforge.org> Sent: Friday, October 31, 2008 3:23:08 PM Subject: Re: [Mechanize-users] Mechanize/Hpricot -- Strings parsing question Thanks Aaron. That worked great. Now I know that this is going to show my lack of Ruby knowledge, but I''m still learning. How can I parse the same set of HTML, but only get the first 2 lines and not the third line. My current code is this: listing.inner_text.split(/<br[^>]*>/).map { |x| x.strip } THANKS!!! -----Original Message----- From: mechanize-users-bounces at rubyforge.org [mailto:mechanize-users-bounces at rubyforge.org] On Behalf Of Aaron Patterson Sent: Friday, October 31, 2008 11:16 AM To: Ruby Mechanize Users List Subject: Re: [Mechanize-users] Mechanize/Hpricot -- Strings parsing question Hi Cindy, On Fri, Oct 31, 2008 at 8:50 AM, Cindy Schaller <cschaller at gmail.com> wrote:> Hello. > > > > I am new to Hpricot and Mechanize, but so far I am loving it. > > > > I am trying to parse out some text inside of an HTML page and hoped that I > could get some advice from this group. > > > > I have the following code: > > > > <strong> > > Wii Game for Sale<br /> > > American Idol<br /> > > Ad #: 12345 > > </strong>Assuming you have already been able to find the "strong" tag, I would do something like this: strong_tag.inner_text.split(/<br[^>]*>/).map { |x| x.strip } Hope that helps. -- Aaron Patterson http://tenderlovemaking.com/ _______________________________________________ Mechanize-users mailing list Mechanize-users at rubyforge.org http://rubyforge.org/mailman/listinfo/mechanize-users _______________________________________________ Mechanize-users mailing list Mechanize-users at rubyforge.org http://rubyforge.org/mailman/listinfo/mechanize-users -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://rubyforge.org/pipermail/mechanize-users/attachments/20081103/3bde509b/attachment-0001.html>
Matt White
2008-Nov-03 20:38 UTC
[Mechanize-users] Mechanize/Hpricot -- Strings parsing question
Heh, I''m going to blame Aaron for this one because he wrote the code that I cut and paste, but it''s my fault for not noticing either. Instead of listing.inner_text, try using listing.inner_html. That should do it. Matt ________________________________ From: Cindy Schaller <cschaller at gmail.com> To: Ruby Mechanize Users List <mechanize-users at rubyforge.org> Sent: Monday, November 3, 2008 12:03:10 PM Subject: Re: [Mechanize-users] Mechanize/Hpricot -- Strings parsing question Hi Matt ? I?m not sure what the problem is, but if when I run the script through my code this is what I see. The output is cleaned up for readability. The value of array[0] appears to be the entire array instead of just the first value. If I take the string Thanks! [code] result.search(''strong'').each do |listing| puts "listing here: #{listing}" puts "===" array = listing.inner_text.split(/<br[^>]*>/).map { |x| x.strip } puts "array: #{array}" puts "===" puts "array 0: #{array[0]}" puts "===" end [output] listing here: <strong> Wii Game for Sale <br /> American Idol52<br /> Ad #: 12345 <img class="noLink" src="img_source" alt="image" /> </strong> == array: Wii Game for Sale American Idol Ad #: 12345 ==array 0: Wii Game for Sale American Idol Ad #:12345 == Cindy ________________________________ From:mechanize-users-bounces at rubyforge.org [mailto:mechanize-users-bounces at rubyforge.org] On Behalf Of Matt White Sent: Saturday, November 01, 2008 2:46 AM To: Ruby Mechanize Users List Subject: Re: [Mechanize-users] Mechanize/Hpricot -- Strings parsing question Cindy, When you call split on this string, it splits it in to 3 pieces and puts them into an array. If you do: array = listing.inner_text.split(/<br[^>]*>/).map { |x| x.strip } you can then access the first line with array[0] and the second with array[1]. If you want to just discard the last item in the array, just call pop on the array. This example should illustrate: irb(main):001:0> string = "this is\na string\nto split" => "this is\na string\nto split" irb(main):002:0> array = string.split("\n").map{|s| s.strip} => ["this is", "a string", "to split"] irb(main):003:0> array.pop => "to split" irb(main):004:0> array => ["this is", "a string"] If this is confusing or unclear, you might try a basic Ruby tutorial like at tryruby.hobix.com. Good luck! Matt White ________________________________ From:Cindy Schaller <cschaller at gmail.com> To: Ruby Mechanize Users List <mechanize-users at rubyforge.org> Sent: Friday, October 31, 2008 3:23:08 PM Subject: Re: [Mechanize-users] Mechanize/Hpricot -- Strings parsing question Thanks Aaron. That worked great. Now I know that this is going to show my lack of Ruby knowledge, but I''m still learning. How can I parse the same set of HTML, but only get the first 2 lines and not the third line. My current code is this: listing.inner_text.split(/<br[^>]*>/).map { |x| x.strip } THANKS!!! -----Original Message----- From: mechanize-users-bounces at rubyforge.org [mailto:mechanize-users-bounces at rubyforge.org] On Behalf Of Aaron Patterson Sent: Friday, October 31, 2008 11:16 AM To: Ruby Mechanize Users List Subject: Re: [Mechanize-users] Mechanize/Hpricot -- Strings parsing question Hi Cindy, On Fri, Oct 31, 2008 at 8:50 AM, Cindy Schaller <cschaller at gmail.com> wrote:> Hello. > > > > I am new to Hpricot and Mechanize, but so far I am loving it. > > > > I am trying to parse out some text inside of an HTML page and hoped that I > could get some advice from this group. > > > > I have the following code: > > > > <strong> > > Wii Game for Sale <br/>> > American Idol<br /> > > Ad #: 12345 > > </strong>Assuming you have already been able to find the "strong" tag, I would do something like this: strong_tag.inner_text.split(/<br[^>]*>/).map { |x| x.strip } Hope that helps. -- Aaron Patterson http://tenderlovemaking.com/ _______________________________________________ Mechanize-users mailing list Mechanize-users at rubyforge.org http://rubyforge.org/mailman/listinfo/mechanize-users _______________________________________________ Mechanize-users mailing list Mechanize-users at rubyforge.org http://rubyforge.org/mailman/listinfo/mechanize-users -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://rubyforge.org/pipermail/mechanize-users/attachments/20081103/cbac9a1c/attachment-0001.html>
Cindy Schaller
2008-Nov-03 21:12 UTC
[Mechanize-users] Mechanize/Hpricot -- Strings parsing question
Ahh, many thanks! Works like a charm. I see that my last email was pre-empted before I finished the sentence. But I had meant to type that in my console, the code to split the string had worked great. Lots to learn. Thanks for being great teachers! _____ From: mechanize-users-bounces at rubyforge.org [mailto:mechanize-users-bounces at rubyforge.org] On Behalf Of Matt White Sent: Monday, November 03, 2008 2:39 PM To: Ruby Mechanize Users List Subject: Re: [Mechanize-users] Mechanize/Hpricot -- Strings parsing question Heh, I''m going to blame Aaron for this one because he wrote the code that I cut and paste, but it''s my fault for not noticing either. Instead of listing.inner_text, try using listing.inner_html. That should do it. Matt _____ From: Cindy Schaller <cschaller at gmail.com> To: Ruby Mechanize Users List <mechanize-users at rubyforge.org> Sent: Monday, November 3, 2008 12:03:10 PM Subject: Re: [Mechanize-users] Mechanize/Hpricot -- Strings parsing question Hi Matt - I''m not sure what the problem is, but if when I run the script through my code this is what I see. The output is cleaned up for readability. The value of array[0] appears to be the entire array instead of just the first value. If I take the string Thanks! [code] result.search(''strong'').each do |listing| puts "listing here: #{listing}" puts "===" array = listing.inner_text.split(/<br[^>]*>/).map { |x| x.strip } puts "array: #{array}" puts "===" puts "array 0: #{array[0]}" puts "===" end [output] listing here: <strong> Wii Game for Sale <br /> American Idol52<br /> Ad #: 12345 <img class="noLink" src="img_source" alt="image" /> </strong> == array: Wii Game for Sale American Idol Ad #: 12345 == array 0: Wii Game for Sale American Idol Ad #:12345 == Cindy _____ From: mechanize-users-bounces at rubyforge.org [mailto:mechanize-users-bounces at rubyforge.org] On Behalf Of Matt White Sent: Saturday, November 01, 2008 2:46 AM To: Ruby Mechanize Users List Subject: Re: [Mechanize-users] Mechanize/Hpricot -- Strings parsing question Cindy, When you call split on this string, it splits it in to 3 pieces and puts them into an array. If you do: array = listing.inner_text.split(/<br[^>]*>/).map { |x| x.strip } you can then access the first line with array[0] and the second with array[1]. If you want to just discard the last item in the array, just call pop on the array. This example should illustrate: irb(main):001:0> string = "this is\na string\nto split" => "this is\na string\nto split" irb(main):002:0> array = string.split("\n").map{|s| s.strip} => ["this is", "a string", "to split"] irb(main):003:0> array.pop => "to split" irb(main):004:0> array => ["this is", "a string"] If this is confusing or unclear, you might try a basic Ruby tutorial like at tryruby.hobix.com. Good luck! Matt White _____ From: Cindy Schaller <cschaller at gmail.com> To: Ruby Mechanize Users List <mechanize-users at rubyforge.org> Sent: Friday, October 31, 2008 3:23:08 PM Subject: Re: [Mechanize-users] Mechanize/Hpricot -- Strings parsing question Thanks Aaron. That worked great. Now I know that this is going to show my lack of Ruby knowledge, but I''m still learning. How can I parse the same set of HTML, but only get the first 2 lines and not the third line. My current code is this: listing.inner_text.split(/<br[^>]*>/).map { |x| x.strip } THANKS!!! -----Original Message----- From: mechanize-users-bounces at rubyforge.org [mailto:mechanize-users-bounces at rubyforge.org] On Behalf Of Aaron Patterson Sent: Friday, October 31, 2008 11:16 AM To: Ruby Mechanize Users List Subject: Re: [Mechanize-users] Mechanize/Hpricot -- Strings parsing question Hi Cindy, On Fri, Oct 31, 2008 at 8:50 AM, Cindy Schaller <cschaller at gmail.com> wrote:> Hello. > > > > I am new to Hpricot and Mechanize, but so far I am loving it. > > > > I am trying to parse out some text inside of an HTML page and hoped that I > could get some advice from this group. > > > > I have the following code: > > > > <strong> > > Wii Game for Sale <br /> > > American Idol<br /> > > Ad #: 12345 > > </strong>Assuming you have already been able to find the "strong" tag, I would do something like this: strong_tag.inner_text.split(/<br[^>]*>/).map { |x| x.strip } Hope that helps. -- Aaron Patterson http://tenderlovemaking.com/ _______________________________________________ Mechanize-users mailing list Mechanize-users at rubyforge.org http://rubyforge.org/mailman/listinfo/mechanize-users _______________________________________________ Mechanize-users mailing list Mechanize-users at rubyforge.org http://rubyforge.org/mailman/listinfo/mechanize-users -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://rubyforge.org/pipermail/mechanize-users/attachments/20081103/2980c39b/attachment.html>