I have an html <h3> <span class="mw-headline">Central Kolkata</span></h3><li> <b>Eden Gardens</b> (one of the most famous cricket stadiums in the world), </li><li> <b>Akashwani Bhavan</b>, All India Radio building </li><li> <b>Indoor Stadium</b> </li><li> <b>Fort William</b>, the massive and impregnable British Citadel built in 1773. The fort is still in use and retains its well-guarded grandeur. Visitors are allowed in with special permission only. </li><li> <b>Victoria Memorial</b> <a href=" http://www.victoriamemorial-cal.org" class="external autonumber" title=" http://www.victoriamemorial-cal.org">[10]</a> Along St. George’s Gate Road, on the southern fringe of the Maidan, you will find Kolkata''s most famous landmark , a splendid white marble monument (CLOSED MONDAYS). </li><li> <b>Calcutta Racecourse</b> </li><li> <b>Chowringee</b>, is the Market place of Kolkata. You will find shops ranging from Computer Periferals to cloth merchants. Even tailors and a few famous Movie theaters too. This place is a favourite pass time for local people. </li><h3> <span class="mw-headline">Northern Kolkata</span></h3><li> <b>Nakhoda Mosque1</b> (the largest mosque in Kolkata) and the </li><li> <b>Shobhabajar Rajbari</b> the ancestral house of Rja Naba Krishna, one of the rich locals to side with Clive during his war with Nabab Siraj-Ud-Daula. </li><li> <b>Jorasanko Thakur Bari1</b> (Tagore Family residence). </li><li> <b>Parashnath Jain Temple</b>, near the Belgachia metro station. </li><li> <b>Parashnath Jain Temple</b>, at Gouribari, less visited, reachable from the Sovabazar Metro station (take an auto rickshaw). </li> if wanna search innert_text of <li> tags upto <span class="mw-headline">Northern Kolkata</span> but it searches all <li> my code is doc.search("li").each do |y| it searches all <li> i know but i wanna stop upto <span> element using hpricot Regards Prashant --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org To unsubscribe from this group, send email to rubyonrails-talk+unsubscribe@googlegroups.com For more options, visit this group at http://groups.google.com/group/rubyonrails-talk?hl=en -~----------~----~----~----~------~----~------~--~---
Is this your html, or are you scraping someone else''s html? If it''s yours, organize your html differently... if you know you want to be processing a section at a time, wrap those sections with an identifiable container, then scope your searches by the container. <div> <h3>blah</h3> <li>a</li> <li>b</li> </div> <div> <h3>blah2</h3> <li>c</li> <li>d</li> </div> (doc/"div").each do |dv| this_h3 = (dv/"h3") if this_h3.inner_html == "blah2" (dv/"li").each do |li| puts li.inner_html end end end emits just c, and d If its someone else''s html in that format, you''ll probably have to go elem by elem for the whole doc with state machine-ish code to track what you''ve seen previously since there doesn''t seem to be any real ''path'' to the li''s per h3. -- Posted via http://www.ruby-forum.com/.
Ya I am scraping someone html so that i cant change the format right. can u help me . Regards Prashant On Tue, Sep 8, 2009 at 10:40 PM, Ar Chron <rails-mailing-list-ARtvInVfO7ksV2N9l4h3zg@public.gmane.org>wrote:> > Is this your html, or are you scraping someone else''s html? > > If it''s yours, organize your html differently... if you know you want to > be processing a section at a time, wrap those sections with an > identifiable container, then scope your searches by the container. > > <div> > <h3>blah</h3> > <li>a</li> > <li>b</li> > </div> > <div> > <h3>blah2</h3> > <li>c</li> > <li>d</li> > </div> > > (doc/"div").each do |dv| > this_h3 = (dv/"h3") > if this_h3.inner_html == "blah2" > (dv/"li").each do |li| > puts li.inner_html > end > end > end > > emits just c, and d > > If its someone else''s html in that format, you''ll probably have to go > elem by elem for the whole doc with state machine-ish code to track what > you''ve seen previously since there doesn''t seem to be any real ''path'' to > the li''s per h3. > -- > Posted via http://www.ruby-forum.com/. > > > >--~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org To unsubscribe from this group, send email to rubyonrails-talk+unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org For more options, visit this group at http://groups.google.com/group/rubyonrails-talk?hl=en -~----------~----~----~----~------~----~------~--~---
Hi othewise i have another chance so that i can change html form what u told Hi this one i have an html <see><span>Central Kolkat</span><li> <b>Eden Gardens</b> (one of the most famous cricket stadiums in the world), </li><li> <b>Akashwani Bhavan</b>, All India Radio building </li><li> <b>Indoor Stadium</b> </li><li> <b>Fort William</b>, the massive and impregnable British Citadel built in 1773. The fort is still in use and retains its well-guarded grandeur. Visitors are allowed in with special permission only. </li><li> <b>Victoria Memorial</b> <a href=" http://www.victoriamemorial-cal.org" class="external autonumber" title=" http://www.victoriamemorial-cal.org">[10]</a> Along St. George’s Gate Road, on the southern fringe of the Maidan, you will find Kolkata''s most famous landmark , a splendid white marble monument (CLOSED MONDAYS). </li><li> <b>Calcutta Racecourse</b> </li><li> <b>Chowringee</b>, is the Market place of Kolkata. You will find shops ranging from Computer Periferals to cloth merchants. Even tailors and a few famous Movie theaters too. This place is a favourite pass time for local people. </li> <span>Red Fort</span><li> <b>Eden Gardens</b> (one of the most famous cricket stadiums in the world), </li><li> <b>Akashwani Bhavan</b>, All India Radio building </li><li> <b>Indoor Stadium</b> </li></see> i WANT TO ADD new element ex: <div> </div> in this html <see><div><span>Central Kolkat</span><li> <b>Eden Gardens</b> (one of the most famous cricket stadiums in the world), </li><li> <b>Akashwani Bhavan</b>, All India Radio building </li><li> <b>Indoor Stadium</b> </li><li> <b>Fort William</b>, the massive and impregnable British Citadel built in 1773. The fort is still in use and retains its well-guarded grandeur. Visitors are allowed in with special permission only. </li><li> <b>Victoria Memorial</b> <a href=" http://www.victoriamemorial-cal.org" class="external autonumber" title=" http://www.victoriamemorial-cal.org">[10]</a> Along St. George’s Gate Road, on the southern fringe of the Maidan, you will find Kolkata''s most famous landmark , a splendid white marble monument (CLOSED MONDAYS). </li><li> <b>Calcutta Racecourse</b> </li><li> <b>Chowringee</b>, is the Market place of Kolkata. You will find shops ranging from Computer Periferals to cloth merchants. Even tailors and a few famous Movie theaters too. This place is a favourite pass time for local people. </li></div><div> <span>Red Fort</span><li> <b>Eden Gardens</b> (one of the most famous cricket stadiums in the world), </li><li> <b>Akashwani Bhavan</b>, All India Radio building </li><li> <b>Indoor Stadium</b> </li></div></see> Please note that <div> have to insert before <span> closes when another span tag starts is it possible can anybody help me using hprciot. Regards Prashanth On Wed, Sep 9, 2009 at 9:21 AM, prashanth hiremath < prashanthhiremath2-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:> Ya I am scraping someone html so that i cant change the format right. > can u help me . > > > > Regards > Prashant > > > > On Tue, Sep 8, 2009 at 10:40 PM, Ar Chron < > rails-mailing-list-ARtvInVfO7ksV2N9l4h3zg@public.gmane.org> wrote: > >> >> Is this your html, or are you scraping someone else''s html? >> >> If it''s yours, organize your html differently... if you know you want to >> be processing a section at a time, wrap those sections with an >> identifiable container, then scope your searches by the container. >> >> <div> >> <h3>blah</h3> >> <li>a</li> >> <li>b</li> >> </div> >> <div> >> <h3>blah2</h3> >> <li>c</li> >> <li>d</li> >> </div> >> >> (doc/"div").each do |dv| >> this_h3 = (dv/"h3") >> if this_h3.inner_html == "blah2" >> (dv/"li").each do |li| >> puts li.inner_html >> end >> end >> end >> >> emits just c, and d >> >> If its someone else''s html in that format, you''ll probably have to go >> elem by elem for the whole doc with state machine-ish code to track what >> you''ve seen previously since there doesn''t seem to be any real ''path'' to >> the li''s per h3. >> -- >> Posted via http://www.ruby-forum.com/. >> >> >> >> >--~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org To unsubscribe from this group, send email to rubyonrails-talk+unsubscribe@googlegroups.com For more options, visit this group at http://groups.google.com/group/rubyonrails-talk?hl=en -~----------~----~----~----~------~----~------~--~---
Your html is still flat, so you have to work with the patterns that you see. You have: span li li li span li li li etc... An ugly, brute force, one case solution is to: read the page with Hpricot remove the header convert it to a simple string representation stick your opening tag ''<see>'' at the head stick your closing tag and a div end ''</div></see>'' at the tail change all ''<span>'' to ''</div><div><span>'' doctor up the new head from ''<see></div><div>'' to just ''<see><div>'' re-create your Hproicot doc from the modified string which takes about 8 lines of code. YMMV -- Posted via http://www.ruby-forum.com/.
Thank u i have done what u told using gsub operator i replaces the tags to the form as u told,but problem is that if doc = Hpricot(open(''Delhi.txt'')) x=doc.to_s doc1=x.gsub(/<(\/?)li><span>/,''</li></see><see><span>'') puts doc1 doc1.search(''span'').each do |y| puts y.inner_text end its giving error undefined method `search'' for #<String:0xb7d0bc74> (NoMethodError) because doc1 is string how can i conevrt so that i can read the file again by hpricot Regards Prashanth Hiremath On Wed, Sep 9, 2009 at 10:38 PM, Ar Chron <rails-mailing-list-ARtvInVfO7ksV2N9l4h3zg@public.gmane.org>wrote:> > Your html is still flat, so you have to work with the patterns that you > see. > You have: > span > li > li > li > span > li > li > li > etc... > > An ugly, brute force, one case solution is to: > > read the page with Hpricot > remove the header > convert it to a simple string representation > stick your opening tag ''<see>'' at the head > stick your closing tag and a div end ''</div></see>'' at the tail > change all ''<span>'' to ''</div><div><span>'' > doctor up the new head from ''<see></div><div>'' to just ''<see><div>'' > re-create your Hproicot doc from the modified string > > which takes about 8 lines of code. > > YMMV > -- > Posted via http://www.ruby-forum.com/. > > > >--~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org To unsubscribe from this group, send email to rubyonrails-talk+unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org For more options, visit this group at http://groups.google.com/group/rubyonrails-talk?hl=en -~----------~----~----~----~------~----~------~--~---
I dont know how to re-create your Hproicot doc from the modified string please help me. On Thu, Sep 10, 2009 at 10:20 AM, prashanth hiremath < prashanthhiremath2-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:> > Thank u i have done what u told using gsub operator i replaces the tags to > the form as u told,but problem is that > > if > doc = Hpricot(open(''Delhi.txt'')) > x=doc.to_s > doc1=x.gsub(/<(\/?)li><span>/,''</li></see><see><span>'') > > puts doc1 > doc1.search(''span'').each do |y| > puts y.inner_text > end > > > its giving error > > undefined method `search'' for #<String:0xb7d0bc74> (NoMethodError) > because doc1 is string how can i conevrt so that i can read the file again > by hpricot > > Regards > Prashanth Hiremath > > > On Wed, Sep 9, 2009 at 10:38 PM, Ar Chron < > rails-mailing-list-ARtvInVfO7ksV2N9l4h3zg@public.gmane.org> wrote: > >> >> Your html is still flat, so you have to work with the patterns that you >> see. >> You have: >> span >> li >> li >> li >> span >> li >> li >> li >> etc... >> >> An ugly, brute force, one case solution is to: >> >> read the page with Hpricot >> remove the header >> convert it to a simple string representation >> stick your opening tag ''<see>'' at the head >> stick your closing tag and a div end ''</div></see>'' at the tail >> change all ''<span>'' to ''</div><div><span>'' >> doctor up the new head from ''<see></div><div>'' to just ''<see><div>'' >> re-create your Hproicot doc from the modified string >> >> which takes about 8 lines of code. >> >> YMMV >> -- >> Posted via http://www.ruby-forum.com/. >> >> >> >> >--~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org To unsubscribe from this group, send email to rubyonrails-talk+unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org For more options, visit this group at http://groups.google.com/group/rubyonrails-talk?hl=en -~----------~----~----~----~------~----~------~--~---
2009/9/10 prashanth hiremath <prashanthhiremath2-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>:> > Thank u i have done what u told using gsub operator i replaces the tags to > the form as u told,but problem is that > > if > doc = Hpricot(open(''Delhi.txt'')) > x=doc.to_s > doc1=x.gsub(/<(\/?)li><span>/,''</li></see><see><span>'') > > puts doc1 > doc1.search(''span'').each do |y| > puts y.inner_text > end > > > its giving error > > undefined method `search'' for #<String:0xb7d0bc74> (NoMethodError) > because doc1 is string how can i conevrt so that i can read the file again > by hpricotPlease don''t top post, it annoys readers on this list and makes it less likely that you will get help. I have not used hpricot but if I were in your situation the first thing I would do is carefully look through the documentation for hpricot. Have you done that? Colin
K i wont post again sorry can u help me to incremant <div> <h3>blah</h3> <li>a</li> <li>b</li> </div> <div> <h3>blah2</h3> <li>c</li> <li>d</li> </div> (doc/"div").each do |dv| this_h3 = (dv/"h3") if this_h3.inner_html == "blah2" (dv/"li").each do |li| puts li.inner_html end end end this is ur code i wanted to check for all inner_text <h3> tags ex: "blah2" u given how i can test for "blah1" all in loop Regards Prashanth On Thu, Sep 10, 2009 at 1:32 PM, Colin Law <clanlaw-gM/Ye1E23mwN+BqQ9rBEUg@public.gmane.org> wrote:> > 2009/9/10 prashanth hiremath <prashanthhiremath2-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>: > > > > Thank u i have done what u told using gsub operator i replaces the tags > to > > the form as u told,but problem is that > > > > if > > doc = Hpricot(open(''Delhi.txt'')) > > x=doc.to_s > > doc1=x.gsub(/<(\/?)li><span>/,''</li></see><see><span>'') > > > > puts doc1 > > doc1.search(''span'').each do |y| > > puts y.inner_text > > end > > > > > > its giving error > > > > undefined method `search'' for #<String:0xb7d0bc74> (NoMethodError) > > because doc1 is string how can i conevrt so that i can read the file > again > > by hpricot > > Please don''t top post, it annoys readers on this list and makes it > less likely that you will get help. > > I have not used hpricot but if I were in your situation the first > thing I would do is carefully look through the documentation for > hpricot. Have you done that? > > Colin > > > >--~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org To unsubscribe from this group, send email to rubyonrails-talk+unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org For more options, visit this group at http://groups.google.com/group/rubyonrails-talk?hl=en -~----------~----~----~----~------~----~------~--~---