I have an html
<h3> <span class="mw-headline">Central
Kolkata</span></h3><li> <b>Eden
Gardens</b> (one of the most famous cricket stadiums in the world),
</li><li> <b>Akashwani Bhavan</b>, All India Radio
building
</li><li> <b>Indoor Stadium</b>
</li><li> <b>Fort William</b>, the massive and
impregnable British Citadel
built in 1773. The fort is still in use and retains its well-guarded
grandeur. Visitors are allowed in with special permission only.
</li><li> <b>Victoria Memorial</b> <a href="
http://www.victoriamemorial-cal.org" class="external autonumber"
title="
http://www.victoriamemorial-cal.org">[10]</a> Along St. George’s
Gate Road,
on the southern fringe of the Maidan, you will find Kolkata''s most
famous
landmark , a splendid white marble monument (CLOSED MONDAYS).
</li><li> <b>Calcutta Racecourse</b>
</li><li> <b>Chowringee</b>, is the Market place of
Kolkata. You will find
shops ranging from Computer Periferals to cloth merchants. Even tailors and
a few famous Movie theaters too. This place is a favourite pass time for
local people.
</li><h3> <span class="mw-headline">Northern
Kolkata</span></h3><li>
<b>Nakhoda Mosque1</b> (the largest mosque in Kolkata) and the
</li><li> <b>Shobhabajar Rajbari</b> the ancestral house
of Rja Naba
Krishna, one of the rich locals to side with Clive during his war with Nabab
Siraj-Ud-Daula.
</li><li> <b>Jorasanko Thakur Bari1</b> (Tagore Family
residence).
</li><li> <b>Parashnath Jain Temple</b>, near the
Belgachia metro station.
</li><li> <b>Parashnath Jain Temple</b>, at Gouribari,
less visited,
reachable from the Sovabazar Metro station (take an auto rickshaw).
</li>
if wanna search innert_text of <li> tags upto <span
class="mw-headline">Northern Kolkata</span>
but it searches all <li>
my code is
doc.search("li").each do |y|
it searches all <li> i know but i wanna stop upto <span> element
using
hpricot
Regards
Prashant
--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups
"Ruby on Rails: Talk" group.
To post to this group, send email to
rubyonrails-talk-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org
To unsubscribe from this group, send email to
rubyonrails-talk+unsubscribe@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/rubyonrails-talk?hl=en
-~----------~----~----~----~------~----~------~--~---
Is this your html, or are you scraping someone else''s html?
If it''s yours, organize your html differently... if you know you want
to
be processing a section at a time, wrap those sections with an
identifiable container, then scope your searches by the container.
<div>
<h3>blah</h3>
<li>a</li>
<li>b</li>
</div>
<div>
<h3>blah2</h3>
<li>c</li>
<li>d</li>
</div>
(doc/"div").each do |dv|
this_h3 = (dv/"h3")
if this_h3.inner_html == "blah2"
(dv/"li").each do |li|
puts li.inner_html
end
end
end
emits just c, and d
If its someone else''s html in that format, you''ll probably
have to go
elem by elem for the whole doc with state machine-ish code to track what
you''ve seen previously since there doesn''t seem to be any real
''path'' to
the li''s per h3.
--
Posted via http://www.ruby-forum.com/.
Ya I am scraping someone html so that i cant change the format right. can u help me . Regards Prashant On Tue, Sep 8, 2009 at 10:40 PM, Ar Chron <rails-mailing-list-ARtvInVfO7ksV2N9l4h3zg@public.gmane.org>wrote:> > Is this your html, or are you scraping someone else''s html? > > If it''s yours, organize your html differently... if you know you want to > be processing a section at a time, wrap those sections with an > identifiable container, then scope your searches by the container. > > <div> > <h3>blah</h3> > <li>a</li> > <li>b</li> > </div> > <div> > <h3>blah2</h3> > <li>c</li> > <li>d</li> > </div> > > (doc/"div").each do |dv| > this_h3 = (dv/"h3") > if this_h3.inner_html == "blah2" > (dv/"li").each do |li| > puts li.inner_html > end > end > end > > emits just c, and d > > If its someone else''s html in that format, you''ll probably have to go > elem by elem for the whole doc with state machine-ish code to track what > you''ve seen previously since there doesn''t seem to be any real ''path'' to > the li''s per h3. > -- > Posted via http://www.ruby-forum.com/. > > > >--~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org To unsubscribe from this group, send email to rubyonrails-talk+unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org For more options, visit this group at http://groups.google.com/group/rubyonrails-talk?hl=en -~----------~----~----~----~------~----~------~--~---
Hi othewise i have another chance so that i can change html form what u told Hi this one i have an html <see><span>Central Kolkat</span><li> <b>Eden Gardens</b> (one of the most famous cricket stadiums in the world), </li><li> <b>Akashwani Bhavan</b>, All India Radio building </li><li> <b>Indoor Stadium</b> </li><li> <b>Fort William</b>, the massive and impregnable British Citadel built in 1773. The fort is still in use and retains its well-guarded grandeur. Visitors are allowed in with special permission only. </li><li> <b>Victoria Memorial</b> <a href=" http://www.victoriamemorial-cal.org" class="external autonumber" title=" http://www.victoriamemorial-cal.org">[10]</a> Along St. George’s Gate Road, on the southern fringe of the Maidan, you will find Kolkata''s most famous landmark , a splendid white marble monument (CLOSED MONDAYS). </li><li> <b>Calcutta Racecourse</b> </li><li> <b>Chowringee</b>, is the Market place of Kolkata. You will find shops ranging from Computer Periferals to cloth merchants. Even tailors and a few famous Movie theaters too. This place is a favourite pass time for local people. </li> <span>Red Fort</span><li> <b>Eden Gardens</b> (one of the most famous cricket stadiums in the world), </li><li> <b>Akashwani Bhavan</b>, All India Radio building </li><li> <b>Indoor Stadium</b> </li></see> i WANT TO ADD new element ex: <div> </div> in this html <see><div><span>Central Kolkat</span><li> <b>Eden Gardens</b> (one of the most famous cricket stadiums in the world), </li><li> <b>Akashwani Bhavan</b>, All India Radio building </li><li> <b>Indoor Stadium</b> </li><li> <b>Fort William</b>, the massive and impregnable British Citadel built in 1773. The fort is still in use and retains its well-guarded grandeur. Visitors are allowed in with special permission only. </li><li> <b>Victoria Memorial</b> <a href=" http://www.victoriamemorial-cal.org" class="external autonumber" title=" http://www.victoriamemorial-cal.org">[10]</a> Along St. George’s Gate Road, on the southern fringe of the Maidan, you will find Kolkata''s most famous landmark , a splendid white marble monument (CLOSED MONDAYS). </li><li> <b>Calcutta Racecourse</b> </li><li> <b>Chowringee</b>, is the Market place of Kolkata. You will find shops ranging from Computer Periferals to cloth merchants. Even tailors and a few famous Movie theaters too. This place is a favourite pass time for local people. </li></div><div> <span>Red Fort</span><li> <b>Eden Gardens</b> (one of the most famous cricket stadiums in the world), </li><li> <b>Akashwani Bhavan</b>, All India Radio building </li><li> <b>Indoor Stadium</b> </li></div></see> Please note that <div> have to insert before <span> closes when another span tag starts is it possible can anybody help me using hprciot. Regards Prashanth On Wed, Sep 9, 2009 at 9:21 AM, prashanth hiremath < prashanthhiremath2-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:> Ya I am scraping someone html so that i cant change the format right. > can u help me . > > > > Regards > Prashant > > > > On Tue, Sep 8, 2009 at 10:40 PM, Ar Chron < > rails-mailing-list-ARtvInVfO7ksV2N9l4h3zg@public.gmane.org> wrote: > >> >> Is this your html, or are you scraping someone else''s html? >> >> If it''s yours, organize your html differently... if you know you want to >> be processing a section at a time, wrap those sections with an >> identifiable container, then scope your searches by the container. >> >> <div> >> <h3>blah</h3> >> <li>a</li> >> <li>b</li> >> </div> >> <div> >> <h3>blah2</h3> >> <li>c</li> >> <li>d</li> >> </div> >> >> (doc/"div").each do |dv| >> this_h3 = (dv/"h3") >> if this_h3.inner_html == "blah2" >> (dv/"li").each do |li| >> puts li.inner_html >> end >> end >> end >> >> emits just c, and d >> >> If its someone else''s html in that format, you''ll probably have to go >> elem by elem for the whole doc with state machine-ish code to track what >> you''ve seen previously since there doesn''t seem to be any real ''path'' to >> the li''s per h3. >> -- >> Posted via http://www.ruby-forum.com/. >> >> >> >> >--~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org To unsubscribe from this group, send email to rubyonrails-talk+unsubscribe@googlegroups.com For more options, visit this group at http://groups.google.com/group/rubyonrails-talk?hl=en -~----------~----~----~----~------~----~------~--~---
Your html is still flat, so you have to work with the patterns that you see. You have: span li li li span li li li etc... An ugly, brute force, one case solution is to: read the page with Hpricot remove the header convert it to a simple string representation stick your opening tag ''<see>'' at the head stick your closing tag and a div end ''</div></see>'' at the tail change all ''<span>'' to ''</div><div><span>'' doctor up the new head from ''<see></div><div>'' to just ''<see><div>'' re-create your Hproicot doc from the modified string which takes about 8 lines of code. YMMV -- Posted via http://www.ruby-forum.com/.
Thank u i have done what u told using gsub operator i replaces the tags to
the form as u told,but problem is that
if
doc = Hpricot(open(''Delhi.txt''))
x=doc.to_s
doc1=x.gsub(/<(\/?)li><span>/,''</li></see><see><span>'')
puts doc1
doc1.search(''span'').each do |y|
puts y.inner_text
end
its giving error
undefined method `search'' for #<String:0xb7d0bc74>
(NoMethodError)
because doc1 is string how can i conevrt so that i can read the file again
by hpricot
Regards
Prashanth Hiremath
On Wed, Sep 9, 2009 at 10:38 PM, Ar Chron
<rails-mailing-list-ARtvInVfO7ksV2N9l4h3zg@public.gmane.org>wrote:
>
> Your html is still flat, so you have to work with the patterns that you
> see.
> You have:
> span
> li
> li
> li
> span
> li
> li
> li
> etc...
>
> An ugly, brute force, one case solution is to:
>
> read the page with Hpricot
> remove the header
> convert it to a simple string representation
> stick your opening tag ''<see>'' at the head
> stick your closing tag and a div end
''</div></see>'' at the tail
> change all ''<span>'' to
''</div><div><span>''
> doctor up the new head from
''<see></div><div>'' to just
''<see><div>''
> re-create your Hproicot doc from the modified string
>
> which takes about 8 lines of code.
>
> YMMV
> --
> Posted via http://www.ruby-forum.com/.
>
> >
>
--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups
"Ruby on Rails: Talk" group.
To post to this group, send email to
rubyonrails-talk-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org
To unsubscribe from this group, send email to
rubyonrails-talk+unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org
For more options, visit this group at
http://groups.google.com/group/rubyonrails-talk?hl=en
-~----------~----~----~----~------~----~------~--~---
I dont know how to re-create your Hproicot doc from the modified string please help me. On Thu, Sep 10, 2009 at 10:20 AM, prashanth hiremath < prashanthhiremath2-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:> > Thank u i have done what u told using gsub operator i replaces the tags to > the form as u told,but problem is that > > if > doc = Hpricot(open(''Delhi.txt'')) > x=doc.to_s > doc1=x.gsub(/<(\/?)li><span>/,''</li></see><see><span>'') > > puts doc1 > doc1.search(''span'').each do |y| > puts y.inner_text > end > > > its giving error > > undefined method `search'' for #<String:0xb7d0bc74> (NoMethodError) > because doc1 is string how can i conevrt so that i can read the file again > by hpricot > > Regards > Prashanth Hiremath > > > On Wed, Sep 9, 2009 at 10:38 PM, Ar Chron < > rails-mailing-list-ARtvInVfO7ksV2N9l4h3zg@public.gmane.org> wrote: > >> >> Your html is still flat, so you have to work with the patterns that you >> see. >> You have: >> span >> li >> li >> li >> span >> li >> li >> li >> etc... >> >> An ugly, brute force, one case solution is to: >> >> read the page with Hpricot >> remove the header >> convert it to a simple string representation >> stick your opening tag ''<see>'' at the head >> stick your closing tag and a div end ''</div></see>'' at the tail >> change all ''<span>'' to ''</div><div><span>'' >> doctor up the new head from ''<see></div><div>'' to just ''<see><div>'' >> re-create your Hproicot doc from the modified string >> >> which takes about 8 lines of code. >> >> YMMV >> -- >> Posted via http://www.ruby-forum.com/. >> >> >> >> >--~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org To unsubscribe from this group, send email to rubyonrails-talk+unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org For more options, visit this group at http://groups.google.com/group/rubyonrails-talk?hl=en -~----------~----~----~----~------~----~------~--~---
2009/9/10 prashanth hiremath <prashanthhiremath2-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>:> > Thank u i have done what u told using gsub operator i replaces the tags to > the form as u told,but problem is that > > if > doc = Hpricot(open(''Delhi.txt'')) > x=doc.to_s > doc1=x.gsub(/<(\/?)li><span>/,''</li></see><see><span>'') > > puts doc1 > doc1.search(''span'').each do |y| > puts y.inner_text > end > > > its giving error > > undefined method `search'' for #<String:0xb7d0bc74> (NoMethodError) > because doc1 is string how can i conevrt so that i can read the file again > by hpricotPlease don''t top post, it annoys readers on this list and makes it less likely that you will get help. I have not used hpricot but if I were in your situation the first thing I would do is carefully look through the documentation for hpricot. Have you done that? Colin
K i wont post again sorry can u help me to incremant
<div>
<h3>blah</h3>
<li>a</li>
<li>b</li>
</div>
<div>
<h3>blah2</h3>
<li>c</li>
<li>d</li>
</div>
(doc/"div").each do |dv|
this_h3 = (dv/"h3")
if this_h3.inner_html == "blah2"
(dv/"li").each do |li|
puts li.inner_html
end
end
end
this is ur code i wanted to check for all inner_text <h3> tags ex:
"blah2" u given how i can test for "blah1" all in loop
Regards
Prashanth
On Thu, Sep 10, 2009 at 1:32 PM, Colin Law
<clanlaw-gM/Ye1E23mwN+BqQ9rBEUg@public.gmane.org> wrote:
>
> 2009/9/10 prashanth hiremath
<prashanthhiremath2-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>:
> >
> > Thank u i have done what u told using gsub operator i replaces the
tags
> to
> > the form as u told,but problem is that
> >
> > if
> > doc = Hpricot(open(''Delhi.txt''))
> > x=doc.to_s
> >
doc1=x.gsub(/<(\/?)li><span>/,''</li></see><see><span>'')
> >
> > puts doc1
> > doc1.search(''span'').each do |y|
> > puts y.inner_text
> > end
> >
> >
> > its giving error
> >
> > undefined method `search'' for #<String:0xb7d0bc74>
(NoMethodError)
> > because doc1 is string how can i conevrt so that i can read the file
> again
> > by hpricot
>
> Please don''t top post, it annoys readers on this list and makes it
> less likely that you will get help.
>
> I have not used hpricot but if I were in your situation the first
> thing I would do is carefully look through the documentation for
> hpricot. Have you done that?
>
> Colin
>
> >
>
--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups
"Ruby on Rails: Talk" group.
To post to this group, send email to
rubyonrails-talk-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org
To unsubscribe from this group, send email to
rubyonrails-talk+unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org
For more options, visit this group at
http://groups.google.com/group/rubyonrails-talk?hl=en
-~----------~----~----~----~------~----~------~--~---