thr3ads.net - Ferret talk - [Ferret-talk] memory leak in index build? [Mar 2007]

If this information is useful, please help other people find it:
Share via:

Caleb Clausen

2007-Mar-09 19:22 UTC

[Ferret-talk] memory leak in index build?

I have a script (below) which attempts to make an index out of all the 
man pages on my system. It takes a while, mostly because it runs man 
over and over... but anyway, as time goes on the memory usage goes up 
and up and never down. Eventually, it runs out of ram and just starts 
thrashing up the swap space, pretty much grinding to a halt.

The workaround would seem to be to index documents in batches in the 
background, shutting down the index process every so often to recover 
its memory. I''m about to try that, because I''m really hunting
a
different bug... however, the memory problem concerns me.


require ''rubygems''
require ''ferret''
require ''set''

dir = "temp_index"

if ARGV.first=="-p"
    ARGV.shift
    prefix=ARGV.shift
end

fi= Ferret::Index::FieldInfos.new
fi.add_field :name,
     :index => :yes, :store => :yes, :term_vector => :with_positions

%w[data field1 field2 field3].each{|fieldname|
   fi.add_field fieldname.to_sym,
        :index => :yes, :store => :no, :term_vector => :with_positions
}

i = Ferret::Index::IndexWriter.new(:path=>dir, :create=>true, 
:field_infos=>fi)

list=Dir["/usr/share/man/*/#{prefix}*.gz"]
numpages=(ARGV.last||list.size).to_i

list[0...numpages].each{|manfile|
   all,name,section=*/\A(.*)\.([^.]+)\Z/.match(File.basename(manfile, 
".gz"))
   tttt=`man #{section} #{name}`.gsub(/.\b/m, '''')

   i << {
   :data=>tttt.to_s,
   :name=>name,
   :field1=>name,
   :field2=>name,
   :field3=>name,
   }
}

i.close


i=Ferret::Index::IndexReader.new dir

i.max_doc.times{|n|
   i.term_vector(n,:data).terms \
    .inject(0){|sum,tvt| tvt.positions.size } > 1_000_000 and
      puts "heinous term count for #{i[n][:name]}"
}

seenterms=Set[]
begin
i.terms(:data).each{|term,df|
seenterms.include? term and next
i.term_docs_for(:data,term)
seenterms << term
}
rescue Exception
   raise
end

David Balmain

2007-Mar-09 23:09 UTC

head link

[Ferret-talk] memory leak in index build?

On 3/10/07, Caleb Clausen <caleb at inforadical.net>
wrote:> I have a script (below) which attempts to make an index out of all the
> man pages on my system. It takes a while, mostly because it runs man
> over and over... but anyway, as time goes on the memory usage goes up
> and up and never down. Eventually, it runs out of ram and just starts
> thrashing up the swap space, pretty much grinding to a halt.
Hey Caleb,

Running your test for 15 minutes my memory usage climbed to 30Mb. It
was still slowly climbing which is not a good sign but not enough to
bring my system to a halt. Anyway, I tried using valgrind''s memcheck
on it and I couldn''t find a leak in the Ferret code. Perhaps it is a
leak in your version of Ruby, although I doubt it. Here is the most
significant output from valgrind with --show-reachable=yes set;

==7636== 110,880 bytes in 6,930 blocks are still reachable in loss
record 15 of 20
==7636==    at 0x4020396: malloc (vg_replace_malloc.c:149)
==7636==    by 0x40C175F: st_insert (st.c:288)
==7636==    by 0x40D1E55: rb_ivar_set (variable.c:1056)
==7636==    by 0x40D1FC2: rb_iv_set (variable.c:1959)
==7636==    by 0x40D2003: rb_name_class (variable.c:282)
==7636==    by 0x408BCBB: boot_defclass (object.c:2462)
==7636==    by 0x408D020: Init_Object (object.c:2549)
==7636==    by 0x40798A0: rb_call_inits (inits.c:54)
==7636==    by 0x4061E5C: ruby_init (eval.c:1382)
==7636==    by 0x8048600: main (in /usr/bin/ruby1.8)
==7636===7636===7636== 187,248 bytes in 11,703 blocks are still reachable in
loss
record 16 of 20
==7636==    at 0x4020396: malloc (vg_replace_malloc.c:149)
==7636==    by 0x40C184F: st_init_table_with_size (st.c:154)
==7636==    by 0x40C18B6: st_init_strtable_with_size (st.c:193)
==7636==    by 0x4095FBD: Init_sym (parse.y:5885)
==7636==    by 0x4079896: rb_call_inits (inits.c:52)
==7636==    by 0x4061E5C: ruby_init (eval.c:1382)
==7636==    by 0x8048600: main (in /usr/bin/ruby1.8)
==7636===7636===7636== 514,228 bytes in 11,687 blocks are still reachable in
loss
record 17 of 20
==7636==    at 0x401F6D5: calloc (vg_replace_malloc.c:279)
==7636==    by 0x40C1870: st_init_table_with_size (st.c:158)
==7636==    by 0x40C1914: st_init_table (st.c:167)
==7636==    by 0x40C196F: st_init_numtable (st.c:173)
==7636==    by 0x40CFEB6: Init_var_tables (variable.c:28)
==7636==    by 0x407989B: rb_call_inits (inits.c:53)
==7636==    by 0x4061E5C: ruby_init (eval.c:1382)
==7636==    by 0x8048600: main (in /usr/bin/ruby1.8)
==7636===7636===7636== 965,584 bytes in 60,349 blocks are still reachable in
loss
record 18 of 20
==7636==    at 0x4020396: malloc (vg_replace_malloc.c:149)
==7636==    by 0x40C1692: st_add_direct (st.c:307)
==7636==    by 0x4095D1A: rb_intern (parse.y:6067)
==7636==    by 0x40CFED7: Init_var_tables (variable.c:30)
==7636==    by 0x407989B: rb_call_inits (inits.c:53)
==7636==    by 0x4061E5C: ruby_init (eval.c:1382)
==7636==    by 0x8048600: main (in /usr/bin/ruby1.8)
==7636===7636===7636== 1,088,800 bytes in 50,609 blocks are still reachable in
loss
record 19 of 20
==7636==    at 0x4020396: malloc (vg_replace_malloc.c:149)
==7636==    by 0x4074E50: ruby_xmalloc (gc.c:121)
==7636==    by 0x40CF72F: ruby_strdup (util.c:634)
==7636==    by 0x4095CFF: rb_intern (parse.y:6066)
==7636==    by 0x40CFED7: Init_var_tables (variable.c:30)
==7636==    by 0x407989B: rb_call_inits (inits.c:53)
==7636==    by 0x4061E5C: ruby_init (eval.c:1382)
==7636==    by 0x8048600: main (in /usr/bin/ruby1.8)
==7636===7636===7636== 2,374,520 bytes in 4 blocks are still reachable in loss
record 20 of 20
==7636==    at 0x4020396: malloc (vg_replace_malloc.c:149)
==7636==    by 0x40737F9: add_heap (gc.c:351)
==7636==    by 0x4061D74: ruby_init (eval.c:1372)
==7636==    by 0x8048600: main (in /usr/bin/ruby1.8)

As you can see, non of this has anything to do with Ferret. If you
haven''t used valgrind before and you want to try it there, here is
how;

    valgrind --leak-check=yes ruby calebs_test.rb 2> res

You''ll probably want to capture the output (like I have here) as it is
*very* long for ruby scripts. Lots of warnings from the ruby
internals. Let me know if you try this and you find anything unusual.

Incidentally, I''m not sure what the other bug you are chasing is but
it may have something to do with the encoding of the man pages. I
don''t think they are UTF-8 so if your locale is set to UTF-8 it will
cause some problems in the analysis.

Cheers,
Dave



-- 
Dave Balmain
http://www.davebalmain.com/

Caleb Clausen

2007-Mar-13 00:17 UTC

head link

[Ferret-talk] memory leak in index build?

Dave Balmain said:> Running your test for 15 minutes my memory usage climbed to 30Mb. It
> was still slowly climbing which is not a good sign but not enough to
> bring my system to a halt. Anyway, I tried using valgrind''s
memcheck
> on it and I couldn''t find a leak in the Ferret code. Perhaps it is
a
> leak in your version of Ruby, although I doubt it. Here is the most
> significant output from valgrind with --show-reachable=yes set;
Ok, so my ruby is version 1.8.2, kinda old, so maybe there is an old bug 
in it. Recent experiments on another machine (running a newer ruby, 
1.8.5, I think) didn''t seem to have the same memory leak.

What version do you run, by the way?
> Incidentally, I''m not sure what the other bug you are chasing is
but
> it may have something to do with the encoding of the man pages. I
I know the man output is some encoding I don''t understand; I''m
just
trying to generate a lot of data to feed into ferret. I don''t care if 
it''s correct. I''m still having quite a few crashes with
ferret, though
the situation has improved. I''m trying to reproduce those without 
handing you my entire codebase. So far, without success. :(
> don''t think they are UTF-8 so if your locale is set to UTF-8 it
will
> cause some problems in the analysis.
I know I''m not on the UTF-8 locale. Actually, I''ve been trying
to figure
out how to set my locale to UTF-8. I don''t suppose you''d know?
I''m using
Debian stable.

David Balmain

2007-Mar-13 06:20 UTC

head link

[Ferret-talk] memory leak in index build?

On 3/13/07, Caleb Clausen <caleb at inforadical.net>
wrote:> Dave Balmain said:
> > Running your test for 15 minutes my memory usage climbed to 30Mb. It
> > was still slowly climbing which is not a good sign but not enough to
> > bring my system to a halt. Anyway, I tried using valgrind''s
memcheck
> > on it and I couldn''t find a leak in the Ferret code. Perhaps
it is a
> > leak in your version of Ruby, although I doubt it. Here is the most
> > significant output from valgrind with --show-reachable=yes set;
>
> Ok, so my ruby is version 1.8.2, kinda old, so maybe there is an old bug
> in it. Recent experiments on another machine (running a newer ruby,
> 1.8.5, I think) didn''t seem to have the same memory leak.
>
> What version do you run, by the way?
I''m on 1.8.5.
> > Incidentally, I''m not sure what the other bug you are chasing
is but
> > it may have something to do with the encoding of the man pages. I
>
> I know the man output is some encoding I don''t understand;
I''m just
> trying to generate a lot of data to feed into ferret. I don''t care
if
> it''s correct. I''m still having quite a few crashes with
ferret, though
> the situation has improved. I''m trying to reproduce those without
> handing you my entire codebase. So far, without success. :(
Let me know when you do find the problem. It is possible that is has
something to do with a mismatch of encodings. Feeding ISO-8859-1 data
(which is what my man pages are encoded in) to a UTF-8 analyzer might
cause Ferret to crash. I''ve tried to fix this so that it
doesn''t
happen but I might have missed something.
> > don''t think they are UTF-8 so if your locale is set to UTF-8
it will
> > cause some problems in the analysis.
>
> I know I''m not on the UTF-8 locale. Actually, I''ve been
trying to figure
> out how to set my locale to UTF-8. I don''t suppose you''d
know? I''m using
> Debian stable.
It''s not too hard. Something like;

$ sudo apt-get install debconf
$ sudo dpkg-reconfigure locales

Cheers,
Dave

-- 
Dave Balmain
http://www.davebalmain.com/

Jonathan Weiss

2007-Mar-13 11:59 UTC

head link

[Ferret-talk] memory leak in index build?

> > It''s not too hard. Something like;
 >
 > $ sudo apt-get install debconf
 > $ sudo dpkg-reconfigure locales


On the notion of the locale stuff, would it be possible to create a 
configuration option that explicitly sets Ferret to UTF-8 mode?

I think that a lot of people have been bitten by this and an explicit 
configuration option IMHO make a lot of sense. With acts_as_ferret it 
would look maybe like this


class A < ActiveRecrod::Base
   acts_as_ferret :encoding => ''utf8''
end



 >
 > Cheers,
 > Dave
 >

Regards,
Jonathan
-- 
Jonathan Weiss
http://blog.innerewut.de

David Balmain

2007-Mar-14 02:07 UTC

head link

[Ferret-talk] memory leak in index build?

On 3/13/07, Jonathan Weiss <jw at innerewut.de>
wrote:>
>  >
>  > It''s not too hard. Something like;
>  >
>  > $ sudo apt-get install debconf
>  > $ sudo dpkg-reconfigure locales
>
>
> On the notion of the locale stuff, would it be possible to create a
> configuration option that explicitly sets Ferret to UTF-8 mode?
>
> I think that a lot of people have been bitten by this and an explicit
> configuration option IMHO make a lot of sense. With acts_as_ferret it
> would look maybe like this
>
>
> class A < ActiveRecrod::Base
>    acts_as_ferret :encoding => ''utf8''
> end
The problem is that this may give people the false impression that
Ferret will handle UTF-8 even when they don''t have a UTF-8 locale
installed. For example, adding this configuration option wouldn''t have
helped Caleb.

I guess one possibility would be to raise an exception if the locale
isn''t available. You could also automatically convert all text to
UTF-8 using iconv. I don''t know how much this would help but I would
certainly commit a patch along these lines if anyone is up for it.

Cheers,
Dave

-- 
Dave Balmain
http://www.davebalmain.com/

Apparently Analagous Threads

Search for more maybe matching threads

Ferret talk - Mar 2007 - memory leak in index build?

[Ferret-talk] memory leak in index build?

[Ferret-talk] memory leak in index build?

[Ferret-talk] memory leak in index build?

[Ferret-talk] memory leak in index build?

[Ferret-talk] memory leak in index build?

[Ferret-talk] memory leak in index build?

Apparently Analagous Threads