I have a script (below) which attempts to make an index out of all the man pages on my system. It takes a while, mostly because it runs man over and over... but anyway, as time goes on the memory usage goes up and up and never down. Eventually, it runs out of ram and just starts thrashing up the swap space, pretty much grinding to a halt. The workaround would seem to be to index documents in batches in the background, shutting down the index process every so often to recover its memory. I''m about to try that, because I''m really hunting a different bug... however, the memory problem concerns me. require ''rubygems'' require ''ferret'' require ''set'' dir = "temp_index" if ARGV.first=="-p" ARGV.shift prefix=ARGV.shift end fi= Ferret::Index::FieldInfos.new fi.add_field :name, :index => :yes, :store => :yes, :term_vector => :with_positions %w[data field1 field2 field3].each{|fieldname| fi.add_field fieldname.to_sym, :index => :yes, :store => :no, :term_vector => :with_positions } i = Ferret::Index::IndexWriter.new(:path=>dir, :create=>true, :field_infos=>fi) list=Dir["/usr/share/man/*/#{prefix}*.gz"] numpages=(ARGV.last||list.size).to_i list[0...numpages].each{|manfile| all,name,section=*/\A(.*)\.([^.]+)\Z/.match(File.basename(manfile, ".gz")) tttt=`man #{section} #{name}`.gsub(/.\b/m, '''') i << { :data=>tttt.to_s, :name=>name, :field1=>name, :field2=>name, :field3=>name, } } i.close i=Ferret::Index::IndexReader.new dir i.max_doc.times{|n| i.term_vector(n,:data).terms \ .inject(0){|sum,tvt| tvt.positions.size } > 1_000_000 and puts "heinous term count for #{i[n][:name]}" } seenterms=Set[] begin i.terms(:data).each{|term,df| seenterms.include? term and next i.term_docs_for(:data,term) seenterms << term } rescue Exception raise end
On 3/10/07, Caleb Clausen <caleb at inforadical.net> wrote:> I have a script (below) which attempts to make an index out of all the > man pages on my system. It takes a while, mostly because it runs man > over and over... but anyway, as time goes on the memory usage goes up > and up and never down. Eventually, it runs out of ram and just starts > thrashing up the swap space, pretty much grinding to a halt.Hey Caleb, Running your test for 15 minutes my memory usage climbed to 30Mb. It was still slowly climbing which is not a good sign but not enough to bring my system to a halt. Anyway, I tried using valgrind''s memcheck on it and I couldn''t find a leak in the Ferret code. Perhaps it is a leak in your version of Ruby, although I doubt it. Here is the most significant output from valgrind with --show-reachable=yes set; ==7636== 110,880 bytes in 6,930 blocks are still reachable in loss record 15 of 20 ==7636== at 0x4020396: malloc (vg_replace_malloc.c:149) ==7636== by 0x40C175F: st_insert (st.c:288) ==7636== by 0x40D1E55: rb_ivar_set (variable.c:1056) ==7636== by 0x40D1FC2: rb_iv_set (variable.c:1959) ==7636== by 0x40D2003: rb_name_class (variable.c:282) ==7636== by 0x408BCBB: boot_defclass (object.c:2462) ==7636== by 0x408D020: Init_Object (object.c:2549) ==7636== by 0x40798A0: rb_call_inits (inits.c:54) ==7636== by 0x4061E5C: ruby_init (eval.c:1382) ==7636== by 0x8048600: main (in /usr/bin/ruby1.8) ==7636===7636===7636== 187,248 bytes in 11,703 blocks are still reachable in loss record 16 of 20 ==7636== at 0x4020396: malloc (vg_replace_malloc.c:149) ==7636== by 0x40C184F: st_init_table_with_size (st.c:154) ==7636== by 0x40C18B6: st_init_strtable_with_size (st.c:193) ==7636== by 0x4095FBD: Init_sym (parse.y:5885) ==7636== by 0x4079896: rb_call_inits (inits.c:52) ==7636== by 0x4061E5C: ruby_init (eval.c:1382) ==7636== by 0x8048600: main (in /usr/bin/ruby1.8) ==7636===7636===7636== 514,228 bytes in 11,687 blocks are still reachable in loss record 17 of 20 ==7636== at 0x401F6D5: calloc (vg_replace_malloc.c:279) ==7636== by 0x40C1870: st_init_table_with_size (st.c:158) ==7636== by 0x40C1914: st_init_table (st.c:167) ==7636== by 0x40C196F: st_init_numtable (st.c:173) ==7636== by 0x40CFEB6: Init_var_tables (variable.c:28) ==7636== by 0x407989B: rb_call_inits (inits.c:53) ==7636== by 0x4061E5C: ruby_init (eval.c:1382) ==7636== by 0x8048600: main (in /usr/bin/ruby1.8) ==7636===7636===7636== 965,584 bytes in 60,349 blocks are still reachable in loss record 18 of 20 ==7636== at 0x4020396: malloc (vg_replace_malloc.c:149) ==7636== by 0x40C1692: st_add_direct (st.c:307) ==7636== by 0x4095D1A: rb_intern (parse.y:6067) ==7636== by 0x40CFED7: Init_var_tables (variable.c:30) ==7636== by 0x407989B: rb_call_inits (inits.c:53) ==7636== by 0x4061E5C: ruby_init (eval.c:1382) ==7636== by 0x8048600: main (in /usr/bin/ruby1.8) ==7636===7636===7636== 1,088,800 bytes in 50,609 blocks are still reachable in loss record 19 of 20 ==7636== at 0x4020396: malloc (vg_replace_malloc.c:149) ==7636== by 0x4074E50: ruby_xmalloc (gc.c:121) ==7636== by 0x40CF72F: ruby_strdup (util.c:634) ==7636== by 0x4095CFF: rb_intern (parse.y:6066) ==7636== by 0x40CFED7: Init_var_tables (variable.c:30) ==7636== by 0x407989B: rb_call_inits (inits.c:53) ==7636== by 0x4061E5C: ruby_init (eval.c:1382) ==7636== by 0x8048600: main (in /usr/bin/ruby1.8) ==7636===7636===7636== 2,374,520 bytes in 4 blocks are still reachable in loss record 20 of 20 ==7636== at 0x4020396: malloc (vg_replace_malloc.c:149) ==7636== by 0x40737F9: add_heap (gc.c:351) ==7636== by 0x4061D74: ruby_init (eval.c:1372) ==7636== by 0x8048600: main (in /usr/bin/ruby1.8) As you can see, non of this has anything to do with Ferret. If you haven''t used valgrind before and you want to try it there, here is how; valgrind --leak-check=yes ruby calebs_test.rb 2> res You''ll probably want to capture the output (like I have here) as it is *very* long for ruby scripts. Lots of warnings from the ruby internals. Let me know if you try this and you find anything unusual. Incidentally, I''m not sure what the other bug you are chasing is but it may have something to do with the encoding of the man pages. I don''t think they are UTF-8 so if your locale is set to UTF-8 it will cause some problems in the analysis. Cheers, Dave -- Dave Balmain http://www.davebalmain.com/
Dave Balmain said:> Running your test for 15 minutes my memory usage climbed to 30Mb. It > was still slowly climbing which is not a good sign but not enough to > bring my system to a halt. Anyway, I tried using valgrind''s memcheck > on it and I couldn''t find a leak in the Ferret code. Perhaps it is a > leak in your version of Ruby, although I doubt it. Here is the most > significant output from valgrind with --show-reachable=yes set;Ok, so my ruby is version 1.8.2, kinda old, so maybe there is an old bug in it. Recent experiments on another machine (running a newer ruby, 1.8.5, I think) didn''t seem to have the same memory leak. What version do you run, by the way?> Incidentally, I''m not sure what the other bug you are chasing is but > it may have something to do with the encoding of the man pages. II know the man output is some encoding I don''t understand; I''m just trying to generate a lot of data to feed into ferret. I don''t care if it''s correct. I''m still having quite a few crashes with ferret, though the situation has improved. I''m trying to reproduce those without handing you my entire codebase. So far, without success. :(> don''t think they are UTF-8 so if your locale is set to UTF-8 it will > cause some problems in the analysis.I know I''m not on the UTF-8 locale. Actually, I''ve been trying to figure out how to set my locale to UTF-8. I don''t suppose you''d know? I''m using Debian stable.
On 3/13/07, Caleb Clausen <caleb at inforadical.net> wrote:> Dave Balmain said: > > Running your test for 15 minutes my memory usage climbed to 30Mb. It > > was still slowly climbing which is not a good sign but not enough to > > bring my system to a halt. Anyway, I tried using valgrind''s memcheck > > on it and I couldn''t find a leak in the Ferret code. Perhaps it is a > > leak in your version of Ruby, although I doubt it. Here is the most > > significant output from valgrind with --show-reachable=yes set; > > Ok, so my ruby is version 1.8.2, kinda old, so maybe there is an old bug > in it. Recent experiments on another machine (running a newer ruby, > 1.8.5, I think) didn''t seem to have the same memory leak. > > What version do you run, by the way?I''m on 1.8.5.> > Incidentally, I''m not sure what the other bug you are chasing is but > > it may have something to do with the encoding of the man pages. I > > I know the man output is some encoding I don''t understand; I''m just > trying to generate a lot of data to feed into ferret. I don''t care if > it''s correct. I''m still having quite a few crashes with ferret, though > the situation has improved. I''m trying to reproduce those without > handing you my entire codebase. So far, without success. :(Let me know when you do find the problem. It is possible that is has something to do with a mismatch of encodings. Feeding ISO-8859-1 data (which is what my man pages are encoded in) to a UTF-8 analyzer might cause Ferret to crash. I''ve tried to fix this so that it doesn''t happen but I might have missed something.> > don''t think they are UTF-8 so if your locale is set to UTF-8 it will > > cause some problems in the analysis. > > I know I''m not on the UTF-8 locale. Actually, I''ve been trying to figure > out how to set my locale to UTF-8. I don''t suppose you''d know? I''m using > Debian stable.It''s not too hard. Something like; $ sudo apt-get install debconf $ sudo dpkg-reconfigure locales Cheers, Dave -- Dave Balmain http://www.davebalmain.com/
>> It''s not too hard. Something like; > > $ sudo apt-get install debconf > $ sudo dpkg-reconfigure locales On the notion of the locale stuff, would it be possible to create a configuration option that explicitly sets Ferret to UTF-8 mode? I think that a lot of people have been bitten by this and an explicit configuration option IMHO make a lot of sense. With acts_as_ferret it would look maybe like this class A < ActiveRecrod::Base acts_as_ferret :encoding => ''utf8'' end > > Cheers, > Dave > Regards, Jonathan -- Jonathan Weiss http://blog.innerewut.de
On 3/13/07, Jonathan Weiss <jw at innerewut.de> wrote:> > > > > It''s not too hard. Something like; > > > > $ sudo apt-get install debconf > > $ sudo dpkg-reconfigure locales > > > On the notion of the locale stuff, would it be possible to create a > configuration option that explicitly sets Ferret to UTF-8 mode? > > I think that a lot of people have been bitten by this and an explicit > configuration option IMHO make a lot of sense. With acts_as_ferret it > would look maybe like this > > > class A < ActiveRecrod::Base > acts_as_ferret :encoding => ''utf8'' > endThe problem is that this may give people the false impression that Ferret will handle UTF-8 even when they don''t have a UTF-8 locale installed. For example, adding this configuration option wouldn''t have helped Caleb. I guess one possibility would be to raise an exception if the locale isn''t available. You could also automatically convert all text to UTF-8 using iconv. I don''t know how much this would help but I would certainly commit a patch along these lines if anyone is up for it. Cheers, Dave -- Dave Balmain http://www.davebalmain.com/