I''m getting a segmentation fault on a large index (15GB). I''m running ferret 0.11.4 on OpenSuSE 10.2 with ruby 1.8.6. The segmentation fault appeared after I optimized the index, see further below for the error message I got before that. Ferret works perfectly on other (smaller) indexes. Is this a known issue, and if so, is there a workaround? --------------------- after optimizing the index ----------------------- $ irb irb(main):001:0> require ''rubygems'' => true irb(main):002:0> require ''ferret'' => true irb(main):003:0> index = Ferret::Index::Index.new(:path => "/tmp/myindex") => #<Ferret::Index::Index:0xb7b23330 @writer=nil, @mon_entering_queue=[], @default_input_field=:id, @mon_count=0, @qp=nil, @default_field=:*, @options={:dir=>#<Ferret::Store::FSDirectory:0xb7b23308>, :path=>"/tmp/myindex", :lock_retry_time=>2, :analyzer=>#<Ferret::Analysis::StandardAnalyzer:0xb7b23268>, :default_field=>:*}, @mon_owner=nil, @auto_flush=false, @open=true, @dir=#<Ferret::Store::FSDirectory:0xb7b23308>, @id_field=:id, @searcher=nil, @mon_waiting_queue=[], @reader=nil, @key=nil, @close_dir=true> irb(main):004:0> index.search_each("*:foo") {|id, score| doc index[id].load; puts doc.inspect} /usr/local/lib/ruby/gems/1.8/gems/ferret-0.11.4/lib/ferret/index.rb:411: [BUG] Segmentation fault ruby 1.8.6 (2007-03-13) [i686-linux] Aborted ---------------------- before optimizing the index --------------------- IOError (IO Error occured at <except.c>:93 in xraise Error occured in fs_store.c:293 - fsi_seek_i seeking pos -1175113459: <Invalid argument> ): /usr/local/lib/ruby/gems/1.8/gems/ferret-0.11.4/lib/ferret/index.rb:411:in `[]'' /usr/local/lib/ruby/gems/1.8/gems/ferret-0.11.4/lib/ferret/index.rb:411:in `[]'' /usr/local/lib/ruby/1.8/monitor.rb:238:in `synchronize'' /usr/local/lib/ruby/gems/1.8/gems/ferret-0.11.4/lib/ferret/index.rb:403:in `[]'' /app/controllers/search_controller.rb:133:in `do_search'' /usr/local/lib/ruby/gems/1.8/gems/ferret-0.11.4/lib/ferret/index.rb:385:in `search_each'' /usr/local/lib/ruby/gems/1.8/gems/ferret-0.11.4/lib/ferret/index.rb:384:in `search_each'' /usr/local/lib/ruby/1.8/monitor.rb:238:in `synchronize'' /usr/local/lib/ruby/gems/1.8/gems/ferret-0.11.4/lib/ferret/index.rb:380:in `search_each'' /app/controllers/search_controller.rb:131:in `do_search'' /app/controllers/search_controller.rb:54:in `index'' /usr/local/lib/ruby/1.8/benchmark.rb:293:in `measure'' /app/controllers/search_controller.rb:53:in `index'' /usr/local/lib/ruby/1.8/benchmark.rb:293:in `measure'' /app/controllers/search_controller.rb:19:in `index'' /usr/local/lib/ruby/gems/1.8/gems/actionpack-1.13.3/lib/action_controller/base.rb:1095:in `send'' /usr/local/lib/ruby/gems/1.8/gems/actionpack-1.13.3/lib/action_controller/base.rb:1095:in `perform_action_without_filters'' /usr/local/lib/ruby/gems/1.8/gems/actionpack-1.13.3/lib/action_controller/filters.rb:632:in `call_filter'' /usr/local/lib/ruby/gems/1.8/gems/actionpack-1.13.3/lib/action_controller/filters.rb:619:in `perform_action_without_benchmark'' /usr/local/lib/ruby/gems/1.8/gems/actionpack-1.13.3/lib/action_controller/benchmarking.rb:66:in `perform_action_without_rescue'' /usr/local/lib/ruby/1.8/benchmark.rb:293:in `measure'' /usr/local/lib/ruby/gems/1.8/gems/actionpack-1.13.3/lib/action_controller/benchmarking.rb:66:in `perform_action_without_rescue'' /usr/local/lib/ruby/gems/1.8/gems/actionpack-1.13.3/lib/action_controller/rescue.rb:83:in `perform_action'' /usr/local/lib/ruby/gems/1.8/gems/actionpack-1.13.3/lib/action_controller/base.rb:430:in `send'' /usr/local/lib/ruby/gems/1.8/gems/actionpack-1.13.3/lib/action_controller/base.rb:430:in `process_without_filters'' /usr/local/lib/ruby/gems/1.8/gems/actionpack-1.13.3/lib/action_controller/filters.rb:624:in `process_without_session_management_support'' /usr/local/lib/ruby/gems/1.8/gems/actionpack-1.13.3/lib/action_controller/session_management.rb:114:in `process'' /usr/local/lib/ruby/gems/1.8/gems/actionpack-1.13.3/lib/action_controller/base.rb:330:in `process'' /usr/local/lib/ruby/gems/1.8/gems/rails-1.2.3/lib/dispatcher.rb:41:in `dispatch'' /usr/local/lib/ruby/gems/1.8/gems/rails-1.2.3/lib/fcgi_handler.rb:168:in `process_request'' /usr/local/lib/ruby/gems/1.8/gems/rails-1.2.3/lib/fcgi_handler.rb:143:in `process_each_request!'' /usr/local/lib/ruby/gems/1.8/gems/rails-1.2.3/lib/fcgi_handler.rb:109:in `with_signal_handler'' /usr/local/lib/ruby/gems/1.8/gems/rails-1.2.3/lib/fcgi_handler.rb:142:in `process_each_request!'' /usr/local/lib/ruby/gems/1.8/gems/fcgi-0.8.7/lib/fcgi.rb:612:in `each_cgi'' /usr/local/lib/ruby/gems/1.8/gems/fcgi-0.8.7/lib/fcgi.rb:609:in `each'' /usr/local/lib/ruby/gems/1.8/gems/fcgi-0.8.7/lib/fcgi.rb:609:in `each_cgi'' /usr/local/lib/ruby/gems/1.8/gems/rails-1.2.3/lib/fcgi_handler.rb:141:in `process_each_request!'' /usr/local/lib/ruby/gems/1.8/gems/rails-1.2.3/lib/fcgi_handler.rb:55:in `process!'' /usr/local/lib/ruby/gems/1.8/gems/rails-1.2.3/lib/fcgi_handler.rb:25:in `process!'' /ma/www/virtual/ferret.marketaudit.no/Site/public/dispatch.fcgi:24 -- Best regards, Stian Gryt?yr
Jeremy Hinegardner
2007-May-14 03:48 UTC
[Ferret-talk] Ferret Query Language as categorizer?
Hi all, I''m looking at useing Ferret for categorizing documents. Essentially what I have are thousands of query rules that if a document matches, then it belongs to the category that is associated with that rule. Normally what we all do is have documents indexed and then run a query against the index to get back the documents that matche the query. What I want to do is the inverse. I have thousands of queries and I want to run all of them against one document at a time. The queries that match the document essentially categorize the document into the associated category. Yes, I am aware that this may not be the best way to approach a categorization problem, but it is a portion of how our current system works and I want to investigate ways to replace it and move on to better mechanism for categorization. I''m considering using our currenty query language and having it be a DSL to generates Ruby code. Esseintially my first whack at using Ferret for this was essentially the following : doc = File.read(OPTIONS.input_file) Ferret::I.new do |index| index << doc FasterCSV.foreach(OPTIONS.category_csv,{ :headers => headers }) do |row| next unless row[:boolean] top_docs = index.search(row[:boolean]) if top_docs.hits.size > 0 then puts "Matches : #{row[:name]}" end end end Short and sweet eh? Basically I''m looking for suggestions on better ways to means to have thousands of ferret queries (as FQL) run against a single document. Are there other approached that would be better? API calls that would do this more efficiently? Means to serialize FQL so that it doesn''t have to be parsed? Thought, comments, rants, raves, brainstorms? enjoy, -jeremy -- ======================================================================= Jeremy Hinegardner jeremy at hinegardner.org
Hi Jeremy, interesting approach. You might build your Query-Objects once by calling QueryParser#parse, serialize these Query-Objects ans use them. IMHO your Problem wouldn''t be query parsing but the amount of queries that you are issueing on each document. On the other hand ferret is quite fast and it may work out if your process is not that time critical. Have you considered to combine queries. Ferrets Query Language is quite powerful and you might bring down the number of queries if you combine the queries that are useful to only one catogorization anyway. Check out the QueryParser API regarding this approach. At least the lines top_docs = index.search(row[:boolean]) if top_docs.hits.size > 0 then should read "if index.search(row[:boolean]).total_hits > 0" so that you don''t need to read in the hits-array to get the size. As a last tip you might be interested in the underlying code of the more_like_this method of aaf to get the most used terms in your documents. This might be able to let your categorizations "learn" while documents get categorized. Cheers, Jan 2007/5/14, Jeremy Hinegardner <jeremy at hinegardner.org>:> > Hi all, > > I''m looking at useing Ferret for categorizing documents. > Essentially what I have are thousands of query rules that if a document > matches, then it belongs to the category that is associated with that > rule. Normally what we all do is have documents indexed and then run a > query against the index to get back the documents that matche the query. > > What I want to do is the inverse. I have thousands of queries and I > want to run all of them against one document at a time. The queries > that match the document essentially categorize the document into the > associated category. > > Yes, I am aware that this may not be the best way to approach a > categorization problem, but it is a portion of how our current system > works and I want to investigate ways to replace it and move on to better > mechanism for categorization. > > I''m considering using our currenty query language and having it be a DSL > to generates Ruby code. > > Esseintially my first whack at using Ferret for this was essentially the > following : > > doc = File.read(OPTIONS.input_file) > Ferret::I.new do |index| > index << doc > FasterCSV.foreach(OPTIONS.category_csv,{ :headers => headers }) do > |row| > next unless row[:boolean] > top_docs = index.search(row[:boolean]) > if top_docs.hits.size > 0 then > puts "Matches : #{row[:name]}" > end > end > end > > Short and sweet eh? Basically I''m looking for suggestions on better > ways to means to have thousands of ferret queries (as FQL) run against a > single document. Are there other approached that would be better? API > calls that would do this more efficiently? Means to serialize FQL so > that it doesn''t have to be parsed? > > Thought, comments, rants, raves, brainstorms? > > enjoy, > > -jeremy > > -- > =======================================================================> Jeremy Hinegardner jeremy at hinegardner.org > > _______________________________________________ > Ferret-talk mailing list > Ferret-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/ferret-talk >-- Jan Prill Rechtsanwalt Gr?nebergstra?e 38 22763 Hamburg Tel +49 (0)40 41265809 Fax +49 (0)40 380178-73022 Mobil +49 (0)171 3516667 http://www.inviado.de -------------- next part -------------- An HTML attachment was scrubbed... URL: http://rubyforge.org/pipermail/ferret-talk/attachments/20070514/454bead1/attachment.html
Jeremy Hinegardner wrote:> Hi all, > > I''m looking at useing Ferret for categorizing documents. > Essentially what I have are thousands of query rules that if a document > matches, then it belongs to the category that is associated with that > rule. Normally what we all do is have documents indexed and then run a > query against the index to get back the documents that matche the query. > > What I want to do is the inverse. I have thousands of queries and I > want to run all of them against one document at a time. The queries > that match the document essentially categorize the document into the > associated category.<snip>> Thought, comments, rants, raves, brainstorms?Random thought that might or might not work, depending on whether your queries are simple enough and how much data you want back: just invert the problem. Store the queries in Ferret, and treat your document as the query. Random example: irb(main):015:0> index = Index::Index.new irb(main):016:0> index << "hat" irb(main):017:0> index << "fox" irb(main):018:0> doc = "the quick brown fox jumped over the lazy dog" irb(main):018:0> index.search_each(doc) { |id, score| puts index[id].load.to_yaml + score.to_s } --- !map:Ferret::Index::LazyDoc :id: fox 0.0425622686743736 => 1 I''ve got absolutely no idea how well the query parser will handle larger documents, but it''s worth a try... -- Alex
Jeremy Hinegardner
2007-May-14 18:55 UTC
[Ferret-talk] Ferret Query Language as categorizer?
On Mon, May 14, 2007 at 11:11:50AM +0100, Alex Young wrote:> Jeremy Hinegardner wrote: > >Hi all, > > > >I''m looking at useing Ferret for categorizing documents. > >Essentially what I have are thousands of query rules that if a document > >matches, then it belongs to the category that is associated with that > >rule. Normally what we all do is have documents indexed and then run a > >query against the index to get back the documents that matche the query. > > > >What I want to do is the inverse. I have thousands of queries and I > >want to run all of them against one document at a time. The queries > >that match the document essentially categorize the document into the > >associated category. > <snip> > >Thought, comments, rants, raves, brainstorms? > Random thought that might or might not work, depending on whether your > queries are simple enough and how much data you want back: just invert > the problem. Store the queries in Ferret, and treat your document as > the query. Random example: > > irb(main):015:0> index = Index::Index.new > irb(main):016:0> index << "hat" > irb(main):017:0> index << "fox" > irb(main):018:0> doc = "the quick brown fox jumped over the lazy dog" > irb(main):018:0> index.search_each(doc) { |id, score| puts > index[id].load.to_yaml + score.to_s } > --- !map:Ferret::Index::LazyDoc > :id: fox > 0.0425622686743736 > => 1 > > I''ve got absolutely no idea how well the query parser will handle larger > documents, but it''s worth a try...I did give some thought to this, but we have some fairly complex categorization queries, some of which are the equivalent of SpanTermQuery. Since there is no FQL for those type of queries yet, I don''t think your approach will work for me. But it is a good idea. enjoy, -jeremy -- ======================================================================= Jeremy Hinegardner jeremy at hinegardner.org
Jeremy Hinegardner
2007-May-14 19:01 UTC
[Ferret-talk] Ferret Query Language as categorizer?
On Mon, May 14, 2007 at 08:00:02AM +0000, Jan Prill wrote:> interesting approach. You might build your Query-Objects once by calling > QueryParser#parse, serialize these Query-Objects ans use them.Yup, that''s one item I need to look into. One of hte issues is the query language we''re using right now has ''NEAR'' keywords so we''ll need to convert those into SpanTermQuery''s, I''m thinking to have the DSL generate ruby code, then serialize those Query objects, or maybe just run them as code.> IMHO your Problem wouldn''t be query parsing but the amount of queries that > you are issueing on each document. On the other hand ferret is quite fast > and it may work out if your process is not that time critical. Have you > considered to combine queries. Ferrets Query Language is quite powerful and > you might bring down the number of queries if you combine the queries that > are useful to only one catogorization anyway. Check out the QueryParser API > regarding this approach.I will investigate the API more, currently we don''t have multiple queries that equate to a single category, its a one-to-one relationship between category and query. The speed of my initial experiments is within our tolerances, but may not be good for a serial execution. Of course, since all of this is in a single Memory index, per document, it could be parallellized.> At least the lines > > top_docs = index.search(row[:boolean]) > if top_docs.hits.size > 0 then > > should read "if index.search(row[:boolean]).total_hits > 0" so that you > don''t need to read in the hits-array to get the size.Good tip, thanks.> As a last tip you might be interested in the underlying code of the > more_like_this method of aaf to get the most used terms in your documents. > This might be able to let your categorizations "learn" while documents get > categorized.I will definitely check more into that. Who knows, maybe a categorization engine based on Feret will fall out of this :-) enjoy, -jeremy -- ======================================================================= Jeremy Hinegardner jeremy at hinegardner.org