I''m getting a segmentation fault on a large index (15GB). I''m
running
ferret 0.11.4 on OpenSuSE 10.2 with ruby 1.8.6. The segmentation
fault appeared after I optimized the index, see further below for the
error message I got before that. Ferret works perfectly on other (smaller)
indexes.
Is this a known issue, and if so, is there a workaround?
--------------------- after optimizing the index -----------------------
$ irb
irb(main):001:0> require ''rubygems''
=> true
irb(main):002:0> require ''ferret''
=> true
irb(main):003:0> index = Ferret::Index::Index.new(:path =>
"/tmp/myindex")
=> #<Ferret::Index::Index:0xb7b23330 @writer=nil,
@mon_entering_queue=[], @default_input_field=:id, @mon_count=0,
@qp=nil, @default_field=:*,
@options={:dir=>#<Ferret::Store::FSDirectory:0xb7b23308>,
:path=>"/tmp/myindex", :lock_retry_time=>2,
:analyzer=>#<Ferret::Analysis::StandardAnalyzer:0xb7b23268>,
:default_field=>:*}, @mon_owner=nil, @auto_flush=false, @open=true,
@dir=#<Ferret::Store::FSDirectory:0xb7b23308>, @id_field=:id,
@searcher=nil, @mon_waiting_queue=[], @reader=nil, @key=nil,
@close_dir=true>
irb(main):004:0> index.search_each("*:foo") {|id, score| doc
index[id].load; puts doc.inspect}
/usr/local/lib/ruby/gems/1.8/gems/ferret-0.11.4/lib/ferret/index.rb:411:
[BUG] Segmentation fault
ruby 1.8.6 (2007-03-13) [i686-linux]
Aborted
---------------------- before optimizing the index ---------------------
IOError (IO Error occured at <except.c>:93 in xraise
Error occured in fs_store.c:293 - fsi_seek_i
seeking pos -1175113459: <Invalid argument>
):
/usr/local/lib/ruby/gems/1.8/gems/ferret-0.11.4/lib/ferret/index.rb:411:in
`[]''
/usr/local/lib/ruby/gems/1.8/gems/ferret-0.11.4/lib/ferret/index.rb:411:in
`[]''
/usr/local/lib/ruby/1.8/monitor.rb:238:in `synchronize''
/usr/local/lib/ruby/gems/1.8/gems/ferret-0.11.4/lib/ferret/index.rb:403:in
`[]''
/app/controllers/search_controller.rb:133:in `do_search''
/usr/local/lib/ruby/gems/1.8/gems/ferret-0.11.4/lib/ferret/index.rb:385:in
`search_each''
/usr/local/lib/ruby/gems/1.8/gems/ferret-0.11.4/lib/ferret/index.rb:384:in
`search_each''
/usr/local/lib/ruby/1.8/monitor.rb:238:in `synchronize''
/usr/local/lib/ruby/gems/1.8/gems/ferret-0.11.4/lib/ferret/index.rb:380:in
`search_each''
/app/controllers/search_controller.rb:131:in `do_search''
/app/controllers/search_controller.rb:54:in `index''
/usr/local/lib/ruby/1.8/benchmark.rb:293:in `measure''
/app/controllers/search_controller.rb:53:in `index''
/usr/local/lib/ruby/1.8/benchmark.rb:293:in `measure''
/app/controllers/search_controller.rb:19:in `index''
/usr/local/lib/ruby/gems/1.8/gems/actionpack-1.13.3/lib/action_controller/base.rb:1095:in
`send''
/usr/local/lib/ruby/gems/1.8/gems/actionpack-1.13.3/lib/action_controller/base.rb:1095:in
`perform_action_without_filters''
/usr/local/lib/ruby/gems/1.8/gems/actionpack-1.13.3/lib/action_controller/filters.rb:632:in
`call_filter''
/usr/local/lib/ruby/gems/1.8/gems/actionpack-1.13.3/lib/action_controller/filters.rb:619:in
`perform_action_without_benchmark''
/usr/local/lib/ruby/gems/1.8/gems/actionpack-1.13.3/lib/action_controller/benchmarking.rb:66:in
`perform_action_without_rescue''
/usr/local/lib/ruby/1.8/benchmark.rb:293:in `measure''
/usr/local/lib/ruby/gems/1.8/gems/actionpack-1.13.3/lib/action_controller/benchmarking.rb:66:in
`perform_action_without_rescue''
/usr/local/lib/ruby/gems/1.8/gems/actionpack-1.13.3/lib/action_controller/rescue.rb:83:in
`perform_action''
/usr/local/lib/ruby/gems/1.8/gems/actionpack-1.13.3/lib/action_controller/base.rb:430:in
`send''
/usr/local/lib/ruby/gems/1.8/gems/actionpack-1.13.3/lib/action_controller/base.rb:430:in
`process_without_filters''
/usr/local/lib/ruby/gems/1.8/gems/actionpack-1.13.3/lib/action_controller/filters.rb:624:in
`process_without_session_management_support''
/usr/local/lib/ruby/gems/1.8/gems/actionpack-1.13.3/lib/action_controller/session_management.rb:114:in
`process''
/usr/local/lib/ruby/gems/1.8/gems/actionpack-1.13.3/lib/action_controller/base.rb:330:in
`process''
/usr/local/lib/ruby/gems/1.8/gems/rails-1.2.3/lib/dispatcher.rb:41:in
`dispatch''
/usr/local/lib/ruby/gems/1.8/gems/rails-1.2.3/lib/fcgi_handler.rb:168:in
`process_request''
/usr/local/lib/ruby/gems/1.8/gems/rails-1.2.3/lib/fcgi_handler.rb:143:in
`process_each_request!''
/usr/local/lib/ruby/gems/1.8/gems/rails-1.2.3/lib/fcgi_handler.rb:109:in
`with_signal_handler''
/usr/local/lib/ruby/gems/1.8/gems/rails-1.2.3/lib/fcgi_handler.rb:142:in
`process_each_request!''
/usr/local/lib/ruby/gems/1.8/gems/fcgi-0.8.7/lib/fcgi.rb:612:in
`each_cgi''
/usr/local/lib/ruby/gems/1.8/gems/fcgi-0.8.7/lib/fcgi.rb:609:in
`each''
/usr/local/lib/ruby/gems/1.8/gems/fcgi-0.8.7/lib/fcgi.rb:609:in
`each_cgi''
/usr/local/lib/ruby/gems/1.8/gems/rails-1.2.3/lib/fcgi_handler.rb:141:in
`process_each_request!''
/usr/local/lib/ruby/gems/1.8/gems/rails-1.2.3/lib/fcgi_handler.rb:55:in
`process!''
/usr/local/lib/ruby/gems/1.8/gems/rails-1.2.3/lib/fcgi_handler.rb:25:in
`process!''
/ma/www/virtual/ferret.marketaudit.no/Site/public/dispatch.fcgi:24
--
Best regards,
Stian Gryt?yr
Jeremy Hinegardner
2007-May-14 03:48 UTC
[Ferret-talk] Ferret Query Language as categorizer?
Hi all,
I''m looking at useing Ferret for categorizing documents.
Essentially what I have are thousands of query rules that if a document
matches, then it belongs to the category that is associated with that
rule. Normally what we all do is have documents indexed and then run a
query against the index to get back the documents that matche the query.
What I want to do is the inverse. I have thousands of queries and I
want to run all of them against one document at a time. The queries
that match the document essentially categorize the document into the
associated category.
Yes, I am aware that this may not be the best way to approach a
categorization problem, but it is a portion of how our current system
works and I want to investigate ways to replace it and move on to better
mechanism for categorization.
I''m considering using our currenty query language and having it be a
DSL
to generates Ruby code.
Esseintially my first whack at using Ferret for this was essentially the
following :
doc = File.read(OPTIONS.input_file)
Ferret::I.new do |index|
index << doc
FasterCSV.foreach(OPTIONS.category_csv,{ :headers => headers }) do
|row|
next unless row[:boolean]
top_docs = index.search(row[:boolean])
if top_docs.hits.size > 0 then
puts "Matches : #{row[:name]}"
end
end
end
Short and sweet eh? Basically I''m looking for suggestions on better
ways to means to have thousands of ferret queries (as FQL) run against a
single document. Are there other approached that would be better? API
calls that would do this more efficiently? Means to serialize FQL so
that it doesn''t have to be parsed?
Thought, comments, rants, raves, brainstorms?
enjoy,
-jeremy
--
======================================================================= Jeremy
Hinegardner jeremy at hinegardner.org
Hi Jeremy,
interesting approach. You might build your Query-Objects once by calling
QueryParser#parse, serialize these Query-Objects ans use them.
IMHO your Problem wouldn''t be query parsing but the amount of queries
that
you are issueing on each document. On the other hand ferret is quite fast
and it may work out if your process is not that time critical. Have you
considered to combine queries. Ferrets Query Language is quite powerful and
you might bring down the number of queries if you combine the queries that
are useful to only one catogorization anyway. Check out the QueryParser API
regarding this approach.
At least the lines
top_docs = index.search(row[:boolean])
if top_docs.hits.size > 0 then
should read "if index.search(row[:boolean]).total_hits > 0" so that
you
don''t need to read in the hits-array to get the size.
As a last tip you might be interested in the underlying code of the
more_like_this method of aaf to get the most used terms in your documents.
This might be able to let your categorizations "learn" while documents
get
categorized.
Cheers,
Jan
2007/5/14, Jeremy Hinegardner <jeremy at
hinegardner.org>:>
> Hi all,
>
> I''m looking at useing Ferret for categorizing documents.
> Essentially what I have are thousands of query rules that if a document
> matches, then it belongs to the category that is associated with that
> rule. Normally what we all do is have documents indexed and then run a
> query against the index to get back the documents that matche the query.
>
> What I want to do is the inverse. I have thousands of queries and I
> want to run all of them against one document at a time. The queries
> that match the document essentially categorize the document into the
> associated category.
>
> Yes, I am aware that this may not be the best way to approach a
> categorization problem, but it is a portion of how our current system
> works and I want to investigate ways to replace it and move on to better
> mechanism for categorization.
>
> I''m considering using our currenty query language and having it be
a DSL
> to generates Ruby code.
>
> Esseintially my first whack at using Ferret for this was essentially the
> following :
>
> doc = File.read(OPTIONS.input_file)
> Ferret::I.new do |index|
> index << doc
> FasterCSV.foreach(OPTIONS.category_csv,{ :headers => headers })
do
> |row|
> next unless row[:boolean]
> top_docs = index.search(row[:boolean])
> if top_docs.hits.size > 0 then
> puts "Matches : #{row[:name]}"
> end
> end
> end
>
> Short and sweet eh? Basically I''m looking for suggestions on
better
> ways to means to have thousands of ferret queries (as FQL) run against a
> single document. Are there other approached that would be better? API
> calls that would do this more efficiently? Means to serialize FQL so
> that it doesn''t have to be parsed?
>
> Thought, comments, rants, raves, brainstorms?
>
> enjoy,
>
> -jeremy
>
> --
> =======================================================================>
Jeremy Hinegardner jeremy at hinegardner.org
>
> _______________________________________________
> Ferret-talk mailing list
> Ferret-talk at rubyforge.org
> http://rubyforge.org/mailman/listinfo/ferret-talk
>
--
Jan Prill
Rechtsanwalt
Gr?nebergstra?e 38
22763 Hamburg
Tel +49 (0)40 41265809 Fax +49 (0)40 380178-73022
Mobil +49 (0)171 3516667
http://www.inviado.de
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://rubyforge.org/pipermail/ferret-talk/attachments/20070514/454bead1/attachment.html
Jeremy Hinegardner wrote:> Hi all, > > I''m looking at useing Ferret for categorizing documents. > Essentially what I have are thousands of query rules that if a document > matches, then it belongs to the category that is associated with that > rule. Normally what we all do is have documents indexed and then run a > query against the index to get back the documents that matche the query. > > What I want to do is the inverse. I have thousands of queries and I > want to run all of them against one document at a time. The queries > that match the document essentially categorize the document into the > associated category.<snip>> Thought, comments, rants, raves, brainstorms?Random thought that might or might not work, depending on whether your queries are simple enough and how much data you want back: just invert the problem. Store the queries in Ferret, and treat your document as the query. Random example: irb(main):015:0> index = Index::Index.new irb(main):016:0> index << "hat" irb(main):017:0> index << "fox" irb(main):018:0> doc = "the quick brown fox jumped over the lazy dog" irb(main):018:0> index.search_each(doc) { |id, score| puts index[id].load.to_yaml + score.to_s } --- !map:Ferret::Index::LazyDoc :id: fox 0.0425622686743736 => 1 I''ve got absolutely no idea how well the query parser will handle larger documents, but it''s worth a try... -- Alex
Jeremy Hinegardner
2007-May-14 18:55 UTC
[Ferret-talk] Ferret Query Language as categorizer?
On Mon, May 14, 2007 at 11:11:50AM +0100, Alex Young wrote:> Jeremy Hinegardner wrote: > >Hi all, > > > >I''m looking at useing Ferret for categorizing documents. > >Essentially what I have are thousands of query rules that if a document > >matches, then it belongs to the category that is associated with that > >rule. Normally what we all do is have documents indexed and then run a > >query against the index to get back the documents that matche the query. > > > >What I want to do is the inverse. I have thousands of queries and I > >want to run all of them against one document at a time. The queries > >that match the document essentially categorize the document into the > >associated category. > <snip> > >Thought, comments, rants, raves, brainstorms? > Random thought that might or might not work, depending on whether your > queries are simple enough and how much data you want back: just invert > the problem. Store the queries in Ferret, and treat your document as > the query. Random example: > > irb(main):015:0> index = Index::Index.new > irb(main):016:0> index << "hat" > irb(main):017:0> index << "fox" > irb(main):018:0> doc = "the quick brown fox jumped over the lazy dog" > irb(main):018:0> index.search_each(doc) { |id, score| puts > index[id].load.to_yaml + score.to_s } > --- !map:Ferret::Index::LazyDoc > :id: fox > 0.0425622686743736 > => 1 > > I''ve got absolutely no idea how well the query parser will handle larger > documents, but it''s worth a try...I did give some thought to this, but we have some fairly complex categorization queries, some of which are the equivalent of SpanTermQuery. Since there is no FQL for those type of queries yet, I don''t think your approach will work for me. But it is a good idea. enjoy, -jeremy -- ======================================================================= Jeremy Hinegardner jeremy at hinegardner.org
Jeremy Hinegardner
2007-May-14 19:01 UTC
[Ferret-talk] Ferret Query Language as categorizer?
On Mon, May 14, 2007 at 08:00:02AM +0000, Jan Prill wrote:> interesting approach. You might build your Query-Objects once by calling > QueryParser#parse, serialize these Query-Objects ans use them.Yup, that''s one item I need to look into. One of hte issues is the query language we''re using right now has ''NEAR'' keywords so we''ll need to convert those into SpanTermQuery''s, I''m thinking to have the DSL generate ruby code, then serialize those Query objects, or maybe just run them as code.> IMHO your Problem wouldn''t be query parsing but the amount of queries that > you are issueing on each document. On the other hand ferret is quite fast > and it may work out if your process is not that time critical. Have you > considered to combine queries. Ferrets Query Language is quite powerful and > you might bring down the number of queries if you combine the queries that > are useful to only one catogorization anyway. Check out the QueryParser API > regarding this approach.I will investigate the API more, currently we don''t have multiple queries that equate to a single category, its a one-to-one relationship between category and query. The speed of my initial experiments is within our tolerances, but may not be good for a serial execution. Of course, since all of this is in a single Memory index, per document, it could be parallellized.> At least the lines > > top_docs = index.search(row[:boolean]) > if top_docs.hits.size > 0 then > > should read "if index.search(row[:boolean]).total_hits > 0" so that you > don''t need to read in the hits-array to get the size.Good tip, thanks.> As a last tip you might be interested in the underlying code of the > more_like_this method of aaf to get the most used terms in your documents. > This might be able to let your categorizations "learn" while documents get > categorized.I will definitely check more into that. Who knows, maybe a categorization engine based on Feret will fall out of this :-) enjoy, -jeremy -- ======================================================================= Jeremy Hinegardner jeremy at hinegardner.org