thr3ads.net - Ferret talk - [Ferret-talk] Segmentation fault on large index [May 2007]

If this information is useful, please help other people find it:
Share via:

Stian Grytøyr

2007-May-10 09:24 UTC

[Ferret-talk] Segmentation fault on large index

I''m getting a segmentation fault on a large index (15GB). I''m
running
ferret 0.11.4 on OpenSuSE 10.2 with ruby 1.8.6. The segmentation
fault appeared after I optimized the index, see further below for the
error message I got before that. Ferret works perfectly on other (smaller)
indexes.

Is this a known issue, and if so, is there a workaround?


--------------------- after optimizing the index -----------------------

$ irb
irb(main):001:0> require ''rubygems''
=> true

irb(main):002:0> require ''ferret''
=> true

irb(main):003:0> index = Ferret::Index::Index.new(:path =>
"/tmp/myindex")
=> #<Ferret::Index::Index:0xb7b23330 @writer=nil,
@mon_entering_queue=[], @default_input_field=:id, @mon_count=0,
@qp=nil, @default_field=:*,
@options={:dir=>#<Ferret::Store::FSDirectory:0xb7b23308>,
:path=>"/tmp/myindex", :lock_retry_time=>2,
:analyzer=>#<Ferret::Analysis::StandardAnalyzer:0xb7b23268>,
:default_field=>:*}, @mon_owner=nil, @auto_flush=false, @open=true,
@dir=#<Ferret::Store::FSDirectory:0xb7b23308>, @id_field=:id,
@searcher=nil, @mon_waiting_queue=[], @reader=nil, @key=nil,
@close_dir=true>

irb(main):004:0> index.search_each("*:foo") {|id, score| doc
index[id].load; puts doc.inspect}
/usr/local/lib/ruby/gems/1.8/gems/ferret-0.11.4/lib/ferret/index.rb:411:
[BUG] Segmentation fault
ruby 1.8.6 (2007-03-13) [i686-linux]

Aborted


---------------------- before optimizing the index ---------------------

IOError (IO Error occured at <except.c>:93 in xraise
Error occured in fs_store.c:293 - fsi_seek_i
        seeking pos -1175113459: <Invalid argument>

):
    /usr/local/lib/ruby/gems/1.8/gems/ferret-0.11.4/lib/ferret/index.rb:411:in
`[]''
    /usr/local/lib/ruby/gems/1.8/gems/ferret-0.11.4/lib/ferret/index.rb:411:in
`[]''
    /usr/local/lib/ruby/1.8/monitor.rb:238:in `synchronize''
    /usr/local/lib/ruby/gems/1.8/gems/ferret-0.11.4/lib/ferret/index.rb:403:in
`[]''
    /app/controllers/search_controller.rb:133:in `do_search''
    /usr/local/lib/ruby/gems/1.8/gems/ferret-0.11.4/lib/ferret/index.rb:385:in
`search_each''
    /usr/local/lib/ruby/gems/1.8/gems/ferret-0.11.4/lib/ferret/index.rb:384:in
`search_each''
    /usr/local/lib/ruby/1.8/monitor.rb:238:in `synchronize''
    /usr/local/lib/ruby/gems/1.8/gems/ferret-0.11.4/lib/ferret/index.rb:380:in
`search_each''
    /app/controllers/search_controller.rb:131:in `do_search''
    /app/controllers/search_controller.rb:54:in `index''
    /usr/local/lib/ruby/1.8/benchmark.rb:293:in `measure''
    /app/controllers/search_controller.rb:53:in `index''
    /usr/local/lib/ruby/1.8/benchmark.rb:293:in `measure''
    /app/controllers/search_controller.rb:19:in `index''
   
/usr/local/lib/ruby/gems/1.8/gems/actionpack-1.13.3/lib/action_controller/base.rb:1095:in
`send''
   
/usr/local/lib/ruby/gems/1.8/gems/actionpack-1.13.3/lib/action_controller/base.rb:1095:in
`perform_action_without_filters''
   
/usr/local/lib/ruby/gems/1.8/gems/actionpack-1.13.3/lib/action_controller/filters.rb:632:in
`call_filter''
   
/usr/local/lib/ruby/gems/1.8/gems/actionpack-1.13.3/lib/action_controller/filters.rb:619:in
`perform_action_without_benchmark''
   
/usr/local/lib/ruby/gems/1.8/gems/actionpack-1.13.3/lib/action_controller/benchmarking.rb:66:in
`perform_action_without_rescue''
    /usr/local/lib/ruby/1.8/benchmark.rb:293:in `measure''
   
/usr/local/lib/ruby/gems/1.8/gems/actionpack-1.13.3/lib/action_controller/benchmarking.rb:66:in
`perform_action_without_rescue''
   
/usr/local/lib/ruby/gems/1.8/gems/actionpack-1.13.3/lib/action_controller/rescue.rb:83:in
`perform_action''
   
/usr/local/lib/ruby/gems/1.8/gems/actionpack-1.13.3/lib/action_controller/base.rb:430:in
`send''
   
/usr/local/lib/ruby/gems/1.8/gems/actionpack-1.13.3/lib/action_controller/base.rb:430:in
`process_without_filters''
   
/usr/local/lib/ruby/gems/1.8/gems/actionpack-1.13.3/lib/action_controller/filters.rb:624:in
`process_without_session_management_support''
   
/usr/local/lib/ruby/gems/1.8/gems/actionpack-1.13.3/lib/action_controller/session_management.rb:114:in
`process''
   
/usr/local/lib/ruby/gems/1.8/gems/actionpack-1.13.3/lib/action_controller/base.rb:330:in
`process''
    /usr/local/lib/ruby/gems/1.8/gems/rails-1.2.3/lib/dispatcher.rb:41:in
`dispatch''
    /usr/local/lib/ruby/gems/1.8/gems/rails-1.2.3/lib/fcgi_handler.rb:168:in
`process_request''
    /usr/local/lib/ruby/gems/1.8/gems/rails-1.2.3/lib/fcgi_handler.rb:143:in
`process_each_request!''
    /usr/local/lib/ruby/gems/1.8/gems/rails-1.2.3/lib/fcgi_handler.rb:109:in
`with_signal_handler''
    /usr/local/lib/ruby/gems/1.8/gems/rails-1.2.3/lib/fcgi_handler.rb:142:in
`process_each_request!''
    /usr/local/lib/ruby/gems/1.8/gems/fcgi-0.8.7/lib/fcgi.rb:612:in
`each_cgi''
    /usr/local/lib/ruby/gems/1.8/gems/fcgi-0.8.7/lib/fcgi.rb:609:in
`each''
    /usr/local/lib/ruby/gems/1.8/gems/fcgi-0.8.7/lib/fcgi.rb:609:in
`each_cgi''
    /usr/local/lib/ruby/gems/1.8/gems/rails-1.2.3/lib/fcgi_handler.rb:141:in
`process_each_request!''
    /usr/local/lib/ruby/gems/1.8/gems/rails-1.2.3/lib/fcgi_handler.rb:55:in
`process!''
    /usr/local/lib/ruby/gems/1.8/gems/rails-1.2.3/lib/fcgi_handler.rb:25:in
`process!''
    /ma/www/virtual/ferret.marketaudit.no/Site/public/dispatch.fcgi:24

-- 
Best regards,
Stian Gryt?yr

Jeremy Hinegardner

2007-May-14 03:48 UTC

head link

[Ferret-talk] Ferret Query Language as categorizer?

Hi all,

I''m looking at useing Ferret for categorizing documents.  
Essentially what I have are thousands of query rules that if a document
matches, then it belongs to the category that is associated with that
rule.  Normally what we all do is have documents indexed and then run a
query against the index to get back the documents that matche the query.

What I want to do is the inverse.  I have thousands of queries and I
want to run all of them against one document at a time.  The queries
that match the document essentially categorize the document into the
associated category.

Yes, I am aware that this may not be the best way to approach a
categorization problem, but it is a portion of how our current system
works and I want to investigate ways to replace it and move on to better
mechanism for categorization.

I''m considering using our currenty query language and having it be a
DSL
to generates Ruby code.

Esseintially my first whack at using Ferret for this was essentially the
following :

    doc = File.read(OPTIONS.input_file)
    Ferret::I.new do |index|
        index << doc
        FasterCSV.foreach(OPTIONS.category_csv,{ :headers => headers }) do
|row|
            next unless row[:boolean]
            top_docs = index.search(row[:boolean])
            if top_docs.hits.size > 0 then
                puts "Matches : #{row[:name]}"
            end
        end
    end

Short and sweet eh?  Basically I''m looking for suggestions on better
ways to means to have thousands of ferret queries (as FQL) run against a
single document.  Are there other approached that would be better?  API
calls that would do this more efficiently?  Means to serialize FQL so
that it doesn''t have to be parsed?

Thought, comments, rants, raves, brainstorms?

enjoy,

-jeremy

-- 
======================================================================= Jeremy
Hinegardner                              jeremy at hinegardner.org

Jan Prill

2007-May-14 08:00 UTC

head link

[Ferret-talk] Ferret Query Language as categorizer?

Hi Jeremy,

interesting approach. You might build your Query-Objects once by calling
QueryParser#parse, serialize these Query-Objects ans use them.

IMHO your Problem wouldn''t be query parsing but the amount of queries
that
you are issueing on each document. On the other hand ferret is quite fast
and it may work out if your process is not that time critical. Have you
considered to combine queries. Ferrets Query Language is quite powerful and
you might bring down the number of queries if you combine the queries that
are useful to only one catogorization anyway. Check out the QueryParser API
regarding this approach.

At least the lines

 top_docs = index.search(row[:boolean])
           if top_docs.hits.size > 0 then

should read "if index.search(row[:boolean]).total_hits > 0" so that
you
don''t need to read in the hits-array to get the size.

As a last tip you might be interested in the underlying code of the
more_like_this method of aaf to get the most used terms in your documents.
This might be able to let your categorizations "learn" while documents
get
categorized.

Cheers,
Jan

2007/5/14, Jeremy Hinegardner <jeremy at
hinegardner.org>:>
> Hi all,
>
> I''m looking at useing Ferret for categorizing documents.
> Essentially what I have are thousands of query rules that if a document
> matches, then it belongs to the category that is associated with that
> rule.  Normally what we all do is have documents indexed and then run a
> query against the index to get back the documents that matche the query.
>
> What I want to do is the inverse.  I have thousands of queries and I
> want to run all of them against one document at a time.  The queries
> that match the document essentially categorize the document into the
> associated category.
>
> Yes, I am aware that this may not be the best way to approach a
> categorization problem, but it is a portion of how our current system
> works and I want to investigate ways to replace it and move on to better
> mechanism for categorization.
>
> I''m considering using our currenty query language and having it be
a DSL
> to generates Ruby code.
>
> Esseintially my first whack at using Ferret for this was essentially the
> following :
>
>     doc = File.read(OPTIONS.input_file)
>     Ferret::I.new do |index|
>         index << doc
>         FasterCSV.foreach(OPTIONS.category_csv,{ :headers => headers })
do
> |row|
>             next unless row[:boolean]
>             top_docs = index.search(row[:boolean])
>             if top_docs.hits.size > 0 then
>                 puts "Matches : #{row[:name]}"
>             end
>         end
>     end
>
> Short and sweet eh?  Basically I''m looking for suggestions on
better
> ways to means to have thousands of ferret queries (as FQL) run against a
> single document.  Are there other approached that would be better?  API
> calls that would do this more efficiently?  Means to serialize FQL so
> that it doesn''t have to be parsed?
>
> Thought, comments, rants, raves, brainstorms?
>
> enjoy,
>
> -jeremy
>
> --
> =======================================================================>
Jeremy Hinegardner                              jeremy at hinegardner.org
>
> _______________________________________________
> Ferret-talk mailing list
> Ferret-talk at rubyforge.org
> http://rubyforge.org/mailman/listinfo/ferret-talk
>


-- 
Jan Prill
Rechtsanwalt

Gr?nebergstra?e 38
22763 Hamburg
Tel +49 (0)40 41265809 Fax +49 (0)40 380178-73022
Mobil +49 (0)171 3516667
http://www.inviado.de
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://rubyforge.org/pipermail/ferret-talk/attachments/20070514/454bead1/attachment.html

Alex Young

2007-May-14 10:11 UTC

head link

[Ferret-talk] Ferret Query Language as categorizer?

Jeremy Hinegardner wrote:> Hi all,
> 
> I''m looking at useing Ferret for categorizing documents.  
> Essentially what I have are thousands of query rules that if a document
> matches, then it belongs to the category that is associated with that
> rule.  Normally what we all do is have documents indexed and then run a
> query against the index to get back the documents that matche the query.
> 
> What I want to do is the inverse.  I have thousands of queries and I
> want to run all of them against one document at a time.  The queries
> that match the document essentially categorize the document into the
> associated category.
<snip>> Thought, comments, rants, raves, brainstorms?Random thought that might or might not work, depending on whether your 
queries are simple enough and how much data you want back:  just invert 
the problem.  Store the queries in Ferret, and treat your document as 
the query.  Random example:

irb(main):015:0> index = Index::Index.new
irb(main):016:0> index << "hat"
irb(main):017:0> index << "fox"
irb(main):018:0> doc = "the quick brown fox jumped over the lazy
dog"
irb(main):018:0> index.search_each(doc) { |id, score| puts
   index[id].load.to_yaml + score.to_s }
--- !map:Ferret::Index::LazyDoc
:id: fox
0.0425622686743736
=> 1

I''ve got absolutely no idea how well the query parser will handle
larger
documents, but it''s worth a try...

-- 
Alex

Jeremy Hinegardner

2007-May-14 18:55 UTC

head link

[Ferret-talk] Ferret Query Language as categorizer?

On Mon, May 14, 2007 at 11:11:50AM +0100, Alex Young
wrote:> Jeremy Hinegardner wrote:
> >Hi all,
> >
> >I''m looking at useing Ferret for categorizing documents.  
> >Essentially what I have are thousands of query rules that if a document
> >matches, then it belongs to the category that is associated with that
> >rule.  Normally what we all do is have documents indexed and then run a
> >query against the index to get back the documents that matche the
query.
> >
> >What I want to do is the inverse.  I have thousands of queries and I
> >want to run all of them against one document at a time.  The queries
> >that match the document essentially categorize the document into the
> >associated category.
> <snip>
> >Thought, comments, rants, raves, brainstorms?
> Random thought that might or might not work, depending on whether your 
> queries are simple enough and how much data you want back:  just invert 
> the problem.  Store the queries in Ferret, and treat your document as 
> the query.  Random example:
> 
> irb(main):015:0> index = Index::Index.new
> irb(main):016:0> index << "hat"
> irb(main):017:0> index << "fox"
> irb(main):018:0> doc = "the quick brown fox jumped over the lazy
dog"
> irb(main):018:0> index.search_each(doc) { |id, score| puts
>   index[id].load.to_yaml + score.to_s }
> --- !map:Ferret::Index::LazyDoc
> :id: fox
> 0.0425622686743736
> => 1
> 
> I''ve got absolutely no idea how well the query parser will handle
larger
> documents, but it''s worth a try...
I did give some thought to this, but we have some fairly complex
categorization queries, some of which are the equivalent of
SpanTermQuery. Since there is no FQL for those type of queries yet, I
don''t think your approach will work for me.  But it is a good idea.

enjoy,

-jeremy

-- 
======================================================================= Jeremy
Hinegardner                              jeremy at hinegardner.org

Jeremy Hinegardner

2007-May-14 19:01 UTC

head link

[Ferret-talk] Ferret Query Language as categorizer?

On Mon, May 14, 2007 at 08:00:02AM +0000, Jan Prill
wrote:> interesting approach. You might build your Query-Objects once by calling
> QueryParser#parse, serialize these Query-Objects ans use them.
Yup, that''s one item I need to look into.  One of hte issues is the
query language we''re using right now has ''NEAR''
keywords so we''ll need
to convert those into SpanTermQuery''s, I''m thinking to have
the DSL
generate ruby code, then serialize those Query objects, or maybe just
run them as code. 
> IMHO your Problem wouldn''t be query parsing but the amount of
queries that
> you are issueing on each document. On the other hand ferret is quite fast
> and it may work out if your process is not that time critical. Have you
> considered to combine queries. Ferrets Query Language is quite powerful and
> you might bring down the number of queries if you combine the queries that
> are useful to only one catogorization anyway. Check out the QueryParser API
> regarding this approach.
I will investigate the API more, currently we don''t have multiple
queries that equate to a single category, its a one-to-one relationship
between category and query.  The speed of my initial experiments is
within our tolerances, but may not be good for a serial execution.  Of
course, since all of this is in a single Memory index, per document, it
could be parallellized.
> At least the lines
> 
> top_docs = index.search(row[:boolean])
>           if top_docs.hits.size > 0 then
> 
> should read "if index.search(row[:boolean]).total_hits > 0" so
that you
> don''t need to read in the hits-array to get the size.
Good tip, thanks.
> As a last tip you might be interested in the underlying code of the
> more_like_this method of aaf to get the most used terms in your documents.
> This might be able to let your categorizations "learn" while
documents get
> categorized.
I will definitely check more into that.  Who knows, maybe a
categorization engine based on Feret will fall out of this :-)

enjoy,

-jeremy

-- 
======================================================================= Jeremy
Hinegardner                              jeremy at hinegardner.org

Maybe Matching Threads

Search for more possibly parallel threads

Ferret talk - May 2007 - Segmentation fault on large index

[Ferret-talk] Segmentation fault on large index

[Ferret-talk] Ferret Query Language as categorizer?

[Ferret-talk] Ferret Query Language as categorizer?

[Ferret-talk] Ferret Query Language as categorizer?

[Ferret-talk] Ferret Query Language as categorizer?

[Ferret-talk] Ferret Query Language as categorizer?

Maybe Matching Threads