thr3ads.net - Ferret talk - [Ferret-talk] Ferret not able to read a Lucene Index? [May 2006]

If this information is useful, please help other people find it:
Share via:

steven shingler

2006-May-15 16:08 UTC

[Ferret-talk] Ferret not able to read a Lucene Index?

Hi all,

Having problems trying to get Ferret to read an index generated by 
Lucene.

Am I right in thinking Ferret should be able to read a Lucene generated 
index no problem?

Using the code snippets detailed in 
http://www.ruby-forum.com/topic/64099#new

Any advice gratefully received.
Many Thanks,
Steven

-- 
Posted via http://www.ruby-forum.com/.

Erik Hatcher

2006-May-15 16:15 UTC

head link

[Ferret-talk] Ferret not able to read a Lucene Index?

On May 15, 2006, at 12:08 PM, steven shingler wrote:> Am I right in thinking Ferret should be able to read a Lucene  
> generated
> index no problem?
That would be nice, but it is not currently the case because of  
Java''s wacky "modified" UTF-8 serialization.  I''ve
seen that plain
ol'' ASCII text indexes will be compatible, but once you put in some  
higher order characters things go askew.

	Erik

steven shingler

2006-May-16 09:55 UTC

head link

[Ferret-talk] Ferret not able to read a Lucene Index?

Hi Erik, Thanks for getting back to me.

Ahh yes, I see what you mean - if I "Lucene-Index" only plain text 
files, Ferret can search that index fine (it seems).

However, what I''m trying to do is index pdfs, using PDFBox to create
the
Lucene documents - but Ferret isn''t at all pleased when I try to
search:

NoMethodError: You have a nil object when you didn''t expect it!
The error occured while evaluating nil.name
    c:/ruby/lib/ruby/site_ruby/1.8/ferret/index/term_buffer.rb:31:in 
`read''
    c:/ruby/lib/ruby/site_ruby/1.8/ferret/index/segment_term_enum.rb:90:in 
`next
?''
    c:/ruby/lib/ruby/site_ruby/1.8/ferret/index/segment_term_enum.rb:118:in 
`sca
n_to''
    c:/ruby/lib/ruby/site_ruby/1.8/ferret/index/term_infos_io.rb:285:in 
`scan_fo
r_term_info''
    c:/ruby/lib/ruby/site_ruby/1.8/ferret/index/term_infos_io.rb:163:in 
`get_ter
m_info''
    c:/ruby/lib/ruby/site_ruby/1.8/ferret/index/segment_reader.rb:176:in 
`doc_fr
eq''
    c:/ruby/lib/ruby/site_ruby/1.8/ferret/index/multi_reader.rb:169:in 
`doc_freq
''
    c:/ruby/lib/ruby/site_ruby/1.8/ferret/index/multi_reader.rb:169:in 
`each''
    c:/ruby/lib/ruby/site_ruby/1.8/ferret/index/multi_reader.rb:169:in 
`doc_freq
''
    c:/ruby/lib/ruby/site_ruby/1.8/ferret/search/index_searcher.rb:47:in 
`doc_fr
eq''
    c:/ruby/lib/ruby/site_ruby/1.8/ferret/search/term_query.rb:13:in 
`initialize
''
    c:/ruby/lib/ruby/site_ruby/1.8/ferret/search/term_query.rb:99:in 
`new''
    c:/ruby/lib/ruby/site_ruby/1.8/ferret/search/term_query.rb:99:in 
`create_wei
ght''
    c:/ruby/lib/ruby/site_ruby/1.8/ferret/search/boolean_query.rb:113:in 
`initia
lize''
    c:/ruby/lib/ruby/site_ruby/1.8/ferret/search/boolean_query.rb:112:in 
`each''
    c:/ruby/lib/ruby/site_ruby/1.8/ferret/search/boolean_query.rb:112:in 
`initia
lize''
    c:/ruby/lib/ruby/site_ruby/1.8/ferret/search/boolean_query.rb:209:in 
`new''
    c:/ruby/lib/ruby/site_ruby/1.8/ferret/search/boolean_query.rb:209:in 
`create
_weight''
    c:/ruby/lib/ruby/site_ruby/1.8/ferret/search/query.rb:51:in
`weight''
    c:/ruby/lib/ruby/site_ruby/1.8/ferret/search/index_searcher.rb:107:in 
`searc
h''
    c:/ruby/lib/ruby/site_ruby/1.8/ferret/index/index.rb:660:in 
`do_search''
    c:/ruby/lib/ruby/site_ruby/1.8/ferret/index/index.rb:331:in 
`search_each''
    c:/ruby/lib/ruby/site_ruby/1.8/ferret/index/index.rb:330:in 
`synchronize''
    c:/ruby/lib/ruby/site_ruby/1.8/ferret/index/index.rb:330:in 
`search_each''
    ./lib/ferret_client.rb:34:in `search_index''
    test/functional/ferret_client_test.rb:12:in `test_search_index''

This is a shame, as I thought I was onto a winner with the Lucene/Ferret 
combo - especially with PDFBox able to create Lucene Docs so easily.

This may not actually relate to your point of higher order chars...?

Does anyone have any experience of indexing pdfs in Lucene (using 
PDFBox) and searching with Ferret? Or of course creating Ferret Index 
Docs from pdf files in ruby?

Any ideas or advice gratefully received.
Thanks,
Steven


-- 
Posted via http://www.ruby-forum.com/.

Jan Prill

2006-May-16 10:02 UTC

head link

[Ferret-talk] Ferret not able to read a Lucene Index?

Hi, steven,

first of all: would you mind to provide a little more info on the
environment you are on: os, version of ferret, version of ruby et al.

second: You might be interested in FerretFinder utility as well as RDig.
Links to both of them you''ll find at the bottom of the howto section on
ferret trac: http://ferret.davebalmain.com/trac/wiki/HowTos . Both of these
tools seem to use pdftotext to extract content from PDFs but might be of
help to you anyways.

Regards
Jan Prill


On 5/16/06, steven shingler <shingler at gmail.com>
wrote:>
> Hi Erik, Thanks for getting back to me.
>
> Ahh yes, I see what you mean - if I "Lucene-Index" only plain
text
> files, Ferret can search that index fine (it seems).
>
> However, what I''m trying to do is index pdfs, using PDFBox to
create the
> Lucene documents - but Ferret isn''t at all pleased when I try to
search:
>
> NoMethodError: You have a nil object when you didn''t expect it!
> The error occured while evaluating nil.name
>     c:/ruby/lib/ruby/site_ruby/1.8/ferret/index/term_buffer.rb:31:in
> `read''
>     c:/ruby/lib/ruby/site_ruby/1.8/ferret/index/segment_term_enum.rb:90:in
> `next
> ?''
>
>     c:/ruby/lib/ruby/site_ruby/1.8/ferret/index/segment_term_enum.rb:118:in
> `sca
> n_to''
>     c:/ruby/lib/ruby/site_ruby/1.8/ferret/index/term_infos_io.rb:285:in
> `scan_fo
> r_term_info''
>     c:/ruby/lib/ruby/site_ruby/1.8/ferret/index/term_infos_io.rb:163:in
> `get_ter
> m_info''
>     c:/ruby/lib/ruby/site_ruby/1.8/ferret/index/segment_reader.rb:176:in
> `doc_fr
> eq''
>     c:/ruby/lib/ruby/site_ruby/1.8/ferret/index/multi_reader.rb:169:in
> `doc_freq
> ''
>     c:/ruby/lib/ruby/site_ruby/1.8/ferret/index/multi_reader.rb:169:in
> `each''
>     c:/ruby/lib/ruby/site_ruby/1.8/ferret/index/multi_reader.rb:169:in
> `doc_freq
> ''
>     c:/ruby/lib/ruby/site_ruby/1.8/ferret/search/index_searcher.rb:47:in
> `doc_fr
> eq''
>     c:/ruby/lib/ruby/site_ruby/1.8/ferret/search/term_query.rb:13:in
> `initialize
> ''
>     c:/ruby/lib/ruby/site_ruby/1.8/ferret/search/term_query.rb:99:in
> `new''
>     c:/ruby/lib/ruby/site_ruby/1.8/ferret/search/term_query.rb:99:in
> `create_wei
> ght''
>     c:/ruby/lib/ruby/site_ruby/1.8/ferret/search/boolean_query.rb:113:in
> `initia
> lize''
>     c:/ruby/lib/ruby/site_ruby/1.8/ferret/search/boolean_query.rb:112:in
> `each''
>     c:/ruby/lib/ruby/site_ruby/1.8/ferret/search/boolean_query.rb:112:in
> `initia
> lize''
>     c:/ruby/lib/ruby/site_ruby/1.8/ferret/search/boolean_query.rb:209:in
> `new''
>     c:/ruby/lib/ruby/site_ruby/1.8/ferret/search/boolean_query.rb:209:in
> `create
> _weight''
>     c:/ruby/lib/ruby/site_ruby/1.8/ferret/search/query.rb:51:in
`weight''
>     c:/ruby/lib/ruby/site_ruby/1.8/ferret/search/index_searcher.rb:107:in
> `searc
> h''
>     c:/ruby/lib/ruby/site_ruby/1.8/ferret/index/index.rb:660:in
> `do_search''
>     c:/ruby/lib/ruby/site_ruby/1.8/ferret/index/index.rb:331:in
> `search_each''
>     c:/ruby/lib/ruby/site_ruby/1.8/ferret/index/index.rb:330:in
> `synchronize''
>     c:/ruby/lib/ruby/site_ruby/1.8/ferret/index/index.rb:330:in
> `search_each''
>     ./lib/ferret_client.rb:34:in `search_index''
>     test/functional/ferret_client_test.rb:12:in
`test_search_index''
>
> This is a shame, as I thought I was onto a winner with the Lucene/Ferret
> combo - especially with PDFBox able to create Lucene Docs so easily.
>
> This may not actually relate to your point of higher order chars...?
>
> Does anyone have any experience of indexing pdfs in Lucene (using
> PDFBox) and searching with Ferret? Or of course creating Ferret Index
> Docs from pdf files in ruby?
>
> Any ideas or advice gratefully received.
> Thanks,
> Steven
>
>
> --
> Posted via http://www.ruby-forum.com/.
> _______________________________________________
> Ferret-talk mailing list
> Ferret-talk at rubyforge.org
> http://rubyforge.org/mailman/listinfo/ferret-talk
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://rubyforge.org/pipermail/ferret-talk/attachments/20060516/9c917523/attachment-0001.htm

steven shingler

2006-May-16 10:07 UTC

head link

[Ferret-talk] Ferret not able to read a Lucene Index?

Hi Jan,

Right - sorry.

I''m on Windows XP(pro); ferret 0.9.1 (pure ruby); ruby 1.8.2

I''ll look into those links now.
Many Thanks
Steven


-- 
Posted via http://www.ruby-forum.com/.

Jan Prill

2006-May-16 10:17 UTC

head link

[Ferret-talk] Ferret not able to read a Lucene Index?

hey steven,

have you got a linux box to your availability too? It might be of interest
if the problem persists with ferret 0.9.3. If you got any scripts and test
data of your pdfs I might as well check this out for you on linux, ferret
0.9.3 and ruby 1.8.4

regards
Jan

On 5/16/06, steven shingler <shingler at gmail.com>
wrote:>
> Hi Jan,
>
> Right - sorry.
>
> I''m on Windows XP(pro); ferret 0.9.1 (pure ruby); ruby 1.8.2
>
> I''ll look into those links now.
> Many Thanks
> Steven
>
>
> --
> Posted via http://www.ruby-forum.com/.
> _______________________________________________
> Ferret-talk mailing list
> Ferret-talk at rubyforge.org
> http://rubyforge.org/mailman/listinfo/ferret-talk
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://rubyforge.org/pipermail/ferret-talk/attachments/20060516/fd3cf380/attachment.htm

steven shingler

2006-May-16 10:54 UTC

head link

[Ferret-talk] Ferret not able to read a Lucene Index?

Hi Jan,

Yes, I''ve got an Ubuntu box I can try it on - just updated to ferret 
0.9.3 and ruby 1.8.4 on it.

Will have a look now and report back.

Many Thanks for your help.
S~

p.s. the ferret_helper finder utils look v interesting

-- 
Posted via http://www.ruby-forum.com/.

David Balmain

2006-May-16 14:53 UTC

head link

[Ferret-talk] Ferret not able to read a Lucene Index?

On 5/16/06, Erik Hatcher <erik at ehatchersolutions.com>
wrote:>
> On May 15, 2006, at 12:08 PM, steven shingler wrote:
> > Am I right in thinking Ferret should be able to read a Lucene
> > generated
> > index no problem?
>
> That would be nice, but it is not currently the case because of
> Java''s wacky "modified" UTF-8 serialization. 
I''ve seen that plain
> ol'' ASCII text indexes will be compatible, but once you put in
some
> higher order characters things go askew.
Hey guys,

What Erik said is exactly correct. Marvin Humphrey, (author of
KinoSearch, a Perl port of Lucene) has submitted a patch to Lucene so
that non-java ports of Lucene will be able to read Lucene indexes. It
currently slows Lucene down by about 25% at the moment (I think??) so
I''m going to be working with him to improve the performance of the
patch so that it can one day be included in Lucene. Don''t hold your
breath though. It''s going to take us a while to get it in there. For
now, I''d recommend using pdftotext as Jan already mentioned.
I''m not
sure what is available on Windows but I''m sure it would be trivial to
write your own pdftotext using Java''s PDFBox and then call it from
Ruby.

Cheers,
Dave

Marvin Humphrey

2006-May-16 16:51 UTC

head link

[Ferret-talk] Ferret not able to read a Lucene Index?

On May 16, 2006, at 7:53 AM, David Balmain wrote:
> On 5/16/06, Erik Hatcher <erik at ehatchersolutions.com> wrote:
>>
>> On May 15, 2006, at 12:08 PM, steven shingler wrote:
>>> Am I right in thinking Ferret should be able to read a Lucene
>>> generated
>>> index no problem?
>>
>> That would be nice, but it is not currently the case because of
>> Java''s wacky "modified" UTF-8 serialization. 
I''ve seen that plain
>> ol'' ASCII text indexes will be compatible, but once you put in
some
>> higher order characters things go askew.
>
> Hey guys,
>
> What Erik said is exactly correct. Marvin Humphrey, (author of
> KinoSearch, a Perl port of Lucene) has submitted a patch to Lucene so
> that non-java ports of Lucene will be able to read Lucene indexes. It
> currently slows Lucene down by about 25% at the moment (I think??)
Around 20% for indexing according to my benchmarker.  I don''t have a  
benchmark for searching.

Modified UTF-8 is not so much the problem for performance of my  
patch, nor is it actually causing the index incompatibility in this  
case.  Modified UTF-8 is problematic for a couple other reasons.

When text contains either null bytes or Unicode code points above the  
Basic Multilingual Plane (values 2^16 and up, such as U+1D160  
"MUSICAL SYMBOL EIGHTH NOTE"), KinoSearch and Ferret, if they write  
legal UTF-8, would write indexes which would cause Lucene to crash  
from time to time with a baffling "read past EOF" error.  Therefore,  
to be Lucene-compatible they''d have to pre-scan all text to detect  
those conditions, which would impose a performance burden and require  
some crufty auxilliary code to turn the legal UTF-8 into Modified UTF-8.

Also, non-shortest-form UTF-8 presents a theoretical security risk,  
and Perl is set up to issue a warning whenever a scalar which is  
marked as UTF-8 isn''t shortest-form.  That condition would occur  
whenever Modified UTF-8 containing null bytes or code points above  
the BMP was read in -- thus requiring that all incoming text be pre- 
scanned as well.

Those are rare conditions, but it isn''t realistic to just say  
"KinoSearch|Ferret doesn''t support null bytes or characters above
the
BMP", because a lot of times the source text that goes into an index  
isn''t under the full control of the indexing/search app''s
author.

To be fair to Java and Lucene, they are paying a price for early  
commitment to the Unicode standard.  Lucene''s UTF-8 encoding/decoding  
hasn''t been touched since Doug Cutting wrote it in 1998, when non- 
shortest-form UTF-8 was still legal and Unicode was still 16-bit.   
You could argue that the Unicode consortium pulled the rug out from  
under its early champions by changing the spec so that existing  
implementations were no longer compliant.

The performance problem sof my patch and the crashing are actually  
tied to the Lucene File Format''s definition of a String.  A String in  
Lucene is the length of the string in Java chars, followed by the  
character data translated to Modified UTF-8.  A String in KinoSearch,  
and if I am not mistaken in Ferret as well, is the length of the  
character data in bytes, followed by the character data.

Those two definitions of String result in identical indexes so long  
as your text is pure ASCII, but as Erik noted, when you add higher  
order characters to the mix, problems arise.  You end up reading  
either too few bytes or too many, the stream gets out of sync, and  
whammo: ''Read past EOF''.

My patch modifies Lucene to use bytecounts as the prefix to its  
Strings.  Unfortunately, there are encoding/decoding inefficiencies  
associated with the new way of doing things.  Under Lucene''s current  
definition of a string you allocate an array of Java char then read  
characters into it one by one.  With the new patch, you don''t know  
how many chars you need, so you might have to re-allocate several  
times.  There are ways to address that inefficiency, but they''d take  
a while to explain.
> Don''t hold your
> breath though. It''s going to take us a while to get it in there.
Yeah.  Modifying Lucene so that it can read both the old index format  
and the new without suffering a performance degradation in either  
case is going to be non-trivial.  I''m sympathetic to the notion that  
it may not be worth it and that Lucene should declare its file format  
private.  There are a lot of issues in play.

No KinoSearch user has yet complained about Lucene/KinoSearch file- 
format compatibility.  The only thing I miss is Luke -- which is  
significant, because Luke is really handy.

How many users here care about Lucene compatibility, and why?

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

Erik Hatcher

2006-May-16 17:17 UTC

head link

[Ferret-talk] Ferret not able to read a Lucene Index?

On May 16, 2006, at 12:51 PM, Marvin Humphrey wrote:> How many users here care about Lucene compatibility, and why?
Personally I''m putting my eggs into the Solr basket - http://
incubator.apache.org/solr

Solr has a ton of benefits over using raw Lucene with its caching and
configurable handling of putting new searchers online, etc. Its got
plenty of room for improvement, and those improvements are in
progress. I am integrating Solr into a Ruby on Rails front-end as we
speak, but doing so crudely through a rough HTTP API, but abstracting
that communication layer behind a nice Rubyish DSL would be quite cool.

I used to really really want Lucene index compatibility at the file
format layer along with a really fast Ruby implementation. At this
point I''ve changed my mind and Solr is my recommended basis for
search integration into non-Java (and even Java perhaps) applications.

I just wanted to toss out my thoughts since I''ve been mostly silent
on the Ferret/KinoSearch issues. I still day dream of GCJ''d Java
Lucene being the basis for cross-language integration using PyLucene
as a great example. They achieve 100% index compatibility with Java
Lucene because it *is* Java Lucene. I''m still extremely pleased to
see folks like Dave and Marvin digging deep in to Ruby and Perl
integration and starting to work together. Very promising no matter
how this ends up. I''m optimistic we''ll have Lucene in Ruby
one of
these days in a compatible way and incredibly performant way!

Erik

Nick Snels

2006-May-16 19:30 UTC

head link

[Ferret-talk] Ferret not able to read a Lucene Index?

I don''t care about the fact that Ferret isn''t able to read a
Lucene
index. The only problem is that when the Ferret index isn''t compatible 
with Lucene as is the case right now (damn EOF errors), you are not able 
to use Luke to take a quick peek inside the index. So a port of Luke to 
access Ferret would be great.

Ferret should be fast, have the power of Lucene searches and be easy to 
access from Ruby, as it is right now. If you are going to use Lucene, go 
all the way and stick to Java. Only problem with Ferret is that the C 
version isn''t available on Windows (for testing purposes) yet, but that
is being worked on. GJC and SWIG sounds great but setting it up is a 
real pain in the ass, great for techies, but horrible for all the 
others.

Solr looks a promising project, only problem I have with it is that you 
need Tomcat and a JVM. This adds two more variables to your 
configuration you have to control. Great if you know Java, but I''m 
programming in Ruby so I don''t have to program in Java or .NET, or 
whatever. So I prefer a Ruby only environment for it''s simplicity.

So Luke is a definite plus as a debugging tool.

Kind regards,

Nick

-- 
Posted via http://www.ruby-forum.com/.

Erik Hatcher

2006-May-16 19:45 UTC

head link

[Ferret-talk] Ferret not able to read a Lucene Index?

On May 16, 2006, at 3:30 PM, Nick Snels wrote:> Solr looks a promising project, only problem I have with it is that  
> you
> need Tomcat and a JVM. This adds two more variables to your
> configuration you have to control. Great if you know Java, but I''m
> programming in Ruby so I don''t have to program in Java or .NET, or
> whatever. So I prefer a Ruby only environment for it''s simplicity.
A fair and expected critique of using Solr in a Ruby environment.   
Every language enjoys a bit of lock-in and programmers obviously  
would prefer to work with native API''s.

It is true you need a JVM to run Solr, but it doesn''t have to be  
Tomcat.  I use Jetty.  To fire up Solr in my Rails environment only  
required I customize its schema.xml and solrconfig.xml files and run  
"java -jar start.jar".  And voila, its up and running.  So while it  
does add an entirely new moving piece, I view it as something akin to  
adding a database.  As long as there is a good way to communicate  
with it natively (a Ruby/Solr API would be well received, methinks)  
then Solr isn''t any more, actually less, overhead to a projects  
deployment than adding a database server.

	Erik

Marvin Humphrey

2006-May-16 21:27 UTC

head link

[Ferret-talk] Ferret not able to read a Lucene Index?

On May 16, 2006, at 12:30 PM, Nick Snels wrote:
> I don''t care about the fact that Ferret isn''t able to
read a Lucene
> index. The only problem is that when the Ferret index isn''t
compatible
> with Lucene as is the case right now (damn EOF errors), you are not  
> able
> to use Luke to take a quick peek inside the index. So a port of  
> Luke to
> access Ferret would be great.
You know what... I think using Luke powered by a version of Lucene  
with my patch applied would allow it to read Ferret indexes.

I don''t have time to check this out right now.  And ironically,
I''ve
made further mods to KinoSearch''s file format, so it wouldn''t
make
Luke available to KinoSearch users unless I change it back.  hahaha. ":o

The patch was prepared against subversion, but it might work against  
1.9.1.  If it doesn''t, it would be trival to finish it and package it  
up.  Maybe we can convince the Lucene folks to distribute it through  
their channels... or I can put it up at my site.  Maybe Luke''s author  
would be amenable to distributing it from his site, but I dunno about  
that - people might blame him rather than me or Balmain when stuff  
fails to work.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

David Balmain

2006-May-17 07:12 UTC

head link

[Ferret-talk] Ferret not able to read a Lucene Index?

On 5/17/06, Marvin Humphrey <marvin at rectangular.com>
wrote:> How many users here care about Lucene compatibility, and why?
Great question. Who does care, and why? Performance used to be a very
good reason but that doesn''t apply anymore. Is it Java''s
libraries?
Java does have PDFBox for example. Unfortunately Ruby doesn''t yet have
an equivalent but there are ways around this. The only good reason I
can think of is the lack of a Luke port. Anyone care to enlighten us?

Cheers,
Dave

Jan Prill

2006-May-17 07:12 UTC

head link

[Ferret-talk] Ferret not able to read a Lucene Index?

hey Marvin,

is there a link in this thread already? I''ve found
http://issues.apache.org/jira/browse/LUCENE-510?page=comments#action_12378519as
well as the links at the bottom of
http://www.archivum.info/java-dev at lucene.apache.org/2005-09/msg00025.htmlwith
google. Is there anything else? I''ll definitly try this out but wanted
to make sure if this is the latest development...

Regards
Jan

On 5/16/06, Marvin Humphrey <marvin at rectangular.com>
wrote:>
>
> On May 16, 2006, at 12:30 PM, Nick Snels wrote:
>
> > I don''t care about the fact that Ferret isn''t able
to read a Lucene
> > index. The only problem is that when the Ferret index isn''t
compatible
> > with Lucene as is the case right now (damn EOF errors), you are not
> > able
> > to use Luke to take a quick peek inside the index. So a port of
> > Luke to
> > access Ferret would be great.
>
> You know what... I think using Luke powered by a version of Lucene
> with my patch applied would allow it to read Ferret indexes.
>
> I don''t have time to check this out right now.  And ironically,
I''ve
> made further mods to KinoSearch''s file format, so it
wouldn''t make
> Luke available to KinoSearch users unless I change it back.  hahaha.
":o
>
> The patch was prepared against subversion, but it might work against
> 1.9.1.  If it doesn''t, it would be trival to finish it and package
it
> up.  Maybe we can convince the Lucene folks to distribute it through
> their channels... or I can put it up at my site.  Maybe Luke''s
author
> would be amenable to distributing it from his site, but I dunno about
> that - people might blame him rather than me or Balmain when stuff
> fails to work.
>
> Marvin Humphrey
> Rectangular Research
> http://www.rectangular.com/
>
> _______________________________________________
> Ferret-talk mailing list
> Ferret-talk at rubyforge.org
> http://rubyforge.org/mailman/listinfo/ferret-talk
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://rubyforge.org/pipermail/ferret-talk/attachments/20060517/9b7c026b/attachment.htm

Jan Prill

2006-May-17 07:30 UTC

head link

[Ferret-talk] Ferret not able to read a Lucene Index?

Hi Dave,

IMHO there are two things:

1. these little marketing and management issues that often have no valid
reason but make a big difference:

Programmer / Freelancer : let''s use ruby we''ll even be able to
build a
superfast search interface to all your great marketing docs with ferret,
rails and ruby
Manager: i think we''ve got this, it''s implemented by something
called
bluezeneeee

P/F: yes we even might use the indexes of this and perform searches with the
old system while we are changing...
M: changing what

P/F: the system to ruby, ferret...
M: WTF?

for these conversations it would be of help to stay in the background as
much as possible with changes as possible...

2. Tools around Lucene

I think people will now give marvins patch and luke a try, but luke is not
the only thing. Thanks to eric for putting up solr. I think it''s a
little
bit of the old java 90%/10% - thingy. For 90% of webapps all the java,
spring, hibernate stuff is damn complex and you''ll be faster with ruby.
but
the 10 or less percent, often the big money stuff of fortune companys, of
banks etc. made their management decision to either j2ee or .net. And for
these projects the programming teams often need distributed and high volume
things, see cnet and solr.

I''ve heard about solr on this thread for the first time and wonder a
little
how it does together with nutch / hadoop for the distributed things but will
do some googleing on this myself. I think there is definitly need - also in
the ruby world - for search engines and crawlers. And nutch has some nifty
features about RDig. Discussions about the interchangeability between nutch
and ferret are showing that people are interested in using Lucene tools but
front end with ruby, rails and ferret. I''ve for example tried to work
with
ferret on a nutch index and luckily ferret didn''t choke on the index
because
there were no utf-8 chars in there. So I could extract url, segment, docno
but then there came this nfs / hadoop thing to extract content and summaries
as well and I gave up.

There also seems to be interest and need in distributed search architectures
as the p2p efforts of hyperestraier as well as nfs / hadoop and solr
(rsync?) are showing...

Regards
Jan

On 5/17/06, David Balmain <dbalmain.ml at gmail.com>
wrote:>
> On 5/17/06, Marvin Humphrey <marvin at rectangular.com> wrote:
> > How many users here care about Lucene compatibility, and why?
>
> Great question. Who does care, and why? Performance used to be a very
> good reason but that doesn''t apply anymore. Is it Java''s
libraries?
> Java does have PDFBox for example. Unfortunately Ruby doesn''t yet
have
> an equivalent but there are ways around this. The only good reason I
> can think of is the lack of a Luke port. Anyone care to enlighten us?
>
> Cheers,
> Dave
>
> _______________________________________________
> Ferret-talk mailing list
> Ferret-talk at rubyforge.org
> http://rubyforge.org/mailman/listinfo/ferret-talk
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://rubyforge.org/pipermail/ferret-talk/attachments/20060517/11d7d92d/attachment.htm

steven shingler

2006-May-17 10:15 UTC

head link

[Ferret-talk] Ferret not able to read a Lucene Index?

I agree with Jan''s ''real-world'' scenario - it is the
reason I started
this thread in the first place... :)

...not so much because of management pressures, but I see merit in being 
able to create indexes in either Java or Ruby, then use Rails to present 
a query interface.

It keeps one''s options open - particularly with PDFBox and POI in the 
Java space, although I''m looking into both routes of the 
pdftotext/ferret_helper tools, and applying Marvin''s patch - so perhaps
both paths can remain open.

Thanks to all though, for contributing to this very interesting thread! 
:)
Cheers
Steven


-- 
Posted via http://www.ruby-forum.com/.

Apparently Analagous Threads

Search for more seemingly similar threads

Ferret talk - May 2006 - Ferret not able to read a Lucene Index?

[Ferret-talk] Ferret not able to read a Lucene Index?

[Ferret-talk] Ferret not able to read a Lucene Index?

[Ferret-talk] Ferret not able to read a Lucene Index?

[Ferret-talk] Ferret not able to read a Lucene Index?

[Ferret-talk] Ferret not able to read a Lucene Index?

[Ferret-talk] Ferret not able to read a Lucene Index?

[Ferret-talk] Ferret not able to read a Lucene Index?

[Ferret-talk] Ferret not able to read a Lucene Index?

[Ferret-talk] Ferret not able to read a Lucene Index?

[Ferret-talk] Ferret not able to read a Lucene Index?

[Ferret-talk] Ferret not able to read a Lucene Index?

[Ferret-talk] Ferret not able to read a Lucene Index?

[Ferret-talk] Ferret not able to read a Lucene Index?

[Ferret-talk] Ferret not able to read a Lucene Index?

[Ferret-talk] Ferret not able to read a Lucene Index?

[Ferret-talk] Ferret not able to read a Lucene Index?

[Ferret-talk] Ferret not able to read a Lucene Index?

Apparently Analagous Threads