Hi there, What''s the current status of the Windows port? I may be in a position to lend a hand over the next couple of weeks - where should I start looking? And what''s the best way to get SVN HEAD? This happens: $ svn checkout svn://www.davebalmain.com/ferret/trunk ferret svn: Can''t connect to host ''www.davebalmain.com'': Connection refused -- Alex
On 6/1/06, Alex Young <alex at blackkettle.org> wrote:> Hi there, > > What''s the current status of the Windows port? I may be in a position > to lend a hand over the next couple of weeks - where should I start > looking?Hi Alex, Thanks for your interest. I got Ferret to compile with Visual Studio Express 2005. Unfortunately you currently need to use Visual C 6 to create Ruby bindings. This proved a lot more difficult so I decided to take a different route. Marvin Humphrey (author of KinoSearch, a perl port of lucene) and I are about to start a new project at Apache called Lucy (http://wiki.apache.org/jakarta-lucene/LucyProposal) which will aim to create a C port of Lucene that can be used as a backend in all dynamic languages. This time around, portability will be a much higher priority. Lucy may or may not one day become the back end to Ferret. At the same time I''m experimenting with some different options using the Ferret codebase. Now that Lucy is happening I''m not going to worry about Lucene index compatibility (which was currently still a long way off in Ferret due to Java''s modified UTF-8 encoding). This experimental code is in; svn://www.davebalmain.com/exp This code is much more portable and will compile with VC6. So if you want a Windows port quickly you can try merging this code back into Ferret propper. Or if you are really interested in the libraries internals you could join me working on this experimental code or join Marvin and I on the Lucy project (still waiting on Apache approval). Whichever route you chose your help will be most appreciated. Let me know your thoughts. Cheers, Dave> And what''s the best way to get SVN HEAD? This happens: > $ svn checkout svn://www.davebalmain.com/ferret/trunk ferret > svn: Can''t connect to host ''www.davebalmain.com'': Connection refusedSorry about that. Subversion is up and running again.
David Balmain wrote:> On 6/1/06, Alex Young <alex at blackkettle.org> wrote: >> Hi there, >> >> What''s the current status of the Windows port? I may be in a position >> to lend a hand over the next couple of weeks - where should I start >> looking? > > Hi Alex, > > Thanks for your interest. I got Ferret to compile with Visual Studio > Express 2005. Unfortunately you currently need to use Visual C 6 to > create Ruby bindings.A few groups have been bitten by this. I believe this is something Curt Hibbs is going to be addressing with the next One-Click Installer. I don''t know if you''ve been following ruby-lang, but there are noises to move over to a mingw32 build instead of a VC6, which would sort a *lot* of things out. If that ends up happening, extension building on Windows will get much simpler. As far as I know, the OCI only uses VC6 because it was believed at the time that it would be compatible with mingw32 extensions. For my purposes, I don''t especially mind building my own Ruby to make Ferret compatible with it, but I can see that approach may not have too many adherents :-) Do you see any reason why that wouldn''t work with the current Ferret source? Would that not be the shortest path to getting it working?> This proved a lot more difficult so I decided to > take a different route. Marvin Humphrey (author of KinoSearch, a perl > port of lucene) and I are about to start a new project at Apache > called Lucy (http://wiki.apache.org/jakarta-lucene/LucyProposal) which > will aim to create a C port of Lucene that can be used as a backend in > all dynamic languages. This time around, portability will be a much > higher priority.I''m sure you''ve considered this, but what does that add compared to a GCJ+SWIG approach, as with PyLucene? Without having looked at it, is there anything which prevents that method from being applied to Ruby?> Lucy may or may not one day become the back end to Ferret. At the same > time I''m experimenting with some different options using the Ferret > codebase. Now that Lucy is happening I''m not going to worry about > Lucene index compatibility (which was currently still a long way off > in Ferret due to Java''s modified UTF-8 encoding). This experimental > code is in; > > svn://www.davebalmain.com/exp > > This code is much more portable and will compile with VC6. So if you > want a Windows port quickly you can try merging this code back into > Ferret propper. Or if you are really interested in the libraries > internals you could join me working on this experimental code or join > Marvin and I on the Lucy project (still waiting on Apache approval). > Whichever route you chose your help will be most appreciated. Let me > know your thoughts.From my personal point of view, I''m most interested in having the same codebase work fast on both Linux and Windows, and, like I say, I don''t mind rebuilding Ruby to do it. Right now, I''d be most interested in patching the current cFerret to work under mingw32, unless you know of any reasons that''s just not going to work. I''ll certainly take a look at the new code and see if there''s anything I can usefully add there, too. Thanks, -- Alex
On 6/1/06, Alex Young <alex at blackkettle.org> wrote:> David Balmain wrote: > > On 6/1/06, Alex Young <alex at blackkettle.org> wrote: > >> Hi there, > >> > >> What''s the current status of the Windows port? I may be in a position > >> to lend a hand over the next couple of weeks - where should I start > >> looking? > > > > Hi Alex, > > > > Thanks for your interest. I got Ferret to compile with Visual Studio > > Express 2005. Unfortunately you currently need to use Visual C 6 to > > create Ruby bindings. > A few groups have been bitten by this. I believe this is something Curt > Hibbs is going to be addressing with the next One-Click Installer. I > don''t know if you''ve been following ruby-lang, but there are noises to > move over to a mingw32 build instead of a VC6, which would sort a *lot* > of things out. If that ends up happening, extension building on Windows > will get much simpler. As far as I know, the OCI only uses VC6 because > it was believed at the time that it would be compatible with mingw32 > extensions.Actually the main reason I haven''t finished porting to Windows yet is that it seemed like too much work if the one-click installer is going to change to mingw32 anyway. I hope it happens soon.> For my purposes, I don''t especially mind building my own Ruby to make > Ferret compatible with it, but I can see that approach may not have too > many adherents :-) Do you see any reason why that wouldn''t work with > the current Ferret source? Would that not be the shortest path to > getting it working?Yes, this would probably be the shortest path to get it working. Plus you''ll have much better locale support (ie utf-3 support).> > This proved a lot more difficult so I decided to > > take a different route. Marvin Humphrey (author of KinoSearch, a perl > > port of lucene) and I are about to start a new project at Apache > > called Lucy (http://wiki.apache.org/jakarta-lucene/LucyProposal) which > > will aim to create a C port of Lucene that can be used as a backend in > > all dynamic languages. This time around, portability will be a much > > higher priority. > I''m sure you''ve considered this, but what does that add compared to a > GCJ+SWIG approach, as with PyLucene? Without having looked at it, is > there anything which prevents that method from being applied to Ruby?It can be done but it''s still a lot of work and I just didn''t feel up to the task. Plus we get better performance this way with a much smaller download.> > Lucy may or may not one day become the back end to Ferret. At the same > > time I''m experimenting with some different options using the Ferret > > codebase. Now that Lucy is happening I''m not going to worry about > > Lucene index compatibility (which was currently still a long way off > > in Ferret due to Java''s modified UTF-8 encoding). This experimental > > code is in; > > > > svn://www.davebalmain.com/exp > > > > This code is much more portable and will compile with VC6. So if you > > want a Windows port quickly you can try merging this code back into > > Ferret propper. Or if you are really interested in the libraries > > internals you could join me working on this experimental code or join > > Marvin and I on the Lucy project (still waiting on Apache approval). > > Whichever route you chose your help will be most appreciated. Let me > > know your thoughts. > From my personal point of view, I''m most interested in having the same > codebase work fast on both Linux and Windows, and, like I say, I don''t > mind rebuilding Ruby to do it. Right now, I''d be most interested in > patching the current cFerret to work under mingw32, unless you know of > any reasons that''s just not going to work. I''ll certainly take a look > at the new code and see if there''s anything I can usefully add there, too.Have fun. I don''t think it''ll be too much work getting it to compile under mingw32. I guess we''ll see. Cheers, Dave
On Jun 1, 2006, at 6:15 AM, David Balmain wrote:> >>> This proved a lot more difficult so I decided to >>> take a different route. Marvin Humphrey (author of KinoSearch, a >>> perl >>> port of lucene) and I are about to start a new project at Apache >>> called Lucy (http://wiki.apache.org/jakarta-lucene/LucyProposal) >>> which >>> will aim to create a C port of Lucene that can be used as a >>> backend in >>> all dynamic languages. This time around, portability will be a much >>> higher priority. >> I''m sure you''ve considered this, but what does that add compared to a >> GCJ+SWIG approach, as with PyLucene? Without having looked at it, is >> there anything which prevents that method from being applied to Ruby? > > It can be done but it''s still a lot of work and I just didn''t feel up > to the task. Plus we get better performance this way with a much > smaller download.Java Lucene is built on the assumption, quite reasonable for Java as a compiled language[1], that method calls are cheap and object creation and destruction are cheap. The fact that they are much more expensive in an interpreted language is the main reason the pure-Perl port of Lucene, Plucene, runs so slowly (<http://www.rectangular.com/ kinosearch/benchmarks.html>). Lack of access to primitive data types such as int is another reason, but it''s actually not that great a factor compared to the OO overhead (I did extensive hacking on Plucene before deciding I had no choice but to start from scratch, and rewriting the IO classes in C didn''t help as much as anyone expected). Presumably similar factors are at work slowing down the pure-Ruby Ferret. The OO overhead problems are mitigated by going the GCJ route, but not eliminated. Say you want to subclass Analyzer -- which most significant deployments of Lucene will want to do eventually. The way a TokenStream works in Lucene, several method calls are required for each and every token -- one for each Analyzer the token passes through. That gets extremely expensive in an interpreted language. Furthermore, none of Perl''s native string manipulation tools work with UTF-16 strings. So if you wanted to, say, insert a custom Perl TokenFilter into a Lucene Analysis chain, you''d have to translate between UTF-8 and UTF-16 each time you cross the Perl/Java boundary, making the TokenStream concept a double disaster. An alternate way of processing Tokens is to have each link in the Analyzer chain accept a "TokenBatch" instead of a TokenStream: an array of Tokens, rather than a stream of Tokens. That way, each Analyzer can iterate over all the Tokens in a tight loop, either natively or in C. The downside of this technique is that it''s not possible to feed it directly from a filehandle/Reader, but that''s small potatoes. It would be possible to graft the TokenBatch concept onto a GCJ''d Lucene: create a native full analysis chain which spits out a TokenBatch, then have the TokenBatch pretend it''s a TokenStream, feeding Tokens to Lucene using a C version of next(). That would perform OK -- but you couldn''t ever mix and match Java Lucene Analyzers with native Analyzers, only prepend the native onto the front. Therefore, you''d have to rewrite the entire org.apache.lucene.analysis package anyway -- it''s the only way you''re going to get both full flexibility and performance. And once you''ve started down the path of rewriting large portions of Lucene, it''s hard to see why you''d put up with the headache of the GCJ approach. There are many other areas where Lucene''s architecture is poorly suited for use with an interpreted language. Dave has solved those problems mainly by rewriting the whole thing in C. KinoSearch has taken that approach in some cases, but more often than Ferret, it uses modified algorithms instead. TokenBatch is one example; the best one, which is harder to explain here, is how KinoSearch merges together inverted documents during indexing. (In summary, it''s faster, simpler, and requires far, far fewer objects.) It would be possible to port some of these algorithm changes to Lucene, but they would be pretty disruptive. Lucene''s a mature, heavily-used library and changing anything at all requires a lot of consideration. Some of the changes I would like to see, I don''t think I could lobby for in good conscience. The bytecounts-as-string- headers patch is a good example. For Ferret and KinoSearch it''s adoption would yield a very significant benefit, as it would open the door to using Luke to browse indexes. For Java Lucene, though, it can only be justified by further changes which build upon it. The downside of the full-port approach that Dave and I have taken is that it''s a lot of work to build and maintain. However, we''ve already done the vast majority of the up-front work once. Re-doing it for Lucy will be a cakewalk in comparison. The maintenance problem that KinoSearch and Ferret currently face, we''re addressing by sharing the C core. We would not be surprised if others join us -- I know of at least one other person who rewrote Lucene in C: Robert Kirchgessner, who did a partial PHP/C port. Heck, it will presumably be easier to maintain a Python port against Lucy than against GCJ''d Lucene, provided that we achieve what we''ve set out to achieve. The only question remaining, I think, is whether the project will actually be hosted at Apache. When Dave and I approached Doug Cutting about it, he specifically requested that development take place there -- before Dave or I had had a chance to indicate that that was our preference as well. However, we''ve been waiting for approval by the Lucene PMC for a couple weeks now, and I''m not sure its coming. I''m guessing that Erik "One Lucene To Rule Them All" Hatcher hasn''t cast his +1. ;) IMO, it would be best for everybody if we did this within the Lucene family, but we''ll just have to see. Marvin Humphrey Rectangular Research http://www.rectangular.com/ [1] What constitutes a compiled vs. a dynamic language is debatable -- see <http://en.wikipedia.org/wiki/Interpreted_language>. It might be more accurate to describe Java as a "more compiled" language.
On Jun 1, 2006, at 11:00 AM, Marvin Humphrey wrote:> IMO, it would be best for everybody > if we did this within the Lucene family,... and that what''s going to happen. I just got an email from Doug. We''re good to go. Thank you, Lucene PMC. :) Marvin Humphrey Rectangular Research http://www.rectangular.com/
You''re welcome! I''m looking forward to Ruby Lucene goodness!!! Erik On Jun 1, 2006, at 2:32 PM, Marvin Humphrey wrote:> > On Jun 1, 2006, at 11:00 AM, Marvin Humphrey wrote: > >> IMO, it would be best for everybody >> if we did this within the Lucene family, > > ... and that what''s going to happen. I just got an email from Doug. > We''re good to go. > > Thank you, Lucene PMC. > > :) > > Marvin Humphrey > Rectangular Research > http://www.rectangular.com/ > > _______________________________________________ > Ferret-talk mailing list > Ferret-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/ferret-talk
On 6/1/06, Marvin Humphrey <marvin at rectangular.com> wrote:> > On Jun 1, 2006, at 11:00 AM, Marvin Humphrey wrote: > > > IMO, it would be best for everybody > > if we did this within the Lucene family, > > ... and that what''s going to happen. I just got an email from Doug. > We''re good to go. > > Thank you, Lucene PMC. > > :)Other than the initial proposal, any pointers to websites or mailing lists where we can track the development of this project? Thanks. -F
On Jun 5, 2006, at 7:48 AM, Finn Smith wrote:> Other than the initial proposal, any pointers to websites or mailing > lists where we can track the development of this project?We''re now waiting for our Apache accounts to be set up, the mailing lists and the subversion repositories to be created, etc. If I''m not mistaken, all the infrastructure support work at Apache is done by volunteers, so patience is the watchword. Once there is a Lucy mailing list, we''ll send a notification to this list. Not a lot is going on right now besides the occasional spasm of high- level planning on either the KinoSearch list or the Ferret list. That''s because in order to avoid the Apache Incubator process, all development needs to take place "on the record" in Apache forums and repositories. Marvin Humphrey Rectangular Research http://www.rectangular.com/